Efficient Similarity Search On Vector Sets
Efficient Similarity Search On Vector Sets
Abstract: Similarity search in database systems is becoming an increasingly important task in modern application domains such as multimedia, molecular biology, medical imaging, computer aided design and many others. Whereas most of the existing similarity models are based on feature vectors, there exist some models which use very complex object representations such as trees and graphs. A promising way between too simple and too complex object representations in similarity search are sets of feature vectors. In this paper, we rst motivate the use of this modeling approach for complete object similarity search as well as for partial object similarity search. After introducing a distance measure between vector sets, suitable for many different application ranges, we present and discuss different lters which are indispensable for efcient query processing. In a broad experimental evaluation based on articial and real-world test datasets, we show that our approach considerably outperforms both the sequential scan and metric index structures.
Introduction
In the last ten years, an increasing number of database applications has emerged for which efcient and effective support for similarity search is substantial. The importance of similarity search grows in application areas such as multimedia, medical imaging, molecular biology, computer aided engineering, marketing, purchasing assistance, and others. As distance functions form the foundation of similarity search, we need an object representation which allows efcient and meaningful distance computations. A common approach is to represent an object by a numerical feature vector. In this case, a feature transformation extracts distinguishable characteristics which are represented by numerical values and grouped together in a feature vector. On the basis of such a feature transformation and under the assumption that similarity corresponds to feature distance, it is possible to dene a distance function between the corresponding feature vectors as a similarity measure for two data objects. Thus, searching for data objects similar to a given query object is transformed into proximity search in the feature space. Most applications use the Euclidean metric (L2 ) to evaluate the feature distance, but there are several other metrics commonly used, e.g. the Manhattan metric (L1 ) and the maximum metric (L ). Furthermore, there exist quite a few much more complex similarity models based on graphs [KS03] and trees [KKSS04]. Generally, the more complex and precise these mod-
els are, the more exact are the results of a similarity search, but at the same time, its computation cost rises as well. In this paper, we present a distance measure for an approach somewhere in between single feature vectors and complex trees and graphs. We model an object by a set of feature vectors which is a very suitable object representation for many different application ranges. In order to achieve efcient query processing we present three different lower-bounding lters and discuss their properties. The remainder of this paper is organized as follows. In Section 2, we motivate the use of vector set represented objects by presenting various application ranges which benet from this modeling approach. In Section 3, we introduce the minimal matching distance between vector sets which is a suitable distance measure for partial and complete similarity search. In Section 4, we sketch the paradigm of multi-step query processing and present appropriate lter techniques for the minimal matching distance on vector sets. In Section 5, we present the results of our experimental evaluation. We conclude this work in Section 6 with a short summary and a few remarks on future work.
Using sets of feature vectors is a generalization of the use of just one large feature vector. It is always possible to restrict the model to a feature space, in which a data object will be completely represented by just one feature vector. But in some applications the properties of vector set representations allow us to model the dependencies between the extracted features more precisely. As the development of conventional database systems in the recent two decades has shown, the use of more sophisticated ways to model data can enhance both the effectiveness and efciency for applications using large amounts of data. Another advantage of using sets of feature vectors is the better storage utilization. It is not necessary to force objects into a common size, if they are represented by sets of different cardinality. In the following, we will shortly sketch different application ranges which benet from the use of vector set data. CAD databases. In [KBK+ 03] voxelized spatial objects were modeled by sets of feature vectors, where each feature vector represents a 3D rectangular cover which approximates the object as good as possible. The vector set representation is able to avoid the problems that occur by storing a set of covers according to a strict order, i.e. in one high-dimensional feature vector. Thereby, it is possible to compare two objects more intuitively compared to the distance calculation in the one-vector model. In a broad experimental evaluation it was shown that the use of sets of feature vectors greatly enhances the quality of the similarity model compared to the use of a single feature vector. Soccer teams. As another example, let us assume that we want to measure the similarity between two soccer teams. It is benecial to represent each player by a feature vector and the complete team as a set of feature vectors. A feature vector for one player may
consist of attributes like his age, his salary, the number of goals in the last season, etc. We can compare two players by computing the Euclidean distance between the corresponding feature vectors. This measures the similarity between two players rather well. But, what is a suitable distance for comparing two teams? Assuming we have a team A consisting of 10 very young players having a low salary and having scored only a few goals in the last season. Furthermore, team A has one highly paid, rather experienced and successful player. On the other hand, we have a team B where we have 10 rather old, highly paid successful players and one young low-budget player. If we compare each player of team A to the most similar player in team B and vice versa, this yields that the two teams are very similar. This straightforward approach does not reect the intuitive notion of similarity. On the other hand, if we compare each player from team A to a different player in team B trying to minimize the average distance between two matched players, this results in a very accurate similarity measure. For partial similarity, it is advisable not to compare all players from team A to a different player in team B , but only the s most similar players. For low values of s, e.g. s = 2, the two teams A and B are very similar, as each team has an old player with a high salary and a young low-budget player. In this case, the distance between the teams A and B would be very small. For higher values of s, the two teams become more and more dissimilar. Let us note that for s = 11 the two notions of partial and complete similarity coincide. This behavior reects the intuitive perception of similarity. To sum up, the use of vector sets allows us to adjust the degree of the partial similarity in k discrete steps, if we represent the objects by vector sets of cardinality k . Further application areas. sets of feature vectors, e.g.: There exist a lot of further possible application elds for
stock portfolios, where each stock is represented by the value of one share, the overall number of shares, how many days ago the shares were bought, the risk category, etc. shopping carts, where each consumer product corresponds to a feature vector containing the category, the price, the quantity, etc. multimedia CDs, where each media le is represented by the publisher, the artist, the title, the lesize, the kind of content, etc. To sum up, sets of feature vectors are a natural way to model a lot of complex real-world objects.
Effective distance functions which allow both complete and partial similarity search as well as suitable lter techniques for efcient query processing are indispensable for the general use of the powerful concept of sets of feature vectors.
There are already several distance measures proposed on sets of vectors. In [EM97] the authors survey the following four measures, which are computable in polynomial time: the Hausdorff distance, the sum of minimum distances, the (fair-)surjection distance and the link distance. The Hausdorff distance does not seem to be suitable as a similarity measure, because it relies too much on the extreme positions of the elements of both sets. The last three distance measures are suitable for modeling similarity, but are not metric. This circumstance makes them unattractive, since there are only limited possibilities for processing similarity queries efciently when using a non-metric distance function. In [EM97], the authors also introduce a method for expanding the distance measures into metrics, but as a side effect the complexity of distance calculation becomes exponential. Furthermore, the possibility to match several elements in one set to just one element in the compared set is questionable in the application areas presented in Section 2. A distance measure on vector sets that demonstrates to be suitable for dening similarity is based on the minimum weight perfect matching of sets. This well known graph problem can be applied here by building a complete bipartite graph G = (X Y, E ) between the vector sets X and Y . The weight of each edge (x, y ) E , where x X and y Y , in this graph G is dened by the distance d(x, y ). A perfect matching is a subset M E that connects each x X to exactly one y Y and vice versa. A minimum weight perfect matching is a matching with a minimum sum of weights of its edges. Contrary to the second example of Section 2, where we considered vector sets of equal cardinality, i.e. soccer teams consisting of 11 players, there are a lot of application ranges, where objects are naturally represented by a varying number of vectors. Since a perfect matching can only be found for sets of equal cardinality, we need to introduce suitable weights as a penalty for the unmatched vectors when dening a distance measure between objects of varying cardinality. Denition 1 (permutation of a set) Let A be any nite set of arbitrary elements. Then is a mapping that assigns a A a unique number i {1, .., |A|}. This is written as (A) = (a1 , .., a|A| ). The set of all possible permutations of A is denoted by (A). Denition 2 (minimal matching distance) Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function between two d-dimensional feature vectors. Furthermore, let W : V R be a weight function D,W : 2V 2V R is for unmatched elements. Then the minimal matching distance Dmm dened as follows:
|X | |Y | D,W Dmm (X, Y ) = (Y )
min
i=1
D(xi , y(i) ) +
i=|X |+1
W (y(i) )
The weight function W provides the penalty given to every unassigned element of the set having larger cardinality. Let us note that the minimal matching distance is a specialization of the netow distance which is proven to be a metric in [RB01]. The minimal matching
D,W distance Dmm is a metric, if the distance function D is a metric and the weight function W meets the following conditions:
(1) W (x) > 0 for x V (2) W (x) + W (y ) D(x, y ) for x, y V The Kuhn-Munkres algorithm [Kuh55, Mun57] can be used to calculate the minimal matching distance in polynomial time. In a primary initialization step, a distance matrix between the two vector sets containing k d-dimensional vectors is computed. If D is an Lp -distance, this initialization takes O(k 2 d) time. The method itself is based on the successive augmentation of an alternating path between both sets. Since it is guaranteed that this path can be expanded by one further match within each step taking O(k 2 ) time and there is a maximum of k steps, the overall complexity of a distance calculation is O(k 3 + k 2 d) in the worst case. The minimal matching distance can be adapted for partial similarity search in vector set represented data. The distance measure dened in the following is based on a partial minimal matching. Given two vector sets X and Y , |X | |Y |, we only match s |X | vectors to calculate the distance between X and Y . Denition 3 (partial minimal matching distance) Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function between two d-dimensional feature vectors. Let s |X |. Then the partial minimal matching distance D,s : 2V 2V R is dened as follows: Dpmm
s D,s Dpmm (X, Y ) = 1 (X ),2 (Y )
min
Unlike the minimal matching distance the partial variant is not a metric. As the KuhnMunkres algorithm produces a partial minimal matching in each step as an intermediate D,s result, we can use it to calculate the partial minimal matching distance Dpmm (X, Y ). But |X | we have to take into account all s combinations of vectors in X to match with vectors 2 2 in Y . Therefore, the time complexity for a single distance calculation is O( k s sk + k d). Thus, a ltering technique to speed up query processing is essential.
Complete similarity search on vector set data can be accelerated by using metric index structures, e.g. the M-tree [CPZ97]. For a detailed survey on metric index structures we refer the reader to [CNBYM01]. Another approach is to use the multi-step query processing paradigm which, in contrast to metric index structures, is also suitable for partial
similarity search. The main goal of multi-step query processing is to reduce the number of complex and therefore time consuming distance calculations in the query process. In order to guarantee that there occur no false drops the used lter distances have to fulll a lowerbounding distance criterion. For any two objects o1 and o2 , a lower-bounding distance function Df in the lter step has to return a value that is not greater than the exact object distance Do of o1 and o2 , i.e. Df (o1 , o2 ) Do (o1 , o2 ). With a lower-bounding distance function, it is possible to safely lter out all database objects which have a lter distance greater than the current query range because the exact similarity distance of those objects cannot be less than the query range. The computation of the minimal matching distance on vector sets is a rather expensive operation. Thus, the employment of selective and efciently computable lter distance functions for similarity search is very important. In the following, we present three different lter types for query processing on data objects represented by vector sets, namely the closest pair lter, the centroid lter and the norm vector lter.
4.1
The closest pair distance between two vector sets X and Y can be used as a lter distance D,W for the minimal matching distance Dmm and is dened as follows. Denition 4 (closest pair distance) Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Let X = {x1 , . . . x|Y | } be a multiset where xi = for i {|X | + 1, . . . |Y |}. D, (X, Y ) : 2V 2V R is dened as follows. Then the closest pair distance Dcp
D, Dcp (X, Y ) = max i=1 |Y | j =1,...|Y | |Y |
j =1,...|Y |
min
D(xi , yj ),
i=1
min
D(xj , yi )
Let us note that the closest pair lter works directly on the set of vectors, i.e. on the original data, and not on approximated data. The lter distance can be computed by scanning the matrix of distance values between each pair of vectors in X and Y for the closest pairs. We will now show that the closest pair distance between two vector sets is a lower bound for the minimal matching distance. Theorem 1 Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Furthermore, let W : V R, W (v ) = D(v, ), be a weight function for unmatched elements. Then the following inequality holds:
D, D,W Dcp (X, Y ) Dmm (X, Y )
y1
y1
y1
a3 a3 y3 c3 x3 a1 a2
c1 x1 b3 x2 y2 x1 x2 b1 y2 x1 b2 x2 y2
a3 a3
b3 b3
c3 c3 c1
a1 c1
b1
a2 b2
(b) Centroid.
Proof: See Appendix A.1. A 2-dimensional example for the closest pair lter is depicted in Fig. 1(a), where |X | = L2 ,W0 L2 ,0 |Y | = 3 and a3 + b3 + c3 = Dcp (X, Y ) = a3 + b3 + c3 . As a3 < a3 , (X, Y ) Dmm x3 is matched to both y1 and y3 during the lter distance calculation, whereas the minimal matching distance is based on one-to-one matchings. We adapt the closest pair lter to partial similarity search by adding up just the distances of the s closest pairs of vectors. Thus, the partial closest pair distance is dened as follows. Denition 5 (partial closest pair distance) Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Let D,s s |X |. Then the partial closest pair distance Dpcp (X, Y ) : 2V 2V R is dened as follows.
s D,s Dpcp (X, Y ) = max (X )
min
i=1 s
j =1,...|Y |
min
(Y )
min
i=1
j =1,...|X |
min
The partial closest pair distance is a lower bound for the partial minimal matching distance. Theorem 2 Let V Rd and let X, Y 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Let s |X |. Then the following inequality holds:
D,s D,s Dpcp (X, Y ) Dpmm (X, Y )
As the partial closest pair distance can be computed rather efciently by scanning the matrix of distance values between each pair of vectors in X and Y for the closest pairs and organizing the s closest distances in a heap structure, it is a very benecial lter for the partial minimal matching distance. The overall runtime complexity is O(k 2 d) for the complete version and O(k 2 d log s) for the partial version of the clostest pair distance, when an Lp -distance is used between vectors. Although this is more complex than the closest pair approach on norm vectors (cf. Section 4.3), it is a more selective lter that saves more of the very expensive calculations of the exact partial minimal matching distance.
4.2
Centroid Approach
This lter step is based on the relation between a set of feature vectors and its extended centroid [KBK+ 03]. Denition 6 (extended centroid) Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | } 2V be a vector set where |X | k . Then the extended centroid Ck, (X ) is dened as follows: Ck, (X ) =
|X | i=1
xi + (k |X |) k
Note how the vector is used as a dummy vector to ll up vector sets with a cardinality of less than k . Theorem 3 Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets where |X |, |Y | k and let Ck, (X ), Ck, (Y ) be their extended centroids. Furthermore, let W : V R, W (v ) = v p , be a weight function for unmatched elements. Then the following inequality holds: k Ck, (X ) Ck, (Y )
p Lp ,W Dmm (X, Y )
See [KBK+ 03] for the proof of this theorem. We have shown that the Lp -distance between the extended centroids multiplied by k is a lower bound for the minimal matching distance under the named preconditions. Therefore, when computing e.g. -range queries, we do not need to examine objects whose extended centroids have a distance to the query object q that is larger than k . Often a good choice of is 0, since 0 / V holds for a lot of applications. Thus, Conditions (1) and (2) for the metric character of the minimal matching L2 ,W0 distance Dmm are satised. A 2-dimensional example for the extended centroid lter is depicted in Fig. 1(b), where |X | = |Y | = 2 and 2c1 = 2 Ck,0 (X ) Ck,0 (Y ) 2
2 Dmm
L ,W0
(X, Y ) = a1 + b1 .
The centroid approach is not suitable as a lter for the partial minimal matching distance, as the centroid invariably aggregates information of all vectors contained in a vector set.
4.3
Another possible lter for vector set represented data is based on the Lp -norms of all vector elements of a vector set. The idea is as follows: For all vectors x in a vector set X , |X | k , we compute the Lp -norms x p and organize these norm values in descending order in a k -dimensional vector. We call this lter the norm vector lter. Denition 7 (norm vector) Let V Rd . Let X 2V be a vector set where |X | k . Let ( x1 p , . . . x|X | p ) be the sequence of the Lp -norm values of the vectors in X in descending order, i.e. for all i < j {1, . . . |X |} holds xi p xj p . Then the norm vector Vk (X ) = (v1 , . . . vk )t Rk is dened as follows: vi = xi 0
p
for i = 1, . . . |X | for i = |X | + 1, . . . k
Note that if X has a cardinality smaller than k, dimensions |X | + 1 to k of the norm vector will get lled with 0. We employ the Manhattan distance as a distance function between two norm vectors Vk (X ) and Vk (Y ). This distance measure fullls the lower-bounding property with respect to the minimal matching distance, if the Lp -norm is used as the weight function W . Theorem 4 Let V Rd and let X, Y 2V be two vector sets. Their norm vectors are denoted by Vk (X ) and Vk (Y ). Furthermore, let W0 : V R, W0 (v ) = v p , be the Lp -norm used as a weight function for the minimal matching distance. Then the following inequality holds: Lp ,W0 Vk (X ) Vk (Y ) 1 Dmm (X, Y ) Proof: See Appendix A.2. A 2-dimensional example for the norm vector lter is depicted in Fig. 1(c), where |X | = L2 ,W0 |Y | = 2 and a2 + b2 = Vk (X ) Vk (Y ) 1 Dmm (X, Y ) = a2 + b2 . An approach for partial similarity search is to apply a parallel scan through the norm vectors Vk (X ) and Vk (Y ) and to build a heap structure containing the distances between the closest pairs of norm values found during the parallel scan. Finally, the sum of the top s elements of the heap is reported as the distance measure. This can be done very efciently in O(k log s) time using the algorithm in Fig. 2. The algorithm corresponds to a closest pair approach on the norm values of the feature vectors, which lower bounds the partial minimal matching distance.
algorithm partialNormVectorFilter(VectorSet X , VectorSet Y , Integer k, Integer s) begin return max(comp(X, Y, k, s), comp(Y, X, k, s)); end; algorithm comp(VectorSet X , VectorSet Y , Integer k, Integer s) begin (x1 , . . . xk ) := Vk (X ); // initialize (y1 , . . . yk ) := Vk (Y ); j := 1; for i in 1..k do // parallel scan while j < k |xi yj | |xi yj +1 | do j := j + 1; end while; heap.insert(|xi yj |); end for; dist := 0; // add up the distance for i in 1..s do dist := dist + heap.top(); end for; return dist; end;
Theorem 5 Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vec = { x1 p , . . . x|X | p }, tor sets. We assume w.l.o.g. |X | |Y | k . Let s |X |. Let X = { y1 p , . . . y|Y | p } be multisets containing the Lp -norm values of the vectors in X Y and Y . Then the following inequality holds:
Lp ,s Lp ,s Dpcp (X, Y ) Dpmm (X, Y )
4.4
Summary
As the computation of the minimal matching distance is rather time-consuming, we introduced three different lters. The centroid and the norm vector ltering techniques can be protably combined. The exact distance computation is only performed if the results of both lter distance computations on the centroids and the norm vectors are small enough. This way, a good deal of the information in the vector sets is incorporated in the lter distance computation. Given d-dimensional data, the centroid lter maps each dimension to a single value, resulting in a d-dimensional vector. On the other hand, the norm vector lter maps each vector to a single value resulting in a k -dimensional vector. Thus, the combined lter contains aggregated information over both the dimensions and the vectors and is therefore suitable for a lot of different data distributions. The time complexity for a
sk 2 + k 2 d)
combined lter distance evaluation is O(d + k ). As the centroid approach is not applicable for partial similarity search, we cannot use the combined lter for this purpose. In contrast to the other two approaches, which derive a single feature vector for approximating a vector set, the closest pair lter works directly on the vector sets. The resulting distance measure lower bounds the minimal matching distance and can be computed more efciently than the exact minimal matching distance. The runtime complexities for partial and complete similarity distance calculations based on the three different lters are summed up in Table 1, where we assume vector sets containing k d-dimensional vectors, a partial similarity parameter s {1, . . . k }, and an Lp -distance between vectors.
Experimental Evaluation
In this section, we present our experimental results. We generated and used two articial datasets, each containing 100,000 random vector sets. The rst dataset consists of vector sets containing 10 2-dimensional vectors each. The other dataset consists of vector sets containing 2 10-dimensional vectors each. The vectors are generated so that all of their components are uniformly distributed in the interval between 0 and 1. All distance measures between vector sets were implemented in Java 1.4 and the experiments were run on a workstation with a Xeon 2.4 GHz processor and 2 GB main memory under Linux. Furthermore, we used the similarity model presented in [KBK+ 03], where CAD objects were represented by a vector set consisting of either 3, 5 or 7 vectors in 6D. All experiments were carried out on a dataset containing 5,000 CAD objects from an American aircraft producer. We conducted our experiments on top of the Oracle9i Server using PL/SQL for the computational main memory based programming. We compared our different lters for vector set represented data to a PL/SQL implementation of the M-tree [CPZ97]. For the M-tree based k -nearest neighbor queries the ranking algorithm of [HS95] was used. The experiments were performed on a Pentium III/700 machine with IDE hard drives. The database block cache was set to 500 disk blocks with a block size of 8 KB and was used exclusively by one active session. The minimal matching distances between sets of feature vectors were computed using an implementation of the Kuhn-Munkres algorithm. Throughout our experiments we used the Euclidean distance as the distance measure between two single vectors. The range queries were based on a sequential scan. The k -nn queries with exact distance calculations were
also based on a sequential scan. For the ltered k -nn queries the lter distances between the query object and all vector sets in the database were calculated and sorted in ascending order. Then the optimal multi-step k -nn search algorithm [SK98] was used. In all tests, we processed 10 different similarity range queries as well as k -nn queries. The presented gures depict the average results from these tests.
5.1
In a rst experiment, we carried out range queries on the two articial datasets. Figure 3 shows rather good results for the norm vector lter, while the centroid lter performs rather badly. The superiority of the norm vector lter is due to the fact that more information is preserved by approximating a vector set by a 10-dimensional vector in contrast to the 2-dimensional centroid computed by the centroid approach. As expected, the situation is reversed in Fig. 4 where each vector set contains 2 10-dimensional vectors. In both tests, the closest pair lter has good to optimal selectivity, but due to its computational
Figure 6: Complete k-nn queries, CAD dataset, cardinality 7, dimensionality 6 (sequential scan took about 1014 sec. for each k).
complexity the overall runtime is rather high especially for high -values. Using the CAD datasets, we carried out different range queries on a vector set consisting of 5 6-dimensional vectors. Figure 5 shows that the selectivity of the closest pair lter is almost optimal, i.e. few unnecessary candidates are produced. Nevertheless, the overall runtime of this lter-step is very high as the runtime complexity of the lter-step is almost as high as the computation of the minimal matching distance itself (cf. Fig. 5). Good results were obtained by using the centroid approach. The good performance of the centroid approach can slightly be increased by using the combined lter, i.e. the combination of the norm vector lter and the centroid lter, which can also be efciently computed and has a slightly higher selectivity. Note that both the selectivity as well as the runtime behavior of the M-tree are outperformed by this combined lter for all -values. Figure 6 shows the average results we obtained for carrying out different k -nn queries on CAD objects represented by vector sets containing 7 vectors. Basically, we made the same observations as for range queries. Although the closest pair lter has a rather good
Figure 8: Partial k-nn queries for s = 3, CAD dataset, cardinality 5, dimensionality 6 (sequential scan took about 2123 sec. for each k).
selectivity, it is rather expensive. The best trade off is achieved by using the combination of the norm vector lter and the centroid lter. All lters have a rather good selectivity and accelerate the query process enormously. For instance, for k -nn queries where k is smaller than 20, the combined lter accelerates the query process on the 6-dimensional vector sets by more than one order of magnitude compared to the sequential scan. Again, the selectivity as well as the runtime behavior of the M-tree is clearly outperformed by this combined lter for all values of k , e.g. for k =5 the combined lter outperforms the M-tree by an order of magnitude. We made the same observations for the CAD datasets with 3 and 5 vectors per vector set, except that the absolute runtime is higher for the larger vector sets. The average runtime for 7 vectors is about four times the average runtime for 3 vectors.
5.2
In this section, we tested the closest pair algorithm on L2 -norm vectors, called norm vector lter, and directly on the d-dimensional vectors, called closest pair lter. Let us note that
detecting partial similarity is a very expensive operation. Furthermore, we cannot apply the M-tree as the distance function is not a metric (cf. Denition 3). Figure 7 shows the average of 10 range queries for varying -values on a vector set of 7 vectors. The partial similarity parameter s was set to 2. Again, the closest pair lter is very selective. As the exact distance function is very expensive, the closest pair lter can be benecially used for small -values. For higher -values, the rather high evaluation cost of the closest pair lter carry into weight. On the other hand, the norm vector can safely be used for all values of , as there is no noteworthy overhead. For rather small -values, it even outperforms the closest pair lter, although the norm vector has a lower selectivity than the closest pair lter. This is because the lower computational cost of the norm vector lter still pays off, compared to the slightly more exact distance computations which have to be carried out. Figure 8 shows the average of 10 k -nn queries for vector sets of 5 vectors each having a dimensionality of 6 and a partial similarity parameter s = 3. For small values of k , the norm vector lter outperforms the exact distance computation by almost one order of magnitude. For higher values of k , the selectivity of the norm vector lter decreases and thus the overall response time increases. For values of k equal to 100, the norm vector lter still accelerates the query process by 100%. As already mentioned, the closest pair lter is rather expensive. Although it has an excellent selectivity, the norm vector lter is better for rather small values of k . For increasing values of k , the closest pair lter outperforms the norm vector lter because of the much better selectivity and the very expensive exact distance calculations.
Conclusions
In this paper, we motivated the use of vector set data by pointing out the different application areas of this promising representation technique. We introduced a suitable distance function on vector sets, which reects the intuitive notion of similarity for the presented application ranges. Furthermore, we presented different ltering techniques with different runtime complexities. Our experimental evaluation and our analytical reasoning showed that the closest pair lter is the most selective lter. As this lter is rather expensive, it only pays off for partial similarity queries which are extremely expensive themselves. For complete similarity queries, the combination of the norm vector lter and the centroid lter is the method of choice for a lot of different data distributions, as it can be computed efciently and the information of each vector and each dimension is taken into consideration. The experimental evaluation on real world datasets demonstrates that the presented ltering techniques accelerate similarity range queries and k -nn queries by up to one order of magnitude compared to metric index structures and the sequential scan. In our future work, we want to show how the paradigm of sets of feature vectors can be applied to effective and efcient data mining tasks, e.g. clustering and classication.
References
[CNBYM01] E. Ch avez, G. Navarro, R. Beaza-Yates, and J. Marroqu n. Search in Metric Spaces. ACM Computing Surveyes, 33(3):273321, 2001. [CPZ97] P. Ciaccia, M. Patella, and P. Zezula. M-Tree: An Efcient Access Method for Similarity Search in Metric Spaces. In Proc. 23rd Int. Conf. of Very Large Data Bases, Athens, Greece, pages 426435, 1997. T. Eiter and H. Mannila. Distance Measures for Point Sets and Their Computation. Acta Informatica, 34(2):103133, 1997. G. R. Hjaltason and H. Samet. Ranking in Spatial Databases. In Proc. 4th Int. Symposium on Large Spatial Databases (SSD95), volume 951 of Lecture Notes in Computer Science (LNCS), pages 8395. Springer, 1995. H.-P. Kriegel, S. Brecheisen, P. Kr oger, M. Pfeie, and M. Schubert. Using Sets of Feature Vectors for Similarity Search on Voxelized CAD Objects. In Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD03), San Diego, CA, 2003. K. Kailing, H.-P. Kriegel, S. Sch onauer, and T. Seidl. Efcient Similarity Search for Hierarchical Data in Large Databases. In Proc. 9th Int. Conf. on Extending Database Technology (EDBT04), Heraklion, Greece, 2004. H.-P. Kriegel and S. Sch onauer. Similarity Search in Structured Data. In Proc. 5th Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK03), Prague, Czech Republic, volume 2737 of Lecture Notes in Computer Science (LNCS), pages 309319. Springer, 2003. H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:8397, 1955. J. Munkres. Algorithms for the assignment and transportation problems. Journal of the SIAM, 6:3238, 1957. J. Ramon and M. Bruynooghe. A polynomial time computable metric between point sets. Acta Informatica, 37:765780, 2001. T. Seidl and H.-P. Kriegel. Optimal Multi-Step k-Nearest Neighbor Search. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 154165, 1998.
[EM97] [HS95]
[KBK+ 03]
[KKSS04]
[KS03]
Appendix
A Formal Proofs
We will use the following three lemmas to proof Theorems 1, 4 and 5. Lemma 1 Let x, y Rd be two d-dimensional feature vectors. Then the difference between the Lp -norms of x and y underestimates the Lp -distance between x and y : x
tri. ineq. p
xy
Proof: x x p y
p p
= x0 p x y p.
tri. ineq. p
xy
+ y0
xy
+ y
follows
y p = y0 x y p. Then x
p
xy
p+
x0
= xy
p
p+
follows y
= max( x
y p, y
x p) x y p.
Lemma 2 Let V Rd . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Then the following inequality holds:
|X | |X |
xi
i=1
yi
i=1
xi yi
xi
yi
xi yi
and this
Lemma 3 Let V Rd and let X, Y 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Their norm vectors are denoted by Vk (X ) and Vk (Y ). Let the sequences of the Lp -norm values of the vectors in X and Y in descending order be denoted by ( x1 p , . . . x|X | p ) and ( y1 p , . . . y|Y | p ). Let (Y ). Then the following inequality holds:
|X | |Y |
Vk (X ) Vk (Y )
i=1
xi
p y ( i)
+
i=|X |+1
y ( i)
Proof: (Sketch) Let Vk (X ) = (x1 , . . . xk )t , Vk (Y ) = (y1 , . . . yk )t . We rst show that the following holds:
k k
Vk (X ) Vk (Y )
=
i=1
|xi yi |
i=1
|xi y(i) |
(*)
Every given permutation can be constructed from adjacent permutations 1 , . . . n , such that = 1 . . . n and for each l there is some q {1, . . . |X |}, such that l (q ) = q + 1, l (q + 1) = q and q / {q, q + 1} : l (q ) = q . Given l , we show that |xq yl (q) | + |xq+1 yl (q+1) | |xq+1 yq+1 | + |xq yq |. There are in total six cases, because of the ordering within the norm vectors: 1. xq xq+1 yl (q+1) yl (q) 2. xq yl (q+1) xq+1 yl (q) 3. xq yl (q+1) yl (q) xq+1 4. yl (q+1) xq xq+1 yl (q) 5. yl (q+1) xq yl (q) xq+1 6. yl (q+1) yl (q) xq xq+1
We exemplarily show the third case. The proofs of the other ve cases are very similar. |xq yl (q) | + |xq+1 yl (q+1) | = xq+1 yq + yq+1 xq = (xq+1 yq+1 ) + (yq+1 yq ) + (yq xq ) + (yq+1 yq ) = |xq+1 yq+1 | + |xq yq | + 2|yq+1 yq | |xq+1 yq+1 | + |xq yq | As for each application of a l the sum on the right side of proposition (*) will grow or remain equal, the sum will grow or remain equal when applying . Thus, proposition (*) holds. Then the following holds: Vk (X ) Vk (Y )
|Y | i=|X |+1 |X | i=1 (*) 1
k i=1 |xi
y ( i) | =
p
|X | i=1
xi 0
p
y ( i)
p
y ( i)
p
+
p
k i=|Y |+1
0
p
xi
y ( i)
|Y | i=|X |+1
y ( i)
A.1
Theorem 1
Proof: Let (Y ) be the permutation of Y that results from the minimum weight perfect matching of X and Y , i.e.
|X | D,W Dmm (X, Y ) = i=1 |Y |
D(xi , y(i) ) +
i=|X |+1
D(, y(i) )
|X | i=1
D(xi , y(i) ) +
The inequality holds, if it holds for every pair of i-th addends. This is obviously the case, as we always pick the yj Y which minimizes D(xi , yj ).
D, (2) Dcp (X, Y ) = |Y | i=1 |X | i=1 |Y | i=1
Again, the inequality holds, if it holds for every pair of i-th addends. This is obviously the case, as we always pick the xj X which minimizes D(xj , y(i) ) (note that X if |X | < |Y |).
A.2
Theorem 4
Proof: Let the sequences of the Lp -norm values of the vectors in X and Y in descending order be denoted by ( x1 p , . . . x|X | p ) and ( y1 p , . . . y|Y | p ). We assume w.l.o.g. |X | |Y | k . Let (Y ) be the permutation of Y that results from the minimum weight perfect matching of X and Y . We combine the results from Lemmas 2 and 3. Vk (X ) Vk (Y )
|X | Lemma 3 1
|Y |
xi
i=1 |X |
y ( i)
+
i=|X |+1
y ( i)
Lemma 2 p
|Y |
xi y(i)
i=1
+
i=|X |+1
y ( i)
p = Dmm
L ,W0
(X, Y )
A.3
Theorem 5
Lp ,s Lp ,s Proof: According to Theorem 2, Dpcp (X, Y ) Dpmm (X, Y ) holds. Lp ,s Lp ,s To obtain Dpmm (X, Y ) Dpmm (X, Y ) we have to show that s 1 (X ),2 (Y )
min
x1 (i)
i=1 s
y 2 ( i )
1 (X ),2 (Y )
min
x1 (i) y2 (i)
i=1