0% found this document useful (0 votes)

52 views

Efficient Similarity Search On Vector Sets

The document discusses efficient similarity search on sets of feature vectors. It introduces the minimal matching distance measure between vector sets, which is suitable for partial and complete similarity search. The document motivates the use of vector set representations for various applications like CAD databases, modeling soccer teams, stock portfolios, shopping carts, and multimedia collections. It also presents different filter techniques that are useful for efficient query processing with the minimal matching distance measure on vector sets.

Uploaded by

Jessica White

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Efficient Similarity Search On Vector Sets

Uploaded by

Jessica White

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

In Proc. 11. GI-Fachtagung fr Datenbanksysteme in Business, Technologie und Web (BTW'05), Karlsruhe, 2005, pp. 425-443.

Efcient Similarity Search on Vector Sets

Stefan Brecheisen Hans-Peter Kriegel Martin Pfeie

Institute for Computer Science, University of Munich {brecheis,kriegel,pfeie}@dbs.informatik.uni-muenchen.de

Abstract: Similarity search in database systems is becoming an increasingly important task in modern application domains such as multimedia, molecular biology, medical imaging, computer aided design and many others. Whereas most of the existing similarity models are based on feature vectors, there exist some models which use very complex object representations such as trees and graphs. A promising way between too simple and too complex object representations in similarity search are sets of feature vectors. In this paper, we rst motivate the use of this modeling approach for complete object similarity search as well as for partial object similarity search. After introducing a distance measure between vector sets, suitable for many different application ranges, we present and discuss different lters which are indispensable for efcient query processing. In a broad experimental evaluation based on articial and real-world test datasets, we show that our approach considerably outperforms both the sequential scan and metric index structures.

Introduction

In the last ten years, an increasing number of database applications has emerged for which efcient and effective support for similarity search is substantial. The importance of similarity search grows in application areas such as multimedia, medical imaging, molecular biology, computer aided engineering, marketing, purchasing assistance, and others. As distance functions form the foundation of similarity search, we need an object representation which allows efcient and meaningful distance computations. A common approach is to represent an object by a numerical feature vector. In this case, a feature transformation extracts distinguishable characteristics which are represented by numerical values and grouped together in a feature vector. On the basis of such a feature transformation and under the assumption that similarity corresponds to feature distance, it is possible to dene a distance function between the corresponding feature vectors as a similarity measure for two data objects. Thus, searching for data objects similar to a given query object is transformed into proximity search in the feature space. Most applications use the Euclidean metric (L2 ) to evaluate the feature distance, but there are several other metrics commonly used, e.g. the Manhattan metric (L1 ) and the maximum metric (L ). Furthermore, there exist quite a few much more complex similarity models based on graphs [KS03] and trees [KKSS04]. Generally, the more complex and precise these mod-

els are, the more exact are the results of a similarity search, but at the same time, its computation cost rises as well. In this paper, we present a distance measure for an approach somewhere in between single feature vectors and complex trees and graphs. We model an object by a set of feature vectors which is a very suitable object representation for many different application ranges. In order to achieve efcient query processing we present three different lower-bounding lters and discuss their properties. The remainder of this paper is organized as follows. In Section 2, we motivate the use of vector set represented objects by presenting various application ranges which benet from this modeling approach. In Section 3, we introduce the minimal matching distance between vector sets which is a suitable distance measure for partial and complete similarity search. In Section 4, we sketch the paradigm of multi-step query processing and present appropriate lter techniques for the minimal matching distance on vector sets. In Section 5, we present the results of our experimental evaluation. We conclude this work in Section 6 with a short summary and a few remarks on future work.

Application Ranges for Vector Sets

Using sets of feature vectors is a generalization of the use of just one large feature vector. It is always possible to restrict the model to a feature space, in which a data object will be completely represented by just one feature vector. But in some applications the properties of vector set representations allow us to model the dependencies between the extracted features more precisely. As the development of conventional database systems in the recent two decades has shown, the use of more sophisticated ways to model data can enhance both the effectiveness and efciency for applications using large amounts of data. Another advantage of using sets of feature vectors is the better storage utilization. It is not necessary to force objects into a common size, if they are represented by sets of different cardinality. In the following, we will shortly sketch different application ranges which benet from the use of vector set data. CAD databases. In [KBK+ 03] voxelized spatial objects were modeled by sets of feature vectors, where each feature vector represents a 3D rectangular cover which approximates the object as good as possible. The vector set representation is able to avoid the problems that occur by storing a set of covers according to a strict order, i.e. in one high-dimensional feature vector. Thereby, it is possible to compare two objects more intuitively compared to the distance calculation in the one-vector model. In a broad experimental evaluation it was shown that the use of sets of feature vectors greatly enhances the quality of the similarity model compared to the use of a single feature vector. Soccer teams. As another example, let us assume that we want to measure the similarity between two soccer teams. It is benecial to represent each player by a feature vector and the complete team as a set of feature vectors. A feature vector for one player may

consist of attributes like his age, his salary, the number of goals in the last season, etc. We can compare two players by computing the Euclidean distance between the corresponding feature vectors. This measures the similarity between two players rather well. But, what is a suitable distance for comparing two teams? Assuming we have a team A consisting of 10 very young players having a low salary and having scored only a few goals in the last season. Furthermore, team A has one highly paid, rather experienced and successful player. On the other hand, we have a team B where we have 10 rather old, highly paid successful players and one young low-budget player. If we compare each player of team A to the most similar player in team B and vice versa, this yields that the two teams are very similar. This straightforward approach does not reect the intuitive notion of similarity. On the other hand, if we compare each player from team A to a different player in team B trying to minimize the average distance between two matched players, this results in a very accurate similarity measure. For partial similarity, it is advisable not to compare all players from team A to a different player in team B , but only the s most similar players. For low values of s, e.g. s = 2, the two teams A and B are very similar, as each team has an old player with a high salary and a young low-budget player. In this case, the distance between the teams A and B would be very small. For higher values of s, the two teams become more and more dissimilar. Let us note that for s = 11 the two notions of partial and complete similarity coincide. This behavior reects the intuitive perception of similarity. To sum up, the use of vector sets allows us to adjust the degree of the partial similarity in k discrete steps, if we represent the objects by vector sets of cardinality k . Further application areas. sets of feature vectors, e.g.: There exist a lot of further possible application elds for

stock portfolios, where each stock is represented by the value of one share, the overall number of shares, how many days ago the shares were bought, the risk category, etc. shopping carts, where each consumer product corresponds to a feature vector containing the category, the price, the quantity, etc. multimedia CDs, where each media le is represented by the publisher, the artist, the title, the lesize, the kind of content, etc. To sum up, sets of feature vectors are a natural way to model a lot of complex real-world objects.

Distance Measures on Vector Sets

Effective distance functions which allow both complete and partial similarity search as well as suitable lter techniques for efcient query processing are indispensable for the general use of the powerful concept of sets of feature vectors.

There are already several distance measures proposed on sets of vectors. In [EM97] the authors survey the following four measures, which are computable in polynomial time: the Hausdorff distance, the sum of minimum distances, the (fair-)surjection distance and the link distance. The Hausdorff distance does not seem to be suitable as a similarity measure, because it relies too much on the extreme positions of the elements of both sets. The last three distance measures are suitable for modeling similarity, but are not metric. This circumstance makes them unattractive, since there are only limited possibilities for processing similarity queries efciently when using a non-metric distance function. In [EM97], the authors also introduce a method for expanding the distance measures into metrics, but as a side effect the complexity of distance calculation becomes exponential. Furthermore, the possibility to match several elements in one set to just one element in the compared set is questionable in the application areas presented in Section 2. A distance measure on vector sets that demonstrates to be suitable for dening similarity is based on the minimum weight perfect matching of sets. This well known graph problem can be applied here by building a complete bipartite graph G = (X Y, E ) between the vector sets X and Y . The weight of each edge (x, y ) E , where x X and y Y , in this graph G is dened by the distance d(x, y ). A perfect matching is a subset M E that connects each x X to exactly one y Y and vice versa. A minimum weight perfect matching is a matching with a minimum sum of weights of its edges. Contrary to the second example of Section 2, where we considered vector sets of equal cardinality, i.e. soccer teams consisting of 11 players, there are a lot of application ranges, where objects are naturally represented by a varying number of vectors. Since a perfect matching can only be found for sets of equal cardinality, we need to introduce suitable weights as a penalty for the unmatched vectors when dening a distance measure between objects of varying cardinality. Denition 1 (permutation of a set) Let A be any nite set of arbitrary elements. Then is a mapping that assigns a A a unique number i {1, .., |A|}. This is written as (A) = (a1 , .., a|A| ). The set of all possible permutations of A is denoted by (A). Denition 2 (minimal matching distance) Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function between two d-dimensional feature vectors. Furthermore, let W : V R be a weight function D,W : 2V 2V R is for unmatched elements. Then the minimal matching distance Dmm dened as follows:
|X | |Y | D,W Dmm (X, Y ) = (Y )

min
i=1

D(xi , y(i) ) +
i=|X |+1

W (y(i) )

The weight function W provides the penalty given to every unassigned element of the set having larger cardinality. Let us note that the minimal matching distance is a specialization of the netow distance which is proven to be a metric in [RB01]. The minimal matching

D,W distance Dmm is a metric, if the distance function D is a metric and the weight function W meets the following conditions:

(1) W (x) > 0 for x V (2) W (x) + W (y ) D(x, y ) for x, y V The Kuhn-Munkres algorithm [Kuh55, Mun57] can be used to calculate the minimal matching distance in polynomial time. In a primary initialization step, a distance matrix between the two vector sets containing k d-dimensional vectors is computed. If D is an Lp -distance, this initialization takes O(k 2 d) time. The method itself is based on the successive augmentation of an alternating path between both sets. Since it is guaranteed that this path can be expanded by one further match within each step taking O(k 2 ) time and there is a maximum of k steps, the overall complexity of a distance calculation is O(k 3 + k 2 d) in the worst case. The minimal matching distance can be adapted for partial similarity search in vector set represented data. The distance measure dened in the following is based on a partial minimal matching. Given two vector sets X and Y , |X | |Y |, we only match s |X | vectors to calculate the distance between X and Y . Denition 3 (partial minimal matching distance) Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function between two d-dimensional feature vectors. Let s |X |. Then the partial minimal matching distance D,s : 2V 2V R is dened as follows: Dpmm
s D,s Dpmm (X, Y ) = 1 (X ),2 (Y )

min

D(x1 (i) , y2 (i) )

i=1

Unlike the minimal matching distance the partial variant is not a metric. As the KuhnMunkres algorithm produces a partial minimal matching in each step as an intermediate D,s result, we can use it to calculate the partial minimal matching distance Dpmm (X, Y ). But |X | we have to take into account all s combinations of vectors in X to match with vectors 2 2 in Y . Therefore, the time complexity for a single distance calculation is O( k s sk + k d). Thus, a ltering technique to speed up query processing is essential.

Filters for Vector Sets

Complete similarity search on vector set data can be accelerated by using metric index structures, e.g. the M-tree [CPZ97]. For a detailed survey on metric index structures we refer the reader to [CNBYM01]. Another approach is to use the multi-step query processing paradigm which, in contrast to metric index structures, is also suitable for partial

similarity search. The main goal of multi-step query processing is to reduce the number of complex and therefore time consuming distance calculations in the query process. In order to guarantee that there occur no false drops the used lter distances have to fulll a lowerbounding distance criterion. For any two objects o1 and o2 , a lower-bounding distance function Df in the lter step has to return a value that is not greater than the exact object distance Do of o1 and o2 , i.e. Df (o1 , o2 ) Do (o1 , o2 ). With a lower-bounding distance function, it is possible to safely lter out all database objects which have a lter distance greater than the current query range because the exact similarity distance of those objects cannot be less than the query range. The computation of the minimal matching distance on vector sets is a rather expensive operation. Thus, the employment of selective and efciently computable lter distance functions for similarity search is very important. In the following, we present three different lter types for query processing on data objects represented by vector sets, namely the closest pair lter, the centroid lter and the norm vector lter.

4.1

Closest Pair Approach

The closest pair distance between two vector sets X and Y can be used as a lter distance D,W for the minimal matching distance Dmm and is dened as follows. Denition 4 (closest pair distance) Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Let X = {x1 , . . . x|Y | } be a multiset where xi = for i {|X | + 1, . . . |Y |}. D, (X, Y ) : 2V 2V R is dened as follows. Then the closest pair distance Dcp
D, Dcp (X, Y ) = max i=1 |Y | j =1,...|Y | |Y |

j =1,...|Y |

min

D(xi , yj ),
i=1

min

D(xj , yi )

Let us note that the closest pair lter works directly on the set of vectors, i.e. on the original data, and not on approximated data. The lter distance can be computed by scanning the matrix of distance values between each pair of vectors in X and Y for the closest pairs. We will now show that the closest pair distance between two vector sets is a lower bound for the minimal matching distance. Theorem 1 Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Furthermore, let W : V R, W (v ) = D(v, ), be a weight function for unmatched elements. Then the following inequality holds:
D, D,W Dcp (X, Y ) Dmm (X, Y )

a3 a3 y3 c3 x3 a1 a2

c1 x1 b3 x2 y2 x1 x2 b1 y2 x1 b2 x2 y2

a3 a3

b3 b3

c3 c3 c1

a1 c1

a2 b2

(a) Closest pair.

(b) Centroid.

(c) Norm vector.

Figure 1: Filters for the minimal matching distance.

Proof: See Appendix A.1. A 2-dimensional example for the closest pair lter is depicted in Fig. 1(a), where |X | = L2 ,W0 L2 ,0 |Y | = 3 and a3 + b3 + c3 = Dcp (X, Y ) = a3 + b3 + c3 . As a3 < a3 , (X, Y ) Dmm x3 is matched to both y1 and y3 during the lter distance calculation, whereas the minimal matching distance is based on one-to-one matchings. We adapt the closest pair lter to partial similarity search by adding up just the distances of the s closest pairs of vectors. Thus, the partial closest pair distance is dened as follows. Denition 5 (partial closest pair distance) Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Let D,s s |X |. Then the partial closest pair distance Dpcp (X, Y ) : 2V 2V R is dened as follows.
s D,s Dpcp (X, Y ) = max (X )

min

i=1 s

j =1,...|Y |

min

D(x(i) , yj ), D(xj , y(i) )

(Y )

min

i=1

j =1,...|X |

min

The partial closest pair distance is a lower bound for the partial minimal matching distance. Theorem 2 Let V Rd and let X, Y 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Let D : Rd Rd R be a distance function. Let s |X |. Then the following inequality holds:
D,s D,s Dpcp (X, Y ) Dpmm (X, Y )

Proof: Analogous to the proof of Theorem 1.

As the partial closest pair distance can be computed rather efciently by scanning the matrix of distance values between each pair of vectors in X and Y for the closest pairs and organizing the s closest distances in a heap structure, it is a very benecial lter for the partial minimal matching distance. The overall runtime complexity is O(k 2 d) for the complete version and O(k 2 d log s) for the partial version of the clostest pair distance, when an Lp -distance is used between vectors. Although this is more complex than the closest pair approach on norm vectors (cf. Section 4.3), it is a more selective lter that saves more of the very expensive calculations of the exact partial minimal matching distance.

4.2

Centroid Approach

This lter step is based on the relation between a set of feature vectors and its extended centroid [KBK+ 03]. Denition 6 (extended centroid) Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | } 2V be a vector set where |X | k . Then the extended centroid Ck, (X ) is dened as follows: Ck, (X ) =
|X | i=1

xi + (k |X |) k

Note how the vector is used as a dummy vector to ll up vector sets with a cardinality of less than k . Theorem 3 Let V Rd and Rd \ V . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets where |X |, |Y | k and let Ck, (X ), Ck, (Y ) be their extended centroids. Furthermore, let W : V R, W (v ) = v p , be a weight function for unmatched elements. Then the following inequality holds: k Ck, (X ) Ck, (Y )
p Lp ,W Dmm (X, Y )

See [KBK+ 03] for the proof of this theorem. We have shown that the Lp -distance between the extended centroids multiplied by k is a lower bound for the minimal matching distance under the named preconditions. Therefore, when computing e.g. -range queries, we do not need to examine objects whose extended centroids have a distance to the query object q that is larger than k . Often a good choice of is 0, since 0 / V holds for a lot of applications. Thus, Conditions (1) and (2) for the metric character of the minimal matching L2 ,W0 distance Dmm are satised. A 2-dimensional example for the extended centroid lter is depicted in Fig. 1(b), where |X | = |Y | = 2 and 2c1 = 2 Ck,0 (X ) Ck,0 (Y ) 2
2 Dmm

L ,W0

(X, Y ) = a1 + b1 .

The centroid approach is not suitable as a lter for the partial minimal matching distance, as the centroid invariably aggregates information of all vectors contained in a vector set.

4.3

Norm Vector Approach

Another possible lter for vector set represented data is based on the Lp -norms of all vector elements of a vector set. The idea is as follows: For all vectors x in a vector set X , |X | k , we compute the Lp -norms x p and organize these norm values in descending order in a k -dimensional vector. We call this lter the norm vector lter. Denition 7 (norm vector) Let V Rd . Let X 2V be a vector set where |X | k . Let ( x1 p , . . . x|X | p ) be the sequence of the Lp -norm values of the vectors in X in descending order, i.e. for all i < j {1, . . . |X |} holds xi p xj p . Then the norm vector Vk (X ) = (v1 , . . . vk )t Rk is dened as follows: vi = xi 0
p

for i = 1, . . . |X | for i = |X | + 1, . . . k

Note that if X has a cardinality smaller than k, dimensions |X | + 1 to k of the norm vector will get lled with 0. We employ the Manhattan distance as a distance function between two norm vectors Vk (X ) and Vk (Y ). This distance measure fullls the lower-bounding property with respect to the minimal matching distance, if the Lp -norm is used as the weight function W . Theorem 4 Let V Rd and let X, Y 2V be two vector sets. Their norm vectors are denoted by Vk (X ) and Vk (Y ). Furthermore, let W0 : V R, W0 (v ) = v p , be the Lp -norm used as a weight function for the minimal matching distance. Then the following inequality holds: Lp ,W0 Vk (X ) Vk (Y ) 1 Dmm (X, Y ) Proof: See Appendix A.2. A 2-dimensional example for the norm vector lter is depicted in Fig. 1(c), where |X | = L2 ,W0 |Y | = 2 and a2 + b2 = Vk (X ) Vk (Y ) 1 Dmm (X, Y ) = a2 + b2 . An approach for partial similarity search is to apply a parallel scan through the norm vectors Vk (X ) and Vk (Y ) and to build a heap structure containing the distances between the closest pairs of norm values found during the parallel scan. Finally, the sum of the top s elements of the heap is reported as the distance measure. This can be done very efciently in O(k log s) time using the algorithm in Fig. 2. The algorithm corresponds to a closest pair approach on the norm values of the feature vectors, which lower bounds the partial minimal matching distance.

algorithm partialNormVectorFilter(VectorSet X , VectorSet Y , Integer k, Integer s) begin return max(comp(X, Y, k, s), comp(Y, X, k, s)); end; algorithm comp(VectorSet X , VectorSet Y , Integer k, Integer s) begin (x1 , . . . xk ) := Vk (X ); // initialize (y1 , . . . yk ) := Vk (Y ); j := 1; for i in 1..k do // parallel scan while j < k |xi yj | |xi yj +1 | do j := j + 1; end while; heap.insert(|xi yj |); end for; dist := 0; // add up the distance for i in 1..s do dist := dist + heap.top(); end for; return dist; end;

Figure 2: Partial norm vector lter algorithm.

Theorem 5 Let V Rd and let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vec = { x1 p , . . . x|X | p }, tor sets. We assume w.l.o.g. |X | |Y | k . Let s |X |. Let X = { y1 p , . . . y|Y | p } be multisets containing the Lp -norm values of the vectors in X Y and Y . Then the following inequality holds:
Lp ,s Lp ,s Dpcp (X, Y ) Dpmm (X, Y )

Proof: See Appendix A.3.

4.4

Summary

As the computation of the minimal matching distance is rather time-consuming, we introduced three different lters. The centroid and the norm vector ltering techniques can be protably combined. The exact distance computation is only performed if the results of both lter distance computations on the centroids and the norm vectors are small enough. This way, a good deal of the information in the vector sets is incorporated in the lter distance computation. Given d-dimensional data, the centroid lter maps each dimension to a single value, resulting in a d-dimensional vector. On the other hand, the norm vector lter maps each vector to a single value resulting in a k -dimensional vector. Thus, the combined lter contains aggregated information over both the dimensions and the vectors and is therefore suitable for a lot of different data distributions. The time complexity for a

Table 1: Runtime complexity of the proposed lters.

exact distance complete similarity partial similarity O(k + k d) O(

k s 3 2

closest pair O(k d) O(k 2 d log s)

centroid O(d) n/a

norm vector O(k ) O(k log s)

sk 2 + k 2 d)

combined lter distance evaluation is O(d + k ). As the centroid approach is not applicable for partial similarity search, we cannot use the combined lter for this purpose. In contrast to the other two approaches, which derive a single feature vector for approximating a vector set, the closest pair lter works directly on the vector sets. The resulting distance measure lower bounds the minimal matching distance and can be computed more efciently than the exact minimal matching distance. The runtime complexities for partial and complete similarity distance calculations based on the three different lters are summed up in Table 1, where we assume vector sets containing k d-dimensional vectors, a partial similarity parameter s {1, . . . k }, and an Lp -distance between vectors.

Experimental Evaluation

In this section, we present our experimental results. We generated and used two articial datasets, each containing 100,000 random vector sets. The rst dataset consists of vector sets containing 10 2-dimensional vectors each. The other dataset consists of vector sets containing 2 10-dimensional vectors each. The vectors are generated so that all of their components are uniformly distributed in the interval between 0 and 1. All distance measures between vector sets were implemented in Java 1.4 and the experiments were run on a workstation with a Xeon 2.4 GHz processor and 2 GB main memory under Linux. Furthermore, we used the similarity model presented in [KBK+ 03], where CAD objects were represented by a vector set consisting of either 3, 5 or 7 vectors in 6D. All experiments were carried out on a dataset containing 5,000 CAD objects from an American aircraft producer. We conducted our experiments on top of the Oracle9i Server using PL/SQL for the computational main memory based programming. We compared our different lters for vector set represented data to a PL/SQL implementation of the M-tree [CPZ97]. For the M-tree based k -nearest neighbor queries the ranking algorithm of [HS95] was used. The experiments were performed on a Pentium III/700 machine with IDE hard drives. The database block cache was set to 500 disk blocks with a block size of 8 KB and was used exclusively by one active session. The minimal matching distances between sets of feature vectors were computed using an implementation of the Kuhn-Munkres algorithm. Throughout our experiments we used the Euclidean distance as the distance measure between two single vectors. The range queries were based on a sequential scan. The k -nn queries with exact distance calculations were

Figure 3: Complete range queries, articial dataset, cardinality 10, dimensionality 2.

Figure 4: Complete range queries, articial dataset, cardinality 2, dimensionality 10.

also based on a sequential scan. For the ltered k -nn queries the lter distances between the query object and all vector sets in the database were calculated and sorted in ascending order. Then the optimal multi-step k -nn search algorithm [SK98] was used. In all tests, we processed 10 different similarity range queries as well as k -nn queries. The presented gures depict the average results from these tests.

5.1

Complete Similarity Search

In a rst experiment, we carried out range queries on the two articial datasets. Figure 3 shows rather good results for the norm vector lter, while the centroid lter performs rather badly. The superiority of the norm vector lter is due to the fact that more information is preserved by approximating a vector set by a 10-dimensional vector in contrast to the 2-dimensional centroid computed by the centroid approach. As expected, the situation is reversed in Fig. 4 where each vector set contains 2 10-dimensional vectors. In both tests, the closest pair lter has good to optimal selectivity, but due to its computational

Figure 5: Complete range queries, CAD dataset, cardinality 5, dimensionality 6.

Figure 6: Complete k-nn queries, CAD dataset, cardinality 7, dimensionality 6 (sequential scan took about 1014 sec. for each k).

complexity the overall runtime is rather high especially for high -values. Using the CAD datasets, we carried out different range queries on a vector set consisting of 5 6-dimensional vectors. Figure 5 shows that the selectivity of the closest pair lter is almost optimal, i.e. few unnecessary candidates are produced. Nevertheless, the overall runtime of this lter-step is very high as the runtime complexity of the lter-step is almost as high as the computation of the minimal matching distance itself (cf. Fig. 5). Good results were obtained by using the centroid approach. The good performance of the centroid approach can slightly be increased by using the combined lter, i.e. the combination of the norm vector lter and the centroid lter, which can also be efciently computed and has a slightly higher selectivity. Note that both the selectivity as well as the runtime behavior of the M-tree are outperformed by this combined lter for all -values. Figure 6 shows the average results we obtained for carrying out different k -nn queries on CAD objects represented by vector sets containing 7 vectors. Basically, we made the same observations as for range queries. Although the closest pair lter has a rather good

Figure 7: Partial range queries for s = 2, CAD dataset, cardinality 7, dimensionality 6.

Figure 8: Partial k-nn queries for s = 3, CAD dataset, cardinality 5, dimensionality 6 (sequential scan took about 2123 sec. for each k).

selectivity, it is rather expensive. The best trade off is achieved by using the combination of the norm vector lter and the centroid lter. All lters have a rather good selectivity and accelerate the query process enormously. For instance, for k -nn queries where k is smaller than 20, the combined lter accelerates the query process on the 6-dimensional vector sets by more than one order of magnitude compared to the sequential scan. Again, the selectivity as well as the runtime behavior of the M-tree is clearly outperformed by this combined lter for all values of k , e.g. for k =5 the combined lter outperforms the M-tree by an order of magnitude. We made the same observations for the CAD datasets with 3 and 5 vectors per vector set, except that the absolute runtime is higher for the larger vector sets. The average runtime for 7 vectors is about four times the average runtime for 3 vectors.

5.2

Partial Similarity Search

In this section, we tested the closest pair algorithm on L2 -norm vectors, called norm vector lter, and directly on the d-dimensional vectors, called closest pair lter. Let us note that

detecting partial similarity is a very expensive operation. Furthermore, we cannot apply the M-tree as the distance function is not a metric (cf. Denition 3). Figure 7 shows the average of 10 range queries for varying -values on a vector set of 7 vectors. The partial similarity parameter s was set to 2. Again, the closest pair lter is very selective. As the exact distance function is very expensive, the closest pair lter can be benecially used for small -values. For higher -values, the rather high evaluation cost of the closest pair lter carry into weight. On the other hand, the norm vector can safely be used for all values of , as there is no noteworthy overhead. For rather small -values, it even outperforms the closest pair lter, although the norm vector has a lower selectivity than the closest pair lter. This is because the lower computational cost of the norm vector lter still pays off, compared to the slightly more exact distance computations which have to be carried out. Figure 8 shows the average of 10 k -nn queries for vector sets of 5 vectors each having a dimensionality of 6 and a partial similarity parameter s = 3. For small values of k , the norm vector lter outperforms the exact distance computation by almost one order of magnitude. For higher values of k , the selectivity of the norm vector lter decreases and thus the overall response time increases. For values of k equal to 100, the norm vector lter still accelerates the query process by 100%. As already mentioned, the closest pair lter is rather expensive. Although it has an excellent selectivity, the norm vector lter is better for rather small values of k . For increasing values of k , the closest pair lter outperforms the norm vector lter because of the much better selectivity and the very expensive exact distance calculations.

Conclusions

In this paper, we motivated the use of vector set data by pointing out the different application areas of this promising representation technique. We introduced a suitable distance function on vector sets, which reects the intuitive notion of similarity for the presented application ranges. Furthermore, we presented different ltering techniques with different runtime complexities. Our experimental evaluation and our analytical reasoning showed that the closest pair lter is the most selective lter. As this lter is rather expensive, it only pays off for partial similarity queries which are extremely expensive themselves. For complete similarity queries, the combination of the norm vector lter and the centroid lter is the method of choice for a lot of different data distributions, as it can be computed efciently and the information of each vector and each dimension is taken into consideration. The experimental evaluation on real world datasets demonstrates that the presented ltering techniques accelerate similarity range queries and k -nn queries by up to one order of magnitude compared to metric index structures and the sequential scan. In our future work, we want to show how the paradigm of sets of feature vectors can be applied to effective and efcient data mining tasks, e.g. clustering and classication.

References
[CNBYM01] E. Ch avez, G. Navarro, R. Beaza-Yates, and J. Marroqu n. Search in Metric Spaces. ACM Computing Surveyes, 33(3):273321, 2001. [CPZ97] P. Ciaccia, M. Patella, and P. Zezula. M-Tree: An Efcient Access Method for Similarity Search in Metric Spaces. In Proc. 23rd Int. Conf. of Very Large Data Bases, Athens, Greece, pages 426435, 1997. T. Eiter and H. Mannila. Distance Measures for Point Sets and Their Computation. Acta Informatica, 34(2):103133, 1997. G. R. Hjaltason and H. Samet. Ranking in Spatial Databases. In Proc. 4th Int. Symposium on Large Spatial Databases (SSD95), volume 951 of Lecture Notes in Computer Science (LNCS), pages 8395. Springer, 1995. H.-P. Kriegel, S. Brecheisen, P. Kr oger, M. Pfeie, and M. Schubert. Using Sets of Feature Vectors for Similarity Search on Voxelized CAD Objects. In Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD03), San Diego, CA, 2003. K. Kailing, H.-P. Kriegel, S. Sch onauer, and T. Seidl. Efcient Similarity Search for Hierarchical Data in Large Databases. In Proc. 9th Int. Conf. on Extending Database Technology (EDBT04), Heraklion, Greece, 2004. H.-P. Kriegel and S. Sch onauer. Similarity Search in Structured Data. In Proc. 5th Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK03), Prague, Czech Republic, volume 2737 of Lecture Notes in Computer Science (LNCS), pages 309319. Springer, 2003. H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:8397, 1955. J. Munkres. Algorithms for the assignment and transportation problems. Journal of the SIAM, 6:3238, 1957. J. Ramon and M. Bruynooghe. A polynomial time computable metric between point sets. Acta Informatica, 37:765780, 2001. T. Seidl and H.-P. Kriegel. Optimal Multi-Step k-Nearest Neighbor Search. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 154165, 1998.

[EM97] [HS95]

[KBK+ 03]

[KKSS04]

[KS03]

[Kuh55] [Mun57] [RB01] [SK98]

Appendix
A Formal Proofs

We will use the following three lemmas to proof Theorems 1, 4 and 5. Lemma 1 Let x, y Rd be two d-dimensional feature vectors. Then the difference between the Lp -norms of x and y underestimates the Lp -distance between x and y : x
tri. ineq. p

Proof: x x p y

p p

= x0 p x y p.
tri. ineq. p

+ y0

+ y

follows

y p = y0 x y p. Then x
p

= xy
p

follows y

= max( x

y p, y

x p) x y p.

Lemma 2 Let V Rd . Let X = {x1 , . . . x|X | }, Y = {y1 , . . . y|Y | } 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Then the following inequality holds:
|X | |X |

xi
i=1

i=1

xi yi

Proof: The proposition holds if i {1, . . . |X |} : follows directly from Lemma 1.

xi yi

and this

Lemma 3 Let V Rd and let X, Y 2V be two vector sets. We assume w.l.o.g. |X | |Y | k . Their norm vectors are denoted by Vk (X ) and Vk (Y ). Let the sequences of the Lp -norm values of the vectors in X and Y in descending order be denoted by ( x1 p , . . . x|X | p ) and ( y1 p , . . . y|Y | p ). Let (Y ). Then the following inequality holds:
|X | |Y |

Vk (X ) Vk (Y )

i=1

p y ( i)

+
i=|X |+1

y ( i)

Proof: (Sketch) Let Vk (X ) = (x1 , . . . xk )t , Vk (Y ) = (y1 , . . . yk )t . We rst show that the following holds:
k k

Vk (X ) Vk (Y )

=
i=1

|xi yi |
i=1

|xi y(i) |

(*)

Every given permutation can be constructed from adjacent permutations 1 , . . . n , such that = 1 . . . n and for each l there is some q {1, . . . |X |}, such that l (q ) = q + 1, l (q + 1) = q and q / {q, q + 1} : l (q ) = q . Given l , we show that |xq yl (q) | + |xq+1 yl (q+1) | |xq+1 yq+1 | + |xq yq |. There are in total six cases, because of the ordering within the norm vectors: 1. xq xq+1 yl (q+1) yl (q) 2. xq yl (q+1) xq+1 yl (q) 3. xq yl (q+1) yl (q) xq+1 4. yl (q+1) xq xq+1 yl (q) 5. yl (q+1) xq yl (q) xq+1 6. yl (q+1) yl (q) xq xq+1

We exemplarily show the third case. The proofs of the other ve cases are very similar. |xq yl (q) | + |xq+1 yl (q+1) | = xq+1 yq + yq+1 xq = (xq+1 yq+1 ) + (yq+1 yq ) + (yq xq ) + (yq+1 yq ) = |xq+1 yq+1 | + |xq yq | + 2|yq+1 yq | |xq+1 yq+1 | + |xq yq | As for each application of a l the sum on the right side of proposition (*) will grow or remain equal, the sum will grow or remain equal when applying . Thus, proposition (*) holds. Then the following holds: Vk (X ) Vk (Y )
|Y | i=|X |+1 |X | i=1 (*) 1

k i=1 |xi

y ( i) | =
p

|X | i=1

xi 0
p

y ( i)
p

+
p

k i=|Y |+1

0
p

y ( i)

|Y | i=|X |+1

y ( i)

A.1

Theorem 1

Proof: Let (Y ) be the permutation of Y that results from the minimum weight perfect matching of X and Y , i.e.
|X | D,W Dmm (X, Y ) = i=1 |Y |

D(xi , y(i) ) +
i=|X |+1

D(, y(i) )

The proof consists of two cases.

D, (1) Dcp (X, Y ) = |Y | i=1

minj =1,...|Y | D(xi , yj ).

|Y | i=1

minj =1,...|Y | D(xi , yj ) =

|Y | i=|X |+1 minj =1,...|Y | D (, yj ) |Y | i=|X |+1 D (, y (i) )

|X | i=1

minj =1,...|Y | D(xi , yj )+

|X | i=1

D(xi , y(i) ) +

The inequality holds, if it holds for every pair of i-th addends. This is obviously the case, as we always pick the yj Y which minimizes D(xi , yj ).
D, (2) Dcp (X, Y ) = |Y | i=1 |X | i=1 |Y | i=1

minj =1,...|Y | D(xj , yi ).

|Y | i=1 minj =1,...|Y | D (xj , y (i) ) = |Y | i=|X |+1 minj =1,...|Y | D (xj , y (i) ) |Y | i=|X |+1 D (, y (i) )

minj =1,...|Y | D(xj , yi ) =

|X | i=1

minj =1,...|Y | D(xj , y(i) )+ D(xi , y(i) ) +

Again, the inequality holds, if it holds for every pair of i-th addends. This is obviously the case, as we always pick the xj X which minimizes D(xj , y(i) ) (note that X if |X | < |Y |).

A.2

Theorem 4

Proof: Let the sequences of the Lp -norm values of the vectors in X and Y in descending order be denoted by ( x1 p , . . . x|X | p ) and ( y1 p , . . . y|Y | p ). We assume w.l.o.g. |X | |Y | k . Let (Y ) be the permutation of Y that results from the minimum weight perfect matching of X and Y . We combine the results from Lemmas 2 and 3. Vk (X ) Vk (Y )
|X | Lemma 3 1

|Y |

xi
i=1 |X |

y ( i)

+
i=|X |+1

y ( i)

Lemma 2 p

|Y |

xi y(i)
i=1

+
i=|X |+1

y ( i)

p = Dmm

L ,W0

(X, Y )

A.3

Theorem 5

Lp ,s Lp ,s Proof: According to Theorem 2, Dpcp (X, Y ) Dpmm (X, Y ) holds. Lp ,s Lp ,s To obtain Dpmm (X, Y ) Dpmm (X, Y ) we have to show that s 1 (X ),2 (Y )

min

x1 (i)
i=1 s

y 2 ( i )

1 (X ),2 (Y )

min

x1 (i) y2 (i)
i=1

and this follows from Lemma 2.

L13
No ratings yet
L13
19 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Evaluation of Similarity Measurement For Image Retrieval
No ratings yet
Evaluation of Similarity Measurement For Image Retrieval
4 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Similarity
No ratings yet
Similarity
19 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
rsfinal (1)
No ratings yet
rsfinal (1)
30 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
A_Comparative_Study_on_Distance_Measuring_Approach
No ratings yet
A_Comparative_Study_on_Distance_Measuring_Approach
3 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
NMSLib Manual
No ratings yet
NMSLib Manual
82 pages
CS-DM MODULE- 3
No ratings yet
CS-DM MODULE- 3
27 pages
Similarity Measure For Sequences of Categorical Data
No ratings yet
Similarity Measure For Sequences of Categorical Data
12 pages
Gemini Algorithm
No ratings yet
Gemini Algorithm
28 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Lecture Slides-Week15,16
No ratings yet
Lecture Slides-Week15,16
50 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Data Similarity
0% (1)
Data Similarity
18 pages
IDS4
No ratings yet
IDS4
50 pages
THE ULTRAMETRIC PROPERTIES OF BINARY DATASETS P. WILCZEK Silesian - J - Pure - Appl - Math - v6 - I1 - STR - 069-084
No ratings yet
THE ULTRAMETRIC PROPERTIES OF BINARY DATASETS P. WILCZEK Silesian - J - Pure - Appl - Math - v6 - I1 - STR - 069-084
16 pages
WJ96
No ratings yet
WJ96
8 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
3.4 Feature and Distance Standardisation: 66 Chapter 3. Content-Based Retrieval in Depth
No ratings yet
3.4 Feature and Distance Standardisation: 66 Chapter 3. Content-Based Retrieval in Depth
18 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
mod 4
No ratings yet
mod 4
49 pages
2012 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. A Rank-Based Approach of Cosine Similarity With Applications in
No ratings yet
2012 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. A Rank-Based Approach of Cosine Similarity With Applications in
5 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
UNIT04
No ratings yet
UNIT04
35 pages
Feature Matching and Model Fitting
No ratings yet
Feature Matching and Model Fitting
34 pages
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
No ratings yet
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
159 pages
Document Mosaicing: Unlocking Visual Insights through Document Mosaicing
From Everand
Document Mosaicing: Unlocking Visual Insights through Document Mosaicing
Fouad Sabry
No ratings yet
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Lab 2
No ratings yet
Lab 2
21 pages
Adl - 00 (2021 - 07 - 30 08 - 37 - 35 Utc)
No ratings yet
Adl - 00 (2021 - 07 - 30 08 - 37 - 35 Utc)
7 pages
Clustering
0% (1)
Clustering
127 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
The Curse of Dimensionality - Inside Out 2
No ratings yet
The Curse of Dimensionality - Inside Out 2
8 pages
02data Part4
No ratings yet
02data Part4
28 pages
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
No ratings yet
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
50 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
Ml unit 2
No ratings yet
Ml unit 2
11 pages
Selecting Vantage Objects For Similarity Indexing
No ratings yet
Selecting Vantage Objects For Similarity Indexing
15 pages
DSB- Unit3
No ratings yet
DSB- Unit3
87 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
PDF Defoe and the new sciences Bacon download
100% (2)
PDF Defoe and the new sciences Bacon download
73 pages
1968 Ford Mustang
No ratings yet
1968 Ford Mustang
7 pages
Lesson 1: Anticipating Questions About Dialogs
No ratings yet
Lesson 1: Anticipating Questions About Dialogs
5 pages
Adarsh Meth
No ratings yet
Adarsh Meth
14 pages
Balance Scorecard
No ratings yet
Balance Scorecard
3 pages
HOUSE RULE: The Following Are Expected To Be Followed:: Present Test Permit When I Take The Exam
No ratings yet
HOUSE RULE: The Following Are Expected To Be Followed:: Present Test Permit When I Take The Exam
1 page
Short Circuit Cal 1
100% (1)
Short Circuit Cal 1
46 pages
SotDL - Spell Cards - Celestial
No ratings yet
SotDL - Spell Cards - Celestial
22 pages
Narrative Essay Grades 6-8 PDF
100% (1)
Narrative Essay Grades 6-8 PDF
13 pages
Espelita - Final Term - Assignment #6
No ratings yet
Espelita - Final Term - Assignment #6
3 pages
KMF All Project
No ratings yet
KMF All Project
11 pages
Yellow Gray and Black Minimalist Industries Presentation
No ratings yet
Yellow Gray and Black Minimalist Industries Presentation
72 pages
Indise Out Film Analysis
No ratings yet
Indise Out Film Analysis
2 pages
Assignment 4 V NLD
No ratings yet
Assignment 4 V NLD
3 pages
MOISTURE ANALYSER TECHNICAL SHEET
No ratings yet
MOISTURE ANALYSER TECHNICAL SHEET
6 pages
Packers & Movers Project
71% (14)
Packers & Movers Project
57 pages
Encoder S Short Form
No ratings yet
Encoder S Short Form
2 pages
Autism in Girls
No ratings yet
Autism in Girls
28 pages
HSBC Ict Issue
No ratings yet
HSBC Ict Issue
3 pages
Civil Engineer List
No ratings yet
Civil Engineer List
9 pages
Final Project Report
100% (1)
Final Project Report
48 pages
BTV Manual v211
100% (1)
BTV Manual v211
45 pages
WELD2CAST Engineering Case Study No - 3
No ratings yet
WELD2CAST Engineering Case Study No - 3
3 pages
Ip Pips Lesson Plan Year 2 Topic 6
No ratings yet
Ip Pips Lesson Plan Year 2 Topic 6
9 pages
Talk - Three Wishes - Barney Wiki - Fandom
No ratings yet
Talk - Three Wishes - Barney Wiki - Fandom
6 pages
Instant Access to Principles of Microeconomics N. Gregory Mankiw ebook Full Chapters
No ratings yet
Instant Access to Principles of Microeconomics N. Gregory Mankiw ebook Full Chapters
54 pages
Dzexams 3am Anglais 175821
No ratings yet
Dzexams 3am Anglais 175821
3 pages
AUKEY_Catalogue
No ratings yet
AUKEY_Catalogue
46 pages
Final Geotechnical Report - PARKLAND Geo
100% (1)
Final Geotechnical Report - PARKLAND Geo
70 pages
Current Electricity Mind Map
No ratings yet
Current Electricity Mind Map
3 pages

Efficient Similarity Search On Vector Sets

Uploaded by

Efficient Similarity Search On Vector Sets

Uploaded by

In Proc. 11. GI-Fachtagung fr Datenbanksysteme in Business, Technologie und Web (BTW'05), Karlsruhe, 2005, pp. 425-443.

Efcient Similarity Search on Vector Sets

Institute for Computer Science, University of Munich {brecheis,kriegel,pfeie}@dbs.informatik.uni-muenchen.de

Application Ranges for Vector Sets

Distance Measures on Vector Sets

D(x1 (i) , y2 (i) )

Filters for Vector Sets

Closest Pair Approach

(a) Closest pair.

(c) Norm vector.

Figure 1: Filters for the minimal matching distance.

D(x(i) , yj ), D(xj , y(i) )

Proof: Analogous to the proof of Theorem 1.

Norm Vector Approach

Figure 2: Partial norm vector lter algorithm.

Proof: See Appendix A.3.

Table 1: Runtime complexity of the proposed lters.

exact distance complete similarity partial similarity O(k + k d) O(

closest pair O(k d) O(k 2 d log s)

centroid O(d) n/a

norm vector O(k ) O(k log s)

Figure 3: Complete range queries, articial dataset, cardinality 10, dimensionality 2.

Figure 4: Complete range queries, articial dataset, cardinality 2, dimensionality 10.

Complete Similarity Search

Figure 5: Complete range queries, CAD dataset, cardinality 5, dimensionality 6.

Figure 7: Partial range queries for s = 2, CAD dataset, cardinality 7, dimensionality 6.

Partial Similarity Search

[Kuh55] [Mun57] [RB01] [SK98]

Proof: The proposition holds if i {1, . . . |X |} : follows directly from Lemma 1.

The proof consists of two cases.

minj =1,...|Y | D(xi , yj ).

minj =1,...|Y | D(xi , yj ) =

minj =1,...|Y | D(xi , yj )+

minj =1,...|Y | D(xj , yi ).

minj =1,...|Y | D(xj , yi ) =

minj =1,...|Y | D(xj , y(i) )+ D(xi , y(i) ) +

and this follows from Lemma 2.

You might also like