Cure, Clustering Algorithm

Cure: An Efficient Clustering Algorithm for Large Databases Possamai Lino, 800509 Department of Computer Science University of Venice www.possamai.it/lino Data Mining Lecture - September 13th, 2006

Introduction Main algorithms for clustering are those who uses partitioning or hierarchical agglomerative techniques. These are different because the former starts with one big cluster and downward step by step reaches the number of clusters wanted partitioning the existing clusters. The second starts with single point cluster and upward step by step merge cluster until desired number of cluster is reached. The second is used in this work

Drawbacks of Traditional Clustering Algorithms The result of clustering process depend on the approach used for represent each cluster. In fact, centroid-based approach (using d mean ) consider only one point as representative of a cluster – the cluster centroid. Other approach, as for example all-points (based on d min ) uses all the points inside him for cluster rappresentation. This choice is extremely sensitive to outliers and to slight changes in the position of data points, as the first approach can’t work well for non-spherical or arbitrary shaped clusters.

Contribution of CURE, ideas CURE employs a new hierarchical algorithm that adopts a middle ground between centroid-based and all-points approach. A constant number c of well scattered points in a cluster are chosen as representative. This points catch all the possible form that could have the cluster. The clusters with the closest pair of representative points are the cluster that are merged at each step of Cure. Random sampling and partitioning are used for reducing the data set of input.

Random Sampling When all the data set is considered as input of algorithm, execution time could be high due to the I/O costs. Random sampling is the answer to this problem. It is demonstrated that with only 2.5% of the original data set, the algorithm results are better than traditional algorithms, execution time are lower and geometry of cluster are preserved. For speed-up the algorithm operations, random sampling is fitted in main memory. The overhead of generating random sample is very small compared to the time for performing the clustering on the sample.

Partitioning sample When the clusters in the data set became less dense, random sampling with limited points became useless because implies a poor quality of clustering. So we have to increase the random sample. They proposed a simple partitioning scheme for speedup CURE algorithm. The scheme follows these steps: Partition n data points into p partition (n/p each). Partially cluster each partition until the final number of cluster created reduces to n/(p*q) with q>1. Cluster partially clustered partition starting from n/q cluster created. The advantage of partitioning the input is the reduced execution time. Each n/p group of points must fit in main memory for increasing performance of partially clustering.

Hierarchical Clustering Algorithm A constant number c of well scattered points in a cluster are chosen as representative. These points catch all the possible form that could have the cluster. The points are shrank toward the mean of the cluster by a fraction  . If  =0 the algorithm behavior became similar as all-points representation. Otherwise, (  =1) cure reduces to centroid-based approach. Outliers are typically further away from the mean of the cluster so the shrinking consequence is to dampen this effect.

Hierarchical Clustering Algorithm The clusters with the closest pair of representative points are the cluster that are merged at each step of CURE. When the number of points inside each cluster increase, the process of choosing c new representative points could became very slowly. For this reason, a new procedure is proposed. Instead choosing c new points from among all the points in the merged cluster, we select c points from the 2c scattered points for the two clusters being merged. The new points are fairly well scattered.

Handling Outlier In different moments CURE dealt with outliers. Random Sampling filter out the majority of outliers. Outliers, due to their larger distance from other points, tend to merge with other points less and typically grow at a much slower rate than actual clusters. Thus, the number of points in a collection of outliers is typically much less than the number in a cluster. So, first, the clusters which are growing very slowly are identified and eliminated. Second, at the end of growing process, very small cluster are eliminated.

Labeling Data on Disk The process of sampling the initial data set, exclude the majority of data points. This data point must be assigned to some cluster created in former phases. Each cluster created is represented by a fraction of randomly selected representative points and each point excluded in the first phase are associated to the cluster whose representative point is closer. This method is different from BIRCH in which it employs only the centroids of the clusters for “partitioning” the remaining points. Since the space defined by a single centroid is a sphere, BIRCH labeling phase has a tendency to split clusters when they have non-spherical shapes of non-uniform sizes.

Experimental Results During experimental phase, CURE was compared to other clustering algorithms and using the same data set results are plotted. Algorithm for comparison are BIRCH and MST (Minimum Spanning Tree, same as CURE when shrink factor is 0) Data set 1 used is formed by one big circle cluster, two small circle clusters and two ellipsoid connected by a dense chain of outliers. Data set 2 used for execution time comparison.

Experimental Results Quality of Clustering As we can see from the picture, BIRCH and MST calculate a wrong result. BIRCH cannot distinguish between big and small cluster, so the consequence is splitting the big one. MST merges the two ellipsoids because it cannot handle the chain of outliers connecting them.

Experimental Results Sensitivity to Parameters Another index to take into account is the a factor. Changes implies a good or poor quality of clustering as we can see from the picture below.

Experimental Results Execution Time To compare the execution time of two algorithms, they have choose dataset 2 because both BIRCH and CURE have the same results. Execution time is presented changing the number of data points thus each cluster became more dense as the points increase, but the geometry still remain the same. Cure is more than 50% less expensive because BIRCH scan the entire data set where CURE sample count always 2500 units. For CURE algorithm we must count for a very little contribution of sampling from a large data set.

Conclusion We have see that CURE can detect cluster with non-spherical shape and wide variance in size using a set of representative points for each cluster. CURE can also have a good execution time in presence of large database using random sampling and partitioning methods. CURE works well when the database contains outliers. These are detected and eliminated.

Index Introduction Drawbacks of Traditional Clustering Algorithms CURE algorithm Contribution of Cure, ideas CURE architecture Random Sampling Partitioning sample Hierarchical Clustering Algorithm Labeling Data on disk Handling Outliers Example Experimental Results

References Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Cure: An Efficient Clustering Algorithm for Large Databases. Information Systems, Volume 26, Number 1, March 2001

Cure, Clustering Algorithm

Recommended

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Cure, Clustering Algorithm (20)

More from Lino Possamai (7)

Recently uploaded (20)

Cure, Clustering Algorithm