SlideShare a Scribd company logo
Cure: An Efficient Clustering Algorithm for Large Databases Possamai Lino, 800509 Department of Computer Science University of Venice www.possamai.it/lino Data Mining Lecture - September 13th, 2006
Introduction Main algorithms for clustering are those who uses partitioning or hierarchical agglomerative techniques. These are different because the former starts with one big cluster and downward step by step reaches the number of clusters wanted partitioning the existing clusters. The second starts with single point cluster and upward step by step merge cluster until desired number of cluster is reached. The second is used in this work
Drawbacks of Traditional Clustering Algorithms The result of clustering process depend on the approach used for represent each cluster. In fact, centroid-based approach (using d mean ) consider only one point as representative of a cluster – the cluster centroid. Other approach, as for example all-points (based on d min ) uses all the points inside him for cluster rappresentation. This choice is extremely sensitive to outliers and to slight changes in the position of data points, as the first approach can’t work well for non-spherical or arbitrary shaped clusters.
Contribution of CURE, ideas CURE employs a new hierarchical algorithm that adopts a middle ground between centroid-based and all-points approach. A constant number  c  of well scattered points in a cluster are chosen as representative. This points catch all the possible form that could have the cluster. The clusters with the closest pair of representative points are the cluster that are merged at each step of Cure. Random sampling and partitioning are used for reducing the data set of input.
CURE architecture
Random Sampling When all the data set is considered as input of algorithm, execution time could be high due to the I/O costs. Random sampling is the answer to this problem. It is demonstrated that with only 2.5% of the original data set, the algorithm results are better than traditional algorithms, execution time are lower and geometry of cluster are preserved. For speed-up the algorithm operations, random sampling is fitted in main memory. The overhead of generating random sample is very small compared to the time for performing the clustering on the sample.
Partitioning sample When the clusters in the data set became less dense, random sampling with limited points became useless because implies a poor quality of clustering. So we have to increase the random sample. They proposed a simple partitioning scheme for speedup CURE algorithm. The scheme follows these steps: Partition n data points into p partition (n/p each). Partially cluster each partition until the final number of cluster created reduces to n/(p*q) with q>1. Cluster partially clustered partition starting from n/q cluster created. The advantage of partitioning the input is the reduced execution time. Each n/p group of points must fit in main memory for increasing performance of partially clustering.
Hierarchical Clustering  Algorithm A constant number  c  of well scattered points in a cluster are chosen as representative. These points catch all the possible form that could have the cluster. The points are shrank toward the mean of the cluster by a fraction   . If   =0 the algorithm behavior became similar as all-points representation. Otherwise, (  =1) cure reduces to centroid-based approach. Outliers are typically further away from the mean of the cluster so the shrinking consequence is to dampen this effect.
Hierarchical Clustering  Algorithm The clusters with the closest pair of representative points are the cluster that are merged at each step of CURE. When the number of points inside each cluster increase, the process of choosing  c  new representative points could became very slowly. For this reason, a new procedure is proposed. Instead choosing  c  new points from among all the points in the merged cluster, we select  c  points from  the  2c  scattered points for the two clusters being merged. The new points are fairly well scattered.
Example
Handling Outlier In different moments CURE dealt with outliers. Random Sampling filter out the majority of outliers. Outliers, due to their larger distance from other points, tend to merge with other points less and typically grow at a much slower rate than actual clusters. Thus, the number of points in a collection of outliers is typically much less than the number in a cluster. So, first, the clusters which are growing very slowly are identified and eliminated. Second, at the end of growing process, very small cluster are eliminated.
Labeling Data on Disk The process of sampling the initial data set, exclude the majority of data points. This data point must be assigned to some cluster created in former phases. Each cluster created is represented by a fraction of randomly selected representative points and each point excluded in the first phase are associated to the cluster whose representative point is closer. This method is different from BIRCH in which it employs only the centroids of the clusters for “partitioning” the remaining points. Since the space defined by a single centroid is a sphere, BIRCH labeling phase has a tendency to split clusters when they have non-spherical shapes of non-uniform sizes.
Experimental Results During experimental phase, CURE was compared to other clustering algorithms and using the same data set results are plotted. Algorithm for comparison are BIRCH and MST (Minimum Spanning Tree, same as CURE when shrink factor is 0)  Data set 1 used is formed by one big circle cluster, two small circle clusters and two ellipsoid connected by a dense chain of outliers. Data set 2 used for execution time comparison.
Experimental Results Quality of Clustering As we can see from the picture, BIRCH and MST calculate a wrong result. BIRCH cannot distinguish between big and small cluster, so the consequence is splitting the big one. MST merges the two ellipsoids because it cannot handle the chain of outliers connecting them.
Experimental Results Sensitivity to Parameters Another index to take into account is the a factor. Changes implies a good or poor quality of clustering as we can see from the picture below.
Experimental Results Execution Time To compare the execution time of two algorithms, they have choose dataset 2 because both BIRCH and CURE have the same results. Execution time is presented changing the number of data points thus each cluster became more dense as the points increase, but the geometry still remain the same. Cure is more than 50% less expensive because BIRCH scan the entire data set where CURE sample count always 2500 units. For CURE algorithm we  must count for a very little contribution of  sampling from a large data set.
Conclusion We have see that CURE can detect cluster with non-spherical shape and wide variance in size using a set of representative points for each cluster. CURE can also have a good execution time in presence of large database using random sampling and partitioning methods. CURE works well when the database contains outliers. These are detected and eliminated.
Index Introduction Drawbacks of Traditional Clustering Algorithms CURE algorithm Contribution of Cure, ideas CURE architecture Random Sampling Partitioning sample Hierarchical Clustering Algorithm Labeling Data on disk Handling Outliers Example Experimental Results
References Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Cure: An Efficient Clustering Algorithm for Large Databases.  Information Systems, Volume 26, Number 1, March 2001
Ad

More Related Content

What's hot (20)

DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
Krish_ver2
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
Krish_ver2
 
Data reduction
Data reductionData reduction
Data reduction
kalavathisugan
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
Pabna University of Science & Technology
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
Krish_ver2
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
Azad public school
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
Sopheaktra YONG
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
International School of Engineering
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
Query trees
Query treesQuery trees
Query trees
Shefa Idrees
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
VARUN KUMAR
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
Dr. C.V. Suresh Babu
 

Viewers also liked (12)

Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
DataminingTools Inc
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Venkata Reddy Konasani
 
Clique
Clique Clique
Clique
sk_klms
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssas
Umar Ali
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
Dr Sandeep Kumar Poonia
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
Krish_ver2
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver2
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
SHIVANI SONI
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Jewel Refran
 
OLAP
OLAPOLAP
OLAP
Slideshare
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssas
Umar Ali
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
Krish_ver2
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver2
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
SHIVANI SONI
 
Ad

Similar to Cure, Clustering Algorithm (20)

Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Pushkar Mishra
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
IJMER
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
Rohit 10103543
Rohit 10103543Rohit 10103543
Rohit 10103543
Pulkit Chhabra
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
IJITCA Journal
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
50120140505013
5012014050501350120140505013
50120140505013
IAEME Publication
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
IJORCS
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
IOSR Journals
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
A0310112
A0310112A0310112
A0310112
iosrjournals
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
IOSR Journals
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
Clustering and Classification Algorithms Ankita Dubey
Clustering and Classification Algorithms Ankita DubeyClustering and Classification Algorithms Ankita Dubey
Clustering and Classification Algorithms Ankita Dubey
Ankita Dubey
 
D0931621
D0931621D0931621
D0931621
IOSR Journals
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
PRAWEEN KUMAR
 
Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
Engineering Research Publication
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
IJMER
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
IJMIT JOURNAL
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
IJITCA Journal
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
IJORCS
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
IOSR Journals
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
IOSR Journals
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
Clustering and Classification Algorithms Ankita Dubey
Clustering and Classification Algorithms Ankita DubeyClustering and Classification Algorithms Ankita Dubey
Clustering and Classification Algorithms Ankita Dubey
Ankita Dubey
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
PRAWEEN KUMAR
 
Ad

More from Lino Possamai (7)

Music Motive @ H-ack
Music Motive @ H-ack Music Motive @ H-ack
Music Motive @ H-ack
Lino Possamai
 
Metodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessiMetodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessi
Lino Possamai
 
Multidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex NetworksMultidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex Networks
Lino Possamai
 
Slashdot.Org
Slashdot.OrgSlashdot.Org
Slashdot.Org
Lino Possamai
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
Lino Possamai
 
A static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming ErrorsA static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming Errors
Lino Possamai
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic Programs
Lino Possamai
 
Music Motive @ H-ack
Music Motive @ H-ack Music Motive @ H-ack
Music Motive @ H-ack
Lino Possamai
 
Metodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessiMetodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessi
Lino Possamai
 
Multidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex NetworksMultidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex Networks
Lino Possamai
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
Lino Possamai
 
A static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming ErrorsA static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming Errors
Lino Possamai
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic Programs
Lino Possamai
 

Recently uploaded (20)

tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 

Cure, Clustering Algorithm

  • 1. Cure: An Efficient Clustering Algorithm for Large Databases Possamai Lino, 800509 Department of Computer Science University of Venice www.possamai.it/lino Data Mining Lecture - September 13th, 2006
  • 2. Introduction Main algorithms for clustering are those who uses partitioning or hierarchical agglomerative techniques. These are different because the former starts with one big cluster and downward step by step reaches the number of clusters wanted partitioning the existing clusters. The second starts with single point cluster and upward step by step merge cluster until desired number of cluster is reached. The second is used in this work
  • 3. Drawbacks of Traditional Clustering Algorithms The result of clustering process depend on the approach used for represent each cluster. In fact, centroid-based approach (using d mean ) consider only one point as representative of a cluster – the cluster centroid. Other approach, as for example all-points (based on d min ) uses all the points inside him for cluster rappresentation. This choice is extremely sensitive to outliers and to slight changes in the position of data points, as the first approach can’t work well for non-spherical or arbitrary shaped clusters.
  • 4. Contribution of CURE, ideas CURE employs a new hierarchical algorithm that adopts a middle ground between centroid-based and all-points approach. A constant number c of well scattered points in a cluster are chosen as representative. This points catch all the possible form that could have the cluster. The clusters with the closest pair of representative points are the cluster that are merged at each step of Cure. Random sampling and partitioning are used for reducing the data set of input.
  • 6. Random Sampling When all the data set is considered as input of algorithm, execution time could be high due to the I/O costs. Random sampling is the answer to this problem. It is demonstrated that with only 2.5% of the original data set, the algorithm results are better than traditional algorithms, execution time are lower and geometry of cluster are preserved. For speed-up the algorithm operations, random sampling is fitted in main memory. The overhead of generating random sample is very small compared to the time for performing the clustering on the sample.
  • 7. Partitioning sample When the clusters in the data set became less dense, random sampling with limited points became useless because implies a poor quality of clustering. So we have to increase the random sample. They proposed a simple partitioning scheme for speedup CURE algorithm. The scheme follows these steps: Partition n data points into p partition (n/p each). Partially cluster each partition until the final number of cluster created reduces to n/(p*q) with q>1. Cluster partially clustered partition starting from n/q cluster created. The advantage of partitioning the input is the reduced execution time. Each n/p group of points must fit in main memory for increasing performance of partially clustering.
  • 8. Hierarchical Clustering Algorithm A constant number c of well scattered points in a cluster are chosen as representative. These points catch all the possible form that could have the cluster. The points are shrank toward the mean of the cluster by a fraction  . If  =0 the algorithm behavior became similar as all-points representation. Otherwise, (  =1) cure reduces to centroid-based approach. Outliers are typically further away from the mean of the cluster so the shrinking consequence is to dampen this effect.
  • 9. Hierarchical Clustering Algorithm The clusters with the closest pair of representative points are the cluster that are merged at each step of CURE. When the number of points inside each cluster increase, the process of choosing c new representative points could became very slowly. For this reason, a new procedure is proposed. Instead choosing c new points from among all the points in the merged cluster, we select c points from the 2c scattered points for the two clusters being merged. The new points are fairly well scattered.
  • 11. Handling Outlier In different moments CURE dealt with outliers. Random Sampling filter out the majority of outliers. Outliers, due to their larger distance from other points, tend to merge with other points less and typically grow at a much slower rate than actual clusters. Thus, the number of points in a collection of outliers is typically much less than the number in a cluster. So, first, the clusters which are growing very slowly are identified and eliminated. Second, at the end of growing process, very small cluster are eliminated.
  • 12. Labeling Data on Disk The process of sampling the initial data set, exclude the majority of data points. This data point must be assigned to some cluster created in former phases. Each cluster created is represented by a fraction of randomly selected representative points and each point excluded in the first phase are associated to the cluster whose representative point is closer. This method is different from BIRCH in which it employs only the centroids of the clusters for “partitioning” the remaining points. Since the space defined by a single centroid is a sphere, BIRCH labeling phase has a tendency to split clusters when they have non-spherical shapes of non-uniform sizes.
  • 13. Experimental Results During experimental phase, CURE was compared to other clustering algorithms and using the same data set results are plotted. Algorithm for comparison are BIRCH and MST (Minimum Spanning Tree, same as CURE when shrink factor is 0) Data set 1 used is formed by one big circle cluster, two small circle clusters and two ellipsoid connected by a dense chain of outliers. Data set 2 used for execution time comparison.
  • 14. Experimental Results Quality of Clustering As we can see from the picture, BIRCH and MST calculate a wrong result. BIRCH cannot distinguish between big and small cluster, so the consequence is splitting the big one. MST merges the two ellipsoids because it cannot handle the chain of outliers connecting them.
  • 15. Experimental Results Sensitivity to Parameters Another index to take into account is the a factor. Changes implies a good or poor quality of clustering as we can see from the picture below.
  • 16. Experimental Results Execution Time To compare the execution time of two algorithms, they have choose dataset 2 because both BIRCH and CURE have the same results. Execution time is presented changing the number of data points thus each cluster became more dense as the points increase, but the geometry still remain the same. Cure is more than 50% less expensive because BIRCH scan the entire data set where CURE sample count always 2500 units. For CURE algorithm we must count for a very little contribution of sampling from a large data set.
  • 17. Conclusion We have see that CURE can detect cluster with non-spherical shape and wide variance in size using a set of representative points for each cluster. CURE can also have a good execution time in presence of large database using random sampling and partitioning methods. CURE works well when the database contains outliers. These are detected and eliminated.
  • 18. Index Introduction Drawbacks of Traditional Clustering Algorithms CURE algorithm Contribution of Cure, ideas CURE architecture Random Sampling Partitioning sample Hierarchical Clustering Algorithm Labeling Data on disk Handling Outliers Example Experimental Results
  • 19. References Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Cure: An Efficient Clustering Algorithm for Large Databases. Information Systems, Volume 26, Number 1, March 2001