0% found this document useful (0 votes)

33 views

Bhaumik-Project - C - Report K Mean Complexity

This report compares clustering algorithms K-Means, Buckshot, and Bisecting K-Means. Buckshot improves on K-Means by using hierarchical agglomerative clustering to select initial centroids. Bisecting K-Means splits the dataset recursively into two clusters to produce more uniform cluster sizes. Evaluation on two queries shows Buckshot consistently achieves higher intra-cluster similarity and lower inter-cluster similarity than the other algorithms, performing the best overall.

Uploaded by

Mahiye Ghosh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Bhaumik-Project - C - Report K Mean Complexity

Uploaded by

Mahiye Ghosh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Project C Report (Clustering Algorithms)

This report contains details about clustering algorithms. It mainly compares two of the clustering algorithms, K-Means and Buckshot and some variation of these algorithms. Clustering is performed on results obtained by Vector Space Similarity.

Overview of Clustering Algorithms

For a given query there can be results in multiple categories. E.g., for query Michael Jordan we can have some pages related to statistician and some pages related to basket ball player. User may be interested in any one of this. Clustering assists user to find only relevant document of his interest. Relevant documents are clustered using clustering. Good clustering should have high intra cluster similarity and low inter cluster similarity. Intra cluster similarity can be defined as, 1 Sa = ---- d c | Ci | d Ci where c is the centroid of the Cith cluster. And inter cluster similarity can be defined as, 1 k k Se = -------------- ci cj k*(k1) i=1 j=1 ji where ci and cj are centroids of ith and jth cluster respectively.

K Means Algorithm
In K Means algorithm initial K centroids are selected randomly. Then using Vector Space similarity each document is assigned to the cluster with maximum similarity. And the centroids are recomputed. Then all the data points are reassigned to the closest centroids. And this process repeats until there is any change in clusters in successive iterations. The algorithm is as described below. findClusters ( top N docs, k){ for i = 1 to k{ do{ randomly select a doc cluster[i].centroid = selected doc. }while(doc not already selected) }

set change = true While(change){ assignClusters() computeCentroids() } } assignClusters(){ set change = false for i = 1 to N{ set maxSim = 0 for j = 1 to k{ if sim(cluster [j].centroid , ith doc) > maxSim{ closestcluster = j maxSim = sim(cluster [j].centroid , ith doc) } } Assign ith doc to closest cluster if closestcluster != previouscluster then set change = true } } computeCentroids(){ for i = 1 to k{ set cluster[i].centroid = 0 for each doc in cluster i{ cluster[i].centroid += doc } Cluster[i].centroid /= number of docs in cluster i } } Let I, d, k and n be the no. of iterations, no. of dimensions of a vector (average no. of terms), no. of clusters and no. of documents in base set respectively. Then the time complexity of assignClusters is of O(nkd) and the time complexity of computeCentroids is O(nd). So for I iterations time complexity becomes O (k + I (nkd + nd)) = O (Idkn) To store doc vector of top n documents space of O (nt) is required. Where t is the average no. of terms in a document. And to store centroid vector space of O (kt) is required, where k is the no. of centroids. And to assign cluster to each document space of O (n) is required.

So total space complexity is O (nt + kt + n) = O(nt) So this algorithm is low in cost. But its performance depends on the initial choice of centroids. Sometimes the results change dramatically during multiple runs of K Means algorithm because of randomly picking the initial centroids. Various extensions are applied on K Means to improve its performance.

Buckshot Algorithm
Buckshot algorithm tries to improve the performance of K Means algorithm by choosing better initial cluster centroids. For that it uses Hierarchical Agglomerative Clustering (HAC) algorithm. HAC considers each point as a separate cluster and combines the cluster with the maximum similarity. Similarity between clusters is measured as a group average. When the no. of clusters left equals the no. of required clusters, the algorithm is stopped. And these centroids of these clusters are taken as initial centroids for the K Means algorithm. For Buckshot algorithm HAC is performed on kn documents. findInitialSeeds (top N docs, k) { create a clusterVector containing randomly selected kn docs from top n docs while( clusterVector.size > k){ maxSim = 0 for i = 1to k{ for j = i + 1 to k{ if sim(clusterVector[i], clusterVector[j]) > maxSim{ cluster1 = clusterVector[i] cluster2 = clusterVector[j] maxSim = sim(clusterVecor[i], clusterVector[j]) } } } remove cluster1 and cluster2 from clusterVector combine docs of cluster1 and cluster2 into cluster add cluster to clusterVector } Return clusterVector } Let n be the number of documents to be clustered, k be the no. of clusters and d be the no. of dimensions. Then time complexity of HAC is O ( kn + ( kn k) knd) = O ((kn)3/2d)

The time complexity of buckshot algorithm is O ((kn)3/2d + Idkn). And the space complexity of HAC is (nt + kt + kn). Where n is the total no. of documents, t is the average no. of terms per document and k is the no. of clusters. So the space complexity of buckshot algorithm is O (nt + kt + n + nt + kt + kn) = O (nt)

Bisecting K Means Algorithm

This algorithm tries to improve quality over K Means. It starts with one large cluster of all the data points and divides the whole dataset into two clusters. K Means algorithm is run multiple times to find a split that produce maximum intra cluster similarity. Then the cluster with largest size is picked to split further. This cluster can be chosen based upon minimum intra cluster similarity also. This algorithm is run k 1 times to get k clusters. This algorithm performs better than regular K Means because bisecting K Means produces almost uniform sized clusters. While in regular K Means there can be notable difference between sizes of the clusters. As small cluster tends to have high intra cluster similarity, large clusters have very low intra cluster similarity and overall intra cluster similarity decreases. The algorithm is as described below. findClusters ( top N docs, k){ Initialize clusterVector with one cluster containing all documents for i = 1 to k - 1{ find cluster with largest size from clusterVector remove cluster from clusterVector. Set maxsim = 0 Do Iteration times{ for j = 1 to 2{ do{ randomly select a doc cluster[j].centroid = selected doc. }while(doc not already selected) } set change = true While(change){ assignClusters() computeCentroids() } if(sim > maxsim){ maxsim = sim

store clusters in cluster1 and cluster2 } } Add cluster1 and cluster2 to clusterVector } assignClusters(){ set change = false for i = 1 to N{ set maxSim = 0 for j = 1 to 2{ if sim(cluster [j].centroid , ith doc) > maxSim{ closestcluster = j maxSim = sim(cluster [j].centroid , ith doc) } } Assign ith doc to closest cluster if closestcluster != previouscluster then set change = true } } computeCentroids(){ for i = 1 to 2{ set cluster[i].centroid = 0 for each doc in cluster i{ cluster[i].centroid += doc } Cluster[i].centroid /= number of docs in cluster i } } Let n be the number of documents to be clustered, k be the no. of clusters, i be the no. of iterations made on each run and d be the no. of dimensions. Then the time complexity of assignClusters is O (nd). And time complexity of computeCentroids is O (nd) if average size of cluster is of O (n). So overall complexity of bisecting K Means algorithm is O ((k - 1) * i * (nd + nd)) = O (nkid) So time complexity of bisecting K Means is linear in nature with respect to no. of documents. To store doc vector of top n documents space of O (nt) is required. Where t is the average no. of terms in a document. And to store centroid vector space of O (kt) is required,

where k is the no. of centroids. And to assign cluster to each document space of O (n) is required.So total space complexity is O (nt + kt + n) = O(nt) This algorithm is low in cost compare to buckshot algorithm but still gives comparable performance to buckshot algorithm.

Comparison of algorithms using similarity measures

To compare these algorithms inter cluster and intra cluster similarity measures are used. Algorithms were run for k = 3 to k =10 for 2 queries information retrieval and parking decal.
inform ation retrieval 0.8 intra cluster similarity 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 4 5 6 7 8 9 10 no. of clusters K K Means buckshot bisecting K Means

Intra Cluster Similarity for all algorithms on Information retrieval query

information retrieval
0.25 inter cluster similarity 0.2 0.15 0.1 0.05 0 3 4 5 6 7 8 9 10 no. of clusters K K Means buckshot bisecting K Means

Inter Cluster Similarity for all algorithms on Information retrieval query

An algorithm performs well compared to another one if it has higher intra cluster similarity and lower inter cluster similarity compared to other algorithm. From the graphs it is obvious that buckshot algorithm consistently performs better than other algorithms in both intra cluster and inter cluster similarity measure. For k =3 and 6 bisecting K Means performs slightly better than buckshot algorithm in intra cluster similarity measure but overall buckshot algorithm outperforms other algorithms.
parking decal 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 4 5 6 7 8 9 10 no. of clusters K

intra cluster similarity

K Means buckshot bisecting K Means

Intra Cluster Similarity for all algorithms on Parking decal query

parking decal 0.5 inter cluster similarity 0.4 K Means 0.3 buckshot 0.2 0.1 0 3 4 5 6 7 8 9 10 no. of clusters K bisecting K Means

Inter Cluster Similarity for all algorithms on Parking Decal query From the 1st graph it is clear that buckshot and bisecting K Means algorithm almost perform equal in terms of intra cluster similarity. And it is already mentioned that buckshot algorithm performs well because it selects better initial centroids. Bisecting K

Means algorithm performs well compared to K Means because Bisecting K Means algorithm produces clusters of almost uniform sizes.

Effect of varying no. of clusters

To analyze the effect of varying no. of clusters results of intra cluster similarity and inter cluster similarity for two queries for K Means algorithm are presented.
0.8 0.7 intra cluster similarity 0.6 0.5 0.4 0.3 0.2 0.1 0 3 4 5 6 7 8 9 10 no. of clusters K softw are engineering computer science

Intra cluster similarity for 2 queries for K Means algorithm

0.45 0.4 inter cluster similarity 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 4 5 6 7 8 9 10 no. of clusters K softw are engineering computer science

Inter cluster similarity for 2 queries for K Means algorithm From the 1st graph it is obvious that as the no. of clusters increases intra cluster similarity increases. And from 2nd graph it is obvious that as the no. of clusters increases inter cluster similarity decreases. This is due to the fact that as the no. of cluster increases the

cluster size becomes smaller so intra cluster similarity increases. And similarly inter cluster similarity decreases.

Analysis of clusters with respect to natural category

information retrieval
Cluster 1 www.public.asu.edu%%~hdavulcu%%CSE591_Semantic_Web_Mining.html www.public.asu.edu%%~hdavulcu%%CSE591_Semantic_Web_Mining2.html prism.asu.edu%%resources_links.asp Cluster 2 www.asu.edu%%copp%%justice%%courses%%index.htm www.asu.edu%%clas%%psych%%dinfo%%courses.htm Cluster 3 www.eas.asu.edu%%~csedept%%academic%%courses.shtml www.eas.asu.edu%%~csedept%%academic%%courses_pfv.html www.asu.edu%%aad%%catalogs%%spring_2003%%cse.html Cluster 4 www.public.asu.edu%%~candan%%cv.htm www.public.asu.edu%%~candan%%research.htm aria.asu.edu%%people.htm Cluster 5 www.eas.asu.edu%%~gcss%%people%%nvf%%pubs.html wpcarey.asu.edu%%pubs%%is%%pub.cfm www.eas.asu.edu%%~gcss%%people%%nvf%%vita.html

Above given are the results for query information retrieval using K Means algorithm using k = 5. 1st cluster represents semantic web mining course pages. 2nd cluster represents pages related to courses. Cluster 3 represents cse courses page. Cluster 4 represents pages from prof. candans directory. And cluster 5 has pages from gcss. So almost all clusters have pages forming one category. Computer Science Cluster1
www.eas.asu.edu%%~wcs%%index.html www.eas.asu.edu%%~wcs%%events.htm www.eas.asu.edu%%~wcs%%members.htm Cluster 2 www.asu.edu%%provost%%smis%%ceas%%bse%%csebse.html www.asu.edu%%provost%%smis%%ceas%%bs%%csbs.html www.eas.asu.edu%%~gcss%%people%%nvf%%pubs.html

Cluster 3 www.asu.edu%%lib%%noble%%library%%bestind.htm www.asu.edu%%provost%%smis%%clas%%bs%%psbs.html www.asu.edu%%provost%%smis%%clas%%ba%%psba.html Above given are the results for computer science query using buckshot algorithm with K = 3. Cluster 1 represents pages from wcs directory. Cluster2 represents pages under provost directory. Cluster 3 represents pages from clas directory. So results for buckshot algorithm are also producing clusters which can be named.

Computer Science
Cluster 1 www.asu.edu%%lrc%%computerlab.html www.asu.edu%%vpsa%%lrc%%computerlab.html Cluster 2 www.eas.asu.edu%%~csedept%%Students%%Internships%%internships.shtml www.eas.asu.edu%%~csedept%%Students%%Scholarships%%scholarships.shtml www.eas.asu.edu%%~csedept%%AcademicPrograms%%AcademicPrograms.shtml Cluster 3 www.eas.asu.edu%%~gcss%%people%%nvf%%pubs.html www.asu.edu%%provost%%smis%%ceas%%bs%%csbs.html www.asu.edu%%provost%%smis%%ceas%%bse%%csebse.html

Above are the results for computer science query using bisecting K Means algorithm for K = 3. Cluster 1 represents pages related to computer lab. Cluster 2 represents pages related to scholarships. Cluster 3 represents pages under ceas directory. Results for bisecting K Means are also producing clusters that can be named.

Chi Square Test in Weka
67% (3)
Chi Square Test in Weka
40 pages
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
No ratings yet
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
4 pages
An Efficient Incremental Clustering Algorithm
No ratings yet
An Efficient Incremental Clustering Algorithm
3 pages
By R.Siranjeevi Me (Cse) Guided by Mrs.P.Hemavathi ME., (PHD) (AP/IT)
No ratings yet
By R.Siranjeevi Me (Cse) Guided by Mrs.P.Hemavathi ME., (PHD) (AP/IT)
10 pages
I Jsa It 04132012
No ratings yet
I Jsa It 04132012
4 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
AK-means: An Automatic Clustering Algorithm Based On K-Means
No ratings yet
AK-means: An Automatic Clustering Algorithm Based On K-Means
6 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Normalization Based K Means Clustering Algorithm
No ratings yet
Normalization Based K Means Clustering Algorithm
5 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
No ratings yet
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
7 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
A Comparison of Document Clustering Techniques: 1 Background and Motivation
No ratings yet
A Comparison of Document Clustering Techniques: 1 Background and Motivation
20 pages
Clustering Techniques Notes 1
No ratings yet
Clustering Techniques Notes 1
20 pages
A Dynamic K-Means Clustering For Data Mining-Dikonversi
No ratings yet
A Dynamic K-Means Clustering For Data Mining-Dikonversi
6 pages
Clustering
No ratings yet
Clustering
28 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
L 12 Flat Cluster
No ratings yet
L 12 Flat Cluster
26 pages
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
No ratings yet
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
6 pages
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
No ratings yet
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
9 pages
Chapter4 Clustering Compressed
No ratings yet
Chapter4 Clustering Compressed
48 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
Kmeans and Adaptive K Means
No ratings yet
Kmeans and Adaptive K Means
6 pages
JETIR1503025
No ratings yet
JETIR1503025
4 pages
Comprehensive Review of K-Means Clustering Algorithms
No ratings yet
Comprehensive Review of K-Means Clustering Algorithms
5 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
IR_Lec_36
No ratings yet
IR_Lec_36
29 pages
na2010
No ratings yet
na2010
5 pages
Enhancing The Exactness of K-Means Clustering Algorithm by Centroids
No ratings yet
Enhancing The Exactness of K-Means Clustering Algorithm by Centroids
7 pages
1 s2.0 S1877050923018549 Main
No ratings yet
1 s2.0 S1877050923018549 Main
5 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
11 pages
Unsupervised K-Means Clustering Algorithm
No ratings yet
Unsupervised K-Means Clustering Algorithm
17 pages
10 K Means Clustering PDF
No ratings yet
10 K Means Clustering PDF
5 pages
Clustering TNP
No ratings yet
Clustering TNP
53 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
Storage Technologies: Digital Assignment 1
No ratings yet
Storage Technologies: Digital Assignment 1
16 pages
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
No ratings yet
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
10 pages
A Dynamic K-Means Clustering For Data Mining
No ratings yet
A Dynamic K-Means Clustering For Data Mining
6 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
K Means HTML
No ratings yet
K Means HTML
8 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
No ratings yet
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
8 pages
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
No ratings yet
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
36 pages
8910-24120-1-PB
No ratings yet
8910-24120-1-PB
7 pages
Lect 4
No ratings yet
Lect 4
34 pages
Unit-4th Question-Bank Solution.docx
No ratings yet
Unit-4th Question-Bank Solution.docx
52 pages
Bisecting K Means
No ratings yet
Bisecting K Means
2 pages
Cluster
No ratings yet
Cluster
50 pages
Amalgam Clustering Algorithm
No ratings yet
Amalgam Clustering Algorithm
9 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
2875 27398 1 SP
No ratings yet
2875 27398 1 SP
4 pages
UNIT 4 NOTES
No ratings yet
UNIT 4 NOTES
21 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
K-Means Clustering
No ratings yet
K-Means Clustering
16 pages
3 Comparison-Of-Conventional-And-Rough-Kmeans-Clustering
No ratings yet
3 Comparison-Of-Conventional-And-Rough-Kmeans-Clustering
8 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Towards Secure Industrial Iot: Blockchain System With Credit-Based Consensus Mechanism
No ratings yet
Towards Secure Industrial Iot: Blockchain System With Credit-Based Consensus Mechanism
10 pages
UNIT-5 - Hidden Lines and Surfaces
100% (1)
UNIT-5 - Hidden Lines and Surfaces
33 pages
Data Mining Static Code Attributes To Learn Defect Predictors
No ratings yet
Data Mining Static Code Attributes To Learn Defect Predictors
3 pages
Performance Comparison of Robust Speech PDF
No ratings yet
Performance Comparison of Robust Speech PDF
6 pages
The Cost of Poor Quality Software in The US 2018 Report
No ratings yet
The Cost of Poor Quality Software in The US 2018 Report
44 pages
Course Handbook: Semester 6
No ratings yet
Course Handbook: Semester 6
22 pages
Predicting Defect-Prone Software Modules Using Support Vector Machines
No ratings yet
Predicting Defect-Prone Software Modules Using Support Vector Machines
12 pages
Personal Details: Course Application
No ratings yet
Personal Details: Course Application
2 pages

Bhaumik-Project - C - Report K Mean Complexity

Uploaded by

Bhaumik-Project - C - Report K Mean Complexity

Uploaded by

Project C Report (Clustering Algorithms)

Overview of Clustering Algorithms

Bisecting K Means Algorithm

Comparison of algorithms using similarity measures

Intra Cluster Similarity for all algorithms on Information retrieval query

Inter Cluster Similarity for all algorithms on Information retrieval query

intra cluster similarity

K Means buckshot bisecting K Means

Intra Cluster Similarity for all algorithms on Parking decal query

Effect of varying no. of clusters

Intra cluster similarity for 2 queries for K Means algorithm

Analysis of clusters with respect to natural category

You might also like