0% found this document useful (0 votes)
53 views

Quality of Clustering: Clustering (K-Means Algorithm)

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarities. It works by assigning data points to the cluster with the nearest mean and then recalculating the mean for each cluster until the means converge. The algorithm aims to minimize intra-cluster distances and maximize inter-cluster distances. It is commonly used for exploratory data analysis to find hidden patterns or grouping in the data.

Uploaded by

Sk Arif Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Quality of Clustering: Clustering (K-Means Algorithm)

K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarities. It works by assigning data points to the cluster with the nearest mean and then recalculating the mean for each cluster until the means converge. The algorithm aims to minimize intra-cluster distances and maximize inter-cluster distances. It is commonly used for exploratory data analysis to find hidden patterns or grouping in the data.

Uploaded by

Sk Arif Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Clustering (K-Means Algorithm)

Finding similarities between data on the basis of the characteristics found


in the data and grouping similar data objects into clusters. It is an
unsupervised learning technique (No dependent variable).

Quality of Clustering
A good clustering method produces high quality clusters with minimum
intra-cluster distance (high similarity within the cluster)
and maximum inter-class distance (low similarity between two clusters).

Ways to measure Distance


There are multiple ways to measure distance. The most two popular
methods are as follows -

Manhattan distance: |x2-x1| + |y2-y1|

K- means Clustering
In k-means clustering algorithm we take the number of inputs, represented
with the k, the k is called as number of clusters from the data set. The value
of k will define by the user and the each cluster having some distance
between them, we calculate the distance between the clusters using
the Euclidean distance formula.

Steps to perform k-means clustering


1. Choose the number of clusters k

2. Compute center of these clusters i.e. centroid or cluster seeds (mean of


the points in a cluster) . We can take any random objects as the initial
centroids or the first k objects in sequence.

3. Determine the distance of each object to the centroids using Euclidean


distance.
4. Group the object based on minimum distance.

5. Computing New Cluster Seeds - Recompute the centroids (centers) of


these clusters by taking mean of all points in each cluster formed above.

6. Repeat Steps 2 ,3, 4 and 5 until the centroids no longer change (


or convergence is reached) .

Calculation Steps : How K-mean clustering works


Dataset : A1(2,10), A2(2,5), A3(8,5), B1(5, 8), B2(7,5), B3(6,4), C1(1,2),
C2(4,9)

Step 1 : We choose 3 clusters.


Step 2 : The initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) -
chosen randomly. They are also called cluster seeds.
Step 3 : We need to calculate the distance between each data points and
the cluster centers using the Euclidean distance.

Two points (x1,y1), (x2,y2)

or Manhattan distance = |x2-x1| + |y2-y1|

1st Row:

Distance calculate between the A2 data point and the Centroids A1, B1, C1
Distance between A2(2,5) & A1(2, 10) = |2-2| + |5-10| = 0+5 = 5
Distance between A2(2,5) & B1(5, 8) = |2-5| + |5-8| = 3+3 = 6
Distance between A2(2,5) & C1(1, 2) = |2-1| + |5-2| = 1+3 = 4
The A2 nearby Cluster Center is C1.

2nd Row:

Distance calculate between the A3 data point and the Centroids A1, B1, C1
Distance between A3(8,5) & A1(2,10) = 11
Distance between A3(8,5) & B1(5,8) = 6
Distance between A3(8,5) & C1(1,2) = 10
The A3 nearby Cluster Center is B1.
3rd Row:

Distance calculate between the B2 data point and the Centroids A1, B1, C1
Distance between B2(7,5) & A1(2,10) = 10
Distance between B2(7,5) & B1(5,8) = 5
Distance between B2(7,5) & C1(1,2) = 9
The B2 nearby Cluster Center is B1.

4th Row:

Distance calculate between the B3 data point and the Centroids A1, B1, C1
Distance between B3(6,4) & A1(2,10) = 10
Distance between B3(6,4) & B1(5,8) = 5
Distance between B3(6,4) & C1(1,2) = 7
The B3 nearby Cluster Center is B1.

5th Row:

Distance calculate between the C2 data point and the Centroids A1, B1, C1
Distance between C2(4,9) & A1(2,10) = 3
Distance between C2(4,9) & B1(5,8) = 2
Distance between C2(4,9) & C1(1,2) = 10
The C2 nearby Cluster Center is B1.

Step 4 : Calculate Cluster Seeds (Mean Values)

Cluster A1 (2, 10) nearby point is A1(2,10), which was the old mean, so the
cluster center remains the same.

Cluster B1(5,8) nearby points are B1(5,8), A3(8,5), B2(7,5), B3(6, 4), C2(4,
9)
B1 Mean value = (5+8+7+6+4/5 , 8+5+5+4+9/5) = (6, 6.2)

Cluster C1(1,2) nearby points are C1(1,2), A2(2,5)


C1 Mean value = (1.5, 3.5)

The updated Cluster seeds are : A1(2, 10), B1(6, 6.2), C1(1.5, 3.5)

Step 5 : Go for the next iteration with the updated cluster seeds.

We need to calculate the distances between the each data points to


updated centroids.

Cluster A1(2, 10) nearby points are C2(4,9)


A1 Mean value = (3, 9.5)

Cluster B1(6, 6.2) nearby points are A3(8,5), B2(7,5), B3(6,4)


B1 Mean value = (6.7, 4)

Cluster C1(1.5,3.5) nearby points are A2(2,5)


C1 Mean value = (1.7, 4.2)

The updated Cluster points are : A1(3, 9.5), B1(6.7, 4), C1= (1.7, 4.2)
After completion of the iteration 2 the cluster points are not equal to the iteration 1
cluster points, and then we need to go for the iteration 3.
Step 6 : Check Convergence

The cluster seeds are no change between the Iteration 2 and the iteration
3, then we stop the iteration.

Limitations of k-means clustering


1. The number of clusters must be known before using k-means
clustering.
2. Sensitive to outliers, noise as mean is used.
3. When the number of data are not so many, initial grouping will
determine the cluster significantly.

You might also like