0% found this document useful (0 votes)
21 views16 pages

K Mean Cluster Analysis

Uploaded by

Nilesh Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views16 pages

K Mean Cluster Analysis

Uploaded by

Nilesh Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Cluster Analysis

(K-means)
•Presenter: Vikas Dubey
Introduction:

1. Cluster analysis is a group of segmentation techniques designed to


classify respondents into groups or segments called clusters.
2. Cluster analysis or clustering is the task of grouping a set of objects
in such a way that objects in the same group (called a cluster) are
more similar (in some sense or other) to each other than to those in
other groups (clusters).
3. Clustering algorithms segment records minimizing within-cluster
variance and maximizing between cluster variation.
Objective of Cluster Analysis

• The main cluster analysis objective is to address the heterogeneity in


each set of data. The other cluster analysis objectives are
• Taxonomy description – Identifying groups within the data
• Data simplification – The ability to analyze groups of similar
observations instead of all individual observation
• Hypothesis generation or testing – Develop hypothesis based on the
nature of the data or to test the previously stated hypothesis
• Relationship Identification – The simplified structure from cluster
analysis that describes the relationships
Introduction of k-means

• An algorithm for partitioning (or clustering) N data points into K


disjoint subsets Sj containing data points so as to minimize the sum-
of-squares criterion

where xn is a vector representing the the n th data point and uj is the


geometric centroid or cluster centers of the data points in S j.
Introduction of k-means (Continue...)

• Simply speaking k-means clustering is an algorithm to classify or to


group the objects based on attributes/features into K number of group.
• K is positive integer number.
• The grouping is done by minimizing the sum of squares of distances
between data and the corresponding cluster centroid.
Basic Data Requirement

• K Means can be only applied to continuous data or Interval data


• Homogeneity variance within cluster and heterogeneity variance
between cluster.
• It assumes that you have selected the appropriate no. of clusters and
have included all relevant variables.
• The sample size required for a K-means cluster depends on the
expected number of resulting clusters (each final group should have a
representative base size).
• It is best to include a variety of statements so it’s easier to identify
different groups
Steps of K-means:
Step 1: Decide number of cluster i.e., the value of K.
Step 2: Select Initial cluster centres.
Step 3: Find the distance between each of the data points from all cluster
centres. Assign the data points to the cluster where it has minimum
distance.
Step 4: Update the Cluster centres by taking mean of the data points that falls
under that respective cluster.
Step 5: Repeat step three and four till the movement of object from one cluster
to another cluster stops.
A Simple example showing the
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two
clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).


Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step 2:

• Now we calculate distance between each observation and centroid.


• The formula used is Euclidean distance given as,

• Some calculations are,


Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Continue..
Distances from Distances from
Individual Centroid 1 Centroid 2
1 0 7.21
2 1.12 6.1
3 3.61 3.61
4 7.21 0
5 4.72 2.5
6 5.31 2.06
7 4.3 2.92

• Thus, we obtain two clusters containing:


{1,2,3} and {4,5,6,7}.
Continue…
Individual Variable 1 Variable 2
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

Their new centroids are,


Continue..
Step 3: Individual Distance from Distance from
Centroid 1 Centroid 2
• Now using these centroids we 1 1.57 5.38
compute the Euclidean distance of
each object, as shown in table. 2 0.47 4.28

3 2.04 1.78
• Therefore, the new clusters are:
4 5.64 1.84
{1,2} and {3,4,5,6,7}
5 3.15 0.73

• Next centroids are: m1=(1.25,1.5) 6 3.78 0.54

and m2 = (3.9,5.1) 7 2.74 1.08


Continue..
• Step 4 :
Individual Distance from Distance from
The clusters obtained are: Centroid 1 Centroid 2
{1,2} and {3,4,5,6,7} 1 0.56 5.02

• Therefore, there is no change in the 2 0.56 3.92


cluster.
3 3.05 1.42
• Thus, the algorithm comes to a halt
here and final result consist of 2 4 6.66 2.20
clusters {1,2} and {3,4,5,6,7}.
5 4.16 0.41

6 4.78 0.61

7 3.75 0.72
Weaknesses of K-Mean Clustering

1. When the base size is low, initial grouping will determine


the cluster significantly.
2. The number of cluster, K, must be determined before hand.
3. We never know the real cluster, using the same data,
because if it is inputted in a different order it may produce
different cluster if the number of data is few.
Any
Questions…..

You might also like