0% found this document useful (0 votes)

4 views

L18_19_Clustering

Uploaded by

keshav pareek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

L18_19_Clustering

Uploaded by

keshav pareek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Unsupervised Learning:

Clustering
Classification VS Clustering
▪ Google News Recommendation
▪ Movie Recommendation
▪ YouTube Videos

1-2
Clustering
Hierarchical
K-mean

1-3
Classroom example
Hierarchical (Agglomerative)
K-mean

Scatterplot for visualization

1-4
Data From the class
Their preferences about the Movie Jonner

1-5
Scatter Plot

1-6
Hierarchical (Agglomerative) Clustering
The idea behind hierarchical agglomerative clustering is to start with each cluster
comprising exactly one record and then progressively agglomerating (combining)
the two nearest clusters until there is just one cluster left at the end, which consists
of all the records.

1-7
1-8
Club Arizona and Commonwealth
Club Arizona, Commonwealth, and Central

1-9
1-10
Repeat it for Classroom Data

1-11
Hierarchical methods can be either agglomerative or divisive.

▪ Agglomerative methods begin with n clusters and sequentially merge

similar clusters until a single cluster is obtained.
▪ Divisive methods work in the opposite direction, starting with one cluster
that includes all records.

▪ Preferred when we need to arrange the clusters into an natural hierarchy.

Example:
Within Gujrat clustering the houses based on the city, then within the city, clustering
them based on the location (Posh vs Nonposh Area), next within the posh/nonposh
are clustering based on the build quality, and so on

1-12
Non-hierarchical methods, such as k-means. Using a prespecified number
of clusters, the method assigns records to each cluster. These methods
are generally less computationally intensive and are therefore preferred with
very large datasets.

Natural clustering is compromised???

1-13
Measuring Distance Between Two
Records

Euclidean Distance

1-14
Normalizing Numerical Measurements?
▪ The scale of each variable highly influences the distance measures,
▪ The variables with larger scales (e.g., Sales) have a much greater influence

▪ It is, therefore, customary to normalize continuous measurements before computing

the Euclidean distance.
▪ Normalizing a measurement means subtracting the average and dividing by the
standard deviation (normalized values are also called z-scores).

1-15
Euclidean distance
▪ It is highly scale-dependent. Changing the units of one variable (e.g., from cents to
dollars) can greatly influence the results.
▪ Unequal weighting should be considered if we want the clusters to depend more on
certain measurements and less on others.

▪ It completely ignores the relationship between the measurements.

▪ Thus, if the measurements are, in fact, strongly correlated, a different distance
(such as the statistical distance), is likely to be a better choice.

▪ It is sensitive to outliers. If the data are believed to contain outliers and careful removal
is not a choice, using more robust distances (such as the Manhattan distance,) is preferred.

1-16
Additional popular distance metrics
Correlation-based similarity. Sometimes, it is more natural or convenient
to work with a similarity measure between records rather than distance,
which measures dissimilarity.

1-17
Statistical distance (also called Mahalanobis distance).

where xi and xj are p-dimensional vectors of the measurement values for

records i and j, respectively; and S is the covariance matrix for these vectors.
(′, a transpose operation simply turns a column vector into a row
vector). S􀀀1 is the inverse matrix of S, which is the p-dimension extension
to division.

1-18
Manhattan distance (“city block”). This distance looks at the absolute differences
rather than squared differences, and is defined by

1-19
Distance Measures for Categorical Data
▪ Matching coefficient
▪ Jaquard’s coefficient

1-20
Measuring Distance Between Two Clusters (Example)
Minimum Distance
The distance between the pair of records Ai and Bj that are closest:

Maximum Distance
The distance between the pair of records Ai and Bj that are farthest:

Average Distance
The average distance of all possible distances between records in one
cluster and records in the other cluster:

Centroid Distance
The distance between the two cluster centroids.

A cluster centroid is the vector of

measurement averages across all the records in that cluster.

1-21
1-22
Domain knowledge is key when deciding among
clustering methods.

1-23
Data From the class
Their preferences about the Movie Jonner

1-24
Hierarchical (Agglomerative) Clustering
The idea behind hierarchical agglomerative clustering is to start with each cluster
comprising exactly one record and then progressively agglomerating (combining)
the two nearest clusters until there is just one cluster left at the end, which consists
of all the records.

1-25
1-26
Club Arizona and Commonwealth
Club Arizona, Commonwealth, and Central

1-27
Next Step

1-28
Linkage
Single Linkage: Minimum distance

Complete Linkage: Maximum distance

Average Linkage: Average distance between clusters

Centroid Linkage: Centroid distance

1-29
Linkage
Ward’s Method
▪ Ward’s method is also agglomerative in that it joins records and clusters
together progressively to produce larger and larger clusters but operates
slightly differently from the general approach described above.

▪ Ward’s method considers the “loss of information” that occurs when

records are clustered together. When each cluster has one record, there
is no loss of information, and all individual values remain available.
▪ When records are joined together and represented in clusters,
information about an individual record is replaced by the information
for the cluster to which it belongs. To measure loss of information,
Ward’s method employs a measure “error sum of squares” (ESS) that
measures the difference between individual records and a group mean.

1-30
Coding for Euclidian distance
from sklearn.metrics import pairwise_distances
import numpy as np
import pandas as pd
from scipy.spatial import distance

Data=pd.read_excel("Clustering_Distance.xlsx")

records = Data[["Norm_Sales", "Norm_Fuel"]]

# Record names
record_names = Data['Manufacturer']
# Calculate Euclidean distances
distance_matrix = distance.cdist(records, records, metric='euclidean')
# Convert to DataFrame for better visualization
distance_df = pd.DataFrame(distance_matrix, index=record_names, columns=record_names)
# Create a mask for the upper triangle
mask = np.triu(np.ones(distance_df.shape), k=1).astype(bool)
# Replace upper triangle values with NaN for clarity
distance_df_masked = distance_df.mask(mask)

print("Lower Half Euclidean Distance Matrix using scipy:")

print(distance_df_masked)
# Calculate the linkage matrix using single linkage
Z = linkage(Data[['Norm_Sales', 'Norm_Fuel']], method='single')

# Plot the dendrogram

plt.figure(figsize=(10, 7))
dendrogram(Z, labels=Data['Manufacturer'].values, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram (Single Linkage)')
plt.xlabel('Cluster Name')
plt.ylabel('Distance')
plt.show() 1-31
Dendrogram: Displaying Clustering Process and Results

1-32
Validating Clusters
Cluster interpretability. Is the interpretation of the resulting clusters reasonable?
▪ Obtaining summary statistics (e.g., average, min, max) from each
cluster on each measurement that was used in the cluster analysis
▪ Examining the clusters for separation along some common feature
(variable) that was not used in the cluster analysis
▪ Labeling the clusters: based on the interpretation, trying to assign a
name or label each cluster

Cluster stability. Do cluster assignments change significantly if some of the

inputs are altered slightly?
Another way to check stability is to partition
the data and see how well clusters formed based on one part apply to the
other part.
To do this:
▪ Cluster partition A.
▪ Use the cluster centroids from A to assign each record in partition B (each
record is assigned to the cluster with the closest centroid).
▪ Assess how consistent the cluster assignments are compared to the
assignments based on all the data.

1-33
Validating Clusters

Cluster separation. Examine the ratio of between-cluster variation to

within-cluster variation to see whether the separation is reasonable.
• There exist statistical tests for this task (an F-ratio), but their
usefulness is somewhat controversial.

Number of clusters. The number of resulting clusters must be useful,

given the purpose of the analysis.

• For example, suppose the goal of the clustering is to identify

categories of customers and assign labels to them for market
segmentation purposes.
• If the marketing department can only manage to sustain three
different marketing presentations, it would probably not make sense
to identify more than three clusters.

1-34
K-means Clustering
• k-Means clustering algorithm proposed by J. Hartigan and M. A.
Wong [1979].

• Given a set of n distinct objects, the k-Means clustering algorithm

partitions the objects into k number of clusters such that intracluster
similarity is high, but the intercluster similarity is low.
• The idea is to minimize a measure of dispersion within the
clusters
• Clusters are as homogeneous as possible with respect to the
measurements used.

• In this algorithm, user has to specify k, the number of clusters and

consider the objects are defined with numeric attributes and thus
using any one of the distance metric to demarcate the clusters.

1-35
k-Means Algorithm
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n
objects. These k objects are treated as the centroids or center of
gravities of k clusters.

• For each of the remaining objects, it is assigned to one of the closest

centroid. Thus, it forms a collection of objects assigned to each
centroid and is called a cluster.

• Next, the centroid of each cluster is then updated (by calculating the
mean values of attributes of each object).

• The assignment and update procedure is until it reaches some

stopping criteria (such as, number of iteration, centroids remain
unchanged or no assignment, etc.)
1-36
k-Means Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.

2. For each of the objects in D do

• Compute distance between the current objects and k cluster
centroids
• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster centers” of each cluster. These become the new
cluster centroids.

4. Repeat step 2-3 until the convergence criterion is satisfied

5. Stop
1-37
Illustration of k-Means clustering algorithms
A1 A2
25
6.8 12.6
0.8 9.8
20
1.2 11.6
2.8 9.6
15
3.8 9.9

A2
4.4 6.5
10
4.8 1.1
6.0 19.9 5
6.2 18.5
7.6 17.4 0
7.8 12.2 0 5 10 15
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1

1-38
Illustration of k-Means clustering
algorithmsTable 2: Distance calculation
Fig 2: Initial cluster with respect to Table
A1 A2 d1 d2 d3 cluster 16.2
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2

1-39
Illustration of k-Means clustering
algorithms
The calculation new centroids of the three cluster using the mean of attribute values of A1
and A2 is shown in the Table below. The cluster with new centroids are shown in Fig 3.

Calculation of new centroids

New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Fig 3: Initial cluster with new centroids

1-40
Illustration of k-Means clustering
algorithms
We next reassign the 16 objects to three clusters by determining which centroid is
closest to each one. This gives the revised set of clusters shown in Fig 4.
Note that point p moves from cluster C2 to cluster C1.

Fig 4: Cluster after first iteration

1-41
Illustration of k-Means clustering
algorithms
• The newly obtained centroids after second iteration are given in the table below. Note that the
centroid c3 remains unchanged, where c2 and c1 changed a little.

• With respect to newly obtained cluster centres, 16 points are reassigned again. These are the same
clusters as before. Hence, their centroids also remain unchanged.

• Considering this as the termination criteria, the k-means algorithm stops here. Hence, the final
cluster in Fig 5 is same as Fig 4. Fig 5: Cluster after Second iteration

Cluster centres after second iteration

Centroid Revised Centroids

A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

1-42
Example

1-43
k = 2 and that the initial clusters
A = {Arizona, Boston} and
B = {Central, Commonwealth, Consolidated}.

Distance

1-44
A = {Arizona, Boston} and
B = {Central, Commonwealth, Consolidated}.

A = {Arizona, Central, Commonwealth} and

B ={Consolidated, Boston}.

1-45
Examples
News Recommendation

Movie Recommendation

Natural clustering of states based on socioeconomic factors

House pricing example with different clusters: Different pricing

patterns in house prices differenced by the location or build
type(Flats, Independent Houses), build quality

Other Examples?

1-46
Homework
1. How to select
K in K-mean
Clustering?
2. Compare both
the methods
3. Using the data
in Table 15.1,
perform K-
mean and
Hierarchical
clustering

1-47
Thank You

1-48

Bs8485-2015+a1-2019 - (2019-02-20 - 01-27-50 PM) PDF
100% (2)
Bs8485-2015+a1-2019 - (2019-02-20 - 01-27-50 PM) PDF
96 pages
Breaker Templates For CIMug Posting
No ratings yet
Breaker Templates For CIMug Posting
48 pages
Text Book of Engineering Mathematics. Volume I
94% (18)
Text Book of Engineering Mathematics. Volume I
377 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Slides - Clustering
No ratings yet
Slides - Clustering
13 pages
Clustering
No ratings yet
Clustering
80 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Clustering
No ratings yet
Clustering
84 pages
8.Cluster Analysis HCA
No ratings yet
8.Cluster Analysis HCA
31 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
No ratings yet
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
43 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
No ratings yet
Chapter 14 - Cluster Analysis: Data Mining For Business Intelligence
31 pages
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
No ratings yet
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
41 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Unit 5
No ratings yet
Unit 5
63 pages
Cluster Analysis Notes
No ratings yet
Cluster Analysis Notes
37 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
11 Chapter 3
No ratings yet
11 Chapter 3
17 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
W6 Clustering
No ratings yet
W6 Clustering
29 pages
Unit 2
No ratings yet
Unit 2
89 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Section 3
No ratings yet
Section 3
22 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Cluster Analysis
No ratings yet
Cluster Analysis
25 pages
Cluster Analysis: Clusters Classification Analysis Numerical Taxonomy
No ratings yet
Cluster Analysis: Clusters Classification Analysis Numerical Taxonomy
50 pages
ML ch 4 (4)
No ratings yet
ML ch 4 (4)
65 pages
Clustering
No ratings yet
Clustering
75 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Grouping
No ratings yet
Grouping
98 pages
Chapter-5-Cluster Analysis PDF
No ratings yet
Chapter-5-Cluster Analysis PDF
5 pages
DM_C6
No ratings yet
DM_C6
37 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Clustering
No ratings yet
Clustering
125 pages
Cluster Analysis
No ratings yet
Cluster Analysis
33 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
61 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
9ccc27f9-3dbb-4e1c-98b1-91ada20d937c
No ratings yet
9ccc27f9-3dbb-4e1c-98b1-91ada20d937c
27 pages
INVESTMENT COMMITTEE REPORT Date
No ratings yet
INVESTMENT COMMITTEE REPORT Date
4 pages
WP_421_2024
No ratings yet
WP_421_2024
32 pages
L13_14_15_16_17_Classification
No ratings yet
L13_14_15_16_17_Classification
123 pages
Project_FAC630 Behavioural Finance
No ratings yet
Project_FAC630 Behavioural Finance
2 pages
Let there be gamma
No ratings yet
Let there be gamma
3 pages
Presentation of Megginson Paper
No ratings yet
Presentation of Megginson Paper
14 pages
DFCCIL Annual Report 2024 Final 3M0F
No ratings yet
DFCCIL Annual Report 2024 Final 3M0F
203 pages
Game Changer First Edition November 2024
No ratings yet
Game Changer First Edition November 2024
25 pages
MerQube Introduction To Defined Outcome Indices
No ratings yet
MerQube Introduction To Defined Outcome Indices
9 pages
Relative Valuation
No ratings yet
Relative Valuation
48 pages
Electrostatic Voltmeters
No ratings yet
Electrostatic Voltmeters
9 pages
Parts of A Substation
No ratings yet
Parts of A Substation
6 pages
J88
No ratings yet
J88
36 pages
Validation Readiness
No ratings yet
Validation Readiness
5 pages
History of Computers in Romania
No ratings yet
History of Computers in Romania
3 pages
Government Scavenger Hunt
No ratings yet
Government Scavenger Hunt
4 pages
Assignment Topic One GROUP D (GROUP6)
No ratings yet
Assignment Topic One GROUP D (GROUP6)
24 pages
Basic Concepts of Accounting Information Systems
No ratings yet
Basic Concepts of Accounting Information Systems
9 pages
Kajoli Govala Sbi Statement
No ratings yet
Kajoli Govala Sbi Statement
5 pages
Manual Tohatsu 3.5CV
No ratings yet
Manual Tohatsu 3.5CV
48 pages
Module 3 Ideal Gases and Ideal Gas Law
No ratings yet
Module 3 Ideal Gases and Ideal Gas Law
12 pages
Compiled Reports From 56th JCRC PDF
No ratings yet
Compiled Reports From 56th JCRC PDF
65 pages
PanoDreamer
No ratings yet
PanoDreamer
10 pages
CDPD
No ratings yet
CDPD
30 pages
Human Resources: The Seven Classic Types of Workplace Behavior
No ratings yet
Human Resources: The Seven Classic Types of Workplace Behavior
3 pages
Fluid Coupling
No ratings yet
Fluid Coupling
7 pages
Instant ebooks textbook Doing Business in Kenya Opportunities and Challenges 1st Edition Wakiuru Wamwara download all chapters
100% (2)
Instant ebooks textbook Doing Business in Kenya Opportunities and Challenges 1st Edition Wakiuru Wamwara download all chapters
40 pages
This Study Resource Was Shared Via
No ratings yet
This Study Resource Was Shared Via
5 pages
Subwaycasestudy
No ratings yet
Subwaycasestudy
2 pages
GUIDELINE of Group Assessment II - Quantitative Data Analysis-Batch 2018
No ratings yet
GUIDELINE of Group Assessment II - Quantitative Data Analysis-Batch 2018
3 pages
Walnut Hills 7th Grade Summer Homework
100% (1)
Walnut Hills 7th Grade Summer Homework
7 pages
AP1 MeasuringBox-Rev4Eng PDF
No ratings yet
AP1 MeasuringBox-Rev4Eng PDF
9 pages
Tourism and Economy: Positive Impacts
No ratings yet
Tourism and Economy: Positive Impacts
4 pages
Stephen Krashen’s Five Hypotheses of Second Language Acquisition
No ratings yet
Stephen Krashen’s Five Hypotheses of Second Language Acquisition
4 pages
Unit Diagram Tle 9
No ratings yet
Unit Diagram Tle 9
1 page
Voc Test
No ratings yet
Voc Test
12 pages

L18_19_Clustering

Uploaded by

L18_19_Clustering

Uploaded by

Unsupervised Learning:

Scatterplot for visualization

▪ Agglomerative methods begin with n clusters and sequentially merge

▪ Preferred when we need to arrange the clusters into an natural hierarchy.

Natural clustering is compromised???

▪ It is, therefore, customary to normalize continuous measurements before computing

▪ It completely ignores the relationship between the measurements.

where xi and xj are p-dimensional vectors of the measurement values for

A cluster centroid is the vector of

Complete Linkage: Maximum distance

Average Linkage: Average distance between clusters

Centroid Linkage: Centroid distance

▪ Ward’s method considers the “loss of information” that occurs when

records = Data[["Norm_Sales", "Norm_Fuel"]]

print("Lower Half Euclidean Distance Matrix using scipy:")

# Plot the dendrogram

Cluster stability. Do cluster assignments change significantly if some of the

Cluster separation. Examine the ratio of between-cluster variation to

Number of clusters. The number of resulting clusters must be useful,

• For example, suppose the goal of the clustering is to identify

• Given a set of n distinct objects, the k-Means clustering algorithm

• In this algorithm, user has to specify k, the number of clusters and

• For each of the remaining objects, it is assigned to one of the closest

• The assignment and update procedure is until it reaches some

2. For each of the objects in D do

4. Repeat step 2-3 until the convergence criterion is satisfied

Calculation of new centroids

Fig 3: Initial cluster with new centroids

Fig 4: Cluster after first iteration

Cluster centres after second iteration

Centroid Revised Centroids

A = {Arizona, Central, Commonwealth} and

Natural clustering of states based on socioeconomic factors

House pricing example with different clusters: Different pricing

You might also like