L18_19_Clustering
L18_19_Clustering
Clustering
Classification VS Clustering
▪ Google News Recommendation
▪ Movie Recommendation
▪ YouTube Videos
1-2
Clustering
Hierarchical
K-mean
1-3
Classroom example
Hierarchical (Agglomerative)
K-mean
1-4
Data From the class
Their preferences about the Movie Jonner
1-5
Scatter Plot
1-6
Hierarchical (Agglomerative) Clustering
The idea behind hierarchical agglomerative clustering is to start with each cluster
comprising exactly one record and then progressively agglomerating (combining)
the two nearest clusters until there is just one cluster left at the end, which consists
of all the records.
1-7
1-8
Club Arizona and Commonwealth
Club Arizona, Commonwealth, and Central
1-9
1-10
Repeat it for Classroom Data
1-11
Hierarchical methods can be either agglomerative or divisive.
Example:
Within Gujrat clustering the houses based on the city, then within the city, clustering
them based on the location (Posh vs Nonposh Area), next within the posh/nonposh
are clustering based on the build quality, and so on
1-12
Non-hierarchical methods, such as k-means. Using a prespecified number
of clusters, the method assigns records to each cluster. These methods
are generally less computationally intensive and are therefore preferred with
very large datasets.
1-13
Measuring Distance Between Two
Records
Euclidean Distance
1-14
Normalizing Numerical Measurements?
▪ The scale of each variable highly influences the distance measures,
▪ The variables with larger scales (e.g., Sales) have a much greater influence
1-15
Euclidean distance
▪ It is highly scale-dependent. Changing the units of one variable (e.g., from cents to
dollars) can greatly influence the results.
▪ Unequal weighting should be considered if we want the clusters to depend more on
certain measurements and less on others.
▪ It is sensitive to outliers. If the data are believed to contain outliers and careful removal
is not a choice, using more robust distances (such as the Manhattan distance,) is preferred.
1-16
Additional popular distance metrics
Correlation-based similarity. Sometimes, it is more natural or convenient
to work with a similarity measure between records rather than distance,
which measures dissimilarity.
1-17
Statistical distance (also called Mahalanobis distance).
1-18
Manhattan distance (“city block”). This distance looks at the absolute differences
rather than squared differences, and is defined by
1-19
Distance Measures for Categorical Data
▪ Matching coefficient
▪ Jaquard’s coefficient
1-20
Measuring Distance Between Two Clusters (Example)
Minimum Distance
The distance between the pair of records Ai and Bj that are closest:
Maximum Distance
The distance between the pair of records Ai and Bj that are farthest:
Average Distance
The average distance of all possible distances between records in one
cluster and records in the other cluster:
Centroid Distance
The distance between the two cluster centroids.
1-21
1-22
Domain knowledge is key when deciding among
clustering methods.
1-23
Data From the class
Their preferences about the Movie Jonner
1-24
Hierarchical (Agglomerative) Clustering
The idea behind hierarchical agglomerative clustering is to start with each cluster
comprising exactly one record and then progressively agglomerating (combining)
the two nearest clusters until there is just one cluster left at the end, which consists
of all the records.
1-25
1-26
Club Arizona and Commonwealth
Club Arizona, Commonwealth, and Central
1-27
Next Step
1-28
Linkage
Single Linkage: Minimum distance
1-29
Linkage
Ward’s Method
▪ Ward’s method is also agglomerative in that it joins records and clusters
together progressively to produce larger and larger clusters but operates
slightly differently from the general approach described above.
1-30
Coding for Euclidian distance
from sklearn.metrics import pairwise_distances
import numpy as np
import pandas as pd
from scipy.spatial import distance
Data=pd.read_excel("Clustering_Distance.xlsx")
# Record names
record_names = Data['Manufacturer']
# Calculate Euclidean distances
distance_matrix = distance.cdist(records, records, metric='euclidean')
# Convert to DataFrame for better visualization
distance_df = pd.DataFrame(distance_matrix, index=record_names, columns=record_names)
# Create a mask for the upper triangle
mask = np.triu(np.ones(distance_df.shape), k=1).astype(bool)
# Replace upper triangle values with NaN for clarity
distance_df_masked = distance_df.mask(mask)
1-32
Validating Clusters
Cluster interpretability. Is the interpretation of the resulting clusters reasonable?
▪ Obtaining summary statistics (e.g., average, min, max) from each
cluster on each measurement that was used in the cluster analysis
▪ Examining the clusters for separation along some common feature
(variable) that was not used in the cluster analysis
▪ Labeling the clusters: based on the interpretation, trying to assign a
name or label each cluster
1-33
Validating Clusters
1-34
K-means Clustering
• k-Means clustering algorithm proposed by J. Hartigan and M. A.
Wong [1979].
1-35
k-Means Algorithm
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n
objects. These k objects are treated as the centroids or center of
gravities of k clusters.
• Next, the centroid of each cluster is then updated (by calculating the
mean values of attributes of each object).
3. Compute the “cluster centers” of each cluster. These become the new
cluster centroids.
A2
4.4 6.5
10
4.8 1.1
6.0 19.9 5
6.2 18.5
7.6 17.4 0
7.8 12.2 0 5 10 15
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1
1-38
Illustration of k-Means clustering
algorithmsTable 2: Distance calculation
Fig 2: Initial cluster with respect to Table
A1 A2 d1 d2 d3 cluster 16.2
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
1-39
Illustration of k-Means clustering
algorithms
The calculation new centroids of the three cluster using the mean of attribute values of A1
and A2 is shown in the Table below. The cluster with new centroids are shown in Fig 3.
New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
1-40
Illustration of k-Means clustering
algorithms
We next reassign the 16 objects to three clusters by determining which centroid is
closest to each one. This gives the revised set of clusters shown in Fig 4.
Note that point p moves from cluster C2 to cluster C1.
1-41
Illustration of k-Means clustering
algorithms
• The newly obtained centroids after second iteration are given in the table below. Note that the
centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again. These are the same
clusters as before. Hence, their centroids also remain unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here. Hence, the final
cluster in Fig 5 is same as Fig 4. Fig 5: Cluster after Second iteration
1-42
Example
1-43
k = 2 and that the initial clusters
A = {Arizona, Boston} and
B = {Central, Commonwealth, Consolidated}.
Distance
1-44
A = {Arizona, Boston} and
B = {Central, Commonwealth, Consolidated}.
1-45
Examples
News Recommendation
Movie Recommendation
Other Examples?
1-46
Homework
1. How to select
K in K-mean
Clustering?
2. Compare both
the methods
3. Using the data
in Table 15.1,
perform K-
mean and
Hierarchical
clustering
1-47
Thank You
1-48