0% found this document useful (0 votes)
4 views

L18_19_Clustering

Uploaded by

keshav pareek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

L18_19_Clustering

Uploaded by

keshav pareek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Unsupervised Learning:

Clustering
Classification VS Clustering
▪ Google News Recommendation
▪ Movie Recommendation
▪ YouTube Videos

1-2
Clustering
Hierarchical
K-mean

1-3
Classroom example
Hierarchical (Agglomerative)
K-mean

Scatterplot for visualization

1-4
Data From the class
Their preferences about the Movie Jonner

1-5
Scatter Plot

1-6
Hierarchical (Agglomerative) Clustering
The idea behind hierarchical agglomerative clustering is to start with each cluster
comprising exactly one record and then progressively agglomerating (combining)
the two nearest clusters until there is just one cluster left at the end, which consists
of all the records.

1-7
1-8
Club Arizona and Commonwealth
Club Arizona, Commonwealth, and Central

1-9
1-10
Repeat it for Classroom Data

1-11
Hierarchical methods can be either agglomerative or divisive.

▪ Agglomerative methods begin with n clusters and sequentially merge


similar clusters until a single cluster is obtained.
▪ Divisive methods work in the opposite direction, starting with one cluster
that includes all records.

▪ Preferred when we need to arrange the clusters into an natural hierarchy.

Example:
Within Gujrat clustering the houses based on the city, then within the city, clustering
them based on the location (Posh vs Nonposh Area), next within the posh/nonposh
are clustering based on the build quality, and so on

1-12
Non-hierarchical methods, such as k-means. Using a prespecified number
of clusters, the method assigns records to each cluster. These methods
are generally less computationally intensive and are therefore preferred with
very large datasets.

Natural clustering is compromised???

1-13
Measuring Distance Between Two
Records

Euclidean Distance

1-14
Normalizing Numerical Measurements?
▪ The scale of each variable highly influences the distance measures,
▪ The variables with larger scales (e.g., Sales) have a much greater influence

▪ It is, therefore, customary to normalize continuous measurements before computing


the Euclidean distance.
▪ Normalizing a measurement means subtracting the average and dividing by the
standard deviation (normalized values are also called z-scores).

1-15
Euclidean distance
▪ It is highly scale-dependent. Changing the units of one variable (e.g., from cents to
dollars) can greatly influence the results.
▪ Unequal weighting should be considered if we want the clusters to depend more on
certain measurements and less on others.

▪ It completely ignores the relationship between the measurements.


▪ Thus, if the measurements are, in fact, strongly correlated, a different distance
(such as the statistical distance), is likely to be a better choice.

▪ It is sensitive to outliers. If the data are believed to contain outliers and careful removal
is not a choice, using more robust distances (such as the Manhattan distance,) is preferred.

1-16
Additional popular distance metrics
Correlation-based similarity. Sometimes, it is more natural or convenient
to work with a similarity measure between records rather than distance,
which measures dissimilarity.

1-17
Statistical distance (also called Mahalanobis distance).

where xi and xj are p-dimensional vectors of the measurement values for


records i and j, respectively; and S is the covariance matrix for these vectors.
(′, a transpose operation simply turns a column vector into a row
vector). S􀀀1 is the inverse matrix of S, which is the p-dimension extension
to division.

1-18
Manhattan distance (“city block”). This distance looks at the absolute differences
rather than squared differences, and is defined by

1-19
Distance Measures for Categorical Data
▪ Matching coefficient
▪ Jaquard’s coefficient

1-20
Measuring Distance Between Two Clusters (Example)
Minimum Distance
The distance between the pair of records Ai and Bj that are closest:

Maximum Distance
The distance between the pair of records Ai and Bj that are farthest:

Average Distance
The average distance of all possible distances between records in one
cluster and records in the other cluster:

Centroid Distance
The distance between the two cluster centroids.

A cluster centroid is the vector of


measurement averages across all the records in that cluster.

1-21
1-22
Domain knowledge is key when deciding among
clustering methods.

1-23
Data From the class
Their preferences about the Movie Jonner

1-24
Hierarchical (Agglomerative) Clustering
The idea behind hierarchical agglomerative clustering is to start with each cluster
comprising exactly one record and then progressively agglomerating (combining)
the two nearest clusters until there is just one cluster left at the end, which consists
of all the records.

1-25
1-26
Club Arizona and Commonwealth
Club Arizona, Commonwealth, and Central

1-27
Next Step

1-28
Linkage
Single Linkage: Minimum distance

Complete Linkage: Maximum distance

Average Linkage: Average distance between clusters

Centroid Linkage: Centroid distance

1-29
Linkage
Ward’s Method
▪ Ward’s method is also agglomerative in that it joins records and clusters
together progressively to produce larger and larger clusters but operates
slightly differently from the general approach described above.

▪ Ward’s method considers the “loss of information” that occurs when


records are clustered together. When each cluster has one record, there
is no loss of information, and all individual values remain available.
▪ When records are joined together and represented in clusters,
information about an individual record is replaced by the information
for the cluster to which it belongs. To measure loss of information,
Ward’s method employs a measure “error sum of squares” (ESS) that
measures the difference between individual records and a group mean.

1-30
Coding for Euclidian distance
from sklearn.metrics import pairwise_distances
import numpy as np
import pandas as pd
from scipy.spatial import distance

Data=pd.read_excel("Clustering_Distance.xlsx")

records = Data[["Norm_Sales", "Norm_Fuel"]]

# Record names
record_names = Data['Manufacturer']
# Calculate Euclidean distances
distance_matrix = distance.cdist(records, records, metric='euclidean')
# Convert to DataFrame for better visualization
distance_df = pd.DataFrame(distance_matrix, index=record_names, columns=record_names)
# Create a mask for the upper triangle
mask = np.triu(np.ones(distance_df.shape), k=1).astype(bool)
# Replace upper triangle values with NaN for clarity
distance_df_masked = distance_df.mask(mask)

print("Lower Half Euclidean Distance Matrix using scipy:")


print(distance_df_masked)
# Calculate the linkage matrix using single linkage
Z = linkage(Data[['Norm_Sales', 'Norm_Fuel']], method='single')

# Plot the dendrogram


plt.figure(figsize=(10, 7))
dendrogram(Z, labels=Data['Manufacturer'].values, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram (Single Linkage)')
plt.xlabel('Cluster Name')
plt.ylabel('Distance')
plt.show() 1-31
Dendrogram: Displaying Clustering Process and Results

1-32
Validating Clusters
Cluster interpretability. Is the interpretation of the resulting clusters reasonable?
▪ Obtaining summary statistics (e.g., average, min, max) from each
cluster on each measurement that was used in the cluster analysis
▪ Examining the clusters for separation along some common feature
(variable) that was not used in the cluster analysis
▪ Labeling the clusters: based on the interpretation, trying to assign a
name or label each cluster

Cluster stability. Do cluster assignments change significantly if some of the


inputs are altered slightly?
Another way to check stability is to partition
the data and see how well clusters formed based on one part apply to the
other part.
To do this:
▪ Cluster partition A.
▪ Use the cluster centroids from A to assign each record in partition B (each
record is assigned to the cluster with the closest centroid).
▪ Assess how consistent the cluster assignments are compared to the
assignments based on all the data.

1-33
Validating Clusters

Cluster separation. Examine the ratio of between-cluster variation to


within-cluster variation to see whether the separation is reasonable.
• There exist statistical tests for this task (an F-ratio), but their
usefulness is somewhat controversial.

Number of clusters. The number of resulting clusters must be useful,


given the purpose of the analysis.

• For example, suppose the goal of the clustering is to identify


categories of customers and assign labels to them for market
segmentation purposes.
• If the marketing department can only manage to sustain three
different marketing presentations, it would probably not make sense
to identify more than three clusters.

1-34
K-means Clustering
• k-Means clustering algorithm proposed by J. Hartigan and M. A.
Wong [1979].

• Given a set of n distinct objects, the k-Means clustering algorithm


partitions the objects into k number of clusters such that intracluster
similarity is high, but the intercluster similarity is low.
• The idea is to minimize a measure of dispersion within the
clusters
• Clusters are as homogeneous as possible with respect to the
measurements used.

• In this algorithm, user has to specify k, the number of clusters and


consider the objects are defined with numeric attributes and thus
using any one of the distance metric to demarcate the clusters.

1-35
k-Means Algorithm
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n
objects. These k objects are treated as the centroids or center of
gravities of k clusters.

• For each of the remaining objects, it is assigned to one of the closest


centroid. Thus, it forms a collection of objects assigned to each
centroid and is called a cluster.

• Next, the centroid of each cluster is then updated (by calculating the
mean values of attributes of each object).

• The assignment and update procedure is until it reaches some


stopping criteria (such as, number of iteration, centroids remain
unchanged or no assignment, etc.)
1-36
k-Means Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.

2. For each of the objects in D do


• Compute distance between the current objects and k cluster
centroids
• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster centers” of each cluster. These become the new
cluster centroids.

4. Repeat step 2-3 until the convergence criterion is satisfied


5. Stop
1-37
Illustration of k-Means clustering algorithms
A1 A2
25
6.8 12.6
0.8 9.8
20
1.2 11.6
2.8 9.6
15
3.8 9.9

A2
4.4 6.5
10
4.8 1.1
6.0 19.9 5
6.2 18.5
7.6 17.4 0
7.8 12.2 0 5 10 15
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1

1-38
Illustration of k-Means clustering
algorithmsTable 2: Distance calculation
Fig 2: Initial cluster with respect to Table
A1 A2 d1 d2 d3 cluster 16.2
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2

1-39
Illustration of k-Means clustering
algorithms
The calculation new centroids of the three cluster using the mean of attribute values of A1
and A2 is shown in the Table below. The cluster with new centroids are shown in Fig 3.

Calculation of new centroids

New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Fig 3: Initial cluster with new centroids

1-40
Illustration of k-Means clustering
algorithms
We next reassign the 16 objects to three clusters by determining which centroid is
closest to each one. This gives the revised set of clusters shown in Fig 4.
Note that point p moves from cluster C2 to cluster C1.

Fig 4: Cluster after first iteration

1-41
Illustration of k-Means clustering
algorithms
• The newly obtained centroids after second iteration are given in the table below. Note that the
centroid c3 remains unchanged, where c2 and c1 changed a little.

• With respect to newly obtained cluster centres, 16 points are reassigned again. These are the same
clusters as before. Hence, their centroids also remain unchanged.

• Considering this as the termination criteria, the k-means algorithm stops here. Hence, the final
cluster in Fig 5 is same as Fig 4. Fig 5: Cluster after Second iteration

Cluster centres after second iteration

Centroid Revised Centroids


A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

1-42
Example

1-43
k = 2 and that the initial clusters
A = {Arizona, Boston} and
B = {Central, Commonwealth, Consolidated}.

Distance

1-44
A = {Arizona, Boston} and
B = {Central, Commonwealth, Consolidated}.

A = {Arizona, Central, Commonwealth} and


B ={Consolidated, Boston}.

1-45
Examples
News Recommendation

Movie Recommendation

Natural clustering of states based on socioeconomic factors

House pricing example with different clusters: Different pricing


patterns in house prices differenced by the location or build
type(Flats, Independent Houses), build quality

Other Examples?

1-46
Homework
1. How to select
K in K-mean
Clustering?
2. Compare both
the methods
3. Using the data
in Table 15.1,
perform K-
mean and
Hierarchical
clustering

1-47
Thank You

1-48

You might also like