Clustering Monograph DSBA
Clustering Monograph DSBA
[email protected]
9T6KVYUXDH
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7. References................................................................................................................................ 35
1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
9T6KVYUXDH
List of Equations
Equation 1: Euclidean Distance ................................................................................................................................. 6
Equation 2: Calculating Euclidean Distance .............................................................................................................. 7
Equation 3: Squared Euclidean Distance ................................................................................................................... 7
Equation 4: Manhattan Distance ................................................................................................................................ 7
Equation 5: Minkowski's Distance ............................................................................................................................. 7
2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3.1 Measures
[email protected] of Distance
9T6KVYUXDH
Since the clustering process is completely controlled by the distance between two points and
distance between two clusters, it is paramount that the concept of distance is clear before we
can move on to the actual clustering process.
Clustering works only when the observations are multivariate. For univariate observations the
concept of distance is trivial. We will also assume that the variables are only numeric. If the
variables are categorical or mixed, this simple measure of distance will not work. A few
alternative distance measures are given in Section 6.
Let us introduce matrix notation to describe the observations. Let X denote the data matrix of
order n x p, i.e. there are n observations in the data and each observation has p dimensions. In
the matrix each row denotes an observation and each column, one dimension or attribute.
5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
For the purpose of clustering, distance is defined between a pair of observations combining all
the attributes. There are many possible distance measures. Distance is a non-negative quantity.
Distance between two identical observations must be zero. Distance between two non-identical
observations must always be positive.
Pairwise distance among a set of n p-dimensional observations can be arranged in a n x n
symmetric (square) matrix whose principal diagonal is 0.
Euclidean Distance: Between two observations X1 and X2, each of dimension p, Euclidean
distance is defined as
6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Squared Euclidean Distance: It is the Euclidean distance without the square root and the
distance between two observations X1 and X2 is defined as
Manhattan Distance: This is also known as absolute value distance or L1 distance. The
Manhattan distance between two observations X1 and X2 is defined as
Where q is a positive integer. Note that for q = 1, Minkowski’s distance coincide with
Manhattan distance and for q = 2, it is identical to Euclidean distance.
There are many other important distance measures, among which the Mahalanobis Distance,
the Cosine Distance and the Jaccard Distance deserve mention.
Following is an illustration of distance matrices computed for data given in Fig 1. Note that the
matrices shown are all lower triangular since the principal diagonal is 0 and the distance
matrices are symmetric.
7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
pts=np.array([[0.53,0.53,0.54,0.54,0.53],[3.65,3.65,3.66,3.66,3.67],[3.67,3.68,3.68,
3.67,3.66]]).transpose()
pts
array([[0.53, 3.65, 3.67],
[0.53, 3.65, 3.68],
[0.54, 3.66, 3.68],
[0.54, 3.66, 3.67],
[0.53, 3.67, 3.66]])
squareform(np.round(pdist(pts, metric='euclidean'),2))
array([[0. , 0.01, 0.02, 0.01, 0.02],
[0.01, 0. , 0.01, 0.02, 0.03],
[0.02, 0.01, 0. , 0.01, 0.02],
[0.01, 0.02, 0.01, 0. , 0.02],
[0.02, 0.03, 0.02, 0.02, 0. ]])
[email protected]
9T6KVYUXDH
squareform(np.round(pdist(pts, metric='cityblock'),2))
array([[0. , 0.01, 0.03, 0.02, 0.03],
[0.01, 0. , 0.02, 0.03, 0.04],
[0.03, 0.02, 0. , 0.01, 0.04],
[0.02, 0.03, 0.01, 0. , 0.03],
[0.03, 0.04, 0.04, 0.03, 0. ]])
Note that the numerical values of the distances may be different depending on which method has been
used to compute distances. The distance between User 1 and 2 is 0.01 using all three methods, while the
distance between Users 1 and 3 depends on the method used. Euclidean distance between Users 1 and 3 is
0.02. However, the Manhattan or cityblock distance between Users 1 and 3 is 0.03. This observation
indicates that the clustering may depend on the distance measure. In Python, the output for the Minkowski
distance will be same as Euclidean Distance because of the parameter p=2 (by default). The parameter
p=1 gives the Manhattan distance.
8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner.
Agglomerative clustering is used more often than divisive clustering. In this monograph we
will not discuss divisive clustering in any detail (see Section 6)
Clustering of observations is the objective here. However, it is also possible to apply
hierarchical clustering to cluster attributes (variables).
[email protected]
9T6KVYUXDH4.2 Agglomerative Clustering
Figure 2 shows a graphical representation of agglomerative clustering. At each stage two units
or two clusters or one unit and one cluster are combined into a larger cluster. The graphical
representation is a tree diagram and is called a dendrogram. At the start of the algorithm, the
number of clusters is equal to the number of observations n. At the end of the algorithm, the
number of clusters is 1.
The optimum number of clusters is not pre-determined in this algorithm. Depending on which
height the dendrogram is cut, the final number of clusters is determined.
9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Let us use Euclidean distance for illustration. In Fig 2, cutting after the first merge (from the
bottom) of the dendrogram will yield clusters {1, 2} {5} {3, 4}. Cutting after the second
merge will yield clusters {1, 2, 5} {3, 4}, which is a coarser clustering, with a smaller number
of clusters but one or more clusters having larger size.
Once one or more clusters contain more than one element, the algorithm of merging two
multi-element clusters need to be developed.
[email protected]
9T6KVYUXDHSingle Linkage: Distance between two clusters is defined to be the smallest distance between
a pair of observations (points), one from each cluster. This is the distance between two closest
points of the two clusters.
Complete Linkage: Distance between two clusters is defined to be the largest distance
between a pair of points, one from each cluster. This is the distance between two points
belonging to two different clusters, which are farthest apart.
10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Centroid Linkage: Centroid of a cluster is defined as the vector of means of all attributes, over
all observations within that cluster. Centroid linkage between two clusters is the distance
between the respective centroids.
[email protected]
9T6KVYUXDHNote that the choice of distance and linkage method uniquely define the outcome of a
clustering procedure. It is possible that the same set of observations may be clustered in
different partitions based on the choice of distance and linkage method.
Case Study:
Review ratings by 5456 travelers (Users) on 24 tourist attractions across various sites in Europe
are considered. This data is the primary input for an automatic recommender system. Each
reviewer (User) has rated various attractions, such as churches, resorts, bakeries etc. and the
given ratings are all between 0 (lowest) and 5 (highest). Objective is to partition the sample of
travelers into clusters, so that tailored recommendations may be made for next travel
destinations.
Statement of the problem: Cluster the Users into mutually exclusive groups according to their
preferences for various tourist attractions.
11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
df = pd.read_csv('ReviewRatings.csv')
round(df.describe().T,2)
Fast Food 5456.0 2.08 1.25 0.78 1.29 1.69 2.28 5.0
Juice Bars 5456.0 2.19 1.58 0.76 1.03 1.49 2.74 5.0
12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
View Points 5456.0 1.75 1.60 0.00 0.74 1.03 2.07 5.0
There are 5456 reviewers’ ratings in the data. In each of the attributes, the maximum value is 5,
i.e. the highest possible score has been reached. However, not all attributes have been given the
lowest (0) rating. The mean value is always greater than the median value, sometimes
considerably, indicating that the distribution of the attributes is not symmetric, but positively
skewed.
13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Following figure gives a quick comparison of attribute means. Whiles malls, restaurants,
theatres etc. have high average values, swimming pools and gyms have the lowest averages.
AvgR = df.mean()
AvgR = AvgR.sort_values()
plt.figure(figsize=(10,7))
plt.barh(np.arange(len(df.columns[1:])), AvgR.values, align='center')
plt.yticks(np.arange(len(df.columns[1:])), AvgR.index)
plt.ylabel('Categories')
plt.xlabel('Average Rating')
plt.title('Average Rating for every Category')
plt.show()
[email protected]
9T6KVYUXDH
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
# Clustering Illustrative
#No Need for scaling as all variables have the same weightage.
#It’s a rating dataset
x=pd.DataFrame(x)
x_subset1 = x.loc[100:120 ,:].values
16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Now let us proceed to the final clustering involving all Users, Euclidean distance and ward
linkage.
18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
9T6KVYUXDH
Figure 10: Dendrogram of the hierarchical clusters for the full data set
Dendrogram of Figure 10 includes all Users. The colours are given by color_threshold parameter and can
be adjusted. However, going by color_threshold=None, there are 3 clusters. It is difficult for a business to
understand segments from just 3 clusters. After multiple iterations, 12 was identified as the suitable
number of clusters and hence using the function from sklearn library, 12 clusters were created in the full
data set and Cluster label was included in the data set.
19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
9T6KVYUXDH
Figure 11 does not show the full grown tree but just the number of Users in each cluster. Cluster size is
not equal for all 12 clusters, as indicated in the above diagram. It is important to note that based on the
method selected for linkage and affinity, cluster membership and cluster size could vary.
Finally, to verify the differences between the identified clusters, a few important variables are
chosen and their means and standard deviations are compared below.
20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
x_analysis.groupby('Cluster').mean()
Recall that the clustering process here is an input to a recommender system. Instead of pushing all
information indiscriminately to every user, based on the average ranking of various attributes in various
groups, more efficient planning may be made. Consider for example Cluster 1 and Cluster 11 have given
very high ranking to Restaurants whereas Clusters 0, 6 and especially 8 have given low ranking to Fast
Food. It is clear that users belonging to Cluster 8 are discriminatory eaters and information about fast food
to these users will not be useful at all. Compared to them, preference of Cluster 3 shows high ranks in
[email protected]
9T6KVYUXDH Lodging and Fast Food but medium rank in Restaurant. It is possible that these users are more convenience
loving and information about good accommodation and perhaps near-by fast food joints will be
appreciated by this group.
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is
often called "the k-means algorithm".
Given an initial set of k means m1(1), …, mk(1) , the algorithm proceeds by alternating between
two steps
Assignment step: Assign each observation to the cluster whose mean has the least squared
Euclidean distance; this is intuitively the "nearest" mean.
Update step: Calculate the new means (centroids) of the observations in the new clusters.
One advantage
[email protected] of k-means clustering is that computation of distances among all pairs of
9T6KVYUXDHobservations is not necessary. However, the number of clusters k needs to be pre-specified.
Once that is determined, an arbitrary partition of data into k clusters is the starting point. Then
sequential assignment and update will find the clusters that are most separated. The cardinal
rule of cluster building is that, at every step within-cluster variance will reduce (or stay the
same) but between-cluster variance will increase (or stay the same).
Random assignment of initial centroids eliminate bias. However, it is not an efficient
algorithmic process. An alternative way to form clusters is to start with a set of pre-specified
starting points, though it is possible that such a set of initial clusters, if converges to a local
optimality, may introduce bias. The following algorithm is a compromise between efficiency
and randomness.
Leader Algorithm:
Step 1. Select the first item from the list. This item forms the centroid of the first cluster.
Step 2. Search through the subsequent items until an item is found that is at least distance
δ away from any previously defined cluster centroid. This item will form the centroid of
the next cluster.
Step 3: Step 2 is repeated until all k cluster centroids are obtained or no further items can
be assigned.
Step 4: The initial clusters are obtained by assigning items to the nearest cluster
centroids.
The distance δ along with the number of clusters k is the input to this algorithm. It is possible
that domain knowledge or various other subjective considerations determine the value of k. If
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
There are many methods that are recommended for determination of an optimal number of
partitions. Unfortunately, however, there is no closed form solution to the problem of
determining k. The choice is somewhat subjective and graphical methods are often employed.
Objective of partitioning is to separate out the observations or units so that the ‘most’ similar
items are put together. Recall that singleton clusters will have the lowest value of WSS, but
that is not useful. Hence finding k is striking a balance between WSS and cluster size.
For a given number of clusters, the total within-cluster sum of squares (WCSS) is computed.
That value of k is chosen to be optimum, where addition of one more cluster does not lower
the value of total WCSS appreciably.
The Elbow method looks at the total WCSS as a function of the number of clusters.
wcss = []
[email protected]
9T6KVYUXDH for i in range(1, 20):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x_full)
# inertia method returns wcss for that model
wcss.append(kmeans.inertia_)
23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
This method measures how tightly the observations are clustered and the average distance
between clusters. For each observation a silhouette score is constructed which is a function of
the average distance between the point and all other points in the cluster to which it belongs,
and the distance between the point and all other points in all other clusters, that it does not
belong to. The maximum value of the statistic indicates the optimum value of k.
ss={1:0}
for i in range(2, 20):
clusterer = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
y=clusterer.fit_predict(x_full)
# The higher (up to 1) the better
s =silhouette_score(x_full, y )
ss[i]=round(s,5)
print("The Average Silhouette Score for {} clusters is {}".format(i,round(s,
[email protected]
9T6KVYUXDH 5)))
It is clear from Fig 13 that the maximum value of average silhouette score is achieved for k = 12, which,
therefore, is considered to be the optimum number of clusters for this data.
However, there are a number of merits for using a smaller number of clusters. The objective of this
particular clustering effort is to devise a suitable recommendation system. It may not be practical to
manage a very large number of tailor made recommendations. Hence, the final decision regarding an
appropriate number of clusters must be taken after considering the within sum of squares and between sum
[email protected]
9T6KVYUXDH of squares. Recall that within cluster sum of squares is the squared average Euclidean distance of all the
points within a cluster from the cluster centroid and between cluster sum of squares is the average squared
Euclidean distance between all cluster centroids.
Let us now proceed with 12 clusters. Clustering, like any other predictive modelling, is an iterative process.
The final recommendation may be quite different from the initial starting point.
Let us decide to partition 5456 users into 12 partitions. Cluster plot is a helpful way to look at the spread
and overlap of clusters. Ideally, for a perfect algorithm, the clusters will be well separated with no (or
minimum) overlap.
The algorithm for k-means is a greedy algorithm and is sensitive to the random starting points.
k = 12
25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Note that elbow plot and silhouette plot both recommended a very large number of clusters for this data
[email protected]
9T6KVYUXDH set. Though these two instruments are probably the most used for determination of k, the final decision
must not be taken based solely on such automatic considerations. It is clear from Fig 14 that 12 clusters
are too many and a considerable overlap exists among the clusters.
Now it is important to investigate the other outcomes of the clustering procedure, namely, the total sum
of squares, the within sum of squares for each cluster, the total within sum of squares, the between
cluster sum of squares and the cluster centers.
x=KMeans(n_clusters=12, random_state=0).fit(x_full).cluster_centers_
partition, euc_distance_to_centroids = vq(x_full, x)
TSS = np.sum((x_full-x_full.mean(0))**2)
SSW = np.sum(euc_distance_to_centroids**2)
SSB = TSS - SSW
print('Between Sum of Squares is ',round(SSB,2))
print('Total Sum of Squared is ',round(TSS,2))
WSS=pd.DataFrame([euc_distance_to_centroids**2,KMeans(n_clusters=12, random_stat
e=0).fit(x_full).labels_]).T
print('The Within Sum of Squares is',np.round(WSS.groupby(1).sum().values,2))
print('Total Within Sum of Squared is ',round(SSW,2))
print('The Cluster Size is',WSS.groupby(1).count().values)
26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Since k = 12 is not a practical choice, let us consider several other values of k to determine a set of well-separated
meaningful clusters.
However, the final decision must be taken after considering the within sum of squares and between sum
of squares. Recall that within cluster sum of squares is the squared average Euclidean distance of all the
points within a cluster from the cluster centroid and between cluster sum of squares is the average squared
Euclidean distance between all cluster centroids.
x=KMeans(n_clusters=3, random_state=0).fit(x_full).cluster_centers_
partition, euc_distance_to_centroids = vq(x_full, x)
TSS = np.sum((x_full-x_full.mean(0))**2)
SSW = np.sum(euc_distance_to_centroids**2)
SSB = TSS - SSW
print('Between Sum of Squared is ',round(SSB,2))
print('Total Sum of Squared is ',round(TSS,2))
WSS=pd.DataFrame([euc_distance_to_centroids**2,KMeans(n_clusters=3, random_state
=0).fit(x_full).labels_]).T
print('The Within Sum of Squares is',np.round(WSS.groupby(1).sum().values,2))
print('Total Within Sum of Squared is ',round(SSW,2))
print('The Cluster Size is',WSS.groupby(1).count().values)
Note that the total sum of square in the data does not depend on the number of clusters. Total sum of
squares for the current data set is 215751.36 which is a measure of total variability within the data. Each
cluster has different within cluster sum of squares. For k = 12, the largest within cluster sum of squares is
13530.54 and the total within cluster sum of squares is 110333.8. The between cluster sum of squares is
105417.56. Let us now investigate the case with k = 3. The between cluster sum of squares is smaller than
the within cluster sum of squares for two clusters. Hence k = 3 cannot be recommended.
Next a few other values of k are considered and k = 5 seems to be an optimal number of clusters.
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(x_full)
plt.figure(figsize=(12,7))
plt.scatter(x=plot_columns[:,1], y=plot_columns[:,0], c=KMeans(n_clusters=6, ran
dom_state=0).fit(x_full).labels_)
plt.title('Cluster Plot for 6 Clusters')
plt.show()
28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Comparison of various types of sums of squares are given below for alternative values of k.
[email protected]
9T6KVYUXDH
import numpy as np
from scipy.cluster.vq import vq
#!pip install tabulate
from tabulate import tabulate
for i in range(3,7):
x=KMeans(n_clusters=i, random_state=0).fit(x_full).cluster_centers_
partition, euc_distance_to_centroids = vq(x_full, x)
wss=((x_full-x_full.mean(0))**2).sum(0)
TSS = np.sum((x_full-x_full.mean(0))**2)
SSW = np.sum(euc_distance_to_centroids**2)
SSB = TSS - SSW
WSS=pd.DataFrame([euc_distance_to_centroids**2,KMeans(n_clusters=i, random_s
tate=0).fit(x_full).labels_]).T
print(tabulate([[i,round(SSB,2),round(TSS,2), np.round(WSS.groupby(1).sum().
values,2),
round(SSW,2),WSS.groupby(1).count().values]],
headers=['Number of Cluster', 'B/w SS','Total SS','Within SS'
,'Total Within SS','Size']))
print('\n')
29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
9T6KVYUXDH Except for cluster 2 having small overlap with cluster 3, the clusters are well separated. For 6 clusters, all within
cluster sums of squares are small compared to the between cluster sum of squares. Hence k = 6 is being
recommended as the optimum number of clusters in this case. The last step of clustering procedure is cluster
profiling. Based on the various mean values of the attributes in different clusters, the recommender system might
be developed.
Let’s profile the clusters, the way it was done in the hierarchical clustering to identify the suitable pattern for user
segmentation.
round(x.groupby('kclusters').mean(),2).T
kclusters 0 1 2 3 4 5
Churches 1.50 0.79 1.09 1.30 2.09 1.94
Resorts 1.87 1.08 2.14 3.27 2.85 2.34
Beaches 1.96 1.68 1.97 3.49 2.67 2.96
Parks 2.85 1.67 2.13 3.63 2.48 3.63
Theatres 3.16 1.72 2.47 4.18 2.20 3.41
Museums 2.79 1.81 3.39 3.89 2.06 2.76
Malls 3.49 3.09 4.66 3.84 2.21 2.26
Zoo 3.31 1.94 3.49 2.48 1.69 1.78
Restaurants 3.83 2.87 4.59 2.77 2.09 2.08
Bars 3.73 2.85 3.54 2.70 1.57 2.21
30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
9T6KVYUXDH
31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
This algorithm is rarely used. The basic principle of divisive clustering was published as the
DIANA (DIvisive ANAlysis Clustering) algorithm. Initially, all data is in the same cluster, and
the largest cluster is split until every object is separate. DIANA chooses the object with the
maximum average dissimilarity and then moves all objects to this cluster that are more similar
to the new cluster than to the remainder.
(https://github.com/div338/Divisive-Clustering-Analysis-Program-DIANA- )
6.2 Scaling
[email protected]
9T6KVYUXDHStandardization or scaling is an important aspect of data pre-processing. All machine learning
algorithms are dependent on the scaling of data. Recall that scaling all numerical variables has
been strongly recommended in case of PCA and FA. Similarly, for clustering too, scaling is
usually applied.
However, in this case we have not applied scaling. The reason being all the variables here are
scores, with values between 0 and 5. The variables here are naturally scaled. It was not
necessary to smooth them any further.
The main problem of using discrete attributes in clustering lies in computing distance
between two items. All the distance measures defined (Euclidean, Minkowski’s etc) are
defined for continuous variables only. To use binary or nominal variables in clustering
algorithm, a meaningful distance measure between a pair of discrete variables need to be
defined. There are several options such as cosine similarity and Mahalanobis Distance. We
will not go in depth with these cases. In Python, Gower Distance may be used to deal with
mixed data type.
Gower distance: For each variable type, a different distance is used and scaled to fall between
0 and 1. For continuous variable Manhattan Distance is used. Ordinal variables are ranked first
and then Manhattan distance is used with a special adjustment for ties. Nominal variables are
32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
This is a variant of k-means clustering where instead of minimizing the squared deviations,
the absolute deviations from the medians is minimized. The difficulty is that for a
multivariate data, there is no universally accepted definition of median. Another option is
using a k-medoid partitioning.
In this algorithm k representative observations, called medoids, are chosen. Each observation
is assigned to that cluster whose medoid is the closest and medoids are updated. A swapping
cost is involved when a current medoid is replaced by a new medoid. Various algorithms are
developed, namely PAM (partitioning around medoids), CLARA and randomized sampling.
(https://github.com/letiantian/kmedoids)
Since for categorical data, mean is not defined, k-mode clustering is a sensible alternative. Once
an optimal number of clusters is determined, initial cluster centres are arbitrarily selected. Each
data object is assigned to that cluster, whose centre is closest to it by a suitably defined distance.
Note that for cluster analysis the distance measure is all important. After each allocation, the
cluster base is updated. A good discussion is given in A review on k-mode clustering algorithm
by M. Goyal and S Aggarwal published in International Journal of Advanced Research in
Computer Science. (https://www.ijarcs.info/index.php/Ijarcs/article/view/4301/4008)
(https://pypi.org/project/kmodes/)
33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Hierarchical
Partition
Dendrogram
[email protected]
9T6KVYUXDH
Final
Partition
34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Hennig, C., Meila M., Murtagh F. and Rocci, R (2015). Handbook of Cluster Analysis.
Chapman and Hall/CRC.
Everitt B. S., Landau S., Leese M. and Stahl, D. (2011). Cluster Analysis. 5th Ed. Wiley.
Kassambara A. (2017). Practical Guide to Cluster Analysis in R. sthda.com.
https://online.stat.psu.edu/stat505/lesson/14
[email protected]
9T6KVYUXDH
35
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.