0% found this document useful (0 votes)
2 views

Lecture 1 (UNIT 1)

The document discusses the concepts of supervised and unsupervised learning, focusing on unsupervised learning, which involves discovering patterns in data without labeled target variables. It specifically highlights clustering as a key technique in unsupervised learning, detailing the k-Means algorithm for grouping similar data points into clusters based on distance metrics. The document also outlines the steps involved in the k-Means algorithm and provides illustrative examples of its application.

Uploaded by

ksheoran1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 1 (UNIT 1)

The document discusses the concepts of supervised and unsupervised learning, focusing on unsupervised learning, which involves discovering patterns in data without labeled target variables. It specifically highlights clustering as a key technique in unsupervised learning, detailing the k-Means algorithm for grouping similar data points into clusters based on distance metrics. The document also outlines the steps involved in the k-Means algorithm and provides illustrative examples of its application.

Uploaded by

ksheoran1213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Lecture 1

Clustering
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
What is Unsupervised Learning?
• Unsupervised learning is a machine learning paradigm where the
algorithm learns patterns and structures from input data without
explicit supervision.
• Unlike supervised learning, there are no labeled target variables. The
algorithm must find inherent patterns on its own.
• The primary goal is to explore and uncover underlying relationships
and groupings in the data.
4
Cont.
 It’s about learning interesting/useful structures in the data
(unsupervisedly!)

 There is no supervision (no labels/responses), only inputs

 Some examples of unsupervised learning


 Clustering: Grouping similar inputs together (and dissimilar ones far apart)
 Dimensionality Reduction: Reducing the data dimensionality
 Estimating the probability density of data (which distribution “generated”
the data)
Clustering
Clustering
• Clustering is a technique for finding similarity groups in data, called clusters. I.e.,
• it groups data instances that are similar to (near) each other in one cluster and data instances that are very
different (far away) from each other into different clusters.
• Clustering is often called an unsupervised learning task as no class values denoting an a priori
grouping of the data instances are given, which is the case in supervised learning.
• Due to historical reasons, clustering is often considered synonymous with unsupervised learning.
• In fact, association rule mining is also unsupervised
• This chapter focuses on clustering.
Clustering techniques

• Clustering has been studied extensively for more than 40 years


and across many disciplines due to its broad applications.
• we shall cover the following clustering techniques.

• k-Means algorithm [1957, 1967]


Partitioning • k-Medoids algorithm
methods • k-Modes [1998]
• Fuzzy c-means algorithm [1999]

Divisive
Hierarchical
methods Agglomerative
Clustering methods
Techniques
Density-based
methods • STING [1997]
• DBSCAN [1996]
• CLIQUE [1998]
k-Means Algorithm
• k-Means clustering algorithm proposed by J. Hartigan and M.
A. Wong [1979].
• K-Means performs the division of objects into clusters that share
similarities and are dissimilar to the objects belonging to another
cluster.
• Given a set of n distinct objects, the k-Means clustering
algorithm partitions the objects into k number of clusters such
that intracluster similarity is high but the intercluster
similarity is low.
• In this algorithm, user has to specify k, the number of clusters
and consider the objects are defined with numeric attributes
and thus using any one of the distance metric to demarcate the
clusters.
What is the intuition of k-means?
The goal of k-means is to locate the centroids around which
data is clustered They are the “means” in “k-means.” If we
know where these points are, the intuition behind the
algorithm is that we can then classify each point by
assigning it to its closest cluster center.
k-Means Algorithm
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n objects.
These k objects are treated as the centroids or center of gravities of k
clusters.

• For each of the remaining objects, it is assigned to one of the closest


centroid. Thus, it forms a collection of objects assigned to each centroid
and is called a cluster.

• Next, the centroid of each cluster is then updated (by calculating the
mean values of attributes of each object).

• The assignment and update procedure is until it reaches some stopping


criteria (such as, number of iteration, centroids remain unchanged or no
assignment, etc.)
k-Means Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.

2. For each of the objects in D do


• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster centers” of each cluster. These become the new
cluster centroids.

4. Repeat step 2-3 until the convergence criterion is satisfied

5. Stop
k-Means Algorithm
Note:
1) Objects are defined in terms of set of attributes. where each is
continuous data type.

2) Distance computation: Any distance such as or cosine similarity.

3) Minimum distance is the measure of closeness between an object and


centroid.

4) Mean Calculation: It is the mean value of each attribute values of all


objects.

5) Convergence criteria: Any one of the following are termination


condition of the algorithm.
• Number of maximum iteration permissible.
• No change of centroid values in any cluster.
• Zero (or no significant) movement(s) of object from one cluster to another.
• Cluster quality reaches to a certain level of acceptance.
Suppose that you have to cluster points into three clusters, where the points are:
A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9).
The distance function is Euclidean distance. Suppose initially we assign A1, B1, and C1 as the center
of each cluster, respectively.

Data Points Distance to Cluster


2 10 5 8 1 2
New Centroids:
A1 2 10 0 3.61 8.06 1
A1: (2,10)
A2 2 5 5 4.24 3.16 3 B1: (6,6)
A3 8 4 8.49 5 7.28 2 C1=(1.5,3.5)
B1 5 8 3.61 0 7.21 2
B2 7 5 7.07 3.61 6.71 2
B3 6 4 7.21 4.12 5.39 2
C1 1 2 8.06 7.21 0 3
C2 4 9 2.24 1.41 7.62 2
Data Points Distance to Cluster New
2 10 6 6 1.5 3.5 Cluster

A1 2 10 0 5.66 6.52 1 1
A2 2 5 5 4.12 1.58 3 3
A3 8 4 8.49 2.82 6.52 2 2
B1 5 8 3.61 2.24 5.70 2 2
New Centroids:
B2 7 5 7.07 1.41 5.70 2 2 A1: (3,9.5)
B3 6 4 7.21 2.00 4.53 2 2 B1: (6.5,5.25)
C1=(1.5,3.5)
C1 1 2 8.06 6.40 1.58 3 3
C2 4 9 2.24 3.61 6.04 2 1
Data Points Distance to Cluster New
3 9.5 6.5 5.25 1.5 3.5 Cluster

A1 2 10 1.12 6.54 6.52 1 1


A2 2 5 4.61 4.51 1.58 3 3
A3 8 4 7.43 1.95 6.52 2 2
B1 5 8 2.50 3.13 5.70 2 1
New Centroids:
B2 7 5 6.02 0.56 5.70 2 2 A1: (3.67,9)
B3 6 4 6.26 1.35 4.53 2 2 B1: (7,4.33)
C1=(1.5,3.5)
C1 1 2 7.76 6.39 1.58 3 3
C2 4 9 1.12 4.51 6.04 1 1
Data Points Distance to Cluster New
3.67 9 7 4.33 1.5 3.5 Cluster

A1 2 10 1.94 7.56 6.52 1 1


A2 2 5 4.33 5.04 1.58 3 3
A3 8 4 6.62 1.05 6.52 2 2
B1 5 8 1.67 4.18 5.70 1 1
Current Centroids:
B2 7 5 5.21 0.67 5.70 2 2 A1: (3.67,9)
B3 6 4 5.52 1.05 4.53 2 2 B1: (7,4.33)
C1=(1.5,3.5)
C1 1 2 7.49 6.44 1.58 3 3
C2 4 9 0.33 5.55 6.04 1 1
Illustration of k-Means clustering algorithms
Table 16.1: 16 objects with two
attributes and . Fig 16.1: Plotting data of Table 16.1

A1 A2 25

6.8 12.6
0.8 9.8 20

1.2 11.6
2.8 9.6 15
3.8 9.9
4.4 6.5

A2
10
4.8 1.1
6.0 19.9
5
6.2 18.5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12

6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1
Illustration of k-Means clustering algorithms
• Suppose, k=3. Three objects are chosen at random shown as circled
(see Fig 16.1). These three centroids are shown below.
Initial Centroids chosen randomly
Centro Objects
id
A1 A2

c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5

• Let us consider the Euclidean distance measure (L2 Norm) as the


distance measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table 16.2.
• Assignment of each object to the respective centroid is shown in the
right-most column and the clustering so obtained is shown in Fig 16.2.
Illustration of k-Means clustering algorithms
Table 16.2: Distance calculation
Fig 16.2: Initial cluster with respect to Table
A1 A2 d1 d2 d3 16.2
6.8 12. 4.0 1.1 5.9 2
cluster

6
0.8 9.8 3.0 7.4 10.2 1
1.2 11. 3.1 6.6 8.5 1
6
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19. 10.2 7.9 1.4 3
9
6.2 18. 8.9 6.5 0.0 3
5
7.6 17. 8.4 5.2 1.8 3
4
7.8 12. 4.6 0.0 6.5 2
2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11. 5.9 2.1 8.1 2
Illustration of k-Means clustering algorithms
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below. The cluster with new centroids
are shown in Fig 16.3.

Calculation of new centroids

New Objects
Centro
id A1 A2

c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

Fig 16.3: Initial cluster with new centroids


Illustration of k-Means clustering algorithms
We next reassign the 16 objects to three clusters by determining which
centroid is closest to each one. This gives the revised set of clusters shown
in Fig 16.4.
Note that point p moves from cluster C2 to cluster C1.

Fig 16.4: Cluster after first iteration


Illustration of k-Means clustering algorithms
• The newly obtained centroids after second iteration are given in the table
below. Note that the centroid c3 remains unchanged, where c2 and c1
changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned
again. These are the same clusters as before. Hence, their centroids also
remain unchanged.
• Considering this as the termination criteria, the k-means algorithm stops
here. Hence, the final cluster in Fig 16.5 is same as Fig 16.4.
Fig 16.5: Cluster after Second iteration

Cluster centres after second iteration

Centro Revised Centroids


id
A1 A2

c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6
Problem
• Cluster the following eight points (with (x, y) representing locations)
into three clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)

• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
Program for K-means Clustering
• import matplotlib.pyplot as plt

a = [2, 5, 7, 9, 3, 11, 14 , 16, 10, 12, 18]
• b = [21, 19, 24, 17, 22, 25, 24, 22, 21, 25,
27]

plt.scatter(a, b)
• plt.show()
from sklearn.cluster import KMeans

data = list(zip(a, b))


inertias = []

for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='x')


plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Value')
plt.show()
kmeans = KMeans(n_clusters=3)
kmeans.fit(data)

plt.scatter(a, b, c=kmeans.labels_)
plt.show()
Algorithm Implementation on IRIS
Dataset
• from sklearn.datasets import make_blobs
• X,y= make_blobs(n_samples=150, n_features=2, centers=3,
cluster_std=0.5, shuffle=True,random_state=0)
• import matplotlib.pyplot as plt
• plt.scatter(X[:,0], X[:,1],c='blue', marker='o', s=30)
• plt.grid()
• plt.show()
• from sklearn.cluster import KMeans
• km= KMeans(n_clusters=3,init='random',n_init=10,max_iter=300,
random_state=0)
• y_km= km.fit_predict(X)
• plt.scatter(X[y_km == 0,0],X[y_km==0,1], s=30, c='green',marker='s',
label="cluster1")

plt.scatter(X[y_km == 1,0],X[y_km==1,1], s=30, c='blue',marker='o',
label="cluster2")

plt.scatter(X[y_km == 2,0],X[y_km==2,1], s=30, c='red',marker='v',
label="cluster3")

plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],
• s=200, marker="*", c="black", label="centroids")
• plt.grid()
• plt.show()
WAP to implement K-means clustering on any dataset.
# import libraries
import numpy as np
import matplotlib.pyplot as mtp
import pandas as pd
from google.colab import files
uploaded = files.upload()
dataset = pd.read_csv('/content/Mall_Customers.csv')

#print(dataset)
print(dataset.head())
x = dataset.iloc[:, [3, 4]].values #for selecting specific row and coloumn
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
list= [] #Initializing the list

#Using for loop for iterations from 1 to 10.


for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
list.append(kmeans.inertia_)
mtp.plot(range(1, 11), list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('list')
mtp.show()
print(list)
#train algo. for dataset
kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
y_predict= kmeans.fit_predict(x)
print(y_predict)
#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 10, c = 'purple',
label = 'Cluster 1') #for first cluster

mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 10, c = 'pink', label


= 'Cluster 2') #for second cluster

mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 10, c = 'magenta',


label = 'Cluster 3') #for third cluster

mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 10, c = 'cyan', label


= 'Cluster 4') #for fourth cluster

mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 10, c = 'blue', label


= 'Cluster 5') #for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
300, c = 'yellow', label = 'Centroid’)

mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
Comments on k-Means algorithm
Let us analyse the k-Means algorithm and discuss the pros and cons of the
algorithm.
We shall refer to the following notations in our discussion.
• Notations:
• : an object under clustering
• : number of objects under clustering
• : the i-th cluster
• : the centroid of cluster
• : number of objects in the cluster
• : denotes the centroid of all objects
• : number of clusters
Comments on k-Means algorithm
1. Value of k:
• The k-means algorithm produces only one set of clusters, for which,
user must specify the desired number, k of clusters.
• In fact, k should be the best guess on the number of clusters present in
the given data. Choosing the best value of k for a given dataset is,
therefore, an issue.
• We may not have an idea about the possible number of clusters for
high dimensional data, and for data that are not scatter-plotted.
• Further, possible number of clusters is hidden or ambiguous in image,
audio, video and multimedia clustering applications etc.
• There is no principled way to know what the value of k ought to be.
We may try with successive value of k starting with 2.
• The process is stopped when two consecutive k values produce more-
or-less identical results (with respect to some cluster quality
estimation).
• Normally and there is heuristic to follow .
Comments on k-Means algorithm
k versus cluster quality

• Usually, there is some objective function to be met as a goal of clustering.


One such objective function is sum-square-error denoted by SSE and
defined as

• Here, denotes the error, if x is in cluster with cluster centroid .

• Usually, this error is measured as distance norms like L1, L2, L3 or Cosine
similarity, etc.
Comments on k-Means algorithm
k versus cluster quality

• With reference to an arbitrary experiment, suppose the following


results are obtained.
k SSE • With respect to this observation, we can
1 62.8 choose the value of as with this smallest
2 12.3 value of k it gives reasonably good result.
3 9.4
• Note: If then SSE=0; However, the cluster
4 9.3 is useless! This is another example of
5 9.2 overfitting.
6 9.1
7 9.05
8 9.0
Comments on k-Means algorithm
2. Choosing initial centroids:
• Another requirement in the k-Means algorithm to choose initial
cluster centroid for each k would be clusters.
• It is observed that the k-Means algorithm terminate whatever be the
initial choice of the cluster centroids.
• It is also observed that initial choice influences the ultimate cluster
quality. In other words, the result may be trapped into local optima, if
initial centroids are chosen properly.
• One technique that is usually followed to avoid the above problem is
to choose initial centroids in multiple runs, each with a different set of
randomly chosen initial centroids, and then select the best cluster
(with respect to some quality measurement criterion, e.g. SSE).
• However, this strategy suffers from the combinational explosion
problem due to the number of all possible solutions.
Comments on k-Means algorithm
2. Choosing initial centroids:
• A detail calculation reveals that there are possible combinations to
examine the search of global optima.

• For example, there are different ways to cluster 20 items into 4


clusters!
• Thus, the strategy having its own limitation is practical only if
1) The sample is negatively small (~100-1000), and
2) k is relatively small compared to n (i.e.. .
Comments on k-Means algorithm
3. Distance Measurement:
• To assign a point to the closest centroid, we need a proximity measure that
should quantify the notion of “closest” for the objects under clustering.

• Usually Euclidean distance (L2 norm) is the best measure when object
points are defined in n-dimensional Euclidean space.

• Other measure namely cosine similarity is more appropriate when objects


are of document type.

• Further, there may be other type of proximity measures that appropriate


in the context of applications.

• For example, Manhattan distance (L1 norm), Jaccard measure, etc.


Comments on k-Means algorithm
3. Distance Measurement:

Thus, in the context of different measures, the sum-of-squared error (i.e.,


objective function/convergence criteria) of a clustering can be stated as
under.

Data in Euclidean space (L2 norm):

Data in Euclidean space (L1 norm):

The Manhattan distance (L1 norm) is used as a proximity measure, where


the objective is to minimize the sum-of-absolute error denoted as SAE and
defined as
Comments on k-Means algorithm
Distance with document objects

Suppose a set of n document objects is defined as d document term matrix


(DTM) (a typical look is shown in the below form).
Documen Term Here, the objective function, which is called
t
t1 t2 tn Total cohesion denoted as TC and defined
as
D1
D2
where
Dn

and
‖‖
Comments on k-Means algorithm
Note: The criteria of objective function with different proximity measures

1. SSE (using L2 norm) : To minimize the SSE.

2. SAE (using L1 norm) : To minimize the SAE.

3. TC(using cosine similarity) : To maximize the TC.


Comments on k-Means algorithm
4. Type of objects under clustering:
• The k-Means algorithm can be applied only when the mean of the cluster
is defined (hence it named k-Means). The cluster mean (also called
centroid) of a cluster is defied as

• In other words, the mean calculation assumed that each object is defined
with numerical attribute(s). Thus, we cannot apply the k-Means to objects
which are defined with categorical attributes.
• More precisely, the k-means algorithm require some definition of cluster
mean exists, but not necessarily it does have as defined in the above
equation.
• In fact, the k-Means is a very general clustering algorithm and can be
used with a wide variety of data types, such as documents, time series, etc.

? How to find the mean of objects with composite attributes?


Comments on k-Means algorithm
Note:
1) When SSE (L2 norm) is used as objective function and the objective is to
minimize, then the cluster centroid (i.e. mean) is the mean value of the
objects in the cluster.

2) When the objective function is defined as SAE (L1 norm), minimizing


the objective function implies the cluster centroid as the median of the
cluster.

The above two interpretations can be readily verified as given in the next
slide.
Comments on k-Means algorithm
Case 1: SSE
We know,

To minimize SSE means,


Thus,

Or,
Comments on k-Means algorithm
Or,

Or,

Or,

1
𝑐 𝑖=
𝑛𝑖
∑ 𝑥
𝑥 ∈𝑪 𝑖
• Thus, the best centroid for minimizing SSE of a cluster is the mean of the
objects in the cluster.
Comments on k-Means algorithm
Case 2: SAE
We know,

To minimize SAE means,


Thus,

Or,
Comments on k-Means algorithm
Or,

Solving the above equation, we get

𝑐 𝑖=𝑚𝑒𝑑𝑖𝑎𝑛 { 𝑥|𝑥∈𝑪 𝑖}
• Thus, the best centroid for minimizing SAE of a cluster is the median of the
objects in the cluster.

? Interpret the best centroid for maximizing TC (with Cosine similarity measure) of
a cluster.

The above mentioned discussion is quite sufficient for the validation of k-Means
algorithm.
Comments on k-Means algorithm
5. Complexity analysis of k-Means algorithm

Let us analyse the time and space complexities of k-Means algorithm.

Time complexity:
The time complexity of the k-Means algorithm can be expressed as

where = number of objects


= number of attributes in the object definition
= number of clusters
= number of iterations.

Thus, time requirement is a linear order of number of objects and the algorithm
runs in a modest time if and (the iteration can be moderately controlled to check
the value of ).
Comments on k-Means algorithm
5. Complexity analysis of k-Means algorithm

Space complexity: The storage complexity can be expressed as follows.

It requires space to store the objects and space to store the proximity measure
from objects to the centroids of clusters.

Thus the total storage complexity is

That is, space requirement is in the linear order of if .


Comments on k-Means algorithm
6. Final comments:
Advantages:

• k-Means is simple and can be used for a wide variety of object types.

• It is also efficient both from storage requirement and execution time point of
views. By saving distance information from one iteration to the next, the actual
number of distance calculations, that must be made can be reduced (specially,
as it reaches towards the termination).

? How similarity metric can be utilized to run k-Means faster? What is the updation
in each iteration?

Limitations:
• The k-Means is not suitable for all types of data. For example, k-Means does
not work on categorical data because mean cannot be defined.

• k-means finds a local optima and may actually minimize the global optimum.
Comments on k-Means algorithm
6. Final comments:
Limitations :

• k-means has trouble clustering data that contains outliers. When the SSE is
used as objective function, outliers can unduly influence the cluster that are
produced. More precisely, in the presence of outliers, the cluster centroids, in
fact, not truly as representative as they would be otherwise. It also influence
the SSE measure as well.

• k-Means algorithm cannot handle non-globular clusters, clusters of different


sizes and densities (see Fig 16.6 in the next slide).

• k-Means algorithm not really beyond the scalability issue (and not so practical
for large databases).
Comments on k-Means algorithm

Cluster with different sizes Cluster with different densities

Non-convex shaped clusters

Fig 16.6: Some failure instance of k-Means algorithm


Common Distance measures:
• Distance measure will determine how the similarity of two
elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is


given by:
3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for different


scales and correlations in the variables.
5. Inner product space: The angle between two vectors can
be used as a distance measure when clustering high
dimensional data
6. Hamming distance (sometimes edit distance) measures
the minimum number of substitutions required to change
one member into another.
Hard vs Soft Clustering
Hard Clustering
• Also known as Exclusive Clustering or Crisp Clustering.
• Each data point is assigned to exactly one cluster.
• Results in non-overlapping clusters with clear boundaries.
• E.g., K-means
Advantages

• Simplicity and Ease of Implementation: Hard clustering algorithms are straightforward to


understand and implement.

• Computational Efficiency: Ideal for handling large datasets efficiently.

• Clear Cluster Membership: Each data point unambiguously belongs to a single cluster.

Disadvantages

• Sensitive to Initial Placement: Results can vary depending on the initial cluster centroids.

• Limited Handling of Overlapping Data: May struggle with complex data structures that have
overlapping clusters.

• Impact of Outliers: Outliers can significantly affect cluster assignments.


Soft Clustering
• Also known as Fuzzy Clustering/ Overlapping clustering.
• Allows data points to belong to multiple clusters with varying
membership degrees.
• Provides more flexibility and captures uncertainty in cluster
assignments.
• E.g., Fuzzy C-means (FCM)
Advantages
• Nuanced Cluster Assignments: Data points can have partial membership in multiple clusters,
providing a more nuanced representation.

• Handling Overlapping Data: Well-suited for datasets with complex or overlapping structures.

• Robustness to Outliers: Outliers may have low membership degrees in any cluster, reducing
their impact.
Disadvantages
• Computational Complexity: Soft clustering methods can be more computationally expensive
than their hard clustering counterparts.

• Determining the Number of Clusters: Requires the pre-specification of the number of


clusters or fuzziness coefficient.

• Interpretability Challenges: Fuzzy memberships might be more challenging to interpret.


How to Find the Optimal Number of Clusters
• The elbow method is a popular technique used to determine the
optimal number of clusters (k) in a clustering algorithm, such as K-
means.
• It involves plotting the cost (inertia) of clustering as a function of the
number of clusters and looking for the "elbow point" in the plot.
• The elbow point is the value of k at which the inertia starts to level off
or decrease at a slower rate.
• This point indicates that adding more clusters does not significantly
improve the clustering quality and suggests the appropriate number
of clusters for the data.
Cont.
• The elbow method helps us find the optimal number of clusters for
our data.
• It involves analyzing the inertia (within-cluster sum of squares) as a
function of the number of clusters.
Elbow Method: Step by Step
• Choose a range of k values to consider (e.g., 1 to 10).
• For each k value, run the K-means algorithm with k clusters on the
data.
• Calculate the inertia (sum of squared distances within clusters) for
each k.
• Plot the inertia values against the corresponding k values.
Cont.
Identifying the Elbow Point
• The elbow point is the optimal k value.
• It is the point where inertia starts to level off or decreases at a slower
rate.
• Adding more clusters beyond this point may not significantly improve
clustering quality.
Limitations of the Elbow Method
• The elbow method may not always yield a clear-cut elbow point,
especially for complex datasets.
• It is subjective, and the optimal k value may vary based on
interpretation.
Weaknesses of K-Mean Clustering

1. When the numbers of data are not so many, initial grouping will
determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with each run, since
the resulting clusters depend on the initial random assignments.
3. We never know the real cluster, using the same data, because if it is
inputted in a different order it may produce different cluster if the number
of data is few.
4. It is sensitive to initial condition. Different initial condition may produce
different result of cluster. The algorithm may be trapped in the local
optimum.
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result at O(tkn),
where n is number of objects or points, k is number of clusters
and t is number of iterations.
• k-means clustering can be applied to machine learning or data
mining
• Used on acoustic data in speech understanding to convert
waveforms into one of k categories (known as Vector Quantization
or Image Segmentation).
• Also used for choosing color palettes on old fashioned graphical
display devices and Image Quantization.
CONCLUSION
• K-means algorithm is useful for undirected knowledge discovery and
is relatively simple. K-means has found wide spread usage in lot of
fields, ranging from unsupervised learning of neural network, Pattern
recognitions, Classification analysis, Artificial intelligence, image
processing, machine vision, and many others.
Thank you!

You might also like