Lecture 1 (UNIT 1)
Lecture 1 (UNIT 1)
Clustering
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
What is Unsupervised Learning?
• Unsupervised learning is a machine learning paradigm where the
algorithm learns patterns and structures from input data without
explicit supervision.
• Unlike supervised learning, there are no labeled target variables. The
algorithm must find inherent patterns on its own.
• The primary goal is to explore and uncover underlying relationships
and groupings in the data.
4
Cont.
It’s about learning interesting/useful structures in the data
(unsupervisedly!)
Divisive
Hierarchical
methods Agglomerative
Clustering methods
Techniques
Density-based
methods • STING [1997]
• DBSCAN [1996]
• CLIQUE [1998]
k-Means Algorithm
• k-Means clustering algorithm proposed by J. Hartigan and M.
A. Wong [1979].
• K-Means performs the division of objects into clusters that share
similarities and are dissimilar to the objects belonging to another
cluster.
• Given a set of n distinct objects, the k-Means clustering
algorithm partitions the objects into k number of clusters such
that intracluster similarity is high but the intercluster
similarity is low.
• In this algorithm, user has to specify k, the number of clusters
and consider the objects are defined with numeric attributes
and thus using any one of the distance metric to demarcate the
clusters.
What is the intuition of k-means?
The goal of k-means is to locate the centroids around which
data is clustered They are the “means” in “k-means.” If we
know where these points are, the intuition behind the
algorithm is that we can then classify each point by
assigning it to its closest cluster center.
k-Means Algorithm
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n objects.
These k objects are treated as the centroids or center of gravities of k
clusters.
• Next, the centroid of each cluster is then updated (by calculating the
mean values of attributes of each object).
3. Compute the “cluster centers” of each cluster. These become the new
cluster centroids.
5. Stop
k-Means Algorithm
Note:
1) Objects are defined in terms of set of attributes. where each is
continuous data type.
A1 2 10 0 5.66 6.52 1 1
A2 2 5 5 4.12 1.58 3 3
A3 8 4 8.49 2.82 6.52 2 2
B1 5 8 3.61 2.24 5.70 2 2
New Centroids:
B2 7 5 7.07 1.41 5.70 2 2 A1: (3,9.5)
B3 6 4 7.21 2.00 4.53 2 2 B1: (6.5,5.25)
C1=(1.5,3.5)
C1 1 2 8.06 6.40 1.58 3 3
C2 4 9 2.24 3.61 6.04 2 1
Data Points Distance to Cluster New
3 9.5 6.5 5.25 1.5 3.5 Cluster
A1 A2 25
6.8 12.6
0.8 9.8 20
1.2 11.6
2.8 9.6 15
3.8 9.9
4.4 6.5
A2
10
4.8 1.1
6.0 19.9
5
6.2 18.5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1
Illustration of k-Means clustering algorithms
• Suppose, k=3. Three objects are chosen at random shown as circled
(see Fig 16.1). These three centroids are shown below.
Initial Centroids chosen randomly
Centro Objects
id
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
6
0.8 9.8 3.0 7.4 10.2 1
1.2 11. 3.1 6.6 8.5 1
6
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19. 10.2 7.9 1.4 3
9
6.2 18. 8.9 6.5 0.0 3
5
7.6 17. 8.4 5.2 1.8 3
4
7.8 12. 4.6 0.0 6.5 2
2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11. 5.9 2.1 8.1 2
Illustration of k-Means clustering algorithms
The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below. The cluster with new centroids
are shown in Fig 16.3.
New Objects
Centro
id A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6
Problem
• Cluster the following eight points (with (x, y) representing locations)
into three clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
Program for K-means Clustering
• import matplotlib.pyplot as plt
•
a = [2, 5, 7, 9, 3, 11, 14 , 16, 10, 12, 18]
• b = [21, 19, 24, 17, 22, 25, 24, 22, 21, 25,
27]
•
plt.scatter(a, b)
• plt.show()
from sklearn.cluster import KMeans
for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
plt.scatter(a, b, c=kmeans.labels_)
plt.show()
Algorithm Implementation on IRIS
Dataset
• from sklearn.datasets import make_blobs
• X,y= make_blobs(n_samples=150, n_features=2, centers=3,
cluster_std=0.5, shuffle=True,random_state=0)
• import matplotlib.pyplot as plt
• plt.scatter(X[:,0], X[:,1],c='blue', marker='o', s=30)
• plt.grid()
• plt.show()
• from sklearn.cluster import KMeans
• km= KMeans(n_clusters=3,init='random',n_init=10,max_iter=300,
random_state=0)
• y_km= km.fit_predict(X)
• plt.scatter(X[y_km == 0,0],X[y_km==0,1], s=30, c='green',marker='s',
label="cluster1")
•
plt.scatter(X[y_km == 1,0],X[y_km==1,1], s=30, c='blue',marker='o',
label="cluster2")
•
plt.scatter(X[y_km == 2,0],X[y_km==2,1], s=30, c='red',marker='v',
label="cluster3")
•
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],
• s=200, marker="*", c="black", label="centroids")
• plt.grid()
• plt.show()
WAP to implement K-means clustering on any dataset.
# import libraries
import numpy as np
import matplotlib.pyplot as mtp
import pandas as pd
from google.colab import files
uploaded = files.upload()
dataset = pd.read_csv('/content/Mall_Customers.csv')
#print(dataset)
print(dataset.head())
x = dataset.iloc[:, [3, 4]].values #for selecting specific row and coloumn
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
list= [] #Initializing the list
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
Comments on k-Means algorithm
Let us analyse the k-Means algorithm and discuss the pros and cons of the
algorithm.
We shall refer to the following notations in our discussion.
• Notations:
• : an object under clustering
• : number of objects under clustering
• : the i-th cluster
• : the centroid of cluster
• : number of objects in the cluster
• : denotes the centroid of all objects
• : number of clusters
Comments on k-Means algorithm
1. Value of k:
• The k-means algorithm produces only one set of clusters, for which,
user must specify the desired number, k of clusters.
• In fact, k should be the best guess on the number of clusters present in
the given data. Choosing the best value of k for a given dataset is,
therefore, an issue.
• We may not have an idea about the possible number of clusters for
high dimensional data, and for data that are not scatter-plotted.
• Further, possible number of clusters is hidden or ambiguous in image,
audio, video and multimedia clustering applications etc.
• There is no principled way to know what the value of k ought to be.
We may try with successive value of k starting with 2.
• The process is stopped when two consecutive k values produce more-
or-less identical results (with respect to some cluster quality
estimation).
• Normally and there is heuristic to follow .
Comments on k-Means algorithm
k versus cluster quality
• Usually, this error is measured as distance norms like L1, L2, L3 or Cosine
similarity, etc.
Comments on k-Means algorithm
k versus cluster quality
• Usually Euclidean distance (L2 norm) is the best measure when object
points are defined in n-dimensional Euclidean space.
and
‖‖
Comments on k-Means algorithm
Note: The criteria of objective function with different proximity measures
• In other words, the mean calculation assumed that each object is defined
with numerical attribute(s). Thus, we cannot apply the k-Means to objects
which are defined with categorical attributes.
• More precisely, the k-means algorithm require some definition of cluster
mean exists, but not necessarily it does have as defined in the above
equation.
• In fact, the k-Means is a very general clustering algorithm and can be
used with a wide variety of data types, such as documents, time series, etc.
The above two interpretations can be readily verified as given in the next
slide.
Comments on k-Means algorithm
Case 1: SSE
We know,
Or,
Comments on k-Means algorithm
Or,
Or,
Or,
1
𝑐 𝑖=
𝑛𝑖
∑ 𝑥
𝑥 ∈𝑪 𝑖
• Thus, the best centroid for minimizing SSE of a cluster is the mean of the
objects in the cluster.
Comments on k-Means algorithm
Case 2: SAE
We know,
Or,
Comments on k-Means algorithm
Or,
𝑐 𝑖=𝑚𝑒𝑑𝑖𝑎𝑛 { 𝑥|𝑥∈𝑪 𝑖}
• Thus, the best centroid for minimizing SAE of a cluster is the median of the
objects in the cluster.
? Interpret the best centroid for maximizing TC (with Cosine similarity measure) of
a cluster.
The above mentioned discussion is quite sufficient for the validation of k-Means
algorithm.
Comments on k-Means algorithm
5. Complexity analysis of k-Means algorithm
Time complexity:
The time complexity of the k-Means algorithm can be expressed as
Thus, time requirement is a linear order of number of objects and the algorithm
runs in a modest time if and (the iteration can be moderately controlled to check
the value of ).
Comments on k-Means algorithm
5. Complexity analysis of k-Means algorithm
It requires space to store the objects and space to store the proximity measure
from objects to the centroids of clusters.
• k-Means is simple and can be used for a wide variety of object types.
• It is also efficient both from storage requirement and execution time point of
views. By saving distance information from one iteration to the next, the actual
number of distance calculations, that must be made can be reduced (specially,
as it reaches towards the termination).
? How similarity metric can be utilized to run k-Means faster? What is the updation
in each iteration?
Limitations:
• The k-Means is not suitable for all types of data. For example, k-Means does
not work on categorical data because mean cannot be defined.
• k-means finds a local optima and may actually minimize the global optimum.
Comments on k-Means algorithm
6. Final comments:
Limitations :
• k-means has trouble clustering data that contains outliers. When the SSE is
used as objective function, outliers can unduly influence the cluster that are
produced. More precisely, in the presence of outliers, the cluster centroids, in
fact, not truly as representative as they would be otherwise. It also influence
the SSE measure as well.
• k-Means algorithm not really beyond the scalability issue (and not so practical
for large databases).
Comments on k-Means algorithm
• Clear Cluster Membership: Each data point unambiguously belongs to a single cluster.
Disadvantages
• Sensitive to Initial Placement: Results can vary depending on the initial cluster centroids.
• Limited Handling of Overlapping Data: May struggle with complex data structures that have
overlapping clusters.
• Handling Overlapping Data: Well-suited for datasets with complex or overlapping structures.
• Robustness to Outliers: Outliers may have low membership degrees in any cluster, reducing
their impact.
Disadvantages
• Computational Complexity: Soft clustering methods can be more computationally expensive
than their hard clustering counterparts.
1. When the numbers of data are not so many, initial grouping will
determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with each run, since
the resulting clusters depend on the initial random assignments.
3. We never know the real cluster, using the same data, because if it is
inputted in a different order it may produce different cluster if the number
of data is few.
4. It is sensitive to initial condition. Different initial condition may produce
different result of cluster. The algorithm may be trapped in the local
optimum.
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result at O(tkn),
where n is number of objects or points, k is number of clusters
and t is number of iterations.
• k-means clustering can be applied to machine learning or data
mining
• Used on acoustic data in speech understanding to convert
waveforms into one of k categories (known as Vector Quantization
or Image Segmentation).
• Also used for choosing color palettes on old fashioned graphical
display devices and Image Quantization.
CONCLUSION
• K-means algorithm is useful for undirected knowledge discovery and
is relatively simple. K-means has found wide spread usage in lot of
fields, ranging from unsupervised learning of neural network, Pattern
recognitions, Classification analysis, Artificial intelligence, image
processing, machine vision, and many others.
Thank you!