0% found this document useful (0 votes)

25 views

Week 07 Lecture Material

Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarity. It can be used to find natural groupings within data to gain insight. Hierarchical clustering builds clusters hierarchically, either bottom-up by merging the closest pairs or top-down by splitting clusters. K-means clustering partitions data into k mutually exclusive clusters by minimizing distances between data points and assigned cluster centers.

Uploaded by

Meer Hassan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Week 07 Lecture Material

Uploaded by

Meer Hassan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data Mining

Week 7: Clustering

Pabitra Mitra
Computer Science and Engineering, IIT Kharagpur

1
Clustering
• Unsupervised method
• Exploratory Data Analysis

• Useful in many applications like market

segment analysis

2
What is clustering?
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly
from the data (in contrast to classification).
• More informally, finding natural groupings among objects.

3
What is a natural grouping among these objects?

4
What is a natural grouping among these objects?

Clustering is subjective

5
Simpson's Family School Employees Females Males
What is similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.

Similarity is hard
to define.
Defining distance measures
Definition: Let O1 and O2 be two objects from the universe of
possible objects. The distance (dissimilarity) between O1 and O2 is a
real number denoted by D(O1,O2)

Peter Piotr

7
0.23 3 342.7
Peter Piotr What properties should a distance measure
have?
d('', '') = 0 d(s, '') = d('',
s) = |s| -- i.e. length of
s d(s1+ch1, s2+ch2) =
3
min( d(s1, s2) + if
ch1=ch2 then 0 else 1
• D(A,B) = D(B,A) Symmetry
•D(A,B) = 0 iff A= B Reflexive
fi, d(s1+ch1, s2) + 1,
d(s1, s2+ch2) + 1 )

• D(A,B) ≤ D(A,C) + D(B,C) Triangle Inequality

8
Two types of clustering
• Partitional algorithms: Construct various partitions and then evaluate them by some
criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using
some criterion
Hierarchical Partitional

9
Desirable Properties of clustering algorithm
• Scalability (in terms of both time and space)
• Ability to deal with different data types
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints

• Interpretability and usability

10
Summarizing similarity measurements
Dendrogram.
Terminal Branch Root
The similarity between two objects in a
Internal Branch dendrogram is represented as the height
of the lowest internal node they share.
Internal Node
Leaf

11
Hierarchical clustering using string edit distance
Pedro (Portuguese)
Petros (Greek), Peter (English), Piotr (Polish), Peadar
(Irish), Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian Alternative),
Petr (Czech), Pyotr (Russian)

Cristovao (Portuguese)
Christoph (German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer (Scandinavian),
Krystof (Czech), Christopher (English)

Miguel (Portuguese)
Michalis (Greek), Michael (English), Mick (Irish!)

12
Hierarchical clustering
Bottom-Up (agglomerative): Starting
The number of dendrograms with n
leafs = (2n -3)!/[(2(n -2)) (n -2)!] with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Top-Down (divisive): Starting with all

the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.

13
Distance matrix
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.

D( , ) = 8 0 8 8 7 7

0 2 4 4

D( , ) = 1 0 3 3

0 1

0
14
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.

Consider all Choose

possible the best
merges…
…
15
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.

Consider all Choose

possible … the best
merges…

Consider all Choose

possible … the best
merges…

16
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.

Consider all Choose

possible … the best
merges…

Consider all Choose

possible … the best
merges…

Consider all Choose

possible … the best
17
merges…
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair
to merge into a new cluster. Repeat until all clusters are
fused together.

Consider all Choose

possible … the best
merges…

Consider all Choose

possible … the best
merges…

Consider all Choose

possible … the best
18
merges…
Extending distance measure to clusters
We the distance between two objects, defining the distance between an
object and a cluster, or defining the distance between two clusters:
• Single linkage (nearest neighbor): In this method the distance between
two clusters is determined by the distance of the two closest objects (nearest
neighbors) in the different clusters.
• Complete linkage (farthest neighbor): In this method, the distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two
different clusters.

19
Minimal Spanning Tree – Single Linkage
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that
one point (p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
Summary of hierarchal clustering
• No need to specify the number of clusters in advance.
• Hierarchal nature maps nicely onto human intuition
for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.

22
Partitional clustering
• Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.

23
k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them to the
nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships found
above are correct.
5. If none of the N objects changed membership in the last iteration, exit.
Otherwise goto 3.

24
K-means clustering: step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

k2
2

k3
0
0 1 2 3 4 5
K-means clustering: step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

k2
2

k3
0
0 1 2 3 4 5 26
K-means clustering: step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4 k1

2
k3
1 k2

0
0 1 2 3 4 5
27
K-means clustering: step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4 k1

2
k3
1 k2

0
0 1 2 3 4 5
28
K-means clustering: step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

k2
k3

29
Evaluation of K-means
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable for clusters with non-convex shapes

30
DBSCAN
• DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number of points

(MinPts) within Eps
• These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point

– A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border

and noise
Eps = 10, MinPts = 4
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data

(MinPts=4, Eps=9.92)
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest neighbors are at
roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
Summary of Clustering Algorithms
• K-Means – fast, works only for data where mean can be defined,
generates spherical clusters, robust to noise
• Single linkage – produces non-convex clusters, slow for large data sets,
sensitive to noise
• Complete linkage – produces non-convex clusters, very sensitive to
noise, very slow for large data sets
• DBSCAN – produces arbitrary shaped clusters – works only for low
dimensional data

38
Cluster Validity
• For supervised classification we have a variety of measures to evaluate
how good our model is
– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the

“goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?
– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results,
e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
– External Index: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering structure
without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or clusters.
• Often an external or internal index is used for this function, e.g., SSE or entropy
Scatter Coefficient
• Cluster evaluation index
• Ratio of average intra-cluster distances to
intra-cluster distances (Sum Squared Error)

42
Internal Measures: Cohesion and Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS = ∑ ∑ ( x − mi ) 2
i x∈C i
– Separation is measured by the between cluster sum of squares
BSS = ∑ i
C
i
( m − mi ) 2

– Where |Ci| is the size of cluster i

Internal Measures: Cohesion and Separation

• Example: SSE
– BSS + WSS = constant
m

1
×
m 2
×3 4
×
m 5
1 2

K=1 cluster: WSS= (1 − 3) 2 + ( 2 − 3) 2 + ( 4 − 3) 2 + (5 − 3) 2 = 10

BSS = 4 × (3 − 3) 2 = 0
Total = 10 + 0 = 10

K=2 clusters: WSS= (1 − 1.5) 2 + ( 2 − 1.5) 2 + ( 4 − 4.5) 2 + (5 − 4.5) 2 = 1

BSS = 2 × (3 − 1.5) 2 + 2 × ( 4.5 − 3) 2 = 9
Total = 1 + 9 = 10
Internal Measures: Cohesion and Separation
• A proximity graph based approach can also be used for cohesion and
separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient
• Silhouette Coefficient combine ideas of both cohesion and separation, but for
individual points, as well as clusters and clusterings
• For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)
b

– Typically between 0 and 1.

– The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

External Measures of Cluster Validity: Entropy and Purity
Outliers Detection
• Important in many applications like anomaly
detection
• Outliers are points not belonging to any
cluster
• Many outlier detection algorithms available

48
End of Clustering

Cse 3318 - W4 - 06242024
100% (1)
Cse 3318 - W4 - 06242024
121 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
17 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
Nearset Clustering
No ratings yet
Nearset Clustering
15 pages
Clustering
No ratings yet
Clustering
36 pages
Lecture-18-Clustering-19092024-091909am
No ratings yet
Lecture-18-Clustering-19092024-091909am
33 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Chapter8-Basic Cluster Analysis2016
No ratings yet
Chapter8-Basic Cluster Analysis2016
143 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Clustering
No ratings yet
Clustering
39 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
71 pages
K-mean
No ratings yet
K-mean
11 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
UNIT 3 Data Mining
No ratings yet
UNIT 3 Data Mining
11 pages
Lec2a-Clustering
No ratings yet
Lec2a-Clustering
64 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
Lect 12
No ratings yet
Lect 12
80 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Week 3 Clustering
No ratings yet
Week 3 Clustering
36 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
Chapter 5 - 5
No ratings yet
Chapter 5 - 5
30 pages
L 12 Flat Cluster
No ratings yet
L 12 Flat Cluster
26 pages
Chapter8-Basic Cluster Analysis2018
No ratings yet
Chapter8-Basic Cluster Analysis2018
143 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
32 pages
Clustering
No ratings yet
Clustering
7 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
DWDM Unit Vi
No ratings yet
DWDM Unit Vi
23 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
UCS551 Chapter 7 - Clustering
No ratings yet
UCS551 Chapter 7 - Clustering
6 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Clustering
No ratings yet
Clustering
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Hierarchical Clustering Algorithms: - Divisive (Top-Down)
No ratings yet
Hierarchical Clustering Algorithms: - Divisive (Top-Down)
53 pages
19. Clustering- Introduction^J Evaluation Metrics
No ratings yet
19. Clustering- Introduction^J Evaluation Metrics
19 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Chapter 7
100% (1)
Chapter 7
31 pages
UNIT V MACHINE LEARNING
No ratings yet
UNIT V MACHINE LEARNING
5 pages
CS276A Text Retrieval and Mining
No ratings yet
CS276A Text Retrieval and Mining
48 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
ML Clustering K Mean (1)
No ratings yet
ML Clustering K Mean (1)
33 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
AL102_M6-M8_ REVIEWER
No ratings yet
AL102_M6-M8_ REVIEWER
5 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Clustering
No ratings yet
Clustering
4 pages
Machine Learning Section3 Ebook v05
No ratings yet
Machine Learning Section3 Ebook v05
15 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001
No ratings yet
Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001
12 pages
Hierarchical Clustering in Machine Learning - GeeksforGeeks
No ratings yet
Hierarchical Clustering in Machine Learning - GeeksforGeeks
8 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Tell Me About Yourself
No ratings yet
Tell Me About Yourself
8 pages
POD23S2C21890053
No ratings yet
POD23S2C21890053
2 pages
RAJA
No ratings yet
RAJA
1 page
Week-7 (SWI)
No ratings yet
Week-7 (SWI)
19 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
No ratings yet
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
15 pages
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
No ratings yet
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
19 pages
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
No ratings yet
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
5 pages
Red Canners
No ratings yet
Red Canners
11 pages
Minimum Spanning Tree Algorithm: Aor Case Study Assignment
No ratings yet
Minimum Spanning Tree Algorithm: Aor Case Study Assignment
8 pages
cs3401 Algorithms Lab Manual Final
No ratings yet
cs3401 Algorithms Lab Manual Final
43 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
G8-W1 (Inc)
No ratings yet
G8-W1 (Inc)
24 pages
Activity 1: Measure Me!
No ratings yet
Activity 1: Measure Me!
10 pages
17 Shortest Paths
No ratings yet
17 Shortest Paths
131 pages
Numerical MCQS2
No ratings yet
Numerical MCQS2
7 pages
MCA 4th Sem ADA Lab Mannual
No ratings yet
MCA 4th Sem ADA Lab Mannual
26 pages
Ejercicios 1 y 2
No ratings yet
Ejercicios 1 y 2
23 pages
Unit 1 - Assignment
No ratings yet
Unit 1 - Assignment
10 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
27 pages
Question bank 3,4,5
No ratings yet
Question bank 3,4,5
9 pages
Design and Analysis Algorithms III Bca
No ratings yet
Design and Analysis Algorithms III Bca
117 pages
Assignment 2 Nonlinear Equations Attempt Review - pdf-1
No ratings yet
Assignment 2 Nonlinear Equations Attempt Review - pdf-1
9 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
59 pages
Dsa Lab Report
No ratings yet
Dsa Lab Report
6 pages
Learning Activity Sheet: - Grade 8
100% (1)
Learning Activity Sheet: - Grade 8
24 pages
Gauss Seidel Method
No ratings yet
Gauss Seidel Method
33 pages
QMT437 Group Project: Prepared by
No ratings yet
QMT437 Group Project: Prepared by
21 pages
Sample Questions
No ratings yet
Sample Questions
5 pages
CSN-501 - Tutorial 2 - Solution
No ratings yet
CSN-501 - Tutorial 2 - Solution
2 pages
Linked List Ques and Ans
No ratings yet
Linked List Ques and Ans
12 pages
MATH 8 Exam 1Q
No ratings yet
MATH 8 Exam 1Q
3 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Contents of Chapter 4: - Chapter 4 The Greedy Method
0% (1)
Contents of Chapter 4: - Chapter 4 The Greedy Method
34 pages
Dual
No ratings yet
Dual
35 pages
Generative AI (21CS733) AAT-1 Final Marks
No ratings yet
Generative AI (21CS733) AAT-1 Final Marks
8 pages
090 - MA8491, MA6459 Numerical Methods - Question Bank 1
No ratings yet
090 - MA8491, MA6459 Numerical Methods - Question Bank 1
19 pages

Week 07 Lecture Material

Uploaded by

Week 07 Lecture Material

Uploaded by

Data Mining

• Useful in many applications like market

• D(A,B) ≤ D(A,C) + D(B,C) Triangle Inequality

• Interpretability and usability

Top-Down (divisive): Starting with all

Consider all Choose

Consider all Choose

Consider all Choose

Consider all Choose

Consider all Choose

Consider all Choose

Consider all Choose

Consider all Choose

Consider all Choose

– A point is a core point if it has more than a specified number of points

Original Points Point types: core, border

Original Points Clusters

• For cluster analysis, the analogous question is how to evaluate the

– Where |Ci| is the size of cluster i

K=1 cluster: WSS= (1 − 3) 2 + ( 2 − 3) 2 + ( 4 − 3) 2 + (5 − 3) 2 = 10

K=2 clusters: WSS= (1 − 1.5) 2 + ( 2 − 1.5) 2 + ( 4 − 4.5) 2 + (5 − 4.5) 2 = 1

– Typically between 0 and 1.

– The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

You might also like