0% found this document useful (0 votes)
25 views

Week 07 Lecture Material

Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarity. It can be used to find natural groupings within data to gain insight. Hierarchical clustering builds clusters hierarchically, either bottom-up by merging the closest pairs or top-down by splitting clusters. K-means clustering partitions data into k mutually exclusive clusters by minimizing distances between data points and assigned cluster centers.

Uploaded by

Meer Hassan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Week 07 Lecture Material

Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarity. It can be used to find natural groupings within data to gain insight. Hierarchical clustering builds clusters hierarchically, either bottom-up by merging the closest pairs or top-down by splitting clusters. K-means clustering partitions data into k mutually exclusive clusters by minimizing distances between data points and assigned cluster centers.

Uploaded by

Meer Hassan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Mining

Week 7: Clustering

Pabitra Mitra
Computer Science and Engineering, IIT Kharagpur

1
Clustering
• Unsupervised method
• Exploratory Data Analysis

• Useful in many applications like market


segment analysis

2
What is clustering?
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly
from the data (in contrast to classification).
• More informally, finding natural groupings among objects.

3
What is a natural grouping among these objects?

4
What is a natural grouping among these objects?

Clustering is subjective

5
Simpson's Family School Employees Females Males
What is similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.

Similarity is hard
to define.
Defining distance measures
Definition: Let O1 and O2 be two objects from the universe of
possible objects. The distance (dissimilarity) between O1 and O2 is a
real number denoted by D(O1,O2)

Peter Piotr

7
0.23 3 342.7
Peter Piotr What properties should a distance measure
have?
d('', '') = 0 d(s, '') = d('',
s) = |s| -- i.e. length of
s d(s1+ch1, s2+ch2) =
3
min( d(s1, s2) + if
ch1=ch2 then 0 else 1
• D(A,B) = D(B,A) Symmetry
•D(A,B) = 0 iff A= B Reflexive
fi, d(s1+ch1, s2) + 1,
d(s1, s2+ch2) + 1 )

• D(A,B) ≤ D(A,C) + D(B,C) Triangle Inequality

8
Two types of clustering
• Partitional algorithms: Construct various partitions and then evaluate them by some
criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using
some criterion
Hierarchical Partitional

9
Desirable Properties of clustering algorithm
• Scalability (in terms of both time and space)
• Ability to deal with different data types
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints

• Interpretability and usability

10
Summarizing similarity measurements
Dendrogram.
Terminal Branch Root
The similarity between two objects in a
Internal Branch dendrogram is represented as the height
of the lowest internal node they share.
Internal Node
Leaf

11
Hierarchical clustering using string edit distance
Pedro (Portuguese)
Petros (Greek), Peter (English), Piotr (Polish), Peadar
(Irish), Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian Alternative),
Petr (Czech), Pyotr (Russian)

Cristovao (Portuguese)
Christoph (German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer (Scandinavian),
Krystof (Czech), Christopher (English)

Miguel (Portuguese)
Michalis (Greek), Michael (English), Mick (Irish!)

12
Hierarchical clustering
Bottom-Up (agglomerative): Starting
The number of dendrograms with n
leafs = (2n -3)!/[(2(n -2)) (n -2)!] with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Top-Down (divisive): Starting with all


the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.

13
Distance matrix
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.

D( , ) = 8 0 8 8 7 7

0 2 4 4

D( , ) = 1 0 3 3

0 1

0
14
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.

Consider all Choose


possible the best
merges…

15
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.

Consider all Choose


possible … the best
merges…

Consider all Choose


possible … the best
merges…

16
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.

Consider all Choose


possible … the best
merges…

Consider all Choose


possible … the best
merges…

Consider all Choose


possible … the best
17
merges…
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair
to merge into a new cluster. Repeat until all clusters are
fused together.

Consider all Choose


possible … the best
merges…

Consider all Choose


possible … the best
merges…

Consider all Choose


possible … the best
18
merges…
Extending distance measure to clusters
We the distance between two objects, defining the distance between an
object and a cluster, or defining the distance between two clusters:
• Single linkage (nearest neighbor): In this method the distance between
two clusters is determined by the distance of the two closest objects (nearest
neighbors) in the different clusters.
• Complete linkage (farthest neighbor): In this method, the distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two
different clusters.

19
Minimal Spanning Tree – Single Linkage
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that
one point (p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
Summary of hierarchal clustering
• No need to specify the number of clusters in advance.
• Hierarchal nature maps nicely onto human intuition
for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.

22
Partitional clustering
• Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.

23
k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them to the
nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships found
above are correct.
5. If none of the N objects changed membership in the last iteration, exit.
Otherwise goto 3.

24
K-means clustering: step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

k2
2

k3
0
0 1 2 3 4 5
K-means clustering: step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

k2
2

k3
0
0 1 2 3 4 5 26
K-means clustering: step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4 k1

2
k3
1 k2

0
0 1 2 3 4 5
27
K-means clustering: step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4 k1

2
k3
1 k2

0
0 1 2 3 4 5
28
K-means clustering: step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2
k3

29
Evaluation of K-means
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable for clusters with non-convex shapes

30
DBSCAN
• DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number of points


(MinPts) within Eps
• These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point

– A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border


and noise
Eps = 10, MinPts = 4
When DBSCAN Works Well

Original Points Clusters


• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data

(MinPts=4, Eps=9.92)
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest neighbors are at
roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
Summary of Clustering Algorithms
• K-Means – fast, works only for data where mean can be defined,
generates spherical clusters, robust to noise
• Single linkage – produces non-convex clusters, slow for large data sets,
sensitive to noise
• Complete linkage – produces non-convex clusters, very sensitive to
noise, very slow for large data sets
• DBSCAN – produces arbitrary shaped clusters – works only for low
dimensional data

38
Cluster Validity
• For supervised classification we have a variety of measures to evaluate
how good our model is
– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the


“goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?
– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results,
e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
– External Index: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering structure
without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or clusters.
• Often an external or internal index is used for this function, e.g., SSE or entropy
Scatter Coefficient
• Cluster evaluation index
• Ratio of average intra-cluster distances to
intra-cluster distances (Sum Squared Error)

42
Internal Measures: Cohesion and Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS = ∑ ∑ ( x − mi ) 2
i x∈C i
– Separation is measured by the between cluster sum of squares
BSS = ∑ i
C
i
( m − mi ) 2

– Where |Ci| is the size of cluster i


Internal Measures: Cohesion and Separation

• Example: SSE
– BSS + WSS = constant
m

1
×
m 2
×3 4
×
m 5
1 2

K=1 cluster: WSS= (1 − 3) 2 + ( 2 − 3) 2 + ( 4 − 3) 2 + (5 − 3) 2 = 10


BSS = 4 × (3 − 3) 2 = 0
Total = 10 + 0 = 10

K=2 clusters: WSS= (1 − 1.5) 2 + ( 2 − 1.5) 2 + ( 4 − 4.5) 2 + (5 − 4.5) 2 = 1


BSS = 2 × (3 − 1.5) 2 + 2 × ( 4.5 − 3) 2 = 9
Total = 1 + 9 = 10
Internal Measures: Cohesion and Separation
• A proximity graph based approach can also be used for cohesion and
separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient
• Silhouette Coefficient combine ideas of both cohesion and separation, but for
individual points, as well as clusters and clusterings
• For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)
b

– Typically between 0 and 1.


a

– The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering


External Measures of Cluster Validity: Entropy and Purity
Outliers Detection
• Important in many applications like anomaly
detection
• Outliers are points not belonging to any
cluster
• Many outlier detection algorithms available

48
End of Clustering

49

You might also like