Cluster-Analysis
Cluster-Analysis
Pukar Karki
Assistant Professor
[email protected]
Contents
1. Basics and Algorithms
2. K-means Clustering https://www.youtube.com/watch?v=5FpsGnkbEpM gATE SMASHERS
Data Mining(CT725) 2
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison
Data Mining(CT725) 3
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms 4
Clustering for Data Understanding and Applications
Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth observation
database
Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Climate: understanding earth climate, find patterns of atmospheric and ocean
Economic Science: market resarch 5
Clustering as a Preprocessing Tool (Utility)
Summarization:
Preprocessing for regression, PCA, classification, and association
analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any cluster
6
Vector Quantization
Left: original image; middle: using 23.9% of the storage; right: using 6.25% of
the storage
K-means is often called “Lloyd’s algorithm” in computer science and engineering, and is
used in vector quantization for compression
Basic idea: run K-means clustering on 4 × 4 squares of pixels in an image, and keep only
the clusters and labels. Smaller K means more compression
7
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns
8
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function, typically metric:
d(i, j)
The definitions of distance functions are usually rather different for
interval-scaled, boolean, categorical, ordinal ratio, and vector variables
Weights should be associated with different variables based on
applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that measures the
“goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
9
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
10
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality 11
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSCAN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
12
Major Clustering Approaches (II)
13
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison
Data Mining(CT725) 14
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
16
An Example of K-Means Clustering
Comments on the K-Means Method
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimal.
Weakness
Applicable only to objects in a continuous n-dimensional space
Using the k-modes method for categorical data
In comparison, k-medoids can be applied to a wide range of data
Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex shapes 18
Variations of the K-Means Method
Most of the variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method 19
What Is the Problem of the K-Means Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the
distribution of the data
K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster
20
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6 Arbitrary 6 Assign 6
5
choose k 5
each 5
4 4
object as remaining
4
3 3 3
2 initial 2
object to 2
medoids nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
K=2
Randomly select a
Total Cost = 26 nonmedoid object,Oramdom
10 10
Do loop 9
8 Compute
9
Swapping O 7
total cost of 7
5 swapping
6
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
21
The K-Medoid Clustering Method
22
The K-Medoid Clustering Method
K-Medoids Clustering: Find representative objects (medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
Starts from an initial set of medoids and iteratively replaces one of the medoids
by one of the non-medoids if it improves the total distance of the resulting
clustering
PAM works effectively for small data sets, but does not scale well for large data
sets (due to the computational complexity)
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
CLARANS (Ng & Han, 1994): Randomized re-sampling
23
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison
Data Mining(CT725) 24
Hierarchical Clustering
There are two basic approaches for generating a hierarchical
clustering:
Agglomerative: Start with the points as individual clusters and, at each
step, merge the closest pair of clusters. This requires defining a notion
of cluster proximity.
Divisive: Start with one, all-inclusive cluster and, at each step, split a
cluster until only singleton clusters of individual points remain. In this
case, we need to decide which cluster to split at each step and how to
do the splitting.
25
Hierarchical Clustering
26
Hierarchical Clustering
Use distance matrix as clustering criteria.
This method does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a (AGNES)
ab
b abcde
c
cde
d de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
(DIANA) 27
Basic Agglomerative Hierarchical Clustering Algorithm
28
Steps 1 and 2
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
✔ After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Step 4
✔ We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Step 5
✔ The question is “How do we update the proximity matrix?”
C2 U
C5
C1 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
C1
Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Distance
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
✔ MIN .
✔ MAX .
✔ Group Average . Proximity Matrix
✔ Distance Between Centroids
✔ Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
✔ MIN .
✔ MAX .
✔ Group Average . Proximity Matrix
✔ Distance Between Centroids
✔ Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
✔ MIN .
✔ MAX .
✔ Group Average . Proximity Matrix
✔ Distance Between Centroids
✔ Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
✔ MIN .
✔ MAX .
✔ Group Average . Proximity Matrix
✔ Distance Between Centroids
✔ Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
✔ MIN .
✔ MAX .
✔ Group Average . Proximity Matrix
✔ Distance Between Centroids
✔ Other methods driven by an objective
function
– Ward’s Method uses squared error
Distance between Clusters
Distance between Clusters
Example
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Two Clusters
Original Points
• Sensitive to noise
Three Clusters
Example: Complete Link or MAX
For the complete link or MAX version of hierarchical clustering,
the proximity of two clusters is defined as the maximum of the
distance (minimum of the similarity) between any two points in
the two different clusters.
Complete link is less susceptible to noise and outliers, but it can
break large clusters and it favors globular shapes.
Hierarchical Clustering: MAX
4 1
2 5 0.4
0.35
5
2
0.3
0.25
3 6
0.2
3 0.15
1 0.1
4
0.05
0
3 6 4 1 2 5
5
4 1
0.25
2
5 0.2
2 0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}),
clusters {3, 6, 4} and {2, 5} are merged at the fourth stage.
Hierarchical Clustering: Group Average
✔ Strengths
– Less susceptible to noise
✔ Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
✔ Similarity of two clusters is based on the increase in squared
error when two clusters are merged
– Similar to group average if distance between points is distance
squared
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3 1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3 1 1
4
4 4
3
Hierarchical Clustering: Time and Space requirements
59
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
Contents
1. Basics and Algorithms
2. K-means Clustering
3. Hierarchical Clustering
4. DBSCAN Clustering
5. Issues : Evaluation, Scalability, Comparison
Data Mining(CT725) 61
Density-Based Clustering Methods
Partitioning and hierarchical methods are designed to find spherical-shaped
clusters.
They have difficulty finding clusters of arbitrary shape such as the “S” shape
and oval clusters in figure.
62
Density-Based Clustering Methods
To find clusters of arbitrary shape, alternatively, we can model clusters as
dense regions in the data space, separated by sparse regions.
This is the main strategy behind density-based clustering methods, which can
discover clusters of nonspherical shape.
63
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected
points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) 64
DBSCAN
The density of an object o can be measured by the number of
objects close to o.
DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) finds core objects, that is, objects that have dense
neighborhoods.
It connects core objects and their neighborhoods to form dense
regions as clusters.
65
DBSCAN
“How does DBSCAN quantify the neighborhood of an object?”
66
DBSCAN
→ To determine whether a neighborhood is dense or not, DBSCAN uses
another user-specified parameter, MinPts, which specifies the density
threshold of dense regions.
67
DBSCAN
Given a set, D, of objects, we can identify all core objects with respect
to the given parameters, ε and MinPts.
The clustering task is therein reduced to using core objects and their
neighborhoods to form dense regions, where the dense regions are
clusters.
68
Directly Density-Reachable
For a core object q and an object p, we say that p is directly
density-reachable from q (with respect to ε and MinPts) if p is within
the ε-neighborhood of q.
69
Density-Reachable and Density-Connected
Density- reachable:
A point p is density-reachable from a point q w.r.t. Eps, MinPts if
there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density- connected
A point p is density-connected to a point q w.r.t. Eps, MinPts if there
is a point o such that both, p and q are density-reachable from o
w.r.t. Eps and MinPts
70
Consider the following figure for a given ε represented by the radius of
the circles, and, say, let MinPts = 3.
Consider the following figure for a given ε represented by the radius of
the circles, and, say, let MinPts = 3.
77
DBSCAN: Determining EPS and MinPts
✔ Idea is that for points in a cluster, their kth nearest neighbors are
at close distance
✔ Noise points have the kth nearest neighbor at farther distance
✔ So, plot sorted distance of every point to its kth nearest neighbor
DBSCAN: Core, Border and Noise Points
Core points: These points are in the interior of a density-based cluster. A point
is a core point if there are at least MinPts within a distance of Eps, where
MinPts and Eps are user-specified parameters. In Figure, point A is a core point
for the radius (Eps) if MinPts ≥ 7.
DBSCAN: Core, Border and Noise Points
Border points: A border point is not a core point, but falls within the
neighborhood of a core point. In Figure, point B is a border point. A border
point can fall within the neighborhoods of several core points.
DBSCAN: Core, Border and Noise Points
Noise points: A noise point is any point that is neither a core point nor a
border point. In Figure, point C is a noise point.
DBSCAN Algorithm
When DBSCAN Works Well
Data Mining(CT725) 85
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
86
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality 87
Cluster Validity
✔ For supervised classification we have a variety of measures to evaluate
how good our model is
– Accuracy, precision, recall
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.8 0.8
K-means 0.7 0.7
Complete
0.6 0.6
Link
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Measures of Cluster Validity
✔ Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following two types.
– Supervised: Used to measure the extent to which cluster labels match
externally supplied class labels.
Entropy
Often called external indices because they use information external to the data
– Unsupervised: Used to measure the goodness of a clustering structure
without respect to external information.
Sum of Squared Error (SSE)
Often called internal indices because they only use information in the data