Slide-08-Chapter10-Cluster Analysis Basic Concept I
Slide-08-Chapter10-Cluster Analysis Basic Concept I
1
What is Cluster Analysis?
Cluster: A collection of data objects
◦ similar (or related) to one another within the same group
◦ dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
◦ Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Typical applications
◦ As a stand-alone tool to get insight into data distribution
◦ As a preprocessing step for other algorithms
2
Clustering for Data Understanding and Applications
Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth observation
database
Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Climate: understanding earth climate, find patterns of atmospheric and ocean
Economic Science: market research
3
Clustering as a Preprocessing Tool (Utility)
Summarization:
◦ Preprocessing for regression, PCA, classification, and association
analysis
Compression:
◦ Image processing: vector quantization
Finding K-nearest Neighbors
◦ Localizing search to one or a small number of clusters
Outlier detection
◦ Outliers are often viewed as those “far away” from any cluster
4
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters
◦ high intra-class similarity: cohesive within clusters
◦ low inter-class similarity: distinctive between clusters
5
Measure the Quality of Clustering
Dissimilarity/Similarity metric
◦ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
◦ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
◦ Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
◦ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
◦ It is hard to define “similar enough” or “good enough”
◦ The answer is typically highly subjective
6
Considerations for Cluster Analysis
Partitioning criteria
◦ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
Separation of clusters
◦ Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
Similarity measure
◦ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity 鄰接)
Clustering space
◦ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
7
Requirements and Challenges
Scalability
◦ Clustering all the data instead of only on samples
Ability to deal with different types of attributes
◦ Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
◦ User may give inputs on constraints
◦ Use domain knowledge to determine input parameters
Interpretability and usability
Others
◦ Discovery of clusters with arbitrary shape
◦ Ability to deal with noisy data
◦ Incremental clustering and insensitivity to input order
◦ High dimensionality
8
Major Clustering Approaches (I)
Partitioning approach:
◦ Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
◦ Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
◦ Create a hierarchical decomposition of the set of data (or objects) using
some criterion
◦ Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
◦ Based on connectivity and density functions
◦ Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
◦ based on a multiple-level granularity structure
◦ Typical methods: STING, WaveCluster, CLIQUE
9
Major Clustering Approaches (II)
Model-based:
◦ A model is hypothesized for each of the clusters and tries to find the best fit
of that model to each other
◦ Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
◦ Based on the analysis of frequent patterns
◦ Typical methods: p-Cluster
User-guided or constraint-based:
◦ Clustering by considering user-specified or application-specific constraints
◦ Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
◦ Objects are often linked together in various ways
◦ Massive links can be used to cluster objects: SimRank, LinkClus
10
Partitioning Algorithms: Basic Concept
E = ik=1 pCi ( p − ci ) 2
11
The K-Means Clustering Method
12
An Example of K-Means Clustering
K=2
13
Comments on the K-Means Method
Weakness
◦ Applicable only to objects in a continuous n-dimensional space
◦ Using the k-modes method for categorical data
◦ In comparison, k-medoids can be applied to a wide range of data
◦ Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
◦ Sensitive to noisy data and outliers
◦ Not suitable to discover clusters with non-convex shapes (非多邊形)
14
Variations of the K-Means Method
◦ Dissimilarity calculations
15
What Is the Problem of the K-Means Method?
◦ Since an object with an extremely large value may substantially distort the
distribution of the data
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
16
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remaining 4
3
initial 3
object to 3
2
medoids 2
nearest 2
medoids
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
17
17
The K-Medoid Clustering Method
18
Hierarchical Clustering
https://www.youtube.com/watch?v=ijUMKMC4f9I
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
20
Dendrogram: Shows How Clusters are Merged
21
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
22
Distance between Clusters X X
Single link: smallest distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
◦ Medoid: a chosen, centrally located object in the cluster
23
Centroid, Radius and Diameter of a Cluster (for numerical data sets)
Radius (半徑): square root of average distance from any point of the
cluster to its centroid
N (t − cm ) 2
Rm = i =1 ip
N
Diameter (直徑): square root of average mean squared distance between
all pairs of points in the cluster
N N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
24
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
◦ Can never undo what was done previously
◦ Do not scale well: time complexity of at least O(n2), where n is
the number of total objects
Source: https://jason-chen-1992.weebly.com/home/-dbscan 26
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-
connected points
Major features:
◦ Discover clusters of arbitrary shape
◦ Handle noise
◦ One scan
◦ Need density parameters as termination condition
Several interesting studies:
◦ DBSCAN: Ester, et al. (KDD’96)
◦ OPTICS: Ankerst, et al (SIGMOD’99).
◦ DENCLUE: Hinneburg & D. Keim (KDD’98)
◦ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
27
Density-Based Clustering: Basic Concepts
Two parameters:
◦ Eps: Maximum radius of the neighborhood
◦ MinPts: Minimum number of points in an Eps-neighborhood
of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
◦ p belongs to NEps(q)
◦ core point condition: p MinPts = 5
28
Density-Reachable and Density-Connected
Density-reachable:
◦ A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
q
such that pi+1 is directly density-
reachable from pi
Density-connected
◦ A point p is density-connected to a point p q
q w.r.t. Eps, MinPts if there is a point o
such that both, p and q, are density-
reachable from o w.r.t. Eps and MinPts o
29
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
30
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database
Continue the process until all of the points have been processed
Source: https://zhuanlan.zhihu.com/p/395088759
31
DBSCAN: Sensitive to Parameters
32
DBScan-pseudo codes
33
Probabilistic Model-Based Clustering
Cluster analysis is to find hidden categories.
A hidden category (i.e., probabilistic cluster) is a distribution over the data space,
which can be mathematically represented using a probability density function (or
distribution function).
34
Model-Based Clustering
A set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk,
respectively, and their probabilities ω1, …, ωk.
Probability of an object o generated by cluster Cj is
Probability of o generated by the set of cluster C is
◼ Since objects are assumed to be generated
independently, for a data set D = {o1, …, on}, we have,
35
Univariate Gaussian Mixture Model
O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k
distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th
distribution using parameter θj, we have
36
The EM (Expectation Maximization) Algorithm
The k-means algorithm has two steps at each iteration:
◦ Expectation Step (E-step): Given the current cluster centers, each object is
assigned to the cluster whose center is closest to the object: An object is
expected to belong to the closest cluster
◦ Maximization Step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of distance from the objects
assigned to this cluster and the new center is minimized
The (EM) algorithm: A framework to approach maximum likelihood or maximum a
posteriori estimates of parameters in statistical models.
◦ E-step assigns objects to clusters according to the current fuzzy clustering or
parameters of probabilistic clusters
◦ M-step finds the new clustering or parameters that minimize the sum of
squared error (SSE) or maximize the expected likelihood
37
Computing Mixture Models with EM
Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk}
s.t.,P(O|Θ) is maximized, where θj = (μj, σj) are the mean and standard deviation of the
j-th univariate Gaussian distribution
We initially assign random values to parameters θj, then iteratively conduct the E-
and M- steps until converge or sufficiently small change
At the E-step, for each object oi, calculate the probability that oi belongs to each
distribution,
◼ At the M-step, adjust the parameters θj = (μj, σj) so that the expected
likelihood P(O|Θ) is maximized
38
Advantages and Disadvantages of Mixture Models
Strength
◦ Mixture models are more general than partitioning and fuzzy clustering
◦ Clusters can be characterized by a small number of parameters
◦ The results may satisfy the statistical assumptions of the generative models
Weakness
◦ Converge to local optimal (overcome: run multi-times w. random initialization)
◦ Computationally expensive if the number of distributions is large, or the data
set contains very few observed data points
◦ Need large data sets
◦ Hard to estimate the number of clusters
39
Thanks for Your Attention
Q&A
40