0% found this document useful (0 votes)
4 views

Slide-08-Chapter10-Cluster Analysis Basic Concept I

Cluster analysis involves grouping similar data objects into clusters based on their characteristics, and is commonly used in various fields such as biology, marketing, and city planning. It can serve as a standalone tool for data insight or as a preprocessing step for other algorithms, with methods including partitioning, hierarchical, and density-based approaches. The quality of clustering is determined by intra-class and inter-class similarity, and various algorithms like k-means and k-medoids are utilized to achieve effective clustering.

Uploaded by

a19910207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Slide-08-Chapter10-Cluster Analysis Basic Concept I

Cluster analysis involves grouping similar data objects into clusters based on their characteristics, and is commonly used in various fields such as biology, marketing, and city planning. It can serve as a standalone tool for data insight or as a preprocessing step for other algorithms, with methods including partitioning, hierarchical, and density-based approaches. The quality of clustering is determined by intra-class and inter-class similarity, and various algorithms like k-means and k-medoids are utilized to achieve effective clustering.

Uploaded by

a19910207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Chapter 10.

Cluster Analysis: Basic


Concepts and Methods
HUI-YIN CHANG (張彙音)

1
What is Cluster Analysis?
Cluster: A collection of data objects
◦ similar (or related) to one another within the same group
◦ dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
◦ Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Typical applications
◦ As a stand-alone tool to get insight into data distribution
◦ As a preprocessing step for other algorithms

2
Clustering for Data Understanding and Applications
Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth observation
database
Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Climate: understanding earth climate, find patterns of atmospheric and ocean
Economic Science: market research

3
Clustering as a Preprocessing Tool (Utility)

Summarization:
◦ Preprocessing for regression, PCA, classification, and association
analysis
Compression:
◦ Image processing: vector quantization
Finding K-nearest Neighbors
◦ Localizing search to one or a small number of clusters
Outlier detection
◦ Outliers are often viewed as those “far away” from any cluster

4
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters
◦ high intra-class similarity: cohesive within clusters
◦ low inter-class similarity: distinctive between clusters

The quality of a clustering method depends on


◦ the similarity measure used by the method
◦ its implementation, and
◦ Its ability to discover some or all of the hidden patterns

5
Measure the Quality of Clustering
Dissimilarity/Similarity metric
◦ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
◦ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
◦ Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
◦ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
◦ It is hard to define “similar enough” or “good enough”
◦ The answer is typically highly subjective

6
Considerations for Cluster Analysis
Partitioning criteria
◦ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
Separation of clusters
◦ Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
Similarity measure
◦ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity 鄰接)
Clustering space
◦ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

7
Requirements and Challenges
Scalability
◦ Clustering all the data instead of only on samples
Ability to deal with different types of attributes
◦ Numerical, binary, categorical, ordinal, linked, and mixture of these
Constraint-based clustering
◦ User may give inputs on constraints
◦ Use domain knowledge to determine input parameters
Interpretability and usability
Others
◦ Discovery of clusters with arbitrary shape
◦ Ability to deal with noisy data
◦ Incremental clustering and insensitivity to input order
◦ High dimensionality

8
Major Clustering Approaches (I)
Partitioning approach:
◦ Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
◦ Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
◦ Create a hierarchical decomposition of the set of data (or objects) using
some criterion
◦ Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
◦ Based on connectivity and density functions
◦ Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
◦ based on a multiple-level granularity structure
◦ Typical methods: STING, WaveCluster, CLIQUE

9
Major Clustering Approaches (II)
Model-based:
◦ A model is hypothesized for each of the clusters and tries to find the best fit
of that model to each other
◦ Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
◦ Based on the analysis of frequent patterns
◦ Typical methods: p-Cluster
User-guided or constraint-based:
◦ Clustering by considering user-specified or application-specific constraints
◦ Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
◦ Objects are often linked together in various ways
◦ Massive links can be used to cluster objects: SimRank, LinkClus

10
Partitioning Algorithms: Basic Concept

Partitioning method: Partitioning a database D of n objects into a set of k


clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)

E = ik=1 pCi ( p − ci ) 2

Given k, find a partition of k clusters that optimizes the chosen partitioning


criterion
◦ Global optimal: exhaustively enumerate all partitions
◦ Heuristic methods: k-means and k-medoids algorithms
◦ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
◦ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster

11
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four steps:


◦ Partition objects into k nonempty subsets
◦ Compute seed points as the centroids of the clusters of the
current partitioning (the centroid is the center, i.e., mean
point, of the cluster)
◦ Assign each object to the cluster with the nearest seed point
◦ Go back to Step 2, stop when the assignment does not
change

12
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
◼ Partition objects into k nonempty subsets
◼ Repeat
◼ Compute centroid (i.e., mean point)
for each partition Update the
◼ Assign each object to the cluster of cluster
centroids
its nearest centroid
◼ Until no change

13
Comments on the K-Means Method

Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.


Normally, k, t << n.
◦ Comparing: PAM (partition around medoids): O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

Comment: Often terminates at a local optimal.

Weakness
◦ Applicable only to objects in a continuous n-dimensional space
◦ Using the k-modes method for categorical data
◦ In comparison, k-medoids can be applied to a wide range of data
◦ Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
◦ Sensitive to noisy data and outliers
◦ Not suitable to discover clusters with non-convex shapes (非多邊形)

14
Variations of the K-Means Method

Most of the variants of the k-means which differ in

◦ Selection of the initial k means

◦ Dissimilarity calculations

◦ Strategies to calculate cluster means

Handling categorical data: k-modes

◦ Replacing means of clusters with modes

◦ Using new dissimilarity measures to deal with categorical objects

◦ Using a frequency-based method to update modes of clusters

◦ A mixture of categorical and numerical data: k-prototype method

15
What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !

◦ Since an object with an extremely large value may substantially distort the
distribution of the data

K-Medoids: Instead of taking the mean value of the object in a cluster as a


reference point, medoids can be used, which is the most centrally located object
in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

16
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remaining 4

3
initial 3
object to 3

2
medoids 2
nearest 2

medoids
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

17
17
The K-Medoid Clustering Method

K-Medoids Clustering: Find representative objects (medoids) in clusters


◦ PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
◦ Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering
◦ PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)

Efficiency improvement on PAM


◦ CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
◦ CLARANS (Ng & Han, 1994): Randomized re-sampling

18
Hierarchical Clustering
https://www.youtube.com/watch?v=ijUMKMC4f9I

Use distance matrix as clustering criteria. This method does not


require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
19
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

20
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting


the dendrogram at the desired level, then each
connected component forms a cluster

21
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

22
Distance between Clusters X X

Single link: smallest distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
◦ Medoid: a chosen, centrally located object in the cluster

23
Centroid, Radius and Diameter of a Cluster (for numerical data sets)

Centroid: the “middle” of a cluster iN= 1(t


Cm = ip )
N

Radius (半徑): square root of average distance from any point of the
cluster to its centroid
 N (t − cm ) 2
Rm = i =1 ip
N
Diameter (直徑): square root of average mean squared distance between
all pairs of points in the cluster
 N  N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

24
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
◦ Can never undo what was done previously
◦ Do not scale well: time complexity of at least O(n2), where n is
the number of total objects

Integration of hierarchical & distance-based clustering


◦ BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
◦ CHAMELEON (1999): hierarchical clustering using dynamic
modeling
25
How to cluster the following data?

Source: https://jason-chen-1992.weebly.com/home/-dbscan 26
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-
connected points
Major features:
◦ Discover clusters of arbitrary shape
◦ Handle noise
◦ One scan
◦ Need density parameters as termination condition
Several interesting studies:
◦ DBSCAN: Ester, et al. (KDD’96)
◦ OPTICS: Ankerst, et al (SIGMOD’99).
◦ DENCLUE: Hinneburg & D. Keim (KDD’98)
◦ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

27
Density-Based Clustering: Basic Concepts
Two parameters:
◦ Eps: Maximum radius of the neighborhood
◦ MinPts: Minimum number of points in an Eps-neighborhood
of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
◦ p belongs to NEps(q)
◦ core point condition: p MinPts = 5

|NEps (q)| ≥ MinPts Eps = 1 cm


q

28
Density-Reachable and Density-Connected

Density-reachable:
◦ A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
q
such that pi+1 is directly density-
reachable from pi
Density-connected
◦ A point p is density-connected to a point p q
q w.r.t. Eps, MinPts if there is a point o
such that both, p and q, are density-
reachable from o w.r.t. Eps and MinPts o

29
DBSCAN: Density-Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a


maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

30
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database
Continue the process until all of the points have been processed

Source: https://zhuanlan.zhihu.com/p/395088759
31
DBSCAN: Sensitive to Parameters

32
DBScan-pseudo codes

33
Probabilistic Model-Based Clustering
Cluster analysis is to find hidden categories.
A hidden category (i.e., probabilistic cluster) is a distribution over the data space,
which can be mathematically represented using a probability density function (or
distribution function).

◼ Ex. 2 categories for digital cameras sold


◼ consumer line vs. professional line
◼ density functions f1, f2 for C1, C2
◼ obtained by probabilistic clustering

◼ A mixture model assumes that a set of observed objects is a mixture


of instances from multiple probabilistic clusters, and conceptually
each observed object is generated independently
◼ Out task: infer a set of k probabilistic clusters that is mostly likely to
generate D using the above data generation process

34
Model-Based Clustering
A set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk,
respectively, and their probabilities ω1, …, ωk.
Probability of an object o generated by cluster Cj is
Probability of o generated by the set of cluster C is
◼ Since objects are assumed to be generated
independently, for a data set D = {o1, …, on}, we have,

◼ Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized


◼ However, maximizing P(D|C) is often intractable since the probability
density function of a cluster can take an arbitrarily complicated form
◼ To make it computationally feasible (as a compromise), assume the
probability density functions being some parameterized distributions

35
Univariate Gaussian Mixture Model
O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k
distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th
distribution using parameter θj, we have

◼ Univariate Gaussian mixture model


◼ Assume the probability density function of each cluster follows a 1-
d Gaussian distribution. Suppose that there are k clusters.
◼ The probability density function of each cluster are centered at μj
with standard deviation σj, θj, = (μj, σj), we have

36
The EM (Expectation Maximization) Algorithm
The k-means algorithm has two steps at each iteration:
◦ Expectation Step (E-step): Given the current cluster centers, each object is
assigned to the cluster whose center is closest to the object: An object is
expected to belong to the closest cluster
◦ Maximization Step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of distance from the objects
assigned to this cluster and the new center is minimized
The (EM) algorithm: A framework to approach maximum likelihood or maximum a
posteriori estimates of parameters in statistical models.
◦ E-step assigns objects to clusters according to the current fuzzy clustering or
parameters of probabilistic clusters
◦ M-step finds the new clustering or parameters that minimize the sum of
squared error (SSE) or maximize the expected likelihood

37
Computing Mixture Models with EM
Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk}
s.t.,P(O|Θ) is maximized, where θj = (μj, σj) are the mean and standard deviation of the
j-th univariate Gaussian distribution
We initially assign random values to parameters θj, then iteratively conduct the E-
and M- steps until converge or sufficiently small change
At the E-step, for each object oi, calculate the probability that oi belongs to each
distribution,

◼ At the M-step, adjust the parameters θj = (μj, σj) so that the expected
likelihood P(O|Θ) is maximized

38
Advantages and Disadvantages of Mixture Models

Strength
◦ Mixture models are more general than partitioning and fuzzy clustering
◦ Clusters can be characterized by a small number of parameters
◦ The results may satisfy the statistical assumptions of the generative models

Weakness
◦ Converge to local optimal (overcome: run multi-times w. random initialization)
◦ Computationally expensive if the number of distributions is large, or the data
set contains very few observed data points
◦ Need large data sets
◦ Hard to estimate the number of clusters

39
Thanks for Your Attention
Q&A

40

You might also like