0% found this document useful (0 votes)
258 views

Data Mining Clustering

Data mining concepts Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Summary
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
258 views

Data Mining Clustering

Data mining concepts Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Summary
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 76

Data Mining:

Concepts and Techniques


(3rd ed.)

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University

1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 Unsupervised learning:
 no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

3
Clustering: Rich Applications and Multidisciplinary
Efforts

 Business intelligence
• Organize customers to groups, improve project management
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 WWW
 Document classification
 Cluster Web blog data to discover groups of similar access patterns

4
Clustering: Application Examples
 Biology:
 taxonomy of living things: kingdom, phylum, class, order, family, genus
and species
 Information retrieval: document clustering
 Land use:
 Identification of areas of similar land use in an earth observation
database
 City-planning: Identifying groups of houses according to their house type,
value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
 Climate: understanding earth climate, find patterns of atmospheric and
ocean
 Economic Science: market research

5
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples. Large database
may contains millions or even billions of data objects
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 New applications – to deal with graphs, sequences, images and
documents
 Discovery of clusters with arbitrary shape
 Alg. Based on distance measures find only spherical clusters with similar shape and density
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
Capability of clustering high-dimensionality data
Requirements of domain knowledge (Eg :- required # of clusters)
 Constraint-based clustering
• User may give inputs on constraints
 Interpretability and usability
6
Quality: What Is Good Clustering?

 A good clustering method will produce high quality


clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

7
Clustering Problem

8
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function, typically
metric: d(i, j)
 The definitions of distance functions are usually rather different
for interval-scaled, boolean, categorical, ordinal ratio, and vector
variables
 Weights should be associated with different variables based on
applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that measures the
“goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective 9
Major Clustering Approaches (I)
 Partitioning approach:
 Given n objects, partitioning method construct k partition of data, where
each partition represent a cluster, k <= n
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Grid-based approach:
 Quantize the object space into a finte number of cells that form a grid
structure
 Typical methods: STING, WaveCluster, CLIQUE
10
Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
 Typical methods: COD (obstacles), constrained clustering
 Link-based clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus

11
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
12
Partitioning Algorithms: Basic Concept

 Partitioning method:
 Given, a dataset D of n objects into a set of k clusters, such that
the sum of squared distances is minimized (where an object
p ԑ Ci, ci is the centroid or medoid of cluster Ci)

 Clustering by k-means

E   ik1 pCi (d ( p, ci )) 2

13
Partitioning Algorithms: Basic Concept

 Given k, find a partition of k clusters that optimizes the chosen


partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in


four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest
seed point
4. Go back to Step 2, stop when the assignment
does not change

15
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 (Re)assign each object to the
cluster of its nearest centroid
 Until no change
16
K –means example
 Use the k-means algorithm and Euclidean distance to cluster the
following 8 examples into 3 clusters: A1=(2,10), A2=(2,5), A3=(8,4),
A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).

 d(a,b) denotes the Eucledian distance between a and b


 d(a,b)=sqrt((xb-xa)2 +(yb-ya)2 ))
 Let the seeds are
 seed1=A1=(2,10)
 seed2=A4=(5,8)
 seed3=A7=(1,2)

17
K –means example

18
K –means example

19
K –means example

 This process has to be repeated till there is no


change in the newly formed clusters

20
Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal
 Weakness
 Applicable only to objects in a continuous n-dimensional space
 For nominal data – a variant of k-means known as k-modes can
be used.
 Need to specify k, the number of clusters, in advance
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

21
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially


distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

22
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3

2
3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

23
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

 Starts from an initial set of medoids and iteratively replaces one


of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM

 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

 CLARANS (Ng & Han, 1994): Randomized re-sampling


24
PAM Algorithm

25
PAM

 Consider an example. Table shows distance between nodes


 Let us place nodes in clusters randomly, {A,C,D} and {B,E} are the
clusters, and {C, D, E} are non medoid nodes
 Let us examine which nodes could replace medoids, A and B

26
PAM
 To replace existing medoids, calculate cost of replacement for
pairs of medoid and non-medoid combinations
 Eg:- Thus we have find, TCAC, TCAD, TCAE, TCBC, TCBD and TCBE
 Here TCAC is calculated as

TCAC = CAAC + CBAC CCAC + CDAC + CEAC + CEAC


= 1 + 0 -2 -1 +0 = -2
Similarly find other costs too
TCAD =-2, TCAE =-1, TCBC =-2
TCBD =-2, TCBE =-2
Swap medoids if overall impact to the cost represents an improvement.
Iteration stops when no changes reduce cost
Here overall cost reduction is 2. Also there are several options for this. If we
choose the first option, we will get {C, D} and {B. A, E} as clusters
27
PAM

28
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

29
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step Step Step Step Step agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0
30
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

31
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of


clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at


the desired level, then each connected component forms a cluster

32
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

33
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

 Average: avg distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e.,


dist(Ki, Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,


Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster 34
Centroid, Radius and Diameter of X

a Cluster (for numerical data sets)


 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

 Radius: square root of average distance from any point


of the cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

35
Single Link Algorithm

36
Single link - example

37
Single link - example

38
Complete link algorithm

39
Complete link algorithm

40
Complete link algorithm

41
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods

 Can never undo what was done previously

 Do not scale well: time complexity of at least O(n2),


where n is the number of total objects
 Integration of hierarchical & distance-based clustering

 BIRCH (1996): uses CF-tree and incrementally adjusts


the quality of sub-clusters
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
42
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent
clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

43
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)


N: Number of data points
LS: linear sum of N points: N
 Xi
i 1
SS: square sum of N points
CF = (5, (16,30),(54,190))

N 2 10
(3,4)
 Xi
9

8
(2,6)
i 1
7

5
(4,5)
4

3
(4,7)
2

1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

44
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
 Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: max # of children
 Threshold: max diameter of sub-clusters stored at the leaf
45
BIRCH Algorithm
Input : D={t1, t2, ….tn}
T // threshold of CF tree construction
Output : K // set of clusters
Algorithm BIRCH
For each ti ɛ D do
determine the correct leaf node for ti insertion
If threshold condition is not violated
Add ti to the cluster and update CF triple
Else
If room to insert ti, then
Insert ti as a single cluster and update CF triples
Else
Split leaf node and redistribute CF features 46
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node

CF1 CF2 CF3 CF5

child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

47
The Birch Algorithm
 Cluster Diameter 1
 (x  x )
2

n( n  1) i j

 For each point in the input


 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly
parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so
natural
 Clusters tend to be spherical given the radius and diameter
measures
48
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
49
KNN Graphs & Interconnectivity
 k-nearest graphs from an original data in 2D:

 EC{Ci ,Cj } :The absolute inter-connectivity between Ci and Cj:


the sum of the weight of the edges that connect vertices in
Ci to vertices in Cj
 Internal inter-connectivity of a cluster Ci : the size of its
min-cut bisector ECCi (i.e., the weighted sum of edges that
partition the graph into two roughly equal parts)
 Relative Inter-connectivity (RI):
50
Relative Closeness & Merge of Sub-Clusters

 Relative closeness between a pair of clusters Ci and Cj :


the absolute closeness between Ci and Cj normalized
w.r.t. the internal closeness of the two clusters Ci and Cj

 and are the average weights of the edges that


belong in the min-cut bisector of clusters Ci and Cj , respectively,
and is the average weight of the edges that connect
vertices in Ci to vertices in Cj
 Merge Sub-Clusters:
 Merges only those pairs of clusters whose RI and RC are both
above some user-specified thresholds
 Merge those maximizing the function that combines RI and RC
51
Overall Framework of CHAMELEON

Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 52
CHAMELEON (Clustering Complex Objects)

53
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distribution
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
54
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a Gaussian
distribution:

 The probability that a point xi ∈ X is generated by the


model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the


the maximum likelihood
parameters μ and σ such that
2

55
Gaussian Distribution

Bean
machine:
drop ball
with pins

1-d 2-d
Gaussian Gaussian
From wikipedia and http://home.dei.polimi.it

56
A Probabilistic Hierarchical Clustering Algorithm

 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,

where P() is the maximum likelihood


 If we merge two clusters Cj1 and Cj2 into a cluster Cj1∪Cj2, then, the
change in quality of the overall clustering is

 Distance between clusters C1 and C2:

 If dist(Ci, Cj) < 0, merge Ci and Cj


57
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

58
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such
as density-connected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-
based)
59
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if

 p belongs to NEps(q) p MinPts = 5


Eps = 1 cm
 core point condition: q
|NEps (q)| ≥ MinPts
60
Density-Reachable and Density-Connected

 Density-reachable:
 A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if p1
there is a chain of points p1, …, q

pn, p1 = q, pn = p such that pi+1 is


directly density-reachable from pi
 Density-connected
p q
 A point p is density-connected to a
point q w.r.t. Eps, MinPts if there o
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
61
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm

Core MinPts = 5

62
DBSCAN: The Algorithm
Input : D : a data set containing n objects, ԑ ; radius, MinPts : neighborhood
threshold
Method :
mark all objects as unmarked
do
randomly select an unvisited object p;
mark p as visited
if the e-neighborhood of p has at least MinPts objects
create a new cluster C, and add p to C
let N be the set of objects in the e-neighborhood of p
for each point p’ in N
if p’ is unvisited
mark p’ as visited
if the e-neighborhood of p’ has at least MinPts points
add those points to N
if p’ is not yet member of any cluster, add p’ to C;
end for
output C;
else mark p as noise
Until no objects is unvisited
63
DBSCAN: Sensitive to Parameters

DBSCAN online Demo:


http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
64
OPTICS: A Cluster-Ordering Method (1999)

 OPTICS: Ordering Points To Identify the Clustering


Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
 Produces a special order of the database wrt its
density-based clustering structure
 This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
 Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
 Can be represented graphically or using visualization
techniques
65
OPTICS: Some Extension from DBSCAN

 Index-based: k = # of dimensions, N: # of points


 Complexity: O(N*logN)
 Core Distance of an object p: the smallest value ε such that
the ε-neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p, ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts
MinPts-distance(p), otherwise
 Reachability Distance of object p from core object q is the
min radius value that makes p density-reachable from q
Reachability-distanceε, MinPts(p, q) =
Undefined if q is not a core object
max(core-distance(q), distance (q, p)), otherwise
66
Core Distance & Reachability Distance

67
Reachability-
distance

undefined



 ‘

Cluster-order of the objects

68
Density-Based Clustering: OPTICS & Applications
demo: http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo

69
DENCLUE: Using Statistical Density
Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
total influence
 Using statistical density functions: on x
d ( x , xi ) 2
2 
( x )  i 1 e
d ( x,y) N
 D 2 2

f Gaussian ( x , y )  e 2 2 f Gaussian

d ( x , xi ) 2
influence of y 
( x, xi )  i 1 ( xi  x)  e
N
on x f D
Gaussian
2 2
 Major features
gradient of x in
 Solid mathematical foundation the direction of
xi
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
70
Denclue: Technical Essence
 Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
 Influence function: describes the impact of a data point within its
neighborhood
 Overall density of the data space can be calculated as the sum of the
influence function of all data points
 Clusters can be determined mathematically by identifying density
attractors
 Density attractors are local maximal of the overall density function
 Center defined clusters: assign to each density attractor the points
density attracted to it
 Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)

71
Density Attractor

72
Center-Defined and Arbitrary

73
Entropy-Based Measure (II):
Normalized mutual information (NMI)

 Mutual information: quantify the amount of shared info between


the clustering C and partitioning T:

It measures the dependency between the observed joint probability pij


of C and T, and the expected joint probability pCi * pTj under the
independence assumption
When C and T are independent, pij = pCi * pTj, I(C, T) = 0. However,
there is no upper bound on the mutual information
 Normalized mutual information (NMI)

Value range of NMI: [0,1]. Value close to 1 indicates a good clustering


74
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
75
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
76

You might also like