0% found this document useful (0 votes)

258 views

Data Mining Clustering

Data mining concepts Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Summary

Uploaded by

Anjali Asha Jacob

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

258 views

Data Mining Clustering

Data mining concepts Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering Summary

Uploaded by

Anjali Asha Jacob

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 76

Data Mining:

Concepts and Techniques

(3rd ed.)

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University

1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 Unsupervised learning:
 no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

3
Clustering: Rich Applications and Multidisciplinary
Efforts

 Business intelligence
• Organize customers to groups, improve project management
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 WWW
 Document classification
 Cluster Web blog data to discover groups of similar access patterns

4
Clustering: Application Examples
 Biology:
 taxonomy of living things: kingdom, phylum, class, order, family, genus
and species
 Information retrieval: document clustering
 Land use:
 Identification of areas of similar land use in an earth observation
database
 City-planning: Identifying groups of houses according to their house type,
value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
 Climate: understanding earth climate, find patterns of atmospheric and
ocean
 Economic Science: market research

5
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples. Large database
may contains millions or even billions of data objects
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of these
 New applications – to deal with graphs, sequences, images and
documents
 Discovery of clusters with arbitrary shape
 Alg. Based on distance measures find only spherical clusters with similar shape and density
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
Capability of clustering high-dimensionality data
Requirements of domain knowledge (Eg :- required # of clusters)
 Constraint-based clustering
• User may give inputs on constraints
 Interpretability and usability
6
Quality: What Is Good Clustering?

 A good clustering method will produce high quality

clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

7
Clustering Problem

8
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function, typically
metric: d(i, j)
 The definitions of distance functions are usually rather different
for interval-scaled, boolean, categorical, ordinal ratio, and vector
variables
 Weights should be associated with different variables based on
applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that measures the
“goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective 9
Major Clustering Approaches (I)
 Partitioning approach:
 Given n objects, partitioning method construct k partition of data, where
each partition represent a cluster, k <= n
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Grid-based approach:
 Quantize the object space into a finte number of cells that form a grid
structure
 Typical methods: STING, WaveCluster, CLIQUE
10
Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
 Typical methods: COD (obstacles), constrained clustering
 Link-based clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus

11
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
12
Partitioning Algorithms: Basic Concept

 Partitioning method:
 Given, a dataset D of n objects into a set of k clusters, such that
the sum of squared distances is minimized (where an object
p ԑ Ci, ci is the centroid or medoid of cluster Ci)

 Clustering by k-means

E   ik1 pCi (d ( p, ci )) 2

13
Partitioning Algorithms: Basic Concept

 Given k, find a partition of k clusters that optimizes the chosen

partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in

four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest
seed point
4. Go back to Step 2, stop when the assignment
does not change

15
An Example of K-Means Clustering

K=2

Arbitrarily Update the

partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects

needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 (Re)assign each object to the
cluster of its nearest centroid
 Until no change
16
K –means example
 Use the k-means algorithm and Euclidean distance to cluster the
following 8 examples into 3 clusters: A1=(2,10), A2=(2,5), A3=(8,4),
A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).

 d(a,b) denotes the Eucledian distance between a and b

 d(a,b)=sqrt((xb-xa)2 +(yb-ya)2 ))
 Let the seeds are
 seed1=A1=(2,10)
 seed2=A4=(5,8)
 seed3=A7=(1,2)

17
K –means example

18
K –means example

19
K –means example

 This process has to be repeated till there is no

change in the newly formed clusters

20
Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal
 Weakness
 Applicable only to objects in a continuous n-dimensional space
 For nominal data – a variant of k-means known as k-modes can
be used.
 Need to specify k, the number of clusters, in advance
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

21
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially

distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

22
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3

2
3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

23
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

 Starts from an initial set of medoids and iteratively replaces one

of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM

 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

 CLARANS (Ng & Han, 1994): Randomized re-sampling

24
PAM Algorithm

25
PAM

 Consider an example. Table shows distance between nodes

 Let us place nodes in clusters randomly, {A,C,D} and {B,E} are the
clusters, and {C, D, E} are non medoid nodes
 Let us examine which nodes could replace medoids, A and B

26
PAM
 To replace existing medoids, calculate cost of replacement for
pairs of medoid and non-medoid combinations
 Eg:- Thus we have find, TCAC, TCAD, TCAE, TCBC, TCBD and TCBE
 Here TCAC is calculated as

TCAC = CAAC + CBAC CCAC + CDAC + CEAC + CEAC

= 1 + 0 -2 -1 +0 = -2
Similarly find other costs too
TCAD =-2, TCAE =-1, TCBC =-2
TCBD =-2, TCBE =-2
Swap medoids if overall impact to the cost represents an improvement.
Iteration stops when no changes reduce cost
Here overall cost reduction is 2. Also there are several options for this. If we
choose the first option, we will get {C, D} and {B. A, E} as clusters
27
PAM

28
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

29
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step Step Step Step Step agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0
30
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

31
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of

clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at

the desired level, then each connected component forms a cluster

32
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

33
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

 Average: avg distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e.,

dist(Ki, Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,

Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster 34
Centroid, Radius and Diameter of X

a Cluster (for numerical data sets)

 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

 Radius: square root of average distance from any point

of the cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

35
Single Link Algorithm

36
Single link - example

37
Single link - example

38
Complete link algorithm

39
Complete link algorithm

40
Complete link algorithm

41
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods

 Can never undo what was done previously

 Do not scale well: time complexity of at least O(n2),

where n is the number of total objects
 Integration of hierarchical & distance-based clustering

 BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
42
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent
clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

43
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points
LS: linear sum of N points: N
 Xi
i 1
SS: square sum of N points
CF = (5, (16,30),(54,190))

N 2 10
(3,4)
 Xi
9

8
(2,6)
i 1
7

5
(4,5)
4

3
(4,7)
2

1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

44
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
 Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: max # of children
 Threshold: max diameter of sub-clusters stored at the leaf
45
BIRCH Algorithm
Input : D={t1, t2, ….tn}
T // threshold of CF tree construction
Output : K // set of clusters
Algorithm BIRCH
For each ti ɛ D do
determine the correct leaf node for ti insertion
If threshold condition is not violated
Add ti to the cluster and update CF triple
Else
If room to insert ti, then
Insert ti as a single cluster and update CF triples
Else
Split leaf node and redistribute CF features 46
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node

CF1 CF2 CF3 CF5

child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

47
The Birch Algorithm
 Cluster Diameter 1
 (x  x )
2

n( n  1) i j

 For each point in the input

 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly
parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so
natural
 Clusters tend to be spherical given the radius and diameter
measures
48
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
49
KNN Graphs & Interconnectivity
 k-nearest graphs from an original data in 2D:

 EC{Ci ,Cj } :The absolute inter-connectivity between Ci and Cj:

the sum of the weight of the edges that connect vertices in
Ci to vertices in Cj
 Internal inter-connectivity of a cluster Ci : the size of its
min-cut bisector ECCi (i.e., the weighted sum of edges that
partition the graph into two roughly equal parts)
 Relative Inter-connectivity (RI):
50
Relative Closeness & Merge of Sub-Clusters

 Relative closeness between a pair of clusters Ci and Cj :

the absolute closeness between Ci and Cj normalized
w.r.t. the internal closeness of the two clusters Ci and Cj

 and are the average weights of the edges that

belong in the min-cut bisector of clusters Ci and Cj , respectively,
and is the average weight of the edges that connect
vertices in Ci to vertices in Cj
 Merge Sub-Clusters:
 Merges only those pairs of clusters whose RI and RC are both
above some user-specified thresholds
 Merge those maximizing the function that combines RI and RC
51
Overall Framework of CHAMELEON

Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 52
CHAMELEON (Clustering Complex Objects)

53
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distribution
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
54
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a Gaussian
distribution:

 The probability that a point xi ∈ X is generated by the

model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the

the maximum likelihood
parameters μ and σ such that
2

55
Gaussian Distribution

Bean
machine:
drop ball
with pins

1-d 2-d
Gaussian Gaussian
From wikipedia and http://home.dei.polimi.it

56
A Probabilistic Hierarchical Clustering Algorithm

 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,

where P() is the maximum likelihood

 If we merge two clusters Cj1 and Cj2 into a cluster Cj1∪Cj2, then, the
change in quality of the overall clustering is

 Distance between clusters C1 and C2:

 If dist(Ci, Cj) < 0, merge Ci and Cj

57
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

58
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such
as density-connected points
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-
based)
59
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if

 p belongs to NEps(q) p MinPts = 5

Eps = 1 cm
 core point condition: q
|NEps (q)| ≥ MinPts
60
Density-Reachable and Density-Connected

 Density-reachable:
 A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if p1
there is a chain of points p1, …, q

pn, p1 = q, pn = p such that pi+1 is

directly density-reachable from pi
 Density-connected
p q
 A point p is density-connected to a
point q w.r.t. Eps, MinPts if there o
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
61
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm

Core MinPts = 5

62
DBSCAN: The Algorithm
Input : D : a data set containing n objects, ԑ ; radius, MinPts : neighborhood
threshold
Method :
mark all objects as unmarked
do
randomly select an unvisited object p;
mark p as visited
if the e-neighborhood of p has at least MinPts objects
create a new cluster C, and add p to C
let N be the set of objects in the e-neighborhood of p
for each point p’ in N
if p’ is unvisited
mark p’ as visited
if the e-neighborhood of p’ has at least MinPts points
add those points to N
if p’ is not yet member of any cluster, add p’ to C;
end for
output C;
else mark p as noise
Until no objects is unvisited
63
DBSCAN: Sensitive to Parameters

DBSCAN online Demo:

http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
64
OPTICS: A Cluster-Ordering Method (1999)

 OPTICS: Ordering Points To Identify the Clustering

Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
 Produces a special order of the database wrt its
density-based clustering structure
 This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
 Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
 Can be represented graphically or using visualization
techniques
65
OPTICS: Some Extension from DBSCAN

 Index-based: k = # of dimensions, N: # of points

 Complexity: O(N*logN)
 Core Distance of an object p: the smallest value ε such that
the ε-neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p, ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts
MinPts-distance(p), otherwise
 Reachability Distance of object p from core object q is the
min radius value that makes p density-reachable from q
Reachability-distanceε, MinPts(p, q) =
Undefined if q is not a core object
max(core-distance(q), distance (q, p)), otherwise
66
Core Distance & Reachability Distance

67
Reachability-
distance

undefined



 ‘

Cluster-order of the objects

68
Density-Based Clustering: OPTICS & Applications
demo: http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo

69
DENCLUE: Using Statistical Density
Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
total influence
 Using statistical density functions: on x
d ( x , xi ) 2
2 
( x )  i 1 e
d ( x,y) N
 D 2 2

f Gaussian ( x , y )  e 2 2 f Gaussian

d ( x , xi ) 2
influence of y 
( x, xi )  i 1 ( xi  x)  e
N
on x f D
Gaussian
2 2
 Major features
gradient of x in
 Solid mathematical foundation the direction of
xi
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
70
Denclue: Technical Essence
 Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
 Influence function: describes the impact of a data point within its
neighborhood
 Overall density of the data space can be calculated as the sum of the
influence function of all data points
 Clusters can be determined mathematically by identifying density
attractors
 Density attractors are local maximal of the overall density function
 Center defined clusters: assign to each density attractor the points
density attracted to it
 Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)

71
Density Attractor

72
Center-Defined and Arbitrary

73
Entropy-Based Measure (II):
Normalized mutual information (NMI)

 Mutual information: quantify the amount of shared info between

the clustering C and partitioning T:

It measures the dependency between the observed joint probability pij

of C and T, and the expected joint probability pCi * pTj under the
independence assumption
When C and T are independent, pij = pCi * pTj, I(C, T) = 0. However,
there is no upper bound on the mutual information
 Normalized mutual information (NMI)

Value range of NMI: [0,1]. Value close to 1 indicates a good clustering

74
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
75
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
76

(Ebook PDF) Creating Value With Big Data Analytics: Making Smarter Marketing Decisions Ebook All Chapters PDF
100% (6)
(Ebook PDF) Creating Value With Big Data Analytics: Making Smarter Marketing Decisions Ebook All Chapters PDF
41 pages
Distributed Database Concepts
No ratings yet
Distributed Database Concepts
52 pages
Chapter 4 Random Variables
No ratings yet
Chapter 4 Random Variables
180 pages
E Health
No ratings yet
E Health
19 pages
Epa Probit Analysis Program
100% (1)
Epa Probit Analysis Program
5 pages
Random Effects Models
No ratings yet
Random Effects Models
37 pages
Midterm Sp16 Solutions
100% (1)
Midterm Sp16 Solutions
17 pages
Logistic Regression
No ratings yet
Logistic Regression
47 pages
Introduction To Descriptive Statistics
No ratings yet
Introduction To Descriptive Statistics
73 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Unit 1
No ratings yet
Unit 1
70 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
No ratings yet
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
47 pages
Anomaly Detection Report
No ratings yet
Anomaly Detection Report
33 pages
Chapter 1 Descriptive Data
No ratings yet
Chapter 1 Descriptive Data
113 pages
Time Series Analysis
No ratings yet
Time Series Analysis
23 pages
Statistics For Begineers
No ratings yet
Statistics For Begineers
28 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Sbe10 10 Simple Regression
No ratings yet
Sbe10 10 Simple Regression
100 pages
Analysis of Quantitative Research
No ratings yet
Analysis of Quantitative Research
3 pages
Data Mining Cluster
50% (2)
Data Mining Cluster
4 pages
Full Download Multivariate Statistical Methods A Primer Third Edition Manly PDF DOCX
100% (8)
Full Download Multivariate Statistical Methods A Primer Third Edition Manly PDF DOCX
65 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Ugc Model Curriculum Statistics: Submitted To The University Grants Commission in April 2001
No ratings yet
Ugc Model Curriculum Statistics: Submitted To The University Grants Commission in April 2001
101 pages
C1a - Anomaly Detection
No ratings yet
C1a - Anomaly Detection
12 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
1-Reference Material I Cse4004 Digital-Forensics Eth 1.1 47 Cse4004
No ratings yet
1-Reference Material I Cse4004 Digital-Forensics Eth 1.1 47 Cse4004
10 pages
Estimation and Hypothesis Testing
100% (2)
Estimation and Hypothesis Testing
47 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Correlation and Regression
No ratings yet
Correlation and Regression
167 pages
Logit Model For Binary Data
No ratings yet
Logit Model For Binary Data
50 pages
# RESEARCH METHODOLOGY Review On Comfirmatory Factor Analysis
No ratings yet
# RESEARCH METHODOLOGY Review On Comfirmatory Factor Analysis
22 pages
CH 6
No ratings yet
CH 6
72 pages
Presented By:: Huffman Coding
No ratings yet
Presented By:: Huffman Coding
7 pages
How To Write Thesis Proposal On Face Recognition
No ratings yet
How To Write Thesis Proposal On Face Recognition
5 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Installation and Configuration - SAS Enterprise Miner
No ratings yet
Installation and Configuration - SAS Enterprise Miner
36 pages
Frequency Distribution For Categorical Data
No ratings yet
Frequency Distribution For Categorical Data
6 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
Non Deterministic Finite Automata
No ratings yet
Non Deterministic Finite Automata
30 pages
Introduction To Path Analysis Using AMOS
No ratings yet
Introduction To Path Analysis Using AMOS
42 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
Linear Regression Analysis For Survey Data
No ratings yet
Linear Regression Analysis For Survey Data
28 pages
03 - K Means Clustering On Iris Datasets
No ratings yet
03 - K Means Clustering On Iris Datasets
4 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Expectation Maximization
No ratings yet
Expectation Maximization
23 pages
Chapter 7
100% (1)
Chapter 7
31 pages
The Growth of Machine Learning in Cybersecurity
No ratings yet
The Growth of Machine Learning in Cybersecurity
17 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
OUTLIERS
100% (1)
OUTLIERS
5 pages
DSF Unit IV MCQ Notes
No ratings yet
DSF Unit IV MCQ Notes
6 pages
Estimation and Testing of Hypothesis PDF
100% (1)
Estimation and Testing of Hypothesis PDF
75 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
59 pages
Pattern Recognition Lecture Bayes Decision Theory: Prof. Dr. Marcin Grzegorzek
100% (1)
Pattern Recognition Lecture Bayes Decision Theory: Prof. Dr. Marcin Grzegorzek
35 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Systems Investigation and Analysis
100% (2)
Systems Investigation and Analysis
18 pages
2 PB
No ratings yet
2 PB
15 pages
BI Bankai
No ratings yet
BI Bankai
27 pages
Service Profit Chain
No ratings yet
Service Profit Chain
13 pages
Engineering Reports - 2023 - Parthasarathy - A Framework For Managing Ethics in Data Science Projects
No ratings yet
Engineering Reports - 2023 - Parthasarathy - A Framework For Managing Ethics in Data Science Projects
12 pages
Reviewers Guide: The Genetic Causes of Toe Walking in Children
No ratings yet
Reviewers Guide: The Genetic Causes of Toe Walking in Children
5 pages
ODD - Solutions Chapter 5
No ratings yet
ODD - Solutions Chapter 5
9 pages
Unit 7 Regration and Correlation
No ratings yet
Unit 7 Regration and Correlation
11 pages
Diseños Experimentales Docente: Dr. Melicio M. Siles Practica 10 - Solucion
No ratings yet
Diseños Experimentales Docente: Dr. Melicio M. Siles Practica 10 - Solucion
5 pages
Exp 2
No ratings yet
Exp 2
6 pages
Notes 2
No ratings yet
Notes 2
22 pages
A Management Practicum Programme in Teacher Education: From Teacher To Teacher-Leader
No ratings yet
A Management Practicum Programme in Teacher Education: From Teacher To Teacher-Leader
9 pages
Fraud Analytics
No ratings yet
Fraud Analytics
5 pages
Department of Economics Problem Set
No ratings yet
Department of Economics Problem Set
5 pages
CHAPTER - 2 Forecasting
No ratings yet
CHAPTER - 2 Forecasting
95 pages
ملخص شامل عن Quantile Regression
No ratings yet
ملخص شامل عن Quantile Regression
3 pages
The Latin Square Design
No ratings yet
The Latin Square Design
16 pages
K-Means_Clustering_Report
No ratings yet
K-Means_Clustering_Report
2 pages
Schuuring, Eline 1
No ratings yet
Schuuring, Eline 1
60 pages
Statistics - Hypothesis Testing One Sample Tests
100% (1)
Statistics - Hypothesis Testing One Sample Tests
6 pages
Designer Wear Operates A Chain of 10 Retail Department Stores
No ratings yet
Designer Wear Operates A Chain of 10 Retail Department Stores
2 pages
Question and Answer Managerial Economics (UM21MB643A) Unit 5: Demand Forecasting
No ratings yet
Question and Answer Managerial Economics (UM21MB643A) Unit 5: Demand Forecasting
3 pages
Resume - Areej H
No ratings yet
Resume - Areej H
2 pages
Bank Exam
No ratings yet
Bank Exam
7 pages
Multple Regression MBS
No ratings yet
Multple Regression MBS
16 pages
Mangalsiddhi Dairy Project Report - Sample Project - Do Not
No ratings yet
Mangalsiddhi Dairy Project Report - Sample Project - Do Not
49 pages
Xalayaa
No ratings yet
Xalayaa
12 pages