0% found this document useful (0 votes)
11 views

Clustering 1

Uploaded by

notesbook14925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Clustering 1

Uploaded by

notesbook14925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

1
What is Cluster Analysis?
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
3
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
4
Applications of Cluster Analysis
Discovered Clusters Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Understanding 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Technology1-DOWN
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Group related documents Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN

for browsing, group genes 2


Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
and proteins that have Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN

similar functionality, or Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

group stocks with similar 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN


Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

price fluctuations 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,


Schlumberger-UP
Oil-UP

Summarization
Reduce the size of large
data sets

Clustering
precipitation in
Australia
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

6
Quality: What Is Good Clustering?

A good clustering method will produce high quality


clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

7
Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is expressed in


terms of a distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
8
Requirements of Clustering in Data Mining

Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
9
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

10
Data Structures
Data matrix
 x 11 ... x 1f ... x 1p 
(two modes)  
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 

Dissimilarity matrix  0 
(one mode)  d(2,1) 0 
 
 d(3,1 ) d ( 3 ,2 ) 0 
 
 : : : 
 d ( n ,1 ) d ( n,2 ) ... ... 0 

11
Type of data in clustering analysis

Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types

12
Interval-valued variables

Eg. Weight, height and weather temperature.


Standardize data
Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)

mf = 1
n (x1 f + x2 f + ... + xnf )
where .

Calculate the standardized measurement (z-score)


xif − m f
zif = sf

Using mean absolute deviation is more robust than using


standard deviation
13
Similarity and Dissimilarity Between
Objects

Distances are normally used to measure the similarity or


dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
If q = 1, d is Manhattan distance
d(i, j) =| x −x | +| x −x | +...+| x − x |
i1 j1 i2 j2 ip jp

14
Similarity and Dissimilarity Between
Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

Properties
d(i,j) ≥ 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)

15
Example

Let x1=(1,2) and x2=(3,5)


Euclidean distance=sqrt(22+32)=3.61
Manhattan Distance=2+3=5

16
Binary Variables
Symmetric binary variable : If both of its states
i.e. 0 and 1 are equally valuable. Here we cannot
decide which outcome should be 0 and which
outcome should be 1.
Example : Marital status of a person is “Married or
Unmarried”.
Asymmetric binary variable : If the outcome of
the states are not equally important. An example of
such a variable is the presence or absence of a
relatively rare attribute.
Example : Person is “handicapped or not
handicapped” .
17
Binary Variables
Object j
1 0 sum
A contingency table for binary 1 a b a +b
Object i
data 0 c d c+d
sum a + c b + d p

Distance measure for b+c


d (i, j ) =
symmetric binary variables: a+b+c+d
Distance measure for b+c
d (i, j ) =
asymmetric binary variables: a+b+c
Jaccard coefficient (similarity
measure for asymmetric sim Jaccard (i, j ) = a
a+b+c
binary variables):
18
Dissimilarity between Binary Variables

Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

gender is a symmetric attribute


the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
19
Nominal(Categorical) Variables

A generalization of the binary variable in that it can take


more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
d ( i , j ) = p −p m

Method 2: use a large number of binary variables


creating a new binary variable for each of the M
nominal states

20
Object Test-1
1 A
2 B P=1
3 C d(I,j)=0 if match =1 otherwise
4 A

0
0
1 0
1 1 0
0 1 1 0

21
Ordinal Variables

An ordinal variable can be discrete or continuous


Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank rif ∈ {1,..., M f }

map the range of each variable onto [0, 1] by replacing


i-th object in the f-th variable by
r if − 1
z if =
M f −1
compute the dissimilarity using methods for interval-
scaled variables

22
Object Test-2 Object Test-2 Object Test-2
1 Excellent 1 3 1 1.0 (3-1)/(3-1)
2 Fair 2 1 2 0.0
3 Good 3 2 3 0.5
4 Excellent 4 3 4 1.0
Ranks
0
0
1 0
0.5 0.5 0
0 1.0 0.5 0

23
Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a


nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank
as interval-scaled
24
Object Test-3 Object Test-3
1 445 1 2.65
2 22 2 1.34
3 164 3 2.21
4 1210 4 3.08

0
0
1.31 0
0.44 0.87 0
0.43 1.74 0.87 0

25
Vector Objects

Vector objects: keywords in documents, gene


features in micro-arrays, etc.
Broad applications: information retrieval, biologic
taxonomy, etc.
Cosine measure

A variant: Tanimoto coefficient

26
Let x=(1,1,0,0) y=(0,1,1,0)

S(x,y)=(0+1+1+0)/(sqrt(2).sqrt(2))
= 0.5

27
Mixed Types
0 0 0
0 0 0
1 0 1 0 1.31 0
1 1 0 0.5 0.5 0 0.44 0.87 0
0 1 1 0 0 1.0 0.5 0 0.43 1.74 0.87 0
0
0
0.75 0
0.25 0.50 0
0 0.25 1.00 0.50 0
0
0.92 0
0.58 0.67 0
0.08 1.00 0.67 0 D(2,1)=(1(1)+1(1)+1(0.75))/3
28
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

29
Major Clustering Approaches (I)

Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

30
Typical Alternatives to Calculate the Distance
between Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an


element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,


dis(Ki, Kj) = dis(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dis(Ki,


Kj) = dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
31
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster ΣiN= 1(t )
Cm = N
ip

Radius: square root of average distance from any point of the


cluster to its centroid
Σ N (t − cm ) 2
Rm = i =1 ip
N
Diameter: square root of average mean squared distance between
all pairs of points in the cluster

Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

32
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
33
Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects


into a set of k clusters, s.t., min sum of squared distance

Σ km=1Σtmi∈Km (Cm − tmi ) 2


Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means : Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) Each cluster is
represented by one of the objects in the cluster

34
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in


four steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment

35
The K-Means Clustering Method

Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2
1
1
objects 0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

3 3

2
the 2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

36
Flowchart

37
Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
Weakness
Applicable only when mean is defined, then what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

38
Variations of the K-Means Method

A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data:

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method

39
What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !


Since an object with an extremely large value may substantially
distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

40
Example
Given : {2,4,10,12,3,20,30,11,25}
Assume number of cluster i.e. k = 2.
Randomly assign means: m1= 3, m2 = 4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
K1={2,3,4,10,11,12},K2={20,30,25}

41
Randomly assign alternative values to each cluster
Number of cluster = 2, therefore
K1 = {2,10,3,30, 25}, Mean = 14
K2 = {4,12, 20, 11}, Mean = 11.75
Re-assign
K1 = {20, 30, 25}, Mean = 25
K2 = {2,4, 10, 12, 3, 11}, Mean= 7
Re-assign
K1 = {20, 30, 25}, Mean = 25
K2 = {2,4, 10, 12, 3, 11}, Mean= 7
So the final answer is K1={2,3,4,10,11,12},K2={20,30,25}
42
Use K-means algorithm to create 3 clusters for
given set of values
{2, 3, 6, 8, 9, 12, 15, 18, 22}

43
The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters


PAM (Partitioning Around Medoids)
starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale
well for large data sets

44
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrary 6
Assign 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

1 1 1

0 0
nearest 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O 7 total cost of 7

Until no and Oramdom 6


swapping 6

5 5

change If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

45
PAM (Partitioning Around Medoids) (1987)

PAM (Kaufman and Rousseeuw, 1987), built in Splus


Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
46
PAM Clustering: Total swapping cost TCih=∑jCjih
10 10

9 9
j
8
t 8 t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

8
h 8

7
7

6
j 6

5
5 i
i 4
h j
4

3
t 3

2
2
t
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t) 47


What Is the Problem with PAM?

Pam is more robust than k-means in the presence of


noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
Pam works efficiently for small data sets but does not
scale well for large data sets.
O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)

48
Number x coordinate y coordinate

1 1.0 4.0

2 5.0 1.0

3 5.0 2.0

4 5.0 4.0

5 10.0 4.0

6 25.0 4.0

7 25.0 6.0

8 25.0 7.0

9 25.0 8.0

10 29.0 7.0

49
Objects 1 and 5 are the selected representative objects initially.

Dissimilarity from Dissimilarity Minimal Closest


Object object 1 from object 5 dissimilarity representative
number object

1 0.00 9.00 0.00 1

2 5.00 5.83 5.00 1


3 4.47 5.39 4.47 1

4 4.00 5.00 4.00 1

5 9.00 0.00 0.00 5

6 24.00 15.00 15.00 5

7 24.08 15.13 15.13 5

8 24.19 15.30 15.30 5

9 24.33 15.52 15.52 5

10 28.16 19.24 19.24 5

Average 9.37

50
Object Dissimilarity Dissimilarity Minimal Closest
Number from object 4 from object 8 dissimilarity representative
object
4
1 4.00 24.19 4.00
4
2 3.00 20.88 3.00
4
3 2.00 20.62 2.00
4
4 0.00 20.22 0.00
4
5 5.00 15.30 5.00
8
6 20.00 3.00 3.00
8
7 20.10 1.00 1.00
8
8 20.22 0.00 0.00
8
9 20.40 1.00 1.00
8
10 24.19 4.00 4.00
Average 2.30

• Swapping Cost = New Cost – Old cost


= 2.30 – 9.32 = – 7.02
• If Swapping Cost < 0
replace medoid with new selected object
• So new Medoids are object 4 and object 8.

51
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)


Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
52
CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized


Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
53
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
54
Hierarchical Clustering

Use distance matrix as clustering criteria. This method


does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
55
Hierarchical clustering technique

Basic Algorithm:
1. Compute the proximity matrix (i.e. distance
matrix)
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

56
single-linkage clustering
D(r,s) = Min { d(i,j) : Where object i is in
cluster r and object j is in cluster s }
complete-linkage clustering
D(r,s) = Max { d(i,j) : Where object i is in
cluster r and object j is in cluster s }.
average-linkage clustering
D(r,s)= mean{ d(i,j) : Where object i is in
cluster r and object j is in cluster s }

57
AGNES (Agglomerative Nesting)

Implemented in statistical analysis packages, e.g., Splus


Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

58
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

59
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)


Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

60
Assume that the database D is given by the table below. Follow
single link technique to find clusters in D. Use Euclidean distance
measure.
X Y
P1 0.40 0.53
P2 0.22 0.38
D= P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30

p1 0
p2 0.24 0
p3 0
p1
p4 0 0
p2
p5 0.24 0 0
p3
p6 0.22 0.15 0 0
p4 0.37
p1 p2 0.20
p3 p4 0.15
p5 p6 0
p5 0.34 0.14 0.28 0.29 0
p6 0.23 0.25 0.11 0.22 0.39 0
p1 p2 p3 p4 p5 p6

61
p1 0 dist( (p3, p6), p1 )
P2 0.24 0 =MIN ( dist(p3, p1) , dist(p6, p1) )
(p3, p6) 0.22 0.15 0
= MIN ( 0.22 , 0.23
p4 0.37 0.20 0.15 0
= 0.22
p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5

62
p1 0
(p2, p5) 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p1 (p2, p5) (p3, p6) p4

dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5),
dist(p6, p5)) = MIN ( 0.15 , 0.25, 0.28, 0.39
= 0.15

63
p1 0
(p2, p5, p3, p6) 0.22 0
p4 0.37 0.15 0
p1 (p2, p5, p3, p6) p4

64
65
66
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
67
What Is Outlier Discovery?

What are outliers?


The set of objects are considerably dissimilar from the
remainder of the data
Example: Sports: Michael Jordon, Wayne Gretzky, ...
Problem: Define and find outliers in large data sets
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis

68
Outlier Discovery:
Statistical Approaches

f Assume a model underlying distribution that generates


data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be known
69
Outlier Discovery: Distance-Based Approach

Introduced to counter the main limitations imposed by


statistical methods
We need multi-dimensional analysis without knowing
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the objects
in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm

70
Density-Based Local
Outlier Detection
Distance-based outlier detection
is based on global distance
distribution
It encounters difficulties to
identify outliers if data is not
uniformly distributed Local outlier factor (LOF)
Ex. C1 contains 400 loosely Assume outlier is not
crisp
distributed points, C2 has 100
tightly condensed points, 2 Each point has a LOF
outlier points o1, o2
Distance-based method cannot
identify o2 as an outlier
Need the concept of local outlier

71
Outlier Discovery: Deviation-Based Approach

Identifies outliers by examining the main characteristics


of objects in a group
Objects that “deviate” from this description are
considered outliers
Sequential exception technique
simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
OLAP data cube technique
uses data cubes to identify regions of anomalies in
large multidimensional data

72
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

73
Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis

74
Problems and Challenges

Considerable progress has been made in scalable


clustering methods
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, ROCK, CHAMELEON
Density-based: DBSCAN, OPTICS, DenClue
Grid-based: STING, WaveCluster, CLIQUE
Model-based: EM, Cobweb, SOM
Frequent pattern-based: pCluster
Constraint-based: COD, constrained-clustering
Current clustering techniques do not address all the
requirements adequately, still an active area of research
75

You might also like