0% found this document useful (0 votes)

11 views

Clustering 1

Uploaded by

notesbook14925

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Clustering 1

Uploaded by

notesbook14925

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

1
What is Cluster Analysis?
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
3
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
4
Applications of Cluster Analysis
Discovered Clusters Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Understanding 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Technology1-DOWN
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Group related documents Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN

for browsing, group genes 2

Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
and proteins that have Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN

similar functionality, or Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

group stocks with similar 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

price fluctuations 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

Schlumberger-UP
Oil-UP

Summarization
Reduce the size of large
data sets

Clustering
precipitation in
Australia
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

6
Quality: What Is Good Clustering?

A good clustering method will produce high quality

clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

7
Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is expressed in

terms of a distance function, typically metric: d(i, j)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
Weights should be associated with different variables
based on applications and data semantics.
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
8
Requirements of Clustering in Data Mining

Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
9
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

10
Data Structures
Data matrix
 x 11 ... x 1f ... x 1p 
(two modes)  
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 

Dissimilarity matrix  0 
(one mode)  d(2,1) 0 
 
 d(3,1 ) d ( 3 ,2 ) 0 
 
 : : : 
 d ( n ,1 ) d ( n,2 ) ... ... 0 

11
Type of data in clustering analysis

Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types

12
Interval-valued variables

Eg. Weight, height and weather temperature.

Standardize data
Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)

mf = 1
n (x1 f + x2 f + ... + xnf )
where .

Calculate the standardized measurement (z-score)

xif − m f
zif = sf

Using mean absolute deviation is more robust than using

standard deviation
13
Similarity and Dissimilarity Between
Objects

Distances are normally used to measure the similarity or

dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
If q = 1, d is Manhattan distance
d(i, j) =| x −x | +| x −x | +...+| x − x |
i1 j1 i2 j2 ip jp

14
Similarity and Dissimilarity Between
Objects (Cont.)

If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

Properties
d(i,j) ≥ 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)

15
Example

Let x1=(1,2) and x2=(3,5)

Euclidean distance=sqrt(22+32)=3.61
Manhattan Distance=2+3=5

16
Binary Variables
Symmetric binary variable : If both of its states
i.e. 0 and 1 are equally valuable. Here we cannot
decide which outcome should be 0 and which
outcome should be 1.
Example : Marital status of a person is “Married or
Unmarried”.
Asymmetric binary variable : If the outcome of
the states are not equally important. An example of
such a variable is the presence or absence of a
relatively rare attribute.
Example : Person is “handicapped or not
handicapped” .
17
Binary Variables
Object j
1 0 sum
A contingency table for binary 1 a b a +b
Object i
data 0 c d c+d
sum a + c b + d p

Distance measure for b+c

d (i, j ) =
symmetric binary variables: a+b+c+d
Distance measure for b+c
d (i, j ) =
asymmetric binary variables: a+b+c
Jaccard coefficient (similarity
measure for asymmetric sim Jaccard (i, j ) = a
a+b+c
binary variables):
18
Dissimilarity between Binary Variables

Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

gender is a symmetric attribute

the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
19
Nominal(Categorical) Variables

A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
d ( i , j ) = p −p m

Method 2: use a large number of binary variables

creating a new binary variable for each of the M
nominal states

20
Object Test-1
1 A
2 B P=1
3 C d(I,j)=0 if match =1 otherwise
4 A

0
0
1 0
1 1 0
0 1 1 0

21
Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank rif ∈ {1,..., M f }

map the range of each variable onto [0, 1] by replacing

i-th object in the f-th variable by
r if − 1
z if =
M f −1
compute the dissimilarity using methods for interval-
scaled variables

22
Object Test-2 Object Test-2 Object Test-2
1 Excellent 1 3 1 1.0 (3-1)/(3-1)
2 Fair 2 1 2 0.0
3 Good 3 2 3 0.5
4 Excellent 4 3 4 1.0
Ranks
0
0
1 0
0.5 0.5 0
0 1.0 0.5 0

23
Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank
as interval-scaled
24
Object Test-3 Object Test-3
1 445 1 2.65
2 22 2 1.34
3 164 3 2.21
4 1210 4 3.08

0
0
1.31 0
0.44 0.87 0
0.43 1.74 0.87 0

25
Vector Objects

Vector objects: keywords in documents, gene

features in micro-arrays, etc.
Broad applications: information retrieval, biologic
taxonomy, etc.
Cosine measure

A variant: Tanimoto coefficient

26
Let x=(1,1,0,0) y=(0,1,1,0)

S(x,y)=(0+1+1+0)/(sqrt(2).sqrt(2))
= 0.5

27
Mixed Types
0 0 0
0 0 0
1 0 1 0 1.31 0
1 1 0 0.5 0.5 0 0.44 0.87 0
0 1 1 0 0 1.0 0.5 0 0.43 1.74 0.87 0
0
0
0.75 0
0.25 0.50 0
0 0.25 1.00 0.50 0
0
0.92 0
0.58 0.67 0
0.08 1.00 0.67 0 D(2,1)=(1(1)+1(1)+1(0.75))/3
28
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

29
Major Clustering Approaches (I)

Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

30
Typical Alternatives to Calculate the Distance
between Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,

dis(Ki, Kj) = dis(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dis(Ki,

Kj) = dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
31
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster ΣiN= 1(t )
Cm = N
ip

Radius: square root of average distance from any point of the

cluster to its centroid
Σ N (t − cm ) 2
Rm = i =1 ip
N
Diameter: square root of average mean squared distance between
all pairs of points in the cluster

Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

32
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
33
Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects

into a set of k clusters, s.t., min sum of squared distance

Σ km=1Σtmi∈Km (Cm − tmi ) 2

Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means : Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) Each cluster is
represented by one of the objects in the cluster

34
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in

four steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment

35
The K-Means Clustering Method

Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2
1
1
objects 0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

3 3

2
the 2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

36
Flowchart

37
Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
Weakness
Applicable only when mean is defined, then what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

38
Variations of the K-Means Method

A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data:

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method

39
What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially
distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

40
Example
Given : {2,4,10,12,3,20,30,11,25}
Assume number of cluster i.e. k = 2.
Randomly assign means: m1= 3, m2 = 4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
K1={2,3,4,10,11,12},K2={20,30,25}

41
Randomly assign alternative values to each cluster
Number of cluster = 2, therefore
K1 = {2,10,3,30, 25}, Mean = 14
K2 = {4,12, 20, 11}, Mean = 11.75
Re-assign
K1 = {20, 30, 25}, Mean = 25
K2 = {2,4, 10, 12, 3, 11}, Mean= 7
Re-assign
K1 = {20, 30, 25}, Mean = 25
K2 = {2,4, 10, 12, 3, 11}, Mean= 7
So the final answer is K1={2,3,4,10,11,12},K2={20,30,25}
42
Use K-means algorithm to create 3 clusters for
given set of values
{2, 3, 6, 8, 9, 12, 15, 18, 22}

43
The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids)
starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale
well for large data sets

44
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrary 6
Assign 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

1 1 1

0 0
nearest 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O 7 total cost of 7

Until no and Oramdom 6

swapping 6

5 5

change If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

45
PAM (Partitioning Around Medoids) (1987)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
46
PAM Clustering: Total swapping cost TCih=∑jCjih
10 10

9 9
j
8
t 8 t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

8
h 8

7
7

6
j 6

5
5 i
i 4
h j
4

3
t 3

2
2
t
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t) 47

What Is the Problem with PAM?

Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
Pam works efficiently for small data sets but does not
scale well for large data sets.
O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)

48
Number x coordinate y coordinate

1 1.0 4.0

2 5.0 1.0

3 5.0 2.0

4 5.0 4.0

5 10.0 4.0

6 25.0 4.0

7 25.0 6.0

8 25.0 7.0

9 25.0 8.0

10 29.0 7.0

49
Objects 1 and 5 are the selected representative objects initially.

Dissimilarity from Dissimilarity Minimal Closest

Object object 1 from object 5 dissimilarity representative
number object

1 0.00 9.00 0.00 1

2 5.00 5.83 5.00 1

3 4.47 5.39 4.47 1

4 4.00 5.00 4.00 1

5 9.00 0.00 0.00 5

6 24.00 15.00 15.00 5

7 24.08 15.13 15.13 5

8 24.19 15.30 15.30 5

9 24.33 15.52 15.52 5

10 28.16 19.24 19.24 5

Average 9.37

50
Object Dissimilarity Dissimilarity Minimal Closest
Number from object 4 from object 8 dissimilarity representative
object
4
1 4.00 24.19 4.00
4
2 3.00 20.88 3.00
4
3 2.00 20.62 2.00
4
4 0.00 20.22 0.00
4
5 5.00 15.30 5.00
8
6 20.00 3.00 3.00
8
7 20.10 1.00 1.00
8
8 20.22 0.00 0.00
8
9 20.40 1.00 1.00
8
10 24.19 4.00 4.00
Average 2.30

• Swapping Cost = New Cost – Old cost

= 2.30 – 9.32 = – 7.02
• If Swapping Cost < 0
replace medoid with new selected object
• So new Medoids are object 4 and object 8.

51
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
52
CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized

Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
53
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
54
Hierarchical Clustering

Use distance matrix as clustering criteria. This method

does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
55
Hierarchical clustering technique

Basic Algorithm:
1. Compute the proximity matrix (i.e. distance
matrix)
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

56
single-linkage clustering
D(r,s) = Min { d(i,j) : Where object i is in
cluster r and object j is in cluster s }
complete-linkage clustering
D(r,s) = Max { d(i,j) : Where object i is in
cluster r and object j is in cluster s }.
average-linkage clustering
D(r,s)= mean{ d(i,j) : Where object i is in
cluster r and object j is in cluster s }

57
AGNES (Agglomerative Nesting)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

58
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected
component forms a cluster.

59
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

60
Assume that the database D is given by the table below. Follow
single link technique to find clusters in D. Use Euclidean distance
measure.
X Y
P1 0.40 0.53
P2 0.22 0.38
D= P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30

p1 0
p2 0.24 0
p3 0
p1
p4 0 0
p2
p5 0.24 0 0
p3
p6 0.22 0.15 0 0
p4 0.37
p1 p2 0.20
p3 p4 0.15
p5 p6 0
p5 0.34 0.14 0.28 0.29 0
p6 0.23 0.25 0.11 0.22 0.39 0
p1 p2 p3 p4 p5 p6

61
p1 0 dist( (p3, p6), p1 )
P2 0.24 0 =MIN ( dist(p3, p1) , dist(p6, p1) )
(p3, p6) 0.22 0.15 0
= MIN ( 0.22 , 0.23
p4 0.37 0.20 0.15 0
= 0.22
p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5

62
p1 0
(p2, p5) 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p1 (p2, p5) (p3, p6) p4

dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5),
dist(p6, p5)) = MIN ( 0.15 , 0.25, 0.28, 0.39
= 0.15

63
p1 0
(p2, p5, p3, p6) 0.22 0
p4 0.37 0.15 0
p1 (p2, p5, p3, p6) p4

64
65
66
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
67
What Is Outlier Discovery?

What are outliers?

The set of objects are considerably dissimilar from the
remainder of the data
Example: Sports: Michael Jordon, Wayne Gretzky, ...
Problem: Define and find outliers in large data sets
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis

68
Outlier Discovery:
Statistical Approaches

f Assume a model underlying distribution that generates

data set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be known
69
Outlier Discovery: Distance-Based Approach

Introduced to counter the main limitations imposed by

statistical methods
We need multi-dimensional analysis without knowing
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the objects
in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm

70
Density-Based Local
Outlier Detection
Distance-based outlier detection
is based on global distance
distribution
It encounters difficulties to
identify outliers if data is not
uniformly distributed Local outlier factor (LOF)
Ex. C1 contains 400 loosely Assume outlier is not
crisp
distributed points, C2 has 100
tightly condensed points, 2 Each point has a LOF
outlier points o1, o2
Distance-based method cannot
identify o2 as an outlier
Need the concept of local outlier

71
Outlier Discovery: Deviation-Based Approach

Identifies outliers by examining the main characteristics

of objects in a group
Objects that “deviate” from this description are
considered outliers
Sequential exception technique
simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
OLAP data cube technique
uses data cubes to identify regions of anomalies in
large multidimensional data

72
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Outlier Analysis
7. Summary

73
Summary
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis

74
Problems and Challenges

Considerable progress has been made in scalable

clustering methods
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, ROCK, CHAMELEON
Density-based: DBSCAN, OPTICS, DenClue
Grid-based: STING, WaveCluster, CLIQUE
Model-based: EM, Cobweb, SOM
Frequent pattern-based: pCluster
Constraint-based: COD, constrained-clustering
Current clustering techniques do not address all the
requirements adequately, still an active area of research
75

Huawei: Question & Answers
50% (2)
Huawei: Question & Answers
4 pages
Swot, Business Model Canvas For Duolingo
100% (1)
Swot, Business Model Canvas For Duolingo
4 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Clustering
No ratings yet
Clustering
47 pages
Unit 4
No ratings yet
Unit 4
65 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Unit- 4 DMA
No ratings yet
Unit- 4 DMA
145 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Data Mining
No ratings yet
Data Mining
98 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
b4l
No ratings yet
b4l
28 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
Analysis of cluteruing
No ratings yet
Analysis of cluteruing
16 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
123 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Clustering
No ratings yet
Clustering
51 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
Algorithms
No ratings yet
Algorithms
107 pages
Clustering
No ratings yet
Clustering
27 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Clustering
No ratings yet
Clustering
64 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
dm 4
No ratings yet
dm 4
76 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
Outlier Analysis
No ratings yet
Outlier Analysis
104 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
CCC Question Answer in Hindi PDF Download 2021: B) Libreoffice Calc
No ratings yet
CCC Question Answer in Hindi PDF Download 2021: B) Libreoffice Calc
14 pages
Operating System Theory PDF
No ratings yet
Operating System Theory PDF
127 pages
Reg - No.: Q.No Answer Any Two Questions Marks CO K Level 5 CO3 K3
No ratings yet
Reg - No.: Q.No Answer Any Two Questions Marks CO K Level 5 CO3 K3
2 pages
Release of QRadar 7.5.0 Update Package 5 SFS (7.5.0-QRADAR-QRSIEM-20230301133107)
No ratings yet
Release of QRadar 7.5.0 Update Package 5 SFS (7.5.0-QRADAR-QRSIEM-20230301133107)
10 pages
AVG Vega5024
100% (1)
AVG Vega5024
2 pages
10th Slow Learner Formulae and Hints English Medium
No ratings yet
10th Slow Learner Formulae and Hints English Medium
3 pages
List of Zoho Product
No ratings yet
List of Zoho Product
2 pages
FAQ's On E-Aadhaar
No ratings yet
FAQ's On E-Aadhaar
4 pages
Chapter 9
No ratings yet
Chapter 9
5 pages
1973 - Fenves - Representation of The Computer-Aided Design Process by A Network of Decision Tables
No ratings yet
1973 - Fenves - Representation of The Computer-Aided Design Process by A Network of Decision Tables
9 pages
IP - 192.168.1.1 Subnet - 255.255.255.0 Gateway - 192.16.1.10 DNS - 192.168.1.1
No ratings yet
IP - 192.168.1.1 Subnet - 255.255.255.0 Gateway - 192.16.1.10 DNS - 192.168.1.1
18 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
1 Introduction
No ratings yet
1 Introduction
18 pages
SBU 105 Introduction To Computers
No ratings yet
SBU 105 Introduction To Computers
2 pages
(Ebook) The Big Nine: How the Tech Titans and Their Thinking Machines Could Warp Humanity by Amy Webb ISBN 9781541773745, 9781541773752, 1541773748, 1541773756 - The full ebook with all chapters is available for download
No ratings yet
(Ebook) The Big Nine: How the Tech Titans and Their Thinking Machines Could Warp Humanity by Amy Webb ISBN 9781541773745, 9781541773752, 1541773748, 1541773756 - The full ebook with all chapters is available for download
66 pages
COMPUTER STUDIES FORM 1 END TERM 3 2023
No ratings yet
COMPUTER STUDIES FORM 1 END TERM 3 2023
5 pages
TSXP571634M: Product Data Sheet
No ratings yet
TSXP571634M: Product Data Sheet
3 pages
Lang101 Workbook-FullText
No ratings yet
Lang101 Workbook-FullText
9 pages
Digraphs, Letter Charts: Blends & Silent
No ratings yet
Digraphs, Letter Charts: Blends & Silent
10 pages
Indefinite Integration 28-08-2020
No ratings yet
Indefinite Integration 28-08-2020
5 pages
Chapter 8 Transaction Processing Final Master
No ratings yet
Chapter 8 Transaction Processing Final Master
116 pages
Minimalist HS Weekly Planner Green Variant
No ratings yet
Minimalist HS Weekly Planner Green Variant
19 pages
Ict 550 Asyraf Danial Bin Suhaimi 2023884694
100% (1)
Ict 550 Asyraf Danial Bin Suhaimi 2023884694
3 pages
3 Cosmetic Shop Automation - 220312 - 092721
No ratings yet
3 Cosmetic Shop Automation - 220312 - 092721
36 pages
MySQL Cheatsheet CodeWithHarry
No ratings yet
MySQL Cheatsheet CodeWithHarry
15 pages
Download ebooks file iOS Test Driven Development by Tutorials First Edition Learn Real World Test Driven Development Joshua Greene all chapters
100% (3)
Download ebooks file iOS Test Driven Development by Tutorials First Edition Learn Real World Test Driven Development Joshua Greene all chapters
55 pages
Instant Access to Big Data A Beginner s Introduction 1st Edition Sarangi ebook Full Chapters
100% (4)
Instant Access to Big Data A Beginner s Introduction 1st Edition Sarangi ebook Full Chapters
71 pages
Random-Weight Generation in Multiple Criteria Deci PDF
No ratings yet
Random-Weight Generation in Multiple Criteria Deci PDF
11 pages

Clustering 1

Uploaded by

Clustering 1

Uploaded by

Chapter 7.

for browsing, group genes 2

similar functionality, or Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

group stocks with similar 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

price fluctuations 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

A good clustering method will produce high quality

Dissimilarity/Similarity metric: Similarity is expressed in

Eg. Weight, height and weather temperature.

Calculate the standardized measurement (z-score)

Using mean absolute deviation is more robust than using

Distances are normally used to measure the similarity or

Let x1=(1,2) and x2=(3,5)

Distance measure for b+c

gender is a symmetric attribute

A generalization of the binary variable in that it can take

Method 2: use a large number of binary variables

An ordinal variable can be discrete or continuous

map the range of each variable onto [0, 1] by replacing

Ratio-scaled variable: a positive measurement on a

Vector objects: keywords in documents, gene

A variant: Tanimoto coefficient

Complete link: largest distance between an element in one cluster

Average: avg distance between an element in one cluster and an

Centroid: distance between the centroids of two clusters, i.e.,

Medoid: distance between the medoids of two clusters, i.e., dis(Ki,

Radius: square root of average distance from any point of the

Partitioning method: Construct a partition of a database D of n objects

Σ km=1Σtmi∈Km (Cm − tmi ) 2

Given k, the k-means algorithm is implemented in

cluster center 4 Update 4

Strength: Relatively efficient: O(tkn), where n is # objects, k is #

A few variants of the k-means which differ in

Selection of the initial k means

Strategies to calculate cluster means

Handling categorical data:

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method

The k-means algorithm is sensitive to outliers !

K-Medoids: Instead of taking the mean value of the object in a

Find representative objects, called medoids, in clusters

K=2 Randomly select a

Until no and Oramdom 6

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Cjih = d(j, h) - d(j, i) Cjih = 0

Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t) 47

Pam is more robust than k-means in the presence of

Dissimilarity from Dissimilarity Minimal Closest

1 0.00 9.00 0.00 1

2 5.00 5.83 5.00 1

4 4.00 5.00 4.00 1

5 9.00 0.00 0.00 5

6 24.00 15.00 15.00 5

7 24.08 15.13 15.13 5

8 24.19 15.30 15.30 5

9 24.33 15.52 15.52 5

10 28.16 19.24 19.24 5

• Swapping Cost = New Cost – Old cost

CLARA (Kaufmann and Rousseeuw in 1990)

CLARANS (A Clustering Algorithm based on Randomized

Use distance matrix as clustering criteria. This method

Implemented in statistical analysis packages, e.g., Splus

Decompose data objects into a several levels of nested

A clustering of the data objects is obtained by cutting the

Introduced in Kaufmann and Rousseeuw (1990)

What are outliers?

f Assume a model underlying distribution that generates

Introduced to counter the main limitations imposed by

Identifies outliers by examining the main characteristics

Considerable progress has been made in scalable

You might also like