0% found this document useful (0 votes)
83 views110 pages

Clustering

The document discusses clustering algorithms. It introduces clustering and describes its basic idea of grouping similar data together. It then covers several types of clustering algorithms including hierarchical, partitional, and density-based clustering. It specifically describes hierarchical agglomerative clustering and the single link and complete link approaches. It provides examples to illustrate how these algorithms work on sample distance matrices.

Uploaded by

Krishnan Swami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views110 pages

Clustering

The document discusses clustering algorithms. It introduces clustering and describes its basic idea of grouping similar data together. It then covers several types of clustering algorithms including hierarchical, partitional, and density-based clustering. It specifically describes hierarchical agglomerative clustering and the single link and complete link approaches. It provides examples to illustrate how these algorithms work on sample distance matrices.

Uploaded by

Krishnan Swami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

UNIT: 7

CLUSTERING
Unit : 7 : Clustering
Sr.No. Topics 06

Introduction, categorization of Major,


1 1
Clustering Methods:-
2 1
partitioning methods- K-Means
3 1
Hierarchical Method
4 1
Hierarchical- Agglomerative
5 1
Hierarchical- Divisive methods
6 1
Model- based- Expectation
7 1
Model- based- Maximization
8 1
2
Clustering: Basic Idea

 Groups similar data together into clusters


 Clustering is accomplished by determining the
similarity among the data on predefined
attributes.
 Most similar data are grouped into clusters
 Partitioning or segmentation

Clustering is unsupervised, because


 No previous categorization known
 Totally data driven
 Classes are formed after examining the data

3
Clustering : Application

▷ Marketing Management: Discover distinct groups


in customer bases, and then use this knowledge to
develop targeted marketing programs
▷ WWW
○ Cluster Weblog data to discover groups of
similar access patterns
▷ Text Mining
○ Grouping documents with similar characteristics

4
Clustering : Example

▷ A good clustering method will produce high quality


clusters
* *
** *
* * * * **
* * *
* *
*
*
*
* * **
*
*
*

5
Similarity?

 Groups of similar customers


 Similar demographics
 Similar buying behavior
 Similar health
 Similar products
 Similar cost
 Similar function
 Similar store
…
 Similarity usually is domain/problem specific

6
Clustering (Contd.)

▷ The output of a clustering algorithm consists


of a summarized representation of each
cluster.
▷ The type of summarized representation
depends strongly on the type & shape of
clusters the algorithm computes.

7
Types of Clustering

▷ Hard Clustering:
○ Each record is in one and only one cluster
▷ Soft Clustering:
○ Each record has a probability of being in each
cluster

8
Type of Clustering Algorithms

Hierarchical clustering
Divisive clustering (top down)
Agglomerative clustering (bottom up)
Partitional clustering
K-means clustering
K-medoids clustering
EM (expectation maximization) clustering
Density-Based Methods
Regions of dense points separated by sparser
regions of relatively low density

9
Type of Clustering Algorithms

▷ A Hierarchical clustering algorithm generates a tree


of clusters.
▷ It is categorized as Agglomerative and Divisive.
▷ Agglomerative start with each object in an
individual cluster and nearby clusters are
repeatedly merged resulting in larger and larger
clusters until some stopping criterion is met or all
objects are merged into a single large cluster with
highest level of hierarchy.
▷ Divisive approach starts by putting all data objects
into one cluster and repeatedly perform splitting
which results in smaller and smaller clusters until a
stopping criterion is met or each cluster has only
one object in it.
10
Agglomerative algorithm

1. Bottom-up approach:
2. Initially each item x1, . . . , xn is in its own cluster C1, . .
. ,Cn.
3. Repeat until there is only one cluster left:
Merge the nearest clusters, say Ci and Cj .

▷ The result is a cluster tree. One can cut the tree at


any level to produce different clustering.

11
12
Agglomerative algorithm

1. Allocate each point to a cluster of its own. So start


with n clusters for n objects.
2. Using a distance measure, at each step identify two
clusters to be merged.
 Single link approach  Find two clusters that have
smallest distance between them (find nearest
neighbor).
 Complete link approach  Find two clusters that
have largest distance between them (find farthest
neighbor).
 There are other approaches too……

13
Agglomerative algorithm
Single link approach
3. Two clusters are merged if
min dist between any two points <= threshold dist

d (ci ,c j )  min d ( x, y )
xci , yc j
4. Identify the pair of objects and merge them.
5. Compute all distances from the new cluster and
update the distance matrix after merger.
6. Continue merging process until only one cluster is left
or stop merging once the distance between clusters is
above the threshold.
▷ Single-linkage tends to produce stringy or elongated
clusters that is clusters are chained together by only
single objects that happen to be close together.
14
Single Link Example

15
Agglomerative algorithm

Complete link approach


3. Two clusters are merged if
max dist between any two points <= threshold
dist

d (ci ,c j )  max d ( x, y )
xci , yc j
4. Identify the pair of objects and merge them.
5. Compute all distances from the new cluster and
update the distance matrix after merger.
6. Continue merging process until only one cluster is
left or stop merging once the distance between
clusters is above the threshold.
▷ Clusters tend to be compact and roughly equal in
diameter.
16
Complete Link Example

17
Agglomerative algorithm

Average link approach

d (ci ,c j )   d ( x, x ' ) / c
xci x 'cj
i .cj

▷ Stop merging once the average distance between


clusters is above the threshold.
▷ Somewhere between single-linkage and complete-
linkage.
▷ and a million other ways you can think of …

18
Dendogram: Hierarchical Clustering

• Clustering obtained by
cutting the dendrogram
at a desired level: each
connected component
forms a cluster.

19
19
20
▷ Use single and complete link
agglomerative clustering to group the
data described by the following distance
matrix. Show the dendrograms.

21
Single link

Dendrogram

AB C D

AB 0 2 5

C 0 3

D 0

ABC D

ABC 0 3

D 0

22
23
Complete link

Dendrogram

AB C D

AB 0 4 6

C 0 3

D 0

ABC D

ABC 0 6

D 0

24
complete link: distance between two clusters is the longest distance
between a pair of elements from the two clusters

25
▷ Use single and complete link
agglomerative clustering to group the
data described by the following distance
matrix. Show the dendrograms.
A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

26
Single link

A B C D E

A 0 1 2 2 3
3
B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3 2

E 3 3 5 3 0

AB CD E 1
AB 0

CD 2 0

E 3 3 0 A B C D E

ABCD E

ABCD 0

E 3 0

27
Complete Link
D K K Comment

0 5 {A}, We start with the point = cluster


{B},{C},{D},{E}
1 3 {AB} {CD} {E} We merge {A} and {B} and {C } and {D} which are closest
d(A,B)<=1 d(C,D)<=1
2 2 {ABCD}, {E} We merge {AB} and {CD} because we have closest d(A,C)=2

3 1 {ABCDE} Merge E

28
complete link

A B C D E

A 0 1 2 2 3
5
B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3 3

E 3 3 5 3 0

AB CD E 1
AB 0

CD 4 0

E 3 5 0 A B E C D

ABE CD

ABE 0

CD 5 0

29
D K K Comment

0 5 {A}, We start with the point = cluster


{B},{C},{D},{E}
1 3 {AB},{CD},{E} We merge {A} and {B} and {C } and {D} which are closest
d(A,B)=1 <=1 and d(C,D)=1 <=1
2 3 {AB},{CD},{E} d(AB,E)=3>2 so cant merged , d(CD.E) =5>2

3 2 {ABE},{CD}, Merge E with AB because there we d(AB,E)<=3 satisfied

4 2 {ABE},{CD}, Distance between this is 5 so 5>=4 not satisfied so cant merge

5 1 {ABCDE} Merge both cluster

30
Average link

▷ Average Linkage In average linkage


hierarchical clustering, the distance
between two clusters is defined as the
average distance between each point in
one cluster to every point in the other
cluster. For example, the distance
between clusters “r” and “s” to the left is
equal to the average length each arrow
between connecting the points of one
cluster to the other.

31
Average link cluster

32
33
34
35
36
37
38
Merge the clusters

39
Agglomerative algorithm

NOTE:
Start with:
▷ Initial no of clusters=n.
▷ Initial value of threshold dist, d = 0.
▷ Increment d in each iteration.
▷ Check condition (min dist<=d) for merging.
▷ Continue until you get one cluster.

40
Weakness of Agglomerative

▷ Major weakness of agglomerative clustering


methods
○ Do not scale well: time complexity of at least
O(n2), where n is the number of total objects
○ Can never undo what was done previously

41
Hierarchical Clustering algorithms

▷ Divisive (top-down):
○ Generate Spanning Tree (no loop in the graph)
○ Start with all documents belong to the same cluster.
○ Eventually each node forms a cluster on its own.
▷ Does not require the number of clusters k in
advance
▷ Needs a termination/readout condition
○ The final mode in both Agglomerative and Divisive is of no
use.

42
Overview

▷ Divisive Clustering starts by placing all


objects into a single group. Before we
start the procedure, we need to decide on
a threshold distance.
The procedure is as follows:
▷ The distance between all pairs of objects
within the same group is determined and
the pair with the largest distance is
selected.

43
Hierarchical Clustering

▷ Divisive Approaches Initialization:


All objects stay in one cluster
Iteration:
a ab Select a cluster and split it into

b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

44
Overview-contd

▷ This maximum distance is compared to the


threshold distance.
○ If it is larger than the threshold, this group is
divided in two. This is done by placing the
selected pair into different groups and using
them as seed points. All other objects in this
group are examined, and are placed into the new
group with the closest seed point. The
procedure then returns to Step 1.
○ If the distance between the selected objects is
less than the threshold, the divisive clustering
stops.
▷ To run a divisive clustering, you simply need
to decide upon a method of measuring the
distance between two objects.
45
A 1 B

2
• Max d(E,D) = 3 so {E}
,{ABCD}
C
• Max d(B,C) = 2 so {E}
{AB}{CD}
1
E D • Finally {E}, {A},{B},{C},{D},{E}
3

46
Type of Clustering Algorithms

▷ A Partitional clustering algorithm partitions the


data into k groups such that some criterion that
evaluates the clustering quality is optimized.
▷ The number of clusters k is a parameter whose
value is specified by the user.

47
Partitioning Algorithms

▷ Partitioning method: Construct a partition of a


database D of n objects into a set of k clusters
▷ Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
○ Global optimal: exhaustively enumerate all
partitions
○ Heuristic methods: k-means and k-medoids
algorithms
○ k-means (MacQueen, 1967): Each cluster is
represented by the center of the cluster
○ k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw, 1987): Each cluster is
represented by one of the objects in the cluster
48
K-Means Clustering

▷ Given k, the k-means algorithm consists of


four steps:
○ Select initial centroids at random.
○ Assign each object to the cluster with the nearest
centroid.
○ Compute each centroid as the mean of the objects
assigned to it.
○ Repeat previous 2 steps until no change.

49
K-Means Clustering (contd.)

▷ Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

50
51
Example:

▷ Suppose we want to group the visitors to


a website using just their age (one-
dimension
▷ n = 19
▷ 15,15,16,19,19,20,20,21,22,28,35,40,41,
42,43,44,60,61,65
▷ al space) as follows:

52
▷ Initial clusters (random
centroid or average): k = 2
▷ c1 = 16
c2 = 22

53
New centroid
Iteration 1 c1 = 15.33
c2 = 36.25

3
3
4
4

54
c1 = 18.56
Iteration 2 c2 = 45.90

55
c1 = 19.50
Iteration 3: c2 = 47.89

56
c1 = 19.50
Iteration 4: c2 = 47.89

57
Do it yourself

▷ Use k-means algorithm to cluster the


following points into 2 clusters:
▷ {2, 4, 10, 12, 3, 20, 30, 11, 25}
▷ Suppose that initial cluster centers (seed)
are 3 & 4. Run K-means algorithm for
three iterations.

58
x m1=3 m2=4 C1=x-m1 C2=x-m2 cluster new centroid
2 3 4 1 2 1
3 3 4 0 1 1 2.5
4 3 4 1 0 2
10 3 4 7 6 2
11 3 4 8 7 2
12 3 4 9 8 2
20 3 4 17 16 2
25 3 4 22 21 2
30 3 4 27 26 2 16

x m1=2.5 m2=16 C1=x-m1 C2=x-m2 cluster new centroid


2 2.5 16 0.5 14 1
3 2.5 16 0.5 13 1
4 2.5 16 1.5 12 1 3
10 2.5 16 7.5 6 2
11 2.5 16 8.5 5 2
12 2.5 16 9.5 4 2
20 2.5 16 17.5 4 2
25 2.5 16 22.5 9 2
30 2.5 16 27.5 14 2 18 59
x m1=3 m2=18 C1=x-m1 C2=x-m2 cluster new centroid
2 3 18 1 16 1
3 3 18 0 15 1
4 3 18 1 14 1
10 3 18 7 8 1 4.75
11 3 18 8 7 2
12 3 18 9 6 2
20 3 18 17 2 2
25 3 18 22 7 2
30 3 18 27 12 2 19.6
x m1=4.75 m2=19.6 C1=x-m1 C2=x-m2 cluster new centroid
2 4.75 19.6 2.75 17.6 1
3 4.75 19.6 1.75 16.6 1
4 4.75 19.6 0.75 15.6 1
10 4.75 19.6 5.25 9.6 1
11 4.75 19.6 6.25 8.6 1
12 4.75 19.6 7.25 7.6 1 7
20 4.75 19.6 15.25 0.4 2
25 4.75 19.6 20.25 5.4 2
60
30 4.75 19.6 25.25 10.4 2 25
x m1=7 m2=25 C1=x-m1 C2=x-m2 cluster new centroid
2 7 19.6 5 17.6 1
3 7 19.6 4 16.6 1
4 7 19.6 3 15.6 1
10 7 19.6 3 9.6 1
11 7 19.6 4 8.6 1
12 7 19.6 5 7.6 1 7
20 7 19.6 13 0.4 2
25 7 19.6 18 5.4 2
30 7 19.6 23 10.4 2 25

61
62
Measurement metric

63
Euclidean

2 3 V2
X = 2, 0  [ (y - x)² ]

1
Y = -2, -2

-3 -2 -1 0

-3 -2 -1 0 1 2 3 V1

 ([-2 – 2]2 + [-2 –) 0]2

( 42 + 22) =  20 = 4.47

4.47 represents the “multivariate dissimilarity” of X & Y

64
Squared Euclidean - (y - x)²

X = 2, 0

2 3 V2
Y = -2, -2

1
-3 -2 -1 0
([-2 – 2]2 + [-2 – 0]2 )
-3 -2 -1 0 1 2 3 V1
( 42 + 22) = 20
Squared Euclidean is a little better at “noticing” strays

• remember that we use a square root transform to “pull


in” outliers
• leaving the value squared makes the strays stand out a
bit more 65
Manhattan distance

▷ The function computes the distance that


would be traveled to get from one data
point to the other if a grid-like path is
followed. The Manhattan distance
between two items is the sum of the
differences of their corresponding
components.
▷ The formula for this distance between a
point X=(X1, X2, etc.) and a point Y=(Y1,
Y2, etc.) is:

66
Manhattan distance or city block

 |y – x|
X = 2, 0

Y = -2, -2

2 3 V2
([-2 – 2] + [-2 – 0])

1
-3 -2 -1 0
( 4 + 2) = 6

-3 -2 -1 0 1 2 3 V1

So named because in a city you have to go “around the


block” you can’t “cut the diagonal”

67
Chebyshev metric

Chebyshev distance is also known as Maximum value


distance. It calculate absolute magnitude of the differences
between coordinates of two objects:

68
Chebychev

max | y - x |
X = 2, 0

Y = -2, -2

2 3 V2
| -2 – 2 | = 4

1
| -2 – 0 | = 2
-3 -2 -1 0

max ( 4 & 2) = 4 -3 -2 -1 0 1 2 3 V1

Uses the “greatest univariate difference” to represent the


multivariate dissimilarity.

69
A Simple example showing the implementation of k-means
algorithm (using K=2) . Use Euclidean distance metric.

70
71
72
73
74
75
76
77
▷ A Simple example showing the implementation of k-means
algorithm (Two dimension) (using K=3).Use Manhattan
distance metric

X Y

1 2 10 Centroid
2 2 5 c1(2,10) c2(5,8) and c3(1,2)
3 8 4

4 5 8

5 7 5

6 6 4

7 1 2

8 4 9

78
Kmean
X Y C1’ C2’( C3’( Clus C1’ C2’( C3’( clust
(2,1 5,8) 1,2) ter (2,1 6,6) 1.5, er
0) 0) 3.5)
A1 2 10 0 5 9 C1 0 8 7 C1

A2 2 5 5 6 4 C3 5 5 2 C3

A3 8 4 12 7 9 C2 12 4 7 C2

A4 5 8 5 0 10 C2 5 3 8 C2

A5 7 5 10 5 9 C2 16 2 7 C2

A6 6 4 10 5 7 C2 10 2 5 C2

A7 1 2 9 10 0 C3 9 9 2 C3

A8 4 9 3 2 10 C2 3 5 8 C1

Second interation : new centroid : c1 => {A1} =>C1(2,10)


C2=>{A3,A4,A5,A6,A8} => ((8+5+7+6+4)/5, (4+8+5+4+9)/5)=> (30/5,30/5) => C2(6,6)
C3 => {A2,A7} => {(2+1)/2 , (5+2)/2} => C3(1.5,3.5)

3rd interation : C1(3,9.5)


C2(6.5,5.25)
C3(1.5,3.5)
79
80
Comments on the K-Means Method

81
EM Model

82
Soft Clustering

▷ Clustering typically assumes that each


instance is given a “hard” assignment to
exactly one cluster.
▷ Does not allow uncertainty in class
membership or for an instance to belong to
more than one cluster.
▷ Soft clustering gives probabilities that an
instance belongs to each of a set of clusters.
▷ Each instance is assigned a probability
distribution across a set of discovered
categories (probabilities of all categories
must sum to 1). 83
Model based clustering
▷ Algorithm optimizes a probabilistic model criterion
▷ Soft clustering is usually done by the Expectation
Maximization (EM) algorithm
○ Gives a soft variant of the K-means algorithm
○ Assume k clusters: {c1, c2,… ck}
○ Assume a probabilistic model of categories that
allows computing P(ci | E) for each category, ci, for
a given example, E.
○ For text, typically assume a naïve Bayes category
model.
○ Parameters  = {P(ci), P(wj | ci): i{1,…k}, j
{1,…,|V|}} 84
Expectation Maximization (EM)
Algorithm
▷ EM algorithm provides a general approach to learning in
presence of unobserved variables
▷ Iterative method for learning probabilistic
categorization model from unsupervised data.
▷ Initially assume random assignment of examples to
categories.
▷ Learn an initial probabilistic model by estimating model
parameters  from this randomly labeled data.
▷ Iterate following two steps until convergence:
○ Expectation (E-step): Compute P(ci | E) for each example
given the current model, and probabilistically re-label
the examples based on these posterior probability
estimates.

○ Maximization (M-step): Re-estimate the model


parameters, , from the probabilistically re-labeled data. 85
Applications

▷ Filling in missing data in samples


▷ Discovering the value of hidden variables
▷ Estimating the parameters of Hidden
Markov Model (HMM)
▷ Unsupervised learning of clusters
▷ Semi-supervised classification and
clustering.

86
Example

87
88
89
90
EM Example

91
92
Step 1

▷ Randomly select two probability


▷ θa probability of head with coin C1
▷ θb probability of head with coin C2

▷ Suppose : θa : 0.6
θb : 0.5

93
Step 2:

▷ Calculate the individual probability of C1


and C2 on first transaction which are as
follows:

▷ H:5
▷ T:5

94
Step :3 compute the likelyhood

95
Step 4

▷ Normalized the data or compute the


probability of calculated likelyhood
▷ Parameter A and B is Estimation parameter

96
Step 4

▷ Repeat the process for all the


experiments in our case the experiments
are 5

97
Step 5

98
Step 6:

▷ Compute the total H and T in c1 and c2 in


our case it is

C1 => 21.3 H and 8.6 T


C2 => 11.7 H and 8.4 T

99
Step 7

Calculate new probability of C1 and C2 for


getting head
This is maximization parameter θ for each coin

100
Step 8

▷ Repeat the process untill you get the


same probability again
▷ In our case it is,

101
EM Example (Do it yourself)

102
Step 1

▷ Randomly select two probability


▷ θa probability of head with coin C1
▷ θb probability of head with coin C2

▷ Suppose : θa : 0.6
θb : 0.5

103
Step 2:

▷ Calculate the individual probability of C1


and C2 on first transaction which are as
follows:

▷ H:3
▷ T:2

104
Step :3 compute the likelyhood
First round = 3 Head and 2 Tails

Liklyhood of A = (0.6)^3 * (0.4)^ 2 = 0.03456

Liklyhood of A = (0.5)^3 * (0.5)^ 2 =0.03125

After normalization the

Expected value of A = 0.03456/ (0.03456+0.03125) = 0.5251

Expected value of B = 0.03125/ (0.03456+0.03125) = 0.4748

105
Step 4

▷ Repeat the process for all the


experiments in our case the experiments
are 5

106
Coin A (Head) Coin A (Tail) Coin B (Head) Coin B (Tail)

0.5251 *3 =1.57 0.5251*2 =1.05 0.4748*3=1.424 0.4748*2


=0.9496

=2.6255 =0 =2.374 =0

=2.1 =0.5251 =1.899 =0.4748

=1.05 =1.57 =0.9496 =1.424

=2.1 =0.5251 =1.899 =0.4748

Total Head A Total Tail A Total Head B Total Head B


9.4455 2.67 8.545 3.319

107
▷ Calculate new probability of C1 and C2
for getting head
▷ This is maximization parameter θ for
each coin
▷ Θa = 9.44/ (9.44+2.67) = 0.72
▷ Θb = 8.5456/ (8.5456+3.32) = 0.72

▷ Repeat the process untill you get the


same probability again

108
Limitations

▷ The EM algorithm is very very slow, even


on the fastest computer. It works best
when you only have a small percentage of
missing data and the dimensionality of
the data isn’t too big. The higher the
dimensionality, the slower the E-step; for
data with larger dimensionality, you may
find the E-step runs extremely slow as the
procedure approaches a local maximum.

109
110

You might also like