Unit 7 Clustering
Unit 7 Clustering
Mining
Bsc.CSIT 7th Sem
1
UNIT:7
CLUSTERING
Clustering
2
Interval-scaled variables
Binary variables
Standardize data
where mf 1
n (x1 f x2 f ... xnf )
.
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
Similarity and Dissimilarity Between Objects (Cont.)
7
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Binary Variables
Object j
1 0 sum
A contingency table for binary
1 a b a b
data Object i
0 c d cd
sum a c b d p
Distance measure for
symmetric binary variables: d (i, j) bc
a bc d
Distance measure for
asymmetric binary variables: d (i, j) bc
a bc
Jaccard coefficient (similarity
measure for asymmetric
simJaccard (i, j) a
binary variables): a b c
8
Dissimilarity between Binary Variables
9
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
d (i, j) p
p
m
Pros
Simple
Fast for low dimensional data
It can find pure sub clusters if large number of clusters is specified
Cons
K-Means cannot handle non-globular data of different sizes and densities
Unable to handle noisy data and outliers
It is sensitive to the initialization of the centroids or mean points. If it is
initialized to far point it might end up with no point associated with it.
More than one centroid might be initialized into the same cluster resulting
in poor clustering.
2) K- means ++
22
Considered the dataset { (7,4), (8,3), (5,9), (3,3), (1,3), (10,1) } cluster
the following dataset with the help of k-means ++ ( take k=3)
Solution: number of cluster to be created (k)=3, so initially choose a
point randomly as an initial cluster center lets the data point C1=(7,4)
First find the distance from C1 to all the dataset(X)
Step 1:
d(C1,X1)2=d{(7,4), (7,4)} = 0
d(C1,X2) 2= d{(7,4), (8,3)}= 2
d(C1,X3) 2 = d{(7,4), (5,9)} =29
d(C1,X4) 2 = d{(7,4), (3,3)} =17
d(C1,X5) 2 = d{(7,4), (1,3)} =37
d(C1,X6) 2 = d{(7,4), (10,1)} =18
25
Step 2:
Now, probabilities of each dataset as choosing a point as centroid
The probability of choosing a next centroid point C2 is(1,3)
Data Probabilities
set (X)
(7,4) 0
(8,3) 2/103 =0.0194
(5,9) 29/103 =0.2815
(3,3) 17/103 = 0.165
(1,3) 37/103 =0.3592
Step 3:
(10,1) 18/103 = 0.1747
Now find distance from C1 and C2 to dataset (X)
d(C1,X1)= null
26
27
Compute the total cost , S, of swapping representative object, Oj, with Orandom .
If S<0 then swap Oj with Orandom to form the new set of k representative objects.
Until no change.
33
1 7, 6 0 +2 =2 4+2=6 C1=2
2 2, 6 7 3 C2=3
3 3, 8 8 4 C2=4
4 8, 5 2 6 C1=2
5 7, 4 0 4+0=4 C1=0
6 4, 7 6 4 C2=4
7 6, 2 3 5 C1=3
8 7, 3 1 5 C1=1
9 6, 4 1 3 C1=1
10 3, 4 4+0=4 0 C2=0
Based on the distance shown in table above our initial cluster are
C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
C2={(2, 6), (3, 8), (4,7), (3,4)}
Now calculating the distance Σd(j, previous mediods) = 2+3+4+2+0+4+3+1+1+0= 20
34
Now, choose some other point to be a mediod lets take data point 8 as new mediod
K3=(7, 3) and repeat the same steps as earlier we obtain the following table
J Data obj C1=(7,3) C2=(3,4) cost
D1=|x-7| +|y-3| D2=|x-3| +|y-4|
1 7, 6 3 6 C1=3
2 2, 6 8 3 C2=3
3 3, 8 9 4 C2=4
4 8, 5 3 6 C1=3
5 7, 4 1 4+0=4 C1=1
6 4, 7 7 4 C2=4
7 6, 2 2 5 C1=2
8 7, 3 0 5 C1=0
9 6, 4 2 3 C1=2
10 3, 4 5 0 C2=0
Based on the distance shown in table above our initial cluster are
C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
C2={(2, 6), (3, 8), (4,7), (3,4)}
Now calculating the distance Σd(j, previous mediods) = 3+3+4+3+1+4+2+0+2+0=22
35
36
If eps=2 and minpts= 2 what are the clusters that Dbscan would discover with
following data set. Data X Y
Solution: to find the clusters by DBSCAN we need points
to calculate first the distance among all pairs of A1 2 10
given data point. A2 2 5
Let us use the Euclidean distance measure for A3 8 4
distance calculation. A4 5 8
A5 7 5
A6 6 4
A7 1 2
A8 4 9
Data A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 6 3.6 7.07 7.21 8.06 2.23
A2 5 0 6.08 4.24
51 5 4.12 3.16 4.47
A3 6 6.08 0 5 1.41 1.41 7.28 6.4
A4 3.6 4.24 5 0 3.6 4.12 7.21 1.41
A5 7.07 5 1.41 3.6 0 1.41 6.7 5
A6 7.21 4.12 1.41 4.12 1.41 0 5.38 5.38
A7 8.06 3.16 7.28 7.21 6.7 5.38 0 7.61
A8 2.23 4.47 6.4 1.41 5 5.38 7.61 0
Now we have to find the cluster. Lets select point A1 as first point. Find all the data point that lie in the
eps neighborhood of A1. since no other points have distance <=2 with A1 so eps neighborhood of A1
becomes zero.
Similarly for point A2, it still doesn‟t have any other point <=2 so eps neighborhood of A2 also zero.
For point A3, it have two data points <=2 i.e. A5 and A6 so eps neighborhood of A3={A5, A6}
For point A4, it have one neighborhood i.e. A4={A8}
For point A5, it have two neighborhood i.e. A5= {A3, A6}
For point A6, it have two neighborhood i.e. A6={A3, A5}
For point A7, it doesn‟t have any neighborhood.
For point A8, it have one neighborhood i.e. A8={A4}
52
Advantages
3) DBSCAN algorithm is able to find arbitrarily size and arbitrarily shaped clusters.
Disadvantages