0% found this document useful (0 votes)
4 views

Unit 7 Clustering

The document provides an overview of clustering, defining it as the process of organizing similar objects into groups while distinguishing them from others. It discusses various types of data used in clustering analysis, similarity and dissimilarity measures, and details several clustering algorithms including K-Means, K-Means++, Mini-Batch K-Means, and K-Medoids. Each algorithm's methodology, advantages, and limitations are also highlighted, emphasizing the importance of choosing appropriate methods based on data characteristics.

Uploaded by

rosunnnn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 7 Clustering

The document provides an overview of clustering, defining it as the process of organizing similar objects into groups while distinguishing them from others. It discusses various types of data used in clustering analysis, similarity and dissimilarity measures, and details several clustering algorithms including K-Means, K-Means++, Mini-Batch K-Means, and K-Medoids. Each algorithm's methodology, advantages, and limitations are also highlighted, emphasizing the importance of choosing appropriate methods based on data characteristics.

Uploaded by

rosunnnn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Warehousing and Data

Mining
Bsc.CSIT 7th Sem
1

UNIT:7
CLUSTERING
Clustering
2

A cluster is a collection of objects which are “similar” between


them and are “dissimilar” to the objects belonging to other
clusters.
Clustering is “the process of organizing objects into groups
whose members are similar in some way”.
“Cluster Analysis is a set of methods for constructing a
(hopefully) sensible and informative classification of an initially
unclassified set of data, using the variable values observed on
each individual.”
Clustering is an unsupervised learning algorithm.
 Hence, Clustering is concerned with grouping things together.
Type of data in clustering analysis
3

 Interval-scaled variables

 Binary variables

 Nominal, ordinal, and ratio variables

 Variables of mixed types


Interval-valued variables
4

 Standardize data

 Calculate the mean absolute deviation:


sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where mf  1
n (x1 f  x2 f  ...  xnf )
.

 Calculate the standardized measurement (z-score)


xif  m f
zif  sf
 Using mean absolute deviation is more robust than using
standard deviation
Similarity and Dissimilarity between Objects
5

The similarity measure is a way of measuring how data


samples are related or closed to each other.
On the other hand, the dissimilarity measure is to tell how
much the data objects are distinct.
In clustering, the similar data samples are grouped into one
cluster and all other data samples are grouped into different ones.
The similarity measure is usually expressed as a numerical
value: It gets higher when the data samples are more alike.
It is often expressed as a number between zero and one by
conversion: zero means low similarity(the data objects are
dissimilar). One means high similarity(the data objects are very
similar).
Similarity and Dissimilarity Between Objects
6

 Distances are normally used to measure the similarity or


dissimilarity between two data objects
 Some popular ones include: Minkowski distance:

d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
 If q = 1, d is Manhattan distance

d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp
Similarity and Dissimilarity Between Objects (Cont.)
7

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Binary Variables
Object j
1 0 sum
 A contingency table for binary
1 a b a b
data Object i
0 c d cd
sum a  c b  d p
 Distance measure for
symmetric binary variables: d (i, j)  bc
a bc  d
 Distance measure for
asymmetric binary variables: d (i, j)  bc
a bc
 Jaccard coefficient (similarity
measure for asymmetric
simJaccard (i, j)  a
binary variables): a b c
8
Dissimilarity between Binary Variables
9

 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 gender is a symmetric attribute


 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
Nominal Variables
10

 A generalization of the binary variable in that it can take


more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables

d (i, j)  p 
p
m

 Method 2: use a large number of binary variables


 creating a new binary variable for each of the M nominal
states
Ordinal Variables
11

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank rif {1,..., M f }

 map the range of each variable onto [0, 1] by replacing i-


th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables
Ratio-Scaled Variables
12

 Ratio-scaled variable: a positive measurement on a


nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a good
choice! (why?—the scale can be distorted)
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their rank as
interval-scaled
Variables of Mixed Types
13

 A database may contain all the six types of variables


 symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and
 and treat zif as interval-scaled zif 
r
if 1
M f 1
Vector Objects
14

 Vector objects: keywords in documents, gene


features in micro-arrays, etc.
 Broad applications: information retrieval, biologic
taxonomy, etc.
 Cosine measure

 A variant: Tanimoto coefficient


Clustering Algorithm
1) K-Means Algorithm
15

 The k-means algorithm is partitioning method of


clustering where each cluster’s center is represented by
the mean value of the objects in the cluster.
 Algorithm
1. Choose k number of clusters to be determined.
2. Choose k objects randomly as the initial cluster centers
3. Repeat
3.1 Assign each object to their closest cluster
3.2 Compute new clusters, calculate mean points
4. Until
4.1 No change in cluster entities OR
4.2 No objects change in its clusters
K means Clustering Detail
16

 Initial centroids are often chosen randomly.


 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the cluster.
 „Closeness‟ is measured mostly by Euclidean distance, cosine
similarity, correlation, etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to „Until relatively few
points change clusters‟
 Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-Means Algorithm Example
17
18
19
20
Pros and Cons of K-Means Clustering Algorithm
21

Pros
 Simple
 Fast for low dimensional data
 It can find pure sub clusters if large number of clusters is specified

Cons
 K-Means cannot handle non-globular data of different sizes and densities
 Unable to handle noisy data and outliers
 It is sensitive to the initialization of the centroids or mean points. If it is
initialized to far point it might end up with no point associated with it.
 More than one centroid might be initialized into the same cluster resulting
in poor clustering.
2) K- means ++
22

 To overcome the drawback of k means, we use K-means ++.


 This algorithm ensures a smarter initialization of the centroids and improves
the quality of the clustering.
 After initialization the rest of the algorithm is same as k-means algorithm.
K-means ++ algorithm
23

1. Choose k number of cluster.


2. Choose an initial centroid uniformly at random from input
data points.
3. For each data point compute its distance from the nearest
previously chosen centroid.
4. Select the next centroid from the data point such that the
probability of choosing a point as centroid is directly
proportional to its distance from the nearest previously chosen
centroid. i.e. the point having maximum distance from the
nearest centroid is most likely to be selected as next centroid.
5. Repeat steps 3 and 4 until k centroid have been sampled.
6. Proceed as with the standard k-means algorithm.
K Means++ Example:
24

Considered the dataset { (7,4), (8,3), (5,9), (3,3), (1,3), (10,1) } cluster
the following dataset with the help of k-means ++ ( take k=3)
Solution: number of cluster to be created (k)=3, so initially choose a
point randomly as an initial cluster center lets the data point C1=(7,4)
 First find the distance from C1 to all the dataset(X)
Step 1:
 d(C1,X1)2=d{(7,4), (7,4)} = 0
 d(C1,X2) 2= d{(7,4), (8,3)}= 2
 d(C1,X3) 2 = d{(7,4), (5,9)} =29
 d(C1,X4) 2 = d{(7,4), (3,3)} =17
 d(C1,X5) 2 = d{(7,4), (1,3)} =37
 d(C1,X6) 2 = d{(7,4), (10,1)} =18
25
Step 2:
 Now, probabilities of each dataset as choosing a point as centroid
 The probability of choosing a next centroid point C2 is(1,3)
Data Probabilities
set (X)
(7,4) 0
(8,3) 2/103 =0.0194
(5,9) 29/103 =0.2815
(3,3) 17/103 = 0.165
(1,3) 37/103 =0.3592
Step 3:
(10,1) 18/103 = 0.1747
 Now find distance from C1 and C2 to dataset (X)
 d(C1,X1)= null
26
27

 The probability of choosing the next centroid point


C3 is ( 5,9)
 Now we have centroid for first cluster C1=(7,4),
centroid for second cluster C2=(1,3) and centroid for
third cluster C3=(5,9).
 Now all data set around these three centroids are
calculated by using standard k-means algorithm.
3) Mini-Batch K means
28

 The Mini-batch K-means clustering algorithm is a


version of the standard K-means algorithm.
 The primary idea behind the Mini Batch K-means
method is to create small random batches of data
with a predetermined size so that they may be kept
in memory.
 Each iteration obtains a fresh random sample from
the dataset and uses it to update the clusters, and the
process is continued until convergence.
Mini-Batch K means Algorithm
29
Why Mini-Batch K means is better than
normal K means?
 Mini-batch k means is considered
30 better than the normal K-means
algorithm because Mini-Batch K-Means uses mini-batches of samples
rather than the entire dataset, which can make it faster for large
datasets, and perhaps more robust to statistical noise.
 The mini-batch K-means is faster but gives slightly different results
than the normal K-means.
 As the number of clusters and the number of data increases, the relative
saving in computational time also increases. The saving in
computational time is more noticeable only when the number of
clusters is very large.
 The effect of the batch size in the computational time is also more
evident when the number of clusters is larger. Increasing the number of
clusters decreases the similarity of the mini-batch K-means solution to
the K-means solution.
 Despite that the agreement between the partitions decreases as the
number of clusters increases; the objective function does not degrade at
the same rate. It means that the final partitions are different, but closer
in quality.
4) K-Mediods
31

 K -mediods clustering is alternative of k-means that


is more robust to noises and outliers.
 Instead of using the mean point as the center of a
cluster, K-mediods uses an actual point in the cluster
to represent it.
 Mediods is the most centrally located object of the
cluster.
Algorithm
32

K-mediods PAM a k-mediods algorithm for partitioning based on mediod or central


objects.
Input:
 k: the number of cluster.

 D: a data set containing n objects.

Output: A set of k clusters.


 Method:
 Arbitrarily choose k objects in D as the initial representative objects.
 Repeat
 Assign each remaining object to the cluster with the nearest representative objects.

 For each representative object Oj

 Randomly select a non-representative object, Orandom .

 Compute the total cost , S, of swapping representative object, Oj, with Orandom .

 If S<0 then swap Oj with Orandom to form the new set of k representative objects.

 Until no change.
33

For a given k=2, cluster the following data set Point X Y


Initializing two initial cluster representatives from the given data 1 7 6
set. Let us choose data point 5 as first mediod K1=(7, 4) and data point 2 2 6
10 as second mediod K2= (3, 4). Suppose considering the Manhattan
3 3 8
Distance metric as the distance measure.
4 8 5
Now, calculating distance of each data point with each cluster
5 7 4
Representative.
6 4 7
7 6 2
8 7 3
9 6 4
10 3 4
J Data obj C1=(7,4) C2=(3,4) cost
D1=|x-7| +|y-4| D2=|x-3| +|y-4|

1 7, 6 0 +2 =2 4+2=6 C1=2
2 2, 6 7 3 C2=3
3 3, 8 8 4 C2=4
4 8, 5 2 6 C1=2
5 7, 4 0 4+0=4 C1=0
6 4, 7 6 4 C2=4
7 6, 2 3 5 C1=3
8 7, 3 1 5 C1=1
9 6, 4 1 3 C1=1
10 3, 4 4+0=4 0 C2=0

Based on the distance shown in table above our initial cluster are
C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
C2={(2, 6), (3, 8), (4,7), (3,4)}
Now calculating the distance Σd(j, previous mediods) = 2+3+4+2+0+4+3+1+1+0= 20

34
Now, choose some other point to be a mediod lets take data point 8 as new mediod
K3=(7, 3) and repeat the same steps as earlier we obtain the following table
J Data obj C1=(7,3) C2=(3,4) cost
D1=|x-7| +|y-3| D2=|x-3| +|y-4|

1 7, 6 3 6 C1=3
2 2, 6 8 3 C2=3
3 3, 8 9 4 C2=4
4 8, 5 3 6 C1=3
5 7, 4 1 4+0=4 C1=1
6 4, 7 7 4 C2=4
7 6, 2 2 5 C1=2
8 7, 3 0 5 C1=0
9 6, 4 2 3 C1=2
10 3, 4 5 0 C2=0
Based on the distance shown in table above our initial cluster are
C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
C2={(2, 6), (3, 8), (4,7), (3,4)}
Now calculating the distance Σd(j, previous mediods) = 3+3+4+3+1+4+2+0+2+0=22
35
36

 Now computing the total cost, S, of swapping a mediod (7, 4)


with Orandom =(7, 3)
 Total cost = new_mediod – old_mediod =22-20=2
 Here, the total cost of swapping K2 with Orandom is 2 which
is greater than zero. i.e. S>0 so, K2 should not be exchanged
with Orandom i.e. the total cost when (7,3) is the mediod >
the total cost when (7,4) was the mediod earlier.
 Hence (7, 4) should be chosen instead of (7,3) as the
mediod. The cluster obtained finally are:
 C1={ (7,6), (8, 5), (7, 4), (6, 2), (7, 3), (6, 4)}
 C2={(2, 6), (3, 8), (4,7), (3,4)}
Hierarchical clustering
37

 Partitioning clustering algorithm always tries to make cluster of the


same size also we have to decide the number of clusters at the
beginning of the algorithm.
 Hierarchical clustering will create a hierarchy of clusters and therefore
we doesnot required predefined number of cluster.
 A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue
these steps until all the clusters are merged together.
 Depending upon the bottom up or top down it is classified into two
categories:
 Agglomerative hierarchical clustering
 Divisive hierarchical clustering
Agglomerative hierarchical clustering
38

 Agglomerative clustering is one of the most common


types of hierarchical clustering used to group similar
objects in clusters.
 Agglomerative clustering is also known as AGNES
(Agglomerative Nesting).
 In agglomerative clustering, each data point act as an
individual cluster and at each step, data objects are
grouped in a bottom-up method.
 Initially, each data object is in its cluster. At each iteration,
the clusters are combined with different clusters until one
cluster is formed.
Agglomerative hierarchical clustering Algorithm
39
Agglomerative hierarchical clustering Algorithm
40
Agglomerative hierarchical clustering Algorithm
41
Divisive hierarchical clustering Algorithm
42

 Divisive hierarchical clustering is exactly the opposite of


Agglomerative Hierarchical clustering.
 In Divisive Hierarchical clustering, all the data points are
considered an individual cluster, and in every iteration, the
data points that are not similar are separated from the
cluster.
 The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.
Divisive hierarchical clustering Algorithm
43
Advantages and dis-advantages of hierarchical clustering Algorithm
44
Density based (Dbscan)
45

Partitioning and hierarchical method are suitable for


finding spherical shaped cluster. They are also affected by
the presence of outliers.
DBSCAN stands for Density-Based Spatial Clustering of
Applications with Noise.
 It depends on a density-based notion of cluster.
It also identifies clusters of arbitrary size in the spatial
database with outliers.
DBSCAN
46

DBSCAN algorithm requires two parameters:


eps : It defines the neighborhood around a data point i.e. if the
distance between two points is lower or equal to „eps‟ then they are
considered as neighbors. If the eps value is chosen too small then
large part of the data will be considered as outliers. If it is chosen
very large then the clusters will merge and majority of the data
points will be in the same clusters.
MinPts: Minimum number of neighbors (data points) within eps
radius. Larger the dataset, the larger value of MinPts must be
chosen. As a general rule, the minimum MinPts can be derived from
the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.
DBSCAN
47

In this algorithm, we have 3 types of data points.


1.Core Point: A point is a core point if it has more than MinPts points
within eps.
2. Border Point: A point which has fewer than MinPts within eps but
it is in the neighborhood of a core point.
3. Noise or outlier: A point which is not a core point or border point.
DBSCAN Basic terminology:
48

Neighborhood: Neighborhood of data point P is a set of all data points within


distance eps from P.
Directly density reachable: A point P is directly density reachable from a point q
with respect to eps and minpts if P belong to neighborhood to q and q satisfy the core
point condition.
Density reachable: a data point p is density reachable from a data point q with
respect to eps and minpts if there is a chain of points p1, p2,….pn where p1 is q and
pn is p such that pi+1 is directly reachable from pi.
 For example: data point p is density reachable from the data point q because there
is chain of point q,p1,p where p1 is directly density reachable from q and point p is
directly density reachable from p1 so p is density reachable from point q.
Density connected: A data point p is density connected to a point q with respect eps
and minpts if there is a data point o such that both p and q are density reachable form
o with respect to eps and minpts.
 For example: point p and q are density connected because these two points are
density reachable from a common data point o.
DBSCAN: The Algorithm
49

 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t. Eps and


MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable


from p and DBSCAN visits the next point of the database.

 Continue the process until all of the points have been


processed.
Dbscan Example:
50

 If eps=2 and minpts= 2 what are the clusters that Dbscan would discover with
following data set. Data X Y
Solution: to find the clusters by DBSCAN we need points
to calculate first the distance among all pairs of A1 2 10
given data point. A2 2 5
Let us use the Euclidean distance measure for A3 8 4
distance calculation. A4 5 8
A5 7 5
A6 6 4
A7 1 2
A8 4 9
Data A1 A2 A3 A4 A5 A6 A7 A8
A1 0 5 6 3.6 7.07 7.21 8.06 2.23
A2 5 0 6.08 4.24
51 5 4.12 3.16 4.47
A3 6 6.08 0 5 1.41 1.41 7.28 6.4
A4 3.6 4.24 5 0 3.6 4.12 7.21 1.41
A5 7.07 5 1.41 3.6 0 1.41 6.7 5
A6 7.21 4.12 1.41 4.12 1.41 0 5.38 5.38
A7 8.06 3.16 7.28 7.21 6.7 5.38 0 7.61
A8 2.23 4.47 6.4 1.41 5 5.38 7.61 0

Now we have to find the cluster. Lets select point A1 as first point. Find all the data point that lie in the
eps neighborhood of A1. since no other points have distance <=2 with A1 so eps neighborhood of A1
becomes zero.
Similarly for point A2, it still doesn‟t have any other point <=2 so eps neighborhood of A2 also zero.
For point A3, it have two data points <=2 i.e. A5 and A6 so eps neighborhood of A3={A5, A6}
For point A4, it have one neighborhood i.e. A4={A8}
For point A5, it have two neighborhood i.e. A5= {A3, A6}
For point A6, it have two neighborhood i.e. A6={A3, A5}
For point A7, it doesn‟t have any neighborhood.
For point A8, it have one neighborhood i.e. A8={A4}
52

 Now, cluster will be constructed on the basis of minpts. Since


minpts =2 we get points A4 and A8 within the cluster C1 and
A3, A5 & A6 within cluster C2.
 Since the data points A1, A2 & A7 have no any data point i.e.
these data point are neither belongs to the neighborhood of any
core points nor itself is core point so these points are outliers.
# Practice Question
If epsilon = 2 and Minpts = 2, what are the core point,
border point, and outliner that DBSCAN would find from
the data set A (3,10), B (2,3), C (3,4), D(6,7), and E(7,6).
Advantages and Dis-advantages of DBSCAN
53

Advantages

1) Does not require a-priori specification of number of clusters.

2) Able to identify noise data while clustering.

3) DBSCAN algorithm is able to find arbitrarily size and arbitrarily shaped clusters.

Disadvantages

1) DBSCAN algorithm fails in case of varying density clusters.

2) Fails in case of neck type of dataset.

3) Does not work well in case of high dimensional data.


Outlier Analysis
54

 Outlier analysis is the process of identifying outliers, or


abnormal observations, in a dataset.
 An outlier is an element of a data set that distinctly stands out
from the rest of the data. In other words, outliers are those data
points that lie outside the overall pattern of distribution.
 Outlier analysis is also known as outlier detection.
 Outlier analysis is an important step in data analysis, as it
removes erroneous or inaccurate observations which might
otherwise skew conclusions.
 The outlier detection can be used in credit card fraud
detection, Telecom fraud detection, customer segmentation
and medical analysis.
Outlier Analysis
55
Outlier Analysis
56

You might also like