ClusteringClusteringClusteringClustering.pdf

Definitions
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group than those in other groups.
Clustering is a technique to group objects based on distance or similarity
The data points that are in the same group should have similar properties and/or features,
while data points in different groups should have highly dissimilar properties and/or
features.
The clustering-based learning method is identified as an unsupervised learning task
wherein the learning starts from no specific target attribute in mind, and the data is
explored with a goal of finding intrinsic structures in them
2

The primary goal of the clustering technique is fiding similar or homogenous
groups in data that are called clusters.
The way this is done is—data instances that are similar or, in short, are near to each other
are grouped in one cluster, and the instances that are different are grouped into a
different cluster.
Clustering refers to the grouping of records, observations, or cases into classes of
similar objects.
A cluster is a collection of records that are similar to one another and dissimilar to records
in other clusters.
3

Clustering differs from classification in that there is no target variable for clustering.
The clustering task does not try to classify, estimate, or predict the value of a target
variable.
Instead, clustering algorithms seek to segment the entire data set into relatively
homogeneous subgroups or clusters, where the similarity of the records within the
cluster is maximized, and the similarity to records outside this cluster is minimized.
4

Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use
this knowledge to develop targeted marketing programs.
Land use: Identification of areas of similar land use in an earth observation database.
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost.
City-planning: Identifying groups of houses according to their house type, value, and
geographical location.
6

Main issues in clustering:
• how to measure similarity
• how to measure distance for categorical variables
• how to standardize or normalize numerical variables
• how many clusters
7

How to measure similarity
For measuring similarity Distance metric is used.
Most common distance metric is Euclidean
Distance. Other Distances can also be used.
where x = x1, x2, … , xm, and y = y1 , y2, … , ym represent the m attribute values of
two records.
8

how to measure distance for categorical
variables
For categorical variables, we may again define the “different from” function for
comparing the ith attribute values of a pair of records:
where xi and yi are categorical values. We may then substitute different (xi, yi) for the
ith term in the Euclidean distance metric above.
9

how to standardize or normalize numerical
variables
For optimal performance, clustering algorithms, just like algorithms for classification,
require the data to be normalized so that no particular variable or subset of variables
dominates the analysis. Analysts may use either the min–max normalization or Z-score
standardization.
Range(X) = Max(X)- Min(X) SD(X) = Standard Deviation of X
10

All clustering methods have as their
goal the identification of groups of
records such that similarity within a
group is very high while the similarity to
records in other groups is very low.
In other words, clustering algorithms
seek to construct clusters of records
such that the between-cluster
variation is large compared to the
within-cluster variation.
11

Requirements of Clustering Algorithms
Scalability − We need highly scalable clustering algorithms to deal with large databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures that
tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
12

Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
13

Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
14

Partition Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n.
It means that it will classify the data into k groups, which satisfy the following requirements:
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects from
one group to other.
15

Algorithms in Partition Method:
K-mean Clustering - Each cluster is represented by the center of the
cluster
k-medoids or PAM (Partition around medoids) : Each cluster is
represented by one of the objects in the cluster
16

K-mean Clustering Algorithm
K-Means clustering intends to partition n objects into k clusters in which each object belongs to
the cluster with the nearest mean.
This method produces exactly k different clusters of greatest possible distinction.
The best number of clusters k leading to the greatest separation (distance) is not known as a
priori and must be computed from the data.
17

K- mean algorithm
Step 1: Select number of clusters k the data set should be partitioned into.
Step 2: Randomly assign k records to be the initial cluster( Usually first k record are
assigned to K clusers)
Step3: Calculate centroid of the cluster.
Step 4: For each record, find the nearest cluster center and add the record to that cluster.
Step5: For each of the k clusters, find the cluster centroid, and update the location of
each cluster center to the new value of the centroid.
Step 5: Repeat steps 4–5 until convergence or termination(centroid do not change).
Centroid of the cluster is the mean value of the elements in that cluster
18

K-mean Clustering Example:
Dataset = { 2, 5, 7,12, 26, 30 40 50 }
K = 3
1. Initially Create Three Empty Clusters
Cluster C1 Cluster C2 Cluster C3
2. Add first three elements to the cluster
2 5 7
3. Find the centroid of each cluster
Centroid =2 Centroid =5 Centroid =7
2 5 7
19

Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
2 0 3 5
5 3 0 2
7 5 2 0
12 10 7 5
26 24 21 19
30 28 25 23
40 38 35 33
50 48 45 43
2 5 7,12,26,30,40,50
20

Centroid =2 Centroid =5 Centroid =27.5
2 5 7,12,26,30,40,50
21

Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
2 0 3 25.7
5 3 0 24.5
7 5 2 20.5
12 10 7 15.5
26 24 21 1.5
30 28 25 2.5
40 38 35 12.5
50 48 45 22.5
Centroid =3.5 Centroid =9.5 Centroid =27.5
2 5,7,12 26,30,40,50
22

2 5,7,12 26,30,40,50
23

Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
2 0 5 34.5
5 3 3 31.5
7 5 1 29.5
12 10 4 24.5
26 24 18 10.5
30 28 22 6.5
40 38 32 3.5
50 48 42 13.5
2,5 7,12 26,30,40,50
24

2,5 7,12 26,30,40,50
25

Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
2 1.5 7.5 34.5
5 1.5 4.5 31.5
7 3.5 2.5 29.5
12 8.5 2.5 24.5
26 22.5 15.5 10.5
30 26.5 19.5 6.5
40 36.5 29.5 3.5
50 46.5 39.5 13.5
2,5 7,12 26,30,40,50
26

Final cluster
2,5 7,12 26,30,40,50
27

When to stop?
The Clustering algorithm may terminate when some convergence criterion is met, such
as no significant shrinkage in the mean squared error (MSE):
28

N =8, k=3
2 5 7,12,26,30,40,50
MSE = (2-2)2+(5-5)2+(7-27.5)2+(12-27.5)2+(26-27.5)2+(30-27.5)2+(40-27.5)2+(50-27.5)2 / (8-3)
MSE = (0+0+420.5+240.25+2.25+6.25+156.25+506.25)/5
MSE = 1331.75/5
MSE= 266.35
29

N =8, k=3
MSE = (2-2)2+(5-8)2+(7- 8)2+(12-8)2+(26-27.5)2+(30-27.5)2+(40-27.5)2+(50-27.5)2 / (8-3)
MSE = (0+9+1+16+2.25+6.25+156.25+506.25)/5
MSE = 697/5
MSE= 139.4
2 5,7,12 26,30,40,50
30

N =8, k=3
MSE = (2-3.5)2+(5-3.5)2+(7- 9.5)2+(12-9.5)2+(26-36.5)2+(30-36.5)2+(40-36.5)2+(50-36.5)2 / (8-3)
MSE = (2.25+2.25+6.25+6.25+110.25+42.25+12.25+182.25)/5
MSE = 364/5
MSE= 72.8
2,5 7,12 26,30,40,50
31

Cluster Quality
The clustering algorithms seek to construct clusters of records such that the between-
cluster variation is large compared to the within-cluster variation. Because this concept is
analogous to the analysis of variance, we define a pseudo-F statistic as follows:
where SSE is defied as above, MSB is the mean square between, and SSB is the sum
of squares between clusters, defied as:
where ni is the number of records in cluster i, mi
is the centroid (cluster center) for cluster i, and
M is the grand mean of all the data.
32

MSB represents the between-cluster variation and MSE represents the within-cluster
variation.
Thus, a “good” cluster would have a large value of the pseudo-F statistic, representing a
situation where the between-cluster variation is large compared to the within-cluster
variation.
Hence, as the k-means algorithm proceeds, and the quality of the clusters increases, we
would expect MSB to increase, MSE to decrease, and F to increase.
33

K-means Clustering summary
Advantages:
• Simple, understandable
• items automatically assigned to clusters
Disadvantages:
• Must pick number of clusters before hand
• Often terminates at a local optimum.
• All items forced into a cluster
• Too sensitive to outliers
34

K-medoid Algorithm
Medoids are representative objects of a data set or a cluster with a data set whose average
dissimilarity to all the objects in the cluster is minimal.
Medoids are similar in concept to means or centroids, but medoids are always restricted to be
members of the data set.
Medoids are most commonly used on data when a mean or centroid cannot be defined, such as
graphs.
A medoid of a finite dataset is a data point from this set, whose average dissimilarity to all the data
points is minimal i.e. it is the most centrally located point in the set.
35

Mathematical Formulation for K-means
36
D= {x1,x2,…,xi,…,xm} a data set of m records
xi= (xi1,xi2,…,xin) a each record is an n-dimensional vector

37
Finding Cluster Centres that Minimize Distortion:
Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster centre
to zero.

38
For any k clusters, the value of k should be such that even if we increase the value of k from
after several levels of clustering the distortion remains constant. The achieved point is
called the “Elbow”.
This is the ideal value of k, for the clusters created.

Hierarchical Methods:
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
• Agglomerative Approach
• Divisive Approach
In hierarchical clustering, a treelike cluster structure (dendrogram) is created through
recursive partitioning (divisive methods) or combining (agglomerative) of existing
clusters.
39

In hierarchical clustering, we categorized the objects into a hierarchy similar to a tree‐like
diagram which is called as dendogram.
Dendogram:
The standard output of hierarchical clustering is a
dendogram.
A dendogram is a cluster tree diagram where the distance of
split or merge is recorded.
Dendogram is a visualization of hierarchical clustering.
40

Using dendogram, we can easily specify the cutting point to determine number of
clusters. For example, in the left dendogram below, we set cutting distance at 2 and we
obtain two clusters out of 6 objects. The first cluster consists of 4 objects (number 4, 6, 5
and 3) and the second cluster consists of two objects (number 1 and 2). Similarly, in the
right dendogram, setting cutting distance at 1.2 will produce 3 clusters.
41

• Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start
with each object forming a separate cluster. It keeps on merging the
objects or clusters that are close to one another. It keep on doing so until
all of the clusters are merged into one or until the termination condition
holds.
• Divisive Approach
This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a
merging or splitting is done, it can never be undone.
42

Steps for Hierarchical Clustering –
Agglomerative approach
1. Compute distance matrix from object features.
2. Set each object as a independent cluster.( if there are 5 objects , then there will be
5 clusters)
3. Iterate until number of cluster is equal to 1
A. Merge two closest clusters
B. Update distance matrix
43

Example:
Assume we have six objects A,B,C,D,E,F each having two attribute X1 and X2
Distance between two objects is calculated using Euclidian distance formula using
Their attributes X1 and X2.
For example distance between A and B can be calculated as:
d(A,B) =
44

Object X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
45

Distance matrix
Here object/Cluster D and F are closer (distance 0.5) hence these two clusters will be merged
47

How to calculate Distance between new cluster (D,F) and other clusters A,B,C, E?
48

Linkages between Objects
The rule of hierarchical clustering lie on how objects should be grouped
into clusters. Given a distance matrix, linkages between objects can be
computed through a criterion to compute distance between groups.
Most common & basic criteria are
1. Single Linkage: minimum distance criterion
49

2. Complete Linkage: maximum distance criterion
50

3. Average Group: average distance criterion
51

4. Centroid distance criterion
52

Using single linkage (Minimum
Distance Approach)
53

Now Distance between A and B is minimum so than can be grouped together to form (A, B) cluster
54

Similarly other distances can be calculated
56

Next minimum distance clusters are E and (D,F) so they can be grouped to form ((D,F),E) cluster
57

Group ((D,F),E) and C to form single cluster
58

We summarized the results of computation as follow:
1. In the beginning we have 6 clusters: A, B, C, D, E and F
2. We merge cluster D and F into cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B into (A, B) at distance 0.71
4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50
7. The last cluster contain all the objects, thus conclude the computation
61

how do we determine distance between clusters of
records?
There are several criteria for determining distance between arbitrary clusters A and B:
Single linkage:
Single linkage, sometimes termed the nearest-neighbour approach, is based on
the minimum distance between any record in cluster A and any record in cluster B.
In other words, cluster similarity is based on the similarity of the most similar
members from each cluster.
Single linkage tends to form long, slender clusters, which may sometimes lead to
heterogeneous records being clustered together.
63

Complete linkage:
Complete linkage, sometimes termed the farthest-neighbor approach, is based
on the maximum distance between any record in cluster A and any record in
cluster B.
In other words, cluster similarity is based on the similarity of the most dissimilar members from each
cluster.
Complete linkage tends to form more compact, spherelike clusters.
64

ClusteringClusteringClusteringClustering.pdf

Recommended

More Related Content

Similar to ClusteringClusteringClusteringClustering.pdf (20)

More from SsdSsd5 (11)

Recently uploaded (20)

ClusteringClusteringClusteringClustering.pdf