SlideShare a Scribd company logo
Clustering
U.A.NULI
1
Definitions
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group than those in other groups.
Clustering is a technique to group objects based on distance or similarity
The data points that are in the same group should have similar properties and/or features,
while data points in different groups should have highly dissimilar properties and/or
features.
The clustering-based learning method is identified as an unsupervised learning task
wherein the learning starts from no specific target attribute in mind, and the data is
explored with a goal of finding intrinsic structures in them
2
The primary goal of the clustering technique is fiding similar or homogenous
groups in data that are called clusters.
The way this is done is—data instances that are similar or, in short, are near to each other
are grouped in one cluster, and the instances that are different are grouped into a
different cluster.
Clustering refers to the grouping of records, observations, or cases into classes of
similar objects.
A cluster is a collection of records that are similar to one another and dissimilar to records
in other clusters.
3
Clustering differs from classification in that there is no target variable for clustering.
The clustering task does not try to classify, estimate, or predict the value of a target
variable.
Instead, clustering algorithms seek to segment the entire data set into relatively
homogeneous subgroups or clusters, where the similarity of the records within the
cluster is maximized, and the similarity to records outside this cluster is minimized.
4
5
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use
this knowledge to develop targeted marketing programs.
Land use: Identification of areas of similar land use in an earth observation database.
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost.
City-planning: Identifying groups of houses according to their house type, value, and
geographical location.
6
Main issues in clustering:
• how to measure similarity
• how to measure distance for categorical variables
• how to standardize or normalize numerical variables
• how many clusters
7
How to measure similarity
For measuring similarity Distance metric is used.
Most common distance metric is Euclidean
Distance. Other Distances can also be used.
where x = x1, x2, … , xm, and y = y1 , y2, … , ym represent the m attribute values of
two records.
8
how to measure distance for categorical
variables
For categorical variables, we may again define the “different from” function for
comparing the ith attribute values of a pair of records:
where xi and yi are categorical values. We may then substitute different (xi, yi) for the
ith term in the Euclidean distance metric above.
9
how to standardize or normalize numerical
variables
For optimal performance, clustering algorithms, just like algorithms for classification,
require the data to be normalized so that no particular variable or subset of variables
dominates the analysis. Analysts may use either the min–max normalization or Z-score
standardization.
Range(X) = Max(X)- Min(X) SD(X) = Standard Deviation of X
10
All clustering methods have as their
goal the identification of groups of
records such that similarity within a
group is very high while the similarity to
records in other groups is very low.
In other words, clustering algorithms
seek to construct clusters of records
such that the between-cluster
variation is large compared to the
within-cluster variation.
11
Requirements of Clustering Algorithms
Scalability − We need highly scalable clustering algorithms to deal with large databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures that
tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-dimensional
data but also the high dimensional space.
12
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
13
Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
14
Partition Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n.
It means that it will classify the data into k groups, which satisfy the following requirements:
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects from
one group to other.
15
Algorithms in Partition Method:
K-mean Clustering - Each cluster is represented by the center of the
cluster
k-medoids or PAM (Partition around medoids) : Each cluster is
represented by one of the objects in the cluster
16
K-mean Clustering Algorithm
K-Means clustering intends to partition n objects into k clusters in which each object belongs to
the cluster with the nearest mean.
This method produces exactly k different clusters of greatest possible distinction.
The best number of clusters k leading to the greatest separation (distance) is not known as a
priori and must be computed from the data.
17
K- mean algorithm
Step 1: Select number of clusters k the data set should be partitioned into.
Step 2: Randomly assign k records to be the initial cluster( Usually first k record are
assigned to K clusers)
Step3: Calculate centroid of the cluster.
Step 4: For each record, find the nearest cluster center and add the record to that cluster.
Step5: For each of the k clusters, find the cluster centroid, and update the location of
each cluster center to the new value of the centroid.
Step 5: Repeat steps 4–5 until convergence or termination(centroid do not change).
Centroid of the cluster is the mean value of the elements in that cluster
18
K-mean Clustering Example:
Dataset = { 2, 5, 7,12, 26, 30 40 50 }
K = 3
1. Initially Create Three Empty Clusters
Cluster C1 Cluster C2 Cluster C3
2. Add first three elements to the cluster
Cluster C1 Cluster C2 Cluster C3
2 5 7
3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =5 Centroid =7
2 5 7
19
Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =2 Centroid =5 Centroid =7
2 0 3 5
5 3 0 2
7 5 2 0
12 10 7 5
26 24 21 19
30 28 25 23
40 38 35 33
50 48 45 43
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =5 Centroid =7
2 5 7,12,26,30,40,50
20
3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =5 Centroid =27.5
2 5 7,12,26,30,40,50
21
Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =2 Centroid =5 Centroid =27.5
2 0 3 25.7
5 3 0 24.5
7 5 2 20.5
12 10 7 15.5
26 24 21 1.5
30 28 25 2.5
40 38 35 12.5
50 48 45 22.5
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =27.5
2 5,7,12 26,30,40,50
22
3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =8 Centroid =36.5
2 5,7,12 26,30,40,50
23
Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =2 Centroid =8 Centroid =36.5
2 0 5 34.5
5 3 3 31.5
7 5 1 29.5
12 10 4 24.5
26 24 18 10.5
30 28 22 6.5
40 38 32 3.5
50 48 42 13.5
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =8 Centroid =36.5
2,5 7,12 26,30,40,50
24
3. Find the centroid of each cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2,5 7,12 26,30,40,50
25
Step 4: For each record, find the nearest cluster centre and add the record to that cluster.
Dataset
Elements
Distance to
Cluster 1
Distance to
Cluster 2
Distance to
Cluster 3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2 1.5 7.5 34.5
5 1.5 4.5 31.5
7 3.5 2.5 29.5
12 8.5 2.5 24.5
26 22.5 15.5 10.5
30 26.5 19.5 6.5
40 36.5 29.5 3.5
50 46.5 39.5 13.5
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2,5 7,12 26,30,40,50
26
Final cluster
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2,5 7,12 26,30,40,50
27
When to stop?
The Clustering algorithm may terminate when some convergence criterion is met, such
as no significant shrinkage in the mean squared error (MSE):
28
N =8, k=3
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =5 Centroid =27.5
2 5 7,12,26,30,40,50
MSE = (2-2)2+(5-5)2+(7-27.5)2+(12-27.5)2+(26-27.5)2+(30-27.5)2+(40-27.5)2+(50-27.5)2 / (8-3)
MSE = (0+0+420.5+240.25+2.25+6.25+156.25+506.25)/5
MSE = 1331.75/5
MSE= 266.35
29
N =8, k=3
MSE = (2-2)2+(5-8)2+(7- 8)2+(12-8)2+(26-27.5)2+(30-27.5)2+(40-27.5)2+(50-27.5)2 / (8-3)
MSE = (0+9+1+16+2.25+6.25+156.25+506.25)/5
MSE = 697/5
MSE= 139.4
Cluster C1 Cluster C2 Cluster C3
Centroid =2 Centroid =8 Centroid =36.5
2 5,7,12 26,30,40,50
30
N =8, k=3
MSE = (2-3.5)2+(5-3.5)2+(7- 9.5)2+(12-9.5)2+(26-36.5)2+(30-36.5)2+(40-36.5)2+(50-36.5)2 / (8-3)
MSE = (2.25+2.25+6.25+6.25+110.25+42.25+12.25+182.25)/5
MSE = 364/5
MSE= 72.8
Cluster C1 Cluster C2 Cluster C3
Centroid =3.5 Centroid =9.5 Centroid =36.5
2,5 7,12 26,30,40,50
31
Cluster Quality
The clustering algorithms seek to construct clusters of records such that the between-
cluster variation is large compared to the within-cluster variation. Because this concept is
analogous to the analysis of variance, we define a pseudo-F statistic as follows:
where SSE is defied as above, MSB is the mean square between, and SSB is the sum
of squares between clusters, defied as:
where ni is the number of records in cluster i, mi
is the centroid (cluster center) for cluster i, and
M is the grand mean of all the data.
32
MSB represents the between-cluster variation and MSE represents the within-cluster
variation.
Thus, a “good” cluster would have a large value of the pseudo-F statistic, representing a
situation where the between-cluster variation is large compared to the within-cluster
variation.
Hence, as the k-means algorithm proceeds, and the quality of the clusters increases, we
would expect MSB to increase, MSE to decrease, and F to increase.
33
K-means Clustering summary
Advantages:
• Simple, understandable
• items automatically assigned to clusters
Disadvantages:
• Must pick number of clusters before hand
• Often terminates at a local optimum.
• All items forced into a cluster
• Too sensitive to outliers
34
K-medoid Algorithm
Medoids are representative objects of a data set or a cluster with a data set whose average
dissimilarity to all the objects in the cluster is minimal.
Medoids are similar in concept to means or centroids, but medoids are always restricted to be
members of the data set.
Medoids are most commonly used on data when a mean or centroid cannot be defined, such as
graphs.
A medoid of a finite dataset is a data point from this set, whose average dissimilarity to all the data
points is minimal i.e. it is the most centrally located point in the set.
35
Mathematical Formulation for K-means
36
D= {x1,x2,…,xi,…,xm} a data set of m records
xi= (xi1,xi2,…,xin) a each record is an n-dimensional vector
37
Finding Cluster Centres that Minimize Distortion:
Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster centre
to zero.
38
For any k clusters, the value of k should be such that even if we increase the value of k from
after several levels of clustering the distortion remains constant. The achieved point is
called the “Elbow”.
This is the ideal value of k, for the clusters created.
Hierarchical Methods:
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
• Agglomerative Approach
• Divisive Approach
In hierarchical clustering, a treelike cluster structure (dendrogram) is created through
recursive partitioning (divisive methods) or combining (agglomerative) of existing
clusters.
39
In hierarchical clustering, we categorized the objects into a hierarchy similar to a tree‐like
diagram which is called as dendogram.
Dendogram:
The standard output of hierarchical clustering is a
dendogram.
A dendogram is a cluster tree diagram where the distance of
split or merge is recorded.
Dendogram is a visualization of hierarchical clustering.
40
Using dendogram, we can easily specify the cutting point to determine number of
clusters. For example, in the left dendogram below, we set cutting distance at 2 and we
obtain two clusters out of 6 objects. The first cluster consists of 4 objects (number 4, 6, 5
and 3) and the second cluster consists of two objects (number 1 and 2). Similarly, in the
right dendogram, setting cutting distance at 1.2 will produce 3 clusters.
41
• Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start
with each object forming a separate cluster. It keeps on merging the
objects or clusters that are close to one another. It keep on doing so until
all of the clusters are merged into one or until the termination condition
holds.
• Divisive Approach
This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a
merging or splitting is done, it can never be undone.
42
Steps for Hierarchical Clustering –
Agglomerative approach
1. Compute distance matrix from object features.
2. Set each object as a independent cluster.( if there are 5 objects , then there will be
5 clusters)
3. Iterate until number of cluster is equal to 1
A. Merge two closest clusters
B. Update distance matrix
43
Example:
Assume we have six objects A,B,C,D,E,F each having two attribute X1 and X2
Distance between two objects is calculated using Euclidian distance formula using
Their attributes X1 and X2.
For example distance between A and B can be calculated as:
d(A,B) =
44
Object X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
45
46
Distance matrix
Here object/Cluster D and F are closer (distance 0.5) hence these two clusters will be merged
47
How to calculate Distance between new cluster (D,F) and other clusters A,B,C, E?
48
Linkages between Objects
The rule of hierarchical clustering lie on how objects should be grouped
into clusters. Given a distance matrix, linkages between objects can be
computed through a criterion to compute distance between groups.
Most common & basic criteria are
1. Single Linkage: minimum distance criterion
49
2. Complete Linkage: maximum distance criterion
50
3. Average Group: average distance criterion
51
4. Centroid distance criterion
52
Using single linkage (Minimum
Distance Approach)
53
Now Distance between A and B is minimum so than can be grouped together to form (A, B) cluster
54
55
Similarly other distances can be calculated
56
Next minimum distance clusters are E and (D,F) so they can be grouped to form ((D,F),E) cluster
57
Group ((D,F),E) and C to form single cluster
58
59
60
We summarized the results of computation as follow:
1. In the beginning we have 6 clusters: A, B, C, D, E and F
2. We merge cluster D and F into cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B into (A, B) at distance 0.71
4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50
7. The last cluster contain all the objects, thus conclude the computation
61
The final dendogram
62
how do we determine distance between clusters of
records?
There are several criteria for determining distance between arbitrary clusters A and B:
Single linkage:
Single linkage, sometimes termed the nearest-neighbour approach, is based on
the minimum distance between any record in cluster A and any record in cluster B.
In other words, cluster similarity is based on the similarity of the most similar
members from each cluster.
Single linkage tends to form long, slender clusters, which may sometimes lead to
heterogeneous records being clustered together.
63
Complete linkage:
Complete linkage, sometimes termed the farthest-neighbor approach, is based
on the maximum distance between any record in cluster A and any record in
cluster B.
In other words, cluster similarity is based on the similarity of the most dissimilar members from each
cluster.
Complete linkage tends to form more compact, spherelike clusters.
64
65

More Related Content

Similar to ClusteringClusteringClusteringClustering.pdf (20)

Clustering & classification
Clustering & classificationClustering & classification
Clustering & classification
Jamshed Khan
 
08 clustering
08 clustering08 clustering
08 clustering
นนทวัฒน์ บุญบา
 
Cluster Analysis in Business Research Methods
Cluster Analysis in Business Research MethodsCluster Analysis in Business Research Methods
Cluster Analysis in Business Research Methods
ufkconsumerproducts
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
Data Mining Lecture Node: Hierarchical Cluster in Data MiningData Mining Lecture Node: Hierarchical Cluster in Data Mining
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
AliyuIshaq2
 
Data mining Techniques
Data mining TechniquesData mining Techniques
Data mining Techniques
Sulman Ahmed
 
Clustering on DSS
Clustering on DSSClustering on DSS
Clustering on DSS
Enaam Alotaibi
 
Chapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaningChapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaning
my123lapto
 
Chapter -10-Clus_Basic.ppt -DataMinning
Chapter -10-Clus_Basic.ppt  -DataMinningChapter -10-Clus_Basic.ppt  -DataMinning
Chapter -10-Clus_Basic.ppt -DataMinning
nayabkainat470
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
108kaushik
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
VIKASGUPTA127897
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
nadimhossain24
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
ssusere1fd42
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
sudesh regmi
 
Clustering & classification
Clustering & classificationClustering & classification
Clustering & classification
Jamshed Khan
 
Cluster Analysis in Business Research Methods
Cluster Analysis in Business Research MethodsCluster Analysis in Business Research Methods
Cluster Analysis in Business Research Methods
ufkconsumerproducts
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
Data Mining Lecture Node: Hierarchical Cluster in Data MiningData Mining Lecture Node: Hierarchical Cluster in Data Mining
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
AliyuIshaq2
 
Data mining Techniques
Data mining TechniquesData mining Techniques
Data mining Techniques
Sulman Ahmed
 
Chapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaningChapter 10 ClusBasic ppt file for clear understaning
Chapter 10 ClusBasic ppt file for clear understaning
my123lapto
 
Chapter -10-Clus_Basic.ppt -DataMinning
Chapter -10-Clus_Basic.ppt  -DataMinningChapter -10-Clus_Basic.ppt  -DataMinning
Chapter -10-Clus_Basic.ppt -DataMinning
nayabkainat470
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
108kaushik
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
VIKASGUPTA127897
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
ssusere1fd42
 

More from SsdSsd5 (11)

Machine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdf
Machine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdfMachine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdf
Machine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdf
SsdSsd5
 
Machine Learning-Unit1Machine Learning-Machine Learning-.pdf
Machine Learning-Unit1Machine Learning-Machine Learning-.pdfMachine Learning-Unit1Machine Learning-Machine Learning-.pdf
Machine Learning-Unit1Machine Learning-Machine Learning-.pdf
SsdSsd5
 
Decision treeDecision treeDecision treeDecision tree
Decision treeDecision treeDecision treeDecision treeDecision treeDecision treeDecision treeDecision tree
Decision treeDecision treeDecision treeDecision tree
SsdSsd5
 
S.E Unit 6colorcolorcolorcolorcolorcolor.pptx
S.E Unit 6colorcolorcolorcolorcolorcolor.pptxS.E Unit 6colorcolorcolorcolorcolorcolor.pptx
S.E Unit 6colorcolorcolorcolorcolorcolor.pptx
SsdSsd5
 
Unit 4colorcolorcolorcolorcolorcolorcolor.pptx
Unit 4colorcolorcolorcolorcolorcolorcolor.pptxUnit 4colorcolorcolorcolorcolorcolorcolor.pptx
Unit 4colorcolorcolorcolorcolorcolorcolor.pptx
SsdSsd5
 
presentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptx
presentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptxpresentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptx
presentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptx
SsdSsd5
 
software requirement engineeringg.ppt
software  requirement   engineeringg.pptsoftware  requirement   engineeringg.ppt
software requirement engineeringg.ppt
SsdSsd5
 
Chap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfChap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdf
SsdSsd5
 
CS6640_F2014_Fourier_I.pdf
CS6640_F2014_Fourier_I.pdfCS6640_F2014_Fourier_I.pdf
CS6640_F2014_Fourier_I.pdf
SsdSsd5
 
DIP_Chapter01.ppt
DIP_Chapter01.pptDIP_Chapter01.ppt
DIP_Chapter01.ppt
SsdSsd5
 
1 [Autosaved].pptx
1 [Autosaved].pptx1 [Autosaved].pptx
1 [Autosaved].pptx
SsdSsd5
 
Machine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdf
Machine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdfMachine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdf
Machine-Learning-Unit2Machine Learning-Machine Learning--Regression.pdf
SsdSsd5
 
Machine Learning-Unit1Machine Learning-Machine Learning-.pdf
Machine Learning-Unit1Machine Learning-Machine Learning-.pdfMachine Learning-Unit1Machine Learning-Machine Learning-.pdf
Machine Learning-Unit1Machine Learning-Machine Learning-.pdf
SsdSsd5
 
Decision treeDecision treeDecision treeDecision tree
Decision treeDecision treeDecision treeDecision treeDecision treeDecision treeDecision treeDecision tree
Decision treeDecision treeDecision treeDecision tree
SsdSsd5
 
S.E Unit 6colorcolorcolorcolorcolorcolor.pptx
S.E Unit 6colorcolorcolorcolorcolorcolor.pptxS.E Unit 6colorcolorcolorcolorcolorcolor.pptx
S.E Unit 6colorcolorcolorcolorcolorcolor.pptx
SsdSsd5
 
Unit 4colorcolorcolorcolorcolorcolorcolor.pptx
Unit 4colorcolorcolorcolorcolorcolorcolor.pptxUnit 4colorcolorcolorcolorcolorcolorcolor.pptx
Unit 4colorcolorcolorcolorcolorcolorcolor.pptx
SsdSsd5
 
presentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptx
presentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptxpresentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptx
presentationDFDdfd fddhdtdtddtdtytydtdtdtdtdttdd6.pptx
SsdSsd5
 
software requirement engineeringg.ppt
software  requirement   engineeringg.pptsoftware  requirement   engineeringg.ppt
software requirement engineeringg.ppt
SsdSsd5
 
Chap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdfChap_10_Object_Recognition.pdf
Chap_10_Object_Recognition.pdf
SsdSsd5
 
CS6640_F2014_Fourier_I.pdf
CS6640_F2014_Fourier_I.pdfCS6640_F2014_Fourier_I.pdf
CS6640_F2014_Fourier_I.pdf
SsdSsd5
 
DIP_Chapter01.ppt
DIP_Chapter01.pptDIP_Chapter01.ppt
DIP_Chapter01.ppt
SsdSsd5
 
1 [Autosaved].pptx
1 [Autosaved].pptx1 [Autosaved].pptx
1 [Autosaved].pptx
SsdSsd5
 

Recently uploaded (20)

Better Builder Magazine, Issue 53 / Spring 2025
Better Builder Magazine, Issue 53 / Spring 2025Better Builder Magazine, Issue 53 / Spring 2025
Better Builder Magazine, Issue 53 / Spring 2025
Better Builder Magazine
 
world subdivision.pdf...................
world subdivision.pdf...................world subdivision.pdf...................
world subdivision.pdf...................
bmmederos12
 
Internship_certificate_by_edunetfoundation.pdf
Internship_certificate_by_edunetfoundation.pdfInternship_certificate_by_edunetfoundation.pdf
Internship_certificate_by_edunetfoundation.pdf
prikshitgautam27
 
Software_Engineering_in_6_Hours_lyst1728638742594.pdf
Software_Engineering_in_6_Hours_lyst1728638742594.pdfSoftware_Engineering_in_6_Hours_lyst1728638742594.pdf
Software_Engineering_in_6_Hours_lyst1728638742594.pdf
VanshMunjal7
 
1.1 Introduction to procedural, modular, object-oriented and generic programm...
1.1 Introduction to procedural, modular, object-oriented and generic programm...1.1 Introduction to procedural, modular, object-oriented and generic programm...
1.1 Introduction to procedural, modular, object-oriented and generic programm...
VikasNirgude2
 
DIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITS
DIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITSDIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITS
DIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITS
Sridhar191373
 
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdf
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdfKevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdf
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdf
Medicoz Clinic
 
Air Filter Flat Sheet Media-Catalouge-Final.pdf
Air Filter Flat Sheet Media-Catalouge-Final.pdfAir Filter Flat Sheet Media-Catalouge-Final.pdf
Air Filter Flat Sheet Media-Catalouge-Final.pdf
FILTRATION ENGINEERING & CUNSULTANT
 
Supplier_PFMEA_Workshop_rev 22_04_27.pptx
Supplier_PFMEA_Workshop_rev 22_04_27.pptxSupplier_PFMEA_Workshop_rev 22_04_27.pptx
Supplier_PFMEA_Workshop_rev 22_04_27.pptx
dariojaen1977
 
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
balbaliadam1980
 
Java Programming Language: until 2025 and beyond
Java Programming Language: until 2025 and beyondJava Programming Language: until 2025 and beyond
Java Programming Language: until 2025 and beyond
arzu TR
 
Advanced Concrete Technology- Properties of Admixtures
Advanced Concrete Technology- Properties of AdmixturesAdvanced Concrete Technology- Properties of Admixtures
Advanced Concrete Technology- Properties of Admixtures
Bharti Shinde
 
2. CT M35 Grade Concrete Mix design ppt.pdf
2. CT M35 Grade Concrete Mix design  ppt.pdf2. CT M35 Grade Concrete Mix design  ppt.pdf
2. CT M35 Grade Concrete Mix design ppt.pdf
smghumare
 
Dr. Shivu__Machine Learning-Module 3.pdf
Dr. Shivu__Machine Learning-Module 3.pdfDr. Shivu__Machine Learning-Module 3.pdf
Dr. Shivu__Machine Learning-Module 3.pdf
Dr. Shivashankar
 
Prediction of Unconfined Compressive Strength of Expansive Soil Amended with ...
Prediction of Unconfined Compressive Strength of Expansive Soil Amended with ...Prediction of Unconfined Compressive Strength of Expansive Soil Amended with ...
Prediction of Unconfined Compressive Strength of Expansive Soil Amended with ...
Journal of Soft Computing in Civil Engineering
 
Department of Environment (DOE) Mix Design with Fly Ash.
Department of Environment (DOE) Mix Design with Fly Ash.Department of Environment (DOE) Mix Design with Fly Ash.
Department of Environment (DOE) Mix Design with Fly Ash.
MdManikurRahman
 
1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...
1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...
1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...
VikasNirgude2
 
Introduction to Machine Vision by Cognex
Introduction to Machine Vision by CognexIntroduction to Machine Vision by Cognex
Introduction to Machine Vision by Cognex
RicardoCunha203173
 
PPT on Grid resilience against Natural disasters.pptx
PPT on Grid resilience against Natural disasters.pptxPPT on Grid resilience against Natural disasters.pptx
PPT on Grid resilience against Natural disasters.pptx
manesumit66
 
22PCOAM16 Machine Learning Unit V Full notes & QB
22PCOAM16 Machine Learning Unit V Full notes & QB22PCOAM16 Machine Learning Unit V Full notes & QB
22PCOAM16 Machine Learning Unit V Full notes & QB
Guru Nanak Technical Institutions
 
Better Builder Magazine, Issue 53 / Spring 2025
Better Builder Magazine, Issue 53 / Spring 2025Better Builder Magazine, Issue 53 / Spring 2025
Better Builder Magazine, Issue 53 / Spring 2025
Better Builder Magazine
 
world subdivision.pdf...................
world subdivision.pdf...................world subdivision.pdf...................
world subdivision.pdf...................
bmmederos12
 
Internship_certificate_by_edunetfoundation.pdf
Internship_certificate_by_edunetfoundation.pdfInternship_certificate_by_edunetfoundation.pdf
Internship_certificate_by_edunetfoundation.pdf
prikshitgautam27
 
Software_Engineering_in_6_Hours_lyst1728638742594.pdf
Software_Engineering_in_6_Hours_lyst1728638742594.pdfSoftware_Engineering_in_6_Hours_lyst1728638742594.pdf
Software_Engineering_in_6_Hours_lyst1728638742594.pdf
VanshMunjal7
 
1.1 Introduction to procedural, modular, object-oriented and generic programm...
1.1 Introduction to procedural, modular, object-oriented and generic programm...1.1 Introduction to procedural, modular, object-oriented and generic programm...
1.1 Introduction to procedural, modular, object-oriented and generic programm...
VikasNirgude2
 
DIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITS
DIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITSDIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITS
DIGITAL ELECTRONICS: UNIT-III SYNCHRONOUS SEQUENTIAL CIRCUITS
Sridhar191373
 
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdf
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdfKevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdf
Kevin Corke Spouse Revealed A Deep Dive Into His Private Life.pdf
Medicoz Clinic
 
Supplier_PFMEA_Workshop_rev 22_04_27.pptx
Supplier_PFMEA_Workshop_rev 22_04_27.pptxSupplier_PFMEA_Workshop_rev 22_04_27.pptx
Supplier_PFMEA_Workshop_rev 22_04_27.pptx
dariojaen1977
 
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
Learning Spark- Lightning-Fast Big Data Analysis -- Holden Karau, Andy Konwin...
balbaliadam1980
 
Java Programming Language: until 2025 and beyond
Java Programming Language: until 2025 and beyondJava Programming Language: until 2025 and beyond
Java Programming Language: until 2025 and beyond
arzu TR
 
Advanced Concrete Technology- Properties of Admixtures
Advanced Concrete Technology- Properties of AdmixturesAdvanced Concrete Technology- Properties of Admixtures
Advanced Concrete Technology- Properties of Admixtures
Bharti Shinde
 
2. CT M35 Grade Concrete Mix design ppt.pdf
2. CT M35 Grade Concrete Mix design  ppt.pdf2. CT M35 Grade Concrete Mix design  ppt.pdf
2. CT M35 Grade Concrete Mix design ppt.pdf
smghumare
 
Dr. Shivu__Machine Learning-Module 3.pdf
Dr. Shivu__Machine Learning-Module 3.pdfDr. Shivu__Machine Learning-Module 3.pdf
Dr. Shivu__Machine Learning-Module 3.pdf
Dr. Shivashankar
 
Department of Environment (DOE) Mix Design with Fly Ash.
Department of Environment (DOE) Mix Design with Fly Ash.Department of Environment (DOE) Mix Design with Fly Ash.
Department of Environment (DOE) Mix Design with Fly Ash.
MdManikurRahman
 
1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...
1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...
1.9 Class,Object,Class Scope,Accessing Class members and Controlling access t...
VikasNirgude2
 
Introduction to Machine Vision by Cognex
Introduction to Machine Vision by CognexIntroduction to Machine Vision by Cognex
Introduction to Machine Vision by Cognex
RicardoCunha203173
 
PPT on Grid resilience against Natural disasters.pptx
PPT on Grid resilience against Natural disasters.pptxPPT on Grid resilience against Natural disasters.pptx
PPT on Grid resilience against Natural disasters.pptx
manesumit66
 

ClusteringClusteringClusteringClustering.pdf

  • 2. Definitions Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. Clustering is a technique to group objects based on distance or similarity The data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. The clustering-based learning method is identified as an unsupervised learning task wherein the learning starts from no specific target attribute in mind, and the data is explored with a goal of finding intrinsic structures in them 2
  • 3. The primary goal of the clustering technique is fiding similar or homogenous groups in data that are called clusters. The way this is done is—data instances that are similar or, in short, are near to each other are grouped in one cluster, and the instances that are different are grouped into a different cluster. Clustering refers to the grouping of records, observations, or cases into classes of similar objects. A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters. 3
  • 4. Clustering differs from classification in that there is no target variable for clustering. The clustering task does not try to classify, estimate, or predict the value of a target variable. Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized. 4
  • 5. 5
  • 6. Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. Land use: Identification of areas of similar land use in an earth observation database. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. City-planning: Identifying groups of houses according to their house type, value, and geographical location. 6
  • 7. Main issues in clustering: • how to measure similarity • how to measure distance for categorical variables • how to standardize or normalize numerical variables • how many clusters 7
  • 8. How to measure similarity For measuring similarity Distance metric is used. Most common distance metric is Euclidean Distance. Other Distances can also be used. where x = x1, x2, … , xm, and y = y1 , y2, … , ym represent the m attribute values of two records. 8
  • 9. how to measure distance for categorical variables For categorical variables, we may again define the “different from” function for comparing the ith attribute values of a pair of records: where xi and yi are categorical values. We may then substitute different (xi, yi) for the ith term in the Euclidean distance metric above. 9
  • 10. how to standardize or normalize numerical variables For optimal performance, clustering algorithms, just like algorithms for classification, require the data to be normalized so that no particular variable or subset of variables dominates the analysis. Analysts may use either the min–max normalization or Z-score standardization. Range(X) = Max(X)- Min(X) SD(X) = Standard Deviation of X 10
  • 11. All clustering methods have as their goal the identification of groups of records such that similarity within a group is very high while the similarity to records in other groups is very low. In other words, clustering algorithms seek to construct clusters of records such that the between-cluster variation is large compared to the within-cluster variation. 11
  • 12. Requirements of Clustering Algorithms Scalability − We need highly scalable clustering algorithms to deal with large databases. Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data. Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes. High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space. 12
  • 13. Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. Interpretability − The clustering results should be interpretable, comprehensible, and usable. 13
  • 14. Clustering Methods Clustering methods can be classified into the following categories − • Partitioning Method • Hierarchical Method • Density-based Method • Grid-Based Method • Model-Based Method 14
  • 15. Partition Method Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following requirements: • Each group contains at least one object. • Each object must belong to exactly one group. Points to remember − • For a given number of partitions (say k), the partitioning method will create an initial partitioning. • Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other. 15
  • 16. Algorithms in Partition Method: K-mean Clustering - Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) : Each cluster is represented by one of the objects in the cluster 16
  • 17. K-mean Clustering Algorithm K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. 17
  • 18. K- mean algorithm Step 1: Select number of clusters k the data set should be partitioned into. Step 2: Randomly assign k records to be the initial cluster( Usually first k record are assigned to K clusers) Step3: Calculate centroid of the cluster. Step 4: For each record, find the nearest cluster center and add the record to that cluster. Step5: For each of the k clusters, find the cluster centroid, and update the location of each cluster center to the new value of the centroid. Step 5: Repeat steps 4–5 until convergence or termination(centroid do not change). Centroid of the cluster is the mean value of the elements in that cluster 18
  • 19. K-mean Clustering Example: Dataset = { 2, 5, 7,12, 26, 30 40 50 } K = 3 1. Initially Create Three Empty Clusters Cluster C1 Cluster C2 Cluster C3 2. Add first three elements to the cluster Cluster C1 Cluster C2 Cluster C3 2 5 7 3. Find the centroid of each cluster Cluster C1 Cluster C2 Cluster C3 Centroid =2 Centroid =5 Centroid =7 2 5 7 19
  • 20. Step 4: For each record, find the nearest cluster centre and add the record to that cluster. Dataset Elements Distance to Cluster 1 Distance to Cluster 2 Distance to Cluster 3 Centroid =2 Centroid =5 Centroid =7 2 0 3 5 5 3 0 2 7 5 2 0 12 10 7 5 26 24 21 19 30 28 25 23 40 38 35 33 50 48 45 43 Cluster C1 Cluster C2 Cluster C3 Centroid =2 Centroid =5 Centroid =7 2 5 7,12,26,30,40,50 20
  • 21. 3. Find the centroid of each cluster Cluster C1 Cluster C2 Cluster C3 Centroid =2 Centroid =5 Centroid =27.5 2 5 7,12,26,30,40,50 21
  • 22. Step 4: For each record, find the nearest cluster centre and add the record to that cluster. Dataset Elements Distance to Cluster 1 Distance to Cluster 2 Distance to Cluster 3 Centroid =2 Centroid =5 Centroid =27.5 2 0 3 25.7 5 3 0 24.5 7 5 2 20.5 12 10 7 15.5 26 24 21 1.5 30 28 25 2.5 40 38 35 12.5 50 48 45 22.5 Cluster C1 Cluster C2 Cluster C3 Centroid =3.5 Centroid =9.5 Centroid =27.5 2 5,7,12 26,30,40,50 22
  • 23. 3. Find the centroid of each cluster Cluster C1 Cluster C2 Cluster C3 Centroid =2 Centroid =8 Centroid =36.5 2 5,7,12 26,30,40,50 23
  • 24. Step 4: For each record, find the nearest cluster centre and add the record to that cluster. Dataset Elements Distance to Cluster 1 Distance to Cluster 2 Distance to Cluster 3 Centroid =2 Centroid =8 Centroid =36.5 2 0 5 34.5 5 3 3 31.5 7 5 1 29.5 12 10 4 24.5 26 24 18 10.5 30 28 22 6.5 40 38 32 3.5 50 48 42 13.5 Cluster C1 Cluster C2 Cluster C3 Centroid =2 Centroid =8 Centroid =36.5 2,5 7,12 26,30,40,50 24
  • 25. 3. Find the centroid of each cluster Cluster C1 Cluster C2 Cluster C3 Centroid =3.5 Centroid =9.5 Centroid =36.5 2,5 7,12 26,30,40,50 25
  • 26. Step 4: For each record, find the nearest cluster centre and add the record to that cluster. Dataset Elements Distance to Cluster 1 Distance to Cluster 2 Distance to Cluster 3 Centroid =3.5 Centroid =9.5 Centroid =36.5 2 1.5 7.5 34.5 5 1.5 4.5 31.5 7 3.5 2.5 29.5 12 8.5 2.5 24.5 26 22.5 15.5 10.5 30 26.5 19.5 6.5 40 36.5 29.5 3.5 50 46.5 39.5 13.5 Cluster C1 Cluster C2 Cluster C3 Centroid =3.5 Centroid =9.5 Centroid =36.5 2,5 7,12 26,30,40,50 26
  • 27. Final cluster Cluster C1 Cluster C2 Cluster C3 Centroid =3.5 Centroid =9.5 Centroid =36.5 2,5 7,12 26,30,40,50 27
  • 28. When to stop? The Clustering algorithm may terminate when some convergence criterion is met, such as no significant shrinkage in the mean squared error (MSE): 28
  • 29. N =8, k=3 Cluster C1 Cluster C2 Cluster C3 Centroid =2 Centroid =5 Centroid =27.5 2 5 7,12,26,30,40,50 MSE = (2-2)2+(5-5)2+(7-27.5)2+(12-27.5)2+(26-27.5)2+(30-27.5)2+(40-27.5)2+(50-27.5)2 / (8-3) MSE = (0+0+420.5+240.25+2.25+6.25+156.25+506.25)/5 MSE = 1331.75/5 MSE= 266.35 29
  • 30. N =8, k=3 MSE = (2-2)2+(5-8)2+(7- 8)2+(12-8)2+(26-27.5)2+(30-27.5)2+(40-27.5)2+(50-27.5)2 / (8-3) MSE = (0+9+1+16+2.25+6.25+156.25+506.25)/5 MSE = 697/5 MSE= 139.4 Cluster C1 Cluster C2 Cluster C3 Centroid =2 Centroid =8 Centroid =36.5 2 5,7,12 26,30,40,50 30
  • 31. N =8, k=3 MSE = (2-3.5)2+(5-3.5)2+(7- 9.5)2+(12-9.5)2+(26-36.5)2+(30-36.5)2+(40-36.5)2+(50-36.5)2 / (8-3) MSE = (2.25+2.25+6.25+6.25+110.25+42.25+12.25+182.25)/5 MSE = 364/5 MSE= 72.8 Cluster C1 Cluster C2 Cluster C3 Centroid =3.5 Centroid =9.5 Centroid =36.5 2,5 7,12 26,30,40,50 31
  • 32. Cluster Quality The clustering algorithms seek to construct clusters of records such that the between- cluster variation is large compared to the within-cluster variation. Because this concept is analogous to the analysis of variance, we define a pseudo-F statistic as follows: where SSE is defied as above, MSB is the mean square between, and SSB is the sum of squares between clusters, defied as: where ni is the number of records in cluster i, mi is the centroid (cluster center) for cluster i, and M is the grand mean of all the data. 32
  • 33. MSB represents the between-cluster variation and MSE represents the within-cluster variation. Thus, a “good” cluster would have a large value of the pseudo-F statistic, representing a situation where the between-cluster variation is large compared to the within-cluster variation. Hence, as the k-means algorithm proceeds, and the quality of the clusters increases, we would expect MSB to increase, MSE to decrease, and F to increase. 33
  • 34. K-means Clustering summary Advantages: • Simple, understandable • items automatically assigned to clusters Disadvantages: • Must pick number of clusters before hand • Often terminates at a local optimum. • All items forced into a cluster • Too sensitive to outliers 34
  • 35. K-medoid Algorithm Medoids are representative objects of a data set or a cluster with a data set whose average dissimilarity to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. A medoid of a finite dataset is a data point from this set, whose average dissimilarity to all the data points is minimal i.e. it is the most centrally located point in the set. 35
  • 36. Mathematical Formulation for K-means 36 D= {x1,x2,…,xi,…,xm} a data set of m records xi= (xi1,xi2,…,xin) a each record is an n-dimensional vector
  • 37. 37 Finding Cluster Centres that Minimize Distortion: Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster centre to zero.
  • 38. 38 For any k clusters, the value of k should be such that even if we increase the value of k from after several levels of clustering the distortion remains constant. The achieved point is called the “Elbow”. This is the ideal value of k, for the clusters created.
  • 39. Hierarchical Methods: This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here − • Agglomerative Approach • Divisive Approach In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning (divisive methods) or combining (agglomerative) of existing clusters. 39
  • 40. In hierarchical clustering, we categorized the objects into a hierarchy similar to a tree‐like diagram which is called as dendogram. Dendogram: The standard output of hierarchical clustering is a dendogram. A dendogram is a cluster tree diagram where the distance of split or merge is recorded. Dendogram is a visualization of hierarchical clustering. 40
  • 41. Using dendogram, we can easily specify the cutting point to determine number of clusters. For example, in the left dendogram below, we set cutting distance at 2 and we obtain two clusters out of 6 objects. The first cluster consists of 4 objects (number 4, 6, 5 and 3) and the second cluster consists of two objects (number 1 and 2). Similarly, in the right dendogram, setting cutting distance at 1.2 will produce 3 clusters. 41
  • 42. • Agglomerative Approach This approach is also known as the bottom-up approach. In this, we start with each object forming a separate cluster. It keeps on merging the objects or clusters that are close to one another. It keep on doing so until all of the clusters are merged into one or until the termination condition holds. • Divisive Approach This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone. 42
  • 43. Steps for Hierarchical Clustering – Agglomerative approach 1. Compute distance matrix from object features. 2. Set each object as a independent cluster.( if there are 5 objects , then there will be 5 clusters) 3. Iterate until number of cluster is equal to 1 A. Merge two closest clusters B. Update distance matrix 43
  • 44. Example: Assume we have six objects A,B,C,D,E,F each having two attribute X1 and X2 Distance between two objects is calculated using Euclidian distance formula using Their attributes X1 and X2. For example distance between A and B can be calculated as: d(A,B) = 44
  • 45. Object X1 X2 A 1 1 B 1.5 1.5 C 5 5 D 3 4 E 4 4 F 3 3.5 45
  • 46. 46
  • 47. Distance matrix Here object/Cluster D and F are closer (distance 0.5) hence these two clusters will be merged 47
  • 48. How to calculate Distance between new cluster (D,F) and other clusters A,B,C, E? 48
  • 49. Linkages between Objects The rule of hierarchical clustering lie on how objects should be grouped into clusters. Given a distance matrix, linkages between objects can be computed through a criterion to compute distance between groups. Most common & basic criteria are 1. Single Linkage: minimum distance criterion 49
  • 50. 2. Complete Linkage: maximum distance criterion 50
  • 51. 3. Average Group: average distance criterion 51
  • 52. 4. Centroid distance criterion 52
  • 53. Using single linkage (Minimum Distance Approach) 53
  • 54. Now Distance between A and B is minimum so than can be grouped together to form (A, B) cluster 54
  • 55. 55
  • 56. Similarly other distances can be calculated 56
  • 57. Next minimum distance clusters are E and (D,F) so they can be grouped to form ((D,F),E) cluster 57
  • 58. Group ((D,F),E) and C to form single cluster 58
  • 59. 59
  • 60. 60
  • 61. We summarized the results of computation as follow: 1. In the beginning we have 6 clusters: A, B, C, D, E and F 2. We merge cluster D and F into cluster (D, F) at distance 0.50 3. We merge cluster A and cluster B into (A, B) at distance 0.71 4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00 5. We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41 6. We merge cluster (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50 7. The last cluster contain all the objects, thus conclude the computation 61
  • 63. how do we determine distance between clusters of records? There are several criteria for determining distance between arbitrary clusters A and B: Single linkage: Single linkage, sometimes termed the nearest-neighbour approach, is based on the minimum distance between any record in cluster A and any record in cluster B. In other words, cluster similarity is based on the similarity of the most similar members from each cluster. Single linkage tends to form long, slender clusters, which may sometimes lead to heterogeneous records being clustered together. 63
  • 64. Complete linkage: Complete linkage, sometimes termed the farthest-neighbor approach, is based on the maximum distance between any record in cluster A and any record in cluster B. In other words, cluster similarity is based on the similarity of the most dissimilar members from each cluster. Complete linkage tends to form more compact, spherelike clusters. 64
  • 65. 65