0% found this document useful (0 votes)
82 views7 pages

Performance Measures On Cluster Analysis

This document discusses performance measures for cluster analysis. It introduces three measures - modularity, conductance, and coverage - to evaluate the quality of clusters based on intra-cluster and inter-cluster densities. It also examines measuring the similarity of clusters using the number of edges between clusters and measuring cluster strength based on the number of within-cluster and between-cluster edges. Finally, it reviews several existing clustering algorithms and applications of graph clustering.

Uploaded by

Abel Thangaraja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views7 pages

Performance Measures On Cluster Analysis

This document discusses performance measures for cluster analysis. It introduces three measures - modularity, conductance, and coverage - to evaluate the quality of clusters based on intra-cluster and inter-cluster densities. It also examines measuring the similarity of clusters using the number of edges between clusters and measuring cluster strength based on the number of within-cluster and between-cluster edges. Finally, it reviews several existing clustering algorithms and applications of graph clustering.

Uploaded by

Abel Thangaraja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

Performance Measures on Cluster Analysis


Dr. G. Abel Thangaraja
Assistant Professor
Department of Computer Science,
SNMV College of Arts & Science,
Coimbatore.
[email protected]
making and machine-learning situations. In the literature,
Abstract—Cluster analysis is used as a key concept for there are many applications for clustering and few of
identifying and selecting the vertices as well as them are part family formation for group technology,
clusters. General problems of clustering require image segmentation information, retrieval, web pages
greater computing powers which have been recently
grouping, market segmentation and engineering analysis.
analysed. There are appealing and advanced methods
to evaluate clustering data and several criteria which
There are many real life examples for the
isolate remarkable importance of the performance of
clustering. In this paper, three different measures are formation of vertices, edges and clusters. The citizens
discussed. At first, the quality of metrics are evaluated migrated from one region to other different regions.
based on intra and inter cluster densities and random Similarly, students belong to village or towns travelled to
walk techniques. Secondly, the similarities of cluster different institutions.
are measured by using the edges linked with clusters
which belong to different regions. Finally, the In the clustering analysis, the selection and
strengths of the cluster structures are studied based measuring various metrics are great task. So, first select
on the number of edges.
the metrics along with corresponding method to measure
Keywords— Clustering, quality, similarity, strength. them. Suppose there are some clusters in different
clustering’s, then the similarity of the clusters is
I. INTRODUCTION computed through their vertices and edges. Based on the
number of edges connecting one vertex to other vertices
Graph clustering is a prime technique to in the same cluster as well as that of different clusters,
understand the relation among vertices in a graph. estimate a parameter which indicates the strength of the
Clustering is a task of assigning a set of vertices to cluster structure. Also the probabilities are to be
categories so that the vertices belong to the same cluster computed and justified.
are more similar to each other than to those in other
clusters. This identity is the basic concept in various
II. LITERATURE REVIEW
research fields like statistics, data analysis, bio
informatics and image processing. Benson et al (2016) have utilized graph data in many
scientific areas such as social network analysis,
In the early twentieth century, the researchers
bioinformatics and computer network analysis. The size
have discussed classical clustering methods such as
of graph data has grown dramatically. Generally, social
connectivity clustering, centroid clustering, density
network graph has millions of vertices and hundreds of
clustering and so on. Benchmark graphs play a vital role
millions of edges.
for evaluating and comparing clustering algorithms.
Clustering algorithm on benchmark have used a single Scott Emmons et al(2016) have studied the relation
metric to measure the amount of ‘gold standard’ between cluster quality metrics and information recovery
information recovered by each and every algorithm. metrics through specific four network clustering
Clustering algorithms can be classified as partitioning algorithms. They have considered quality metrics of
methods, hierarchical methods, density based methods modularity, conductance and coverage along with
and grid based methods. Cluster analysis is the synthetic graphs and empirical data sets of large sizes.
organization of a collection of patterns into clusters based
on similarity. It revealed the difference between Honglei Zhang et al (2016) have proposed a random
clustering and discriminant analysis, clustering is useful walk based graph clustering algorithm, namely limited
in several exploratory pattern analysis, grouping decision- random walk algorithm. The proposed method can tackle
1

Volume VI, Issue VI, JUNE/2019 Page No:1942


JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

the computational complexity using a parallel In this paper, different concepts are analysed in the
programming schedule and applied to both global and light of graph clustering. At first, quality metric is
local graph clustering. They have analysed the simulated identified based on the mathematical measurements
and real graph data. which are measured through stochastic random walk
model. The similarities of clusters are discussed and
Xin et al(2016)have discussed random walk based computed by using the number of edges connecting any
methods that use the Markov chain model to analyse the two clusters which belong to different regions. Finally,
graph. In this regard, vertices and edges indicate the the strengths of the cluster structure are obtained and the
states and transitions between states respectively. probability values are computed when the pair of vertices
Similarly, the graph structures represent the probabilities. belong to same cluster and that belong to different
clusters.
Chung and Kempton(2013) have analysed big graph
data in which the concept becomes more challenging and III. QUALITY METRICS ON CLUSTERS
the researchers are interested to find the cluster for the
seed vertex and it is known as “local clustering problem” In a network, a cluster is defined as a set of
densely connected vertices which are also connected to
Spielman and Teng (2008) have analysed the graph other clusters in a graph. There are a variety of metrics
conductance measurement as the fitness function. which lead to evaluate the quality of clustering based on
intra-cluster and inter-cluster densities. Among the list of
Andersen et al (2007) have introduced random walk
metrics, for this study, consider only three metrics,
to find important vertices around the seed vertex.
namely, “Modularity, Conductance and Coverage” Here
Random walk methods have gained great attention on this
we discuss these three metrics mathematically along with
local graph clustering problems, since the walk started
real life example.
from the seed vertex.

Schaeffer (2007) has studied graph clustering A. Modularity


algorithms which revealed the heterogeneity and the The term ‘Modularity’ is mostly used in technology
relations among vertices. and management systems. The concept of multiple
modules linking and combining them to establish a
Newman(2004) has defined modularity measurement complete system is known as modularity. Modularity is
based on the probability of the connection between any the degree to which the components of the system may be
two vertices. He has applied greedy search method to separated and recombined. Its meaning is stated as here,
minimize this modularity function of clusters in a graph.
► In modular programming, modularity refers to
Girvan and Newman (2002) have given much compartmentalization and inter-relation of
attention to performance of algorithms on benchmark software concepts.
graphs for evaluating and comparing network clustering ► In software design, modularity refers to a local
algorithms. Benchmark graphs are synthetic graphs which partitioning of that design.
revealed that the best and worst clustering algorithms In short, modularity provides greater software
recover the most and least information respectively. development manageability.

Rand(1971) has proposed several criteria which


Modularity is a most useful metric to compare the
isolate specific aspects of the performance of a method
presence of each intra-cluster edge of the graph with the
and its stability with sensitivity to re-sampling in the light
probability of the edge which would exist in a random
of new data. These criteria based on the similarity
graph. Some of the popular clustering algorithms use
between two different clustering of the same set of data
modularity as an objective function. Modularity is
Lance and Williams (1967) have analysed the mathematically defined as,
general problems of clustering including the necessity
𝑚 = ∑𝑖(𝑒𝑖𝑖 − 𝑎𝑖2 ) (1)
and availability of huge computing power
Where,
Friedman and Rubin(1967) have studied the
theoretical findings toward investigation and 𝑒𝑖𝑖 is the probability of intra- cluster edges in
development of specific methods in a certain situations. cluster𝐶𝑖
2

Volume VI, Issue VI, JUNE/2019 Page No:1943


JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

And
It is observed that the conductance of a graph
𝑎𝑖 is the probability of either an intra-cluster always ranges from 0 to 1. It is noted that there are
edge in cluster𝐶𝑖 different ways to define the conductance of a graph which
has been well clustered. Here, we utilize only inter-
or an inter-cluster edge incident on cluster 𝐶𝑖 .
cluster conductance. But, in the case of coverage, the
In other words, 𝑒𝑖𝑖 and 𝑎𝑖 are again empirically intra-cluster conductance is adopted. Still recent days, the
stated as follows. definition of conductance emphasizes the notation of
inter-cluster sparsity but does not capture the intra-cluster
𝑒𝑖𝑖 is the ratio of the favorable number of density. Almeida et al(2011) have discussed the measures
forward and backward edges with in a cluster to the total of conductance for inter-cluster as well as intra-cluster.
number of edges in the graph.
C. Coverage
𝑎𝑖 is the ratio of the favorable number of forward and
backward edges within a cluster plus the number inter- Coverage is defined as the ratio of the number of
cluster edges which linked with neighboring clusters to intra-cluster edges in the graph to the total number of
the total number of edges in the graph. edges in the graph. It is mathematically stated as

∑𝑗,𝑘 𝐴𝑗𝑘 𝛿(𝐶𝑗 ,𝐶𝑘 )


B. Conductance 𝐶= (5)
∑𝑗,𝑘 𝐴𝑗,𝑘

The conductance of a graph is often called the


cheeger constant of a graph. It measures how ‘well-knit’ Where 𝐶𝑗 is the cluster to which vertex j is assigned
the graph is; it controls how fast a random walk on graph and 𝛿(𝛼, 𝛽) is 1 if 𝛼 = 𝛽 and 0, otherwise. The measure
converges to a uniform distribution. of coverage lies between 0 and 1 is the optimal score. It is
also noted that the coverage captures the notion of intra-
cluster density and all vertices are assigned to the same
The conductance of a cluster is defined as the ratio cluster.
of the number of inter-cluster edges to the minimum of
the either the number of edges with an end point in the There are varieties of graphs in network analysis to
cluster or the number of edges that do not have any end measure the qualities of metrics. In this study a stochastic
point in the cluster. random walk model, otherwise, called as drunken random
walk, is taken to measure the metrics.
Let 𝐶𝑖 (i=1,2,3,……,n) be the ithcluster and

𝐴(𝐶𝑖 ) = ∑𝑗∈𝐶𝑖 ∑𝑘∈𝑉 𝐴𝑗𝑘 − ∑𝑗.𝑘∈𝐶𝑖 ∑ 𝐴𝑗𝑘 (2)

Here, 𝐴(𝐶𝑖 ) refers to the number of edges with


an end point in 𝐶𝑖 and 𝐴𝑗𝑘 is the edge between jth and kth
vertices.

Then, the conductance of a cluster is stated as

∑𝑗∈𝐶 ,𝑘∉𝐶 𝐴𝑗𝑘


𝑖 𝑖
∅(𝐶𝑖 ) = (3)
𝑚𝑖𝑛{𝐴(𝐶𝑖 ) ,𝐴(𝐶̅𝑖 )}

Now, the conductance of a graph is defined as the


average of the conductance of clusters, subtracted from
unity.

i.e,
1 Fig:1: Drunken random walk
∅(𝐺) = 1 − ∑𝑛𝑖=1 ∅(𝐶𝑖 ) (4)
𝑛

Volume VI, Issue VI, JUNE/2019 Page No:1944


JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

Fig(1) revealed that the original path of a the choice between the fast bad method and a slow good
drunkard person from origin to destination is only a method requires a special quantification which makes a
straight way. But his unsteadiness, he forms various method either good or bad.
junctions (vertices) and edges by zic-zac walk in different
directions. The vertices and edges construct a graph with The performance of a clustering method requires
some clusters. By using these network components of comparison of results to either standard results or to the
fig(1) and the expressions(1), (4) and (5), the measures of results of other method. The purpose of this selection is to
the quality metrics, particularly modularity, conductance measure the similarity between clusterings. One of the
and coverage are obtained and presented here. classification problems is discriminant analysis which
provides a correct classification against others.
Modularity, m= 0.53482
In the clustering situations, there is no proper
Conductance, ∅(G)= 0.66785
methodology to measure clustering and their similarities.
Coverage, C = 0.83333 Anyhow, the comparison of two clustering is the suitable
method. In clustering method, the similarity measure
motivates the following three considerations which form
IV. SIMILARITY OF CLUSTERS the basis for clustering problem.

Clustering analysis has applied as a prime term for the  Clustering is discrete and every point is
techniques which are dealt with the specified problem. unequivocably assigned to a particular cluster.
The selection of clusters and vertices of clusters that  Clusters are defined just by those points which
belong to different regions is a great task. The nature of do not contain and that do contain
clusters and their vertices along with edges has been  Equal importances are given to all points in the
studied depending on their closeness and other characters. justification of clustering.
The similarity of clusters has been analysed with the help
of vertices and edges. The above three considerations lead to the basic unit
of comparison between two clustering is how pair of
Consider a clustering problem which is formed as a triplet points are clustered. If the elements of an individual
(X,Y,m). point-pair are placed together in a cluster in each of the
Here X is a set of N objects to be clustered. X={X 1, two clustering or if they are assigned to different clusters
X2, X3,…….,Xn}. in both clusterings, this shows a similarity between the
Y is a specific partitioning of these objects in to k clusterings.
disjoint sets. Y={Y1, Y2, Y3,…, Yk}.

Each Yi has a set of some objects. The set of all such


partitions will be denoted as ‘Y’. ‘m’ is used for the
method of choosing a particular Y given in the set X. It is
also noted that Y′ is another partition of Y'i into specified
disjoint sets. Generally speaking, clustering methods have
two components, namely criterion and technique. The
criterion assigns to each clustering a numerical value that
indicates its relative value. The technique selects a
particular subset of Y for a clustering which optimizes the
criterion. The technique is a vital part of any operational
process. There are two ways of comparing clustering
methods. The first way is to use for processing and the
second way is to evaluate the steps of the process. The
first way is computer oriented in which execution time Fig :2: Clusters on regions
and the requirements of storage are considered. Even
though (while) a good method with fast processing is Let C(Y,Y') be the similarity between clusterings Y
always better than a bad method with slow processing, and Y'. This measure of similarity is equal to the number

Volume VI, Issue VI, JUNE/2019 Page No:1945


JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

of similar assignments of point -pairs normalized by the other hand, for large number of edges, there is null
total number of point–pairs. For estimating the similarity, similarity.
a mathematical form is essentially established.
V. STRENGTH OF GRAPH CLUSTERING
Let N be the total number of vertices. Let
nij(i=1,2,3,……,n1;j=1,2,3,……,n2) be the number of The process of graph clustering is a challenging
vertices simultaneously in the ith cluster of Y and the jth and cumbersome task. During the last decades, many
cluster of Y'. algorithms on graph clustering have been proposed. The
criteria based approaches try to optimize clustering
It is otherwise stated that the number of edges linking
fitness functions by applying optimization techniques.
the vertices of ith cluster and the vertices of jth cluster.
The size of the graph data has grown rapidly. In general,
By these assumptions, the similarity between Y and Y' a social network graph contains countless number of
is defined as vertices and edges. Processing these data are very
challenging and time consuming since the nature of these
1 2 2
[(𝑁2) − [2 {∑𝑛1 𝑛2 𝑛2 𝑛1 𝑛1 𝑛2 2
𝑖=1(∑𝑗=1 𝑛𝑖𝑗 ) + ∑𝑗=1(∑𝑖=1 𝑛𝑖𝑗 ) } − ∑𝑖=1 ∑𝑗=1 𝑛𝑖𝑗 ]] graphs are also heterogeneous.
𝐶(𝑌, 𝑌 ′ ) = ⁄𝑁
(2 )
(6)

Now, formulate a clustering model to estimate


The basic properties of ‘C’ are discussed here
the strength of the cluster structure and related
probabilities.
 The similarity lies between 0 and 1, 0≤C≤1
 If C=0, then the two clusterings have no
Let G (n,p,c,q) be the clustering model which
similarities.
generates the graphs.
 If C=1, then the clusterings are identical
The notations of the model are explained as follows:
 (1-C) is a measure of distance
n: the number of vertices
Total C
number of p: the probability of the edge connecting two
edges (nij) vertices.
40 0.63590
c: the number of clusters
47 0.49615
𝑑𝑖𝑛 : the expected number of edges connecting
51 0.36667 one vertex to other vertices
inside the same cluster
56 0.26410
𝑑𝑜𝑢𝑡 : the expected number of edges connecting
57 0.24744 one vertex of a particular
cluster to other vertices of other clusters
60 0.16154
q: the parameter that represents the strength of
the cluster structure.
Table 1: Measure of similarity

On observing the vertices of Y and Y' in the fig(2), the 𝑑𝑖𝑛


i.e, 𝑞 = (7)
number of edges connecting the vertices of ith cluster of Y 𝑑𝑜𝑢𝑡

and that of jth cluster of Y' are counted. The counted


Greater ‘q’ represents stronger cluster structure but
edges are used in the equation(6) which gives the
lesser q gives, weaker cluster structure.
measures of similarity. The total number of edges and
similarity measure are presented in table (1). It is Let ‘d’ be the expected value of the degree of a vertex
observed that the similarity decreases when the total and be defined as
number of edges increases. Suppose the numbers of edges
are very less, the clusterings become identical. On the 𝑑 = 𝑑𝑖𝑛 + 𝑑𝑜𝑢𝑡 = 𝑝(𝑛 − 1) (8)

Volume VI, Issue VI, JUNE/2019 Page No:1946


JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

The expression (8) is useful for generating The strength of the cluster structure by using the
graphs with ‘c’ clusters and each cluster contains equal equation (7) and the probabilities from equations (8),(9)
number of vertices. Consider pair of vertices which are and (10) are computed and presented in table(2).
belong to either the same cluster or different clusters.

Let 𝑃(𝑆𝐶) and 𝑃(𝐷𝐶)be the probability that the given Number Strength
of edges (q) 𝒑(𝒔𝒄) 𝒑(𝑫𝒄)
pair of vertices belong to the samecluster and different
clusters respectively.

qpc(n−1) 3 1.1296 0.6778 0.2500


𝑃(𝑆𝐶) = (q+1)(𝑛−𝑐) (9)
4 0.8472 0.6778 0.3333
and
5 0.6778 0.6778 0.4167
pc(n−1)
𝑃(𝐷𝐶) = (10)
n(q+1)(𝑐−1)
6 0.5648 0.6778 0.5000
Here it is noted that 𝑃(𝑆𝐶) is always greater than
𝑃(𝐷𝐶) Table:2: strength and probabilities

For the purpose of illustrating this model, The computational results showed that, for
consider a graph with three clusters each have six vertices increasing the number of edges, the strength of the cluster
since constructing and analyzing a graph with large structure decreases but the probability that the edges
number of vertices and edges are very tedious. The connecting the vertices of different clusters increases
number of edges raised from each vertex of a cluster rapidly. But, on the other hand, for the edges connecting
connecting the vertices of other clusters are considered as the vertices belong to same cluster, the probability
equal in each case. For estimating the strengths of cluster remains the same.
structure and probabilities, equal number of edges taken
as 3,4,5 and 6. IV. CONCLUSION AND DISCUSSION

This paper is designed by five sections. The


preliminaries relating to the clustering and past studies
are presented in the first and second sections respectively.
In the other three sections, the quality of metrics,
similarity and strength of cluster structure are analysed.
The methodologies of performance measures are
described and numerical results are obtained.

From a group of metrics, modularity,


conductance and coverage are selected for this study. The
measures of this metrics indentified that modularity is the
superior and coverage is weaker. Similarly, the
similarities of clusters are reversely proportional to the
number of edges. The large number of edges provided
very less similarity which tending to zero and show null
similarity between clusters. On the other hand, for less
number of edges, the high similarity is attained and
proved that the clustering are identical. The difference of
similarity from unity is the measure of distance. As in the
case of similarity, the strengths of cluster structure are
also decreased when the number of edges increase. On
account of studying probability analysis in clustering, if
Fig: 3: vertices and edges. the given vertices belong to different clusters, then its
probability increases for increasing the number of edges.

Volume VI, Issue VI, JUNE/2019 Page No:1947


JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

But the probability remains same when the vertices


[10] Lance.G.N and Williams.W.T(1967), “A general theory of
belong to same cluster.
classificatory sorting strategies.II. clustering systems”, The
Computer Journal,10, 271-277.
The results obtained from this study revealed
that quality of metrics, similarity of clusters and strength [11] Newman.M.E(2004), “Fast algorithm for
of cluster structure may fluctuated and lead another detecting community structure in networks”,
conclusions, if the researcher may apply other Phys.Rev.E, 69(6):066133.
methodologies and logics. The above techniques are
applicable to most of the real life problems even if there [12] Rand.W.M(1971), “Objective criteria for the
are huge number of vertices and edges. evaluation of clustering methods”, J.of
American Statl. Assn., Vol 66, No:336, 846-850.
V. REFERENCES
[13] Schaeffer. S.E(2007), “Graph Clustering”,
[1] Almeida.H, Guedes.D, Jr.WM, and Zaki. M.J(2011), “Is
there a best quality metric for graph clusters?” In: Comput.Sci.Rev1(1), 27-64.
Gunoplus.D, Halmann.T, Malerba.D and Vazirgiannis.M
(Editors), Machine learning and knowledge discovery in [14] M. Ezhilarasi V. Krishnaveni "A Survey on
Databases, 44-59, Springer, Berlin Heidelberg.
Wireless Sensor Network: Energy and Lifetime
[2] Andersen.R, Chung.F and Lang.K(2007), “Using page rank Perspective" Taga Journal vol. 14 pp. 3099-3113
to locally partition a graph”, Internet math,4(1), 35-64. ISSN 1748-0345

[3] Benson.A.R,Gleich.D.F and Leskovec.J(2016), “Higher


order organization of complex networks”, Science, 353, 163-
[15] Scott Emmons, Stephen Kobourov, Mike
166. Gallant and Katy Borner(2016), “Analysis of
network clustering algorithms and cluster quality
[4] Dr Nagarajan Munusamy and Karthik Srinivasan , Various metrics at scale”, PLOS ONE journal. Pone.
Node Deployment Strategies in Wireless Sensor Network” ,
IPASJ International Journal of Computer Science, Volume 5,
0159161, 8, 1-18.
Issue 8 , August 2017
[16] Spielman.D.A and Teng.S.H(2008), “A local
[5] Chung.F and Kempton.M(2013), “A local Clustering clustering algorithm for massive graphs and its
algorithm for connection graphs”, In: Algorithms and models
application to nearly-linear time graph
for the web graph, 26-43.
partitioning” :arxiv: 0809. 3232.
[6] Friedman.H.P and Rubin.J(1967), “On some invariant
criteria for grouping data”, J of American statl. Assn. 62, [17] Ezhilarasi, M. & Krishnaveni, V. "An
1159-1178. evolutionary multipath energy-efficient routing
protocol (EMEER) for network lifetime
[7] Girvan.M and Newman.M.E.J(2002), “community structure
in social and biological networks”, Procedddings of the
enhancement in wireless sensor networks" Soft
national academy of sciences, 99(12), 7821-7826.
Computing, (2019).
https://doi.org/10.1007/s00500-019-03928-1
[8] Honglei Zhang, Jenni Raitoharju, Serkan Kiranyaz and
Moncef Gabbouj(2016), “Limited random walk algorithm [18] Xin.Y, Xie.Z.Q and Yang.J(2016), “The
for big graph data clustering”, J.Bigdata, 3(26), 1-22. adaptive dynamic community detection
[9] M. Nagarajan and S. Karthikeyan, “A New Approach to
algorithm based on the non-homogeneous
Increase the Life Time and Efficiency of Wireless Sensor random walking”, Phys.A.Stat.Mech. Appl.450;
Network”, IEEE International Conference on Pattern 241-252.
Recognition, Informatics and Medical Engineering
(PRIME), (2012), pp. 231-235.

Volume VI, Issue VI, JUNE/2019 Page No:1948

You might also like