3-KMEANS - An Efficient Community Detection Method Based On Rank Centrality-2012
3-KMEANS - An Efficient Community Detection Method Based On Rank Centrality-2012
Physica A
journal homepage: www.elsevier.com/locate/physa
1. Introduction
Many network systems are made of individuals or organizations that are related to each other by various interdepen-
dencies like friendship, kinship, etc. These members can be represented as vertices in a graph, and the relationships can be
represented as edges between the vertices. There are many different networks in the real world, such as biological networks,
ecological (species and their trophic interactions) networks, social (people and their interactions) networks, etc. With the de-
velopment of complex network research, scientists have found many particular properties in complex networks, e.g., small
world [1], scale-free [2] and modularity [3,4]. Modularity means that there are some community structures in social net-
works. These communities are subgroups of vertices where the edges between the vertices in the same groups are much
denser than those in different groups. Identifying communities in complex networks is very important for many real appli-
cations. For instances, a community in a social network implies common beliefs among people. A community in a WWW
network indicates the same topic among pages. So far, many algorithms have been developed to detect communities in
complex networks. These algorithms can be simply divided into two classes: vertex clustering methods based on traditional
clustering and new developed methods based on network topology.
Vertex clustering methods based on traditional clustering include K -means [5], AP [6] (affinity propagation), minimum
cut clustering [7] and NMF (nonnegative matrix factorization) clustering [8], etc. Clustering methods are efficient algorithms
to settle the problem of community detection. However, some clustering algorithms also have their own defects, such as,
K -means is sensitive to its initial seeds, AP’s performance depends heavily on its parameters, etc. New developed methods
based on network topology include GN [3], modularity maximizing methods [9–12], OSLOM [13], infomap [14], LPA [15],
etc. GN [3] is a classical one which deletes edges with maximal betweenness. But due to the high complexity for computing
the betweenness, GN is not efficient in large networks with thousands of vertices. Another type of classical methods
∗ Correspondence to: School of Computer and Information Technology, Lab of Machine Learning and Cognitive Computation, Beijing Jiaotong University,
Beijing, 100044, China.
E-mail addresses: [email protected] (Y. Jiang), [email protected] (C. Jia), [email protected] (J. Yu).
0378-4371/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.physa.2012.12.013
Y. Jiang et al. / Physica A 392 (2013) 2182–2194 2183
partitions networks by maximizing the modularity function [9–12] of a network. But all of them suffer from the resolution
limitation [16]. OSLOM [13] is based on a statistical model, which finds statistically significant communities in networks.
Infomap [14] is an information theoretic approach based on random walks, which uses the probability flow of random walks
on a network as a proxy for information flows. Raghavan et al. [15] employed a simple label propagation algorithm (LPA) to
find communities in large real-world networks. Although some of these algorithms are very fast, they sometimes have low
accuracy when the community structure of the network is not clear.
Among those clustering algorithms, K -means [5] is widely used and it is more efficient because of its fast convergence.
So in this paper, we focus on community detection using the basic idea of K -means. As mentioned above, K -means is
very sensitive to its initial seeds; in other words, if the community number is large, it may produce an empty cluster or
bad clustering results. Although K -means++ [17] was introduced to solve this problem, K -means++ [17] has high time
complexity when choosing the initial seeds. Therefore, in this paper we propose a new efficient clustering algorithm called
K -rank based on rank centrality which is much faster than K -means++ and synchronously solves the K -means’s problem.
First, K -rank finds K seeds via a new initial seeds choice strategy. Second, K -rank classifies the vertices using vertex
similarities, and then updates these seeds, iterating. Like K -means algorithm, K -rank is simple and has fast convergence
in finding communities. Furthermore, K -rank can be easily extended to directed, weighted and overlapping networks.
The rest of this paper is organized as follows. Section 2 presents the K -rank algorithm. In Section 3 we discuss how
K -rank can be extended to directed, weighted and overlapping networks. Experimental results on synthetic and real-world
networks are shown in Section 4. Section 5 contains the conclusions and summary.
2. K -rank algorithm
In cluster analysis, how to define vertex similarities is significant. In order to treat the community detection as a clustering
problem, the most crucial step is how to measure the similarities between vertices. At present, there are many ways to define
vertex similarities. The literature [18] gives an excellent summary about vertex similarities. It classified all the similarity
indices into three types, local, global and quasi-local indices. In this paper, we choose the signal similarity [19] based on
global network topology, which changes the network topology into a geometrical structure of vectors in n-dimensional
Euclidean space.
In this section, we present some related problems about the K -rank algorithm, including signal similarity, how to choose
the initial seeds, the problem of parameter choice, the process of K -rank algorithm and the details on how K was chosen.
Signal [19] is a kind of vertex similarities definition based on signaling propagation, which changes the network topology
into a geometrical structure of vectors in n-dimensional Euclidean space. For a network with n vertices, every vertex is
assumed to be a source which can send, receive, and record signals. In this signaling process, all the vertices record the
amount of signals they have received, and at every step, each vertex sends all its present-owning signals to its neighbors
and itself. After c steps, the amount distribution of signals over the vertices could be viewed as the influence of the source
vertex on the whole network. Naturally, compared with the vertices in the other communities, the vertices of the same
community have a similar influence on the whole network. Therefore, normalizing these n vectors, the distance of each pair
of vectors will represent the similarity of the corresponding vertices. In fact, the propagation process could be described by
a simple and clear mathematical formula. Suppose A is the adjacency matrix of a network, In is an n-dimensional identity
matrix, then the matrix
U = (In + A)c (1)
represents the effects of source vertices to the whole network after c steps.
Similar to K -means and K -means++, the choice of initial seeds will influence the results of K -rank algorithm to a great
extent. Up to now, many techniques were proposed to solve this problem [20]. Among them, K -means++ [17] is the most
popular. The difference between K -means and K -means++ is that K -means++ makes the chosen seeds as far away from
each other as possible. However, K -means++ needs to compare almost all the vertices in the network before choosing a
new initial seed, if the network is large, the time complexity of this process will be very high. The proposed K -rank algorithm
adopts an effective method based on rank centrality and vertex similarities which purposely puts the chosen seeds as far
away from each other as possible as well. K -rank does not need to compare all the vertices of the network each time.
• Rank centrality. For vertices ranking, we adopt the PageRank authority centrality which was originally developed by Brin
and Page [21] to rank the authority of web pages using the hyperlink structures of the web. Assume that a random
walker follows the structure of the network by the transition matrix P and sometimes randomly jumps to a vertex by the
probability distribution v , then PageRank vector r satisfies the following equation [21]:
r T = α r T P + (1 − α)v T (2)
2184 Y. Jiang et al. / Physica A 392 (2013) 2182–2194
in which r refers to the steady state distribution of the random walk governed by the transition matrix α P + (1 − α)lv T ,
where l is a column of ones. If v is uniform over V , then the steady state vector r is referred to the global PageRank vector
(GPR).
• Choice of K seeds. After ranking the vertices of the network, we can know the order of the vertices based on their PageRank
(PR) value. The larger the vertex’s PR value is, the more possible this vertex is a seed. From this view, perhaps simply
choosing the top K largest PR value vertices is an appropriate way to find the seeds. But as described in Ref. [21], if a
page (vertex) PR value is very large, then the PR values of the other pages (vertices) which link to this page are also large.
As a result, if we choose the top K largest PR value vertices, it is very common that we get K seeds which are very near
to each other. This situation violates the rule which requires the chosen seeds as far away from each other as possible.
Therefore, this seeds choice method is an alternative way which is slightly better than random choice but not the best
one. How can we find the seeds which have large PR value and at the same time these seeds are as far away from each
other as possible? This is the reason why we use vertex similarities. First, we choose the vertex with the largest PR value
as the first seed. Then we consider the vertex with the second largest PR value, if the similarity between it and the first
seed is smaller than a threshold µ, we choose it as the second seed. In the same way, at the tth step, we choose a vertex
as a new seed if the similarities between it and the chosen seeds are all smaller than µ until we get K seeds. In this way,
the value of µ determines the extent of the separated seeds. The smaller µ is, the farther between the seeds. If µ is fixed,
we can get K seeds by rank centrality and vertex similarities.
• The threshold µ. The parameter selection problem is very hard in machine learning. In the problem of seed choice, the
threshold µ is an important factor and we believe that how to estimate it is still an open problem. Small µ means that it
is hard to find K seeds in the minority of the vertices. We have to consider the majority of the vertices in order to make
sure that the similarities between the seeds are all smaller than the threshold. In the extreme case, all the vertices are
considered once. But in this situation, the choice of seeds is poor because the PR values of most seeds are very low. On
the contrary, large µ means that it is easy to find K seeds in the minority of the vertices. In the extreme case, µ is large
enough so that top K largest PR value vertices are all the seeds. This choice is also poor because these seeds are not far
from each other. So the threshold µ should be neither too large nor too small. Though it is hard to define its exact value,
we can estimate it by some heuristic and experiential means. If K is small, a smaller µ is needed, whereas large K means
that a larger µ is better. Besides, generally speaking, when choosing seeds, we should consider 10%–80% of the vertices
to guarantee the chosen seeds are far from each other and simultaneously that have large PR values. At last, if we use
the negative Euclidean distance to measure the similarities between the vectors of the vertices [19], the experimental
results tell us that [−1, −0.6] is a better experiential range of the threshold µ.
First, we focus on a simple undirected, unweighted network G without multiple links or self-loops. The whole K -rank
algorithm based on rank centrality to detect communities in the network is summarized as Algorithm 1. Due to the fast
convergence of K -means and PageRank algorithms, K -rank algorithm also has fast convergence. In fact, K -rank algorithm
can be regarded as extended and improved K -means or K -means++. A difference between K -rank and K -means(++) is the
manner of updating seeds. K -rank adopts an effective and fast algorithm PageRank to update the seeds. Moreover, K -means
or K -means++ needs a number of iterations to make the algorithm convergent, while K -rank needs few iterations, indeed,
needs only one iteration if an appropriate µ is chosen. If the vertices of the network and K are very large, K -rank is still
possible to find empty clusters. In our implementation of K -rank, we deal with it by abandoning the seeds of empty clusters
and choosing seeds from the other vertices randomly until there is no any empty cluster.
In the first step of the K -rank algorithm, we must specify additional information about the number of communities (K ).
In this paper, we use F statistics [19,22] to estimate a proper K . Suppose U = {u1 , u2 , . . . , un } is the set of vectors of all
vertices and uj = (xj1 , xj2 , . . . , xjn ), here, xjk is the kth character quantity of uj . Suppose K is the number of communities
and ni is the number of vertices of the ith community. All the vertices’ vectors of the ith community are ui1 , ui2 , . . . , uini . Let
ni
x̄ik = 1
ni j =1 uij (k), k = 1, 2, . . . , n, be the mean characters of ith community, ūi = (x̄i1 , x̄i2 , . . . , x̄in ) be the ith community’s
center, and ū = (x̄1 , x̄2 , . . . , x̄n ) be all the vertices’ centers, here x̄k = xjk (k = 1, 2, . . . , n). Then the F statistic is
1
n
n j =1
defined as
K
ni ∥ūi −ū∥2
K −1
i=1
F = (3)
ni
K
∥uij −ūi ∥2
n −K
i=1 j=1
k=1 (x̄k − x̄k )2 is the distance between ūi and ū, and ∥uij − ūi ∥ is the distance between the uij vertex of
n i
where ∥ūi − ū∥ =
i
the ith and the center ū . The numerator of F signifies the distance of intercommunities and the denominator the distance
of intracommunities. So the F could be larger when the difference distance of intercommunities is larger and the difference
distance of intracommunities is smaller. When F achieves the maximum, we can get the best K . The literature [19] has told
us that, the clearer the community structure is, the more distinct the maximal value of the F statistics.
The complexity of K -rank consists of signaling process [19], seeds choice and the iterative process. The time complexity
of the process of signal diffusion is O(cn3 ) [19] when we use multiplication of the matrix to simulate the process, where c
is the number of propagation and n is the number of vertices. But if we simulate the process in the network directly, the
corresponding time complexity is O(c (k + 1)n2 ) [19], where k is the average degree of vertices in the network. Compared
with signaling process, the time complexity of seeds choice is trivial, because K is much less than n and PageRank algorithm
is very fast, GPR can be easily computed by the power iteration. The iterative process is similar to K -means algorithm [5],
the only difference is the strategy of updating the seeds. K -means computes the means of all the community members
while K -rank finds the new seeds by calculating rank centrality of communities. As the communities are subnetworks
which are much smaller than the original network, K -rank’s process of updating the seeds does not cost too much. If the
number of communities K is fixed, the time complexity of K -means clustering is O(nKt ) where t is the number of iterations.
Consequently, the total complexity of K -rank is O(c (k + 1)n2 + nKt ).
Some of the existing methods that are proposed especially for community detection can only find communities in
undirected and unweighted networks, which cannot be easily extended to directed and weighted networks. However, in
the real world, there are many networks where edges’ directedness and weights (indicating the strength of the interaction
between vertices) are essential features, such as citation networks, web pages, etc. Besides, in some real-world networks,
like social networks, vertices sometimes belong to more than one community. These communities are called overlapping
communities and in practical situations it is very common for communities to overlap, but not all the community detection
methods can handle these types of networks. Compared with detecting communities in directed and weighted networks,
finding overlapping communities is more difficult. In this section, we will discuss how K -rank can be extended to directed,
weighted and overlapping networks.
Suppose we have a weighted and directed network with n vertices, it can be represented mathematically by an adjacency
matrix W with elements Wij which denotes the connection strength of vertex i and j. Also, since we are dealing with directed
networks, Aij ̸= Aji and Wij ̸= Wji in general. Then in the first step of K -rank, we use [19]
U = ( In + W ) c (4)
instead of U = (In +A) . In this way, we can compute vertex similarities incorporating edges’ weights and direction naturally.
c
In the same way, if we regard the original network as a weighted and directed graph, the weights and direction are also easily
incorporated because PageRank can work on a weighted and directed web network. Thereafter, the rest of the algorithm is
the same as the original K -rank algorithm.
2186 Y. Jiang et al. / Physica A 392 (2013) 2182–2194
We have found that overlapping vertices always exist in the boundary area of different communities and their
memberships of different communities are almost the same. So first, we define the vertex membership measure B.
• Membership measures of vertices. Suppose a network is divided into K communities C1 , C2 , . . . , CK , and we have already
got the vertex similarity S (i, j) which denotes the similarity between vertex i and j (i ̸= j). Then we can define four
membership measures of vertices in the following.
(1) Vertex i’s maximum membership measure with regard to community j:
Bmax (i, j) = max S (i, x). (5)
x∈Cj ,x̸=i
(2) Vertex i’s minimum membership measure with regard to community j:
Bmin (i, j) = min S (i, x). (6)
x∈Cj ,x̸=i
(3) Vertex i’s average membership measure with regard to community j:
1
Baverage (i, j) = S (i, x), (7)
Nj x∈C ,x̸=i
j
where Nj denotes the number of vertices in community j except vertex i.
(4) Vertex i’s central membership measure with regard to community j:
Bcenter (i, j) = S (i, e), (8)
where e is the seed of community j and i ̸= e.
All the membership measures can tell us how closely vertex i belongs to community j. The larger the membership measure
is, the more possibly i belongs to community j. It is easy to find out that the performances of maximum and minimum
membership measures are similarly sensitive to the noises. While the other two measures are robust and not sensitive
to the noises. Therefore, we choose central membership to measure the possibility of a vertex belonging to a community.
• Finding overlapping vertices. According to the definitions of membership measures, we can see that the central
membership measure can be regarded as a kind of vertex similarity. Given the community structure of the network and
the seed of each community, we can get the vertex central membership measure by vertex similarities. Then, based on
this measure, we can judge every vertex whether it is overlapping. Therefore, first we should partition the network using
K -rank algorithm, get the communities and the seeds. And we define another threshold ϵ which denotes the minimum
value of the similarities between all the seeds and the overlapping vertices in their own communities. The value of
ϵ cannot be fixed because at present we do not know which vertex is overlapping in each community. So we set the
threshold ϵ with different values, and calculate all the non-seed vertices’ central membership measure with regard to all
the communities. If vertex i’s central membership measure with regard to community j is no smaller than the threshold ϵ ,
then vertex i should belong to community j. After that, if a vertex i belongs to more than one community, it is overlapping.
Thus, different values of ϵ correspond with different results of overlapping vertices. At last, we use an evaluation function
Qov [23] (another modularity function for overlapping networks) to find the best cover of the overlapping communities.
• Estimate the threshold ϵ . In the process of overlapping vertices detection, the threshold ϵ determines the result of
the algorithm. As mentioned above, ϵ denotes the minimum value of the similarities between all the seeds and the
overlapping vertices in their own communities. Although ϵ is not a fixed value, and in general different networks have
different values of ϵ , we can still estimate it according to its upper bound and lower bound.
Suppose C1 , C2 , . . . , CK is a partition of the network G, ei is the seed of community i, x is a vertex in community
Ci , S (i, j) is the similarity between vertex i and vertex j, K is the number of communities. Then the upper bound of ϵ is
mini=1,....K minx∈Ci S (ei , x). We consider two cases. In the first case, if the result of the algorithm is the same as the ground
truth, then the value of ϵ is mini=1,....K minx∈Ci S (ei , x). The reason is obvious, no matter whether the vertex is overlapping,
it must be close to one or more communities which it belongs to and the similarities between them are all no smaller
than ϵ . In the second case, if the result of the algorithm is not the best, then all the overlapping vertices were classified
as if they were non-overlapping vertices. In this case, the value of ϵ should be smaller than mini=1,....K minx∈Ci S (ei , x).
As to the lower bound of ϵ , we cannot get a strict bound, based on the basic assumption of cluster analysis, similarities
in the same cluster are larger than those between the clusters, so we can relax the constrains of lower bound and use
similarities between the seeds of different communities to approximate it. Then a relatively looser lower bound of ϵ is
given by maxi,j∈1,...,K s.t.i̸=j S (ei , ej ). To sum up, the value of ϵ should lie in the range
max S (ei , ej ), min min S (ei , x) (9)
i,j∈1,...,K s.t.i̸=j i=1,....K x∈Ci
and we can find the appropriate value of ϵ near the upper bound.
• Why it can work. In the above section, we compute all the non-seed vertices’ central membership measure to detect
overlapping vertices. But how to deal with the problem if the seeds are overlapping? In fact, we do not need consider
whether the seeds are overlapping. The reason is that we make three basic assumptions.
Table 1
Parameters of LFR artificial networks for various figures.
Parameter Fig. 2(a, b) Fig. 4(a, b) Fig. 3(a, b) Fig. 5(a, b) Fig. 6(a, b)
Hypothesis 2. Each non-overlapping vertex must take its own community seed as a center, if the vertex is overlapping, this
center is not the only one.
Hypothesis 3. Each seed must take itself as the center of its community.
The networks we deal with must satisfy the three basic assumptions mentioned above. If a network satisfies them, we
can get the following Hypothesis 4.
Hypothesis 4. If a network satisfies the Hypotheses 1–3, then the overlapping vertices in the network cannot be the seeds
of communities.
The proof is based on contradiction. Assume that the overlapping vertices only belong to two different communities. In
the first case, if vertex ei is an overlapping vertex belonging to different communities Ci and Ci′ , moreover, ei is the seed of Ci
and Ci′ , then all the other vertices of Ci and Ci′ take ei as their center. In other words, these vertices of the two communities
should be actually in the same community which conflicts with the assumption that they are in the different communities.
In the second case, if vertex ei is an overlapping vertex belonging to two different communities Ci and Cj , ei is the seed of Ci
not the seed of Cj , then based on Hypothesis 1, there must be another seed ej in community Cj which is the center of ei . As a
result, the seed ei takes another vertex as its center which conflicts with Hypothesis 3.
Therefore, based on the above hypothesis, seeds cannot be overlapping, we can detect the overlapping vertices from
non-seed vertices without considering the seeds.
In this section, we evaluate K -rank algorithm on artificial and real world networks, including undirected, directed,
unweighted, weighted and overlapping networks. As for the nonoverlapping networks with known community structure,
we use accuracy and NMI (Normalized Mutual Information) [24] to measure the similarity between the planted partitions
and the results of the algorithms. In the same way, as for the overlapping networks, the extended version of NMI [25] for
overlapping networks should be used instead.
The algorithms are first compared on two classes of benchmark networks, namely, Girvan and Newman [3] and
Lancichinetti et al. [26] benchmark (LFR) networks. For the former, we generate a set of artificial networks with 128 vertices
divided into 4 communities with 32 vertices respectively. The average degree of each vertex is set to 16 and the average
number of edges of each vertex in a community, denoted by Zin , is varied from 8 to 16, the average number of edges of each
vertex between communities, denoted by Zout , is varied from 8 to 0 such that Zin + Zout = 16 on average. Therefore, the larger
Zin is, the more easily we can detect the communities. Lancichinetti et al. [26] present a class of artificial networks whose
degree and community size distributions are power laws which is the same as many real-world networks. Parameters of
the latter benchmark networks used in our experiments are given as shown in Table 1. They are the number of vertices (N),
mixing parameter (mu), average degree of the vertices (k), maximum degree of the vertices (kmax ), minimum community
size (minc), maximum community size (maxc), and the exponents of the power law degree and community size distributions
(t1 and t2, respectively). The mixing parameter mu is defined such that every vertex shares a fraction 1 − mu links with
other vertices in its community and a fraction mu links with vertices outside its community.
In all of the experiments, we compare K -rank with other five classical algorithms: LPA [15], BGLL [11], infomap [14],
OSLOM [13] and K -means [5] (K -means++’s results are similar to K -means, so we only list the results of K -means). Because
LPA [15] and K -means [5] are both sensitive to the initial input, we run both of them ten times and get their best results. The
results of different Zin in Girvan and Newman’s networks [3] are shown as Fig. 1. The results show that for both accuracy and
NMI, all the methods can detect the exact communities when Zin ≥ 11, but when Zin ≤ 10 they have different performances.
In general, K -rank and K -means are the best among all the methods especially when Zin = 8. Infomap and BGLL have bad
2188 Y. Jiang et al. / Physica A 392 (2013) 2182–2194
1 1
LPA K−rank
K−rank 0.9 BGLL
0.9 BGLL infomap
infomap 0.8 OSLOM
0.8 OSLOM K−means
K−means 0.7 LPA
0.7
0.6
accuracy
NMI
0.6 0.5
0.5 0.4
0.3
0.4
0.2
0.3 0.1
0.2 0
8 9 10 11 12 13 14 15 16 8 9 10 11 12 13 14 15 16
Zin Zin
(a) Accuracy. (b) NMI.
Fig. 1. Comparison of K -rank and other algorithms in Girvan and Newman’s networks.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
accuracy
K−rank
NMI
Fig. 2. Comparison of K -rank and other algorithms in LFR networks (N = 1000, [minc , maxc ] = [10, 50]).
results when Zin = 8, 9. Though OSLOM’s accuracy is 1 when Zin = 9, its performance drops when Zin = 8. Besides, LPA has
the worst results because it does not work when Zin < 11.
Figs. 2–6 show the results of different methods in LFR networks using different parameters listed in Table 1. The same
with the above analysis, we still compare these six algorithms according to accuracy and NMI criteria. The only difference
is that we use different mu to identify whether the network has clear communities instead of Zin which is used in Girvan
and Newman’s networks. The results in LFR networks are similar to the results in Girvan and Newman’s networks. When
mu is small, almost all the methods perform well. But as mu increases, some algorithms’ performances drop greatly, for
example, LPA (mu ≥ 0.6), BGLL and OSLOM (mu ≥ 0.7), infomap (mu ≥ 0.8) in Fig. 2(a) etc. However, the performances of
K -rank and K -means do not drop dramatically. Furthermore, they have better results when mu is large. It should be noted
that there exist differences between the performances of K -means and K -rank. Though K -means sometimes is better than
K -rank, K -means lacks stability when mu is small such as mu ≤ 0.5 in Figs. 2(a) and 4(a). What is more, as mentioned
in the above section, if the network is large (Figs. 3, 5 and 6), the number of communities K is usually large as well, then
K -means hardly finds the communities. Because random choice of initial seeds will easily lead to an empty cluster or a bad
clustering result during the iterative process. However, K -rank can avoid bad results using our strategy, mentioned above,
of initial seeds choice. The experimental results in Figs. 3, 5 and 6 confirm our conclusion. In a word, compared with other
algorithms, K -rank algorithm has better performances and it is more steady when dealing with small networks. As for the
larger networks (Fig. 6), K -rank’s performance is slightly worse than the other four methods when mu is small, but it is
the best of all when mu is large. Therefore, K -rank is fit for the networks no matter whether they have clear communities
or not.
Y. Jiang et al. / Physica A 392 (2013) 2182–2194 2189
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
accuracy
NMI
K−rank 0.5 K−rank
0.5
LPA LPA
0.4 BGLL 0.4 BGLL
infomap infomap
0.3 OSLOM 0.3 OSLOM
0.2 0.2
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mu mu
(a) Accuracy. (b) NMI.
Fig. 3. Comparison of K -rank and other algorithms in LFR networks (N = 5000, [minc , maxc ] = [10, 50]).
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
accuracy
K−rank K−rank
NMI
0.5 0.5
LPA LPA
0.4 BGLL BGLL
0.4
infomap infomap
0.3 OSLOM 0.3 OSLOM
K−means K−means
0.2 0.2
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mu mu
(a) Accuracy. (b) NMI.
Fig. 4. Comparison of K -rank and other algorithms in LFR networks (N = 1000, [minc , maxc ] = [20, 100]).
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
accuracy
NMI
0.5 0.5
K−rank K−rank
0.4 LPA
0.4 LPA
BGLL BGLL
0.3 infomap 0.3 infomap
OSLOM OSLOM
0.2 0.2
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mu mu
(a) Accuracy. (b) NMI.
Fig. 5. Comparison of K -rank and other algorithms in LFR networks (N = 5000, [minc , maxc ] = [20, 100]).
2190 Y. Jiang et al. / Physica A 392 (2013) 2182–2194
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
accuracy
NMI
0.5 0.5
0.4 0.4
K−rank K−rank
0.3 LPA 0.3 LPA
BGLL BGLL
0.2 infomap 0.2 infomap
OSLOM OSLOM
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mu mu
(a) Accuracy. (b) NMI.
Fig. 6. Comparison of K -rank and other algorithms in LFR networks (N = 10 000, [minc , maxc ] = [10, 100]).
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
accuracy
NMI
0.5 0.5
0.4 K−rank 0.4 K−rank
LPA LPA
0.3 BGLL 0.3 BGLL
infomap infomap
0.2 OSLOM 0.2 OSLOM
K−means K−means
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mut mut
(a) Accuracy. (b) NMI.
Fig. 7. Comparison of K -rank and other algorithms in LFR directed and weighted networks (N = 1000, [minc , maxc ] = [20, 50]).
Weighted and directed artificial networks are also generated by LFR [26] benchmark networks. But this type of network
needs 3 new parameters which are the mixing parameter for the topology (mut), the mixing parameter for the weights
(muw) and the exponent for the weight distribution (beta). In this section, the parameters of the networks we generated are
as follows: N = 1000, k = 15, kmax = 50, mut = muw = [0.1–0.9], minc = 20, maxc = 50, beta = 0.6, t1 = 2, t2 = 1.
The results of different mut in LFR networks [26] are shown in Fig. 7. The results show that in both accuracy and NMI,
K -rank and K -means are the best of all the methods especially when mut is large. Infomap and BGLL have bad results when
mut ≥ 0.7. OSLOM’s performance drops when mut ≥ 0.8. Besides, LPA has the worst results because it does not work
when mut ≥ 0.6. Similarly, K -means suffers from its instability when mut is small (Fig. 7(a)) though it is better than
K -rank when mut ≥ 0.7. We should note that the methods we compared with are the extended versions for weighted
and directed networks. But their results need to be improved. The proposed K -rank algorithm uses the vertex similarities
and PageRank which can incorporate both edges’ weights and direction naturally, so the results show that it outperforms
the other methods and is more fit for the weighted and directed networks.
In order to test our algorithm in overlapping networks, we compare K -rank with the other four overlapping communities
detection algorithms: CFinder [27], CONGA [28], COPRA [29] and LFM [25] in LFR networks. LFR networks with overlapping
communities need 2 new parameters which are the number of overlapping vertices (on) and the number of memberships of
the overlapping vertices (om). In our experiments, we generate 4 types of artificial networks, their parameters are as follows:
Y. Jiang et al. / Physica A 392 (2013) 2182–2194 2191
Table 2
Results of the comparison in networks with 8 overlapping vertices.
Methods Zin
16 15 14 13 12 11 10 9 8
Qov
K -rank 0.75 0.75 0.75 0.75 0.75 0.75 0.74 0.74 0.70
CFinder 0.74 0.74 – 0.07 0.06 0.06 0.06 0.07 0.06
CONGA 0.75 0.64 0.35 0.63 0.28 0.12 0.11 0.08 0.07
COPRA 0.74 0.74 0.74 0.74 0.74 0.74 0.00 0.00 0.00
LFM 0.74 0.74 0.74 0.74 0.74 0.73 0.73 – –
Accuracy
K -rank 1.00 1.00 1.00 1.00 1.00 1.00 0.92 0.86 0.51
CFinder 0.93 0.93 – 0.22 0.22 0.22 0.22 0.21 0.23
CONGA 1.00 0.63 0.31 0.27 0.22 0.23 0.22 0.22 0.23
COPRA 0.93 0.93 0.93 0.91 0.89 0.77 0.23 0.23 0.23
LFM 0.92 0.92 0.91 0.91 0.83 0.82 0.84 – –
NMI
K -rank 1.00 1.00 1.00 1.00 1.00 1.00 0.78 0.53 0.66
CFinder 0.45 0.57 – 0.47 0.46 0.46 0.47 0.46 0.46
CONGA 1.00 0.43 0.61 0.60 0.50 0.47 0.47 0.47 0.47
COPRA 0.58 0.70 0.72 0.57 0.56 0.42 0.37 0.37 0.37
LFM 0.57 0.57 0.68 0.68 0.43 0.55 0.54 – –
Table 3
Results of the comparison in networks with 16 overlapping vertices.
Methods Zin
16 15 14 13 12 11 10 9 8
Qov
K -rank 0.75 0.75 0.75 0.75 0.74 0.74 0.74 0.72 0.71
CFinder 0.74 – 0.07 – 0.06 – 0.06 – –
CONGA 0.75 0.64 0.32 0.46 0.27 0.15 0.19 0.09 0.06
COPRA 0.74 0.74 0.74 0.74 0.62 0.14 0.00 0.00 0.00
LFM 0.74 0.74 0.74 0.74 0.73 0.73 – – –
Accuracy
K -rank 1.00 1.00 1.00 1.00 0.98 0.98 0.85 0.55 0.32
CFinder 0.87 – 0.21 – 0.21 – 0.21 – –
CONGA 1.00 0.50 0.22 0.21 0.21 0.21 0.21 0.21 0.21
COPRA 0.87 0.85 0.81 0.81 0.45 0.29 0.21 0.21 0.21
LFM 0.85 0.86 0.85 0.85 0.84 0.78 – – –
NMI
K -rank 1.00 1.00 1.00 1.00 0.97 0.97 0.76 0.48 0.39
CFinder 0.44 – 0.46 – 0.47 – 0.46 – –
CONGA 1.00 0.74 0.48 0.49 0.49 0.47 0.47 0.47 0.47
COPRA 0.56 0.56 0.43 0.43 0.41 0.41 0.37 0.37 0.37
LFM 0.43 0.43 0.43 0.43 0.53 0.52 – – –
N = 128, k = 16, kmax = 16, maxc = minc = [38, 44, 56, 80], t1 = 2, t2 = 1, on = [8, 16, 32, 64], om = 4. The value
of each mu can be calculated by Zin (mu = (k − Zin )/k). For example, Zin = 15 means mu = (16 − 15)/16 = 0.0625.
Accuracy and NMI are used to estimate these methods’ performances, besides, in order to evaluate the results’ modularity,
we compute Qov [23], a kind of modularity function for overlapping networks.
Based on the different Zin and the different number of overlapping vertices, five methods’ experimental results are shown
in Tables 2–5. From the results of tables we can see that, K -rank is the best among all the algorithms in most cases except
CONGA may perform well in very few cases. In the Table, ‘‘–’’ means that this algorithm cannot find meaningful communities
in that case. It should be noted that sometimes K -rank is much better than the other algorithms. As we have discussed,
K -rank is based on a clustering technique which needs to compute vertex similarities. We adopt signal similarity based on
the network topology. Therefore, we can get the network’s global information about any pair of vertices. We believe that it
is this global information that makes K -rank highly effective. However, in the other four methods, CONGA [28] uses vertices
splitting based on some rules, COPRA [29] adopts the idea of label propagation, LFM [25] optimizes a local fitness function
and CFinder [27] finds all the κ -cliques in the networks. These four methods only use the vertices local information. K -rank
uses global information, making it highly effective. The experimental results in this section confirm our conclusion.
2192 Y. Jiang et al. / Physica A 392 (2013) 2182–2194
Table 4
Results of the comparison in networks with 32 overlapping vertices.
Methods Zin
16 15 14 13 12 11 10 9 8
Qov
K -rank 0.75 0.74 0.74 0.74 0.74 0.73 0.72 0.74 0.73
CFinder – 0.06 – – – – – – –
CONGA 0.61 0.53 0.12 0.23 0.07 0.17 0.08 0.09 0.08
COPRA 0.74 0.73 0.72 0.72 0.05 0.00 0.00 0.00 0.00
LFM 0.74 0.73 0.62 0.73 – – – – –
Accuracy
K -rank 1.00 0.97 0.96 0.90 0.89 0.69 0.61 0.41 0.24
CFinder – 0.18 – – – – – – –
CONGA 0.39 0.31 0.19 0.18 0.18 0.18 0.17 0.18 0.18
COPRA 0.75 0.74 0.59 0.44 0.17 0.18 0.18 0.18 0.18
LFM 0.74 0.70 0.54 0.68 – – – – –
NMI
K -rank 1.00 0.83 0.82 0.67 0.45 0.41 0.39 0.39 0.40
CFinder – 0.46 – – – – – – –
CONGA 0.62 0.60 0.46 0.48 0.47 0.48 0.46 0.47 0.47
COPRA 0.43 0.43 0.41 0.37 0.38 0.37 0.37 0.37 0.37
LFM 0.41 0.41 0.39 0.41 – – – – –
Table 5
Results of the comparison in networks with 64 overlapping vertices.
Methods Zin
16 15 14 13 12 11 10 9 8
Qov
K -rank 0.74 0.74 0.74 0.74 0.74 0.74 0.75 0.75 0.74
CFinder – – – – – – – – –
CONGA 0.14 0.13 0.08 0.11 0.11 0.11 0.11 0.08 0.06
COPRA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
LFM – – – – – – – – 0.56
Accuracy
K -rank 0.91 0.81 0.78 0.60 0.51 0.51 0.50 0.50 0.51
CFinder – – – – – – – – –
CONGA 0.12 0.12 0.13 0.12 0.12 0.11 0.12 0.12 0.12
COPRA 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12
LFM – – – – – – – – 0.11
NMI
K -rank 0.45 0.74 0.42 0.68 0.47 0.57 0.37 0.37 0.38
CFinder – – – – – – – – –
CONGA 0.18 0.37 0.47 0.47 0.18 0.47 0.28 0.47 0.47
COPRA 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37
LFM – – – – – – – – 0.24
Table 6
The real-world networks used in our experiments.
No. Network N E K Ref
In this section, we test these six algorithms in real-world networks. Table 6 shows the networks we used in this section.
N is the number of vertices, E is the number of edges and K is the number of communities (prior knowledge). The results
Y. Jiang et al. / Physica A 392 (2013) 2182–2194 2193
Table 7
Comparison of K -rank and other algorithms in real-world networks.
Methods No.
1 2 3 4 5 6 7 8
Accuracy
K -rank 1.00 0.98 1.00 0.71 0.85 0.93 0.89 0.91
LPA 0.91 0.85 0.77 0.72 0.82 0.80 0.63 0.87
BGLL 0.64 0.85 0.50 0.77 0.72 0.90 0.79 0.88
infomap 0.82 0.85 0.58 0.76 0.78 0.91 0.58 0.75
OSLOM 0.97 0.42 0.87 0.59 0.80 0.91 0.88 0.85
K -means 1.00 0.98 1.00 0.62 0.85 0.90 0.82 –
NMI
K -rank 1.00 0.96 1.00 0.80 0.57 0.92 0.51 0.94
LPA 0.64 0.90 0.60 0.74 0.53 0.86 0.29 0.94
BGLL 0.58 0.94 0.46 0.81 0.51 0.93 0.33 0.95
infomap 0.69 0.94 0.53 0.78 0.53 0.92 0.29 0.90
OSLOM 0.84 0.42 0.55 0.64 0.55 0.91 0.50 0.93
K -means 1.00 0.96 1.00 0.66 0.55 0.90 0.38 –
of different algorithms in real-world networks are shown in Table 7. From the results of tables we can see that, K -rank
outperforms the other algorithms in most cases though BGLL and K -means may perform well sometimes. ‘‘–’’ means that
K -means cannot find meaningful communities in larger PPI network due to the bad choice of initial seeds. We think why
K -rank works better in these real-world network is largely because of clustering and the high performance of vertex
similarities. Besides, the strategy of choosing initial seeds makes K -rank more effective than K -means.
5. Conclusion
In this paper, we proposed an efficient clustering algorithm K -rank based on rank centrality. Similar to K -means
algorithm, the proposed method first finds K seeds which have largest rank centrality in the network, then updates these
seeds by an iterative technique. K -rank can be easily extended to directed, weighted and overlapping networks. Besides,
K -rank is much faster than K -means++ when choosing the initial seeds and can avoid producing empty clusters during
the iterative process. The results on synthetic and real-world networks have shown our method is more efficient than the
state-of-the-art algorithms.
Acknowledgments
We acknowledge Bian-Fang Chai,Ya-Fang Li and Li-Yan Ma for their spelling and grammar check. This work was
supported in part by National Nature Science Foundation of China (Grant Nos. 60905029, 81230086), Beijing Natural Science
Foundation (Grant No. 4112046) and the Fundamental Research Funds for the Central Universities.
References
[1] D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature 393 (1998) 440.
[2] A.L. Barabási, R. Albert, Emergence of scaling in random networks, Science 286 (1999) 509.
[3] M. Girvan, M.E.J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99 (2002) 7821.
[4] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2004) 026113.
[5] J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of 5-th Berkeley Symposium on
Mathematical Statistics and Probability, Vol. 1, University of California Press, Berkeley, 1967, p. 281.
[6] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (2007) 972.
[7] B. Yang, D.Y. Liu, et al., Complex network clustering algorithms, J. Softw. 20 (2009) 54.
[8] C. Ding, X. He, Horst D. Simon, On the equivalence of nonnegative matrix factorization and spectral clustering, in: Proceedings of the Fifth SIAM
International Conference on Data Mining, Newport Beach, CA.
[9] A. Clauset, M.E.J. Newman, C. Moore, Finding community structure in very large networks, Phys. Rev. E 70 (2004) 066111.
[10] M.E.J. Newman, Finding community structure in networks using the eigenvectors of matrices, Phys. Rev. E 74 (2006) 036104.
[11] V.D. Blondel, J. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, J. Stat. Mech. (2008) 10008.
[12] M.J. Barber, J.W. Clark, Detecting network communities by propagating labels under constraints, Phys. Rev. E 80 (2009) 026129.
[13] A. Lancichinetti, F. Radicchi, J.J. Ramasco, S. Fortunato, Finding statistically significant communities in networks, PLoS One 6 (2011) 0018961.
[14] M. Rosvall, C. Bergstrom, Maps of random walks on complex networks reveal community structure, Proc. Natl. Acad. Sci. USA 105 (2008) 1118.
[15] U.N. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect community structures in large-scale networks, Phys. Rev. E 76 (2007) 036106.
[16] S. Fortunato, M. Barthelemy, Resolution limit in community detection, Proc. Natl. Acad. Sci. USA 104 (2007) 36.
[17] D. Arthur, S. Vassilvitskii, k-means++: the advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM–SIAM Symposium on Discrete
algorithms, 2007, p. 1027.
[18] Linyuan Lü, Tao Zhou, Link prediction in complex networks: a survey, Physica A 390 (2011) 1150.
[19] H. Yanqing, L. Menghui, Z. Peng, et al., Community detection by signaling on complex networks, Phys. Rev. E 78 (2008) 016115.
[20] D. Steinley, M.J. Brusco, Initializing K -means batch clustering: a critical evaluation of several techniques, J. Classification 24 (2007) 99.
[21] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: bringing order to the web, Technical Report, Stanford University, 1998.
[22] A. Li, Fuzzy Mathematics and Application, Metallurgical Industry Press, Beijing, 2005.
2194 Y. Jiang et al. / Physica A 392 (2013) 2182–2194
[23] V. Nicosia, G. Mangioni, V. Carchiolo, M. Malgeri, Extending the definition of modularity to directed graphs with overlapping communities, J. Stat.
Mech. (2009) 03024.
[24] L. Danon, A. Daz-Guilera, J. Duch, A. Arenas, Comparing community structure identification, J. Stat. Mech. (2005) 09008.
[25] A. Lancichinetti, S. Fortunato, J. Kertész, Detecting the overlapping and hierarchical community structure in complex networks, New J. Phys. 11 (2009)
033015.
[26] A. Lancichinetti, S. Fortunato, F. Radicchi, Benchmark graphs for testing community detection algorithms, Phys. Rev. E 78 (2008) 046110.
[27] G. Palla, I. Dernyi, I. Farkas, et al., Uncovering the overlapping community structure of complex networks in nature and society, Nature 435 (2005)
814.
[28] S. Gregory, An algorithm to find overlapping community structure in networks, in: PKDD, vol. 4702, 2007, p. 91.
[29] S. Gregory, Finding overlapping communities in networks by label propagation, New J. Phys. 12 (2010) 103018.
[30] W.W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33 (1977) 452.
[31] L. Donetti, M.A. Muñoz, J. Stat. Mech. (2004) P10012.
[32] D. Lusseau, K. Schneider, O.J. Boisseau, P. Haase, E. Slooten, S.M. Dawson, The bottlenose dolphin community of doubtful sound features a large
proportion of long-lasting associations, Behavioral Ecology and Sociobiology 54 (2003) 396.
[33] D.E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA, 1993.
[34] V. Krebs, (unpublished). http://www.orgnet.com/.
[35] L.A. Adamic, N. Glance, The political blogosphere and the 2004 US election: divided they blog, in: Proceedings of the International Workshop on Link
Discovery, 2005, p. 36.
[36] J. Vlasblom, S.J. Wodak, Markov clustering versus affinity propagation for the partitioning of protein interaction graphs, BMC Bioinformatics 10 (2009)
99.