An efficient algorithm for approximate betweenness centrality computation
An efficient algorithm for approximate betweenness centrality computation
Centrality Computation
ABSTRACT rithms, the most efficient existing exact method is Brandes’s algo-
Betweenness centrality is an important centrality measure widely rithm [3] which requires O(nm) time for unweighted graphs and
used in social network analysis, route planning etc. However, even O(nm + n2 log n) time for weighted graphs (n is the number of
for mid-size networks, it is practically intractable to compute ex- vertices and m is the number of edges in the network). Therefore,
act betweenness scores. In this paper, we propose a generic ran- exact betweenness centrality computation is not practically appli-
domized framework for unbiased approximation of betweenness cable, even for mid-size networks. The next bad news is that com-
centrality. The proposed framework can be adapted with different puting exact betweenness centrality of a single vertex is not easier
sampling techniques and give diverse methods. We discuss the con- than computing betweenness centrality of all vertices. Therefore,
ditions a promising sampling technique should satisfy to minimize the above mentioned worst case bounds hold if someone wants to
the approximation error and present a sampling method partially compute betweenness centrality of one or a few vertices. How-
satisfying the conditions. We perform extensive experiments and ever, in many applications it might be required to compute be-
show the high efficiency and accuracy of the proposed method. tweenness centrality of only one or a few vertices. For example,
the index might be computed only for core vertices of communi-
ties in social/information networks [12], or hubs in communication
Categories and Subject Descriptors networks.
G.2.2 [Discrete Mathematics]: Graph Theory—Graph algorithms, To make betweenness centrality practically computable, in re-
Path and circuit problems; E.1 [Data]: Data structures—Graphs cent years, several algorithms have been proposed for approximate
and networks betweenness centrality computation [4], [1] and [7]. Existing algo-
rithms fall into one of the following categories.
General Terms 1. Some algorithms like [4] and [7] try to approximate between-
Theory ness centrality of all vertices in the network. For such meth-
ods the value computed for every vertex is not of high im-
portance. Instead, the main goal is to correctly estimate the
Keywords relative rank of all vertices.
Centrality, betweenness centrality, social network analysis, approx-
imate algorithms. 2. Some others, like the method presented in [1], aim to ap-
proximate betweenness centrality of a single vertex (or a few
vertices) in time faster than computing betweenness central-
1. INTRODUCTION ity of all vertices. For such methods, the accuracy of the
Betweenness centrality of a vertex, introduced by Linton Free- estimated betweenness centrality is important.
man [6], is defined as the number of shortest paths (geodesic paths)
from all (source) vertices to all others that pass through that vertex. Our focus in this paper is the second category of algorithms, i.e.
He used it as a measure for quantifying the control of a human on we aim at developing an efficient and accurate algorithm for be-
the communication between other humans in a social network [6]. tweenness centrality computation of a single vertex. In this pa-
Betweenness centrality is also used in some well-known algorithms per, we propose a generic randomized framework for unbiased ap-
for clustering and community detection in social and information proximation of betweenness centrality. In the proposed framework,
networks [8]. a source vertex i is selected by some strategy, single-source be-
Although betweenness centrality computation is tractable in the- tweenness scores of all vertices on i are computed, and the scores
ory in the sense that there exist polynomial time and space algo- are scaled as estimations of betweenness centralities. While our
method might seem similar to the method of Brandes and Pich [4],
they have a key difference. In the method of [4], for a few source
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
vertices, single-source betweenness scores are computed and for
for profit or commercial advantage and that copies bear this notice and the full cita- the rest, they are extrapolated. Betweenness centralities are the sum
tion on the first page. Copyrights for components of this work owned by others than of all single-source betweenness scores (which are either computed
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- or extrapolated). In our method, single-source betweenness scores
publish, to post on servers or to redistribute to lists, requires prior specific permission are computed for one single source chosen randomly, and the ob-
and/or a fee. Request permissions from [email protected].
tained scores are scaled as estimations of betweenness centralities.
CIKM’13 October 27-November 1, 2013, Burlingame, CA, USA.
Copyright 2013 ACM 978-1-4503-2263-8/13/10 ...$15.00.
We discuss the condition a promising sampling technique should
http://dx.doi.org/10.1145/2505515.2507826. satisfy to minimize the approximation error for a single vertex.
1489
Since it might be computationally expensive to find such a sam- the following probabilities are computed
pling, we propose a sampling technique which partially satisfies
n
the condition. While the algorithm of [1] is intuitively presented p1 , p2 , . . . , pn ≥ 0 such that pi = 1 (3)
for high centrality vertices, in our method, the sampling technique i=1
can be revised to optimize itself for both high centrality vertices and
low centrality vertices. The proposed method can be used to com- Then, at every iteration t of the loop in Lines 8-15 of Algorithm 1:
pute similar centrality notions like stress centrality, which is also (I) an i ∈ {1, . . . , n} is selected with probability pi , (II) the SPD
rooted at i is computed, (III) dependency score of vertex v on i,
based on counting shortest paths. We perform extensive experi-
ments on real-world networks from different domains, and show δi• (v), is computed, and (IV) δi•p(v) i
is the estimation of BC(v)
that compared to existing exact and inexact algorithms, our method in iteration t. The average of betweenness centralities calculated
works with higher accuracy or it gives significant speedups. in different iterations is returned as the estimation of betweenness
The rest of this paper is organized as follows. In Section 2, pre- centrality.
liminaries and definitions related to betweenness centrality com-
putation are given. In Section 3, we present a generic random- Algorithm 1 High level pseudo code of the algorithm of approxi-
ized algorithm for betweenness centrality computation. In Section mate betweenness centrality computation.
4, we discuss the sampling methods. We empirically evaluate the 1: A PPROXIMATE B ETWEENNESS
proposed method in Section 5 and show its efficiency and high ac- 2: Require. A network (graph) G, the number of samples T .
curacy. Finally, the paper is concluded in Section 6. 3: Ensure. Betweenness centrality of vertices of G.
4: Compute probabilities p1 , . . . , pn
2. PRELIMINARIES 5: for all vertices v ∈ V (G) do
6: B[v] ← 0
Throughout the paper, G refers to a graph (network). For sim- 7: end for
plicity, we suppose G is a connected and loop-free graph without 8: for all t = 1 to T do
multi-edges. Throughout the paper, we assume G is an unweighted 9: Select a vertex i with probability pi
graph, unless it is explicitly mentioned that G is weighted. V (G) 10: Form the SPD D rooted at i
and E(G) refer to the set of vertices and the set of edges of G, re- 11: Compute dependency scores of every vertex v on i
spectively. Throughout the paper, n points to |V (G)| and m points 12: for all vertex v ∈ V (G) do
to |E(G)|. For an edge e = (u, v) ∈ E(G), u and v are two end-
13: B[v] ← B[v] + δi•p(v)
points of e. A shortest path (also called a geodesic path) between i
two vertices u, v ∈ V (G) is a path whose size is minimum, among 14: end for
all paths between u and v. For two vertices u, v ∈ V (G), we 15: end for
use d(u, v), to denote the size (the number of edges) of a short- 16: for all i ∈ {1, . . . , n} do
est path connecting u and v. By definition, d(u, v) = 0 and 17: B[i] ← B[i]T
d(u, v) = d(v, u). For s, t ∈ V (G), σst denotes the number 18: end for
of shortest paths between s and t, and σst (v) denotes the number 19: return B
of shortest paths
between s and t that also pass through v. We
have σs (v) = t∈V (G)\{s,v} σst (v). Betweenness centrality of a
vertex v is defined as: L EMMA 1. In Algorithm 1, for every vertex v we have
σst (v) E(B[v]) = BC(v) (4)
BC(v) = (1)
σst and
s,t∈V (G)\{v}
n
A notion which is widely used for counting the number of shortest 1 δi• (v)2 BC(v)2
Var(B[v]) = − (5)
paths in a graph is the directed acyclic graph (DAG) containing T i=1 pi T
all shortest paths starting from a vertex s (see e.g. [3]). In this
paper, we refer to it as the shortest-path-DAG, or SPD in short, P ROOF. We have:
rooted at s. For every vertex s in a graph G, the SPD rooted at s n pi δi• (v)
T i=1 pi
is unique, and it can be computed in O(m) time for unweighted E(B[v]) = = BC(v)
T
graphs and in O(m + n log n) time for weighted graphs [3]. In
[3], the authors introduced the notion of the dependency score of a and
vertex s ∈ V (G) on a vertex x ∈ V (G) \ {s}, which is defined n
δi• (v)2
as: Var(Bt [v]) = E(Bt [v]2 ) − E(Bt [v])2 = − BC(v)2
p i
σst (v)
i=1
δs• (v) = (2) Since B[v] is the average of T independent copies of Bt [v], there-
σst
t∈V (G)\{v,s}
fore
n
We have: BC(v) = s∈V (G)\{v} δs• (v). As mentioned in [3], 1 δi• (v)2 BC(v)2
given the SPD rooted at s, dependency scores of s on all other Var(B[v]) = − (6)
T i=1 pi T
vertices can be computed in O(m) time.
1490
note that instead of taking T samples, we can define a criteria for Therefore, with probabilities pi defined in Equation 9, a vertex
the termination of the loop in Lines 8-15 of Algorithm 1. For ex- i is selected and dependency score of i on v is computed, and the
ample, similar to the algorithm of [1], we can stop when B[v] ≥ cn result is scaled. For unweighted graphs, it gives an O(T m) time al-
for some constant c. gorithm for approximate betweenness centrality computation. For
weighted graphs (with positive weights), time complexity of the
4. SAMPLING METHODS approximation algorithm will be O(T m + T n log n).
Suppose that we want to approximate betweenness centrality of
a vertex v. The following Lemma defines the probabilities mini- 5. EXPERIMENTAL RESULTS
mizing the approximation error. Our experiments were done on one core of a single AMD Proces-
sor 270 clocked at 2.0 GHz with 8 GB main memory and 2 × 1 MB
L EMMA 2. If in Algorithm 1 source vertices i are selected with L2 cache, running Ubuntu Linux 12.0. The program was compiled
probabilities by the GNU C++ compiler 4.0.2 using optimization level 3. We
δ•i (v) compare our proposed method with the algorithm presented in [1].
pi = n (7) As mentioned earlier, methods like [4] and [7] aim to rank vertices
j=1 δ•j (v)
based on betweenness scores (and the betweenness score of an in-
the approximation error (i.e. variance of B[v]) is minimized. In dividual vertex is not very important for them). Therefore, they are
this case, variance of B[v] will be 0 1 . not suitable for our comparisons. We refer to the algorithm of [1]
P ROOF. Omitted due to lack of space. as uniform sampling, since it chooses source vertices uniformly at
random. We refer to our proposed method as distance-based sam-
Therefore, using probabilities pi defined in Equation 7, gives an pling. We also compare the methods with Brandes’s algorithm for
exact method in the sense that it makes the approximation error 0. exact betweenness centrality computation [3].
However, time complexity of computing optimal pi ’s is the same We performed extensive experiments on different real-world net-
as exact betweenness centrality computation. Although it is not works to assess the quantitative and qualitative behavior of the pro-
practically efficient to use probabilities pi defined in Equation 7, posed algorithm. We used two DBLP citation networks dblp0305
they can help us to define properties of a good sampling. From and dblp0507 [2], the Wiki-Vote social network [9], the Enron-
Equation 7, we can conclude that in a good sampling, for every two Email communication network [10], and the CA-CondMat collabo-
vertices i and i , the following must hold: ration network [11]. Table 1 summarizes specifications of our real-
world networks.
pi ≥ pi ⇔ δ•i (v) ≥ δ•i (v) (8)
which means vertices with higher dependency scores on v, must be Table 1: Summary of real-world networks.
selected with a higher probability. Dataset # vertices # edges Avg. degree
However, finding probabilities p1 , p2 , . . . , pn which satisfy Equa- dblp-0305 109,044 233,961 4.29
tion 8 might be computationally expensive, since it needs to com- dblp-0507 135,116 290,363 4.28
pute dependency scores of all vertices on v which is as bad as com- Enron-Email 36,692 367,662 20.04
puting dependency scores of every source vertex on all vertices. In Wiki-Vote 7,115 103,689 29.14
order to design practically efficient sampling techniques, we con- CA-CondMat 23,133 93,497 8.08
sider relaxations of Equation 8. Consider two vertices i and i such
that d(i , v) > d(i, v). If in the SPD rooted at v there exists an
ancestor-descendant relationship between i and i and i is the only
ancestor of i at the level d(i, v), then, it can be shown that for For a vertex v, the empirical betweenness approximation error
k ∈ {i, i }, probability pk defined as (which is reported in the experiments) is defined as:
1 |App(v) − BC(v)|
d(k,v) err(v) = × 100 (10)
p k = n 1 (9) BC(v)
j=1 d(j,v)
where App(v) is the computed approximate score.
satisfies Equation 8. In our experiments, we consider several vertices of a dataset
The positive aspect of the sampling technique presented in Equa- and for every vertex, we compute distance-based probabilities, ex-
tion 9 is that it only needs to compute the distance between vertex act betweenness scores, and approximate betweenness scores using
v and every vertex in the graph: the single-source shortest path, or distance-based and uniform samplings. Table 2 summarizes the av-
SSSP in short, problem. For unweighted graphs, this problem can erage results (i.e. the sum of results of all vertices divided by their
be solved in O(m) time and for weighted graphs, using Fibonacci number) obtained for different datasets. In all experiments, for
heap, it is solvable in O(m + n log n) time [5]. It means that the both uniform and distance-based samplings, the number of sam-
sampling method presented in Equation 9 is practically efficient. ples (the number of source vertices selected) is 10% of the number
1
What this lemma suggests, somehow contradicts the source ver- of vertices in the network. Therefore, the running time of the ap-
tex selection procedure presented in [7]. In the method of [7] the proximate methods is around 10% of the running time of the exact
scheme for aggregating dependency scores changes so that vertices method.
do not profit from being near the selected source vertices. How- The first dataset is Wiki-Vote which is dense (its average de-
ever, Lemma 2 says it is better to select source vertices based on gree is 29.14). For most of vertices of Wiki-Vote, distance-based
their dependency scores on v, and as we will see later, it might sampling gives a better approximation. The extra time needed by
result in preferring source vertices which are closer to v. the distance-based sampling to compute required shortest path dis-
The reason of this contradiction is that while here we aim at pre-
cisely approximating betweenness centrality of some specific ver- tances is quite tiny and ignorable compared to running time of the
tex v, the method of [7] aims to rank all vertices based on their whole process. Since for all datasets it is a tiny value varying in
betweenness scores. different runs, we report an upper bound for it. For the Wiki-Vote
1491
Table 2: Comparing average approximation error and average running time of uniform sampling, distance-based sampling, and
exact method, for single vertices in different datasets.
Exact BC Approximate BC
Distance-based sampling Uniform sampling
Database Avg. error (%) Avg. dist. Avg. error (%)
Avg. BC score Avg. time (sec) Avg. time (sec) # iterations
comp. time
(sec)
Wiki Vote 76056.85 515.09 37.0% <1 41.13% 46.05 10%
Email-Enron 2775100.8 9033.11 15.75% <1 25.28% 925.80 10%
dblp0305 564246.41 19149.8 7.59% <2 64.73% 1747.15 10%
dblp0507 798125.00 35140 7.19% <2 50.17% 2863.82 10%
CA-CondMat 691667 3026.9 10.8% <1 20.81% 315.3 10%
dataset, it is always smaller than 1 second. The next dataset is techniques to give diverse methods for approximating betweenness
Email-Enron. Compared to Wiki-Vote, it is less dense (but still centrality. We discussed the conditions a promising sampling tech-
dense) and larger. Over this dataset, the approximation error of nique should satisfy to minimize the approximation error, and pro-
distance-based sampling is better than the uniform sampling. posed a sampling technique which partially satisfies the conditions.
Dblp0305 and dblp0507 are Our experiments show the high efficiency and quality of the pro-
large and relatively sparse datasets. posed method.
As reflected in Table 2, over Figure 1: In sparse
these datasets, distance-based graphs, distance-based 7. ACKNOWLEDGMENTS
sampling works much better sampling is closer to opti-
This work was partially supported by ERC Starting Grant 240186
than uniform sampling. This mal sampling. The graph
"MiGraNT: Mining Graphs and Networks: a Theory-based approach".
means that on sparse datasets, in the left side shows a
the difference between the ap- SPD in a dense graph, and
proximation quality of two meth- the graph in the right side 8. REFERENCES
ods is more considerable. It shows a SPD in a sparse [1] D. A. Bader, S. Kintali, K. Madduri and M. Mihail,
has several reasons. The first graph. Approximating Betweenness Centrality, WAW, 124-137,
reason is that in very dense 2007.
datasets, many vertices have the [2] M. Berlingerio, F. Bonchi, B. Bringmann and A. Gionis,
same (and small) distance from Mining Graph Evolution Rules, ECML/PKDD (1), 115-130,
v (v is the vertex whose be- 2009.
tweenness centrality is approx- [3] U. Brandes. A Faster Algorithm for Betweenness Centrality.
imated). Therefore, distance- Journal of Mathematical Sociology, 25(2), 163-177, 2001.
based sampling becomes closer [4] U. Brandes, C. Pich, Centrality estimation in large networks,
to the uniform sampling. The Intl. Journal of Bifurcation and Chaos, 17(7), 303-2318, 2007.
second reason is that in sparse networks, in the SPD rooted at
[5] M. L. Fredman and R. E. Tarjan, Fibonacci heaps and their
v, the probability that a vertex i has only one ancestor at some
uses in improved network optimization algorithms, Journal of
level k is lower than this probability in dense graphs. Figure 1
the Association for Computing Machinery, 34(3), 596-615,
compares these two situations. It means that in sparse networks,
1987.
distance-based sampling is closer to the optimal sampling, because
[6] L. C. Freeman, A set of measures of centrality based upon
by distance-based sampling, a larger number of vertices will satisfy
betweenness, Sociometry, 40, 35-41, 1977.
the condition expressed in Equation 7. As a result, on sparse net-
works, distance-based sampling becomes much more effective than [7] R. Geisberger, P. Sanders, D. Schultes, Better Approximation
uniform sampling. of Betweenness Centrality, ALENEX, 90-100, 2008.
Finally, the methods were compared on the CA-CondMat dataset [8] M. Girvan and M. E. J. Newman, Community structure in
which contains scientific collaborations between authors of papers social and biological networks, Proc. Natl. Acad. Sci. USA
submitted to Condense Matter category [11]. The average degree 99, 7821-7826, 2002.
in this dataset is 8.08 which means it is denser than dblp0305 and [9] J. Leskovec, D. Huttenlocher, J. Kleinberg, Signed Networks
dblp0507, but less dense than Wiki-Vote and Email-Enron. Over in Social Media, CHI 2010.
this dataset, the approximation error of uniform sampling is almost [10] J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney,
twice of the approximation error of distance-based sampling. Community Structure in Large Networks: Natural Cluster
Sizes and the Absence of Large Well-Defined Clusters,
Internet Mathematics, 6(1), 29-123, 2009.
6. CONCLUSION [11] J. Leskovec, J. Kleinberg, C. Faloutsos, Graph Evolution:
In this paper, we presented a generic randomized framework for Densification and Shrinking Diameters. ACM Transactions on
unbiased approximation of betweenness centrality. In the proposed Knowledge Discovery from Data (ACM TKDD), 1(1), 2007.
framework, a source vertex i is selected by some strategy, single- [12] Y. Wang, Z. Di, Y. Fan, Identifying and Characterizing
source betweenness scores of all vertices on i are computed, and Nodes Important to Community Structure Using the Spectrum
the scores are scaled as estimations of betweenness centralities. of the Graph, PLoS ONE 6(11), e27418, 2011.
Our proposed framework can be adapted with different sampling
1492