Engineering Parallel Algorithms For Community Detection in Massive Networks
Engineering Parallel Algorithms For Community Detection in Massive Networks
Abstract—The amount of graph-structured data has recently methods that harness parallelism and apply specifically to
experienced an enormous growth in many applications. To trans- complex networks.
arXiv:1304.4453v4 [cs.DC] 2 Feb 2015
form such data into useful information, fast analytics algorithms In this work, we deal with the task of community detection
and software tools are necessary. One common graph analytics
kernel is disjoint community detection (or graph clustering). (also known as graph clustering) in large networks, i. e. the
Despite extensive research on heuristic solvers for this task, only discovery of dense subgraphs. Among manifold applications,
few parallel codes exist, although parallelism will be necessary community detection has been used to counteract search
to scale to the data volume of real-world applications. We engine rank manipulation [32], to discover scientific com-
address the deficit in computing capability by a flexible and munities in publication databases [34], to identify functional
extensible community detection framework with shared-memory
parallelism. Within this framework we design and implement groups of proteins in cancer research [15], and to organize
efficient parallel community detection heuristics: A parallel label content on social media sites [12]. So far, extensive research
propagation scheme; the first large-scale parallelization of the on community detection in networks has given rise to a variety
well-known Louvain method, as well as an extension of the of definitions of what constitutes a good community and
method adding refinement; and an ensemble scheme combining a variety of methods for finding such communities, many
the above. In extensive experiments driven by the algorithm
engineering paradigm, we identify the most successful parameters of which are described in surveys by Schaeffer [32] and
and combinations of these algorithms. We also compare our im- Fortunato [10]. Among these definitions, the lowest common
plementations with state-of-the-art competitors. The processing denominator is that a community is an internally dense node
rate of our fastest algorithm often reaches 50M edges/second. We set with sparse connections to the rest of the graph. While
recommend the parallel Louvain method and our variant with it can be argued that communities can overlap, we restrict
refinement as both qualitatively strong and fast. Our methods
are suitable for massive data sets with billions of edges.1 ourselves to finding disjoint communities, i.e. a partition of
the node set which uniquely assigns a node to a community.
Keywords: Disjoint community detection, graph clustering,
parallel Louvain method, parallel algorithm engineering, network The quality measure modularity [14] formalizes the notion of a
analysis good community detection solution by comparing its coverage
(fraction of edges within communities) to an expected value
based on a random edge distribution model which preserves
I. I NTRODUCTION the degree distribution. Modularity is not without flaws (like
the resolution limit [11], which can be partially overcome
The data volume produced by electronic devices is growing
by different techniques [2], [18], [23]) nor alternatives [37],
at an enormous rate. Important classes of such data can
but has emerged as a well-accepted measure of community
be modeled by complex networks, which are increasingly
quality. This makes modularity our measure of choice. While
used to represent phenomena as varied as the WWW, social
optimizing modularity is NP-hard [6], efficient heuristics have
relations, and brain topology. The resulting graph data sets can
been introduced which explicitly increase modularity.
easily reach billions of edges for many relevant applications.
For graphs with millions to billions of edges, only (near)
Analyzing data of this volume in near real-time challenges the
linear-time community detection algorithms are practical. Sev-
state of the art in terms of hardware, software, and algorithms.
eral fast methods have been developed in recent years. Yet,
A particular challenge is not only the amount of data, but
there is a lack of research in adapting these methods to take
also its structure. Complex networks have topological features
advantage of parallelism. A recent attempt at assessing the
which pose computational challenges different from traditional
state of the art in community detection was the 10th DIMACS
HPC applications: In a scale-free network, the presence of a
Implementation Challenge on Graph Partitioning and Graph
few high-degree nodes (hubs) among many low degree nodes
Clustering [1]. DIMACS challenges are scientific competitions
generates load balancing issues. In a small-world network,
in which the participants solve problems from a specified test
the entire graph can be visited in only a few hops from any
set, with the aim of high solution quality and high speed.
source node, which negatively affects cache performance. To
Only two of the 15 submitted implementations for modularity
enable network analysis methods to scale, we need algorithmic
optimization relied on parallelism and only very few could
1 A preliminary version of this paper appeared in Proceedings of the 42nd handle graphs with billions of edges in reasonable time.
International Conference on Parallel Processing (ICPP 2013) [35]. Accordingly, our objective is the development and imple-
2
mentation of parallel community detection heuristics which are In comparative experiments, our implementations perform well
able to handle massive graphs quickly while also producing in comparison to other state-of-the-art algorithms (Sec. V-E
a high-quality solution. In the following, the competitors of and V-F): Three of our algorithms are on the Pareto frontier.
the DIMACS challenge will be used for a comparative experi- Our community detection software framework, written in
mental study. In the design of such heuristics, we necessarily C++, is flexible, extensible, and supports rapid iteration be-
trade off solution quality against running time. The DIMACS tween design, implementation and testing required for algo-
challenge also showed that there is no consensus on what rithm engineering [25]. In this work, we focus on specific
running times are acceptable and how desirable an increase configurations of algorithms, but future combinations can be
in the third decimal place of modularity is. We therefore need quickly evaluated. We distribute our community detection code
to clarify our design goals as follows: In the comparison with as a component of NetworKit [36], our open-source network
other proposed methods, we want to place our algorithms analysis package, which is under continuous development.
on the Pareto frontier so that they are not dominated, i.e.
surpassed in speed and quality at the same time. Secondly, II. R ELATED W ORK
we target a usage scenario: Our algorithms should be suitable This section gives a short overview over related efforts.
as part of interactive data analysis workflows, performed by a For a comprehensive overview of community detection in
data analyst operating a multicore workstation. Networks with networks, we refer the interested reader to aforementioned
billions of edges should be processed in minutes rather than surveys [32], [10]. Recent developments and results are also
hours, and the solution quality should be competitive with the covered by the 10th DIMACS Implementation Challenge [1].
results of well-established sequential methods. Among efficient heuristics for community detection we can
distinguish between those based on community agglomeration
We implement three standalone parallel algorithms: Label
and those based on local node moves. Agglomerative algo-
propagation [28] is a simple procedure where nodes adopt
rithms successively merge pairs of communities so that an
the community assignment (label) which is most frequent
improvement with respect to community quality is achieved.
among their neighbors until stable communities emerge. We
In contrast, local movers search for quality gains which can be
implement a parallel version of the approach as the PLP
achieved by moving a node to the community of a neighbor.
algorithm. The Louvain method [5] is a multilevel technique
A globally greedy agglomerative method known as
in which nodes are repeatedly moved to the community of
CNM [8] runs in O(md log n) for graphs with n nodes and m
a neighbor if modularity can be improved. We are the first
edges, where d is the depth of the dendrogram of mergers and
to present a parallel implementation of the method for large
typically d ∼ log n. Among the few parallel implementations
inputs, named PLM. We also extend the method by adding
competing in the DIMACS challenge, Fagginger Auer and
a refinement phase on every level, which yields the PLMR
Bisseling [9] submitted an agglomerative algorithm with an
algorithm. In addition to these basic algorithms, we also
implementation for both the GPU (using NVIDIA CUDA) and
implement a two-phase approach that combines them. It is
the CPU (using Intel TBB). The algorithm weights all edges
inspired by ensemble learning, in which the output of several
with the difference in modularity resulting from a contraction
weak classifiers is combined to form a strong one. In our case,
of the edge, then computes a heavy matching M and contracts
multiple base algorithms run in parallel as an ensemble. Their
according to M . This process continues recursively with
solutions are then combined to form the core communities,
a hierarchy of successively smaller graphs. The matching
representing the consensus of all base algorithms. The graph
procedure can adapt to star-like structures in the graph to
is coarsened according to the core communities, and then
avoid insufficient parallelism due to small matchings. In the
assigned to a single final algorithm. Within this extensible
challenge, the CPU implementation competed as CLU_TBB
framework, which we call the ensemble preprocessing method
and proved exceptionally fast. Independently, Riedy et al. [30]
(EPP), we apply PLP as base algorithms and PLMR as the
developed a similar method, which follows the same principle
final algorithm.
but does not provide the adaptation to star-like structures.
With our shared-memory parallel implementation of com- An improved implementation, labeled CEL in the following,
munity detection by label propagation (PLP), we provide an corresponds to the description in [29].
extremely fast basic algorithm that scales well with the number Community detection by label propagation belongs to the
of processors (considering the heterogeneous structure of the class of local move heuristics. It has originally been described
input). The processing rate of PLP reaches 50M edges per by Raghavan et al. [28]. Several variants of the algorithm exist,
second for large graphs, making it suitable for massive data one of them (under the name peer pressure clustering) is due
sets. With PLM, we present the first parallel implementation of to Gilbert et al. [13]. The latter use the algorithm as a proto-
the Louvain community detection method for massive inputs, type application within a parallel toolbox that uses numerical
and demonstrate that it is both fast and qualitatively strong. algorithms for combinatorial problems. Unfortunately, Gilbert
We show that solution quality can be further improved by et al. report running times only for a different algorithm, which
extending the method with a refinement phase on every level solves a very specific benchmark problem and is not applicable
of the hierarchy, yielding the PLMR algorithm. The EPP in our context. A variant of label propagation by Soman and
ensemble algorithm can yield a good quality-speed tradeoff on Narang [33] for multicore and GPU architectures exists, which
some instances when an even lower time to solution is desired. seeks to improve quality by re-weighting the graph.
3
A locally greedy multilevel-algorithm known as the Louvain niques. Both basic approaches can be adapted for parallelism,
method [5] combines local node moves with a bottom-up but this is currently the exception rather than the norm in our
multilevel approach. Bhowmick and Srinivasan [3] presented scenario. In this work we compare our own algorithms with
a previous parallel version of the algorithm. According to their the best currently available, sequential and parallel alike.
experimental results, our implementation is about four orders
of magnitude faster. Noack and Rotta [31] evaluate similar se- III. A LGORITHMS
quential multilevel algorithms, which combine agglomeration In this section we formulate and describe our parallel vari-
with refinement. ants of existing sequential community detection algorithms,
as well as ensemble techniques which combine them. Imple-
Ovelgönne and Geyer-Schulz [27] apply the ensemble learn-
mentation details are also discussed. We use the following
ing paradigm to community detection. They develop what
notation: A graph, the abstraction of a network data set, is
they call the Core Groups Graph Clusterer scheme, which
denoted as G = (V, E) with a node set V of size n and
we adapt as the Ensemble Preprocessing (EPP) algorithm.
an edge set E of size m. In the following, edges {u, v} are
They also introduce an iterated scheme in which the core
undirected and have weights ω : E →P R+ . The weight of
communities are again assigned to an ensemble, creating a 0
a set of nodes is denoted as ω(E ) := {u,v}∈E’ ω(u, v). A
hierarchy of solutions/coarsened graphs until quality does
community detection solution ζ = {C1 , . . . , Ck } is a partition
not improve any more. Within this framework, they employ
of the node set V into disjoint subsets called communities.
Randomized Greedy (RG), a variant of the aforementioned
Equivalently, such a solution can be understood as a mapping
CNM algorithm. It avoids a loss in solution quality that
where ζ(v) returns the community containing node v. For our
arises from highly unbalanced community sizes. The resulting
implementation, the nodes have consecutive integer identifiers
CGGC algorithm emerged as the winner of the Pareto part
id(v) and edges are pairs of node identifiers. A solution is
of the DIMACS challenge, which related quality to speed
represented as an array indexed by integer node identifiers
according to specific rules. Recently Ovelgönne [26] presented
and containing integer community identifiers.
a distributed implementation (based on the big data frame-
work Hadoop) of an ensemble preprocessing scheme using A. Parallel Label Propagation (PLP) à
label propagation as a base algorithm. This implementation relire avec plus
processes a 3.3 billion edge web graph in a few hours on a a) Algorithm: Community detection d'attention
by label propaga-
50 machine Hadoop cluster [26, p. 73]. (Our OpenMP-based tion, as originally introduced by Raghavan et al. [28], extracts
implementation of the similar EPP algorithm requires only 4 communities from a labelling V → N of the node set. Initially,
minutes on a shared-memory machine with 16 physical cores.) each node is assigned a unique label, and then multiple
iterations over the node set are performed: In each iteration,
From an algorithmic perspective, disjoint community de- every node adopts the most frequent label in its neighborhood
tection is related to graph partitioning (GP). Although the (breaking ties arbitrarily). Densely connected groups of nodes
problems are different in important aspects (unbalanced vs bal- thus agree on a common label, and eventually a globally
anced blocks, unknown vs known number of blocks, different stable consensus is reached, which usually corresponds to a
objectives), algorithms such as the Louvain method or PLMR good solution for the network. Label propagation therefore
bear conceptual resemblance to multilevel graph partitioners. finds communities in nearly linear time: Each iteration takes
Exploiting parallelism has been studied extensively for GP. O(m) time, and the algorithm has been empirically shown
Several established tools are discussed in recent surveys [4], to reach a stable solution in only a few iterations, though
[7], most of them for machines with distributed memory. Often not mathematically proven to do so. The number of iterations
employed techniques are parallel matchings for coarsening and seems to depend more on the graph structure than the size.
parallel variants of Fiduccia-Mattheyses for local improve- More theoretical analysis is done by Kothapalli et al. [16].
ment. These techniques are at best partially helpful in our The algorithm can be described as a locally greedy coverage
scenario since vanilla matching-based coarsening is ineffective maximizer, i.e. it tries to maximize the fraction of edges which
on complex networks and distributed-memory parallelism is are placed within communities rather than across. With its
not necessary for us. Related to our work is a recent study on purely local update rule, it tends to get stuck in local optima
multithreaded GP by LaSalle and Karypis [20], who explore of coverage which implicitly are good solutions with respect
the design space of multithreaded GP algorithms. Their results to modularity: A label is likely to propagate through and cover
provide interesting insights, but are not completely transferable a dense community, but unlikely to spread beyond bottlenecks.
to our scenario. Very recently they presented Nerstrand [21], a The local update rule and the absence of global variables make
fast parallel community detection algorithm based on modular- label propagation well-suited for a parallel implementation.
ity maximization and the multilevel paradigm, using different Algorithm 1 denotes PLP, our parallel variant of label
aggregation schemes. Our work on PLP in this paper has propagation. We adapt the algorithm in a straightforward way
also inspired a very promising parallel multilevel algorithm to make it applicable to weighted graphs. Instead of the most
for partitioning massive complex networks [24]. frequent label, the dominant label in P the neighborhood is
We observe that most efficient disjoint community detection chosen, i.e. the label l that maximizes u∈N (v):ζ(u)=l ω(v, u).
heuristics make use of agglomeration or local node moves, We continue the iteration until the number of nodes which
possibly in combination with multilevel or ensemble tech- changed their labels falls below a threshold θ.
4
Algorithm 1: PLP: Parallel Label Propagation exécuter sethas been found to avoid oscillation of labels on bipar-
Input: graph G = (V, E) algorithme tite structures [28]. When dealing with scale-free networks
Result: communities ζ : V → N whose degree distribution follows a power law, assigning
1 parallel for v ∈ V node ranges of equal size to each thread will lead to load
2 ζ(v) ← id(v) imbalance as computational cost depends on the node degree.
3 updated ← n Instead of statically dividing the iteration among the threads,
4 Vactive ← V
guided scheduling (with #pragma omp parallel for
5 while updated > θ do
6 updated ← 0 schedule(guided)) assigns node ranges of decreasing
7 parallel for v ∈ {un∈ Vactive : deg(u) > 0} o size from a queue to available threads. This way it can help
8 l? ← arg maxl
P
u∈N (v):ζ(u)=l ω(v, u)
to overcome load balancing issues, since threads processing
9 if ζ(v) 6= l? then large neighborhoods will receive fewer vertices in later phases
10 ζ(v) ← l? of the dynamical assignment process. This introduces some
11 updated ← updated + 1 overhead, but we observed that guided scheduling is generally
12 Vactive ← Vactive ∪ N (v) superior to static parallelization for PLP and similar methods.
13 else
14 Vactive ← Vactive \ {v}
B. Parallel Louvain Method (PLM)
15 return ζ
Algorithm: The Louvain method for community detection
comprendre implémentation was first presented by Blondel et al. [5]. It can be classified
as a locally greedy, bottom-up multilevel algorithm and uses
b) Implementation: We make a few modifications to the modularity as the objective function. In each pass, nodes are
original algorithm. In the original description [28], nodes repeatedly moved to neighboring communities such that the
are traversed in random order. Since the cost of explicitly locally maximal increase in modularity is achieved, until the
randomizing the node order in parallel is not insignificant, we communities are stable. Algorithm 2 denotes this move phase.
make this optional and rely on some randomization through Then, the graph is coarsened according to the solution (by con-
parallelism otherwise. We also observe that forgoing random- tracting each community into a supernode) and the procedure
ization has a negligible effect on quality. We avoid unnecessary continues recursively, forming communities of communities.
computation by distinguishing between active and inactive Finally, the communities in the coarsest graph determine those
nodes. It is unnecessary to recompute the label weights for in the input graph by direct prolongation.
a node whose neighborhood labels have not changed in the Computation of the objective function modularity is a cen-
previous iteration. Nodes which already have the heaviest label
P
tral part of the algorithm. Let ω(u, C) := {u,v}:v∈C ω(u, v)
become inactive (Algorithm 1, line 14), and are only reacti- be the weight of all edges from u to nodes in community
vated if a neighboring node is updated (line 12). We restrict C, and define
P the volume of a node and a community as
iteration to the set of active nodes. Iterations are repeated until vol(u) := {u,v}:v∈N (u) ω(u, v) + 2 · ω(u, u) and vol(C) :=
the number of nodes updated falls below a threshold value. The
P
u∈C vol(u), respectively. The modularity of a solution is
motivation for setting threshold values other than zero is that defined as
on some graph instances, the majority of iterations are spent on X ω(C) vol(C)2
updating only a very small fraction of high-degree nodes (see mod(ζ, G) := − (III.1)
ω(E) 4ω(E)2
Fig. 12 in the supplementary material for an example). Since C∈ζ
preliminary experiments have shown that time can be saved Note that the change in modularity resulting from a node
and quality is not significantly degraded by simply omitting move can be calculated by scanning only the local neighbor-
these iterations, we set an update threshold of θ = n · 10−5 . hood of a node, because the difference in modularity when
Note that we do not use the termination criterion specified moving node u ∈ C to community D is:
in [27] as it does not lead to convergence on some inputs.
The original criterion is to stop when all nodes have the label
ω(u, D \ {u}) − ω(u, C \ {u})
of the relative majority in their neighborhood [28]. ∆mod(u, C → D) =
ω(E)
Label propagation can be parallelized easily by dividing
(vol(C \ {u}) − vol(D \ {u})) · vol(u)
the range of nodes among multiple threads which operate +
on a common label array. This parallelization is not free 2 · ω(E)2
of race conditions, since by the time the neighborhood of We introduce a shared-memory parallelization of the Lou-
a node u is evaluated in iteration i to set ζi (u), a neigh- vain method (PLM, Algorithm 3) in which node moves are
bor v might still have the previous iteration’s label ζi−1 (v) evaluated and performed in parallel instead of sequentially.
or already ζi (v). The outcome thus depends on the order This approach may work on stale data so that a monotonous
of threads. However, these race conditions are acceptable modularity increase is no longer guaranteed. Suppose that
and even beneficial in an ensemble setting since they in- during the evaluation of a possible move of node u other
troduce random variations and increase base solution diver- threads might have performed moves that affect the ∆mod
sity. This also corresponds to asynchronous updating, which scores of u. In some cases this can lead to a move of
5
u that actually decreases modularity. Still, such undesirable associated with using an std::map to store for each node
decisions can also be corrected in a following iteration, which the weights of edges leading to neighboring communities. the
is why the solution quality is not necessarily worse. Working mechanism was replaced by one std::vector for each
only on independent sets of vertices in parallel does not of the p threads, leading to an acceleration of a factor of 2
provide a solution since the sets would have to be very small, on average, at the cost of a memory overhead of O(p · n).
limiting parallelism and/or leading to the undesirable effect of The former version (referred to as PLM*) can still be used
a very deep coarsening hierarchy. Concerns about termination optionally under tighter memory constraints.
turned out to be theoretical for our set of benchmark graphs, Graph coarsening according to communities is performed in
all of which can be successfully processed with PLM. The a straightforward way such that the nodes of a community in
community size resolution produced by PLM can be varied G are aggregated to a single node in G0 . An edge between two
through a parameter γ in the range [0, 2m], 0 yielding a single nodes in G0 receives as weight the sum of weights of inter-
community, 1 being standard modularity and 2m producing community edges in G, while self-loops preserve the weight
singletons. Tuning this parameter is a possible practical rem- of intra-community edges. A mapping π of nodes in the fine
edy [18] against modularity’s resolution limit. graph to nodes in the coarse graph is also returned. In earlier
versions of PLM, the graph coarsening phase proved to be a
Algorithm 2: move: Local node moves for modularity major sequential bottleneck. We address this problem with a
gain parallel coarsening scheme: Each thread first scans a portion
Input: graph G = (V, E), communities ζ : V → N of the edges in G and constructs a coarse graph G0t of its own.
Result: communities ζ : V → N These partial graphs are then combined into G0 by processing
1 repeat each node of G0 in parallel and merging the adjacencies stored
2 parallel for u ∈ V in each G0t .
3 δ ← maxv∈N (u) {∆mod(u, ζ(u) → ζ(v))}
4 C ← ζ(arg maxv∈N (u) {∆mod(u, ζ(u) → ζ(v))})
5 if δ > 0 then C. Parallel Louvain Method with Refinement (PLMR)
6 ζ(u) ← C
Following up on the work by Noack and Rotta on multilevel
7 until ζ stable techniques and refinement heuristics [31], we extend the Lou-
8 return ζ vain method by an additional move phase after each prolon-
gation. This makes it possible to re-evaluate node assignments
in view of the changes that happened on the next coarser level,
giving additional opportunities for modularity improvement at
Algorithm 3: PLM: Parallel Louvain Method
the cost of additional iterations over the node set in each level
Input: graph G = (V, E) of the hierarchy. We denote the method and implementation
Result: communities ζ : V → N
1 ζ ← ζsingleton (G) as PLMR for Parallel Louvain Method with Refinement. We
2 ζ ← move(G, ζ) present a recursive implementation in Algorithm 4 which uses
3 if ζ changed then the same concepts as PLM.
4 [G0 , π] ← coarsen(G, ζ)
5 ζ 0 ← PLM(G0 )
6 ζ ← prolong(ζ 0 , G, G0 , π) Algorithm 4: PLMR: Parallel Louvain Method with Re-
7 return ζ finement
Input: graph G = (V, E)
Result: communities ζ : V → N
Implementation: The main idea of PLM (Algorithm 3) 1 ζ ← ζsingleton (G)
2 ζ ← move(ζ, G)
is to parallelize both the node move phase and the coarsening 3 if ζ changed then
phase of the Louvain method. Since the computation of the 4 [G0 , π] ← coarsen(G, ζ)
∆mod scores is the most frequent operation, it needs to be 5 ζ 0 ← PLMR(G0 )
very fast. We store and update some interim values, which is 6 ζ ← prolong(ζ 0 , G, G0 , π)
7 ζ ← move(ζ, G)
not apparent from the high-level pseudocode in Algorithm 3.
An earlier implementation associated with each node a map 8 return ζ
in which the edge weight to neighboring communities was
stored and updated when node moves occurred. A lock for
each vertex v protected all read and write accesses to v’s map
since std::map is not thread-safe. Meant to avoid redundant D. Ensemble Preprocessing (EPP)
computation, we later discovered that this introduces too much In machine learning, ensemble learning is a strategy in
overhead (map operations, locks). Recomputing the weight to which multiple base classifiers or weak classifiers are com-
neighbor communities each time a node is evaluated turned bined to form a strong classifier. Classification in this context
out to be faster. The current implementation only stores and can be understood as deciding whether a pair of nodes
updates the volume of each community. An additional opti- should belong to the same community. We follow this general
mization to the PLM implementation eliminated the overhead idea, which has been applied successfully to graph clustering
6
before [27]. Subsequently, we describe an ensemble techniques IV. I MPLEMENTATION AND E XPERIMENTAL S ETUP
EPP. We also briefly describe algorithms for combining A. Framework and Settings
multiple base solutions.
The language of choice for all implementations is C++
according to the C++11 standard, allowing us to use object-
Algorithm 5: EPP: Ensemble Preprocessing oriented and functional programming concepts while also
Input: graph G = (V, E), ensemble size b compiling to native code. We implemented all algorithms on
Result: communities ζ : V → N top of a general-purpose adjacency array graph data structure.
1 parallel for i ∈ [1, b] Basically, it represents the adjacencies of each node by storing
2 ζi ← Basei (G) them in an std::vector, allowing for efficient insertions
3 ζ̄ ← combine(ζ1 , . . . , ζb ) and deletions of nodes and edges. A high-level interface
4 G0 , π ← coarsen(G, ζ̄) encapsulates the data structure and enables a clear and concise
5 ζ 0 ← Final(G0 )
6 ζ ← prolong(ζ 0 , G, G0 , π)
notation of graph algorithms. In particular, our interface con-
7 return ζ veniently supports parallel programming through parallel node
and edge iteration methods which receive a function and apply
it to all elements in parallel. Parallelism is achieved in the form
In a preprocessing step, assign G to an ensemble of of loop parallelization with OpenMP, using the parallel
base algorithms. The graph is then coarsened according to for directive with schedule(guided) where appropriate
the core communities ζ̄, which represent the consensus of for improved load balancing.
the base algorithms. Coarsening reduces the problem size We publish our source code under a permissive free software
considerably, and implicitly identifies the contested and the license to encourage reproduction, reuse and contribution by
unambiguous parts of the graph. After the preprocessing phase, the community. Implementations of all community detection
the coarsened graph G0 is assigned to the final algorithm, algorithms mentioned are part of NetworKit [36], our growing
whose result is applied to the input graph by prolongation. Our toolkit for network analysis.3 The software combines fast
implementation of the ensemble technique EPP is agnostic to parallel algorithms written in C++ with an interactive Python
the base and final algorithms and can be instantiated with a interface for flexible and interactive data analysis workflows.
variety of such algorithms. We instantiate the scheme with For representative experiments we average quality and speed
PLP as a base algorithm and PLMR as the final algorithm. values over multiple runs in order to compensate for fluctua-
Thus we achieve massive nested parallelism with several tions. Table I provides information on the multicore platform
parallel PLP instances running concurrently in the first phase, used for all experiments.
and proceed in the second phase with the more expensive
phipute1.iti.kit.edu
but qualitatively superior PLMR. This constitutes the EPP compiler gcc 4.8.1
algorithm (Algorithm 5). We write EPP(b, Base, Final) to CPU 2 x 8 Cores: Intel(R) Xeon(R)
indicate the size of the ensemble b and the types of base and E5-2680 0 @ 2.70GHz, 32 threads
final algorithm. RAM 256 GB
OS SUSE 13.1-64
Implementation: A consensus of b > 1 base algorithms
is formed by combining the base solutions ζi in the following Table I: Platform for experiments
way: Only if a pair of nodes is classified as belonging to the
same community in every ζi , then it is assigned to the same B. Networks
community in the core communities ζ̄. Formally, for all node
pairs u, v ∈ V : We perform experiments on a variety of graphs from
different categories of real-world and synthetic data sets. Our
focus is on real-world complex networks, but to add variety
∀i ∈ [1, b] ζi (u) = ζi (v) ⇐⇒ ζ̄(u) = ζ̄(v). (III.2) some non-complex and synthetic instances are included as
well. The test set includes web graphs (uk-2002, eu-2005,
We introduce a highly parallel combination algorithm based in-2004, web-BerkStan), internet topology networks
on hashing. With a suitable hash function h(ζ1 (v), . . . , ζb (v)), (as-22july06, as-Skitter, caidaRouterLevel),
the community identifiers of the base solutions are mapped social networks (soc-LiveJournal, fb-Texas84,
to a new identifier ζ̄(v) in the core communities. Except for com-youtube, wiki-Talk, soc-pokec, com-orkut),
unlikely hash collisions, a pair of nodes will be assigned to the scientific coauthorship networks (coAuthorsCiteseer,
same community only if the criterion above is satisfied. We coPapersDBLP), a connectome graph (con-fiber_big),
use a relatively simple function called djb2 due to Bernstein,2 a street network (europe-osm) and synthetic graphs
which appears sufficient for our purposes. The use of a b-way (G_n_pin_pout, kron_g500-simple-logn20,
hash function is fast due to a high degree of parallelism. hyperbolic-268M). Therefore, we cover a range of
graph-structural properties. Real-world complex networks are
2 hash functions: http://www.cse.yorku.ca/~oz/hash.html 3 NetworKit: https://networkit.iti.kit.edu/
7
speedup
time [s]
300 6
size and inherent community structure, which may or may
200 4
not be distinctive, and varies widely among the instances.
100 2
The majority of test networks are taken from the collection
0 0
compiled for the 10th DIMACS Implementation Challenge4
1 2 4 8 16 32 1 2 4 8 16 32
as well as the Stanford Large Network Dataset Collection5 threads threads
and are freely available on the web. They are undirected,
unweighted graphs. Table II gives an overview over graph Figure 1: PLP strong scaling on the uk-2007-05 web graph
sizes as well as some structural features: A high maximum
node degree (maxdeg) indicates possible load balancing
issues. The number of connected components (comp) points B. Parallel Louvain Method (PLM)
to isolated single nodes or small groups of nodes. A high For PLM we observe only small deviations in quality
average local clustering coefficient (lcc) is an indicator for between single-threaded and multi-threaded runs, supporting
the presence of dense subgraphs. We evaluate solution quality the argument that the algorithm is able to correct undesirable
and running time for all of our own algorithms as well as decisions due to stale data. PLM detects communities with
several relevant competitors on this set. For those algorithms relatively high modularity in the majority of networks. Even
that can process in reasonable time the largest real-world large instances are processed in no more than a few minutes.
graph available to us, a web graph of the .uk domain with Figure 2 shows the scaling behavior of PLM. Since both
m ≈ 3.3 · 109 , we add further experiments (see Section V-H). the node move phase and the coarsening phase have been
To measure strong scaling, we run our parallel algorithms on parallelized, PLM profits from increased parallelism as well,
this web graph. achieving a speedup of factor 9 for 32 threads. In comparison
to PLP (Figure 6b), we observe that PLP can solve instances
V. E XPERIMENTS AND R ESULTS in only half the time required by PLM, but at a significant
loss of modularity. As discussed in Sec. VI, the communities
In this section we report on a representative subset of our detected by the two algorithms can be markedly different.
experimental results for our different parallel algorithms, as Because the Louvain method for community detection is well-
well as competing codes. Figures 6 and 7 (as well as Fig- known and accepted, we choose the performance of PLM as
ures 16 and 17 in the supplementary material) show running our baseline (Figure 6a) and present quality and running time
time and quality differences broken down by the networks of of other algorithms relative to PLM.
our test set. The bars of the charts are in ascending order
of graph size. We have selected a diverse test set and show C. Parallel Louvain Method with Refinement (PLMR)
4 DIMACS collection: http://www.cc.gatech.edu/dimacs10/downloads.shtml As shown by Figure 6c, adding a refinement phase generally
5 Stanford collection: http://snap.stanford.edu/data/index.html leads to a (sometimes significant) improvement in modularity.
8
1600 10 18
1400 16
8 14
speedup
1200 12
speedup
time [s]
1000 6 10
8
800 6
600 4 4
400 2
2 0
200
0 0 1 2 4 8 16 32
1 2 4 8 16 32 1 2 4 8 16 32 18
threads threads 16
14
speedup
12
10
Figure 2: PLM strong scaling on the uk-2007-05 web graph 8
6
4
2
0
1 2 4 8 16 32
This improvement is paid for by a small increase in running 18
16
time. The results indicate that our proposed extension of the 14
speedup
12
original Louvain method by a refinement phase can efficiently 10
8
increase solution quality. We also evaluate the scaling behavior 6
4
2
of each phase of the PLMR algorithm. In Figure 3 a yellow 0
bar indicates the running time on the finest graph while the 1 2 4 8 16 32
threads
red bar stops at the total running time of the phase. Time spent
on the finest graph clearly dominates all running times. Our Figure 4: PLMR strong scaling of the move, coarsening and
experiments show that the move and refinement phases scale refinement phases (top to bottom) – speedup factors aggregated
well with the number of threads, while the coarsening phase
only partially profits from parallelization. The results on this
graph are representative for the trend of the scaling behavior
for the algorithm’s phases: Figure 4 shows speedup factors for a cost of about 5 times the running time of PLP alone. It also
each of the phases, aggregated over the test set of 20 graphs. becomes clear that for small networks the approach does not
pay off as running time becomes dominated by the overhead of
1600 the ensemble scheme. In comparison to PLM (Figure 6d), the
1400
1200 ensemble approach can be slightly faster on some networks,
time [s]
1000
800 but quality is slightly worse in most cases. We conclude that
600
400 the ensemble technique EPP is effective in improving on
200
0 the quality of a single algorithm. While somewhat lower in
1 2 4 8 16 32
modularity, the communities detected are similar (see Sec. VI)
50 to those of the Louvain method. In practice, our acceleration
40
of the PLM algorithm have made the ensemble approach less
time [s]
30
20 relevant.
10
0
1 2 4 8 16 32 E. Comparison with State-of-the-Art Competitors
300 In this section we present results for an experimental com-
250
parison with several relevant competing community detection
time [s]
200
150 codes. These are mainly those which excelled in the DIMACS
100
50 challenge either by solution quality or time to solution: The
0
agglomerative algorithms CLU_TBB6 and RG, as well as
1 2 4 8 16 32
threads CGGC and CGGCi7 , ensemble algorithms based on RG.
We also include the widely used original sequential Louvain8
Figure 3: PLMR strong scaling of the move, coarsening and implementation, as well as the agglomerative algorithm CEL.
refinement phases (top to bottom) on uk-2007-05 In contrast to the DIMACS challenge, we run all codes on the
same multicore machine (Tab. I) and measure time to solution
D. Ensemble Preprocessing (EPP) for sequential and parallel ones alike.
Figure 15 in the supplementary material demonstrates the a) Louvain: Although not submitted to the DIMACS
effectiveness of the ensemble approach. Results were gen- competition, the original sequential implementation of the
erated by an EPP instance with a 4-piece PLP ensemble Louvain method is still relatively fast (Figure 7a). The
and PLMR as final algorithm in comparison to a single PLP marginally different modularity values in comparison to PLM
instance. We observe that the approach of EPP pays off in may be caused by subtle differences in the implementation. For
the form of improved modularity on most instances, exploiting 6 CLU_TBB http://www.staff.science.uu.nl/~faggi101/
differences in the base solutions and spending extra time on 7 RG etc: http://www.umiacs.umd.edu/~mov/
classifying contested nodes. For larger networks, this comes at 8 Louvain https://sites.google.com/site/findcommunities/
9
example, Louvain explicitly randomizes the order in which PLMR emerge as qualitatively strong and fast candidates,
nodes are visited, while we rely on implicit randomization closest to the lower right corner. (Their more memory-efficient
through parallelism. For the smallest graphs, running time implementation PLM* is about a factor of 2 slower.) It is
values are missing because the implementation reported a run- also evident that our extended version PLMR can improve
ning time of zero. Louvain eventually falls behind the parallel solution quality for a small computational extra charge. We
algorithm for large graphs, confirming that the overhead and recommend both PLM and PLMR as the default algorithms for
complexity introduced by parallelism is eventually justified parallel community detection in large networks. The original
when we target massive datasets. sequential implementation of the Louvain method is thus no
b) CLU_TBB and CEL: CLU_TBB, one of the few longer on the Pareto frontier since it cannot benefit from
parallel entries in the DIMACS competition, is a very fast multicore systems. RG and its ensemble combinations have
implementation of agglomerative modularity maximization, the best modularity scores by a narrow margin, while they are
solving the larger instances more quickly than PLM (Fig- by far the most computationally expensive ones, which places
ure 7b). Qualitatively however, PLM is clearly superior on them outside of the application scenario we target.
most networks. Both in terms of modularity and running time,
CLU_TBB occupies a middle ground between PLP and PLM, 512
CGGCi
and is qualitatively very similar to our ensemble algorithm 256 CGGC
EPP. CEL, as another fast parallel program, produced con- 128
time score
PLP. 16
Louvain
c) RG, CGGC and CGGCi: Ovelgönne and Geyer- 8
4 CEL
Schulz entered the DIMACS challenge with an ensemble
approach conceptually similar to what we have developed in 2
PLM*
this paper. Their base algorithm is the sequential agglomerative EPP
1
RG, and two ensemble variants exist: CGGC implements PLMR
CLU_TBB PLM
anensemble technique very similar to EPP, while CGGCi 0 PLP
iterates the approach. The RG algorithm achieves a high
−0.25 −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10
solution quality, surpassing PLM by a small margin on most modularity score
networks (Figure 7c). Quality is again slightly improved by the
ensemble approach CGGC and its iterated version CGGCi Figure 5: Pareto evaluation of community detection algorithms
(Figure 7d, and 16a in the SM), with the latter surpassing
any other heuristic known to us. However, all three are very G. LFR Benchmark
expensive in terms of computation time, often taking orders The LFR benchmark [19] is an established method for eval-
of magnitude longer than PLM. We consider running times of uating community detection algorithms: A generator produces
several hours for many of our networks no longer viable for graphs that resemble real complex networks and contain dense
the scenario we target, namely interactive data analysis on a communities which are the more sparsely connected the lower
parallel workstation. the mixing parameter µ. Algorithm performance is measured
as the accuracy in recognizing the ground truth communities
F. Pareto Evaluation supplied by the generator, in view of increasing difficulty (µ).
Although there are real-world networks that come with sup-
We have so far presented results broken down by data
posed ground truth communities (e. g. interest-based groups of
set to stress that observed effects may vary strongly from
online social networks in the SNAP collection), we consider
one network to another, a sign of the heterogeneity of real-
only a synthetic ground truth reliable enough for our purposes.
world complex networks. Additionally, we want to give a
In Figure 8 we plot the agreement (graph-structural Rand
condensed picture of the results. For this purpose we use the
index, where 1 is complete agreement) between detected and
previous experimental data to compute a score for running
ground truth communities for our algorithms, and show that
time and solution quality. The time score is the geometric
the PLM method is able to detect the ground truth even with
mean of running time ratios over our test set of networks
strong noise (µ = 0.8), while PLP (and hence EPP) is
with the running time of PLM as the baseline, while the
somewhat less robust.
modularity score is the arithmetic mean of absolute modularity
differences. Figure 5 shows the resulting points. It becomes
clear that all algorithms except CEL and EPP are placed on or
close to the Pareto frontier. PLP is unrivaled in terms of time
to solution, but solution quality is suboptimal. In the middle
ground between label propagation and Louvain method, the
parallel CLU_TBB achieves about the same modularity but
beats the ensemble approach in terms of speed. PLM and
10
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
0 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0 10 10 10 10
modularity time [s]
(a) PLM : absolute quality and speed serve as baseline for comparison
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
−0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.0 0.5 1.0 1.5 2.0 2.5 3.0
modularity difference time ratio
(b) PLP
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
−0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
modularity difference time ratio
(c) PLMR
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
0 0 1 2
−0.5 −0.4 −0.3 −0.2 −0.1 0.0 0.1 10 10 10
modularity difference time ratio
Figure 6: Performance of our algorithms in comparison: PLM serves as the baseline. 32 threads used.
11
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
(a) Louvain
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
−0.25 −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.0 0.5 1.0 1.5 2.0 2.5
modularity difference time ratio
(b) CLU_TBB
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
−0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0 50 100 150 200
modularity difference time ratio
(c) RG
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
−0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0 100 200 300 400 500 600 700 800 900
modularity difference time ratio
(d) CGGC
Figure 7: Performance of competitors relative to baseline PLM. 32 threads used for CLU_TBB.
12
1.0
a generator [22] based on a unit-disk graph model in hyper-
bolic geometry [17] (HUD), which produces both a power
0.8 law degree distribution and distinctive dense communities.
Figure 10 shows the results of weak scaling experiments for
accuracy
0.6
PLP and PLM. It must be noted that perfect scaling cannot
0.4 be expected due to the complex structure of the input. The
PLMR results of the respective last column have been obtained with
0.2 PLM
PLP hyperthreading, which explains the steeper increase. Fig. 14 in
EPP
0.0 the supplementary material show results for additional weak
0.0 0.2 0.4 0.6 0.8 1.0
mixing parameter µ scaling experiments on synthetic graphs generated with the
R-MAT model.
Figure 8: LFR benchmark (n = 105 ): accuracy in recognizing
ground truth while increasing inter-community edges 16 45
14 40
12 35
30
time [s]
time [s]
10
25
8
20
6 15
H. One More Massive Network 4 10
2 5
In addition to the experiments that went into the Pareto 0 0
evaluation, we run our parallel algorithms on the web graph 25 26 27 28 29 30 25 26 27 28 29 30
log(m) log(m)
uk-2007-05, at about 3.3 billion edges the largest real-world
data set currently available to us. CLU_TBB fails at reading Figure 10: PLP (left) and PLM (right) weak scaling on the
the input file. This leaves us with five of our own parallel series of HUD graphs
algorithms for Figure 9: EPP(4,PLP,PLMR) takes about 219
seconds, while PLM requires about 156 seconds to arrive at
a slightly higher modularity. As expected, PLP is by far the VI. Q UALITATIVE A SPECTS
fastest algorithm and terminates in less than a minute. If a
In this work we concentrate on achieving a good tradeoff
certain modularity loss (here 0.02) is acceptable, PLP is also
between high modularity, a widely accepted quality measure
an appropriate choice for quickly detecting communities in
for community detection, and low running time. Ideally one
billion-edge networks. The processing rate for PLP is over
should also look for further validation of the detected com-
53M edges/second and over 21M edges/second for PLM with
munities beyond good modularity. This is a difficult task for
respect to a complete run of each algorithm. These rates
several reasons. For most networks, we do not have a reliable
confirm the suitability of our algorithms for analyzing massive
ground-truth partition, especially because community structure
complex networks on a commodity shared-memory server.
is likely a multi-factorial phenomenon in real networks. Our
task is to uncover the hidden community structure of the
PLP
PLMR
network. In order to know whether we have succeeded in this
PLM* data mining task, we would have to check whether the solution
PLM helps us to formulate hypotheses to predict and explain real-
EPP(4,PLP,PLMR) world phenomena on the basis of network data. Whether one
0.90 0.92 0.94 0.96 0.98 1.00 solution is more appropriate than another may strongly depend
modularity on the domain of the network. Domain-specific validation of
PLP 52 s this kind goes beyond the scope of this paper as we focus on
PLMR 168 s parallelization aspects. Also, most sequential counterparts of
PLM* 203 s
our algorithms have been validated before, see e. g. [5].
PLM 156 s
219 s
However, we give an example to illustrate differences be-
EPP(4,PLP,PLMR)
tween our algorithms in a more qualitative way. Coarsening
0 0 1 2 3
10 10 10 10 the input graph according to the detected communities yields a
time [s]
community graph, which we then visualize by drawing the size
Figure 9: Modularity and running time at 32 threads for our of nodes proportional to the size of the respective community.
parallel algorithms on the massive web graph uk-2007-05 Figure 11 shows community graphs for the PGPgiantcompo
graph, a social network and web of trust resulting from
signatures on PGP keys. From top to bottom, the solutions
were produced by PLP, PLM, PLMR and EPP(4, PLP,
I. Weak Scaling PLMR). It is apparent that PLP has a much finer resolution
For weak scaling experiments, we use a series of synthetic and detects ca. 1000 small communities. This is true for
graphs where each graph has twice the size of its predecessor most of our data sets, but the inverse case also appears. On
(from log m = 25 . . . 30), and double the number of threads this network, higher modularity is associated with coarser
simultaneously from 1 to 32. The graphs were created using resolution. PLM, PLMR and EPP(4, PLP, PLMR) have a
13
very similar resolution and divide the network into ca. 100
communities. While PGPgiantcompo is admittedly a very
small graph, this example shows how community detection
can help to reduce the complexity of networks for visual
representation.
c 2015 IEEE. Citation information: DOI10.1109/TPDS.2015.2390633, IEEE Trans-
actions on Parallel and Distributed Systems. http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=7006796
A PPENDIX
108
107 active
106 updated
105
104
103
102
101
100 0 20 40 60 80 100 120
Figure 12: Number of active and updated labels per iteration
of PLP for the web graph uk-2002.
5
10
4
10
running time [ms]
3
10
2
10
1
10
0
10
0
1 21
iteration
50 1400
1200
40
1000
time [s]
time [s]
30 800
20 600
400
10
200
0 0
19 20 21 22 23 24 19 20 21 22 23 24
log(n) log(n)
Figure 14: PLP (left) and PLM (right) weak scaling on the
series of R-MAT graphs.
16
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
0 0 1 2
−0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 10 10 10
modularity difference time ratio
Figure 15: Difference in quality (left) and running time time ratio (right) of EPP(4,PLP,PLMR) compared to a single PLP.
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0 200 400 600 800 1000 1200
modularity difference time ratio
(a) CGGCi
Figure 16: Performance of the competitor algorithm CGGCi relative to baseline PLM.
as-22july06 as-22july06
G_n_pin_pout G_n_pin_pout
caidaRouterLevel caidaRouterLevel
coAuthorsCiteseer coAuthorsCiteseer
fb-Texas84 fb-Texas84
com-youtube com-youtube
wiki-Talk wiki-Talk
web-BerkStan web-BerkStan
as-Skitter as-Skitter
in-2004 in-2004
coPapersDBLP coPapersDBLP
eu-2005 eu-2005
soc-pokec soc-pokec
soc-LiveJournal soc-LiveJournal
kron_g500-simple-logn20 kron_g500-simple-logn20
con-fiber_big con-fiber_big
europe-osm europe-osm
com-orkut com-orkut
uk-2002 uk-2002
hyperbolic-268M hyperbolic-268M
0 0 1
−0.005 0.000 0.005 0.010 0.015 10 10
modularity difference time ratio
(a) PLM*