Porter - Communities in Networks
Porter - Communities in Networks
1082
October 2009
1083
11
29
32
28
17
26
25
24
34
33
31
12
30
13
27
14
23
10
16
15
20
19
21
18
22
1084
context, the term module is typically used to refer to a single cluster of nodes. Given a network
that has been partitioned into nonoverlapping
modules in some fashion (although some methods also allow for overlapping communities), one
can continue dividing each module in an iterative
fashion until each node is in its own singleton
community. This hierarchical partitioning process
can then be represented by a tree, or dendrogram (see Figure 2). Such processes can yield
a hierarchy of nested modules (see Figure 3),
or a collection of modules at one mesoscopic
level might be obtained in an algorithm independently from those at another level. However
obtained, the community structure of a network
refers to the set of graph partitions obtained at
each reasonable step of such procedures. Note
that community structure investigations rely implicitly on using connected network components.
(We will assume such connectedness in our discussion of community-detection algorithms below.)
Community detection can be applied individually
to separate components of networks that are not
connected.
Many real-world networks possess a natural hierarchy. For example, the committee assignment
network of the U.S. House of Representatives includes the House floor, groups of committees, committees, groups of subcommittees within larger
committees, and individual subcommittees [100,
101]. As shown in Figure 4, different House committees are resolved into distinct modules within
this network. At a different hierarchical level, small
groups of committees belong to larger but less
densely connected modules. To give an example
closer to home, lets consider the departmental
organization at a university and suppose that
the network in Figure 3 represents collaborations
among professors. (It actually represents grassland species interactions [23].) At one level of
inspection, everybody in the mathematics department might show up in one community, such
as the large one in the upper left. Zooming in,
however, reveals smaller communities that might
represent the departments subfields.
Although network community structure is almost always fairly complicated, several forms of
it have nonetheless been observed and shown
to be insightful in applications. The structures
of communities and between communities are
important for the demographic identification of
network components and the function of dynamical processes that operate on networks (such as the
spread of opinions and diseases) [39]. A community in a social network might indicate a circle of
friends, a community in the World Wide Web might
indicate a group of pages on closely related topics,
and a community in a cellular or genetic network
might be related to a functional module. In some
APPROPRIATIONS
BUDGET
ENERGY/COMMERCE
VETERANS AFFAIRS
INTELLIGENCE
ARMED SERVICES
AGRICULTURE
HOUSE ADMINISTRATION
OFFICIAL CONDUCT
RULES
TRANSPORTATION
HOMELAND SECURITY
SMALL BUSINESS
RESOURCES
EDUCATION
GOVERNMENT REFORM
WAYS AND MEANS
SCIENCE
INTERNATIONAL RELATIONS
JUDICIARY
FINANCIAL SERVICES
APPROPRIATIONS
ENERGY/COMMERCE
ARMED SERVICES
BUDGET
INTELLIGENCE
VETERANS AFFAIRS
RULES
TRANSPORTATION
OFFICIAL CONDUCT
HOUSE ADMINISTRATIONAGRICULTURE
HOMELAND SECURITY
SMALL BUSINESS
cases, a network can contain several identical replicas of small communities known as motifs [75].
Consider a transcription network that controls
gene expression in bacteria or yeast. The nodes
represent genes or operons, and the edges represent direct transcriptional regulation. A simple
motif called a feed-forward loop has been shown
both theoretically and experimentally to perform
signal-processing tasks such as pulse generation.
Naturally, the situation becomes much more complicated in the case of people (doesnt it always?).
However, monitoring electronically recorded behavioral data, such as mobile phone calls, allows
one to study underlying social structures [49, 95].
Although these pairwise interactions (phone calls)
are short in duration, they are able to uncover
social groups that are persistent over time [97].
One interesting empirical finding, hypothesized by
Granovetter [51], is that links within communities
tend to be strong and links between them tend
to be weak [95]. This structural configuration has
important consequences for information flow in
social systems [95] and thus affects how the underlying network channels the circulation of social
and cultural resources. (See below for additional
discussion.)
With methods and algorithms drawn from statistical physics, computer science, discrete mathematics, nonlinear dynamics, sociology, and other
subjects, the investigation of network community
structure (and more general forms of data clustering) has captured the attention of a diverse group
of scientists [39, 54, 88, 110]. This breadth of interest has arisen partly because the development
October 2009
RESOURCES
GOVERNMENT REFORM
EDUCATION
WAYS AND MEANS
SCIENCE
INTERNATIONAL RELATIONS
JUDICIARY
FINANCIAL SERVICES
1085
1086
A Simple Example
Identifying Communities
To set the stage for our survey of communitydetection algorithms below, consider the ubiquitous but illustrative example of the Zachary Karate
Club, in which an internal dispute led to the schism
of a karate club into two smaller clubs [131]. We
show a visualization of the friendships between
members of the original club in Figure 2. When the
club split in two, its members chose preferentially
to be in the one with most of their friends. Sociologist Wayne Zachary, who was already studying the
clubs friendships when the schism occurred, realized that he might have been able to predict the
split in advance. This makes the Zachary Karate
Club a useful benchmark for community-detection
algorithms, as one expects any algorithmically produced division of the network into communities
to include groups that are similar to the actual
memberships of the two smaller clubs.
In Figure 2, we show the communities that we obtained using a spectral-partitioning optimization
of a quality function known as modularity [86].
(This method is described below.) Keeping in mind
the hierarchical organization that often occurs as
part of network community structure, we visualize
the identified divisions using a polar coordinate
dendrogram and enumerate the networks nodes
around its exterior. Each distinct radius of the
dendrogram corresponds to a partition of the
original graph into multiple groups. That is, the
community assignments at a selected level of the
dendrogram are indicated by a radial cut in the
bottom panel of Figure 2; one keeps only connections (of nodes to groups) that occur outside this
cut. The success of the community identification
is apparent in the Zachary Karate Club example, as
the two main branches in the dendrogram reflect
the actual memberships of the new clubs.
As shown in Figure 2, this community-detection
method subsequently splits each of the two main
branches. Hence, we see that the Zachary Karate
Club network has a natural hierarchy of decompositions: a coarse pair of communities that correspond precisely to the observed membership
split and a finer partition into four communities. In larger networks, for which algorithmic
methods of investigation are especially important,
the presence of multiple such partitions indicates
network structures at different mesoscopic resolution levels. At each level, one can easily compare
the set of communities with identifying characteristics of the nodes (e.g., the post-split Zachary
October 2009
1087
1088
betweenness for the remaining edges. The recalculation step is important because the removal of
an edge can cause a previously low-traffic edge to
have much higher traffic. An iterative implementation of these steps gives a divisive algorithm for
detecting community structure, as it deconstructs
the initial graph into progressively smaller connected chunks until one obtains a set of isolated
nodes.
Betweenness-based methods have been generalized to use network components other than
edges, to bipartite networks [100], and to use
other sociological notions of centrality [39]. Note,
however, that although centrality-based community detection is intuitively appealing, it can be
too slow for many large networks (unless they are
very sparse), and it tends to give relatively poor
results for dense networks.
Modularity Optimization
One of the most popular quality functions is modularity, which attempts to measure how well a
given partition of a network compartmentalizes
its communities [84, 86, 87, 89, 90]. The problem
of optimizing modularity is equivalent to an instance of the famous MAX-CUT problem [86], so
it is not surprising that it has been proven to
be NP-complete [17]. There are now numerous
community-finding algorithms that try to optimize modularity or similarly constructed quality
functions in various ways [6, 12, 26, 39]. In the
original definition of modularity, an unweighted
and undirected network that has been partitioned
into communities has modularity [84, 90]
X
(1)
Q = (eii bi2 ),
i
October 2009
1089
(4)
1090
Pij =
Jij =
Aij Pij
W
October 2009
Aij Pij
.
W
One can alternatively incorporate a resolution parameter into Aij or elsewhere in the definition
of a quality function (see, e.g., [6]). This allows
one to zoom in and out in order to find communities of different sizes and thereby explore
both the modular and the hierarchical structures
of a graph. Fixing in (7) corresponds to setting the scale at which one is examining the
network: Larger values of yield smaller communities (and vice versa). Resolution parameters
have now been incorporated (both explicitly and
implicitly) into several methods that use modularity [12, 104], other quality functions [6, 69], and
other perspectives [8, 97].
Although introducing a resolution parameter
using equations like (7) seems ad hoc at first, it
can yield very interesting insights. For example,
Jij = (Aij )/W gives a uniform null model in
which a given fixed average edge weight occurs
between each node. This can be useful for correlation and similarity networks, such as those
produced from matrices of yea and nay votes.
Nodes i and j want to be in the same community
if and only if they voted the same way more than
some threshold fraction of times that is specified
by the value of .
Even more exciting, one can relate resolution
parameters to the time scales of dynamical processes unfolding on a network [27,68,98,108]. Just
as we can learn about the behavior of a dynamical
system by studying the structural properties of
the network on which it is occurring, we can also
learn about the networks structural properties
by studying the behavior of a given dynamical
process. This suggests the intuitive result that the
choice of quality function should also be guided by
the nature of the dynamical process of interest. In
addition to revealing that resolution parameters
arise naturally, this perspective shows that the
Potts method arises as a special case of placing
a continuous time random walk with Poissondistributed steps on a network [27]. Freezing the
dynamics at a particular point in time yields the
modularity-maximizing partition. Freezing at earlier times yields smaller communities (because
the random walker hasnt explored as much of
the graph), and waiting until later times results in
larger communities. The t limit reproduces
the partitioning from Miroslav Fiedlers original
spectral method [35].
(7)
Jij =
Applications
Armed with the above ideas and algorithms, we
turn to selected demonstrations of their efficacy.
The increasing rapidity of developments in network community detection has resulted in part
1091
at the selection of topics and citations in this section.) In this spirit, we use scientific coauthorship
networks as our first example.
A bipartite (two-mode) coauthorship network
with scientists linked to papers that they authored
p
or coauthoredcan be defined by letting i = 1
if scientist i was a coauthor on paper p and
0 otherwise. Such a network was collected and
examined from different databases of research
papers in [8082]. To represent the collaboration
strength between scientists i and j, one can define
(8)
Aij =
X pi pj
p
1092
np 1
as the components of a weighted unipartite (onemode) network, where np is the number of authors
of paper p and the sum runs over multiple-author
papers only. Applying betweenness-based community detection to a network derived from Santa Fe
Institute working papers using (8) yields communities that correspond to different disciplines [48].
The statistical physics community can then be further subdivided into three smaller modules that
are each centered around the research interests
of one dominant member. Similar results have
been found using various community-finding algorithms and numerous coauthorship networks,
such as the network of network scientists [86]
(see Figures 1 and 5), which has become one of
the standard benchmark examples in communitydetection papers.
Mobile Phone Networks
Several recent papers have attempted to uncover
the large-scale characteristics of communication
and social networks using mobile phone data
sets [49, 95, 96]. Like many of the coauthorship
data sets studied recently [96], mobile phone networks are longitudinal (time-dependent). However,
in contrast to the ties in the coauthorship networks
above, links in phone networks arise from instant
communication events and capture relationships
as they happen. This means that at any given instant, the network consists of the collection of ties
connecting the people who are currently having a
phone conversation. To probe longer-term social
structures, one needs to aggregate the data over a
time window.
In 2007 one research group used a society-wide
communication network containing the mobile
phone interaction patterns of millions of individuals in an anonymous European country to explore
the relationship of microscopic, mesoscopic, and
macroscopic network structures to the strength
of ties between individuals on a societal level [95].
They obtained very interesting insights into Mark
Granovetters famous weak tie hypothesis, which
states that the relative overlap of the friendship circles of two individuals increases with the strength
October 2009
1093
example, that communities at Princeton, Georgetown, and the University of North Carolina at
Chapel Hill are organized predominantly by class
year, whereas those at Caltech are based almost
exclusively on House (dormitory) affiliation. As
we illustrate in Figure 7, community structure can
also be used to make simple yet intelligent guesses
about withheld user characteristics. Naturally, this
1094
October 2009
1095
havent had space to discuss (see the review articles [26, 39, 110] for more information on many of
them). Despite this wealth of technical advances,
much work remains. As Mark Newman recently wrote [88], The development of methods for
finding communities within networks is a thriving
sub-area of the field, with an enormous number of different techniques under development.
Methods for understanding what the communities
mean after you find them are, by contrast, still
quite primitive, and much needs to be done if
we are to gain real knowledge from the output
of our computer programs. One of our primary
purposes in writing this article is as a call to
arms for the mathematics community to be a
part of this exciting endeavor. Accordingly, we
close our discussion with additional comments
about important unresolved issues.
The remarkable advances of the past few years
have been driven largely by a massive influx
of data. Many of the fascinating networks that
have been constructed using such data are enormous (with millions of nodes or more). Given
that optimization procedures, such as maximizing graph modularity, have been proven to be
NP-complete [17], much of the research drive has
been to formulate fast methods that still find
a reasonable community structure. Some of the
existing algorithms scale well enough to be used
on large networks, whereas others must be restricted to smaller ones. The wealth of data has
also led to an increasing prevalence (and, we hope,
cognizance) of privacy issues. However, although
the study of network communities has become so
prominent, this research area has serious flaws
from both theoretical and applied perspectives:
There are almost no theorems, and few methods
have been developed to use or even validate the
communities that we find.
We hope that some of the mathematicallyminded Notices readers will be sufficiently excited
by network community detection to contribute
by developing new methods that address important graph features and make existing techniques
more rigorous. When analyzing networks constructed from real-world data, the best practice
right now is to use several of the available computationally tractable algorithms and trust only
those structures that are similar across multiple
methods in order to be confident that they are
properties of the actual data rather than byproducts of the algorithms used to produce them.
Numerous heuristics and analytical arguments
are available, but there arent any theorems, and
even the notion of community structure is itself
based on the methodology selected to compute
it. There also appear to be deep but uncharacterized connections between methods that have
been developed in different fields [39, 110]. Additionally, it would be wonderful if there were a
1096
few algorithms that can be applied to such multiplex situations without constructing individual
graphs for each category, and further development will likely require the application of ideas
from multilinear algebra [65,115]. It would also be
desirable to detect communities in hypergraphs
and to be able to consider connections between
agents that are given by interval ranges rather
than precise values. Finally, to be able to study
interactions between dynamical processes on networks and the structural dynamics of networks
themselves (e.g., if somebody spends a day at
home with the flu, the network structure in the
workplace is different than usual that day), a lot
more work is needed on both overlap between
communities and on the community structure
of time-dependent and parameter-dependent networks. Analyzing time- and parameter-dependent
networks currently relies on ad hoc amalgamation
of different snapshots rather than on a systematic
approach, so it is necessary to develop communitydetection methods that incorporate the network
structure at multiple times (or parameter values)
simultaneously [58, 96, 116]. More generally, this
will also have important ramifications for studies
of clustering in correlated time series.
We stress that research on network communities has focused on using exclusively structural information (i.e., node connectivity and link
weights) to deduce structural communities as imperfect proxies for functional communities [39,
116, 122]. While this seems to be sufficient for
some applications [39], in most situations it is
not at all clear that structural communities actually map well to the organization of actors in
social networks, functions in biological networks,
etc. Hence, it is necessary to develop tools for the
detection of functional communities that, whenever possible, incorporate node characteristics and
other available information along with the networks structural information. The elephant in the
literature is simply elucidated with just one question: Now that we have all these ways of detecting
communities, what do we do with them?
Acknowledgements
Our views on network community structure have
been shaped by numerous discussions with our
colleagues and students over the last several years.
We particularly acknowledge Aaron Clauset, Santo
Fortunato, Nick Jones, Eric Kelsic, Mark Newman, Stephen Reid, and Chris Wiggins. We also
thank Joe Blitzstein, Tim Elling, Santo Fortunato, James Fowler, A. J. Friend, Roger Guimer,
Nick Jones, David Kempe, Franziska Klingner, Renaud Lambiotte, David Lazer, Sune Lehmann, Jim
Moody, and David Smith for useful comments on
this manuscript, and Christina Frost and Amanda
October 2009
Traud for assistance in preparing some of the figures. We obtained data from Adam DAngelo and
Facebook, the House of Representatives Office of
the Clerk (Congressional committee assignments),
Mark Newman (network scientist coauthorship),
James Fowler (Congressional legislation cosponsorship), and Keith Poole (Congressional roll call
votes). PJM was funded by the NSF (DMS-0645369)
and by start-up funds provided by the Institute
for Advanced Materials, Nanoscience and Technology and the Department of Mathematics at
the University of North Carolina at Chapel Hill.
MAP acknowledges a research award (#220020177)
from the James S. McDonnell Foundation. JPO is
supported by the Fulbright Program.
References
[1] G. Agarwal and D. Kempe, Modularitymaximizing network communities using mathematical programming, The European Physical
Journal B, 66 (2008), 409418.
[2] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, Communities and hierarchical organization of links in
complex networks. arXiv:0903.3178 (2009).
[3] R. Albert and A.-L. Barabsi, Statistical mechanics of complex networks, Reviews of Modern
Physics, 74 (2002), 4797.
[4] J. M. Anthonisse, The rush in a graph. Technical Report BN 9/71, Stichting Mathematische
Centrum, Amsterdam, 1971.
[5] A. Arenas, A. Fernndez, S. Fortunato, and
S. Gmez, Motif-based communities in complex networks, Journal of Physics A: Mathematical and
Theoretical, 41, 224001 (2008).
[6] A. Arenas, A. Fernndez, and S. Gmez, Analysis
of the structure of complex networks at different resolution levels, New Journal of Physics, 10,
053039 (2008).
[7] J. P. Bagrow, Evaluating local community methods in networks, Journal of Statistical Mechanics:
Theory and Experiment, P05001 (2008).
[8] J. P. Bagrow and E. M. Bollt, A local method
for detecting communities, Physical Review E, 72,
046108 (2005).
[9] S. Bansal, S. Khandelwal, and L. A. Meyers,
Evolving clustered random networks. arXiv:
0808.0509 (2008).
[10] M. J. Barber, Modularity and community detection in bipartite networks, Physical Review E, 76,
066102 (2007).
[11] M. Blatt, S. Wiseman, and E. Domany, Superparamagnetic clustering of data, Physical Review
Letters, 76 (1996), 32513254.
[12] V. D. Blondel, J.-L. Guillaume, R. Lambiotte,
and E. Lefebvre, Fast unfolding of communities
in large network, Journal of Statistical Mechanics:
Theory and Experiment, P10008 (2008).
[13] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez,
and D.-U. Hwang, Complex networks: Structure and dynamics, Physics Reports, 424 (2006),
175308.
References continued on page 1164.
1097
1164
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57] J. M. Hofman and C. H. Wiggins, Bayesian approach to network modularity, Physical Review
Letters, 100, 258701 (2008).
[58] J. Hopcroft, O. Khan, B. Kulis, and B. Selman,
Tracking evolving communities in large linked networks, Proceedings of the National Academy of
Sciences, 101 (2004), 52495253.
[59] T. Hogg, D. Wilkinson, G. Szabo, and M. Brzozowski, Multiple relationship types in online
communities and social networks, in Proceedings of
the AAAI Spring Symposium on Social Information
Processing, AAAI Press (2008).
[60] G. C. Homans, The human group, Harcourt, Brace,
New York, 1950.
[61] S. C. Johnson, Hierarchical clustering schemes,
Psychometrica, 32 (1967), 241254.
[62] T. Kamada and S. Kawai, An algorithm for drawing
general undirected graphs, Information Processing
Letters, 31 (1989), 715.
[63] F. Kps, ed., Biological networks, Complex Systems and Interdisciplinary Science, Vol. 3, World
Scientific, Hackensack, NJ, 2007.
[64] B. W. Kernighan and S. Lin, An efficient heuristic
procedure for partitioning graphs, The Bell System
Technical Journal, 49 (1970), 291307.
[65] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM Review, 51:3 (2009), in
press.
[66] C. P. Kottak, Cultural anthropology, McGraw-Hill,
New York, 5th ed., 1991.
[67] R. Kumar, J. Novak, and A. Tomkins, Structure
and evolution of online social networks. 12th International Conference on Knowledge Discovery and Data
Mining (2006).
[68] R. Lambiotte, J.-C. Delvenne, and M. Barahona, Dynamics and modular structure in networks.
arXiv:0812.1770 (2008).
[69] A. Lancichinetti, S. Fortunato, and J. Kertesz,
Detecting the overlapping and hierarchical community structure of complex networks, New Journal of
Physics, 11, 033015 (2009).
[70] A. Lancichinetti, S. Fortunato, and F. Radicchi,
Benchmark graphs for testing community detection
algorithms, Physical Review E, 93, 046110 (2008).
[71] S. Lehmann, M. Schwartz, and L. K. Hansen, Biclique communities, Physical Review E, 78, 016108
(2008).
[72] E. A. Leicht and M. E. J. Newman, Community structure in directed networks, Physical Review Letters,
100, 118703 (2008).
[73] K. Lewis, J. Kaufman, M. Gonzalez, M. Wimmer,
and N. A. Christakis, Tastes, ties, and time: A
new (cultural, multiplex, and longitudinal) social network dataset using Facebook.com, Social Networks,
30 (2008), 330342.
[74] R. Luce and A. Perry, A method of matrix analysis of group structure, Psychometrika, 14 (1949),
95116.
[75] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan,
D. Chklovskii, and U. Alon, Network motifs: Simple building blocks of complex networks, Science, 298
(2002), 824827.
[76] M. Molloy and B. Reed, A critical point for random graphs with a given degree sequence, Random
Structures and Algorithms, 6 (1995), 161-180.
October 2009
1165
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
1166