Community Structure in Social and Biological Networks: M. Girvan and M. E. J. Newman
Community Structure in Social and Biological Networks: M. Girvan and M. E. J. Newman
biological networks
M. Girvan*†‡ and M. E. J. Newman*§
*Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501; †Department of Physics, Cornell University, Clark Hall, Ithaca, NY 14853-2501; and
§Department of Physics, University of Michigan, Ann Arbor, MI 48109-1120
Edited by Lawrence A. Shepp, Rutgers, State University of New Jersey–New Brunswick, Piscataway, NJ, and approved April 6, 2002 (received for review
December 6, 2001)
A number of recent studies have focused on the statistical prop- In this article, we consider another property, which, as we will
erties of networked systems such as social networks and the show, appears to be common to many networks, the property of
Worldwide Web. Researchers have concentrated particularly on a community structure. (This property is also sometimes called
few properties that seem to be common to many networks: the clustering, but we refrain from this usage to avoid confusion with
small-world property, power-law degree distributions, and net- the other meaning of the word clustering introduced in the
work transitivity. In this article, we highlight another property that preceding paragraph.) Consider for a moment the case of social
is found in many networks, the property of community structure, networks—networks of friendships or other acquaintances be-
in which network nodes are joined together in tightly knit groups, tween individuals. It is a matter of common experience that such
between which there are only looser connections. We propose a networks seem to have communities in them: subsets of vertices
method for detecting such communities, built around the idea of within which vertex–vertex connections are dense, but between
using centrality indices to find community boundaries. We test our which connections are less dense. A figurative sketch of a
method on computer-generated and real-world graphs whose network with such a community structure is shown in Fig. 1.
community structure is already known and find that the method (Certainly it is possible that the communities themselves also
detects this known structure with high sensitivity and reliability. join together to form metacommunities, and that those meta-
We also apply the method to two networks whose community communities are themselves joined together, and so on in a
structure is not well known—a collaboration network and a food hierarchical fashion. This idea is discussed further in the next
web—and find that it detects significant and informative commu- section.) The ability to detect community structure in a network
nity divisions in both cases. could clearly have practical applications. Communities in a social
network might represent real social groupings, perhaps by
interest or background; communities in a citation network (19)
M any systems take the form of networks, sets of nodes or
vertices joined together in pairs by links or edges (1).
Examples include social networks (2–4) such as acquaintance
might represent related papers on a single topic; communities in
a metabolic network might represent cycles and other functional
networks (5) and collaboration networks (6), technological groupings; communities on the web might represent pages
networks such as the Internet (7), the Worldwide Web (8, 9), and on related topics. Being able to identify these communities
power grids (4, 5), and biological networks such as neural could help us to understand and exploit these networks more
networks (4), food webs (10), and metabolic networks (11, 12). effectively.
Recent research on networks among mathematicians and phys- In this article we propose a method for detecting community
MATHEMATICS
icists has focused on a number of distinctive statistical properties structure and apply it to the study of a number of different social
APPLIED
that most networks seem to share. One such property is the and biological networks. As we will show, when applied to
‘‘small world effect,’’ which is the name given to the finding that networks for which the community structure is already known
the average distance between vertices in a network is short (13, from other studies, our method appears to give excellent agree-
14), usually scaling logarithmically with the total number n of ment with the expected results. When applied to networks for
vertices. Another is the right-skewed degree distributions that which we do not have other information about communities, it
many networks possess (8, 9, 15–17). The degree of a vertex in gives promising results that may help us understand better the
a network is the number of other vertices to which it is connected, interplay between network structure and function.
and one finds that there are typically many vertices in a network
with low degree and a small number with high degree, the precise Detecting Community Structure
distribution often following a power-law or exponential form In this section we review existing methods for detecting com-
(1, 5, 15). munity structure and discuss the ways in which these approaches
A third property that many networks have in common is may fail, before describing our own method, which avoids some
clustering, or network transitivity, which is the property that two of the shortcomings of the traditional techniques.
vertices that are both neighbors of the same third vertex have a
heightened probability of also being neighbors of one another. Traditional Methods. The traditional method for detecting com-
In the language of social networks, two of your friends will have munity structure in networks such as that depicted in Fig. 1 is
a greater probability of knowing one another than will two hierarchical clustering. One first calculates a weight Wij for every
people chosen at random from the population, on account of pair i,j of vertices in the network, which represents in some sense
their common acquaintance with you. This effect is quantified by how closely connected the vertices are. (We give some examples
the clustering coefficient C (4, 18), defined by of possible such weights below.) Then one takes the n vertices in
the network, with no edges between them, and adds edges
3⫻ (number of triangles on the graph) between pairs one by one in order of their weights, starting with
C⫽ . [1]
(number of connected triples of vertices) the pair with the strongest weight and progressing to the weakest.
As edges are added, the resulting graph shows a nested set of
This number is precisely the probability that two of one’s friends
are friends themselves. It is 1 on a fully connected graph
(everyone knows everyone else) and has typical values in the This paper was submitted directly (Track II) to the PNAS office.
range of 0.1 to 0.5 in many real-world networks. ‡To whom reprint requests should be addressed. E-mail: [email protected].
冘
⬁
MATHEMATICS
independently at random, with probability Pin for vertices be-
Fig. 4a shows the network, with the instructor and the admin-
APPLIED
longing to the same community and Pout for vertices in different
communities, with Pout ⬍ Pin. The probabilities were chosen so istrator represented by nodes 1 and 34, respectively. Fig. 4b shows
as to keep the average degree z of a vertex equal to 16. This the hierarchical tree of communities produced by our method.
produces graphs that have known community structure, but The most fundamental split in the network is the first one at the
which are essentially random in other respects. Feeding these top of the tree, which divides the network into two groups of
graphs into our algorithm, we measured the fraction of vertices roughly equal size. This split corresponds almost perfectly with
that were classified by the algorithm into their correct commu- the actual division of the club members after the break-up, as
nities, as a function of the average number of intercommunity revealed by which club they attended afterward. Only one node,
edges per vertex. The results are shown in Fig. 3 (circles). As Fig. node 3, is classified incorrectly. In other words, the application
3 shows, the algorithm performs nearly perfectly when zout ⬍ 6, of our algorithm to the empirically observed network of friend-
classifying 90% or more of the vertices correctly. Only for zout ⱖ ships is a good predictor of the subsequent social evolution of the
6 does the fraction correctly classified start to fall off substan- group.
tially. In other words, the algorithm performs very well almost For comparison we also have performed a traditional hierar-
to the point at which each vertex has as many intercommunity as chical clustering based on edge-independent paths for the karate
intracommunity connections. club network; the resulting tree is shown in Fig. 4c. As Fig. 4c
For comparison we also show in Fig. 3 (squares) the fraction shows, this method correctly identifies the core vertex sets
of vertices classified correctly by a standard hierarchical clus- {1,2,3} and {33,34} of the two communities, but otherwise there
tering calculation based on independent path counts computed appears to be little correlation with the actual split of the club,
by using max-flow. As Fig. 3 shows, the performance of this indicating once again that our method is significantly more
method is far inferior to that of our method. accurate and sensitive than the standard method.
Zachary’s Karate Club Study. Although computer-generated net- College Football. As a further test of our algorithm, we turn to the
works provide a reproducible and well controlled test bed for our world of United States college football. (Football here means
community-structure algorithm, it is clearly desirable to test the American football, not soccer.) The network we look at is a
algorithm on data from real-world networks as well. To this end, representation of the schedule of Division I games for the 2000
we have selected two datasets representing real-world networks season: vertices in the graph represent teams (identified by their
for which the community structure is already known from other college names) and edges represent regular-season games be-
sources. The first of these is drawn from the well known karate tween the two teams they connect. What makes this network
Girvan and Newman PNAS 兩 June 11, 2002 兩 vol. 99 兩 no. 12 兩 7823
Fig. 4. (a) The friendship network from Zachary’s karate club study (26) as
described in the text. Nodes associated with the club administrator’s faction
are drawn as circles, those associated with the instructor’s faction are drawn
as squares. (b) Hierarchical tree showing the complete community structure
for the network calculated by using the algorithm presented in this article. The
initial split of the network into two groups is in agreement with the actual
factions observed by Zachary, with the exception that node 3 is misclassified.
(c) Hierarchical tree calculated by using edge-independent path counts, which
fails to extract the known community structure of the network.
MATHEMATICS
be undirected.
show that in these cases it can help us to understand the make-up
APPLIED
Applying our algorithm to this network, we find two well
of otherwise complex and tangled datasets. Our first example is defined communities of roughly equal size, plus a small number
a collaboration network of scientists; our second is a food web of vertices that belong to neither community (see Fig. 7). As Fig.
of marine organisms in the Chesapeake Bay. 7 shows, the split between the two large communities corre-
sponds quite closely with the division between pelagic organisms
Collaboration Network. We have applied our community-finding (those that dwell principally near the surface or in the middle
method to a collaboration network of scientists at the Santa Fe depths of the bay) and benthic organisms (those that dwell near
Institute, an interdisciplinary research center in Santa Fe, New the bottom). Interestingly, the algorithm includes within each
Mexico (and current academic home to both authors of this group organisms from a variety of different trophic levels. This
article). The 271 vertices in this network represent scientists in finding contrasts with other techniques that have been used to
residence at the Santa Fe Institute during any part of calendar analyze food webs (28), which tend to cluster taxa according to
year 1999 or 2000 and their collaborators. An edge is drawn trophic level rather than habitat. Our results seem to imply that
between a pair of scientists if they coauthored one or more pelagic and benthic organisms in the Chesapeake Bay can be
articles during the same time period. The network includes all separated into reasonably self-contained ecological subsystems.
journal and book publications by the scientists involved, along The separation is not perfect: a small number of benthic
with all papers that appeared in the institute’s technical reports organisms find their way into the pelagic community, presumably
series. On average, each scientist coauthored articles with ap- indicating that these species play a substantial role in the food
proximately five others. chains of their surface-dwelling colleagues. This finding suggests
In Fig. 6 we illustrate the results from the application of our that the simple traditional division of taxa into pelagic or benthic
algorithm to the largest component of the collaboration graph may not be an ideal classification in this case.
(which consists of 118 scientists). Vertices are drawn as different We also have applied our method to a number of other food
shapes according to the primary divisions detected. We find that webs. Interestingly, although some of these show clear commu-
the algorithm splits the network into a few strong communities, nity structure similar to that of Fig. 7, some others do not. This
with the divisions running principally along disciplinary lines. could be because some ecosystems are genuinely not composed
The community indicated by diamonds is the least well defined of separate communities, but it also could be because many food
and represents a group of scientists using agent-based models to webs, unlike other networks, are dense, i.e., the number of edges
study problems in economics and traffic flow. The algorithm scales as the square of the number of vertices rather than scaling
further divides this group into smaller components that corre- linearly (29). Our algorithm was designed with sparse networks
Girvan and Newman PNAS 兩 June 11, 2002 兩 vol. 99 兩 no. 12 兩 7825
known community structure with a high degree of success. We
have also tested it on two real-world networks with well docu-
mented structure and find the results to be in excellent agree-
ment with expectations. In addition, we have given two examples
of applications of the algorithm to networks whose structure was
not previously well documented and find that in both cases it
extracts clear communities that appear to correspond to plau-
sible and informative divisions of the network nodes.
A number of extensions or improvements of our method may
be possible. First, we hope to generalize the method to handle
both weighted and directed graphs. Second, we hope that it may
be possible to improve the speed of the algorithm. At present, the
algorithm runs in time O(n3) on sparse graphs, where n is the
number of vertices in the network. This makes it impractical for
very large graphs. Detecting communities in, for instance, the
large collaboration networks (6) or subsets of the web graph (9)
that have been studied recently, would be entirely unfeasible.
Perhaps, however, the basic principles of our approach—
focusing on the boundaries of communities rather than their
cores, and making use of edge betweenness—can be incorpo-
rated into a modified method that scales more favorably with
network size.
Fig. 7. Hierarchical tree for the Chesapeake Bay food web described in the
text. We hope that the ideas and methods presented here will prove
useful in the analysis of many other types of networks. Possible
further applications range from the determination of functional
in mind, and it is possible that it may not perform as well on clusters within neural networks to analysis of communities on the
dense networks. Worldwide Web, as well as others not yet thought of. We hope
to see such applications in the future.
Conclusions
In this article we have investigated community structure in We thank Jennifer Dunne, Neo Martinez, Matthew Salganik, Steve
networks of various kinds, introducing a method for detecting Strogatz, and Doug White for useful conversations, and Jennifer Dunne,
such structure. Unlike previous methods that focus on finding Sarah Knutson, Matthew Salganik, and Doug White for help with
the strongly connected cores of communities, our approach compiling the data for the food web, collaboration, college football, and
works by using information about edge betweenness to detect karate club networks, respectively. This work was funded in part by
community peripheries. We have tested our method on National Science Foundation Grants DMS-0109086, DGE-9870631, and
computer-generated graphs and have shown that it detects the PHY-9910217.
1. Strogatz, S. H. (2001) Nature (London) 410, 268–276. 14. Milgram, S. (1967) Psychol. Today 2, 60–67.
2. Wasserman, S. & Faust, K. (1994) Social Network Analysis (Cambridge Univ. 15. Barabási, A.-L. & Albert, R. (1999) Science 286, 509–512.
Press, Cambridge, U.K.). 16. Krapivsky, P. L., Redner, S. & Leyvraz, F. (2000) Phys. Rev. Lett. 85, 4629–4632.
3. Scott, J. (2000) Social Network Analysis: A Handbook (Sage, London), 2nd Ed. 17. Dorogovtsev, S. N., Mendes, J. F. F. & Samukhin, A. N. (2000) Phys. Rev. Lett.
4. Watts, D. J. & Strogatz, S. H. (1998) Nature (London) 393, 440–442. 85, 4633–4636.
5. Amaral, L. A. N., Scala, A., Barthélémy, M. & Stanley, H. E. (2000) Proc. Natl. 18. Newman, M. E. J., Strogatz, S. H. & Watts, D. J. (2001) Phys. Rev. E 64, 026118.
Acad. Sci. USA 97, 11149–11152. 19. Redner, S. (1998) Eur. Phys. J. B 4, 131–134.
6. Newman, M. E. J. (2001) Proc. Natl. Acad. Sci. USA 98, 404–409. 20. Menger, K. (1927) Fundamenta Mathematicae 10, 96–115.
7. Faloutsos, M., Faloutsos, P. & Faloutsos, C. (1999) Comput. Commun. Rev. 29, 21. White, D. R. & Harary, F. (2001) Sociol. Methodol. 31, 305–359.
251–262. 22. Ahuja, R. K., Magnanti, T. L. & Orlin, J. B. (1993) Network Flows: Theory,
8. Albert, R., Jeong, H. & Barabási, A.-L. (1999) Nature (London) 401, 130–131. Algorithms, and Applications (Prentice–Hall, Upper Saddle River, NJ).
9. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., 23. Katz, L. (1953) Psychometrika 18, 39–43.
Tomkins, A. & Wiener, J. (2000) Comput. Networks 33, 309–320. 24. Freeman, L. (1977) Sociometry 40, 35–41.
10. Williams, R. J. & Martinez, N. D. (2000) Nature (London) 404, 180–183. 25. Newman, M. E. J. (2001) Phys. Rev. E 64, 016131.
11. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabási, A.-L. (2000) Nature 26. Zachary, W. W. (1977) J. Anthropol. Res. 33, 452–473.
(London) 407, 651–654. 27. Baird, D. & Ulanowicz, R. E. (1989) Ecol. Monogr. 59, 329–364.
12. Fell, D. A. & Wagner, A. (2000) Nat. Biotechnol. 18, 1121–1122. 28. Yodzis, P. & Winemiller, K. O. (1999) Oikos 87, 327–340.
13. Pool, I. de S. & Kochen, M. (1978) Social Networks 1, 1–48. 29. Martinez, N. D. (1992) Am. Nat. 139, 1208–1218.