Radius Plots for Mining Tera-byte Scale Graphs:
Algorithms, Patterns, and Observations
U Kang Charalampos E. Tsourakakis Ana Paula Appel∗ Christos Faloutsos
SCS, CMU SCS, CMU CSD, USP at São Carlos SCS, CMU
Jure Leskovec†
CSD, Stanford
Abstract version.
Given large, multi-million node graphs (e.g., FaceBook, 2. Optimization and Experimentation: We carefully fine-
web-crawls, etc.), how do they evolve over time? How are tune our algorithm, and we tested it on one of the largest
they connected? What are the central nodes and the outliers public web graph ever analyzed, with several billions of
of the graphs? We show that the Radius Plot (pdf of node nodes and edges, spanning 1/8 of a Terabyte.
radii) can answer these questions. However, computing the 3. Observations: Thanks to HADI, we find interesting
Radius Plot is prohibitively expensive for graphs reaching patterns and observations, like the “Multi-modal and
the planetary scale. Bi-modal” pattern, and the surprisingly small effective
There are two major contributions in this paper: (a) diameter of the Web. For example, see the Multi-
We propose HADI (HAdoop DIameter and radii estimator), modal pattern in the radius plot of Figure 1, which also
a carefully designed and fine-tuned algorithm to compute shows the effective diameter and the center node of the
the diameter of massive graphs, that runs on the top of Web(‘google.com’).
the H ADOOP /M AP R EDUCE system, with excellent scale-up
on the number of available machines (b) We run HADI on The HADI algorithm (implemented in
several real world datasets including YahooWeb (6B edges, H ADOOP) and several datasets are available at
1/8 of a Terabyte), one of the largest public graphs ever http://www.cs.cmu.edu/∼ukang/HADI. The
analyzed. rest of the paper is organized as follows: Section 2 defines
Thanks to HADI, we report fascinating patterns on related terms and a sequential algorithm for the Radius
large networks, like the surprisingly small effective diameter, Plot. Section 3 describes large scale algorithms for the
the multi-modal/bi-modal shape of the Radius Plot, and its Radius Plot, and Section 4 analyzes the complexity of the
palindrome motion over time. algorithms and provides a possible extension. In Section 5
we present timing results, and in Section 6 we observe inter-
1 Introduction esting patterns. After describing backgrounds in Section 7,
we conclude in Section 8.
How do real, Terabyte-scale graphs look like? Is it true
that the nodes with the highest degree are the most central
2 Preliminaries; Sequential Radii Calculation
ones, i.e., have the smallest radius? How do we compute the
diameter and node radii in graphs of such size? 2.1 Definitions In this section, we define several terms
Graphs appear in numerous settings, such as social net- related to the radius and the diameter. Recall that, for a node
works (FaceBook, LinkedIn), computer network intrusion v in a graph G, the radius r(v) of v is the distance between
logs, who-calls-whom phone networks, search engine click- v and a reachable node farthest away from v. The diameter
streams (term-URL bipartite graphs), and many more. The d(G) of a graph G is the maximum radius of nodes v ∈ G.
contributions of this paper are the following: That is, d(G) = maxv r(v).
Since the radius and the diameter are susceptible to
1. Design: We propose HADI, a scalable algorithm, to outliers (e.g., long chains), we follow the literature and
compute the radii and diameter of network. As shown define the effective radius and diameter as follows.
in Figure 1, our method is 7.6× faster than the naive
D EFINITION 1. (E FFECTIVE R ADIUS ) For a node v in a
∗ Work performed while visiting CMU.
graph G, the effective radius ref f (v) of v is the 90th-
† Work performed while at CMU. percentile of all the distances from v.
Figure 1: (Left) Radius Plot(Count versus Radius) of the YahooWeb graph. Notice the effective diameter is surprisingly
small. Also notice the peak(marked ‘S’) at radius 2, due to star-structured disconnected components.
(Middle) Radius Plot of GCC(Giant Connected Component) of YahooWeb graph. The only node with radius 5 (marked
‘C’) is google.com.
(Right) Running time of HADI with/without optimizations for Kronecker and Erdős-Rényi graphs with billions edges. Run
on the M45 H ADOOP cluster, using 90 machines for 3 iterations. HADI-OPT is up to 7.6× faster than HADI-plain.
Symbol Definition 2.2 Computing Radius and Diameter To generate the
G a graph Radius Plot, we need to calculate the effective radius of
n number of nodes in a graph every node. In addition, the effective diameter is useful for
m number of edges in a graph tracking the evolution of networks. Therefore, we describe
d diameter of a graph our algorithm for computing the effective radius and the
h number of hops effective diameter of a graph. As described in Section 7,
N (h) number of node-pairs reachable in ≤ h existing algorithms do not scale well. To handle graphs with
hops (neighborhood function) billions of nodes and edges, we use the following two main
N (h, i) number of neighbors of node i reachable in ideas:
≤ h hops
b(h, i) Flajolet-Martin bitstring for node i at h 1. We use an approximation rather than an exact algo-
hops. rithm.
b̂(h, i) Partial Flajolet-Martin bitstring for node i 2. We design a parallel algorithm for H ADOOP
at h hops /M AP R EDUCE (the algorithm can also run in a
Table 1: Table of symbols parallel RDBMS).
D EFINITION 2. (E FFECTIVE D IAMETER ) The effective di- To approximate the effective radius and the effective
ameter def f (G) of a graph G is the minimum number of diameter, we use the Flajolet-Martin algorithm [17][29] for
hops in which 90% of all connected pairs of nodes can reach counting the number of distinct elements in a multiset. While
each other. many other applicable algorithms exist (e.g., [6], [10], [18]),
we choose the Flajolet-Martin algorithm because it gives an
We will use the following three Radius-based Plots:
unbiased estimate, as well as a tight O(logn) bound for the
1. Static Radius Plot (or just “Radius Plot”) of graph G space complexity [3].
shows the distribution (count) of the effective radius of The main idea is that we maintain K Flajolet-Martin
nodes at a specific time, as shown in Figure 1. (FM) bitstrings b(h, i) for each node i and current hop
2. Temporal Radius Plot shows the distributions of effec- number h. b(h, i) encodes the number of nodes reachable
tive radius of nodes at several times( see Figure 9 for an from node i within h hops, and can be used to estimate
example). radii and diameter as shown below. The bitstrings b(h, i) are
3. Radius-Degree Plot shows the scatter-plot of the effec- iteratively updated until the bitstrings of all nodes stabilize.
tive radius ref f (v) versus the degree dv for each node At the h-th iteration, each node receives the bitstrings of its
v, as shown in Figure 8. neighboring nodes, and updates its own bitstrings b(h − 1, i)
handed over from the previous iteration:
Table 1 lists the symbols used in this paper.
(2.1) b(h, i) = b(h−1, i) BIT-OR {b(h−1, j)|(i, j) ∈ E}
where “BIT-OR” denotes bitwise OR. After h iterations, is determined at line 20, and the effective diameter def f (G)
a node i has K bitstrings that encode the neighborhood is determined at line 22.
function N (h, i), that is, the number of nodes within h hops Algorithm 1 runs in O(dm) time, since the algorithm
from the node i. N (h, i) is estimated from the K bitstrings iterates at most d times with each iteration running in O(m)
by time. By using approximation, Algorithm 1 runs faster
1 1
PK than previous approaches (see Section 7 for discussion).
(2.2) N (h, i) = 2 K l=1 bl (i) However, Algorithm 1 is a sequential algorithm and requires
0.77351
where bl (i) is the position of leftmost ’0’ bit of the lth O(n log n) space and thus can not handle extremely large
bitstring of node i. The iterations continue until the bitstrings graphs (more than billions of nodes and edges) which can not
of all nodes stabilize, which is a necessary condition that the be fit into a single machine. In the next sections we present
current iteration number h exceeds the diameter d(G). After efficient parallel algorithms.
the iterations finish at hmax , we can calculate the effective
radius for every node and the diameter of the graph, as 3 Proposed Method
follows: In the next two sections we describe HADI, a parallel
radius and diameter estimation algorithm. As mentioned in
• ref f (i) is the smallest h such that N (h, i) ≥ 0.9 · Section 2, HADI can run on the top of both a M AP R EDUCE
N (hmax , i). system and a parallel SQL DBMS. In the following, we
• d ef f (G) is the smallest h such that N (h) =
P first describe the general idea behind HADI and show the
i N (h, i) ≥ 0.9 · N (hmax ). algorithm for M AP R EDUCE. The algorithm for parallel SQL
Algorithm 1 shows the summary of the algorithm de- DBMS is sketched in Section 4.
scribed above.
Algorithm 1 Computing Radii and Diameter 3.1 HADI Overview HADI follows the flow of Algo-
rithm 1; that is, it uses the FM bitstrings and iteratively up-
Input: Input graph G and integers M axIter and K
dates them using the bitstrings of its neighbors. The most ex-
Output: ref f (i) of every node i, and def f (G)
pensive operation in Algorithm 1 is line 8 where bitstrings of
1: for i = 1 to n do
each node are updated. Therefore, HADI focuses on the ef-
2: b(0, i) ← NewFMBitstring(n);
ficient implementation of the operation using M AP R EDUCE
3: end for
framework.
4: for h = 1 to M axIter do
It is important to notice that HADI is a disk-based
5: Changed ← 0;
algorithm; indeed, memory-based algorithm is not possible
6: for i = 1 to n do
for Tera- and Peta-byte scale data. HADI saves two kinds
7: for l = 1 to K do
of information to a distributed file system (such as HDFS
8: bl (h, i) ← bl (h − 1, i)BIT-OR{bl (h − 1, j)|∀j
(Hadoop Distributed File System) in the case of H ADOOP):
adjacent from i};
9: end for • Edge has a format of (srcid, dstid).
10: if ∃l s.t. bl (h, i) 6= bl (h − 1, i) then • Bitstrings has a format of (nodeid, bitstring1 , ...,
11: increase Changed by 1; bitstringK ).
12: end if
13: end for P Combining the bitstrings of each node with those of its
14: N (h) ← i N (h, i); neighbors is very expensive operation which needs several
15: if Changed equals to 0 then optimization to scale up near-linearly. In the following sec-
16: hmax ← h, and break for loop; tions we will describe three HADI algorithms in a progres-
17: end if sive way. That is we first describe HADI-naive, to give the
18: end for big picture and explain why it such a naive implementation
19: for i = 1 to n do {estimate eff. radii} should not be used in practice, then the HADI-plain, and fi-
20: ref f (i) ← smallest h0 where N (h0 , i) ≥ 0.9 · nally HADI-optimized, the proposed method that should be
N (hmax , i); used in practice. We use H ADOOP to describe the M AP R E -
21: end for DUCE version of HADI.
22: def f (G) ← smallest h0 where N (h0 ) ≥ 0.9 · N (hmax );
3.2 HADI-naive in M AP R EDUCE HADI-naive is ineffi-
The parameter K is typically set to 32[17], and cient, but we present it for ease of explanation.
M axIter is set to 256 since real graphs have relatively small Data The edge file is saved as a sparse adjacency matrix
effective diameter. The NewFMBitstring() function in line 2 in HDFS. Each line of the file contains a nonzero element
generates K FM bitstrings [17]. The effective radius ref f (i) of the adjacency matrix of the graph, in the format of
HADI-naive by copying only the necessary bitstrings to each
reducer. The details are next:
Data As in HADI-naive, the edges are saved in the
format of (srcid, dstid), and bitstrings are saved in the
format of (nodeid, f lag, bitstring1 , ..., bitstringK ) in
files over HDFS. The initial bitstrings generation, which
corresponds to line 1-3 of Algorithm 1, can be performed
in completely parallel way. The f lag of each node records
the following information:
• Effective Radii and Hop Numbers to calculate the
effective radius.
• Changed flag to indicate whether at least a bitstring has
been changed or not.
Main Program Flow As mentioned in the beginning,
Figure 2: One iteration of HADI-naive. First stage: Bit- HADI-plain copies only the necessary bitstrings to each
strings of all nodes are sent to every reducer. Second stage: reducer. The main idea is to replicate bitstrings of node j
sums up the count of changed nodes. The multiple arrows exactly x times where x is the in-degree of node j. The
at the beginning of Stage 2 mean that there may be many replicated bitstrings of node j is called the partial bitstring
machines containing bitstrings. and represented by b̂(h, j). The replicated b̂(h, j)’s are
used to update b(h, i), the bitstring of node i where (i, j)
(srcid, dstid). Also, the bitstrings of each node are saved is an edge in the graph. HADI-plain iteratively runs three-
in a file in the format of (nodeid, f lag, bitstring1 , ..., stage M AP R EDUCE jobs until all bitstrings of all nodes
bitstringK ). The f lag records information about the status stop changing. Algorithm 2, 3, 4 shows HADI-plain. We
of the nodes(e.g., ‘Changed’ flag to check whether one of use h for the current iteration number, starting from h=1.
the bitstrings changed or not). Notice that we don’t know the Output(a,b) means to output a pair of data with the key a and
physical distribution of the data in HDFS. the value b.
Main Program Flow The main idea of HADI-naive is Stage 1 We generate (key, value) pairs, where the key
to use the bitstrings file as a logical “cache” to machines is a node id i and the value is the partial bitstrings b̂(h, j)’s
which contain edge files. The bitstring update operation where j ranges over all the neighbors adjacent from node i.
in Equation (2.1) requires that the machine which updates To generate such pairs, the bitstrings of node j are grouped
the bitstrings of node i should have access to (a) all edges together with edges whose dstid is j. Notice that at the very
adjacent from i, and (b) all bitstrings of the adjacent nodes. first iteration, bitstrings of nodes do not exist; they have to
To meet the requirement (a), it is needed to reorganize the be generated on the fly, and we use the Bitstring Creation
edge file such that edges with a same source id are grouped Command for that. Notice also that line 22 of Algorithm 2
together. That can be done by using an Identity mapper is used to propagate the bitstrings of one’s own node. These
which outputs the given input edges in (srcid, dstid) format. bitstrings are compared to the newly updated bitstrings at
The most simple yet naive way to meet the requirement (b) Stage 2 to check convergence.
is sending the bitstrings to every machine which receives the Stage 2 Bitstrings of node i are updated by combining
reorganized edge file. partial bitstrings of itself and nodes adjacent from i. For
Thus, HADI-naive iterates over two-stages of M AP R E - the purpose, the mapper is the Identity mapper (output the
DUCE . The first stage updates the bitstrings of each node input without any modification). The reducer combines
and sets the ‘Changed’ flag if at least one of the bitstrings them, generates new bitstrings, and sets f lag by recording
of the node is different from the previous bitstring. The sec- (a) whether at least a bitstring changed or not, and (b)
ond stage counts the number of changed nodes and stops it- the current iteration number h and the neighborhood value
erations when the bitstrings stabilized, as illustrated in the N (h, i) (line 9). This h and N (h, i) are used to calculate
swim-lane diagram of Figure 2. the effective radius of nodes after all bitstrings converge,
Although conceptually simple and clear, HADI-naive is i.e., don’t change. Notice that only the last neighborhood
unnecessarily expensive, because it ships all the bitstrings to N (hlast , i) and other neighborhoods N (h0 , i) that satisfy
all reducers. Thus, we propose HADI-plain and additional N (h0 , i) ≥ 0.9 · N (hlast , i) need to be saved to calculate
optimizations, which we explain next. the effective radius. The output of Stage 2 is fed into the
input of Stage 1 at the next iteration.
3.3 HADI-plain in M AP R EDUCE HADI-plain improves Stage 3 We calculate the number of changed nodes and
Algorithm 2 HADI Stage 1 Algorithm 4 HADI Stage 3
Input: Edge data E = {i, j)}, Input: Full bitstring B = {(i, b(h, i))}
Current bitstring B = {(i, b(h − 1, i))} or Output: Number of changed nodes, Neighborhood N (h)
Bitstring Creation Command BC = {(i, cmd)} 1: Stage3-Map(key k, value v);
Output: Partial bitstring B 0 = {(i, b(h − 1, j))} 2: Analyze v to get (changed, N (h, i));
1: Stage1-Map(key k, value v); 3: Output(key for changed,changed);
2: if (k, v) is of type B or BC then 4: Output(key for neighborhood, N (h, i));
3: Output(k, v); 5:
4: else if (k, v) is of type E then 6: Stage3-Reduce(key k, values V []);
5: Output(v, k); 7: Changed ← 0;
6: end if 8: N (h) ← 0;
7: 9: for v ∈ V do
8: Stage1-Reduce(key k, values V []); 10: if k is key for changed then
9: SRC ← []; 11: Changed ← Changed + v;
10: for v ∈ V do 12: else if k is key for neighborhood then
11: if (k, v) is of type BC then 13: N (h) ← N (h) + v;
12: b̂(h − 1, k) ←NewFMBitstring(); 14: end if
13: else if (k, v) is of type B then 15: end for
14: b̂(h − 1, k) ← v; 16: Output(key for changed,Changed);
15: else if (k, v) is of type E then 17: Output(key for neighborhood, N (h));
16: Add v to SRC;
17: end if machines.
18: end for
19: for src ∈ SRC do 3.4 HADI-optimized in M AP R EDUCE HADI-optimized
20: Output(src, b̂(h − 1, k)); further improves HADI-plain. It uses two orthogonal ideas:
21: end for “block operation” and “bit shuffle encoding”. Both try
22: Output(k, b̂(h − 1, k)); to address some subtle performance issues. Specifically,
H ADOOP has the following two major bottlenecks:
Algorithm 3 HADI Stage 2
• Materialization: at the end of each map/reduce stage,
Input: Partial bitstring B = {(i, b̂(h − 1, j)} the output is written to the disk, and it is also read at the
Output: Full bitstring B = {(i, b(h, i)} beginning of next reduce/map stage.
1: Stage2-Map(key k, value v); // Identity Mapper • Sorting: at the Shuffle stage, data is sent to each reducer
2: Output(k, v); and sorted before they are handed over to the Reduce
3: stage.
4: Stage2-Reduce(key k, values V []);
5: b(h, k) ← 0; HADI-optimized addresses these two issues.
6: for v ∈ V do Block Operation Our first optimization is the block en-
7: b(h, k) ← b(h, k) BIT-OR v; coding of the edges and the bitstrings. The main idea is to
8: end for group w by w sub-matrix into a super-element in the adja-
9: Update f lag of b(h, k); cency matrix E, and group w bitstrings into a super-bitstring.
10: Output(k, b(h, k)); Now, HADI-plain is performed on these super-elements and
super-bitstrings, instead of the original edges and bitstrings.
sum up the neighborhood value of all nodes to calculate Of course, appropriate decoding and encoding is necessary
N (h). We use only two unique keys(key for changed and at each stage. Figure 3 shows an example of converting data
key for neighborhood), which correspond to the two calcu- to block.
lated values. The analysis of line 2 can be done by checking By this block operation, the performance of HADI-plain
the f lag field and using Equation (2.2) in Section 2. changes as follows:
When all bitstrings of all nodes converged, a M AP R E -
DUCE job to finalize the effective radius and diameter is per- • Input size decreases in general, since we can use fewer
formed and the program finishes. Compared to HADI-naive, bits to index elements inside a block.
the advantage of HADI-plain is clear: bitstrings and edges • Sorting time decreases, since the number of elements to
are evenly distributed over machines so that the algorithm sort decreases.
can handle as much data as possible, given sufficiently many • Network traffic decreases since the result of matching a
super-element and a super-bitstring is a bitstring which
4 Analysis and Discussion
In this section, we analyze the time/space complexity of
HADI and its possible implementation at RDMBS.
4.1 Time and Space Analysis We analyze the algorithm
complexity of HADI with M machines for a graph G with n
nodes and m edges with diameter d. We are interested in the
time complexity, as well as the space complexity.
Figure 3: Converting the original edge and bitstring to
blocks. The 4-by-4 edge and length-4 bitstring are converted L EMMA 4.1. (T IME C OMPLEXITY OF HADI) HADI takes
to 2-by-2 super-elements and length-2 super-bitstrings. No- O( d(n+m)
M log n+m
M ) time.
tice the lower-left super-element of the edge is not produced
Proof. (Sketch) The Shuffle steps after Stage1 takes
since there is no nonzero element inside it.
O( n+m n+m
M log M ) time which dominates the time complex-
ity.
can be at maximum block width times smaller than that
of HADI-plain. Notice that the time complexity of HADI is less than previ-
ous approaches in Section 7(O(n2 +nm), at best). Similarly,
• Map and Reduce functions takes more time, since the
for space we have:
block must be decoded to be processed, and be encoded
back to block format. L EMMA 4.2. (S PACE C OMPLEXITY OF HADI) HADI re-
quires O((n + m) log n) space.
For reasonable-size blocks, the performance gains (smaller
Proof. (Sketch) The maximum space k · ((n + m) log n)
input size, faster sorting time, less network traffic) outweigh
is required at the output of Stage1-Reduce. Since k is a
the delays (more time to perform the map and reduce func-
constant, the space complexity is O((n + m) log n).
tion). Also notice that the number of edge blocks depends on
the community structure of the graph: if the adjacency ma-
4.2 HADI in parallel DBMSs Using relational database
trix is nicely clustered, we will have fewer blocks. See Sec-
management systems (RDBMS) for graph mining is a
tion 5, where we show results from block-structured graphs
promising research direction, especially given the findings
(‘Kronecker graphs’ [24]) and from random graphs (‘Erdős-
of [31]. We mention that HADI can be implemented on top
Rényi graphs’ [15]).
of an Object-Relational DBMS (parallel or serial): it needs
Bit Shuffle Encoding In our effort to decrease the input
repeated joins of the edge file with the appropriate file of bit-
size, we propose an encoding scheme that can compress the
strings, and a user-defined function for bit-OR-ing. See [20]
bitstrings. Recall that in HADI-plain, we use K (e.g., 32,
for details.
64) bitstrings for each node, to increase the accuracy of our
estimator. Since HADI requires K · ((n + m) log n) space,
5 Scalability of HADI
the amount of data increases when K is large. For example,
the YahooWeb graph in Section 6 spans 120 GBytes (with In this section, we perform experiments to answer the fol-
1.4 billion nodes, 6.6 billion edges). However the required lowing questions:
disk space for just the bitstrings is 32 · (1.4B + 6.6B) · 8 byte • Q1: How fast is HADI?
= 2 Tera bytes (assuming 8 byte for each bitstring), which is • Q2: How does it scale up with the graph size and the
more than 16 times larger than the input graph. number of machines?
The main idea of Bit Shuffle Encoding is to carefully • Q3: How do the optimizations help performance?
reorder the bits of the bitstrings of each node, and then use
run length encoding. By construction, the leftmost part of 5.1 Experimental Setup We use both real and synthetic
each bitstring is almost full of one’s, and the rest is almost graphs in Table 2 for our experiments and analysis in Sec-
full of zeros. Specifically, we make the reordered bit strings tion 5 and 6, with the following details.
to contain long sequences of 1’s and 0’s: we get all the
first bits from all K bitstrings, then get the second bits, and • YahooWeb: web pages and their hypertext links in-
so on. As a result we get a single bit-sequence of length dexed by Yahoo! Altavista search engine in 2002.
K ∗ |bitstring|, where most of the first bits are ‘1’s, and • Patents: U.S. patents, citing each other (from 1975 to
most of the last bits are ‘0’s. Then we encode only the length 1999).
of each bit sequence, achieving good space savings (and, • LinkedIn: people connected to other people (from 2003
eventually, time savings, through fewer I/Os). to 2006).
Graph Nodes Edges File Desc. 3
YahooWeb 1.4 B 6.6 B 116G page-page HADI: 10 machines
HADI: 30 machines
LinkedIn 7.5 M 58 M 1G person-person 2.5 HADI: 50 machines
HADI: 70 machines
Patents 6M 16 M 264M patent-patent
Run time in hours
2 HADI: 90 machines
Kronecker 177 K 1,977 M 25G synthetic
120 K 1,145M 13.9G 1.5
59 K 282 M 3.3G
Erdős-Rényi 177 K 1,977 M 25G random Gn,p 1
120 K 1,145 M 13.9G
0.5
59 K 282 M 3.3G
0
Table 2: Datasets. B: Billion, M: Million, K: Thousand, G: 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Number of edges in billions
Gigabytes
Figure 4: Running time versus number of edges with HADI-
• Kronecker: Synthetic Kronecker graphs [24] using a OPT on Kronecker graphs for three iterations. Notice the
chain of length two as the seed graph. excellent scalability: linear on the graph size (number of
For the performance experiments, we use synthetic Kro- edges).
necker and Erdős-Rényi graphs. The reason of this choice is
that we can generate any size of these two types of graphs, Graph on 2 billion edges
and Kronecker graph mirror several real-world graph charac- 6 Ideal scale up
teristics, including small and constant diameters, power-law 5
’’Scale up’’: 1/TM
degree distributions, etc. The number of nodes and edges
of Erdős-Rényi graphs have been set to the same values of 4
the corresponding Kronecker graphs. The main difference
3
of Kronecker compared to Erdős-Rényi graphs is the emer-
gence of a block-wise structure of the adjacency matrix, from 2
its construction [24]. We will see how this characteristic
affects in the running time of our block-optimization in the 1
next sections. 0 10 20 30 40 50 60 70 80 90 100
Number of machines
HADI runs on M45, one of the fifty most powerful
supercomputers in the world. M45 has 480 hosts (each with Figure 5: “Scale-up” (throughput 1/TM ) versus number of
2 quad-core Intel Xeon 1.86 GHz, running RHEL5), with machines M , for the Kronecker graph (2B edges). Notice the
3Tb aggregate RAM, and over 1.5 Peta-byte disk size. near-linear growth in the beginning, close to the ideal(dotted
Finally, we use the following notations to indicate dif- line).
ferent optimizations of HADI:
just the inverse of TM . HADI scales up near-linearly with
• HADI-BSE: HADI-plain with bit shuffle encoding. the number of machines M , close to the ideal scale-up.
• HADI-BL: HADI-plain with block operation.
• HADI-OPT: HADI-plain with bit shuffle encoding and 5.3 Effect of Optimizations Among the optimizations
block operation. that we mentioned earlier, which one helps the most, and
by how much? Figure 1 plots the running time of differ-
5.2 Running Time and Scale-up Figure 4 gives the wall- ent graphs versus different HADI optimizations. For the
clock time of HADI-OPT versus the number of edges in Kronecker graphs, we see that block operation is more ef-
the graph. Each curve corresponds to a different number ficient than bit shuffle encoding. Here, HADI-OPT achieves
of machines used (from 10 to 90). HADI has excellent 7.6× better performance than HADI-plain. For the Erdős-
scalability, with its running time being linear on the number Rényi graphs, however, we see that block operations do
of edges. The rest of the HADI versions (HADI-plain, not help more than bit shuffle encoding, because the adja-
HADI-BL, and HADI-BSE), were slower, but had a similar, cency matrix has no block structure, as Kronecker graphs do.
linear trend, and they are omitted to avoid clutter. Also notice that HADI-BLK and HADI-OPT run faster on
Figure 5 gives the throughput 1/TM of HADI-OPT. Kronecker graphs than on Erdős-Rényi graphs of the same
We also tried HADI with one machine; however it didn’t size. Again, the reason is that Kronecker graphs have fewer
complete, since the machine would take so long that it would nonzero blocks (i.e., “communities”) by their construction,
often fail in the meanwhile. For this reason, we do not report and the “block” operation yields more savings.
the typical scale-up score s = T1 /TM (ratio of time with 1
machine, over time with M machine), and instead we report
6 HADI At Work
HADI reveals new patterns in massive graphs which we
present in this section.
6.1 Static Patterns
Diameter What is the effective diameter of the Web?
Barabasi et al. [2] conjectured that it is around 19 for the
1.4 billion-node Web, and Broder et al. [7] reported 6.83 by
sampling from ≈ 200 million-nodes Web. What should be
the diameter, for a significantly larger crawl of the web, with
billions of nodes? Figure 1 gives the surprising answer:
O BSERVATION 1. (S MALL W EB ) The effective diameter of
Figure 7: Radius plot (Count versus radius) for several
the YahooWeb graph (year: 2002) is surprisingly small (≈
connected components of the Patent data in 1985. In red:
7 ∼ 8).
the distribution for the GCC (Giant Connected Component);
Shape of Distribution The next question is, how are rest colors: several DC (Disconnected Component)s. Notice
the radii distributed in real networks? Is it Poisson? Log- that the first peak from Figure 6(a) disappeared.
normal? Figure 1 gives the surprising answer: multimodal!
In other relatively small networks, however, have bi-modal
structures. As shown in the Radius Plot of Patent and
LinkedIn network in Figure 6, they have a peak at zero, a dip
at a small radius value (9, and 4, respectively) and another
peak very close to the dip. Other small networks (includ-
ing IMDB), had similar bi-modal behavior but we omitted
here for brevity. Given the prevalence of bi-modal shape,
our conjecture is that the multi-modal shape of YahooWeb
is possibly due to a mixture of relatively smaller sub-graphs,
which got loosely connected recently.
O BSERVATION 2. (M ULTI - MODAL AND B I - MODAL ) The
Radius distribution of the Web graph has a multi-modal Figure 8: Radius-Degree plot of Patent at 1985. Notice that
structure. Many smaller networks have bi-modal structures. the hubs are not necessarily the nodes with smallest radius
within GCC, and whiskers have small degree.
About the bi-modal structures, a natural question to
ask is what are the common properties of the nodes that
Radius plot of GCC Figure 1(b) shows a striking
belong to the first peak; similarly, for the nodes in the
pattern: all nodes of the GCC of the YahooWeb graph have
first dip, and the same for the nodes of the second peak.
radius 6 or more, except for 1 (only!). Inspection shows that
After investigation, the former are nodes that belong to
this is google.com. We were surprised, because we would
the disconnected components (‘DC’s); nodes in the dip are
expect a few more popular nodes to be in the same situation
usually core nodes in the giant connected component (GCC),
(eg., Yahoo, eBay, Amazon).
and the nodes at the second peak are the vast majority of
“Whisker” nodes The next question is, what can we
well connected nodes in the GCC. Figure 7 exactly shows
say about the connectivity of the core nodes, and the whisker
the radii distribution for the nodes of the GCC (in red), and
nodes? For example, is it true that the highest degree nodes
the nodes of the few largest remaining components. Notice
are the most central ones (i.e. minimum radius)? The answer
that the first peak disappeared, exactly because it consists of
is given by the “Radius-Degree” plot in Figure 8: This is a
nodes from the DCs (Disconnected Components), that we
scatter-plot, with one dot for every node, plotting the degree
omitted here.
of the node versus its radius. We also color-coded the nodes
In Figure 6, ‘outsiders’ are nodes in the disconnected
of the GCC (in red), while the rest are in blue.
components, and responsible for the first peak and the neg-
ative slope to the dip. ‘Core’ are the central nodes from
the giant connected component. ‘Whiskers’ [26] are the O BSERVATION 3. (H IGH DEGREE NODES ) The highest de-
nodes connected to the GCC with long paths(resembling a gree nodes (a) belong to the GCC but (b) are not necessarily
whisker), and are the reasons of the second negative slope. the ones with the smallest radius.
(a) U.S. Patent (b) LinkedIn
Figure 6: Static Radius Plot(Count versus Radius) of U.S. Patent and LinkedIn. Notice the bi-modal structure with
‘outsiders’, ‘core’, and ‘whiskers’.
The next observation is that whisker nodes have small de- include Breadth First Search (BFS) and Floyd’s algorithm
gree, that is, they belong to chains (as opposed to more com- ([11]). Both approaches are prohibitively slow for large
plicated shapes) graphs, requiring O(n2 + nm) and O(n3 ) time, where n
and m are the number of nodes and edges, respectively. For
6.2 Temporal Patterns Here we study how the radius the same reason, related BFS or all-pair shortest-path based
distribution changes over time. We know that the diameter algorithms like [16], [4], [27], [34] can not handle large
of a graph typically grows with time, spikes at the ‘gelling graphs.
point’, and then shrinks [28],[25]. Indeed, this holds for our A sampling approach starts BFS from a subset of nodes,
datasets (plots omitted for brevity). typically chosen at random as in [7]. Despite its practicality,
The question is, how does the radius distribution change this approach has no obvious solution for choosing the
over time? Does it still have the bi-modal pattern? Do the representative sample for BFS.
peaks and slopes change over time? We show the answer in Large Graph Mining There are numerous papers on
Figure 9 and Observation 4. large graph mining and indexing, mining subgraphs([22],
[39], ADI[37], gSpan[38]), graph clustering([33], Gr-
O BSERVATION 4. (E XPANSION -C ONTRACTION ) The aclus [13], METIS [21]), partitioning([12], [9], [14]),
radius distribution expands to the right until it reaches the tensors([23]), triangle counting([5], [35], [36]), minimum
gelling point. Then, it contracts to the left. cut([1]), to name a few. However, none of the above com-
Another striking observation is that the two decreasing putes the diameter of the graph or radii of the nodes.
segments seem to be well fit by a line, in log-lin axis, thus Large scale data processing using scalable and paral-
indicating an exponential decay. lel algorithms has attracted increasing attention due to the
needs to process web-scale data. Due to the volume of the
O BSERVATION 5. (E XPONENTIAL D ECAYS ) The decreas- data, platforms for this type of processing choose “shared-
ing segments of several, real radius plots seem to decay ex- nothing” architecture. Two promising platforms for such
ponentially, that is large scale data analysis are (a) M AP R EDUCE and (b) par-
(6.3) count(r) ∝ exp (−cr) allel RDBMS.
for every time tick after the gelling point. count(r) is the The M AP R EDUCE programming framework processes
number of nodes with radius r, and c is a constant. huge amounts of data in a massively parallel way, using thou-
sands or millions commodity machines. It has advantages of
For the Patents dataset, the correlation coefficient was (a) fault-tolerance, (b) familiar concepts from functional pro-
excellent, (typically, -0.98 or better). gramming, and (c) low cost of building the cluster. H ADOOP,
the open source version of M AP R EDUCE, is a very promis-
7 Background ing tool for massive parallel graph mining applications, (e.g.,
We briefly present related works on algorithms for radius and cross-associations [30], connected components [20]). Other
diameter computation, as well as on large graph mining. advanced M AP R EDUCE-like systems include [19], [8], and
Computing Radius and Diameter The typical algo- [32].
rithms to compute the radius and the diameter of a graph
(a) Patent-Expansion (b) Patent-Contraction
Figure 9: Radius distribution over time. “Expansion”: the radius distribution moves to the right until the gelling point.
“Contraction”: the radius distribution moves to the left after the gelling point.
Parallel RDBMS systems, including Vertica and Aster Acknowledgments
Data, are based on traditional database systems and provide This work was partially funded by the National Science
high performance using distributed processing and query Foundation under Grants No. IIS-0705359, IIS-0808661,
optimization. They have strength in processing structured CAPES (PDEE project number 3960-07-2), CNPq, Fapesp
data. For detailed comparison of these two systems, see [31]. and under the auspices of the U.S. Dept. of Energy by
Again, none of the above articles shows how to use such Lawrence Livermore National Laboratory under contract
platforms to efficiently compute the diameter of a graph. DE-AC52-07NA27344. We would like to thank YAHOO!
for the web graph and access to the M45, and Adriano A.
8 Conclusions Paterlini for feedback. The opinions expressed are those of
Our main goal is to develop an open-source package to mine the authors and do not necessarily reflect the views of the
Giga-byte, Tera-byte and eventually Peta-byte networks. We funding agencies.
designed HADI, an algorithm for computing radii and diam-
eter of Tera-byte scale graphs, and analyzed large networks References
to observe important patterns. The contributions of this pa- [1] C. C. Aggarwal, Y. Xie, and P. S. Yu. Gconnect:
per are the following: A connectivity index for massive disk-resident graphs.
PVLDB, 2009.
• Design: We developed HADI, a scalable M AP R EDUCE [2] R. Albert, H. Jeong, and A.-L. Barabasi. Diameter of
algorithm for diameter and radius estimation, on mas- the world wide web. Nature, (401):130–131, 1999.
sive graphs. [3] N. Alon, Y. Matias, and M. Szegedy. The space
• Optimization: Careful fine-tunings on HADI, leading complexity of approximating the frequency moments,
to up to 7.6× faster computation, linear scalability on 1996.
the size of the graph (number of edges) and near-linear [4] D. A. Bader and K. Madduri. A graph-theoretic analy-
speed-up on the number of machines. The experiments sis of the human protein-interaction network using mul-
ran on the M45 H ADOOP cluster of Yahoo, one of the ticore parallel algorithms. Parallel Comput., 2008.
50 largest supercomputers in the world. [5] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis.
• Observations: Thanks to HADI, we could study the di- Efficient semi-streaming algorithms for local triangle
ameter and radii distribution of one of the largest pub- counting in massive graphs. In KDD, 2008.
lic web graphs ever analyzed (over 6 billion edges); [6] K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and
we also observed the “Small Web” phenomenon, multi- R. Gemulla. On synopses for distinct-value estimation
modal/bi-modal radius distributions, and palindrome under multiset operations. SIGMOD, 2007.
motions of radius distributions over time in real net- [7] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra-
works. jagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph
structure in the web. Computer Networks 33, 2000.
Future work includes algorithms for additional graph
[8] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,
mining tasks like computing eigenvalues, and outlier detec-
D. Shakib, S. Weaver, and J. Zhou. Scope: easy
tion, for graphs that span Tera- and Peta-bytes.
and efficient parallel processing of massive data sets. [27] J. Ma and S. Ma. Efficient parallel algorithms for some
VLDB, 2008. graph theory problems. JCST, 1993.
[9] D. Chakrabarti, S. Papadimitriou, D. S. Modha, and [28] M. Mcglohon, L. Akoglu, and C. Faloutsos. Weighted
C. Faloutsos. Fully automatic cross-associations. In graphs and disconnected components: patterns and a
KDD, 2004. generator. KDD, 2008.
[10] M. Charikar, S. Chaudhuri, R. Motwani, and [29] C. R. Palmer, P. B. Gibbons, and C. Faloutsos. Anf: a
V. Narasayya. Towards estimation error guarantees for fast and scalable tool for data mining in massive graphs.
distinct values. PODS, 2000. KDD, pages 81–90, 2002.
[11] T. Cormen, C. Leiserson, and R. Rivest. Introduction [30] S. Papadimitriou and J. Sun. Disco: Distributed co-
to Algorithms. The MIT Press, 1990. clustering with map-reduce. ICDM, 2008.
[12] S. Daruru, N. M. Marin, M. Walker, and J. Ghosh. [31] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.
Pervasive parallelism in data mining: dataflow solution Dewitt, S. Madden, and M. Stonebraker. A comparison
to co-clustering large and sparse netflix data. In KDD, of approaches to large-scale data analysis. SIGMOD,
2009. June 2009.
[13] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph [32] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.
cuts without eigenvectors a multilevel approach. IEEE Interpreting the data: Parallel analysis with sawzall.
TPAMT, 2007. Scientific Programming Journal, 2005.
[14] I. S. Dhillon, S. Mallela, and D. S. Modha. [33] V. Satuluri and S. Parthasarathy. Scalable graph cluster-
Information-theoretic co-clustering. In KDD, 2003. ing using stochastic flows: applications to community
[15] P. Erdős and A. Rényi. On random graphs. Publica- discovery. KDD, 2009.
tiones Mathematicae, 1959. [34] B. P. Sinha, B. B. Bhattacharya, S. Ghose, and P. K.
[16] J.-A. Ferrez, K. Fukuda, and T. Liebling. Parallel Srimani. A parallel algorithm to compute the shortest
computation of the diameter of a graph. In HPCSA, paths and diameter of a graph and its vlsi implementa-
1998. tion. IEEE Trans. Comput., 1986.
[17] P. Flajolet and G. N. Martin. Probabilistic counting al- [35] C. E. Tsourakakis, U. Kang, G. L. Miller, and
gorithms for data base applications. Journal of Com- C. Faloutsos. Doulion: Counting triangles in massive
puter and System Sciences, 1985. graphs with a coin. KDD, 2009.
[18] M. N. Garofalakis and P. B. Gibbon. Approximate [36] C. E. Tsourakakis, M. N. Kolountzakis, and G. L.
query processing: Taming the terabytes. VLDB, 2001. Miller. Approximate triangle counting. CoRR,
[19] R. L. Grossman and Y. Gu. Data mining using high abs/0904.3761, 2009.
performance data clouds: experimental studies using [37] C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. Scalable
sector and sphere. KDD, 2008. mining of large disk-based graph databases. KDD,
[20] U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: 2004.
A peta-scale graph mining system - implementation [38] X. Yan and J. Han. gspan: Graph-based substructure
and observations. ICDM, 2009. pattern mining. ICDM, 2002.
[21] G. Karypis and V. Kumar. Parallel multilevel k- [39] C. H. You, L. B. Holder, and D. J. Cook. Learning
way partitioning for irregular graphs. SIAM Review, patterns in the dynamics of biological networks. In
41(2):278–300, 1999. KDD, 2009.
[22] Y. Ke, J. Cheng, and J. X. Yu. Top-k correlative graph
mining. SDM, 2009.
[23] T. G. Kolda and J. Sun. Scalable tensor decompositions
for multi-aspect data mining. In ICDM, 2008.
[24] J. Leskovec, D. Chakrabarti, J. M. Kleinberg, and
C. Faloutsos. Realistic, mathematically tractable graph
generation and evolution, using kronecker multiplica-
tion. PKDD, pages 133–145, 2005.
[25] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs
over time: densification laws, shrinking diameters and
possible explanations. In KDD, 2005.
[26] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W.
Mahoney. Statistical properties of community structure
in large social and information networks. In WWW ’08,
2008.