0% found this document useful (0 votes)
41 views

Power Laws, Pareto Distributions and Zipf's Law

This document discusses power laws, Pareto distributions, and Zipf's law. It summarizes that when the probability of measuring a particular value varies inversely with the power of that value, the quantity follows a power law distribution. Power laws appear widely in many fields including physics, biology, economics, and the social sciences. Examples where quantities follow power laws include the distributions of city sizes, earthquakes, solar flares, and personal fortunes. The origin of power-law behavior has been debated for over a century and different theories have been proposed to explain its emergence.

Uploaded by

api-20590611
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Power Laws, Pareto Distributions and Zipf's Law

This document discusses power laws, Pareto distributions, and Zipf's law. It summarizes that when the probability of measuring a particular value varies inversely with the power of that value, the quantity follows a power law distribution. Power laws appear widely in many fields including physics, biology, economics, and the social sciences. Examples where quantities follow power laws include the distributions of city sizes, earthquakes, solar flares, and personal fortunes. The origin of power-law behavior has been debated for over a century and different theories have been proposed to explain its emergence.

Uploaded by

api-20590611
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Power laws, Pareto distributions and Zipf’s law

M. E. J. Newman
Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor,
MI 48109. U.S.A.
arXiv:cond-mat/0412004v3 [cond-mat.stat-mech] 29 May 2006

When the probability of measuring a particular value of some quantity varies inversely as a power
of that value, the quantity is said to follow a power law, also known variously as Zipf’s law or the
Pareto distribution. Power laws appear widely in physics, biology, earth and planetary sciences,
economics and finance, computer science, demography and the social sciences. For instance,
the distributions of the sizes of cities, earthquakes, solar flares, moon craters, wars and people’s
personal fortunes all appear to follow power laws. The origin of power-law behaviour has been
a topic of debate in the scientific community for more than a century. Here we review some of
the empirical evidence for the existence of power-law forms and the theories proposed to explain
them.

I. INTRODUCTION ulation is at least 150 000. Clearly this is quite different


from what we saw for heights of people. And an even
Many of the things that scientists measure have a typ- more startling pattern is revealed when we look at the
ical size or “scale”—a typical value around which in- histogram of the sizes of cities, which is shown in Fig. 2.
dividual measurements are centred. A simple example In the left panel of the figure, I show a simple his-
would be the heights of human beings. Most adult hu- togram of the distribution of US city sizes. The his-
man beings are about 180cm tall. There is some varia- togram is highly right-skewed, meaning that while the
tion around this figure, notably depending on sex, but we bulk of the distribution occurs for fairly small sizes—
never see people who are 10cm tall, or 500cm. To make most US cities have small populations—there is a small
this observation more quantitative, one can plot a his- number of cities with population much higher than the
togram of people’s heights, as I have done in Fig. 1a. The typical value, producing the long tail to the right of the
figure shows the heights in centimetres of adult men in histogram. This right-skewed form is qualitatively quite
the United States measured between 1959 and 1962, and different from the histograms of people’s heights, but is
indeed the distribution is relatively narrow and peaked not itself very surprising. Given that we know there is a
around 180cm. Another telling observation is the ratio of large dynamic range from the smallest to the largest city
the heights of the tallest and shortest people. The Guin- sizes, we can immediately deduce that there can only
ness Book of Records claims the world’s tallest and short- be a small number of very large cities. After all, in a
est adult men (both now dead) as having had heights country such as America with a total population of 300
272cm and 57cm respectively, making the ratio 4.8. This million people, you could at most have about 40 cities the
is a relatively low value; as we will see in a moment, size of New York. And the 2700 cities in the histogram
some other quantities have much higher ratios of largest of Fig. 2 cannot have a mean population of more than
to smallest. 3 × 108 /2700 = 110 000.
Figure 1b shows another example of a quantity with What is surprising on the other hand, is the right panel
a typical scale: the speeds in miles per hour of cars on of Fig. 2, which shows the histogram of city sizes again,
the motorway. Again the histogram of speeds is strongly but this time replotted with logarithmic horizontal and
peaked, in this case around 75mph. vertical axes. Now a remarkable pattern emerges: the
But not all things we measure are peaked around a typ- histogram, when plotted in this fashion, follows quite
ical value. Some vary over an enormous dynamic range, closely a straight line. This observation seems first to
sometimes many orders of magnitude. A classic example have been made by Auerbach [1], although it is often at-
of this type of behaviour is the sizes of towns and cities. tributed to Zipf [2]. What does it mean? Let p(x) dx
The largest population of any city in the US is 8.00 mil- be the fraction of cities with population between x and
lion for New York City, as of the most recent (2000) cen- x + dx. If the histogram is a straight line on log-log
sus. The town with the smallest population is harder to scales, then ln p(x) = −α ln x + c, where α and c are con-
pin down, since it depends on what you call a town. The stants. (The minus sign is optional, but convenient since
author recalls in 1993 passing through the town of Mil- the slope of the line in Fig. 2 is clearly negative.) Taking
liken, Oregon, population 4, which consisted of one large the exponential of both sides, this is equivalent to:
house occupied by the town’s entire human population,
a wooden shack occupied by an extraordinary number p(x) = Cx−α , (1)
of cats and a very impressive flea market. According to
the Guinness Book, however, America’s smallest town is with C = ec .
Duffield, Virginia, with a population of 52. Whichever Distributions of the form (1) are said to follow a power
way you look at it, the ratio of largest to smallest pop- law. The constant α is called the exponent of the power
law. (The constant C is mostly uninteresting; once α
2 Power laws, Pareto distributions and Zipf’s law

6 4

percentage
4
2

2
1

0 0
0 50 100 150 200 250 0 20 40 60 80 100

heights of males speeds of cars

FIG. 1 Left: histogram of heights in centimetres of American males. Data from the National Health Examination Survey,
1959–1962 (US Department of Health and Human Services). Right: histogram of speeds in miles per hour of cars on UK
motorways. Data from Transport Statistics 2003 (UK Department for Transport).

-2
0.004 10
-3
percentage of cities

10
0.003
-4
10
-5
0.002 10
-6
10
0.001
-7
10
-8
0 10
0 5 5 4 5 6 7
2×10 4×10 10 10 10 10

population of city

FIG. 2 Left: histogram of the populations of all US cities with population of 10 000 or more. Right: another histogram of the
same data, but plotted on logarithmic scales. The approximate straight-line form of the histogram in the right panel implies
that the distribution follows a power law. Data from the 2000 US Census.

is fixed, it is determined by the requirement that the Power-law distributions are the subject of this arti-
distribution p(x) sum to 1; see Section III.A.) cle. In the following sections, I discuss ways of detecting
Power-law distributions occur in an extraordinarily di- power-law behaviour, give empirical evidence for power
verse range of phenomena. In addition to city popula- laws in a variety of systems and describe some of the
tions, the sizes of earthquakes [3], moon craters [4], solar mechanisms by which power-law behaviour can arise.
flares [5], computer files [6] and wars [7], the frequency of Readers interested in pursuing the subject further may
use of words in any human language [2, 8], the frequency also wish to consult the reviews by Sornette [18] and
of occurrence of personal names in most cultures [9], the Mitzenmacher [19], as well as the bibliography by Li.2
numbers of papers scientists write [10], the number of
citations received by papers [11], the number of hits on
web pages [12], the sales of books, music recordings and
almost every other branded commodity [13, 14], the num- tical distributions of quantities. For instance, Newton’s famous
bers of species in biological taxa [15], people’s annual in- 1/r 2 law for gravity has a power-law form with exponent α = 2.
comes [16] and a host of other variables all follow power- While such laws are certainly interesting in their own way, they
law distributions.1 are not the topic of this paper. Thus, for instance, there has
in recent years been some discussion of the “allometric” scal-
ing laws seen in the physiognomy and physiology of biological
organisms [17], but since these are not statistical distributions
they will not be discussed here.
1 2 http://linkage.rockefeller.edu/wli/zipf/.
Power laws also occur in many situations other than the statis-
II Measuring power laws 3

1.5 0
(a) 10 (b)
-1
10
1

samples

samples
-2
10
-3
0.5 10
-4
10
-5
0 10
0 2 4 6 8 1 10 100
x x
0
10
(d)

samples with value > x


-1
(c)
10

-3
samples

10 -2
10
-5
10
-4
-7 10
10

-9
10
1 10 100 1000 1 10 100 1000
x x

FIG. 3 (a) Histogram of the set of 1 million random numbers described in the text, which have a power-law distribution with
exponent α = 2.5. (b) The same histogram on logarithmic scales. Notice how noisy the results get in the tail towards the
right-hand side of the panel. This happens because the number of samples in the bins becomes small and statistical fluctuations
are therefore large as a fraction of sample number. (c) A histogram constructed using “logarithmic binning”. (d) A cumulative
histogram or rank/frequency plot of the same data. The cumulative distribution also follows a power law, but with an exponent
of α − 1 = 1.5.

II. MEASURING POWER LAWS numbers, produced by binning them into bins of equal
size 0.1. That is, the first bin goes from 1 to 1.1, the
Identifying power-law behaviour in either natural or second from 1.1 to 1.2, and so forth. On the linear scales
man-made systems can be tricky. The standard strategy used this produces a nice smooth curve.
makes use of a result we have already seen: a histogram To reveal the power-law form of the distribution it is
of a quantity with a power-law distribution appears as better, as we have seen, to plot the histogram on logarith-
a straight line when plotted on logarithmic scales. Just mic scales, and when we do this for the current data we
making a simple histogram, however, and plotting it on see the characteristic straight-line form of the power-law
log scales to see if it looks straight is, in most cases, a distribution, Fig. 3b. However, the plot is in some re-
poor way proceed. spects not a very good one. In particular the right-hand
Consider Fig. 3. This example shows a fake data set: end of the distribution is noisy because of sampling er-
I have generated a million random real numbers drawn rors. The power-law distribution dwindles in this region,
from a power-law probability distribution p(x) = Cx−α meaning that each bin only has a few samples in it, if
with exponent α = 2.5, just for illustrative purposes.3 any. So the fractional fluctuations in the bin counts are
Panel (a) of the figure shows a normal histogram of the large and this appears as a noisy curve on the plot. One
way to deal with this would be simply to throw out the
data in the tail of the curve. But there is often useful in-
formation in those data and furthermore, as we will see
3 This can be done using the so-called transformation method. If in Section II.A, many distributions follow a power law
we can generate a random real number r uniformly distributed in only in the tail, so we are in danger of throwing out the
the range 0 ≤ r < 1, then x = xmin (1 − r)−1/(α−1) is a random baby with the bathwater.
power-law-distributed real number in the range xmin ≤ x < ∞ An alternative solution is to vary the width of the bins
with exponent α. Note that there has to be a lower limit xmin
on the range; the power-law distribution diverges as x → 0—see
in the histogram. If we are going to do this, we must
Section II.A. also normalize the sample counts by the width of the
4 Power laws, Pareto distributions and Zipf’s law

bins they fall in. That is, the number of samples in a bin which is 1 less than the original exponent. Thus, if we
of width ∆x should be divided by ∆x to get a count per plot P (x) on logarithmic scales we should again get a
unit interval of x. Then the normalized sample count straight line, but with a shallower slope.
becomes independent of bin width on average and we are But notice that there is no need to bin the data at
free to vary the bin widths as we like. The most common all to calculate P (x). By its definition, P (x) is well-
choice is to create bins such that each is a fixed multiple defined for every value of x and so can be plotted as a
wider than the one before it. This is known as loga- perfectly normal function without binning. This avoids
rithmic binning. For the present example, for instance, all questions about what sizes the bins should be. It
we might choose a multiplier of 2 and create bins that also makes much better use of the data: binning of data
span the intervals 1 to 1.1, 1.1 to 1.3, 1.3 to 1.7 and so lumps all samples within a given range together into the
forth (i.e., the sizes of the bins are 0.1, 0.2, 0.4 and so same bin and so throws out any information that was
forth). This means the bins in the tail of the distribu- contained in the individual values of the samples within
tion get more samples than they would if bin sizes were that range. Cumulative distributions don’t throw away
fixed, and this reduces the statistical errors in the tail. It any information; it’s all there in the plot.
also has the nice side-effect that the bins appear to be of Figure 3d shows our computer-generated power-law
constant width when we plot the histogram on log scales. data as a cumulative distribution, and indeed we again
I used logarithmic binning in the construction of see the tell-tale straight-line form of the power law, but
Fig. 2b, which is why the points representing the individ- with a shallower slope than before. Cumulative distribu-
ual bins appear equally spaced. In Fig. 3c I have done tions like this are sometimes also called rank/frequency
the same for our computer-generated power-law data. As plots for reasons explained in Appendix A. Cumula-
we can see, the straight-line power-law form of the his- tive distributions with a power-law form are sometimes
togram is now much clearer and can be seen to extend for said to follow Zipf ’s law or a Pareto distribution, af-
at least a decade further than was apparent in Fig. 3b. ter two early researchers who championed their study.
Even with logarithmic binning there is still some noise Since power-law cumulative distributions imply a power-
in the tail, although it is sharply decreased. Suppose the law form for p(x), “Zipf’s law” and “Pareto distribu-
bottom of the lowest bin is at xmin and the ratio of the tion” are effectively synonymous with “power-law distri-
widths of successive bins is a. Then the kth bin extends bution”. (Zipf’s law and the Pareto distribution differ
from xk−1 = xmin ak−1 to xk = xmin ak and the expected from one another in the way the cumulative distribution
number of samples falling in this interval is is plotted—Zipf made his plots with x on the horizon-
Z xk Z xk tal axis and P (x) on the vertical one; Pareto did it the
p(x) dx = C x−α dx other way around. This causes much confusion in the lit-
xk−1 xk−1 erature, but the data depicted in the plots are of course
aα−1
−1 identical.4 )
=C (xmin ak )−α+1 . (2) We know the value of the exponent α for our artifi-
α−1
cial data set since it was generated deliberately to have
Thus, so long as α > 1, the number of samples per bin a particular value, but in practical situations we would
goes down as k increases and the bins in the tail will have often like to estimate α from observed data. One way
more statistical noise than those that precede them. As to do this would be to fit the slope of the line in plots
we will see in the next section, most power-law distribu- like Figs. 3b, c or d, and this is the most commonly used
tions occurring in nature have 2 ≤ α ≤ 3, so noisy tails method. Unfortunately, it is known to introduce system-
are the norm. atic biases into the value of the exponent [20], so it should
Another, and in many ways a superior, method of plot- not be relied upon. For example, a least-squares fit of a
ting the data is to calculate a cumulative distribution straight line to Fig. 3b gives α = 2.26 ± 0.02, which is
function. Instead of plotting a simple histogram of the clearly incompatible with the known value of α = 2.5
data, we make a plot of the probability P (x) that x has from which the data were generated.
a value greater than or equal to x: An alternative, simple and reliable method for extract-
Z ∞ ing the exponent is to employ the formula
P (x) = p(x′ ) dx′ . (3) " n #−1
x X xi
α=1+n ln . (5)
The plot we get is no longer a simple representation of i=1
xmin
the distribution of the data, but it is useful nonetheless.
If the distribution follows a power law p(x) = Cx−α , then Here the quantities xi , i = 1 . . . n are the measured values
of x and xmin is again the minimum value of x. (As
Z ∞
−α C
P (x) = C x′ dx′ = x−(α−1) . (4)
x α−1

Thus the cumulative distribution function P (x) also fol- 4 See http://www.hpl.hp.com/research/idl/papers/ranking/
lows a power law, but with a different exponent α − 1, for a useful discussion of these and related points.
II Measuring power laws 5

discussed in the following section, in practical situations shows the cumulative distribution of the number of
xmin usually corresponds not to the smallest value of x citations received by a paper between publication
measured but to the smallest for which the power-law and June 1997.
behaviour holds.) An estimate of the expected statistical
error σ on (5) is given by (c) Web hits: The cumulative distribution of the
number of “hits” received by web sites (i.e., servers,
" n #−1 not pages) during a single day from a subset of the
√ X xi α−1
σ= n ln = √ . (6) users of the AOL Internet service. The site with
i=1
xmin n the most hits, by a long way, was yahoo.com. Af-
ter Adamic and Huberman [12].
The derivation of both these formulas is given in Ap-
pendix B. (d) Copies of books sold: The cumulative distribu-
Applying Eqs. (5) and (6) to our present data gives an tion of the total number of copies sold in Amer-
estimate of α = 2.500 ± 0.002 for the exponent, which ica of the 633 bestselling books that sold 2 million
agrees well with the known value of 2.5. or more copies between 1895 and 1965. The data
were compiled painstakingly over a period of sev-
eral decades by Alice Hackett, an editor at Pub-
A. Examples of power laws lisher’s Weekly [23]. The best selling book dur-
ing the period covered was Benjamin Spock’s The
In Fig. 4 we show cumulative distributions of twelve Common Sense Book of Baby and Child Care. (The
different quantities measured in physical, biological, tech- Bible, which certainly sold more copies, is not really
nological and social systems of various kinds. All have a single book, but exists in many different transla-
been proposed to follow power laws over some part of tions, versions and publications, and was excluded
their range. The ubiquity of power-law behaviour in the by Hackett from her statistics.) Substantially bet-
natural world has led many scientists to wonder whether ter data on book sales than Hackett’s are now avail-
there is a single, simple, underlying mechanism link- able from operations such as Nielsen BookScan, but
ing all these different systems together. Several candi- unfortunately at a price this author cannot afford.
dates for such mechanisms have been proposed, going by I should be very interested to see a plot of sales
names like “self-organized criticality” and “highly opti- figures from such a modern source.
mized tolerance”. However, the conventional wisdom is
that there are actually many different mechanisms for (e) Telephone calls: The cumulative distribution of
producing power laws and that different ones are appli- the number of calls received on a single day by 51
cable to different cases. We discuss these points further million users of AT&T long distance telephone ser-
in Section IV. vice in the United States. After Aiello et al. [24].
The distributions shown in Fig. 4 are as follows. The largest number of calls received by a customer
in that day was 375 746, or about 260 calls a minute
(a) Word frequency: Estoup [8] observed that the (obviously to a telephone number that has many
frequency with which words are used appears to fol- people manning the phones). Similar distributions
low a power law, and this observation was famously are seen for the number of calls placed by users and
examined in depth and confirmed by Zipf [2]. also for the numbers of email messages that people
Panel (a) of Fig. 4 shows the cumulative distribu- send and receive [25, 26].
tion of the number of times that words occur in a
typical piece of English text, in this case the text of (f) Magnitude of earthquakes: The cumulative dis-
the novel Moby Dick by Herman Melville.5 Similar tribution of the Richter (local) magnitude of earth-
distributions are seen for words in other languages. quakes occurring in California between January
1910 and May 1992, as recorded in the Berkeley
(b) Citations of scientific papers: As first observed Earthquake Catalog. The Richter magnitude is de-
by Price [11], the numbers of citations received by fined as the logarithm, base 10, of the maximum
scientific papers appear to have a power-law distri- amplitude of motion detected in the earthquake,
bution. The data in panel (b) are taken from the and hence the horizontal scale in the plot, which
Science Citation Index, as collated by Redner [22], is drawn as linear, is in effect a logarithmic scale
and are for papers published in 1981. The plot of amplitude. The power law relationship in the
earthquake distribution is thus a relationship be-
tween amplitude and frequency of occurrence. The
data are from the National Geophysical Data Cen-
5 The most common words in this case are, in order, “the”, “of”,
ter, www.ngdc.noaa.gov.
“and”, “a” and “to”, and the same is true for most written En-
glish texts. Interestingly, however, it is not true for spoken En- (g) Diameter of moon craters: The cumulative dis-
glish. The most common words in spoken English are, in order, tribution of the diameter of moon craters. Rather
“I”, “and”, “the”, “to” and “that” [21]. than measuring the (integer) number of craters of
6

6
4
(a) 10 (b) (c)
10
4
4 10
10
2
10 2
2 10
10

0 0 0
10 10 10
0 2 4 0 2 4 0 2 4
10 10 10 10 10 10 10 10 10
word frequency citations web hits

(d) (e) 4 (f)


10
6
100 10
3
10
3
10 10
2
10
0
1 10
6 7 0 2 4 6
10 10 10 10 10 10 2 3 4 5 6 7
books sold telephone calls received earthquake magnitude

2 (g) 4
10 (h) 100 (i)
10
3
0 10
10
2 10
-2 10
10
1
-4 10
10 1
2 3 4 5
0.01 0.1 1 10 10 10 10 1 10 100
crater diameter in km peak intensity intensity

4
10 (l)
(j) 4
10 (k)
100
2
2
10 10
10

0 0
1 10 10
9 10 4 5 6 3 5 7
10 10 10 10 10 10 10 10
net worth in US dollars name frequency population of city

FIG. 4 Cumulative distributions or “rank/frequency plots” of twelve quantities reputed to follow power laws. The distributions
were computed as described in Appendix A. Data in the shaded regions were excluded from the calculations of the exponents
in Table I. Source references for the data are given in the text. (a) Numbers of occurrences of words in the novel Moby Dick
by Hermann Melville. (b) Numbers of citations to scientific papers published in 1981, from time of publication until June
1997. (c) Numbers of hits on web sites by 60 000 users of the America Online Internet service for the day of 1 December 1997.
(d) Numbers of copies of bestselling books sold in the US between 1895 and 1965. (e) Number of calls received by AT&T
telephone customers in the US for a single day. (f) Magnitude of earthquakes in California between January 1910 and May 1992.
Magnitude is proportional to the logarithm of the maximum amplitude of the earthquake, and hence the distribution obeys a
power law even though the horizontal axis is linear. (g) Diameter of craters on the moon. Vertical axis is measured per square
kilometre. (h) Peak gamma-ray intensity of solar flares in counts per second, measured from Earth orbit between February
1980 and November 1989. (i) Intensity of wars from 1816 to 1980, measured as battle deaths per 10 000 of the population of the
participating countries. (j) Aggregate net worth in dollars of the richest individuals in the US in October 2003. (k) Frequency
of occurrence of family names in the US in the year 1990. (l) Populations of US cities in the year 2000.
II Measuring power laws 7

a given size on the whole surface of the moon, the as well (for example in Japan [28]) but not in all
vertical axis is normalized to measure number of cases. Korean family names for instance appear to
craters per square kilometre, which is why the axis have an exponential distribution [29].
goes below 1, unlike the rest of the plots, since it is
entirely possible for there to be less than one crater (l) Populations of cities: Cumulative distribution
of a given size per square kilometre. After Neukum of the size of the human populations of US cities as
and Ivanov [4]. recorded by the US Census Bureau in 2000.
(h) Intensity of solar flares: The cumulative dis- Few real-world distributions follow a power law over
tribution of the peak gamma-ray intensity of their entire range, and in particular not for smaller val-
solar flares. The observations were made be- ues of the variable being measured. As pointed out in
tween 1980 and 1989 by the instrument known the previous section, for any positive value of the expo-
as the Hard X-Ray Burst Spectrometer aboard nent α the function p(x) = Cx−α diverges as x → 0. In
the Solar Maximum Mission satellite launched reality therefore, the distribution must deviate from the
in 1980. The spectrometer used a CsI scin- power-law form below some minimum value xmin . In our
tillation detector to measure gamma-rays from computer-generated example of the last section we sim-
solar flares and the horizontal axis in the fig- ply cut off the distribution altogether below xmin so that
ure is calibrated in terms of scintillation counts p(x) = 0 in this region, but most real-world examples
per second from this detector. The data are are not that abrupt. Figure 4 shows distributions with
from the NASA Goddard Space Flight Center, a variety of behaviours for small values of the variable
umbra.nascom.nasa.gov/smm/hxrbs.html. See measured; the straight-line power-law form asserts itself
also Lu and Hamilton [5]. only for the higher values. Thus one often hears it said
(i) Intensity of wars: The cumulative distribution that the distribution of such-and-such a quantity “has a
of the intensity of 119 wars from 1816 to 1980. In- power-law tail”.
tensity is defined by taking the number of battle Extracting a value for the exponent α from distribu-
deaths among all participant countries in a war, tions like these can be a little tricky, since it requires
dividing by the total combined populations of the us to make a judgement, sometimes imprecise, about the
countries and multiplying by 10 000. For instance, value xmin above which the distribution follows the power
the intensities of the First and Second World Wars law. Once this judgement is made, however, α can be
were 141.5 and 106.3 battle deaths per 10 000 re- calculated simply from Eq. (5).6 (Care must be taken to
spectively. The worst war of the period covered use the correct value of n in the formula; n is the number
was the small but horrifically destructive Paraguay- of samples that actually go into the calculation, exclud-
Bolivia war of 1932–1935 with an intensity of 382.4. ing those with values below xmin , not the overall total
The data are from Small and Singer [27]. See also number of samples.)
Roberts and Turcotte [7]. Table I lists the estimated exponents for each of the
distributions of Fig. 4, along with standard errors and
(j) Wealth of the richest people: The cumulative also the values of xmin used in the calculations. Note
distribution of the total wealth of the richest people that the quoted errors correspond only to the statistical
in the United States. Wealth is defined as aggre- sampling error in the estimation of α; they include no
gate net worth, i.e., total value in dollars at current estimate of any errors introduced by the fact that a single
market prices of all an individual’s holdings, minus power-law function may not be a good model for the data
their debts. For instance, when the data were com- in some cases or for variation of the estimates with the
piled in 2003, America’s richest person, William H. value chosen for xmin .
Gates III, had an aggregate net worth of $46 bil- In the author’s opinion, the identification of some of
lion, much of it in the form of stocks of the company the distributions in Fig. 4 as following power laws should
he founded, Microsoft Corporation. Note that net be considered unconfirmed. While the power law seems
worth doesn’t actually correspond to the amount of to be an excellent model for most of the data sets de-
money individuals could spend if they wanted to: picted, a tenable case could be made that the distribu-
if Bill Gates were to sell all his Microsoft stock, for tions of web hits and family names might have two differ-
instance, or otherwise divest himself of any signif- ent power-law regimes with slightly different exponents.7
icant portion of it, it would certainly depress the
stock price. The data are from Forbes magazine, 6
October 2003.
6 Sometimes the tail is also cut off because there is, for one reason
(k) Frequencies of family names: Cumulative dis- or another, a limit on the largest value that may occur. An
tribution of the frequency of occurrence in the US of example is the finite-size effects found in critical phenomena—
the 89 000 most common family names, as recorded see Section IV.E. In this case, Eq. (5) must be modified [20].
by the US Census Bureau in 1990. Similar distribu- 7 Significantly more tenuous claims to power-law behaviour for
tions are observed for names in some other cultures other quantities have appeared elsewhere in the literature, for
8 Power laws, Pareto distributions and Zipf’s law

4
minimum exponent 1000 10
(a) (b)
quantity xmin α
3
(a) frequency of use of words 1 2.20(1) 10
100
(b) number of citations to papers 100 3.04(2) 2
10
(c) number of hits on web sites 1 2.40(1)
10 1
(d) copies of books sold in the US 2 000 000 3.51(16) 10
(e) telephone calls received 10 2.22(1) 0
1 10
(f) magnitude of earthquakes 3.8 3.04(4) 0 2 4
0 100 200 300
10 10 10
(g) diameter of moon craters 0.01 3.14(5)
abundance number of addresses
(h) intensity of solar flares 200 1.83(2)
(i) intensity of wars 3 1.80(9)
(j) net worth of Americans $600m 2.09(4) (c)
4
(k) frequency of family names 10 000 1.94(1) 10

(l) population of US cities 40 000 2.30(5)


2
10
TABLE I Parameters for the distributions shown in Fig. 4.
The labels on the left refer to the panels in the figure. Expo-
nent values were calculated using the maximum likelihood 0
10
method of Eq. (5) and Appendix B, except for the moon 10
0
10
2
10
4 6
10
craters (g), for which only cumulative data were available. For
this case the exponent quoted is from a simple least-squares fit size in acres
and should be treated with caution. Numbers in parentheses
give the standard error on the trailing figures. FIG. 5 Cumulative distributions of some quantities whose
distributions span several orders of magnitude but that
nonetheless do not follow power laws. (a) The number of
And the data for the numbers of copies of books sold sightings of 591 species of birds in the North American Breed-
cover rather a small range—little more than one decade ing Bird Survey 2003. (b) The number of addresses in the
horizontally. Nonetheless, one can, without stretching email address books of 16 881 users of a large university com-
the interpretation of the data unreasonably, claim that puter system [33]. (c) The size in acres of all wildfires occur-
power-law distributions have been observed in language, ring on US federal land between 1986 and 1996 (National Fire
demography, commerce, information and computer sci- Occurrence Database, USDA Forest Service and Department
of the Interior). Note that the horizontal axis is logarithmic
ences, geology, physics and astronomy, and this on its
in frames (a) and (c) but linear in frame (b).
own is an extraordinary statement.

B. Distributions that do not follow a power law books, which spans about three orders of magni-
tude but seems to follow a stretched exponential.
b
Power-law distributions are, as we have seen, impres- A stretched exponential is curve of the form e−ax
sively ubiquitous, but they are not the only form of broad for some constants a, b.
distribution. Lest I give the impression that everything
interesting follows a power law, let me emphasize that (c) The distribution of the sizes of forest fires, which
there are quite a number of quantities with highly right- spans six orders of magnitude and could follow a
skewed distributions that nonetheless do not obey power power law but with an exponential cutoff.
laws. A few of them, shown in Fig. 5, are the following: This being an article about power laws, I will not discuss
(a) The abundance of North American bird species, further the possible explanations for these distributions,
which spans over five orders of magnitude but is but the scientist confronted with a new set of data having
probably distributed according to a log-normal. A a broad dynamic range and a highly skewed distribution
log-normally distributed quantity is one whose log- should certainly bear in mind that a power-law model is
arithm is normally distributed; see Section IV.G only one of several possibilities for fitting it.
and Ref. [32] for further discussions.
(b) The number of entries in people’s email address III. THE MATHEMATICS OF POWER LAWS

A continuous real variable with a power-law distribu-


tion has a probability p(x) dx of taking a value in the
instance in the discussion of the distribution of the sizes of elec- interval from x to x + dx, where
trical blackouts [30, 31]. These however I consider insufficiently
substantiated for inclusion in the present work. p(x) = Cx−α , (7)
III The mathematics of power laws 9

with α > 0. As we saw in Section II.A, there must be then the mean of those many means is itself also for-
some lowest value xmin at which the power law is obeyed, mally divergent, since it is simply equal to the mean we
and we consider only the statistics of x above this value. would calculate if all the repetitions were combined into
one large experiment. This implies that, while the mean
may take a relatively small value on any particular repe-
A. Normalization tition of the experiment, it must occasionally take a huge
value, in order that the overall mean diverge as the num-
The constant C in Eq. (7) is given by the normalization ber of repetitions does. Thus there must be very large
requirement that fluctuations in the value of the mean, and this is what
Z ∞ Z ∞ the divergence in Eq. (11) really implies. In effect, our
C h −α+1 i∞
1= p(x)dx = C x−α dx = x . calculations are telling us that the mean is not a well
xmin xmin 1−α xmin defined quantity, because it can vary enormously from
(8) one measurement to the next, and indeed can become
We see immediately that this only makes sense if α > arbitrarily large. The formal divergence of hxi is a signal
1, since otherwise the right-hand side of the equation that, while we can quote a figure for the average of the
would diverge: power laws with exponents less than unity samples we measure, that figure is not a reliable guide to
cannot be normalized and don’t normally occur in nature. the typical size of the samples in another instance of the
If α > 1 then Eq. (8) gives same experiment.
For α > 2 however, the mean is perfectly well defined,
C = (α − 1)xα−1
min , (9)
with a value given by Eq. (11) of
and the correct normalized expression for the power law
itself is α−1
hxi = xmin . (12)
 −α α−2
α−1 x
p(x) = . (10) We can also calculate higher moments of the distribu-
xmin xmin
tion p(x). For instance, the second moment, the mean
Some distributions follow a power law for part of their square, is given by
range but are cut off at high values of x. That is, above
some value they deviate from the power law and fall off
2 C h −α+3 i∞
x = x . (13)
quickly towards zero. If this happens, then the distribu- 3−α xmin
tion may be normalizable no matter what the value of
the exponent α. Even so, exponents less than unity are This diverges if α ≤ 3. Thus power-law distributions in
rarely, if ever, seen. this range, which includes almost all of those in Table I,
have no meaningful mean square, and thus also no mean-
ingful variance or standard deviation. If α > 3, then the
B. Moments second moment is finite and well-defined, taking the value

The mean value of our power-law distributed quan- α−1 2


x2 =

x . (14)
tity x is given by α − 3 min
Z ∞ Z ∞
These results can easily be extended to show that in
hxi = xp(x) dx = C x−α+1 dx
xmin xmin general all moments hxm i exist for m < α − 1 and all
C h −α+2 i∞ higher moments diverge. The ones that do exist are given
= x . (11) by
2−α xmin

Note that this expression becomes infinite if α ≤ 2. α−1


hxm i = xm . (15)
Power laws with such low values of α have no finite mean. α − 1 − m min
The distributions of sizes of solar flares and wars in Ta-
ble I are examples of such power laws.
What does it mean to say that a distribution has an C. Largest value
infinite mean? Surely we can take the data for real solar
flares and calculate their average? Indeed we can and Suppose we draw n measurements from a power-law
necessarily we will always get a finite number from the distribution. What value is the largest of those measure-
calculation, since each individual measurement x is itself ments likely to take? Or, more precisely, what is the
a finite number and there are a finite number of them. probability π(x) dx that the largest value falls in the in-
Only if we had a truly infinite number of samples would terval between x and x + dx?
we see the mean actually diverge. The definitive property of the largest value in a sample
However, if we were to repeat our finite experiment is that there are no others larger than it. The probability
many times and calculate the mean for each repetition, that a particular sample will be larger than x is given by
10 Power laws, Pareto distributions and Zipf’s law

the quantity P (x) defined in Eq. (3): Thus, as long as α > 1, we find that hxmax i always in-
−α+1 creases as n becomes larger.10
∞ 
C x
Z
P (x) = p(x′ ) dx′ = x−α+1 = ,
x α−1 xmin
(16) D. Top-heavy distributions and the 80/20 rule
so long as α > 1. And the probability that a sample is
not greater than x is 1 − P (x). Thus the probability that Another interesting question is where the majority of
a particular sample we draw, sample i, will lie between the distribution of x lies. For any power law with expo-
x and x + dx and that all the others will be no greater nent α > 1, the median is well defined. That is, there is
than it is p(x) dx × [1 − P (x)]n−1 . Then there are n ways a point x1/2 that divides the distribution in half so that
to choose i, giving a total probability half the measured values of x lie above x1/2 and half lie
below. That point is given by
π(x) = np(x)[1 − P (x)]n−1 . (17) Z ∞ Z ∞
1
p(x) dx = 2 p(x) dx, (24)
Now we can calculate the mean value hxmax i of the x1/2 xmin
largest sample thus:
Z ∞ Z ∞ or
hxmax i = xπ(x)dx = n xp(x)[1−P (x)]n−1 dx. x1/2 = 21/(α−1) xmin . (25)
xmin xmin
(18)
Using Eqs. (10) and (16), this is So, for example, if we are considering the distribution
of wealth, there will be some well-defined median wealth
hxmax i = n(α − 1) × that divides the richer half of the population from the
Z ∞  −α+1   −α+1 n−1 poorer. But we can also ask how much of the wealth
x x itself lies in those two halves. Obviously more than half
1− dx
xmin xmin xmin of the total amount of money belongs to the richer half of
Z 1 the population. The fraction of the money in the richer
y n−1
= nxmin 1/(α−1)
dy half is given by
0 (1 − y)
 R∞
= nxmin B n, (α − 2)/(α − 1) , (19) xp(x) dx  x1/2 −α+2
x
R ∞1/2 = = 2−(α−2)/(α−1) , (26)
where I have made the substitution y = 1−(x/xmin)−α+1 x
xp(x) dx xmin
min
and B(a, b) is Legendre’s beta-function,8 which is defined
by provided α > 2 so that the integrals converge. Thus,
for instance, if α = 2.1 for the wealth distribution, as
Γ(a)Γ(b) indicated in Table I, then a fraction 2−0.091 ≃ 94% of the
B(a, b) = , (20) wealth is in the hands of the richer 50% of the population,
Γ(a + b)
making the distribution quite top-heavy.
with Γ(a) the standard Γ-function: More generally, the fraction of the population whose
Z ∞ personal wealth exceeds x is given by the quantity P (x),
Γ(a) = ta−1 e−t dt. (21) Eq. (16), and the fraction of the total wealth in the hands
0 of those people is
The beta-function has the interesting property that R∞ ′ ′ ′ −α+2
x x p(x ) dx

x
for large values of either of its arguments it itself fol- W (x) = R ∞ = , (27)
lows a power law.9 For instance, for large a and fixed b, x x′ p(x′ ) dx′ xmin
min

B(a, b) ∼ a−b . In most cases of interest, the number n


of samples from our power-law distribution will be large assuming again that α > 2. Eliminating x/xmin be-
(meaning much greater than 1), so tween (16) and (27), we find that the fraction W of the
wealth in the hands of the richest P of the population is

B n, (α − 2)/(α − 1) ∼ n−(α−2)/(α−1) , (22)
W = P (α−2)/(α−1) , (28)
and

hxmax i ∼ n1/(α−1) . (23)


10 Equation (23) can also be derived by a simpler, although less
rigorous, heuristic argument: if P (x) = 1/n for some value of x
then we expect there to be on average one sample in the range
8
from x to ∞, and this of course will the largest sample. Thus a
Also called the Eulerian integral of the first kind. rough estimate of hxmax i can be derived by setting our expression
9 This can be demonstrated by approximating the Γ-functions of for P (x), Eq. (16), equal to 1/n and rearranging for x, which
Eq. (20) using Sterling’s formula. immediately gives hxmax i ∼ n1/(α−1) .
III The mathematics of power laws 11

1 of behaviour. For the data of Fig. 4k, about 75% of the


population have names in the top 15 000. Estimates of
the total number of unique family names in the US put
0.8
α = 2.1 the figure at around 1.5 million. So in this case 75% of
fraction of wealth W

the population have names in the most common 1%—


α = 2.2
0.6 a very top-heavy distribution indeed. The line α = 2
α = 2.4
thus separates the regime in which you will with some
α = 2.7 frequency meet people with uncommon names from the
0.4 α = 3.5 regime in which you will rarely meet such people.

0.2 E. Scale-free distributions

A power-law distribution is also sometimes called a


0 scale-free distribution. Why? Because a power law is the
0 0.2 0.4 0.6 0.8 1 only distribution that is the same whatever scale we look
fraction of population P at it on. By this we mean the following.
Suppose we have some probability distribution p(x) for
a quantity x, and suppose we discover or somehow deduce
FIG. 6 The fraction W of the total wealth in a country held by that it satisfies the property that
the fraction P of the richest people, if wealth is distributed fol-
lowing a power law with exponent α. If α = 2.1, for instance, p(bx) = g(b)p(x), (29)
as it appears to in the United States (Table I), then the richest
20% of the population hold about 86% of the wealth (dashed for any b. That is, if we increase the scale or units by
lines). which we measure x by a factor of b, the shape of the dis-
tribution p(x) is unchanged, except for an overall multi-
plicative constant. Thus for instance, we might find that
of which Eq. (26) is a special case. This again has a computer files of size 2kB are 14 as common as files of
power-law form, but with a positive exponent now. In size 1kB. Switching to measuring size in megabytes we
Fig. 6 I show the form of the curve of W against P for also find that files of size 2MB are 41 as common as files
various values of α. For all values of α the curve is con- of size 1MB. Thus the shape of the file-size distribution
cave downwards, and for values only a little above 2 the curve (at least for these particular values) does not de-
curve has a very fast initial increase, meaning that a large pend on the scale on which we measure file size.
fraction of the wealth is concentrated in the hands of a This scale-free property is certainly not true of most
small fraction of the population. Curves of this kind are distributions. It is not true for instance of the exponen-
called Lorenz curves, after Max Lorenz, who first studied tial distribution. In fact, as we now show, it is only true
them around the turn of the twentieth century [34]. of one type of distribution, the power law.
Using the exponents from Table I, we can for example Starting from Eq. (29), let us first set x = 1, giving
calculate that about 80% of the wealth should be in the p(b) = g(b)p(1). Thus g(b) = p(b)/p(1) and (29) can be
hands of the richest 20% of the population (the so-called written as
“80/20 rule”, which is borne out by more detailed obser- p(b)p(x)
vations of the wealth distribution), the top 20% of web p(bx) = . (30)
sites get about two-thirds of all web hits, and the largest p(1)
10% of US cities house about 60% of the country’s total Since this equation is supposed to be true for any b, we
population. can differentiate both sides with respect to b to get
If α ≤ 2 then the situation becomes even more ex-
treme. In that case, the integrals in Eq. (27) diverge p′ (b)p(x)
xp′ (bx) = , (31)
at their upper limits, meaning that in fact they depend p(1)
on the value of the largest sample, as described in Sec-
tion III.B. But for α > 1, Eq. (23) tells us that the where p′ indicates the derivative of p with respect to its
expected value of xmax goes to ∞ as n becomes large, argument. Now we set b = 1 and get
and in that limit the fraction of money in the top half
dp p′ (1)
of the population, Eq. (26), tends to unity. In fact, the x = p(x). (32)
fraction of money in the top anything of the population, dx p(1)
even the top 1%, tends to unity, as Eq. (27) shows. In This is a simple first-order differential equation which has
other words, for distributions with α < 2, essentially all the solution
of the wealth (or other commodity) lies in the tail of the
distribution. The distribution of family names in the US, p(1)
ln p(x) = ln x + constant. (33)
which has an exponent α = 1.9, is an example of this type p′ (1)
12 Power laws, Pareto distributions and Zipf’s law

Setting x = 1 we find that the constant is simply ln p(1), If, as is usually the case, the power-law behaviour is seen
and then taking exponentials of both sides only in the tail of the distribution, for values k ≥ kmin ,
then the equivalent expression is
p(x) = p(1) x−α , (34)
k −α
pk = , (38)

where α = −p(1)/p (1). Thus, as advertised, the power- ζ(α, kmin )
law distribution is the only function satisfying the scale- P∞ −α
free criterion (29). where ζ(α, kmin ) = k=kmin k is the generalized or
This fact is more than just a curiosity. As we will incomplete ζ-function.
see in Section IV.E, there are some systems that become Most of the results of the previous sections can be gen-
scale-free for certain special values of their governing pa- eralized to the case of discrete variables, although the
rameters. The point defined by such a special value is mathematics is usually harder and often involves special
called a “continuous phase transition” and the argument functions in place of the more tractable integrals of the
given above implies that at such a point the observable continuous case.
quantities in the system should adopt a power-law dis- It has occasionally been proposed that Eq. (35) is not
tribution. This indeed is seen experimentally and the the best generalization of the power law to the discrete
distributions so generated provided the original motiva- case. An alternative and often more convenient form is
tion for the study of power laws in physics (although Γ(k)Γ(α)
most experimentally observed power laws are probably pk = C = C B(k, α), (39)
Γ(k + α)
not the result of phase transitions—a variety of other
mechanisms produce power-law behaviour as well, as we where B(a, b) is, as before, the Legendre beta-function,
will shortly see). Eq. (20). As mentioned in Section III.C, the beta-
function behaves as a power law B(k, α) ∼ k −α for large k
and so the distribution has the desired asymptotic form.
F. Power laws for discrete variables Simon [35] proposed that Eq. (39) be called the Yule dis-
tribution, after Udny Yule who derived it as the limiting
So far I have focused on power-law distributions for distribution in a certain stochastic process [36], and this
continuous real variables, but many of the quantities we name is often used today. Yule’s result is described in
deal with in practical situations are in fact discrete— Section IV.D.
usually integers. For instance, populations of cities, num- The Yule distribution is nice because sums involving it
bers of citations to papers or numbers of copies of books can frequently be performed in closed form, where sums
sold are all integer quantities. In most cases, the distinc- involving Eq. (35) can only be written in terms of special
tion is not very important. The power law is obeyed only functions. For instance, the normalizing constant C for
in the tail of the distribution where the values measured the Yule distribution is given by
are so large that, to all intents and purposes, they can be

considered continuous. Technically however, power-law X C
distributions should be defined slightly differently for in- 1=C B(k, α) = , (40)
α−1
teger quantities. k=1

If k is an integer variable, then one way to proceed is and hence C = α − 1 and


to declare that it follows a power law if the probability pk
of measuring the value k obeys pk = (α − 1) B(k, α). (41)

pk = Ck −α , (35) The first and second moments (i.e., the mean and mean
square of the distribution) are
for some constant exponent α. Clearly this distribution
cannot hold all the way down to k = 0, since it diverges α−1
2 (α − 1)2
hki = , k = , (42)
there, but it could in theory hold down to k = 1. If we α−2 (α − 2)(α − 3)
discard any data for k = 0, the constant C would then
be given by the normalization condition and there are similarly simple expressions corresponding
to many of our earlier results for the continuous case.

X ∞
X
1= pk = C k −α = Cζ(α), (36)
k=1 k=1 IV. MECHANISMS FOR GENERATING POWER-LAW
DISTRIBUTIONS
where ζ(α) is the Riemann ζ-function. Rearranging, we
find that C = 1/ζ(α) and In this section we look at possible candidate mech-
anisms by which power-law distributions might arise in
k −α natural and man-made systems. Some of the possibilities
pk = . (37)
ζ(α) that have been suggested are quite complex—notably the
IV Mechanisms for generating power-law distributions 13

physics of critical phenomena and the tools of the renor- Thus, following our argument above, the distribution of
malization group that are used to analyse it. But let us frequencies of words has the form p(x) ∼ x−α with
start with some simple algebraic methods of generating
power-law functions and progress to the more involved a 2 ln m − ln(1 − qs )
α=1− = . (47)
mechanisms later. b ln m − ln(1 − qs )
For the typical case where m is reasonably large and qs
quite small this gives α ≃ 2 in approximate agreement
A. Combinations of exponentials with Table I.
This is a reasonable theory as far as it goes, but real
A much more common distribution than the power law text is not made up of random letters. Most combina-
is the exponential, which arises in many circumstances, tions of letters don’t occur in natural languages; most are
such as survival times for decaying atomic nuclei or the not even pronounceable. We might imagine that some
Boltzmann distribution of energies in statistical mechan- constant fraction of possible letter sequences of a given
ics. Suppose some quantity y has an exponential distri- length would correspond to real words and the argument
bution: above would then work just fine when applied to that
fraction, but upon reflection this suggestion is obviously
p(y) ∼ eay . (43) bogus. It is clear for instance that very long words sim-
ply don’t exist in most languages, although there are ex-
The constant a might be either negative or positive. If ponentially many possible combinations of letters avail-
it is positive then there must also be a cutoff on the able to make them up. This observation is backed up
distribution—a limit on the maximum value of y—so that by empirical data. In Fig. 7a we show a histogram of
the distribution is normalizable. the lengths of words occurring in the text of Moby Dick,
Now suppose that the real quantity we are interested in and one would need a particularly vivid imagination to
is not y but some other quantity x, which is exponentially convince oneself that this histogram follows anything like
related to y thus: the exponential assumed by Miller’s argument. (In fact,
the curve appears roughly to follow a log-normal [32].)
x ∼ eby , (44) There may still be some merit in Miller’s argument
however. The problem may be that we are measuring
with b another constant, also either positive or negative. word “length” in the wrong units. Letters are not really
Then the probability distribution of x is the basic units of language. Some basic units are letters,
but some are groups of letters. The letters “th” for ex-
dy eay x−1+a/b ample often occur together in English and make a single
p(x) = p(y) ∼ by = , (45)
dx be b sound, so perhaps they should be considered to be a sep-
arate symbol in their own right and contribute only one
which is a power law with exponent α = 1 − a/b.
unit to the word length?
A version of this mechanism was used by Miller [37] to
Following this idea to its logical conclusion we
explain the power-law distribution of the frequencies of
can imagine replacing each fundamental unit of the
words as follows (see also [38]). Suppose we type ran-
language—whatever that is—by its own symbol and then
domly on a typewriter,11 pressing the space bar with
measuring lengths in terms of numbers of symbols. The
probability qs per stroke and each letter with equal prob-
pursuit of ideas along these lines led Claude Shannon
ability ql per stroke. If there are m letters in the alpha-
in the 1940s to develop the field of information the-
bet then ql = (1 − qs )/m. (In this simplest version of the
ory, which gives a precise prescription for calculating the
argument we also type no punctuation, digits or other
number of symbols necessary to transmit words or any
non-letter symbols.) Then the frequency x with which
other data [39, 40]. The units of information are bits and
a particular word with y letters (followed by a space)
the true “length” of a word can be considered to be the
occurs is
number of bits of information it carries. Shannon showed
 y that if we regard words as the basic divisions of a mes-
1 − qs
x= qs ∼ eby , (46) sage, the information y carried by any particular word
m
is
where b = ln(1 − qs ) − ln m. The number (or fraction) of y = −k ln x, (48)
distinct possible words with length between y and y + dy
goes up exponentially as p(y) ∼ my = eay with a = ln m. where x is the frequency of the word as before and k is
a constant. (The reader interested in finding out more
about where this simple relation comes from is recom-
mended to look at the excellent introduction to informa-
11 This argument is sometimes called the “monkeys with typewrit- tion theory by Cover and Thomas [41].)
ers” argument, the monkey being the traditional exemplar of a But this has precisely the form that we want. Inverting
random typist. it we have x = e−y/k and if the probability distribution of
14 Power laws, Pareto distributions and Zipf’s law

4 B. Inverses of quantities
10 (a) (b)

Suppose some quantity y has a distribution p(y) that


passes through zero, thus having both positive and neg-
number of words

3
10
ative values. And suppose further that the quantity we
are really interested in is the reciprocal x = 1/y, which
10
2 will have distribution

dy p(y)
1 p(x) = p(y) =− 2 . (49)
10 dx x

The large values of x, those in the tail of the distribution,


0 10 20 5 10
correspond to the small values of y close to zero and thus
length in letters information in bits the large-x tail is given by

FIG. 7 (a) Histogram of the lengths in letters of all distinct p(x) ∼ x−2 , (50)
words in the text of the novel Moby Dick. (b) Histogram of
the information content a la Shannon of words in Moby Dick. where the constant of proportionality is p(y = 0).
The former does not, by any stretch of the imagination, follow More generally, any quantity x = y −γ for some γ will
an exponential, but the latter could easily be said to do so. have a power-law tail to its distribution p(x) ∼ x−α , with
(Note that the vertical axes are logarithmic.)
α = 1+1/γ. It is not clear who the first author or authors
were to describe this mechanism,12 but clear descriptions
have been given recently by Bouchaud [44], Jan et al. [45]
and Sornette [46].
One might argue that this mechanism merely generates
a power law by assuming another one: the power-law re-
the “lengths” measured in terms of bits is also exponen- lationship between x and y generates a power-law distri-
tial as in Eq. (43) we will get our power-law distribution. bution for x. This is true, but the point is that the mecha-
Figure 7b shows the latter distribution, and indeed it nism takes some physical power-law relationship between
follows a nice exponential—much better than Fig. 7a. x and y—not a stochastic probability distribution—and
This is still not an entirely satisfactory explanation. from that generates a power-law probability distribution.
Having made the shift from pure word length to informa- This is a non-trivial result.
tion content, our simple count of the number of words of One circumstance in which this mechanism arises is
length y—that it goes exponentially as my —is no longer in measurements of the fractional change in a quantity.
valid, and now we need some reason why there should be For instance, Jan et al. [45] consider one of the most
exponentially more distinct words in the language of high famous systems in theoretical physics, the Ising model of
information content than of low. That this is the case is a magnet. In its paramagnetic phase, the Ising model has
experimentally verified by Fig. 7b, but the reason must a magnetization that fluctuates around zero. Suppose we
be considered still a matter of debate. Some possibilities measure the magnetization m at uniform intervals and
are discussed by, for instance, Mandelbrot [42] and more calculate the fractional change δ = (∆m)/m between
recently by Mitzenmacher [19]. each successive pair of measurements. The change ∆m
Another example of the “combination of exponentials” is roughly normally distributed and has a typical size set
mechanism has been discussed by Reed and Hughes [43]. by the width of that normal distribution. The 1/m on the
They consider a process in which a set of items, piles or other hand produces a power-law tail when small values
groups each grows exponentially in time, having size x ∼ of m coincide with large values of ∆m, so that the tail of
ebt with b > 0. For instance, populations of organisms the distribution of δ follows p(δ) ∼ δ −2 as above.
reproducing freely without resource constraints grow ex- In Fig. 8 I show a cumulative histogram of mea-
ponentially. Items also have some fixed probability of surements of δ for simulations of the Ising model on a
dying per unit time (populations might have a stochas- square lattice, and the power-law distribution is clearly
tically constant probability of extinction), so that the visible. Using Eq. (5), the value of the exponent is
times t at which they die are exponentially distributed α = 1.98 ± 0.04, in good agreement with the expected
p(t) ∼ eat with a < 0. value of 2.
These functions again follow the form of Eqs. (43)
and (44) and result in a power-law distribution of the
sizes x of the items or groups at the time they die. Reed
and Hughes suggest that variations on this argument may 12 A correspondent tells me that a similar mechanism was described
explain the sizes of biological taxa, incomes and cities, in an astrophysical context by Chandrasekhar in a paper in 1943,
among other things. but I have been unable to confirm this.
IV Mechanisms for generating power-law distributions 15

4
10 t = 2m t = 2n

fluctuations observed of size δ or greater

position
3
10
t

2
10

FIG. 9 The position of a one-dimensional random walker (ver-


1
10 tical axis) as a function of time (horizontal axis). The proba-
bility u2n that the walk returns to zero at time t = 2n is equal
to the probability f2m that it returns to zero for the first time
0
at some earlier time t = 2m, multiplied by the probability
10 u2n−2m that it returns again a time 2n − 2m later, summed
1 10 100 1000 over all possible values of m. We can use this observation
fractional change in magnetization δ to write a consistency relation, Eq. (51), that can be solved
for ft , Eq. (59).

FIG. 8 Cumulative histogram of the magnetization fluctu-


ations of a 128 × 128 nearest-neighbour Ising model on a arguments since there is no way to get back to zero in
square lattice. The model was simulated at a tempera- any odd number of steps.
ture of 2.5 times the spin-spin coupling for 100 000 time
As Fig. 9 illustrates, the probability ut = u2n , with n
steps using the cluster algorithm of Swendsen and Wang [47]
and the magnetization per spin measured at intervals of integer, can be written
ten steps. The fluctuations were calculated as the ratio 
δi = 2(mi+1 − mi )/(mi+1 + mi ). 1 if n = 0,
u2n = Pn (51)
m=1 f2m u2n−2m if n ≥ 1,

C. Random walks where m is also an integer and we define f0 = 0 and


u0 = 1. This equation can conveniently be solved for f2n
Many properties of random walks are distributed ac- using a generating function approach. We define
cording to power laws, and this could explain some
∞ ∞
power-law distributions observed in nature. In particu- X X
lar, a randomly fluctuating process that undergoes “gam- U (z) = u2n z n , F (z) = f2n z n . (52)
n=0 n=1
bler’s ruin”,13 i.e., that ends when it hits zero, has a
power-law distribution of possible lifetimes.
Consider a random walk in one dimension, in which a Then, multiplying Eq. (51) throughout by z n and sum-
walker takes a single step randomly one way or the other ming, we find
along a line in each unit of time. Suppose the walker n
∞ X
starts at position 0 on the line and let us ask what the U (z) = 1 +
X
f2m u2n−2m z n
probability is that the walker returns to position 0 for the n=1 m=1
first time at time t (i.e., after exactly t steps). This is the ∞ ∞
so-called first return time of the walk and represents the
X X
m
= 1+ f2m z u2n−2m z n−m
lifetime of a gambler’s ruin process. A trick for answering m=1 n=m
this question is depicted in Fig. 9. We consider first the = 1 + F (z)U (z). (53)
unconstrained problem in which the walk is allowed to
return to zero as many times as it likes, before returning
So
there again at time t. Let us denote the probability of
this event as ut . Let us also denote by ft the probability 1
that the first return time is t. We note that both of these F (z) = 1 − . (54)
U (z)
probabilities are non-zero only for even values of their
The function U (z) however is quite easy to calculate.
The probability u2n that we are at position zero after 2n
13
steps is
Gambler’s ruin is so called because a gambler’s night of betting
ends when his or her supply of money hits zero (assuming the  
gambling establishment declines to offer him or her a line of −2n 2n
u2n = 2 , (55)
credit). n
16 Power laws, Pareto distributions and Zipf’s law

so14 group of species all descended by repeated speciation


∞ 
from a common ancestor.15 The ranks of the Linnean
2n z n

X 1 hierarchy—genus, family, order and so forth—are exam-
U (z) = n
=√ . (56)
n=0
n 4 1−z ples of taxa. If a taxon gains and loses species at random
over time, then the number of species performs a ran-
And hence dom walk, the taxon becoming extinct when the number
√ of species reaches zero for the first (and only) time. (This
F (z) = 1 − 1 − z. (57) is one example of “gambler’s ruin”.) Thus the time for
which taxa live should have the same distribution as the
Expanding this function using the binomial theorem first return times of random walks.
thus: In fact, it has been argued that the distribution of the
1 lifetimes of genera in the fossil record does indeed follow
× 21 2 1
× 1
× 3
F (z) = 1
2z + 2
z + 2 2 2
z3 + . . . a power law [48]. The best fits to the available fossil data
2! 3! put the value of the exponent at α = 1.7 ± 0.3, which is
∞ 2n

X in agreement with the simple random walk model [49].16
= n
zn (58)
n=1
(2n − 1) 22n

and comparing this expression with Eq. (52), we imme- D. The Yule process
diately see that
One of the most convincing and widely applicable
2n

n mechanisms for generating power laws is the Yule pro-
f2n = , (59)
(2n − 1) 22n cess, whose invention was, coincidentally, also inspired
by observations of the statistics of biological taxa as dis-
and we have our solution for the distribution of first re- cussed in the previous section.
turn times. In addition to having a (possibly) power-law distribu-
Now consider the form of f2n for large n. Writing out tion of lifetimes, biological taxa also have a very convinc-
the binomial coefficient as 2n
n = (2n)!/(n!)2 , we take ing power-law distribution of sizes. That is, the distribu-
logs thus: tion of the number of species in a genus, family or other
taxonomic group appears to follow a power law quite
ln f2n = ln(2n)! − 2 ln n! − 2n ln 2 − ln(2n − 1), (60) closely. This phenomenon was first reported by Willis
1
and Yule in 1922 for the example of flowering plants [15].
and use Sterling’s formula ln n! ≃ n ln n − n + 2 ln n to Three years later, Yule [36] offered an explanation using
get ln f2n ≃ 12 ln 2 − 21 ln n − ln(2n − 1), or a simple model that has since found wide application in
s other areas. He argued as follows.
2 Suppose first that new species appear but they never
f2n ≃ . (61) die; species are only ever added to genera and never re-
n(2n − 1)2
moved. This differs from the random walk model of the
last section, and certainly from reality as well. It is be-
In the limit n → ∞, this implies that f2n ∼ n−3/2 , or
lieved that in practice all species and all genera become
equivalently
extinct in the end. But let us persevere; there is nonethe-
ft ∼ t−3/2 . (62) less much of worth in Yule’s simple model.
Species are added to genera by speciation, the splitting
So the distribution of return times follows a power law of one species into two, which is known to happen by a va-
with exponent α = 23 . Note that the distribution has a
divergent mean (because α ≤ 2). As discussed in Sec-
tion III.C, this implies that the mean is finite for any
finite sample but can take very different values for dif- 15 Modern phylogenetic analysis, the quantitative comparison of
ferent samples, so that the value measured for any one species’ genetic material, can provide a picture of the evolution-
sample gives little or no information about the value for ary tree and hence allow the accurate “cladistic” assignment of
species to taxa. For prehistoric species, however, whose genetic
any other. material is not usually available, determination of evolutionary
As an example application, the random walk can be ancestry is difficult, so classification into taxa is based instead
considered a simple model for the lifetime of biological on morphology, i.e., on the shapes of organisms. It is widely ac-
taxa. A taxon is a branch of the evolutionary tree, a knowledged that such classifications are subjective and that the
taxonomic assignments of fossil species are probably riddled with
errors.
16 To be fair, I consider the power law for the distribution of genus
lifetimes to fall in the category of “tenuous” identifications to
14 The enthusiastic reader can easily derive this result for him or which I alluded in footnote 7. This theory should be taken with
herself by expanding (1 − z)−1/2 using the binomial theorem. a pinch of salt.
IV Mechanisms for generating power-law distributions 17

riety of mechanisms, including competition for resources, genera with k species thus:
spatial separation of breeding populations and genetic
m  
drift. If we assume that this happens at some stochasti- (n + 1)pk,n+1 = npk,n + (k − 1)pk−1,n − kpk,n .
cally constant rate, then it follows that a genus with k m+1
(64)
species in it will gain new species at a rate proportional
The only exception to this equation is for genera of size 1,
to k, since each of the k species has the same chance per
which instead obey the equation
unit time of dividing in two. Let us further suppose that
occasionally, say once every m speciation events, the new m
(n + 1)p1,n+1 = np1,n + 1 − p1,n , (65)
species produced is, by chance, sufficiently different from m+1
the others in its genus as to be considered the founder
member of an entire new genus. (To be clear, we define since by definition exactly one new such genus appears
m such that m species are added to pre-existing genera on each time step.
and then one species forms a new genus. So m + 1 new Now we ask what form the distribution of the sizes of
species appear for each new genus and there are m + 1 genera takes in the limit of long times. To do this we
species per genus on average.) Thus the number of gen- allow n → ∞ and assume that the distribution tends
era goes up steadily in this model, as does the number of to some fixed value pk = limn→∞ pn,k independent of n.
species within each genus. Then Eq. (65) becomes p1 = 1 − mp1 /(m + 1), which has
We can analyse this Yule process mathematically as the solution
follows.17 Let us measure the passage of time in the m+1
model by the number of genera n. At each time-step p1 = . (66)
2m + 1
one new species founds a new genus, thereby increasing
n by 1, and m other species are added to various pre- And Eq. (64) becomes
existing genera which are selected in proportion to the
number of species they already have. We denote by pk,n m  
pk = (k − 1)pk−1 − kpk , (67)
the fraction of genera that have k species when the total m+1
number of genera is n. Thus the number of such genera which can be rearranged to read
is npk,n . We now ask what the probability is that the
next species added to the system happens to be added to k−1
a particular genus i having ki species in it already. This pk = pk−1 , (68)
k + 1 + 1/m
probability is proportional
P to ki , P and so when properly
normalized is just ki / i ki . But i ki is simply the to- and then iterated to get
tal number of species, which is n(m + 1). Furthermore,
between the appearance of the nth and the (n + 1)th (k − 1)(k − 2) . . . 1
pk = p1
genera, m other new species are added, so the probabil- (k + 1 + 1/m)(k + 1/m) . . . (3 + 1/m)
ity that genus i gains a new species during this interval is (k − 1) . . . 1
mki /(n(m + 1)). And the total expected number of gen- = (1 + 1/m) , (69)
(k + 1 + 1/m) . . . (2 + 1/m)
era of size k that gain a new species in the same interval
is where I have made use of Eq. (66). This can be simpli-
fied further by making use of a handy property of the
Γ-function, Eq. (21), that Γ(a) = (a − 1)Γ(a − 1). Using
mk m this, and noting that Γ(1) = 1, we get
× npk,n = kpk,n . (63)
n(m + 1) m+1
Γ(k)Γ(2 + 1/m)
pk = (1 + 1/m)
Γ(k + 2 + 1/m)
Now we observe that the number of genera with k = (1 + 1/m)B(k, 2 + 1/m), (70)
species will decrease on each time step by exactly this
number, since by gaining a new species they become gen- where B(a, b) is again the beta-function, Eq. (20). This,
era with k + 1 instead. At the same time the number we note, is precisely the distribution defined in Eq. (39),
increases because of species that previously had k − 1 which Simon called the Yule distribution. Since the beta-
species and now have an extra one. Thus we can write function has a power-law tail B(a, b) ∼ a−b , we can im-
a master equation for the new number (n + 1)pk,n+1 of mediately see that pk also has a power-law tail with an
exponent
1
α = 2+ . (71)
17 Yule’s analysis of the process was considerably more involved m
than the one presented here, essentially because the theory of
stochastic processes as we now know it did not yet exist in his The mean number m + 1 of species per genus for the
time. The master equation method we employ is a relatively example of flowering plants is about 3, making m ≃ 2
modern innovation, introduced in this context by Simon [35]. and α ≃ 2.5. The actual exponent for the distribution
18 Power laws, Pareto distributions and Zipf’s law

found by Willis and Yule [15] is α = 2.5 ± 0.1, which is zero citations for instance.
in excellent agreement with the theory. In between the appearance of one object and the next,
Most likely this agreement is fortuitous, however. The m new species/people/citations etc. are added to the en-
Yule process is probably not a terribly realistic expla- tire system. That is some cities or papers will get new
nation for the distribution of the sizes of genera, princi- people or citations, but not necessarily all will. And in
pally because it ignores the fact that species (and gen- the simplest case these are added to objects in propor-
era) become extinct. However, it has been adapted and tion to the number that the object already has. Thus
generalized by others to explain power laws in many the probability of a city gaining a new member is pro-
other systems, most famously city sizes [35], paper ci- portional to the number already there; the probability
tations [50, 51], and links to pages on the world wide of a paper getting a new citation is proportional to the
web [52, 53]. The most general form of the Yule process number it already has. In many cases this seems like a
is as follows. natural process. For example, a paper that already has
Suppose we have a system composed of a collection of many citations is more likely to be discovered during a
objects, such as genera, cities, papers, web pages and so literature search and hence more likely to be cited again.
forth. New objects appear every once in a while as cities Simon [35] dubbed this type of “rich-get-richer” process
grow up or people publish new papers. Each object also the Gibrat principle. Elsewhere it also goes by the names
has some property k associated with it, such as number of of the Matthew effect [54], cumulative advantage [50], or
species in a genus, people in a city or citations to a paper, preferential attachment [52].
that is reputed to obey a power law, and it is this power There is a problem however when k0 = 0. For example,
law that we wish to explain. Newly appearing objects if new papers appear with no citations and garner cita-
have some initial value of k which we will denote k0 . tions in proportion to the number they currently have,
New genera initially have only a single species k0 = 1, which is zero, then no paper will ever get any citations!
but new towns or cities might have quite a large initial To overcome this problem one typically assigns new cita-
population—a single person living in a house somewhere tions not in proportion simply to k, but to k + c, where
is unlikely to constitute a town in their own right but c is some constant. Thus there are three parameters k0 ,
k0 = 100 people might do so. The value of k0 can also be c and m that control the behaviour of the model.
zero in some cases: newly published papers usually have

By an argument exactly analogous to the one given above, one can then derive the master equation
k−1+c k+c
(n + 1)pk,n+1 = npk,n + m pk−1,n − m pk,n , for k > k0 , (72)
k0 + c + m k0 + c + m
and
k0 + c
(n + 1)pk0 ,n+1 = npk0 ,n + 1 − m pk ,n , for k = k0 . (73)
k0 + c + m 0
(Note that k is never less than k0 , since each object appears with k = k0 initially.)

Looking for stationary solutions of these equations as beta-function, Eq.(20):


before, we define pk = limn→∞ pn,k and find that
B(k + c, α)
pk = pk . (76)
k0 + c + m B(k0 + c, α) 0
pk0 = , (74)
(m + 1)(k0 + c) + m
Since the beta-function follows a power law in its tail,
and B(a, b) ∼ a−b , the general Yule process generates a
power-law distribution pk ∼ k −α with exponent related
(k − 1 + c)(k − 2 + c) . . . (k0 + c) to the three parameters of the process according to
pk = pk
(k − 1 + c + α)(k − 2 + c + α) . . . (k0 + c + α) 0
k0 + c
Γ(k + c)Γ(k0 + c + α) α=2+ . (77)
= pk , (75) m
Γ(k0 + c)Γ(k + c + α) 0
For example, the original Yule process for number of
where I have made use of the Γ-function notation intro- species per genus has c = 0 and k0 = 1, which reproduces
duced for Eq. (70) and, for reasons that will become clear the result of Eq. (71). For citations of papers or links to
in just moment, I have defined α = 2 + (k0 + c)/m. As web pages we have k0 = 0 and we must have c > 0 to get
before, this expression can also be written in terms of the any citations or links at all. So α = 2 + c/m. In his work
IV Mechanisms for generating power-law distributions 19

on citations Price [50] assumed that c = 1, so that paper


citations have the same exponent α = 2 + 1/m as the
standard Yule process, although there doesn’t seem to be
any very good reason for making this assumption. As we
saw in Table I (and as Price himself also reported), real
citations seem to have an exponent α ≃ 3, so we should
expect c ≃ m. For the data from the Science Citation
Index examined in Section II.A, the mean number m of
citations per paper is 8.6. So we should put c ≃ 8.6
too if we want the Yule process to match the observed
exponent.
The most widely studied model of links on the web,
that of Barabási and Albert [52], assumes c = m so that
α = 3, but again there doesn’t seem to be a good reason
for this assumption. The measured exponent for numbers FIG. 10 The percolation model on a square lattice: squares
of links to web sites is about α = 2.2, so if the Yule on the lattice are coloured in independently at random with
process is to match the data in this case, we should put some probability p. In this example p = 12 .
c ≃ 0.2m.
However, the important point is that the Yule process
is a plausible and general mechanism that can explain a
number of the power-law distributions observed in nature To better understand the physics of critical phenom-
ena, let us explore one simple but instructive example,
and can produce a wide range of exponents to match the
observations by suitable adjustments of the parameters. that of the “percolation transition”. Consider a square
lattice like the one depicted in Fig. 10 in which some of
For several of the distributions shown in Fig. 4, especially
citations, city populations and personal income, it is now the squares have been coloured in. Suppose we colour
each square with independent probability p, so that on
the most widely accepted theory.
average a fraction p of them are coloured in. Now we look
at the clusters of coloured squares that form, i.e., the con-
tiguous regions of adjacent coloured squares. We can ask,
E. Phase transitions and critical phenomena for instance, what the mean area hsi is of the cluster to
which a randomly chosen square belongs. If that square
A completely different mechanism for generating power is not coloured in then the area is zero. If it is coloured
laws, one that has received a huge amount of attention in but none of the adjacent ones is coloured in then the
over the past few decades from the physics community, area is one, and so forth.
is that of critical phenomena. When p is small, only a few squares are coloured in
Some systems have only a single macroscopic length- and most coloured squares will be alone on the lattice,
scale, size-scale or time-scale governing them. A classic or maybe grouped in twos or threes. So hsi will be small.
example is a magnet, which has a correlation length that This situation is depicted in Fig. 11 for p = 0.3. Con-
measures the typical size of magnetic domains. Under versely, if p is large—almost 1, which is the largest value
certain circumstances this length-scale can diverge, leav- it can have—then most squares will be coloured in and
ing the system with no scale at all. As we will now see, they will almost all be connected together in one large
such a system is “scale-free” in the sense of Section III.E cluster, the so-called spanning cluster. In this situation
and hence the distributions of macroscopic physical quan- we say that the system percolates. Now the mean size
tities have to follow power laws. Usually the circum- of the cluster to which a vertex belongs is limited only
stances under which the divergence takes place are very by the size of the lattice itself and as we let the lattice
specific ones. The parameters of the system have to be size become large hsi also becomes large. So we have two
tuned very precisely to produce the power-law behaviour. distinctly different behaviours, one for small p in which
This is something of a disadvantage; it makes the diver- hsi is small and doesn’t depend on the size of the sys-
gence of length-scales an unlikely explanation for generic tem, and one for large p in which hsi is much larger and
power-law distributions of the type highlighted in this increases with the size of the system.
paper. As we will shortly see, however, there are some And what happens in between these two extremes?
elegant and interesting ways around this problem. As we increase p from small values, the value of hsi also
The precise point at which the length-scale in a sys- increases. But at some point we reach the start of the
tem diverges is called a critical point or a phase transi- regime in which hsi goes up with system size instead of
tion. More specifically it is a continuous phase transi- staying constant. We now know that this point is at p =
tion. (There are other kinds of phase transitions too.) 0.5927462 . . ., which is called the critical value of p and
Things that happen in the vicinity of continuous phase is denoted pc . If the size of the lattice is large, then hsi
transitions are known as critical phenomena, of which also becomes large at this point, and in the limit where
power-law distributions are one example. the lattice size goes to infinity hsi actually diverges. To
20 Power laws, Pareto distributions and Zipf’s law

FIG. 11 Three examples of percolation systems on 100 × 100 square lattices with p = 0.3, p = pc = 0.5927 . . . and p = 0.9. The
first and last are well below and above the critical point respectively, while the middle example is precisely at it.

illustrate this phenomenon, I show in Fig. 12 a plot of dimensionless ratios we can form: s/a, a/ hsi and s/ hsi
hsi from simulations of the percolation model and the (or their reciprocals, if we prefer). Only two of these are
divergence is clear. independent however, since the last is the product of the
Now consider not just the mean cluster size but the en- other two. Thus in general we can write
tire distribution of cluster sizes. Let p(s) be the probabil-  
ity that a randomly chosen square belongs to a cluster of s a
p(s) = Cf , , (78)
area s. In general, what forms can p(s) take as a function a hsi
of s? The important point to notice is that p(s), being
a probability distribution, is a dimensionless quantity— where f is a dimensionless mathematical function of its
just a number—but s is an area. We could measure s in dimensionless arguments
P and C is a normalizing constant
terms of square metres, or whatever units the lattice is chosen so that s p(s) = 1.
calibrated in. The average hsi is also an area and then But now here’s the trick. We can coarse-grain or
there is the area of a unit square itself, which we will de- rescale our lattice so that the fundamental unit of the
note a. Other than these three quantities, however, there lattice changes. For instance, we could double the size of
are no other independent parameters with dimensions in our unit square a. The kind of picture I’m thinking of
this problem. (There is the area of the whole lattice, but is shown in Fig. 13. The basic percolation clusters stay
we are considering the limit where that becomes infinite, roughly the same size and shape, although I’ve had to
so it’s out of the picture.) fudge things around the edges a bit to make it work. For
If we want to make a dimensionless function p(s) out this reason this argument will only be strictly correct for
of these three dimensionful parameters, there are three large clusters s whose area is not changed appreciably by
the fudging. (And the argument thus only tells us that
the tail of the distribution is a power law, and not the
200 whole distribution.)

150
mean cluster size

100

50

0
0 0.2 0.4 0.6 0.8 1
FIG. 13 A site percolation system is coarse-grained, so that
percolation probability the area of the fundamental square is (in this case) quadru-
pled. The occupation of the squares in the coarse-grained
lattice (right) is chosen to mirror as nearly as possible that of
FIG. 12 The mean area of the cluster to which a randomly the squares on the original lattice (left), so that the sizes and
chosen square belongs for the percolation model described in shapes of the large clusters remain roughly the same. The
the text, calculated from an average over 1000 simulations on small clusters are mostly lost in the coarse-graining, so that
a 1000×1000 square lattice. The dotted line marks the known the arguments given in the text are valid only for the large-s
position of the phase transition. tail of the cluster size distribution.
IV Mechanisms for generating power-law distributions 21

8
The probability p(s) of getting a cluster of area s is 10
unchanged by the coarse-graining since the areas them-
selves are, to a good approximation, unchanged, and the

clusters with area s or greater


mean cluster size is thus also unchanged. All that has 6
10
changed, mathematically speaking, is that the unit area
a has been rescaled a → a/b for some constant rescaling
factor b. The equivalent of Eq. (78) in our coarse-grained
4
system is 10
   
′ s a/b ′ bs a
p(s) = C f , =C f , . (79)
a/b hsi a b hsi 2
10
Comparing with Eq. (78), we can see that this is equal, to
within a multiplicative constant, to the probability p(bs)
of getting a cluster of size bs, but in a system with a 10
0

different mean cluster size of b hsi. Thus we have related 10


0
10
2
10
4
10
6

the probabilities of two different sizes of clusters to one


another, but on systems with different average cluster area of cluster s
size and hence presumably also different site occupation
probability. Note that the normalization constant must FIG. 14 Cumulative distribution of the sizes of clusters for
in general be changed in Eq. (79) to make sure that p(s) (site) percolation on a square lattice of 40 000 × 40 000 sites
still sums to unity, and that this change will depend on at the critical site occupation probability pc = 0.592746 . . .
the value we choose for the rescaling factor b.
But now we notice that there is one special point at
which this rescaling by definition does not result in a shown in Fig. 13 to derive power-law forms and their
change in hsi or a corresponding change in the site occu- exponents for distributions at the critical point. An ex-
pation probability, and that is the critical point. When ample application to the percolation problem is given by
we are precisely at the point at which hsi → ∞, then Reynolds et al. [55]. A more technically sophisticated
b hsi = hsi by definition. Putting hsi → ∞ in Eqs. (78) technique is the k-space renormalization group, which
and (79), we then get p(s) = C ′ f (bs/a, 0) = (C ′ /C)p(bs). makes use of transformations in Fourier space to accom-
Or equivalently plish similar aims in a particularly elegant formal envi-
ronment [56].
p(bs) = g(b)p(s), (80)

where g(b) = C/C ′ . Comparing with Eq. (29) we see that F. Self-organized criticality
this has precisely the form of the equation that defines a
scale-free distribution. The rest of the derivation below As discussed in the preceding section, certain sys-
Eq. (29) follows immediately, and so we know that p(s) tems develop power-law distributions at special “critical”
must follow a power law. points in their parameter space because of the divergence
This in fact is the origin of the name “scale-free” for a of some characteristic scale, such as the mean cluster size
distribution of the form (29). At the point at which hsi in the percolation model. This does not, however, pro-
diverges, the system is left with no defining size-scale, vide a plausible explanation for the origin of power laws
other than the unit of area a itself. It is “scale-free”, and in most real systems. Even if we could come up with some
by the argument above it follows that the distribution of model of earthquakes or solar flares or web hits that had
s must obey a power law. such a divergence, it seems unlikely that the parameters
In Fig. 14 I show an example of a cumulative distribu- of the real world would, just coincidentally, fall precisely
tion of cluster sizes for a percolation system right at the at the point where the divergence occurred.
critical point and, as the figure shows, the distribution As first proposed by Bak et al. [57], however, it is possi-
does indeed follow a power law. Technically the distribu- ble that some dynamical systems actually arrange them-
tion cannot follow a power law to arbitrarily large cluster selves so that they always sit at the critical point, no
sizes since the area of a cluster can be no bigger than the matter what state we start off in. One says that such
area of the whole lattice, so the power-law distribution systems self-organize to the critical point, or that they
will be cut off in the tail. This is an example of a finite- display self-organized criticality. A now-classic example
size effect. This point does not seem to be visible in of such a system is the forest fire model of Drossel and
Fig. 14 however. Schwabl [58], which is based on the percolation model we
The kinds of arguments given in this section can be have already seen.
made more precise using the machinery of the renor- Consider the percolation model as a primitive model
malization group. The real-space renormalization group of a forest. The lattice represents the landscape and a
makes use precisely of transformations such as that single tree can grow in each square. Occupied squares
22 Power laws, Pareto distributions and Zipf’s law

fires of size s or greater


6
10

FIG. 15 Lightning strikes at random positions in the forest


fire model, starting fires that wipe out the entire cluster to
which a struck tree belongs.

1 10 100 1000
represent trees and empty squares represent empty plots
of land with no trees. Trees appear instantaneously at size of fire s
random at some constant rate and hence the squares of
the lattice fill up at random. Every once in a while a FIG. 16 Cumulative distribution of the sizes of “fires” in a
wildfire starts at a random square on the lattice, set off simulation of the forest fire model of Drossel and Schwabl [58]
by a lightning strike perhaps, and burns the tree in that for a square lattice of size 5000 × 5000.
square, if there is one, along with every other tree in
the cluster connected to it. The process is illustrated in
Fig. 15. One can think of the fire as leaping from tree
to adjacent tree until the whole cluster is burned, but
the fire cannot cross the firebreak formed by an empty
square. If there is no tree in the square struck by the
lightning, then nothing happens. After a fire, trees can
grow up again in the squares vacated by burnt trees, so it follows a power law closely. The exponent of the dis-
the process keeps going indefinitely. tribution is quite small in this case. The best current
If we start with an empty lattice, trees will start to ap- estimates give a value of α = 1.19 ± 0.01 [59], meaning
pear but will initially be sparse and lightning strikes will that the distribution has an infinite mean in the limit of
either hit empty squares or if they do chance upon a tree large system size. For all real systems however the mean
they will burn it and its cluster, but that cluster will be is finite: the distribution is cut off in the large-size tail be-
small and localized because we are well below the perco- cause fires cannot have a size any greater than that of the
lation threshold. Thus fires will have essentially no effect lattice as a whole and this makes the mean well-behaved.
on the forest. As time goes by however, more and more This cutoff is clearly visible in Fig. 16 as the drop in the
trees will grow up until at some point there are enough curve towards the right of the plot. What’s more the dis-
that we have percolation. At that point, as we have seen, tribution of the sizes of fires in real forests, Fig. 5d, shows
a spanning cluster forms whose size is limited only by the a similar cutoff and is in many ways qualitatively similar
size of the lattice, and when any tree in that cluster gets to the distribution predicted by the model. (Real forests
hit by the lightning the entire cluster will burn away. are obviously vastly more complex than the forest fire
This gets rid of the spanning cluster so that the system model, and no one is seriously suggesting that the model
does not percolate any more, but over time as more trees is an accurate representation the real world. Rather it
appear it will presumably reach percolation again, and so is a guide to the general type of processes that might be
the scenario will play out repeatedly. The end result is going on in forests.)
that the system oscillates right around the critical point, There has been much excitement about self-organized
first going just above the percolation threshold as trees criticality as a possible generic mechanism for explaining
appear and then being beaten back below it by fire. In where power-law distributions come from. Per Bak, one
the limit of large system size these fluctuations become of the originators of the idea, wrote an entire book about
small compared to the size of the system as a whole and it [60]. Self-organized critical models have been put for-
to an excellent approximation the system just sits at the ward not only for forest fires, but for earthquakes [61, 62],
threshold indefinitely. Thus, if we wait long enough, we solar flares [5], biological evolution [63], avalanches [57]
expect the forest fire model to self-organize to a state and many other phenomena. Although it is probably not
in which it has a power-law distribution of the sizes of the universal law that some have claimed it to be, it is cer-
clusters, or of the sizes of fires. tainly a powerful and intriguing concept that potentially
In Fig. 16 I show the cumulative distribution of the has applications to a variety of natural and man-made
sizes of fires in the forest fire model and, as we can see, systems.
IV Mechanisms for generating power-law distributions 23

G. Other mechanisms for generating power laws be used to model avalanches and earthquakes.
One of the broad distributions mentioned in Sec. II.B
In the preceding sections I’ve described the best as an alternative to the power law was the log-normal. A
known and most widely applied mechanisms that gener- log-normally distributed quantity is one whose logarithm
ate power-law distributions. However, there are a num- is normally distributed. That is
ber of others that deserve a mention. One that has been
(ln x − µ)2
 
receiving some attention recently is the highly optimized p(ln x) ∼ exp − , (81)
tolerance mechanism of Carlson and Doyle [64, 65]. The 2σ 2
classic example of this mechanism is again a model of
forest fires and is based on the percolation process. Sup- for some choice of the mean µ and standard deviation σ
pose again that fires start at random in a grid-like forest, of the distribution. Distributions like this typically arise
just as we considered in Sec. IV.F, but suppose now that when we are multiplying together random numbers. The
instead of appearing at random, trees are deliberately log of the product of a large number of random numbers is
planted by a knowledgeable forester. One can ask what the sum of the logarithms of those same random numbers,
the best distribution of trees is to optimize the amount of and by the central limit theorem such sums have a normal
lumber the forest produces, subject to random fires that distribution essentially regardless of the distribution of
could start at any place. The answer turns out to be that the individual numbers.
one should plant trees in blocks, with narrow firebreaks But Eq. (81) implies that the distribution of x itself is
between them to prevent fires from spreading. Moreover,
(ln x − µ)2
 
one should make the blocks smaller in regions where fires d ln x 1
p(x) = p(ln x) = exp − . (82)
start more often and larger where fires are rare. The dx x 2σ 2
reason for this is that we waste some valuable space by
making firebreaks, space in which we could have planted To see how this looks if we were to plot it on log scales,
more trees. If fires are rare, then on average it pays to put we take logarithms of both sides, giving
the breaks further apart—more trees will burn if there is (ln x − µ)2
a fire, but we also get more lumber if there isn’t. ln p(x) = − ln x − 2
Carlson and Doyle show both by analytic arguments 2
2σ
µ2

and by numerical simulation that for quite general dis- (ln x) µ
= − + − 1 ln x − , (83)
tributions of starting points for fires this process leads to 2σ 2 σ2 2σ 2
a distribution of fire sizes that approximately follows a
power law. The distribution is not a perfect power law which is quadratic in ln x. However, any quadratic curve
in this case, but on the other hand neither are many of looks straight if we view a sufficient small portion of it, so
those seen in the data of Fig. 4, so this is not necessarily p(x) will look like a power-law distribution when we look
a disadvantage. Carlson and Doyle have proposed that at a small portion on log scales. The effective exponent α
highly optimized tolerance could be a model not only for of the distribution is in this case not fixed by the theory—
forest fires but also for the sizes of files on the world wide it could be anything, depending on which part of the
web, which appear to follow a power law [6]. quadratic our data fall on.
Another mechanism, which is mathematically similar On larger scales the distribution will have some down-
to that of Carlson and Doyle but quite different in mo- ward curvature, but so do many of the distributions
tivation, is the coherent noise mechanism proposed by claimed to follow power laws, so it is possible that these
Sneppen and Newman [66] as a model of biological ex- distributions are really log-normal. In fact, in many cases
tinction. In this mechanism a number of agents or species we don’t even have to restrict ourselves to a particu-
are subjected to stresses of various sizes, and each agent larly small a portion of the curve. If σ is large then the
has a threshold for stress above which an applied stress quadratic term in Eq. (83) will vary slowly and the cur-
will wipe that agent out—the species becomes extinct. vature of the line will be slight, so the distribution will
Extinct species are replaced by new ones with randomly appear to follow a power law over relatively large por-
chosen thresholds. The net result is that the system self- tions of its range. This situation arises commonly when
organizes to a state where most of the surviving species we are considering products of random numbers.
have high thresholds, but the exact distribution depends Suppose for example that we are multiplying together
on the distribution of stresses in a way very similar to the 100 numbers, each of which is drawn from some distri-
relation between block sizes and fire frequency in highly bution such that the standard deviation of the logs is
optimized tolerance. No conscious optimization is needed around 1—i.e., the numbers themselves vary up or down
in this case, but the end result is similar: the overall dis- by about a factor of e. Then, by the central limit the-
tribution of the numbers of species becoming extinct as orem, the standard deviation for ln x will be σ ≃ 10
a result of any particular stress approximately follows a and ln x will have to vary by about ±10 for changes in
power law. The power-law form is not exact, but it’s as (ln x)2 /σ 2 to be apparent. But such a variation in the
good as that seen in real extinction data. Sneppen and logarithm corresponds to a variation in x of more than
Newman have also suggested that their mechanism could four orders of magnitude. If our data span a domain
smaller than this, as many of the plots in Fig. 4 do, then
24 Power laws, Pareto distributions and Zipf’s law

we will see a measured distribution that looks close to this is precisely the exponent observed for the distribu-
power-law. And the range will get quickly larger as the tion of waiting times for aftershocks of earthquakes. The
number of numbers we are multiplying grows. record dynamics has also been proposed as a model for
One example of a random multiplicative process might the lifetimes of biological taxa [71].
be wealth generation by investment. If a person invests
money, for instance in the stock market, they will get
a percentage return on their investment that varies over
V. CONCLUSIONS
time. In other words, in each period of time their in-
vestment is multiplied by some factor which fluctuates
from one period to the next. If the fluctuations are ran- In this review I have discussed the power-law statis-
dom and uncorrelated, then after many such periods the tical distributions seen in a wide variety of natural and
value of the investment is the initial value multiplied by man-made phenomena, from earthquakes and solar flares
the product of a large number of random numbers, and to populations of cities and sales of books. We have seen
therefore should be distributed according to a log-normal. many examples of power-law distributions in real data
This could explain why the tail of the wealth distribution, and seen how to analyse those data to understand the be-
Fig. 4j, appears to follow a power law. haviour and parameters of the distributions. I have also
Another example is fragmentation. Suppose we break described a number of physical mechanisms that have
a stick of unit length into two parts at a position which is been proposed to explain the occurrence of power laws.
a random fraction z of the way along the stick’s length. Perhaps the two most important of these are:
Then we break the resulting pieces at random again and
so on. After many breaks, 1. The Yule process, a rich-get-richer mechanism in
Q the length of one of the re- which the most populous cities or best-selling books
maining pieces will be i zi , where zi is the position of
the ith break. This is a product of random numbers and get more inhabitants or sales in proportion to the
thus the resulting distribution of lengths should follow a number they already have. Yule and later Simon
power law over a portion of its range. A mechanism like showed mathematically that this mechanism pro-
this could, for instance, produce a power-law distribution duces what is now called the Yule distribution,
of meteors or other interplanetary rock fragments, which which follows a power law in its tail.
tend to break up when they collide with one another, and
this in turn could produce a power-law distribution of the 2. Critical phenomena and the associated concept of
sizes of meteor craters similar to the one in Fig. 4g. self-organized criticality, in which a scale-factor of a
In fact, as discussed by a number of authors [67, 68, system diverges, either because we have tuned the
69], random multiplication processes can also generate system to a special critical point in its parameter
perfect power-law distributions with only a slight modi- space or because the system automatically drives it-
fication: if there is a lower bound on the value that the self to that point by some dynamical process. The
product of a set of numbers is allowed to take (for ex- divergence can leave the system with no appropri-
ample if there is a “reflecting boundary” on the lower ate scale factor to set the size of some measured
end of the range, or an additive noise term as well as a quantity and as we have seen the quantity must
multiplicative one) then the behaviour of the process is then follow a power law.
modified to generate not a log-normal, but a true power
law. The study of power-law distributions is an area in
Finally, some processes show power-law distributions which there is considerable current research interest.
of times between events. The distribution of times be- While the mechanisms and explanations presented here
tween earthquakes and their aftershocks is one exam- certainly offer some insight, there is much work to be
ple. Such power-law distributions of times are observed done both experimentally and theoretically before we can
in critical models and in the coherent noise mechanism say we really understand the physical processes driving
mentioned above, but another possible explanation for these systems. Without doubt there are many exciting
their occurrence is a random extremal process or record discoveries still waiting to be made.
dynamics. In this mechanism we consider how often a
randomly fluctuating quantity will break its own record
for the highest value recorded. For a quantity with, say, a
Acknowledgements
Gaussian distribution, it is always in theory possible for
the record to be broken, no matter what its current value,
The author thanks Jean-Philippe Bouchaud, Petter
but the more often the record is broken the higher the
record will get and the longer we will have to wait until it Holme, Cris Moore, Cosma Shalizi, Eduardo Sontag,
Didier Sornette, and Erik van Nimwegen for useful con-
is broken again. As shown by Sibani and Littlewood [70],
this non-stationary process gives a distribution of wait- versations and suggestions, and Lada Adamic for the
web site hit data. This work was funded in part by the
ing times between the establishment of new records that
follows a power law with exponent α = 1. Interestingly, National Science Foundation under grant number DMS–
0405348.
B Maximum likelihood estimate of exponents 25

APPENDIX A: Rank/frequency plots tional to


n n  −α
Suppose we wish to make a plot of the cumulative dis-
Y Y α − 1 xi
P (x|α) = p(xi ) = . (B2)
tribution function P (x) of a quantity such as, for exam- i=1 i=1
xmin xmin
ple, the frequency with which words appear in a body
of text (Fig. 4a). We start by making a list of all the This quantity is called the likelihood of the data set.
words along with their frequency of occurrence. Now the To find the value of α that best fits the data, we need
cumulative distribution of the frequency is defined such to calculate the probability P (α|x) of a particular value
that P (x) is the fraction of words with frequency greater of α given the observed {xi }, which is related to P (x|α)
than or equal to x. Or alternatively one could simply by Bayes’ law thus:
plot the number of words with frequency greater than
or equal to x, which differs from the fraction only in its P (α)
P (α|x) = P (x|α) . (B3)
normalization. P (x)
Now consider the most frequent word, which is “the”
The prior probability of the data P (x) is fixed since x
in most written English texts. If x is the frequency with
itself is fixed—x is equal to the particular set of ob-
which this word occurs, then clearly there is exactly one
servations we actually made and does not vary in the
word with frequency greater than or equal to x, since no
calculation—and it is usually assumed, in the absence of
other word is more frequent. Similarly, for the frequency
any information to the contrary, that the prior proba-
of the second most common word—usually “of”—there
bility of the exponent P (α) is uniform, i.e., a constant
are two words with that frequency or greater, namely
independent of α. Thus P (α|x) ∝ P (x|α). For conve-
“of” and “the”. And so forth. In other words, if we
nience we typically work with the logarithm of P (α|x),
rank the words in order, then by definition there are
which, to within an additive constant, is equal to the log
n words with frequency greater than or equal to that
of the likelihood, denoted L and given by
of the nth most common word. Thus the cumulative
distribution P (x) is simply proportional to the rank n Xn 
xi

of a word. This means that to make a plot of P (x) L = ln P (x|α) = ln(α − 1) − ln xmin − α ln
xmin
all we need do is sort the words in decreasing order i=1
of frequency, number them starting from 1, and then n
X xi
plot their ranks as a function of their frequency. Such = n ln(α − 1) − n ln xmin − α ln . (B4)
xmin
a plot of rank against frequency was called by Zipf [2] i=1
a rank/frequency plot, and this name is still sometimes
Now we calculate the most likely value of α by maximiz-
used to refer to plots of the cumulative distribution of a
ing the likelihood with respect to α, which is the same
quantity. Of course, many quantities we are interested in
as maximizing the log likelihood, since the logarithm is
are not frequencies—they are the sizes of earthquakes or
a monotonic increasing function. Setting ∂L/∂α = 0, we
people’s personal wealth or whatever—but nonetheless
find
people still talk about “rank/frequency” plots although
n
the name is not technically accurate. n X xi
In practice, sorting and ranking measurements and − ln = 0, (B5)
α − 1 i=1 xmin
then plotting rank against those measurements is usu-
ally the quickest way to construct a plot of the cumula- or
tive distribution of a quantity. All the cumulative plots X −1
in this paper were made in this way, except for the plot xi
α=1+n ln . (B6)
of the sizes of moon craters in Fig. 4g, for which the data i
xmin
came already in cumulative form.
We also wish to know what the expected error is on our
value of α. We can estimate this from the width of the
maximum of the likelihood as a function of α. Taking the
APPENDIX B: Maximum likelihood estimate of exponents exponential of Eq. (B4), we find that that the likelihood
has the form
Consider the power-law distribution
P (x|α) = ae−bα (α − 1)n , (B7)
 −α
α−1 x where b =
Pn
p(x) = Cx−α = , (B1) i=1 ln(xi /xmin ) and a is an unimportant
xmin xmin normalizing constant. Assuming that α > 1 so that the
distribution (B1) is normalizable, the mean and mean
where we have made use of the value of the normalization square of α in this distribution are given by
constant C calculated in Eq. (9). R ∞ −bα
Given a set of n values xi , the probability that those e (α − 1)n α dα
hαi = R1 ∞ −bα
values were generated from this distribution is propor- 1 e (α − 1)n dα
26 Power laws, Pareto distributions and Zipf’s law

e−b b−2−n (n + 1 + b)Γ(n + 1) Gaither and D. A. Reed (eds.), Proceedings of the 1996
=
e−b b−1−n Γ(n + 1) ACM SIGMETRICS Conference on Measurement and
Modeling of Computer Systems, pp. 148–159, Association
n+1+b
= (B8) of Computing Machinery, New York (1996).
b
[7] D. C. Roberts and D. L. Turcotte, Fractality and self-
and organized criticality of wars. Fractals 6, 351–357 (1998).
R ∞ −bα [8] J. B. Estoup, Gammes Stenographiques. Institut

2 1R
e (α − 1)n α2 dα Stenographique de France, Paris (1916).
α = ∞ −bα
1
e (α − 1)n dα [9] D. H. Zanette and S. C. Manrubia, Vertical transmission
e−b b−3−n (n2 + 3n + b2 + 2b + 2nb + 2)Γ(n + 1) of culture and the distribution of family names. Physica
= A 295, 1–8 (2001).
e−b b−1−n Γ(n + 1)
[10] A. J. Lotka, The frequency distribution of scientific pro-
n2 + 3n + b2 + 2b + 2nb + 2 duction. J. Wash. Acad. Sci. 16, 317–323 (1926).
= , (B9)
b2 [11] D. J. de S. Price, Networks of scientific papers. Science
149, 510–515 (1965).
where Γ(x) is the Γ-function of Eq. (21). Then the vari-
ance of α is [12] L. A. Adamic and B. A. Huberman, The nature of mar-
kets in the World Wide Web. Quarterly Journal of Elec-
σ 2 = α2 − hαi

2 tronic Commerce 1, 512 (2000).
[13] R. A. K. Cox, J. M. Felton, and K. C. Chung, The con-
n2 + 3n + b2 + 2b + 2nb + 2 (n + 1 + b)2 centration of commercial success in popular music: an
= −
b2 b2 analysis of the distribution of gold records. Journal of
n+1 Cultural Economics 19, 333–340 (1995).
= , (B10)
b2 [14] R. Kohli and R. Sah, Market shares: Some power law
results and observations. Working paper 04.01, Harris
and the error on α is School of Public Policy, University of Chicago (2003).
√ −1 [15] J. C. Willis and G. U. Yule, Some statistics of evolution
n+1 √
X
xi
σ= = n+1 ln . (B11) and geographical distribution in plants and animals, and
b i
xmin their significance. Nature 109, 177–179 (1922).
[16] V. Pareto, Cours d’Economie Politique. Droz, Geneva
In most cases we will have n ≫ 1 and it is safe to ap- (1896).
proximate n + 1 by n, giving
[17] G. B. West, J. H. Brown, and B. J. Enquist, A general
−1 model for the origin of allometric scaling laws in biology.

X
xi α−1 Science 276, 122–126 (1997).
σ= n ln = √ , (B12)
i
xmin n [18] D. Sornette, Critical Phenomena in Natural Sciences,
chapter 14. Springer, Heidelberg, 2nd edition (2003).
where α in this expression is the maximum likelihood [19] M. Mitzenmacher, A brief history of generative mod-
estimate from Eq. (B6). els for power law and lognormal distributions. Internet
Mathematics 1, 226–251 (2004).
[20] M. L. Goldstein, S. A. Morris, and G. G. Yen, Problems
References
with fitting to the power-law distribution. Eur. Phys. J.
B 41, 255–258 (2004).
[1] F. Auerbach, Das Gesetz der Bevölkerungskonzentration.
Petermanns Geographische Mitteilungen 59, 74–76 [21] H. Dahl, Word Frequencies of Spoken American English.
(1913). Verbatim, Essex, CT (1979).
[2] G. K. Zipf, Human Behaviour and the Principle of Least [22] S. Redner, How popular is your paper? An empirical
Effort. Addison-Wesley, Reading, MA (1949). study of the citation distribution. Eur. Phys. J. B 4,
131–134 (1998).
[3] B. Gutenberg and R. F. Richter, Frequency of earth-
quakes in california. Bulletin of the Seismological Society [23] A. P. Hackett, 70 Years of Best Sellers, 1895-1965. R. R.
of America 34, 185–188 (1944). Bowker Company, New York, NY (1967).
[4] G. Neukum and B. A. Ivanov, Crater size distributions [24] W. Aiello, F. Chung, and L. Lu, A random graph model
and impact probabilities on Earth from lunar, terrestial- for massive graphs. In Proceedings of the 32nd Annual
planet, and asteroid cratering data. In T. Gehrels (ed.), ACM Symposium on Theory of Computing, pp. 171–180,
Hazards Due to Comets and Asteroids, pp. 359–416, Uni- Association of Computing Machinery, New York (2000).
versity of Arizona Press, Tucson, AZ (1994). [25] H. Ebel, L.-I. Mielsch, and S. Bornholdt, Scale-free topol-
[5] E. T. Lu and R. J. Hamilton, Avalanches of the distri- ogy of e-mail networks. Phys. Rev. E 66, 035103 (2002).
bution of solar flares. Astrophysical Journal 380, 89–92 [26] B. A. Huberman and L. A. Adamic, Information dynam-
(1991). ics in the networked world. In E. Ben-Naim, H. Frauen-
[6] M. E. Crovella and A. Bestavros, Self-similarity in World felder, and Z. Toroczkai (eds.), Complex Networks, num-
Wide Web traffic: Evidence and possible causes. In B. E. ber 650 in Lecture Notes in Physics, pp. 371–398,
B Maximum likelihood estimate of exponents 27

Springer, Berlin (2004). [46] D. Sornette, Mechanism for powerlaws without self-
[27] M. Small and J. D. Singer, Resort to Arms: International organization. Int. J. Mod. Phys. C 13, 133–136 (2001).
and Civil Wars, 1816-1980. Sage Publications, Beverley [47] R. H. Swendsen and J.-S. Wang, Nonuniversal critical
Hills (1982). dynamics in Monte Carlo simulations. Phys. Rev. Lett.
[28] S. Miyazima, Y. Lee, T. Nagamine, and H. Miyajima, 58, 86–88 (1987).
Power-law distribution of family names in Japanese soci- [48] K. Sneppen, P. Bak, H. Flyvbjerg, and M. H. Jensen,
eties. Physica A 278, 282–288 (2000). Evolution as a self-organized critical phenomenon. Proc.
[29] B. J. Kim and S. M. Park, Distribution of Korean family Natl. Acad. Sci. USA 92, 5209–5213 (1995).
names. Preprint cond-mat/0407311 (2004). [49] M. E. J. Newman and R. G. Palmer, Modeling Extinction.
[30] J. Chen, J. S. Thorp, and M. Parashar, Analysis of elec- Oxford University Press, Oxford (2003).
tric power disturbance data. In 34th Hawaii International [50] D. J. de S. Price, A general theory of bibliometric and
Conference on System Sciences, IEEE Computer Society other cumulative advantage processes. J. Amer. Soc. In-
(2001). form. Sci. 27, 292–306 (1976).
[31] B. A. Carreras, D. E. Newman, I. Dobson, and A. B. [51] P. L. Krapivsky, S. Redner, and F. Leyvraz, Connectivity
Poole, Evidence for self-organized criticality in electric of growing random networks. Phys. Rev. Lett. 85, 4629–
power system blackouts. In 34th Hawaii International 4632 (2000).
Conference on System Sciences, IEEE Computer Soci- [52] A.-L. Barabási and R. Albert, Emergence of scaling in
ety (2001). random networks. Science 286, 509–512 (1999).
[32] E. Limpert, W. A. Stahel, and M. Abbt, Log-normal dis- [53] S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin,
tributions across the sciences: Keys and clues. Bioscience Structure of growing networks with preferential linking.
51, 341–352 (2001). Phys. Rev. Lett. 85, 4633–4636 (2000).
[33] M. E. J. Newman, S. Forrest, and J. Balthrop, Email [54] R. K. Merton, The Matthew effect in science. Science
networks and the spread of computer viruses. Phys. Rev. 159, 56–63 (1968).
E 66, 035101 (2002).
[55] P. J. Reynolds, W. Klein, and H. E. Stanley, A real-space
[34] M. O. Lorenz, Methods of measuring the concentration renormalization group for site and bond percolation. J.
of wealth. Publications of the American Statisical Asso- Phys. C 10, L167–L172 (1977).
ciation 9, 209–219 (1905).
[56] K. G. Wilson and J. Kogut, The renormalization group
[35] H. A. Simon, On a class of skew distribution functions. and the ǫ-expansion. Physics Reports 12, 75–199 (1974).
Biometrika 42, 425–440 (1955).
[57] P. Bak, C. Tang, and K. Wiesenfeld, Self-organized crit-
[36] G. U. Yule, A mathematical theory of evolution based on icality: An explanation of the 1/f noise. Phys. Rev. Lett.
the conclusions of Dr. J. C. Willis. Philos. Trans. R. Soc. 59, 381–384 (1987).
London B 213, 21–87 (1925).
[58] B. Drossel and F. Schwabl, Self-organized critical forest-
[37] G. A. Miller, Some effects of intermittent silence. Amer- fire model. Phys. Rev. Lett. 69, 1629–1632 (1992).
ican Journal of Psychology 70, 311–314 (1957).
[59] P. Grassberger, Critical behaviour of the drossel-schwabl
[38] W. Li, Random texts exhibit Zipf’s-law-like word fre- forest fire model. New Journal of Physics 4, 17 (2002).
quency distribution. IEEE Transactions on Information
Theory 38, 1842–1845 (1992). [60] P. Bak, How Nature Works: The Science of Self-
Organized Criticality. Copernicus, New York (1996).
[39] C. E. Shannon, A mathematical theory of communication
I. Bell System Technical Journal 27, 379–423 (1948). [61] P. Bak and C. Tang, Earthquakes as a self-organized crit-
ical phenomenon. Journal of Geophysical Research 94,
[40] C. E. Shannon, A mathematical theory of communication 15635–15637 (1989).
II. Bell System Technical Journal 27, 623–656 (1948).
[62] Z. Olami, H. J. S. Feder, and K. Christensen, Self-
[41] T. M. Cover and J. A. Thomas, Elements of Information organized criticality in a continuous, nonconservative cel-
Theory. John Wiley, New York (1991). lular automaton modeling earthquakes. Phys. Rev. Lett.
[42] B. B. Mandelbrot, An information theory of the statsit- 68, 1244–1247 (1992).
ical structure of languages. In W. Jackson (ed.), Symp. [63] P. Bak and K. Sneppen, Punctuated equilibrium and crit-
Applied Communications Theory, pp. 486–502, Butter- icality in a simple model of evolution. Phys. Rev. Lett. 74,
worth, Woburn, MA (1953). 4083–4086 (1993).
[43] W. J. Reed and B. D. Hughes, From gene families and [64] J. M. Carlson and J. Doyle, Highly optimized tolerance:
genera to incomes and internet file sizes: Why power A mechanism for power laws in designed systems. Phys.
laws are so common in nature. Phys. Rev. E 66, 067103 Rev. E 60, 1412–1427 (1999).
(2002).
[65] J. M. Carlson and J. Doyle, Highly optimized tolerance:
[44] J.-P. Bouchaud, More Lévy distributions in physics. In Robustness and design in complex systems. Phys. Rev.
M. F. Shlesinger, G. M. Zaslavsky, and U. Frisch (eds.), Lett. 84, 2529–2532 (2000).
Lévy Flights and Related Topics in Physics, number 450
[66] K. Sneppen and M. E. J. Newman, Coherent noise, scale
in Lecture Notes in Physics, Springer, Berlin (1995).
invariance and intermittency in large systems. Physica D
[45] N. Jan, L. Moseley, T. Ray, and D. Stauffer, Is the fossil 110, 209–222 (1997).
record indicative of a critical system? Adv. Complex Syst.
2, 137–141 (1999). [67] D. Sornette and R. Cont, Convergent multiplicative pro-
cesses repelled from zero: Power laws and truncated
28 Power laws, Pareto distributions and Zipf’s law

power laws. Journal de Physique I 7, 431–444 (1997). [70] P. Sibani and P. B. Littlewood, Slow dynamics from noise
[68] D. Sornette, Multiplicative processes and power laws. adaptation. Phys. Rev. Lett. 71, 1482–1485 (1993).
Phys. Rev. E 57, 4811–4813 (1998). [71] P. Sibani, M. R. Schmidt, and P. Alstrøm, Fitness op-
[69] X. Gabaix, Zipf’s law for cities: An explanation. Quar- timization and decay of the extinction rate through bio-
terly Journal of Economics 114, 739–767 (1999). logical evolution. Phys. Rev. Lett. 75, 2055–2058 (1995).

You might also like