Quantum
Quantum
Physica A
journal homepage: www.elsevier.com/locate/physa
1. Introduction
Quantitative analysis of large text samples has revealed regularities in the behavior of various text parameters. The
empirical laws found in texts, such as Zipf’s law, are known to hold in various domains, in particular the distribution of
nucleotides in genomes and other fields of biology [1–4], regularities in social sciences [5–9], etc.
Approaches from the domain of statistical physics can be used to study systems composed of many units in general, and
texts are suitable for such studies as well. The application of physical techniques in linguistics is quite common [10–14], and
other domains are also successfully covered by physical approaches; see [15].
In this work, we analyze the quantitative behavior of texts by finding an analogy with a bosonic system within the
grand canonical ensemble. In doing so, we demonstrate the possibility of assigning some new parameters characterizing
the frequency structure of texts, one of which can be conventionally called ‘‘temperature’’.
The notion of the ‘‘temperature of texts’’ has been discussed from different points of view by several authors.
Mandelbrot [16] suggested the name ‘‘informational temperature of texts’’ for a parameter in a rank–frequency distribution
(known as the Zipf–Mandelbrot law). Such a parameter is related to ‘‘good’’ or ‘‘bad’’ employment of words, especially rare
words [17]. The ‘‘temperature’’ as a measure of communicative ability was introduced in [18]. Recently, Miyazima and
Yamamoto [19] used the classical Boltzmann distribution to define the ‘‘temperature of texts’’ from the frequency data
of the most frequent words. We propose a different approach, mainly addressing the behavior of low-frequency vocabulary.
The paper is organized as follows. In Section 2, we present the main notions used subsequently, namely the principles
of obtaining a rank–frequency distribution as well as the term hapax legomena. Section 3 contains the main part of the
paper, where the physical analogy with the Bose distribution is discussed in detail and the parameters of text frequency
distribution are given a suitable interpretation. The results of text analyses in three languages are given in Section 4, and
Section 5 contains a brief discussion of the presented approach.
∗ Corresponding address: Department for Theoretical Physics, 12 Drahomanov St., Lviv, UA-79005, Ukraine. Tel.: +380 32 2614443.
E-mail addresses: [email protected], [email protected] (A. Rovenchak).
0378-4371/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.physa.2010.12.009
A. Rovenchak, S. Buk / Physica A 390 (2011) 1326–1331 1327
Fig. 1. Typical rank–frequency distribution. The absolute frequency f is shown versus the rank r for orthographic words of Perekhresni stežky (The Cross-
Paths), a Ukrainian novel by Ivan Franko. Data are obtained by the authors in the preliminary stage of compiling the frequency dictionary of the novel [21].
2. Rank–frequency distribution
In this work, we analyze texts on the word level. While the notion of ‘‘word’’ has no unique definition, see [20], we restrict
ourselves to the so-called ‘‘orthographic word’’ defined as an alphanumeric sequence between two spaces or punctuation
marks. Different word forms, like ‘hand’ and ‘hands’, ‘write’ and ‘wrote’, etc., are considered as different words for simplicity.
To obtain a rank–frequency distribution, one should first compile the frequency list from a given sample. Then, the item
with the highest frequency is given rank 1, the second most frequent item is given rank 2, and so on. Items with the same
frequency are given a consecutive range of ranks, the ordering within which can be arbitrary.
The studies of rank–frequency distributions originate from text analysis. The regularities found there are known to hold
in various domains, though texts still remain the most easily accessible material having a good variety of sorts to be analyzed.
A typical rank–frequency distribution has the shape shown in Fig. 1.
Horizontal plateaus in the domain of high ranks/low frequencies correspond to a large number of words having the same
frequency. The longest plateau corresponds to frequency 1. Such words are known as hapax legomena, the term originating
from Bible studies.
Hapax legomena is the plural of the Classical Greek term hapax legomenon , translated as ‘[something] said
[only] once’. That is, this term corresponds to the tokens appearing only once in a given sample. Examples from the Bible
include [22] ‘Lilith’ (a word of obscure meaning) or ‘gopher wood’ (used to build Noah’s Ark). Other often
cited examples are a kind of plough (Hesiod, [Opera et Dies = Works and Days], 433) and honori-
ficabilitudinitatibus ‘the state of being able to achieve honors’ (Shakespeare, Love’s Labour’s Lost, act 5, scene 1 [23, p. 372]).
For large text samples, about 40–60% of occurring words are hapaxes, depending on the text size [24, p. 72]. The relative
number of hapax legomena slightly decreases as the text becomes longer. Various quantities depending on the text size N
are well described by a power law [25], and the number of hapaxes fits into this family as well:
?
Nhapax = AN b .
Note, however, that for statistical studies texts must be sufficiently long.
Indeed, even in such a long sentence as this, having twenty-five tokens, all the words are hapaxes, except for ‘‘hapaxes’’ itself
since it occurs twice.
3. Physical analogy
The rank–frequency distribution of words in texts has clear similarities with the Bose distribution in statistical physics.
We suggest identifying the energy level numbers j with word frequencies (the number of occurrences in a given text). Thus,
the words with frequency 1 occupy the level j = 1, the words with frequency 2 occupy the level j = 2, etc. The level
occupation then corresponds to the number of different words with the same frequency. Since the level occupation can
reach any value (in particular, significantly larger than unity) the use of the Bose distribution is appropriate. The lowest
level corresponds to hapax legomena, and in this scheme can be identified with the Bose condensate.
Table 1
The parameters of the ‘‘energy spectrum’’ and ‘‘temperature’’ of texts.
N α T ln T / ln N T /N
Moby-Dick (ENG)
5 942 1.97 470.4 0.708 0.0792
39 363 1.60 1773.3 0.707 0.0451
66 916 1.56 2639.7 0.709 0.0394
107 503 1.48 3622.3 0.707 0.0337
132 968 1.48 4207.3 0.707 0.0316
165 746 1.48 4968.4 0.708 0.0300
191 040 1.47 5476.5 0.708 0.0287
215 270 1.45 5791.3 0.706 0.0269
(The Cross-Paths) (UKR)
343 1.57 26.6 0.562 0.0774
1 052 2.03 119.1 0.687 0.1132
12 949 1.68 812.0 0.708 0.0627
28 010 1.73 1610.1 0.721 0.0575
40 811 1.72 2270.7 0.728 0.0556
54 361 1.70 2964.3 0.733 0.0545
70 330 1.64 3597.4 0.734 0.0512
96 083 1.57 4561.4 0.734 0.0475
As shown further, the power energy spectrum gives a proper description for lower levels:
εj = (j − 1)α . (2)
Unity is subtracted to ensure that the lowermost level has zero energy.
Due to the nature of the frequency distribution, a simple model of a very weak log-of-log growth is appropriate for the
energy spectrum at high levels, εj ∝ ln ln j for j ≫ 1; see Fig. 2. Note, however, that a log-of-log spectrum requires the
maximal number of levels to be bounded from above by some jmax .
It must be emphasized that we focus in our study on the structure of low-frequency data (small j values, j = 1 ÷ 20). This
is unlike many traditional studies of word frequencies, which concentrate mainly on the analysis of high (and also medium)
frequencies.
We defined the parameters in Eq. (1) in two steps. First, the parameter z, being interpreted as fugacity in physics, is
defined from the occupation number of the lowermost state, i.e., the number of hapax legomena:
z
Nhapax = . (3)
1−z
‘‘Temperature’’ T and the exponent α in Eq. (2) are found simultaneously by fitting the occupation of higher energy levels
to
1
Nj = −1 (j−1)α /T (4)
z e −1
via two parameters, α and T . The sample results of fitting are presented in Fig. 2; see also Table 1. These calculations, as well
as other given later, were made using the nonlinear least-squares Marquardt–Levenberg algorithm implemented in the fit
procedure of GnuPlot, version 4.0.
One should note that the parameter T is dimensionless in our case, as is the energy εj . Such definition differs, for example,
from that in [19], where a distribution of some standard text was used to set the reference temperature in kelvins.
The state with T = 0 corresponds to all the frequencies equal to unity, that is, the whole text is composed of hapax
legomena. This could be the case of a very short text, not longer than just one or a couple of sentences (see the example at
the end of Section 2).
We fit the first 10–20 levels using the power excitation spectrum (2). Higher levels are neglected since a different
dependence on j must be applied to ensure good fitting of the occupation data Nj , as suggested in the previous subsection.
The parameter T obtained in such a way scales (very precisely) as N β (β < 1). The scaling is related to the definition of the
‘‘thermodynamic limit’’ for the problem under consideration. Just to recall, in a system of N bosons trapped in a D-
dimensional harmonic oscillator potential with frequency ω the thermodynamic limit is given by ωN 1/D = const as N →
∞, ω → 0 [26]. Since ω (or h̄ω if Planck’s constant h̄ is not set equal to unity) is a natural unit for the oscillator energy, the
A. Rovenchak, S. Buk / Physica A 390 (2011) 1326–1331 1329
Fig. 2. (Color online) The fit of the power energy spectrum to the level occupations. Blue crosses correspond to the data obtained by the authors on the
basis of first 40 chapters (of total 60) from the text mentioned in the caption of Fig. 1. The solid line is the fitting curve (4) for the first 20 values of the
occupation numbers Nj .
Fig. 3. (Color online) The behavior of ‘‘temperature’’ as the size of text grows. MD—Moby-Dick; PS—Perekhresni stežky; So—Sobor. The lines correspond to
the linear fits of the data represented by the respective symbols.
power-like scaling of the quantities measured in energy units is expectable for systems with a power energy spectrum as
well.
Curiously, the ratio ln T / ln N exhibits an insignificant variation with the size of the text sample (for a sufficiently long
text); see Table 1. This makes it a good variable for comparative linguistic studies.
4. Some results
So far, we have performed analyses of some texts written in English (Germanic language), Ukrainian (Slavic language),
and Guinean Maninka (in the Nko script; a language from the Mande family). Such a vast range has been used to check the
approach on significantly different language materials in order to reveal both universal and unique features of the parameter
behavior.
Fig. 3 demonstrates the ‘‘temperature’’ behavior of an English text (Moby-Dick by Hermann Melville) and two novels in
Ukrainian (Perekhresni stežky [The Cross-Paths] by Ivan Franko and Sobor [The Cathedral] by Oles Honchar).
Table 1 shows the numerical data for the parameter T calculated by using increasing numbers of chapters. The data based
on an article from a Guinean journal Yélén` are also given.
The values of z in all cases are close to 1. A better resolution might be achieved by introducing an analog of the chemical
potential µ in a standard way: z = eµ/T .
The fitting gives the values of the exponent α slightly decreasing as the text size grows. An interpretation in terms of an
external potential can be applied to justify such a change. Indeed, if the presence of an external potential is treated in the
semiclassical approach [27], the decreasing values of the exponent in a power excitation spectrum effectively correspond
to weakening of the steepness of an external potential. That is, as a text becomes longer, it suffers less from some external
influences.
1330 A. Rovenchak, S. Buk / Physica A 390 (2011) 1326–1331
Fig. 4. (Color online) The behavior of ln T / ln N for texts from different genres. Open letters, private letters, and sermons are shown.
Indeed, in one dimension, a power energy spectrum εp ∝ pα leads to the density of states
1
g (ε) ∝ ε α −1 . (5)
η
On the other hand, non-interacting particles confined into a trapping potential U (x) ∝ x in the semi-classical approach [27]
have the density of states
1
− 21
g (ε) ∝ ε η . (6)
2η
α= , (7)
η+2
leading to α = 3/2 for η = 6 and α = 1 for η = 2.
The T and α -exponent values obtained correlate with the analyticity level of the language. Lower values correspond to
higher analyticity (less word inflection), as can be seen from the opposition between English and Ukrainian (both Indo-
European languages). So far, we do not have sufficient data to make further statements, in particular, for the language from
an unrelated language family (Mande), whose data are given for curiosity and future reference. As should be expected, a low
value of α for the Maninka sample suggests a high level of analyticity.
Finally, in Fig. 4 we present the results of the ‘‘temperature’’ calculation made for short Ukrainian texts of different
genres [28]. Close values denote weak genre dependence of this parameter. A multivariate discriminant analysis is required
to study this issue in more detail.
5. Brief discussion
The presented results are a preliminary attempt to define a new set of parameters describing the frequency structure of
texts. Further application of this approach to a larger number of texts from different languages written by different authors
is required to establish the correlation between parameter values and text language/authorship. The shape of the spectrum
in the whole domain of the variation of j must be considered in further studies to give a proper description of the level
occupation. Also, more parameters can be calculated within the ‘‘thermodynamic approach’’ (like some analogs of total
energy, specific heat, etc.; see [18]). One of the outcomes which we expect from such calculations is the possibility of
automatic text attribution, which would be useful for automated language processing. Applications beyond linguistics, in
genetics and social sciences, for example, are also possible in the future.
Acknowledgements
We are grateful to the anonymous referees for useful critical comments and suggestions, which were helpful in improving
the manuscript.
This work was partly supported by Projects ΦΦ -14Φ (No. 0109U002096) and M/6-2009 (No. 0109U001786) from the
Ministry of Education and Sciences of Ukraine and WTZ Project UA 05/2009 from the Österreichischer Austauschdienst.
A. Rovenchak, S. Buk / Physica A 390 (2011) 1326–1331 1331
References