0% found this document useful (0 votes)
19 views

Sbalchiero Topicmodelinglongtextsand

This document discusses how the length of texts affects the optimal number of topics when performing topic modeling using LDA. It presents a series of experiments applied to short and long text samples to identify a relationship between text length and the best number of topics. The conclusion is that there is a clear power-law relationship between optimal number of topics and sample size.

Uploaded by

mongolskykun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Sbalchiero Topicmodelinglongtextsand

This document discusses how the length of texts affects the optimal number of topics when performing topic modeling using LDA. It presents a series of experiments applied to short and long text samples to identify a relationship between text length and the best number of topics. The conclusion is that there is a clear power-law relationship between optimal number of topics and sample size.

Uploaded by

mongolskykun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Quality & Quantity

https://doi.org/10.1007/s11135-020-00976-w

Topic modeling, long texts and the best number of topics.


Some Problems and solutions

Stefano Sbalchiero1 · Maciej Eder2

© Springer Nature B.V. 2020

Abstract
The main aim of this article is to present the results of different experiments focused on the
problem of model fitting process in topic modeling and its accuracy when applied to long
texts. At the same time, in fact, the digital era has made available both enormous quan-
tities of textual data and technological advances that have facilitated the development of
techniques to automate the data coding and analysis processes. In the ambit of topic mod-
eling, different procedures were born in order to analyze larger and larger collections of
texts, namely corpora, but this has posed, and continues to pose, a series of methodological
questions that urgently need to be resolved. Therefore, through a series of different experi-
ments, this article is based on the following consideration: taking into account Latent Dir-
ichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res
3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classifica-
tion, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths
and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004),
the problem of fitting model is crucial because the LDA algorithm demands that the num-
ber of topics is specified a priori. Needles to say, the number of topics to detect in a corpus
is a parameter which affect the analysis results. Since there is a lack of experiments applied
to long texts, our article tries to shed new light on the complex relationship between texts’
length and the optimal number of topics. In the conclusions, we present a clear-cut power-
law relation between the optimal number of topics and the analyzed sample size, and we
formulate it in a form of a mathematical model.

Keywords Topic modeling · Latent Dirichlet Allocation · Long texts · Log-likelihood for
the model · Best number of topics

* Stefano Sbalchiero
[email protected]
1
Department of Philosophy, Sociology, Education and Applied Psychology (FISPPA) ‑ Section
of Sociology, University of Padova, Padova, Italy
2
Polish Academy of Sciences and Pedagogical University of Kraków, Kraków, Poland

13
Vol.:(0123456789)
S. Sbalchiero, M. Eder

1 Introduction

Among the plethora of theories and methods in the area of text analysis—including text
classification, clustering, information retrieval, semantic analysis, and so forth—methods
of discovering latent thematic structure in large collections of texts are attracting a good
share of attention not only among scholars, but also a broader audience of text analysis
practitioners. Perhaps the best known technique of this sort is topic modeling, usually com-
puted using Latent Dirichlet Allocation (LDA). In recent years topic modeling has gained
ground in influential text mining research agendas and today the LDA represents one of
the most widespread topic modeling algorithms considering it as a milestone in the pan-
orama of topic models (Sbalchiero 2018). In the last few years, numerous probabilistic
models based on LDA have been proposed, as author-topic model (Rosen-Zvi et al. 2004),
the dynamic topic model (Blei and Lafferty 2006), the pachinko allocation model (Li and
McCallum 2006), the correlated topic models (Blei and Lafferty 2007). We chose LDA
on the one hand because it is by far the most popular (in terms of its applications, imple-
mentations, and research studies), and the most tested; on the other hand, also the subse-
quent extensions require that the number of topics to detect in a corpus must be identified
a priori. This is the main object of the present paper. The basic idea behind this algorithm
is that topics emerge directly from data in a generative probabilistic model applied to a col-
lection of texts (Blei and Lafferty 2009). Given this premise, we can infer that a corpus is
represented by a set of “latent topics” and this means that “one common way of modelling
the contributions of different topics to a document is to treat each topic as a probability
distribution over words, viewing a document as a probabilistic mixture of these topics”
(Griffiths and Steyvers 2004, p. 5228). In other words, because it is not possible to directly
observe topics in a corpus but having words and documents available, the algorithm infers
the hidden text structure by the generative process which generate, precisely, the docu-
ments of the corpus by identifying the relevance of topics to documents, where relevance
stands for probability distributions, and that of words to topics. As a result, the probabilistic
distribution of topics over documents is characterized by a set of probable terms (Blei et al.
2003). Many studies have demonstrated the usefulness of the LDA applied to both different
communicative contexts and short texts (Hall et al. 2008; Hong and Davison 2010; Tong
and Zhang 2016; Puschmann and Scheffler 2016; Maier et al. 2018; Giordan et al. 2018;
Sbalchiero and Tuzzi 2018). Even if it has been demonstrated that topic modeling performs
quite well with short texts (Sbalchiero 2018), there is a lack of empirical evidence when it
is applied to long texts. It is true that some studies suggest a simple rule of thumb—divid-
ing the input texts into 1000-word chunks (Jockers and Mimno 2013)—but no systematic
studies exist that would support such a claim. In fact, the above limitation of topic mod-
eling seems bo be problematic in many other text analysis routines. While a short text (e.g.
an abstract of an article, social media post, journal article, newspaper article, Wikipedia
article and so on) can provide concise information about the main contents, text mining
techniques applied to long a text exhibits its lousier content density (Michel et al. 2011).
In the case of topic modeling methods, it is hardly feasible to infer a unique and coherent
topic from a long text (a book), because it usually contains a range of variety of topics.
The main aim of the present article, then, is to perform a systematic series of tests
with short and long text samples, and to identify a relationship between the texts’
length and the optimal number of topics using a measure of topics’ coherence. The
paper is structured as follows: the second section focuses on the model fitting process.
The choice of the number of topics to consider in a corpus is a sensitive parameter

13
Topic modeling, long texts and thebest number of topics. Some…

because the validity of the results depends on the capacity of the model to identify an
adequate number of topics, and the LDA algorithm demands that the number of topics
is specified a priori. The third section illustrates the corpus and data pre-processing, an
important step to make the subsequent experiments comparable in term of text chunks
of similar length. We go on to illustrate the empirical research based on demonstra-
tion and validation of the results. The validation of the results follows in three stages,
each of them representing an empirical experiment. In the fifth section, we discuss
our empirical findings and their theoretical implications. In the conclusions, the paper
depicts a clear-cut power-law relation between the optimal number of topics and the
analyzed sample size that we formulate it in a form of a mathematical model.

2 Method: fitting the model

Referring to the references for further details on LDA algorithm and model estimation
(Hall et al. 2008; Blei and Lafferty 2009) it is worthwhile to dwell on some aspects
related to this contribution. In a bag-of-words perspective (Köhler and Galle 1993;
Lebart et al. 1998; Popescu et al. 2009), where neither the word positions in the docu-
ment nor the documents order in the corpus are taken into account, we can observe that
the algorithm considers documents as a mixture of topics. For example, the document
D1 may refer to Topic no. 1 (T1), following the per-document topic probabilities, but
also covers T2 and marginally T3; thus “each document Dj, for j = 1,…,n, is generated
based on a distribution over the k topics, where k defines the maximum number of top-
ics. This value is fixed and defined a priori. In the LDA model, each document in the
corpus may be generated according to all k topics with different intensities” (Savoy
2013, p. 347). Needles to say, the a priori number of topics to detect in a corpus is a
parameter which affect the results substantially. The k-parameter of the algorithm is
crucial mainly because the validity of the results depends on the fitting model process.
On theoretical grounds, an excessively small number on topics may generate broad and
heterogeneous topics; on the contrary, a large number of k will produce too specific
topics. In both cases, it will be difficult to interpret them. Various ways to deal with
this issue have been proposed (Cao et al. 2009; Arun et al. 2010; Deveaud et al. 2014),
but one of the most appreciated for its simplicity is a solution based on a Bayesian
approach that involves computing the log-likelihood for all the possible models in a
given interval (e.g. 2-N) using the Gibbs sampling algorithm to obtain samples from
the posterior distribution of topics within documents to identify the best number of
topics to fit the model (Griffiths and Steyvers 2004). To assess the consequences for
different number of topics Tj, for j = 1,…,n, the authors suggest to “compute the pos-
terior probability of that set of models given the observed data […], the data are the
words in the corpus, w, and the model is specified by the number of topics, T, so we
wish to compute the likelihood P(w|T)” (Griffiths and Steyvers 2004, p. 5231). In other
words, this procedure permits to evaluate to which extent the latent topic structure gen-
erates documents which reflect the observed data, that is the collections of documents.
It is quite intuitive to think that P(w|T) increases to a peak, then decreases thereafter:
the maximum value suggests the best number of topics to fit the model.

13
S. Sbalchiero, M. Eder

3 Corpus and data

To evaluate the impact of different texts’ length on the best number topics T suggested by
the aforementioned method, we used a benchmark corpus of 100 English novels. This large
corpus covers the 19th and the beginning of the 20th century and contains novels by 33
authors (Table 1). The novels in the collection contain an average of 10,654 of word types
(N) and 1,432,000 word tokens (V).
The corpus used in the experiments is available in a GitHub repository.1 The corpus has
been pre-processed with dedicated R’s package tm (R Development Core Team 2016; Fei-
nerer et al. 2008). The pre-processing phase (Grün and Hornik 2011) have covered: parsing
(words are sequences of letters isolated by means of separators), tokenization, normaliza-
tion by replacing uppercase with lowercase letters, punctuation removal, and stop words
removal.
Our experiments were aimed at assessing the relationship between texts’ length and
number of topics, by exploring the variation of the best number of topics (suggested by
the methods in section No. 2) taking into consideration the length of text chunks. Three
different controlled experiments have been designed, in which particular text chunks sam-
ples were assessed independently (one by one) to identify the best number of topics. The
following procedure was applied (section No. 4). Also, it is worth mentioning that for each
sample experiment, a 62 GB RAM computer has taken on average of 4 h processing, and
all the experiments have taken 176 h in total.

4 Results of the experiments

Our first experiment was conducted on the entire above-described corpus. At a first stage,
for exploratory purpose and to determine the following phases, we decided to work by
splitting the novels into text chunks of 500 words, 1000, 5000, 10,000, 20,000, to 50,000
words in a sample. For each sample size, we calculated the log-likelihood for all the pos-
sible models in the interval 2-100 to identify the best number of topics to fit the model
(Table 2).
The results of the first experiment provided a first clue. The best number of topics sug-
gested by the log-likelihood method for all the possible models in the interval stays stable
for some time and then decreases for text chunks of the length between 20,000 and 50,000
words. In general, this empirical evidence is most visible when the length of text chunks
exceeds 10,000 words. A general observation can be formulated that there exists a depend-
ence between the predefined text chunk size and the optimal number of topics. More spe-
cifically, it seems that the inferred optimal number of topics decreases when the size of text
chunks increases. On theoretical grounds, one could expect to get such a result. We were
interested, however, to which extent the observed phenomenon was systematic.
To scrutinize the above intuition, aware that the experiments would take a long time,
we ran a second experiment using a sub-corpus of texts composed by the first 10 novels of
the corpus (see Table 1). This time, we aimed at testing a fine-grained range of shorter and
longer samples in their relation to the optimal number of topics. The ten novels were split

1
https​://githu​b.com/compu​tatio​nalst​ylist​ics/100_engli​sh_novel​s.

13
Topic modeling, long texts and thebest number of topics. Some…

Table 1  The 100 novels included in the corpus


No. Label (author_novel_year) N V No. Label (author_novel_year) N V

1 Anon_Clara_1864 15,430 199,308 51 Gissing_Warburton_1903 7884 86,961


2 Barclay_Ladies_1917 9050 122,832 52 Gissing_Women_1893 9819 141,050
3 Barclay_Postern_1911 5070 40,372 53 Haggard_Mines_1885 7940 83,287
4 Barclay_Rosary_1909 9039 106,603 54 Haggard_She_1887 10,313 114,381
5 Bennet_Babylon_1902 7330 69,497 55 Haggard_Sheallan_1921 8586 121,772
6 Bennet_Helen_1910 6850 53,665 56 Hardy_Jude_1895 12,126 149,528
7 Bennet_Imperial_1930 16,707 262,014 57 Hardy_Madding_1874 13,523 140,585
8 Blackmore_Erema_1877 11,655 167,641 58 Hardy_Tess_1891 13,628 153,004
9 Blackmore_Lorna_1869 15,060 273,939 59 James_Ambassadors_1903 9,785 171,431
10 Blackmore_Spring- 14,159 203,133 60 James_Roderick_1875 10,458 134,074
haven_1887
11 Braddon_Audley_1862 11,390 150,691 61 James_Tragic_1890 12,330 216,152
12 Braddon_Phantom_1883 13,930 182,599 62 Kipling_Captains_1896 7983 55,524
13 Braddon_Quest_1871 10,820 176,619 63 Kipling_Kim_1901 11,313 108,431
14 Burnett_Garden_1911 5469 82,757 64 Kipling_Light_1891 8132 74,746
15 Burnett_Lord_1886 5055 59,602 65 Lawrence_Peacock_1911 11,149 126,720
16 Burnett_Princess_1905 5511 67,687 66 Lawrence_Serpent_1926 12,435 173,610
17 Cbronte_Jane_1847 14,357 190,035 67 Lawrence_Women_1920 13,010 185,898
18 Cbronte_Shirley_1849 16,958 219,907 68 Lee_Albany_1884 8503 63,290
19 Cbronte_Villette_1853 16,816 197,312 69 Lee_Brown_1884 6147 48,642
20 Chesterton_Inno- 8742 80,153 70 Lee_Penelope_1903 3872 21,948
cence_1911
21 Chesterton_Napoleon_1904 7341 55,397 71 Lytton_Kenelm_1873 14,717 194,284
22 Chesterton_Thursday_1908 6956 58,638 72 Lytton_Novel_1853 22,775 460,336
23 Conrad_Almayer_1895 6911 63,718 73 Lytton_What_1858 22,014 341,064
24 Conrad_Nostromo_1904 13,722 172,943 74 Meredith_Feverel_1859 14,674 171,971
25 Conrad_Rover_1923 7902 88,570 75 Meredith_Marriage_1895 15,787 158,284
26 Corelli_Innocent_1914 12,555 126,676 76 Meredith_Richmond_1871 17,220 217,725
27 Corelli_Romance_1886 9992 100,904 77 Morris_Roots_1890 8109 154,638
28 Corelli_Satan_1895 13,707 170,015 78 Morris_Water_1897 7408 148,018
29 Dickens_Bleak_1853 17,843 362,076 79 Morris_Wood_1894 4276 50,103
30 Dickens_Expecta- 12,416 188,928 80 Schreiner_Farm_1883 8008 101,616
tions_1861
31 Dickens_Oliver_1839 11,968 161,805 81 Schreiner_Trooper_1897 3108 25,168
32 Doyle_Hound_1902 6367 59,995 82 Schreiner_Undine_1929 7464 91,522
33 Doyle_Lost_1912 8694 76,754 83 Stevenson_Arrow_1888 7784 80,254
34 Doyle_Micah_1889 15,245 178,973 84 Stevenson_Catriona_1893 8354 102,009
35 Eliot_Adam_1859 13,188 222,549 85 Stevenson_Treasure_1883 6734 69,830
36 Eliot_Daniel_1876 16,856 313,968 86 Thackeray_Esmond_1852 13,049 189,059
37 Eliot_Felix_1866 14,094 185,423 87 Thackeray_Penden- 19,927 365,389
nis_1850
38 Ford_Girl_1907 5016 36,269 88 Thackeray_Virgin- 19,339 360,914
ians_1859
39 Ford_Post_1926 9568 71,336 89 Trollope_Angel_1881 9784 221,650
40 Ford_Soldier_1915 7513 77,373 90 Trollope_Phineas_1869 10,627 268,963
41 Forster_Angels_1905 6080 51,054 91 Trollope_Warden_1855 7415 72,823

13
S. Sbalchiero, M. Eder

Table 1  (continued)
No. Label (author_novel_year) N V No. Label (author_novel_year) N V

42 Forster_Howards_1910 10,302 112,992 92 Ward_Ashe_1905 13,794 144,276


43 Forster_View_1908 7568 68,840 93 Ward_Harvest_1920 9179 76,686
44 Galsworthy_Man_1906 10,337 112,150 94 Ward_Milly_1881 3947 48,821
45 Galsworthy_River_1933 8649 89,676 95 Wcollins_Basil_1852 8981 119,034
46 Galsworthy_Saints_1919 8572 97,218 96 Wcollins_Legacy_1889 7684 121,894
47 Gaskell_Lovers_1863 12,681 195,330 97 Wcollins_Woman_1860 11,675 249,132
48 Gaskell_Ruth_1855 10,639 163,599 98 Woolf_Lighthouse_1927 7292 70,505
49 Gaskell_Wives_1865 12,913 275,618 99 Woolf_Night_1919 11,449 169,689
50 Gissing_Unclassed_1884 9433 126,981 100 Woolf_Years_1937 9479 131,207

Table 2  Results of experiment Corpus: 100 English novels


No. 1
Length of text chunks (No. No. of text chunks Best
of words) number of
topics

500 29,008 95
1000 14,479 95
2000 7215 95
5000 2857 92
10,000 1407 86
20,000 678 67
50,000 254 36

Table 3  Results of experiment Corpus: 10 English novels


No. 2
Length of text chunks (No. No. of text chunks Best
of words) number of
topics

100 15,217 63
500 3040 56
1000 1518 42
2000 756 32
3000 503 26
4000 376 25
5000 299 24
6000 249 21
7000 212 21
8000 185 19
9000 163 20
10,000 147 19
15,000 96 16
20,000 72 19
50,000 29 19

13
Topic modeling, long texts and thebest number of topics. Some…

Fig. 1  Log-likelihood for different samples size (Length: 500 words, 1000, 5000, 10,000, 20,000, 50,000
words)

in fifteen samples of text chunks composed of 100 words, 500, 1000, 2000, 3000, 4000,
5000, 6000, 7000, 8000, 9000, 10,000, 15,000, 20,000 to 50,000 words (Table 3).
The results of the second experiment confirm the previous intuition: it is clear that the
number of topics decreases when the length of text chunks increases. Moreover, it is clear
that at a certain point—in this particular corpus it falls between the text chunks size of
8000 and 10,000 words—the best number of topics stabilizes. Presumably, the point of
stabilization is the very value we want to identify, since it shows that the model reaches its
saturation. In the case of our 10 novels, topic model estimation should involve splitting the
texts into portions of about 10,000 words, followed by setting the k-parameter (‘How many
topics’) to about 20. As one can see in the representation of some samples size (Fig. 1),
while the length of text chunks increases, then the best number of topics decreases: the best
Log-likelihood estimation for each model corresponds to the maximum value reached by
curves.
The observed phenomenon—the dependence of the chunk size and the optimal number
of topics—is more evident when particular results are represented side by side in one plot.
To this end, we normalized (scaled) all the Log-likelihood values to the range {0,1} and
showed them in Fig. 2. As one can see, the optimal number of topics across the particular
chunk sizes tends to stabilize for chunks of, roughly, 5000–10,000 words in a sample.
Finally, a third experiment aimed to validate the previous results was performed. In
fact, given that a corpus differs in the number of text chunks depending on their lengths
(a few long chunks vs. several short chunks), we split the ten novels in ten samples of text
chunks composed by 10 words, 25, 50, 100, 200, 300, 400, 500, 1000 to 2000 words, and
we randomly extracted 100 text chunks from each sample. In this case, the log-likelihood
for all the possible models in the interval 2-100 has been applied to the same number of
text chunks of different lengths (Table 4).

13
S. Sbalchiero, M. Eder

Fig. 2  Log-likelihood for all the analyzed samples sizes shown side by side

Table 4  Results of experiment Length of text chunks (No. of No. of text chunks Best
No. 3 words) number of
topics

10 100 94
25 100 55
50 100 33
100 100 28
200 100 20
300 100 16
400 100 17
500 100 15
1000 100 16
2000 100 16

The results of the third experiment confirm the previous intuition: even with the same
number of text chunks considered, as the length of the text chunks increases the number of
topics decreases, until the optimal number of topics reaches its stabilization.

5 Discussion and conclusion

This article explored the problem of applying topic modeling to long texts. We run three
experiments applied to samples of text chunks of different lengths. The length of text
chunks is a variable for explaining change in the best number of topics to detect in a cor-
pus. In fact, our research indicated that at different text chunks length, the log-likelihood
for all the possible models indicated different number of topics. Our three experiments,
applied to a large corpus (100 English novels) for the sake of validation, then to a reduced
corpus (10 English novels), and then to an equal number of text chunks, exhibited a relation

13
Topic modeling, long texts and thebest number of topics. Some…

Fig. 3  Power regression of the experiment No. 2

that resembles an inverse proportionality. And indeed, as Fig. 2 seems to show, we deal
here with a very clear-cut power-law relation between the optimal number of topics and the
analyzed sample size.
Following our results, we present the Sbalchiero-Eder rule: given a corpus, the best
number of topics is inversely proportional to the length of text chunks, i.e. the larger the
portions of text, the lower the best number of topics. The relation is not linear and a pos-
sible way to represent it is through
y = f (x)

where y is the best number of topics; x is the size in word tokens of the text chunks; and f
is a steep decreasing function that might be expressed in various ways, for example, by a
power or logarithmic model.
In the case of power model,

y = ax−b

where a and b are parameters, a, b ∈ R.


For an illustrative purpose, the power trend to best fit the data of experiment No. 2
shows the non linear relation between best number of topics and length of text chunks with
a high R-squared statistic value (Fig. 3).
The exploratory nature of the research presented here opens, in our opinion, new hori-
zons for comparative research, and other experiments would shed more light on the rela-
tionship between length and best number of topics. The paper contribution is twofold. First,
the experiments permitted to enucleate an empirical observation on the relation between
the number of topics and the length of texts. Second, what is worth further investigation,
is the choice of the length of text chunks when analyzing long texts. A solution could be to
estimate the curve trough the elbow method (Kodinariya and Makwana 2013). More pre-
cisely, if one plots the percentage of variance explained by the length against the best num-
ber of topics, initially it will explain a considerable amount of variance, then it will drop
giving an angle: this the elbow criterion. For example, if we take into consideration the
Fig. 3, the location of a ‘knee’ in the curve is considered as an indicator of the appropri-
ate solution. In our case, the elbow criterion is met for text chunks of 10,000 words, which

13
S. Sbalchiero, M. Eder

Table 5  Excerpt of ten most probable words for each topic (decreasing order of probability). Suggested
model solution based on text chunks of 10,000 words and 19 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10

evelyn thing man shelfer prioress evelyn john man helen boy
hotel major thing dear antoni sir man good jame miss
miss good time isola mari violet thing time mrs martha
palac man lorna clara hand miss anni carn olleren- dear
shaw
violet hockin love don sister room good mind emanuel christobel
orcham lord great long mother henri great thing prockter professor
mrs miss mother thing ladi hotel doon great tea ann
sir time john heart thee palac knew long room blue
thought father doon hand door night hor make year hand
room dear make made knight ceria snow father hand eye
Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19

bishop love dolli thing racksol mother jane evelyn time


hugh father admir father princ john garth graci thing
knight time man man jule thee nur time lorna
love hand good miss nella thing hand room mother
mora eye young time babylon great face love great
hand lili time long tel time love don long
eye long captain great man boy thing thought answer
prioress man thing water eugen made rosemari smile john
heart life carn day aribert cri doctor door fear
lord heart great uncl theodor word miss girl peopl

corresponds to 19 topics. The results of LDA applied to the corpus of 10 English novels we
used for this experiment (Table 5), suggest that it can be an appropriate solution.
As can be seen from the optimal combination of input parameters (text chunks
of 10,000 words and 19 topics), the inferred most probable words for each topic are
semantically associated. Such a semantic relationship—defined as rough associations
of words, not necessarily seen through the lenses of linguistic theories—not only is
discovered by the unsupervised Log-Likelihood procedure, but is also graspable for a
human judgement. However, we do not show numerous other topics as inferred using
different combinations of input parameters, because the human judgement is subject to
the confirmation bias, and to other forms of seeing what one expects to see in the data.
To give an impression, however, of how difficult it is for a human eye to distinguish
optimal and suboptimal topics, we show two examples in Tables 6 and 7. For a human
judgement perspective, one could convincingly claim that there exists some redundancy
of words between topics and, consequently, the lack of homogeneity in the topics’ con-
tents (Tables 6 and 7). To avoid any a risk of biases caused by such naked-eye com-
parisons, in our study we rely exclusively on the Log-Likelihood values as indicators of
topics’ coherence.

13
Topic modeling, long texts and thebest number of topics. Some…

Table 6  Excerpt of ten most probable words for each topic (decreasing order of probability). Model solu-
tion based on text chunks of 5000 words and 24 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8

miss time miss prioress jane racksol helen evelyn


time light good antoni garth Princ jame sir
hand long mrs mari hand Tel mrs henri
mother eye time mother nur Jule prockter palac
room love thing sister love babylon emanuel hotel
eye snow eye reverend face Nella ollerenshaw man
door thing shelfer hand doctor Miss tea thought
father back mind ladi thing Room room graci
mrs hand hand door dear Rocco don good
dear heart put cell rosemari Man hous savott
Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16

boy evelyn evelyn thing racksol evelyn man princ


miss sir graci man man Graci good aribert
dear hotel hotel chapter hazel Room carn eugen
professor palac love great night thought time racksol
martha denni car firm eugen Tabl long man
day henri time time miss don young million
christobel father day water princ girl make nella
love savott night uncl jule night father smiss
ann sharehold thought good thing smile thing high
charteri imperi girl sam spencer sir made room
Topic 17 Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24

farmer bishop violet thing man john man father


thee hugh miss major good lorna good love
man knight evelyn man thing mother time lili
huxtabl love room hockin time man thing time
love hand sir good men good admir clara
time mora cousin time john time make long
long eye hotel father lorna thing captain mother
mother prioress ceria miss doon anni great eye
poor heart thought dear great great young day
great day palac lord made love long made

Intuitively, choosing a combination of an excessively small text chunks with a large


number of topics (Table 6: text chunks of 5000 words and 24 topics) could generate mini-
mal topics that are specific and redundant, while choosing an excessively small number of
topics based on larger text chunks (Table 7: text chunks of 20,000 words and 19 topics)
will give rise to topics that are too broad and heterogeneous. Consequently, either way, they
will be difficult to interpret (Sbalchiero, 2018). Our obtained Log-Likelihood values seem
to confirm the above intuition.

13
S. Sbalchiero, M. Eder

Table 7  Excerpt of ten most probable words for each topic (decreasing order of probability). Model solu-
tion based on text chunks of 20,000 words and 19 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10

bishop boy chapter hand john lorna jane chapter miss boy
knight miss boy window good good man man time day
prioress dear water light man time lorna time good time
hand ann john room thing john thing eye mrs hand
mari mother great good time man good long shelfer aunt
antoni love time door anni mother time young father blue
hugh good good eye great thing garth make love back
eye man made time mother doon love thing long call
love eye make heart make long dal hand eye father
sister day back thought made great made good day made
Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19

evelyn violet good garth evelyn chapter racksol thing man


graci evelyn dolli nur jame time princ miss carn
jane sir duchess rosemari helen father babylon major good
doctor miss dear love sir man eugen man admir
hand room thing face henri made miss father young
love palac man time girl thing aribert good time
face ceria ladi jane hotel good nella time dolli
time hotel faith hand thought great tel lord captain
room floor made room man long spencer hockin great
man henri jane thought don day jule castle- thing
wood

Finally, it should be verified by literary genre: it is hardly possible that scientific papers
show the same behavior as literary novels, because they could be more dense in content and
therefore the portions of text in particular chunks might need to be shorter. In any case, on
the base of our results, what we suggest is to evaluate the best number of topic for different
samples of text chunks lengths and to evaluate the resulting curve, for example with the elbow
method. As it encourages further studies and investigations, the findings of the present experi-
mental research add complexity to the debate on topic modeling, analysis of long texts and
the sensitive parameters to identify the best number of topics, enriching it with new ideas and
issues that characterize the directions undertaken by contemporary debate in topic detection
algorithms.

Acknowledgements The study was conducted at the intersection of two research projects: M.E. was founded
by the Polish National Science Center (SONATA-BIS 2017/26/E/HS2/01019), whereas S.S. was supported
by the COST Action "Distant Reading for European Literary History" (CA16204). We are grateful to prof.
Arjuna Tuzzi (University of Padova, Italy) for the inspiring discussions and her valuable suggestions.

Compliance with ethical standards


Conflict of interest The authors declare that they have no conflict of interest.

13
Topic modeling, long texts and thebest number of topics. Some…

References
Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of
topics with latent Dirichlet allocation some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B.,
Pudi, V. (eds.) Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer
Science, pp. 391–402. Springer, Berlin (2010)
Blei, D.M, Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference
on Machine Learning, pp. 113–120 (2006)
Blei, D., Lafferty, J.: A correlated topic model of Science. Ann. Appl. Stat. 1(1):17–35 (2007)
Blei, D.M., Lafferty, J.D.: Topic Models. In: Srivastava, A., Sahami, M. (eds.) Text Mining: Classifica-
tion, Clustering, and Applications, pp. 71–93. Chapman & Hall/CRC Press, Cambridge (2009)
Blei, D.M., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection.
Neurocomputing 72(7–9), 1775–1781 (2009)
Deveaud, R., SanJuan, É., Bellot, P.: Accurate and effective latent concept modeling for ad hoc informa-
tion retrieval. Document numérique 17(1), 61–84 (2014)
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)
Giordan, G., Saint-Blancat, C., Sbalchiero, S.: Exploring the history of american sociology through
topic modeling. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social
Sciences, pp. 45–64. Springer, Berlin (2018)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of
the United States of America (PNAS) 101(Supplement 1), 5228–5235 (2004)
Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw. 40(13), 1–30
(2011)
Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: EMNLP ‘08
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–
371 (2008)
Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the SIGKDD
Workshop on SMA, pp. 80–88 (2010)
Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering.
International Journal of Advance Research in Computer Science and Management Studies 1(6),
90–95 (2013)
Köhler, R., Galle, M.: Dynamic aspects of text characteristics. In: Hrebícek, L., Altmann, G. (eds.)
Quantitative Text Analysis, pp. 46–53. Wissenschaftlicher, Trier (1993)
Lebart, L., Salem, A., Berry, L.: Exploring textual data. Kluwer Academic Publishers, Dordrecht (1998)
Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In:
Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G.,
Reber, U., Häussler, T., Schmid-Petri, H., Adam, S.: Applying LDA topic modeling in communica-
tion research: toward a valid and reliable methodology. Commun. Methods Meas. 12(2–3), 93–118
(2018)
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D.,
Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using
millions of digitized books. Science 331(6014), 176–182 (2011)
Popescu, I., Macutek, J., Altmann, G.: Aspects of Word Frequencies. Studies in Quantitative Linguistics.
RAM Verlag, Ludenscheid (2009)
Puschmann, C., Scheffler, T.: Topic modeling for media and communication research: a short primer.
HIIG Discussion Paper Series No.2016-05. Available at SSRN: https​://doi.org/10.2139/ssrn.28364​
78 (2016)
R Development Core Team: R: a language and environment for statistical computing [software]. R
foundation for statistical computing. Retrieved from http://www.r-proje​ct.org. Accessed Jan 2020
(2016)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents.
In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494
(2004)
Savoy, J.: Authorship attribution based on a probabilistic topic model. Inf. Process. Manag. 49, 341–354
(2013)

13
S. Sbalchiero, M. Eder

Sbalchiero, S.: Finding topics: a statistical model and a quali-quantitative method. In: Tuzzi, A. (ed.)
Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 189–210. Springer,
Berlin (2018)
Sbalchiero, S., Tuzzi, A.: What’s old and new? Discovering Topics in the American Journal of Sociol-
ogy. In: Iezzi, D.F., Celdardo, L., Misuraca, M. (eds.) Proceedings of 14th International Conference
on Statistical Analysis of Textual Data, pp. 724–732. UniversItalia Editore, Rome (2018)
Tong, Z., Zhang, H.: A text mining research based on LDA topic modelling. In: Jordery School of Computer
Science, pp. 201–210 (2016)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

13

You might also like