Sbalchiero Topicmodelinglongtextsand
Sbalchiero Topicmodelinglongtextsand
https://doi.org/10.1007/s11135-020-00976-w
Abstract
The main aim of this article is to present the results of different experiments focused on the
problem of model fitting process in topic modeling and its accuracy when applied to long
texts. At the same time, in fact, the digital era has made available both enormous quan-
tities of textual data and technological advances that have facilitated the development of
techniques to automate the data coding and analysis processes. In the ambit of topic mod-
eling, different procedures were born in order to analyze larger and larger collections of
texts, namely corpora, but this has posed, and continues to pose, a series of methodological
questions that urgently need to be resolved. Therefore, through a series of different experi-
ments, this article is based on the following consideration: taking into account Latent Dir-
ichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res
3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classifica-
tion, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths
and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004),
the problem of fitting model is crucial because the LDA algorithm demands that the num-
ber of topics is specified a priori. Needles to say, the number of topics to detect in a corpus
is a parameter which affect the analysis results. Since there is a lack of experiments applied
to long texts, our article tries to shed new light on the complex relationship between texts’
length and the optimal number of topics. In the conclusions, we present a clear-cut power-
law relation between the optimal number of topics and the analyzed sample size, and we
formulate it in a form of a mathematical model.
Keywords Topic modeling · Latent Dirichlet Allocation · Long texts · Log-likelihood for
the model · Best number of topics
* Stefano Sbalchiero
[email protected]
1
Department of Philosophy, Sociology, Education and Applied Psychology (FISPPA) ‑ Section
of Sociology, University of Padova, Padova, Italy
2
Polish Academy of Sciences and Pedagogical University of Kraków, Kraków, Poland
13
Vol.:(0123456789)
S. Sbalchiero, M. Eder
1 Introduction
Among the plethora of theories and methods in the area of text analysis—including text
classification, clustering, information retrieval, semantic analysis, and so forth—methods
of discovering latent thematic structure in large collections of texts are attracting a good
share of attention not only among scholars, but also a broader audience of text analysis
practitioners. Perhaps the best known technique of this sort is topic modeling, usually com-
puted using Latent Dirichlet Allocation (LDA). In recent years topic modeling has gained
ground in influential text mining research agendas and today the LDA represents one of
the most widespread topic modeling algorithms considering it as a milestone in the pan-
orama of topic models (Sbalchiero 2018). In the last few years, numerous probabilistic
models based on LDA have been proposed, as author-topic model (Rosen-Zvi et al. 2004),
the dynamic topic model (Blei and Lafferty 2006), the pachinko allocation model (Li and
McCallum 2006), the correlated topic models (Blei and Lafferty 2007). We chose LDA
on the one hand because it is by far the most popular (in terms of its applications, imple-
mentations, and research studies), and the most tested; on the other hand, also the subse-
quent extensions require that the number of topics to detect in a corpus must be identified
a priori. This is the main object of the present paper. The basic idea behind this algorithm
is that topics emerge directly from data in a generative probabilistic model applied to a col-
lection of texts (Blei and Lafferty 2009). Given this premise, we can infer that a corpus is
represented by a set of “latent topics” and this means that “one common way of modelling
the contributions of different topics to a document is to treat each topic as a probability
distribution over words, viewing a document as a probabilistic mixture of these topics”
(Griffiths and Steyvers 2004, p. 5228). In other words, because it is not possible to directly
observe topics in a corpus but having words and documents available, the algorithm infers
the hidden text structure by the generative process which generate, precisely, the docu-
ments of the corpus by identifying the relevance of topics to documents, where relevance
stands for probability distributions, and that of words to topics. As a result, the probabilistic
distribution of topics over documents is characterized by a set of probable terms (Blei et al.
2003). Many studies have demonstrated the usefulness of the LDA applied to both different
communicative contexts and short texts (Hall et al. 2008; Hong and Davison 2010; Tong
and Zhang 2016; Puschmann and Scheffler 2016; Maier et al. 2018; Giordan et al. 2018;
Sbalchiero and Tuzzi 2018). Even if it has been demonstrated that topic modeling performs
quite well with short texts (Sbalchiero 2018), there is a lack of empirical evidence when it
is applied to long texts. It is true that some studies suggest a simple rule of thumb—divid-
ing the input texts into 1000-word chunks (Jockers and Mimno 2013)—but no systematic
studies exist that would support such a claim. In fact, the above limitation of topic mod-
eling seems bo be problematic in many other text analysis routines. While a short text (e.g.
an abstract of an article, social media post, journal article, newspaper article, Wikipedia
article and so on) can provide concise information about the main contents, text mining
techniques applied to long a text exhibits its lousier content density (Michel et al. 2011).
In the case of topic modeling methods, it is hardly feasible to infer a unique and coherent
topic from a long text (a book), because it usually contains a range of variety of topics.
The main aim of the present article, then, is to perform a systematic series of tests
with short and long text samples, and to identify a relationship between the texts’
length and the optimal number of topics using a measure of topics’ coherence. The
paper is structured as follows: the second section focuses on the model fitting process.
The choice of the number of topics to consider in a corpus is a sensitive parameter
13
Topic modeling, long texts and thebest number of topics. Some…
because the validity of the results depends on the capacity of the model to identify an
adequate number of topics, and the LDA algorithm demands that the number of topics
is specified a priori. The third section illustrates the corpus and data pre-processing, an
important step to make the subsequent experiments comparable in term of text chunks
of similar length. We go on to illustrate the empirical research based on demonstra-
tion and validation of the results. The validation of the results follows in three stages,
each of them representing an empirical experiment. In the fifth section, we discuss
our empirical findings and their theoretical implications. In the conclusions, the paper
depicts a clear-cut power-law relation between the optimal number of topics and the
analyzed sample size that we formulate it in a form of a mathematical model.
Referring to the references for further details on LDA algorithm and model estimation
(Hall et al. 2008; Blei and Lafferty 2009) it is worthwhile to dwell on some aspects
related to this contribution. In a bag-of-words perspective (Köhler and Galle 1993;
Lebart et al. 1998; Popescu et al. 2009), where neither the word positions in the docu-
ment nor the documents order in the corpus are taken into account, we can observe that
the algorithm considers documents as a mixture of topics. For example, the document
D1 may refer to Topic no. 1 (T1), following the per-document topic probabilities, but
also covers T2 and marginally T3; thus “each document Dj, for j = 1,…,n, is generated
based on a distribution over the k topics, where k defines the maximum number of top-
ics. This value is fixed and defined a priori. In the LDA model, each document in the
corpus may be generated according to all k topics with different intensities” (Savoy
2013, p. 347). Needles to say, the a priori number of topics to detect in a corpus is a
parameter which affect the results substantially. The k-parameter of the algorithm is
crucial mainly because the validity of the results depends on the fitting model process.
On theoretical grounds, an excessively small number on topics may generate broad and
heterogeneous topics; on the contrary, a large number of k will produce too specific
topics. In both cases, it will be difficult to interpret them. Various ways to deal with
this issue have been proposed (Cao et al. 2009; Arun et al. 2010; Deveaud et al. 2014),
but one of the most appreciated for its simplicity is a solution based on a Bayesian
approach that involves computing the log-likelihood for all the possible models in a
given interval (e.g. 2-N) using the Gibbs sampling algorithm to obtain samples from
the posterior distribution of topics within documents to identify the best number of
topics to fit the model (Griffiths and Steyvers 2004). To assess the consequences for
different number of topics Tj, for j = 1,…,n, the authors suggest to “compute the pos-
terior probability of that set of models given the observed data […], the data are the
words in the corpus, w, and the model is specified by the number of topics, T, so we
wish to compute the likelihood P(w|T)” (Griffiths and Steyvers 2004, p. 5231). In other
words, this procedure permits to evaluate to which extent the latent topic structure gen-
erates documents which reflect the observed data, that is the collections of documents.
It is quite intuitive to think that P(w|T) increases to a peak, then decreases thereafter:
the maximum value suggests the best number of topics to fit the model.
13
S. Sbalchiero, M. Eder
To evaluate the impact of different texts’ length on the best number topics T suggested by
the aforementioned method, we used a benchmark corpus of 100 English novels. This large
corpus covers the 19th and the beginning of the 20th century and contains novels by 33
authors (Table 1). The novels in the collection contain an average of 10,654 of word types
(N) and 1,432,000 word tokens (V).
The corpus used in the experiments is available in a GitHub repository.1 The corpus has
been pre-processed with dedicated R’s package tm (R Development Core Team 2016; Fei-
nerer et al. 2008). The pre-processing phase (Grün and Hornik 2011) have covered: parsing
(words are sequences of letters isolated by means of separators), tokenization, normaliza-
tion by replacing uppercase with lowercase letters, punctuation removal, and stop words
removal.
Our experiments were aimed at assessing the relationship between texts’ length and
number of topics, by exploring the variation of the best number of topics (suggested by
the methods in section No. 2) taking into consideration the length of text chunks. Three
different controlled experiments have been designed, in which particular text chunks sam-
ples were assessed independently (one by one) to identify the best number of topics. The
following procedure was applied (section No. 4). Also, it is worth mentioning that for each
sample experiment, a 62 GB RAM computer has taken on average of 4 h processing, and
all the experiments have taken 176 h in total.
Our first experiment was conducted on the entire above-described corpus. At a first stage,
for exploratory purpose and to determine the following phases, we decided to work by
splitting the novels into text chunks of 500 words, 1000, 5000, 10,000, 20,000, to 50,000
words in a sample. For each sample size, we calculated the log-likelihood for all the pos-
sible models in the interval 2-100 to identify the best number of topics to fit the model
(Table 2).
The results of the first experiment provided a first clue. The best number of topics sug-
gested by the log-likelihood method for all the possible models in the interval stays stable
for some time and then decreases for text chunks of the length between 20,000 and 50,000
words. In general, this empirical evidence is most visible when the length of text chunks
exceeds 10,000 words. A general observation can be formulated that there exists a depend-
ence between the predefined text chunk size and the optimal number of topics. More spe-
cifically, it seems that the inferred optimal number of topics decreases when the size of text
chunks increases. On theoretical grounds, one could expect to get such a result. We were
interested, however, to which extent the observed phenomenon was systematic.
To scrutinize the above intuition, aware that the experiments would take a long time,
we ran a second experiment using a sub-corpus of texts composed by the first 10 novels of
the corpus (see Table 1). This time, we aimed at testing a fine-grained range of shorter and
longer samples in their relation to the optimal number of topics. The ten novels were split
1
https://github.com/computationalstylistics/100_english_novels.
13
Topic modeling, long texts and thebest number of topics. Some…
13
S. Sbalchiero, M. Eder
Table 1 (continued)
No. Label (author_novel_year) N V No. Label (author_novel_year) N V
500 29,008 95
1000 14,479 95
2000 7215 95
5000 2857 92
10,000 1407 86
20,000 678 67
50,000 254 36
100 15,217 63
500 3040 56
1000 1518 42
2000 756 32
3000 503 26
4000 376 25
5000 299 24
6000 249 21
7000 212 21
8000 185 19
9000 163 20
10,000 147 19
15,000 96 16
20,000 72 19
50,000 29 19
13
Topic modeling, long texts and thebest number of topics. Some…
Fig. 1 Log-likelihood for different samples size (Length: 500 words, 1000, 5000, 10,000, 20,000, 50,000
words)
in fifteen samples of text chunks composed of 100 words, 500, 1000, 2000, 3000, 4000,
5000, 6000, 7000, 8000, 9000, 10,000, 15,000, 20,000 to 50,000 words (Table 3).
The results of the second experiment confirm the previous intuition: it is clear that the
number of topics decreases when the length of text chunks increases. Moreover, it is clear
that at a certain point—in this particular corpus it falls between the text chunks size of
8000 and 10,000 words—the best number of topics stabilizes. Presumably, the point of
stabilization is the very value we want to identify, since it shows that the model reaches its
saturation. In the case of our 10 novels, topic model estimation should involve splitting the
texts into portions of about 10,000 words, followed by setting the k-parameter (‘How many
topics’) to about 20. As one can see in the representation of some samples size (Fig. 1),
while the length of text chunks increases, then the best number of topics decreases: the best
Log-likelihood estimation for each model corresponds to the maximum value reached by
curves.
The observed phenomenon—the dependence of the chunk size and the optimal number
of topics—is more evident when particular results are represented side by side in one plot.
To this end, we normalized (scaled) all the Log-likelihood values to the range {0,1} and
showed them in Fig. 2. As one can see, the optimal number of topics across the particular
chunk sizes tends to stabilize for chunks of, roughly, 5000–10,000 words in a sample.
Finally, a third experiment aimed to validate the previous results was performed. In
fact, given that a corpus differs in the number of text chunks depending on their lengths
(a few long chunks vs. several short chunks), we split the ten novels in ten samples of text
chunks composed by 10 words, 25, 50, 100, 200, 300, 400, 500, 1000 to 2000 words, and
we randomly extracted 100 text chunks from each sample. In this case, the log-likelihood
for all the possible models in the interval 2-100 has been applied to the same number of
text chunks of different lengths (Table 4).
13
S. Sbalchiero, M. Eder
Fig. 2 Log-likelihood for all the analyzed samples sizes shown side by side
Table 4 Results of experiment Length of text chunks (No. of No. of text chunks Best
No. 3 words) number of
topics
10 100 94
25 100 55
50 100 33
100 100 28
200 100 20
300 100 16
400 100 17
500 100 15
1000 100 16
2000 100 16
The results of the third experiment confirm the previous intuition: even with the same
number of text chunks considered, as the length of the text chunks increases the number of
topics decreases, until the optimal number of topics reaches its stabilization.
This article explored the problem of applying topic modeling to long texts. We run three
experiments applied to samples of text chunks of different lengths. The length of text
chunks is a variable for explaining change in the best number of topics to detect in a cor-
pus. In fact, our research indicated that at different text chunks length, the log-likelihood
for all the possible models indicated different number of topics. Our three experiments,
applied to a large corpus (100 English novels) for the sake of validation, then to a reduced
corpus (10 English novels), and then to an equal number of text chunks, exhibited a relation
13
Topic modeling, long texts and thebest number of topics. Some…
that resembles an inverse proportionality. And indeed, as Fig. 2 seems to show, we deal
here with a very clear-cut power-law relation between the optimal number of topics and the
analyzed sample size.
Following our results, we present the Sbalchiero-Eder rule: given a corpus, the best
number of topics is inversely proportional to the length of text chunks, i.e. the larger the
portions of text, the lower the best number of topics. The relation is not linear and a pos-
sible way to represent it is through
y = f (x)
where y is the best number of topics; x is the size in word tokens of the text chunks; and f
is a steep decreasing function that might be expressed in various ways, for example, by a
power or logarithmic model.
In the case of power model,
y = ax−b
13
S. Sbalchiero, M. Eder
Table 5 Excerpt of ten most probable words for each topic (decreasing order of probability). Suggested
model solution based on text chunks of 10,000 words and 19 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
evelyn thing man shelfer prioress evelyn john man helen boy
hotel major thing dear antoni sir man good jame miss
miss good time isola mari violet thing time mrs martha
palac man lorna clara hand miss anni carn olleren- dear
shaw
violet hockin love don sister room good mind emanuel christobel
orcham lord great long mother henri great thing prockter professor
mrs miss mother thing ladi hotel doon great tea ann
sir time john heart thee palac knew long room blue
thought father doon hand door night hor make year hand
room dear make made knight ceria snow father hand eye
Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19
corresponds to 19 topics. The results of LDA applied to the corpus of 10 English novels we
used for this experiment (Table 5), suggest that it can be an appropriate solution.
As can be seen from the optimal combination of input parameters (text chunks
of 10,000 words and 19 topics), the inferred most probable words for each topic are
semantically associated. Such a semantic relationship—defined as rough associations
of words, not necessarily seen through the lenses of linguistic theories—not only is
discovered by the unsupervised Log-Likelihood procedure, but is also graspable for a
human judgement. However, we do not show numerous other topics as inferred using
different combinations of input parameters, because the human judgement is subject to
the confirmation bias, and to other forms of seeing what one expects to see in the data.
To give an impression, however, of how difficult it is for a human eye to distinguish
optimal and suboptimal topics, we show two examples in Tables 6 and 7. For a human
judgement perspective, one could convincingly claim that there exists some redundancy
of words between topics and, consequently, the lack of homogeneity in the topics’ con-
tents (Tables 6 and 7). To avoid any a risk of biases caused by such naked-eye com-
parisons, in our study we rely exclusively on the Log-Likelihood values as indicators of
topics’ coherence.
13
Topic modeling, long texts and thebest number of topics. Some…
Table 6 Excerpt of ten most probable words for each topic (decreasing order of probability). Model solu-
tion based on text chunks of 5000 words and 24 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
13
S. Sbalchiero, M. Eder
Table 7 Excerpt of ten most probable words for each topic (decreasing order of probability). Model solu-
tion based on text chunks of 20,000 words and 19 topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
bishop boy chapter hand john lorna jane chapter miss boy
knight miss boy window good good man man time day
prioress dear water light man time lorna time good time
hand ann john room thing john thing eye mrs hand
mari mother great good time man good long shelfer aunt
antoni love time door anni mother time young father blue
hugh good good eye great thing garth make love back
eye man made time mother doon love thing long call
love eye make heart make long dal hand eye father
sister day back thought made great made good day made
Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19
Finally, it should be verified by literary genre: it is hardly possible that scientific papers
show the same behavior as literary novels, because they could be more dense in content and
therefore the portions of text in particular chunks might need to be shorter. In any case, on
the base of our results, what we suggest is to evaluate the best number of topic for different
samples of text chunks lengths and to evaluate the resulting curve, for example with the elbow
method. As it encourages further studies and investigations, the findings of the present experi-
mental research add complexity to the debate on topic modeling, analysis of long texts and
the sensitive parameters to identify the best number of topics, enriching it with new ideas and
issues that characterize the directions undertaken by contemporary debate in topic detection
algorithms.
Acknowledgements The study was conducted at the intersection of two research projects: M.E. was founded
by the Polish National Science Center (SONATA-BIS 2017/26/E/HS2/01019), whereas S.S. was supported
by the COST Action "Distant Reading for European Literary History" (CA16204). We are grateful to prof.
Arjuna Tuzzi (University of Padova, Italy) for the inspiring discussions and her valuable suggestions.
13
Topic modeling, long texts and thebest number of topics. Some…
References
Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of
topics with latent Dirichlet allocation some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B.,
Pudi, V. (eds.) Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer
Science, pp. 391–402. Springer, Berlin (2010)
Blei, D.M, Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference
on Machine Learning, pp. 113–120 (2006)
Blei, D., Lafferty, J.: A correlated topic model of Science. Ann. Appl. Stat. 1(1):17–35 (2007)
Blei, D.M., Lafferty, J.D.: Topic Models. In: Srivastava, A., Sahami, M. (eds.) Text Mining: Classifica-
tion, Clustering, and Applications, pp. 71–93. Chapman & Hall/CRC Press, Cambridge (2009)
Blei, D.M., Ng, A., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cao, J., Xia, T., Li, J., Zhang, Y., Tang, S.: A density-based method for adaptive LDA model selection.
Neurocomputing 72(7–9), 1775–1781 (2009)
Deveaud, R., SanJuan, É., Bellot, P.: Accurate and effective latent concept modeling for ad hoc informa-
tion retrieval. Document numérique 17(1), 61–84 (2014)
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)
Giordan, G., Saint-Blancat, C., Sbalchiero, S.: Exploring the history of american sociology through
topic modeling. In: Tuzzi, A. (ed.) Tracing the Life-Course of Ideas in the Humanities and Social
Sciences, pp. 45–64. Springer, Berlin (2018)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of
the United States of America (PNAS) 101(Supplement 1), 5228–5235 (2004)
Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw. 40(13), 1–30
(2011)
Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: EMNLP ‘08
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 363–
371 (2008)
Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the SIGKDD
Workshop on SMA, pp. 80–88 (2010)
Jockers, M.L., Mimno, D.: Significant themes in 19th-century literature. Poetics 41(6), 750–769 (2013)
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering.
International Journal of Advance Research in Computer Science and Management Studies 1(6),
90–95 (2013)
Köhler, R., Galle, M.: Dynamic aspects of text characteristics. In: Hrebícek, L., Altmann, G. (eds.)
Quantitative Text Analysis, pp. 46–53. Wissenschaftlicher, Trier (1993)
Lebart, L., Salem, A., Berry, L.: Exploring textual data. Kluwer Academic Publishers, Dordrecht (1998)
Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In:
Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G.,
Reber, U., Häussler, T., Schmid-Petri, H., Adam, S.: Applying LDA topic modeling in communica-
tion research: toward a valid and reliable methodology. Commun. Methods Meas. 12(2–3), 93–118
(2018)
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D.,
Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using
millions of digitized books. Science 331(6014), 176–182 (2011)
Popescu, I., Macutek, J., Altmann, G.: Aspects of Word Frequencies. Studies in Quantitative Linguistics.
RAM Verlag, Ludenscheid (2009)
Puschmann, C., Scheffler, T.: Topic modeling for media and communication research: a short primer.
HIIG Discussion Paper Series No.2016-05. Available at SSRN: https://doi.org/10.2139/ssrn.28364
78 (2016)
R Development Core Team: R: a language and environment for statistical computing [software]. R
foundation for statistical computing. Retrieved from http://www.r-project.org. Accessed Jan 2020
(2016)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents.
In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494
(2004)
Savoy, J.: Authorship attribution based on a probabilistic topic model. Inf. Process. Manag. 49, 341–354
(2013)
13
S. Sbalchiero, M. Eder
Sbalchiero, S.: Finding topics: a statistical model and a quali-quantitative method. In: Tuzzi, A. (ed.)
Tracing the Life-Course of Ideas in the Humanities and Social Sciences, pp. 189–210. Springer,
Berlin (2018)
Sbalchiero, S., Tuzzi, A.: What’s old and new? Discovering Topics in the American Journal of Sociol-
ogy. In: Iezzi, D.F., Celdardo, L., Misuraca, M. (eds.) Proceedings of 14th International Conference
on Statistical Analysis of Textual Data, pp. 724–732. UniversItalia Editore, Rome (2018)
Tong, Z., Zhang, H.: A text mining research based on LDA topic modelling. In: Jordery School of Computer
Science, pp. 201–210 (2016)
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13