5_ines_topic-modeling-on-news-articles-using-latent-dirichlet-allocation-kretinin-a-kol
5_ines_topic-modeling-on-news-articles-using-latent-dirichlet-allocation-kretinin-a-kol
Abstract—Topic modeling is widely used to obtain the most vis- extracting and summarizing information, building graphics,
2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES) | 978-1-6654-9209-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/INES56734.2022.9922609
ible topics from a given text corpus. In this work, a demonstration and visualizing statistics. Therefore, to be able to achieve
of the most discussed topic modeling is presented from articles different results and use topic modeling on different datasets
on the Reuters news website. These articles are collected and
consequently processed with a Latent Dirichlet Allocation (LDA) properly, many different models were developed. This work
unsupervised learning algorithm. The main goal is to build mainly focuses on LDA model, which is one of the most pop-
the best model(s) that accurately produces the most discussed ular and famous methods in topic modeling. The application
topics. Such a model(s) can be used in real life to instantly get of algorithms for text preprocessing, model creation, learning,
information about actual news to classify documents in a given and finally, model usage on real datasets is presented in the
dataset and extract dominated topics with their keywords. This
helps to build, for example, correlations with user preferences following parts of this work.
and recommend interesting content. There are works which use
different models to evaluate texts and obtain statistics about them, II. R ELATED WORK
such as the most popular people’s opinions about some question The use of NLP is very useful in the analysis of social
or to obtain popular and dominating subtopics of the specific networks. Each social network contains many chats, polls,
topic dataset (e.g., medicine articles). As a result of the work, we
were able to create a generic LDA model, trained on Wikipedia and groups with the opinions of users on a certain topic or
articles. The model successfully analyzes Reuters articles and situation. Sometimes, it may be very useful to get summa-
extracted their topics as keyword sets. Then, they can be used rized information in graphs and tables for faster analysis,
to recommend content that is interesting to the target user, for but this process of detecting and processing these data is
example, based on the recommended content tags. time-consuming. This problem was very relevant during the
Index Terms—Topic Modeling, Latent Dirichlet Allocation,
Reuters Articles, Wikipedia, Ukraine, War, Covid, NLP
COVID-19 epidemic, when governments of countries and
world organizations wanted to know the point of view of
I. I NTRODUCTION people about preventive measures, such as mask wear, antigen
tests, and vaccination.
Today, natural language processing (NLP) methods are Another goal was described in [2]. Authors studied the
widely used. They help humans process documents, extract possibility of using NLP in the detection of disease gene
needed information from text, analyze its content, and build associations within large volumes with a large number of
graphs and diagrams with statistics. NLP methods for text complicated associations. Therefore, they described a compu-
processing save a lot of time and optimize the process of tational framework that discovers latent disease mechanisms
searching and processing information. For example, topic by dissecting disease gene associations from more than 25
modeling methods can be used to predict the topics of the million PubMed articles. They used the LDA model and
text, helping to recommend interesting articles and videos to network-based analysis because of their ability to detect latent
users. associations within text and reduce noise for large volumes of
In medicine, NLP methods allowed us to transform raw data.
data from unstructured clinical information on patients into a A good use of the LDA model was demonstrated in [3].
structured form [1]. There, the unstructured information about Here, the created LDA model topics were used to conduct
the patients took a long time for doctors to read the ”free” a literature review on papers from online databases such as
text and search for possible symptoms here. Thus, the usage Web of Science, Scopus or Google Scholar. Then the resulting
of NLP helped to solve two important problems: topics were merged into clusters to get top-level topics from
• Big amount of time spent on text analysis of electronic the former ones. It helps to understand the correlation between
health records by physicians on a regular basis, them, so a concept map can be made of the keywords of the
• Possibility of managing and mining large volumes of models, to describe the topics of the models in the most precise
clinical data on large time scales. way.
Topic modeling is a broad field of NLP, and its results are The main objective of this work is the practical application
used in the daily life of people, in their work, and in different of topic modeling methods on real datasets such as text-
fields of science. It helps to analyze large volumes of text, containing news articles. In this work, the process and results
Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
M. Kretinin and G. Nguyen • Topic Modeling on News Articles using Latent Dirichlet Allocation
of the use of the LDA model are described, and tests are
carried out on models trained with different hyperparameters
to compare their results and choose the most successful model
for future use. Moreover, in this paper the question of the
possibility of using a generic model, trained on the Wikipedia
article dataset [4], [5], instead of a specific model, which was
trained on the Reuters article, is discussed. It is supposed that
accuracy of the Wikipedia model will be lower, but it still has
to be able to precisely predict a document topics and might
be used on almost any text, as it contains words from a wide
range of topics. All used datasets are unlabeled, so the models
are trained with the unsupervised learning approach.
Fig. 1: Words assigned to the topic to which they most likely
III. L ATENT D IRICHLET A LLOCATION
belong
Latent Dirichlet Allocation (LDA) [6], [7] is a topic mod-
eling method, which allows users to get a probabilistic distri-
bution of the topics in the document. Topics are represented The process of creating / training LDA models may be
by keywords, which are the most ”popular” words in the described in five steps, which together create the following
documents assigned to the current topic. algorithm [10]:
1) General requirements to the process: To start with topic 1) Choose number (k) of topics, which should be created
modeling using the LDA model, it should first be trained. by the model.
Therefore, a corpus is needed for the training process. A 2) Distribute these k topic among the document m by as-
corpus is usually represented as a bag-of-words (BoW), or signing a topic to the words in the text. This distribution
a list of pairs ”word:number of occurrences”, which do not is named α.
present the orders and relations of the words, but their count 3) Then we suppose, that for each word w in the text
in the text. has been assigned wrong topic, but every other word
2) Data gathering: First, the data should be acquired. In is assigned the correct topic.
the data collection process, the consideration of general topic 4) Assign word w a topic, basing on probability of two
bias should be taken seriously, especially if the majority of things:
documents collected are from the same closely located source. • What topics are actually in the analyzed document
If such a situation occurs, then the resulting model can easily m.
be overfitted for some particular topic(s). This model cannot • How many times word w has been assigned to the
accurately analyze other topics in the deployment. particular topic z across all of the documents.
3) Corpus and Dictionary: The goal of this step is to 5) Repeat this process for each document, to get k topics
transform the text into the form that the model can use, with assigned words.
be trained on it, or analyze. To obtain the corpus from This educational process is iterative, which means that it
the text, it should first be processed. This step includes the has to be repeated N times to obtain a better result. Executing
removal of special characters, punctuation and stop words, this algorithm only 1-2 times for text should not give us a
and lemmatization [8], [9] of the remaining text. The text very good result compared to a higher number of iterations.
is then transformed into a list of words, which will later In Fig. 2 the relations of the variables can be seen as follows:
be transformed into the corpus. In addition, words with a In the picture 2 we can see the relations of variables, where:
very low number of occurrences can be removed as well, to • α is the per-document topic density,
reduce the corpus and speed up the model training process. • β is the per-topic word density,
However, it can affect the quality of the model. From the • θ is the topic distribution for document m,
corpus we can get the ”word-id” relation, or the dictionary of • η is the word distribution for specific topic,
the corpus, because some implementations require it to work. • z are topics of the document m,
This is the case for the LDA model library gensim (https: • w is the specific word
// radimrehurek.com/ gensim/ models/ ldamulticore.html) which α and β are vectors of real numbers that are usually the
was used in the experiments in this paper. same for all topics/words, respectively.
θ and η are matrices, where θ(i, j) represents the probability
A. Model training
that the i-th document contains the j-th topic and η(i, j)
When the corpus is ready, it can be used to train the represents the probability that the i-th topic contains the j-
model. As a result, the trained model will be able to use the th word [11].
information gained on the topics and the words assigned to Among all these variables, only w is grayed out because
them in the analysis of the unseen text. it is the only observable variable in the system, while the
000250
Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
INES 2022 • 26th IEEE International Conference on Intelligent Engineering Systems • August 12-15, 2022 • Crete, Greece
000251
Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
M. Kretinin and G. Nguyen • Topic Modeling on News Articles using Latent Dirichlet Allocation
000252
Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
INES 2022 • 26th IEEE International Conference on Intelligent Engineering Systems • August 12-15, 2022 • Crete, Greece
000253
Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
M. Kretinin and G. Nguyen • Topic Modeling on News Articles using Latent Dirichlet Allocation
able to successfully predict the main topic of the documents. [3] M. Weiss and S. Muegge, “Conceptualizing a new
Therefore, every approach in model training has its own domain using topic modeling and concept mapping: A
pros and cons, but for this work, balanced but generalized case study of managed security services for small busi-
Wikipedia dumps were enough to experiment with the LDA nesses,” Technology Innovation Management Review,
model, as they were able to accurately predict topics of vol. 9, pp. 55–64, 2019, ISSN: 1927-0321. DOI: http:
documents in the testing dataset. //doi.org/10.22215/timreview/1261. [Online]. Available:
https://timreview.ca/article/1261.
V. C ONCLUSION
[4] E. Cambria and B. White, “Jumping nlp curves: A
In this work, a possible usage of the LDA model has been review of natural language processing research,” IEEE
described to analyze news articles, with preliminary acqui- Computational intelligence magazine, vol. 9, no. 2,
sition, preprocessing, and model training on text data sets. pp. 48–57, 2014. DOI: 10 . 1109 / MCI . 2014 . 2307227.
To achieve better results and higher accuracy, here we used [Online]. Available: https : / / www. gwern . net / docs / ai /
a hyperparameter tuning process, where we chose a model 2014-cambria.pdf.
for dataset analysis. The results demonstrated that with the [5] S. Dlugolinsky, G. Nguyen, M. Laclavik, and M. Se-
statistics acquired on the analyzed datasets, we can understand leng, “Character gazetteer for named entity recogni-
the dominant topics by their keywords. By this way, we can tion with linear matching complexity,” in Third World
compare popular topics over a particular period of time, so Congress on Information and Communication Technolo-
it is possible, for example, to monitor the topic changing gies (WICT 2013), IEEE, 2013, pp. 361–365. DOI: 10.
over time according to the reader’s interest. From the results, 1109/WICT.2013.7113096.
we were able to state that in October there were popular [6] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet
topics about politics and Covid, while in February there were allocation,” the Journal of machine Learning research,
relevant articles about the war in Ukraine and Covid. These vol. 3, pp. 993–1022, 2003. [Online]. Available: https:
topics accurately reflected the real situation in the world and //www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?
reaffirmed the development and change in the relevant topics. TB iframe=true&width=370.8&height=658.8.
Moreover, the acquired Intertopic Distance Maps are able to [7] M. Weiss and S. Muegge, “Conceptualizing a new
separate words that occur only in the current topic from those domain using topic modeling and concept mapping: A
distributed among many topics. With that, it is possible to get case study of managed security services for small busi-
more information on any topic of the model. As an additional nesses,” Technology Innovation Management Review,
way to use acquired topics, their keywords and tags can be vol. 9, no. 8, 2019. [Online]. Available: https://www.
used to search for similar content on the Internet. Therefore, timreview.ca/article/1261.
if the previous user’s search results were analyzed this way, it [8] D. Khyani, B. Siddhartha, N. Niveditha, and B. Divya,
would be possible to find similar content, which would save “An interpretation of lemmatization and stemming in
a lot of time and help find potentially interesting information. natural language processing,” vol. 22, pp. 350–357,
ACKNOWLEDGEMENTS 2020.
[9] D. Maier, A. Waldherr, P. Miltner, et al., “Applying lda
This work is supported by VEGA 2/0125/20 New Meth- topic modeling in communication research: Toward a
ods and Approaches for Distributed Scalable Computing and valid and reliable methodology,” Communication Meth-
the Operational Programme Integrated Infrastructure for the ods and Measures, vol. 12, no. 2-3, pp. 93–118, 2018.
project: International Center of Excellence for Research on DOI : 10.1080/19312458.2018.1430754.
Intelligent and Secure Information and Communication Tech- [10] T. Doll, Lda topic modeling: An explanation, Ac-
nologies and Systems – Phase II (ITMS code: 313021W404), cessed 02.03.2022, 2018. [Online]. Available: https :
co-funded by the European Regional Development Fund / / towardsdatascience . com / lda - topic - modeling - an -
(ERDF). explanation-e184c90aadcd.
R EFERENCES [11] T. Ganegedara, Intuitive guide to latent dirichlet allo-
cation, Accessed 02.03.2022, 2018. [Online]. Available:
[1] K. Kreimeyer, M. Foster, A. Pandey, et al., “Natural
https://towardsdatascience.com/light-on-math-machine-
language processing systems for capturing and stan-
learning- intuitive- guide- to- latent- dirichlet- allocation-
dardizing unstructured clinical information: A system-
437c81220158.
atic review,” Journal of biomedical informatics, vol. 73,
pp. 14–29, 2017. DOI: 10 . 1016 / j . jbi . 2017 . 07 . 012.
[Online]. Available: https : / / www. sciencedirect . com /
science/article/pii/S1532046417301685.
[2] Y. Zhang, F. Shen, M. R. Mojarad, et al., “Systematic
identification of latent disease-gene associations from
pubmed articles,” PloS one, vol. 13, no. 1, e0191568,
2018. DOI: 10 . 1371 / journal . pone . 0191568. [Online].
Available: https://journals.plos.org/plosone/article?id=
10.1371/journal.pone.0191568.
000254
Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES) | 978-1-6654-9209-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/INES56734.2022.9922625
on
http://www.ines-conf.org
INES 2022
26th IEEE International Conference
PROCEEDINGS
Intelligent Engineering Systems 2022
Crete, Greece
August 12-15, 2022
Committees
iv
Béla Pátkai, Tampere University of Technology, Finland
Emil M. Petriu, University of Ottawa, Canada
Radu-Emil Precup, „Politehnica” University of Timisoara, Romania
Stefan Preitl, „Politehnica” University of Timisoara, Romania
Ales Prochazka, University of Chemistry and Technology & Czech Technical University, Czech Republik
Octavian Prostean, „Politehnica” University of Timisoara, Romania
Ewaryst Rafajlowicz, Wroclaw University os Science and Technology, Poland
Jerzy Rozenblit, University of Arizona, Tucson, USA
Joseph Sevilla, Strathmore University, Nairobi, Kenya
Czeslaw Smutnicki, Wroclaw University os Science and Technology, Poland
Carmen Paz Suarez Araujo, ULPGC, Spain
Miroslav Sveda, Brno University of Technology, Czech Republik
Sándor Szénási, Óbuda University, Budapest, Hungary
Pavol Tanuska, Slovak University of Technology, Slovakia
József K. Tar, Óbuda University, Budapest, Hungary
Andrea Tick, Óbuda University, Budapest, Hungary
Xin Yao, University of Birminghan
Annamária R. Várkonyi-Kóczy, Óbuda University, Budapest, Hungary
Antonio Visioli, University of Brescia, Italy
Bogdan M. Wilamowski, USA
PROCEEDINGS EDITOR
Anikó Szakál, Óbuda University, Budapest, Hungary
PRODUCTION PUBLISHER
IEEE Hungary Section
v
Organized by
Sponsored by
Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy be-
yond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the
first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive,
Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights Manager at [email protected].
All rights reserved. Copyright ©2022 by IEEE.
ii