0% found this document useful (0 votes)

6 views10 pages

5_ines_topic-modeling-on-news-articles-using-latent-dirichlet-allocation-kretinin-a-kol

This document presents a study on topic modeling of news articles using the Latent Dirichlet Allocation (LDA) algorithm, focusing on the analysis of articles from the Reuters news website. The authors demonstrate the process of building and evaluating LDA models to extract key topics and keywords from the text, aiming to improve content recommendation systems. The paper discusses the training of models on different datasets, including a generic model trained on Wikipedia, and evaluates their performance based on various metrics.

Uploaded by

tomasgabrielmtc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

5_ines_topic-modeling-on-news-articles-using-latent-dirichlet-allocation-kretinin-a-kol

Uploaded by

tomasgabrielmtc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

INES 2022 • 26th IEEE International Conference on Intelligent Engineering Systems • August 12-15, 2022 • Crete, Greece

Topic Modeling on News Articles using

Latent Dirichlet Allocation
Mykyta Kretinin∗ and Giang Nguyen∗†
∗ Faculty of Informatics and Information Technologies, STU in Bratislava, Ilkovičova 2, Bratislava 84216, Slovakia
† Institute of Informatics, Slovak Academy of Sciences, Dúbravská cesta 9, Bratislava 84507, Slovakia

Emails: [email protected], [email protected]

Abstract—Topic modeling is widely used to obtain the most vis- extracting and summarizing information, building graphics,
2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES) | 978-1-6654-9209-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/INES56734.2022.9922609

ible topics from a given text corpus. In this work, a demonstration and visualizing statistics. Therefore, to be able to achieve
of the most discussed topic modeling is presented from articles different results and use topic modeling on different datasets
on the Reuters news website. These articles are collected and
consequently processed with a Latent Dirichlet Allocation (LDA) properly, many different models were developed. This work
unsupervised learning algorithm. The main goal is to build mainly focuses on LDA model, which is one of the most pop-
the best model(s) that accurately produces the most discussed ular and famous methods in topic modeling. The application
topics. Such a model(s) can be used in real life to instantly get of algorithms for text preprocessing, model creation, learning,
information about actual news to classify documents in a given and finally, model usage on real datasets is presented in the
dataset and extract dominated topics with their keywords. This
helps to build, for example, correlations with user preferences following parts of this work.
and recommend interesting content. There are works which use
different models to evaluate texts and obtain statistics about them, II. R ELATED WORK
such as the most popular people’s opinions about some question The use of NLP is very useful in the analysis of social
or to obtain popular and dominating subtopics of the specific networks. Each social network contains many chats, polls,
topic dataset (e.g., medicine articles). As a result of the work, we
were able to create a generic LDA model, trained on Wikipedia and groups with the opinions of users on a certain topic or
articles. The model successfully analyzes Reuters articles and situation. Sometimes, it may be very useful to get summa-
extracted their topics as keyword sets. Then, they can be used rized information in graphs and tables for faster analysis,
to recommend content that is interesting to the target user, for but this process of detecting and processing these data is
example, based on the recommended content tags. time-consuming. This problem was very relevant during the
Index Terms—Topic Modeling, Latent Dirichlet Allocation,
Reuters Articles, Wikipedia, Ukraine, War, Covid, NLP
COVID-19 epidemic, when governments of countries and
world organizations wanted to know the point of view of
I. I NTRODUCTION people about preventive measures, such as mask wear, antigen
tests, and vaccination.
Today, natural language processing (NLP) methods are Another goal was described in [2]. Authors studied the
widely used. They help humans process documents, extract possibility of using NLP in the detection of disease gene
needed information from text, analyze its content, and build associations within large volumes with a large number of
graphs and diagrams with statistics. NLP methods for text complicated associations. Therefore, they described a compu-
processing save a lot of time and optimize the process of tational framework that discovers latent disease mechanisms
searching and processing information. For example, topic by dissecting disease gene associations from more than 25
modeling methods can be used to predict the topics of the million PubMed articles. They used the LDA model and
text, helping to recommend interesting articles and videos to network-based analysis because of their ability to detect latent
users. associations within text and reduce noise for large volumes of
In medicine, NLP methods allowed us to transform raw data.
data from unstructured clinical information on patients into a A good use of the LDA model was demonstrated in [3].
structured form [1]. There, the unstructured information about Here, the created LDA model topics were used to conduct
the patients took a long time for doctors to read the ”free” a literature review on papers from online databases such as
text and search for possible symptoms here. Thus, the usage Web of Science, Scopus or Google Scholar. Then the resulting
of NLP helped to solve two important problems: topics were merged into clusters to get top-level topics from
• Big amount of time spent on text analysis of electronic the former ones. It helps to understand the correlation between
health records by physicians on a regular basis, them, so a concept map can be made of the keywords of the
• Possibility of managing and mining large volumes of models, to describe the topics of the models in the most precise
clinical data on large time scales. way.
Topic modeling is a broad field of NLP, and its results are The main objective of this work is the practical application
used in the daily life of people, in their work, and in different of topic modeling methods on real datasets such as text-
fields of science. It helps to analyze large volumes of text, containing news articles. In this work, the process and results

978-1-6654-9209-6/22/$31.00 @2022 IEEE 000249

Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
M. Kretinin and G. Nguyen • Topic Modeling on News Articles using Latent Dirichlet Allocation

of the use of the LDA model are described, and tests are
carried out on models trained with different hyperparameters
to compare their results and choose the most successful model
for future use. Moreover, in this paper the question of the
possibility of using a generic model, trained on the Wikipedia
article dataset [4], [5], instead of a specific model, which was
trained on the Reuters article, is discussed. It is supposed that
accuracy of the Wikipedia model will be lower, but it still has
to be able to precisely predict a document topics and might
be used on almost any text, as it contains words from a wide
range of topics. All used datasets are unlabeled, so the models
are trained with the unsupervised learning approach.
Fig. 1: Words assigned to the topic to which they most likely
III. L ATENT D IRICHLET A LLOCATION
belong
Latent Dirichlet Allocation (LDA) [6], [7] is a topic mod-
eling method, which allows users to get a probabilistic distri-
bution of the topics in the document. Topics are represented The process of creating / training LDA models may be
by keywords, which are the most ”popular” words in the described in five steps, which together create the following
documents assigned to the current topic. algorithm [10]:
1) General requirements to the process: To start with topic 1) Choose number (k) of topics, which should be created
modeling using the LDA model, it should first be trained. by the model.
Therefore, a corpus is needed for the training process. A 2) Distribute these k topic among the document m by as-
corpus is usually represented as a bag-of-words (BoW), or signing a topic to the words in the text. This distribution
a list of pairs ”word:number of occurrences”, which do not is named α.
present the orders and relations of the words, but their count 3) Then we suppose, that for each word w in the text
in the text. has been assigned wrong topic, but every other word
2) Data gathering: First, the data should be acquired. In is assigned the correct topic.
the data collection process, the consideration of general topic 4) Assign word w a topic, basing on probability of two
bias should be taken seriously, especially if the majority of things:
documents collected are from the same closely located source. • What topics are actually in the analyzed document
If such a situation occurs, then the resulting model can easily m.
be overfitted for some particular topic(s). This model cannot • How many times word w has been assigned to the
accurately analyze other topics in the deployment. particular topic z across all of the documents.
3) Corpus and Dictionary: The goal of this step is to 5) Repeat this process for each document, to get k topics
transform the text into the form that the model can use, with assigned words.
be trained on it, or analyze. To obtain the corpus from This educational process is iterative, which means that it
the text, it should first be processed. This step includes the has to be repeated N times to obtain a better result. Executing
removal of special characters, punctuation and stop words, this algorithm only 1-2 times for text should not give us a
and lemmatization [8], [9] of the remaining text. The text very good result compared to a higher number of iterations.
is then transformed into a list of words, which will later In Fig. 2 the relations of the variables can be seen as follows:
be transformed into the corpus. In addition, words with a In the picture 2 we can see the relations of variables, where:
very low number of occurrences can be removed as well, to • α is the per-document topic density,
reduce the corpus and speed up the model training process. • β is the per-topic word density,
However, it can affect the quality of the model. From the • θ is the topic distribution for document m,
corpus we can get the ”word-id” relation, or the dictionary of • η is the word distribution for specific topic,
the corpus, because some implementations require it to work. • z are topics of the document m,
This is the case for the LDA model library gensim (https: • w is the specific word
// radimrehurek.com/ gensim/ models/ ldamulticore.html) which α and β are vectors of real numbers that are usually the
was used in the experiments in this paper. same for all topics/words, respectively.
θ and η are matrices, where θ(i, j) represents the probability
A. Model training
that the i-th document contains the j-th topic and η(i, j)
When the corpus is ready, it can be used to train the represents the probability that the i-th topic contains the j-
model. As a result, the trained model will be able to use the th word [11].
information gained on the topics and the words assigned to Among all these variables, only w is grayed out because
them in the analysis of the unseen text. it is the only observable variable in the system, while the

000250

Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
INES 2022 • 26th IEEE International Conference on Intelligent Engineering Systems • August 12-15, 2022 • Crete, Greece

the model topics. In a symmetric distribution, α is the same

for each topic:
1
α= (1)
number of topics
For an asymmetric setting, the value of α depends on the index
of the topic:
1
α= p (2)
topic index + (number of topics)
After all models were trained, their metrics were compared
to obtain the most suitable model for further use. The results
are presented in Table I

TABLE I: Results of hyperparameter tuning - 12 models

Hyperparameter tuning results
α num topics perplexity CV CU M ass
4 -9.304 0.550 -1.253
Fig. 2: Plate notation representing the LDA model with 6 -9.423 0.579 -1.401
variables and their possible values 8 -9.819 0.572 -1.987
Symmetric
10 -10.556 0.608 -1.735
12 -11.855 0.574 -2.483
14 -13.199 0.577 -2.499
others are hidden from us. In the algorithm above, our goal 4 -9.291 0.541 -1.245
in steps 1-4 is to assume a word for a topic, so we can 6 -9.396 0.550 -1.517
8 -9.806 0.548 -2.067
suggest the word w as the final result of the process. From Asymmetric
10 -10.580 0.579 -1.943
the beginning, the topics z are unknown and need to be filled 12 -11.901 0.536 -2.436
with the words w. Similarly, θ and η are also unknown and 14 -13.153 0.575 -2.147
will be calculated using the words w and the topics z. These
variables are affected by predefined model hyperparameters α Three metrics are used for model evaluation and compari-
(per-document topic density) and β (per-topic word density) son: perplexity in (3), CU M ass (p(rare word | common word))
accordingly. The higher α, the documents consist of more and CV (log (P M I), for P M I in (4)) coherence scores. In
topics (θ), and the higher β, the topics consist of most of short, the coherence score indicates the degree of semantic
the corpus words (η). similarity between words on the topic, while the perplexity
indicates how well a probability model predicts a sample or
how much it is confused with the analyzed content. The best
IV. E XPERIMENTING WITH LDA MODEL score applied to all of these metrics is the highest. Based on the
results, the model with 6 topics and a symmetric α distribution
Our goal is to obtain a precise model that can accurately
is chosen because it obtains the closest results to the best in
predict the topic of the text. Only then will it be able to
all metrics, while the other models have good results (close to
produce reliable statistics on the most popular and discussed
the best ones) only in one or two particular metrics.
topics among the documents in the scraped article set.
Normalized perplexity score:
Therefore, a strong ”basement” should be built, which
means that the trained model should show good metrics. In ln (P (W ) = ln (P (w1 , w2 , ..., wN )) (3)
this way, its words will be distributed in the right way, and
where
the prediction of the unseen text will be more accurate. This
will be achieved in the hyperparameter tuning process. ln (P ) is a normalized perplexity function;
W is a full sentence/text;
wi is an i-th word of the sentence/text W ;
A. Hyperparameter tuning N is a number of words w in the text W ;
To obtain a good model, the grid search is used over P (w1 , w2 , ..., wN ) is a probability that the model
two hyperparameters of the model, number of topics and assigns to text W ;
α (also known as the topic density per document). Totally Pointwise Mutual Information (PMI):
12 models were trained on a 100MB Wikipedia abstract p(wi , wj )
dump dataset (enwiki-latest-pages-articles-multistream24.xml- score = log( ) (4)
p(wi )p(wj )
p56564554p57025655.bz2, Feb-2022). Each model has a dif-
ferent number of topics and a value of α. Number of topics where
was in the range of [4,14] with a step of 2 (4, 6, ..., 14 topics) p(w) is a probability that the word w will be seen
and α was distributed symmetrically / asymmetrically between in a random document,

000251

p(wi , wj ) is a probability of seeing both wi and wj

words in the same document.
Models are not trained on the dataset of Reuters articles
because they have a high potential to be unbalanced. This may
lead to overfitting the model for certain topics and inaccurate
results in realization. On the contrary, Wikipedia articles are
not grouped by topic, so the Wikipedia dumps used are
considered balanced enough to produce good training results.
B. Topic modeling on the articles datasets
The trained model can finally be used to analyze crawled
articles to get brief information about their content. The most
important information we can obtain from model analysis is
the most popular topic among the given data set. Fig. 3a
and Fig. 3b present document counts from the dataset, which
were assigned to the particular topic of the model. Here, in
Fig. 3a, the most discussed topics were about politics (topic
0, 298 assigned documents) and about people’s life and health
(topic 2, 642 documents, due to the words ”covid”, ”people”,
”climate”). At the same time, in February (Fig. 3b) the war in
Ukraine was a very discussed topic, since multiple topics (0th
and 4th) are related to the war, while the ”covid” topic (2nd)
was still relevant.
This information can be used to help people understand
the topics most discussed today and to compare the topics
discussed in different periods. These data may be used, for
example, by other news websites to get the actual most popular
topics or to understand trends over a particular period of time
and to publish appropriate content that will be interesting to
readers.
Using Intertopic Distance Map, the similarity of the model
topics can be visually displayed by comparing the distance
between them. Examples of such visualization are shown in
Fig. 4, where the topic ”1” is highlighted with its 30 most
relevant terms. The top terms are calculated with the usage
of relevance (λ), with values in the range [0,1]. With λ
= 1 the most relevant terms of the topic will generally be
displayed, even if they occur somewhere else, when λ = 0
will be displayed terms that belong only to this topic.
In Fig. 4, the top terms are the same as those of the topic
”2” in Fig. 3a, as they display information about the same
topic of the model. However, these top keywords do not
appear only in the text of this topic, so we may be interested
in those that were not assigned to any other topic. This case
is shown in Figure 5, where the top 30 terms of the topic
are assigned only to this topic and do not appear anywhere
else. This may give us a more accurate explanation about
the topics of the model, as in most cases these words have a (a) October articles (b) February articles
closer meaning to each other and appear in documents with (11.05.2021) (26.02.2022)
similar content only. Fig. 3: Per-topic documents count

C. Generic versus specific LDA model

Sometimes, it may be essential to train a model on data that
are related to the future data we want to analyze. For example,
if a model is used to analyze medical articles, it should also be

000252

Fig. 4: Intertopic Distance Map for October articles with λ=1

Fig. 5: Intertopic Distance Map for October articles with λ=0

trained on some medical data to be able to separate documents

by medical subtopics and obtain better results. almost any possible topic. To check if the Reuters model will
Here, the model was trained on Wikipedia datasets instead work better on Reuters articles, two models have been trained
of the Reuters news article dataset. The main reason was on Reuters and Wikipedia datasets of similar size. Reuters
a possible unbalance in the Reuters dataset, as the topics training dataset was scrapped on 20.04.2022, so after training
discussed are very often biased, which may lead to model it was biased toward the topics of politics and war in Ukraine.
overfitting over some topics. Wikipedia, in contrast, is a very These models have to analyze 1000 November articles to
generalized dataset, so it may have many terms and even topics check the resulting topics and the distribution of the document
that do not occur very often. This may also lead to inaccurate words. From the test, the Reuters trained model got better
results, but the models will not be biased. Moreover, the results, as expected for a specific model. The documents in
Wikipedia trained model can potentially be used on any testing the test dataset were quite equally distributed by it, in contrast
dataset, as it is generic and will show similar results with to the Wikipedia trained model. However, both models were

000253

able to successfully predict the main topic of the documents. [3] M. Weiss and S. Muegge, “Conceptualizing a new
Therefore, every approach in model training has its own domain using topic modeling and concept mapping: A
pros and cons, but for this work, balanced but generalized case study of managed security services for small busi-
Wikipedia dumps were enough to experiment with the LDA nesses,” Technology Innovation Management Review,
model, as they were able to accurately predict topics of vol. 9, pp. 55–64, 2019, ISSN: 1927-0321. DOI: http:
documents in the testing dataset. //doi.org/10.22215/timreview/1261. [Online]. Available:
https://timreview.ca/article/1261.
V. C ONCLUSION
[4] E. Cambria and B. White, “Jumping nlp curves: A
In this work, a possible usage of the LDA model has been review of natural language processing research,” IEEE
described to analyze news articles, with preliminary acqui- Computational intelligence magazine, vol. 9, no. 2,
sition, preprocessing, and model training on text data sets. pp. 48–57, 2014. DOI: 10 . 1109 / MCI . 2014 . 2307227.
To achieve better results and higher accuracy, here we used [Online]. Available: https : / / www. gwern . net / docs / ai /
a hyperparameter tuning process, where we chose a model 2014-cambria.pdf.
for dataset analysis. The results demonstrated that with the [5] S. Dlugolinsky, G. Nguyen, M. Laclavik, and M. Se-
statistics acquired on the analyzed datasets, we can understand leng, “Character gazetteer for named entity recogni-
the dominant topics by their keywords. By this way, we can tion with linear matching complexity,” in Third World
compare popular topics over a particular period of time, so Congress on Information and Communication Technolo-
it is possible, for example, to monitor the topic changing gies (WICT 2013), IEEE, 2013, pp. 361–365. DOI: 10.
over time according to the reader’s interest. From the results, 1109/WICT.2013.7113096.
we were able to state that in October there were popular [6] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet
topics about politics and Covid, while in February there were allocation,” the Journal of machine Learning research,
relevant articles about the war in Ukraine and Covid. These vol. 3, pp. 993–1022, 2003. [Online]. Available: https:
topics accurately reflected the real situation in the world and //www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?
reaffirmed the development and change in the relevant topics. TB iframe=true&width=370.8&height=658.8.
Moreover, the acquired Intertopic Distance Maps are able to [7] M. Weiss and S. Muegge, “Conceptualizing a new
separate words that occur only in the current topic from those domain using topic modeling and concept mapping: A
distributed among many topics. With that, it is possible to get case study of managed security services for small busi-
more information on any topic of the model. As an additional nesses,” Technology Innovation Management Review,
way to use acquired topics, their keywords and tags can be vol. 9, no. 8, 2019. [Online]. Available: https://www.
used to search for similar content on the Internet. Therefore, timreview.ca/article/1261.
if the previous user’s search results were analyzed this way, it [8] D. Khyani, B. Siddhartha, N. Niveditha, and B. Divya,
would be possible to find similar content, which would save “An interpretation of lemmatization and stemming in
a lot of time and help find potentially interesting information. natural language processing,” vol. 22, pp. 350–357,
ACKNOWLEDGEMENTS 2020.
[9] D. Maier, A. Waldherr, P. Miltner, et al., “Applying lda
This work is supported by VEGA 2/0125/20 New Meth- topic modeling in communication research: Toward a
ods and Approaches for Distributed Scalable Computing and valid and reliable methodology,” Communication Meth-
the Operational Programme Integrated Infrastructure for the ods and Measures, vol. 12, no. 2-3, pp. 93–118, 2018.
project: International Center of Excellence for Research on DOI : 10.1080/19312458.2018.1430754.
Intelligent and Secure Information and Communication Tech- [10] T. Doll, Lda topic modeling: An explanation, Ac-
nologies and Systems – Phase II (ITMS code: 313021W404), cessed 02.03.2022, 2018. [Online]. Available: https :
co-funded by the European Regional Development Fund / / towardsdatascience . com / lda - topic - modeling - an -
(ERDF). explanation-e184c90aadcd.
R EFERENCES [11] T. Ganegedara, Intuitive guide to latent dirichlet allo-
cation, Accessed 02.03.2022, 2018. [Online]. Available:
[1] K. Kreimeyer, M. Foster, A. Pandey, et al., “Natural
https://towardsdatascience.com/light-on-math-machine-
language processing systems for capturing and stan-
learning- intuitive- guide- to- latent- dirichlet- allocation-
dardizing unstructured clinical information: A system-
437c81220158.
atic review,” Journal of biomedical informatics, vol. 73,
pp. 14–29, 2017. DOI: 10 . 1016 / j . jbi . 2017 . 07 . 012.
[Online]. Available: https : / / www. sciencedirect . com /
science/article/pii/S1532046417301685.
[2] Y. Zhang, F. Shen, M. R. Mojarad, et al., “Systematic
identification of latent disease-gene associations from
pubmed articles,” PloS one, vol. 13, no. 1, e0191568,
2018. DOI: 10 . 1371 / journal . pone . 0191568. [Online].
Available: https://journals.plos.org/plosone/article?id=
10.1371/journal.pone.0191568.

000254

Authorized licensed use limited to: Slovak University of Technology Trial User. Downloaded on December 08,2022 at 12:26:49 UTC from IEEE Xplore. Restrictions apply.
2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES) | 978-1-6654-9209-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/INES56734.2022.9922625

http://www.ines-conf.org
INES 2022
26th IEEE International Conference

PROCEEDINGS
Intelligent Engineering Systems 2022

Crete, Greece
August 12-15, 2022
Committees

INES 2022 GENERAL CHAIR

Levente Kovács, Óbuda University, Budapest, Hungary

INES FOUNDING HONORARY CHAIR

Imre J. Rudas, Óbuda University, Budapest, Hungary

INES 2022 TECHNICAL PROGRAM COMMITTEE CHAIRS

Rudolf Andoga, Technical University of Košice, Slovakia
2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES) | 978-1-6654-9209-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/INES56734.2022.9922643

László Szilágyi, Óbuda University, Budapest, Hungary

INES 2022 TECHNICAL PROGRAM COMMITTEE

Norbert Ádám, Technical University of Košice, Slovakia
Ito Atsushi, Utsunomiya University, Japan
Monika Bakosová, Slovak University of Technology, Slovakia
Valentina Balas, “Aurel Vlaicu” University, Arad, Romania
Ildar Batyrshin, CIC, Instituto Politecnico Nacional, Mexico
Barnabás Bede, DigiPen, Seattle, USA
Attila L. Bencsik, Óbuda University, Budapest, Hungary
Balázs Benyó, BME, Hungary
Manuel Berenguel, Universidad de Almería, Spain
Wojciech Bozejko, Wroclaw University os Science and Technology, Poland
Jorge Bondia, Polytechnic University of Valencia, Spain
Julius Butime, Strathmore University, Nairobi, Kenya
Zenon Chaczko, UTS, Sydney, Australia
Chee-Kong Chui, National University of Singapore, Singapore
József Dombi, University of Szeged, Hungary
Dániel Drexler, Óbuda University, Budapest, Hungary
Éva Dulf, University of Cluj-Napoca, Romania
György Eigner, Óbuda University, Budapest, Hungary
Tamás Ferenci, Óbuda University, Budapest, Hungary
Gabor Fichtinger, Queen’s University, Canada
Ladislav Főző, Technical University of Košice, Slovakia
Péter Galambos, Óbuda University, Budapest, Hungary
Tom D. Gedeon, The Australian National University, Canberra, Australia
Tamás Haidegger, Óbuda University, Budapest, Hungary
László Horváth, Óbuda University, Budapest, Hungary
Clara Ionescu, Ghent University, Belgium
Zsolt Csaba Johanyák, John von Neumann Unviversity, Hungary
Ryszard Klempous, Wroclaw University of Science and Technology, Poland
George Kovács, CAI of HAS, Hungary
Szilveszter Kovács, University of Miskolc, Hungary
Miklós Kozlovszky, Óbuda University, Budapest, Hungary
Naoyuki Kubota, Tokyo Metropolitan University, Hino, Japan
Róbert Lovas, SZTAKI, Hungary
Lőrinc Márton, Sapientia University, Târgu-Mures, Romania
Alajos Mészáros, Slovak University of Technology, Bratislava, Slovakia
György Molnár, Óbuda University, Budapest, Hungary
Jan Nikodem, Wroclaw University os Science and Technology, Poland
John Olukuru, Strathmore University, Nairobi, Kenya
Pasquale Palumbo, National Research Council (IASI-CNR), Rome, Italy

iv
Béla Pátkai, Tampere University of Technology, Finland
Emil M. Petriu, University of Ottawa, Canada
Radu-Emil Precup, „Politehnica” University of Timisoara, Romania
Stefan Preitl, „Politehnica” University of Timisoara, Romania
Ales Prochazka, University of Chemistry and Technology & Czech Technical University, Czech Republik
Octavian Prostean, „Politehnica” University of Timisoara, Romania
Ewaryst Rafajlowicz, Wroclaw University os Science and Technology, Poland
Jerzy Rozenblit, University of Arizona, Tucson, USA
Joseph Sevilla, Strathmore University, Nairobi, Kenya
Czeslaw Smutnicki, Wroclaw University os Science and Technology, Poland
Carmen Paz Suarez Araujo, ULPGC, Spain
Miroslav Sveda, Brno University of Technology, Czech Republik
Sándor Szénási, Óbuda University, Budapest, Hungary
Pavol Tanuska, Slovak University of Technology, Slovakia
József K. Tar, Óbuda University, Budapest, Hungary
Andrea Tick, Óbuda University, Budapest, Hungary
Xin Yao, University of Birminghan
Annamária R. Várkonyi-Kóczy, Óbuda University, Budapest, Hungary
Antonio Visioli, University of Brescia, Italy
Bogdan M. Wilamowski, USA

INES FINCANCE CHAIR

Anikó Szakál, Óbuda University, Budapest, Hungary

INES 2022 ORGANIZING COMMITTEE CHAIR

György Eigner, Óbuda University, Budapest, Hungary

INES SERIES LIFE SECRETARY GENERAL

Anikó Szakál, Óbuda University, Budapest, Hungary
E-mail: [email protected]

PROCEEDINGS EDITOR
Anikó Szakál, Óbuda University, Budapest, Hungary

PRODUCTION PUBLISHER
IEEE Hungary Section

v
Organized by

Óbuda University, Budapest, Hungary

IEEE Hungary Section

IEEE Greece Section
2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES) | 978-1-6654-9209-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/INES56734.2022.9922640

Hungarian Fuzzy Association

IEEE Hungary Section

IEEE Joint Chapter of IES and RAS, Hungary
IEEE SMC Chapter, Hungary
IEEE Control Systems Chapter, Hungary

Technical Co-sponsors Venue

IEEE Industrial Electronics Society Hydramis Palace Beach Resort

IEEE Greece Section

Part Number: CFP22IES-USB (pendrive); CFP22IES-ART (Xplore)

ISBN: 978-1-6654-9208-9 (pendrive); 978-1-6654-9209-6 (Xplore)

Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy be-
yond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the
first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive,
Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights Manager at [email protected].
All rights reserved. Copyright ©2022 by IEEE.
ii

Installation: EVC-E3/E4 D4/D6 Aquamatic & Inboard: Typical Installation / Main Station
0% (1)
Installation: EVC-E3/E4 D4/D6 Aquamatic & Inboard: Typical Installation / Main Station
2 pages
Year 3 Calapa c3 s4 H Reflectivenarrative
100% (1)
Year 3 Calapa c3 s4 H Reflectivenarrative
4 pages
Angeles - Momentum Transfer
No ratings yet
Angeles - Momentum Transfer
16 pages
Latent Dirichlet Allocation LDA and Topic Modeling PDF
No ratings yet
Latent Dirichlet Allocation LDA and Topic Modeling PDF
41 pages
Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
No ratings yet
Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
40 pages
2019 - Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
No ratings yet
2019 - Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey
43 pages
s10462-023-10661-7
No ratings yet
s10462-023-10661-7
30 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
Ke Et Al. - 2024 - Recent Advances in Text Analysis
No ratings yet
Ke Et Al. - 2024 - Recent Advances in Text Analysis
60 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
2024.eacl-long.51
No ratings yet
2024.eacl-long.51
20 pages
1-s2.0-S2666285X22000206-main
No ratings yet
1-s2.0-S2666285X22000206-main
7 pages
A Network Approach to Topic Models
No ratings yet
A Network Approach to Topic Models
22 pages
Machine Learning for data science Unit-5
No ratings yet
Machine Learning for data science Unit-5
10 pages
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
No ratings yet
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
10 pages
Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
No ratings yet
Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
14 pages
dbm302Presentation
No ratings yet
dbm302Presentation
5 pages
Report NLP
No ratings yet
Report NLP
25 pages
NewSociRank: Recognizing and Ranking Frequent News Topics Using Social Media Factors
No ratings yet
NewSociRank: Recognizing and Ranking Frequent News Topics Using Social Media Factors
4 pages
Eai 13-7-2018 159623
No ratings yet
Eai 13-7-2018 159623
16 pages
Abdelrazek Et Al 2023 - Topic Modeling Algorithms and Applications, A Survey - Information Systems 112 (2023) 102131
No ratings yet
Abdelrazek Et Al 2023 - Topic Modeling Algorithms and Applications, A Survey - Information Systems 112 (2023) 102131
17 pages
Sciadv Aaq1360
No ratings yet
Sciadv Aaq1360
11 pages
A_systematic_review_of_the_use_of_topic_models_for
No ratings yet
A_systematic_review_of_the_use_of_topic_models_for
34 pages
Exploration of Thesis
No ratings yet
Exploration of Thesis
93 pages
31
No ratings yet
31
12 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Sma Exp 4
No ratings yet
Sma Exp 4
3 pages
Topic Modelling Meets Deep Neural Networks - A Survey
No ratings yet
Topic Modelling Meets Deep Neural Networks - A Survey
8 pages
The_Supervised_Hierarchical_Dirichlet_Process
No ratings yet
The_Supervised_Hierarchical_Dirichlet_Process
13 pages
Advance Machine Learning - Final - Report
No ratings yet
Advance Machine Learning - Final - Report
88 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Topic Modeling v.02
No ratings yet
Topic Modeling v.02
26 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
The Structural Topic Model and Applied Social Science
No ratings yet
The Structural Topic Model and Applied Social Science
4 pages
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
Thesis Anum Afzal
No ratings yet
Thesis Anum Afzal
127 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Analyzing COVID-19 Discourse on Twitter
No ratings yet
Analyzing COVID-19 Discourse on Twitter
25 pages
Full Text 01
No ratings yet
Full Text 01
68 pages
Major Research Topics in Big Data
No ratings yet
Major Research Topics in Big Data
4 pages
2024_Donner_Catherine_Thesis
No ratings yet
2024_Donner_Catherine_Thesis
161 pages
Topic Models Indian Institute of Technology Pawangcoursestopicmodelspdf
No ratings yet
Topic Models Indian Institute of Technology Pawangcoursestopicmodelspdf
93 pages
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
Topic Modeling Using NLP for Student Feedback
No ratings yet
Topic Modeling Using NLP for Student Feedback
5 pages
Fake News 13-2-2024 (1)
No ratings yet
Fake News 13-2-2024 (1)
8 pages
Topic Modeling Text Clustering Based On Deep Learning Model
No ratings yet
Topic Modeling Text Clustering Based On Deep Learning Model
11 pages
neuromax and its baseline 2409.19749v1
No ratings yet
neuromax and its baseline 2409.19749v1
15 pages
Advancements of Artificial Intelligence Techniques in The Realm About Library and Information SubjectA Case Survey of Latent Dirichlet Allocation Method
No ratings yet
Advancements of Artificial Intelligence Techniques in The Realm About Library and Information SubjectA Case Survey of Latent Dirichlet Allocation Method
14 pages
Text Classification
No ratings yet
Text Classification
3 pages
2021 Agreeing to Disagree - Choosing Among Eight Topic-modeling Methods
No ratings yet
2021 Agreeing to Disagree - Choosing Among Eight Topic-modeling Methods
9 pages
Clustering Thesis
No ratings yet
Clustering Thesis
55 pages
A CORRELATED TOPIC MODEL OF SCIENCE1
No ratings yet
A CORRELATED TOPIC MODEL OF SCIENCE1
19 pages
An Integrated Comprehensive Approach Towards Road Traffic Accident Reduction
No ratings yet
An Integrated Comprehensive Approach Towards Road Traffic Accident Reduction
34 pages
(. Sangeeta Alagi)Survey_on_Election_Prediction_Using_Machine_Learning_Technique_ijariie19474 (1)
No ratings yet
(. Sangeeta Alagi)Survey_on_Election_Prediction_Using_Machine_Learning_Technique_ijariie19474 (1)
7 pages
Maier 2018
No ratings yet
Maier 2018
27 pages
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
No ratings yet
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
6 pages
Combine PDF
No ratings yet
Combine PDF
7 pages
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
Jipeng Qiang 2019
No ratings yet
Jipeng Qiang 2019
17 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Standards Projects 18-11-08 NBF
No ratings yet
Standards Projects 18-11-08 NBF
2 pages
VIZIO Soundbar Manual 2021 - M51a-H6 - UM-ENG
No ratings yet
VIZIO Soundbar Manual 2021 - M51a-H6 - UM-ENG
44 pages
Test Method For Resins
No ratings yet
Test Method For Resins
10 pages
Chapter 12 Presentation
No ratings yet
Chapter 12 Presentation
31 pages
Practice Questions For Electronic Devices
No ratings yet
Practice Questions For Electronic Devices
54 pages
Hazard Analysisand Critical Control Point HACCPPlanfor Carbonated Soft Drinks Plant
No ratings yet
Hazard Analysisand Critical Control Point HACCPPlanfor Carbonated Soft Drinks Plant
17 pages
A GIS Dataset Built To Map Small Scale Phenomena' May Be in Appropriate To Use in Large Scale Analysis
No ratings yet
A GIS Dataset Built To Map Small Scale Phenomena' May Be in Appropriate To Use in Large Scale Analysis
38 pages
Genbio1 Module-1
100% (2)
Genbio1 Module-1
31 pages
Fault Detection and Diagnostics: Dr. Vishal Garg Dr. Pradeep Ramancharla
No ratings yet
Fault Detection and Diagnostics: Dr. Vishal Garg Dr. Pradeep Ramancharla
10 pages
KA SAT SNG Terminals PDF
No ratings yet
KA SAT SNG Terminals PDF
2 pages
Bendix ABS - EC-80.Dual Rear Axle Control (6S-6M)
No ratings yet
Bendix ABS - EC-80.Dual Rear Axle Control (6S-6M)
7 pages
Antenna Synthesis: Given A Radiation Pattern, Find The Required Current Distribution To Realize It
No ratings yet
Antenna Synthesis: Given A Radiation Pattern, Find The Required Current Distribution To Realize It
15 pages
THESIS AND DISSERTATION Guide
No ratings yet
THESIS AND DISSERTATION Guide
24 pages
Goyen IS Series Rev04 022016 PDF
No ratings yet
Goyen IS Series Rev04 022016 PDF
2 pages
SPRD How To Add Components For Smart3D
No ratings yet
SPRD How To Add Components For Smart3D
91 pages
Seep/W'S Intuitive Modeling Workflow: Reate Problem Workspace and Analysis Properties
No ratings yet
Seep/W'S Intuitive Modeling Workflow: Reate Problem Workspace and Analysis Properties
4 pages
ISC2 ISSMPDumps 2024 PDFExam Questions 224 QAs
No ratings yet
ISC2 ISSMPDumps 2024 PDFExam Questions 224 QAs
5 pages
Modelos de Billings - Equipe MLC - RAZ
No ratings yet
Modelos de Billings - Equipe MLC - RAZ
10 pages
Is 14687 Falsework For Concrete Structure Guideline
100% (1)
Is 14687 Falsework For Concrete Structure Guideline
24 pages
Political Rhetoric: Susan Condor Cristian Tileaga
No ratings yet
Political Rhetoric: Susan Condor Cristian Tileaga
38 pages
Ecalc Manual PDF
No ratings yet
Ecalc Manual PDF
58 pages
Priced 2022 - 03 Livestock Breeding and Production Structures - BoQ
No ratings yet
Priced 2022 - 03 Livestock Breeding and Production Structures - BoQ
32 pages
Industry Immersion Performance Evaluation Form: R P T C
No ratings yet
Industry Immersion Performance Evaluation Form: R P T C
4 pages
Dissertation Sujet Largent Fait Il Le Bonheur
100% (2)
Dissertation Sujet Largent Fait Il Le Bonheur
4 pages
Gratitude Questions - Through A Soulful Journey (Soul Candy)
No ratings yet
Gratitude Questions - Through A Soulful Journey (Soul Candy)
128 pages
Dammang East Elementarry School Learning Continuity Plan
No ratings yet
Dammang East Elementarry School Learning Continuity Plan
4 pages
Full7 2010
No ratings yet
Full7 2010
7 pages

5_ines_topic-modeling-on-news-articles-using-latent-dirichlet-allocation-kretinin-a-kol

Uploaded by

5_ines_topic-modeling-on-news-articles-using-latent-dirichlet-allocation-kretinin-a-kol

Uploaded by

INES 2022 • 26th IEEE International Conference on Intelligent Engineering Systems • August 12-15, 2022 • Crete, Greece

Topic Modeling on News Articles using

Emails: [email protected], [email protected]

978-1-6654-9209-6/22/$31.00 @2022 IEEE 000249

the model topics. In a symmetric distribution, α is the same

TABLE I: Results of hyperparameter tuning - 12 models

p(wi , wj ) is a probability of seeing both wi and wj

C. Generic versus specific LDA model

Fig. 4: Intertopic Distance Map for October articles with λ=1

trained on some medical data to be able to separate documents

INES 2022 GENERAL CHAIR

INES FOUNDING HONORARY CHAIR

INES 2022 TECHNICAL PROGRAM COMMITTEE CHAIRS

László Szilágyi, Óbuda University, Budapest, Hungary

INES 2022 TECHNICAL PROGRAM COMMITTEE

INES FINCANCE CHAIR

INES 2022 ORGANIZING COMMITTEE CHAIR

INES SERIES LIFE SECRETARY GENERAL

Óbuda University, Budapest, Hungary

IEEE Hungary Section

Hungarian Fuzzy Association

IEEE Hungary Section

Technical Co-sponsors Venue

IEEE Industrial Electronics Society Hydramis Palace Beach Resort

IEEE Greece Section

Part Number: CFP22IES-USB (pendrive); CFP22IES-ART (Xplore)

You might also like