SATLabel A Framework For Sentiment and Aspect Terms Based Automatic Topic Labeling
SATLabel A Framework For Sentiment and Aspect Terms Based Automatic Topic Labeling
v1
1 Introduction
Twitter nowadays is considered as one of the most important social media plat-
forms to explain the characteristics and predict the status of the pandemic [9].
In Wuhan, at the end of 2019, a novel coronavirus disease that causes COVID-
19 was reported by the World Health Organisation (WHO). The declaration
of COVID-19 as an international concern of public health emergency by WHO
2 K. T. Shahriar et al.
was reported on January 30, 2020 [1]. During the pandemic, the use of Twit-
ter increases immensely and plays a critical role by reflecting real-time public
panic and providing rich information to raise public awareness through posts
and comments. However, text mining and analysis of data from social media
platforms such as Twitter have become a burning issue to extract necessary in-
formation. Moreover, it is a great challenge to extract meaningful topic labels
by machines instead of following diverse human interpretations of the manual
labelling approach [?]. Hence, in this paper, we propose SATLabel, a framework
that effectively identifies key topic labels of tweets automatically from the huge
volume of the Twitter dataset to reduce the human effort of cumbersome topic
labelling tasks.
A large number of labelled datasets is required for traditional supervised
methods. Obtaining such a labelled dataset for topic labelling purposes is very
difficult and expensive. In this paper, we use LDA [3], which is an unsupervised
probabilistic algorithm for text documents. Thus SATLabel does not need any
labelled dataset for topics. A set of topics available in the documents is discov-
ered by LDA. Sentiment terms express emotions from tweets and Aspect terms
describe features of an entity [19]. We create sentiment terms cluster and aspect
terms cluster for each LDA generated topic. However, Unigram is a probabilis-
tic language model that is extensively used in natural language processing tasks
and text mining to exhibit the context of texts. SATLabel uses the top Unigrams
features from sentiment terms cluster and aspect terms cluster respectively and
create attribute tags concatenating the two top Unigrams features (first senti-
ment term and then aspect term). We select the attribute tag which has the
highest soft cosine similarity value with respect to the tweets of the same topic
to assign a meaningful label for that LDA-generated topic. Our experimental re-
sults show that the label generated by SATLabel has a high soft cosine similarity
value with the tweets of the same topic than the manual labelling approach. The
main contributions of this paper can be summarized as follows:
The organization of the rest of the paper is as follows. Related works are reviewed
in section 2. In section 3, we present the methodology of the proposed framework.
In section 4, we assess the evaluation results of our framework by conducting
experiments on the Twitter dataset. Next, we present the discussion, and finally,
we conclude this paper and highlight the direction of future work.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 April 2022 doi:10.20944/preprints202204.0026.v1
2 Related Work
COVID-19 tweets can be helpful for identifying meaningful topic labels to high-
light user conversation and understand ideas of people’s needs and interests.
Many researchers used the LDA algorithm to extract hidden themes of docu-
ments. Patil et. al. [12] proposed a paper using the frequency-based technique
to extract topics from people’s reviews without mentioning the proper labelling
techniques for describing the topics. Hingmire et. al. [7] proposed a paper to
construct LDA based topic model but the expert association is required to as-
sign the topic to the class labels. Hourani et. al. [8] proposed a paper to classify
articles according to their topics for which labelled dataset is required. Asmussen
et. al. [2] proposed a topic modelling method for researchers but topic labelling
depends on the researcher’s view without having any automatic method. Wang
et. al. [18] proposed a paper that minimizes the problem of data sparsity with-
out labelling key topics specifically. Zhu et. al. [20] presented the change of the
number of texts on topics with respect to time by following the manual topic
labelling approach. Satu et. al. [15] proposed a framework that extracts topics
from the best cluster of sentiment classification having a manual explanation
of topic labels tends to misinterpretation. Kee et. al. [10] used LDA to extract
higher-order arbitrary topics but only 61.3% clear collective themes were evalu-
ated. Maier et. al. [11] presented accessibility and applicability of communication
researchers using LDA based topic labelling approach which depends manually
on broader context knowledge. In our previous work, we only considered the top
unigram feature of aspect terms cluster to identify the key topics with labels
by implementing LDA [17]. Elgesem et. al. [5] presented an analysis about the
discussion of the Snowden affair using a manual topic labelling approach. Guo
et. al. [6] compared dictionary-based analysis and LDA analysis using a manual
topic labelling approach.
The summary of the above works describes that most of the works considered
a manual topic labelling approach to categorize documents and get an overview
which is expensive, time-consuming, and requires cumbersome human interpre-
tations. Hence, an automatic and effective topic labelling approach would be
helpful to reduce human effort and save time. Thus in this paper, we consider
the development of a framework named SATLabel to generate significant topic
labels automatically to highlight users’ conversations on Twitter.
3 Methodology
4 K. T. Shahriar et al.
opinion of the text. Noun and noun phrases are considered as aspects terms
of text. Objects of verbs are often regarded as aspect terms that describe the
features of an entity, product, or event [19]. We follow precise parts of speech
tagging which is an efficient approach to extract sentiment terms and aspect
terms from texts. Examples of sentiment terms and aspect terms of sample
tweets are shown in Table 1.
Sentiment Aspect
Sample Tweet
Terms Terms
Please read the thread. read thread
To enjoy and relax for your dinner enjoy, relax,
dinner, place
it is a great place. great
Links with info on communicating with communicate, links, info,
children regarding COVID-19. covid children
The retail store owners right now retail, right owners, store
6 K. T. Shahriar et al.
sentiment terms and aspect terms of documents without any human in-
terpretation. Based on the topic coherence score, we choose a model that
discovers 20 optimal number of topics itself. Then we enumerate the domi-
nant topic for each tweet to understand the distribution of topics across the
tweets in the dataset.
The steps to generate significant topic labels automatically as output from the
topics extracted by LDA are discussed below:
4 Experiments
4.1 Dataset
coherence value before flattening out considering better sense while the coherence
score seems to keep growing as shown in Fig. 2. For the next steps, we choose
the model having 20 topics itself.
8 K. T. Shahriar et al.
An expert annotator assigns the topic labels manually using the word probabil-
ities in LDA-generated topics to a randomly selected set of tweets. In Table 2,
we present a portion of set of tweets assigned by the SATLabel generated topic
labels. Table 2 shows that SATLabel generated topic labels are well-aligned and
closely coherent with the descriptions of tweets. We can extract useful informa-
tion related to a topic, simply by categorizing the tweets using the key label
generated by SATLabel of that topic.
In this experiment section, we calculate the Soft Cosine Similarity (SCS) values
of detected topic labels by SATLabel and manual approach for LDA-generated
20 topics. SCS is used to detect the semantic text similarities between two doc-
uments. A high SCS value provides a high similarity index and similarity is
smaller for unrelated documents. We train the word2vec embedding model to
use SCS. We show the comparison of SATLabel and manual labeling approach
for all LDA-generated topics in terms of SCS value in Fig. 5. We get SCS values
generated by proposed SATLabel for topic no. 4, 8, 10, 14 are 0.77, 0.61, 0.53,
0.64 respectively while manual approach generates 0.09, 0.06, 0.02, 0.07 SCS
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 April 2022 doi:10.20944/preprints202204.0026.v1
10 K. T. Shahriar et al.
scores for those topics which are very low. Diverse human interpretation of top-
ics is the possible reason for the high difference of SCS scores between proposed
SATLabel and manual approach. For topics no. 3, 6, 9, 12, 16, 18 we get the
same scs values for SATLabel and manual approach because of identical topic
labels generated by both approaches. From Fig. 5, we can observe that the topic
labels generated by the proposed framework SATLabel produce high SCS values
for a maximum number of topics compared with the manual labeling approach.
Hence, our proposed framework is more effective and traces better topic labels
from unlabelled datasets to reduce the cumbersome task of the human manual
labeling approach.
5 Discussion
References
1. Adhikari, S.P., Meng, S., Wu, Y.J., Mao, Y.P., Ye, R.X., Wang, Q.Z., Sun, C.,
Sylvia, S., Rozelle, S., Raat, H., et al.: Epidemiology, causes, clinical manifestation
and diagnosis, prevention and control of coronavirus disease (covid-19) during the
early outbreak period: a scoping review. Infectious diseases of poverty 9(1), 1–12
(2020)
2. Asmussen, C.B., Møller, C.: Smart literature review: a practical topic modelling
approach to exploratory literature review. Journal of Big Data 6(1), 1–18 (2019)
3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. the Journal of
machine Learning research 3, 993–1022 (2003)
4. Boussaadi, S., Aliane, H., Abdeldjalil, P.O.: The researchers profile with topic
modeling. In: 2020 IEEE 2nd International Conference on Electronics, Control,
Optimization and Computer Science (ICECOCS). pp. 1–6. IEEE (2020)
5. Elgesem, D., Feinerer, I., Steskal, L.: Bloggers’ responses to the snowden affair:
Combining automated and manual methods in the analysis of news blogging. Com-
puter Supported Cooperative Work (CSCW) 25(2-3), 167–191 (2016)
6. Guo, L., Vargo, C.J., Pan, Z., Ding, W., Ishwar, P.: Big social data analytics in
journalism and mass communication: Comparing dictionary-based text analysis
and unsupervised topic modeling. Journalism & Mass Communication Quarterly
93(2), 332–359 (2016)
7. Hingmire, S., Chougule, S., Palshikar, G.K., Chakraborti, S.: Document classifi-
cation by topic labeling. In: Proceedings of the 36th international ACM SIGIR
conference on Research and development in information retrieval. pp. 877–880
(2013)
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 April 2022 doi:10.20944/preprints202204.0026.v1
12 K. T. Shahriar et al.
8. Hourani, A.S.: Arabic topic labeling using naı̈ve bayes (nb). In: 2021 12th In-
ternational Conference on Information and Communication Systems (ICICS). pp.
478–479. IEEE (2021)
9. Jahanbin, K., Rahmanian, V., et al.: Using twitter and web news mining to predict
covid-19 outbreak. Asian Pacific Journal of Tropical Medicine 13(8), 378 (2020)
10. Kee, Y.H., Li, C., Kong, L.C., Tang, C.J., Chuang, K.L.: Scoping review of mind-
fulness research: A topic modelling approach. Mindfulness 10(8), 1474–1488 (2019)
11. Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A.,
Pfetsch, B., Heyer, G., Reber, U., Häussler, T., et al.: Applying lda topic modeling
in communication research: Toward a valid and reliable methodology. Communi-
cation Methods and Measures 12(2-3), 93–118 (2018)
12. Patil, P.P., Phansalkar, S., Kryssanov, V.V.: Topic modelling for aspect-level sen-
timent analysis. In: Proceedings of the 2nd International Conference on Data En-
gineering and Communication Technology. pp. 221–229. Springer (2019)
13. Sarker, I.H.: Deep learning: A comprehensive overview on techniques, taxonomy,
applications and research directions. SN Computer Science 2(6), 1–20 (2021)
14. Sarker, I.H.: Machine learning: Algorithms, real-world applications and research
directions. SN Computer Science 2(3), 1–21 (2021)
15. Satu, M.S., Khan, M.I., Mahmud, M., Uddin, S., Summers, M.A., Quinn, J.M.,
Moni, M.A.: Tclustvid: a novel machine learning classification model to investigate
topics and sentiment in covid-19 tweets. Knowledge-Based Systems 226, 107126
(2021)
16. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft co-
sine measure: Similarity of features in vector space model. Computación y Sistemas
18(3), 491–504 (2014)
17. Tayef Shahriar, K., Sarker, I.H., Nazrul Islam, M., Moni, M.A.: A dynamic topic
identification and labeling approach of covid-19 tweets. In: International Confer-
ence on Big Data, IoT and Machine Learning (BIM 2021). Taylor and Francis
(2021)
18. Wang, B., Liakata, M., Zubiaga, A., Procter, R.: A hierarchical topic modelling
approach for tweet clustering. In: International Conference on Social Informatics.
pp. 378–390. Springer (2017)
19. Wang, W., Pan, S.J., Dahlmeier, D., Xiao, X.: Coupled multi-layer attentions for
co-extraction of aspect and opinion terms. In: Proceedings of the AAAI Conference
on Artificial Intelligence. vol. 31 (2017)
20. Zhu, B., Zheng, X., Liu, H., Li, J., Wang, P.: Analysis of spatiotemporal character-
istics of big data on social media sentiment with covid-19 epidemic topics. Chaos,
Solitons & Fractals 140, 110123 (2020)