0% found this document useful (0 votes)
7 views

Full Text 01

Uploaded by

Yuvraj Pardeshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Full Text 01

Uploaded by

Yuvraj Pardeshi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Topic Modeling for

Customer Insights
A Comparative Analysis of LDA
and BERTopic in Categorizing
Customer Calls
Henrik Axelborn & John Berggren

Master’s Thesis, 30 Credits


M.Sc. in Industrial Engineering and Management, 300 Credits
Spring Term 2023
Topic Modeling for Customer Insights
A Comparative Analysis of LDA and BERTopic in Categorizing Customer Calls

Henrik Axelborn
John Berggren †

Copyright © 2023 Henrik Axelborn & John Berggren


All Rights Reserved
Supervisors:
Robert Andersson Krohn, Partner Company
Per Arnqvist, Umeå University
Examiner:
Armin Eftekhari, Umeå University
Degree Project in Industrial Engineering and Management, 30 Credits
Department of Mathematics and Mathematical Statistics
Umeå University
SE-901 87 Umeå, Sweden

† Equal contribution. In order to ensure fairness and impartiality, the order of authors was determined by three rounds of

outright spin-the-wheel.
Abstract
Customer calls serve as a valuable source of feedback for financial service providers, potentially containing a
wealth of unexplored insights into customer questions and concerns. However, these call data are typically
unstructured and challenging to analyze effectively. This thesis project focuses on leveraging Topic Modeling
techniques, a sub-field of Natural Language Processing, to extract meaningful customer insights from recorded
customer calls to a European financial service provider. The objective of the study is to compare two widely
used Topic Modeling algorithms, Latent Dirichlet Allocation (LDA) and BERTopic, in order to categorize
and analyze the content of the calls. By leveraging the power of these algorithms, the thesis aims to provide
the company with a comprehensive understanding of customer needs, preferences, and concerns, ultimately
facilitating more effective decision-making processes.
Through a literature review and dataset analysis, i.e., pre-processing to ensure data quality and consistency,
the two algorithms, LDA and BERTopic, are applied to extract latent topics. The performance is then
evaluated using quantitative and qualitative measures, i.e., perplexity and coherence scores as well as in-
terpretability and usefulness of topic quality. The findings contribute to knowledge on Topic Modeling for
customer insights and enable the company to improve customer engagement, satisfaction and tailor their
customer strategies.
The results show that LDA outperforms BERTopic in terms of topic quality and business value. Although
BERTopic demonstrates a slightly better quantitative performance, LDA aligns much better with human
interpretation, indicating a stronger ability to capture meaningful and coherent topics within company’s
customer call data.
Keywords: Customer Insights, Natural Language Processing, Topic Modeling, Latent Dirichlet Allocation,
BERTopic.

i
Sammanfattning
Kundsamtal är en värdefull källa till feedback för finansiella tjänsteleverantörer, som potentiellt innehåller en
mängd outforskade insikter om kundernas frågeställningar och upplevelser. Dessa kundsamtalsdata är dock
vanligtvis ostrukturerade och utmanande att analysera effektivt. Detta examensarbete utforskar tillämp-
ningen av Topic Modeling-tekniker, ett delområde inom Natural Language Processing, för att extrahera
kundinsikter från inspelade kundsamtal hos en Europeisk finansiell tjänsteleverantör. Syftet med studien är
att jämföra två populära Topic Modeling-algoritmer, Latent Dirichlet Allocation (LDA) och BERTopic, för
att kategorisera och analysera innehållet i samtalen. Genom att utnyttja kraften i dessa algoritmer syftar
examensarbetet till att ge företaget en heltäckande förståelse för kundernas behov, preferenser och problem,
vilket i slutändan underlättar effektivare beslutsprocesser.
Genom en litteraturgenomgång och datauppsättningsanalys, det vill säga förbearbetning för att säkerställa
datakvalitet och dataförenlighet, tillämpas de två algoritmerna, LDA och BERTopic, för att extrahera la-
tenta ämnen, så kallade "topics". Modellutförandet utvärderas sedan med kvantitativa och kvalitativa mått,
genom metrikerna perplexity och coherence, samt tolkningsbarhet och användbarhet av ämneskvalitet. Re-
sultaten bidrar till kunskap om Topic Modeling för kundinsikter och gör det möjligt för företaget att förbättra
kundengagemang, kundnöjdhet och skräddarsy sina kundstrategier.
Resultaten visar att LDA överträffar BERTopic när det gäller ämneskvalitet och affärsvärde. Även om
BERTopic uppvisar en något bättre kvantitativ prestanda, överensstämmer LDA mycket bättre med mänsklig
tolkning, vilket indikerar en starkare förmåga att fånga meningsfulla och sammanhängande ämnen inom
företagets kundsamtalsdata.

ii
Acknowledgements
First and foremost, we would like to express our sincere gratitude to our supervisor at the partner company,
Robert Andersson Krohn, for his guidance and support throughout the entire thesis. We are also grateful
to the company for the opportunity to collaborate with the Data and ML team and gain insights into the
application of Topic Modeling with customer data.
A special thanks also goes to our supervisor from Umeå University, Associate Professor Per Arnqvist, for his
valuable direction and assistance in navigating the complex realm of theoretical writing and for his prompt
assistance whenever needed. We would also like to extend our appreciation to our classmates for the enriching
and memorable years we have spent together at Umeå University.
Lastly, we would like to express our gratitude to our families and friends for their constant support and en-
couragement as well as unwavering patience and understanding throughout the thesis and our entire academic
journey. Their love and patience have been instrumental in our success and well-being.
Thank you!

Henrik Axelborn
John Berggren

Stockholm, June 7, 2023

iii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Aim of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Data Cleaning and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Lowercase and Non-Alphabetical Character Removal . . . . . . . . . . . . . . . . . . . 5
2.2.4 Stopword Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.5 WordPiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 TF-IDF - Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . . . . . 6
2.4 Metrics for Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Generative Process for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Hierarchical Bayesian Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.3 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.1 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.4 Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.5 Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.6 Fine-tuning Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Topic Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.1 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Method 20
3.1 Modeling Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Initial Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Data Description and Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Creation of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Transcription and Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Concatenating the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 Evaluation and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Results 29
4.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Final LDA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Extracting Topic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Default BERTopic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Final BERTopic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Discussion 37
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Transcription and Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.2 Size of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.3 Sampling Data from Specific Time Periods . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.4 Stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.5 Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Comparing LDA and BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 Advantages and Limitations of LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.2 Advantages and Limitations of BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Topic Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.1 Interpretation and Usefulness of Perplexity . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.2 Coherence vs. Business Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Reflection on the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Conclusion 43
6.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Recommendations for Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Appendices 48
A Stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B Model Building LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
C Model Building BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
D Model Fit and Topic Extraction Functions BERTopic . . . . . . . . . . . . . . . . . . . . . . . 53
E Visualization Functions BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

v
List of Figures
1 Model architecture of the Transformer. Figure from [72]. . . . . . . . . . . . . . . . . . . . . . 4
2 Graphical model representation of LDA. Nodes represent random variables, links between
nodes are conditional dependencies and boxes are replicated components. . . . . . . . . . . . 8
3 Illustration of BERTopics architecture and its modularity throughout a variety of sub-models.
Figure from [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Algorithm overview of BERtopic’s default model. Figure from [20]. . . . . . . . . . . . . . . . 11
5 Input structure for BERT. Figure from [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 Algorithm overview for keyword extraction in topic n with KeyBERTInspired which is based
on KeyBERT. Figure from [26]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Illustration over the project processes covered in Section 3 . . . . . . . . . . . . . . . . . . . . 20
8 Schematic representation of the LDA modeling workflow, from the initial dataset, to prepro-
cessing, text representation, modeling and finally visualization of topics and extracting topic
information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9 Schematic representation of the BERTopic modeling workflow, from the original dataset to
topic visualization and information extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
10 Word count distribution for the complete dataset of 5000 transcribed call recordings. . . . . . 25
11 Example of a small corpus represented in bag-of-words format, i.e., a list of (token_id,
token_count) tuples. In this example, the corpus consist of three documents and the bag-of-
words representation contains three list objects each consisting of four, three and six (token_id,
token_count) tuples respectively. For example, this representation shows that token_id = 3
was counted one time in the first and second document, and three times in the third document. 26
12 Coherence and Perplexity Scores for the LDA model with different number of topics T . The
coherence score graph have peaks for T = 5, 7, 9, 11 and 15. T = 10 is also interesting since
it lies in-between two peaks and has a high coherence score while a lower number of topics
compared to the two peaks to its right (T = 10 < T = 11 < T 15), and it therefore has a
higher perplexity score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
13 Visualization of topics with PyLDAvis for the six candidate final models. Each subfigure
represents one of the six candidates LDA models with the number of topics that produced the
best coherence and perplexity scores, as presented in Table 4 and Figure 12. The size of each
bubble is proportional to the percentage of unique tokens attributed to each topic. . . . . . . 31
14 Topic visualization with PyLDAvis for the final LDA model with 10 topics (T = 10). . . . . . 32
15 Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents
one of the topics presented in Table 10. Due to confidentiality reasons, each specific term t
cannot be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
16 Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents
one of the topics presented in Table 12. Due to confidentiality reasons, each specific term t
cannot be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
17 LDA model implementation in Python using the models.ldamodel module from the Gensim
library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
18 BERTopic model implementation in Python using the BERTopic module from the bertopic
library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
19 Functions for topic extraction within the bertopic library in python. Figure from [19]. . . . . 53
20 Functions for topic visualization within the bertopic library in python. Figure from [19]. . . . 54

vi
List of Tables
1 Applied filters for downloading the call recordings. . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Illustration of how the concatenated transcriptions are saved as output in the .csv file. . . . 24
3 Descriptive statistics for the concatenated dataset of transcribed call recordings. . . . . . . . 24
4 Coherence and Perplexity Scores for different number of topics. The rows highlighted in green
represent the best models from observing the coherence and perplexity score graphs from
Figure 12, with regards to the "peaks" in Figure 12a and the largest (best) perplexity scores
in Figure 12b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Evaluation metrics for the final LDA model (T = 10). . . . . . . . . . . . . . . . . . . . . . . 32
6 Relative size of each topic with regards to the percentage of tokens assigned to that topic. . . 32
7 Output from 5 randomly selected documents dn (n = {1, 602, 974, 3501, 4982}) within the
document corpus D, showing the probability distribution of each t 2 {1, ..., T = 10} for
those documents. The numbers are rounded to 4 decimal points for a better overview. The
probabilities in each row sums up to 1 (if not rounded). . . . . . . . . . . . . . . . . . . . . . 33
8 Distribution of the n = 10 words with the highest probability of belonging to each topic t. Due
to confidentiality reasons, each specific word w't ,n for topic t given the word-topic distribution
't cannot be explicitly presented. The probability of each word w't ,n given the word-topic
distribution 't is denoted as 'w,z where z is the word-topic assignment, as defined in 2.5. . . 33
9 The amount of documents, i.e., customer calls, assigned to each topic obtained from the default
model together with the model Coherence Score. . . . . . . . . . . . . . . . . . . . . . . . . . 34
10 The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific term t for class c cannot
be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
11 The amount of documents, i.e., customer calls, assigned to each topic obtained from the final
model together with its Coherence Score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
12 The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific word t for class c cannot
be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
13 Comparison between LDA and BERTopic in the context of practical application scenarios.
Table inspired by [13]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
14 Stopwords used in both LDA and BERTopic. A total of 354 stopwords, using the standard
NLTK stopword library for English, as well as added stopwords that are project specific. The
added stopwords are mainly mistranslations and the most frequent names. . . . . . . . . . . . 48

vii
List of Algorithms
1 Generative process for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Algorithm for BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

viii
List of Acronyms
API Application Programming Interface

BERT Bidirectional Encoder Representations from Transformers


BOW Bag of Words

CER Character Error Rate


CPU Central Processing Units
CSV Comma-Separated Values
c-TF-IDF class-based Term Frequency - Inverse Document Frequency

DBSCAN Density-Based Spatial Clustering of Applications with Noise

GCP Google Cloud Platform


GDPR General Data Protection Regulation
GPU Graphical Processing Units

HDBSCAN Hierarchical Density-Based Spatial Clustering of Applications with Noise

LDA Latent Dirichlet Allocation


LLM Large Language Model
LSI Latent Semantic Indexing

ML Machine Learning
MLM Masked Language Modeling
MT-DNN Multi-Task Deep Neural Network

NLP Natural Language Processing


NLTK Natural Language Toolkit
NPMI Normalized Point-Wise Mutual Information
NSP Next Sentence Prediction

PLSA Probabilistic Latent Semantic Analysis


POC Proof of Concept

RNN Recurrent Neural Network

SBERT Sentence-BERT
SER Sentence Error Rate

TF-IDF Term Frequency - Inverse Document Frequency

UMAP Uniform Manifold Approximation and Projection

VM Virtual Machine

WAV Waveform Audio File Format


WER Word Error Rate

ix
1 INTRODUCTION

1 Introduction
In this section, the subject and objectives of this master thesis project is presented. This section will provide
a problem description, followed by the aim and purpose of the thesis along with the project delimitations
and confidentiality disclosures.

1.1 Background
This thesis is conducted in collaboration with a European financial service provider. Due to confidentiality
reasons, the name of the company will be kept anonymous and will be referred to as "the partner company"
or "the company" in this thesis.

1.2 Problem Description


Today, the company receives approximately 10,000-20,000 phone calls every quarter to their customer service
and sales department. The calls are recorded and saved, but no analysis on the call recordings is performed.
This data can be seen as an unexplored gold mine of insights into what the customers are calling about. By
analyzing the data from these calls, the company could potentially identify trends and patterns in customer
behavior that could help them improve their services.
This master thesis is conducted at the Data & ML team, who are responsible for building and deploying
machine learning models in the company’s production environment. Together with the company, we have
identified one main use-case for this thesis project - to build automated categorizations of customer calls
based on what the customers are calling about.

1.2.1 Categorization
What are the customers calling about?
As it stands today, no systematic, data-driven or automatic extraction of what the customers are calling
about is implemented. All such information is gathered from the employees and their general experience
and expertise in this area. This can lead to biased information as the same conversation may be perceived
differently between employees. Further, this could potentially result in a misrepresentation of conversation
categories, whereas a data-driven solution is more consistent in its interpretation of topics given a word
distribution or the semantic structure of a conversation.
The company has a vision to find categorizations based on different business areas, and over time track and
analyze why the customers are calling, i.e., what the customers are calling about. This would allow the
company to quickly identify new categories and trends, and seamlessly adapt the information shared in calls
or on the website accordingly.

1.3 Aim of Thesis


The categorization problem requires us to build and compare unsupervised machine learning approaches to
extract information of the latent topics that occur in the calls. In this thesis, this will be done using two
different topic modeling approaches, Latent Dirichlet Allocation (LDA) and BERTopic.
Topic models are generative machine learning models for discovering latent topics that occur in a corpus of
documents [36]. Specifically, topic modeling is a field of Natural Language Processing, or NLP for short,
which is a field of Artificial Intelligence that aims towards closing the gap between human language and the
way systems interpret this information [53].
LDA was introduced in 2003 and is a probabilistic generative model with a three-level hierarchical Bayesian
structure [4], and widely considered to be the tried and tested best-practice approach in topic modeling.
BERTopic is a cutting-edge topic modeling algorithm introduced in 2019, which uses contextualized word
representations learned from a large corpus of text using a deep neural network, called BERT embeddings.
This embedding component allows BERTopic to capture the semantic meaning and contextual relationships
among words [29], compared to LDA which doesn’t capture any semantic meaning or contextual information.
The aim of the thesis project is to build and compare the efficiency of implementing LDA and BERTopic for
categorization of the company’s customer calls, which can hopefully act as a proof of concept (POC) that
there are valuable insights to obtained from the call recordings. If the results provide meaningful insights on
the customer experience, there is an intention from the company to further explore other use-cases connected
to this call data to gather valuable insights and streamline their customer service.

1
1 INTRODUCTION

1.3.1 Research Questions


Based on the aim of the thesis, the project will try to answer the following research questions:
• Based on what the customers are calling about, can we find general topics and categorize
them based on the different business areas?
• Can the acquired categorizations provide insights on how the company can improve and
streamline their customer service?
• Compare the performance of two different topic modeling methods, LDA and BERTopic.
Based on evaluation measures and human interpretation of topic quality, which is the
most suitable method using the company’s customer call data?

1.3.2 Purpose
The insights obtained from the project can hopefully be used in several different departments within the
company. With the ability to track and analyze how the topics in phone calls change over time, they can
implement changes to improve the overall customer experience. Some examples of questions that have the
potential to be answered with our study are:
– If the company receives a lot of calls with information available on their web-page, should they change
how and where the information is presented? Is some information missing that should be added or
clarified?
– If the analysis indicate that customers often complain about a particular issue, should the company
take steps to address that issue and improve the customer experience? For example, if queue time is
a common complaint, should they investigate their resource allocation in the customer service depart-
ment?
– Can analyzing this call data can help the company identify areas where their employees may need
additional training or support?

1.4 Delimitations
The most crucial limitation of this thesis project is the 20 week time-frame, during the Swedish spring term
of 2023, in which we have access to the data and can perform our analysis. A large part of the time is
dedicated to data preprocessing and construction of the dataset. As a consequence, less time is available to
the modeling process which requires us to delimit our scope as follows:
The main delimitation is that we won’t have time to evaluate the data quality, which can significantly affect
the results obtained. The call recordings first needs to be transcribed to text, which can introduce bias in
the data from errors in the transcription. Additionally, the transcribed texts also needs to be translated
from Swedish to English to utilize the best pre-existing libraries trained on the English language. This could
also lead to a potential source of error, where mistranslations could introduce bias in the dataset and affect
the results. This brings a challenge for us to carefully consider the reliability of the results due to potential
sources of error that could affect the data quality negatively.
Another delimitation is that we only cover incoming calls to the customer service department, and therefore
no outgoing calls or calls to the sales department are included in the dataset. The extracted calls used in the
project also has a minimum call duration of 60 seconds, which excludes all calls shorter than that duration.

1.5 Confidentiality
The data used for this master thesis comprises confidential customer cases containing private customer
information. To honor the company’s request, no information that can be linked directly to a person,
organization, or location will be revealed. Tables and diagrams will be censored to prevent confidential
information from being exposed. However, the evaluation and model selection processes will be presented in
a way that allows readers to replicate the methods used for similar tasks.

2
2 THEORY

2 Theory
In this section, the underlying theory for this thesis is provided. The section is divided into seven subsections:
2.1 Natural Language Processing, 2.2 Data Cleaning and Preprocessing, 2.3 Text Representation, 2.4 Metrics
for Distance, 2.5 Latent Dirichlet Allocation, 2.6 BERTopic and 2.7 Topic Model Evaluation. Each subsection
will explore the theoretical underpinnings of the algorithms used to accomplish specific tasks in the thesis
work and provide an overview of the machine learning models that were analyzed and implemented.

2.1 Natural Language Processing


While Natural Language Processing (NLP) primarily is considered a subfield of Artificial Intelligence, it has
its foundations in linguistic theory. By combining long-established linguistic theory with modern machine
learning techniques, a system can generate a knowledge base from text and present it in a user-friendly manner
[38]. NLP in the context of artificial intelligence focuses on enabling computers to understand, interpret and
generate human language. NLP combines computational linguistics and computer science to analyze and
synthesize naturally occurring language data [53].
NLP has a wide range of applications, including machine translation, sentiment analysis, speech recognition,
and text generation. These applications are made possible by various techniques and algorithms used in NLP,
such as tokenization and lemmatization [38].
One of the main challenges in NLP is dealing with the ambiguity and complexity of human language. To
overcome this challenge, NLP systems often use machine learning algorithms to learn from large amounts of
data and improve their performance over time. Overall, NLP is a rapidly growing field with many exciting
developments and applications. Its ability to enable computers to process human language has the potential
to revolutionize the way we interact with technology.

2.1.1 Topic Modeling


In statistics and NLP, a topic model is a probabilistic generative model for discovering latent topics that occur
in a corpus of documents. Topic modeling originates from latent semantic indexing (LSI), a non-probabilistic
method for automatic indexing and retrieval of semantic structures within texts, introduced by Deerwester et
al. in 1990 [7]. Based on LSI, probabilistic latent semantic analysis (PLSA) was proposed as a first genuine
probabilistic topic model by Hofmann in 2001 [31], which was further generalized by Blei et al. in 2003 [4]
who introduced the perhaps most commonly used topic model today, latent Dirichlet allocation (LDA).
The relevant theory used to describe the two topic modeling techniques used in this thesis work, Latent
Dirichlet Allocation and BERTopic, is presented in Section 2.5 and Section 2.6 respectively.

2.1.2 Transformers
The Transformer model is a neural network architecture designed for NLP tasks and was proposed by Vaswani
et al. in 2017 [72]. Since then, it has become one of the most widely used models in the field as it offers
significant improvements in comparison with, for example, Recurrent Neural Network (RNN) which was the
previous best-performing sequential model. This improvement was largely due to the use of a self-attention
mechanism, which replaced the need for a recurrent component in the architecture. This mechanism allows
different positions of a sequence to relate to each other to generate a sequence representation. By using this,
the Transformer is able to better understand dependencies in long sequences which results in an improved
efficiency and reduced computation time.
The Transformer consists of an encoder and a decoder, both composed of multiple layers of self-attention and
feed-forward neural networks. The encoder and decoder components of the Transformer model are described
by Vaswani et al. [72] as follows:
Encoder
The encoder takes an input sequence and produces a sequence of hidden states that capture the meaning
of each token in the input sequence. Each layer in the encoder consists of two sub-layers: a multi-head
self-attention mechanism and a position-wise feed-forward network. The self-attention mechanism computes
a weighted sum of the input sequence tokens based on their similarity to each other, while the feed-forward
network applies a non-linear transformation to each token independently.

3
2 THEORY

Decoder
The decoder takes as input the output sequence produced by the encoder and generates a new sequence
token by token. Each layer in the decoder has three sub-layers: a multi-head self-attention mechanism, an
encoder-decoder attention mechanism, and a position-wise feed-forward network. The self-attention mecha-
nism computes a weighted sum of the decoder’s own output tokens based on their similarity to each other,
while the encoder-decoder attention mechanism computes a weighted sum of the encoder’s output tokens
based on their similarity to each decoder output token.

Figure 1: Model architecture of the Transformer. Figure from [72].

As mentioned, the self-attention mechanism allows the model to attend to different parts of the input sequence,
while the feed-forward neural networks provides a non-linear transformation of the hidden states at each
layer which allows the model to capture complex patterns in the input sequence [72]. The architecture of the
transformers is presented in Figure 1.
Self-Attention
In the paper proposing the Transformer model by Vaswani et al. [72], one of the key contributions to the
field is the self-attention mechanism. The self-attention mechanism allows the Transformer model to compute
representations of its input and output sequences by attending to different positions of the same sequence.
In other words, it allows the model to weigh the importance of different positions of the input sequence when
computing a representation for each position.
Vaswani et al. describes the inner workings of self-attention as computing three matrices from the input
sequence: a query matrix, a key matrix, and a value matrix. Each matrix is computed by multiplying the
input sequence with a learned weight matrix. The query matrix is then used to compute a set of attention
scores between each position in the input sequence and every other position. These attention scores are used
to weight the value matrix, which is then summed up to produce a weighted representation of each position
in the input sequence [72].
As mentioned above in the encoder and decoder descriptions, the self-attention mechanism is used in both
the encoder and decoder components of the Transformer model. In the encoder, it allows the model to attend
to different positions of the input sequence when computing hidden representations for each position. In the
decoder, it allows the model to attend to different positions of the output sequence when generating each
token.

4
2 THEORY

2.2 Data Cleaning and Preprocessing


Data cleaning and preprocessing are important steps when analyzing textual data. These steps are performed
before the vectorization of the data. The preprocessing and data cleaning techniques used in this thesis
include tokenization, lemmatization, stopword removal, lowercasing and non-alphabetical character removal.
These techniques aim to prepare the data for NLP computations and further remove noise and meaningless
information from the data, which results in a decrease in complexity and size of the dataset.

2.2.1 Tokenization
Tokenization is the process of breaking down a text document into smaller units called tokens, which are
usually words, sub-words or symbols. It is an essential step within NLP where the tokens serve as the basic
building blocks for subsequent NLP tasks such as language modeling, text classification, sentiment analysis
and machine translation [57].
The tokenization process involves separating words and punctuation marks from each other to create a list of
individual units that can be analysed and processed. This is typically done by using a tokenization strategy,
which can be rule-based or statistical. Rule-based tokenizers rely on pre-defined rules to identify tokens,
while statistical tokenizers use machine learning algorithms to learn patterns in the text [3]. The choice of
tokenization method often depends on the type of text being analysed, language in the text and the specific
NLP task being performed.

2.2.2 Lemmatization
Lemmatization is taking the process of breaking down a text document even further to analyze the meaning
behind a word. While tokenization breaks a text into individual units, lemmatization is the process of
reducing the inflectional forms of each word into a common base or root, known as a lemma. This enables
different forms of a word to be treated as the same word and improve the accuracy of the text analysis [43].

2.2.3 Lowercase and Non-Alphabetical Character Removal


Converting all the words in textual data to lowercase is a common way to reduce the noise in the data. This
allows a normalization of the text to lowercase so that the distinction between words such as The and the is
ignored [3]. For the purposes of topic modeling where the desired output is topics strictly based on words,
another preprocessing technique to reduce noise is to remove all non-alphabetical characters. This includes
numbers and special characters such as exclamation marks, semicolons, colons and question marks.

2.2.4 Stopword Removal


Stopwords are words that are commonly used in a language and are usually removed from texts before
processing for analysis. Removing stopwords in NLP tasks can be beneficial for several reasons. One being
that they do not carry much meaning and can be considered noise in the data. Removing stopwords also helps
reduce the size of the text and avoids irrelevant information [3]. These words include “the”, “a”, “an”, “in”,
“on”, “at”, etc. However, removing stopwords is not a hard and fast rule in NLP and the specific stopwords
depends on the task at hand. For a discussion on the use of stopwords in this project, see Section 5.2.4.

2.2.5 WordPiece
WordPiece is a subword tokenization method introduced by Wu et al. in 2016 [74] to break down words into
smaller subword units. This approach is effective in handling out-of-vocabulary and morphologically complex
languages as well as reduces number of words needed and use of unknown tokens. For instance, consider the
following sentence:

An example that will show how, with the help of wordpiece, BERT tokenize

After using WordPiece tokenizer the sentence would transform to:

"an" "example" "that" "will" "show" "how" "with" "the" "help" "of" "word" "##piece" "bert" "token"
"##ize"

Non-existing words in the vocabulary gets split into subwords where ## marks the sectioning.

5
2 THEORY

2.3 Text Representation


In order for machines to understand and decode the information contained within language, the raw text
must be converted into a suitable numerical form which in the field of NLP is a process referred to as text
representation. This section will describe methods for text-transforming, along with a discussion of key
elements in the field.

2.3.1 Bag-of-Words
Within NLP, Bag-of-words (BOW) is a processing technique of text modeling used to represent textual data
as a collection of words, without considering grammar or the order of which the words appear in the text. It
involves representing each document as a fixed-length vector with length equal to the vocabulary size. Each
dimension of this vector corresponds to the count or occurrence of a word in a document [37]. This technique
makes variable-length documents more amenable for use with a large variety of ML models and tasks.
A fixed-length document representation means that you can easily feed documents with varying length into
ML models. This allows you to perform clustering or topic classification on documents. The structural
information of the document is removed and models have to discover which vector dimensions are semantically
similar. Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance
between items is based on the likeness of their meaning or semantic content as opposed to lexicographical
similarity [37].

2.3.2 Word Embedding


Word embeddings are dense, fixed-length vectors that represent words in a continuous vector space. They
are used in NLP and text analysis to capture the meaning of words and their relationships with other words
in a corpus [41]. The method aims to assign a real-valued vector to each word that encodes its meaning and
semantics in such a way that similar words have vectors that are close together in the vector space. This
is often achieved by training a model to map words and sentences to meaningful real-valued vectors, which
enables comparisons between text units based on the distance and angle between vectors. Once trained,
these models can be used to generate word embeddings for new words or to compare the similarity between
different words [1].

2.3.3 TF-IDF - Term Frequency-Inverse Document Frequency


TF-IDF stands for Term Frequency-Inverse Document Frequency and is a numerical statistic that reflects
how important a word is to a document in a collection or corpus. It is used to examine the relevance of key-
words to documents in a corpus. TF-IDF is a combination of Term Frequency, TF, and Inverse Document
Frequency, IDF.
TF represents ho many times a word is present in a document in relation to how long the document is. TF
is calculated as:
number of times the term appears in the document
TF = .
total number of terms in the document
IDF on the other hand tries to address the issue of different keywords having the same importance. IDF
assigns low weight to words that is more frequently occuring and high weight to words that are infrequent.
IDF is calculated as:

⇣ number of documents in the corpus ⌘


IDF = log .
number of documents in the corpus containing the term

The TF-IDF value increases proportionally to the number of times a word appears in the document but is
offset by the frequency of the word in the corpus. This helps adjust for words that appear frequently across
all documents and therefore may not be as important [59].

2.4 Metrics for Distance


Distance measures play an important role in machine learning and NLP. A distance measure is an objective
score that summarizes the relative difference between two objects in a set of points, called a space. A distance
measure is a function d(x, y) that takes two points in the space as arguments and produces a real number
and satisfies the following axioms:

6
2 THEORY

1. d(x, y) 0 (no negative distances).


2. d(x, y) = 0 if and only if x = y (distances are positive, except for the distance from a point to itself).
3. d(x, y) = d(y, x) (distance is symmetric).
4. d(x, y)  d(x, z) + d(z, y) (the triangle inequality).

The fourth axiom, the triangle inequality, is the most complex condition. It states that if we are traveling
from point x to point y, we cannot obtain any benefit if we are forced to travel via some particular third
point z. The triangle-inequality axiom is what makes all distance measures behave as if distance describes
the length of a shortest path from one point to another [40].

2.4.1 Euclidean Distance


The Euclidean distance is a measure of distance between two points in a two- or multi-dimensional space.
It is calculated as the square root of the sum of the squared differences between corresponding elements of
the two vectors. In other words, if we have two vectors A and B with n elements each, then their Euclidean
distance is given by:
v
u n
uX
dEuclidean (A, B) = t (Ai Bi ) 2
i=1

where the Euclidean distance is always larger or equal to 0, i.e., dEuclidean (A, B) 0 [40].

2.4.2 Cosine Similarity


Cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space.
It is commonly used in text analysis because it can capture semantic similarity between words represented
in vector space.
The cosine similarity is defined as follows:
A·B
cos(✓) =
kAk2 kBk2

where A and B are two vectors and ✓ is the angle between them.
Cosine similarity is not affected by the length of a document and the score always falls between 0 and 1. A
higher score indicates that the vectors are more similar to each other [32].

2.5 Latent Dirichlet Allocation


LDA is an extension of Hofmann’s work on PLSA, mentioned in Section 2.1.1, which introduces sparse
Dirichlet prior distributions over document-topic and topic-word distributions. While PLSA was a useful
step toward probabilistic modeling of text, it is incomplete in that it provides no probabilistic model at the
level of documents [4].
Thus, compared to PLSA, LDA is a more complete probabilistic generative model with a three-level hierar-
chical Bayesian structure, described in Section 2.5.2, for its components; documents, topics and words [39].
LDA is considered state of the art in topic modeling and is a powerful textual analysis technique that uses
statistical correlation between words in multiple documents to find and quantify the underlying topics [36].
Notation
Formally, the terminology used to describe LDA by Blei et al. [4] is defined as follows:
• A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, ..., V }.
Words are represented using unit-basis vectors that have single component equal to 1 and all other
components equal to 0. Thus, using superscripts to denote components, the v :th word in the vocabulary
is represented by a V -vector w such that wv = 1 and wu = 0 for u 6= v.
• A document is a sequence of N words denoted by d = (w1 , w2 , ..., wN ), where wn is the n:th word in
the sequence.

7
2 THEORY

• A corpus is a collection of M documents denoted by D = {d1 ,d2 ,...,dM }, where dn is the n:th document
in the corpus.
The basic idea of LDA is that documents are represented as random mixtures over latent topics, where each
topic is characterized by a distribution over words. The words that have the highest probability on each topic
are usually used to determine what the topic is. LDA assumes that every document can be represented as
a probabilistic distribution over latent topics, as shown in Figure 2. The topic distribution in all documents
shares a common Dirichlet prior. Each latent topic in the LDA model is also represented as a probabilistic
distribution over words, with the word distributions of topics sharing a common Dirichlet prior. The objective
of LDA is not only to find a probabilistic model of a corpus that assigns high probability to members of the
corpus, but also assigns high probability to other "similar" documents [4].

Figure 2: Graphical model representation of LDA. Nodes represent random variables, links between nodes
are conditional dependencies and boxes are replicated components.

2.5.1 Generative Process for LDA


LDA assumes the following generative process for each document d in a corpus D consisting of M documents
having Nd words (d 2 1, ..., M ):

Algorithm 1 Generative process for LDA


(1) Choose a multinomial distribution 't for topic t 2 {1, ..., T } from a Dirichlet distribution with parameter
.
(2) Choose a multinomial distribution ✓d for document d 2 {1, ..., M } from a Dirichlet distribution with
parameter ↵.
(3)For a word wn (n 2 {1, ..., Nd }) in document d:
(a) Choose a topic zn from ✓d .
(b) Choose a word wn from 'zn , a multinomial probability conditioned on the topic zn .

The generative process above has only words in documents as observed variables, while the rest are latent
variables (' and ✓) and hyper parameters (↵ and ). To find the latent variables and hyper parameters, the
probability of the observed data D is calculated and maximized as follows:

8
2 THEORY

M Z
Y Nd X
Y
p(D|↵, ) = p(✓d |↵)( p(zdn |✓d )p(wdn |zdn , ))d✓d
d=1 n=1 zdn

The topic Dirichlet prior has ↵ parameters and the word-topic distribution comes from the Dirichlet distri-
bution with parameters. T is how many topics there are, M how many documents there are, and N how
big the vocabulary is. The Dirichlet multinomial pair (↵, ) is used for the topic distributions in the whole
corpus. The Dirichlet multinomial pair ( , ') is used for the word distributions in each topic. Variables ✓d
are the document-level variables while zdn and wdn are word-level variables sampled for each word in each
document [11].

2.5.2 Hierarchical Bayesian Structure


LDA has a three-layer hierarchical Bayesian structure for its components; documents, topics and words,
where each layer has its own prior distribution that depends on the parameters in the previous layer. A
hierarchical Bayesian model has a hierarchical structure to make inferences about parameters in the model
using Bayesian statistics, based on Bayes’ Theorem [39].
Bayesian statistics is a branch of statistics that involves using prior knowledge and probability distributions
to update and make inferences about the parameters of a statistical model. It is based on the Bayesian
interpretation of probability where available prior knowledge about parameters in a statistical model is
updated with the information in observed data [71].
One of the key advantages of Bayesian statistics is its ability to incorporate prior knowledge into the analysis.
This prior knowledge can come from previous studies, expert opinion, or other sources. By incorporating
this prior knowledge into the analysis, Bayesian statistics can often provide more precise estimates of the
parameters of interest, especially when data are limited.
Components of Bayesian Statistics
There are three essential components underlying Bayesian statistics first described by T. Bayes in 1763 [2]:
The first component, the prior distribution, refers to the prior knowledge on the model parameters being
tested, i.e., all knowledge available before observing any data. The variance of the prior distribution is
expressed as precision, which is simply the inverse of the variance and an indication of the uncertainty about
the population value of our parameter of interest; the larger the variance, the more uncertainty there is [71].
The second component, the likelihood function, is the observed evidence expressed in terms of the likelihood
of the data given the parameter values. The likelihood function asks what the likelihood of the data in hand
is, given a set of parameters such as the mean and/or the variance [71].
The third component, the posterior distribution, is based on combining the prior distribution and likelihood
function, and is the updated probability distribution that represents the uncertainty about the parameter
after observing the data [71].
Bayes’ Theorem
The three components of Bayesian statistics described above constitutes Bayes’ theorem, which describes the
probability of an event, based on prior knowledge that might be related to the event. This probability is
commonly referred to as the conditional probability or the posterior probability. The theorem expresses how
a degree of belief, expressed as a probability, should change to account for the availability of related prior
evidence [2].
Bayes’ theorem is stated mathematically by the following equation:

p(B|A)p(A)
p(A|B) = ,
p(B)
where A and B are events and p(B) 6= 0.
• p(A|B) is the posterior probability of A given B.
• p(B|A) is the likelihood of B given a fixed A.
• p(A) and p(B) are the probabilities of observing A and B without any given conditions; they are also
called the prior probability and the marginal probability respectively.

9
2 THEORY

Specifically, the theorem states that the posterior probability of an event is proportional to the product of
the prior probability of the event and the likelihood of the data given the event.

2.5.3 Dirichlet Distribution


The Dirichlet distribution is a multivariate continuous distribution in probability statistics and one of the
most basic models for proportional data, such as the mix of vocabulary words in a body of text. The Dirichlet
distribution is a generalization of the Beta distribution describing the probabilities of outcomes. Unlike the
Beta distribution which describes the probability of one of two outcomes of a Bernoulli trial, the Dirichlet
distribution describes the probability of T outcomes. The Beta distribution can therefore be seen as the
special case of the Dirichlet Distribution with T = 2 [52]. The (prior) probability density function for the
Dirichlet distribution can be defined as:

T
1 Y ↵t 1
f (✓; ↵) = ✓ ,
B(↵) t=1 t

where:

QT
t=1 (↵t )
B(↵) = PT
( t=1 ↵t )

is the multivariate Beta function, as defined in [5].


PT
The Dirichlet distribution has support on the interval [0, 1] such that ✓t 2 [0, 1], t=1 ✓t = 1 and (x) is the
gamma function. The parameters are ↵ = (↵1 , ↵2 , ..., ↵T ) and ↵t > 0 for all t.
In Bayesian statistics, the Dirichlet distribution is commonly used as a conjugate prior for the multinomial
distribution, which is the case in LDA. This means that if we have observed data from a multinomial
distribution, we can update our beliefs about the probabilities of the categories by multiplying the Dirichlet
prior with the multinomial likelihood, and obtain a new posterior Dirichlet distribution [12].

2.6 BERTopic
BERTopic is a cutting-edge topic modeling algorithm introduced by Maarten Grootendorst in his paper
"BERTopic: Neural topic modeling with a class-based TF-IDF procedure" from 2019 [29]. The term BERTopic
is an acronym or abbreviation for the combination of "BERT", Bidirectional Encoder Representations from
Transformers, and "topic" which refers to the goal of the algorithm, i.e., extracting meaningful topics from
text data.
The key innovation of BERTopic is its use of BERT embeddings, which are contextualized word representa-
tions learned from a large corpus of text using a deep neural network. These embeddings capture the rich
conceptual information of each word in a document, allowing BERTopic to capture the semantic meaning
and contextual relationships among words, resulting in more accurate topic modeling [29].
In addition to BERT embeddings, BERTopic also utilize a class-based Term Frequency-Inverse Document
Frequency (c-TF-IDF) procedure to filter out irrelevant words and prioritize the more important ones for topic
modeling. This procedure uses the class labels of each document, such as the document source or category,
to assign class-specific weights to words in the TF-IDF calculation. This helps BERTopic to emphasize words
that are more informative for and disregard words that are less relevant [29].
BERTopic has been shown to outperform other topic modeling techniques, such as LDA, in terms of topic
coherence and interpretability. The algorithm is capable of generating high-quality and interpretable topics
from text data, making it suitable for a wide range of applications. In the paper by Grootendorst [29],
the technical details of BERTopic are more thoroughly explained than in this paper, including the steps for
generating topics and the class-based TF-IDF procedure.
The high-level algorithm for BERTopic to create its topic representation consists of the following five to six
steps presented in Figure 3. Within each sequence there is a variety of sub-models to choose from which
makes BERTopic quite modular and allows you to build your own topic model. Each of the steps needs to
be carefully selected and are somewhat independent from one another, even though there of course is some
influence between them.

10
2 THEORY

Figure 3: Illustration of BERTopics architecture and its modularity throughout a variety of sub-models.
Figure from [20].

The default values in each sequence is presented to the right in Figure 4 with Fine-tune Representation, as an
optional step in the model structure [20]. Each of these steps and the theory behind them will be presented
starting from the bottom going upwards.

Figure 4: Algorithm overview of BERtopic’s default model. Figure from [20].

11
2 THEORY

Algorithm BERTopic
To summarize, the basic idea and high-level algorithm for BERTopic presented in Figure 3 and Figure 4 is
as follows,

Algorithm 2 Algorithm for BERTopic


(1) Select an embedding technique:
(a) SBERT, SpaCy, . . . , Transformers.
(2) Select an dimensionality reduction technique:
(a) UMAP, PCA, . . . , TruncatedSVD.
(3) Select an clustering technique:
(a) HDBSCAN, k-means, . . . , BIRCH.
(4) Select an tokenizer technique:
(a) CountVectorizer, Jieba, . . . , POS.
(5) Select an weighting scheme technique:
(a) c-TF-IDF, c-TF-IDF + BM25, . . . , c-TF-IDF + Normalization.
(6) Select an representation tuning technique (optional):
(a) GPT / T5, KeyBERT, . . . , MMR.

2.6.1 Embeddings
BERTopic starts with converting the documents into numerical representation through word embedding as
described in Section 2.3.2. The default method for BERTopic for doing so is by using sentence-transformers,
or more specifically Sentence-BERT (SBERT) which is a modification and improvement from the pretrained
BERT network [63].
BERT
Bidirectional Encoder Representations from Transformers (BERT) was presented in late 2018 by Devin et al.
[8] as a fine-tuning approach for NLP. From its name it can be deduced that the model uses a transformer-
based architecture described in Section 2.1.2. The main difference between transformers and BERT is that,
while transformers consist of both encoder and decoder stack, BERT only consist of an encoder stack. Its
objective is to learn language and achieve this by producing meaningful representations of text using the
self-attention mechanism. In addition, BERT is bidirectional, i.e., encoding tokens using information from
both directions where the classification token [CLS] is added at the beginning of every sequence while [SEP]
token indicates the end, see Figure 5. The input sequence length of 512 tokens is required to be consistent
and any sequences shorter than 512 tokens are padded with a [PAD] token, while sequences longer than
512 tokens are truncated. This indicates that you risk missing out on important insights and information
when having documents longer than 512 tokens. Finally, BERT also uses a WordPiece embedding with a
vocabulary of 30,000 word pieces described in Section 2.2.5 [8].

Figure 5: Input structure for BERT. Figure from [8].

BERT is a pre-training strategy that succeeds in extracting deep semantic information from a sentence. It’s
trained on a large amount of text data in an unsupervised manner using two main tasks, Masked Language
Modeling (MLM) and Next Sentence Prediction (NSP) [8].
In the first unsupervised task, MLM, we get to use the advantages of a bidirectional approach. In traditional
left-to-right or right-to-left models, the goal is to predict the next token and when using a bidirectional
approach an opportunity to look ahead at the next word and "cheat" when predicting it, is created. The

12
2 THEORY

MLM method however, replaces approximately 15% of the tokens with a [MASK] token, enabling the use
of a bidirectional model and forcing the model to predict the missing word without cheating. This forces
the model to learn to predict the masked word based on the context provided by the surrounding words and
allows BERT to learn bidirectional representations of language. To avoid discrepancies between pre-training
and fine-tuning, where the [MASK] token is not present, the [MASK] token replaces the masked words only
80% of the time. In the remaining 20% it uses two strategies, half of which a random token is used and the
other half the unchanged i-th token [8].
The second unsupervised task, NSP, is trained to predict whether two sentences are consecutive in a text or
not, which helps it understand sentence-level relationships. Given two sentences, A and B, the objective is
to predict whether the second sentence is the correct subsequent sentence or not. Half of the time, sentence
B is the true next sentence with respect to sentence A, and the other half a random sentence from the corpus
is selected. The corpus contains 800 million words and to put that into some context the English Wikipedia
comprises 2500 millions word, so it’s a large corpus [8].
After pre-training, BERT can be fine-tuned on a specific downstream task using a smaller dataset. During
fine-tuning, BERT’s parameters are updated to better fit the specific task at hand and the embeddings
generated can be used as input feature. As a result, the fine-tuning of the model can be performed with
only one additional output layer to create a state-of-the-art model. Without any substantial task-specific
architecture modifications the model is conceptually simple and empirically powerful, and can be used for a
wide range of tasks such as text classification, named entity recognition or question answering [8].
SBERT
It has been shown that standard BERT implementation is suboptimal for sentence similarity tasks that require
the use of standard similarity measures such as euclidean distance or cosine-similarity, presented in Section
2.4.1 and Section 2.4.2 respectively. To address this issue, Reimers and Gurevych introduced Sentence BERT
(SBERT) in 2019 [63] as a solution that aims to make up for the weaknesses of BERT and further improve the
sentence-level embeddings. For instance, computing the similarity between 10,000 sentences would require
around 50 million inference computations, which is time-consuming and unsuitable for unsupervised tasks
like clustering. SBERT, a modification of the pre-trained BERT network, utilizes Siamese and triplet network
structures to derive meaningful embeddings as well as reduces time required from 65 hours to five seconds
when finding and pairing between 10,000 sentences.
A Siamese network consists of two identical neural networks that share the same weights and are trained on
pairs of similar and dissimilar sentences. The network further learns to encode the semantic meaning of each
sentence into a vector representation and then compares the similarity of the vectors of two sentences using
a similarity function, such as one of the measures mentioned earlier. The key contribution of SBERT on the
other hand is the pooling technique that applies on the output of the second-to-last layer of the BERT model
to generate a fixed-length vector representation of the sentence. The resulting vector representation captures
the most informative features and meaning of the sentence in a compact and robust form. Additional and
more comprehensive description of SBERT can be found in the initial paper by Reimers and Gurevych [63].
Sentence Transformers
Sentence Transformer and SBERT generally refers to the same thing in the context of NLP. However. since
the introduction in 2019 many researchers and developers have built upon and extended the original SBERT
method and they may use different names to refer to their implementation such as sentence transformer
[63]. The SentenceTransformer package, for example, implements various modifications and enhancements of
the SBERT method and provides an easy-to-use interface for generating sentence embeddings [61]. A large
collection of pre-trained models tuned for various task can be found [62], which is hosted by the HuggingFace
Model Hub [35].
all-MiniLM-L6-v2
’all-MiniLM-L6-v2’ is a pre-trained transformer-based language model designed to be fine-tuned for a wide
range of NLP tasks, such as text classification, and is available on Hugging Face’s Sentence Transformers
library [33]. The model was trained on a large corpus of text data using a masked language modeling objective,
which allows it to learn contextual representations of words and sentences.
Despite the small size ’all-MiniLM-L6-v2’ it offers good quality in comparison with other pre-trained transformer-
based language model. The best average performance sentence embedding model ’all-mpnet-base-v2’, which
only is slightly better than ’all-MiniLM-L6-v2’, is five times slower which makes this often the go-to model.
Its small size and fast inference speed make it suitable for deployment in resource-constrained environments
[62].

13
2 THEORY

The model architecture consists of 6 transformer layers, which is fewer than the original MiniLM model,
originated by W. Wang et al. [73], that has 12. Compared to the standard BERT model, the main contribution
is its approach to training. It employs a student-teacher approach, where the student model is trained to
emulate the self-attention module of the teacher’s final transformer layer. The limitation of the MiniLM
model however, is its restriction to processing sequences with a maximum of 256 tokens, which may be
considered a disadvantage since longer texts becomes truncated.
all-mpnet-base-v2
’All-mpnet-base-v2’ is a pre-trained neural network model that belongs to the family of models known as
Sentence Transformers and is is available on Hugging Face’s Sentence Transformers library [34]. The model
is based on a variant of the transformer architecture called "multi-task deep neural network" (MT-DNN) and
has 12 transformer layers, which perform multi-head self-attention and feedforward operations. Additional
information can be found via the original paper [42].
Compared to ’all-MiniLM-L6-v2’ this model is a mentioned significantly larger and therefore requires more
computing power and can be time-consuming. Because of that, it performs better in all cases presented in
[62] and therefore could be worth using despite its size disadvantages. An additional disadvantage when
comparing with the standard BERT model is that text longer than 384 tokens is truncated. This should on
the other hand be considered as a benefit when comparing to ’all-MiniLM-L6-v2’.

2.6.2 Dimensionality Reduction


Dimensionality reduction is a crucial step in the BERTopic algorithm as it enables efficient and accurate
identification of topics in large text datasets. This type of dataset is typically high-dimensional, with each
word represented as a vector in a high-dimensional space and as the number of unique words in a text
dataset grows, so does the dimensionality of the vector space. Techniques for dimensionality reduction, such
as UMAP, aims to address this problem by reducing the number of dimensions while preserving both local
and global structure of the data [29].
UMAP
Uniform Manifold Approximation and Projection (UMAP) was introduced by McInnes et al. in their research
paper "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction" from 2018 [48].
This is a non-linear dimensionality reduction technique that aims to preserve both local and global structure
of high-dimensional data. It is based on the idea of constructing a topological representation of the data
manifold using a graph-based approach, where points close to each other in the high-dimensional space are
connected by edges in the graph. To measure the distance between points in the high-dimensional space
the choice of metric can have a significant impact on the quality of the resulting embedding, as different
metrics can capture different aspects of the data structure. For example, the Euclidean metric measures the
straight-line distance between two points while the cosine metric measures the angle between two vectors,
described in Section 2.4.1 and 2.4.2 respectively. UMAP supports a wide range of metrics but these two are
the most common ones and the euclidean is the default choice.
Once the graph has been constructed, UMAP constructs a low-dimensional embedding of the data that
preserves the topological structure of the graph. This is achieved by optimizing an objective function that
balances the preservation of the global and local structure. Meaning, preserve the overall shape of the data
manifold as well as the relationships between neighboring points [48].
The algorithm is computationally efficient and can handle large datasets since there are no restrictions on
embedding dimension, making it suitable for machine learning and various applications, such as visualization,
clustering, and classification. Additionally, UMAP allows for the incorporation of additional information, such
as labels or weights, to guide the embedding process [48].

2.6.3 Clustering
The reduced embeddings created in Section 2.6.2 are now clustered with the purpose to group similar doc-
uments together based on their semantic content. This process plays a crucial role in the accuracy of topic
representations, as the effectiveness of our clustering technique directly impacts the quality of the results.
A variety of different clustering models are available for use such as HDBSCAN, k-Means and BIRCH, pre-
sented in Figure 3, and since there is not one perfect clustering model this modularity allows you come as
close as possible. The default setting HDBSCAN is a a density-based clustering algorithm that provides a
powerful and flexible tool for analyzing large collections of text data [27]. The resulting clusters can be used

14
2 THEORY

to identify topics or themes in the collection of documents, allowing users to gain insights into the content of
the documents and the underlying patterns and trends within the data [44].
HDBSCAN
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is an extension of
DBSCAN which is thoroughly presented below in this section. HDBSCAN utilizes a hierarchical density-
based approach to cluster data point and is designed to be highly efficient and effective, even for large and
complex datasets. The algorithm works by first constructing a hierarchical clustering tree which then uses a
mutual-reachability graph to identify clusters within the tree. Finally, it applies a cluster stability analysis to
select the optimal clustering solution [44]. To calculate the distance between points in the data space when
constructing a distance matrix, which represents the pairwise distances between all data points, a metric
needs to be chosen. This is as critical aspect of HDBSCAN and the characteristics of the data as well as
research question should be taken into consideration when choosing. For example, the Euclidean metric
measures the straight-line distance between two points while the cosine metric measures the angle between
two vectors, described in Section 2.4.1 and 2.4.2 respectively. HDBSCAN supports a wide range of metrics
but these two are the most common ones and euclidean is the default setting [45, 46].
When using HDBSCAN in BERTopic, a number of outlier documents might also be created if they do not
fall within any of the created topics. This topic is labeled as "-1" and will contain the outlier documents that
cannot be assigned to a topic given a automatically calculated probability threshold. This threshold can be
modified in the model parameter settings to reduce the number of outliers, as described in [23].
DBSCAN
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an algorithm that identifies
clusters of points in a dataset based on their density. The algorithm is known for its ability to handle noise
and outliers efficiently as well as widely used when mining large spatial datasets. By relying on the concept
of density, it is able to identify clusters of arbitrary shapes and sizes, making it a valuable tool for many
applications. In addition, it uses a minimal number of required input parameters, two, which are the radius
if the cluster, Eps, and minimum point required inside the cluster, M inP ts. [56].
A Density Based Notion of Clusters
DBSCAN was introduced by M.Ester et al. in their 1996 paper "A density-based algorithm for discovering
clusters in large spatial databases with noise" [10] which is based in the idea that clusters are dense regions
of objects in the data space separated by regions of lower density. In their paper they provide a detailed
description of the algorithm and its implementation, as well as the following six definitions that constitutes
the notion of density based clusters.
Definition 1 - Eps–neighborhood of a point:
The Eps neighborhood of a point p, denoted by NEps (p) is defined by,
NEps (p) = {p 2 D|dist(p, q)  Eps}.
There are two kinds of points in the cluster, the points which is inside the cluster(core points), and points
on the border of the cluster(border points).
Definition 2 - Directly density-reachable:
A point p is directly density-reachable from a point q wrt. Eps, M inP ts if,
1. p 2 NEps (q) and
2. |NEps (q)| M inP ts (core conditions).
Definition 3 - Density-reachable:
A point p is density-reachable from a point q wrt. Eps and M inP ts if there is a chain of points p1 , ..., pn , p1 =
q, pn = p such that pi+1 is directly density-reachable from pi .
Definition 4 - Density-connected:
A point p is density-connected to a point q wrt. Eps and M inP ts if there is a point o such that both, p and
q are density-reachable from o wrt. Eps and M inP ts.

15
2 THEORY

Definition 5 - Cluster:
Let D be a database of points. A cluster C wrt. Eps and M inP ts is a non-empty subset of D satisfying the
following conditions:
1. 8p, q : if p 2 C and q is density-reachable from p wrt. Eps and M inP ts, then q 2 C (Maximality)
2. 8p, q 2 C : q is density-connected to q wrt. Eps and M inP ts (Connectivity)
Definition 6 - Noise:
Let C1 , ..., Ck be the clusters of the database D wrt. parameters Epsi and M inP tsi , i = 1, ..., k. The noise is
defined as the set of points in the database D not belonging to any cluster Ci , i.e., noise = {p 2 D|8i : p 2
/ Ci }
Algorithm DBSCAN
The basic idea and algorithm behind DBSCAN [56] is as follows,

Algorithm 3 DBSCAN
(1) Select an arbitrary point p
(2) Retrieve all points density-reachable from p w.r.t. Eps and M inP ts.
(3) If p is a core point, a cluster is formed.
(4) If p is a border point, no points are density reachable from p and DBSCAN visits the next point of the
database.
(5) Continue the process until all the points have been processed.

2.6.4 Tokenizer
Before the next step Weighting Scheme, where the creation of topic representation in the BERTopic algorithm
is initiated, a technique that allows for modularity needs to be selected. When using HDBSCAN as a cluster
model the clusters may have different levels of density and shapes which means that a centroid-based topic
representation technique may not be suitable. Therefore, a technique that makes little to no assumption
about the expected structure of the clusters is preferred such as a bag-of-words method. To achieve this bag-
of-words representation on a cluster level, all document in a cluster needs to be treated as a single document
by simply concatenating them and then count the frequency of each word in each cluster [20].
Since the quality of topic representation is key when understanding the patterns, interpreting the topics and
communicating the results, it’s of utmost importance that best possible method for this is chosen for the
data at hand. The flexibility within BERTopic for vectorization algorithms is wide, as presented in Figure 3,
when methods such as CountVectorizer, Jieba and POS are available. The default method for this sequence
in BERTopic is CountVectorizer described below [28].
CountVectorizer
The CountVectorizer is a method in scikit-learn that converts a collection of text documents to a matrix of
token counts to extract features from text. Specifically, it converts the text documents into a bag-of-words
representation, where each document is represented by a vector that counts the frequency of each word in the
document. In addition, it performs several text pre-processing steps such as tokenizing the text, lowercasing
and removing stop words, see section 2.2.1, 2.2.3 respectively 2.2.4 for more detailed information. Additional
information about CountVectorizer can be found in the scikit-learn’s documentation [58].

2.6.5 Weighting Scheme


From the generated bag-of-word representation, described in Section 2.6.4, knowledge regarding what distin-
guishes one cluster from another is what we are after. If there for example are some words that are typical
for one cluster but not so much for the others. To solve this BERTopic utilize a modified TF-IDF, described
in Section 2.3.3, called class-based TF-IDF (c-TF-IDF) so that topics, i.e., clusters, is considered instead of
documents [20]. In addition, due to the topic representation being relatively independent of the clustering
step, we have the flexibility to modify the appearance of the c-TF-IDF representation. This modularity
enables us to experiment with different weighting schemes as presented in Figure 3 [17].
c-TF-IDF
The idea of c-TF-IDF is to assign each cluster one topic. We want to know what makes one topic based on
its cluster-word representation and thereby distinguishes the different topics from each other. As described
in Section 2.3.3 TF-IDF is a numerical statistic that reflects how important a word is to a document in a

16
2 THEORY

collection or corpus. The classical procedure of TF-IDF combines the two statistics, term frequency (TF),
and inverse document frequency (IDF),

⇣N ⌘
Wt,d = tft,d · log .
dft

where the TF models the frequency f of term t in document d while the IDF models how much information
a term provides a document. The latter is calculated by taking the logarithm of N which is the number of
documents in a corpus, divided by the total number of documents d that contain t [29].
BERTopic on the other hand, uses a custom class-based TF-IDF for topic representation which means that
it wants to measure a term’s importance to a topic instead. To do so all document in a cluster needs to be
treated as a single document by simply concatenating them. Then, TF-IDF is adjusted to account for this
representation by translating documents to clusters,

⇣ A⌘
Wt,c = tft,c · log 1 + .
tft

where the TF models the frequency f of term t in class c where class c is the collection of documents
concatenated into a single document for each cluster. In this case, the IDF is replaced by the inverse class
frequency to measure how much information a term provides to a class. This is calculated by taking the
logarithm of A which is the average number of words per class, divided by the frequency f of term t across
all classes. We add one to the division of logarithm so that only positive values is in the output [29].
Since the goal of this class-based TF-IDF procedure is to evaluate the importance of words in clusters of
documents, rather than in individual documents, this approach enables the creation of topic-word distribu-
tions for each cluster. To reduce the number of topics to a user-specified level, we can iteratively combine
the c-TF-IDF representations of the least common topic with its most similar one [29].

2.6.6 Fine-tuning Representation


With the use of c-TF-IDF described in Section 2.6.5 an accurate topic representation, i.e., a list of words that
describe a collection of documents, is generated in a prompt manner. However, due to the rapid advancements
within the field of NLP, new and exciting techniques are introduced weekly that can be used for additional
fine-tuning of the model. The generated representation from c-TF-IDF can further be viewed as potential
topics that include keywords and representative documents. These documents offer a significant advantage
in fine-tuning as they allow computation on a smaller set of documents and improved topic representation
can be achieved. Techniques for this such as KeyBERT is already integrated into BERTopic for easy use and
experimentation [20].
KeyBERT
KeyBERT is a basic but powerful technique developed by Maarten Grootendorst [18] that allows the users
to extract the most relevant keywords and keyphrases from a document based on its content. This method
leverages BERT embeddings, Section 2.6.1, and cosine similarity, Section 2.4.2, to identify the words and
sub-phrases within a document that are most similar to the document itself.
KeyBERTInspired
KeyBERTInspired is an representation package in bertopics python library where the algorithm follows the
same principals as KeyBERT, but to speed up the inference it also does some optimization. The model for
this procedure is illustrated in Figure 6 below [26].

17
2 THEORY

Figure 6: Algorithm overview for keyword extraction in topic n with KeyBERTInspired which is based on
KeyBERT. Figure from [26].

2.7 Topic Model Evaluation


Topic models extract word sets, called topics, from document word counts without requiring semantic anno-
tations. To evaluate how well a topic model predicts previously unseen data, perplexity has been been used
as common metric [69] to capture the level of "uncertainty" in a models prediction results. However, topics
may not be easily interpretable, and a study by Chang et al. [6] found that perplexity was often negatively
correlated with human judgements of topic quality. Therefore, coherence measures have also been proposed
as an alternative measure more focused upon real-words task performance, to distinguish between good and
bad topics [66].

2.7.1 Coherence
Coherence measures are used in NLP to evaluate topics constructed by some topic model. Coherence measures
are used to evaluate how well a topic model captures the underlying themes in a corpus of text [9]. These
measures are used to evaluate the quality of topics generated by topic models and can thus be used to compare
the outcomes between different topic models on the same corpus of texts.
Topic Coherence
Topic coherence has been proposed as an intrinsic evaluation method for topic models and is defined as
average or median of pairwise word similarities formed by top words of a given topic [66]. There are several
variations of topic coherence measures, see for example [65], but the most commonly used coherence measure
in both LDA and BERTopic to estimate the optimal number of topics is cv . The cv coherence measure was
proposed by Röder et al. [65] and is based on a sliding window, one-set segmentation of the top words and
an indirect confirmation measure that uses normalized point-wise mutual information (NPMI) and the cosine
similarity, described in Section 2.4.2. The formula for computing the cv coherence measure is as follows:

n
X1 n
X
2 D(wi , wj ) + e
cv = log
n(n 1) i=1 j=i+1
D(wi ) + D(wj ) + e

where n is the number of top words in the topic, D(wi , wj ) is the number of documents that contain both
words wi and wj , D(wi ) is the number of documents that contain word wi , and e is a smoothing parameter.
The cv coherence measure ranges between 0 and 1, a higher value indicates that the words in the topic are
more coherent, and therefore better.

2.7.2 Perplexity
Perplexity is a confusion metric used to evaluate topic models and accounts for the level of "uncertainty" in
a model’s prediction result. It measures how well a probability distribution or probability model predicts a
sample, and is used in topic modeling to measure how well a model predicts previously unseen data [69].

18
2 THEORY

Perplexity is calculated by splitting a dataset into two parts - a training set and a test set. The idea is to train
a topic model using the training set and then test the model on a test set that contains previously unseen
documents (i.e., held-out documents). The measure traditionally used for topic models is the perplexity of
held-out documents, Dtest which can be defined as:
( P )
M PNd
d=1 i=1 log p(wdi )
perplexity(Dtest ) = exp PM
d=1 Nd

where wdi is the i-th word in document d, p(wdi ) is the probability of word wdi given the topic model, Nd
is the number of words in document d, and M is the number of documents in the test set Dtest . A lower
perplexity score indicates that the model is better at predicting new data [69].
Perplexity Score in LDA
In LDA modelling, the standard perplexity measure is the output statistic from Gensims log_perplexity
function. The output statistic is a negative number, and is calculated as 2( bound) as described in the docu-
mentation [75]. The mathematical formula for bound is however not explicitly stated in the documentation,
just that the function calculates the per-word likelihood bound using a chunk of documents as evaluation
corpus.
When reading about the output statistic from log_perplexity, there is some confusion on how it should
be interpreted. For example, the user Rafs on StackExchange suggests that the log_perplexity function,
counter intuitively doesn’t output a perplexity after all, but a likelihood bound which is utilized in the
perplexity’s lower bound equation [70].
This interpretation of the log_perplexity output statistic suggests that a smaller bound values implies
deterioration, and therefore bigger values means that the model is better. A post by the creator of Gensim
Radim Řehůřek, from the Google Gensim group seems to support this interpretation. On this post the
question is regarding if the perplexity is improving when the output from log_perplexity is decreasing,
Řehůřek replies with: "No, the opposite: a smaller bound value implies deterioration. For example, bound
-6000 is "better" than -7000 (bigger is better) [14].". For further discussion of the interpretation of the
perplexity score output from the LDA model, see Section 5.4.

19
3 METHOD

3 Method
In this section, the overall methodology and workflow of the thesis work is presented. To answer the research
questions stated in Section 1.3.1, the project work was divided into a number of different phases, as presented
in Figure 7.

Figure 7: Illustration over the project processes covered in Section 3

Section 3.1 provides a visual overview of the modeling workflow for the LDA and BERTopic models. Section
3.2 summarizes the initial process of the project work, followed by Section 3.3 where the software and tools
used are presented. Section 3.4 provides an overall description of the data and the data extraction process,
followed by Section 3.5 with an explanation of how the dataset was created. Section 3.6 describes the
procedure for data cleaning and preprocessing, followed by Section 3.7 were the text representation steps are
explained. Section 3.8 gives a summary of the modeling phase, which is divided into two subsections, Section
3.8.1 which focuses on the LDA modeling and Section 3.8.2 on the BERTopic modeling. Finally, Section 3.9
provides a description of the methods used for model evaluation and model selection.

3.1 Modeling Overview


This subsection aims to provide a visual overview of the modeling process for the two topic modeling methods
used in this project. It is divided into two separate subsections, Section 3.1.1 for the LDA overview and Section
3.1.2 for the BERTopic overview.

20
3 METHOD

3.1.1 LDA
To summarize the steps for the LDA modeling, a schematic representation of the complete modeling workflow,
from the initial dataset, to preprocessing, text representation, modeling and finally visualization of topics
and topic information extraction is presented in Figure 2.

Figure 8: Schematic representation of the LDA modeling workflow, from the initial dataset, to preprocessing,
text representation, modeling and finally visualization of topics and extracting topic information.

21
3 METHOD

3.1.2 BERTopic
A schematic representation of the BERTopic modeling workflow, from the original dataset to topic visualiza-
tion and information extraction, is presented in Figure 9. The steps involved in the topic modeling process
include initial text representation, clustering, topic representation, and finally visualization of the extracted
topics and its information.

Figure 9: Schematic representation of the BERTopic modeling workflow, from the original dataset to topic
visualization and information extraction.

22
3 METHOD

3.2 Initial Process


Several unstructured qualitative interviews were conducted with managers from the customer service, data
& analytics and data & ML teams during the start-up phase of the project. These interviews served as a
basis for structuring the scope and delimitations of the project. In addition, the interviews provided the
necessary information needed to get a clear picture of the problem and how the customer service workflow
looked like at the company. During the later stages of the project work, a continuous dialogue was also kept
with managers from these teams to ensure that the project was kept inline with the scope and the desired
outcome for the organization.
The area of text analysis and NLP has grown and progressed swiftly in the past couple of years, especially
with the introduction of large language models (LLMs) such as OpenAI’s ChatGPT introduced to the public
in november 2022 [54]. This has transformed the area of NLP and made many of the standard procedures
and benchmarks tasks for text analysis surpassed and even outdated. To ensure that we kept up to date
with the latest methods and models in the literature, the first weeks of the project work largely consisted of
a thorough literary study and domain research. The choice of methods was therefore a combination of the
initial literature study and model experimentation where several approaches and methods were tested.

3.3 Software
For the entirety of this thesis work, the Python programming language was used. Python has for some time
been the programming language of choice when implementing NLP models, due to its proficiency and ease of
use with a wide range of open sourced libraries and packages. To ensure sufficient computational power for
running large NLP-models, all the programming, including transcription, translation and modeling, was done
in a Linux environment on a Virtual Machine (VM) on the company’s Google Cloud Platform (GCP). On
the VM, Python version 3.7.13 was installed since it is a stable version with support for all the dependencies
and libraries used in the project. The additional computing power from the VM with dedicated graphical
processing units (GPU’s) allowed a significant reduction in the time needed for running the models, compared
to running the models on local machines.

3.4 Data Description and Extraction


Every customer phone call to the company’s customer service is recorded and saved to a database which can
be accessed through an internal web-portal. From our agreement with the company, we had permission to
download and transcribe 5000 call recordings to use in our dataset. These 5000 call recordings were collected
and downloaded through the web-portal using the desired filtering options provided by the company, presented
in Table 1.

Table 1: Applied filters for downloading the call recordings.

Filtering Options Applied Filters


Department Customer service
Call duration 60 seconds
Direct dial-ins (DDI’s) Support number & Company IVR
Call direction Incoming

The call recordings were downloaded in five batches of 1000, one batch collected each week from week 10,
12, 13, 14 and 15 of 2023. The call recordings were downloaded in a .wav audio format, and therefore no
additional formatting was necessary before transcribing the files.
As suggested by the company, phone calls with a duration of less than 60 seconds were excluded to reduce
noise in the dataset. These calls could for example be interrupted calls or error calls not intended for them.
The other applied filters suggested by the company were intended to limit the type of calls to incoming direct
dial-ins to the customer service department.

23
3 METHOD

3.5 Creation of Dataset


With the 5000 call recordings retrieved based on the provided filters, presented in Table 1, the process of
turning the audio files into a workable dataset of text could be initiated. This subsection describes the steps
for transcribing and translating the call recordings from audio files in Swedish, to text files in English.

3.5.1 Transcription and Translation


In order to create a dataset suitable for NLP-models, the call recordings first needed to be transcribed to
text files. Two different methods for transcribing the audio files were compared, Google’s Speech to Text
API [16] and OpenAI’s Whisper model [60]. After some research on the capabilities of these two methods,
and a comparison on a small subset of call recordings, we could conclude that OpenAI’s Whisper model
was the preferred choice for both transcription and translation. The motivation for choosing Whisper over
Google’s service was due to two reasons: The first being the cost of using Google’s API, where they charge
a small amount for every transcription [15], compared to Whisper which is free to use. The second was the
quality of the transcriptions, where OpenAI’s Whisper model outperformed Google’s quite substantially in
both transcribing and translating when manually comparing the quality of the outputted test files.
Whisper is a speech recognition model developed by OpenAI that can also do language identification and
speech translation across a number of languages [60]. The model is open sourced and it comes in 5 sizes for
the parameter space of the model, scaling from "tiny" to "large". To achieve the best possible quality of
the dataset, the "large" model setting was used both to transcribe and translate the call recordings.
The translation process was simply done by adding the argument task = ’translate’ into Whispers
transcription-model, as follows: model.transcribe(audio.wav, task = ’translate’). This automati-
cally translates the text to English if the detected language is not already English. The reason for translating
all text to English was to take advantage of the multiple packages and embeddings trained on the English
language, with the hope of achieving the best possible results from the analysis.

3.5.2 Concatenating the Dataset


To obtain a .csv file suitable as input for the NLP-models, each transcribed call recording was placed in a
list with two columns. For each transcription, the corresponding index n 2 {1, ..., 5000} was placed in the
first column and the string containing the content of the transcribed call in the second column. This was
repeated for all 5000 transcriptions and concatenated into a .csv file, as illustrated in Table 2.

Table 2: Illustration of how the concatenated transcriptions are saved as output in the .csv file.

Call Recording Index (n 2 {1, ..., 5000}) Transcribed Call (one call per cell/row)
1 "Welcome to XXX, you are talking to..."
2 "Welcome to XXX, this is John..."
.. ..
. .

5000 "Welcome to XXX, how may I help you?..."

The finished .csv file containing all 5000 transcribed and translated call recordings was then saved and
ready for data preprocessing. In Table 3, we provide a summary of descriptive statistics for the content of
the concatenated dataset. To get an intuition of the call duration variability in the the dataset, a word count
distribution was also done, as seen in Figure 10, which shows the distribution of word count frequencies.

Table 3: Descriptive statistics for the concatenated dataset of transcribed call recordings.

Dataset Summary - Transcriptions


Average word count per call 353
Maximum word count for a call 2841
Minimum word count for a call 41
Number of calls 5000

24
3 METHOD

Figure 10: Word count distribution for the complete dataset of 5000 transcribed call recordings.

3.6 Data Preprocessing


The following subsection presents the methodology for data preprocessing which included methods for cleaning
and preparing the data for text analysis. To prepare the text data, a preprocessing procedure was applied to
the dataset which included the following stages: lowercasing, non-alphabetical character removal, stopword
removal, tokenization and lemmatization.
The data preprocessing methods for LDA and BERTopic differed somewhat in the tools and packages used.
For LDA, the NLTK package was used for stopword removal, tokenization and lemmatization. The specific
lemmatizer used from the NLTK package was the WordNetLemmatizer, which uses the Open Multilingual
WordNet, a lexical database for the English language [3]. In the tokenization process, all single- and two
character words were also removed to reduce unwanted noise in the data.
In addition to the predefined stopword list for the English language provided by NLTK packages, a number of
project-specific stopwords were added to eliminate other unwanted words. The added stopwords were decided
upon together with the company, for instance words such as "absolutely", "welcome" and "great" were added
to the list of stopwords. See Appendix A for a complete table of the stopwords used. For the lowercasing
and non-alphabetical character removal, Pythons built in functions lower() and re.sub() were used.
For BERTopic, no manual preprocessing was required on the dataset before implementing the model. As
described in Section 2.6.4, BERTopic has a CountVectorizer component from scikit-learn as the default
setting, which performs tokenization, lowercasing and stopword removal automatically. The only manual
step that was performed was to set the list of stopwords to the same as for LDA, presented in Appendix A,
to allow a fair comparison of the results.

3.7 Text Representation


This subsection explains the process of converting the unstructured textual data into a structured numerical
format that can be used by the NLP algorithms implemented in our thesis work. The process of text
representation is divided into the specific methods used for LDA and BERTopic, in Section 3.7.1 and Section
3.7.2 respectively.

3.7.1 LDA
In order to present the dataset in a machine-readable way for a LDA topic model, a bag-of-words represen-
tation, as described in Section 2.3.1, of the tokenized and lemmatized text was created. This was done using
the doc2bow function from the Gensim library, which is a platform independent library aimed at automatic

25
3 METHOD

extraction of semantic topics from documents. The doc2bow function converts the documents of a corpus into
a bag-of-words format, i.e., a list of (token_id, token_count) tuples [76]. An example of a bag-of-words
representation is illustrated in Figure 11.

Figure 11: Example of a small corpus represented in bag-of-words format, i.e., a list of (token_id,
token_count) tuples. In this example, the corpus consist of three documents and the bag-of-words repre-
sentation contains three list objects each consisting of four, three and six (token_id, token_count) tuples
respectively. For example, this representation shows that token_id = 3 was counted one time in the first
and second document, and three times in the third document.

In order to know which word each token_id represents, a dictionary was also created where each token_id
is paired with its corresponding word. This dictionary was created using the corpora.Dictionary function
from the Gensim library [76].

3.7.2 BERTopic
To enable the BERTopic model to interpret the dataset, the data had to be transformed into a machine-
readable format which BERTopic does in several steps during the modeling process. These steps were as
follows:
1. Initially, the text representation was done using BERT-embeddings, described in Section 2.6.1, where
the SentenceTransformers() function from Hugging Face’s Python library was utilized to implement
context-based representation, using the pre-trained models all-MiniLM-L6-v2 [33] and all-mpnet-base-
v2 [34], both described in Section 2.6.1.
2. After numerical text representation of the documents was created, dimensional reduction with the
technique of UMAP, presented in Section 2.6.2, was used through a function UMAP() from a library in
Python with the same name [51].
3. With the reduced BERT-embeddings the data was clustered with the use of HDBSCAN, described in
Section 2.6.3, through the HDBSCAN() function from sklearn�s Python library [47].
4. Within each cluster a bag-of-words representation, described in Section 2.3.1, was generated through
the CountVectorizer() function from sklearn�s python library, described in Section 2.6.4. In addition,
this function also performs several text pre-processing steps such as tokenizing the text, lowercasing
and removing stopwords, see section 2.2.1, 2.2.3 and 2.2.4 respectively. See Appendix A for all used
stopwords [58].
5. From the generated bag-of-words representation of each cluster, we want to distinguished them from one
another which was solved by using c-TF-IDF, presented in Section 2.6.5. This was generated through
the ClassTfidfTransformer() function from bertopic’s Python library which is a modified version of
scikit-learn’s TfidfTransformer() class [21, 58].
6. Lastly, on top of the generated topic representations some fine-tuning was performed using KeyBERT,
described in Section 2.6.6. This was done through the KeyBERTInspired() function in the bertopic
library in Python [18].
The steps presented above is also illustrated and clarified in Figure 9.

3.8 Modeling
After data preprocessing and text representation, the dataset was ready for model implementation. This
subsection is divided into two parts describing the topic modeling phase, where Section 3.8.1 describes the
implementation of LDA and Section 3.8.2 the implementation of BERTopic.

26
3 METHOD

3.8.1 LDA
For the entire implementation of LDA, the Gensim library was used. The Gensim library was developed
for Python by Radim Řehůřek in 2008, and allows both LDA model estimation from a training corpus and
inference of topic distribution on new, unseen documents [75]. Gensim requires NumPy, a fundamental
package for scientific computing with Python [30], as well as a Python version of 3.6 [77].
Model Building
With the corpus of text represented as a bag-of-words, the implementation of the LDA model was done
through importing the models.ldamodel module from the Gensim library and utilizing the LdaModel func-
tion. This function has several parameters that can be used to customize the LDA model. For the two
Dirichlet hyperparameters ↵ and , which controls the sparsity of the topic-document and the word-topic
distributions respectively, the standard ’auto’ setting was used. When set to ’auto’, the model will learn
an asymmetric prior directly from the data, meaning that the model will learn the best value of alpha and
beta for each document based on the data [75]. A full description of model parameters and the settings used
is shown in Appendix B.
The LDA model was then saved with the chosen parameter settings and can later be used to extract and
visualize topics.
Topic Extraction and Visualization
Once the model was trained several different actions could be performed to extract topic information as well as
visualizing topics. For extracting information about the topics, the get_document_topics and show_topics
functions from the Gensim library were used to retrieve the distribution of topics for each document, and the
corresponding word distribution for each topic.
For visualizing the topics, the PyLDAvis library was used. PyLDAvis is a Python library for interactive topic
model visualization. It is a port of the R package by Carson Sievert and Kenny Shirley [68]. PyLDAvis
produces a bubble-chart, where each bubble on the left-hand side of the chart represents a topic. The larger
the bubble, the more prevalent is that topic, since the size of each bubble is proportional to the percentage
of unique tokens attributed to each topic. A good topic model will have fairly big, non-overlapping bubbles
scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics
will typically have many overlaps, small sized bubbles clustered in one region of the chart. The visualizations
created from PyLDAvis can be seen in Section 4.1.

3.8.2 BERTopic
For the implementation of BERTopic, the BERTopic module developed by Maarten Grootendorst, the creator
of BERTopic, was used. The necessary dependencies for installing and importing the BERTopic library was
also installed, as described by Grootendorst in [25].
Model Building
The implementation of the BERTopic modeling was done through a simple line of code where the user calls
for the imported BERTopic function and library, which includes the different sequence-methods wanted as
inputs to the function. The main steps for topic modeling with BERTopic are sentence-transformers, UMAP,
HDBSCAN, as well as c-TF-IDF, and in addition, a fine-tuning process can be added with KeyBERT. Within
each of the different sequences of BERTopic several different parameter settings that can be used and the
main settings for each step can be found in Appendix C. As mentioned, these are the main settings for the
different sequences but additional modification and tuning can also be done.
When a model setup was chosen it needed to be trained on the specific data at hand which was done trough
the .fit_transform function within the bertopic library. This and similar functions for model fitting in the
bertopic library can be found in Appendix D.
Topic Extraction and Visualization
Once the model was trained several different actions were performed to extract topic information and visual-
izing it. For topic extraction, functions in the bertopic library such as .get_topic and .get_document_info
were used together with other presented functions in Appendix D. From the results an initial representation
of the different topics was obtained and could be analyzed.
Through the different topic visualizations tools presented in Appendix E more insights were gained regarding
each topic.

27
3 METHOD

3.9 Evaluation and Model Selection


This section provides an overview of the evaluation and model selection procedures. It is divided into two
subsections, Section 3.9.1 outlines the evaluation methods for LDA, while Section 3.9.2 explains the procedures
for BERTopic.

3.9.1 LDA
The best model was evaluated and selected as the model with the optimal number of topics for the data. The
optimal number of topics was determined through comparing the topic coherence, cv , and perplexity scores,
as described in Section 2.7, for different number of topics in the model settings. In addition to evaluating the
models using coherence and perplexity metrics, human interpretation of the topic quality was done through
analyzing the visualizations generated with PyLDAvis. This was done to ensure that the final model would
have topics that were easily interpretable and provided categorizations with the highest possible business
value for the company.
To obtain the perplexity and coherence scores, a performance function was built to calculate these metrics us-
ing the CoherenceModel with the coherence parameter specified as coherence=’c_v’, and the log_perplexity
function with the document corpus represented as bag-of-words as input. Both modules were imported from
the Gensim library [75].
The coherence and perplexity scores for number of topics = {2,.., 20} were saved and presented in joint a
table as well as separate graphs to determine the model with the optimal number of topics with regards to
these metrics. See Section 4.1.1 for these results.

3.9.2 BERTopic
To obtain the best model possible for extracting topic insights two different methods were used for model
evaluation and comparison. The first method was comparing the topic coherence scores, cv , as described
in Section 2.7, which was done through a built in function called CoherenceModel with parameter setting
coherence=’c_v’. The second method was to perform human interpretation of the topic quality which was
analyzed through word cloud representation and the built in function wordcloud.
The perplexity measure used in LDA is as a measure of model fit, to find the optimal number of topics for
the final model. In BERTopic however, the number of topics do not need to be pre-determined by the user
when using HDBSCAN. Therefore, perplexity was not needed as an evaluation metric in the BERTopic model
selection phase.
The model selection procedure was initiated with use of the default model presented in Section 2.6 and
more specifically Figure 4. From this model baseline, the model selection procedure continued with several
modifications and experiments regarding the different parameter settings presented in the Appendix C. As
presented in the appendix, different model settings can be employed during the selection process and when
we believed that best possible model was obtained, based on our two evaluation methods, a discussion with
the stakeholders at the company was conducted. Since the results based on the second metric, human
interpretation of the topic quality, can be considered arbitrary, this dialogue was of utmost importance to
ensure that a model that maximizes business value was created.

28
4 RESULTS

4 Results
In this section, the results of the project is presented. The section is divided into two subsections, Section 4.1
where the results from the LDA modeling are presented, and Section 4.2 where the results from the BERTopic
modeling are presented.

4.1 LDA
This subsection presents the results of various LDA models applied to the dataset, which were evaluated based
on coherence score, perplexity score and relevance, i.e., human interpretation of topic quality. It includes
visual representations of topic clusters and aims to provide a concise overview of the LDA topic modeling
results as well as demonstrate its efficacy in extracting meaningful topic information.

4.1.1 Model Selection


The process of model selection for LDA includes evaluating the model based on the evaluation metrics
coherence score and perplexity score, as well as human interpretation of topic quality.
Coherence and Perplexity
From Table 4 and Figure 12, we saw that the perplexity score was decreasing with an increasing number of
topics T in the model, meaning that more topics indicate deterioration and a worse model fit. As mentioned
in 2.7.2, the perplexity score used in our LDA models was the output statistic from Gensims log_perplexity
function. This value, in contrast to held-out documents perplexity measure, also defined in 2.7.2, should be
interpreted by "bigger is better", and therefore a larger value is preferred. With this in mind, the optimal
choice of LDA model with regards to coherence and perplexity metrics should be the model with the highest
coherence score, with as low T as possible.
The candidates for the final model with the optimal number of topics are highlighted in green in Table 4 and
with green vertical lines in Figure 12a. The motivation for choosing these models as candidates for the final
model was by both looking at models with the highest coherence "peaks" in the coherence graph, Figure 12a
and as low perplexity as possible. This includes the models at each peak in Figure 12a (T = 5, 7, 9, 11 and
15), and model T = 10 that lies between the highest and second highest peak. One thing to note is also that
there seems to be a diminishing return for adding more topics after T = 15, seeing as the trend points to a
lower coherence score after this point.

Table 4: Coherence and Perplexity Scores for different number of topics. The rows highlighted in green
represent the best models from observing the coherence and perplexity score graphs from Figure 12, with
regards to the "peaks" in Figure 12a and the largest (best) perplexity scores in Figure 12b.

LDA Model Selection


Num of topics T Coherence Score Perplexity Score
2 0.2981 -6.592
3 0.3266 -6.595
4 0.3586 -6.5837
5 0.3973 -6.6089
6 0.3639 -6.6478
7 0.4238 -6.6895
8 0.42 -6.7476
9 0.4714 -6.8272
10 0.4591 -6.9289
11 0.4684 -7.0875
12 0.4261 -7.3103
13 0.4276 -7.5983
14 0.438 -7.9255
15 0.4684 -8.1613
16 0.4143 -8.3434
17 0.3961 -8.4631
18 0.4145 -8.5657
19 0.4155 -8.6916
20 0.4117 -8.7929

29
4 RESULTS

(a) Coherence scores for number of topics T = {2,...,20} (b) Perplexity scores for number of topics T = {2,...,20}

Figure 12: Coherence and Perplexity Scores for the LDA model with different number of topics T . The
coherence score graph have peaks for T = 5, 7, 9, 11 and 15. T = 10 is also interesting since it lies in-between
two peaks and has a high coherence score while a lower number of topics compared to the two peaks to its
right (T = 10 < T = 11 < T 15), and it therefore has a higher perplexity score.

Human Interpretation of Topic Quality


By observing Table 4 and Figure 12, the model with the best coherence score was the one with T = 9.
However, looking at the topic visualization for the model with T = 9, as seen in Figure 13c, this model has
three highly overlapping small topic clusters (t = 7, 8 and 9) in the top-right quadrant of the graph. The size
of each bubble is proportional to the percentage of unique tokens attributed to that topic. When evaluating
this model with regards to human interpretation of topic quality, the topics were also quite hard to categorize
in a business context when interpreting the most frequent words in each topic. The most frequent words for
each topic cannot be presented in this report due to confidentiality reasons. However, the interpretation of
topic quality was done in collaboration with project stakeholders at the company to ensure that the topics
were interpretable in a business context.
After the human interpretation of the topics, in combination with evaluating the coherence and perplexity
scores we could conclude that the model with T = 10 was the preferred one. This is because there was less
overlap of the topic clusters, and a better separation of smaller topic clusters compared to the model with T
= 9. The model with 10 topics also had the third best coherence score, and lower perplexity than the models
with the shared second highest coherence score (T = 11 and T = 15).

30
4 RESULTS

(a) T = 5 (b) T = 7

(c) T = 9 (d) T = 10

(e) T = 11 (f) T = 15

Figure 13: Visualization of topics with PyLDAvis for the six candidate final models. Each subfigure represents
one of the six candidates LDA models with the number of topics that produced the best coherence and
perplexity scores, as presented in Table 4 and Figure 12. The size of each bubble is proportional to the
percentage of unique tokens attributed to each topic.

31
4 RESULTS

4.1.2 Final LDA model


After the model selection process, as described in Section 4.1.1, the final model was the one with 10 topics,
i.e., T = 10. The topic visualization of the final LDA model is presented in Figure 14, and the evaluation
metrics coherence score and perplexity score is presented in Table 5.

Figure 14: Topic visualization with PyLDAvis for the final LDA model with 10 topics (T = 10).

Table 5: Evaluation metrics for the final LDA model (T = 10).

Evaluation Metrics Final LDA Model


Number of topics T Coherence Score Perplexity Score
10 0.4591 -6.9289

4.1.3 Extracting Topic Information


LDA provides a probability distribution of topics for each document. It also provides a probability distribution
for each topic, which is characterized by a distribution over words. The output from LDA can therefore be
used to identify the topics that are present in the corpus and the words that are associated with each topic.
From Figure 14, we observe the relative size of each topic, represented by size of the bubbles. Another topic
size comparison is also shown in Table 6 where the size of each topic with regards to the percentage of tokens
(unique words in the corpus) in each topic is presented.

Table 6: Relative size of each topic with regards to the percentage of tokens assigned to that topic.

Topic Sizes for Each Topic t 2 {1, ..., T = 10}


t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t = 10
% of Tokens 53.7 % 14.1 % 13.8 % 7.3 % 4.4 % 4.1 % 1.4 % 0.6 % 0.5 % 0.1 %

32
4 RESULTS

Distribution of Topics per Document


For each document the results from the LDA model gave us a mix of topics that make up that document. To
be precise, we got a probability distribution over topics t 2 {1, ..., T = 10}) for each document. In Table 7 this
probability distribution of topics for 5 randomly selected documents are presented. This information provides
insights into the topics that are present in the corpus and how they relate to each other. For example, when
several documents have similar probability distributions over topics, the context of those documents may be
related to each other.
Table 7: Output from 5 randomly selected documents dn (n = {1, 602, 974, 3501, 4982}) within the document
corpus D, showing the probability distribution of each t 2 {1, ..., T = 10} for those documents. The numbers
are rounded to 4 decimal points for a better overview. The probabilities in each row sums up to 1 (if not
rounded).

Probability Distribution of Topics per Document


Document dn t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t = 10
n=1 0.5681 0.3179 0.0726 0.0066 0.0130 0.0189 0.0010 0.0008 0.0009 0.0002
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .

n = 602 0.7255 0.1036 0.0430 0.0181 0.0260 0.0815 0.0008 0.0006 0.0007 0.0001
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .

n = 974 0.3191 0.0296 0.5960 0.0155 0.0105 0.0190 0.0037 0.0029 0.0032 0.0006
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .

n = 3501 0.3513 0.5322 0.0444 0.0096 0.0457 0.0103 0.0023 0.0018 0.0020 0.0003
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .

n = 4982 0.3645 0.1087 0.1281 0.0038 0.3776 0.0147 0.0009 0.0007 0.0008 0.0001

Distribution of Words per Topic


Each word in a document is attributed to a particular topic with a probability given by its distribution.
This output can be used to identify the most probable words associated with each topic, which can provide
insights into how each topic should categorized in a business setting. The distribution of the top 10 most
probable words for each topic is presented in Table 8. Due to confidentiality reasons, the explicit word cannot
be presented in the table. However, this information is shared with the company to provide insights on how
to categorize and name each topic based on the business context and the most probable words.

Table 8: Distribution of the n = 10 words with the highest probability of belonging to each topic t. Due to
confidentiality reasons, each specific word w't ,n for topic t given the word-topic distribution 't cannot be
explicitly presented. The probability of each word w't ,n given the word-topic distribution 't is denoted as
'w,z where z is the word-topic assignment, as defined in 2.5.

Probability Distribution of Words per Topic


Topic 1 'w,z Topic 2 'w,z Topic 3 'w,z Topic 4 'w,z Topic 5 'w,z
w'1 ,1 0.0553 w'2 ,1 0.0617 w'3 ,1 0.0470 w'4 ,1 0.0846 w'5 ,1 0.1388
w'1 ,2 0.0329 w'2 ,2 0.0579 w'3 ,2 0.0354 w'4 ,2 0.0576 w'5 ,2 0.0400
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .

w'1 ,10 0.0178 w'2 ,10 0.0160 w'3 ,10 0.0152 w'4 ,10 0.0164 w'5 ,10 0.0122
Topic 6 'w,z Topic 7 'w,z Topic 8 'w,z Topic 9 'w,z Topic 10 'w,z
w'6 ,1 0.0619 w'7 ,1 0.0881 w'8 ,1 0.1132 w'9 ,1 0.0526 w'10 ,1 0.0829
w'6 ,2 0.0603 w'7 ,2 0.0664 w'8 ,2 0.0897 w'9 ,2 0.0521 w'10 ,2 0.0504
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .

w'6 ,10 0.0204 w'7 ,10 0.0135 w'8 ,10 0.0110 w'9 ,10 0.0205 w'10 ,10 0.0001

33
4 RESULTS

4.2 BERTopic
This subsection presents the outcome of two different BERTopic models applied to the given dataset, the
Default Model and the Final Model, presented in Section 4.2.1 and Section 4.2.2 respectively. These were
evaluated based on two different methods as described in Section 3.9.2, coherence score and relevance, i.e.,
human interpretation of topic quality. It includes visual representations of topic clusters and aims to provide
a concise overview of BERTopic’s topic modeling results as well as demonstrate its efficacy in discovering
meaningful insights. To maintain uniformity in the notation for topic affiliation, according to the theory
presented in 2.6.5, a class c in BERTopic is equivalent to a topic, which is represented by t in the LDA
notation.

4.2.1 Default BERTopic Model


The topics representation from the default model resulted in two different topics as presented in Table 9. The
amount of documents assigned to these topics as well as the coherence score of the model is also presented.

Table 9: The amount of documents, i.e., customer calls, assigned to each topic obtained from the default
model together with the model Coherence Score.

Results Default BERTopic Model


Topic Count
Topic 0 4494
Topic 1 506
Coherence Score: 0.302

From Table 9 we could establish that almost 90% of the total dataset were assigned to one of the two topics
which is not preferable when trying to define topics from a human interpretation perspective. But, despite
the disappointing topic sectioning we received a somewhat okay coherence score from the default model which
indicates that further tuning might get us a useful result. Below in Table 10 the term importance, i.e., c-TF-
IDF score, for the top 10 most important terms of each topic is presented. These 10 words are also illustrated
in a word cloud for each topic in Figure 15, but due to confidentiality reasons, the explicit word cannot be
presented in the this report. However, this information is shared with the company to provide insights on
how to categorize and name each topic based on the business context and the most probable words.

Table 10: The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific term t for class c cannot be explicitly
presented.

Term Importance to a Topic


Topic 0 Wt,c Topic 1 Wt,c
t1,0 0.1208 t1,1 0.1282
t2,0 0.0906 t2,1 0.0974
.. .. .. ..
. . . .

t10,0 0.0545 t10,1 0.0580

34
4 RESULTS

(a) Default model - Topic 0 (b) Default model - Topic 1

Figure 15: Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents one
of the topics presented in Table 10. Due to confidentiality reasons, each specific term t cannot be explicitly
presented.

4.2.2 Final BERTopic Model


The topics representation from the final model resulted in three different topics as well as an outlier-topic
as presented in Table 11. The amount of documents assigned to these topics and the coherence score of the
model is also presented.

Table 11: The amount of documents, i.e., customer calls, assigned to each topic obtained from the final model
together with its Coherence Score.

Results Final BERTopic Model


Topic Count
Topic -1 (Outlier Topic) 24
Topic 0 4506
Topic 1 439
Topic 2 31
Coherence Score: 0.486

To our delight, Table 11 shows that the Coherence Score increased from 0.302 to 0.486 and the number of
topics from 2 to 3 in addition to the outlier topic (Topic -1) compared to the results from the default model
presented in Section 4.2.1, which enabled a more fair and user-friendly topic representation. However, we
could also once again establish that almost 90% of the total dataset were assigned to Topic 0 which, as
mentioned, is not preferable when trying to define topics from the perspective of human interpretation and
business value. In addition, the remaining 10% were almost exclusively assigned to Topic 1 apart from a
small portion which Topic 2 and Topic -1 shared. With that said, the topic representation was almost the
same in both cases.
Below in Table 12 the term importance, i.e., c-TF-IDF score, for the top 10 most important terms of each
topic is presented. These 10 words are also illustrated in a word cloud for each topic in Figure 16 but, due to
confidentiality reasons, the explicit word cannot be presented in the this paper. However, this information is
shared with the company to provide insights on how to categorize and name each topic based on the business
context and the most probable word.

Table 12: The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific word t for class c cannot be explicitly
presented.

Term Importance to a Topic


Topic -1 Wt,c Topic 0 Wt,c Topic 1 Wt,c Topic 2 Wt,c
t1, 1 0.1158 t1,0 0.0624 t1,1 0.1930 t1,2 0.1422
t2, 1 0.0623 t2,0 0.0438 t2,1 0.0716 t2,2 0.0535
.. .. .. .. .. .. .. ..
. . . . . . . .

t10, 1 0.0231 t10,0 0.0198 t10,1 0.0196 t10,2 0.0317

35
4 RESULTS

(a) Final model - Topic -1 (b) Final model - Topic 0

(c) Final model - Topic 1 (d) Final model - Topic 2

Figure 16: Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents one
of the topics presented in Table 12. Due to confidentiality reasons, each specific term t cannot be explicitly
presented.

From the results obtained in Table 11, Table 12 and Figure 16 we could establish that despite the improvement
in coherence score and increased topics, the human interpretability of topic quality was still subpar. So in
discussion with the company, no further model experimentation was performed with BERTopic since the
results were too poor with regards to business value.

36
5 DISCUSSION

5 Discussion
In this section, we discuss some of the challenges and opportunities associated with our process and provide
some thoughts on how to improve the use of topic modeling to extract insights from this type of customer
call data. This section is divided into five subsections, where Section 5.1 provides a discussion of the project
limitations, Section 5.2 a discussion of the data quality, Section 5.3 a comparison between LDA and BERTopic,
Section 5.4 a discussion of evaluation methods, and finally Section 5.5 some reflections on how our results
can be interpreted.

5.1 Limitations
The most significant limitation of this project has been the data privacy concerns regarding data accessibility
and whether analysis of the data was possible from a GDPR point of view. As the call recordings contains
sensitive personal information about the customers, it became a lengthy process to determine whether analysis
of this type was possible from a privacy perspective. This was not accounted for by us in the initial planning
of the project timeline, and was therefore limiting to how much of the initial project scope we managed to
complete.

5.2 Data Quality


Data quality is a critical factor in topic modeling. The quality of the data used to train the model can have
a significant impact on the accuracy and usefulness of the results. In this subsection, we discuss some of the
key considerations for ensuring high-quality data in topic modeling and explore some of the challenges and
opportunities associated with our process.

5.2.1 Transcription and Translation


In Section 3.5.1, we mention that OpenAI’s Whisper was our preferred model for transcribing and translating
the call recordings. Whisper is trained on 680,000 hours of multilingual and multitask supervised data
collected from the web [60]. The use of such a large and diverse dataset leads to improved robustness to
accents, background noise and technical language. However, there are some concerns about the quality of
the transcriptions and translations that are produced.
One concern is that the quality of the transcriptions and translations may not be as good as human tran-
scriptionists or translators. This is because the system is still learning and may not be able to accurately
transcribe or translate certain words or phrases, specifically words that are specific to the company’s services.
Another concern is that the system may not be able to accurately transcribe or translate certain accents or
dialects, or if the audio quality is low for example due to background noise or low microphone quality.
To evaluate the quality of the transcriptions and translations that are produced, one approach to consider is
to use metrics such as word error rate (WER), character error rate (CER), and sentence error rate (SER)
[55]. These metrics can help determine how accurate the transcriptions and translations are compared to
human transcriptionists or translators. This would give an indication of how reliable the transcriptions and
translations are, and provide valuable input on the quality of the data.
To ensure the best possible data quality of the transcriptions and translations, using human evaluators to
review a sample of the transcriptions and translations that are produced is also an alternative. However, this
would require far more time and resources than to evaluate the data quality using error metrics.

5.2.2 Size of Dataset


The size of the dataset is an important factor to consider when doing topic modeling to extract insights from
customer phone calls. A larger dataset can provide more representative and diverse examples of customer
feedback, which can help to improve the accuracy and reliability of the model.
Generally speaking in ML and NLP, the more qualitative data available for training the model, the better
accuracy and reliability of the results produced. It is therefore important to carefully consider the research
question and determine the appropriate sample size based on the available resources and the desired level of
precision. The suggested minimum amount of data for BERTopic is 1000 documents in the corpus, which
our 5000 documents covers by some margin. However, a bigger dataset doesn’t necessarily result in less bias
in the data. There is also the question of what time periods the data is sampled from, which brings us to the
next point of discussion - Sampling data from specific time periods.

37
5 DISCUSSION

5.2.3 Sampling Data from Specific Time Periods


In addition to the size of the dataset, sampling data from specific time periods may result in data biases
when applying topic modeling to the dataset.
If we only transcribe calls from specific time periods, we may create bias in the data by excluding important
topics or perspectives that are not represented in the sample. For example, if we only transcribe calls from a
period when a particular topic was popular, we may miss important feedback or insights about other topics
that were not as frequent at that time.
To avoid this problem, it is important to carefully consider the sampling strategy and to ensure that the
sample is representative of the population of interest. This may involve collecting data from multiple time
periods and using random sampling techniques to select calls for transcription. In our case, this was done by
sampling data in batches of 1000, where each batch was sampled from every day for a specific week. However,
this resulted in our dataset only having samples from five weeks, which could introduce some bias to the
topic representation.
One of the future goals of this project work is to implement topic modeling in production to track and analyze
trends over time in what the customers are calling about. In this case, it is important to carefully consider
the time periods when the data is sampled, to ensure that the analysis represents the time period of interest.

5.2.4 Stopwords
As described in Section 2.2.4, stopwords are commonly used in topic modeling to remove words that are
considered to be unimportant or irrelevant for analysis. However, adding stopwords to a topic model may
actually create more bias in the data by removing words that may be of importance for understanding the
underlying themes and patterns in the text. For example, if we remove words like "see", we may miss
important connections between topics or lose valuable context that could help us better understand the
meaning of the call. This specific word could have different meanings in different contexts, for example "I
see" in the context of affirmation doesn’t necessarily provide any insight on what the call is about, but "I
can’t see this information on the web-page" on the other hand gives the word "see" an entirely different
weight to the context of the call. It is therefore highly subjective and hard to determine what stopwords to
remove to achieve the best possible topic model with regards to getting representative and coherent topics.
In the paper "Understanding Text Pre-Processing for Latent Dirichlet Allocation" by Schofield and Mimno,
the authors conclude that aside from extremely frequent stopwords, removal of stopwords does little to
impact the inference of topics on non-stopwords. The authors add that removing determiners, conjunctions,
and prepositions can improve model fit and quality, but that further removal has little effect on inference for
non-stopwords and thus can wait until after model training [67].
To address this problem, it is important to carefully consider which stopwords to include or exclude from the
analysis and to test the impact of these decisions on the results. In some cases, it may be more appropriate
to only use techniques such as lemmatization for reducing noise in the data, which can help to preserve more
of the original meaning of the text while still reducing noise and improving the accuracy of the model.

5.2.5 Validation Data


Unsupervised machine learning techniques such as topic modeling can be a powerful tool for extracting
insights from customer phone calls. However, these techniques also have some limitations and pitfalls that
should be considered. As mentioned, topic models can be very sensitive to the quality and representativeness
of the data, and may not always capture the full range of topics or perspectives that are present in the text.
One way to address the sensitivity to the representativeness of the data can be to manually create a validation
set of labeled calls from reading a set of transcription and manually assigning it to a topic. This can help to
ensure that the model is accurately capturing the themes and patterns in the text, and can provide a useful
benchmark for evaluating the performance of the model. Additionally, manual validation can help to identify
errors or inconsistencies in the data that may not be apparent from unsupervised analysis alone.
In conclusion, while unsupervised machine learning techniques such as topic modeling can be a powerful tool
for extracting insights from customer phone calls, it is important to carefully consider their limitations and
perhaps implement complementary data quality procedures such as manual validation and error metrics to
ensure that the results are accurate and reliable.

38
5 DISCUSSION

5.3 Comparing LDA and BERTopic


In this thesis we presented two different approaches to topic modeling, LDA and BERTopic. In this subsection,
we will explore the differences between the two approaches and discuss some of the key considerations for
choosing the best technique for our specific use-case. Table 13 provides an overview of the key characteristics
of LDA and BERTopic models in the context of practical application scenarios, which will be address in this
section.

Table 13: Comparison between LDA and BERTopic in the context of practical application scenarios. Table
inspired by [13].

Topic Modeling Comparison


Metric LDA BERTopic
Data preprocessing X: Pre-processing is essential X: Pre-processing are not
needed in most cases
Number of topics X: The number of topics must X: Automatically finds the
be known beforehand number of topics
Topic relationship for each X: Each document can be a X: Each document is assigned
document mixture of topics to one topic only
Topic representation X: Bag-of-words representa- X: Semantic embeddings
tion, disregards semantics lead to more meaningful and
coherent topics.
Finding the optimal number X: More complex X: Support for hierarchical
of topics topic reduction
Outliers X: No outliers X: HDBSCAN leads to more
coherent and consistent top-
ics, but at the price of having
a significant portion of out-
liers. (Can be avoided with
other techniques such as k-
means)
Longer input documents X: Regardless of document X: Most embedding models
length have a limit on the number of
input tokens.
Shorter input documents X: Regardless of document X: Better performances with
length shorter documents.
Small datasets (<1000 docs) X: Can handle small datasets X: May be less effective with
small datasets
Large datasets (>1000 docs) X: Can handle large datasets X: Scales better with larger
corpora
Speed & Resources X: Effective and inexpen- X: Longer training times
sive computational resources compared to LDA and po-
(CPU) tentially expensive computa-
tional resources (GPU)
Visualization X: pyLDAvis for visualiza- X: Advanced visualization
tion tools

5.3.1 Advantages and Limitations of LDA


LDA is considered as a state-of-the art method for topic modeling and there are many reasons to why it is
still widely used today, 20 years after its introduction in 2003. The practical advantages, as shown in Table
13, include that it is effective and computationally inexpensive, also that it handles both shorter and longer
input documents as well as small and large datasets. For context, the size of the dataset used for this project
was 5000 documents with an average of 353 words per document. When implementing the LDA model, the
training procedure for LDA took less than one minute, running on an ordinary laptop CPU. In comparison,
the same training process took roughly 40 minutes for BERTopic, while running on a VM with a dedicated
NVIDIA T4 GPU.
While being computationally lightweight, LDA has been shown to be effective in identifying topics in large
collections of documents and provide topics that are easily interpretable. The generative assumption of LDA

39
5 DISCUSSION

also confers one of the main advantages - that LDA can generalize the model that separates documents into
topic distributions to documents outside the training corpus. However, there are limitations of LDA, several
of which have been addressed in newer approaches such as BERTopic.
One limitation of LDA is that it is hard to know when LDA is working - topics are soft-clusters so there is
no objective metric to say “this is the right number of topics”. The number of topics must be specified by the
user and there is no explicit way to determine the optimal number of topics. The coherence and perplexity
scores used in this thesis are two common ways of evaluate how many topics should be used for an LDA
model, however there is some question to how reliable they are as metrics. This is discussed in more detail
in Section 5.4.
Perhaps the most significant limitation of LDA, one that BERTopic aims to improve on, is the core premise
of LDA - the bag-of-words representation. In LDA, documents are considered a probabilistic mixture of
latent topics, with each topic having a probability distribution over words, and each document is represented
using a bag-of-words model. The bag-of-words representation makes LDA an adequate model for learning
hidden themes but do not account for a document’s deeper semantic structure. Not capturing the semantic
representation of a word can be an essential element in acquiring accurate results from a topic model. For
example, for the sentence "The girl became the queen of Sweden", the bag-of-words representation will not
detect that the words "girl" and "queen" are semantically related. This could potentially result in LDA
missing the "true" meaning of a sentence, if the semantic structure highly affects how the words in a sentence
should be interpreted.

5.3.2 Advantages and Limitations of BERTopic


Due to the limitations addressed above for LDA, new generative models have been proposed using the
traditional approach from LDA as a base to improve upon its limitations. One of those improved models is
BERTopic, which mainly addresses the limitation of semantic understanding with the use of an embedding
component to generate meaningful topics understandable to a human. BERTopic is a deep learning approach
that has no limitation on how large datasets it can handle, and has been shown to achieve competitive
performance compared to other state-of-the-art topic modeling algorithms, such as LDA.
One advantage of BERTopic is that with the use of embeddings, it is possible for the user to choose from a
variety of embedding layers or even create custom ones tailored to the type of data used. As described in
Section 2.3.2, embedding layers are a type of hidden layer in a neural network that maps input information
from a high-dimensional to a lower-dimensional space, allowing the network to learn more about the rela-
tionship between inputs and to process the data more efficiently. By choosing using the most appropriate
embeddings for the specific data and use-case, the user can tailor the model to their specific use-case and
improve its performance. Compared to LDA, the concern of finding the optimal number of topics is also ad-
dressed in BERTopic. BERTopic has support for hierarchical topic reduction that allows for a more nuanced
understanding of the relationships between topics.
The advantages presented above is an improvement of some aspects compared to traditional methods like
LDA, however like all models BERTopic is not perfect, and it has some limitations that need to be taken into
consideration.
The limitation that probably affected our results the most is that BERTopic assumes that each document
only contains a single topic which does not reflect the reality that documents may contain multiple topics.
Since we dealing with customer phone calls to a financial service provider, there are probably some parts of
every conversation that are very similar regardless of the topic of the conversation. This intuitively suggests
that most customer calls in our dataset contains more than one topic, making a single-topic-per-document
representation from BERTopic sub-optimal to produce granularity in topic categorizations.
Finally, there are some practical limitations of implementing the BERTopic algorithm as shown in Table
13. Mainly the time needed for running and fine-tuning the model, which is significantly more resource
demanding and computationally expensive than LDA. Since it is a deep learning model, it also scales better
with larger corpora and can have difficulties with giving accurate topic representations on smaller datasets.

5.4 Topic Model Evaluation


Topic model evaluation is an important step in the topic modeling process that can help to ensure that the
results are accurate and reliable. There are many different evaluation metrics that can be used to assess the
quality of a topic model, where perplexity and coherence are considered the two main metrics. However, it
should be addressed that these evaluation metrics are not always correlated with human interpretation of

40
5 DISCUSSION

topic quality. It is therefore of importance to understand the strengths and weaknesses of these evaluation
methods to ensure that they are not relied on blindly to determine the optimal topic model.
In this subsection, we will explore some of the key considerations for evaluating topic models and discuss the
pros and cons of using coherence and perplexity as evaluation measures. We will also explore some of the
challenges associated with the trade-off between topic coherence and business value, i.e., human interpretation
of topic quality.

5.4.1 Interpretation and Usefulness of Perplexity


Although the perplexity measure, as described in Section 2.7.2, has been a common method for the evaluation
of topic models, a study of Chang et al. [6] from 2009 found that this was often negatively correlated
with human judgements of topic quality, and suggested that evaluation should be focused upon real-world
task performance, inferring that the coherence measure is a better approach. Hence, while perplexity is
a mathematically sound approach for evaluating topic models, it may not be a good indicator of human-
interpretable topics.
Another concern with the perplexity measure, specifically when using the log_perplexity function in LDA
from the Gensim library, is that it is not entirely clear how the output statistic should be interpreted.
As mentioned in the end of Section 2.7.2, there is some confusion about how the output statistic should be
interpreted. The mathematical formula in the documentation is not explicitly stated, which leads to confusion
regarding if it should be interpreted as "lower is better" according to the standard definition of perplexity,
or the opposite. The latter seems to be the supported interpretation, especially after reading a post on the
Gensim Google Group by the creator of Gensim Radim Řehůřek. He writes: "No, the opposite: a smaller
bound value implies deterioration. For example, bound -6000 is "better" than -7000 (bigger is better)." [14].
In this post, Řehůřek also adds that better perplexity does not necessarily mean better topics, referring to
previously mentioned paper by Chang et al. [6], but that it can still be useful for within-model comparisons.
These uncertainties on how to interpret perplexity measures gives some concerns on how reliable it is as
a method for topic model evaluation. Much suggests that topic coherence is a more intuitive metric that
is better at evaluating real-world task performance for a topic model. Coherence is therefore the main
metric, together with human interpretation of topic quality, used to evaluate the LDA and BERTopic models
presented in this thesis.

5.4.2 Coherence vs. Business Value


Topic coherence and business value are two important factors to consider when evaluating the quality of a
topic model. Topic coherence measures the degree of semantic similarity between the words in each topic,
while business value is based on human interpretation of topic quality and measures the usefulness of the
topics for achieving specific business goals. While both of these factors are important, they can sometimes
be in tension with each other, and it is important to consider the trade-offs between them when evaluating
the quality of a topic model for each specific use-case.
For example, a highly coherent topic model may not always be the most useful for achieving specific business
goals if the topics are not aligned with the needs and interests of the stakeholders. Similarly, a highly useful
topic model may not always be the most coherent if the topics are too broad or too narrow to provide
meaningful insights. To address these trade-offs, it is important to carefully consider the research question
when selecting the final model, and perhaps consider the use complementary methods mentioned earlier, such
as manual validation, to ensure that the results are accurate and reliable.

5.5 Reflection on the Results


As presented in Section 4, the results for LDA and BERTopic were widely different with regards to optimal
number of topics and the human interpretation of topic quality. LDA gave significantly better results with
regards to human interpretation of topic quality, and a higher level of granularity in the amount of distinct
topics. BERTopic on the other hand created the topics while taking semantic understanding into consid-
eration, and managed to create a model with a slightly higher coherence score (0.486 compared to LDA’s
0.459). However, as mentioned previously, coherence score does not tell the whole story of the quality of a
topic model.
Since topic modeling is an unsupervised technique, there are no labels to rely on during evaluation making
topics difficult to evaluate. From a business perspective, it could therefore be wiser to rely more on the
interpretability of topics when selecting what model to use. We believe that evaluation metrics such as

41
5 DISCUSSION

coherence proved to be a good tool for selecting the candidates for a final model, but the final pick mainly
came down to topic interpretability rather than hard metrics.
With this in mind, the model that created the best business value for the company was definitely the LDA
model, since the interpretation of topic quality is crucial for business value. We believe that this was mainly
due to the limitation for BERTopic in which it assumes that each document only contains a single topic. As
mentioned, LDA assumes that each document is a mixture of topics, while BERTopic is a clustering-based
model that uses contextualized embeddings to group similar documents together, and assigns each document
to one topic using c-TF-IDF. In this case, a document-topic distribution from LDA gives a probability
representation of all extracted topics for each document, which probably suited the data better since a
conversation with the company’s customer service easily could shift between topics.
Looking at the acquired topics from LDA, topic 1 represented 52.7% of the tokens in the corpus, as seen in
Table 6. This suggests that this was most likely a topic that more often that not occurred in calls, regardless
if there were other topics discussed at some point in the conversation. We believe that this made BERTopic
prone to categorizing most of the calls to this "general" topic, while ignoring other topics that occurred in
the conversations since they probably accumulated lower probabilities.

42
6 CONCLUSION

6 Conclusion
This section aims to address our answers to the research questions, summarize the results and discussions,
and outline potential future work in the area.

6.1 Research Questions


• Based on what the customers are calling about, can we find general topics and categorize
them based on the different business areas?
By utilizing topic modeling to analyze the contents of customer calls received by the company, it was
possible to identify general topics and categorize them based on different business areas. This catego-
rization allows for a comprehensive understanding of the customer concerns and the ability to address
them effectively within specific domains.

• Can the acquired categorizations provide insights on how the company can improve and
streamline their customer service?
Yes, this categorization enables the company to gain insights into the specific issues related to each
business area. Further research and analysis into the categorization could enable facilitating targeted
improvements, efficient resource allocation, and effective customer service. It could allow the company
to better comprehend the needs and preferences of its customers, tailor its services accordingly, and
proactively address any recurring or emerging concerns. This would enhance the overall customer sat-
isfaction, strengthen customer relationships, and support the organization’s growth and success.

• Compare the performance of two different topic modeling methods, LDA and BERTopic.
Based on evaluation measures and human interpretation of topic quality, which is the
most suitable method using the company’s customer call data?
When analyzing the results from the quantitative evaluation measures as well as human interpretation
of topic quality, it could be concluded that LDA outperformed BERTopic in terms of topic quality.
Although BERTopic demonstrated a slightly higher coherence scores, LDA aligned much better with
human interpretation, indicating a stronger ability to capture meaningful and coherent topics within
the company’s customer call data.

6.2 Recommendations for Further Work


One of the potential shortcomings of our analysis was the quality of the transcribed and translated data. We
believe that evaluating the quality of transcriptions and translations using error metrics should be a priority
going forward to ensure that the data going into the models contain as few errors as possible. Reducing
systematic errors in transcription and translation could potentially lead to completely new words and topics
that were previously lost due to mistranscriptions or mistranslations.
To use the insights from the acquired categories to improve and streamline the customer service, the ability
to track and analyze trends in topics would be very beneficial. For example, by building a dashboard and
running the model daily or weekly on new calls, it would allow the company to get statistics on developments
in how topics change over time, and identify percentage increases and decreases in the subjects their customers
are calling about. In order to analyze the defined topics and dig deeper into what kind of phone calls occur
in each category, additional models could also be built to analyze the documents specific to each topic to get
a higher level of granularity in the insights obtained.
One further improvement could be to utilize the word search capabilities in the LDA model. This would
allow the company to input a word of interest and get information on the most likely topics given the word.
This could be a valuable tool for the company stakeholders to analyze the trends of specific words and get
information how they change over time.
The techniques and models used in the field of topic modeling and NLP are currently under heavy devel-
opment, and the future of this project relies on continuous and extensive research in the field. It is also
important to note that the NLP technologies used today work best when coupled with human interpretation
and intelligence, making the synergy between data-driven implementation and human interpretation crucial
to getting the most valuable insights from the findings.

43
REFERENCES

References
[1] Felipe Almeida and Geraldo Xexéo. Word embeddings: A survey, 2019.
[2] Thomas Bayes and null Price. Lii. an essay towards solving a problem in the doctrine of chances. by
the late rev. mr. bayes, f. r. s. communicated by mr. price, in a letter to john canton, a. m. f. r. s.
Philosophical Transactions of the Royal Society of London, 53:370–418, 1763.
[3] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with python. In Analyzing
Text with the Natural Language Toolkit, chapter 1-3, pages 1–50, 67–110. O’Reilly Media, 1 edition, 2009.
[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res.,
3(null):993–1022, mar 2003.
[5] Justin Bois. Dirichlet distribution. https://distribution-explorer.github.io/multivariate_
continuous/dirichlet.html, Jan 2022. Accessed: April 25, 2023.
[6] Jonathan Chang, Sean Gerrish, Chong Wang, Jordan Boyd-graber, and David Blei. Reading tea leaves:
How humans interpret topic models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and
A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates,
Inc., 2009.
[7] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman.
Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–
407, 1990.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding, 2018.
[9] Igor Douven and Willem Meijs. Measuring coherence. Synthese, 156(3):405–425, 2007.
[10] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discov-
ering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, 1996.
[11] Akhmedov Farkhod, Akmalbek Abdusalomov, Fazliddin Makhmudov, and Young Im Cho. Lda-based
topic modeling sentiment analysis using topic/document/sentence (tds) model. Applied Sciences, 11(23),
2021.
[12] Thomas S. Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics,
1(2):209 – 230, 1973.
[13] Shubham Garg. Topic modeling with lsa, plsa, lda and nmf:
Bertopic and top2vec - a comparison. https://towardsdatascience.com/
topic-modeling-with-lsa-plsa-lda-nmf-bertopic-top2vec-a-comparison-5e6ce4b1e4a5,
2021. Accessed: May 10, 2023.
[14] Gensim. Gensim google group. https://groups.google.com/g/gensim/c/iK692kdShi4, 2011. Ac-
cessed: May 5, 2023.
[15] Google Cloud. Pricing | cloud speech-to-text | google cloud. https://cloud.google.com/
speech-to-text/pricing, 2023. Accessed: April 23, 2023.
[16] Google Cloud. Speech-to-text: Automatic speech recognition. https://cloud.google.com/
speech-to-text/, 2023. Accessed: April 23, 2023.
[17] Maarten Graven. c-tf-idf. https://maartengr.github.io/BERTopic/getting_started/ctfidf/
ctfidf.html, 2021. Accessed: May 10, 2023.
[18] Maarten Grootendorst. Keybert: Minimal keyword extraction with bert., 2020.
[19] Maarten Grootendorst. Bertopic. https://maartengr.github.io/BERTopic/index.html, 2021. Ac-
cessed: April 27, 2023.
[20] Maarten Grootendorst. Bertopic algorithm. https://maartengr.github.io/BERTopic/algorithm/
algorithm.html#visual-overview, 2021. Accessed: April 21, 2023.
[21] Maarten Grootendorst. Bertopic c-tf-idf. https://maartengr.github.io/BERTopic/getting_
started/ctfidf/ctfidf.html#reduce_frequent_words, 2021. Accessed: April 26, 2023.

44
REFERENCES

[22] Maarten Grootendorst. Bertopic keybert representation. https://maartengr.github.io/BERTopic/


api/representation/keybert.html, 2021. Accessed: May 9, 2023.
[23] Maarten Grootendorst. Bertopic outlier reduction. https://maartengr.github.io/BERTopic/
getting_started/outlier_reduction/outlier_reduction.html, 2021. Accessed: April 25, 2023.
[24] Maarten Grootendorst. Bertopic parameter tuning. https://maartengr.github.io/BERTopic/
getting_started/parameter%20tuning/parametertuning.html, 2021. Accessed: May 6, 2023.
[25] Maarten Grootendorst. Bertopic quickstart. https://maartengr.github.io/BERTopic/getting_
started/quickstart/quickstart.html, 2021. Accessed: April 21, 2023.
[26] Maarten Grootendorst. Bertopic representation. https://maartengr.github.io/BERTopic/getting_
started/representation/representation.html, 2021. Accessed: May 9, 2023.
[27] Maarten Grootendorst. Clustering. https://maartengr.github.io/BERTopic/getting_started/
clustering/clustering.html, June 2021. Accessed: May 10, 2023.
[28] Maarten Grootendorst. Vectorizers. https://maartengr.github.io/BERTopic/getting_started/
vectorizers/vectorizers.html, 2021. Accessed: May 10, 2023.
[29] Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure, 2022.
[30] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David
Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus,
Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark
Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser,
Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature,
585(7825):357–362, September 2020.
[31] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine learning,
42(1-2):177, 2001.
[32] Anna-Lan Huang. Similarity measures for text document clustering. 2008.
[33] Hugging Face. all-minilm-l6-v2. https://huggingface.co/sentence-transformers/
all-MiniLM-L6-v2. Accessed: May 5, 2023.
[34] Hugging Face. all-mpnet-base-v2. https://huggingface.co/sentence-transformers/
all-mpnet-base-v2, 2021. Accessed: May 9, 2023.
[35] Hugging Face. Sentence transformers. https://huggingface.co/sentence-transformers, 2021. Ac-
cessed: May 5, 2023.
[36] Hamed Jelodar, Yongli Wang, Chi Yuan, and Xia Feng. Latent dirichlet allocation (LDA) and topic
modeling: models, applications, a survey. CoRR, abs/1711.04305, 2017.
[37] Krishna Juluru, Hao-Hsin Shih, Krishna Nand Keshava Murthy, and Pierre Elnajjar. Bag-of-words
technique in natural language processing: A primer for radiologists. RadioGraphics, 41(5):1420–1426,
2021. PMID: 34388050.
[38] Diksha Khurana, Aditya Koli, Kiran Khatter, and Sukhdev Singh. Natural language processing: State
of the art, current trends and challenges - multimedia tools and applications, Jul 2022.
[39] Josip Kunsabo and Jasminka Dobša. A systematic literature review on topic modelling and sentiment
analysis. 09 2022.
[40] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of Massive Datasets. Cambridge
University Press, 3 edition, 2020.
[41] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 302–308,
2014.
[42] Liu, Xiaodong et al. Multi-task deep neural networks for natural language understanding. In Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496, Florence,
Italy, July 2019. Association for Computational Linguistics.
[43] Christopher D. Manning, Prabhakar Raghavan, and Schutze Hinrich. Stemming and lemmatization.
Cambridge University Press, 2019.

45
REFERENCES

[44] Leland McInnes and John Healy. Accelerated hierarchical density based clustering. In 2017 IEEE
International Conference on Data Mining Workshops (ICDMW), pages 33–42. IEEE, 2017.
[45] Leland McInnes and John Healy. HDBSCAN: Choosing the right parameters. https://hdbscan.
readthedocs.io/en/latest/parameter_selection.html, 2021. Accessed: May 6, 2023.
[46] Leland McInnes and John Healy. Hdbscan documentation: How hdbscan works. https://hdbscan.
readthedocs.io/en/latest/how_hdbscan_works.html, 2021. Accessed: May 11, 2023.
[47] Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering. The
Journal of Open Source Software, 2(11):205, 2017.
[48] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projec-
tion for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
[49] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Pro-
jection. https://github.com/lmcinnes/umap, 2021. Accessed: May 2, 2023.
[50] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projec-
tion. https://umap-learn.readthedocs.io/en/latest/api.html#umap.UMAP, 2021. Accessed: May
2, 2023.
[51] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. Umap: Uniform manifold approx-
imation and projection. The Journal of Open Source Software, 3(29):861, 2018.
[52] T.P. Minka. Estimating a dirichlet distribution. Annals of Physics, 2000(8):1–13, 2003.
[53] Prakash M Nadkarni, Lucila Ohno-Machado, and Wendy W Chapman. Natural language processing: an
introduction. Journal of the American Medical Informatics Association, 18(5):544–551, 09 2011.
[54] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, Nov 2022. Accessed: April 18,
2023.
[55] OpenAI. Speech to text - openai api. https://platform.openai.com/docs/guides/speech-to-text,
2022. Accessed: May 5, 2023.
[56] M Parimala, Daphne Lopez, and NC Senthilkumar. A survey on density based clustering algorithms for
mining large spatial databases. International Journal of Advanced Science and Technology, 31(1):59–66,
2011.
[57] Kyubyong Park, Joohong Lee, Seongbo Jang, and Dawoon Jung. An empirical study of tokenization
strategies for various korean nlp tasks, 2020.
[58] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.
[59] Shahzad Qaiser and Ramsha Ali. Text mining: Use of tf-idf to examine the relevance of words to
documents. International Journal of Computer Applications, 181(1):25–29, Jul 2018.
[60] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.
Robust speech recognition via large-scale weak supervision, 2022.
[61] Nils Reimers. Sentencetransformers: Multilingual sentence embeddings using bert, roberta, xlm-roberta
and co. with pytorch. https://www.sbert.net/, 2019. Accessed: May 7, 2023.
[62] Nils Reimers. SBERT: Pretrained models. https://www.sbert.net/docs/pretrained_models.html,
Accessed 2023. Accessed: May 8, 2023.
[63] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks,
2019.
[64] Nils Reimers and Iryna Gurevych. Sentence-transformers: Multilingual sentence embeddings using
bert, roberta, xlm-roberta and co. with pytorch. https://www.sbert.net/docs/package_reference/
SentenceTransformer.html, 2021. Accessed: May 5, 2023.
[65] Michael Röder, Andreas Both, and Alexander Hinneburg. Exploring the space of topic coherence mea-
sures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages
399–408. ACM, 2015.

46
REFERENCES

[66] Frank Rosner, Alexander Hinneburg, Michael Röder, Martin Nettling, and Andreas Both. Evaluating
topic coherence measures. arXiv preprint arXiv:1403.6397, 2014.
[67] Alexandra Schofield and David Mimno. Understanding text pre-processing for latent dirichlet allocation.
In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pages 553–562, 2017.
[68] Carson Sievert and Kenny Shirley. pyldavis: Python library for interactive topic model visualization.
https://github.com/bmabey/pyLDAvis, 2014. Accessed: April 28, 2023.
[69] Nur Tresnasari, Teguh Adji, and Adhistya Permanasari. Social-child-case document clustering based
on topic modeling using latent dirichlet allocation. IJCCS (Indonesian Journal of Computing and
Cybernetics Systems), 14:179, 04 2020.
[70] StatsExchange user:Rafs. Inferring the number of topics for gensim’s lda - per-
plexity, cm, aic and bic. https://stats.stackexchange.com/questions/322809/
inferring-the-number-of-topics-for-gensims-lda-perplexity-cm-aic-and-bic, 2018. Ac-
cessed: May 5, 2023.
[71] Rens van de Schoot, David Kaplan, Jaap Denissen, Jens B. Asendorpf, Franz J. Neyer, and Marcel A.G.
van Aken. A gentle introduction to bayesian analysis: Applications to developmental research. Child
Development, 85(3):842–860, 2014.
[72] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.
[73] Wenhui et al. Wang. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained
transformers. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in
Neural Information Processing Systems, volume 33, pages 5776–5788. Curran Associates, Inc., 2020.
[74] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson,
Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation
system: Bridging the gap between human and machine translation, 2016.
[75] Radim Řehůřek. Gensim: Topic modelling for humans. https://radimrehurek.com/gensim/models/
ldamodel.html, Dec 2022. Accessed: April 23, 2023.
[76] Radim Řehůřek. Gensim: Topic modelling for humans. https://radimrehurek.com/gensim/corpora/
dictionary.html, Dec 2022. Accessed: April 25, 2023.
[77] Radim Řehůřek. Gensim: Topic modelling for humans. https://radimrehurek.com/gensim/, Dec
2022. Accessed: April 27, 2023.

47
APPENDICES

Appendices
A Stopwords

Table 14: Stopwords used in both LDA and BERTopic. A total of 354 stopwords, using the standard NLTK
stopword library for English, as well as added stopwords that are project specific. The added stopwords are
mainly mistranslations and the most frequent names.

Stopwords
a abdul abdullah about above absolutely adam address
adnan after again against agnes agneta ahlqvist ahmad
ain albin alex alexander all alma also amelia
an and anders andreas anna anton any are
aren’t as ask at axel be because been
before being below between björn blah boris both
but by bye byebye call camilla can carola
cecilia christian claes clara couldn’t d daniel david
dennis did didn’t do doe does doesn’t doing
don douglas down during each elin elisabeth emil
emilia emma erik erika eriksson etc. eva exactly
few filip first for fred fredrik from further
gabriel gabriela gabriella good goodbye great gunn had
hadn’t hamlet hanna hannes has hasn’t have haven
haven’t having he hello henrik her here herman
hers herself him himself his how hugo hussein
håkan i ida if ill in into is
isn’t it’s itll its itself jacob jan jasmin
jenny jens jessica jeppe jim joakim joel johan
johanna johannes johansson john johnny jonas josef josefin
just jörgen karin karlsson katarina kim kristian lars
larsson lasse last leon let lina linda linnea
linus lisa lucas luke lundin länder länderplats ländervapen
lönnbro m ma madelaine magnus malin marcus maria
marianne marie matilda mattias me mia michael michelle
mightn mightn’t mikael mohamed mohammed monica more most
mustn mustn’t my myself name needn’t nice nicklas
niklas no nor not now number o of
off okay oliver olle on once only or
oskar other our ours ourselves out over own
patrik person personal persson peter please rasmus re
regina richard right robert roger roland s same
sandra say sebastian see shan shan’t she she’s
should should’ve shouldn’t so some sonja stefan stella
stig such surname susanna t than thank thanks
that that’ll thats the their theirs them themselves
then there therese these they theyll this thomas
those through to tom too tove try ulrik
under understand until up vas ve very victor
victoria viktoria viola vägen wait wallgren was wasn
wasn’t we welcome were weren weren’t what when
where which while who whom why will william
with won won’t wouldn wouldn’t y yeah yes
ylva you you’d you’ll you’re you’ve your yours
yourself yourselves

48
APPENDICES

B Model Building LDA

Figure 17: LDA model implementation in Python using the models.ldamodel module from the Gensim
library.

Each parameter in the model settings, as shown in Figure 17, can be described as follows:
• corpus: The corpus of documents that the model will be trained on. This is the created dataset after
the transcribed files have gone through the data preprocessing and text representation stages.
• num_topics: This controls the number of requested latent topics that the model will identify. This
hyperparameter can be changed to produce new models with different number of topics to be fitted to
the data. During the evaluation and model selection phase, described in Section 3.9.1, this parameter is
changed to determine the model with the best coherence and perplexity scores, as described in Section
2.7.
• id2word: The dictionary that maps every token_id to its corresponding word.
• passes: The number of times the model will pass over the corpus during training. The standard setting
for this parameter is 10, which is kept unchanged in our model.
• alpha: This hyperparameter is the Dirichlet prior parameter that controls the sparsity of the topic
distribution, as described in Section 2.5.1. With a higher alpha, documents are assumed to be made
up of more topics and result in more specific topic distribution per document. When set to ’auto’,
which is the case in our model implementation, the model will learn an asymmetric prior directly from
the data, meaning that the model will learn the best value of alpha for each document based on the
data [75].
• chunksize: The number of documents to consider at once, i.e., that passes through the corpus during
training. The standard setting for this parameter is 100, which is kept unchanged in our model.
• update_every: Number of documents to be iterated through for each update. The standard setting
for this parameter is 1, which is kept unchanged in our model.
• random_state: This serves as a random seed and can be utilized to get reproducible results during
the different runs. For consistency reasons, this parameter was kept at 100 during the entire modeling
phase.
• per_word_topic: This setting is set to True, which results in the model computing a sorted list of
most likely topics given a word.

49
APPENDICES

C Model Building BERTopic

Figure 18: BERTopic model implementation in Python using the BERTopic module from the bertopic library.

SentenceTransfromer
The text representation was done using BERT-embeddings, described in Section 2.6.1, where the SentenceTransfromer()
function from Hugging Face’s python library which contain several different pre-trained embedding models
was utilized. The main parameter settings for this function are [64]:
• model_name_or_path: This parameter was used to specify the pre-trained model to be used for
generating embeddings. It can either be a string with the name of the model or the path to a local
directory containing the pre-trained model files. When the language of the data are English the default
model isall-MiniLM-L6-v2, but other pre-trained models such as all-mpnet-base-v2, both described in
Section 2.6.1, can be chosen.
• device: Specifies the dives that should be used for computation, like ’cuda’ or ’cpu’. If None, checks
for a Graphics Processing Units (GPU) otherwise Central Processing Units (CPU) is default.
• batch_size: This parameter specifies the batch size for generating embeddings. (Default = 32)
UMAP
When reducing the dimensionality the UMAP technique, described in Section 2.6.2, was used which it allows
you to keep the data’s local and global structure. This through the function UMAP() from a library in python
with the same name, and its main parameter settings are [49, 50]:
• n_neighbors: Controls the number of neighboring points used in the construction of the initial
high-dimensional graph. Increasing the value of n_neighbors can lead to more global structure being
preserved at the loss of local detailed structure. In the range of 5 to 50 is recommended and 10 to 15
as sensible default. (Default = 15)
• n_components: Specifies the number of dimensions in the low-dimensional embedding. In general a
value should be between 2 and 100. (Default = 2)
• metric: Specifies the distance metric used to measure the distance between points, i.e. meaningful
words, in the high-dimensional embedding space. Metrics such as euclidean distance or cosine distance,
described in Section 2.4.1 respectively 2.4.2, can be used. (Default = Euclidean distance)

50
APPENDICES

• min_dist: Controls the minimum distance between points in the embedding space. Increasing the
value ensure a more evenly distribution but can also cause the loss of some small-scale structure. Values
should in general be in the range from 0.001 to 0.5 and be relative to the spred value. (Default = 0.1)
• spread: controls the scale of the low-dimensional embedding and determines, together with min_dist,
how clustered/clumped the embedded points should be. (Default = 1)
HDBSCAN
After the dimensionality reduction a clustering procedure took place using HDBSCAN through the HDBSCAN()
function from sklearn’s python library. The most important parameters of this function are as following
[45, 47]:
• min_cluster_size: This parameter sets the minimum number of points required to form a cluster,
i.e. controls the granularity of the resulting clusters. Larger values will produce fewer, larger clusters,
while smaller values the opposite. (Default = None)
• min_samples: The minimum number of samples in a neighborhood for a point to be considered a
core point. This parameter affects the density threshold for forming clusters where smaller values will
produce more clusters, while larger values the opposite. (Default = 5)
• metric: The distance metric to use when computing the mutual reachability distance between points.
Metrics such as euclidean distance or cosine distance, described in Section 2.4.1 respectively 2.4.2, can
be used. (Default = Euclidean distance)
CountVectorizer
To perform a bag-of-words representation within each cluster the CountVectorizer method, presented in
Section 2.6.4, was used. This through the CountVectorizer() function from sklearn’s python library, and
its main parameter settings are [58]:
• stop_words: This parameter specifies a list of words to be removed from the input text data which
often can improve the accuracy and efficiency of NLP models as described in section 2.2.4. (Default =
None, which means that no stop words are removed)
• max_df : Specifies the maximum document frequency of a token to be included in the output vocab-
ulary. A lower value can filter out tokens that appear too frequently in the input data and are not
informative. (Default = 1, which means that all tokens are included)
• min_df : Specifies the minimum document frequency of a token to be included in the output vo-
cabulary. A higher value can filter out tokens that occur too infrequently and may not be useful for
modeling. (Default = 1, which means that all tokens that occur at least once in the input data are
included)
• tokenizer: This parameter is used to transform each document into a list of tokens. Here you can pass
a custom tokenizer function to extract tokens in a more sophisticated way. (Default = None).
ClassTfidfTransformer
To distinguish the clusters from one another and create topics the c-TF-IDF method was used, described in
Section 2.6.5, which is a modified version of TF-IDF. This was generated through the ClassTfidfTransformer()
function from bertopic’s python library, and its most important parameters is as follows [21, 58]:
• sublinear_tf : A boolean value indicating whether or not to apply sublinear scaling to the term
frequency. (Default = False)
• smooth_idf : A boolean value indicating whether or not to apply smoothing to the inverse document
frequency. (Default = True)
• use_idf : A boolean value indicating whether or not to enable inverse-document-frequency reweighting.
(Default = True)
KeyBERTInspired
Finally, the fine-tuning of the topics was done with KeyBERT, described in Section 2.6.6, and in particular
the KeyBERTInspired() function in the bertopic library in python. Its main parameter settings are [22]:
• top_n_words: A integer value indicating how many top n words to extract per topic. (Default =
10)
• nr_reps_docs: A integer value indicating the number of representative documents to extract per
cluster. (Default = 5)

51
APPENDICES

BERTopic
In addition, the BERTopic() function itself have some hyperparameters to tune beside the choice of sequence-
model and the most important ones are [24]:
• nr_topics: Controls the number of topics to extract from the corpus. If this parameter is not specified
or set to a too large number, BERTopic will use an algorithm to estimate the optimal number of topics
automatically. (Default = None)
• min_topic_size Controls the minimum number of documents required for a topic to be considered
valid. Topics with fewer documents are discarded. (Default = 10)

52
APPENDICES

D Model Fit and Topic Extraction Functions BERTopic


The input parameter "docs" represents the dataset.

Figure 19: Functions for topic extraction within the bertopic library in python. Figure from [19].

53
APPENDICES

E Visualization Functions BERTopic

Figure 20: Functions for topic visualization within the bertopic library in python. Figure from [19].

54

You might also like