Full Text 01
Full Text 01
Customer Insights
A Comparative Analysis of LDA
and BERTopic in Categorizing
Customer Calls
Henrik Axelborn & John Berggren
† Equal contribution. In order to ensure fairness and impartiality, the order of authors was determined by three rounds of
outright spin-the-wheel.
Abstract
Customer calls serve as a valuable source of feedback for financial service providers, potentially containing a
wealth of unexplored insights into customer questions and concerns. However, these call data are typically
unstructured and challenging to analyze effectively. This thesis project focuses on leveraging Topic Modeling
techniques, a sub-field of Natural Language Processing, to extract meaningful customer insights from recorded
customer calls to a European financial service provider. The objective of the study is to compare two widely
used Topic Modeling algorithms, Latent Dirichlet Allocation (LDA) and BERTopic, in order to categorize
and analyze the content of the calls. By leveraging the power of these algorithms, the thesis aims to provide
the company with a comprehensive understanding of customer needs, preferences, and concerns, ultimately
facilitating more effective decision-making processes.
Through a literature review and dataset analysis, i.e., pre-processing to ensure data quality and consistency,
the two algorithms, LDA and BERTopic, are applied to extract latent topics. The performance is then
evaluated using quantitative and qualitative measures, i.e., perplexity and coherence scores as well as in-
terpretability and usefulness of topic quality. The findings contribute to knowledge on Topic Modeling for
customer insights and enable the company to improve customer engagement, satisfaction and tailor their
customer strategies.
The results show that LDA outperforms BERTopic in terms of topic quality and business value. Although
BERTopic demonstrates a slightly better quantitative performance, LDA aligns much better with human
interpretation, indicating a stronger ability to capture meaningful and coherent topics within company’s
customer call data.
Keywords: Customer Insights, Natural Language Processing, Topic Modeling, Latent Dirichlet Allocation,
BERTopic.
i
Sammanfattning
Kundsamtal är en värdefull källa till feedback för finansiella tjänsteleverantörer, som potentiellt innehåller en
mängd outforskade insikter om kundernas frågeställningar och upplevelser. Dessa kundsamtalsdata är dock
vanligtvis ostrukturerade och utmanande att analysera effektivt. Detta examensarbete utforskar tillämp-
ningen av Topic Modeling-tekniker, ett delområde inom Natural Language Processing, för att extrahera
kundinsikter från inspelade kundsamtal hos en Europeisk finansiell tjänsteleverantör. Syftet med studien är
att jämföra två populära Topic Modeling-algoritmer, Latent Dirichlet Allocation (LDA) och BERTopic, för
att kategorisera och analysera innehållet i samtalen. Genom att utnyttja kraften i dessa algoritmer syftar
examensarbetet till att ge företaget en heltäckande förståelse för kundernas behov, preferenser och problem,
vilket i slutändan underlättar effektivare beslutsprocesser.
Genom en litteraturgenomgång och datauppsättningsanalys, det vill säga förbearbetning för att säkerställa
datakvalitet och dataförenlighet, tillämpas de två algoritmerna, LDA och BERTopic, för att extrahera la-
tenta ämnen, så kallade "topics". Modellutförandet utvärderas sedan med kvantitativa och kvalitativa mått,
genom metrikerna perplexity och coherence, samt tolkningsbarhet och användbarhet av ämneskvalitet. Re-
sultaten bidrar till kunskap om Topic Modeling för kundinsikter och gör det möjligt för företaget att förbättra
kundengagemang, kundnöjdhet och skräddarsy sina kundstrategier.
Resultaten visar att LDA överträffar BERTopic när det gäller ämneskvalitet och affärsvärde. Även om
BERTopic uppvisar en något bättre kvantitativ prestanda, överensstämmer LDA mycket bättre med mänsklig
tolkning, vilket indikerar en starkare förmåga att fånga meningsfulla och sammanhängande ämnen inom
företagets kundsamtalsdata.
ii
Acknowledgements
First and foremost, we would like to express our sincere gratitude to our supervisor at the partner company,
Robert Andersson Krohn, for his guidance and support throughout the entire thesis. We are also grateful
to the company for the opportunity to collaborate with the Data and ML team and gain insights into the
application of Topic Modeling with customer data.
A special thanks also goes to our supervisor from Umeå University, Associate Professor Per Arnqvist, for his
valuable direction and assistance in navigating the complex realm of theoretical writing and for his prompt
assistance whenever needed. We would also like to extend our appreciation to our classmates for the enriching
and memorable years we have spent together at Umeå University.
Lastly, we would like to express our gratitude to our families and friends for their constant support and en-
couragement as well as unwavering patience and understanding throughout the thesis and our entire academic
journey. Their love and patience have been instrumental in our success and well-being.
Thank you!
Henrik Axelborn
John Berggren
iii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Aim of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theory 3
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Data Cleaning and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Lowercase and Non-Alphabetical Character Removal . . . . . . . . . . . . . . . . . . . 5
2.2.4 Stopword Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.5 WordPiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 TF-IDF - Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . . . . . 6
2.4 Metrics for Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Generative Process for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Hierarchical Bayesian Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.3 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.1 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.4 Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.5 Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.6 Fine-tuning Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Topic Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.1 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Method 20
3.1 Modeling Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Initial Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Data Description and Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Creation of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Transcription and Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Concatenating the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 Evaluation and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Results 29
4.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Final LDA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Extracting Topic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Default BERTopic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Final BERTopic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Discussion 37
5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Transcription and Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.2 Size of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.3 Sampling Data from Specific Time Periods . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.4 Stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.5 Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Comparing LDA and BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 Advantages and Limitations of LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.2 Advantages and Limitations of BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Topic Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.1 Interpretation and Usefulness of Perplexity . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.2 Coherence vs. Business Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Reflection on the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Conclusion 43
6.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Recommendations for Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Appendices 48
A Stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B Model Building LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
C Model Building BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
D Model Fit and Topic Extraction Functions BERTopic . . . . . . . . . . . . . . . . . . . . . . . 53
E Visualization Functions BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
v
List of Figures
1 Model architecture of the Transformer. Figure from [72]. . . . . . . . . . . . . . . . . . . . . . 4
2 Graphical model representation of LDA. Nodes represent random variables, links between
nodes are conditional dependencies and boxes are replicated components. . . . . . . . . . . . 8
3 Illustration of BERTopics architecture and its modularity throughout a variety of sub-models.
Figure from [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Algorithm overview of BERtopic’s default model. Figure from [20]. . . . . . . . . . . . . . . . 11
5 Input structure for BERT. Figure from [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 Algorithm overview for keyword extraction in topic n with KeyBERTInspired which is based
on KeyBERT. Figure from [26]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Illustration over the project processes covered in Section 3 . . . . . . . . . . . . . . . . . . . . 20
8 Schematic representation of the LDA modeling workflow, from the initial dataset, to prepro-
cessing, text representation, modeling and finally visualization of topics and extracting topic
information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9 Schematic representation of the BERTopic modeling workflow, from the original dataset to
topic visualization and information extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
10 Word count distribution for the complete dataset of 5000 transcribed call recordings. . . . . . 25
11 Example of a small corpus represented in bag-of-words format, i.e., a list of (token_id,
token_count) tuples. In this example, the corpus consist of three documents and the bag-of-
words representation contains three list objects each consisting of four, three and six (token_id,
token_count) tuples respectively. For example, this representation shows that token_id = 3
was counted one time in the first and second document, and three times in the third document. 26
12 Coherence and Perplexity Scores for the LDA model with different number of topics T . The
coherence score graph have peaks for T = 5, 7, 9, 11 and 15. T = 10 is also interesting since
it lies in-between two peaks and has a high coherence score while a lower number of topics
compared to the two peaks to its right (T = 10 < T = 11 < T 15), and it therefore has a
higher perplexity score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
13 Visualization of topics with PyLDAvis for the six candidate final models. Each subfigure
represents one of the six candidates LDA models with the number of topics that produced the
best coherence and perplexity scores, as presented in Table 4 and Figure 12. The size of each
bubble is proportional to the percentage of unique tokens attributed to each topic. . . . . . . 31
14 Topic visualization with PyLDAvis for the final LDA model with 10 topics (T = 10). . . . . . 32
15 Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents
one of the topics presented in Table 10. Due to confidentiality reasons, each specific term t
cannot be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
16 Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents
one of the topics presented in Table 12. Due to confidentiality reasons, each specific term t
cannot be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
17 LDA model implementation in Python using the models.ldamodel module from the Gensim
library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
18 BERTopic model implementation in Python using the BERTopic module from the bertopic
library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
19 Functions for topic extraction within the bertopic library in python. Figure from [19]. . . . . 53
20 Functions for topic visualization within the bertopic library in python. Figure from [19]. . . . 54
vi
List of Tables
1 Applied filters for downloading the call recordings. . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Illustration of how the concatenated transcriptions are saved as output in the .csv file. . . . 24
3 Descriptive statistics for the concatenated dataset of transcribed call recordings. . . . . . . . 24
4 Coherence and Perplexity Scores for different number of topics. The rows highlighted in green
represent the best models from observing the coherence and perplexity score graphs from
Figure 12, with regards to the "peaks" in Figure 12a and the largest (best) perplexity scores
in Figure 12b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Evaluation metrics for the final LDA model (T = 10). . . . . . . . . . . . . . . . . . . . . . . 32
6 Relative size of each topic with regards to the percentage of tokens assigned to that topic. . . 32
7 Output from 5 randomly selected documents dn (n = {1, 602, 974, 3501, 4982}) within the
document corpus D, showing the probability distribution of each t 2 {1, ..., T = 10} for
those documents. The numbers are rounded to 4 decimal points for a better overview. The
probabilities in each row sums up to 1 (if not rounded). . . . . . . . . . . . . . . . . . . . . . 33
8 Distribution of the n = 10 words with the highest probability of belonging to each topic t. Due
to confidentiality reasons, each specific word w't ,n for topic t given the word-topic distribution
't cannot be explicitly presented. The probability of each word w't ,n given the word-topic
distribution 't is denoted as 'w,z where z is the word-topic assignment, as defined in 2.5. . . 33
9 The amount of documents, i.e., customer calls, assigned to each topic obtained from the default
model together with the model Coherence Score. . . . . . . . . . . . . . . . . . . . . . . . . . 34
10 The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific term t for class c cannot
be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
11 The amount of documents, i.e., customer calls, assigned to each topic obtained from the final
model together with its Coherence Score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
12 The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific word t for class c cannot
be explicitly presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
13 Comparison between LDA and BERTopic in the context of practical application scenarios.
Table inspired by [13]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
14 Stopwords used in both LDA and BERTopic. A total of 354 stopwords, using the standard
NLTK stopword library for English, as well as added stopwords that are project specific. The
added stopwords are mainly mistranslations and the most frequent names. . . . . . . . . . . . 48
vii
List of Algorithms
1 Generative process for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Algorithm for BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
viii
List of Acronyms
API Application Programming Interface
ML Machine Learning
MLM Masked Language Modeling
MT-DNN Multi-Task Deep Neural Network
SBERT Sentence-BERT
SER Sentence Error Rate
VM Virtual Machine
ix
1 INTRODUCTION
1 Introduction
In this section, the subject and objectives of this master thesis project is presented. This section will provide
a problem description, followed by the aim and purpose of the thesis along with the project delimitations
and confidentiality disclosures.
1.1 Background
This thesis is conducted in collaboration with a European financial service provider. Due to confidentiality
reasons, the name of the company will be kept anonymous and will be referred to as "the partner company"
or "the company" in this thesis.
1.2.1 Categorization
What are the customers calling about?
As it stands today, no systematic, data-driven or automatic extraction of what the customers are calling
about is implemented. All such information is gathered from the employees and their general experience
and expertise in this area. This can lead to biased information as the same conversation may be perceived
differently between employees. Further, this could potentially result in a misrepresentation of conversation
categories, whereas a data-driven solution is more consistent in its interpretation of topics given a word
distribution or the semantic structure of a conversation.
The company has a vision to find categorizations based on different business areas, and over time track and
analyze why the customers are calling, i.e., what the customers are calling about. This would allow the
company to quickly identify new categories and trends, and seamlessly adapt the information shared in calls
or on the website accordingly.
1
1 INTRODUCTION
1.3.2 Purpose
The insights obtained from the project can hopefully be used in several different departments within the
company. With the ability to track and analyze how the topics in phone calls change over time, they can
implement changes to improve the overall customer experience. Some examples of questions that have the
potential to be answered with our study are:
– If the company receives a lot of calls with information available on their web-page, should they change
how and where the information is presented? Is some information missing that should be added or
clarified?
– If the analysis indicate that customers often complain about a particular issue, should the company
take steps to address that issue and improve the customer experience? For example, if queue time is
a common complaint, should they investigate their resource allocation in the customer service depart-
ment?
– Can analyzing this call data can help the company identify areas where their employees may need
additional training or support?
1.4 Delimitations
The most crucial limitation of this thesis project is the 20 week time-frame, during the Swedish spring term
of 2023, in which we have access to the data and can perform our analysis. A large part of the time is
dedicated to data preprocessing and construction of the dataset. As a consequence, less time is available to
the modeling process which requires us to delimit our scope as follows:
The main delimitation is that we won’t have time to evaluate the data quality, which can significantly affect
the results obtained. The call recordings first needs to be transcribed to text, which can introduce bias in
the data from errors in the transcription. Additionally, the transcribed texts also needs to be translated
from Swedish to English to utilize the best pre-existing libraries trained on the English language. This could
also lead to a potential source of error, where mistranslations could introduce bias in the dataset and affect
the results. This brings a challenge for us to carefully consider the reliability of the results due to potential
sources of error that could affect the data quality negatively.
Another delimitation is that we only cover incoming calls to the customer service department, and therefore
no outgoing calls or calls to the sales department are included in the dataset. The extracted calls used in the
project also has a minimum call duration of 60 seconds, which excludes all calls shorter than that duration.
1.5 Confidentiality
The data used for this master thesis comprises confidential customer cases containing private customer
information. To honor the company’s request, no information that can be linked directly to a person,
organization, or location will be revealed. Tables and diagrams will be censored to prevent confidential
information from being exposed. However, the evaluation and model selection processes will be presented in
a way that allows readers to replicate the methods used for similar tasks.
2
2 THEORY
2 Theory
In this section, the underlying theory for this thesis is provided. The section is divided into seven subsections:
2.1 Natural Language Processing, 2.2 Data Cleaning and Preprocessing, 2.3 Text Representation, 2.4 Metrics
for Distance, 2.5 Latent Dirichlet Allocation, 2.6 BERTopic and 2.7 Topic Model Evaluation. Each subsection
will explore the theoretical underpinnings of the algorithms used to accomplish specific tasks in the thesis
work and provide an overview of the machine learning models that were analyzed and implemented.
2.1.2 Transformers
The Transformer model is a neural network architecture designed for NLP tasks and was proposed by Vaswani
et al. in 2017 [72]. Since then, it has become one of the most widely used models in the field as it offers
significant improvements in comparison with, for example, Recurrent Neural Network (RNN) which was the
previous best-performing sequential model. This improvement was largely due to the use of a self-attention
mechanism, which replaced the need for a recurrent component in the architecture. This mechanism allows
different positions of a sequence to relate to each other to generate a sequence representation. By using this,
the Transformer is able to better understand dependencies in long sequences which results in an improved
efficiency and reduced computation time.
The Transformer consists of an encoder and a decoder, both composed of multiple layers of self-attention and
feed-forward neural networks. The encoder and decoder components of the Transformer model are described
by Vaswani et al. [72] as follows:
Encoder
The encoder takes an input sequence and produces a sequence of hidden states that capture the meaning
of each token in the input sequence. Each layer in the encoder consists of two sub-layers: a multi-head
self-attention mechanism and a position-wise feed-forward network. The self-attention mechanism computes
a weighted sum of the input sequence tokens based on their similarity to each other, while the feed-forward
network applies a non-linear transformation to each token independently.
3
2 THEORY
Decoder
The decoder takes as input the output sequence produced by the encoder and generates a new sequence
token by token. Each layer in the decoder has three sub-layers: a multi-head self-attention mechanism, an
encoder-decoder attention mechanism, and a position-wise feed-forward network. The self-attention mecha-
nism computes a weighted sum of the decoder’s own output tokens based on their similarity to each other,
while the encoder-decoder attention mechanism computes a weighted sum of the encoder’s output tokens
based on their similarity to each decoder output token.
As mentioned, the self-attention mechanism allows the model to attend to different parts of the input sequence,
while the feed-forward neural networks provides a non-linear transformation of the hidden states at each
layer which allows the model to capture complex patterns in the input sequence [72]. The architecture of the
transformers is presented in Figure 1.
Self-Attention
In the paper proposing the Transformer model by Vaswani et al. [72], one of the key contributions to the
field is the self-attention mechanism. The self-attention mechanism allows the Transformer model to compute
representations of its input and output sequences by attending to different positions of the same sequence.
In other words, it allows the model to weigh the importance of different positions of the input sequence when
computing a representation for each position.
Vaswani et al. describes the inner workings of self-attention as computing three matrices from the input
sequence: a query matrix, a key matrix, and a value matrix. Each matrix is computed by multiplying the
input sequence with a learned weight matrix. The query matrix is then used to compute a set of attention
scores between each position in the input sequence and every other position. These attention scores are used
to weight the value matrix, which is then summed up to produce a weighted representation of each position
in the input sequence [72].
As mentioned above in the encoder and decoder descriptions, the self-attention mechanism is used in both
the encoder and decoder components of the Transformer model. In the encoder, it allows the model to attend
to different positions of the input sequence when computing hidden representations for each position. In the
decoder, it allows the model to attend to different positions of the output sequence when generating each
token.
4
2 THEORY
2.2.1 Tokenization
Tokenization is the process of breaking down a text document into smaller units called tokens, which are
usually words, sub-words or symbols. It is an essential step within NLP where the tokens serve as the basic
building blocks for subsequent NLP tasks such as language modeling, text classification, sentiment analysis
and machine translation [57].
The tokenization process involves separating words and punctuation marks from each other to create a list of
individual units that can be analysed and processed. This is typically done by using a tokenization strategy,
which can be rule-based or statistical. Rule-based tokenizers rely on pre-defined rules to identify tokens,
while statistical tokenizers use machine learning algorithms to learn patterns in the text [3]. The choice of
tokenization method often depends on the type of text being analysed, language in the text and the specific
NLP task being performed.
2.2.2 Lemmatization
Lemmatization is taking the process of breaking down a text document even further to analyze the meaning
behind a word. While tokenization breaks a text into individual units, lemmatization is the process of
reducing the inflectional forms of each word into a common base or root, known as a lemma. This enables
different forms of a word to be treated as the same word and improve the accuracy of the text analysis [43].
2.2.5 WordPiece
WordPiece is a subword tokenization method introduced by Wu et al. in 2016 [74] to break down words into
smaller subword units. This approach is effective in handling out-of-vocabulary and morphologically complex
languages as well as reduces number of words needed and use of unknown tokens. For instance, consider the
following sentence:
An example that will show how, with the help of wordpiece, BERT tokenize
"an" "example" "that" "will" "show" "how" "with" "the" "help" "of" "word" "##piece" "bert" "token"
"##ize"
Non-existing words in the vocabulary gets split into subwords where ## marks the sectioning.
5
2 THEORY
2.3.1 Bag-of-Words
Within NLP, Bag-of-words (BOW) is a processing technique of text modeling used to represent textual data
as a collection of words, without considering grammar or the order of which the words appear in the text. It
involves representing each document as a fixed-length vector with length equal to the vocabulary size. Each
dimension of this vector corresponds to the count or occurrence of a word in a document [37]. This technique
makes variable-length documents more amenable for use with a large variety of ML models and tasks.
A fixed-length document representation means that you can easily feed documents with varying length into
ML models. This allows you to perform clustering or topic classification on documents. The structural
information of the document is removed and models have to discover which vector dimensions are semantically
similar. Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance
between items is based on the likeness of their meaning or semantic content as opposed to lexicographical
similarity [37].
The TF-IDF value increases proportionally to the number of times a word appears in the document but is
offset by the frequency of the word in the corpus. This helps adjust for words that appear frequently across
all documents and therefore may not be as important [59].
6
2 THEORY
The fourth axiom, the triangle inequality, is the most complex condition. It states that if we are traveling
from point x to point y, we cannot obtain any benefit if we are forced to travel via some particular third
point z. The triangle-inequality axiom is what makes all distance measures behave as if distance describes
the length of a shortest path from one point to another [40].
where the Euclidean distance is always larger or equal to 0, i.e., dEuclidean (A, B) 0 [40].
where A and B are two vectors and ✓ is the angle between them.
Cosine similarity is not affected by the length of a document and the score always falls between 0 and 1. A
higher score indicates that the vectors are more similar to each other [32].
7
2 THEORY
• A corpus is a collection of M documents denoted by D = {d1 ,d2 ,...,dM }, where dn is the n:th document
in the corpus.
The basic idea of LDA is that documents are represented as random mixtures over latent topics, where each
topic is characterized by a distribution over words. The words that have the highest probability on each topic
are usually used to determine what the topic is. LDA assumes that every document can be represented as
a probabilistic distribution over latent topics, as shown in Figure 2. The topic distribution in all documents
shares a common Dirichlet prior. Each latent topic in the LDA model is also represented as a probabilistic
distribution over words, with the word distributions of topics sharing a common Dirichlet prior. The objective
of LDA is not only to find a probabilistic model of a corpus that assigns high probability to members of the
corpus, but also assigns high probability to other "similar" documents [4].
Figure 2: Graphical model representation of LDA. Nodes represent random variables, links between nodes
are conditional dependencies and boxes are replicated components.
The generative process above has only words in documents as observed variables, while the rest are latent
variables (' and ✓) and hyper parameters (↵ and ). To find the latent variables and hyper parameters, the
probability of the observed data D is calculated and maximized as follows:
8
2 THEORY
M Z
Y Nd X
Y
p(D|↵, ) = p(✓d |↵)( p(zdn |✓d )p(wdn |zdn , ))d✓d
d=1 n=1 zdn
The topic Dirichlet prior has ↵ parameters and the word-topic distribution comes from the Dirichlet distri-
bution with parameters. T is how many topics there are, M how many documents there are, and N how
big the vocabulary is. The Dirichlet multinomial pair (↵, ) is used for the topic distributions in the whole
corpus. The Dirichlet multinomial pair ( , ') is used for the word distributions in each topic. Variables ✓d
are the document-level variables while zdn and wdn are word-level variables sampled for each word in each
document [11].
p(B|A)p(A)
p(A|B) = ,
p(B)
where A and B are events and p(B) 6= 0.
• p(A|B) is the posterior probability of A given B.
• p(B|A) is the likelihood of B given a fixed A.
• p(A) and p(B) are the probabilities of observing A and B without any given conditions; they are also
called the prior probability and the marginal probability respectively.
9
2 THEORY
Specifically, the theorem states that the posterior probability of an event is proportional to the product of
the prior probability of the event and the likelihood of the data given the event.
T
1 Y ↵t 1
f (✓; ↵) = ✓ ,
B(↵) t=1 t
where:
QT
t=1 (↵t )
B(↵) = PT
( t=1 ↵t )
2.6 BERTopic
BERTopic is a cutting-edge topic modeling algorithm introduced by Maarten Grootendorst in his paper
"BERTopic: Neural topic modeling with a class-based TF-IDF procedure" from 2019 [29]. The term BERTopic
is an acronym or abbreviation for the combination of "BERT", Bidirectional Encoder Representations from
Transformers, and "topic" which refers to the goal of the algorithm, i.e., extracting meaningful topics from
text data.
The key innovation of BERTopic is its use of BERT embeddings, which are contextualized word representa-
tions learned from a large corpus of text using a deep neural network. These embeddings capture the rich
conceptual information of each word in a document, allowing BERTopic to capture the semantic meaning
and contextual relationships among words, resulting in more accurate topic modeling [29].
In addition to BERT embeddings, BERTopic also utilize a class-based Term Frequency-Inverse Document
Frequency (c-TF-IDF) procedure to filter out irrelevant words and prioritize the more important ones for topic
modeling. This procedure uses the class labels of each document, such as the document source or category,
to assign class-specific weights to words in the TF-IDF calculation. This helps BERTopic to emphasize words
that are more informative for and disregard words that are less relevant [29].
BERTopic has been shown to outperform other topic modeling techniques, such as LDA, in terms of topic
coherence and interpretability. The algorithm is capable of generating high-quality and interpretable topics
from text data, making it suitable for a wide range of applications. In the paper by Grootendorst [29],
the technical details of BERTopic are more thoroughly explained than in this paper, including the steps for
generating topics and the class-based TF-IDF procedure.
The high-level algorithm for BERTopic to create its topic representation consists of the following five to six
steps presented in Figure 3. Within each sequence there is a variety of sub-models to choose from which
makes BERTopic quite modular and allows you to build your own topic model. Each of the steps needs to
be carefully selected and are somewhat independent from one another, even though there of course is some
influence between them.
10
2 THEORY
Figure 3: Illustration of BERTopics architecture and its modularity throughout a variety of sub-models.
Figure from [20].
The default values in each sequence is presented to the right in Figure 4 with Fine-tune Representation, as an
optional step in the model structure [20]. Each of these steps and the theory behind them will be presented
starting from the bottom going upwards.
11
2 THEORY
Algorithm BERTopic
To summarize, the basic idea and high-level algorithm for BERTopic presented in Figure 3 and Figure 4 is
as follows,
2.6.1 Embeddings
BERTopic starts with converting the documents into numerical representation through word embedding as
described in Section 2.3.2. The default method for BERTopic for doing so is by using sentence-transformers,
or more specifically Sentence-BERT (SBERT) which is a modification and improvement from the pretrained
BERT network [63].
BERT
Bidirectional Encoder Representations from Transformers (BERT) was presented in late 2018 by Devin et al.
[8] as a fine-tuning approach for NLP. From its name it can be deduced that the model uses a transformer-
based architecture described in Section 2.1.2. The main difference between transformers and BERT is that,
while transformers consist of both encoder and decoder stack, BERT only consist of an encoder stack. Its
objective is to learn language and achieve this by producing meaningful representations of text using the
self-attention mechanism. In addition, BERT is bidirectional, i.e., encoding tokens using information from
both directions where the classification token [CLS] is added at the beginning of every sequence while [SEP]
token indicates the end, see Figure 5. The input sequence length of 512 tokens is required to be consistent
and any sequences shorter than 512 tokens are padded with a [PAD] token, while sequences longer than
512 tokens are truncated. This indicates that you risk missing out on important insights and information
when having documents longer than 512 tokens. Finally, BERT also uses a WordPiece embedding with a
vocabulary of 30,000 word pieces described in Section 2.2.5 [8].
BERT is a pre-training strategy that succeeds in extracting deep semantic information from a sentence. It’s
trained on a large amount of text data in an unsupervised manner using two main tasks, Masked Language
Modeling (MLM) and Next Sentence Prediction (NSP) [8].
In the first unsupervised task, MLM, we get to use the advantages of a bidirectional approach. In traditional
left-to-right or right-to-left models, the goal is to predict the next token and when using a bidirectional
approach an opportunity to look ahead at the next word and "cheat" when predicting it, is created. The
12
2 THEORY
MLM method however, replaces approximately 15% of the tokens with a [MASK] token, enabling the use
of a bidirectional model and forcing the model to predict the missing word without cheating. This forces
the model to learn to predict the masked word based on the context provided by the surrounding words and
allows BERT to learn bidirectional representations of language. To avoid discrepancies between pre-training
and fine-tuning, where the [MASK] token is not present, the [MASK] token replaces the masked words only
80% of the time. In the remaining 20% it uses two strategies, half of which a random token is used and the
other half the unchanged i-th token [8].
The second unsupervised task, NSP, is trained to predict whether two sentences are consecutive in a text or
not, which helps it understand sentence-level relationships. Given two sentences, A and B, the objective is
to predict whether the second sentence is the correct subsequent sentence or not. Half of the time, sentence
B is the true next sentence with respect to sentence A, and the other half a random sentence from the corpus
is selected. The corpus contains 800 million words and to put that into some context the English Wikipedia
comprises 2500 millions word, so it’s a large corpus [8].
After pre-training, BERT can be fine-tuned on a specific downstream task using a smaller dataset. During
fine-tuning, BERT’s parameters are updated to better fit the specific task at hand and the embeddings
generated can be used as input feature. As a result, the fine-tuning of the model can be performed with
only one additional output layer to create a state-of-the-art model. Without any substantial task-specific
architecture modifications the model is conceptually simple and empirically powerful, and can be used for a
wide range of tasks such as text classification, named entity recognition or question answering [8].
SBERT
It has been shown that standard BERT implementation is suboptimal for sentence similarity tasks that require
the use of standard similarity measures such as euclidean distance or cosine-similarity, presented in Section
2.4.1 and Section 2.4.2 respectively. To address this issue, Reimers and Gurevych introduced Sentence BERT
(SBERT) in 2019 [63] as a solution that aims to make up for the weaknesses of BERT and further improve the
sentence-level embeddings. For instance, computing the similarity between 10,000 sentences would require
around 50 million inference computations, which is time-consuming and unsuitable for unsupervised tasks
like clustering. SBERT, a modification of the pre-trained BERT network, utilizes Siamese and triplet network
structures to derive meaningful embeddings as well as reduces time required from 65 hours to five seconds
when finding and pairing between 10,000 sentences.
A Siamese network consists of two identical neural networks that share the same weights and are trained on
pairs of similar and dissimilar sentences. The network further learns to encode the semantic meaning of each
sentence into a vector representation and then compares the similarity of the vectors of two sentences using
a similarity function, such as one of the measures mentioned earlier. The key contribution of SBERT on the
other hand is the pooling technique that applies on the output of the second-to-last layer of the BERT model
to generate a fixed-length vector representation of the sentence. The resulting vector representation captures
the most informative features and meaning of the sentence in a compact and robust form. Additional and
more comprehensive description of SBERT can be found in the initial paper by Reimers and Gurevych [63].
Sentence Transformers
Sentence Transformer and SBERT generally refers to the same thing in the context of NLP. However. since
the introduction in 2019 many researchers and developers have built upon and extended the original SBERT
method and they may use different names to refer to their implementation such as sentence transformer
[63]. The SentenceTransformer package, for example, implements various modifications and enhancements of
the SBERT method and provides an easy-to-use interface for generating sentence embeddings [61]. A large
collection of pre-trained models tuned for various task can be found [62], which is hosted by the HuggingFace
Model Hub [35].
all-MiniLM-L6-v2
’all-MiniLM-L6-v2’ is a pre-trained transformer-based language model designed to be fine-tuned for a wide
range of NLP tasks, such as text classification, and is available on Hugging Face’s Sentence Transformers
library [33]. The model was trained on a large corpus of text data using a masked language modeling objective,
which allows it to learn contextual representations of words and sentences.
Despite the small size ’all-MiniLM-L6-v2’ it offers good quality in comparison with other pre-trained transformer-
based language model. The best average performance sentence embedding model ’all-mpnet-base-v2’, which
only is slightly better than ’all-MiniLM-L6-v2’, is five times slower which makes this often the go-to model.
Its small size and fast inference speed make it suitable for deployment in resource-constrained environments
[62].
13
2 THEORY
The model architecture consists of 6 transformer layers, which is fewer than the original MiniLM model,
originated by W. Wang et al. [73], that has 12. Compared to the standard BERT model, the main contribution
is its approach to training. It employs a student-teacher approach, where the student model is trained to
emulate the self-attention module of the teacher’s final transformer layer. The limitation of the MiniLM
model however, is its restriction to processing sequences with a maximum of 256 tokens, which may be
considered a disadvantage since longer texts becomes truncated.
all-mpnet-base-v2
’All-mpnet-base-v2’ is a pre-trained neural network model that belongs to the family of models known as
Sentence Transformers and is is available on Hugging Face’s Sentence Transformers library [34]. The model
is based on a variant of the transformer architecture called "multi-task deep neural network" (MT-DNN) and
has 12 transformer layers, which perform multi-head self-attention and feedforward operations. Additional
information can be found via the original paper [42].
Compared to ’all-MiniLM-L6-v2’ this model is a mentioned significantly larger and therefore requires more
computing power and can be time-consuming. Because of that, it performs better in all cases presented in
[62] and therefore could be worth using despite its size disadvantages. An additional disadvantage when
comparing with the standard BERT model is that text longer than 384 tokens is truncated. This should on
the other hand be considered as a benefit when comparing to ’all-MiniLM-L6-v2’.
2.6.3 Clustering
The reduced embeddings created in Section 2.6.2 are now clustered with the purpose to group similar doc-
uments together based on their semantic content. This process plays a crucial role in the accuracy of topic
representations, as the effectiveness of our clustering technique directly impacts the quality of the results.
A variety of different clustering models are available for use such as HDBSCAN, k-Means and BIRCH, pre-
sented in Figure 3, and since there is not one perfect clustering model this modularity allows you come as
close as possible. The default setting HDBSCAN is a a density-based clustering algorithm that provides a
powerful and flexible tool for analyzing large collections of text data [27]. The resulting clusters can be used
14
2 THEORY
to identify topics or themes in the collection of documents, allowing users to gain insights into the content of
the documents and the underlying patterns and trends within the data [44].
HDBSCAN
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is an extension of
DBSCAN which is thoroughly presented below in this section. HDBSCAN utilizes a hierarchical density-
based approach to cluster data point and is designed to be highly efficient and effective, even for large and
complex datasets. The algorithm works by first constructing a hierarchical clustering tree which then uses a
mutual-reachability graph to identify clusters within the tree. Finally, it applies a cluster stability analysis to
select the optimal clustering solution [44]. To calculate the distance between points in the data space when
constructing a distance matrix, which represents the pairwise distances between all data points, a metric
needs to be chosen. This is as critical aspect of HDBSCAN and the characteristics of the data as well as
research question should be taken into consideration when choosing. For example, the Euclidean metric
measures the straight-line distance between two points while the cosine metric measures the angle between
two vectors, described in Section 2.4.1 and 2.4.2 respectively. HDBSCAN supports a wide range of metrics
but these two are the most common ones and euclidean is the default setting [45, 46].
When using HDBSCAN in BERTopic, a number of outlier documents might also be created if they do not
fall within any of the created topics. This topic is labeled as "-1" and will contain the outlier documents that
cannot be assigned to a topic given a automatically calculated probability threshold. This threshold can be
modified in the model parameter settings to reduce the number of outliers, as described in [23].
DBSCAN
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an algorithm that identifies
clusters of points in a dataset based on their density. The algorithm is known for its ability to handle noise
and outliers efficiently as well as widely used when mining large spatial datasets. By relying on the concept
of density, it is able to identify clusters of arbitrary shapes and sizes, making it a valuable tool for many
applications. In addition, it uses a minimal number of required input parameters, two, which are the radius
if the cluster, Eps, and minimum point required inside the cluster, M inP ts. [56].
A Density Based Notion of Clusters
DBSCAN was introduced by M.Ester et al. in their 1996 paper "A density-based algorithm for discovering
clusters in large spatial databases with noise" [10] which is based in the idea that clusters are dense regions
of objects in the data space separated by regions of lower density. In their paper they provide a detailed
description of the algorithm and its implementation, as well as the following six definitions that constitutes
the notion of density based clusters.
Definition 1 - Eps–neighborhood of a point:
The Eps neighborhood of a point p, denoted by NEps (p) is defined by,
NEps (p) = {p 2 D|dist(p, q) Eps}.
There are two kinds of points in the cluster, the points which is inside the cluster(core points), and points
on the border of the cluster(border points).
Definition 2 - Directly density-reachable:
A point p is directly density-reachable from a point q wrt. Eps, M inP ts if,
1. p 2 NEps (q) and
2. |NEps (q)| M inP ts (core conditions).
Definition 3 - Density-reachable:
A point p is density-reachable from a point q wrt. Eps and M inP ts if there is a chain of points p1 , ..., pn , p1 =
q, pn = p such that pi+1 is directly density-reachable from pi .
Definition 4 - Density-connected:
A point p is density-connected to a point q wrt. Eps and M inP ts if there is a point o such that both, p and
q are density-reachable from o wrt. Eps and M inP ts.
15
2 THEORY
Definition 5 - Cluster:
Let D be a database of points. A cluster C wrt. Eps and M inP ts is a non-empty subset of D satisfying the
following conditions:
1. 8p, q : if p 2 C and q is density-reachable from p wrt. Eps and M inP ts, then q 2 C (Maximality)
2. 8p, q 2 C : q is density-connected to q wrt. Eps and M inP ts (Connectivity)
Definition 6 - Noise:
Let C1 , ..., Ck be the clusters of the database D wrt. parameters Epsi and M inP tsi , i = 1, ..., k. The noise is
defined as the set of points in the database D not belonging to any cluster Ci , i.e., noise = {p 2 D|8i : p 2
/ Ci }
Algorithm DBSCAN
The basic idea and algorithm behind DBSCAN [56] is as follows,
Algorithm 3 DBSCAN
(1) Select an arbitrary point p
(2) Retrieve all points density-reachable from p w.r.t. Eps and M inP ts.
(3) If p is a core point, a cluster is formed.
(4) If p is a border point, no points are density reachable from p and DBSCAN visits the next point of the
database.
(5) Continue the process until all the points have been processed.
2.6.4 Tokenizer
Before the next step Weighting Scheme, where the creation of topic representation in the BERTopic algorithm
is initiated, a technique that allows for modularity needs to be selected. When using HDBSCAN as a cluster
model the clusters may have different levels of density and shapes which means that a centroid-based topic
representation technique may not be suitable. Therefore, a technique that makes little to no assumption
about the expected structure of the clusters is preferred such as a bag-of-words method. To achieve this bag-
of-words representation on a cluster level, all document in a cluster needs to be treated as a single document
by simply concatenating them and then count the frequency of each word in each cluster [20].
Since the quality of topic representation is key when understanding the patterns, interpreting the topics and
communicating the results, it’s of utmost importance that best possible method for this is chosen for the
data at hand. The flexibility within BERTopic for vectorization algorithms is wide, as presented in Figure 3,
when methods such as CountVectorizer, Jieba and POS are available. The default method for this sequence
in BERTopic is CountVectorizer described below [28].
CountVectorizer
The CountVectorizer is a method in scikit-learn that converts a collection of text documents to a matrix of
token counts to extract features from text. Specifically, it converts the text documents into a bag-of-words
representation, where each document is represented by a vector that counts the frequency of each word in the
document. In addition, it performs several text pre-processing steps such as tokenizing the text, lowercasing
and removing stop words, see section 2.2.1, 2.2.3 respectively 2.2.4 for more detailed information. Additional
information about CountVectorizer can be found in the scikit-learn’s documentation [58].
16
2 THEORY
collection or corpus. The classical procedure of TF-IDF combines the two statistics, term frequency (TF),
and inverse document frequency (IDF),
⇣N ⌘
Wt,d = tft,d · log .
dft
where the TF models the frequency f of term t in document d while the IDF models how much information
a term provides a document. The latter is calculated by taking the logarithm of N which is the number of
documents in a corpus, divided by the total number of documents d that contain t [29].
BERTopic on the other hand, uses a custom class-based TF-IDF for topic representation which means that
it wants to measure a term’s importance to a topic instead. To do so all document in a cluster needs to be
treated as a single document by simply concatenating them. Then, TF-IDF is adjusted to account for this
representation by translating documents to clusters,
⇣ A⌘
Wt,c = tft,c · log 1 + .
tft
where the TF models the frequency f of term t in class c where class c is the collection of documents
concatenated into a single document for each cluster. In this case, the IDF is replaced by the inverse class
frequency to measure how much information a term provides to a class. This is calculated by taking the
logarithm of A which is the average number of words per class, divided by the frequency f of term t across
all classes. We add one to the division of logarithm so that only positive values is in the output [29].
Since the goal of this class-based TF-IDF procedure is to evaluate the importance of words in clusters of
documents, rather than in individual documents, this approach enables the creation of topic-word distribu-
tions for each cluster. To reduce the number of topics to a user-specified level, we can iteratively combine
the c-TF-IDF representations of the least common topic with its most similar one [29].
17
2 THEORY
Figure 6: Algorithm overview for keyword extraction in topic n with KeyBERTInspired which is based on
KeyBERT. Figure from [26].
2.7.1 Coherence
Coherence measures are used in NLP to evaluate topics constructed by some topic model. Coherence measures
are used to evaluate how well a topic model captures the underlying themes in a corpus of text [9]. These
measures are used to evaluate the quality of topics generated by topic models and can thus be used to compare
the outcomes between different topic models on the same corpus of texts.
Topic Coherence
Topic coherence has been proposed as an intrinsic evaluation method for topic models and is defined as
average or median of pairwise word similarities formed by top words of a given topic [66]. There are several
variations of topic coherence measures, see for example [65], but the most commonly used coherence measure
in both LDA and BERTopic to estimate the optimal number of topics is cv . The cv coherence measure was
proposed by Röder et al. [65] and is based on a sliding window, one-set segmentation of the top words and
an indirect confirmation measure that uses normalized point-wise mutual information (NPMI) and the cosine
similarity, described in Section 2.4.2. The formula for computing the cv coherence measure is as follows:
n
X1 n
X
2 D(wi , wj ) + e
cv = log
n(n 1) i=1 j=i+1
D(wi ) + D(wj ) + e
where n is the number of top words in the topic, D(wi , wj ) is the number of documents that contain both
words wi and wj , D(wi ) is the number of documents that contain word wi , and e is a smoothing parameter.
The cv coherence measure ranges between 0 and 1, a higher value indicates that the words in the topic are
more coherent, and therefore better.
2.7.2 Perplexity
Perplexity is a confusion metric used to evaluate topic models and accounts for the level of "uncertainty" in
a model’s prediction result. It measures how well a probability distribution or probability model predicts a
sample, and is used in topic modeling to measure how well a model predicts previously unseen data [69].
18
2 THEORY
Perplexity is calculated by splitting a dataset into two parts - a training set and a test set. The idea is to train
a topic model using the training set and then test the model on a test set that contains previously unseen
documents (i.e., held-out documents). The measure traditionally used for topic models is the perplexity of
held-out documents, Dtest which can be defined as:
( P )
M PNd
d=1 i=1 log p(wdi )
perplexity(Dtest ) = exp PM
d=1 Nd
where wdi is the i-th word in document d, p(wdi ) is the probability of word wdi given the topic model, Nd
is the number of words in document d, and M is the number of documents in the test set Dtest . A lower
perplexity score indicates that the model is better at predicting new data [69].
Perplexity Score in LDA
In LDA modelling, the standard perplexity measure is the output statistic from Gensims log_perplexity
function. The output statistic is a negative number, and is calculated as 2( bound) as described in the docu-
mentation [75]. The mathematical formula for bound is however not explicitly stated in the documentation,
just that the function calculates the per-word likelihood bound using a chunk of documents as evaluation
corpus.
When reading about the output statistic from log_perplexity, there is some confusion on how it should
be interpreted. For example, the user Rafs on StackExchange suggests that the log_perplexity function,
counter intuitively doesn’t output a perplexity after all, but a likelihood bound which is utilized in the
perplexity’s lower bound equation [70].
This interpretation of the log_perplexity output statistic suggests that a smaller bound values implies
deterioration, and therefore bigger values means that the model is better. A post by the creator of Gensim
Radim Řehůřek, from the Google Gensim group seems to support this interpretation. On this post the
question is regarding if the perplexity is improving when the output from log_perplexity is decreasing,
Řehůřek replies with: "No, the opposite: a smaller bound value implies deterioration. For example, bound
-6000 is "better" than -7000 (bigger is better) [14].". For further discussion of the interpretation of the
perplexity score output from the LDA model, see Section 5.4.
19
3 METHOD
3 Method
In this section, the overall methodology and workflow of the thesis work is presented. To answer the research
questions stated in Section 1.3.1, the project work was divided into a number of different phases, as presented
in Figure 7.
Section 3.1 provides a visual overview of the modeling workflow for the LDA and BERTopic models. Section
3.2 summarizes the initial process of the project work, followed by Section 3.3 where the software and tools
used are presented. Section 3.4 provides an overall description of the data and the data extraction process,
followed by Section 3.5 with an explanation of how the dataset was created. Section 3.6 describes the
procedure for data cleaning and preprocessing, followed by Section 3.7 were the text representation steps are
explained. Section 3.8 gives a summary of the modeling phase, which is divided into two subsections, Section
3.8.1 which focuses on the LDA modeling and Section 3.8.2 on the BERTopic modeling. Finally, Section 3.9
provides a description of the methods used for model evaluation and model selection.
20
3 METHOD
3.1.1 LDA
To summarize the steps for the LDA modeling, a schematic representation of the complete modeling workflow,
from the initial dataset, to preprocessing, text representation, modeling and finally visualization of topics
and topic information extraction is presented in Figure 2.
Figure 8: Schematic representation of the LDA modeling workflow, from the initial dataset, to preprocessing,
text representation, modeling and finally visualization of topics and extracting topic information.
21
3 METHOD
3.1.2 BERTopic
A schematic representation of the BERTopic modeling workflow, from the original dataset to topic visualiza-
tion and information extraction, is presented in Figure 9. The steps involved in the topic modeling process
include initial text representation, clustering, topic representation, and finally visualization of the extracted
topics and its information.
Figure 9: Schematic representation of the BERTopic modeling workflow, from the original dataset to topic
visualization and information extraction.
22
3 METHOD
3.3 Software
For the entirety of this thesis work, the Python programming language was used. Python has for some time
been the programming language of choice when implementing NLP models, due to its proficiency and ease of
use with a wide range of open sourced libraries and packages. To ensure sufficient computational power for
running large NLP-models, all the programming, including transcription, translation and modeling, was done
in a Linux environment on a Virtual Machine (VM) on the company’s Google Cloud Platform (GCP). On
the VM, Python version 3.7.13 was installed since it is a stable version with support for all the dependencies
and libraries used in the project. The additional computing power from the VM with dedicated graphical
processing units (GPU’s) allowed a significant reduction in the time needed for running the models, compared
to running the models on local machines.
The call recordings were downloaded in five batches of 1000, one batch collected each week from week 10,
12, 13, 14 and 15 of 2023. The call recordings were downloaded in a .wav audio format, and therefore no
additional formatting was necessary before transcribing the files.
As suggested by the company, phone calls with a duration of less than 60 seconds were excluded to reduce
noise in the dataset. These calls could for example be interrupted calls or error calls not intended for them.
The other applied filters suggested by the company were intended to limit the type of calls to incoming direct
dial-ins to the customer service department.
23
3 METHOD
Table 2: Illustration of how the concatenated transcriptions are saved as output in the .csv file.
Call Recording Index (n 2 {1, ..., 5000}) Transcribed Call (one call per cell/row)
1 "Welcome to XXX, you are talking to..."
2 "Welcome to XXX, this is John..."
.. ..
. .
The finished .csv file containing all 5000 transcribed and translated call recordings was then saved and
ready for data preprocessing. In Table 3, we provide a summary of descriptive statistics for the content of
the concatenated dataset. To get an intuition of the call duration variability in the the dataset, a word count
distribution was also done, as seen in Figure 10, which shows the distribution of word count frequencies.
Table 3: Descriptive statistics for the concatenated dataset of transcribed call recordings.
24
3 METHOD
Figure 10: Word count distribution for the complete dataset of 5000 transcribed call recordings.
3.7.1 LDA
In order to present the dataset in a machine-readable way for a LDA topic model, a bag-of-words represen-
tation, as described in Section 2.3.1, of the tokenized and lemmatized text was created. This was done using
the doc2bow function from the Gensim library, which is a platform independent library aimed at automatic
25
3 METHOD
extraction of semantic topics from documents. The doc2bow function converts the documents of a corpus into
a bag-of-words format, i.e., a list of (token_id, token_count) tuples [76]. An example of a bag-of-words
representation is illustrated in Figure 11.
Figure 11: Example of a small corpus represented in bag-of-words format, i.e., a list of (token_id,
token_count) tuples. In this example, the corpus consist of three documents and the bag-of-words repre-
sentation contains three list objects each consisting of four, three and six (token_id, token_count) tuples
respectively. For example, this representation shows that token_id = 3 was counted one time in the first
and second document, and three times in the third document.
In order to know which word each token_id represents, a dictionary was also created where each token_id
is paired with its corresponding word. This dictionary was created using the corpora.Dictionary function
from the Gensim library [76].
3.7.2 BERTopic
To enable the BERTopic model to interpret the dataset, the data had to be transformed into a machine-
readable format which BERTopic does in several steps during the modeling process. These steps were as
follows:
1. Initially, the text representation was done using BERT-embeddings, described in Section 2.6.1, where
the SentenceTransformers() function from Hugging Face’s Python library was utilized to implement
context-based representation, using the pre-trained models all-MiniLM-L6-v2 [33] and all-mpnet-base-
v2 [34], both described in Section 2.6.1.
2. After numerical text representation of the documents was created, dimensional reduction with the
technique of UMAP, presented in Section 2.6.2, was used through a function UMAP() from a library in
Python with the same name [51].
3. With the reduced BERT-embeddings the data was clustered with the use of HDBSCAN, described in
Section 2.6.3, through the HDBSCAN() function from sklearn�s Python library [47].
4. Within each cluster a bag-of-words representation, described in Section 2.3.1, was generated through
the CountVectorizer() function from sklearn�s python library, described in Section 2.6.4. In addition,
this function also performs several text pre-processing steps such as tokenizing the text, lowercasing
and removing stopwords, see section 2.2.1, 2.2.3 and 2.2.4 respectively. See Appendix A for all used
stopwords [58].
5. From the generated bag-of-words representation of each cluster, we want to distinguished them from one
another which was solved by using c-TF-IDF, presented in Section 2.6.5. This was generated through
the ClassTfidfTransformer() function from bertopic’s Python library which is a modified version of
scikit-learn’s TfidfTransformer() class [21, 58].
6. Lastly, on top of the generated topic representations some fine-tuning was performed using KeyBERT,
described in Section 2.6.6. This was done through the KeyBERTInspired() function in the bertopic
library in Python [18].
The steps presented above is also illustrated and clarified in Figure 9.
3.8 Modeling
After data preprocessing and text representation, the dataset was ready for model implementation. This
subsection is divided into two parts describing the topic modeling phase, where Section 3.8.1 describes the
implementation of LDA and Section 3.8.2 the implementation of BERTopic.
26
3 METHOD
3.8.1 LDA
For the entire implementation of LDA, the Gensim library was used. The Gensim library was developed
for Python by Radim Řehůřek in 2008, and allows both LDA model estimation from a training corpus and
inference of topic distribution on new, unseen documents [75]. Gensim requires NumPy, a fundamental
package for scientific computing with Python [30], as well as a Python version of 3.6 [77].
Model Building
With the corpus of text represented as a bag-of-words, the implementation of the LDA model was done
through importing the models.ldamodel module from the Gensim library and utilizing the LdaModel func-
tion. This function has several parameters that can be used to customize the LDA model. For the two
Dirichlet hyperparameters ↵ and , which controls the sparsity of the topic-document and the word-topic
distributions respectively, the standard ’auto’ setting was used. When set to ’auto’, the model will learn
an asymmetric prior directly from the data, meaning that the model will learn the best value of alpha and
beta for each document based on the data [75]. A full description of model parameters and the settings used
is shown in Appendix B.
The LDA model was then saved with the chosen parameter settings and can later be used to extract and
visualize topics.
Topic Extraction and Visualization
Once the model was trained several different actions could be performed to extract topic information as well as
visualizing topics. For extracting information about the topics, the get_document_topics and show_topics
functions from the Gensim library were used to retrieve the distribution of topics for each document, and the
corresponding word distribution for each topic.
For visualizing the topics, the PyLDAvis library was used. PyLDAvis is a Python library for interactive topic
model visualization. It is a port of the R package by Carson Sievert and Kenny Shirley [68]. PyLDAvis
produces a bubble-chart, where each bubble on the left-hand side of the chart represents a topic. The larger
the bubble, the more prevalent is that topic, since the size of each bubble is proportional to the percentage
of unique tokens attributed to each topic. A good topic model will have fairly big, non-overlapping bubbles
scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics
will typically have many overlaps, small sized bubbles clustered in one region of the chart. The visualizations
created from PyLDAvis can be seen in Section 4.1.
3.8.2 BERTopic
For the implementation of BERTopic, the BERTopic module developed by Maarten Grootendorst, the creator
of BERTopic, was used. The necessary dependencies for installing and importing the BERTopic library was
also installed, as described by Grootendorst in [25].
Model Building
The implementation of the BERTopic modeling was done through a simple line of code where the user calls
for the imported BERTopic function and library, which includes the different sequence-methods wanted as
inputs to the function. The main steps for topic modeling with BERTopic are sentence-transformers, UMAP,
HDBSCAN, as well as c-TF-IDF, and in addition, a fine-tuning process can be added with KeyBERT. Within
each of the different sequences of BERTopic several different parameter settings that can be used and the
main settings for each step can be found in Appendix C. As mentioned, these are the main settings for the
different sequences but additional modification and tuning can also be done.
When a model setup was chosen it needed to be trained on the specific data at hand which was done trough
the .fit_transform function within the bertopic library. This and similar functions for model fitting in the
bertopic library can be found in Appendix D.
Topic Extraction and Visualization
Once the model was trained several different actions were performed to extract topic information and visual-
izing it. For topic extraction, functions in the bertopic library such as .get_topic and .get_document_info
were used together with other presented functions in Appendix D. From the results an initial representation
of the different topics was obtained and could be analyzed.
Through the different topic visualizations tools presented in Appendix E more insights were gained regarding
each topic.
27
3 METHOD
3.9.1 LDA
The best model was evaluated and selected as the model with the optimal number of topics for the data. The
optimal number of topics was determined through comparing the topic coherence, cv , and perplexity scores,
as described in Section 2.7, for different number of topics in the model settings. In addition to evaluating the
models using coherence and perplexity metrics, human interpretation of the topic quality was done through
analyzing the visualizations generated with PyLDAvis. This was done to ensure that the final model would
have topics that were easily interpretable and provided categorizations with the highest possible business
value for the company.
To obtain the perplexity and coherence scores, a performance function was built to calculate these metrics us-
ing the CoherenceModel with the coherence parameter specified as coherence=’c_v’, and the log_perplexity
function with the document corpus represented as bag-of-words as input. Both modules were imported from
the Gensim library [75].
The coherence and perplexity scores for number of topics = {2,.., 20} were saved and presented in joint a
table as well as separate graphs to determine the model with the optimal number of topics with regards to
these metrics. See Section 4.1.1 for these results.
3.9.2 BERTopic
To obtain the best model possible for extracting topic insights two different methods were used for model
evaluation and comparison. The first method was comparing the topic coherence scores, cv , as described
in Section 2.7, which was done through a built in function called CoherenceModel with parameter setting
coherence=’c_v’. The second method was to perform human interpretation of the topic quality which was
analyzed through word cloud representation and the built in function wordcloud.
The perplexity measure used in LDA is as a measure of model fit, to find the optimal number of topics for
the final model. In BERTopic however, the number of topics do not need to be pre-determined by the user
when using HDBSCAN. Therefore, perplexity was not needed as an evaluation metric in the BERTopic model
selection phase.
The model selection procedure was initiated with use of the default model presented in Section 2.6 and
more specifically Figure 4. From this model baseline, the model selection procedure continued with several
modifications and experiments regarding the different parameter settings presented in the Appendix C. As
presented in the appendix, different model settings can be employed during the selection process and when
we believed that best possible model was obtained, based on our two evaluation methods, a discussion with
the stakeholders at the company was conducted. Since the results based on the second metric, human
interpretation of the topic quality, can be considered arbitrary, this dialogue was of utmost importance to
ensure that a model that maximizes business value was created.
28
4 RESULTS
4 Results
In this section, the results of the project is presented. The section is divided into two subsections, Section 4.1
where the results from the LDA modeling are presented, and Section 4.2 where the results from the BERTopic
modeling are presented.
4.1 LDA
This subsection presents the results of various LDA models applied to the dataset, which were evaluated based
on coherence score, perplexity score and relevance, i.e., human interpretation of topic quality. It includes
visual representations of topic clusters and aims to provide a concise overview of the LDA topic modeling
results as well as demonstrate its efficacy in extracting meaningful topic information.
Table 4: Coherence and Perplexity Scores for different number of topics. The rows highlighted in green
represent the best models from observing the coherence and perplexity score graphs from Figure 12, with
regards to the "peaks" in Figure 12a and the largest (best) perplexity scores in Figure 12b.
29
4 RESULTS
(a) Coherence scores for number of topics T = {2,...,20} (b) Perplexity scores for number of topics T = {2,...,20}
Figure 12: Coherence and Perplexity Scores for the LDA model with different number of topics T . The
coherence score graph have peaks for T = 5, 7, 9, 11 and 15. T = 10 is also interesting since it lies in-between
two peaks and has a high coherence score while a lower number of topics compared to the two peaks to its
right (T = 10 < T = 11 < T 15), and it therefore has a higher perplexity score.
30
4 RESULTS
(a) T = 5 (b) T = 7
(c) T = 9 (d) T = 10
(e) T = 11 (f) T = 15
Figure 13: Visualization of topics with PyLDAvis for the six candidate final models. Each subfigure represents
one of the six candidates LDA models with the number of topics that produced the best coherence and
perplexity scores, as presented in Table 4 and Figure 12. The size of each bubble is proportional to the
percentage of unique tokens attributed to each topic.
31
4 RESULTS
Figure 14: Topic visualization with PyLDAvis for the final LDA model with 10 topics (T = 10).
Table 6: Relative size of each topic with regards to the percentage of tokens assigned to that topic.
32
4 RESULTS
n = 602 0.7255 0.1036 0.0430 0.0181 0.0260 0.0815 0.0008 0.0006 0.0007 0.0001
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
n = 974 0.3191 0.0296 0.5960 0.0155 0.0105 0.0190 0.0037 0.0029 0.0032 0.0006
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
n = 3501 0.3513 0.5322 0.0444 0.0096 0.0457 0.0103 0.0023 0.0018 0.0020 0.0003
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
n = 4982 0.3645 0.1087 0.1281 0.0038 0.3776 0.0147 0.0009 0.0007 0.0008 0.0001
Table 8: Distribution of the n = 10 words with the highest probability of belonging to each topic t. Due to
confidentiality reasons, each specific word w't ,n for topic t given the word-topic distribution 't cannot be
explicitly presented. The probability of each word w't ,n given the word-topic distribution 't is denoted as
'w,z where z is the word-topic assignment, as defined in 2.5.
w'1 ,10 0.0178 w'2 ,10 0.0160 w'3 ,10 0.0152 w'4 ,10 0.0164 w'5 ,10 0.0122
Topic 6 'w,z Topic 7 'w,z Topic 8 'w,z Topic 9 'w,z Topic 10 'w,z
w'6 ,1 0.0619 w'7 ,1 0.0881 w'8 ,1 0.1132 w'9 ,1 0.0526 w'10 ,1 0.0829
w'6 ,2 0.0603 w'7 ,2 0.0664 w'8 ,2 0.0897 w'9 ,2 0.0521 w'10 ,2 0.0504
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
w'6 ,10 0.0204 w'7 ,10 0.0135 w'8 ,10 0.0110 w'9 ,10 0.0205 w'10 ,10 0.0001
33
4 RESULTS
4.2 BERTopic
This subsection presents the outcome of two different BERTopic models applied to the given dataset, the
Default Model and the Final Model, presented in Section 4.2.1 and Section 4.2.2 respectively. These were
evaluated based on two different methods as described in Section 3.9.2, coherence score and relevance, i.e.,
human interpretation of topic quality. It includes visual representations of topic clusters and aims to provide
a concise overview of BERTopic’s topic modeling results as well as demonstrate its efficacy in discovering
meaningful insights. To maintain uniformity in the notation for topic affiliation, according to the theory
presented in 2.6.5, a class c in BERTopic is equivalent to a topic, which is represented by t in the LDA
notation.
Table 9: The amount of documents, i.e., customer calls, assigned to each topic obtained from the default
model together with the model Coherence Score.
From Table 9 we could establish that almost 90% of the total dataset were assigned to one of the two topics
which is not preferable when trying to define topics from a human interpretation perspective. But, despite
the disappointing topic sectioning we received a somewhat okay coherence score from the default model which
indicates that further tuning might get us a useful result. Below in Table 10 the term importance, i.e., c-TF-
IDF score, for the top 10 most important terms of each topic is presented. These 10 words are also illustrated
in a word cloud for each topic in Figure 15, but due to confidentiality reasons, the explicit word cannot be
presented in the this report. However, this information is shared with the company to provide insights on
how to categorize and name each topic based on the business context and the most probable words.
Table 10: The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific term t for class c cannot be explicitly
presented.
34
4 RESULTS
Figure 15: Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents one
of the topics presented in Table 10. Due to confidentiality reasons, each specific term t cannot be explicitly
presented.
Table 11: The amount of documents, i.e., customer calls, assigned to each topic obtained from the final model
together with its Coherence Score.
To our delight, Table 11 shows that the Coherence Score increased from 0.302 to 0.486 and the number of
topics from 2 to 3 in addition to the outlier topic (Topic -1) compared to the results from the default model
presented in Section 4.2.1, which enabled a more fair and user-friendly topic representation. However, we
could also once again establish that almost 90% of the total dataset were assigned to Topic 0 which, as
mentioned, is not preferable when trying to define topics from the perspective of human interpretation and
business value. In addition, the remaining 10% were almost exclusively assigned to Topic 1 apart from a
small portion which Topic 2 and Topic -1 shared. With that said, the topic representation was almost the
same in both cases.
Below in Table 12 the term importance, i.e., c-TF-IDF score, for the top 10 most important terms of each
topic is presented. These 10 words are also illustrated in a word cloud for each topic in Figure 16 but, due to
confidentiality reasons, the explicit word cannot be presented in the this paper. However, this information is
shared with the company to provide insights on how to categorize and name each topic based on the business
context and the most probable word.
Table 12: The top n = 10 terms t importance to a class c, i.e., the c-TF-IDF score Wt,c which is calculated
as defined in Section 2.6.5. Due to confidentiality reasons, each specific word t for class c cannot be explicitly
presented.
35
4 RESULTS
Figure 16: Visualization in a word cloud of the top n = 10 terms t for class c. Each subfigure represents one
of the topics presented in Table 12. Due to confidentiality reasons, each specific term t cannot be explicitly
presented.
From the results obtained in Table 11, Table 12 and Figure 16 we could establish that despite the improvement
in coherence score and increased topics, the human interpretability of topic quality was still subpar. So in
discussion with the company, no further model experimentation was performed with BERTopic since the
results were too poor with regards to business value.
36
5 DISCUSSION
5 Discussion
In this section, we discuss some of the challenges and opportunities associated with our process and provide
some thoughts on how to improve the use of topic modeling to extract insights from this type of customer
call data. This section is divided into five subsections, where Section 5.1 provides a discussion of the project
limitations, Section 5.2 a discussion of the data quality, Section 5.3 a comparison between LDA and BERTopic,
Section 5.4 a discussion of evaluation methods, and finally Section 5.5 some reflections on how our results
can be interpreted.
5.1 Limitations
The most significant limitation of this project has been the data privacy concerns regarding data accessibility
and whether analysis of the data was possible from a GDPR point of view. As the call recordings contains
sensitive personal information about the customers, it became a lengthy process to determine whether analysis
of this type was possible from a privacy perspective. This was not accounted for by us in the initial planning
of the project timeline, and was therefore limiting to how much of the initial project scope we managed to
complete.
37
5 DISCUSSION
5.2.4 Stopwords
As described in Section 2.2.4, stopwords are commonly used in topic modeling to remove words that are
considered to be unimportant or irrelevant for analysis. However, adding stopwords to a topic model may
actually create more bias in the data by removing words that may be of importance for understanding the
underlying themes and patterns in the text. For example, if we remove words like "see", we may miss
important connections between topics or lose valuable context that could help us better understand the
meaning of the call. This specific word could have different meanings in different contexts, for example "I
see" in the context of affirmation doesn’t necessarily provide any insight on what the call is about, but "I
can’t see this information on the web-page" on the other hand gives the word "see" an entirely different
weight to the context of the call. It is therefore highly subjective and hard to determine what stopwords to
remove to achieve the best possible topic model with regards to getting representative and coherent topics.
In the paper "Understanding Text Pre-Processing for Latent Dirichlet Allocation" by Schofield and Mimno,
the authors conclude that aside from extremely frequent stopwords, removal of stopwords does little to
impact the inference of topics on non-stopwords. The authors add that removing determiners, conjunctions,
and prepositions can improve model fit and quality, but that further removal has little effect on inference for
non-stopwords and thus can wait until after model training [67].
To address this problem, it is important to carefully consider which stopwords to include or exclude from the
analysis and to test the impact of these decisions on the results. In some cases, it may be more appropriate
to only use techniques such as lemmatization for reducing noise in the data, which can help to preserve more
of the original meaning of the text while still reducing noise and improving the accuracy of the model.
38
5 DISCUSSION
Table 13: Comparison between LDA and BERTopic in the context of practical application scenarios. Table
inspired by [13].
39
5 DISCUSSION
also confers one of the main advantages - that LDA can generalize the model that separates documents into
topic distributions to documents outside the training corpus. However, there are limitations of LDA, several
of which have been addressed in newer approaches such as BERTopic.
One limitation of LDA is that it is hard to know when LDA is working - topics are soft-clusters so there is
no objective metric to say “this is the right number of topics”. The number of topics must be specified by the
user and there is no explicit way to determine the optimal number of topics. The coherence and perplexity
scores used in this thesis are two common ways of evaluate how many topics should be used for an LDA
model, however there is some question to how reliable they are as metrics. This is discussed in more detail
in Section 5.4.
Perhaps the most significant limitation of LDA, one that BERTopic aims to improve on, is the core premise
of LDA - the bag-of-words representation. In LDA, documents are considered a probabilistic mixture of
latent topics, with each topic having a probability distribution over words, and each document is represented
using a bag-of-words model. The bag-of-words representation makes LDA an adequate model for learning
hidden themes but do not account for a document’s deeper semantic structure. Not capturing the semantic
representation of a word can be an essential element in acquiring accurate results from a topic model. For
example, for the sentence "The girl became the queen of Sweden", the bag-of-words representation will not
detect that the words "girl" and "queen" are semantically related. This could potentially result in LDA
missing the "true" meaning of a sentence, if the semantic structure highly affects how the words in a sentence
should be interpreted.
40
5 DISCUSSION
topic quality. It is therefore of importance to understand the strengths and weaknesses of these evaluation
methods to ensure that they are not relied on blindly to determine the optimal topic model.
In this subsection, we will explore some of the key considerations for evaluating topic models and discuss the
pros and cons of using coherence and perplexity as evaluation measures. We will also explore some of the
challenges associated with the trade-off between topic coherence and business value, i.e., human interpretation
of topic quality.
41
5 DISCUSSION
coherence proved to be a good tool for selecting the candidates for a final model, but the final pick mainly
came down to topic interpretability rather than hard metrics.
With this in mind, the model that created the best business value for the company was definitely the LDA
model, since the interpretation of topic quality is crucial for business value. We believe that this was mainly
due to the limitation for BERTopic in which it assumes that each document only contains a single topic. As
mentioned, LDA assumes that each document is a mixture of topics, while BERTopic is a clustering-based
model that uses contextualized embeddings to group similar documents together, and assigns each document
to one topic using c-TF-IDF. In this case, a document-topic distribution from LDA gives a probability
representation of all extracted topics for each document, which probably suited the data better since a
conversation with the company’s customer service easily could shift between topics.
Looking at the acquired topics from LDA, topic 1 represented 52.7% of the tokens in the corpus, as seen in
Table 6. This suggests that this was most likely a topic that more often that not occurred in calls, regardless
if there were other topics discussed at some point in the conversation. We believe that this made BERTopic
prone to categorizing most of the calls to this "general" topic, while ignoring other topics that occurred in
the conversations since they probably accumulated lower probabilities.
42
6 CONCLUSION
6 Conclusion
This section aims to address our answers to the research questions, summarize the results and discussions,
and outline potential future work in the area.
• Can the acquired categorizations provide insights on how the company can improve and
streamline their customer service?
Yes, this categorization enables the company to gain insights into the specific issues related to each
business area. Further research and analysis into the categorization could enable facilitating targeted
improvements, efficient resource allocation, and effective customer service. It could allow the company
to better comprehend the needs and preferences of its customers, tailor its services accordingly, and
proactively address any recurring or emerging concerns. This would enhance the overall customer sat-
isfaction, strengthen customer relationships, and support the organization’s growth and success.
• Compare the performance of two different topic modeling methods, LDA and BERTopic.
Based on evaluation measures and human interpretation of topic quality, which is the
most suitable method using the company’s customer call data?
When analyzing the results from the quantitative evaluation measures as well as human interpretation
of topic quality, it could be concluded that LDA outperformed BERTopic in terms of topic quality.
Although BERTopic demonstrated a slightly higher coherence scores, LDA aligned much better with
human interpretation, indicating a stronger ability to capture meaningful and coherent topics within
the company’s customer call data.
43
REFERENCES
References
[1] Felipe Almeida and Geraldo Xexéo. Word embeddings: A survey, 2019.
[2] Thomas Bayes and null Price. Lii. an essay towards solving a problem in the doctrine of chances. by
the late rev. mr. bayes, f. r. s. communicated by mr. price, in a letter to john canton, a. m. f. r. s.
Philosophical Transactions of the Royal Society of London, 53:370–418, 1763.
[3] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with python. In Analyzing
Text with the Natural Language Toolkit, chapter 1-3, pages 1–50, 67–110. O’Reilly Media, 1 edition, 2009.
[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res.,
3(null):993–1022, mar 2003.
[5] Justin Bois. Dirichlet distribution. https://distribution-explorer.github.io/multivariate_
continuous/dirichlet.html, Jan 2022. Accessed: April 25, 2023.
[6] Jonathan Chang, Sean Gerrish, Chong Wang, Jordan Boyd-graber, and David Blei. Reading tea leaves:
How humans interpret topic models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and
A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates,
Inc., 2009.
[7] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman.
Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–
407, 1990.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding, 2018.
[9] Igor Douven and Willem Meijs. Measuring coherence. Synthese, 156(3):405–425, 2007.
[10] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discov-
ering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, 1996.
[11] Akhmedov Farkhod, Akmalbek Abdusalomov, Fazliddin Makhmudov, and Young Im Cho. Lda-based
topic modeling sentiment analysis using topic/document/sentence (tds) model. Applied Sciences, 11(23),
2021.
[12] Thomas S. Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics,
1(2):209 – 230, 1973.
[13] Shubham Garg. Topic modeling with lsa, plsa, lda and nmf:
Bertopic and top2vec - a comparison. https://towardsdatascience.com/
topic-modeling-with-lsa-plsa-lda-nmf-bertopic-top2vec-a-comparison-5e6ce4b1e4a5,
2021. Accessed: May 10, 2023.
[14] Gensim. Gensim google group. https://groups.google.com/g/gensim/c/iK692kdShi4, 2011. Ac-
cessed: May 5, 2023.
[15] Google Cloud. Pricing | cloud speech-to-text | google cloud. https://cloud.google.com/
speech-to-text/pricing, 2023. Accessed: April 23, 2023.
[16] Google Cloud. Speech-to-text: Automatic speech recognition. https://cloud.google.com/
speech-to-text/, 2023. Accessed: April 23, 2023.
[17] Maarten Graven. c-tf-idf. https://maartengr.github.io/BERTopic/getting_started/ctfidf/
ctfidf.html, 2021. Accessed: May 10, 2023.
[18] Maarten Grootendorst. Keybert: Minimal keyword extraction with bert., 2020.
[19] Maarten Grootendorst. Bertopic. https://maartengr.github.io/BERTopic/index.html, 2021. Ac-
cessed: April 27, 2023.
[20] Maarten Grootendorst. Bertopic algorithm. https://maartengr.github.io/BERTopic/algorithm/
algorithm.html#visual-overview, 2021. Accessed: April 21, 2023.
[21] Maarten Grootendorst. Bertopic c-tf-idf. https://maartengr.github.io/BERTopic/getting_
started/ctfidf/ctfidf.html#reduce_frequent_words, 2021. Accessed: April 26, 2023.
44
REFERENCES
45
REFERENCES
[44] Leland McInnes and John Healy. Accelerated hierarchical density based clustering. In 2017 IEEE
International Conference on Data Mining Workshops (ICDMW), pages 33–42. IEEE, 2017.
[45] Leland McInnes and John Healy. HDBSCAN: Choosing the right parameters. https://hdbscan.
readthedocs.io/en/latest/parameter_selection.html, 2021. Accessed: May 6, 2023.
[46] Leland McInnes and John Healy. Hdbscan documentation: How hdbscan works. https://hdbscan.
readthedocs.io/en/latest/how_hdbscan_works.html, 2021. Accessed: May 11, 2023.
[47] Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering. The
Journal of Open Source Software, 2(11):205, 2017.
[48] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projec-
tion for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
[49] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Pro-
jection. https://github.com/lmcinnes/umap, 2021. Accessed: May 2, 2023.
[50] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projec-
tion. https://umap-learn.readthedocs.io/en/latest/api.html#umap.UMAP, 2021. Accessed: May
2, 2023.
[51] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. Umap: Uniform manifold approx-
imation and projection. The Journal of Open Source Software, 3(29):861, 2018.
[52] T.P. Minka. Estimating a dirichlet distribution. Annals of Physics, 2000(8):1–13, 2003.
[53] Prakash M Nadkarni, Lucila Ohno-Machado, and Wendy W Chapman. Natural language processing: an
introduction. Journal of the American Medical Informatics Association, 18(5):544–551, 09 2011.
[54] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, Nov 2022. Accessed: April 18,
2023.
[55] OpenAI. Speech to text - openai api. https://platform.openai.com/docs/guides/speech-to-text,
2022. Accessed: May 5, 2023.
[56] M Parimala, Daphne Lopez, and NC Senthilkumar. A survey on density based clustering algorithms for
mining large spatial databases. International Journal of Advanced Science and Technology, 31(1):59–66,
2011.
[57] Kyubyong Park, Joohong Lee, Seongbo Jang, and Dawoon Jung. An empirical study of tokenization
strategies for various korean nlp tasks, 2020.
[58] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.
[59] Shahzad Qaiser and Ramsha Ali. Text mining: Use of tf-idf to examine the relevance of words to
documents. International Journal of Computer Applications, 181(1):25–29, Jul 2018.
[60] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.
Robust speech recognition via large-scale weak supervision, 2022.
[61] Nils Reimers. Sentencetransformers: Multilingual sentence embeddings using bert, roberta, xlm-roberta
and co. with pytorch. https://www.sbert.net/, 2019. Accessed: May 7, 2023.
[62] Nils Reimers. SBERT: Pretrained models. https://www.sbert.net/docs/pretrained_models.html,
Accessed 2023. Accessed: May 8, 2023.
[63] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks,
2019.
[64] Nils Reimers and Iryna Gurevych. Sentence-transformers: Multilingual sentence embeddings using
bert, roberta, xlm-roberta and co. with pytorch. https://www.sbert.net/docs/package_reference/
SentenceTransformer.html, 2021. Accessed: May 5, 2023.
[65] Michael Röder, Andreas Both, and Alexander Hinneburg. Exploring the space of topic coherence mea-
sures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages
399–408. ACM, 2015.
46
REFERENCES
[66] Frank Rosner, Alexander Hinneburg, Michael Röder, Martin Nettling, and Andreas Both. Evaluating
topic coherence measures. arXiv preprint arXiv:1403.6397, 2014.
[67] Alexandra Schofield and David Mimno. Understanding text pre-processing for latent dirichlet allocation.
In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pages 553–562, 2017.
[68] Carson Sievert and Kenny Shirley. pyldavis: Python library for interactive topic model visualization.
https://github.com/bmabey/pyLDAvis, 2014. Accessed: April 28, 2023.
[69] Nur Tresnasari, Teguh Adji, and Adhistya Permanasari. Social-child-case document clustering based
on topic modeling using latent dirichlet allocation. IJCCS (Indonesian Journal of Computing and
Cybernetics Systems), 14:179, 04 2020.
[70] StatsExchange user:Rafs. Inferring the number of topics for gensim’s lda - per-
plexity, cm, aic and bic. https://stats.stackexchange.com/questions/322809/
inferring-the-number-of-topics-for-gensims-lda-perplexity-cm-aic-and-bic, 2018. Ac-
cessed: May 5, 2023.
[71] Rens van de Schoot, David Kaplan, Jaap Denissen, Jens B. Asendorpf, Franz J. Neyer, and Marcel A.G.
van Aken. A gentle introduction to bayesian analysis: Applications to developmental research. Child
Development, 85(3):842–860, 2014.
[72] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.
[73] Wenhui et al. Wang. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained
transformers. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in
Neural Information Processing Systems, volume 33, pages 5776–5788. Curran Associates, Inc., 2020.
[74] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson,
Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation
system: Bridging the gap between human and machine translation, 2016.
[75] Radim Řehůřek. Gensim: Topic modelling for humans. https://radimrehurek.com/gensim/models/
ldamodel.html, Dec 2022. Accessed: April 23, 2023.
[76] Radim Řehůřek. Gensim: Topic modelling for humans. https://radimrehurek.com/gensim/corpora/
dictionary.html, Dec 2022. Accessed: April 25, 2023.
[77] Radim Řehůřek. Gensim: Topic modelling for humans. https://radimrehurek.com/gensim/, Dec
2022. Accessed: April 27, 2023.
47
APPENDICES
Appendices
A Stopwords
Table 14: Stopwords used in both LDA and BERTopic. A total of 354 stopwords, using the standard NLTK
stopword library for English, as well as added stopwords that are project specific. The added stopwords are
mainly mistranslations and the most frequent names.
Stopwords
a abdul abdullah about above absolutely adam address
adnan after again against agnes agneta ahlqvist ahmad
ain albin alex alexander all alma also amelia
an and anders andreas anna anton any are
aren’t as ask at axel be because been
before being below between björn blah boris both
but by bye byebye call camilla can carola
cecilia christian claes clara couldn’t d daniel david
dennis did didn’t do doe does doesn’t doing
don douglas down during each elin elisabeth emil
emilia emma erik erika eriksson etc. eva exactly
few filip first for fred fredrik from further
gabriel gabriela gabriella good goodbye great gunn had
hadn’t hamlet hanna hannes has hasn’t have haven
haven’t having he hello henrik her here herman
hers herself him himself his how hugo hussein
håkan i ida if ill in into is
isn’t it’s itll its itself jacob jan jasmin
jenny jens jessica jeppe jim joakim joel johan
johanna johannes johansson john johnny jonas josef josefin
just jörgen karin karlsson katarina kim kristian lars
larsson lasse last leon let lina linda linnea
linus lisa lucas luke lundin länder länderplats ländervapen
lönnbro m ma madelaine magnus malin marcus maria
marianne marie matilda mattias me mia michael michelle
mightn mightn’t mikael mohamed mohammed monica more most
mustn mustn’t my myself name needn’t nice nicklas
niklas no nor not now number o of
off okay oliver olle on once only or
oskar other our ours ourselves out over own
patrik person personal persson peter please rasmus re
regina richard right robert roger roland s same
sandra say sebastian see shan shan’t she she’s
should should’ve shouldn’t so some sonja stefan stella
stig such surname susanna t than thank thanks
that that’ll thats the their theirs them themselves
then there therese these they theyll this thomas
those through to tom too tove try ulrik
under understand until up vas ve very victor
victoria viktoria viola vägen wait wallgren was wasn
wasn’t we welcome were weren weren’t what when
where which while who whom why will william
with won won’t wouldn wouldn’t y yeah yes
ylva you you’d you’ll you’re you’ve your yours
yourself yourselves
48
APPENDICES
Figure 17: LDA model implementation in Python using the models.ldamodel module from the Gensim
library.
Each parameter in the model settings, as shown in Figure 17, can be described as follows:
• corpus: The corpus of documents that the model will be trained on. This is the created dataset after
the transcribed files have gone through the data preprocessing and text representation stages.
• num_topics: This controls the number of requested latent topics that the model will identify. This
hyperparameter can be changed to produce new models with different number of topics to be fitted to
the data. During the evaluation and model selection phase, described in Section 3.9.1, this parameter is
changed to determine the model with the best coherence and perplexity scores, as described in Section
2.7.
• id2word: The dictionary that maps every token_id to its corresponding word.
• passes: The number of times the model will pass over the corpus during training. The standard setting
for this parameter is 10, which is kept unchanged in our model.
• alpha: This hyperparameter is the Dirichlet prior parameter that controls the sparsity of the topic
distribution, as described in Section 2.5.1. With a higher alpha, documents are assumed to be made
up of more topics and result in more specific topic distribution per document. When set to ’auto’,
which is the case in our model implementation, the model will learn an asymmetric prior directly from
the data, meaning that the model will learn the best value of alpha for each document based on the
data [75].
• chunksize: The number of documents to consider at once, i.e., that passes through the corpus during
training. The standard setting for this parameter is 100, which is kept unchanged in our model.
• update_every: Number of documents to be iterated through for each update. The standard setting
for this parameter is 1, which is kept unchanged in our model.
• random_state: This serves as a random seed and can be utilized to get reproducible results during
the different runs. For consistency reasons, this parameter was kept at 100 during the entire modeling
phase.
• per_word_topic: This setting is set to True, which results in the model computing a sorted list of
most likely topics given a word.
49
APPENDICES
Figure 18: BERTopic model implementation in Python using the BERTopic module from the bertopic library.
SentenceTransfromer
The text representation was done using BERT-embeddings, described in Section 2.6.1, where the SentenceTransfromer()
function from Hugging Face’s python library which contain several different pre-trained embedding models
was utilized. The main parameter settings for this function are [64]:
• model_name_or_path: This parameter was used to specify the pre-trained model to be used for
generating embeddings. It can either be a string with the name of the model or the path to a local
directory containing the pre-trained model files. When the language of the data are English the default
model isall-MiniLM-L6-v2, but other pre-trained models such as all-mpnet-base-v2, both described in
Section 2.6.1, can be chosen.
• device: Specifies the dives that should be used for computation, like ’cuda’ or ’cpu’. If None, checks
for a Graphics Processing Units (GPU) otherwise Central Processing Units (CPU) is default.
• batch_size: This parameter specifies the batch size for generating embeddings. (Default = 32)
UMAP
When reducing the dimensionality the UMAP technique, described in Section 2.6.2, was used which it allows
you to keep the data’s local and global structure. This through the function UMAP() from a library in python
with the same name, and its main parameter settings are [49, 50]:
• n_neighbors: Controls the number of neighboring points used in the construction of the initial
high-dimensional graph. Increasing the value of n_neighbors can lead to more global structure being
preserved at the loss of local detailed structure. In the range of 5 to 50 is recommended and 10 to 15
as sensible default. (Default = 15)
• n_components: Specifies the number of dimensions in the low-dimensional embedding. In general a
value should be between 2 and 100. (Default = 2)
• metric: Specifies the distance metric used to measure the distance between points, i.e. meaningful
words, in the high-dimensional embedding space. Metrics such as euclidean distance or cosine distance,
described in Section 2.4.1 respectively 2.4.2, can be used. (Default = Euclidean distance)
50
APPENDICES
• min_dist: Controls the minimum distance between points in the embedding space. Increasing the
value ensure a more evenly distribution but can also cause the loss of some small-scale structure. Values
should in general be in the range from 0.001 to 0.5 and be relative to the spred value. (Default = 0.1)
• spread: controls the scale of the low-dimensional embedding and determines, together with min_dist,
how clustered/clumped the embedded points should be. (Default = 1)
HDBSCAN
After the dimensionality reduction a clustering procedure took place using HDBSCAN through the HDBSCAN()
function from sklearn’s python library. The most important parameters of this function are as following
[45, 47]:
• min_cluster_size: This parameter sets the minimum number of points required to form a cluster,
i.e. controls the granularity of the resulting clusters. Larger values will produce fewer, larger clusters,
while smaller values the opposite. (Default = None)
• min_samples: The minimum number of samples in a neighborhood for a point to be considered a
core point. This parameter affects the density threshold for forming clusters where smaller values will
produce more clusters, while larger values the opposite. (Default = 5)
• metric: The distance metric to use when computing the mutual reachability distance between points.
Metrics such as euclidean distance or cosine distance, described in Section 2.4.1 respectively 2.4.2, can
be used. (Default = Euclidean distance)
CountVectorizer
To perform a bag-of-words representation within each cluster the CountVectorizer method, presented in
Section 2.6.4, was used. This through the CountVectorizer() function from sklearn’s python library, and
its main parameter settings are [58]:
• stop_words: This parameter specifies a list of words to be removed from the input text data which
often can improve the accuracy and efficiency of NLP models as described in section 2.2.4. (Default =
None, which means that no stop words are removed)
• max_df : Specifies the maximum document frequency of a token to be included in the output vocab-
ulary. A lower value can filter out tokens that appear too frequently in the input data and are not
informative. (Default = 1, which means that all tokens are included)
• min_df : Specifies the minimum document frequency of a token to be included in the output vo-
cabulary. A higher value can filter out tokens that occur too infrequently and may not be useful for
modeling. (Default = 1, which means that all tokens that occur at least once in the input data are
included)
• tokenizer: This parameter is used to transform each document into a list of tokens. Here you can pass
a custom tokenizer function to extract tokens in a more sophisticated way. (Default = None).
ClassTfidfTransformer
To distinguish the clusters from one another and create topics the c-TF-IDF method was used, described in
Section 2.6.5, which is a modified version of TF-IDF. This was generated through the ClassTfidfTransformer()
function from bertopic’s python library, and its most important parameters is as follows [21, 58]:
• sublinear_tf : A boolean value indicating whether or not to apply sublinear scaling to the term
frequency. (Default = False)
• smooth_idf : A boolean value indicating whether or not to apply smoothing to the inverse document
frequency. (Default = True)
• use_idf : A boolean value indicating whether or not to enable inverse-document-frequency reweighting.
(Default = True)
KeyBERTInspired
Finally, the fine-tuning of the topics was done with KeyBERT, described in Section 2.6.6, and in particular
the KeyBERTInspired() function in the bertopic library in python. Its main parameter settings are [22]:
• top_n_words: A integer value indicating how many top n words to extract per topic. (Default =
10)
• nr_reps_docs: A integer value indicating the number of representative documents to extract per
cluster. (Default = 5)
51
APPENDICES
BERTopic
In addition, the BERTopic() function itself have some hyperparameters to tune beside the choice of sequence-
model and the most important ones are [24]:
• nr_topics: Controls the number of topics to extract from the corpus. If this parameter is not specified
or set to a too large number, BERTopic will use an algorithm to estimate the optimal number of topics
automatically. (Default = None)
• min_topic_size Controls the minimum number of documents required for a topic to be considered
valid. Topics with fewer documents are discarded. (Default = 10)
52
APPENDICES
Figure 19: Functions for topic extraction within the bertopic library in python. Figure from [19].
53
APPENDICES
Figure 20: Functions for topic visualization within the bertopic library in python. Figure from [19].
54