0% found this document useful (0 votes)
4 views

Topic Modelling Unveiling Hidden Themes in Text

this machine learning nlp research paper

Uploaded by

tanmaya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Topic Modelling Unveiling Hidden Themes in Text

this machine learning nlp research paper

Uploaded by

tanmaya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Topic Modelling to

discover underlying theme


in document
Topic modelling is a powerful technique used in natural language
processing (NLP) to uncover hidden topics within a collection of
documents. This presentation explores two prominent topic
modelling approaches: Latent Dirichlet Allocation (LDA) and Non-
Negative Matrix Factorization (NMF), comparing their strengths and
weaknesses in generating coherent and interpretable topics.
The Problem of
Unstructured Text Data
Exponential Growth Identifying Patterns
The volume of unstructured Organizations and
text data is rapidly researchers need to identify
increasing across various meaningful patterns and
domains, making it themes within large
challenging to extract collections of documents.
meaningful insights.

Topic Modelling as a Solution


Topic modelling provides a solution by revealing hidden topics
within unstructured data, offering valuable insights into the
underlying themes.
LDA and NMF: Two Approaches to Topic
Modelling
Latent Dirichlet Allocation (LDA) Non-Negative Matrix Factorization (NMF)

A probabilistic model that assumes documents are A linear-algebra-based approach that decomposes
generated as mixtures of topics, each characterized by documents into parts that can be reconstructed
a specific distribution of words. through latent topics of representation.
Research Questions
1 Coherence and 2 Computational
Interpretability Efficiency and
Scalability
How do LDA and NMF
compare in terms of the What are the differences
coherence and in computational
interpretability of the efficiency and scalability
generated topics? between LDA and NMF?

3 Topic Separation and Interpretability


Which algorithm better separates topics and is more
interpretable on a given dataset of texts?
Methodology: Data Collection and Preprocessing
1 Data Collection
A sample of multiple short documents belonging to different thematic categories (sports, technology,
general knowledge) was created.

2 Data Cleaning
Documents were cleaned to ensure consistency and noise reduction, including tokenization, lowercase
conversion, stop word removal, and lemmatization.

3 Vectorization
The pre-processed documents were vectorized using Count Vectorization for LDA and TF-IDF Vectorization
for NMF.
Topic Modelling Algorithms: LDA and NMF
Algorithm Input Parameters

LDA Count Vectorized document-term Alpha (document-topic density),


matrix Beta (topic-word density)

NMF TF-IDF vectorized document-term Number of topics, initialization


matrix method, regularization settings
Evaluation Metrics: Assessing Topic Quality

Topic Coherence Perplexity (for LDA)


Measures the semantic similarity of top words within Measures the model's ability to represent the data, with
each topic, indicating the interpretability of the topics. lower perplexity indicating a better fit.

Human Interpretability Computational Efficiency


Evaluates the clarity and relevance of the topics based Measures the time taken for the model to converge and
on human judgment. memory usage during training.
Results and Analysis:
Comparing LDA and NMF
LDA
Produces more coherent topics, capturing the
probabilistic nature of word distribution across topics.

NMF
Offers greater interpretability, with topics centered
around more specific keywords.

Computational Efficiency
NMF outperforms LDA in terms of computational speed
and scalability.
Conclusion: Choosing the Right Topic
Modelling Approach
The choice between LDA and NMF depends on the specific requirements of the application. LDA is well-suited for
complex datasets with overlapping themes, while NMF is computationally efficient and provides clearer topics for
simpler datasets.
Future Work: Exploring
Hybrid Models and Deep
Learning
Future research could explore hybrid models that combine the
strengths of LDA and NMF, or integrate deep learning models to
further enhance topic coherence and interpretability. This research
provides valuable insights into the practical considerations
involved in choosing a topic modelling technique, guiding data
scientists and researchers in selecting the most appropriate
approach for their specific data and goals.

You might also like