4 Steps of Using Latent Dirichlet Allocation (LDA) For Topic Modeling in NLP

Uploaded by

kehinde.ogunleye007

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

38 views

4 Steps of Using Latent Dirichlet Allocation (LDA) For Topic Modeling in NLP

Uploaded by

kehinde.ogunleye007

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 21

1410212023, 17:10 Latent Dirichlet location (LDA) Tutorial Tople Modeling in NLP. © Omdena 4 Steps of Using Latent Dirichlet Allocation (LDA) for Topic Modeling in NLP Published Reading Time Jan 15, 2022 9 min © Rate this post Follow us (11 votes) eooo0e8 hitpsomdena.convogiatent-aichet-alocabon! et1410212023, 1710 Latent Dirichlet Allocation (LDA) Tutors Tope Modeling in NLP CHA Real World Tutorials by @mdena EV CHe Uiianica@. Wer-1e Cn (89) TOES CMM em Ta me | a. =a -_ oO RSS a wy CE ) Jump to section An introduction to the LDA algorithm Step 1: Data collection Step 2: Preprocessing ‘Step 3: Model implementation * 3.1. Training * 3.2. Improving preprocessing Step 4: Visualization Summary References tps omdena.combogiatent-arichet-alocaton! ar‘0272028, 17:10 Latent Dietet Allocation (LDA) Tuoi Topic Moeling in NLP This tutorial will guide you through how to implement its most popular algorithm, Latent Dirichlet Allocation (LDA) algorithm, step by step in the context of a complete pipeline. First, we will be learning about the inner works of LDA. Then, we will be using scikit-learn for data preprocessing and model implementation, and pyLDAvis tor visualization. As a little extra, we will also be doing our own data collection with newspaper3k. Topic Modeling is a technique that you probably have heard of many times if you are into Natural Language Processing (NLP). Topic Modeling in NLP commonly used for document clustering, not only for text analysis but also in search and recommendation engines. Sounds good? Let's start! Author: Jessica Becerra Formoso What is Latent Dirichlet Allocation? Latent Dirichlet Allocation (LDA) is an unsupervised algorithm that assigns each document a value for each defined topic (let's say, we decide to look for 5 different topics in our corpus) Latent is another word for hidden (ie, features that cannot be directly measured), while Dirichlet is a type of probability distribution LDA considers each document as a mix of topics and each topic as a mix of words. It iterates through the total number of topics and each word. It will randomly assign each word to a topic and evaluate how often the word occurs in that topic together with which other words. This approach follows a similar way of thought as we humans would. This makes LDA easier to interpret and one of the most popular methods out there. The trickiest part of it though is to figure out the optimal number of topics and iterations. Latent Dirichlet Allocation is not to be confused with Latent Discriminant Analysis (also referred to as LDA). Latent Discriminant Analysis is a supervised dimensionality reduction technique used for the classification or preprocessing of high-dimensional data. You might also like * Using Topic Modeling to Understand Climate Change Domains tps omdena.combogiatent-arichet-alocaton! sat1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Step 1: Data collection To spice things up, let's use our own dataset! For this, we will use the newspaper3k library, a wonderful tool for easy article scraping, Ipip install newspaper3k import newspaper from newspaper import Article We will be using the build functionality to collect the URLs on our chosen news website's main page. # Save URLs from main page. /www.theguardian.com/us", memoize_articles-False) news = newspaper. build(“https: By passing the memoize_articles argument as False, we ensure that, if we call the function a second time, all the URLs will be collected again. Otherwise, only the new URLs would be returned. We can check news.size() to get the number of collected news URLs. In our case, 143. Next, we need to simply pass each URL through Article(), call download() and parse(), and finally, we can get the article's text. We also pass a length condition to avoid storing some previously spotted exceptions. That way, we ensure adding only long texts to our dataset texts = [] # For each URL, tps omdena.combogiatent-arichet-alocaton! an4i02/2023, 1710 Latent Dirichlet Aocation (LDA) Tutorial: Tope Modeling in NUP for article in news.articles: # Get the corresponding article. article = article(article.url) article.download() if articl article.parse() # Get text only if has more than 68 characters -- to avoid undesired exct if len(article.text) > 60: texts. append(article. text) After running these lines, the total number of news articles is 132. Step 2: Preprocessing The next step is to prepare the input data for the LDA model. LDA takes as input a document- ‘term matrix. We will be using Bag of Words, specifically the CountVectorizer implementation from scikit- learn. from sklearn.feature_extraction.text import CountVectorizer bow_vectorizer = CountVectorizer(stop_words=stopwords, lowercase=True, max_df=0.' bow_matrix = bow_vectorizer.fit_transform(texts) There are a couple of things to mention here. First, it is essential not to forget to remove stopwords. We call the lowercase method for increased normalization, and we set a series of parameters to avoid high-frequency words (common words not in the stopwords list that do not add any meaning overall) or too low-frequency terms. Our resulting Bag of Words has a shape of (132, 438). With that in place, it is time to use LDA algorithm, tps omdena.combogiatent-arichet-alocaton! iat1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Step 3: Model implementation Using scikitlearn's implementation of this algorithm is really easy. However, this abstraction can make it really difficult to understand what is going on behind the scenes, It is important to have at least some intuition on how the algorithms we use actually work, so let's recap a bit on the explanations from the introduction. from sklearn.decomposition import LatentDirichletAllocation as LDA Ida_bow = LDA(n_components=5, random_state=42) Ida_bow. fit (bow_matrix) LDA needs three inputs: a document-term matrix, the number of topics we estimate the documents should have, and the number of iterations for the model to figure out the optimal words-per-topic combinations. n_components corresponds to the number of topics, here, 5 as a first guess The number of iterations is 10 by default, so we can omit that parameter. Having the configurations of our LDA model set up under the Ida_bow variable, we fit (train) on the BOW. 1da_bow. transform(bow_matrix[:2]) By calling transform, we get to see the results of the trained model. This gives us a good picture of how it actually works. We pass only the first two rows of our BOW matrix as an example. array({[0.76662544, 0.01858679, 0.0183296 , 0.17813906, 0.01831911], [0.00103261, 0.00102449, 0.001021 , 0.00102753, 0.99589436])) As you can see, we have 5 values in each of the two vectors. Each value represents a topic (remember we told the model to find 5 different topics). In specific, it illustrates how much of tps omdena.combogiatent-arichet-alocaton! 621‘0272028, 17:10 Latent Dietet Allocation (LDA) Tuoi Topic Moeling in NLP that topic is covered in that document (vector). This makes sense since a document is usually made up of several (sub)topics. let's naw print the mast camman words far each tanic for idx, topic in enumerate(1da_bow.components_): print(f"Top 5 words in Topic #{idx}:") print ([bow_vectorizer.get_feature_names()[i] for i in topic.argsort()[-5:]]) print(‘*) The output looks like this: Top 5 words in Topic #0: [time’, ‘years’, life, ‘says’, ‘ike] Top 5 words in Topic #1: public’, ‘york, ‘new’, ‘police’, ‘trump'] Top 5 words in Topic #2: [white,, decision, ‘international, black, uk] Top 5 words in Topic #3: [like ‘year’, ‘food, ‘police’, ‘city Top 5 words in Topic #4: [bill, ‘democrats’, ‘rights’, ‘voting’, ‘biden’] This type of visualization is actually an excellent indicator of how well our topic model is being trained, Having words such as “like” or “says” does not provide much meaning. One way around this is to do lemmatization and add these undesired words to our stopwords list. Let's improve our current model next. 3.2. Improving preprocessing tps omdena.combogiatent-arichet-alocaton! nen‘0272028, 17:10 Latent Dietet Allocation (LDA) Tuoi Topic Moeling in NLP Coming back to the preprocessing step is something very common and often necessary. After all, Machine Learning is an iterative process. In our case, we need to improve our Bag of Words to not take into account some very frequent words that could not be filtered out with the previous approach. Furthermore, it would be good to add a lemmatizer to avoid repeated words under different forms. For the first case, we just need to add our new list of stopwords to the already defined set of stopwords. For the second step though, CountVectorizer does not integrate a lemmatizer, so we have to create our own lemmatizer class and pass it to the tokenizer parameter. No need to worry much here, scikit-learn has you covered with their documentation on how to customize your vectorizer in this particular case. nitk.download(‘punkt") nltk.download( "wordnet" ) from nitk import word_tokenize from nitk.stem import WordNetLemmatizer class LemmaTokenizer: def init__(self): self.wnl = WordNetLenmatizer() def _call_(self, doc): return [self.wnl.lemmatize(t) for t in word tokenize(doc) if (t.isalpha( We download first some necessary packages and import the corresponding dependencies. The LemmatTokenizer class is the same as in the documentation except for two extra conditions we add to account only for tokens with alphabetic characters and with more than one letter. Otherwise, your topics will be flooded with punctuation and other undesired tokens. Now, we only have to pass our new parameter to the vectorizer. The rest remains as before. bow_vectorizer = CountVectorizer(stop_words=stopwords, tokenizer=LemmaTokenizer( bow_matrix = bow_vectorizer.fit_transform(texts) If we run all again, we see that indeed the most common words for our topics do change. Top 5 words in Topic #0: tps omdena.combogiatent-arichet-alocaton! at1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” experience’, ‘event, ‘life’ ‘year’, city] Top 5 words in Topic #1 republican’, ‘voting, ‘right, ‘trump’, biden] Top 5 words in Topic #2: film, ‘life’, ‘new, ‘time, ‘year'] Top 5 words in Topic #3: year’, ‘vaccine, ‘food’, ‘city’ police! Top 5 words in Topic #4: ['week’, ‘governor’, ‘new’, ‘state’, ‘woman’] That is looking good, well done! Step 4: Visualization One last step in our Topic Modeling analysis has to be visualization. One popular tool for interactive plotting of Latent Dirichlet Allocation results is pyLDAvis. !pip install pyldavis import pyLDAvis import pyLDAvis.sklearn pyLDAvis.enable_notebook() Make sure to import the corresponding module to the main library you are using for Topic Modeling (in our case, scikit-learn). Again, this step will help us determine how well our model is performing. Let's take a look at the visualizations as they were before improving our vectorizer with the lemmatizer. tps omdena.combogiatent-arichet-alocaton! erat1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors Tope Modeling in NLP Sii0oln [Ome Ie [ole le Piel In ea la lelelaia le MISE. wyN ry 0D timccoemsre mi ara a EEE EE ee Ee NLP Topic modeling - Source: Omdena There are two main parts to pyLDAvis. On the left side, the intertopic Distance Map shows each topic as a bubble. The bigger the bubble, the higher the number of documents in our corpus belonging to that topic. The more distanced the bubbles are from each other, the more different their topics are. On the right side, the Top-30 Most Relevant Terms for Topic N consist of a barplot with two indicators. In blue, the total frequency of that word in the corpus, and in red, the frequency of that word in that topic. tps omdena.combogiatent-arichet-alocaton! sor1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” NLP Topic modeling - Source: Omdena It seems we did not have a bad result after all! Let's see how it shows after lemmatization. The sizes of the bubbles are more irregular among them, and Topic 1 has a very large bubble that overlaps in great part with Topic 5. One thing we could explore further is the number of topics. It possibly is that five topics are much for our limited dataset. After some tweaking, we conclude that 3 topics without lemmatizer gives the best results for our case. The topics may still not make entire sense, or may sound repetitive or weak to us. There is no wrong in that. On the other hand, gathering more data can help the variety of our results and solidify the output. Feel free to explore with a larger amount of news articles or with your previously scraped tweets from Part 1 Summary In this tutorial, we learned about Latent Dirichlet Allocation. We built some intuition of the whole process and are ready to improve our first outputs by observing the performance of several parameters in our LDA implementation with the help of pyLDAvis. Now it's time to put this into practice! Happy coding! References ‘+ newspaper3k documentation: https://github.com/codelucas/newspaper + scikit-learn LDA documentation: https://scikit- learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html + pyLDAvis documentation: https://github.com/bmabey/pyLDAvis + Video tutorial on LDA with Gensim: https://www.youtube.com/watch?v=NYkbqzTIW3w Want to learn more? Check out the tutorials below: * Building “Yarub" Library for Arabic NLP Purposes tps omdena.combogiatent-arichet-alocaton! swe1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” * Labeling Text Data for News Article Classification and NLP. Tagged: Latent Dirichlet Allocation + LDA + Topic Modeling Jessica Becerra Formoso o tps omdena.combogiatent-arichet-alocaton! sam1490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Al Projects to boost your career Develop vour skills. build a project portfolio. and make an impact. All at once! Most Read Posts Applying Machine Learning to Analyse Pipe-Borne Water Availability in Lagos, Nigeria 13 min read Crop Yield Prediction Using Deep Neural Networks 9 min read Measuring Soil Organic Carbon Changes Using RothC and Machine Learning 13 min read Data Analysis AlEngineering Object Detection Web Scraping Climate Change Machine Learning —Startups_—Deep Learning Data Science Career Geospatial Data Al'Startups Satellite Imagery Career Career Tips Data Visualization tps omdena.combogiatent-arichet-alocaton! 13211490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Join or host projects and build solutions through the power of collaboration. PROBLEM STATEMENT oy Monitoring the Water Quality in Bhopal Region using Satellite Imagery and GIS Techniques This Omdena Local Chapter Challenge runs for 5 weeks and is a unique experience to try. Challenge Start: Feb 16th tps lomdena.combogiatent-arichet-alocaton! 14211410212028, 17:10 Latent Dirichlet location (LDA) Tutorial Tople Modeling in NUP E-Commerce Customer Churn Prediction using Machine Learning This Omdena Local Chapter Challenge runs for 5 weeks and is a unique experience to try... Apply now Challenge Start: Feb 19th tps lomdena.combogiatent-arichet-alocaton! 19211410212023, 17:10 Latent Dirichlet Allocation (LDA) Tutors Tope Modeling in NLP Mitigating Air Pollution in Poland Through Machine Learning This Omdena Local Chapter Challenge runs for 4 weeks and is a unique experience to try... Apply now Challenge Start: Feb 19th See all projects > tps omdena.combogiatent-arichet-alocaton!1490212028, 17:10 Latent Diichet Allocation (LDA) Tutorial Tople Modeling in NUP 2 oe ete Ce eee i eens eee Tees Applying Machine Learning to Analyse Pipe-Borne Water Availability in Lagos, Nigeria (Nov 17,2022 | @ 13 min read itpsomdena.combogiatent-arichet-alocaton! sre1490212028, 17:10 Latent Diichet Allocation (LDA) Tutorial Tople Modeling in NUP Image cropped and frst band histogram =“ 6 eazsane Sepimighth—ryial t Crop Yield Prediction Using Deep Neural Networks 1H Sep 30, 2022 | @ 9min read Measuring Soil Organic Carbon Changes Using RothC and Machine Learning tps omdena.combogiatent-archet-alocatons 19211490212028, 17:10 Latent Dirichlet Allocation (LDA) Tutors: Tope Modeling in NLP” Browse categories. Agriculture Al Planning Career Development _—_ Career Growth Stories Child Safety Civil Society Computer Vision Conversational Al Data Engineering DataScience Disaster Management —_ Education Environment & Sustainability Equality & Inclusion Finance Healthcare Impact Tech Startups Infrastructure —_ Logistics & Transportation Machine Learning Media NLP — Omdena impACT leadership Omdena Local Chapters Partner Success Stories Policy PR & Impact Real-World Tutorials Remote Sensing Renewable Energy Robotics, Drones, and loT Security & Justice Software Development Technical Case Studies Web and Mobile Applications tps omdena.combogiatent-arichet-alocaton! v9sanz005, 40 Latent Ofte Alcaton (DA) Ta Tope MeeIng NL Leave a comment. 0 Comments Join Us About Us » Projects » Mission » Omdena School » Team » Omdena Local Chapters For Organizations » Al Startup Incubator » Civil Society y» Universities » Testimonials » Contact Us » FAQ Industry Applications » Media » Renewable Energy » Solar » Wildfire Prediction > More itpsomdena.combogiatent-arichet-alocaton! » Our partners » Careers Industry Applications » Agriculture » Crop Prediction » Finance » Healthcare » Cancer » Mental Health 201241490212028, 17:10 Latent Diichet Allocation (LDA) Tutorial Tople Modeling in NUP Privacy Policy Terms and Conditions © 2023 Omdena. All Rights Reserved. 5 « 6) Gl o tps omdena.combogiatent-arichet-alocaton! 21a

Topic Modeling With BERT. - Towards Data Science
No ratings yet
Topic Modeling With BERT. - Towards Data Science
9 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
A Beginner's Guide To Latent Dirichlet Allocation (LDA)
No ratings yet
A Beginner's Guide To Latent Dirichlet Allocation (LDA)
9 pages
7.2 Latent
No ratings yet
7.2 Latent
27 pages
IIT-P ADS Week 22 Transcripts
No ratings yet
IIT-P ADS Week 22 Transcripts
4 pages
Ex6 SMA
No ratings yet
Ex6 SMA
11 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Latent Dirichlet Allocation
100% (2)
Latent Dirichlet Allocation
13 pages
Topic Modeling Uncovering Hidden Themes in Text
No ratings yet
Topic Modeling Uncovering Hidden Themes in Text
10 pages
Project Example
No ratings yet
Project Example
19 pages
3 Topic Models
No ratings yet
3 Topic Models
15 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
Song 2009
No ratings yet
Song 2009
4 pages
Improving Topic Models With Latent Feature Word Representations
No ratings yet
Improving Topic Models With Latent Feature Word Representations
16 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Input To The LDA Algorithm:: Latent Dirichlet Allocation Using Gibbs Sampling Technique Is A Framework For Analyzing
No ratings yet
Input To The LDA Algorithm:: Latent Dirichlet Allocation Using Gibbs Sampling Technique Is A Framework For Analyzing
3 pages
dbm302Presentation
No ratings yet
dbm302Presentation
5 pages
LU - 35 Latent Dirichlet Algorithm
No ratings yet
LU - 35 Latent Dirichlet Algorithm
13 pages
2024.eacl-long.51
No ratings yet
2024.eacl-long.51
20 pages
Module III
No ratings yet
Module III
42 pages
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
Exploration of Thesis
No ratings yet
Exploration of Thesis
93 pages
Session 2
No ratings yet
Session 2
58 pages
Experiments With Non Parametric Topic Models
No ratings yet
Experiments With Non Parametric Topic Models
10 pages
ME314 Day11
No ratings yet
ME314 Day11
77 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
Pip Install Guidedlda
No ratings yet
Pip Install Guidedlda
3 pages
Visualizing Topic Models
No ratings yet
Visualizing Topic Models
4 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
Topic Modeling Text Clustering Based On Deep Learning Model
No ratings yet
Topic Modeling Text Clustering Based On Deep Learning Model
11 pages
Machine Learning for data science Unit-5
No ratings yet
Machine Learning for data science Unit-5
10 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
CS772 Lec21
No ratings yet
CS772 Lec21
18 pages
LDA Topic Model With Soft Assignment of Descriptors To Words
No ratings yet
LDA Topic Model With Soft Assignment of Descriptors To Words
9 pages
Adison Wongkar, Christoph Wertz, What Are People Saying About Net Neutrality
No ratings yet
Adison Wongkar, Christoph Wertz, What Are People Saying About Net Neutrality
5 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
Full Text 01
No ratings yet
Full Text 01
68 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Topic Modelling
No ratings yet
Topic Modelling
14 pages
Latent Dirichlet Allocation
No ratings yet
Latent Dirichlet Allocation
3 pages
A16_ppt
No ratings yet
A16_ppt
21 pages
Machine Learning: Suprit Shrestha (19BCE2584)
No ratings yet
Machine Learning: Suprit Shrestha (19BCE2584)
20 pages
chapter_2
No ratings yet
chapter_2
36 pages
Latent Dirichlet Allocation LDA and Topic Modeling PDF
No ratings yet
Latent Dirichlet Allocation LDA and Topic Modeling PDF
41 pages
Unit iv
No ratings yet
Unit iv
58 pages
Topic Modelling- Deep Learning Interview Questions
No ratings yet
Topic Modelling- Deep Learning Interview Questions
19 pages
Pre-Training Is A Hot Topic: Contextualized Document Embeddings Improve Topic Coherence
No ratings yet
Pre-Training Is A Hot Topic: Contextualized Document Embeddings Improve Topic Coherence
8 pages
Tensor Decomposition For Topic Models: An Overview and Implementation
No ratings yet
Tensor Decomposition For Topic Models: An Overview and Implementation
14 pages
Large Scale Topic Modeling
No ratings yet
Large Scale Topic Modeling
18 pages
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
UTOPIC 2023.eacl-main.132
No ratings yet
UTOPIC 2023.eacl-main.132
16 pages
5 Data Analytics Projects For Beginners - CourseraG
No ratings yet
5 Data Analytics Projects For Beginners - CourseraG
6 pages
1 An Introduction To Machine Learning With Scikit Learn
No ratings yet
1 An Introduction To Machine Learning With Scikit Learn
2 pages
Converting List To String
No ratings yet
Converting List To String
8 pages
Datetime - Basic Date and Time Types - Python 3.11.3 Documentation
No ratings yet
Datetime - Basic Date and Time Types - Python 3.11.3 Documentation
38 pages