NLP Worksheet: Text Processing, Bag of Words and TF-IDF
NLP Worksheet: Text Processing, Bag of Words and TF-IDF
Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y
No. Sentence
2 We can use NLP to create chatbots and we will be making health chatbots now!
3 Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y
Step 2: Tokenization
Separate your sentences into tokens. How many tokens do you have?
Tokens
Number of tokens: 37
Step 3: Remove stopwords, special characters, numbers
List out the stopwords, special characters, and numbers that you want to remove!
Modified Form
We—we
Health—health
Chatbots—chatbots
Yay—yay
NLP—nlp
Step 5: Stemming
List out the stem words.
Stem words
Step 6: Lemmatization
List out the root words/ lemma.
Lemma
Final data
List out the final, processed data.
Processed data
Bag of words
Step 1: Collect data and process it
For this exercise, we can use the sentences without processing it so that it is easier for us to read the sentences.
No. Sentence
2 We can use NLP to create chatbots and we will be making health chatbots now
Dictionary
we can use health chatbots for treating stress nlp to create and will be
making now cannot replace human counselors yay
we 1 1 0
can 1 1 0
use 1 1 0
health 1 1 1
chatbots 1 1 1
for 1 0 0
treating 1 0 0
stress 1 0 0
nlp 0 1 0
to 0 1 0
create 0 1 0
and 0 1 0
will 0 1 0
be 0 1 0
making 0 1 0
now 0 1 1
cannot 0 0 1
replace 0 0 1
human 0 0 1
counsellor 0 0 1
s
yay 0 0 1
TF-IDF
You’ve obtained your bag of words. Now let’s continue with the TF-IDF!
Step 1 - 3: Count the number of documents where the word appears at least once & write
that number down next to the word in your vocabulary to get your document frequency.
Draw your own table for this!
Aman and Anil are stressed went to a therapist download health chatbot
2 1 2 1 1 2 2 2 1 1 1 1
aman and anil are stressed went to a therapist download health chatbot
3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
Your TF-IDF:
we 0.176 0.176 0
for 0.477 0 0
treating 0.477 0 0
stress 0.477 0 0
nlp 0 0.477 0
to 0 0.477 0
create 0 0.477 0
and 0 0.477 0
will 0 0.477 0
be 0 0.477 0
making 0 0.477 0
cannot 0 0 0.477
replace 0 0 0.477
human 0 0 0.477
counsellor 0 0 0.477
s
yay 0 0 0.477
Thank You
Sampurna Rastogi