NLP
NLP
Sentiment Analysis
The goal of sentiment analysis is to identify sentiment
among several posts or even in the same post where
emotion is not always explicitly expressed. Companies use
Natural Language Processing applications, such as
sentiment analysis, to identify opinions and sentiment
online to help them understand what customer think about
their product and services (“I love the new iPhone” and a
few line letter “But sometime it doesn’t work well” where
the person is still talking about the iPhone) and overall
indicator software reputation.
Test Classification
Make it possible to assign predefined categories to a
document and organize it to you find the
information you need or simply some activities. For
example an application of text categorization is
spam filtering in email.
Virtual Assistant
Nowadays Google Assistant, Cortana, Siri, Alexa, etc.
have become an integral part of our lives. Not only
can we talk to them but they also have the ability to
make our lives easier. By accessing our data they can
help us in keeping note of our tasks, make calls for
us, send message and a lot more.
AI Project Cycle in Natural Language Processing
Natural Language Processing is all about how machine tried to understand and interpret
human language and operate accordingly. But how can Natural Language Processing be used
to solve the problems around us.
Tokenization
After segmentation the sentence, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a sentence.
Under tokenization, every word, number and special character is considered separately
and each of them is now a separate token.
Stemming
In this step, the remaining words are reduced to their root words. In other words, streaming
is the process in which the affixes of words are removed and the words are converted to their
base form.
Word Affixes Steam
Healed -ed Heal
Healing -ing Heal
Healer -er Heal
Studies -es Studi
Studying -ing Study
Note that in stemming the steamed
words (which we get after removing the affixes) might not be meaningful. Here in this
example as you can see: healed, healing and healer all were reduced to heal but studies was
reduced to studi after the affix removal which is not a meaningful word.
Lemmatization
Stemming and lemmatization both are iterative processes to each other as the role of both
the processes is same removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal(also known as lemma) is a meaningful
one. Lemmatization make sure that lemma is a word with meaning and hence it take a longer
time to execute than streaming.
45vgy4 Affixes Lemma
7
Healed -ed Heal
Healing -ing Heal
Healer -er Heal
Studies -es Study
Studying -ing Study
Bag of Words
Bag of words is a Natural Language Processing model which helps in extracting features out
of the text which can be helpful in machine learning algorithms. In bag of words, we get the
occurrences of each word and construct vocabulary for the corpus.
This image give us a brief overview about how bag of words works. Let us assume that the
next on the left in this image is the normalized corpus which we have got after going through
all these steps of text processing. Now, as we put this text into the bag of words algorithm,
the algorithm returns to us unique words out of the corpus and their occurrences in it. As you
can see at the right, it shows us a list of words appearing in the corpus and the number
corresponding to it shows how many times the word has occurred in the text body. Thus, we
can say that the bag of words gives us two things:
A vocabulary of words for the corpus
The frequency of these words (number of time it has occurred in the whole corpus)
Here calling this algorithm “bag” of words symbolizes that the sequence of sentences or
tokens does not matter in this case as all we need are the unique words and their frequency
in it.
Here is the step-by-step approach to implement bag of words algorithm:
1-Text Normalization: Collect data and pre-process it
2-Create Dictionary: Make a list of all the unique words occurring in the corpus. (vocabulary)
3-Create Document Vector: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4-Create document vectors for all the documents
Step 1: Collecting data and preprocessing it.
Document 1 Aman and anil are stressed
Document 2 Aman went to a therapist
Document 3 Anil went to download a health chatbot
here are 3 document having one sentence each. After text normalization, the text becomes:
Document 1 [Aman, and, anil, are, stressed]
Document 2 [Aman, went, to, a, therapist]
Document 3 [Anil, went, to, download, a, health, chatbot]
Note that no token have been removed in the stopwords removal step. It is because we have
with little data and since the frequency of all the words is almost the same, no word can be
said to have lesser value than the other.
Step 2: Create Dictionary
Go through all the steps and create a dictionary, list down all the words which occur in all 3
documents
Dictionary:
aman and anil are stressed Went
download health chatbot therapist a to
Note that even though some words are repeated in different documents, they are all written
just once as while creating the dictionary, we create the list of unique words.
Step 3: Create document vectors
In this step, the vocabulary written in the top row. Now, for each word in the document, if it
matches with the vocabulary, put a 1 under it. If the same word appears again, increment to
the previous value by 1. And if the word does not occur in the document, put a 0 under it.
ama and anil are stressed went to a therapis download Health chatbot
n t
1 1 1 1 1 0 0 0 0 0 0 0
Since in the first document, we have words: aman, and, Anil, are, stressed. So, all these words
get a value of 1 and rest of the words get a 0 value.
Step 4: Repeat for all documents
Same exercise has to be done for all the documents, Hence, the table become In this table,
the header row contains vocabulary of the corpus and three rows correspond to 3 different
documents. Take a look at this table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. But the tokens have still not
converted to numbers. This leads us to the final steps of our algorithm TFIDF.
TFIDF: Turn Frequency & Inverse Document Frequency
Bag of words algorithm gives us the frequency of words in each document we have in our
corpus. It gives us an idea that if the word is occurring more in the document, its value is
more for that document.
For example if I have document on air pollution, air and pollution would be the words which
occur many times it it. And these words are valuable to us they give us some context around
the document. But let us suppose we have 10 document and all of them talk about diferent
issues. One is on women empowerment, the other is on unemployment and so on. Do you
think air and pollution would still be one of the most occurring words in the whole corpus? If
not, then which words do you think would have the highest frequency in all of them?
And, this, is, the, etc. are the words which occur the most in almost all the documents. But
these words do not talk about the corpus at all. Though they are important for humans as
they make the statements understandable to us, for the machine they are a complete waste
as they do not provide us with any information regarding the corpus. Hence these are termed
as stopwords and are mostly removed at the preprocessing stage only.
Take a look at this graph. It is a plot of occurrence of word versus their value. As you can see,
if the words have highest occurrence in all the documents of the corpus, they are said to have
negligible value hance they are termed as stopwords. These words are mostly removed at the
pre-processing stage only.
Now as we move ahead from the stopwords, the occurrence level drops drastically and the
words which have adequate occurrence in the corpus are said to have some amount of value
and are termed as frequent words. These words mostly talk about the documents subject
and their occurrence is adequate in the corpus. Then as the occurrence of words drops
further, the value of such words rises. These words are termed as rare or valuable words.
These words occur the least but add the most value to the corpus.
Let us now demystify TFIDF. TFIDF stands for Term Frequency and Inverse Document
Frequency TFIDF helps us in identifying the value for each word. Let us understand each term
one by one.
Term Frequency
Term frequency is the frequency of word in one document. Term frequency can easily be
found from the document vector table as in that table we mentioned the frequency of each
word of the vocabulary in each document.
ama and anil are stressed went to a therapis download Health chatbot
n t
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
Here, you can see that the frequency of each word for each document has been recorded in
the table. These numbers are nothing but the turn frequencies
Inverse Document Frequency
Now let us look at the other half of TFIDF which is in inverse document frequency. For this let
us first understand what does document frequency mean. Document frequency is the
number of documents in which the word occurs irrespective of how many times it has
occurred in those documents. The document frequency for the exemplar vocabulary would
be:
ama and anil are stressed went to a therapis download Health chatbot
n t
2 1 2 1 1 2 2 2 1 1 1 1
you can see that the document frequency of ‘aman’, ’anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in 2 document. Rest of them just one document hence the document frequency for
them is one.
Talking about inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator. Here, the total number
of documents are 3, hence inverse document frequency become:
aman an anil ar stressed went to a therapist download Health chatbot
d e
3/2 3/1 3/2 3/1 3/1 3/2 3/ 3/2 3/1 3/1 3/1 3/1
2
Finally, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log(IDF(W))
Here, log is to the base of 10. You don’t need to calculate the log value by yourself. Simply use
the log function in calculator and find out.
Now let’s multiply the IDF values to the TF value. Note that the TF value are for each
document while the IDF values are for the whole corpus. Hence we need to multiply the IDF
values to each row of the document vector table.
Here, you can see that the IDF value for Aman in each row is the same and similar pattern is
followed for all the words of vocabulary. After calculating all the value, we get:
Finally the words have been converted to numbers. These number are the value of each for
each documents. Here you can see that since we have less amount of data, words like ‘are’
and ‘and’ also have a high value. But as the IDF value increases, the value of that words
decreases. That is, for example:
Total number of document: 10
Number of documents in which ‘and’ occurs: 10
Therefore, IDF(and) = 10/10=1
Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0
on the other hand, number of documents in which ‘pollution’ occurs: 3
IDF (pollution) = 10/3 = 3.3333
which means: log(3.3333) = 0.522
which shows that the word ‘pollution’ has considerable value in the corpus.
Summarizing the concept
1- Words that occur in all the document with high term frequencies have the least value and
are considered to be the stopwords.
2- For a word to have high TFIDF value, the word needs to have high term frequency but less
document frequency which shows that the word is important for one document but is not a
common word for all documents.
3- These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for
given corpus.
Application of TFIDF
TFIDF is commonly used in the NLP domain. some of its applications are:
Document Information Retrieval
Topic Modelling Stop word filtering
Classification System
To extract the Helps in removing
Helps in classifying It helps in predicting
important the unnecessary
the type and genre the topic for a
information out of a words out of a text
of a document. corpus.
corpus. body.