0% found this document useful (0 votes)
2 views

NLP Notes

Natural Language Processing (NLP) is a sub-field of AI that focuses on enabling computers to understand human languages, with applications including automatic summarization, sentiment analysis, text classification, and virtual assistants. The document outlines the process of developing an NLP project, including problem scoping, data acquisition, exploration, modeling, evaluation, and deployment, emphasizing the importance of understanding human language complexities. It also discusses the differences between human and computer language processing, highlighting techniques like text normalization, stemming, and lemmatization for preparing data for machine interpretation.

Uploaded by

Scrap Gamer 2.0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP Notes

Natural Language Processing (NLP) is a sub-field of AI that focuses on enabling computers to understand human languages, with applications including automatic summarization, sentiment analysis, text classification, and virtual assistants. The document outlines the process of developing an NLP project, including problem scoping, data acquisition, exploration, modeling, evaluation, and deployment, emphasizing the importance of understanding human language complexities. It also discusses the differences between human and computer language processing, highlighting techniques like text normalization, stemming, and lemmatization for preparing data for machine interpretation.

Uploaded by

Scrap Gamer 2.0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Natural Language Processing

Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to
understand and process human languages. AI is a subfield of Linguistics, Computer Science,
Information Engineering, and Artificial Intelligence concerned with the interactions between
computers and human (natural) languages, in particular how to program computers to process and
analyse large amounts of natural language data.

But how do computers do that? How do they understand what we say in our language? This chapter
is all about demystifying the Natural Language Processing domain and understanding how it works.

Applications of Natural Language Processing


Since Artificial Intelligence nowadays is becoming an integral part of our lives, its applications are very
commonly used by the majority of people in their daily lives. Here are some of the applications of
Natural Language Processing which are used in the real-life scenario:

Automatic Summarization: Automatic summarization is


relevant not only for summarizing the meaning of
documents and information, but also to understand the
emotional meanings within the information, such as in
collecting data from social media. Automatic summarization
is especially relevant when used to provide an overview of a
news item or blog post, while avoiding redundancy from
multiple sources and maximizing the diversity of content
obtained.

Sentiment Analysis: to identify opinions and


sentiment online to help them understand what
customers think about their products and services
(i.e., “I love the new iPhone” and, a few lines later
“But sometimes it doesn’t work well” where the
person is still talking about the iPhone) and overall
indicators of their reputation. Beyond determining simple polarity, sentiment analysis understands
sentiment in context to help better understand what’s behind an expressed opinion, which can be
extremely relevant in understanding and driving purchasing decisions.

Text classification: Text classification makes it possible to assign


predefined categories to a document and organize it to help you
find the information you need or simplify some activities. For
example, an application of text categorization is spam filtering in
email.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Virtual Assistants: Nowadays Google Assistant, Cortana,
Siri, Alexa, etc have become an integral part of our lives. Not
only can we talk to them but they also have the abilities to
make our lives easier. By accessing our data, they can help
us in keeping notes of our tasks, make calls for us, send
messages and a lot more. With the help of speech
recognition, these assistants can not only detect our speech
but can also make sense out of it. According to recent
researches, a lot more advancements are expected in this
field in the near future.

Revisiting the AI Project Cycle


Let us try to understand how we can develop a project in Natural Language processing with the help
of an example.The Scenario

The world is competitive nowadays. People face


competition in even the tiniest tasks and are expected to
give their best at every point in time. When people are
unable to meet these expectations, they get stressed and
could even go into depression. So, to overcome this,
cognitive behavioural therapy (CBT) is considered to be one
of the best methods to address stress as it is easy to
implement on people and also gives good results. This
therapy includes understanding the behaviour and
mindset of a person in their normal life. With the help
of CBT, therapists help people overcome their stress and
live a happy life.

Problem Scoping
CBT is a technique used by most therapists to cure patients out of stress and depression. But it has
been observed that people do not wish to seek the help of a psychiatrist willingly. They try to avoid
such interactions as much as possible. Thus, there is a need to bridge the gap between a person who
needs help and the psychiatrist. Let us look at various factors around this problem through the 4Ws
problem canvas.

Who Canvas – Who has the problem?

Who are the


o People who suffer from stress and are at the onset of depression.
stakeholders?

What do we know
o People who are going through stress are reluctant to consult a psychiatrist.
about them?

What Canvas – What is the nature of the problem?

* Images shown here are the property of individual organisations and are used here for reference purpose only.
What is the o People who need help are reluctant to consult a psychiatrist and hence live
problem? miserably.

How do you know o Studies around mental stress and depression available on various authentic
it is a problem? sources.

Where Canvas – Where does the problem arise?

What is the context/situation


o When they are going through a stressful period of time
in which the stakeholders
o Due to some unpleasant experiences
experience this problem?

Why Canvas – Why do you think it is a problem worth solving?

o People get a platform where they can talk and vent out their
What would be of key feelings anonymously
value to the stakeholders? o People get a medium that can interact with them and applies
primitive CBT on them and can suggest help whenever needed

How would it improve their o People would be able to vent out their stress
situation? o They would consider going to a psychiatrist whenever required

Now that we have gone through all the factors around the problem, the problem statement templates
go as follows:

Our People undergoing stress Who?


Have a problem of Not being able to share their feelings What?
While They need help in venting out their emotions Where?
Provide them a platform to share their thoughts
An ideal solution would Why
anonymously and suggest help whenever required

This leads us to the goal of our project which is:

“To create a chatbot which can interact with people, help them
to vent out their feelings and take them through primitive CBT.”

Data Acquisition
To understand the sentiments of people, we need to collect their conversational data so the machine
can interpret the words that they use and understand their meaning. Such data can be collected from
various means:

1. Surveys 2. Observing the therapist’s sessions


3. Databases available on the internet 4. Interviews, etc.
Data Exploration
* Images shown here are the property of individual organisations and are used here for reference purpose only.
Once the textual data has been collected, it needs to be processed and cleaned so that an easier
version can be sent to the machine. Thus, the text is normalised through various steps and is lowered
to minimum vocabulary since the machine does not require grammatically correct statements but the
essence of it.

Modelling
Once the text has been normalised, it is then fed to an NLP based AI model. Note that in NLP, modelling
requires data pre-processing only after which the data is fed to the machine. Depending upon the type
of chatbot we try to make, there are a lot of AI models available which help us build the foundation of
our project.

Evaluation
The model trained is then evaluated and the accuracy for the same is generated on the basis of the
relevance of the answers which the machine gives to the user’s responses. To understand the
efficiency of the model, the suggested answers by the chatbot are compared to the actual answers.

As you can see in the above diagram, the blue line talks about the model’s output while the green one
is the actual output along with the data samples.

The model’s output does not match the true function at all. Hence the model is said
Figure 1 to be underfitting and its accuracy is lower.

In the second one, the model’s performance matches well with the true function
Figure 2 which states that the model has optimum accuracy and the model is called a
perfect fit.

In the third case, model performance is trying to cover all the data samples even if
Figure 3 they are out of alignment to the true function. This model is said to be overfitting
and this too has a lower accuracy.

Once the model is evaluated thoroughly, it is then deployed in the form of an app which people can
use easily.

Chatbots
As we have seen earlier, one of the most common applications of Natural Language Processing is a
chatbot. There are a lot of chatbots available and many of them use the same approach as we used in
the scenario above.. Let us try some of the chatbots and see how they work.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Script-bot Smart-bot
Script bots are easy to make Smart-bots are flexible and powerful
Script bots work around a script which is Smart bots work on bigger databases and other
programmed in them resources directly
Mostly they are free and are easy to integrate Smart bots learn with more data
to a messaging platform
No or little language processing skills Coding is required to take this up on board
Limited functionality Wide functionality

Human Language VS Computer Language


Humans communicate through language which we process all the time. Our brain keeps on processing
the sounds that it hears around itself and tries to make sense out of them all the time. Even in the
classroom, as the teacher delivers the session, our brain is continuously processing everything and
storing it in some place. Also, while this is happening, when your friend whispers something, the focus
of your brain automatically shifts from the teacher’s speech to your friend’s conversation. So now, the
brain is processing both the sounds but is prioritising the one on which our interest lies.

The sound reaches the brain through a long channel. As a person speaks, the sound travels from his
mouth and goes to the listener’s eardrum. The sound striking the eardrum is converted into neuron
impulse, gets transported to the brain and then gets processed. After processing the signal, the brain
gains understanding around the meaning of it. If it is clear, the signal gets stored. Otherwise, the
listener asks for clarity to the speaker. This is how human languages are processed by humans.

On the other hand, the computer understands the language of numbers. Everything that is sent to the
machine has to be converted to numbers. And while typing, if a single mistake is made, the computer
throws an error and does not process that part. The communications made by the machines are very
basic and simple.

Arrangement of the words and meaning


There are rules in human language. There are nouns, verbs, adverbs, adjectives. A word can be a noun
at one time and an adjective some other time. There are rules to provide structure to a language.

This is the issue related to the syntax of the language. Syntax refers to the grammatical structure of a
sentence. When the structure is present, we can start interpreting the message. Now we also want to
have the computer do this. One way to do this is to use the part-of-speech tagging. This allows the
computer to identify the different parts of a speech.

Analogy with programming language:

Different syntax, same semantics: 2+3 = 3+2

Here the way these statements are written is different, but their meanings are the same that is 5.

Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)

Here the statements written have the same syntax but their meanings are different. In Python 2.7,
this statement would result in 1 while in Python 3, it would give an output of 1.5.
Multiple Meanings of a word
Let’s consider these three sentences:

His face turned red after he found out that he took the wrong bag
What does this mean? Is he feeling ashamed because he took another person’s bag instead of his? Is
he feeling angry because he did not manage to steal the bag that he has been targeting?

The red car zoomed past his nose


Probably talking about the color of the car

His face turns red after consuming the medicine


Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?

Here we can see that context is important. We understand a sentence almost intuitively, depending
on our history of using the language, and the memories that have been built within.
Perfect Syntax, no Meaning
Sometimes, a statement can have a perfectly correct syntax but it does not mean anything. For
example, take a look at this statement:

Chickens feed extravagantly while the moon drinks tea.

This statement is correct grammatically but does this make any sense? In Human language, a perfect
balance of syntax and semantics is important for better understanding
Data Processing
Humans interact with each other very easily. For us, the natural languages that we use are so
convenient that we speak them easily and understand them well too. But for computers, our
languages are very complex.

Since we all know that the language of computers is Numerical, the very first step that comes to our
mind is to convert our language to numbers. This conversion takes a few steps to happen. The first
step to it is Text Normalisation.

Text Normalisation
In Text Normalisation, we undergo several steps to normalise the text to a lower level. Before we
begin, we need to understand that in this section, we will be working on a collection of written text.
That is, we will be working on text from multiple documents and the term used for the whole textual
data from all the documents altogether is known as corpus. Not only would we go through all the
steps of Text Normalisation, we would also work them out on a corpus. Let us take a look at the steps:

Sentence Segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as
a different data so now the whole corpus gets reduced to sentences.
Tokenisation
After segmenting the sentences, each sentence is then further divided into tokens. Tokens is a term
used for any word or number or special character occurring in a sentence. Under tokenisation, every
word, number and special character is considered separately and each of them is now a separate
token.

Removing Stopwords, Special Characters and Numbers


In this step, the tokens which are not necessary are removed from the token list. What can be the
possible words which we might not require?

Stopwords are the words which occur very frequently in the corpus but do not add any value to it.
Humans use grammar to make their sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which is to be transmitted through the
statement hence they come under stopwords. Some examples of stopwords are:

* Images shown here are the property of individual organisations and are used here for reference purpose only.
These words occur the most in any given corpus but talk very little or nothing about the context or the
meaning of it. Hence, to make it easier for the computer to focus on meaningful terms, these words
are removed.

Along with these words, a lot of times our corpus might have special characters and/or numbers. Now
it depends on the type of corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then you might not want to
remove the special characters and numbers whereas in some other textual data if these characters do
not make sense, then you can remove them along with the stopwords.

Converting text to a common case


After the stopwords removal, we convert the whole text into a similar case, preferably lower case.
This ensures that the case-sensitivity of the machine does not consider same words as different just
because of different cases.

Here in this example, the all the 6 forms of hello would be converted to lower case and hence would
be treated as the same word by the machine.

Stemming
In this step, the remaining words are reduced to their root words. In other words, stemming is the
process in which the affixes of words are removed and the words are converted to their base form.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Note that in stemming, the stemmed words (words which are we get after removing the affixes) might
not be meaningful. Here in this example as you can see: healed, healing and healer all were reduced
to heal but studies was reduced to studi after the affix removal which is not a meaningful word.
Stemming does not take into account if the stemmed word is meaningful or not. It just removes the
affixes hence it is faster.

Lemmatization
Stemming and lemmatization both are alternative processes to each other as the role of both the
processes is same – removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.

As you can see in the same example, the output for studies after affix removal has become study
instead of studi.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Difference between stemming and lemmatization can be summarized by this example:

With this we have normalised our text to tokens which are the simplest form of words present in the
corpus. Now it is time to convert the tokens into numbers. For this, we would use the Bag of Words
algorithm

Bag of Words
Bag of Words is a Natural Language Processing model which helps in extracting features out of the
text which can be helpful in machine learning algorithms. In bag of words, we get the occurrences of
each word and construct the vocabulary for the corpus.

This image gives us a brief overview about how bag of words works. Let us assume that the text on
the left in this image is the normalised corpus which we have got after going through all the steps of
text processing. Now, as we put this text into the bag of words algorithm, the algorithm returns to us
the unique words out of the corpus and their occurrences in it. As you can see at the right, it shows us
a list of words appearing in the corpus and the numbers corresponding to it shows how many times
the word has occurred in the text body. Thus, we can say that the bag of words gives us two things:

1. A vocabulary of words for the corpus

2. The frequency of these words (number of times it has occurred in the whole corpus).

Here calling this algorithm “bag” of words symbolises that the sequence of sentences or tokens does
not matter in this case as all we need are the unique words and their frequency in it.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here is the step-by-step approach to implement bag of words algorithm:

1. Text Normalisation: Collect data and pre-process it


2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4. Create document vectors for all the documents.
Let us go through all the steps with an example:

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed

Document 2: Aman went to a therapist

Document 3: Anil went to download a health chatbot

Here are three documents having one sentence each. After text normalisation, the text becomes:

Document 1: [aman, and, anil, are, stressed]

Document 2: [aman, went, to, a, therapist]

Document 3: [anil, went, to, download, a, health, chatbot]

Note that no tokens have been removed in the stopwords removal step. It is because we have very
little data and since the frequency of all the words is almost the same, no word can be said to have
lesser value than the other.

Step 2: Create Dictionary

Go through all the steps and create a dictionary i.e., list down all the words which occur in all three
documents:

Dictionary:

aman and anil are stressed went

download health chatbot therapist a to

Note that even though some words are repeated in different documents, they are all written just once
as while creating the dictionary, we create the list of unique words.

Step 3: Create document vector

In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches
with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value
by 1. And if the word does not occur in that document, put a 0 under it.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a
value of 1 and rest of the words get a 0 value.

Step 4: Repeat for all documents

Same exercise has to be done for all the documents. Hence, the table becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents. Take a look at this table and analyse the positioning of 0s and 1s in it.

Finally, this gives us the document vector table for our corpus. But the tokens have still not converted
to numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF: Term Frequency & Inverse Document Frequency


Suppose you have a book. Which characters or words do you think would occur the most in it?

Bag of words algorithm gives us the frequency of words in each document we have in our corpus. It
gives us an idea that if the word is occurring more in a document, its value is more for that document.
For example, if I have a document on air pollution, air and pollution would be the words which occur
many times in it. And these words are valuable too as they give us some context around the document.
But let us suppose we have 10 documents and all of them talk about different issues. One is on women
empowerment, the other is on unemployment and so on. Do you think air and pollution would still be
one of the most occurring words in the whole corpus? If not, then which words do you think would
have the highest frequency in all of them?

And, this, is, the, etc. are the words which occur the most in almost all the documents. But these words
do not talk about the corpus at all. Though they are important for humans as they make the
statements understandable to us, for the machine they are a complete waste as they do not provide
us with any information regarding the corpus. Hence, these are termed as stopwords and are mostly
removed at the pre-processing stage only.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Take a look at this graph. It is a plot of occurrence of words versus their value. As you can see, if the
words have highest occurrence in all the documents of the corpus, they are said to have negligible
value hence they are termed as stop words. These words are mostly removed at the pre-processing
stage only. Now as we move ahead from the stopwords, the occurrence level drops drastically and the
words which have adequate occurrence in the corpus are said to have some amount of value and are
termed as frequent words. These words mostly talk about the document’s subject and their
occurrence is adequate in the corpus. Then as the occurrence of words drops further, the value of
such words rises. These words are termed as rare or valuable words. These words occur the least but
add the most value to the corpus. Hence, when we look at the text, we take frequent and rare words
into consideration.

Let us now demystify TFIDF. TFIDF stands for Term Frequency and Inverse Document Frequency. TFIDF
helps un in identifying the value for each word. Let us understand each term one by one.

Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from
the document vector table as in that table we mention the frequency of each word of the vocabulary
in each document.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here, you can see that the frequency of each word for each document has been recorded in the table.
These numbers are nothing but the Term Frequencies!

Inverse Document Frequency


Now, let us look at the other half of TFIDF which is Inverse Document Frequency. For this, let us first
understand what does document frequency mean. Document Frequency is the number of documents
in which the word occurs irrespective of how many times it has occurred in those documents. The
document frequency for the exemplar vocabulary would be:

Here, you can see that the document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in two documents. Rest of them occurred in just one document hence the document
frequency for them is one.

Talking about inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator. Here, the total number of
documents are 3, hence inverse document frequency becomes:

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

Here, log is to the base of 10. Don’t worry! You don’t need to calculate the log values by yourself.
Simply use the log function in the calculator and find out!

Now, let’s multiply the IDF values to the TF values. Note that the TF values are for each document
while the IDF values are for the whole corpus. Hence, we need to multiply the IDF values to each row
of the document vector table.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here, you can see that the IDF values for Aman in each row is the same and similar pattern is followed
for all the words of the vocabulary. After calculating all the values, we get:

Finally, the words have been converted to numbers. These numbers are the values of each for each
document. Here, you can see that since we have less amount of data, words like ‘are’ and ‘and’ also
have a high value. But as the IDF value increases, the value of that word decreases. That is, for
example:

Total Number of documents: 10

Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

Which means: log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in
the corpus.

Summarising the concept, we can say that:

1. Words that occur in all the documents with high term frequencies have the least values and
are considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a
common word for all documents.
3. These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for a
given corpus.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:

Document Information
Topic Modelling Stop word filtering
Classification Retrieval System

Helps in classifying the To extract the Helps in removing the


It helps in predicting
type and genre of a important information unnecessary words
the topic for a corpus.
document. out of a corpus. out of a text body.

You might also like