0% found this document useful (0 votes)
20 views

NLP

Nigga

Uploaded by

yuvrajyuvi17412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

NLP

Nigga

Uploaded by

yuvrajyuvi17412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction To Natural Language Processing

Natural language processing, or NLP, is the subfield of AI that is focused on enabling


computer to understand and process human language. AI is a subfield of linguistics, computer
science, information engineering and artificial intelligence concerned with the interactions
between computers and human (natural) languages, in particular how to program computers
to process and analyse large amount of natural language data.
Applications Of Natural Language Processing
Some of the applications of natural language processing which are used in the real life
scenario: Automatic Summarization
Information overload is a real problem when we need
to access specific, important piece of information
from a huge knowledge base. Automatic
summarization is relevant not only for summarizing
the meaning of document information, but also to
understand the emotional meaning within the
information, such as in collecting data from social

Sentiment Analysis
The goal of sentiment analysis is to identify sentiment
among several posts or even in the same post where
emotion is not always explicitly expressed. Companies use
Natural Language Processing applications, such as
sentiment analysis, to identify opinions and sentiment
online to help them understand what customer think about
their product and services (“I love the new iPhone” and a
few line letter “But sometime it doesn’t work well” where
the person is still talking about the iPhone) and overall
indicator software reputation.
Test Classification
Make it possible to assign predefined categories to a
document and organize it to you find the
information you need or simply some activities. For
example an application of text categorization is
spam filtering in email.

Virtual Assistant
Nowadays Google Assistant, Cortana, Siri, Alexa, etc.
have become an integral part of our lives. Not only
can we talk to them but they also have the ability to
make our lives easier. By accessing our data they can
help us in keeping note of our tasks, make calls for
us, send message and a lot more.
AI Project Cycle in Natural Language Processing
Natural Language Processing is all about how machine tried to understand and interpret
human language and operate accordingly. But how can Natural Language Processing be used
to solve the problems around us.

The Scenario: The world is competitive nowadays. People face


competition in the tiniest tasks and are expected to give their best at
every point in time. When people are unable to meet these
expectations, they get stressed and could even go into depressions. We
get to hear a lot of cases where people are depressed due to reasons like
peer pressures, studies, family issues, relationships, etc. and they
eventually get into some something that is bad for them as well as for
others. So, to overcome this cognitive behavioral therapy (CBT) is
considered to be one of the best methods to address stress as it is easy
to implement on people and also gives good result. This therapy
includes understanding the behavior and mindset of a person in their
normal life. With the help of CBT, therapists help people overcome their
stress and live a happy life.
Problem Scoping : CBT is a technique used by most therapists to cure patients out of stress
and depressions. But it has been observed that people do not wish to seek the help of
psychiatrist willingly. They try to avoid such interactions as much as possible. Thus, there is a
need to bridge the gap between a person who needs help and the psychiatrist. Let us look at
various factors around this problem through the 4Ws problem canvas.
Who canvas - Who has the problem?
Who are the stakeholders? People who suffer from stress and are at the onset of
depression.
What do we know about People who are going through stress are reluctant to consult
them? a psychiatrist.
What Canvas – What is the nature of the problem?
What is the problem? People who need help are reluctant to consult a psychiatrist
and hence live miserably.
How do you know it is a Studies around mental stress and depression available on
problem? various authentic sources.
Where Canvas – Where does the problem arise?
What is the context/situations When they are going through a stressful period of time.
in which the stakeholders Due to some unpleasant experiences.
experience this problem?
Why Canvas – Why do you think it is a problem worth solving?
What would be of key value to People get a platform where they can talk and vent out
the stakeholders? their feelings anonymously
People get a medium that can interact with them and
applies primitive CBT on them and can suggest help
whenever needed
How would it improve their People would be able to vent out their stress
situations? They would consider going to a psychiatrist whenever
required
Now that we have gone through all the factors around the problem, the problem is statement
templates go as follows:
Our People undergoing stress Who?
Have a problem of Not being able to share their feelings What?
While They need help in venting out their emotions Where?
An ideal solution Provide them a platform to share their thoughts Why
would anonymously and suggest help whenever required

Data Acquisition: To understand sentiment of people, we need to collect their conversational


data so the machine can interpret the words that they use and understand their meaning.
Such data can be collected from various means:
1-Survey
2-Observing the therapist sessions
3-Database available on the Internet
4-Interviews, etc.
Data Exploration: Once the textual data has been collected, it needs to be proceed and
cleaned so that an easier version can be sent to the machine. Thus, the text is normalized
through various steps and is lowered to minimum vocabulary since the machine does not
require grammatically correct statements but the essence of it.
Modelling: Once the textual has been normalized, it is then fed to an NLP based AI model.
Note that in NLP, modelling requires data pre-processing only the after which the data is fed
to the machine. Depending upon the type of chatbot we try to make, there are a lot of AI
models available which help us building the foundation of our project
Evaluation
The model trained is then evaluated and the accuracy for the same is generated on the basis
of the relevance of the answer which the machine give to the user responses. To understand
the efficiency of the model, the suggested answer by the chatbot are compared to the actual
answer.
CHATBOTS
Chatbots are software applications that use AI and Natural Language Processing to
understand what a Human wants, and guides them to their desired outcome with as little
work for the end user is possible. Like a virtual assistant for your customer experience
touchpoints.
A well designed and built chatbot will:
Use existing conversation data to understand the type of question people ask.
Analyze correct answers to those questions through a ‘training’ period.
Use machine learning and NLP to learn context, and continually get better at answering those
questions in the future.
The adoption of chatbot was accelerated in 2016 when Facebook opened up its developer
platform and showed the world what is possible with chatbots through their Messenger app.
Google also got in the game soon after with Google Assistant. Since then there have been a
tremendous amount of chatbot app built on websites, in applications, on social media, for
customer support, and countless other examples.

How Do Chatbot Work:


At the heart of chatbot technology lies natural language processing on NLP the same
technology that form the basis of the voice recognition system used by virtual assistants
such as Google Now, Apple’s Siri, and Open Microsoft’s Cortana.
Chatbot process the text presented to them by the user before responding according to a
complex series of algorithm that interprets and identify what the user said infers what
they mean and or want and determine a series of appropriate response based on this
information.
As we have seen earlier, one of the most common applications of NLP is a chatbot. There
are lot of chatbots available and many of them use the same approach as we used in the
scenario above let us try some of the chatbot and see how they work.
Type of Chatbots
As you interact with more and more chatbots, you would realize that some of them are
scripted or in other worlds are traditional chatbots while other were AI-powered in had more
knowledge. With the help of this experience, we can understand that there are 2 types of
chatbots around us: script bot and smart bot.
Script-bot Smart-bot
Script bots are easy to make Smart-bot are flexible & powerful
Script bots work around a script which is Smart bots work on bigger databases and
programmed in them other resources directly
Mostly they are free and are easy to Smart bots learn with more data
integrate to a messaging platform
No or little language processing skills Coding is required to take this up on board
Limited functionality Wide functionality

Human Language VS Computer Language


Human communicate through language which we process all the time. Our brain keeps on
processing the sounds that it hears around itself and tries to make sense out of them all the
time.
The sound reaches the brain through a long channel. As a person speaks, the sound travels
from his mouth and goes to the listener’s eardrum. The sound striking the eardrum is
converted into neuron impulse, get transported to the brain and then get processed. After
processing the signal, the brain gains understanding around the meaning of it. If it is clear the
signal gets stored. Otherwise, the listener asks for clarity to the speaker. This is how human
languages are processed by humans.
On the other hand, the computer understands the language of numbers. Everything that is
sent to the machine has to be converted to numbers. And while typing, if a single mistake is
made, the computer throws an error and does not process that part. If you want the machine
to understand our language, how should this happen? What are the possible difficulties a
machine would face in processing natural language?
Arrangement Of the Words and Meaning
There are rules in human language. There are nouns, verbs, adverbs, adjectives. There are
rules to provide a structure to a language. This is the issue related to the syntax of the
language. Syntax refers to the grammatical structure of a sentence. Now we also want to
have the computer do this. One way to do this is to use the part of speech tagging. This
allows the computer to identify the different part of speech.
Beside the matter of arrangement, there is also meaning behind the language we use. There
are multiple characteristics of the human language that might be easy for a human to
understand but extremely difficult for a computer to understand.
Concept Of Natural Language Processing
Data Processing
Humans interact with each other very easily. For us, the natural languages that we use are so
convenient that we speak them easily understand them well too. But for computers, our
language is very complex..
Since we all know that the language of computers is Numerical, the very first step that comes
to our mind is to convert our language to numbers. This conversion take a few steps to
happen. The first step to it is text normalization. Text normalization helps in cleaning up the
textual data in such a way that it comes down to a level where it complexities is lower than
the actual data. Let us go through text normalization in detail.
Text Normalization
In text Normalization, we undergo several steps to normalize the text to a lower level. Before
we begin, we need to understand that in this section. we will be working on a collection of
return text. That is, we will be working on text from multiple documents the term used for
the whole textual data from all the documents altogether is known as corpus. Not only
would we go through all the steps of text normalization, we would also work them out on a
corpus. Let us take a look at the steps:
Sentence Segmentation
Under sentence segmentation, the whole
corpus is divided into sentences. Each
sentence is taken as different data so now
the whole corpus gets reduced to sentences.

Tokenization
After segmentation the sentence, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a sentence.
Under tokenization, every word, number and special character is considered separately
and each of them is now a separate token.

Removing Stop words, Special Characters and Numbers


In this step, the token which are not necessary are removed from the token list. What can be
the possible words which we might not require?
Stop words are the words which occur very frequently in the corpus but do not add any
value to it. Humans are use grammar to make their sentence meaningful for the other person
to understand. But grammatical words do not add any essence to the information which is to
be transmitted through the statement hence they come under Stop words.
These words occur the most in any given corpus but talk very little or nothing about the
context or the meaning of it. Hence, to make it easier for the computer to focus on
meaningful terms these words are removed.
Along with these words a lot of times our corpus might have special character and/or
numbers. Now it depends on the type of corpus that we are working on whether we should
keep them in it or not. For example, if you are working on a document containing email IDs,
then you might not want to remove the special characters numbers whereas in some other
textual data if these characters do not make sense, then you can remove them along with the
stopwatch.

Converting Text to A Common Case


After the stop words removal, we convert the whole text
into a similar case. This ensure that the case sensitivity of
the machine does not consider same words in different just
because of different cases. Here in this example the all the 6
forms of hello would be converted to lower case and hence
would be treated as the same word by the machine.

Stemming
In this step, the remaining words are reduced to their root words. In other words, streaming
is the process in which the affixes of words are removed and the words are converted to their
base form.
Word Affixes Steam
Healed -ed Heal
Healing -ing Heal
Healer -er Heal
Studies -es Studi
Studying -ing Study
Note that in stemming the steamed

words (which we get after removing the affixes) might not be meaningful. Here in this
example as you can see: healed, healing and healer all were reduced to heal but studies was
reduced to studi after the affix removal which is not a meaningful word.
Lemmatization
Stemming and lemmatization both are iterative processes to each other as the role of both
the processes is same removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal(also known as lemma) is a meaningful
one. Lemmatization make sure that lemma is a word with meaning and hence it take a longer
time to execute than streaming.
45vgy4 Affixes Lemma
7
Healed -ed Heal
Healing -ing Heal
Healer -er Heal
Studies -es Study
Studying -ing Study
Bag of Words
Bag of words is a Natural Language Processing model which helps in extracting features out
of the text which can be helpful in machine learning algorithms. In bag of words, we get the
occurrences of each word and construct vocabulary for the corpus.

This image give us a brief overview about how bag of words works. Let us assume that the
next on the left in this image is the normalized corpus which we have got after going through
all these steps of text processing. Now, as we put this text into the bag of words algorithm,
the algorithm returns to us unique words out of the corpus and their occurrences in it. As you
can see at the right, it shows us a list of words appearing in the corpus and the number
corresponding to it shows how many times the word has occurred in the text body. Thus, we
can say that the bag of words gives us two things:
A vocabulary of words for the corpus
The frequency of these words (number of time it has occurred in the whole corpus)
Here calling this algorithm “bag” of words symbolizes that the sequence of sentences or
tokens does not matter in this case as all we need are the unique words and their frequency
in it.
Here is the step-by-step approach to implement bag of words algorithm:
1-Text Normalization: Collect data and pre-process it
2-Create Dictionary: Make a list of all the unique words occurring in the corpus. (vocabulary)
3-Create Document Vector: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4-Create document vectors for all the documents
Step 1: Collecting data and preprocessing it.
Document 1 Aman and anil are stressed
Document 2 Aman went to a therapist
Document 3 Anil went to download a health chatbot
here are 3 document having one sentence each. After text normalization, the text becomes:
Document 1 [Aman, and, anil, are, stressed]
Document 2 [Aman, went, to, a, therapist]
Document 3 [Anil, went, to, download, a, health, chatbot]
Note that no token have been removed in the stopwords removal step. It is because we have
with little data and since the frequency of all the words is almost the same, no word can be
said to have lesser value than the other.
Step 2: Create Dictionary
Go through all the steps and create a dictionary, list down all the words which occur in all 3
documents
Dictionary:
aman and anil are stressed Went
download health chatbot therapist a to
Note that even though some words are repeated in different documents, they are all written
just once as while creating the dictionary, we create the list of unique words.
Step 3: Create document vectors
In this step, the vocabulary written in the top row. Now, for each word in the document, if it
matches with the vocabulary, put a 1 under it. If the same word appears again, increment to
the previous value by 1. And if the word does not occur in the document, put a 0 under it.
ama and anil are stressed went to a therapis download Health chatbot
n t
1 1 1 1 1 0 0 0 0 0 0 0
Since in the first document, we have words: aman, and, Anil, are, stressed. So, all these words
get a value of 1 and rest of the words get a 0 value.
Step 4: Repeat for all documents
Same exercise has to be done for all the documents, Hence, the table become In this table,
the header row contains vocabulary of the corpus and three rows correspond to 3 different
documents. Take a look at this table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. But the tokens have still not
converted to numbers. This leads us to the final steps of our algorithm TFIDF.
TFIDF: Turn Frequency & Inverse Document Frequency
Bag of words algorithm gives us the frequency of words in each document we have in our
corpus. It gives us an idea that if the word is occurring more in the document, its value is
more for that document.
For example if I have document on air pollution, air and pollution would be the words which
occur many times it it. And these words are valuable to us they give us some context around
the document. But let us suppose we have 10 document and all of them talk about diferent
issues. One is on women empowerment, the other is on unemployment and so on. Do you
think air and pollution would still be one of the most occurring words in the whole corpus? If
not, then which words do you think would have the highest frequency in all of them?
And, this, is, the, etc. are the words which occur the most in almost all the documents. But
these words do not talk about the corpus at all. Though they are important for humans as
they make the statements understandable to us, for the machine they are a complete waste
as they do not provide us with any information regarding the corpus. Hence these are termed
as stopwords and are mostly removed at the preprocessing stage only.
Take a look at this graph. It is a plot of occurrence of word versus their value. As you can see,
if the words have highest occurrence in all the documents of the corpus, they are said to have
negligible value hance they are termed as stopwords. These words are mostly removed at the
pre-processing stage only.
Now as we move ahead from the stopwords, the occurrence level drops drastically and the
words which have adequate occurrence in the corpus are said to have some amount of value
and are termed as frequent words. These words mostly talk about the documents subject
and their occurrence is adequate in the corpus. Then as the occurrence of words drops
further, the value of such words rises. These words are termed as rare or valuable words.
These words occur the least but add the most value to the corpus.
Let us now demystify TFIDF. TFIDF stands for Term Frequency and Inverse Document
Frequency TFIDF helps us in identifying the value for each word. Let us understand each term
one by one.
Term Frequency
Term frequency is the frequency of word in one document. Term frequency can easily be
found from the document vector table as in that table we mentioned the frequency of each
word of the vocabulary in each document.
ama and anil are stressed went to a therapis download Health chatbot
n t
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
Here, you can see that the frequency of each word for each document has been recorded in
the table. These numbers are nothing but the turn frequencies
Inverse Document Frequency
Now let us look at the other half of TFIDF which is in inverse document frequency. For this let
us first understand what does document frequency mean. Document frequency is the
number of documents in which the word occurs irrespective of how many times it has
occurred in those documents. The document frequency for the exemplar vocabulary would
be:
ama and anil are stressed went to a therapis download Health chatbot
n t
2 1 2 1 1 2 2 2 1 1 1 1
you can see that the document frequency of ‘aman’, ’anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in 2 document. Rest of them just one document hence the document frequency for
them is one.
Talking about inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator. Here, the total number
of documents are 3, hence inverse document frequency become:
aman an anil ar stressed went to a therapist download Health chatbot
d e
3/2 3/1 3/2 3/1 3/1 3/2 3/ 3/2 3/1 3/1 3/1 3/1
2
Finally, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log(IDF(W))
Here, log is to the base of 10. You don’t need to calculate the log value by yourself. Simply use
the log function in calculator and find out.
Now let’s multiply the IDF values to the TF value. Note that the TF value are for each
document while the IDF values are for the whole corpus. Hence we need to multiply the IDF
values to each row of the document vector table.

Here, you can see that the IDF value for Aman in each row is the same and similar pattern is
followed for all the words of vocabulary. After calculating all the value, we get:
Finally the words have been converted to numbers. These number are the value of each for
each documents. Here you can see that since we have less amount of data, words like ‘are’
and ‘and’ also have a high value. But as the IDF value increases, the value of that words
decreases. That is, for example:
Total number of document: 10
Number of documents in which ‘and’ occurs: 10
Therefore, IDF(and) = 10/10=1
Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0
on the other hand, number of documents in which ‘pollution’ occurs: 3
IDF (pollution) = 10/3 = 3.3333
which means: log(3.3333) = 0.522
which shows that the word ‘pollution’ has considerable value in the corpus.
Summarizing the concept
1- Words that occur in all the document with high term frequencies have the least value and
are considered to be the stopwords.
2- For a word to have high TFIDF value, the word needs to have high term frequency but less
document frequency which shows that the word is important for one document but is not a
common word for all documents.
3- These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for
given corpus.
Application of TFIDF
TFIDF is commonly used in the NLP domain. some of its applications are:
Document Information Retrieval
Topic Modelling Stop word filtering
Classification System
To extract the Helps in removing
Helps in classifying It helps in predicting
important the unnecessary
the type and genre the topic for a
information out of a words out of a text
of a document. corpus.
corpus. body.

Natural Language Toolkit (NLTK)


NLTK is one of the most powerful NLP libraries which help to make machine understand
human language and respond to it accordingly. It is a leading platform for building Python
programs to work with human language data.
NLTK provides easy-to-use interface such as word Net along with suite of text-processing
libraries for classification tokenization, stemming, tagging, parsing and semanting reasoning.
NLTK can be installed by opening command prompt and typing in it.
Pip install nltk

You might also like