0% found this document useful (0 votes)

2 views

NLP Notes

Natural Language Processing (NLP) is a sub-field of AI that focuses on enabling computers to understand human languages, with applications including automatic summarization, sentiment analysis, text classification, and virtual assistants. The document outlines the process of developing an NLP project, including problem scoping, data acquisition, exploration, modeling, evaluation, and deployment, emphasizing the importance of understanding human language complexities. It also discusses the differences between human and computer language processing, highlighting techniques like text normalization, stemming, and lemmatization for preparing data for machine interpretation.

Uploaded by

Scrap Gamer 2.0

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

NLP Notes

Uploaded by

Scrap Gamer 2.0

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Natural Language Processing

Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to
understand and process human languages. AI is a subfield of Linguistics, Computer Science,
Information Engineering, and Artificial Intelligence concerned with the interactions between
computers and human (natural) languages, in particular how to program computers to process and
analyse large amounts of natural language data.

But how do computers do that? How do they understand what we say in our language? This chapter
is all about demystifying the Natural Language Processing domain and understanding how it works.

Applications of Natural Language Processing

Since Artificial Intelligence nowadays is becoming an integral part of our lives, its applications are very
commonly used by the majority of people in their daily lives. Here are some of the applications of
Natural Language Processing which are used in the real-life scenario:

Automatic Summarization: Automatic summarization is

relevant not only for summarizing the meaning of
documents and information, but also to understand the
emotional meanings within the information, such as in
collecting data from social media. Automatic summarization
is especially relevant when used to provide an overview of a
news item or blog post, while avoiding redundancy from
multiple sources and maximizing the diversity of content
obtained.

Sentiment Analysis: to identify opinions and

sentiment online to help them understand what
customers think about their products and services
(i.e., “I love the new iPhone” and, a few lines later
“But sometimes it doesn’t work well” where the
person is still talking about the iPhone) and overall
indicators of their reputation. Beyond determining simple polarity, sentiment analysis understands
sentiment in context to help better understand what’s behind an expressed opinion, which can be
extremely relevant in understanding and driving purchasing decisions.

Text classification: Text classification makes it possible to assign

predefined categories to a document and organize it to help you
find the information you need or simplify some activities. For
example, an application of text categorization is spam filtering in
email.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Virtual Assistants: Nowadays Google Assistant, Cortana,
Siri, Alexa, etc have become an integral part of our lives. Not
only can we talk to them but they also have the abilities to
make our lives easier. By accessing our data, they can help
us in keeping notes of our tasks, make calls for us, send
messages and a lot more. With the help of speech
recognition, these assistants can not only detect our speech
but can also make sense out of it. According to recent
researches, a lot more advancements are expected in this
field in the near future.

Revisiting the AI Project Cycle

Let us try to understand how we can develop a project in Natural Language processing with the help
of an example.The Scenario

The world is competitive nowadays. People face

competition in even the tiniest tasks and are expected to
give their best at every point in time. When people are
unable to meet these expectations, they get stressed and
could even go into depression. So, to overcome this,
cognitive behavioural therapy (CBT) is considered to be one
of the best methods to address stress as it is easy to
implement on people and also gives good results. This
therapy includes understanding the behaviour and
mindset of a person in their normal life. With the help
of CBT, therapists help people overcome their stress and
live a happy life.

Problem Scoping
CBT is a technique used by most therapists to cure patients out of stress and depression. But it has
been observed that people do not wish to seek the help of a psychiatrist willingly. They try to avoid
such interactions as much as possible. Thus, there is a need to bridge the gap between a person who
needs help and the psychiatrist. Let us look at various factors around this problem through the 4Ws
problem canvas.

Who Canvas – Who has the problem?

Who are the

o People who suffer from stress and are at the onset of depression.
stakeholders?

What do we know
o People who are going through stress are reluctant to consult a psychiatrist.
about them?

What Canvas – What is the nature of the problem?

* Images shown here are the property of individual organisations and are used here for reference purpose only.
What is the o People who need help are reluctant to consult a psychiatrist and hence live
problem? miserably.

How do you know o Studies around mental stress and depression available on various authentic
it is a problem? sources.

Where Canvas – Where does the problem arise?

What is the context/situation

o When they are going through a stressful period of time
in which the stakeholders
o Due to some unpleasant experiences
experience this problem?

Why Canvas – Why do you think it is a problem worth solving?

o People get a platform where they can talk and vent out their
What would be of key feelings anonymously
value to the stakeholders? o People get a medium that can interact with them and applies
primitive CBT on them and can suggest help whenever needed

How would it improve their o People would be able to vent out their stress
situation? o They would consider going to a psychiatrist whenever required

Now that we have gone through all the factors around the problem, the problem statement templates
go as follows:

Our People undergoing stress Who?

Have a problem of Not being able to share their feelings What?
While They need help in venting out their emotions Where?
Provide them a platform to share their thoughts
An ideal solution would Why
anonymously and suggest help whenever required

This leads us to the goal of our project which is:

“To create a chatbot which can interact with people, help them
to vent out their feelings and take them through primitive CBT.”

Data Acquisition
To understand the sentiments of people, we need to collect their conversational data so the machine
can interpret the words that they use and understand their meaning. Such data can be collected from
various means:

1. Surveys 2. Observing the therapist’s sessions

3. Databases available on the internet 4. Interviews, etc.
Data Exploration
* Images shown here are the property of individual organisations and are used here for reference purpose only.
Once the textual data has been collected, it needs to be processed and cleaned so that an easier
version can be sent to the machine. Thus, the text is normalised through various steps and is lowered
to minimum vocabulary since the machine does not require grammatically correct statements but the
essence of it.

Modelling
Once the text has been normalised, it is then fed to an NLP based AI model. Note that in NLP, modelling
requires data pre-processing only after which the data is fed to the machine. Depending upon the type
of chatbot we try to make, there are a lot of AI models available which help us build the foundation of
our project.

Evaluation
The model trained is then evaluated and the accuracy for the same is generated on the basis of the
relevance of the answers which the machine gives to the user’s responses. To understand the
efficiency of the model, the suggested answers by the chatbot are compared to the actual answers.

As you can see in the above diagram, the blue line talks about the model’s output while the green one
is the actual output along with the data samples.

The model’s output does not match the true function at all. Hence the model is said
Figure 1 to be underfitting and its accuracy is lower.

In the second one, the model’s performance matches well with the true function
Figure 2 which states that the model has optimum accuracy and the model is called a
perfect fit.

In the third case, model performance is trying to cover all the data samples even if
Figure 3 they are out of alignment to the true function. This model is said to be overfitting
and this too has a lower accuracy.

Once the model is evaluated thoroughly, it is then deployed in the form of an app which people can
use easily.

Chatbots
As we have seen earlier, one of the most common applications of Natural Language Processing is a
chatbot. There are a lot of chatbots available and many of them use the same approach as we used in
the scenario above.. Let us try some of the chatbots and see how they work.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Script-bot Smart-bot
Script bots are easy to make Smart-bots are flexible and powerful
Script bots work around a script which is Smart bots work on bigger databases and other
programmed in them resources directly
Mostly they are free and are easy to integrate Smart bots learn with more data
to a messaging platform
No or little language processing skills Coding is required to take this up on board
Limited functionality Wide functionality

Human Language VS Computer Language

Humans communicate through language which we process all the time. Our brain keeps on processing
the sounds that it hears around itself and tries to make sense out of them all the time. Even in the
classroom, as the teacher delivers the session, our brain is continuously processing everything and
storing it in some place. Also, while this is happening, when your friend whispers something, the focus
of your brain automatically shifts from the teacher’s speech to your friend’s conversation. So now, the
brain is processing both the sounds but is prioritising the one on which our interest lies.

The sound reaches the brain through a long channel. As a person speaks, the sound travels from his
mouth and goes to the listener’s eardrum. The sound striking the eardrum is converted into neuron
impulse, gets transported to the brain and then gets processed. After processing the signal, the brain
gains understanding around the meaning of it. If it is clear, the signal gets stored. Otherwise, the
listener asks for clarity to the speaker. This is how human languages are processed by humans.

On the other hand, the computer understands the language of numbers. Everything that is sent to the
machine has to be converted to numbers. And while typing, if a single mistake is made, the computer
throws an error and does not process that part. The communications made by the machines are very
basic and simple.

Arrangement of the words and meaning

There are rules in human language. There are nouns, verbs, adverbs, adjectives. A word can be a noun
at one time and an adjective some other time. There are rules to provide structure to a language.

This is the issue related to the syntax of the language. Syntax refers to the grammatical structure of a
sentence. When the structure is present, we can start interpreting the message. Now we also want to
have the computer do this. One way to do this is to use the part-of-speech tagging. This allows the
computer to identify the different parts of a speech.

Analogy with programming language:

Different syntax, same semantics: 2+3 = 3+2

Here the way these statements are written is different, but their meanings are the same that is 5.

Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)

Here the statements written have the same syntax but their meanings are different. In Python 2.7,
this statement would result in 1 while in Python 3, it would give an output of 1.5.
Multiple Meanings of a word
Let’s consider these three sentences:

His face turned red after he found out that he took the wrong bag
What does this mean? Is he feeling ashamed because he took another person’s bag instead of his? Is
he feeling angry because he did not manage to steal the bag that he has been targeting?

The red car zoomed past his nose

Probably talking about the color of the car

His face turns red after consuming the medicine

Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?

Here we can see that context is important. We understand a sentence almost intuitively, depending
on our history of using the language, and the memories that have been built within.
Perfect Syntax, no Meaning
Sometimes, a statement can have a perfectly correct syntax but it does not mean anything. For
example, take a look at this statement:

Chickens feed extravagantly while the moon drinks tea.

This statement is correct grammatically but does this make any sense? In Human language, a perfect
balance of syntax and semantics is important for better understanding
Data Processing
Humans interact with each other very easily. For us, the natural languages that we use are so
convenient that we speak them easily and understand them well too. But for computers, our
languages are very complex.

Since we all know that the language of computers is Numerical, the very first step that comes to our
mind is to convert our language to numbers. This conversion takes a few steps to happen. The first
step to it is Text Normalisation.

Text Normalisation
In Text Normalisation, we undergo several steps to normalise the text to a lower level. Before we
begin, we need to understand that in this section, we will be working on a collection of written text.
That is, we will be working on text from multiple documents and the term used for the whole textual
data from all the documents altogether is known as corpus. Not only would we go through all the
steps of Text Normalisation, we would also work them out on a corpus. Let us take a look at the steps:

Sentence Segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as
a different data so now the whole corpus gets reduced to sentences.
Tokenisation
After segmenting the sentences, each sentence is then further divided into tokens. Tokens is a term
used for any word or number or special character occurring in a sentence. Under tokenisation, every
word, number and special character is considered separately and each of them is now a separate
token.

Removing Stopwords, Special Characters and Numbers

In this step, the tokens which are not necessary are removed from the token list. What can be the
possible words which we might not require?

Stopwords are the words which occur very frequently in the corpus but do not add any value to it.
Humans use grammar to make their sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which is to be transmitted through the
statement hence they come under stopwords. Some examples of stopwords are:

* Images shown here are the property of individual organisations and are used here for reference purpose only.
These words occur the most in any given corpus but talk very little or nothing about the context or the
meaning of it. Hence, to make it easier for the computer to focus on meaningful terms, these words
are removed.

Along with these words, a lot of times our corpus might have special characters and/or numbers. Now
it depends on the type of corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then you might not want to
remove the special characters and numbers whereas in some other textual data if these characters do
not make sense, then you can remove them along with the stopwords.

Converting text to a common case

After the stopwords removal, we convert the whole text into a similar case, preferably lower case.
This ensures that the case-sensitivity of the machine does not consider same words as different just
because of different cases.

Here in this example, the all the 6 forms of hello would be converted to lower case and hence would
be treated as the same word by the machine.

Stemming
In this step, the remaining words are reduced to their root words. In other words, stemming is the
process in which the affixes of words are removed and the words are converted to their base form.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Note that in stemming, the stemmed words (words which are we get after removing the affixes) might
not be meaningful. Here in this example as you can see: healed, healing and healer all were reduced
to heal but studies was reduced to studi after the affix removal which is not a meaningful word.
Stemming does not take into account if the stemmed word is meaningful or not. It just removes the
affixes hence it is faster.

Lemmatization
Stemming and lemmatization both are alternative processes to each other as the role of both the
processes is same – removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.

As you can see in the same example, the output for studies after affix removal has become study
instead of studi.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Difference between stemming and lemmatization can be summarized by this example:

With this we have normalised our text to tokens which are the simplest form of words present in the
corpus. Now it is time to convert the tokens into numbers. For this, we would use the Bag of Words
algorithm

Bag of Words
Bag of Words is a Natural Language Processing model which helps in extracting features out of the
text which can be helpful in machine learning algorithms. In bag of words, we get the occurrences of
each word and construct the vocabulary for the corpus.

This image gives us a brief overview about how bag of words works. Let us assume that the text on
the left in this image is the normalised corpus which we have got after going through all the steps of
text processing. Now, as we put this text into the bag of words algorithm, the algorithm returns to us
the unique words out of the corpus and their occurrences in it. As you can see at the right, it shows us
a list of words appearing in the corpus and the numbers corresponding to it shows how many times
the word has occurred in the text body. Thus, we can say that the bag of words gives us two things:

1. A vocabulary of words for the corpus

2. The frequency of these words (number of times it has occurred in the whole corpus).

Here calling this algorithm “bag” of words symbolises that the sequence of sentences or tokens does
not matter in this case as all we need are the unique words and their frequency in it.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here is the step-by-step approach to implement bag of words algorithm:

1. Text Normalisation: Collect data and pre-process it

2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4. Create document vectors for all the documents.
Let us go through all the steps with an example:

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed

Document 2: Aman went to a therapist

Document 3: Anil went to download a health chatbot

Here are three documents having one sentence each. After text normalisation, the text becomes:

Document 1: [aman, and, anil, are, stressed]

Document 2: [aman, went, to, a, therapist]

Document 3: [anil, went, to, download, a, health, chatbot]

Note that no tokens have been removed in the stopwords removal step. It is because we have very
little data and since the frequency of all the words is almost the same, no word can be said to have
lesser value than the other.

Step 2: Create Dictionary

Go through all the steps and create a dictionary i.e., list down all the words which occur in all three
documents:

Dictionary:

aman and anil are stressed went

download health chatbot therapist a to

Note that even though some words are repeated in different documents, they are all written just once
as while creating the dictionary, we create the list of unique words.

Step 3: Create document vector

In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches
with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value
by 1. And if the word does not occur in that document, put a 0 under it.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a
value of 1 and rest of the words get a 0 value.

Step 4: Repeat for all documents

Same exercise has to be done for all the documents. Hence, the table becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents. Take a look at this table and analyse the positioning of 0s and 1s in it.

Finally, this gives us the document vector table for our corpus. But the tokens have still not converted
to numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF: Term Frequency & Inverse Document Frequency

Suppose you have a book. Which characters or words do you think would occur the most in it?

Bag of words algorithm gives us the frequency of words in each document we have in our corpus. It
gives us an idea that if the word is occurring more in a document, its value is more for that document.
For example, if I have a document on air pollution, air and pollution would be the words which occur
many times in it. And these words are valuable too as they give us some context around the document.
But let us suppose we have 10 documents and all of them talk about different issues. One is on women
empowerment, the other is on unemployment and so on. Do you think air and pollution would still be
one of the most occurring words in the whole corpus? If not, then which words do you think would
have the highest frequency in all of them?

And, this, is, the, etc. are the words which occur the most in almost all the documents. But these words
do not talk about the corpus at all. Though they are important for humans as they make the
statements understandable to us, for the machine they are a complete waste as they do not provide
us with any information regarding the corpus. Hence, these are termed as stopwords and are mostly
removed at the pre-processing stage only.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Take a look at this graph. It is a plot of occurrence of words versus their value. As you can see, if the
words have highest occurrence in all the documents of the corpus, they are said to have negligible
value hence they are termed as stop words. These words are mostly removed at the pre-processing
stage only. Now as we move ahead from the stopwords, the occurrence level drops drastically and the
words which have adequate occurrence in the corpus are said to have some amount of value and are
termed as frequent words. These words mostly talk about the document’s subject and their
occurrence is adequate in the corpus. Then as the occurrence of words drops further, the value of
such words rises. These words are termed as rare or valuable words. These words occur the least but
add the most value to the corpus. Hence, when we look at the text, we take frequent and rare words
into consideration.

Let us now demystify TFIDF. TFIDF stands for Term Frequency and Inverse Document Frequency. TFIDF
helps un in identifying the value for each word. Let us understand each term one by one.

Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from
the document vector table as in that table we mention the frequency of each word of the vocabulary
in each document.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here, you can see that the frequency of each word for each document has been recorded in the table.
These numbers are nothing but the Term Frequencies!

Inverse Document Frequency

Now, let us look at the other half of TFIDF which is Inverse Document Frequency. For this, let us first
understand what does document frequency mean. Document Frequency is the number of documents
in which the word occurs irrespective of how many times it has occurred in those documents. The
document frequency for the exemplar vocabulary would be:

Here, you can see that the document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in two documents. Rest of them occurred in just one document hence the document
frequency for them is one.

Talking about inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator. Here, the total number of
documents are 3, hence inverse document frequency becomes:

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

Here, log is to the base of 10. Don’t worry! You don’t need to calculate the log values by yourself.
Simply use the log function in the calculator and find out!

Now, let’s multiply the IDF values to the TF values. Note that the TF values are for each document
while the IDF values are for the whole corpus. Hence, we need to multiply the IDF values to each row
of the document vector table.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Here, you can see that the IDF values for Aman in each row is the same and similar pattern is followed
for all the words of the vocabulary. After calculating all the values, we get:

Finally, the words have been converted to numbers. These numbers are the values of each for each
document. Here, you can see that since we have less amount of data, words like ‘are’ and ‘and’ also
have a high value. But as the IDF value increases, the value of that word decreases. That is, for
example:

Total Number of documents: 10

Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

Which means: log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in
the corpus.

Summarising the concept, we can say that:

1. Words that occur in all the documents with high term frequencies have the least values and
are considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a
common word for all documents.
3. These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for a
given corpus.

* Images shown here are the property of individual organisations and are used here for reference purpose only.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:

Document Information
Topic Modelling Stop word filtering
Classification Retrieval System

Helps in classifying the To extract the Helps in removing the

It helps in predicting
type and genre of a important information unnecessary words
the topic for a corpus.
document. out of a corpus. out of a text body.

The Compelling Communicator
No ratings yet
The Compelling Communicator
14 pages
Unveiling The Success Factors of BTS A Mixed-Methods Approach (2020)
No ratings yet
Unveiling The Success Factors of BTS A Mixed-Methods Approach (2020)
23 pages
Class10 Facilitator Handbook Removed
No ratings yet
Class10 Facilitator Handbook Removed
31 pages
NLP
No ratings yet
NLP
14 pages
Grade 10 Unit 6 - Natural Language Processing
No ratings yet
Grade 10 Unit 6 - Natural Language Processing
33 pages
W Ith Support From
No ratings yet
W Ith Support From
73 pages
14 NLP
No ratings yet
14 NLP
20 pages
nlp
No ratings yet
nlp
70 pages
NLP Grade 10 2023-2024
No ratings yet
NLP Grade 10 2023-2024
72 pages
Screenshot 2023-10-23 at 6.50.44 AM
No ratings yet
Screenshot 2023-10-23 at 6.50.44 AM
48 pages
Natural Language Processing GRADE 10 - 2021
No ratings yet
Natural Language Processing GRADE 10 - 2021
82 pages
Class X Unit VI Natural Language Processing
No ratings yet
Class X Unit VI Natural Language Processing
42 pages
Unit 6 Natural Language Processing
No ratings yet
Unit 6 Natural Language Processing
48 pages
Artificial Intelligence - Nlp
No ratings yet
Artificial Intelligence - Nlp
32 pages
The Socially Intelligent Project Manager: Soft Skills That Prevent Hard Days
From Everand
The Socially Intelligent Project Manager: Soft Skills That Prevent Hard Days
Kim Wasson
3/5 (2)
Empathy
No ratings yet
Empathy
14 pages
An Introduction To Design Thinking Process Guide
No ratings yet
An Introduction To Design Thinking Process Guide
11 pages
The Best Managers Balance Analytical and Emotional Intelligence
No ratings yet
The Best Managers Balance Analytical and Emotional Intelligence
7 pages
Guide To Sentiment Analysis Using Natural Language Processing
No ratings yet
Guide To Sentiment Analysis Using Natural Language Processing
15 pages
nlp
No ratings yet
nlp
71 pages
CHAPTER 3 Empathize PDF
No ratings yet
CHAPTER 3 Empathize PDF
8 pages
Study on Sentiment Analysis
No ratings yet
Study on Sentiment Analysis
5 pages
Introduction To Design Thinking PDF
100% (1)
Introduction To Design Thinking PDF
19 pages
QUICK REVISION_CLASS_9_AI_IMP POINTS
No ratings yet
QUICK REVISION_CLASS_9_AI_IMP POINTS
69 pages
DTE.04 Session - 4 - Empathize - and - Define
No ratings yet
DTE.04 Session - 4 - Empathize - and - Define
63 pages
D.school's Design Thinking Process Mode Guide PDF
No ratings yet
D.school's Design Thinking Process Mode Guide PDF
11 pages
!brain NLP Meta Programs, by John David Hoag
No ratings yet
!brain NLP Meta Programs, by John David Hoag
9 pages
Design Thinking Guide
No ratings yet
Design Thinking Guide
6 pages
NLP
No ratings yet
NLP
61 pages
2. An Introduction to Design Thinking Process Guide
No ratings yet
2. An Introduction to Design Thinking Process Guide
6 pages
ModeGuideBOOTCAMP2010 PDF
No ratings yet
ModeGuideBOOTCAMP2010 PDF
11 pages
Annotated Bibliography - dh199
No ratings yet
Annotated Bibliography - dh199
16 pages
Lecture 2 Reading Materia
No ratings yet
Lecture 2 Reading Materia
12 pages
Text To PDF
No ratings yet
Text To PDF
1 page
Module 2 Local Networks
No ratings yet
Module 2 Local Networks
49 pages
Design Thinking: ME 6606 Computer-Aided Product Development DR Lu Wen Feng EA-07-13 65161228 Mpelwf@nus - Edu.sg
No ratings yet
Design Thinking: ME 6606 Computer-Aided Product Development DR Lu Wen Feng EA-07-13 65161228 Mpelwf@nus - Edu.sg
31 pages
Unit_4_The_Capstone_Project (2)
No ratings yet
Unit_4_The_Capstone_Project (2)
11 pages
Personas - A Simple Introduction: Ideation
100% (2)
Personas - A Simple Introduction: Ideation
15 pages
Practical NLP 2: Language: Practical NLP, #2
From Everand
Practical NLP 2: Language: Practical NLP, #2
Andy Smith
No ratings yet
DesignThinking UNIT II
No ratings yet
DesignThinking UNIT II
43 pages
Introduction To Human Centered Design (HCI) (Notes) (Coursera)
No ratings yet
Introduction To Human Centered Design (HCI) (Notes) (Coursera)
11 pages
Deloitte CN MMP Pre Reading Design Thinking Participant Fy19 en 181106
No ratings yet
Deloitte CN MMP Pre Reading Design Thinking Participant Fy19 en 181106
22 pages
Thesis Topics On Emotional Intelligence
100% (3)
Thesis Topics On Emotional Intelligence
6 pages
Personas Demystified 1.0
100% (1)
Personas Demystified 1.0
54 pages
AI art integration
No ratings yet
AI art integration
10 pages
Design Thinking Process Guide
No ratings yet
Design Thinking Process Guide
6 pages
Ultimate Communications Guide Workbook
No ratings yet
Ultimate Communications Guide Workbook
8 pages
Lab Manual-HCI
No ratings yet
Lab Manual-HCI
10 pages
Personas - A Simple Introduction
No ratings yet
Personas - A Simple Introduction
17 pages
MDD12 Empathy
No ratings yet
MDD12 Empathy
27 pages
AI-NLP
No ratings yet
AI-NLP
21 pages
Clientserv
No ratings yet
Clientserv
10 pages
Psychological Analysis of Brochure Readers
No ratings yet
Psychological Analysis of Brochure Readers
2 pages
Time Line Therapy and the Basis of Personality: and the basis of personality
From Everand
Time Line Therapy and the Basis of Personality: and the basis of personality
Tad James
4/5 (13)
Design Thinking: Methods & Tools: Empathize Define Ideate Prototype Test
No ratings yet
Design Thinking: Methods & Tools: Empathize Define Ideate Prototype Test
8 pages
What Is Empathy and Why Is It So Important in Design Thinking - IxDF
No ratings yet
What Is Empathy and Why Is It So Important in Design Thinking - IxDF
11 pages
Thinking Process
No ratings yet
Thinking Process
38 pages
Memory Improvement: How to Boost Your Brain Power and Remember (The Beginner's Guide to Memory Training and Memory Improvement)
From Everand
Memory Improvement: How to Boost Your Brain Power and Remember (The Beginner's Guide to Memory Training and Memory Improvement)
David Valadez
No ratings yet
SSRN Id3866861
No ratings yet
SSRN Id3866861
3 pages
Design Thinking Concepts - Empathize@Suprayitno
No ratings yet
Design Thinking Concepts - Empathize@Suprayitno
4 pages
NLP: Dark Psychology - Secret Methods of Neuro Linguistic Programming to Master Influence Over Anyone and Getting What You Want
From Everand
NLP: Dark Psychology - Secret Methods of Neuro Linguistic Programming to Master Influence Over Anyone and Getting What You Want
R. J. Anderson
No ratings yet
Hybrid News Recommendation System Using TF-IDF and Machine Learning Approach
No ratings yet
Hybrid News Recommendation System Using TF-IDF and Machine Learning Approach
5 pages
Ir MCQ-1
No ratings yet
Ir MCQ-1
22 pages
Module 7 Mining Object Spatial Multimedia Text and Web Data
100% (1)
Module 7 Mining Object Spatial Multimedia Text and Web Data
28 pages
Ontology Matching - A Machine Learning Approach
No ratings yet
Ontology Matching - A Machine Learning Approach
20 pages
Unit 5
No ratings yet
Unit 5
46 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
Rating Prediction
No ratings yet
Rating Prediction
20 pages
White Paper
No ratings yet
White Paper
9 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Dav Q.B by (Musa)
No ratings yet
Dav Q.B by (Musa)
30 pages
Unit 2
No ratings yet
Unit 2
157 pages
Machine Learning 600 - Chapter 6
No ratings yet
Machine Learning 600 - Chapter 6
26 pages
Wood 2017
100% (1)
Wood 2017
8 pages
Ci 5
No ratings yet
Ci 5
17 pages
Internship Report
No ratings yet
Internship Report
65 pages
Combating Link Spam: Prof. Soumen Chakrabarti Om P. Damani
No ratings yet
Combating Link Spam: Prof. Soumen Chakrabarti Om P. Damani
23 pages
(The Information Retrieval Series 26) Victor Lavrenko (Auth.) - A Generative Theory of Relevance-Springer-Verlag Berlin Heidelberg (2009)
No ratings yet
(The Information Retrieval Series 26) Victor Lavrenko (Auth.) - A Generative Theory of Relevance-Springer-Verlag Berlin Heidelberg (2009)
211 pages
Information Retrieval
100% (1)
Information Retrieval
11 pages
Eee Qp Gr10 Sample Paper 1
No ratings yet
Eee Qp Gr10 Sample Paper 1
5 pages
Answer Key Sample Paper 1 AI Class 10 Tutorialaicsip
100% (1)
Answer Key Sample Paper 1 AI Class 10 Tutorialaicsip
12 pages
Natural Language Processing Notes Class 10 AI
100% (1)
Natural Language Processing Notes Class 10 AI
20 pages
Modern Information Retrieval Chapter 5 Query Operations
No ratings yet
Modern Information Retrieval Chapter 5 Query Operations
33 pages
5624 - Softskill - NLP
No ratings yet
5624 - Softskill - NLP
28 pages
I041 NLP Assignment5
No ratings yet
I041 NLP Assignment5
12 pages
Quantitative Text Analysis Overview and Fundamentals: Kenneth Benoit
No ratings yet
Quantitative Text Analysis Overview and Fundamentals: Kenneth Benoit
60 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
A Comparative Study of Keyword Extraction Algorithms For English Texts
No ratings yet
A Comparative Study of Keyword Extraction Algorithms For English Texts
8 pages
Automatic Recognition of Handwritten Medical Forms For Search Engines
No ratings yet
Automatic Recognition of Handwritten Medical Forms For Search Engines
16 pages

NLP Notes

Uploaded by

NLP Notes

Uploaded by

Natural Language Processing

Applications of Natural Language Processing

Automatic Summarization: Automatic summarization is

Sentiment Analysis: to identify opinions and

Text classification: Text classification makes it possible to assign

Revisiting the AI Project Cycle

The world is competitive nowadays. People face

Who Canvas – Who has the problem?

Who are the

What Canvas – What is the nature of the problem?

Where Canvas – Where does the problem arise?

What is the context/situation

Why Canvas – Why do you think it is a problem worth solving?

Our People undergoing stress Who?

This leads us to the goal of our project which is:

1. Surveys 2. Observing the therapist’s sessions

Human Language VS Computer Language

Arrangement of the words and meaning

Analogy with programming language:

Different syntax, same semantics: 2+3 = 3+2

Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)

The red car zoomed past his nose

His face turns red after consuming the medicine

Chickens feed extravagantly while the moon drinks tea.

Removing Stopwords, Special Characters and Numbers

Converting text to a common case

1. A vocabulary of words for the corpus

1. Text Normalisation: Collect data and pre-process it

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed

Document 2: Aman went to a therapist

Document 3: Anil went to download a health chatbot

Document 1: [aman, and, anil, are, stressed]

Document 2: [aman, went, to, a, therapist]

Document 3: [anil, went, to, download, a, health, chatbot]

Step 2: Create Dictionary

aman and anil are stressed went

download health chatbot therapist a to

Step 3: Create document vector

Step 4: Repeat for all documents

TFIDF: Term Frequency & Inverse Document Frequency

Inverse Document Frequency

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

Total Number of documents: 10

Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

Summarising the concept, we can say that:

Helps in classifying the To extract the Helps in removing the

You might also like