0% found this document useful (0 votes)
4 views

Nlp Revision Notes

The document contains multiple-choice questions (MCQs) and subjective questions related to Natural Language Processing (NLP) concepts, tools, and techniques. Key topics include NLTK, TF-IDF, tokenization, stemming, lemmatization, and applications of chatbots in healthcare. Additionally, it discusses the differences between stemming and lemmatization, and how NLP interprets communication compared to humans.

Uploaded by

shikhadm9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Nlp Revision Notes

The document contains multiple-choice questions (MCQs) and subjective questions related to Natural Language Processing (NLP) concepts, tools, and techniques. Key topics include NLTK, TF-IDF, tokenization, stemming, lemmatization, and applications of chatbots in healthcare. Additionally, it discusses the differences between stemming and lemmatization, and how NLP interprets communication compared to humans.

Uploaded by

shikhadm9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Question Banks – MCQs :

1. What is NLTK tool in Python ?

(a) Natural Linguistics Tool (b) Natural Language Toolkit

(c) Neutral Language Kit (d) Neutral Language Toolkit

2. TF-IDF in NLP is defined as :

(a) Term Frequency and Definite Frequency

(b) Term Frequency and Indefinite Frequency

(c) Term Frequency and Inverse Document Frequency

(d) Term Frequency and Integrated Document Frequency

3. What do we call the process of dividing a string into component words ?

(a) Regression

(b) Word Tokenization

(c) Classification

(d) Clustering

4. What is the stem of the word “Making” ?

(a) Mak (b) Make (c) Making (d) Maker

5. What is the Lemma of the word “Making” ?

(a) Mak (b) Make (c) Making (d) Maker

6. Which of these is not a stop word ?

(a) This (b) Things (c) Is (d) Do

151
7. The higher the value, the more important the word in the document – this is

true of which model ?

(a) Bag of Words (b)TF-IDF (c) YOLO (d) SSD

8. Which of these is not an NLP library ?

(a) NLTK (b) NLP Kit (c) Open NLP (d) NLP Suite

9. What is a chatbot called which uses simple FAQs without any intelligence ?

(a) Smart Chatbot (b) Script Chatbot

(c) AI Chatbot (d) ML Chatbot

10. What is the process of extracting emotions within a text data using NLP called?

(a) Sentiment Analysis

(b) Emotional Data Science

(c) Emotional Processing

(d) Emotional Classification

Subjective Type Questions 2 Marks:

1. Explain the key steps of NLP – based text analysis.


i) Sentence Segmentation
ii) Tokenization
iii) Removing Stop words, Special Characters and Numbers
iv) Stemming
v) Converting Text to common Case
vi) Lemmatization

152
2. Compare Bag of words and TF-IDF and share your finding.

Bag of Words is a Natural Language Processing model which helps in extracting


features out of the text which can be helpful in machine learning algorithms. In bag of
words, we get the occurrences of each word and construct the vocabulary for the
corpus. Bag of Words just creates a set of vectors containing the count of word
occurrences in the document (reviews). Bag of Words vectors are easy to interpret.
TFIDF is commonly used in the Natural Language Processing domain.
Some of its applications are:
· Document Classification - Helps in classifying the type and genre of a document.
· Topic Modelling - It helps in predicting the topic for a corpus.
· Information Retrieval System - To extract the important information out of a corpus.
Stop word filtering - Helps in removing the unnecessary words out of a text body.

3. What are some of the applications of chatbots in health care ?

The most valuable features of using chatbots in healthcare include:

· Monitoring: Awareness and tracking of user’s behavior, anxiety, and weight changes
to encourage developing better habits.
· Anonymity: Especially in sensitive and mental health issues.
· Personalization: Level of personalization depends on the specific application. Some
applications make use of measurements of:
. Physical vitals (oxygenation, heart rhythm, body temperature) via mobile sensors.
. Patient behavior via facial recognition.
· Real time interaction: Immediate response, notifications, and reminders.
· Scalability: Ability to react with numerous users at the same time.

4. Explain the difference between Stemming and Lemmatization.

Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes


(“ing”, “ly”, “es”, “s” etc) from a word.

153
Stemming is a process of reducing words to their word stem, base or root form (for
example, books — book, looked — look).

Lemmatization: Lemmatization, on the other hand, is an organized & step by step


procedure of obtaining the root form of the word, it makes use of vocabulary
(dictionary importance of words) and morphological analysis (word structure and
grammar relations).

5. What is the difference between how humans interpret communication and how NLP interpret?

The communications made by the machines are very basic and simple. Human communication
is complex. There are multiple characteristics of the human language that might be easy for a
human to understand but extremely difficult for a computer to understand.

For machines it is difficult to understand our language. Let us take a look at some of them
here:
Arrangement of the words and meaning - There are rules in human language. There are nouns,
verbs, adverbs, adjectives. A word can be a noun at one time and an adjective some other time.
This can create difficulty while processing by computers.

Analogy with programming language- Different syntax, same semantics: 2+3 = 3+2 Here the
way these statements are written is different, but their meanings are the same that is 5.
Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3) Here the statements
written have the same syntax but their meanings are different. In Python 2.7, this statement
would result in 1 while in Python 3, it would give an output of 1.5. Multiple Meanings of a
word - In natural language, it is important to understand that a word can have multiple
meanings and the meanings fit into the statement according to the context of it.

Subjective type questions 4 Marks


1. Through a step-by-step process, calculate TFIDF for the given corpus and
mention the word(s) having highest value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.

154
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.

Term Frequency

Term frequency is the frequency of a word in one document. Term frequency can easily be
found from the document vector table as in that table we mention the frequency of each word
of the vocabulary in each document.

We Are Going to Mumbai is a famous Place I am in


1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 0 0 0
1 1 1 1 0 0 1 1 1 0 0 0
0 0 0 0 1 0 0 1 0 1 1 1

Inverse Document Frequency

The other half of TFIDF which is Inverse Document Frequency. For this, let us first understand
what does document frequency mean. Document Frequency is the number of documents in
which the word occurs irrespective of how many times it has occurred in those documents.
The document frequency for the exemplar vocabulary would be:

We Are Going to Mumbai is a famous Place I am in

2 2 2 2 3 1 2 3 2 1 1 1

Talking about inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator. Here, the total number of
documents are 3, hence inverse document frequency becomes:

We Are Going to Mumbai is a famous Place I am in


4/2 4/2 4/2 4/2 4/3 4/1 4/2 4/3 4/2 4/1 4/1 4/1

The formula of TFIDF for any word W becomes:


TFIDF(W) = TF(W) * log (IDF(W))

155
We Are Going to Mumbai is a famous Place I am in

1*log(4/2) 1*log(4/2) 1*log(4/2) 1*log(4/2) 1*log(4/3) 0*log(4/1) 0*log(4/2) 0*log(4/3) 0*log(4/2) 0*log(4/1) 0*log(4/1) 0*log(4/1)

0*log(4/2) 0*log(4/2) 0*log(4/2) 0*log(4/2) 1*log(4/3) 0*log(4/1) 1*log(4/2) 1*log(4/3) 1*log(4/2) 0*log(4/1) 0*log(4/1) 0*log(4/1)

1*log(4/2) 1*log(4/2) 1*log(4/2) 1*log(4/2) 0*log(4/3) 0*log(4/1) 1*log(4/2) 1*log(4/3) 1*log(4/2) 0*log(4/1) 0*log(4/1) 0*log(4/1)

0*log(4/2) 0*log(4/2) 0*log(4/2) 0*log(4/2) 1*log(4/3) 0*log(4/1) 0*log(4/2) 1*log(4/3) 1*log(4/2) 1*log(4/1) 1*log(4/1) 1*log(4/1)

After calculating all the values, we get

We Are Going to Mumb is a fam Plac I am in


ai ous e
0.301 0.301 0.301 0.301 0.124 0 0 0 0 0 0 0

0 0 0 0 0.124 0.602 0.301 0.124 0.301 0 0 0

0.301 0.301 0.301 0.301 0 0 0.301 0.124 0.301 0 0 0

0 0 0 0 0.124 0 0 0.124 0 0.602 0.602 0.602

Finally, the words have been converted to numbers. These numbers are the values of each for each
document. Here, you can see that since we have less data, words like ‘I’,’am’, ‘in’ and ‘is’ also
have a high value. But as the IDF value increases, the value of that word decreases.

2. Normalize the given text and comment on the vocabulary before and after the
normalization:
Raj and Vijay are best friends. They play together with other friends. Raj likes to play
football but Vijay prefers to play online games. Raj wants to be a footballer. Vijay wants
to become an online gamer.

Normalization of the given text:


Sentence Segmentation:
1. Raj and Vijay are best friends.
2. They play together with other friends.
3. Raj likes to play football but Vijay prefers to play online games.
4. Raj wants to be a footballer.
5. Vijay wants to become an online gamer.

156

You might also like