0% found this document useful (0 votes)
3 views

NLP Notes

The document discusses Natural Language Processing (NLP), its features, significance, applications, and various stages including lexical, syntactic, semantic, discourse, and pragmatic analysis. It also covers concepts like text normalization, stemming vs. lemmatization, TFIDF, and the creation of document vector tables. Additionally, it highlights the role of AI in sentiment analysis and the importance of stop words in text processing.

Uploaded by

ayaanayu2010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

NLP Notes

The document discusses Natural Language Processing (NLP), its features, significance, applications, and various stages including lexical, syntactic, semantic, discourse, and pragmatic analysis. It also covers concepts like text normalization, stemming vs. lemmatization, TFIDF, and the creation of document vector tables. Additionally, it highlights the role of AI in sentiment analysis and the importance of stop words in text processing.

Uploaded by

ayaanayu2010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 6

NATURAL LANGUAGE PROCESING


1.Explain a few features of natural languages?

• They are governed by set rules that include syntax, lexicon, and semantics.
• All natural languages are redundant, i.e., information can be conveyed in multiple
ways.
• All natural languages change over time.
2. What is the significance of NLP?
Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling
computers to analyse, understand and process human languages to derive meaningful
information from human language.
3. Artificial Intelligence nowadays is becoming an integral part of our lives, its
applications are very commonly used by the majority of people in their daily lives.
Explain all applications
i. Voice assistants: Voice assistants take our natural speech, process it, and give us an
output. These assistants leverage NLP to understand natural language and execute tasks
efficiently.
For example:
Hey Google, set an alarm at 3.30 pm Hey Alexa, play some music
Hey Siri, what's the weather today

ii. Autogenerated captions: Captions are generated by turning natural speech into text in real-
time. It is a valuable feature for enhancing the accessibility of video content.
For example:
Auto-generated captions on YouTube and Google Meet.

iii. Language Translation: It incorporates the generation of translation from another language.
This involves the conversion of text or speech from one language to another, facilitating cross-
linguistic communication and fostering global connectivity.
For example:
Google Translate

iv. Sentiment Analysis: Sentiment Analysis is a tool to express an opinion, whether the
underlying sentiment is positive, negative, or
neutral. Customer sentiment analysis helps in the automatic detection of emotions when
customers interact with the products, services, or brand

v. Text Classification: Text classification is a tool which classifies a sentence or document


category-wise.
In the example, we can observe news articles containing information on various sectors,
including Food, Sports, and Politics, being categorized through the text classification process.
This process classifies the raw texts into predefined groups or categories.

vi. Keyword Extraction: Keyword extraction is a tool that automatically extracts the most
used, important words and expressions from a text. It can give valuable insights into people’s
opinions about any business on social media. Customer Service can be improved by using a
Keyword extraction tool.

4. What are the different stages of Natural Language processing?


The different stages of Natural Language Processing (NLP) serve various purposes in the
overall task of understanding and processing human language. The stages of Natural Language
Processing (NLP) typically involve the following:
1. Lexical Analysis
2. Syntactic Analysis / Parsing
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis

5.What do you mean by lexical analysis in NLP?


NLP starts with identifying the structure of input words. It is the process of dividing a large
chunk of words into structural paragraphs, sentences, and words. Lexicon stands for a
collection of the various words and phrases used in a language. Lengthy text is broken down
into chunks.

6. Explain Syntactic Analysis / Parsing in NLP?


It is the process of checking the grammar of sentences and phrases. It forms a relationship
among words and eliminates logically incorrect sentences.

7. Explain Symantic Analysis in NLP?


In this stage, the input text is now checked for meaning, and every word and phrase is checked
for meaningfulness.
For example:
It will reject a sentence that contains ‘hot ice cream’ in it. The fox jumped into the dog.

8. Explain Discourse Integration in NLP?


It is the process of forming the story of the sentence. Every sentence should have a relationship
with its preceding and succeeding sentences.
9. Explain Pragmatic Analysis in NLP?
Pragmatic means practical or logical, i.e., this step requires knowledge of the intent in a
sentence. It also means to discard the actual word meaning taken after semantic analysis and
take the intended meaning.

10.What do you mean by a chatbot?


A chatbot is a computer program that's designed to simulate human conversation through
voice commands or text chats or both. It can learn over time how to best interact with humans.
It can answer questions and troubleshoot customer problems, evaluate and qualify prospects,
generate sales leads and increase sales on an ecommerce site. There are a lot of chatbots
available. Eg: Elizabot, Mitsuki ,Cleverbot ,Singtel

11. Explain the difference between Script Bot and Smart Bot?

12. Define Text Normalisation? Explain the steps of text normalisation?


Text Normalisation helps in cleaning up the textual data in such a way that it comes down to a
level where its complexity is lower than the actual data. In Text Normalisation, we undergo
several steps to normalise the text to a lower level.
1. Sentence Segmentation -Under sentence segmentation, the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the whole corpus gets reduced
to sentences.
2. Tokenization-After segmenting the sentences, each sentence is then further divided into
tokens. Tokens is a term used for any word or number or special character occurring in a
sentence. Under tokenisation, every word, number and special character is considered
separately and each of them is now a separate token.
3. Removing Stop words, Special Characters and Numbers-Stop words are the words which
occur very frequently in the corpus but do not add any value to it. Along with these words, a lot
of times our corpus might have special characters and/or numbers
4. Converting Text to a Common Case- After the stop words removal, we convert the whole
text into a similar case, preferably lowercase. This ensures that the case sensitivity of the
machine does not consider the same words as different just because of different cases.
5.Stemming- The process in which the affixes of words are removed, and the words are
converted to their base form. Stemming does not consider whether the stemmed word is
meaningful or not. It just removes the affixes hence it is faster.
6. Lemmatization - Stemming and lemmatization both are alternative processes to each other
as the role of both the processes is same – removal of affixes. Lemmatization makes sure that
a lemma is a word with meaning and hence it takes a longer time to execute than stemming.

13. Define the term Corpus?


In Text Normalization, we undergo several steps to normalize the text to a lower level. That is,
we will be working on text from multiple documents and the term used for the whole textual
data from all the documents altogether is known as corpus.

14. Explain the difference between Stemming and Lemmatization?


Stemming and lemmatization both are alternative processes to each other as the role of both
the processes is same – removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
Lemmatization makes sure that a lemma is a word with meaning and hence it takes a longer
time to execute than stemming.

15.What does the term "Bag of Words" refer to in Natural Language Processing (NLP)?
Bag of Words is a Natural Language Processing model which helps in extracting features out of
the text which can be helpful in machine learning algorithms.
The bag of words gives us two things:
• A vocabulary of words for the corpus
• The frequency of these words (number of times it has occurred in the whole corpus).
Here is the step-by-step approach to implementing the bag of words algorithm:
a) Text Processing: Collect data and pre-process it
b) Create a Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
c) Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
d) Create document vectors for all the documents.
16. While working with NLP what is the meaning of?
a. Syntax
b. Semantics
Syntax: Syntax refers to the grammatical structure of a sentence.
Semantics: It refers to the meaning of the sentence.

17. What is the full form of TFIDF?


Term Frequency and Inverse Document Frequency

18.What is TFIDF? Write its formula.


Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect
how important a word is to a document in a collection or corpus. The number of times a word
appears in a document divided by the total number of words in the document. Every document
has its own term frequency.
TFIDF(W) = TF(W) * log( IDF(W) )
19. What is meant by a dictionary in NLP?
Dictionary in NLP means a list of all the unique words occurring in the corpus. If some words
are repeated in different documents, they are all written just once as while creating the
dictionary.
20. What is term frequency?
Term frequency is the frequency of a word in one document. Term frequency can easily be
found from the document vector table as in that table we mention the frequency of each word
of vocabulary in each document.

21. What is inverse document frequency?


To understand inverse document frequency, first we need to understand document frequency.
Document Frequency is the number of documents in which the word occurs irrespective of
how many times it has occurred in those documents.
In case of inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator.
For example, if the document frequency of a word “AMAN” is 2 in a particular document then
its inverse document frequency will be 3/2. (Here no. of documents is 3)

22. Which package is used for Natural Language Processing in Python programming?
Natural Language Toolkit (NLTK). NLTK is one of the leading platforms for building Python
programs that can work with human language data.
23. What is a document vector table?
Document Vector Table is used while implementing Bag of Words algorithm. In a document
vector table, the header row contains the vocabulary of the corpus and other rows correspond
to different documents. If the document contains a particular word it is represented by 1 and
absence of word is represented by 0 value.
24.Which words in a corpus have the highest values and which ones have the least?
Stop words like - and, this, is, the, etc. have highest values in a
corpus. But these words do not talk about the corpus at all. Hence,
these are termed as stopwords and are mostly removed at the pre-
processing stage only. Rare or valuable words occur the least but
add the most importance to the corpus. Hence, when we look at
the text, we take frequent and rare words into consideration.
25. Give an example of the following:
• Multiple meanings of a word
• Perfect syntax, no meaning

• Example of Multiple meanings of a word – His face turns red after consuming the medicine
Meaning - Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?
• Example of Perfect syntax, no meaning-
Chickens feed extravagantly while the moon drinks tea.
This statement is correct grammatically, but it does not make any sense. In Human language,
a perfect balance of syntax and semantics is important for better understanding.

26. Does the vocabulary of a corpus remain the same before and after text normalization?
Why?
No, the vocabulary of a corpus does not remain the same before and after text normalization.
Reasons are –
• In normalization the text is normalized through various steps and is lowered to minimum
vocabulary since the machine does not require grammatically correct statements but the
essence of it.
• In normalization Stop words, Special Characters and Numbers are removed.
• In stemming the affixes of words are removed and the words are converted to their base
form. So, after normalization, we get the reduced vocabulary.

29. Explain the relation between occurrence and value of a word.

plot of occurrence of words versus their value


As shown in the graph, occurrence and value of a word are inversely proportional. The words
which occur most (like stop words) have negligible value. As the occurrence of words drops,
the value of such words rises. These words are termed as rare or valuable words. These words
occur the least but add the most value to the corpus.
28. What is the significance of converting the text into a common case?
In Text Normalization, we undergo several steps to normalize the text to a lower level. After the
removal of stop words, we convert the whole text into a similar case, preferably lower case. This
ensures that the case-sensitivity of the machine does not consider same words as different
just because of different cases.

29. What are the applications of TFIDF?


TFIDF is commonly used in the Natural Language Processing domain. Some of its applications
are:
• Document Classification - Helps in classifying the type and genre of a document.
• Topic Modelling - It helps in predicting the topic for a corpus.
• Information Retrieval System - To extract the important information out of a corpus.
• Stop word filtering - Helps in removing the unnecessary words out of a text body.

30. What are stop words? Explain with the help of examples.
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These
words do not carry important meaning and are usually removed from texts. It is possible to
remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for
symbolic and statistical natural language processing.

31. Explain how AI can play a role in sentiment analysis of human beings?
The goal of sentiment analysis is to identify sentiment among several posts or even in the same
post where emotion is not always explicitly expressed.
Companies use Natural Language Processing applications, such as sentiment analysis, to
identify opinions and sentiment online to help them understand what customers think about
their products and services (i.e., “I love the new iPhone” and, a few lines later “But sometimes
it doesn’t work well” where the person is still talking about the iPhone) and overall *
Beyond determining simple polarity, sentiment analysis understands sentiment in context to
help better understand what’s behind an expressed opinion, which can be extremely relevant
in understanding and driving purchasing decisions.

32. Create a Document vector table using bag of words Algorithm for the following corpus.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
I. Text Normalisation: In Text Normalisation, we undergo several steps to normalise the text to a
lower level
a. Sentence Segmentation: Under sentence segmentation, the whole corpus is divided into
sentence
We are going to Mumbai
Mumbai is a famous place.
We are going to a famous place.
I am famous in Mumbai.
b. Tokenisation: Under tokenisation, every word, number and special character is considered
separately and each of them is now a separate token

We, are, going, to ,Mumbai


Mumbai ,is, a, famous, place,. ,
We ,are ,going ,to, a, famous, place,. ,
I ,am, famous, in, Mumbai,.,
c. Removal of stop Words: Stopwords are the words which occur very frequently in the corpus
but do not add any value to it

We, going, Mumbai,famous, place,I ,am


d. Converting into common case: After the stopwords removal, we convert the whole text into a
similar case, preferably lower case
we, going, mumbai,famous, place,i ,am

e.Stemming/Lemmatisation: : Stemming and lemmatization both are alternative processes to


each other as the role of both the processes is same – removal of affixes

we, go, mumbai,famous, place,i ,am


2. CRETAE A DICTIONARY : list down all the words which occur in all three documents
we go mumbai famous place I am

3. Create a Document Vector for 1 document: for each word in the document, if it matches with
the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by
1. And if the word does not occur in that document, put a 0 under it.
we go mumbai famous place I am
1 1 1 0 0 0 0
4. Create a Document Vector for 3 documents.

we go mumbai famous place I am


1 1 1 0 0 0 0
0 0 1 1 1 0 0
1 1 0 1 1 0 0
0 0 1 1 0 1 1

33. Through a step-by-step process, calculate TFIDF for the given corpus and mention the
word(s) having highest value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Answer:
1. Term Frequency
is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as in that table we mention the frequency of each word of the
vocabulary in each document.
we go mumbai famous place I am
1 1 1 0 0 0 0
0 0 1 1 1 0 0
1 1 0 1 1 0 0
0 0 1 1 0 1 1

2. Document Frequency : Document Frequency is the number of documents in which the


word occurs irrespective of how many times it has occurred in those documents.

we go mumbai famous place I am


2 2 3 3 2 1 1

3. Inverse Document Frequency: inverse document frequency, we need to put the document
frequency in the denominator while the total number of documents is the numerator. Here, the
total number of documents are 3, hence inverse document frequency becomes:
we go mumbai famous place I am
4/2 4/2 4/3 4/3 4/2 4/1 4/1

4. Term Frequency Inverse Document Frequency


he formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log (IDF(W))
we go mumbai famous place I am
1*log(2) 1*log(2) 1*log(4/3) 0*log(4/3) 0*log(2) 0*log(4) 0*log(4)
0*log(2) 0*log(2) 1*log(4/3) 1*log(4/3) 1*log(2) 0*log(4) 0*log(4)
1*log(2) 1*log(2) 0*log(4/3) 1*log(4/3) 1*log(2) 0*log(4) 0*log(4)
0*log(2) 0*log(2) 1*log(4/3) 1*log(4/3) 0*log(2) 1*log(4) 1*log(4)

we go mumbai famous place I am


1*0.301 1*0.301 1*0.124 0*0.124 0*0.301 0*0.602 0*0.602
0*0.301 0*0.301 1*0.124 1*0.124 1*0.301 0*0.602 0*0.602
1*0.301 1*0.301 0*0.124 1*0.124 1*0.301 0*0.602 0*0.602
0*0.301 0*0.301 1*0.124 1*0.124 0*0.301 1*0.602 1*0.602

we go mumbai famous place I am


0.301 0.301 0.124 0 0 0 0
0 0 0.124 0.124 0.301 0 0
0.301 0.301 0 0.124 0.301 0 0
0 0 0.124 0.124 0 0.602 0.602

The words having highest value are – mumbai, famous

You might also like