NLP Notes
NLP Notes
• They are governed by set rules that include syntax, lexicon, and semantics.
• All natural languages are redundant, i.e., information can be conveyed in multiple
ways.
• All natural languages change over time.
2. What is the significance of NLP?
Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling
computers to analyse, understand and process human languages to derive meaningful
information from human language.
3. Artificial Intelligence nowadays is becoming an integral part of our lives, its
applications are very commonly used by the majority of people in their daily lives.
Explain all applications
i. Voice assistants: Voice assistants take our natural speech, process it, and give us an
output. These assistants leverage NLP to understand natural language and execute tasks
efficiently.
For example:
Hey Google, set an alarm at 3.30 pm Hey Alexa, play some music
Hey Siri, what's the weather today
ii. Autogenerated captions: Captions are generated by turning natural speech into text in real-
time. It is a valuable feature for enhancing the accessibility of video content.
For example:
Auto-generated captions on YouTube and Google Meet.
iii. Language Translation: It incorporates the generation of translation from another language.
This involves the conversion of text or speech from one language to another, facilitating cross-
linguistic communication and fostering global connectivity.
For example:
Google Translate
iv. Sentiment Analysis: Sentiment Analysis is a tool to express an opinion, whether the
underlying sentiment is positive, negative, or
neutral. Customer sentiment analysis helps in the automatic detection of emotions when
customers interact with the products, services, or brand
vi. Keyword Extraction: Keyword extraction is a tool that automatically extracts the most
used, important words and expressions from a text. It can give valuable insights into people’s
opinions about any business on social media. Customer Service can be improved by using a
Keyword extraction tool.
11. Explain the difference between Script Bot and Smart Bot?
15.What does the term "Bag of Words" refer to in Natural Language Processing (NLP)?
Bag of Words is a Natural Language Processing model which helps in extracting features out of
the text which can be helpful in machine learning algorithms.
The bag of words gives us two things:
• A vocabulary of words for the corpus
• The frequency of these words (number of times it has occurred in the whole corpus).
Here is the step-by-step approach to implementing the bag of words algorithm:
a) Text Processing: Collect data and pre-process it
b) Create a Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
c) Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
d) Create document vectors for all the documents.
16. While working with NLP what is the meaning of?
a. Syntax
b. Semantics
Syntax: Syntax refers to the grammatical structure of a sentence.
Semantics: It refers to the meaning of the sentence.
22. Which package is used for Natural Language Processing in Python programming?
Natural Language Toolkit (NLTK). NLTK is one of the leading platforms for building Python
programs that can work with human language data.
23. What is a document vector table?
Document Vector Table is used while implementing Bag of Words algorithm. In a document
vector table, the header row contains the vocabulary of the corpus and other rows correspond
to different documents. If the document contains a particular word it is represented by 1 and
absence of word is represented by 0 value.
24.Which words in a corpus have the highest values and which ones have the least?
Stop words like - and, this, is, the, etc. have highest values in a
corpus. But these words do not talk about the corpus at all. Hence,
these are termed as stopwords and are mostly removed at the pre-
processing stage only. Rare or valuable words occur the least but
add the most importance to the corpus. Hence, when we look at
the text, we take frequent and rare words into consideration.
25. Give an example of the following:
• Multiple meanings of a word
• Perfect syntax, no meaning
• Example of Multiple meanings of a word – His face turns red after consuming the medicine
Meaning - Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?
• Example of Perfect syntax, no meaning-
Chickens feed extravagantly while the moon drinks tea.
This statement is correct grammatically, but it does not make any sense. In Human language,
a perfect balance of syntax and semantics is important for better understanding.
26. Does the vocabulary of a corpus remain the same before and after text normalization?
Why?
No, the vocabulary of a corpus does not remain the same before and after text normalization.
Reasons are –
• In normalization the text is normalized through various steps and is lowered to minimum
vocabulary since the machine does not require grammatically correct statements but the
essence of it.
• In normalization Stop words, Special Characters and Numbers are removed.
• In stemming the affixes of words are removed and the words are converted to their base
form. So, after normalization, we get the reduced vocabulary.
30. What are stop words? Explain with the help of examples.
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These
words do not carry important meaning and are usually removed from texts. It is possible to
remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for
symbolic and statistical natural language processing.
31. Explain how AI can play a role in sentiment analysis of human beings?
The goal of sentiment analysis is to identify sentiment among several posts or even in the same
post where emotion is not always explicitly expressed.
Companies use Natural Language Processing applications, such as sentiment analysis, to
identify opinions and sentiment online to help them understand what customers think about
their products and services (i.e., “I love the new iPhone” and, a few lines later “But sometimes
it doesn’t work well” where the person is still talking about the iPhone) and overall *
Beyond determining simple polarity, sentiment analysis understands sentiment in context to
help better understand what’s behind an expressed opinion, which can be extremely relevant
in understanding and driving purchasing decisions.
32. Create a Document vector table using bag of words Algorithm for the following corpus.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
I. Text Normalisation: In Text Normalisation, we undergo several steps to normalise the text to a
lower level
a. Sentence Segmentation: Under sentence segmentation, the whole corpus is divided into
sentence
We are going to Mumbai
Mumbai is a famous place.
We are going to a famous place.
I am famous in Mumbai.
b. Tokenisation: Under tokenisation, every word, number and special character is considered
separately and each of them is now a separate token
3. Create a Document Vector for 1 document: for each word in the document, if it matches with
the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by
1. And if the word does not occur in that document, put a 0 under it.
we go mumbai famous place I am
1 1 1 0 0 0 0
4. Create a Document Vector for 3 documents.
33. Through a step-by-step process, calculate TFIDF for the given corpus and mention the
word(s) having highest value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Answer:
1. Term Frequency
is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as in that table we mention the frequency of each word of the
vocabulary in each document.
we go mumbai famous place I am
1 1 1 0 0 0 0
0 0 1 1 1 0 0
1 1 0 1 1 0 0
0 0 1 1 0 1 1
3. Inverse Document Frequency: inverse document frequency, we need to put the document
frequency in the denominator while the total number of documents is the numerator. Here, the
total number of documents are 3, hence inverse document frequency becomes:
we go mumbai famous place I am
4/2 4/2 4/3 4/3 4/2 4/1 4/1