Chapter 6 - NLP Question Answer
Chapter 6 - NLP Question Answer
a. Script Bot – An Internet bot, sometimes known as a web robot, robot, or simply
bot, is a software programme that does automated operations (scripts) over the
Internet, typically with the aim of simulating extensive human online activity like
communicating.
b. Smart Bot – An artificial intelligence (AI) system that can learn from its
surroundings and past experiences and develop new skills based on that knowledge
is referred to as a smart bot. Smart bot that are intelligent enough can operate
alongside people and learn from their actions.
Q8. Name the two chatbots developed by British Programmer Rollo Carpenter.
Ans 8. The two chatbots developed by British Programmer Rollo Carpenter are.
1) Cleverbot
2) Jabberwacky
Ans 14- While text normalization we reduce the randomness and bring them closer to
predefined standards. It reduces the amount of different information that the
computer has to deal with and therefore improves efficiency.
Corpus
• The text and terms collected from various documents and used for whole
textual data from all documents altogether is known as corpus.
• To work out on corpus these steps are required:
a. Sentence Segmentation
• Sentence segmentation divides the corpus into sentences.
• Each sentence is taken as a different data so now the corpus gets reduced to
sentences.
b. Tokenisation
• After sentence segmentation, each sentence is further divided into tokens.
• The token is a term used for any word or number or special character
occurring in a sentence.
• Under tokenisation, every word, number and special character is considered
separately and each of them is now a separate token.
The above words have little or no meaning in the corpus, hence these words are
removed and focused on meaningful terms.
Along with these stopwords, the corpus may have some special characters and
numbers. Sometimes some of them are meaningful, sometimes not. For example, for
email ids, the symbol @ and some numbers are very important. If symbolism special
characters and numbers are not meaningful can be removed like stopwords.
• The next step after removing stopwords, convert the whole text into a similar
case.
• The most preferable case is the lower case.
• This ensures that the case sensitivity of the machine does not consider the
same words as different just because of different cases.
In the above example, the word “hello” is written in 6 different forms, which are
converted into lower case and hence all of them are treated as a similar word by the
machine.
d. Stemming
• In this step, the words are reduced to their root words.
• Stemming is the process in which the affixes of words are removed and the
words are converted to their base form.
• Note that in stemming, the stemmed words (words which we get after
removing the affixes) might not be meaningful.
• Here in this example as you can see: healed, healing and healer all were
reduced to heal but studies were reduced to studi after the affix removal which
is not a meaningful word.
• Stemming does not take into account if the stemmed word is meaningful or
not.
• It just removes the affixes hence it is faster.
healed ed heal
healer er heal
studies es studi
In the next section of Unit 6 Natural Language Processing AI Class 10 we are going
to discuss lemmatization. Here we go!
e. Lemmatization
• It is an alternate process of stemming.
• It also removes the affix from the corpus.
• The only difference between lemmatization and stemming is the output of
lemmatization are meaningful words.
• The final output is known as a lemma.
• It takes a longer time to execute than stemming.
• The following table shows the process.
Word Affix Stem
healed ed heal
healer er heal
studies es study
Compare the tables of stemming and lemmatization table, you will find the word
studies converted into studi by stemming whereas the lemma word is study.
After normalisation of the corpus, let’s convert the tokens into numbers. To do so
the bag of words algorithm will be used.