NLP_Midterm_Spring2025
NLP_Midterm_Spring2025
1.
a. Find the minimum edit distance between the words “Dive” - “Dye” and “Dive” –
“Thrive”. (2)
2. By replacing the letter "i" with "y" and removing the "v," you may change the word "Dive"
to "Dye," requiring two operations. Additionally, changing "Dive" to "Thrive" necessitates
replacing "D" with "T" and then moving "h" and "r" into the appropriate places, totaling three
operations. Therefore, the minimal edit distance between "Dive" and "Dye" is two, and
between "Dive" and "Thrive," it is three.
a. Using the same matrix, figure out whether “Dive” is closer to “Dye” or to
“Thrive” using minimum distance edit matrix D for each conversion. (4+4)
Changing "Dive" to "Dye" takes two operations, whereas changing "Dive" to "Thrive" takes
three, according to the minimal edit distance matrix. For a few seconds, "Dive" is more similar
to "Dye" than to "Thrive." This is because a smaller distance indicates a closer resemblance.
The change from "Dive" to "Dye" costs two, while the change from "Dive" to "Thrive" costs
three, according to the edit distance matrix. Because the terms are more similar when the cost
is smaller, "Dive" is more similar to "Dye."
b. Augment the distance matrix to output an alignment; you will need to store
pointers to compute the backtrace. (5)
We record pointers next to the edit distances to monitor actions like replacement, insertion,
and deletion in order to enhance the distance matrix. This aids in determining the best
alignment in the past. From "Dive" to "Dye," the route is as follows:
1. Change "i" to "y" 2. Eliminate "v" 3. Align "D" and "e"
Dive
Dy-e
6. Use NLTK to remove all stopwords from a corpus using the nltk stopwords list(3)
Code :
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
Original Tokens: ['This', 'is', 'an', 'example', 'sentence', 'demonstrating', 'how', 'to', 'remove',
'stopwords', 'using', 'NLTK', '.']
Filtered Tokens (Without Stopwords): ['example', 'sentence', 'demonstrating', 'remove',
'stopwords', 'using', 'NLTK', '.']
This method will efficiently cleans a corpus by removing commonly used stopwords. Which will
helps to improve the text processing tasks like sentiment analysis or topic modeling
7.
a. What is the difference between lemmatization and stemming? Explain with an
example. (2)
b. Demonstrate examples of both techniques using NLTK library tools on a sentence
of your choice and show the results. (6)
log(P(Mary|<s><s>))=-1
log(P(likes|Mary<s>))=-8
log(P(cats|Mary likes))=-58
log(P(</s>|cats likes))=-2
Approximate the log probability of the following sentence: “<s> Mary likes cats </s>”,
with:
a) Bigrams
For the sentence “<s> Mary likes cats </s>” under a bigram model, we have the following
provided log probabilities:
log P(Mary | <s>) = –2
log P(likes | Mary) = –10
log P(cats | likes) = –74
log P(</s> | cats) = –1
The sum of them results in a log probability of -87. The average log probability is -87/4 = -21.75
since there are four transitions and each bigram step. To calculate perplexity, we take the
negative average log probability and multiply it by 2:
Perplexity_{\text{bi}} = 2^{21.75} Perplexitybi=221.75Perplexitybi = 221.75
This extremely high number illustrates how poorly bigrams predicted this text.
b) Trigrams
For the trigram model the provided probabilities are:
log P(Mary | <s><s>) = –1
log P(likes | Mary <s>) = –8
log P(cats | Mary likes) = –58
log P(</s> | cats likes) = –2
These values add up to -69. The average log probability with four transitions is -69/4 = -17.25,
and the perplexity is calculated as follows:
text{Perplexity}_{\text{tri}} = 2^{17.25} Perplexitytri=217.25Tri (perplexity) = 217.25.
The trigram model is statistically superior at predicting the text since 2^17.25 is much lower than
2^21.75 (higher performance is associated with lower perplexity).
Find the perplexity of both the n-grams. Which one is the better model? What are the possible
difficulties of using the better model and suggest some means to overcome it. (2+2+2+2+2+5)
Attempt either 6 or 7
9. Given the Co-occurrence counts in Fig 6.10 of your Jurafsky textbook (3 rd ed.), calculate the
PPMI values of:
a. PPMI(digital,data)
b. PPMI(strawberry,pie)
(3+3)
10. Solve question 4.2 (page 80) from your Jurafsky textbook (3rd. ed) (6)
Calculation on page :
Since P(action | D) > P(comedy | D), the Naïve Bayes classifier predicts the document as:
Action
-------------END---------------