0% found this document useful (0 votes)
4 views7 pages

NLP_Midterm_Spring2025

The document is a mid-term exam for a Natural Language Processing course, covering various topics such as edit distance, regular expressions, stopword removal using NLTK, and n-gram models. It includes questions on calculating log probabilities, perplexity, and challenges associated with n-gram models, as well as practical coding tasks. The exam emphasizes academic honesty and requires students to demonstrate their understanding of NLP concepts and techniques.

Uploaded by

basithussainusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

NLP_Midterm_Spring2025

The document is a mid-term exam for a Natural Language Processing course, covering various topics such as edit distance, regular expressions, stopword removal using NLTK, and n-gram models. It includes questions on calculating log probabilities, perplexity, and challenges associated with n-gram models, as well as practical coding tasks. The exam emphasizes academic honesty and requires students to demonstrate their understanding of NLP concepts and techniques.

Uploaded by

basithussainusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Total Score: ____ / 55

Final Percent: ____

CSCI 4535 / 5535: Natural Language Processing


Mid-Term Exam
Feb 24, 2025

Name: __________Basit Hussain_________________________________


Student UNO ID: ________________

Statement of Academic Honesty


I attest to the following:
 This exam is closed-book, closed-note, closed-internet, etc., except for
o 1 standard size cheat sheet (if required),
o NLTK and spaCy websites.
 I have not used any other notes or web searches or AI apps during this exam.
 The answers on this exam are mine and mine alone.

Signature: ______Basit Hussain _____________________________________


Points (2)

1.
a. Find the minimum edit distance between the words “Dive” - “Dye” and “Dive” –
“Thrive”. (2)
2. By replacing the letter "i" with "y" and removing the "v," you may change the word "Dive"
to "Dye," requiring two operations. Additionally, changing "Dive" to "Thrive" necessitates
replacing "D" with "T" and then moving "h" and "r" into the appropriate places, totaling three
operations. Therefore, the minimal edit distance between "Dive" and "Dye" is two, and
between "Dive" and "Thrive," it is three.

a. Using the same matrix, figure out whether “Dive” is closer to “Dye” or to
“Thrive” using minimum distance edit matrix D for each conversion. (4+4)

Changing "Dive" to "Dye" takes two operations, whereas changing "Dive" to "Thrive" takes
three, according to the minimal edit distance matrix. For a few seconds, "Dive" is more similar
to "Dye" than to "Thrive." This is because a smaller distance indicates a closer resemblance.
The change from "Dive" to "Dye" costs two, while the change from "Dive" to "Thrive" costs
three, according to the edit distance matrix. Because the terms are more similar when the cost
is smaller, "Dive" is more similar to "Dye."
b. Augment the distance matrix to output an alignment; you will need to store
pointers to compute the backtrace. (5)

We record pointers next to the edit distances to monitor actions like replacement, insertion,
and deletion in order to enhance the distance matrix. This aids in determining the best
alignment in the past. From "Dive" to "Dye," the route is as follows:
1. Change "i" to "y" 2. Eliminate "v" 3. Align "D" and "e"
Dive
Dy-e

3. Use regular expressions to find:


a. all strings that start at the beginning of the line with an integer and that end at
the end of the line with a word,
4. An alpha string can also share anything in order to match strings that begin with an integer at
the beginning of the line and finish with a word. ^[0-9]+.*\b[A-Za-z]+\b are regular
expressions.$
Stated otherwise, begin the line ^ and [0-9]+ → ensures that the string begins with an integer
by matching one or more digits. Additionally, any characters in between are matched by.*.
Nevertheless, it guarantees that the final word only consists of the alphabetic letters \b[A-Za-
z]+\b.
$ marks the end of the line.

a. all urls in a text. (4+2)


5. The regex https?://[^\s]+ this regex https?:// matches "http://" or "https://" and is a
straightforward way to discover URLs in text.
Additionally, anything following the protocol is matched by [^\s]+ until a space catches the
whole URL.

Example Matches, for instance: https://example.com


Test.org/path?query=1 http://www.test.org
This demonstrates that there are only legitimate characters in the domain. Additionally, it has
a legitimate top-level domain (TLD), such as.com or.org. which records query arguments and
routes.

Note: By “word”, we mean an alphabetic string separated from other words by


whitespace, any relevant punctuation, line breaks, and so forth.

6. Use NLTK to remove all stopwords from a corpus using the nltk stopwords list(3)
Code :
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

// Ensure all the necessary NLTK resources are downloaded


nltk.download('punkt')
nltk.download('stopwords')

// Example corpus text


text = " This is an example sentence demonstrating how to remove stopwords using NLTK."

// Tokenize the text into words


tokens = word_tokenize(text)

// Get the English stopwords list from NLTK


stop_words = set(stopwords.words('english'))

// Filter out stopwords


filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

// Print the results


print("Original Tokens:", tokens)
print("Filtered Tokens (Without Stopwords):", filtered_tokens)

Tokenization: The text is split into individual words using word_tokenize().


Stopword Removal: We filter out words found in the predefined English stopwords list from
NLTK.
Case Insensitivity: Converting words to lowercase ensures proper matching.
Final Output: The result includes only non-stopword tokens.

The output will be :

Original Tokens: ['This', 'is', 'an', 'example', 'sentence', 'demonstrating', 'how', 'to', 'remove',
'stopwords', 'using', 'NLTK', '.']
Filtered Tokens (Without Stopwords): ['example', 'sentence', 'demonstrating', 'remove',
'stopwords', 'using', 'NLTK', '.']

This method will efficiently cleans a corpus by removing commonly used stopwords. Which will
helps to improve the text processing tasks like sentiment analysis or topic modeling

7.
a. What is the difference between lemmatization and stemming? Explain with an
example. (2)
b. Demonstrate examples of both techniques using NLTK library tools on a sentence
of your choice and show the results. (6)

8. Given the log of the conditional probabilities:


log(P(Mary|<s>))=-2
log(P(likes|Mary))=-10
log(P(cats|likes))=-74
log(P(</s>|cats))=-1

log(P(Mary|<s><s>))=-1
log(P(likes|Mary<s>))=-8
log(P(cats|Mary likes))=-58
log(P(</s>|cats likes))=-2

Approximate the log probability of the following sentence: “<s> Mary likes cats </s>”,
with:

a) Bigrams
For the sentence “<s> Mary likes cats </s>” under a bigram model, we have the following
provided log probabilities:
log P(Mary | <s>) = –2
log P(likes | Mary) = –10
log P(cats | likes) = –74
log P(</s> | cats) = –1
The sum of them results in a log probability of -87. The average log probability is -87/4 = -21.75
since there are four transitions and each bigram step. To calculate perplexity, we take the
negative average log probability and multiply it by 2:
Perplexity_{\text{bi}} = 2^{21.75} Perplexitybi=221.75Perplexitybi = 221.75
This extremely high number illustrates how poorly bigrams predicted this text.

b) Trigrams
For the trigram model the provided probabilities are:
log P(Mary | <s><s>) = –1
log P(likes | Mary <s>) = –8
log P(cats | Mary likes) = –58
log P(</s> | cats likes) = –2

These values add up to -69. The average log probability with four transitions is -69/4 = -17.25,
and the perplexity is calculated as follows:
text{Perplexity}_{\text{tri}} = 2^{17.25} Perplexitytri=217.25Tri (perplexity) = 217.25.
The trigram model is statistically superior at predicting the text since 2^17.25 is much lower than
2^21.75 (higher performance is associated with lower perplexity).

Find the perplexity of both the n-grams. Which one is the better model? What are the possible
difficulties of using the better model and suggest some means to overcome it. (2+2+2+2+2+5)

Challenges and Trade-Offs :


The trigram model has major drawbacks even if it performs better (as evidenced by its
decreased perplexity):
1. Data Sparsity: Compared to bigrams, trigram combinations are tenfold more common. The
training data may contain few or no trigrams, which might result in inaccurate probability
estimations or even zero counts.
2. Computational overhead and memory: Because trigrams have more characteristics than
bigrams, they require significantly more memory and computing capacity to store and process.
3. Overfitting: A trigram model may overfit the training corpus when given insufficient input,
identifying noise rather than broad linguistic patterns.
Solutions:
Several tactics can be used to get beyond these problems:
1. Smoothing Techniques: To guarantee that even unseen trigrams get a little, nonzero
probability, techniques such as Laplace smoothing, Good-Turing discounting, or Kneser-Ney
smoothing can modify the raw counts.
2. Backoff and Interpolation: These techniques enable the model to integrate the probabilities
from various n-gram levels or to "back off" to bigram or unigram probabilities when a trigram is
not visible. As a result, the model is more resilient to sparse data.
3. Greater Training Corpora: By expanding the training data set, the sparsity issue is lessened
and the coverage of trigram combinations is enhanced.
In conclusion, data sparsity, computational expense, and possible overfitting may restrict the
usefulness of the trigram model, even if it offers a lower perplexity and, thus, superior
prediction performance for the given text. These challenges can be lessened by increasing the
training corpus or by using smoothing and backoff techniques.

Attempt either 6 or 7

9. Given the Co-occurrence counts in Fig 6.10 of your Jurafsky textbook (3 rd ed.), calculate the
PPMI values of:

a. PPMI(digital,data)
b. PPMI(strawberry,pie)
(3+3)

10. Solve question 4.2 (page 80) from your Jurafsky textbook (3rd. ed) (6)

Movie Reviews with Genres


Movie Review Genre
fun, couple, love, love comedy
fast, furious, shoot action
couple, fly, fast, fun, fun comedy
furious, shoot, shoot, fun action
fly, fast, shoot, love action

Calculation on page :
Since P(action | D) > P(comedy | D), the Naïve Bayes classifier predicts the document as:
Action

-------------END---------------

You might also like