0% found this document useful (0 votes)

4 views7 pages

NLP_Midterm_Spring2025

The document is a mid-term exam for a Natural Language Processing course, covering various topics such as edit distance, regular expressions, stopword removal using NLTK, and n-gram models. It includes questions on calculating log probabilities, perplexity, and challenges associated with n-gram models, as well as practical coding tasks. The exam emphasizes academic honesty and requires students to demonstrate their understanding of NLP concepts and techniques.

Uploaded by

basithussainusa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

NLP_Midterm_Spring2025

Uploaded by

basithussainusa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Total Score: ____ / 55

Final Percent: ____

CSCI 4535 / 5535: Natural Language Processing

Mid-Term Exam
Feb 24, 2025

Name: Basit Hussain_______________________

Student UNO ID: ________________

Statement of Academic Honesty

I attest to the following:
 This exam is closed-book, closed-note, closed-internet, etc., except for
o 1 standard size cheat sheet (if required),
o NLTK and spaCy websites.
 I have not used any other notes or web searches or AI apps during this exam.
 The answers on this exam are mine and mine alone.

Signature: Basit Hussain _______________________________

Points (2)

1.
a. Find the minimum edit distance between the words “Dive” - “Dye” and “Dive” –
“Thrive”. (2)
2. By replacing the letter "i" with "y" and removing the "v," you may change the word "Dive"
to "Dye," requiring two operations. Additionally, changing "Dive" to "Thrive" necessitates
replacing "D" with "T" and then moving "h" and "r" into the appropriate places, totaling three
operations. Therefore, the minimal edit distance between "Dive" and "Dye" is two, and
between "Dive" and "Thrive," it is three.

a. Using the same matrix, figure out whether “Dive” is closer to “Dye” or to
“Thrive” using minimum distance edit matrix D for each conversion. (4+4)

Changing "Dive" to "Dye" takes two operations, whereas changing "Dive" to "Thrive" takes
three, according to the minimal edit distance matrix. For a few seconds, "Dive" is more similar
to "Dye" than to "Thrive." This is because a smaller distance indicates a closer resemblance.
The change from "Dive" to "Dye" costs two, while the change from "Dive" to "Thrive" costs
three, according to the edit distance matrix. Because the terms are more similar when the cost
is smaller, "Dive" is more similar to "Dye."
b. Augment the distance matrix to output an alignment; you will need to store
pointers to compute the backtrace. (5)

We record pointers next to the edit distances to monitor actions like replacement, insertion,
and deletion in order to enhance the distance matrix. This aids in determining the best
alignment in the past. From "Dive" to "Dye," the route is as follows:
1. Change "i" to "y" 2. Eliminate "v" 3. Align "D" and "e"
Dive
Dy-e

3. Use regular expressions to find:

a. all strings that start at the beginning of the line with an integer and that end at
the end of the line with a word,
4. An alpha string can also share anything in order to match strings that begin with an integer at
the beginning of the line and finish with a word. ^[0-9]+.*\b[A-Za-z]+\b are regular
expressions.$
Stated otherwise, begin the line ^ and [0-9]+ → ensures that the string begins with an integer
by matching one or more digits. Additionally, any characters in between are matched by.*.
Nevertheless, it guarantees that the final word only consists of the alphabetic letters \b[A-Za-
z]+\b.
$ marks the end of the line.

a. all urls in a text. (4+2)

5. The regex https?://[^\s]+ this regex https?:// matches "http://" or "https://" and is a
straightforward way to discover URLs in text.
Additionally, anything following the protocol is matched by [^\s]+ until a space catches the
whole URL.

Example Matches, for instance: https://example.com

Test.org/path?query=1 http://www.test.org
This demonstrates that there are only legitimate characters in the domain. Additionally, it has
a legitimate top-level domain (TLD), such as.com or.org. which records query arguments and
routes.

Note: By “word”, we mean an alphabetic string separated from other words by

whitespace, any relevant punctuation, line breaks, and so forth.

6. Use NLTK to remove all stopwords from a corpus using the nltk stopwords list(3)
Code :
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

// Ensure all the necessary NLTK resources are downloaded

nltk.download('punkt')
nltk.download('stopwords')

// Example corpus text

text = " This is an example sentence demonstrating how to remove stopwords using NLTK."

// Tokenize the text into words

tokens = word_tokenize(text)

// Get the English stopwords list from NLTK

stop_words = set(stopwords.words('english'))

// Filter out stopwords

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

// Print the results

print("Original Tokens:", tokens)
print("Filtered Tokens (Without Stopwords):", filtered_tokens)

Tokenization: The text is split into individual words using word_tokenize().

Stopword Removal: We filter out words found in the predefined English stopwords list from
NLTK.
Case Insensitivity: Converting words to lowercase ensures proper matching.
Final Output: The result includes only non-stopword tokens.

The output will be :

Original Tokens: ['This', 'is', 'an', 'example', 'sentence', 'demonstrating', 'how', 'to', 'remove',
'stopwords', 'using', 'NLTK', '.']
Filtered Tokens (Without Stopwords): ['example', 'sentence', 'demonstrating', 'remove',
'stopwords', 'using', 'NLTK', '.']

This method will efficiently cleans a corpus by removing commonly used stopwords. Which will
helps to improve the text processing tasks like sentiment analysis or topic modeling

7.
a. What is the difference between lemmatization and stemming? Explain with an
example. (2)
b. Demonstrate examples of both techniques using NLTK library tools on a sentence
of your choice and show the results. (6)

8. Given the log of the conditional probabilities:

log(P(Mary|<s>))=-2
log(P(likes|Mary))=-10
log(P(cats|likes))=-74
log(P(</s>|cats))=-1

log(P(Mary|<s><s>))=-1
log(P(likes|Mary<s>))=-8
log(P(cats|Mary likes))=-58
log(P(</s>|cats likes))=-2

Approximate the log probability of the following sentence: “<s> Mary likes cats </s>”,
with:

a) Bigrams
For the sentence “<s> Mary likes cats </s>” under a bigram model, we have the following
provided log probabilities:
log P(Mary | <s>) = –2
log P(likes | Mary) = –10
log P(cats | likes) = –74
log P(</s> | cats) = –1
The sum of them results in a log probability of -87. The average log probability is -87/4 = -21.75
since there are four transitions and each bigram step. To calculate perplexity, we take the
negative average log probability and multiply it by 2:
Perplexity_{\text{bi}} = 2^{21.75} Perplexitybi=221.75Perplexitybi = 221.75
This extremely high number illustrates how poorly bigrams predicted this text.

b) Trigrams
For the trigram model the provided probabilities are:
log P(Mary | <s><s>) = –1
log P(likes | Mary <s>) = –8
log P(cats | Mary likes) = –58
log P(</s> | cats likes) = –2

These values add up to -69. The average log probability with four transitions is -69/4 = -17.25,
and the perplexity is calculated as follows:
text{Perplexity}_{\text{tri}} = 2^{17.25} Perplexitytri=217.25Tri (perplexity) = 217.25.
The trigram model is statistically superior at predicting the text since 2^17.25 is much lower than
2^21.75 (higher performance is associated with lower perplexity).

Find the perplexity of both the n-grams. Which one is the better model? What are the possible
difficulties of using the better model and suggest some means to overcome it. (2+2+2+2+2+5)

Challenges and Trade-Offs :

The trigram model has major drawbacks even if it performs better (as evidenced by its
decreased perplexity):
1. Data Sparsity: Compared to bigrams, trigram combinations are tenfold more common. The
training data may contain few or no trigrams, which might result in inaccurate probability
estimations or even zero counts.
2. Computational overhead and memory: Because trigrams have more characteristics than
bigrams, they require significantly more memory and computing capacity to store and process.
3. Overfitting: A trigram model may overfit the training corpus when given insufficient input,
identifying noise rather than broad linguistic patterns.
Solutions:
Several tactics can be used to get beyond these problems:
1. Smoothing Techniques: To guarantee that even unseen trigrams get a little, nonzero
probability, techniques such as Laplace smoothing, Good-Turing discounting, or Kneser-Ney
smoothing can modify the raw counts.
2. Backoff and Interpolation: These techniques enable the model to integrate the probabilities
from various n-gram levels or to "back off" to bigram or unigram probabilities when a trigram is
not visible. As a result, the model is more resilient to sparse data.
3. Greater Training Corpora: By expanding the training data set, the sparsity issue is lessened
and the coverage of trigram combinations is enhanced.
In conclusion, data sparsity, computational expense, and possible overfitting may restrict the
usefulness of the trigram model, even if it offers a lower perplexity and, thus, superior
prediction performance for the given text. These challenges can be lessened by increasing the
training corpus or by using smoothing and backoff techniques.

Attempt either 6 or 7

9. Given the Co-occurrence counts in Fig 6.10 of your Jurafsky textbook (3 rd ed.), calculate the
PPMI values of:

a. PPMI(digital,data)
b. PPMI(strawberry,pie)
(3+3)

10. Solve question 4.2 (page 80) from your Jurafsky textbook (3rd. ed) (6)

Movie Reviews with Genres

Movie Review Genre
fun, couple, love, love comedy
fast, furious, shoot action
couple, fly, fast, fun, fun comedy
furious, shoot, shoot, fun action
fly, fast, shoot, love action

Calculation on page :
Since P(action | D) > P(comedy | D), the Naïve Bayes classifier predicts the document as:
Action

-------------END---------------

Immediate download Learn LLVM 17: A beginner's guide to learning LLVM compiler tools and core libraries with C++ 2nd Edition Anonymous ebooks 2024
100% (3)
Immediate download Learn LLVM 17: A beginner's guide to learning LLVM compiler tools and core libraries with C++ 2nd Edition Anonymous ebooks 2024
55 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Stack and Subroutines
No ratings yet
Stack and Subroutines
25 pages
Ai&Ml Bai601 Nlp Lab Manual
No ratings yet
Ai&Ml Bai601 Nlp Lab Manual
48 pages
A7_NLP_Exp2
No ratings yet
A7_NLP_Exp2
11 pages
NLP SEM QUESTIONS AND ANSWERS
No ratings yet
NLP SEM QUESTIONS AND ANSWERS
72 pages
NLP question
No ratings yet
NLP question
4 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Batch 2
No ratings yet
Batch 2
13 pages
Unit 5
No ratings yet
Unit 5
26 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Unit Vapplications Notes
No ratings yet
Unit Vapplications Notes
13 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
NLP LAB_MANUAL (1)
No ratings yet
NLP LAB_MANUAL (1)
33 pages
ai
No ratings yet
ai
13 pages
CS671A/CS671: Introduction To Natural Language Processing Mid-Semester Exam
No ratings yet
CS671A/CS671: Introduction To Natural Language Processing Mid-Semester Exam
7 pages
Unit V-AI-KCS071
No ratings yet
Unit V-AI-KCS071
28 pages
N Grams
No ratings yet
N Grams
51 pages
Unit 5-Aiml
No ratings yet
Unit 5-Aiml
25 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Unit 5 notes final
No ratings yet
Unit 5 notes final
14 pages
2 N-Gram
No ratings yet
2 N-Gram
70 pages
Kami Export - Assignment - 2 - 20240709
No ratings yet
Kami Export - Assignment - 2 - 20240709
13 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Lecture_4_N_grams
No ratings yet
Lecture_4_N_grams
29 pages
NLP Previous Sem
No ratings yet
NLP Previous Sem
5 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
Ngrams
100% (1)
Ngrams
22 pages
Analysis of Statistical Parsing in Natural Language Processing
No ratings yet
Analysis of Statistical Parsing in Natural Language Processing
6 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
4-Tolerant retrieval
No ratings yet
4-Tolerant retrieval
82 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
Lecture 6 to 8 N-gram
No ratings yet
Lecture 6 to 8 N-gram
19 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
NLP lab Manual (3)
No ratings yet
NLP lab Manual (3)
7 pages
NLP Previous Sem-1-3
No ratings yet
NLP Previous Sem-1-3
3 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
Module5 PPT
No ratings yet
Module5 PPT
69 pages
NLP___
No ratings yet
NLP___
28 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
Assignment 3 NLP
No ratings yet
Assignment 3 NLP
3 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Assignment 2 - 20240709
No ratings yet
Assignment 2 - 20240709
13 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
0YqnEK3vg4heOTv089KxSI1ijWzuAxT1AgGevOKKJE
No ratings yet
0YqnEK3vg4heOTv089KxSI1ijWzuAxT1AgGevOKKJE
4 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Racing Physics
No ratings yet
Racing Physics
7 pages
Log
No ratings yet
Log
58 pages
Vectors & Arraylist
No ratings yet
Vectors & Arraylist
26 pages
15 Class Design
No ratings yet
15 Class Design
12 pages
Quiz PPAW
No ratings yet
Quiz PPAW
4 pages
Poonam Practical File.docx10101.docxsdnbds dasksvfbqewuvqwodhwycdoxnxdg
No ratings yet
Poonam Practical File.docx10101.docxsdnbds dasksvfbqewuvqwodhwycdoxnxdg
34 pages
Program 1 Assignment Kit PDF
100% (1)
Program 1 Assignment Kit PDF
14 pages
Oops Project My
No ratings yet
Oops Project My
25 pages
Mulesoft - Examanswers.mcd Level 1.v2021!04!27.by - Ethan.65q
No ratings yet
Mulesoft - Examanswers.mcd Level 1.v2021!04!27.by - Ethan.65q
48 pages
Ba 304
No ratings yet
Ba 304
5 pages
Newton Raphson (NM)
No ratings yet
Newton Raphson (NM)
28 pages
Python Range Function: Built-In Functions
No ratings yet
Python Range Function: Built-In Functions
36 pages
Bugreport Earth - in SP1A.210812.016 2024 02 28 21 31 28 Dumpstate - Log 31527
No ratings yet
Bugreport Earth - in SP1A.210812.016 2024 02 28 21 31 28 Dumpstate - Log 31527
32 pages
Crack The Coding Interview
No ratings yet
Crack The Coding Interview
12 pages
SDA Fall 2023 Lecture#1
No ratings yet
SDA Fall 2023 Lecture#1
48 pages
08-Feb-2021 Date and Time MCQ
No ratings yet
08-Feb-2021 Date and Time MCQ
18 pages
Python Python 1
No ratings yet
Python Python 1
159 pages
List of Companies in Dubai
No ratings yet
List of Companies in Dubai
229 pages
Creating A Connection: Imports
No ratings yet
Creating A Connection: Imports
4 pages
QTP Framework
100% (3)
QTP Framework
100 pages
Reflector: Father's Name: Sh. Subhash Chander Date of Birth: 18 Can Be Contacted At: House No:BXI-1218, Street No:1
No ratings yet
Reflector: Father's Name: Sh. Subhash Chander Date of Birth: 18 Can Be Contacted At: House No:BXI-1218, Street No:1
3 pages
3 - 4 - 5th Program
No ratings yet
3 - 4 - 5th Program
3 pages
Operating System
No ratings yet
Operating System
3 pages
AEM Handbook
No ratings yet
AEM Handbook
5 pages
Algorithms, Part I: Programming Assignment 1: Percolation - Instructions
No ratings yet
Algorithms, Part I: Programming Assignment 1: Percolation - Instructions
2 pages
DSA by Shradha Didi & Aman Bhaiya
No ratings yet
DSA by Shradha Didi & Aman Bhaiya
7 pages
JavaScript Wikipedia PDF
No ratings yet
JavaScript Wikipedia PDF
33 pages

NLP_Midterm_Spring2025

Uploaded by

NLP_Midterm_Spring2025

Uploaded by

Total Score: ____ / 55

Final Percent: ____

CSCI 4535 / 5535: Natural Language Processing

Name: __________Basit Hussain_________________________________

Statement of Academic Honesty

Signature: ______Basit Hussain _____________________________________

3. Use regular expressions to find:

a. all urls in a text. (4+2)

Example Matches, for instance: https://example.com

Note: By “word”, we mean an alphabetic string separated from other words by

// Ensure all the necessary NLTK resources are downloaded

// Example corpus text

// Tokenize the text into words

// Get the English stopwords list from NLTK

// Filter out stopwords

// Print the results

Tokenization: The text is split into individual words using word_tokenize().

The output will be :

8. Given the log of the conditional probabilities:

Challenges and Trade-Offs :

Movie Reviews with Genres

You might also like

Name: Basit Hussain_______________________

Signature: Basit Hussain _______________________________