0% found this document useful (0 votes)

5 views7 pages

NLP lab Manual (3)

The document provides various Python programming tasks related to natural language processing and regular expressions. It includes examples of tokenizing sentences and words, generating sentences using a probabilistic context-free grammar (PCFG), building a trigram model for word prediction, extracting US phone numbers, USNs, and email addresses using regular expressions, and understanding cost parameters in an edit distance function. Each task is accompanied by code snippets and explanations.

Uploaded by

panave3104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views7 pages

NLP lab Manual (3)

Uploaded by

panave3104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1.

)Write a program to tokenize the following given

sentences into the sentence,words. Hello everyone.
Welcome to NITTE (Deemed to be University) NMAMIT. I
AM studying the NLP Elective. Use at least 3 different
methods to perform the same
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize,
regexp_tokenize
nltk.download('punkt_tab')

# Given text to tokenize

text = "Hello everyone. Welcome to NITTE (Deemed to be University)
NMAMIT. I AM studying the NLP Elective."

# Method 1: Splitting sentences in the paragraph using sent_tokenize

print("\nMethod 1: Splitting sentences in the paragraph")
print(text)
print(sent_tokenize(text))

# Method 2: Splitting words in the sentence using word_tokenize

print("\nMethod 2: Splitting words in the sentence")
print(word_tokenize(text))

# Method 3: Tokenizing words using regular expression with

regexp_tokenize
print("\nMethod 3: Tokenizing words using regular expression")
print(regexp_tokenize(text, r"[\w]+"))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt_tab.zip.

Method 1: Splitting sentences in the paragraph

Hello everyone. Welcome to NITTE (Deemed to be University) NMAMIT. I
AM studying the NLP Elective.
['Hello everyone.', 'Welcome to NITTE (Deemed to be University)
NMAMIT.', 'I AM studying the NLP Elective.']

Method 2: Splitting words in the sentence

['Hello', 'everyone', '.', 'Welcome', 'to', 'NITTE', '(', 'Deemed',
'to', 'be', 'University', ')', 'NMAMIT', '.', 'I', 'AM', 'studying',
'the', 'NLP', 'Elective', '.']

Method 3: Tokenizing words using regular expression

['Hello', 'everyone', 'Welcome', 'to', 'NITTE', 'Deemed', 'to', 'be',
'University', 'NMAMIT', 'I', 'AM', 'studying', 'the', 'NLP',
'Elective']

2.)How does the recursive generate function use a PCFG

defined in a Python dictionary to select weighted
production rules and expand the starting symbol 'S' into
a complete sentence?
import random

# Define a simple PCFG grammar.

# Each key is a non-terminal symbol with a list of tuples.
# Each tuple contains a production rule (as a list of symbols) and its
probability.
grammar = {
"S": [(["NP", "VP"], 1.0)], # Sentence -> Noun Phrase + Verb
Phrase
"NP": [
(["Det", "N"], 0.8), # Noun Phrase -> Determiner + Noun
(["Name"], 0.2) # Noun Phrase -> Proper Name
],
"VP": [
(["V", "NP"], 0.5), # Verb Phrase -> Verb + Noun Phrase
(["V"], 0.5) # Verb Phrase -> Verb
],
"Det": [
(["the"], 0.5),
(["a"], 0.5)
],
"N": [
(["cat"], 0.5),
(["dog"], 0.5)
],
"Name": [
(["Alice"], 1.0)
],
"V": [
(["sees"], 1.0)
]
}

def generate(symbol):
"""
Recursively generates a sentence fragment from the given symbol
using the PCFG grammar.

Parameters:
symbol (str): The non-terminal or terminal symbol to expand.

Returns:
str: The generated string from the grammar.
"""
# If the symbol is not in the grammar, it's assumed to be a
terminal.
if symbol not in grammar:
return symbol

productions = grammar[symbol]
# Unzip the production rules and their corresponding weights.
rules, weights = zip(*productions)

# Choose one production rule based on the probabilities.

chosen_rule = random.choices(rules, weights=weights, k=1)[0]

# Debug log: show the chosen production rule for the current non-
terminal.
print(f"Expanding '{symbol}' using rule: {chosen_rule}")

# Recursively generate the string for each symbol in the chosen

rule.
result = [generate(sym) for sym in chosen_rule]
return " ".join(result)

# Generate a sentence starting from the initial symbol 'S'

sentence = generate("S")
print("\nGenerated Sentence:", sentence)

Expanding 'S' using rule: ['NP', 'VP']

Expanding 'NP' using rule: ['Name']
Expanding 'Name' using rule: ['Alice']
Expanding 'VP' using rule: ['V', 'NP']
Expanding 'V' using rule: ['sees']
Expanding 'NP' using rule: ['Det', 'N']
Expanding 'Det' using rule: ['a']
Expanding 'N' using rule: ['dog']

Generated Sentence: Alice sees a dog

3.)Build a trigram model using the Reuters corpus to

predict the next word based on two preceding words?
# Import necessary libraries
import nltk
from nltk import bigrams, trigrams
from nltk.corpus import reuters
from collections import defaultdict
# Download necessary NLTK resources
nltk.download('reuters')
nltk.download('punkt')
nltk.download('punkt_tab')

# Tokenize the text

words = nltk.word_tokenize(' '.join(reuters.words()))

# Create trigrams
tri_grams = list(trigrams(words))

# Build a trigram model

model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurrence

for w1, w2, w3 in tri_grams:
model[(w1, w2)][w3] += 1

# Transform the counts into probabilities

for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count

# Function to predict the next word

def predict_next_word(w1, w2):
"""
Predicts the next word based on the previous two words using the
trained trigram model.

Args:
w1 (str): The first word.
w2 (str): The second word.

Returns:
str: The predicted next word.
"""
next_word = model[w1, w2]
if next_word:
predicted_word = max(next_word, key=next_word.get) # Choose
the most likely next word
return predicted_word
else:
return "No prediction available"

# Example usage
print("Next Word:", predict_next_word('the', 'stock'))
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!

Next Word: of

4.)Using Python's re module, extract US phone numbers,

USNs (format LLLNNLLDDD), and email addresses from
a given string.
text_to_search = "Reach us at 800-555-1212 or [email protected].
Student ID: NNM21EC099."

import re # Import the regular expression module

# 1. The concise text we want to search within

text_to_search = "Reach us at 800-555-1212 or [email protected].
Student ID: NNM21EC099."

# 2. Define the regular expression patterns (same as before)

# Phone Number Pattern
phone_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"

# USN Number Pattern (LLLNNLLDDD)

usn_pattern = r"[A-Z]{3}\d{2}[A-Z]{2}\d{3}"

# Email Address Pattern

email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# 3. Find all matches for each pattern

found_phones = re.findall(phone_pattern, text_to_search)
found_usns = re.findall(usn_pattern, text_to_search)
found_emails = re.findall(email_pattern, text_to_search)

# 4. Print the results

print(f"--- Original Text ---")
print(text_to_search)
print("-" * 20) # Separator

print(f"\n--- Found Phone Numbers (Pattern: {phone_pattern}) ---")

if found_phones:
for phone in found_phones:
print(f"- {phone}")
else:
print("No phone numbers found matching the pattern.")
print(f"\n--- Found USN Numbers (Pattern: {usn_pattern}) ---")
if found_usns:
for usn in found_usns:
print(f"- {usn}")
else:
print("No USN numbers found matching the pattern.")

print(f"\n--- Found Email Addresses (Pattern: {email_pattern}) ---")

if found_emails:
for email in found_emails:
print(f"- {email}")
else:
print("No email addresses found matching the pattern.")

--- Original Text ---

Reach us at 800-555-1212 or [email protected]. Student ID: NNM21EC099.
--------------------

--- Found Phone Numbers (Pattern: \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})

---
- 800-555-1212

--- Found USN Numbers (Pattern: [A-Z]{3}\d{2}[A-Z]{2}\d{3}) ---

- NNM21EC099

--- Found Email Addresses (Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.

[a-zA-Z]{2,}) ---
- [email protected]

5.) What do the cost parameters (ins_cost, del_cost,

sub_cost) control in this edit distance function?
def weighted_edit_distance_no_numpy(s1, s2, ins_cost=1, del_cost=1,
sub_cost=1):

m = len(s1)
n = len(s2)

# Initialize DP table with nested lists

dp = [[0.0 for _ in range(n + 1)] for _ in range(m + 1)]

# --- Initialization ---

for j in range(n + 1):
dp[0][j] = j * ins_cost
for i in range(m + 1):
dp[i][0] = i * del_cost

# --- Fill DP table ---

for i in range(1, m + 1):
for j in range(1, n + 1):
current_sub_cost = 0 if s1[i - 1] == s2[j - 1] else
sub_cost
deletion = dp[i - 1][j] + del_cost
insertion = dp[i][j - 1] + ins_cost
substitution = dp[i - 1][j - 1] + current_sub_cost

dp[i][j] = min(deletion, insertion, substitution)

return dp[m][n]

# Example usage
string1 = "intention"
string2 = "execution"
distance1 = weighted_edit_distance_no_numpy(string1, string2,
ins_cost=1, del_cost=1, sub_cost=1)
print(f"Weighted edit distance between '{string1}' and '{string2}':
{distance1}")

Weighted edit distance between 'intention' and 'execution': 5

Daily 5 Implementation Checklist
No ratings yet
Daily 5 Implementation Checklist
6 pages
Tambiah - The Magical Power of Words
No ratings yet
Tambiah - The Magical Power of Words
35 pages
Ai&Ml Bai601 Nlp Lab Manual
No ratings yet
Ai&Ml Bai601 Nlp Lab Manual
48 pages
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
No ratings yet
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
22 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Module5 PPT
No ratings yet
Module5 PPT
69 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Final_NLP_Lab_File
No ratings yet
Final_NLP_Lab_File
28 pages
1
No ratings yet
1
13 pages
N_gram_Presentation
No ratings yet
N_gram_Presentation
29 pages
Soundarya 256 NLP Practs
No ratings yet
Soundarya 256 NLP Practs
14 pages
NLP Expts
No ratings yet
NLP Expts
41 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Batch 2
No ratings yet
Batch 2
13 pages
Bling
No ratings yet
Bling
7 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
NLP LAB_MANUAL (1)
No ratings yet
NLP LAB_MANUAL (1)
33 pages
Lab File Complete
No ratings yet
Lab File Complete
10 pages
A7_NLP_Exp2
No ratings yet
A7_NLP_Exp2
11 pages
Karpathy
No ratings yet
Karpathy
32 pages
Pograms
No ratings yet
Pograms
20 pages
NLP_Midterm_Spring2025
No ratings yet
NLP_Midterm_Spring2025
7 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
NLP_record[1][1] (1)
No ratings yet
NLP_record[1][1] (1)
23 pages
R22 Nlp Python Programs
No ratings yet
R22 Nlp Python Programs
15 pages
NLP Lab Complete
No ratings yet
NLP Lab Complete
23 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
Assessment - 2: - K Mary Nikitha
No ratings yet
Assessment - 2: - K Mary Nikitha
27 pages
AI Lab Final
No ratings yet
AI Lab Final
22 pages
123nlp456
No ratings yet
123nlp456
4 pages
SPCC PRAC IMPLEMENTATION
No ratings yet
SPCC PRAC IMPLEMENTATION
9 pages
x0 Process
No ratings yet
x0 Process
4 pages
tsarecord
No ratings yet
tsarecord
22 pages
exp-2 nlp
No ratings yet
exp-2 nlp
4 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
NLP-pyth
No ratings yet
NLP-pyth
5 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
NLP EXP 3 (b) - Word Generation
No ratings yet
NLP EXP 3 (b) - Word Generation
2 pages
_new Compiler Practls
No ratings yet
_new Compiler Practls
17 pages
AIML PROGRAMS
No ratings yet
AIML PROGRAMS
24 pages
Sample Program Using Python 3
No ratings yet
Sample Program Using Python 3
5 pages
Def Generate - N - Chars (A, B) : Return A B
No ratings yet
Def Generate - N - Chars (A, B) : Return A B
20 pages
Implement Trie (Prefix Tree)
No ratings yet
Implement Trie (Prefix Tree)
7 pages
AI BCAI 551 Lab Manual
No ratings yet
AI BCAI 551 Lab Manual
54 pages
Record
No ratings yet
Record
6 pages
Compiler Practls
No ratings yet
Compiler Practls
16 pages
"Enter A Number:": Def If Return Else Return
No ratings yet
"Enter A Number:": Def If Return Else Return
5 pages
Project Assignment 4: Markov Chain
No ratings yet
Project Assignment 4: Markov Chain
10 pages
Algorithms For Predictino
No ratings yet
Algorithms For Predictino
7 pages
Python Code Examples
No ratings yet
Python Code Examples
30 pages
Self Evaluation Exercises (1)
No ratings yet
Self Evaluation Exercises (1)
12 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP (1)
No ratings yet
NLP (1)
12 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Microcontroller Lab - RB - For Circulation
No ratings yet
Microcontroller Lab - RB - For Circulation
36 pages
MAR_Lab Manual
No ratings yet
MAR_Lab Manual
42 pages
MPLC Manual
No ratings yet
MPLC Manual
18 pages
Microcontroller Lab - RB - For Circulation
No ratings yet
Microcontroller Lab - RB - For Circulation
36 pages
Resume Questions
No ratings yet
Resume Questions
1 page
Chapter 03 Design For Static Strength
No ratings yet
Chapter 03 Design For Static Strength
37 pages
Kinematics of Mobile Robot
No ratings yet
Kinematics of Mobile Robot
18 pages
Chapter 04 Threaded Fasteners
No ratings yet
Chapter 04 Threaded Fasteners
38 pages
Generalized Wheel Model
No ratings yet
Generalized Wheel Model
18 pages
Introduction To Design of Robotic Components
No ratings yet
Introduction To Design of Robotic Components
31 pages
Week 9 Homework Packet 13-14
No ratings yet
Week 9 Homework Packet 13-14
1 page
2.2 - Basic NLP Tasks With NLTK
No ratings yet
2.2 - Basic NLP Tasks With NLTK
12 pages
ELT 110 Assessment of Macro Skills Course Work Plan
100% (1)
ELT 110 Assessment of Macro Skills Course Work Plan
7 pages
English Words Chart and Worksheet (2)
No ratings yet
English Words Chart and Worksheet (2)
3 pages
German Articles
No ratings yet
German Articles
25 pages
The Routledge Handbook Of Sociocultural Theory And Second Language Development James P Lantolf pdf download
No ratings yet
The Routledge Handbook Of Sociocultural Theory And Second Language Development James P Lantolf pdf download
76 pages
NeoPlus Tutoring 22052024
No ratings yet
NeoPlus Tutoring 22052024
72 pages
03cptl1qb - SNEd 9 Activity Midterm
No ratings yet
03cptl1qb - SNEd 9 Activity Midterm
3 pages
Direct and Indirect Speech
No ratings yet
Direct and Indirect Speech
8 pages
go getter 3 unit 3 test
No ratings yet
go getter 3 unit 3 test
5 pages
Unit 2: The Communication Process: Learning Objectives
No ratings yet
Unit 2: The Communication Process: Learning Objectives
10 pages
Balyasnikova pdf
No ratings yet
Balyasnikova pdf
72 pages
Vocabulary Research
No ratings yet
Vocabulary Research
101 pages
Raz cqld60 Buddybear
No ratings yet
Raz cqld60 Buddybear
2 pages
The Connective Model DES
No ratings yet
The Connective Model DES
2 pages
Tablero Irregular Verbs. Mr. Fran
No ratings yet
Tablero Irregular Verbs. Mr. Fran
1 page
Phil Lit Quiz
No ratings yet
Phil Lit Quiz
1 page
How To Make The Assessment of Grammar Skills More Efficient?
No ratings yet
How To Make The Assessment of Grammar Skills More Efficient?
5 pages
House Description
No ratings yet
House Description
8 pages
Admin, Equivalence in Translating Idioms From English Into Amharic
No ratings yet
Admin, Equivalence in Translating Idioms From English Into Amharic
11 pages
Social Studies Unit Plan-Benchmark Assignment 1
No ratings yet
Social Studies Unit Plan-Benchmark Assignment 1
75 pages
TS Inter 1st year English grammar
No ratings yet
TS Inter 1st year English grammar
10 pages
Write It! Paragraph To Essay 3 WB Answer Key
100% (1)
Write It! Paragraph To Essay 3 WB Answer Key
12 pages
IyabodeOmolaraD 2011 CHAPTERONE IntroductoryPhonetics
No ratings yet
IyabodeOmolaraD 2011 CHAPTERONE IntroductoryPhonetics
3 pages
A. SO, SUCH, TOO AND ENOUGH: Fill in The Gaps Using So, Such, Too and Enough
No ratings yet
A. SO, SUCH, TOO AND ENOUGH: Fill in The Gaps Using So, Such, Too and Enough
4 pages
Introduction To Historical Linguistics
No ratings yet
Introduction To Historical Linguistics
30 pages
Gold XP B1+ Unit 4 Sample
No ratings yet
Gold XP B1+ Unit 4 Sample
14 pages
Purcommodule3topic 4
No ratings yet
Purcommodule3topic 4
21 pages

NLP lab Manual (3)

Uploaded by

NLP lab Manual (3)

Uploaded by

1.

)Write a program to tokenize the following given

# Given text to tokenize

# Method 1: Splitting sentences in the paragraph using sent_tokenize

# Method 2: Splitting words in the sentence using word_tokenize

# Method 3: Tokenizing words using regular expression with

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

Method 1: Splitting sentences in the paragraph

Method 2: Splitting words in the sentence

Method 3: Tokenizing words using regular expression

2.)How does the recursive generate function use a PCFG

# Define a simple PCFG grammar.

# Choose one production rule based on the probabilities.

# Recursively generate the string for each symbol in the chosen

# Generate a sentence starting from the initial symbol 'S'

Expanding 'S' using rule: ['NP', 'VP']

Generated Sentence: Alice sees a dog

3.)Build a trigram model using the Reuters corpus to

# Tokenize the text

# Build a trigram model

# Count frequency of co-occurrence

# Transform the counts into probabilities

# Function to predict the next word

4.)Using Python's re module, extract US phone numbers,

import re # Import the regular expression module

# 1. The concise text we want to search within

# 2. Define the regular expression patterns (same as before)

# USN Number Pattern (LLLNNLLDDD)

# Email Address Pattern

# 3. Find all matches for each pattern

# 4. Print the results

print(f"\n--- Found Phone Numbers (Pattern: {phone_pattern}) ---")

print(f"\n--- Found Email Addresses (Pattern: {email_pattern}) ---")

--- Original Text ---

--- Found Phone Numbers (Pattern: \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})

--- Found USN Numbers (Pattern: [A-Z]{3}\d{2}[A-Z]{2}\d{3}) ---

--- Found Email Addresses (Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.

5.) What do the cost parameters (ins_cost, del_cost,

# Initialize DP table with nested lists

# --- Initialization ---

# --- Fill DP table ---

dp[i][j] = min(deletion, insertion, substitution)

Weighted edit distance between 'intention' and 'execution': 5

You might also like