0% found this document useful (0 votes)

3 views9 pages

NLP Manual

The document outlines various natural language processing techniques using the NLTK library, including tokenization and stop word removal. It also explains stemming, specifically the Porter Stemming algorithm, which reduces words to their root form, enhancing information retrieval efficiency. Additionally, it discusses word analysis and generation processes, illustrating how to identify simple and complex words.

Uploaded by

Latha Panjala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views9 pages

NLP Manual

Uploaded by

Latha Panjala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 9

Week-1:-

import nltkfrom nltk.tokenize import word_tokenize

nltk.download('punkt')text = "Ayush and Smrita are beautiful couple"tokens =
word_tokenize(text)print(tokens)

Output:

`['Ayush' , 'and' , 'Smrita' , 'are' , 'beautiful' , 'couple']`

rom nltk.tokenize import word_tokenize

# Define the text to be tokenized
text = "This is an example sentence for tokenization."
# Tokenize the text into words
words = word_tokenize(text)
print(words)

B) Stop Word Removal

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))

Output:-
['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself',
'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
'below', 'to',
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once',
'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most',
'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're',
've', 'y',
'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't",
'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn',
"mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn',
"shouldn't",
'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Week-2:-

In linguistics (study of language and its structure), a stem is part of a word, that is
common to all of its inflected variants.

CONNECT
CONNECTED
CONNECTION
CONNECTING
Above words are inflected variants of CONNECT. Hence, CONNECT is a stem. To
this stem we can add different suffixes to form different words.

The process of reducing such inflected (or sometimes derived) words to their word
stem is known as Stemming. For example, CONNECTED, CONNECTION and
CONNECTING can be reduced to the stem CONNECT.

The Porter Stemming algorithm (or Porter Stemmer) is used to remove the suffixes
from an English word and obtain its stem which becomes very useful in the field of
Information Retrieval (IR). This process reduces the number of terms kept by an IR
system which will be advantageous both in terms of space and time complexity.
This algorithm was developed by a British Computer Scientist named Martin F.
Porter. You can visit the official home page of the Porter stemming algorithm for
further information.

First, a few terms and expressions will be introduced, which will be helpful for the
ease of explanation.

Consonants and Vowels

A consonant is a letter other than the vowels and other than a letter “Y” preceded by
a consonant. So in “TOY” the consonants are “T” and “Y”, and in “SYZYGY” they
are “S”, “Z” and “G”.

If a letter is not a consonant it is a vowel.

A consonant will be denoted by c and a vowel by v.

A list of one or more consecutive consonants (ccc…) will be denoted by C, and a

list of one or more consecutive vowels (vvv…) will be denoted by V. Any word, or
part of a word, therefore has one of the four forms given below.

CVCV … C → collection, management

CVCV … V → conclude, revise
VCVC … C → entertainment, illumination
VCVC … V → illustrate, abundance
All of these forms can be represented using a single form as,
[C]VCVC … [V]

Here the square brackets denote arbitrary presence of consonants or vowels.

(VC)m denotes VC repeated m times. So the above expression can be written as,

[C](VC)m[V]

What is m?

The value m found in the above expression is called the measure of any word or
word part when represented in the form [C](VC)m[V]. Here are some examples for
different values of m:

m=0 → TREE, TR, EE, Y, BY

m=1 → TROUBLE, OATS, TREES, IVY
m=2 → TROUBLES, PRIVATE, OATEN, ROBBERY
Stemmer

Rules
The rules for replacing (or removing) a suffix will be given in the form as shown
below.

(condition) S1 → S2

This means that if a word ends with the suffix S1, and the stem before S1 satisfies
the given condition, S1 is replaced by S2. The condition is usually given in terms of
m in regard to the stem before S1.

(m > 1) EMENT →

Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to

REPLAC, since REPLAC is a word part for which m = 2.

Conditions
The conditions may contain the following:

*S – the stem ends with S (and similarly for the other letters)
*v* – the stem contains a vowel
*d – the stem ends with a double consonant (e.g. -TT, -SS)
*o – the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)

Example 1
In the first example, we input the word MULTIDIMENSIONAL to the Porter
Stemming algorithm. Let’s see what happens as the word goes through steps 1 to 5.

ex.png
The suffix will not match any of the cases found in steps 1, 2 and 3.
Then it comes to step 4.
The stem of the word has m > 1 (since m = 5) and ends with “AL”.
Hence in step 4, “AL” is deleted (replaced with null).
Calling step 5 will not change the stem further.
Finally the output will be MULTIDIMENSION.
MULTIDIMENSIONAL → MULTIDIMENSION

import nltk
from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance

porter_stemmer = PorterStemmer()

# Example words for stemming

words = ["running", "jumps", "happily", "programming"]

# Apply stemming to each word

stemmed_words = [porter_stemmer.stem(word) for word in words]

print("Original words:", words)

print("Stemmed words:", stemmed_words)

Output:-
Original words: ['running', 'jumps', 'happily', 'programming']

Stemmed words: ['run', 'jump', 'happi', 'program']

Week-3:-
Word Analysis
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'

STEP 1: Select the language.

OUTPUT: Drop down for selecting words will appear.

STEP 2: Select the word.

OUTPUT: Drop down for selecting features will appear.

STEP 3: Select the features.

STEP 4: Click "Check" button to check your answer.

OUTPUT: Right features are marked by tick and wrong features are marked by cross.

Step-1

Step-2
Step-3

Step-4
Step-5

Word Generation

A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'

STEP 1: Select the language.

OUTPUT: Drop downs for selecting root and other features will appear.
STEP 2: Select the root and other features.

STEP 3: After selecting all the features, select the word corresponding above features
selected.

STEP 4: Click the check button to see whether right word is selected or not

OUTPUT: Output tells whether the word selected is right or wrong

Step-1

Step-2
Week-4:

NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
NLP chap2
No ratings yet
NLP chap2
126 pages
The Porter Stemming Algorithm
No ratings yet
The Porter Stemming Algorithm
12 pages
02-Stemming - Jupyter Notebook
No ratings yet
02-Stemming - Jupyter Notebook
4 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Word Level Analysis (NLP)
No ratings yet
Word Level Analysis (NLP)
28 pages
Stemming: Ilakiyaselvan N, B2 Slot
No ratings yet
Stemming: Ilakiyaselvan N, B2 Slot
23 pages
1
No ratings yet
1
13 pages
IR Group Assignment
No ratings yet
IR Group Assignment
5 pages
UNIT-5
No ratings yet
UNIT-5
14 pages
3 a Morphology
No ratings yet
3 a Morphology
4 pages
3. porter stemmer on Penn Tree Bank Dataset
No ratings yet
3. porter stemmer on Penn Tree Bank Dataset
23 pages
Porter Stemmer
No ratings yet
Porter Stemmer
14 pages
Lab 2
No ratings yet
Lab 2
49 pages
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
Natural Language Processing From Scratch: Bruno Gonçalves
No ratings yet
Natural Language Processing From Scratch: Bruno Gonçalves
87 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
3.Word level analysis-tokenization stemming
No ratings yet
3.Word level analysis-tokenization stemming
8 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
04 Word Normalization and Stemming 11-47
No ratings yet
04 Word Normalization and Stemming 11-47
5 pages
EL1101E 24 25 Assignment 1 Phonetics Answer Key
No ratings yet
EL1101E 24 25 Assignment 1 Phonetics Answer Key
6 pages
Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
No ratings yet
Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
5 pages
Text preprocessing
No ratings yet
Text preprocessing
39 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
NLP 03
No ratings yet
NLP 03
3 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
NLP___
No ratings yet
NLP___
28 pages
NLP_Lecture_6_Week_3
No ratings yet
NLP_Lecture_6_Week_3
9 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
CH4
No ratings yet
CH4
15 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
NLTK
No ratings yet
NLTK
3 pages
The Problem of Words Undergoing Sound Changes in Uzbek Stemmers
No ratings yet
The Problem of Words Undergoing Sound Changes in Uzbek Stemmers
8 pages
ROOT WORDS SUFFIXES & PREFIXES
No ratings yet
ROOT WORDS SUFFIXES & PREFIXES
7 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
Introduction - Types of Stemming Algorithms
No ratings yet
Introduction - Types of Stemming Algorithms
28 pages
PorterStemmer .HTML
No ratings yet
PorterStemmer .HTML
6 pages
Natual Languagr Processing
No ratings yet
Natual Languagr Processing
12 pages
Morphology FST
No ratings yet
Morphology FST
47 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
5 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Phonetics I - Transcription Check List
No ratings yet
Phonetics I - Transcription Check List
5 pages
Adobe Scan 30 Sept 2024
No ratings yet
Adobe Scan 30 Sept 2024
6 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Python An Introduction
From Everand
Python An Introduction
Renier Engelbrecht
No ratings yet
MCS-013: Discrete Mathematics
From Everand
MCS-013: Discrete Mathematics
Dr. DK Sukhani
No ratings yet
How To Design An Antenna For Dynamic NFC Tags Stmicroelectronics
No ratings yet
How To Design An Antenna For Dynamic NFC Tags Stmicroelectronics
26 pages
Asmit_cv_3rd year
No ratings yet
Asmit_cv_3rd year
2 pages
Heat Transfer Concept Map
100% (1)
Heat Transfer Concept Map
4 pages
Assessment of Seismic Site Conditions A Case Study From Guwahati City
No ratings yet
Assessment of Seismic Site Conditions A Case Study From Guwahati City
24 pages
u3lessons 1 and 2
No ratings yet
u3lessons 1 and 2
10 pages
Lesson 3
No ratings yet
Lesson 3
9 pages
Paper Oct
No ratings yet
Paper Oct
4 pages
Tracey Resume Eportfolio
No ratings yet
Tracey Resume Eportfolio
1 page
Project
No ratings yet
Project
7 pages
Manuscript For Title No. 2 (ESports Arena)
No ratings yet
Manuscript For Title No. 2 (ESports Arena)
38 pages
Ardjuna Field (1)
No ratings yet
Ardjuna Field (1)
6 pages
10th-11th Week Planning and Coordination MGMT 1.
No ratings yet
10th-11th Week Planning and Coordination MGMT 1.
14 pages
Download Full Elementary Number Theory in Nine Chapters 2nd Edition James J. Tattersall PDF All Chapters
100% (1)
Download Full Elementary Number Theory in Nine Chapters 2nd Edition James J. Tattersall PDF All Chapters
82 pages
Evaluation of Operational Strategy in Area 4 PT Tirta Investama (AQUA) Pandaan
No ratings yet
Evaluation of Operational Strategy in Area 4 PT Tirta Investama (AQUA) Pandaan
19 pages
Esitte 110b8b2 Ecdis Hand Book
No ratings yet
Esitte 110b8b2 Ecdis Hand Book
20 pages
The History of The Basel Problem
No ratings yet
The History of The Basel Problem
13 pages
3724 9613 1 PB
No ratings yet
3724 9613 1 PB
10 pages
Intercultural Business Communication (6th Edition) instant download
100% (1)
Intercultural Business Communication (6th Edition) instant download
30 pages
CV Etwin Pratama-2
No ratings yet
CV Etwin Pratama-2
1 page
POCSO Outline
No ratings yet
POCSO Outline
2 pages
Polymer Coating Material GSS
No ratings yet
Polymer Coating Material GSS
26 pages
HG200 CMC
No ratings yet
HG200 CMC
1 page
energies-15-01738-v2 (1)
No ratings yet
energies-15-01738-v2 (1)
32 pages
Homework 1
No ratings yet
Homework 1
2 pages
Kerr D. - The New Eurasianism, The Rise of Geopolitics in Russia's Foreign Policy
No ratings yet
Kerr D. - The New Eurasianism, The Rise of Geopolitics in Russia's Foreign Policy
13 pages
Their Eyes Were Watching God Symbolism Thesis
100% (3)
Their Eyes Were Watching God Symbolism Thesis
8 pages
Introduction To Educational Sociology Philosophical and Sociological Foundations of Education (Set 1)
No ratings yet
Introduction To Educational Sociology Philosophical and Sociological Foundations of Education (Set 1)
17 pages
SLR Paresr C Program
No ratings yet
SLR Paresr C Program
7 pages
52 2 Brochure February 2024final
No ratings yet
52 2 Brochure February 2024final
2 pages
Request For Proposal
No ratings yet
Request For Proposal
40 pages

NLP Manual

Uploaded by

NLP Manual

Uploaded by

Week-1:-

import nltkfrom nltk.tokenize import word_tokenize

`['Ayush' , 'and' , 'Smrita' , 'are' , 'beautiful' , 'couple']`

rom nltk.tokenize import word_tokenize

B) Stop Word Removal

Consonants and Vowels

If a letter is not a consonant it is a vowel.

A consonant will be denoted by c and a vowel by v.

A list of one or more consecutive consonants (ccc…) will be denoted by C, and a

CVCV … C → collection, management

Here the square brackets denote arbitrary presence of consonants or vowels.

m=0 → TREE, TR, EE, Y, BY

Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to

# Create a Porter Stemmer instance

# Example words for stemming

# Apply stemming to each word

print("Original words:", words)

Stemmed words: ['run', 'jump', 'happi', 'program']

STEP 1: Select the language.

OUTPUT: Drop down for selecting words will appear.

STEP 2: Select the word.

OUTPUT: Drop down for selecting features will appear.

STEP 3: Select the features.

STEP 4: Click "Check" button to check your answer.

STEP 1: Select the language.

OUTPUT: Output tells whether the word selected is right or wrong

You might also like