0% found this document useful (0 votes)
3 views9 pages

NLP Manual

The document outlines various natural language processing techniques using the NLTK library, including tokenization and stop word removal. It also explains stemming, specifically the Porter Stemming algorithm, which reduces words to their root form, enhancing information retrieval efficiency. Additionally, it discusses word analysis and generation processes, illustrating how to identify simple and complex words.

Uploaded by

Latha Panjala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views9 pages

NLP Manual

The document outlines various natural language processing techniques using the NLTK library, including tokenization and stop word removal. It also explains stemming, specifically the Porter Stemming algorithm, which reduces words to their root form, enhancing information retrieval efficiency. Additionally, it discusses word analysis and generation processes, illustrating how to identify simple and complex words.

Uploaded by

Latha Panjala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 9

Week-1:-

import nltkfrom nltk.tokenize import word_tokenize


nltk.download('punkt')text = "Ayush and Smrita are beautiful couple"tokens =
word_tokenize(text)print(tokens)

Output:

`['Ayush' , 'and' , 'Smrita' , 'are' , 'beautiful' , 'couple']`

rom nltk.tokenize import word_tokenize


# Define the text to be tokenized
text = "This is an example sentence for tokenization."
# Tokenize the text into words
words = word_tokenize(text)
print(words)

B) Stop Word Removal

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))

Output:-
['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself',
'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
'below', 'to',
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once',
'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most',
'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're',
've', 'y',
'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't",
'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn',
"mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn',
"shouldn't",
'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Week-2:-

In linguistics (study of language and its structure), a stem is part of a word, that is
common to all of its inflected variants.

CONNECT
CONNECTED
CONNECTION
CONNECTING
Above words are inflected variants of CONNECT. Hence, CONNECT is a stem. To
this stem we can add different suffixes to form different words.

The process of reducing such inflected (or sometimes derived) words to their word
stem is known as Stemming. For example, CONNECTED, CONNECTION and
CONNECTING can be reduced to the stem CONNECT.

The Porter Stemming algorithm (or Porter Stemmer) is used to remove the suffixes
from an English word and obtain its stem which becomes very useful in the field of
Information Retrieval (IR). This process reduces the number of terms kept by an IR
system which will be advantageous both in terms of space and time complexity.
This algorithm was developed by a British Computer Scientist named Martin F.
Porter. You can visit the official home page of the Porter stemming algorithm for
further information.

First, a few terms and expressions will be introduced, which will be helpful for the
ease of explanation.

Consonants and Vowels

A consonant is a letter other than the vowels and other than a letter “Y” preceded by
a consonant. So in “TOY” the consonants are “T” and “Y”, and in “SYZYGY” they
are “S”, “Z” and “G”.

If a letter is not a consonant it is a vowel.

A consonant will be denoted by c and a vowel by v.

A list of one or more consecutive consonants (ccc…) will be denoted by C, and a


list of one or more consecutive vowels (vvv…) will be denoted by V. Any word, or
part of a word, therefore has one of the four forms given below.

CVCV … C → collection, management


CVCV … V → conclude, revise
VCVC … C → entertainment, illumination
VCVC … V → illustrate, abundance
All of these forms can be represented using a single form as,
[C]VCVC … [V]

Here the square brackets denote arbitrary presence of consonants or vowels.

(VC)m denotes VC repeated m times. So the above expression can be written as,

[C](VC)m[V]

What is m?

The value m found in the above expression is called the measure of any word or
word part when represented in the form [C](VC)m[V]. Here are some examples for
different values of m:

m=0 → TREE, TR, EE, Y, BY


m=1 → TROUBLE, OATS, TREES, IVY
m=2 → TROUBLES, PRIVATE, OATEN, ROBBERY
Stemmer

Rules
The rules for replacing (or removing) a suffix will be given in the form as shown
below.

(condition) S1 → S2

This means that if a word ends with the suffix S1, and the stem before S1 satisfies
the given condition, S1 is replaced by S2. The condition is usually given in terms of
m in regard to the stem before S1.

(m > 1) EMENT →

Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to


REPLAC, since REPLAC is a word part for which m = 2.

Conditions
The conditions may contain the following:

*S – the stem ends with S (and similarly for the other letters)
*v* – the stem contains a vowel
*d – the stem ends with a double consonant (e.g. -TT, -SS)
*o – the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)

Example 1
In the first example, we input the word MULTIDIMENSIONAL to the Porter
Stemming algorithm. Let’s see what happens as the word goes through steps 1 to 5.

ex.png
The suffix will not match any of the cases found in steps 1, 2 and 3.
Then it comes to step 4.
The stem of the word has m > 1 (since m = 5) and ends with “AL”.
Hence in step 4, “AL” is deleted (replaced with null).
Calling step 5 will not change the stem further.
Finally the output will be MULTIDIMENSION.
MULTIDIMENSIONAL → MULTIDIMENSION

import nltk
from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance


porter_stemmer = PorterStemmer()

# Example words for stemming


words = ["running", "jumps", "happily", "programming"]

# Apply stemming to each word


stemmed_words = [porter_stemmer.stem(word) for word in words]

print("Original words:", words)


print("Stemmed words:", stemmed_words)

Output:-
Original words: ['running', 'jumps', 'happily', 'programming']

Stemmed words: ['run', 'jump', 'happi', 'program']

Week-3:-
Word Analysis
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'

STEP 1: Select the language.

OUTPUT: Drop down for selecting words will appear.

STEP 2: Select the word.

OUTPUT: Drop down for selecting features will appear.

STEP 3: Select the features.

STEP 4: Click "Check" button to check your answer.

OUTPUT: Right features are marked by tick and wrong features are marked by cross.

Step-1

Step-2
Step-3

Step-4
Step-5

Word Generation

A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'

STEP 1: Select the language.

OUTPUT: Drop downs for selecting root and other features will appear.
STEP 2: Select the root and other features.

STEP 3: After selecting all the features, select the word corresponding above features
selected.

STEP 4: Click the check button to see whether right word is selected or not

OUTPUT: Output tells whether the word selected is right or wrong

Step-1

Step-2
Week-4:

You might also like