NLP Manual
NLP Manual
Output:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words('english'))
Output:-
['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself',
'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
'below', 'to',
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once',
'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most',
'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're',
've', 'y',
'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't",
'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn',
"mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn',
"shouldn't",
'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Week-2:-
In linguistics (study of language and its structure), a stem is part of a word, that is
common to all of its inflected variants.
CONNECT
CONNECTED
CONNECTION
CONNECTING
Above words are inflected variants of CONNECT. Hence, CONNECT is a stem. To
this stem we can add different suffixes to form different words.
The process of reducing such inflected (or sometimes derived) words to their word
stem is known as Stemming. For example, CONNECTED, CONNECTION and
CONNECTING can be reduced to the stem CONNECT.
The Porter Stemming algorithm (or Porter Stemmer) is used to remove the suffixes
from an English word and obtain its stem which becomes very useful in the field of
Information Retrieval (IR). This process reduces the number of terms kept by an IR
system which will be advantageous both in terms of space and time complexity.
This algorithm was developed by a British Computer Scientist named Martin F.
Porter. You can visit the official home page of the Porter stemming algorithm for
further information.
First, a few terms and expressions will be introduced, which will be helpful for the
ease of explanation.
A consonant is a letter other than the vowels and other than a letter “Y” preceded by
a consonant. So in “TOY” the consonants are “T” and “Y”, and in “SYZYGY” they
are “S”, “Z” and “G”.
(VC)m denotes VC repeated m times. So the above expression can be written as,
[C](VC)m[V]
What is m?
The value m found in the above expression is called the measure of any word or
word part when represented in the form [C](VC)m[V]. Here are some examples for
different values of m:
Rules
The rules for replacing (or removing) a suffix will be given in the form as shown
below.
(condition) S1 → S2
This means that if a word ends with the suffix S1, and the stem before S1 satisfies
the given condition, S1 is replaced by S2. The condition is usually given in terms of
m in regard to the stem before S1.
(m > 1) EMENT →
Conditions
The conditions may contain the following:
*S – the stem ends with S (and similarly for the other letters)
*v* – the stem contains a vowel
*d – the stem ends with a double consonant (e.g. -TT, -SS)
*o – the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)
Example 1
In the first example, we input the word MULTIDIMENSIONAL to the Porter
Stemming algorithm. Let’s see what happens as the word goes through steps 1 to 5.
ex.png
The suffix will not match any of the cases found in steps 1, 2 and 3.
Then it comes to step 4.
The stem of the word has m > 1 (since m = 5) and ends with “AL”.
Hence in step 4, “AL” is deleted (replaced with null).
Calling step 5 will not change the stem further.
Finally the output will be MULTIDIMENSION.
MULTIDIMENSIONAL → MULTIDIMENSION
import nltk
from nltk.stem import PorterStemmer
Output:-
Original words: ['running', 'jumps', 'happily', 'programming']
Week-3:-
Word Analysis
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'
OUTPUT: Right features are marked by tick and wrong features are marked by cross.
Step-1
Step-2
Step-3
Step-4
Step-5
Word Generation
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller part. On the other hand, the word
'cats' is complex, because the word is made up of two parts: root 'cat' and plural suffix
'-s'
OUTPUT: Drop downs for selecting root and other features will appear.
STEP 2: Select the root and other features.
STEP 3: After selecting all the features, select the word corresponding above features
selected.
STEP 4: Click the check button to see whether right word is selected or not
Step-1
Step-2
Week-4: