Lab 04 - Text Normalization Tutorial
Lab 04 - Text Normalization Tutorial
Learning Outcome:
Apply text mining and natural language processing methodologies to textual data.
QUICK REVIEW
Mapping from foxes to fox is called stemming. Morphological STEMMING parsing or stemming applies to many
affixes other than plurals; for example we might need to take any English verb form ending in -ing (going, talking,
congratulating) and parse it into its verbal stem plus the -ing morpheme.
The Porter algorithm is a simple and efficient way to do stemming, stripping off affixes. It is not as accurate as a
transducer model that includes a lexicon, but may be preferable for applications like information retrieval in which
exact morphological structure is not needed.
PRACTICES
4.1. Stemmers
Use the simple following code to implement porter stemmer.
ps = PorterStemmer()
for w in example_words1:
print(ps.stem(w))
for w in example_words2:
print(ps.stem(w))
You can use this algorithm combining with a tokenization code in order to stem the words in a sentence, as below:
ps = PorterStemmer()
words = word_tokenize(new_text)
NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in
preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter
and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles
the word lying (mapping it to lie), while the Lancaster stemmer does not.
tokens = word_tokenize(raw)
porter = PorterStemmer()
lancaster = LancasterStemmer()
Compare and discuss the results of two stemmers (Porter and Lancaster), if you observe any difference.
ps = PorterStemmer()
print("\nStem of the word produced :", ps.stem("produced"))
tokens = word_tokenize(raw)
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])
print()
for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))
print()
The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid
lemmas (or lexicon headwords). The results would result lemma not as 100% accurate according to the lemmas found in
the dictionary.
However, to have the exact lemma as per the dictionary, POS tagging better be included in the code.
for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))
for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))
Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word.
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
words = raw.lower()
print(words)
print()
tokens = word_tokenize(words)
print("Tokens")
print(tokens)
print()
print("Lemmas")
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t, pos = "v") for t in tokens])
print()
print("Porter Stemming")
ps = PorterStemmer()
print ([ps.stem(t) for t in tokens])
print()
print("Lancaster Stemming")
ls = LancasterStemmer()
print ([ls.stem(t) for t in tokens])
print()
print("Snowball Stemming")
sn = nltk.SnowballStemmer("english")
print([sn.stem(t) for t in tokens])
NOTE:
Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word. Whereas, in lemmatization, you used WordNet corpus
and a corpus for stop words as well to produce lemma which makes it slower than stemming.
import nltk
print(nltk.SnowballStemmer.languages)
print(len(nltk.SnowballStemmer.languages))
print()
text = "This is achieved in practice during stemming, a text preprocessing
operation."
tokens = nltk.tokenize.word_tokenize(text)
print()
stemmer = nltk.SnowballStemmer('english')
print([stemmer.stem(t) for t in tokens])
print()
text2 = "Ceci est réalisé en pratique lors du stemming, une opération de
prétraitement de texte."
tokens2 = nltk.tokenize.word_tokenize(text2)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens2])
print(en_blob.detect_language())
fr_blob = en_blob.translate(from_lang="en", to='fr')
print(fr_blob)
tokens = fr_blob.words
print(tokens)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens])
References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in Artificial
Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.
4. Lemmatization approaches with examples in Python (https://www.machinelearningplus.com/nlp/lemmatization-
examples-python/)
Revision Quiz:
https://quizlet.com/512734559/test?
answerTermSides=2&promptTermSides=6&questionCount=7&questionTypes=14&showImages=true
Take-home task:
Perform the lemmatization including the POS tags such as ADJECTIVE, NOUN, VERB, and ADVERB. Write the
suitable python code to get the proper output.