0% found this document useful (0 votes)
10 views

Lab 04 - Text Normalization Tutorial

Lab 2

Uploaded by

Don Pablo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lab 04 - Text Normalization Tutorial

Lab 2

Uploaded by

Don Pablo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

Lab 4: Text Normalization: Stemming and Lemmatization

This session covers:


 Different types of Stemmers & stemming  Porter | Lancaster | SnowBall
 Different types of Lemmatizer & lemmatizing  WordNet lemmatizer | textblob lemmatizer
 Lemmatization with & without POS tags

Learning Outcome:
Apply text mining and natural language processing methodologies to textual data.

QUICK REVIEW

Mapping from foxes to fox is called stemming. Morphological STEMMING parsing or stemming applies to many
affixes other than plurals; for example we might need to take any English verb form ending in -ing (going, talking,
congratulating) and parse it into its verbal stem plus the -ing morpheme.
The Porter algorithm is a simple and efficient way to do stemming, stripping off affixes. It is not as accurate as a
transducer model that includes a lexicon, but may be preferable for applications like information retrieval in which
exact morphological structure is not needed.

PRACTICES

4.1. Stemmers
Use the simple following code to implement porter stemmer.

from nltk.stem import PorterStemmer


from nltk.tokenize import word_tokenize

ps = PorterStemmer()

example_words1 = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]


example_words2 = ["List", "listed", "lists", "listing", "listings"]

for w in example_words1:
print(ps.stem(w))

for w in example_words2:
print(ps.stem(w))

You can use this algorithm combining with a tokenization code in order to stem the words in a sentence, as below:

from nltk.stem import PorterStemmer


from nltk.tokenize import word_tokenize

ps = PorterStemmer()

new_text = """It is very important to be pythonly while you are pythoning


with python. All pythoners have pythoned poorly at least once."""

words = word_tokenize(new_text)

print([ps.stem(w) for w in words])

Level 3 Asia Pacific University (APU) Page 1 of 5


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in
preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter
and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles
the word lying (mapping it to lie), while the Lancaster stemmer does not.

from nltk.stem import PorterStemmer, LancasterStemmer


from nltk.tokenize import word_tokenize

raw = """DENNIS: Listen, strange women lying in ponds distributing swords


is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

porter = PorterStemmer()
lancaster = LancasterStemmer()

print([porter.stem(t) for t in tokens])


print("\n").
print([lancaster.stem(t) for t in tokens])

Compare and discuss the results of two stemmers (Porter and Lancaster), if you observe any difference.

4.2. Lemmatization using WordNet lemmatizer


The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking
process makes the lemmatizer slower than the above stemmers.

from nltk.stem import WordNetLemmatizer, PorterStemmer


lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("\nproduced :", lemmatizer.lemmatize("produced", pos ="v"))

ps = PorterStemmer()
print("\nStem of the word produced :", ps.stem("produced"))

print("\nbetter :", lemmatizer.lemmatize("better", pos ="a"))

print("\nwomen :", lemmatizer.lemmatize("women", pos ="n"))

Notice that it doesn't handle lying, but it converts women to woman.

from nltk.stem import PorterStemmer


from nltk.tokenize import word_tokenize

raw = """DENNIS: Listen, strange women lying in ponds distributing swords


is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])
print()

Level 3 Asia Pacific University (APU) Page 2 of 5


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))
print()

example_words = ["List", "listed", "lists", "listing", "listings"]


print([wnl.lemmatize(w) for w in example_words])
print()
for words in example_words:
print ("{0:20}{1:20}".format(words, wnl.lemmatize(words, pos="v")))

The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid
lemmas (or lexicon headwords). The results would result lemma not as 100% accurate according to the lemmas found in
the dictionary.

However, to have the exact lemma as per the dictionary, POS tagging better be included in the code.

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))

4.3. Lemmatization using TextBlob


Words can be lemmatized by calling the lemmatize method via the TextBlob objects

from textblob import TextBlob


sentence = TextBlob('DENNIS: Listen, strange women lying in ponds distributing
swords is no basis for a system of government. Supreme executive power derives
from a mandate from the masses, not from some farcical aquatic ceremony.')
tokens = sentence.words
print(tokens)
print
tokens.lemmatize()

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))

4.4. Stemming & Lemmatization

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word.

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

#file = open ("D:/APU/TXSA-CT107-3-3/TUTORIAL/sample01.txt")


#raw = file.read()

raw = """DENNIS: Listen, strange women lying in ponds distributing swords


is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

Level 3 Asia Pacific University (APU) Page 3 of 5


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

words = raw.lower()
print(words)
print()
tokens = word_tokenize(words)
print("Tokens")
print(tokens)
print()
print("Lemmas")
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t, pos = "v") for t in tokens])
print()
print("Porter Stemming")
ps = PorterStemmer()
print ([ps.stem(t) for t in tokens])
print()
print("Lancaster Stemming")
ls = LancasterStemmer()
print ([ls.stem(t) for t in tokens])
print()
print("Snowball Stemming")
sn = nltk.SnowballStemmer("english")
print([sn.stem(t) for t in tokens])

NOTE:
Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word. Whereas, in lemmatization, you used WordNet corpus
and a corpus for stop words as well to produce lemma which makes it slower than stemming.

4.5. Stemmers --> Snowball Stemmer

import nltk
print(nltk.SnowballStemmer.languages)
print(len(nltk.SnowballStemmer.languages))
print()
text = "This is achieved in practice during stemming, a text preprocessing
operation."
tokens = nltk.tokenize.word_tokenize(text)
print()
stemmer = nltk.SnowballStemmer('english')
print([stemmer.stem(t) for t in tokens])
print()
text2 = "Ceci est réalisé en pratique lors du stemming, une opération de
prétraitement de texte."
tokens2 = nltk.tokenize.word_tokenize(text2)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens2])

4.6. Snowball Stemmer --> for other space delimited languages

from textblob import TextBlob


import nltk
en_blob = TextBlob(u'This is achieved in practice during stemming, a text
preprocessing operation.')

Level 3 Asia Pacific University (APU) Page 4 of 5


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

print(en_blob.detect_language())
fr_blob = en_blob.translate(from_lang="en", to='fr')
print(fr_blob)
tokens = fr_blob.words
print(tokens)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens])

References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in Artificial
Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.
4. Lemmatization approaches with examples in Python (https://www.machinelearningplus.com/nlp/lemmatization-
examples-python/)

Revision Quiz:
https://quizlet.com/512734559/test?
answerTermSides=2&promptTermSides=6&questionCount=7&questionTypes=14&showImages=true

Take-home task:
Perform the lemmatization including the POS tags such as ADJECTIVE, NOUN, VERB, and ADVERB. Write the
suitable python code to get the proper output.

Level 3 Asia Pacific University (APU) Page 5 of 5

You might also like