0% found this document useful (0 votes)

10 views

Lab 04 - Text Normalization Tutorial

Lab 2

Uploaded by

Don Pablo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lab 04 - Text Normalization Tutorial

Lab 2

Uploaded by

Don Pablo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

Lab 4: Text Normalization: Stemming and Lemmatization

This session covers:

 Different types of Stemmers & stemming  Porter | Lancaster | SnowBall
 Different types of Lemmatizer & lemmatizing  WordNet lemmatizer | textblob lemmatizer
 Lemmatization with & without POS tags

Learning Outcome:
Apply text mining and natural language processing methodologies to textual data.

QUICK REVIEW

Mapping from foxes to fox is called stemming. Morphological STEMMING parsing or stemming applies to many
affixes other than plurals; for example we might need to take any English verb form ending in -ing (going, talking,
congratulating) and parse it into its verbal stem plus the -ing morpheme.
The Porter algorithm is a simple and efficient way to do stemming, stripping off affixes. It is not as accurate as a
transducer model that includes a lexicon, but may be preferable for applications like information retrieval in which
exact morphological structure is not needed.

PRACTICES

4.1. Stemmers
Use the simple following code to implement porter stemmer.

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

ps = PorterStemmer()

example_words1 = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

example_words2 = ["List", "listed", "lists", "listing", "listings"]

for w in example_words1:
print(ps.stem(w))

for w in example_words2:
print(ps.stem(w))

You can use this algorithm combining with a tokenization code in order to stem the words in a sentence, as below:

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

ps = PorterStemmer()

new_text = """It is very important to be pythonly while you are pythoning

with python. All pythoners have pythoned poorly at least once."""

words = word_tokenize(new_text)

print([ps.stem(w) for w in words])

Level 3 Asia Pacific University (APU) Page 1 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in
preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter
and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles
the word lying (mapping it to lie), while the Lancaster stemmer does not.

from nltk.stem import PorterStemmer, LancasterStemmer

from nltk.tokenize import word_tokenize

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

porter = PorterStemmer()
lancaster = LancasterStemmer()

print([porter.stem(t) for t in tokens])

print("\n").
print([lancaster.stem(t) for t in tokens])

Compare and discuss the results of two stemmers (Porter and Lancaster), if you observe any difference.

4.2. Lemmatization using WordNet lemmatizer

The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking
process makes the lemmatizer slower than the above stemmers.

from nltk.stem import WordNetLemmatizer, PorterStemmer

lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("\nproduced :", lemmatizer.lemmatize("produced", pos ="v"))

ps = PorterStemmer()
print("\nStem of the word produced :", ps.stem("produced"))

print("\nbetter :", lemmatizer.lemmatize("better", pos ="a"))

print("\nwomen :", lemmatizer.lemmatize("women", pos ="n"))

Notice that it doesn't handle lying, but it converts women to woman.

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])
print()

Level 3 Asia Pacific University (APU) Page 2 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))
print()

example_words = ["List", "listed", "lists", "listing", "listings"]

print([wnl.lemmatize(w) for w in example_words])
print()
for words in example_words:
print ("{0:20}{1:20}".format(words, wnl.lemmatize(words, pos="v")))

The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid
lemmas (or lexicon headwords). The results would result lemma not as 100% accurate according to the lemmas found in
the dictionary.

However, to have the exact lemma as per the dictionary, POS tagging better be included in the code.

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))

4.3. Lemmatization using TextBlob

Words can be lemmatized by calling the lemmatize method via the TextBlob objects

from textblob import TextBlob

sentence = TextBlob('DENNIS: Listen, strange women lying in ponds distributing
swords is no basis for a system of government. Supreme executive power derives
from a mandate from the masses, not from some farcical aquatic ceremony.')
tokens = sentence.words
print(tokens)
print
tokens.lemmatize()

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))

4.4. Stemming & Lemmatization

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word.

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

#file = open ("D:/APU/TXSA-CT107-3-3/TUTORIAL/sample01.txt")

#raw = file.read()

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

Level 3 Asia Pacific University (APU) Page 3 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

words = raw.lower()
print(words)
print()
tokens = word_tokenize(words)
print("Tokens")
print(tokens)
print()
print("Lemmas")
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t, pos = "v") for t in tokens])
print()
print("Porter Stemming")
ps = PorterStemmer()
print ([ps.stem(t) for t in tokens])
print()
print("Lancaster Stemming")
ls = LancasterStemmer()
print ([ls.stem(t) for t in tokens])
print()
print("Snowball Stemming")
sn = nltk.SnowballStemmer("english")
print([sn.stem(t) for t in tokens])

NOTE:
Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word. Whereas, in lemmatization, you used WordNet corpus
and a corpus for stop words as well to produce lemma which makes it slower than stemming.

4.5. Stemmers --> Snowball Stemmer

import nltk
print(nltk.SnowballStemmer.languages)
print(len(nltk.SnowballStemmer.languages))
print()
text = "This is achieved in practice during stemming, a text preprocessing
operation."
tokens = nltk.tokenize.word_tokenize(text)
print()
stemmer = nltk.SnowballStemmer('english')
print([stemmer.stem(t) for t in tokens])
print()
text2 = "Ceci est réalisé en pratique lors du stemming, une opération de
prétraitement de texte."
tokens2 = nltk.tokenize.word_tokenize(text2)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens2])

4.6. Snowball Stemmer --> for other space delimited languages

from textblob import TextBlob

import nltk
en_blob = TextBlob(u'This is achieved in practice during stemming, a text
preprocessing operation.')

Level 3 Asia Pacific University (APU) Page 4 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

print(en_blob.detect_language())
fr_blob = en_blob.translate(from_lang="en", to='fr')
print(fr_blob)
tokens = fr_blob.words
print(tokens)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens])

References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in Artificial
Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.
4. Lemmatization approaches with examples in Python (https://www.machinelearningplus.com/nlp/lemmatization-
examples-python/)

Revision Quiz:
https://quizlet.com/512734559/test?
answerTermSides=2&promptTermSides=6&questionCount=7&questionTypes=14&showImages=true

Take-home task:
Perform the lemmatization including the POS tags such as ADJECTIVE, NOUN, VERB, and ADVERB. Write the
suitable python code to get the proper output.

Level 3 Asia Pacific University (APU) Page 5 of 5

Waiting - The Whites of South Africa
0% (1)
Waiting - The Whites of South Africa
392 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
The Pornographic Age
100% (4)
The Pornographic Age
137 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
02-Stemming - Jupyter Notebook
No ratings yet
02-Stemming - Jupyter Notebook
4 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Word Level Analysis (NLP)
No ratings yet
Word Level Analysis (NLP)
28 pages
NLTK - Stem NLTK - Stem: Print Print Print Print
No ratings yet
NLTK - Stem NLTK - Stem: Print Print Print Print
1 page
Chapter 6
No ratings yet
Chapter 6
6 pages
NLTK
No ratings yet
NLTK
3 pages
NLP Record
No ratings yet
NLP Record
15 pages
Natural language processing-Section (3)
No ratings yet
Natural language processing-Section (3)
25 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
24 pages
3 a Morphology
No ratings yet
3 a Morphology
4 pages
7.TextAnalysis
No ratings yet
7.TextAnalysis
3 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Lemmatization Approaches
No ratings yet
Lemmatization Approaches
13 pages
ChatGPT-Tokenization Stemming Lemmatization NLTK
No ratings yet
ChatGPT-Tokenization Stemming Lemmatization NLTK
110 pages
NLP 03
No ratings yet
NLP 03
3 pages
NLP LAB 2 (1)
No ratings yet
NLP LAB 2 (1)
4 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
5 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
NLTK
No ratings yet
NLTK
4 pages
Lemmatization__Stemming__Presentation
No ratings yet
Lemmatization__Stemming__Presentation
11 pages
14Python Stemming and Lemmatization
No ratings yet
14Python Stemming and Lemmatization
2 pages
NLP LAB MANUAL
No ratings yet
NLP LAB MANUAL
17 pages
Back of Words
No ratings yet
Back of Words
21 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Stemming and Lemmatizing in Action (Sources)
No ratings yet
Stemming and Lemmatizing in Action (Sources)
3 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
Nlp exp 5 , implement stemming, lemmetization, pos_ tag, wordNet - Colab
No ratings yet
Nlp exp 5 , implement stemming, lemmetization, pos_ tag, wordNet - Colab
2 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
SL-3_Assignment No 7
No ratings yet
SL-3_Assignment No 7
14 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
04 StemminginNLP
No ratings yet
04 StemminginNLP
10 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
NLP___
No ratings yet
NLP___
28 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
Stemming and Lemmatization
No ratings yet
Stemming and Lemmatization
17 pages
NLP Exp 4
No ratings yet
NLP Exp 4
2 pages
Lab 2
No ratings yet
Lab 2
49 pages
CH4
No ratings yet
CH4
15 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language
No ratings yet
Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language
6 pages
Week-4 NLP 2
No ratings yet
Week-4 NLP 2
2 pages
R22 Nlp Python Programs
No ratings yet
R22 Nlp Python Programs
15 pages
text_analytics[1]
No ratings yet
text_analytics[1]
3 pages
Natual Languagr Processing
No ratings yet
Natual Languagr Processing
12 pages
NLP Assignment(917722H031)
No ratings yet
NLP Assignment(917722H031)
18 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
NLP 04
No ratings yet
NLP 04
3 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Small Talk Common Responses PDF
100% (1)
Small Talk Common Responses PDF
2 pages
4TH EVS SAT PAPER(1)
No ratings yet
4TH EVS SAT PAPER(1)
4 pages
AFS Academic History 01 2018
No ratings yet
AFS Academic History 01 2018
3 pages
Đề HSG 7 TX
No ratings yet
Đề HSG 7 TX
7 pages
Book 5 Homework & EE PDF
No ratings yet
Book 5 Homework & EE PDF
68 pages
El Skills Block Lesson Plans 3 9
No ratings yet
El Skills Block Lesson Plans 3 9
3 pages
Final-Phonics-And-Reading Lac Session Topic
No ratings yet
Final-Phonics-And-Reading Lac Session Topic
33 pages
Gesture and Sign Language
No ratings yet
Gesture and Sign Language
7 pages
Oris, Mark Kenneth C. Ti Activity 7 Accomplishing Reports Forms
No ratings yet
Oris, Mark Kenneth C. Ti Activity 7 Accomplishing Reports Forms
45 pages
Python: The Python: The Beginner-Friendly Beginner-Friendly Language Language
No ratings yet
Python: The Python: The Beginner-Friendly Beginner-Friendly Language Language
6 pages
Do It Yourself (D.I.Y) : Let's Create
No ratings yet
Do It Yourself (D.I.Y) : Let's Create
2 pages
Yct 1 Vocabulary
No ratings yet
Yct 1 Vocabulary
2 pages
Resume 1
No ratings yet
Resume 1
2 pages
All in One 8
No ratings yet
All in One 8
266 pages
Anh 6 Otgk - 25032021
No ratings yet
Anh 6 Otgk - 25032021
4 pages
Paket: B: Bahasa Inggris Semua Program Keahlian
No ratings yet
Paket: B: Bahasa Inggris Semua Program Keahlian
4 pages
GoldPre-First B1P VideoWorksheetsT1
No ratings yet
GoldPre-First B1P VideoWorksheetsT1
4 pages
ทดลองอ่าน - Lecture สรุปเข้มอังกฤษ ประถม
No ratings yet
ทดลองอ่าน - Lecture สรุปเข้มอังกฤษ ประถม
13 pages
Engl 34 Final Exam
No ratings yet
Engl 34 Final Exam
2 pages
Unit 1 Short Test 1A: Grammar
No ratings yet
Unit 1 Short Test 1A: Grammar
2 pages
Exam Time Table Summer 2011
No ratings yet
Exam Time Table Summer 2011
10 pages
Conditionals- Zero, First, Second, and Third
No ratings yet
Conditionals- Zero, First, Second, and Third
7 pages
6A Good and Bad Habits
No ratings yet
6A Good and Bad Habits
12 pages
Unreal Conditionals Exercise
No ratings yet
Unreal Conditionals Exercise
2 pages
Suffixes - Adjectives From Nouns - Worksheet
No ratings yet
Suffixes - Adjectives From Nouns - Worksheet
2 pages
Diary Writing Powerpoint
No ratings yet
Diary Writing Powerpoint
15 pages
Grammar: 1 Underline The Correct Form
No ratings yet
Grammar: 1 Underline The Correct Form
8 pages
Full Download Dylan Thomas 1st Edition Walford Davies PDF DOCX
100% (3)
Full Download Dylan Thomas 1st Edition Walford Davies PDF DOCX
61 pages

Lab 04 - Text Normalization Tutorial

Uploaded by

Lab 04 - Text Normalization Tutorial

Uploaded by

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

Lab 4: Text Normalization: Stemming and Lemmatization

This session covers:

from nltk.stem import PorterStemmer

example_words1 = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

from nltk.stem import PorterStemmer

new_text = """It is very important to be pythonly while you are pythoning

print([ps.stem(w) for w in words])

Level 3 Asia Pacific University (APU) Page 1 of 5

from nltk.stem import PorterStemmer, LancasterStemmer

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

print([porter.stem(t) for t in tokens])

4.2. Lemmatization using WordNet lemmatizer

from nltk.stem import WordNetLemmatizer, PorterStemmer

print("\nbetter :", lemmatizer.lemmatize("better", pos ="a"))

print("\nwomen :", lemmatizer.lemmatize("women", pos ="n"))

Notice that it doesn't handle lying, but it converts women to woman.

from nltk.stem import PorterStemmer

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

Level 3 Asia Pacific University (APU) Page 2 of 5

example_words = ["List", "listed", "lists", "listing", "listings"]

4.3. Lemmatization using TextBlob

from textblob import TextBlob

4.4. Stemming & Lemmatization

#file = open ("D:/APU/TXSA-CT107-3-3/TUTORIAL/sample01.txt")

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

Level 3 Asia Pacific University (APU) Page 3 of 5

4.5. Stemmers --> Snowball Stemmer

4.6. Snowball Stemmer --> for other space delimited languages

from textblob import TextBlob

Level 3 Asia Pacific University (APU) Page 4 of 5

Level 3 Asia Pacific University (APU) Page 5 of 5

You might also like