Unit-2- NLP.pptx
Unit-2- NLP.pptx
'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit
the Earth.']
Removing Stopwords
Stopwords
• Stopwords are the most common words in any natural language. For
the purpose of analyzing text data and building NLP models, these
stopwords might not add much value to the meaning of the
document.
• Generally, the most common words used in a text are “the”, “is”, “in”,
“for”, “where”, “when”, “to”, “at” etc.
• Consider this text string – “There is a pen on the table”. Now, the
words “is”, “a”, “on”, and “the” add no meaning to the statement
while parsing it. Whereas words like “there”, “book”, and “table” are
the keywords and tell us what the statement is all about.
Removing Stopwords
Why do we Need to Remove Stopwords?
• Removing stopwords is not a hard and fast rule in NLP. It depends upon
the task that we are working on. For tasks like text classification, where the
text is to be classified into different categories, stopwords are removed or
excluded from the given text so that more focus can be given to those
words which define the meaning of the text.
• Just like we saw in the above section, words like there, book, and table add
more meaning to the text as compared to the words is and on.
• However, in tasks like machine translation and text summarization,
removing stopwords is not advisable.
Removing Stopwords
Example: Remove Stopwords
1. Stopword Removal using NLTK
• NLTK, or the Natural Language Toolkit, is a treasure trove of a library
for text preprocessing. It’s one of my favorite Python libraries. NLTK
has a list of stopwords stored in 16 different languages.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords') # Download the 'stopwords' dataset
set(stopwords.words('english')) # Now you can access the stopwords
Removing Stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download the 'punkt_tab' dataset
nltk.download('punkt_tab')
# You can now use word_tokenize as before
set(stopwords.words('english'))
text = """He determined to drop his litigation with the monastry, and
relinguish his claims to the wood-cuting and fishery rihgts at once. He
was the more ready to do this becuase the rights had become much
less valuable, and he had indeed the vaguest idea where the wood and
river in question were.""“# sample sentence
Removing Stopwords
stop_words = set(stopwords.words('english')) # set of stop words
word_tokens = word_tokenize(text) # tokens of words
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print("\n\nOriginal Sentence \n\n")
print(" ".join(word_tokens))
print("\n\nFiltered Sentence \n\n")
print(" ".join(filtered_sentence))
Normalization: Word stemming
Stemming is a method in text processing that eliminates prefixes and
suffixes from words, transforming them into their fundamental or root
form, The main objective of stemming is to streamline and standardize
words, enhancing the effectiveness of the natural language
processing tasks.
Why we Need Stemming?
• In NLP use cases such as sentiment analysis, spam classification,
restaurant reviews etc., getting base word is important to know
whether the word is positive or negative. Stemming is used to get
that base word.
Normalization: Word stemming
• Simplifying words to their most basic form is called stemming, and it is
made easier by stemmers or stemming algorithms. For example,
“chocolates” becomes “chocolate” and “retrieval” becomes “retrieve.”
• This is crucial for pipelines for natural language processing, which use
tokenized words that are acquired from the first stage of dissecting a
document into its constituent words.
• Stemming in natural language processing reduces words to their base or
root form, aiding in text normalization for easier processing. This technique
is crucial in tasks like text classification, information retrieval, and text
summarization.
• While beneficial, stemming has drawbacks, including potential impacts on
text readability and occasional inaccuracies in determining the correct root
form of a word.
Normalization: Word stemming
• Python NLTK contains a variety of stemming algorithms, including several
types.
Porter’s Stemmer
• It is one of the most popular stemming methods proposed in 1980. It is
based on the idea that the suffixes in the English language are made up of a
combination of smaller and simpler suffixes. This stemmer is known for its
speed and simplicity.
• The main applications of Porter Stemmer include data mining and
Information retrieval. However, its applications are only limited to English
words. Also, the group of stems is mapped on to the same stem and the
output stem is not necessarily a meaningful word. The algorithms are fairly
lengthy in nature and are known to be the oldest stemmer.
Normalization: Word stemming
from nltk.stem import PorterStemmer
# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()
# Example words for stemming
words = ["running", "jumps", "happily", "running", "happily"]
# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]
# Print the results
print("Original words:", words)
print("Stemmed words:", stemmed_words)
Normalization: Lemmatization
• Lemmatization is a fundamental text pre-processing technique widely
applied in natural language processing (NLP) and machine learning. Serving
a purpose akin to stemming, lemmatization seeks to distill words to their
foundational forms. In this linguistic refinement, the resultant base word is
referred to as a “lemma.”
• Lemmatization is the process of grouping together the different inflected
forms of a word so they can be analyzed as a single item. Lemmatization is
similar to stemming but it brings context to the words. So, it links words
with similar meanings to one word.
• Text preprocessing includes both Stemming as well as lemmatization.
Lemmatization is preferred over Stemming because lemmatization does
morphological analysis of the words.
Normalization: Lemmatization
Examples of lemmatization:
• -> rocks : rock
• -> corpora : corpus
• -> better : good
Lemmatization Techniques
• Lemmatization techniques in natural language processing (NLP) involve
methods to identify and transform words into their base or root forms,
known as lemmas. These approaches contribute to text normalization,
facilitating more accurate language analysis and processing in various NLP
applications. Three types of lemmatization techniques are:
Normalization: Lemmatization
1. Rule Based Lemmatization
• Rule-based lemmatization involves the application of predefined rules to
derive the base or root form of a word. Unlike machine learning-based
approaches, which learn from data, rule-based lemmatization relies on
linguistic rules and patterns.
Here’s a simplified example of rule-based lemmatization for English verbs:
• Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
• Word: “walked”
• Rule Application: Remove “-ed”
• Result: “walk
Normalization: Lemmatization
2. Dictionary-Based Lemmatization
• Dictionary-based lemmatization relies on predefined dictionaries or lookup tables
to map words to their corresponding base forms or lemmas. Each word is matched
against the dictionary entries to find its lemma. This method is effective for
languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
• ‘running’ -> ‘run’
• ‘better’ -> ‘good’
• ‘went’ -> ‘go’
• When we apply dictionary-based lemmatization to a text like “I was running to
become a better athlete, and then I went home,” the resulting lemmatized form
would be: “I was run to become a good athlete, and then I go home.”
Normalization: Lemmatization
3. Machine Learning-Based Lemmatization
• Machine learning-based lemmatization leverages computational models to
automatically learn the relationships between words and their base forms. Unlike
rule-based or dictionary-based approaches, machine learning models, such as
neural networks or statistical models, are trained on large text datasets to
generalize patterns in language.
• Example:
• Consider a machine learning-based lemmatizer trained on diverse texts. When
encountering the word ‘went,’ the model, having learned patterns, predicts the
base form as ‘go.’
• Similarly, for ‘happier,’ the model deduces ‘happy’ as the lemma. The advantage
lies in the model’s ability to adapt to varied linguistic nuances and handle
irregularities, making it robust for lemmatizing diverse vocabularies.
Normalization: Lemmatization
# import these modules
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))
Part of Speech Tagging
• One of the core tasks in Natural Language Processing (NLP) is Parts of
Speech (PoS) tagging, which is giving each word in a text a
grammatical category, such as nouns, verbs, adjectives, and adverbs.
Through improved comprehension of phrase structure and semantics,
this technique makes it possible for machines to study and
comprehend human language more accurately.
• In many NLP applications, including machine translation, sentiment
analysis, and information retrieval, PoS tagging is essential. PoS
tagging serves as a link between language and machine
understanding, enabling the creation of complex language processing
systems and serving as the foundation for advanced linguistic
analysis.
Part of Speech Tagging
• Parts of Speech tagging is a linguistic activity in Natural Language
Processing (NLP) wherein each word in a document is given a particular
part of speech (adverb, adjective, verb, etc.) or grammatical category.
• Through the addition of a layer of syntactic and semantic information to the
words, this procedure makes it easier to comprehend the sentence’s
structure and meaning.
• In NLP applications, POS tagging is useful for machine translation, named
entity recognition, and information extraction, among other things. It also
works well for clearing out ambiguity in terms with numerous meanings
and revealing a sentence’s grammatical structure.
Part of Speech Tagging
Part of Speech Tagging
Example of POS Tagging
• Consider the sentence: “The quick brown fox jumps over the lazy dog.”
• After performing POS Tagging:
• “The” is tagged as determiner (DT)
• “quick” is tagged as adjective (JJ)
• “brown” is tagged as adjective (JJ)
• “fox” is tagged as noun (NN)
• “jumps” is tagged as verb (VBZ)
• “over” is tagged as preposition (IN)
• “the” is tagged as determiner (DT)
• “lazy” is tagged as adjective (JJ)
• “dog” is tagged as noun (NN)
Part of Speech Tagging
# Importing the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
# Sample text
text = "NLTK is a powerful library for natural language processing."
# Performing PoS tagging
pos_tags = pos_tag(words)
# Displaying the PoS tagged result in separate lines
print("Original Text:")
print(text)
print("\nPoS Tagging Result:")
for word, pos_tag in pos_tags:
print(f"{word}: {pos_tag}")