0% found this document useful (0 votes)
2 views

Unit-2- NLP.pptx

Parsing in Natural Language Processing (NLP) involves analyzing the grammatical structure of sentences to understand word relationships and meanings. There are two main types of parsing: syntactic parsing, which focuses on grammatical structure, and semantic parsing, which extracts meaning. Various parsing techniques, such as top-down and bottom-up parsing, along with different types of parsers, are employed to facilitate tasks like machine translation and sentiment analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-2- NLP.pptx

Parsing in Natural Language Processing (NLP) involves analyzing the grammatical structure of sentences to understand word relationships and meanings. There are two main types of parsing: syntactic parsing, which focuses on grammatical structure, and semantic parsing, which extracts meaning. Various parsing techniques, such as top-down and bottom-up parsing, along with different types of parsers, are employed to facilitate tasks like machine translation and sentiment analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

UNIT - 2

Natural Language Processing


Parsing and Tagging
Introduction - What is Parsing in NLP?
• Parsing is the process of examining the grammatical structure and
relationships inside a given sentence or text in natural language
processing (NLP). It involves analyzing the text to determine the roles of
specific words, such as nouns, verbs, and adjectives, as well as their
interrelationships.
• This analysis produces a structured representation of the text, allowing NLP
computers to understand how words in a phrase connect to one another.
Parsers expose the structure of a sentence by constructing parse trees or
dependency trees that illustrate the hierarchical and syntactic relationships
between words.
• This essential NLP stage is crucial for a variety of language understanding
tasks, which allow machines to extract meaning, provide coherent answers,
and execute tasks such as machine translation, sentiment analysis, and
information extraction.
Types of Parsing in NLP:
The types of parsing are the core steps in NLP, allowing machines to perceive the
structure and meaning of the text, which is required for a variety of language
processing activities. There are two main types of parsing in NLP which are as
follows:
Types of Parsing in NLP
Syntactic Parsing
• Syntactic parsing deals with a sentence’s grammatical structure. It involves
looking at the sentence to determine parts of speech, sentence boundaries,
and word relationships.
The two most common approaches included are as follows:
• Constituency Parsing: Constituency Parsing builds parse trees that break
down a sentence into its constituents, such as noun phrases and verb
phrases. It displays a sentence’s hierarchical structure, demonstrating how
words are arranged into bigger grammatical units.
• Dependency Parsing: Dependency parsing depicts grammatical links
between words by constructing a tree structure in which each word in the
sentence is dependent on another. It is frequently used in tasks such as
information extraction and machine translation because it focuses on word
relationships such as subject-verb-object relations.
Types of Parsing in NLP
Semantic Parsing
• Semantic parsing goes beyond syntactic structure to extract a
sentence’s meaning or semantics.
• It attempts to understand the roles of words in the context of a
certain task and how they interact with one another.
• Semantic parsing is utilized in a variety of NLP applications, such as
question answering, knowledge base populating, and text
understanding.
• It is essential for activities requiring the extraction of actionable
information from text.
Parsing Techniques in NLP
• The fundamental link between a sentence and its grammar is derived from a parse
tree. A parse tree is a tree that defines how the grammar was utilized to construct
the sentence.
• There are mainly two parsing techniques, commonly known as top-down and
bottom-up.
Top-Down Parsing
• A parse tree is a tree that defines how the grammar was utilized to construct the
sentence. Using the top-down approach, the parser attempts to create a parse
tree from the root node S down to the leaves.
• The top-down technique has the advantage of never wasting time investigating
trees that cannot result in S, which indicates it never examines subtrees that
cannot find a place in some rooted tree.
• The procedure begins with the assumption that the input can be derived from the
selected start symbol S.
Parsing Techniques in NLP
• The next step is to find the tops of all the trees that can begin with S by looking at
the grammatical rules with S on the left-hand side, which generates all the
possible trees.
• Top-down parsing is a search with a specific objective in mind.
• It attempts to replicate the initial creation process by rederiving the sentence from
the start symbol, and the production tree is recreated from the top down.
• Top-down, left-to-right, and backtracking are prominent search strategies that are
used in this method.
• The search begins with the root node labeled S, i.e., the starting symbol, expands
the internal nodes using the next productions with the left-hand side equal to the
internal node, and continues until leaves are part of speech (terminals).
• If the leaf nodes, or parts of speech, do not match the input string, we must go
back to the most recent node processed and apply it to another production.
Parsing Techniques in NLP
Let’s consider the grammar rules:
• Sentence(S)
• Noun Phrase(NP)
• Verb Phrase(VP)
• Determiner(Det)
• Nominal Noun (Nom)
• Prepositional Phrase(PP)
• Verb(V)
• Noun(N)
• Proper Noun (Pnoun)
Parsing Techniques in NLP
Let’s consider the example:
“John is playing a game”, and apply Top-down parsing.
• S🡪NP and VP
• NP🡪Det and Nom
Parsing Techniques in NLP
• NP🡪 Pnoun
• VP🡪 Verb NP
• Pnoun 🡪 John
If part of the speech does not match the input string, backtrack to the
node NP.
Parsing Techniques in NLP
• Part of the speech verb does not match the input string, backtrack to
the node S, since PNoun is matched.
• S 🡪 NP Aux VP
Parsing Techniques in NLP
• S🡪 NP and VP
• S🡪 NP Aux VP
• NP🡪 Det Nom
• NP🡪 Pnoun | Noun
• VP🡪 Verb NP
• Pnoun 🡪 John
• Aux 🡪 is
• Verb 🡪 playing
• Noun 🡪 game
Parsing Techniques in NLP
Bottom-Up Parsing
• Bottom-up parsing begins with the words of input and attempts to
create trees from the words up, again by applying grammar rules one
at a time.
• The parse is successful if it builds a tree rooted in the start symbol S
that includes all of the input. Bottom-up parsing is a type of
data-driven search. It attempts to reverse the manufacturing process
and return the phrase to the start symbol S.
• It reverses the production to reduce the string of tokens to the
beginning Symbol, and the string is recognized by generating the
rightmost derivation in reverse.
Parsing Techniques in NLP
Bottom-Up Parsing
• The goal of reaching the starting symbol S is accomplished through a
series of reductions; when the right-hand side of some rule matches
the substring of the input string, the substring is replaced with the
left-hand side of the matched production, and the process is repeated
until the starting symbol is reached.
• Bottom-up parsing can be thought of as a reduction process.
Bottom-up parsing is the construction of a parse tree in postorder.
Parsing Techniques in NLP
• Considering the grammatical rules stated above and the input sentence
“John is playing a game”. The bottom-up parsing operates as follows:
Parsing Techniques in NLP
Example 1:
Example 2:
Types of Parsers
1) Recursive Descent Parser
• It is a top-down parsing method that may or may not require
backtracking. We recursively scan the input to make the parse tree,
which is created from the top, while reading the input from left to
right.
• Example: Consider the following grammar:
• S->pTq
• T->xy|x
• and the input string: pxq.
Types of Parsers
Types of Parsers
The parser will parse the string in the following steps:
• First, the character 'p' is scanned and according to that the first
rule(S->pTq) is used. The first part of this rule S->p satisfies the input
character.
• The next character 'x' is then scanned. We move on to the second
part of the first rule (S->T), which is a non-terminal and hence the
second rule(T->xy|x)'s first part(T->xy) is used. The rule T->x matches
here so we scan the next character, 'q'.
• Now, the rule (T->y) does not match with 'q'. This rule(T->xy) is
discarded. Now, we backtrack and use the rule(T->x), which satisfies
the character 'x', similar to step 2.
• The final character 'q' is then matched with the last rule of the first
grammar, i.e. S->q.
Types of Parsers
2) Shift Reduce Parser
• This is a bottom-up parser that uses an input buffer for storing the input
string and a stack for accessing the production rules.
Shift Reduce parser uses the following functions:
• Shift:
Transferring symbols from the input buffer to stack.
• Reduce:
Reduction of the correct production rule.
• Accept:
the string is accepted by the grammar
• Error:
When the parser can perform neither of the above three.
Types of Parsers
Example:
• Consider the grammar:
• T->T+T
• T->T\*T
• T->a
• The input string: a+a+a
Types of Parsers
Column 1 Column 2 Column 3
$ a+a+a$ Shift
$a +a+a$ Reduce T->a
$T +a+a$ Shift
$T+ a+a$ Shift
$T+a +a$ Reduce T->a
$T+T +a$ Reduce T->T+T
$T +a$ Shift
$T+ a$ Shift
$T+a $ Reduce T->a
$T+T $ Reduce T->T+T
$T $ Accept
Types of Parsers
3) Shallow Parser
• Shallow Parsing, also known as chunking, is a parsing technique that
identifies only phrases (e.g., noun phrases, verb phrases) without
constructing a full parse tree.
• Instead of a deep hierarchical structure, it focuses on chunks of
words.
• Applications: Used in Named Entity Recognition (NER), Part-of-Speech
(POS) tagging, and information retrieval.
Types of Parsers
• Example:
• Input: "John plays football in the park.“
• Output (Chunks): [John] [plays] [football] [in the park]
• Here, it detects noun phrases (NP) and verb phrases (VP) but does not
analyze deeper syntactic structures.
Techniques Used:
• Regular Expressions
• Machine Learning models (e.g., CRF, BiLSTM)
• POS tagging
Types of Parsers
4) Deep Parser (Full Parsing)
• Deep Parsing involves constructing a complete hierarchical structure (parse
tree) of a sentence by analyzing its syntax and relationships between
words.
• It Provides detailed grammatical analysis, including dependencies and
syntactic structures.
• Application: Used in machine translation, sentiment analysis, question
answering, and speech recognition.
Techniques Used:
• Constituency Parsing: Uses grammar rules (e.g., Context-Free Grammar).
• Dependency Parsing: Focuses on relationships between words
(subject-verb-object).
Types of Parsers
• Example:
• Input: "John plays football in the park."
• Output (Full Parse Tree):
S
├── NP (John)
├── VP
│ ├── V (plays)
│ ├── NP (football)
│ ├── PP
│ ├── P (in)
│ ├── NP (the park)
Types of Parsers
Comparison: Shallow vs. Deep Parsing

Feature Shallow Parser Deep Parser

Depth Surface-level (chunks) Hierarchical structure

Output Phrases (NP, VP) Full parse tree

Speed Faster Slower

Complexity Low High

POS tagging, Machine translation,


Use Case
Named Entity Recognition Question Answering Systems
Tokenization
• Tokenization breaks text into smaller parts for easier machine analysis,
helping machines understand human language.
• Tokenization refers to the process of converting a sequence of text into
smaller parts, known as tokens. These tokens can be as small as characters
or as long as words. The primary reason this process matters is that it helps
machines understand human language by breaking it down into bite-sized
pieces, which are easier to analyze.
• The primary goal of tokenization is to represent text in a manner that's
meaningful for machines without losing its context. By converting text into
tokens, algorithms can more easily identify patterns. This pattern
recognition is crucial because it makes it possible for machines to
understand and respond to human input. For instance, when a machine
encounters the word "running", it doesn't see it as a singular entity but
rather as a combination of tokens that it can analyze and derive meaning
from.
Tokenization
Types of Tokenization
• Word tokenization. This method breaks text down into individual words. It's the
most common approach and is particularly effective for languages with clear word
boundaries like English.
• Character tokenization. Here, the text is segmented into individual characters.
This method is beneficial for languages that lack clear word boundaries or for
tasks that require a granular analysis, such as spelling correction.
• Subword tokenization. Striking a balance between word and character
tokenization, this method breaks text into units that might be larger than a single
character but smaller than a full word. For instance, "Chatbots" could be
tokenized into "Chat" and "bots". This approach is especially useful for languages
that form meaning by combining smaller units or when dealing with
out-of-vocabulary words in NLP tasks.
Tokenization
Tokenization using NLTK
• Natural Language ToolKit is a library written in Python for symbolic
and statistical Natural Language Processing.
• NLTK contains a module called tokenize() which further classifies into
two sub-categories: Different Methods to perform tokenization are:
• Word tokenize: We use the word_tokenize() method to split a
sentence into tokens or words
• Sentence tokenize: We use the sent_tokenize() method to split a
document or paragraph into sentences
Tokenization- Word Tokenization
Input:
from nltk.tokenize import word_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization
and a multi-planet species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1
became the first privately developed liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)
Output:
['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a',
'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city',
'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed',
'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']
Tokenization - Sentence Tokenization
Input:
from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization
and a multi-planet species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1
became the first privately developed liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)
Output:
['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a
multi-planet \nspecies by building a self-sustaining city on Mars.',

'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit
the Earth.']
Removing Stopwords
Stopwords
• Stopwords are the most common words in any natural language. For
the purpose of analyzing text data and building NLP models, these
stopwords might not add much value to the meaning of the
document.
• Generally, the most common words used in a text are “the”, “is”, “in”,
“for”, “where”, “when”, “to”, “at” etc.
• Consider this text string – “There is a pen on the table”. Now, the
words “is”, “a”, “on”, and “the” add no meaning to the statement
while parsing it. Whereas words like “there”, “book”, and “table” are
the keywords and tell us what the statement is all about.
Removing Stopwords
Why do we Need to Remove Stopwords?
• Removing stopwords is not a hard and fast rule in NLP. It depends upon
the task that we are working on. For tasks like text classification, where the
text is to be classified into different categories, stopwords are removed or
excluded from the given text so that more focus can be given to those
words which define the meaning of the text.
• Just like we saw in the above section, words like there, book, and table add
more meaning to the text as compared to the words is and on.
• However, in tasks like machine translation and text summarization,
removing stopwords is not advisable.
Removing Stopwords
Example: Remove Stopwords
1. Stopword Removal using NLTK
• NLTK, or the Natural Language Toolkit, is a treasure trove of a library
for text preprocessing. It’s one of my favorite Python libraries. NLTK
has a list of stopwords stored in 16 different languages.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords') # Download the 'stopwords' dataset
set(stopwords.words('english')) # Now you can access the stopwords
Removing Stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download the 'punkt_tab' dataset
nltk.download('punkt_tab')
# You can now use word_tokenize as before
set(stopwords.words('english'))
text = """He determined to drop his litigation with the monastry, and
relinguish his claims to the wood-cuting and fishery rihgts at once. He
was the more ready to do this becuase the rights had become much
less valuable, and he had indeed the vaguest idea where the wood and
river in question were.""“# sample sentence
Removing Stopwords
stop_words = set(stopwords.words('english')) # set of stop words
word_tokens = word_tokenize(text) # tokens of words
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print("\n\nOriginal Sentence \n\n")
print(" ".join(word_tokens))
print("\n\nFiltered Sentence \n\n")
print(" ".join(filtered_sentence))
Normalization: Word stemming
Stemming is a method in text processing that eliminates prefixes and
suffixes from words, transforming them into their fundamental or root
form, The main objective of stemming is to streamline and standardize
words, enhancing the effectiveness of the natural language
processing tasks.
Why we Need Stemming?
• In NLP use cases such as sentiment analysis, spam classification,
restaurant reviews etc., getting base word is important to know
whether the word is positive or negative. Stemming is used to get
that base word.
Normalization: Word stemming
• Simplifying words to their most basic form is called stemming, and it is
made easier by stemmers or stemming algorithms. For example,
“chocolates” becomes “chocolate” and “retrieval” becomes “retrieve.”
• This is crucial for pipelines for natural language processing, which use
tokenized words that are acquired from the first stage of dissecting a
document into its constituent words.
• Stemming in natural language processing reduces words to their base or
root form, aiding in text normalization for easier processing. This technique
is crucial in tasks like text classification, information retrieval, and text
summarization.
• While beneficial, stemming has drawbacks, including potential impacts on
text readability and occasional inaccuracies in determining the correct root
form of a word.
Normalization: Word stemming
• Python NLTK contains a variety of stemming algorithms, including several
types.
Porter’s Stemmer
• It is one of the most popular stemming methods proposed in 1980. It is
based on the idea that the suffixes in the English language are made up of a
combination of smaller and simpler suffixes. This stemmer is known for its
speed and simplicity.
• The main applications of Porter Stemmer include data mining and
Information retrieval. However, its applications are only limited to English
words. Also, the group of stems is mapped on to the same stem and the
output stem is not necessarily a meaningful word. The algorithms are fairly
lengthy in nature and are known to be the oldest stemmer.
Normalization: Word stemming
from nltk.stem import PorterStemmer
# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()
# Example words for stemming
words = ["running", "jumps", "happily", "running", "happily"]
# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]
# Print the results
print("Original words:", words)
print("Stemmed words:", stemmed_words)
Normalization: Lemmatization
• Lemmatization is a fundamental text pre-processing technique widely
applied in natural language processing (NLP) and machine learning. Serving
a purpose akin to stemming, lemmatization seeks to distill words to their
foundational forms. In this linguistic refinement, the resultant base word is
referred to as a “lemma.”
• Lemmatization is the process of grouping together the different inflected
forms of a word so they can be analyzed as a single item. Lemmatization is
similar to stemming but it brings context to the words. So, it links words
with similar meanings to one word.
• Text preprocessing includes both Stemming as well as lemmatization.
Lemmatization is preferred over Stemming because lemmatization does
morphological analysis of the words.
Normalization: Lemmatization
Examples of lemmatization:
• -> rocks : rock
• -> corpora : corpus
• -> better : good
Lemmatization Techniques
• Lemmatization techniques in natural language processing (NLP) involve
methods to identify and transform words into their base or root forms,
known as lemmas. These approaches contribute to text normalization,
facilitating more accurate language analysis and processing in various NLP
applications. Three types of lemmatization techniques are:
Normalization: Lemmatization
1. Rule Based Lemmatization
• Rule-based lemmatization involves the application of predefined rules to
derive the base or root form of a word. Unlike machine learning-based
approaches, which learn from data, rule-based lemmatization relies on
linguistic rules and patterns.
Here’s a simplified example of rule-based lemmatization for English verbs:
• Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
• Word: “walked”
• Rule Application: Remove “-ed”
• Result: “walk
Normalization: Lemmatization
2. Dictionary-Based Lemmatization
• Dictionary-based lemmatization relies on predefined dictionaries or lookup tables
to map words to their corresponding base forms or lemmas. Each word is matched
against the dictionary entries to find its lemma. This method is effective for
languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
• ‘running’ -> ‘run’
• ‘better’ -> ‘good’
• ‘went’ -> ‘go’
• When we apply dictionary-based lemmatization to a text like “I was running to
become a better athlete, and then I went home,” the resulting lemmatized form
would be: “I was run to become a good athlete, and then I go home.”
Normalization: Lemmatization
3. Machine Learning-Based Lemmatization
• Machine learning-based lemmatization leverages computational models to
automatically learn the relationships between words and their base forms. Unlike
rule-based or dictionary-based approaches, machine learning models, such as
neural networks or statistical models, are trained on large text datasets to
generalize patterns in language.
• Example:
• Consider a machine learning-based lemmatizer trained on diverse texts. When
encountering the word ‘went,’ the model, having learned patterns, predicts the
base form as ‘go.’
• Similarly, for ‘happier,’ the model deduces ‘happy’ as the lemma. The advantage
lies in the model’s ability to adapt to varied linguistic nuances and handle
irregularities, making it robust for lemmatizing diverse vocabularies.
Normalization: Lemmatization
# import these modules
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))
Part of Speech Tagging
• One of the core tasks in Natural Language Processing (NLP) is Parts of
Speech (PoS) tagging, which is giving each word in a text a
grammatical category, such as nouns, verbs, adjectives, and adverbs.
Through improved comprehension of phrase structure and semantics,
this technique makes it possible for machines to study and
comprehend human language more accurately.
• In many NLP applications, including machine translation, sentiment
analysis, and information retrieval, PoS tagging is essential. PoS
tagging serves as a link between language and machine
understanding, enabling the creation of complex language processing
systems and serving as the foundation for advanced linguistic
analysis.
Part of Speech Tagging
• Parts of Speech tagging is a linguistic activity in Natural Language
Processing (NLP) wherein each word in a document is given a particular
part of speech (adverb, adjective, verb, etc.) or grammatical category.
• Through the addition of a layer of syntactic and semantic information to the
words, this procedure makes it easier to comprehend the sentence’s
structure and meaning.
• In NLP applications, POS tagging is useful for machine translation, named
entity recognition, and information extraction, among other things. It also
works well for clearing out ambiguity in terms with numerous meanings
and revealing a sentence’s grammatical structure.
Part of Speech Tagging
Part of Speech Tagging
Example of POS Tagging
• Consider the sentence: “The quick brown fox jumps over the lazy dog.”
• After performing POS Tagging:
• “The” is tagged as determiner (DT)
• “quick” is tagged as adjective (JJ)
• “brown” is tagged as adjective (JJ)
• “fox” is tagged as noun (NN)
• “jumps” is tagged as verb (VBZ)
• “over” is tagged as preposition (IN)
• “the” is tagged as determiner (DT)
• “lazy” is tagged as adjective (JJ)
• “dog” is tagged as noun (NN)
Part of Speech Tagging
# Importing the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
# Sample text
text = "NLTK is a powerful library for natural language processing."
# Performing PoS tagging
pos_tags = pos_tag(words)
# Displaying the PoS tagged result in separate lines
print("Original Text:")
print(text)
print("\nPoS Tagging Result:")
for word, pos_tag in pos_tags:
print(f"{word}: {pos_tag}")

You might also like