0% found this document useful (0 votes)
16 views

Morphological Parsing

The document discusses parsing in Natural Language Processing (NLP), covering concepts such as syntax analysis, formal grammars, and types of parsing including top-down and bottom-up methods. It explains various grammar types like Context-Free Grammar (CFG) and Regular Grammar, as well as the roles and types of parsers used in syntactic analysis. Additionally, it describes the structure of parse trees and the process of derivation in both leftmost and rightmost forms.

Uploaded by

ketkijadhav.tae
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Morphological Parsing

The document discusses parsing in Natural Language Processing (NLP), covering concepts such as syntax analysis, formal grammars, and types of parsing including top-down and bottom-up methods. It explains various grammar types like Context-Free Grammar (CFG) and Regular Grammar, as well as the roles and types of parsers used in syntactic analysis. Additionally, it describes the structure of parse trees and the process of derivation in both leftmost and rightmost forms.

Uploaded by

ketkijadhav.tae
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

lOMoARcPSD|34331009

2021 2.3 Parsing - NLP

COMPUTER APPLICATION (University of Madras)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by ketki jadhav ([email protected])
lOMoARcPSD|34331009

NATURAL LANGUAGE PROCESSING

UNIT 2

Syntax: Formal Grammars of English - Word Level Analysis: Regular Expressions - Finite-
State Automata - Syntactic Analysis / Parsing: Context-free Grammar – Types of Parsing:
Morphological Parsing, Syntactic Parsing, Statistical Parsing, Probabilistic Parsing,
Constituency Parsing -Spelling Error Detection and correction - Words and Word classes-
Part-of Speech Tagging.
2.3 PARSING

2.3.1 PARSING / SYNTACTIC ANALYSIS / SYNTAX ANALYSIS

The word ‘Parsing’ whose origin is from Latin word ‘pars’ (which means ‘part’), is used to
draw exact meaning or dictionary meaning from the text. It is also called Syntactic analysis
or syntax analysis.
Syntax analysis determines the syntactic structure of a text and checks the text for
meaningfulness, comparing the rules of formal grammar of the language.

Fig. 1 Syntactic Analysis

2.3.2 CONCEPT OF GRAMMAR

Grammar is very essential and important to describe the syntactic structure of well-formed
programs. A mathematical model of grammar was given by Noam Chomsky in 1956, which
is effective for writing computer languages.
Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where −
 N or VN = set of non-terminal symbols, i.e., variables.
 T or ∑ = set of terminal symbols.
 S = Start symbol where S ∈ N
 P denotes the Production rules for Terminals as well as Non-terminals. It has the form
α → β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

Context Free Grammar (CFG)

Context free grammar, also called CFG, is a notation for describing languages and a superset
of Regular grammar. It can be seen in the following diagram −

CFG consists of finite set of grammar rules with the following four components −
Set of Non-terminals: It is denoted by V. The non-terminals are syntactic variables that
denote the sets of strings, which further help defining the language, generated by the
grammar.
Set of Terminals: It is also called tokens and defined by Σ. Strings are formed with the basic
symbols of terminals.
Set of Productions: It is denoted by P. The set defines how the terminals and non-terminals
can be combined. Every production(P) consists of non-terminals, an arrow, and terminals
(the sequence of terminals). Non-terminals are called the left side of the production and
terminals are called the right side of the production.
Start Symbol: The production begins from the start symbol. It is denoted by symbol S. Non-
terminal symbol is always designated as start symbol.
Regular Grammar

Regular grammars (RGs) are CFGs that generate regular languages. A regular grammar is
a CFG where productions are restricted to two forms, either A → a or A → aB,
where A, B ∈ NT and a ∈ T. Regular grammars are equivalent to regular expressions; they
encode precisely those languages that can be recognized by a DFA.

A regular grammar is a four-tuple G = (N, Σ, P, S), where

1. N is an alphabet called the set of nonterminals.


2. Σ is an alphabet called the set of terminals, with Σ ∩ N = Ø.
3. P is a finite set of productions or rules of the form A → w, where A ∈ N and w ∈
Σ*N ∪ Σ*.
4. S is the start symbol, S ∈ N.

Notice that productions in a regular grammar have at most one nonterminal on the right-hand
side and that this nonterminal always occurs at the end of a production.

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

Constituency Grammar
Phrase structure grammar, introduced by Noam Chomsky, is based on the constituency
relation. That is why it is also called constituency grammar. It is opposite to dependency
grammar.

Dependency Grammar

It is based on dependency relation. It was introduced by Lucien Tesniere. Dependency


grammar (DG) is opposite to the constituency grammar because it lacks phrasal nodes.

Parse tree that uses Constituency grammar is called constituency-based parse tree; and the
parse trees that uses dependency grammar is called dependency-based parse tree.

2.3.3 TYPES OF PARSING

A. Top-Down Vs Bottom-up Parsing


Derivation divides parsing into the followings two types −

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

 Top-down Parsing
 Bottom-up Parsing

Top-down Parsing: In this kind of parsing, the parser starts constructing the parse tree from
the start symbol and then tries to transform the start symbol to the input. The most common
form of topdown parsing uses recursive procedure to process the input. The main
disadvantage of recursive descent parsing is backtracking.

Bottom-up Parsing: In this kind of parsing, the parser starts with the input symbol and tries to
construct the parser tree up to the start symbol.

B. Deep Vs Shallow Parsing

Deep Parsing Shallow Parsing

In deep parsing, the search strategy will give It is the task of parsing a limited part of the
a complete syntactic structure to a sentence. syntactic information from the given task.

It is suitable for complex NLP applications. It can be used for less complex NLP
applications.

Dialogue systems and summarization are Information extraction and text mining are
the examples of NLP applications where the examples of NLP applications where
deep parsing is used. deep parsing is used.

It is also called full parsing. It is also called chunking.

2.3.4 PARSE TREE / DERIVATION TREE

The process of deriving a string is called as derivation. A Parse tree / Derivation tree may
be defined as the graphical depiction of a derivation. The start symbol of derivation serves as
the root of the parse tree. In every parse tree, the leaf nodes are terminals and interior nodes
are non-terminals. A property of parse tree is that in-order traversal will produce the original
input string.

1. Leftmost Derivation
The process of deriving a string by expanding the leftmost non-terminal at each step is
called as leftmost derivation. i.e the input is scanned and replaced from left to the right.

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

The geometrical representation of leftmost derivation is called as a leftmost derivation


tree.

Example: Consider the following grammar-


S → aB / bA
S → aS / bAA / a
B → bS / aBB / b
(Unambiguous Grammar)
Let us consider a string w = aaabbabbba. Derive the string w using leftmost derivation.
S → aB
→ aaBB (Using B → aBB)
→ aaaBBB (Using B → aBB)
→ aaabBB (Using B → b)
→ aaabbB (Using B → b)
→ aaabbaBB (Using B → aBB)
→ aaabbabB (Using B → b)
→ aaabbabbS (Using B → bS)
→ aaabbabbbA (Using S → bA)
→ aaabbabbba (Using A → a)

2. Rightmost Derivation

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

The process of deriving a string by expanding the rightmost non-terminal at each step is
called as rightmost derivation. i.e the sentential form of an input is scanned and replaced
from right to left. The sentential form in this case is called the right-sentential form.

The geometrical representation of rightmost derivation is called as a rightmost derivation


tree.
Example
Consider the following grammar-
S → aB / bA
S → aS / bAA / a
B → bS / aBB / b
(Unambiguous Grammar)
Let us consider a string w = aaabbabbba. Derive the string w using rightmost derivation.
S → aB
→ aaBB (Using B → aBB)
→ aaBaBB (Using B → aBB)
→ aaBaBbS (Using B → bS)
→ aaBaBbbA (Using S → bA)
→ aaBaBbba (Using A → a)
→ aaBabbba (Using B → b)
→ aaaBBabbba (Using B → aBB)
→ aaaBbabbba (Using B → b)
→ aaabbabbba (Using B → b)

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

Notes
 For unambiguous grammars, Leftmost derivation and Rightmost derivation represents
the same parse tree.
 For ambiguous grammars, Leftmost derivation and Rightmost derivation represents
different parse trees.

In the example, the given grammar was unambiguous. That is why, leftmost derivation and
rightmost derivation represents the same parse tree.
Leftmost Derivation Tree = Rightmost Derivation Tree
Properties Of Parse Tree
 Root node of a parse tree is the start symbol of the grammar.
 Each leaf node of a parse tree represents a terminal symbol.
 Each interior node of a parse tree represents a non-terminal symbol.
 Parse tree is independent of the order in which the productions are used during
derivations.

Yield Of Parse Tree


 Concatenating the leaves of a parse tree from the left produces a string of terminals.
 This string of terminals is called as yield of a parse tree.

Problem
Consider the grammar-
S → bB / aA
A → b / bS / aAA
B → a / aS / bBB

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

For the string w = bbaababa, find the following:


1. Leftmost derivation
2. Rightmost derivation
3. Parse Tree

Solution

1. Leftmost Derivation
S → bB
→ bbBB (Using B → bBB)
→ bbaB (Using B → a)
→ bbaaS (Using B → aS)
→ bbaabB (Using S → bB)
→ bbaabaS (Using B → aS)
→ bbaababB (Using S → bB)
→ bbaababa (Using B → a)

2. Rightmost Derivation

S → bB
→ bbBB (Using B → bBB)
→ bbBaS (Using B → aS)
→ bbBabB (Using S → bB)
→ bbBabaS (Using B → aS)
→ bbBababB (Using S → bB)
→ bbBababa (Using B → a)
→ bbaababa (Using B → a)

3. Parse Tree

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

 Whether we consider the leftmost derivation or rightmost derivation, we get the above
parse tree.
 The reason is given grammar is unambiguous.

2.3.5 PARSER
The main roles of the parser include the following:
 To report any syntax error.
 To recover from commonly occurring error so that the processing of the remainder of
program can be continued.
 To create parse tree.
 To create symbol table.
 To produce intermediate representations (IR).

2.3.6 TYPES OF PARSERS

A parser is basically a procedural interpretation of grammar. It finds an optimal tree for the
given sentence after searching through the space of a variety of trees.
i. Recursive descent Parser
ii. Shift-reduce Parser
iii. Chart Parser
iv. Regexp parser
v. Dependency Parser
vi. Morphological Parser
vii. Constituency Parser

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

viii. Probabilistic Parser


This section discusses some of the available parsers.

i. Recursive descent parser

Recursive descent parsing is one of the most straightforward forms of parsing.


 It follows a top-down process.
 It attempts to verify if the syntax of the input stream is correct or not.
 It reads the input sentence from left to right.
 One necessary operation for recursive descent parser is to read characters from the
input stream and matching them with the terminals from the grammar.

ii. Shift-reduce parser

Following are some important points about shift-reduce parser.


 It follows a simple bottom-up process.
 It tries to find a sequence of words and phrases that correspond to the right-hand side
of a grammar production and replaces them with the left-hand side of the production.
 The above attempt to find a sequence of word continues until the whole sentence is
reduced.
 In simple words, shift-reduce parser starts with the input symbol and tries to construct
the parser tree up to the start symbol.

iii. Chart parser

Following are some important points about chart parser


 It is mainly useful or suitable for ambiguous grammars, including grammars of
natural languages.
 It applies dynamic programming to the parsing problems.
 Because of dynamic programming, partial hypothesized results are stored in a
structure called a ‘chart’.
 The ‘chart’ can also be re-used.

iv. Regexp parser

Regexp parsing is one of the mostly used parsing technique. Following are some important
points about Regexp parser
 As the name implies, it uses a regular expression defined in the form of grammar on
top of a POS-tagged string.
 It basically uses these regular expressions to parse the input sentences and generate a
parse tree.

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

v. Dependency Parser

Dependency Parsing (DP) refers to examining the dependencies between the words of a
sentence to analyze its grammatical structure. Dependency parsing doesn’t make use of
phrasal constituents or sub-phrases. Instead, the syntax of the sentence is expressed in terms
of dependencies between words — that is, directed, typed edges between words in a graph.
More formally, a dependency parse tree is a graph G = (V, E) where the set of
vertices V contains the words in the sentence, and each edge in E connects two words. The
graph must satisfy three conditions:

1. There has to be a single root node with no incoming edges.


2. For each node in , there must be a path from the root to .
3. Each node except the root must have exactly 1 incoming edge.

Additionally, each edge in E has a type, which defines the grammatical relation that occurs
between the two words.
Let’s see what the previous example looks like if we perform dependency parsing:

As we can see, the result is completely different. With this approach, the root of the tree is the
verb of the sentence, and edges between words describe their relationships.
For example, the word “saw” has an outgoing edge of type nsubj to the word “I”, meaning
that “I” is the nominal subject of the verb “saw”. In this case, we say that “I” depends
on “saw”.

vi. Morphological Parser

Morphemes are smallest meaning-bearing units. Example: We can break the word foxes into
two, fox and -es. We can see that the word foxes, is made up of two morphemes, one
is fox and other is -es.
Morphemes can be divided into two types
i. Stems
It is the core meaningful unit of a word. We can also say that it is the root of the word.
Example: In the word foxes, the stem is fox.
 Affixes − As the name suggests, they add some additional meaning and grammatical
functions to the words. For example, in the word foxes, the affix is − es.
Affixes can also be divided into following four types −

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

o Prefixes − As the name suggests, prefixes precede the stem. For example, in
the word unbuckle, un is the prefix.
o Suffixes − As the name suggests, suffixes follow the stem. For example, in the
word cats, -s is the suffix.
o Infixes − As the name suggests, infixes are inserted inside the stem. For
example, the word cupful, can be pluralized as cupsful by using -s as the
infix.
o Circumfixes − They precede and follow the stem. There are very less
examples of circumfixes in English language. A very common example is ‘A-
ing’ where we can use -A precede and -ing follows the stem.
ii. Word Order
The order of the words would be decided by morphological parsing.
Morphology
Morphology is the study of the following:
 The formation of words.
 The origin of the words.
 Grammatical forms of the words.
 Use of prefixes and suffixes in the formation of words.
 How parts-of-speech (PoS) of a language are formed.
Morphological parsing is the problem of recognizing that a word breaks down into smaller
meaningful units called “morphemes” producing some sort of linguistic structure for it.
Requirements for building a Morphological parser
Let us now see the requirements for building a morphological parser

Lexicon: This includes the list of stems and affixes along with the basic information about
them. For example, the information like whether the stem is Noun stem or Verb stem, etc.

Morphotactics: It is basically the model of morpheme ordering. In other sense, the model
explaining which classes of morphemes can follow other classes of morphemes inside a
word. For example, the morphotactic fact is that the English plural morpheme always follows
the noun rather than preceding it.

Orthographic rules: These spelling rules are used to model the changes occurring in a word.
For example, the rule of converting y to ie in word like city+s = cities not citys.

The goal of morphological parsing is to find out what morphemes a given word is built from.
For example, a morphological parser should be able to tell us that the word cats is the plural
form of the noun stem cat, and that the word mice is the plural form of the noun stem mouse.
So, given the string cats as input, a morphological parser should produce an output that looks
similar to cat N PL. Here are some more examples:

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

mouse mouse N SG

mice mouse N PL

foxes fox N PL

Morphological parsing yields information that is useful in many NLP applications.

 In parsing, it helps to know the agreement features of words.


 Grammar checkers need to know agreement information to detect such mistakes.
Morphological information also helps spell checkers to decide whether something is a
possible word or not.
 In information retrieval it is used to search not only cats, if that's the user's input, but
also for cat.

To get from the surface form of a word to its morphological analysis, we are going to proceed
in two steps.

Step 1: Split the words up into its possible components.

Cats: cat + s where + is used to indicate morpheme boundaries.


Foxes: foxe + s which assumes that foxe is a stem and s is the suffix
fox + s where fox is the stem , e has been introduced due to the spelling rule

Step 2: Use a lexicon of stems and affixes to look up the categories of the stems and the
meaning of the affixes.
cat + s will get mapped to cat NP PL
fox + s to fox N PL.
We will also find out now that foxe is not a legal stem. This tells us that
splitting foxes into foxe + s was actually an incorrect way of splitting foxes, which
should be discarded.
Note: For the word houses splitting it into house + s is correct.

Here is a picture illustrating the two steps of our morphological parser with some examples.

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

vii. Constituency Parser

The constituency parse tree is based on the formalism of context-free grammars. In this type
of tree, the sentence is divided into constituents, that is, sub-phrases that belong to a specific
category in the grammar.
In English, for example, the phrases “a dog”, “a computer on the table” and “the nice sunset”
are all noun phrases, while “eat a pizza” and “go to the beach” are verb phrases.
The grammar provides a specification of how to build valid sentences, using a set of rules.
As an example, the rule VP  V NP means that we can form a verb phrase (VP) using a verb
(V) and then a noun phrase (NP).
While we can use these rules to generate valid sentences, we can also apply them the other
way around, in order to extract the syntactical structure of a given sentence according to the
grammar.
Example of a constituency parse tree for the simple sentence, “I saw a fox”:

A constituency parse tree always contains the words of the sentence as its terminal
nodes. Usually, each word has a parent node containing its part-of-speech tag (noun,
adjective, verb, etc…), although this may be omitted in other graphical representations.
All the other non-terminal nodes represent the constituents of the sentence and are
usually one of verb phrase, noun phrase, or prepositional phrase (PP).
To sum things up, constituency parsing creates trees containing a syntactical representation of
a sentence, according to a context-free grammar. This representation is highly hierarchical
and divides the sentences into its single phrasal constituents.

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

viii. Probabilistic Parser

Probabilistic parsing uses dynamic programming algorithms to compute the most likely
parse(s) of a given sentence, given a statistical model of the syntactic structure of a language.
CFG: A context free grammar consists of:
1. a set of non-terminal symbols N
2. a set of terminal symbols Σ (disjoint from N)
3. a set of productions, P, each of the form A → α, where A is a non-terminal and α is a string
of symbols from the infinite set of strings (Σ ∪ N)
4. a designated start symbol
Probabilistic CFGs / Stochastic Grammars (PCFGs)
Probabilistic CFGs Augments each rule in P with a conditional probability: A → β [p] where
p is the probability that the non-terminal A will be expanded to the sequence β.
Often referred to as P(A → β) or P(A → β|A).
Why are PCFGs useful?
• Assigns a probability to each parse tree T
• Useful in disambiguation – Choose the most likely parse – Computing the probability of a
parse If we make independence assumptions, P(T) = Qn∈T p(r(n)).
• Useful in language modeling tasks
Example:

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

Where does the grammar come from?


1. developed manually
2. from a treebank
Where do the probabilities come from?
1. from a treebank: P(α → β|α) = Count(α → β)/Count(α)
2. use EM (forward-backward algorithm, inside-outside algorithm)
Parsing with PCFGs
Produce the most likely parse for a given sentence: T ˆ (S) = argmaxT ∈τ(S)P(T) where τ (S)
is the set of possible parse trees for S.
 Augment the Earley algorithm to compute the probability of each of its parses.
When adding an entry E of category C to the chart using rule i with n subconstituents,
E1, . . . , En:
P(E) = P(rule i | C) ∗ P(E1) ∗ . . . ∗ P(En)
 probabilistic CKY (Cocke-Kasami-Younger) algorithm

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

Problems with PCFGs


• PCFGs assume that all rules are essentially independent
– But, e.g. in English “NP  pro” more likely when in subject position
• Difficult to incorporate lexical information
– Pre-terminal rules can inherit important information from words which help to
make choices higher up the parse, e.g. lexical choice can help determine PP
attachment
• Do not adequately model lexical dependencies.

xi. Statistical Parser

Statistical parsing is the task of computing the most probable parse of a sentence given a
probabilistic (or weighted) context-free grammar (CFG). The weights of the probabilistic or
weighted CFG are typically learned on a corpus of texts.

2.3.6 RELEVANCE OF PARSING IN NLP


We can understand the relevance of parsing in NLP with the help of following points −
 Parser is used to report any syntax error.
 It helps to recover from commonly occurring error so that the processing of the
remainder of program can be continued.
 Parse tree is created with the help of a parser.
 Parser is used to create symbol table, which plays an important role in NLP.
 Parser is also used to produce intermediate representations (IR).

2.3.7 CHALLENGES IN PARSING NATURAL LANGUAGE

Parsing natural language presents several challenges that don’t occur when parsing
programming languages. The reason for this is that natural language is often ambiguous,
meaning there can be multiple valid parse trees for the same sentence.
Let’s consider for a moment the sentence, “I shot an elephant in my pajamas”. It has two
possible interpretations: one where the man is wearing his pajamas while shooting the
elephant, and the other where the elephant is inside the man’s pajamas.
These are both valid from a syntactical perspective, but humans are able to solve these
ambiguities very quickly – and often unconsciously – since many of the possible
interpretations are unreasonable for their semantics or for the context in which the sentence
occurs. However, it’s not as easy for a parsing algorithm to select the most likely parse tree
with great accuracy.

Downloaded by ketki jadhav ([email protected])


lOMoARcPSD|34331009

To do this, most modern parsers use supervised machine learning models that are trained on
manually annotated data. Since the data is annotated with the correct parse trees, the model
will learn a bias towards more likely interpretations.

2.3.8 CONCLUSION
• Basic parsing approaches without constraints is not practical in real applications
• Whatever approach taken, bear in mind that the lexicon is the real bottleneck
• There’s a real trade-off between coverage and efficiency, so it’s a good idea to
sacrifice broad coverage (e.g. domain-specific parsers, controlled language), or use a
scheme that minimizes the disadvantages (e.g. probabilistic parsing)
– From computational perspective, a parser provides a formalism for writing
linguistic rules and an implementation which can apply the rules to an input
text
– An interface to allow grammar development and testing (eg tracing rules,
showing trees) and an interface with the application of which it is a part that
may be hidden to the end-user are necessary
• All of the above tailored to meet the needs

REFERENCES

1. https://www.gatevidyalay.com/parse-tree-derivations-automata/

Downloaded by ketki jadhav ([email protected])

You might also like