Machine 22
Machine 22
A chart parser
We will apply the algorithm design technique of dynamic programming to the
parsing problem. Dynamic programming stores intermediate results and reuses
them when appropriate, achieving significant efficiency gains. This technique can be
applied to syntactic parsing. This allows us to store partial solutions to the parsing
task and then allows us to look them up when necessary in order to efficiently arrive
at a complete solution. This approach to parsing is known as chart parsing.
A regex parser
A regex parser uses a regular expression defined in the form of grammar on top of a
POS-tagged string. The parser will use these regular expressions to parse the given
sentences and generate a parse tree out of this. A working example of the regex
parser is given here:
# Regex parser
>>>chunk_rules=ChunkRule("<.*>+","chunk everything")
>>>import nltk
>>>from nltk.chunk.regexp import *
>>>reg_parser = RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>} # Preposition
V: {<V.*>} # Verb
PP: {<P> <NP>} # PP -> P NP
VP: {<V> <NP|PP>*} # VP -> V (NP|PP)*
''')
>>>test_sent="Mr. Obama played a big role in the Health insurance bill"
>>>test_sent_pos=nltk.pos_tag(nltk.word_tokenize(test_sent))
>>>paresed_out=reg_parser.parse(test_sent_pos)
>>> print paresed_out
Tree('S', [('Mr.', 'NNP'), ('Obama', 'NNP'), Tree('VP', [Tree('V',
[('played', 'VBD')]), Tree('NP', [('a', 'DT'), ('big', 'JJ'), ('role',
'NN')])]), Tree('P', [('in', 'IN')]), ('Health', 'NNP'), Tree('NP',
[('insurance', 'NN'), ('bill', 'NN')])])
[ 49 ]
www.it-ebooks.info
Parsing Structure in Text
The following is a graphical representation of the tree for the preceding code:
Root
NP VP
Dependency parsing
Dependency parsing (DP) is a modern parsing mechanism. The main concept of DP
is that each linguistic unit (words) is connected with each other by a directed link.
These links are called dependencies in linguistics. There is a lot of work going on in
the current parsing community. While phrase structure parsing is still widely used
for free word order languages (Czech and Turkish), dependency parsing has turned
out to be more efficient.
A very clear distinction can be made by looking at the parse tree generated by phrase
structure grammar and dependency grammar for a given example, as the sentence
"The big dog chased the cat". The parse tree for the preceding sentence is:
[ 50 ]
www.it-ebooks.info
Chapter 4
NP VP
Art Adj N V NP
Art N
the big dog chased the cat The big dog chased the cat
Phrase Structure tree Dependency Tree
If we look at both parse trees, the phrase structures try to capture the relationship
between words and phrases and then eventually between phrases. While a
dependency tree just looks for a dependency between words, for example, big is
totally dependent on dog.
NLTK provides a couple of ways to do dependency parsing. One of them is to use
a probabilistic, projective dependency parser, but it has the restriction of training
with a limited set of training data. One of the state of the art dependency parsers is
a Stanford parser. Fortunately, NLTK has a wrapper around it and in the following
example, I will talk about how to use a Stanford parser with NLTK:
# Stanford Parser [Very useful]
>>>from nltk.parse.stanford import StanfordParser
>>>english_parser = StanfordParser('stanford-parser.jar', 'stanford-
parser-3.4-models.jar')
>>>english_parser.raw_parse_sents(("this is the english parser test")
Parse
(ROOT
(S
(NP (DT this))
(VP (VBZ is)
(NP (DT the) (JJ english) (NN parser) (NN test)))))
Universal dependencies
nsubj(test-6, this-1)
cop(test-6, is-2)
det(test-6, the-3)
amod(test-6, english-4)
compound(test-6, parser-5)
root(ROOT-0, test-6)
[ 51 ]
www.it-ebooks.info
Parsing Structure in Text
The output looks quite complex but, in reality, it's not. The output is a list of three
major outcomes, where the first is just the POS tags and the parsed tree of the
given sentences. The same is plotted in a more elegant way in the following figure.
The second is the dependency and positions of the given words. The third is the
enhanced version of dependency:
Root
NP VP
DT VBZ NP
this is DT JJ NN NN
Chunking
Chunking is shallow parsing where instead of reaching out to the deep structure
of the sentence, we try to club some chunks of the sentences that constitute some
meaning.
[ 52 ]
www.it-ebooks.info
Chapter 4
A chunk can be defined as the minimal unit that can be processed. So, for example, the
sentence "the President speaks about the health care reforms" can be broken into two
chunks, one is "the President", which is noun dominated, and hence is called a noun
phrase (NP). The remaining part of the sentence is dominated by a verb, hence it is
called a verb phrase (VP). If you see, there is one more sub-chunk in the part "speaks
about the health care reforms". Here, one more NP exists that can be broken down
again in "speaks about" and "health care reforms", as shown in the following figure:
VP
NP NP
This is how we broke the sentence into parts and that's what we call chunking.
Formally, chunking can also be described as a processing interface to identify
non-overlapping groups in unrestricted text.
Now, we understand the difference between shallow and deep parsing. When we
reach the syntactic structure of the sentences with the help of CFG and understand
the syntactic structure of the sentence. Some cases we need to go for semantic
parsing to understand the meaning of the sentence. On the other hand, there are
cases where, we don't need analysis this deep. Let's say, from a large portion
of unstructured text, we just want to extract the key phrases, named entities, or
specific patterns of the entities. For this, we will go for shallow parsing instead of
deep parsing because deep parsing involves processing the sentence against all the
grammar rules and also the generation of a variety of syntactic tree till the parser
generates the best tree by using the process of backtracking and reiterating. This
entire process is time consuming and cumbersome and, even after all the processing,
you might not get the right parse tree. Shallow parsing guarantees the shallow parse
structure in terms of chunks which is relatively faster.
[ 53 ]
www.it-ebooks.info