0% found this document useful (0 votes)
4 views26 pages

Unit 3-2

The document discusses parsing using Context Free Grammar (CFG), explaining the process of generating parse trees for strings based on grammatical rules. It covers various parsing strategies, including top-down and bottom-up approaches, as well as the use of dynamic programming methods like the CKY algorithm. Additionally, it highlights the importance of treebanks, such as the Penn Treebank, in providing syntactically annotated corpora that serve as grammars for language processing.

Uploaded by

moumitashopping0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views26 pages

Unit 3-2

The document discusses parsing using Context Free Grammar (CFG), explaining the process of generating parse trees for strings based on grammatical rules. It covers various parsing strategies, including top-down and bottom-up approaches, as well as the use of dynamic programming methods like the CKY algorithm. Additionally, it highlights the importance of treebanks, such as the Penn Treebank, in providing syntactically annotated corpora that serve as grammars for language processing.

Uploaded by

moumitashopping0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Parsing with Context Free Grammar

Sudeshna Sarkar

16 AUG 2019
Parsing
• Parsing is the process of taking a string and a
grammar and returning parse tree(s) for that string

15-Aug-19
“The old dog the footsteps of the young.”

S  NP VP VP  V
S  Aux NP VP VP -> V PP
S -> VP PP -> Prep NP
NP  Det Nom N  old | dog | footsteps | young
NP PropN V  dog | eat | sleep | bark | meow
Nom -> Adj N Aux  does | can
Nom  N Prep from | to | on | of
Nom  N Nom PropN  Fido | Felix
Nom  Nom PP Det  that | this | a | the
VP  V NP Adj -> old | happy| young
Parsing
• Parsing with CFGs refers to the task of assigning
proper trees to input strings
• Proper: a tree that covers all and only the elements
of the input and has an S at the top

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 4
Syntactic Analysis (Parsing)
• Automatic methods of finding the syntactic structure
for a sentence
– Symbolic methods: a phrase grammar or another
description of the structure of language is required.
The chart parser.
– Statistical methods: a text corpus with syntactic structures
is needed (a treebank)

9.12.1999 http://ufal.mff.cuni.cz/course/npfl094 5
Search Framework
• Think about parsing as a form of search…
– A search through the space of possible trees given an
input sentence and grammar

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 6
How to parse
• Top-down: Start at the top of the tree with an S
node, and work your way down to the words.

• Bottom-up: Look for small pieces that you know how


to assemble, and work your way up to larger pieces.
Top-Down Search
• Builds from the root S node to the leaves
• Expectation-based
• Common top-down search strategy
– Top-down, left-to-right, with backtracking
– Try first rule s.t. LHS is S
– Next expand all constituents on RHS
– Iterate until all leaves are POS
– Backtrack when candidate POS does not match POS of
current word in input string

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 8
“The old dog the footsteps of the young.”

S  NP VP VP  V
S  Aux NP VP VP -> V PP
S -> VP PP -> Prep NP
NP  Det Nom N  old | dog | footsteps | young
NP PropN V  dog | eat | sleep | bark | meow
Nom -> Adj N Aux  does | can
Nom  N Prep from | to | on | of
Nom  N Nom PropN  Fido | Felix
Nom  Nom PP Det  that | this | a | the
VP  V NP Adj -> old | happy| young
Bottom-Up Parsing
• Of course, we also want trees that cover the input
words. So we might also start with trees that link up
with the words in the right way.
• Then work your way up from there to larger and
larger trees.

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 10
Bottom-Up Search

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 11
Bottom-Up Search

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 12
Bottom-Up Search

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 13
Bottom-Up Search

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 14
Bottom-Up Search

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 15
Issues
• Ambiguity
• Shared subproblems

15-Aug-19
Ambiguity

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 17
Dynamic Programming

• DP search methods fill tables with partial results and


thereby
– Avoid doing avoidable repeated work
– Solve exponential problems in polynomial time (ok, not
really)
– Efficiently store ambiguous structures with shared sub-parts.
• We’ll cover one approach that corresponds to a
bottom-up strategy
– CKY

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 18
CKY Algorithm

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 19
CKY Algorithm

Looping over the columns


Filling the bottom cell

Filling row i in column j


Looping over the possible split
locations between i and j.

Check the grammar for rules


that link the constituents in [i,k]
with those in [k,j]. For each rule
found store the LHS of the rule
in cell [i,j].

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 20
Treebank
• A syntactically annotated corpus where every sentence is
paired with a corresponding tree.
• The Penn Treebank project
– treebanks from the Brown, Switchboard, ATIS, and Wall
Street Journal corpora of English
– treebanks in Arabic and Chinese.
• Others
– the Prague Dependency Treebank for Czech,
– the Negra treebank for German, and
– the Susanne treebank for English
– Universal Dependencies Treebank
Penn Treebank
• Penn TreeBank is a widely used treebank.

Most well known part is


the Wall Street Journal
section of the Penn
TreeBank.
1 M words from the
1987-1989 Wall
Street Journal.

Speech and Language


15-Aug-19 Processing - Jurafsky and Martin 22
Treebanks as Grammars
• The sentences in a treebank implicitly constitute a
grammar of the language represented by the corpus
being annotated.

• Simply take the local rules that make up the sub-


trees in all the trees in the collection and you have a
grammar
– The WSJ section gives us about 12k rules

15-Aug-19
Treebanks as Grammars
• The sentences in a treebank implicitly constitute a
grammar of the language represented by the corpus
being annotated.

• Simply take the local rules that make up the sub-


trees in all the trees in the collection and you have a
grammar
– The WSJ section gives us about 12k rules if you do this
• Treebanks (and head-finding) are particularly critical
to the development of statistical parsers

15-Aug-19

You might also like