0% found this document useful (0 votes)
18 views29 pages

Grammars and Parsing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views29 pages

Grammars and Parsing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Grammars and Parsing

• To compute syntactic structure of a sentence, we need two things:


• The grammar: A formal specification of the structures allowable in the language,
• The parsing technique: The method of analyzing a sentence to determine its
structure according to the grammar.
• Grammars and Sentence Structure:
• The tree representation for the sentence John ate the cat:
The sentence (S) consists of an initial noun phrase (NP)
and a verb phrase (VP). The initial noun phrase is made
of the simple NAME John. The verb phrase is composed of
a verb (V) ate and an NP, which consists of an article
(ART) the and a common noun (N) cat.

In list notation this same structure could be represented


as:

(S (NP (NAME John))


• Node at the top---Root, (VP (V ate)
• Node at the bottom---Leaves (NP (ART the)
• Link, Parent Node, Child Node, Ancestor, (N cat) )))
• A node is dominated by its ancestor Nodes
• Root Node dominates all other Nodes in the tree
• A set of rewrite rules describes what tree structures are allowable:

• Grammars consisting entirely of rules with a single symbol on the left-hand side: called the
mother:
• Context-free grammars (CFGs).

• CFGs are a very important class of grammars for two reasons:


• 1. The formalism is powerful enough to describe most of the structure in natural
languages,
• 2. Yet it is restricted enough so that efficient parsers can be built to analyze sentences.
• Symbols that cannot be further decomposed in a grammar, namely the words in the example,
are called terminal symbols.
• Other symbols, such as NP, VP, and S, are called nonterminal symbols.
• The grammatical symbols such as N and V that describe word categories are called lexical
symbols.
• Words will be listed under multiple categories.
• For example, can would be listed under V and N.
• Start Symbol: S
• Process of Sentence generation uses derivations to construct legal sentences.
• A simple generator could be implemented by randomly choosing rewrite rules, starting from the S symbol, until
you have a sequence of words.

• S
• => NP VP (rewriting S)
• => NAME VP (rewriting NP)
• => John VP (rewriting NAME)
• => John V NP (rewriting VP)
• => John ate NP (rewriting V)
• => John ate ART N (rewriting NP)
• => John ate the N (rewriting ART)
• => John ate the cat (rewriting N)
• Second process based on derivations is parsing,
• Parsing identifies the structure of sentences given a grammar.

• There are two basic methods of searching.


• A top-down strategy starts with the S symbol and then searches through different ways
to rewrite the symbols until the input sentence is generated, or until all possibilities have
been explored.
• The preceding example demonstrates that John ate the cat is a legal sentence by
showing the derivation that could be found by this process.
• In a bottom-up strategy, we start with the words in the sentence and use the rewrite rules backward to
reduce the sequence of symbols until it consists solely of S.
• The left-hand side of each rule is used to rewrite the symbol on the right-hand side.
• A possible bottom-up parse of the sentence John ate the cat is—

• => NAME ate the cat (rewriting John)


• => NAME V the cat (rewriting ate)
• => NAME V ART cat (rewriting the)
• => NAME V ART N (rewriting cat)
• => NP V ART N (rewriting NAME)
• => NP V NP (rewriting ART N)
• => NP VP (rewriting V NP)
• => S (rewriting NP VP)
What Makes a Good Grammar
• Generality--the range of sentences the grammar analyzes correctly;
• Selectivity--the range of non-sentences it identifies as problematic;
• Understandability—the simplicity of the grammar

• When a group of words forms a particular constituent


• try to construct a new sentence that involves that group of words in a
conjunction with another group of words classified as the same type of
constituent.
• This is a good test because for the most part only constituents of the same
type can be conjoined.
The acceptable sentences:

• NP: I ate a hamburger and a hot dog..


• VP: I will eat the hamburger and throw away the hot dog.
• S: I ate a hamburger and John ate a hot dog.
• PP: I saw a hot dog in the bag and on the stove.
• ADJP: I ate a cold and well burned hot dog.
• ADVP: I ate the hot dog slowly and very carefully.
• N: I ate a hamburger and hot dog.
• V: I will cook and burn a hamburger.
• AUX: I can and will eat the hot dog.
• ADJ: I ate the very cold and burned hot dog

Not The acceptable sentences:

• *I ate a hamburger and on the stove.


• *I ate a cold hot dog and well burned.
• *I ate the hot dog slowly and a hamburger.
• Another test involves inserting the proposed constituent into other
sentences that take the same category of constituent.

• John’s hitting of Mary is an NP in John’s hitting of Mary alarmed Sue,


• then it should be usable as an NP in other sentences as well.
• In fact this is true—the NP can be the object of a verb,
• I cannot explain John’s hitting of Mary
• as well as in the passive form of the initial sentence
• Sue was alarmed by John’s hitting of Mary.

• Given this evidence, we can conclude that the proposed constituent appears to
behave just like other NPs.
• I looked up John ‘s phone number and I looked up John ‘s chimney.
• Should these sentences have the identical structure?
• If so, you would presumably analyze both as subject-verb-complement
sentences with the complement in both cases being a PP.

• That is, up John’s phone number would be a PP.

• try the conjunction test–


• Conjoining up John’s phone number with another PP, as in—
• *I looked up John’s phone number and in his cupboards, is certainly not acceptable
• I looked up John’s chimney and in his cupboards is perfectly acceptable.

• Thus apparently the analysis of up John ‘s phone number as a PP is incorrect.


• Thus a different analysis is needed for each of the two sentences.
• If up John’s phone number is not a PP,
• then two remaining analyses may be possible.
• The VP could be the complex verb looked up followed by an NP,
• or it could consist of three components: the V looked, a particle up, and an NP.
• Either of these is a better solution.
Top-Down Parser
• Parsing algorithm: a procedure that searches through various ways of combining
grammatical rules to find a combination that generates a tree that could be the
structure of the input sentence.
• the algorithm will say whether a certain sentence is accepted by the grammar or not
• A top-down parser: starts with the S symbol and attempts to rewrite it into a
sequence of terminal symbols that matches the classes of the words in the input
sentence.
• The state of the parse at any given time can be represented as a list of symbols that
are the result of operations applied so far, called the symbol list.
• the parser starts in the state (S) and after applying the rule S -> NP VP the symbol list will be (NP
VP).
• if it then applies the rule NP ->ART N, the symbol list will be (ART N VP), and so on...
• The parser could continue in this fashion—
• until the state consisted entirely of terminal symbols,
• and then it could check the input sentence to see if it matched
• But this would be quite wasteful, for a mistake made early on (say, in choosing the
rule that rewrites S) is not discovered until much later
• A better algorithm checks the input as soon as it can.
• In addition, a structure called the lexicon is used to efficiently store the possible categories for each word.
• A very small lexicon for use in the examples is
• cried: V Grammar:
• dogs: N, V 1. S -> NP VP
• the: ART 2. NP -> ART N
• a state of the parse is now defined by a pair: 3. NP -> ART ADJ
• a symbol list similar to before and a number indicating the current position in the sentence. N
4. VP -> V
5. VP -> V NP
• 1 The 2 dogs 3 cried 4

• A typical parse state would be ((N VP) 2) indicating that the parser needs to find an N followed by a VP, starting at position two.
• New states are generated from old states depending on whether the first symbol is a lexical symbol or not.
• If it is a lexical symbol, like N in the preceding example, and if the next word can belong to that lexical category,
• then update the state by removing the first symbol and updating the position counter.
• since the word dogs is listed as an N in the lexicon, the next parser state would be ((VP) 3) which means it needs to find a VP starting at
position 3.
• If the first symbol is a nonterminal, like VP, then it is rewritten using a rule from the grammar.
• using rule 4 in Grammar, the new state would be ((V) 3) which means it needs to find a V starting at position 3.
• On the other hand, using rule 5, the new state would be ((V NP) 3)
• A parsing algorithm that is guaranteed to find a parse if there is one must systematically explore every possible new state.
• One simple technique for this is called backtracking.
• Using this approach, rather than generating a single new state from the state ((VP) 3), we generate all possible new states.
• One of these is picked to be the next state and the rest are saved as backup states.
• If we ever reach a situation where the current state cannot lead to a solution,
• simply pick a new current state from the list of backup states.
Grammar:
• 1 The 2 dogs 3 cried 4 1.
2.
S -> NP VP
NP -> ART N
3. NP -> ART ADJ
N
• A typical parse state would be ((N VP) 2) indicating that the parser needs to find an N 4. VP -> V
5. VP -> V NP
followed by a VP, starting at position two.
• New states are generated from old states depending on whether the first symbol is a
lexical symbol or not.
• If it is a lexical symbol, like N in the preceding example, and if the next word can belong to that
lexical category,
• then update the state by removing the first symbol and updating the position counter.
• since the word dogs is listed as an N in the lexicon, the next parser state would be ((VP) 3) which
means it needs to find a VP starting at position 3.
• If the first symbol is a nonterminal, like VP, then it is rewritten using a rule from the grammar.
• using rule 4 in Grammar, the new state would be ((V) 3) which means it needs to find a V starting
at position 3.
• On the other hand, using rule 5, the new state would be ((V NP) 3)
• A parsing algorithm that is guaranteed to find a parse if there is one must systematically explore
every possible new state.
• One simple technique for this is called backtracking.
• Using this approach, rather than generating a single new state from the state ((VP) 3), we generate all possible
new states.
• One of these is picked to be the next state and the rest are saved as backup states.
• If we ever reach a situation where the current state cannot lead to a solution,
• simply pick a new current state from the list of backup states.
Simple Top-Down Parsing Algorithm

• The algorithm manipulates a list of possible states---called the


possibilities list.
• The first element of this list is the current state,
• consists of a symbol list and a word position in the sentence,
• remaining elements of the search state are the backup states, each indicating an
alternate symbol-list—word-position pair.
• For example, the possibilities list
• (((N) 2) ((NAME) 1) ((ADJ N) 1))
• current state consists of the symbol list (N) at position 2,
• and that there are two possible backup states:
• one consisting of the symbol list (NAME) at position 1
• and the other consisting of the symbol list (ADJ N) at position 1.
• The algorithm starts with the initial state ((S) 1) and no backup states.

• 1. Select the current state: Take the first state off the possibilities list and call
it C.
• If the possibilities list is empty, then the algorithm fails (that is, no successful parse is
possible).
• 2. If C consists of an empty symbol list and the word position is at the end of
the sentence, then the algorithm succeeds.
• 3. Otherwise, generate the next possible states.
• 3.1. If the first symbol on the symbol list of C is a lexical symbol, and the next word in the
sentence can be in that class,
• then create a new state by removing the first symbol from the symbol list and updating the
word position, and add it to the possibilities list.
• 3.2. Otherwise, if the first symbol on the symbol list of C is a non-terminal, generate a
new state for each rule in the grammar that can rewrite that nonterminal symbol and
add them all to the possibilities list.
Grammar:
Top-down depth-first parse of 1 The 2 dogs 3 cried 4 1. S -> NP VP
2. NP -> ART N
3. NP -> ART ADJ
Step Current State Backup States Comment N
4. VP -> V
1. ((S) 1) initial position 5. VP -> V NP

2. ((NP VP) 1) rewriting S by rule 1


3. ((ART N VP) 1) rewriting NP by rules 2 & 3
((ART ADJ N VP) 1)
4. ((N VP) 2) matching ART with the
((ART ADJ N VP) 1)
5. ((VP) 3) matching N with dogs
((ART ADJ N VP) 1)
6. ((V) 3) rewriting VP by rules 5
((V NP) 3)
((ART ADJ N VP) 1)
7. the parse succeeds as V is matched to cried, leaving an empty
grammatical symbol list with an empty sentence
Now, Let us consider the same
algorithm and grammar
operating on the sentence:
1 The 2 old 3 man 4 cried 5

lexicon is:
•the: ART
•old: ADJ, N (ambiguous)
•man: N, V (ambiguous)
•cried: V

Grammar:
1. S -> NP VP
2. NP -> ART N
3. NP -> ART ADJ
N
4. VP -> V
5. VP -> V NP
Bottom-Up Chart Parser
The large can can hold the water

You might also like