Compiler Design: - Top-Down Parsing With A Recursive Descent Parser
Compiler Design: - Top-Down Parsing With A Recursive Descent Parser
Parsing
Lexical Analyzer has translated the source program into a sequence of tokens The Parser must translate the sequence of tokens into an intermediate representation
Assume that the interface is that the parser can call getNextToken to get the next token from the lexical analyzer And the parser can call a function called emit that will put out intermediate representations, currently unspecified
The parser outputs error messages if the syntax of the source program is wrong
And a string to parse such as "0 + 0" A parser can parse top-down, from start symbol (E above):
E -> 0+0 E+E -> 0 + E -> 0 + 0
Bottom-up parsing requires table-driven parsing If the grammar is not complicated, the simplest approach is to implement a recursive-descent parser. A recursive descent parser does not require backtracking to take alternative paths along the parse (derivation) path.
Where E consists of a sequence of non-terminal and terminal symbols Requires no left recursion in the grammar.
5
Parsing a rule
A sequence of non-terminal and terminal symbols, Y1 Y2 Y3 Yn is recognized by parsing each symbol in turn For each non-terminal symbol, Y, call the corresponding parse function parseY For each terminal symbol, y, call a function
expect(y)
Look-Ahead
In general, one non-terminal may have more than one production, so more than one function should be written to parse that non-terminal. Instead, we insist that we can decide which rule to parse just by looking ahead one symbol in the input
<sentence> -> 'if' '(' <expr> ')' <block> | 'while' '(' <expr> ')' <block> ...
Follow(<N>), where <N> is a non-terminal symbol in the grammar, is the set of terminal symbols that can follow immediately after any sentence derived from any rule of N In this grammar:
E -> 0 | E + E
First(E) = {0}
Grammar Restriction 1
Grammar Restriction 1 (for top-down parsing): The First sets of alternative rules for the same LHS must be different (so we know which path to take upon seeing a first terminal symbol/token). Notice: This is not true in the grammar above. Upon seeing 0 we don't know if we should take 0 or E + E path.
10
11
Recognizing Sequences
In a context free grammar, you often have rules that specify any number of a phrase can occur <arglist> <arg> <arglist> | e In extended BNF, we replace this with the * to indicate 0 or more occurrences <arg> * We can recognize these sequences by using iteration. If there is a rule of the form <phrase1> <phrase2>* we can recognize the phrase2 occurrences by
while (currentsymbol is in First(<phrase2>)) do parsePhrase2()
12
Grammar Restriction 2
In either of the previous cases, where the grammar symbol may generate sentences which are empty, the grammar must be restricted suppose that <phrase2> is the symbol that can occur 0 times require that the sets First(<phrase2>) and Follow(<phrase2) be disjoint Grammar Restriction 2: If a nonterminal may occur 0 times, its First and Follow sets must be different (so we know whether to parse it or skip it on seeing a terminal symbol/token).
13
Multiple Rules
Suppose that there is a nonterminal symbol with multiple rules where each rhs is nonempty <phrase1> E1 | E2 | E3 | . . . | En then we can write ParsePhrase1 as follows: if (currentsymbol is in First( E1 )) then ParseE1 elsif (currentsymbol is in First( E2 )) then ParseE2 ... elsif (currentsymbol is in First( En )) then ParseEn else Syntax Error If any rhs can be empty, then dont give the syntax error Remember the first grammar restriction:
The sets First( E1 ), , First( En ) must be disjoint
14
First Sets
Here we give a more formal, and more detailed, definition of a First set, starting with any non-terminal.
If we have a set of rules for a non-terminal, <phrase1> <phrase1> E1 | E2 | E3 | . . . | En then First(<phrase1>) = First(E1)+ . . . + First(En ) (set union) For any right hand side, Y1 Y2 Y3 Yn , we make cases on the form of the rule First(aY2 Y3 Yn) = a , for any terminal symbol a First(N Y2 Y3 Yn) = First(N), for any non-terminal N that does not generate the empty string First([N]M) = First(N) + First(M) (0 or 1 occurrence of N) First({N}*M) = First(N) + First(M) (0 or more occurrences) First( ) =
16
Follow Sets
To define the set Follow(T), examine the cases of where the non-terminal T may appear on the rhs of a rule in the grammar.
N S T U or N S [T] U or N S {T}* U If U never generates an empty string, then Follow(T) includes First(U) If U can generate an empty string, then Follow(T) includes First(U) and Follow(N) N S T or N S [ T ] or N S { T }* Follow(T) includes Follow(N) The Follow set of the start symbol should contain EOT, the end of text marker
Include the Follow set of all occurrences of T from the rhs of rules to make the set Follow(T)
17
18
Each recursive descent procedure should also take StopSymbols as a parameter, and may modify these to pass to any procedure that it calls
E.g. if there is a procedure to parse the parameter list of a method call, then it can have ) as a stop symbol
19
Stop Symbols
If the parser is trying to parse the rhs E of a non-terminal NE then the stop symbols are those symbols which the parser is prepared to recognize after a sentence generated by E
Remove anything ambiguous from Follow(N)
The stop symbols should always also contain the end of text symbol, EOT, so that the syntax error routine never tries to skip over symbols past the end of the program.
20