0% found this document useful (0 votes)
11 views

Introduction_to_Parsing

The document discusses the fundamentals of parsing in the context of a computer science course (COMP 412) at Rice University. It covers the role of parsers, context-free grammars, derivations, and the importance of precedence in grammar design. The document also highlights the limitations of regular languages and the necessity of context-free grammars for certain language constructs.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Introduction_to_Parsing

The document discusses the fundamentals of parsing in the context of a computer science course (COMP 412) at Rice University. It covers the role of parsers, context-free grammars, derivations, and the importance of precedence in grammar design. The document also highlights the limitations of regular languages and the necessity of context-free grammars for certain language constructs.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

COMP 412

FALL 2010

Introduction to Parsing

Comp 412

Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 412 at Rice University have explicit permission to make
copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit
educational purposes, provided this copyright notice is preserved.
The Front End

Source tokens IR
Scanner Parser
code

Errors

Parser
• Checks the stream of words and their parts of speech
(produced by the scanner) for grammatical correctness
• Determines if the input is syntactically well formed
• Guides checking at deeper levels than syntax
• Builds an IR representation of the code
Think of this chapter as the mathematics of diagramming
sentences

Comp 412, Fall 2010 2


The Study of Parsing
The process of discovering a derivation for some sentence
• Need a mathematical model of syntax — a grammar G
• Need an algorithm for testing membership in L(G)
• Need to keep in mind that our goal is building parsers,
not studying the mathematics of arbitrary languages

Roadmap for our study of parsing


1 Context-free grammars and derivations Today
2 Top-down parsing
— Generated LL(1) parsers & hand-coded recursive descent
parsers
3 Bottom-up parsing Lab 2
— Generated LR(1) parsers

We will define “context free” today. I am


Comp 412, Fall 2010 3
just deferring the definition for a couple of
slides.
Specifying Syntax with a Grammar
Context-free syntax is specified with a context-free grammar

SheepNoise  SheepNoise baa


| baa

This CFG defines the set of noises sheep normally make

It is written in a variant of Backus–Naur form

Formally, a grammar is a four tuple, G = (S,N,T,P)


• S is the start symbol (set of strings in L(G))
• N is a set of nonterminal symbols (syntactic variables)
• T is a set of terminal symbols (words)
• P is a set of productions or rewrite rules (P : N  (N  T)+ )
Example due to Dr. Scott K. Warren

Comp 412, Fall 2010 From Lecture 4


1
Deriving Syntax
We can use the SheepNoise grammar to create sentences
— use the productions as rewriting rules

And so on ...

While this example is cute, it quickly runs out of intellectual


steam ...
Comp 412, Fall 2010 5
Why Not Use Regular Languages & DFAs?
Not all languages are regular (RL’s  CFL’s  CSL’s)
You cannot construct DFA’s to recognize these languages
• L = { p k qk } (parenthesis
languages)
• L = { wcwr | w  *}
Neither of these is a regular language (nor an RE)

To recognize these features requires an arbitrary amount of


context (left or right …)
But, this issue is somewhat subtle. You can construct DFA’s
for
• Strings with alternating 0’s and 1’s
(  | 1 ) ( 01 )* (  | 0 )
• Strings with an even number of 0’s and 1’s
RE’s can count bounded sets and bounded differences
Comp 412, Fall 2010 6
Limits of Regular Languages
Advantages of Regular Expressions
• Simple & powerful notation for specifying patterns
• Automatic construction of fast recognizers
• Many kinds of syntax can be specified with REs
Example — a regular expression for arithmetic expressions
Term  [a-zA-Z] ([a-zA-Z] | [0-9])*
Op  +|-||/
Expr  ( Term Op )* Term
([a-zA-Z] ([a-zA-Z] | [0-9])* (+ | - |  | /))* [a-zA-Z] ([a-zA-Z] | [0-9])
Of course, this would generate a DFA …

If REs are so useful … Why not use them for everything?


 Cannot add parenthesis, brackets, begin-end pairs, …

Comp 412, Fall 2010 7


Context-free Grammars
What makes a grammar “context free”?

The SheepNoise grammar has a specific form:

SheepNoise  SheepNoise baa


| baa

Productions have a single nonterminal on the left hand side,


which makes it impossible to encode left or right context.
 The grammar is context free.
A context-sensitive grammar can have ≥ 1 nonterminal on
lhs.

Notice that L(SheepNoise) is actually a regular language: baa


+

Classic definition: any language that can be


Comp 412, Fall 2010 8
recognized by a push-down automaton is a
context-free language.
A More Useful Grammar Than Sheep Noise
To explore the uses of CFGs,we need a more complex
grammar
Rule Sentential Form
0 Expr  Expr Op Expr
— Expr
1 | number
0 Expr Op Expr
2 | id
2 <id,x> Op Expr
3 Op  + 4 <id,x> - Expr
4 | - 0 <id,x> - Expr Op Expr
5 | * 1 <id,x> - <num,2> Op
Expr
6 | /
5 <id,x> - <num,2> *
Expr
2 <id,x> - <num,2> *
<id,y>

• Such a sequence of rewrites is called a derivation


• Process of discovering a derivation is called parsing
We denote this derivation: Expr * id – num *
id
Comp 412, Fall 2010 9
Derivations
The point of parsing is to construct a derivation

• At each step, we choose a nonterminal to replace


• Different choices can lead to different derivations
Two derivations are of interest
• Leftmost derivation — replace leftmost NT at each step
• Rightmost derivation — replace rightmost NT at each step
These are the two systematic derivations
(We don’t care about randomly-ordered derivations!)

The example on the preceding slide was a leftmost


derivation
• Of course, there is also a rightmost derivation
• Interestingly, it turns out to be different
Comp 412, Fall 2010 10
Derivations
The point of parsing is to construct a derivation

A derivation consists of a series of rewrite steps


S  0  1  2  …  n–1  n  sentence

• Each i is a sentential form


— If  contains only terminal symbols,  is a sentence in L(G)
— If  contains 1 or more non-terminals,  is a sentential form
• To get i from i–1, expand some NT A  i–1 by using A 
— Replace the occurrence of A  i–1 with  to get i
— In a leftmost derivation, it would be the first NT A  i–1

A left-sentential form occurs in a leftmost derivation


A right-sentential form occurs in a rightmost derivation

Comp 412, Fall 2010 11


The Two Derivations for x – 2 * y

Rule Sentential Form Rule Sentential Form


— Expr — Expr
0 Expr Op Expr 0 Expr Op Expr
2 <id,x> Op Expr 2 Expr Op <id,y>
4 <id,x> - Expr 5 Expr * <id,y>
0 <id,x> - Expr Op Expr 0 Expr Op Expr * <id,y>
1 <id,x> - <num,2> Op 1 Expr Op <num,2> *
Expr <id,y>
5 <id,x> - <num,2> * 4 Expr - <num,2> *
Expr <id,y>
2 <id,x> - <num,2> * 2 <id,x> - <num,2> *
Leftmost
<id,y> derivation Rightmost
<id,y>
derivation
In both cases, Expr * id – num * id
• The two derivations produce different parse trees
• The parse trees imply different evaluation orders!
Comp 412, Fall 2010 12
Derivations and Parse Trees
Leftmost derivation
G
Rule Sentential Form
— Expr
0 Expr Op Expr
2 <id,x> Op Expr E
4 <id,x> - Expr
0 <id,x> - Expr Op Expr
1 <id,x> - <num,2> Op E Op E
Expr
5 <id,x> - <num,2> *
Expr x – E Op E
2 <id,x> - <num,2> *
<id,y>

This evaluates as x – ( 2 * 2 y
y) *

Comp 412, Fall 2010 13


Derivations and Parse Trees
Rightmost derivation
G
Rule Sentential Form
— Expr
0 Expr Op Expr
2 Expr Op <id,y> E
5 Expr * <id,y>
0 Expr Op Expr * <id,y>
1 Expr Op <num,2> * E Op E
<id,y>
4 Expr - <num,2> *
<id,y>
E Op E * y
2 <id,x> - <num,2> *
<id,y>

This evaluates as ( x – 2 ) * x – 2
y

This ambiguity is NOT good


Comp 412, Fall 2010 14
Derivations and Precedence

These two derivations point out a problem with the grammar:


It has no notion of precedence, or implied order of evaluation

To add precedence
• Create a nonterminal for each level of precedence
• Isolate the corresponding part of the grammar
• Force the parser to recognize high precedence
subexpressions first

For algebraic expressions


• Parentheses first (level 1 )
• Multiplication and division, next ( level
2)
• Subtraction and addition, last ( level 3)

Comp 412, Fall 2010 15


Derivations and Precedence
Adding the standard algebraic precedence produces:
0 Goal  Expr This grammar is slightly larger
1 Expr  Expr + Term •Takes more rewriting to
level
2 | Expr - Term reach some of the terminal
3
3 | Term symbols

level
4 Term  Term * Factor •Encodes expected
5 | Term / Factor precedence
2
6 | Factor •Produces same parse tree
7 Factor  ( Expr ) under leftmost & rightmost
level
8 | number derivations
1
9 | id •Correctness trumps the speed
of the parser

Cannot handle Let’s see how


Introduced it parses xtoo
parentheses, -2*
precedence in an RE for y
(beyond power of an RE)
expressions
Comp 412, Fall 2010 One form of the “classic expression 16
grammar”
Derivations and Precedence
Rule Sentential Form G
— Goal
0 Expr E
2 Expr - Term
4 Expr - Term * Factor E – T
9 Expr - Term * <id,y>
6 Expr - Factor * <id,y> T T * F
8 Expr - <num,2> *
<id,y> F F <id,y
3 Term - <num,2> * >
<id,y>
6 Factor - <num,2> * <id,x <num,2>
<id,y> >
9 <id,x> - <num,2> * Its parse tree
The rightmost
<id,y>
derivation

It derives x – ( 2 * y ), along with an appropriate parse tree.


Both the leftmost and rightmost derivations give the same expression,
because the grammar directly and explicitly encodes the desired
precedence.
Comp 412, Fall 2010 17
Ambiguous Grammars
Let’s leap back to our original expression grammar.
It had other problems.
Rule Sentential Form
0 Expr  Expr Op Expr — Expr
1 | number 0 Expr Op Expr
2 | id 2 <id,x> Op Expr
3 Op  + 4 <id,x> - Expr
0 <id,x> - Expr Op Expr
4 | -
1 <id,x> - <num,2> Op
5 | * Expr
6 | / 5 <id,x> - <num,2> *
Expr
2 <id,x> - <num,2> *
<id,y>
• This grammar allows multiple leftmost derivations for x - 2 * y
• Hard to automate derivation if > 1 choice
Different choice
• The grammar is ambiguous than the first time
Comp 412, Fall 2010 18
Two Leftmost Derivations for x – 2 * y
The Difference:
 Different productions chosen on the second step
Rule Sentential Form Rule Sentential Form
— Expr — Expr
0 Expr Op Expr 0 Expr Op Expr
2 <id,x> Op Expr 0 Expr Op Expr Op Expr
4 <id,x> - Expr 2 <id,x> Op Expr Op
0 <id,x> - Expr Op Expr Expr
1 <id,x> - <num,2> Op 4 <id,x> - Expr Op Expr
Expr 1 <id,x> - <num,2> Op
5 <id,x> - <num,2> * Expr
Expr 5 <id,x> - <num,2> *
1 <id,x> - <num,2> * Expr
Original choice
<id,y> 2 New -choice
<id,x> <num,2> *
<id,y>

 Both derivations succeed in producing x - 2 * y


Comp 412, Fall 2010 19
Two Leftmost Derivations for x – 2 * y
The Difference:
 Different productions chosen on the second step
Rule Sentential Form Rule Sentential Form
— Expr — Expr
0 Expr Op Expr 0 Expr Op Expr
2 <id,x> Op Expr 0 Expr Op Expr Op Expr
4 <id,x> - Expr 2 <id,x> Op Expr Op
0 <id,x> - Expr Op Expr Expr
1 <id,x> - <num,2> Op 4 <id,x> - Expr Op Expr
Expr 1 <id,x> - <num,2> Op
5 <id,x> - <num,2> * Expr
Expr 5 <id,x> - <num,2> *
2 <id,x> - <num,2> * Expr
Original choice
<id,y> 2 New -choice
<id,x> <num,2> *
<id,y>
Different choices in same
situation, again
Remember
Comp 412, Fall 2010 nondeterminism? 20
Ambiguous Grammars
Definitions
• If a grammar has more than one leftmost derivation for
a single sentential form, the grammar is ambiguous
• If a grammar has more than one rightmost derivation
for a single sentential form, the grammar is ambiguous
• The leftmost and rightmost derivations for a sentential
form may differ, even in an unambiguous grammar
— However, they must have the same parse tree!

Classic example — the if-then-else problem


Stmt  if Expr then Stmt
| if Expr then Stmt else Stmt
| … other stmts …
This ambiguity is inherent in the grammar
Comp 412, Fall 2010 21
Ambiguity
This sentential form has two derivations
if Expr1 then if Expr2 then Stmt1 else Stmt2 Part of the problem
is that the structure
built by the parser
if if will determine the
interpretation of the
code, and these two
E1 then else E1 then forms have different
meanings!

if S2 if

E2 then E2 then else

S1 S1 S2

production 2, then production 1, then


production 1 production 2

Comp 412, Fall 2010 22


The grammar forces the
Ambiguity structure to match the desired
meaning.
Removing the ambiguity
• Must rewrite the grammar to avoid generating the
problem
• Match each else to innermost unmatched if (common sense
0)
rule Stmt  if Expr then Stmt
1  if Expr then WithElse else Stmt
2  Other Statements
3 WithElse  if Expr then WithElse else WithElse
4  Other Statements

Intuition: once into WithElse, we cannot generate an unmatched


With
elsethis grammar, example has only one rightmost
derivation
… a final if without an else can only come through rule 2 …
Comp 412, Fall 2010 23
Ambiguity
if Expr1 then if Expr2 then Stmt1 else Stmt2

Rul Sentential Form


e
— Stmt
0 if Expr then Stmt
1 if Expr then if Expr then WithElse else Stmt
2 if Expr then if Expr then WithElse else S2
4 if Expr then if Expr then S1 else S2
? if Expr then if E2 then S1 else S2
? if E1 then if E2 then S1 else S2
Other productions to derive Expr
s
This grammar has only one rightmost derivation for the
example
Comp 412, Fall 2010 24
Deeper Ambiguity
Ambiguity usually refers to confusion in the CFG
Overloading can create deeper ambiguity
a = f(17)
In many Algol-like languages, f could be either a function
or a subscripted variable

Disambiguating this one requires context


• Need values of declarations
• Really an issue of type, not context-free syntax
• Requires an extra-grammatical solution (not in CFG)
• Must handle these with a different mechanism
— Step outside grammar rather than use a more complex
grammar

Comp 412, Fall 2010 25


Ambiguity - the Final Word
Ambiguity arises from two distinct sources
• Confusion in the context-free syntax (if-then-else)
• Confusion that requires context to resolve (overloading)

Resolving ambiguity
• To remove context-free ambiguity, rewrite the grammar
• To handle context-sensitive ambiguity takes cooperation
— Knowledge of declarations, types, …
— Accept a superset of L(G) & check it by other means†
— This is a language design problem

Sometimes, the compiler writer accepts an ambiguous


grammar
— Parsing techniques that “do the right thing”
— i.e., always select the same derivation

Comp 412, Fall 2010 †


See Chapter 4 26

You might also like