0% found this document useful (0 votes)
69 views

Chapter Three

This document discusses context-free grammars and parsing. It defines key terms like context-free grammar, derivation, parse tree, and language defined by a grammar. A context-free grammar specifies the syntactic structure of a programming language using rules and nonterminal/terminal symbols. A derivation uses the grammar rules to rewrite nonterminals until reaching a string of terminals. The set of all such strings is the language defined by the grammar. Parsing uses a grammar to analyze a program's syntax and structure.

Uploaded by

hammad
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Chapter Three

This document discusses context-free grammars and parsing. It defines key terms like context-free grammar, derivation, parse tree, and language defined by a grammar. A context-free grammar specifies the syntactic structure of a programming language using rules and nonterminal/terminal symbols. A derivation uses the grammar rules to rewrite nonterminals until reaching a string of terminals. The set of all such strings is the language defined by the grammar. Parsing uses a grammar to analyze a program's syntax and structure.

Uploaded by

hammad
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 70

Compiler

Principle and
Technology
Prof. Dongming LU
Mar. 5th, 2014

3. Context-Free
Grammars and
Parsing
(PART ONE)

Contents
PART ONE
3.1 The Parsing Process
3.2 Context-Free Grammars
3.3 Parse Trees and Abstract
PART TWO
3.4 Ambiguity
3.5 Extended Notations: EBNF and Syntax Diagrams
3.6 Formal Properties of Context-Free Languages

Introduction

Parsing is the task of Syntax Analysis


Determining the syntax, or structure of a program.
defined by

Syntax
Syntax
analysis
Analysis
Tool

Context-free
Grammar

Top Down(LL)
Derivation

Parser

Bottom Up(LR)
Data
Structure

parse tree &


syntax tree

3.1 The Parsing


Process

Function of a Parser
Parser
Sequence of tokens

Trees

initial + rate * 60

id + id * num

expression

expression

identifier
(initial)

+
expression

identifier
(rate)

Parse Tree
expression

expression

number
(60)

Issues of the Parsing


1. The sequence of tokens is not an explicit input
parameter
The parser calls a scanner procedure
getToken to fetch the next token
The parsing step of the compiler reduces to a
call to the parser as follows:
SyntaxTree = parse( )

Issues of the Parsing


2. Single pass & Multi-pass
In a single-pass compiler
No explicit syntax tree needs to be constructed
The parser steps themselves will represent the syntax
tree implicitly by a call Parse ( )
In

Multi-Pass compiler
The further passes will use the syntax tree as their
input
The structure of the syntax tree is heavily dependent
on the particular syntactic structure of the language
This tree is usually defined as a dynamic data
structure
Each node consists of a record whose fields include
the attributes needed for the remainder of the
compilation process (i.e., not just those computed by
the parser)

Issues of the Parsing


3. Differences in treatment of errors
Error in the scanner
Generate an error token and consume the offending
character.

Error in the parser


The

parser must not only report an error message

but

it must recover from the error and continue parsing (to


find as many errors as possible).A parser may perform error
repair
Error

recovery is the reporting of meaningful error messages


and the resumption of parsing as close to the actual error as
possible

Back

3.2 ContextFree Grammars

Limitation of Regular Expression


Type 0

Type 1
Type 2: CFG
Type 3:RE

The Regular Expression can only


express simple languages, can
not express recursion and nested
languages.
Eg:brackets matching
((()))

We need a

more powerful
language!

The Context-Free Grammar


The quadrupleVT , VN , S, Pin Context-Free
Grammer
VT : set of terminal symbols(not null, symbols
in the alphabet)
VN : set of nonterminal symbols(not null)
S : start symbol (nonterminal symbol)
P : derivation rules
Eg: E (E)| , E VN , (VT VN )*
a, (a), ((a))

3.2.1
Comparison to
Regular
Expression
Notation

Comparing an Example
A context-free grammar is a specification for the
syntactic structure of a programming language.
Similar to the specification of the lexical structure of
a language using regular expressions Except involving
recursive rules

The context-free grammar:


exp exp op exp | (exp) | number
op + | | *
The regular expression:
number = digit digit*
digit = 0|1|2|3|4|5|6|7|8|9

Basic Rules
Regular Expression
1. Three operations:
Choice,
concatenation,
repetition
2. = represents the
definition of a name for
3. Name was written in
italics.

Context Free Grammer


1.| for choice; Concatenation
is used as a standard operation;
No * for repetition
2. represents the definitions
of names
3. Names are written in italic( in
a different font)

Grammar

rules use regular


expressions as components
Grammar

rules in this form are


usually said to be in BackusNaur form, or BNF

characteristics
Regular Expression

Context Free Grammar


Advantages:

Advantages:
The rules are simple
and easy for description
and understand

Disadvantages:
Its express
suitablepower is
The
limited
for Lexical

analysis

1 The rules are accurate and more


powerful than regular expressions.
2. It can describe the structure of
a given language
3. It is easy to modify a language
defined by the context free
grammar.
Its suitable

for Syntax
Disadvantages:
analysis

Grammar
rules
is
more
complicated
than
Regular
Expression; It cant express all kinds
of programming languages.

3.2.2
Specification of
Context-Free
Grammar Rules

Construction of a CFG rule


Given an alphabet, a context-free grammar rule in
BNF consists of a string of symbols.
The first symbol is a name for a structure.
The second symbol is the meta-symbol"".
This symbol is followed by a string of symbols,
each of which is either a symbol from the
alphabet, a name for a structure, or the
metasymbol "| ".

Construction of a CFG rule


A grammar rule in BNF is interpreted as follows
The rule defines the structure whose name is to
the left of the arrow
The structure is defined to consist of one of the
choices on the right-hand side separated by
the vertical bars
The sequences of symbols and structure names
within each choice defines the layout of the
structure
example:

exp exp op exp | (exp) | number


op + | | *

More about the Conventions

The meta-symbols and conventions used here are in


wide use but there is no universal standard for these
conventions

Common alternatives for the arrow metasymbol '' include


"=" (the equal sign), ":" (the colon), and "::=" ("double-colonequals")

In normal text files, replacing the use of italics, by


surrounding structure names with angle brackets <...>
and by writing italicized token names in uppercase

example:

<exp> ::= <exp> <op> <exp> | (<exp>) | NUMBER


<op> ::= + | - | *

3.2.3
Derivations &
Language
Defined by a
Grammar

How Grammar Determine a Language


Context-free grammar rules determine the set of
syntactically legal strings of token symbols for the
structures defined by the rules.
For example: Given a grammar rule

(number - number ) * number


(34-3)*42 is a legal expression
(34-3*42 is not a legal expression.

There is a left parenthesis that is not matched by a


right parenthesis and the grammar rule requires that
parentheses be generated in pairs, so the second
expression is not legal.

Derivations

Grammar rules determine the legal strings of token symbols


by means of derivations
A derivation is a sequence of replacements of structure
names by choices on the right-hand sides of grammar
rules
A derivation begins with a single structure name and ends
with a string of token symbols
At each step in a derivation, a single replacement is made
using one choice from a grammar rule

Derivations

The example
exp exp op exp | (exp) | number
op + | | *

A derivation
(1) exp => exp op exp

[exp exp op exp]

(2) => exp op number

[exp number]

(3) => exp * number

[op * ]

(4) => ( exp ) * number

[exp ( exp ) ]

(5) =>{ exp op exp ) * number

[exp exp op exp}

(6) => (exp op number) * number

[exp number]

(7) => (exp - number) * number

[op - ]

(8) => (number - number) * number

[exp number]

Derivation steps use a different arrow from the arrow


meta-symbol in the grammar rules.

Language Defined
by a Grammar
The set of all strings of token symbols obtained by
derivations from the exp symbol is the language defined by
the grammar of expressions.
L(G) = { s | exp =>* s }

G : the expression grammar


s : an arbitrary string of token symbols (sometimes called a
sentence)
=>* : stand for a derivation consisting of a sequence of
replacements as described earlier.(The asterisk is used to indicate
a sequence of steps, much as it indicates repetition in regular
expressions.)
Grammar rules are sometimes called productions. Because they
"produce" the strings in L(G) via derivations

Grammar for a Programming Language

The grammar for a programming language often defines


a structure called program

The language of this structure is the set of all syntactically


legal programs of the programming language.
For example: a BNF for Pascal
program program-heading; program-block
program-heading .

program-block ..

The first rule says that a program consists of a program


heading, followed by a semicolon, followed by a program
block, followed by a period

Symbols in TINY
Represent

the alphabet of tokens for the


TINY language:
{if. then, else, end, repeat, until, read, write,
identifier, number, +, -, *, /, =, <, (, ), ; , := }

Instead

of the set of tokens (as defined in


the TINY scanner)
{IF,THEN,ELSE,END,REPEAT,UNTIL,READ,WRITE,ID,N
UM, PLUS,MINUS,TIMES, OVER,EQ, LT,
LPAREN,RPAREN, SEMI, ASSIGN }

Examples
Example 3.1:
The grammar G with the single grammar rule
E (E) | a
This grammar generates the language
L(G) = { a,(a),((a)),(((a))),}
= { (na)n | n an integer >= 0 }
Eg : Derivation for ((a))
E => (E) => ((E)) => ((a))

Examples
Example 3.2:
The grammar G with the single grammar rule
E (E)
This grammar generates no strings at all, there is no
way we can derive a string consisting only of
terminals.

Examples
Example 3.3:

Consider the grammar G with the single grammar rule


EE+a|a

This grammar generates all strings consisting of a's


separated by +'s:
L(G) ={a, a + a ,a + a + a, a + a + a + a,...}

Derivation:
E => E+a => E+a+a => E+a+a+a =>

finally replace the E on the left using the base E a

Example
Examples of strings in
Example 3.4
this language are
Consider the following
extremely simplified
grammar of statements: other

Statement if-stmt |
other
if-stmt if ( exp )
statement
| if ( exp ) statement else
statement
exp 0 | 1

if (0) other
if (1) other
if (0) other else other
if (1) other else other
if (0) if (0) other
if (0) if (1) other else other
if (1) other else if (0) other else
other

Recursion

The grammar rule:


A A a | a or
A aA|a
Generates the language {an | n an integer >=1 }
(The set of all strings of one or more a's)
The same language as that generated by
the regular expression a+
The string aaaa can be generated by the first
grammar rule with the derivation:
A => Aa => Aaa => Aaaa => aaaa

Recursion
i) Left recursive:
The non-terminal A appears as the first symbol on the
right-hand side of the rule defining A:
A Aa|a
ii) Right recursive:
The non-terminal A appears as the last symbol on the
right-hand side of the rule defining A:
A aA|a

Examples of Recursion

Consider a rule of the form: A A | where and


represent arbitrary strings and does not begin with A.

This rule generates all strings of the form

, , , , ...
(All strings beginning with a , followed by 0 or more 's).

This grammar rule is equivalent in its effect to the regular


expression *.
Similarly, the right recursive grammar rule A A |
(Where does not end in A) Generates all strings

, , , , ....

Examples of Recursion

To generate the same language as the regular


expression a* we must have a notation for a grammar
rule that generates the empty string .Use the epsilon
meta-symbol for the empty string :
empty , called an -production (an "epsilon
production").

A grammar that generates a language containing the


empty string must have at least one -production.

A grammar equivalent to the regular expression a* :


A A a | or A a A |
Both grammars generate the language:
{ an | n an integer >= 0} = L(a*).

Examples

Example 3.5:
A (A) A |
Generating the strings of all "balanced parentheses."

For example, the string (( ) (( ))) ( )


Generated by the following derivation (The -production is
used to make A disappear as needed):

=> (A) A => (A)(A)A => (A)(A) =>(A)( ) => ((A)A)( )


=>( ( )A)() => (( ) (A)A ) () => (( )( A ))( ) => (( )((A)A))( ) =>
(( )(( )A))( ) => (( )(( )))( )

Examples

Example 3.6:
The statement grammar of Example 3.4 can be written in
the following alternative way using an -production:
statement if-stmt | other
if-stmt if ( exp ) statement else-part
else-part else statement |
exp 0 | 1

The -production indicates that the structure else-part is


optional.

Examples

Example 3.7: Consider the following grammar G


for a sequence of statements:
stmt-sequence stmt ; stmt-sequence | stmt
stmt s
This grammar generates sequences of one or
more statements separated by semicolons
(statements have been abstracted into the single
terminal s):
L(G)= { s, s; s, s; s; s,... )

Examples

If allow statement sequences to be empty, write the


following grammar G':
stmt-sequence stmt ; stmt-sequence |
stmt s
semicolon is a terminator rather than a separator:
L(G')= { , s;, s;s;, s;s;s;,... }

If allow statement sequences to be empty, but retain the


semicolon as a separator, write the grammar as follows:
stmt-sequence nonempty-stmt-sequence |
nonempty-stmt-sequence stmtnonempty-stmtsequence | stmt stmt s
L(G)= {, s, s; s, s; s; s,... )

3.3 Parse trees


and abstract
syntax trees

3.3.1 Parse
trees

Derivation V.S. Structure


Derivations

do not uniquely represent the


structure of the strings
There are many derivations for the same
string.
The string of tokens:
(number - number ) * number

There exist two different derivations for


above string

Derivation V.S. Structure


(number - number ) * number
(1) exp => exp op exp

[exp exp op exp]

(2) => exp op number

[exp

(3) => exp * number

[op * ]

(4) => ( exp ) * number

[exp ( exp ) ]

(5) =>( exp op exp ) * number

[exp exp op exp]

(6) => (exp op number) * number

[exp number]

(7) => (exp - number) * number

[op - ]

(8) => (number - number) * number

[exp number]

number]

Derivation V.S. Structure


(number - number ) * number
(1) exp => exp op exp

[exp exp op exp]

(2) => (exp) op exp

[exp ( exp )]

(3) => (exp op exp) op exp

[exp exp op exp]

(4) => (number op exp) op exp

[exp number]

(5) =>(number - exp) op exp

[op - ]

(6) => (number - number) op exp

[exp number]

(7) => (number - number) * exp

[op *]

(8) =>(number - number) * number

[exp number]

Derivation VS Parse Tree

They can both express the structure of a


given string.

The derivation has some superficial


difference: the replacements order.

A parse tree can filter out the order,


make the structure clear.

One parse tree may correspond to


more than one derivation

Parsing Tree

A parse tree corresponding to a derivation is a labeled


tree.
The interior nodes are labeled by non-terminals, the leaf
nodes are labeled by terminals;

The children of each internal node represent the


replacement of the associated non-terminal in one step
of the derivation.

Parsing Tree

The example:
exp => exp op exp => number op exp => number + exp
=> number + number
Corresponding to the parse tree:
exp
exp
number

op

exp

number

The above parse tree is corresponds to the three


derivations:

Parsing Tree

Left most derivation

1exp => exp op exp


2
=> number op exp
3
=> number + exp
4
=> number + number

Right most derivation

(1) exp => exp op exp


(2)
=> exp op number
(3)
=> exp + number
(4)
=> number + number

Parsing Tree

Neither leftmost nor rightmost derivation


1 exp => exp op exp
2
=> exp + exp
3
=> number + exp
4
=> number + number

Generally, a parse tree corresponds to many derivations


represent the same basic structure for the parsed string of
terminals.
It is possible to distinguish particular derivations that are
uniquely associated with the parse tree.

Parsing Tree
A

leftmost derivation:

A derivation in which the leftmost non-terminal


is replaced at each step in the derivation.
Corresponds to the preorder numbering of the
internal nodes of its associated parse tree.

rightmost derivation:

A derivation in which the rightmost non-terminal


is replaced at each step in the derivation.
Corresponds to the postorder numbering of the
internal nodes of its associated parse tree.

Parsing Tree

The parse tree corresponds to the first


derivation.
1 exp
2 exp 3 op
number

4 exp
number

Example: The expression (34-3)*42


The

parse tree for the above


arithmetic expression
1 exp

4 exp
(

3 op

5 exp )

8 exp
number

7 op 6 exp

number

2 exp
number

3.3.2 Abstract
syntax trees

Way Abstract Syntax-Tree

The parse tree contains more information than is


absolutely necessary for a compiler
For the example: 3*4

exp

exp
number
(3)

op
*

exp
number
(4)

Why Abstract Syntax-Tree

The principle of syntax-directed translation


The meaning, or semantics, of the string 3+4
should be directly related to its syntactic
structure as represented by the parse tree.
In this case, the parse tree should imply that
the value 3 and the value 4 are to be added.
A much simpler way to represent this same
information, namely, as the tree

Tree for expression (34-3)*42

The expression (34-3)*42 whose parse tree can be


represented more simply by the tree:
*
-

42

34
3
The parentheses tokens have actually disappeared
still represents precisely the semantic content of
subtracting 3 from 34, and then multiplying by 42.

Abstract Syntax Trees or Syntax Trees

Syntax trees represent abstractions of the actual


source code token sequences,

The token sequences cannot be recovered


from them (unlike parse trees).

Nevertheless they contain all the information


needed for translation, in a more efficient
form than parse trees.

Abstract Syntax Trees or Syntax Trees

A parse tree is a representation for the structure of


ordinary called concrete syntax when comparing it
to abstract syntax.

Abstract syntax can be given a formal definition


using a BNF-like notation, just like concrete syntax.

The BNF-like rules for the abstract syntax of the


simple arithmetic expression:

exp OpExp(op,exp,exp) |ConstExp(integer)


op Plus | Minus | Times

Abstract Syntax Trees or Syntax Trees


Data type declaration.the C data type declarations.
typedef enum {Plus,Minus,Times} OpKind;
typedef enum {OpK.ConstK} ExpKind;
typedef struct streenode
{ ExpKind kind;
OpKind op;
struct streenode *lchild,*rchild;
int val;
} STreeNode;
typedef STreeNode *SyntaxTree;

Examples
Example

3.8:

The grammar for simplified if-statements


statement if-stmt | other
if-stmt if ( exp ) statement
| if ( exp ) statement else
statement
exp 0 | 1

Examples
The

parse tree for the string:

if (0) other else other


statement
if-stmt
if

exp
0

) statement
other

else statement
other

Examples
Using

the grammar of Example 3.6


statement if-stmt | other
if-stmt if ( exp ) statement else-part
else-part else statement |
exp 0 | 1

Examples
This same string has the following parse tree:

if (0) other else other

statement

if-stmt
if

exp ) statement
0

other

other

else

else-part
statement

Examples

A syntax tree for the previous string (using either


the grammar of Example 3.4 or 3.6) would be:

if (0) other else other

if
0

other

other

Examples

A set of C declarations that would be appropriate


for the structure of the statements and expressions
in this example is as follows:
typedef enum {ExpK, StmtK) NodeKind;
typedef enum {Zero, One} ExpKind;
typedef enum {IfK, OtherK) StmtKind;
typedef struct streenode
{ NodeKind kind;
ExpKind ekind;

StmtKind skind;
struct streenode
*test,*thenpart,*elsepart;
} STreeNode;
typedef STreeNode * SyntaxTree;

Examples
Example

3.9:

The grammar of a sequence of statements


separated by semicolons from Example 3.7:
stmt-sequence stmt ; stmt-sequence| stmt
stmt s

Examples

The string s; s; s has the following parse tree with


respect to this grammar:
stmt-sequence
stmt
s

stmt-sequence
stmt

stmt-sequence

stmt
s

Problem & Solution

The solution: use the standard leftmost-child rightsibling representation for a tree (presented in most
data structures texts) to deal with arbitrary number
of children

The only physical link from the parent to its


children is to the leftmost child.

The children are then linked together from left to


right in a standard linked list, which are called
sibling links to distinguish them from parent-child
links.

Problem & Solution

The previous tree now becomes, in the


leftmost-child right-sibling arrangement:
seq
s
s
s
With this arrangement, we can also do away
with the connecting seq node, and the syntax
tree then becomes simply:

Back

End of Part
One
THANKS

You might also like