CD Chapter III-1
CD Chapter III-1
Syntax Analysis
Syntax Analysis in a Compilation
Context Free Grammars CFG
Derivation In Syntax Analysis
Syntax Trees and Ambiguity
Syntax Error Handling
1
Syntax Analysis in a Compilation
Where LA splits the input into tokens, the purpose of
syntax analysis (parsing) is to recombine these tokens.
Not back into a list of characters, but into something that
reflects the structure of the text.
This “something” is typically a data structure called the
syntax tree of the text.
The parser obtains a string of tokens from the LA.
2
Syntax Analysis in a Compilation
3
Syntax Analysis in a Compilation
Syntax tree is a tree structure.
The leaves of this tree are the tokens found by the LA,
And if the leaves are read from left to right, the sequence is
the same as in the input text.
Hence, what is important in the syntax tree is how these
leaves are combined to form the structure of the tree and
how the interior nodes of the tree are labelled.
4
Syntax Analysis in a Compilation
A derivation can be conveniently represented by a
derivation tree( parse tree).
The root is labeled by the start symbol.
Each leaf is labeled by a token or ε.
Each interior none is labeled by a nonterminal symbol.
The methods commonly used in compilers
» Top-down
» Bottom-up.
5
Syntax Analysis in a Compilation
Top down parsers build parse trees from the top (root) to the bottom
(leaves).
Bottom-up parsers start from the leaves and works up to the root.
The output of the parser is some representation of the parse tree from
the stream of tokens produced by the LA.
Recursive descent parsing :
A common form of top-down parsing.
Uses recursive procedures to process the input.
6
Syntax Analysis in a Compilation
May or may not suffer from backtracking.
But the grammar associated with it (if not left factored) cannot avoid
back-tracking.
7
Syntax Analysis in a Compilation
When A→x1… xn is derived,
Nodes labeled by x1… xn are made as children nodes of
node labeled by A.
• Root : the start symbol
• Internal nodes : nonterminal
• Leaf nodes : terminal
8
Syntax Analysis in a Compilation
The syntax analysis must also reject invalid texts by
reporting syntax errors.
Syntax analysis is less local in nature than LA, more
advanced methods are required & use the same basic
strategy.
A notation suitable for human understanding is transformed
into a machine-like low-level notation suitable for efficient
execution.
This process is called parser generation. 9
Syntax Analysis in a Compilation
The notation we use for human manipulation is context-free
grammars (CFG).
CFG refers to the fact that derivation is independent of
context.
Which is a recursive notation for describing sets of strings
and imposing a structure on each such string.
Can be translated almost directly into recursive programs,
but it is often more convenient to generate stack automata.
10
Context Free Grammars (CFG)
Like RE, CFG describe sets of strings, i.e., languages.
CFG also defines structure on the strings in the language it
defines.
A language is defined over some alphabet,
For example the set of tokens produced by a lexer or the set
of alphanumeric characters.
The symbols in the alphabet are called terminals.
11
CFG
CFG recursively defines several sets of strings.
Each set is denoted by a name, which is called a
nonterminal.
The set of non-terminals is disjoint from the set of
terminals.
One of the non-terminals are chosen to denote the language
described by the grammar.
This is called the start symbol of the grammar.
12
CFG
The sets are described by a number of productions.
Each production describes some of the possible strings that
are contained in the set denoted by a nonterminal.
A production has the form:
19
How to write CFGs
A RE can systematically be rewritten as a CFG by using a
nonterminal for every subexpression in the RE and using
one or two productions for each nonterminal.
So, if we can think of a way of expressing a language as a
RE , it is easy to make a grammar for it.
However, we will also want to use grammars to describe
non-regular languages.
20
How to write CFGs
An example is the kind of arithmetic expressions that are
part of most programming languages (and also found on
electronic calculators).
Note that, the matching parentheses can not be described by
RE, as these can not “count” the number of unmatched
opening parentheses at a particular point in the string.
However, if we did not have parentheses in the language, it
could be described by the RE.
21
How to write CFGs
Even so, the regular description is not useful if you want
operators to have different precedence, as it treats the
expression as a flat string rather than as having structure.
Most constructions from programming languages are easily
expressed by CFG.
In fact, most modern languages are designed this way.
22
How to write CFGs
Simple expression grammar Simple statement grammar
23
How to write CFGs
When writing a grammar for a programming language, one
normally starts by dividing the constructs of the language
into different syntactic categories.
A syntactic category is a sub-language that embodies a
particular concept.
Each syntactic category is denoted by a main nonterminal,
Examples of common syntactic categories in programming
languages are:
24
How to write CFGs
Expressions: are used to express calculation of values.
Statements: express actions that occur in a particular
sequence.
Declarations: express properties of names used in other
parts of the program.
25
How to write CFGs
More non-terminals might be needed to describe a syntactic
category or provide structure to it, as we shall see, and
productions for one syntactic category can refer to non-
terminals for other syntactic categories.
For example, statements may contain expressions, so some
of the productions for statements use the main nonterminal
for expressions.
26
Derivation
The basic idea of derivation is to consider productions as
rewrite rules:
Whenever we have a nonterminal, we can replace this by
the right-hand side of any production in which the
nonterminal appears on the left-hand side.
We can do this anywhere in a sequence of symbols
(terminals and non-terminals) and repeat doing so until we
have only terminals left.
27
Derivation
The resulting sequence of terminals is a string in the
language defined by the grammar.
Formally, we define the derivation relation ⇒ by the three
rules.
Example grammar
30
Derivation
Derivation of the string Leftmost derivation of the
aabbbcc using grammar above string aabbbcc using grammar
31
Derivation
In this derivation, we have applied derivation steps
sometimes to the leftmost nonterminal, sometimes to the
rightmost and sometimes to a nonterminal that was neither.
However, since derivation steps are local, the order does not
matter.
So, we might as well decide to always rewrite the leftmost
nonterminal.
32
Derivation
A derivation that always rewrites the leftmost nonterminal
is called a leftmost derivation.
Similarly, a derivation that always rewrites the rightmost
nonterminal is called a rightmost derivation.
33
Syntax Trees & Ambiguity
We can draw a derivation as a tree:
The root of the tree is the start symbol of the grammar, and
whenever we rewrite a nonterminal we add as its children
the symbols on the right-hand side of the production that
was used.
The leaves of the tree are terminals which, when read from
left to right, form the derived string.
If a nonterminal is rewritten using an empty production, an
ε is shown as its child. 34
Syntax Trees & Ambiguity
This is also a leaf node, but is ignored when reading the
string from the leaves of the tree.
When we write such a syntax tree, the order of derivation is
irrelevant:
We get the same tree for left derivation, right derivation or
any other derivation order.
Only the choice of production for rewriting each
nonterminal matters.
35
Syntax Trees & Ambiguity
For compilation, we do the derivation backwards:
We start with a string and want to produce a syntax tree.
This process is called syntax analysis or parsing.
Even though the order of derivation does not matter when
constructing a syntax tree, the choice of production for that
nonterminal does.
Obviously, different choices can lead to different strings
being derived, but it may also happen that several different
syntax trees can be built for the same string. 36
Syntax Trees & Ambiguity
When a grammar permits several different syntax trees for
some strings we call the grammar ambiguous.
If our only use of grammar is to describe sets of strings,
ambiguity is not a problem.
However, when we want to use the grammar to impose
structure on strings, the structure had better be the same
every time.
Hence, it is a desirable feature for a grammar to be
unambiguous. 37
Syntax Trees & Ambiguity
Syntax tree for the string Alternative syntax tree for the string aabbbcc
aabbbcc using grammar
38
Operator Precedence
Ambiguity is not surprising, as we are used to the fact that
an expression like 2+3*4 can be read in two ways:
Either as multiplying the sum of 2 and 3 by 4 or as adding 2
to the product of 3 and 4.
As they use a hierarchy of operator precedence which
dictate that the product must be calculated before the sum.
The hierarchy can be overridden by explicit
parenthesisation.
39
Operator Precedence
When evaluating an expression, the subexpressions
represented by subtrees of the syntax tree are evaluated
before the topmost operator is applied.
Most programming languages use the same convention as
scientific calculators.
A possible way of resolving the ambiguity is to use
precedence rules during syntax analysis to select among the
possible syntax trees.
40
Syntax Analysis
41
Syntax Error Handling
A good compiler should assist the programmer in
identifying and locating errors.
Errors can be
Lexical: misspelling an identifier, keyword, or operator
Syntactic: arithmetic expression with unbalanced parentheses
Semantic: an operator applied to an incomplete operand
Logical: infinitely recursive call
42
Syntax Error Handling
Often much of the error detection and recovery in a
compiler is centered around the syntax analysis phase.
Reasons:
Many errors are syntactic in nature
Stream of tokens disobeys the grammatical rules defining the
programming language.
Several parsing methods(LL,LR)detect that an error has
occurred as soon as they see a prefix of the input that is not
a prefix of any string in the language. 43
Syntax Analysis
Error-Recover Strategies
There are many different general strategies that a parser can
employ to recover from a syntactic error.
The strategies:
Panic mode
Phrase level
Error productions
Global corrections
44
Error-Recover Strategies
Panic mode recovery:
On discovering an error, the parser discards in put symbols
one at a time until one of a designated set of synchronizing
tokens(usually delimiters, such as semicolon or end) is
found.
In situations where multiple errors in the statement are rare, this
method may be quite adequate.
It is guaranteed not to go to infinite loop
45
46
Error-Recover Strategies
Phrase level
A parser may perform local correction on the remaining
input; that is it may replace a prefix of the remaining input
by some string that allows the parser to continue
A typical local correction would be
To replace a comma by a semicolon
Delete an extraneous semicolon
Insert a missing semicolon
47
48
Error Productions
If we have a good idea of the common errors that might be
encountered, we can augment the grammar for the language
at hand with productions that generate the erroneous
constructs.
We then use the grammar augmented by these error
productions to construct a parser.
If an error production is used by the parser we can generate
appropriate error diagnostics to indicate the erroneous
constructs that has been recognized in the input. 49
Global correction
Ideally, we would like a compiler to make a few changes as
possible in processing an incorrect input string.
- Choose a minimal sequence of changes to obtain a global
least-cost correction
50
51
Syntax Analysis
Syntax trees
For compilation, we do the derivation backwards:
We start with a string and want to produce a syntax tree.
This process is called syntax analysis or parsing.
It may also happen that several different syntax trees can be
built for the same string.
52
Syntax Analysis
Predictive parsing
Predictive parsers always build the syntax tree from the root down to the
leaves and are hence also called (deterministic) top-down parsers.
If we can always choose a unique production based on the next input symbol,
we are able to do predictive parsing without backtracking.
FIRST
A symbol c is in FIRST(α) if and only if α →cβ for some (possibly empty) sequence β of grammar
symbols.
- returns the set of symbols with which strings derived from that
sequence can begin.
53
Syntax Analysis
For the grammar G with productions:
A → a | BCA
B → be | cA
C→d
First(a) = {a} and First(BCA) = {b, c}
Notice that, to be able to write a predictive parser for a
grammar, for any non-terminal symbol, the First sets for all
pairs of distinct productions’ right-hand sides must be
disjoint.
54
Syntax Analysis
55
Syntax Analysis
56
Syntax Analysis
57
58
Syntax Analysis
So FOLLOW (T ' ) { , $, ) }
59
Syntax Analysis
61
{e} FIRST ( S ' ) FOLLOW( S )
S" S $ {$} FOLLOW( S )
62
Example
S " S $ i E t SS '$ {$} FOLLOW( S ' )
S i E t SS ' i E t i E tS S ' S ' i E t i E tS S ' e S {e} FOLLOW( S ' )
So, FOLLOW( S ' ) {$, e}
63
Back-tracking
Top- down parsers start from the root node (start symbol)
and match the input string against the production rules to
replace them (if matched).
Example:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser
64
Back-tracking
65
Predictive Parser
If we can always choose a unique production based on the
next input symbol, we are able to do predictive parsing
without backtracking.
Is a recursive descent parser
Has the capability to predict which production is to be used
to replace the input string.
66
Predictive Parser
Does not suffer from backtracking.
Uses a look-ahead pointer, which points to the next input symbols.
Accepts only a class of grammar known as LL(k) grammar.
Uses a stack and a parsing table to parse the input and generate a parse tree
67
Predictive Parser
68
Predictive Parser
T R T ' T $
T aTc T R
R T aTc
R bR R
R bR
LL(1) table for the grammar:
a b c $
T ' T ' T $ T ' T $ T ' T $
T T aTc T R T R T R
R R bR R R
69
Predictive Parser
T ' T $
T R
T aTc Input stack
R aabbbcc $ T '
T $
R bR aabbbcc $
aabbbcc $ a T c$
abbbcc$ T c$
abbbcc$ a T cc$
bbbcc$ T cc $
bbbcc$
R cc $
bbbcc$ b R cc$
bbcc$
R cc $
bbcc$ b R cc$
R cc $
bcc $
b R cc$
bcc $
cc $ R cc $
cc $ cc $
c$
c$
$ $
70
Rewriting a grammar for LL(1) parsing
Methods for rewriting grammars :
- elimination of left recursion and
- left factorization.
Eliminating left-recursion
A grammar becomes left-recursive if it has any non-terminal ‘N’ whose derivation contains ‘N’
itself as the left-most symbol.
Left-recursive grammar is considered to be a problematic situation for top-down parsers.
- hard for it to judge when to stop parsing the left non-terminal and it goes into an infinite loop.
Example
(1) A => Aα | β ; immediate left recursion
(2) S => Aα | β
A => Sd ; indirect-left recursion
71
Rewriting a grammar for LL(1) parsing
When we have a nonterminal with some left-recursive and
some productions that are not: N 1 N *
N N 1
.
. .
. .
N n N*
equivalent grammar:
. (Right recursive)
N N m N * 1 N *
N 1 .
.
. .
. .
N n , i do not start with N N* m N*
N*
72
Example: Removing left-recursion from grammar:
74
Left Factoring
When the top-down parser cannot make a choice as to which of the
production it should take to parse the string in hand.
Example If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …
To remove this confusion, we use a technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers.
Technique: we make one production for each common prefixes and the rest of
the derivation is added by new productions.
Example The above productions can be written as
A => αA’
A’=> β | 𝜸 | …
75
Non-LL(1) Examples
76
Thank you!
77