0% found this document useful (0 votes)
4 views

CD Chapter III-1

CD chapter 1 for computer science

Uploaded by

Tolera jIrenya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

CD Chapter III-1

CD chapter 1 for computer science

Uploaded by

Tolera jIrenya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

CHAPTER THREE

Syntax Analysis
 Syntax Analysis in a Compilation
 Context Free Grammars CFG
 Derivation In Syntax Analysis
 Syntax Trees and Ambiguity
 Syntax Error Handling

1
Syntax Analysis in a Compilation
 Where LA splits the input into tokens, the purpose of
syntax analysis (parsing) is to recombine these tokens.
 Not back into a list of characters, but into something that
reflects the structure of the text.
 This “something” is typically a data structure called the
syntax tree of the text.
 The parser obtains a string of tokens from the LA.

2
Syntax Analysis in a Compilation

3
Syntax Analysis in a Compilation
 Syntax tree is a tree structure.
 The leaves of this tree are the tokens found by the LA,
 And if the leaves are read from left to right, the sequence is
the same as in the input text.
 Hence, what is important in the syntax tree is how these
leaves are combined to form the structure of the tree and
how the interior nodes of the tree are labelled.

4
Syntax Analysis in a Compilation
 A derivation can be conveniently represented by a
derivation tree( parse tree).
 The root is labeled by the start symbol.
 Each leaf is labeled by a token or ε.
 Each interior none is labeled by a nonterminal symbol.
 The methods commonly used in compilers
» Top-down
» Bottom-up.
5
Syntax Analysis in a Compilation
 Top down parsers build parse trees from the top (root) to the bottom
(leaves).
 Bottom-up parsers start from the leaves and works up to the root.
 The output of the parser is some representation of the parse tree from
the stream of tokens produced by the LA.
 Recursive descent parsing :
 A common form of top-down parsing.
 Uses recursive procedures to process the input.

6
Syntax Analysis in a Compilation
 May or may not suffer from backtracking.
 But the grammar associated with it (if not left factored) cannot avoid
back-tracking.

7
Syntax Analysis in a Compilation
 When A→x1… xn is derived,
 Nodes labeled by x1… xn are made as children nodes of
node labeled by A.
• Root : the start symbol
• Internal nodes : nonterminal
• Leaf nodes : terminal

8
Syntax Analysis in a Compilation
 The syntax analysis must also reject invalid texts by
reporting syntax errors.
 Syntax analysis is less local in nature than LA, more
advanced methods are required & use the same basic
strategy.
 A notation suitable for human understanding is transformed
into a machine-like low-level notation suitable for efficient
execution.
 This process is called parser generation. 9
Syntax Analysis in a Compilation
 The notation we use for human manipulation is context-free
grammars (CFG).
 CFG refers to the fact that derivation is independent of
context.
 Which is a recursive notation for describing sets of strings
and imposing a structure on each such string.
 Can be translated almost directly into recursive programs,
but it is often more convenient to generate stack automata.
10
Context Free Grammars (CFG)
 Like RE, CFG describe sets of strings, i.e., languages.
 CFG also defines structure on the strings in the language it
defines.
 A language is defined over some alphabet,
 For example the set of tokens produced by a lexer or the set
of alphanumeric characters.
 The symbols in the alphabet are called terminals.
11
CFG
 CFG recursively defines several sets of strings.
 Each set is denoted by a name, which is called a
nonterminal.
 The set of non-terminals is disjoint from the set of
terminals.
 One of the non-terminals are chosen to denote the language
described by the grammar.
 This is called the start symbol of the grammar.
12
CFG
 The sets are described by a number of productions.
 Each production describes some of the possible strings that
are contained in the set denoted by a nonterminal.
 A production has the form:

 where N is a nonterminal and X1 ...Xn are zero or more


symbols, each of which is either a terminal or a
nonterminal.
13
CFG
 The intended meaning of this notation is to say that the set
denoted by N contains strings that are obtained by
concatenating strings from the sets denoted by X1 ...Xn.
 In this setting, a terminal denotes a singleton set, just like
alphabet characters in RE.
 We will, when no confusion is likely, equate a nonterminal
with the set of strings it denotes
»A → a
14
CFG
 A → a Says that the set denoted by the nonterminal A
contains the one-character string a.
 The production indicate that A contains all non-empty
sequences of as and is hence (in the absence of other
productions) equivalent to the regular expression a +.
» B→
» B → aB

 The 1st production indicates that the empty string is part of


the set B.
 Compare this grammar with the definition of s* 15
CFG
 Productions with empty right-hand sides are called empty
productions.
 These are sometimes written with an ε on the right hand
side instead of leaving it empty.
 So far, we have not described any set that could not just as
well have been described using RE.
 CFGs are, however, capable of expressing much more
complex languages.
16
CFG
 When several non-terminals are used, we must make it clear
which of these is the start symbol.
 By convention (if nothing else is stated), the nonterminal on
the left-hand side of the first production is the start symbol.
 The grammar has T as start symbol and denotes the set of
strings that start with any number of as followed by a non-
zero number of bs and then the same number of as with
which it started.
17
CFG
 Sometimes, a shorthand notation is used where all the
productions of the same nonterminal are combined to a
single rule, using the alternative symbol (|) from RE to
separate the right-hand sides.
 The above grammar would read

 There are still four productions in the grammar, even


though the arrow symbol → is only used twice.
18
CFG
 Each subexpression of the RE is numbered and
subexpression si is assigned a nonterminal Ni .
 The productions for Ni depend on the shape of si

19
How to write CFGs
 A RE can systematically be rewritten as a CFG by using a
nonterminal for every subexpression in the RE and using
one or two productions for each nonterminal.
 So, if we can think of a way of expressing a language as a
RE , it is easy to make a grammar for it.
 However, we will also want to use grammars to describe
non-regular languages.

20
How to write CFGs
 An example is the kind of arithmetic expressions that are
part of most programming languages (and also found on
electronic calculators).
 Note that, the matching parentheses can not be described by
RE, as these can not “count” the number of unmatched
opening parentheses at a particular point in the string.
 However, if we did not have parentheses in the language, it
could be described by the RE.
21
How to write CFGs
 Even so, the regular description is not useful if you want
operators to have different precedence, as it treats the
expression as a flat string rather than as having structure.
 Most constructions from programming languages are easily
expressed by CFG.
 In fact, most modern languages are designed this way.

22
How to write CFGs
 Simple expression grammar  Simple statement grammar

23
How to write CFGs
 When writing a grammar for a programming language, one
normally starts by dividing the constructs of the language
into different syntactic categories.
 A syntactic category is a sub-language that embodies a
particular concept.
 Each syntactic category is denoted by a main nonterminal,
 Examples of common syntactic categories in programming
languages are:
24
How to write CFGs
 Expressions: are used to express calculation of values.
 Statements: express actions that occur in a particular
sequence.
 Declarations: express properties of names used in other
parts of the program.

25
How to write CFGs
 More non-terminals might be needed to describe a syntactic
category or provide structure to it, as we shall see, and
productions for one syntactic category can refer to non-
terminals for other syntactic categories.
 For example, statements may contain expressions, so some
of the productions for statements use the main nonterminal
for expressions.

26
Derivation
 The basic idea of derivation is to consider productions as
rewrite rules:
 Whenever we have a nonterminal, we can replace this by
the right-hand side of any production in which the
nonterminal appears on the left-hand side.
 We can do this anywhere in a sequence of symbols
(terminals and non-terminals) and repeat doing so until we
have only terminals left.
27
Derivation
 The resulting sequence of terminals is a string in the
language defined by the grammar.
 Formally, we define the derivation relation ⇒ by the three
rules.

 Where α, β and γ are (possibly empty) sequences of


grammar symbols (terminals and non-terminals).
28
Derivation
 The first rule states that using a production as a rewrite rule
(anywhere in a sequence of grammar symbols) is a
derivation step.
 The second states that the derivation relation is reflexive,
i.e., that a sequence derives itself.
 The third rule describes transitivity, i.e., that a sequence of
derivations is in itself a derivation
 We can use derivation to formally define the language that a
CFG generates: 29
Derivation
 Given a CFG G with start symbol S, terminal symbols T
and productions P, the language L(G) that G generates is
defined to be the set of strings of terminal symbols that can
be obtained by derivation from S using the productions P,
i.e., the set {w ∈ T ∗ | S ⇒ w}.

Example grammar

30
Derivation
 Derivation of the string  Leftmost derivation of the
aabbbcc using grammar above string aabbbcc using grammar

31
Derivation
 In this derivation, we have applied derivation steps
sometimes to the leftmost nonterminal, sometimes to the
rightmost and sometimes to a nonterminal that was neither.
 However, since derivation steps are local, the order does not
matter.
 So, we might as well decide to always rewrite the leftmost
nonterminal.

32
Derivation
 A derivation that always rewrites the leftmost nonterminal
is called a leftmost derivation.
 Similarly, a derivation that always rewrites the rightmost
nonterminal is called a rightmost derivation.

33
Syntax Trees & Ambiguity
 We can draw a derivation as a tree:
 The root of the tree is the start symbol of the grammar, and
whenever we rewrite a nonterminal we add as its children
the symbols on the right-hand side of the production that
was used.
 The leaves of the tree are terminals which, when read from
left to right, form the derived string.
 If a nonterminal is rewritten using an empty production, an
ε is shown as its child. 34
Syntax Trees & Ambiguity
 This is also a leaf node, but is ignored when reading the
string from the leaves of the tree.
 When we write such a syntax tree, the order of derivation is
irrelevant:
 We get the same tree for left derivation, right derivation or
any other derivation order.
 Only the choice of production for rewriting each
nonterminal matters.
35
Syntax Trees & Ambiguity
 For compilation, we do the derivation backwards:
 We start with a string and want to produce a syntax tree.
 This process is called syntax analysis or parsing.
 Even though the order of derivation does not matter when
constructing a syntax tree, the choice of production for that
nonterminal does.
 Obviously, different choices can lead to different strings
being derived, but it may also happen that several different
syntax trees can be built for the same string. 36
Syntax Trees & Ambiguity
 When a grammar permits several different syntax trees for
some strings we call the grammar ambiguous.
 If our only use of grammar is to describe sets of strings,
ambiguity is not a problem.
 However, when we want to use the grammar to impose
structure on strings, the structure had better be the same
every time.
 Hence, it is a desirable feature for a grammar to be
unambiguous. 37
Syntax Trees & Ambiguity
 Syntax tree for the string Alternative syntax tree for the string aabbbcc
aabbbcc using grammar

38
Operator Precedence
 Ambiguity is not surprising, as we are used to the fact that
an expression like 2+3*4 can be read in two ways:
 Either as multiplying the sum of 2 and 3 by 4 or as adding 2
to the product of 3 and 4.
 As they use a hierarchy of operator precedence which
dictate that the product must be calculated before the sum.
 The hierarchy can be overridden by explicit
parenthesisation.
39
Operator Precedence
 When evaluating an expression, the subexpressions
represented by subtrees of the syntax tree are evaluated
before the topmost operator is applied.
 Most programming languages use the same convention as
scientific calculators.
 A possible way of resolving the ambiguity is to use
precedence rules during syntax analysis to select among the
possible syntax trees.
40
Syntax Analysis

41
Syntax Error Handling
 A good compiler should assist the programmer in
identifying and locating errors.
 Errors can be
 Lexical: misspelling an identifier, keyword, or operator
 Syntactic: arithmetic expression with unbalanced parentheses
 Semantic: an operator applied to an incomplete operand
 Logical: infinitely recursive call

42
Syntax Error Handling
 Often much of the error detection and recovery in a
compiler is centered around the syntax analysis phase.
 Reasons:
 Many errors are syntactic in nature
 Stream of tokens disobeys the grammatical rules defining the
programming language.
 Several parsing methods(LL,LR)detect that an error has
occurred as soon as they see a prefix of the input that is not
a prefix of any string in the language. 43
Syntax Analysis
 Error-Recover Strategies
 There are many different general strategies that a parser can
employ to recover from a syntactic error.
 The strategies:
 Panic mode
 Phrase level
 Error productions
 Global corrections
44
Error-Recover Strategies
 Panic mode recovery:
 On discovering an error, the parser discards in put symbols
one at a time until one of a designated set of synchronizing
tokens(usually delimiters, such as semicolon or end) is
found.
 In situations where multiple errors in the statement are rare, this
method may be quite adequate.
 It is guaranteed not to go to infinite loop
45
46
Error-Recover Strategies
 Phrase level
 A parser may perform local correction on the remaining
input; that is it may replace a prefix of the remaining input
by some string that allows the parser to continue
 A typical local correction would be
 To replace a comma by a semicolon
 Delete an extraneous semicolon
 Insert a missing semicolon
47
48
Error Productions
 If we have a good idea of the common errors that might be
encountered, we can augment the grammar for the language
at hand with productions that generate the erroneous
constructs.
 We then use the grammar augmented by these error
productions to construct a parser.
 If an error production is used by the parser we can generate
appropriate error diagnostics to indicate the erroneous
constructs that has been recognized in the input. 49
Global correction
 Ideally, we would like a compiler to make a few changes as
possible in processing an incorrect input string.
- Choose a minimal sequence of changes to obtain a global
least-cost correction

50
51
Syntax Analysis
Syntax trees
 For compilation, we do the derivation backwards:
 We start with a string and want to produce a syntax tree.
 This process is called syntax analysis or parsing.
 It may also happen that several different syntax trees can be
built for the same string.

52
Syntax Analysis
Predictive parsing
 Predictive parsers always build the syntax tree from the root down to the
leaves and are hence also called (deterministic) top-down parsers.
 If we can always choose a unique production based on the next input symbol,
we are able to do predictive parsing without backtracking.
FIRST
A symbol c is in FIRST(α) if and only if α →cβ for some (possibly empty) sequence β of grammar
symbols.

- returns the set of symbols with which strings derived from that
sequence can begin.

53
Syntax Analysis
 For the grammar G with productions:
A → a | BCA
B → be | cA
C→d
First(a) = {a} and First(BCA) = {b, c}
Notice that, to be able to write a predictive parser for a
grammar, for any non-terminal symbol, the First sets for all
pairs of distinct productions’ right-hand sides must be
disjoint.
54
Syntax Analysis

55
Syntax Analysis

56
Syntax Analysis

57
58
Syntax Analysis

E '  T E '   FT ' E '  FIRST ( E ' )  {}  FOLLOW (T ' )


E "  E $  TE '$  FT ' E '$  FT '$  {$}  FOLLOW (T ' )
To get FOLLOW(T’): F  ( E )  (TE ' )  (T )  ( FT ' )  { ) }  FOLLOW (T ' )

So FOLLOW (T ' )  { , $, ) }

59
Syntax Analysis

To get FOLLOW(E’): F  ( E )  (TE ' )  {)}  FOLLOW( E ' )


And , E "  E $  TE '$  { $ }  FOLLOW( E ' )
So FOLLOW( E ' )  {) , $}
60
Syntax Analysis

61
{e}  FIRST ( S ' )  FOLLOW( S )
S"  S $  {$}  FOLLOW( S )

So, FOLLOW( S )  {e, $}

62
Example
S "  S $  i E t SS '$  {$}  FOLLOW( S ' )
S  i E t SS '  i E t i E tS S ' S '  i E t i E tS S ' e S  {e}  FOLLOW( S ' )
So, FOLLOW( S ' )  {$, e}

63
Back-tracking
 Top- down parsers start from the root node (start symbol)
and match the input string against the production rules to
replace them (if matched).
Example:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser
64
Back-tracking

65
Predictive Parser
 If we can always choose a unique production based on the
next input symbol, we are able to do predictive parsing
without backtracking.
 Is a recursive descent parser
 Has the capability to predict which production is to be used
to replace the input string.

66
Predictive Parser
 Does not suffer from backtracking.
 Uses a look-ahead pointer, which points to the next input symbols.
 Accepts only a class of grammar known as LL(k) grammar.
 Uses a stack and a parsing table to parse the input and generate a parse tree

67
Predictive Parser

68
Predictive Parser
T  R T ' T $
T  aTc T  R
R  T  aTc
R  bR R 
R  bR
LL(1) table for the grammar:

a b c $
T ' T ' T $ T ' T $ T ' T $
T T  aTc T R T R T R
R R  bR R  R 

69
Predictive Parser
T ' T $
T  R
T  aTc Input stack
R   aabbbcc $ T '
T $
R  bR aabbbcc $
aabbbcc $ a T c$
abbbcc$ T c$
abbbcc$ a T cc$
bbbcc$ T cc $
bbbcc$
R cc $
bbbcc$ b R cc$
bbcc$
R cc $
bbcc$ b R cc$
R cc $
bcc $
b R cc$
bcc $
cc $ R cc $
cc $ cc $
c$
c$
$ $

70
Rewriting a grammar for LL(1) parsing
Methods for rewriting grammars :
- elimination of left recursion and
- left factorization.
Eliminating left-recursion
A grammar becomes left-recursive if it has any non-terminal ‘N’ whose derivation contains ‘N’
itself as the left-most symbol.
Left-recursive grammar is considered to be a problematic situation for top-down parsers.
- hard for it to judge when to stop parsing the left non-terminal and it goes into an infinite loop.
Example
(1) A => Aα | β ; immediate left recursion

(2) S => Aα | β
A => Sd ; indirect-left recursion

71
Rewriting a grammar for LL(1) parsing
 When we have a nonterminal with some left-recursive and
some productions that are not: N  1 N *
N  N 1
.
. .
. .
N   n N*
equivalent grammar:
. (Right recursive)
N  N m N *  1 N *
N  1 .
.
. .
. .
N   n ,  i do not start with N N*   m N*
N*  
72
Example: Removing left-recursion from grammar:

Exp  Exp  Exp2 Exp  Exp 2 Exp*


Exp  Exp  Exp2 Exp*   Exp 2 Exp*
Exp  Exp2 Exp*   Exp 2 Exp*
Exp2  Exp2 * Exp3 Exp*  
equivalent grammar:
Exp2  Exp2 / Exp3 (Right Recursive) Exp 2  Exp3Exp 2*
Exp2  Exp3 Exp 2*  *Exp3Exp 2*
Exp3  num Exp 2*  / Exp3Exp 2*
Exp3  ( Exp) Exp3  num
Exp3  ( Exp)
73
Example The production set Immediate left recursion
S => Aα | β S => Aα | β
A => Sd A => Aαd | βd

None of the production has either direct or indirect left recursion.


S => Aα | β
A => βdA’
A’ => αdA’
A’ => ε

74
Left Factoring
 When the top-down parser cannot make a choice as to which of the
production it should take to parse the string in hand.
 Example If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …
 To remove this confusion, we use a technique called left factoring.
 Left factoring transforms the grammar to make it useful for top-down parsers.
 Technique: we make one production for each common prefixes and the rest of
 the derivation is added by new productions.
 Example The above productions can be written as
A => αA’
A’=> β | 𝜸 | …
75
Non-LL(1) Examples

76
Thank you!

77

You might also like