Compiler: Syntax Analysis

Syllabus
• Syntax Analysis - CFG, top-down and bottom-
up parsers, RDP, Predictive parser, SLR,LR(1),
LALR parsers, using ambiguous grammar, Error
detection and recovery, automatic
construction of parsers using YACC,
Introduction to Semantic analysis-Need of
semantic analysis, type checking and type
conversion.
1- By Jaydeep Patil AISSMS's IOIT Pune

UNIT 2
Syntax Analysis

Grammar
• A Set of formal rules for generating syntactically correct
sentence.
• It is defined by tuple G(V,T,P,S)
– V- Variables
– T- Terminals
– P-Production
– S-Start Symbol(variable)
• Terminal-{a,b,c,…..z,0-9}
• Nonterminal or Variable-{A-Z}
• Rule : LHS of production should at least contain one
non-terminal that is variable.

Grammar
Example 1
E->E+E
E->E*E
E->id
G(E,{id,*,+},P,E)
G={id,id+id,id*id,id+id*id,…………….}
Example 2
S->Xa
X->aX|bX|a|b
G({S,X},{a,b},P,S)
s->{aa,ba,aaa,baa………….}

CFG
• RULE: 1. Every Production of the form
𝐴 → α
A any Variable, α any terminal
2. α → β
On LHS of Production there has to be
only non-terminal, Variable not string
CFL(Context Free Language)-> A Set of
Sentences derived from start symbol of CGF is
called CFL.

Leftmost and Rightmost Derivations
E->E+E
E->E*E
E->id
• Derive Id+id*id

Ambiguity
• A grammar that produces more than one
parse tree for some sentence is said to be
ambiguous. Put another way, an ambiguous
grammar is one that produces more than one
leftmost derivation or more than one
rightmost derivation for the same sentence.

Left Recursion
• A grammar is left recursive if it has a nonterminal A such that
there is a derivation
A -> Aa
for some string a.
• Top-down parsing methods cannot handle left-recursive
grammars, so a transformation is needed to eliminate left
recursion.
A -> Aa I b
A-> bA’
A’->aA’/e

Left Factoring
• Left factoring is a grammar transformation that is useful for producing a
grammar suitable for predictive, or top-down, parsing. When the choice
between two alternative A-productions is not clear, we may be able to
rewrite the productions to defer the decision until enough of the input has
been seen that we can make the right choice.
• For example, if we have the two production
A -> ab1 I ab2
A-> aA’
A’-> b1|b2

Left Factoring
S-> cAd
A->ab|a
S->cAd
A->aA’
A’->b/e

Top Down Parsing
• Top-down parsing can be viewed as the problem of
constructing a parse tree for the input string, starting from the
root and creating the nodes of the parse tree in preorder.
Equivalently, top-down parsing can be viewed as finding a
leftmost derivation for an input string.

Recursive-Descent Parsing
• A recursive-descent parsing program consists
of a set of procedures, one for each
nonterminal. Execution begins with the
procedure for the start symbol, which halts
and announces success if its procedure body
scans the entire input string.

• General recursive-descent may require
backtracking; that is, it may require repeated
scans over the input. However, backtracking is
rarely needed to parse programming language
constructs, so backtracking parsers are not
seen frequently.

Consider the grammar
s -> c A d
A -> a b I a
Get W={cad}
S->cAd
S->cad

• A left-recursive grammar can cause a
recursive-descent parser, even one with
backtracking, to go into an infinite loop. That
is, when we try to expand a nonterminal A, we
may eventually find ourselves again trying to
expand A without having consumed any input.

FIRST and FOLLOW
• The construction of both top-down and
bottom-up parsers is aided by two functions,
FIRST and FOLLOW, associated with a
grammar G. During topdown parsing, FIRST
and FOLLOW allow us to choose which
production to apply, based on the next input
symbol. During panic-mode error recovery,
sets of tokens produced by FOLLOW can be
used as synchronizing tokens.

FIRST
• First is a function which gives the set of terminals
that begins the string derived from production
rules.
• Rules:
1. If x is a terminal then first(x)=x.
2. If X->e is a production, then add e to the set of
first[x].
3. If X is non terminal then,
3.a) If X->Y, first (Y) is an element of the set of first(x).
3.b) If first(Y) has e as an element & X->YZ then
first(x)=first(Y)-e U first(z)

Follow
• Follow is a function which gives set of terminals that
can appear immediately to the right of given symbol.
• Rules:

FIRST
• E->TE’
• E’->+TE’/ε
• T->FT’
• T’->*FT’/ ε
• F->(E)/id
First(E) = first(T) = First(F)={ (,id }
First(E’)={ +, ε }
First(T’)={*/ ε}

S-> iEtSS’/a
S’-> eS/ ε
E->b
First(S)={i,a}
First(S’)={e, ε)
First(E)={b}

S->A
A->aB/Ad
B->aBC/f
C->g
First(S)=First(A) ={a}
First(B)={a,f}
First(C)={g}

First(S)={1, ε}
First(A)={1,0}
First(B)={0}
First(C)={1}
S->1AB/ ε
A->1AC/0C
B->0S
C->1

S->AaAb/BbBa
A-> ε
B-> ε
First(A)={ε}
First(B)={ε}
First(S)={First(A)- ε U First(a)} U{First(B)- ε U First(b)={a,b}

S->aBbDh
B->cC
C->bc/ ε
D->EF
E->g/ ε
F->f/ ε
First(S)={a}
First(B)={c}
First(C)={b/ ε}
First(D)={First(E)- ε U First (F)}={g,f, ε}
First(E) = {g, ε}
First(F) = {f, ε}

FIRST & FOLLOW
• E->TE’
• E’->+TE’/ε
• T->FT’
• T’->*FT’/ ε
• F->(E)/id
First(E’)={ +, ε }
First(T’)={*/ ε}
Follow(E)= { $,) }
Follow(E’)={$,) }
Follow(T)={First(E’)- ε U Follow(E’)}={+, $,) }
Follow(T’)={+,$,)}
Follow(F)={First(T’)- ε U Follow(T’)}={*,+, $,) }

S-> iEtSS’/a
S’-> eS/ ε
E->b
First(S)={i,a}
First(S’)={e, ε)
First(E)={b}
Follow(S)={$,e}
Follow(S’)={$,e}
Follow(T)={t}

S->A
A->aB/Ad
B->aBC/f
C->g
First(S)=First(A) ={a}
First(B)={a,f}
First(C)={g}
Follow(S)={$}
Follow(A)={$,d}
Follow(B)={$,d,g}
Follow(C)={$,d,g}

First(S)={1, ε}
First(A)={1,0}
First(B)={0}
First(C)={1}
Follow(S)={$}
Follow(A)={0,1}
Follow(B)={$}
Follow(C)={0,1)
S->1AB/ ε
A->1AC/0C
B->0S
C->1

S->AaAb/BbBa
A-> ε
B-> ε
First(A)={ε}
First(B)={ε}
First(S)={First(A)- ε U First(a)} U{First(B)- ε U First(b)={a,b}
Follow(S) ={$}
Follow(A)={a,b}
Follow(B}={b,a} 33- By Jaydeep Patil AISSMS's IOIT Pune

S->aBbDh
B->cC
C->bc/ ε
D->EF
E->g/ ε
F->f/ ε
First(S)={a}
First(B)={c}
First(C)={b/ ε}
First(E) = {g, ε}
First(F) = {f, ε}
Follow(S)={$}
Follow(B)={b}
Follow(C)={b}
Follow(D)={h}
Follow(E)={f,h}
Follow(F}={h}

S->aBDh
B->cC
C->bc/ ε
D->EF
E->g/ ε
F->f/ ε
First(S)={a}
First(B)={c}
First(C)={b/ ε}
First(E) = {g, ε}
First(F) = {f, ε}
Follow(S)={$}
Follow(B)={g,f,h}
Follow(C)={g,f,h}
Follow(D)={h}
Follow(E)={f,h}
Follow(F}={h}

• E->TA
• A->+TA/ε
• T->FB
• B->*FB/ ε
• F->(E)/id
First(A)={ +, ε }
First(B)={*/ ε}
Follow(E)= { $,) }
Follow(A)={$,) }
Follow(T)={First(E’)- ε U Follow(E’)}={+, $,) }
Follow(B)={+,$,)}
Follow(F)={First(T’)- ε U Follow(T’)}={*,+, $,) }

LL ( 1 ) Grammars
• Predictive parsers, that is, recursive-descent parsers
needing no backtracking, can be constructed for a class
of grammars called LL(I) , The first "L" in LL(1) stands
for scanning the input from left to right, the second "L"
for producing a leftmost derivation, and the “1" for
using one input symbol of lookahead at each step to
make parsing action decisions.
• The class of LL(1) grammars is rich enough to cover
most programming constructs, although care is needed
in writing a suitable grammar for the source language .
For example, no left-recursive or ambiguous grammar
can be LL(1) .

Predictive Parsing

Grammar is LL(1)(No Multiple Entries)

Grammar is Not LL(1)(Multiple Entries)

Nonrecursive Predictive Parsing
• A non recursive predictive parser can be built
by maintaining a stack explicitly, rather than
implicitly via recursive calls. The parser mimics
a leftmost derivation. If w is the input that has
been matched so far , then the stack holds a
sequence of grammar symbols.

• If X=a=$ the parser halts & announces successful
completion of parsing
• If X=a|=$ the parser pops X the stack and advance the
i/p pointer to the next i/p symbol.
• If x is non terminal the program consults entry M[X,a]
of the parsing table M. This entry will either on X
production of the grammar or an error entry. If for
example m[x,a]->{UVW} the parser replaces x on the
top of stack by WVU(with U on top of stack). As output
we shall assume that the parser joust points to the
production used.

Bottom-Up Parsing

Bottom-Up Parsing
• A bottom-up parse corresponds to the
construction of a parse tree for an input string
beginning at the leaves (the bottom) and
working up towards the root (the top) .

Reductions
• We can think of bottom-up parsing as the
process of "reducing" a string w to the start
symbol of the grammar. At each reduction
step, a specific substring matching the body of
a production is replaced by the nonterminal at
the head of that production.
• The key decisions during bottom-up parsing
are about when to reduce and about what
production to apply, as the parse proceeds.

• The goal of bottom-up parsing is therefore to
construct a derivation in reverse. The
following derivation corresponds to the parse
in
• E => T => T * F => T * id => F * id => id * id
• This derivation is in fact a rightmost
derivation.

Handle
• Bottom-up parsing during a left-to-right scan
of the input constructs a rightmost derivation
in reverse. Informally, a "handle" is a substring
that matches the body of a production, and
whose reduction represents one step along
the reverse of a rightmost derivation.

Shift-Reduce Parsing
• Shift-reduce parsing is a form of bottom-up parsing in
which a stack holds grammar symbols and an input
buffer holds the rest of the string to be parsed.
• As we shall see, the handle always appears at the top of
the stack just before it is identified as the handle.
• We use $ to mark the bottom of the stack and also the
right end of the input. Conventionally, when discussing
bottom-up parsing, we show the top of the stack on the
right, rather than on the left as we did for top-down
parsing. Initially, the stack is empty, and the string w is
on the input, as follows:

• During a left-to-right scan of the input string,
the parser shifts zero or more input symbols
onto the stack, until it is ready to reduce a
string β of grammar symbols on top of the
stack. It then reduces β to the head of the
appropriate production. The parser repeats
this cycle until it has detected an error or until
the stack contains the start symbol and the
input is empty:

LR Parsing: Simple LR
• The 'most prevalent type of bottom-up parser
today is based on a concept called LR(k) parsing;
the "L" is for left-to-right scanning of the input,
the "R" for constructing a rightmost derivation in
reverse, and the k for the number of input
symbols of lookahead that are used in making
parsing decisions. The cases k = 0 or k = 1 are of
practical interest, and we shall only consider LR
parsers with k <= 1 here. When (k) is omitted, k is
assumed to be 1 .

Why LR Parsers
• LR parsers are table-driven, much like the
nonrecursive LL parsers. A grammar for which
we can construct a parsing table using one of
the methods in this section and the next is
said to be an LR grammar. Intuitively, for a
grammar to be LR it is sufficient that a left-to-
right shift-reduce parser be able to recognize
handles of right- sentential forms when they
appear on top of the stack.

• The principal drawback of the LR method is that it is too
much work to construct an LR parser by hand for a typical
programming-language grammar. A specialized tool, an LR
parser generator, is needed.
• Fortunately, many such generators are available, and we
shall discuss one of the most commonly used ones, Yacc .
• Such a generator takes a context-free grammar and
automatically produces a parser for that grammar. If the
grammar contains ambiguities or other constructs that are
difficult to parse in a left-to-right scan of the input, then the
parser generator locates these constructs and provides
detailed diagnostic messages.

Items and the LR(O) Automaton

Augmented grammar
• If G is a grammar with start symbol S, then G' ,
the augmented grammar for G, is G with a ne
• start symbol S' and production S' -> S. The
purpose of this new starting production is to
indicate to the parser when it should stop
parsing and announce acceptance of the
input. That is, acceptance occurs when and
only when the parser is about to reduce by
• S' -> S.

E’-> E
E-> E+T|T
T-> T*F|F
F->(E)|id

Canonical(CLR)- Parsing

Operator Precedence Parsing
• Operator Grammar: For Small but important
class of grammar, we can easily construct
efficient shift reduce parsers by hand.
Operator Grammars have the property that no
production right side is empty or has two
adjacent non-terminals.

• Eg: E-> EAE/ (E)/ -E/id
• A->+/-/|/*/
• The above grammar is not operator grammar
but we can readjust the grammar.

• In Operator Precedence Parsing we define
three disjoint precedence relations <· , =· ,·>
between certain pair of terminals. This
precedence relations guide the selection of
handle & have the following meaning

• Basic Principle
• Having precedence relations allows identifying handles as follows:
• 1. Scan the string from left until seeing ·> and put a pointer.
• 2. Scan backwards the string from right to left until seeing <·
• 3. Everything between the two relations <· and ·> forms the handle
• 4. Replace handle with the head of the production.

Conflicts in LR Parsing
• Every SLR grammar is unambiguous, but every unambiguous
grammar is not a SLR grammar.

shift/reduce and reduce/reduce
conflicts
• If a state does not know whether it will make a shift
operation or reduction for a terminal, we say that there
is a shift/reduce conflict.
• If a state does not know whether it will make a
reduction operation using the production rule i or j
for a terminal, we say that there is a reduce/reduce
conflict.
• If the SLR parsing table of a grammar G has a conflict,
we say that that grammar is not SLR grammar.

Using Ambiguous Grammars
• All grammars used in the construction of LR-parsing
tables must be un-ambiguous.
• Can we create LR-parsing tables for ambiguous
grammars ?
– Yes, but they will have conflicts.
– We can resolve these conflicts in favor of one of them to disambiguate the grammar.
– At the end, we will have again an unambiguous grammar.
• Why we want to use an ambiguous grammar?
– Some of the ambiguous grammars are much natural, and a corresponding unambiguous
grammar can be very complex.
– Usage of an ambiguous grammar may eliminate unnecessary reductions.
• Ex.
E  E+T | T
E  E+E | E*E | (E) | id  T  T*F | F
F  (E) | id

Sets of LR(0) Items for Ambiguous
Grammar
I0: E’  .E
E  .E+E
E  .E*E
E  .(E)
E  .id
I1: E’  E.E  E .+E
E  E .*E
I2: E  (.E)
E  .E+E
E  .E*E
E  .(E)
E  .id
I3: E  id.
I4: E  E +.E
E  .E+E
E  .E*E
E  .(E)
E  .id
I5: E  E *.E
E  .E+E
E  .E*E
E  .(E)
E  .id
I6: E  (E.)
E  E.+E
E  E.*E
I7: E  E+E.E  E.+E
E  E.*E
I8: E  E*E.E  E.+E
E  E.*E
I9: E  (E).
I5
)
E
E
E
E
*
+
+
+
+
*
*
*
(
(
(
(
id
id
id
id
I4
I2
I2
I3
I3
I4
I4
I5
I5

Using ambiguous grammars

Error Recovery in LR Parsing
• An LR parser will detect an error when it consults the
parsing action table and finds an error entry. All empty
entries in the action table are error entries.
• Errors are never detected by consulting the goto table.
• An LR parser will announce error as soon as there is no
valid continuation for the scanned portion of the input.
• A canonical LR parser (LR(1) parser) will never make
even a single reduction before announcing an error.
• The SLR and LALR parsers may make several reductions
before announcing an error.
• But, all LR parsers (LR(1), LALR and SLR parsers) will
never shift an erroneous input symbol onto the stack.

ERROR RECOVERY IN LR PARSING
• An LR Parser will detect an error when it consults
parsing table and finds an error entry. A canonical
parser will never make even a single reduction before
announcing an error. The SLR and LALR parsers may
take several reductions before announcing an error,
but they will never shift an erroneous input into the
stack..
• We can implement two modes of recovery :

Panic Mode
• We scan down the stack until a state s with a goto
on a particular non-terminal A is found. Zero or more
input symbols are then discarded until a symbol a is
found that can legitimately follow A. The parser then
stack the state goto[ s, A] and resume normal
parsing. Normally there may be many choices for the
non terminal A. Normally these would be non-
terminals representing major program pieces, such
as an expression, statement, or block.

Phrase Level Recovery
• It is implemented by examining each error
entry in the LR parsing table and deciding on
the basis of language the most likely program
error that give rise to that error entry in the LR
parsing table. An appropriate error procedure
than can be implemented; presumably the top
of the stack and/or first input symbols would
be modified in a way deemed appropriate for
each error.

• As an example consider the grammar (1).
E  E + E | E * E | ( E ) | id ---- (1)
The parsing table contains error routines that
have effect of detecting error before any
shift move takes place.

Id + * ( ) $ E
0 s3 e1 e1 s2 e2 e1 1
1 e3 s4 s5 e3 e2 acc
2 s3 e1 e1 s2 e2 e1 6
3 r4 r4 r4 r4 r4 r4
4 s3 e1 e1 s2 e2 e1 7
5 s3 e1 e1 s2 e2 e1 8
6 e3 s4 s5 e3 s9 e4
7 r1 r1 s5 r1 r1 r1
8 r2 r2 r2 r2 r2 r2
9 r3 r3 r3 r3 r3 r3
The LR parsing table with error routines

Error routines : e1
• This routine is called from states 0,2,4 and 5,
all of which the beginning of the operand,
either an id of left parenthesis. Instead an
operator, + or *, or the end of input was
found.
• Action: Push an imaginary id on to the stack
and cover it with a state 3. (the goto of the
states 0, 2, 4 and 5)
• Print: Issue diagnostic “missing operand”

Error routines : e2
• This routine is called from states 0, 1, 2, 4 and
5 on the finding a right parenthesis.
• Action: Remove the right parenthesis from the
input
• Print: Issue diagnostic “Unbalanced right
parenthesis”

Error routines : e3
• This routine is called from states 1 or 6 when
expecting an operator, and an id or right
parenthesis is found.
• Action: Push + onto the stack and cover it with
state 4.
• Print: Issue diagnostic “Missing operator”

Error routines : e4
• This routine is called from state 6 when the
end of input is found while expecting operator
or a right parenthesis.
• Action: Push a right parenthesis onto the stack
and cover it with a state 9.
• Print: Issue diagnostic “Missing right
parenthesis”

Automatic construction of parsers
(YACC), YACC specifications.

104
• Two classical tools for compilers:
– Lex: A Lexical Analyzer Generator
– Yacc: “Yet Another Compiler Compiler” (Parser Generator)
• Lex creates programs that scan your tokens one by one.
• Yacc takes a grammar (sentence structure) and generates a
parser.
Lex Yacc
yylex() yyparse()
Lexical Rules Grammar Rules
Input Parsed Input
- By Jaydeep Patil AISSMS's IOIT Pune

105
• Lex and Yacc generate C code for your analyzer & parser.
Lex Yacc
yylex() yyparse()
Lexical Rules Grammar Rules
Input
Parsed
Input
C code C code
C code C code
Lexical Analyzer
(Tokenizer)
Parser
char
stream
token
stream

106
• Often, instead of the standard Lex and Yacc,
Flex and Bison are used:
– Flex: A fast lexical analyzer
– (GNU) Bison: A drop-in replacement for (backwards
compatible with) Yacc
• Byacc is Berkeley implementation of Yacc (so it
is Yacc).
• Resources:
– http://en.wikipedia.org/wiki/Flex_lexical_analyser
– http://en.wikipedia.org/wiki/GNU_Bison
• The Lex & Yacc Page (manuals, links):
– http://dinosaur.compilertools.net/

107
• Yacc is not a new tool, and yet, it is still used in many
projects.
• Yacc syntax is similar to Lex/Flex at the top level.
• Lex/Flex rules were regular expression – action pairs.
• Yacc rules are grammar rule – action pairs.
declarations
%%
rules
%%
programs

• Declaration Section
• There are two sections in the declarations part of a Yacc program;
both are optional. In the first section, we put ordinary C
declarations, delimited by %{ and % }. Here we place declarations of
any temporaries used by the translation rules or procedures of the
second and third sections.
Ex. #include <ctype .h>
• The C preprocessor to include the standard header file <ctype . h>
that contains the predicate isdigit.
• Also in the declarations part are declarations of grammar tokens.
• %token DIGIT
• Tokens declared in this section can then be used in the second and
third parts of the Yacc specification. , If Lex is used to create the
lexical analyzer that passes token to the Yacc parser, then these
token declarations are also made available to the analyzer
generated by Lex.

• The Translation Rules Part
• In the part of the Yacc specification after the
first %% pair, we put the translation rules .
Each rule consists of a grammar production
and the associated semantic action. A set of
productions that we have been writing:

• In a Yacc production, unquoted strings of letters
and digits not declared to be tokens are taken to
be non terminals. A quoted single character, e.g. '
c' , is taken to be the terminal symbol c , as well
as the integer code for the token represented by
that character (i.e., Lex would return the
character code for ' c ‘ to the parser, as an
integer) . Alternative bodies can be separated by
a vertical bar, and a semicolon follows each head
with its alternatives and their semantic actions.
The first head is taken to be the start symbol.

• A Yacc semantic action is a sequence of C statements. In a
semantic action, the symbol $$ refers to the attribute value
associated with the nonterminal of the head, while $i refers to
the value associated with the ith grammar symbol (terminal or
nonterminal) of the body. The semantic action is performed
whenever we reduce by the associated production, so
normally the semantic action computes a value for $$ in
terms of the $i's. In the Yacc specification, we have written
the two E-productions

• Note that the nonterminal term in the first production is
the third grammar symbol of the body, while + is the
second. The semantic action associated with the first
production adds the value of the expr and the term of the
body and assigns the result as the value for the
nonterminal expr of the head. We have omitted the
semantic action for the second production altogether, since
copying the value is the default action for productions with
a single grammar symbol in the body. In general, { $$ = $ 1 ;
} is the default semantic action. Notice that we have added
a new starting production
• line : expr ' n ' { print f C " %dn" , $ 1 ) ; }
• to the Yacc specification. This production says that an input
to the desk calculator is to be an expression followed by a
newline character. The semantic action associated with this
production prints the decimal value of the expression
followed by a newline character.

• The Supporting C-Routines Part
• The third part of a Yacc specification consists of supporting C-
routines. A lexical analyzer by the name yylex() must be provided.
Using Lex to produce yylex() is a common choice; The lexical
analyzer yylex() produces tokens consisting of a token name and its
associated attribute value. If a token name such as DIGIT is
returned, the token name must be declared in the first section of
the Yacc specification.
• The attribute value associated with a token is communicated to the
parser through a Yacc-defined variable yylval. It reads input
characters one at a time using the C-function getchar() . If the
character is a digit, the value of the digit is stored in the variable
yylval, and the token name DIGIT is returned. Otherwise, the
character itself is returned as the token name.

yacc –d bas.y # create y.tab.h, y.tab.c
lex bas.l # create lex.yy.c
cc lex.yy.c y.tab.c –o bas.exe # compile/link

• Yacc reads the grammar descriptions in bas.y and generates a
syntax analyzer (parser), that includes function yyparse, in file
y.tab.c. The –d option causes yacc to generate definitions for
tokens and place them in file y.tab.h. Lex reads the pattern
descriptions in bas.l, includes file y.tab.h, and generates a
lexical analyzer, that includes function yylex, in file lex.yy.c.
• Finally, the lexer and parser are compiled and linked together
to create executable bas.exe. From main we call yyparse to
run the compiler. Function yyparse automatically calls yylex
to obtain each token.

• %token INTEGER
• This definition declares an INTEGER token. Yacc generates a parser in file y.tab.c
and an include file, y.tab.h:
• #ifndef YYSTYPE
• #define YYSTYPE int
• #endif
• #define INTEGER 258
• extern YYSTYPE yylval;
• Lex includes this file and utilizes the definitions for token values. To obtain tokens
yacc calls yylex. Function yylex has a return type of int that returns a token. Values
associated with the token are returned by lex in variable yylval. For example,
• [0-9]+ { yylval = atoi(yytext); return INTEGER; }
• would store the value of the integer in yylval, and return token INTEGER to yacc.
The type of yylval is determined by YYSTYPE. Since the default type is integer this
works well in this case. Token values 0-255 are reserved for character values. For
example, if you had a rule such as
• [-+] return *yytext; /* return operator */
• the character value for minus or plus is returned. Note that we placed the minus
sign first so that it wouldn’t be mistaken for a range designator. Generated token
values typically start around 258 because lex reserves several values for end-of-file
and error processing.

• By default yylval is of type int, but you can override that from the
YACC file by re#defining YYSTYPE.
• The Lexer needs to be able to access yylval. In order to do so, it
must be declared in the scope of the lexer as an extern variable.
The original YACC neglects to do this for you, so you should add the
following to your lexer, just beneath
• #include <y.tab.h>:
• extern YYSTYPE yylval;
• Bison does this for you automatically.
• #ifndef checks whether the given token has been #defined earlier
in the file or in an included file; if not, it includes the code between
it and the closing #else or, if no #else is present, #endif statement.

• Internally yacc maintains two stacks in memory; a
parse stack and a value stack. The parse stack
contains terminals and nonterminals that
represent the current parsing state. The value
stack is an array of YYSTYPE elements and
associates a value with each element in the parse
stack. For example when lex returns an INTEGER
token yacc shifts this token to the parse stack. At
the same time the corresponding yylval is shifted
to the value stack. The parse and value stacks are
always synchronized so finding a value related to
a token on the stack is easily accomplished.

• The left-hand side of a production, or nonterminal, is entered left-justified
and followed by a colon. This is followed by the right-hand side of the
production. Actions associated with a rule are entered in braces.
• With left-recursion, we have specified that a program consists of zero or
more expressions. Each expression terminates with a newline. When a
newline is detected we print the value of the expression. When we apply
the rule
• expr: expr '+' expr { $$ = $1 + $3; }
• we replace the right-hand side of the production in the parse stack with
the left-hand side of the same production. In this case we pop “expr '+'
expr” and push “expr”. We have reduced the stack by popping three terms
off the stack and pushing back one term. We may reference positions in
the value stack in our C code by specifying “$1” for the first term on the
right-hand side of the production, “$2” for the second, and so on. “$$”
designates the top of the stack after reduction has taken place. The above
action adds the value associated with two expressions, pops three terms
off the value stack, and pushes back a single sum. As a consequence the
parse and value stacks remain synchronized.

• Numeric values are initially entered on the stack when we
reduce from INTEGER to expr. After INTEGER is shifted to
the stack we apply the rule
• expr: INTEGER { $$ = $1; }
• The INTEGER token is popped off the parse stack followed
by a push of expr. For the value stack we pop the integer
value off the stack and then push it back on again. In other
words we do nothing. In fact this is the default action and
need not be specified. Finally, when a newline is
encountered, the value associated with expr is printed.
• In the event of syntax errors yacc calls the user-supplied
function yyerror. If you need to modify the interface to
yyerror then alter the canned file that yacc includes to fit
your needs. The last function in our yacc specification is
main. This example still has an ambiguous grammar.
Although yacc will issue shift-reduce warnings it will still
process the grammar using shift as the default operation.

• The lexical analyzer returns VARIABLE and INTEGER tokens. For variables yylval
specifies an index to the symbol table sym. For this program sym merely holds the
value of the associated variable. When INTEGER tokens are returned, yylval
contains the number scanned.

• The input specification for yacc follows. The
tokens for INTEGER and VARIABLE are utilized by
yacc to create #defines in y.tab.h for use in lex.
This is followed by definitions for the arithmetic
operators. We may specify %left, for left-
associative or %right for right associative. The last
definition listed has the highest precedence.
Consequently multiplication and division have
higher precedence than addition and subtraction.
All four operators are left-associative. Using this
simple technique we are able to disambiguate
our grammar.

• extern void *malloc();
• malloc accepts an argument of type size_t,
and size_t may be defined as unsigned long. If
you are passing ints (or even unsigned ints),
malloc may be receiving garbage (or similarly
if you are passing a long but size_t is int).

Semantic Analysis

Beyond syntax analysis
•An identifier named x has been recognized.
–Is x a scalar, array or function?
–How big is x?
–If x is a function, how many and what type of arguments does it take?
–Is x declared before being used?
–Where can x be stored?
–Is the expression x+y type-consistent?
•Semantic analysis is the phase where we collect information about the types
of expressions and check for type related errors.
•The more information we can collect at compile time, the less overhead we
have at run time.

Semantic Analysis
•The syntax of a programming language
describes the proper form of its programs,
•while the semantics of the language defines
what its programs mean; that is, what each
program does when it executes.

Semantic analysis
•Collecting type information may involve "computations"
–What is the type of x+y given the types of x and y?
•Tool: attribute grammars
–Each grammar symbol has a number of associated attributes:
–The type of a variable or expression
–The value of a variable or expression
–The code for a statement
–Etc.
–The grammar is augmented with special equations (called semantic
actions) that specify how the values of attributes are computed from other
attributes.
–The process of using semantic actions to evaluate attributes is called
syntax-directed translation.

•TYPE Checking

•A compiler must check that the source program
follows both syntactic and semantic conversion of the
source language.
•This checking called static checking (to distinguish it
from dynamic checking during execution of the target
program), ensure that certain kinds of programming
errors will be detected and reported.

•Example of Static Checks:
–Type Checks : A Compiler Should report error if an operator
applied to an incompatible operand.
–Flow of Control Checks: Statement that cause flow of
control to leave a construct must have some place to which
to transfer the flow of control.
–Uniqueness check: There are some situations in which an
object must be defined only once.
–Name-Related Check: Sometimes the same name must
appear two or more times. Ex. In ADA, A loop or block may
have a name that appears at the beginning and end of the
construct. The compiler must check same name is used at
both places.

Type Checking
•TYPE CHECKING is the main activity in semantic
analysis.
•Goal: calculate and ensure consistency of the type of
every expression in a program
•If there are type errors, we need to notify the user.
•Otherwise, we need the type information to generate
code that is correct.

137
Type Systems and Type
Expressions

Type systems
•Every language has a set of types and rules for
assigning types to language constructs.
•Example from the C specification:
–“The result of the unary & operator is a pointer to the
object referred to by the operand. If the type of the operand
is ‘…’ then the type of the result is ‘pointer to …’
•Usually, every expression has a type.
•Type have structure: the type ‘pointer to int’ is
•CONSTRUCTED from the type ‘int’

Basic vs. constructed types
•Most programming languages have basic and
constructed types.
•BASIC TYPES are the atomic types provided by the
language.
–Pascal: boolean, character, integer, real
–C: char, int, float, double
•CONSTRUCTED TYPES are built up from basic types.
–Pascal: arrays, records, sets, pointers
–C: arrays, structs, pointers

Type expressions
•We denote the type of language constructs with TYPE
EXPRESSIONS.
•Type expressions are built up with TYPE
CONSTRUCTORS.
1.A basic type is a type expression. The basic types are
boolean, char, integer, and real. The special basic type
type_error signifies an error. The special type void
signifies “no type”
2.A type name is a type expression (type names are like
typedefs in C)

Type expressions
1.A type constructor applied to type expressions is a type expression.
a.Arrays: if T is a type expression, then pointer(T) is a type expression denoting
the type “pointer to an object of type T”
b.Array(I,T)  I: index set, T: element type
c.Products: if T1 and T2 are type expressions, then their Cartesian product T1 ×
T2 is also a type expression.
d.Records: a record is a special kind of product in which the fields have names
(examples below)
e.Pointers: if T is a type expression, then pointer(T) is a type expression denoting
the type “pointer to an object of type T”
f.Functions: functions map elements of a domain D to a range R, so we write D ->
R to denote “function mapping objects of type D to objects of type R” (examples
below)
2.Type expressions may contain variables, whose values are themselves type
expressions.  polymorphism

Record type expressions
•The Pascal code
• type row = record
• address: integer;
• lexeme: array[1..15] of char
• end;
• var table: array[1..10] of row;
•associates type expression
•record((address × integer) × (lexeme × array(1..15,char)))
•with the variable row, and the type expression
•array(1..101,record((address × integer) × (lexeme × array(1..15,char)))
•with the variable table

Function type expressions
•The C declaration
•int *foo( char a, char b );
•would associate type expression
•char × char -> pointer(integer)
•with foo. Some languages (like ML) allow all sorts of
crazy function types, e.g.
• (integer -> integer) -> (integer -> integer)
•denotes functions taking a function as input and
returning another function

Graph representation of type expressions
•The recursive structure of a type can be represented
with a tree, e.g. for char × char -> pointer(integer):
•Some compilers explicitly use graphs like these to
represent the types of expressions.

Type systems and checkers
•A TYPE SYSTEM is a set of rules for assigning
type expressions to the parts of a program.
•Every type checker implements some type
system.
•Syntax-directed type checking is a simple
method to implement a type checker.

Static vs. dynamic type checking
•STATIC type checking is done at compile time.
•DYNAMIC type checking is done at run time.
•Any kind of type checking CAN be done at run time.
•But this reduces run-time efficiency, so we want to do static
checking when possible.
•A SOUND type system is one in which ALL type errors can be
found statically.
•If the compiler guarantees that every program it accepts will run
without type errors, then the language is STRONGLY TYPED.

147
An Example Type Checker

The type system
•The basic types are char and integer.
•type_error signals an error.
•All arrays start at 1, so
•array[256] of char
•leads to type expression: array(1..256,char)
•The symbol ↑ in an declaration specifies a pointer
type,
•so
• ↑ integer
•leads to type expression: pointer(integer)

Translation scheme for
declarations
•P → D ; E
•D → D ; D
•D → id : T { addtype(id.entry, T.type) }
•T → char { T.type := char }
•T → integer { T.type := integer }
•T → ↑T1 { T.type := pointer(T1.type) }
•T → array [ num ] of T1
• { T.type := array(1 .. num.val, T1.type) }

Type checking for expressions
•E → literal { E.type := char }
•E → num { E.type := integer }
•E → id { E.type := lookup(id.entry) }
•E → E1 mod E2 { if E1.type =integer and E2.type = integer
• then E.type := integer
• else E.type := type_error }
•E → E1 [ E2 ] { if E2.type = integer and E1.type = array(s,t)
• then E.type := t else E.type := type_error }
•E → E1↑ { if E1.type = pointer(t)
• then E.type := t else E.type := type-error }
Once the identifiers and their types have been inserted into the symbol table, we can
check the type of the elements of an expression:

How about boolean types?
•Try adding
• T -> boolean
• Relational operators < <= = >= > <>
• Logical connectives and or notto the
grammar, then add appropriate type checking
semantic actions.

Type checking for statements
•Usually we assign the type VOID to statements.
•If a type error is found during type checking,
though, we should set the type to type_error
•Let’s change our grammar allow statements:
• P → D ; S
•i.e., a program is a sequence of declarations
followed by a sequence of statements.

Type checking for statements
•S → id := E { if id.type = E.type then S.type := void
• else S.type := type_error }
•S → if E then S1 { if E.type = boolean
• then S.type := S1.type
•S → while E do S1 { if E.type = boolean
• then S.type := S1.type
•S → S1 ; S2 { if S1.type = void and S2.type = void
• then S.type := void
• else S.type := type_error.
Now we need to add productions and semantic actions:

Type checking for function calls
•Suppose we add a production E → E ( E )
•Then we need productions for function declarations:
E → E1 ( E2 ) { if E2.type = s and E1.type = s → t
then E.type := t
else E.type := type_error }
T → T1 → T2 { T.type := T1.type → T2.type }
and function calls:

Type checking for function calls
•Multiple-argument functions, however, can be
modeled as functions that take a single PRODUCT
argument.
• root : ( real → real ) x real → real
•this would model a function that takes a real function
•over the reals, and a real, and returns a real.
•In C:float root( float (*f)(float), float x );

Type expression equivalence
•Type checkers need to ask questions like:
• – “if E1.type == E2.type, then …”
•What does it mean for two type expressions to be
equal?
•STRUCTURAL EQUIVALENCE says two types are the
same if they are made up of the same basic types and
constructors.
•NAME EQUIVALENCE says two types are the same if
their constituents have the SAME NAMES.

Structural Equivalence
•boolean sequiv( s, t )
•{
• if s and t are the same basic type
• return TRUE;
• else if s == array( s1, s2 ) and t == array( t1, t2 )
• return sequiv( s1, t1 ) and sequiv( s2, t2 )
• else s == s1 x s2 and t = t1 x t2 then
• else if s == pointer( s1 ) and t == pointer( t1 )
• return sequiv( s1, t1 )
• else if s == s1 → s2 and t == t1 → t2 then
• return false
•}

Relaxing structural equivalence
•We don’t always want strict structural equivalence.
•E.g. for arrays, we want to write functions that accept
arrays of any length.
•To accomplish this, we would modify sequiv() to
accept any bounds:
• …
• else if s == array( s1, s2 ) and t == array( t1, t2 )
• return sequiv( s2, t2 )
• …

Encoding types
•Recursive routines are very slow.
•Recursive type checking routines increase the
compiler’s run time.
•In the compilers of the 1970’s and 1980’s,
compilers took too long time to run.
•So designers came up with ENCODINGS for
types that allowed for faster type checking.

Name equivalence
•Most languages allow association of names with type expressions. This
makes type equivalence trickier.
•Example from Pascal:
• type link = ↑cell;
• var next: link;
• last: link;
• p: ↑ cell;
• q,r: ↑ cell;
•Do next, last, p, q, and r have the same type?
•In Pascal, it depends on the implementation!
•In structural equivalence, the types would be the same.
•But NAME EQUIVALENCE requires identical NAMES.

Handling cyclic types
•Suppose we had the Pascal declaration
• type link = ↑cell;
• cell = record
• info: integer;
• next: link;
• end;
•The declaration of cell contains itself (via the next
pointer).
•The graph for this type therefore contains a cycle.

Cyclic types
•The situation in C is slightly different, since it is
impossible to refer to an undeclared name.
• typedef struct _cell {
• int info;
• struct _cell *next;
• } cell;
• typedef *cell link;
•But the name link is just shorthand for
• (struct _cell *).
•C uses name equivalence for structs to avoid recursion
•(after expanding typedef’s).
•But it uses structural equivalence elsewhere.

Type conversion
•Suppose we encounter an expression x+i where x has type float and i has
type int. CPU instructions for addition could take EITHER float OR int as
operands, but not a mix.
•This means the compiler must sometimes convert the operands of
arithmetic expressions to ensure that operands are consistent with operators.
•With postfix as an intermediate language for expressions, we could express
the conversion as follows:
x i inttoreal float+
•where real+ is the floating point addition operation.

Type coercion
•If type conversion is done by the compiler without the
programmer requesting it, it is called IMPLICIT
conversion or type COERCION.
•EXPLICIT conversions are those that the programmer
• specifices, e.g.
• x = (int)y * 2;
•Implicit conversion of CONSTANT expressions should
be done at compile time.

Type checking example with coercion
•Production Semantic Rule
•E -> num E.type := integer
•E -> num . num E.type := real
•E -> id E.type := lookup( id.entry )
•E -> E1 op E2 E.type := if E1.type == integer and E2.type == integer
• then integer
• else if E1.type == integer and E2.type == real
• then real
• else if E1.type == real and E2.type == integer
• then real
• else if E1.type == real and E2.type == real
• then real
• else type_error

END of Unit 2

Compiler: Syntax Analysis

Recommended

More Related Content

What's hot (20)

Similar to Compiler: Syntax Analysis (20)

Recently uploaded (20)

Compiler: Syntax Analysis