Compiler Design
Compiler Design
Compiler:
It is a software which converts a program written in high level language (Source Language) to
low level language (Object/Target/Machine Language).
SOURCE PROGRAM
The source code is translated to object code successfully
if it is free of errors.
The compiler specifies the errors at the end of
Compiler
compilation with line numbers when there are any
errors in the source code.
TARGET PROGRAM
1
Compiler Design: Introduction
Interpreter :
The translation of single statement of source program into machine code is done by language
processor and executes it immediately before moving on to the next line is called an interpreter.
If there is an error in the statement, the interpreter terminates its translating process at that
statement and displays an error message.
The interpreter moves on to the next line for execution only after removal of the error.
An Interpreter directly executes instructions written in a programming or scripting
language without previously converting them to an object code or machine code.
Assembler:
The Assembler is used to translate the program written in Assembly language into machine
code.
The source program is a input of assembler that ASSEMBLY CODE
MACHINE CODE/OBJECT
2
CODE
Language Processing System
we write programs in high-level language, which is easier for us to understand and remember.
These programs are then fed into a series of tools and OS components to get the desired code
that can be used by the machine. This is known as Language Processing System.
Steps Involved in language processing:
1. User writes a program in high-level
language(Source code).
2. A preprocessor, generally considered
as a part of compiler, is a tool that produces
input for compilers. It deals with
macro-processing, augmentation, file inclusion,
language extension, etc.
3. The compiler, compiles the program and translates
it to assembly program (low-level language).
4. An assembler then translates the assembly program
into machine code (object).
5. A linker tool is used to link all the parts of the program
together for execution (executable machine code).
6. A loader loads all of them into memory and then the
3
program is executed.
Language Processing System
Compiler Phases:
The compilation process contains the
sequence of various phases. Each
phase takes source program in one
representation and produces output
in another representation.
Each phase takes input from its previous
stage.
There are two parts of Compilation:
I. Analysis (Front End)
II. Synthesis ( Back End)
The Analysis part breaks the source
program into constituent pieces and
creates an intermediate representation
of source program.
4
Phases of Compilation...
The Synthesis part Construct the desired target program from the Intermediate
representation.
5
Phases of Compilation...
Lexical Analysis:
Lexical analysis is the first phase of compiler which is also termed as scanning.
Source program is scanned to read the stream of characters and those characters are grouped to
form a sequence called lexemes which produces token as output.
Token: Token is a sequence of characters that represent lexical unit, which matches with the
pattern, such as keywords, operators, identifiers etc.
Lexeme: Lexeme is instance of a token i.e., group of characters forming a token.
Example: Pi=3.14,here string Pi is a lexeme for the token “identifier”
Pattern: Pattern describes the rule that the lexemes of a token takes. It is the structure that must
be matched by strings.
Once a token is generated the corresponding entry is made in the symbol table.
Example : c=a+b*5 Lexemes Tokens
c Identifier id1
= assignment symbol
a Identifier id2
+ + (addition symbol)
b Identifier id3
* * (multiplication symbol)
5 5 (number)
66
Output of LA is <id1>=< id2> +<id3 > * 5
Phases of Compilation...
Syntax Analysis:
Syntax analysis is the second phase of compiler which is also called as parsing.
Parser converts the tokens produced by lexical analyzer into a tree like representation called
parse tree.
A parse tree describes the syntactic structure of the input.
Syntax tree is a compressed representation of the parse tree in which the operators appear as
interior nodes and the operands of the operator are the children of the node for that
operator.
Input: Tokens
Output: Parse Tree
7
Phases of Compilation...
Semantic Analysis:
Semantic analysis is the third phase of compiler.
It checks for the semantic consistency.
Type information is gathered and stored in symbol table or in syntax tree.
Performs type checking.
8
Phases of Compilation...
Intermediate Code Generation:
Intermediate code generation produces intermediate representations for the source program which
are of the following forms:
Postfix notation
Three address code
Syntax tree
Most commonly used form is the three address code.
t1 = inttofloat (5)
t2 = id3* t1
t3 = id2 + t2
id1 = t3
Three address code is a type of intermediate code which is easy to generate and can be
easily converted to machine code.
It makes use of at most three addresses or operands and one operator to represent an
expression and the value computed at each instruction is stored in temporary variable
generated by compiler.
9
Phases of Compilation...
Code Optimization:
Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.
It can be done by reducing the number of lines of code for a program.
During the code optimization, the result of the program is not affected.
t1 = id3* 5.0
id1 = id2 + t1
Code Generation:
Code generation is the final phase of a compiler.
It gets input from code optimization phase and produces the target code /object code as result.
Intermediate instructions are translated into a sequence of machine instructions or assembly
code that perform the same task.
LDF R2, id3
MULF R2, #5.0
LDF R1, id2
ADDF R1, R2
STF id1, R1
10
Phases of Compilation...
Symbol Table Management:
Symbol table is used to store all the information about identifiers used in the program.
It is a data structure containing a record for each identifier, with fields for the attributes of the
identifier.
It allows finding the record for each identifier quickly and to store or retrieve data from that
record.
Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
Error Handling:
Each phase can encounter errors. After detecting an error, a phase must handle the error so that
compilation can proceed.
In lexical analysis, errors occur in separation of tokens.
In syntax analysis, errors occur during construction of syntax tree.
In semantic analysis, errors may occur at the following cases:
(i) When the compiler detects constructs that have right syntactic structure but no meaning
(ii) During type conversion.
11
Prepared by D HIMAGIRI
Phases of Compilation...
Compilation
Example 1: Write the output for all the phases of compiler.
12
Phases of Compilation
Example 2:
13
Regular Expressions& Regular Grammars
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belong to the language .
It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols.
The grammar defined by regular expressions is known as regular grammar. The language
defined by regular grammar is known as regular language.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representing occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
14
sign = [ + | - ]
15
Regular Expressions& Regular Grammars...
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted solution
is to use finite automata for verification.
Finite automata:
Finite Automata(FA) is the simplest machine to recognize patterns.
The finite automata or finite state machine is an abstract machine which have five elements or
tuple .
It has a set of states and rules for moving from one state to another but it depends upon the
applied input symbol. Basically it is an abstract model of digital computer.
A Finite Automata is a 5-tuple Machine M={ Q, Σ, q, F, δ } :
Q : Finite set of states.
Σ : set of Input Symbols.
q : Initial state.
F : set of Final States.
δ : Transition Function. 15
Finite Automata...
FA is characterized into two types:
1) Deterministic Finite Automata (DFA)
2) Nondeterministic Finite Automata (NFA)
Deterministic Finite Automata (DFA) :
DFA refers to Deterministic Finite Automaton.
A Finite Automata(FA) is said to be deterministic, if corresponding to an input
symbol, there is single resultant state i.e. there is only one transition.
A deterministic finite automata is set of five tuples and represented as:
M = (Q, Σ, qo, F, δ)
Where,
Q – Non Empty finite set of states
Σ – Non Empty finite set of input symbols
qo – Start/Initial state
F – set of final states
δ – Transition function
δ :Q x Σ → Q
Finite Automata...
Non -Deterministic Finite Automata (NFA) :
NFA refers to Nondeterministic Finite Automaton.
A Finite Automata(FA) is said to be non deterministic, if there is more than one
possible transition from one state on the same input symbol.
A non deterministic finite automata is also set of five tuples and represented as:
M = (Q, Σ, qo, F, δ)
Where,
Q – Non Empty finite set of states
Σ – Non Empty finite set of input symbols
qo – Start/Initial state
F – set of final states
δ – Transition function
δ :Q x Σ → 2Q
NFA
17
NFA to DFA Conversion...
Let, M = (Q, ∑, δ, q0, F) is an NFA which accepts the language L(M). There should be equivalent
DFA denoted by M' = (Q', ∑', q0', δ', F') such that L(M) = L(M').
Steps for converting NFA to DFA:
Step 1: Initially Q' = ϕ
Step 2: Add q0 of NFA to Q'. Then find the transitions from this start state.
Step 3: In Q', find the possible set of states for each input symbol. If this set of states is not in Q',
then add it to Q'.
Step 4: In DFA, the final state will be all the states which contain F(final states of NFA)
Convert the NFA to DFA:
State 0 1
DFA :
As in the given NFA, q1 is a final state, then in DFA wherever, q1 exists that state
becomes a final state. Hence in the DFA, final states are [q1] and [q0, q1]. Therefore set
of final states F = {[q1], [q0, q1]}. 19
ε- NFA
NFA to DFA Conversion...
ε- NFA:NFA with ε-Moves
It is a five tuple Machine and represented as:
M = (Q, Σ, qo, F, δ)
Where,
Q – Non Empty finite set of states
Σ – Non Empty finite set of input symbols
qo – Start/Initial state
F – set of final states
δ – Transition function
δ :Q x ΣU{ε} → 2Q
Epsilon Closure:
Epsilon closure for a given state X is a set of states which can be reached from the states X with
only (null) or ε moves including the state X itself.
Example:
∈ closure(A) : {A, B, C}
∈ closure(B) : {B, C}
∈ closure(C) : {C}
ε- NFA to NFA without ε moves
Steps for converting ε- NFA to NFA without ε moves:
Step-1: Find the ε-closure of the states qi where qi ∈Q
Step-2:Find the Extended transition function as
ˆδ(q0, ε ) = ε-closure(q0)
ˆδ(q0, a) = ε-closure(δ(ˆδ(q0, ε ) ,a))
repeat this for each input symbol.
Step-3 :Draw the transition table and diagram using resultant transitions.
Step-4: if the ε-closure of the state contains the final state of ε- NFA then make the state as
final.
a b c
Start q r s
21
Step 1: Find ε closures
ε closure(q)= {q,r,s} a b c
ε closure(r)= {r,s}
ε closure(s)= {s} Start q r s
Step 2: Find δ for all states
δ’(q,a)= ε closure (δ(δ’(q, ε),a))
= ε closure (δ(ε closure(q),a))
= ε closure(δ((q,r,s),a))
= ε closure (δ(q,a) U δ(r,a) U δ(s,a) )
= ε closure (q U θ U θ )
= ε closure (q)
= {q,r,s}
δ’(q,b)= ε closure (δ(δ’(q, ε),b))
= ε closure (δ(ε closure(q),b))
= ε closure(δ((q,r,s),b))
= ε closure (δ(q,b) U δ(r,b) U δ(s,b) )
= ε closure (θ Ur U θ )
= ε closure (r) = {r,s}
ε- NFA to NFA without ε moves... a b c
δ’(q,c)= ε closure (δ(δ’(q, ε),c)) Start q r s
= ε closure (δ(ε closure(q),c))
= ε closure(δ((q,r,s),c))
= ε closure (δ(q,c) U δ(r,c) U δ(s,c) )
= ε closure (θ U θ U s )
= ε closure (s)
= {s}
δ’(r,a)= ε closure (δ(δ’(r, ε),a))
= ε closure (δ(ε closure(r),a))
= ε closure(δ((r,s),a))
= ε closure (δ(r,a) U δ(s,a) )
= ε closure (θ U θ) = θ
δ’(r,b) = ε closure (δ(δ’(r, ε),b))
= ε closure (δ(ε closure(r),b))
= ε closure(δ((r,s),b))
= ε closure (δ(r,b) U δ(s,b) )
= ε closure (r U θ ) = ε closure (r ) ={r,s}
δ’(r,c)= ε closure (δ(δ’(r, ε),c)) a b c
= ε closure (δ(ε closure(r),c))
= ε closure(δ((r,s),c)) Start q r s
= ε closure (δ(r,c) U δ(s,c) )
= ε closure (θ U s )
= ε closure (s)
= {s}
δ’{s,a}= ε closure (δ(δ’(s, ε),a))
=ε closure (δ(s,a))
= ε closure (θ )
=θ
δ’{s,b}= ε closure (δ(δ’(s, ε),b))
=ε closure (δ(s,b))
= ε closure (θ ) = θ
δ’{s,c}=ε closure (δ(δ’(s, ε),c))
=ε closure (δ(s,c) )
= ε closure (s ) = {s}
ε- NFA to NFA without ε moves...
Step 3: Draw transition table and diagram for all
new states
Let, a b c
(q,r,s)= D D E F
->*D
(r,s) =E
*E θ E F
s =F
*F θ θ F
Step 4: Final states are D,E and F.
begin
Mark T
for each input symbol “a” do
begin
U= ∈ closure(δ( T, a))
Dtrans[T , a]=U
End
26
End
ε- NFA to DFA
Conversion of ε- NFA to DFA :
27
ε- NFA to DFA
Conversion of ε- NFA to DFA :
28
ε- NFA to DFA
Conversion of ε- NFA to DFA :
29
ε- NFA to DFA
Conversion of ε- NFA to DFA :
30
ε- NFA to DFA
Conversion of ε- NFA to DFA :
31
Converting RE to NFA
Conversion of RE to NFA ( Thompson Construction)
32
Converting RE to NFA
Conversion of RE to NFA ( Thompson Construction)
33
Converting RE to NFA
Problem: Convert the RE (ab*c)/ (a(b/c*)) to NFA
34
Converting RE to NFA
Convert the RE (ab*c)/ (a(b/c*)) to NFA
35
Converting RE to NFA
Convert the RE (ab*c)/ (a(b/c*)) to NFA
36
Pass and Phases of Translation
A compiler can have many phases and passes.
Pass : A pass refers to the traversal of a compiler through the entire program.
Phase : A phase of a compiler is a distinguishable stage, which takes input from the previous
stage, processes and yields output that can be used as input for the next stage.
Compiler pass are two types:
1. Single Pass Compiler
2. Two Pass Compiler or Multi Pass Compiler.
Single Pass Compiler(Narrow Compilers):
If we combine or group all the phases of compiler
design in a single module known as single pass
compiler.
A one pass/single pass compiler is that type of
compiler that passes through the part of each
compilation unit exactly once.
Single pass compiler is faster and smaller than
the multi pass compiler.
As a disadvantage of single pass compiler is that
it is less efficient in comparison with 37
multipass compiler.
Pass and Phases of Translation...
Multipass Compiler( Wide Compilers):
A Two pass/multi-pass Compiler is a type of compiler that processes the source code of a
program multiple times. In multipass Compiler we divide phases in two pass as:
In first pass the included phases are as Lexical analyzer, syntax analyzer, semantic analyzer,
intermediate code generator are work as front end.
First pass is platform independent because the
output of first pass is as three address code
which is useful for every system .
In second Pass the included phases are as
Code optimization and Code generator are work
as back end and the synthesis part refers to
taking input as three address code and convert
them into Low level language/assembly language
and second pass is platform dependent because
final stage of a typical compiler converts the
intermediate representation of program into an
executable set of instructions which is dependent on the system.
Bootstrapping
Bootstrapping is widely used in the compilation development.
It is a process in which simple language is used to translate more complicated program
which in turn may handle for more complicated program. This complicated program can
further handle even more complicated program and so on.
It is used to produce a self-hosting compiler.
Self-hosting compiler is a type of compiler that can compile its own source code . i.e.
a compiler written in the source programming language that it intends to compile.
A compiler can be characterized by three languages:
1) Source Language
2) Target Language
3) Implementation Language
The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.
Cross Compiler is a compiler which runs on one machine and produces output for another
machine.
39
Bootstrapping
LEX
LEX:
Lex is a program that generates lexical analyzer.
It is a Unix utility.
The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
Lex specifies tokens using Regular Expression.
The function of Lex is as follows:
1. Firstly lexical analyzer creates a program called lex specification file , lex.l in the Lex
language. Then Lex compiler runs the lex.1 program and produces a C program lex.yy.c.
2. Finally C compiler runs the lex.yy.c program and produces an object program a.out.
3. a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
LEX...
The structure of LEX programs:
%{
Declarations
%}
%%
Rules
%%
Auxiliary Functions
Declaration Section:
The declarations section consists of two parts, auxiliary declarations and regular definitions.
The auxiliary declarations are copied as such by LEX to the output lex.yy.c file. This C code
consists of instructions to the C compiler and are not processed by the LEX tool.
The auxiliary declarations (which are optional) are written in C language and are enclosed
within ' %{ ' and ' %} ' .
It is generally used to declare functions, include header files, or define global variables and
constants.
LEX allows the use of short-hands and extensions to regular expressions for the regular
definitions. A regular definition in LEX is of the form : D R where D is the symbol
41
representing the regular expression R.
LEX...
Rules:
Rules in a LEX program consists of two parts :
1. The pattern to be matched
2. The corresponding action to be executed
Patterns are defined using the regular expressions and actions can be specified using C Code.
The Rules can be given as
R1 {Action1}
R2 {Action2}
.
.
.
Rn {Action n}
Where Ri is RE and Action i is the action to be taken for corresponding RE.
Auxiliary Functions:
All the required procedures are defined in this section.
Note: Function yywrap is called by lex when input is exhausted. When the end of the file is
reached the return value of yywrap() is checked. If it is non-zero, scanning terminates and if it is
0 scanning continues with next input file.
LEX...
Lex Program for count tokens in source program:
43
Note: yylex() match the characters with the regular expression.
Compiler Design: Parsing
UNIT – II:
Top down Parsing: Context free grammars, Top down parsing – Backtracking, LL (1),
recursive descent parsing, Predictive parsing, Pre-processing steps required for
predictive parsing.
Bottom up parsing: Shift Reduce parsing, SLR,CLR and LALR parsing, Error
recovery in parsing , handling ambiguous grammar, YACC –automatic parser generator.
Role of a Parser:
1
Compiler
Parsing Design: Parsing
A syntax analyzer is also known as parser.
A parser takes input in the form of a sequence of tokens from Lexical Analyzer and builds a
data structure in the form of a parse tree.
It verifies whether the string can be generated by the grammar for the source language.
It also returns any syntax error for the source language.
A parser for a grammar G is a program that takes input as a string s and produces an output
either,
A parser tree for s, if s is a sentence of G or
An error message indicating that s is not a sentence of G
Types of Parsers:
1. Top down Parsers
Parsers
2. Bottom up Parsers
Top down parser builds the parse tree from root
to leaves.
Bottom up parser builds the parse tree from
leaves to root. Top Down Bottom up
In both the cases input is scanned from left to right Parsers Parser
one symbol at a time.
2
Context Free Grammar (CFG)
A context-free grammar (CFG) consisting of a finite set of grammar rules is a quadruple
G=(V, T, P, S)
where
V is a set of non-terminal .
T is a set of terminals.
P is a set of Production rules,
P: V → (V ∪ T)*
S is the start symbol.
A context-free grammar is a set of recursive rules used to generate patterns of strings.
The language generated using Context Free Grammar is called as Context Free Language.
Example:
G = (V , T , P , S)
Where,
V={S}
T={a,b}
P = { S → aSbS , S → bSaS , S → ∈ }
S={S}
3
Context Free
Parse Tree/ Grammar
Syntax Tree/(CFG)
Derivation Tree
Parse Tree:
The diagrammatical representation of a derivation is called as a parse tree or derivation tree.
Root node of a parse tree is the start symbol of the grammar.
Each leaf node of a parse tree represents a terminal symbol.
Each interior node of a parse tree represents a non-terminal symbol.
Concatenating the leaves of a parse tree from the left produces a string of terminals, called
as yield of a parse tree.
Example:
Construct Parse tree for the string w=a+a+a
G:
E → E+E | E*E |E| a
4
Derivations
Derivation:Starting with the start symbol, non-terminals are rewritten using productions rules
until only terminals remain.
There are two types of derivations:
1. Left Most Derivations(LMD)
2. Right Most Derivations( RMD)
Left Most Derivations (LMD):
A left most derivation is obtained by applying rule of
production to the left most variable/ non terminal in
each step of derivation.
Example:
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
The leftmost derivation for the string "a+a*a" is
X → X+X X →a
→ a+X X →X*X
→ a + X*X X →a
→ a+a*X X →a
→ a+a*a
5
Derivations
Right Most Derivations (LMD):
A right most derivation is obtained by applying rule of production to the right most variable/ non
terminal in each step of derivation.
Example:
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
The leftmost derivation for the string "a+a*a" is
X → X*X X →a
→ X*a X →X+X
→ X+X*a X →a
→ X+a*a X →a
→ a+a*a
6
Derivations...
Example:
Consider the following grammar
G:
S → aB / bA
A→ aS / bAA / a
B->bS/aBB/b
find LMD,RMD for string w = aaabbabbba
LMD:
S → aB
→ aaBB (Using B → aBB)
→ aaaBBB (Using B → aBB)
→ aaabBB (Using B → b)
→ aaabbB (Using B → b)
→ aaabbaBB (Using B → aBB)
→ aaabbabB (Using B → b)
→ aaabbabbS (Using B → bS)
→ aaabbabbbA (Using S → bA)
→ aaabbabbba (Using A → a)
7
Derivations...
Example:
Consider the following grammar
G:
S → aB / bA
A → aS / bAA / a
B->bS/aBB/b
find LMD,RMD for string w = aaabbabbba
RMD:
S → aB
→ aaBB (Using B → aBB)
→ aaBaBB (Using B → aBB)
→ aaBaBbS (Using B → bS)
→ aaBaBbbA (Using S → bA)
→ aaBaBbba (Using A → a)
→ aaBabbba (Using B → b)
→ aaaBBabbba (Using B → aBB)
→ aaaBbabbba (Using B → b)
→ aaabbabbba (Using B → b)
8
Ambiguous Grammar
Ambiguous Grammar:
A grammar is said to ambiguous if for any string generated by it, it produces more than one
Parse Tree or Leftmost Derivation (LMD) or Rightmost Derivation (RMD).
Example-
Consider the following grammar-
E → E + E / E x E / id
Let w = id + id x id be string generated by G
11
Left Factoring
Eliminate the left Recursion in the Eliminate left recursion in the Grammar
G: A → ABd / Aa / a E→E+T/T
B → Be / b T→T*F/F
F → id
C →c
The grammar after eliminating left recursion is
A → ABd / Aa / a E → TE‘
After eliminating Left Recursion ,We get E‘ → +TE‘ / ∈
A →aAl T → FT‘
Al →Bd Al /a Al / ε T‘ → *FT‘ / ∈
F → id
B → Be / b
After eliminating Left Recursion ,We get
B →bBl
Bl →e Bl / ε
C →c
Left Factoring
2) Left Factoring:
It is a grammar transformation that is useful for producing a grammar useful for predictive
parsing.
If A → αβ1 / αβ2 are two A-productions ,both these productions starts with same
string in RHS , then such grammars are said to be having common prefixes.
Left factoring is a process by which the grammar with common prefixes is transformed
to make it useful for Top down parsers.
Example 1:
Do left factoring in the following grammar- The left factored grammar is:
S → iEtS / iEtSeS / a S → iEtSS‘ / a
S‘ → eS / ∈
E→b
E→b 13
Top Down Parsing
Do left factoring in the following grammar- Do left factoring in the following grammar-
A → aAB / aBc / aAc S → bSSaaS / bSSaSb / bSb / a
Solution- Solution-
Step-01:
Step 1:
S → bSS‘ / a
A → aA‘
A‘ → AB / Bc / Ac S‘ → SaaS / SaSb / b
Again, this is a grammar with common Again, this is a grammar with common
prefixes.
prefixes.
Step 2:
Step-02:
A → aA‘
A‘ → AD / Bc S → bSS‘ / a
D→B/c S‘ → SaA / b
This is a left factored grammar.
A → aS / Sb
This is a left factored grammar.
14
Top Down Parsing
Top Down Parsers:
Top-down parsers build parse trees from the top (root) to the bottom (leaves).
Top down parsers are classified as follow:
Backtracking Predictive
Parsers
Recursive Descent
LL(1) Parser
Parser
Backtracking:
Top- down parsers start from the root node (start symbol) and match the input string against
the production rules to replace them (if matched).
It means, if one derivation of a production fails, the syntax analyzer restarts the process using
different rules of same production.
This technique may process the input string more than once to determine the right production.
15
Recursive Descent Parsing
Example:
G:
S → rXd | rZd
X → oa | ea
Z → ai
and input w=―read‖
It will start with S from the production rules and will match its yield to the left-most letter of
the input, i.e. ‗r‘.
The very production of S (S → rXd) matches with it. So the top-down parser advances to the
next input letter (i.e. ‗e‘).
The parser tries to expand non-terminal ‗X‘ and checks its production from the left (X → oa). It
does not match with the next input symbol. So the top-down parser backtracks to obtain the next
production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted and
parsing is successful.
16
Recursive Descent Parsing
Recursive Descent Parser:
It is a top-down parser builds the parse tree from the top to down, starting with the
start non-terminal.
It is a Predictive Parser where no Backtracking is required.
In this parsing technique each non terminal is associated with a recursive procedure.
The RHS of the production rule is directly converted to code of the respective
procedure.
If the RHS of production rule is containing a non terminal ,then it will invoke the respective
procedure.
If it is a terminal then it is matched with lookahead from the input string, lookahead pointer is
moved one position to right if match is found.
These procedures are responsible for matching the non terminal with next part of the input.
If the production rule have many alternatives then all the alternatives are combined into a single
body of the procedure.
Since, it is a top down parsing technique the parser is activated by calling the procedure of start
symbol.
17
Predictive LL(1)
Example: E'()
G: {
E --> i E' if (lookahead == '+')
E' --> + i E' | e {
match('+');
if (lookahead == ‗i')
E()
match('i');
{ E'();
if (lookahead == 'i') }
{ }
match(char t)
match('i');
{
E'(); if (lookhead== t)
} {
else if (lookahead == ‗$') lookahead = getchar();
printf("Parsing Successful"); }
else
else
printf("Error");
return error; }
} 18
18
Predictive LL(1)
Predictive LL(1):
It is a non recursive top down parser.
In LL(1), 1st L represents that the scanning of the Input from Left to Right.
Second L shows that in this Parsing technique we are going to use Left most Derivation Tree.
1 represents the number of look ahead, means how many symbols are going to see when
you want to make a decision.
The predictive parser has an input, a stack,
a parsing table, and an output.
The input contains the string to be parsed,
followed by $, the right end marker.
The stack contains a sequence of grammar
symbols, preceded by $, the bottom-of stack
marker.
The Stack holds left most derivation.
The parsing table is a two dimensional array
M[A ,a], where A is a nonterminal, and
a is a terminal or the symbol $.
19
Predictive LL(1)
The parser is controlled by a program that behaves as follows:
The program determines X, the symbol on top of the
stack, and ‗a„the current input symbol.
These two symbols determine the action of
the parser.
There are three possibilities:
1. If X = a = $, the parser halts and announces
successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack
and advances the input pointer to the next
input symbol.
3. If X is a nonterminal, the program consults
entry M[X, a] of the parsing table M.
This entry will be either an X-production of the grammar or an error entry.
If M[X, a] = {X → UVW}, the parser replaces X on top of the stack
by WVU (with U on top).
If M[X, a] = error, the parser calls an error recovery routine.
20
First & Follow
The Construction of predictive parser is aided by two functions associated with a grammar G.
These Functions, First and Follow allows us to fill the entries of predictive parsing table for
grammar G
FIRST :
Step for finding FIRST:
1. If X is terminal, then FIRST(X) is {X}
2. If X → ∈ is a production , then add ∈ to FIRST(X).
3. If X is a non-terminal and X →Y1Y2..........Yk is a production, then place ‗a‘ in
FIRST(X) if for some i,‘a‘ i.s in FIRST(Yi) and ∈ is in all of FIRST(Y1)....FIRST(Yi-1) .
If ∈ is in FIRST(Yj) for all j=1,2,......,k then add ∈ to FIRST(X).
FOLLOW:
Step for finding FOLLOW:
1) FOLLOW(S) = { $ } // where S is the starting Non-Terminal and $ is the input right end
marker.
2) if there is a production A → αBβ ,then everything in FIRST(β) except for ∈ is placed in
FOLLOW(B).
3) if there is a production A → αB or a production A → αBβ where FIRST(β ) contains ∈
then everything in FOLLOW(A) is in FOLLOW(B). 21
First & Follow...
Example 1: Example 2:
First & Follow...
Example 3: Example 4:
S→A S → AaAb / BbBa
A → aBA’ A→∈
A’ → dA’ / ∈ B→∈
B→b
C→g First(S) = { a , b }
First(A) = { ∈ }
First(S) = First(A) = { a } First(B) = { ∈ }
First(A) = { a }
First(A’) = { d , ∈ } Follow Functions-
First(B) = { b }
First(C) = { g } Follow(S) = { $ }
Follow(A) = { a , b }
Follow(S) = { $ } Follow(B) = { a , b }
Follow(A) = { $ }
Follow(A’) = { $ }
Follow(B) = { d , $ }
Follow(C) = NA
First & Follow...
23
LL(1) Parsing table Construction
Steps Involved in Predictive parsing table construction: remove LF and LR first
Step 1: for each production A →α of the grammar do steps 2 &3
Step 2: for each terminal ‗a‘ in FIRST(α) , add A →α to M[A , a]
Step 3: if ∈ is in FIRST(α) , add A →α to M[A , b] for each terminal ‗b‘ in FOLLOW(A).
Step 4: Make each undefined entry of M be Error.
Construct LL(1) Parsing table for the grammar E TE‟
G: E E+T|T E‟ +TE‟|
After Eliminating Left Recursion
T T*F|F T FT‟
F id|(E) T‟ *FT‟|
F id|(E)
and Parse the string id+id*id
Find FIRST and FOLLOW:
LL(1) Parsing table Construction
G: E TE‟
E‟ +TE‟|
T FT‟
T‟ *FT‟|
F id|(E)
id + * ( ) $
E E TE‘ E TE‘
E‟ E‘ +TE‘ E‘ E‘
T T FT‘ T FT‘
T‟ T‘ T‘ *FT‘ T‘ T‘
F F id F (E)
Note: All undefined entries are Errors. 25
LL(1) Parsing table Construction
Parsing the input string “id+id*id” using LL(1) parser:
STACK INPUT OUTPUT
27
The entry M[S‟,e] contains multiple entries so the grammar is not LL(1)
LL(1) Parsing Example
Construct LL(1) Parsing table for the Grammar and parse string w= int*int
G:
int * + ( ) $
E E->TX E->TX
X X->+E X-> X->
T T->int Y T->(E)
Y Y->*T Y-> Y-> Y->
LL(1) Parsing Example
Parsing the string “ int*int” using parsing table
int * + ( ) $
E E->TX E->TX
X X->+E X-> X->
T T->int Y T->(E)
Y Y->*T Y-> Y-> Y->
30
Bottom Up Parsing
Handle:
It is a substring that matches the right side of the production and we can reduce such substring
by left hand side Non-terminal of production rule.
Example:
G: E → E + E
E→E*E
E→(E)
E → id
Bottom Up Parsers:
Bottom Up Parser
32
Shift Reduce Parser...
Final Configuration of SR Parser:
Stack is left with only the start symbol and the input buffer
becomes empty.(Successful Parsing)
An Error is detected.(Unsuccessful Parsing).
Example : Consider the following grammar-
S –> S + S
S –> S * S
S –> id and Parse the string “id + id + id” using SR Parser.
33
Shift Reduce Parser...
Example : Consider the following
STACK INPUT ACTION
grammar-
$ id-id*id$ Shift
E →E-E
E →E*E $id -id*id$ Reduce E →id
E →id $E -id*id$ Shift
Parse the input string id-id*id $E- id*id$ Shift
using a shift-reduce parser. $E-id *id$ Reduce E →id
$E-E *id$ Shift
$E-E* id$ Shift
$E-E*id $ Reduce E →id
$E-E*E $ Reduce E →E*E
$E-E $ Reduce E →E-E
$E $ Accept
Note:
If the Incoming operator has more priority than in stack operator then perform Shift otherwise
perform reduce operation.
34
Shift Reduce Parser...
Example : Consider the following
grammar-
S→( L)|a
L→L,S|S
Parse the input string ( a , ( a , a ) )
using a shift-reduce parser.
35
LR Parser...
LR parser is one type of bottom up parsing, which is used to parse the large class of grammars.
In the LR(K) parsing,
Where,
"L" stands for left-to-right scanning of the input,
"R" stands for constructing a right most derivation in reverse, and
"K" is the number of input symbols of the look ahead used to make number of parsing
decision.
LR Parser Model:
It consists of an Input , an Output, a Stack ,
LR Parser program and a Parsing table
which has two parts (Action and Goto).
Input buffer holds the input ,the parser
program reads character from it one at a
time.
The stack holds a sequence of the form
s0 X1 s1 X2 s2 … Xm sm, where Sm is on the
top.
Each Xi is a grammar symbol and Si is a state.
36
LR Parser...
The Parser program driving LR Parser behaves as follow:
It determines Sm , the state currently on top of stack and ai ,the current input symbol.
It then consults action [Sm, ai ]into the parse table which can have one of the four values:
1. Shift s, where s is a state.
2. Reduce by a grammar production A—> β.
3. Accept( Successful Parsing)
4. Error.
A Configuration of LR Parser is a pair whose first component is stack content and Second
component is the input.
stack Input
The next move of a parser is determined by reading the Sm , the state currently on top of stack
and ai ,the current input symbol.
The parser behaves based on the entry in the parser table.
37
SLR Parser...
LR Parser behaves as follow(Parsing Process):
1. If action[sm,ai] = shift s then push current input symbol ai, and next state s on to
the stack, and advance input one position to right:
(s0 X1 s1 X2 s2 … Xm sm a i s, ai+1 … an $)
Typ4e.sIof facLtiR
onP
[samr,asie] r=se: rror then attempt recovery.
1. SLR(1) Parser
2. CLR(1) Parser
3. LALR(1) Parser
All the above parsers will follow the same parsing process.
38
SLR Parser...
LR(0) Items:
An LR (0) item is a production with dot at some position on the right side of the production.
LR(0) items is useful to indicate that how much of the input has been scanned up to a given
point in the process of parsing.
For example, production T → T * F leads to four LR(0) items:
T→⋅T*F
T→T⋅*F
T→T*⋅F
T→T*F⋅
That production A has one item [A •]
Augmented Grammar:
If G is a grammar with start symbol S then G‘, the augmented grammar for G, is the grammar
with new start symbol S‘ and a production S‘ -> S.
The purpose of this new starting production is to indicate to the parser when it should stop
parsing and announce acceptance of input.
Example: G: S -> AA
A -> aA | b
The augmented grammar for the above grammar will be
G‘: S‘ -> S
39
S -> AA , A -> aA | b
SLR Parser...
Closure Operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I by the
two rules:
1. Initially every item in I is added to closure(I).
2. If A -> α . Bβ is in closure(I) and B -> γ is a production then add the item B -> .γ to I, If
it is not already there. We apply this rule until no more items can be added to closure(I).
Example:
G‟: S‘ ->S
S ->AA
A ->aA/b
Closure(S‘ ->S)= S‘ ->.S
S ->.AA
A ->.aA/.b
Closure(S ->AA)=S ->.AA
A ->.aA/.b
Closure(A ->aA/b)= A ->.aA/.b
SLR Parser...
Goto Operation:
41
SLR Parser...
SLR Parser...
Canonical LR(0) Items:
Construct SLR Parsing
Table and Parse the
String id*id + id
G:
E E+T|T E E+T|T
T T*F|F T T*F|F
F (E) | id
F id|(E)
Augmented Grammar
G‟:
E‘ E
E E+T E E+T|T
ET T T*F|F
T T*F
F id|(E)
TF
F (E)
F id 43
SLR Parser...
Goto Graph:
44
SLR Parser...
Construction of SLR Parsing Table:
45
SLR Parser...
SLR Parsing Table:
46
SLR Parser...
1. E E+T 2. E T
SLR Parsing: 3. T T*F 4. T F
5. F (E) 6. F id
47
SLR Parser...
1. E E+T 2. E T
CLR represents canonical LR Parser.
The Grammar used for constructing this parser is called as CLR Grammar or LR(1) Grammar.
This Parser uses LR(1) items to represent the states of the parser.
The LR(1) items are of the form [A →α.Xβ , a] which is having two components.
LR(1) item = LR(0) item + lookahead.
The first component is an LR(0) item indicates that up to what position in the grammar rule
parsing is completed.
Second component is a terminal or $, which represents the actual follow.
Closure operation LR(1) Items:
1. Start with closure(I) = I
2. If [A•B, a] closure(I) then for each production B in the grammar
and each terminal b FIRST(a), add the item [B•, b] to I if not already in I
3. Repeat 2 until no new items can be added
Goto operation LR(1) Items :
1. For each item [A•X, a] I, add the set of items closure({[AX•, a]})
to goto(I,X) if not already there
2. Repeat step 1 until no more items can be added to goto(I,X) 48
CLR Parser...
Construction of canonical set of LR(1) Items:
Example:
Construct the CLR Parsing table and parse the string ―adad‖ for the Grammar
G:
S → CC
C → cC
C→d
Augmented Grammar G‟: S‘ → S
S → CC
C → cC
49
C→d
CLR Parser...
G‟: S‘ → S A →α.Bβ , a
S → CC First(a)
C → aC
C→d
I0: S‘→.S , $
S→.CC, $ I4: goto(I0,d)
C→d. , a / d
C→.aC , a / d
I5: goto(I2,C)
C→.d , a / d
S→ CC. , $
I1:goto(I0,S) I6: goto(I2,a)
S‘→S. , $ C→ a.C , $
I2: goto(I0,C) C→.aC , $
S→ C.C, $ C→.d , $
C→.aC , $ I7: goto(I2,d)
C→.d , $ C→d. , $
I3: goto(I0,a) I8: goto(I3,C)
C→aC. , a / d
C→ a.C , a / d
I9: goto(I6,C)
C→.aC , a / d C→aC. , $
C→.d , a / d
50
CLR Parser...
G‟: S‘ → S A →α.Bβ , a Goto Graph:
S → CC First(a) S
I0 I1
C → aC
C→d
S→ a.C , a / d
C→.aC , a / d d I4
51
CLR Parser...
53
CLR Parser...
CLR Parsing table :
G‟: S‘ → S
S → CC
C → aC
C→d I4: goto(I0,d)
I0: S‘→.S , $ C→d. , a / d
S→.CC, $ I5: goto(I2,C)
C→.aC , a / d S→ CC. , $
C→.d , a / d I6: goto(I2,a)
I1:goto(I0,S) C→ a.C , $
S‘→S. , $ C→.aC , $
I2: goto(I0,C) C→.d , $
S→ C.C, $ I7: goto(I2,d)
C→.aC , $ C→d. , $
C→.d , $
I8: goto(I3,C)
I3: goto(I0,a) C→aC. , a / d
C→ a.C , a / d
C→.aC , a / d I9: goto(I6,C)
C→.d , a / d C→aC. , $
54
CLR Parser...
CLR Parsing table : Parsing the string “adad”:
55
LALR Parser
LALR stands for LookAhead LR Parser.
The LALR parsing table construction is same as CLR parsing table construction,only at the the
set of LR(1) items having same core components i.e. similar first components are detected and
merged together as a single state in the parsing table.
In this parsing method ,the parse table is considerably smaller than the CLR Parsing table.
Example:
In CLR example, the items (I3 , I6) , (I4 , I7) and (I8 , I9) have similar core components.
I3 and I6 are merged as I36 I0: S‘→.S , $
I36 : S→.CC, $
S→ a.C , a / d /$ C→.aC , a / d
C→.aC , a / d /$ C→.d , a / d
C→.d , a / d /$ I1:goto(I0,S)
S‘→S. , $
I4 and I7 are merged as I47
I2: goto(I0,C)
I47 : C→d. , a / d /$ S→ C.C, $
I8 and I9 are merged as I89 C→.aC , $
I89 : C→aC. , a / d /$ C→.d , $
I5: goto(I2,C)
S→ CC. , $
55
LALR Parser
LALR Parsing table : Parsing the string “adad”:
56
Error recovery in parsing
What should happen when your parser finds an error in the user‘s input?
Stop immediately and signal an error .
Record the error but try to continue.
Error Recovery Strategies:
1. Panic Mode
2. Phrase Level
3. Error Productions
4. Global Correction
1. Panic Mode:
When a parser encounters an error anywhere in the statement, it ignores the rest of the
statement by not processing input from erroneous input to synchronizing tokens .
Typical synchronizing tokens are delimiters, such as a semicolon, opening or closing
parenthesis.
Simplest method to implement.
When multiple errors in the same statement are rare, this method is quite adequate.
2. Phrase Level :
On discovering an error, a parser may perform local correction on the remaining input.
For example, it may replace a prefix of the remaining input by some string that allows the
parser to continue. 57
Error recovery in parsing
A typical local correction would be to:
Replace a comma by a semicolon,
Delete an extraneous semicolon, or
Insert a missing semicolon.
Major drawback: Situations in which the actual error has occurred before the point of detection.
3. Error Productions :
If we have a good idea of the common errors then augment the grammar with error
productions that generate the erroneous constructs.
Use the grammar augmented by these error productions to construct a parser.
If an error production is used by the parser, generate an appropriate error diagnostic
message.
4. Global Correction:
The parser examines the whole program and tries to find out the closest match for it which
is error free.
When an erroneous input (statement) X is fed, it creates a parse tree for some closest error-
free statement Y.
This may allow the parser to make minimal changes in the source code, but due to the
complexity (time and space) of this strategy, it has not been implemented in practice yet.
58
YACC –Automatic Parser Generator
YACC stands for Yet Another Compiler Compiler.
It is used to produce the source code of the syntactic analyzer of the language produced by
LALR (1) grammar.
The input of YACC is the rule or grammar, and the output is a C program.
The Unix command transforms the YACC specification file translate.y into a C program called
y.tab.c, which is a representation of LALR parser written in C.
59
By compiling y.tab.c along with the ly library, we will get the desired object program a.out that
performs the operation defined by the original YACC program.
A YACC source program contains three parts:
Declarations
%%
Translation rules
%%
Supporting C rules
Declarations Part:
This part of YACC has two sections; both are optional.
The first section has ordinary C declarations, which is delimited by %{ and %}.
This section contains only the include statements .
In second section we can declare the grammar tokens. Ex %token DIGIT
Token declared in this section can be used by second and third part of YACC
specification.
60
YACC –Automatic Parser Generator...
Translation rules:
This part contains translation rules and associated semantic actions.
This part is enclosed between %% &%%.
A set of productions:
<head> -> <body1> | <body2> | ….. | <body n>
would be written in YACC as
<head> : <body1> {<semantic action>1}
| <body2> {<semantic action>2}
…..
| <body n> {<semantic action>n}
;
The semantic action of YACC is a set of C statements. In a semantic action, the symbol $$
considered to be an attribute value associated with the head‘s non-terminal.
While $i considered as the value associated with the ith grammar production of the body.
Supporting C rules:
The third part of a YACC Specification consists of supporting C- routines.
A lexical analyzer by the name yylex() must be provided.
61
Example:
%{
#include <ctype.h>
%} yylex()
%token DIGIT {
%% int c;
line : expr ‗\n‘ { printf(―%d\n‖, $1); } c = getchar();
if (isdigit(c)) {
;
yylval = c-‗0‘;
expr : expr ‗+‘ term { $$ = $1 + $3; } return DIGIT;
| term }
; return c;
term : term ‗*‘ factor { $$ = $1 * $3; } }
| factor
;
factor : ‗(‗ expr ‘)‘ { $$ = $2; }
| DIGIT
;
%%
62
Compiler Design
UNIT – III:
Semantic analysis: Intermediate forms of source Programs – abstract syntax tree, polish notation
and three address codes. Attributed grammars, Syntax directed translation,
Conversion of popular Programming languages language Constructs into Intermediate code
forms, Type checker.
Semantic analysis:
Semantic Analysis is the third phase of Compiler.
It makes sure that declarations and statements of program are semantically correct.
Both syntax tree of previous phase and symbol table are used to check the consistency of the
given code.
It gathers type information and stores it in either syntax tree or symbol table. This type
information is subsequently used by compiler during intermediate-code generation.
Type checking is an important part of semantic analysis .
Errors recognized by semantic analyzer are as follows:
Type mismatch
Undeclared variables
1
Reserved identifier misuse
Semantic Analysis
Functions of Semantic Analysis:
Type Checking :
Ensures that data types are used in a way consistent with their definition.
Label Checking:
A program should contain labels references.
Flow Control Check:
Keeps a check that control structures are used in a proper manner.(Example: no break
statement outside a loop).
Example:
float x = 10.1;
float y = x*30;
In the above example integer 30 will be type casted to float 30.0 before multiplication, by
semantic analyzer.
Intermediate forms of source Program
An Intermediate source form is an internal form of a program created by compiler while
translating the source program from high level language to assembly level or machine level
code.
Intermediate representation of source program can be done using:
I. Abstract Syntax Tree
II. Postfix Notation
III. Three Address Code
Abstract Syntax Tree:
It is a tree structure representation of the abstract syntactic structure of source code written in
a programming language.
Each node of a tree denotes a construct
occurring in the source code.
This hierarchal structure consists
of operands in leaf nodes and operators
in the interior nodes.
The operator that will be evaluated first is placed
near the bottom of the tree.
The operator that will be evaluated at end is placed id+id*id
3
at the root of the tree.
Intermediate forms of source Program
Postfix Notation:
It is a notation form for expressing arithmetic, logic and algebraic equations.
Its most basic distinguishing feature is that operators are placed on the right of their
operands.
It is a linearised form of the syntax tree.
Syntax tree can be converted into a postfix notation and vice versa.
Example – The postfix representation of the expression
Infix notation: (a – b) * (c + d) + (a – b)
Postfix notation : ab – cd + *ab -+
Three Address Code:
Three-address code is used to represent an intermediate code.
Three address code is a sequence of statements of the general form:
x=y op z
Each instruction in three address code consist of
At most three addresses or operands
At most one operator to represent an expression excluding the assignment operator
Value computed at each instruction is stored in temporary variable generated
by compiler. 4
Intermediate forms of source Program
Example:
a= (-c * b) + (-c * d)
Three address code is :
t1 = -c
t2 = b*t1
t3 = -c
t4 = d * t3
t 5 = t2 + t 4
a = t5
Implementation of Three Address Code:
There are 3 representations of three address code:
1. Quadruple
2. Triples
3. Indirect Triples
Quadruple:
It is a record structure consists of 4 fields namely op, arg1, arg2 and result.
op denotes the operator and arg1 and arg2 denotes the two operands and result is used to
store the result of the expression.
5
Intermediate forms of source Program
The contents of fields arg1, arg2 and result are pointers to the symbol table entries for the
names represented by these fields.
Temporary names must be entered into symbol tables as they are created.
Example :
a = – c*b + – c*b
Triples:
This representation doesn’t make use of extra temporary variable to represent a single
operation instead when a reference to another triple’s value is needed, a pointer to that triple
is used.
It consist of only three fields namely op, arg1 and arg2.
The fields arg1 and arg2 are either pointers to symbol table or pointers into the triple
structure. 6
Intermediate forms of source Program
Example :
a = – c*b + – c*b
Indirect Triples:
This representation makes use of pointer to the listing of all references to computations
which is made separately and stored. Its similar in utility as compared to quadruple
representation but requires less space than it. Temporaries are implicit and easier to rearrange
code.
Example :
a = – c*b + – c*b
7
Intermediate forms of source Program
Attribute grammar is a special form of context-free grammar where some additional
information (attributes) are appended to one or more of its non-terminals in order to provide
context-sensitive information.
A finite, possibly empty set of attributes is associated with each distinct symbol in the grammar.
Each attribute has well-defined domain of values, such as integer, float, character, string, etc.
It is a medium to provide semantics to the context-free grammar and it can
help specify the syntax and semantics of a programming language.
It can pass values or information among the nodes of a parse tree.
Example:
8
Attribute Grammar...
Synthesized attributes:
These attributes get values from the attribute values of their child nodes.
Ex: S → ABC
If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized attribute, as
the values of ABC are synthesized to S.
As in our previous example (E → E + T), the parent node E gets its value from its child node.
Synthesized attributes never take values from their parent nodes or any sibling nodes.
Inherited attributes:
In contrast to synthesized attributes, inherited attributes can take values from parent and/or
siblings.
Ex: S → ABC
A can get values from S, B and C. B can take values from S, A, and C. Likewise, C can take
values from S, A, and B.
9
Syntax Directed Translation:
In syntax directed translation scheme embeds program fragments called semantic actions
within the production bodies.
Ex: E->E+T { print’+’}
F->id { print id.val}
Semantic Actions are enclosed within the curly braces.
Syntax Directed Definition:
In syntax directed definition, the grammar is associated with some notations called as
semantic rules.
Grammar + semantic rule = SDD
In SDD Grammar symbols are associated with attributes and productions are associated with
semantic rules. Example:
num.lexval is the attribute
returned by the lexical
analyzer.
10
S-Attributed and L-Attributed Definition
S-Attributed Definition:
If an SDT uses only synthesized attributes, it is called as S-attributed SDT.
S-attributed SDTs are evaluated in bottom-up parsing, as the values of the parent nodes depend
upon the values of the child nodes.
Semantic actions are placed in rightmost place of RHS.
L-attributed SDT:
If an SDT uses both synthesized attributes and inherited attributes with a restriction that
inherited attribute can inherit values from left siblings only, it is called as L-attributed SDT.
Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing manner.
Semantic actions are placed anywhere in RHS.
For example:
A -> XYZ {Y.S = A.S, Y.S = X.S, Y.S = Z.S} is not an
L-attributed grammar since Y.S = A.S and Y.S = X.S are
allowed but Y.S = Z.S violates the L-attributed SDT
definition as attributed is inheriting the value from its
right sibling.
12
Annotated Parse Tree...
13
Type Checking...
Type checking is the process of verifying that each operation executed in a program respects
the type system of the language.
There are two types of type checking:
1. Static Type Checking
2. Dynamic Type checking
Static type checking is performed during compile time , it means that the type of a variable
is known at compile time.
For some languages, the programmer must specify what type each variable is (e.g C, C++,
Java)
In Static Typing, variables generally are not allowed to change types.
Dynamic type checking is performed at runtime.
For example, Python is a dynamically typed language. It means that the type of a variable is
allowed to change over its lifetime. Other dynamically typed languages are -Perl, Ruby, PHP,
JavaScript etc.
14
Type Checking...
Type checking of Expressions:
15
Type Checking...
Type checking of Statements:
16
Type Checking...
Type checking of Functions:
****
17
Compiler Design
UNIT – IV:
Symbol Tables: Symbol table format, organization for block structures languages, hashing, tree
structures representation of scope information. Block structures and non block structure storage
allocation: static, Runtime stack and heap storage allocation, storage allocation for arrays, strings
and records.
Code optimization: Consideration for Optimization, Scope of Optimization, local optimization,
loop optimization, frequency reduction, folding, DAG representation.
Symbol Table:
Symbol table is an important data structure used in a compiler.
Symbol table is used to store the information about the occurrence of various entities such as
objects, classes, variable name, interface, function name etc.
it is used by both the analysis and synthesis phases.
Symbol table is used by various phases of compiler as follows :-
Lexical Analysis: Creates new table entries in the table, example like entries about token.
Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of
reference, use,etc in the table.
1
Symbol Table...
Semantic Analysis: Uses available information in the table to check for semantics i.e. to
verify that expressions and assignments are semantically correct(type checking) and update
it accordingly.
Intermediate Code generation: Refers symbol table for knowing how much and what
type of run- time is allocated and table helps in adding temporary variable information.
Code Optimization: Uses information present in symbol table for machine dependent
optimization.
Code generation: Generates code by using address information of identifier present in the
table.
Symbol Table Operations:
2
Symbol Table...
Symbol Table Format:
Symbol table consists of names and its properties like type , values, size ,scope etc..
There are two types of name representations: Name Properties/Attributes
1. Fixed Length Name
2. Variable Length Name
3
Symbol Table...
2. Variable Length Name Representation:
A fixed space is not allocated for name in the symbol table.
The name is stored with the help of starting index and length of each name.
Example:
Instead of storing the names SUM, A, B and
Name
MAX in the symbol table directly , these
Properties/Attributes
name are stored in an array and they are Starting Length
separated with delimiter. Index
The staring index of each name in the 0 4
array and its length including delimiter
5 2
is stored in the name field of symbol
6 2
table.
8 4
4
Symbol Table...
Organization for Block Structures Languages:
The block structured language is a kind of language in which sections of source code
is within some matching pair of delimiters such as “{“ and “}” or begin and end.
Such a section gets executed as one unit or one procedure or a function or it may be
controlled by some conditional statements (if, while, do-while).
Normally, block structured languages support structured programming approach
Example: C, C++, JAVA, ALGOL,PASCAL etc.
Non-block structured languages does not contain any blocks ,Examples are LISP,
FORTRAN and SNOBOL.
Implementation of Symbol Table:
The following data structures are used for organization of block structured languages:
1. Linear List
2. Self-Organizing List
3. Hashing
4. Tree Structure
5
Symbol Table...
1. Linear List:
Linear list of records is the easiest way to implement the symbol table.
In this method, an array is used to store names and associated information.
The new names are added to the symbol table in the order they arrive.
The pointer “available” is maintained at the end of all stored records.
To retrieve the information about some name we start from beginning of array and go on
searching up to available pointer. If we reach at pointer available without finding a name we
get an error “use of undeclared name”.
While inserting a new name we should ensure that it is not already present. If it is already
present then another error occurs, i.e., “Multiple Defined Name”.
6
Symbol Table...
2. Self Organizing List:
In this method, symbol table is implemented using linked list.
A link field is added to each record.
We search the records in the order pointed by the link of link field.
A pointer “First” is maintained to point to first record of the symbol table
When the name is reference or created, it is moved to the front of the list.
The most frequently referred names will tend to be at the front of the list. Hence,
access time to most frequently referred names will be the least.
8
Symbol Table...
4. Tree Structure:
When the scope information is presented in hierarchical manner then it forms a tree structure
representation which is an efficient approach for symbol table organization.
This organization uses binary search tree for storing the names in symbol table.
We add two links left and right in each record in the search trees.
Whenever a name is to be added first, the name is searched in the tree.
If it does not exist then a record for new name is created and added at the proper position.
Each node of tree has following format:
Index
a total
v
c
9
Block structures and Non Block structure storage allocation...
Storage allocation refers to process of mapping the data code into appropriate location in the
main memory.
Compiler must carry out the storage allocation and provide access to variables and data.
Storage allocation strategies are:
1. Static Storage Allocation
For any program if we create memory at compile time, memory will be created in the
static area.
For any program if we create memory at compile time only, memory is created only once.
It don’t support dynamic data structure i.e memory is created at compile time and
deallocated after program completion.
The drawback with static storage allocation is recursion is not supported.
Another drawback is size of data should be known at compile time
Eg- FORTRAN was designed to permit static storage allocation.
II. Stack Storage Allocation
Stack allocation is a procedure in which stack is used to organize the storage.
The stack used in stack allocation is known as control stack.
In this type of allocation, creation of data objects is performed dynamically.
In this activation records are created for the allocation of memory.
10
Block structures and Non Block structure storage allocation...
These activation records are pushed onto the stack using Last In First Out (LIFO) method.
Locals are stored in the activation records at run time and memory addressing is done by using
pointers and registers .
Recursion is supported in stack allocation.
Activation record contains 7 fields :
1. Return Value: It is used by calling procedure to return
a value to calling procedure.
2. Actual Parameter: It is used by calling procedures to
supply parameters to the called procedures.
3. Control Link: It is an optional field .It points to activation
record of the caller. It also known as dynamic link field.
4. Access Link: It is an optional field . It is used to refer to
non-local data held in other activation records.
It also known as static link field.
5. Saved Machine Status: It holds the information about
status of machine before the procedure is called.
6. Local Data: It holds the data that is local to the execution of the procedure.
7. Temporaries: It stores the value that arises in the evaluation of an expression.
11
Block structures and Non Block structure storage allocation...
Copy Propagation
Constant Folding
• Reduction in Strength
Copy Propagation
It is a compiler optimization technique of finding redundant expression evaluations, and
replacing them with a single computation . This saves the time overhead resulted by evaluating
the expression for more than once .
Before After
Copy Propagation
Example:
Before After
x=y
z=3+x z=3+y
Constant Folding
Code that is unreachable or that does not affect the program can be eliminated.
Example :
Function1()
{
int a=10,b=20,c,d;
c=a+b;
d=b/a’
print(c);
return;
print(d); // Dead Code
}
Here, the value of d will not print and function will return
Constant Folding
Example:
Before:
X=10+20*3/2;
After
X=40;
If an Expression contains all the literals ,they must be folded to a single value.
Induction Variable Elimination:
Moving the code outside the loop, whose value does not change for all the
iterations .
Example:
Induction Variable Elimination:
A variable is said to be Induction variable, if the value of a variable changes for
every iteration in side the loop i.e. increase or decrease with fixed value . if the loop
contains such variables then we have to eliminate or minimize such variables inside the
loop.
Example:
Induction
ReductionVariable
in Strength
Elimination:
:
It is an loop optimization technique in which expensive operations are replaced with
equivalent and less expensive operations.
Example:
Before After
C=8; C=8;
for(i=0;i<=10;i++) K=0;
{ for(i=0;i<=10;i++)
A[i]=c*i; {
} A[i]=k;
K=k+c;
}
Directed Acyclic Graph (DAG):
Directed Acyclic Graph (DAG) is a tool that depicts the structure of basic blocks,
helps to see the flow of values flowing among the basic blocks, and offers optimization
too. DAG is used to represent the flow graph.
DAG consists of :
Leaf nodes represent identifiers, names or constants.
Interior nodes represent operators.
Interior nodes also represent the results of expressions or the identifiers/name where
the values are to be stored or assigned.
Example:
t0 = a + b
t1 = t0 + c
d = t0 + t1
AST and Acyclic
Directed DAG: Graph (DAG):
Compiler Design
UNIT – V:
Data flow analysis: Flow graph, data flow equation, global optimization, redundant sub
expression elimination, Induction variable elements, Live variable analysis, Copy
propagation.
Object code generation: Object code forms, machine dependent code optimization,
register allocation and assignment generic code generation algorithms, DAG for register
allocation.
---------------------------------------------------------------------------------------------------------------------
Basic block: Basic block is a set of statements that always executes in a sequence one after the
other.
The characteristics of basic blocks are:
There is no possibility of branching or getting halt in the middle.
All the statements execute in the same order they appear without losing the flow
control of the program.
Example:
Basic block Not a Basic block
A graph representation of three-address statements, called a flow graph.
A flow graph consists of set of basic blocks and edges .
Edges represents the flow of information between the basic blocks.
and block represents computations.
It is used for data flow analysis through which we can achieve the global optimization.
We can construct a flow graph for given three address code.
Example:
Dominators in flow graph:
In a flow graph, a node d dominates node n, if every path from initial node of the
flow graph to n goes through d. This will be denoted by d dom n.
Every initial node dominates all the remaining nodes in the flow graph.
Every node dominates itself.
Example:
• D(1)={1}
• D(2)={1,2}
• D(3)={1,3}
• D(4)={1,3,4}
• D(5)={1,3,4,5}
• D(6)={1,3,4,6}
• D(7)={1,3,4,7}
• D(8)={1,3,4,7,8}
• D(9)={1,3,4,7,8,9}
• D(10)={1,3,4,7,8,10}
A loop must have a single entry point, called the header. This entry point-dominates
all nodes in the loop.
There must be at least one way to iterate the loop(i.e.)at least one path back to the
header.
One way to find all the loops in a flow graph is to search for edges in the flow graph
whose heads dominate their tails. If a→b is an edge, b is the head and a is the tail.
These types of edges are called as back edges.
Example:
Back edges:
i) 7→4 4 DOM 7
ii) 10 →7 7 DOM 10
iii) 4→3 3 DOM 4
iv) 8→3 3 DOM 8
v) 9 →1 1 DOM 9
Natural loop:
For a back edge n → d, we define the natural loop of the edge to be d plus the
set of nodes that can reach n without going through d. Node d is the header of the
loop.
Example : if back edge is 7→4 ,then natural loop is{4,5,6,7}.
Constructing a Flow graph for given Three Address Code
Algorithm:
Step 1: Identifying leader in a Basic Block –
First statement is always a leader
Statement that is target of conditional or un-conditional statement is a leader
Statement that follows immediately a conditional or un-conditional statement is a
leader
Step 2: For each leader construct the basic block which consists of all the instructions up to but
not including next leader or the end of intermediate code.
Step 3: Draw a flow graph
Example:
Three address code is:
1. i=0
2. if(i>10) goto 6
3.a[i]=0
4.i=i+1
5 goto 2
6 End
Step1: Identifying the Leader
1. i=0 L
2. if(i>10) goto 6------------- L
3.a[i]=0 L
4.i=i+1
5 goto 2
6 End L
B1 1. i=0
B2
2.if(i>10) goto 6
3.a[i]=0
4. i=i+1
B3
5. goto 2
B4
6.End
Step3: Constructing of flow graph
B1 1. i=0
B1
B2
2.if(i>10) goto 6
3.a[i]=0 B2
B3 i=i+1
goto 2
B3
B4 6.End
B4
The three-address code for the above source program is
given as :
GEN[B] = set of all definitions inside B that are “visible” immediately after the block .
KILL[B] = union of the definitions in all the basic blocks of the flow graph, that are killed
by individual statements in B.
Algorithm to find In and Out of each block in a flow graph
Finding In and Out for the following flow graph:
Step1: Finding the Predecessors of all
blocks
1. i=n-1 B1
2. j=n Blocks Predecessor
3. a=u1 B1 Φ
B2 B1,B4
B3 B2
4. i=i+1 B2 B4 B2,B3
5. J=j+1
Step2: Finding the Gen and Kill of all
blocks
6.a=u2 B3 Blocks Gen Kill
B1 {1,2,3} {4,5,6,7}
B2 {4,5} {1,2,7}
7.i=a+j B4 B3 {6) {3}
B4 {7} {1,4)
Exit
Step 3: Finding In and Out for all Blocks
Blocks In Out
Iteration-1: B1 Φ {1,2,3}
In[B]= Φ and Out[B]=Gen[B] B2 Φ {4,5}
B3 Φ {6)
Iteration-2: B4 Φ {7}
In 2nd and subsequent Iterations In and Out values are calculated using previous iteration and
following equations :
In[B]=In[B] U Out[P] where, P is Predecessor of B Blocks In Out
Out[B]=Gen[B]U ( In[B]-Kill[B])
Working: B1 Φ {1,2,3}
In[B1]=In[B1] U Out[Predecessor(B1)] B2 {1,2,3,7} {3,4,5}
= Φ U Out[Φ] B3 {4,5} {4,5,6,7)
=Φ U Φ
B4 {4,5,6} {5,6,7}
=Φ
Out[B1]=Gen[B1]U(In[B1]-Kill[B1])
= {1,2,3}U(Φ- {4,5,6,7})
={1,2,3}U Φ
={1,2,3}
Similarly ,we have find In and Out for all blocks
Iteration-3
Working: Blocks In Out
In[B2]=In[B2] U Out[Predecessor(B2)]
= {1,2,3,7} U Out[B1,B4] B1 Φ {1,2,3}
={1,2,3,7} U Out[B1] U Out[B4]
= {1,2,3,7}U{1,2,3}U{5,6,7} B2 {1,2,3,5,6,7} {3,4,5,6}
= {1,2,3,5,6,7}
Out[B2]=Gen[B2]U(In[B2]-Kill[B2]) B3 {3,4,5} {4,5,6)
= {4,5}U({1,2,3,5,6,7}-{1,2,7}) B4 {3,4,5,6} {3,5,6,7}
={4,5}U {3,5,6}
={3,4,5,6}
Iteration-4: Iteration-5:
Blocks In Out Blocks In Out
B1 Φ {1,2,3} B1 Φ {1,2,3}
B2 {1,2,3,5,6,7} {3,4,5,6} B2 {1,2,3,5,6,7} {3,4,5,6}
B3 {3,4,5,6} {4,5,6) B3 {3,4,5,6} {4,5,6)
B4 {3,4,5,6} {3,5,6,7} B4 {3,4,5,6} {3,5,6,7}
Since Iteration 4 and 5 are Identical we have stop the process. Finally we get In and Out of
each block.
Peephole optimization:
Peephole optimization is a type of Code Optimization performed on a small part of the code
The small set of instructions or small part of code on which peephole optimization is performed
is known as peephole or window.
It basically works on the theory of replacement in which a part of code is replaced by shorter
and faster code without change in output.
To improve performance
Code-generation algorithm:
getReg : Code generator uses getReg function to determine the status of available registers
and the location of name values. It works as follows:
If variable Y is already in register R, it uses that register.
Else if some register R is available, it uses that register.
Else if both the above options are not possible, it chooses a register that requires minimal
number of load(MM to Registers) and store(Registers to MM) instructions.
The algorithm takes as input a sequence of three-address statements constituting a basic block.
For each three-address statement of the form x = y op z , perform the following actions:
1. Invokes a function getreg to determine the location L where the result of the computation
y op z should be stored.
2. Consult the address descriptor for y to determine y’, the current location of y. Prefer the
register for y’ if the value of y is currently both in memory and a register. If the value of y
is not already in L, generate the instruction MOV y’ , L to place a copy of y in L.
3. Generate the instruction OP z’ , L where z’ is a current location of z. Prefer a register to a
memory location if z is in both. Update the address descriptor of x to indicate that x is in
location L. If x is in L, update its descriptor and remove x from all other descriptors.
4. If the current values of y or z have no next uses, are not live on exit from the block, and
are in registers, alter the register descriptor to indicate that, after execution of x : = y op z
, those registers will no longer contain y or z.
Example:
Generate Code for following three address code:
t:=a–b
u:=a–c
v:=t+u
d:=v+u
Statements Code Generated Register descriptor Address descriptor
Register empty
MOV a, R0
t:=a-b SUB b, R0 R0 contains t t in R0
MOV a , R1 R0 contains t t in R0
u:=a-c SUB c , R1 R1 contains u u in R1
R0 contains v u in R1
v:=t+u ADD R1, R0 R1 contains u v in R0
ADD R1, R0 d in R0
d:=v+u R0 contains d d in R0 and memory
MOV R0, d
DAG for Register Allocation
Code generation from DAG is much simpler than the linear sequence of three address code.
DAG can be used to rearrange sequence of instructions and generate and efficient code.
The steps involved in the algorithm to generate code from DAG include :
Rearranging the order – To optimize the code generation, the instructions are rearranged
and this is referred to as heuristic reordering .
Labelling the tree for register information – To know the number of registers required
to generate code, the labels of the nodes are numbered which indicate the number of
registers required to evaluate that node.
Tree traversal to generate code – This reordered labelled tree is traversed to generate
code based on the target language’s instruction.
Rearranging the order – Heuristic reordering :
Rearranging the nodes involves changing the order of independent statements of the DAG
which will help efficient utilization of the registers.
This rearranging of nodes also helps in reducing the final cost of assembly level code.
DAG for Register Allocation
Algorithm:
Node_listing ( )
{
while unlisted interior nodes remain do
begin
select an unlisted node n, all of whose parents have been listed ;
list n;
while the leftmost child m of n has no unlisted parents and is not a leaf do
/* since n was just listed, m is not yet listed*/
begin
list m;
n=m
end
end
}
The listed nodes are “1234568”. This string is reversed to yield, “8654321”.
This indicates we need to evaluate node 8 followed by 6, 5, 4, 3, 2 and finally 1.
The following is the sequence of instruction after rearranging.
1. t8 := d +e
2. t6 := a +b
3. t5 := t6 - c
4. t4 := t5 * t8
5. t3 := t4 – e
6. t2 := t6 + t4
7. t1:= t2 + t3
DAG for Register Allocation
Labelling the tree for register information :
A node ‘n’ is labelled using the following equation:
******THANK YOU******