7MCE1C4-Principles of Compiler Design
7MCE1C4-Principles of Compiler Design
Unit II
The syntactic specification of Programming Languages: Context – free grammars –
Derivations and parse trees – Capabilities of context – free grammars.
Basic Parsing Techniques: Parses – Shift – reduce parsing – Operator – precedence
parsing – Top-down parsing – Predictive parsers.
Automatic construction of efficient parsers: LR parsers – Constructing SLR parsing
tables – Constructing LALR parsing tables.
Unit III
Syntax – Directed translation: Syntax Directed translation schemes – Implementation
of syntax – directed translators – Intermediate code – Postfix notation – Parse trees and syntax
trees – Three – address code, quadruples, and triples – Translation of assignment statements –
Boolean expressions – Statements that alter the flow of control – Postfix translations –
Translation with a top-down parser.
Unit IV
Symbol Tables: The contents of a symbol table – Data structures for symbol tables –
Representing scope information.
Run time storage administration: Implementation of a simple stack allocation scheme
– Implementation of block – structured languages – Storage allocation in block – structured
languages.
Error Detection and Recovery: Errors – lexical – phase errors – Syntactic phase errors
– Semantic errors.
Unit V
Introduction to code optimization:- The principal sources of optimization – loop
optimization– The DAG Representation of basic blocks.
Code generation: object programs – Problems in code generation – A machine model
– A simple code generator – Register allocation and assignment – Code generation from DAG’s
–Peephole optimization.
Text Book:
1. “Principles of Compiler Design” by Alfred V. Aho Jeffrey D. Ullman, Narosa
Publishing House, 1989 Reprint 2002
Books for Reference:
1. “Compiler Construction Principles and Practice”, by Dhamdhere D. M, 1981, Macmillan
India.
2. “Compiler Design”, by Reinhard Wilhelm, Director Mauser, 1995, Addison Wesley.
UNIT I
OUTLINE
Introduction to Compilers:
Compilers and Translators
Lexical analysis
Syntax analysis
Intermediate code generation
Optimization
Code generation
Bookkeeping
Error handling
Compiler writing tools.
COMPILER
LEXICAL ANALYSIS
• Scanning
• Tokenization
• constant,
• identifiers,
• numbers,
• operators and
• punctuations symbols
Pattern
Lexeme
token.
(eg.) c=a+b*5;
Intermediate code can be either language specific (e.g., Byte Code for Java) or
language independent (three-address code).
Three-Address Code
Intermediate code generator receives input from its predecessor phase,
semantic analyzer, in the form of an annotated syntax tree. That syntax tree
then can be converted into a linear representation.
For example:
a = b + c * d;
The intermediate code generator will try to divide this expression into sub-
expressions and then generate the corresponding code.
r1 = c * d;
r2 = b + r1;
a = r2
OPTIMIZATION
CODE GENERATION
Code generator is used to produce the target code for three-address
statements. It uses registers to store the operands of the three address
statement.
Example:
MOV x, R0
ADD y, R0
A code-generation algorithm:
1. Invoke a function getreg to find out the location L where the result of
computation b op c should be stored.
4. If the current value of y or z have no next uses or not live on exit from
the block or in register then alter the register descriptor to indicate that
after execution of x : = y op z those register will no longer contain y or z.
The assignment statement d: = (a-b) + (a-c) + (a-c) can be translated into the
following sequence of three address code:
1. t:= a-b
2. u:= a-c
3. v:= t +u
4. d:= v+u
Code sequence for the example is as follows:
BOOKKEEPING
A compiler needs to collect information about all the data objects that
appear in the source program.
The information about data objects is collected by the early phases of
the compiler-lexical and syntactic analyzers.
The data structure used to record this information is called as Symbol
Table.
ERROR HANDLING
One of the most important functions of a compiler is the detection and
reporting of errors in the source program.
The error message should allow the programmer to determine exactly
where the errors have occurred.
Errors may occur in all or the phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the
error to the error handler, which issues an appropriate diagnostic msg.
Both of the table-management and error-Handling routines interact
with all phases of the compiler.
COMPILER WRITING TOOLS
It has a set of states and rules for moving from one state to another but it
depends upon the applied input symbol.
1. Input
2. Output
3. States of automata
4. State relation
5. Output relation
Lexical Analysis
Lexical Analysis is the first phase of the compiler also known as a
scanner. It converts the High level input program into a sequence of Tokens.
Lexical Analysis can be implemented with the Deterministic finite
Automata.
The output is a sequence of tokens that is sent to the parser for syntax
analysis
Upon receiving a ‘get next token’ command form the parser, the lexical
analyzer reads the input character until it can identify the next token.
The LA return to the parser representation for the token it has found.
The representation will be an integer code, if the token is a simple
construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface.
One such task is striping out from the source program the commands
and white spaces in the form of blank, tab and new line characters.
Another is correlating error message from the compiler with the source
program.
Input Buffering
The LA scans the characters of the source pgm one at a time to discover
tokens.
Because of large amount of time can be consumed scanning characters,
specialized buffering techniques have been developed to reduce the
amount of overhead required to process an input character.
REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string.
Component of regular expression
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain
type.
If we view the set of strings in each token class as an language, we can use
the regular expression notation to describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or
more letters or digits.
In regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet.
is a regular expression denoting { € }, that is, the language containing
only the empty string.
For each ‘a’ in ∑, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol ‘a’ .
If R and S are regular expressions, then
(R) | (S) means LrULs
R.S means Lr. Ls
R* denotes Lr*
FINITE AUTOMATA
Automation is defined as a system where information is transmitted and used
for performing some functions without direct participation of man.
1, an automation in which the output depends only on the input is called
automation without memory.
2, an automation in which the output depends on the input and state also
is called as automation with memory.
3, an automation in which the output depends only on the state of the
machine is called a Moore machine.
4. an automation in which the output depends on the state and input at
any instant of time is called a mealy machine.
Description of Automata
1, an automata has a mechanism to read input from input tape,
2, any language is recognized by some automation, Hence these
automation are basically language ‘acceptors’ or ‘language recognizers’.
Types of Finite Automata
• Deterministic Automata
• Non-Deterministic Automata
Even number of a’s : The regular expression for even number of a’s
is (b|ab*ab*)*. We can construct a finite automata as shown in Figure 1.
The above automata will accept all strings which have even number of a’s.
For zero a’s, it will be in q0 which is final state. For one ‘a’, it will go from
q0 to q1 and the string will not be accepted. For two a’s at any positions, it
will go from q0 to q1 for 1st ‘a’ and q1 to q0 for second ‘a’. So, it will accept
all strings with even number of a’s.
String with ‘ab’ as substring: The regular expression for strings with
‘ab’ as substring is (a|b)*ab(a|b)*. We can construct finite automata as
shown in Figure 2.
The above automata will accept all string which have ‘ab’ as substring. The
automata will remain in initial state q0 for b’s. It will move to q1 after
reading ‘a’ and remain in same state for all ‘a’ afterwards. Then it will
move to q2 if ‘b’ is read. That means, the string has read ‘ab’ as substring if
it reaches q2.
The above automata will accept all string of form a3n. The automata will
remain in initial state q0 for ɛ and it will be accepted. For string ‘aaa’, it will
move from q0 to q1 then q1 to q2 and then q2 to q0. For every set of three a’s, it
will come to q0, hence accepted. Otherwise, it will be in q1 or q2, hence
rejected.
Note: If we want to design a finite automata with number of a’s as 3n+1, same
automata can be used with final state as q1 instead of q0.
MINIMIZING THE NUMBER OF STATES OF A DFA
Minimization of DFA means reducing the number of states from given FA.
Thus, we get the FSM (finite state machine) with redundant states after
minimizing the FSM.
We have to follow the various steps to minimize the DFA. These are as follows:
Step 1: Remove all the states that are unreachable from the initial state via
any set of the transition of DFA.
Step 3: Now split the transition table into two tables T1 and T2. T1 contains
all final states, and T2 contains non-final states.
1. δ (q, a) = p
2. δ (r, a) = p
That means, find the two states which have the same value of a and b and
remove one of them.
Step 5: Repeat step 3 until we find no similar rows available in the transition
table T1.
Step 7: Now combine the reduced T1 and T2 tables. The combined transition
table is the transition table of minimized DFA.
Example:
Solution:
Step 1: In the given DFA, q2 and q4 are the unreachable states so remove
them.
Step 2: Draw the transition table for the rest of the states.
Step 3: Now divide rows of transition table into two sets as:
1. One set contains those rows, which start from non-final states:
2. Another set contains those rows, which starts from final states.
Step 5: In set 2, row 1 and row 2 are similar since q3 and q5 transit to the
same state on 0 and 1. So skip q5 and then replace q5 by q3 in the rest.
Notation
• Union: A+B≡A|B
• Option: A + ε ≡ A?
• Integer = digit +
• OpenPar = ‘(‘
•…
Parsers
Shift reduce parsing
Operator precedence parsing
Top down parsing
Predictive parsers
LR Parsers
Constructing SLR parsing tables
Constructing LALR parsing tables
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies used
in parsing technology.
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The
non-terminals define sets of strings that help define the language generated by the grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.
A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of a non-
terminal called the left side of the production, an arrow, and a sequence of tokens and/or on-
terminals, called the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by
means of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,
etc
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams.
The parser analyzes the source code (token stream) against the production rules to detect any
errors in the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential
form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as
right-most derivation. The sentential form derived from the right-most derivation is called
the right-sentential form.
E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
Capabilities of CFG
CHAPTER 5
Parser
Parser is a compiler that is used to break the data into smaller elements coming from lexical
analysis phase.
A parser takes input in the form of sequence of tokens and produces output in the form of parse
tree.
Parsing is of two types: top down parsing and bottom up parsing.
Top down paring
Bottom up parsing
Example
Production
1. E → T
2. T → T * F
3. T → id
4. F → T
5. F → id
1. Shift-Reduce Parsing
2. Operator Precedence Parsing
3. Table Driven LR Parsing
a) LR( 1 )
b) SLR( 1 )
c) CLR ( 1 )
d) LALR( 1 )
o Sift reduce parsing performs the two actions: shift and reduce. That's why it is known as shift
reduces parsing.
o At the shift action, the current symbol in the input string is pushed to a stack.
o At each reduction, the symbols will replaced by the non-terminals. The symbol is the right side of
the production and non-terminal is the left side of the production.
Example:
Grammar:
S → S+S
S → S-S
S → (S)
S→a
Input string:
a1-(a2+a3)
Parsing table:
1. Operator-Precedence Parsing
2. LR-Parser
Operator precedence grammar is kinds of shift reduce parsing method. It is applied to a small class
of operator grammars.
A grammar is said to be operator precedence grammar if it has two properties:
Operator precedence can only established between the terminals of the grammar. It ignores the
non-terminal.
There are the three operator precedence relations:
a⋗ b means that terminal "a" has the higher precedence than terminal "b".
a⋖ b means that terminal "a" has the lower precedence than terminal "b".
a≐ b means that the terminal "a" and "b" both have same precedence.
Precedence table:
Parsing Action
o Now scan the input string from left right until the ⋗ is encountered.
o Scan towards left over all the equal precedence until the first left most ⋖ is encountered.
Example
Grammar:
E → E+T/T
T → T*F/F
F → id
Given string:
w = id + id * id
On the basis of above tree, we can design following operator precedence table:
Now let us process the string with the help of the above precedence table:
Top-Down Parser
We have learnt in the last chapter that the top-down parsing technique parses the input, and starts
constructing a parse tree from the root node gradually moving down to the leaf nodes. The types
of top-down parsing are depicted below:
Recursive Descent Parsing
Recursive descent is a top-down parsing technique that constructs the parse tree from the top and
the input is read from left to right. It uses procedures for every terminal and non-terminal entity.
This parsing technique recursively parses the input to make a parse tree, which may or may not
require back-tracking. But the grammar associated with it (if not left factored) cannot avoid back-
tracking. A form of recursive-descent parsing that does not require any back-tracking is known
as predictive parsing.
This parsing technique is regarded recursive as it uses context-free grammar which is recursive in
nature.
Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of
CFG:
S →rXd|rZd
X →oa|ea
Z →ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser
advances to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks
its production from the left (X → oa). It does not match with the next input symbol. So the top-
down parser backtracks to obtain the next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.
Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict which
production is to be used to replace the input string. The predictive parser does not suffer from
backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next
input symbols. To make the parser back-tracking free, the predictive parser puts some constraints
on the grammar and accepts only a class of grammar known as LL(k) grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree.
Both the stack and the input contains an end symbol $ to denote that the stack is empty and the
input is consumed. The parser refers to the parsing table to take any decision on the input and
stack element combination.
In recursive descent parsing, the parser may have more than one production to choose from for a
single instance of input, whereas in predictive parser, each step has at most one production to
choose. There might be instances where there is no production matching the input string, making
the parsing procedure to fail.
LL Parser
LL Parsing Algorithm
We may stick to deterministic LL(1) for parser explanation, as the size of table grows
exponentially with the value of k. Secondly, if a given grammar is not LL(1), then usually, it is
not LL(k), for any given k.
Given below is an algorithm for LL(1) Parsing:
Input:
stringω
parsing table M for grammar G
Output:
IfωisinL(G)then left-most derivation ofω,
error otherwise.
repeat
let X be the top stack symbol and a the symbol pointed byip.
if X∈Vtor $
if X = a
POP X and advance ip.
else
error()
endif
else /* X is non-terminal */
if M[X,a]= X → Y1, Y2,...Yk
POP X
PUSH Yk,Yk-1,... Y1 /* Y1 on top */
Output the production X → Y1, Y2,...Yk
else
error()
endif
endif
until X = $ /* empty stack */
A grammar G is LL(1) if A → α | β are two distinct productions of G:
for no terminal, both α and β derive strings beginning with a.
at most one of α and β can derive empty string.
if β → t, then α does not derive any string beginning with a terminal in FOLLOW(A).
Predictive Parsing
The goal of predictive parsing is to construct a top-down parser that never backtracks. To do so,
we must transform a grammar in two ways:
These rules eliminate most common causes for backtracking although they do not guarantee a
completely backtrack-free parsing (called LL(1) as we will see later).
LR Parser
LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.
In the LR parsing, "L" stands for left-to-right scanning of the input.
"R" stands for constructing a right most derivation in reverse.
"K" is the number of input symbols of the look ahead used to make number of parsing decision.
LR parsing is divided into four parts:
o LR (0) parsing
o SLR parsing,
o CLR parsing
o LALR parsing.
LR algorithm:
The LR algorithm requires stack, input, output and parsing table. In all type of LR parsing, input,
output and stack are same but parsing table is different.
Input buffer is used to indicate end of input and it contains the string to be parsed followed by a $
Symbol.
A stack is used to contain a sequence of grammar symbols with a $ at the bottom of the stack.
Parsing table is a two dimensional array. It contains two parts: Action part and Go To part.
LR (1) Parsing
Augmented grammar G` will be generated if we add one more production in the given grammar
G. It helps the parser to identify when to stop the parsing and announce the acceptance of the
input.
Example
Given grammar
1. S → AA
2. A → aA | b
1. S`→ S
2. S → AA
3. A → aA | b
Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it
reaches the root node. Here, we start from a sentence and then apply production rules in reverse
manner in order to reach the start symbol. The image given below depicts the bottom-up parsers
available.
Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-
step and reduce-step.
Shift step: The shift step refers to the advancement of the input pointer to the next input symbol,
which is called the shifted symbol. This symbol is pushed onto the stack. The shifted symbol is
treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to (LHS), it
is known as reduce-step. This occurs when the top of the stack contains a handle. To reduce, a
POP function is performed on the stack which pops off the handle and replaces it with LHS non-
terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-
free grammar which makes it the most efficient syntax analysis technique.
LR parsers are also known as LR(k) parsers, where L stands for left-to-right scanning of the input
stream; R stands for the construction of right-most derivation in reverse, and k denotes the
number of lookahead symbols to make decisions.
There are three widely used algorithms available for constructing an LR parser:
LR Parsing Algorithm
repeat forever
s = top of stack
else
error()
LL vs. LR
LL LR
Starts with the root nonterminal on the Ends with the root nonterminal on the
stack. stack.
Uses the stack for designating what is Uses the stack for designating what is
still to be expected. already seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off Tries to recognize a right hand side on the
the stack, and pushes the corresponding stack, pops it, and pushes the
right hand side. corresponding nonterminal.
Reads the terminals when it pops one Reads the terminals while it pushes them
off the stack. on the stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
Add Augment Production and insert '•' symbol at the first position for every production in G
1. S` → •S
2. S → •AA
3. A → •aA
4. A → •b
I0 State:
Drawing DFA:
The DFA contains the 7 states I0 to I6.
LR(0) Table
o If a state is going to some other state on a terminal then it correspond to a shift move.
o If a state is going to some other state on a variable then it correspond to go to move.
o If a state contain the final item in the particular row then write the reduce node completely.
Explanation:
o I0 on S is going to I1 so write it as 1.
o I0 on A is going to I2 so write it as 2.
o I2 on A is going to I5 so write it as 5.
o I3 on A is going to I6 so write it as 6.
o I0, I2and I3on a are going to I3 so write it as S3 which means that shift 3.
o I0, I2 and I3 on b are going to I4 so write it as S4 which means that shift 4.
o I4, I5 and I6 all states contains the final item because they contain • in the right most end. So rate
the production as production number.
S → AA ... (1)
A → aA ... (2)
A → b ... (3)
o I1 contains the final item which drives(S` → S•), so action {I1, $} = Accept.
o I4 contains the final item which drives A → b• and that production corresponds to the production
number 3 so write it as r3 in the entire row.
o I5 contains the final item which drives S → AA• and that production corresponds to the
production number 1 so write it as r1 in the entire row.
o I6 contains the final item which drives A →aA• and that production corresponds to the production
number 2 so write it as r2 in the entire row.
SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only difference is in the
parsing table.To construct SLR (1) parsing table, we use canonical collection of LR (0) item.
In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.
Various steps involved in the SLR (1) Parsing:
For the given input string write a context free grammar
o Check the ambiguity of the grammar
o Add Augment production in the given grammar
o Create Canonical collection of LR (0) items
o Draw a data flow diagram (DFA)
o Construct a SLR (1) parsing table
If a state (Ii) is going to some other state (Ij) on a variable then it correspond to go to move in the
Go to part.
If a state (Ii) contains the final item like A →ab• which has no transitions to the next state then the
production is known as reduce production. For all terminals X in FOLLOW (A), write the reduce
entry along with their production numbers.
Example
1. S -> •Aa
2. A->αβ•
1. Follow(S) = {$}
2. Follow (A) = {a}
SLR ( 1 ) Grammar
S→E
E→ E + T | T
T→ T * F | F
F→ id
Add Augment Production and insert '•' symbol at the first position for every production in
G
S` →•E
E→•E + T
E →•T
T→•T * F
T →•F
F→•id
I0 State:
Add Augment production to the I0 State and Compute the Closure
Add all productions starting with E in to I0 State because "." is followed by the non-
terminal. So, the I0 State becomes
I0 = S` →•E
E→•E + T
E →•T
Add all productions starting with T and F in modified I0 State because "." is followed by
the non-terminal. So, the I0 State becomes.
I0= S` →•E
E→•E + T
E →•T
T→•T * F
T →•F
F→•id
Add all productions starting with T and F in I5 State because "." is followed by the non-terminal.
So, the I5 State becomes
I5 = E → E +•T
T→•T * F
T →•F
F→•id
Add all productions starting with F in I6 State because "." is followed by the non-terminal.
So, the I6 State becomes
I6 = T → T * •F
F→•id
Drawing DFA:
Explanation:
LALR PARSING
MOTIVATION
The LALR ( Look Ahead-LR ) parsing method is between SLR and Canonical LR
both in terms of power of parsing grammars and ease of implementation.
This method is often used in practice because the tables obtained by it are considerably
smaller than the Canonical LR tables, yet most common syntactic constructs of
programming languages can be expressed conveniently by an LALR grammar.
The same is almost true for SLR grammars, but there are a few constructs that can not be
handled by SLR techniques.
CONSTRUCTING LALR PARSING TABLES
CORE: A core is a set of LR (0) (SLR) items for the grammar, and an LR (1) (Canonical
LR) grammar may produce more than two sets of items with the same core.
The core does not contain any look ahead information.
Example: Let s1 and s2 are two states in a Canonical LR grammar.
These two states have the same core consisting of only the production rules without any look
ahead information.
CONSTRUCTION IDEA:
Output: The LALR parsing table actions and goto for G’.
Method:
1. Construct C= {I0, I1, I2,… , In}, the collection of sets of LR(1) items.
2. For each core present in among these sets, find all sets having the core, and replace
these sets by their union.
3. Parsing action table is constructed as for Canonical LR.
4. The goto table is constructed by taking the union of all sets of items having the
same core. If J is the union of one or more sets of LR (1) items, that is, J=I1 U I2 U
… U Ik, then the cores of goto(I1,X), goto(I2,X),…, goto(Ik, X) are the same as all
of them have same core. Let K be the union of all sets of items having same core as
goto(I1, X). Then goto(J,X)=K.
EXAMPLE
GRAMMAR:
1. S’ -> S
2. S -> CC
3. C -> cC
4. C -> d
STATES:
I0 : S’ -> .S, $
S -> .CC, $
C -> .c C, c /d
C -> .d, c /d
I3: C -> c. C, c /d
C -> .Cc, c /d
C -> .d, c /d
C -> .d, $
c d $ S C
0 S3 S4 1 2
1 acc
2 S6 S7 5
3 S3 S4 8
4 R3 R3
5 R1
6 S6 S7 9
7 R3
8 R2 R2
9 R2
NOTE: For goto graph see the construction used in Canonical LR.
LALR PARSING TABLE:
goto
START Actions
C D $ S C
0 S36 S47 1 2
1 Acc
2 S36 S47 5
36 S36 S47 89
47 R3 R3 R3
5 R1
89 R2 R2 R2
I0 I1
I2
I5
I6
I8
I3 I9
I4
I7
Showing states with same core with same colour which get merged in conversion from LR(1) to
LALR.
States merged together: 3 and 6
4 and 7
8 and 9
UNIT III
CHAPTER : 7
CHAPTER 7
In syntax directed translation, along with the grammar we associate some informal
notations and these notations are called as semantic rules.
So we can say that
attribute or sometimes 0 attribute depending on the type of the attribute. The value
of these attributes is evaluated by the semantic rules associated with the production
rule.
In the semantic rule, attribute is VAL and an attribute may hold anything like a
o The syntax directed translation scheme is used to evaluate the order of semantic
rules.
o In translation scheme, the semantic rules are embedded within the right side of the
productions.
Example
S→E$ { printE.VAL }
Example
Background :
Parser uses a CFG(Context-free-Grammer) to validate the input string and produce
output for next phase of the compiler. Output could be either a parse tree or abstract
syntax tree.
Now to interleave semantic analysis with syntax analysis phase of the compiler, we
use Syntax Directed Translation.
Definition
Syntax Directed Translation are augmented rules to the grammar that facilitate
semantic analysis. SDT involves passing information bottom-up and/or top-down
the parse tree in form of attributes attached to the nodes.
Syntax directed translation rules use 1) lexical values of nodes, 2) constants & 3)
attributes associated to the non-terminals in their definitions.
The general approach to Syntax-Directed Translation is to construct a parse tree or
syntax tree and compute the values of attributes at the nodes of the tree by visiting
them in some order.
In many cases, translation can be done during parsing without building an explicit
tree.
Example
E -> E+T | T
T -> T*F | F
F -> INTLIT
This is a grammar to syntactically validate an expression having additions and
multiplications in it. Now, to carry out semantic analysis we will augment SDT rules
to this grammar, in order to pass some information up the parse tree and check for
semantic errors, if any. In this example we will focus on evaluation of the given
expression, as we don’t have any semantic assertions to check in this very basic
example.
For understanding translation rules further, we take the first SDT augmented to [ E -
> E+T ] production rule. The translation rule in consideration has val as attribute for
Right hand side of the translation rule corresponds to attribute values of right side
nodes of the production rule and vice-versa. Generalizing, SDT are augmented rules
to a CFG that associate 1) set of attributes to every node of the grammar and 2) set
of translation rules to every production rule using attributes, constants and lexical
values.
Let’s take a string to see how semantic analysis happens – S = 2+3*4. Parse tree
corresponding to S would be
To evaluate translation rules, we can employ one depth first search traversal on the
parse tree. This is possible only because SDT rules don’t impose any specific order
on evaluation until children attributes are computed before parents for a grammar
having all synthesized attributes.
Otherwise, we would have to figure out the best suited plan to traverse through the
parse tree and evaluate all the attributes in one or more traversals. For better
understanding, we will move bottom up in left to right fashion for computing
translation rules of our example.
Above diagram shows how semantic analysis could happen. The flow of information
happens bottom-up and all the children attributes are computed before parents, as
discussed above.
Right hand side nodes are sometimes annotated with subscript 1 to distinguish
Additional Information
Synthesized Attributes are such attributes that depend only on the attribute values
of children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a synthesized attribute val
synthesized, one depth first search traversal in any order is sufficient for semantic
analysis phase.
Inherited Attributes are such attributes that depend on parent and/or siblings
attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val } ], where E &Ep are
same production symbols annotated to differentiate between parent and child, has an
Intermediate code
Intermediate code is used to translate the source code into the machine code.
Intermediate code lies between the high-level language and the machine language.
o If the compiler directly translates source code into the machine code without
generating intermediate code then a full native compiler is required for each new
machine.
o The intermediate code keeps the analysis portion same for all the compilers that's
why it doesn't need a full compiler for every unique machine.
o Intermediate code generator receives input from its predecessor phase and
semantic analyzer phase. It takes input in the form of an annotated syntax tree.
o Using the intermediate code, the second phase of the compiler synthesis phase is
changed according to the target machine.
Intermediate representation
Low level intermediate code is close to the target machine, which makes it suitable
for register and memory allocation etc. it is used for machine-dependent
optimizations.
Postfix Notation
o Postfix notation is the useful form of intermediate code if the given language is
expressions.
parentheses.
The ordinary (infix) way of writing the sum of x and y is with operator in the
middle: x * y. But in the postfix notation, we place the operator at the right end as
xy *.
In postfix notation, the operator follows the operand.
Example
Production
E → E1 op E2
E → (E1)
E → id
E.code = E1.code
E.code = id print id
When you create a parse tree then it contains more details than actually needed.
So, it is very difficult to compiler to parse the parse tree. Take the following parse
tree as an example:
In the parse tree, most of the leaf nodes are single child to their parent nodes.
In the syntax tree, we can eliminate this extra information.
Syntax tree is a variant of parse tree. In the syntax tree, interior nodes are
operators and leaves are operands.
Syntax tree is usually used when represent a program in a tree structure.
Abstract syntax trees are important data structures in a compiler. It contains the
least unnecessary information.
Abstract syntax trees are more compact than a parse tree and can be easily used by
a compiler.
In three-address code, the given expression is broken down into several separate
Given Expression:
a := (-c * b) + (-c * d)
t1 := -c
t2 := b*t1
t3 := -c
t4 := d * t3
t5 := t2 + t4
a := t5
Quadruples
The quadruples have four fields to implement the three address code. The field of
quadruples contains the name of the operator, the first source operand, the second
source operand and the result respectively.
Example
o a := -b * c + d
o Three-address code is as follows:
o t1 := -b
o t2 := c + d
o t3 := t1 * t2
o a := t3
These statements are represented by quadruples as follows:
Operator Source 1 Source 2 Destination
(0) uminus b - t1
(1) + c d t2
(2) * t1 t2 t3
(3) := t3 - a
Triples
The triples have three fields to implement the three address code. The field of
triples contains the name of the operator, the first source operand and the second
source operand.
In triples, the results of respective sub-expressions are denoted by the position of
expression. Triple is equivalent to DAG while representing expressions.
Example:
a := -b * c + d
(0) uminus B -
(1) + C d
(3) := (2) -
Translation of Assignment Statements
S → id := E
E → E1 + E2
E → E1 * E2
E → (E1)
E → id
S → id :=E {p = look_up(id.name);
If p ≠ nil then
Emit (p = E.place)
Else
Error;
}
E → E1 + E2 {E.place = newtemp();
Emit (E.place = E1.place '+' E2.place)
}
E → E1 * E2 {E.place = newtemp();
Emit (E.place = E1.place '*' E2.place)
}
E → id {p = look_up(id.name);
If p ≠ nil then
Emit (p = E.place)
Else
Error;
}
Boolean expressions have two primary purposes. They are used for computing the
logical values. They are also used as conditional expression using if-then-else or
while-do.
E → E OR E
E → E AND E
E → NOT E
E → (E)
E → id relop id
E → TRUE
E → FALSE
E → E1 OR E2 {E.place = newtemp();
Emit (E.place ':=' E1.place 'OR' E2.place)
}
E → E1 + E2 {E.place = newtemp();
Emit (E.place ':=' E1.place 'AND' E2.place)
}
The EMIT function is used to generate the three address code and the newtemp( )
function is used to generate the temporary variables.
The E → id relop id2 contains the next_state and it gives the index of next three
address statements in the output sequence.
Here is the example which generates the three address code using the above
translation scheme:
p>q AND r<s OR u>r
100: if p>q goto 103
101: t1:=0
102: goto 104
103: t1:=1
104: if r>s goto 107
105: t2:=0
106: goto 108
107: t2:=1
108: if u>v goto 111
109: t3:=0
110: goto 112
111: t3:= 1
112: t4:= t1 AND t2
113: t5:= t4 OR t3
Statements that alter the flow of control
The goto statement alters the flow of control. If we implement goto statements
then we need to define a LABEL for a statement. A production can be added for
this purpose:
S→ LABEL : S
LABEL → id
In this production system, semantic action is attached to record the LABEL and its
value in the symbol table.
Following grammar used to incorporate structure flow-of-control constructs:
S → if E then S
S → if E then S else S
S→ while E do S
S→ begin L end
S→ A
L→ L;S
L→ S
S → if E then M S
S→ if E then M S else M S
S→ while M E do M S
S→ begin L end
S→ A
L→ L;MS
L→ S
M→ ∈
N→ ∈
M →∈ M.QUAD = NEXTQUAD
The production
S → while M1 E do M2 S1
S→ C S1
C→ W E do
W→ while
C → W E do C W E do
S for L = E1 step E2 to E3 do S1
Can be factored as
F → for L
T → F = E1 by E2 to E3 do
S → T S1
o The Emit function is used for appending the three address code to the output file.
Otherwise it will report an error.
o The newtemp() is a function used to generate new temporary variables.
o E.place holds the value of E.
Any translation done by top down parser can be done in a bottom up parser also.
But in certain situations, translation with top down parser is advantageous as ricks.
such as placing a marker non terminal can be avoided
Semantic routines can be called in the middle of productions in top down parser.So
the location of a[i] can be computed at the run time by evaluating the formula
i*width+c where c is (base A low *widyh) which is evaluated at compile tima.
Intermediate code generator should produce the code to evaluate this formula
i*width+c (one multiplication and one addition operation)
A two dimensional array can be stored in either row –major(row by row) or colum
major(column by column)
Most of the programming languages use row major based method.
The location of A[i1,i2] is based A+(i1-row)*n2 +i2-low2)* width.
UNIT IV
CHAPTER :8 - Symbol Table
The content of the symbol table
Data structures for symbol tables
Representing scope information
CHAPTER :9 -Run Time Storage Administration
Implementation of simple stack allocation scheme
Implementation of block structured languages
Storage allocation in block structured language
CHAPTER :10 -Error Detection And Recovery
Errors
Lexical Phase errors
Syntactic phase errors
Semantic errors.
CHAPTER 8
Symbol Table
Symbol Table entries – Each entry in symbol table is associated with attributes that support
compiler in different phases.
Items stored in Symbol table:
Variable names and constants
Procedure and function names
Literal constants and strings
Compiler generated temporaries
Labels in source languages
Information used by compiler from Symbol table:
Data type and name
Declaring procedures
Offset in storage
If structure or record then, pointer to structure table.
For parameters, whether parameter passing by value or by reference
Number and type of arguments passed to function
Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:
o It is used to store the name of all entities in a structured form at one place.
o It is used to verify if a variable has been declared.
o It is used to determine the scope of a name.
o It is used to implement type checking by verifying assignments and expressions in the
source code are semantically correct.
o A symbol table can either be linear or a hash table. Using the following format, it
maintains the entry for each name.
For example, suppose a variable store the information about the following variable
declaration:
static int salary
then, it stores an entry in the following format:
<salary, int, static>
The clause attribute contains the entries related to the name.
Implementation
The symbol table can be implemented in the unordered list if the compiler is used to
handle the small amount of data.
A symbol table can be implemented in one of the following techniques:
Linear (sorted or unsorted) list
Hash table
Binary search tree
Symbol table are mostly implemented as hash table.
Operations
Insert ()
o Insert () operation is more frequently used in the analysis phase when the tokens are
identified and names are stored in the table.
o The insert() operation is used to insert the information in the symbol table like the unique
name occurring in the source code.
o In the source code, the attribute for a symbol is the information associated with that
symbol. The information contains the state, value, type and scope about the symbol.
o The insert () function takes the symbol and its value in the form of argument.
For example:
int x;
lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:
lookup (symbol)
This format is varies according to the programming language
Activation Record
o Control stack is a run time stack which is used to keep track of the live procedure
activations i.e. it is used to find out the procedures whose execution have not been
completed.
o When it is called (activation begins) then the procedure name will push on to the stack
and when it returns (activation ends) then it will popped.
o An activation record is pushed into the stack when a procedure is called and it is popped
when the control returns to the caller function.
Access Link: It is used to refer to non-local data held in other activation records.
Saved Machine Status: It holds the information about status of machine before the procedure is
called.
Local Data: It holds the data that is local to the execution of the procedure.
Following are commonly used data structure for implementing symbol table :-
1. List –
In this method, an array is used to store names and associated information.
A pointer “available” is maintained at end of all stored records and new names are
added in the order as they arrive
To search for a name we start from beginning of list till available pointer and if not
found we get an error “use of undeclared name”
While inserting a new name we must ensure that it is not already present otherwise
error occurs i.e. “Multiple defined name”
Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
Advantage is that it takes minimum amount of space.
2. Linked List –
This implementation is using linked list. A link field is added to each record.
Searching of names is done in order pointed by link of link field.
A pointer “First” is maintained to point to first record of symbol table.
Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
3. Hash Table –
In hashing scheme two tables are maintained – a hash table and symbol table and is
the most commonly used method to implement symbol tables..
A hash table is an array with index range: 0 to tablesize – 1.These entries are pointer
pointing to names of symbol table.
To search for a name we use hash function that will result in any integer between 0 to
tablesize – 1.
Insertion and lookup can be made very fast – O(1).
Advantage is quick search is possible and disadvantage is that hashing is complicated
to implement.
4. Binary Search Tree –
Another approach to implement symbol table is to use binary search tree i.e. we add
two link fields i.e. left and right child.
All names are created as child of root node that always follow the property of binary
search tree.
Insertion and lookup are O(log2 n) on average.
In the source program, every name possesses a region of validity, called the scope of that name.
For example:
int x;
void f(int m) {
float x, y;
{
int i, j;
int u, v;
}
}
int g (int n)
{
bool t;
}
Fig: Symbol table organization that complies with static scope information rules
CHAPTER 9:
Static allocation
Stack allocation
Call
Return
Halt
Action, a placeholder for other statements
Basic block contains a sequence of statement. The flow of control enters at the beginning
of the statement and leave at the end without any halt (except may be the last instruction
of the block).
The following sequence of three address statements forms a basic block:
t1:= x * x
t2:= x * y
t3:= 2 * t2
t4:= t1 + t3
t5:= y * y
t6:= t4 + t5
Output: it contains a list of basic blocks with each three address statement in exactly one block
Method: First identify the leader in the code. The rules for finding leaders are as follows:
For each leader, its basic block consists of the leader and all statement up to. It doesn't
include the next leader or end of the program.
Consider the following source code for dot product of two vectors a and b of length 10:
B1
(1) prod := 0
(2) i := 1
B2
1. (3) t1 := 4* i
2. (4) t2 := a[t1]
3. (5) t3 := 4* i
4. (6) t4 := b[t3]
5. (7) t5 := t2*t4
6. (8) t6 := prod+t5
7. (9) prod := t6
8. (10) t7 := i+1
9. (11) i := t7
10. (12) if i<=10 goto (3)
Optimization process can be applied on a basic block. While optimization, we don't need
to change the set of expressions computed by the block.
There are two type of basic block optimization. These are as follows:
1. Structure-Preserving Transformations
2. Algebraic Transformations
Structure preserving transformations:
In the common sub-expression, you don't need to be computed it over and over again.
Instead of this you can compute it once and kept in store from where it's referenced when
encountered again.
a:=b+c
b:=a-d
c:=b+c
d:=a-d
In the above expression, the second and forth expression computed the same expression.
So the block can be transformed as follows:
a:=b+c
b:=a-d
c:=b+c
d:=b
2. Algebraic transformations:
In the algebraic transformation, we can change the set of expression into an algebraically
equivalent set. Thus the expression x:= x + 0 or x:= x *1 can be eliminated from a basic
block without changing the set of expression.
Constant folding is a class of related optimization. Here at compile time, we evaluate
constant expressions and replace the constant expression by their values. Thus the
expression 5*2.7 would be replaced by13.5.
Sometimes the unexpected common sub expression is generated by the relational
operators like <=, >=, <, >, +, = etc.
Sometimes associative expression is applied to expose common sub expression without
changing the basic block value. if the source code has the assignments
a:= b + c
e:= c +d +b
a:= b + c
t:= c +d
e:= t + b
CHAPTER 10
Errors :
In this phase of compilation, all possible errors made by the user are detected and reported
to the user in form of error messages. This process of locating errors and reporting it to user
is called Error Handling process.
Functions of Error handler
Detection
Reporting
Recovery
Classification of Errors
Lexical PhaseError
During the lexical analysis phase this type of error can be detected.
Lexical error is a sequence of characters that does not match the pattern of any token.
Lexical phase error is found during the execution of the program.
Lexical phase error can be:
o Spelling error.
o Exceeding length of identifier or numeric constants.
o Appearance of illegal characters.
o To remove the character that should be present.
o To replace a character with an incorrect character.
o Transposition of two characters.
Example:
Void main()
{
int x=10, y=20;
char * a;
a= &x;
x= 1xab;
}
In this code, 1xab is neither a number nor an identifier. So this code will show the lexical
error.
Error recovery:
Panic Mode Recovery
In this method, successive characters from the input are removed one at a time until a
designated set of synchronizing tokens is found. Synchronizing tokens are delimiters such
as; or }
Advantage is that it is easy to implement and guarantees not to go to infinite loop
Disadvantage is that a considerable amount of input is skipped without checking it for
additional error
SyntaxPhase Error
During the syntax analysis phase, this type of error appears. Syntax error is found during
the execution of the program.
Some syntax error can be:
o Error in structure
o Missing operators
o Unbalanced parenthesis
When an invalid calculation enters into a calculator then a syntax error can also occurs.
This can be caused by entering several decimal points in one number or by opening
brackets without closing them.
if (number=200)
count << "number is equal to 20";
else
count << "number is not equal to 200"
firstprog.cpp
In this code, if expression used the equal sign which is actually an assignment operator
Due to the assignment operator, number is set to 200 and the expression number=200 are
always true because the expression's value is actually 200. For this example the correct
16 if (number==200)
Compiler message:
int i;
void f (int m)
{
m=t;
}
int a = "hello"; // the types String and int are not compatible
String s = "...";
int a = 5 - s; // the - operator does not support arguments of type String
Error recovery
If error “Undeclared Identifier” is encountered then, to recover from this a symbol table
entry for corresponding identifier is made.
If data types of two operands are incompatible then, automatic type conversion is done by
the compiler.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.
UNIT V
Object programs
Problems in code generation
A machine model
A simple code generator
Register allocation and assignment
Code generation from DAG’s
Peephole optimization
CHAPTER 11:-
RINCIPAL SOURCES OF OPTIMIZATION
There are a number of ways in which a compiler can improve a program without
changing the function it computes.
Function preserving transformations examples:
Common sub expression elimination
Copy propagation,
Dead-code elimination
Constant folding
The other transformations come up primarily when global optimizations are performed.
Frequently, a program will include several calculations of the offset in an array. Some of
the duplicate calculations cannot be avoided by the programmer because they lie below
the level of detail accessible within the source language.
For example
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t4: = 4*i
t5: = n
t6: = b [t4] +t5
Copy Propagation:
For example:
x=Pi;
A=x*r*r;
Dead-Code Eliminations:
A variable is live at a point in a program if its value can be used subsequently; otherwise,
it is dead at that point. A related idea is dead or useless code, statements that compute
values that never get used. While the programmer is unlikely to introduce any dead code
intentionally, it may appear as the result of previous transformations.
Example:
i=0;
if(i=1)
{
a=b+5;
}
Here, ‘if’ statement is dead code because this condition will never get
satisfied.
Constant folding:
Deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding. One advantage of copy propagation is that
it often turns the copy statement into dead code.
For example,
o a=3.14157/2 can be replaced by
o a=1.570 there by eliminating a division operation.
Loop Optimizations:
In loops, especially in the inner loops, programs tend to spend the bulk of their time. The
running time of a program may be improved if the number of instructions in an inner loop
is decreased, even if we increase the amount of code outside that loop.
An important modification that decreases the amount of code in a loop is code motion.
This transformation takes an expression that yields the same result independent of the
number of times a loop is executed (a loop-invariant computation) and places the
expression before the loop. Note that the notion “before the loop” assumes the existence
of an entry for the loop. For example, evaluation of limit-2 is a loop-invariant
computation in the following while-statement:
o t= limit-2;
o while (i<=t) /* statement does not change limit or t */
Induction Variables :
Loops are usually processed inside out. For example consider the loop around B3. Note
that the values of j and t4 remain in lock-step; every time the value of j decreases by 1,
that of t4 decreases by 4 because 4*j is assigned to t4. Such identifiers are called
induction variables.
When there are two or more induction variables in a loop, it may be possible to get rid of
all but one, by the process of induction-variable elimination. For the inner loop around
B3 in Fig.5.3 we cannot get rid of either j or t4 completely; t4 is used in B3 and j in B4.
However, we can illustrate reduction in strength and illustrate a part of the process of
induction-variable elimination. Eventually j will be eliminated when the outer loop of
B2- B5 is considered.
Example:
As the relationship t4:=4*j surely holds after such an assignment to t4 in Fig. and t4 is
not changed elsewhere in the inner loop around B3, it follows that just after the statement
j:=j-1 the relationship t4:= 4*j-4 must hold. We may therefore replace the assignment
t4:= 4*j by t4:= t4-4. The only problem is that t4 does not have a value when we enter
block B3 for the first time. Since we must maintain the relationship t4=4*j on entry to the
block B3, we place an initializations of t4 at the end of the block where j itself is
initialized, shown by the dashed addition to block B1 in Fig.5.3.
Reduction In Strength:
1. Code motion
2. Induction-variable elimination
3. Strength reduction
1.Code Motion:
Code motion is used to decrease the amount of code in loop. This transformation takes a
statement or expression which can be moved outside the loop body without affecting the
semantics of the program.
For example
2.Induction-Variable Elimination
3.Reduction in Strength
Strength reduction is used to replace the expensive operation by the cheaper once on the
target machine.
Addition of a constant is cheaper than a multiplication. So we can replace multiplication
with an addition within the loop.
Multiplication is cheaper than exponentiation. So we can replace exponentiation with
multiplication within the loop.
Example:
1. while (i<10)
2. {
3. j= 3 * i+1;
4. a[j]=a[j]-2;
5. i=i+2;
6. }
7. After strength reduction the code will be:
8. s= 3*i+1;
9. while (i<10)
10. {
11. j=s;
12. a[j]= a[j]-2;
13. i=i+2;
14. s=s+6;
15. }
A DAG for basic block is a directed acyclic graph with the following labels on nodes:
1. The leaves of graph are labeled by unique identifier and that identifier can be variable
names or constants.
2. Interior nodes of the graph islabeled by an operator symbol.
3. Nodes are also given a sequence of identifiers for labels to store the computed value.
Method:
Step 1:
Step 2:
For case(i), create node(OP) whose right child is node(z) and left child is node(y).
For case(ii), check whether there is node(OP) with one child node(y).
For case(iii), node n will be node(y).
Output:
For node(x) delete x from the list of identifiers. Append x to attached identifiers list for
the node n found in step 2. Finally set node(x) to n.
Example:
1. S1:= 4 * i
2. S2:= a[S1]
3. S3:= 4 * i
4. S4:= b[S3]
5. S5:= s2 * S4
6. S6:= prod + S5
7. Prod:= s6
8. S7:= i+1
9. i := S7
10. if i<= 20 goto (1)
To efficiently optimize the code compiler collects all the information about the program
and distribute this information to each block of the flow graph. This process is known as
data-flow graph analysis.
Certain optimization can only be achieved by examining the entire program. It can't be
achieve by examining just a portion of the program.
For this kind of optimization user defined chaining is one particular problem.
Here using the value of the variable, we try to find out that which definition of a variable
is applicable in a statement.
Based on the local information a compiler can perform some optimizations. For example,
consider the following code:
x = a + b;
x=6*3
o In this code, the first assignment of x is useless. The value computer for x is never
used in the program.
o At compile time the expression 6*3 will be computed, simplifying the second
assignment statement to x = 18;
o Some optimization needs more global information. For example, consider the
following code:
a = 1;
b = 2;
c = 3;
if (....) x = a + 5;
else x = b + 4;
c = x + 1;
In this code, at line 3 the initial assignment is useless and x +1 expression can be simplified as 7.
But it is less obvious that how a compiler can discover these facts by looking only at one
or two consecutive statements. A more global analysis is required so that the compiler
knows the following things at each point in the program:
Data flow analysis is used to discover this kind of property. The data flow analysis can be
performed on the program's control flow graph (CFG).
The control flow graph of a program is used to determine those parts of a program to
which a particular value assigned to a variable might propagate.
CHAPTER 12
Object programs
Let assume that, you have a c program, then you give the C program to compiler and
compiler will produce the output in assembly code.Now, that assembly language code will
give to the assembler and assembler is going to produce you some code. That is known
as Object Code.
But, when you compile a program, then you are not going to use both compiler and
assembler.You just take the program and give it to the compiler and compiler will give you
the directly executable code. The compiler is actually combined inside the assembler along
with loader and linker.So all the module kept together in the compiler software itself. So
when you calling gcc, you are actually not just calling the compiler, you are calling the
compiler, then assembler, then linker and loader.
Once you call the compiler, then your object code is going to present in Hard-disk. This
object code contains various part –
Problems In Code Generation
The input to the code generation consists of the intermediate representation of the
source program produced by front end , together with information in the symbol
table to determine run-time addresses of the data objects denoted by the names in
the intermediate representation.
Intermediate representation can be :
e. • Prior to code generation, the front end must be scanned, parsed and
translated into intermediate representation along with necessary type
checking. Therefore, input to code generation is assumed to be error-free.
2. Target program:
The output of the code generator is the target program. The output may be : a.
Absolute machine language
3. Memory management:
Names in the source program are mapped to addresses of data objects in run-time
memory by the front end and code generator.
It makes use of symbol table, that is, a name in a three-address statement refers to a
symbol-table entry for the name.
Labels in three-address statements have to be converted to addresses of instructions.
For example,
if i < j, a backward jump instruction with target address equal to location of code for
quadruple i is generated.
if i > j, the jump is forward. We must store on a list for quadruple i the location of the
first machine instruction generated for quadruple j. When i is processed, the machine
locations for all instructions that forward jumps to i are filled.
4. Instruction selection:
a:=b+c
d:=a+e (a)
MOV b,R0
ADD c,R0
MOV R0,a (b)
MOV a,R0
ADD e,R0
MOV R0,d
5. Register allocation
o Instructions involving register operands are shorter and faster than those involving
operands in memory. The use of registers is subdivided into two subproblems :
o Register allocation - the set of variables that will reside in registers at a point in
the program is selected.
o Register assignment - the specific register that a value picked•
o Certain machine requires even-odd register pairs for some operands and results.
For example , consider the division instruction of the form :D x, y
The order in which the computations are performed can affect the efficiency of the
target code.
Some computation orders require fewer registers to hold intermediate results than
others.
Machine model
op source, destination
Where, op is used as an op-code and source and destination are used as a data
field.
Example:
3. Literal Mode:
MOV #1, R0
cost = 1+1+1 = 3 (one word for constant 1 and one for instruction)
Example:
Consider the three address statement x:= y + z. It can have the following sequence of
codes:
MOV x, R0
ADD y, R0
Register and Address Descriptors:
o A register descriptor contains the track of what is currently in each register. The register
A code-generation algorithm:
The algorithm takes a sequence of three-address statements as input. For each three
address statement of the form a:= b op c perform the various actions. These are as
follows:
1. Invoke a function getreg to find out the location L where the result of computation b op c
should be stored.
2. Consult the address description for y to determine y'. If the value of y currently in
memory and register both then prefer the register y' . If the value of y is not already in L
then generate the instruction MOV y' , L to place a copy of y in L.
3. Generate the instruction OP z' , L where z' is used to show the current location of z. if z
is in both then prefer a register to a memory location. Update the address descriptor of x
to indicate that x is in location L. If x is in L then update its descriptor and remove x from
all other descriptor.
4. If the current value of y or z have no next uses or not live on exit from the block or in
register then alter the register descriptor to indicate that after execution of x : = y op z
those register will no longer contain y or z.
The assignment statement d:= (a-b) + (a-c) + (a-c) can be translated into the following
sequence of three address code:
t:= a-b
u:= a-c
v:= t +u
d:= v+u
Count uses
Advantage
Heavily used values reside in registers
Disadvantage
Does not consider non-uniform distribution of uses
Need of global register allocation
Local allocation does not take into account that some instructions (e.g. those in loops)
execute more frequently. It forces us to store/load at basic block endpoints since each
block has no knowledge of the context of others.
To find out the live range(s) of each variable and the area(s) where the variable is
used/defined global allocation is needed. Cost of spilling will depend on frequencies and
locations of uses.
Basic idea:
Register can be accessed faster than memory. The instructions involving operands in
register are shorter and faster than those involving in memory operand.
Register allocation: In register allocation, we select the set of variables that will reside
in register.
Register assignment: In Register assignment, we pick the register that contains variable.
Directed Acyclic Graph (DAG) is a tool that depicts the structure of basic blocks, helps
to see the flow of values flowing among the basic blocks, and offers optimization too.
DAG provides easy transformation on basic blocks. DAG can be understood here:
Interior nodes also represent the results of expressions or the identifiers/name where the
values are to be stored or assigned.
Example:
t0= a + b
t1= t0+ c
d = t0+ t1
[t0 = a + b]
[t1 = t0 + c]
[d = t0 + t1]
Peephole Optimization
This optimization technique works locally on the source code to transform it into an
optimized code. By locally, we mean a small portion of the code block at hand. These
methods can be applied on intermediate codes as well as on target codes. A bunch of
statements is analyzed and are checked for the following possible optimization:
MOV x, R0
MOV R0, R1
We can delete the first instruction and re-write the sentence as:
MOV x, R1
Unreachable code
Unreachable code is a part of the program code that is never accessed because of
programming constructs. Programmers may have accidently written a piece of code that
can never be reached.
Example:
voidadd_ten(int x)
{
return x +10;
printf(“valueof x is%d”, x);
}
In this code segment, the printf statement will never be executed as the program control
returns back before it can execute, hence printf can be removed.
There are instances in a code where the program control jumps back and forth without
performing any significant task. These jumps can be removed. Consider the following
chunk of code:
...
MOV R1, R2
GOTO L1
...
L1 : GOTO L2
L2 : INC R1
...
MOV R1, R2
GOTO L2
...
L2 : INC R1
There are occasions where algebraic expressions can be made simple. For example, the
expression a = a + 0 can be replaced by a itself and the expression a = a + 1 can simply
be replaced by INC a.
Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced
by replacing them with other operations that consume less time and space, but produce
the same result.
For example, x * 2 can be replaced by x << 1, which involves only one left shift.
Though the output of a * a and a2 is same, a2 is much more efficient to implement.