CD Digital Notes Cse-Aiml
CD Digital Notes Cse-Aiml
DIGITAL NOTES
UNIT I
Finite Automata:
A Finite Automata is the mathematical model of a digital computer. Finite Automata areused as string
or language acceptors. They are mainly used in pattern matching tools likeLEX and Text editors.
The Finite State System represents a mathematical model of a system with certain input.
The model finally gives a certain output. The output given to the machine is processed by various
states. These states are called intermediate states.
A good example of finite state systems is the control mechanism of an elevator. Thismechanism only
remembers the current floor number pressed, it does not remember all the previously pressed numbers.
The finite state systems are useful in design of text editors, lexical analyzers and naturallanguage
processing. The word “automaton” is singular and “automata” is plural.
An automaton in which the output depends only on the input is called an automaton withoutmemory.
An automaton in which the output depends on the input and state is called as automationwith memory.
Finite Automation Model:
Informally, a FA – Finite Automata is a simple machine that reads an input string – one symbol at a
time -- and then, after the input has been completely read, decides whether to accept or reject the input.
As the symbols are read from the tape, the automaton can changeits state, to reflect how it reacts to
what it has seen so far.
The Finite Automata can be represented as,
i) Input Tape: Input tape is a linear tape having some cells which can hold an input symbol from ∑.
ii) Finite Control: It indicates the current state and decides the next state on receiving a particular input from
the input tape. The tape reader reads the cells one by one from left to right and at any instance only one input
symbol is read. The reading head examines read symbol and the head moves to the right side with or without
changing the state. When the entire string is read and if finite control is in final state then the string is accepted
otherwise rejected. The finite automaton can be represented by a transition diagram in which the vertices
represent the states and the edges represent transitions.
A Finite Automaton (FA) consists of a finite set of states and set of transitions among statesin response to
inputs.
• Always associated with a FA is a transition diagram, which is nothing but a‘directed graph’.
• The vertices of the graph correspond to the states of the FA.
• The FA accepts a string of symbols from∑, x if the sequence of transitions corresponding to
symbols in x leads from the state to an accepting state.
Finite Automata can be classified into two type:
1. FA without output or Language Recognizers ( e.g. DFA and NFA)
2. FA with output or Transducers ( e.g. Moore and Mealy machines)
Finite Automaton (FA), a collection of states in which we make transitions based upon input symbols.
For any element q of Q and any symbol σ∈Σ, we interpret δ(q,σ) as the state to which the FA moves, if it is in
state q and receives the input σ.
The input is itself a pair because δ was defined as a function of the form Q×Σ→Qso the input
has the form Q×Σ the set of all pairs in which the first element is taken from set Q and the
second element from set Σ.
o Of course, there may be easier ways to visualize δ. In particular, we could do it via a table with
the input state on one axis and the input character on another:
Starting State
Input q0 q1 q2 0 1 2
0 q1 q2 q2 0 2 1
1 1 q2 q2 1 0 2
o The table representation is particularly useful because it suggests an efficient implementation. If
we numbered our states instead of using arbitrary labels:
Starting State
Input 0 1 2 4 5 6
0 1 2 2 3 5 4
1 4 2 2 4 3 5
The above diagram accepts all strings starting with a and ending with b.Here,
The above diagram accepts all strings starting with b and ending with a.Here,
L = L1 and L2 = L1 ∩ L2
It accepts all the string that accept with even number of 1’s.
State Transition Diagram ForL1 ∩ L2
State Intersection of L1 and L2 can be explained by language that a string over {0, 1} accept such that it
ends with 01 and has even number of 1’s.
L = L1 ∩ L2
= {1001, 0101, 01001, 10001, ....}
Thus as we see that L1 and L2 have been combined through intersection process and this final FA accept all
the language that has even number of 1’s and is ending with 01.
Regular Language:
The set of regular languages over an alphabet is defined recursively as below. Any language belonging to
this set is a regular language over .
Definition of set of Regular Languages:
Basis Clause: ,Ǿ{ } and {a} for any symbol a€ are regular languages.
Inductive Clause: If Lr and Ls are regular languages, then Lr Ls , LrLs and Lr* are regular languages.
Nothing is a regular language unless it is obtained from the above two clauses.
For example, let = {a, b}. Then since {a} and {b} are regular languages, {a, b} ( = {a} {b} ) and {ab} (
= {a}{b} ) are regular languages. Also since {a} is regular, {a} * is a regular language which is the set of
*
strings consisting of a's such as , a, aa, aaa, aaaa etc. Note also that , which is the set of strings
consisting of a's and b's, is a regular language because {a, b} is regular.
Regular Expression:
Regular expressions are used to denote regular languages. They can represent regular languages and
operations on them succinctly.
The set of regular expressions over an alphabet is defined recursively as below. Any element of that set is
a regular expression.
Basis Clause: Ǿ and a are regular expressions corresponding to languages ,Ǿ{ } and {a}, respectively,
where a is an element of .
Inductive Clause: If r and s are regular expressions corresponding to languages Lr and Ls ,then ( r + s
),(rs) and ( r*) are regular expressions corresponding to languages Lr Ls , LrLs and Lr* , respectively.
Nothing is a regular expression unless it is obtained from the above two clauses.
In a DFA, for a particular input character, the machine goes to one state only. A transition function is
defined on every state for every input symbol. Also in DFA null (or ε) move is not allowed, i.e., DFA
cannot change state without any input character.
For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.
Due to the above additional features, NFA has a different transition function, the rest is the same as DFA.
δ: Transition Function
δ: Q X (Σ U ε ) --> 2 ^ Q.
As you can see in the transition function is for any input including null (or ε), NFA can go to any state
number of states. For example, below is an NFA for the above problem.
NFA
One important thing to note is, in NFA, if any path for an input string leads to a final state, then the input
string is accepted. For example, in the above NFA, there are multiple paths for the input string “00”. Since
one of the paths leads to a final state, “00” is accepted by the above NFA.
A Regular Expression is a representation of Tokens. But, to recognize a token, it can need a token Recognizer,
which is nothing but a Finite Automata (NFA). So, it can convert Regular Expression into NFA.
For a NFA is
For a + b, or a | b NFA is
a
ε 2 3 ε
star ε ε a b b
t 0 1 6 7 8 9 10 10
ε 4 5
b
ε
ε
Start the Conversion
1. Begin with the start state 0 and calculate ε-closure(0). a. the set of states reachable by ε-transitions
which includes 0 itself is { 0,1,2,4,7}. This defines a new state A in the DFA A = {0,1,2,4,7}
2. We must now find the states that A connects to. There are two symbols in the language (a, b) so in the
DFA we expect only two edges: from A on a and from A on b. Call these states B and C:
We find B and C in the following way:
Find the state B that has an edge on a from A
a. start with A{0,1,2,4,7}. Find which states in A have states reachable by a transitions. This set is called
move(A,a) The set is {3,8}: move(A,a) = {3,8}
b. now do an ε-closure on move(A,a). Find all the states in move(A,a) which are reachable with ε-transitions. We
have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1 and 7, and from 1 to 2 and 4.
Starting with 8 we can get to 8 only. So the complete set is {1,2,3,4,6,7,8}. So ε-closure(move(A,a)) = B =
{1,2,3,4,6,7,8}
This defines the new state B that has an edge on a from A
Find the state C that has an edge on b from A
c. start with A{0,1,2,4,7}. Find which states in A have states reachable by b transitions. This set is called
move(A,b) The set is {5}: move(A,b) = {5}
d. now do an ε-closure on move(A,b). Find all the states in move(A,b) which are reachable with ε-transitions. We
have only state 5 to consider. From 5 we can get to 5, 6, 7, 1, 2, 4. So the complete set is {1,2,4,5,6,7}. So
a. ε-closure(move(A,a)) = C = {1,2,4,5,6,7}
This defines the new state C that has an edge on b from A
A={0,1,2,4,7} B={1,2,3,4,6,7,8} C={1,2,4,5,6,7}
Now that we have B and C we can move on to find the states that have a and b transitions from B and C.
Find the state that has an edge on a from B
e. start with B{1,2,3,4,6,7,8}. Find which states in B have states reachable by a transitions. This set is called
move(B,a) The set is {3,8}: move(B,a) = {3,8}
f. now do an ε-closure on move(B,a). Find all the states in move(B,a) which are reachable with ε-transitions. We
have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1 and 7, and from 1 to 2 and 4.
Starting with 8 we can get to 8 only. So the complete set is {1,2,3,4,6,7,8}. So ε-closure(move(A,a)) =
{1,2,3,4,6,7,8}
which is the same as the state B itself. In other words, we have a repeating edge to B:
A={0,1,2,4,7} B={1,2,3,4,6,7,8} C={1,2,4,5,6,7}
Find the state D that has an edge on b from B
g. start with B{1,2,3,4,6,7,8}. Find which states in B have states reachable by b transitions. This set is called
move(B,b) The set is {5,9}: move(B,b) = {5,9}
h. now do an ε-closure on move(B,b). Find all the states in move(B,b) which are reachable with ε-transitions.
From 5 we can get to 5, 6, 7, 1, 2, 4. From 9 we get to 9 itself. So the complete set is {1,2,4,5,6,7,9}}. So
ε-closure(move(B,a)) = D = {1,2,4,5,6,7,9} This defines the new state D that has an edge on b from B
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D{1,2,4,5,6,7,9}
Find the state that has an edge on a from D
i. start with D{1,2,4,5,6,7,9}. Find which states in D have states reachable by a transitions. This set is called
move(D,a) The set is {3,8}: move(D,a) = {3,8}
j. now do an ε-closure on move(D,a). Find all the states in move(B,a) which are reachable with ε-transitions. We
have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1 and 7, and from 1 to 2 and 4.
Starting with 8 we can get to 8 only. So the complete set is {1,2,3,4,6,7,8}. So ε-closure(move(D,a)) =
{1,2,3,4,6,7,8} =B
This is a return edge to B:
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D{1,2,4,5,6,7,9}
Find the state E that has an edge on b from D
k. start with D{1,2,4,5,6,7,9}. Find which states in D have states reachable by b transitions. This set is called
move(B,b) The set is {5,10}: move(D,b) = {5,10}
l. now do an ε-closure on move(D,b). Find all the states in move(D,b) which are reachable with ε-transitions. From
5 we can get to 5, 6, 7, 1, 2, 4. From 10 we get to 10 itself. So the complete set is {1,2,4,5,6,7,10}}. So
ε-closure(move(D,b) = E = {1,2,4,5,6,7,10}
This defines the new state E that has an edge on b from D. Since it contains an accepting state, it is also an accepting
state.
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D={1,2,4,5,6,7,9}, E={1,2,4,5,6,7,10}
We should now examine state C
Find the state that has an edge on a from C
m. start with C{1,2,4,5,6,7}. Find which states in C have states reachable by a transitions. This set is called move(C,a)
The set is {3,8}:
move(C,a) = {3,8}
we have seen this before. It’s the state B
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D={1,2,4,5,6,7,9}, E={1,2,4,5,6,7,10}
Find the state that has an edge on b from C
n. start with C{1,2,4,5,6,7}. Find which states in C have states reachable by b transitions. This set is called move(C,b)
The set is {5}:
o. move(C,b) = {5}
p. now do an ε-closure on move(C,b). Find all the states in move(C,b) which are reachable with ε-transitions. From 5
we can get to 5,6,7,1,2,4. which is C itself So
ε-closure(move(C,b)) = C
This defines a loop on C
Finally we need to look at E. Although this is an accepting state, the regular expression allows us to repeat adding in
more a’s and b’s as long as we return to the accepting E state finally. So
Find the state that has an edge on a from E
q. start with E{1,2,4,5,6,7,10}. Find which states in E have states reachable by a transitions. This set is called
move(E,a) The set is {3,8}:
move(E,a) = {3,8} We saw this before, it’s B
a
a
b b
a B D E
start a
A b a b
C
That’s it ! There is only one edge from each state for a given input character. It’s a DFA. Disregard the fact that
b
each of these states is actually a group of NFA states. We can regard them as single states in the DFA. In fact it also
requires other as an edge beyond E leading to the ultimate accepting state. Also the DFA is not yet optimized (there can
be less states).
Properties
Example
Problem
Solution
Let’s find out the derivation tree for the string "a+a*a". It has two leftmost derivations.
Derivation 1 − X → X+X → a +X → a+ X*X → a+a*X → a+a*a
Parse tree 1 −
Since there are two parse trees for a single string "a+a*a", the grammar G is ambiguous.
Simplified Forms:
As we have seen, various languages can efficiently be represented by a context-free grammar. All the
grammar are not always optimized that means the grammar may consist of some extra symbols(non-terminal).
Having extra symbols, unnecessary increase the length of grammar. Simplification of grammar means reduction
of grammar by removing useless symbols.
The properties of reduced grammar are given below:
1. Each variable (i.e. non-terminal) and each terminal of G appears in the derivation of some word in L.
2. There should not be any production as X → Y where X and Y are non-terminal.
3. If ε is not in the language L then there need not to be the production X → ε.
A symbol can be useless if it does not appear on the right-hand side of the production rule and does not take part
in the derivation of any string. That symbol is known as a useless symbol. Similarly, a variable can be useless if
it does not take part in the derivation of any string. That variable is known as a useless variable.
For Example:
1. T → aaB | abA | aaT
2. A → aA
3. B → ab | b
4. C → ad
In the above example, the variable 'C' will never occur in the derivation of any string, so the production C → ad
is useless. So we will eliminate it, and the other productions are written in such a way that variable C can never
reach from the starting variable 'T'.
Production A → aA is also useless because there is no way to terminate it. If it never terminates, then it can
never produce a string. Hence this production can never take part in any derivation.
To remove this useless production A → aA, we will first find all the variables which will never lead to a
terminal string such as variable 'A'. Then we will remove all the productions in which the variable 'B' occurs.
Elimination of ε Production
The productions of type S → ε are called ε productions. These type of productions can only be removed from
those grammars that do not generate ε.
Step 1: First find out all nullable non-terminal variable which derives ε.
Step 2: For each production A → a, construct all production A → x, where x is obtained from a by removing
one or more non-terminal from step 1.
Step 3: Now combine the result of step 2 with the original production and remove ε productions.
Example:
Remove the production from the following CFG by preserving the meaning of it.
1. S → XYX
2. X → 0X | ε
3. Y → 1Y | ε
Solution:
Now, while removing ε production, we are deleting the rule X → ε and Y → ε. To preserve the meaning of CFG
we are actually placing ε at the right-hand side whenever X and Y have appeared.
Let us take
S → XYX
S → YX
S → XY
If Y = ε then
S → XX
If Y and X are ε then,
S→X
S → Y Now,
S → XY | YX | XX | X | Y
X → 0X
X→0
X → 0X | 0
Similarly Y → 1Y | 1
S → XY | YX | XX | X | Y
X → 0X | 0
Y → 1Y | 1
The unit productions are the productions in which one non-terminal gives another non-terminal. Use the
following steps to remove unit production:
Step 1: To remove X → Y, add production X → a to the grammar rule whenever Y → a occurs in the grammar.
Step 3: Repeat step 1 and step 2 until all unit productions are removed.
For example:
S → 0A | 1B | C
A → 0S | 00
B→1|A
C → 01
Solution:
S → C is a unit production. But while removing S → C we have to consider what C gives. So, we can add a rule
to S.
S → 0A | 1B | 01
B → 1 | 0S | 00
S → 0A | 1B | 01
A → 0S | 00
B → 1 | 0S | 00
C → 01
Normal Forms
Chomsky's Normal Form (CNF):
CNF stands for Chomsky normal form. A CFG(context free grammar) is in CNF(Chomsky normal form) if all
production rules satisfy one of the following conditions:
o Start symbol generating ε. For example, A → ε.
o A non-terminal generating two non-terminals. For example, S → AB.
o A non-terminal generating a terminal. For example, S → a.
For example:
G1 = {S → AB, S → c, A → a, B → b}
G2 = {S → aA, A → a, B → c}
The production rules of Grammar G1 satisfy the rules specified for CNF, so the grammar G1 is in CNF.
However, the production rule of Grammar G2 does not satisfy the rules specified for CNF as S → aZ contains
terminal followed by non-terminal. So the grammar G2 is not in CNF.
Steps for converting CFG into CNF
Step 1: Eliminate start symbol from the RHS. If the start symbol T is at the right-hand side of any production,
create a new production as:
S1 → S
Step 2: In the grammar, remove the null, unit and useless productions. You can refer to the Simplification of
CFG.
Step 3: Eliminate terminals from the RHS of the production if they exist with other non-terminals or terminals.
For example, production S → aA can be decomposed as:
S → RA
R→a
Step 4: Eliminate RHS with more than two non-terminals. For example, S → ASB can be decomposed as:
S → RS
R → AS
Example:
Convert the given CFG to CNF. Consider the given grammar G1:
S → a | aA | B
A → aBB | ε
B → Aa | b
Solution:
Step 1: We will create a new production S1 → S, as the start symbol S appears on the RHS. The grammar will
be:
S1 → S
S → a | aA | B
A → aBB | ε
B → Aa | b
Step 2: As grammar G1 contains A → ε null production, its removal from the grammar yields:
S1 → S
S → a | aA | B
A → aBB
B → Aa | b | a
Now, as grammar G1 contains Unit production S → B, its removal yield:
S1 → S
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Also remove the unit production S1 → S, its removal from the grammar yields:
S0 → a | aA | Aa | b
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Step 3: In the production rule S0 → aA | Aa, S → aA | Aa, A → aBB and B → Aa, terminal a exists on RHS
with non-terminals. So we will replace terminal a with X:
S0 → a | XA | AX | b
S → a | XA | AX | b
A → XBB
B → AX | b | a
X→a
Step 4: In the production rule A → XBB, RHS has more than two symbols, removing it from grammar yield:
S0 → a | XA | AX | b
S → a | XA | AX | b
A → RB
B → AX | b | a
X→a
R → XB Hence, for the given grammar, this is the required CNF.
Greibach Normal Form (GNF):
GNF stands for Greibach normal form. A CFG(context free grammar) is in GNF(Greibach normal form) if all
the production rules satisfy one of the following conditions:
For example:
The production rules of Grammar G1 satisfy the rules specified for GNF, so the grammar G1 is in GNF.
However, the production rule of Grammar G2 does not satisfy the rules specified for GNF as A → ε and B → ε
contains ε(only start symbol can generate ε). So the grammar G2 is not in GNF.
Steps for converting CFG into GNF
If the given grammar is not in CNF, convert it into CNF. You can refer the following topic to convert the CFG
into CNF: Chomsky normal form
If the context free grammar contains left recursion, eliminate it. You can refer the following topic to eliminate
left recursion: Left Recursion
Step 3: In the grammar, convert the given production rule into GNF form.
If any production rule in the grammar is not in GNF form, convert it.
Example:
S → XB | AA
A → a | SA
B→b
X→a
Solution:
As the given grammar G is already in CNF and there is no left recursion, so we can skip step 1 and step 2 and
directly go to step 3.
The production rule A → SA is not in GNF, so we substitute S → XB | AA in the production rule A → SA as:
S → XB | AA
A → a | XBA | AAA
B→b
X→a
The production rule S → XB and B → XBA is not in GNF, so we substitute X → a in the production rule S →
XB and B → XBA as:
S → aB | AA
A → a | aBA | AAA
B→b
X→a
S → aB | AA
A → aC | aBAC
C → AAC | ε
B→b
X→a
Now we will remove null production C → ε, we get:
S → aB | AA
A → aC | aBAC | a | aBA
C → AAC | AA
B→b
X→a
The production rule S → AA is not in GNF, so we substitute A → aC | aBAC | a | aBA in production rule S →
AA as:
The production rule C → AAC is not in GNF, so we substitute A → aC | aBAC | a | aBA in production rule C →
AAC as:
Introduction to Compiler
As Computers became inevitable and indigenous part of human life, and several languages
with different and more advanced features are evolved into this stream to satisfy or comfort the
user in communicating with the machine , the development of the translators or mediator
Software‘s have become essential to fill the huge gap between the human and machine
understanding.
This process is called Language Processing to reflect the goal and intent of the process. On the
way to this process to understand it in a better way, we have to be familiar with some key
terms and concepts explained in following lines.
LANGUAGE TRANSLATORS
Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target]language. The Source program is a high level language
where as the Target language can be any thing from the machine language of a target machine
(between Microprocessor to Supercomputer) to another high level language program.
Based on the input the translator takes and the output it produces, a language translator
can be called as any one of the following.
Preprocessor: A preprocessor takes the skeletal source program as input and produces an
extended version of it, which is the resultant of expanding the Macros, manifest constants if any,
and including header files etc in the source file. For example, the C preprocessor is a macro
processor that is used automatically by the C compiler to transform our source before actual
compilation. Over and above a preprocessor performs the following activities:
Collects all the modules, files in case if the source program is divided into different
modules stored at different files.
Compiler: Is a translator that takes as input a source program written in high level
language and converts it into its equivalent target program in machine language. In
addition to above the compiler also
Facilitates the user in rectifying the errors, and execute the code.
Loader / Linker: This is a program that takes as input a relocatable code and collects
the library functions, relocatable object files, and produces its equivalent absolute machine
code.
Specifically,
Loading consists of taking the relocatable machine code, altering the relocatable
addresses, and placing the altered instructions and data in memory at the proper
locations.
Linking allows us to make a single program from several files of relocatable machine
code. These files may have been result of several different compilations, one or
more may be library routines provided by the system available to any program
that needs them.
In addition to these translators, programs like interpreters, text formatters etc., may be used
in language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.
Normally the steps in a language processing system includes Preprocessing the skeletal
Source program which produces an extended or expanded source program or a ready to compile
unit of the source program, followed by compiling the resultant, then linking / loading , and
finally its equivalent executable code is produced. As I said earlier not all these steps are
mandatory. In some cases, the Compiler only performs this linking and loading functions
implicitly.
TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be
classifiedinto the following types;
Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLLinto
its equivalent in native machine code or object code.
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.
Cross-Compilers: These are the compilers that run on one machine and produce
code for another machine.
Incremental Compilers: These compilers separate the source into user defined–
steps; Compiling/recompiling step- by- step; interpreting steps in a given order
Converters (e.g. COBOL to C++): These Programs will be compiling from one high level language
to another.
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers
from intermediate language (byte code, MSIL) to executable code or native machine
code. These perform type –based verification which makes the executable code more
trustworthy
Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native
code for Java and .NET
Binary Compilation: These compilers will be compiling object code of one platform into
object code of another platform.
BOOTSTRAPING
1. Source Language
2. Target Language
3. Implementation Language
1. Create a compiler SC AA for subset, S of the desired language, L using language "A" and that compiler runs
on machine A.
PHASES OF A COMPILER:
Due to the complexity of compilation task, a Compiler typically proceeds in a Sequence of
compilation phases. The phases communicate with each other via clearly defined
interfaces. Generally an interface contains a Data structure (e.g., tree), Set of exported
functions. Each phase works on an abstract intermediate representation of the source
program, not the source program text itself (except the first phase)
Compiler Phases are the individual modules which are chronologically executed to perform
their respective Sub-activities, and finally integrate the solutions to give target code.
It is desirable to have relatively few phases, since it takes time to read and write immediate
files. Following diagram (Figure1.4) depicts the phases of a compiler through which it goes
during the compilation. There fore a typical Compiler is having the following Phases:
The Phases of compiler divided in to two parts, first three phases we are called
as Analysis part remaining three called as Synthesis part.
The Back-end of the compiler consists of phases that depend on the target machine, and
those portions don‘t dependent on the Source language, just the Intermediate language.
In this we have different aspects of Code Optimization phase, code generation along with
the necessary Error handling, and Symbol table operations.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as
interface between the compiler and the Source language program and performs the
following functions:
Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an
identifier , a Keyword , a punctuation mark, a multi character operator like := .
The character sequence forming a token is called a lexeme of the token.
The Scanner generates a token-id, and also enters that identifiers name in the
Symboltable if it doesn‘t exist.
Also removes the Comments, and unnecessary spaces.
SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its
subsequent phase Semantic Analyzer and performs the following functions:
Groups the above received, and recorded token stream into syntactic structures,
usuallyinto a structure called Parse Tree whose leaves are tokens.
The interior node of this tree represents the stream of tokens that logically
belongstogether.
It means it checks the syntax of program elements.
SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically
correct, it may happen that they are not correct semantically. Therefore the semantic
analyzer checks the semantics (meaning) of the statements formed.
o The Syntactically and Semantically correct structures are produced here in the form of
aSyntax tree or DAG or some other sequential representation like matrix.
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and
beneficial in terms of saving development time, effort, and cost. This phase performs the
following specific functions:
Attempts to improve the IC so as to have a faster machine code. Typical functions
include –Loop Optimization, Removal of redundant computations, Strength reduction,
Frequency reductions etc.
Sometimes the data structures used in representing the intermediate forms may also
bechanged.
CODE GENERATOR: This is the final phase of the compiler and generates the target
code, normally consisting of the relocatable machine code or Assembly code or
absolute machine code.
Memory locations are selected for each variable used, and assignment of variables
toregisters is done.
Intermediate instructions are translated into a sequence of machine instructions.
The Compiler also performs the Symbol table management and Error handling throughout
the compilation process. Symbol table is nothing but a data structure that stores different
source
language constructs, and tokens generated during the compilation. These two
interact with all phases of the Compiler.
For example the source program is an assignment statement; the following figure shows
howthephases of compiler will process the program.
As the first phase of a compiler, the main task of the lexical analyzer is to read
the input characters of the source program, group them into lexemes, and produce as
output tokens for each lexeme in the source program. This stream of tokens is sent to
the parser for syntax analysis. It is common for the lexical analyzer to interact with the
symbol table as well.
When lexical analyzer identifies the first token it will send it to the parser, the
parser receives the token and calls the lexical analyzer to send next token by issuing
the getNextToken() command. This Process continues until the lexical analyzer
identifies all the tokens. During this process the lexical analyzer will neglect or discard
the white spaces and comment lines.
A pattern is a description of the form that the lexemes of a token may take [ or match].
In the case of a keyword as a token, the pattern is just the sequence of
characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure that
is matched by many strings.
A lexeme is a sequence of characters in the source program that matches
the pattern for atoken and is identified by the lexical analyzer as an instance of that
token.
There are a number of reasons why the analysis portion of a compiler is normally
separated intolexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration. The separation of Lexical and
Syntactic analysis often allows us to simplify at least one of these tasks. For example, a
parser that had to deal with comments and whitespace as syntactic units would be
considerably more complex than one that can assume comments and whitespace have
already been removed by the lexical analyzer.
Buffer Pairs
Because of the amount of time taken to process characters and the large number
of
characters that must be processed during the compilation of a large source program,
specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character. An important scheme involves
two buffers that are alternately reloaded.
Each buffer is of the same size N, and N is usually the size of a disk block, e.g.,
4096 bytes. Using one system read command we can read N characters in to a buffer,
rather than using one system call per character. If fewer than N characters remain in the
input file, then a special character, represented by eof, marks the end of the source file
and is different from any possible character of the source program. Two pointers to the
input are maintained:
1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extentwe are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact
strategywhereby this determination is made will be covered in the balance of
this chapter.
Once the next lexeme is determined, forward is set to the character at its right
end. Then, after the lexeme is recorded as an attribute value of a token returned to the
parser, 1exemeBegin is set to the character immediately after the lexeme just found. In
Fig, we see forward has passed the end of the next lexeme, ** (the FORTRAN
exponentiation operator), and must be retracted one position to its left.
Advancing forward requires that we first test whether we have reached the end
of one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer.
As long as we never need to look so far ahead of the actual lexeme that the sum
of the lexeme's length plus the distance we look ahead is greaterthan N, we shall never
overwrite the lexeme in its buffer before determining it.
We can combine the buffer-end test with the test for the current character if we
extend each buffer to hold a sentinel character at the end. The sentinel is a special
character that cannot be part of the source program, and a natural choice is the
character eof. Figure shows the same arrangement as Figure above but with the
sentinels added. Note that eof retains its use as a marker for the end of the entire
input.
Any eof that appears other than at the end of a buffer means that the input is at an end.
Below Figure summarizes the algorithm for advancing forward. Notice how the first test,
which can be part of a multiway branch based on the character pointed to by forward, is
the only test we make, except
in the case where we actually are at the end of a buffer or the end of the
){
1. Alphabets
2. Strings
3. Special symbols
4. Language
5. Longest match rule
6. Operations
7. Notations
8. Representing valid tokens of a language in regular expression
9. Finite automata
1. Alphabets: Any finite set of symbols
o {0,1} is a set of binary alphabets,
o {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,
o {a-z, A-Z} is a set of English language alphabets.
2. Strings: Any finite sequence of alphabets is called a string.
3. Special symbols: A typical high-level language contains the following symbols:
Arithmetic
Addition(+), Subtraction(-), Multiplication(*), Division(/)
Symbols
Assignment =
Special
+=, -=, *=, /=
assignment
Preprocessor #
4. Language: A language is considered as a finite set of strings over some finite set of
alphabets.
5. Longest match rule: When the lexical analyzer read the source-code, it scans the code
letterby letter and when it encounters a whitespace, operator symbol, or special symbols
it decides that a word is completed.
6. Operations: The various operations on languages are:
Union of two languages L and M is written as, L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as, LM = {st | s is in L and t
is in M}
The Kleene Closure of a language L is written as, L* = Zero or more occurrence
of language L.
7. Notations: If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : L(r)UL(s)
Concatenation : L(r)L(s)
Kleene closure : (L(r))*
then 8. Representing valid tokens of a language in regular expression:If x is a regular expression,
:
o x* means zero or more occurrence of x.
o x+ means one or more occurrence of x.
9. Finite automata: Finite automata is a state machine that takes a string of symbols as input
and changes its state accordingly.If the input string is successfully processed and the
automatareaches its final state, it is accepted.The mathematical model of finite automata
consists of:
o
Finite set of states (Q)
o
Finite set of input symbols (Σ)
o
One Start state (q0)
o
Set of final states (qf)
o
Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q
×Σ➔Q
RECOGNITION OF TOKENS
id -> letter
(letter|digit)* If -> if
Then ->
else
…
pn {actionn}
where pi is regular expression and actioni describes what action the lexical analyzer
should take when pattern pi matches a lexeme. Actions are written in C code.
Those functions that are passed directly through Lex to the output
The actions from the input program, which appear as fragments of code to
beinvoked at the appropriate time by the automaton simulator.
Example: We shall illustrate the ideas of this section with the following simple, abstract
example:
Note that these three patterns present some conflicts of the type discussed in
earlier. In particular, string abb matches both the second and third patterns, but
we shall consider it a lexeme for pattern p2, since that pattern is listed first in the
above Lex program. Then, input strings such as aabbb • • • have many prefixes
that match the third pattern. The Lex rule is to take the longest, so we continue
reading 6's, until another a is met, whereupon we report the lexeme to be the
initial a's followed by as many 6's as there are.
Three NFA's that recognize the three patterns. The third is a simplification
of what would come out of Algorithm. Then, Fig. 3.52 shows these three NFA's
combined into a single NFA by the addition of start state 0 and three e-
transitions. •
If the lexical analyzer simulates an NFA such as that of Fig. 3.52, then it
must read input beginning at the point on its input which we have
referred to as lexemeBegin. As it moves the pointer called forward ahead in the
input, it calculates the set of states it is in at each point, following Algorithm.
Eventually, the NFA simulation reaches a point on the input where there
are no next states. At that point, there is no hope that any longer prefix of the
input would ever get the NFA to an accepting state; rather, the set of states
will always be empty. Thus, we are ready to decide on the longest prefix that is a
lexeme matching some pattern.
We look backwards in the sequence of sets of states, until we find a set that
includes one or more accepting states. If there are several accepting states in
that set, pick the one associated with the earliest pattern pi in the list from
the Lex program. Move the forward pointer back to the end of the lexeme, and
perform the action Ai associated with pattern pi.
Example: Suppose we have the patterns of the input begins aaba. Figure 3.53
shows the sets of states of the NFA of Fig. 3.52 that we enter, starting with e-
closure of the initial state 0, which is {0,1,3,7}, and proceeding from there. After
reading the fourth input symbol, we are in an empty set of states, since in Fig.
3.52, there are no transitions out of state 8 on input a.
Thus, we need to back up, looking for a set of states that includes an
accepting state. Notice that, as indicated in Fig. 3.53, after reading a we are in a
set that includes state 2 and therefore indicates that the pattern a has been
matched. However, after reading aab, we are in state 8, which indicates that a *
b + has been matched; prefix aab is the longest prefix that gets us to an
accepting state. We therefore select aab as the lexeme, and execute action A3,
which should include a return to the parser indicating that the token whose
pattern is p3 = a * b + has been found. •
Example : Figure 3.54 shows a transition diagram based on the DFA that is
constructed by the subset construction from the NFA in Fig. 3.52. The accepting states
are labeled by the pattern that is identified by that state. For instance, the state {6,8}
has two accepting states, corresponding to patterns abb and a * b + . Since the former
is listed first, that is the pattern associated with state {6,8} . •
We use the DFA in a lexical analyzer much as we did the NFA. We simulate the
DFA until at some point there is no next state (or strictly speaking, the next state
is 0, the dead state corresponding to the empty set of NFA states). At that point, we
back up through the sequence of states we entered and, as soon as we meet an
accepting DFA state, we perform the action associated with the pattern for that state.
Example : Suppose the DFA of Fig. 3.54 is given input abba. The se-quence of
states entered is 0137,247,58,68, and at the final a there is no tran-sition out of state
68. Thus, we consider the sequence from the end, and in this case, 68 itself is an
accepting state that reports pattern p2 = abb . •
2. There is a path from the start state of the NFA to state s that spells out x.
3. There is a path from state s to the accepting state that spells out y.
If there is only one e-transition state on the imaginary / in the NFA, then
the end of the lexeme occurs when this state is entered for the last time as the
following example illustrates. If the NFA has more than one e-transition state on
the imaginary /, then the general problem of finding the correct state s is much
more difficult.
Dead States in DFA's
Technically, the automaton in Fig. 3.54 is not quite a DFA. The reason
is that a DFA has a transition from every state on every input symbol in its
input alphabet. Here, we have omitted transitions to the dead state 0, and we
have therefore omitted the transitions from the dead state to itself on every
input. Previous NFA-to-DFA examples did not have a way to get from the start
state to 0, but the NFA of Fig. 3.52 does.
input string, starting from the root and creating the nodes of the parse tree in preorder
input string.
It is classified in to two different variants namely; one which uses Back Tracking and the other is
Non Back Tracking Parsing: There are two variants of this parser as given below.
i. LL (1) Parsing
Back Tracking
LL (1) stands for, left to right scan of input, uses a Left most derivation, and the parser
takes 1 symbol as the look ahead symbol from the input in taking parsing action decision.
A non recursive predictive parser can be built by maintaining a stack explicitly, rather than implicitly via
recursive calls. The parser mimics a leftmost derivation.
If w is the input that has been matched so far, then the stack holds a sequence of grammar symbols a such that
The table-driven parser in the figure has
An input buffer that contains the string to be parsed followed by a $ Symbol, used to indicate end of input.
A stack, containing a sequence of grammar symbols with a $ at the bottom of the stack, which initially
contains the start symbol of the grammar on top of $.
A parsing table containing the production rules to be applied. This is a two dimensional array M [Non
terminal, Terminal].
A parsing Algorithm that takes input String and determines if it is conformant to Grammar and it uses the
parsing table and stack to take such decision.
4. Check For Left Factoring. Perform left factoring if it contains common prefixes in
programming language constructs. The CFG is denoted as G, and defined using a four tuple
notation.
Where
V is a finite set of Non terminal; Non terminals are syntactic variables that denote sets of
strings. The sets of strings denoted by non terminals help define the language generated
by the grammar. Non terminals impose a hierarchical structure on the language that
T is a Finite set of Terminal; Terminals are the basic symbols from which strings are
formed. The term "token name" is a synonym for '"terminal" and frequently we will use
the word "token" for terminal when it is clear that we are talking about just the token
name. We assume that the terminals are the first components of the tokens output by the
lexical analyzer.
S is the Starting Symbol of the grammar, one non terminal is distinguished as the start
symbol, and the set of strings it denotes is the language generated by the grammar. P
is finite set of Productions; the productions of a grammar specify the manner in which the
terminals and non terminals can be combined to form strings, each production is in α->β
(a) A non terminal called the head or left side of the production; this
(b) The symbol ->. Some times: = has been used in place of the arrow.
(c) A body or right side consisting of zero or more terminals and non-terminals. The components of the body
describe one way in which strings of the non
Conventionally, the productions for the start symbol are listed first.
Notational Conventions Used In Writing CFGs:
To avoid always having to state that ―these are the terminals," "these are the non
terminals," and so on, the following notational conventions for grammars will be used
terminal symbol.
(b) The letter S, which, when it appears, is usually the start symbol.
(d) When discussing programming constructs, uppercase letters may be used to represent
Non terminals for the constructs. For example, non terminal for expressions, terms,
Using these conventions the grammar for the arithmetic expressions can be written as
DERIVATIONS:
The construction of a parse tree can be made precise by taking a derivational view, in
which productions are treated as rewriting rules. Beginning with the start symbol, each rewriting
step replaces a Non terminal by the body of one of its productions. This derivational view
corresponds to the top-down construction of a parse tree as well as the bottom construction of the
parse tree.
Derivations are classified in to Let most Derivation and Right Most Derivations.
It is the process of constructing the parse tree or accepting the given input string, in
which at every time we need to rewrite the production rule it is done with left most non terminal
only.
Ex: - If the Grammar is E-> E+E | E*E | -E| (E) | id and the input string is id + id* id
The production E -> - E signifies that if E denotes an expression, then – E must also denote an
sequence of grammar symbols, as in αAβ, where α and β are arbitrary strings of grammar
symbol. Suppose A ->γ is a production. Then, we write αAβ => αγβ. The symbol => means
"derives in one step". Often, we wish to say, "Derives in zero or more steps." For this purpose,
we can use the symbol , If we wish to say, "Derives in one or more steps." We cn use
sentential form of G.
E => E +E
All the interior nodes are Non terminals and all the leaf nodes terminals.
All the leaf nodes reading from the left to right will be the output of the parse tree.
Example1:- Parse tree for the input string - (id + id) using the above Context free Grammar is
The Following figure shows step by step construction of parse tree using CFG for the parse tree
Definition: A grammar that produces more than one parse tree for some sentence (input string)
is said to be ambiguous.
In other words, an ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence.
Or If the right hand production of the grammar is having two non terminals which are
exactly same as left hand side production Non terminal then it is said to an ambiguous grammar.
Example : If the Grammar is E-> E+E | E*E | -E| (E) | id and the Input String is id + id* id
is an ambiguous Grammar
Note: LL (1) parser will not accept the ambiguous grammars or We cannot construct an
LL(1) parser for the ambiguous grammars. Because such grammars may cause the Top
Down parser to go into infinite loop or make it consume more time for parsing. If necessary
ELIMINATING AMBIGUITY: Since Ambiguous grammars may cause the top down Parser
Therefore, sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. The
recursion. A grammar is left recursive if it has a non terminal A such that there is a derivation
A=>Aα for some string α in (TUV)*. LL(1) or Top Down Parsers can not handle the Left
Recursive grammars, so we need to remove the left recursion from the grammars before being
LEFT FACTORING:
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive or top-down parsing. A grammar in which more than one production has common
For example, in the following grammar there are n A productions have the common prefix α,
which should be removed or factored out without changing the language defined for A.
FIRST and FOLLOW:
The construction of both top-down and bottom-up parsers is aided by two
functions,
FIRST and FOLLOW, associated with a grammar G. During top down parsing,
FIRST and
FOLLOW allow us to choose which production to apply, based on the next input
(look a head) symbol.
LL (1) Parsing Algorithm:
The parser acts on basis on the basis of two symbols
There are three conditions for A and ‗a‘, that are used fro the parsing program.
2. If A=a≠$ then parser pops off the stack and advances the current input pointer to the
next.
3. If A is a Non terminal the parser consults the entry M [A, a] in the parsing table. If
M[A, a] is a Production A-> X1X2..Xn, then the program replaces the A on the top of
If the input string for the parser is id + id * id, the below table shows how the parser accept the string with the
help of Stack.
RECURSIVE DESCENT PARSING :
F (E) | id
Reccursive procedures for the recursive descent parser for the given grammar are given
below.
procedure E( )
{
T( );
E′( );
}
procedure T ( )
{
F( );
T′( );
}
Procedure E′( )
{
if input = ‗+‘
{
advance(
); T ( );
E′( );
return true;
}
else error;
}
procedure T′( )
{
if input = ‗*‘
{
advance(
); F ( );
T′( );
return true;
}
else return error;
}
procedure F( )
{
if input = ‗(‗
{
advance(
); E ( );
if input = ‗)‘
advance(
); return
true;
}
else if input = ―idǁ
{
advance(
); return
true;
}
else return error;
}
advance()
{
input = next token;
}
BACK TRACKING: This parsing method uses the technique called Brute Force method
during the parse tree construction process. This allows the process to go back (back track) and
redo the steps by undoing the work done so far in the point of processing.
Brute force method: It is a Top down Parsing technique, occurs when there
is more than one alternative in the productions to be tried while parsing the input string.
It selects alternatives in the order they appear and when it realizes that something
gone wrong it tries with next alternative.
For example, consider the grammar bellow.
S cAd
A ab | a
To generate the input string ―cadǁ, initially the first parse tree given below is
generated. As the string generated is not ―cadǁ, input pointer is back tracked to position
―Aǁ, to examine the next alternate of ―Aǁ. Now a match to the input string occurs as
shown in the 2nd parse trees given below.
( 1) (2)
BOTTOM-UP PARSING
T T*F
F (E)|id
Bottom up parsing of the input string “id * id “is as follows:
Figure 3.1 : A Bottom-up Parse tree for the input String “id*id”
Bottom up parsing is classified in to 1. Shift-Reduce Parsing, 2. Operator Precedence
parsing , and 3. [Table Driven] L R Parsing
i. SLR( 1 )
ii. CALR ( 1 )
iii.LALR( 1 )
SHIFT-REDUCE PARSING:
F (E)|id
Actions of the Shift-reduce parser using Stack implementation
LR Parsing:
Most prevalent type of bottom up parsing is LR (k) parsing. Where, L is left to right scan
of the given input string, R is Right Most derivation in reverse and K is no of input
symbols as the Look ahead.
It is the most general non back tracking shift reduce parsing method
The class of grammars that can be parsed using the LR methods is a proper
superset of the class of grammars that can be parsed with predictive parsers.
LR Parser Consists of
An input buffer that contains the string to be parsed followed by a $ Symbol,
used toindicate end of input.
A stack containing a sequence of grammar symbols with a $ at the bottom of
the stack, which initially contains the Initial state of the parsing table on top of
$.
A parsing table (M), it is a two dimensional array M[ state, terminal or Non
terminal] and it contains two parts
1. ACTION Part
The ACTION part of the table is a two dimensional array indexed by state
and the input symbol, i.e. ACTION[state][input], An action table entry can
have one of following four kinds of values in it. They are:
1. Shift X, where X is a State number.
2. Reduce X, where X is a Production number.
3. Accept, signifying the completion of a successful parse.
4. Error entry.
2. GO TO Part
The GO TO part of the table is a two dimensional array indexed by state
and a Non terminal, i.e. GOTO[state][NonTerminal]. A GO TO entry has a
state number in the table.
A parsing Algorithm uses the current State X, the next input symbol ‗a‘ to
consult the entry at action[X][a]. it makes one of the four following actions as
given below:
1. If the action[X][a]=shift Y, the parser executes a shift of Y on to the top of the stack
and advances the input pointer.
2. If the action[X][a]= reduce Y (Y is the production number reduced in the State X), if
the production is Y->β, then the parser pops 2*β symbols from the stack and push Y
on to the Stack.
3. If the action[X][a]= accept, then the parsing is successful and the input string is
accepted.
4. If the action[X][a]= error, then the parser has discovered an error and calls the error
Department of Computer Science & Engineering Course File : Compiler Design
routine.
The parsing is classified in to
1. LR ( 0 )
2. Simple LR ( 1 )
3. Canonical LR ( 1 )
4. Look ahead LR ( 1 )
Closure operation
If I is an initial State, then the Closure (I) is constructed as follows:
1. Initially, add Augment Production to the state and check for the • symbol in the
Righthand side production, if the • is followed by a Non terminal then Add
Productions which are Stating with that Non Terminal in the State I.
The 1st and 2nd productions are satisfies the 2nd rule. So, add
productions which are starting with E and T in I0
Note: once productions are added in the state the same
production shouldnot added for the 2nd time in the same state.So,
Department of Computer Science & Engineering Course File : Compiler Design
the state becomes
GO TO Operation
Go to (I0, X), where I0 is set of items and X is the grammar Symbol on
which we are moving the „•‟ symbol. It is like finding the next state of the NFA for a
give State I0 and theinput symbol is X. For example, if the production is E •E+T
1 If there is one state (Ii), where there is one production ( A->αβ•) which has no
transitions to the next State. Then, the production is said to be a reduced
production. For all terminals X in FOLLOW (A), write the reduce entry along with
their production numbers. If the Augment production is reducing then write
accept.
1 S -> •aAb
2 A->αβ•
Follow(S) = {$}
2 A->αβ•
States ACTION GO TO
a b $ S A
Ii Ii
Ii r2
S aB
B bB | b
ACTION GOTO
States
A b $ S B
I0 S2 1
I1 ACCEPT
I2 S4 3
I3 R1
I4 S4 R3 5
I5 R2
Note: When Multiple Entries occurs in the SLR table. Then, the grammar is not
accepted by SLR(1) Parser.
Canonical LR (1) Parsing: Various steps involved in the CLR (1) Parsing:
1. Write the Context free Grammar for the given input string
5. Draw DFA
7. Based on the information from the Table, with help of Stack and
Parsingalgorithm generate the output.
LR (1) items :
The LR (1) item is defined by production, position of data and a terminal symbol. The
terminal is called as Look ahead symbol.
I0 State : Add Augment production and compute the Closure, the look ahead symbol for the Augment
Production is $.
S′->•S, $= Closure(S′->•S, $)
The dot symbol is followed by a Non terminal S. So, add productions starting with S in I0
State.
S->•CC, $
Department of Computer Science & Engineering Course File : Compiler Design
The dot symbol is followed by a Non terminal C. So, add productions starting with C in I0
State.
C->•cC, FIRST(C, $)
C->•d, FIRST(C, $)
C->•cC, c/d
C->•d, c/d
The dot symbol is followed by a terminal value. So, close the I0 State. So, the productions in the
I0 are
S′->•S , $
S->•CC , $
C->•cC, c/d
C->•d , c/d
S-> C->•cC , $
S->C•C,$
C->•cC , $
C->•d,$
C->•cC, c/d
C->•d , c/d So, the I3 State is
C->c•C, c/d
C->•cC, c/d
C->•d , c/d
C->•cC, $
C->•d , $ S0, the I6 State is
Department of Computer Science & Engineering Course File : Compiler Design
C->c•C , $
C->•cC , $
C->•d,$
Drawing the Finite State Machine DFA for the above LR (1) items
S->CC•, $
S′->S•,$
S I1 C I5 C->cC• , $
0 S′->•S , $ C->c•C , $ I9
S->C•C,$
C c C->•cC , $ c
1 S->•CC , $ C->•cC , $
C->•d,$ C->•d,$ d I6
2C->•cC,c/d
3 C->•d ,c/d I2 I6 I7
I0 c d
d
C->c•C, c/d C->d•, $
I4 I3 I8
C->cC•, c/d
Department of Computer Science & Engineering Course File : Compiler Design
ACTION GOTO
States
c d $ S C
I0 S3 S4 1 2
I1 ACCEPT
I2 S6 S7 5
I3 S3 S4 8
I4 R3 R3 5
I5 R1
I6 S6 S7 9
I7 R3
I8 R2 R2
I9 R2
These states are differing only in the look-aheads. They have the same productions.
Hence these states are combined to form a single state called as I47.
Department of Computer Science & Engineering Course File : Compiler Design
Similarly the states I3 and I6 differing only in their look-aheads as given below:
I3= Goto(I0,c)=
Department of Computer Science & Engineering Course File : Compiler Design
C->c•C, c/d
C->•cC, c/d
C->•d , c/d
C->c•C , $
C->•cC , $
C->•d,$
These states are differing only in the look-aheads. They have the same productions.
Hence these states are combined to form a single state called as I36.
Similarly the States I8 and I9 differing only in look-aheads. Hence they combined
to formthe state I89.
ACTION GOTO
States
c d $ S C
I0 S36 S47 1 2
I1 ACCEPT
I2 S36 S47 5
I36 S36 S47 89
I47 R3 R3 R3 5
I5 R1
I89 R2 R2 R2
Shift Reduce Conflict in the CLR (1) parsing occurs when a state has
3. A Reduced item of the form A α•, a and
4. An incomplete item of the form A β•aα as shown below:
1 A-> β•a α , $
States Action GOTO
a
2 B->b• ,a Ij a $ A B
Ii Sj/r2
Ii
Department of Computer Science & Engineering Course File : Compiler Design
Reduce- Reduce Conflict in the CLR (1) parsing occurs when a state has two
or more reduced items of the form
3. A α•
4. B β• If two productions in a state (I) reducing on same look ahead symbol
as shown below:
a $ A B
2 B->β•,a Ii r1/r2
Ii
String Acceptance using LR Parsing:
$0 cdd$ Shift S3
$0c3 dd$ Shift S4
$0c3d4 d$ Reduce with R3,C->d, pop 2*β symbols from the stack
$0c3C d$ Goto ( I3, C)=8Shift S6
$0c3C8 d$ Reduce with R2 ,C->cC, pop 2*β symbols from the stack
$0C d$ Goto ( I0, C)=2
$0C2 d$ Shift S7
Department of Computer Science & Engineering Course File : Compiler Design
$0C2d7 $ Reduce with R3,C->d, pop 2*β symbols from the stack
$0C2C $ Goto ( I2, C)=5
$0C2C5 $ Reduce with R1,S->CC, pop 2*β symbols from the stack
$0S $ Goto ( I0, S)=1
$0S1 $ Accept
Department of Computer Science & Engineering Course File : Compiler Design
UNIT-IV
Syntax Directed Translation and Intermediate code Generator
The output of the Parser and the input to the Code Generator.
Relatively machine-independent and allows the compiler to be retargeted.
Relatively easy to manipulate (optimize).
1. Retargeting is facilitated - Build a compiler for a new machine by attaching a new code
generator to an existing front-end.
1. Syntax Trees
2. Postfix notation
Graphical Representations
. assign
a +
* *
b uniminus b uniminus
c c
The edges in a syntax tree do not appear explicitly in postfix notation. They can
be recovered in the order in which the nodes appear and the no. of operands that the
operator at a node expects. The recovery of edges is similar to the evaluation, using a
staff, of an expression in postfix notation.
t1 := y * z
t2 : = x +
t1
Where t1 and t2 are compiler-generated temporary names. This unraveling of
complicated arithmetic expressions and of nested flow-of-control statements makes
three-address code desirable for target code generation and optimization. The use of
names for the intermediate values computed by a program allow- three-address code to
be easily rearranged – unlike postfix notation. Three - address code is a linearzed
representation of a syntax tree or a dag in which explicit names correspond to the
interior nodes of the graph.
expression t1 := -c
t2 := b * t1
t3 := -c
t4 := b * t3
t5 := t2 +
t4 a := t5
The reason for the termǁthree-address codeǁ is that each statement usually
contains three addresses, two for the operands and one for the result. In the
implementations of three-address code given later in this section, a programmer-defined
name is replaced by a pointer tc a symbol- table entry for that name.
2. Assignment instructions of the form x:= op y, where op is a unary operation. Essential unary
operations include unary minus, logical negation, shift operators, and conversion operators
that, for example, convert a fixed-point number to a floating-point number.
4. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.
Department of Computer Science & Engineering Course File : Compiler Design
5. Conditional jumps such as if x relop y goto L. This instruction applies a relational operator (<,
=, >=, etc.) to x and y, and executes the statement with label L next if x stands in relation relop
to y. If not, the three-address statement following if x relop y goto L is executed next, as in the
usual sequence.
6. param x and call p, n for procedure calls and return y, where y representing a returned value
is optional. Their typical use is as the sequence of three-address statements
param
x1
param
x2
param
xn call p,
n
Generated as part of a call of the procedure p(x,, x~,..., xǁ). The integer n indicating
the number of actual parameters in ǁcall p, nǁ is not redundant because calls can be
nested. The implementation of procedure calls is outline d in Section 8.7.
7. Indexed assignments of the form x: = y[ i ] and x [ i ]: = y. The first of these sets x to the value
in the location i memory units beyond location y. The statement x[i]:=y sets the contents of the
location i units beyond x to the value of y. In both these instructions, x, y, and i refer to data
objects.
8. Address and pointer assignments of the form x:= &y, x:= *y and *x: = y. The first of these
sets the value of x to be the location of y. Presumably y is a name, perhaps a temporary, that
denotes an expression with an I-value such as A[i, j], and x is a pointer name or temporary. That
is, the r-value of x is the l-value (location) of some object!. In the statement x: = ~y,
presumably y is a pointer or a temporary whose r- value is a location. The r-value of x is made
equal to the contents of that location. Finally, +x: = y sets the r-value of the object pointed to by
x to the r- value of y.
computed into a new temporary t. In general, the three- address code for id: = E
consists of code to evaluate E into some temporary t, followed by the assignment
id.place: = t. If an expression is a single identifier, say y, then y itself holds the value of
the expression. For the moment, we create a new name every time a temporary is
needed; techniques for reusing temporaries are given in Section S.3. The S-attributed
definition in Fig. 8.6 generates three-address code for assignment statements. Given
input a: = b+ – c + b+ – c, it produces the code in Fig. 8.5(a). The synthesized attribute
S.code represents the three- address code for the assignment S. The non- terminal E
has two attributes:
The function newtemp returns a sequence of distinct names t1, t2,... in response
to successive calls. For convenience, we use the notation gen(x ‘: =‘ y ‘+‘ z) in Fig. 8.6
to represent the three-address statement x: = y + z. Expressions appearing instead of
variables like x, y, and z are evaluated when passed to gen, and quoted operators or
operands, like ‘+‘, are taken literally. In practice, three- address statements might be
sent to an output file, rather than built up into the code attributes. Flow-of-control
statements can be added to the language of assignments in Fig.
8.6 by productions and semantic rules )like the ones for while statements in Fig. 8.7. In
the figure, the code for S - while E do S, is generated using‘ new attributes S.begin and
S.after to mark the first statement in the code for E and the statement following the code
for S, respectively.
These attributes represent labels created by a function new label that returns a
new label every time it is called.
Department of Computer Science & Engineering Course File : Compiler Design
QUADRUPLES:
A quadruple is a record structure with four fields, which we call op, arg l, arg 2,
and result. The op field contains an internal code for the operator. The three-address
statement x:= y op z is represented by placing y in arg 1. z in arg 2. and x in result.
Statements with unary operators like x: = – y or x: = y do not use arg 2. Operators like
param use neither arg2 nor result. Conditional and unconditional jumps put the target
label in result. The quadruples in Fig. H.S(a) are for the assignment a: = b+ – c + b i – c.
They are obtained from the three-address code
.The contents of fields arg 1, arg 2, and result are normally pointers to the symbol-table
entries for the names represented by these fields. If so, temporary names must be
entered into the symbol table as they are created.
TRIPLES:
To avoid entering temporary names into the symbol table. We might refer to a
temporary value bi the position of the statement that computes it. If we do so, three-
address statements can be represented by records with only three fields: op, arg 1 and
arg2, as Shown below. The fields arg l and arg2, for the arguments of op, are either
pointers to the symbol table (for programmer- defined names or constants) or pointers
into the triple structure (for temporary values). Since three fields are used, this
intermediate code format is known as triples.‘ Except for the treatment of programmer-
defined names, triples correspond to the representation of a syntax tree or dag by an
array of nodes, as in
the copy statement a:= t5 is encoded in the triple representation by placing a in the arg
1 field and using the operator assign. A ternary operation like x[ i ]: = y requires two
entries in the triple structure, as shown in Fig. 8.9(a), while x: = y[i] is naturally
represented as two operations in Fig. 8.9(b).
Indirect Triples
- Type checking
-Control flow checking
- Uniqueness checking
- Name checking aspects of translation
Assume that the program has been verified to be syntactically correct and
converted into some kind of intermediate representation (a parse tree). One now has
parse tree available. The next phase will be semantic analysis of the generated parse
tree. Semantic analysis also includes error reporting in case any semantic error is found
out.
TYPE CHECKING: The process of verifying and enforcing the constraints of types is
called type checking. This may occur either at compile-time (a static check) or run-time
(a dynamic check). Static type checking is a primary task of the semantic analysis
carried out by a compiler. If type rules are enforced strongly (that is, generally allowing
only those automatic type conversions which do not lose information), the process is
called strongly typed, if not, weakly typed.
UNIQUENESS CHECKING: Whether a variable name is unique or not, in the its scope.
Type coersion: If some kind of mixing of types is allowed. Done in languages which
are not strongly typed. This can be done dynamically as well as statically.
NAME CHECKS: Check whether any variable has a name which is not allowed. Ex.
Name is same as an identifier (Ex. int in java).
- Whether an identifier has been declared before use, this problem is of identifying a language
{w α w | w ε Σ *}
A parser has its own limitations in catching program errors related to semantics,
something that is deeper than syntax analysis. Typical features of semantic analysis
cannot be modeled using context free grammar formalism. If one tries to incorporate
those features in the definition of a language then that language doesn't remain context
free anymore.
Example: in
string x; int
y;
y=x+3 the use of x is a type
error int a, b;
a=b+c c is not declared
ABSTRACT SYNTAX TREE: Is nothing but the condensed form of a parse tree, It is
In the next few slides we will see how abstract syntax trees can be constructed from
syntax directed definitions. Abstract syntax trees are condensed form of parse trees.
Normally operators and keywords appear as leaves but in an abstract syntax tree they
are associated with the interior nodes that would be the parent of those leaves in the
Department of Computer Science & Engineering Course File : Compiler Design
parse tree. This is clearly indicated by the examples in these slides.
Department of Computer Science & Engineering Course File : Compiler Design
Chain of single productions may be collapsed, and operators move to the parent nodes
Chain of single productions are collapsed into one node with the operators moving up
to become the node.
. Each node of the tree can be represented as a record consisting of at least two
fields to store operators and operands.
.operators : one field for operator, remaining fields ptrs to operands mknode( op,left,right )
.identifier : one field with label id and another ptr to symbol table mkleaf(id, id.entry)
.number : one field with label num and another to keep the value of the number
mkleaf(num,val)
Each node in an abstract syntax tree can be implemented as a record with several
fields. In the node for an operator one field identifies the operator (called the label of the
node) and the remaining contain pointers to the nodes for operands. Nodes of an
abstract syntax tree may have additional fields to hold values (or pointers to values) of
attributes attached to the node. The functions given in the slide are used to create the
nodes of abstract syntax trees for expressions. Each function returns a pointer to a
newly created note.
For Example: the
following sequence of
function
calls creates a
parsetree for w= a-
4+c
Department of Computer Science & Engineering Course File : Compiler Design
P 1 = mkleaf(id, entry.a)P 2 =
mkleaf(num, 4)
P 3 = mknode(-, P 1 , P 2 )P
4 = mkleaf(id, entry.c)
Department of Computer Science & Engineering Course File : Compiler Design
P 5 = mknode(+, P 3 , P 4 )
An example showing the formation of an abstract syntax tree by the given function calls
for the expression a-4+c.The call sequence can be defined based on its postfix form,
which is explained blow.
A- Write the postfix equivalent of the expression for which we want to construct a
B- Call the functions in the sequence, as defined by the sequence in the postfix
expression which results in the desired tree. In the case above, call mkleaf() for a,
mkleaf() for 4, mknode() for
-, mkleaf() for c , and mknode() for + at last.
1. P1 = mkleaf(id, a.entry) : A leaf node made for the identifier a, and an entry for a is made
inthe symbol table.
2. P2 = mkleaf(num,4) : A leaf node made for the number 4, and entry for its value.
3. P3 = mknode(-,P1,P2) : An internal node for the -, takes the pointer to previously made
nodesP1, P2 as arguments and represents the expression a-4.
4. P4 = mkleaf(id, c.entry) : A leaf node made for the identifier c , and an entry for c.entry
madein the symbol table.
Following is the syntax directed definition for constructing syntax tree above
E T E.ptr = T.ptr
T F T.ptr := F.ptr
Now we have the syntax directed definitions to construct the parse tree for a given
grammar. All the rules mentioned in slide 29 are taken care of and an abstract syntax
tree is formed.
Department of Computer Science & Engineering Course File : Compiler Design
ATTRIBUTE GRAMMARS: A CFG G=(V,T,P,S), is called an Attributed Grammar iff ,where in
G, each grammar symbol XƐ VUT, has an associated set of attributes, and each production, p
ƐP, is associated with a set of attribute evaluation rules called Semantic Actions.
Department of Computer Science & Engineering Course File : Compiler Design
In an AG, the values of attributes at a parse tree node are computed by semantic
rules. There are two different specifications of AGs used by the Semantic Analyzer in
evaluating the semantics of the program constructs. They are,
/* does not give any implementation details. It just tells us. This kind of attribute
equation we will be using, Details like at what point of time is it evaluated and in what
manner are hidden from the programmer.*/
E T { E.val = T.val }
T T1*F { T.val = T1 .val+ F.val)
T F { T.val = F.val }
F (E) { F.val = E.val }
F id { F.val = id.lexval }
F num { F.val = num.lexval }
2) Syntax directed Translation(SDT) scheme: Sometimes we want to control the way the
attributes are evaluated, the order and place where they are evaluated. This is of a slightly
lower level.
An SDT is an SDD in which semantic actions can be placed at any position in the
body of the production.
Department of Computer Science & Engineering Course File : Compiler Design
For example , following SDT prints the prefix equivalent of an arithmetic expression consisting
a
+ and * operators.
L En{ printf(„E.val‟) }
E { printf(„+‟) }E1 + TE
T
T { printf(„*‟) }T 1 * FT
F
F (E)
F { printf(„id.lexval‟) }id
F { printf(„num.lexval‟) } num
This action in an SDT, is executed as soon as its node in the parse tree is visited in
a preorder traversal of the tree.
To avoid repeated traversal of the parse tree, actions are taken simultaneously when
a token is found. So calculation of attributes goes along with the construction of the
parse tree.
Along with the evaluation of the semantic rules the compiler may simultaneously
generate code, save the information in the symbol table, and/or issue error messages
etc. at the same time while building the parse tree.
tree. Example
Number sign
list sign + | -
list list bit | bit
bit 0|1
Build attribute grammar that annotates Number with the value it represents
Department of Computer Science & Engineering Course File : Compiler Design
symbol attributes
Number value
sign negative
if sign.negative
then number.value -
list.value else number.value
list.value
sign + sign.negative false sign - sign.negative true list bit
bit.position list.position
list.value
bit.value list0
list 1 bit
list1 .position list 0 .position +
1 bit.position list 0 .position
list0 .value list1 .value + bit.value
bit 0 bit.value 0 bit 1 bit.value 2bit.position
List -> bit /*bit position is the same as list position because this bit is the rightmost
*value of the list is same as bit.*/
List0 -> List1 bit /*position and value
calculations*/Bit -> 0 | 1 /*set the corresponding
value*/
Attributes of RHS can be computed from attributes of LHS and vice versa.
Department of Computer Science & Engineering Course File : Compiler Design
Attributes : . Attributes fall into two classes namely synthesized attributes and inherited
attributes. Value of a synthesized attribute is computed from the values of its children nodes.
Value of an inherited attribute is computed from the sibling and parent nodes.
The attributes are divided into two groups, called synthesized attributes and
inherited attributes. The synthesized attributes are the result of the attribute evaluation
rules also using the values of the inherited attributes. The values of the inherited
attributes are inherited from parent nodes and siblings.
attribute of A Or
Here the value of the attribute b depends on the values of the attributes c1 to ck .
If c1 to ck belong to the children nodes and b to A then b will be called a synthesized
attribute. And if b belongs to one among a (child nodes) then it is an inherited attribute
of one of the grammar symbols on the right.
Department of Computer Science & Engineering Course File : Compiler Design
Synthesized Attributes : A syntax directed definition that uses only synthesized attributes
is said to be an S- attributed definition
E T E.val = T.val
T F T.val = F.val
. terminals are assumed to have only synthesized attribute values of which are supplied
by lexical analyzer
This is a grammar which uses only synthesized attributes. Start symbol has no
parents, hence noinherited attributes.
Using the previous attribute grammar calculations have been worked out here for 3 *
4 + 5 n. Bottom up parsing has been done.
Inherited attributes help to find the context (type, scope etc.) of a token e.g., the type
of a token or scope when the same variable name is used multiple times in a program in
different functions. An inherited attribute system may be replaced by an S -attribute
system but it is more natural to use inherited attributes in some cases like the example
given above.
Here addtype(a, b) functions adds a symbol table entry for the id a and attaches to it the type
of b
.
. The dependencies among the nodes can be depicted by a directed graph called
dependency graph
tree do
dependency graph
for a
for i = 1 to k do
An algorithm to construct the dependency graph. After making one node for
every attribute of all the nodes of the parse tree, make one edge from each of the other
attributes on which it depends.
For example ,
The semantic rule A.a = f(X.x , Y.y) for the production A -> XY defines the
synthesized attribute a of A to be dependent on the attribute x of X and the attribute y of
Y . Thus the dependency graph will contain an edge from X.x to A.a and Y.y to A.a
accounting for the two dependencies. Similarly for the semantic rule X.x = g(A.a , Y.y)
for the same production there will be an edge from A.a to X.x and an edg e from Y.y to
X.x.
Example
+ E 2 .val
The synthesized attribute E.val depends on E1.val and E2.val hence the two
edges one each from E 1 .val & E 2 .val
For example, the dependency graph for the sting real id1, id2, id3
. Put a dummy synthesized attribute b for a semantic rule that consists of a procedure call
The figure shows the dependency graph for the statement real id1, id2, id3 along
with the parse tree. Procedure calls can be thought of as rules defining the values of
dummy synthesized attributes of the nonterminal on the left side of the associated
production. Blue arrows constitute the dependency graph and black lines, the parse
tree. Each of the semantic rules addtype (id.entry, L.in) associated with the L
productions leads to the creation of the dummy attribute.
Evaluation Order :
Any topological sort of dependency graph gives a valid order in which semantic rules
must be evaluated
a4 =
real a5
= a4
addtype(id3.entry,
a5) a7 = a5
addtype(id2.entry,
a7 )
a9 := a7 addtype(id1.entry, a9 )
A topological sort of a directed acyclic graph is any ordering m1, m2, m3 .......mk
of the nodes of the graph such that edges go from nodes earlier in the ordering to later
nodes. Thus if mi -> mj is an edge from mi to mj then mi appears before mj in the
ordering. The order of the statements shown in the slide is obtained from the topological
sort of the dependency graph in the previous slide. 'an' stands for the attribute
associated with the node numbered n in the dependency graph. The numbering is as
shown in the previous slide.
Abstract Syntax Tree is the condensed form of the parse tree, which is
In the next few slides we will see how abstract syntax trees can be constructed
from syntax directed definitions. Abstract syntax trees are condensed form of parse
trees. Normally operators and keywords appear as leaves but in an abstract syntax tree
they are associated with the interior nodes that would be the parent of those leaves in
the parse tree. This is clearly indicated by the examples in these slides.
. Chain of single productions may be collapsed, and operators move to the parent nodes
Chain of single production are collapsed into one node with the operators moving up to
become the node.
Department of Computer Science & Engineering Course File : Compiler Design
. identifier : one field with label id and another ptr to symbol table mkleaf(id,entry)
. number : one field with label num and another to keep the value of the
number mkleaf(num,val)
P 1 = mkleaf(id, entry.a)P
2 = mkleaf(num, 4)
P 3 = mknode(-, P 1 , P 2 )P 4 =
mkleaf(id, entry.c)
P 5 = mknode(+, P 3 , P 4 )
An example showing the formation of an abstract syntax tree by the given function
calls for the expression a-4+c.The call sequence can be explained as:
1. P1 = mkleaf(id,entry.a) : A leaf node made for the identifier Qa R and an entry for Qa R is
made in the symbol table.
2. P2 = mkleaf(num,4) : A leaf node made for the number Q4 R.
3. P3 = mknode(-,P1,P2) : An internal node for the Q- Q.I takes the previously made nodes as
arguments and represents the expression Qa-4 R.
4. P4 = mkleaf(id,entry.c) : A leaf node made for the identifier Qc R and an entry for Qc R is
made in the symbol table.
5. P5 = mknode(+,P3,P4) : An internal node for the Q+ Q.I takes the previously made nodes as
arguments and represents the expression Qa- 4+c R.
A syntax directed definition for constructing syntax tree
E E1+T E.ptr = mknode(+, E 1 .ptr, T.ptr)
E T E.ptr = T.ptr
T F T.ptr := F.ptr
Now we have the syntax directed definitions to construct the parse tree for a given
grammar. All the rules mentioned in slide 29 are taken care of and an abstract syntax
tree is formed.
Translation schemes : A CFG where semantic actions occur within the right
hand side of production, A translation scheme to map infix to postfix.
E TR
addop T { print(addop)} R |
e T num {print(num)}
We assume that the actions are terminal symbols and Perform depth first order
traversal to obtain 9 5 - 2 +.
When designing translation scheme, ensure attribute value is available when referred to
In case of synthesized attribute it is trivial (why?)
In a translation scheme, as we are dealing with implementation, we have to
explicitly worry about the order of traversal. We can now put in between the rules some
actions as part of the RHS. We put this rules in order to control the order of traversals.
In the given example, we have two terminals (num and addop). It can generally be seen
as a number followed by R (which
Department of Computer Science & Engineering Course File : Compiler Design
necessarily has to begin with an addop). The given grammar is in infix notation and we
need to convert it into postfix notation. If we ignore all the actions, the parse tree is in
black, without the red edges. If we include the red edges we get a parse tree with
actions. The actions are so far treated as a terminal. Now, if we do a depth first
traversal, and whenever we encounter a action we execute it, we get a post-fix notation.
In translation scheme, we have to take care of the evaluation order; otherwise some of
the parts may be left undefined. For different actions, different result will be obtained.
Actions are something we write and we have to control it. Please note that translation
scheme is different from a syntax driven definition. In the latter, we do not have any
evaluation order; in this case we have an explicit evaluation order. By explicit evaluation
order we have to set correct action at correct places, in order to get the desired output.
Place of each action is very important. We have to find appropriate places, and that is
that translation scheme is all about. If we talk of only synthesized attribute, the
translation scheme is very trivial. This is because, when we reach we know that all the
children must have been evaluated and all their attributes must have also been dealt
with. This is because finding the place for evaluation is very simple, it is the rightmost
place.
A a {print(A.in)}
. A synthesized attribute for non terminal on the lhs can be computed after all
attributes it references, have been computed. The action normally should be
placed at the end of rhs
We have a problem when we have both synthesized as well as inherited attributes. For
the given example, if we place the actions as shown, we cannot evaluate it. This is
because, when doing a depth first traversal, we cannot print anything for A1. This is
because A1 has not yet been initialized. We, therefore have to find the correct places for
the actions. This can be that the inherited attribute of A must be calculated on its left.
This can be seen logically from the definition of L-attribute definition, which says that
when we reach a node, then everything on its left must have been computed. If we do
this, we will always have the attribute evaluated at the
correct place. For such specific cases (like the given example) calculating anywhere
on the left will work, but generally it must be calculated immediately at the left.
S B B.pts =
10 S.ht =
B.ht
B B1 B2 B1 .pts =
B.pts B 2 .pts
= B.pts
B.ht = max(B 1 .ht,B2 .ht)
B B1 sub B 2 B1 .pts = B.pts;
B 2 .pts =
shrink(B.pts) B.ht =
disp(B1 .ht,B2 .ht)
B text B.ht = text.h * B.pts
We now look at another example. This is the grammar for finding out how do I compose
text. EQN was equation setting system which was used as an early type setting system
for UNIX. It was earlier used as an latex equivalent for equations. We say that start
symbol is a block: S - >B We can also have a subscript and superscript. Here, we look
at subscript. A Block is composed of several blocks: B -> B1B2 and B2 is a subscript of
B1. We have to determine what is the point size (inherited) and height Size
(synthesized). We have the relevant function for height and point size given along side.
After putting actions in the right place
We have put all the actions at the correct places as per the rules stated. Read it from
left to right, and top to bottom. We note that all inherited attribute are calculated on the
left of B symbols and synthesized attributes are on the right.
E T E.val := T.val
. If both the operands of arithmetic operators +, -, x are integers then the result is of type
integer
. The result of unary & operator is a pointer to the object referred to by the operand.
- If the type of operand is X then type of result is pointer to X
1. Basic types: These are atomic types with no internal structure. They include the types
boolean,character, integer and real.
2. Sub-range types: A sub-range type defines a range of values within the range of another
type.For example, type A = 1..10; B = 100..1000; U = 'A'..'Z';
3. Enumerated types: An enumerated type is defined by listing all of the possible values for the
type. For example: type Colour = (Red, Yellow, Green); Country = (NZ, Aus, SL, WI, Pak, Ind, SA,
Ken, Zim, Eng); Both the sub-range and enumerated types can be treated as basic types.
4. Constructed types: A constructed type is constructed from basic types and other basic types.
Examples of constructed types are arrays, records and sets. Additionally, pointers and functions
can also be treated as constructed types.
TYPE EXPRESSION:
1. A basic type is a type expression. Among the basic types are boolean , char , integer , and real
.A special basic type, type_error , is used to signal an error during type checking.
Another special basic type is void which denotes "the absence of a value" and is used
to check statements.
2. Since type expressions may be named, a type name is a type expression.
3. The result of applying a type constructor to a type expression is a type expression.
4. Type expressions may contain variables whose values are type expressions themselves.
TYPE CONSTRUCTORS: are used to define or construct the type of user defined
types based on their dependent types.
Arrays : If T is a type expression and I is a range of integers, then array ( I , T ) is
the type expression denoting the type of array with elements of type T and index
set I.
For example, the Pascal declaration, var A: array[1 .. 10] of integer; associates
the typeexpression array ( 1..10, integer ) with A.
Products : If T1 and T2 are type expressions, then their Cartesian product T1 X T2 is also a type
expression.
Records : A record type constructor is applied to a tuple formed from field names
and field types. For example, the declaration
Consider the
record
addr : integer;
lexeme : array [1 .. 15] of
char end;
var table: array [1 .. 10] of row;
The type row has type expression : record ((addr x integer) x (lexeme x array(1 .. 15, char)))
Note: Including the field names in the type expression allows us to define another
record type with the same fields but with different names without being forced to
equate the two.
Pointers: If T is a type expression, then pointer ( T ) is a type expression
denoting the type "pointer to an object of type T".
For example, in Pascal, the declaration
var p: row declares variable p to have type pointer( row ).
Functions : Analogous to mathematical functions, functions in programming languages
may be defined as mapping a domain type D to a range type R. The type of such a
function is denoted by the type expression D R. For example, the built-in function mod of
Pascal has domain type int X int, and range type int . Thus we say mod has the type: int
x int -> int
As another example, according to the Pascal
declaration function f(a, b: char) : integer;
Here the type of f is denoted by the type expression is char x char pointer( integer )
P D;E
D D ; D | id : T
A type checker is a translation scheme that synthesizes the type of each expression
from the types of its sub-expressions. Consider the above given grammar that
generates programs consisting of a sequence of declarations D followed by a single
expression E.
Specifications of a type checker for the language of the above grammar: A program generated
by this grammar is
key : integer;
key mod
1999
Assumptions:
1. The language has three basic types: char , int and type-error
2. For simplicity, all arrays start at 1. For example, the declaration array[256] of char leads to
thetype expression array ( 1.. 256, char).
E1 .type == s t
then t
else type-error
The rules for the symbol table entry are specified above. These are basically the way
in which the symbol table entries corresponding to the productions are done.
E -> E1 ( E2 ) { E.type := if E2.type == s and E1. type == s -> t then t else type_error }
This rule says that in an expression formed by applying E1 to E2, the type of E1 must
be a function s -> t from the type s of E2 to some range type t ; the type of E1 ( E2 ) is t .
The above rule can be generalized to functions with more than one argument
byconstructing a product type consisting of the arguments. Thus n arguments of type T1
, T2
declares a function root that takes a function from reals to reals and a real as
arguments and returns a real. The Pascal-like syntax for this declaration is
TYPE CHECKING FOR EXPRESSIONS: consider the following SDD for expressions
E id E.type = lookup(id.entry)
E E1 mod E2 E.type = if E 1 .type == integer and
E2
.type==integer
then integer
else type_error
E E1 [E2 ] E.type = if E2 .type==integer
and E1 .type==array(s,t)
then t
else type_error
E E1 ^ E.type = if E1
.type==pointer(t) then t
else type_error
To perform type checking of expressions, following rules are used. Where the
synthesized attribute type for E gives the type expression assigned by the type system to
the expression generated by E.
The following semantic rules say that constants represented by the tokens literal and
num have type char and integer , respectively:
. The function lookup ( e ) is used to fetch the type saved in the symbol-table entry pointed to
by
e. When an identifier appears in an expression, its declared type is fetched and
assigned to theattribute type:
. According to the following rule, the expression formed by applying the mod
operator to two sub-expressions of type integer has type integer ; otherwise, its type
is type_error .
E -> E1 mod E2 { E.type := if E1.type == integer and E2.type == integer then integer else
type_error }
type_error }
Within expressions, the postfix operator yields the object pointed to by its operand. The
type of Eis the type t of the object pointed to by the pointer E:
E E1 { E.type := if E1.type == pointer ( t ) then t else type_error
TYPE CHECKING OF STATEMENTS: Statements typically do not have values. Special
basic type void can be assigned to them. Consider the SDD for the grammar below
which generates Assignment statements conditional, and looping statements.
else type_error
S if E then S1 S.Type = if E.type == boolean
then S1.type
else
type_error
S while E do S1 S.Type = if E.type == boolean
then S1.type
else type_error
S S1 ; S2 S.Type = if S1.type ==
void and S2.type == void
then void
else type_error
Since statements do not have values, the special basic type void is assigned to
them, but if an error is detected within a statement, the type assigned to the
statement is type_error .
The statements considered below are assignment, conditional, and while statements.
Sequences of statements are separated by semi-colons. The productions given below
can be combined with those given before if we change the production for a complete
program to P -> D; S. The program now consists of declarations followed by
statements.
This rule checks that the left and right sides of an assignment statement have the same type.
This rule specifies that the expressions in an if -then statement must have the type boolean .
This rule specifies that the expression in a while statement must have the type boolean .
4. S S1; S2 { S.type := if S1.type == void and S2.type == void then void else type_error }
Department of Computer Science & Engineering Course File : Compiler Design
UNIT – V
Code optimization and Code generation
CODE OPTIMIZATION
Considerations for optimization : The code produced by the straight forward
compiling algorithms can often be made to run faster or take less space,or both. This
improvement is achieved by program transformations that are traditionally called
optimizations. Machine independent optimizations are program transformations that
improve the target code without taking into consideration any properties of the target
machine. Machine dependant optimizations are based on register allocation and
utilization of special machine-instruction sequences.
- Simply stated, the best program transformations are those that yield the most benefit
for the least effort.
- First, the transformation must preserve the meaning of programs. That is, the
optimization must not change the output produced by a program for a given input, or
cause an error.
Some transformations can only be applied after detailed, often time-consuming analysis
of the source program, so there is little point in applying them to programs that will be
run only a few times.
Department of Computer Science & Engineering Course File : Compiler Design
1. Exploit the fast path in case of multiple paths fro a given situation.
4. Trade off between the size of the code and the speed with which it gets executed.
5. Place code and data together whenever it is required to avoid unnecessary searching
ofdata/code
During code transformation in the process of optimization, the basic requirements are as
follows:
Consider all that has happened up to this point in the compiling process—
lexical analysis, syntactic analysis, semantic analysis and finally intermediate-code
generation. The compiler has done an enormous amount of analysis, but it still doesn‘t
really know how the program does what it does. In control-flow analysis, the compiler
figures out even more information about how the program does its work, only now it can
assume that there are no syntactic or semantic errors in the code.
Now we can construct the control-flow graph between the blocks. Each basic
block is a node in the graph, and the possible different routes a program might take are
the connections, i.e. if a block ends with a branch, there will be a path leading from that
block to the branch target. The blocks that can follow a block are called its successors.
There may be multiple successors or just one. Similarly the block may have many, one,
or no predecessors. Connect up the flow graph for Fibonacci basic blocks given above.
What does an if then-else look like in a flow graph? What about a loop? You probably
have all seen the gcc warning or javac error about: "Unreachable code at line XXX."
How can the compiler tell when code is unreachable?
LOCAL OPTIMIZATIONS
expression is alive if the operands used to compute the expression have not been
changed. An expression that is no longer alive is dead.
Example :
a=b*c;
d=b*c+x-
y;
We can eliminate the second evaluation of b*c from this code if none of the
intervening statements has changed its value. We can thus rewrite the code as
t1=b*c;
a=t1;
d=t1+x-
y;
2. Variable Propagation:
Let us consider the above code once again
c=a*b;
x=a;
d=x*b+4
;
Department of Computer Science & Engineering Course File : Compiler Design
if we replace x by a in the last statement, we can identify a*b and x*b as common
sub expressions. This technique is called variable propagation where the use of one
variable is replaced by another variable if it has been assigned the value of same
Compile Time evaluation
The execution efficiency of the program can be improved by shifting execution
time actions to compile time so that they are not performed repeatedly during the
program execution. We can evaluate an expression with constants operands at compile
time and replace that expression by a single value. This is called folding. Consider the
following statement:
a= 2*(22.0/7.0)*r;
Here, we can perform the computation 2*(22.0/7.0) at compile time itself.
4. Code Movement:
Department of Computer Science & Engineering Course File : Compiler Design
The motivation for performing code movement in a program is to improve the execution
time of the program by reducing the evaluation frequency of expressions. This can be
done by moving the evaluation of an expression to other parts of the program. Let us
consider the bellow code:
If(a<10)
{
b=x^2-y^2;
}
else
{
b=5
;
a=( x^2-y^2)*10;
}
At the time of execution of the condition a<10, x^2-y^2 is evaluated twice. So, we can
optimize the code by moving the out side to the block as follows:
t= x^2-
y^2;
If(a<10)
{
b=t;
}
else
{
b=5
;
a=t*10;
}
5. Strength Reduction:
In the frequency reduction transformation we tried to reduce the execution
frequency of the expressions by moving the code. There is other class of
transformations which perform equivalent actions indicated in the source program by
reducing the strength of operators. By strength reduction, we mean replacing the high
strength operator with low strength operator with out affecting the program meaning. Let
us consider the bellow example:
i=1;
while (i<10)
{
y=i*4;
}
while (i<10)
{
y=t;
t=t+4
;
}
Here the high strength operator * is replaced with +.
1) First, compute defined and killed sets for each basic block (this does not involve any of
itspredecessors or successors).
2) Iteratively compute the avail and exit sets for each block by running the following
algorithmuntil you hit a stable fixed point:
a) Identify each statement s of the form a = b op c in some block B such that b op c is
available at the entry to B and neither b nor c is redefined in B prior to s.
b) Follow flow of control backward in the graph passing back to but not through
eachblock that defines b op c. The last computation of b op c in such a block
reaches s.
c) After each computation d = b op c identified in step 2a, add statement t = d to
thatblock where t is a new temp.
d) Replace s by a = t.
Try an example to make things
clearer: main:
BeginFunc 28;
b=a+2
;c = 4 * b
;
tmp1 = b < c;
ifNZ tmp1 goto L1
;b = 1 ;
L1:
d=a+2
; EndFunc ;
First, divide the code above into basic blocks. Now calculate the available expressions
for each block. Then find an expression available in a block and perform step 2c above.
What common sub-expression can you share between the two blocks? What if the
above code were:
main:
BeginFunc 28;
b=a+2
;c = 4 * b
;
tmp1 = b < c ;
IfNZ tmp1 Goto L1
;b = 1 ;
z = a + 2 ; <========= an additional line here
L1:
Department of Computer Science & Engineering Course File : Compiler Design
d=a+2;
EndFunc ;
MACHINE OPTIMIZATIONS
In final code generation, there is a lot of opportunity for cleverness in generating
efficient target code. In this pass, specific machines features (specialized instructions,
hardware pipeline abilities, register details) are taken into account to produce code
optimized for this particular architecture.
REGISTER ALLOCATION:
One machine optimization of particular importance is register allocation, which is
perhaps the single most effective optimization for all architectures. Registers are the
fastest kind of memory available, but as a resource, they can be scarce.
The problem is how to minimize traffic between the registers and what lies
beyond them in the memory hierarchy to eliminate time wasted sending data back and
forth across the bus and the different levels of caches. Your Decaf back-end uses a very
naïve and inefficient means of assigning registers, it just fills them before performing an
operation and spills them right afterwards.
A much more effective strategy would be to consider which variables are more
heavily in demand and keep those in registers and spill those that are no longer
needed or won't be needed until much later.
One common register allocation technique is called "register coloring", after the
central idea to view register allocation as a graph coloring problem. If we have 8
registers, then we try to color a graph with eight different colors. The graph‘s nodes are
made of "webs" and the arcs are determined by calculating interference between the
webs. A web represents a variable‘s definitions, places where it is assigned a value (as
in x = …), and the possible different uses of those definitions (as in y = x + 2). This
problem, in fact, can be approached as another graph. The definition and uses of a
variable are nodes, and if a definition reaches a use, there is an arc between the two
nodes. If two portions of a variable‘s definition-use graph are unconnected, then we have
two separate webs for a variable. In the interference graph for the routine, each node is
a web. We seek to determine which webs don't interfere with one another, so we know
we can use the same register for those two variables. For example, consider the
following code:
i = 10;
j = 20;
x = i + j;
y = j +
k;
We say that i interferes with j because at least one pair of i‘s definitions and uses
is separated by a definition or use of j, thus, i and j are "alive" at the same time. A
variable is alive between the time it has been defined and that definition‘s last use, after
which the variable is dead. If two variables interfere, then we cannot use the same
register for each. But two variables that don't interfere can since there is no overlap in
the liveness and can occupy the same register. Once we have the interference graph
constructed, we r-color it so that no two adjacent nodes share the same color (r is the
number of registers we have, each color represents a different register).
We may recall that graph-coloring is NP-complete, so we employ a heuristic
rather than an optimal algorithm. Here is a simplified version of something that might be
used:
1. Find the node with the least neighbors. (Break ties arbitrarily.)
2. Remove it from the interference graph and push it onto a stack
3. Repeat steps 1 and 2 until the graph is empty.
4. Now, rebuild the graph as follows:
a. Take the top node off the stack and reinsert it into the graph
b. Choose a color for it based on the color of any of its neighbors presently in the
graph,rotating colors in case there is more than one choice.
c. Repeat a , and b until the graph is either completely rebuilt, or there is no
coloravailable to color the node.
If we get stuck, then the graph may not be r-colorable, we could try again with a
different heuristic, say reusing colors as often as possible. If no other choice, we have to
spill a variable to memory.
INSTRUCTION SCHEDULING:
PEEPHOLE OPTIMIZATIONS:
Peephole optimization is a pass that operates on the target assembly and only
considers a few instructions at a time (through a "peephole") and attempts to do simple,
machine dependent
Department of Computer Science & Engineering Course File : Compiler Design
Abstract Syntax Tree/DAG : Is nothing but the condensed form of a parse tree and is
.DAG is more compact than abstract syntax tree because common sub expressions are
eliminated A syntax tree depicts the natural hierarchical structure of a source program.
Its structure has already been discussed in earlier lectures. DAGs are generated as a
combination of trees: operands that are being reused are linked together, and nodes
may be annotated with variable names (to denote assignments). This way, DAGs are
highly compact, since they eliminate local common sub-expressions. On the other hand,
Department of Computer Science & Engineering Course File : Compiler Design
they are not so easy to optimize, since they are more specific tree forms. However, it
can be seen that proper building of DAG for a given
Department of Computer Science & Engineering Course File : Compiler Design
calculation. An example of a syntax tree and DAG has been given in the
next slide .
a := b * -c + b * -c
You can see that the node " * " comes only once in the DAG as well as the leaf " b
", but the meaning conveyed by both the representations (AST as well as the DAG)
remains the same.
So far we were only considering making changes within one basic block. With
some additional analysis, we can apply similar optimizations across basic blocks,
making them global optimizations. It‘s worth pointing out that global in this case does
not mean across the entire program. We usually only optimize one function at a time.
Interprocedural analysis is an even larger task, one not even attempted by some
compilers. The additional analysis the optimizer must do to perform optimizations across
basic blocks is called data-flow analysis. Data-flow analysis is much more complicated
than control-flow analysis.
Let‘s consider a global common sub-expression elimination optimization as our
example. Careful analysis across blocks can determine whether an expression is alive
on entry to a block. Such an expression is said to be available at that point.
Once the set of available expressions is known, common sub-expressions can
be eliminated on a global basis. Each block is a node in the flow graph of a program.
The successor set (succ(x)) for a node x is the set of all nodes that x directly flows into.
The predecessor set (pred(x)) for a node x is the set of all nodes that flow directly into x.
An expression is defined at the point where it is assigned a value and killed when one of
its operands is subsequently assigned a new value. An expression is available at some
point p in a flow graph if every path leading to p contains a prior definition of that
expression which is not
subsequently killed.
BeginFunc
28; b = a + 2 ;
c=4*b;
tmp1 = b <
c;
ifNZ tmp1 goto L1
;b = 1 ;
L1:
d=a+2
;
EndFunc
;
First, divide the code above into basic blocks. Now calculate the available
expressions for each block. Then find an expression available in a block and perform
step 2c above.
What common subexpression can you share between the two blocks? What if
the above code were:
main:
BeginFunc
28; b = a + 2 ;
c=4*b;
tmp1 = b < c
;
IfNZ tmp1 Goto L1
;b = 1 ;
z = a + 2 ; <========= an additional line
here L1:
d=a+2
;
EndFunc
;
main()
{
int x, y, z;
x = (1+20)* -x;
y = x*x+(x/y);
y = z = (x/y)/(x*x);
}
straight
translation: tmp1
= 1 + 20 ; tmp2 =
-x ;
Department of Computer Science & Engineering Course File : Compiler Design
x = tmp1 * tmp2
; tmp3 = x * x ;
tmp4 = x / y ;
y = tmp3 + tmp4 ;
Department of Computer Science & Engineering Course File : Compiler Design
tmp5 = x / y
; tmp6 = x * x
;
z = tmp5 / tmp6
;y = z ;
What sub-expressions can be eliminated? How can valid common sub-expressions (live
ones) be determined? Here is an optimized version, after constant folding and
propagation and elimination of common sub-expressions:
tmp2 = -x ;
x = 21 * tmp2
; tmp3 = x * x ;
tmp4 = x / y ;
y = tmp3 + tmp4
; tmp5 = x / y ;
z = tmp5 / tmp3
;y = z ;
int arr[20 * 4 +
3]; switch (i) {
case 10 * 5: ...
}
In both snippets shown above, the expression can be resolved to an integer constant at
compile time and thus, we have the information needed to generate code. If either
expression involved a variable, though, there would be an error. How could you rewrite
the grammar to allow the grammar to do constant folding in case statements? This
situation is a classic example of the gray area between syntactic and semantic analysis.
de f[B] is the set of variables assigned values in B prior to any use of that variable in
B use [B]is the set of variables whose values may be used in [B] prior to any definition
of the variable.
A variable comes live into a block (in in[B]), if it is either used before redefinition of
it is live coming out of the block and is not redefined in the block .A variable comes live
out of a block (in out[B]) if and only if it is live coming into one of its successors
Out[B]= U in[s]
S succ[B]
Note the relation between reaching-definitions equations: the roles of in and out are
interchanged
Copy Propagation
We can also drive this optimization "backwards", where we can recognize that the
original assignment made to a temporary can be eliminated in favor of direct
assignment to the final goal: tmp1 = LCall _Binky ;
a = tmp1;
tmp2 = LCall _Winky
; b = tmp2 ;
tmp3 = a * b
; c = tmp3 ;
a = LCall
_Binky; b = LCall
_Winky; c = a * b
;
CODE GENERATION:
2. An address descriptor keeps track of the location (or locations) where the current value of
the name can be found at run time. The location might be a register, a stack location, a memory
address, or some set of these, since when copied, a value also stays where it was. This
information can be stored in the symbol table and is used to determine the accessing method
fora name.
CODE GENERATION ALGORITHM :
for each X = Y op Z do
Mov Y', L
- Generat
eop Z', L
Again prefer a register for Z. Update address descriptor of X to indicate X is in L. If L is
a register update its descriptor to indicate that it contains X and remove X from all other
register descriptors.
. If current value of Y and/or Z has no next use and are dead on exit from block
and are in registers, change register descriptor to indicate that they no longer
contain Y and/or Z.
1. Invoke a function getreg to determine the location L where the result of the
computation y op z should be stored. L will usually be a register, but it could also be a
memory location. We shall describe getreg shortly.
2. Consult the address descriptor for u to determine y', (one of) the current location(s) of
y. Prefer the register for y' if the value of y is currently both in memory and a
register. If the value of u is not already in L, generate the instruction MOV y', L to
place a copy of y in L.
3. Generate the instruction OP z', L where z' is a current location of z. Again, prefer a
register to a memory location if z is in both. Update the address descriptor to indicate
that x is in location L. If L is a register, update its descriptor to indicate that it contains
the value of x, and remove x from all other register descriptors.
4. If the current values of y and/or y have no next uses, are not live on exit from the
block, and are in registers, alter the register descriptor to indicate that, after execution
ofx := y op z, those registers no longer will contain y and/or z, respectively.
FUNCTION getreg:
1. If Y is in register (that holds no other values) and Y is not live and has no next use after
X = Y op Z
then return register of Y for L.
2. Failing (1) return an empty register
3. Failing (2) if X has a next use in the block or op requires register then get a register R, store its
content into M (by Mov R, M) and use it.
4. Else select memory location X as L
The function getreg returns the location L to hold the value of x for the assignment x := y op z.
1. If the name y is in a register that holds the value of no other names (recall that copy
instructions such as x := y could cause a register to hold the value of two or more
variables
simultaneously), and y is not live and has no next use after execution of x := y op z,
then return the register of y for L. Update the address descriptor of y to indicate that y is
no longer in L.
3. Failing (2), if x has a next use in the block, or op is an operator such as indexing, that requires
a register, find an occupied register R. Store the value of R into memory location (by MOV R,
M) if it is not already in the proper memory location M, update the address descriptor M, and
return R. If R holds the value of several variables, a MOV instruction must be generated for each
variable that needs to be stored. A suitable occupied register might be one whose datum is
referenced furthest in the future, or one whose value is also in memory.
4. If x is not used in the block, or no suitable occupied register can be found, select the memory
location of x as L.
Example :
Stmt code reg desc addr desc
bt 2 = a
-c
t3=t1+
t2 d = t 3 +
t2
The code generation algorithm that we discussed would produce the code sequence as
shown. Shown alongside are the values of the register and address descriptors as code
generation progresses.
DAG for Register allocation:
DAG (Directed Acyclic Graphs) are useful data structures for implementing
transformations on basic blocks. A DAG gives a picture of how the value computed by a
statement in a basic block is used in subsequent statements of the block. Constructing a
DAG from three-address statements is a good way of determining common sub-
expressions (expressions computed more than once) within a block, determining which
names are used inside the block but evaluated outside the block, and determining which
statements of the block could have their computed value used outside the block.
A DAG for a basic block is a directed cyclic graph with the following labels on nodes:
1. Leaves are labeled by unique identifiers, either variable names or constants. From the
operator applied to a name we determine whether the l-value or r-value of a name is needed;
most leaves represent r- values. The leaves represent initial values of names, and we subscript
them with 0 to avoid confusion with labels denoting "current" values of names as in (3) below.
3. Nodes are also optionally given a sequence of identifiers for labels. The intention is
that interior nodes represent computed values, and the identifiers labeling a node are deemed
to have that value.
For example, the slide shows a three-address code. The corresponding DAG is shown.
We observe that each node of the DAG represents a formula in terms of the leaves, that
is, the values possessed by variables and constants upon entering the block. For
example, the node labeled t 4 represents the formula
b[4 * i]
that is, the value of the word whose address is 4*i bytes offset from address b,
which is theintended value of t 4 .
S 1= 4 * i S1=4*i
S2 = addr(A)-4 S 2 = addr(A)-4
S3 = S 2 [S 1 ] S 3 = S2 [S 1 ]
S4=4*i
S5 = addr(B)-4 S 5= addr(B)-4
S 6 = S 5 [S4 ] S6 = S5 [S 4 ]
S7 = S 3 * S6 S 7 = S3 * S 6
S8 = prod+S7
prod = S8 prod = prod + S 7
S9 = I+1
I = S9 I=I+1
We see how to generate code for a basic block from its DAG representation. The
advantage of doing so is that from a DAG we can more easily see how to rearrange the
order of the final computation sequence than we can starting from a linear sequence of
three-address statements or quadruples. If the DAG is a tree, we can generate code
that we can prove is optimal under such criteria as program length or the fewest number
of temporaries used. The algorithm for optimal code generation from a tree is also
useful when the intermediate code is a parse tree.
t 1 = a + bt
2 = c + dt
3 = e -t 2
X = t 1 -t 3
and its DAG given here.
Here, we briefly consider how the order in which computations are done can
affect the cost of resulting object code. Consider the basic block and its corresponding
DAG representation as shown in the slide.
Rearranging order .
ADD b, R0 gives
MOV c, R 1 MOV c, R 0
ADD d, R 1 ADD d, R 0
MOV e, R0 SUB R 0 , R1
SUB R 1 , R0 MOV a, R 0
SUB R 0 , R1 SUB R 1 , R0
MOV R1 , X MOV R 1 , X
If we generate code for the three-address statements using the code generation
algorithm described before, we get the code sequence as shown (assuming two
registers R0 and R1 are available, and only X is live on exit). On the other hand
suppose we rearranged the order of the statements so that the computation of t 1
occurs immediately before that of X as:
t2 = c +
d t3 = e -t
2 t1 = a +
bX = t 1
-t3
Then, using the code generation algorithm, we get the new code sequence as shown
(again only R0 and R1 are available). By performing the computation in this order,
we have been able to save two instructions; MOV R0, t 1 (which stores the value of R0
in memory location t 1 ) and MOV t 1 , R1 (which reloads the value of t 1 in the register
R1).
CODE GENERATION:
4. An address descriptor keeps track of the location (or locations) where the current value of
the name can be found at run time. The location might be a register, a stack location, a memory
address, or some set of these, since when copied, a value also stays where it was. This
information can be stored in the symbol table and is used to determine the accessing method
fora name.
for each X = Y op Z do
Mov Y', L
- Generat
eop Z', L
Department of Computer Science & Engineering Course File : Compiler Design
. If current value of Y and/or Z has no next use and are dead on exit from block
and are in registers, change register descriptor to indicate that they no longer
contain Y and/or Z.
5. Invoke a function getreg to determine the location L where the result of the
computation y op z should be stored. L will usually be a register, but it could also be a
memory location. We shall describe getreg shortly.
6. Consult the address descriptor for u to determine y', (one of) the current location(s) of
y. Prefer the register for y' if the value of y is currently both in memory and a
register. If the value of u is not already in L, generate the instruction MOV y', L to
place a copy of y in L.
7. Generate the instruction OP z', L where z' is a current location of z. Again, prefer a
register to a memory location if z is in both. Update the address descriptor to indicate
that x is in location L. If L is a register, update its descriptor to indicate that it contains
the value of x, and remove x from all other register descriptors.
8. If the current values of y and/or y have no next uses, are not live on exit from the
block, and are in registers, alter the register descriptor to indicate that, after execution
ofx := y op z, those registers no longer will contain y and/or z, respectively.
FUNCTION getreg:
5. If Y is in register (that holds no other values) and Y is not live and has no next use after
X = Y op Z
then return register of Y for L.
6. Failing (1) return an empty register
7. Failing (2) if X has a next use in the block or op requires register then get a register R, store its
content into M (by Mov R, M) and use it.
8. Else select memory location X as L
The function getreg returns the location L to hold the value of x for the assignment x := y op z.
5. If the name y is in a register that holds the value of no other names (recall that copy
instructions such as x := y could cause a register to hold the value of two or more
variables
Department of Computer Science & Engineering Course File : Compiler Design
simultaneously), and y is not live and has no next use after execution of x := y op z,
then return the register of y for L. Update the address descriptor of y to indicate that y is
no longer in L.
7. Failing (2), if x has a next use in the block, or op is an operator such as indexing, that requires
a register, find an occupied register R. Store the value of R into memory location (by MOV R,
M) if it is not already in the proper memory location M, update the address descriptor M, and
return R. If R holds the value of several variables, a MOV instruction must be generated for each
variable that needs to be stored. A suitable occupied register might be one whose datum is
referenced furthest in the future, or one whose value is also in memory.
8. If x is not used in the block, or no suitable occupied register can be found, select the memory
location of x as L.
Example :
Stmt code reg desc addr desc
bt 2 = a
-c
t3=t1+
t2 d = t 3 +
t2
The code generation algorithm that we discussed would produce the code sequence as
shown. Shown alongside are the values of the register and address descriptors as code
generation progresses.
Department of Computer Science & Engineering Course File : Compiler Design
DAG for Register allocation:
Department of Computer Science & Engineering Course File : Compiler Design
DAG (Directed Acyclic Graphs) are useful data structures for implementing
transformations on basic blocks. A DAG gives a picture of how the value computed by a
statement in a basic block is used in subsequent statements of the block. Constructing a
DAG from three-address statements is a good way of determining common sub-
expressions (expressions computed more than once) within a block, determining which
names are used inside the block but evaluated outside the block, and determining which
statements of the block could have their computed value used outside the block.
A DAG for a basic block is a directed cyclic graph with the following labels on nodes:
1. Leaves are labeled by unique identifiers, either variable names or constants. From the
operator applied to a name we determine whether the l-value or r-value of a name is needed;
most leaves represent r- values. The leaves represent initial values of names, and we subscript
them with 0 to avoid confusion with labels denoting "current" values of names as in (3) below.
3. Nodes are also optionally given a sequence of identifiers for labels. The intention is
that interior nodes represent computed values, and the identifiers labeling a node are deemed
to have that value.
For example, the slide shows a three-address code. The corresponding DAG is shown.
We observe that each node of the DAG represents a formula in terms of the leaves, that
is, the values possessed by variables and constants upon entering the block. For
example, the node labeled t 4 represents the formula
b[4 * i]
Department of Computer Science & Engineering Course File : Compiler Design
that is, the value of the word whose address is 4*i bytes offset from address b,
which is theintended value of t 4 .
S 1= 4 * i S1=4*i
S2 = addr(A)-4 S 2 = addr(A)-4
S3 = S 2 [S 1 ] S 3 = S2 [S 1 ]
S4=4*i
S5 = addr(B)-4 S 5= addr(B)-4
S 6 = S 5 [S4 ] S6 = S5 [S 4 ]
S7 = S 3 * S6 S 7 = S3 * S 6
S8 = prod+S7
prod = S8 prod = prod + S 7
S9 = I+1
I = S9 I=I+1
We see how to generate code for a basic block from its DAG representation. The
advantage of doing so is that from a DAG we can more easily see how to rearrange the
order of the final computation sequence than we can starting from a linear sequence of
three-address statements or quadruples. If the DAG is a tree, we can generate code
that we can prove is optimal under such criteria as program length or the fewest number
of temporaries used. The algorithm for optimal code generation from a tree is also
useful when the intermediate code is a parse tree.
t 1 = a + bt
2 = c + dt
3 = e -t 2
X = t 1 -t 3
Here, we briefly consider how the order in which computations are done can
affect the cost of resulting object code. Consider the basic block and its corresponding
DAG representation as shown in the slide.
Rearranging order .
ADD b, R0 gives
MOV c, R 1 MOV c, R 0
ADD d, R 1 ADD d, R 0
MOV e, R0 SUB R 0 , R1
SUB R 1 , R0 MOV a, R 0
SUB R 0 , R1 SUB R 1 , R0
MOV R1 , X MOV R 1 , X
If we generate code for the three-address statements using the code generation
algorithm described before, we get the code sequence as shown (assuming two
registers R0 and R1 are available, and only X is live on exit). On the other hand
suppose we rearranged the order of the statements so that the computation of t 1
occurs immediately before that of X as:
t2 = c + d
t3 = e -t 2
t1 = a + b
X = t 1 -t3
Then, using the code generation algorithm, we get the new code sequence as
shown (again only R0 and R1 are available). By performing the computation in
this order, we have been able to save two instructions; MOV R0, t 1 (which stores
the value of R0 in memory location t 1 ) and MOV t 1 , R1 (which reloads the value
of t 1 in the register R1).