CTCD Unit 4
CTCD Unit 4
Prior to code generation, the front end must be scanned, parsed and translated into
intermediate representation along with necessary type checking. Therefore, input to code
generation is assumed to be error-free.
2. Target program:
The output of the code generator is the target program. The output may be:
a. Absolute machine language
- It can be placed in a fixed memory location and can be executed immediately.
b. Relocatable machine language
- It allows subprograms to be compiled separately.
c. Assembly language
- Code generation is made easier.
3. Memory management:
Names in the source program are mapped to addresses of data objects in run-time
memory by the front end and code generator.
It makes use of symbol table, that is, a name in a three-address statement refers to a
symbol-table entry for the name.
4. Instruction selection:
The instructions of target machine should be complete and uniform.
Instruction speeds and machine idioms are important factors when efficiency of target
program is considered.
The quality of the generated code is determined by its speed and size.
The former statement can be translated into the latter statement as shown below:
5. Register allocation
Instructions involving register operands are shorter and faster than those involving
operands in memory.
Certain machine requires even-odd register pairs for some operands and results.
For example, consider the division instruction of the form:
D x, y
6. Evaluation order
The order in which the computations are performed can affect the efficiency of the
target code. Some computation orders require fewer registers to hold intermediate
results than others.
TARGET MACHINE
Familiarity with the target machine and its instruction set is a prerequisite for designing a
good code generator.
The target computer is a byte-addressable machine with 4 bytes to a word.
It has n general-purpose registers, R0, R1, . . . , Rn-1 .
It has two-address instructions of the form:
op source, destination
where op is an op-code, and source and destination are data fields.
Absolute M M 1
Register R R 0
literal #c c 1
For example : MOV R0, M stores contents of Register R0 into memory location M ;
MOV 4(R0), M stores the value contents(4+contents(R 0)) into M.
Instruction costs :
Instruction cost = 1+cost for source and destination address modes. This cost corresponds
to the length of the instruction.
Address modes involving registers have cost zero.
Address modes involving memory location or literal have cost one.
Instruction length should be minimized if space is important. Doing so also minimizes the
time taken to fetch and perform the instruction.
In order to generate good code for target machine, we must utilize its addressing
capabilities efficiently.
Problems
1. Generate code for the following three-address statements assuming all variables are stored in memory
locations.
2. x = 1
3. x = a
4. x = a + 1
5. x = a + b
6. The two statements
o x=b*c
o y=a+x
answer
1. LD R1, #1
ST x, R1
2. LD R1, a
ST x, R1
3. LD R1, a
ADD R1, R1, #1
ST x, R1
4. LD R1, a
LD R2, b
ADD R1, R1, R2
ST x, R1
5. LD R1, b
LD R2, c
MUL R1, R1, R2
LD R3, a
ADD R3, R3, R1
ST y, R3
2. Generate code for the following three-address statements assuming a and b are arrays whose elements are 4-byte
values.
1. The four-statement sequence
i. x = a[i]
ii. y = b[j]
iii. a[i] = y
iv. b[j] = x
2. The three-statement sequence
i. x = a[i]
ii. y = b[i]
iii. z=x*y
3. The three-statement sequence
i. x = a[i]
ii. y = b[x]
iii. a[i] = y
answer
1. LD R1, i
MUL R1, R1, #4
LD R2, a(R1)
LD R3, j
MUL R3, R3, #4
LD R4, b(R3)
ST a(R1), R4
ST b(R3), R2
2. LD R1, i
MUL R1, R1, #4
LD R2, a(R1)
LD R1, b(R1)
MUL R1, R2, R1
ST z, R1
3. LD R1, i
MUL R1, R1, #4
LD R2, a(R1)
MUL R2, R2, #4
LD R2, b(R2)
ST a(R1), R2
3. Generate code for the following three-address sequence assuming that p and q are in memory locations:
y = *q
q=q+4
*p = y
p=p+4
answer
LD R1, q
LD R2, 0(R1)
ADD R1, R1, #4
ST q, R1
LD R1, p
ST 0(R1), R2
ADD R1, R1, #4
ST p, R1
4. Generate code for the following sequence assuming that x, y, and z are in memory locations:
if x < y goto L1
z=0
goto L2
L1: z = 1
answer
LD R1, x
LD R2, y
SUB R1, R1, R2
BLTZ R1, L1
LD R1, #0
ST z, R1
BR L2
L1: LD R1, #1
ST z, R1
5. Generate code for the following sequence assuming that n is in a memory location:
s=0
i=0
L1: if i > n goto L2
s=s+i
i=i+1
goto L1
L2:
answer
Long version:
LD R1, #0
ST s, R1
ST i, R1
L1: LD R1, i
LD R2, n
SUB R2, R1, R2
BGTZ R2, L2
LD R2, s
ADD R2, R2, R1
ST s, R2
ADD R1, R1, #1
ST i, R1
BR L1
L2:
Short version:
LD R2, #0
LD R1, R2
LD R3, n
L1: SUB R4, R1, R3
BGTZ R4, L2
ADD R2, R2, R1
ADD R1, R1, #1
BR L1
L2:
2. LD R0, i
MUL R0, R0, 8
LD R1, a(R0)
ST b, R1
3. LD R0, c
LD R1, i
MUL R1, R1, 8
ST a(R1),R0
4. LD R0, p
LD R1, 0(R0)
ST x, R1
5. LD R0, p
LD R1, x
ST 0(R0), R1
6. LD R0, x
LD R1, y
SUB R0, R0, R1
BLTZ *R3, R0
answer
1. 2 + 2 + 1 + 2 = 7
2. 2 + 2 + 2 + 2 = 8
3. 2 + 2 + 2 + 2 = 8
4. 2 + 2 + 2 = 6
5. 2 + 2 + 2 = 6
6. 2 + 2 + 1 + 1 = 6
Basic Blocks
Output: A list of basic blocks with each three-address statement in exactly one block
Method:
1. We first determine the set of leaders, the first statements of basic blocks. The
rules we use are of the following:
a. The first statement is a leader.
b. Any statement that is the target of a conditional or unconditional goto is a
leader.
c. Any statement that immediately follows a goto or conditional goto statement
is a leader.
2. For each leader, its basic block consists of the leader and all statements up to but not
including the next leader or the end of the program.
Consider the following source code for dot product of two vectors a and b of length 20
begin
prod :=0;
i:=1;
do begin
i :=i+1;
end
while i <= 20
end
(2) i := 1
(3) t1 := 4* i
(5) t3 := 4* i
(7) t5 := t2*t4
(8) t6 := prod+t5
(9) prod := t6
(10) t7 := i+1
(11) i := t7
Flow graph is a directed graph containing the flow-of-control information for the set of
basic blocks making up a program.
The nodes of the flow graph are basic blocks. It has a distinguished initial node.
E.g.: Flow graph for the vector dot product is given as follows:
prod : = 0 B1
i:=1
t1 : = 4 * i
t2 : = a [ t1 ]
t3 : = 4 * i
B2
t4 : = b [ t3 ]
t5 : = t2 * t4
t6 : = prod + t5
prod : = t6
t7 : = i + 1
i : = t7
if i <= 20 goto B2
B 1 is theinitialnode. B 2 immediately follows B1, so there is an edge from B1 to B2. The target
of jump from last statement of B1 is the first statement B2, so there is an edge from B1 (last
statement) to B2 (first statement).
B 1 is thepredecessorof B 2, and B2 is asuccessorof B 1.
Loops
NEXT-USE INFORMATION
If the name in a register is no longer needed, then we remove the name from the register
and the register can be used to store some other names.
Input:Basic block B of three-address statements
Symbol Table:
y Live i
z Live i
A code generator generates target code for a sequence of three- address statements and
effectively uses registers to store operands of the statements.
(or)
(or)
ADD Rj, Ri
A register descriptor is used to keep track of what is currently in each registers. The
register descriptors show that initially all the registers are empty.
An address descriptor stores the location where the current value of the name can be
found at run time.
A code-generation algorithm:
The algorithm takes as input a sequence of three-address statements constituting a basic block.
For each three-address statement of the form x : = y op z,perform the following actions:
1. Invoke a functiongetregto determine the location L where the result of the computation y op
z should be stored.
2. Consult the address descriptor for y to determine y’, the current location of y. Prefer the
register for y’ if the value of y is currently both in memory and a register. If the value of y is
not already in L, generate the instructionMOV y’ , Lto place a copy of y in L.
4. If the current values of y or z have no next uses, are not live on exit from the block, and are in
registers, alter the register descriptor to indicate that, after execution of x : = y op z , those
registers will no longer contain y or z.
The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following three-
address code sequence:
t:=a–b
u:=a–c
v:=t+u
d:=v+u
with d live at the end.
Register empty
The table shows the code sequences generated for the indexed assignment statements
a : = b [ i ]anda [ i ] : = b
The table shows the code sequences generated for the pointer assignments
a : = *pand*p : = a
a : = *p MOV *Rp, a 2
*p : = a MOV a, *Rp 2
Statement Code
x : = y +z MOV y, R0
if x < 0 goto z ADD z, R0
MOV R0,x
CJ< z
RUN-TIME ENVIRONMENTS
Parameter Passing
The communication medium among procedures is known as parameter passing. The values of the
variables from a calling procedure are transferred to the called procedure by some mechanism. Before
moving ahead, first go through some basic terminologies pertaining to the values in a program.
r-value
The value of an expression is called its r-value. The value contained in a single variable also becomes
an r-value if it appears on the right-hand side of the assignment operator. r-values can always be
assigned to some other variable.
l-value
The location of memory (address) where an expression is stored is known as the l-value of that
expression. It always appears at the left hand side of an assignment operator.
For example:
day = 1;
week = day * 7;
month = 1;
year = month * 12;
From this example, we understand that constant values like 1, 7, 12, and variables like day, week, month
and year, all have r-values. Only variables have l-values as they also represent the memory location
assigned to them.
For example:
7 = x + y;
is an l-value error, as the constant 7 does not represent any memory location.
Formal Parameters
Variables that take the information passed by the caller procedure are called formal parameters. These
variables are declared in the definition of the called function.
Actual Parameters
Variables whose values or addresses are being passed to the called procedure are called actual
parameters. These variables are specified in the function call as arguments.
Example:
fun_one()
{
int actual_parameter = 10;
call fun_two(int actual_parameter);
}
fun_two(int formal_parameter)
{
print formal_parameter;
}
Formal parameters hold the information of the actual parameter, depending upon the parameter passing
technique used. It may be a value or an address.
Pass by Value
In pass by value mechanism, the calling procedure passes the r-value of actual parameters and the
compiler puts that into the called procedure’s activation record. Formal parameters then hold the values
passed by the calling procedure. If the values held by the formal parameters are changed, it should have
no impact on the actual parameters.
Pass by Reference
In pass by reference mechanism, the l-value of the actual parameter is copied to the activation record
of the called procedure. This way, the called procedure now has the address (memory location) of the
actual parameter and the formal parameter refers to the same memory location. Therefore, if the value
pointed by the formal parameter is changed, the impact should be seen on the actual parameter as they
should also point to the same value.
Pass by Copy-restore
This parameter passing mechanism works similar to ‘pass-by-reference’ except that the changes to
actual parameters are made when the called procedure ends. Upon function call, the values of actual
parameters are copied in the activation record of the called procedure. Formal parameters if manipulated
have no real-time effect on actual parameters (as l-values are passed), but when the called procedure
ends, the l-values of formal parameters are copied to the l-values of actual parameters.
Example:
int y;
calling_procedure()
{
y = 10;
copy_restore(y); //l-value of y is passed
printf y; //prints 99
}
copy_restore(int x)
{
x = 99; // y still has value 10 (unaffected)
y = 0; // y is now 0
}
When this function ends, the l-value of formal parameter x is copied to the actual parameter y. Even if
the value of y is changed before the procedure ends, the l-value of x is copied to the l-value of y making
it behave like call by reference.
Pass by Name
Languages like Algol provide a new kind of parameter passing mechanism that works like preprocessor
in C language. In pass by name mechanism, the name of the procedure being called is replaced by its
actual body. Pass-by-name textually substitutes the argument expressions in a procedure call for the
corresponding parameters in the body of the procedure so that it can now work on actual parameters,
much like pass-by-reference.
Symbol Table
Symbol table is an important data structure created and maintained by compilers in order to store
information about the occurrence of various entities such as variable names, function names, objects,
classes, interfaces, etc. Symbol table is used by both the analysis and the synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
A symbol table is simply a table which can be either linear or a hash table. It maintains an entry for
each name in the following format:
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be implemented as an
unordered list, which is easy to code, but it is only suitable for small tables only. A symbol table can be
implemented in one of the following ways:
Among all, symbol tables are mostly implemented as hash tables, where the source code symbol itself
is treated as a key for the hash function and the return value is the information about the symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of the compiler where tokens
are identified and names are stored in the table. This operation is used to add information in the symbol
table about unique names occurring in the source code. The format or structure in which the names are
stored depends upon the compiler in hand.
An attribute for a symbol in the source code is the information associated with that symbol. This
information contains the value, state, scope, and type about the symbol. The insert() function takes the
symbol and its attributes as arguments and stores the information in the symbol table.
For example:
int a;
insert(a, int);
lookup()
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol exists in the
symbol table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be accessed by all
the procedures and scope symbol tables that are created for each scope in the program.
To determine the scope of a name, symbol tables are arranged in hierarchical structure as shown in the
example below:
...
int value=10;
void pro_one()
{
int one_1;
int one_2;
{ \
int one_3; |_ inner scope 1
int one_4; |
} /
int one_5;
{ \
int one_6; |_ inner scope 2
int one_7; |
} /
}
void pro_two()
{
int two_1;
int two_2;
{ \
int two_3; |_ inner scope 3
int two_4; |
} /
int two_5;
}
...
The above program can be represented in a hierarchical structure of symbol tables:
The global symbol table contains names for one global variable (int value) and two procedure names,
which should be available to all the child nodes shown above. The names mentioned in the pro_one
symbol table (and all its child tables) are not available for pro_two symbols and its child tables.
This symbol table data structure hierarchy is stored in the semantic analyzer and whenever a name needs
to be searched in a symbol table, it is searched using the following algorithm:
first a symbol will be searched in the current scope, i.e. current symbol table.
if a name is found, then search is completed, else it will be searched in the parent symbol table
until,
either the name is found or global symbol table has been searched for the name.