0% found this document useful (0 votes)
970 views39 pages

Intermediate Representation and Symbol Table

The document discusses intermediate representations (IRs) used in compilers. It notes that compilers typically use 2-3 IRs including a high-level IR (HIR) that preserves structure, a mid-level IR (MIR) for optimizations and code generation, and a low-level IR (LIR) similar to machine code. While there is no standard IR, they allow expressing programs in a form machines can understand while enabling analysis and transformations. IR design involves balancing various issues around languages, optimizations, and code generation.

Uploaded by

SANDJITH
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
970 views39 pages

Intermediate Representation and Symbol Table

The document discusses intermediate representations (IRs) used in compilers. It notes that compilers typically use 2-3 IRs including a high-level IR (HIR) that preserves structure, a mid-level IR (MIR) for optimizations and code generation, and a low-level IR (LIR) similar to machine code. While there is no standard IR, they allow expressing programs in a form machines can understand while enabling analysis and transformations. IR design involves balancing various issues around languages, optimizations, and code generation.

Uploaded by

SANDJITH
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

Intermediate Representation

Design
• More of a wizardry rather than science

• each compiler uses 2-3 IRs

• HIR (high level IR) preserves loop structure and array bounds

• MIR (medium level IR) reflects range of features in a set of


source languages

– language independent
– good for code generation for one or more architectures
– appropriate for most optimizations

• LIR (low level IR) low level similar to the machines


1
• Compiler writers have tried to define Universal IRs and have
failed. (UNCOL in 1958)

• There is no standard Intermediate Representation. IR is a step in


expressing a source program so that machine understands it

• As the translation takes place, IR is repeatedly analyzed and


transformed

• Compiler users want analysis and translation to be fast and


correct

• Compiler writers want optimizations to be simple to write, easy to


understand and easy to extend

• IR should be simple and light weight while allowing easy


expression of optimizations and transformations.

2
Issues in IR Design
• source language and target language

• porting cost or reuse of existing design

• whether appropriate for optimizations

• U-code IR used on PA-RISC and Mips. Suitable for


expression evaluation on stacks but less suited for
load-store architectures

• both compilers translate U-code to another form

– HP translates to very low level representation


– Mips translates to MIR and translates back to U-code for
code generator
3
Issues in new IR Design
• how much machine dependent

• expressiveness: how many languages are covered

• appropriateness for code optimization

• appropriateness for code generation

• Use more than one IR (like in PA-RISC)

Front
ucode SLLIC Optimizer
end
Used by Spectrum
HP3000 Low Level
As these were
Intermediate code
stack machines 4
Issues in new IR Design …
• Use more than one IR for more than one
optimization

• represent subscripts by list of subscripts:


suitable for dependence analysis

• make addresses explicit in linearized form:


– suitable for constant folding, strength reduction,
loop invariant code motion, other basic
optimizations

5
float a[20][10];
use a[i][j+2]
HIR
MIR LIR
t1a[i,j+2]
t1 j+2 r1 [fp-4]
t2 i*20 r2 r1+2
t3 t1+t2 r3 [fp-8]
t4 4*t3 r4 r3*20
t5 addr a r5 r4+r2
t6 t4+t5 r6 4*r5
t7*t6 r7fp-216
f1 [r7+r6]
6
High level IR
int f(int a, int b) {
int c;
c = a + 2;
print(b, c);
}

• Abstract syntax tree


– keeps enough information to reconstruct source
form
– keeps information about symbol table

7
function

ident f paramlist body

ident a paramlist declist stmtlist

ident b end ident c end = stmtlist

ident c + call end

Identifiers are actually ident a const 2 ident arglist


Pointers to the print
Symbol table entries arglist
ident b
end
ident c
8
• Medium level IR
– reflects range of features in a set of source languages
– language independent
– good for code generation for a number of architectures
– appropriate for most of the optimizations
– normally three address code

• Low level IR
– corresponds one to one to target machine instructions
– architecture dependent

• Multi-level IR
– has features of MIR and LIR
– may also have some features of HIR

9
Abstract Syntax Tree/DAG
• Condensed form of parse tree

• useful for representing language constructs

• Depicts the natural hierarchical structure of the


source program

– Each internal node represents an operator


– Children of the nodes represent operands
– Leaf nodes represent operands

• DAG is more compact than abstract syntax tree


because common sub expressions are eliminated
10
a := b * -c + b * -c
Abstract syntax tree Directed Acyclic Graph

assign assign

a + a +

* * *

b uminus b uminus b uminus

c c c
11
Postfix notation
• Linearized representation of a syntax tree

• List of nodes of the tree

• Nodes appear immediately after its children

• The postfix notation for an expression E is defined as follows:

– If E is a variable or constant then the postfix notation is E


itself

– If E is an expression of the form E1 op E2 where op is a


binary operator then the postfix notation for E is
• E1' E2' op where E1' and E2‘ are the postfix notations for E1 and
E2 respectively

– If E is an expression of the form (E1) then the postfix


notation for E1 is also the postfix notation for E

12
Postfix notation …
• No parenthesis are needed in postfix notation
because
– the position and parity of the operators permits
only one decoding of a postfix expression

• Postfix notation for


a = b * -c + b * - c
is
abc-*bc-*+=

13
Three address code
• It is a sequence of statements of the
general form X := Y op Z where

– X, Y or Z are names, constants or compiler


generated temporaries

– op stands for any operator such as a fixed-


or floating-point arithmetic operator, or a
logical operator
14
Three address code …
• Only one operator on the right hand side is allowed

• Source expression like x + y * z might be translated into


t1 := y * z
t2 := x + t1

where t1 and t2 are compiler generated temporary names

• Unraveling of complicated arithmetic expressions and of control flow


makes 3-address code desirable for code generation and optimization

• The use of names for intermediate values allows 3-address code to be


easily rearranged

• Three address code is a linearized representation of a syntax tree where


explicit names correspond to the interior nodes of the graph

15
Three address instructions
• Assignment • Function
– x = y op z – param x
– x = op y – call p,n
– x=y – return y

• Jump • Pointer
– goto L – x = &y
– if x relop y goto L – x = *y
– *x = y
• Indexed assignment
– x = y[i]
– x[i] = y

16
Other representations
• SSA: Single Static Assignment
• RTL: Register transfer language
• Stack machines: P-code
• CFG: Control Flow Graph
• Dominator Trees
• DJ-graph: dominator tree augmented with join edges
• PDG: Program Dependence Graph
• VDG: Value Dependence Graph
• GURRR: Global unified resource requirement
representation. Combines PDG with resource
requirements
• Java intermediate bytecodes
• The list goes on ......

17
Symbol Table
• Compiler uses symbol table to keep track of scope and binding
information about names

• symbol table is changed every time a name is encountered in the


source; changes to table occur
– if a new name is discovered
– if new information about an existing name is discovered

• Symbol table must have mechanism to:


– add new entries
– find existing information efficiently

• Two common mechanism:


– linear lists, simple to implement, poor performance
– hash tables, greater programming/space overhead, good performance

• Compiler should be able to grow symbol table dynamically

• if size is fixed, it must be large enough for the largest program


18
Symbol Table Entries
• each entry for a declaration of a name

• format need not be uniform because information depends upon the usage of
the name

• each entry is a record consisting of consecutive words

• to keep records uniform some entries may be outside the symbol table

• information is entered into symbol table at various times


– keywords are entered initially
– identifier lexemes are entered by lexical analyzer

• symbol table entry may be set up when role of name becomes clear

• attribute values are filled in as information is available

19
• a name may denote several objects in the same block
– int x;
struct x {float y, z; }
– lexical analyzer return the name itself and not pointer to symbol table
entry
– record in the symbol table is created when role of the name becomes
clear
– in this case two symbol table entries will be created

• attributes of a name are entered in response to declarations

• labels are often identified by colon

• syntax of procedure/function specifies that certain identifiers are


formals

• characters in a name
– there is a distinction between token id, lexeme and attributes of the
names
– it is difficult to work with lexemes
– if there is modest upper bound on length then lexemes can be stored
in symbol table
– if limit is large store lexemes separately
20
Storage Allocation Information
• information about storage locations is kept in the symbol table

• if target is assembly code then assembler can take care of storage for
various names

• compiler needs to generate data definitions to be appended to


assembly code

• if target is machine code then compiler does the allocation

• for names whose storage is allocated at runtime no storage allocation


is done

• compiler plans out activation records

21
Data Structures
• List data structure
– simplest to implement
– use a single array to store names and information
– search for a name is linear
– entry and lookup are independent operations
– cost of entry and search operations are very high and
lot of time goes into book keeping

• Hash table
– The advantages are obvious

22
Representing Scope Information
• entries are declarations of names

• when a lookup is done, entry for appropriate declaration must be


returned

• scope rules determine which entry is appropriate

• maintain separate table for each scope

• symbol table for a procedure or scope is compile time equivalent an


activation record

• information about non local is found by scanning symbol table for


the enclosing procedures

• symbol table can be attached to abstract syntax of the procedure


(integrated into intermediate representation)

23
• most closely nested scope rule can be implemented in data
structures discussed so far

• give each procedure a unique number

• blocks must also be numbered

• procedure number is part of all local declarations

• name is represented as a pair of number and name

• names are entered in symbol table in the order they occur

• most closely nested rule can be created in terms of following


operations:
– lookup: find the most recently created entry
– insert: make a new entry
– delete: remove the most recently created entry

24
Symbol table structure
• Assign variables to storage classes that prescribe scope, visibility, and
lifetime
– scope rules prescribe the symbol table structure
– scope: unit of static program structure with one or more variable
declarations
– scope may be nested
• Pascal: procedures are scoping units
• C: blocks, functions, files are scoping units

• Visibility, lifetimes, global variables

• Common (in Fortran)

• Automatic or stack storage

• Static variables

25
Symbol attributes and symbol
table entries
• Symbols have associated attributes

• typical attributes are name, type, scope, size, addressing mode


etc.

• a symbol table entry collects together attributes such that they


can be easily set and retrieved

• example of typical names in symbol table

Name Type
name character string
class enumeration
size integer
type enumeration

26
Local Symbol Table Management
NewSymTab: SymTab  SymTab

DestSymTab: SymTab  SymTab

InsertSym: SymTab X Symbol  boolean

LocateSym: SymTab X Symbol  boolean

GetSymAttr: SymTab X Symbol X Attr  boolean

SetSymAttr: SymTab X Symbol X Attr X value  boolean

NextSym: SymTab X Symbol  Symbol

MoreSyms: SymTab X Symbol  boolean

27
• A major consideration in designing a symbol table is
that insertion and retrieval should be as fast as
possible

• One dimensional table: search is very slow

• Balanced binary tree: quick insertion, searching and


retrieval; extra work required to keep the tree
balanced

• Hash tables: quick insertion, searching and retrieval;


extra work to compute hash keys

• Hashing with a chain of entries is generally a good


approach
28
Hashed local symbol table

29
Nesting structure of an example
Pascal program
program e; procedure i;
var a, b, c: integer; var b, d: integer;
begin
procedure f; b:= a+c
var a, b, c: integer; end;
begin
a := b+c procedure j;
end; var b, d: integer;
begin
procedure g; b := a+d
var a, b: integer; end;

procedure h; begin
var c, d: integer; a := b+c
begin
end.
c := a+d
end;

30
Global Symbol table structure
• scope and visibility e( ) ‘s symtab
Integer a
rules determine the Integer b
structure of global Integer c
symbol table

• for Algol class of f( ) ‘s symtab g( ) ‘s j( ) ‘s symtab


languages scoping Integer a
Integer b
symtab
Integer a
Integer b
Integer d
rules structure the Integer c Integer b
symbol table as tree
of local tables
– global scope as root
h( ) ‘s symtab i( ) ‘s symtab
– tables for nested Integer c Integer b
scope as children of Integer d Integer d
the table for the
scope they are
nested in

31
32
Storage binding and symbolic
registers
• Translates variable names into addresses

• This process must occur before or during code


generation

• each variable is assigned an address or addressing


method

• each variable is assigned an offset with respect to


base which changes with every invocation

• variables fall in four classes: global, global static,


stack, stack static

33
• global/static: fixed relocatable address or offset with
respect to base as global pointer

• stack variable: offset from stack/frame pointer

• allocate stack/global in registers

• registers are not indexable, therefore, arrays cannot


be in registers

• assign symbolic registers to scalar variables

• used for graph coloring for global register allocation

34
a: global b: local c[0..9]: local
gp: global pointer fp: frame pointer
LIR
MIR LIR
s0  s0*2
a  a*2 r1  [gp+8]
r2  r1*2
[gp+8]  r2
s1  [fp-28]
b  a+c[1] r3  [gp+8]
s2  s0+s1
r4  [fp-28]
r5  r3+r4
[fp-20]r5
Names bound
to symbolic
Names bound registers
to locations
35
Local Variables in Frame
• assign to consecutive locations; allow
enough space for each
– may put word size object in half word
boundaries
– requires two half word loads
– requires shift, or, and

• align on double word boundaries


– wastes space
– machine may allow small offsets
36
• sort variables by the alignment they need

• store largest variables first


– automatically aligns all the variables
– does not require padding

• store smallest variables first


– requires more space (padding)
– for large stack frame makes more variables
accessible with small offsets

37
How to store large local data
structures
• Requires large space in local frames and therefore large
offsets

• If large object is put near the boundary other objects


require large offset either from fp (if put near beginning)
or sp (if put near end)

• Allocate another base register to access large objects

• Allocate space in the middle or elsewhere; store pointer


to these locations from at a small offset from fp

• Requires extra loads


38
int i;
double float x;
short int j;
float y;

Unsorted aligned

i x j y
0 -4 -8 -16 -18 -20 -24

Sorted frames

x i y j
0 -8 -12 -16 -18
39

You might also like