0% found this document useful (0 votes)
53 views

Lecture Notes On Semantic Analysis and Specifications: 15-411: Compiler Design Andr e Platzer

This document provides lecture notes on semantic analysis and specifications in compiler design. It discusses: 1) Semantic analysis determines if a program is syntactically well-formed by checking name and type analysis through symbol tables and type rules. This fills in details parsing cannot represent. 2) Static semantics specify how expressions are structured but not their runtime behavior. Dynamic semantics define evaluation relations showing how expressions evaluate step-by-step and handle side effects. 3) Dynamic semantics are defined through evaluation rules showing how expressions reduce to values while tracking state changes, to precisely capture languages with side effects like function calls.

Uploaded by

Muhammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Lecture Notes On Semantic Analysis and Specifications: 15-411: Compiler Design Andr e Platzer

This document provides lecture notes on semantic analysis and specifications in compiler design. It discusses: 1) Semantic analysis determines if a program is syntactically well-formed by checking name and type analysis through symbol tables and type rules. This fills in details parsing cannot represent. 2) Static semantics specify how expressions are structured but not their runtime behavior. Dynamic semantics define evaluation relations showing how expressions evaluate step-by-step and handle side effects. 3) Dynamic semantics are defined through evaluation rules showing how expressions reduce to values while tracking state changes, to precisely capture languages with side effects like function calls.

Uploaded by

Muhammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture Notes on

Semantic Analysis and Specifications

15-411: Compiler Design


André Platzer

Lecture 13

1 Introduction
Now we have seen how parsing works in the front-end of a compiler and
how instruction selection and register allocation works in the back-end.
We have also seen how intermediate representations can be used in the
middle-end. One important question is the last phase of the front-end: se-
mantic analysis that is used to determine if the input program is actually
syntactically well-formed. Another important question arises in the first
phase of the middle-end: translation of the dynamic aspects of advanced
data structures. Even though both questions belong to different phases of
the compiler, we answer them together in this lecture. The static and dy-
namic semantical aspects need to fit together anyhow.
Some smaller subset of what is covered in this lecture can be found in
the textbook [App98, Ch 7.2], which covers data structures.

2 Semantic Analysis and Static Semantics


Essentially, the semantic analysis makes up for syntactical aspects of the
language that are important for understanding if the program makes sense,
but cannot be represented (easily) in the context-free domain of determin-
istic parsing. That is, all consistency checks that need information from the
context of the current program location. Typical parts of semantic analy-
sis include name analysis that is used to identify which particular variable
an identifier x refers to, especially where it has been declared. Is it a local
variable? Is it an formal parameter of a function? Is it a global variable (for

L ECTURE N OTES
L13.2 Semantic Analysis and Specifications

programming languages that allow this)? Is it an identifier in a struct? Of


course, correct name analysis is important to make sure the right memory
locations or registers are accessed when looking up or changing the value
of x. Name analysis is usually solved by reading off a symbol table with all
definitions and their type information from the abstract syntax tree.
Another part of semantic analysis is type analysis that is used to look up
the types of all identifiers based on the results of name analysis and make
sure the types fit. It is also responsible for simple type inference. If we find
an expression
e[t.f + x]

in the source code, then what exactly is the type of the result? And is it a
well-typed expression at all? The answer depends on the type of e which
had better be an array type (otherwise the array access would be ill-typed).
The answer also depends on the type of t which had better be a struct type
s and will then be used to lookup the type of t.f according to the type of
the field f declared in s. Finally, the answer depends on x. And if the result
of the addition t.f + x does not produce an integer, the whole expression
still does not type check. It is crucial to find out whether a program with
such an expression is well-typed at all. Otherwise, we would compile it
to something with a strange and arbitrary effect without knowing that the
source program made no sense at all.
All these answers depend on information from the context of the pro-
gram. One interesting indicator for a language is how many passes of anal-
ysis through the abstract-syntax tree are necessary to perform semantical
analysis successfully.
A simple typing rule is that for plus expressions:

e1 : int e2 : int
e1 + e2 : int

It specifies that if e1 and e2 both have type int then e1 + e2 also has type int.
In the following, we will give typing rules that define the static seman-
tics of source program expressions.

3 Dynamic Semantics
The static semantics is necessary to make sense of a source code expression.
It only specifies it incompletely, though. We will also explain the dynamic

L ECTURE N OTES
Semantic Analysis and Specifications L13.3

semantics of expressions, i.e., what their effect is when evaluated. This in-
formation is required for the translation phase in order to make sure that
the intermediate language generated for a particular source code snippet
actually complies with the semantics of the programming language, which
hopefully fits to the intention that the programmer had in mind when writ-
ing the program.
As a side-note, the job of compiler verification is to make sure that the
source program will be compiled to something that has exactly the same
effect as prescribed by the language semantics, regardless of whether the
source program is doing the right thing. The compiler’s job is to adhere
to this exactly. Contrast this to program verification, where the job is to
make sure that the rogram fits to the intentions that the programmer has
in mind, as expressed by some formal specification of what it is meant to
achieve, e.g., in the form of a set of pre/postconditions.
For describing the dynamic semantics of C0, we define how we evalu-
ate expressions and statements of the programming language. We need to
describe how an expression e will be evaluated to determine the result. For
this purpose, we want to define a relation e ⇒ v that specifies that e, when
evaluated, results in the value v. We want to define the relation e ⇒ v by
rules specifying the effect of each expression like

e 1 ⇒ n1 e 2 ⇒ n2 n = add(n1 , n2 )
+?
e1 + e2 ⇒ n

This rule is intended to specify that, when e1 evaluates to value n1 and


e2 evaluates to n2 , and value n is the sum of values n1 and n2 , then the
expression e1 + e2 evaluates to n. Unfortunately, it does not quite do the
trick yet. To see why, we first consider two other rules. One simple rule
that states that constants just evaluate to themselves (similarly for 5,7,...)

0
0⇒0

And one rule that evaluates a variable identifier x. But what should a vari-
able evaluate to? Well that depends on what its value is. The value of a
variable identifier is stored at some address in memory (or a register, which
we talk about in a moment). Let’s denote the memory address where x is
stored by addr(x). This memory address could be for a local variable on
the stack, for a spilled function argument on the stack (near beginning of
frame), or somewhere in a global data segment for global variables. Either
way, it is in memory. Thus when we evaluate a variable identifier, the result

L ECTURE N OTES
L13.4 Semantic Analysis and Specifications

is going to be its value from the memory:

x variable identifier
id?
x ⇒ M (addr(x))

Yet we need to know the memory contents for this to make sense. So let
us reflect this in the notation and change our judgment to e@M ⇒ v to say
that expression e, when evaluated in memory state M evaluates to value v.

x variable identifier
id
x@M ⇒ M (addr(x))

Yet now what about local variables x that are stored in registers or function
arguments that have been passed in registers? The precise option would be
to also include the register state R into the judgment e@M @R → v. Then
a variable x that is stored in the register address addr(x) = %eax would
evaluate to R(%eax) instead of to M (addr(x)).

x variable identifier x stored in register


id1
x@M @R ⇒ R(addr(x))

x variable identifier x stored in memory


id2
x@M @R ⇒ M (addr(x))
That is the formally precise way to do it. The only downside is that the
notation is a bit clumsy. So instead, what we will do is just simply pretend
the registers would be a special part of memory state M stored at the spe-
cial addresses M (%eax), M (%rbx), M (%rdi), .... This really doesn’t change
anything except making the notation easier to read. Formally this notation
corresponds to considering the cross product M × R of the real memory
state and the register state and just calling the result M again.
Unfortunately, however, the above approach is still only sufficient for
describing pure programming languages where no expression can have an
effect, except computing its result. The C0 programming language already
has no unnecessary side effects during expression evaluation like preincre-
ment/postincrement etc. Yet it still allows function calls in expressions,
and function calls can have arbitrary side effects. In order to make sure we
do not miss those effects in the semantics, we thus carry an explicit system
state M around through the evaluation. We thus look at the judgement
e@M ⇒ v@M 0 capturing that an expression e in system state M evalu-
ates to value v and results in the new system state M 0 . Here, we primarily

L ECTURE N OTES
Semantic Analysis and Specifications L13.5

consider the memory state M , but other state can be tracked too with this
principle. Thus the above rule turns into the more precise

e1 @M ⇒ n1 @M 0 e2 @M 0 ⇒ n2 @M 00 n = add(n1 , n2 )
+
e1 + e2 @M ⇒ n@M 00

Unlike the first rule (+?), the new rule now captures the semantics of left-
to-right evaluation order. In the first rule (+?), we could still supply the
premisses in an arbitrary order and were not restricted to evaluating the
subexpressions e1 and e2 in any particular order. The new rule (+) explic-
itly requires left-to-right evaluation, because the state M 0 resulting from
evaluating e1 is the starting state for evaluating e2 , whose resulting state
M 00 will be the resulting state of evaluating the whole expression e1 + e2 .
The static and dynamic semantics together give meaning to all elements
of the programming language. We treat the static and dynamic semantics
for various elements of the C0 programming language at the same time in
the following.

4 Small Types
So far, we have only used a programming language with minimal typing.
Basically, the only two types so far were int and bool and are easily distin-
guished by their respective syntactical occurrence in the language. Only int
had been allowed as a type for declared variables, and bool only occurred
in the test expressions for if, while and for.
Real programming languages, including C0, have more serious types.

Types τ ::= int | bool | structs | τ ∗ | τ [] | a

where a is a name of a type abbreviation for some type τ and has been
introduced in the form
typedef τ a

We mostly ignore the other C0 types char and string in this course.
For discussing the layout of the various types, we distinguish between
˘;small types that can fit into a register and large types that have to be stored
in memory. First we discuss all small types. For the purpose of memory
layout and register handling we define the size |τ | of small types τ as fol-

L ECTURE N OTES
L13.6 Semantic Analysis and Specifications

lows:

|int| = 4
|bool| = 4
|τ ∗ | = 8
|τ []| = 8

That is int and bool are 32-bit and pointers τ ∗ are represented by 64-bit
addresses on 64-bit machines. Arrays themselves are large values and ar-
ray constants would be large, because we cannot pass a whole array in a
register. But C0 allocates arrays on the heap like pointers and they are only
represented by their starting address. Hence, variables of array type have
a small type, because we can fit the array address into a register.
Especially we have data of two different sizes. Pointers are allocated
from the heap memory by the runtime system using the alloc(τ ) library
function that returns fresh chunks of memory at a location divisible by 8
ready to hold a value of type τ . In C-like programming languages, the null
address 0 (denoted by the constant NULL) is special in that it will never be
returned by alloc(τ ), except to indicate that the system ran out of memory
altogether. All memory access to the null pointer is thus considered bad
memory access.

5 Large Types
Array contents and structures are large types, because they do not (usually)
fit into a register. We define their size as follows

|s| = pad(|τ1 |, . . . , |τn |)

when the structure s has been defined to be

struct s {
τ1 f1 ;
τ2 f2 ;
..
.
τn fn ;
}

L ECTURE N OTES
Semantic Analysis and Specifications L13.7

The function pad adds the sizes of its arguments, adding padding as neces-
sary in between and at the end. That is, elements of type int and bool are
aligned at memory addresses that are divisible by 4. Elements of type τ ∗
and τ [] are aligned at memory addresses divisible by 8. A compiler remem-
bers the byte offset of field fi in the memory layout of structure s in order
to find it later. We denote it by off(s, fi ).
Similarly to distinguishing between small and large types, we distin-
guish between small values (values of a small type) that fit into a register
and large values (values of a large type) that have to be stored in memory.

6 Structs
The typing rule for the static semantics of structs is simple and just says
that an access to a field f of a struct value e of type s results in a value of
type τ , where τ is the declared type of field f in s:

e:s struct s {. . . τ f ; . . .}
e.f : τ

To give a dynamic semantics to structs, we define the operational se-


mantics of what happens when we evaluate an expression involving structs.
When evaluating e.f , we just evaluate e to an address a and then lookup
the memory contents at a with the offset off(s, f ) belonging to the field f of
struct s in memory, i.e., M (a + off(s, f )):

e : s struct s {. . . τ f ; . . .} e@M ⇒ a@M 0 τ small


e.f @M ⇒ M 0 (a + off(s, f ))@M 0

Unfortunately, this only works well for small types whose values can be
returned into registers right away. For large types, this cannot really work
well, because the memory M (a) at location a does not even contain all in-
formation, and we cannot store the whole object in a single register anyhow.
For large types, evaluation produces an address instead, relative to which
the content will be addressed further.

e:s struct s {. . . τ f ; . . .} e@M ⇒ a@M 0 τ large


e.f @M ⇒ a + off(s, f )@M 0

L ECTURE N OTES
L13.8 Semantic Analysis and Specifications

7 Pointers
To explain the static semantics of pointers and pointer access, there are sim-
ple rules:
e : τ∗
∗e : τ alloc(τ ) : τ ∗ NULL : τ ∗

If pointer e has the type τ ∗ of a pointer to an element of type τ , then the


pointer dereference ∗e has the type τ of the element. Allocation of a piece
of heap memory for data of type τ gives a pointer of type τ ∗, i.e., pointing
to τ . The last typing rule is a little tricky, because it gives NULL all pointer
types at once. This is necessary, because the same NULL pointer is used
to represent not-allocated regions of arbitrary types. In particular, the type
of NULL depends on its context, that is, on the expected type that the con-
text wants NULL to have in order to make sense of it. In order to avoid
ambiguity of the typing, we disallow ∗NULL. The expression ∗NULL is
tricky to type-check because it can lead to ambiguous situations. For in-
stance (∗NULL).f could have a lot of types: essentially all types of field f
in arbitrary structs declared in the program.
The operational semantics of pointer evaluation can be described us-
ing the notation M (a) to denote the content of memory address a. The
operational semantics for a pointer access ∗e evaluates e to an address and
then returns the memory contents at that address. Dereferencing the NULL
pointer must raise the SIGSEGV exception. In an implementation this can
be accomplished without any checks, because the operating system will
prevent read access to address 0 and raise the appropriate exception just by
having page 0 unmapped in the virtual-memory page table. When deref-
erencing pointers that store address a, which are not the null pointer, we
obtain their memory contents M (a) (for small types):

e:τ∗ e@M ⇒ a a 6= 01 τ small e:τ∗ e@M ⇒ a a 6= 0 τ large


? ?
∗e@M ⇒ M (a) ∗e@M ⇒ a

For large types, the memory M (a) at location a does not even contain all
information, and we cannot store the whole object in a register anyhow.
So instead, ∗e evaluates to a itself, relative to which the content will be
addressed further. When we dereference a pointer that is null, the program
1
Machines implement this check by having page 0 unmapped in the virtual-memory
page table.

L ECTURE N OTES
Semantic Analysis and Specifications L13.9

terminates with a segmentation fault:


e : τ ∗ e@M ⇒ a a = 0
?
∗e@M ⇒ SIGSEGV
For memory allocation, however, we run into some issues when we
want to specify what it is doing. After all, memory allocation modifies the
memory by finding a free chunk of memory and by clearing the memory
contents to 0. Thus memory changes from the old memory M to the new
memory M 0 . We model this by changing our judgement from e@M ⇒ e0
into e@M ⇒ e0 @M 0 , in which we also specify the new memory state M 0 .
M 0 like M but M 0 (a) = . . . = M 0 (a + |τ | − 1) = 0 for fresh locations
alloc(τ )@M ⇒ a@M 0
The values stored in freshly allocated location must be all 0. This can be
achieved with calloc() and means that values of type int are simply 0,
values of type bool are false, values of type τ 0 ∗ are NULL pointers, all fields
of structs are recursively set to 0, and values of array type have address 0
which is akin to a NULL array reference.
This change of judgment to e@M ⇒ e0 @M 0 is reflected in the subse-
quent modifications of the above rules that now carries the memory state
through. When doing that, we also notice that expression evaluation dur-
ing pointer dereference (just as well as all other expression evaluation) can
actually modify the memory contents. We fix this deficiency in our previ-
ous specification right away:
e:τ∗ e@M ⇒ a@M 0 a 6= 0 τ small
∗e@M ⇒ M 0 (a)@M 0
e:τ∗ e@M ⇒ a@M 0 a 6= 0 τ large
∗e@M ⇒ a@M 0
e:τ∗ e@M ⇒ a@M 0 a = 0
∗e@M ⇒ SIGSEGV@M 0 NULL@M ⇒ 0@M
When no functions with side effects (like memory allocation) occur in the
expressions, we need not distinguish between M and M 0 during expression
evaluation.
Note, however, that when combining pointers and structs, we cannot
necessarily rely on the operating system to trap null pointer dereferencing.
For a very large struct s and a pointer p : s∗, dereferencing a field p−>f
(which desugars into (∗p).f ), the target address may already be beyond the
unmapped virtual memory page 0 if f has a large offset.

L ECTURE N OTES
L13.10 Semantic Analysis and Specifications

8 Arrays
Arrays are almost like pointers. Both are allocated. The difference is that
C0 pointers disallow pointer arithmetic, whereas arrays can access its con-
tents at arithmetic integer positions randomly. In particular, in arrays, the
question rises what to do with access out of bounds, i.e., outside the array
size. Does it just access the memory unsafely at wild places, or will it be
detected safely and raise a runtime exception? In early labs, we will fol-
low the unsafe C tradition and allow more arbitrary behavior. In later labs,
we will switch to safe compilation more like in Java. First, we give simple
typing rules explaining the static semantics:

e : τ [] t : int e : int
e[t] : τ alloc array(τ, e) : τ []

Now we consider the operational semantics. Note that the evaluation order
in array access (like everywhere else) is strictly left-to-right. So expression
e[t] will be evaluated by evaluating e first and t second, and then accessing
the result of e at the result of t:
e : τ [] e@M ⇒ a@M 0 t@M 0 ⇒ n@M 00 a 6= 0 M 00 (a + n|τ |) allocated τ small
e[t]@M ⇒ M 00 (a + n|τ |)@M 00

e : τ [] e@M ⇒ a@M 0 t@M 0 ⇒ n@M 00 a 6= 0 M 00 (a + n|τ |) allocated τ large


e[t]@M ⇒ a + n|τ |@M 00
For safe array access with array bounds check, we add checks to the above
rules ensuring that 0 ≤ n < N where N is the size of the array, which has
to be stored at the time of allocation. We have two choices. The liberal but
unsafe choice like in C where we leave the evaluation of array access unde-
fined in all other cases that do not match either rule. Or an unambiguously
defined semantics choice where we say precisely how array access fails:

e : τ [] e@M ⇒ a@M 0 t@M 0 ⇒ n@M 00 a 6= 0 M 00 (a + n|τ |) not allocated


e[t]@M ⇒ SIGSEGV@M 00

For the case where the address computation of the array itself yields NULL,
we can either raise a SIGSEGV before evaluating t or after. Both choices are
reasonable. The early choice saves operations in case of a SIGSEGV. The
late choice, however, reduces the number of times that violations have to

L ECTURE N OTES
Semantic Analysis and Specifications L13.11

be checked, which we thus prefer:


e : τ [] e@M ⇒ a@M 0 a=0 e : τ [] e@M ⇒ a@M 0 t@M 0 ⇒ n@M 00 a=0
e[t]@M ⇒ SIGSEGV@M 0 or e[t]@M ⇒ SIGSEGV@M 00
The difference between those two choices is not gigantic, because it only
affects the memory state after an abnormal termination of the program.
Getting this part of the semantics exact is more important in programming
languages like Java where throwing and catching exceptions is used rou-
tinely and, in fact, some programs may rely on exceptions being raised all
the time and in the right order in order to function properly.
For the safe array access semantics with array bounds checks, failed
checks for array bounds result in SIGABRT, where N is the size of the array:
e : τ [] e@M ⇒ a@M 0 t@M 0 ⇒ n@M 00 a 6= 0 (n < 0 ∨ n ≥ N )
e[t]@M ⇒ SIGABRT@M 00
Allocation of arrays is very similar to allocation of pointers, except that
we also check if the size makes sense:
e@M ⇒ n@M 0
n≥0
M 00 like M 0 but M 00 (a) = . . . = M 00 (a + (n − 1)|τ |) = 0 for fresh locations
alloc array(τ, e)@M ⇒ a@M 0
Values in a freshly allocated array are all initialized to 0. This can again be
achieved using calloc.
Note again, that when combining pointers and arrays, we cannot neces-
sarily rely on the operating system to trap null pointer dereferencing. For a
pointer p : τ []∗ to a very large array, accessing (∗p)[70000] may already lead
to a target address beyond the unmapped virtual memory page 0.
C and C0 do not have special support for multidimensional arrays,
but just considers int[][] as (int[])[], i.e., an array of arrays of integers. In
ragged representation, this two-dimensional array is represented as a one-
dimensional array of pointers to arrays. This results in row-major ordering
in which the cells in each row are stored one after the other in memory. In
ragged representation, there is no guarantee in general that the rows are
stored contiguously without gaps (or reorderings). There isn’t even a guar-
antee that all rows are of the same length.
In contrast, statically declared arrays (which are not allowed in C0) are
usually stored contiguously in row-major order, because the dimensions
are known statically. For instance,

L ECTURE N OTES
L13.12 Semantic Analysis and Specifications

int matrix[2][3] = {{1, 2, 3}, {4, 5, 6}};

corresponds to the matrix


 
1 2 3
which is stored in memory as 1 2 3 4 5 6
4 5 6

Side-note: an odd thing in C is that x[i] and i[x] are both valid array
accesses and equivalent, because both are just defined as ∗(x + i). C even
allows 2[x] instead of x[2].

9 Assignments to Lvalues
Assignments to primitive int variables are simple and ultimately just im-
plemented by a MOV instruction to the respective temp (see lectures 2 and
3). In more complicated languages with structured data, we can assign to
other expressions such as a[10 − i] or ∗p or x.f or even ∗x.f or (∗x).f alias
x−>f . Not all expressions qualify as proper expressions to which we can
assign to. It makes no sense to try to assign a value to x + y nor to f (∗x − 1)
that may only appear on the right-hand side of an expression (rvalues). The
expressions that make sense to appear on the left-hand side of an expres-
sion as they identify a proper location (say in memory) are called lvalues.
Lvalues are well-typed expressions of the form

x | ∗ e | e.f | e[t] | e−>f

for (well-typed) expressions e, t, primitive variable x and struct field f . The


only syntactically valid assignments in C0 are of the form l=e or l+=e and
so on for an lvalue l of type τ and an arbitrary (rvalue) expression e of
type τ . No implicit type cast conversions or coercions happen in C0. While
some programming languages allow assignments to large types and give it
a memcopy semantics, C0 does not do so, because it is not clear for pointer
types if a shallow or deep copy would make more sense. Thus, in C0, only
small types can be assigned to directly.
An lvalue represents a destination location for the assignment, which is
either a variable x or an address a in memory. Essentially, for determining
the target of an lvalue, we use the rules of the structural operational seman-
tics that we have discussed so far, except that we stop at location a before
actually doing the memory access M (a). More precisely, we define the re-
lation v@M ⇒l d@M 0 to say that lvalue v, when evaluated in memory state

L ECTURE N OTES
Semantic Analysis and Specifications L13.13

M represents location d and this evaluation changed the memory state to


M 0 . It is defined as:
e@M ⇒ a@M 0 e:s e@M ⇒ a@M 0
x@M ⇒l x@M ∗e@M ⇒l a@M 0 e.f @M ⇒l a + off(s, f )@M 0

e1 : τ [] e1 @M ⇒ a@M 0 e2 @M 0 ⇒ n@M 00
e1 [e2 ]@M ⇒l a + n|τ |@M 00
The side conditions and failure modes for the address computation when
evaluating lvalue e1 [e2 ]@M ⇒l ... of an array access are just like those for
the value evaluation e1 [e2 ]@M ⇒ ....
Using this lvalue relation ⇒l , we can define the effect of an assignment
v = e. The semantics of a statement does not produce a value, it just has
an effect on memory. Thus we just write e@M ⇒ @M 0 to describe the
transition. As a shorthand notation, we write M 00 {M 00 (a) ← w} for the
memory state M 000 that is obtained from a memory state M 00

v@M ⇒l x@M e@M ⇒ w@M 0


v = e @M ⇒ @M 0 {V (x) ← w}

v@M ⇒l a@M 0 e@M 0 ⇒ w@M 00 M 00 (a) allocated


v = e @M ⇒ @M 00 {M 00 (a) ← w}
v@M ⇒l a@M 0 e@M 0 ⇒ w@M 00 a=0
v = e @M ⇒ SIGSEGV@M 00
The effect of an assignment is undefined otherwise. In particular, whether
the assignment segfaults during a bad access or not may (at present) de-
pend on whether the compiler implements out of bounds checks. In later
labs, you will implement a safe compiler for C0 where out of bounds prob-
lems have to be checked.
Note especially, that for an assignment v = e, the lvalue v will be eval-
uated to a destination location before the right-hand side expression e will
be evaluated. When both v and e have been evaluated, the assignment to
v will actually be performed and the destination address a will only be ac-
cessed then. In particular:
1. *e = 1/0 will raise SIGFPE when e evaluates without any other ex-
ception, because e evaluates to an address (without complications)
and then, before this memory location is even accessed, the expres-
sion 1/0 is computed which throws an exception.

L ECTURE N OTES
L13.14 Semantic Analysis and Specifications

2. e[-1] = 1/0 should raise a SIGABRT in safe mode, assuming e


evaluates without any other exception during evaluation of e, be-
cause the target address computation for the lvalue itself failed.

3. e->f = 1/0 will raise SIGSEGV when e evaluates to NULL without


any other exception during evaluation of e.

In principle, compound assignment operators ⊕= for an operator ⊕ ∈


{+, −, ∗, /, ...} work like assignments, but with the operation ⊕. Yet, the
meaning of compound assignment operators changes in subtle ways com-
pared to what it meant for just primitive variables. Now compound as-
signments are no longer just a syntactic expansion, because expressions can
now have side effects and it matters how often an expression is evaluated.
For a compound assignment e[t] += e’, the lvalue of e[t] is only com-
puted once, quite unlike for the assignment e[t] = e[t] + e’, where
e[t] is evaluated to an address twice. A compound assignment

v ⊕= e

with an operator ⊕ executes as

v@M ⇒l x@M e@M ⇒ w@M 0


v = e @M ⇒ @M 0 {V (x) ← V (x) ⊕ w}

v@M ⇒l a@M 0 e@M 0 ⇒ w@M 00 M 00 (a) allocated


v ⊕= e @M ⇒ @M 00 {M 00 (a) ← M 00 (a) ⊕ w}

10 Function Calls
Suppose we have a function call f (e1 , . . . , en ) to a function f that has been
defined as τ f (τ1 x1 , . . . , τn xn ) {b}. We consider a simplified situation here
and just assume there is a return variable called %eax in the function body
b.
e1 @M ⇒ v1 @M1 , e2 @M1 ⇒ v2 @M2 , . . . , en @Mn−1 ⇒ vn @Mn b@Mn0 ⇒ @M 0 τ small
f (e1 , . . . , en )@M ⇒ M 0 (%eax)@M 0

where Mn0 is like Mn , except that the values vi of the arguments ei have been
bound to the formal parameters xi , i.e., Mn0 (x1 ) = v1 , . . . , Mn0 (xn ) = vn .
And now we remember that allocation is actually a function call in C0.
Consequently, in the intermediate representation of our C0 compiler, side

L ECTURE N OTES
Semantic Analysis and Specifications L13.15

effects due to allocation can only occur at the statement level not nested
within expressions. Hence, specifying the semantics for the intermediate
representation is actually easier (it doesn’t need complicated M 0 ). But, un-
like its intermediate representation, C0 itself still needs to respect memory
state passing orders carefully.

11 Type Safety
An important property of programming languages is whether they are type-
safe. In a type-safe language, the static and dynamic semantics of a pro-
gramming language should fit together. If we have an expression e in a
program that has the type int, then we would be rather surprised to find at
runtime a result of evaluating e that is a float. If this could happen, then it is
rather hard to make sure that the program will always execute reasonably
even if the compiler accepted it as a well-typed program.
What we expect from the static and dynamic semantics of a type-safe
language is that types are preserved in the following sense. If we have a
program that is well-typed (the static semantics says it’s okay) and we fol-
low an evaluation step of the dynamic semantics, then the resulting pro-
gram is still well-typed (type preservation). Otherwise what can happen
is that we run a well-typed program and suddenly break the well-typing
leading to values out of the type ranges. That is, the property that we want
(and need to prove for our static and dynamic semantics) is that

If e : τ and e ⇒ v then v : τ

For C0 (and other impure programming languages), the statement is a bit


more involved, because the dynamic semantics refers to the memory state
M . The program reads values from memory and stores values back in
memory. If the program would store an int into M (a) and then later on ex-
pect to read a pointer from M (a), then type-safety is broken. Consequently,
type-preservation is a property of the form

If e : τ and e@M ⇒ v@M 0 and M is okay then v : τ and M 0 is okay

for a suitable definition of when a memory state M is “okay”, i..e, the types
of the values that it stores are compatible with what the program expects.
The other property that one would expect from type-safe languages is
that the dynamic semantics always knows what do do (with well-typed
programs). We do not want to be stuck in the middle of a run or an in-
terpretation of the program by the dynamic semantics rules not knowing

L ECTURE N OTES
L13.16 Semantic Analysis and Specifications

where to go and not having a rule that allows a transition. For instance,
if the program contains the well-typed expression e + f and the dynamic
semantics does not know how to evaluate the odd expression “test”+0.5,
then we better make sure that the evaluation of e can never lead to a string
“test” while, at the same time, the evaluation of f leads to the float 0.5.

If e : τ and e is not a final value then e → e0 for some e0

Again, the real definition of progress is complicated by the fact that we


need to consider memory M .
The conjunction of type preservation and progress properties is called
type safety [WF94]. Without the progress property, every language could
be given a trivially type-preserving dynamic semantics that just stops eval-
uating whenever it hits an expression that would not preserve types. But
that doesn’t help write safer programs.

Quiz
1. Which of the rules conveys important secret information about how
to implement a compiler correctly that are easy to miss?
2. How many ways are there to implement accesses like (∗a)[i]?
3. Why is 2[i] not allowed in the C0 language when it is allowed in C?
4. Is it important how exactly the compiler implements things like e[-1]
= 1/0 or not?
5. How can you make sure that you always generate the most effective
code for the subtleties in the rules? What information do you need for
that? Define a dataflow analysis that solves (some) of these issues.
6. In the rules discussed here, what would happen if you would move
the primes of memory M around? Which permutations still give a
good language semantics? And which permutations are still good
for implementation purposes? And which permutations spoil every-
thing?
7. Under which assumptions can you implement a compiler correctly
using the rules that do not track @M ?
8. Can you write a compiler that does not distinguish between Lvalues
and Rvalues? Can you write a parser that does not?

L ECTURE N OTES
Semantic Analysis and Specifications L13.17

9. Should programming languages have multidimensional arrays or should


they have an understanding of nested arrays of arrays of arrays in-
stead?

10. Some old C libraries use one-dimensional arrays. These libraries were
often translated from Fortran. They probably just didn’t know how
to write proper C, did they?

11. Suppose you hired a high-school student to translate a Fortran library


for numerical computation to C. Suppose it doesn’t work or occasion-
ally produces unexpected results. What is your first question?

12. Why is there a difference comparing e=e+a and e+=a? Should there
be a difference? Doesn’t this difference only confuse the user?

13. List all advantages and disadvantages that type preservation has when
writing a compiler.

14. List all advantages and disadvantages that type preservation has when
using a compiler.

15. List all advantages and disadvantages that type progress has when
writing a compiler.

16. List all advantages and disadvantages that type progress has when
using a compiler.

17. Is your job as a compiler designer easier if you can change the static
semantics of the programming language? How?

18. Is your job as a compiler designer easier if you can change the dy-
namic semantics of the programming language? How?

19. Is your job as a compiler designer easier if you can change the type
preservation aspects of the programming language? How?

20. Is your job as a compiler designer easier if you can change the type
progress aspects of the programming language? How?

21. In the last questions: what are the downsides for the user?

22. You want to add threads to C0. Which rules do you need to change
for that and how? Where are the difficulties?

L ECTURE N OTES
L13.18 Semantic Analysis and Specifications

References
[App98] Andrew W. Appel. Modern Compiler Implementation in ML. Cam-
bridge University Press, Cambridge, England, 1998.

[WF94] Andrew K. Wright and Matthias Felleisen. A syntactic approach


to type soundness. Inf. Comput., 115(1):38–94, 1994.

L ECTURE N OTES

You might also like