Lecture Notes On Semantic Analysis and Specifications: 15-411: Compiler Design Andr e Platzer
Lecture Notes On Semantic Analysis and Specifications: 15-411: Compiler Design Andr e Platzer
Lecture 13
1 Introduction
Now we have seen how parsing works in the front-end of a compiler and
how instruction selection and register allocation works in the back-end.
We have also seen how intermediate representations can be used in the
middle-end. One important question is the last phase of the front-end: se-
mantic analysis that is used to determine if the input program is actually
syntactically well-formed. Another important question arises in the first
phase of the middle-end: translation of the dynamic aspects of advanced
data structures. Even though both questions belong to different phases of
the compiler, we answer them together in this lecture. The static and dy-
namic semantical aspects need to fit together anyhow.
Some smaller subset of what is covered in this lecture can be found in
the textbook [App98, Ch 7.2], which covers data structures.
L ECTURE N OTES
L13.2 Semantic Analysis and Specifications
in the source code, then what exactly is the type of the result? And is it a
well-typed expression at all? The answer depends on the type of e which
had better be an array type (otherwise the array access would be ill-typed).
The answer also depends on the type of t which had better be a struct type
s and will then be used to lookup the type of t.f according to the type of
the field f declared in s. Finally, the answer depends on x. And if the result
of the addition t.f + x does not produce an integer, the whole expression
still does not type check. It is crucial to find out whether a program with
such an expression is well-typed at all. Otherwise, we would compile it
to something with a strange and arbitrary effect without knowing that the
source program made no sense at all.
All these answers depend on information from the context of the pro-
gram. One interesting indicator for a language is how many passes of anal-
ysis through the abstract-syntax tree are necessary to perform semantical
analysis successfully.
A simple typing rule is that for plus expressions:
e1 : int e2 : int
e1 + e2 : int
It specifies that if e1 and e2 both have type int then e1 + e2 also has type int.
In the following, we will give typing rules that define the static seman-
tics of source program expressions.
3 Dynamic Semantics
The static semantics is necessary to make sense of a source code expression.
It only specifies it incompletely, though. We will also explain the dynamic
L ECTURE N OTES
Semantic Analysis and Specifications L13.3
semantics of expressions, i.e., what their effect is when evaluated. This in-
formation is required for the translation phase in order to make sure that
the intermediate language generated for a particular source code snippet
actually complies with the semantics of the programming language, which
hopefully fits to the intention that the programmer had in mind when writ-
ing the program.
As a side-note, the job of compiler verification is to make sure that the
source program will be compiled to something that has exactly the same
effect as prescribed by the language semantics, regardless of whether the
source program is doing the right thing. The compiler’s job is to adhere
to this exactly. Contrast this to program verification, where the job is to
make sure that the rogram fits to the intentions that the programmer has
in mind, as expressed by some formal specification of what it is meant to
achieve, e.g., in the form of a set of pre/postconditions.
For describing the dynamic semantics of C0, we define how we evalu-
ate expressions and statements of the programming language. We need to
describe how an expression e will be evaluated to determine the result. For
this purpose, we want to define a relation e ⇒ v that specifies that e, when
evaluated, results in the value v. We want to define the relation e ⇒ v by
rules specifying the effect of each expression like
e 1 ⇒ n1 e 2 ⇒ n2 n = add(n1 , n2 )
+?
e1 + e2 ⇒ n
0
0⇒0
And one rule that evaluates a variable identifier x. But what should a vari-
able evaluate to? Well that depends on what its value is. The value of a
variable identifier is stored at some address in memory (or a register, which
we talk about in a moment). Let’s denote the memory address where x is
stored by addr(x). This memory address could be for a local variable on
the stack, for a spilled function argument on the stack (near beginning of
frame), or somewhere in a global data segment for global variables. Either
way, it is in memory. Thus when we evaluate a variable identifier, the result
L ECTURE N OTES
L13.4 Semantic Analysis and Specifications
x variable identifier
id?
x ⇒ M (addr(x))
Yet we need to know the memory contents for this to make sense. So let
us reflect this in the notation and change our judgment to e@M ⇒ v to say
that expression e, when evaluated in memory state M evaluates to value v.
x variable identifier
id
x@M ⇒ M (addr(x))
Yet now what about local variables x that are stored in registers or function
arguments that have been passed in registers? The precise option would be
to also include the register state R into the judgment e@M @R → v. Then
a variable x that is stored in the register address addr(x) = %eax would
evaluate to R(%eax) instead of to M (addr(x)).
L ECTURE N OTES
Semantic Analysis and Specifications L13.5
consider the memory state M , but other state can be tracked too with this
principle. Thus the above rule turns into the more precise
e1 @M ⇒ n1 @M 0 e2 @M 0 ⇒ n2 @M 00 n = add(n1 , n2 )
+
e1 + e2 @M ⇒ n@M 00
Unlike the first rule (+?), the new rule now captures the semantics of left-
to-right evaluation order. In the first rule (+?), we could still supply the
premisses in an arbitrary order and were not restricted to evaluating the
subexpressions e1 and e2 in any particular order. The new rule (+) explic-
itly requires left-to-right evaluation, because the state M 0 resulting from
evaluating e1 is the starting state for evaluating e2 , whose resulting state
M 00 will be the resulting state of evaluating the whole expression e1 + e2 .
The static and dynamic semantics together give meaning to all elements
of the programming language. We treat the static and dynamic semantics
for various elements of the C0 programming language at the same time in
the following.
4 Small Types
So far, we have only used a programming language with minimal typing.
Basically, the only two types so far were int and bool and are easily distin-
guished by their respective syntactical occurrence in the language. Only int
had been allowed as a type for declared variables, and bool only occurred
in the test expressions for if, while and for.
Real programming languages, including C0, have more serious types.
where a is a name of a type abbreviation for some type τ and has been
introduced in the form
typedef τ a
We mostly ignore the other C0 types char and string in this course.
For discussing the layout of the various types, we distinguish between
˘;small types that can fit into a register and large types that have to be stored
in memory. First we discuss all small types. For the purpose of memory
layout and register handling we define the size |τ | of small types τ as fol-
L ECTURE N OTES
L13.6 Semantic Analysis and Specifications
lows:
|int| = 4
|bool| = 4
|τ ∗ | = 8
|τ []| = 8
That is int and bool are 32-bit and pointers τ ∗ are represented by 64-bit
addresses on 64-bit machines. Arrays themselves are large values and ar-
ray constants would be large, because we cannot pass a whole array in a
register. But C0 allocates arrays on the heap like pointers and they are only
represented by their starting address. Hence, variables of array type have
a small type, because we can fit the array address into a register.
Especially we have data of two different sizes. Pointers are allocated
from the heap memory by the runtime system using the alloc(τ ) library
function that returns fresh chunks of memory at a location divisible by 8
ready to hold a value of type τ . In C-like programming languages, the null
address 0 (denoted by the constant NULL) is special in that it will never be
returned by alloc(τ ), except to indicate that the system ran out of memory
altogether. All memory access to the null pointer is thus considered bad
memory access.
5 Large Types
Array contents and structures are large types, because they do not (usually)
fit into a register. We define their size as follows
struct s {
τ1 f1 ;
τ2 f2 ;
..
.
τn fn ;
}
L ECTURE N OTES
Semantic Analysis and Specifications L13.7
The function pad adds the sizes of its arguments, adding padding as neces-
sary in between and at the end. That is, elements of type int and bool are
aligned at memory addresses that are divisible by 4. Elements of type τ ∗
and τ [] are aligned at memory addresses divisible by 8. A compiler remem-
bers the byte offset of field fi in the memory layout of structure s in order
to find it later. We denote it by off(s, fi ).
Similarly to distinguishing between small and large types, we distin-
guish between small values (values of a small type) that fit into a register
and large values (values of a large type) that have to be stored in memory.
6 Structs
The typing rule for the static semantics of structs is simple and just says
that an access to a field f of a struct value e of type s results in a value of
type τ , where τ is the declared type of field f in s:
e:s struct s {. . . τ f ; . . .}
e.f : τ
Unfortunately, this only works well for small types whose values can be
returned into registers right away. For large types, this cannot really work
well, because the memory M (a) at location a does not even contain all in-
formation, and we cannot store the whole object in a single register anyhow.
For large types, evaluation produces an address instead, relative to which
the content will be addressed further.
L ECTURE N OTES
L13.8 Semantic Analysis and Specifications
7 Pointers
To explain the static semantics of pointers and pointer access, there are sim-
ple rules:
e : τ∗
∗e : τ alloc(τ ) : τ ∗ NULL : τ ∗
For large types, the memory M (a) at location a does not even contain all
information, and we cannot store the whole object in a register anyhow.
So instead, ∗e evaluates to a itself, relative to which the content will be
addressed further. When we dereference a pointer that is null, the program
1
Machines implement this check by having page 0 unmapped in the virtual-memory
page table.
L ECTURE N OTES
Semantic Analysis and Specifications L13.9
L ECTURE N OTES
L13.10 Semantic Analysis and Specifications
8 Arrays
Arrays are almost like pointers. Both are allocated. The difference is that
C0 pointers disallow pointer arithmetic, whereas arrays can access its con-
tents at arithmetic integer positions randomly. In particular, in arrays, the
question rises what to do with access out of bounds, i.e., outside the array
size. Does it just access the memory unsafely at wild places, or will it be
detected safely and raise a runtime exception? In early labs, we will fol-
low the unsafe C tradition and allow more arbitrary behavior. In later labs,
we will switch to safe compilation more like in Java. First, we give simple
typing rules explaining the static semantics:
e : τ [] t : int e : int
e[t] : τ alloc array(τ, e) : τ []
Now we consider the operational semantics. Note that the evaluation order
in array access (like everywhere else) is strictly left-to-right. So expression
e[t] will be evaluated by evaluating e first and t second, and then accessing
the result of e at the result of t:
e : τ [] e@M ⇒ a@M 0 t@M 0 ⇒ n@M 00 a 6= 0 M 00 (a + n|τ |) allocated τ small
e[t]@M ⇒ M 00 (a + n|τ |)@M 00
For the case where the address computation of the array itself yields NULL,
we can either raise a SIGSEGV before evaluating t or after. Both choices are
reasonable. The early choice saves operations in case of a SIGSEGV. The
late choice, however, reduces the number of times that violations have to
L ECTURE N OTES
Semantic Analysis and Specifications L13.11
L ECTURE N OTES
L13.12 Semantic Analysis and Specifications
Side-note: an odd thing in C is that x[i] and i[x] are both valid array
accesses and equivalent, because both are just defined as ∗(x + i). C even
allows 2[x] instead of x[2].
9 Assignments to Lvalues
Assignments to primitive int variables are simple and ultimately just im-
plemented by a MOV instruction to the respective temp (see lectures 2 and
3). In more complicated languages with structured data, we can assign to
other expressions such as a[10 − i] or ∗p or x.f or even ∗x.f or (∗x).f alias
x−>f . Not all expressions qualify as proper expressions to which we can
assign to. It makes no sense to try to assign a value to x + y nor to f (∗x − 1)
that may only appear on the right-hand side of an expression (rvalues). The
expressions that make sense to appear on the left-hand side of an expres-
sion as they identify a proper location (say in memory) are called lvalues.
Lvalues are well-typed expressions of the form
L ECTURE N OTES
Semantic Analysis and Specifications L13.13
e1 : τ [] e1 @M ⇒ a@M 0 e2 @M 0 ⇒ n@M 00
e1 [e2 ]@M ⇒l a + n|τ |@M 00
The side conditions and failure modes for the address computation when
evaluating lvalue e1 [e2 ]@M ⇒l ... of an array access are just like those for
the value evaluation e1 [e2 ]@M ⇒ ....
Using this lvalue relation ⇒l , we can define the effect of an assignment
v = e. The semantics of a statement does not produce a value, it just has
an effect on memory. Thus we just write e@M ⇒ @M 0 to describe the
transition. As a shorthand notation, we write M 00 {M 00 (a) ← w} for the
memory state M 000 that is obtained from a memory state M 00
L ECTURE N OTES
L13.14 Semantic Analysis and Specifications
v ⊕= e
10 Function Calls
Suppose we have a function call f (e1 , . . . , en ) to a function f that has been
defined as τ f (τ1 x1 , . . . , τn xn ) {b}. We consider a simplified situation here
and just assume there is a return variable called %eax in the function body
b.
e1 @M ⇒ v1 @M1 , e2 @M1 ⇒ v2 @M2 , . . . , en @Mn−1 ⇒ vn @Mn b@Mn0 ⇒ @M 0 τ small
f (e1 , . . . , en )@M ⇒ M 0 (%eax)@M 0
where Mn0 is like Mn , except that the values vi of the arguments ei have been
bound to the formal parameters xi , i.e., Mn0 (x1 ) = v1 , . . . , Mn0 (xn ) = vn .
And now we remember that allocation is actually a function call in C0.
Consequently, in the intermediate representation of our C0 compiler, side
L ECTURE N OTES
Semantic Analysis and Specifications L13.15
effects due to allocation can only occur at the statement level not nested
within expressions. Hence, specifying the semantics for the intermediate
representation is actually easier (it doesn’t need complicated M 0 ). But, un-
like its intermediate representation, C0 itself still needs to respect memory
state passing orders carefully.
11 Type Safety
An important property of programming languages is whether they are type-
safe. In a type-safe language, the static and dynamic semantics of a pro-
gramming language should fit together. If we have an expression e in a
program that has the type int, then we would be rather surprised to find at
runtime a result of evaluating e that is a float. If this could happen, then it is
rather hard to make sure that the program will always execute reasonably
even if the compiler accepted it as a well-typed program.
What we expect from the static and dynamic semantics of a type-safe
language is that types are preserved in the following sense. If we have a
program that is well-typed (the static semantics says it’s okay) and we fol-
low an evaluation step of the dynamic semantics, then the resulting pro-
gram is still well-typed (type preservation). Otherwise what can happen
is that we run a well-typed program and suddenly break the well-typing
leading to values out of the type ranges. That is, the property that we want
(and need to prove for our static and dynamic semantics) is that
If e : τ and e ⇒ v then v : τ
for a suitable definition of when a memory state M is “okay”, i..e, the types
of the values that it stores are compatible with what the program expects.
The other property that one would expect from type-safe languages is
that the dynamic semantics always knows what do do (with well-typed
programs). We do not want to be stuck in the middle of a run or an in-
terpretation of the program by the dynamic semantics rules not knowing
L ECTURE N OTES
L13.16 Semantic Analysis and Specifications
where to go and not having a rule that allows a transition. For instance,
if the program contains the well-typed expression e + f and the dynamic
semantics does not know how to evaluate the odd expression “test”+0.5,
then we better make sure that the evaluation of e can never lead to a string
“test” while, at the same time, the evaluation of f leads to the float 0.5.
Quiz
1. Which of the rules conveys important secret information about how
to implement a compiler correctly that are easy to miss?
2. How many ways are there to implement accesses like (∗a)[i]?
3. Why is 2[i] not allowed in the C0 language when it is allowed in C?
4. Is it important how exactly the compiler implements things like e[-1]
= 1/0 or not?
5. How can you make sure that you always generate the most effective
code for the subtleties in the rules? What information do you need for
that? Define a dataflow analysis that solves (some) of these issues.
6. In the rules discussed here, what would happen if you would move
the primes of memory M around? Which permutations still give a
good language semantics? And which permutations are still good
for implementation purposes? And which permutations spoil every-
thing?
7. Under which assumptions can you implement a compiler correctly
using the rules that do not track @M ?
8. Can you write a compiler that does not distinguish between Lvalues
and Rvalues? Can you write a parser that does not?
L ECTURE N OTES
Semantic Analysis and Specifications L13.17
10. Some old C libraries use one-dimensional arrays. These libraries were
often translated from Fortran. They probably just didn’t know how
to write proper C, did they?
12. Why is there a difference comparing e=e+a and e+=a? Should there
be a difference? Doesn’t this difference only confuse the user?
13. List all advantages and disadvantages that type preservation has when
writing a compiler.
14. List all advantages and disadvantages that type preservation has when
using a compiler.
15. List all advantages and disadvantages that type progress has when
writing a compiler.
16. List all advantages and disadvantages that type progress has when
using a compiler.
17. Is your job as a compiler designer easier if you can change the static
semantics of the programming language? How?
18. Is your job as a compiler designer easier if you can change the dy-
namic semantics of the programming language? How?
19. Is your job as a compiler designer easier if you can change the type
preservation aspects of the programming language? How?
20. Is your job as a compiler designer easier if you can change the type
progress aspects of the programming language? How?
21. In the last questions: what are the downsides for the user?
22. You want to add threads to C0. Which rules do you need to change
for that and how? Where are the difficulties?
L ECTURE N OTES
L13.18 Semantic Analysis and Specifications
References
[App98] Andrew W. Appel. Modern Compiler Implementation in ML. Cam-
bridge University Press, Cambridge, England, 1998.
L ECTURE N OTES