Continuation-Passing, Closure-Passing Style: Andrew W. Appel Trevor Jim
Continuation-Passing, Closure-Passing Style: Andrew W. Appel Trevor Jim
Andrew W. Appel*
Trevor Jim†
CS-TR-183-88
July 1988
Revised September 1988
ABSTRACT
We implemented a continuation-passing style (CPS) code generator for ML. Our CPS language is
represented as an ML datatype in which all functions are named and most kinds of ill-formed expres-
sions are impossible. We separate the code generation into phases that rewrite this representation
into ever-simpler forms. Closures are represented explicitly as records, so that closure strategies can
be communicated from one phase to another. No stack is used. Our benchmark data shows that the
new method is an improvement over our previous, abstract-machine based code generator.
* Supported in part by NSF Grant DCR-8603543 and by a Digital Equipment Corp. Faculty Incentive Grant.
† AT&T Bell Laboratories, Murray Hill, NJ. Current address: Laboratory for Computer Science, MIT, Cambridge, Mass.
-2-
1. Overview definitions.
Standard ML of New Jersey[1] is a compiler for 8. ‘‘Register spilling,’’ producing a CPS
ML written in ML. Its first code generator, based expression in which no sub-expression has
on an abstract stack machine, produced code with more than n free variables, where n is
acceptable but not stunning performance. Exami- related to the number of registers on the
nation of the code revealed that the greatest target machine.
source of inefficiency seemed to be that each 9. Generation of target-machine instructions.
value went on and off the stack too many times.
10. Backpatching and jump-size optimization.
Rather than hack a register allocator into the
abstract stack machine, we decided to try a Where the ORBIT compiler has one black box
continuation-passing style (CPS)[2] code genera- covering phases 6 through 9, we have four
tor. Kranz’s ORBIT compiler[3] [4] shows how smaller black boxes. The interfaces between the
CPS provides a natural context for register alloca- phases are semantically well-defined, making it
tion and representation decisions. easier to isolate individual parts of the analysis to
one phase.
The beauty of continuation passing style is that
control flow and data flow can be represented in a This paper describes phases 4 through 9, and then
clean intermediate language with a known seman- presents an analysis based on profiling and bench-
tics, rather than being hidden inside a ‘‘black marks. Because of space limitations, we must
box’’ code generator. The ORBIT compiler assume that the reader is familiar with
translates CPS into efficient machine code, mak- continuation-passing style.
ing representation decisions for each function and
each variable. ORBIT does an impressive set of 2. Continuation-passing style
analyses in its back end, but they’re all tangled Our back-end representation language is a
together into a single phase. We have a series of continuation-passing style (CPS) representation
phases, each of which re-writes and simplifies the similar in spirit to Steele’s, but with a few impor-
representation of the program, culminating in a tant differences: we use the ML datatype feature
final instruction emission phase that’s never to prohibit ill-formed expressions; we want every
presented with complications. function to have a name; and we have n-tuple
operators which make modelling closures con-
The phases are: venient.
The italicized var’s are binding occurrences, and SELECT(i,r,v,cont) means ‘‘let v be the
the others are uses of the variables. i th field of r in cont’’.
All of Steele’s ‘‘atoms’’[2] are represented in our We have a constructor for indexed jumps
cexp’s by variables. Constants are represented by (SWITCH), and a constructor (PRIMOP) for mis-
globally-free variables entered in an auxiliary cellaneous in-line primitive operations like
table. This means that a function application integer and floating arithmetic, array subscript,
(APP) can be represented by the name of the etc. The expression
function (var) and a series of arguments (var list).
PRIMOP(i,[a,b,c],[d,e],[F,G,H])
The constraint that an APP can’t be a child of an
APP is enforced by the fact that the arguments of means to apply operator i to the arguments
the APP constructor are variables, not (a,b,c) yielding the results d and e, and then
continuation-expressions. branch to one of the continuations F, G, or H.
Each primitive operator has its own ‘‘signature;’’
One of the useful properties of CPS is that every for example
intermediate value of a computation is given a
PRIMOP(plus,[a,b],[d],[F])
name. In Steele’s representation, however, func-
tions can still be anonymous, making it difficult takes two arguments, returns one result, and con-
for a code generator to keep track of them. tinues in only one way, whereas
Therefore, we eliminate LAMBDA from our CPS
PRIMOP(lessthan,[a,b],[],[F,G])
datatype, in favor of FIX, a general-purpose
mutually recursive function definition in which takes two arguments, produces no result, and
names are explicitly bound to functions: branches to F or G.
zations; it’s simpler to do that in a separate phase. 4: Remove unused arguments of functions.
The converter has its hands full just with the 2: Flatten the arguments of (nominally
semantics of the two languages (lambda-calculus single-argument) ML functions that are
and continuation-passing style) that it is translat- always called with a tuple of actual param-
ing between. It does make these representation eters.
decisions:
0.1: Remove the definitions of variables that
g Makes control flow explicit by the use of aren’t used.
continuations.
Our optimizer makes several passes (typically
g ‘‘Lowers’’ typed constructs like ML’s dis- half a dozen) before no (or few) redexes remain.
joint union constructors into untyped con- The test that produced the frequencies above
structs like RECORDs with integer tags. counted all passes on a 16,000-line ML program
g Optimizes the representation of case state- that had a graph of size 118514. If the optimizer
ments (arising from ML pattern-matching) were to stop at module boundaries, the numbers
into jump-tables or binary trees of com- would be somewhat different. Our compiler, for
parisons.[5]. historical reasons, also has an optimizer in the
g The pattern (λx. M)(N), which has the (non-CPS) lambda-calculus level, which has
effect of let x = N in M, is treated spe- some overlap with the CPS reducer and also will
cially. This is an optimization that could be affect the counts given here.
left for the next phase, but it is convenient
and cost-effective to recognize it here. 5. Closure conversion
When one function is nested inside another, the
4. Reduction of the CPS inner function may refer to variables bound in the
The next phase is a CPS ‘‘reducer’’ that performs outer function. A compiler for a language where
a variety of optimizations. They are listed here, function nesting is permitted must have a
each with an indication of how often it is applica- mechanism for access to these variables. The
ble for each 1000 operands* of CPS graph: problem is more complicated in languages (like
ML) with higher-order functions, where the inner
function can be called after the outer function has
205: Replace SELECT(i,r,...) with the i th returned.
field of r, when r is a statically determin-
able record. The usual implementation technique uses a ‘‘clo-
181: Perform beta-reduction (inline expansion) sure’’ data structure: a record containing the free
on any function that is called only once, or variables of the inner function as well as a pointer
whose body is not too large. to its machine code. A pointer to this record is
72: Merge sets of mutually recursive function made available to the machine code while it exe-
definitions (FIXes) in the hope that they cutes so that the free variables are accessible. By
will later share the same closure. Merging putting the code-pointer at a fixed offset (e.g. 0)
can be done if one FIX is the immediate of the record, users of the function need not know
child of another, and each has the same set the format of the record or even its size in order
of free variables. to call the function.
66: Perform eta-reduction (where f(x,y)=g(x,y),
In fact, several functions can be represented by a
replace all uses of f with g).
single closure record containing the union of their
47: Perform constant-folding on SWITCHes free variables and code pointers. A closure
and PRIMOPs. record is necessary only for a function that
26: Hoist (un-nest, or enlarge the scope of visi- ‘‘escapes’’ — some of its call sites are unknown
bility of) function definitions to enable the because it is passed as an argument, stored into a
merging of FIXes. data-structure, or returned as a result of a
function-call. A call of an escaping function is
hhhhhhhhhhhhhhhhhh implemented by extracting the code pointer from
* This is a larger and more useful quantity than the the closure record and jumping to the function
number of nodes in the graph. with the closure record as one of its arguments.
-5-
The following code fragment shows a sample ML The function g is known, so its free variable x has
function and the transformations applied to it. In been added to its argument list; function f
rewriting a function f, our convention is to call the escapes, and requires a closure record. See the
new, closed function f ′, its closure record (if any) appendix for a larger example written in a more
f, and the formal parameter corresponding to the readable notation.
closure record f ′′. The first element of a closure
record f will be f ′, so that if f escapes to some In more complicated examples, involving many
context where f ′ is not known, the code-pointer f ′ variables from differing scopes, there can be a
can be SELECTed from the closure record. All number of possible closure representations for a
other references to escaping functions become function.[8] One simple strategy is to use a flat
references to closure records. closure containing all the free variables. At the
other extreme, a number of closure records
already exist when a new closure must be created,
hhhhhhhhhhhhhhhhhh and a pointer or link to one can provide access to
*Mutually-recursive functions complicate the free vari- several of the necessary variables. Combinations
able analysis, but this turns out to be a classical dataflow of the two allow us to trade off time of closure
problem (live-variable analysis) that can be solved by
classical techniques.[6] creation, size of closures, and ease of access to
† This technique has been used in the Categorical variables from closures. The tradeoffs can be
Abstract Machine[7] subtle: for example, linked closures can take up
more space than flat closures because they hold
-6-
ables in both spill records and registers. Some instructions on some machines prefer
Although this might seem expensive, profiling to have their result argument in the same
shows that all spill records take up only one or register as a source argument (this prefer-
two percent of the total heap allocation in our ence doesn’t typically apply to fetches, so it
Vax implementation, where n is 8. won’t probably wouldn’t be used for this
SELECT).
7. Generation of target-machine instructions Targeting:
Since modern garbage collectors are so cheap[9] Sometimes there’s an opportunity to avoid
[10] we have dispensed with the stack. This a move instruction later on. If the variable
simplifies the code generator, which doesn’t need w is used (in cexp) as the n th argument to a
to do the analysis[2] [4] [11] necessary to decide function f whose calling sequence requires
which closure records can be allocated on the the n th argument in register r, and if r is not
stack; it also simplifies the runtime system, mak- bound to any other live variable, then r
ing it easier to add multiple threads or state- should be used for w to save the cost of a
saving operators to the programming environ- move instruction when f is called.
ment.
Anti-targeting:
Eliminating the stack is advantageous not only If there is a call in cexp to a function f , one
because it makes the compiler simpler. Opera- of whose arguments (which is not w) is to
tions like call-with-current-continuation (which, be passed in register r, then r is to be given
though not in ML, is compatible with the ML less preference than another register, to
type system) are more efficient if there is no avoid the cost of moving w out of the way
stack; a generational garbage collector can when f is called.
traverse just the newest call-frames, whereas it Default:
would have to traverse all the call frames on a
Otherwise, any register not already allo-
stack; and in a multi-thread environment with
cated to a live variable may be used. The
stacks, a large stack space must be allocated for
work done by the spiller ensures that there
each thread even if it won’t all be used.
will always be a register available.
The expression handed to the target-machine
instruction generator has a very simple form As in the ORBIT compiler, the instruction gen-
indeed. Procedures never return (as a result of erator treats ‘‘known’’ functions (those whose
CPS conversion), procedures don’t have non- call sites are all known statically) specially. The
constant free variables (as a result of closure parameters of a known function can be allocated
analysis), scopes aren’t nested, and there are to registers in a way that optimizes at least one of
never more live variables than registers to hold the calls to the function. Specifically, the code
them. for a known function is not generated until a call
site is found and generated. Then, the formal
Since all representation decisions have been made parameters of the function can be allocated to the
in previous phases, the decisions made by the same registers where the call’s actual parameters
instruction generator have mostly to do with are already sitting; and the transfer of control can
register allocation. As an example, consider the be by ‘‘falling through’’ without a jump. Thus, at
fragment SELECT(3,v,w,cexp), which least one call to each known function can be at no
requires that the third field of the record v be cost (except for actual parameters that are con-
fetched into the (newly-defined) variable w, and stants, and must be fetched into registers).
execution to continue with the expression cexp.
Both v and w can be allocated to machine regis- 8. Benchmarks
ters, as a result of the (previous) spilling analysis. We ran five different compilers on five different
The variable v will have already been allocated at benchmark programs, all on a VAX 8650. The
the time it was bound in the enclosing expression. compilers were:
The variable w must be allocated at this time to a Pascal Berkeley Pascal with -O option
register. Several heuristics are used.
ORBIT Version 3.0 of the T system from
Two-address instructions: Yale, with the ORBIT code genera-
tor.
-8-
Old Our old code generator for ML, benchmark, our CPS code generator did only 25%
based on an abstract stack machine. better than the old one. The reason seems to be
CPS Our new code generator, described in that though the new code generator produces very
this paper. efficient code for tight, tail-recursive loops, big
programs tend to have more function calls requir-
CPS’ Our new code generator with aggres- ing saving of state.
sive cross-module optimization
enabled. It might be argued that since we save state by
Of course, comparisons between compilers for making continuation closure records, it is our
different programming languages may tell us stackless strategy that slows performance. How-
more about the languages than about the com- ever, we estimate that even if every closure
pilers. record were stack allocated we would save only
6% to 10%. Furthermore, the stackless strategy
The programs were: tends to use less memory (typically on the order
of 20%, but sometimes a much greater savings)
Hanoi The towers of Hanoi benchmark
than the old, stack-based code, because objects
from Kranz’s thesis[4].
tend to be retained on the stack after their last use.
Puzz A compute-bound program from And the stackless strategy has other advantages,
Forest Baskett[4]. like a simpler runtime system and garbage collec-
LenL A tail-recursive function (or, in Pas- tor.
cal, a while loop) to compute the
length of a list. Acknowledgements
LenR A recursive function (not tail- David MacQueen was co-designer and co-
recursive) to compute the length of a implementor of the front end of the ML compiler.
list. David Kranz made many useful suggestions about
Comp A 16000-line compilation job in benchmarks.
Standard ML. This is intended to
measure the performance of real sys- Appendix: An example in detail
tems, not just artificial benchmarks. To illustrate the several phases of the code gen-
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii eration, we show the transformations made to a
c Hanoi Puzz LenL LenR Comp c fragment of an ML program. The program has
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
c c
Pascal .42 2.02 1.43 7.52 function, count, that takes a predicate (a func-
c c
c ORBIT ˜.4 ˜2.1 .9 3.6 c
tion from α to boolean, for some type α) as an
c Old 1.28 8.81 5.62 5.71 1613 c argument, and returns a function that counts how
c CPS .72 2.63 1.18 3.89 1432 c many elements of a list (of α) satisfy that predi-
c
CPS’ .21 2.87 1.09 4.53 1224 cc cate. Then a function countZeros is made by
ciiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
applying count to a predicate that returns
The table above gives execution times in seconds, true on 0 and false on other integers:
not including garbage-collection overhead (which
can be arbitrarily large or small depending on fun count(pred) =
memory size[10]). let fun f(x::rest) =
if pred(x)
9. Results then 1+f(rest)
else f rest
By separating the code generation into easily- | f nil = 0
understood phases with clean interfaces, we make in f
it easier to produce robust optimizing compilers. end
Our method is not difficult to implement, and
works well in practice. val countZeros =
count (fn 0 => true
Our CPS code generator produces code that runs | _ => false)
up to four times faster than our old, stack-based
code generator on small benchmarks, and seems This function will be translated by the compiler
comparable to Pascal and ORBIT for the exam- into lambda-calculus and then into CPS, but for
ples we tested. But on the large, ‘‘real world’’ illustrative purposes we will show all the transfor-
-9-