0% found this document useful (0 votes)
5 views10 pages

CD Seminar 3

Uploaded by

sanin mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

CD Seminar 3

Uploaded by

sanin mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Journal of Computer Languages 72 (2022) 101151

Contents lists available at ScienceDirect

Journal of Computer Languages


journal homepage: www.elsevier.com/locate/cola

A surprisingly simple Lua compiler—Extended version


Hugo Musso Gualandi ∗, Roberto Ierusalimschy
PUC-Rio, Rio de Janeiro, Brazil

ARTICLE INFO ABSTRACT


CCS Concepts: Dynamically-typed programming languages are often implemented using interpreters, which offer several
Software and its engineering → Interpreters advantages in terms of portability and flexibility of the implementation. However, as a language matures
Just-in-time compilers and its programs get bigger, programmers may seek compilers, to avoid the interpretation overhead.
Keywords: In this study, we present LuaAOT, a simple ahead-of-time compiler for Lua derived from the reference
Dynamic languages Lua interpreter. We describe two alternative compilation strategies. The first one exemplifies an old idea of
Interpreters using partial evaluation to produce a compiler based on an existing interpreter. Its contribution is to apply this
Partial evaluation idea to a well-established programming language. We show that with a quite modest effort it is possible to
Compilers
implement an efficient compiler that covers the entirety of Lua, including coroutines and tail calls. The whole
Just-in-time compilers
implementation required less than 500 lines of new code. For this effort, we reduced the running time of our
benchmarks from 20% to 60%.
The second compilation strategy is based on function ‘‘outlining’’, where each bytecode is implemented by
a small subroutine. This strategy reduces executable sizes and compilation times, at the cost of not speeding
up the running times as much as the first compilation method.

1. Introduction of it is due to just using a compiler instead of an interpreter. To help


answer this question, we developed LuaAOT, a simple ahead-of-time
Dynamic programming languages are popular for many applica- compiler for Lua which does not perform any type-based optimizations.
tions, including scripting. They are often implemented using an in- The inspiration for the architecture of LuaAOT is an old idea of
terpreter, which makes it easier to load code fragments at run-time producing a compiler from an existing interpreter by unrolling and
and enables a fast test-change-recompile development loop. However,
specializing the core interpreter loop [2,3]. Our contribution is to show
interpreted programs are often slower than compiled programs, due to
that this idea can be successfully applied to an established language.
the interpretation overhead.
The conventional wisdom is that the most efficient implementations Using less than 500 lines of new code, our compiler implements the
for dynamic languages are just-in-time (JIT) compilers, which can take entirety of Lua, including coroutines and tail calls.
advantage of run-time information to perform speculative optimiza- One thing that contributes to the simplicity of LuaAOT is that we
tions. Ahead-of-time compilers (AOT) have a steeper hill to climb, can delegate a significant part of the work to a C compiler. We get
because they must rely on clever static analysis or type inference to several optimizations ‘‘for free’’, including constant propagation and
inform their optimizations. Nevertheless, in this paper we show that dead code elimination. This allows us to remove much of the interpreter
a very simple compiler, focused exclusively on reducing the interpre- overhead while still emitting straightforward code that is mostly copied
tation overhead, can deliver respectable improvements for a modest from the existing interpreter.
implementation effort. We also argue that such a simple compiler can Before going on, we should emphasize that LuaAOT catches the
provide useful insight about the performance of the interpreter that it
low-hanging-fruits of compiler optimizations for dynamic languages.
is based on.
Its performance is not competitive with a reasonable JIT compiler.
Our motivation for this paper was our previous work on Lua com-
Its selling point is that it achieves a decent performance boost for a
pilers, in particular the Pallene language [1]. Pallene is superficially
similar to a typed dialect of Lua, where the type information allows the surprisingly low cost.
compiler to perform significant optimizations to the code. Because the The next section has a brief discussion about partial evaluation of
type information is not speculative, Pallene’s compiler can work ahead virtual machines. Section 3 describes LuaAOT, our take on that idea.
of time and be simpler than a JIT compiler. However, a relevant ques- Next, we evaluate our artifact in Section 4. Finally, we discuss related
tion is how much of this improvement is due to the types and how much work in Section 5 and draw some conclusions in Sections 6 and 7.

∗ Corresponding author.
E-mail addresses: [email protected] (H.M. Gualandi), [email protected] (R. Ierusalimschy).

https://doi.org/10.1016/j.cola.2022.101151
Received 18 February 2022; Received in revised form 12 July 2022; Accepted 16 August 2022
Available online 24 August 2022
2590-1184/© 2022 Elsevier Ltd. All rights reserved.
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

Fig. 1. A Lua function and its bytecode.

This paper is an extended and expanded version of a previous


paper [4] published at the Brazilian Symposium of Programming Lan-
guages. The main changes in this expanded version are the addition
of more benchmarks and a compilation method that uses function
‘‘outlining’’ to reduce the size and compilation time of the generated
code.

2. Virtual machines and partial evaluation

One of the most popular ways to implement an interpreter for a


dynamically typed programming language is via a virtual machine. Fig. 2. A virtual machine/interpreter.

The virtual machine defines an intermediate language of portable


instructions, also called bytecodes. This approach is illustrated in Fig. 1,
which shows a small Lua function and the corresponding portable happens as the interpreter transfers the control to the appropriate
instructions for the Lua virtual machine. The first instruction, loadi 3 instruction handler. The most basic form is a while-switch loop, but
17, loads the integer constant 17 into the register R[3] (the address of some interpreters might use more advanced dispatch techniques such
the local variable ‘d’). Then, the next instruction add 0 1 2 adds the as ‘‘threaded code’’ [6]. The Lua 5.4 interpreter can be configured to
content of the R[1] and R[2] registers (variables ‘b’ and ‘c’), storing use either a portable while-switch loop or a dispatch table that takes
the result in R[0] (variable ‘a’). The next ADD is similar. Finally, the advantage of the computed-goto GCC extension.
jump instruction jumps three instructions back in the code, repeating To minimize the decoding and dispatching overheads, LuaAOT pro-
the loop. Note that Lua optimized the test for the while condition, duces a modified version of the inner interpreter loop that is specialized
given that it was a constant. To ease the presentation, we represented to run a given function. Fig. 3 provides an example of this idea. The
these instructions using records. The actual Lua interpreter encodes the instructions become compile-time constants and the jumps become
instruction components as bit-fields of a 32-bit integer [5]. goto statements. The execute_foo function can be seen as a partial
Fig. 2 shows a typical inner loop of a virtual machine. It executes the evaluation of the execute function, where the prog argument is fixed
portable instructions, one by one, like a conventional CPU. The inter- to be the foo array from Fig. 1. This method to produce a compiler
preter maintains a stack, which is where the local variables are stored. from an interpreter is sometimes called a Futamura Projection [2].
The program counter points to the current instruction and guides the This simple compilation strategy does not optimize all the things
control flow. Data operations, such as loadi and add, manipulate the that an advanced Lua compiler can try to optimize. For example, there
values in the stack. Control-flow operations, such as jump, modify the is no attempt to store Lua variables in CPU registers. Similarly to the
program counter. In this example, DoAdd is a macro that does the Lua interpreter, LuaAOT stores all local variables in the Lua stack.
actual work, including checking the types of the arguments.
However, the simple compilation strategy does provide an idea of what
The virtual machine in our example is register-based, similarly to
can be achieved by optimizing the low-hanging fruit. In particular, it
the Lua virtual machine [5]. The defining characteristic of a register-
can tell us about the interpretation overhead of the original interpreter.
based virtual machine is that the data-manipulation instructions can
Since the bytecode instructions are compile-time constants, the C com-
read from and write to any position in the stack. The other common
piler can use constant propagation to remove most of the operations
way to design a virtual machine is in a stack-based discipline, where
for decoding the instructions. Similarly, because the Lua jumps are
the data manipulation instructions always push values to the top of the
converted to C gotos, we avoid indirect jumps and dispatch tables. The
stack and pop results from its top. We believe that the technique we
control flow graph also is exposed to the C compiler, possibly allowing
describe in this paper should also apply to stack-based virtual machines.
further optimizations.
Although stack and register VMs differ in how they encode the opcodes,
the basic dispatch mechanism and switch/case is similar.
In this paper we are interested in the interpretation overhead that is 3. The LuaAOT compiler
associated with decoding and dispatching virtual-machine instructions.
The decoding overhead comes from fetching the next virtual instruction In this section we describe how we built the LuaAOT compiler and
from memory and computing the values of its parameters; in the exam- which challenges we encountered in the process. The compiler is free
ple, those would be the tag and arg fields. The dispatch overhead software and the source code is publicly available [7].

2
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

After this, we told the interpreter how to use these compiled func-
tions. At the start of the execute subroutine (the one that Fig. 2 talks
about), we add a check just before the inner loop. If the function has a
compiled version we transfer the control to that, instead of continuing
to the usual interpreter loop.
The next change we made is related to the public interface that is
exposed to C extension modules. Our partial-evaluation generates code
that calls many internal functions from the Lua interpreter, which in
normal circumstances are not exposed to extension modules. To allow
our generated code to use these internal functions, we modified the
interpreter to make all those internal names public.
Finally, we proceeded to implement the code generator, examining
the bytecodes one by one. Most had their code directly pasted into
the compiler; some required modifications to the generated code, but
there was one case that also required modifications to the interpreter:
the bytecodes for function calls (call, tailcall, and return). The Lua 5.4
interpreter has an optimization where Lua-to-Lua calls reuse the same
execution frame for the execute function. Like a conventional CPU,
the Lua interpreter implements these function calls by updating the
prog argument and the program counter, so that the same interpreter
loop naturally runs the called function. Unfortunately, this implemen-
tation is incompatible with our compiled code, where each execute
function is specialized to a particular Lua function. Our solution was
to disable this optimization. We believe that with additional work it
might have been possible to keep it. However, disabling it was certainly
simpler.
We should stress that this change did not harm the tail-recursive
functions, which are guaranteed to use O(1) stack space. In the gener-
ated C code for the tailcall instruction, the crucial function call appears
Fig. 3. Specializing the interpreter to a particular function.
in a tail position, so that a good C compiler can perform the required
tail-call optimization.
As important as the changes we made are the many things we did
not need to modify. Other than adding a single field to the function
Lua, like many scripting languages, can be extended by binary
objects, we made no other changes to the internal Lua data types. We
modules written in other languages (such as C). This was a natural
also did not modify the Lua runtime system, the garbage collector, or
mechanism for integrating LuaAOT with Lua. The core of the LuaAOT
the Lua standard library.
compiler is the luaot executable, which takes Lua source code as input
and outputs such a binary extension module. For the most part these
3.2. The code generator
compiled modules can be loaded by the Lua interpreter and runtime
just like any other binary module. The only difference is that we had
The code generator receives a Lua module and converts it to C.
to patch the interpreter to tell it how to call our compiled Lua functions.
To do this it calls the Lua bytecode compiler and then it converts the
A possible alternative implementation would be for the LuaAOT
compiler to generate a standalone executable. The standard procedure bytecode to C. For each function in the module, the code generator
for this would be to bundle the interpreter inside the executable, be- produces an appropriate function header and then it iterates over the
cause the compiled code still needs the Lua runtime. However, creating function’s bytecodes, outputting a block of C code for each instruction.
standalone programs for Lua in this manner is not typical. It is more For the most part, these blocks of C code are copied verbatim from the
common to install and distribute Lua applications separately from the original Lua interpreter. However, we had to adapt a few bytecodes.
interpreter. The main categories were bytecodes that modify the program counter,
bytecodes using C gotos, and bytecodes related to function calls.
3.1. The interpreter In the original interpreter, jumps are implemented by assigning
to the interpreter variable representing the program counter. In our
The Lua interpreter plays a central role in our system, which com- compiler, we replaced each of these jumps by a C goto. This required
prises of both a compiler and a slightly modified interpreter. There are making the appropriate changes to the generated code for the jump
multiple reasons for this. The first is that programs can contain both instruction, as well as to the instructions that implement for-loops.
compiled and non-compiled sections and the interpreter is necessary to We also had to change the instructions for binary operations, because
run the non-compiled parts. Moreover, our compiled code still requires of how Lua implements operator overloading. In Lua, every binary
the interpreter: because we partially evaluate the inner interpreter operation is followed by a special instruction (mmbin), which handles
loop, the compiled code calls several subroutines from the interpreter. overloading. When the operands have the expected types (e.g., numbers
Furthermore, the interpreter codebase also houses the Lua runtime for the add operation), the binary operation increments the program
and garbage collector, which are used by both the compiled and the counter to skip this next instruction. This means that all binary opera-
non-compiled code. tions contain an implicit jump, which our compiler must also replace
The custom interpreter has few changes compared to the original by a goto.
Lua interpreter. The first modification was to add an additional field The next category of instructions we had to adapt were the ones that
to the data structure that represents Lua functions. This field refers use goto statements in their original implementation. One example is
to the compiled code for that function, if it exists. The C extension the forcall instruction, which is always followed by a forloop. As an
modules that we generate include initialization code that associates the optimization, the Lua interpreter uses a goto to jump straight to the
Lua functions with their compiled C implementation. handler for the forloop, bypassing the usual dispatch logic. Since our

3
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

Fig. 4. Support for Lua coroutines.

compiler is already removing the run-time instruction dispatching, we


simply removed this optimization from our generated code.
The instructions for function calls also had the issue of gotos in the
original implementation, as we discussed in the previous section. How-
ever, in this case we had to apply our changes both to the interpreter
and to the generated code, so that uncompiled Lua functions could
properly call the compiled ones.

3.3. Coroutines

Lua’s coroutines [8,9] are a powerful control-flow mechanism, use-


ful for asynchronous programming. They operate in a similar space to
generators and delimited continuations. Lua implements coroutines by
maintaining a separate call stack for each coroutine. When a coroutine
yields, the interpreter saves the current program counter and exits from
the interpreter loop (via a longjmp). When the coroutine is resumed,
the interpreter must continue the execution loop from where it left.
To make our compiler compatible with coroutines, we had to teach
it how to restart the execution from the point where the coroutine was
interrupted. To do this, we need to jump to the appropriate location in
the code, according to the saved value of the program counter. Fig. 4
illustrates how we do this. At the start of the function, we insert a
switch-case that jumps to the location indicated by the saved program
counter. The rest of the compiled function, including the jump labels, Fig. 5. Compilation without gotos, using a trampoline.

is the same as the version without coroutine support, shown in Fig. 3.


The switch case is used only once, at the start of the function. After
this, all the other jumps happen as previously described, with gotos. then re-dispatches to the desired jump target. The trampoline analogy
comes from this two-step jumping scheme: the break statement drops
3.4. Alternative compilation without gotos back to the start of the loop and then the switch-case bounces us back
to the destination label.
A fragile aspect of LuaAOT is the optimization of replacing Lua In terms of implementation effort the trampoline approach is even
jumps by C gotos. While the implementation of the Lua interpreter gave simpler than the goto-based one. Most of the manual modifications
us the opportunity to use gotos without too much trouble, doing this that we described in Section 3.2 involved replacing Lua jumps with
to a different interpreter might have been more difficult. For example, C gotos. In the trampoline implementation, the vast majority of those
the Lua interpreter is written in C, which is a language with gotos. Had sections can be copy-pasted without any changes at all. This design also
it been written in a different language, it might have been harder to simplifies the support for coroutines, because the trampoline already
use gotos in the partial evaluation process. Another important aspect is includes the switch-case that the coroutines need. The only difference
that Lua’s core interpreter loop is all in a single function. Had it been is the initial value of the program counter: instead of always starting
broken into smaller subroutines, it might have been more difficult to from zero (first instruction, as is done in Fig. 5), it should start from the
use gotos because one subroutine would not be able to use goto to jump program counter value that the coroutine saved (as is done in Fig. 4).
to the other. The obvious downside of trampolines is that they preserve some of
In response to this, we developed an alternative design for jumps, the dispatch-related overhead from the interpreter. We will evaluate
this time using a trampoline-like pattern instead of gotos. As we can this overhead in Section 4.
see in Fig. 5, there is still a switch-case. However, it dispatches based
on the program counter instead of based on the instruction tag. For 3.5. Alternative compilation with opcode subroutines
instructions that do not jump, the handler falls through to the handler
for the next instruction. For the instructions that do modify the program As we will discuss in more detail in Section 4, we observed that
counter, the handler jumps back to the start of the switch-case, which LuaAOT can generate fairly large binaries due to how it effectively

4
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

instructions that follow every binary arithmetic instruction, which we


discussed in Section 3.2. LuaAOT does not share the same limitation
of limited space to encode instructions, so what we did was merge the
mmbin instructions into the preceding instruction.
We hypothesized that this function-based compilation would pro-
duce a smaller amount of C code, which could lead to better compi-
lation times and smaller executables. For running time, it was harder
to guess. In one direction, having smaller executables could lead to
better performance due to improved instruction cache usage. However,
refactoring the local interpreter variables into a struct and introducing
function call overhead could potentially harm the performance.

4. Evaluation

To evaluate LuaAOT, we studied its performance and we measured


the complexity of its implementation. We also made a qualitative
analysis of what are the requirements that an interpreter must fulfill
to allow a compiler in the style of LuaAOT. The source code for the
benchmarks and the related scripts are available online, in the same
repository as the compiler [7].

4.1. Correctness

Our main tool for assessing the correctness of LuaAOT was the test
Fig. 6. Extracting regular opcodes into functions. suite from the reference Lua interpreter [10]. Although tests cannot
prove the absence of bugs, they certainly help. We ensured that LuaAOT
could run the full Lua test suite without crashing and that it produced
the same result as the reference interpreter. During this process we
found and fixed several bugs, including some tricky ones involving
recursive functions and tail calls.

4.2. Running time

To study the performance of LuaAOT, we compared the running


time of the compiled programs with the running time of the interpreter.
To provide a baseline, we also compared the results with LuaJIT [11],
an advanced just-in-time compiler for Lua.
Fig. 7. Extracting control-flow opcodes into functions. The benchmarks we used come from the Computer Language Bench-
marks Game [12] and Are We Fast Yet [13]. From the former we ex-
cluded three benchmarks: pydigits, regex-redux, and reverse-
‘‘inlines’’ and copy–pastes the implementation of every VM instruction. complement. The first two require external libraries that are not part
With this problem in mind, we designed an alternative implementation of the Lua standard library. The latter is bottlenecked by the string
in which each opcode is split off into a separate subroutine. library, which is implemented in C, therefore making it unsuitable for
This technique is illustrated in Figs. 6 and 7. The first thing we evaluating the performance of the interpreter.
did was refactor the local variables that represent the interpreter state We carried out the measurements on a laptop with an Intel
(stack, program counter, etc.), placing that data in a struct that can i7-10510U CPU, running Fedora Linux 35. We used Lua version 5.4.3
be passed by reference. Then, for each instruction we created a helper and LuaJIT version 2.1.0-beta3. For the C compiler we used GCC
function that receives the instruction parameters as arguments. Keep in version 11.2.1, with the -O2 optimization level. For each benchmark
mind that while the helper functions in Fig. 6 only contain a single line we picked an input size large enough to ensure that the fastest imple-
of code, in the real interpreter they are typically longer than that. mentation took at least one second to run. We ran each benchmark 20
The body of the compiled Lua function consists of a series of func- times.
tion calls to the helper functions, one per instruction. Our compilation The complete performance data is listed in Table 1, which displays
still gets rid of the run-time instruction decoding and dispatching, the average running time as well as the encountered variation. The
as those instruction parameters are encoded as constants passed as error intervals refer to the difference between the average and the
function arguments. maximum or the minimum time, whichever was greater. In this table,
Similarly to before, control-flow instructions that modify the pro- the LuaAOT column is the default LuaAOT compiler; Trampoline refers
gram counter are treated specially. For unconditional jump instructions, to the compilation strategy without goto; Functions is the compilation
we put the goto statement right after the call to the opcode subroutine. strategy with one function call per instruction; Struct is a variant of
For conditional jump instructions, the subroutine returns a boolean the Function strategy without the separate functions, only refactoring
saying whether to make the jump. Finally, for the jmp opcode, which is the interpreter state into a struct (in order to measure the cost of that
just a single goto, we did not create a helper function and just inlined refactoring).
its definition as before. Before we get into the alternative compilation strategies, let us
Another change we made was related to common opcode pairs. start by looking at the default LuaAOT compiler. In Fig. 8 we com-
There are a couple of places where Lua encodes an operation as pare the performance of LuaAOT against the reference Lua interpreter
a sequence of two opcodes, because they would not fit in a single and LuaJIT, with the times being normalized by the average time
instruction. The most common case where this happens are the mmbin of the reference interpreter. In all benchmarks, LuaAOT was faster

5
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

Table 1
Running times, in seconds.
Benchmark Lua Trampoline LuaAOT Function Struct LuaJIT
Binary Trees 3.49 ± 0.06 3.08 ± 0.14 2.87 ± 0.07 3.15 ± 0.69 2.92 ± 0.11 1.24 ± 0.50
Fannkuch 43.97 ± 2.43 25.87 ± 2.76 21.79 ± 0.88 33.94 ± 2.40 32.90 ± 0.62 7.38 ± 0.25
Fasta 4.84 ± 0.11 4.05 ± 0.33 3.74 ± 0.17 4.46 ± 0.37 4.43 ± 0.09 1.24 ± 0.07
K-Nucleotide 3.91 ± 0.14 3.35 ± 0.10 3.21 ± 0.17 3.58 ± 0.14 3.64 ± 0.74 0.95 ± 0.07
Mandelbrot 14.27 ± 1.63 10.49 ± 0.71 6.80 ± 0.11 11.36 ± 1.14 10.94 ± 0.27 1.63 ± 0.05
N-Body 17.85 ± 1.45 12.26 ± 1.21 10.49 ± 0.36 14.15 ± 0.69 14.59 ± 0.80 1.10 ± 0.04
Spectral Norm 34.64 ± 1.12 28.00 ± 12.64 18.76 ± 0.56 28.28 ± 0.71 28.38 ± 2.27 1.21 ± 0.02
CD 1.89 ± 0.04 1.87 ± 0.10 1.68 ± 0.10 1.80 ± 0.06 1.79 ± 0.07 0.90 ± 0.15
Deltablue 2.08 ± 0.20 2.00 ± 0.20 1.89 ± 0.12 2.06 ± 0.09 2.09 ± 0.15 1.05 ± 0.17
Havlak 7.63 ± 0.32 7.41 ± 0.18 7.20 ± 0.12 7.59 ± 0.19 7.59 ± 0.15 4.36 ± 0.17
JSON 5.06 ± 0.11 4.89 ± 0.19 4.52 ± 0.42 4.88 ± 0.11 4.87 ± 0.15 1.03 ± 0.07
List 2.58 ± 0.11 2.01 ± 0.15 1.76 ± 0.54 2.23 ± 0.11 2.25 ± 0.12 1.02 ± 0.05
Permute 2.64 ± 0.08 2.00 ± 0.17 1.93 ± 0.15 2.58 ± 0.55 2.52 ± 0.19 0.09 ± 0.04
Richards 3.61 ± 0.13 3.09 ± 0.09 2.86 ± 0.12 3.34 ± 0.16 3.32 ± 0.15 0.89 ± 0.08
Hanoi Towers 4.16 ± 0.12 3.40 ± 0.15 3.22 ± 0.47 4.13 ± 0.38 4.11 ± 0.13 0.33 ± 0.09

Fig. 8. LuaAOT and LuaJIT running times, normalized by reference interpreter.

than the Lua interpreter, but slower than LuaJIT. In the microbench- 4.3. Pipelining
marks, the reduction in running time compared to interpreted Lua
ranged from approximately 20%, in the K-Nucleotide benchmark, to We were curious whether the better performance of LuaAOT com-
approximately 60%, in the Mandelbrot benchmark. Among the bigger pared to Lua was because it ran less CPU instructions or because it could
benchmarks (CD, Deltablue, Havlak, JSON), there was a smaller time run more instructions per second. To answer this question, we reran the
improvement in the order of 5 to 10%. benchmarks using Linux’s perf tool, which can measure the number
The speed of the trampoline version of LuaAOT fell between the of CPU instructions and CPU cycles used by each program. The results
speed of Lua and the speed of the default version of LuaAOT (using are listed in Table 2. They suggest that, at least for this CPU model, the
gotos). This is what we expected, because the main difference in the largest factor behind the improved speeds is a reduction in the number
trampoline design is that it has a bit more dispatch overhead. Other
of CPU instructions. It appears that the instruction-per-cycle statistic is
than the dispatching, the generated code is the same as default LuaAOT.
actually slightly worse for LuaAOT. For most benchmarks the reduction
In all benchmarks, the version that generates function calls for each
in instruction count is larger than the reduction in time (CPU cycles).
instruction was slower than default LuaAOT. The worst case being the
We hypothesize that the biggest speedup comes from the compiler
Mandelbrot benchmark with a 66% difference. Most of this can be
removing some of the instructions responsible for bytecode decoding
attributed to the refactoring which moved the interpreter state from
local variables to a struct object, which was necessary to be able to and dispatching. As most of these instructions are cheap (e.g., shifts
split the interpreter loop into multiple subroutines. We tested this by and masks for decoding), the reduction in the number of instructions is
running a version of the compiler that only did that refactoring. As we larger than the reduction in cycles, therefore reducing the instructions
can see in Fig. 9, the performance was about the same as the version per cycle.
that also split into multiple subroutines. This suggests that the biggest In theory, removing the bytecode dispatching (and the associated
culprit was having less efficient access to the interpreter state. One branches) has the potential to improve performance by avoiding costly
question that might be interesting for future research is whether we can branch mispredictions. However, in our benchmarks this was not a
avoid this problem by using a custom calling convention that preserves big factor, because the CPU already did a good job of predicting the
the important interpreter state in machine registers. branches in the interpreted version. In almost all benchmarks, the

6
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

Fig. 9. Running time for function call backend, normalized by reference interpreter.

Table 2 Table 3
CPU instruction count for LuaAOT, relative to Lua. Size of compiled modules, in KB.
Benchmark Instrs (%) Time (%) Benchmark Bytecode AOT FUN
Binary Trees 82.3 81.1 Empty File 0.07 15 15
Fannkuch 44.3 50.4 Binary Trees 1.4 35 39
Fasta 70.5 75.9 Fannkuch 1.1 39 39
K-Nucleotide 75.8 81.9 Fasta 3.0 71 71
Mandelbrot 32.7 46.5 K-Nucleotide 2.2 51 51
N-Body 57.1 59.9 Mandelbrot 0.9 35 35
Spectral Norm 51.4 56.3 N-Body 3.2 83 67
CD 81.5 89.1 Spectral Norm 1.5 39 39
Deltablue 83.6 91.1 CD 26.0 495 447
Havlak 89.1 92.7 Deltablue 26.0 451 415
JSON 82.4 87.5 Havlak 24.0 411 379
List 60.1 63.7 JSON 49.0 435 407
Permute 69.3 73.1 List 13.0 243 215
Richards 76.1 80.3 Permute 12.0 231 207
Hanoi Towers 72.1 80.8 Richards 22.0 371 343
Hanoi Towers 13.0 243 215

branch miss rates for both Lua and LuaAOT were under 1%; the sole
exception is the N-Body benchmark for Lua, with a branch miss rate benchmarks from the Benchmarks Game; in those cases the executable
of 1.2%. With these low rates of branch miss, we do not think it is size is essentially the same. This might be because those programs have
meaningful to compare the absolute numbers. It is also hard to tell small bytecode size, meaning that the executable size is dominated by
whether the compilation is helping the branch-miss rates. fixed overheads instead of by bytecode compilation results.
Due to the large size of the compiled executables, programmers
4.4. Code size may want to consider compiling only some of their modules. These
decisions are commonplace for scripting languages, where we often
One trade-off of LuaAOT compared to the interpreter is that the write parts of the program in a compiled system language to achieve
generated binaries are larger than the corresponding bytecode. To better performance. At the end of the day, compiling a Lua module with
evaluate this aspect of the compiler we measured the sizes of the LuaAOT is a speed vs size tradeoff. On one axis we have the number of
Lua bytecode and of the AOT-compiled executables (stripped, without bytecode instructions that are executed; modules with hot loops stand
debug information). The results are listed in Table 3. In this table, to benefit the most from compilation. On another axis we have the size
the Bytecode column refers to interpreted Lua, the AOT column to of the files; larger Lua programs lead to larger executables.
the default compilation strategy with gotos, and FUN refers to the
alternative compilation strategy with subroutines. We can see that the
compiled executables are significantly larger than the corresponding 4.5. Compilation times
Lua bytecode, albeit not so large that it becomes prohibitive.
For the larger benchmarks from Are We Fast Yet, the function-based In addition to reducing executable sizes, the alternative function-
compilation strategy did generate smaller executables as intended. based compilation strategy can also reduce compilation times, when
There was a size reduction of approximately 10% when compared compared to the default goto-based compilation strategy. These num-
to default LuaAOT. However, we did not observe this in the smaller bers can be found in Table 4. Most benchmarks had a reduction of

7
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

Table 4 One limitation of our manual process for creating LuaAOT is that
Compilation time of AOT executables, in seconds.
we must repeat it if we want to update LuaAOT to a new version of the
Benchmark AOT FUN 𝛥 upstream Lua. In theory it ought to be possible to automate at least
Binary Trees 1.92 1.98 +3% some of this work. Create text manipulation scripts that go through
Fannkuch 2.24 1.97 −12%
the Lua codebase, copy over the opcode handlers from the interpreter
Fasta 4.26 3.90 −8%
K-Nucleotide 2.53 2.52 0% loop, put them inside printf calls, substitute goto statements in place of
Mandelbrot 1.89 1.44 −24% the program counter assignments, etc. The main catch is that without
N-Body 5.84 3.22 −45% collaboration from the upstream interpreter, it is hard to guarantee that
Spectral Norm 1.61 2.00 +24% these text manipulation scripts will still work in a future version of the
CD 38.84 29.53 −24%
interpreter. It is likely that at least some degree of manual intervention
Deltablue 32.32 25.93 −20%
Havlak 31.05 24.24 −22% would still be necessary. Exploring this sort of automation might be an
JSON 27.64 24.93 −10% interesting avenue for future work.
List 15.20 12.01 −21%
Permute 14.08 11.88 −16%
4.7. Applicability of the technique
Richards 20.15 20.34 +1%
Hanoi Towers 14.93 12.12 −19%
While the work we have presented is specific to the reference
Lua interpreter, we think that the technique is simple enough to be
applicable to other dynamic language interpreters. In this section, we
compilation times of over 15% and for the larger benchmarks the discuss what were the aspects of Lua that our interpreter relied on, and
reduction was around 20% to 25%. what conditions are necessary to apply this to another interpreter. Of
Although compilation times are not as important for ahead-of-time course, the performance improvements will depend on the specifics of
compilers as they are for just-in-time compilers, long compilation times the interpreter, in particular what percentage of time is attributable to
can add friction to the development process. And since LuaAOT gener- the core interpreter loop.
ates fairly large executables, the compilation times are quite noticeable The first important thing is that our technique would not work as
in practice. easily for AST-based interpreters. While it is also possible to use partial
Our compilation time experiments suggest that the function-based evaluation for an AST-based interpreter, it is more complicated than for
compilation strategy has a bigger impact for larger programs. Out of a bytecode-based one because of the frequent presence of recursion in
curiosity, we tested what happens to the largest Lua file we could get the main interpreter loop.
our hands on: the Teal compiler [14]. It consists of a single file with
Since our technique is based on partial evaluation, the language
over 9500 lines of Lua code. Compiling it with LuaAOT at -O2 opti-
used to implement the original interpreter is important. C, which is
mization level took a leisurely 430 s. The function-based compilation
a popular language for writing interpreters, worked well for several
strategy brought that down to 160 s, which is less than half.
reasons: the presence of a goto statement, the availability of optimizing
compilers, and the existence of preprocessor macros.
4.6. Complexity of the implementation
When we compile jump instructions in the bytecode, we want a
similar jump operation in our target language. In C we can use goto
A selling point of the partial-evaluation strategy that we used is its
statements for this purpose, provided that the control flow in the
extreme simplicity. To measure this, we counted the lines of code of
original interpreter is all inside a single interpreter function. If the
our code generator, as a proxy for implementation complexity.
target language does not have goto statements, it is harder to compile
We built the code generator by hand, using generous amounts of
the unstructured jumps in the bytecode.
copy-pasting of code from the Lua interpreter loop and from the Lua
Using C as the target language allowed us to take advantage of sev-
bytecode compiler. Because we copied some subroutines from the Lua
eral optimizations from the C compiler, including constant propagation
bytecode compiler, we chose to write the code generator in C. Out
of the total of 1600 lines of code in the generator, 450 lines can for the bytecode instructions. The partial evaluator has the luxury of
be attributed to those subroutines from the bytecode compiler, which emitting code that is almost identical to the code used by the original
are responsible for traversing and printing bytecodes. Code templates interpreter. This would be harder to do if, for example, the original
derived from the core interpreter loop account for over half of the interpreter were implemented in hand-written assembly language. In
code generator, about 850 lines. The rest of the code, which we wrote that case we would likely have to implement the constant propagation
from scratch, fits in less than 500 lines. It consists of miscellaneous ourselves.
things such as comments, command-line option handling, and the Albeit not a fundamental requirement, the C preprocessor was a con-
initialization routines for the generated extension modules. venient feature. The LuaAOT code generator is essentially a text-based
For comparison, the reference Lua interpreter contains 28 thou- code transformer and in that context it helps to have a text-based macro
sand lines of C code [15] and the LuaJIT just-in-time compiler has system built into the target language. Unlike inline functions, macros
80 thousand lines of C and 35 thousand lines of platform-specific can jump to other parts of the program and assign to surrounding local
assembly language [11]. Another thing that we can point out is that variables.
while LuaAOT required us to be familiar with the internals of the To summarize, we believe that our technique is a good fit for
Lua interpreter, the final product did not require complex analysis languages that are implemented by a bytecode interpreter written in C
algorithms or optimization passes. or C++. Some popular scripting languages that fit this criteria include
We also looked at the implementation complexity of the alternative Python, Ruby, JavaScript, Perl, and PHP.
compilation strategies that we described: the trampoline strategy and
the function-based strategy. The implementation for the trampoline 5. Related work
version is about the same size as default LuaAOT, the main differ-
ence being that the code templates required less manual tweaking. Although our work is inspired by partial evaluation, it is not a
The function-based implementation, on the other hand, required more partial evaluation system. There is a rich literature on these partial eval-
manual tweaking. Namely, we had to refactor local variables into uation systems and their application to interpreters [16,17]. However,
struct fields and create the helper functions (which required manu- one difference between our work and these partial evaluation systems
ally choosing the proper argument types and return values for each is that they usually require that the input interpreter must be in some
function). specific format that the partial evaluator can work with. LuaAOT is a

8
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

Another area we touch is the study of interpretation overhead.


One way that this has been studied is by profiling the interpreter
while it is running, to measure how much of the execution time can
be assigned to bytecode decoding and dispatching [20]. However, if
the motivation for the question is to compare the performance of an
interpreter versus a compiler, it is useful to measure the result of an
actual compiler. In addition to removing the decoding and dispatch-
ing, the compiler can also perform optimizations that are natural to
implement in a compiler. One such compiler is Barany’s pylibjit [21].
Barany implemented a just-in-time compiler for Python using the GNU
LibJIT [22] library. Similarly to LuaAOT, his compiler also works at
the bytecode level, converting each Python bytecode instruction into
a machine code sequence. Barany measured the effect of enabling and
disabling various optimizations passes of his compiler, to estimate how
much of an impact these aspects have in the performance of Python
programs [23]. Some of the optimizations that he implemented were
removal of redundant reference counting, static dispatch of arithmetic
operations, unboxing of number and container objects, and call-stack
frame removal. These Python-specific optimizations allowed Barany to
investigate the performance impact of more features of the interpreter
other than just the bytecode handling. However, his compiler is more
complex than ours; he had to reimplement most of the bytecodes using
the LibJIT framework.
In addition to the basic form of LuaAOT, this paper also presented
two alternative compilation strategies. The trampoline-based strategy
is based on a novel idea. The function-based strategy is a trick that has
been rediscovered by many compiler writers (for instance, PHC [24]
and likely several others). In the context of LuaAOT, one of the most
interesting aspects of the function-based strategy is that it shines a light
at the code size and compilation times.
Finally, we want to mention that our work does not consider
optimizations of the interpreter itself, including superinstructions or
threaded dispatch [6]. We also did not study the effect of type infer-
ence. Our focus was in reducing the direct interpretation overhead.

6. Threats to validity

The benchmarks we used for this paper are microbenchmarks, each


designed to take at least one second to run when executed by the Lua
interpreter. It is conceivable that the performance results could be dif-
ferent for larger programs or programs that are not as computationally
intensive as our benchmarks.
Another thing to consider is this work evaluates a single compiler,
in the context of Lua. While we believe these techniques should also
applicable to other languages, that has not been evaluated yet.

7. Conclusion

Dynamic programming languages are often implemented using in-


terpreters, which spend some portion of the running time on interpre-
tation overhead. Compilers can avoid this, but they may be complex to
implement.
In this paper we have presented LuaAOT, a simple ahead-of-time
Fig. 10. The actual code generated by the compiler. compiler for Lua. Using less than 500 lines of new code in a total
of 1600 lines, we were able to compile the entirety of Lua, including
features such as coroutines and tail calls.
case study in doing this partial evaluation in an ad-hoc manner, on an In return, we achieved a reduction in running times between 20%
and 60%. While these numbers cannot compete head-to-head with a
existing interpreter.
good JIT compiler, they demonstrate a noticeable performance boost,
A relevant example of partial evaluation for interpreters is the
for a tiny fraction of the implementation cost. These numbers also offer
Truffle framework [18,19]. Truffle allows a language implementer to a contribution to studies about interpretation overheads.
create an efficient just-in-time compiler based on an AST interpreter. One novelty presented in this paper was the trampoline-based com-
The implementer can write the AST interpreter and provide hints pilation strategy, which does not require modifications to the jump
that tell Truffle which run-time type information should be collected instructions. While it is not as fast as the goto-based strategy, it is even
and where to use the partial evaluation. As we just mentioned, one simpler to implement. It also supports coroutines with very little work.
important difference compared to our work is that Truffle requires that That is something that is tricky to do when compiling to C, because C
the interpreter be written in Java, using the Truffle framework. itself does not support coroutines.

9
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151

We believe that the technique we used to implement LuaAOT may [5] Roberto Ierusalimschy, Luiz Henrique de Figueiredo, Waldemar Celes, The
be of interest to other programming languages. In particular, the basic implementation of Lua 5.0, J. Univ. Comput. Sci. 11 (7) (2005) 1159–1176,
ideas of our work may also be applicable to other interpreters that use http://dx.doi.org/10.3217/jucs-011-07-1159.
[6] Anton M. Ertl, David Gregg, The structure and performance of efficient inter-
a bytecode-based virtual machine written in C.
preters, J. Instr.-Level Parall. (5) (2003) URL https://www.jilp.org/vol5/index.
html.
CRediT authorship contribution statement [7] Hugo Gualandi, LuaAOT 5.4 source code repository, 2021, URL https://github.
com/hugomg/lua-aot-5.4.
Hugo Musso Gualandi: Implemented LuaAOT, Carried out the ex- [8] Ana Lúcia de Moura, Revisitando co-rotinas (Ph.D. thesis), PUC-Rio, 2004.
periments, Participated in writing the paper. Roberto Ierusalimschy: [9] Ana Lúcia de Moura, Noemi Rodriguez, Roberto Ierusalimschy, Coroutines in Lua,
J. Univ. Comput. Sci. 10 (7) (2004) 910–925, http://dx.doi.org/10.3217/jucs-
Supervised the work, Participated in writing the paper.
010-07-0910.
[10] Roberto Ierusalimschy, Luiz Henrique de Figueiredo, Waldemar Celes, Lua test
Declaration of competing interest suite, 2022, URL https://www.lua.org/tests/.
[11] Mike Pall, LuaJIT, A just-in-time compiler for Lua, 2005, URL http://luajit.org/
The authors declare that they have no known competing finan- luajit.html.
cial interests or personal relationships that could have appeared to [12] Isaac Gouy, The computer language benchmarks game, 2013, URL https://
benchmarksgame-team.pages.debian.net/benchmarksgame.
influence the work reported in this paper.
[13] Stefan Marr, Benoit Daloze, Hanspeter Mössenböck, Cross-language compiler
benchmarking: Are we fast yet? in: Proceedings of the 12th Symposium on
Acknowledgments Dynamic Languages, in: DLS 2016, 2016, pp. 120–131, http://dx.doi.org/10.
1145/2989225.2989232, URL http://stefan-marr.de/papers/dls-marr-et-al-cross-
This work has been funded in part by CNPQ, Brazil, grants language-compiler-benchmarking-are-we-fast-yet/.
153918/2015-2 and 305001/2017-5; and by CAPES, Brazil, grant [14] Hisham Muhamma, The teal compiler, 2019, Teal source code repository, URL
https://github.com/teal-language/tl/.
number 001.
[15] Roberto Ierusalimschy, Luiz Henrique de Figueiredo, Waldemar Celes, Lua 5.4
interpreter source code, 2021, URL https://www.lua.org/versions.html.
Appendix. Example of generated code [16] Lars Ole Andersen, Partial evaluation of C and automatic compiler generation,
in: Uwe Kastens, Peter Pfahler (Eds.), Compiler Construction, Springer Berlin
In Section 2, our examples featured a simplified interpreter. In this Heidelberg, Berlin, Heidelberg, 1992, pp. 251–257.
appendix, we show some real code produced by LuaAOT. It is the [17] Neil D. Jones, Transformation by interpreter specialisation, Sci. Comput. Pro-
gram. 52 (1) (2004) 307–339, http://dx.doi.org/10.1016/j.scico.2004.03.010,
result of compiling the foo function from Fig. 1. To make it fit, we
Special Issue on Program Transformation.
made minor stylistic edits and replaced some sections by /*...*/ [18] Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Du-
comments. The code starts with some boilerplate initialization code, boscq, Christian Humer, Gregor Richards, Doug Simon, Mario Wolczko, One VM
followed by the coroutine dispatch table, and finally the handlers for to rule them all, in: Proceedings of the 2013 ACM International Symposium on
each bytecode instruction. In this code, we can see some of the prepro- New Ideas, New Paradigms, and Reflections on Programming & Software, in: On-
cessor macro tricks. The vmfetch macro initializes the i variable to ward! 2013, 2013, pp. 187–204, http://dx.doi.org/10.1145/2509578.2509581,
https://wiki.openjdk.java.net/display/Graal/Publications+and+Presentations.
a compile-time constant, allowing the C compiler to constant fold the
[19] Thomas Würthinger, Christian Wimmer, Christian Humer, Andreas Wöß, Lukas
GETARG_sBx macro. The LUAOT_SKIP1 macro allows instructions Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, Matthias Grimmer, Prac-
like op_arith to skip over the following mmbin instruction; in the tical partial evaluation for high-performance dynamic language runtimes, in:
original interpreter, they increment the program counter (pc++) while Proceedings of the 38th ACM SIGPLAN Conference on Programming Language
in LuaAOT we replace that by goto LUAOT_SKIP1 (see Fig. 10). Design and Implementation, in: PLDI 2017, ACM, 2017, pp. 662–676, http:
//dx.doi.org/10.1145/3062341.3062381.
[20] Nagy Mostafa, Chandra Krintz, Calin Cascaval, David Edelson, Priya Nagpurkar,
References
Peng Wu, Understanding the Potential of Interpreter-Based Optimizations For
Python, Technical Report, University of California, Santa Barbara, 2010, URL
[1] Hugo Gualandi, Roberto Ierusalimschy, Pallene: A companion language for Lua,
https://www.cs.ucsb.edu/sites/cs.ucsb.edu/files/docs/reports/2010-14.pdf.
Sci. Comput. Program. 189 (2020) 102393, http://dx.doi.org/10.1016/j.scico.
[21] Gergö Barany, pylibjit: A JIT compiler library for Python, in: Software
2020.102393, http://www.inf.puc-rio.br/~hgualandi/papers/Gualandi-2020-
Engineering (Workshops), 2014, pp. 213–224.
SCP.pdf, http://www.inf.puc-rio.br/~hgualandi/papers/ Gualandi-2020-SCP.pdf.
[22] Rhys Weatherley, GNU libjit, 2004, The GNU LibJIT library.
[2] Yoshihiko Futamura, Partial evaluation of computation process–An approach
[23] Gergö Barany, Python interpreter performance deconstructed, in: Proceedings of
to a compiler-compiler, High.-Order Symb. Comput. 12 (4) (1999) 381–391,
the Workshop on Dynamic Languages and Applications, Dyla ’14, 2014, pp. 5:1–
http://dx.doi.org/10.1023/A:1010095604496.
[3] Peter Sestoft Neil D. Jones, Partial evaluation and automatic program generation, 5:9, http://dx.doi.org/10.1145/2617548.2617552, URL https://publik.tuwien.ac.
in: Prentice-Hall International Series in Computer Science, Prentice Hall, 1993. at/files/PubDat_233742.pdf.
[4] Hugo Musso Gualandi, Roberto Ierusalimschy, A surprisingly simple lua compiler, [24] Paul Biggar, Design and Implementation of an Ahead-of-Time Compiler for
in: Proceedings of the 25th Brazilian Symposium on Programming Languages, PHP (Ph.D. thesis), Trinity College Dublin, 2010, URL https://paulbiggar.com/
SBLP 2021, Joinville, Brazil, September 27-October 1, 2021, 2021, http://dx. research/#phd-dissertation.
doi.org/10.1145/3475061.3475077.

10

You might also like