CD Seminar 3
CD Seminar 3
∗ Corresponding author.
E-mail addresses: [email protected] (H.M. Gualandi), [email protected] (R. Ierusalimschy).
https://doi.org/10.1016/j.cola.2022.101151
Received 18 February 2022; Received in revised form 12 July 2022; Accepted 16 August 2022
Available online 24 August 2022
2590-1184/© 2022 Elsevier Ltd. All rights reserved.
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
2
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
After this, we told the interpreter how to use these compiled func-
tions. At the start of the execute subroutine (the one that Fig. 2 talks
about), we add a check just before the inner loop. If the function has a
compiled version we transfer the control to that, instead of continuing
to the usual interpreter loop.
The next change we made is related to the public interface that is
exposed to C extension modules. Our partial-evaluation generates code
that calls many internal functions from the Lua interpreter, which in
normal circumstances are not exposed to extension modules. To allow
our generated code to use these internal functions, we modified the
interpreter to make all those internal names public.
Finally, we proceeded to implement the code generator, examining
the bytecodes one by one. Most had their code directly pasted into
the compiler; some required modifications to the generated code, but
there was one case that also required modifications to the interpreter:
the bytecodes for function calls (call, tailcall, and return). The Lua 5.4
interpreter has an optimization where Lua-to-Lua calls reuse the same
execution frame for the execute function. Like a conventional CPU,
the Lua interpreter implements these function calls by updating the
prog argument and the program counter, so that the same interpreter
loop naturally runs the called function. Unfortunately, this implemen-
tation is incompatible with our compiled code, where each execute
function is specialized to a particular Lua function. Our solution was
to disable this optimization. We believe that with additional work it
might have been possible to keep it. However, disabling it was certainly
simpler.
We should stress that this change did not harm the tail-recursive
functions, which are guaranteed to use O(1) stack space. In the gener-
ated C code for the tailcall instruction, the crucial function call appears
Fig. 3. Specializing the interpreter to a particular function.
in a tail position, so that a good C compiler can perform the required
tail-call optimization.
As important as the changes we made are the many things we did
not need to modify. Other than adding a single field to the function
Lua, like many scripting languages, can be extended by binary
objects, we made no other changes to the internal Lua data types. We
modules written in other languages (such as C). This was a natural
also did not modify the Lua runtime system, the garbage collector, or
mechanism for integrating LuaAOT with Lua. The core of the LuaAOT
the Lua standard library.
compiler is the luaot executable, which takes Lua source code as input
and outputs such a binary extension module. For the most part these
3.2. The code generator
compiled modules can be loaded by the Lua interpreter and runtime
just like any other binary module. The only difference is that we had
The code generator receives a Lua module and converts it to C.
to patch the interpreter to tell it how to call our compiled Lua functions.
To do this it calls the Lua bytecode compiler and then it converts the
A possible alternative implementation would be for the LuaAOT
compiler to generate a standalone executable. The standard procedure bytecode to C. For each function in the module, the code generator
for this would be to bundle the interpreter inside the executable, be- produces an appropriate function header and then it iterates over the
cause the compiled code still needs the Lua runtime. However, creating function’s bytecodes, outputting a block of C code for each instruction.
standalone programs for Lua in this manner is not typical. It is more For the most part, these blocks of C code are copied verbatim from the
common to install and distribute Lua applications separately from the original Lua interpreter. However, we had to adapt a few bytecodes.
interpreter. The main categories were bytecodes that modify the program counter,
bytecodes using C gotos, and bytecodes related to function calls.
3.1. The interpreter In the original interpreter, jumps are implemented by assigning
to the interpreter variable representing the program counter. In our
The Lua interpreter plays a central role in our system, which com- compiler, we replaced each of these jumps by a C goto. This required
prises of both a compiler and a slightly modified interpreter. There are making the appropriate changes to the generated code for the jump
multiple reasons for this. The first is that programs can contain both instruction, as well as to the instructions that implement for-loops.
compiled and non-compiled sections and the interpreter is necessary to We also had to change the instructions for binary operations, because
run the non-compiled parts. Moreover, our compiled code still requires of how Lua implements operator overloading. In Lua, every binary
the interpreter: because we partially evaluate the inner interpreter operation is followed by a special instruction (mmbin), which handles
loop, the compiled code calls several subroutines from the interpreter. overloading. When the operands have the expected types (e.g., numbers
Furthermore, the interpreter codebase also houses the Lua runtime for the add operation), the binary operation increments the program
and garbage collector, which are used by both the compiled and the counter to skip this next instruction. This means that all binary opera-
non-compiled code. tions contain an implicit jump, which our compiler must also replace
The custom interpreter has few changes compared to the original by a goto.
Lua interpreter. The first modification was to add an additional field The next category of instructions we had to adapt were the ones that
to the data structure that represents Lua functions. This field refers use goto statements in their original implementation. One example is
to the compiled code for that function, if it exists. The C extension the forcall instruction, which is always followed by a forloop. As an
modules that we generate include initialization code that associates the optimization, the Lua interpreter uses a goto to jump straight to the
Lua functions with their compiled C implementation. handler for the forloop, bypassing the usual dispatch logic. Since our
3
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
3.3. Coroutines
4
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
4. Evaluation
4.1. Correctness
Our main tool for assessing the correctness of LuaAOT was the test
Fig. 6. Extracting regular opcodes into functions. suite from the reference Lua interpreter [10]. Although tests cannot
prove the absence of bugs, they certainly help. We ensured that LuaAOT
could run the full Lua test suite without crashing and that it produced
the same result as the reference interpreter. During this process we
found and fixed several bugs, including some tricky ones involving
recursive functions and tail calls.
5
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
Table 1
Running times, in seconds.
Benchmark Lua Trampoline LuaAOT Function Struct LuaJIT
Binary Trees 3.49 ± 0.06 3.08 ± 0.14 2.87 ± 0.07 3.15 ± 0.69 2.92 ± 0.11 1.24 ± 0.50
Fannkuch 43.97 ± 2.43 25.87 ± 2.76 21.79 ± 0.88 33.94 ± 2.40 32.90 ± 0.62 7.38 ± 0.25
Fasta 4.84 ± 0.11 4.05 ± 0.33 3.74 ± 0.17 4.46 ± 0.37 4.43 ± 0.09 1.24 ± 0.07
K-Nucleotide 3.91 ± 0.14 3.35 ± 0.10 3.21 ± 0.17 3.58 ± 0.14 3.64 ± 0.74 0.95 ± 0.07
Mandelbrot 14.27 ± 1.63 10.49 ± 0.71 6.80 ± 0.11 11.36 ± 1.14 10.94 ± 0.27 1.63 ± 0.05
N-Body 17.85 ± 1.45 12.26 ± 1.21 10.49 ± 0.36 14.15 ± 0.69 14.59 ± 0.80 1.10 ± 0.04
Spectral Norm 34.64 ± 1.12 28.00 ± 12.64 18.76 ± 0.56 28.28 ± 0.71 28.38 ± 2.27 1.21 ± 0.02
CD 1.89 ± 0.04 1.87 ± 0.10 1.68 ± 0.10 1.80 ± 0.06 1.79 ± 0.07 0.90 ± 0.15
Deltablue 2.08 ± 0.20 2.00 ± 0.20 1.89 ± 0.12 2.06 ± 0.09 2.09 ± 0.15 1.05 ± 0.17
Havlak 7.63 ± 0.32 7.41 ± 0.18 7.20 ± 0.12 7.59 ± 0.19 7.59 ± 0.15 4.36 ± 0.17
JSON 5.06 ± 0.11 4.89 ± 0.19 4.52 ± 0.42 4.88 ± 0.11 4.87 ± 0.15 1.03 ± 0.07
List 2.58 ± 0.11 2.01 ± 0.15 1.76 ± 0.54 2.23 ± 0.11 2.25 ± 0.12 1.02 ± 0.05
Permute 2.64 ± 0.08 2.00 ± 0.17 1.93 ± 0.15 2.58 ± 0.55 2.52 ± 0.19 0.09 ± 0.04
Richards 3.61 ± 0.13 3.09 ± 0.09 2.86 ± 0.12 3.34 ± 0.16 3.32 ± 0.15 0.89 ± 0.08
Hanoi Towers 4.16 ± 0.12 3.40 ± 0.15 3.22 ± 0.47 4.13 ± 0.38 4.11 ± 0.13 0.33 ± 0.09
than the Lua interpreter, but slower than LuaJIT. In the microbench- 4.3. Pipelining
marks, the reduction in running time compared to interpreted Lua
ranged from approximately 20%, in the K-Nucleotide benchmark, to We were curious whether the better performance of LuaAOT com-
approximately 60%, in the Mandelbrot benchmark. Among the bigger pared to Lua was because it ran less CPU instructions or because it could
benchmarks (CD, Deltablue, Havlak, JSON), there was a smaller time run more instructions per second. To answer this question, we reran the
improvement in the order of 5 to 10%. benchmarks using Linux’s perf tool, which can measure the number
The speed of the trampoline version of LuaAOT fell between the of CPU instructions and CPU cycles used by each program. The results
speed of Lua and the speed of the default version of LuaAOT (using are listed in Table 2. They suggest that, at least for this CPU model, the
gotos). This is what we expected, because the main difference in the largest factor behind the improved speeds is a reduction in the number
trampoline design is that it has a bit more dispatch overhead. Other
of CPU instructions. It appears that the instruction-per-cycle statistic is
than the dispatching, the generated code is the same as default LuaAOT.
actually slightly worse for LuaAOT. For most benchmarks the reduction
In all benchmarks, the version that generates function calls for each
in instruction count is larger than the reduction in time (CPU cycles).
instruction was slower than default LuaAOT. The worst case being the
We hypothesize that the biggest speedup comes from the compiler
Mandelbrot benchmark with a 66% difference. Most of this can be
removing some of the instructions responsible for bytecode decoding
attributed to the refactoring which moved the interpreter state from
local variables to a struct object, which was necessary to be able to and dispatching. As most of these instructions are cheap (e.g., shifts
split the interpreter loop into multiple subroutines. We tested this by and masks for decoding), the reduction in the number of instructions is
running a version of the compiler that only did that refactoring. As we larger than the reduction in cycles, therefore reducing the instructions
can see in Fig. 9, the performance was about the same as the version per cycle.
that also split into multiple subroutines. This suggests that the biggest In theory, removing the bytecode dispatching (and the associated
culprit was having less efficient access to the interpreter state. One branches) has the potential to improve performance by avoiding costly
question that might be interesting for future research is whether we can branch mispredictions. However, in our benchmarks this was not a
avoid this problem by using a custom calling convention that preserves big factor, because the CPU already did a good job of predicting the
the important interpreter state in machine registers. branches in the interpreted version. In almost all benchmarks, the
6
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
Fig. 9. Running time for function call backend, normalized by reference interpreter.
Table 2 Table 3
CPU instruction count for LuaAOT, relative to Lua. Size of compiled modules, in KB.
Benchmark Instrs (%) Time (%) Benchmark Bytecode AOT FUN
Binary Trees 82.3 81.1 Empty File 0.07 15 15
Fannkuch 44.3 50.4 Binary Trees 1.4 35 39
Fasta 70.5 75.9 Fannkuch 1.1 39 39
K-Nucleotide 75.8 81.9 Fasta 3.0 71 71
Mandelbrot 32.7 46.5 K-Nucleotide 2.2 51 51
N-Body 57.1 59.9 Mandelbrot 0.9 35 35
Spectral Norm 51.4 56.3 N-Body 3.2 83 67
CD 81.5 89.1 Spectral Norm 1.5 39 39
Deltablue 83.6 91.1 CD 26.0 495 447
Havlak 89.1 92.7 Deltablue 26.0 451 415
JSON 82.4 87.5 Havlak 24.0 411 379
List 60.1 63.7 JSON 49.0 435 407
Permute 69.3 73.1 List 13.0 243 215
Richards 76.1 80.3 Permute 12.0 231 207
Hanoi Towers 72.1 80.8 Richards 22.0 371 343
Hanoi Towers 13.0 243 215
branch miss rates for both Lua and LuaAOT were under 1%; the sole
exception is the N-Body benchmark for Lua, with a branch miss rate benchmarks from the Benchmarks Game; in those cases the executable
of 1.2%. With these low rates of branch miss, we do not think it is size is essentially the same. This might be because those programs have
meaningful to compare the absolute numbers. It is also hard to tell small bytecode size, meaning that the executable size is dominated by
whether the compilation is helping the branch-miss rates. fixed overheads instead of by bytecode compilation results.
Due to the large size of the compiled executables, programmers
4.4. Code size may want to consider compiling only some of their modules. These
decisions are commonplace for scripting languages, where we often
One trade-off of LuaAOT compared to the interpreter is that the write parts of the program in a compiled system language to achieve
generated binaries are larger than the corresponding bytecode. To better performance. At the end of the day, compiling a Lua module with
evaluate this aspect of the compiler we measured the sizes of the LuaAOT is a speed vs size tradeoff. On one axis we have the number of
Lua bytecode and of the AOT-compiled executables (stripped, without bytecode instructions that are executed; modules with hot loops stand
debug information). The results are listed in Table 3. In this table, to benefit the most from compilation. On another axis we have the size
the Bytecode column refers to interpreted Lua, the AOT column to of the files; larger Lua programs lead to larger executables.
the default compilation strategy with gotos, and FUN refers to the
alternative compilation strategy with subroutines. We can see that the
compiled executables are significantly larger than the corresponding 4.5. Compilation times
Lua bytecode, albeit not so large that it becomes prohibitive.
For the larger benchmarks from Are We Fast Yet, the function-based In addition to reducing executable sizes, the alternative function-
compilation strategy did generate smaller executables as intended. based compilation strategy can also reduce compilation times, when
There was a size reduction of approximately 10% when compared compared to the default goto-based compilation strategy. These num-
to default LuaAOT. However, we did not observe this in the smaller bers can be found in Table 4. Most benchmarks had a reduction of
7
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
Table 4 One limitation of our manual process for creating LuaAOT is that
Compilation time of AOT executables, in seconds.
we must repeat it if we want to update LuaAOT to a new version of the
Benchmark AOT FUN 𝛥 upstream Lua. In theory it ought to be possible to automate at least
Binary Trees 1.92 1.98 +3% some of this work. Create text manipulation scripts that go through
Fannkuch 2.24 1.97 −12%
the Lua codebase, copy over the opcode handlers from the interpreter
Fasta 4.26 3.90 −8%
K-Nucleotide 2.53 2.52 0% loop, put them inside printf calls, substitute goto statements in place of
Mandelbrot 1.89 1.44 −24% the program counter assignments, etc. The main catch is that without
N-Body 5.84 3.22 −45% collaboration from the upstream interpreter, it is hard to guarantee that
Spectral Norm 1.61 2.00 +24% these text manipulation scripts will still work in a future version of the
CD 38.84 29.53 −24%
interpreter. It is likely that at least some degree of manual intervention
Deltablue 32.32 25.93 −20%
Havlak 31.05 24.24 −22% would still be necessary. Exploring this sort of automation might be an
JSON 27.64 24.93 −10% interesting avenue for future work.
List 15.20 12.01 −21%
Permute 14.08 11.88 −16%
4.7. Applicability of the technique
Richards 20.15 20.34 +1%
Hanoi Towers 14.93 12.12 −19%
While the work we have presented is specific to the reference
Lua interpreter, we think that the technique is simple enough to be
applicable to other dynamic language interpreters. In this section, we
compilation times of over 15% and for the larger benchmarks the discuss what were the aspects of Lua that our interpreter relied on, and
reduction was around 20% to 25%. what conditions are necessary to apply this to another interpreter. Of
Although compilation times are not as important for ahead-of-time course, the performance improvements will depend on the specifics of
compilers as they are for just-in-time compilers, long compilation times the interpreter, in particular what percentage of time is attributable to
can add friction to the development process. And since LuaAOT gener- the core interpreter loop.
ates fairly large executables, the compilation times are quite noticeable The first important thing is that our technique would not work as
in practice. easily for AST-based interpreters. While it is also possible to use partial
Our compilation time experiments suggest that the function-based evaluation for an AST-based interpreter, it is more complicated than for
compilation strategy has a bigger impact for larger programs. Out of a bytecode-based one because of the frequent presence of recursion in
curiosity, we tested what happens to the largest Lua file we could get the main interpreter loop.
our hands on: the Teal compiler [14]. It consists of a single file with
Since our technique is based on partial evaluation, the language
over 9500 lines of Lua code. Compiling it with LuaAOT at -O2 opti-
used to implement the original interpreter is important. C, which is
mization level took a leisurely 430 s. The function-based compilation
a popular language for writing interpreters, worked well for several
strategy brought that down to 160 s, which is less than half.
reasons: the presence of a goto statement, the availability of optimizing
compilers, and the existence of preprocessor macros.
4.6. Complexity of the implementation
When we compile jump instructions in the bytecode, we want a
similar jump operation in our target language. In C we can use goto
A selling point of the partial-evaluation strategy that we used is its
statements for this purpose, provided that the control flow in the
extreme simplicity. To measure this, we counted the lines of code of
original interpreter is all inside a single interpreter function. If the
our code generator, as a proxy for implementation complexity.
target language does not have goto statements, it is harder to compile
We built the code generator by hand, using generous amounts of
the unstructured jumps in the bytecode.
copy-pasting of code from the Lua interpreter loop and from the Lua
Using C as the target language allowed us to take advantage of sev-
bytecode compiler. Because we copied some subroutines from the Lua
eral optimizations from the C compiler, including constant propagation
bytecode compiler, we chose to write the code generator in C. Out
of the total of 1600 lines of code in the generator, 450 lines can for the bytecode instructions. The partial evaluator has the luxury of
be attributed to those subroutines from the bytecode compiler, which emitting code that is almost identical to the code used by the original
are responsible for traversing and printing bytecodes. Code templates interpreter. This would be harder to do if, for example, the original
derived from the core interpreter loop account for over half of the interpreter were implemented in hand-written assembly language. In
code generator, about 850 lines. The rest of the code, which we wrote that case we would likely have to implement the constant propagation
from scratch, fits in less than 500 lines. It consists of miscellaneous ourselves.
things such as comments, command-line option handling, and the Albeit not a fundamental requirement, the C preprocessor was a con-
initialization routines for the generated extension modules. venient feature. The LuaAOT code generator is essentially a text-based
For comparison, the reference Lua interpreter contains 28 thou- code transformer and in that context it helps to have a text-based macro
sand lines of C code [15] and the LuaJIT just-in-time compiler has system built into the target language. Unlike inline functions, macros
80 thousand lines of C and 35 thousand lines of platform-specific can jump to other parts of the program and assign to surrounding local
assembly language [11]. Another thing that we can point out is that variables.
while LuaAOT required us to be familiar with the internals of the To summarize, we believe that our technique is a good fit for
Lua interpreter, the final product did not require complex analysis languages that are implemented by a bytecode interpreter written in C
algorithms or optimization passes. or C++. Some popular scripting languages that fit this criteria include
We also looked at the implementation complexity of the alternative Python, Ruby, JavaScript, Perl, and PHP.
compilation strategies that we described: the trampoline strategy and
the function-based strategy. The implementation for the trampoline 5. Related work
version is about the same size as default LuaAOT, the main differ-
ence being that the code templates required less manual tweaking. Although our work is inspired by partial evaluation, it is not a
The function-based implementation, on the other hand, required more partial evaluation system. There is a rich literature on these partial eval-
manual tweaking. Namely, we had to refactor local variables into uation systems and their application to interpreters [16,17]. However,
struct fields and create the helper functions (which required manu- one difference between our work and these partial evaluation systems
ally choosing the proper argument types and return values for each is that they usually require that the input interpreter must be in some
function). specific format that the partial evaluator can work with. LuaAOT is a
8
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
6. Threats to validity
7. Conclusion
9
H.M. Gualandi and R. Ierusalimschy Journal of Computer Languages 72 (2022) 101151
We believe that the technique we used to implement LuaAOT may [5] Roberto Ierusalimschy, Luiz Henrique de Figueiredo, Waldemar Celes, The
be of interest to other programming languages. In particular, the basic implementation of Lua 5.0, J. Univ. Comput. Sci. 11 (7) (2005) 1159–1176,
ideas of our work may also be applicable to other interpreters that use http://dx.doi.org/10.3217/jucs-011-07-1159.
[6] Anton M. Ertl, David Gregg, The structure and performance of efficient inter-
a bytecode-based virtual machine written in C.
preters, J. Instr.-Level Parall. (5) (2003) URL https://www.jilp.org/vol5/index.
html.
CRediT authorship contribution statement [7] Hugo Gualandi, LuaAOT 5.4 source code repository, 2021, URL https://github.
com/hugomg/lua-aot-5.4.
Hugo Musso Gualandi: Implemented LuaAOT, Carried out the ex- [8] Ana Lúcia de Moura, Revisitando co-rotinas (Ph.D. thesis), PUC-Rio, 2004.
periments, Participated in writing the paper. Roberto Ierusalimschy: [9] Ana Lúcia de Moura, Noemi Rodriguez, Roberto Ierusalimschy, Coroutines in Lua,
J. Univ. Comput. Sci. 10 (7) (2004) 910–925, http://dx.doi.org/10.3217/jucs-
Supervised the work, Participated in writing the paper.
010-07-0910.
[10] Roberto Ierusalimschy, Luiz Henrique de Figueiredo, Waldemar Celes, Lua test
Declaration of competing interest suite, 2022, URL https://www.lua.org/tests/.
[11] Mike Pall, LuaJIT, A just-in-time compiler for Lua, 2005, URL http://luajit.org/
The authors declare that they have no known competing finan- luajit.html.
cial interests or personal relationships that could have appeared to [12] Isaac Gouy, The computer language benchmarks game, 2013, URL https://
benchmarksgame-team.pages.debian.net/benchmarksgame.
influence the work reported in this paper.
[13] Stefan Marr, Benoit Daloze, Hanspeter Mössenböck, Cross-language compiler
benchmarking: Are we fast yet? in: Proceedings of the 12th Symposium on
Acknowledgments Dynamic Languages, in: DLS 2016, 2016, pp. 120–131, http://dx.doi.org/10.
1145/2989225.2989232, URL http://stefan-marr.de/papers/dls-marr-et-al-cross-
This work has been funded in part by CNPQ, Brazil, grants language-compiler-benchmarking-are-we-fast-yet/.
153918/2015-2 and 305001/2017-5; and by CAPES, Brazil, grant [14] Hisham Muhamma, The teal compiler, 2019, Teal source code repository, URL
https://github.com/teal-language/tl/.
number 001.
[15] Roberto Ierusalimschy, Luiz Henrique de Figueiredo, Waldemar Celes, Lua 5.4
interpreter source code, 2021, URL https://www.lua.org/versions.html.
Appendix. Example of generated code [16] Lars Ole Andersen, Partial evaluation of C and automatic compiler generation,
in: Uwe Kastens, Peter Pfahler (Eds.), Compiler Construction, Springer Berlin
In Section 2, our examples featured a simplified interpreter. In this Heidelberg, Berlin, Heidelberg, 1992, pp. 251–257.
appendix, we show some real code produced by LuaAOT. It is the [17] Neil D. Jones, Transformation by interpreter specialisation, Sci. Comput. Pro-
gram. 52 (1) (2004) 307–339, http://dx.doi.org/10.1016/j.scico.2004.03.010,
result of compiling the foo function from Fig. 1. To make it fit, we
Special Issue on Program Transformation.
made minor stylistic edits and replaced some sections by /*...*/ [18] Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Du-
comments. The code starts with some boilerplate initialization code, boscq, Christian Humer, Gregor Richards, Doug Simon, Mario Wolczko, One VM
followed by the coroutine dispatch table, and finally the handlers for to rule them all, in: Proceedings of the 2013 ACM International Symposium on
each bytecode instruction. In this code, we can see some of the prepro- New Ideas, New Paradigms, and Reflections on Programming & Software, in: On-
cessor macro tricks. The vmfetch macro initializes the i variable to ward! 2013, 2013, pp. 187–204, http://dx.doi.org/10.1145/2509578.2509581,
https://wiki.openjdk.java.net/display/Graal/Publications+and+Presentations.
a compile-time constant, allowing the C compiler to constant fold the
[19] Thomas Würthinger, Christian Wimmer, Christian Humer, Andreas Wöß, Lukas
GETARG_sBx macro. The LUAOT_SKIP1 macro allows instructions Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, Matthias Grimmer, Prac-
like op_arith to skip over the following mmbin instruction; in the tical partial evaluation for high-performance dynamic language runtimes, in:
original interpreter, they increment the program counter (pc++) while Proceedings of the 38th ACM SIGPLAN Conference on Programming Language
in LuaAOT we replace that by goto LUAOT_SKIP1 (see Fig. 10). Design and Implementation, in: PLDI 2017, ACM, 2017, pp. 662–676, http:
//dx.doi.org/10.1145/3062341.3062381.
[20] Nagy Mostafa, Chandra Krintz, Calin Cascaval, David Edelson, Priya Nagpurkar,
References
Peng Wu, Understanding the Potential of Interpreter-Based Optimizations For
Python, Technical Report, University of California, Santa Barbara, 2010, URL
[1] Hugo Gualandi, Roberto Ierusalimschy, Pallene: A companion language for Lua,
https://www.cs.ucsb.edu/sites/cs.ucsb.edu/files/docs/reports/2010-14.pdf.
Sci. Comput. Program. 189 (2020) 102393, http://dx.doi.org/10.1016/j.scico.
[21] Gergö Barany, pylibjit: A JIT compiler library for Python, in: Software
2020.102393, http://www.inf.puc-rio.br/~hgualandi/papers/Gualandi-2020-
Engineering (Workshops), 2014, pp. 213–224.
SCP.pdf, http://www.inf.puc-rio.br/~hgualandi/papers/ Gualandi-2020-SCP.pdf.
[22] Rhys Weatherley, GNU libjit, 2004, The GNU LibJIT library.
[2] Yoshihiko Futamura, Partial evaluation of computation process–An approach
[23] Gergö Barany, Python interpreter performance deconstructed, in: Proceedings of
to a compiler-compiler, High.-Order Symb. Comput. 12 (4) (1999) 381–391,
the Workshop on Dynamic Languages and Applications, Dyla ’14, 2014, pp. 5:1–
http://dx.doi.org/10.1023/A:1010095604496.
[3] Peter Sestoft Neil D. Jones, Partial evaluation and automatic program generation, 5:9, http://dx.doi.org/10.1145/2617548.2617552, URL https://publik.tuwien.ac.
in: Prentice-Hall International Series in Computer Science, Prentice Hall, 1993. at/files/PubDat_233742.pdf.
[4] Hugo Musso Gualandi, Roberto Ierusalimschy, A surprisingly simple lua compiler, [24] Paul Biggar, Design and Implementation of an Ahead-of-Time Compiler for
in: Proceedings of the 25th Brazilian Symposium on Programming Languages, PHP (Ph.D. thesis), Trinity College Dublin, 2010, URL https://paulbiggar.com/
SBLP 2021, Joinville, Brazil, September 27-October 1, 2021, 2021, http://dx. research/#phd-dissertation.
doi.org/10.1145/3475061.3475077.
10