Luajit
Luajit
LuaJIT
Authors
Dario d’Andrea
Laurent Deniau
This document describes the details of the LuaJIT project with a particular
focus on the concept of trace-based just-in-time compilation. It aims to be a
general guide especially for newcomers of this topic. This document takes its
birth from a recent Master thesis project [1] that was carried out at CERN,
the European Organisation for Nuclear Research.
Before to start, we want to express our gratitude to the author of LuaJIT,
Mike Pall, and to its community for all the useful insights contained in the
LuaJIT mailing list.
Contents
1 LuaJIT overview 5
1.1 Lua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 LuaJIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Files organisation . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 11
2.1 Just-in-time compilation . . . . . . . . . . . . . . . . . . . . . 11
2.2 Compilation units . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Trace-based Just-in-time Compilation . . . . . . . . . . . . . . 13
2.3.1 Identifying trace headers . . . . . . . . . . . . . . . . . 16
2.3.2 Hotpath detection . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Trace recording . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Abort and blacklisting . . . . . . . . . . . . . . . . . . 20
2.3.5 Compiling traces . . . . . . . . . . . . . . . . . . . . . 21
2.3.6 Trace exit . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.7 Sidetraces . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Early Tracing JITs . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Recents Tracing JITs . . . . . . . . . . . . . . . . . . . 26
3 Virtual Machine 28
3.1 Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Bytecode frontend . . . . . . . . . . . . . . . . . . . . 29
3.2 Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Tagged value . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 String Internalization . . . . . . . . . . . . . . . . . . . 30
3.2.3 Lua table . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.4 Garbage collector . . . . . . . . . . . . . . . . . . . . . 31
3.2.5 Allocator . . . . . . . . . . . . . . . . . . . . . . . . . 32
2
3.2.6 Function . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.7 Fast Function . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.8 GC64 mode . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Bytecode interpreter . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Standard library . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 LuaJIT extensions . . . . . . . . . . . . . . . . . . . . 34
3.4.3 The C API . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.4 Build Library . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Foreign function interface . . . . . . . . . . . . . . . . . . . . 37
4 JIT compiler 44
4.1 Hotpaths detection . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Architecture specific implementation . . . . . . . . . . 47
4.1.2 Hotcount collisions . . . . . . . . . . . . . . . . . . . . 48
4.1.3 Memory address randomisation . . . . . . . . . . . . . 48
4.2 Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Trace compiler state machine . . . . . . . . . . . . . . 49
4.2.2 Start recording . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3 Recording . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.4 Ending recording . . . . . . . . . . . . . . . . . . . . . 53
4.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.1 Dead code elimination . . . . . . . . . . . . . . . . . . 54
4.3.2 Loop optimisations . . . . . . . . . . . . . . . . . . . . 54
4.3.3 Split optimisations . . . . . . . . . . . . . . . . . . . . 54
4.3.4 Sinking optimisations . . . . . . . . . . . . . . . . . . . 54
4.3.5 Narrowing optimisations . . . . . . . . . . . . . . . . . 55
4.3.6 Fold engine . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Assemble trace . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Trace Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.1 Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.2 Blacklisting . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.2 Sparse snapshots . . . . . . . . . . . . . . . . . . . . . 69
4.7 Variables allocation . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7.1 Local variables . . . . . . . . . . . . . . . . . . . . . . 70
4.7.2 Global variables . . . . . . . . . . . . . . . . . . . . . . 72
4.7.3 Upvalues . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3
5 Side traces and Stitch traces 75
5.1 Canonical transformations . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Logical transformations . . . . . . . . . . . . . . . . . . 76
5.1.2 Loops equivalence . . . . . . . . . . . . . . . . . . . . . 82
5.1.3 Assert . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Essential cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Empty loop . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 Loop with assignment . . . . . . . . . . . . . . . . . . 87
5.2.3 Loop with if-statements . . . . . . . . . . . . . . . . . 89
5.2.4 Nested loop . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Loop with two if-statements . . . . . . . . . . . . . . . . . . . 95
5.3.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.3 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.4 Case 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Nested loop with more inner loops . . . . . . . . . . . . . . . . 106
5.5 Recursive functions . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5.1 Non-tail recursive function . . . . . . . . . . . . . . . . 111
5.5.2 Tail recursive function . . . . . . . . . . . . . . . . . . 112
5.6 Stitch trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7 Strategy 124
7.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2 Dump mode and post-execution analysis . . . . . . . . . . . . 127
7.3 Tuning LuaJIT parameters . . . . . . . . . . . . . . . . . . . . 129
4
Chapter 1
LuaJIT overview
1.1 Lua
Lua is described on its official website [2] as a powerful, efficient, lightweight,
embeddable scripting language. It supports procedural programming, object-
oriented programming, functional programming, data-driven programming,
and data description. It is designed, implemented, and maintained by a team
at PUC-Rio, the Pontifical Catholic University of Rio de Janeiro in Brazil.
Lua combines simple procedural syntax with powerful data description
constructs based on associative arrays and extensible semantics. Lua runs by
interpreting bytecode with a register-based virtual machine. It is dynamically
typed and has automatic memory management with incremental garbage
collection.
Lua is specifically known for its performance. Experiments on several
benchmarks show Lua as one the fastest interpreted scripting languages ever
made.
1.2 LuaJIT
LuaJIT [3] is a trace-based just-in-time compiler for the Lua programing
language. It is widely considered to be one of the fastest dynamic language
implementations as it outperforms other dynamic languages on many cross-
5
language benchmarks. In this paragraph, we will go through a description of
its internal architecture, which is shown in Fig. 1.1.
Libraries C
VM
call
FFI
config
JIT JIT
...
*.lua Lexer Parser
BC Standard
Library
*.o BC Frontend BC
...
Interpreter
(*.dasc) Extensions
Optimizer
IR
BC patching BC
assembler
6
detected, the executed bytecode instructions are recorded into a trace by the
Recorder which also emits an intermediate representation (IR) in Static single
assignment (SSA) form. Then, the Optimiser applies some optimisations on
the IR followed by the Assembler, which compiles the trace to platform-
specific machine code. Finally, the bytecode that was detected as hotpath is
patched in order to replace its execution with a call to the compiled trace.
The details of the just-in-time compilation are described at chapter 4.
File Description
luaconf.h Lua configuration header
lua.[h,hpp] Lua
lauxlib.h Auxiliary functions for building Lua libraries
lib aux.c Auxiliary library for the Lua/C API
lualib.h Standard library header
lib base.c Base and coroutine library
lib math.c Math library
lib string.c String library
lib table.c Table library
lib io.c I/O library
lib os.c OS library
lib package.c Package library
lib debug.c Debug library
lib bit.c Bit manipulation library
lib jit.c JIT library
lib ffi.c FFI library
lib init.c Library initialization
Makefile LuaJIT Makefile
ljamalg.c LuaJIT core and libraries amalgamation
luajit.[c,h] LuaJIT frontend
lj alloc.[c,h] Bundled memory allocator
lj api.c Public Lua/C API
lj arch.h Target architecture selection
lj buf.[c,h] Buffer handling
lj def.h LuaJIT common internal definitions
7
lj ff.h Fast function IDs
lj ffrecord.[c,h] Fast function call recorder
lj frame.h Stack frames
lj func.[c,h] Function handling (prototypes, functions and up-
values)
lj gc.[c,h] Garbage collector
lj gdbjit.[c,h] Client for the GDB JIT API
lj obj.[c,h] LuaJIT VM tags, values and objects
lj lib.[c,h] Library function support
lj load.c Load and dump code
lj udata.[c,h] Userdata handling
lj carith.[c,h] C data arithmetic
lj ccall.[c,h] FFI C call handling
lj ccallback.[c,h] FFI C callback handling
lj cconv.[c,h] C type conversions
lj cdata.[c,h] C data management
lj char.[c,h] Character types
lj clib.[c,h] FFI C library loader
lj cparse.[c,h] C declaration parser
lj crecord.[c,h] Trace recorder for C data operations
lj ctype.[c,h] C type management
lj str.[c,h] String handling
lj strfmt.[c,h] String formatting
lj strfmt num.c String formatting for floating-point numbers
lj strscan.[c,h] String scanning
lj tab.[c,h] Table handling
lj bcdef.h Generated file
lj bc.[c,h] Bytecode instruction format and modes
lj bcdump.h Bytecode dump definitions
lj bcread.c Bytecode reader
lj bcwrite.c Bytecode writer
lj lex.[c,h] Lexical analyzer
lj parse.[c,h] Lua parser (source code -> bytecode)
lj vm.h Assembler VM interface definitions
lj vmevent.[c,h] VM event handling
lj vmmath.c Math helper functions for assembler VM
vm arm.dasc Low-level VM code for ARM CPUs
vm arm64.dasc Low-level VM code for ARM64 CPUs
vm mips.dasc Low-level VM code for MIPS CPUs
vm mips64.dasc Low-level VM code for MIPS64 CPUs
8
vm ppc.dasc Low-level VM code for PowerPC 32 bit or 32on64
bit mode
vm x64.dasc Low-level VM code for x64 CPUs in LJ GC64
mode
vm x86.dasc Low-level VM code for x86 CPUs
lj jit.h Common definitions for the JIT compiler
lj trace.[c,h] Trace management
lj traceerr.h Trace compiler error messages
lj dispatch.[c,h] Instruction dispatch handling
lj ir.[c,h] SSA IR (Intermediate Representation) format and
emitter
lj ircall.h IR CALL* instruction definitions
lj record.[c,h] Trace recorder (bytecode -> SSA IR)
lj snap.[c,h] Snapshot handling
lj state.[c,h] State and stack handling
lj iropt.h Common header for IR emitter and optimizations
lj opt dce.c Dead Code Elimination. Pre-loop only (ASM al-
ready performs DCE)
lj opt fold.c Constant Folding, Algebraic Simplifications and
Reassociation. Array Bounds Check Elimination.
Common-Subexpression Elimination.
lj opt loop.c Loop Optimizations
lj opt mem.c Memory access optimizations. Alias Analysis us-
ing high-level semantic disambiguation. Load For-
warding (L2L) + Store Forwarding (S2L). Dead-
Store Elimination.
lj opt narrow.c Narrowing of numbers to integers (double to
int32 t). Stripping of overflow checks.
lj opt sink.c Allocation Sinking and Store Sinking
lj opt split.c Split 64 bit IR instructions into 32 bit IR instruc-
tions
lj mcode.[c,h] Machine code management
lj meta.[c,h] Metamethod handling
lj emit arm.h ARM instruction emitter
lj emit arm64.h ARM64 instruction emitter
lj emit mips.h MIPS instruction emitter
lj emit ppc.h PPC instruction emitter
lj emit x86.h x86/x64 instruction emitter
lj target.h Definitions for target CPU
lj target arm.h Definitions for ARM CPUs
9
lj target arm64.h Definitions for ARM64 CPUs
lj target mips.h Definitions for MIPS CPUs
lj target ppc.h Definitions for PPC CPUs
lj target x86.h Definitions for x86 and x64 CPUs
lj asm.[c,h] IR assembler (SSA IR -> machine code)
lj asm arm.h ARM IR assembler (SSA IR -> machine code)
lj asm arm64.h ARM64 IR assembler (SSA IR -> machine code)
lj asm mips.h MIPS IR assembler (SSA IR -> machine code)
lj asm ppc.h PPC IR assembler (SSA IR -> machine code)
lj asm x86.h x86/x64 IR assembler (SSA IR -> machine code)
lj debug.[c,h] Debugging and introspection
lj err.[c,h] Error handling
lj errmsg.h VM error messages
lj profile.[c,h] Low-overhead profiling
Table 1.1: LuaJIT files
10
Chapter 2
Background
11
Cuni in [7] illustrates two general rules to consider when tackling the
problem of compiler optimisation: (i) the Pareto principle (or 80/20 rule)
[8] states that the 80% of the execution time of a program is related only to
20% of the code. Thus, small parts of the code can make the difference in the
performance of the whole program; (ii) the Fast Path principle [9] explains
that the most frequently used operations should be handled by fast paths in
order to speed up the execution, while the remaining cases are not required
to be particularly efficient.
(i). Dynamic Basic Block. As defined by Smith and Nair [11] a dynamic
basic block is determined by the actual flow of a program when it is
executed. It always begins at the instruction executed immediately
after a branch and it continues until the first next conditional branch
is encountered. Dynamic basic blocks are usually larger than static
basic blocks and the same static instruction may belong to more than
one dynamic basic block. This approach is typically used in binary
translators.
(ii). Function (Method ). It is the most intuitive compilation unit for a JIT
compiler. In this case, the whole function with all the possible branches
and control flow paths is compiled. A function is generally marked as
hot and compiled when it is frequently called at run-time. Then, any
subsequent calls of the same function will lead to the already compiled
machine code, instead of using the interpreter. Afterwards, the system
generally reverts to interpretation when the compiled function ends.
Also static compilers usually compile a function all at once, hence the
same optimisation techniques can be used for function-based just-in-
time compilers.
12
(iii). Loop. The analogous approach used for functions can be applied for
loops. In this context the entire loop body is compiled, including all
possible control-flow paths. Loops are generally good candidates to
be considered as hotspots since the same set of instructions will be
executed repeatedly many times.
(iv). Region. Firstly introduced in [12], this approach uses regions as more
general compilation units. A region is the result of collecting code from
several functions, but it excludes all rarely executed portions of these
functions. To create a region the process begins by the most executed
block not yet in the region, so-called seed block. Then, the scope of the
region is expanded by selecting a path of successors based solely on the
execution frequency. This process continues until no more desirable
successors are found.
(v). Trace. A trace is a linear sequence of instruction that does not contain
any control-flow joint points. The execution either continues on the
trace, which consists of a unique path of instructions (hotpath), or it
exits the trace. A trace can have a single entry point and one or more
exit points. According to the logic used in designing the JIT, traces
can be generated from loops or functions. The last instruction of the
trace may jump to the beginning of the trace (e.g. loops) or to another
trace or to the interpreter. Trace exits can either lead to another trace
(sidetrace) or back to the interpreter. If there are multiple frequently
executed control flow paths related to the same set of instructions,
the JIT will generate multiple traces (including sidetraces). This can
lead to duplication because a block of instruction can be repeated in
different traces, but this replication can provide more opportunities for
specialisation and aggressive optimisation.
13
(either loops or functions) are good candidates to produce hotpaths that will
be compiled into traces.
This family of just-in-time compilers is built on the assumptions that: (i)
programs spend most of their execution time in loops; (ii) several iterations
of the same loop are likely to take similar code paths.
start execution
Interpretation
with Hotpath
Monitoring
hotpath detected
end recording
Compilation and
Optimisation abort
Executing
Machine Code guard failure
A system made by a virtual machine (VM) equipped with a tracing JIT can
go through various stages when executing a program. These are summarised
in Fig. 2.1:
14
On the other hand, if the interpreter hits a fragment of code that has
already been compiled into an existing trace, the execution goes to
the already compiled machine code. In this case the VM switches to
executing machine code.
(ii). Recording. When a hotpath is found the interpreter continues to run
bytecode instructions, but all the executed bytecode instructions are
also recorded. These recorded instructions are stored into a linear list
that we previously called trace. Generally, from the bytecode instruc-
tions it is emitted an intermediate representation (IR) that will be used
for optimisation and compilation.
Recording continues until the interpreter finishes to execute all the
instructions of the detected hotpath (e.g. one iteration of a loop or
an entire function). The decision to stop recording is crucial for the
efficacy and performance of a tracing JIT. It will be further discussed
in Section 2.3.3.
At any moment of the recording phase, an abort can occur. It means
that recording failed because the execution flow took a path in the
code that cannot produce a suitable trace. This can be caused by
an exception or any kind of error while recording. If this happens the
partial trace that was generated is discarded and the VM switches back
in the previous stage that consists of sole interpretation.
(iii). Optimisation and Compilation. Once recording is successfully com-
pleted the system switches to compilation. In this case the IR pro-
duced is aggressively optimised and the trace is compiled to machine
code. The JIT compiler produces very efficient machine code that is
immediately executable, e.g. it can be used for the next iteration of a
loop or for the next time a function will be called.
During this phase an abort can also occur. If so, the partial trace is
discarded and the system switches back to interpretation.
(iv). Executing Machine Code. In this phase the machine code previously
generated by the tracing JIT is executed. This machine code is cached
so that if the interpreter encounters a code fragment that previously
produced a trace, it will switch to executing the already compiled ma-
chine code. Generally, there is a limited cash memory where compiled
traces are stores; if this memory is full the oldest trace is discarded to
give place for the new one.
The end of a trace can either be connected: (1) to itself (i.e. loop or
recursive function), thus the machine code of the trace runs repeatedly
15
until some exit condition is triggered; (ii) to another trace; (iii) to
the interpreter. This link is created according to the specific hotpath
previously recorded and executed by the interpreter.
Since a trace is a linear sequence of instructions, it contains guards
that ensure the correctness of the machine code executed. Guards
check that the assumptions in the trace are fulfilled (e.g. the execution
flow follows a specific path of a branch, the hypothesis on variables
types are verified, etc). If one of the assumptions is not respected the
associated guard fails and the trace exits. When the trace exits because
of a guard failure the system generally switches to interpretation, but
if particular conditions are met (see Section 2.3.7) a trace exit can lead
to another trace, so-called sidetrace.
In the next sections, we will describe the phases just mentioned more in
details.
16
but it keeps track of the last n branch targets in a history cache. Only
branch targets in this cache will be considered as potential trace header.
Even if this method implies an overhead caused by the cache, it needs
fewer counters because there will be fewer branch targets. Hiniker,
Hazelwood, and Smith proved that using LEI (instead of NET) there
is an improvement in locality of execution while reducing the size of
the code cache.
(iii). Natural loop first (NLF). This approach consists in considering some
bytecode instructions as ”special” because they are the only ones that
can be potential trace headers (e.g. bytecode instructions at the be-
ginning of a loop, or function call). A special treatment should also
be performed for recursive functions and gotos that can rise with high
probability frequently executed paths in the code. To use this tech-
nique we must be able to access the information on the higher-level
structure of the program. The advantage of this method is that fewer
points of the program are considered as potential trace headers and
fewer counters are needed. It is also more predictable to know where
traces can start.
LuaJIT [3] by Pall uses in fact this heuristic to identify hotpaths, e.g.
a for is translated to a special bytecode instruction that is considered
as potential trace header.
It should be noted that side exits of a trace can also be considered as potential
trace headers because a trace, that we previously called sidetrace, can start
from that point (paragraph 2.3.7 describe this technique in details).
17
of the code. Finding a suitable trade-off depends on many aspects including
the specific application, programming language, architecture, etc.
In many tracing JIT, the hotness threshold is a parameter that the user
can set according to its needs. In this way it is possible to change it based
on the performances obtained.
start execution
Interpret
bytecode
instructions
potential
trace header
NO
is executed?
YES
Increment
its counter
counter exceeds
NO threshold?
YES
Hotpath detected
18
take when recording instructions. Ideally, we should record the path that has
the highest probability to be taken, but this is not ensured in any way.
Analysing the example of the loop in Fig. 2.3 clarifies this concept. The
two possible paths taken by the execution flow are either A-B-D (path 1) or
A-C-D (path 2) since there is a branch after the block A. Let’s suppose that,
in a random iteration of the loop, the probability of executing path 1 is 80%
and the remaining 20% for path 2. In this situation the best would be to
record the trace considering path 1, but there is no guarantee of that. In
fact, the behaviour of a tracing JIT is the following. As usual the program is
run by in the interpreter at first, then VM starts recording when the counter
exceeds the hotness threshold (assuming that the loop iterates enough time
to become hot). The path that will be recorded is the path taken by the
execution flow in the next iteration of the loop when the system switches
to mode ”interpretation and recording” (it can be either path 1 or path 2).
Assuming that we are not unlucky path 1 will be recorded, but there is no
guarantee of that.
B C
The phase of interpretation and recording may continue until either there is
an abort or an end-of-trace condition is met, which means that the recording
is successfully completed (abort will be discussed in the next paragraph).
In a successful scenario, a tracing JIT stops recording because one of the
following end-conditions has been encountered [10]: (i) Loop back to entry. It
is the most simple case when a loop goes back to the point where recording
started. It means that a cycle has been found, thus recording can stop
because a loop has been correctly recorded; (ii) Loop back to parent. It
is the case when a sidetrace loops back to its parent trace. Thus, a cycle
has been detected and the sidetrace was successfully recorded; (iii) Start of
existing trace. This happens when an already existing trace is encountered
while recording. In this situation the behaviour of a tracing JIT can vary
for different implementations: either it stops recording and the trace jumps
19
to the existing trace or recording will continue independently. In the latter
situation there will be longer traces and duplication increases, but there are
more opportunities for specialisation and aggressive optimisations.
20
Fragment n-th
becomes hot
abort trace
Retry to create a
trace if the same
fragment becomes
hot again
Increment its
backoff counter
counter exceeds
blacklisting
NO
threshold?
YES
Blacklist
fragment n-th
21
After optimisations, a trace is compiled to very efficient and specialised
machine code where every guard is turned into a quick check to verify whether
the assumption still holds. At this point, the trace consists of a linear se-
quence of optimised instructions in SSA form, hence the translation to ma-
chine code is also facilitated.
22
to interpretation should be relatively rare events.
2.3.7 Sidetraces
As previously mentioned, a trace can be created from an exit of a root trace.
The trace generated will be called sidetrace because it starts from the exit of
another trace. The trace to which the sidetrace is attached is called parent
trace. Sidetraces are needed because a single trace only covers one path of
the entire control flow graph. If multiple paths become hot (in the sense
of being frequently executed) it is appropriate to compile them in multiple
traces.
Executing
machine code
in a trace
guard n-th fails
Execute
sidetrace
sidetrace
already exists? YES
machine code
NO
Increment
counter of
guard n-th
counter exceeds
Go back to
sidetrace
interpretation NO
threshold?
YES
Guard
becomes hot
Record and
compile
sidetrace
A sidetrace is created when the same guard fails repeatedly (the guard be-
comes hot). At that point, it is too expensive to restore the VM and to
23
resume interpretation. Thus, it is more profitable to attach a sidetrace to
the hot exit. The diagram in Fig. 2.5 describe this mechanism.
In a situation where two paths are frequently executed, the first path
that becomes hot will be handled by the parent trace, then the second one
will be handled in part by the parent trace and finally by the sidetrace. The
example of the loop in Fig. 2.6 describes this situation. If the path A-B-D
becomes hot first a root trace (trace with no parent) will be created. Then
a sidetrace that executes C-D is created when the guard also becomes hot.
LOOP PARENT TRACE
A A
B C SIDETRACE
B
D C
D
Figure 2.6: Loop
24
the contents of the stack slot in the parent trace.
25
2.4.2 Recents Tracing JITs
In more recent years, several tracing just-in-time compilers have been pro-
posed as an efficient solution for dynamic languages. HotpathVM by Gal et
al. [16] is a tracing JIT for Java VM released in 2006. It is small enough to
fit on resource-constrained embedded devices. It dynamically builds traces
from the bytecode and it limits its effort to frequently executed loops that are
identified through backward branches. The key of their success consists in an
innovative use of the SSA [21] transformation, which Gal et al. called TSSA
(Trace Static Single Assignment). In classical SSA a control-flow graph is
entirely transformed into SSA form, and φ nodes are placed in control-flow
merge points. TSSA consists in transforming into SSA form only variables
that are actually used in a recorded trace. In this way it is possible to perform
more aggressive optimisations on the trace including LICM (loop invariant
code motion) and moving operations on SSA values across side exit points.
Later on, in 2009, Gal et al. applied their trace-based approach to JIT
compilation of dynamically-typed languages. They developed a tracing JIT
compiler for JavaScript, so-called TraceMonkey [19]. It was implemented for
an existing JavaScript interpreter called SpiderMonkey [30] and it was used in
Mozilla’s Firefox Browser up to version 11 of Firefox. TraceMonkey proposed
a novel approach including trace trees. It considers side-exits as potential
locations for trace header when the execution of a trace is repetitively aborted
due to a guard failure. In this case the VM starts recording a new trace from
the point where the trace is aborted. Moreover, it generates special nested
trace trees for nested loops. On the same path, Chang et. al [31] released a
tracing JIT called Tamarin-Tracing in 2009. Tamarin is the Adobe’s VM that
implements ActionScript 3 [32], a flavour of ECMAScript [33]. JavaScript
is the most known flavour of ECMAScript, but most of JavaScript can be
executed without modification on Tamarin. Tamarin-Tracing is a branch of
Tamarin with a trace-based just-in-time compiler that uses run-time profiling
to identify frequently executed code paths. Both TraceMonkey and Tamarin-
Tracing were developed with the support of a joint collaboration of Mozilla
and Adobe. Others relevant works related to the ones just mentioned are:
Gal’s PhD thesis [34] in 2006; Gal et al. [22, 35] respectively in 2006, 2007;
Chang et.al [36, 37] respectively in 2007, 2011.
A further project has been realised in the context of meta-tracing where
the JIT compiler does not trace the user program being run, but it traces
the execution of the interpreter while it runs this program. In 2009, Bolz et
al. [17] applied this technique for PyPy’s tracing JIT compiler to programs
that are interpreted for dynamic languages, including Python. Many studies
had been conducted on the same direction including Bolz et. al [38, 39, 40]
26
respectively in 2010, 2013, 2014; Bolz’s PhD thesis [41] in 2012 ; Cuni’s PhD
thesis [7] in 2010; Ardö et al. [42] in 2012; Vandercammen’s MSc thesis [43]
in 2015. On the same path Bauman et al. [44] created Pycket, a tracing JIT
for Racket that is a dynamically typed functional programming language
descended from Scheme. Pycket is implemented using the RPython meta-
tracing framework, which automatically generates a tracing JIT compiler
from an interpreter written in RPython (a subset of Python ”Restricted
Python”).
Another important contribution for trace-based just-in-time compilation
has been done by Bebenita et al. [45] in 2010. They designed and imple-
mented SPUR, a tracing JIT for Microsoft’s CIL (the target language of C#,
VisualBasic, F#, and many other languages).
A very successful tracing just-in-time compiler for the Lua programming
language is LuaJIT by Mike Pall [3]. Its first version, LuaJIT 1, was released
in 2005 provided with a JIT implemented in the assembly language DynASM
[5, 46]. In LuaJIT 2, published in 2012, the whole VM has been rewritten
from scratch in DynASM realising a fast interpreter and the JIT was reimple-
mented in C. There is no documentation of the LuaJIT internals, but a short
summary of techniques used is given by Pall in a public statement about the
intellectual property contained in LuaJIT [47]. In the following years many
JITs have been implemented from LuaJIT, because of its outstanding per-
formance as just-in-time compiler. Schilling developed a trace compiler for
Haskell based on LuaJIT called Lambdamachine [10] for his PhD thesis in
2013. Another just-in-time compiler that is born from LuaJIT is RaptorJIT
[20] by Gorrie. It is a fork of LuaJIT suitable for high-performance low-level
system programming. It aims to ubiquitous tracing and profiling to make ap-
plication performance and compiler behaviour transparent to programmers.
It is provided with an interactive tool for inspecting and cross-referencing
trace and profiler data called Studio [48]. Finally, another relevant software
that should be mentioned in this context is OpenResty [49]. It is a full-fledged
web platform that integrates a modified version of LuaJIT.
27
Chapter 3
Virtual Machine
3.1 Frontend
The compiler frontend is composed by Lexer, Parser and BC Frontend.
3.1.1 Lexer
The lexer (implemented in lj lex.[c,h]) converts a sequence of Lua intruc-
tions into a sequence of tokens. Its input is a Lua program (*.lua). It uses
LexState as a principal data structure. The user-provided rfunc function is
used to read a chunk of data to process. It is accessed through the p and
pe pointers. The main function is lex scan that dispatches the work to other
functions depending on the type of data to be processed (comment, string
literal, long string, numbers etc...). TValues (tokval, lookaheadval ) are used
to store the token values were LexToken (tok, lookahead ) determine the type.
The string buffer (sb) is used to accumulate characters of a future string
before internalizing it. All Lua keyword are internalized as a string at the
very beginning, GCstr has the field reserved for marking them.
3.1.2 Parser
The parser (implemented in lj parse.[c,h]) takes input in the form of a
sequence of tokens produced by the lexer. LuaJIT does not build an abstract
syntax tree representation of the parsed code (as a ”standard” compiler would
build), but it directly generates the bytecode on the fly using helpers from
28
lj bc.h. It also uses LexState as the principal data-structure. The lj parse
function is the entry point and parses the main chunk as a vararg function.
The unit of emission is the function GCproto and the structure used for the
construction is FuncState. Parsing is a succession of chunks (parse chunk )
were parse stmt is the principal function called for each line, dispatching
the work depending on the current token type. FuncScope is a linked-list of
structure used for scope management.
3.2 Internals
3.2.1 Tagged value
LuaJIT represent all internal elements as 64-bits TValue (tagged value). It
uses the nan-tagging technique to differentiate between numbers and other
types of element. In fact, lua number are 64-bits floating-point numbers
following the ieee standard. In this way, numeric NaNs are canonised by the
CPU (0xfff8 in msb and zero otherwise), letting the possibility to use the
lower bits to represent arbitrary data. Internally, LuaJIT has two different
representations, one for 32-bits and another for 64-bits (LJ GC64 ) mode (see
Tables 3.1 and 3.2). In those tables, itypes are numbers identifying the type
of the object. GC objects (Garbage-Collected Object) represent all allocated
objects that are managed by the garbage collector. GCRef are references to
such object.
29
Table 3.1: Internal object tagging for 32-bits mode
MSW LSW
size 32-bits 32-bits
primitive types itypes
lightuserdata (32-bits) itypes void *
lightuserdata (64-bits) 0xffff void *
GC objects itypes GCRef
int itypes int
number double
30
3.2.3 Lua table
Tables are garbage-collected objects represented by the structure GCtab in
lj obj.h. The functions to manipulate them are defined in lj tab.c and
lj tab.h. GCtab is composed of an array part and a hash part. If the array
part is small, it is allocated directly after the structure in memory (collocation
functionality), otherwise, it is separated. The hash part is a hash table used
to store all non-integer key (or integer too big to fit in the array part). It is
implemented as an array using a singly-linked list for collision, where nodes of
the linked list are within the array (not allocated) and a variation of Brent’s
hashing methods is used. New integer keys that are bigger than the array
part are always inserted in the hash part until this one is full. Then, it
triggers the resizing of the table. The new asize and hmask are both power
of 2. The new asize value corresponds to the biggest power of 2 such that at
least 50% of the integers below it are used as keys. The new hmask is picked
such that all non-integer keys plus the integer keys that are bigger than the
new asize fit in it. When resizing occur the hash values are re-hashed and
integer kes that does fit in the array part is reintroduced there.
The nomm field of GCtab is a negative cache for fast metamethods checks.
It is a bitmap marking absent fields of the metatable.
31
10 MRef sweep ; /* Sweep position in root list . */
11 GCRef gray ; /* List of gray objects . */
12 GCRef grayagain ; /* List of objects for atomic traversal . */
13 GCRef weak ; /* List of weak tables ( to be cleared ). */
14 GCRef mmudata ; /* List of userdata ( to be finalized ). */
15 GCSize debt ; /* Debt ( how much GC is behind schedule ). */
16 GCSize estimate ; /* Estimate of memory actually in use . */
17 MSize stepmul ; /* Incremental GC step granularity . */
18 MSize pause ; /* Pause between successive GC cycles . */
19 } GCState ;
3.2.5 Allocator
LuaJIT has its own embedded allocator which is a customized version of
dlmalloc (Doug Lea’s Malloc). Information on the original implementation
can be found on the web article [51] or in the comment of the code [52]. The
allocator is implemented in lj alloc.c. Its main structure is malloc state.
Memory on the heap is allocated in chunks. Free chunks are managed as
double linked-list with the size of the chunk at the beginning and end of
it. Unallocated memory is grouped in bins of the corresponding size. There
are two types of bins. The smaller one contains chunks of the same size
and the top is anchored in smallbins. The bigger ones are stored as bitwise
digital trees (aka tries) keyed by size where the top of a tree is anchored in
treebins. The allocator differentiates with two types of memory allocation, if
it is higher than 128KB then it asks the operating system for a new memory
segment using mmap, if it is lower than that it uses chunks from the current
segment. All allocated segments are kept in a linked list anchored in seg.
For such smaller allocation, the allocator first tries to find an exact fit from
the available chunks to optimize for internal fragmentation. If it cannot find
one and that the requested size is smaller than dvsize then it uses the dv
chunk (designated victim) which is the last chunk that has been split. This
is done to optimise locality. Otherwise, it goes for a best-fit match. If no
chunk big enough is available, it asks the system to extend the segment and
use the boundary chunk top (always kept free). When memory is freed, it
does chunk coalescing to avoid memory fragmentation. If topsize is bigger
than trim check, then the current segment is shrunk and the memory is given
back to the OS. release checks is a decreasing counter that when it riches zero
triggers a check of all segments to release empty ones back to the OS.
3.2.6 Function
There are two different representations of function, the function’s prototype,
and the function’s closure. Lua function’s prototypes are represented by
GCproto (lj obj.h), and are followed by the functions’ bytecodes in memory.
32
The closures are represented by the GCfuncL for Lua function and GCfuncC
for C function (using Lua API). They contain the necessary information for
upvalues. Upvalues are represented by the GCupval which contains the cor-
responding value or a reference to the stack slot with the appropriate value.
Closures can be managed by the functions present in lj func.c allowing to
create them, destroy them and closing their upvalues.
33
3.4 Library
3.4.1 Standard library
LuaJIT supports full compatibility with Lua 5.1, hence it implements the
standard library. The code is copied and adapted from the PUC-RIO Lua
interpreter. A list of the corresponding files and descriptions is shown below.
34
library without the need for manual maintenance. Here are described the dif-
ferent steps and what they are useful for.
Firstly, if no Lua interpreter (either PUC-Lua or LuaJIT) is available on
the machine, a simplified and reduced version of Lua interpreter is built from
minilua.c. Then, the interpreter is used to run genlibbc.lua that will be
responsible for parsing all LuaJIT’s source files searching for the LJLIB LUA
macro that surrounds library functions names that are written in Lua. It then
generates the buildbm libbc.h file that contains the Lua bytecodes for all
those functions in the libbc code array and a mapping of the function name
and the bytecodes offset for that function in libbc map.
This newly generated file is built along with all buildvm * files to create
the buildvm program that is used to parse from the library source code all
other LJLIB * macro and generates some files (lj bcdef.h, lj libdef.h,
lj ffdef.h, lj recdef.h and vmdef.lua) that will be added to LuaJIT
compilation. In Table 3.4 it is is shown the description of the macros and in
Table 3.5 the description of the corresponding generated file.
Macro Description
LJLIB MODULE * register new module.
LJLIB CF(name) register C function.
LJLIB ASM(name) register fast function fallback handler.
register fast function that uses previous
LJLIB ASM (name)
LJLIB ASM fallback handler.
LJLIB LUA(name) register Lua function.
register previous Lua stack value into the module
table with name has key.
LJLIB SET(name) • ’ !’ : last stack value became next function’s
env
35
register a handler to record a function.
– name of recorder
– auxiliary data to put in recff idmap
File Description
lj bcdef.h for each fast functions, lj bc ofs contains the off-
set from lj vm asm begin (in lj vm.h) to the mcode
of the function and lj bc mode contains the byte
code operande mode (all set to BCMODE FF )
(see lj bc.h and this Introduction section of the wiki
[55]).
lj ffdef.h list of all library function name
lj libdef.h lj lib cf * arrays contains the list of function point-
ers for the * library. lj lib init * are arrays of
packed data describing how the corresponding li-
brary should be loaded (see lj lib register in lj lib.c
for the function that parse those data).
lj recdef.h for each library functions recff idmap contains an
optional auxiliary data (opcode, literal) allowing
to handle similar functionalities in a common han-
dler. recff func contains the list of record handler.
36
vmdef.lua contains all vm definition for use in Lua.
• bcnames : bytecode names
There is some official documentation for FFI users on the official LuaJIT
website where you can find the motivation for the FFI module [56], a small
tutorial [57], the API documentation [58] and the FFI semantics [59]. There
is also a reflection library for FFI ctypes [60] and its documentation [61] for
anyone interested in exploring the ctype of a given cdata.
This section will present the internal implementation of the FFI and not
its use. The organisation of the information will follow the actual implemen-
tation files.
37
lib ffi.c
It is the top-level file of the FFI library. It contains the implementation of the
FFI API, the function that makes the connection between Lua and C using
the standard Lua C API. It is also responsible for loading the FFI module
(luaopen ffi ). This file mainly uses and connects together functionalities
implemented in other files. For instance, it is responsible for the allocation
and initialisation of the main state (CTState) explained below.
lj obj.h
From this file, we are only interested in this section by the GCcdata structure
that is the garbage-collected object representing any C data use through and
with the FFI. Its structure is shown below. The key point to highlight is
the ctypeid which is the index of the ctype describing the attached data (the
payload follows the structure in memory).
1 typedef struct GCcdata {
2 GCHeader ;
3 uint16_t ctypeid ; /* C type ID . */
4 } GCcdata ;
lj ctype.h
The CType data structure responsible for describing to the FFI what kind
of data the cdata represents (e.g. variable, struct, function, etc.). A detailed
schema in this regards is shown in Tab. 3.6. The abbreviations used are
explained in Tab. 3.7.
1 typedef struct CType {
2 CTInfo info ; /* Type info . */
3 CTSize size ; /* Type size or other info . */
4 CTypeID1 sib ; /* Sibling element . */
5 CTypeID1 next ; /* Next element in hash chain . */
6 GCRef name ; /* Element name ( GCstr ). */
7 } CType ;
The most important struct of the FFI is CTState. It contains all the in-
ternalize ctype in the tab table. finalizer is a weak keyed Lua table (values
can be garbage collected if the key is not referenced elsewhere) containing all
finalizer registered with the ffi.gc method. miscmap is a Lua table mapping
all metatable of ctypes registered using the ffi.metatype method in the neg-
ative CTypeID range and all callback functions in the positive callback slot
range. Any metatable added to miscmap is definitive and never collected.
hash is an array used as a hash table for quick CTypeID checks. It maps
both, the hashed name of named elements and the hashed type (info and
38
Table 3.6: Summary of CType informations
info
size sid next name
type flags A cid
size 4 8 4 16 32 16 16 GCRef
NUM 0000 BFcvUL.. A size type
STRUCT 0001 ..cvu..V A size field name name
PTR 0010 ..cvR... A cid size type
ARRAY 0011 V 2 Ccv...V A cid size type
VOID 0100 ..cv.... A size type
ENUM 0101 ........ A cid size const name name
FUNC 0110 ....V 3 S.. ..cc cid nargs field name name
TYPEDEF 0111 ........ .... cid name name
ATTRIB 1000 .... attrnum cid attr sib type
FIELD 1001 ........ .... cid offset field name
BITFIELD 1010 B.cvU csz .bsz .pos offset field name
CONSTVAL 1011 ..c..... .... cid value const name name
EXTERN 1100 ........ .... cid sib name name
KW 1101 ........ .... tok size name name
39
size) for unnamed elements to the corresponding CTypeID. Collisions are
handled in a linked list using the next field of the CType struct.
1 typedef struct CTState {
2 CType * tab ; /* C type table . */
3 CTypeID top ; /* Current top of C type table . */
4 MSize sizetab ; /* Size of C type table . */
5 lua_State * L ; /* Lua state ( for errors and allocations ). */
6 global_State * g ; /* Global state . */
7 GCtab * finalizer ; /* Map of cdata to finalizer . */
8 GCtab * miscmap ; /* Map - CTypeID - > metatable and cb slot - > func . */
9 CCallback cb ; /* Temporary callback state . */
10 CTypeID1 hash [...]; /* Hash anchors for C type table . */
11 } CTState ;
lj ctype.c
This file contains functions to manage CType. It is divided into three parts.
The first one for the allocation, creation and internalization of CType. The
second one provides getters to retrieve C type information. The last one is
type representation, providing the necessary functions to convert a CType to
a human-readable string representation. lj ctype repr is the entry function
that returns the internalize string representation. It uses the struct CTRepr
to create the representation by appending/prepending characters through
the pb/pe pointers into the buffer. The main function is the ctype repr that
contains a switch on the CType info.
1 typedef struct CTRepr {
2 char * pb , * pe ; /* Points to begining / end inside the buffer */
3 CTState * cts ; /* C type state . */
4 lua_State * L ;
5 int needsp ; /* Next append needs an extra space character */
6 int ok ; /* Indicate if buf is currently a valid type */
7 char buf [...]; /* String buffer of the ctype being constructed */
8 } CTRepr ;
lj cparse.h
The cparser is responsible for parsing the string of the C declarations identi-
fying types or external symbols. Its code structure is quite close to the Lua
lexer/parser. Its principal struct is the CPState which is similar to LexState.
In this struct, tmask is a mask constraining the possible ctype of the next
identifier. The mode defines the behaviour of the parser with respect to the
input. It has a different behaviour according to the type: accepting multiple
declarations, skipping errors, accept/reject abstract declarators, accept/re-
ject implicit declarators etc... (see CPARSE MODE * for a full definition).
1 typedef struct CPState {
2 CPChar c ; /* Current character . */
3 CPToken tok ; /* Current token . */
40
4 CPValue val ; /* Token value . */
5 GCstr * str ; /* Interned string of identifier / keyword . */
6 CType * ct ; /* C type table entry . */
7 const char * p ; /* Current position in input buffer . */
8 SBuf sb ; /* String buffer for tokens . */
9 lua_State * L ; /* Lua state . */
10 CTState * cts ; /* C type state . */
11 TValue * param ; /* C type parameters . ( $xyz ) */
12 const char * srcname ; /* Current source name . */
13 BCLine linenumber ; /* Input line counter . */
14 int depth ; /* Recursive declaration depth . */
15 uint32_t tmask ; /* Type mask for next identifier . */
16 uint32_t mode ; /* C parser mode . */
17 uint8_t packstack [...]; /* Stack for pack pragmas . */
18 uint8_t curpack ; /* Current position in pack pragma stack . */
19 } CPState ;
lj cparse.c
This file contains the code of a simple lexer and a simplified, not valid C
parser. It uses the CPState for the parsing of the input string and the
CPDecl structure for the construction of the corresponding CType. During
parsing the chain of typdef is unrolled (typdef are still internalized for future
reference but are not chained to the created ctype).
1 typedef struct CPDecl {
2 CPDeclIdx top ; /* Top of declaration stack . */
3 CPDeclIdx pos ; /* Insertion position in declaration chain . */
4 CPDeclIdx specpos ; /* Saved position for declaration specifier . */
5 uint32_t mode ; /* Declarator mode ( same as CPState ) */
6 CPState * cp ; /* C parser state . */
7 GCstr * name ; /* Name of declared identifier ( if direct ). */
8 GCstr * redir ; /* Redirected symbol name . */
9 CTypeID nameid ; /* Existing typedef for declared identifier . */
10 CTInfo attr ; /* Attributes . */
11 CTInfo fattr ; /* Function attributes . */
12 CTInfo specattr ; /* Saved attributes . */
13 CTInfo specfattr ; /* Saved function attributes . */
14 CTSize bits ; /* Field size in bits ( see Ctype bsz ). */
15 CType stack [...]; /* Type declaration stack . */
16 } CPDecl ;
lj cdata.c
This file contains the functions responsible for cdata management, such as
allocations, free, finaliser, getter, setter and indexing.
lj cconv.c
This file is responsible for ctype conversion. It is divided in 5 parts: (i) C
type compatibility checks, (ii) C type to C type conversion, (iii) C type to
TValue conversion (from C to Lua : i.e returned values), (iv) TValue to C
41
type conversion (from Lua to c: i.e passed arguments), and (v) initializing C
type with TValues (Initialization of struct/union/array with Lua object).
lj carith.c
This file contains the implementation for all built-in cdata arithmetic, such
as pointers arithmetic and integer arithmetic. It mainly manipulates some
CDArith structure shown below.
1 typedef struct CDArith {
2 uint8_t * p [2]; /* data of the two operands */
3 CType * ct [2]; /* ctype of the two operands */
4 } CDArith ;
lj ccall.c
This file contains the code handling calls to C function. It realises struc-
t/array register classification (see CCALL RCL *) computing how it can be
passed as argument/return values (in GP register, SSE register or memory).
It handles then the decomposition/packing depending on the calling conven-
tion picked.
lj ccall.h
This file contains the main structure used for C function call.
1 typedef struct CCallState {
2 void (* func )( void ); /* Pointer to called function . */
3 uint32_t spadj ; /* Stack pointer adjustment . */
4 uint8_t nsp ; /* Number of stack slots . */
5 uint8_t retref ; /* Return value by reference . */
6 uint8_t ngpr ; /* Number of arguments in GPRs . */
7 uint8_t nfpr ; /* Number of arguments in FPRs . */
8 [...]
9 FPRArg fpr [...]; /* Arguments / results in FPRs . ( SSE ) */
10 GPRArg gpr [...]; /* Arguments / results in GPRs . */
11 GPRArg stack [...]; /* Stack slots . */
12 } CCallState ;
lj ccallback.c
This file provides the FFI C callback handling. The principal structure is
the CCallback (see lj ctype.h) that mainly use through the cb field of the
CTState structure. Each callback is associated with a unique cb slot and
cts->miscmap contains the mapping between cb slot and function pointers.
The cts.cb.cbid is a table mapping cb slot to the corresponding CTypeID.
cts.cb.mcode is an mapped executable page that contains a push of slot id
and a jump to lj vm ffi callback (the entry in this page corresponds to the cb
42
address provided to the C code). The order of call when a callback occurs is
the following:
• It arrives in the callback mcode page (push the appropriate slot id and
call lj vm ffi callback )
• It arrives in lj ccallback enter that prepare the Lua state and do con-
version of argument from C to Lua type.
• It arrives in lj vm ffi callback : execute the callback with this Lua state
• Lua callback
• It arrives in C code.
lj clib.[c,h]
This file contains the necessary code to load/unload ffi library. It also handles
the indexing of the external library using named symbol. It uses platform-
specific tools to explore the exposed symbols. Every symbol is resolved only
once and cached in the CLibrary cache table.
1 typedef struct CLibrary {
2 void * handle ; /* Opaque handle for dynamic library loader . */
3 GCtab * cache ; /* Cache resolved symbols . Anchored in ud - > env . */
4 } CLibrary ;
43
Chapter 4
JIT compiler
44
• When the same hotcode generate too many times a trace abort, it is
blacklisted. LuaJIT will never try to record it again. The bytecode of
that hotloop/hotfunction is patched with a special bytecode instruc-
tion (I...) that stops hotspot detection and force execution in the
interpreter.
Operation Description
FORL Numeric ’for’ loop
LOOP Generic loop
ITERL Iterator ’for’ loop
ITERN Specialized iterator function next() (NYI)
FUNCF Fixed-arg Lua function
FUNCV Vararg Lua function
45
Index value
0 111
1 108
2 111
3 51
... ...
63 111
46
Figure 4.1: Hotcount decrement for loops
Indeed, to set and get hotcount values the following macros are defined.
1 # define hotcount_get ( gg , pc ) \
2 ( gg ) - > hotcount [( u32ptr ( pc ) > >2) & ( HOTCOUNT_SIZE -1)] /* hotcount [( PC /4)%64] */
3 # define hotcount_set ( gg , pc , val ) \
4 ( hotcount_get (( gg ) , ( pc )) = ( HotCount )( val ))
47
| sub word [ DISPATCH + reg + GG_DISP2HOT ] , HOTCOUNT_LOOP
| jb -> vm_hotloop
|. endmacro
48
against buffer-overflow attacks by randomising the location where executa-
bles are loaded into memory by the operating system. Two consecutive runs
of the exact same code will generate different memory address for the same
bytecode instruction. Code fragments could be attached to different counters
of the Hotcount table and the behaviour of the JIT can be different from run
to run.
Most operating systems nowadays support ASLR, but this should not be
seen as a problem for LuaJIT. Even if code fragments are attached to different
counters in the Hotcount table from run to run, LuaJIT tries to guarantee on
average relatively similar performance on the whole application. However, in
some cases there are peaks of substantial slowness in execution.
ASLR brought significant difficulties in the context of this research while
studying LuaJIT. In fact, the approach adopted to overcome this problem
was to disable ASLR when examining LuaJIT internals. In this way the JIT
behaviour will be deterministic and bytecode instructions will be stored in
the same memory addresses from run to run.
4.2 Recording
Once an hotpath has been triggered LuaJIT starts recording. From the hot-
path header, bytecode instructions will be recorded while they are executed.
Recording continues until either an end-of-trace condition is encountered or
the trace is aborted. The control flow is flattened, therefore only taken
branches are recorded and functions are generally inlined.
49
10 } TraceState ;
abort
abort
abort
error detected ERR
50
27 break ; 46 /* fallthrough */
28 } 47
29 lj_opt_split ( J ) ; 48 case LJ_TRACE_ERR :
30 lj_opt_sink ( J ) ; 49 if ( trace_abort ( J )) {
31 J - > state = LJ_TRACE_ASM ; 50 goto retry ;
32 break ; 51 }
33 52 setvmstate ( J2G ( J ) , INTERP );
34 case LJ_TRACE_ASM : 53 J - > state = LJ_TRACE_IDLE ;
35 setvmstate ( J2G ( J ) , ASM ); 54 l j _ d i s p a t c h _ u p d a t e ( J2G ( J ));
36 lj_asm_trace (J , &J - > cur ) ; 55 return NULL ;
37 trace_stop ( J ) ; 56 }
38 setvmstate ( J2G ( J ) , INTERP ); 57
39 J - > state = LJ_TRACE_IDLE ; 58 } while (J - > state > L J_ TR AC E _R EC OR D );
40 l j _ d i s p a t c h _ u p d a t e ( J2G ( J )); 59 return NULL ;
41 return NULL ; 60 }
42
43 default : Listing 4.6: lj trace.c
44 /* Trace aborted async hronousl y */
45 setintV (L - > top ++ , LJ _T R ER R_ RE C ER R );
51
Figure 4.3: Start recording
When the state is LJ TRACE START the following actions are performed: (i)
the trace compiler state is changed to LJ TRACE RECORD; (ii) the function
trace start performs initial setup to start a new trace; (iii) the function
lj dispatch update prepares the dispatcher, so that each bytecode instruc-
tions executed by the VM will be henceforward recorded.
4.2.3 Recording
Recording will be done executing in an infinite loop lj dispatch ins and
lj trace ins until recording stops or an error occurs. The trace compiler
state is LJ TRACE RECORD and for each instruction the execution flow goes
to lj record ins (defined in lj record.c). This function is responsible
for recording a bytecode instruction before it is executed. It contains a
huge switch case on all possible bytecodes. Finally, from each bytecode
instruction, LuaJIT emits the corresponding IR instruction. Therefore, the
IR is incrementally generated.
52
Figure 4.4: Recording a bytecode instruction
4.3 Optimisation
It is possible to distinguish between three types of optimisation in the code.
First of all, the optimisations of the optimisation engine. They are imple-
mented in the lj opt *.c files, hence they can be easily identified. These
optimisations are can be either: (i) global optimisations, which are run on the
entire IR once at the end of the recording phase, during the LJ TRACE END
53
state (see 4.3.1, 4.3.2, 4.3.3, 4.3.4) or (ii) local optimisations, which are ap-
plied while recording a trace (see 4.3.5, 4.3.6, 4.3.6). Finally, there is a
plethora of optimisations and heuristics applied in various parts of the code
(ssee LuaJIT wiki on optimisation [63]).
54
4.3.5 Narrowing optimisations
LuaJIT performs the narrowing of Lua numbers (double) into integers when
it seems to be useful. It uses demand-driven (by the backend) narrowing
for index expressions, integer arguments (FFI), bit operations and predictive
narrowing for induction variables. It emits overflow check instruction when
necessary. Most arithmetic operations are never narrowed. More details are
illustrated in the comment section of lj opt narrow.c.
Fold optimisations
The fold optimisations preformed by the FOLD engine are implemented in
the lj opt fold.c file. They can be classified in five well-known techniques:
(i) constant folding, (ii) algebraic simplifications, (iii) re-association, (iv)
common sub-expression elimination, and (v) array bounds check elimination.
55
4.4 Assemble trace
When the compiler state machine reaches the state LJ TRACE ASM, the trace
is assembled. The main function responsible for this task is lj asm trace
(defined in lj asm.c). Each IR instruction previously generated is assembled
through the function asm ir that contains a huge switch case on all possible
IR codes.
The implementation of the assembler is divided in three different files: (i)
lj asm.c contains the platform-independent code, (ii) lj asm ARCH.h con-
tains the architecture dependent code (e.g. x86), and (iii) lj emit ARCH.h
contains the helper functions to generate instructions for a specific instruction-
set. An IR instruction can be translated into M ≥ 1 machine code instruc-
tions.
At the end of the assemble phase the hotpath header bytecode is patched
with the adequate J... operation. This will force the VM to execute the
JIT-compiled trace, instead of interpreting the corresponding bytecode in-
structions. When the execution of the trace will be completed, the control
flow goes back to the VM that restart interpreting the bytecode instructions
after the hotcode.
The function trace stop is responsible to stop tracing and to patch the
bytecode (see Tab. 4.5).
56
Figure 4.6: Stop tracing
57
2 TRACEOV trace too long
3 STACKOV trace too deep
4 SNAPOV too many snapshots
5 BLACKL blacklisted
6 RETRY retry recording
7 NYIBC NYI: bytecode %d
/* Recording loop ops */
8 LLEAVE leaving loop in root trace
9 LINNER inner loop in root trace
10 LUNROLL loop unroll limit reached
/* Recording calls/returns */
11 BADTYPE bad argument type
12 CJITOFF JIT compilation disabled for function
13 CUNROLL call unroll limit reached
14 DOWNREC down-recursion, restarting
15 NYIFFU NYI: unsupported variant of FastFunc %s
16 NYIRETL NYI: return to lower frame
/* Recording indexed load/store */
17 STORENN store with nil or NaN key
18 NOMM missing metamethod
19 IDXLOOP looping index lookup
20 NYITMIX NYI: mixed sparse/dense table
/* Recording C data operations */
21 NOCACHE symbol not in cache
22 NYICONV NYI: unsupported C type conversion
23 NYICALL NYI: unsupported C function type
/* Optimisations */
24 GFAIL guard would always fail
25 PHIOV too many PHIs
26 TYPEINS persistent type instability
/* Assembler */
27 MCODEAL failed to allocate mcode memory
28 MCODEOV machine code too long
29 MCODELM hit mcode limit (retrying)
30 SPILLOV too many spill slots
31 BADRA inconsistent register allocation
32 NYIIR NYI: cannot assemble IR instruction %d
33 NYIPHI NYI: PHI shuffling too complex
34 NYICOAL NYI: register coalescing too complex
58
Table 4.6: Trace compiler error messages
59
Figure 4.7: Throw error
Many aborts are influenced by the value of some parameters that users can
set for the JIT compiler. The most important are shown in Tab. 4.7
60
Parameter Deafult Description
maxtrace 1000 Max. number of traces in the cache
maxrecord 4000 Max. number of recorded IR instructions
maxirconst 500 Max. number of IR constants of a trace
maxside 100 Max. number of side traces of a root trace
maxsnap 500 Max. number of snapshots for a trace
hotloop 56 Number of iterations to detect a hotloop or hot call
hotexit 10 Number of taken exits to start a side trace
tryside 4 Number of attempts to compile a side trace
instunroll 4 Max. unroll factor for instable loops
loopunroll 15 Max. unroll factor for loop ops in side traces
callunroll 3 Max. unroll factor for pseudo-recursive calls
recunroll 2 Min. unroll factor for true recursion
sizemcode 32 Size of each machine code area in KBytes (Windows: 64K)
maxmcode 512 Max. total size of all machine code areas in KBytes
Table 4.7: Parameters of the JIT compiler
4.5.2 Blacklisting
LuaJIT implements blacklisting (see Sec. 2.3.4 for details) in order to prevent
recording traces that it tried to generate, but it failed many times.
When hotpath recording fails, the trace compiler increments a counter,
so-called backoff counter by Gal et al. [19], linked to the hotcode that it
tried to record unsuccessfully. Once this counter exceeds a certain threshold,
the VM will never try to compile that hotcode again.
To avoid retrying compilation, the bytecode of the hotloop/hotfunction is
patched with an operation that stops hotspot detection and force execution
in the interpreter. Operations that force interpretation have the same syntax
of ’standard’ operation with ’I’ as prefix (see table 4.8).
61
As mentioned, each time a trace aborts, the hotcode detected is penalised
incrementing its backoff counter. The penalty mechanism uses a 64-entries
table defined into the JIT state. Each row contains: (i) the starting bytecode
PC of the hotpath; (ii) the penalty value (previously called backoff counter);
(iii) the abort reason number (details in Tab. 4.6).
1 /* Round - robin penalty cache for bytecodes leading to aborted traces . */
2 typedef struct HotPenalty {
3 MRef pc ; /* Starting bytecode PC . */
4 uint16_t val ; /* Penalty value , i . e . hotcount start . */
5 uint16_t reason ; /* Abort reason ( really TraceErr ). */
6 } HotPenalty ;
7
8 # define PENALTY_SLOTS 64 /* Penalty cache slot . Must be a power of 2. */
9 # define PENALTY_MIN (36*2) /* Minimum penalty value . */
10 # define PENALTY_MAX 60000 /* Maximum penalty value . */
11 # define P E NA LT Y_ R ND BI TS 4 /* # of random bits to add to penalty value . */
12
13 /* JIT compiler state . */
14 typedef struct jit_State {
15 ...
16 HotPenalty penalty [ PENALTY_SLOTS ]; /* Penalty slots . */
17 uint32_t penaltyslot ; /* Round - robin index into penalty slots . */
18 uint32_t prngstate ; /* PRNG state . */
19 ...
20 } jit_State ;
The penalisation mechanism is the following. When a trace aborts, the func-
tion trace abort is called. If the trace is a root trace, the PC of the starting
62
bytecode instruction is penalized (thus the hotcode is penalised). The func-
tion penalty pc is responsible for the penalisation. If the penalty value
exceeds the threshold (PENALTY MAX), the trace is blacklisted. Thus, the
bytecode is patched with the adequate I... operation and the hotcode
previously detected can never become hot again. In LuaJIT, blacklisting is
permanent. Once a bytecode is blacklisted, it can never be whitelisted.
A description of penalisation and blacklisting is shown in the flow chart
at Fig. 4.9.
1 /* Penalize a bytecode instruction . */
2 static void penalty_pc ( jit_State *J , GCproto * pt , BCIns * pc , TraceError e ) {
3 uint32_t i , val = PENALTY_MIN ;
4 for ( i = 0; i < PENALTY_SLOTS ; i ++)
5 if ( mref (J - > penalty [ i ]. pc , const BCIns ) == pc ) { /* Cache slot found ? */
6 /* First try to bump its hotcount several times . */
7 val = (( uint32_t )J - > penalty [ i ]. val << 1) + LJ_PRNG_BITS (J , PE NA L TY _R ND B IT S );
8 if ( val > PENALTY_MAX ) {
9 blacklist_pc ( pt , pc ); /* Blacklist it , if that didn ’t help . */
10 return ;
11 }
12 goto setpenalty ;
13 }
14 /* Assign a new penalty cache slot . */
15 i = J - > penaltyslot ;
16 J - > penaltyslot = (J - > penaltyslot + 1) & ( PENALTY_SLOTS -1);
17 setmref (J - > penalty [ i ]. pc , pc );
18 setpenalty :
19 J - > penalty [ i ]. val = ( uint16_t ) val ;
20 J - > penalty [ i ]. reason = e ;
21 hotcount_set ( J2GG ( J ) , pc +1 , val );
22 }
23
24 /* Blacklist a bytecode instruction . */
25 static void blacklist_pc ( GCproto * pt , BCIns * pc ) {
26 setbc_op ( pc , ( int ) bc_op (* pc )+( int ) BC_ILOOP -( int ) BC_LOOP );
27 pt - > flags |= PROTO_ILOOP ;
28 }
63
fail
try recording abort PC already penalized?
no yes
retry recording if
blacklist hotcode
the same hotspot
patch BC with I...
is detected again
Differently from the hotcount mechanism, collisions can never occur in the
hot penalty table. The PC is used to check if a hotcode has already been
penalized. Since the PC (that contains the memory address of the current
bytecode instruction) is unique, collisions are impossible.
However, another small drawback can occur because of the round-robin
index. When there are enough traces aborts after each other so that the
hot penalty table is full (the 64 slots are already taken by ongoing penalised
bytecode), a slot can be reused before its counter gets high enough to cause
blacklisting. Thus, the next abort of the trace linked to the overwritten
counter will be considered as its previous count was discarded. The trace
compiler will link that hotpath to a different slot, as if it never aborted
before.
The probability of ending up in such an inconvenient situation is relatively
low. In fact, Mike Pall commented this issue on the LuaJIT mailing list
saying: ”While theoretically possible, I’ve never seen this happen. In practice,
something does get blacklisted eventually, which frees up a penalty tracking
slot. There are only ever a handful of these slots in use. 64 slots is really
generous” 1 .
Blacklisting cancer
Blacklisting is a very powerful tool that is used in the context of LuaJIT to
save time avoiding to retry compilation of hotpaths that aborted repeatedly.
However, a really unpleasant situation can arise as a consequence of the
1
Conversation on the LuaJIT mailing list: https://www.freelists.org/post/luajit/
When-to-turn-off-JIT,4
64
fact that in LuaJIT blacklisting is permanent. In other words, blacklisted
fragments of code cannot be whitelisted ever again. On one hand, this is
reasonable because there is no sense in recording a trace that contains a code
fragment which failed to compile many times (hence it was blacklisted). On
the other hand, it could be possible that the same code fragment can be
successfully compiled in a different context of a new trace.
Moreover, LuaJIT blacklists hotpaths that hits an already blacklisted
code fragment. To be more precise, if the interpreter hits an already black-
listed bytecode instruction (I...) while recording a new trace, trace cre-
ation is aborted and the new trace is also blacklisted. This could lead to
a mechanism of cascading blacklisting from a code fragment to another. It
is particularly dangerous when LuaJIT blacklists a key function of an ap-
plication (i.e. a function called in many points of the source code) because
the cascade of blacklisting spreads rapidly all over the code, hence the name
blacklisting cancer.
To not fall is such an inconvenient situation we should avoid to use pro-
gramming patterns which emphasise this problem. An exemplifying situation
on this subject was discussed in the LuaJIT community2 . This example is
proposed again below to better illustrate the problem.
2
Conversation on the LuaJIT mailing list: https://www.freelists.org/post/luajit/
ANN-dumpanalyze-tool-for-working-with-LuaJIT-dumps,12
65
1 -- Blacklisting Cancer
2
3 -- Call a function with an argument.
4 local function apply (f , x )
5 return f ( x )
6 end
7
8 -- Return the value x wrapped inside a closure.
9 local function fwrap ( x )
10 return function () return x end -- Create a closure Not - Yet - Implemented ( NYI )
11 end
12
13 -- Attempt to compile that naturally fails.
14 for i =1 ,1 e5 do
15 apply ( fwrap , i ) -- Abort message : NYI : create a closure
16 end
17
18 -- apply () function is now permanently blacklisted.
19 -- Every call to apply () in the future will have to be run by the interpreter.
20
21 local function nop ( x )
22 end
23
24 -- Attempt to compile that should not fail.
25 for i =1 ,1 e5 do
26 apply ( nop , i ) -- Abort message : calling blacklisted function
27 end
The JIT permanently blacklists apply(). It will never allow that function
to be called in a trace. In fact, the second loop could have been compiled
successfully if it was run before the first loop.
It should be mention that other implementations of tracing JIT do not
suffer from this problem. For instance, RaptorJIT [20] gives another chance
to blacklisted code fragments of being compiled in a different context. In
particular, when calling a blacklisted function from JIT code, it ignores the
blacklisting and it just inline the content of the function. It tries to compile
the function in the context of its caller. It can be worth to do it because
sometimes code fails to compile in isolation, but a function can be compiled
in the context of its caller. In fact, this implies overhead, since RaptorJIT
puts more effort to generate machine code at the expense of making more
compilation attempts.
4.6 Snapshots
Snapshot is a key feature used in LuaJIT to leave the VM in a consistent
state when a trace exits. When the execution leaves a trace because of
guard failures, the system switches to interpretation. The VM should be
left in a consistent state for the interpret to continue. This means that
all updates (stores) to the state (stack or objects) must track the original
language semantics. In particular the values held in registers throughout the
66
trace must be written to their respective stack locations. Once the stack is
in a suitable state, the interpreter can continue.
Within the possible techniques used in trace-based just-in-time compi-
lation (see Sec. 2.3.6), LuaJIT solves the problem of keeping the VM in a
consistent state by maintaining a mapping from stack location to SSA IR
instructions. Such mappings are called snapshots in the LuaJIT codebase.
Using a snapshot, LuaJIT can then reconstruct the operand stack as it would
have been if the instructions in the trace were interpreted.
The Snapshot mechanism is implemented in the lj snap.[c,h], and
the SnapShot data structure is defined in lj jit.h (the key functions are
lj snap restore and snap restoreval).
Snapshots can be see at the IR level. From the definition in the official
LuaJIT website [65]: a snapshot captures the IR references corresponding to
modified slots and frames in the bytecode execution stack. Every snapshot
saves a specific bytecode execution state, which can later be restored on trace
exits. Snapshots are sparsely emitted and compressed. Snapshots provide
the link between the IR and the bytecode domain (and transitively the source
code domain, via the bytecode debug info).
Code sinking via snapshots allows sinking of arbitrary code without the
overhead of the other approaches. A snapshot stores a consistent view of all
updates to the state before an exit. If an exit is taken the on-trace machine
state (registers and spill slots) and the snapshot can be used to restore the
VM state.
The snapshot mechanism is slow if compared to other approaches such as
exit stubs (see Sec. 2.3.6), but this should be seen as a problem in LuaJIT. By
quoting Mike Pall in [66] ”State restoration using this data-driven approach is
slow of course. But repeatedly taken side exits quickly trigger the generation
of side traces. The snapshot is used to initialise the IR of the side trace
with the necessary state using pseudo-loads. These can be optimised together
with the remainder of the side trace. The pseudo-loads are unified with the
machine state of the parent trace by the back-end to enable zero-cost linking
to side traces”.
4.6.1 Example
Here we propose a concrete example which describes snapshots.
1 -- Snapshot
2
3 local x = 0
4
5 -- A for loop implicitly define tacit variables :
67
6 -- index i ( intern ) , limit N , step s , index i ( extern )
7 for i =1 ,200 do
8 if i < 100 then x = x + 11
9 else x = x + 22 end
10 end
LuaJIT stores the local variables in stack slots according to the order in
which they appear in the code. When a local variable is declared, a single
slot is reserved to store it. On the other hand, a for loop reserves 4 slots
that contain respectively: (i) the internal copy of the index, (ii) the limit
of the loop, (iii) the step of each iteration and (iv) the external copy of the
index. See Tab. 4.10 for details.
When printing the IR through jdump you can visualise snapshots by selecting
the flag ’s’. A typical use of this option is shown below.
./ luajit - jdump = is Ex . lua
Once you manage to print snapshots in the IR you should be aware that:
Each snaphot (SNAP) lists the modified stack slots and their
values. The i-th value in the snapshot list represents the index of
the IR that writes a value in slot number #i; ”----” indicates
that the slot is not written. Frames are separated by ”|”.
68
0008 + int ADD 0004 +1
.... SNAP #3 [ ---- 0007 ]
0009 > int LE 0008 +100
0010 int PHI 0004 0008
0011 num PHI 0003 0007
---- TRACE 1 stop -> loop
In this example slot #0 is not written since the frame slot doesn’t change.
In SNAP #1 the value of x changes because we compute x=x+11, hence the
IR at index 0003 writes a value in slot #1. SNAP #2 stores the changes of x
as before and it stores the value of i because it was incremented i=i+1 and
it is checked that i<100. In this case the IR at index 0004 writes a value
in slot #2 and #5 (internal and external copy of the index). SNAP #3 is the
same of SNAP #1 but it considers the IR at index 0007.
69
4.7 Variables allocation
The aim of this section is to describe how variables are allocated in the
bytecode and in the SSA IR. In particular, (i) the first paragraph investigates
the allocation of local variables, (ii) the second paragraph considers the case
of global variables and (iii) the last paragraphs examines upvalues. In each
of them it will be shown an example that clarifies what explained.
The example above contains in order: variables declaration (line 4), for loop
(line 6), variable declaration (line 7), for loop (line 8). Thus, variables will
be allocated in order in the stack slots as show in table 4.11.
70
Structure Type Variable Stack slot number
local var iN #1
local var iS #2
declaration
local var jN #3
local var jS #4
index (internal copy) i #5
limit iN #6
outer loop
step iS #7
index (external copy) i #8
declaration local var x #9
index (internal copy) j #10
limit jN #11
inner loop
step jS #12
index (external copy) j #13
Below it is shown the IR dump. Variables are loaded with the SLOAD (stack
slot load) operation where the left operand #n refers to the variable slot
number and the right operand contains flags (see ?? for details about flags).
Note that #0 indicates the closure/frame slot and #1 the first variable slot
(corresponding to slot 0 of the bytecode). Moreover, there are no store
operations for stack slots. When the execution flow exits from a trace, values
on stack slots are restored. All stores to stack slots are effectively sunk into
exits or side traces [65].
71
4.7.2 Global variables
Dealing with global variables is more subtle because their value can be mod-
ified from different part of the code. In fact, the concept of global variables
does not exist explicitly in Lua. To preserve the illusion of global variables,
Lua keeps them internally in a regular table that is used as global environment
(this is a simplification of what really occurs, see [67] for more details).
The example contains two global variables x and y. The value of x is updated
at each iteration of the loop, while y can be moved outside the body of the
loop by LICM without affecting the semantic of the program. As expected
this optimisation is performed only at IR level, but not in the bytecode.
As shown in the bytecode above, x and y are retrieved and modified with the
instructions GGET (global get) and GSET (global set) because they are global
variables.
In the IR it is possible to see more details of what really occurs.
SLOAD at line 0001 refers to the variable i and SLOAD at line 0002 is rela-
tive to the starting frame of the trace, where #0 indicates the frame slot.
72
Then, FLOAD (object field load) at line 0003 accesses the global environment
(func.env). At line 0004 the size of the hash part (tab.hmask) is loaded
and at line 0005 it is checked that its value is not changed (hmask=+63). Fi-
nally, the table hash part (tab.node) is loaded at line 0006 and with HREFK
(constant hash reference) "x" is retrieved from the global environment table
(line 0007). Once the reference has been retrieved the value of x is loaded
with HLOAD (line 0008) and stored with HSTORE (line 0010 and 0020).
Since the global environment has already been loaded, there is no need to
do it again for y. Thus, "y" can be retrieved with HREFK (line 0011). Then,
the global environment metatable is loaded (line 0012) and it is checked that
it is equal to NULL (line 0013). After storing the value of y with HSTORE (line
0014) the compiler creates a table write barrier (line 0015). TBAR (table
write barrier) is a write barrier needed for the incremental garbage collector.
4.7.3 Upvalues
Closure and upvalues are a key concept in Lua. Thus, LuaJIT has specific
instructions to handle them both at bytecode and IR level.
The example below shows a loop with a closure that contains an up-
value: the closure is inlined in the trace and the upvalue is modified inside
it.
1 -- Recursive function with upvalue
2
3 local function f ( x )
4 return function ( y ) -- return a closure
5 x = x + 1
6 return x + y
7 end
8 end
9
10 local a = 0
11 local b = f (0)
12
13 for i =1 ,1 e3 do
14 a = b(i) + 2
15 end
In the bytecode below it is possible to see the fact that the closure is inlined
(lines 0000-00006). The dot in the second column of the bytecode represents
the depth in the call. What is interesting is the fact that the upvalue x is
retrieved and modified with the instructions UGET (get upvalue) and USETV
(set upvalue to variable).
73
---- TRACE 1 start Ex . lua :13 0003 . USETV 0 1 ; x
0010 MOV 7 2 0004 . UGET 1 0 ; x
0011 MOV 8 6 0005 . ADDVV 1 1 0
0012 CALL 7 2 2 0006 . RET1 1 2
0000 . FUNCF 2 ; Ex . lua :4 0013 ADDVN 1 7 0 ; 2
0001 . UGET 1 0 ; x 0014 FORL 3 = > 0010
0002 . ADDVN 1 1 0 ; 1 ---- TRACE 1 stop -> loop
---- TRACE 1 start Ex . lua :13 0012 > int LE 0011 +1000
---- TRACE 1 IR 0013 ------------ LOOP - - - - - - - - - - - -
0001 rbp int SLOAD #4 CI 0014 xmm6 + num ADD 0006 +1
0002 > fun SLOAD #3 T 0015 num USTORE 0004 0014
0003 > fun EQ 0002 Ex . lua :4 0016 xmm7 num CONV 0011 num . int
0004 > p32 UREFC Ex . lua :4 #0 0017 xmm7 num ADD 0016 0014
0005 xmm6 > num ULOAD 0004 0018 xmm7 + num ADD 0017 +2
0006 xmm6 + num ADD 0005 +1 0019 rbp + int ADD 0011 +1
0007 num USTORE 0004 0006 0020 > int LE 0019 +1000
0008 xmm7 num CONV 0001 num . int 0021 rbp int PHI 0011 0019
0009 xmm7 num ADD 0008 0006 0022 xmm6 num PHI 0006 0014
0010 xmm7 + num ADD 0009 +2 0023 xmm7 num PHI 0010 0018
0011 rbp + int ADD 0001 +1 ---- TRACE 1 stop -> loop
SLOAD at line 0001 refers to the variable i and SLOAD at line 0002 loads the
closure assigned to the variable b. Then it is checked that the closure loaded
is equal to what is returned at Ex.lua:4 (line 0003). UREFC (closed upvalue
reference) gets the upvalue reference from the closure (line 0004). Finally, it
is possible to load and store the upvalue with ULOAD (line 0005) and USTORE
(line 0007 and 0015).
74
Chapter 5
The aim of this chapter is to show some concrete experimental cases in order
to understand how multiple traces are generated and organised by LuaJIT.
Multiple traces appear when there is a branch in the control flow and more
than one path is frequently executed. In this case, LuaJIT will create a root
trace with side traces attached to it.
Another important aspect that will be explored is stitch traces. It is
a mechanism that avoids trace aborts due to Not-Yet-Implemented (NYI)
functions when encountering a C function or not-compiled built-in.
Before to investigate these aspects, we will present canonical transforma-
tions of logical expressions, loops equivalence and asserts. These structures
create branches in the code that can be represented by simple if-statements
and for loops. This is the reason why in the second part of this chapter
we only will consider if-statements and for loops to investigate how LuaJIT
creates side traces and stitch trace.
Concerning side traces, we will first clarifies how the just-in-time com-
piler (JIT) generate traces from simple loops. Then we will analyse how
traces are connected with each other in more complex structures. Finally,
we will investigates recursive functions. For each case it is described how
the compiler behaves and it is shown: the LUA code of the example; the
corresponding bytecode and intermediate representation (IR) generated; a
flow diagram that refers to the IR.
75
are mapped to traces by the JIT compiler. Finally, it will explore the JIT
behaviour when dealing with asserts.
Once understood the JIT behaviour for hotpaths which contain control struc-
tures (e.g. if-statements) you can apply the same rationale to logical opera-
tors because every expressions with logical operators can be transformed to
the corresponding if-statement expressions.
In this section we present the canonical transformation of logical opera-
tors in the corresponding if-statements expressions. We illustrate concrete
examples of the logical operators and and or considering all the possible
combinations of the operands values (true, false, nil).
76
And
1 -- And
2
3 local a , b , c = v1 , v2 , nil -- v1 , v2 in ( true , false , nil )
4
5 for i =1 ,100 do
6 c = a and b
7 end
c a b
2 1 2
false 1 false
nil 1 nil
false false 2
false false false
false false nil
nil nil 2
nil nil false
nil nil nil
Or
1 -- Or
2
3 local a , b , c = v1 , v2 , nil -- v1 , v2 in ( true , false , nil )
4
5 for i =1 ,100 do
6 c = a or b
7 end
1 -- Or with if - statement
2
3 local a , b , c = v1 , v2 , nil -- v1 , v2 in ( true , false , nil )
4
5 for i =1 ,100 do
77
6 if a then c = a else c = b end
7 end
c a b
1 1 2
1 1 false
1 1 nil
2 false 2
false false false
nil false nil
2 nil 2
false nil false
nil nil nil
And-Or
1 -- And - Or
2
3 local a , b , c , d = v1 , v2 , v3 , nil -- v1 , v2 , v3 in ( true , false , nil )
4
5 for i =1 ,100 do
6 d = a and b or c
7 end
78
d a b c
2 1 2 3
2 1 2 false
2 1 2 nil
3 1 false 3
false 1 false false
nil 1 false nil
3 1 nil 3
false 1 nil false
nil 1 nil nil
3 false 2 3
false false 2 false
nil false 2 nil
3 false false 3
false false false false
nil false false nil
3 false nil 3
false false nil false
nil false nil nil
3 nil 2 3
false nil 2 false
nil nil 2 nil
3 nil false 3
false nil false false
nil nil false nil
3 nil nil 3
false nil nil false
nil nil nil nil
Or-And
1 -- Or - And
2
3 local a , b , c , d = v1 , v2 , v3 , nil -- v1 , v2 , v3 in ( true , false , nil )
4
5 for i =1 ,100 do
6 d = a or b and c
7 end
79
1 -- Or - And with if statements
2
3 local a , b , c , d = v1 , v2 , v3 , nil -- v1 , v2 , v3 in ( true , false , nil )
4
5 for i =1 ,100 do
6 if a then d = a
7 elseif b then d = c
8 else d = b end
9 end
d a b c
1 1 2 3
1 1 2 false
1 1 2 nil
1 1 false 3
1 1 false false
1 1 false nil
1 1 nil 3
1 1 nil false
1 1 nil nil
3 false 2 3
false false 2 false
nil false 2 nil
false false false 3
false false false false
false false false nil
nil false nil 3
nil false nil false
nil false nil nil
3 nil 2 3
false nil 2 false
nil nil 2 nil
false nil false 3
false nil false false
false nil false nil
nil nil nil 3
nil nil nil false
nil nil nil nil
80
1 -- And with function
2
3 local a , b = v1 , nil -- v1 in (1 , 2 , 3 , false , nil )
4
5 local function f ( a )
6 if a == 1 then return true
7 elseif a == 2 then return false
8 else return nil end
9 end
10
11 for i =1 ,100 do
12 b = a and f ( a ) or nil
13 end
b a f(a) nil
true 1 true nil
nil 2 false nil
nil 3 nil nil
nil false - nil
nil nil - nil
81
5.1.2 Loops equivalence
In this section we present the equivalence between different loop structures in
Lua and how they are handled by LuaJIT when compiling a trace. For each
example we show the recorded bytecode instructions and the IR (the details
of the syntax will be explained in the next section). When investigating
the equivalence between code structures we must especially focus on the IR
because it will be directly mapped to the compiled machine code.
Looking at the IR you can notice that the index variable i is considered by
LuaJIT as an integer value (line 0001).
While loop
The same loop previously presented can be realised by a while structure.
1 -- While loop
2
3 local x , i = 0 , 1
4
5 while i <= 100 do
6 x = x + 11
7 i = i + 1
82
8 end
The IR is identical to the while structure with the only difference that the
end-condition of the loop is i>100 (either than i≤100). The IR instruction
is ISGE ”is grater than” at line 0007. This is because the loop repeats until
the given condition becomes true.
83
---- TRACE 1 start Ex . lua :5 0007 + num ADD 0002 +11
0001 > num SLOAD #1 T 0008 + num ADD 0004 +1
0002 + num ADD 0001 +11 0009 > num ULE 0008 +100
0003 > num SLOAD #2 T 0010 num PHI 0002 0007
0004 + num ADD 0003 +1 0011 num PHI 0004 0008
0005 > num ULE 0004 +100 ---- TRACE 1 stop -> loop
0006 ------ LOOP - - - - - - - - - - - -
A generic loop cannot be compiled if the iterator uses the built-in function
next() (e.g. pairs()) because LuaJIT does not support the compilation of
next(). In fact, recording aborts if next() is executed because it belongs
to the Not-Yet-implemented features of the JIT compiler.
On the other hand, a generic loop not based on next() (e.g. ipairs())
can be compiled. Here we propose an example.
1 -- Ipairs loop
2
3 local x , array = 0 , {}
4 for i =1 ,100 do array [ i ] = i end
5
6 for i , v in ipairs ( array ) do
7 x = x + v + 11
8 end
The recorded bytecode instructions are the following. Note that TRACE 1 is
not shown, but it simply refers to the numeric loop at line 4.
84
0007 > num SLOAD #5 T 0018 + num ADD 0017 +11
0008 > fun EQ 0005 ipairs_aux 0019 num CONV 0010 num . int
0009 int CONV 0007 int . num 0020 + int ADD 0010 +1
0010 + int ADD 0009 +1 0021 > int ABC 0011 0020
0011 int FLOAD 0006 tab . asize 0022 p32 AREF 0013 0020
0012 > int ABC 0011 0010 0023 >+ num ALOAD 0022
0013 p32 FLOAD 0006 tab . array 0024 num PHI 0004 0018
0014 p32 AREF 0013 0010 0025 num PHI 0015 0023
0015 >+ num ALOAD 0014 0026 int PHI 0010 0020
0016 ------ LOOP - - - - - - - - - - - - ---- TRACE 2 stop -> loop
0017 num ADD 0015 0004
5.1.3 Assert
An assert in Lua is treated by LuaJIT as a ”standard” if-statement condition.
The example below shows a case were the assert becomes false after 100
iteration of the loop.
1 -- Assert
2
3 local assert = assert
4
5 for i =1 ,200 do
6 assert (i <100 , " Case failed : expected i <100 but got instead. " )
7 end
In this case the JIT will create a trace with a guard condition on ’i<100’
(line 0003, 0008 in the IR), which verifies that its value is less than 100. The
details of the syntax of the IR are explained from the next paragraph.
85
operands during execution. Thus, it generates the equivalent IR.
1 -- Empty loop
2 for i =1 ,100 do
3
4 end
In this case the bytecode produced contains just the FORL loop instruction.
---- TRACE 1 start Ex . lua :3
0005 FORL 0 = > 0005
---- TRACE 1 stop -> loop
The instruction at the first line SLOAD (stack slot load) is used to init the
variable i used by the loop where its left operand #1 refers to the first variable
slot and the right operand contains two flags: coalesce (C) and inherited (I).
The next lines are supposed to contain the loop, but the same instructions
are repeated twice. This is due to the fact that the first iteration of the loop is
unrolled (lines 0002-0003), then the actual loop (lines 0005-0006) is shown
after the -- LOOP -- label (line 0004). The first iteration ensures that pre-
conditions for all subsequent instructions are met. ADD increments the loop
counter i and LE (left operand ≤ right operand) checks that its value is lower
than 100. If this condition is not satisfied (i > 100), the execution takes
the trace exit at line 0003 or at line 0006. Possible exits from the trace are
indicated by the symbol > in the second column of the instruction. If the
condition is true (i ≤ 100) the execution flow makes a backward jump to the
-- LOOP -- label (line 0004) and continues with the next iteration of the
loop. It is important to highlight the fact that only the instructions at lines
0005-0006 are executed repeatedly.
Eventually, the PHI instruction positioned at the end of the looping trace
(line 0007) allows to select values from different incoming path at control
flow merge points [69]. The left operand 0002 holds a reference to the initial
value of i, the right operand 0005 holds a reference to the value after each
loop iteration. Operands of the PHI function are indicated by the symbol +
in the second column of the IR (in this case lines 0002 and 0005).
86
The diagram below explains the execution flows of the IR in a cleaner
way. Specially for the next complex examples it will be easier to look at the
diagram to understand the IR.
TRACE 1
[0] 0001
i ← #1
F
i ← i + 1 X1
? i≤100
LOOP T 0004
[1] 0005
i ← i + 1 F
X2
? i≤100
T
In the diagrams each trace is divided into blocks containing instructions with
a unique identifier enclosed in squared brackets (e.g. [0]). On the top right
of each block it is indicated the line in the IR of the first instruction in the
block (e.g. 0001). At the end of each block there could be a conditional
expression that represents a guard. In the case that the guard is violated
(the condition is false) the trace is exited, otherwise the execution continues
to the next block of the trace. Possible exits are represented by the letter ’X’
followed by their number (e.g. X1). By default an exit leads to the virtual
machine. Note that when the execution flow exits from a trace, values on
stack slots are restored.
87
1 -- Loop with assignment
2
3 local x = 0
4 local y = 0
5 local N = 100
6
7 for i =1 , N do
8 y = 11
9 x = x + 22
10 end
As shown in the bytecode below, the instruction KSHORT (line 0008) sets y
to 11 and ADDVN (line 0009) computes the operation x = x + 22. Here it is
clear that at bytecode level the LICM optimisation is not applied because the
execution flow makes a backward jump to line 0008 and y = 11 is repeated at
each iteration of the loop. In fact, no optimisation is performed by LuaJIT
on bytecode.
---- TRACE 1 start Ex . lua :7
0008 KSHORT 1 11
0009 ADDVN 0 0 0 ; 22
0010 FORL 3 = > 0008
---- TRACE 1 stop -> loop
The first two lines refer to the maximum loop counter N : SLOAD (line 0001)
with flag read-only (R) is used to init N and in line 0002 it is checked if
its value falls into the signed 32-bit integer range (N ≤ +2147483646). In
this way, the compiler can discriminate if the loop will be done over integer
or floating point values. The SLOADs at lines 0003-0004 are used to init the
variables i and x respectively. In particular x has a flag of type check (T).
At IR level it is possible to see compiler optimisations. The value of x
changes at each iteration of the loop. Thus, the ADD instruction x = x + 22 is
contained both in the pre-loop (line 0005) and in the actual loop (line 0009).
On the other hand, the expression y = 11 can be moved outside the body
of the loop by LICM without affecting the semantics of the program. This
88
instruction will be executed only once outside the trace (in fact there is no
line in the IR referring to it).
At the end of the dump there are two PHI functions. The first (line 0012)
refers to the variable i as explained in the previous example. The second
(line 0013) refers to the variable x and it is necessary for the same reason.
The graph below shows what was just explained.
TRACE 1
[0] 0001
N ← #5 F
X1
? N:const
T
[1] 0002
F
? N:int X2
T
[2] 0003
i ← #4
F
x ← #1 X3
? x:num
T
[3] 0005
x ← x + 22
F
i ← i + 1 X4
? i≤N
LOOP T 0008
[4] 0009
x ← x + 22
F
i ← i + 1 X5
? i≤N
T
89
3 local x = 0
4
5 for i =1 ,1 e4 do
6 x = x + 11
7 if i %10 == 0 then -- if - statement
8 x = x + 22
9 end
10 x = x + 33
11 end
In the loop shown above, the execution flow skips most of the times the
instruction contained in the if-statement because the expression i%10 == 0 is
true only every 10 iterations of the loop. From i = 0 forward, the instructions
that are executed repeatedly the most are x = x + 11 and x = x + 33, thus
the compiler creates a trace containing these instructions (TRACE 1). By
increasing i, the condition of the if-statement becomes true more and more
often, thus the compiler will generate a side trace (TRACE 2) that contains
the instruction within the if-statement and what follows down to the loop
”end”.
The bytecode below shows more in details this method (these are the
bytecode instructions recorded by the JIT).
90
the exit number 4, which is the 4th line having as second column the symbol
> in TRACE 1.
The diagram in Fig. 5.3 shows the IR flow diagram for this example. TRACE
1 is organised as follow: blocks [0],[1],[2] contain the first pass of the
unrolled loop; blocks [3],[4] contain the n-1 iterations of the actual loop;
in block [3] there is a possible exit that leads to the side trace. At the end,
when TRACE 2 is finished, the execution flow joins TRACE 1 in block [0],
while other exits join the VM.
91
TRACE 1
[0] 0001
i ← #2
F
x ← #1 X1
? x:num
T
[1] 0003
x ← x + 11 F
X2
? i%10 6= 0
T
TRACE 2
1 -- Nested loop
2
3 local x = 0
4
5 for i =1 ,1 e4 ,2 do -- outer loop
6 x = x + 11
7 for j =2 ,1 e3 ,4 do -- inner loop
8 x = x + 22
9 end
10 x = x + 33
11 end
92
The instructions of the inner loop will be executed repeatedly at first. Thus,
the inner loop becomes hot first and the compiler creates a trace (TRACE 1).
At some point, also the outer loop becomes hot and the compiler generates
another trace (TRACE 2) that is a side trace of the previous one.
Traces are organised in a reverse order if compared with the standard way
of thinking the execution flow of nested loops. As it is shown in the bytecode
below: TRACE 1 (inner loop) contains the instruction x = x + 22; TRACE 2
(outer loop) contains first the instruction x = x + 33 and then x = x + 11.
Moreover, TRACE 2 is a side trace that starts at the exit number 3 of TRACE
1 (when the inner loop finished).
The method of organising traces for nested loops is more clear when looking
at the IR and the diagram.
93
TRACE 1
[0] 0001
i ← #6
F
x ← #1 X1
? x:num TRACE 2
T
[3] 0001
[1] 0003 x ← #1
x ← x + 22 x ← x + 33
F F
j ← j + 4 X2 i ← #2 X1
? j≤1e3 i ← i + 2
LOOP T 0006 ? i≤1e4
T
[2] 0007
x ← x + 22 [4] 0006
F
j ← j + 4 X3 x ← x + 11
? j≤1e3
T
[0]
The outer loop (TRACE 2) goes around the inner loop (TRACE 1) and joins it
in block [0].
On the other hand, if the inner loop had low iteration count, it would be
unrolled and inlined2 .
2
Conversation on the LuaJIT mailing list: https://www.freelists.org/post/luajit/
How-does-LuaJITs-trace-compiler-work,1
94
5.3 Loop with two if-statements
In this section it will be explored how the JIT-compiler organises traces
when there are two different if-statements within the same loop. The cases
investigated are as follow: (i) the first if-statement condition becomes true
(hot) before the second; (ii) the second becomes true (hot) before the first;
(iii) the if-statement conditions become true (hot) at different time; (iv) the
if-statement conditions become true (hot) at the same time.
5.3.1 Case 1
This example shows how traces are organised in a loop with 2 if-statements
when the first condition becomes true (hot) before the second. In this case
the truthfulness of the second if-condition implies the first.
As it happened in the example 5.2.3, the execution flow skips most of the
times the instructions contained in the if-statements and the compiler creates
a trace (TRACE 1) with the instructions x = x + 11, x = x + 33, x = x + 55.
By increasing i, the condition of the first if-statement becomes true more
and more often, thus the compiler will generate a side trace (TRACE 2) that
contains the instruction within the first if-statement x = x + 22 and what
follows down to the loop ”end”. At some point, also the condition of the
second if-statement becomes true repeatedly, thus the compiler creates a trace
(TRACE 3) with the instruction within the second if-statement x = x + 44
and what follows down to the loop ”end”.
The bytecode below shows what was just explained.
95
0014 JMP 5 = > 0016 0014 JMP 5 = > 0016
0016 ADDVN 0 0 8 ; 55 0016 ADDVN 0 0 8 ; 55
0017 FORL 1 = > 0006 0017 JFORL 1 1
---- TRACE 1 stop -> loop ---- TRACE 2 stop -> 1
---- TRACE 2 start 1/5 Ex . lua :8 ---- TRACE 3 start 2/1 Ex . lua :12
0010 ADDVN 0 0 4 ; 22 0015 ADDVN 0 0 7 ; 44
0011 ADDVN 0 0 5 ; 33 0016 ADDVN 0 0 8 ; 55
0012 MODVN 5 4 6 ; 20 0017 JFORL 1 1
0013 ISNEN 5 3 ; 0 ---- TRACE 3 stop -> 1
Links between traces are more explicit in the IR and in the diagram.
Generally side traces are created when an exit is taken repeatedly. In this
case TRACE 2 starts at the exit number 5 of TRACE 1 (line 0015) and it joins
TRACE 1 at the end. TRACE 3 starts at the exit number 1 of TRACE 2 (line
0006) and it joins TRACE 1 at the end. Thus, TRACE 2 is a side trace of
TRACE 1 and TRACE 3 is a side trace of TRACE 2, both joining TRACE 1 at
their ends.
96
TRACE 1
[0] 0001
i ← #2
F
x ← #1 X1
? x:num
T
[1] 0003
x ← x + 11 F
X2
? i%10 6= 0
T
[2] 0006
x ← x + 33 F
X3
? i%20 6= 0
T
TRACE 2
x ← x + 55
T F
i ← i + 1 X7 [0]
[0]
? i≤1e6
97
5.3.2 Case 2
This example is the same as the previous one, but the order of the two if-
statement is reversed. In particular, it investigates how traces are organised
in a loop with two if-statements when the second condition becomes true
(hot) before the first. In this case the truthfulness of the first if-condition
implies the second.
Even if the change from the previous example is small, it causes a big
difference in the traces organisation.
The compiler creates the first trace (TRACE 1) with the same logic of the
previous example. It contains the instructions x = x + 11, x = x + 33,
x = x+55. By increasing i, the condition of the second if-statement becomes
true more and more often, thus the compiler will generate a side trace (TRACE
3) that contains the instruction within the second if-statement x = x + 44
and what follows. At some point, also the condition of the first if-statement
becomes true repeatedly, thus the compiler creates a trace (TRACE 2) that
contains instructions within both the if-statements and what follows.
The bytecode below shows what was just explained.
98
The IR displayed below shows that TRACE 2 starts at the exit number 5 of
TRACE 1 (line 0015) and it joins TRACE 1 at the end. TRACE 3 starts at the
exit number 6 of TRACE 1 (line 0018) and it joins TRACE 1 at the end. Thus,
both TRACE 2 and TRACE 3 are side traces of TRACE 1.
99
TRACE 1 TRACE 2
LOOP T 0012
[4] 0013 [0]
x ← x + 11 F
X5
? i%20 6= 0
TRACE 3
T
[11] 0001
[5] 0016
x ← #1
x ← x + 33 F
X6 i ← #2
? i%10 6= 0
x ← x + 44
T
[6] 0019
[12] 0004
x ← x + 55
T F x ← x + 55
i ← i + 1 X7 F
i ← i + 1 X1
? i≤1e6
? i≤1e6
T
[0]
100
5.3.3 Case 3
This example shows how traces are organised in a loop with two if-statements
when both conditions become true (hot) at the same time. The compiler does
not generate a side trace for each if-statement, but it creates only a unique
side trace. In this case the truthfulness of the first if-condition implies the
second and vice versa.
1 -- Loop with 2 if - statement Example 3
2
3 local x = 0
4
5 for i =1 ,1 e6 do
6 x = x + 11
7 if i %10 == 0 then -- 1 st if - statement
8 x = x + 22
9 end
10 x = x + 33
11 if i %10 == 0 then -- 2 nd if - statement
12 x = x + 44
13 end
14 x = x + 55
15 end
The compiler produces the first trace (TRACE 1) with the same logic of the
previous examples. It contains the instructions x = x + 11, x = x + 33,
x = x + 55. By increasing i, the condition of both if-statements becomes
true at the same time more and more often. Thus, the compiler will generate
a side trace (TRACE 2) that contains the instruction with both if-statements
and what follows.
The bytecode below shows what was just explained.
101
---- TRACE 1 start Ex . lua :5 0018 int PHI 0008 0016
---- TRACE 1 IR 0019 num PHI 0007 0015
0001 int SLOAD #2 CI ---- TRACE 1 stop -> loop
0002 > num SLOAD #1 T
0003 num ADD 0002 +11 ---- TRACE 2 start 1/4 Ex . lua :8
0004 int MOD 0001 +10 ---- TRACE 2 IR
0005 > int NE 0004 +0 0001 num SLOAD #1 PI
0006 num ADD 0003 +33 0002 int SLOAD #2 PI
0007 + num ADD 0006 +55 0003 num ADD 0001 +22
0008 + int ADD 0001 +1 0004 num ADD 0003 +33
0009 > int LE 0008 +1000000 0005 int MOD 0002 +10
0010 ------ LOOP - - - - - - - - - - - - 0006 > int EQ 0005 +0
0011 num ADD 0007 +11 0007 num ADD 0004 +44
0012 int MOD 0008 +10 0008 num ADD 0007 +55
0013 > int NE 0012 +0 0009 int ADD 0002 +1
0014 num ADD 0011 +33 0010 > int LE 0009 +1000000
0015 + num ADD 0014 +55 0011 num CONV 0009 num . int
0016 + int ADD 0008 +1 ---- TRACE 2 stop -> 1
0017 > int LE 0016 +1000000
102
TRACE 1
[0] 0001
i ← #2
F
x ← #1 X1
? x:num
T
[1] 0003
x ← x + 11 F
X2
? i%10 6= 0 TRACE 2
T [7] 0001
[2] 0006 x ← #1
x ← x + 33 i ← #2
x ← x + 22
[3] 0007
x ← x + 55 [8] 0004
F
i ← i + 1 X4 x ← x + 33 F
X1
? i≤1e6 ? i%10 = 0
LOOP T 0010 T
T [10] 0008
[5] 0014 x ← x + 55
F
x ← x + 33 i ← i + 1 X2
? i≤1e6
T
[6] 0015
x ← x + 55
T F
i ← i + 1 X7 [0]
? i≤1e6
103
5.3.4 Case 4
This example shows how traces are organised in a loop with two if-statements
when the first condition becomes hot before the second, but the conditions
become true mostly at different time. In this case the truthfulness of the
second implies the first just for some iteration of the loop (e.g. i = 60, 120, ...).
The compiler creates the first trace (TRACE 1) with the same logic of the
previous examples. It contains the instructions x = x + 11, x = x + 33,
x = x + 55. By increasing i, the condition of the first if-statement becomes
true more and more often, thus the compiler will generate a side trace (TRACE
2) that contains the instruction within the first if-statement x = x + 22 and
what follows. At some point, also the condition of the second if-statement
becomes true repeatedly, thus the compiler creates a trace (TRACE 3) with
the instruction within the second if-statement x = x + 44 and what follows.
Both TRACE 2 and TRACE 3 are side trace of TRACE 1.
Later on, when both the conditions becomes true at the same time re-
peatedly, the compiler generates another trace (TRACE 4) that contains the
instruction within the second if-statement x = x + 44 and what follows.
TRACE 4 starts as a side trace of TRACE 2.
The bytecode below shows what was just explained.
104
0015 ADDVN 0 0 7 ; 44 ---- TRACE 4 start 2/1 Ex . lua :12
0016 ADDVN 0 0 8 ; 55 0015 ADDVN 0 0 7 ; 44
0017 JFORL 1 1 0016 ADDVN 0 0 8 ; 55
---- TRACE 3 stop -> 1 0017 JFORL 1 1
---- TRACE 4 stop -> 1
The IR displayed below shows that TRACE 2 starts at the exit number 5 of
TRACE 1 (line 0015) and it joins TRACE 1 at the end. TRACE 3 starts at the
exit number 6 of TRACE 1 (line 0018) and it joins TRACE 1 at the end. TRACE
4 starts at the exit number 1 of TRACE 2 (line 0006) and it joins TRACE 1 at
the end.
Possible trace paths covered by the execution flow are: (i) TRACE 1: if both
conditions are false; (ii) TRACE 1, TRACE 2: if the first condition is true and
the second is false; (iii) TRACE 1, TRACE 3: if the first condition is false and
the second is true; (iv) TRACE 1, TRACE 2, TRACE 4: if both conditions are
true.
105
TRACE 1
[0] 0001
i ← #2
F
x ← #1 X1
? x:num TRACE 2
T
[7] 0001
[1] 0003
x ← #1
x ← x + 11 F
X2 i ← #2
? i%20 6= 0
x ← x + 22 TRACE 4
T
[6] 0019
[11] 0004
x ← x + 55
T F x ← x + 55
i ← i + 1 X7 F
i ← i + 1 X1
? i≤1e6
? i≤1e6
T
[0]
106
a trace for each inner loop, a trace (or more) to connect them and a trace
for the outer loop that goes around the inner loops.
In this case the instructions of the inner loops will be executed repeatedly
at first. Thus, the inner loops become hot first and the compiler creates a
trace for each of them (TRACE 1, TRACE 2). At some point, also the outer loop
becomes hot and the compiler generates a trace (TRACE 3) to connect the two
inner loops (with the instruction x = x+33) and a trace (TRACE 4) that goes
around the inner loops (with the instructions x = x + 55, x = x + 11). Note
that TRACE 1, TRACE 2 are root traces and TRACE 3, TRACE 4 are sidetraces.
The bytecode below shows what was just explained.
The IR shows the details of the traces organisation. TRACE 1 and TRACE 2
are independent traces. TRACE 3 starts at the exit number 3 of TRACE 1 (line
0009) and it joins TRACE 2 at the end (this is the connection of the two inner
loops). TRACE 4 starts at the exit number 3 of TRACE 2 (line 0009) and it
joins TRACE 1 at the end. This is the part of the outer loop that goes around
the inner loops.
107
---- TRACE 1 start *. lua :7 0007 + num ADD 0003 +44
---- TRACE 1 IR 0008 + int ADD 0004 +6
0001 int SLOAD #6 CI 0009 > int LE 0008 +30000
0002 > num SLOAD #1 T 0010 int PHI 0004 0008
0003 + num ADD 0002 +22 0011 num PHI 0003 0007
0004 + int ADD 0001 +4 ---- TRACE 2 stop -> loop
0005 > int LE 0004 +20000
0006 ------ LOOP - - - - - - - - - - - - ---- TRACE 3 start 1/3 *. lua :10
0007 + num ADD 0003 +22 ---- TRACE 3 IR
0008 + int ADD 0004 +4 0001 num SLOAD #1 PI
0009 > int LE 0008 +20000 0002 num ADD 0001 +33
0010 int PHI 0004 0008 ---- TRACE 3 stop -> 2
0011 num PHI 0003 0007
---- TRACE 1 stop -> loop ---- TRACE 4 start 2/3 *. lua :14
---- TRACE 4 IR
---- TRACE 2 start *. lua :11 0001 num SLOAD #1 PI
---- TRACE 2 IR 0002 num ADD 0001 +55
0001 int SLOAD #6 CI 0003 num SLOAD #2 I
0002 > num SLOAD #1 T 0004 num ADD 0003 +2
0003 + num ADD 0002 +44 0005 > num LE 0004 +10000
0004 + int ADD 0001 +6 0006 num ADD 0002 +11
0005 > int LE 0004 +30000 ---- TRACE 4 stop -> 1
0006 ------ LOOP - - - - - - - - - - - -
108
TRACE 1 TRACE 2
TRACE 3 TRACE 4
[8] 0006
x ← x + 11
[0]
The same structure is maintained when the number of inner loops increases.
If n is the number of inner loops, the compiler generates: n traces for the n
inner loops, n − 1 traces to connect the inner loops and a final trace for the
outer loop that goes around the inner loops.
The code below describes the case of n = 3.
1 -- Nested loop with 3 inner loops
2
3 local x = 0
109
4
5 for i =1 ,1 e4 do -- outer loop
6 x = x + 11
7 for j =1 ,2 e4 ,2 do -- inner loop ( LOOP 1)
8 x = x + 22
9 end
10 x = x + 33
11 for k =1 ,3 e4 ,3 do -- inner loop ( LOOP 2)
12 x = x + 44
13 end
14 x = x + 55
15 for t =1 ,4 e4 ,4 do -- inner loop ( LOOP 3)
16 x = x + 66
17 end
18 x = x + 77
19 end
[12] *
x ← x + 11
[0]
110
5.5.1 Non-tail recursive function
This example consists in a non-tail recursive factorial function. The com-
piler does not create a loop structure because the function is non-tail recur-
sive.
1 -- Non - tail recursive factorial
2
3 local function factorial ( n )
4 if n > 0 then
5 return n * factorial (n -1)
6 end
7 return 1
8 end
111
5.5.2 Tail recursive function
This example consists in a tail recursive factorial function. In this case,
the compiler create a loop structure because the function is tail recursive.
Since the function is tail recursive as soon as there is a return from the
recursive call the execution flows goes immediately to a return as well. It
skips the entire chain of recursive functions returning and it returns straight
to the original caller. There is no need of a call stack for the recursive calls.
As a matter of fact, there are no dots in the bytecode below because
there is no depth of recursive calls. Also in this case the compiler unrolls
three recursive calls. Thus, the trace created contains three function calls
with the instruction FUNCF (lines 0000) and it ends with a tail recursion --
TRACE 1 stop -> tail-recursion.
The fact that the compiler transforms tail-recursive functions in loops is more
explicit in the IR (see the -- LOOP -- label at line 0014). This is a standard
compiler transformation.
112
---- TRACE 1 start Ex . lua :3 0013 + num MUL 0010 0009
---- TRACE 1 IR 0014 ------ LOOP - - - - - - - - - - - -
0001 > num SLOAD #2 T 0015 > num GT 0012 +0
0002 > num SLOAD #1 T 0016 num SUB 0012 +1
0003 > num GT 0002 +0 0017 num MUL 0013 0012
0004 fun SLOAD #0 R 0018 > num GT 0016 +0
0005 > fun EQ 0004 Ex . lua :3 0019 num SUB 0016 +1
0006 num SUB 0002 +1 0020 num MUL 0017 0016
0007 num MUL 0002 0001 0021 > num GT 0019 +0
0008 > num GT 0006 +0 0022 + num SUB 0019 +1
0009 num SUB 0006 +1 0023 + num MUL 0020 0019
0010 num MUL 0007 0006 0024 num PHI 0013 0023
0011 > num GT 0009 +0 0025 num PHI 0012 0022
0012 + num SUB 0009 +1 ---- TRACE 1 stop -> tail - recursion
113
---- TRACE 1 start Ex . lua :5 ---- TRACE 2 start 1/ stitch Ex . lua :7
0006 ADDVN 0 0 0 ; 11 0010 ADDVN 0 0 1 ; 22
0007 GGET 5 0 ; " os " 0011 JFORL 1 1
0008 TGETS 5 5 1 ; " clock " ---- TRACE 2 stop -> 1
0009 CALL 5 1 1
0000 . FUNCC ; os . clock
---- TRACE 1 stop -> stitch
114
Chapter 6
Analysis tools
This chapter presents some diagnostic tools for the analysis of LuaJIT. The
compiler framework provides a set of tools that helps to investigate what
happens under the hood while executing a program with LuaJIT. In partic-
ular we will illustrate: (i) the verbose mode, (ii) the profiler, and (iii) the
dump mode.
Over the years, the LuaJIT community realised that a key aspect to
understand LuaJIT behaviour more in details and to facilitate its usage would
be to improve the reported information provided by the LuaJIT ”standard”
tools. In these regards, we implemented a small extension of the dump mode
in order to get more specific information on traces. The goal is to provide
to the user some insights for writing JIT-friendly code, since in some specific
circumstances certain patterns of code can be preferred to others.
Another interesting tool for interactive software diagnostics, called Studio
[48], was born in the context of RaptorJIT [20], a fork of LuaJIT. It is an
interactive tools for inspecting and cross-referencing trace and profiler data.
115
[ TRACE 1 myapp . lua :1 loop ]
[ TRACE 2 (1/3) myapp . lua :1 -> 1]
The first number in each line indicates the internal trace number. Then,
it prints the file name ”myapp.lua” and the line number ”:1” where the
trace started. Sidetraces also show in parentheses ”(1/3)” the parent trace
number and the exit number from where they are attached. An arrow at the
end shows where the trace links to ”− > 1” unless it loops to itself.
When a trace aborts the output is the following:
[ TRACE --- foo . lua :44 -- leaving loop in root trace at foo : lua :50]
Trace aborts are quite common, even in programs which can be fully com-
piled. The compiler may retry several times until it finds a suitable trace.
6.2 Profiler
This module is a command line interface to the built-in low-overhead profiler
of LuaJIT. The lower-level API of the profiler is accessible via the ”jit.profile”
module or the LuaJIT profile * C API.
Some examples of using this mode are shown below.
luajit - jp myapp . lua
luajit - jp = s myapp . lua
luajit - jp = - s myapp . lua
luajit - jp = vl myapp . lua
luajit - jp =G , profile . txt myapp . lua
The following dump features are available. Many of these options can be
activated at ones.
116
Option Description
f shows function name (default mode)
F shows function name with prepend module
l shows line granularity (’module’:’line’)
<number> Stack dump depth (callee<caller - default: 1)
-<number> inverse stack dump depth (callee>caller
s split stack dump after first stack level. Implies | depth | >= 2
p show full path for module names
v VM states (Compiled, Interpreted, C code, Garbage Collector, JIT)
z show zones (statistics can be grouped in user defined zone)
r show raw sample counts (Default: percentages)
a annotate excerpts from source code files
A annotate complete source code files
G gives raw output for graphics (time spent per function/VM state)
m<number> Mminimum sample percentage to be shown (default: 3)
i<number> sampling interval in milliseconds (default: 10)
This module can be also used while programming. The start function can
take 2 arguments, the list of options (describe above) and the output file.
local prof = require " jit . p "
prof . start (" vf " , " file . txt ")
-- Code to analyze here
prof . stop ()
117
The first argument specifies the dump mode. The second argument gives
the output file name (default output is to stdout). The file is overwritten
every time the module is started. Different features can be turned on or off
with the dump mode. If the mode starts with a ’+’, the following features
are added to the default set of features; a ’-’ removes them. Otherwise, the
features are replaced. The following dump features are available (* marks
the default):
Option Description
t* print a line for each started, ended or aborted trace (see also -jv)
b* dump the traced bytecode
i* dump the IR (intermediate representation)
r augment the IR with register/stack slots
s dump the snapshot map
m* dump the generated machine code
x print each taken trace exit
X print each taken trace exit and the contents of all registers
a print the IR of aborted traces, too
T output format: plain text output
A output format: ANSI-colored text output
H output format: colorized HTML + CSS output
This module can be also used while programming. The on function can take
2 arguments, the list of options (describe above) and the output file.
local dump = require " jit . dump "
dump . on (" tbimT " , " outfile . txt ")
-- Code to analyze here
dump . off ()
This function can have an important role since it gives the opportunity to
restrict dump size and locate analysis only on interesting parts of the code.
Bytecode dump
An example of bytecode instruction is shown in the dump below. The of-
ficial LuaJIT website contains an extensive description of possible bytecode
instructions [55].
0007 . . CALL 0 0 0 ; comment
118
Column Description
1st column bytecode index, numbered by function
2nd column dots represent the depth (call hierarchy)
3rd-5th columns bytecode arguments
last column comment to tie the instruction to the Lua code
IR dump
Some examples of IR instructions are shown in the dump below. The offi-
cial LuaJIT website contains an extensive description of possible bytecode
instructions [65].
.... SNAP #0 [ ---- ]
0001 rbp int SLOAD #2 CI
0002 xmm7 > num SLOAD #1 T
.... SNAP #2 [ ---- 0003 0004 ---- ---- 0004 ]
0003 xmm7 + num MUL 0002 -1
0010 > fun EQ 0158 app . lua :298
Column Description
1st column IR instruction index (numbered per trace)
2nd column Show where the value is written to, when converted to ma-
chine code if the ’r’ flags is included (e.g. CPU stack slot,
physical CPU stack slot)
3rd column Instruction flags (”>” are locations of guards leading to
possible side exits from the trace, ”+” indicates instruction
is left or right PHI operand
4th column IR type
5th column IR opcode
6th/7th column IR operands (e.g. ”%d+” : reference to IR SSA instruction.
”#” : prefixes refer to slot numbers, used in SLOADS)
Snapshot
A snapshot (SNAP) stores a consistent view of all updates to the state before
an exit. If an exit is taken, the snapshot is used to restore the VM to a con-
sistent state when a trace exits. Each snapshot lists the modified stack slots
and the corresponding values. The n-th value in the snapshot list represents
the index of the IR instruction that wrote in slot number n (”----” indicates
that the slot has not been modified). Function frames are separated by ’|’.
119
.... SNAP #1 [ ---- ---- ---- ---- 0009 0010 0011 ---- ]
.... SNAP #1 [ app . mad :120| - - - - ---- false ]
Mcode dump
For the mcode dump each line is composed of two parts, the mcode instruc-
tion’s address and the corresponding assembler instruction. Some examples
of mcode instructions are shown in the dump below.
(i). The first line is emitted when recording starts. It shows the potential
trace number, file name and line in the source file from where the trace
starts. When the trace recorded is a sidetrace it also prints information
on the parent trace and side exit. For instance, in the third example
120
above 2 is the sidetrace number, 1 is the parent trace from which the
sidetrace starts from and 4 is the side-exit number. When the trace is
a stitch trace, it display the parent number and the keyword stitch (i.e.
TRACE N start N parent/stitch app.lua:lineNo)
(ii). The second line (our extension) indicates one of the following situa-
tion: (i) trace compilation success; (ii) simple trace abortion; (iii) trace
abortion with blacklisting. It also prints the memory address of the
code fragment associated to the trace (PC=0x...), the index hit in the
Hotcount table (in square brackets [idx]) and in case of abort also the
error number and the penalty value.
(iii). The third line is emitted when a trace creation is concluded. If a trace
was successfully generated it shows to what the trace is linked to, e.g.
loop, N (number of another trace), tail-recursion, interpreter,
stitch, etc. Otherwise, in case of aborts it shows the file and the
source line responsible for the aborts and it prints a message indicating
the reason.
The tool was also patched for the diagnosis of the current status of critical
data structures used by the JIT for hotpaths detection and blacklisting. In
particular, if a special variable is set, it is possible to print the Hotcout table.
This is the table in which the counters associated to the hotspots in the code
are stored. With this information the user is aware if some collisions occur
and what is their impact on trace creation. An example of the output that
includes the Hotcout table is shown below.
---- HOTCOUNT TABLE
[ 0]=111 [ 1]=111 [ 2]=111 [ 3]=111 [ 4]=111 [ 5]=111 [ 6]=111 [ 7]=111
[ 8]=111 [ 9]=111 [10]=111 [11]=111 [12]=111 [13]=111 [14]=111 [15]=111
[16]=111 [17]=111 [18]=111 [19]=111 [20]=111 [21]=111 [22]=111 [23]=111
[24]=111 [25]=40 [26]=111 [27]=111 [28]=111 [29]=111 [30]=52 [31]=111
[32]=111 [33]=111 [34]=111 [35]=111 [36]=111 [37]=111 [38]=111 [39]=111
[40]=111 [41]=111 [42]=111 [43]=111 [44]=111 [45]=111 [46]=111 [47]=111
[48]=111 [49]=111 [50]=111 [51]=111 [52]=111 [53]=111 [54]=111 [55]=111
[56]=111 [57]=111 [58]=111 [59]=111 [60]=111 [61]=0 [62]=111 [63]=111
---- TRACE 1 start app . lua :5
---- TRACE 1 info success trace compilation -- PC =0 x7fafd0f79bf0 [61]
---- TRACE 1 stop -> loop
On the other hand, when analysing abort and blacklisting it would be useful
to print the Hotpenaly table. This table keeps information on the aborted
trace, in order to blacklist fragments when trace creation aborts repeatedly.
For each line it prints the memory address of the code fragment from where a
trace aborted (PC=address ), the penalty value (val=penalty value ), the
error number (reason=err number ). The table has 64 possible entries which
are filled through a round-robin index.
121
---- TRACE 2 start app . lua :25
**** HOTPENALTY TABLE penaltyslot =4 ( round - robin index )
[0]: PC = 7 fd7493e56dc val = 39129 reason = 7
[1]: PC = 7 fd7493e4dc8 val = 42135 reason = 7
[2]: PC = 7 fd7493e5198 val = 41161 reason = 7
[3]: PC = 7 fd7493e5704 val = 18943 reason = 5
[4]: PC = 0 val = 0 reason = 0
[5]: ... ... ...
---- TRACE 2 info abort penalty pc errno =5 valpenalty =37891 -- PC =0 x7fd7493e5704 [2]
---- TRACE 2 abort app . lua :4 -- NYI : bytecode 51
In the example above the first 3 entries of the table have been blacklisted
because their value (valpenalty=val*2+small rand num) exceeds the max-
imum penalty value of 60000 (default of LuaJIT). See 4.5.2 for more details
on blacklisting in LuaJIT.
122
0 x7f46175b2abc 69 0 0 { 4 ,22 ,28 ,... } app3 . lua 55 1
Once the table is completed we can see which code fragment is more criti-
cal. A substantial number of aborts and blacklisting related to a single code
fragment can be seen as a red flag, which can deteriorate the overall per-
formance. Also an excessive dimension of the trace tree (many sidetraces
attached to a root trace) means that the fragment of code is critical. Once
identified the traces that refers to critical code fragments, they should be
analysed in details by printing the related bytecode, IR and machine code
through the Dump mode. For instance, in the example above the second and
third fragments of the table need more investigations. The second fragment
aborted many times and it was eventually blacklisted. The third fragment
has a high number of success, hence a long chain of sidetraces was attached
to the root trace.
123
Chapter 7
Strategy
7.1 Profiling
The profiler is a very powerful tool provided by LuaJIT. It is a statistical
profiler with very low overhead which allows sampling the currently executing
stack and other parameters in regular intervals. Through the profiler, it is
124
possible to get some insights on programs executions. It shows where the
program spends most of the time. In 6.2 we presented how to use the profiler
with its options. Here we will show how we used the options of the profiler
in order to get useful information.
Through the combination of the options jp=vl, the output of the pro-
filer displays the percentage of time spent in each of the VM states (if one
of the states does not appears, the time spent in that state is negligible).
The profiler also shows for each state the percentage of time spent in each
module:line.
The example below illustrates a practical case.
Input:
./ luajit - jp = vl myapp . lua
Output:
40% Compiled
-- 50% fun1 . lua :111
-- 25% fun2 . lua :222
...
25% Garbage Collector
-- 33% fun1 . lua :111
-- 20% fun3 . lua :333
...
15% Interpreted
-- 35% fun1 . lua :111
-- 10% fun2 . lua :222
...
15% JIT Compiler
-- 25% fun1 . lua :111
...
5% C code
-- 20% fun4 . lua :444
...
Here we can see that LuaJIT spent 40% of the time executing machine code
and only 15% of the time in the interpreter. This can be view as a good
indicator because running compiled code is supposed to be faster than inter-
preting bytecode instructions. 5% of the time in the C code can be reasonable
if the program does not call frequently C functions. 25% of the time spent in
the garbage collector might seem a big percentage, but it can be necessary
for some situations. 15% of the time spent in the JIT compiler is also a
consistent time slice. Ideally, the time spent in the JIT compiler (that is the
time spent in recording, generating mcode, etc.) must be kept to the bare
minimum, e.g. under 5% of the total execution time.
Moreover, thanks to the option ’l’ we can dig deeper into the specific
percentage of time spent in each module:line of the VM states. For instance,
in the example above the fragment of code at fun1.lua:111 appears in many
125
VM states with considerable percentages. This means that it is a critical part
of the program which must be analysed in details and eventually optimised.
Thanks to the profiler we can see how the changes in our program may
affect the percentage of time spent in each VM state. Ideally, we want to
maximise the time spent in the Compiled state and minimise the time spent
in the Interpreted. The time spent in the C code is related to the specific
application. Finally, we must keep to the bare minimum the time spent in
the JIT compiler and Garbage Collector. It should be noted that this is a
general and reasonable guideline to follow, but the overall execution time
might be faster even with a slightly disadvantageous configuration (e.g. 70%
compiled, 25% Interpreted, 5% JIT Compiler is not necessarily better than
60% compiled, 30% Interpreted, 10% JIT Compiler).
Another interesting combination of options is jp=Fl. It can give interest-
ing insights about the specific function and module:line where the program
spends most of the execution time. In particular, the profiler shows the
percentage of time spent in each function and module:line.
The example below illustrates a practical case.
Input:
./ luajit - jp = Fl myapp . lua
Output:
13% file1 . lua : func1
-- 67% file1 . lua :866
-- 16% file1 . lua :886
-- 5% file1 . lua :869
-- 4% file1 . lua :868
10% file2 . lua : func2
-- 69% file2 . lua :194
-- 13% file2 . lua :192
-- 10% file2 . lua :195
-- 7% file2 . lua :191
9% file3 . lua : func3
-- 68% file3 . lua :714
-- 11% file3 . lua :736
-- 5% file3 . lua :718
-- 3% file3 . lua :733
-- 3% file3 . lua :734
-- 3% file3 . lua :735
...
To conclude our discussion about the profiler, we want to highlight the fact
that when running the profiler LuaJIT calls a function to flush all traces (see
lj profile.c for more details).
126
7.2 Dump mode and post-execution analysis
The profiler is a very powerful tool that can help us to detect critical parts
of the code where programs spend most of the time. It also shows how
the changes we make to the code impact on the time spent in each state of
the VM. However, what is missing in the profiler is the information about
the traces generated by the JIT compiler. Having information about traces
can guide us to change our code in a ”good” way so that the compiler will
generate traces more efficiently.
The Dump mode provides useful information about traces at different
levels of granularity. In this context, we used the Dump tool with our small
patch as explained in Sec. 6.4.
The example below illustrates a practical case.
Input:
./ LuaJITpatched / src / luajit - jdump = t myapp . lua -o nil > output . txt 2 <&1
Output:
---- TRACE 1 start file1 . lua :328
---- TRACE 1 info abort penalty pc errno =7 valpenalty =144 -- PC =0 x7f10791ab78c [36]
---- TRACE 1 abort file1 . lua :335 -- NYI : bytecode 50
127
The output of the dump mode can be too large to be analysed (it can be
gigabytes of data). Thus, we need to discriminate which part of this huge
file should be inspected in more details.
Once the output.txt file is created we used a post-execution script
TraceAnalyzer.lua in order to collect summary information about traces
as described in Sec. 6.5. Data are filtered and reorganised from the file
output.txt in a big table containing a row for each hotspot from where the
JIT tried to create a trace. Here we can decide the columns we want to print.
In the example below the output shows: (i) the memory address of the first
instruction of the code fragment that was detected as hotspot; (ii) the nature
of the trace, i.e. root, side or stitch; (iii) the link at the end of the trace, e.g.
the trace number of its parent, nil means no connection because the trace
is aborted, etc.; (iv) the number of aborts or if it was blacklisted; (v) the file
name and the first line of the code fragment in the source code; (v) the error
file name, the first line of the code fragment in its source code and the error
number; (vi) the number of flushes that occurred before the creation of the
trace.
Input:
./ luajit TracesAn alyser . lua > Result . txt
Output:
**** TRACE ANALYSIS
PC What Link Aborts Blacklist FileName Line Errfile Errline Errno Flush
0 x7f46176dbb5c side 4 2 0 app1 . lua 34 app9 . lua 23 10 0
0 x7f4617660b54 root nil 11 1 app2 . lua 70 app5 . lua 90 8 0
0 x7f46175b2abc side 12 0 0 app3 . lua 55 app5 . lua 57 10 1
Once the table is completed we can see which code fragment is more criti-
cal. A substantial number of aborts and blacklisting related to a single code
fragment can be seen as a red flag, which can deteriorate the overall perfor-
mance. This can be seen as a general indicator but, specially in large and
complex applications, aborts and/or blacklisting will always occur. It should
be noted that the fact of identifying blacklisted fragments of code is possible
only thanks to our patch of the Dump mode since the standard version gives
only information about aborts.
Once identified the traces that refers to critical code fragments, they
should be analysed in details by printing the related bytecode, IR and ma-
chine code through the Dump mode (options bim). In order to do that we
must activate the dump mode only in the fragments of code that we classified
as ”critical”.
The example below illustrates a practical case.
local dump = require " jit . dump "
128
dump . on ( " bim " , " outfile . txt ")
-- code to analyse here
dump . off ( )
129
Also in our complex scientific application that involves loads of compu-
tations we observed a faster and more stable behaviour when changing some
of the default parameters. In particular our final configuration includes:
maxtrace=8000, maxrecord=16000, loopunroll=7, maxcode=65536.
It is clear that in both these applications extending the limitations in
terms of memory constraints leads to an improvement in the performance.
In order to change the default parameters, you must run LuaJIT with the
option -Oparam=value. Once you find the most appropriate configuration of
parameters for your specific application you can then recompile LuaJIT with
the values of the new parameters.
It should be noted that this is a very delicate task. The default values
were defined based on heuristics by Mike Pall, who actually commented this
subject in a post on the LuaJIT Github project saying: ”The trace heuristics
are a very, very delicate matter. They interact with each other in many ways
and it will only show in very specific code. This is hard to tune unless one
is willing to do lots of benchmarking and stare at generated code for hours.
That’s how the current heuristics have been derived. A major project for
someone with enough time at their hands” 1 .
Here we want to describe how we tuned the loopunroll parameter for
our specific application so that we provide a concrete example for the reader.
Thanks to the script TraceAnalyser.lua, which processes the output of the
patched version of the Dump mode, we realised that an enormous number
of trace aborts were caused by loop unroll limit reached (errno=10).
The abort loop unroll limit reached means that the JIT compiler has
applied the loop unrolling optimisations too many times in a given trace. A
first solution would be to restructure the code into a few tight loops instead
of one big loop (or vice-versa) or another option is to change to loopunroll
limit.
We realised that 15 (the default value of loopunroll) was not appropriate
for our specific application. Incrementing this value (e.g. loopunroll=30)
made the performance of the overall application worst. Thus, we decremented
the default value trying different alternatives. Finally, we decided to go for
loopunroll=7 as our optimal value because we observed a faster and more
stable behaviour of the overall application.
Citing again Mike Pall from a post on the LuaJIT Github project: The
loop unrolling heuristics are not ideal for all situations. Some dependent
projects change the loop unrolling limits 1 .
1
Conversation on the LuaJIT Github project: https://github.com/LuaJIT/LuaJIT/issues/11
130
Appendix A
Values in dumps
Table A.1: Possible values for [link] (see jit trlinkname and TraceLink in
lib jit.[c,h])
value Meaning
none Incomplete trace. No link, yet.
root Link to other root trace.
loop Loop to same trace.
tail-recursion Tail-recursion.
up-recursion Up-recursion.
down-recursion Down-recursion.
Fallback to interpreter (stop a side trace record due
interpreter
to maximum reached see sidecheck: in lj record.c)
return Return to interpreter.
stitch trace stitching.
131
Table A.3: Possible value for XLOAD argument (see lj ir.h)
R IRXLOAD READONLY Load from read-only data.
V IRXLOAD VOLATILE Load from volatile data.
U IRXLOAD UNALIGNED Unaligned load.
Table A.4: Possible value for FLOAD or FREF argument (see lj ir.h)
str.len IRFL STR LEN String length
func.env IRFL FUNC ENV function’s environment for up-values
func.pc IRFL FUNC PC PC of the function’s prototype
func.ffid IRFL FUNC FFID Function id
thread.env IRFL THREAD ENV Thread environment for up-values
tab.meta IRFL TAB META Metatable
tab.array IRFL TAB ARRAY Table array part
tab.node IRFL TAB NODE Table hash part
tab.asize IRFL TAB ASIZE Size of array part
tab.hmask IRFL TAB HMASK Size of hash part - 1
Negative cache for fast
tab.nomm IRFL TAB NOMM metamethods bitmap, marking
absent fields of the metatable
udata.meta IRFL UDATA META udata metatable
udata.udtype IRFL UDATA UDTYPE see UDTYPE table
udata.file IRFL UDATA FILE udata payload
cdata.ctypeid IRFL CDATA CTYPEID cdata’s ctypeid
cdata.ptr IRFL CDATA PTR cdata payload
cdata.int IRFL CDATA INT cdata payload
cdata.int64 IRFL CDATA INT64 cdata payload
cdata.int64 4 IRFL CDATA INT64 4 cdata payload
132
Table A.6: Possible value for FPMATH argument (see lj ir.h)
floor FPM FLOOR
ceil FPM CEIL
trunc FPM TRUNC
sqrt FPM SQRT
exp FPM EXP
exp2 FPM EXP2
log FPM LOG
log2 FPM LOG2
log10 FPM LOG10
sin FPM SIN
cos FPM COS
tan FPM TAN
133
Appendix B
DynASM: Assembler
Main directives
Directive Description
.arch specifies the architecture of the assembly
code
.type name, ctype [, makes easier to manipulate registers of type
default reg] ctype*. The provided syntactic sugar is de-
picted in Table B.2
.macro [...] .endmacro create a multi-lines macro instruction that
can be invoked as a normal instruction and
where arguments will be substituted
.define defines a prepreprocessor substitution
.if [...] .elif [...] Preprocessor conditional construct similar as
.else [...] .endif C preprocessor
134
Sugar Expansion
#name sizeof(ctype)
name:reg->field [reg + offsetof(ctype,field)]
name:reg[imm32] [reg + sizeof(ctype)*imm32]
name:reg[imm32].field [reg + sizeof(ctype)*imm32 + offsetof(ctype,field)]
name:reg... [reg + (int)(ptrdiff t)&(((ctype*)0)...)]
Line markers
Typical DynASM lines that emit assembler instructions must start with a
vertical bare (”|”). If you want to emit some lines of C code, but you still want
DynASM’s preprocessor to do substitutions, they must start with a double
vertical bar (”||”). Finally, lines with no starting markers are completely
untouched and discarded by DynASM. It should be noted that lines of C
code that has to be inlined with a macro must start with a double vertical
bar.
Labels
There are different kind of labels. The first category is the ”global labels”
that have two categories, ”static label” (|−>name:) and ”dynamic labels”
(|=>imm32:). These labels are uniques in a DynASM file. The second cat-
egory is the ”local labels” that use a single digit from 1 to 9 (|i:). They
can be defined multiple times in the same DynASM file. They are used by
jump instructions with of the syntax <i or >i. They respectively point to
the most recent, the next definition of i as the jump target.
135
Bibliography
136
[13] Derek Bruening and Evelyn Duesterwald. “Exploring optimal compila-
tion unit shapes for an embedded just-in-time compiler”. In: In Proceed-
ings of the 2000 ACM Workshop on Feedback-Directed and Dynamic
Optimization FDDO-3. Citeseer. 2000.
[14] Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Transpar-
ent dynamic optimization: The design and implementation of Dynamo.
Tech. rep. HP Laboratories Cambridge, 1999.
[15] Evelyn Duesterwald and Vasanth Bala. “Software profiling for hot path
prediction: Less is more”. In: ACM SIGOPS Operating Systems Review.
Vol. 34. 5. ACM. 2000, pp. 202–211.
[16] Andreas Gal, Christian W Probst, and Michael Franz. “HotpathVM: an
effective JIT compiler for resource-constrained devices”. In: Proceedings
of the 2nd international conference on Virtual execution environments.
ACM. 2006, pp. 144–153.
[17] Carl Friedrich Bolz et al. “Tracing the meta-level: PyPy’s tracing JIT
compiler”. In: Proceedings of the 4th workshop on the Implementation,
Compilation, Optimization of Object-Oriented Languages and Program-
ming Systems. ACM. 2009, pp. 18–25.
[18] David Hiniker, Kim Hazelwood, and Michael D Smith. “Improving re-
gion selection in dynamic optimization systems”. In: Proceedings of the
38th annual IEEE/ACM International Symposium on Microarchitec-
ture. IEEE Computer Society. 2005, pp. 141–154.
[19] Andreas Gal et al. “Trace-based just-in-time type specialization for
dynamic languages”. In: ACM Sigplan Notices 44.6 (2009), pp. 465–
478.
[20] Luke Gorrie. RaptorJIT. 2017. url: https://github.com/raptorjit/
raptorjit.
[21] Ron Cytron et al. “Efficiently computing static single assignment form
and the control dependence graph”. In: ACM Transactions on Pro-
gramming Languages and Systems (TOPLAS) 13.4 (1991), pp. 451–
490.
[22] Andreas Gal and Michael Franz. Incremental dynamic code generation
with trace trees. Tech. rep. Citeseer, 2006.
[23] James G Mitchell. “The design and construction of flexible and effi-
cient interactive programming systems.” PhD thesis. Carnegie-Mellon
University, Pittsburgh, PA, 1970.
137
[24] Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. “Dynamo: a
transparent dynamic optimization system”. In: (2000).
[25] John Aycock. “A brief history of just-in-time”. In: ACM Computing
Surveys (CSUR) 35.2 (2003), pp. 97–113.
[26] Dean Deaver, Rick Gorton, and Norm Rubin. “Wiggins/Redstone: An
on-line program specializer”. In: Proceedings of the IEEE Hot Chips XI
Conference. 1999.
[27] Michael Gschwind et al. “Dynamic and transparent binary transla-
tion”. In: Computer 33.3 (2000), pp. 54–59.
[28] Cindy Zheng and Carol Thompson. “PA-RISC to IA-64: Transparent
execution, no recompilation”. In: Computer 33.3 (2000), pp. 47–52.
[29] Gregory T Sullivan et al. “Dynamic native optimization of interpreters”.
In: Proceedings of the 2003 workshop on Interpreters, virtual machines
and emulators. ACM. 2003, pp. 50–57.
[30] Mozilla Foundation. SpiderMonkey. url: https://developer.mozilla.
org/en-US/docs/Mozilla/Projects/SpiderMonkey.
[31] Mason Chang et al. “Tracing for web 3.0: trace compilation for the
next generation web applications”. In: Proceedings of the 2009 ACM
SIGPLAN/SIGOPS international conference on Virtual execution en-
vironments. ACM. 2009, pp. 71–80.
[32] Adobe System Inc. ActionScript 3.0 overview. url: https : / / www .
adobe.com/devnet/actionscript/articles/actionscript3%5C_
overview.html.
[33] Ecma International. Standard ECMA-262. 1999. url: https://www.
ecma - international . org / publications / files / ECMA - ST / Ecma -
262.pdf.
[34] Andreas Gal. “Efficient Bytecode Verification and Compilation in a
Virtual Machine Dissertation”. PhD thesis. PhD thesis, University Of
California, Irvine, 2006.
[35] Andreas Gal et al. Making the compilation “pipeline” explicit: Dynamic
compilation using trace tree serialization. Tech. rep. Technical Report
07-12, University of California, Irvine, 2007.
[36] Mason Chang et al. Efficient just-in-time execution of dynamically
typed languages via code specialization using precise runtime type in-
ference. Tech. rep. Citeseer, 2007.
138
[37] Mason Chang et al. “The impact of optional type information on jit
compilation of dynamically typed languages”. In: ACM SIGPLAN No-
tices 47.2 (2012), pp. 13–24.
[38] Carl Friedrich Bolz et al. “Allocation removal by partial evaluation in a
tracing JIT”. In: Proceedings of the 20th ACM SIGPLAN workshop on
Partial evaluation and program manipulation. ACM. 2010, pp. 43–52.
[39] Carl Friedrich Bolz and Laurence Tratt. “The impact of meta-tracing
on VM design and implementation”. In: Science of Computer Program-
ming (2013).
[40] Carl Friedrich Bolz et al. “Meta-tracing makes a fast Racket”. In: Work-
shop on Dynamic Languages and Applications. 2014.
[41] Carl Friedrich Bolz. “Meta-tracing just-in-time compilation for RPython”.
PhD thesis. Heinrich Heine University Düsseldorf, 2012.
[42] Håkan Ardö, Carl Friedrich Bolz, and Maciej FijaBkowski. “Loop-
aware optimizations in PyPy’s tracing JIT”. In: ACM SIGPLAN No-
tices. Vol. 48. 2. ACM. 2012, pp. 63–72.
[43] Maarten Vandercammen. “The Essence of Meta-Tracing JIT Compil-
ers”. In: (2015).
[44] Spenser Bauman et al. “Pycket: a tracing JIT for a functional lan-
guage”. In: ACM SIGPLAN Notices. Vol. 50. 9. ACM. 2015, pp. 22–
34.
[45] Michael Bebenita et al. “SPUR: a trace-based JIT compiler for CIL”.
In: ACM Sigplan Notices 45.10 (2010), pp. 708–725.
[46] Peter Cawley. The unofficial DynASM documentation. url: https :
//corsix.github.io/dynasm-doc/.
[47] Mike Pall. LuaJIT 2.0 intellectual property disclosure and research op-
portunities. 2009. url: http://lua-users.org/lists/lua-l/2009-
11/msg00089.html.
[48] Luke Gorrie. Studio. 2017. url: https : / / github . com / studio /
studio/.
[49] OpenResty Inc. OpenResty. url: https://openresty.org/.
[50] Mike Pall. LuaJIT New Garbage Collector wiki page. 2017. url: http:
//wiki.luajit.org/New-Garbage-Collector.
[51] Doug Lea. Doug Lea Memory Allocator article. 2018. url: http://g.
oswego.edu/dl/html/malloc.html.
139
[52] Doug Lea. Doug Lea Memory Allocator implementation. 2018. url:
ftp://g.oswsego.edu/pub/misc/malloc.c.
[53] Mike Pall. LuaJIT Extensions. 2018. url: http : / / luajit . org /
extensions.html.
[54] Mike Pall. LuaJIT Bit Operations Module. 2018. url: http://bitop.
luajit.org/.
[55] Mike Pall. LuaJIT 2.0 Bytecode Instructions. url: http : / / wiki .
luajit.org/Bytecode-2.0.
[56] Mike Pall. FFI motivation and use. 2018. url: http://luajit.org/
ext%5C_ffi.html.
[57] Mike Pall. FFI tutorial. 2018. url: http://luajit.org/ext%5C_ffi%
5Ctutorial.html.
[58] Mike Pall. FFI API documentation. 2018. url: http://luajit.org/
ext%5C_ffi%5C_api.html.
[59] Mike Pall. FFI Semantics. 2018. url: http://luajit.org/ext%5C_
ffi%5C_semantics.html.
[60] Peter Cawley. Reflection library for ctypes. 2018. url: https://github.
com/corsix/ffi-reflect.
[61] Peter Cawley. FFI reflect documentation. 2018. url: http://corsix.
github.io/ffi-reflect.
[62] Mike Pall. LuaJIT Wiki. url: http://wiki.luajit.org/Home.
[63] Mike Pall. LuaJIT optimization wiki page. 2018. url: http://wiki.
luajit.org/Optimizations.
[64] Mike Pall. LuaJIT Allocation Sinking Optimization wiki page. 2017.
url: http://wiki.luajit.org/Allocation-Sinking-Optimization%
5C#implementation%5C_assembler-backend%5C_snapshot-allocations.
[65] Mike Pall. LuaJIT 2.0 SSA IR. url: http://wiki.luajit.org/SSA-
IR-2.0.
[66] Mike Pall. Mail: LuaJIT 2.0 intellectual property disclosure and re-
search opportunities. 2017. url: http://lua-users.org/lists/lua-
l/2009-11/msg00089.html.
[67] Roberto Ierusalimschy. “Programming in Lua, Fourt Edition”. In: Lua.org,
2016. Chap. 22.
[68] LUA Reference Manual. url: https://www.lua.org/manual/.
140
[69] Static Single Assignment Book. url: http://ssabook.gforge.inria.
fr/latest/book.pdf.
[70] Mike Pall. LuaJIT source code. 2017. url: https : / / github . com /
LuaJIT/LuaJIT.
141