0% found this document useful (0 votes)
30 views

Android JIT

The document discusses Android's use of just-in-time (JIT) compilation to improve the performance of Java applications running on Android. It begins by explaining the typical performance issues with interpreting Java bytecode. It then describes how JIT works by compiling bytecode to native machine code at runtime. Specifically, it discusses Dalvik's JIT implementation, the design of JIT compilers generally, and optimization techniques like only compiling frequently used "hot" methods.

Uploaded by

Huasheng Qiu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Android JIT

The document discusses Android's use of just-in-time (JIT) compilation to improve the performance of Java applications running on Android. It begins by explaining the typical performance issues with interpreting Java bytecode. It then describes how JIT works by compiling bytecode to native machine code at runtime. Specifically, it discusses Dalvik's JIT implementation, the design of JIT compilers generally, and optimization techniques like only compiling frequently used "hot" methods.

Uploaded by

Huasheng Qiu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Android JIT

Introduction

Just-In-Time (JIT)/Dynamic

Compilation

JIT Design

Dalvik JIT

JIT Compiler

Intermediate Representation

Optimization Techniques

Data- Control- Flow Analysis

Introduction:
The Java language is made to be interpreted to achieve the critical
goal of application portability.
HW.java

Other classes

HW.class

public class HW{


. . . .
void hello(){
. . . .
}
}

Java Source file

Java
Language

javac

ca
08

fe
1a

ba
42

be
..

java

Class file(bytecode)

Java Virtual
Machine

Microprocessors have instruction sets that define the operations they can
perform, so does the VM instructions compile into a format known
asbytecodes.
It is through the VM that executable bytecode Java classes are executed
and ultimately routed to appropriate native system calls.
Problem:

Problem (Contd.):
The conventional approach resulted in significantly lower
performance when compared to compiled languages like C/C++ by the
additional processor and memory usage during interpretation.
As a result, slow and space-constrained computing devices have tended
not to include virtual computing technology(i.e. JVM).
Initiatives:

JSR-30 : J2ME CLDC (Connected Limited Device Configuration)


Specification

Reference implementation of the J2ME CLDC (Connected Limited


Device Configuration)
in April 1999, got approval in August 1999

Final public release of CLDC 1.0 in May 2003


The HotSpot engine was developed to address the perception that
Java virtual machine performance was insufficient for many mainstream
applications.
By implementing a host of performance enhancing techniques that
went beyond innovations like just-in-time (JIT) compilers, the performance
of the Java virtual machine increased by an order of magnitude

Just-In-Time (JIT)/Dynamic Compilation :


The Just-In-Time (JIT) compiler is a component of the Java Runtime
Environment. It improves the performance of Java applications by
compiling bytecodes to native machine code at run time.

Byte
Code
s

JVM

Just-In-Time
Compiler
Intermediate
Representation
Generator

Optimize
r

GC
Code
Generator
Runtime

Just-In-Time (JIT) Compiler

Profile
r

Just-In-Time (JIT)/Dynamic Compilation (Contd.) :


JIT Compilation Strategies:
With a JIT compiler, Java programs are compiled one block of code at
a time as they
execute into the native processor's instructions to achieve higher
performance.

The process involves generating an internal representation of a


method that's
different from bytecodes but at a higher level than the target
processor's native
instructions.

The compiler performs optimization to improve quality and


efficiency
and finally a code-generation step to translate the optimized internal
representation to
the target processor's native instructions

To avoid the overhead of compiling and optimizing all an


applications classes at a time,
a number of incremental compilation strategies have evolved.

The general strategy of only compiling the hot parts of an


application will often result
in only a small percentage of an application being compiled, thus

Just-In-Time (JIT)/Dynamic Compilation (Contd.) :


The Just-In-Time (JIT) compiler is a component of the Java Runtime
Environment. It improves the performance of Java applications by
compiling bytecodes to native machine code at run time.
A Java class that has been loaded into memory by the VM contains a
V-table (virtual table), which is a list of the addresses for all the methods
in the class.

Method 1
Method 2
Method 3
Method 4
V-table

Method 1
Bytecode
Method 2
Bytecode
Method 3
Bytecode
Method 4
Bytecode

Each address in the V-table points to the executable bytecode for the
particular method

Just-In-Time (JIT)/Dynamic Compilation (Contd.) :


When the JIT is loaded, bytecode address in the V-table is replaced
with the address of the JIT compiler itself.
Method 1
Method 2
Method 3

Just-In-Time
Compiler

Method 4
Method 5
V-table
When the VM calls a method through the address in the V-table, the JIT
compiler is executed instead.

Just-In-Time (JIT)/Dynamic Compilation (Contd.) :


The JIT compiler steps in and compiles the Java bytecode into native
code and then patches the native code address back to the V-table.
V-table
Method 1
Method 2
Method 3

Just-In-Time
Compiler

Method 4
Method 5

Method 5 Native
Code

From now on, each call to the method results in a call to the native
version.

JIT Design :
Challenges (Price of Platform neutrality):

The time it takes to compile the code is added to the program's running time.
JIT typically causes a slight delay in initial execution of an application, due to the
time taken to load and compile the bytecode.

Optimizations:

Modern JIT compilers take one of two approaches


1. Compile all the code but without performing any expensive analyses or
transformations so that the code is generated quickly.
2. Devote compilation resources to only a small number of methods that execute
frequently.
Combine interpretation and JIT compilation. The application code is initially
interpreted, but the JVM monitors which sequences ofbytecodeare frequently
executed and translates them to machine code for direct execution on the hardware.

JIT Design (Contd.) :


There are 4 reasons for why a JIT for the complete byte code set was
not implemented and the combined usage of Interpreter and JIT has
become unavoidable.
1.If thread context switching would have had to be performed whilst
executing generated native code, this would have added complexity
to code generation, runtime support, and the base VM code. By only
performing context switching in the interpreter no changes were made
to the way the thread scheduling was done in VM.
2.The generated machine code would have needed to be more rigorous
in the way it dealt with error conditions and other exceptional
conditions. As it is, the machine code only needs to check for error
conditions. When they occur the error handling bytecodes can be then
executed by the interpreter, which then can deal with the details of
how the error should be processed.
3.A complete JIT would have required more complicated interactions
between the generated machine code and the virtual machine as a
whole. For example, the generated machine code could cause the
compiler, class loader, garbage collector, or native code to run. In
retrospect some of these restrictions were not strictly necessary,
but
(Contd.)
the system probably has fewer undiscovered bugs, and it does not
seem to have limited the performance of the type of compute-

JIT Design (Contd.) :


4. A debugging technique (discussed below) was used which could not
have been
employed so easily with a complete JIT.
Therefore the system was designed to allow execution to pass from
the compiled code to the interpreter at any time, and also for the
interpreter to be able return to generated code in a timely fashion.
Additionally, to keep the interpreter from getting trapped in a long
loop of bytecodes it was necessary to be able to return to compiled code
in the middle of a method as well as at the start.
JIT lets the interpreter to deal with complex tasks such as Class
loading, Exception handling, Synchronization, Garbage
Collection
etc
The basic interpreter
loop is as follows:
Start:
Try to enter compiled code.
Interpret the next bytecode.
goto Start.
If the current method has not been compiled then checks are performed
to determine if it can be.

JIT Design (Contd.) :


Compilation may not be possible for one of the following reasons.
1.A native function was called.
2.The method has more than a certain number of parameters or local
variables, is unusually large
3.There is no available memory for more compiled code.
4.An object could not be created without running the garbage collector.
5.An operation was attempted that required a class to be initialized.
6.The start of an exception handler was reached.
7.An exception or error occurred. The interpreter always processes these.
8.The part of a method was reached for which no corresponding machine
code could be generated.
9.A function was called for which there was no compiled code.
10.A method return was executed but there was no compiled code to
return to because the code buffer had been flushed.

JIT Design (Contd.) :


1.

The JVM interprets a method until its call count exceeds aJIT threshold.
2. After a method is compiled, its call count is reset to zero; subsequent calls to
the
method continue to increment its count.
3. When the call count of a method reaches aJIT recompilation threshold, the
JIT compiles it a second time, this time applying a larger selection of
optimizations than on the previous compilation (because the method has proven
to be a significant part of the whole program)

Method 1
Method 2

Method 3
Method 4
V-table

Method 1
Bytecode
Method 2
Bytecode
Method 3
Bytecode
Method 4
Bytecode

Just-In-Time
Compiler

IT Design (Contd.) :
Interpret
er
JIT

JIT=O
FF
.
class

.
class

JIT=ON
Threshold=10
.
class
times >=
10

JVM

JVM

Nativ
e
Code
Operating
System

times <
10

Dalvik JIT :
Dalvik Execution Environment:
1.Register based architecture (Register Machine)
Stack-based machines (JVMs) must useinstructionsto load data on
the stack and manipulate that data, and, thus, require more instructions
than register machines.
2.Very compact representation
Javabytecode is converted into an alternateinstruction setused by
the Dalvik VM.
dxis a tool used to convert some (but not all) Java.classfiles into
the .dex format.
3.Emphasis on code/data sharing to reduce memory usage
Multipleclassesare included in a single .dex file.
4.Highly-tuned very fast (2x similar) Dalvik Interpreter, good enough for
most of the applications.
For compute-intensive applications, Native Development Kit was
released to allow Dalvik applications to call out statically-compiled(native)
methods.

Dalvik JIT (Contd.):


Other part of solution is Dalvik JIT:
Translates byte code to optimized native code at run time.
1.Method Compiler
2.Trace Compiler
3.Method Compiler
- Most common model for server JITs
- Interprets with profiling to detect hot methods
- Compile & optimize method-sized chunks
- Strengths
Larger optimization window
Machine state sync with interpreter only at method call
boundaries
- Weaknesses
Cold code within hot methods gets compiled
Much higher memory usage during compilation & optimization
Longer delay between the point at which a method goes hot and
the
point that a compiled and optimized method delivers
benefits

Dalvik JIT (Contd.):

2.Trace Compiler
- Most common model for low-level code migration systems
- Interprets with profiling to identify hot execution paths
- Compiled fragments chained together in translation cache
- Strengths
Only hottest of hot code is compiled, minimizing memory usage
Tight integration with interpreter allows focus on common cases
Very rapid return of performance boost once hotness detected
- Weaknesses
Smaller optimization window limits peak gain
More frequent state synchronization with interpreter
Difficult to share translation cache across processes

Dalvik JIT (Contd.):


(Method Vs Trace):

Method JIT:
Best optimization
window
Trace JIT:
Best speed/space
tradeoff

Full Program
4,695,780
bytes

Hot
Methods
396,230
8% of
program
bytes
Hot Traces
396,230
bytes
26% of Hot
methods
2% of program

Dalvik JIT (Contd.):


The provisional decision was to start with trace for the following reasons:
Minimizing memory usage critical for mobile devices
Important to deliver performance boost quickly
- User might give up on new app if we wait too long to JIT
Leave open the possibility of supplementing with method-based JIT
- The two styles can co-exist
- A mobile device looks more like a server when its plugged in
- Best of both worlds
Trace JIT when running on battery
Method JIT in background while charging

The Dalvik JIT can be considered as an extension of the


Interpreter because it is the Interpreter which profiles and triggers trace
selection mode when a potential trace head goes hot.

Dalvik JIT (Contd.):


Dalvik Trace JIT Flow: Interpret until
next potential
trace head

Start

Translation
Cache

NO

Update Profile
count for this
location

Interpret/build NO
Trace request
Submit
Compilation
Request

Compiler
Thread

Threshol
d?

Translati
on

YES

Exit 0
Exit 1

Xlatio
n
exist
s?

Translati
on
Exit 0
Exit 1

YES

Install
new
translatio
n

Translati
on
Exit 0
Exit 1

Dalvik JIT (Contd.):


Features:
Trace request is built during interpretation
- Allows access to actual run-time values
- Ensures that trace only includes byte codes that have successfully
executed at
least once (useful for some optimizations)
Trace requests handed off to compiler thread, which compiles and
optimizes into native
code
Compiled traces chained together in translation cache
Per-process translation caches (sharing only within security
sandboxes)
Simple traces - generally 1 to 2 basic blocks long
Local optimizations
- Register promotion
- Load/store elimination
- Redundant null-check elimination
- Heuristic scheduling
Loop optimizations
- Simple loop detection
- Invariant code motion
- Induction variable optimization

JIT Compiler:
JIT Compiler Work Flow:
In order to execute bytecode, JIT compiler goes through three stages.
1.Baseline: Generates code that is Obviously correct
The process involves generating an internal representation of a java
code that is
different from bytecodes but at a higher level than the target
processor's native
instructions (Intermediate Representation(IR)).
IR allows more effective machine-specific optimizations
2.Optimizing: Applies a set of optimizations to a class when it is loaded
at run time
3.Adaptive:
Methods are compiled with a non-optimizing compiler first
A key part of the JIT design was to split the compilation process into two
and then
selects
hot
methods for
based on bytecodes
run-time
passes.
The
first pass
transforms
therecompilation
standard, stack-based
profiling
information.
into
a simple
3-address intermediate representation in which all
temporary statement results are placed into new local variables instead
of entries on an evaluation stack. The second pass converts this threeaddress form into native machine code.

Intermediate Representation:

An IR instruction is an N-tuple (a simple mathematical set), consisting of an


operator, and some number of operands.

The Intermediate Representation is a machine- and language-independent


version of the original source code

An Operator is the instruction to perform


Operands are used to represent Symbolic Register, Physical Registers,
Memory Locations, Constants, Branch targets, Method Signatures, Types etc

An IR code must be convenient to translate into real assembly code for all
desired target machines

Intermediate Representation (contd.):


Three Address Code (TAC or 3AC):
1.Three-address codeis a form of representingintermediate code(IR)used
bycompilersto aid in the implementation of code-improving
transformations.
2.Each instruction in three-address code can be described as a 4-tuple:
(operator, operand1, operand2, result) as shown.
result := operand1 operator operand2
such as
x := y + z
3.Expressions containing more than one fundamental operation, such as:
p=x+y*z
are not representable in three-address code as a single instruction.
Instead, they are decomposed into an equivalent series of instructions,
such as
t1 := y * z
p := x + t1
The key features of three-address code are that every instruction implements exactly
one fundamental operation, and that the source and destination may refer to any

Intermediate Representation (contd.):


Static Single Assignment form (SSA):
1.A refinement of three-address code and a property of anintermediate
representation(IR), which says that each variable is assigned exactly
once
2.Existing variables in the original IR are split intoversions, new variables
typically indicated by the original name with a subscript in textbooks, so
that every definition gets its own version
Benefits (by Example):
y := 1
y := 2
x := y
TAC

1.

Humans can see that the first assignment is

2.

not necessary
The value of y being used in the third line
comes from the second assignment of y.

A program would have to perform reaching


definition analysis to do these
optimizations

With SSA, 1 and 2 are immediate as it


identifies y1 is used only once and
omitting it wont affect other part of code

y1 := 1
y2 := 2
x := y2
SSA

Intermediate Representation (contd.):


3 levels of IR:
Levels of IR:

b
y
t
e
c
o
d
e

M
a
c
h
i
n
e

1. IRs that are close to a high-level language are called high-level IRs, and IRs
that are close to assembly are called low-level IRs.
2. A high-level IR might preserve things like array subscripts or field accesses
whereas a low-level IR converts those into explicit addresses and offsets.

Original
LIR
float a[10][20]
a[i][j+2]

HIR
t1 = a[i, j+2]

MIR
t1
t2
t3
t4
t5
t6
t7

=
=
=
=
=
=
=

j+2
i*20
t1+t2
4*t3
addr a
t5+t4
*t6

r1
r2
r3
r4
r5
r6
r6

=
=
=
=
=
=
=

[fp-4]
[r1+2]
[fp-8]
r3*20
r4+r2
4*r5
fp216

Intermediate Representation (contd.):


1.HIR (High Level IR)
a) IR that are closer to high-level language (Operators similar to
Java bytecode)
b) Usually preserves information such as loop-structure and if-thenelse statements
c) Operate on symbolic registers instead of an implicit stack
HIR Generation:

class AdditionMethodTest {
public static void main(String args[]) {
int a = 3;
int b = 4;
int c = a + b;
int d = getNewValue(c);
return;
} // End method main
public static int getValue(int var) {
return var * var;
} // End method getNewValue
}

Java Code

Method void main(java.lang.String[])


0 iconst_3
1 istore_1
2 iconst_4
3 istore_2
4 iload_1
5 iload_2
6 iadd
7 istore_3
8 iload_3
9 invokestatic #2 <Method int getValue(int)>
12 istore 4
14 return
Method int getNewValue(int)
0 iload_0
1 iload_0
2 imul
3 ireturn

Bytecode

Intermediate Representation (contd.):


Conversion from Java bytecode to HIR:
Compiler that performs this conversion contains 2 parts.
1. The BC2IR algorithm that translates bytecode to HIR and performs onthe-fly optimizations during translation.
2.Additional optimizations perform on the HIR after translation.
BC2IR Translation:
3.Discovers extended-basic-blocks
4.Constructs an exception-table for the method
5.Creates HIR instructions for bytecodes
6.Performs On-the-fly optimizations
a) Copy propagation
b) Constant propagation
c) Register renaming for local variables
d) Dead-Code elimination
e) Short final or static methods are in-lined
Note: Even though these optimizations are performed in later phases,
doing so here
reduces the size of the HIR generated and thus compile time.

Intermediate Representation (contd.):


Example of on-fly-optimization:
Java
Bytecode
iload x
iconst 5
iadd
istore y

y=x+5

Generated IR
(optimization
off)

Generated IR
(optimization
on)

INT_ADD tint,
xint 5
INT_MOVE yint,
tint

INT_ADD yint, xint,


5

Copy propagation algorithm can be noticed here

Intermediate Representation (contd.):


The HIR generated code for AdditionMethodTest.java:

***** START OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V


LABEL0 Frequency: 0.0
EG ir_prologue
l0i([Ljava/lang/String;,d) =
int_move
l1i(B) = 3
int_move
l2i(B) = 4
int_move
l3i(B) = 7
EG call
l5i(I) AF CF OF PF SF ZF = 66668, static"AdditionMethodTest.getValue (I)I", <unuse
return
<unused>
bbend
BB0 (ENTRY)
***** END OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V
********* START OF IR DUMP Initial HIR
FOR AdditionMethodTest.getValue (I)I
-13
LABEL0
Frequency: 0.0
-2
EG ir_prologue
l0i(I,d) =
2
int_mul
t2i(I) = l0i(I,d), l0i(I,d)
3
int_move
t1i(I) = t2i(I)
-3
return
t1i(I)
-1
bbend
BB0 (ENTRY)
*********
END OF IR DUMP Initial HIR
FOR AdditionMethodTest.getValue (I)I

Intermediate Representation (contd.):


Optimizations for HIR:
Following optimizers are provided for the basic optimization.
1. CF // Constant Folding
2. CPF
// Constant Propagation and Folding (triggered by the propagation)
3. CSE
// Common Sub-expression Elimination (within basic blocks)
4. DCE
// Dead Code Elimination
5. GT // Global Variable Temporalization (within basic block)
The optimizers CF and GT do not require data flow analysis, however, CPF, CSE
and DCE require some result of data flow analysis.
Complete Description can be available @
http://www.coins-project.org/international/COINSdoc.en/hiropt/hiropt.html

Intermediate Representation (contd.):


2. Medium-Level IRs (MIR)
a) Support range of features in a set of source languages, but in a languageindependent way.
b) Good basis for generation of efficient machine code for one or more
architectures.
Example: register transfer languages
3. Low-Level IRs (LIR)
a) Almost one-to-one correspondence to target-machine instructions: quite
architecture-dependent.
<MIR & LIR to be added>

Optimization Techniques:
Why Optimization:
1.

Programmers do not always write optimal code.


a) For example, ways to improve code are not always recognized
(e.g. move loop-invariant code out of loops, avoiding recomputation of the same
expression).
2. High-level language may not allow a programmer to avoid redundant
computation (or make it inconvenient)
a[i][j] = a[i][j] + 1
3.
The programmer should not be bothered with the target machine
architecture.
Moreover, modern machine architectures assume optimization; it has become
hard to
optimize by hand.
Goal:
Let programmers write clean, high-level source code, produce programs that
approach assembly-code performance.
Optimization: the transformation of a program P into a program P, that has the
same input/output behavior, but is somehow better. Better might mean:
faster, or
smaller, or
uses less power, or

Optimization Techniques:
1. In-lining (also at lower levels)
2. Specialization
3. Constant folding
4. Constant propagation
5. Value numbering
6. Dead code elimination
7. Loop-invariant code motion
8. Common sub-expression elimination
9. Strength reduction
10. Branch prediction/optimization
11. Register allocation
12. Loop unrolling
13. Cache optimization

You might also like