A Compiler and Runtime Infrastructure For Automatic Program Distribution

This paper presents a compiler and runtime infrastructure that enables automatic distribution of programs across networked systems. The system performs dependence analysis to model a program's access patterns and resource requirements. It then partitions the dependence graph using graph partitioning techniques. The system inserts communication calls between partitions and generates code to allow distributed execution. Runtime profiling collects statistics to provide insight into the partitioning strategy.

Uploaded by

chowsaj9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views10 pages

A Compiler and Runtime Infrastructure For Automatic Program Distribution

Uploaded by

chowsaj9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Compiler and Runtime Infrastructure for Automatic Program Distribution∗

Roxana E. Diaconescu Lei Wang, Zachary Mouri, Matt Chu

California Institute of Technology University of California, Irvine
Center for Advanced Computing Research Department of Computer Science
[email protected] {leiw, chum, zmouri}@uci.edu

Abstract 1.1. Possible Uses

This paper presents the design and the implementation Our approach places high emphasis on the generality of
of a compiler and runtime infrastructure for automatic pro- the distribution strategy and the ability to build an abstract
gram distribution. We are building a research infrastructure model of the execution environment. Then, the distribution
that enables experimentation with various program parti- strategy can be specialized to concrete environments. We
tioning and mapping strategies and the study of automatic recognize that this approach may not be suitable for all com-
distribution’ s effect on resource consumption (e.g., CPU, putations. Many programs may not need distribution at all.
memory, communication). Since many optimization tech- In some cases, however, automatic distribution is cru-
niques are faced with conflicting optimization targets (e.g., cial. New technologies such as pervasive computing require
memory and communication), we believe that it is impor- that applications connect from any device, over any net-
tant to be able to study their interaction. work, using any style of interface. Mobile computing re-
We present a set of techniques that enable flexible requires that mobile code is deployed over heterogeneous net-
source modeling and program distribution. These are: de- works of sometimes resource constrained devices. If there
pendence analysis, weighted graph partitioning, code and are not enough resources available to accommodate a given
communication generation, and profiling. We have devel- program on a single computing node, the promises of these
oped these ideas in the context of the Java language. We technologies cannot be delivered. In this context, automatic
present in detail the design and implementation of each of distribution can help with increased accessibility, resource
the techniques as part of our compiler and runtime infras- sharing, and load balancing.
tructure. Then, we evaluate our design and present prelim- Another broad class of data intensive applications relies
inary experimental data for each component, as well as for on networked systems to process their data concurrently.
the entire system. Such applications range from inherently concurrent appli-
cations like image processing, universe exploration, com-
puter supported cooperative work, to loosely concurrent ap-
plications such as fluid mechanics in avionics and marine
1. Introduction structures. In this context, automatic distribution can help
with exploiting concurrency, reducing the execution time,
There are important potential benefits of automatic over and increasing scalability.
manual program distribution, such as correctness, increased Our specific technical contributions relative to previous
productivity, adaptive execution, concurrency exploitation. systems with similar goals are:
This paper describes a new approach to automatic program
distribution. In contrast with previous work, instead of con- • A set of techniques for a novel approach to automatic
sidering a particular class of programs and optimization program distribution. These techniques are: object de-
targets, we consider general-purpose programs and study pendence graph construction, general graph partition-
multiple optimization targets. Our system accepts a mono- ing, automatic communication generation, and auto-
lithic program and transforms it into multiple communicat- matically distributed program execution.
ing parts in networked systems. • An original compiler and runtime infrastructure that
implements all the above techniques to allow flexible
∗ Parts of this research were funded under ONR award N00014-01-1- program distribution based on program access pattern,
0854 and NSF award CNS-0205712. resource requirements, and resource availability.
Java byte−code ... proximate the object dependence graph for a program
and model its resource requirements.
Loader
3. The system partitions the object dependence graph us-
Bytecode Bytecode ... ing a Java wrapper of the Metis graph partitioning
decoder to Quad
Joeq
Front−End
tool [14].
Bytecode IR Quad IR 4. The system uses bytecode rewriting to insert com-
Byte−code Quad
munication calls for remote dependences in the par-
Analyses Analyses titioned program. Also, the system uses a bottom-up
Object
Dependence rewrite system to generate target code for the vari-
Compiler Graph (ODG)
ous platforms making-up the networked configuration.
Partitioning
For better resource utilization, in the future we plan
to use native execution rather than Java Virtual Ma-
Object partition
P1
Object partition
P2
... Object partition
Pn
chine (JVM) hosted execution on (possibly resource-
Dependence Dependence
Information constrained) devices.
Information
Dependence
Information 5. The system monitors the program execution and col-
Code and Communication
Generation
lects a set of statistics about resource usage. We use
this information to gain insight into static partitioning.
P1 plus P2 plus Pn plus
communication
calls communication
calls
... communication
calls In the future we plan to use this information to per-
form adaptive repartitioning.
Run−Time
Library
Linker Interpreter The rest of the paper is organized as follows. Section 2
Back−End Communication
describes the object dependence graph construction. Sec-
Library
and
Run−Time
tion 3 explains how we use graph partitioning to model re-
Support sources and split the program into multiple pieces. Section 4
P1 executable
code section
P2 executable
code section
... P3 executable
code section
Adaptive describes in detail our approach to code and communica-
Repartitioning
tion generation. Section 5 presents the implementation of
Scheduling Execution Monitoring
a runtime system that allows automatic distributed execu-
Module Program Profiling
tion. Section 6 presents the design and implementation of
a mixed instrumentation and sampling profiler that moni-
Distributed Execution
tors programs during execution. Section 7 discusses an ini-
tial evaluation of the techniques we introduce in this paper.
Section 8 reviews related research and contrasts our effort
Figure 1. The distributed compiler and run- from previous approaches similar in goals. Section 9 con-
time infrastructure. cludes the paper and underlines future research directions.

2. Dependence Graph Construction

1.2. Basic Approach
The first transformation our system performs is to cre-
Our compiler and runtime infrastructure is depicted in ate the dependence graph of the program. This graph de-
Figure 1. This system transforms sequential Java programs picts the dependences between program objects and serves
into distributed programs. Moreover, the system attempts to as the input for the resource modeling and graph partition-
model the resources needed by the sequential program and ing phase. We use a concrete example to illustrate the de-
distribute the program based on the resource availability in pendence graph construction.
a networked system.To this effect, the system performs the
following transformations: 2.1. An Example
1. The front-end transforms Java bytecode into the in- Figure 2 shows an example of a Java program that we
termediate representation using Joeq front-end [22]. use throughout the paper. In our example, there are two
Joeq provides us with two intermediate representa- classes. The Account class describes a bank account with
tions: bytecode and quad. The latter is a quadruple a unique identifier, holder name, checking, savings, and
style IR which resembles register-based representa- loan. The Bank class describes a banking institution with
tions. a unique identifier, name, number of customers, and a list
2. The system uses our static analysis framework to ap- (java.lang.Vector) of their actual accounts.
public class Account {
}

public class Bank {

protected Bank(String name, int numCustomers, int initialBalance,) {
...;
initializeAccounts(initialBalance);
}
private void initializeAccounts(int initialBalance) {
while (numCustomers) {
...;
Account a = new Account(i, n, s, c);
accounts.add(a);
numCustomers--;
} Created with aiSee V2.0 (ERP-Version) (c) 2000 AbsInt Angewandte Informatik GmbH. Commercial use prohibited!

}
public void openAccount(Account a){
accounts.add(a);
}
public boolean withdraw(int customerID, int amount) {
if (...) {
Figure 3. The class relation graph visualized
this.getCustomercustomerID).setBalance(
this.getCustomer(customerID).getBalance() - amount);
with aiSee tool for vcg format.
return true;
} else return false;
}
public static void main(String[] args) {
...;
Bank merchants = new Bank("Merchants", 100, 10000);
Account a4 = new Account(1, "ABC Market", 1000000, 100000, 20000000);
Account a5 = new Account(2, "CDE Outlet", 5000000, 300000, 150000000);
merchants.openAccount(a4);
merchants.openAccount(a5);
...;
Account a = merchants.getCustomer(2);
merchants.withdraw(a.getId(), 900);
}
}

Figure 2. An example of a Java program.

The Bank class initializes a number of Account

structures for its clients. On an openAccount event
an Account reference is passed to the Bank ob-
ject and it is added to the existing accounts list. The
Bank.withdraw(...) method reduces the bal-
ance by the amount withdrawn. The main method cre-
ates instances of a bank and various types of accounts
Created with aiSee V2.0 (ERP-Version) (c) 2000 AbsInt Angewandte Informatik GmbH. Commercial use prohibited!

that are opened and operated on. Our analysis is tar-

geted toward finding these instances and their dependences.
Figure 4. The object dependence graph visu-
We have implemented an improved version of Spiegel’s alized with aiSee tool for vcg format.
algorithm [20] (for detailed contrast see [2]). We use rapid
type analysis (RTA) to compute the call graph and the pro-
gram types. Then, for each method in the graph we compute
the class relations by looking at field access and method call field accesses, and allocation statements. The export edge
statements. A usage relation between two classes occurs occurs due to the invocation of the openAccount method
when one class calls methods or accesses fields of another on the dynamic Bank class with an Account class as pa-
class. Export or import relations occur when new types may rameter. The import edge occurs due to the getCustomer
propagate from one class to another through field accesses invocation that returns a result of Account type.
or method calls. Given the class relation graph, and the object set, we
Figure 3 shows the class relation graph for our example. compute the relation between the corresponding objects
We use the aiSee1 tool for the visualization of the graph in (class instances). For each allocation statement, we add ref-
the Visualising Compiler Graphs (VCG) format. The types erence relations between the instance of the class where
are annotated with the ST or DT prefix to indicate static the allocation takes place and the newly created instance.
or instance (dynamic) parts of a class. The use relations tell We then create new references by matching the initial ob-
that some classes occur in the context of other classes and ject references against the export and import relations be-
their occurrence is noted by looking at the method calls, tween the corresponding classes. We iterate through all ob-
ject triples and propagate references matching against the
1 A graph visualization tool from AbsInt. Available from type relations until the algorithm reaches a fix point.
http://www.absint.com/aisee.html. Figure 4 shows the object graph for our example. The
edges are labeled by create, use, reference. The objects are Java:
prefixed by 1 indicating single instances (a * prefix indi- public class Example
{
cates summary instances of zero or more — i.e., created in- int ex ( int b ){
side a control structure). The reference relation is redundant b = 4; // 1
if (b > 2){ // 2
and only used for intermediate processing. We can safely b++; // 3
abandon it. The create relation means that an object creates }
return b; // 4
another object. The creation relation between object pairs is }
propagated to discover new usage relations from the class }
Quad:
relation graph. Therefore, after the propagation, only the us- BB0 (ENTRY) (in: <none>, out: BB2)
age relation should matter for the partitioning: if an object BB2 (in: BB0 (ENTRY), out: BB3, BB4)
1 MOVE_I R1 int, IConst: 4
a on abstract processor Pa uses an object b on abstract pro- 2 IFCMP_I IConst: 4, IConst: 2, LE, BB4
cessor Pb, then communication may be generated. We also BB3 (in: BB2, out: BB4)
3 ADD_I R1 int, IConst: 4, IConst: 1
show the partition number within square parentheses for a BB4 (in: BB2, BB3, out: BB1 (EXIT))
two way partitioning transformation. For details on the ac- 4 RETURN_I R1 int
BB1 (EXIT) (in: BB4, out: <none>)
tual algorithm and implementation, please refer to our tech-
nical report [2]. Figure 5. Turning a Java class into quads.

3. Graph Partitioning In our current implementation we have written a Java

wrapper [10] for the Metis graph partitioning tool [14]. The
The next transformation our system performs is the
wrapper implementation (including visualization capabili-
graph partitioning. As a result, this phase assigns a vir-
ties) is about 10000 lines of code.
tual processor number to each object.
A multi-constraint graph partitioning gives an optimal
partitioning of the object dependence graph such as to min- 4. Code and Communication Generation
imize the cut, and thus communication, and to account for
the resource constraints of each partition. Once each object has been assigned to a virtual proces-
Finding an optimal multi-way partition for large graphs sor, the program can be distributed by mapping virtual pro-
is an NP-complete problem (thus, no algorithm that solves cessors to actual processing units at runtime. There are two
the problem in polynomial time exists). However, many issues related to the distributed execution. First, native exe-
heuristic-based approaches exist [3, 12]. To our knowledge cution in heterogeneous environments requires retargetable
the most advanced multilevel partitioning scheme is Hen- code generation. Second, correct execution requires com-
drickson et al.’s [7]. munication to satisfy the remote dependences.
We use Metis’ multi-objective, multi-constraint graph To address retargetable code generation we use the quad
partitioning algorithms to partition the dependence graph. high-level intermediate representation to generate Abstract
We model the resources for the object dependence graph Syntax Trees (AST) and then use bottom-up rewrite system
as follows. Each object in the graph encapsulates data and (BURS) [18] to emit code for a range of architectures (cur-
computation. The amount of data it encapsulates charac- rently x86 and StrongARM).
terizes the memory usage, while the amount of computa- To address communication generation, we use the depen-
tions characterizes the CPU usage. The weight of a node is dence and partitioning information to classify objects as lo-
a vector that contains memory, CPU, and battery usage for cal and dependent. Local objects have no dependences on
the creation and usage of an object. An edge between two objects in different address spaces. Thus, they are treated
objects indicates a potential communication, if the objects as normal objects and no communication is generated for
were to reside in two different address spaces. The data that those. Dependent objects have dependences across address
needs to be transferred between address spaces is the de- spaces and thus, messages are inserted to resolve these de-
pendence data (i.e., field, method arguments or result). The pendences.
weight of an edge is the amount of data that needs to be
transferred due to a dependence. 4.1. Retargetable Code Generation
We use static approximations of resource consumption
to guide the static partitioning. The static approximations The input for this phase is the quad intermediate repre-
can be imprecise under the assumption that all objects have sentation. The result is a generated set of compilers for vari-
equal weights. In the future we plan to use simple heuris- ous target machines. An example of the quad format is listed
tics; for example, objects created inside the loops can be as Figure 5, along with the Java class that was used to gen-
considered “heavier” than single instance objects, etc. erate the code.
x86:
MOVE_I IFCMP_I ADD_I RETURN_I mov eax, 4 ; 1
cmp 4, 2 ; 2a
jle BB4 ; 2b
int IConst IConst IConst LE BB4 int IConst IConst int
mov eax, 4 ; 3a
add eax, 4 ; 3b
R1 4 4 2 R1 4 4 R1
BB4:
ret eax ; 4
StrongARM:
Figure 6. A Tree representation of the quads. mov R1, #4 ; 1
cmp #4, #2 ; 2a
ble BB4 ; 2b
add R1, 4, 4 ; 3
Abstract Syntax Tree. Once the quad source is estab- .BB4
lished, the program is then turned into an Abstract Syn- mov PC, R14 ; 4

tax Tree to act as the code generator front-end. The AST

is structured such that each instruction acts as a root node, Figure 7. Machine code for two separate ar-
with instruction parameters represented as child leaves. The chitectures
tree generator used is called ANTLR [16], and is a gram-
mar parser similar to Yacc. A visual representation of this
tree can be seen as Figure 6.
Original byte-code:
Because of the inherent simplicity in the quad format, 13: aload //load Account object
it is feasible that a simple, linear parser be written from 14: invokevirtual Account.getSavings:()
Transformed byte-code:
scratch and a code generator built on top of it. Though that 13: aload //load DependentObject object
approach may perform faster and can be more specialized to 14: ldc INVOKE_METHOD_HASRETURN (int) //access type
16: ldc "getSavings" //load method name
this task, using the tree allows extensibility. This would al- 18: aconst_null //no method argument for getSavings()
low the code generator to be used with any intermediate rep- 19: invokevirtual DependentObject.access
22: checkcast Integer //cast to return type
resentation or source language as creating a tree allows us 25: invokevirtual Integer.intValue //get primitive value
to completely abstract the source.
Bottom-Up Rewrite Generation. After obtaining the
tree representation of the source, the remaining work is Figure 8. The transformation for method invo-
done through the back-end and is handled through a method cation account.getSavings();.
called Bottom-Up Rewrite Machine Generation, or BURM.
This does two passes of the incoming AST: an initial pass
to find a minimum-cost traversal, followed by a second pass The dependences handled by our current implementation
that emits code based on the instructions represented in are object accesses, including field accesses, and method in-
each node. The specific machine generator is called JBurg, a vocations. For each dependent object that is referred from
Java-based BURG (Bottom-Up Rewrite Generator) [5] that remote, there is a corresponding DependentObject
differs from other BURM implementations in that it tra- that performs Message Passing Interface (MPI) com-
verses the tree employing dynamic programming pattern munication with the home node of the referring object.
matching to satisfy goals. Two examples of machine code Distributed dependences are therefore transformed to ac-
emitted by the BURG are as Figure 7. cesses to DependentObject instances.
Figure 8 illustrates the original and trans-
4.2. Communication Generation formed bytecode snippets for method invocation
account.getSavings(). The transformation for
To generate communication, we generate partitions off- method invocations performs three tasks: prepare the argu-
line for 1, 2, ... nodes. This is a form of off-line rather than ments for the DependentObject access, prepare the ar-
runtime specialization. guments (in a LinkedList) for the original method call,
Each node in the object graph has a unique identifier that and cast the return value (Object type) to the appropri-
contains a virtual processor number. Communication is in- ate class type or primitive value. The transformation for
serted only for dependent objects. That is, for each depen- field accesses are similar.
dence relation to a remote object two calls are generated: The remote instantiation of a dependent class is trans-
a send call that packs the access type and associated data, lated to an instantiation of a DependentObject, which
and a receive call that fetches the response. For each de- in turn will communicate via MPI messages to the home
pendence relation from a remote object, two calls are gener- node of the dependent object. The home node will then cre-
ated: a receive call that processes the access type and asso- ate the object. Figure 9 demonstrates the transformation of
ciated data and a send call that sends the results of the ac- new Account(i, n, s, c). The information passed
cess back. to the MPI message for distributed instantiation and com-
Original byte-code: Code Partition Code Partition Code Partition

35: new Account

38: dup
39: iload_2 //i
40: aload_3 //n
41: iload 4 //s
43: iload 5 //c Message Message Message
45: invokespecial Account."<init>" Exchange Exchange Exchange
Transformed bytecode: Service Service Service
35: new DependentObject
38: dup
39: iload_2 //i
40: aload_3 //n MPI Service MPI Service MPI Service
41: iload 4 //s Starter Starter Starter
43: iload 5 //c
//.... User
// prepare the constructor arguments
//....
105: ldc 0 (int) //location of Account, Node0
107: ldc "Account" (String) Figure 10. The organization of runtime ser-
109: aload 6 //constructor arguments in a list vices for distributed execution.
111: invokespecial DependentObject."<init>"

Figure 9. The transformation for new ment — such as groups, communicators, and the commu-
Account(i, n, s, c); nication context.
The Execution Starter service starts the applica-
tion by invoking the main() method of the application
prises of the class name and the arguments to the class con- class. Only one copy of Execution Starter needs to
structor. be active on the processor node in the distributed execution
The quality of communication generation is directly in- environment where the user initiates the application.
fluenced by the quality of dependence analysis. Our anal- The core of this MPI-aware runtime support is
ysis is type-based and thus, not very precise. More pre- the Message Exchange service. This service pro-
cise dependence information makes use of points-to infor- cesses all the send and receive MPI communica-
mation [19] in the context of speculative multithreading. tion generated from the object dependence information.
In addition, there are several communication optimization The Message Exchange service uses two support-
techniques that can be applied to optimize communication ing data structures. One is the DependentObject
generation: message aggregation, hoisting communication and the other is the exchanged Message. The run-
out of the loop, asynchronous communication, overlapping time uses the DependentObject (implemented by a
communication and computation, data replication, and early Java class) to indicate an object that has dependence rela-
prefetch. Many of these techniques cannot be used with re- tions to another partition.
quest/response communication style like RPC or RMI. In
Each dependent object contains the following informa-
contrast, we use message exchange communication to re-
tion: its class type, the identifier of the partition (node) that
veal more optimization opportunities.
hosts the object, and its unique identifier in that partition
(node). A message (packed in a Message structure) ex-
5. Distributed Execution changed between two dependent objects across two nodes
contains the object identifier of the receiver of the commu-
The distributed target code partitions are executed within nication and the relevant dependence data. The Message
the MPI enhanced runtime environment. Currently we use Exchange service passes objects between nodes using a
JVM hosted execution rather than native execution. Even streamed format.
though the retargetable code generation component is fully
We currently identify two types of messages: NEW and
implemented, it was easier to use normal JVM since our
DEPENDENCE for object instantiation and data depen-
current experiments are conducted on resource-rich x86
dence. We are in the process of defining more precise
platforms. Also, the use of JVM does not affect our current
dependence relations (e.g., read after write), and discrimi-
distributed execution evaluation (speed-up measurements).
nating further between messages.
In our current implementation, on each node there
are three supporting services: the MPI service, the
ExecutionStarter service, and the Message 6. Profiler
Exchange service. Figure 10 depicts this organiza-
tion of the runtime services for distributed execution. The We have built a profiler that collects statistics indicating
MPI service sets up the necessary MPI working environ- the resource consumption of a program during runtime.
The profiler is built on top of the Joeq compiler and vir-
tual machine. The profiler works either through instrumen- benchmark size CRG ODG
#C #M KB #N #E EC #N #E EC
tation or sampling. Some of the metrics can be implemented create* 14 28 13 17 6 2 210 632 82
using either technique. In these cases, the instrumentation is method* 6 35 10 12 10 2 9 32 2
useful as a baseline for comparison of the accuracy of the crypt* 6 45 12 13 13 3 11 33 1
heapsort* 6 42 10 13 13 3 11 33 2
sampling. There are four basic categories of runtime appli- moldyn* 8 48 17 12 15 2 9 32 2
cation behavior we are interested in: CPU, memory, bat- search* 9 57 17 14 23 3 6 20 3
cmprss** 38 295 160 36 42 1 32 107 2
tery, and communication (i.e., network) usage. To measure db** 32 299 155 32 26 2 49 164 8
these four basic categories, we have currently implemented
six metrics: method duration, method frequency, hot meth- * Java Grande benchmarks: JGFCreateBench and JGFMethodBench (section 1),
JGFCryptBench and JGFHeapSortBench (section 2), JGFMolDynBench and
ods, hot paths, memory allocation, and dynamic call graph. JGFSearchBench (section 3).
The method duration metric measures the amount of ** SPEC JVM98 benchmarks: 201 compress, and 209 db.
time each method took to execute. The metric was origi-
nally implemented by overloading the method invocation Table 1. The size of the benchmarks (number
process of the built-in native 2 interpreter. The time of entry of classes, methods, and KB) and the sizes of
and exit of each method (both system-level and user-level) the resulting graphs (the number of nodes,
are recorded in a profiling class. Unfortunately, due to prob- edges, and the edgecut for both CRG and
lems within Joeq itself, this metric on our test benchmarks ODG).
had to be measured with Java source level instrumentation.
See Section 7.3 for details.
The method frequency metric measures how often each The dynamic call graph metric shows the methods that
method is invoked. This metric can also be used as a less ex- actually got called in a specific application instance. It was
pensive substitute for the method duration metric. A counter measured using sampling. It also makes use of similar data
is associated with each method that kept track of the num- as the hot paths metric, but processes the data in a different
ber of invocations. However, also like the method duration manner to actually construct the dynamic call graph.
metric, source level instrumentation had to be performed in-
stead.
The hot methods metric minimizes the overhead of the 7. Evaluation
previous metric by using sampling. For each native thread
Joeq spawns it also attaches a separate native interrupter We have implemented a functional infrastructure proto-
thread. The interrupter thread’s main task is to signal the type that realizes the components presented in the above
thread queue when to switch threads. This provides a con- sections. We evaluate the functionality and the performance
venient approach to sampling; simply pass control from the of our prototype with a set of benchmarks from Java Grande
interrupter thread to the profiler at each scheduling time benchmark suite and SPEC JVM98 (see Table 1). In our
quantum. The profiler then obtains the currently executing experiments the networked configuration includes a ser-
method by reading the call stack of the thread and record- vice node, 1.7GHz Pentium III machine (512MB RAM,
ing the top stack frame. SuSE9.1), and another computation node, a 800MHz Pen-
The hot paths metric goes a level above the hot meth- tium III (384MB RAM, Redhat 9.0). Both nodes run JDK
ods metric in its scope and measures the hottest execution 1.4. The two nodes are connected via 100M Ethernet. At the
paths through the application. We extend the hot method time of this publication we did not have access to other net-
technique, and we sample the entire call stack instead of worked configurations and we only experimented with the
sampling only the top stack frame. few computers we had access to. However, in the future, we
The memory allocation metric is implemented by di- plan to set up a network consisting of multiple nodes with
rectly modifying the internal Java virtual machine system significant differences in resources and configurations.
code of Joeq. By overloading some of the methods that im-
plement memory allocation, we can estimate the memory 7.1. Dependence Graph Construction
profile of the application without performing instrumen-
tation. Unfortunately, this metric is currently only a very Table 1 shows the sizes of the original benchmarks as
rough approximation, but we are confident that much bet- well as the resulting class relation graph (CRG) and object
ter accuracy will be achieved in the near future. dependence graph (ODG) for each benchmark. The edge-
cut is the number of edges that straddle partitions.Currently
2 ”Native” in context of Joeq means it bootstraps itself into a fully func- we use the class relation graph partitioning to distribute the
tional JVM without the need for a host JVM to support it. program.
benchmark construct partition rewrite
CRG ODG TRG ODG
create 2043 3056 7 12 271
method 1704 53 7 6 202
crypt 1715 40 7 7 209
heapsort 1615 54 6 7 193
moldyn 1903 114 6 6 215
search 1868 49 7 7 204
compress 2305 100 6 7 285
db 2434 99 10 7 280

Table 2. The execution time breakdown in

code distribution. The columns indicate the
construction time, the partitioning time, and
the bytecode rewriting time Figure 11. Performance comparison of cen-
tralized and distributed executions.

The execution times for graph construction and distribu- 7.3. Profiling
tion transformation are shown in Table 2, in milliseconds.
We can see that the static analysis of the class relations is in We evaluate the profiler for a a subset of the Java Grande
the order of seconds. This is because the process to extract Forums benchmarks. For the baseline measurements, Joeq
high-level dependence information from the low-level byte- runs each of the benchmarks with all the profiling code
code format is computation and time consuming. However compiled in, but not enabled. Then each of the profilers
since this process only happens once at compiler-time, it is is enabled in turn. The tests were conducted on an AMD
not as crucial as the other phases in the dynamic repartition- Athlon XP 2000+ (1.67 GHz) with 512 MB RAM running
ing process — ODG construction, partitioning, and code Windows XP. In each of the tests, Joeq was allocated a max-
rewriting. In these latter phases, only partitioning has to imum heap-size of 1024 MB.
be completely re-executed in each adaptive iteration. ODG Table 3 shows the total execution times for each of the
construction and code rewriting can be both adjusted incre- benchmarks and profilers. The average overhead for all the
mentally. Since the partition time is only about 10ms, we profilers is 21.94%. A general trend is that metrics which
believe that the results are promising for our future plans of were measured with instrumentation overall incurred no-
incorporating adaptive repartitioning. Also, Create bench- tably higher overhead than did the others, which used either
mark has an unusual long ODG construction time. This is sampling or modification of the JVM system code. The hot
because it creates a large amount of objects which substan- paths, dynamic call graph, and memory usage metrics all
tially complicate the object graph. incurred about equal levels of overhead, approximately 14-
20%. The most impressive results came from the hot meth-
ods metric, which at approximately 4% is a very good re-
sult.
7.2. Distributed Execution
8. Related Work
To evaluate the performance of the distributed execu-
tion runtime, we compare the distributed execution time There are two types of automatic distribution compilers
of the transformed benchmarks with the execution time of or virtual machines available: automatic distribution to ex-
the original sequential benchmarks on the 800MHz Pen- ploit data parallelism in scientific programs and automatic
tium III machine. The execution speedup is depicted in Fig- partitioning of Java programs to relieve resources on con-
ure 11. The distributed execution shows comparable or im- strained devices.
proved performance (79.2% to 175.2%) with the original Automatic Distribution of Data Parallel Programs.
sequential execution. The results are promising, since with- Automatic parallelization is one research area that has in-
out any further optimization the distributed execution re- vestigated the partitioning problem mainly for scientific
sults in very little overhead (in Method and Compress), or programs typically targeting a significant reduction in CPU
speed-up. Since we currently use a suboptimal naive parti- or memory consumption [8, 13, 1, 6, 11]. There are two
tioning, it is expected that further performance gain will be main differences between partitioning for scientific appli-
achieved if optimization is introduced to the distribution in- cations and our work. First, most of the previous work fo-
frastructure in our future work. cuses on array partitioning, or loop iteration partitioning for
Test/Metric Baseline Hot Dynamic Hot Method Method Memory
Paths Call Meth- Du- Fre- Us-
Graph ods ra- quency age
tion
CreateBench 4.406 5.125 5.375 5.468 4.734 5.937 9.718
(int [])
CreateBench 18.250 28.046 28.640 19.281 25.140 31.062 35.000
(long[])
CreateBench 4.468 6.437 5.906 4.265 5.015 4.659 6.015
(float[])
CreateBench 2.156 2.421 2.468 2.328 2.296 2.203 2.281
(Object[])
CreateBench 10.718 12.687 12.500 11.484 11.875 11.234 51.406
(Custom[])
MethodBench 196.187 212.140 222.359 202.281 323.437 248.156 198.937
FFTA 32.187 37.609 40.765 33.812 35.781 36.546 34.312
HeapSortA 3.906 4.296 4.968 4.281 17.297 14.328 3.968
MolDynA 48.234 53.062 57.390 50.234 51.375 51.750 50.125
MonteCarloA 48.734 59.859 58.890 51.015 75.194 60.234 49.671
Total: 369.734 421.682 439.261 384.449 552.144 466.109 441.433
Overhead: 0.00% 14.05% 18.80% 3.98 % 49.34% 26.07% 19.39%

Table 3. The profiler evaluation. Each row is the individual benchmark, while each column is the name
of the profiler enabled. The last row is the total time it took to execute all the benchmarks. The times
are given in seconds. The baseline column is the execution times with all the profiling code com-
piled in but not enabled.

scientific programs. We address general program distribu- memory implementation. The communication is syn-
tion, where all the objects in a program are of interest. Sec- chronous only — i.e., RMI. To exploit asynchronous com-
ond, the main objective for partitioning in scientific pro- munication, we use automatically generated point-to-point
grams is to speedup execution, either on distributed or on messages.
shared memory machines. Our design choices are motivated Pangaea [20] is a system that can distribute Java pro-
by the ability to model multiple resources and study their in- grams using arbitrary middleware (Java RMI, CORBA) to
teraction. Then, the general distribution can be specialized invoke objects remotely. The system is based on the origi-
at runtime depending on resource priorities and actual envi- nal algorithm by Spiegel which was the basis for our own
ronment. extended algorithm [2]. Pangaea’s input is a centralized
Automatic Distribution of Java Programs. Java- Java source-code program. The result is a distributed pro-
Party [17] extends Java with remote objects. The objec- gram underlying the synchronous remote method invoca-
tive is to provide location transparency in a distributed tion communication paradigm. Our approach starts from
memory environment. In contrast, we achieve the trans- Java bytecode and targets a flexible distribution model (i.e.,
parency effect without extending Java syntax. However, we allows the exploitation of concurrency and asynchronous
do not give the user any control over distribution. communication) in a program.
Messer et al.’s approach, though entirely dynamic, has Coign [9] is also a system that strives to automatically
an objective that more closely matches our own [15]. The partition binary programs (built from COM components)
goal is to transparently off-load services to relieve mem- for optimal execution. Coign is designed to handle 2-way
ory and processing constraints on resource-constrained de- partitioning only (between two nodes) for client-server dis-
vices. The main difference is the handling of object refer- tributions. Also, the distribution is fully dynamic, based on
ences. In this approach each JVM maps all other JVMs ref- profiling history. We combine static analysis with off-line
erences, and thus it results in a replicate all strategy. Our ap- distributions in a general, multi-way partitioning.
proach is partly static, and it considers just some of the in-
teractions between objects (cross processor). 9. Conclusion
Another approach, similar to the distributed shared mem-
ory paradigm, is to implement a distributed JVM as global This paper presented the design and implementation of a
object space [4]. We achieve the same transparency effect research compiler and runtime infrastructure for automatic
at hopefully lower cost, since we distinguish between lo- program distribution. While not all programs can benefit
cal and remote accesses. from automatic distribution, we believe that it is important
J-orchestra [21] transforms Java bytecode into dis- to be able to model the resources of a program and study
tributed Java applications. This is also an abstract shared the effect of distribution on program behavior with respect
to resource consumption. The motivating factor to our de- the 7th international conference on Supercomputing, pages
sign was flexibility and modularity. Thus, we expect each 87–96. ACM Press, 1993.
of the techniques we presented to evolve as more experi- [7] B. Hendrickson and R. Leland. A multilevel algorithm for
ments are conducted. partitioning graphs. In Proceedings of the 1995 ACM/IEEE
Our design is based on two key ideas: find the depen- conference on Supercomputing (CDROM), page 28. ACM
Press, 1995.
dences between the objects in a program and use this infor-
[8] S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling
mation to automatically generate communication. We have
Fortran D for MIMD distributed-memory machines. Com-
shown how we cast the resource modeling and program dis-
munications of the ACM, 35(8):66–80, 1992.
tribution problem into an optimal graph partitioning prob-
[9] G. C. Hunt and M. L. Scott. The coign automatic distributed
lem. We model the resources as weights on the dependence partitioning system. In Operating Systems Design and Im-
graph and then experiment with multiple resource priori- plementation, pages 187–200, 1999.
ties and constraints. We have presented the code generation [10] Java interface to metis graph partitioning and visualization
phase as two separate parts: platform independent code gen- tool. Available at: http://www.cacr.caltech.edu/
eration and communication generation. roxana/code/jmetis.tar.gz.
We have also described a profiler system that allows us to [11] K. Kennedy and U. Kremer. Automatic data layout for High
collect information about the program behavior and eventu- Performance Fortran. In Proceedings of the 1995 conference
ally, be able to redistribute the program according to the ac- on Supercomputing (CD-ROM), page 76. ACM Press, 1995.
tual access patterns and resource requirements. Our present [12] B. W. Kernighan and S. Lin. An efficient heuristic procedure
infrastructure only handles static partitioning. While dy- for partitioning graphs. The Bell System Technical Journal,
namic repartitioning is the goal of our next design iteration, pages 291–307, February 1970.
it does not influence the design of the infrastructure pre- [13] C. Koelbel and P. Mehrotra. Compiling global name-space
sented in this paper. parallel loops for distributed execution. IEEE Transactions
on Parallel and Distributed Systems, 2(4):440–451, October
Finally, we have presented results on each of the tech-
1991.
niques that we have introduced. The results indicate that
[14] Metis family of multilevel partitioning algorithms. Available
partitioning takes little time and the computed dependence at: http://www-users.cs.umn.edu/ karypis/metis/.
graphs are within manageable sizes. We have also shown [15] A. Messer, I. Greenberg, P. Bernadat, D. Milojicic, T. Giuli,
that without any further tuning, the distributed execution re- and X. Gu. Towards a distributed platform for resource-
sults in either a very small overhead or a speed-up. Finally, constrained devices. In Proceedings of 22 nd International
we have evaluated our profiler system in terms of the in- Conference on Distributed Computing Systems (ICDCS’02).
curred overhead as well as collected data. IEEE, July 2002.
[16] T. J. Parr and R. W. Quong. ANTLR: A predicated-
LL(k) parser generator. Software Practice and Experience,
References 25(7):789–810, 1995.
[17] M. Philippsen and M. Zenger. JavaParty — transparent re-
[1] C. Ancourt and F. Irigoin. Automatic Code Distribution. In
mote objects in Java. Concurrency: Practice and Experience,
Proceedings of the Third Workshop on Compilers for Paral-
9(11):1225–1242, 1997.
lel Computers, Vienna, Austria, July 1992.
[18] T. A. Proebsting. Burs automata generation. ACM Trans.
[2] R. E. Diaconescu, L. Wang, and M. Franz. Automatic dis-
Program. Lang. Syst., 17(3):461–486, 1995.
tribution of java byte-code based on dependence analysis.
[19] R. Rugina and M. Rinard. Pointer analysis for multithreaded
Technical Report Technical Report No. 03-18, School of In-
programs. In Proceedings of the ACM SIGPLAN 1999 con-
formation and Computer Science, University of California,
ference on Programming language design and implementa-
Irvine, October 2003.
tion, pages 77–90. ACM Press, 1999.
[3] S. Dutt. New faster kernighan-lin-type graph-partitioning
[20] A. Spiegel. Automatic Distribution of Object-Oriented Pro-
algorithms. In Proceedings of the 1993 IEEE/ACM inter-
grams. PhD thesis, Fachbereich Mathematik u. Informatik,
national conference on Computer-aided design, pages 370–
Freie Universitat Berlin, 2002.
377. IEEE Computer Society Press, 1993.
[21] E. Tilevich and Y. Smaragdakis. J-orchestra: Automatic java
[4] W. Fang, C.-L. Wang, and F. Lau. Efficient global object application partitioning. In ECOOP 2002 - Object-Oriented
space support for distributed jvm on cluster. In International Programming: 16th European Conference Malaga, Spain,
Conference on Parallel Processing, Vancouver, Canada, Au- volume 2374/2002 of Lecture Notes in Computer Science,
gust 2002. pages 178–204. Springer-Verlag, June 2002.
[5] C. W. Fraser, D. R. Hanson, and T. A. Proebsting. Engineer- [22] J. Whaley. Joeq: A virtual machine and compiler infrastruc-
ing a simple, efficient code-generator generator. ACM Lett. ture. In Proceedings of the Workshop on Interpreters, Vir-
Program. Lang. Syst., 1(3):213–226, 1992. tual Machines, and Emulators, San Diego, CA. Pages 58-66.,
[6] M. Gupta and P. Banerjee. Paradigm: a compiler for auto- 2003.
matic data distribution on multicomputers. In Proceedings of