A Compiler and Runtime Infrastructure For Automatic Program Distribution
A Compiler and Runtime Infrastructure For Automatic Program Distribution
This paper presents the design and the implementation Our approach places high emphasis on the generality of
of a compiler and runtime infrastructure for automatic pro- the distribution strategy and the ability to build an abstract
gram distribution. We are building a research infrastructure model of the execution environment. Then, the distribution
that enables experimentation with various program parti- strategy can be specialized to concrete environments. We
tioning and mapping strategies and the study of automatic recognize that this approach may not be suitable for all com-
distribution’ s effect on resource consumption (e.g., CPU, putations. Many programs may not need distribution at all.
memory, communication). Since many optimization tech- In some cases, however, automatic distribution is cru-
niques are faced with conflicting optimization targets (e.g., cial. New technologies such as pervasive computing require
memory and communication), we believe that it is impor- that applications connect from any device, over any net-
tant to be able to study their interaction. work, using any style of interface. Mobile computing re-
We present a set of techniques that enable flexible re- quires that mobile code is deployed over heterogeneous net-
source modeling and program distribution. These are: de- works of sometimes resource constrained devices. If there
pendence analysis, weighted graph partitioning, code and are not enough resources available to accommodate a given
communication generation, and profiling. We have devel- program on a single computing node, the promises of these
oped these ideas in the context of the Java language. We technologies cannot be delivered. In this context, automatic
present in detail the design and implementation of each of distribution can help with increased accessibility, resource
the techniques as part of our compiler and runtime infras- sharing, and load balancing.
tructure. Then, we evaluate our design and present prelim- Another broad class of data intensive applications relies
inary experimental data for each component, as well as for on networked systems to process their data concurrently.
the entire system. Such applications range from inherently concurrent appli-
cations like image processing, universe exploration, com-
puter supported cooperative work, to loosely concurrent ap-
plications such as fluid mechanics in avionics and marine
1. Introduction structures. In this context, automatic distribution can help
with exploiting concurrency, reducing the execution time,
There are important potential benefits of automatic over and increasing scalability.
manual program distribution, such as correctness, increased Our specific technical contributions relative to previous
productivity, adaptive execution, concurrency exploitation. systems with similar goals are:
This paper describes a new approach to automatic program
distribution. In contrast with previous work, instead of con- • A set of techniques for a novel approach to automatic
sidering a particular class of programs and optimization program distribution. These techniques are: object de-
targets, we consider general-purpose programs and study pendence graph construction, general graph partition-
multiple optimization targets. Our system accepts a mono- ing, automatic communication generation, and auto-
lithic program and transforms it into multiple communicat- matically distributed program execution.
ing parts in networked systems. • An original compiler and runtime infrastructure that
implements all the above techniques to allow flexible
∗ Parts of this research were funded under ONR award N00014-01-1- program distribution based on program access pattern,
0854 and NSF award CNS-0205712. resource requirements, and resource availability.
Java byte−code ... proximate the object dependence graph for a program
and model its resource requirements.
Loader
3. The system partitions the object dependence graph us-
Bytecode Bytecode ... ing a Java wrapper of the Metis graph partitioning
decoder to Quad
Joeq
Front−End
tool [14].
Bytecode IR Quad IR 4. The system uses bytecode rewriting to insert com-
Byte−code Quad
munication calls for remote dependences in the par-
Analyses Analyses titioned program. Also, the system uses a bottom-up
Object
Dependence rewrite system to generate target code for the vari-
Compiler Graph (ODG)
ous platforms making-up the networked configuration.
Partitioning
For better resource utilization, in the future we plan
to use native execution rather than Java Virtual Ma-
Object partition
P1
Object partition
P2
... Object partition
Pn
chine (JVM) hosted execution on (possibly resource-
Dependence Dependence
Information constrained) devices.
Information
Dependence
Information 5. The system monitors the program execution and col-
Code and Communication
Generation
lects a set of statistics about resource usage. We use
this information to gain insight into static partitioning.
P1 plus P2 plus Pn plus
communication
calls communication
calls
... communication
calls In the future we plan to use this information to per-
form adaptive repartitioning.
Run−Time
Library
Linker Interpreter The rest of the paper is organized as follows. Section 2
Back−End Communication
describes the object dependence graph construction. Sec-
Library
and
Run−Time
tion 3 explains how we use graph partitioning to model re-
Support sources and split the program into multiple pieces. Section 4
P1 executable
code section
P2 executable
code section
... P3 executable
code section
Adaptive describes in detail our approach to code and communica-
Repartitioning
tion generation. Section 5 presents the implementation of
Scheduling Execution Monitoring
a runtime system that allows automatic distributed execu-
Module Program Profiling
tion. Section 6 presents the design and implementation of
a mixed instrumentation and sampling profiler that moni-
Distributed Execution
tors programs during execution. Section 7 discusses an ini-
tial evaluation of the techniques we introduce in this paper.
Section 8 reviews related research and contrasts our effort
Figure 1. The distributed compiler and run- from previous approaches similar in goals. Section 9 con-
time infrastructure. cludes the paper and underlines future research directions.
}
public void openAccount(Account a){
accounts.add(a);
}
public boolean withdraw(int customerID, int amount) {
if (...) {
Figure 3. The class relation graph visualized
this.getCustomercustomerID).setBalance(
this.getCustomer(customerID).getBalance() - amount);
with aiSee tool for vcg format.
return true;
} else return false;
}
public static void main(String[] args) {
...;
Bank merchants = new Bank("Merchants", 100, 10000);
Account a4 = new Account(1, "ABC Market", 1000000, 100000, 20000000);
Account a5 = new Account(2, "CDE Outlet", 5000000, 300000, 150000000);
merchants.openAccount(a4);
merchants.openAccount(a5);
...;
Account a = merchants.getCustomer(2);
merchants.withdraw(a.getId(), 900);
}
}
Figure 9. The transformation for new ment — such as groups, communicators, and the commu-
Account(i, n, s, c); nication context.
The Execution Starter service starts the applica-
tion by invoking the main() method of the application
prises of the class name and the arguments to the class con- class. Only one copy of Execution Starter needs to
structor. be active on the processor node in the distributed execution
The quality of communication generation is directly in- environment where the user initiates the application.
fluenced by the quality of dependence analysis. Our anal- The core of this MPI-aware runtime support is
ysis is type-based and thus, not very precise. More pre- the Message Exchange service. This service pro-
cise dependence information makes use of points-to infor- cesses all the send and receive MPI communica-
mation [19] in the context of speculative multithreading. tion generated from the object dependence information.
In addition, there are several communication optimization The Message Exchange service uses two support-
techniques that can be applied to optimize communication ing data structures. One is the DependentObject
generation: message aggregation, hoisting communication and the other is the exchanged Message. The run-
out of the loop, asynchronous communication, overlapping time uses the DependentObject (implemented by a
communication and computation, data replication, and early Java class) to indicate an object that has dependence rela-
prefetch. Many of these techniques cannot be used with re- tions to another partition.
quest/response communication style like RPC or RMI. In
Each dependent object contains the following informa-
contrast, we use message exchange communication to re-
tion: its class type, the identifier of the partition (node) that
veal more optimization opportunities.
hosts the object, and its unique identifier in that partition
(node). A message (packed in a Message structure) ex-
5. Distributed Execution changed between two dependent objects across two nodes
contains the object identifier of the receiver of the commu-
The distributed target code partitions are executed within nication and the relevant dependence data. The Message
the MPI enhanced runtime environment. Currently we use Exchange service passes objects between nodes using a
JVM hosted execution rather than native execution. Even streamed format.
though the retargetable code generation component is fully
We currently identify two types of messages: NEW and
implemented, it was easier to use normal JVM since our
DEPENDENCE for object instantiation and data depen-
current experiments are conducted on resource-rich x86
dence. We are in the process of defining more precise
platforms. Also, the use of JVM does not affect our current
dependence relations (e.g., read after write), and discrimi-
distributed execution evaluation (speed-up measurements).
nating further between messages.
In our current implementation, on each node there
are three supporting services: the MPI service, the
ExecutionStarter service, and the Message 6. Profiler
Exchange service. Figure 10 depicts this organiza-
tion of the runtime services for distributed execution. The We have built a profiler that collects statistics indicating
MPI service sets up the necessary MPI working environ- the resource consumption of a program during runtime.
The profiler is built on top of the Joeq compiler and vir-
tual machine. The profiler works either through instrumen- benchmark size CRG ODG
#C #M KB #N #E EC #N #E EC
tation or sampling. Some of the metrics can be implemented create* 14 28 13 17 6 2 210 632 82
using either technique. In these cases, the instrumentation is method* 6 35 10 12 10 2 9 32 2
useful as a baseline for comparison of the accuracy of the crypt* 6 45 12 13 13 3 11 33 1
heapsort* 6 42 10 13 13 3 11 33 2
sampling. There are four basic categories of runtime appli- moldyn* 8 48 17 12 15 2 9 32 2
cation behavior we are interested in: CPU, memory, bat- search* 9 57 17 14 23 3 6 20 3
cmprss** 38 295 160 36 42 1 32 107 2
tery, and communication (i.e., network) usage. To measure db** 32 299 155 32 26 2 49 164 8
these four basic categories, we have currently implemented
six metrics: method duration, method frequency, hot meth- * Java Grande benchmarks: JGFCreateBench and JGFMethodBench (section 1),
JGFCryptBench and JGFHeapSortBench (section 2), JGFMolDynBench and
ods, hot paths, memory allocation, and dynamic call graph. JGFSearchBench (section 3).
The method duration metric measures the amount of ** SPEC JVM98 benchmarks: 201 compress, and 209 db.
time each method took to execute. The metric was origi-
nally implemented by overloading the method invocation Table 1. The size of the benchmarks (number
process of the built-in native 2 interpreter. The time of entry of classes, methods, and KB) and the sizes of
and exit of each method (both system-level and user-level) the resulting graphs (the number of nodes,
are recorded in a profiling class. Unfortunately, due to prob- edges, and the edgecut for both CRG and
lems within Joeq itself, this metric on our test benchmarks ODG).
had to be measured with Java source level instrumentation.
See Section 7.3 for details.
The method frequency metric measures how often each The dynamic call graph metric shows the methods that
method is invoked. This metric can also be used as a less ex- actually got called in a specific application instance. It was
pensive substitute for the method duration metric. A counter measured using sampling. It also makes use of similar data
is associated with each method that kept track of the num- as the hot paths metric, but processes the data in a different
ber of invocations. However, also like the method duration manner to actually construct the dynamic call graph.
metric, source level instrumentation had to be performed in-
stead.
The hot methods metric minimizes the overhead of the 7. Evaluation
previous metric by using sampling. For each native thread
Joeq spawns it also attaches a separate native interrupter We have implemented a functional infrastructure proto-
thread. The interrupter thread’s main task is to signal the type that realizes the components presented in the above
thread queue when to switch threads. This provides a con- sections. We evaluate the functionality and the performance
venient approach to sampling; simply pass control from the of our prototype with a set of benchmarks from Java Grande
interrupter thread to the profiler at each scheduling time benchmark suite and SPEC JVM98 (see Table 1). In our
quantum. The profiler then obtains the currently executing experiments the networked configuration includes a ser-
method by reading the call stack of the thread and record- vice node, 1.7GHz Pentium III machine (512MB RAM,
ing the top stack frame. SuSE9.1), and another computation node, a 800MHz Pen-
The hot paths metric goes a level above the hot meth- tium III (384MB RAM, Redhat 9.0). Both nodes run JDK
ods metric in its scope and measures the hottest execution 1.4. The two nodes are connected via 100M Ethernet. At the
paths through the application. We extend the hot method time of this publication we did not have access to other net-
technique, and we sample the entire call stack instead of worked configurations and we only experimented with the
sampling only the top stack frame. few computers we had access to. However, in the future, we
The memory allocation metric is implemented by di- plan to set up a network consisting of multiple nodes with
rectly modifying the internal Java virtual machine system significant differences in resources and configurations.
code of Joeq. By overloading some of the methods that im-
plement memory allocation, we can estimate the memory 7.1. Dependence Graph Construction
profile of the application without performing instrumen-
tation. Unfortunately, this metric is currently only a very Table 1 shows the sizes of the original benchmarks as
rough approximation, but we are confident that much bet- well as the resulting class relation graph (CRG) and object
ter accuracy will be achieved in the near future. dependence graph (ODG) for each benchmark. The edge-
cut is the number of edges that straddle partitions.Currently
2 ”Native” in context of Joeq means it bootstraps itself into a fully func- we use the class relation graph partitioning to distribute the
tional JVM without the need for a host JVM to support it. program.
benchmark construct partition rewrite
CRG ODG TRG ODG
create 2043 3056 7 12 271
method 1704 53 7 6 202
crypt 1715 40 7 7 209
heapsort 1615 54 6 7 193
moldyn 1903 114 6 6 215
search 1868 49 7 7 204
compress 2305 100 6 7 285
db 2434 99 10 7 280
The execution times for graph construction and distribu- 7.3. Profiling
tion transformation are shown in Table 2, in milliseconds.
We can see that the static analysis of the class relations is in We evaluate the profiler for a a subset of the Java Grande
the order of seconds. This is because the process to extract Forums benchmarks. For the baseline measurements, Joeq
high-level dependence information from the low-level byte- runs each of the benchmarks with all the profiling code
code format is computation and time consuming. However compiled in, but not enabled. Then each of the profilers
since this process only happens once at compiler-time, it is is enabled in turn. The tests were conducted on an AMD
not as crucial as the other phases in the dynamic repartition- Athlon XP 2000+ (1.67 GHz) with 512 MB RAM running
ing process — ODG construction, partitioning, and code Windows XP. In each of the tests, Joeq was allocated a max-
rewriting. In these latter phases, only partitioning has to imum heap-size of 1024 MB.
be completely re-executed in each adaptive iteration. ODG Table 3 shows the total execution times for each of the
construction and code rewriting can be both adjusted incre- benchmarks and profilers. The average overhead for all the
mentally. Since the partition time is only about 10ms, we profilers is 21.94%. A general trend is that metrics which
believe that the results are promising for our future plans of were measured with instrumentation overall incurred no-
incorporating adaptive repartitioning. Also, Create bench- tably higher overhead than did the others, which used either
mark has an unusual long ODG construction time. This is sampling or modification of the JVM system code. The hot
because it creates a large amount of objects which substan- paths, dynamic call graph, and memory usage metrics all
tially complicate the object graph. incurred about equal levels of overhead, approximately 14-
20%. The most impressive results came from the hot meth-
ods metric, which at approximately 4% is a very good re-
sult.
7.2. Distributed Execution
8. Related Work
To evaluate the performance of the distributed execu-
tion runtime, we compare the distributed execution time There are two types of automatic distribution compilers
of the transformed benchmarks with the execution time of or virtual machines available: automatic distribution to ex-
the original sequential benchmarks on the 800MHz Pen- ploit data parallelism in scientific programs and automatic
tium III machine. The execution speedup is depicted in Fig- partitioning of Java programs to relieve resources on con-
ure 11. The distributed execution shows comparable or im- strained devices.
proved performance (79.2% to 175.2%) with the original Automatic Distribution of Data Parallel Programs.
sequential execution. The results are promising, since with- Automatic parallelization is one research area that has in-
out any further optimization the distributed execution re- vestigated the partitioning problem mainly for scientific
sults in very little overhead (in Method and Compress), or programs typically targeting a significant reduction in CPU
speed-up. Since we currently use a suboptimal naive parti- or memory consumption [8, 13, 1, 6, 11]. There are two
tioning, it is expected that further performance gain will be main differences between partitioning for scientific appli-
achieved if optimization is introduced to the distribution in- cations and our work. First, most of the previous work fo-
frastructure in our future work. cuses on array partitioning, or loop iteration partitioning for
Test/Metric Baseline Hot Dynamic Hot Method Method Memory
Paths Call Meth- Du- Fre- Us-
Graph ods ra- quency age
tion
CreateBench 4.406 5.125 5.375 5.468 4.734 5.937 9.718
(int [])
CreateBench 18.250 28.046 28.640 19.281 25.140 31.062 35.000
(long[])
CreateBench 4.468 6.437 5.906 4.265 5.015 4.659 6.015
(float[])
CreateBench 2.156 2.421 2.468 2.328 2.296 2.203 2.281
(Object[])
CreateBench 10.718 12.687 12.500 11.484 11.875 11.234 51.406
(Custom[])
MethodBench 196.187 212.140 222.359 202.281 323.437 248.156 198.937
FFTA 32.187 37.609 40.765 33.812 35.781 36.546 34.312
HeapSortA 3.906 4.296 4.968 4.281 17.297 14.328 3.968
MolDynA 48.234 53.062 57.390 50.234 51.375 51.750 50.125
MonteCarloA 48.734 59.859 58.890 51.015 75.194 60.234 49.671
Total: 369.734 421.682 439.261 384.449 552.144 466.109 441.433
Overhead: 0.00% 14.05% 18.80% 3.98 % 49.34% 26.07% 19.39%
Table 3. The profiler evaluation. Each row is the individual benchmark, while each column is the name
of the profiler enabled. The last row is the total time it took to execute all the benchmarks. The times
are given in seconds. The baseline column is the execution times with all the profiling code com-
piled in but not enabled.
scientific programs. We address general program distribu- memory implementation. The communication is syn-
tion, where all the objects in a program are of interest. Sec- chronous only — i.e., RMI. To exploit asynchronous com-
ond, the main objective for partitioning in scientific pro- munication, we use automatically generated point-to-point
grams is to speedup execution, either on distributed or on messages.
shared memory machines. Our design choices are motivated Pangaea [20] is a system that can distribute Java pro-
by the ability to model multiple resources and study their in- grams using arbitrary middleware (Java RMI, CORBA) to
teraction. Then, the general distribution can be specialized invoke objects remotely. The system is based on the origi-
at runtime depending on resource priorities and actual envi- nal algorithm by Spiegel which was the basis for our own
ronment. extended algorithm [2]. Pangaea’s input is a centralized
Automatic Distribution of Java Programs. Java- Java source-code program. The result is a distributed pro-
Party [17] extends Java with remote objects. The objec- gram underlying the synchronous remote method invoca-
tive is to provide location transparency in a distributed tion communication paradigm. Our approach starts from
memory environment. In contrast, we achieve the trans- Java bytecode and targets a flexible distribution model (i.e.,
parency effect without extending Java syntax. However, we allows the exploitation of concurrency and asynchronous
do not give the user any control over distribution. communication) in a program.
Messer et al.’s approach, though entirely dynamic, has Coign [9] is also a system that strives to automatically
an objective that more closely matches our own [15]. The partition binary programs (built from COM components)
goal is to transparently off-load services to relieve mem- for optimal execution. Coign is designed to handle 2-way
ory and processing constraints on resource-constrained de- partitioning only (between two nodes) for client-server dis-
vices. The main difference is the handling of object refer- tributions. Also, the distribution is fully dynamic, based on
ences. In this approach each JVM maps all other JVMs ref- profiling history. We combine static analysis with off-line
erences, and thus it results in a replicate all strategy. Our ap- distributions in a general, multi-way partitioning.
proach is partly static, and it considers just some of the in-
teractions between objects (cross processor). 9. Conclusion
Another approach, similar to the distributed shared mem-
ory paradigm, is to implement a distributed JVM as global This paper presented the design and implementation of a
object space [4]. We achieve the same transparency effect research compiler and runtime infrastructure for automatic
at hopefully lower cost, since we distinguish between lo- program distribution. While not all programs can benefit
cal and remote accesses. from automatic distribution, we believe that it is important
J-orchestra [21] transforms Java bytecode into dis- to be able to model the resources of a program and study
tributed Java applications. This is also an abstract shared the effect of distribution on program behavior with respect
to resource consumption. The motivating factor to our de- the 7th international conference on Supercomputing, pages
sign was flexibility and modularity. Thus, we expect each 87–96. ACM Press, 1993.
of the techniques we presented to evolve as more experi- [7] B. Hendrickson and R. Leland. A multilevel algorithm for
ments are conducted. partitioning graphs. In Proceedings of the 1995 ACM/IEEE
Our design is based on two key ideas: find the depen- conference on Supercomputing (CDROM), page 28. ACM
Press, 1995.
dences between the objects in a program and use this infor-
[8] S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling
mation to automatically generate communication. We have
Fortran D for MIMD distributed-memory machines. Com-
shown how we cast the resource modeling and program dis-
munications of the ACM, 35(8):66–80, 1992.
tribution problem into an optimal graph partitioning prob-
[9] G. C. Hunt and M. L. Scott. The coign automatic distributed
lem. We model the resources as weights on the dependence partitioning system. In Operating Systems Design and Im-
graph and then experiment with multiple resource priori- plementation, pages 187–200, 1999.
ties and constraints. We have presented the code generation [10] Java interface to metis graph partitioning and visualization
phase as two separate parts: platform independent code gen- tool. Available at: http://www.cacr.caltech.edu/
eration and communication generation. roxana/code/jmetis.tar.gz.
We have also described a profiler system that allows us to [11] K. Kennedy and U. Kremer. Automatic data layout for High
collect information about the program behavior and eventu- Performance Fortran. In Proceedings of the 1995 conference
ally, be able to redistribute the program according to the ac- on Supercomputing (CD-ROM), page 76. ACM Press, 1995.
tual access patterns and resource requirements. Our present [12] B. W. Kernighan and S. Lin. An efficient heuristic procedure
infrastructure only handles static partitioning. While dy- for partitioning graphs. The Bell System Technical Journal,
namic repartitioning is the goal of our next design iteration, pages 291–307, February 1970.
it does not influence the design of the infrastructure pre- [13] C. Koelbel and P. Mehrotra. Compiling global name-space
sented in this paper. parallel loops for distributed execution. IEEE Transactions
on Parallel and Distributed Systems, 2(4):440–451, October
Finally, we have presented results on each of the tech-
1991.
niques that we have introduced. The results indicate that
[14] Metis family of multilevel partitioning algorithms. Available
partitioning takes little time and the computed dependence at: http://www-users.cs.umn.edu/ karypis/metis/.
graphs are within manageable sizes. We have also shown [15] A. Messer, I. Greenberg, P. Bernadat, D. Milojicic, T. Giuli,
that without any further tuning, the distributed execution re- and X. Gu. Towards a distributed platform for resource-
sults in either a very small overhead or a speed-up. Finally, constrained devices. In Proceedings of 22 nd International
we have evaluated our profiler system in terms of the in- Conference on Distributed Computing Systems (ICDCS’02).
curred overhead as well as collected data. IEEE, July 2002.
[16] T. J. Parr and R. W. Quong. ANTLR: A predicated-
LL(k) parser generator. Software Practice and Experience,
References 25(7):789–810, 1995.
[17] M. Philippsen and M. Zenger. JavaParty — transparent re-
[1] C. Ancourt and F. Irigoin. Automatic Code Distribution. In
mote objects in Java. Concurrency: Practice and Experience,
Proceedings of the Third Workshop on Compilers for Paral-
9(11):1225–1242, 1997.
lel Computers, Vienna, Austria, July 1992.
[18] T. A. Proebsting. Burs automata generation. ACM Trans.
[2] R. E. Diaconescu, L. Wang, and M. Franz. Automatic dis-
Program. Lang. Syst., 17(3):461–486, 1995.
tribution of java byte-code based on dependence analysis.
[19] R. Rugina and M. Rinard. Pointer analysis for multithreaded
Technical Report Technical Report No. 03-18, School of In-
programs. In Proceedings of the ACM SIGPLAN 1999 con-
formation and Computer Science, University of California,
ference on Programming language design and implementa-
Irvine, October 2003.
tion, pages 77–90. ACM Press, 1999.
[3] S. Dutt. New faster kernighan-lin-type graph-partitioning
[20] A. Spiegel. Automatic Distribution of Object-Oriented Pro-
algorithms. In Proceedings of the 1993 IEEE/ACM inter-
grams. PhD thesis, Fachbereich Mathematik u. Informatik,
national conference on Computer-aided design, pages 370–
Freie Universitat Berlin, 2002.
377. IEEE Computer Society Press, 1993.
[21] E. Tilevich and Y. Smaragdakis. J-orchestra: Automatic java
[4] W. Fang, C.-L. Wang, and F. Lau. Efficient global object application partitioning. In ECOOP 2002 - Object-Oriented
space support for distributed jvm on cluster. In International Programming: 16th European Conference Malaga, Spain,
Conference on Parallel Processing, Vancouver, Canada, Au- volume 2374/2002 of Lecture Notes in Computer Science,
gust 2002. pages 178–204. Springer-Verlag, June 2002.
[5] C. W. Fraser, D. R. Hanson, and T. A. Proebsting. Engineer- [22] J. Whaley. Joeq: A virtual machine and compiler infrastruc-
ing a simple, efficient code-generator generator. ACM Lett. ture. In Proceedings of the Workshop on Interpreters, Vir-
Program. Lang. Syst., 1(3):213–226, 1992. tual Machines, and Emulators, San Diego, CA. Pages 58-66.,
[6] M. Gupta and P. Banerjee. Paradigm: a compiler for auto- 2003.
matic data distribution on multicomputers. In Proceedings of