Flux Usenix 2006

This paper presents Flux, a language for programming high-performance servers. Flux allows programmers to compose existing sequential C/C++ functions into concurrent servers. Flux programs are type-checked and guaranteed to be deadlock-free. The authors have implemented several servers in Flux, including a web server, image server, BitTorrent peer, and game server, which match or exceed the performance of hand-written counterparts. Flux aims to simplify server programming by eliminating the need for explicit concurrency constructs like threads and locks.

Uploaded by

Anonymous 2oUJOl3rRt

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

Flux Usenix 2006

Uploaded by

Anonymous 2oUJOl3rRt

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Flux: A Language for Programming High-Performance Servers

Brendan Burns Kevin Grimaldi Alexander Kostadinov Emery D. Berger Mark D. Corner
Department of Computer Science
University of Massachusetts Amherst
Amherst, MA 01003
{bburns,kgrimald,akostadi,emery,mcorner}@cs.umass.edu
Abstract
Programming high-performance server applications is
challenging: it is both complicated and error-prone to
write the concurrent code required to deliver high perfor-
mance and scalability. Server performance bottlenecks
are difcult to identify and correct. Finally, it is difcult
to predict server performance prior to deployment.
This paper presents Flux, a language that dramatically
simplies the construction of scalable high-performance
server applications. Flux lets programmers compose off-
the-shelf, sequential C or C++ functions into concurrent
servers. Flux programs are type-checked and guaran-
teed to be deadlock-free. We have built a number of
servers in Flux, including a web server with PHP sup-
port, an image-rendering server, a BitTorrent peer, and
a game server. These Flux servers match or exceed
the performance of their counterparts written entirely
in C. By tracking hot paths through a running server,
Flux simplies the identication of performance bottle-
necks. The Flux compiler also automatically generates
discrete event simulators that accurately predict actual
server performance under load and with different hard-
ware resources.
1 Introduction
Server applications need to provide high performance
while handling large numbers of simultaneous requests.
However, programming servers remains a daunting task.
Concurrency is required for high performance but in-
troduces errors like race conditions and deadlock that
are difcult to debug. The mingling of server logic
with low-level systems programming complicates de-
velopment and makes it difcult to understand and de-
bug server applications. Consequently, the resulting im-
plementations are often either lacking in performance,
buggy or both. At the same time, the interleaving of mul-
tiple threads of server logic makes it difcult to identify
performance bottlenecks or predict server performance
prior to deployment.
This paper introduces Flux, a domain-specic lan-
guage that addresses these problems in a declarative,
ow-oriented language (Flux stems from the Latin word
for ow). A Flux program describes two things: (1)
the ow of data from client requests through nodes, typ-
ically off-the-shelf C or C++ functions, and (2) mutual
exclusion requirements for these nodes, expressed as
high-level atomicity constraints. Flux requires no other
typical programming language constructs like variables
or loops a Flux program executes inside an implicit
innite loop. The Flux compiler combines the C/C++
components into a high performance server using just
the ow connectivity and atomicity constraints.
Flux captures a programming pattern common to
server applications: concurrent executions, each based
on a client request from the network and a subsequent
response. This focus enables numerous advantages over
conventional server programming:
Ease of use. Flux is a declarative, implicitly-
parallel coordination language that eliminates
the error-prone management of concurrency via
threads or locks. A typical Flux server requires just
tens of lines of code to combine off-the-shelf com-
ponents written in sequential C or C++ into a server
application.
Reuse. By design, Flux directly supports the in-
corporation of unmodied existing code. There is
no Flux API that a component must adhere to; as
long as components follow the standard UNIX con-
ventions, they can be incorporated unchanged. For
example, we were able to add PHP support to our
web server just by implementing a required PHP
interface layer.
Runtime independence. Because Flux is not tied
to any particular runtime model, it is possible to
deploy Flux programs on a wide variety of run-
time systems. Section 3 describes three runtimes
we have implemented: thread-based, thread pool,
and event-driven.
Correctness. Flux programs are type-checked to
ensure their compositions make sense. The atom-
icity constraints eliminate deadlock by enforcing a
canonical ordering for lock acquisitions.
Performance prediction. The Flux compiler op-
tionally outputs a discrete event simulator. As we
source Listen Image;
Image = ReadRequest CheckCache
Handler
Write Complete;
Handler [_, _, hit] = ;
Handler [_, _, _] =
ReadInFromDisk
Compress
StoreInCache;
ReadRequest
ReadInFromDisk
Write CheckCache
Compress StoreInCache
Complete
hit
ReadRequest
ReadInFromDisk
Write CheckCache
Compress StoreInCache
Complete
hit
ReadRequest
ReadInFromDisk
Write CheckCache
Compress StoreInCache
Complete
hit
Listen
client
1
client
2
client
n
Figure 1: An example Flux program and a dynamic view of its execution.
show in Section 5, this simulator accurately pre-
dicts actual server performance.
Bottleneck analysis. Flux servers include light-
weight instrumentation that identies the most-
frequently executed or most expensive paths in a
running Flux application.
Our experience with Flux has been positive. We have
implemented a wide range of server applications in Flux:
a web server with PHP support, a BitTorrent peer, an im-
age server, and a multi-player online game server. The
longest of these consists of fewer than 100 lines of code,
with the majority of the code devoted to type signatures.
In every case, the performance of these Flux servers
matches or exceeds that of their hand-written counter-
parts.
The remainder of this paper is organized as follows.
Section 2 presents the semantics and syntax of the Flux
language. Section 3 describes the Flux compiler and
runtime systems. Section 4 presents our experimen-
tal methodology and compares the performance of Flux
servers to their hand-written counterparts. Section 5
demonstrates the use of path proling and discrete-event
simulation. Section 6 reports our experience using Flux
to build several servers. Section 7 presents related work,
and Section 8 concludes with a discussion of planned
future work.
2 Language Description
To introduce Flux, we develop a sample application that
exercises most of Fluxs features. This sample appli-
cation is an image server that receives HTTP requests
for images that are stored in the PPM format and com-
presses them into JPEGs, using calls to an off-the-shelf
JPEG library. Recently-compressed images are stored
in a cache managed with a least-frequently used (LFU)
replacement policy.
Figure 1 presents an abbreviated listing of the image
server code with a schematic view of its dynamic exe-
cution (see Figure 2 for a more detailed listing). The
Listen node (a source node) executes in an innite
loop, handling client requests and transferring data and
control to the server running the Flux program. A sin-
gle Flux program represents an unbounded number of
separate concurrent ows: each request executes along a
separate ow through the Flux program, and eventually
outputs results back to the client.
Notice that Flux programs are acyclic. The only loops
exposed in Flux are the implicit innite loops in which
source nodes execute, and the round-trips between the
Flux server and its clients. The lack of cycles in Flux
allows it to enforce deadlock-free concurrency control.
While theoretically limiting expressiveness, we have
found cycles to be unnecessary for implementing in Flux
the wide range of servers described in Section 4.
The Flux language consists of a minimal set of fea-
tures, including concrete nodes that correspond to the
C or C++ code implementing the server logic, abstract
nodes that represent a ow through multiple nodes,
predicate types that implement conditional data ow,
error handlers that deal with exceptional conditions,
and atomicity constraints that control simultaneous ac-
cess to shared state.
2.1 Concrete Nodes
The rst step in designing a Flux program is describing
the concrete nodes that correspond to C and C++ func-
tions. Flux requires type signatures for each node. The
name of each node is followed by the input arguments in
parentheses, followed by an arrow and the output argu-
ments. Functions implementing concrete nodes require
either one or two arguments. If the node is a source
node, then it requires only one argument: a pointer to a
struct that the function lls in with its outputs. Simi-
larly, if the node is a sink node (without output), then its
argument is a pointer to a struct that holds the func-
tions inputs. Most concrete nodes have two arguments:
rst input, then output.
Figure 2 starts with the signatures for three of the
concrete nodes in the image server: ReadRequest
parses client input, Compress compresses images, and
Write outputs the compressed image to the client.
While most concrete nodes both receive input data
// concrete node signatures
Listen ()
=> (int socket);
ReadRequest (int socket)
=> (int socket, bool close,
image_tag
*
request);
CheckCache (int socket, bool close,
image_tag
*
request)
=> (int socket, bool close,
image_tag
*
request);
// omitted for space:
// ReadInFromDisk, StoreInCache
Compress (int socket, bool close,
image_tag
*
request,
__u8
*
rgb_data)
=> (int socket, bool close,
image_tag
*
request);
Write (int socket, bool close,
image_tag
*
request)
=> (int socket, bool close,
image_tag
*
request);
Complete (int socket, bool close,
image_tag
*
request) => ();
// source node
source Listen => Image;
// abstract node
Image = ReadRequest -> CheckCache
-> Handler -> Write -> Complete;
// predicate type & dispatch
typedef hit TestInCache;
Handler:[_, _, hit] = ;
Handler:[_, _, _] =
ReadInFromDisk -> Compress
-> StoreInCache;
// error handler
handle error ReadInFromDisk => FourOhFor;
// atomicity constraints
atomic CheckCache:{cache};
atomic StoreInCache:{cache};
atomic Complete:{cache};
Figure 2: An image compression server, written in Flux.
and produce output, source nodes only produce output to
initiate a data ow. The statement below indicates that
Listen is a source node, which Flux executes inside an
innite loop. Whenever Listen receives a connection,
it transfers control to the Image node.
// source node
source Listen => Image;
2.2 Abstract Nodes
In Flux, concrete nodes can be composed to form ab-
stract nodes. These abstract nodes represent a ow of
data from concrete nodes to concrete nodes or other ab-
stract nodes. Arrows connect nodes, and Flux checks
to ensure that these connections make sense. The out-
put type of the node on the left side of the arrow must
match the input type of the node on the right side. For
example, the abstract node Image in the image server
corresponds to a ow from client input that checks the
cache for the requested image, handles the result, writes
the output, and completes.
// abstract node
Image = ReadRequest -> CheckCache
-> Handler -> Write -> Complete;
2.3 Predicate Types
A client request for an image may result in either a cache
hit or a cache miss. These need to be handled differ-
ently. Instead of exposing control ow directly, Flux lets
programmers use the predicate type of a nodes output
to direct the ow of data to the appropriate subsequent
node. A predicate type is an arbitrary Boolean function
supplied by the Flux programmer that is applied to the
nodes output.
Using predicate types, a Flux programmer can express
multiple possible paths for data through the server. Pred-
icate type dispatch is processed in order of the tests in
the Flux program. The typedef statement binds the
type hit to the Boolean function TestInCache. The
node Handler below checks to see if its rst argument
is of type hit; in other words, it applies the function
TestInCache to the third argument. The underscores
are wildcards that match any type. Handler does noth-
ing for a hit, but if there is a miss in the cache, the image
server fetches the PPM le, compresses it, and stores it
in the cache.
// predicate type & dispatch
typedef hit TestInCache;
Handler:[_, _, hit] = ;
Handler:[_, _, _] =
ReadInFromDisk -> Compress
-> StoreInCache;
2.4 Error Handling
Any server must handle errors. Flux expects nodes to
follow the standard UNIX convention of returning error
codes. Whenever a node returns a non-zero value, Flux
checks if an error handler has been declared for the node.
If none exists, the current data ow is simply terminated.
In the image server, if the function that reads an im-
age from disk discovers that the image does not exist, it
returns an error. We handle this error by directing the
ow to a node FourOhFour that outputs a 404 page:
// error handler
handle error ReadInFromDisk => FourOhFour;
2.5 Atomicity Constraints
All ows through the image server access a single shared
image cache. Access to this shared resource must be
controlled to ensure that two different data ows do not
interfere with each others operation.
The Flux programmer species such atomicity con-
straints in Flux rather than inside the component im-
plementation. The programmer species atomicity con-
straints by using arbitrary symbolic names. These con-
straints can be thought of as locks, although this is not
necessarily how they are implemented. A node only runs
when it has acquired all of the constraints. This acqui-
sition follows a two-phase locking protocol: the node ac-
quires (locks) all of the constraints in order, executes
the node, and then releases them in reverse order.
Atomicity constraints can be specied as either read-
ers or writers. Using these constraints allows multiple
readers to execute a node at the same time, support-
ing greater efciency when most nodes read shared data
rather than update it. Reader constraints have a question
mark appended to them (?). Although constraints are
considered writers by default, a programmer can append
an exclamation point (!) for added documentation.
In the image server, the image compression cache
can be updated by three nodes: CheckCache, which
increments a reference count to the cached item,
StoreInCache, which writes a new item into the
cache, evicting the least-frequently used item with a zero
reference count, and Complete, which decrements the
cached images reference count. Only one instance of
each node may safely execute at a time; since all of them
modify the cache, we label them with the same writer
constraint (cache).
// atomicity constraints
atomic CheckCache:{cache};
atomic StoreInCache:{cache};
atomic Complete:{cache};
Note that programmers can apply atomicity con-
straints not only to concrete nodes but also to abstract
nodes. In this way, programmers can specify that mul-
tiple nodes must be executed atomically. For exam-
ple, the node Handler could also be annotated with
an atomicity constraint, which would span the execu-
tion of the path ReadInFromDisk Compress
StoreInCache. This freedomto apply atomicity con-
straints presents some complications for deadlock-free
lock assignment, which we discuss in Section 3.1.1.
2.5.1 Scoped Constraints
While ows generally represent independent clients, in
some server applications, multiple ows may constitute
a single session. For example, a le transfer to one client
may take the form of multiple simultaneous ows. In
this case, the state of the session (such as the status of
transferred chunks) only needs to be protected from con-
current access in that session.
In addition to program-wide constraints that apply
across the entire server (the default), Flux supports per-
session constraints that apply only to particular ses-
sions. Using session-scoped atomicity constraints in-
creases concurrency by eliminating contention across
sessions. Sessions are implemented as hash functions on
the output of each source node. The Flux programmer
implements a session id function that takes the source
nodes output as its parameter and returns a unique ses-
sion identier, and then adds (session) to a con-
straint name to indicate that it applies only per-session.
2.5.2 Discussion
Specifying atomicity constraints in Flux rather than
placing locking operations inside implementation code
has a number of advantages, beyond the fact that it al-
lows the use of libraries whose source code is unavail-
able.
Safety. The Flux compiler imposes a canonical order-
ing on atomicity constraints (see Section 3.1.1). Com-
bined with the fact that Flux ows are acyclic, this or-
dering prevents cycles from appearing in its lock graph.
Programs that use Flux-level atomicity constraints ex-
clusively (i.e., that do not themselves contain locking
operations) are thus guaranteed to not deadlock.
Efciency. Exposing atomicity constraints also en-
ables the Flux compiler to generate more efcient code
for particular environments. For example, while a multi-
threaded runtime require locks, a single-threaded event-
driven runtime does not. The Flux compiler generates
locks or other mutual exclusion operations only when
needed.
Granularity selection. Finally, atomicity constraints
let programmers easily nd the appropriate granularity
of locking they can apply ne-grained constraints to
individual concrete nodes or coarse-grained constraints
to abstract nodes that comprise many concrete nodes.
However, even when deadlock-freedom is guaranteed,
grain selection can be difcult: too coarse a grain results
in contention, while too ne a grain can impose exces-
sive locking overhead. As we describe in Section 5.1,
Flux can generate a discrete event simulator for the Flux
program. This simulator can let a developer measure the
effect of different granularity decisions and identify the
appropriate locking granularity before actual server de-
ployment.
3 Compiler and Runtime Systems
A Flux program is transformed into a working server by
a multi-stage process. The compiler rst reads in the
Flux source and constructs a representation of the pro-
gram graph. It then processes the internal representation
to type-check the program. Once the code has been veri-
ed, the runtime code generator processes the graph and
outputs C code that implements the servers data ow for
a specic runtime. Finally, this code is linked with the
implementation of the server logic into an operational
server. We rst describe the compilation process in de-
tail. We then describe the three runtime systems that
Flux currently supports.
3.1 The Flux Compiler
The Flux compiler is a three-pass compiler implemented
in Java, and uses the JLex lexer [5] in conjunction with
the CUP LALR parser generator [3].
The rst pass parses the Flux program text and builds
a graph-based internal representation. During this pass,
the compiler links nodes referenced in the programs
data ows. All of the conditional ows are merged, with
an edge corresponding to each conditional ow.
The second pass decorates edges with types, connects
error handlers to their respective nodes, and veries that
the program is correct. First, each node mentioned in
a data ow is labelled with its input and output types.
Each predicate type used by a conditional node is asso-
ciated with its user-supplied predicate function. Finally,
the error handlers and atomicity constraints are attached
to each node. If any of the referenced nodes or predi-
cate types are undened, the compiler signals an error
and exits. Otherwise, the program graph is completely
instantiated. The nal step of program graph construc-
tion checks that the output types of each node match the
inputs of the nodes that they are connected to. If all type
tests pass, then the compiler has a valid program graph.
The third pass generates the intermediate code that
implements the data ow of the server. Flux supports
generating code for arbitrary runtime systems. The com-
piler denes an object-oriented interface for code gener-
ation. New runtimes can easily be plugged into the Flux
compiler by implementing this code generator interface.
The current Flux compiler supports several different
runtimes, described below. In addition to the runtime-
specic intermediate code, the Flux compiler generates
a Makele and stubs for all of the functions that provide
the server logic. These stubs ensure that the program-
mer uses the appropriate signatures for these methods.
When appropriate, the code generator outputs locks cor-
responding to the atomicity constraints.
3.1.1 Avoiding Deadlock
The Flux compiler generates locks in a canonical order.
Our current implementation sorts them alphabetically by
name. In other words, a node that has y,x as its atom-
icity constraints actually rst acquires x, then y.
It is easy to see that when applied only to con-
crete nodes, this approach combines with Fluxs acyclic
graphs to eliminate deadlock. However, when abstract
nodes also require constraints, the constraints may be-
come nested and preventing deadlock is more compli-
cated. Nesting could itself cause deadlock by acquiring
constraints in non-canonical order. Consider the follow-
ing Flux program fragment:
A = B;
C = D;
atomic A: {x};
atomic B: {y};
atomic C: {y};
atomic D: {x};
In this example, a ow passing through A (which then
invokes B) locks x and then y. However, a ow through
C locks y and then x, which is a cycle in the locking
graph.
To prevent deadlock, the Flux compiler detects such
situations and moves up the atomicity constraints in the
program, forcing earlier lock acquisition. For each ab-
stract node with atomicity constraints, the Flux compiler
computes a constraint list comprising the atomicity con-
straints the node transitively requires, in execution or-
der. This list can easily be computed via a depth-rst
traversal of the relevant part of the program graph. If
a constraint list is out of order, then the rst constraint
acquired in a non-canonical order is added to the par-
ent of the node that requires the constraint. This process
repeats until no out-of-order constraint lists remain.
For the above example, Flux will discover that node
C has an out-of-order sequence (y, x). It then adds con-
straint x to node C. The algorithm then terminates with
the following set of constraints:
atomic A: {x};
atomic B: {y};
atomic C: {x,y};
atomic D: {x};
Flux locks are reentrant, so multiple lock acquisitions
do not present any problems. However, reader and writer
locks require special treatment. After computing all con-
straint lists, the compiler performs a second pass to nd
any instances when a lock is acquired at least once as a
reader and a writer. If it nds such a case, Flux changes
the rst acquisition of the lock to a writer if it is not one
already. Reacquiring a constraint as a reader while pos-
sessing it as a writer is allowed because it does not cause
the ow to give up the writer lock.
Because early lock acquisition can reduce concur-
rency, whenever the Flux compiler discovers and re-
solves potential deadlocks as described above, it gener-
ates a warning message.
3.2 Runtime Systems
The current Flux compiler supports three different run-
time systems: one thread per connection, a thread-pool
system, and an event-driven runtime.
3.2.1 Thread-based Runtimes
In the thread-based runtimes, each request handled by
the server is dispatched to a thread function that handles
all possible paths through the servers data ows. In the
one-to-one thread server, a thread is created for every
different data ow. In the thread-pool runtime, a xed
number of threads are allocated to service data ows. If
all threads are occupied when a new data ow is created,
the data ow is queued and handled in rst-in rst-out
order.
3.2.2 Event-driven Runtime
The event-driven runtime operates differently. In this
runtime, every input to a functional node is seen as an
event. Each event is placed into a queue and handled in
turn by a single thread. Additionally, each source node
(a node with no input) is repeatedly placed on the queue
to originate each new data ow. The transformation of
input to output by a node generates a new event corre-
sponding to the output data being propagated to the sub-
sequent node.
The implementation of the event-based runtime is
complicated by the fact that node implementations may
perform blocking function calls. If blocking function
calls like read and write were allowed to run unmod-
ied, the operation of the entire server would block until
the function returned.
Instead, the event-based runtime intercepts all calls to
blocking functions using a handler that is pre-loaded via
the LD PRELOAD environment variable. This handler
captures the state of the node at the blocking call and
moves to the next event in the queue. The formerly-
blocking call is then executed asynchronously. When
the event-based runtime receives a signal that the call
has completed, the event is reactivated and re-queued
for completion. Because the mainstream Linux kernel
does not currently support callback-driven asynchronous
I/O, the current Flux event-based runtime uses a separate
thread to simulate callbacks for asynchronous I/O using
the select function. A programmer is thus free to use
synchronous I/O primitives without interfering with the
operation of the event-based runtime.
3.2.3 Other Languages and Runtimes
Each of these runtimes was implemented in the C us-
ing POSIX threads and locks. Flux can also generate
code for different programming languages. We have
also implemented a prototype that targets Java, using
both SEDA [24] and a custom runtime implementation,
though we do not evaluate the Java systems here.
In addition to these runtimes, we have implemented
a code generator that transforms a Flux program graph
into code for the discrete event simulator CSIM [17].
This simulator can predict the performance of the server
under varying conditions, even prior to the implementa-
tion of the core server logic. Section 5.1 describes this
process in greater detail.
4 Experimental Evaluation
To demonstrate its effectiveness for building high-
performance server applications, we implemented a
number of servers in Flux. We summarize these in Ta-
ble 1. We chose these servers specically to span the
space of possible server applications. Most server appli-
cations can be broadly classied into one of the follow-
ing categories, based on how they interact with clients:
request-response client/server, heartbeat client/server
and peer-to-peer.
We implemented a server in Flux for each of these cat-
egories and compared its performance under load with
existing hand-tuned server applications written in con-
ventional programming languages. The Flux servers
rely on single-threaded C and C++ code that we either
borrowed from existing implementations or wrote our-
selves. The most signicant inclusions of existing code
were in the web server, which uses the PHP interpreter,
and in the image server, which relies on calls to the
libjpeg library to compress JPEG images.
4.1 Methodology
We evaluate all server applications by measuring their
throughput and latency in response to realistic work-
loads.
All testing was performed with a server and client ma-
chine, both running Linux version 2.4.20. The server
machine was a Pentium 4 (2.4Ghz, 1GB RAM), con-
nected via gigabit Ethernet on a dedicated switched
network to the client machine, a Xeon-based machine
Server Style Description Lines of Flux code Lines of C/C++ code
Web server request-response a basic HTTP/1.1 server with PHP 36 386 (+ PHP)
Image server request-response image compression server 23 551 (+ libjpeg)
BitTorrent peer-to-peer a le-sharing server 84 878
Game server heartbeat multiplayer game of Tag 54 257
client-server
Table 1: Servers implemented using Flux, described in Section 4.
(2.4Ghz, 1GB RAM). All server and client applications
were compiled using GCC version 3.2.2. During testing,
both machines were running in multi-user mode with
only standard services running. All results are for a run
of two minutes, ignoring the rst twenty seconds to al-
low the cache to warm up.
4.2 Request-Response: Web Server
Request-response based client/server applications are
among the most common examples of network servers.
This style of server includes most major Internet proto-
cols including FTP, SMTP, POP, IMAP and HTTP. As
an example of this application class, we implemented a
web server in Flux. The Flux web server implements
the HTTP/1.1 protocol and can serve both static and dy-
namic PHP web pages.
We implemented a benchmark to load test the Flux
webserver that is similar to SPECweb99 [20]. The
benchmark simulates a number of clients requesting les
from the server. Each simulated client sends ve re-
quests over a single HTTP/1.1 TCP connection using
keep-alives. When one le is retrieved, the next le
is immediately requested. After the ve les are re-
trieved, the client disconnects and reconnects over a new
TCP connection. The les requested by each simulated
client follow the static portion of the SPECweb bench-
mark and each le is selected using the Zipf distribu-
tion. The working set for this benchmark is approxi-
mately 32MB, which ts into RAM, so this benchmark
primarily stresses CPU performance.
We compare the performance of the Flux webserver
against the latest versions of the knot webserver distrib-
uted with Capriccio [23] and the Haboob webserver dis-
tributed with the SEDA runtime system [24]. Figure 3
presents the throughput and latency for a range of si-
multaneous clients. These graphs represent the average
of ve different runs for each number of clients.
The results show that the Flux web server provides
comparable performance to the fastest webserver (knot),
regardless of whether the event-based or thread-based
runtime is used. All three of these servers (knot, ux-
threadpool and ux-event-based) signicantly outper-
form Haboob, the event-based server distributed with
SEDA. As expected, the nave one-thread, one-client
server generated by Flux has signicantly worse perfor-
mance due to the overhead of creating and destroying
threads.
The results for the event-based server highlight one
drawback of running on a system without true asyn-
chronous I/O. With small numbers of clients, the event-
based server suffers from increased latency that initially
decreases and then follows the behavior of the other
servers. This hiccup is an artifact of the interaction
between the webservers implementation and the event-
driven runtime, which must simulate asynchronous I/O.
The rst node in the webserver uses the select func-
tion with a timeout to wait for network activity. In the
absence of other network activity, this node will block
for a relatively long period of time. Because the event-
based runtime only reactivates nodes that make blocking
I/O calls after the completion of the currently-operating
node, in the absence of other network activity, the call
to select imposes a minimum latency on all block-
ing I/O. As the number of clients increases, there is suf-
cient network activity that select never reaches its
timeout and frozen nodes are reactivated at the appropri-
ate time. In the absence of true asynchronous I/O, the
only solution to this problem would be to decrease the
timeout call to select, which would increase the CPU
usage of an otherwise idle server.
4.3 Peer-to-Peer: BitTorrent
Peer-to-peer applications act as both a server and a
client. Unlike a request-response server, they both re-
ceive and initiate requests.
We implemented a BitTorrent server in Flux as a rep-
resentative peer-to-peer application. BitTorrent uses a
scatter-gather protocol for le sharing. BitTorrent peers
exchange pieces of a shared le until all participants
have a complete copy. Network load is balanced by ran-
domly requesting different pieces of the le from differ-
ent peers.
To facilitate benchmarking, we changed the behavior
of both of the BitTorrent peers we test here (the Flux ver-
sion and CTorrent). First, all client peers are unchoked
by default. Choking is an internal BitTorrent state that
blocks certain clients from downloading data. This pro-
tocol restriction prevents real-world servers from being
overwhelmed by too many client requests. We also allow
an unlimited number of unchoked client peers to operate
0
100
200
300
400
500
600
700
0 50 100 150 200
Simultaneous Clients
T
h
r
o
u
g
h
p
u
t

(
M
b
/
s
)SEDA
Capriccio
Event (Flux)
Pure Threaded (Flux)
Thread Pool (Flux)
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200
Simultaneous Clients
L
a
t
e
n
c
y

(
m
s
)
SEDA
Capriccio
Event (Flux)
Pure Threaded (Flux)
Thread Pool (Flux)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 50 100 150 200
Simultaneous Clients
C
o
m
p
l
e
t
i
o
n
s

/

s
SEDA
Capriccio
Event (Flux)
Pure Threaded (Flux)
Thread Pool (Flux)
Figure 3: Comparison of Flux web servers with other
high-performance implementations (see Section 4.2).
0
100
200
300
400
500
600
700
800
0 50 100 150 200
Simultaneous Clients
T
h
r
o
u
g
h
p
u
t

(
M
b
/
s
)
Ctorrent
Pure Threaded (Flux)
Thread Pool (Flux)
Event (Flux)
0
5
10
15
20
25
30
35
0 50 100 150 200
Simultaneous Clients
L
a
t
e
n
c
y

(
m
s
)
Ctorrent
Pure Threaded (Flux)
Thread Pool (Flux)
Event (Flux)
0
1000
2000
3000
4000
5000
6000
0 50 100 150 200
Simultaneous Clients
C
o
m
p
l
e
t
i
o
n
s

/

s
Ctorrent
Pure Threaded (Flux)
Thread Pool (Flux)
Event (Flux)
Figure 4: Comparison of Flux BitTorrent servers with
CTorrent (see Section 4.3).
simultaneously, while the real BitTorrent server only un-
chokes clients who upload content.
We are unaware of any existing BitTorrent bench-
marks, so we developed our own. Our BitTorrent bench-
mark mimics the trafc encountered by a busy BitTor-
rent peer and stresses server performance. It simulates
a series of clients continuously sending requests for ran-
domly distributed pieces of a 54MB test le to a Bit-
Torrent peer with a complete copy of the le. When a
peer nishes downloading a piece of the le, it imme-
diately requests another random piece of the le from
those still missing. Once a client has obtained the entire
le, it disconnects. This benchmark does not simulate
the scatter-gather nature of the BitTorrent protocol; in-
stead, all requests go to a single peer. Using single peers
has the effect of maximizing load, since obtaining data
from a different source would lessen the load on the peer
being tested.
Figure 4 compares the latency, throughput in comple-
tions per second and network throughput to CTorrent,
an implementation of the BitTorrent protocol written in
C. The goal of any BitTorrent system is to maximize
network utilization (thus saturating the network), and
both the CTorrent and Flux implementations achieve this
goal. However, prior to saturating the network, all of the
Flux servers perform slightly worse than the CTorrent
server. We are investigating the cause of this small per-
formance gap.
4.4 Heartbeat Client-Server: Game
Server
Unlike request-response client/server applications and
most peer-to-peer applications, certain server applica-
tions are subject to deadlines. An example of such a
server is an online multi-player game. In these applica-
tions, the server maintains the shared state of the game
and distributes this state to all of the players at heart-
beat intervals. There are two important conditions that
must be met by this communication: the state possessed
by all clients must be the same at each instant in time,
and the inter-arrival time between states can not be too
great. If either of these conditions is violated, the game
will be unplayable or susceptible to cheating. These re-
quirements place an important delay-sensitive constraint
on the servers performance.
We have implemented an online multi-player game of
Tag in Flux. The Flux game server enforces the rules
of Tag. Players can not move beyond the boundaries of
the game world. When a player is tagged by the player
who is it, that player becomes the new it and is tele-
ported to a new random location on the board. All com-
munication between clients and server occurs over UDP
at 10Hz, a rate comparable to other real-world online
games. While simple, this game has all of the impor-
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
Simultaneous Clients
C
o
m
p
l
e
t
i
o
n
s

/

s
Predicted, 16 CPUs
Actual, 16 CPUs
Predicted, 8 CPUs
Actual, 8 CPUs
Predicted, 4 CPUs
Actual, 4 CPUs
Predicted, 2 CPUs
Actual, 2 CPUs
Predicted, 1 CPU
Actual, 1 CPU
Figure 5: Predicted performance of the image server (de-
rived from a single-processor run) versus observed per-
formance for varying numbers of processors and load.
tant characteristics of servers for rst person shooter or
real-time strategy games.
Benchmarking the gameserver is signicantly differ-
ent than load-testing either the webserver or BitTorrent
peer. Throughput is not a consideration since only small
pieces of data are transmitted. The primary concern is
the latency of the server as the number of clients in-
creases. The server must receive every players move,
compute the new game state, and broadcast it within a
xed window of time.
To load-test the game server, we measured the effect
of increasing the number of players. The performance
of the gameserver is largely based upon the length of
time it takes the server to update the game state given the
moves received from all of the players, and this compu-
tation time is identical across the servers. The latency of
the gameserver is largely a product of the rate of game
turns, which stays constant at 10Hz. We found no appre-
ciable differences between a traditional implementation
of the gameserver and the various Flux versions. These
results show that Flux is capable of producing a server
with sufcient performance for multi-player online gam-
ing.
5 Performance
In addition to its programming language support for
writing server applications, Flux provides support for
predicting and measuring the performance of server ap-
plications. The Flux system can generate discrete-event
simulators that predict server performance for synthetic
workloads and on different hardware. It can also per-
form path proling to identify server performance bot-
tlenecks on a deployed system.
5.1 Performance Prediction
Predicting the performance of a server prior to deploy-
ment is important but often difcult. For example, per-
formance bottlenecks due to contention may not appear
during testing because the load placed on the system is
insufcient. In addition, system testing on a small-scale
system may not reveal problems that arise when the sys-
tem is deployed on an enterprise-scale multiprocessor.
In addition to generating executable server code, the
Flux code generator can automatically transform a Flux
program directly into a discrete-event simulator that
models the performance of the server. We use CSIM
as the implementation language for the simulator [17].
In the simulator, each node acquires a shared CPU
resource for the period of time observed in the real
world. Increasing the number of nodes that can simulta-
neously acquire the CPU resource simulates the addition
of processors to the system. Each atomicity constraint
becomes a shared resource. Every node using a partic-
ular atomicity constraint acquires that resource for the
duration of the nodes execution. The simulator conser-
vatively treats session-level constraints as globals.
It is important to note that this simulation does not
model disk or network resources. While this is a realis-
tic assumption for CPU-bound servers (such as dynamic
web-servers), other servers may require more complete
modeling.
The simulator can either use observed parameters
from a running system on a uniprocessor (per-node ex-
ecution times, source node inter-arrival times, and ob-
served branching probabilities), or the Flux programmer
can supply estimates for these parameters. The latter ap-
proach allows server performance to be estimated prior
to implementing the server logic.
To demonstrate that the generated simulations accu-
rately predict actual performance, we tested the image
server described in Section 2. To simulate load on the
machine, we made requests at increasingly small inter-
arrival times. The image server had 5 images, and our
load tester randomly requests one of eight sizes (be-
tween 1/8th scale and full-size) of a randomly-chosen
image. When congured to run with n clients, the
load tester issues requests at a rate of one every 1/n sec-
onds. The image server is CPU-bound, with each image
taking on average 0.5 seconds to compress.
We rst measured the performance of this server on
a 16-processor SunFire 6800, but with only a single
CPU enabled. We then used the observed node runtime
and branching probabilities to parameterize the gener-
ated CSIM simulator. We compare the predicted and ac-
tual performance of the server by making more proces-
sors available to the system. As Figure 5 shows, the
predicted results (dotted lines) and actual results (solid
lines) match closely, demonstrating the effectiveness of
the simulator at predicting performance.
5.2 Path Proling
The Flux compiler optionally instruments generated
servers to simplify the identication of performance bot-
tlenecks. This proling information takes the form of
hot paths, the most frequent or most time-consuming
paths in the server. Flux identies these hot paths us-
ing the Ball-Larus path proling algorithm [4]. Because
Flux graphs are acyclic, the Ball-Larus algorithm iden-
ties each unique path through the servers data-ow
graph.
Hot paths not only aid understanding of server per-
formance characteristics but also identify places where
optimization would be most effective. Because proling
information can be obtained from an operating server
and is linked directly to paths in the program graph, a
performance analyst can easily understand the perfor-
mance characteristics of deployed servers.
The overhead of path proling is low enough that hot
path information can be maintained even in a production
server. Proling adds just one arithmetic operation and
two high-resolution timer calls to each node. A perfor-
mance analyst can obtain path proles from a running
Flux server by connecting to a dedicated socket.
To demonstrate the use of path proling, we compiled
a version of the BitTorrent peer with proling enabled.
For the experiments, we used a patched version of Linux
that supports per-thread time gathering. The BitTorrent
peer was load-tested with the same tester as in the per-
formance experiments. For proling, we used loads of
25, 50, and 100 clients. All proling information was
automatically generated from a running Flux server.
In BitTorrent, the most time-consuming path identi-
ed by Flux was, unsurprisingly, the le transfer path
(Listen GetClients SelectSockets
CheckSockets Message ReadMessage
HandleMessage Request MessageDone,
0.295 ms). However, the second most expensive path
was the path that nds no outstanding chunk requests
(Listen GetClients SelectSockets
CheckSockets ERROR, 0.016ms). While this path
is relatively cheap compared to the le transfer path, it
also turns out to be the most frequently executed path
(780,510 times, compared to 313,994 for the le transfer
path). Since this path accounts for 13% of BitTorrents
execution time, it is a reasonable candidate for optimiza-
tion efforts.
6 Developer Experience
In this section, we examine the experience of program-
mers implementing Flux applications. In particular, we
focus on the implementation of the Flux BitTorrent peer.
The Flux BitTorrent peer was implemented by two un-
dergraduate students in less than one week. The students
began with no knowledge of the technical details of the
BitTorrent protocol or the Flux language. The design of
the Flux program for the BitTorrent peer was entirely
their original work. The implementation of the func-
tional nodes in BitTorrent is loosely derived from the
CTorrent source code. The program graph for the Bit-
Torrent server is shown in Figure 6 at the end of this
document.
The students had a generally positive reaction to pro-
gramming in Flux. Primarily, they felt that organizing
the application into a Flux program graph prior to im-
plementation helped modularize their application design
and debug server data ow prior to programming. They
also found that the exposure of atomicity constraints at
the Flux language level allowed for easy identication
of the appropriate locations for mutual exclusion. Fluxs
immunity to deadlock and the simplicity of the atomicity
constraints increased their condence in the correctness
of the resulting server.
Though this is only anecdotal evidence, this ex-
perience suggests that programmers can quickly gain
enough expertise in Flux to build reasonably complex
server applications.
7 Related Work
In this section, we discuss the most closely related work
to Flux.
Coordination and data ow languages. Flux is an
example of a coordination language [10] that combines
existing code into a larger program in a data ow set-
ting. There have been numerous data ow languages
proposed in the literature, see Johnston et al. for a re-
cent survey [12]. Data ow languages generally operate
at the level of fundamental operations rather than at a
functional granularity, although some medium-grained
dataow languages exist (e.g., CODE 2 [7]). Most data
ow languages also prohibit global state. Languages
that support streaming applications such as StreamIt [21]
also share this property, where all data dependencies are
expressed in the data ow graph. Flux departs from all
of these languages by explicitly supporting safe access to
global state via atomicity constraints. Perhaps most sig-
nicantly, data ow languages focus on extracting par-
allelism from individual programs, while Flux describes
parallelism across multiple clients or event streams.
Programming language constructs. Flux shares
certain linguistic concepts with previous and current
work in other programming languages. Fluxs predicate
matching syntax is deliberately based on the pattern-
matching syntax used by functional languages like ML,
Miranda, and Haskell [11, 18, 22]. The PADS data
description language also allows programmers to spec-
ify predicate types, although these must be written in
PADS itself rather than in an external language like
C [8]. Flanagan and Freund present a type inference
system that computes atomicity constraints for Java
programs that correspond to Liptons theory of reduc-
tion [9, 15]; Fluxs atomicity constraints operate at a
higher level of abstraction. The Autolocker tool [16], de-
veloped independently and concurrently with this work,
automatically assigns locks in a deadlock-free manner
to manually-annotated C programs. It shares Fluxs en-
forcement of an acyclic locking order and its use of two-
phase lock acquisition and release.
Related domain-specic languages. Several previ-
ous domain-specic languages allow the integration of
off-the-shelf code into data ow graphs, though for dif-
ferent domains. The Click modular router is a domain-
specic language for building network routers out of ex-
isting C components [13]. Knit is a domain-specic lan-
guage for building operating systems, with rich support
for integrating code implementing COM interfaces [19].
In addition to its linguistic and tool support for pro-
gramming server applications, Flux ensures deadlock-
freedom by enforcing a canonical lock ordering; this is
not possible in Click and Knit because they permit cyclic
program graphs.
Runtime systems. Researchers have proposed a wide
variety of runtime systems for high-concurrency appli-
cations, including SEDA [24], Hood [6, 1], Capric-
cio [23], libasync/mp [25], Fibers [2], and cohort
scheduling [14]. Users of these runtimes are forced to
implement a server using a particular API. Once imple-
mented, the server logic is generally inextricably linked
to the runtime. By contrast, Flux programs are indepen-
dent of any particular choice of runtime system, so ad-
vanced runtime systems can be integrated directly into
Fluxs code generation pass.
8 Future Work
We plan to build on this work in several directions. First,
we are actively porting Flux to other architectures, espe-
cially multicore systems. We are also planning to extend
Flux to operate on clusters. Because concurrency con-
straints identify nodes that share state, we plan to use
these constraints to guide the placement of nodes across
a cluster to minimize communication.
To gain more experience with Flux, we are adding
further functionality to the web server. In particular,
we plan to build an Apache compatibility layer so we
can easily incorporate Apache modules. We also plan to
enhance the simulator framework to support per-session
constraints.
The entire Flux system is available for download at
flux.cs.umass.edu via the Flux-based BitTorrent
and web servers described in this paper.
9 Acknowledgments
The authors thank Gene Novark for helping to design the
discrete event simulation generator, and Vitaliy Lvin for
assisting in experimental setup and data gathering.
This material is based upon work supported by the
National Science Foundation under CAREER Awards
CNS-0347339 and CNS-0447877. Any opinions, nd-
ings, and conclusions or recommendations expressed in
this material are those of the author(s) and do not neces-
sarily reect the views of the National Science Founda-
tion.
References
[1] U. A. Acar, G. E. Blelloch, and R. D. Blumofe.
The data locality of work stealing. In SPAA 00:
Proceedings of the twelfth annual ACM symposium
on Parallel algorithms and architectures, pages 1
12, New York, NY, USA, 2000. ACM Press.
[2] A. Adya, J. Howell, M. Theimer, W. J. Bolosky,
and J. R. Douceur. Cooperative task management
without manual stack management. In Proceedings
of the General Track: 2002 USENIX Annual Tech-
nical Conference, pages 289302, Berkeley, CA,
USA, 2002. USENIX Association.
[3] A. W. Appel, F. Flannery, and S. E. Hud-
son. CUP parser generator for Java.
http://www.cs.princeton.edu/

appel/modern/java/CUP/.
[4] T. Ball and J. R. Larus. Optimally proling and
tracing programs. ACM Transactions on Program-
ming Languages and Systems, 16(4):13191360,
July 1994.
[5] E. Berk and C. S. Ananian. JLex: A lex-
ical analyzer generator for Java. http:
//www.cs.princeton.edu/

appel/
modern/java/JLex/.
[6] R. D. Blumofe and D. Papadopoulos. The per-
formance of work stealing in multiprogrammed
environments (extended abstract). In SIGMET-
RICS 98/PERFORMANCE 98: Proceedings of
the 1998 ACM SIGMETRICS joint international
conference on Measurement and modeling of com-
puter systems, pages 266267, New York, NY,
USA, 1998. ACM Press.
[7] J. C. Browne, E. D. Berger, and A. Dube. Com-
positional development of performance models
in POEMS. The International Journal of High
Performance Computing Applications, 14(4):283
291, Winter 2000.
[8] K. Fisher and R. Gruber. PADS: a domain-specic
language for processing ad hoc data. In PLDI
05: Proceedings of the 2005 ACM SIGPLAN con-
ference on Programming Language Design and
Implementation, pages 295304, New York, NY,
USA, 2005. ACM Press.
[9] C. Flanagan, S. N. Freund, and M. Lifshin. Type
inference for atomicity. In TLDI 05: Proceedings
of the 2005 ACM SIGPLAN international work-
shop on Types in languages design and implemen-
tation, pages 4758, New York, NY, USA, 2005.
ACM Press.
[10] D. Gelernter and N. Carriero. Coordination lan-
guages and their signicance. Commun. ACM,
35(2):96, 1992.
[11] P. Hudak. Conception, evolution, and application
of functional programming languages. ACM Com-
put. Surv., 21(3):359411, 1989.
[12] W. M. Johnston, J. R. P. Hanna, and R. J. Mil-
lar. Advances in dataow programming languages.
ACM Comput. Surv., 36(1):134, 2004.
[13] E. Kohler, R. Morris, B. Chen, J. Jannotti, and
M. F. Kaashoek. The Click modular router. ACM
Transactions on Computer Systems, 18(3):263
297, August 2000.
[14] J. R. Larus and M. Parkes. Using cohort-
scheduling to enhance server performance. In Pro-
ceedings of the General Track: 2002 USENIX An-
nual Technical Conference, pages 103114, Berke-
ley, CA, USA, 2002. USENIX Association.
[15] R. J. Lipton. Reduction: a method of proving
properties of parallel programs. Commun. ACM,
18(12):717721, 1975.
[16] B. McCloskey, F. Zhou, D. Gay, and E. Brewer.
Autolocker: synchronization inference for atomic
sections. In J. G. Morrisett and S. L. P. Jones, edi-
tors, POPL, pages 346358. ACM, Jan. 2006.
[17] Mesquite Software. The CSIM Simulator. http:
//www.mesquite.com.
[18] R. Milner. A proposal for standard ml. In LFP 84:
Proceedings of the 1984 ACM Symposium on LISP
and functional programming, pages 184197, New
York, NY, USA, 1984. ACM Press.
[19] A. Reid, M. Flatt, L. Stoller, J. Lepreau, and
E. Eide. Knit: Component composition for systems
software. In Proceedings of the 4th ACM Sympo-
sium on Operating Systems Design and Implemen-
tation (OSDI), pages 347360, Oct. 2000.
[20] Standard Performance Evaluation Corporation.
SPECweb99. http://www.spec.org/osg/web99/.
[21] W. Thies, M. Karczmarek, and S. Amarasinghe.
StreamIt: A language for streaming applications.
In International Conference on Compiler Con-
struction, Grenoble, France, Apr. 2002.
[22] D. A. Turner. Miranda: a non-strict functional
language with polymorphic types. In Proc. of a
conference on Functional programming languages
and computer architecture, pages 116, New York,
NY, USA, 1985. Springer-Verlag New York, Inc.
[23] R. von Behren, J. Condit, F. Zhou, G. C. Necula,
and E. Brewer. Capriccio: scalable threads for in-
ternet services. In SOSP 03: Proceedings of the
nineteenth ACM symposium on Operating systems
principles, pages 268281, New York, NY, USA,
2003. ACM Press.
[24] M. Welsh, D. Culler, and E. Brewer. SEDA: an
architecture for well-conditioned, scalable inter-
net services. In SOSP 01: Proceedings of the
eighteenth ACM symposium on Operating systems
principles, pages 230243, New York, NY, USA,
2001. ACM Press.
[25] N. Zeldovich, A. Yip, F. Dabek, R. Morris,
D. Mazi` eres, and F. Kaashoek. Multiprocessor sup-
port for event-driven programs. In Proceedings
of the 2003 USENIX Annual Technical Conference
(USENIX 03), San Antonio, Texas, June 2003.
TrackerTimer
CheckinWithTracker
SendRequestToTracker
GetTrackerResponse
Connect
SetupConnection
Handshake
SendBitfield
Bitfield
MessageDone
Message
ReadMessage
CompletePiece
VerifyPiece
*,*,piececomplete
SendRequest
Have
Piece
SendKeepAlives
Choke
Interested
SendHave
SendUninterested
Cancel
UpdateChokeList
PickChoked
Request
Unchoke
SendChokeUnchoke HandleMessage
*,*,bitfield,*,*
*,*,have,*,*
*,*,piece,*,*
*,*,choke,*,*
*,*,interested,*,*
*,*,cancel,*,*
*,*,request,*,*
*,*,unchoke,*,*
Uninterested
*,*,uninterested,*,*
ChokeTimer Listen KeepAliveTimer
Figure 6: The Flux program graph for the example BitTorrent server.