Corey: An Operating System For Many Cores
Corey: An Operating System For Many Cores
A BSTRACT total application run time on two cores [12]. Veal and
Foong show that as a Linux Web server uses more cores
Multiprocessor application performance can be limited
directory lookups spend increasing amounts of time con-
by the operating system when the application uses the
tending for spin locks [29]. Section 8.5.1 shows that
operating system frequently and the operating system
contention for Linux address space data structures causes
services use data structures shared and modified by mul-
the percentage of total time spent in the reduce phase of
tiple processing cores. If the application does not need
a MapReduce application to increase from 5% at seven
the sharing, then the operating system will become an
cores to almost 30% at 16 cores.
unnecessary bottleneck to the application’s performance.
One source of poorly scaling operating system ser-
This paper argues that applications should control
vices is use of data structures modified by multiple cores.
sharing: the kernel should arrange each data structure
Figure 1 illustrates such a scalability problem with a sim-
so that only a single processor need update it, unless
ple microbenchmark. The benchmark creates a number
directed otherwise by the application. Guided by this
of threads within a process, each thread creates a file de-
design principle, this paper proposes three operating
scriptor, and then each thread repeatedly duplicates (with
system abstractions (address ranges, kernel cores, and
dup) its file descriptor and closes the result. The graph
shares) that allow applications to control inter-core shar-
shows results on a machine with four quad-core AMD
ing and to take advantage of the likely abundance of
Opteron chips running Linux 2.6.25. Figure 1 shows
cores by dedicating cores to specific operating system
that, as the number of cores increases, the total number of
functions.
dup and close operations per unit time decreases. The
Measurements of microbenchmarks on the Corey pro-
cause is contention over shared data: the table describ-
totype operating system, which embodies the new ab-
ing the process’s open files. With one core there are no
stractions, show how control over sharing can improve
cache misses, and the benchmark is fast; with two cores,
performance. Application benchmarks, using MapRe-
the cache coherence protocol forces a few cache misses
duce and a Web server, show that the improvements can
per iteration to exchange the lock and table data. More
be significant for overall performance: MapReduce on
generally, only one thread at a time can update the shared
Corey performs 25% faster than on Linux when using
file descriptor table (which prevents any increase in per-
16 cores. Hardware event counters confirm that these
formance), and the increasing number of threads spin-
improvements are due to avoiding operations that are ex-
ning for the lock gradually increases locking costs. This
pensive on multicore machines.
problem is not specific to Linux, but is due to POSIX se-
mantics, which require that a new file descriptor be vis-
1 I NTRODUCTION ible to all of a process’s threads even if only one thread
Cache-coherent shared-memory multiprocessor hard- uses it.
ware has become the default in modern PCs as chip man- Common approaches to increasing scalability include
ufacturers have adopted multicore architectures. Chips avoiding shared data structures altogether, or designing
with four cores are common, and trends suggest that them to allow concurrent access via fine-grained locking
chips with tens to hundreds of cores will appear within or wait-free primitives. For example, the Linux com-
five years [2]. This paper explores new operating system munity has made tremendous progress with these ap-
abstractions that allow applications to avoid bottlenecks proaches [18].
in the operating system as the number of cores increases. A different approach exploits the fact that some in-
Operating system services whose performance scales stances of a given resource type need to be shared, while
poorly with the number of cores can dominate applica- others do not. If the operating system were aware of an
tion performance. Gough et al. show that contention for application’s sharing requirements, it could choose re-
Linux’s scheduling queues can contribute significantly to source implementations suited to those requirements. A
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 43
mentable without inter-core sharing by default, but allow
3000 sharing among cores as directed by applications.
1000s of dup + close per second
44 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
2000 Spin lock
C0 C1 C2 C3
DRAM
14/4.44 50/3.27 201/0.86
L2
1000
L3
282/0.84 500
255/1.75
0
273/1.46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cores
DRAM
DRAM
327/1.33 Figure 3: Time required to acquire and release a lock on a 16-core
AMD machine when varying number of cores contend for the lock.
The two lines show Linux kernel spin locks and MCS locks (on Corey).
A spin lock with one core takes about 11 nanoseconds; an MCS lock
about 26 nanoseconds.
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 45
core a core b
chooses to harness concurrency using threads, the result
is typically a single address space shared by all threads.
If an application obtains concurrency by forking multiple
address space processes, the result is typically a private address space
per process; the processes can then map shared segments
stack a stack b results a results b into all address spaces. The problem is that each of these
(a) A single address space. two configurations works well for only one of the sharing
patterns, placing applications with a mixture of patterns
core a core b in a bind.
As an example, consider a MapReduce application [8].
address space a address space b During the map phase, each core reads part of the appli-
cation’s input and produces intermediate results; map on
stack a results a results b stack b each core writes its intermediate results to a different area
(b) Separate address spaces.
of memory. Each map instance adds pages to its address
space as it generates intermediate results. During the re-
core a core b duce phase each core reads intermediate results produced
by multiple map instances to produce the output.
For MapReduce, a single address-space (see Fig-
root address range a root address range b
ure 4(a)) and separate per-core address-spaces (see Fig-
ure 4(b)) incur different costs. With a single address
stack a stack b
space, the map phase causes contention as all cores add
shared address range a shared address range b
mappings to the kernel’s address space data structures.
On the other hand, a single address space is efficient
results a results b for reduce because once any core inserts a mapping into
the underlying hardware page table, all cores can use the
(c) Two address spaces with shared result mappings.
mapping without soft page faults. With separate address
Figure 4: Example address space configurations for MapReduce exe- spaces, the map phase sees no contention while adding
cuting on two cores. Lines represent mappings. In this example a stack mappings to the per-core address spaces. However, the
is one page and results are three pages. reduce phase will incur a soft page fault per core per page
of accessed intermediate results. Neither memory con-
3 D ESIGN figuration works well for the entire application.
Existing operating system abstractions are often difficult We propose address ranges to give applications high
to implement without sharing kernel data among cores, performance for both private and shared memory (see
regardless of whether the application needs the shared se- Figure 4(c)). An address range is a kernel-provided
mantics. The resulting unnecessary sharing and resulting abstraction that corresponds to a range of virtual-to-
contention can limit application scalability. This section physical mappings. An application can allocate address
gives three examples of unnecessary sharing and for each ranges, insert mappings into them, and place an address
example introduces a new abstraction that allows the ap- range at a desired spot in the address space. If multi-
plication to decide if and how to share. The intent is that ple cores’ address spaces incorporate the same address
these abstractions will help applications to scale to large range, then they will share the corresponding pieces of
numbers of cores. hardware page tables, and see mappings inserted by each
others’ soft page faults. A core can update the mappings
3.1 Address ranges in a non-shared address range without contention. Even
Parallel applications typically use memory in a mixture when shared, the address range is the unit of locking; if
of two sharing patterns: memory that is used on just only one core manipulates the mappings in a shared ad-
one core (private), and memory that multiple cores use dress range, there will be no contention. Finally, deletion
(shared). Most operating systems give the application of mappings from a non-shared address range does not
a choice between two overall memory configurations: a require TLB shootdowns.
single address space shared by all cores or a separate Address ranges allow applications to selectively share
address space per core. The term address space here parts of address spaces, instead of being forced to make
refers to the description of how virtual addresses map an all-or-nothing decision. For example, the MapReduce
to memory, typically defined by kernel data structures runtime can set up address spaces as shown in 4(c). Each
and instantiated lazily (in response to soft page faults) in core has a private root address range that maps all private
hardware-defined page tables or TLBs. If an application memory segments used for stacks and temporary objects
46 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
and several shared address ranges mapped by all other hand, typically have global significance. Common im-
cores. A core can manipulate mappings in its private ad- plementations use per-process and global lookup tables,
dress ranges without contention or TLB shootdowns. If respectively. If a particular identifier used by an applica-
each core uses a different shared address range to store tion needs only limited scope, but the operating system
the intermediate output of its map phase (as shown Fig- implementation uses a more global lookup scheme, the
ure 4(c)), the map phases do not contend when adding result may be needless contention over the lookup data
mappings. During the reduce phase there are no soft page structures.
faults when accessing shared segments, since all cores We propose a share abstraction that allows applica-
share the corresponding parts of the hardware page ta- tions to dynamically create lookup tables and determine
bles. how these tables are shared. Each of an application’s
3.2 Kernel cores cores starts with one share (its root share), which is pri-
vate to that core. If two cores want to share a share, they
In most operating systems, when application code on a
create a share and add the share’s ID to their private root
core invokes a system call, the same core executes the
share (or to a share reachable from their root share). A
kernel code for the call. If the system call uses shared
root share doesn’t use a lock because it is private, but a
kernel data structures, it acquires locks and fetches rel-
shared share does. An application can decide for each
evant cache lines from the last core to use the data.
new kernel object (including a new share) which share
The cache line fetches and lock acquisitions are costly
will hold the identifier.
if many cores use the same shared kernel data. If the
shared data is large, it may end up replicated in many Inside the kernel, a share maps application-visible
caches, potentially reducing total effective cache space identifiers to kernel data pointers. The shares reachable
and increasing DRAM references. from a core’s root share define which identifiers the core
We propose a kernel core abstraction that allows ap- can use. Contention may arise when two cores manipu-
plications to dedicate cores to kernel functions and data. late the same share, but applications can avoid such con-
A kernel core can manage hardware devices and exe- tention by placing identifiers with limited sharing scope
cute system calls sent from other cores. For example, in shares that are only reachable on a subset of the cores.
a Web service application may dedicate a core to inter- For example, shares could be used to implement file-
acting with the network device instead of having all cores descriptor-like references to kernel objects. If only one
manipulate and contend for driver and device data struc- thread uses a descriptor, it can place the descriptor in its
tures (e.g., transmit and receive DMA descriptors). Mul- core’s private root share. If the descriptor is shared be-
tiple application cores then communicate with the kernel tween two threads, these two threads can create a share
core via shared-memory IPC; the application cores ex- that holds the descriptor. If the descriptor is shared
change packet buffers with the kernel core, and the ker- among all threads of a process, the file descriptor can
nel core manipulates the network hardware to transmit be put in a per-process share. The advantage of shares
and receive the packets. is that the application can limit sharing of lookup tables
This plan reduces the number of cores available to the and avoid unnecessary contention if a kernel object is not
Web application, but it may improve overall performance shared. The downside is that an application must often
by reducing contention for driver data structures and as- keep track of the share in which it has placed each iden-
sociated locks. Whether a net performance improvement tifier.
would result is something that the operating system can-
not easily predict. Corey provides the kernel core ab-
straction so that applications can make the decision. 4 C OREY KERNEL
3.3 Shares Corey provides a kernel interface organized around five
Many kernel operations involve looking up identifiers in types of low-level objects: shares, segments, address
tables to yield a pointer to the relevant kernel data struc- ranges, pcores, and devices. Library operating sys-
ture; file descriptors and process IDs are examples of tems provide higher-level system services that are imple-
such identifiers. Use of these tables can be costly when mented on top of these five types. Applications imple-
multiple cores contend for locks on the tables and on the ment domain specific optimizations using the low-level
table entries themselves. interface; for example, using pcores and devices they can
For each kind of lookup, the operating system inter- implement kernel cores. Figure 5 provides an overview
face designer or implementer typically decides the scope of the Corey low-level interface. The following sections
of sharing of the corresponding identifiers and tables. describe the design of the five types of Corey kernel ob-
For example, Unix file descriptors are shared among the jects and how they allow applications to control sharing.
threads of a process. Process identifiers, on the other Section 5 describes Corey’s higher-level system services.
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 47
System call Description
name obj get name(obj) return the name of an object
shareid share alloc(shareid, name, memid) allocate a share object
void share addobj(shareid, obj) add a reference to a shared object to the specified share
void share delobj(obj) remove an object from a share, decrementing its reference count
void self drop(shareid) drop current core’s reference to a share
segid segment alloc(shareid, name, memid) allocate physical memory and return a segment object for it
segid segment copy(shareid, seg, name, mode) copy a segment, optionally with copy-on-write or -read
nbytes segment get nbytes(seg) get the size of a segment
void segment set nbytes(seg, nbytes) set the size of a segment
arid ar alloc(shareid, name, memid) allocate an address range object
void ar set seg(ar, voff, segid, soff, len) map addresses at voff in ar to a segment’s physical pages
void ar set ar(ar, voff, ar1, aoff, len) map addresses at voff in ar to address range ar1
ar mappings ar get(ar) return the address mappings for a given address range
pcoreid pcore alloc(shareid, name, memid) allocate a physical core object
pcore pcore current(void) return a reference to the object for current pcore
void pcore run(pcore, context) run the specified user context
void pcore add device(pcore, dev) specify device list to a kernel core
void pcore set interval(pcore, hz) set the time slice period
void pcore halt(pcore) halt the pcore
devid device alloc(shareid, hwid, memid) allocate the specified device and return a device object
dev list device list(void) return the list of devices
dev stat device stat(dev) return information about the device
void device conf(dev, dev conf) configure a device
void device buf(dev, seg, offset, buf type) feed a segment to the device object
locality matrix locality get(void) get hardware locality information
Figure 5: Corey system calls. shareid, segid, arid, pcoreid, and devid represent 64-bit object IDs. share, seg, ar, pcore, obj, and dev represent
share ID, object ID pairs. hwid represents a unique ID for hardware devices and memid represents a unique ID for per-core free page lists.
4.1 Object metadata pose. Applications specify which shares are available on
The kernel maintains metadata describing each object. which cores by passing a core’s share set to pcore run
To reduce the cost of allocating memory for object meta- (see below).
data, each core keeps a local free page list. If the archi- When allocating an object, an application selects
tecture is NUMA, a core’s free page list holds pages from a share to hold the object ID. The application uses
its local memory node. The system call interface allows share ID, object ID pairs to specify objects to system
the caller to indicate which core’s free list a new object’s calls. Applications can add a reference to an object in a
memory should be taken from. Kernels on all cores can share with share addobj and remove an object from
address all object metadata since each kernel maps all of a share with share delobj. The kernel counts refer-
physical memory into its address space. ences to each object, freeing the object’s memory when
The Corey kernel generally locks an object’s metadata the count is zero. By convention, applications maintain a
before each use. If the application has arranged things so per-core private share and one or more shared shares.
that the object is only used on one core, the lock and use
of the metadata will be fast (see Figure 3), assuming they
4.3 Memory management
have not been evicted from the core’s cache. Corey uses The kernel represents physical memory using the seg-
spin lock and read-write lock implementations borrowed ment abstraction. Applications use segment alloc to
from Linux, its own MCS lock implementation, and a allocate segments and segment copy to copy a seg-
scalable read-write lock implementation inspired by the ment or mark the segment as copy-on-reference or copy-
MCS lock to synchronize access to object metadata. on-write. By default, only the core that allocated the seg-
ment can reference it; an application can arrange to share
4.2 Object naming a segment between cores by adding it to a share, as de-
An application allocates a Corey object by calling the scribed above.
corresponding alloc system call, which returns a An application uses address ranges to define its ad-
unique 64-bit object ID. In order for a kernel to use an dress space. Each running core has an associated root
object, it must know the object ID (usually from a sys- address range object containing a table of address map-
tem call argument), and it must map the ID to the address pings. By convention, most applications allocate a root
of the object’s metadata. Corey uses shares for this pur- address range for each core to hold core private map-
48 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
pings, such as thread stacks, and use one or more address 5.2 Network
ranges that are shared by all cores to hold shared segment Applications can choose to run several network stacks
mappings, such as dynamically allocated buffers. An ap- (possibly one for each core) or a single shared network
plication uses ar set seg to cause an address range to stack. Corey uses the lwIP [19] networking library. Ap-
map addresses to the physical memory in a segment, and plications specify a network device for each lwIP stack.
ar set ar to set up a tree of address range mappings. If multiple network stacks share a single physical device,
4.4 Execution Corey virtualizes the network card. Network stacks on
different cores that share a physical network device also
Corey represents physical cores with pcore objects. Once share the device driver data, such as the transmit descrip-
allocated, an application can start execution on a physi- tor table and receive descriptor table.
cal core by invoking pcore run and specifying a pcore All configurations we have experimented with run a
object, instruction and stack pointer, a set of shares, and separate network stack for each core that requires net-
an address range. A pcore executes until pcore halt work access. This design provides good scalability but
is called. This interface allows Corey to space-multiplex requires multiple IP addresses per server and must bal-
applications over cores, dedicating a set of cores to a ance requests using an external mechanism. A potential
given application for a long period of time, and letting solution is to extend the Corey virtual network driver to
each application manage its own cores. use ARP negotiation to balance traffic between virtual
An application configures a kernel core by allocat- network devices and network stacks (similar to the Linux
ing a pcore object, specifying a list of devices with Ethernet bonding driver).
pcore add device, and invoking pcore run with
the kernel core option set in the context argument. A 5.3 Buffer cache
kernel core continuously polls the specified devices by An inter-core shared buffer cache is important to system
invoking a device specific function. A kernel core polls performance and often necessary for correctness when
both real devices and special “syscall” pseudo-devices. multiple cores access shared files. Since cores share the
A syscall device allows an application to invoke sys- buffer cache they might contend on the data structures
tem calls on a kernel core. The application communi- used to organize cached disk blocks. Furthermore, un-
cates with the syscall device via a ring buffer in a shared der write-heavy workloads it is possible that cores will
memory segment. contend for the cached disk blocks.
The Corey buffer cache resembles a traditional Unix
5 S YSTEM SERVICES buffer cache; however, we found three techniques that
This section describes three system services exported by substantially improve multicore performance. The first
Corey: execution forking, network access, and a buffer is a lock-free tree that allows multiple cores to locate
cache. These services together with a C standard library cached blocks without contention. The second is a write
that includes support for file descriptors, dynamic mem- scheme that tries to minimize contention on shared data
ory allocation, threading, and other common Unix-like using per-core block allocators and by copying applica-
features create a scalable and convenient application en- tion data into blocks likely to be held in local hardware
vironment that is used by the applications discussed in caches. The third uses a scalable read-write lock to en-
Section 6. sure blocks are not freed or reused during reads.
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 49
kernel kernel cation processing is often best parallelized by partition-
NIC core core NIC
ing application data across cores to avoid multiple cores
contending for the same data. Even for read-only ap-
IPC IPC
plication data, such partitioning may help maximize the
amount of distinct data cached by avoiding duplicating
webd webd the same data in many cores’ local caches.
The Corey Web server is built from three compo-
IPC IPC IPC IPC IPC IPC nents: Web daemons (webd), kernel cores, and applica-
tions (see Figure 6). The components communicate via
app app app shared-memory IPC. Webd is responsible for process-
ing HTTP requests. Every core running a webd front-
end uses a private TCP/IP stack. Webd can manipu-
buffer cache late the network device directly or use a kernel core to
do so. If using a kernel core, the TCP/IP stack of each
Figure 6: A Corey Web server configuration with two kernel cores,
two webd cores and three application cores. Rectangles represent seg- core passes packets to transmit and buffers to receive in-
ments, rounded rectangles represents pcores, and circles represent de- coming packets to the kernel core using a syscall device.
vices. Webd parses HTTP requests and hands them off to a
core running application code. The application core per-
achieve good parallel performance. Data-parallel appli- forms the required computation and returns the results to
cations fit well with the architecture of multicore ma- webd. Webd packages the results in an HTTP response
chines, because each core has its own private cache and and transmits it, possibly using a kernel core. Applica-
can efficiently process data in that cache. If the runtime tions may run on dedicated cores (as shown in Figure 6)
does a good job of putting data in caches close to the or run on the same core as a webd front-end.
cores that manipulate that data, performance should in- All kernel objects necessary for a webd core to com-
crease with the number of cores. The difficult part is plete a request, such as packet buffer segments, network
the global communication between the map and reduce devices, and shared segments used for IPC are referenced
phases. by a private per-webd core share. Furthermore, most
We started with the Phoenix MapReduce implemen- kernel objects, with the exception of the IPC segments,
tation [22], which is optimized for shared-memory mul- are used by only one core. With this design, once IPC
tiprocessors. We reimplemented Phoenix to simplify its segments have been mapped into the webd and applica-
implementation, to use better algorithms for manipulat- tion address ranges, cores process requests without con-
ing the intermediate data, and to optimize its perfor- tending over any global kernel data or locks, except as
mance. We call this reimplementation Metis. needed by the application.
Metis on Corey exploits address ranges as described in The application running with webd can run in two
Section 3. Metis uses a separate address space on each modes: random mode and locality mode. In random
core, with private mappings for most data (e.g. local vari- mode, webd forwards each request to a randomly cho-
ables and the input to map), so that each core can up- sen application core. In locality mode, webd forwards
date its own page tables without contention. Metis uses each request to a core chosen from a hash of the name
address ranges to share the output of the map on each of the data that will be needed in the request. Locality
core with the reduce phase on other cores. This arrange- mode increases the probability that the application core
ment avoids contention as each map instance adds pages will have the needed data in its cache, and decreases re-
to hold its intermediate output, and ensures that the re- dundant caching of the same data.
duce phase incurs no soft page faults while processing
intermediate data from the map phase.
7 I MPLEMENTATION
6.2 Web server applications Corey runs on AMD Opteron and the Intel Xeon proces-
The main processing in a Web server includes low-level sors. Our implementation is simplified by using a 64-bit
network device handling, TCP processing, HTTP proto- virtual address space, but nothing in the Corey design re-
col parsing and formatting, application processing (for lies on a large virtual address space. The implementation
dynamically generated content), and access to applica- of address ranges is geared towards architectures with
tion data. Much of this processing involves operating hardware page tables. Address ranges could be ported to
system services. Different parts of the processing require TLB-only architectures, such as MIPS, but would pro-
different parallelization strategies. For example, HTTP vide less performance benefit, because every core must
parsing is easy to parallelize by connection, while appli- fill its own TLB.
50 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
The Corey implementation is divided into two parts: Linux single
140000 Linux separate
the low-level code that implements Corey objects (de- Corey address ranges
scribed in Section 4) and the high-level Unix-like envi- 120000
ronment (described in Section 5). The low-level code
Time (milliseconds)
400
8 E VALUATION 300
This section demonstrates the performance improve-
200
ments that can be obtained by allowing applications to
control sharing. We make these points using several mi- 100
crobenchmarks evaluating address ranges, kernel cores,
and shares independently, as well as with the two appli- 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cations described in Section 6. Cores
8.1 Experimental setup (b) mempass
We ran all experiments on an AMD 16-core system (see Figure 7: Address ranges microbenchmark results.
Figure 2 in Section 2) with 64 Gbytes of memory. We
counted the number of cache misses and computed the
low cost for only one type of sharing, but not both, de-
average latency of cache misses using the AMD hard-
pending on whether the application uses a single address
ware event counters. All Linux experiments use Debian
space shared by all cores or a separate address space per
Linux with kernel version 2.6.25 and pin one thread to
core.
each core. The kernel is patched with perfctr 2.6.35 to
The memclone [3] benchmark explores the costs of
allow application access to hardware event counters. For
private memory. Memclone has each core allocate a
Linux MapReduce experiments we used our version of
100 Mbyte array and modify each page of its array. The
Streamflow because it provided better performance than
memory is demand-zero-fill: the kernel initially allo-
other allocators we tried, such as TCMalloc [11] and
cates no memory, and allocates pages only in response
glibc 2.7 malloc. All network experiments were per-
to page faults. The kernel allocates the memory from
formed on a gigabit switched network, using the server’s
the DRAM system connected to the core’s chip. The
Intel Pro/1000 Ethernet device.
benchmark measures the average time to modify each
We also have run many of the experiments with Corey
page. Memclone allocates cores from chips in a round-
and Linux on a 16-core Intel machine and with Windows
robin fashion. For example, when using five cores,
on 16-core AMD and Intel machines, and we draw con-
memclone allocates two cores from one chip and one
clusions similar to the ones reported in this section.
core from each of the other three chips.
8.2 Address ranges Figure 7(a) presents the results for three situations:
To evaluate the benefits of address ranges in Corey, we Linux with a single address space shared by per-core
need to investigate two costs in multicore applications threads, Linux with a separate address space (process)
where some memory is private and some is shared. First, per core but with the 100 Mbyte arrays mmaped into each
the contention costs of manipulating mappings for pri- process, and Corey with separate address spaces but with
vate memory. Second, the soft page-fault costs for mem- the arrays mapped via shared address ranges.
ory that is used on multiple cores. We expect Corey to Memclone scales well on Corey and on Linux with
have low costs for both situations. We expect other sys- separate address spaces, but poorly on Linux with a sin-
tems (represented by Linux in these experiments) to have gle address space. On a page fault both Corey and Linux
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 51
verify that the faulting address is valid, allocate and clear 120000
a 4 Kbyte physical page, and insert the physical page into
100000
the hardware page table. Clearing a 4 Kbyte page incurs
until every core touches every page. Mempass measures (b) L3 cache misses.
the total time for every core to touch every page. Figure 8: TCP microbenchmark results.
Figure 7(b) presents the results for the same configu-
rations used in memclone. This time Linux with a sin-
gle address space performs well, while Linux with sep- sions, and manipulating corresponding driver data struc-
arate address spaces performs poorly. Corey performs tures. The second configuration, called “Polling”, uses a
well here too. Separate address spaces are costly with kernel core only to poll for received packet notifications
this workload because each core separately takes a soft and transmit completions. In both cases, each other core
page fault for the shared page. runs a private TCP/IP stack and an instance of the TCP
To summarize, a Corey application can use address service. For Dedicated, each service core uses shared-
ranges to get good performance for both shared and pri- memory IPC to send and receive packet buffers with the
vate memory. In contrast, a Linux application can get kernel core. For Polling, each service core transmits
good performance for only one of these types of mem- packets and registers receive packet buffers by directly
ory. manipulating the device DMA descriptors (with lock-
ing), and is notified of received packets via IPC from the
8.3 Kernel cores Polling kernel core. The purpose of the comparison is to
This section explores an example in which use of the show the effect on performance of moving all device pro-
Corey kernel core abstraction improves scalability. The cessing to the Dedicated kernel core, thereby eliminating
benchmark application is a simple TCP service, which contention over device driver data structures. Both con-
accepts incoming connections, writing 128 bytes to each figurations poll for received packets, since otherwise in-
connection before closing it. Up to 15 separate client ma- terrupt costs would dominate performance.
chines (as many as there are active server cores) generate Figure 8(a) presents the results of the TCP benchmark.
the connection requests. The network device appears to be capable of handling
We compare two server configurations. One of them only about 900,000 packets per second in total, which
(called “Dedicated”) uses a kernel core to handle all net- limits the throughput to about 110,000 connections per
work device processing: placing packet buffer pointers second in all configurations (each connection involves 4
in device DMA descriptors, polling for received pack- input and 4 output packets). The dedicated configuration
ets and transmit completions, triggering device transmis- reaches 110,000 with only five cores, while Polling re-
52 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
quires 11 cores. That is, each core is able to handle more 8000 Global share
Per−core shares
connections per second in the Dedicated configuration
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 53
Corey NIC
60 Linux
50
webd webd
Time (seconds)
40
IPC IPC IPC IPC
30
20
filesum filesum
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 buffer cache
Cores
Figure 11: A webd configuration with two front-end cores and two
(a) Corey and Linux performance.
filesum cores. Rectangles represent segments, rounded rectangles
represents pcores, and the circle represents a network device.
25
54 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
60000 Random NUMA machine. Like Disco, Corey aims to avoid ker-
Locality
nel bottlenecks with a small kernel that minimizes shared
50000
data. However, Corey has a kernel interface like an ex-
connections per second
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 55
11 C ONCLUSIONS [7] S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovi-
tis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix,
This paper argues that, in order for applications to scale
N. Hardavellas, T. C. Mowry, and C. Wilkerson.
on multicore architectures, applications must control
Scheduling Threads for Constructive Cache Shar-
sharing. Corey is a new kernel that follows this princi-
ing on CMPs. In Proceedings of the 19th ACM
ple. Its address range, kernel core, and share abstractions
Symposium on Parallel Algorithms and Architec-
ensure that each kernel data structure is used by only one
tures, pages 105–115. ACM, 2007.
core by default, while giving applications the ability to
specify when sharing of kernel data is necessary. Ex- [8] J. Dean and S. Ghemawat. MapReduce: simplified
periments with a MapReduce application and a synthetic data processing on large clusters. Commun. ACM,
Web application demonstrate that Corey’s design allows 51(1):107–113, 2008.
these applications to avoid scalability bottlenecks in the
operating system and outperform Linux on 16-core ma- [9] D. R. Engler, M. F. Kaashoek, and J. W. O’Toole.
chines. We hope that the Corey ideas will help appli- Exokernel: An operating system architecture for
cations to scale to the larger number of cores on future application-level resource management. In Pro-
processors. All Corey source code is publicly available. ceedings of the 15th ACM Symposium on Operating
Systems Principles, pages 251–266, Copper Moun-
ACKNOWLEDGMENTS tain, CO, December 1995. ACM.
We thank Russ Cox, Austin Clements, Evan Jones, Emil
[10] B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm.
Sit, Xi Wang, and our shepherd, Wolfgang Schröder-
Tornado: maximizing locality and concurrency in
Preikschat, for their feedback. The NSF partially funded
a shared memory multiprocessor operating system.
this work through award number 0834415.
In Proceedings of the 3rd Symposium on Operat-
ing Systems Design and Implementation, pages 87–
R EFERENCES
100, 1999.
[1] J. Aas. Understanding the Linux 2.6.8.1
CPU scheduler, February 2005. http:// [11] Google Performance Tools. http://
josh.trancesoftware.com/linux/. goog-perftools.sourceforge.net/.
[2] A. Agarwal and M. Levy. Thousand-core chips: the [12] C. Gough, S. Siddha, and K. Chen. Kernel
kill rule for multicore. In Proceedings of the 44th scalability—expanding the horizon beyond fine
Annual Conference on Design Automation, pages grain locks. In Proceedings of the Linux Sympo-
750–753, 2007. sium 2007, pages 153–165, Ottawa, Ontario, June
2007.
[3] J. Appavoo, D. D. Silva, O. Krieger, M. Auslander,
M. Ostrowski, B. Rosenburg, A. Waterland, R. W. [13] K. Govil, D. Teodosiu, Y. Huang, and M. Rosen-
Wisniewski, J. Xenidis, M. Stumm, and L. Soares. blum. Cellular Disco: resource management us-
Experience distributing objects in an SMMP OS. ing virtual clusters on shared-memory multiproces-
ACM Trans. Comput. Syst., 25(3):6, 2007. sors. In Proceedings of the 17th ACM Symposium
on Operating Systems Principles, pages 154–169,
[4] R. Bryant, J. Hawkes, J. Steiner, J. Barnes, and Kiawah Island, SC, October 1999. ACM.
J. Higdon. Scaling linux to the extreme. In Pro-
ceedings of the Linux Symposium 2004, pages 133– [14] W. C. Hsieh, M. F. Kaashoek, and W. E.
148, Ottawa, Ontario, June 2004. Weihl. Dynamic computation migration in dsm
systems. In Supercomputing ’96: Proceedings of
[5] E. Bugnion, S. Devine, and M. Rosenblum. the 1996 ACM/IEEE conference on Supercomput-
DISCO: running commodity operating systems on ing (CDROM), Washington, DC, USA, 1996.
scalable multiprocessors. In Proceedings of the
16th ACM Symposium on Operating Systems Prin- [15] Intel. Supra-linear Packet Processing Performance
ciples, pages 143–156, Saint-Malo, France, Octo- with Intel Multi-core Processors. ftp://
ber 1997. ACM. download.intel.com/technology/
advanced comm/31156601.pdf.
[6] M. C. Carlisle and A. Rogers. Software caching and
computation migration in Olden. In Proceedings of [16] A. Klein. An NUMA API for Linux, Au-
the 5th ACM SIGPLAN Symposium on Principles gust 2004. http://www.firstfloor.org/
and Practice of Parallel Programming, 1995. ˜andi/numa.html.
56 8th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
[17] Linux kernel mailing list. http:// [28] uClibc. http://www.uclibc.org/.
kerneltrap.org/node/8059.
[29] B. Veal and A. Foong. Performance scalability
[18] Linux Symposium. http:// of a multi-core web server. In Proceedings of the
www.linuxsymposium.org/. 3rd ACM/IEEE Symposium on Architecture for net-
[19] lwIP. http://savannah.nongnu.org/ working and communications systems, pages 57–
projects/lwip/. 66, New York, NY, USA, 2007. ACM.
[20] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, [30] B. Verghese, S. Devine, A. Gupta, and M. Rosen-
O. Krieger, and R. Russell. Read-copy update. In blum. Operating system support for improving
Proceedings of the Linux Symposium 2002, pages data locality on CC-NUMA compute servers. In
338–367, Ottawa, Ontario, June 2002. Proceedings of the 7th international conference on
Architectural support for programming languages
[21] J. M. Mellor-Crummey and M. L. Scott. Al- and operating systems, pages 279–289, New York,
gorithms for scalable synchronization on shared- NY, USA, 1996. ACM.
memory multiprocessors. ACM Trans. Comput.
Syst., 9(1):21–65, 1991. [31] K. Yotov, K. Pingali, and P. Stodghill. Auto-
matic measurement of memory hierarchy parame-
[22] C. Ranger, R. Raghuraman, A. Penmetsa, G. Brad- ters. SIGMETRICS Perform. Eval. Rev., 33(1):181–
ski, and C. Kozyrakis. Evaluating MapReduce for 192, 2005.
Multi-core and Multiprocessor Systems. In Pro-
ceedings of the 13th International Symposium on
High Performance Computer Architecture, pages
13–24. IEEE Computer Society, 2007.
[23] B. Saha, A.-R. Adl-Tabatabai, A. Ghuloum, M. Ra-
jagopalan, R. L. Hudson, L. Petersen, V. Menon,
B. Murphy, T. Shpeisman, E. Sprangle, A. Ro-
hillah, D. Carmean, and J. Fang. Enabling scal-
ability and performance in a large-scale CMP en-
vironment. In Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Com-
puter Systems, pages 73–86, New York, NY, USA,
2007. ACM.
[24] S. Schneider, C. D. Antonopoulos, and D. S.
Nikolopoulos. Scalable Locality-Conscious Mul-
tithreaded Memory Allocation. In Proceedings of
the 2006 ACM SIGPLAN International Symposium
on Memory Management, pages 84–94, 2006.
[25] A. Schüpbach, S. Peter, A. Baumann, T. Roscoe,
P. Barham, T. Harris, and R. Isaacs. Embracing
diversity in the Barrelfish manycore operating sys-
tem. In Proceedings of the Workshop on Managed
Many-Core Systems (MMCS), Boston, MA, USA,
June 2008.
[26] J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH:
Stanford Parallel Applications for Shared-Memory.
Computer Architecture News, 20(1):5–44.
[27] D. Tam, R. Azimi, and M. Stumm. Thread cluster-
ing: sharing-aware scheduling on SMP-CMP-SMT
multiprocessors. In Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Com-
puter Systems, pages 47–58, New York, NY, USA,
2007. ACM.
USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 57