0% found this document useful (0 votes)
4 views

A Case For Scaling Applications To Many

Uploaded by

kevinallein
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

A Case For Scaling Applications To Many

Uploaded by

kevinallein
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A Case for Scaling Applications to Many-core with OS Clustering

Xiang Song Haibo Chen Rong Chen Yuanxuan Wang Binyu Zang
Parallel Processing Institute, Fudan University
{xiangsong, hbchen, chenrong, yxwang1987, byzang}@fudan.edu.cn

Abstract
This paper proposes an approach to scaling UNIX-like oper-
ating systems for many cores in a backward-compatible way,
which still enjoys common wisdom in new operating system
designs. The proposed system, called Cerberus, mitigates
contention on many shared data structures within OS kernels
by clustering multiple commodity operating systems atop a $$ % & ' "

VMM, and providing applications with the traditional shared ! " #

memory interface. Cerberus extends a traditional VMM with


efficient support for resource sharing and communication
among the clustered operating systems. It also routes system
calls of an application among operating systems, to provide
applications with the illusion of running on a single operat-
ing system. Figure 1. Architecture overview of OS Clustering.
We have implemented a prototype system based on
Xen/Linux, which runs on an Intel machine with 16 cores
and an AMD machine with 48 cores. Experiments with an of system researchers for a long time. Currently, there
unmodified MapReduce application, dbench, Apache Web is a debate on the approach to scaling operating sys-
Server and Memcached show that, given the nontrivial per- tems: designing new operating systems from scratch (e.g.,
formance overhead incurred by the virtualization layer, Cer- Corey [Boyd-Wickizer 2008], Barrelfish [Baumann 2009]
berus achieves up to 1.74X and 4.95X performance speedup and fos [Wentzlaff 2008]); or continuing the traditional path
compared to native Linux. It also scales better than a single of refining commodity kernels by iteratively eliminating bot-
Linux configuration. Profiling results further show that Cer- tlenecks using both traditional parallel programming skills
berus wins due to mitigated contention and more efficient or new data structures (e.g., RCU [McKenney 2002], Sloppy
use of resources. Counter [Boyd-Wickizer 2010]). However, with continual
growth of the number of cores in a single machine and
Categories and Subject Descriptors D.4.7 [Operating
the still speculative structure of future many-core machines,
Systems]: Organization and Design
there is currently no conclusion on the best long-term direc-
General Terms Design, Experimentation, Performance tion.
In this paper, we seek to add a point to the debate, by eval-
Keywords Multicore, Scalability, OS Clustering uating a middle ground between these two trends, motivated
by the observation that commodity operating systems can
1. Introduction scale well with a small number of CPU cores, and one virtual
Scaling UNIX-like operating systems on shared memory machine monitor (VMM) can effectively consolidate multi-
multicore or multiprocessor machines has been a goal ple operating systems. The proposed approach, called OS
clustering (shown in Figure 1), is an operating system struc-
turing strategy that attempts to provide a near- or middle-
term solution to mitigate the scalability problem of com-
Permission to make digital or hard copies of all or part of this work for personal or modity operating systems, yet without non-trivial testing ef-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation forts and possible backward compatibility issues in new op-
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
erating system designs. The basic idea is clustering multiple
EuroSys’11, April 10–13, 2011, Salzburg, Austria. commodity operating systems atop a VMM to serve one ap-
Copyright c 2011 ACM 978-1-4503-0634-8/11/04. . . $5.00 plication, while providing the familiar POSIX programming
interface to shared-memory applications. The resulting sys- Xen VMM and requires no code change to the Linux core
tem, called Cerberus, supports existing many-core applica- kernel. A loadable kernel module and a user-level module
tions with little or no porting effort. are implemented to support the system call virtualization,
The goal of Cerberus is to make a bridge between two dif- which has 8,800 lines of code in total.
ferent directions (i.e., designing new OSes and refining com- To measure the effectiveness of Cerberus, we have con-
modity OSes) of scaling operating systems. On one hand, ducted several performance measurements and compared
Cerberus incorporates some common wisdom in new oper- the performance of a shared memory MapReduce applica-
ating system designs, such as state replication and message tion, dbench [Tridgell 2010], Memcached and Apache web
passing. On the other hand, Cerberus is designed based on sever running on a single Linux (native Linux and virtual-
reusing commodity operating systems, which means Cer- ized Linux) and Cerberus with different number of VMs.
berus may still share the benefits of improvements to com- Performance results show that though Cerberus incurs over-
modity operating systems. It should be noted that Cerberus head for some primitives, it does provide better performance
also comes at the cost of increased resource consumption scalability. The performance speedup ranges from 1.74X to
due to the increased number of OS instances. However, for 4.95X over native Linux and from 1.37X to 11.62X com-
future many-core platforms with likely abundant resources, pared to virtualized Linux on 48 cores. The profiling results
we believe it is worthwhile to trade resources for scalability. using Oprofile [Levon 2004] and Xenprof [Menon 2005] in-
In general, Cerberus could mitigate or avoid many in- dicate that Cerberus mitigates or avoids many instances of
stances of resource contention within a single operating sys- contention within both Xen and Linux.
tem as well as in the VMM, due to the reduced number of In summary, the contributions of this paper are:
CPU cores managed by a single operating system kernel.
• A technique called OS clustering, which provides a
It is also easier for the inter-OS communication protocol to
backward-compatible way to scale existing shared mem-
scale with the number of OS instances, rather than with the
ory applications on multicore machines;
number of cores. Thus, contention within many subsystems
could be mitigated for shared-memory multi-threaded and • A set of mechanisms to enable efficient sharing of re-
multiprocessing applications. sources among clustered operating systems;
As well as state replication and distribution in operat- • The design and implementation of our prototype system
ing systems and the VMM, Cerberus also retrofits some Cerberus, as well as the evaluation of Cerberus using re-
techniques in new OS designs back to commodity oper- alistic application benchmarks, which demonstrate both
ating systems. Cerberus extends traditional system virtual- the performance and scalability of our approach.
ization techniques with support for efficient resource shar-
ing among the clustered operating systems. Specifically, The rest of the paper is organized as follows. The next
the VMM is built with the address range support from section relates Cerberus with previous work on OS scalabil-
Corey [Boyd-Wickizer 2008] to minimize the page fault ity. Section 3 provides an overview on the challenges and
costs for cross-OS memory sharing, which is critical for approaches of Cerberus. Sections 4 and 5 present the design
some memory-intensive applications. Further, to reduce con- of the two major enabling parts of Cerberus, namely Super-
tention for file accesses, Cerberus incorporates an efficient Process and resource sharing. Then, section 6 describes the
distributed file-system among clustered OSes, which opti- implementation details on Xen and Linux. The experimental
mizes local accesses while maintaining good performance results are shown in section 7. We present the discussion of
for remote accesses. the limitations and future work in section 8. Finally, we end
Moreover, Cerberus incorporates a system call virtual- this paper with a concluding remark in section 9.
ization layer that allows processes/threads of an application
to be executed in multiple operating systems, yet provides 2. Related Work
users with the illusion of running in a single operating sys-
Improving the scalability of UNIX-like operating systems
tem. This layer relies on both message passing and shared
has been a longstanding goal of system researchers. This
memory mechanisms to route system calls to specific op-
section relates Cerberus to other work in operating system
erating systems and marshal the results, thus providing ap-
scalability.
plications with a unified TCP/IP stack and file system. This
layer uses the notion of “SuperProcess”, which groups pro-
2.1 OS Structuring Strategies
cesses/threads in multiple operating systems, to manage the
spawned processes/threads. Cerberus is influenced by much existing work on system vir-
We have implemented a Cerberus prototype based on tualization, building new scalable OSes and refining existing
Xen-3.3.0 [Barham 2003] and Linux-2.6.18, which runs on OSes. Cerberus differs from existing work mainly in that it
an Intel machine with 16 cores and an AMD machine with aims at improving performance scalability of existing appli-
48 cores. The prototype adds 1800 lines of code to the cations by using a backward-compatible technique called OS
clustering.
The idea of running multiple operating systems in a single on Linux by refining the kernel, and improving applications’
machine is not new, but rather an inherent goal of system vir- user-level design and use of kernel services. Cerberus also
tualization [Goldberg 1974]. For example, Disco [Bugnion aims at improving scalability of applications on Linux, but
1997] (and its relative Cellular Disco [Govil 1999]) had run runs multiple commodity OS instances to host one applica-
multiple virtual machines in the form of a virtual cluster tion instead of refining the core kernel.
to support distributed applications. Denali [Whitaker 2002] In summary, the effort of Cerberus is complementary to
also safely multiplexes a large number of Internet services the efforts of improving the scalability of existing commer-
atop a lightweight virtual machine monitor. Cerberus puts cial operating systems. With more scalable operating sys-
these ideas into the context of multicore architecture, and tems, Cerberus would require fewer operating systems to
more importantly supports efficiently running a contempo- be clustered together to provide a scalable runtime environ-
rary shared-memory application with POSIX APIs on mul- ment.
tiple clustered operating systems with little or no modifica-
tion.
One viable way to scale operating systems is partition- 3. Overview and Approaches
ing a hardware platform as a distributed system and dis- This section first discusses the challenges and illustrates
tributing replicated kernels among partitioned hardware. the solutions to running a shared-memory application on
Hive [Chapin 1995] uses a strategy called multicellular, multiple operating systems in a virtualized system. Then, we
which organizes an operating system as multiple indepen- give an overview of the Cerberus architecture.
dent kernels (i.e., cells), which communicates with each
other for resource management to provide better reliability
3.1 The Case for OS Clustering
and scalability. Barrelfish [Baumann 2009] tries to scale ap-
plications on multicore system by using a multikernel model, To meet its design goals, Cerberus uses pervasive system vir-
which distributes replicated kernels on multiple cores and tualization. Rather than designing a new OS from scratch
uses message-passing instead of shared-memory to maintain or fixing the internal mechanisms of commodity OSes, Cer-
their consistency. The factored operating system [Wentzlaff berus clusters multiple commodity operating systems atop
2008] argues that with the likely abundant cores, it would be a VMM, and allows an application to run on multiple clus-
more appropriate to space-multiplex cores instead of time- tered operating systems with a shared-memory interface. Us-
slicing them. Helios [Nightingale 2009] is an operating sys- ing multiple federated operating systems hosted by a VMM
tem that aims at bridging the heterogeneity of different pro- to run one application means that processes/threads belong-
cessing units in a platform by using a satellite kernel, which ing to one shared-memory application now run on multiple
provides the same abstractions across different processor ar- OSes. Hence, Cerberus uses a set of mechanisms to ensure
chitectures. Cerberus is also influenced by these systems in system consistency.
the use of replicated kernels and state, but retrofits the ideas Single Shared-Memory Interface: To avoid requiring a
to commodity operating systems to scale existing shared- port of existing applications, it is critical to provide the exist-
memory applications. ing shared-memory interface to applications. Usually, appli-
Other work has focused on improving OS scalability by cation programmers using traditional shared-memory APIs
controlling or reducing sharing and improving data locality. (e.g., POSIX) often make the assumption that their programs
Corey [Boyd-Wickizer 2008] is an exokernel [Engler 1995] run within an operating system. Thus, a shared-memory ap-
style operating system that provides three new abstrac- plication running in an operating system in the form of mul-
tions (share, address range and kernel core) for applications tiple processes and threads often expects to share a consis-
to explicitly control sharing of resources. K42 [Appavoo tent view of system resources. These processes/threads also
2007] and its relatives (Tornado [Gamsa 1999] and Hurri- rely on the operating system interfaces and services to com-
cane [Unrau 1995]) are designed to reduce contention and to municate with each other. For example, threads belonging to
improve locality for NUMA systems. Cerberus shares some one process are expected to see the same address space and
similarities with the clustered objects of K42, but applies at processes in one application have parent-children relations
a much higher level (complete operating systems). and use IPCs to notify each other.
To address these issues, Cerberus incorporates a system-
2.2 Efforts in Commodity OSes call virtualization layer, which coordinates system calls in
There are extensive studies on the scalability issues of com- multiple clustered operating systems and marshals the re-
mercial kernels and a lot of approaches are proposed to fix sults. Cerberus uses the notion of SuperProcess (section 4),
them. RCU [McKenney 2002], MCS lock [Mellor-Crummey which denotes a group of processes/threads executing in
1991] and local runqueues [Aas 2005] are strategies that multiple operating systems. Each process/thread in a Super-
aim at reducing the contention on shared data structures. Process is called a SubProcess. The SuperProcess coordi-
Recently, Boyd-Wickizer et al. [Boyd-Wickizer 2010] an- nates the control delivery and data communication of its Sub-
alyzed and fixed the scalability of many-core applications Processes, to maintain the system’s consistency, and provide
Process in the form of a coordinated distributed system, us-
$ % ing both messages and shared memory. Multiple processes
#
" of one application run on multiple OSes in the form of a Su-
! ## !
perProcess, which consists of one master daemon and mul-
#
## tiple slave daemons. There is exactly one slave daemon for
& an application in VMs not running the master daemon. The
Inter‐VM Call master daemon is responsible for loading the initial parts of
Propagate an application and creating the slave daemons. Afterwards,
the master daemon works similarly to the slave daemon, ac-
$ % $ $ % $
cording to the semantics of the application.
The daemons communicate with each other to decide
which VMs should serve a process/thread creation request,
to balance load among clustered OSes. To run a process in a
VM other than the requesting VM, the SuperProcess daemon
issues a remote spawn, which replicates the current running
state to the target VM.
Figure 2. The Cerberus System Architecture. Cerberus is Cerberus also routes system calls using the SuperProcess
composed of an extension in the VMM, which handles the module in each operating system, which is a loadable kernel
resource sharing; and a kernel module in the guest OS, module. The module intercepts system calls made by an
which manages the SuperProcess and inter-VM system calls. application. To retain the semantics of system calls and a
consistent view of the execution context, the module routes
the system calls, as well as marshalling and translating the
the application with the illusion of running on one operating results.
system. Cerberus uses cross-VM message-passing mechanisms
Efficient Resource Sharing: Another challenge is effi- to handle communication between daemons in multiple
cient sharing of resources among processes/threads cross- VMs. A daemon uses the message-passing mechanism to
ing the operating system boundary, to still provide applica- send process/thread creation requests and signal remote pro-
tions with a consistent view of system resources. Unfortu- cesses/threads. There is also a shared-memory area for data
nately, traditional VMMs are not built with support for shar- communication among multiple VMs.
ing many resources between operating systems, but rather, Currently, Cerberus decides the number of operating sys-
enforce strong isolation among guest operating systems for tems to run based on a user-specified heuristic for simplicity
security reasons. (the scalability limit of an application with certain number of
Hence, Cerberus implements a resource-sharing layer in cores). By default, Cerberus allocates a fixed, equal portion
both the VMM and OSes, which supports efficient sharing of resources to an operating system and lets the application
of resources such as address spaces, networking and file sys- decide the assignment of processes/threads to operating sys-
tems. The resource-sharing layer exploits the fact that the tems. Each operating system is pinned on a fixed number of
clustered OSes share the hardware to coordinate accesses cores.
to shared resources. Cerberus uses both shared-memory and In the following sections, we will describe the mecha-
message-passing mechanisms to coordinate accesses and nisms in Cerberus to support SuperProcess (section 4) and
events among clustered operating systems. To serialize ac- efficient sharing of address space, file system and network-
cesses to resources shared by multiple OS instances, it uses ing among clustered operating systems (section 5).
message-passing and lock-free mechanisms when necessary.
For events among instances of different operating systems,
Cerberus uses a two-level message queue to deliver these 4. Supporting SuperProcess
events. As the system state of an application is replicated This section describes the underlying design to support the
and by default private, false sharing and unnecessary serial- SuperProcess abstraction, which provides applications with
ization can be significantly avoided. the illusion of running on a single operating system.

3.2 System Architecture 4.1 Remote Process/Thread Spawning


The system architecture of Cerberus is shown in Figure 2. Cerberus uses techniques from traditional process check-
There are several virtual machines running atop a VMM. point/restart mechanisms to support remote process spawn-
The VMM manages the underlying hardware resources and ing. As shown in Figure 3, Cerberus first checkpoints the
partitions the resources among the VMs. Currently, Cerberus state of the current running process, including the regis-
requires the VMs to run the same operating system kernel for ter state, memory mappings and opened files, among oth-
simplicity. Roughly speaking, Cerberus organizes the Super- ers. The checkpointed process state is then put into a shared
VM‐1 VM‐2 the signal will be delivered to the process associated with
the real PID. Otherwise, Cerberus will redirect the signal to
signal
clone/fork the corresponding operating system. Cerberus also maintains
clone/fork
a logical to physical CPU mapping table and provides the
correct cores and operating systems to run threads and pro-
cesses. For example, the pthread library provides interfaces
msg queue
to get and set the affinity (pthread get(/set) affinity) that ob-
tain the set of cores on which a thread can run and assign
threads/ specific threads to run on some cores, and Cerberus trans-
threads/
processes shared memory processes lates these calls.
Figure 3. The sequence of doing a remote fork/clone. 4.3 Coordination of State Accesses
As the state of an application is shared or replicated among
memory area. To spawn a process on behalf of the request- multiple clustered operating systems, Cerberus uses lock-
ing daemon in a different operating system, the daemon in free mechanisms and message passing to coordinate changes
the target OS first spawns a child process itself. Then, it re- to the state from each OS. For some shared state among
trieves the checkpointed state from the shared memory area, OSes, Cerberus uses compare-and-swap to allow each OS
and restores it to the child process. to eagerly access some replicated state such as page table
Currently, many applications use the threaded program- pages. Upon a conflict, Cerberus rolls back the changes to
ming model. As all threads of a threaded application share state from one OS. For some shared data structures such as
the same binary image, Cerberus proactively creates a res- the virtual file descriptor table and inode table, Cerberus par-
ident process (similar to the dispatcher in K42 [Krieger titions these data structures to individual OSes, to avoid ac-
2006]) for each clustered operating system, and maintains cess serialization and cache ping-ponging, and uses message
the consistency of each resident process by propagating passing to coordinate the state.
changes to the application’s global resources. For example, Cerberus also implements an inter-VM notification mech-
Cerberus automatically propagates memory mapping and anism that uses a hierarchical message-passing mechanism:
unmapping requests in the issuing operating system to other when a process notifies processes in other VMs on the oc-
clustered operating systems. Thus, for applications that cre- currences of certain events (e.g., signals, unmap requests),
ate a large number of threads, the cost of remote spawning it first sends a message to the SubProcess in that VM. The
of threads is reduced, as the thread creation requests can be SubProcess will queue the message marked with the type
done locally by each resident process. of the message and deliver the message to the appropri-
It should be noted that although creating a remote process ate VM. Then the receiver VM will send the correspond-
or thread in Cerberus is more heavyweight than within a ing event to the appropriate threads/processes. For example,
single OS, Cerberus supports parallel fork/clone that allows for a futex [Franke 2002]1 call on the address of a remote
simultaneously creating processes/threads in multiple VMs, thread, Cerberus will translate the address into the real ad-
which amortizes the cost of a single operation. dress to monitor. On being notified by the local operating
system about the change of the address, Cerberus will send
4.2 Process Management
a message to the receiving thread.
Cerberus relies mainly on the system call interception
and redirection mechanisms to group processes distributed 5. Supporting Resource Sharing
across multiple operating systems to provide correct seman-
tics. Cerberus supports the efficient sharing of address spaces,
Cerberus virtualizes the process identity (such as the pro- file systems and networks across the clustered operating
cess ID), the parent-child relationship and the group infor- systems, to provide a consistent view for applications.
mation. To achieve this, Cerberus intercepts the system calls 5.1 Sharing Address Spaces
manipulating such information, translates the arguments be-
fore dispatching the operations, and marshals the results be- Cerberus identifies the range of shared address space by in-
fore returning to applications. terpreting the application’s semantics. An application run-
For process IDs, Cerberus maintains a global mapping ta- ning in a multi-threading mode should normally have its ad-
ble of the virtual process ID (seen by applications) and the dress space shared with all threads in a process. A forked
physical ID (seen by the operating system). Cerberus thus process usually shares little with its parent. For a multi-
relies on the virtual ID to maintain the process relationship. threaded application, Cerberus maintains a global list of the
For example, the PID passed by the kill shell command will 1 A futex allows two entities to synchronize with each other using a shared
be translated by the Cerberus system call interception layer. memory location. The pthread mutex is implemented based on this mecha-
If the virtual PID belongs to the current operating system, nism.
shared address ranges. It intercepts the memory mapping re-
quests (e.g., mmap) from each thread and updates the list ac-
cordingly. Cerberus creates a virtual memory mapping for
that shared address range to let the page fault handler be
aware of that address range. When handling a page fault,
the Cerberus module first checks for a pending list entry and
updates the virtual memory mapping before resolving the
faulting address.
To efficiently share an address space across operating sys-
tems, Cerberus incorporates the address range abstraction
from Corey [Boyd-Wickizer 2008]. This supports sharing
a subset of the root page table by multiple guest VMs, ac-
cording to the address range. The level of page table sharing
might be changed according to the virtual memory mapping
of an operating system. Cerberus also dynamically coalesces
and splits the sharing of page tables according to the appli-
cation’s memory mapping requests. According to the list of
shared address ranges, the page fault handler in the VMM Figure 4. Architecture of the Cerberus file system (CFS),
will connect the page table of a shared address range in one which is organized as a mesh of networked file systems: each
VM’s page table to that in other VMs, when there is a first OS manages its local partition and exposes it to other OSes
access to that shared address range in those domains. Cer- through the CFS client. Cerberus dispatches accesses to files
berus determines the level of sharing based on the size of the and marshals the results according the managed metadata.
address range.

5.2 Sharing File Systems Figure 4 shows the architecture of our approach, which
forms of a mesh of networked file systems: each operating
Running a single application on multiple operating systems system manages a local partition and exposes it to other op-
raises the problem of sharing files among processes in each erating systems through an NFS-like interface; processes in
clustered operating system. This is because each operating such an operating system access private files directly in the
system will have its own file system and device driver, pre- local partition and access files in other partitions through the
venting a process from accessing files managed by another CFS client. To identify a file as shared or private, Cerberus
operating system. One intuitive approach would be the use maintains a mapping from each inode describing a file to the
of a networked file system managed by one operating sys- owner ID (e.g., virtual machine ID). As the metadata of files
tem, with other operating systems as NFS clients to access in each partition is maintained only by one operating system,
files in the operating system running the NFS server. How- Cerberus offers a similar metadata consistency and crash-
ever, this creates two performance problems. First, all file recover model to native systems. It should be noted that the
accesses are now centralized to one operating system, which CFS implemented by Cerberus does not rely on network but
can easily make the accesses the new performance bottle- rather on the virtual machine communication interfaces for
neck. Second, there are some inherent performance over- communication and shared memory. This avoids redundant
heads, as a networked file system usually has inferior per- file data copies and the associated data exchange, and thus is
formance compared to a local one. For example, recent mea- more efficient than NFS [Zhao 2006]. Again, the sharing of
surements [Nightingale 2005, Zhao 2006] showed that NFS a file between the CFS client and CFS server is done using
could be several times slower than a native file system such the address range abstraction to minimize soft page faults.
as ext3. To provide applications with a consistent view of the
Fortunately, most files in many multiprocessing applica- clustered file system, Cerberus intercepts accesses to the at-
tions are usually accessed exclusively, with few opportuni- tributes or state of each file and directory, distributes ac-
ties to be accessed by multiple processes (except some non- cesses to each partition when necessary, and marshals the
performance-critical ones such as log files)2 . Hence, Cer- results before returning to user applications. Such operations
berus uses a hybrid approach of both networked and local (e.g., list a directory) are relatively costly compared to those
file system, which seeks to give accesses to private files little in a single operating system. However, they are rare and usu-
contention and high performance, while maintaining accept- ally occur in non-performance critical paths of applications.
able performance for shared files.
5.3 Sharing File Contents
2 For multi-threaded applications, applications usually map the files into
memory using mmap and then threads can modify the memory-mapped file For multithreaded applications, it is common that the content
directly, which will be discussed in the following sections. of a file is shared by multiple threads. Thus, Cerberus sup-
ports the sharing of a file based on the address space sharing 6. Prototype Implementation
in Cerberus to maintain consistency for a file accessed by We have implemented Cerberus based on Xen to run multi-
multiple operating systems. Cerberus uses memory mapped ple Linux instances with a single shared memory interface,
I/O (MMIO) to map a file into a shared address range, which using the shadow mode of page table management in Xen.
is visible to all threads in multiple operating systems. Cer- The system call layer in Cerberus currently supports only a
berus only allows the SuperProcess to access shared files subset of the POSIX interface, but is sufficient to run many
using MMIO. To provide backward compatibility for appli- applications including shared-memory MapReduce applica-
cations using the traditional read/write APIs, Cerberus han- tions, Apache, Memcached and file system benchmarks. For
dles file I/O to shared files using a similar idea to that in simplicity, Cerberus currently requires applications to be
Overshadow [Chen 2008], by translating file related I/Os to statically linked3 , and to link with a small piece of user-level
MMIOs. On the first read/write operation to the file, Cer- code containing a few Cerberus-specific signal handlers, that
berus maps the file in a shared address space using the mmap handle remote requests such as futex and socket operations.
system call. Cerberus ensures that the buffer is mapped us-
ing the MAP SHARED flag. Cerberus also ensures that the 6.1 Inter-VM Message Passing
address range of the memory buffer is shared among clus- The inter-VM message passing mechanism is implemented
tered operating systems using the address range abstraction. by leveraging the cross-VM event channel mechanism in
Thus, changes from one operating system will be directly Xen. Cerberus creates a point-to-point event channel be-
visible to other operating systems. Then, Cerberus emulates tween each pair of clustered operating systems. The SuperP
the read/write system calls by operating on the mmapped module inside each operating system has a handler to receive
area. such cross-VM events and distribute them to the receivers.
To provide file-I/O semantics, Cerberus maintains a vir- In the case of concurrent cross-VM events, each operating
tual file metadata structure that reflects the logical view of system maintains a cross-event queue to buffer the incoming
the files seen by a process. Cerberus also virtualizes the sys- events, and handles them in order. All cross-VM communi-
tem calls that operate on the metadata of files. For example, cation of Cerberus, such as futex and signal operations, uses
the fseek system call will advance the file position main- this mechanism.
tained in the virtualized metadata and return the state in vir-
tualized metadata for fstat-like system calls. 6.2 Memory Management
Note that this scheme is transparent to the in-kernel file In Cerberus, the sharing of page tables is implemented in the
systems and buffer cache management, as each buffer cache shadow page tables, and by manipulating the P2M (physical-
will have a consistent view of the file. The same piece of to-machine) table, thus is transparent to guest operating sys-
a file might be replicated among multiple buffer caches, tems. We have also investigated an implementation of page
causing wasted memory. However, multiple replicas also in- sharing for Xen’s direct mode (with writable page tables),
crease the concurrency of file access and avoid unnecessary with the aim of supporting para-virtualization. However, our
contention. preliminary results show that supporting writable page ta-
bles could result in significant changes to guest operating
5.4 Shared Networking Interfaces systems, as well as incurring non-trivial performance over-
head.
To provide applications with a consistent view of network-
On x86-64, Xen uses 4-levels of page tables and Cer-
ing interfaces, Cerberus exploits the fact that typical servers
berus supports sharing at the lower three levels (i.e., L1 –
are usually equipped with multiple NICs, and each NIC is
L3). Cerberus records the root page table page for an address
now hardware virtualizable (e.g., through SR-IOV [PCI-SIG
range when the guest kernel connects an allocated page ta-
2010]). Hence, Cerberus directly assigns either virtualized
ble page to the upper-level page table. When sharing a page
NICs or physical NICs to each operating system for high
table page among multiple OSes, one machine page might
performance. This could avoid contention on TCP/IP stacks
be accessed by multiple OSes, and thus might correspond to
if the operations are done on the local (virtual) NICs. To
more than one guest-physical page in Xen. Hence, Cerberus
hide applications from such geographic distributions, Cer-
creates a per-VM representation of each shared page table,
berus virtualizes the socket interface by intercepting related
but in an on-demand way. When a VM tries to write a page
system calls and relies on the file descriptor virtualization
table page for the first time, Cerberus will create a represen-
described previously to manage socket descriptors. Cerberus
tation of the page table page in that VM and map it to a single
maintains the (virtual) NIC information, and redirects calls
machine page by manipulating the P2M table, which maps
that bind to a NIC if necessary. Cerberus then dispatches re-
guest physical memory to the host machine memory. Cer-
lated operations (e.g., send, receive) to the VM that manages
berus uses compare-and-swap to serialize updates to shared
the NIC. The associated data will be exchanged using the
shared memory area managed by Cerberus to avoid possible 3 Thiswill not increase much memory usage, as application code is shared
data copies. by default.
page table pages among multiple VMs: when a VM tries to send a remote socket operation request to the target VM, and
update the shared page table, it uses a compare-and-swap to let the responder handle the socket request. With this simple
see if the entry has already been filled by other VMs, and mechanism, Cerberus can currently support several socket-
frees the duplicated page table page if so. related operations (such as bind, listen, accept, read, write,
Other than sharing page tables, Cerberus also needs select, sendto and recvmsg).
to synchronize the virtual memory area (VMA) mappings
6.5 System Call Virtualization
across clustered VMs. As threads on different VMs have
separate address spaces, they maintain their VMAs individ- We classify system calls into two types according to which
ually. Memory management system calls (e.g., mmap) on a system state they access. The first type includes system calls
single VM only change the VMA mappings of the threads that only access local state or are stateless (e.g., get systime).
in that VM. Thus, Cerberus intercepts most memory man- For such system calls, replicating calls among multiple OSes
agement system calls (e.g., mmap, mremap, mprotect, mun- will not cause state consistency problems, and thus Cerberus
map and brk). Before handling the memory management does not need to handle them specially. The second type in-
system call, Cerberus will first force the VM to handle the cludes system calls that access and modify global state in
virtual memory synchronization requests from other VMs. the operating system (e.g., mmap). Cerberus needs to inter-
After finishing the call, Cerberus will allow the VM to prop- cept this kind of system call, coordinate state changes, and
agate the system call to all other VMs in the system. This marshal the results to support cross-VM interactions. To do
is done by adding a virtual memory synchronization request interceptions, the Cerberus module modifies the system call
with appropriate parameters to the request queue of each re- table to change the function pointers of certain system call
ceiver VM. handlers to Cerberus-specific handlers during loading. When
a system call is invoked, the Cerberus handler checks if it
6.3 Cerberus File System should be handled by Cerberus, and if so, invokes specific
Inodes in a Cerberus file system (CFS) are divided into two handlers provided by Cerberus.
kinds, namely local inodes and remote inodes. Local inodes We have currently virtualized 35 POSIX system calls (be-
describe files on a domain-local file system, and may be ac- longing to the second type) at either system call level or
cessed directly. Remote inodes correspond to files stored on virtual file system level. They are divided into five cate-
remote domains. A remote inode can be uniquely identified gories: process/thread creation and exit (e.g., fork, clone,
by its owner domain and its inode number in that domain. exec, exit, exit group, getpid and getppid); thread commu-
When a remote inode is created, CFS will keep track of nication (e.g., futex and signal); memory management (e.g.,
this unique identifier. Each time a remote inode access is brk, mmap, munmap, mprotect and mremap); network op-
required, CFS will pack the inode identifier and other infor- erations (e.g., socket, connect, bind, accept, listen, select,
mation into a message, and send it to remote domain via the sendto, recvfrom, shutdown and close); and file operations
inter-VM message passing mechanism. (e.g., open, read, write, mkdir, rmdir, close and readdir). We
Another data structure we track is the dentry. A dentry is currently leave system calls related to security, realtime sig-
an object describing relationships between inodes, and stor- nals, debugging and kernel modules unhandled. In our expe-
ing names of inodes. Unlike inodes, dentries in the original rience, virtualizing a system call is usually not very difficult,
Linux file system do not have identifiers. To simplify remote as it mostly involves partitioning/marshaling the associated
dentry access, we assign a global identifier to each remote cross-process state. Table 1 gives some typical examples of
dentry. The dentry id is assigned in a lazy way, that is, only how they are implemented.
when a dentry is visited from a remote domain for the first
time, will we assign a global identifier to it. 6.6 Implementation Efforts
In total, the implementation adds 1,800 lines of code to
6.4 Virtualizing Networking Xen to support management of Cerberus and efficient shar-
Cerberus virtualizes the socket interface by intercepting the ing of data among SubProcess in multiple Linux instances.
related system calls. The socket operations are divided into The support for system call interception, super-process and
two kinds, namely local and remote socket operations. We Cerberus file system is implemented as a loadable kernel
use virtual file descriptor numbers to distinguish the opera- module, which is comprised of 8,800 lines of code. It takes
tions. Each virtual file descriptor number is associated with 1,250 lines of code to enable SuperProcess management.
a virtual file descriptor. The virtual file descriptor describes About 800 lines of code are used to support network vir-
the owner VM, the responder (a user-level daemon on the tualization and 750 lines of code to support the Cerberus
owner VM) and the real file descriptor corresponding to it. file system. The Cerberus system call virtualization layer
When a process accesses a virtual file descriptor, Cerberus takes about 3000 lines of code, including marshaling mul-
will first check the corresponding owner VM. If it is a local tiple system calls (e.g., clone). The Cerberus system sup-
access, Cerberus just handles the request as in native Xen- port code consists of 3000 lines, including the management
Linux using the real file descriptor. Otherwise, Cerberus will of shared memory pool, cross-VM messages and process
Syscall Approaches
clone Cerberus first makes sure each VM has the resident process. Then, it queries the SuperProcess daemons for the target domain.
Native clone is invoked for a local clone. Otherwise a remote clone request with the marshalled parameters (e.g., stack
address) is sent. The resident process on the target domain then creates a new thread.
getpid Cerberus returns a virtual pid to the caller. The virtual pid contains the domain id and the SubProcess number.
signal Cerberus scans the mapping between virtual pid and process to find the target domain and process. A remote signal request is
sent when necessary. The native signal call is then invoked on the receiver domain.
mmap Cerberus first handles the VMA synchronization request, and then makes the native mmap call. Finally, it broadcasts the
mmap result to other VMs.
accept Cerberus checks the virtual fd table to get the owner domain of the fd. A remote accept request is sent when necessary. The
accept operation is done by the corresponding responder with the real fd, and the resulting virtual fd of the created connection
is sent back.
sendto If the fd does not refer to a remote connection (either the socket is not established or it is a local connection), Cerberus will
invoke the native sendto. Otherwise, Cerberus will query the virtual fd table to get the owner domain. A remote sendto request
is sent and handled by the corresponding responder with the real fd.
mkdir Cerberus first gets the global identifiers of the inode and dentry of the parent directory. If it is a local request, Cerberus passes
it to the native file system. Otherwise, Cerberus gets the owner domain id and sends a remote CFS request with the global
identifiers, the type of the new node (directory in this case) and the directory name. The owner domain creates the new child
directory and sends the corresponding global identifier of the newly created inode and dentry back to the request domain.

Table 1. System call implementation examples

checkpointing and restoring (including 700 lines of code reduce the bottlenecks from the NIC itself. Due to resource
from Crak [Zhong 2001]). limitations, we can run up to 24 virtual machines on the
AMD machine. All performance measurements were tested
7. Experimental Results at least three times and we report the mean.
Cerberus is based on is Xen 3.3.0, which by default runs
This section evaluates the potential costs and benefits in per-
with the Linux kernel version 2.6.18. We thus use the kernel
formance and scalability of Cerberus’s approach to mitigat-
version 2.6.18 for the three measured systems. Xen-Linux
ing contention in operating system kernels.
uses the privileged domain (Dom0) in direct paging mode,
7.1 Experimental Setup for good performance. As Xen-3.3.0 can support at most 32
VCPUs for one VM, we only evaluate Xen-Linux with up to
The benchmarks used include histogram from the Phoenix 32 cores.
testsuite [Ranger 2007]4 , dbench 3.0.4 [Tridgell 2010], We compare the performance and scalability of Cerberus
Apache web server 2.2.15 [Fielding 2002] and Memcached with Linux. We also present the performance results of Xen-
1.4.5 [Fitzpatrick 2004]. Linux to show the performance overhead incurred by the
Moreover, we present the costs of basic operations in virtualization layer, as well as the performance benefit of
Cerberus. We use OProfile to study the time distribution of Cerberus over typical virtualized systems. To investigate
histogram, dbench, Apache and Memcached on Cerberus, the performance gain of Cerberus, we used Oprofile and
Xen-Linux and Linux. Xenoprof to collect the distribution of time of histogram,
Most experiments were conducted on an AMD 48-core dbench, Apache and Memcached on Xen-Linux and Linux
machine with 8 6-core AMD 2.4 GHz Opteron chips. Each and that on Cerberus using the 2 core per-VM configuration.
core has a separate 128 KByte L1 cache and a separate 512 All profiling tests use the CPU CYCLE UNHALTED event
KByte L2 cache. Each chip has a shared 8-way 6 MByte as the performance counter event.
L3 cache. The size of physical memory is 128 GByte. We
use Debian GNU/Linux 5.0, which is installed on a 147 7.2 Cerberus Primitives
GByte SCSI hard disk with the ext3 file system. There are a We also wrote a set of microbenchmarks to evaluate the cost
total of four network interface cards and each is configured of many primitives of Cerberus, to understand the basic cost
with different IPs in a subnet. The input files and executable underlying Cerberus.
for testing are stored in a separate 300 GByte SCSI hard Sending Signals: To evaluate the performance of the Cer-
disk with ext3 file system. The Apache and Memcached berus signal mechanism, we use a micro-benchmark to test
benchmarks were conducted on a 4 quad-core Intel machine the time it takes to send a signal using a ping-pong scheme
with 8 NICs (as it has more NICs than the AMD machine), to (e.g., sending a signal to a process and that process sending
4 The reason we chose histogram is because it has severe performance
a signal back to the originator) on both the Intel and AMD
scalability problems on our testing machine, which other programs don’t machines. Table 3 depicts the evaluation results. It can be
exhibit. seen that the virtualization layer introduces some overhead
localhost remote host file, and then calculate the average execution time. The re-
Native Linux 12.5ms 125.5ms sult shows that one read operation on a native Xen-Linux
Xen-Linux 42.9ms 132.6ms file takes 6.47 µs, and one read operation on a local CFS file
Cerberus local 43.1ms 131.8ms takes 6.52 µs, while on a remote CFS file it will cost 17.81
Cerberus remote 87.1ms 154.7ms µs. The performance of local read operations on the CFS is
close to that of native Xen-Linux file system. However, re-
Table 2. Cost of ping-ponging one packet 1000 times mote read operations introduce some performance overhead.
Sending and Receiving Packets: To evaluate the per-
Intel AMD formance of the Cerberus network system, we use a micro-
Native Linux 7.9ms 4.0ms benchmark to test the time for sending and receiving net-
Xen-Linux 38.7ms 74.1ms work packets, using a ping-pong scheme on the Intel ma-
Cerberus local 43.1ms 72.3ms chine. The micro-server establishes a network connection
Cerberus remote 25.8ms 45.0ms with the client and creates a child to handle the following
requests. The client will send an 8 byte string to the server
Table 3. Cost of ping-ponging 1000 signals through a socket connection (localhost/remote host) to trig-
ger the test. Table 2 depicts the evaluation results. It shows
to the signal mechanism. However, sending a cross-VM sig- the execution time of ping-ponging one message 1000 times
nal takes less time than sending a local signal. There are two under different configurations. It can be seen that the vir-
reasons: 1) The inter-VM message passing mechanism is ef- tualization layer introduces some overhead for sending and
ficient; 2) Sending a signal to a remote process only needs receiving packets, while forwarding a packet in Cerberus in-
to forward the request to the target VM, so signaling the tar- troduces more overhead. However, if the connection is from
get process and executing the sender process can be done in a remote host, the overhead of the packet forwarding is be-
parallel. low 25% compared to native Linux.
7.3 Performance Results
Primitive Config Time
For histogram and dbench, we ran each workload on the
1 process 5.40 ms
remote fork AMD 48-core machine under Xen-Linux and Linux with the
24 processes 31.77 ms
number of cores increasing from 2 to 48. For Cerberus, we
1 thread 3.21 ms
remote clone evaluate two configurations, which run one and two cores for
24 threads 30.79 ms
each VM (Cerberus-1core and Cerberus-2cores), running on
Table 4. The costs of fork and clone in Cerberus a different number of cores, increasing from 2 to 48. When
running one virtual machine on two cores, we configured
Remote Fork and Clone: The first and second columns each VM with cores that have minimal communication costs
of Table 4 show the cost of spawning 1 processe/thread on a (e.g., sharing the L3 cache). For the Apache web server and
remote VM in the AMD machine with 2 VMs and concur- Memcached service benchmarks, we run each workload on
rently spawning 24 processes/threads on remote VMs in the the Intel 16-core machine under Cerberus, Xen-Linux and
AMD machine with 24 VMs. Cerberus suffers from some Linux with different number of cores, increasing from 2 to
overhead due to checkpointing, transferring and restoring 16. As both applications require a relatively large number
process/thread state from the issuing VM to the receiving of NICs, we did not test them on 48-core AMD machine.
VMs. However, with increasing numbers of VMs, the steps During the Apache and Memcached tests, we setup one
of creating remote threads can be processed in parallel. This instance of the web server on each core, which accepts
helps to reduce some overhead of creating threads as shown service requests from clients running on a pool of 16 dual-
in the table. core machines (32 clients for Apache and 64 clients for
Inter-VM Message Passing: To evaluate the costs of Memcached).
inter-VM message passing, we pass a message between VMs Histogram: Figure 5 shows the performance and scala-
in order using a ping-pong scheme, e.g., sending that mes- bility of histogram processing 4 GByte of data on Cerberus,
sage to a VM and the VM responds by sending a message native Linux and Xen-Linux. All input data is held in an in-
back to the sender. The time for one round-trip is around memory tmpfs to avoid applications being bottlenecked by
10.24 µs within the same chip and 11.34 µs between chips, disk I/O. Cerberus performs significantly worse than Linux
which we believe is modest and acceptable. for a small number of cores, due to the performance over-
Reading a File with CFS: To evaluate the performance head in shadow page management and the inherent virtual-
of (CFS), we write a micro-benchmark to test the time it ization overhead. However, as the number of cores increases,
costs to read the beginning portion of a simple file on the the execution time of histogram eventually decreases and
AMD machine. We generate one hundred files with random outperforms native Linux. The speedup of Cerberus over
content, clear the buffer cache, read the first ten bytes of each Xen-Linux is around 51% on 24 cores for the one core per-
14 2400
Linux
12 2000
Execution Time (sec)

Throughput (MB/sec)
Xen-Linux
Cerberus-1Core
10
Cerberus-2Cores 1600
8 Linux
1200 Xen-Linux
6 Cerberus-1Core
800 Cerberus-2Cores
4
2 400

0 0
2 4 6 12 18 24 30 36 42 48 2 4 6 12 18 24 30 36 42 48
Cores Cores
1.8 12

10
1.4
8
Speedup

Speedup
1 6

Linux/Cerberus-1Core
4 Cerbersu-1Core/Linux
0.6 Linux/Cerberus-2Cores Cerberus-2Cores/Linux
Xen-Linux/Cerberus-1Core 2 Cerberus-1Core/Xen-Linux
Xen-Linux/Cerberus-2Cores Cerberus-2Cores/Xen-Linux
0.2 0
2 4 6 12 18 24 30 36 42 48 2 4 6 12 18 24 30 36 42 48
Cores Cores

Figure 5. The execution time and speedup of histogram Figure 6. The throughput and speedup of dbench on Cer-
on Cerberus compared to those on Linux and Xen-Linux berus compared to those on Linux and Xen-Linux under two
under two configurations: which use 1 and 2 cores/domain configurations: which use 1 and 2 cores/domain accordingly.
accordingly.

VM configuration, and 30% on 30 cores for the two cores throughput of Cerberus is worse than that on Linux for a
per-VM configuration. The speedup over Linux is around small number of cores (1-6), its throughput scales well to 18
43% on 24 cores for one core per-VM, and 37% on 48 cores cores and 12 cores for the one and two core per-VM con-
for two cores per-VM. The performance of two cores per- figuration. It appears that dbench has reached its extreme
VM is worse than that of one core per-VM, due to the in- throughput here and has no further space for improvement.
creased contention on the shadow page table inside Xen. The Starting from 12 cores or 18 cores, the throughput degrades
speedup of Cerberus degrades a little (57% vs. 37%) from slightly due to the increased process creation and inter-VM
42 cores to 48 cores, probably because the costs of creat- communication costs. Again, the one core per-VM configu-
ing threads and communication increases, thus the benefit ration is slightly better than the two cores per-VM configu-
degrades. ration, due to the per-VM lock on the shadow page table. In
Table 5 shows the top 3 hottest functions in the profiling total, the speedup is 4.89X for the one core per-VM config-
report of the histogram benchmark. Linux suffers from con- uration on 24 cores, 4.95X for 42 cores, and 4.61X for 48
tention in up read and down read trylock due to memory cores.
management. Xen-Linux spends most of its time in address Table 6 shows the top 3 hottest functions in the profiling
0x0 (/vmlinux-unknown) when the number of cores exceeds report of dbench benchmark. We ignore the portion of sam-
eight5 , which might be used for a para-virtualized kernel to ples related to mwait idle, as it means the CPU has nothing
interact with the hypervisor. However, Cerberus does not en- to do. From the table we can see that Linux and Xen-Linux
counter contention in Linux and Xen-Linux, the time spent both spend substantial time in ext3 file system operations,
in the lock-free implementation (cmpxchg) increases a little which may be the reason for poor scalability. On the other
with the increasing number of cores. hand, Cerberus does not encounter such scalability prob-
dbench: Figure 6 depicts the throughput and speedup lems, but is slightly affected by the shadow paging mode.
of dbench on Cerberus over Xen-Linux and Linux. The The evaluation on histogram and dbench also shows that
throughput of dbench on Xen-Linux and Linux degrades dra- these applications poorly utilize multicore resources when
matically when the number of cores increases from 6 to 12 the number of cores reaches a certain level. This indicates
and degrades slightly afterwards. By contrast, though the that horizontally allocating more cores to such applications
may not be a good idea. Instead, allocating a suitable amount
5 Theprofiling results are obtained through Xenoprof using the of cores to such applications could result in better utilization
CPU CYCLE UNHALTED event and performance tradeoff.
10000
Threads Top 3 Functions Percent Linux

Throughput (Req/sec)
Linux 8000 Xen-Linux
up read 38.6% Cerberus
6000
48 down read trylock 35.9%
calc hist 8.3% 4000
calc hist 81.2%
2000
1 find busiest group 0.06%
page fault 0.03% 0
Xen-Linux 1 2 4 6 8 10 12 14 16
/vmlinux-unknown 70.9% Cores
32 calc hist 11.6% Figure 7. The per-core throughput of Apache on Cerberus
handle mm fault 3.2% compared to those on Linux and Xen-Linux
calc hist 60.3%
1 handle mm fault 3.6%
to 8 different VMs (using PCI passthrough). The per-core
sh gva to gfn guest 4 2.7%
throughput of 16 cores is only 1085 requests/sec for Linux,
Cerberus
which is 12.1% of that on 1 core. By contrast, the through-
calc hist 22.5%
put of Cerberus is quite stable. Although Cerberus performs
2/VM sh x86 emulate cmpxchg guest 2 8.9%
worse than Linux for a small number (1-2) of cores (4603 vs.
/xen-unknown 8.3%
8118 on 2 cores), it outperforms Linux when the number of
Table 5. The summary of the top 3 hottest functions in cores exceeds 4 and scales nearly linearly. Cerberus achieves
histogram benchmark profiling a speedup of 3.49X and 3.53X over Linux and Xen-Linux
(3833 vs. 1099 and 1085).
The profiling of Apache shows that more CPU time is
Threads Top 3 Functions Percent spent idle with the increasing number of cores used to
Linux host web servers, and there is some load imbalance. Opro-
ext3 test allocatable 66.6% file shows that the same server instance takes 2.57X more
48 bitmap search next usable block 18.2% CPU cycles under 1-core configuration than that under 16-
journal dirty metadata 0.02% core configuration. The same scenario also appears in Xen-
/lib/libc-2.7.so 20.7% Linux(2.39X). This may be caused by contention in the net-
1 copy user generic 14.1% work layer in Linux and Xen-Linux. However, Cerberus
d lookup 0.03% does not encounter such a problem and can fully utilize its
Xen-Linux CPU resources. This evaluation shows that Cerberus could
ext3 test allocatable 59.7% also avoid some imbalance caused by Linux, and achieve
32 bitmap search next usable block 17.7% more efficient use of resources.
/vmlinux-unknown 5.99%
Throughput (10000 Reqs/sec)

500
/lib/libc-2.7.so 13.7% Linux
1 copy user generic 9.9% 400 Xen-Linux
Cerberus
d lookup 4.1% 300
Cerberus
sh x86 emulate cmpxchg guest 2 11.2% 200
2/VM /xen-unknown 8.67%
100
sh x86 emulate write guest 2 5.2%
0
Table 6. The summary of the top 3 hottest functions in 1 2 4 6 8 10 12 14 16
dbench benchmark profiling Cores

Figure 8. The per-core throughput of Memcached on Cer-


Apache Web Server: Figure 7 shows the per-core berus compared to that on Linux and Xen-Linux
throughput of Apache on the Intel 16-core machine under
Cerberus, Xen-Linux and Linux. There are a total of eight Memcached: Figure 8 shows the average throughput of
NICs, and each is configured with a different IP in a subnet. Memcached server on the Intel 16-core machine under Cer-
We run one web server instance on each core and share one berus, Xen-Linux and Linux. The configuration is similar
NIC between two web servers. The throughput of Apache to Apache. We run one Memcached server instance on each
on Linux significantly degrades with the growing number of core and share one card by two servers listening to different
cores. When evaluating Cerberus, we directly assign 8 NICs UDP ports. The throughput of Memcached server on Linux
significantly degrades when the number of cores exceeds 4. 8. Discussion and Future Work
By contrast, the throughput of Cerberus does not degrade un- Though Cerberus has demonstrated the applicability of scal-
til the Memcached instances start to provide service on the ing applications with OS clustering, There are still ample
same VM, as two instances affect each other heavily. How- optimization and research opportunities remaining. We de-
ever Cerberus still outperforms Xen-Linux and Linux. scribe our current limitations as well as possible extensions.
The profiling of Memcached shows that many CPU cy- Viability of Our Approach: Our approach is not a
cles are spent polling network events. Further per-CPU pro- panacea to the scalability of applications on multicore, but
filing shows that a few Memcached instances spend much is only effective in specific scenarios where applications
time in the ep poll callback and task rq lock functions, and themselves have good parallelism and do not have intensive
seem to block other instances. communication. Specifically, Cerberus might not show per-
We also evaluated histogram, dbench, Apache and Mem- formance advantages in the following scenarios. First, ap-
cached on other Linux versions (Linux 2.6.26, the standard plications that clone a number of short-lived, intensively-
kernel for Debian GNU/Linux, and Linux 2.6.35, the newest communicating threads/processes will probably not benefit
stable kernel). Only the scalability of histogram improves in from our approach, due to the relatively expensive cost of
Linux 2.6.35. Others still suffer from heavy contention, and message passing and thread creation. Second, as remote net-
have similar performance and scalability. work and remote file introduce overhead in Cerberus, appli-
Performance of Different Configurations: We also cations with frequent remote resource access might experi-
measured the performance of different cores per-VM using ence degraded performance. Finally, applications with fre-
48 cores. As shown in Table 7, the performance actually quent small-size memory mapping operations (e.g., mmap,
degrades when the number of cores per-VM increases. The mremap) will stress the current synchronization mechanism
degradation is especially significant due to heavy contention for virtual memory in Cerberus and might have some perfor-
on shadow page table management, with the execution time mance degradation.
increasing more than 12X (10.624s vs. 0.860s) when the Application Cooperation: To retain application trans-
number of cores per-VM increases from 2 to 8. The eval- parency, Cerberus relies on some relatively expensive op-
uation shows that Cerberus does not rely on the scalability erations (such as inter-VM fork/clone) to support cross-OS
of the VMM and can also mitigate performance scalability execution of an application. In our future work, we would
problems within the VMM when configured properly. like to investigate ways of adding some appropriate appli-
cation programming interfaces and libraries to let applica-
tions cooperate with Cerberus, thus further reducing the per-
#Cores/VM Histogram(sec) Dbench(MB/sec) formance overhead. For example, it would be interesting to
2 0.860 2123.6 let user applications explicitly specify which address space
4 1.130 1805.0 range should share the page table, to avoid unnecessary seri-
8 10.624 1273.8 alization and contention. Moreover, in a fork-intensive appli-
Table 7. Performance of histogram and dbench with differ- cation, it would be beneficial for applications to direct Cer-
ent number of cores per-VM berus on which parts need to be checkpointed.
Hardware-assisted Virtualization: Currently, Cerberus
is implemented based on hardware platform without
hardware-assisted virtualization, thus come with the asso-
Performance Comparison with Xen-Linux Shadow
ciated (usually non-trivial) overhead of virtualization. How-
Mode: As Cerberus is based on the shadow mode of Xen-
ever, hardware-assisted virtualization techniques such as In-
Linux, we also give a performance comparison with Xen-
tel VT-x and AMD SVM with extended page tables or nested
Linux for reference. We used Domain0 in shadow mode and
page tables are commercially available. Our future work in-
direct mode running 32 virtual cores to run histogram and
cludes incorporating hardware-assisted virtualization to re-
dbench. Due to heavy contention in shadow mode, Xen-
duce the virtualization overhead, thus further enlarging the
Linux experiences extremely bad performance, spending
performance benefits of Cerberus.
about 246.54s on histogram, and has only 3.4 MB/s through-
Fault Tolerance: Currently, Cerberus does not provide
put for dbench. Yet, for the direct mode, the execution time
fault tolerance to applications. While running applications
for histogram is 1.80s and the throughput is 246.54 MB/s for
on multiple VMs, it would be desirable if one process fails,
dbench. Hence, when running parallel workloads on multi-
processes in other VMs could take over the tasks and pro-
ple cores, it should be better to use direct mode rather than
ceed as if the failure never happened. However, in Cerberus,
shadow mode. The performance evaluation also shows that
if one process of a SuperProcess failed in one VM, it is un-
Cerberus could not only mitigate the contention within oper-
certain what would happen to other processes on the other
ating systems, but also reduce the contention from multiple
VMs.
cores accessing the shared state owned by a single virtual
machine (i.e., shadow page management).
9. Conclusions Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and
Zheng Zhang. Corey: An operating system for many cores. In
Scaling operating systems on many-core systems is a criti-
Proc. OSDI, 2008.
cal issue for researchers and developers to fully harness the
likely abundant future processing resources. This paper has [Boyd-Wickizer 2010] Silas Boyd-Wickizer, Austin Clements,
Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert
presented Cerberus, a system that runs a single many-core
Morris, and Nickolai Zeldovich. An analysis of Linux scala-
application on multiple commodity operating systems, yet bility to many cores. In Proc. OSDI, 2010.
provides applications with the illusion of running on a sin-
[Bugnion 1997] E. Bugnion, S. Devine, and M. Rosenblum.
gle operating system. Cerberus has the potential to mitigate
DISCO: running commodity operating systems on scalable mul-
the pressure of applications on the efficiency of operating tiprocessors. In Proc. SOSP, pages 143–156, 1997.
systems managing resources on many cores. Cerberus is en-
[Chapin 1995] John Chapin, Mendel Rosenblum, Scott Devine,
abled by retrofitting a number of new design techniques back
Tirthankar Lahiri, Dan Teodosiu, and Anoop Gupta. Hive: Fault
to commodity operating systems to mitigate contention and Containment for Shared-Memory Multiprocessors. In Proc.
to support efficient resource sharing. A system call virtual- SOSP, 1995.
ization layer coordinates accesses from process instances in
[Chen 2008] X. Chen, T. Garfinkel, E.C. Lewis, P. Subrahmanyam,
clustered operating systems to ensure state consistency. Ex- C.A. Waldspurger, D. Boneh, J. Dwoskin, and D.R.K. Ports.
periments with four applications on a 48-core AMD machine Overshadow: a virtualization-based approach to retrofitting pro-
and a 16-core Intel machine show that Cerberus outperforms tection in commodity operating systems. In Proc. ASPLOS,
native Linux for a relatively large number of cores, and also pages 2–13, 2008.
scales better than Linux. [Engler 1995] Dawson R. Engler, M. Frans Kaashoek, and
James W. O’Toole. Exokernel: An operating system architec-
10. Acknowledgments ture for application-level resource management. In Proc. SOSP,
We thank our shepherd Andrew Baumann and the anony- pages 251–266, 1995.
mous reviewers for their detailed and insightful comments. [Fielding 2002] RT Fielding and G. Kaiser. The Apache HTTP
This work was funded by China National Natural Science server project. Internet Computing, IEEE, 1(4):88–90, 2002.
Foundation under grant numbered 61003002, a grant from [Fitzpatrick 2004] B. Fitzpatrick. Distributed caching with mem-
the Science and Technology Commission of Shanghai Mu- cached. Linux journal, 2004.
nicipality numbered 10511500100, China National 863 pro- [Franke 2002] H. Franke, R. Russell, and M. Kirkwood. Fuss,
gram numbered 2008AA01Z138, a research grant from In- Futexes and Furwocks: Fast Userlevel Locking in Linux. In
tel as well as a joint program between China Ministry of Proceedings of the Ottawa Linux Symposium, 2002.
Education and Intel numbered MOE-INTEL-09-04, Funda- [Gamsa 1999] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and
mental Research Funds for the Central Universities in China Michael Stumm. Tornado: maximizing locality and concurrency
and Shanghai Leading Academic Discipline Project (Project in a shared memory multiprocessor operating system. In Proc.
Number: B114). OSDI, 1999.
[Goldberg 1974] R.P. Goldberg. Survey of virtual machine re-
References search. IEEE Computer, 7(6):34–45, 1974.
[Aas 2005] Josh Aas. Understand- [Govil 1999] Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and
ing the Linux 2.6.8.1 CPU scheduler. Mendel Rosenblum. Cellular disco: resource management using
http://joshaas.net/linux/linux_cpu_scheduler.pdf, virtual clusters on shared-memory multiprocessors. In Proc.
Februry 2005. SOSP, pages 154–169, 1999.
[Appavoo 2007] Jonathan Appavoo, Dilma Da Silva, Orran [Krieger 2006] O. Krieger, M. Auslander, B. Rosenburg, R.W. Wis-
Krieger, Marc Auslander, Michal Ostrowski, Bryan Rosenburg, niewski, J. Xenidis, D. Da Silva, M. Ostrowski, J. Appavoo,
Amos Waterland, Robert W. Wisniewski, Jimi Xenidis, Michael M. Butrico, M. Mergen, et al. K42: building a complete op-
Stumm, and Livio Soares. Experience distributing objects in an erating system. ACM SIGOPS Operating Systems Review, 40
SMMP OS. TOCS, 25(3):6, 2007. (4):145, 2006.
[Barham 2003] Paul Barham, Boris Dragovic, Keir Fraser, Steven [Levon 2004] John Levon. OProfile Manual. Victoria University
Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and of Manchester, 2004. http://oprofile.sourceforge.net/doc/.
Andrew Warfield. Xen and the art of virtualization. In Proc.
[McKenney 2002] Paul E. McKenney, Dipankar Sarma, Andrea
SOSP, pages 164–177, 2003.
Arcangeli, Andi Kleen, Orran Krieger, and Rusty Russell. Read-
[Baumann 2009] Andrew Baumann, Paul Barham, Pierre-Evariste copy update. In Proceedings of Linux Symposium, pages 338–
Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy 367, 2002.
Roscoe, Adrian Schuepbach, and Akhilesh Singhania. The mul-
[Mellor-Crummey 1991] John M. Mellor-Crummey and Michael L.
tikernel: A new OS architecture for scalable multicore systems.
Scott. Algorithms for scalable synchronization on shared-
In Proc. SOSP, 2009.
memory multiprocessors. ACM Transaction on Computer Sys-
[Boyd-Wickizer 2008] Silas Boyd-Wickizer, Haibo Chen, Rong tems, 9(1):21–65, 1991. ISSN 0734-2071.
Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey
[Menon 2005] A. Menon, J.R. Santos, Y. Turner, G.J. Janakiraman, [Unrau 1995] R.C. Unrau, O. Krieger, B. Gamsa, and M. Stumm.
and W. Zwaenepoel. Diagnosing performance overheads in the Hierarchical clustering: A structure for scalable multiprocessor
Xen virtual machine environment. In Proc. VEE, 2005. operating system design. The Journal of Supercomputing, 9(1):
[Nightingale 2005] E.B. Nightingale, P.M. Chen, and J. Flinn. 105–134, 1995.
Speculative execution in a distributed file system. In Proc. [Wentzlaff 2008] D. Wentzlaff and A. Agarwal. Factored Operating
SOSP, pages 191–205, 2005. Systems (fos): The Case for a Scalable Operating System for
[Nightingale 2009] E.B. Nightingale, O. Hodson, R. McIlroy, Multicores. Operating System Review, 2008.
C. Hawblitzel, and G. Hunt. Helios: Heterogeneous multipro- [Whitaker 2002] A. Whitaker, M. Shaw, and S.D. Gribble. Scale
cessing with satellite kernels. In Proc. SOSP, 2009. and performance in the Denali isolation kernel. In Proc. OSDI,
[PCI-SIG 2010] PCI-SIG. Single-root I/O virtualization specifi- 2002.
cations. http://www.pcisig.com/specifications/iov/single root/, [Zhao 2006] X. Zhao, A. Prakash, B. Noble, and K. Borders. Im-
2010. proving Distributed File System Performance in Virtual Ma-
[Ranger 2007] C. Ranger, R. Raghuraman, A. Penmetsa, G. Brad- chine Environments. Technical report, CSE-TR-526-06. Uni-
ski, and C. Kozyrakis. Evaluating mapreduce for multi-core and versity of Michigan, 2006.
multiprocessor systems. In Proc. HPCA, 2007. [Zhong 2001] H. Zhong and J. Nieh. CRAK: Linux check-
[Tridgell 2010] A. Tridgell. Dbench filesystem benchmark. point/restart as a kernel module. Technical Report CUCS-
http://samba.org/ftp/tridge/dbench/, 2010. 014-01, Department of Computer Science, Columbia University,
2001.

You might also like