A Case For Scaling Applications To Many
A Case For Scaling Applications To Many
Xiang Song Haibo Chen Rong Chen Yuanxuan Wang Binyu Zang
Parallel Processing Institute, Fudan University
{xiangsong, hbchen, chenrong, yxwang1987, byzang}@fudan.edu.cn
Abstract
This paper proposes an approach to scaling UNIX-like oper-
ating systems for many cores in a backward-compatible way,
which still enjoys common wisdom in new operating system
designs. The proposed system, called Cerberus, mitigates
contention on many shared data structures within OS kernels
by clustering multiple commodity operating systems atop a $$ % & ' "
5.2 Sharing File Systems Figure 4 shows the architecture of our approach, which
forms of a mesh of networked file systems: each operating
Running a single application on multiple operating systems system manages a local partition and exposes it to other op-
raises the problem of sharing files among processes in each erating systems through an NFS-like interface; processes in
clustered operating system. This is because each operating such an operating system access private files directly in the
system will have its own file system and device driver, pre- local partition and access files in other partitions through the
venting a process from accessing files managed by another CFS client. To identify a file as shared or private, Cerberus
operating system. One intuitive approach would be the use maintains a mapping from each inode describing a file to the
of a networked file system managed by one operating sys- owner ID (e.g., virtual machine ID). As the metadata of files
tem, with other operating systems as NFS clients to access in each partition is maintained only by one operating system,
files in the operating system running the NFS server. How- Cerberus offers a similar metadata consistency and crash-
ever, this creates two performance problems. First, all file recover model to native systems. It should be noted that the
accesses are now centralized to one operating system, which CFS implemented by Cerberus does not rely on network but
can easily make the accesses the new performance bottle- rather on the virtual machine communication interfaces for
neck. Second, there are some inherent performance over- communication and shared memory. This avoids redundant
heads, as a networked file system usually has inferior per- file data copies and the associated data exchange, and thus is
formance compared to a local one. For example, recent mea- more efficient than NFS [Zhao 2006]. Again, the sharing of
surements [Nightingale 2005, Zhao 2006] showed that NFS a file between the CFS client and CFS server is done using
could be several times slower than a native file system such the address range abstraction to minimize soft page faults.
as ext3. To provide applications with a consistent view of the
Fortunately, most files in many multiprocessing applica- clustered file system, Cerberus intercepts accesses to the at-
tions are usually accessed exclusively, with few opportuni- tributes or state of each file and directory, distributes ac-
ties to be accessed by multiple processes (except some non- cesses to each partition when necessary, and marshals the
performance-critical ones such as log files)2 . Hence, Cer- results before returning to user applications. Such operations
berus uses a hybrid approach of both networked and local (e.g., list a directory) are relatively costly compared to those
file system, which seeks to give accesses to private files little in a single operating system. However, they are rare and usu-
contention and high performance, while maintaining accept- ally occur in non-performance critical paths of applications.
able performance for shared files.
5.3 Sharing File Contents
2 For multi-threaded applications, applications usually map the files into
memory using mmap and then threads can modify the memory-mapped file For multithreaded applications, it is common that the content
directly, which will be discussed in the following sections. of a file is shared by multiple threads. Thus, Cerberus sup-
ports the sharing of a file based on the address space sharing 6. Prototype Implementation
in Cerberus to maintain consistency for a file accessed by We have implemented Cerberus based on Xen to run multi-
multiple operating systems. Cerberus uses memory mapped ple Linux instances with a single shared memory interface,
I/O (MMIO) to map a file into a shared address range, which using the shadow mode of page table management in Xen.
is visible to all threads in multiple operating systems. Cer- The system call layer in Cerberus currently supports only a
berus only allows the SuperProcess to access shared files subset of the POSIX interface, but is sufficient to run many
using MMIO. To provide backward compatibility for appli- applications including shared-memory MapReduce applica-
cations using the traditional read/write APIs, Cerberus han- tions, Apache, Memcached and file system benchmarks. For
dles file I/O to shared files using a similar idea to that in simplicity, Cerberus currently requires applications to be
Overshadow [Chen 2008], by translating file related I/Os to statically linked3 , and to link with a small piece of user-level
MMIOs. On the first read/write operation to the file, Cer- code containing a few Cerberus-specific signal handlers, that
berus maps the file in a shared address space using the mmap handle remote requests such as futex and socket operations.
system call. Cerberus ensures that the buffer is mapped us-
ing the MAP SHARED flag. Cerberus also ensures that the 6.1 Inter-VM Message Passing
address range of the memory buffer is shared among clus- The inter-VM message passing mechanism is implemented
tered operating systems using the address range abstraction. by leveraging the cross-VM event channel mechanism in
Thus, changes from one operating system will be directly Xen. Cerberus creates a point-to-point event channel be-
visible to other operating systems. Then, Cerberus emulates tween each pair of clustered operating systems. The SuperP
the read/write system calls by operating on the mmapped module inside each operating system has a handler to receive
area. such cross-VM events and distribute them to the receivers.
To provide file-I/O semantics, Cerberus maintains a vir- In the case of concurrent cross-VM events, each operating
tual file metadata structure that reflects the logical view of system maintains a cross-event queue to buffer the incoming
the files seen by a process. Cerberus also virtualizes the sys- events, and handles them in order. All cross-VM communi-
tem calls that operate on the metadata of files. For example, cation of Cerberus, such as futex and signal operations, uses
the fseek system call will advance the file position main- this mechanism.
tained in the virtualized metadata and return the state in vir-
tualized metadata for fstat-like system calls. 6.2 Memory Management
Note that this scheme is transparent to the in-kernel file In Cerberus, the sharing of page tables is implemented in the
systems and buffer cache management, as each buffer cache shadow page tables, and by manipulating the P2M (physical-
will have a consistent view of the file. The same piece of to-machine) table, thus is transparent to guest operating sys-
a file might be replicated among multiple buffer caches, tems. We have also investigated an implementation of page
causing wasted memory. However, multiple replicas also in- sharing for Xen’s direct mode (with writable page tables),
crease the concurrency of file access and avoid unnecessary with the aim of supporting para-virtualization. However, our
contention. preliminary results show that supporting writable page ta-
bles could result in significant changes to guest operating
5.4 Shared Networking Interfaces systems, as well as incurring non-trivial performance over-
head.
To provide applications with a consistent view of network-
On x86-64, Xen uses 4-levels of page tables and Cer-
ing interfaces, Cerberus exploits the fact that typical servers
berus supports sharing at the lower three levels (i.e., L1 –
are usually equipped with multiple NICs, and each NIC is
L3). Cerberus records the root page table page for an address
now hardware virtualizable (e.g., through SR-IOV [PCI-SIG
range when the guest kernel connects an allocated page ta-
2010]). Hence, Cerberus directly assigns either virtualized
ble page to the upper-level page table. When sharing a page
NICs or physical NICs to each operating system for high
table page among multiple OSes, one machine page might
performance. This could avoid contention on TCP/IP stacks
be accessed by multiple OSes, and thus might correspond to
if the operations are done on the local (virtual) NICs. To
more than one guest-physical page in Xen. Hence, Cerberus
hide applications from such geographic distributions, Cer-
creates a per-VM representation of each shared page table,
berus virtualizes the socket interface by intercepting related
but in an on-demand way. When a VM tries to write a page
system calls and relies on the file descriptor virtualization
table page for the first time, Cerberus will create a represen-
described previously to manage socket descriptors. Cerberus
tation of the page table page in that VM and map it to a single
maintains the (virtual) NIC information, and redirects calls
machine page by manipulating the P2M table, which maps
that bind to a NIC if necessary. Cerberus then dispatches re-
guest physical memory to the host machine memory. Cer-
lated operations (e.g., send, receive) to the VM that manages
berus uses compare-and-swap to serialize updates to shared
the NIC. The associated data will be exchanged using the
shared memory area managed by Cerberus to avoid possible 3 Thiswill not increase much memory usage, as application code is shared
data copies. by default.
page table pages among multiple VMs: when a VM tries to send a remote socket operation request to the target VM, and
update the shared page table, it uses a compare-and-swap to let the responder handle the socket request. With this simple
see if the entry has already been filled by other VMs, and mechanism, Cerberus can currently support several socket-
frees the duplicated page table page if so. related operations (such as bind, listen, accept, read, write,
Other than sharing page tables, Cerberus also needs select, sendto and recvmsg).
to synchronize the virtual memory area (VMA) mappings
6.5 System Call Virtualization
across clustered VMs. As threads on different VMs have
separate address spaces, they maintain their VMAs individ- We classify system calls into two types according to which
ually. Memory management system calls (e.g., mmap) on a system state they access. The first type includes system calls
single VM only change the VMA mappings of the threads that only access local state or are stateless (e.g., get systime).
in that VM. Thus, Cerberus intercepts most memory man- For such system calls, replicating calls among multiple OSes
agement system calls (e.g., mmap, mremap, mprotect, mun- will not cause state consistency problems, and thus Cerberus
map and brk). Before handling the memory management does not need to handle them specially. The second type in-
system call, Cerberus will first force the VM to handle the cludes system calls that access and modify global state in
virtual memory synchronization requests from other VMs. the operating system (e.g., mmap). Cerberus needs to inter-
After finishing the call, Cerberus will allow the VM to prop- cept this kind of system call, coordinate state changes, and
agate the system call to all other VMs in the system. This marshal the results to support cross-VM interactions. To do
is done by adding a virtual memory synchronization request interceptions, the Cerberus module modifies the system call
with appropriate parameters to the request queue of each re- table to change the function pointers of certain system call
ceiver VM. handlers to Cerberus-specific handlers during loading. When
a system call is invoked, the Cerberus handler checks if it
6.3 Cerberus File System should be handled by Cerberus, and if so, invokes specific
Inodes in a Cerberus file system (CFS) are divided into two handlers provided by Cerberus.
kinds, namely local inodes and remote inodes. Local inodes We have currently virtualized 35 POSIX system calls (be-
describe files on a domain-local file system, and may be ac- longing to the second type) at either system call level or
cessed directly. Remote inodes correspond to files stored on virtual file system level. They are divided into five cate-
remote domains. A remote inode can be uniquely identified gories: process/thread creation and exit (e.g., fork, clone,
by its owner domain and its inode number in that domain. exec, exit, exit group, getpid and getppid); thread commu-
When a remote inode is created, CFS will keep track of nication (e.g., futex and signal); memory management (e.g.,
this unique identifier. Each time a remote inode access is brk, mmap, munmap, mprotect and mremap); network op-
required, CFS will pack the inode identifier and other infor- erations (e.g., socket, connect, bind, accept, listen, select,
mation into a message, and send it to remote domain via the sendto, recvfrom, shutdown and close); and file operations
inter-VM message passing mechanism. (e.g., open, read, write, mkdir, rmdir, close and readdir). We
Another data structure we track is the dentry. A dentry is currently leave system calls related to security, realtime sig-
an object describing relationships between inodes, and stor- nals, debugging and kernel modules unhandled. In our expe-
ing names of inodes. Unlike inodes, dentries in the original rience, virtualizing a system call is usually not very difficult,
Linux file system do not have identifiers. To simplify remote as it mostly involves partitioning/marshaling the associated
dentry access, we assign a global identifier to each remote cross-process state. Table 1 gives some typical examples of
dentry. The dentry id is assigned in a lazy way, that is, only how they are implemented.
when a dentry is visited from a remote domain for the first
time, will we assign a global identifier to it. 6.6 Implementation Efforts
In total, the implementation adds 1,800 lines of code to
6.4 Virtualizing Networking Xen to support management of Cerberus and efficient shar-
Cerberus virtualizes the socket interface by intercepting the ing of data among SubProcess in multiple Linux instances.
related system calls. The socket operations are divided into The support for system call interception, super-process and
two kinds, namely local and remote socket operations. We Cerberus file system is implemented as a loadable kernel
use virtual file descriptor numbers to distinguish the opera- module, which is comprised of 8,800 lines of code. It takes
tions. Each virtual file descriptor number is associated with 1,250 lines of code to enable SuperProcess management.
a virtual file descriptor. The virtual file descriptor describes About 800 lines of code are used to support network vir-
the owner VM, the responder (a user-level daemon on the tualization and 750 lines of code to support the Cerberus
owner VM) and the real file descriptor corresponding to it. file system. The Cerberus system call virtualization layer
When a process accesses a virtual file descriptor, Cerberus takes about 3000 lines of code, including marshaling mul-
will first check the corresponding owner VM. If it is a local tiple system calls (e.g., clone). The Cerberus system sup-
access, Cerberus just handles the request as in native Xen- port code consists of 3000 lines, including the management
Linux using the real file descriptor. Otherwise, Cerberus will of shared memory pool, cross-VM messages and process
Syscall Approaches
clone Cerberus first makes sure each VM has the resident process. Then, it queries the SuperProcess daemons for the target domain.
Native clone is invoked for a local clone. Otherwise a remote clone request with the marshalled parameters (e.g., stack
address) is sent. The resident process on the target domain then creates a new thread.
getpid Cerberus returns a virtual pid to the caller. The virtual pid contains the domain id and the SubProcess number.
signal Cerberus scans the mapping between virtual pid and process to find the target domain and process. A remote signal request is
sent when necessary. The native signal call is then invoked on the receiver domain.
mmap Cerberus first handles the VMA synchronization request, and then makes the native mmap call. Finally, it broadcasts the
mmap result to other VMs.
accept Cerberus checks the virtual fd table to get the owner domain of the fd. A remote accept request is sent when necessary. The
accept operation is done by the corresponding responder with the real fd, and the resulting virtual fd of the created connection
is sent back.
sendto If the fd does not refer to a remote connection (either the socket is not established or it is a local connection), Cerberus will
invoke the native sendto. Otherwise, Cerberus will query the virtual fd table to get the owner domain. A remote sendto request
is sent and handled by the corresponding responder with the real fd.
mkdir Cerberus first gets the global identifiers of the inode and dentry of the parent directory. If it is a local request, Cerberus passes
it to the native file system. Otherwise, Cerberus gets the owner domain id and sends a remote CFS request with the global
identifiers, the type of the new node (directory in this case) and the directory name. The owner domain creates the new child
directory and sends the corresponding global identifier of the newly created inode and dentry back to the request domain.
checkpointing and restoring (including 700 lines of code reduce the bottlenecks from the NIC itself. Due to resource
from Crak [Zhong 2001]). limitations, we can run up to 24 virtual machines on the
AMD machine. All performance measurements were tested
7. Experimental Results at least three times and we report the mean.
Cerberus is based on is Xen 3.3.0, which by default runs
This section evaluates the potential costs and benefits in per-
with the Linux kernel version 2.6.18. We thus use the kernel
formance and scalability of Cerberus’s approach to mitigat-
version 2.6.18 for the three measured systems. Xen-Linux
ing contention in operating system kernels.
uses the privileged domain (Dom0) in direct paging mode,
7.1 Experimental Setup for good performance. As Xen-3.3.0 can support at most 32
VCPUs for one VM, we only evaluate Xen-Linux with up to
The benchmarks used include histogram from the Phoenix 32 cores.
testsuite [Ranger 2007]4 , dbench 3.0.4 [Tridgell 2010], We compare the performance and scalability of Cerberus
Apache web server 2.2.15 [Fielding 2002] and Memcached with Linux. We also present the performance results of Xen-
1.4.5 [Fitzpatrick 2004]. Linux to show the performance overhead incurred by the
Moreover, we present the costs of basic operations in virtualization layer, as well as the performance benefit of
Cerberus. We use OProfile to study the time distribution of Cerberus over typical virtualized systems. To investigate
histogram, dbench, Apache and Memcached on Cerberus, the performance gain of Cerberus, we used Oprofile and
Xen-Linux and Linux. Xenoprof to collect the distribution of time of histogram,
Most experiments were conducted on an AMD 48-core dbench, Apache and Memcached on Xen-Linux and Linux
machine with 8 6-core AMD 2.4 GHz Opteron chips. Each and that on Cerberus using the 2 core per-VM configuration.
core has a separate 128 KByte L1 cache and a separate 512 All profiling tests use the CPU CYCLE UNHALTED event
KByte L2 cache. Each chip has a shared 8-way 6 MByte as the performance counter event.
L3 cache. The size of physical memory is 128 GByte. We
use Debian GNU/Linux 5.0, which is installed on a 147 7.2 Cerberus Primitives
GByte SCSI hard disk with the ext3 file system. There are a We also wrote a set of microbenchmarks to evaluate the cost
total of four network interface cards and each is configured of many primitives of Cerberus, to understand the basic cost
with different IPs in a subnet. The input files and executable underlying Cerberus.
for testing are stored in a separate 300 GByte SCSI hard Sending Signals: To evaluate the performance of the Cer-
disk with ext3 file system. The Apache and Memcached berus signal mechanism, we use a micro-benchmark to test
benchmarks were conducted on a 4 quad-core Intel machine the time it takes to send a signal using a ping-pong scheme
with 8 NICs (as it has more NICs than the AMD machine), to (e.g., sending a signal to a process and that process sending
4 The reason we chose histogram is because it has severe performance
a signal back to the originator) on both the Intel and AMD
scalability problems on our testing machine, which other programs don’t machines. Table 3 depicts the evaluation results. It can be
exhibit. seen that the virtualization layer introduces some overhead
localhost remote host file, and then calculate the average execution time. The re-
Native Linux 12.5ms 125.5ms sult shows that one read operation on a native Xen-Linux
Xen-Linux 42.9ms 132.6ms file takes 6.47 µs, and one read operation on a local CFS file
Cerberus local 43.1ms 131.8ms takes 6.52 µs, while on a remote CFS file it will cost 17.81
Cerberus remote 87.1ms 154.7ms µs. The performance of local read operations on the CFS is
close to that of native Xen-Linux file system. However, re-
Table 2. Cost of ping-ponging one packet 1000 times mote read operations introduce some performance overhead.
Sending and Receiving Packets: To evaluate the per-
Intel AMD formance of the Cerberus network system, we use a micro-
Native Linux 7.9ms 4.0ms benchmark to test the time for sending and receiving net-
Xen-Linux 38.7ms 74.1ms work packets, using a ping-pong scheme on the Intel ma-
Cerberus local 43.1ms 72.3ms chine. The micro-server establishes a network connection
Cerberus remote 25.8ms 45.0ms with the client and creates a child to handle the following
requests. The client will send an 8 byte string to the server
Table 3. Cost of ping-ponging 1000 signals through a socket connection (localhost/remote host) to trig-
ger the test. Table 2 depicts the evaluation results. It shows
to the signal mechanism. However, sending a cross-VM sig- the execution time of ping-ponging one message 1000 times
nal takes less time than sending a local signal. There are two under different configurations. It can be seen that the vir-
reasons: 1) The inter-VM message passing mechanism is ef- tualization layer introduces some overhead for sending and
ficient; 2) Sending a signal to a remote process only needs receiving packets, while forwarding a packet in Cerberus in-
to forward the request to the target VM, so signaling the tar- troduces more overhead. However, if the connection is from
get process and executing the sender process can be done in a remote host, the overhead of the packet forwarding is be-
parallel. low 25% compared to native Linux.
7.3 Performance Results
Primitive Config Time
For histogram and dbench, we ran each workload on the
1 process 5.40 ms
remote fork AMD 48-core machine under Xen-Linux and Linux with the
24 processes 31.77 ms
number of cores increasing from 2 to 48. For Cerberus, we
1 thread 3.21 ms
remote clone evaluate two configurations, which run one and two cores for
24 threads 30.79 ms
each VM (Cerberus-1core and Cerberus-2cores), running on
Table 4. The costs of fork and clone in Cerberus a different number of cores, increasing from 2 to 48. When
running one virtual machine on two cores, we configured
Remote Fork and Clone: The first and second columns each VM with cores that have minimal communication costs
of Table 4 show the cost of spawning 1 processe/thread on a (e.g., sharing the L3 cache). For the Apache web server and
remote VM in the AMD machine with 2 VMs and concur- Memcached service benchmarks, we run each workload on
rently spawning 24 processes/threads on remote VMs in the the Intel 16-core machine under Cerberus, Xen-Linux and
AMD machine with 24 VMs. Cerberus suffers from some Linux with different number of cores, increasing from 2 to
overhead due to checkpointing, transferring and restoring 16. As both applications require a relatively large number
process/thread state from the issuing VM to the receiving of NICs, we did not test them on 48-core AMD machine.
VMs. However, with increasing numbers of VMs, the steps During the Apache and Memcached tests, we setup one
of creating remote threads can be processed in parallel. This instance of the web server on each core, which accepts
helps to reduce some overhead of creating threads as shown service requests from clients running on a pool of 16 dual-
in the table. core machines (32 clients for Apache and 64 clients for
Inter-VM Message Passing: To evaluate the costs of Memcached).
inter-VM message passing, we pass a message between VMs Histogram: Figure 5 shows the performance and scala-
in order using a ping-pong scheme, e.g., sending that mes- bility of histogram processing 4 GByte of data on Cerberus,
sage to a VM and the VM responds by sending a message native Linux and Xen-Linux. All input data is held in an in-
back to the sender. The time for one round-trip is around memory tmpfs to avoid applications being bottlenecked by
10.24 µs within the same chip and 11.34 µs between chips, disk I/O. Cerberus performs significantly worse than Linux
which we believe is modest and acceptable. for a small number of cores, due to the performance over-
Reading a File with CFS: To evaluate the performance head in shadow page management and the inherent virtual-
of (CFS), we write a micro-benchmark to test the time it ization overhead. However, as the number of cores increases,
costs to read the beginning portion of a simple file on the the execution time of histogram eventually decreases and
AMD machine. We generate one hundred files with random outperforms native Linux. The speedup of Cerberus over
content, clear the buffer cache, read the first ten bytes of each Xen-Linux is around 51% on 24 cores for the one core per-
14 2400
Linux
12 2000
Execution Time (sec)
Throughput (MB/sec)
Xen-Linux
Cerberus-1Core
10
Cerberus-2Cores 1600
8 Linux
1200 Xen-Linux
6 Cerberus-1Core
800 Cerberus-2Cores
4
2 400
0 0
2 4 6 12 18 24 30 36 42 48 2 4 6 12 18 24 30 36 42 48
Cores Cores
1.8 12
10
1.4
8
Speedup
Speedup
1 6
Linux/Cerberus-1Core
4 Cerbersu-1Core/Linux
0.6 Linux/Cerberus-2Cores Cerberus-2Cores/Linux
Xen-Linux/Cerberus-1Core 2 Cerberus-1Core/Xen-Linux
Xen-Linux/Cerberus-2Cores Cerberus-2Cores/Xen-Linux
0.2 0
2 4 6 12 18 24 30 36 42 48 2 4 6 12 18 24 30 36 42 48
Cores Cores
Figure 5. The execution time and speedup of histogram Figure 6. The throughput and speedup of dbench on Cer-
on Cerberus compared to those on Linux and Xen-Linux berus compared to those on Linux and Xen-Linux under two
under two configurations: which use 1 and 2 cores/domain configurations: which use 1 and 2 cores/domain accordingly.
accordingly.
VM configuration, and 30% on 30 cores for the two cores throughput of Cerberus is worse than that on Linux for a
per-VM configuration. The speedup over Linux is around small number of cores (1-6), its throughput scales well to 18
43% on 24 cores for one core per-VM, and 37% on 48 cores cores and 12 cores for the one and two core per-VM con-
for two cores per-VM. The performance of two cores per- figuration. It appears that dbench has reached its extreme
VM is worse than that of one core per-VM, due to the in- throughput here and has no further space for improvement.
creased contention on the shadow page table inside Xen. The Starting from 12 cores or 18 cores, the throughput degrades
speedup of Cerberus degrades a little (57% vs. 37%) from slightly due to the increased process creation and inter-VM
42 cores to 48 cores, probably because the costs of creat- communication costs. Again, the one core per-VM configu-
ing threads and communication increases, thus the benefit ration is slightly better than the two cores per-VM configu-
degrades. ration, due to the per-VM lock on the shadow page table. In
Table 5 shows the top 3 hottest functions in the profiling total, the speedup is 4.89X for the one core per-VM config-
report of the histogram benchmark. Linux suffers from con- uration on 24 cores, 4.95X for 42 cores, and 4.61X for 48
tention in up read and down read trylock due to memory cores.
management. Xen-Linux spends most of its time in address Table 6 shows the top 3 hottest functions in the profiling
0x0 (/vmlinux-unknown) when the number of cores exceeds report of dbench benchmark. We ignore the portion of sam-
eight5 , which might be used for a para-virtualized kernel to ples related to mwait idle, as it means the CPU has nothing
interact with the hypervisor. However, Cerberus does not en- to do. From the table we can see that Linux and Xen-Linux
counter contention in Linux and Xen-Linux, the time spent both spend substantial time in ext3 file system operations,
in the lock-free implementation (cmpxchg) increases a little which may be the reason for poor scalability. On the other
with the increasing number of cores. hand, Cerberus does not encounter such scalability prob-
dbench: Figure 6 depicts the throughput and speedup lems, but is slightly affected by the shadow paging mode.
of dbench on Cerberus over Xen-Linux and Linux. The The evaluation on histogram and dbench also shows that
throughput of dbench on Xen-Linux and Linux degrades dra- these applications poorly utilize multicore resources when
matically when the number of cores increases from 6 to 12 the number of cores reaches a certain level. This indicates
and degrades slightly afterwards. By contrast, though the that horizontally allocating more cores to such applications
may not be a good idea. Instead, allocating a suitable amount
5 Theprofiling results are obtained through Xenoprof using the of cores to such applications could result in better utilization
CPU CYCLE UNHALTED event and performance tradeoff.
10000
Threads Top 3 Functions Percent Linux
Throughput (Req/sec)
Linux 8000 Xen-Linux
up read 38.6% Cerberus
6000
48 down read trylock 35.9%
calc hist 8.3% 4000
calc hist 81.2%
2000
1 find busiest group 0.06%
page fault 0.03% 0
Xen-Linux 1 2 4 6 8 10 12 14 16
/vmlinux-unknown 70.9% Cores
32 calc hist 11.6% Figure 7. The per-core throughput of Apache on Cerberus
handle mm fault 3.2% compared to those on Linux and Xen-Linux
calc hist 60.3%
1 handle mm fault 3.6%
to 8 different VMs (using PCI passthrough). The per-core
sh gva to gfn guest 4 2.7%
throughput of 16 cores is only 1085 requests/sec for Linux,
Cerberus
which is 12.1% of that on 1 core. By contrast, the through-
calc hist 22.5%
put of Cerberus is quite stable. Although Cerberus performs
2/VM sh x86 emulate cmpxchg guest 2 8.9%
worse than Linux for a small number (1-2) of cores (4603 vs.
/xen-unknown 8.3%
8118 on 2 cores), it outperforms Linux when the number of
Table 5. The summary of the top 3 hottest functions in cores exceeds 4 and scales nearly linearly. Cerberus achieves
histogram benchmark profiling a speedup of 3.49X and 3.53X over Linux and Xen-Linux
(3833 vs. 1099 and 1085).
The profiling of Apache shows that more CPU time is
Threads Top 3 Functions Percent spent idle with the increasing number of cores used to
Linux host web servers, and there is some load imbalance. Opro-
ext3 test allocatable 66.6% file shows that the same server instance takes 2.57X more
48 bitmap search next usable block 18.2% CPU cycles under 1-core configuration than that under 16-
journal dirty metadata 0.02% core configuration. The same scenario also appears in Xen-
/lib/libc-2.7.so 20.7% Linux(2.39X). This may be caused by contention in the net-
1 copy user generic 14.1% work layer in Linux and Xen-Linux. However, Cerberus
d lookup 0.03% does not encounter such a problem and can fully utilize its
Xen-Linux CPU resources. This evaluation shows that Cerberus could
ext3 test allocatable 59.7% also avoid some imbalance caused by Linux, and achieve
32 bitmap search next usable block 17.7% more efficient use of resources.
/vmlinux-unknown 5.99%
Throughput (10000 Reqs/sec)
500
/lib/libc-2.7.so 13.7% Linux
1 copy user generic 9.9% 400 Xen-Linux
Cerberus
d lookup 4.1% 300
Cerberus
sh x86 emulate cmpxchg guest 2 11.2% 200
2/VM /xen-unknown 8.67%
100
sh x86 emulate write guest 2 5.2%
0
Table 6. The summary of the top 3 hottest functions in 1 2 4 6 8 10 12 14 16
dbench benchmark profiling Cores