Emulatingan Octeon MIPS64 in QEMU
Emulatingan Octeon MIPS64 in QEMU
net/publication/313459562
CITATIONS READS
0 2,947
5 authors, including:
Some of the authors of this publication are also working on these related projects:
This work was supported by the School of Engineering and Sciences at Tecnologico de Monterrey. View project
All content following this page was uploaded by Qurrat ul Ain on 14 March 2018.
Abstract—Embedded systems are proliferating with their development. By having an emulation of Octeon processor in
growing hardware capabilities. Their application areas include commonly used emulator such as QEMU facilitates
internet of things, cellular devices, network devices, etc. development, testing, and debugging of embedded
Application development and testing natively on such embedded applications.
hardware is expensive, time consuming, and challenging. In this
case, system emulation is a cost-effective alternative. We have Our goal is to provide an emulation solution for Cavium
extended Quick Emulator (QEMU) to support Cavium Octeon MIPS64 Octeon systems using QEMU. We enhanced QEMU
MIPS64 processor based embedded systems. This paper and added the support for Octeon MIPS based systems. We
summarizes the modifications in QEMU v1.0.1 for supporting implement Octeon specific instruction translation using Tiny
Octeon processor. We compare the performance of guest Octeon Code Generator (TCG) present in the QEMU. With our
MIPS64 system against native using synthetic and applications modifications, now anyone having Octeon Linux binary can
benchmarks. This comparison shows that emulated system emulate Octeon based system on X86 without the need of
performance is comparable to that of the real system depending buying expensive MIPS hardware. We also evaluated our
on use cases. enhanced guest system using different benchmarks.
Keywords—Emulation; Linux kernel; MIPS64; Cavium The reminder of the paper is structured as follows: Section
Octeon; Porting; and QEMU. II briefly discusses the related work and Section III presents a
brief description of internal structure and working mechanism
I. INTRODUCTION of QEMU. Section IV highlights the modification in different
The ubiquity of embedded devices today has considerably components of QEMU for adding the Octeon processor
changed the embedded system computing landscape. Recently, support. In Section V, we present performance comparison of
with the availability of multiple cores and increased processing Octeon guest system with its native system using various
power there is an exponential growth of embedded devices in benchmarks including LMbench, MiBench, CommBench, and
smart watches, mobile phones, internet of things, telecomm Netpref. Section VI concludes the overall findings.
appliances, networking, and navigational systems [1-3]. On the
other hand, the development of such systems becomes difficult II. RELATED WORK
and time consuming as the required embedded hardware is Emulator is software that provides the behavior of another
expensive. In addition it is also infeasible to provide such environment or architecture while running on different
hardware to every developer in large developing environments. environment/architecture. Single application as well as a
Developers of such systems heavily depend on emulators for complete system can be emulated by the help of emulators.
initial development and testing, without requiring real System emulators are quite useful as a client side application
expensive hardware. that includes debugging and development tools. There is a vast
variety of emulators that covers different architectures and
In this paper, we focus on Quick Emulator (QEMU) - an
applications. For example, Bochs [6], ARMware [7], PearPC
open source behavioral emulator which uses efficient means to
[8], GXemul [9], etc. Despite number of options available for
mimic the target machine’s instructions [4]. For example,
emulators, QEMU is a commonly used open source emulator.
different operating systems can be launched simultaneously to
utilize faster X86 hardware resources for emulation. Currently QEMU has been extended for many processors e.g., in [10]
QEMU supports emulation for a wide variety of processors the authors present their work of building a hypervisor using
including X86, PowerMac, ARM, SPARC, MicroBlaze, and "Choices" operating system on ARM architecture. They have
MIPS32. In addition, QEMU has the ability to run either as an modified QEMU to trap the sensitive ARM instructions when
emulator or a virtual machine. executed from low privileged user mode. A hyper call handler
is added to “Choices” to validate and emulate the sensitive
Cavium Octeon MIPS64 based systems are typically used
instruction. In [11] authors added SPARC in QEMU mainly
in network switches, routers, set-top boxes, and high-end
due to the need of a software development tool and testbed.
devices [5]. However, QEMU does not support emulation of
ARMvisor [12] is based on QEMU and KVM, which support
Octeon MIPS64 processors. Therefore, developers of network
full-virtualization on Linux kernel. A modified QEMU is used
switches and routers cannot use QEMU for such application
Fig. 4. Latency results for context switching and different system calls along with the overhead. [y-axis = log scale]
We ensure that guest's E1000 related accesses are properly 1) LMbench: LMbench [15] is a famous synthetic micro
connected to E1000's emulated code in QEMU and benchmark that evaultes performance of different subsystems.
networking is functional. Figure 3 compares the results related to memory bandwidth of
both guest and native system. “bw_mem” measures the
V. PERFORMANCE EVALUATION
memory bandwidth while performing different memory
In this section, we evaluate the performance of QEMU with operations that includes “rd” (read), “wr” (write), “rdwr”(read
our Octeon implementation. We ran various standard and write), and “cp” (copy). These operations were performed
benchmarks to evaluate performance of the emulated system on 10MB memory size. “bw_mmap_rd” measures the
and compare it with native board (i.e., Cavium Octeon
CN5700). During experiments the host system used is HP memory bandwidth while reading a 10MB memory mapped
ProBook 4540, core i7 with 8G RAM and 2.2 GHz processor. file on the system. “bw_file_rd” reads a 10MB file and
measures its bandwidth. “bw_pipe” creates a Unix pipe
A. Performance Metrics between two processes and moves 10MB data through it in
In general, we measure the performance of our emulated 64k size messages. This test provides the pipe bandwidth.
system using two key metrics: 1) Throughput and 2) Latency. “bw_unix” measure the bandwidth of data sockets. Our results
For comparison purpose, we define throughput overhead as show better performance for “rd”, “cp”, “rdwr”, and
the ratio of throughput of native system and QEMU. Latency “bw_mmap_rd” operations ranging from 0.41 to 0.75.
overhead is defined as the ratio of QEMU and native system. However, rest of the tests show that overhead due to
In both cases, overhead of 1 means that both systems have emulation is less than 3.
similar performance, however overhead value greater than 1 Figure 4 present the results of tests related to the process
implies penalty in performance due to emulation. execution on a linux system. “lat_ctx” measures the latency
B. Benchmarks for context switching between two processes. “lat_proc”
First, we present the results of LMbench, which is a measures time consumed during the creation of a process
synthetic benchmark used to measure performance of different through exec or fork. “lat_syscall” measures latencies for
subsystems. Additionally, we ran different applications different system calls. Note that we observe overhead of 5.42-
benchmarks such as MiBench, CommBench, and Netperf to 7.85 during these tests which is higher than memory
ascertain the emulation overhead. bandwidth operations, mainly due to the complexity involved
in the process management.
Fig. 5. Latency results for basic operation for different data types in LMbench along with overhead. [y-axis = log scale]
Fig. 6. Execution time for different MiBench application tests along with the overhead. [y-axis = log scale]
Figure 5 shows the results of “lat_ops” which measures the benchmarks are oriented towards packet header processing or
latency of basic CPU operations such as bit manipulation, add, data stream processing.
mul, and div for different data types. Multiplication and Figure 7 shows the execution of different applications
division operation for integer data type show speed up on which are widely used during packet header processing. We
Octeon guest ranging from 0.26-0.65. Rest of the operations observe minor overhead in case of REED encoding and
shows the emulation overhead ranges from 1.48-3.14. decoding whereas ZIP shows variable behavior in encoding
2) MiBench: Next, we ran various application benchmarks and decoding. The maximum overhead in this case is observed
to find out functioning of our systems from an application’s for CAST.
perspective. MiBench [16] is a collection of multiple 4) Netperf: Netperf benchmark [18] measurs bulk data
benchmarks targeting various areas of embedded systems e.g., transfer or request-response performance. “TCP_RR” and
automotive, security, and networking. Execution time of “UDP_RR” measure the transactions per sec during a 120
different tests is shown in figure 6. “basicmath” performs seconds test run and results are shown figure 8. Here
simple mathematical operations e.g. cubic function solving, transaction is considered as round trip time of sending request
integer square root, etc. A large array of strings is sorted into and receiving response over the network. This unit highlights
ascending order by “qsort”. Blowfish, CRC, and SHA belongs one-way and round-trip average latency in the network
to the network and data security algorithm category. Our system.
results highlight that our emulator performs better in case of Networking results show up to 80 times overhead, which is
basicmath and sorting applications, however, we observe highest compared to any other benchmark. This is expected
significantly high overhead in case of complex applications due to multiple layers of functionality that need to be emulated
like CRC, blowfish, and SHA. including PCI bus and network interface for each packet
3) CommBench: CommBench [17] is another application transmission and reception. Our results are consistent with
benchmark designed for testing performance of processors previous studies [19] [20] that also report poor networking
with communication related tasks. Tests in this benchmark performance of QEMU with full emulated E1000 and tap
focus on small, computationally intensive kernels that are interface.
typical of the network processor environment. These
Fig. 7. Execution time for encoding/decoding algorithms in CommBench along with their overhead. [y-axis = log scale]
VI. CONCLUSION
This paper enhances emulation capabilities of QEMU – an
open source emulator. We highlight the modifications and
additions that were required in order to provide support for
Octeon MIPS64 processor in QEMU. We evaluated the
performance of guest system with native Octeon board. Our
benchmarking results show that in LMbench simple test e.g.
copy performs 1.37 times better than native system. However,
with more complicated operations that involve frequent context
switches we observe more overhead. Furthermore, MiBench
shows that simpler applications including basic mathematical
functions and sorting performs better than native while
complex algorithm like SHA faces overhead due to the
emulation layer. While CommBench shows varying overhead Fig. 8. Request/Response test results of Netperf. [y-axis = log scale]
for different encoding/ decoding algorithms, networking seems
[10] R. Bhardwaj, P. Reames, R. Greenspan, V. S. Nori, and E. Ucan, "A
to be the only component with huge overhead. Choices Hypervisor on the ARM architecture," Department of Computer
Science, University of Illinois at Urbana-Champaign, vol. 11, 2006.
ACKNOWLEDGEMENT [11] J.-W. Choi and B.-G. Nam, "Development of high performance space
This research work was funded by Higher Education processor emulator based on QEMU—Open source dynamic translator,"
in 12th International Conference on Control, Automation and Systems
Commission (HEC) Pakistan, under the project titled “MIPS64 (ICCAS) 2012, pp. 300-304.
- System Mode Emulation in QEMU”. [12] J.-H. Ding, C.-J. Lin, P.-H. Chang, C.-H. Tsang, W.-C. Hsu, and Y.-C.
Chung, "Armvisor: System virtualization for arm," in Proceedings of the
REFERENCES Ottawa Linux Symposium (OLS), 2012, pp. 93-107.
[1] Intel Embedded Processors, "Rise of the Embedded Internet." White [13] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, "kvm: the
Paper, 2009. Linux virtual machine monitor," in Proceedings of the Linux
[2] T. Tuttle. The Internet of Things: the next wave of our connected world. symposium, 2007, pp. 225-230.
[Online] Available: [14] K. Butt, A. Qadeer, and A. Waheed, "MIPS64 user mode emulation: A
http://www.embedded.com/design/connectivity/4430102/The-Internet- case study in open source software engineering," in 7th International
of-Things--the-next-wave-of-our-connected-world Conference on Emerging Technologies (ICET), 2011, pp. 1-6.
[3] G. M. Insights. Embedded system market size likely to exceed USD 258 [15] L. W. McVoy and C. Staelin, "lmbench: Portable Tools for Performance
billion by 2023. [Online] Available: Analysis," in USENIX annual technical conference, 1996, pp. 279-294.
https://www.gminsights.com/pressrelease/embedded-system-market- [16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and
report R. B. Brown, "MiBench: A free, commercially representative embedded
[4] F. Bellard, "QEMU, a Fast and Portable Dynamic Translator," in benchmark suite," in International Workshop on Workload
USENIX Annual Technical Conference, FREENIX Track, 2005, pp. 41- Characterization, 2001, pp. 3-14.
46. [17] T. Wolf and M. Franklin, "CommBench-a telecommunications
[5] Cavium. Corporate Profile. [Online] Available: benchmark for network processors," in IEEE International Symposium
http://investor.caviumnetworks.com/phoenix.zhtml?c=209126&p=irol- on Performance Analysis of Systems and Software, 2000, pp. 154-162.
homeProfile&t=&id=& [18] R. Jones, "NetPerf: a network performance benchmark," Information
[6] K. P. Lawton, "Bochs: A portable pc emulator for unix/x," Linux Networks Division, Hewlett-Packard Company, 1996.
Journal, vol. 1996, p. 7, 1996. [19] L. Rizzo and G. Lettieri, "Vale, a switched ethernet for virtual
[7] H. Wei. (2012). Armware. [Online] Available: machines," in Proceedings of the 8th international conference on
http://code.google.com/p/armware/ Emerging networking experiments and technologies, 2012, pp. 61-72.
[8] S. Biallas. (2004). PearPC-PowerPC architecture emulator. [Online] [20] S. Vrijders, V. Maffione, D. Staessens, F. Salvestrini, M. Biancani, E.
Available: http://pearpc.sourceforge.net/doc.html Grasa, et al., "Reducing the complexity of virtual machine networking,"
[9] A. Gavare. GXemul. [Online] Available: http://gxemul.sourceforge.net/ IEEE Communications Magazine, vol. 54, pp. 152-158, 2016.