0% found this document useful (0 votes)
134 views

Emulatingan Octeon MIPS64 in QEMU

This document discusses modifications made to QEMU to enable emulation of Cavium Octeon MIPS64 embedded systems on x86 processors. The authors enhanced QEMU to support Octeon instruction translation using Tiny Code Generator. With these changes, Octeon Linux binaries can now be emulated on x86 without requiring the actual MIPS hardware. Performance comparisons show the emulated system has comparable performance to the native system for most use cases.

Uploaded by

dirava7622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views

Emulatingan Octeon MIPS64 in QEMU

This document discusses modifications made to QEMU to enable emulation of Cavium Octeon MIPS64 embedded systems on x86 processors. The authors enhanced QEMU to support Octeon instruction translation using Tiny Code Generator. With these changes, Octeon Linux binaries can now be emulated on x86 without requiring the actual MIPS hardware. Performance comparisons show the emulated system has comparable performance to the native system for most use cases.

Uploaded by

dirava7622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/313459562

Emulating an Octeon MIPS64 based embedded system on X86 in QEMU

Conference Paper · December 2016


DOI: 10.1109/INMIC.2016.7840110

CITATIONS READS
0 2,947

5 authors, including:

Amir Mehmood Qurrat ul Ain


University of Engineering and Technology, Lahore University of Engineering and Technology, Lahore
50 PUBLICATIONS 189 CITATIONS 5 PUBLICATIONS 3 CITATIONS

SEE PROFILE SEE PROFILE

Ayaz Akram Abdul Qadeer


University of California, Davis University of Southern California
23 PUBLICATIONS 186 CITATIONS 10 PUBLICATIONS 60 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cisco AON (Application Oriented Networking) View project

This work was supported by the School of Engineering and Sciences at Tecnologico de Monterrey. View project

All content following this page was uploaded by Qurrat ul Ain on 14 March 2018.

The user has requested enhancement of the downloaded file.


Emulating an Octeon MIPS64 based Embedded
System on X86 in QEMU
Muhammad Amir Mehmood, Qurrat Ul Ain, Ayaz Akram, Abdul Qadeer and Abdul Waheed
Al-Khawarzmi Institute of Computer Science
University of Engineering and Technology
Lahore, Pakistan
{amir.mehmood, qurratulain, ayyaz.akram, qadeer, awaheed}@kics.edu.pk

Abstract—Embedded systems are proliferating with their development. By having an emulation of Octeon processor in
growing hardware capabilities. Their application areas include commonly used emulator such as QEMU facilitates
internet of things, cellular devices, network devices, etc. development, testing, and debugging of embedded
Application development and testing natively on such embedded applications.
hardware is expensive, time consuming, and challenging. In this
case, system emulation is a cost-effective alternative. We have Our goal is to provide an emulation solution for Cavium
extended Quick Emulator (QEMU) to support Cavium Octeon MIPS64 Octeon systems using QEMU. We enhanced QEMU
MIPS64 processor based embedded systems. This paper and added the support for Octeon MIPS based systems. We
summarizes the modifications in QEMU v1.0.1 for supporting implement Octeon specific instruction translation using Tiny
Octeon processor. We compare the performance of guest Octeon Code Generator (TCG) present in the QEMU. With our
MIPS64 system against native using synthetic and applications modifications, now anyone having Octeon Linux binary can
benchmarks. This comparison shows that emulated system emulate Octeon based system on X86 without the need of
performance is comparable to that of the real system depending buying expensive MIPS hardware. We also evaluated our
on use cases. enhanced guest system using different benchmarks.
Keywords—Emulation; Linux kernel; MIPS64; Cavium The reminder of the paper is structured as follows: Section
Octeon; Porting; and QEMU. II briefly discusses the related work and Section III presents a
brief description of internal structure and working mechanism
I. INTRODUCTION of QEMU. Section IV highlights the modification in different
The ubiquity of embedded devices today has considerably components of QEMU for adding the Octeon processor
changed the embedded system computing landscape. Recently, support. In Section V, we present performance comparison of
with the availability of multiple cores and increased processing Octeon guest system with its native system using various
power there is an exponential growth of embedded devices in benchmarks including LMbench, MiBench, CommBench, and
smart watches, mobile phones, internet of things, telecomm Netpref. Section VI concludes the overall findings.
appliances, networking, and navigational systems [1-3]. On the
other hand, the development of such systems becomes difficult II. RELATED WORK
and time consuming as the required embedded hardware is Emulator is software that provides the behavior of another
expensive. In addition it is also infeasible to provide such environment or architecture while running on different
hardware to every developer in large developing environments. environment/architecture. Single application as well as a
Developers of such systems heavily depend on emulators for complete system can be emulated by the help of emulators.
initial development and testing, without requiring real System emulators are quite useful as a client side application
expensive hardware. that includes debugging and development tools. There is a vast
variety of emulators that covers different architectures and
In this paper, we focus on Quick Emulator (QEMU) - an
applications. For example, Bochs [6], ARMware [7], PearPC
open source behavioral emulator which uses efficient means to
[8], GXemul [9], etc. Despite number of options available for
mimic the target machine’s instructions [4]. For example,
emulators, QEMU is a commonly used open source emulator.
different operating systems can be launched simultaneously to
utilize faster X86 hardware resources for emulation. Currently QEMU has been extended for many processors e.g., in [10]
QEMU supports emulation for a wide variety of processors the authors present their work of building a hypervisor using
including X86, PowerMac, ARM, SPARC, MicroBlaze, and "Choices" operating system on ARM architecture. They have
MIPS32. In addition, QEMU has the ability to run either as an modified QEMU to trap the sensitive ARM instructions when
emulator or a virtual machine. executed from low privileged user mode. A hyper call handler
is added to “Choices” to validate and emulate the sensitive
Cavium Octeon MIPS64 based systems are typically used
instruction. In [11] authors added SPARC in QEMU mainly
in network switches, routers, set-top boxes, and high-end
due to the need of a software development tool and testbed.
devices [5]. However, QEMU does not support emulation of
ARMvisor [12] is based on QEMU and KVM, which support
Octeon MIPS64 processors. Therefore, developers of network
full-virtualization on Linux kernel. A modified QEMU is used
switches and routers cannot use QEMU for such application

978-1-5090-4300-2/16/$31.00 ©2016 IEEE


to create the guest virtual machine and provide I/O emulation. translation) because the guest has control of the CPU. QEMU
It introduces a lightweight memory virtualization model which incorporates caches and guest code block chaining for better
provides appropriate emulation solution for embedded system. performance.
III. QEMU ARCHITECTURE IV. EMULATION SUPPORT FOR OCTEON PROCESSOR
QEMU v1.0.1 is open source emulation software that we Here we discuss modifications in different components of
have used for providing emulation of MIPS64 Octeon system. QEMU in order to provide emulation support for Octeon
QEMU has also been integrated with other virtualization processor. Our first examination revealed that Malta MIPS64
solutions to provide system emulation e.g., KVM [13] and implementation is available however this implementation can
VirtualBox. Along with KVM, QEMU is able to exploit only emulate Malta and does not support Octeon. Modified
hardware virtualization support for better performance. components are briefly described below.
VirtualBox has a QEMU based built-in dynamic compiler and
it also benefits from its device emulation. Note that in our case A. Virtual Board
MIPS64 Linux kernel is emulated on X86 host system, A virtual board is a software counterpart of the main board
therefore we refer MIPS64 Linux kernel as "guest" and X86 of a system and has pivotal importance in SME. Virtual board
Linux system as "host" in the rest of the paper. requires specific implementation for system mode emulation.
Our emulated virtual board contains the software
In general, QEMU operates in: (1) System Mode Emulation representations of necessary hardware components like
(SME) and (2) User Mode Emulation (UME). In SME, processor(s), RAM, I/O devices etc. We can configure QEMU
emulation solely focuses on what is visible at Instruction Set for either Malta or Octeon processor emulation for MIPS64 by
Architecture (ISA) level. It includes registers and hardware using -m command line switch.
structures (e.g., Translation Look-aside Buffer (TLB)), which
can be used with the ISA. Once emulation of all the structures, B. Instruction Emulation and Memory Management
visible at the ISA level, along with appropriate instruction QEMU already provides support for the emulation of
behavior is present, QEMU does not need to know the nature standard MIPS instructions. However, it does not provide
of guest code. All guest code is simply a permutation of Octeon processor specific instructions. We added emulation of
architecture-specific instructions. While in UME, QEMU Cavium Octeon specific instructions for SME in TCG similar
needs to intercept Linux system calls and provides kernel level to the work for UME in [14]. In general, Malta and Octeon
behavioral emulation. Although, UME works for simple user processors are quite different with respect to their memory
mode guest applications, it is not extendable for complex mapping. For instance, there is a special memory region called
software like Linux kernel. ‘CVMSEG’ which is used as a scratch pad memory by guest
In order to emulate a full Cavium MIPS64 Linux system, processes. This region is mapped on L2 cache to enhance the
we rely on SME in QEMU. For SME, all major virtual devices overall performance of the system. All the memory address
are contained in the virtual board and QEMU initializes these accesses for the CVMSEG are handled through the normal
devices to begin system emulation. During the guest system memory mapping in QEMU. We note that guest system is
initialization the target board is initialized which subsequently completely oblivious from the fact that CVMSEG is
initializes the configured devices like RAM, clock, network implemented by using ordinary RAM. During the initialization
card, etc. In addition memory initialization is done according to of virtual board, memory segments (e.g., RAM, CVMSEG) are
the target guest system. Major task of QEMU is to provide initialized specifically. Standard address translation mechanism
correct instruction emulation and interrupt dispatching for MIPS64 was compatible with Octeon processor. There is
mechanism for the guest system. After the basic initialization no addition in TLB implementation.
phase of SME in QEMU, TCG fetches the instructions from Figure 1 shows QEMU’s components that are modified for
Uboot binary of the guest. Afterwards, QEMU undergoes the Octeon processor. The blocks with dotted outline indicate the
fetch, translate, and execute cycle. Uboot binary initializes the modified components. Octeon board initializes the memory
guest system’s memory and loads the guest kernel, which map and devices. Soft-MMU contains MMU, TLB, and
initializes the guest’s configured devices. Uboot loads the address interception mechanism. The devices in the Octeon
kernel binary which creates and initializes the whole Linux processor are memory mapped therefore “address interception
system with external devices. block” performs the address mapping for devices like
QEMU run as a user-space process on X86 systems and Universal Asynchronous Receiver/Transmitter (UART) and
execute a MIPS64 Octeon Linux guest on it by providing an Central Interrupt Unit (CIU). CVMSEG is a Cavium specific
emulated environment. QEMU implements two-thread model memory region whose implementation was inevitable. Guest
for the guest system. One thread emulates the normal flow of code instructions are intercepted by TCG in which support for
guest instructions while other is managing any pending Octeon specific instructions is added. Upon initialization,
exceptions or external interrupts for the guest system. When Octeon board assigns read/write handlers for devices which
any external interrupt occurs, the second thread interrupts the interact with QEMU’s generic layer to provide successful
main thread to service the pending interrupt. This dedicated device emulation.
IO-thread runs an event loop to process I/O (including network C. System Time
and disk). In general, the first thread performs the main task of
MIPS based timing system is supported by QEMU.
instruction emulation. While a thread is running emulated guest
However, Octeon specific timer register that requires
code it cannot concurrently be in the event loop (I/O,
Fig. 2. Modifications in guest driver and MMU for enabling networking
support.

therefore we modified QEMU’s memory management unit to


provide appropriate values to the interrupt controller
depending on type of interrupt. Interrupts from devices are
routed through i8259 controller to the guest. Figure 1 also
Fig. 1. Modified system architecture of QEMU with Octeon support.
depicts the interrupt flow in modified QEMU.
increment after every clock cycle was not supported. 2) UART: QEMU already had complete UART emulation,
Emulation of this particular register was inevitable for the with the only difference in the addresses of UART registers.
timing infrastructure of Octeon processor. For this purpose, We have intercepted load and store instrcution methods of
CVMCount is emulated and incremented at every clock cycle QEMU for reading and writing UART due to the fact that
using pertinent QEMU functions. These functions are also used Octeon has memory mapped UART device. We call
for time stamping counter register of X86 systems. Figure 1 appropriate UART emulation functions if load and store
shows addition of CVMCount register in MIPS timing system. address ranges lie within specified memory range of device
D. Device Emulation else we let it pass through the memory subsystem.
MIPS64 based Octeon board has number of devices (e.g., 3) Networking Support: Networking is one of the most
Interrupt controller, timers, and UART), which exhibit quite significant and complex subsystems. Octeon has its specific
complicated functionalities. Moreover, Octeon system network card that is very complex for emulation. Instead of
implements memory-mapped I/O devices i.e., a particular emulating Octeon's network card in QEMU, we rely on one of
address region is assigned to a particular device and the already supported networking devices in QEMU (i.e.
communication to these devices is done by writing or reading E1000 and PCI bus). Figure 2 shows the modified components
from that address range. in QEMU and guest for enabling networking.
Although QEMU does not provide support for Octeon Using emulated PCI bus of QEMU was another
specific devices, it has emulation of almost all the basic challenging task. On actual Octeon board PCIe is used which
devices which can be required by an embedded board. is not supported by QEMU. Available emulation of PCI bus is
Therefore, keeping in view that complete emulation of Octeon based on X86 system. The PCI specifications provide
specific devices could be quite cumbersome and time complete software driven initialization and configuration of
consuming task, we decided to go for an improvised approach. each device on the PCI bus via a separate configuration
We used already emulated devices of QEMU and made sure address space. But due to some inconsistencies between
that corresponding devices were configured in kernel. The Octeon Linux kernel and QEMU's emulated PCI system, it
supported devices are UART, interrupt controller, timer, and was cumbersome to run PCI related code of Octeon kernel.
networking device. Following is the implementation detail. For this reason, kernel driver was changed to initialize PCI
1) Interrupt Controller: Octeon systems use a specific instead of PCIe. We have modified Octeon specific functions
interrupt controller known as CIU, which is quite complicated of guest which read and write PCI configuration space.
In addition, Octeon does not use specific data and address
device. CIU is involved in UART, inter-core, and other device
configuration registers as used by QEMU's X86 based
communication. We provide equivalent CIU functionality
emulated PCI system. After the modifications related to PCI
using interrupt controller provided by QEMU (i.e., i8259),
bus were done, E1000 network card was visible as PCI device
instead of emulating full fledge interrupt controller for Octeon in the guest. In order to provide proper network emulation, we
systems. Since, CIU is also a memory mapped device, had to use a proper base address for E1000's device registers.
Fig. 3. Memory bandwidth results for different LMbench tests along with their overhead. [y-axis = log scale]

Fig. 4. Latency results for context switching and different system calls along with the overhead. [y-axis = log scale]

We ensure that guest's E1000 related accesses are properly 1) LMbench: LMbench [15] is a famous synthetic micro
connected to E1000's emulated code in QEMU and benchmark that evaultes performance of different subsystems.
networking is functional. Figure 3 compares the results related to memory bandwidth of
both guest and native system. “bw_mem” measures the
V. PERFORMANCE EVALUATION
memory bandwidth while performing different memory
In this section, we evaluate the performance of QEMU with operations that includes “rd” (read), “wr” (write), “rdwr”(read
our Octeon implementation. We ran various standard and write), and “cp” (copy). These operations were performed
benchmarks to evaluate performance of the emulated system on 10MB memory size. “bw_mmap_rd” measures the
and compare it with native board (i.e., Cavium Octeon
CN5700). During experiments the host system used is HP memory bandwidth while reading a 10MB memory mapped
ProBook 4540, core i7 with 8G RAM and 2.2 GHz processor. file on the system. “bw_file_rd” reads a 10MB file and
measures its bandwidth. “bw_pipe” creates a Unix pipe
A. Performance Metrics between two processes and moves 10MB data through it in
In general, we measure the performance of our emulated 64k size messages. This test provides the pipe bandwidth.
system using two key metrics: 1) Throughput and 2) Latency. “bw_unix” measure the bandwidth of data sockets. Our results
For comparison purpose, we define throughput overhead as show better performance for “rd”, “cp”, “rdwr”, and
the ratio of throughput of native system and QEMU. Latency “bw_mmap_rd” operations ranging from 0.41 to 0.75.
overhead is defined as the ratio of QEMU and native system. However, rest of the tests show that overhead due to
In both cases, overhead of 1 means that both systems have emulation is less than 3.
similar performance, however overhead value greater than 1 Figure 4 present the results of tests related to the process
implies penalty in performance due to emulation. execution on a linux system. “lat_ctx” measures the latency
B. Benchmarks for context switching between two processes. “lat_proc”
First, we present the results of LMbench, which is a measures time consumed during the creation of a process
synthetic benchmark used to measure performance of different through exec or fork. “lat_syscall” measures latencies for
subsystems. Additionally, we ran different applications different system calls. Note that we observe overhead of 5.42-
benchmarks such as MiBench, CommBench, and Netperf to 7.85 during these tests which is higher than memory
ascertain the emulation overhead. bandwidth operations, mainly due to the complexity involved
in the process management.
Fig. 5. Latency results for basic operation for different data types in LMbench along with overhead. [y-axis = log scale]

Fig. 6. Execution time for different MiBench application tests along with the overhead. [y-axis = log scale]

Figure 5 shows the results of “lat_ops” which measures the benchmarks are oriented towards packet header processing or
latency of basic CPU operations such as bit manipulation, add, data stream processing.
mul, and div for different data types. Multiplication and Figure 7 shows the execution of different applications
division operation for integer data type show speed up on which are widely used during packet header processing. We
Octeon guest ranging from 0.26-0.65. Rest of the operations observe minor overhead in case of REED encoding and
shows the emulation overhead ranges from 1.48-3.14. decoding whereas ZIP shows variable behavior in encoding
2) MiBench: Next, we ran various application benchmarks and decoding. The maximum overhead in this case is observed
to find out functioning of our systems from an application’s for CAST.
perspective. MiBench [16] is a collection of multiple 4) Netperf: Netperf benchmark [18] measurs bulk data
benchmarks targeting various areas of embedded systems e.g., transfer or request-response performance. “TCP_RR” and
automotive, security, and networking. Execution time of “UDP_RR” measure the transactions per sec during a 120
different tests is shown in figure 6. “basicmath” performs seconds test run and results are shown figure 8. Here
simple mathematical operations e.g. cubic function solving, transaction is considered as round trip time of sending request
integer square root, etc. A large array of strings is sorted into and receiving response over the network. This unit highlights
ascending order by “qsort”. Blowfish, CRC, and SHA belongs one-way and round-trip average latency in the network
to the network and data security algorithm category. Our system.
results highlight that our emulator performs better in case of Networking results show up to 80 times overhead, which is
basicmath and sorting applications, however, we observe highest compared to any other benchmark. This is expected
significantly high overhead in case of complex applications due to multiple layers of functionality that need to be emulated
like CRC, blowfish, and SHA. including PCI bus and network interface for each packet
3) CommBench: CommBench [17] is another application transmission and reception. Our results are consistent with
benchmark designed for testing performance of processors previous studies [19] [20] that also report poor networking
with communication related tasks. Tests in this benchmark performance of QEMU with full emulated E1000 and tap
focus on small, computationally intensive kernels that are interface.
typical of the network processor environment. These
Fig. 7. Execution time for encoding/decoding algorithms in CommBench along with their overhead. [y-axis = log scale]

VI. CONCLUSION
This paper enhances emulation capabilities of QEMU – an
open source emulator. We highlight the modifications and
additions that were required in order to provide support for
Octeon MIPS64 processor in QEMU. We evaluated the
performance of guest system with native Octeon board. Our
benchmarking results show that in LMbench simple test e.g.
copy performs 1.37 times better than native system. However,
with more complicated operations that involve frequent context
switches we observe more overhead. Furthermore, MiBench
shows that simpler applications including basic mathematical
functions and sorting performs better than native while
complex algorithm like SHA faces overhead due to the
emulation layer. While CommBench shows varying overhead Fig. 8. Request/Response test results of Netperf. [y-axis = log scale]
for different encoding/ decoding algorithms, networking seems
[10] R. Bhardwaj, P. Reames, R. Greenspan, V. S. Nori, and E. Ucan, "A
to be the only component with huge overhead. Choices Hypervisor on the ARM architecture," Department of Computer
Science, University of Illinois at Urbana-Champaign, vol. 11, 2006.
ACKNOWLEDGEMENT [11] J.-W. Choi and B.-G. Nam, "Development of high performance space
This research work was funded by Higher Education processor emulator based on QEMU—Open source dynamic translator,"
in 12th International Conference on Control, Automation and Systems
Commission (HEC) Pakistan, under the project titled “MIPS64 (ICCAS) 2012, pp. 300-304.
- System Mode Emulation in QEMU”. [12] J.-H. Ding, C.-J. Lin, P.-H. Chang, C.-H. Tsang, W.-C. Hsu, and Y.-C.
Chung, "Armvisor: System virtualization for arm," in Proceedings of the
REFERENCES Ottawa Linux Symposium (OLS), 2012, pp. 93-107.
[1] Intel Embedded Processors, "Rise of the Embedded Internet." White [13] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, "kvm: the
Paper, 2009. Linux virtual machine monitor," in Proceedings of the Linux
[2] T. Tuttle. The Internet of Things: the next wave of our connected world. symposium, 2007, pp. 225-230.
[Online] Available: [14] K. Butt, A. Qadeer, and A. Waheed, "MIPS64 user mode emulation: A
http://www.embedded.com/design/connectivity/4430102/The-Internet- case study in open source software engineering," in 7th International
of-Things--the-next-wave-of-our-connected-world Conference on Emerging Technologies (ICET), 2011, pp. 1-6.
[3] G. M. Insights. Embedded system market size likely to exceed USD 258 [15] L. W. McVoy and C. Staelin, "lmbench: Portable Tools for Performance
billion by 2023. [Online] Available: Analysis," in USENIX annual technical conference, 1996, pp. 279-294.
https://www.gminsights.com/pressrelease/embedded-system-market- [16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and
report R. B. Brown, "MiBench: A free, commercially representative embedded
[4] F. Bellard, "QEMU, a Fast and Portable Dynamic Translator," in benchmark suite," in International Workshop on Workload
USENIX Annual Technical Conference, FREENIX Track, 2005, pp. 41- Characterization, 2001, pp. 3-14.
46. [17] T. Wolf and M. Franklin, "CommBench-a telecommunications
[5] Cavium. Corporate Profile. [Online] Available: benchmark for network processors," in IEEE International Symposium
http://investor.caviumnetworks.com/phoenix.zhtml?c=209126&p=irol- on Performance Analysis of Systems and Software, 2000, pp. 154-162.
homeProfile&t=&id=& [18] R. Jones, "NetPerf: a network performance benchmark," Information
[6] K. P. Lawton, "Bochs: A portable pc emulator for unix/x," Linux Networks Division, Hewlett-Packard Company, 1996.
Journal, vol. 1996, p. 7, 1996. [19] L. Rizzo and G. Lettieri, "Vale, a switched ethernet for virtual
[7] H. Wei. (2012). Armware. [Online] Available: machines," in Proceedings of the 8th international conference on
http://code.google.com/p/armware/ Emerging networking experiments and technologies, 2012, pp. 61-72.
[8] S. Biallas. (2004). PearPC-PowerPC architecture emulator. [Online] [20] S. Vrijders, V. Maffione, D. Staessens, F. Salvestrini, M. Biancani, E.
Available: http://pearpc.sourceforge.net/doc.html Grasa, et al., "Reducing the complexity of virtual machine networking,"
[9] A. Gavare. GXemul. [Online] Available: http://gxemul.sourceforge.net/ IEEE Communications Magazine, vol. 54, pp. 152-158, 2016.

View publication stats

You might also like