A cloud gaming system based on user level virtualization and its resource scheduling

A Cloud Gaming System Based on User-Level
Virtualization and Its Resource Scheduling
Youhui Zhang, Member, IEEE, Peng Qu, Jiang Cihang, and Weimin Zheng, Member, IEEE
Abstract—Many believe the future of gaming lies in the cloud, namely Cloud Gaming, which renders an interactive gaming application
in the cloud and streams the scenes as a video sequence to the player over Internet. This paper proposes GCloud, a GPU/CPU hybrid
cluster for cloud gaming based on the user-level virtualization technology. Specially, we present a performance model to analyze the
server-capacity and games’ resource-consumptions, which categorizes games into two types: CPU-critical and memory-io-critical.
Consequently, several scheduling strategies have been proposed to improve the resource-utilization and compared with others.
Simulation tests show that both of the First-Fit-like and the Best-Fit-like strategies outperform the other(s); especially they are near
optimal in the batch processing mode. Other test results indicate that GCloud is efficient: An off-the-shelf PC can support five high-end
video-games run at the same time. In addition, the average per-frame processing delay is 8$19 ms under different image-resolutions,
which outperforms other similar solutions.
Index Terms—Cloud computing, cloud gaming, resource scheduling, user-level virtualization
Ç
1 INTRODUCTION
CLOUD gaming provides game-on-demand services over
the Internet. This model has several advantages [1]: it
allows easy access to games without owning a game console
or high-end graphics processing units (GPUs); the game dis-
tribution and maintenance become much easier.
For cloud gaming, the response latency is the most essen-
tial factor of the quality of gamers’ experience “on the
cloud”. The number of games that can run on one machine
simultaneously is another important issue, which makes
this mode economical and then really practical. Thus, to
optimize cloud gaming experiences, CPU / GPU hybrid
systems are usually employed because CPU-only solutions
are not efficient for graphics rendering.
One of the industrial pioneers of cloud gaming, Onlive1
emphasized the former: it allocated one GPU per instance for
high-end video games. To improve utilization, some other
service-providers use the virtual machine (VM) technology
to share the GPU among games running on top of VMs. For
example, GaiKai2
and G-cluster3
stream games from cloud-
servers located around the world to internet-connected devi-
ces. Since the end of 2013, Amazon EC2 has also provided
the service for streaming games based on VMs.4
More technical details can be acquired from non-
commercial projects. GamePipe [2] is a VM-based cloud
cluster of CPU/GPU servers. Its characteristic lies in that,
not only cloud resources but also the local resources of
clients can be employed to improve the gaming quality.
Another system, GamingAnywhere [3], has used the user-
level virtualization technology. Compared with some solu-
tions, its processing delay is lower.
Besides, task scheduling is regarded as another key issue
to improve the utilization of resources, which has been veri-
fied in the high-performance GPU-computing fields [4], [5],
[6], [7]. However, to the best of our knowledge, the schedul-
ing research for cloud gaming has not received much
attention yet. One example based on VMs is VGRIS [8]
(including its successor VGASA [9]. It is a GPU-resource
management framework in the host OS and schedules vir-
tualized resource of guest OSes.
This paper proposes the design of a GPU/CPU hybrid sys-
tem for cloud gaming and its prototype, GCloud. GCloud has
used the user-level virtualization technology to implement a
sandbox for different types of games, which can isolate more
than one game-instance from each other on a game-server,
transparently capture the game’s video/audio outputs for
streaming, and handle the remote client-device’s inputs.
Moreover, a performance model has been presented;
thus we have analyzed resource-consumptions of games
and performance bottleneck(s) of a server, through exces-
sive experiments using a variety of hardware performance-
counters. Accordingly, several task-scheduling strategies
have been designed to improve the server utilization and
been evaluated respectively.
Different from related researches, we focus on the guide-
line of task-assignment, that is, on the reception of a game-
launch request, we should judge if a server is suitable to
undertake the new instance or not, under the condition sat-
isfying the performance requirements.
In addition, from the aspect of user-level virtualization
(there is some existing user-level solution, like Gaming-
Anywhere [3]), GCloud has its own characteristics:
1. http://www.onlive.com/
2. https://www.gaikai.com/
3. http://www.g-cluster.com/eng/
4. https://aws.amazon.com/game-hosting/
The authors are with the Department of Computer Science and Technology,
Tsinghua University, Beijing, China. E-mail: {zyh02, zwm-dcs}@tsinghua.
edu.cn, shen_yhx@163.com, famousjch@qq.com.
Manuscript received 13 Nov. 2014; revised 11 May 2015; accepted 11 May
2015. Date of publication 14 May 2015; date of current version 13 Apr. 2016.
Recommended for acceptance by Y. Wang.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPDS.2015.2433916
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 1239
1045-9219 ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
www.redpel.com +917620593389

First, it implements a virtual input-layer for each of con-
currently-running instances, rather than a system-wide one,
which can support more than one Direct-3D games at the
same time. Second, it designs a virtual storage layer to trans-
parently store each client’s configurations across all servers,
which has not been mentioned by related projects.
In summary, the following contributions have been
accomplished:
1) Enabling-technologies based on the light-weight virtu-
alization are introduced, especially those of GCloud ‘s
characteristics. (Section 3)
2) To balance the gaming-responsiveness and costs, we
adopt a “just good enough” principle to fix the FPS
(frame per second) of games to an acceptable level.
Under this principle, a performance model is con-
structed to analyze resource consumptions of games,
which categorizes games into two types: CPU-critical
and memory-io-critical; thus several scheduling mech-
anisms have been presented to improve the utiliza-
tion and compared. In addition, different from
previous jobs focused on the GPU-resource, our
work has found the host CPU or the memory bus is
the system bottleneck when several games are run-
ning simultaneously. (Section 4)
3) Such a cloud-gaming cluster has been constructed,
which supports the mainstream game-types. Results
of tests show that GCloud is highly efficient: An off-
the-shelf PC can support up to five concurrently-run-
ning video-games (each game’s image-resolution is
1024 Â 768 and the frame per second is 30). The aver-
age per-frame processing delay is 8$19 ms under
different image-resolutions, which can satisfy the
stringent delay requirement of highly-interactive
games. Tests have also verified the effects of our per-
formance model. (Section 5)
The remainder of this paper is organized as follows.
Section 2 presents the background knowledge of cloud gam-
ing as well as related work. Sections 3 and 4 are the main
part: the former introduces the user-level virtualization
framework and enabling technologies; the performance
model and its analysis method are given in the latter, as well
as the scheduling strategies. Section 5 presents the prototype
cluster and evaluates its performance. Section 6 concludes.
2 RELATED WORK
2.1 Cloud Gaming
Cloud gaming is a type of online gaming that allows
direct and on-demand streaming of game-scenes to
networked-devices, in which the actual game is running on
the server-end (main steps have been described in Fig. 1).
Moreover, to ensure the interactivity, all of these serial oper-
ations must happen in the order of milliseconds, which
challenges the system design critically.
The amount of latencies is defined as interaction delay.
Existing researches [10] have shown that different types of
games put forward different requirements.
One solution type of cloud-gaming is VM-based. For the
solutions based on VMs, Step 1 is completed in the guest OS
while others on the server-end are accomplish by the host.
Barboza et al. [11] presents such a solution, which provides
cloud gaming services and uses three levels of managers for
the cloud, hosts and clients. Some existing work, like GaiKai,
G-cluster, Amazon EC2 for streaming games and GamePipe
[2], also belong to this category.
In contrast to VM-based solutions, the user-level solution
inserts the virtualization layer between applications and the
run-time environment. This mode simplifies the processing
stack; thus it can reduce the extra overhead. GamingAny-
where [3] is such a user-level implementation, which sup-
ports Direct3D/SDL games on Windows and SDL games on
Linux.
Some solutions have enhanced the thin-client protocol to
support interactive gaming applications. Dependent on the
concrete implementation, they can be classified into the two
types. For example, Winter et al. [12] have enhanced the
thin-client server driver to integrate a real-time desktop
streamer to stream the graphical output of applications after
GPU processing, which can be regarded as a light-weight
virtualization-based solution. In contrast, Muse [13] uses
VMs to isolate and share GPU resources on the cloud-end,
which has enhanced the remote frame buffer (RFB) protocol
to compress the frame-buffer contents of server-side VMs.
However, these researches have focused on the optimiza-
tion of interaction delay, namely, taken care of the perfor-
mance of a single game on the cloud, rather than the
interference between concurrently-running instances. More-
over, none of these systems has presented any specific
scheduling strategy.
2.2 Resource Scheduling
For high performance computing (HPC), GPU virtualization
has been widely researched [14], [15], [16] for general pur-
pose computing. From the scheduling viewpoint, there are
also several researches, including Phull et al. [4], Ravi et al.
[5], Elliott and Anderson [6], [7] L. Chen et al. [7] and Bautin
et al. [17].
Fig. 1. The whole workflow of cloud-gaming.
1240 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016

However, none of these researches has considered the
cloud gaming characteristics, including the critical demand
on processing latencies, highly-coupled sequential opera-
tions and so on.
The work on the scheduling for cloud gaming is limited:
VGRIS [8] and its successor VGASA [9] are resource man-
agement frameworks for VM-based GPU resources, which
have implemented several scheduling algorithms for differ-
ent objectives. However, they are focused on scheduling
rendering tasks on a GPU, without considering other tasks
like image-capture / -encoding, etc. iCloudAccess [18] has
proposed an online control algorithm to perform gaming-
request dispatching and VM-server provisioning to reduce
latencies of the cloud gaming platform. A recent work is
[19], which has studied the optimized placement of cloud-
gaming-enabled VMs. The proposed heuristic algorithms
are efficient and nearly optimal. Ours can be regarded as
complementary to these researches, because they are
focused on the VM-granularity dispatching / provisioning
while we pay attention to issues inside an OS.
One related work on GPU-scheduling (but not cloud-gam-
ing-specific) is TimeGraph [20]: it is a real-time GPU scheduler
that has modified the device-driver for protecting important
GPU workloads from performance interference. Similarly, it
has not considered the cloud gaming characteristics.
Another category of related researches [21], [22] is con-
cerning the streaming media applications. For example,
Cherkasova and Staley [21] developed a workload-aware per-
formance model for video-on-demand (VOD) applications,
which is helpful to measure the capacity of a streaming sever
as well as the resource requirements. We have referred their
design principles to construct our performance model.
2.3 Others
To improve the processing efficiency and adaptation, Wang
and Dey [23] propose a rendering adaptation technique to
adapt the game rendering parameters to satisfy Cloud
Mobile Gaming’s constraints. Klionsky [24] has presented
an architecture which amortizes the cost of across-user ren-
dering. However, these two technologies are not transpar-
ent to games.
In addition, Jurgelionis et al. [25] explored the impact of
networking on gaming; Ojala and Tyrvainen [26] developed
a business model behind a cloud gaming company.
As a summary, compared with the above-mentioned
work, GCloud has the following features:
1) It is based on the user-level virtualization. Compared
with some existing user-level solution, GCloud has
proposed more thorough solutions for the virtual
input / storage.
2) From the aspect of performance modeling and sched-
uling, more real jobs (including image-capture, encod-
ing, etc.) have been considered (compared with
VGRIS / VGASA [8], [9]). In addition, we use the hard-
ware-assistant video encoding to mitigate the infer-
ence between games and to improve the performance.
3) Last but not least, our work is focused on related
issues inside a node, while [18], [19] do work on the
VM-granularity.
4) Furthermore, quite a few researches have been car-
ried out to measure the performance of cloud gam-
ing systems, like [27], [28], [29] and [30]. We also
referred them to complete our measurements.
3 SYSTEM ARCHITECTURE AND ENABLING
TECHNOLOGIES
3.1 The Framework
The system (in Fig. 2) is built with a cluster of CPU / GPU-
hybrid computing servers; a dedicated storage server is
used as the shared storage. Each computing server can host
the execution of several games simultaneously. One of these
servers is employed as the manager-node, which collects
real-time running information of all servers and completes
management tasks, including the task-assignment, user
authentication, etc.
It is necessary to note that the framework in Fig. 2 is for
small / medium system-scales. For a large scale system
with many users, a hierarchical architecture is needed to
avoid the bottleneck of information-exchange. In fact,
because the quality of gamers’ experience highly depends
on the response latency and the latter is sensitive to the
physical distance between clients and servers, the architec-
ture may be geographically-distributed, which is out of
scape of this paper. It also means that in one site the scale
will not be very large.5
Initially, gaming-agents on available computing servers
register to the manager, indicating that they are ready and
Fig. 2. System architecture.
5. According to OnLive, the theoretical upper bound of the distance
between a user and a cloud gaming server is approximately 1,000 miles.
In China, some gaming systems provide services for just one city or sev-
eral cities.
ZHANG ET AL.: A CLOUD GAMING SYSTEM BASED ON USER-LEVEL VIRTUALIZATION AND ITS RESOURCE SCHEDULING 1241

which games they can execute. When a client wants to play
some game, the manager will search for candidates among
the registered information. After such a server has been cho-
sen, a start-up command will be sent to the corresponding
agent to boot up the game within a light-weight virtualiza-
tion environment. Then, its address will be sent to the client.
Future communication will be done directly between the
two ends.
During the run time, each agent collects local runtime
information and sends it to the manager periodically; the
latter can get the latest status of resource-consumptions.
The storage server is an important role to provide the
personalized game-configuration for each user. For
instance, User A had played Game B on Server C. Now A
wants to play the game again while the manager finds that
Server C’s resources have been depleted. Then the task has
to be assigned to another server, D. Consequently, it is nec-
essary to restore A’s configurations of B on D, including the
game’s progress and other customized information. The
storage server is just used as the shared storage for all com-
puting nodes.
3.2 The User-Level Virtualization Environment
For each game, API Interception is employed to implement
a lightweight virtualization environment. API interception
means to intercept calls from the application to the underly-
ing running system. The typical applications include soft-
ware streaming [31], [32], etc. Here it is used to catch the
corresponding resource-access APIs from the game. In addi-
tion, our main target platform is MS Windows as Windows
dominates the PC video-game world.
3.2.1 Image Capture
Usually, gaming applications employ the mainstream 3D
computer-graphics-rendering libraries, like Direct3D or
OpenGL, to complete the hardware (GPU) acceleration;
GCloud supports both of them.
In the case of Direct3D, the typical workflow of a game is
usually an endless loop: First, some CPU computation pre-
pares the data for the GPU, e.g., calculating objects in the
upcoming frame. Then, the data is uploaded to the GPU
buffer and the GPU performs the computation, e.g., render-
ing, using its buffer contents and fills the front buffer. To
fetch contents of the image into the system memory for the
consequent processing, we intercept the Direct3D’s Present
API.
For OpenGL, we have intercepted the Present-like API in
OpenGL, glutSwapBuffers, to capture images.
For other games based on the common GUI window, we
just set a timer for the application’s main window, then we
intercept the right message handler to capture the image of
the target window periodically.
3.2.2 Audio Capture
Capturing of audio data is a platform-dependent task.
Because our main target platform is MS Windows, we inter-
cept Windows Audio Session APIs to capture the sound.
Core Audio serves as the foundation of quite a few higher-
level APIs; thus this method can bring about the best
adaptability.
3.2.3 Virtual Input Layer
Flash-based or OpenGL-based applications are usually
using the window’s default message-loop to handle inputs.
Thus, the solution is straightforward: We inject a dedicated
input-thread into the intercepted game-process. On recep-
tion of any control command from the client, this thread
will convert it into a local input message and send it to the
target window.
For Direct3D-based games, the situation is more compli-
cated. The existing work [3] replays input events using the
SendInput API on Windows. However, SendInput inserts
events into a system-wide queue, rather than the queue of a
specific process. So, it is difficult to support more than one
instance for the non-VM solution. To conquer this problem,
we intercepted quite a few DirectInput APIs to simulate
input-queues for any virtualized application; thus the user’s
input can be pushed into these queues and made accessible
to applications.
3.2.4 Virtual Storage Layer
From the storage aspect, a program can be divided into
three parts [31]: Part 12 include all resources provided by
the OS and those created/modified by the installation pro-
cess; Part 3 is the data created/modified/deleted during the
run time, which contains game-configurations of each user.
For the immutable parts, it is relatively easy to distribute
them to servers through some system clone method. The
focus is how to migrate resources of Part 3 across servers to
provide personalized game-configurations for users.
We construct a virtual storage layer by the interception of
file-system and registry accessing APIs of all games. During
the run time, the resource modified by the game instance
will be moved into Part 3. When the previously-described
case in Section 3.1 occurs, the virtual storage layer of Game
B on the current server can redirect resource-accesses to the
shared storage to visit the latest configurations of User A,
which were stored by the last run on Server C.
4 PERFORMANCE MODEL AND TASK SCHEDULING
As mentioned in Section 1, the response latency and the
number of games that one machine can execute simulta-
neously are both essential to a cloud gaming system. To a
large extent, they are in contradiction and existing systems
(like [3], [11], [12]) usually focus on the first issue.
However, it is not always economical. For example, if the
FPS of a given game is too high, it will consume more
resources. Moreover, the loss compression will counteract
the high video-quality to a certain extent.
Some scheduling work, like VGRIS / VGASA [8], [9], has
presented multi-task scheduling strategies. There are several
essential differences between our work and VGRIS / VGASA:
First, they are focused on how to schedule existing games on a
server, including the allocation of enough GPU resources for
a game, etc. In contrast, GCloud is focused on the assignment
of a new task. Second, they are focused on the GPU resource
and no any other operation (like image-capture, encoding,
etc.) has been considered, while our tests (presented in
Section 4.4) show the host CPU or the memory bus is the
bottleneck. Third, VGRIS and VGASA are VM-specific.

In this paper, we adopt a “just good enough” strategy,
which means that we just keep the game quality at some
acceptable level and then we try to satisfy the interactivity
requests of as many games as possible. Hence, there are two
main issues:
Issue 1: For a given server and its running game instances,
how to make sure the game quality is acceptable?
Issue 2: On an incoming request, which server is suitable to
launch the new game instance?
For Issue 1, we first give a brief pipeline model for cloud-
gaming, which can be used to judge whether the game qual-
ity is acceptable or not. Second, a method to fix the FPS has
been presented to provide the “just good enough” quality;
some hardware-assistant video encoding technique has also
been used to mitigate the inference between games further.
For Issue 2, several resource-metrics have been given. Then
we carry out tests to measure the server capacity and to cat-
egorize games into different types. Accordingly, we design
a server capacity model and corresponding task-assignment
strategies. These strategies have been compared with others.
4.1 Game Quality
A cloud gaming system’s interaction delay contains three
parts [27]: (1) Network delay, the time required for a round
of data exchange between the server and client; (2) Play-out
delay, the time required for the client to handle the received
for playback; (3) Processing delay, required for the server to
process a player’s command, and to encode and send the
corresponding frame back.
This paper is mainly about the server-side and the net-
work is assumed to be able to provide the sufficient band-
width, thus we focus on the processing delay that should be
confined into a limited range. The work [25] on measuring
the latency of cloud gaming has disclaimed that, for some
existing service-providers (like Onlive), the processing delay
is about 100-200ms. Thus, we use 100 ms as our scheduling
target, denoted MAX_PD. Another measurement of key
metrics is FPS; the required FPS is illustrated as FIXED_FPS.
In this work, FIXED_FPS is set to 30 by default.
As presented by Fig. 1, the gaming workflow can be
regarded as a pipeline including four steps: operations of
gaming logic, graphic rendering (including the image cap-
ture), encoding (including the color-space conversion) and
transmission. In addition, our tests show that given the suf-
ficient bandwidth, the delay of transmission is much less
than other steps. Thus, the fourth step can be skipped and
we focus on the remaining three.
Furthermore, the first two steps are completed by the
intercepted process, which is transparent to us; thus we
should combine them together and the sum of these laten-
cies is denoted by Tpresent. The average processing time
of the encoding step is denoted by Tencoding (The pipeline is
presented in Fig. 3). Hence, if the following conditions
(referred as Responsiveness Conditions) have been satisfied,
the requirement on the FPS and processing delay will be
met undoubtedly. To be more precise, satisfaction of the
first two conditions means the establishment of the last one,
under the default case.
Tpresent ¼ 1=FIXED FPS and (1)
Tencoding ¼ 1=FIXED FPS and (2)
Tencoding þ Tpresent ¼ MAX PD (3)
4.2 Fixed FPS
To provide the “just good enough” gaming quality, the FPS
value should be fixed to some acceptable level (Issue 1).
Because the interface of GPU drivers is not open, our solu-
tion is in the user-space, too.
Take the Direct3D game as an example, we intercept the
Present API to insert a Sleep call for adjusting the loop
latency: The rendering complexity is mostly affected by the
complexity of gaming scenes and the latter changes gradu-
ally. Thus, it is reasonable to predict Tpresent based on its
own historical information. In the implementation, the aver-
age time (denoted Tavg present) of the past 100 loops is used
as the prediction for the upcoming one (the similar method
has been adopted by [8], [9]) and the sleep time (Tsleep) is cal-
culated as:
Tsleep ¼ 1=FIXED FPS À Tavg present
The true problem lies in how to judge whether a busy
server is suitable to undertake a new game instance or not.
Thus, we should solve Issue 2 anyway.
4.3 Hardware-Assistant Video Encoding
The fixed-FPS can mitigate the inference between games
because it allocates just enough resource for rendering. Fur-
ther, we use the hardware-assistant video-encoding capabil-
ity of commodity CPUs for less inference.
The hardware technology of Intel CPUs, Quick Sync, has
been employed. It owns a full-hardware function pipeline
to compress raw images in the RGB or YUV format into the
H264 video. Now Quick Sync has become one of the main-
stream hardware encoding technologies.6
On the test server,
a Quick-Sync-enabled CPU can simultaneously support up
to twenty 30-FPS encoding tasks (the image resolution is
1024 Â 768); the latency for one frame is as low as 4.9 ms.
Fig. 3. Gaming pipeline.
6. Quick Sync was introduced with the Sandy Bridge CPU micro-
architecture. It is a part of the integrated on the same die as the CPU.
Thus, to enable it work with a discrete graphics card (used for gaming),
some special configuration should be set up as described by http://
mirillis.com/en/products/tutorials/ action-tutorial-intel-quick-sync-
setup_for_desktops.html. For AMD, its Accelerated Processing Unit
(APU) has the similar function.

Moreover, the CPU utilization of such one task is almost
negligible, less than 0.2 percent. (Details are presented in
Appendix A, which can be found on the Computer Society
Digital Library at http://doi.ieeecomputersociety.org/
10.1109/TPDS.2015.2433916). The result means it causes lit-
tle interference to other tasks. Thus, we use it as the refer-
ence implementation in all following tests, as well as in the
system prototype.
4.4 Resource-Metrics
Five types of system-resources have been focused on,
including the CPU, GPU, system-RAM, video-RAM and the
system bandwidth: The first two can be denoted by utiliza-
tion ratios; the next two are represented by memory con-
sumptions and the last refers to the miss number of the LLC
(Last Level Cache). Correspondingly, the server capacity
and the average resource requirements of a game (under
the condition satisfying the Responsiveness Conditions) can be
denoted by a tuple of five-items, U_CPU, U_GPU,
M_HOST, M_GPU, B.
Based on the above metrics, we should judge whether the
remaining resource-capacities of a server can meet the
demand of a new game or not. The key lies in how to mea-
sure the capacity of a server, as well as the game require-
ments. We present the following method to accomplish
these tasks, namely, to solve Issue 2.
4.4.1 Test Methods
Commercial GPUs usually implement driver / hardware
counters to provide the runtime performance information.
For example, the NVIDIA’s PerfKit APIs7
can collect
resource-consumption information of each GPU in real
time. Hence, we can get results accumulated from the previ-
ous time the GPU was sampled, including the percentage of
time the GPU is idle/busy, the consumption of graphic
memories, etc.
For commodity CPUs, the similar method has been used,
too. For instance, Intel has already provided the capability
to monitor performance events inside processors. Through
its performance counter monitor (PCM), a lot of perfor-
mance-related events per CPU-core, including the number
of LLC-misses, instructions per CPU cycle, etc., can be
obtained periodically.
The sample periods for CPU and GPU are both set to 3s.
In addition, we embed monitoring codes into the inter-
cepted gaming APIs to record processing delays of each
frame, which will be used to judge whether the Responsive-
ness Conditions have been met or not.
Moreover, it is necessary to note that, the integrated
graphics processor (that contains the Quick Sync encoding
engine) shares the LLC with CPU cores and there is no on-
chip graphics memory.8
Thus the hardware encoding pro-
cess needs to access the system memory (if the required
data is missed in the LLC), which means the corresponding
miss number is still suitable to indicate the memory
throughput with hardware encoding.
In addition, we select four representative games, includ-
ing three Direct3D video games (Direct3D is the most popu-
lar development library for PC video games) and one
OpenGL game. They are:
1) Need for Speed-Most Wanted (abbreviated to NFS).
It is a classic racing video game.
2) Modern Combat 2-Black Pegasus (abbreviated to
Combat), a first-person shooter video game.
3) Elder Scrolls: Skyrim-Dragonborn (abbreviated to
Scrolls), an action role-playing video game.
4) Angry Birds Classic: (abbreviated to Birds), the well-
known mobile-phone game’s PC version.
Several volunteers have been invited to play games on
the cloud gaming system and encouraged to play quite a
few game scenes and the duration is more than 15 minutes
for each game. After several loops, runtime information can
be collected for further analysis.
4.4.2 Test Cases
A Win 7 (64-bit) PC is used as the server, which is equipped
with an NVIDIA GTX780 GPU adapter (3 GB video mem-
ory), a common Core i7 CPU (four cores, 3.4 GHz) and 8 GB
RAM. By default, games will be streamed at the resolution
of 1024 Â 768 and the game picture quality is set to medium
in all cases; the FPS is fixed to 30. Video encoding is
completed by Quick Sync.
Single instance (Resource-requirement Tests). Each game has
been played in our virtualization environment alone and
resource consumptions are recorded in real-time. As
expected, Responsiveness Conditions can be met for each
game on the powerful machine; the corresponding
resource-requirements are presented as follows (Table 1).
Considering resource consolidation, the average value of
each item of the tuple has been used.
Multi-instances running simultaneously. Quite a few game
groups have been executed and sampled simultaneously.
For example, we have played 2$6 NFS instances at the
same time. Based on the runtime information, we can see
that this server can support up to five acceptable instances
simultaneously (we consider a game’s running quality
acceptable if its average FPS-value is not less than 90 per-
cent of the FIXED_FPS). While six instances are running, the
FPS value is less than 27, which is regarded as unacceptable.
Furthermore, we should identify the bottleneck that is
pivotal for task assignment. Considering the following facts
(in Fig. 4a), NFS is memory-io-critical:
When no more than five games are running simulta-
neously, the average FPS is stable (about 30.3) and the value
of million-miss-number-per-second increases almost linearly.
As six instances are running, the FPS is about 24.7 and the
throughput nearly remains unchanged (from 37.6 to 37.9). At
the same time, both U_GPU and U_CPU are far from
exhausted, 47 and 71 percent respectively. This phenomenon
indicates that memory accesses have impeded tasks from uti-
lizing the CPU/GPU resources efficiently. Moreover, memory
consumptions are not the bottleneck; thus no any swap opera-
tion will happen (For clarity, the information of memory-con-
sumptions is skipped in these figures).
For Combat and Scrolls (in Figs. 4b and 4c), the same
conclusion does hold: Under the condition satisfying
7. http://www.nvidia.com/object/nvperfkit_home.html
8. http://www.hardwaresecrets.com/printpage/Inside-the-Intel-
Sandy-Bridge-Microarchitecture/1161

performance requirements, there can be at most three con-
current instances of Scrolls. For Combat, the maximum num-
ber of instances is 5. At the same time, both U_GPU and
U_CPU are limited, too. On the other hand, Birds (in Fig. 4d)
is CPU-critical because it can exhaust the CPU (97 percent as
10 instances running and the average FPS is 27.1), while
the value of million-miss-number-per-second increases
almost linearly.
4.4.3 Modeling
Based on the previous results, we have normalized
the resource requirement and the server capacity; the princi-
ple is critical-resource-ﬁrst: (1) For a memory-io-critical game
that the game-server can occupy Ni instances, the ﬁfth
item (Bandwidth) of its tuple is set as MAX_SYSTEM_
THROUGHPUT9
/ Ni, regardless of the absolute value. (2)
For any CPU-critical that the game-server can occupy Nj
instances, the value of its Ucpu is set as 1/ Nj. (3) The other
tuple items are kept unchanged.
For example, the tuple of NFS is 9.15 percent, 2.01 per-
cent, 526, 220, MAX_SYSTEM_THROUGHPUT / 5, and
the Birds’ tuple is 100 percent / 10, 1.1 percent, 181, 142,
6.54. Tuples of these four games are listed in Table 2.
Then for a set of M games (each denoted as Gamei,
0 ¼ i M), if the sum of each kind of resource consump-
tion is less than the corresponding system-capacity, we con-
sider these games can run simultaneously and smoothly.
Formally, we use the following notations:
U CPUgame i, U GPUgame i, M HOSTgame i, M GPUgame i;
B game i : the tuple of resource requirements of Gamei;
100%, 100%, SERVER_RAM_CAPACITY, SERVER_VI-
DEO_RAM_CAPACITY, MAX_SYSTEM_THROUGHPUT
server: the capacity of a given server.
If the following conditions have been met, this sever can
occupy all games of the set running simultaneously.
X
0 i M
U CPUgame i 100%
X
0 i M
U GPUgame i 100%
X
0 i M
M HOSTgame i SERVER RAM CAPACITY
X
0 i M
M GPUgame i SERVER VIDEO RAM CAPACITY
X
0 i M
Bgame i MAX SYSTEM THROUGHPUT
Fig. 4. FPS and resource-consumptions of games.
TABLE 1
Resource-Requirements of Each Game
U_CPU
(%)
U_GPU
(%)
M_HOST
(MB)
M_GPU
(MB)
B (million miss-
number per
second)
NFS 9.15% 2.01% 526 220 8.10
Scrolls 14.55% 7.02% 795 560 13.52
Combat 8.47% 3.27% 800 296 7.97
Birds 9.36% 1.1% 181 142 6.54
9. MAX_SYSTEM_THROUGHPUT refers to the maximal LLC-miss-
number per second that the system can sustain. It can been evaluated
by a specially-designed program to access the memory space randomly
and intensively.

For example, one Scrolls, one Combat and two NFS can
run at the same time; if an extra NFS joins, this condition
will not be met and the bottleneck is B. Quite a few tests of
real games will be given in Section 5.1 to verify this design.
4.5 The Scheduling Strategy
As a conclusion, the following procedure for task assign-
ment is illustrated, which contains two stages.
Ready-Stage: when a game is being on-line, it will be
tested to get its resource requirements. Then, for any game
(denoted as Game_i), a tuple U_CPU, U_GPU, M_HOST,
M_GPU, Bgame_i can be given to represent its
requirements.
In addition, for any Server_j, its capacity is denoted as
U_CPU, U_GPU, M_HOST, M_GPU, Bserver_j. The corre-
sponding test-process has been described in the previous
paragraphs and each element will be labeled as the corre-
sponding maximum capacity.
Runtime-Stage: During the run time, the concurrent
resource-consumptions of each server (denoted as
U_CPU, U_GPU, M_HOST, M_GPU, Bserver_j_cur; in our
prototype, the average value of the latest one minute have
been used) are sampled periodically.
Moreover, the main goal of our scheduling strategy is to
minimize the number of servers used, which can be regarded
as a bin-packing problem. Serveral theoretical researches
[33], [34] have claimed that the First-fit and Best-fit algo-
rithms behave well for this problem, especially for the online
version with requests inserted in random order [34]. Thus,
we have designed two heuristic task-scheduling algorithms
based on the well-known First-fit and Best-fit strategies,
namely first-fit-like (FFL) and best-fit-like (BFL). The princi-
ple is straight; thus we only give their outlines here:
In FFL, for a given request of game_i, all servers will be
checked orderly, if one server (for example, server_j) can
occupy the new game, which means that each kind of
resource consumptions for all games on server_j (including
game_i) does not exceeds the capacity, this algorithm ends
successfully.
In BFL, the procedure is similar. The difference lies in
that, if there is more than one suitable server, the one will
leave the least critical resources is the best.
4.5.1 Tests with Artificial Traces
We have simulated our algorithms in two situations:
1) Several requests of the four games come simulta-
neously and must be dispatched instantly, namely,
in the batch processing mode.
2) Requests come one by one. The request-sequence fol-
lows a Poisson process with a mean time interval of
5 seconds; the duration of each game also follows a
Poisson process and the mean time is 40 minutes.
In both situations, we assume that there are enough servers
and each has an initial resource usage 10, 5, 3096, 512, 0
(it is gathered from our real servers). Thus, we can start a new
sever whenever needed. Moreover, from the aspect of
resource-usage, we mainly focus on the number of used-serv-
ers by each algorithm.
For the first situation, we have compared our algorithms
with three others:
Size-based task assignment (STA) [35]: This algorithm is
widely used in distributed systems, in which all tasks
within a given size range of resource requirements are
assigned to a particular server. Specific to our case, two
types of servers (for CPU-critical and for memory-IO-critical
respectively) are designated.
Packing algorithm (PA): It is a greedy algorithm. For
each server, it will be assigned as much games as possible
till all the games have been dispatched.
Dominant resource fairness (DRF) [36]: A fair sharing
model that generalizes max-min fairness to multiple
resource types. In our implementation, the collection of all
currently-used servers (called small servers) is regarded as a
big server. Whethe the big server can satisfy an incoming
request or not just depends on if there exists such a small
server. If not, a new small server will be added to enlarge
the big. The scheduling strategy inside the big one is First-fit
and all gaming requests are considered to be issued by dif-
ferent users.
We also estimate the ideal server-number for reference.
For each kind of resources (denoted by s), the minimum
number is
P
i¼1
P
i¼1
n
RRs
i =RRs
.
Here n is the total number of game requests; Ri denotes
the resource utilization of the i-th game and Rs
is the corre-
sponding resource capacity of a server. Thus, the maximum
of all minimums is the ideal number.
In the second situation our algorithms have been com-
pared with the STA algorithm, because others require the
information of the request sequence (which is unavailable
in this case) and will become the FFL.
Simulation results of Situation 1 are given in Fig. 5. The
y-axis stands for the needed-server numbers (for clarity,
TABLE 2
Resource-Requirements of Games
Tuple Game type
NFS 9.15%, 2.01%, 526, 220,
MAX_SYSTEM_BANDWIDTH / 5
memory-io-
critical
Scrolls 14.55%, 7.02%, 795, 560,
memory-io-
critical
Combat 8.47%, 3.27%, 800, 296,
memory-io-
critical
Birds 10%, 1.1%, 181, 142, 6.54 CPU-critical
Fig. 5. Server-numbers in Situation 1.

values have been normalized) as several requests have
arrived simultaneously (the request number is illustrated
by the x-axis). We can figure out that, compared with others,
the heuristic algorithms are quite good. Even considering
the ideal number, our algorithms are really close to the opti-
mal (the maximal value is 101.23 percent). Moreover, these
two algorithms perform almost equal in all cases.
Fig. 6 shows that the number of requested servers
when requests arrived in sequence (Situation 2). We can
figure out that our heuristic algorithms are more efficient
than the STA. These two algorithms also perform simi-
larly in all cases: compared with the BFL algorithm, the
more consumed resources by the FFL are less than
3.6 percent (57:55). At last, results show the performance
of FFL is about 20 percent faster than the BFL, while
both are fast enough (in the batch processing mode, both
can complete the task-assignment in several milliseconds
as the request number is 1,000).
4.5.2 Tests with Real Game-Traces
To further evaluate the proposed task-scheduling strategies,
we conduct a trace-driven simulation for a large-scale clus-
ter (some similar simulation method has been used in [37]);
each server is the same as the one presented in Section 4.4.
The dataset we used is the World of Warcraft history dataset
provided by Yeng-Ting Chen et al. [38]. Although this data-
set is based on the MMORPG of “World of Warcraft”, we
think it is useful in our case because cloud gaming and
MMORPG share many similarities, such as wide variations
in the gaming time, a huge bandwidth-demand and a large
number of concurrent users. Of course, necessary pre-proc-
essing is introduced to make the dataset more suitable,
namely, we have mapped the first four races in the dataset
(Blood Elf, Orc, Tauren and Troll) to the four kinds of games
in our system and the remaining one (the undead) is mapped
to one of these four games randomly.
In detail, we have used traces of three months that con-
sist of 396,631 game-requests (details are shown in Table 3).
Accordingly, a cluster of 200 servers has been simulated, in
which the master node collects the resource utilization of all
servers every one minute. Because previous tests have
shown that BFL and FFL policies perform similarly, we
have only tested the BFL scheduling policy here.
Fig. 7 shows numbers of running game-instances, acti-
vated servers and used servers (once it is used, a server will
be regarded as a used server no matter whether it is being
activated or not); there is an obvious linear relationship
between the number of game-instances and the number of
activated servers. What’s more, the average number of acti-
vated servers is 64, which is significantly less than the maxi-
mum number of used servers (152). It means that the
scheduling efficiency is good; it also means server consoli-
dation [37] can be used to further reduce the number of
servers.
Fig. 8 shows the average resource-utilizations of acti-
vated servers of each day. Although the utilization rates of
other resources are relatively low, the bandwidth’s is high.
It proves that most games are memory-io-critical, which
accords with our performance model.
We have completed another simulation, in which the
server number is infinite, to illustrate the relationship
between the total of used servers and the update-interval
for resource utilization.
Fig. 9 shows the relationship; we can see that when the
update-interval is less than 20 minutes, the number of used
servers varies slightly. When the interval is larger, the num-
ber has increased significantly. It means that we could use a
longer update-interval and the impact on the system effi-
ciency is very limited. It is also helpful to manage a large-
scale cloud gaming system, because message exchanges
between server-agents and the manager will be reduced
apparently.
Fig. 6. Server-numbers in Situation 2.
TABLE 3
Details of the Dataset
Parameter Value
Simulated period 3 months
Server number 200
Total game requests 396,631
Maximum game requests
arriving simultaneously
227
Maximum game instances
running simultaneously
757
Average lifetime of game
instances
85 minutes
Average interval between game
requests
3 minutes
Fig. 7. Running games and servers of each day.

4.6 Discussions
4.6.1 Different Game Configurations and/or
Heterogeneous Servers
The above work is targeted to specific hardware and games
and we believe the method is practical: it is reasonable to
assume that any game should be tested fully before on-line;
thus the resource requirements of each game can be mea-
sured on the given server of which the hardware configura-
tion will remain unchanged for a long period.
If heterogeneous servers are used, as we have found that
the host CPU or the memory bus is the system bottleneck,
new servers’ capacities can also be derived, based on the
comparison between the CPU performance and system
bandwidth of reference servers and new servers (these met-
rics may have been labeled by the producer or can be
tested), which can avoid the exponentially-growing com-
plexity of testing. Appendix B, available in the online supple-
mental material, gives an example to show that the
capability of a new server for known games is predictable
and then summarizes the prediction method.
For different game configurations, the situation is more
complicated. Even if only the resolution is different, tests
show that there is not an obvious relationship between the
resolution and resource consumptions, although the con-
sumption of our framework itself (like encoding and image-
capture) is proportional to the resolution.
Therefore, our solution is: during the real operation ser-
vice period, such configurations can be evaluated on line
first. For example, we can schedule the same game with
same configurations to some dedicated server(s) if a user
has demanded. With the accumulation of game-runs, the
metrics will become more accurate.
4.6.2 Time-Dependent Factors
We use average values to denote resource requirements of a
given game. In reality, requirements are time-dependent,
which may vary in different gaming stages. However, we
believe average values are enough owing to the following facts:
1) The variety degree depends on the time granularity
heavily. Our tests show that the degree becomes
smaller with the increase of the time interval. When
the time interval is 30s (in Appendix C, available in
the online supplemental material), the variety of
requirements is relatively small.
2) Consider resource consolidation of multiple concur-
rently-running games, the usage of average values
are reasonable.
Moreover, it is necessary to note that some games will
last very long time to finish. Thus in our experimental envi-
ronment, it is difficult to explore plenty of scenes. However,
such a game can be evaluated on line first for data accumu-
lation (as we have mentioned above).
5 IMPLEMENTATION AND EVALUATION
5.1 Implementation
We have implemented the cloud gaming system based on
the user-level virtualization technology. Eight PC servers
are connected by a Gigabit Ethernet; their configurations
are the same as the previous one in Section 4.4. Detours [39]
has been used to complete the required interception func-
tions. In detail, we have implemented a DLL (called gamedll)
that can be inserted into any gaming process to wrap all
interesting APIs and to spawn two threads for input-recep-
tion and data-encoding / -streaming respectively.
Now our virtualization layer can stream Direct3D games,
OpenGL games and flash games to Windows, iOS and
Android clients, and receive remote operations. The UDT
(UDP-based Data Transfer) protocol [40] is used to deliver
the video / audio / operation data between the server
and client.
We use the periodical video capture as the timing-refer-
ence on the server side; any audio data between two conse-
cutive video-capture-timestamps will be delivered with the
current video data.
To be specific, Windows Audio Session APIs provide
some interface to create and manage audio streams to and
from audio devices. Our interception does replicate such
stream buffers. After the current image has been captured,
the audio data between the current read / write positions
(read position is just the current playback position) of the
buffer will be copied out immediately and sent out with the
current image. This method completes video / audio
synchronization and limits the timing discrepancy to the
reciprocal of the FPS value or so.
As mentioned in Section 4.1, an exception lies in that
games may decrease the FPS deliberately in some scenes,
which will cause more timing discrepancies. To remedy this
Fig. 8. Resource-utilizations of activated servers.
Fig. 9. Used servers of different update-intervals.

situation, a dedicated timer has been introduced to trigger
audio transmission only if the current interval of successive
frames is longer than a threshold.
Moreover, from the aspect of clients, to smooth the play-
back of received audio, one extra audio-buffer will be man-
aged by the cloud-gaming client software. Any received
audio will be stored into this buffer first to be appended to
the existing data (also in this buffer). As the whole buffer
has been filled, all will be copied to the playback device.
Thus, combined with the default buffer of the playback
device, it constructs a double-buffering mechanism, which
can parallelize the playback and reception and then smooth
the playback. Therefore, any audio data will be delayed for
some time: in our system, the length of this buffer is set to
occupy audio-data of 200 ms, which will make the playback
smooth. Results have been given in the next section.
5.2 Evaluation
The test environment and configurations are the same as
those in Section 4.4, as well as the testing method.
5.2.1 Overheads of the User-Level Virtualization
Technology Itself
We execute a game on a physical machine directly and
record the game speed (in term of the average FPS) and
average memory consumption. Then, this game is running
in the user-level virtualization environment (all related
APIs have been intercepted but no any real work, like image
capture, encoding, etc., has been enabled) and in a virtual
machine respectively; the same runtime information will be
recorded repeatedly.
The latest VMware Play 6 is employed and both the host
/ guest OSes are Win 7. The comparison is shown in Fig. 10
(for clarity, values have been normalized).
Consider the GPU utilization, the user-level technology
itself almost introduces no performance-loss, while the VM-
based solution’s efficiency is a little lower, about 90 percent
of the native. On the other side, the memory consumption
of the VM-based solution is 2.4 times as many as the native,
because the memory occupied by the guest OS is consider-
able. For the user-level solution, this consumption is almost
the same, too.
5.2.2 Processing Performance of the Server
The processing procedure of a cloud-gaming instance can
be divided into four parts: (1) image capture, which copies a
rendered into the system memory, (2) video encoding,
(3) transferring, which sends each compressed-frame into
the network, and (4) the process of the game-logic-operation
and rendering. The last one is mainly dependent on the con-
crete game while GCloud handles the others. Thus the first
three are object of this test and the amount of these delays is
denoted as SD (Server Delay).
Moreover, we intend to get the limit of the performance.
Hence only one instance is running on a server and the “try
the best” strategy is used. Namely, no Sleep call has been
inserted; the games can run as fast as possible. Some exist-
ing work [3] has completed the similar test for GamingAny-
where and Onlive, so that we can compare results with
theirs. Although the tested games of [3] are different, we
believe the comparison is meaningful because the server
delay is independent on specific games to a large extent.
Fig. 11 reports the average SD of three video games
under different resolutions. The corresponding FPS is in
Fig. 10. Comparison of resource consumption.
Fig. 11. Processing performance and the decomposition (three
resolutions).

Fig. 12. The average value of 720P is given in Fig. 13, as well
as the corresponding values of GamingAnywhere and OnLive
(values have been normalized).
Results show that, compared with similar solutions,
GCloud achieves smaller SDs (ranging from 8 ms to 19 ms),
which are positive correlated with resolutions. We think it
is mainly attributed to the high encoding performance of
Quick Sync. In contrast, the encoding delay of GamingAny-
where is about 14$16 ms per frame.
The transferring latency is smaller than others by two
orders of magnitude. Even in following cases of multiple
games, it does still hold. Thus, the transferring latency can
be skipped, as we have proposed in Section 4.
5.2.3 Multiple Games
The “just good enough” strategy is used; a Sleep call has been
used to fix the FPS. First, an OpenGL game and three
Direct3D games have been played one by one and the proc-
essing delay (including the sleep time) is sampled periodi-
cally; the sample period is one frame. Second, quite a few
game combinations, each including more than one game,
have been executed and sampled. Without loss of general-
ity, FPS values of some game combinations that are played
simultaneously are presented in Table 4, as well as the aver-
age absolute deviations (AADs). These combinations are:
Case 1: Two NFS instances;
Case 2: One NFS, one Combat and one Scrolls;
Case 3: Two NFS, one Combat and one Scrolls;
Case 4: One NFS, one Combat, one Scrolls and two Birds.
On the whole, the average FPS ranges from 30.5 to 31.5 as
one game is running alone. Their average absolute devia-
tions are 0.10 (Birds), 0.11 (NFS), 0.15 (Combat) and 1.47
(Scrolls) respectively, which means the FPS value is fairly
stable. Of course, there are quite a few delay-fluctuations. It
usually means the corresponding game-scenes are changing
rapidly, which is the common case for highly-interactive
games, especially for Scrolls.
With the increment of the number of concurrently-run-
ning games (it means more interferences between games),
the FPS values decrease correspondingly while the average
absolute deviations increase:
For Scrolls, as three games running (Case 2) at the same
time, its average FPS is 28.3 and the AAD is 2.13. For four
instances (Case 3), the values are 27.8 and 2.98 respectively.
For Combat, as three games running simultaneously, the
average FPS is 29.2; the AAD is 0.89. For four, the values are
28.8 and 1.59 respectively.
For the uncertainties of FPS values, we believe the main
reason lies in two aspects:
1) There exists interferences among several running
instances, including resource contests, which make
resource-consumption not totally linear with the
increase of instances (as illustrated in Fig. 4). For
example, Scrolls consumes the most resources, thus
its uncertainty is the biggest.
2) As mentioned in Section 4.6, resource require-
ments of games are time-dependent, which may
vary in different stages. It has also caused some
uncertainties.
Anyway, it means that the system can get satisfactory
gaming-effect and the FPS can be made relatively stable, as
multiple games are running simultaneously.
5.2.4 Verification of the Performance Model
According to the result of the performance model and
scheduling strategy, we test several typical server loads for
verification. Without loss of generality, the following cases
have been presented.
1) One Scrolls, one Combat and two NFS. As presented in
Table 5 (1st
row), the FPS value of each game is more
than 27 and the lowest is Scrolls’s, about 27.1. All are
not less than 90 percent of the FIXED_FPS (30), thus
they are accepetable. Because the system-RAM band-
width has been nearly exhausted (about 93 percent of
the MAX_SYSTEM_BANDWIDTH), when another
game join (regardless NFS or Birds), the FPS of Scrolls
will drop below the acceptable level.
Fig. 12. FPS of games.
Fig. 13. Comparison of the processing delay (1280 Â 720; the lower the
better).
TABLE 4
FPS Values and Average Absolute Deviations of
Different Numbers of the Running Games
Game / Case 1 2 3 4
NFS FPS 30.2 30.3 30.2 30.2
AAD 0.18 0.24 0.44 0.70
Combat FPS N/A 29.2 28.8 28.6
AAD N/A 0.89 1.59 1.89
Scrolls FPS N/A 28.3 27.8 27.3
AAD N/A 2.13 2.98 3.30
Birds FPS N/A N/A N/A 29.8
AAD N/A N/A N/A 0.56

2) One Scrolls, one Combat, one NFS and three Birds. For
this case, the sum of each kind of resource consump-
tion is less than the corresponding system-capacity;
the relative maximum is the sum of memory
throughputs, about 95 percent of the MAX_SYS-
TEM_THROUGHPUT.In Table 5 (second row), the
FPS value of each game is more than 27.
3) One NFS, two Combat and five Birds.
4) Three NFS and five Birds.
In Case 3 4, the sum of memory throughputs is about
96 percent of the MAX_SYSTEM_THROUGHPUT. Anyway,
as the sum of each kind of resource consumption is less than
the corresponding system-capacity, the FPS value of each
game is still more than 27.
5.2.5 Discrepancy between Video and Audio
We have designed a method to calculate this discrep-
ancy: on the server, some sequences of full-black images
are inserted into the video-stream to replace original
scenes; at the same time, mute data will replace the cor-
responding audio-data, too. On the client, a screen
recording software is running with the gaming client.
Thus, through the analysis of audio / video streams of
recorded data, we can get time-stamps of the beginnings
of inserted video / audio sequences respectively. Then,
the discrepancies can be calculated. Results show that
these values are in the range of 180 ms$410 ms (Table 6).
We think the reason lies in the following, besides the
preset delays aforementioned:
1) The delay-fluctuations of games. The corresponding
FPS-values will be less than 30, which will increase
the timing discrepancy, because the accumulation
process of audio-data will be slowed.
2) The network’s delay-fluctuations. They will increase
the timing discrepancy, too. Our tests are carried out
in the campus. We believe, for the Internet, this fac-
tor will cause more delays.
3) The measurement error. The recording software
records the screen periodically, 30 FPS, while the
audio recording is consecutive. Thus, beginnings of
some sequences of full-black images may be lost,
which will decrease the gap.
6 CONCLUSIONS AND FUTURE WORK
This paper proposes GCloud, a GPU/CPU hybrid cluster for
cloud gaming based on the user-level virtualization
technology. We focus on the guideline of task scheduling:
To balance the gaming-responsiveness and costs, we fix the
game’s FPS to allocate just enough resources, which can also
mitigate the inference between games. Accordingly, a per-
formance model has been analyzed to explore the server-
capacity and the game-demands on resource, which can
locate the performance bottleneck and guide the task-sched-
uling based on games’ critical resource-demands. Compari-
sons show that both the First-Fit-like and Best-Fit-like
scheduling strategies can outperform others. Moreover,
they are near optimal in the batch processing mode.
In the future, we plan to enhance performance models to
support heterogeneous servers.
ACKNOWLEDGMENTS
The work is supported by the High Tech. RD Program of
China under Grant No. 2013AA01A215.
REFERENCES
[1] R. Shea, L. Jiangchuan, E.C.-H. Ngai, and C. Yong, “Cloud gam-
ing: Architecture and performance,” IEEE Netw., vol. 27, no. 4,
pp. 16–21, Jul./Aug. 2013.
[2] Z. Zhao, K. Hwang, and J. Villeta, “GamePipe: A virtualized cloud
platform design and performance evaluation,” in Proc. ACM 3rd
Workshop Sci. Cloud Comput., 2012, pp. 1–8.
[3] C.-Y. Huang, C.-H. Hsu, Y.-C. Chang, and K.-T. Chen,
“GamingAnywhere: An open cloud gaming system,” in Proc.
ACM Multimedia Syst., Feb. 2013, pp. 36–47.
[4] R. Phull, C.-H. Li, K. Rao, S. Cadambi, and S. T. Chakradhar,
“Interference-driven resource management for GPU-based het-
erogeneous clusters,” in Proc. 21st ACM Int. Symp. High Perform.
Distrib. Comput., 2012, pp. 109–120.
[5] V. T. Ravi, M. Becchi, G. Agrawal, and S. T. Chakradhar,
“Supporting GPU sharing in cloud environments with a transpar-
ent runtime consolidation framework,” in Proc. 20th ACM Int.
Symp. High Perform. Distrib. Comput., 2011, pp. 217–228.
[6] G. A. Elliott and J. H. Anderson, “Globally scheduled real-time
multiprocessor systems with GPUs,” Real-Time Syst., vol. 48, no. 1.
pp. 34–74, 2012.
[7] L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao, “Dynamic
load balancing on single- and multi-gpu systems,” in Proc. IEEE
Int. Symp. Parallel Distrib. Process., 2010, pp. 1–12.
[8] M. Yu, C. Zhang, Z. Qi, J. Yao, Y. Wang, and H. Guan, “GRIS:
Virtualized GPU resource isolation and scheduling in cloud
gaming,” in Proc. 22nd Int. Symp. High-Perform. Parallel Distrib.
Comput., 2012, pp. 203–214.
[9] C. Zhang, J. Yao, Z. Qi, M. Yu, and H. Guan, “vGASA: Adaptive
scheduling algorithm of virtualized GPU resource in cloud
gaming,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 11,
pp. 3036–3045, 2014.
[10] M. Claypool and K. Claypool, “Latency and player actions in
online games,” Commun. ACM, vol. 49, no. 11, pp. 40–45, 2006.
[11] D. C. Barboza, V. E. F. Rebello, E. W. G. Clua, and H. Lima, “A
simple architecture for digital games on demand using low per-
formance resources under a cloud computing paradigm,” in Proc.
Brazilian Symp., Games Digital Entertainment, 2010, pp. 33–39.
[12] D. De Winter, P. Simoens, and L. Deboosere, “A hybrid thin-client
protocol for multimedia streaming and interactive gaming
applications,” in Proc. Int. Workshop Netw. Oper. Syst. Support Digi-
tal Audio Video, 2006, p. 15.
TABLE 5
FPS of Concurrently-Running Games
TABLE 6
Discrepancy Values on the Client Side
Minimum Maximum Average
NFS 205 ms 395 ms 287 ms
Scrolls 213 ms 410 ms 323 ms
Combat 196 ms 336 ms 278 ms
Birds 180 ms 275 ms 242 ms

[13] W. Yu, J. Li, C. Hu, and L. Zhong, “Muse: A multimedia streaming
enabled remote interactivity system for mobile devices,” in Proc.
10th Int. Conf. Mobile Ubiquitous Multimedia, 2011, pp. 216–225.
[14] L. Shi, H. Chen, and J. Sun, “vCUDA: GPU accelerated high per-
formance computing in virtual machines,” in Proc. IEEE Int. Symp.
Parallel Distrib. Process., 2009, pp. 1–11.
[15] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana- Ortı,
“rCUDA: Reducing the number of GPU-based accelerators in
high performance clusters,” in Proc. Int. Conf. High Perform. Com-
put. Simul., 2010, pp. 224–231.
[16] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V.
Talwar, and P. Ranganathan, “GViM: Gpu-accelerated virtual
machines,” in Proc. ACM Workshop Syst.-Level Virtualization High
Perform. Comput., 2009, pp. 17–24.
[17] M. Bautin, A. Dwarakinath, and T. cker Chiueh, “Graphic engine
resource management,” in Proc. 15th Multimedia Comput. Netw.,
2008, pp. 15–21.
[18] D. Wu Z. Xue, and J. He “iCloudAccess: Cost-effective streaming
of video games from the cloud with low latency,” IEEE Trans.
Circuits Syst. Video Technol., vol. 24, no. 8, pp. 1405–1416, Jan. 2014.
[19] H.-J. Hong, D.-Y. Chen, C.-Y. Huang, K.-T. Chen, and C.-H. Hsu,
“Placing virtual machines to optimize cloud gaming experience,”
IEEE Trans. Cloud Comput. , vol. 3, no. 1, pp. 42–53, Jan.–Mar. 2015.
[20] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa,
“TimeGraph: GPU scheduling for real-time multi-tasking environ-
ments,” in Proc. USENIX Conf. USENIX Annu. Tech. Conf., 2011, p. 2.
[21] L Cherkasova and L Staley, “Building a performance model of
streaming media application in utility data center environment,” in
Proc. 3rd IEEE/ACM Int. Symp. Cluster Comput. Grid, 2003, pp. 52–59.
[22] V. Ishakian and A. Bestavros, “MORPHOSYS: Efficient colocation
of QoS-constrainedworkloads in the cloud,” in Proc. 12th IEEE/
ACM Int. Symp. Cluster, Cloud Grid Comput., 2012, pp. 90–97.
[23] S. Wang and S. Dey, “Rendering adaptation to address communi-
cation and computation constraints in cloud mobile gaming,” in
Proc. Global Telecommun. Conf., Dec. 6–10, 2010, pp. 1–6.
[24] D. Klionsky. A new architecture for cloud rendering and amor-
tized graphics. M.S. Thesis, School Comput. Sci., Carnegie Mellon
Univ., CMU-CS-11–122. [Online]. Available: http://reports-
archive.adm.cs.cmu.edu/anon/2011/abstracts/11–122.html.
[25] A. Jurgelionis, P. Fechteler, P. Eisert, F. Bellotti, and H. David,
“Platform for distributed 3D gaming,” Int. J. Comput. Games Tech-
nol. , vol. 2009, p. 1, 2009.
[26] A. Ojala and P. Tyrvainen, “Developing cloud business models:
A case study on cloud gaming,” IEEE Softw., vol. 28, no. 4,
pp. 42–47, Jul. 2011.
[27] S.-W. Chen, Y.-C. Chang, and P.-H. Tseng, C.-Y. Huang, and C.-L.
Lei, “Measuring the latency of cloud gaming systems,” in Proc.
19th ACM Int. Conf. Multimedia, 2011, pp. 1269–1272.
[28] S. Choy, B. Wong, G. Simon, and C. Rosenberg “The brewing
storm in cloud gaming: A measurement study on cloud to end-
user latency,” in Proc. 11th Annu. Workshop Netw. Syst. Support
Games, 2012, p. 2.
[29] Y.-T. Lee, K.-T. Chen, H.-I. Su, and C.-L. Lei, “Are all games equally
cloud-gaming-friendly? An electromyographic approach,” in Proc.
IEEE/ACM NetGames, 2012, pp. 109–120.
[30] K.-T. Chen, Y.-C. Chang, H.-J. Hsu, D.-Y. Chen, C.-Y. Huang, and
C.-H. Hsu, “On the quality of service of cloud gaming systems,”
IEEE Trans. Multimedia, vol. 16, no. 2, pp. 480–495, Feb. 2014.
[31] Y. Zhang, X. Wang, and L. Hong, “Portable desktop applications
based on P2P transportation and virtualization,” in Proc. 22nd
Large Installation Syst. Administration Conf., 2008, pp. 133–144.
[32] P. Guo, “CDE: Run any linux application on-demand without
installation,” in Proc. 25th USENIX Large Installation Syst. Adminis-
tration Conf., 2011, p. 2.
[33] B. Xia and T. Zhiyi, “Tighter bounds of the first fit algorithm for
the bin-packing problem,” Discrete Appl. Math., vol. 158, no. 15,
pp. 1668–1675, 2010.
[34] C. Kenyon, “Best-fit bin-packing with random order,” in Proc. 7th
Annu. ACM-SIAM Symp. Discrete Algorithm, 1996, vol. 96,
pp. 359–364.
[35] M. Harchol-Balter, M. E. Crovella, and C. Duarte Murta, “On
Choosing a task assignment policy for a distributed server sys-
tem,” J. Parallel Distrib. Comput., vol. 59, no. 2, pp. 204–228, 1999.
[36] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker,
and I. Stoica, “Dominant resource fairness: Fair allocation of mul-
tiple resource types,” in Proc. 8th USENIX Symp. Netw. Syst. Des.
Implementation, 2011, pp. 323–336.
[37] Y.-T. Lee and K.-T. Chen, “Is server consolidation beneficial to
MMORPG? A case study of world of warcraft,” in Proc. IEEE 3rd
Int. Conf. Cloud Comput., 2013, pp. 435–442.
[38] Y.-T. Lee, K.-T. Chen, Y.-M. Cheng, and C.-L. Lei, “World of war-
craft avatar history dataset,” in Proc. 2nd Anuu. ACM Multimedia
Syst., Feb. 2011, pp. 123–128.
[39] G. Hunt and D. Brubacher, “Detours: Binary interception of
Win32 functions,” in Proc. 3rd USENIX Windows NT Symp., Jul.
1999, p. 14.
[40] Y. Gu and R. L. Grossman, “UDT: UDP-based data transfer for
high-speed wide area networks,” Comput. Netw., vol. 51, no. 7,
pp. 109–120, May 2007.
Youhui Zhang received the BSc and PhD
degrees in computer science from Tsinghua Uni-
versity, China, in 1998 and 2002. He is currently
a professor in the Department of Computer Sci-
ence, Tsinghua University. His research interests
include computer architecture, cloud computing,
and high-performance computing. He is a mem-
ber of the IEEE and the IEEE Computer Society.
Peng Qu received the BSc degree in computer
science from Tsinghua University, China, in
2013. He is currently working toward the PhD
degree in the Department of Computer Science,
University of Tsinghua, China. His interests
include cloud computing and micro-architecture.
Cihang Jiang received the BSc degree in com-
puter science from Tsinghua University, China, in
2013. He is currently a master student in the
Department of Computer Science, University of
Tsinghua, China. His research interest is cloud
computing.
Weimin Zheng received the BSc and MSc
degrees in computer science from Tsinghua Uni-
versity, China, in 1970 and 1982, respectively.
He is currently a professor in the Department of
Computer Science, University of Tsinghua,
China. His research interests include high perfor-
mance computing, network storage and distrib-
uted computing. He is a member of the IEEE and
the IEEE Computer Society.
For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

A cloud gaming system based on user level virtualization and its resource scheduling

Recommended

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to A cloud gaming system based on user level virtualization and its resource scheduling (20)

More from redpel dot com (20)

Recently uploaded (20)

A cloud gaming system based on user level virtualization and its resource scheduling