0% found this document useful (0 votes)
22 views

Performance Analysis and Improvement of Storage Virtualization in An OS-Online System

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Performance Analysis and Improvement of Storage Virtualization in An OS-Online System

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Journal of Software Engineering and Applications, 2012, 5, 912-922

http://dx.doi.org/10.4236/jsea.2012.531106 Published Online November 2012 (http://www.SciRP.org/journal/jsea)

Performance Analysis and Improvement of Storage


Virtualization in an OS-Online System
Yuan Gao, Yaoxue Zhang, Yuezhi Zhou
Department of Computer Science and Technology, Tsinghua University, Beijing, China.
Email: [email protected]

Received October 15th, 2012; revised November 16th, 2012; accepted November 25th, 2012

ABSTRACT
An OS-online system called TransCom is based on a virtual storage system that supports heterogeneous services of the
operating system and applications online. In TransCom, OS and software which run in the client are stored on the cen-
tralized servers, while computing tasks are carried out by the clients, so the server is the bottleneck of the system per-
formance. This paper firstly analyzes the characteristics of its real usage workload and builds a queuing model to locate
the system bottlenecks. According to the study, the disk is the primary bottleneck, so an optimal two-level cache ar-
rangement policy is developed on both the server and the client, which aims to avoid most of the server disk accesses.
LRU algorithm is used in the client-side cache. A cache management algorithm called Frequency-based Multi-Priority
Queues (FMPQ) proposed in this paper is used in the server-side cache. Experiment results show that the appropriate
cache arrangement can significantly improve the capability of the TransCom server.

Keywords: Storage Virtualization; Performance Analysis; Cache Strategy

1. Introduction of the clients. Due to the central storage of OSes and ap-
plications, the installation, maintenance, and managment
During the last decade, with the rapid advance in the em-
are also centralized, leaving the clients light-weighted. A
bedded and mobile devices, the traditional general-pur-
typical transparent computing system is illustrated in Fig-
pose desktop computing is shifting toward the greatly
heterogeneous and scalable cloud computing [1,2], which ure 1.
aims to offer novel pervasive services for users in right We implemented a prototype of transparent computing,
place, at right time and by right means with some kinds/ namely, TransCom [6], which is a distributed system
levels of smart or intelligent behaviours. From the scal- based on C/S model. In TransCom, a client is nearly a
able service perspective, these pervasive services are bare hardware, which is responsible for the execution of
highly expected, that a smart ubiquitous computing plat- programs and the interaction with users. Most programs
form should enable users to get different services via a including OSes and applications executed on the clients
single light-weight device and a same service via differ- are centralized on the server, which is responsible for the
ent types of devices. Unfortunately, all of the current storage and the management. In order to fetch the remote
technologies cannot achieve the uneven conditioning programs and data transparently, the virtual disk system
services. In another word, users are often unable to select (Vdisk) in TransCom extends the local external memory
their desired service freely via the devices or platforms to the disks and the memory on the server.
available to them. A new computing paradigm is pro- Unlike the traditional distributed storage systems, Vdisk
posed, namely, transparent computing [3,4], which aims in TransCom is designed for the remote programs access
to solve the problems above. The core idea of this para- rather than only the remote data access, which brings
digm is to realize the “stored program concept [5]” mo- Vdisk some unique features as follows. Firstly, Vdisk
del in the networking environment, in which the execu- supports for the remote program loading and paging.
tion and the storage of programs are separated in the dif- Secondly, all the virtual disks are transparent to the na-
ferent computers. All the OSes, applications, and data of tive file systems and applications. Thirdly, one program
clients are centered on the servers and are scheduled on segment can be shared among different clients. At last,
demand and run on different clients in a “block-stream- each client has a separate disk view.
ing” way. All the OS-, application-, and data-streaming Since Vdisk is designed for the special purpose, its
can be intercepted, monitored, or audited independent behaviour is not the same as the traditional distributed

Copyright © 2012 SciRes. JSEA


Performance Analysis and Improvement of Storage Virtualization in an OS-Online System 913

Resources on server

Conputing services
APP APP APP
OS OS OS

Polymorphous Client
User Data

Block-streaming

Network communication

OS/APPs can be utilized as Servers on cloud networks


disposable resource

Figure 1. The computing environment of Transparent Computing.

storage systems. Understanding the workload character- management process, a disk management process, and all
istics of Vdisk is a necessary prelude to improve the per- Vdisk image files belonging to all clients in the system.
formance of TransCom. In this paper, a trace-driven ana- As seen from its structure in Figure 2, the Vdisk dri-
lysis method is used to observe I/O characteristics of ver is composed of two parts running on the Trans-Com
Vdisk, and the effect of cache on both the server and the client and TransCom server, respectively. OS-Specific
client side is discussed. Also an analytical model is built Driver (OSD) is mainly used to provide the interaction
to evaluate the effect of several optimizations on a cache interface with a specific Client OS, so that Client OS
system. may perform various operations on the virtual devices as
The remaining sections are organized as follows. The usual. Independent Driver (ID) which runs in TransOS is
overall architecture of the TransCom system is shown in used to fulfill the Vdisk functions that are irrelevant with
Section 2. In Section 3, we build a queuing network mo- a specific Client OS. The interface between OSD and ID
del to analyse the utilization of resources on the server. is an ordinary hardware-level interface based on the I/O
In Section 4, we identify the bottleneck on TransCom controller and register. Service Initiator is used to locate
server and discuss the factors affect the cache hit ratio the TransCom server for ID and to transport the requests
and the overall performance by simulation. In Section 5, for Vdisk operations to relevant handling programs on
we propose a two-level cache strategy optimization me- the TransCom server via NSAP. Waiting for the response
thod and provide the experimental results in Section 6. from the server, Service Initiator then passes the handling
The conclusions and future works are discussed in Sec- results to ID for further handling. Service Target is used
tion 7. to receive I/O requests from the TransCom client, search
relevant database, check the access authority, perform
2. System Overview operations to the corresponding Vdisk image files and
physical devices, and finally return the results to the
TransCom system is based on C/S model, where a single TransCom client. NSAP communication protocol is the
server can support up to tens of clients connected in a communication protocol to locate the TransCom server,
network system. Figure 2 shows the overall architecture verify relevant authorization, and transport requests and
of a TransCom system with a server and a single client. responses for various I/O operations.
Without the local hard disk, each client accesses the OS, As mentioned above, the Virtual I/O (VIO) path needs
software and data from the remote virtual disks which to go through the TransCom delivery network (a round-
simulate the physical block-level storage devices. Vdisk, trip transportation) and the physical I/O operations of the
in essence, is one or more disk image files located on the TransCom server. Therefore, a complete VIO operation
server and accessed by the client remotely via Network will take more time than commonly known I/O opera-
Service Access Protocol (NSAP) [7]. TransCom server, tions. More often than not, this makes the VIO the bot-
running as an application daemon, maintains a client tleneck of system performance. In order to enhance the

Copyright © 2012 SciRes. JSEA


914 Performance Analysis and Improvement of Storage Virtualization in an OS-Online System

Running in memory
Server OS
OS-Specific User
Client OS manager
Driver (OSD)

Authentication
Client server
Server
Cache (CC) Cache (SC)
Independent
Driver (ID)
Service NSAP Service Image
Initiator Targets Manager
TransOS

Repositories
Vdisk Image
……
Virtual Disk Virtual Disk

Client Server
Figure 2. Overall architecture of a TransCom system with a server and a single client.

access performance to VIO in the TransCom system, it is still isolating the personal files for the privacy of users.
necessary to add cache modules along the VIO path We study a real usage case deployed in the network
through “add-in” mechanism, so as to further improve and system group in Tsinghua University. The system is
the read or write performance. set up as the baseline case. The server is Dell PowerEdge
The client cache is used to cache the requests or re- 1900 machine, equipped with an Intel Xeon Quad Core
sponses data from the Client OS and remote TransCom 1.6 GHz CPU, 4 GB Dual DDR2 667 MHz RAM, one
servers, and to reduce the I/O response time. The server 160 GB Hitachi 15,000 rpm SATA hard disk, and a 1
cache is added based on Service Targets on the Trans- Gbps on-board network card. Each client is configured as
Com server. After the caching modules are added, in Intel Dual Core E6300 1.86 GHz machine, with 512 MB
handling VIO requests sent from the TransCom client, DDR 667 RAM and 100 Mbps on-board network card.
the Service Targets will first search the cache for the I/O All the clients and server are connected by an Ethernet
data requested by the user using the server cache. If the switch with 98,100 Mbps interfaces and two 1Gbps inter-
VIO data requested by the user is in the cache, it will faces. All clients run the Windows XP Professional SP3.
directly return the I/O data to the TransCom client. Other- The server runs Windows 2003 Standard SP2, with the
wise, the Service Target will directly operate on the V- software providing the TransCom services.
disk image file and its corresponding physical device, In this paper, we study and optimize the above system.
acquiring the VIO data requested by the users, updating In the following sections, we will discuss what the bottle-
the content in the cache buffer with the server cache, and neck of this system is and How to improve the system.
then sending the result to the sending queue. The server
cache also will determine whether it is needed to pre-read 3. Model Analysis
some VIO data into the cache buffer, according to the
In this section, the most critical resources on the server
specific VIO request sent by the user. If it is needed, it
are identified. The measurement data is analysed to build
will invoke the Service Target to operate directly on the
the queuing network performance models. In this section,
Vdisk image file and its corresponding physical device,
we describe our models, the inputs, and the experiments
so as to read the VIO data beforehand.
conducted to obtain these inputs.
Two features distinguish TransCom from previous
diskless distributed systems [8-10]. Firstly, TransCom
3.1. Models of TransCom System
can boot and run heterogeneous OSes and applications,
so the Vdisk driver is transparent to both OSes and ap- Since the input requirements of our models dictate the
plications. Secondly, Vdisks perceived by users can be quantities that must be measured, a description of these
flexibly mapped to the Vdisk image files on the Trans- models is introduced at the beginning. We chose the queu-
Com server. Such flexibility allows TransCom to share ing network performance models, because the models
OSes and applications to different clients to reduce the can achieve an attractive combination of the efficiency
overhead of the storage and the management, while and the accuracy. There are three components in the

Copyright © 2012 SciRes. JSEA


Performance Analysis and Improvement of Storage Virtualization in an OS-Online System 915

specification of a queue network model: service centre Server


description, customer description, and service demands.
The service centre description identifies the resources of
the system that will be represented in the model, such as CPU Storage System NIU

disks, CPUs, communication networks, etc. The custo-


mer description indicates the workload intensity and the
offered load, such as the average number of the requests ……
in the system, the average rate at which requests arrive,
the number of users and the average waiting time. The Clients
service demands indicate the average amount of the ser- (a)
vices which each request requires at each service centre.
Once these inputs have been specified, the model can Read hit /
Disk 1

be evaluated using efficient numerical algorithms to ob- Cache write

tain the performance measures, such as utilization, resi- ……


dence time, queue length, and throughput. In essence, the Read miss / write Disk n
evaluation algorithm calculates the effect of the interfer- buffer full

ence, and queues the results when customers who have


certain service demands share the system at particular
workload intensity. Once created, the model can be used Storage System
to project the performance of the system under various (b)
modifications. System modifications often have straight-
Figure 3. Models of TransCom system. (a) Queuing model
forward representations as modifications to the model
of TransCom system; (b) Model of storage system in Trans-
inputs. Com system.
In TransCom system, a number of TransCom clients
share a server over a local area network. Figure 3 illus- There are 15 users, a professor and several graduate stu-
trates the models used in our study. The server is repre- dents, working on each client with Windows XP from 8
sented by three service centres, corresponding to the am to 6 pm. The applications used most frequently are
CPU, the storage subsystem and the network subsystem. the internet browser (IE 7.0), the text editor (Microsoft
The execution of the requests at the CPU depends on the Office 2007) and the text viewer (Adobe acrobat reader
type of operations requested by the client, which may be 8.0). Besides, New Era English software, a multimedia
either control or access operations. For the storage ser- application for English self-learning, is often used by
vice, the execution of a request to the server is simpler. A students.
user request for a control operation is translated to one or Wireshark 1.6 is used to set up a network monitor on
more access requests to the server by the client. The ac- the server to capture I/O requests related packets and to
cess and control requests at the server are handled in a extract the required information, such as disk id, user id,
similar way to an access request to a file service. The requested initial block number, block length, operation
storage system is represented by a flow equivalent server command and time of packet issued/received. Note that,
centre, which is composed of a memory cache and some because of the limitation of the network packet size,
disks. Thus, the efficiency of the storage system depends TransCom clients need to split a large I/O request to sev-
on the effect of the cache system, which is usually pre- eral small ones. Some fields in each split packets are
sented by the hit rate. Each client workstation is repre- added to record the initial block number and the length of
sented by a delay centre, in which the delay time is a sum original requests.
of latencies during the network transaction and the net- The results of our trace analysis are summarized as
work stack processing on each client. The model includes follows. Note that a request referenced here is an original
one “token” or “customer” corresponding to each client. request before it is split by TransCom system.
Each customer cycles between its client and the server 1) The minimal request size is 0.5 KB, and the maxi-
via the network, accumulating the services and encoun- mal request size is 64 KB. The average request size is 8
tering the queuing delays which are caused by the com- KB.
petition from other customers.
2) Most of the requests are short in length (70% less
than or equal to 4 KB), and 4 KB is the most frequent re-
3.2. Customer Characteristics
quest size (60%).
The I/O requests issued by the clients in the baseline 3) The proportion of the traffic between the read and
system were traced in Tsinghua University for 4 weeks. the write requests is 1:3, while the proportion of the

Copyright © 2012 SciRes. JSEA


916 Performance Analysis and Improvement of Storage Virtualization in an OS-Online System

working set (amount of blocks that be accessed at least tions are proposed, some parts of which have already
once) is 4:5. been mentioned in previous sections.
4) On average, half of the requests are sequential. Assumptions of Service Centre are proposed as fol-
According to the above observations, a 4 KB data is lows: 1) Service centres in the model are independent
defined as a “typical request”. The service demands at from each other; 2) The buffer of each service centre is
the client are composed of the processes in user mode, unlimited, so no request will be dropped.
and the overhead processes for transferring some 4 KB Assumptions of workload are proposed as follows: 1)
blocks. Since NSAP is a one-step protocol, the service the size of the requests is 4 KB; 2) A seek time happens
demands of a client should be a constant when the scale once per two disk accesses.
of clients increases. To examine whether the assumptions affect the accu-
racy of our model, the response time of the Vdisk re-
3.3. Measuring Service Demands quests calculated by the model is compared with the re-
The parameters whose values are required for transfer- sponse time measured in the real system, as shown in
ring 4 KB data are the service demands, such as the cli- Figure 4. The calculated values and the measured values
ent CPU, the server CPU, disk and NIC, and the network. are pretty similar to each other in the both two figures,
These service demands are measured in a series of ex- which prove that the assumptions are reasonable in our
periments that transfer large numbers of blocks with the scenario. In next section, this model is adopted to con-
4 KB block size. These experiments are repeated to en- duct the bottleneck analysis of TransCom system, espe-
sure the reliability. cially, to evaluate the effect of the cache on both the
The CPU service demands at the clients or the server server and the clients.
are measured by a performance monitor, which is a back-
ground process provided by Windows in all experiments. 4. Performance Analysis of the Baseline
The server CPU consumption can be further divided into System
3 parts: storage related consumption, network related
consumption and I/O server consumption. The storage The service demands are measured in the real system, as
related consumption is the CPU service time spent on it is shown in Table 1. The disk service demands at the
dealing with the cache system and controlling disks. server dominate among the shared resources, our re-
Network related consumption is mainly associated with search emphasizes on the investigation of the effect of
the overhead on UDP/IP network stack. I/O server con- the memory cache which can be presented by the hit rate.
sumption is used to calculate the requested image files
and the position of the file access pointer. 4.1. Effect of Cache Hit Ratio at Server
Since it is complicated and expensive to deploy a
The relationship between the server throughput at the
monitor on the NIU (Network Interface Unit) to measure
heavy load and its cache hit ratio of the storage subsys-
the service time of the NIU directly, the service time is
tem is plotted in Figure 5. The throughput isn’t sensitive
estimated by the throughput and the network related
to the hit ratio when the hit ratio is low, while it improves
consumption. Lots of 4 KB UDP packets are transferred
dramatically when the hit ratio is more than 80%. Be-
continuously via 1Gbps Ethernet NIC in the server, so
sides, the large block size will achieve a higher through-
we measure the throughput and the network stack con-
sumption on the server CPU, by which the service time put than the small one.
on the NIU can be calculated. The disk service time of
both the random and sequential accesses is measured by 4.2. Congestion Analysis
IOMeter. In the experiment, we found that the service Figure 6(a) illustrates the throughput of the server at
demands of the CPU in the client and server were not various loads. A fact can be observed that even the hit
dependent on the access mode. According to the results ratio is 100%, the server saturates at a rather small scale
of our trace study mentioned above, we assume that a (about 15 clients). Another metric to evaluate the per-
seek time is required once per two disk accesses in our formance of our system is the latency issued by the cli-
model. The typical parameter describing the cache effect ents. It is observed from Figure 6(b) that the access la-
is the hit ratio, which is not easy to be measured in a real tency can be smaller than the local disk when the hit ratio
usage. The hit ratio is one of the factors, which affect the is higher than a certain threshold. This indicates that re-
request response time and the utilization of each service mote disk accesses in TransCom may achieve better per-
centre. formance at a light load. According to Figure 6(a) and
Figure 6(b), a design, that reduces light-load remote
3.4. Modeling Verification access latency at the expense of increasing service de-
To make the model simple and effective, several assump- mands, would appear to be inappropriate. Conversely, a

Copyright © 2012 SciRes. JSEA


Performance Analysis and Improvement of Storage Virtualization in an OS-Online System 917

6.0
6
5.5

5.0
5
4.5
Local Disk
Throughput (KIOPS)

Throughput (KIOPS)
4.0 50% hit ratio
4
3.5 95% hit ratio
100% hit ratio
3.0
3
2.5

2.0
2
1.5
Real system
Model estimation 1.0
1
0.5

0 10 20 30 40 50 60 0.0
0 5 10 15 20 25 30
Number of clients
Number of clients
(a)
(a)

0.114
Local Disk
Real system 25
0.112
80% hit ratio
Model estimation 90% hit ratio
0.110 95% hit ratio
20
Response time (ms)

100% hit ratio


0.108
Latency (ms)

15
0.106

0.104
10
0.102

0.100 5

0.098
0 10 20 30 40 50 60 0
0 2 4 6 8 10 12 14 16
Number of clients
Number of Clients
(b)
(b)
Figure 4. Performance comparison between model estima-
tion and real system; (a) Throughput at various scales; (b) Figure 6. Congestion analysis at different server hit ratio; (a)
Average response time of each 4 KB request. Throughput at different server hit ratio; (b) Latency at dif-
ferent server hit ratio.
70
Table 1. Service demands to transfer a 4 KB or 32 KB re-
4K Block quest.
60 32K Block

50 Request block size


Service demand
Throughput (MBPS)

40 4 KB 32 KB

30
CPU 0.1 ms 0.23 ms

Disk 6.76 ms 7.20 ms


20

NIU 0.17 ms 0.5 ms


10
Client-Delay
0.32 ms 2.56 ms
0 (Client CPU + NIU)
0 20 40 60 80 100
Hit Ratio (%)
4.3. Bottleneck Analysis
Figure 5. Relationship between the throughput and the ser-
ver cache. The bottleneck is the shared resource with the highest
utilization at a heavy load. Because of the effect of the
design, that reduces the service demands at the expense cache, the primary bottleneck may vary with the different
of some increase in light-load file access latency, would hit ratios. Besides, the block size of a request, which is a
appear to be desirable. factor that affects the ratio of the sequential and random

Copyright © 2012 SciRes. JSEA


918 Performance Analysis and Improvement of Storage Virtualization in an OS-Online System

accesses to the server disk, is also a potential factor that this section, we discuss how these factors affect the ca-
affects the bottleneck identification. Therefore, Figure 7 che hit ratio and the overall performance by simulation.
illustrates the utilization of some devices with the func- We firstly study the client and server cache access pat-
tion of the cache hit ratio. Figure 7(a) presents a typical terns in TransCom. LRU [11] algorithm is used in cli-
small block size (4 KB) and Figure 7(b) presents a typi- ent-side cache. A cache management algorithm called
cal large block size (32 KB). According to the two fig- Frequency-based Multi-Priority Queues (FMPQ) propos-
ures, the disk, or the storage subsystem, is the primary ed in this paper is used in server-side cache.
bottleneck when the hit ratio is lower than 90%, and the
network becomes the primary bottleneck when the hit 5.1. Cache Simulator
ratio is higher than 90%. The Utilization of the disk at a
A program is written to simulate the behaviour of various
large block size drops more sharply than a small block
kinds of caches, using the trace data to drive the simu-
size when the hit ratio increases.
lations. The trace data is collected in the experiment
mentioned in Section 3. The design of the simulator is
5. Cache Strategy
similar to the classical stack algorithm, by which the hit
In a caching scheme, requested blocks are saved in the ratios for all cache sizes could be calculated in a single
main memory so that the subsequent requests for these pass over the reference trace.
blocks can be satisfied without disk operations. The per- The simulator represents the cache as a stack, with the
formance of the cache depends on the behaviours of the most recently referenced block on the top. Each element
workload, and the deployed location (client or server). In of the stack represents a fixed-size blocks composed of
several Vdisk sectors, and the upper k elements of the
100 stack are the blocks in a cache. To simulate the image
sharing mechanism, a block in the simulator is identified
80
by a tuple < block_id, client_id > block_id is the linear
address of the initial sector, and client_id is the MAC
address of the client who “owns” the block. “Owner” of a
Utilization (%)

60
block indicates a client who creates the block. A block
CPU
Network
with a specific owner, namely private block, is created
40 Disk when the owner firstly attempts to modify the content of
a shared block with same block_id. Since then the client
20 can access the new block instead of the shared one.
When the trace indicates that a range of sectors in
Vdisk is read or written, the range is firstly divided into
0
0 20 40 60 80 100 one or more block accesses. For each block access, the
Hit Ratio (%) simulator checks to see if a corresponding private block
(a)
owned by the requesting client exists. If so, it finds the
private block in the stack and moves the block to the top
100
of the stack. If not, the simulator checks whether it is a
read or write request. If it is a read request, the simulator
80 searches the shared block. If it finds the requested block,
CPU it moves the block to the top of the stack. Otherwise, it
Network
creates a shared block, and pushes it in the stack. If it is a
Utilization (%)

60 Disk
write request, the simulator creates a private block be-
40
long to the client, and pushes it in the stack.
If a block is referenced at the level k, it is a “hit” for
all sizes presented by the level k and larger. Counters are
20
deployed in each level of the stack. It is response for re-
cording the number of “hit” occurred at the certain level.
0 To work out the hit ratio of the cache size k, we only
0 20 40 60 80 100
Hit Ratio (%)
need to sum all the counter numbers of the upper k lev-
els.
(b)

Figure 7. CPU, network and disk utilization at different 5.2. Two-Level Cache Characteristic
server hit ratio; (a) Block size of request is 4 KB; (b) Block
size of request is 32 KB. Figure 8 illustrates the relationship among the hit ratio,

Copyright © 2012 SciRes. JSEA


Performance Analysis and Improvement of Storage Virtualization in an OS-Online System 919

100
100

80
80

60
Hit Ratio (%)

60

Hit Ratio (%)


1K
2K
40 1K
4K 40 2K
8K
4K
16K
20 8K
32K
20 16K
64K
32K
64K
0
1 2 4 8 16 32 64 128 256 512 1024 0
1 2 4 8 16 32 64 128 256 512 1024
Cache size (M)
Cache size (M)
Figure 8. Cache hit ratio at the server.
Figure 9. Cache hit ratio at the client.
the cache size and the block size at the server. As men-
scale of 50. Compared to a large server cache, a small
tioned in Section 4, if the server cache hit ratio is raise up
client cache only reduces 27% at the optimal block size.
to 90% or higher, disks will not be the primary bottle-
Therefore, there are some benefits that can be achieved
neck of the system. Figure 8 shows that it is possible to
from the client cache. However, it is quite limited.
achieve a hit ratio over 90% if the cache size is larger
The effect of the cache at both the server and the client
than 256 M. Since it is common for a PC or a server con-
is analysed. According to our observations, several con-
figured with several gigabyte memories today, it is rea-
clusions can be summarized. We firstly study whether
sonable to keep most working sets in the memory to
data accesses in the client and server cache have tempo-
achieve a high hit ratio in this scenario.
ral locality characteristics. Previous studies have shown
The simulation results of the client cache are similar
that the client cache accesses show a high degree of
with the server cache. However, since the client res-
temporal locality [13]. LRU algorithm, which takes full
ources are usually limited, we focus on the hit ratio pro-
advantage of temporal locality, is mainly used in the cli-
duced by a small cache size. As it is illustrated in Figure
ent cache. Those blocks which have temporal locality
9, the block size is the critical factor that affects the
characteristics may remain in the client cache, and at the
cache hit ratio, and the effect of the cache size increasing
same time the block requests unmet in client cache will
from 1 M to 16 M is negligible. This fact can be ex-
access the server cache buffer cache. Therefore, the ser-
plained that the workload of Vdisk, filtered by the file
ver cache is a critical mechanism to improve the overall
system cache of the operating system, is lack of temporal
performance of TransCom. The capability of the server
locality.
increases slowly along with the hit ratio when the hit
Although, a large block size is able to increase the hit
ratio is at a low level. However, it increases dramati-
ratio of the client cache, it also increases the expenses of
cally when the hit ratio is over 90%. Since the Vdisk
single request transfer, and produces more cache pollu-
image is widely shared, the working set size is small
tions. Therefore, the model should be enhanced by add-
enough that a reasonable large cache size achieves a high
ing the storage subsystem component to the delay centre.
hit ratio on the server. The workload of the server cache
The new model is parameterized by the service demands
is less of temporal locality, so appropriate cache algo-
and the simulation results of the client cache hit ratio,
rithm which achieves a higher hit ratio can reduce the
especially the server cache hit rate is assumed as 95%.
access latency.
Assumption of the workload is the same as the previous
section.
5.3. Frequency-Based Multi-Priority Queues
A large cache block size achieves a higher hit ratio
(FMPQ)
shown in Figure 9, however, it also makes the average
response time longer at a large scale. This feature is dis- In TransCom client, the client cache has temporal local-
tinct from the conclusions observed in the local disk ity characteristic. Therefore, some cache replacement
analysis [12], that the average access time reduces as the algorithm based on the temporal locality, such as LRU
block number increases. Because 95% server cache hit can be used. We use LRU algorithm as the client cache
ratio dominates the access latency. In this scenario, a management algorithm. In the TransCom server, I/O re-
block size of 8KB allows the minimal access latency at a quests accesses in the server cache show the characteris-

Copyright © 2012 SciRes. JSEA


920 Performance Analysis and Improvement of Storage Virtualization in an OS-Online System

tics, that some of the frequently accessed blocks satisfy a vival time of each block in a LRU queue. When an ac-
higher proportion of accesses, so we designed a cache cess happens, FMPQ compares OverTime of the head
replacement algorithm, namely Frequency-based Multi- block of the queue with NowTime. If OverTime is less
Priority Queues, which is based on the access frequency than NowTime, the block will be moved to the end of the
priority. FMPQ gives the highest priority on the blocks next level queue and the value of its OverTime will be
which are accessed most frequently. FMPQ also gives reset.
different priorities for the blocks depending on the access Similar to 2Q [14] algorithm, FMPQ also has a time
frequencies, and reserves for different periods according complexity O(1). Because all of the queues use the LRU
to the priorities of the blocks in the server cache. list, w is usually very small. When an access happens, up
FMPQ uses more than one LRU queue (Q0, Q1 to Qw–1) to w-1 head blocks will be checked for the possible
to store the different priorities of the blocks. w is an ad- downgrade. Relative to FBR [15] or LRU algorithm,
justable parameter. If the priority of Qi is lower than the FMPQ is highly efficient and very easy to be imple-
priority of Qj (i < j), the life cycle of the blocks in the Qi mented.
will be less than Qj. FMPQ also sets up a queue Qoff,
which records the access frequencies of the blocks which 6. Optimization and Experiment
have recently been replaced. Qoff is a FIFO queue with a
6.1. Evaluation of FMPQ
limited size, which only stores the identities and the ac-
cess frequencies of the blocks. We have evaluated the local algorithms for the two level
FMPQ sets a function QCount(g) = log2g, which puts buffer caches using trace-driven simulations. We used
the blocks on the proper LRU queue. g is a given fre- the analysis of the I/O request access patterns in the Sec-
quency. When the frequency of a block is changed, the tion 3 to simulate FMPQ algorithm. LRU cache replace-
function ascends the position of the block. For example, ment algorithm is used in the client’s Vdisk driver, on
the block P is hit in the server cache, and then P is firstly server-side the FMPQ and three existing replacement
removed from the LRU queue. And according to the algorithms, LRU, FBR, and 2Q, are implemented. The
current access frequency of the block P, the function block size is set to 4 KB. The requests have a significant
QCount(g) = log2g calculates the result which is pre- temporal locality characteristic in the client cache, so this
sented as d. The block is put at end of the Qd queue. For section will not evaluate the performance of the client
another example, the block P is accessed eighth times, cache algorithm in TransCom client, and focus on the
and then P will be upgraded from the Q2 queue to the Q3 performance of FMPQ algorithm in the server cache.
queue. When the block P has not been hit, FMPQ selects In the trace load tests, the performance of LRU algo-
a block which is evicted from the server cache to make a rithm is not very good, even if it has a good performance
room for the block P. When the replacer is chosen, in the client cache. There is no algorithm worse than
FMPQ starts to query the head of Q0 queue. If Q0 is an LRU, because the longer minimal distance in the server
empty queue, FMPQ will query the queues from Q1 until cache makes the access frequency inaccurate. The per-
it finds a non-empty queue Qi with the lowest level, and formance of FBR algorithm is better than LRU, but it is
then replaces the head of the queue. If the block R is re- always worse than FMPQ, in several cases the difference
placed, its identity and the current access frequency will is very large. Although FBR considers the access fre-
be inserted into the end of the historical cache queue Qoff. quency in order to overcome the defects of LRU algo-
If Qoff is full, the identity reserved for the longest period rithm, but it is difficult to adjust the parameters to com-
in Qoff will be deleted. If the request block P is in the Qoff bine the frequency and recency properly. The perform-
records, P will be loaded from the hard disk into the ance of 2Q algorithm is better than other algorithms ex-
server cache, and the value of its frequency g is set to be cept FMPQ. To set up a separate queue, for the blocks
the record value of the access frequency in Qoff plus 1. If only accessed once, 2Q will store the blocks accessed
the block P is not in Qoff, it will be loaded into the server frequently in the queue for a long time. When the server
cache, and its frequency g is set to 1. At last, according to cache size is small, 2Q hit ratio is lower than FMPQ,
QCount(g), P is moved into the relevant LRU queue. because the life cycle of a block in the server cache is not
In the server cache, FMPQ sets a failure time parame- long enough to reserve it to be accessed in the next cycle.
ter, OverTime, for each block, which is used to drop the To learn more about the test results, we use the temporal
inactive blocks from the high priority queue to a low distance as a measurement to analyse the performance of
priority queue and is used to exceed the access count the algorithm. The analysis in Section 5 shows that the
limit. “Time” here refers to the logical time, which is the access to the server cache mostly tends to maintain a
access count. When a block enters a LRU queue, Over- longer temporal distance, so the performance of the
Time is set to be NowTime + DurationTime. The Dura- server cache replacement algorithm depends on the ex-
tionTime is an adjustable parameter for setting the sur- tent that it meets the survival time attribute of the block.

Copyright © 2012 SciRes. JSEA


Performance Analysis and Improvement of Storage Virtualization in an OS-Online System 921

If the temporal distance of the majority accesses is longer 4) Strategy D: As shown in Section 4, the network is
than S, the replacement algorithm which cannot save the primary bottleneck, if the cache works effectively.
most of the blocks during a period that is longer than S is We investigate the efficiency of the UDP transmission in
unlikely to have a good performance. different OSes, and the result presents Linux is much
We choose to the trace load to analyse in detail. Table more efficient than Windows. Therefore, using Linux as
2 shows the hits and misses of different algorithms in the the platform of TransCom server applications is much
two types of access when the size of the server cache better in term of the performance.
buffer cache is 256 MB. FMPQ has a significant reduc- OS boot is a typical I/O intensive procedure in Trans-
tion in misses in the right column, as shown in Table 2, Com. The concurrent boot time in multi-clients is a met-
the misses of FMPQ in the right column is 2573 k, which ric used frequently to evaluate the performance of the
is 33% less than LRU. Similar to the FBR algorithm, the system. In this experiment, we emulate a scale of 50 cli-
FMPQ algorithm has some misses in left column, but the ents and compare the boot time with the four optimiza-
number of the misses is very small, only about 13.2% of tions mentioned above.
the whole number of misses. Overall, the performance of In the experiment, hardware configurations of the
FMPQ is significantly better than other algorithms. server and the client are the same as the baseline system
described in Section 3. There are 50 clients and a server
6.2. System Optimization connected by a switch in LAN. We develop an I/O emu-
lator, called IOEmu, which is deployed on each client.
According to the discussion above, the remote memory IOEmu is a software which is used to emulate multi-
accesses are faster than the local disk, because the remote clients’ behaviours. An emulated client is a thread run-
fetching, paging and swapping of the programs in Trans- ning on a workstation, which sends requests continu-
Com are done more efficiently than the local disk, even if ously instructed by a trace file. The trace files are the
tens of clients work together. The key point to achieve logs of Vdisk requests, and each entry of the log records
the goal is to avoid the server disk accesses as much as the information of a request, such as the initial block
possible. Therefore, several key points are proposed to number, the requested block length, and the request is-
enhance the system performance effectively, especially at sued time. Therefore a trace file can be considered as a
a heavy load. script of the workload. In addition, IOEmu is able to
1) Strategy A: A large memory cache (e.g. 1 - 2 GB) emulate more than one client on a workstation by creat-
should be configured at the server to ensure a high hit ing multiple emulation threads.
ratio. A small memory (e.g. 1 MB) cache is enough at the We enable the 4 optimizations incrementally to ob-
client, and the block size is more significant than the serve the benefits of each one. As shown in Figure 10,
cache size. each optimization is able to reduce the boot time. A
2) Strategy B: FMPQ algorithm is used in the server combination effect of the four optimizations can decrease
cache and client cache as a two-level cache strategy. the boot time by 63%. The results not only prove the
3) Strategy C: A disk cache at the client is deployed to optimizations we proposed are effective in TransCom,
localize the accesses to “shadow image”. Given the ca- but also prove the correctness of our analysis method for
pabilities of the cheap disks today, the disk cache can be this kind of systems.
considered as large enough to contain all a user’s modi-
fied blocks. The shadow image localization absorbs near-
100
ly all the write accesses and partial read accesses at the
client, which greatly reduce the overhead of both the ser-
ver and the network. 80
Boot time (s)

Table 2. Hits and misses distribution with a 256 MBytes 60

buffer cache.
40
Distance < 32 k Distance ≥ 32 k
Algorithms
Hits Misses Hits Misses 20

FMPQ 1483 k 338 k 1834 k 2573 k


0
2Q 1763 k 0 1138 k 3256 k Nothing A A+B A+B+C A+B+C+D
Optimization strategy
FBR 1492 k 321 k 1046 k 3312 k

LRU 1793 k 0 394 k 3892 k


Figure 10. Comparison of Boot time using different optimi-
zation strategy.

Copyright © 2012 SciRes. JSEA


922 Performance Analysis and Improvement of Storage Virtualization in an OS-Online System

7. Conclusions and Future Work formation Networking and Applications (AINA 2007),
Niagara Falls, 21-23 May 2007, pp. 394-403.
TransCom is a novel pervasive computing system which [5] W. Aspray, “The Stored Program Concept,” IEEE Spec-
allows users to download and execute heterogeneous trum, Vol. 27, No. 9, 1990, pp. 51-57.
commodity OSes and their applications on demand. This http://dx.doi.org/10.1109/6.58457
paper analyses the characteristics of its real usage work- [6] L. Wei, Y. X. Zhang and Y.-Z. Zhou, “TransCom: A
load, builds a queuing model to locate the bottlenecks of Virtual Disk Based Self-Management System. Autonomic
the system, and studies the client and server cache access and Trusted Computing,” 4th International Conference
patterns in TransCom system. Finally, we evaluate sev- (ATC 2007), 11-13 July 2007, Hong Kong, pp. 509-518
eral design alternatives which are able to improve the [7] W. Y. Kuang, Y. X. Zhang, Y.-Z. Zhou, et al., “NSAP—
capability of TransCom server. A Network Storage Access Protocol for Transparent Com-
Research in this paper aims to increase the throughput puting,” Tsinghua University (Science & Technology), Vol.
and the capability of the server so as to achieve a high 49, No. 1, 2009, pp. 106-109.
scalability by reducing the server demands of the bottle- [8] B. Pfaff, T. Garfinkel and M. Rosenblum, “Virtualization
neck resources. To solve this problem, another solution is Aware File Systems: Getting beyond the Limitations of
Virtual Disks,” Proceedings of the 3rd Symposium on
to combine the clients’ cache to form a cooperative cache
Networked Systems Design and Implementation (NSDI06),
system. A p2p protocol should be introduced to locate San Jose, 8-10 May 2006.
and download required block from other clients’ memory.
[9] S. Tang, Y. Chen and Z. Zhang, “Machine Bank: Own
Our research discovers that the workload of Vdisk sys- Your Virtual Personal Computer,” Proceedings of the
tem is lack of temporal locality. The multi-level buffer 21st IEEE International Parallel and Distributed Proc-
cache has been well studied in the data centres and sev- essing Symposium (IPDPS’07), Long Beach, 26-30 March
eral replace algorithms about the weak locality have been 2007, pp. 1-10.
already proposed. All these replace algorithms could be [10] D. T. Meyer, G. Aggarwal, B. Cully, G. Lefebvre, et al.,
carefully evaluated in the transparent computing envi- “Parallax: Virtual Disks for Virtual Machines,” Euro-
ronment. Sys’08, Glasgow, 31 March-4 April 2008.
[11] C. I. Aven, E. G. I. Coffmann and I. A. Kogan, “Stochas-
tic Analysis of Computer Storage,” Reidel, Amsterdam,
REFERENCES 1987.
[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, [12] W. H. Windsor and A. J. Smith, “The Real Effect of I/O
A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica Optimizations and Disk Improvements,” UCB/CSD-03-
and M. Zaharia, “A View of Cloud Computing,” Com- 1263, University of California, San Diego, 2003.
munications of the ACM, Vol. 53, No. 4, 2010, pp. 50-58. [13] R. Karedla, J. S. Love and B. G. Wherry, “Caching Strate-
doi:10.1145/1721654.1721672 gies to Improve Disk System Performance,” Computer,
[2] R. Buyya, “Market-Oriented Cloud Computing: Vision, Vol. 27, No. 3, 1994, pp. 38-46. doi:10.1109/6.58457
Hype, and Reality for Delivering IT Services as Comput- [14] T. Johnson and D. Shasha, “2Q: A Low Overhead High
ing Utilities,” 10th IEEE International Conference on Performance Buffer Management Replacement Algo-
High Performance Computing and Communications, Da- rithm,” Proceedings of Very Large Databases Conference,
lian, 25-27 September 2008, pp. 5-13. Seoul, 12-15 September 1995, pp. 439-450.
[3] Y. X. Zhang, “Transparent Computing: Concept, Archi- [15] J. Robinson and M. Devarakonda, “Data Cache Manage-
tecture and Example,” Chinese Journal of Electronics, ment Using Frequency-Based Replacement,” Proceedings
Vol. 32, No. 12A, 2004, pp. 169-174. of ACM SIGMETRICS Conference Measurement and Mo-
[4] Y. X. Zhang and Y.-Z. Zhou, “4VP: A Novel Meta OS deling of Computer Systems, Boulder, 22-25 May 1990,
Approach for Streaming Programs in Ubiquitous Com- pp. 134-142. doi:10.1145/98457.98523
putting,” 21st International Conference on Advanced In-

Copyright © 2012 SciRes. JSEA

You might also like