perf-soft-hard-raid
perf-soft-hard-raid
net/publication/4004715
Performance evaluation of software RAID vs. hardware RAID for Parallel Virtual
File System
CITATIONS READS
12 2,713
3 authors, including:
All content following this page was uploaded by Rizwan Ali on 10 July 2014.
Abstract and all PVFS file system data is stored on file systems local to
Linux clusters of commodity computer systems and inter- I/O nodes. The effective throughput which can be contributed
connects have become the fastest growing choice for building by each I/O node is minimum(Bnetwork , Bstorage), where Bnet-
cost-effective high-performance parallel computing systems. work is the sustained TCP/IP bandwidth of an I/O node and
The Parallel Virtual File System (PVFS) could potentially ful- Bstorage is the sustained storage bandwidth. Second, although
fill the requirements of large I/O-intensive parallel applica- PVFS is a stable parallel file system, it is not fault-tolerant.
tions. It provides a high-performance parallel file system by Failure of a disk drive on an I/O node or failure of an I/O node
striping file data across multiple cluster nodes, called I/O itself will make PVFS accesses which need that I/O node to
nodes. Therefore, the choice of storage devices on I/O nodes is fail as well.
crucial to PVFS. Until PVFS adopts more efficient communication proto-
In this paper, we study the impact of software RAIDs and cols, such as GM [9] or VIA [4], a RAID of a small number of
hardware RAIDs on the performance of PVFS when they are disk drives is a reasonable storage configuration for an I/O
used on I/O nodes. We first establish a baseline performance node. A small number of disk drives will not oversaturate the
of both RAIDs in a stand-alone configuration. We then present network interface and the RAID will provide disk level fault-
the performance of PVFS for a workload comprising concur- tolerance. With a small number of drives, the RAID can be
rent reads and writes using ROMIO MPI-IO, and for the BTIO implemented either using a dedicated RAID controller, called
benchmark with a noncontiguous access pattern. We found hardware RAID, or using host processors on the I/O node per-
that software RAIDs have a comparable performance to hard- forming parity calculation and data striping, called software
ware RAIDs, except for write operations that require file syn- RAID. Before making the decision of using either hardware or
chronization. software RAIDs for the I/O node, we need to study the perfor-
mance trend of both types of RAIDs. Tables 1 and 2 show the
Keywords: performance of software RAID 5 and hardware RAID 5 on
Performance evaluation, software RAID, hardware RAID, three generations of Dell PowerEdge servers using the Bonnie
cluster computing, parallel I/O, Parallel Virtual File System, benchmark1.
benchmarking
Table 1: Performance of software RAID 5.
I. Introduction Platform
CPU Write CPU Read CPU
Speed (MB/s) Load (MB/s) Load
The Parallel Virtual File System (PVFS) [3] from Clemson PE2450 866MHz 45.1 43.7% 101.5 60.2%
University is the most popular open source parallel file system PE2550 1.4GHz 45.3 41.7% 103.9 39.4%
for Linux clusters. It is a user-level, client-server implementa- PE2650 2.0GHz 57.7 32.0% 104.0 28.0%
tion that utilizing TCP/IP-based socket communications and
Table 2: Performance of hardware RAID 5.
existing local file systems on cluster nodes. In order to provide
high-performance access to data stored on the file system by Platform
CPU Write CPU Read CPU
many compute nodes, PVFS stripes file data across multiple Speed (MB/s) Load (MB/s) Load
cluster nodes, designated as I/O nodes. PVFS has been widely PE2450 866MHz 23.5 12.5% 46.4 19.9%
PE2550 1.4GHz 34.0 15.3% 54.9 14.1%
used as a high-performance, large parallel file system for tem-
PE2650 2.0GHz 42.7 12.5% 60.7 12.2%
porary storage and as an infrastructure for parallel I/O
research. In October of 2000, PVFS demonstrated an aggre- Both tables show performance improvement from the 4th
gate I/O throughput of 1.05 GBytes/sec with 48 I/O nodes and generation server (PowerEdge 2450) to the 6th generation
112 compute nodes [11]. More impressively, the throughput server (PowerEdge 2650), with the set of drives remaining
was achieved by aggregating a single SCSI disk drive from
each I/O node.
1. Bonnie benchmark [2] with three Seagate ST318406LC
Two design considerations need to be addressed when
drives. Performance numbers are sequential accesses on a
deploying PVFS in a Linux cluster with a commodity inter-
2GB ext2 file partition. To limit the impact from file sys-
connect (Fast Ethernet or Gigabit Ethernet). First, messages tem cache, each server is only equipped with 512 MB of
between PVFS clients and I/O nodes are passed over TCP/IP memory.
1
Proceedings of the Ninth International Conference on Parallel and Distributed Systems (ICPADS’02)
1521-9097/02 $17.00 © 2002 IEEE
fixed. Overall, software RAIDs have better sequential entire disk drive or a logical volume of many disk drives. Fig-
throughput than hardware RAIDs. The sequential read ure 1 is a logical view of PVFS. It shows how cluster nodes
throughput of Software RAIDs is reaching the plateau level of might be assigned for use with PVFS. They are divided into
three drives in a RAID 5 configuration. However, considering three types of nodes: compute nodes, on which applications
CPU loads, hardware RAIDs still enjoy a better ratio of are run, a management node which handles metadata opera-
throughput/CPU load. This indicates that hardware RAIDs tions, and I/O nodes which store file data for PVFS file sys-
off-load host CPUs for data transfer and RAID functionality. tems.
In this paper, we study the impact of software and hard-
ware RAIDs on the performance of PVFS when they are used A. Components of PVFS
as the storage configuration on I/O nodes. We present perfor-
The shaded areas in Figure 1 highlight four major compo-
mance results of PVFS on one of the Linux clusters at the Dell
nents of the PVFS. A single metadata server (mgr) stores
Computer Corporation’s Scalable Systems Laboratory. We
metadata and controls file operations including open, close
first establish the baseline performance of software RAID and
and remove commands. I/O servers (iod) handle all data trans-
hardware RAID on a single I/O node with ext2 filesystem and
fers, storing and retrieving file data stored on each I/O node’s
ext3 journaling filesystem [13]. We then present the perfor-
local filesystems. These first two components are daemons
mance of PVFS for a workload comprising concurrent reads
running on management and I/O nodes.
and writes using ROMIO MPI-IO [12]. We also utilize the
On compute nodes, the PVFS native API (a user library of
BTIO benchmark [10] to study PVFS performance with a
I/O calls, called libpvfs) provides a user-space low-level
noncontiguous access pattern. We found that for most of the
access to the PVFS servers. This library handles the scatter/
test cases software RAIDs provide a similar level of perfor-
gather operations necessary to move data between user buffers
mance as hardware RAIDs, although the former competes for
and PVFS servers, keeping these operations transparent to the
host CPU with network communications. The only exception
user. For metadata operations, applications communicate
is the write operations with file synchronization. In this case,
through the library with the metadata server. For data access
hardware RAIDs have better data transfer rates due to incor-
the metadata server is eliminated from the access path and
porated cache memory on the RAID controllers.
instead I/O servers are contacted directly. PVFS also provides
The rest of this paper is organized as follows. The next
Linux kernel support that allows PVFS file systems to be
section provides an overview of PVFS. In section 3, we
mounted in the same manner as an NFS or local filesystem on
describe our experimental environment. Baseline performance
compute nodes. This allows existing programs to access PVFS
of a single I/O node is presented in Section 4. In Sections 5
files without any modification.
and 6, we present PVFS’s performance and discuss the results.
Section 7 concludes the paper and outlines our future studies. B. Interfaces of PVFS
Compute Nodes As shown in Figure 1, there are three interfaces through
Applications which PVFS may be accessed.
kernel support ROMIO • PVFS native API (via libpvfs): a UNIX-like interface for
accessing PVFS files. This allows applications to specify
libpvfs
how files will be striped across the I/O nodes in the PVFS
system.
Network • Linux kernel interface: It allows PVFS file system to be
merged into a compute node’s local directory hierarchy.
iod iod mgr Through this interface existing applications or common
Local Filesystems Local Filesystems utilities may manipulate data on PVFS file systems.
Management
Node • ROMIO MPI-IO interface: ROMIO implements the MPI-2
I/O Nodes
[8] I/O calls in a portable library. This allows parallel pro-
Figure 1: Logical view of PVFS. grams using MPI to access PVFS files through the MPI-IO
interface.
II. Overview of PVFS In this paper, both benchmark programs we used are based
on ROMIO MPI-IO interface.
Like many other network file systems or parallel file sys-
tems, PVFS is implemented using a client-server architecture.
It utilizes a group of collaborative user-space processes (dae- III. Experimental Environment
mons) to provide a cluster-wide consistent name space and to The testing environment consists of a total of 24 rack opti-
store data in a striped fashion across multiple nodes in the mized Dell PowerEdge 1650 servers, 16 of which are used for
cluster. Messages between PVFS clients and servers are compute nodes and eight for I/O nodes. One of the I/O nodes
exchanged over TCP/IP for reliable communications. All is also designated as the metadata server. All of the cluster
PVFS file system data is stored on cluster nodes’ local filesys- nodes contain two Intel Pentium III processors running at 1.4
tems, which can be one of the partitions on a disk drive, the GHz with 512KB of L2 cache, 2GB of main memory, and two
2
Proceedings of the Ninth International Conference on Parallel and Distributed Systems (ICPADS’02)
1521-9097/02 $17.00 © 2002 IEEE
integrated Gigabit Ethernet interfaces. Each compute node has crash, and is therefore in high demand in environments where
an 18 GB SCSI system disk. Each I/O node has three 18 GB high availability is important. Actually ext3 is an ext2 filesys-
Hitachi DK32DJ-18MC SCSI disks in either hardware RAID tem with a journal file. The journaling capability means the
5 (via a PowerEdge Expandable RAID Controller, PERC 3/ user does not have to wait for a long consistency check or to
Di) or software RAID 5 configuration (via an embedded worry about metadata corruption after a crash.
Adaptec AIC 7899 SCSI controller). Cluster nodes are inter- In the baseline performance evaluation, we also study the
connected with a Foundry Networks Fast Iron II Gigabit penalty from using ext3 filesystem. For sequential read opera-
Ethernet switch. tions, there is no significant difference between ext2 and ext3.
From the software aspect, benchmark programs use The hardware RAID even demonstrates slightly better read
ROMIO implementation of MPI-IO in MPICH [5] version throughput on ext3. On sequential write operations, there is a
1.2.4. The PVFS version 1.5.4 is built on RedHat Linux 7.2, significant performance penalty with ext3. The software
which is the operating system running on all cluster nodes. RAID’s write throughput decreases by 45% from ext2 to ext3.
The hardware RAID experiences a performance reduction of
32%. The performance penalty comes from the journaling
IV. Baseline Performance of Hardware and Soft-
mechanism. Every single modification that is made to the file-
ware RAID 5 system will be written to the journaling log first. And only
We first used Bonnie [2] to establish a baseline perfor- once it is committed to the log, is the modification allowed to
mance of an I/O node. Bonnie is a widely used benchmark update the main copy on disk.
which measures the performance of Unix file system opera-
tions. It provides a good indicator of the performance charac- V. Performance Results of ROMIO perf
teristics of sequential operations. We tested Bonnie on one I/O
node with both software RAID 5 and hardware RAID 5. To The ROMIO source code includes an example MPI-IO test
limit the impact from file system cache, we reduced the sys- program called perf. It performs concurrent read and write
tem memory to 512MB and ran Bonnie against a 2GB file. To operations to the same file. In this program, each MPI process
ensure consistent results, we also used another benchmark, has a fixed size data array, 4 MB by default, which is written
IOzone filesystem benchmark [6], for cross reference. Bon- using MPI_File_write() and read using MPI_File_read() to
nie’s measurements are summarized in Table 3. disjoint regions of the file. All MPI processes synchronize
(using MPI_Barrier()) before each I/O operation.
Table 3: Baseline performance of a I/O node. The perf program measures four types of concurrent I/O
File Write CPU Read CPU operations.
RAID 5
Systems (MB/s) Load (MB/s) Load
• Write operations without file synchronization,
Software Ext2 50.5 45.7 89.6 30.8
Software Ext3 27.7 52.9 88.2 34.8 • Read operations without file synchronization,
Hardware Ext2 51.6 40.5 36.5 12.3 • Write operations with file synchronization (using
Hardware Ext3 35.3 31.5 39.3 13.2 MPI_File_sync()), and
• Read operations after file synchronization.
A. Software RAID 5 vs. Hardware RAID 5
These tests provide an upper bound on the MPI-IO perfor-
Table 3 shows different performance characteristics than mance that can be expected from a given set of I/O nodes and
Tables 1 and 2. Previously, software RAID 5 has better file system. In this section, we run perf program with up to 16
throughput than hardware RAID 5 for both write and read compute nodes against the PVFS file system constructed from
operations. In this case hardware RAID has better throughput eight I/O nodes.
for write operations. For both ext2 and ext3 filesystems, hard- Figure 2 has three charts that show the aggregate band-
ware RAID has higher write throughput than software RAID, width of write operations without file synchronization
while enjoying a lower CPU load. The difference is due to the reported by perf with each MPI process using array sizes
performance characteristics of Hitachi DK32DJ-18MC disks. (access sizes) from 4MB, 16MB, and 64MB. These charts
In a previous study of the performance trend of RAIDs, we present the aggregate bandwidth of four configurations (hard-
used Seagate ST318406LC disks. ware RAID 5 ext2 and ext3 filesystem, software RAID 5 ext2
Similar to the previous study, hardware RAIDs also have a and ext3 filesystem) with increasing numbers of processes.
better ratio of throughput/CPU load. With ext2 filesystem, the They show the peak performance increases with larger access
ratio for hardware RAID on write and read operations are 1.27 sizes.
and 2.97. The ratio for software RAID are 1.11 and 2.91. There is no difference in performance with 4MB access
size. As the access size increases, the charts start to show per-
B. Ext2 vs. Ext3 Filesystem formance difference among the four configurations. Hardware
RAID 5 ext2 filesystem has the best performance with the
The ext3 filesystem [13] is a journaling extension to the peak at 766 MB/sec with 64MB access size. Software RAID 5
standard ext2 filesystem on Linux. Journaling results in mas- ext3 filesystem has the worst performance, especially for
sively reduced time spent recovering a filesystem after a 64MB access size.
3
Proceedings of the Ninth International Conference on Parallel and Distributed Systems (ICPADS’02)
1521-9097/02 $17.00 © 2002 IEEE
Write Bandwidth Without File Sync (4MB Access Size) Read Bandwidth Without File Sync (4MB Access Size)
600 600
500 500
MB/sec
MB/sec
400 400
300 300
200 200
100 100
0 0
1 2 4 8 16 1 2 4 8 16
Number of MPI Processes Number of MPI Processes
Write Bandwidth Without File Sync (16MB Access Size) Read Bandwidth Without File Sync (16MB Access Size)
600 600
500 500
MB/sec
MB/sec
400 400
300 300
200 200
100 100
0 0
1 2 4 8 16 1 2 4 8 16
Number of MPI Processes Number of MPI Processes
Write Bandwidth Without File Sync (64MB Access Size) Read Bandwidth Without File Sync (64MB Access Size)
600 600
500 500
MB/sec
MB/sec
400 400
300 300
200 200
100 100
0 0
1 2 4 8 16 1 2 4 8 16
Number of MPI Processes Number of MPI Processes
Figure 2: ROMIO perf measurement on write operations without file Figure 3: ROMIO perf measurement on read operations without file
synchronization. synchronization.
4
Proceedings of the Ninth International Conference on Parallel and Distributed Systems (ICPADS’02)
1521-9097/02 $17.00 © 2002 IEEE
Write Bandwidth Including File Sync (4MB Access Size) Read Bandwidth After File Sync (4MB Access Size)
MB/sec
150 400
300
100
200
50
100
0 0
1 2 4 8 16 1 2 4 8 16
Number of MPI Processes Number of MPI Processes
Write Bandwidth Including File Sync (16MB Access Size) Read Bandwidth After File Sync (16MB Access Size)
MB/sec
150 400
300
100
200
50
100
0 0
1 2 4 8 16 1 2 4 8 16
Number of MPI Processes Number of MPI Processes
Write Bandwidth Including File Sync (64MB Access Size) Read Bandwidth After File Sync (64MB Access Size)
MB/sec
150 400
300
100
200
50
100
0 0
1 2 4 8 16 1 2 4 8 16
Number of MPI Processes Number of MPI Processes
Figure 4: ROMIO perf measurement on write operations including file Figure 5: ROMIO perf measurement on read operations after file syn-
synchronization, chronization,
5
Proceedings of the Ninth International Conference on Parallel and Distributed Systems (ICPADS’02)
1521-9097/02 $17.00 © 2002 IEEE
Figure 3 shows three charts of the aggregate bandwidth of The three writing methods for BTIO are Fortran direct
read operations without file synchronization. All four configu- unformatted I/O, MPI-IO using MPI_File_write_at() (the
rations reach similar peak performance with 16 MPI pro- “simple” MPI-IO version), and MPI-IO using
cesses. The best case has a peak of 842 MB/sec. Comparing MPI_File_write_at_all() collective I/O (the “full” MPI-IO
Figure 3 to Figure 2, we also found read operations have better version). The access pattern in BTIO is noncontiguous in
bandwidth than write operations. Overall, we are pleased with memory and in the file. We used the “full” MPI-IO version of
the performance of PVFS. The aggregate bandwidth of both BTIO which utilizes collective I/O and MPI derived data type
read and write operations indicate a high degree of utilization to describe noncontiguity in memory and file. Also the
on the Gigabit Ethernet network infrastructure. ROMIO implementation has optimization for such type of
Figure 4 shows three charts of the aggregate bandwidth of requests.
write operations with file synchronization. The measurement
includes the time required for a call to MPI_File_sync() rou-
NAS Parallel I/O Benchmarks -- BTIO Benchmark (Class A)
tine after the write operation. The MPI_File_sync() routine 100
forces the updates to a file to be propagated to the storage
device before it returns. We found hardware RAIDs have sig-
80
nificant better performance than software RAIDs. The former
mark 60
VIII. Acknowledgement
The authors would like to express their appreciation to col-
leagues in the Scalable Systems Group at Dell Computer Cor-
poration for their support on this project. Among them,
Monica Kashyap started the initial prototyping effort on
PVFS, Frank E. Elizondo assisted on project coordination, Dr.
Victor Mashayekhi and Dr. Reza Rooholamini provided tre-
mendous support from the management side. They are best
friends to experimental computer scientists. We also like to