Octopus: An RDMA-enabled Distributed Persistent Memory File System
Octopus: An RDMA-enabled Distributed Persistent Memory File System
Normalized Bandwidth
(%)
Latency Breakdown
18 ms
83MB/s
324 us 1.0 CLIENT2
100 CLIENT1
2%
80
2. Create(“/home/a”). 2. Read(“/home/b”).
60
323MB/s
98 % 100 % 0.5 5. Return result. 4. Return file address.
40
3. Start Tx. 3. Lookup. 5. RDMA READ.
20
4. Collect,
0 0.0
Dispatch.
diskGlusterFS memGlusterFS diskGlusterFS memGlusterFS metadata metadata metadata metadata
(a) (b) … …
data data data data
Figure 1: Software Overhead
RDMA RDMA RDMA
Server% Server& Server' Server(
management layer on another local file system (a.k.a, Shared Persistent Memory Pool Shared NVM Private NVM
stacked file system layers), they face more serious soft-
ware overhead than local storage systems. Figure 2: Octopus Architecture
Bandwidth. We also measure the maximum band- a Shared Persistent Memory Pool, and improve
width of GlusterFS to understand the software overhead throughput of small I/Os using Client-Active I/Os.
in terms of bandwidth. In the evaluation, we perform • Low-Latency Metadata Access, to provide a low-
1MB write requests to a single GlusterFS server repeat- latency and scalable metadata RPC with Self-
edly to get the average write bandwidth of GlusterFS. Identified RPC, and decrease consistency overhead
Figure 1(b) shows the GlusterFS write bandwidth against using the Collect-Dispatch Transaction.
the storage and network bandwidths. In diskGluster,
GlusterFS achieves a bandwidth that is 93.6% of raw 3.1 Overview
disk bandwidth and 70.3% of raw Gigabit Ethernet band-
width. In memGluster, GlusterFS’s bandwidth is only Octopus is built for a cluster of servers that are equipped
14.7% of raw memory bandwidth and 15.1% of raw In- with non-volatile memory and RDMA-enabled net-
finiBand bandwidth. Existing file systems are inefficient works. Octopus consists of two parts: clients and data
in exploiting the high bandwidth of new hardware. servers. Octopus has no centralized metadata server,
We find that there are four mechanisms that contribute and the metadata service is distributed to different data
to this inefficiency in existing distributed file systems. servers. In Octopus, files are distributed to data servers
First, data are copied multiple times in multiple places in in a hash-based way, as shown in Figure 2. A file has
memory, including user buffer, file system page cache, its metadata and data blocks in the same data server. But
and network buffer. While this design is feasible for file its parent directory and its siblings may be distributed
systems that are built for slow disks and networks, it has to other servers. Note that the hash-based distribution
a significant impact on system performance with high- of file or data blocks is not a design focus of this paper.
speed hardware. Second, when networking is getting Hash-based distribution may lead to difficulties in wear
faster, the CPU at server side can be easily the bottleneck leveling issue in non-volatile memory, and we leave this
when processing requests from a lot of clients. Third, problem for future work. Instead, we aim to discuss
traditional RPC that is based on the event-driven model novel metadata and data mechanism designs that are
has relatively high notification latency when hardware enabled by RDMA in this paper.
provides low latency communication. Fourth, distributed In each server, the data area is exported and shared in
file systems have huge consistency overhead in dis- the whole cluster for remote direct data accesses, while
tributed transactions, owing to multiple network round- the metadata area is kept private for consistency reasons.
trips and complex processing logic. Figure 3 shows the data layout of each server, which
As such, we propose to design an efficient distributed is organized into six zones: (1) Super Block to keep
memory file system for high-speed network and memory the metadata of the file system. (2) Message Pool for
hardware, by revisiting the internal mechanisms in both the metadata RPC for temporary message storage when
data and metadata management. exchanging messages. (3) Metadata Index Zone using a
chained hash table to index the file or directory metadata
nodes in the metadata zone. Each entry in the chained
3 Octopus Design hash table contains name, i addr, and list ptr fields,
which respectively represent the name of the file, the
To effectively explore the benefits of raw hardware physical address of the file’s inode, and the pointer to
performance, Octopus closely couples RDMA with file link the metadata index for the files that has a same
system mechanism designs. Both data and metadata hash value. A file hashes its name and locates its
mechanisms are reconsidered: metadata index to fetch its inode address. (4) Metadata
• High-Throughput Data I/O, to achieve high Zone to keep the file or directory metadata nodes (i.e.,
I/O bandwidth by reducing memory copies with inode), each of which consumes 256 bytes. With the
Bitmap
Bitmap
name message message page
i_addr mbuf mbuf
list_ptr
pool pool cache
inode, Octopus locates the data blocks in the data zone. Figure 4: Data Copies in a Remote I/O Request
(5) Data Zone to keep data blocks, including directory
page cache (for local file system cache), and file system
entry blocks and file data blocks. (6) Log Zone for
image in persistent memory (for file storage in a local
transaction log blocks to ensure file system consistency.
file system in NVM). As the GlusterFS example shown
While a data server keeps metadata and data respec-
in Figure 4, a remote I/O request requires the fetched data
tively in the private and shared area, Octopus accesses
to be copied seven times including in memory and NIC
the two areas remotely in different ways. For the private
(network interface controller) for final access.
metadata accesses, Octopus uses optimized remote pro-
Recent local persistent file systems (like PMFS [14]
cedure calls (RPC) as in existing distributed file systems.
and EXT4-DAX [4]) directly access persistent memory
For the shared data accesses, Octopus directly reads or
storage without going through kernel page cache, but it
writes data objects remotely using RDMA primitives.
does not solve problems in the distributed file systems
With the use of RDMA, Octopus removes duplicated
cases. With direct access of these persistent memory file
memory copies between file system images and memory
systems, only page cache is bypassed, and a distributed
buffers by introducing the Shared Persistent Memory
file system still requires data to be copied six times.
Pool (shared pool for brevity). This shared pool is
Octopus introduces the shared persistent memory pool
formed with exported data areas from each data server
by exporting the data area of the file system image in
in the whole cluster (in Section 3.2.1). In current
each server for sharing. The shared pool design not only
implementation, the memory pool is initialized using
removes the stacked file system design, but also enables
a static XML configuration file, which stores the pool
direct remote access to file system images without any
size and the cluster information. Octopus also redesigns
caching. Octopus directly manages data distribution and
the read/write flows by sacrificing network round-trips
layout of each server, and does not rely on a local file
to amortize server loads using Client-Active I/Os (in
system. Direct data management without stacking file
Section 3.2.2).
systems is also taken in Crail [9], a recent RDMA-aware
For metadata mechanisms, Octopus leverages RDMA
distributed file system built from scratch. Compared to
write primitives to design a low-latency and scalable
stacked file system designs like GlusterFS, data copies in
RPC for metadata operations (in Section 3.3.1). It also
Octopus and Crail do not need to go through user space
redesigns the distributed transaction to reduce the consis-
buffer in the server side, as shown in Figure 4.
tency overhead, by collecting data from remote servers
for local logging and then dispatching them to remote Octopus also provides a global view of data layout
sides (in Section 3.3.2). with the shared pool enabled by RDMA. In a data server
in Octopus, the data area in the non-volatile memory is
registered with ibv reg mr when the data server joins,
3.2 High-Throughput Data I/O
which allows the remote direct access to file system
Octopus introduces a shared persistent memory pool to images. Hence, Octopus removes the use of a message
reduce data copies for higher bandwidth, and actively pool or a mbuf in the server side, which are used for
performs I/Os in clients to rebalance server and network preparing file system data for network transfers. As
overheads for higher throughput. such, Octopus requires data to be copied only four times
for a remote I/O request, as shown in Figure 4. By
3.2.1 Shared Persistent Memory Pool reducing memory copies in non-volatile memories, data
I/O performance is significantly improved, especially for
In a system with extremely fast NVM and RDMA, large I/Os that incur fewer metadata operations.
memory copies account for a large portion of overhead
in an I/O request. In existing distributed file systems, 3.2.2 Client-Active Data I/O
a distributed file system is commonly layered on top of
local file systems. For a read or write request, a data For data I/O, it is common to complete a request within
object is duplicated to multiple locations in memory, one network round-trip. Figure 5(a) shows a read exam-
such as kernel buffer (mbuf in TCP/IP stack), user buffer ple. The client issues a read request to the server, and
(for storing distributed data objects as local files), kernel the server prepares data and sends it back to the client.
Local Lock Local Lock Local Lock Local Lock The other is a combination of GCC and RDMA
Log Begin Log Begin Collect locking for concurrency control, which is the same as
Transaction Transaction WriteSet
Execution Execution
Wait WRITE-SET the lock design in the data I/Os in Section 3.2.2. In
Log Context
Log Commit/
Log Context
Log Commit/
collect-dispatch transactions, locks are added locally
Abort Abort
Local using the GCC compare and swap command in both
VOTE-YES/NO Transaction
Wait Execution coordinator and participants. For the unlock operations,
Log Context the coordinator releases the local lock using the GCC
Log Commit/
Abort Wait
Write Data compare and swap command but the remote lock in
COMMIT/ABORT
Local Unlock
each participant using the RDMA compare and swap
Write Data
Local Unlock Write Data UPDATE WRITESET command. The RDMA unlock operations do not involve
Local Unlock
ACK REMOTE UNLOCK the CPU processing of participants, and thus simplify the
Wait
Local Log Log Commit/
unlock phase.
As a whole, collect-dispatch requires one RPC, one
Abort
Log End
End End
Distributed Log
End End
RDMA write, and one RDMA atomic operation, and
2PC requires two RPCs. Collect-Dispatch still has lower
(a) Traditional 2PC Approach (b) Collect-Dispatch Approach
overhead, because (1) RPC has higher latency than an
Figure 6: Distributed Transaction
RDMA write/atomic primitive, (2) RDMA write/atomic
processing, the server uses RDMA write to return data primitive does not involve CPU processing of remote
back to the specified address of offset in the client of side. Thus, we conclude collect-dispatch is efficient,
node id. Compared to buffer scanning, this immediate as it not only removes complex negotiations for log
notification dramatically lowers down the CPU overhead persistence ordering across servers, but reduces costly
when there are a lot of client requests. As such, the self- RPC and CPU processing overheads.
identified metadata RPC provides low-latency and scal- Consistency Discussions. In persistent memory sys-
able RPCs than send/recv and read/write approaches. tems, data cache in the CPU cache needs to be flushed
to the memory timely and ordered to provide crash
3.3.2 Collect-Dispatch Transaction consistency [11, 26, 33, 25, 14, 32]. In Octopus, meta-
A single file system operation, like mkdir, mknod, rmnod data consistency is guaranteed by the collect-dispatch
and rmdir in Octopus, performs updates to multiple transaction, which uses clflush to flush data from the
servers. Distributed transactions are needed to provide CPU cache to the memory to force persistence of the
concurrency control for simultaneous requests and crash log. While the collect-dispatch transaction can be used
consistency for the atomicity of updates across servers. to provide data consistency, data I/Os are not wrapped
The two-phase commit (2PC) protocol is usually used to in a transaction in current Octopus implementation for
ensure consistency. However, 2PC incurs high overhead efficiency. We expect that RDMA will have more ef-
due to its distributed logging and coordination for both ficient remote flush operations that could benefit data
locks and log persistence. As shown in Figure 6(a), consistency, such as novel I/O flows like RDMA read
both locking and logging are required in coordinator for remote durability [12], new proposed commands
and participants, and complex network round-trips are like RDMA commit [39], or new designs that leverage
needed for negotiation for log persistence ordering. availability for crash consistency [45]. We leave efficient
Octopus designs a new distributed transaction protocol data consistency for future work.
named Collect-Dispatch Transaction leveraging RDMA
primitives. The key idea lies in two aspects, respectively 4 Evaluation
in crash consistency and concurrency control. One is
local logging with remote in-place update for crash In this section, we evaluate Octopus’s overall data
consistency. As shown in Figure 6(b), in collect phase, and metadata performance, then the benefits from each
Octopus collects the read and write sets from partici- mechanism design, and finally its performance for big
pants, and performs local transaction execution and local data applications.
logging in the coordinator. Since participants do not need
to keep logging, there is no need for complex negotiation 4.1 Experimental Setup
for log persistence between coordinator and participants,
thereby reducing protocol overheads. For the dispatch Evaluation Platform. In the evaluation, we run Octopus
phase, the coordinator spreads the updated write set on servers with large memory. Each server is equipped
Normalized Bandwidth
Storage Network File System
(%)
Latency Breakdown
100 7.3 us 6.7 us
6088MB/s
2680 v3 processors, and each processor has 24 cores. 80
1.0 5629MB/s
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Number of Clients Number of Clients Number of Clients
(a) Mknod (b) Mkdir (c) Readdir
Throughput (ops/s x1000)
10000 GlusterFS NVFS Crail 1000 GlusterFS NVFS Crail GlusterFS NVFS Crail
Crail-Poll Octopus Crail-Poll Octopus Crail-Poll Octopus
1000
1000
100
100
10
100
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Number of Clients Number of Clients Number of Clients
(d) Getattr (e) Rmnod (f) Rmdir
Figure 8: Metadata Throughput
(2) Octopus achieves much better scalability than the Figure 9: Data I/O Throughput (Multiple Clients)
other evaluated file systems. NVFS and Crail are de- GlusterFS NVFS Crail GlusterFS NVFS Crail
signed with single metadata server, and achieve constant 10000
Crail-Poll Octopus
10000
Crail-Poll Octopus
Bandwidth (MB/s)
Throughput (kops/s)
800 Server-Active Server-Active
Client-Active 900 Client-Active
4.3.1 Effects of Reducing Data Copies 600
600
400
Octopus improves data transfer bandwidth by reducing
200 300
memory copies. To verify the effect of reducing data
copies, we implement a version of Octopus which add 1K
B
4K
B
16
KB
64
KB
25
6K
1M
B
1K
B
4K
B
16 6 2 1
K B 4K B 56K M B
B B
an extra copy at client side, and we refer to it as Oc- (a) Write (b) Read
Throughput (Kops/s)
Send/Recv Crail 6 180
6 Collect-Dispath Collect-Dispatch
Latency (us)
Self-Identified
30
4 120
4 20
2 10 60
2
0 0 0
Self
16B 64B 128B 256B 512B 1KB W rite- Write- Send - Mkdir Mknod Rmnod Rmdir Mkdir Mknod Rmnod Rmdir
20C 100C /Rec Crail Identif (a) (b)
li li v ied
(a) (b)
Figure 15: Collect-Dispatch Transaction Performance
Figure 13: Raw RPC Performance
and readdir), which are close to the InfiniBand network 4.5 Evaluation using Big Data Applications
latency for most cases. With the self-identified metadata
RPC, Octopus can support low-latency metadata opera- In addition, we compare Octopus with distributed file
tions even without client cache. Crail uses DaRPC for systems that are used in big data framework. We con-
inter-server communication. However, Crail’s metadata figure Hadoop with different distributed file systems -
(e.g., mkdir and mknod) latencies are much higher than memHDFS, Alluxio, NVFS, Crail and Octopus. In
raw DaRPC’s latency. This possibly is because Crail is this section, we compare both read/write bandwidth and
implemented on the inefficient HDFS framework, or it application performance.
registers memory temporarily for message communica- Read/Write Bandwidth. Figure 16(a) compares the
tion, which is time-consuming. NVFS and memGluster read/write bandwidths of above-mentioned file systems
suffer the similar problem of heavy file system designs using TestDFSIO by setting the read/write size to 256KB.
as Crail, and thus have relatively higher latency. Octopus and Crail show much higher bandwidth than tra-
ditional file systems. Octopus achieves 2689MB/s and
Normalized Latency (us)
GlusterFS NVFS Crail 2499MB/s for write and read operations respectively,
10
Crail-Poll Octopus
513.9 381.2 243.8 263.1 217.5
and Crail achieves 2424MB/s and 2215MB/s respec-
1 3225.8
tively. Note that they have lower bandwidths than the re-
0.1 sults in fio. The reason is that we connect Octopus/Crail
with Hadoop plugin using JNI (Java Native Interface),
0.01
which restricts the bandwidth. In contrast, memHDFS,
Mkdir Mknod Readdir Getattr Rmnod Rmdir Alluxio and NVFS show lower bandwidth than Octopus
Figure 14: Metadata Latency and Crail. memHDFS has the lowest bandwidth, for the
heavy HDFS software design that is for hard disks and
4.4.2 Effects of Collect-Dispatch Transaction traditional Ethernet. Alluxio and NVFS are optimized to
run on DRAM, and thus provide higher bandwidth than
To evaluate the effects of the collect-dispatch transaction
memHDFS. But they are still slower than Octopus. Thus,
in Octopus, we also implement a transaction system
we conclude the general-purpose Octopus can also be
based on 2PC for comparison. Figure 15(a) exhibits the
integrated into existing big data framework and provide
latencies of these two transaction mechanisms. Collect-
better performance than existing file systems.
dispatch reduces latency by up to 37%. This is because
2PC involves two RPCs to exchange messages from
Wordcount Execution Time (s)
memHDFS Alluxio NVFS memHDFS Alluxio NVFS
Teragen Execution Time (s)
80
RPC and two one-sided RDMA commands to finish the 2000 10
transaction. Although the number of messages is in- 8 60
creased, the total latency drops. RPC protocol needs the 1000
6
involvements of both local and remote nodes, and a lot 0 40
Teragen Wordcount
of side information (e.g., hash computing, and message Write Read
discovery) needs to be processed at this time. Thus, Figure 16: Big Data Evaluation
RPC latency (around 5us) is much higher than one-sided
RDMA primitives (less than 1us). From figure 15(b) we Big Data Application Performance. Figure 16(b)
can see that, transaction based on collect-dispatch im- shows the application performance for different file sys-
proves throughput by up to 79%. On one hand, collect- tems. Octopus consumes the least time to finish all
dispatch only writes logs locally, significantly reducing evaluated applications. Among all the evaluated file
logging overhead. On the other hand, collect-dispatch systems, memHDFS generally has the highest run time,
decreases the total number of RPC when processing i.e., 11.7s for Teragen and 82s for Wordcount. For the
transactions, which reduces the involvements of remote Teragen workload, the run time in Alluxio, NVFS, Crail
CPUs and thereby improves performance. and Octopus is 11.0s, 10.0s, 11.4s and 8.8s, respectively.