0% found this document useful (0 votes)
82 views

Octopus: An RDMA-enabled Distributed Persistent Memory File System

This paper proposes Octopus, a distributed persistent memory file system that leverages non-volatile memory (NVM) and remote direct memory access (RDMA) to improve performance. Octopus avoids isolating the file system and network layers found in other systems. Instead, it closely couples NVM and RDMA features in its design. For data operations, Octopus directly accesses a shared persistent memory pool without additional file system layers. It also offloads work from servers to clients. For metadata, Octopus uses self-identified remote procedure calls and distributed transactions to improve consistency and notification latency. The paper aims to better exploit emerging high-speed hardware capabilities compared to other distributed file systems.

Uploaded by

Prince Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Octopus: An RDMA-enabled Distributed Persistent Memory File System

This paper proposes Octopus, a distributed persistent memory file system that leverages non-volatile memory (NVM) and remote direct memory access (RDMA) to improve performance. Octopus avoids isolating the file system and network layers found in other systems. Instead, it closely couples NVM and RDMA features in its design. For data operations, Octopus directly accesses a shared persistent memory pool without additional file system layers. It also offloads work from servers to clients. For metadata, Octopus uses self-identified remote procedure calls and distributed transactions to improve consistency and notification latency. The paper aims to better exploit emerging high-speed hardware capabilities compared to other distributed file systems.

Uploaded by

Prince Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Octopus: an RDMA-enabled Distributed

Persistent Memory File System


Youyou Lu, Jiwu Shu, and Youmin Chen, Tsinghua University; Tao Li, University of Florida
https://www.usenix.org/conference/atc17/technical-sessions/presentation/lu

This paper is included in the Proceedings of the


2017 USENIX Annual Technical Conference (USENIX ATC ’17).
July 12–14, 2017 • Santa Clara, CA, USA
ISBN 978-1-931971-38-6

Open access to the Proceedings of the


2017 USENIX Annual Technical Conference
is sponsored by USENIX.
Octopus: an RDMA-enabled Distributed Persistent Memory File System

Youyou Lu Jiwu Shu⇤ Youmin Chen


Tsinghua University Tsinghua University Tsinghua University
Tao Li
University of Florida

Abstract recently to exploit the byte-addressability or persistence


advantages of non-volatile memories. Their promising
Non-volatile memory (NVM) and remote direct memory
results have shown potentials of NVMs in high perfor-
access (RDMA) provide extremely high performance
mance of both data storage and processing.
in storage and network hardware. However, existing
Meanwhile, the remote direct memory access
distributed file systems strictly isolate file system and
(RDMA) technology brings extremely low latency and
network layers, and the heavy layered software de-
high bandwidth to the networking. We have measured an
signs leave high-speed hardware under-exploited. In
average latency and bandwidth of 0.9us and 6.35GB/s
this paper, we propose an RDMA-enabled distributed
with a 56 Gbps InfiniBand switch, compared to 75us and
persistent memory file system, Octopus, to redesign file
118MB/s with Gigabit Ethernet (GigaE). RDMA has
system internal mechanisms by closely coupling NVM
greatly improved data center communications or RPCs
and RDMA features. For data operations, Octopus
in recent studies [13, 37, 19, 20].
directly accesses a shared persistent memory pool to
reduce memory copying overhead, and actively fetches Distributed file systems are trying to support RDMA
and pushes data all in clients to re-balance the load be- networks for high performance, but mostly by substi-
tween the server and network. For metadata operations, tuting the communication module with an RDMA li-
Octopus introduces self-identified RPC for immediate brary. CephFS supports RDMA by using Accelio [2],
notification between file systems and networking, and an RDMA-based asynchronous RPC middleware. Glus-
an efficient distributed transaction mechanism for con- terFS implements its own RDMA library for data com-
sistency. Evaluations show that Octopus achieves nearly munication [1]. NVFS [16] is a HDFS variant that is
the raw bandwidth for large I/Os and orders of magnitude optimized with NVM and RDMA. And, Crail [9], a
better performance than existing distributed file systems. recent distributed file system from IBM, is built on the
RDMA-optimized RPC library, DaRPC [37]. However,
these file systems strictly isolate file system and network
1 Introduction layers, by only replacing their data management and
communication modules without refactoring the internal
The in-memory storage and computing paradigm file system mechanisms. This layered and heavy soft-
emerges as both HPC and big data communities are ware design prevents file systems from exploiting the
demanding extremely high performance in data storage hardware benefits. As we observed, GlusterFS has its
and processing. Recent in-memory storage systems, software latency that accounts for nearly 100% on NVM
including both database systems (e.g., SAP HANA [8]) and RDMA, while it is only 2% on disk. Similarly, it
and file systems (e.g., Alluxio [23]), have been used achieves only 15% of raw InfiniBand bandwidth, com-
to achieve high data processing performance. With pared to 70% of the GigaE bandwidth. In conclusion,
the emerging non-volatile memory (NVM) technologies, the strict isolation between the file system and network
such as phase change memory (PCM) [34, 21, 46], layers makes distributed file systems too heavy to exploit
resistive RAM (ReRAM), and 3D XPoint [7], data can be the benefits of emerging high-speed hardware.
stored persistently in main memory level, i.e., persistent In this paper, we revisit both data and metadata mech-
memory. New local file systems, including BPFS [11], anism designs of the distributed file system by taking
SCMFS [42], PMFS [14], and HiNFS [32], are built NVM and RDMA features into consideration. We pro-
⇤ Jiwu Shu is the corresponding author. pose an efficient distributed persistent memory file sys-

USENIX Association 2017 USENIX Annual Technical Conference 773


tem, Octopus1 , to effectively exploit the benefits of high- Remote Direct Memory Access. Remote Direct
speed hardware. Octopus avoids the strict isolation of file Memory Access (RDMA) enables low-latency network
system and network layers, and redesigns the file system access by directly accessing memory from remote
internal mechanisms by closely coupling with NVM and servers. It bypasses the operating system and supports
RDMA features. For the data management, Octopus zero-copy networking, and thus achieves high bandwidth
directly accesses a shared persistent memory pool by and low latency in network accesses. There are two kinds
exporting NVM to a global space, avoiding stacking a of commands in RDMA for remote memory access:
distributed file system layer on local file systems, to (1) Message Semantics, with typical RDMA send
eliminate redundant memory copies. It also rebalances and recv verbs for message passing, are similar to socket
the server and network loads, and revises the data I/O programming. Before sending an RDMA send request
flows to offload loads from servers to clients in a client- at the client side, an RDMA recv needs to be posted at
active way for higher throughput. For the metadata the server side with an attached address indicating where
management, Octopus introduces a self-identified RPC to store the coming message.
which carries sender’s identifier with the RDMA write (2) Memory Semantics, with typical RDMA read
primitive for low-latency notification. In addition, it and write verbs, use a new data communication model
proposes a new distributed transaction mechanism by (i.e., one-sided ) in RDMA. In memory semantics, the
incorporating RDMA write and atomic primitives. As memory address in remote server where the message will
such, Octopus efficiently incorporates RDMA into file be stored is assigned at the sender side. This removes
system designs that effectively exploit hardware benefits. the CPU involvement of remote servers. The memory
Our major contributions are summarized as follows. semantics provide relatively higher bandwidth and lower
• We propose novel I/O flows based on RDMA for latency than the message semantics.
Octopus, which directly accesses a shared persistent In addition, RDMA provides other verbs, in-
memory pool without stacked file system layers, cluding atomic verbs like compare and swap and
and actively fetches or pushes data in clients to fetch and add that enable atomic memory access of
rebalance server and network loads. remote servers.
• We redesign metadata mechanisms leveraging
RDMA primitives, including self-identified meta-
2.2 Software Challenges on Emerging
data RPC for low-latency notification, and a collect-
dispatch distributed transaction for low-overhead High-Speed Hardware
consistency. In a storage system equipped with NVMs and RDMA en-
• We implement and evaluate Octopus. Experimental abled network, the hardware provides extremely higher
results show that Octopus effectively explores the performance than traditional media like hard disks and
raw hardware performance, and significantly out- Gigabit Ethernet. Comparatively, overheads of the soft-
performs existing RDMA-optimized distributed file ware layer, which are negligible compared to slow disk
systems. and Ethernet, now account for a significant part in the
whole system.
Latency. To understand the latency overhead of ex-
2 Background and Motivation isting distributed file systems, we perform synchronous
1KB write operations on GlusterFS, and collect latencies
2.1 Non-volatile Memory and RDMA respectively in the storage, network, and software parts.
The latencies are averaged with 100 synchronous writes.
Non-Volatile Memory. Byte-addressable non-volatile
Figure 1(a) shows the latency breakdown of GlusterFS
memory (NVM) technologies, including PCM [34, 21,
on disk (denoted as diskGluster) and memory (denoted
46], ReRAM, Memristor [36], are being intensively stud-
as memGluster). To improve efficiency of GlusterFS on
ied in recent years. Intel and Micron have announced
memory, we run memGluster on EXT4-DAX [4], which
the 3D XPoint technology which is expected to be in
is optimized for NVM by bypassing the page cache and
product in the near future [7]. These NVMs have ac-
reducing memory copies. In diskGluster, the storage
cess latencies close to that of DRAM, while providing
latency consumes the most part, nearly 98% of the total
data persistence as hard disks. In addition, NVMs are
latency. In memGluster, the storage latency percentage
expected to have better scalability than DRAM [34, 21].
drops dramatically to nearly zero. In comparison, the
Therefore, NVMs are promising candidates for storing
file system software latency becomes the dominate part,
data persistently at the main memory level.
almost 100%. Similar trends have also been observed
1 It
is called Octopus because the file system performs remote direct in previous studies in local storage systems [38]. While
memory access just like a Octopus uses its eight legs. most distributed file systems stack the distributed data

774 2017 USENIX Annual Technical Conference USENIX Association


Storage Network Software Storage Network File System 1. Server% = hash(“/home/a”). 1. Server' = hash(“/home/b”).

Normalized Bandwidth
(%)
Latency Breakdown
18 ms

83MB/s
324 us 1.0 CLIENT2
100 CLIENT1
2%
80
2. Create(“/home/a”). 2. Read(“/home/b”).
60

323MB/s
98 % 100 % 0.5 5. Return result. 4. Return file address.
40
3. Start Tx. 3. Lookup. 5. RDMA READ.
20
4. Collect,
0 0.0
Dispatch.
diskGlusterFS memGlusterFS diskGlusterFS memGlusterFS metadata metadata metadata metadata
(a) (b) … …
data data data data
Figure 1: Software Overhead
RDMA RDMA RDMA
Server% Server& Server' Server(
management layer on another local file system (a.k.a, Shared Persistent Memory Pool Shared NVM Private NVM
stacked file system layers), they face more serious soft-
ware overhead than local storage systems. Figure 2: Octopus Architecture
Bandwidth. We also measure the maximum band- a Shared Persistent Memory Pool, and improve
width of GlusterFS to understand the software overhead throughput of small I/Os using Client-Active I/Os.
in terms of bandwidth. In the evaluation, we perform • Low-Latency Metadata Access, to provide a low-
1MB write requests to a single GlusterFS server repeat- latency and scalable metadata RPC with Self-
edly to get the average write bandwidth of GlusterFS. Identified RPC, and decrease consistency overhead
Figure 1(b) shows the GlusterFS write bandwidth against using the Collect-Dispatch Transaction.
the storage and network bandwidths. In diskGluster,
GlusterFS achieves a bandwidth that is 93.6% of raw 3.1 Overview
disk bandwidth and 70.3% of raw Gigabit Ethernet band-
width. In memGluster, GlusterFS’s bandwidth is only Octopus is built for a cluster of servers that are equipped
14.7% of raw memory bandwidth and 15.1% of raw In- with non-volatile memory and RDMA-enabled net-
finiBand bandwidth. Existing file systems are inefficient works. Octopus consists of two parts: clients and data
in exploiting the high bandwidth of new hardware. servers. Octopus has no centralized metadata server,
We find that there are four mechanisms that contribute and the metadata service is distributed to different data
to this inefficiency in existing distributed file systems. servers. In Octopus, files are distributed to data servers
First, data are copied multiple times in multiple places in in a hash-based way, as shown in Figure 2. A file has
memory, including user buffer, file system page cache, its metadata and data blocks in the same data server. But
and network buffer. While this design is feasible for file its parent directory and its siblings may be distributed
systems that are built for slow disks and networks, it has to other servers. Note that the hash-based distribution
a significant impact on system performance with high- of file or data blocks is not a design focus of this paper.
speed hardware. Second, when networking is getting Hash-based distribution may lead to difficulties in wear
faster, the CPU at server side can be easily the bottleneck leveling issue in non-volatile memory, and we leave this
when processing requests from a lot of clients. Third, problem for future work. Instead, we aim to discuss
traditional RPC that is based on the event-driven model novel metadata and data mechanism designs that are
has relatively high notification latency when hardware enabled by RDMA in this paper.
provides low latency communication. Fourth, distributed In each server, the data area is exported and shared in
file systems have huge consistency overhead in dis- the whole cluster for remote direct data accesses, while
tributed transactions, owing to multiple network round- the metadata area is kept private for consistency reasons.
trips and complex processing logic. Figure 3 shows the data layout of each server, which
As such, we propose to design an efficient distributed is organized into six zones: (1) Super Block to keep
memory file system for high-speed network and memory the metadata of the file system. (2) Message Pool for
hardware, by revisiting the internal mechanisms in both the metadata RPC for temporary message storage when
data and metadata management. exchanging messages. (3) Metadata Index Zone using a
chained hash table to index the file or directory metadata
nodes in the metadata zone. Each entry in the chained
3 Octopus Design hash table contains name, i addr, and list ptr fields,
which respectively represent the name of the file, the
To effectively explore the benefits of raw hardware physical address of the file’s inode, and the pointer to
performance, Octopus closely couples RDMA with file link the metadata index for the files that has a same
system mechanism designs. Both data and metadata hash value. A file hashes its name and locates its
mechanisms are reconsidered: metadata index to fetch its inode address. (4) Metadata
• High-Throughput Data I/O, to achieve high Zone to keep the file or directory metadata nodes (i.e.,
I/O bandwidth by reducing memory copies with inode), each of which consumes 256 bytes. With the

USENIX Association 2017 USENIX Annual Technical Conference 775


Super Message Metadata Log
Client Server
Metadata Zone Data Zone
Block Pool Index Zone Zone User Space Buffer User Space Buffer
bucket bucket bucket

Bitmap

Bitmap
name message message page
i_addr mbuf mbuf
list_ptr
pool pool cache

Private NVM Shared NVM NIC NIC FS Image

Figure 3: Data Layout in a Octopus Node GlusterFS Crail Octopus

inode, Octopus locates the data blocks in the data zone. Figure 4: Data Copies in a Remote I/O Request
(5) Data Zone to keep data blocks, including directory
page cache (for local file system cache), and file system
entry blocks and file data blocks. (6) Log Zone for
image in persistent memory (for file storage in a local
transaction log blocks to ensure file system consistency.
file system in NVM). As the GlusterFS example shown
While a data server keeps metadata and data respec-
in Figure 4, a remote I/O request requires the fetched data
tively in the private and shared area, Octopus accesses
to be copied seven times including in memory and NIC
the two areas remotely in different ways. For the private
(network interface controller) for final access.
metadata accesses, Octopus uses optimized remote pro-
Recent local persistent file systems (like PMFS [14]
cedure calls (RPC) as in existing distributed file systems.
and EXT4-DAX [4]) directly access persistent memory
For the shared data accesses, Octopus directly reads or
storage without going through kernel page cache, but it
writes data objects remotely using RDMA primitives.
does not solve problems in the distributed file systems
With the use of RDMA, Octopus removes duplicated
cases. With direct access of these persistent memory file
memory copies between file system images and memory
systems, only page cache is bypassed, and a distributed
buffers by introducing the Shared Persistent Memory
file system still requires data to be copied six times.
Pool (shared pool for brevity). This shared pool is
Octopus introduces the shared persistent memory pool
formed with exported data areas from each data server
by exporting the data area of the file system image in
in the whole cluster (in Section 3.2.1). In current
each server for sharing. The shared pool design not only
implementation, the memory pool is initialized using
removes the stacked file system design, but also enables
a static XML configuration file, which stores the pool
direct remote access to file system images without any
size and the cluster information. Octopus also redesigns
caching. Octopus directly manages data distribution and
the read/write flows by sacrificing network round-trips
layout of each server, and does not rely on a local file
to amortize server loads using Client-Active I/Os (in
system. Direct data management without stacking file
Section 3.2.2).
systems is also taken in Crail [9], a recent RDMA-aware
For metadata mechanisms, Octopus leverages RDMA
distributed file system built from scratch. Compared to
write primitives to design a low-latency and scalable
stacked file system designs like GlusterFS, data copies in
RPC for metadata operations (in Section 3.3.1). It also
Octopus and Crail do not need to go through user space
redesigns the distributed transaction to reduce the consis-
buffer in the server side, as shown in Figure 4.
tency overhead, by collecting data from remote servers
for local logging and then dispatching them to remote Octopus also provides a global view of data layout
sides (in Section 3.3.2). with the shared pool enabled by RDMA. In a data server
in Octopus, the data area in the non-volatile memory is
registered with ibv reg mr when the data server joins,
3.2 High-Throughput Data I/O
which allows the remote direct access to file system
Octopus introduces a shared persistent memory pool to images. Hence, Octopus removes the use of a message
reduce data copies for higher bandwidth, and actively pool or a mbuf in the server side, which are used for
performs I/Os in clients to rebalance server and network preparing file system data for network transfers. As
overheads for higher throughput. such, Octopus requires data to be copied only four times
for a remote I/O request, as shown in Figure 4. By
3.2.1 Shared Persistent Memory Pool reducing memory copies in non-volatile memories, data
I/O performance is significantly improved, especially for
In a system with extremely fast NVM and RDMA, large I/Os that incur fewer metadata operations.
memory copies account for a large portion of overhead
in an I/O request. In existing distributed file systems, 3.2.2 Client-Active Data I/O
a distributed file system is commonly layered on top of
local file systems. For a read or write request, a data For data I/O, it is common to complete a request within
object is duplicated to multiple locations in memory, one network round-trip. Figure 5(a) shows a read exam-
such as kernel buffer (mbuf in TCP/IP stack), user buffer ple. The client issues a read request to the server, and
(for storing distributed data objects as local files), kernel the server prepares data and sends it back to the client.

776 2017 USENIX Annual Technical Conference USENIX Association


Client 1 Client 2 Client 3 Server Client 1 Client 2 Client 3 Server
NIC CPU MEM NIC CPU MEM
alizability between GCC and RDMA atomic primitives
Read “/home/a” Read “/home/a”
is not guaranteed due to lack of atomicity between the
CPU and the NIC [10, 41, 19]. In Octopus, GCC and
Read “/home/b” Read “/home/b”
RDMA atomic instructions are respectively used in the
locking and unlocking phases. This isolation prevents
Read “/home/c”
the competition between the CPU and the NIC, and thus
Read “/home/c” ensures correctness of parallel accesses.

3.3 Low-Latency Metadata Access


Lookup file data. Send data. Lookup file data. Send address. RDMA provides microsecond level access latencies for
(a) Server-Active Data I/O (b) Client-Active Data I/O remote data access. To explore this benefit in the file
Figure 5: Comparison of Server-Active and Client- system level, Octopus refactors the metadata RPC and
Active Modes distributed transaction by incorporating RDMA write
and atomic primitives.
Similarly, a write request can also complete with one
round-trip. This is called Server-Active Mode. While 3.3.1 Self-Identified Metadata RPC
this mode works well for slow Ethernet, we find that RPCs are used in Octopus for metadata operations. Both
the server is always in high utilization and becomes a message and memory semantic commands can be uti-
bottleneck when new hardware is equipped. lized to implement RPCs.
In remote I/Os, the throughput is bounded by the lower (1) Message-based RPC. In the message-based RPC,
one between the network and server throughput. In our a recv request is firstly assigned with a memory address,
cluster, we achieve 5 million network IOPS for 1KB and then initialized in the remote side before the send
writes, but have to spend around 2us (i.e., 0.5 million) for request. Each time an RDMA send arrives, an RDMA
data locating even without data processing. The server recv is consumed. Message-base RPC has relatively
processing capacity becomes the bottleneck for small high latency and low throughput. send/recv in UD
I/Os when RDMA is equipped. (Unreliable Datagram) mode provides higher through-
In Octopus, we propose client-active mode to improve put [20], but is not suitable for distributed file systems
server throughput by sacrificing the network perfor- due to its unreliable connections.
mance when performing small size I/Os. As shown in (2) Memory-based RPC. RDMA read/write have
Figure 5(b), in the first step, a client in Octopus sends a lower latency than send/recv. Unfortunately, these
read or write request to the server. In the second step, the commands are one-sided, and remote server is unin-
server sends back the metadata information to the client. volved. To timely process these requests, the server side
Both the two steps are executed for metadata exchange needs to scan the message buffers repeatedly to discover
using the self-identified metadata RPC which will be new requests. This causes high CPU overhead. Even
discussed next. In the third step, the client reads or worse, when the number of clients increased, the server
writes file data with the returned metadata information, side needs to scan more message buffers, and this in turn
and directly accesses data using RDMA read and write increases the processing latency.
commands. Since RDMA read and write are one-sided To gain benefits of both sides, we propose the self-
operations, which access remote data without participa- identified metadata RPC. Self-identified metadata RPC
tion of CPUs in remote servers, the server in Octopus has attaches the sender’s identifier with the RDMA write
higher processing capacity. By doing so, a rebalance is request using the RDMA write with imm command.
made between the server and network overheads. With write with imm is different from RDMA write in two
introduced limited round-trips, server load is offloaded aspects: (1) it is able to carry an immediate field in the
to clients, resulting in higher throughput for concurrent message, and (2) it notifies remote side immediately,
requests. but RDMA write does not. With the first difference,
Besides, Octopus uses the per-file read-write lock to we attach the client’s identifier in the immediate data
serialize the concurrent RDMA-based data accesses. The field including both a node id and an offset of the
lock service is based on a combination of GCC (GNU client’s receive buffer. For the second difference, RDMA
Compiler Collection) and RDMA atomic primitives. To write with imm consumes one receive request from the
read or write file data, the locking operation is executed remote queue pair (QP), and thus gets immediately pro-
by the server locally using GCC atomic instructions. The cessing after the request arrives. The identifier attached
unlock operation is executed remotely by the client with in the immediate field helps the server to direct locate the
RDMA atomic verbs after data I/Os. Note that seri- new message without scanning the whole buffer. After

USENIX Association 2017 USENIX Annual Technical Conference 777


Coordinator Participant Coordinator Participant
to the participants using RDMA write and releases
Begin Begin
Begin the corresponding lock with RDMA atomic primitives,
Log Begin Wait
without the involvements of the participants.
Log Begin
OP-REQ COLLECT-REQ

Local Lock Local Lock Local Lock Local Lock The other is a combination of GCC and RDMA
Log Begin Log Begin Collect locking for concurrency control, which is the same as
Transaction Transaction WriteSet
Execution Execution
Wait WRITE-SET the lock design in the data I/Os in Section 3.2.2. In
Log Context
Log Commit/
Log Context
Log Commit/
collect-dispatch transactions, locks are added locally
Abort Abort
Local using the GCC compare and swap command in both
VOTE-YES/NO Transaction
Wait Execution coordinator and participants. For the unlock operations,
Log Context the coordinator releases the local lock using the GCC
Log Commit/
Abort Wait
Write Data compare and swap command but the remote lock in
COMMIT/ABORT
Local Unlock
each participant using the RDMA compare and swap
Write Data
Local Unlock Write Data UPDATE WRITESET command. The RDMA unlock operations do not involve
Local Unlock
ACK REMOTE UNLOCK the CPU processing of participants, and thus simplify the
Wait
Local Log Log Commit/
unlock phase.
As a whole, collect-dispatch requires one RPC, one
Abort
Log End

End End
Distributed Log
End End
RDMA write, and one RDMA atomic operation, and
2PC requires two RPCs. Collect-Dispatch still has lower
(a) Traditional 2PC Approach (b) Collect-Dispatch Approach
overhead, because (1) RPC has higher latency than an
Figure 6: Distributed Transaction
RDMA write/atomic primitive, (2) RDMA write/atomic
processing, the server uses RDMA write to return data primitive does not involve CPU processing of remote
back to the specified address of offset in the client of side. Thus, we conclude collect-dispatch is efficient,
node id. Compared to buffer scanning, this immediate as it not only removes complex negotiations for log
notification dramatically lowers down the CPU overhead persistence ordering across servers, but reduces costly
when there are a lot of client requests. As such, the self- RPC and CPU processing overheads.
identified metadata RPC provides low-latency and scal- Consistency Discussions. In persistent memory sys-
able RPCs than send/recv and read/write approaches. tems, data cache in the CPU cache needs to be flushed
to the memory timely and ordered to provide crash
3.3.2 Collect-Dispatch Transaction consistency [11, 26, 33, 25, 14, 32]. In Octopus, meta-
A single file system operation, like mkdir, mknod, rmnod data consistency is guaranteed by the collect-dispatch
and rmdir in Octopus, performs updates to multiple transaction, which uses clflush to flush data from the
servers. Distributed transactions are needed to provide CPU cache to the memory to force persistence of the
concurrency control for simultaneous requests and crash log. While the collect-dispatch transaction can be used
consistency for the atomicity of updates across servers. to provide data consistency, data I/Os are not wrapped
The two-phase commit (2PC) protocol is usually used to in a transaction in current Octopus implementation for
ensure consistency. However, 2PC incurs high overhead efficiency. We expect that RDMA will have more ef-
due to its distributed logging and coordination for both ficient remote flush operations that could benefit data
locks and log persistence. As shown in Figure 6(a), consistency, such as novel I/O flows like RDMA read
both locking and logging are required in coordinator for remote durability [12], new proposed commands
and participants, and complex network round-trips are like RDMA commit [39], or new designs that leverage
needed for negotiation for log persistence ordering. availability for crash consistency [45]. We leave efficient
Octopus designs a new distributed transaction protocol data consistency for future work.
named Collect-Dispatch Transaction leveraging RDMA
primitives. The key idea lies in two aspects, respectively 4 Evaluation
in crash consistency and concurrency control. One is
local logging with remote in-place update for crash In this section, we evaluate Octopus’s overall data
consistency. As shown in Figure 6(b), in collect phase, and metadata performance, then the benefits from each
Octopus collects the read and write sets from partici- mechanism design, and finally its performance for big
pants, and performs local transaction execution and local data applications.
logging in the coordinator. Since participants do not need
to keep logging, there is no need for complex negotiation 4.1 Experimental Setup
for log persistence between coordinator and participants,
thereby reducing protocol overheads. For the dispatch Evaluation Platform. In the evaluation, we run Octopus
phase, the coordinator spreads the updated write set on servers with large memory. Each server is equipped

778 2017 USENIX Annual Technical Conference USENIX Association


with 384GB DRAM and two 2.5GHz Intel Xeon E5- Storage Network Software

Normalized Bandwidth
Storage Network File System
(%)

Latency Breakdown
100 7.3 us 6.7 us
6088MB/s
2680 v3 processors, and each processor has 24 cores. 80
1.0 5629MB/s

Clients run on different servers. Each client server has 60 85 % 84 %


16GB DRAM and one Intel Xeon E2620 processor. All 40
0.5

these servers are connected with a Mellanox SX1012 20


14 % 15 %
switch using CX353A ConnectX-3 FDR HCAs (which 0
Getattr Readdir
0.0
Write Read
(a) (b)
support 56 Gbps over InfiniBand and 40GigE). All of
Figure 7: Latency Breakdown and Bandwidth Utilization
them are installed with Fedora 23.
Evaluated File Systems. Table 1 lists the distributed systems are running in the memory level with RDMA-
file system (DFSs) for comparison. All these file systems enabled InfiniBand network. In this evaluation, we first
are deployed in memory of the same cluster. For existing compare Octopus’s latency and bandwidth to the raw
DFSs that require local file systems, we build local network’s and storage’s latency and bandwidth, and then
file systems on DRAM with pmem driver and DAX [5] compare Octopus’s metadata and data performance to
supported in ext4. The EXT4-DAX [4] is optimized other file systems.
for NVM which bypasses the page cache and reduces
memory copies. Octopus manages its storage space on 4.2.1 Latency and Bandwidth Breakdown
the emulated persistent memory using shared memory
Figure 7 shows both single round-trip latency and band-
(SHM) of Linux in each server. These file systems are al-
width breakdown for Octopus. From the figures, we have
located with 20GB for file system storage at each server.
two observations.
For the network part, all distributed file systems run
(1) The software latency is dramatically reduced to
on RDMA directly. Specifically, memGluster supports
6us (around 85% of the total latency) in Octopus, from
using RDMA protocol for communication between glus-
323us (over 99%) in memGluster, as shown in Fig-
terfs clients and glusterfs bricks. NVFS is an optimized
ure 7(a). For the memGluster on the emerging non-
version of HDFS which exploits the advantages of byte-
volatile memory and RDMA hardwares, the file system
addressability of NVM and RDMA. Crail is a recent
layer has a latency that is several orders larger than
open-source DFS from IBM, and it relies on DaRPC [37]
that of storage or network. The software consumes the
for RDMA optimization and reserves huge pages as
overwhelmed part, and becomes a new bottleneck of the
transfer cache for bandwidth improvement.
whole storage system. In contrast, Octopus is effective in
Table 1: Evaluated File Systems reducing the software latency by redesigning the data and
memGluster GlusterFS runs on memory, and GlusterFS is a metadata mechanisms with RDMA. The software latency
widely-used DFS that has no centralized metadata in Octopus is in the same order with the hardware.
services and is now a part of Redhat
NVFS [16] a version of HDFS that is optimized with both (2) Octopus achieves read/write bandwidth that ap-
RDMA and NVM proaches the raw network bandwidth, as shown in Fig-
Crail [9] an in-memory RDMA-optimized DFS built with ure 7(b). The raw storage and network bandwidths
DaRPC [37]
respectively are 6509MB/s (with single-thread mem-
memHDFS [35] HDFS runs on memory, and HDFS is a widely-
used DFS for big data processing cpy) and 6350MB/s. Octopus achieves a read/write
Alluxio[23] an in-memory file system for big data processing (6088/5629MB/s) bandwidth that is 95.9%/88.6% of the
network bandwidth. In conclusion, Octopus effectively
Workloads. In our evaluation, we compare Octopus exploits the hardware bandwidth.
with memGluster, NVFS and Crail for metadata and
read-write performance, and compare it with NVFS and 4.2.2 Metadata Performance
Alluxio for big data benchmarks. We use mdtest for
metadata evaluation, fio for read/write evaluation, and Figure 8 shows the file systems’ performance in terms
an in-house read/write tool based on openMPI for ag- of metadata IOPS with different metadata operations by
gregated I/O performance. For big data evaluation, we varying the number of data servers. From the figure, we
replace HDFS by adding Octopus plugin under Hadoop. make two observations.
We use three package-in MapReduce benchmarks in (1) Octopus has the highest metadata IOPS among
Hadoop, i.e., TestDFSIO, Teragen, and Wordcount, for all evaluated file systems in general. memGluster and
evaluation. NVFS provide metadata IOPS in the order of 104 . Crail
provides metadata IOPS in the order of 105 owing to
4.2 Overall Performance DaRPC, a high performance RDMA-based RPC. Com-
paratively, Octopus provides metadata IOPS in the order
To evaluate Octopus, we first compare its overall perfor- of 106 , which is two orders higher than memGluster
mance with memGluster, NVFS and Crail. All these file and NVFS. Octopus achieves the highest throughput

USENIX Association 2017 USENIX Annual Technical Conference 779


Throughput (ops/s x1000)
GlusterFS NVFS Crail GlusterFS NVFS Crail GlusterFS NVFS Crail
10000
Crail-Poll Octopus Crail-Poll Octopus Crail-Poll Octopus
1000
1000
100 1000
100
10
100
10 1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Number of Clients Number of Clients Number of Clients
(a) Mknod (b) Mkdir (c) Readdir
Throughput (ops/s x1000)

10000 GlusterFS NVFS Crail 1000 GlusterFS NVFS Crail GlusterFS NVFS Crail
Crail-Poll Octopus Crail-Poll Octopus Crail-Poll Octopus
1000

1000
100
100

10
100
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Number of Clients Number of Clients Number of Clients
(d) Getattr (e) Rmnod (f) Rmdir
Figure 8: Metadata Throughput

Throughput (ops/s x1000)


except for rmdir and rmnod when there is only one data GlusterFS 1200 GlusterFS
750 NVFS NVFS
server. Crail is slightly better in this case, because it is Crail Crail
800
deployed with RdmaDataNode mode without transaction 500
Octopus Octopus

guarantee. Generally, Octopus achieves high throughput 400


250
in processing metadata requests, which mainly owes to
the self-identified RPC and collect-dispatch transaction 0 1
K 4K 16 6 2 1
K B 4K B 56K M B
0 1
K 4K 16 6 2 1
K B 4K B 56K M B
B B B B
that promise extremely low latency and high throughput. (a) Write
B
(b) Read
B

(2) Octopus achieves much better scalability than the Figure 9: Data I/O Throughput (Multiple Clients)
other evaluated file systems. NVFS and Crail are de- GlusterFS NVFS Crail GlusterFS NVFS Crail
signed with single metadata server, and achieve constant 10000
Crail-Poll Octopus
10000
Crail-Poll Octopus
Bandwidth (MB/s)

metadata throughput. Even with one metadata server, 1000


1000
Octopus achieves better throughput than these two file
systems in most cases. memGluster achieves the worst 100 100

throughput, for GlusterFS is designed to run on hard 10 10


disks and the software layer is inefficient in exploring the 1K 4K 16K 64K 256K 1MB
(a) Write
1K 4K 16K 64K 256K 1MB
(b) Read
high performance of NVM and RDMA, which has been Figure 10: Data I/O Bandwidth (Single Client)
illustrated in Section 2.2. Besides, memGluster stacks its
data management layer on top of the local file system in performance. But it drops rapidly when the I/O size
each server to process metadata requests, and this also grows, which is mainly restricted by the performance of
limits the throughput. Comparatively, Octopus has the RPC efficiency. Crail has lower throughput than NVFS
best scalability. For all evaluated metadata operations, when I/O size is small, but it achieves throughput close
Octopus’s IOPS is improved by 3.6 to 5.4 times when to Octopus when I/O size grows. memGluster has the
the number of servers is increased from 1 to 5. worst throughput and only achieves 100 Kops/s.
Figure 10 shows the read/write bandwidth achieved
4.2.3 Read/Write Performance by a single client with different read/write sizes. As
shown in the figure, Octopus significantly outperforms
Figure 9 shows the file systems’ performance in terms existing DFSs in terms of read or write bandwidth. When
of concurrent read/write throughput with multiple clients the I/O size is set to 1MB, the read/write bandwidths in
by varying the read/write sizes. From figure 9, we can NVFS and memGluster are around only 1000MB/s and
see that, with small read/write sizes, Octopus achieves 1500MB/s, respectively. Crail reaches a bandwidth of
much higher throughput than other file systems (750 4000MB/s, which only occupies 63% of the raw network
Kops/s and 1 Mops/s for writes and reads respectively). bandwidth. In contrast, Octopus can achieve bandwidth
This benefit mainly comes from the client-active data I/O close to that of the raw InfiniBand network (6088MB/s
and self-identified RPC mechanisms. NVFS achieves and 5629MB/s with 1MB I/O size for read and write re-
relatively high throughput when read/write size is set spectively), which is mainly because of reduced memory
to 1KB, for its buffer manager prefetches data to boost copies by using a shared persistent memory pool.

780 2017 USENIX Annual Technical Conference USENIX Association


4.3 Evaluation of Data Mechanisms 1000
Crail 1200 Crail

Throughput (kops/s)
800 Server-Active Server-Active
Client-Active 900 Client-Active
4.3.1 Effects of Reducing Data Copies 600
600
400
Octopus improves data transfer bandwidth by reducing
200 300
memory copies. To verify the effect of reducing data
copies, we implement a version of Octopus which add 1K
B
4K
B
16
KB
64
KB
25
6K
1M
B
1K
B
4K
B
16 6 2 1
K B 4K B 56K M B
B B
an extra copy at client side, and we refer to it as Oc- (a) Write (b) Read

topus+copy. As shown in Figure 11, when I/O size


Figure 12: Client-Active Data I/O Performance
is set to 1MB, Octopus+copy achieves nearly the same
bandwidth as Crail (around 4000MB/s). However, when 4.4 Evaluation of Metadata Mechanisms
the extra data copy is removed, Octopus can provide
6000MB/s of bandwidth that is written or read by a 4.4.1 Effects of Self-Identified Metadata RPC
single client, 23% of extra bandwidth gained. When the
I/O size is small, Octopus+copy still surpasses Crail with We first compare raw RPC performance with different
higher bandwidth, owing to closely coupled RDMA and usage of RDMA primitives to evaluate the effects of self-
file system mechanism designs to be evaluated next. identified metadata RPC. We then compare Octopus with
existing file systems on metadata latencies.
10000 10000
Figure 13(a) shows the raw RPC throughput us-
Bandwidth (MB/s)

1000 1000 ing three RPC implementations (i.e., message-based,


memory-based, and self-identified, without message
100
100 Crail Crail batch) along with DaRPC by varying the I/O sizes.
Octopus+copy Octopus+copy
Octopus 10 Octopus DaRPC used in Crail is designed based on RDMA
10
1K 4K 16K 64K 256K1MB 1K 4K 16K 64K 256K1MB send/recv, and it achieves the lowest throughput,
(a) Write (b) Read
2.4Mops/s with an I/O size of 16 bytes. Its performance
Figure 11: Effects of Reducing Data Copies may be limited by the Java implementation in its jVerbs
interface. We also implement a message-based RPC
4.3.2 Effects of Client-Active Data I/O
that uses RDMA send/recv verbs, and it achieves a
We then compare the IOPS of data I/O in client-active throughput of 3.87Mops/s at most. This throughput is
and server-active modes that are mentioned in Section 3. limited by the raw performance of RDMA send/recv.
Figure 12 shows the read/write throughput of both client- For the memory-based RPCs that use RDMA write
active and server-active modes of Octopus by varying verbs, as taken in FaRM [13], we compare the perfor-
read/write sizes. Crail’s performance is also given for mance by setting the maximum number of client threads
reference. We observe that the client-active mode has to 20 and 100. As observed, the throughput is the
higher data throughput than the server-active mode for highest (i.e., 5.4Mops/s) when the maximum number
small read/write sizes. Both modes have close through- of client threads is 20. However, it decreases quickly
put for read/write sizes that are larger than 16KB. When to 3.46Mops/s when the maximum number of client
the read/write sizes are smaller than 16KB, the client- threads is 100. This shows the inefficiency in processing
active mode has higher data throughput by 193% for and notification in the memory-based RPCs when there
writes and 27.2% for reads on average. Even the client- are a large number of client threads. Our proposed self-
active mode consists more network round-trips, it is more identified RPC, which carry on client identifiers with
efficient to offload workloads to clients from servers the RDMA write with imm verbs, keeps constant high
when the read/write size is small, in order to improve throughput for an average of 5.4Mops/s, without being
the data throughput. Client-active mode improves write affected by the number of client threads. Similarly, we
throughput more obviously than read throughput, be- also measure the latency of each RPC (in Figure 13(b)),
cause the server side has higher overhead for writes than among which self-identified RPC keeps relative low la-
reads in server-active mode. In server-active mode, after tency. As such, self-identified RPCs provide scalable and
the server side reads data from the client using RDMA low-latency accesses, which is suitable for distributed
read when processing client’s write operation, it has to storage systems to support a large number of client
check the completion of this operation, which is time- requests.
consuming. But for client’s read operations, server side Figure 14 shows metadata latencies of Octopus along
never checks the completion message, and provides rela- with other file systems. As shown in the figure, Octopus
tively higher throughput. In all, we conclude that client- achieves the lowest metadata latencies among all the
active mode has higher bandwidth than the commonly- evaluated file systems for all evaluated metadata oper-
used server-active mode. ations (i.e., 7.3us and 6.7us respectively for getattr

USENIX Association 2017 USENIX Annual Technical Conference 781


Raw RPC Latency (us)
Throughput (Mops/s) Write-20Cli Write-100Cli 40 2PC 2PC

Throughput (Kops/s)
Send/Recv Crail 6 180
6 Collect-Dispath Collect-Dispatch

Latency (us)
Self-Identified
30
4 120
4 20

2 10 60
2
0 0 0
Self
16B 64B 128B 256B 512B 1KB W rite- Write- Send - Mkdir Mknod Rmnod Rmdir Mkdir Mknod Rmnod Rmdir
20C 100C /Rec Crail Identif (a) (b)
li li v ied
(a) (b)
Figure 15: Collect-Dispatch Transaction Performance
Figure 13: Raw RPC Performance

and readdir), which are close to the InfiniBand network 4.5 Evaluation using Big Data Applications
latency for most cases. With the self-identified metadata
RPC, Octopus can support low-latency metadata opera- In addition, we compare Octopus with distributed file
tions even without client cache. Crail uses DaRPC for systems that are used in big data framework. We con-
inter-server communication. However, Crail’s metadata figure Hadoop with different distributed file systems -
(e.g., mkdir and mknod) latencies are much higher than memHDFS, Alluxio, NVFS, Crail and Octopus. In
raw DaRPC’s latency. This possibly is because Crail is this section, we compare both read/write bandwidth and
implemented on the inefficient HDFS framework, or it application performance.
registers memory temporarily for message communica- Read/Write Bandwidth. Figure 16(a) compares the
tion, which is time-consuming. NVFS and memGluster read/write bandwidths of above-mentioned file systems
suffer the similar problem of heavy file system designs using TestDFSIO by setting the read/write size to 256KB.
as Crail, and thus have relatively higher latency. Octopus and Crail show much higher bandwidth than tra-
ditional file systems. Octopus achieves 2689MB/s and
Normalized Latency (us)

GlusterFS NVFS Crail 2499MB/s for write and read operations respectively,
10
Crail-Poll Octopus
513.9 381.2 243.8 263.1 217.5
and Crail achieves 2424MB/s and 2215MB/s respec-
1 3225.8
tively. Note that they have lower bandwidths than the re-
0.1 sults in fio. The reason is that we connect Octopus/Crail
with Hadoop plugin using JNI (Java Native Interface),
0.01
which restricts the bandwidth. In contrast, memHDFS,
Mkdir Mknod Readdir Getattr Rmnod Rmdir Alluxio and NVFS show lower bandwidth than Octopus
Figure 14: Metadata Latency and Crail. memHDFS has the lowest bandwidth, for the
heavy HDFS software design that is for hard disks and
4.4.2 Effects of Collect-Dispatch Transaction traditional Ethernet. Alluxio and NVFS are optimized to
run on DRAM, and thus provide higher bandwidth than
To evaluate the effects of the collect-dispatch transaction
memHDFS. But they are still slower than Octopus. Thus,
in Octopus, we also implement a transaction system
we conclude the general-purpose Octopus can also be
based on 2PC for comparison. Figure 15(a) exhibits the
integrated into existing big data framework and provide
latencies of these two transaction mechanisms. Collect-
better performance than existing file systems.
dispatch reduces latency by up to 37%. This is because
2PC involves two RPCs to exchange messages from
Wordcount Execution Time (s)
memHDFS Alluxio NVFS memHDFS Alluxio NVFS
Teragen Execution Time (s)

Crail Octopus Crail Octopus


remote servers, while collect-dispatch only needs one 3000 12
TestDFSIO (MB/s)

80
RPC and two one-sided RDMA commands to finish the 2000 10
transaction. Although the number of messages is in- 8 60
creased, the total latency drops. RPC protocol needs the 1000
6
involvements of both local and remote nodes, and a lot 0 40
Teragen Wordcount
of side information (e.g., hash computing, and message Write Read

discovery) needs to be processed at this time. Thus, Figure 16: Big Data Evaluation
RPC latency (around 5us) is much higher than one-sided
RDMA primitives (less than 1us). From figure 15(b) we Big Data Application Performance. Figure 16(b)
can see that, transaction based on collect-dispatch im- shows the application performance for different file sys-
proves throughput by up to 79%. On one hand, collect- tems. Octopus consumes the least time to finish all
dispatch only writes logs locally, significantly reducing evaluated applications. Among all the evaluated file
logging overhead. On the other hand, collect-dispatch systems, memHDFS generally has the highest run time,
decreases the total number of RPC when processing i.e., 11.7s for Teragen and 82s for Wordcount. For the
transactions, which reduces the involvements of remote Teragen workload, the run time in Alluxio, NVFS, Crail
CPUs and thereby improves performance. and Octopus is 11.0s, 10.0s, 11.4s and 8.8s, respectively.

782 2017 USENIX Annual Technical Conference USENIX Association


For the Wordcount workload, the run time in Alluxio, stores. Pilaf [29] optimizes the get operation using mul-
NVFS, Crail and Octopus is 69.5s, 65.9s, 62.5s and tiple RDMA read commands at the client side, which
57.1s, respectively. We conclude that our proposed offloads hash calculation burden from remote servers to
general-purpose Octopus can even provide better perfor- clients, improving system performance. HERD [18] im-
mance for big data applications than existing dedicated plements both get and put operations using the combina-
file systems. tion of RDMA write and UD send, in order to achieve
high throughput. HydraDB [40] is a versatile key-value
5 Related Work middleware that achieves data replication to guarantee
fault-tolerance and awareness for NUMA architecture,
Persistent Memory File Systems: In addition to file and adds client-side cache to accelerate the get opera-
systems that are built for flash memory [17, 28, 27, 22, tion. While RDMA techniques lead to evolutions in the
44], a number of local file systems have been built from designs of key-value stores, its impact on file system
scratch to exploit both byte-addressability and persis- designs is still under-exploited.
tence benefits of non-volatile memory [11, 14, 42, 32, RDMA Optimizations in Distributed File Systems:
43]. BPFS [11] is a file system for persistent memory that Existing distributed file systems have tried to support
directly manages non-volatile memory in a tree structure, RDMA network by substituting their communication
and provides atomic data persistence using short-circuit modules [1, 3, 6]. Ceph over Accelio [3] is a project un-
shadow paging. PMFS [14] proposed by Intel also der development to support RDMA in Ceph. Accelio [2]
enables direct persistent memory access from applica- is an RDMA-based asynchronous messaging and RPC
tions by removing file system page cache with memory middleware designed to improve message performance
mapped IO. Similar to BPFS and PMFS, SCMFS [42] and CPU parallelism. Alluxio [23] in Spark (formerly
is a file system for persistent memory which leverages named Tachyon) is transplanted to run on top of RDMA
the virtual memory management of the operating system. by Mellanox [6]. It faces the same problem as Ceph on
Fine-grained management is further studied in recent RDMA. NVFS [16] is an optimized version of HDFS
NOVA [43] and HiNFS [32] to make software more that combines both NVM and RDMA technologies. Due
efficient. The Linux kernel community also starts to to heavy software design in HDFS, NVFS hardly exploits
support persistent memory by introducing DAX (Direct the high performance of NVM and RDMA. Crail [9]
Access) to existing file systems, e.g., EXT4-DAX [4]. is a recently developed distributed file system built on
The efficient software design concept in these local file DaRPC [37]. DaRPC is an RDMA-based RPC that
systems, including removing duplicated memory copies, tightly integrates the RPC message processing and net-
is further studied in Octopus distributed file system to work processing, which provides both high throughput
make remote accesses more efficient. and low latency. However, their internal file system
General RDMA Optimizations: RDMA provides mechanisms remain the same. In comparison, our pro-
high performance but requires careful tuning. Recent posed Octopus revisits the file system mechanisms with
study [19] offers guidelines on how to use RDMA verbs RDMA features, instead of introducing RDMA only to
efficiently from a low-level perspective such as in PCIe the communication module.
and NIC. Cell [30] dynamically balances CPU consump-
tion and network overhead using RDMA primitives in 6 Conclusion
a distributed B-tree store. PASTE [15] proposes direct
NIC DMA to persistent memory to avoid data copies,
The efficiency of the file system design becomes an im-
for a joint optimization between network and data stores.
portant design issue for storage systems that are equipped
FaSST [20] proposes to use UD (Unreliable Datagram)
with high-speed NVM and RDMA hardware. Both the
for RPC implementation when using send/recv, in or-
two emerging hardware technologies not only improve
der to improve scalability. RDMA has also been used
hardware performance, but also push back the soft-
to optimize distributed protocols, like shared memory
ware evolution. In this paper, we propose a distributed
access [13], replication [45], in-memory transaction [41],
memory file system, Octopus, which has its internal
and lock mechanism [31]. RDMA optimizations have
file system mechanisms closely coupled with RDMA
brought benefits to computer systems, and this motivates
features. Octopus simplifies the data management layer
us to start rethinking the file system design with RDMA.
by reducing memory copies, and rebalances network and
RDMA Optimizations in Key-Value Stores: RDMA
server loads with active I/Os in clients. It also redesigns
features have been adopted in several key-value stores
the metadata RPC and the distributed transaction by
to improve performance [29, 18, 13, 40]. MICA [24]
using RDMA primitives. Evaluations show that Octopus
bypasses the kernel and uses a lightweight networking
effectively explores hardware benefits, and significantly
stack to improve data access performance in key-value
outperforms existing distributed file systems.

USENIX Association 2017 USENIX Annual Technical Conference 783


Acknowledgments [15] H ONDA , M., E GGERT, L., AND S ANTRY, D. Paste: Network
stacks must integrate with nvmm abstractions. In Proceedings
We thank our shepherd Michio Honda and anonymous of the 15th ACM Workshop on Hot Topics in Networks (2016),
ACM, pp. 183–189.
reviewers for their feedbacks and suggestions. We
[16] I SLAM , N. S., WASI - UR R AHMAN , M., L U , X., AND PANDA ,
also thank Weijian Xu for his contribution in the early D. K. High performance design for hdfs with byte-addressability
prototype of Octopus. This work is supported by the of nvm and rdma. In Proceedings of the 2016 International
National Natural Science Foundation of China (Grant Conference on Supercomputing (2016), ACM, p. 8.
No. 61502266, 61433008, 61232003), the Beijing Mu- [17] J OSEPHSON , W. K., B ONGO , L. A., F LYNN , D., AND L I , K.
nicipal Science and Technology Commission of China DFS: A file system for virtualized flash storage. In Proceedings
of the 8th USENIX Conference on File and Storage Technologies
(Grant No. D151100000815003), and the China Post- (FAST) (Berkeley, CA, 2010), USENIX.
doctoral Science Foundation (Grant No. 2016T90094,
[18] K ALIA , A., K AMINSKY, M., AND A NDERSEN , D. G. Using
2015M580098). Youyou Lu is also supported by the rdma efficiently for key-value services. In SIGCOMM (2014).
Young Elite Scientists Sponsorship Program of China [19] K ALIA , A., K AMINSKY, M., AND A NDERSEN , D. G. Design
Association for Science and Technology (CAST). guidelines for high performance rdma systems. In 2016 USENIX
Annual Technical Conference (USENIX ATC 16) (2016).
[20] K ALIA , A., K AMINSKY, M., AND A NDERSEN , D. G. Fasst:
References fast, scalable and simple distributed transactions with two-sided
[1] GlusterFS on RDMA. "https://gluster.readthedocs.io (rdma) datagram rpcs. In 12th USENIX Symposium on Operating
/en/latest/AdministratorGuide/RDMATransport/". Systems Design and Implementation (OSDI 16) (2016), USENIX
Association, pp. 185–201.
[2] Accelio. "http://www.accelio.org", 2013.
[21] L EE , B. C., I PEK , E., M UTLU , O., AND B URGER , D. Ar-
[3] Ceph over Accelio. "https://www.cohortfs.com/ceph-o chitecting phase change memory as a scalable dram alternative.
ver-accelio", 2014. In Proceedings of the 36th annual International Symposium on
Computer Architecture (ISCA) (New York, NY, USA, 2009),
[4] Support ext4 on NV-DIMMs. "https://lwn.net/Articles
ACM, pp. 2–13.
/588218", 2014.
[22] L EE , C., S IM , D., H WANG , J., AND C HO , S. F2FS: A
[5] Supporting filesystems in persistent memory. "https://lwn.
new file system for flash storage. In Proceedings of the 13th
net/Articles/610174", 2014.
USENIX Conference on File and Storage Technologies (FAST)
[6] Alluxio on RDMA. "https://community.mellanox.com (Santa Clara, CA, Feb. 2015), USENIX.
/docs/DOC-2128", 2015. [23] L I , H., G HODSI , A., Z AHARIA , M., S HENKER , S., AND
[7] Introducing Intel Optane technology - bringing 3D S TOICA , I. Tachyon: Reliable, memory speed storage for cluster
XPoint memory to storage and memory products. computing frameworks. In Proceedings of the ACM Symposium
"https://newsroom.intel.com/press-kits/introd on Cloud Computing (2014).
ucing-intel-optane-technology-bringing-3d-xpoin [24] L IM , H., H AN , D., A NDERSEN , D. G., AND K AMINSKY, M.
t-memory-to-storage-and-memory-products/", 2016. Mica: A holistic approach to fast in-memory key-value storage.
[8] SAP HANA, in-memory computing and real time analyt- management 15, 32 (2014), 36.
ics. "http://go.sap.com/product/technology-platf [25] L U , Y., S HU , J., AND S UN , L. Blurred persistence in transac-
orm/hana.html", 2016. tional persistent memory. In Proceedings of the 31st Conference
[9] Crail: A Fast Multi-tiered Distributed Direct Access File System. on Massive Storage Systems and Technologies (MSST) (2015),
https://github.com/zrlio/crail, 2017. IEEE, pp. 1–13.
[26] L U , Y., S HU , J., S UN , L., AND M UTLU , O. Loose-ordering
[10] A SSOCIATION , I. T., ET AL . InfiniBand Architecture Specifica-
consistency for persistent memory. In Proceedings of the IEEE
tion: Release 1.3. InfiniBand Trade Association, 2009.
32nd International Conference on Computer Design (ICCD)
[11] C ONDIT, J., N IGHTINGALE , E. B., F ROST, C., I PEK , E., L EE , (2014), IEEE.
B., B URGER , D., AND C OETZEE , D. Better I/O through byte- [27] L U , Y., S HU , J., AND WANG , W. ReconFS: A reconstructable
addressable, persistent memory. In Proceedings of the 22nd ACM file system on flash storage. In Proceedings of the 12th USENIX
SIGOPS Symposium on Operating Systems Principles (SOSP) Conference on File and Storage Technologies (FAST) (Berkeley,
(New York, NY, USA, 2009), ACM, pp. 133–146. CA, 2014), USENIX, pp. 75–88.
[12] D OUGLAS , C. RDMA with PMEM: software mechanisms for [28] L U , Y., S HU , J., AND Z HENG , W. Extending the lifetime of
enabling access to remote persistent memory. http://www.sn flash-based storage through reducing write amplification from
ia.org/sites/default/files/SDC15_presentations/ file systems. In Proceedings of the 11th USENIX Conference
persistant_mem/ChetDouglas_RDMA_with_PM.pdf, 2015. on File and Storage Technologies (FAST) (Berkeley, CA, 2013),
[13] D RAGOJEVI Ć , A., NARAYANAN , D., C ASTRO , M., AND H OD - USENIX.
SON , O. Farm: fast remote memory. In 11th USENIX Symposium [29] M ITCHELL , C., G ENG , Y., AND L I , J. Using one-sided rdma
on Networked Systems Design and Implementation (NSDI 14) reads to build a fast, cpu-efficient key-value store. In Presented as
(2014), pp. 401–414. part of the 2013 USENIX Annual Technical Conference (USENIX
[14] D ULLOOR , S. R., K UMAR , S., K ESHAVAMURTHY, A., L ANTZ , ATC 13) (2013), pp. 103–114.
P., R EDDY, D., S ANKARAN , R., AND JACKSON , J. System [30] M ITCHELL , C., M ONTGOMERY, K., N ELSON , L., S EN , S.,
software for persistent memory. In Proceedings of the Ninth AND L I , J. Balancing cpu and network in the cell distributed
European Conference on Computer Systems (EuroSys) (New b-tree store. In 2016 USENIX Annual Technical Conference
York, NY, USA, 2014), ACM, pp. 15:1–15:15. (USENIX ATC 16) (2016).

784 2017 USENIX Annual Technical Conference USENIX Association


[31] NARRAVULA , S., M ARNIDALA , A., V ISHNU , A., [46] Z HOU , P., Z HAO , B., YANG , J., AND Z HANG , Y. A durable
VAIDYANATHAN , K., AND PANDA , D. K. High performance and energy efficient main memory using phase change memory
distributed lock management services using network-based technology. In Proceedings of the 36th annual International
remote atomic operations. In Seventh IEEE International Symposium on Computer Architecture (ISCA) (New York, NY,
Symposium on Cluster Computing and the Grid (CCGrid’07) USA, 2009), ACM, pp. 14–23.
(2007), IEEE, pp. 583–590.
[32] O U , J., S HU , J., AND L U , Y. A high performance file system
for non-volatile main memory. In Proceedings of the Eleventh
European Conference on Computer Systems (2016), ACM, p. 12.
[33] P ELLEY, S., C HEN , P. M., AND W ENISCH , T. F. Memory
persistency. In Proceedings of the 41st ACM/IEEE International
Symposium on Computer Architecture (ISCA) (2014), pp. 265–
276.
[34] Q URESHI , M. K., S RINIVASAN , V., AND R IVERS , J. A.
Scalable high performance main memory system using phase-
change memory technology. In Proceedings of the 36th annual
International Symposium on Computer Architecture (ISCA) (New
York, NY, USA, 2009), ACM, pp. 24–33.
[35] S HVACHKO , K., K UANG , H., R ADIA , S., AND C HANSLER , R.
The hadoop distributed file system. In IEEE 26th symposium
on mass storage systems and technologies (MSST) (2010), IEEE,
pp. 1–10.
[36] S TRUKOV, D. B., S NIDER , G. S., S TEWART, D. R., AND
W ILLIAMS , R. S. The missing memristor found. nature 453,
7191 (2008), 80–83.
[37] S TUEDI , P., T RIVEDI , A., M ETZLER , B., AND P FEFFERLE , J.
DaRPC: Data center rpc. In Proceedings of the ACM Symposium
on Cloud Computing (SoCC) (2014), ACM, pp. 1–13.
[38] S WANSON , S., AND C AULFIELD , A. M. Refactor, reduce,
recycle: Restructuring the i/o stack for the future of storage.
Computer 46, 8 (2013), 52–59.
[39] TALPEY, T. Remote Access to ultra-low-latency storage.
http://www.snia.org/sites/default/files/SDC15_pr
esentations/persistant_mem/Talpey-Remote_Access
_Storage.pdf, 2015.
[40] WANG , Y., Z HANG , L., TAN , J., L I , M., G AO , Y., G UERIN ,
X., M ENG , X., AND M ENG , S. Hydradb: a resilient rdma-
driven key-value middleware for in-memory cluster computing.
In Proceedings of the International Conference for High Per-
formance Computing, Networking, Storage and Analysis (2015),
ACM, p. 22.
[41] W EI , X., S HI , J., C HEN , Y., C HEN , R., AND C HEN , H. Fast
in-memory transaction processing using rdma and htm. In Pro-
ceedings of the 25th Symposium on Operating Systems Principles
(2015), ACM, pp. 87–104.
[42] W U , X., AND R EDDY, A. L. N. SCMFS: A file system for
storage class memory. In Proceedings of 2011 International Con-
ference for High Performance Computing, Networking, Storage
and Analysis (SC) (New York, NY, USA, 2011), ACM, pp. 39:1–
39:11.
[43] X U , J., AND S WANSON , S. Nova: a log-structured file system
for hybrid volatile/non-volatile main memories. In 14th USENIX
Conference on File and Storage Technologies (FAST 16) (2016),
pp. 323–338.
[44] Z HANG , J., S HU , J., AND L U , Y. Parafs: A log-structured file
system to exploit the internal parallelism of flash devices. In 2016
USENIX Annual Technical Conference (USENIX ATC 16) (2016).
[45] Z HANG , Y., YANG , J., M EMARIPOUR , A., AND S WANSON ,
S. Mojim: A reliable and highly-available non-volatile memory
system. In Proceedings of the Twentieth International Confer-
ence on Architectural Support for Programming Languages and
Operating Systems (New York, NY, USA, 2015), ASPLOS ’15,
ACM, pp. 3–18.

USENIX Association 2017 USENIX Annual Technical Conference 785

You might also like