Transactions On Red-Black and AVL Trees in NVRAM
Transactions On Red-Black and AVL Trees in NVRAM
NVRAM
Thorsten Schütt Florian Schintke Jan Skrzypczak
Zuse Institute Berlin
arXiv:2006.16284v1 [cs.DC] 29 Jun 2020
Abstract
Byte-addressable non-volatile memory (NVRAM) supports persistent
storage with low latency and high bandwidth. Complex data structures in
it ought to be updated transactionally, so that they remain recoverable at
all times. Traditional database technologies such as keeping a separate log,
a journal, or shadow data work on a coarse-grained level, where the whole
transaction is made visible using a final atomic update operation. These
methods typically need significant additional space overhead and induce
non-trivial overhead for log pruning, state maintenance, and resource (de-
)allocation. Thus, they are not necessarily the best choice for NVRAM,
which supports fine-grained, byte-addressable access.
We present a generic transaction mechanism to update dynamic com-
plex data structures ‘in-place’ with a constant memory overhead. It is
independent of the size of the data structure. We demonstrate and eval-
uate our approach on Red-Black Trees and AVL Trees with a redo log of
constant size (4 resp. 2 cache lines). The redo log guarantees that each ac-
cepted (started) transaction is executed eventually despite arbitrary many
system crashes and recoveries in the meantime. We update complex data
structures in local and remote NVRAM providing exactly once semantics
and durable linearizability for multi-reader single-writer access. To persist
data, we use the available processor instructions for NVRAM in the local
case and remote direct memory access (RDMA) combined with a software
agent in the remote case.
1 Introduction
The introduction of NVRAM enables a new range of applications, but it also
causes new challenges for their effective use. NVRAM is the first non-volatile
storage providing byte-granular access with low-latency and high bandwidth.
In addition, it will replace SSDs as fastest persistent storage in the storage
hierarchy. While SSDs only provide block-oriented APIs, NVRAM comes as
a standard DIMM. Plain loads and stores suffice to get direct access (DAX)
while bypassing the operating system. For recoverability and consistency of
data structures, it becomes relevant when, in which order, and which part of
them will be written to NVRAM from the processor’s caches—either explicitly
by instructions or implicitly by cache evictions. This is influenced by aspects
such as data alignment, weak memory models, and cache properties such as
associativity, size, and its replacement and eviction policy. To conquer all these
1
aspects, complex data structures stored in NVRAM must be recoverable at all
times, which requires new and sound transactional update mechanisms.
The literature on NVRAM has been mostly focusing on B+trees [16] with
a high radix [12, 27] to minimize costs for insert and remove. By using a high
radix, expansive operations like balancing happen seldom. Insert and remove
are operations on leaves. Thus, they do not have to be transactional and only
require a few persist calls. If the sequence of calls is interrupted by a crash, the
tree remains valid. They are tuned for absolute performance.
In contrast, balancing is the common case for Red-Black Trees and AVL
Trees [58]. Insert and remove always need an unpredictable and variable se-
quence of operations, i.e., balancing, recoloring, and updating the balance fac-
tors. This depends on the size and the shape of the tree as they work across
several levels of it. For NVRAM, the steps of the sequence have to happen
atomically despite an arbitrary number of crashes and restarts. Otherwise, the
tree might become invalid, unrecoverable, and may lose sub-trees. Thus, trans-
actions are needed [50].
The literature so far focused on a copy-on-write style for updating data-
structures. It comes with additional costs for allocating, de-allocating, and
garbage collection. For B+ trees, a constant amount of memory is needed, i.e.,
the size of a node, which simplifies the process. As Wang et al. [63] noted, for
Red-Black Trees copy-on-write operations touch almost the complete tree. One
needs to allocate and de-allocate a variable amount of memory for operations.
We split updates into a sequence of micro-transactions. Thus a constant size
redo log suffices for all operations. We neither allocate nor de-allocate memory
for operations as all updates are in-place [54]. It shows its strengths for complex
data structures where updates are global operations and touch large parts of
the data structure. There is no doubt that performance-wise B+-trees beat
binary trees. It is by design. However, binary trees represent a wider class of
dynamic data structures using pointers. For binary trees, we needed to invent
new methodologies for storing data structures in NVRAM that widely differ
from B+-trees. They could also be applied to data structures such as (doubly)
linked lists, priority queues, or graphs. Trees are often used as proxies for index
data structures [49, 48].
As NVRAM behaves like memory rather than a spinning hard-disk, we can
use remote direct memory access (RDMA) of modern interconnects to directly
access NVRAM on remote nodes. Local and remote access can rely on a com-
mon set of operations. For NVRAM, access can be expressed in terms of read,
write, atomic compare&swap (CAS), and persist operations. For remote access,
get, put, remote atomic CAS, and remote persist of the passive target communi-
cation model defined in the MPI standard [38] can be used. For passive target
communication, the origin process can access the target’s memory without in-
volvement of the target process. It is similar to a shared memory model and
allows the design of a single transaction system based on common primitives for
both local and remote NVRAM.
We support exactly once operations [21] on dynamic complex data structures
in local and remote NVRAM and make the following main contributions:
• We designed a new transaction system for NVRAM, which splits large
multi-step transactions into a sequence of micro-transactions. A state
machine describes the transaction and the sequence of micro-transactions.
2
Each micro-transaction resp. state transition is idempotent allowing atom-
icity and recovery in the failure case for multi-step transactions. All up-
dates happen directly on the data structure ‘in-place’ without shadow
copying. It incrementally transforms the old data structure into the new
one. All accepted operations will eventually succeed (Sect. 4).
• A redo log of constant size (four resp. two cache lines for Red-Black Trees
and AVL Trees) is used to guarantee recoverability and atomicity at all
times. Note that the size of the redo log is independent of the size of the
data structure (Sect. 4.1).
• Our approach supports exactly once semantics and guarantees durable
linearizability for all operations, both local and remote. Failed clients
cannot corrupt any data (Sect. 4.4).
• We implemented balanced Red-Black Trees and AVL Trees in NVRAM
using our approach—local and with passive target communication for re-
mote access (Sect. 6).
• Intel guarantees 8 byte fail-safe atomicity for NVRAM. For our approach
7 bytes suffice as we do not rely on atomic pointer updates (Sect. 4).
• We designed a multi-reader single-writer lock with f -fairness to coordinate
concurrent RDMA writes. Failed lock-holders can be safely expelled by
other processes, because their process ids are stored in the lock, which
allows other clients to use failure detectors (Sect. 10).
• We simulated more than 2,000,000 power failures by killing processes to
validate the robustness of our approach (Sect. 11).
• Our evaluation shows more than 2,300/s key-value pair inserts into Red-
Black Trees using passive target communication with NVRAM. For AVL
Trees, we reached more than 1,800/s inserts (Sect. 12.4). For local access,
we reached almost 400,000 inserts per second (Sect. 12.1).
2 System Model
We assume a full-system failure model [30]. On a crash, all transient state (of
all processes) is lost. Only operations on fundamental, naturally aligned data
types up to 8 bytes are atomic fail-safe in NVRAM, but 7 bytes are enough for
our approach.
While RDMA operations can fail non-atomically, we assume 64 bit RDMA
CAS operations to be atomic. To detect failed nodes, we use the weak failure
detector ♦W [11]. We consider a system with a single server storing data without
replication for simplicity. An arbitrary number of read/write clients may try to
access the data concurrently. We do not consider Byzantine failures.
3 Preliminaries
As discussed above, hardware only supports atomic updates of 8 bytes. In the
following, we describe the basic concepts needed for larger updates and describe
in detail the methods for persisting data with NVRAM.
3
3.1 Logging and shadow copying
As long as updates are atomic, i.e., 8 bytes for NVRAM or a block for SSDs, they
can be done in-place. For non-atomic updates, transaction systems [39, 63, 6]
use a combination of different techniques to preserve consistency in the face
of crashes. Logging uses undo and redo logs to store enough data to roll-back
an interrupted transaction (undo) or retry the transaction again (redo). Undo
logging tends to be more costly. It has to log every store before executing
it. Thus, redo logging is the preferred technique. Some databases [6] use a
combination of both. Shadow copying, also known as copy-on-write, creates a
copy of the data to be updated, updates the copy, and atomically replaces the
old data with the new data. For example, to atomically update a tree node, a
copy of the node is created, updated, and then the parent’s pointer to the node
is atomically updated. Often, it is sufficient to replace one 8-byte pointer for
the last step, which can be done atomically.
4
4 Exactly once operations with Micro-Transactions
(µ-Tx) and State Machines
Performing updates on complex data structures often requires a sequence of
smaller operations (recoloring, balancing, node splitting, etc.). NVRAM makes
it challenging to perform them correctly in the face of power losses as only
aligned stores up to 8 bytes are fail-safe atomic with current hardware. All
larger operations require transactions. Otherwise, it is unknown which updates
are persistent in the failure case, i.e., reached the persistence domain. This can
fatally corrupt the data. Traditional techniques to address this problem are
shadowing, copy-on-write, and logging (see Sect. 3.1).
Aims. Inserting or removing elements from trees, for example, often re-
quires a sequence of operations, such as tree rotations (see Sect. 5.1). We want
to support such complex data structure updates atomically ‘in-place’. We di-
rectly update the data structure without shadow copying but with a redo log of
constant size independent of the size of the overall data structure. The structure
and actual size of such a constant-size redo log depends on the particular data
structure and operations to be supported. In Sect. 6, we show some examples
for Red-Black Trees and AVL Trees.
Approach. In general, we split an operation to be performed on a complex
data structure into a sequence of smaller operations, which we execute in micro-
transactions (µ-Tx) until the whole operation is finished. We want to be able
to identify the ongoing operation (insert or remove), detect the progress in
that operation, perform updates atomically, minimize the size of the redo log,
and guarantee that all accepted operations will eventually complete. We need
the following components (see Figure 6): (1) the primary data structure D of
potentially dynamic size that we want to update, (2) a redo log L of constant
size, and (3) a state machine M with S states describing the sequence of updates
on D and L. D and L can be seen as disjoint sets of byte ranges. D, L, and the
current state of M are stored in NVRAM.
For non-trivial updates, the idea is to establish a two-step mode of operation
repeatedly: First, persist all information that are required to perform the oper-
ation on D in the redo log L. Afterwards, perform the operation and persist it.
Each step is idempotent until the next micro-transaction begins. To separate
them from each other, a state variable is updated atomically between steps to
indicate which step finished last.
The current state of the state machine indicates whether the data structure
is clean for performing the next read or write or if it is currently performing
an ongoing concurrent operation. In this case, any upcoming read, insert, or
remove request has to wait. We reject concurrent operations with locks (see
Sect. 10) as our transaction approach cannot handle concurrent accesses by
itself. Otherwise, they may corrupt the data structure [20].
Each state transition of the state machine (see Figure 1) performs a sequence
of writes followed by a mechanism to make the writes persistent—an epoch [17].
A state transition either updates D or L and then updates the current state
atomically. Thus, it consists of two epochs. We require all state transitions to
be idempotent. If a transition updates D, there must be enough information
in L to be able to redo the operation. If it updates L, there must be enough
information in D and L to redo the operation, i.e., D can be the redo log for
5
4. µ-Tx (idempotent)
s ← persist( ) ← w
c rite
omi w ⊆
set at 4 D
← ←
) decide next state by re
st( if stmnt. or control flow ad
rsi r
pe 4⊆
L
clean dirty
action
c0 dj
1. µ-Tx 2. µ-Tx (idempotent) 3. µ-Tx (idempotent)
Figure 1: µ-Tx in a state machine with clean and dirty states working on a
constant size redo log L, a complex data structure D of arbitrary size, and the
state variable s.
6
In contrast to shadowing and logging, the state machine approach is less
affected by failures. When it reached the first dirty state, it can guarantee the
client that the operation will eventually succeed. The first state transition ac-
cepts the operation and the following ones perform the operation. Shadowing
can only guarantee success after completion. We can identify the kind of oper-
ation uniquely by the current state. Different operations will use disjoint sets
of dirty states. If the machine is in a clean state, there is no ongoing operation.
The recovery cost is negligible. We might lose one epoch, which we have
to repeat. The next process can continue where the last process crashed. For
comparison, NV-Trees [65] only store the leaves in NVRAM. On recovery, they
have to recompute the inner nodes.
4.1 Overhead
There is a trade-off between the number of states and the size of the redo log.
Larger state machines tend to have smaller logs. Less information is needed to
make state transitions idempotent. However, they require more state transitions
and flushes. Smaller state machines will need larger logs and require fewer state
transitions and flushes.
The smallest state machine with one clean and one dirty state akin to tradi-
tional transactions in databases has no constant size redo log. Remove for AVL
Trees has in the worst case O(log n) balancing operations, which cannot be ex-
ecuted in-place. All balancing operations require logging, see Sect. 5.4. Thus,
the minimal state machine with two states has no constant size redo log. For
remove with AVL Trees, the log’s size is a function of the depth of the tree. In
our approach, it suffices to provide enough space in the redo log to make state
transitions idempotent. The complexity of state transitions is independent of
the size of the data structure.
There are operations that imperative code implements in-place, such as tree
rotations, updating pointers, and updating loop counters. These operations are
not idempotent and have to be split into two state transitions. The smallest
state machine with a constant size redo log depends on the complexity of the
algorithm. Note that both state machines for AVL Trees are larger than the
two for Red-Black Trees. AVL Trees are stricter balanced . Furthermore, the
remove state machine for AVL Trees is larger than the one for insert (35 vs. 24
states).
Despite all state transitions being idempotent, there are two strategies for
re-executing an interrupted state transition: (a) blindly re-executing it and (b)
analyzing the progress made before the crash and only executing missing parts.
In our implementation, we used the former strategy. The number of stores per
state transition is small. A single rotation needs two state transitions with
three stores each. The gain of the latter strategy would be negligible. For store
intensive state transitions, like, e.g., writing a GiB of data, the latter would be
more efficient than the former.
7
We use a global state variable of 7 bytes that stores the current state and
is updated atomically with CAS operations. This means that the number of
states is limited to 256 . When 256 states are insufficient, we create a state ma-
chine where D is a variable with more than 56 bits. However, the number of
executable state transitions is not bounded by the 7 byte limit. Unfortunately,
this approach does not allow reliably detecting whether the state machine made
progress. Even if the state variable did not change between two reads, the state
machine may have transitioned through several states in the meantime ending in
the original state again. Supporting detection of progress would require an ap-
proach based on arbitrary precision counters, which is unfortunately challenging
with NVRAM.
8
In contrast, buffered durable linearizability [30] only requires operations to
be persistently ordered before they return. After a crash, the data would still
be persistent but not necessarily up to date.
In our approach, all operations start at the state transition the last operation
finished. They start from the clean state, traverse the state machine, and return
to the clean state before they return. The current state of D represents the
execution resp. history of all previously completed µ-Tx. If the state machine is
in a dirty state after the crash, an operation started but did not return. Given
enough progress, the state machine might be in the clean state after the crash,
but it did not return before the crash. Thus, our approach supports durable
linearizability.
9
6 GP 6 GP
Right rotation
Q 4 pivot P 2
-1
+1
pivot P 2 5 C A 1 4 Q
r
ed
vi
ol
at
io
A 1 n 3 B B 3 5 C
6 GP 6 GP 6 GP
pivot P 2 T T
4 Q 4 Q pivot P 2 D
D D
A 1
pivot P 2 T 5 C 3 B 5 C A 1 4 Q
A 1 3 B 3 B 5 C
1. Q(L) Drops P and Takes B (S1) 2. P(R) Drops B and Takes Q (S2) 3. GP(L) Drops Q and Takes P (S3)
Figure 2: Right rotation with red (white) and black (gray) nodes. Nodes/Roles
(A, B, C, pivot P, Q, and GP) and keys (1, 2, 3, . . . , 6). P is promoted and Q
is demoted. Nodes drop (D) and take (T) children.
Knuth [31] describes a top-down algorithm for insert with O(1) single or
double rotations on average [36]. It walks down the tree and inserts the new
node at the bottom. On the way down, it records the positions, which have to
potentially be rebalanced. Additionally, it adapts the respective balance factors.
Remove deviates from the algorithms described so far. It also requires O(1)
balance operations on average, but it can need up to O(log n) balance opera-
tions [59]. It must walk back up to balance the tree. B and B+ trees [8, 16, 9]
use a similar concept. If a node is full, they split the node and walk up the tree.
The imperative C code from Julienne Walker 2 uses a stack of size O(log n)
for remove. This violates our aim that the redo log is constant size. On the
way down, we store pointers to parents in the nodes, which can be used to
walk back up the tree as needed. In general, it shows a limit of our approach.
Algorithms that need auxiliary space larger than O(1) in the log cannot be
directly supported when the log must remain of constant size. Here, we used
the common technique of storing pointers to parents in the nodes. We discuss
this problem in more in detail in Sect. 15.
10
5.4 Tree Rotations
A common balancing technique in binary trees are tree rotations [55]. Tree
rotations preserve the order of nodes, but change the shape of the tree for
rebalancing. Figure 2 shows an example for a right rotation with Q as the root
of the rotation, P as pivot (the left child of Q), and GP as grandparent of P. The
rotation increases the height of the tree under the pivot by one and decrease the
height of the tree under the root by one. Additionally, it solves a red violation
between the pivot ( 2 ) and B ( 3 ). A left rotation works vice versa. The order
of keys (1, 2, 3, . . ., 6) and thus the order of nodes remains unchanged.
For the right rotation, Q replaces its left child (the pivot) with B. The pivot
replaces its right child (B) with Q. The GP replaces its left child (Q) with the
pivot. The three nodes pass the ownership, the parent relation, around. This
leads to the tree rotating around the GP.
Atomic Tree Rotations for NVRAM For NVRAM, the ownership changes
are implemented with stores (S1-S3). S1 for updating Q’s left child, S2 for
updating the pivot’s right child, and S3 for updating GP’s left child. If the
ownership changes are performed partially, i.e, the pivot drops the link to B
and Q does not take ownership of B, we lose sub-trees and can create cycles, as
the following analysis shows:
S1 and S3 fail → No link to pivot P.
S1 and S2 fail → No link to B. Cycle: Q and pivot P.
S2 and S2 fail → No link to Q.
S1 fails → No link to pivot P.
S3 fails → No link to Q.
S2 fails → No link to B. Cycle: Q and pivot P.
GP, Q, and the pivot have to be updated atomically. Otherwise, the tree
will lose sub-trees and becomes invalid. Logging the insert resp. delete request
would not be sufficient. Shadowing would copy Q and the pivot node, update
them, and atomically update the pointer of the grandparent pointing to the new
pivot. In our approach, we copy pointers to Q, the pivot, and B to the redo log
in a first step and then update Q, the grandparent, and the pivot in a second
transaction.
11
1 void insert(root, key, value) {
2 if (root == nullptr) { // the first node
3 root = new Node(key, value); // A1
4 } else { // initialize pointers and iterators
5 Node head; // A2
6 Node *it, *parent , *grand , *grandgrand = nullptr;
7 parent = &head;
8 it = parent ->right = root;
9 Direction dir, last; dir = Left;
10 while(true) {
11 if (it == nullptr) { // insert the new node here
12 parent ->dir = it = new Node(key, value); // A3,A4
13 } else if (isRed(it->left) and isRed(it->right)) {
14 // recolor
15 it->color = Red; // A5
16 it->left->color = it->right ->color = Black;
17 }
18 if (isRed(it) and isRed(parent) // need rotation?
19 rebalance(grandgrand , grand , last); // A6,A7,A8,A9,A10,A11
20 if (it->key == key) break; // key exists already
21 // traverse one level down the tree
22 last = dir; dir = (it->key < key) ? Right : Left; // A12
23 grandgrand = grand;
24 grand = parent; parent = it;
25 it = it->dir;
26 }
27 // update root
28 root = head.right; // A13
29 }
30 // ensure the root is black
31 root->color = Black; // A14
32 }
Figure 3: Top down insert based on Guibas and Sedgewick [24] and J. Walker.2
The labels show the state of the state machine in Figure 6.
6.2 Insert
Guibas and Sedgewick [24] introduced top-down approaches for inserting and re-
moving key-value pairs for dichromatic trees, see Figure 3. As they are top-down
algorithms, they do not require keeping a stack, but only keep a few pointers
12
up the tree—the iterators. They are mainly used for playing the roles/anchors
in tree rotations. They also use a fake head node to simplify corner-cases. We
derived insert and remove (Figure 3 and 4) from Julienne Walker,2 who uses the
same concepts. To avoid black violations, we insert the new node as a red node
at the bottom. However, inserting a red node can yield red violations, which
can be resolved by promotions [58], i.e., color flip, single, and double rotation.
Inserting a node into an empty tree from the clean state C is trivial (C
→ A1 → C ). Otherwise, we initialize helpers with the init transition (lines 6–
9; outgoing edges from A2 ). If a leaf node is reached, we insert the new node
(line 12; A3 → A4 ). Otherwise, we might need to flip colors (lines 15–16), which
can be done in one state transition, because setting a new color is idempotent
(outgoing edges of A5 ).
The rebalance function (line 19 in Figure 3) performs single or double rota-
tions between the grand and grand grandparent. Single rotations are converted
into two state transitions: one for logging and one for executing the rotation
(Figure 6, A10 → A11 → either A12 (to continue) or A13 (key was found)).
Double rotations need four state transitions accordingly (A6 → A7 → A8 →
A9 ).
Lines 23–25 descend the iterators one level down the tree (outgoing edges
of A8 ) to close the loop. In Sect. 8, we show how to update the iterators with
fewer state transitions and persist operations in some cases. Finally, the root is
set and colored black (lines 28–31; A13 → A14 → C ).
The log’s size (4 cache lines of 64 bytes each) is independent of the size of the
tree (see Figure 6). Its main components are the key, the value, the iterator, the
parent, and the grandparent. In addition, it keeps some space for redo logging,
e.g., Dir, TmpNode, and Sp. The direction on the way down the tree (Dir ).
A temporary node for tree rotations (TmpNode). An anchor node on the way
down the tree for remove in RBT (Sp). The majority of the space would also
be needed for non-persistent insert operation. Note that the log does not keep
a stack of size O(log n). Instead, it suffices to keep a few pointers up the tree.
6.3 Remove
While the algorithm for remove (Figure 4) looks more complex than for insert
(Figure 3), its state machine in Figure 6 is simpler than the one for inserting.
The control flow graph for remove is simpler and consists of a single nested
if statement. In each iteration, the state machine can jump into exactly one
place. In contrast, the inner part of the insert algorithm contains two nested if
statements. It allows jumping into two cases resulting in a more complex state
machine.
After the basic initialization, lines 11–14 move the iterators one level further
down the tree. It uses the even-odd optimization described in Sect. 8, so that
all outgoing edges of R1 update the iterators. The single rotation in line 19
uses R2 → R3. The color flip in lines 27–29 is idempotent. Thus, a single
state transition from R5 suffices. The same applies to the color correction in
lines 35–37 represented by R12.
The balance operation in line 33 either executes a single rotation (R6 →
R7 ) or a double rotation (R8 → R9 → R10 → R11 ).
13
1 void remove(root, key) {
2 if (root == nullptr) return; // empty tree
3 // initialize pointers and iterators
4 Node head; // C0
5 Node *it, *parent , *grand , *found = nullptr;
6 Direction dir = Right;
7 it = &head;
8 it->right = root;
9 while (it->dir != nullptr) {
10 // traverse one level down the tree
11 Direction last = dir; // R1
12 grand = parent;
13 parent = it;
14 it = it->dir;
15 dir = (it->key < key) ? Right:Left; // direction?
16 found = (it->key == key) ? it:found; // found?
17 if (not isRed(it) and not isRed(it->dir)) {
18 if (isRed(it->(!dir))) { // single rotation
19 parent = parent ->last = single(it, dir); // R2,R3
20 } else if (not isRed(it->(!dir))) {
21 Node *s = parent ->(!last); // R4
22 if (s != nullptr) {
23 Direction dir2 =
24 (grand ->right == parent)? Right : Left;
25 if (not isRed(s->left) and not isRed(s->right)){
26 // recolor
27 parent ->color = Black; // R5
28 s->color = Red;
29 it->color = Red;
30 } else if ((grand != nullptr) and
31 not((grand==&head) and (dir2==Left))){
32 // rotate?
33 rebalance(grand , parent , s, last); // R6,R7,R8,R9,R10,R11
34 // recolor
35 it->color = g->dir2->color = Red; // R12
36 grand ->dir2->left->color = Black;
37 grand ->dir2->right ->color = Black;
38 } } } } }
39 if (found != nullptr) { // unlink and delete
40 found ->key = it->key; // R13,R14
41 Direction dirL = (parent ->right == it) ? Left:Right;
42 Direction dirR = (it->left == nullptr) ? Right:Left;
43 parent ->dirL = it->dirR;
44 delete it;
45 }
46 // update root
47 root = head.right; // R15
48 root->color = Black; // ensure the root is black
49 }
Figure 4: Top down remove based on Guibas and Sedgewick [24] and J. Walker.2
The labels show the state of the state machine in Figure 7.
14
Key Head n0 n1 n2
Value
n3 n4 n5 n6 Cache-
Left Right
lines
Up Col Dir Bal
n7 n8 n9 n10
Next Node
tree’s size further would require adding additional fixed-length arrays. The index
type would need to be updated accordingly (from IdxType to (ArrayIdxType,
IdxType)).
To support allocating new tree nodes after a crash, we use the Next Node
structure—a uint56 t. It stores the index of the next free array element. Note
though, we did not implement garbage collection. Despite a client releasing a
node, it cannot be used anymore. We only allocate from the head and do not
maintain free lists.
NVRAM-aware garbage collection and dynamic node allocation is provided
with Makalu [10], nvm malloc [51], and PAllocator [43]. They could be used for
our tree data structure. The state and next node variable have to be updated
with atomic CAS operations. All other updates can use non-atomic stores,
because we can do redo logging.
15
Even Iterators
LastDir Dir
C blac C
Tp Grandp new k roo
t
descend init
A1 R15
Parentp Qp init
A2 A14 R1 desc
in end
Q2p Fp it it remove
in
t
R2
ini
init
d
ini
cen
R14
A5 A3 rot
des
co
Odd Iterators
LastDir Dir flip
de
lor
s
ce
n
R4 R3
d
Tp Grandp
colo
A10 add
color
R5 flip
A4
rot
Parentp Qp
root
d
ad
Q2p Fp
add
add
R13
A6 A11 R8 R6
color
t
Redo Log
TmpNode Sp
ro
ro
color
t
rot
Dir’ Dir” R7 R12
A12 A13 R9 t
Savep Savechldp descend ro ro
A7 t
ro t
t ro R11
ro co
t A9 lor R10
Key A8
Request
Value
Root
State
Figure 6: NVRAM data structures and state machines for insert and remove
for Red-Black Trees with 15 (insert) and 16 (remove) states (rot=rotation,
init=initialize variables, add=insert new node, remove=remove a node, de-
scend=one step down the tree, root=update root, color=recolor nodes,
black=color the root black). Dashed edges flush to the log. The State fills
7 bytes aligned to 8 bytes.
Key
new C A24
Value
A1 A23 R2 R1 R26 R25 R24 R34
Root Root2
rot
P Q A2 A22 rot
S T A25 A21 R7 R3 R27 R31 R33
rot
rot
NextP NextS
A4
NextT SP A26 A15 A20 R8 R5 C R15 R23 R32
Cache- N NN R4
rot
lines
A7 A10 A14 A19 R9 R20 R16 R29
Save SaveChild R6
R22
Inc Bal Bal2 Bal3 It
OldIt NewItDown A8 A12 A13 A18 R10 R13 R14 R21 R28
NewIt NewItUp
OldNewIt OldNewItUp A11 A9 A16 A17 R11 R12 R17 R18 R19 R30
SaveParent Heir
OldHeir Foo State
Foo Done Top OldTop Dir Dir2 Dir3 FooDir
Figure 7: NVRAM data structures and state machines for insert and remove
for AVL Trees with 24 and 35 states. Dashed edges flush to the log. The State
fills 7 bytes aligned to 8 bytes.
16
and the cache lines are only partially filled while the two cache lines for AVL
Trees are completely filled. Spreading the data over more cache lines might
improve performance further by reducing correlated cache misses, but such op-
timizations are beyond the scope of this paper.
8 Optimizations
Figure 8 shows the epochs by category for inserting resp. removing 107 keys
into Red-Black Trees. There are O(1) single and double rotations per operation
on average and one color flip as can be expected [58]. Some categories reflect
necessary steps for starting resp. completing operations, e.g., removing the found
node, updating the root, initializing the variables for the search, and flushing the
current command. While the cost for all operations is in O(log n), the number
of state transitions are dominated by shuffling the iterators for tree traversal.
As we cannot update the iterators in place, we need to use the redo log.
The canonical approach requires for each loop iteration two state transitions
(4 epochs). In the first, it flushes the old iterators to the log. In the second,
it updates the new iterators to go one level deeper into the tree. Instead, we
implemented an even-odd scheme. All iterators are stored twice to have disjoint
read and write sets. In even rounds through the loop, the first set is written. In
odd rounds, the second set is written. Thus, we need only one state transition
and the previous iterators form the redo log to go one level down the tree. A
set of iterators fits into a single cache line, which supports the even-odd scheme
17
and reduces flush costs. We use one bit of the state variable to store whether
we are in an even or odd round:
struct SV {bool:1 Even{1},uint56_t:55 State{0}} EvenState;
80
▷ smaller is better
70
average #epochs per operation [1]
60
50
InsertNewNode(A1)
40 Allocator
RemoveNode(R13-14)
Misc(A3,R4)
30 UpdateRoot(A13-14,R15)
TreeTraversal(A12,R1)
FlushCommand(C0)
20 InitializeVariables(C0,A2)
NewNode(A1,A4)
ColorFlip(A5,R5)
10 Recolor(R12)
DoubleRotation(A6-9,R8-11)
SingleRotation(A10-11,R2-3,R6-7)
0
Add Add w LA Remove
Figure 8: Average number of epochs for inserting resp. removing 107 keys in
random order in Red-Black Trees.
80
▷ smaller is better
PopStack(R20)
70
PushStack(R2-3)
FixParent(A22-24,R5-6,R14-16)
average #epochs per operation [1]
60 Balance(A13,A16-17,R19,R21,R31-33)
BalanceFactors(A7-A9,A11-12,R14,R17)
50 BalancePoints(A5,A10)
Loop(A3,A6,A25-26,R9-10)
InsertNewNode(A1)
40
Allocator
StateChange
30 RemoveNode(R12-13)
Misc(R4,R7-8,R11,R34)
20 FlushCommand(C0)
InitializeVariables(A2,R1,R18,R30)
10 NewNode(A4)
DoubleRotation(A18-21)
SingleRotation(A14-15)
0
Add Add w LA Remove
Figure 9: Average number of epochs for inserting resp. removing 107 keys in
random order in AVL Trees.
If the next loop iteration does not perform any rotations or recoloring, it
can be skipped. It does not change the tree. For inserts, we added unlimited
look-ahead (LA): instead of going exactly one level down into the tree in each
iteration, we progress the iterators directly to the next level that needs balancing
or recoloring.
18
Figure 8 shows the average number of epochs, i.e., flushes for inserting resp.
removing 107 keys in random order [35] into RBTs. Insert and remove are dom-
inated by the tree traversal, but unlimited look-ahead (LA) almost eliminates
the costs. Figure 9 shows the transitions for AVL Trees. Add is dominated by
the loop walking down the tree. Look-ahead does not iterate through the loop.
Remove is dominated by push stack, which stores in nodes the pointers to their
parents. Again AVL Trees are more expansive than RBTs. In both experiments,
keys are inserted on average in depth 21. The final Red-Black Tree has a depth
of 29 and the AVL Tree has a depth of 28. It is balanced more strictly.
The AVL Tree insert algorithm [31] is split into two phases. First, it walks
down the tree and searches the parent of the new node. On the way down, it
saves two balancing points. In the second phase, it inserts the new node and
uses the balancing points to rebalance the tree. The first loop can be replaced
by two state transitions, because it is almost side-effect free—except for storing
the balancing points. The approach is similar to the look-ahead for RBTs.
The AVL Tree remove algorithm is more challenging, because the loops walk-
ing down the tree store parent pointers in the nodes. Skipping loop iterations
is challenging. The last loop walks up the tree and balances it. There are fewer
opportunities for skipping iterations.
9 Implementation Details
The programming model for NVRAM [57] maps files into memory using mmap,
which provides direct access (DAX) to the NVDIMM. Mapping a file again after
a power failure may yield a different base address. So, all memory accesses have
to be explicitly adjusted to the corresponding address range. This is necessary to
consistently access the same data after a power loss. Makalu [10] is a persistent
heap manager, which hides this problem from users.
For simplicity, we placed the log structure and the tree, an array of nodes,
into different files. There are pointers between nodes and between the log and
the nodes, which have to be adjusted to the base addresses. The following
assignment is completely handled by the compiler.
log->root->left->color = log->save->right ->color;
The only feasible solution is to manage memory accesses by the user instead
of the compiler. Thus, for memory accesses and persist operations, we wrote
19
our own embedded domain specific language (DSL) based on expression tem-
plates [60].
The DSL provides a declarative language for describing memory addresses
including all intermediate steps—the path. Instead of using expressions such as
log->root->left->color, we assign types to each memory address, e.g.,
ColorInNode<LeftInNode<RootInLog>> address = {log};. Each memory access can be seen
as a path of intermediate memory accesses. The DSL allows programmers to
describe memory accesses and persist operations declaratively while ignoring the
peculiarities of the underlying programming model. For each memory access,
the runtime maps the request to the corresponding object (log, nodes, next node,
and state variable) and its associated mapped file. The access is applied to the
address space with the offset adjusted accordingly. This allows development on
machines with and without NVRAM and facilitates remote access. For local
development, we simply use malloc to simulate mmapped files. For machines
with NVRAM, we rely on the PMDK for mmapping. For remote access, we use
PMDK to mmap files and UCX 3 for communication with RDMA. Additionally,
our DSL allows us to transparently experiment with different cache flushing
and caching strategies. The runtime can map persist operations to the different
persist operations described in Sect. 3.2. For remote access, we can cache the
results of gets.
The following updates the key and value in the log:
WriteOp <typename LogAddress::Key, KeyValue > W1 =
{LogAddress::Key(log), KeyValue(key)};
WriteOp <typename LogAddress::Value , ValueValue > W2 =
{LogAddress::Value(log), ValueValue(value)};
flushOp(W1, W2);
WriteOpand flushOp are the customization points. Each WriteOp has a source and a
destination memory location. It performs a read and a write. flushOp executes
all write operations. As memory locations for read and write are described by
paths, they have to be evaluated first. They may require a sequence of plain
loads for local access or gets for remote access. Each WriteOp could execute the
assignment followed by a clwb and the flushOp invokes an sfence to finish the
epoch. For our tree code, the largest epoch (in AVLT remove) invokes WriteOp
nine times.
20
holding the lock. In shared mode, the shrd field holds the number of clients shar-
ing the lock. The holders of the shared lock are anonymous, which hinders to
identify failed lock holders and to get the lock back into exclusive mode. Thus,
these designs are not well prepared for client failures. They also cannot guar-
antee fairness, because readers can always starve any writer trying to acquire
the write-lock. Gerstenberger et al. [22] designed a similar lock for one-sided
communication in MPI, but here even the holder of the write lock is anonymous.
However, one could argue that as of today the behavior of failed nodes in MPI
is deliberately unspecified. The same concepts are used for shared-memory as
well. Bit-vectors (32-bit or 64-bit) are split into reader and writer parts and
atomic operations are used to update the value [37].
64 bits
To tolerate client failures, all lock holders have to store their pid. A stored
pid is the proof that the lock is taken. Our data structure’s design is shown in
the bottom part of Figure 10. For a k-reader single-writer lock with f -fairness,
we use an array of k 64 bit slots to hold the readers r0 –rk−1 , one slot for the
writer w, f slots for the waiting queue sl 0 –sl f −1 , and one slot for the outer lock
(lock owner ), which has to be acquired to modify the lock data structure itself.
Each entry is either 0 or the pid of the respective client. Using the lock-holder’s
pid to indicate whether a lock is taken allows clients to use failure detectors on
lock-holders. If a client wants to acquire a taken lock, it can either be taken
because of contention or because the holder failed. The client starts a failure
detector on the lock-holder and retries to acquire the lock to cover both cases.
Fairness (equal share and no starvation) would require a waiting queue with
sufficient capacity to hold all waiting clients, which is a theoretical but not a
practical solution. A compromise is f -fairness with a waiting queue of length f .
Clients in the queue are subject to fairness. They cannot overtake each other.
The first node is always the next to acquire its desired lock. However, we cannot
guarantee fairness for clients waiting to enter the queue.
When the desired lock becomes available, the process in the first slot of the
queue takes the outer lock, takes the desired lock, copies the other members of
the queue one step forward, sets the last element to zero, and releases the outer
lock. All operations require atomic CAS operations, as a crash of the writing
process during a non-atomic RDMA write may result in a slot with a valid pid
of a process not intending to hold a lock.
Due to the 64 bit size limitation of remote atomics, it is not possible to
21
shift the complete queue in a single CAS operation. Therefore, the outer lock
is needed to prevent other processes from interfering. During the shift, each
process in the queue is always stored at least once in it—either in the old and/or
new slot. If the shifting process fails, the fairness in the queue is preserved. On
success, the last slot becomes zero and is available for the next client. Entering
the queue does not require holding the outer lock. It is a CAS with a zero entry,
which may fail. A process holding a read or write lock can release it at any time
by zeroing its slot. It does not have to acquire the outer lock beforehand.
11 Simulating Power-Failures
To validate our implementation, we simulate power failures by killing processes
with SIGKILL following the approach used by NV-Heaps [14]. We start a process
doing insert resp. remove operations. At a later time, we kill the process. Similar
to a power failure, we lose all transient data. As the process and thus the
mmapped address range does not exist anymore, dirty cache lines cannot be
written back. The process could have been killed at any instruction of any state
transition. Afterwards, we start a new process that recovers the current state
and progresses to the clean state. The two processes actually use the same code.
While the former assumes that it is in the clean state, the latter actually reads
the current state from the file.
The testing revealed an issue with idempotence. We killed a process during
a tree rotation. The recovery process obviously executed the tree rotation again,
but tree rotations were not idempotent at that time. As discussed in Sect. 5.1,
failures during tree rotations can lose sub-trees. The challenge with tree rota-
tions is that they read and write the same memory locations. In the example,
we could lose access to Q, the pivot, or B. To make them idempotent, we have
to store all read values in the redo log. Thus, we have to keep pointers to all
three of them in the redo log to separate the read and write sets. Since then,
we have run more than 2,000,000 tests without revealing any further issues.
12 Evaluation
For all experiments with NVDIMM-N and Infiniband, we used one server with
two Intel Xeon Gold 6138 and one server with two Intel Xeon Silver 4116 CPUs.
Each server has 192 GiB main memory and two 16 GiB NVDIMM-N. They are
connected with an InfiniBand FDR network (ConnectX-3). We used CentOS
7.5, Clang 6.0.0, PMDK 1.5.1, and UCX 1.5.
For each measurement, we report the median among 1,000 samples. The 99
percent confidence interval (CI) is always within the 1.5 percent of the reported
medians. Extremely short runs show slightly larger percentages.
Red-Black Trees and AVL Trees are self-balancing binary search trees. Each
node has at most two children. There are no high radix variants. They do
not amend themselves to the optimizations commonly used for B and B+-trees.
Red-Black Trees and AVL Trees are simply not competitive. Our contributions
are not in the area of optimizations for speed, but we designed a new transaction
system with O(1) log-space, in-place updates, and RDMA. Thus, we did not
compare the performance of our implementation with highly optimized B+-
22
trees. We want to analyze our optimizations and the scalability of our approach
itself. We used the data structures as shown in Figure 6 and Figure 7. The
constant-size redo logs were used for trees from 0 to 107 nodes.
400 k
◁ larger is better
AVLT clwb RBT clwb
374.1 k
366.5 k
365.9 k
364.3 k
358.5 k
357.3 k
342.4 k
335.6 k
inserts per second [1/s]
333.9 k
327.3 k
300 k
310.1 k
250 k
232.6 k
227.3 k
200 k
134.5 k
134.5 k
134.2 k
134.1 k
133.9 k
133.3 k
130.0 k
126.7 k
126.3 k
126.1 k
123.2 k
122.2 k
185.2 k
180.2 k
176.1 k
150 k
156.2 k
82.0 k
81.3 k
74.1 k
66.1 k
65.9 k
60.8 k
100 k
50 k
0k
10 102 103 104 105 106
number of inserted keys [1]
Figure 11: Insert throughput for Red-Black Trees and AVL Trees both with
look ahead with NVDIMM-N.
For small trees, the performance is worse than for larger trees, cmp. Fig-
ure 11. For trees larger than 100 keys, the performance stabilizes. It could be
due to the fact that for small trees insert operations touch a larger share of the
tree, i.e., they flush out a large part of the tree. For larger trees, large parts
of the tree remain untouched and can be accessed without cache misses in the
next operation. If you insert a key on the left side, it will evict parts of the tree
on the path down from the caches. If the next insert is on the right side, there
will be only a low number of cache misses. If the tree is small, the two paths
will overlap and cause more cache misses.
According to Tarjan [58], Red-Black Trees and AVL Trees only need O(1)
balance operations per insert on average. With the optimizations described in
Sect. 8, we almost eliminate the logarithmic part of the insert operation. As
expected, the insert costs are independent of the size of the tree. It also shows
23
400 k
◁ larger is better
AVLT clwb RBT clwb
350 k AVLT clfushopt RBT clfushopt
removes per second [1/s]
270.3 k
250 k
232.6 k
200 k 108.7 k
108.7 k
185.2 k
102.5 k
100.2 k
150 k
161.2 k
85.7 k
85.1 k
153.2 k
80.0 k
77.3 k
72.4 k
72.4 k
71.8 k
71.3 k
136.7 k
136.3 k
63.7 k
62.0 k
61.5 k
125.6 k
53.1 k
51.9 k
118.3 k
117.8 k
45.7 k
100 k
112.5 k
112.2 k
110.8 k
95.1 k
94.8 k
92.1 k
92.0 k
78.0 k
50 k
0k
10 102 103 104 105 106
number of removed keys [1]
Figure 12: Remove throughput for Red-Black Trees and AVL Trees both without
look ahead with NVDIMM-N.
again that Red-Black Trees are faster than AVL Trees. Red-Black Trees might
reduce the costs for insert by leaving the trees less balanced than AVL Trees, see
Sect. 5. Despite Cormen et al. [18], Adel’son-Vel’skii and Landis [1] suggesting
that the throughput should decrease with the depth of the tree, we can keep it
constant.
The costs for remove are much higher than for insert, cmp. Figure 12. Note
that we only tried to optimize insert operations. The expected costs per remove
are O(log N ). Red-Black Trees and AVL Trees perform O(1) balance operations
per remove on average. As expected AVL Trees are slower than Red-Black Trees.
The gap is much smaller.
400 k
◁ larger is better
AVLT clwb RBT clwb
350 k AVLT clfushopt RBT clfushopt
AVLT clfush RBT clfush
inserts per second [1/s]
300 k
250 k
200 k
97.1 k
97.0 k
96.9 k
96.8 k
95.0 k
94.7 k
150 k
91.8 k
91.6 k
78.5 k
78.2 k
77.2 k
77.0 k
72.5 k
72.3 k
69.9 k
68.5 k
59.1 k
56.5 k
100 k
39.9 k
40.0 k
39.5 k
39.2 k
38.1 k
37.6 k
37.5 k
37.5 k
35.7 k
34.7 k
32.9 k
32.8 k
32.7 k
32.6 k
30.4 k
30.0 k
29.7 k
27.7 k
50 k
0k
10 102 103 104 105 106
number of inserted keys [1]
Figure 13: Insert throughput for Red-Black Trees and AVL Trees both with
look ahead with Intel Optane.
For the evaluation of their RBT code Wang et al. [63] simulated NVDIMM,
STT-RAM, and PCM. For insert with 106 resp. 210 keys, they achieved: 666,666,
166,666, and 52,600 inserts per second. It shows that the workload is latency
sensitive. For simulating NVDIMMs, they used plain DRAM. Our hardware
setup differs from theirs, but our NVDIMM-Ns run at a lower clock than DRAM.
With close to 400,000 inserts per seconds for RBTs, we are close despite a
completely different approach. Our AVL Trees are slightly slower because they
keep the tree more balanced and thus require more epochs.
24
400 k
◁ larger is better
AVLT clwb RBT clwb
350 k AVLT clfushopt RBT clfushopt
removes per second [1/s]
45.7 k
45.7 k
100 k
43.5 k
43.3 k
42.6 k
42.6 k
34.8 k
33.5 k
33.3 k
33.4 k
30.7 k
30.6 k
30.1 k
30.1 k
29.2 k
28.6 k
28.5 k
26.2 k
24.5 k
24.5 k
24.6 k
24.6 k
22.6 k
20.9 k
20.8 k
20.8 k
20.3 k
19.6 k
18.1 k
18.1 k
16.7 k
14.1 k
12.1 k
50 k
0k
10 102 103 104 105 106
number of removed keys [1]
Figure 14: Remove throughput for Red-Black Trees and AVL Trees both without
look ahead with Intel Optane.
25
400 k
◁ larger is better
AVLT LA AVLT RBT LA RBT
350 k
357.3 k
inserts per second [1/s]
300 k
250 k
200 k
134.1 k
150 k
153.7 k
54.8 k
100 k
50 k
0k
106
number of inserted keys [1]
Figure 15: Insert throughput for Red-Black Trees and AVL Trees with and
without look-ahead (LA) using clwb.
Table 1: Remote inserts via RDMA and software agent based flushing (RBTs).
AVLT No Cache Cache RBT No Cache Cache
No Flush 2,145/s 2,150/s No Flush 2,091/s 2,542/s
Flush 1,867/s 1,866/s Flush 1,922/s 2,312/s
mance of the cache is due to the fact that accesses to the tree are not cached.
Furthermore, it does not affect communication with the software agent.
13 Related Work
Trees on NVRAM. A number of systems were already proposed to manage
tree data structures with persistent memory. A popular tree variant in this area
are B+trees [16], which store all values in the leaves. CDDS B-Tree [61], for
example, relies on 8-byte writes and a version system for inserts as long as free
slots are available in leaf nodes. Otherwise, it uses shadow copying to split the
node and to update the inner nodes. On recovery, it uses its version system to
discard all interrupted operations. Similarly, NV-Tree [65] stores all values in
leaf nodes. While the leaf nodes are stored in NVRAM, here, the inner nodes are
stored in DRAM and can be restored after a power failure. To further minimize
the cost of flushing, entries in leaf nodes are appended to the corresponding
leaf node and remain unsorted. Full leaf nodes are split using shadow copying.
Rebalancing is not done on-the-fly but as a separate operation that recreates
the inner nodes and makes them the current ones atomically. The wB+Tree [12]
also mitigates the costs of flushing by keeping node entries unsorted to avoid
entry movements on insert. This allows inserts with only a few 8-byte writes
and shadowing. Two B+tree algorithms exploiting weak memory models and
allowing temporal inconsistencies are FAST and FAIR [27]. The Bztree [7] is
a multi-threaded B+Tree. It relies on Persistent Multi-Word CAS (PMwCAS)
and an epoch-based garbage collection scheme.
For radix trees, WORT [32] maintains a tree shape independent of the inser-
tion order. It neither needs nor supports balancing. Inserting items on leaves
26
or leaf paths and pointer updates can be done with atomic 8-byte writes. For
more sophisticated adaptive radix trees, shadow copying is used.
Most research focuses on variants of B-trees with often more than two chil-
dren and optimizations for external memory. In most cases, new keys can be
added in leaf nodes with a few flushes without any re-balancing.
Red-Black Trees in NVRAM were discussed in Wang et al. [63] for the first
time. While they implemented a more complex tree kind than before, they still
relied on shadow copying and a versioning system to distinguish between the
copy and the real tree. They maintain something close to a shadow tree with
some efforts to minimize overhead. Operations are performed on the shadow
tree. An atomic pointer update switches between the shadow and the current
tree. Wang et al. [63] also show that update operations on RBTs are not local
operations. It does not suffice to do shadow copying on individual nodes. It
requires shadow copying of larger fractions of the tree.
All these systems handle requests in a blocking way and operations often
become visible with the last 8-byte atomic pointer update. After a crash, it
is challenging for them to identify aborted resp. the last successful operation.
Highly desirable properties such as exactly once semantics, see Sect. 4.3, are
hard to achieve. In contrast, with our state machine approach, we can guaran-
tee eventual success after accepting the command to the redo log and support
exactly once semantics.
Transactions on NVRAM. Systems supporting generic transaction pro-
cessing on NVRAM based on redo logging are, for example, SoftWrAP [23] and
DudeTx [33]. They use a mix of shadow-memory and redo logging where all
memory accesses during a transaction are aliased into a volatile memory region
and writes are stored in a persistent redo log immediately or when all work
of the transaction is done. For DudeTx, the redo log is then applied to the
actual data stored in persistent memory in a final step. With language exten-
sions, Mnemosyne [62] provides primitives for working with persistent memory.
Variables can be marked as persistent. Code regions marked as atomic will
be executed with durable transactions. It hooks into a lightweight software
transaction system to implement write-ahead redo logging.
Other systems base their transaction system on write-ahead and undo log-
ging [52]. First, all store operations are written to the undo log before the real
transaction is executed. In case of a power failure, uncompleted transactions
are rolled back. The NV-Heaps system [14] provides its own heap manager, spe-
cialized pointers, and atomic sections for persistent memory. For transactions,
it keeps a volatile read log and a non-volatile write log. In case of an abort or
power failure, it rolls back all changes.
Here, the literature seems to be undecided between undo and redo logging.
However, undo logging suffers from more flush operations compared to redo
logging. Shadow memory is a neat way to exploit the fact that there are two
types of memories available—volatile and non-volatile—with different perfor-
mance characteristics. It provides isolation and can leverage the benefits of
caches. Our approach of in-place updates is seldomly found in the literature.
Logging in Databases. The quasi-standard algorithm for write-ahead
logging (WAL) with no-force and steal policies, ARIES [39], has influenced the
design of many commercial databases. It is optimized for spinning disks and
maintains an append-only log. The log contains undo and redo records. For
recovery, it goes through 3 phases: (a) analyze the log for uncommitted and
27
aborted transactions, (b) redo finishable transactions, and (c) undo the remain-
ing transactions. In our approach, we only us a fixed-size redo log. While Aries
uses write-only WAL, we read the log during epochs to facilitate idempotence.
Our analysis phase simply identifies the current state and continues from there.
While ARIES optimizes for sequential writes, MARS [15] exploits the fact,
that SSDs support high random access performance. It introduces the concept
of editable atomic writes (EAW), which are essentially redo logs. The full trans-
action is executed in a redo record and on commit the system applies the trans-
action atomically. On failure, it can simply reapply the redo log. In contrast,
we write the data directly into the data structure. The transaction becomes
re-doable because of the redo log created in the previous epoch. We always split
large transactions into a sequence of micro-transactions. For NVRAM, redo
logs provide lower costs. They reduce the number of flushes in contrast to undo
logs. We can also avoid complex log pruning mechanisms, because the log has
a fixed size and every operation re-uses the log of the previous operation.
14 Discussion of Correctness
General approach. It is a standard technique in compiler construction to convert
code into control flow graphs with basic blocks. For the execution of this rep-
resentation we can use state machines. It tracks which basic black is currently
executed and which are the legal successor blocks. State transitions correspond
to the execution of basic blocks. Large basic blocks can be split into a sequence
of smaller ones without changing the algorithm. Additionally, splitting a basic
block into two: (a) reading from the data structure and writing to the log and
(b) reading the log and updating the data structure, ought to make both basic
blocks idempotent. This is a common task for databases with appropriate log-
ging. The size of the basic block resp. the number of stores influences the size
of the log.
Red-Black Trees and AVL Trees. In Sect. 11, we discussed our testing approach.
After a crash, we verified that we can recover and return to the clean state.
Furthermore, after the recovery we checked whether the tree is correct, i.e.,
correctly balanced and colored. For all experiments in Sect. 12, we inserted k
keys into an empty tree and removed the same k keys in a different order from
the tree. This indicates that the state machines are correct. Otherwise this
would yield corrupt trees. During the development of the state machine, we
extensively tested the state machine. After each insert or remove operation, we
verified that the tree is correct and the number of nodes was as expected.
15 Discussion of Limitations
For remote access, message passing might in some cases provide higher per-
formance than RDMA, but it is a completely orthogonal approach to shared
memory for local access. It would require two completely different transaction
systems, but our goal was to design one transaction system for local and remote
usage. As shown in Sect. 9, we abstract from local and remote access and use
one common implementation for both.
For B and B+ trees, insert, remove, and balancing are operations of limited
28
scope. They do not need transactions. The algorithms for these trees are of
low complexity and the corresponding state machines would be tiny. They are
tuned for absolute performance. The literature went for high radix trees for
performance reasons with low balance costs. However, for Red-Black Trees and
AVL Trees balancing is the common case [58]. Red-Black Trees and AVL Trees
are simply not competitive and serve other demands.
As discussed before, data structures that need an auxiliary space, which is
not in O(1), cannot be supported with a constant size log. Allocating addi-
tional memory would violate our assumptions. The only option left is store the
auxiliary data in the data structure. AVL remove needs a stack of O(log n).
We use the common technique of maintaining pointers to parents. In theory,
that would induce space overhead in each tree node. The initial data layout for
the Red-Black Trees was 32 bytes. It had sufficient unused space to add up-
pointers (Up and Dir) without needing to change its size, see Figure 5. Insert
and remove in trees can often be implemented with top-down algorithms, which
only need constant-sized auxiliary space. Linked lists and hash-tables also need
constant-sized auxiliary space.
Atomic CAS for NVRAM [34, 44] brings its own challenges. Atomic opera-
tions are commonly used, because (a) they do no tear and (b) provide protection
against concurrent access. In our approach, we are only interested in the former
property, because we use locks for thread safety. In this paper, we consider CAS
more like a read, modify, and atomic update resp. store operation of 8 bytes
(cmp. Sect. 2) as there is no contention.
16 Conclusion
We presented a new transaction system for complex data structures in NVRAM
that provides exactly once semantics and linearizable durability with a redo log
of constant size. It splits large transactions into smaller micro-transactions and
uses a state machine approach to perform larger transactions step-wise. Every
accepted transaction will eventually succeed and will never be aborted. For local
and remote access, we use the same primitives: load, store, atomic update, and
persist. This allowed us to design one transaction system that runs locally
and with InfiniBand for remote access. As our approach is not prepared for
concurrent access, we use locks to control concurrency. For remote access, we
designed a fault-tolerant lock with f -fairness.
Wang et al. [63] showed the first Red-Black Tree implementation for NVRAM,
but their approach is based on shadowing the whole tree. We presented, to the
best of our knowledge, the first AVL Tree implementation for NVRAM and
the first Red-Black Tree implementation for NVRAM without shadowing and
with updates ‘in-place’. These trees are algorithmic far more challenging than
the trees covered in the literature so far. Insert and remove are global opera-
tions instead of a sequence of operations with limited scope. Thus, they need
transactions.
Shadowing can claim that data structures are consistent all the time. This
approach atomically replaces parts of a data structure with new data. Inter-
mediates steps are not visible. Wang et al. [63] atomically replace the old tree
with the new one. There is also no need for recovery, but it fails at exactly once
semantics. In our approach, the data structure might be inconsistent after a
29
crash, but it is recoverable all the time. We see recovery as finishing the inter-
rupted operation, i.e., moving forward to the clean state. Likewise, we support
exactly once semantics. By using a constant-sized log, we avoid any overhead
for dynamic log allocation, log pruning, and keeping a shadow copy of the whole
tree.
17 Availability
4
Our code is on GitHub under the Apache License 2.0.
Acknowledgments
The authors thank ZIB’s Supercomputing department and ZIB’s core facilities
unit for providing the machines and infrastructure for the evaluation. This
work received funding from the German Research Foundation (DFG) under
grant RE 1389 as part of the DFG priority program SPP 2037 (Scalable data
management for future hardware). This work is partially supported by In-
tel Corporation within the Research Center for Many-core High-Performance
Computing (Intel PCC) at ZIB.
References
[1] Georgy Adel’son-Vel’skii and Evegnii Landis. An algorithm for the organi-
zation of information. Dokl. Akad. Nauk SSSR, 146:263–266, 1962.
[2] Marcos K. Aguilera and Douglas B. Terry. The many faces of consistency.
IEEE Data Eng. Bull., 39(1):3–13, 2016. URL http://sites.computer.
org/debull/A16mar/p3.pdf.
[3] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers, principles,
techniques. Addison Wesley, 7(8):9, 1986.
[4] Frances E. Allen. Control flow analysis. SIGPLAN Not., 5(7):1–19, July
1970. ISSN 0362-1340. doi: 10.1145/390013.808479. URL http://doi.
acm.org/10.1145/390013.808479.
[5] Joe Armstrong. Programming Erlang: Software for a Concurrent World.
Pragmatic Bookshelf, 2013. ISBN 193778553X, 9781937785536.
[6] Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor. Let’s talk about
storage & recovery methods for non-volatile memory database systems.
In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data, pages 707–722, New York, NY, USA, 2015. ACM.
[7] Joy Arulraj, Justin Levandoski, Umar Farooq Minhas, and Per-Ake Lar-
son. Bztree: A high-performance latch-free range index for non-volatile
memory. Proc. VLDB Endow., 11(5):553–565, January 2018. ISSN 2150-
8097. doi: 10.1145/3187009.3164147. URL https://doi.org/10.1145/
3187009.3164147.
4 https://github.com/tschuett/transactions-on-nvram
30
[8] Rudolf Bayer. Symmetric binary b-trees: Data structure and maintenance
algorithms. Acta informatica, 1(4):290–306, 1972.
[9] Rudolf. Bayer and Edward McCreight. Organization and maintenance of
large ordered indexes. Acta Informatica, 1(3):173–189, Sep 1972. ISSN
1432-0525. doi: 10.1007/BF00288683. URL https://doi.org/10.1007/
BF00288683.
[10] Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm. Makalu:
Fast recoverable allocation of non-volatile memory. In Proceedings of the
2016 ACM SIGPLAN International Conference on Object-Oriented Pro-
gramming, Systems, Languages, and Applications, OOPSLA 2016, pages
677–694, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4444-9.
[11] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest
failure detector for solving consensus. Journal of the ACM (JACM), 43(4):
685–722, 1996.
[12] Shimin Chen and Qin Jin. Persistent b+-trees in non-volatile main mem-
ory. Proc. VLDB Endow., 8(7):786–797, February 2015. ISSN 2150-
8097. doi: 10.14778/2752939.2752947. URL http://dx.doi.org/10.
14778/2752939.2752947.
[13] Yeounoh Chung and Erfan Zamanian. Using RDMA for lock management.
CoRR, abs/1507.03274, 2015. URL http://arxiv.org/abs/1507.03274.
[14] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Ra-
jesh K. Gupta, Ranjit Jhala, and Steven Swanson. NV-Heaps: Making
persistent objects fast and safe with next-generation, non-volatile memo-
ries. SIGPLAN Not., 46(3):105–118, March 2011. ISSN 0362-1340.
[15] Joel Coburn, Trevor Bunker, Meir Schwarz, Rajesh Gupta, and Steven
Swanson. From ARIES to MARS: Transaction support for next-generation,
solid-state drives. In Proceedings of the Twenty-Fourth ACM Symposium
on Operating Systems Principles, SOSP ’13, pages 197–212, New York, NY,
USA, 2013. ACM. ISBN 978-1-4503-2388-8.
[16] Douglas Comer. The ubiquitous B-tree. ACM Comput. Surv., 11(2):121–
137, 1979. doi: 10.1145/356770.356776. URL http://doi.acm.org/10.
1145/356770.356776.
[17] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek,
Benjamin Lee, Doug Burger, and Derrick Coetzee. Better I/O through
byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS
22nd Symposium on Operating Systems Principles, SOSP ’09, pages 133–
146, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-752-3.
[18] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford
Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd
edition, 2009. ISBN 0262033844, 9780262033848.
[19] Pierre-Jacques Courtois, Frans Heymans, and David Lorge Parnas. Con-
current control with readers and writers. Commun. ACM, 14(10):667–668,
October 1971. ISSN 0001-0782.
31
[20] Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. Under-
standing and effectively preventing the aba problem in descriptor-
based lock-free designs. In 2010 13th IEEE International Symposium
on Object/Component/Service-Oriented Real-Time Distributed Computing,
pages 185–192. IEEE, 2010.
[21] Michal Friedman, Maurice Herlihy, Virendra J. Marathe, and Erez Petrank.
A persistent lock-free queue for non-volatile memory. In Proceedings of the
23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018, pages
28–40, New York, NY, USA, 2018. ACM.
[22] Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. Enabling Highly-
Scalable Remote Memory Access Programming with MPI-3 One Sided. In
Proceedings of the International Conference on High Performance Comput-
ing, Networking, Storage and Analysis, pages 53:1–53:12, New York, NY,
USA, Nov. 2013. ACM. ISBN 978-1-4503-2378-9.
[23] Ellis R. Giles, Kshitij Doshi, and Peter Varman. SoftWrAP: A lightweight
framework for transactional support of storage class memory. In Mass
Storage Systems and Technologies (MSST), 2015 31st Symposium on, pages
1–14, New York, NY, USA, 2015. IEEE Computer Society.
[24] Leo J. Guibas and Robert Sedgewick. A dichromatic framework for bal-
anced trees. In Foundations of Computer Science, 1978., 19th Annual
Symposium on, pages 8–21, New York, NY, USA, 1978. IEEE Computer
Society.
[29] Intel Corp. Intel 64 and IA-32 architectures optimization reference manual,
September 2019.
[30] Joseph Izraelevitz, Hammurabi Mendes, and Michael L. Scott. Linearizabil-
ity of persistent memory objects under a full-system-crash failure model.
In Cyril Gavoille and David Ilcinkas, editors, Distributed Computing, pages
32
313–327, Berlin, Heidelberg, 2016. Springer Berlin Heidelberg. ISBN 978-
3-662-53426-7.
33
[41] Jrg Nievergelt and Edward Reingold. Binary search trees of bounded
balance. SIAM Journal on Computing, 2(1):33–43, 1973. doi: 10.1137/
0202005.
[42] Henk J. Olivié. A new class of balanced search trees: half-balanced binary
search tress. RAIRO. Informatique théorique, 16(1):51–71, 1982.
[43] Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner, Thomas
Willhalm, and Grégoire Gomes. Memory management techniques for
large-scale persistent-main-memory systems. PVLDB, 10(11):1166–1177,
8 2017. doi: 10.14778/3137628.3137629. URL http://www.vldb.org/
pvldb/vol10/p1166-oukid.pdf.
[44] Matej Pavlovic, Alex Kogan, Virendra J. Marathe, and Tim Harris. Brief
announcement: Persistent multi-word compare-and-swap. In Proceedings of
the 2018 ACM Symposium on Principles of Distributed Computing, PODC
’18, pages 37–39, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-
5795-1. doi: 10.1145/3212734.3212783. URL http://doi.acm.org/10.
1145/3212734.3212783.
[45] Fernando Magno Quintao Pereira and Jens Palsberg. Register allocation
after classical ssa elimination is np-complete. In International Conference
on Foundations of Software Science and Computation Structures, pages
79–93. Springer, 2006.
[46] Marius Poke and Torsten Hoefler. Dare: High-performance state machine
replication on rdma networks. In Proceedings of the 24th International Sym-
posium on High-Performance Parallel and Distributed Computing, pages
107–118. ACM, 2015.
[47] William N. Scherer III and Michael L. Scott. Advanced contention man-
agement for dynamic software transactional memory. In Proceedings of the
twenty-fourth annual ACM symposium on Principles of distributed comput-
ing, pages 240–248, New York, NY, USA, 2005. ACM.
[51] David Schwalb, Tim Berning, Martin Faust, Markus Dreseler, and Hasso
Plattner. nvm malloc: Memory allocation for nvram. In Rajesh Bor-
dawekar, Tirthankar Lahiri, Bugra Gedik, and Christian A. Lang, editors,
ADMS@VLDB, pages 61–72, 2015. URL http://dblp.uni-trier.de/db/
conf/vldb/adms2015.html#SchwalbBFDP15.
34
[52] Seunghee Shin, James Tuck, and Yan Solihin. Hiding the long latency of
persist barriers using speculative execution. In Proceedings of the 44th An-
nual International Symposium on Computer Architecture, ISCA ’17, pages
175–186, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4892-8. doi:
10.1145/3079856.3080240.
[53] Jan Skrzypczak, Florian Schintke, and Thorsten Schütt. Linearizable state
machine replication of state-based CRDTs without logs. In Proceedings of
the 2019 ACM Symposium on Principles of Distributed Computing, pages
455–457, 2019.
[54] Jan Skrzypczak, Florian Schintke, and Thorsten Schütt. RMWPaxos:
Fault-tolerant in-place consensus sequences. IEEE Transactions on Parallel
and Distributed Systems, 31(10):2392–2405, 2020.
[55] Daniel Sleator, Robert Tarjan, and William Thurston. Rotation distance,
triangulations, and hyperbolic geometry. In Proceedings of the Eighteenth
Annual ACM Symposium on Theory of Computing, STOC ’86, pages 122–
135, New York, NY, USA, 1986. ACM. ISBN 0-89791-193-8. doi: 10.1145/
12130.12143. URL http://doi.acm.org/10.1145/12130.12143.
[56] Storage Networking Industry Association. NVM PM Remote Access for
High Availability, February 2016.
[57] Storage Networking Industry Association. NVM Programming Model v1.2,
June 2017.
[58] Robert Endre Tarjan. Updating a balanced search tree in O(1) rotations.
Information Processing Letters, 16(5):253–257, 1983. ISSN 0020-0190.
[59] Athanasios K. Tsakalidis. Rebalancing operations for deletions in avl-trees.
RAIRO-Theoretical Informatics and Applications-Informatique Théorique
et Applications, 19(4):323–329, 1985.
[60] David Vandevoorde and Nicolai M. Josuttis. C++ Templates: The Com-
plete Guide. Addison-Wesley Professional, 1 edition, November 2002. ISBN
9780201734843.
[61] Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and
Roy H. Campbell. Consistent and durable data structures for non-volatile
byte-addressable memory. In Proceedings of the 9th USENIX Conference
on File and Stroage Technologies, FAST’11, pages 5–5, Berkeley, CA, USA,
2011. USENIX Association. ISBN 978-1-931971-82-9.
[62] Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne:
lightweight persistent memory. In Rajiv Gupta and Todd C. Mowry, ed-
itors, Proceedings of the 16th International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS
2011, Newport Beach, CA, USA, March 5-11, 2011, pages 91–104. ACM,
2011. ISBN 978-1-4503-0266-1. doi: 10.1145/1950365.1950379. URL
https://doi.org/10.1145/1950365.1950379.
35
[63] Chundong Wang, Qingsong Wei, Lingkun Wu, Sibo Wang, Cheng Chen,
Xiaokui Xiao, Jun Yang, Mingdi Xue, and Yechao Yang. Persisting rb-
tree into nvm in a consistency perspective. ACM Trans. Storage, 14(1):
6:1–6:27, February 2018. ISSN 1553-3077. doi: 10.1145/3177915. URL
http://doi.acm.org/10.1145/3177915.
[64] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve
Swanson. An empirical guide to the behavior and use of scalable persistent
memory. In 18th {USENIX} Conference on File and Storage Technologies
({FAST} 20), pages 169–182, 2020.
[65] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong,
and Bingsheng He. NV-Tree: Reducing consistency cost for NVM-based
single level systems. In 13th USENIX Conference on File and Storage
Technologies (FAST 15), pages 167–181, Santa Clara, CA, 2015. USENIX
Association. ISBN 978-1-931971-201.
36