0% found this document useful (0 votes)

8 views

Transactions On Red-Black and AVL Trees in NVRAM

Byte-addressable non-volatile memory (NVRAM) supports persistent storage with low latency and high bandwidth. Complex data structures in it ought to be updated transactionally, so that they remain recoverable at all times. Traditional database technologies such as keeping a separate log, a journal, or shadow data work on a coarse-grained level, where the whole transaction is made visible using a final atomic update operation.

Uploaded by

Weldon Hughes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Transactions On Red-Black and AVL Trees in NVRAM

Uploaded by

Weldon Hughes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Transactions on Red-black and AVL trees in

NVRAM
Thorsten Schütt Florian Schintke Jan Skrzypczak
Zuse Institute Berlin
arXiv:2006.16284v1 [cs.DC] 29 Jun 2020

Abstract
Byte-addressable non-volatile memory (NVRAM) supports persistent
storage with low latency and high bandwidth. Complex data structures in
it ought to be updated transactionally, so that they remain recoverable at
all times. Traditional database technologies such as keeping a separate log,
a journal, or shadow data work on a coarse-grained level, where the whole
transaction is made visible using a final atomic update operation. These
methods typically need significant additional space overhead and induce
non-trivial overhead for log pruning, state maintenance, and resource (de-
)allocation. Thus, they are not necessarily the best choice for NVRAM,
which supports fine-grained, byte-addressable access.
We present a generic transaction mechanism to update dynamic com-
plex data structures ‘in-place’ with a constant memory overhead. It is
independent of the size of the data structure. We demonstrate and eval-
uate our approach on Red-Black Trees and AVL Trees with a redo log of
constant size (4 resp. 2 cache lines). The redo log guarantees that each ac-
cepted (started) transaction is executed eventually despite arbitrary many
system crashes and recoveries in the meantime. We update complex data
structures in local and remote NVRAM providing exactly once semantics
and durable linearizability for multi-reader single-writer access. To persist
data, we use the available processor instructions for NVRAM in the local
case and remote direct memory access (RDMA) combined with a software
agent in the remote case.

1 Introduction
The introduction of NVRAM enables a new range of applications, but it also
causes new challenges for their effective use. NVRAM is the first non-volatile
storage providing byte-granular access with low-latency and high bandwidth.
In addition, it will replace SSDs as fastest persistent storage in the storage
hierarchy. While SSDs only provide block-oriented APIs, NVRAM comes as
a standard DIMM. Plain loads and stores suffice to get direct access (DAX)
while bypassing the operating system. For recoverability and consistency of
data structures, it becomes relevant when, in which order, and which part of
them will be written to NVRAM from the processor’s caches—either explicitly
by instructions or implicitly by cache evictions. This is influenced by aspects
such as data alignment, weak memory models, and cache properties such as
associativity, size, and its replacement and eviction policy. To conquer all these

1
aspects, complex data structures stored in NVRAM must be recoverable at all
times, which requires new and sound transactional update mechanisms.
The literature on NVRAM has been mostly focusing on B+trees [16] with
a high radix [12, 27] to minimize costs for insert and remove. By using a high
radix, expansive operations like balancing happen seldom. Insert and remove
are operations on leaves. Thus, they do not have to be transactional and only
require a few persist calls. If the sequence of calls is interrupted by a crash, the
tree remains valid. They are tuned for absolute performance.
In contrast, balancing is the common case for Red-Black Trees and AVL
Trees [58]. Insert and remove always need an unpredictable and variable se-
quence of operations, i.e., balancing, recoloring, and updating the balance fac-
tors. This depends on the size and the shape of the tree as they work across
several levels of it. For NVRAM, the steps of the sequence have to happen
atomically despite an arbitrary number of crashes and restarts. Otherwise, the
tree might become invalid, unrecoverable, and may lose sub-trees. Thus, trans-
actions are needed [50].
The literature so far focused on a copy-on-write style for updating data-
structures. It comes with additional costs for allocating, de-allocating, and
garbage collection. For B+ trees, a constant amount of memory is needed, i.e.,
the size of a node, which simplifies the process. As Wang et al. [63] noted, for
Red-Black Trees copy-on-write operations touch almost the complete tree. One
needs to allocate and de-allocate a variable amount of memory for operations.
We split updates into a sequence of micro-transactions. Thus a constant size
redo log suffices for all operations. We neither allocate nor de-allocate memory
for operations as all updates are in-place [54]. It shows its strengths for complex
data structures where updates are global operations and touch large parts of
the data structure. There is no doubt that performance-wise B+-trees beat
binary trees. It is by design. However, binary trees represent a wider class of
dynamic data structures using pointers. For binary trees, we needed to invent
new methodologies for storing data structures in NVRAM that widely differ
from B+-trees. They could also be applied to data structures such as (doubly)
linked lists, priority queues, or graphs. Trees are often used as proxies for index
data structures [49, 48].
As NVRAM behaves like memory rather than a spinning hard-disk, we can
use remote direct memory access (RDMA) of modern interconnects to directly
access NVRAM on remote nodes. Local and remote access can rely on a com-
mon set of operations. For NVRAM, access can be expressed in terms of read,
write, atomic compare&swap (CAS), and persist operations. For remote access,
get, put, remote atomic CAS, and remote persist of the passive target communi-
cation model defined in the MPI standard [38] can be used. For passive target
communication, the origin process can access the target’s memory without in-
volvement of the target process. It is similar to a shared memory model and
allows the design of a single transaction system based on common primitives for
both local and remote NVRAM.
We support exactly once operations [21] on dynamic complex data structures
in local and remote NVRAM and make the following main contributions:
• We designed a new transaction system for NVRAM, which splits large
multi-step transactions into a sequence of micro-transactions. A state
machine describes the transaction and the sequence of micro-transactions.

2
Each micro-transaction resp. state transition is idempotent allowing atom-
icity and recovery in the failure case for multi-step transactions. All up-
dates happen directly on the data structure ‘in-place’ without shadow
copying. It incrementally transforms the old data structure into the new
one. All accepted operations will eventually succeed (Sect. 4).
• A redo log of constant size (four resp. two cache lines for Red-Black Trees
and AVL Trees) is used to guarantee recoverability and atomicity at all
times. Note that the size of the redo log is independent of the size of the
data structure (Sect. 4.1).
• Our approach supports exactly once semantics and guarantees durable
linearizability for all operations, both local and remote. Failed clients
cannot corrupt any data (Sect. 4.4).
• We implemented balanced Red-Black Trees and AVL Trees in NVRAM
using our approach—local and with passive target communication for re-
mote access (Sect. 6).
• Intel guarantees 8 byte fail-safe atomicity for NVRAM. For our approach
7 bytes suffice as we do not rely on atomic pointer updates (Sect. 4).
• We designed a multi-reader single-writer lock with f -fairness to coordinate
concurrent RDMA writes. Failed lock-holders can be safely expelled by
other processes, because their process ids are stored in the lock, which
allows other clients to use failure detectors (Sect. 10).
• We simulated more than 2,000,000 power failures by killing processes to
validate the robustness of our approach (Sect. 11).
• Our evaluation shows more than 2,300/s key-value pair inserts into Red-
Black Trees using passive target communication with NVRAM. For AVL
Trees, we reached more than 1,800/s inserts (Sect. 12.4). For local access,
we reached almost 400,000 inserts per second (Sect. 12.1).

2 System Model
We assume a full-system failure model [30]. On a crash, all transient state (of
all processes) is lost. Only operations on fundamental, naturally aligned data
types up to 8 bytes are atomic fail-safe in NVRAM, but 7 bytes are enough for
our approach.
While RDMA operations can fail non-atomically, we assume 64 bit RDMA
CAS operations to be atomic. To detect failed nodes, we use the weak failure
detector ♦W [11]. We consider a system with a single server storing data without
replication for simplicity. An arbitrary number of read/write clients may try to
access the data concurrently. We do not consider Byzantine failures.

3 Preliminaries
As discussed above, hardware only supports atomic updates of 8 bytes. In the
following, we describe the basic concepts needed for larger updates and describe
in detail the methods for persisting data with NVRAM.

3
3.1 Logging and shadow copying
As long as updates are atomic, i.e., 8 bytes for NVRAM or a block for SSDs, they
can be done in-place. For non-atomic updates, transaction systems [39, 63, 6]
use a combination of different techniques to preserve consistency in the face
of crashes. Logging uses undo and redo logs to store enough data to roll-back
an interrupted transaction (undo) or retry the transaction again (redo). Undo
logging tends to be more costly. It has to log every store before executing
it. Thus, redo logging is the preferred technique. Some databases [6] use a
combination of both. Shadow copying, also known as copy-on-write, creates a
copy of the data to be updated, updates the copy, and atomically replaces the
old data with the new data. For example, to atomically update a tree node, a
copy of the node is created, updated, and then the parent’s pointer to the node
is atomically updated. Often, it is sufficient to replace one 8-byte pointer for
the last step, which can be done atomically.

3.2 How to persist data with NVRAM?

Persisting data in NVRAM works similar to block-oriented storage, i.e., writes
followed by a flush, but the details differ. Persisting data to NVRAM relies
on cache line (cl) flushing, which persists data as a side effect. According
to Intel Corp. [28], there are four different methods: clflush, movnti+sfence,
clflushopt+sfence, and clwb+sfence. The clflushopt and clwb operations
were recently introduced with the Skylake micro architecture. Stores before
an sfence, a store fence, are separated from following stores. clflush and
clflushopt persist the cache line’s data by invalidating it and writing the con-
tent back to memory. movnti is a non-temporal store, which bypasses caches
and writes directly to memory. Hot data has to be loaded again from memory.
To mitigate the costs of cache misses, prefetching can be used [65]. A clwb
ought to perform a write-back of the data without invalidating the cache line.
It is recommended to use clwb [28].
While any pair of clflush instructions are ordered, pairs of clflushopt are
only ordered with each other when accessing the same cache line. Pairs of clwb
always remain unordered with each other. These recently introduced operations
of the Skylake micro architecture shown above allow more parallelism than the
older ones.
Former versions of Intel Corp. [29] recommended clflush+mfence, where
mfence is a memory fence that separates load and stores before and after the
fence. CDDS [61] and NV-Tree [65] even used mfence+clflush+mfence. Mem-
ory fences have much higher latency than store fences.
Today, there is no hardware support for remote persisting. The standard
work-around is proposed by SNIA [56]. After an RDMA write, the client has to
request cache flushing by sending a message to a software agent. The software
agent then must respond with an ack message. This workflow is implemented
by the PMDK.1
1 http://pmem.io/pmdk/

4
4 Exactly once operations with Micro-Transactions
(µ-Tx) and State Machines
Performing updates on complex data structures often requires a sequence of
smaller operations (recoloring, balancing, node splitting, etc.). NVRAM makes
it challenging to perform them correctly in the face of power losses as only
aligned stores up to 8 bytes are fail-safe atomic with current hardware. All
larger operations require transactions. Otherwise, it is unknown which updates
are persistent in the failure case, i.e., reached the persistence domain. This can
fatally corrupt the data. Traditional techniques to address this problem are
shadowing, copy-on-write, and logging (see Sect. 3.1).
Aims. Inserting or removing elements from trees, for example, often re-
quires a sequence of operations, such as tree rotations (see Sect. 5.1). We want
to support such complex data structure updates atomically ‘in-place’. We di-
rectly update the data structure without shadow copying but with a redo log of
constant size independent of the size of the overall data structure. The structure
and actual size of such a constant-size redo log depends on the particular data
structure and operations to be supported. In Sect. 6, we show some examples
for Red-Black Trees and AVL Trees.
Approach. In general, we split an operation to be performed on a complex
data structure into a sequence of smaller operations, which we execute in micro-
transactions (µ-Tx) until the whole operation is finished. We want to be able
to identify the ongoing operation (insert or remove), detect the progress in
that operation, perform updates atomically, minimize the size of the redo log,
and guarantee that all accepted operations will eventually complete. We need
the following components (see Figure 6): (1) the primary data structure D of
potentially dynamic size that we want to update, (2) a redo log L of constant
size, and (3) a state machine M with S states describing the sequence of updates
on D and L. D and L can be seen as disjoint sets of byte ranges. D, L, and the
current state of M are stored in NVRAM.
For non-trivial updates, the idea is to establish a two-step mode of operation
repeatedly: First, persist all information that are required to perform the oper-
ation on D in the redo log L. Afterwards, perform the operation and persist it.
Each step is idempotent until the next micro-transaction begins. To separate
them from each other, a state variable is updated atomically between steps to
indicate which step finished last.
The current state of the state machine indicates whether the data structure
is clean for performing the next read or write or if it is currently performing
an ongoing concurrent operation. In this case, any upcoming read, insert, or
remove request has to wait. We reject concurrent operations with locks (see
Sect. 10) as our transaction approach cannot handle concurrent accesses by
itself. Otherwise, they may corrupt the data structure [20].
Each state transition of the state machine (see Figure 1) performs a sequence
of writes followed by a mechanism to make the writes persistent—an epoch [17].
A state transition either updates D or L and then updates the current state
atomically. Thus, it consists of two epochs. We require all state transitions to
be idempotent. If a transition updates D, there must be enough information
in L to be able to redo the operation. If it updates L, there must be enough
information in D and L to redo the operation, i.e., D can be the redo log for

5
4. µ-Tx (idempotent)

s ← persist( ) ← w
c rite
omi w ⊆
set at 4 D
← ←
) decide next state by re
st( if stmnt. or control flow ad
rsi r
pe 4⊆
L
clean dirty
action
c0 dj
1. µ-Tx 2. µ-Tx (idempotent) 3. µ-Tx (idempotent)

write w1 ⊆L setatomic s read r2 ⊆L setatomic s read r3 ⊆L∪D setatomic s

persist( ) persist( ) write w2 ⊆D persist( ) write w3 ⊆L, persist( )
persist( ) w3 ∩r3 =∅
persist( )

Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 5 Epoch 6

persist( ) ∈ {clflush, movnti+sfence, clflushopt+sfence, clwb+sfence}

Figure 1: µ-Tx in a state machine with clean and dirty states working on a
constant size redo log L, a complex data structure D of arbitrary size, and the
state variable s.

L. Switching the roles of the two components during an ongoing transaction is

uncommon for traditional databases, but it is one of the fundamental concepts of
our approach. Idempotence of state transitions facilitates applying non fail-safe
atomic updates consistently.
State transitions. A transition from the current state si to a direct suc-
cessor state sj is split into two steps: (a) an epoch updating D or L and (b)
atomically making sj the current state. We discuss in Sect. 4.2 our approach
for identifying and atomically changing the current state. If a client succeeds
in updating D or L, it tries to make sj the current state. When the client fails
to complete the first step, si remains the current state. If the client finishes
the epoch and fails to make sj the current state, si also remains the current
state. So, any attempt to perform a state transition may fail. As all update
operations resp. epochs are required to be idempotent, a state transition can
be blindly retried arbitrarily often. Given enough progress, it will eventually
succeed. It will never abort.
To be able to precisely track progress and detect failures, we divide the states
S into disjoint sets of clean states C and dirty states D, such that S = C ∪ D
and C ∩ D = ∅. If the state machine is in a clean state, clients can read D or
start a new transaction. If it is in a dirty state, a transaction is ongoing and it
has to be finished first.
Recovery. Recovery is agnostic to the ongoing operation. In case of a power
failure, we can always identify the current state at all times. If we recover in the
clean state, there was either no accepted operation or an operation completed
that did not return to the client before the power loss. If we recover in a dirty
state, we continue the current operation until we reach the clean state as all
state transitions are idempotent.
For databases, recovery is separated from production, see Sect. 13. For
recovery, databases have to analyze the logs and cleanup the damage caused by
the crash. In our approach, we simply continue the ongoing operation. Recovery
barely differs from production.

6
In contrast to shadowing and logging, the state machine approach is less
affected by failures. When it reached the first dirty state, it can guarantee the
client that the operation will eventually succeed. The first state transition ac-
cepts the operation and the following ones perform the operation. Shadowing
can only guarantee success after completion. We can identify the kind of oper-
ation uniquely by the current state. Different operations will use disjoint sets
of dirty states. If the machine is in a clean state, there is no ongoing operation.
The recovery cost is negligible. We might lose one epoch, which we have
to repeat. The next process can continue where the last process crashed. For
comparison, NV-Trees [65] only store the leaves in NVRAM. On recovery, they
have to recompute the inner nodes.

4.1 Overhead
There is a trade-off between the number of states and the size of the redo log.
Larger state machines tend to have smaller logs. Less information is needed to
make state transitions idempotent. However, they require more state transitions
and flushes. Smaller state machines will need larger logs and require fewer state
transitions and flushes.
The smallest state machine with one clean and one dirty state akin to tradi-
tional transactions in databases has no constant size redo log. Remove for AVL
Trees has in the worst case O(log n) balancing operations, which cannot be ex-
ecuted in-place. All balancing operations require logging, see Sect. 5.4. Thus,
the minimal state machine with two states has no constant size redo log. For
remove with AVL Trees, the log’s size is a function of the depth of the tree. In
our approach, it suffices to provide enough space in the redo log to make state
transitions idempotent. The complexity of state transitions is independent of
the size of the data structure.
There are operations that imperative code implements in-place, such as tree
rotations, updating pointers, and updating loop counters. These operations are
not idempotent and have to be split into two state transitions. The smallest
state machine with a constant size redo log depends on the complexity of the
algorithm. Note that both state machines for AVL Trees are larger than the
two for Red-Black Trees. AVL Trees are stricter balanced . Furthermore, the
remove state machine for AVL Trees is larger than the one for insert (35 vs. 24
states).
Despite all state transitions being idempotent, there are two strategies for
re-executing an interrupted state transition: (a) blindly re-executing it and (b)
analyzing the progress made before the crash and only executing missing parts.
In our implementation, we used the former strategy. The number of stores per
state transition is small. A single rotation needs two state transitions with
three stores each. The gain of the latter strategy would be negligible. For store
intensive state transitions, like, e.g., writing a GiB of data, the latter would be
more efficient than the former.

4.2 Maintaining the Current State

State change operations have to be either fail-safe atomic or use a state machine,
otherwise we can reach unintended resp. unreachable states.

7
We use a global state variable of 7 bytes that stores the current state and
is updated atomically with CAS operations. This means that the number of
states is limited to 256 . When 256 states are insufficient, we create a state ma-
chine where D is a variable with more than 56 bits. However, the number of
executable state transitions is not bounded by the 7 byte limit. Unfortunately,
this approach does not allow reliably detecting whether the state machine made
progress. Even if the state variable did not change between two reads, the state
machine may have transitioned through several states in the meantime ending in
the original state again. Supporting detection of progress would require an ap-
proach based on arbitrary precision counters, which is unfortunately challenging
with NVRAM.

4.3 Detectable Execution

The detectable execution property was introduced by Friedman et al. [21]. A
data structure with this property can, upon recovery after a crash, detect for
each operation executed during the crash whether it completed or aborted. This
property is necessary to execute operations exactly once.
To fulfill this property, additional state provided by the client is required.
The operation has to go through three steps: (1) announced, (2) accepted, and
(3) done. The client has to durably store its intent to execute an operation
(step 1). Then it can move the state machine from the clean state into the first
dirty state (the operation is accepted) and durably store that the operation is
accepted (step 2). Finally, the client traverses the state machine until it reaches
the clean state again and durably marks the operation as done (step 3).
If a crash happens before step 1, the operation was never intended to be
executed. If the crash happens between step 1 and step 2, the state machine
is either in the clean state or the first dirty state. Thus, the operation did not
happen yet. If the crash happens between step 2 and step 3, the state machine
is either in a dirty state or the clean state. When it is in a dirty state, the
operation did not finish . If it is in the clean state, the operation finished, but
the client did not update the state. If the state machine is after step 3, the
operation finished. Thus, we can support exactly once semantic. We did not
implement this approach and the evaluation in Sect. 12 does not consider it.
For systems based on shadow copying, it is challenging to support this prop-
erty. Updates become only visible with the last atomic pointer update. Inter-
mediate steps cannot be detected. Even if the client durably stored its intent
to perform an operation, there are no visible events that allow distinguishing
between success and failure after the crash. For a no-op operation, we cannot
detect whether it succeeded or not. The updated value does not change and we
cannot use the value change to distinguish between success or failure.

4.4 Durable Linearizability

According to durable linearizability introduced by Izraelevitz et al. [30], op-
erations have to become durable before they return. In case of a crash, all
previously completed operations remain completed. Operations started before
the crash may be visible, because they progressed enough but did not return.
The concept preserves important properties from linearizability, composability,
and non-blocking progress [25, 2, 53].

8
In contrast, buffered durable linearizability [30] only requires operations to
be persistently ordered before they return. After a crash, the data would still
be persistent but not necessarily up to date.
In our approach, all operations start at the state transition the last operation
finished. They start from the clean state, traverse the state machine, and return
to the clean state before they return. The current state of D represents the
execution resp. history of all previously completed µ-Tx. If the state machine is
in a dirty state after the crash, an operation started but did not return. Given
enough progress, the state machine might be in the clean state after the crash,
but it did not return before the crash. Thus, our approach supports durable
linearizability.

5 Self-balancing Binary Search Trees

Binary search trees are a widely studied topic. They serve as a research ve-
hicle for theoretical computer science and have many applications in practical
systems.

5.1 Red-Black Trees

Red-Black Trees [24, 18] are half-balanced binary trees. They have the following
properties: (a) each node is colored either red or black, (b) all children of red
nodes are black, and (c) every path from a given node to its leaves has the same
number of black nodes. Violating property (b) or (c) is called a red violation
or black violation, respectively. The shortest possible path from a given node
to its leave has all black nodes. The longest possible path from a given node
to its leave alternates between black and red nodes with twice the length of the
shortest possible path. Thus, Red-Black Trees are called half-balanced [42]. Our
root is always colored black to prevent red violations (property (b)) between the
root and its children. Insert, remove, and lookup are all in O(log n).
We focus on insert and remove as lookups are not of particular interest
regarding failures. We use top-down algorithms, which start at the root and
walk down the tree. They will always insert (add) or remove a leaf object.
On the way down, they anticipate and repair red and black violations. This
relies on node recoloring and tree rotations. In case of a failure, recoloring is
trivial to repair—we verify the Red-Black Tree properties and recolor the nodes
as necessary. Failures during tree rotations are more challenging, because they
might fatally damage the tree and risk losing sub-trees. Thus, tree rotations
have to be performed atomically (see Sect. 5.4).

5.2 AVL Trees

AVL Trees [1] are balanced binary trees. In contrast to Red-Black Trees, they
are height balanced. For each node, the heights of the subtrees of its two children
differ at most by one. Red-Black Trees store a color in each node to support
the balancing scheme. AVL Trees store two bits in each node—the balance. It
is either -1, 0, or 1 depending on the comparison of the height of the subtrees.
AVL Trees also rely on tree rotations for balancing. Insert, remove, and lookup
are all in O(log n).

9
6 GP 6 GP

Right rotation
Q 4 pivot P 2
-1
+1
pivot P 2 5 C A 1 4 Q
r
ed
vi
ol
at
io
A 1 n 3 B B 3 5 C

6 GP 6 GP 6 GP
pivot P 2 T T

4 Q 4 Q pivot P 2 D
D D
A 1
pivot P 2 T 5 C 3 B 5 C A 1 4 Q

A 1 3 B 3 B 5 C

1. Q(L) Drops P and Takes B (S1) 2. P(R) Drops B and Takes Q (S2) 3. GP(L) Drops Q and Takes P (S3)

Figure 2: Right rotation with red (white) and black (gray) nodes. Nodes/Roles
(A, B, C, pivot P, Q, and GP) and keys (1, 2, 3, . . . , 6). P is promoted and Q
is demoted. Nodes drop (D) and take (T) children.

Knuth [31] describes a top-down algorithm for insert with O(1) single or
double rotations on average [36]. It walks down the tree and inserts the new
node at the bottom. On the way down, it records the positions, which have to
potentially be rebalanced. Additionally, it adapts the respective balance factors.
Remove deviates from the algorithms described so far. It also requires O(1)
balance operations on average, but it can need up to O(log n) balance opera-
tions [59]. It must walk back up to balance the tree. B and B+ trees [8, 16, 9]
use a similar concept. If a node is full, they split the node and walk up the tree.
The imperative C code from Julienne Walker 2 uses a stack of size O(log n)
for remove. This violates our aim that the redo log is constant size. On the
way down, we store pointers to parents in the nodes, which can be used to
walk back up the tree as needed. In general, it shows a limit of our approach.
Algorithms that need auxiliary space larger than O(1) in the log cannot be
directly supported when the log must remain of constant size. Here, we used
the common technique of storing pointers to parents in the nodes. We discuss
this problem in more in detail in Sect. 15.

5.3 Weight-balanced trees

Red-Black Trees and AVL Trees use the height of sub-trees as a balancing
criterium. Weight-balanced trees [41, 26] store the number of nodes in its sub-
trees in each node. A node n is α-weight-balanced if weight(n.left) ≥ α ∗
weight(n) ∧ weight(n.right) ≥ α ∗ weight(n). A larger α makes a tree more
balanced. Weight-balanced trees also rely on tree rotations for balancing.
2 http://eternallyconfuzzled.com/tuts/datastructures/jsw tut rbtree.aspx.

10
5.4 Tree Rotations
A common balancing technique in binary trees are tree rotations [55]. Tree
rotations preserve the order of nodes, but change the shape of the tree for
rebalancing. Figure 2 shows an example for a right rotation with Q as the root
of the rotation, P as pivot (the left child of Q), and GP as grandparent of P. The
rotation increases the height of the tree under the pivot by one and decrease the
height of the tree under the root by one. Additionally, it solves a red violation
between the pivot ( 2 ) and B ( 3 ). A left rotation works vice versa. The order
of keys (1, 2, 3, . . ., 6) and thus the order of nodes remains unchanged.
For the right rotation, Q replaces its left child (the pivot) with B. The pivot
replaces its right child (B) with Q. The GP replaces its left child (Q) with the
pivot. The three nodes pass the ownership, the parent relation, around. This
leads to the tree rotating around the GP.

Atomic Tree Rotations for NVRAM For NVRAM, the ownership changes
are implemented with stores (S1-S3). S1 for updating Q’s left child, S2 for
updating the pivot’s right child, and S3 for updating GP’s left child. If the
ownership changes are performed partially, i.e, the pivot drops the link to B
and Q does not take ownership of B, we lose sub-trees and can create cycles, as
the following analysis shows:
S1 and S3 fail → No link to pivot P.
S1 and S2 fail → No link to B. Cycle: Q and pivot P.
S2 and S2 fail → No link to Q.
S1 fails → No link to pivot P.
S3 fails → No link to Q.
S2 fails → No link to B. Cycle: Q and pivot P.
GP, Q, and the pivot have to be updated atomically. Otherwise, the tree
will lose sub-trees and becomes invalid. Logging the insert resp. delete request
would not be sufficient. Shadowing would copy Q and the pivot node, update
them, and atomically update the pointer of the grandparent pointing to the new
pivot. In our approach, we copy pointers to Q, the pivot, and B to the redo log
in a first step and then update Q, the grandparent, and the pivot in a second
transaction.

6 Implementing binary trees with µ-Tx

In the following, we describe how to implement Red-Black Trees (RBTs) in
NVRAM. The major challenge is to convert the RBT insert and remove algo-
rithms into state machines with idempotent state transitions. For AVL Trees,
we use the same concepts. For didactic and brevity reasons, we focus on RBTs
in the following.

6.1 Creating the state machine

In the following, we outline how to create the state machine. For this paper, we
build them manually following the techniques described below. Building a tool
that automatically converts an algorithm into a state machine with idempotent
state transitions is beyond the scope of this paper.

11
1 void insert(root, key, value) {
2 if (root == nullptr) { // the first node
3 root = new Node(key, value); // A1
4 } else { // initialize pointers and iterators
5 Node head; // A2
6 Node *it, *parent , *grand , *grandgrand = nullptr;
7 parent = &head;
8 it = parent ->right = root;
9 Direction dir, last; dir = Left;
10 while(true) {
11 if (it == nullptr) { // insert the new node here
12 parent ->dir = it = new Node(key, value); // A3,A4
13 } else if (isRed(it->left) and isRed(it->right)) {
14 // recolor
15 it->color = Red; // A5
16 it->left->color = it->right ->color = Black;
17 }
18 if (isRed(it) and isRed(parent) // need rotation?
19 rebalance(grandgrand , grand , last); // A6,A7,A8,A9,A10,A11
20 if (it->key == key) break; // key exists already
21 // traverse one level down the tree
22 last = dir; dir = (it->key < key) ? Right : Left; // A12
23 grandgrand = grand;
24 grand = parent; parent = it;
25 it = it->dir;
26 }
27 // update root
28 root = head.right; // A13
29 }
30 // ensure the root is black
31 root->color = Black; // A14
32 }

Figure 3: Top down insert based on Guibas and Sedgewick [24] and J. Walker.2
The labels show the state of the state machine in Figure 6.

As a first approximation, the imperative algorithm is converted into a control-

flow graph [4] with basic blocks. The nodes represent basic blocks resp. state
transitions and edges represent states. If a basic block is not idempotent, it has
overlapping read and write sets and has to be split into two. The first block
writes data into the redo log and the second block updates data in-place. This
process has to be iterated until all state transitions are idempotent.
For example, a tree rotation updates pointers in-place. Its read and write sets
overlap. The overwritten values, which can be determined by alias analysis [3],
have to be stored in the redo log first before the actual rotation is performed.
While creating the state machine and its redo log, we tried to find a balance
between convenience and the log’s size. Minimizing the log’s size is equivalent
to the register allocation problem [45]. An example for the flexibility in design
are double rotations. They could be expressed as two single rotations (4 state
transitions, space for 3 pointers in the log) or as one double rotation (2 state
transitions, space for 6 pointers in the log). Trivial control-flow can often be
hidden in state transitions, like, e.g.,
Direction dir = (it->key < key) ? Right : Left;

6.2 Insert
Guibas and Sedgewick [24] introduced top-down approaches for inserting and re-
moving key-value pairs for dichromatic trees, see Figure 3. As they are top-down
algorithms, they do not require keeping a stack, but only keep a few pointers

12
up the tree—the iterators. They are mainly used for playing the roles/anchors
in tree rotations. They also use a fake head node to simplify corner-cases. We
derived insert and remove (Figure 3 and 4) from Julienne Walker,2 who uses the
same concepts. To avoid black violations, we insert the new node as a red node
at the bottom. However, inserting a red node can yield red violations, which
can be resolved by promotions [58], i.e., color flip, single, and double rotation.
Inserting a node into an empty tree from the clean state C is trivial (C
→ A1 → C ). Otherwise, we initialize helpers with the init transition (lines 6–
9; outgoing edges from A2 ). If a leaf node is reached, we insert the new node
(line 12; A3 → A4 ). Otherwise, we might need to flip colors (lines 15–16), which
can be done in one state transition, because setting a new color is idempotent
(outgoing edges of A5 ).
The rebalance function (line 19 in Figure 3) performs single or double rota-
tions between the grand and grand grandparent. Single rotations are converted
into two state transitions: one for logging and one for executing the rotation
(Figure 6, A10 → A11 → either A12 (to continue) or A13 (key was found)).
Double rotations need four state transitions accordingly (A6 → A7 → A8 →
A9 ).
Lines 23–25 descend the iterators one level down the tree (outgoing edges
of A8 ) to close the loop. In Sect. 8, we show how to update the iterators with
fewer state transitions and persist operations in some cases. Finally, the root is
set and colored black (lines 28–31; A13 → A14 → C ).
The log’s size (4 cache lines of 64 bytes each) is independent of the size of the
tree (see Figure 6). Its main components are the key, the value, the iterator, the
parent, and the grandparent. In addition, it keeps some space for redo logging,
e.g., Dir, TmpNode, and Sp. The direction on the way down the tree (Dir ).
A temporary node for tree rotations (TmpNode). An anchor node on the way
down the tree for remove in RBT (Sp). The majority of the space would also
be needed for non-persistent insert operation. Note that the log does not keep
a stack of size O(log n). Instead, it suffices to keep a few pointers up the tree.

6.3 Remove
While the algorithm for remove (Figure 4) looks more complex than for insert
(Figure 3), its state machine in Figure 6 is simpler than the one for inserting.
The control flow graph for remove is simpler and consists of a single nested
if statement. In each iteration, the state machine can jump into exactly one
place. In contrast, the inner part of the insert algorithm contains two nested if
statements. It allows jumping into two cases resulting in a more complex state
machine.
After the basic initialization, lines 11–14 move the iterators one level further
down the tree. It uses the even-odd optimization described in Sect. 8, so that
all outgoing edges of R1 update the iterators. The single rotation in line 19
uses R2 → R3. The color flip in lines 27–29 is idempotent. Thus, a single
state transition from R5 suffices. The same applies to the color correction in
lines 35–37 represented by R12.
The balance operation in line 33 either executes a single rotation (R6 →
R7 ) or a double rotation (R8 → R9 → R10 → R11 ).

13
1 void remove(root, key) {
2 if (root == nullptr) return; // empty tree
3 // initialize pointers and iterators
4 Node head; // C0
5 Node *it, *parent , *grand , *found = nullptr;
6 Direction dir = Right;
7 it = &head;
8 it->right = root;
9 while (it->dir != nullptr) {
10 // traverse one level down the tree
11 Direction last = dir; // R1
12 grand = parent;
13 parent = it;
14 it = it->dir;
15 dir = (it->key < key) ? Right:Left; // direction?
16 found = (it->key == key) ? it:found; // found?
17 if (not isRed(it) and not isRed(it->dir)) {
18 if (isRed(it->(!dir))) { // single rotation
19 parent = parent ->last = single(it, dir); // R2,R3
20 } else if (not isRed(it->(!dir))) {
21 Node *s = parent ->(!last); // R4
22 if (s != nullptr) {
23 Direction dir2 =
24 (grand ->right == parent)? Right : Left;
25 if (not isRed(s->left) and not isRed(s->right)){
26 // recolor
27 parent ->color = Black; // R5
28 s->color = Red;
29 it->color = Red;
30 } else if ((grand != nullptr) and
31 not((grand==&head) and (dir2==Left))){
32 // rotate?
33 rebalance(grand , parent , s, last); // R6,R7,R8,R9,R10,R11
34 // recolor
35 it->color = g->dir2->color = Red; // R12
36 grand ->dir2->left->color = Black;
37 grand ->dir2->right ->color = Black;
38 } } } } }
39 if (found != nullptr) { // unlink and delete
40 found ->key = it->key; // R13,R14
41 Direction dirL = (parent ->right == it) ? Left:Right;
42 Direction dirR = (it->left == nullptr) ? Right:Left;
43 parent ->dirL = it->dirR;
44 delete it;
45 }
46 // update root
47 root = head.right; // R15
48 root->color = Black; // ensure the root is black
49 }

Figure 4: Top down remove based on Guibas and Sedgewick [24] and J. Walker.2
The labels show the state of the state machine in Figure 7.

6.4 Common state for Trees on NVRAM

The tree (Figure 5) is represented as a fixed-length array of tree nodes. Pointers
to nodes are thus indices into the array, so that the pointers remain valid when
the array is mapped to a different base address after a crash. The first element
is reserved for the special head node in the insert and remove algorithms. The
node data-structure is shared by Red-Black Trees and AVL Trees. The Col (the
color) field is used for RBTs. Up (up pointer ), Dir (direction from parent), and
Bal (the balance) are exclusively used for AVL Trees. Up and Dir are needed
to walk the tree back up again, see Sect. 5.2.
The programming model for NVRAM requires the tree to be a fixed-length
array [57]. The size of mmapped files for NVRAM cannot change. Increasing the

14
Key Head n0 n1 n2

Value
n3 n4 n5 n6 Cache-
Left Right
lines
Up Col Dir Bal
n7 n8 n9 n10

n11 n12 n13 n14

n15 n16 n17 n18

n19 n20 n21 ...

Next Node

Figure 5: Tree-nodes occupy 31 bytes aligned to 32 bytes, tree-pointers need

4 bytes, keys and values require 8 bytes each, and directions (Dir ), the balance
(Bal ), and colors (Col ) take one byte each. The Next Node fills 7 bytes aligned
to 8 bytes.

tree’s size further would require adding additional fixed-length arrays. The index
type would need to be updated accordingly (from IdxType to (ArrayIdxType,
IdxType)).
To support allocating new tree nodes after a crash, we use the Next Node
structure—a uint56 t. It stores the index of the next free array element. Note
though, we did not implement garbage collection. Despite a client releasing a
node, it cannot be used anymore. We only allocate from the head and do not
maintain free lists.
NVRAM-aware garbage collection and dynamic node allocation is provided
with Makalu [10], nvm malloc [51], and PAllocator [43]. They could be used for
our tree data structure. The state and next node variable have to be updated
with atomic CAS operations. All other updates can use non-atomic stores,
because we can do redo logging.

6.5 State for Red-Black Trees on NVRAM

The main data structures to manage an RBT are stored in NVRAM: the tree,
the log, the state variable, and the next node (Figure 5 and Figure 6). For
concurrency control, we additionally maintain a multi-reader single-writer lock
in volatile memory (see Sect. 10).
The log is structured into four parts, each fitting into a single cache line:
data for the root node and the request to perform, data to redo the current
state transition, and even and odd data structures (see Sect. 8 for more details
on this) for traversing the tree.

15
Even Iterators
LastDir Dir
C blac C
Tp Grandp new k roo
t
descend init
A1 R15
Parentp Qp init
A2 A14 R1 desc
in end
Q2p Fp it it remove
in

t
R2

ini
init

d
ini

cen
R14
A5 A3 rot

des
co

Odd Iterators
LastDir Dir flip

de
lor

s
ce
n
R4 R3

d
Tp Grandp

colo
A10 add

color
R5 flip
A4

rot
Parentp Qp

root
d
ad
Q2p Fp

add
add
R13
A6 A11 R8 R6
color

t
Redo Log

TmpNode Sp

ro
color

t
rot
Dir’ Dir” R7 R12
A12 A13 R9 t
Savep Savechldp descend ro ro
A7 t
ro t
t ro R11
ro co
t A9 lor R10
Key A8
Request

Value
Root
State

Log State Machine for Insert State Machine for Remove

Figure 6: NVRAM data structures and state machines for insert and remove
for Red-Black Trees with 15 (insert) and 16 (remove) states (rot=rotation,
init=initialize variables, add=insert new node, remove=remove a node, de-
scend=one step down the tree, root=update root, color=recolor nodes,
black=color the root black). Dashed edges flush to the log. The State fills
7 bytes aligned to 8 bytes.

Key
new C A24
Value
A1 A23 R2 R1 R26 R25 R24 R34
Root Root2
rot

P Q A2 A22 rot
S T A25 A21 R7 R3 R27 R31 R33
rot

rot

NextP NextS
A4
NextT SP A26 A15 A20 R8 R5 C R15 R23 R32
Cache- N NN R4
rot

lines
A7 A10 A14 A19 R9 R20 R16 R29
Save SaveChild R6
R22
Inc Bal Bal2 Bal3 It
OldIt NewItDown A8 A12 A13 A18 R10 R13 R14 R21 R28

NewIt NewItUp
OldNewIt OldNewItUp A11 A9 A16 A17 R11 R12 R17 R18 R19 R30
SaveParent Heir
OldHeir Foo State
Foo Done Top OldTop Dir Dir2 Dir3 FooDir

Log State Machine for Insert State Machine for Remove

Figure 7: NVRAM data structures and state machines for insert and remove
for AVL Trees with 24 and 35 states. Dashed edges flush to the log. The State
fills 7 bytes aligned to 8 bytes.

6.6 State for AVL Trees on NVRAM

The state for AVL Trees is similar to Red-Black Trees. While the log for Red-
Black Trees occupies four cache lines, the log for AVL Trees fits into two cache
lines. The overhead can mostly be accounted for the optimizations, see Sect. 8,

16
and the cache lines are only partially filled while the two cache lines for AVL
Trees are completely filled. Spreading the data over more cache lines might
improve performance further by reducing correlated cache misses, but such op-
timizations are beyond the scope of this paper.

7 Remote Persisting with a Software Agent

Clients efficiently access our Red-Black Trees and AVL Trees with passive target
communication using RDMA. All buffers used for such communication have
to be registered with the network card, which creates remote keys. Clients
have to use remote keys to access the registered address range with RDMA
operations. Revoking the key would deny the client to access the address space.
To gain access to remote keys for address ranges with InfiniBand, clients have
to establish a connection and query a software agent that we run on the server
node. For each connection request by a client, the software agent creates two
client-specific memory registrations: (a) mr d and (b) mr nv with read and write
privileges.
Memory registrations allow revoking the access privileges of individual users.
If a client starts a failure detector on c and the failure detector claims c to be
failed, the client asks the software agent to revoke access privileges of c, i.e., it
invalidates its remote key. This is a way to ensure client c cannot access the data
any longer. The client can then safely release locks held by c. After a power
loss, the DRAM region will be initialized with zeros. The locks lose their value
as the remote keys became invalid and can safely be reset to not taken. Poke
and Hoefler [46] also use memory registration and InfiniBand’s QP mechanism
to handle access rights, but fail to realize that locks, see Sect. 10, or atomic
operations are required to prevent data-races.
As there is no hardware support for remote data persisting yet, the software
agent also provides cache flushing as a service.

8 Optimizations
Figure 8 shows the epochs by category for inserting resp. removing 107 keys
into Red-Black Trees. There are O(1) single and double rotations per operation
on average and one color flip as can be expected [58]. Some categories reflect
necessary steps for starting resp. completing operations, e.g., removing the found
node, updating the root, initializing the variables for the search, and flushing the
current command. While the cost for all operations is in O(log n), the number
of state transitions are dominated by shuffling the iterators for tree traversal.
As we cannot update the iterators in place, we need to use the redo log.
The canonical approach requires for each loop iteration two state transitions
(4 epochs). In the first, it flushes the old iterators to the log. In the second,
it updates the new iterators to go one level deeper into the tree. Instead, we
implemented an even-odd scheme. All iterators are stored twice to have disjoint
read and write sets. In even rounds through the loop, the first set is written. In
odd rounds, the second set is written. Thus, we need only one state transition
and the previous iterators form the redo log to go one level down the tree. A
set of iterators fits into a single cache line, which supports the even-odd scheme

17
and reduces flush costs. We use one bit of the state variable to store whether
we are in an even or odd round:
struct SV {bool:1 Even{1},uint56_t:55 State{0}} EvenState;

▷ smaller is better
70
average #epochs per operation [1]

InsertNewNode(A1)
40 Allocator
RemoveNode(R13-14)
Misc(A3,R4)
30 UpdateRoot(A13-14,R15)
TreeTraversal(A12,R1)
FlushCommand(C0)
20 InitializeVariables(C0,A2)
NewNode(A1,A4)
ColorFlip(A5,R5)
10 Recolor(R12)
DoubleRotation(A6-9,R8-11)
SingleRotation(A10-11,R2-3,R6-7)
0
Add Add w LA Remove

Figure 8: Average number of epochs for inserting resp. removing 107 keys in
random order in Red-Black Trees.

80
▷ smaller is better

PopStack(R20)
70
PushStack(R2-3)
FixParent(A22-24,R5-6,R14-16)
average #epochs per operation [1]

60 Balance(A13,A16-17,R19,R21,R31-33)
BalanceFactors(A7-A9,A11-12,R14,R17)
50 BalancePoints(A5,A10)
Loop(A3,A6,A25-26,R9-10)
InsertNewNode(A1)
40
Allocator
StateChange
30 RemoveNode(R12-13)
Misc(R4,R7-8,R11,R34)
20 FlushCommand(C0)
InitializeVariables(A2,R1,R18,R30)

10 NewNode(A4)
DoubleRotation(A18-21)
SingleRotation(A14-15)
0
Add Add w LA Remove

Figure 9: Average number of epochs for inserting resp. removing 107 keys in
random order in AVL Trees.

If the next loop iteration does not perform any rotations or recoloring, it
can be skipped. It does not change the tree. For inserts, we added unlimited
look-ahead (LA): instead of going exactly one level down into the tree in each
iteration, we progress the iterators directly to the next level that needs balancing
or recoloring.

18
Figure 8 shows the average number of epochs, i.e., flushes for inserting resp.
removing 107 keys in random order [35] into RBTs. Insert and remove are dom-
inated by the tree traversal, but unlimited look-ahead (LA) almost eliminates
the costs. Figure 9 shows the transitions for AVL Trees. Add is dominated by
the loop walking down the tree. Look-ahead does not iterate through the loop.
Remove is dominated by push stack, which stores in nodes the pointers to their
parents. Again AVL Trees are more expansive than RBTs. In both experiments,
keys are inserted on average in depth 21. The final Red-Black Tree has a depth
of 29 and the AVL Tree has a depth of 28. It is balanced more strictly.
The AVL Tree insert algorithm [31] is split into two phases. First, it walks
down the tree and searches the parent of the new node. On the way down, it
saves two balancing points. In the second phase, it inserts the new node and
uses the balancing points to rebalance the tree. The first loop can be replaced
by two state transitions, because it is almost side-effect free—except for storing
the balancing points. The approach is similar to the look-ahead for RBTs.
The AVL Tree remove algorithm is more challenging, because the loops walk-
ing down the tree store parent pointers in the nodes. Skipping loop iterations
is challenging. The last loop walks up the tree and balances it. There are fewer
opportunities for skipping iterations.

9 Implementation Details
The programming model for NVRAM [57] maps files into memory using mmap,
which provides direct access (DAX) to the NVDIMM. Mapping a file again after
a power failure may yield a different base address. So, all memory accesses have
to be explicitly adjusted to the corresponding address range. This is necessary to
consistently access the same data after a power loss. Makalu [10] is a persistent
heap manager, which hides this problem from users.
For simplicity, we placed the log structure and the tree, an array of nodes,
into different files. There are pointers between nodes and between the log and
the nodes, which have to be adjusted to the base addresses. The following
assignment is completely handled by the compiler.
log->root->left->color = log->save->right ->color;

There are no means to either control or adapt the intermediate memory

accesses. For NVRAM, we need to control all memory accesses. We have to
map and adjust them to their respective locations in mmapped address ranges.
For remote access, we need to know all addresses and intermediate steps to
invoke put and get calls accordingly.

expression mmapped file

log->root log file
log->root->left tree file
log->root->left->color tree file
log->save log file
log->save->right tree file
log->save->right->color tree file

The only feasible solution is to manage memory accesses by the user instead
of the compiler. Thus, for memory accesses and persist operations, we wrote

19
our own embedded domain specific language (DSL) based on expression tem-
plates [60].
The DSL provides a declarative language for describing memory addresses
including all intermediate steps—the path. Instead of using expressions such as
log->root->left->color, we assign types to each memory address, e.g.,
ColorInNode<LeftInNode<RootInLog>> address = {log};. Each memory access can be seen
as a path of intermediate memory accesses. The DSL allows programmers to
describe memory accesses and persist operations declaratively while ignoring the
peculiarities of the underlying programming model. For each memory access,
the runtime maps the request to the corresponding object (log, nodes, next node,
and state variable) and its associated mapped file. The access is applied to the
address space with the offset adjusted accordingly. This allows development on
machines with and without NVRAM and facilitates remote access. For local
development, we simply use malloc to simulate mmapped files. For machines
with NVRAM, we rely on the PMDK for mmapping. For remote access, we use
PMDK to mmap files and UCX 3 for communication with RDMA. Additionally,
our DSL allows us to transparently experiment with different cache flushing
and caching strategies. The runtime can map persist operations to the different
persist operations described in Sect. 3.2. For remote access, we can cache the
results of gets.
The following updates the key and value in the log:
WriteOp <typename LogAddress::Key, KeyValue > W1 =
{LogAddress::Key(log), KeyValue(key)};
WriteOp <typename LogAddress::Value , ValueValue > W2 =
{LogAddress::Value(log), ValueValue(value)};
flushOp(W1, W2);

WriteOpand flushOp are the customization points. Each WriteOp has a source and a
destination memory location. It performs a read and a write. flushOp executes
all write operations. As memory locations for read and write are described by
paths, they have to be evaluated first. They may require a sequence of plain
loads for local access or gets for remote access. Each WriteOp could execute the
assignment followed by a clwb and the flushOp invokes an sfence to finish the
epoch. For our tree code, the largest epoch (in AVLT remove) invokes WriteOp
nine times.

10 Multi-Reader Single-Writer Locks

with f -fairness over RDMA
Data structures with concurrent writers can show greatly reduced performance [47].
Thus, we use multi-reader single-writer (mrsw) locks [19] to limit the number
of concurrent writers. It is sufficient to store the mrsw-lock in DRAM, because
after power losses remote keys become invalid (cmp. Sect. 7).
Two distributed lock servers based on RDMA were developed by Narravula
et al. [40] and Chung and Zamanian [13] with similar designs. They both use
atomic CAS and fetch&add to update the lock data structure (see top of Fig-
ure 10). In exclusive mode, the excl field holds the client’s process id (pid) [5]
3 http://www.openucx.org

20
holding the lock. In shared mode, the shrd field holds the number of clients shar-
ing the lock. The holders of the shared lock are anonymous, which hinders to
identify failed lock holders and to get the lock back into exclusive mode. Thus,
these designs are not well prepared for client failures. They also cannot guar-
antee fairness, because readers can always starve any writer trying to acquire
the write-lock. Gerstenberger et al. [22] designed a similar lock for one-sided
communication in MPI, but here even the holder of the write lock is anonymous.
However, one could argue that as of today the behavior of failed nodes in MPI
is deliberately unspecified. The same concepts are used for shared-memory as
well. Bit-vectors (32-bit or 64-bit) are split into reader and writer parts and
atomic operations are used to update the value [37].

RDMA reader writer lock [40]:

State: excl shrd
32 bits

A k-reader single-writer lock with f -fairness:

Readers: r0 r1 ... rk−1
Writer: w
Waiting Queue: sl 0 sl 1 ... sl f −1
Outer Lock: lock owner

64 bits

Figure 10: State for reader-writer locks with RDMA.

To tolerate client failures, all lock holders have to store their pid. A stored
pid is the proof that the lock is taken. Our data structure’s design is shown in
the bottom part of Figure 10. For a k-reader single-writer lock with f -fairness,
we use an array of k 64 bit slots to hold the readers r0 –rk−1 , one slot for the
writer w, f slots for the waiting queue sl 0 –sl f −1 , and one slot for the outer lock
(lock owner ), which has to be acquired to modify the lock data structure itself.
Each entry is either 0 or the pid of the respective client. Using the lock-holder’s
pid to indicate whether a lock is taken allows clients to use failure detectors on
lock-holders. If a client wants to acquire a taken lock, it can either be taken
because of contention or because the holder failed. The client starts a failure
detector on the lock-holder and retries to acquire the lock to cover both cases.
Fairness (equal share and no starvation) would require a waiting queue with
sufficient capacity to hold all waiting clients, which is a theoretical but not a
practical solution. A compromise is f -fairness with a waiting queue of length f .
Clients in the queue are subject to fairness. They cannot overtake each other.
The first node is always the next to acquire its desired lock. However, we cannot
guarantee fairness for clients waiting to enter the queue.
When the desired lock becomes available, the process in the first slot of the
queue takes the outer lock, takes the desired lock, copies the other members of
the queue one step forward, sets the last element to zero, and releases the outer
lock. All operations require atomic CAS operations, as a crash of the writing
process during a non-atomic RDMA write may result in a slot with a valid pid
of a process not intending to hold a lock.
Due to the 64 bit size limitation of remote atomics, it is not possible to

21
shift the complete queue in a single CAS operation. Therefore, the outer lock
is needed to prevent other processes from interfering. During the shift, each
process in the queue is always stored at least once in it—either in the old and/or
new slot. If the shifting process fails, the fairness in the queue is preserved. On
success, the last slot becomes zero and is available for the next client. Entering
the queue does not require holding the outer lock. It is a CAS with a zero entry,
which may fail. A process holding a read or write lock can release it at any time
by zeroing its slot. It does not have to acquire the outer lock beforehand.

11 Simulating Power-Failures
To validate our implementation, we simulate power failures by killing processes
with SIGKILL following the approach used by NV-Heaps [14]. We start a process
doing insert resp. remove operations. At a later time, we kill the process. Similar
to a power failure, we lose all transient data. As the process and thus the
mmapped address range does not exist anymore, dirty cache lines cannot be
written back. The process could have been killed at any instruction of any state
transition. Afterwards, we start a new process that recovers the current state
and progresses to the clean state. The two processes actually use the same code.
While the former assumes that it is in the clean state, the latter actually reads
the current state from the file.
The testing revealed an issue with idempotence. We killed a process during
a tree rotation. The recovery process obviously executed the tree rotation again,
but tree rotations were not idempotent at that time. As discussed in Sect. 5.1,
failures during tree rotations can lose sub-trees. The challenge with tree rota-
tions is that they read and write the same memory locations. In the example,
we could lose access to Q, the pivot, or B. To make them idempotent, we have
to store all read values in the redo log. Thus, we have to keep pointers to all
three of them in the redo log to separate the read and write sets. Since then,
we have run more than 2,000,000 tests without revealing any further issues.

12 Evaluation
For all experiments with NVDIMM-N and Infiniband, we used one server with
two Intel Xeon Gold 6138 and one server with two Intel Xeon Silver 4116 CPUs.
Each server has 192 GiB main memory and two 16 GiB NVDIMM-N. They are
connected with an InfiniBand FDR network (ConnectX-3). We used CentOS
7.5, Clang 6.0.0, PMDK 1.5.1, and UCX 1.5.
For each measurement, we report the median among 1,000 samples. The 99
percent confidence interval (CI) is always within the 1.5 percent of the reported
medians. Extremely short runs show slightly larger percentages.
Red-Black Trees and AVL Trees are self-balancing binary search trees. Each
node has at most two children. There are no high radix variants. They do
not amend themselves to the optimizations commonly used for B and B+-trees.
Red-Black Trees and AVL Trees are simply not competitive. Our contributions
are not in the area of optimizations for speed, but we designed a new transaction
system with O(1) log-space, in-place updates, and RDMA. Thus, we did not
compare the performance of our implementation with highly optimized B+-

22
trees. We want to analyze our optimizations and the scalability of our approach
itself. We used the data structures as shown in Figure 6 and Figure 7. The
constant-size redo logs were used for trees from 0 to 107 nodes.

12.1 Local Performance for Red-Black Trees and AVL Trees

with NVDIMM-N
For all local experiments, we used the Xeon Gold. As Figure 6 shows, se-
quences through the state machine often alternate between flushing to the log
and reading the log back to update the tree. This concept is needed to facilitate
idempotent updates of the tree. This style of micro-transactions performs well
with persistence mechanisms based on write-backs without invalidation and can
take full advantage of cache hierarchies. According to Intel Corp. [28], clwb
should show such behavior, but our evaluation shows no significant differences
between clwb and clflushopt—neither in micro-benchmarks nor in the shown
Red-Black Tree and AVL Tree benchmarks. clflush is consistently slower than
clwb and clflushopt.
For all local benchmarks, we varied the number of keys inserted into empty
trees. The range covers 10 to 106 . Keys were inserted in random order. For
remove, we re-used the filled tree and removed all items in a different random
order.

400 k

◁ larger is better
AVLT clwb RBT clwb
374.1 k
366.5 k

365.9 k
364.3 k

350 k AVLT clfushopt RBT clfushopt

358.5 k
357.3 k
342.4 k

AVLT clfush RBT clfush

336.0 k

335.6 k
inserts per second [1/s]

333.9 k

327.3 k
300 k
310.1 k

250 k
232.6 k
227.3 k

200 k
134.5 k
134.5 k

134.2 k
134.1 k
133.9 k
133.3 k
130.0 k

126.7 k
126.3 k

126.1 k
123.2 k
122.2 k
185.2 k

180.2 k
176.1 k

150 k
156.2 k
82.0 k
81.3 k
74.1 k

66.1 k
65.9 k
60.8 k

100 k
50 k
0k
10 102 103 104 105 106
number of inserted keys [1]

Figure 11: Insert throughput for Red-Black Trees and AVL Trees both with
look ahead with NVDIMM-N.

For small trees, the performance is worse than for larger trees, cmp. Fig-
ure 11. For trees larger than 100 keys, the performance stabilizes. It could be
due to the fact that for small trees insert operations touch a larger share of the
tree, i.e., they flush out a large part of the tree. For larger trees, large parts
of the tree remain untouched and can be accessed without cache misses in the
next operation. If you insert a key on the left side, it will evict parts of the tree
on the path down from the caches. If the next insert is on the right side, there
will be only a low number of cache misses. If the tree is small, the two paths
will overlap and cause more cache misses.
According to Tarjan [58], Red-Black Trees and AVL Trees only need O(1)
balance operations per insert on average. With the optimizations described in
Sect. 8, we almost eliminate the logarithmic part of the insert operation. As
expected, the insert costs are independent of the size of the tree. It also shows

23
400 k

◁ larger is better
AVLT clwb RBT clwb
350 k AVLT clfushopt RBT clfushopt
removes per second [1/s]

AVLT clfush RBT clfush

300 k

270.3 k
250 k

232.6 k
200 k 108.7 k
108.7 k

185.2 k

102.5 k
100.2 k
150 k

161.2 k

85.7 k
85.1 k
153.2 k
80.0 k

77.3 k
72.4 k
72.4 k

71.8 k
71.3 k
136.7 k
136.3 k
63.7 k

62.0 k
61.5 k
125.6 k

53.1 k
51.9 k
118.3 k
117.8 k

45.7 k
100 k

112.5 k
112.2 k
110.8 k

95.1 k
94.8 k
92.1 k

92.0 k

78.0 k
50 k
0k
10 102 103 104 105 106
number of removed keys [1]

Figure 12: Remove throughput for Red-Black Trees and AVL Trees both without
look ahead with NVDIMM-N.

again that Red-Black Trees are faster than AVL Trees. Red-Black Trees might
reduce the costs for insert by leaving the trees less balanced than AVL Trees, see
Sect. 5. Despite Cormen et al. [18], Adel’son-Vel’skii and Landis [1] suggesting
that the throughput should decrease with the depth of the tree, we can keep it
constant.
The costs for remove are much higher than for insert, cmp. Figure 12. Note
that we only tried to optimize insert operations. The expected costs per remove
are O(log N ). Red-Black Trees and AVL Trees perform O(1) balance operations
per remove on average. As expected AVL Trees are slower than Red-Black Trees.
The gap is much smaller.

400 k
◁ larger is better
AVLT clwb RBT clwb
350 k AVLT clfushopt RBT clfushopt
AVLT clfush RBT clfush
inserts per second [1/s]

300 k
250 k
200 k
97.1 k
97.0 k

96.9 k
96.8 k

95.0 k
94.7 k

150 k
91.8 k
91.6 k

78.5 k

78.2 k
77.2 k

77.0 k
72.5 k
72.3 k
69.9 k
68.5 k

59.1 k
56.5 k

100 k
39.9 k
40.0 k

39.5 k
39.2 k

38.1 k
37.6 k

37.5 k
37.5 k
35.7 k
34.7 k

32.9 k
32.8 k

32.7 k

32.6 k
30.4 k
30.0 k
29.7 k

27.7 k

50 k
0k
10 102 103 104 105 106
number of inserted keys [1]

Figure 13: Insert throughput for Red-Black Trees and AVL Trees both with
look ahead with Intel Optane.

For the evaluation of their RBT code Wang et al. [63] simulated NVDIMM,
STT-RAM, and PCM. For insert with 106 resp. 210 keys, they achieved: 666,666,
166,666, and 52,600 inserts per second. It shows that the workload is latency
sensitive. For simulating NVDIMMs, they used plain DRAM. Our hardware
setup differs from theirs, but our NVDIMM-Ns run at a lower clock than DRAM.
With close to 400,000 inserts per seconds for RBTs, we are close despite a
completely different approach. Our AVL Trees are slightly slower because they
keep the tree more balanced and thus require more epochs.

24
400 k

◁ larger is better
AVLT clwb RBT clwb
350 k AVLT clfushopt RBT clfushopt
removes per second [1/s]

AVLT clfush RBT clfush

300 k
250 k
200 k
150 k 72.5 k
71.9 k
54.1 k

45.7 k
45.7 k
100 k
43.5 k
43.3 k

42.6 k
42.6 k
34.8 k

33.5 k

33.3 k
33.4 k
30.7 k
30.6 k

30.1 k
30.1 k
29.2 k

28.6 k
28.5 k
26.2 k
24.5 k
24.5 k

24.6 k
24.6 k
22.6 k
20.9 k

20.8 k
20.8 k
20.3 k

19.6 k
18.1 k
18.1 k
16.7 k

14.1 k

12.1 k
50 k
0k
10 102 103 104 105 106
number of removed keys [1]

Figure 14: Remove throughput for Red-Black Trees and AVL Trees both without
look ahead with Intel Optane.

12.2 Local Performance for Red-Black Trees and AVL Trees

with Intel Optane
For the experiments with Intel Optane, we used one server with two Intel Xeon
Platinum 8260L CPUs. It has 3 TB Apache Pass. We used CentOS 7, GCC
9.1, and pmdk 1.7.1.
According to [29, 64] Optane has a three times higher latency than DRAM
in idle mode. Under load, the latency and bandwidth depends on the access
pattern. While the shape of the graphs for NVDIMM-Ns and Intel Optane are
similar, Intel Optane is slower by a factor of 3-5x.
The optimizations for insert keep the performance constant despite the trees
growing. For remove the performance drops with the size as expected. In all
experiments with Intel Optane the performance of clwb and clflushopt is
indistinguishable, see Figure 13 and Figure 14.

12.3 Gain of Optimizations

To analyze the improvements of our optimizations, see Sect. 8, we compared
the performance with and without look-ahead, see Figure 15. For both kinds of
trees, the gain is over a factor of 2.5x in throughput.

12.4 Remote Performance via RDMA

For all experiments with InfiniBand, the Xeon Silver served as client, while the
Xeon Gold served as server. For all measurements, we inserted 1,000 keys in
random order into an empty tree. With flushing, every flush epoch ends with a
ping pong message from the client to the software agent on the server to persist
the data. The workload needs ≈ 40,000 flush requests. With caching enabled,
we cache all read requests to the redo log. The client keeps a local copy. All
write requests are applied accordingly. Accesses to the tree are never cached.
With caching and flushing enabled (Table 1), we reach more than 2,300 in-
serts per second for RBTs and 1,800 inserts per second for AVL Trees. Remote
transaction processing is latency bound because of gets and there is no efficient
way for remote persisting with a passive target yet. The unsatisfactory perfor-

25
400 k

◁ larger is better
AVLT LA AVLT RBT LA RBT
350 k

357.3 k
inserts per second [1/s]
300 k
250 k
200 k

134.1 k
150 k

153.7 k
54.8 k
100 k
50 k
0k
106
number of inserted keys [1]

Figure 15: Insert throughput for Red-Black Trees and AVL Trees with and
without look-ahead (LA) using clwb.

Table 1: Remote inserts via RDMA and software agent based flushing (RBTs).
AVLT No Cache Cache RBT No Cache Cache
No Flush 2,145/s 2,150/s No Flush 2,091/s 2,542/s
Flush 1,867/s 1,866/s Flush 1,922/s 2,312/s

mance of the cache is due to the fact that accesses to the tree are not cached.
Furthermore, it does not affect communication with the software agent.

13 Related Work
Trees on NVRAM. A number of systems were already proposed to manage
tree data structures with persistent memory. A popular tree variant in this area
are B+trees [16], which store all values in the leaves. CDDS B-Tree [61], for
example, relies on 8-byte writes and a version system for inserts as long as free
slots are available in leaf nodes. Otherwise, it uses shadow copying to split the
node and to update the inner nodes. On recovery, it uses its version system to
discard all interrupted operations. Similarly, NV-Tree [65] stores all values in
leaf nodes. While the leaf nodes are stored in NVRAM, here, the inner nodes are
stored in DRAM and can be restored after a power failure. To further minimize
the cost of flushing, entries in leaf nodes are appended to the corresponding
leaf node and remain unsorted. Full leaf nodes are split using shadow copying.
Rebalancing is not done on-the-fly but as a separate operation that recreates
the inner nodes and makes them the current ones atomically. The wB+Tree [12]
also mitigates the costs of flushing by keeping node entries unsorted to avoid
entry movements on insert. This allows inserts with only a few 8-byte writes
and shadowing. Two B+tree algorithms exploiting weak memory models and
allowing temporal inconsistencies are FAST and FAIR [27]. The Bztree [7] is
a multi-threaded B+Tree. It relies on Persistent Multi-Word CAS (PMwCAS)
and an epoch-based garbage collection scheme.
For radix trees, WORT [32] maintains a tree shape independent of the inser-
tion order. It neither needs nor supports balancing. Inserting items on leaves

26
or leaf paths and pointer updates can be done with atomic 8-byte writes. For
more sophisticated adaptive radix trees, shadow copying is used.
Most research focuses on variants of B-trees with often more than two chil-
dren and optimizations for external memory. In most cases, new keys can be
added in leaf nodes with a few flushes without any re-balancing.
Red-Black Trees in NVRAM were discussed in Wang et al. [63] for the first
time. While they implemented a more complex tree kind than before, they still
relied on shadow copying and a versioning system to distinguish between the
copy and the real tree. They maintain something close to a shadow tree with
some efforts to minimize overhead. Operations are performed on the shadow
tree. An atomic pointer update switches between the shadow and the current
tree. Wang et al. [63] also show that update operations on RBTs are not local
operations. It does not suffice to do shadow copying on individual nodes. It
requires shadow copying of larger fractions of the tree.
All these systems handle requests in a blocking way and operations often
become visible with the last 8-byte atomic pointer update. After a crash, it
is challenging for them to identify aborted resp. the last successful operation.
Highly desirable properties such as exactly once semantics, see Sect. 4.3, are
hard to achieve. In contrast, with our state machine approach, we can guaran-
tee eventual success after accepting the command to the redo log and support
exactly once semantics.
Transactions on NVRAM. Systems supporting generic transaction pro-
cessing on NVRAM based on redo logging are, for example, SoftWrAP [23] and
DudeTx [33]. They use a mix of shadow-memory and redo logging where all
memory accesses during a transaction are aliased into a volatile memory region
and writes are stored in a persistent redo log immediately or when all work
of the transaction is done. For DudeTx, the redo log is then applied to the
actual data stored in persistent memory in a final step. With language exten-
sions, Mnemosyne [62] provides primitives for working with persistent memory.
Variables can be marked as persistent. Code regions marked as atomic will
be executed with durable transactions. It hooks into a lightweight software
transaction system to implement write-ahead redo logging.
Other systems base their transaction system on write-ahead and undo log-
ging [52]. First, all store operations are written to the undo log before the real
transaction is executed. In case of a power failure, uncompleted transactions
are rolled back. The NV-Heaps system [14] provides its own heap manager, spe-
cialized pointers, and atomic sections for persistent memory. For transactions,
it keeps a volatile read log and a non-volatile write log. In case of an abort or
power failure, it rolls back all changes.
Here, the literature seems to be undecided between undo and redo logging.
However, undo logging suffers from more flush operations compared to redo
logging. Shadow memory is a neat way to exploit the fact that there are two
types of memories available—volatile and non-volatile—with different perfor-
mance characteristics. It provides isolation and can leverage the benefits of
caches. Our approach of in-place updates is seldomly found in the literature.
Logging in Databases. The quasi-standard algorithm for write-ahead
logging (WAL) with no-force and steal policies, ARIES [39], has influenced the
design of many commercial databases. It is optimized for spinning disks and
maintains an append-only log. The log contains undo and redo records. For
recovery, it goes through 3 phases: (a) analyze the log for uncommitted and

27
aborted transactions, (b) redo finishable transactions, and (c) undo the remain-
ing transactions. In our approach, we only us a fixed-size redo log. While Aries
uses write-only WAL, we read the log during epochs to facilitate idempotence.
Our analysis phase simply identifies the current state and continues from there.
While ARIES optimizes for sequential writes, MARS [15] exploits the fact,
that SSDs support high random access performance. It introduces the concept
of editable atomic writes (EAW), which are essentially redo logs. The full trans-
action is executed in a redo record and on commit the system applies the trans-
action atomically. On failure, it can simply reapply the redo log. In contrast,
we write the data directly into the data structure. The transaction becomes
re-doable because of the redo log created in the previous epoch. We always split
large transactions into a sequence of micro-transactions. For NVRAM, redo
logs provide lower costs. They reduce the number of flushes in contrast to undo
logs. We can also avoid complex log pruning mechanisms, because the log has
a fixed size and every operation re-uses the log of the previous operation.

14 Discussion of Correctness
General approach. It is a standard technique in compiler construction to convert
code into control flow graphs with basic blocks. For the execution of this rep-
resentation we can use state machines. It tracks which basic black is currently
executed and which are the legal successor blocks. State transitions correspond
to the execution of basic blocks. Large basic blocks can be split into a sequence
of smaller ones without changing the algorithm. Additionally, splitting a basic
block into two: (a) reading from the data structure and writing to the log and
(b) reading the log and updating the data structure, ought to make both basic
blocks idempotent. This is a common task for databases with appropriate log-
ging. The size of the basic block resp. the number of stores influences the size
of the log.
Red-Black Trees and AVL Trees. In Sect. 11, we discussed our testing approach.
After a crash, we verified that we can recover and return to the clean state.
Furthermore, after the recovery we checked whether the tree is correct, i.e.,
correctly balanced and colored. For all experiments in Sect. 12, we inserted k
keys into an empty tree and removed the same k keys in a different order from
the tree. This indicates that the state machines are correct. Otherwise this
would yield corrupt trees. During the development of the state machine, we
extensively tested the state machine. After each insert or remove operation, we
verified that the tree is correct and the number of nodes was as expected.

15 Discussion of Limitations
For remote access, message passing might in some cases provide higher per-
formance than RDMA, but it is a completely orthogonal approach to shared
memory for local access. It would require two completely different transaction
systems, but our goal was to design one transaction system for local and remote
usage. As shown in Sect. 9, we abstract from local and remote access and use
one common implementation for both.
For B and B+ trees, insert, remove, and balancing are operations of limited

28
scope. They do not need transactions. The algorithms for these trees are of
low complexity and the corresponding state machines would be tiny. They are
tuned for absolute performance. The literature went for high radix trees for
performance reasons with low balance costs. However, for Red-Black Trees and
AVL Trees balancing is the common case [58]. Red-Black Trees and AVL Trees
are simply not competitive and serve other demands.
As discussed before, data structures that need an auxiliary space, which is
not in O(1), cannot be supported with a constant size log. Allocating addi-
tional memory would violate our assumptions. The only option left is store the
auxiliary data in the data structure. AVL remove needs a stack of O(log n).
We use the common technique of maintaining pointers to parents. In theory,
that would induce space overhead in each tree node. The initial data layout for
the Red-Black Trees was 32 bytes. It had sufficient unused space to add up-
pointers (Up and Dir) without needing to change its size, see Figure 5. Insert
and remove in trees can often be implemented with top-down algorithms, which
only need constant-sized auxiliary space. Linked lists and hash-tables also need
constant-sized auxiliary space.
Atomic CAS for NVRAM [34, 44] brings its own challenges. Atomic opera-
tions are commonly used, because (a) they do no tear and (b) provide protection
against concurrent access. In our approach, we are only interested in the former
property, because we use locks for thread safety. In this paper, we consider CAS
more like a read, modify, and atomic update resp. store operation of 8 bytes
(cmp. Sect. 2) as there is no contention.

16 Conclusion
We presented a new transaction system for complex data structures in NVRAM
that provides exactly once semantics and linearizable durability with a redo log
of constant size. It splits large transactions into smaller micro-transactions and
uses a state machine approach to perform larger transactions step-wise. Every
accepted transaction will eventually succeed and will never be aborted. For local
and remote access, we use the same primitives: load, store, atomic update, and
persist. This allowed us to design one transaction system that runs locally
and with InfiniBand for remote access. As our approach is not prepared for
concurrent access, we use locks to control concurrency. For remote access, we
designed a fault-tolerant lock with f -fairness.
Wang et al. [63] showed the first Red-Black Tree implementation for NVRAM,
but their approach is based on shadowing the whole tree. We presented, to the
best of our knowledge, the first AVL Tree implementation for NVRAM and
the first Red-Black Tree implementation for NVRAM without shadowing and
with updates ‘in-place’. These trees are algorithmic far more challenging than
the trees covered in the literature so far. Insert and remove are global opera-
tions instead of a sequence of operations with limited scope. Thus, they need
transactions.
Shadowing can claim that data structures are consistent all the time. This
approach atomically replaces parts of a data structure with new data. Inter-
mediates steps are not visible. Wang et al. [63] atomically replace the old tree
with the new one. There is also no need for recovery, but it fails at exactly once
semantics. In our approach, the data structure might be inconsistent after a

29
crash, but it is recoverable all the time. We see recovery as finishing the inter-
rupted operation, i.e., moving forward to the clean state. Likewise, we support
exactly once semantics. By using a constant-sized log, we avoid any overhead
for dynamic log allocation, log pruning, and keeping a shadow copy of the whole
tree.

17 Availability
4
Our code is on GitHub under the Apache License 2.0.

Acknowledgments
The authors thank ZIB’s Supercomputing department and ZIB’s core facilities
unit for providing the machines and infrastructure for the evaluation. This
work received funding from the German Research Foundation (DFG) under
grant RE 1389 as part of the DFG priority program SPP 2037 (Scalable data
management for future hardware). This work is partially supported by In-
tel Corporation within the Research Center for Many-core High-Performance
Computing (Intel PCC) at ZIB.

References
[1] Georgy Adel’son-Vel’skii and Evegnii Landis. An algorithm for the organi-
zation of information. Dokl. Akad. Nauk SSSR, 146:263–266, 1962.
[2] Marcos K. Aguilera and Douglas B. Terry. The many faces of consistency.
IEEE Data Eng. Bull., 39(1):3–13, 2016. URL http://sites.computer.
org/debull/A16mar/p3.pdf.
[3] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers, principles,
techniques. Addison Wesley, 7(8):9, 1986.
[4] Frances E. Allen. Control flow analysis. SIGPLAN Not., 5(7):1–19, July
1970. ISSN 0362-1340. doi: 10.1145/390013.808479. URL http://doi.
acm.org/10.1145/390013.808479.
[5] Joe Armstrong. Programming Erlang: Software for a Concurrent World.
Pragmatic Bookshelf, 2013. ISBN 193778553X, 9781937785536.
[6] Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor. Let’s talk about
storage & recovery methods for non-volatile memory database systems.
In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data, pages 707–722, New York, NY, USA, 2015. ACM.
[7] Joy Arulraj, Justin Levandoski, Umar Farooq Minhas, and Per-Ake Lar-
son. Bztree: A high-performance latch-free range index for non-volatile
memory. Proc. VLDB Endow., 11(5):553–565, January 2018. ISSN 2150-
8097. doi: 10.1145/3187009.3164147. URL https://doi.org/10.1145/
3187009.3164147.
4 https://github.com/tschuett/transactions-on-nvram

30
[8] Rudolf Bayer. Symmetric binary b-trees: Data structure and maintenance
algorithms. Acta informatica, 1(4):290–306, 1972.
[9] Rudolf. Bayer and Edward McCreight. Organization and maintenance of
large ordered indexes. Acta Informatica, 1(3):173–189, Sep 1972. ISSN
1432-0525. doi: 10.1007/BF00288683. URL https://doi.org/10.1007/
BF00288683.
[10] Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm. Makalu:
Fast recoverable allocation of non-volatile memory. In Proceedings of the
2016 ACM SIGPLAN International Conference on Object-Oriented Pro-
gramming, Systems, Languages, and Applications, OOPSLA 2016, pages
677–694, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4444-9.
[11] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest
failure detector for solving consensus. Journal of the ACM (JACM), 43(4):
685–722, 1996.
[12] Shimin Chen and Qin Jin. Persistent b+-trees in non-volatile main mem-
ory. Proc. VLDB Endow., 8(7):786–797, February 2015. ISSN 2150-
8097. doi: 10.14778/2752939.2752947. URL http://dx.doi.org/10.
14778/2752939.2752947.
[13] Yeounoh Chung and Erfan Zamanian. Using RDMA for lock management.
CoRR, abs/1507.03274, 2015. URL http://arxiv.org/abs/1507.03274.
[14] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Ra-
jesh K. Gupta, Ranjit Jhala, and Steven Swanson. NV-Heaps: Making
persistent objects fast and safe with next-generation, non-volatile memo-
ries. SIGPLAN Not., 46(3):105–118, March 2011. ISSN 0362-1340.
[15] Joel Coburn, Trevor Bunker, Meir Schwarz, Rajesh Gupta, and Steven
Swanson. From ARIES to MARS: Transaction support for next-generation,
solid-state drives. In Proceedings of the Twenty-Fourth ACM Symposium
on Operating Systems Principles, SOSP ’13, pages 197–212, New York, NY,
USA, 2013. ACM. ISBN 978-1-4503-2388-8.
[16] Douglas Comer. The ubiquitous B-tree. ACM Comput. Surv., 11(2):121–
137, 1979. doi: 10.1145/356770.356776. URL http://doi.acm.org/10.
1145/356770.356776.
[17] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek,
Benjamin Lee, Doug Burger, and Derrick Coetzee. Better I/O through
byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS
22nd Symposium on Operating Systems Principles, SOSP ’09, pages 133–
146, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-752-3.
[18] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford
Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd
edition, 2009. ISBN 0262033844, 9780262033848.
[19] Pierre-Jacques Courtois, Frans Heymans, and David Lorge Parnas. Con-
current control with readers and writers. Commun. ACM, 14(10):667–668,
October 1971. ISSN 0001-0782.

31
[20] Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. Under-
standing and effectively preventing the aba problem in descriptor-
based lock-free designs. In 2010 13th IEEE International Symposium
on Object/Component/Service-Oriented Real-Time Distributed Computing,
pages 185–192. IEEE, 2010.
[21] Michal Friedman, Maurice Herlihy, Virendra J. Marathe, and Erez Petrank.
A persistent lock-free queue for non-volatile memory. In Proceedings of the
23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018, pages
28–40, New York, NY, USA, 2018. ACM.
[22] Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. Enabling Highly-
Scalable Remote Memory Access Programming with MPI-3 One Sided. In
Proceedings of the International Conference on High Performance Comput-
ing, Networking, Storage and Analysis, pages 53:1–53:12, New York, NY,
USA, Nov. 2013. ACM. ISBN 978-1-4503-2378-9.
[23] Ellis R. Giles, Kshitij Doshi, and Peter Varman. SoftWrAP: A lightweight
framework for transactional support of storage class memory. In Mass
Storage Systems and Technologies (MSST), 2015 31st Symposium on, pages
1–14, New York, NY, USA, 2015. IEEE Computer Society.
[24] Leo J. Guibas and Robert Sedgewick. A dichromatic framework for bal-
anced trees. In Foundations of Computer Science, 1978., 19th Annual
Symposium on, pages 8–21, New York, NY, USA, 1978. IEEE Computer
Society.

[25] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness

condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12
(3):463–492, July 1990. ISSN 0164-0925. doi: 10.1145/78969.78972. URL
http://doi.acm.org/10.1145/78969.78972.
[26] Yoichi Hirai and Kazuhiko Yamamoto. Balancing weight-balanced trees.
Journal of Functional Programming, 21(3):287–307, 2011. doi: 10.1017/
S0956796811000104.
[27] Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. En-
durable transient inconsistency in byte-addressable persistent B+-tree.
In 16th USENIX Conference on File and Storage Technologies (FAST
18), pages 187–200, Oakland, CA, 2018. USENIX Association. ISBN
978-1-931971-42-3. URL https://www.usenix.org/conference/fast18/
presentation/hwang.
[28] Intel Corp. Intel 64 and IA-32 architectures software developer’s manual,
May 2019.

[29] Intel Corp. Intel 64 and IA-32 architectures optimization reference manual,
September 2019.
[30] Joseph Izraelevitz, Hammurabi Mendes, and Michael L. Scott. Linearizabil-
ity of persistent memory objects under a full-system-crash failure model.
In Cyril Gavoille and David Ilcinkas, editors, Distributed Computing, pages

32
313–327, Berlin, Heidelberg, 2016. Springer Berlin Heidelberg. ISBN 978-
3-662-53426-7.

[31] Donald E. Knuth. The Art of Computer Programming, Volume 3: (2nd

Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co.,
Inc., Redwood City, CA, USA, 1998. ISBN 0-201-89685-0.
[32] Se Kwon Lee, K. Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H.
Noh. WORT: Write optimal radix tree for persistent memory storage sys-
tems. In 15th USENIX Conference on File and Storage Technologies (FAST
17), pages 257–270, Santa Clara, CA, 2017. USENIX Association. ISBN
978-1-931971-36-2.
[33] Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, Yongwei Wu,
Weimin Zheng, and Jinglei Ren. DudeTx: Durable transactions made
decoupled. ACM Trans. Storage, 14(1):7:1–7:28, April 2018. ISSN 1553-
3077.
[34] Virendra J. Marathe, Margo Seltzer, Steve Byan, and Tim Harris. Per-
sistent memcached: Bringing legacy code to byte-addressable persis-
tent memory. In 9th USENIX Workshop on Hot Topics in Storage
and File Systems (HotStorage 17), Santa Clara, CA, 2017. USENIX As-
sociation. URL https://www.usenix.org/conference/hotstorage17/
program/presentation/marathe.
[35] Makoto Matsumoto and Takuji Nishimura. Mersenne twister: A 623-
dimensionally equidistributed uniform pseudo-random number generator.
ACM Trans. Model. Comput. Simul., 8(1):3–30, January 1998. ISSN 1049-
3301. doi: 10.1145/272991.272995. URL http://doi.acm.org/10.1145/
272991.272995.
[36] Kurt Mehlhorn and Athanasios Tsakalidis. An amortized analysis of inser-
tions into avl-trees. SIAM Journal on Computing, 15(1):22–33, 1986.

[37] John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer

synchronization for shared-memory multiprocessors. In ACM SIGPLAN
Notices, volume 26, pages 106–113, New York, NY, USA, 1991. ACM.
[38] Message Passing Interface Forum. MPI: A Message-Passing Interface Stan-
dard Version 3.1. MPI Forum, 2015.

[39] Chandrasekaran Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh,

and Peter Schwarz. Aries: A transaction recovery method supporting fine-
granularity locking and partial rollbacks using write-ahead logging. ACM
Trans. Database Syst., 17(1):94–162, March 1992. ISSN 0362-5915.
[40] Sundeep Narravula, A. Marnidala, Abhinav Vishnu, Karthikeyan
Vaidyanathan, and Dhabaleswar K. Panda. High performance distributed
lock management services using network-based remote atomic operations.
In Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh IEEE
International Symposium on, pages 583–590, New York, NY, USA, 2007.
IEEE Computer Society.

33
[41] Jrg Nievergelt and Edward Reingold. Binary search trees of bounded
balance. SIAM Journal on Computing, 2(1):33–43, 1973. doi: 10.1137/
0202005.

[42] Henk J. Olivié. A new class of balanced search trees: half-balanced binary
search tress. RAIRO. Informatique théorique, 16(1):51–71, 1982.
[43] Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner, Thomas
Willhalm, and Grégoire Gomes. Memory management techniques for
large-scale persistent-main-memory systems. PVLDB, 10(11):1166–1177,
8 2017. doi: 10.14778/3137628.3137629. URL http://www.vldb.org/
pvldb/vol10/p1166-oukid.pdf.
[44] Matej Pavlovic, Alex Kogan, Virendra J. Marathe, and Tim Harris. Brief
announcement: Persistent multi-word compare-and-swap. In Proceedings of
the 2018 ACM Symposium on Principles of Distributed Computing, PODC
’18, pages 37–39, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-
5795-1. doi: 10.1145/3212734.3212783. URL http://doi.acm.org/10.
1145/3212734.3212783.
[45] Fernando Magno Quintao Pereira and Jens Palsberg. Register allocation
after classical ssa elimination is np-complete. In International Conference
on Foundations of Software Science and Computation Structures, pages
79–93. Springer, 2006.
[46] Marius Poke and Torsten Hoefler. Dare: High-performance state machine
replication on rdma networks. In Proceedings of the 24th International Sym-
posium on High-Performance Parallel and Distributed Computing, pages
107–118. ACM, 2015.
[47] William N. Scherer III and Michael L. Scott. Advanced contention man-
agement for dynamic software transactional memory. In Proceedings of the
twenty-fourth annual ACM symposium on Principles of distributed comput-
ing, pages 240–248, New York, NY, USA, 2005. ACM.

[48] Thorsten Schütt, Florian Schintke, and Alexander Reinefeld. A structured

overlay for multi-dimensional range queries. In European Conference on
Parallel Processing, pages 503–513. Springer, 2007.
[49] Thorsten Schütt, Florian Schintke, and Alexander Reinefeld. Range queries
on structured overlay networks. Computer Communications, 31(2):280–
291, 2008.
[50] Thorsten Schütt, Florian Schintke, and Alexander Reinefeld. Scalaris: re-
liable transactional p2p key/value store. In Proceedings of the 7th ACM
SIGPLAN workshop on ERLANG, pages 41–48, 2008.

[51] David Schwalb, Tim Berning, Martin Faust, Markus Dreseler, and Hasso
Plattner. nvm malloc: Memory allocation for nvram. In Rajesh Bor-
dawekar, Tirthankar Lahiri, Bugra Gedik, and Christian A. Lang, editors,
ADMS@VLDB, pages 61–72, 2015. URL http://dblp.uni-trier.de/db/
conf/vldb/adms2015.html#SchwalbBFDP15.

34
[52] Seunghee Shin, James Tuck, and Yan Solihin. Hiding the long latency of
persist barriers using speculative execution. In Proceedings of the 44th An-
nual International Symposium on Computer Architecture, ISCA ’17, pages
175–186, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4892-8. doi:
10.1145/3079856.3080240.
[53] Jan Skrzypczak, Florian Schintke, and Thorsten Schütt. Linearizable state
machine replication of state-based CRDTs without logs. In Proceedings of
the 2019 ACM Symposium on Principles of Distributed Computing, pages
455–457, 2019.
[54] Jan Skrzypczak, Florian Schintke, and Thorsten Schütt. RMWPaxos:
Fault-tolerant in-place consensus sequences. IEEE Transactions on Parallel
and Distributed Systems, 31(10):2392–2405, 2020.

[55] Daniel Sleator, Robert Tarjan, and William Thurston. Rotation distance,
triangulations, and hyperbolic geometry. In Proceedings of the Eighteenth
Annual ACM Symposium on Theory of Computing, STOC ’86, pages 122–
135, New York, NY, USA, 1986. ACM. ISBN 0-89791-193-8. doi: 10.1145/
12130.12143. URL http://doi.acm.org/10.1145/12130.12143.
[56] Storage Networking Industry Association. NVM PM Remote Access for
High Availability, February 2016.
[57] Storage Networking Industry Association. NVM Programming Model v1.2,
June 2017.
[58] Robert Endre Tarjan. Updating a balanced search tree in O(1) rotations.
Information Processing Letters, 16(5):253–257, 1983. ISSN 0020-0190.
[59] Athanasios K. Tsakalidis. Rebalancing operations for deletions in avl-trees.
RAIRO-Theoretical Informatics and Applications-Informatique Théorique
et Applications, 19(4):323–329, 1985.
[60] David Vandevoorde and Nicolai M. Josuttis. C++ Templates: The Com-
plete Guide. Addison-Wesley Professional, 1 edition, November 2002. ISBN
9780201734843.
[61] Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and
Roy H. Campbell. Consistent and durable data structures for non-volatile
byte-addressable memory. In Proceedings of the 9th USENIX Conference
on File and Stroage Technologies, FAST’11, pages 5–5, Berkeley, CA, USA,
2011. USENIX Association. ISBN 978-1-931971-82-9.
[62] Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne:
lightweight persistent memory. In Rajiv Gupta and Todd C. Mowry, ed-
itors, Proceedings of the 16th International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS
2011, Newport Beach, CA, USA, March 5-11, 2011, pages 91–104. ACM,
2011. ISBN 978-1-4503-0266-1. doi: 10.1145/1950365.1950379. URL
https://doi.org/10.1145/1950365.1950379.

35
[63] Chundong Wang, Qingsong Wei, Lingkun Wu, Sibo Wang, Cheng Chen,
Xiaokui Xiao, Jun Yang, Mingdi Xue, and Yechao Yang. Persisting rb-
tree into nvm in a consistency perspective. ACM Trans. Storage, 14(1):
6:1–6:27, February 2018. ISSN 1553-3077. doi: 10.1145/3177915. URL
http://doi.acm.org/10.1145/3177915.
[64] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve
Swanson. An empirical guide to the behavior and use of scalable persistent
memory. In 18th {USENIX} Conference on File and Storage Technologies
({FAST} 20), pages 169–182, 2020.
[65] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong,
and Bingsheng He. NV-Tree: Reducing consistency cost for NVM-based
single level systems. In 13th USENIX Conference on File and Storage
Technologies (FAST 15), pages 167–181, Santa Clara, CA, 2015. USENIX
Association. ISBN 978-1-931971-201.

14 NVM
No ratings yet
14 NVM
46 pages
NVM - Malloc - Memory Allocation Dor NVRAM by Schwalb, Berning, Faust, Dreseler and Plattner
No ratings yet
NVM - Malloc - Memory Allocation Dor NVRAM by Schwalb, Berning, Faust, Dreseler and Plattner
12 pages
Persistence and Synchronization: Friends or Foes?
No ratings yet
Persistence and Synchronization: Friends or Foes?
11 pages
Persistence For The Masses: RRB-Vectors in A Systems Language
No ratings yet
Persistence For The Masses: RRB-Vectors in A Systems Language
28 pages
A4
No ratings yet
A4
5 pages
Solaris
No ratings yet
Solaris
45 pages
NVM Aware NUMA Replication
No ratings yet
NVM Aware NUMA Replication
61 pages
AUTOSAR Memory Stack
No ratings yet
AUTOSAR Memory Stack
31 pages
p1753-Arulraj Non Volatile Memory Database Management Systems
No ratings yet
p1753-Arulraj Non Volatile Memory Database Management Systems
6 pages
Coa CH3&CH4 &CH6
No ratings yet
Coa CH3&CH4 &CH6
45 pages
Memory
No ratings yet
Memory
29 pages
A Survey of Operating System Support For Persistent Memory
No ratings yet
A Survey of Operating System Support For Persistent Memory
28 pages
Netapp Read and Write
No ratings yet
Netapp Read and Write
38 pages
A Survey of Software Techniques For Using No-Volatile Memories For Storage and Main Memory Systems
No ratings yet
A Survey of Software Techniques For Using No-Volatile Memories For Storage and Main Memory Systems
14 pages
Arinj: O# / Afo19 - U T8Jpf-I3-I2
No ratings yet
Arinj: O# / Afo19 - U T8Jpf-I3-I2
31 pages
A Write-Friendly and Fast-Recovery Scheme For Security Metadata in Non-Volatile Memories
No ratings yet
A Write-Friendly and Fast-Recovery Scheme For Security Metadata in Non-Volatile Memories
12 pages
Thread Local Scope
No ratings yet
Thread Local Scope
8 pages
Octopus: An RDMA-enabled Distributed Persistent Memory File System
No ratings yet
Octopus: An RDMA-enabled Distributed Persistent Memory File System
15 pages
Copy On Write Btrees
No ratings yet
Copy On Write Btrees
8 pages
Sherman: A Write-Optimized Distributed B Tree Index On Disaggregated Memory
No ratings yet
Sherman: A Write-Optimized Distributed B Tree Index On Disaggregated Memory
16 pages
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
No ratings yet
Bus-Based Multiprocessor: A.K.A or Snoopy-Bus Architecture
54 pages
NVMe Performance Hacks
From Everand
NVMe Performance Hacks
Mei Gates
No ratings yet
Re-LSM A ReRAM-based Processing-In-Memory Framework For LSM-based Key-Value Store
No ratings yet
Re-LSM A ReRAM-based Processing-In-Memory Framework For LSM-based Key-Value Store
9 pages
Chapter 4 (Continued) : Caching Testing Memory Modules
No ratings yet
Chapter 4 (Continued) : Caching Testing Memory Modules
20 pages
Ho 2019
No ratings yet
Ho 2019
6 pages
Memory L
No ratings yet
Memory L
44 pages
Stealth-Persist: Architectural Support For Persistent Applications in Hybrid Memory Systems
No ratings yet
Stealth-Persist: Architectural Support For Persistent Applications in Hybrid Memory Systems
14 pages
Concurrent Programming Without Locks
No ratings yet
Concurrent Programming Without Locks
59 pages
Memory Subsystem Organization 1.3.2
No ratings yet
Memory Subsystem Organization 1.3.2
3 pages
Non-Volatile Memory For Fast, Reliable File Systems
No ratings yet
Non-Volatile Memory For Fast, Reliable File Systems
13 pages
PPGCC: Non-Volatile Memory: Emerging Technologies and Their Impacts On Memory Systems
No ratings yet
PPGCC: Non-Volatile Memory: Emerging Technologies and Their Impacts On Memory Systems
44 pages
10 Distributed Shared Memory
No ratings yet
10 Distributed Shared Memory
20 pages
NOVA: A Log-Structured File System For Hybrid Volatile/Non-volatile Main Memories
No ratings yet
NOVA: A Log-Structured File System For Hybrid Volatile/Non-volatile Main Memories
17 pages
Topics: - Cache Operations
No ratings yet
Topics: - Cache Operations
6 pages
FileDirectory
No ratings yet
FileDirectory
29 pages
Autosar Memory Stack (Memstack)
No ratings yet
Autosar Memory Stack (Memstack)
21 pages
2007 Tocs
No ratings yet
2007 Tocs
61 pages
CaseDB Lightweight Key-Value Store For Edge Comput
No ratings yet
CaseDB Lightweight Key-Value Store For Edge Comput
13 pages
HashKV - Enabling Efficient Updates in KV Storage Via Hashing
No ratings yet
HashKV - Enabling Efficient Updates in KV Storage Via Hashing
14 pages
Memento
No ratings yet
Memento
26 pages
CH10 - Memory Hierarchy
No ratings yet
CH10 - Memory Hierarchy
106 pages
RRB-Trees: Efficient Immutable Vectors: Phil Bagwell Tiark Rompf
No ratings yet
RRB-Trees: Efficient Immutable Vectors: Phil Bagwell Tiark Rompf
6 pages
Libro Estruc Datos Amplios fnt23 Athanassoulis
No ratings yet
Libro Estruc Datos Amplios fnt23 Athanassoulis
168 pages
Outline: File System Consistency Issues in The Presence of Failures
No ratings yet
Outline: File System Consistency Issues in The Presence of Failures
4 pages
Sola 10 U11 Ga x86 DVD
No ratings yet
Sola 10 U11 Ga x86 DVD
15 pages
Distributed In-Memory JVM Cache
No ratings yet
Distributed In-Memory JVM Cache
4 pages
Report SRAM 6T Cell Design - Analysis Nisha-1306184446
No ratings yet
Report SRAM 6T Cell Design - Analysis Nisha-1306184446
51 pages
Irjet V7i5474
No ratings yet
Irjet V7i5474
6 pages
SNIA Persistent Memory Atomics Transactions WP
No ratings yet
SNIA Persistent Memory Atomics Transactions WP
19 pages
Memory Organization: CS 147 Presented By: Duong Pham
No ratings yet
Memory Organization: CS 147 Presented By: Duong Pham
28 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Fast RDMA-Based Ordered Key-Value Store Using Remote Learned Cache
No ratings yet
Fast RDMA-Based Ordered Key-Value Store Using Remote Learned Cache
20 pages
HW 6
No ratings yet
HW 6
6 pages
Ca10 2014 PDF
No ratings yet
Ca10 2014 PDF
34 pages
CA I - Chapter 5 Caches 2
No ratings yet
CA I - Chapter 5 Caches 2
80 pages
OS - 2 MARKS (2)
No ratings yet
OS - 2 MARKS (2)
6 pages
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
CassandraTraining v3.3.4
100% (1)
CassandraTraining v3.3.4
183 pages
Efficient Memory Mapped File IO For In-Memory File Systems
No ratings yet
Efficient Memory Mapped File IO For In-Memory File Systems
6 pages
11 Memory
No ratings yet
11 Memory
41 pages
National Identification
No ratings yet
National Identification
5 pages
Database Management Systems Week 3
No ratings yet
Database Management Systems Week 3
36 pages
Systems Analysis and Design - Awad, Elias M - 1985 - Homewood, Ill - R - D - Irwin - 9780256028249 - Anna's Archive
100% (1)
Systems Analysis and Design - Awad, Elias M - 1985 - Homewood, Ill - R - D - Irwin - 9780256028249 - Anna's Archive
552 pages
User Guide For OneDrive - 2019 07 24
No ratings yet
User Guide For OneDrive - 2019 07 24
11 pages
Metallic Transmission Line Equivalent Circuit Transmission Characteristics
No ratings yet
Metallic Transmission Line Equivalent Circuit Transmission Characteristics
7 pages
Priyanka
No ratings yet
Priyanka
2 pages
Bitronics 70 Series: Measurement System
No ratings yet
Bitronics 70 Series: Measurement System
8 pages
Steve Plimpton Sandia National Labs Sjplimp@sandia - Gov
No ratings yet
Steve Plimpton Sandia National Labs Sjplimp@sandia - Gov
77 pages
Ashwin Kumar K-Profile
No ratings yet
Ashwin Kumar K-Profile
5 pages
Foundation Tamil
No ratings yet
Foundation Tamil
21 pages
Manual 1742
No ratings yet
Manual 1742
37 pages
SP2552-2-15-1 UM (From Inspector 2023)
100% (1)
SP2552-2-15-1 UM (From Inspector 2023)
667 pages
Web Application and Safety Notez
No ratings yet
Web Application and Safety Notez
12 pages
VistA Imaging DICOM Modality Interfaces0721
No ratings yet
VistA Imaging DICOM Modality Interfaces0721
38 pages
Technocracy: The Hard Road To World Order A Special Interview With Patrick Wood
0% (1)
Technocracy: The Hard Road To World Order A Special Interview With Patrick Wood
26 pages
Catálogo V.2 OPTICTIMES SAS
No ratings yet
Catálogo V.2 OPTICTIMES SAS
38 pages
CINDA 2250KVA 11kV MRP R4 Asbuilt 09.05.2024
No ratings yet
CINDA 2250KVA 11kV MRP R4 Asbuilt 09.05.2024
69 pages
Q2 WK3 4 Mooc 1
No ratings yet
Q2 WK3 4 Mooc 1
42 pages
2017 TCR International Series Tecnical Regulations-2017!01!28
No ratings yet
2017 TCR International Series Tecnical Regulations-2017!01!28
19 pages
Accesorios Unistrut
No ratings yet
Accesorios Unistrut
76 pages
GS5 - Week 20 - Unit 5 - Weekly Worksheet (Unit 5 Review)
100% (1)
GS5 - Week 20 - Unit 5 - Weekly Worksheet (Unit 5 Review)
4 pages
Image Compression
No ratings yet
Image Compression
15 pages
Lpa 43 26 NF 01
No ratings yet
Lpa 43 26 NF 01
2 pages
APC Smart-UPS On-Line Lithium-Ion 230V Brochure
No ratings yet
APC Smart-UPS On-Line Lithium-Ion 230V Brochure
13 pages
GATv2
No ratings yet
GATv2
26 pages
Tormax United Kingdom - Imotion 2202 - Operations - Instructions For Use
No ratings yet
Tormax United Kingdom - Imotion 2202 - Operations - Instructions For Use
20 pages
Samsung ml-1630 SM
No ratings yet
Samsung ml-1630 SM
117 pages
ACM116 HW 3
No ratings yet
ACM116 HW 3
2 pages
Steering Column Tilt PDF
No ratings yet
Steering Column Tilt PDF
26 pages
Chen (2017) - CNN+sentiment Analysis+sentence Type
No ratings yet
Chen (2017) - CNN+sentiment Analysis+sentence Type
10 pages

Transactions On Red-Black and AVL Trees in NVRAM

Uploaded by

Transactions On Red-Black and AVL Trees in NVRAM

Uploaded by

Transactions on Red-black and AVL trees in

3.2 How to persist data with NVRAM?

write w1 ⊆L setatomic s read r2 ⊆L setatomic s read r3 ⊆L∪D setatomic s

Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 5 Epoch 6

L. Switching the roles of the two components during an ongoing transaction is

4.2 Maintaining the Current State

4.3 Detectable Execution

4.4 Durable Linearizability

5 Self-balancing Binary Search Trees

5.1 Red-Black Trees

5.2 AVL Trees

5.3 Weight-balanced trees

6 Implementing binary trees with µ-Tx

6.1 Creating the state machine

As a first approximation, the imperative algorithm is converted into a control-

6.4 Common state for Trees on NVRAM

n11 n12 n13 n14

n15 n16 n17 n18

n19 n20 n21 ...

Figure 5: Tree-nodes occupy 31 bytes aligned to 32 bytes, tree-pointers need

6.5 State for Red-Black Trees on NVRAM

Log State Machine for Insert State Machine for Remove

Log State Machine for Insert State Machine for Remove

6.6 State for AVL Trees on NVRAM

7 Remote Persisting with a Software Agent

There are no means to either control or adapt the intermediate memory

expression mmapped file

10 Multi-Reader Single-Writer Locks

RDMA reader writer lock [40]:

A k-reader single-writer lock with f -fairness:

Figure 10: State for reader-writer locks with RDMA.

12.1 Local Performance for Red-Black Trees and AVL Trees

350 k AVLT clfushopt RBT clfushopt

AVLT clfush RBT clfush

AVLT clfush RBT clfush

AVLT clfush RBT clfush

12.2 Local Performance for Red-Black Trees and AVL Trees

12.3 Gain of Optimizations

12.4 Remote Performance via RDMA

[25] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness

[31] Donald E. Knuth. The Art of Computer Programming, Volume 3: (2nd

[37] John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer

[39] Chandrasekaran Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh,

[48] Thorsten Schütt, Florian Schintke, and Alexander Reinefeld. A structured

You might also like