BBB: Simplifying Persistent Programming Using Battery-Backed Buffers
BBB: Simplifying Persistent Programming Using Battery-Backed Buffers
Abstract—Non-volatile memory (NVM) is poised to augment the persistent memory state, and this may lead to inexplicable
or replace DRAM as main memory. With the right abstraction results. Programmers rely on the persistency model to write
and support, non-volatile main memory (NVMM) can provide an both normal-operation code and post-crash recovery code, and
alternative to the storage system to host long-lasting persistent
data. However, keeping persistent data in memory requires to reason about how such code can keep persistent data in
programs to be written such that data is crash consistent (i.e. a consistent state [23], [27], [43], [54], [68], [88], [89]. In
it can be recovered after failure). Critical to supporting crash designing persistency models, it is generally accepted that
recovery is the guarantee of ordering of when stores become there is a tradeoff between performance and programmability.
durable with respect to program order. Strict persistency, which For example, strict persistency requires persist order to co-
requires persist order to coincide with program order of stores,
is simple and intuitive but generally thought to be too slow. incide with program order of stores, while epoch persistency
More relaxed persistency models are available but demand higher orders persists across epochs but not within an epoch. While a
programming complexity, e.g. they require the programmer to more relaxed persistency model can offer higher performance,
insert persist barriers correctly in their program. adopting a more relaxed persistency model burdens the pro-
We identify the source of strict persistency inefficiency as the grammer with additional tasks, e.g. defining epochs.
gap between the point of visibility (PoV) which is the cache, and
the point of persistency (PoP) which is the memory. In this paper, Another persistency programmability challenge is caused by
we propose a new approach to close the PoV/PoP gap which we the gap between the point of visibility (PoV) and the point of
refer to as Battery-Backed Buffer (BBB). The key idea of BBB persistency (PoP), which affects parallel programs (Figure 1).
is to provide a battery-backed persist buffer (bbPB) in each core A store may become visible to other threads when the value
next to the L1 data cache (L1D). A store value is allocated in the is written to the cache, but does not persist until reaching
bbPB as it is written to cache, becoming part of the persistence
domain. If a crash occurs, battery ensures bbPB can be fully the memory controller (MC)1 . Furthermore, before persistency
drained to NVMM. BBB simplifies persistent programming as the is ensured, a store value may be observed by another thread
programmer does not need to insert persist barriers or flushes. which may persist another value, resulting a non-consistent
Furthermore, our BBB design achieves nearly identical results to persistent memory state.
eADR in terms of performance and number of NVMM writes,
while requiring two orders of magnitude smaller energy and time
to drain.
PoV PoP
I. I NTRODUCTION
Non-volatile main memory (NVMM) is poised to augment
or replace DRAM as main memory. Due to its non-volatility,
MC and
byte addressability, and being much faster in speed than SSD Core L1D L2 LLC NVMM
and HDD, NVM can host persistent data in main memory [4],
[10], [11], [39], [46], [51], [52], [74], [95], [96].
In order to utilize this non-volatility feature, it is critical to
guarantee ordering of persists, i.e. the ordering of stores reach-
ing persistent memory to become durable. This is specified
through a persistency model [5], [23], [38], [43], [68]. Without Fig. 1: Illustration for the gap between the Point of Visibilty (PoV)
at the L1D cache, and the Point of Persistency (PoP) at the NVMM
an explicit guarantee, the persist order will follow the cache or MC.
replacement policy instead of the program order in updating
112
1 void AppendNode(int new val){ Figure 3. Mainly, special instructions are added after storing
2 // create and initialize new node
3 node t* new node = new node t(new val);
the new node (line 7-8), and after storing the head pointer
4 // update new node’s next pointer (line 12-13). With that, it is now guaranteed that the update
5 new node −> next = head; to the head pointer will never be persisted until the update to
6 // update the linkedList ’ s head pointer
7 head = new node; the new node itself is persisted.
8 } With BBB, this persist ordering problem will no longer
exist, and the code shown in Figure 2 can still be safely used
without any risk of persist ordering issues. This is because
Fig. 2: Example code to add a node to the beginning of a the new node initialization (line 3) will become persistent
linked list. immediately after its store is committed. Since the two stores
are guaranteed to commit in program order, the store updating
the head pointer (line 7) will commit after that, and hence
1 void AppendNode(int new val){ the two updates are also going to automatically and instantly
2 // create and initialize new node
3 node t* new node = new node t(new val); persist in that order. Note that the discussion about this code
4 // update new node’s next pointer focuses on the persist ordering problem. Other programming
5 new node −> next = head; problems (e.g. transaction semantics, permanent leaks) are out
6 // NEW: Persist new node
7 writeBack(new node); of the scope of this paper.
8 persistBarrier ;
9 // update the linkedList ’ s head pointer
10 head = new node; B. Non-Volatile Caches and eADR
11 // NEW: Persist head pointer
12 writeBack(head); As mentioned earlier, non-volatile caches (NVCaches) have
13 persistBarrier ; been proposed and evaluated in literature [40], [58], [69],
14 }
[77]. NVCaches rely on various NVM technologies, such as
PCM, STT-RAM, and ReRAM [4], [28], [48], [52], [71],
Fig. 3: Updated code to add a node to the beginning of a which differ in their access latency and density. However, they
linked List. suffer from similar challenges as in NVMM, including limited
write endurance. These problems will be more pronounced
than NVMM because caches will be written at a much higher
rate than memory, and the closer it is to the core, the higher
which is done by flushing or writing back the corresponding the rate. Spin-Transfer Torque Random Access Memory (STT-
cache block from the caches to NVMM (writeBack). Some RAM) has a relatively high write endurance level of 4 × 1012
examples of such instructions are (clwb) and (clflushopt) from writes, higher than alternatives such as Phase Change Memory
x86 ISA, and (DCCVAP) and (DCCVADP) from Arm ISA. (PCM) with 108 writes [52], [61], [71], [87], and Resistive
(2) After sending blocks to the NVMM, the second instruction Random Access Memory (ReRAM) with 1011 writes [4],
type needed to guarantee persistency ordering is a barrier/fence [48], [61], [87]. However, their write endurance is still or-
instruction that ensures the subsequent instructions wait until ders of magnitude lower than SRAM memory cells (about
the flushing is completed (persistBarrier). Examples of such 1015 ) [61], [71], [87]. Furthermore, they also suffer from high
barrier instructions are (sfence) and (mfence) for x86 ISA, and write energy, and higher access latency than SRAM caches.
(DSB) and (DMB) for Arm ISA. Finally, unless they are used for the entire cache hierarchy,
The main task of a programmer is to decide when and where they will still have PoV/PoP gap that complicates persistency
to add these two types of special instructions. Figure 2 shows programming.
an example code to add a node to the head of a linked list. Nevertheless, SRAM-based caches can become non-volatile
The code creates and initializes a new node (line 3), makes by providing an additional energy source to create battery-
it to point to the current head node (line 5), and updates the backed caches. This energy source should be enough to im-
head pointer to the new node (line 7). The code example is plement flush-on-fail policy, where the entire cache hierarchy
correct and works well if crash recoverability is not a concern. is drained to memory when a crash happens, and thus making
However, with NVMM, the code risks losing the entire linked the SRAM-based caches appear non-volatile. This approach
list if stores persist out of the program order. For instance, enhances the ADR [37] guarantee from only covering the
the update to the head pointer may get persisted before the memory controller, to covering the entire cache hierarchy.
new node itself is persisted. If a crash occurs between the Hence, it is named enhanced-ADR (eADR) [16], [77], [78].
two persists, the new node will be lost (since it is still in the Compared to NVCaches, eADR does not affect access latency,
volatile caches), while the head pointer will still point to new write energy, and write endurance. However, flush-on-fail
node, which becomes invalid after the crash. requires a substantial amount of energy and time, resulting
To make this code NVMM-friendly, the programmer may in considerable space and cost for the energy source (e.g.
impose persist ordering by modifying the code as shown in battery), and delaying crash recovery until after the draining
is complete.
113
Core Core Core Core Core Core
PoP: New
Store Buffer
bbPB
PoV Caches
bbPB Private Private bbPB Processor
Cache(s) Cache(s)
Side Forced
Battery-backed in: Memory bbPB Drain
WPQ
eADR only Shared LLC Side
NVMM
BBB only WPQ
Both PoP: Old NVMM
DRAM NVMM
Controller Controller (a) Processor side (b) Memory side
114
choice. In such an organization, each bbPB entry corresponds dirty block in the LLC. Being dirty-inclusive, an LLC miss is
to an (address, value) pair for each store instruction that needs guaranteed not to find a block in a bbPB, hence eliminating
to persist. Store granularity could be used (e.g. byte, word, the need to check bbPB on LLC misses. Enforcing inclusion
doubleword). The stores need to be ordered in the bbPB is simple. When a dirty LLC block is evicted, a forced drain
because they have not yet reached the persistence domain. message is sent to all bbPBs (Figure 5(b)), akin to back
Coalescing of values between stores is not permitted except invalidation being sent to the L2 and L1 caches. If a bbPB
in some special cases (e.g. when two stores are subsequent has such a block, it drains the block before responding with
and involve the same block)2 . In contrast, in the memory- acknowledgment.
side organization, each bbPB entry corresponds to a data A dirty block may be drained from bbPB as well as be
block whose value is changed by a store. Because bbPB written back from the LLC. While it is correct to let both
entries are already in the persistence domain, stores to the occur, for write endurance reason we should avoid redundant
same block can be coalesced regardless of the ordering of write back from the LLC. To quickly identify dirty blocks that
such stores. Furthermore, ordering is not necessary as store should not be written back, we add a bit to each cache block to
values have reached the persistence domain in bbPB entries. annotate a block that is holding persistent data, similar to the
Entries in bbPB can also drain out of order to NVMM, one used in [50]. When such a block is evicted from the LLC,
making various optimizations possible, for example, one that it is not written back to NVMM. Since a dirty persistent block
minimizes NVMM writes. By allowing store reordering and in LLC has or had a corresponding bbPB block, the value can
coalescing, the memory-side organization conveys substantial be considered to have been written back to memory.
advantages in requiring fewer bbPB entries to perform well,
and in reducing writes to NVMM. Furthermore, the memory-
side organization also simplifies cache coherence: since bbPB C. Handling Relaxed Memory Consistency Models
is at the memory side, it is not directly involved with cache
coherence the way L1D or L2 caches are. In an earlier discussion, we described the PoV/PoP gap.
The two approaches also differ in handling a load from the There is a subtle issue here related to relaxed memory con-
core. In the processor-side approach, a load must check the sistency models. For example, with release consistency, PoV
cache hierarchy and bbPB to find the data block. In most cases, is defined only for and with regard to release synchroniza-
the block will be found in the caches instead of the bbPB, but tion memory instructions but undefined/unordered for regular
in rare cases, the block may have been evicted from the caches stores. This creates an ambiguous aspect as to whether PoP
while still residing in the bbPB. In such cases, bbPB supplies for stores should be ordered or left unspecified as the PoV.
the block to the core. Handling a load in the memory-side To guide our choice, we note that PoV is only applicable to
approach is more complicated. A load first accesses the cache multi-threaded applications as it governs when a store from
hierarchy. If it misses in the hierarchy (i.e. last level cache one thread is seen by others, whereas memory persistency
(LLC) miss), the block may reside in the memory (NVMM applies even to single-threaded (sequential) applications. We
or WPQ) or the bbPB, so both need to be checked. The believe that the latter requires persist ordering to be defined
(MC) may need to inquire the bbPB of each core to find even for relaxed consistency models. Hence, we propose PoP
potentially the latest/valid value of the block. Alternatively, to follow the semantics of program-order for persisting stores.
to avoid broadcast, a bbPB directory may be kept at the MC A challenge to achieving program-order persistency for
to track which bbPB may hold a valid copy of the block. The relaxed consistency models is that while stores are committed
need to broadcast or keep a directory is a substantial drawback in program order, they do not go to the L1D in program order.
of the memory-side approach. For example, if an older store misses in L1D, a younger store
Thus, for our BBB design, we choose the memory-side that hits the cache is permitted to write its value in the L1D.
approach. However, there is a drawback of the memory-side If an update to bbPB and L1D coincides, we cannot guarantee
approach. We note that when the LLC misses, the missed program order to PoP. To solve this, for relaxed consistency
block may still be in bbPB pending to be drained to persistent models, we also battery-back the store buffer (SB) (Figure 4).
memory. The memory has a stale copy of the block so the In this design, PoP is achieved when a committed store is
missed block must be located from the right bbPB. To locate allocated in the SB, earlier than PoV which is at the L1D. This
the block, a broadcast to all bbPBs in all cores may be needed; requirement for sequential programs is equally needed when
or if a directory for bbPBs is kept, only select bbPBs need to using NVCache or eADR, as stores may also write to the cache
be inquired. However, a broadcast is not scalable, but keeping out of program order. This design adds a small cost to the
directory information updated requires a complex protocol battery but allows BBB to guarantee program order persistency
mechanism as various protocol races could occur. To avoid without requiring the programmer to use persist barriers and
such a problem, we require that the LLC be dirty-inclusive without incurring persistency stalls. When a crash happens,
of bbPBs, i.e. any bbPB block must have a corresponding the content of the SB will be drained directly to the WPQ
2 An additional coalescing opportunity is possible if epoch persistency is (similar to non-temporal stores [3]) after completely draining
considered: stores within an epoch may be coalesced. However, with BBB the content of the corresponding bbPB. This guarantees the
we are targeting strict persistency. per-core program order to be maintained.
115
D. BBB Design Invariants Core 1 2 1 2
+ L1D
BBB design requires the following invariants to be kept to
L2 2: Inv/
guarantee correct execution and crash recovery: X, M X Remove - X, I X, M X
it becomes persistent.
Core 1 2 1 2
4) LLC or L2 caches are inclusive of bbPBs, and a block + L1D
L2
only resides in at most one bbPB.
2: Down
X, M X - X, S X X, S
To meet Invariant 1, a store is allocated a bbPB entry after grade
1: Rd
all older stores have been allocated and written to the cache.
3: ReplyD (no flush to Mem)
If bbPB is full, some entries are drained to free them up. To
(c) Intervention to M block
avoid performance degradation from full bbPB, bbPB needs
to be sized sufficiently. If the bbPB already has the block, the Fig. 6: Illustrating how BBB handles main cache coherence
new store value will be coalesced with it. cases with data in bbPBs. Terms follow from [83].
Invariant 2 is ensured by having a battery with sufficient
energy to drain bbPB to memory, and thus guranteeing that
any allocated bbPB entry will eventually reach the NVMM. external invalidation or intervention request made by another
This includes in-flight inter-cores packets between bbPBs. core. This is not as simple as it sounds because the LLC does
Invariant 3 is common in persistency studies. Violating it not keep a bbPB directory; it only keeps directory for per-core
may result in a first store from a first thread that has not L2 caches. Hence, when a core wants to write to a block, it
persisted to become visible to a second thread which then does not know which bbPB to send the invalidation to. To
persists a second store that depends on the first store. If a crash simplify this, we enforce bbPB L2 inclusion, meaning that for
occurs, the threads disagree on the persistent state of the first each block in bbPB, the same block must also exist in the
store. To meet Invariant 3, the L1D cache ensures that it has L2 cache. L2 inclusion provides substantial benefits because
obtained the block in the coherence state that allows the store the LLC keeps L2 directory, hence by sending invalidation
(i.e. M state), before the store writes to the L1D cache and to the sharer L2 caches (which then send back invalidation
allocated in the bbPB. to their respective bbPBs), it is guaranteed that the bbPB
Invariant 4 was partly discussed in Section III-B and the containing the block will be notified as well. No new directory
rest will be discussed more below in Section III-E. information is needed in the LLC.
E. Cache Coherence Interaction Another issue is whether to drain the block from bbPB when
an invalidation/intervention is received by a bbPB. If the block
bbPBs have two unique characteristics. Despite being logi-
is drained, Invariant 4 is enforced as the block is removed
cally located at the memory side, each core has its own bbPB,
from the current bbPB so that the new bbPB can allocate it.
and hence if not carefully designed, a block may potentially
However, draining delays the acknowledgement or reply to
exist in multiple bbPB and suffer from coherence issues. In
the invalidation/intervention until draining is complete, and
order to avoid that, Invariant 4 requires that a block resides
it incurs an additional write to NVMM which reduces write
in at most one bbPB. The invariant ensures that a block
endurance. Thus, we choose not to drain the block. Instead,
is drained only once from bbPB to NVMM with the latest
when an external request is received, the block is moved to
value, and avoids dealing with coherence between copies at
the requesting core. The requesting core is now responsible for
multiple bbPBs. Another unique characteristic of bbPB is that
draining this block to the NVMM. Note that the that energy
it is located close to the core, and a persisting store needs to
source is sized to provide sufficient energy to complete any in-
allocate an entry as it writes its value to the L1D. To enforce
flight packets in the event of crash. Therefore, it is guaranteed
Invariant 4, a writing core cannot just allocate a new entry for
that no updates will be lost due to the inter-core movements of
a block if the block resides in another bbPB. The block must
cache blocks. This requirement is equally needed for eADR.
be removed from the other bbPB and retrieved to the writing
core’s bbPB. Figure 6 illustrates the main coherence scenarios. Two cores
There are two issues that we need to deal with to support are illustrated with the L2 cache and the bbPB is shown for
Invariant 4. First, a bbPB must be notified of any relevant each core. A block X and its initial state (assuming MESI
protocol) in the L2 cache of Core1 are shown. It receives
3 This invariant is equally needed for eADR or any NVCache solution. an external request from Core2. For example (a), the block
116
TABLE II: Illustrating the bbPB actions corresponding to different
coherence operations, originating from other cores (remote invali-
coalesce multiple stores, which decreases both performance
dation/intervention) or from the same core (local read/write). An and write endurance. On the other hand, draining too late
operation is marked unmodified (UM) if the base MESI protocol increases the chance of the bbPB being full when a burst
applies. of persisting stores needs new entries allocated, resulting in
State In bbPB? RemoteInv RemoteInt LocalRd LocalWr performance degradation. Hence, an important principle of
N UM UM UM Allocate optimization is to keep bbPB as full as possible while keeping
M
Y Fig. 6(a) Fig. 6(c) UM Coalesce the probability of full bbPB low. To achieve this balance, we
N UM UM UM Allocate define a draining occupancy threshold. bbPB does not drain
E
Y Invalidate UM UM Coalesce
N UM UM UM Allocate blocks except when its occupancy reaches the threshold, at
S
Y Fig. 6(b) UM UM Coalesce which time draining is initiated until the occupancy decreases
N UM UM UM Allocate below the threshold. For example, we found 75% threshold to
I
Y Invalidate UM UM Coalesce
work well for 32-entry bbPB. A similar optimization is applied
to memory controller WPQ [34], [35].
is in Core1’s L2 cache (in M state) and in bbPB. The L2 Regarding the how question, we apply a first come first
cache at Core1 receives a Read Exclusive request by Core2 served (FCFS) draining policy; the oldest block allocated in
and notifies the bbPB. The L2 cache invalidates the block and the bbPB is chosen to drain first. While other policies are
bbPB removes the block (without draining it). The block is possible, e.g. draining blocks based on the prediction for future
then sent back to Core2, which then installs it in the L2 cache writes, we leave them for future work.
(in M state), allowing it to write to the block and install it in b) Hardware cost of BBB: Assuming bbPB having 16-
its bbPB. This example illustrates that if a block is written by 32 entries (more on the choice in Section V), the total size of
multiple cores, the block may move between bbPBs but will bbPB will be about 1-2KB per core. Each bbPB entry contains
drain to memory only once. a 64-byte data block plus up to 8-byte meta-data that contains
In example (b), block X is initially shared by both cores. the physical block address, and a few bits for status. The
An Upgrade request is received at Core1’s L2 which notifies physical address is used to avoid accessing TLB when bbPB
bbPB. As before, the block is invalidated from the L2 cache is drained, or when the L2 cache sends back invalidation to
and removed from bbPB. An acknowledgment is sent to Core2. the bbPB.
At this time, Core2 has sufficient state to allow it to write to the c) Context switch: Because bbPB holds the block’s phys-
block and simultaneously install it in the bbPB. No draining ical address, there is no cache block address aliasing problem
occurs here, either. between multiple processes. No draining or state saving is
Finally, in example (c), the block is in the M state initially needed on context switch.
and Core1’s L2 cache receives a read request from Core2.
In response, Core1 downgrades its block from M to S and IV. M ETHODOLOGY
replies to the request with data. However, the block remains
TABLE III: The simulated system configuration.
in the original bbPB. With traditional MESI protocol, the block
will be written back to memory because the resulting state S Component Configuration
indicates that the block must be clean in the cache. However, Processor 8 cores, OoO, 2GHz, 8-wide issue/retire
our memory-side approach here allows an optimization. Since ROB: 192, fetchQ/issueQ/LSQ: 32/32/32
bbPB is in the persistence domain and can be considered as L1I and L1D private, 128kB, 8-way, 64B, 2 cycles
L2 shared, 1MB, 8-way, 64B, 11 cycles
an extension of the main memory, it is as if the M block had DRAM 8GB, 55ns read/write
already been written back to memory. Hence, write back to NVMM 8GB, 150ns read, 500ns write (ADR)
memory is skipped, resulting in bandwidth saving to the LLC. bbPB 32 entries per core, drain threshold 75%
In conclusion, the modifications to the cache coherence
protocol are minor. No additional delay is added to the critical
path of cache coherence transactions. Furthermore, our BBB
A. Simulation configuration
approach allows bbPB to minimize the number of writes to
memory, both for bbPB draining, as well as for writeback We evaluate BBB using a multicore processor model built
from the L2 cache. Table II illustrates full coherence cases on gem5 simulator [15], with parameters shown in Table III.
and the corresponding bbPB operations. The machine consists of a hybrid DRAM/NVM main memory,
each type being 8GB and having a separate MC. The NVMM
F. Other Issues MC is in persistence domain and is battery-backed (ADR).
a) bbPB draining policy: Another important design issue The NVMM read and writes latencies are 150ns and 500ns,
is regarding when and how to drain bbPB to NVMM. Regard- respectively, which are higher than DRAM latencies, in line
ing the when question, since blocks in bbPB are already in with prior studies [14], [20], [47], [55], [86]. We use Arm 64-
the persistence domain and there is sufficient energy to drain bit instruction set architecture (ISA). Our simulation models an
them to memory, in theory, they can stay in the bbPB for Arm-based mobile phone with an 8-core processor, each core
coalescing. Draining bbPB too early reduces opportunities to has an 8-wide out-of-order pipeline. L1 caches are private per
117
core, while the L2 is shared. Coherence between L1 caches is 107MB and 8.75MB for the server and the mobile class
rely on directory-based MESI protocol. system, respectively.
TABLE IV: Summary of the evaluated workloads along with their TABLE V: Systems used to evaluate the draining costs
descriptions and the percentage of the persistent stores (%P-Stores)
Component Mobile Class Server Class
to the total stores in the workload.
Number of cores 6 32
Workload Description %P-Stores L1 cache size 6 x 128kB 32 x 32 kB
rtree 1 million-node rtree insertion 15.5% L2 cache size 1 x 8MB 32 x 1 MB
ctree 1 million-node ctree insertion 18.9% L3 cache size N/A 2 x 35.75 MB
hashmap 1 million-node hashmap insertion 6.0% Memory channels 2 12
mutate[NC/C] modify in 1 million-element array 23.8%
swap[NC/C] swap in 1 million-element array 23.8%
To compare the draining cost of BBB and eADR, we focus
on (1) the energy needed at the time of the crash, which
determines the size, lifetime, and system footprint of the
B. Workload Description battery, and (2) the time needed to perform the draining, which
To evaluate battery-backed persist buffers (bbPB) size re- is affected by the amount of data to drain and non-volatile
quirements in BBB, we designed the workloads listed in main memory (NVMM) write bandwidth. Such time impacts
Table IV. These workloads are chosen to generate significant the turn-round time after a crash, and thus the responsiveness
persist traffic. of the system. It can also result in further energy overheads if
Among these workloads rtree, btree, and hashmap work- other parts of the system (e.g. the core) need to remain alive
loads maintain a 1 million-node data structure that is allocated during draining.
in the persistent space, and the workload performs random Estimating draining energy. On a crash, data in bbPB (or
insertions to the data structure. This generates persistent writes in caches for eADR) is accessed and then moved to NVMM.
that need to be allocated at the bbPB. Similarly, array-mutate We assume that caches in eADR and bbPB in BBB are SRAM.
and array-swap perform random mutate and swap operations The energy needed to access data in such SRAM cells is
respectively, on a 1 million-element array. NC or C after the estimated to be about 1pJ/Byte [63]. However, this is very
array operation’s name (e.g. mutateNC vs mutateC) stands small compared to the energy needed for data movement. The
for ”Non-Conflicting” or ”Conflicting”, respectively. This in- energy required for data movement is a lot harder to calculate.
dicates whether each thread performs updates on a separate Our estimations of the energy cost for data movement are
region of the array (hence non-conflicting), or conflicts are based on the results from the work done by Dhinakaran et
allowed between threads. Each workload runs with 8 threads al [65], which looked into the energy cost of data movement
on 8 cores. across several levels in the memory hierarchy. The energy
We designed the workloads to exert maximum pressure consumption per memory operation was measured using an ex-
on the bbPB. They perform back-to-back persistent writes ternal power meter while executing carefully designed micro-
with little other computation. In contrast, real-world workloads benchmarks. These micro-benchmarks were used to isolate
typically perform additional computations to generate the data and observe the energy needed solely for data movement
to be persisted. Thus, our analyses on size of bbPB required and minimize the effect of out-of-order execution and other
for good performance represents the worst-case end point for architectural optimizations. More specifically:
the workloads we studied. 1) To calculate the cost of data movement between the
For all of these workloads, we evaluate BBB normalized processor and a targeted level in the memory hierarchy
to eADR which serves as the base case. eADR represents the (e.g. L2 cache), these micro-benchmarks operate on an
optimal case for performance overheads and number of writes allocated data that is chosen such that it has a memory
to NVMM, hence it performs as good as a system without footprint to not fit in any of the cache levels above the
any persistency in mind. On average, the simulated window targeted level.
reports the timing of 250 million instructions, after 200 million 2) The average memory latency and the cache miss rates
instructions for warm-up. were continuously monitored to validate that the micro-
benchmarks are accessing the targeted level of the mem-
C. Methodology for Evaluating Draining Cost ory hierarchy.
eADR draining cost depends on the cache hierarchy and 3) The micro-benchmarks were designed to minimize the
number of cores. We evaluate the cost based on two types of impact of other operations, not related to memory ac-
systems with differing number of cores and cache hierarchy: a cesses.
server class and a mobile class system, as shown in Table V. 4) To isolate the compiler optimizations’ impact on the
The server class system is based on the specifications of Intel micro-benchmarks, all the assembly codes were man-
Xeon Platinum 9222 [25], [94], while the the mobile class sys- ually validated to guarantee the expected behavior.
tem are based on the Arm-based iPhone 11 specifications [26], These experiments provided the energy needed to move the
[30], [36]. Most notably, the total cache size for the system data between the processor’s registers and any level in the
118
memory hierarchy. Finally, the difference between these results V. E VALUATION
can be used to calculate the energy cost of moving data We first discuss the most important aspect of BBB: its
between different levels in the memory hierarchy. draining cost, in comparison to eADR (Section V-A). Then we
Table VI shows the estimated energy needed for draining discuss BBB performance and write overheads (Section V-B
data from different cache levels to NVMM. The numbers are and V-C). Finally, we present the sensitivity study of BBB
derived from [65], with some adaptation: (1) As the analysis design (Section V-D).
in [65] only reports data movement in the direction from
A. Draining Cost Comparison
memory to caches, we estimate that the energy needed to
bring data from a cache to memory (as needed at the crash) Table VII presents the average energy needed to drain
is similar to that to bring data from memory to the cache. data from caches (for eADR) and from bbPB (our BBB
(2) Since the results reported in [65] are only for a DRAM- approach, 32 entries), based on the cost model we discussed in
based system, we assume those results for our analysis on the Section IV-C. We give eADR optimistic estimates with several
energy needed to drain to NVMM. This assumption equally assumptions. First, we assume eADR only drains dirty cache
affects eADR and BBB, and thus will not have a notable blocks to memory. For the workloads evaluated, on average
impact on our comparison between the two schemes. (3) The 44.9% of blocks are dirty in the cache hierarchy, similar to
energy for draining a block from bbPB is estimated from the figure obtained by Garcia et al [31]. Second, we assume dirty
energy to drain a block from the L1D cache to NVMM. (4) blocks are identified using a hardware finite state machine
The energy numbers in [65] do not include a 3-level memory that is power efficient, and consumes zero energy overheads.
hierarchy. Therefore, we assume the draining cost numbers Moreover, we don’t include the static energy cost for eADR.
do not increase when adding another cache level, as in the In contrast, we note that BBB does not require cache accesses
server class system in Table V. This assumption produces for dirty block identification. Furthermore, caches do not need
an optimistic energy figure for eADR, so in reality, eADR to be powered during the draining process (hence no static
energy cost may be higher than our estimate. Moreover, when energy consumption). Finally, we assume that at the time of
reporting eADR draining energy and time costs, we calculate failure, the battery-backed persist buffers (bbPB) are full and
only the energy and time needed to drain dirty blocks, to all entries need to be drained, representing the worst case for
estimate the average energy and time. BBB.
As shown in Table VII, despite optimistic estimates, using
The final part of our energy analysis is estimating the battery eADR costs 46.5 mJ and 550 mJ to drain for the mobile and
size. In this analysis, the battery needs to be provisioned with server class systems, respectively. Not surprisingly, the mobile
sufficient energy to drain the entire caches, in case all blocks class system has smaller caches hence draining energy is
in the caches are dirty. This is important because missing to smaller than in the server class system. Despite more realistic
drain even one dirty cache block may result in inconsistent estimates, BBB costs only 145 µJ and 775 µJ, respectively,
persistent data that cannot be recovered. We chose the smallest which are 320× and 709× more efficient than eADR, respec-
battery size that is capable of storing the required energy. tively. BBB’s energy cost is between two to three orders of
Different battery technologies have different energy densities magnitude smaller than eADR.
(i.e. the amount of energy stored per volume unit). We looked
into two main battery technologies (SuperCap [98] and Li- TABLE VII: Estimated draining energy cost for BBB vs. eADR (dirty
thin [67]), which have energy density of (10−4 and 10−2 ) Wh blocks only).
cm-3 , respectively [93]. System
Draining Energy Normalized to BBB (X)
eADR BBB eADR BBB
Mobile Class 46.5 mJ 145 µJ 320× 1
TABLE VI: Estimated energy costs of different operations for drain- Server Class 550 mJ 775 µJ 709× 1
ing eADR or BBB at the moment of crash.
Operation Energy Cost Table VIII presents the average time needed to drain data
Accessing Data from SRAM 1pJ/Byte from caches (for eADR) and from bbPB (our BBB approach).
Moving data from L1D to NVMM 11.839nJ/Byte eADR takes 0.8 ms and 1.8 ms to drain for mobile class
Moving data from bbPB to NVMM 11.839nJ/Byte and the server class system, respectively. In contrast, BBB
Moving data from L2 to NVMM 11.228nJ/Byte takes only 2.6 µs (307× faster) and 2.4 µs (750× faster),
Moving data from L3 to NVMM 11.228nJ/Byte respectively, which represent two to three orders of magnitude
improvement.
eADR and our BBB need energy source to drain. We
Estimating draining time. For this part, we rely on the estimate two energy source types: super capacitors (Super-
reported NVMM bandwidth and latencies [41]. As the draining Cap) [98], and lithium thin-film batteries (Li-thin) [67], by
happens at crash with no other traffic present, we assume that applying the analysis from [93] (as discussed in Section IV-C),
the entire NVMM bandwidth will be dedicated for draining. while using the energy values from Table VII. Table IX shows
NVMM bandwidth also depends on the number of memory the estimates for only the active material needed for the
channels for each system as described in Table V. battery, excluding packaging and other aspects. As shown in
119
TABLE VIII: Estimated draining time for BBB vs. eADR (dirty TABLE X: Battery size (in mm3 ) when varying the number of bbPB
blocks only). entries for mobile (M) and server (S) platforms.
Draining Time Normalized to BBB (X) bbPB Size 1 4 16 32 64 256 1024
System
eADR BBB eADR BBB M 0.12 0.50 2.02 4.1 8.1 32.3 129.3
Mobile Class 0.8 ms 2.6 µs 307× 1 SprCap
S 0.7 2.7 10.8 21.6 43.1 172.4 689.7
Server Class 1.8 ms 2.4 µs 750× 1 M 0.001 0.005 0.02 0.04 0.08 0.3 1.3
Li-thin
S 0.006 0.026 0.10 0.21 0.43 1.7 6.8
which is more space efficient, the area is still large: 3.6× and
0.8 0.8
18.7× the size of the core, for the mobile class and server
class systems, respectively. In contrast, BBB requires a much 0.6 0.6
smaller size. Even with SuperCap, the area needed is 97.2% 0.4 0.4
and 296% the size of a core for the mobile and the server class, 0.2 0.2
respectively. It becomes even smaller with Li-thin: 4.5% and
0.0 0.0
13.7% the size of a core, respectively. Overall, the battery
volume for BBB is between 707 − 1, 574× smaller, while
the area for BBB is between 79 − 137× smaller than eADR. (a) Execution Time (b) Number Of Writes
Moreover, Table X looks into the battery size when varying
the number of entries in bbPB and shows that even with bbPB Fig. 7: Execution time and number of writes to NVMM for BBB
with 32 entries (first bar), with 1024 entries (second bar), and eADR
size of 1024 entries, BBB is 22 − 49× cheaper than eADR. (third bar), normalized to eADR.
TABLE IX: Estimates of the size of the energy source needed to
implement BBB and eADR (a). In addition, to showing the needed As discussed in Section IV, our workloads were designed
footprint to be occupied by the energy source as a ratio to the area to generate back-to-back writes to the persistent domain,
of the mobile class system’s core (b). which stresses persist buffers. However, the nature of the data
(a) Size/Volume (mm3 ) (b) Ratio to core area (%) structures in the workloads still creates a difference in the time
Battery needed to perform the operation and thus creates a difference
SuperCap Li-thin SuperCap Li-thin
7,746% 359.5% in the frequency of generating persists by the core. This is why
2.9 × 103
Server Mobile
eADR 30
(∼ 77×) (∼ 3.6×) some workloads (e.g. swapNC) might incur relatively higher
BBB 4.1 0.04 97.2% 4.5%
40,363% 1,873%
delays due to the very short time between two subsequent
eADR 34 × 103 300 persists.
(∼ 404×) (∼ 18.7×)
296%
BBB 21.6 0.21 13.7% C. NVMM Writes Evaluation
(∼ 3×)
Due to the limited write endurance of NVMM, the number
of writes to NVMM is also an important metric. If we had used
B. Performance Evaluation the processor-side approach, almost every persisting store must
The much smaller amount of data that needs draining is go to the bbPB and drain to the NVMM. With the memory-
the primary reason for BBB’s much smaller battery cost than side approach, stores are coalesced in bbPB, and the number
120
of NVMM writes depends on the number of block draining impact on execution time, which stops decreasing with 32
needed to free up entries in bbPB. Hence, we expect the entries. The bbPB drains overhead reaches near zero with 64
number of bbPB entries to determine the number of NVMM entries, but 32 entries are not far behind. This represents the
writes. Also, as discussed in Section III, dirty writebacks from amount of coalescing that was achieved at the bbPB and will
the LLC to the NVMM in our BBB approach are silently be translated into a reduction of the number of writes to the
dropped to avoid redundant writes to the NVMM. NVMM. Thus, 32-entry bbPB (our default configuration) is the
Figure 7(b) compares the number of NVMM writes for smallest size that shows very close results compared to eADR,
BBB with 32 and 1024 entries, with eADR, normalized to beyond which we start to have diminishing returns. This buffer
eADR. eADR represents the optimal case because eADR size might be higher than what was used in prior works (e.g. 8
does not introduce any new writes to the NVMM due to entries in [50]). However, we decided to conservatively choose
persistency ordering. The figure shows that even 32-entry a relatively larger buffer for the following reasons: (1) We
bbPB in BBB captures the majority of the coalescing that chose the size to conduct the energy comparison between BBB
happen in eADR; it only adds an average of 4.9% writes to and eADR. Therefore, we chose the smallest size that shows
NVMM (ranging from 1 − 7.9%) to eADR. This overhead almost no performance degradation (about 1%). Prior works
decreases to less than 1% if BBB uses 1024-entry bbPB, showed acceptable, yet higher, degradation when using smaller
since the larger buffer provides more room to hold blocks and sizes, which is consistent with the result we report in Figure 8.
captures most coalescing opportunities. This result illustrates (2) Our evaluated workloads were chosen to represent the
the effectiveness of the memory-side approach in coalescing worst-case and stress the design by generating back-to-back
stores in the bbPB, because bbPB is in the persistence domain. persists. Therefore we expect to have larger buffers. In general,
In contrast, traditional persist buffers prevent most coalescing the choice of bbPB size is a design decision based on the trade-
because it would result in persistency ordering violation, so off between the energy budget and the desired performance.
the number of writes would be much higher.
We also measured the number of writes to NVMM using the VI. R ELATED W ORK
processor-side approach, and found that on average, there are NVM has received significant research activities in recent
2.8× more writes to NVMM than eADR. This is because there years. Past work has examined various aspects of NVM,
are not many coalescing opportunities, while the memory-side including memory organization (e.g., [9], [79]), abstraction
approach is effective in performing coalescing. (e.g., [84]), checkpointing (e.g., [7], [27]), memory safety
(e.g., [8], [11], [95], [96]), secure execution environment (e.g.,
D. Sensitivity Studies
[12], [29]), extending life time (e.g., [6], [21], [22], [70], [72]),
To obtain deeper insights into BBB, we vary the bbPB size and persistency acceleration (e.g., [81], [82]). The above list
from 1 entry up to 1024 entries. Figure 8 shows the number is a small subset of examples of work in NVM research. From
of times persisting stores are rejected at the bbPB because it here on, we will expand on papers that are most immediately
is full (a), execution time overhead (b), and number of bbPB related to our work.
drains to memory (c). All figures are normalized to the 1-entry Persistency models. Persist barriers enable epochs to be
bbPB case. Note that the y-axes starts at -0.2, so that near zero defined in BPFS [23]. Pelley et al. defined and formalized
values are visible in the figures. memory persistency models including strict, epoch, and strand
persistency [68], in the order of increasing performance and
1
Workloads’ Avg. Overhead (X)
1024
1024
1
2
4
8
1
2
4
8
1
2
4
8
32
16
16
64
32
64
16
32
64
512
256
128
256
512
128
256
128
512
121
store requires flush, fence, and another instruction pcommit R EFERENCES
which flushes MC write pending queue (WPQ) to NVMM.
With ADR, Intel adds WPQ to the persistency domain, thus [1] Intel. an introduction to pmemcheck. [Online]. Available: https:
//pmem.io/2015/07/17/pmemcheck-basic.html
deprecating pcommit. Capacitor or battery is needed to provide [2] “Persistent memory development kit (pmdk).” [Online]. Available:
WPQ’s flush-on-fail capability. More recently, Intel hinted that https://pmem.io/pmdk/
eADR will make it into production [78]. eADR [77] adds [3] “Non-temporal store instructions,” 2017. [Online]. Available: https:
the entire cache hierarchy to the persistence domain. Because //www.felixcloutier.com/x86/movntdq
[4] H. Akinaga and H. Shima, “Resistive Random Access Memory
PoV/PoP gap is closed, flush and fence instructions are gener- (ReRAM) Based on Metal Oxides,” IEEE Journal, 2010.
ally no longer necessary with eADR. However, eADR requires [5] M. Alshboul, J. Tuck, and Y. Solihin, “Lazy persistency: A high-
battery to provide flush-on-fail for the entire cache hierarchy performing and write-efficient software persistency technique,” in ISCA,
2018.
instead of only the bbPB in BBB. Table XI summarizes this [6] M. Alshboul, J. Tuck, and Y. Solihin, “Wet: Write efficient loop tiling
comparison between eADR and BBB. for non-volatile main memory,” in DAC, 2020.
[7] M. Alshboul, H. Elnawawy, R. Elkhouly, K. Kimura, J. Tuck, and
TABLE XI: Summary of the comparison between eADR and BBB Y. Solihin, “Efficient checkpointing with recompute scheme for non-
regarding the hardware/integration costs. volatile main memory,” ACM Trans. Archit. Code Optim., 2019.
[8] A. Awad, P. Manadhata, S. Haber, Y. Solihin, and W. Horne, “Silent
Aspect eADR BBB Shredder: Zero-Cost Shredding for Secure Non-Volatile Main Memory
(1) Adding the bbPBs Controllers,” in ASPLOS, 2016.
Processor modifications None [9] A. Awad, S. Blagodurov, and Y. Solihin, “Write-aware management of
(2) Minor coherence changes
Draining energy cost Very High Low nvm-based memory extensions,” in ICS, 2016.
Time needed to drain Very High Low [10] A. Awad, B. Kettering, and Y. Solihin, “Non-volatile Memory Host
Drive energy to Controller Interface Performance Analysis in High-performance I/O
Needed Needed Systems,” in ISPASS, 2015.
targeted components
[11] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “ObfusMem: A Low-
Overhead Access Obfuscation for Trusted Memories,” in ISCA, 2017.
Failure atomicity. Persistency programming assumes a [12] A. Awad, M. Ye, Y. Solihin, L. Njilla, and K. A. Zubair, “Triad-
certain persistency model and on top of that, relies on support NVM: Persistency for Integrity-Protected and Encrypted Non-Volatile
for failure atomic code regions. For sequential programs, Memories,” in ISCA, 2019.
[13] A. W. Baskara Yudha, K. Kimura, H. Zhou, and Y. Solihin, “Scalable
libraries and language extensions have been implemented to and fast lazy persistency on gpus,” in IISWC, 2020.
provide transaction-based failure atomicity [2], [17], [85]. Au- [14] F. Bedeschi, C. Resta, O. Khouri, E. Buda, L. Costa, M. Ferraro,
tomatic transformation of data structures to persistent memory F. Pellizzer, F. Ottogalli, A. Pirovano, M. Tosi, R. Bez, R. Gastaldi, and
G. Casagrande, “An 8Mb Demonstrator for High-density 1.8V Phase-
has been proposed too [53], [60]. For concurrent programs, Change Memories,” in VLSIIC, 2004.
compiler-based automatic instrumentation has been proposed [15] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
to transform lock-based concurrent programs for persistent J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
memory [19], [33]. Other works propose hardware and soft- M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The GEM5
simulator,” ACM SIGARCH Computer Architecture News (CAN), 2011.
ware primitives to aid porting lock-free data structures to [16] S. Blanas, “The new bottlenecks of scientific computing,”
persistent memory [66], [91]. Many studies add durability 2020. [Online]: https://www.sigarch.org/from-flops-to-iops-the-new-
to transaction-based concurrent programs without changing bottlenecks-of-scientific-computing/
[17] B. Bridge, “Nvm-direct.” [Online]: https://github.com/oracle/nvm-direct
application source code [32], [44], [92], while others add [18] M. Carlson. Persistent memory: What developers need to know.
durability to software transactions [24], [75], [86]. Finally, [Online]: https://www.snia.org/educational-library/persistent-memory-
checkpointing and system-level solutions were also used to what-developers-need-know-2018
[19] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “Atlas: Leveraging
achieve failure atomicity [45], [59], [76]. Overall, the afore- locks for non-volatile memory consistency,” ACM SIGPLAN Notices,
mentioned works provide mechanisms for achieving failure 2014.
atomic regions (i.e. durable transactions), which is orthogonal [20] A. Chatzistergiou, M. Cintra, and S. D. Viglas, “Rewind: Recovery
the goal of our paper. BBB addresses persist ordering and write-ahead system for in-memory non-volatile data-structures,” VLDB
Endow., Jan. 2015.
simplifies ordering-related programming complexity, which [21] J. Chen, G. Venkataramani, and H. H. Huang, “Repram: Re-cycling pram
provides a property that can be relied on by higher level faulty blocks for extended lifetime,” in DSN 2012, 2012.
primitives such as failure atomic regions. [22] J. Chen, Z. Winter, G. Venkataramani, and H. H. Huang, “rpram:
Exploring redundancy techniques to improve lifetime of pcm-based main
memory,” in PACT, 2011.
VII. C ONCLUSION [23] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and
D. Coetzee, “Better I/O through byte-addressable, persistent memory,”
We have proposed Battery-Backed Buffers (BBB), a mi- in SOSP, 2009.
croarchitectural approach that simplifies the persist ordering [24] A. Correia, P. Felber, and P. Ramalhete, “Romulus: Efficient algorithms
aspect of persistency programming by aligning the point of for persistent transactional memory,” in SPAA, 2018.
[25] CPU-WORLD, “Intel xeon 9222 specifications.” [Online]: http:
persistency (PoP) with the point of visibility (PoV). We //www.cpu-world.com/CPUs/Xeon/Intel-Xeon\%209222.html
evaluated BBB over several workloads and found that adding [26] J. Cross, “Inside apple’s a13 bionic system-on-chip.” [On-
32-entry bbPB per core is sufficient to provide performance line]: https://www.macworld.com/article/3442716/inside-apples-a13-
comparable to eADR (only 1% slow down and 4.9% extra bionic-system-on-chip.html
[27] H. Elnawawy, M. Alshboul, J. Tuck, and Y. Solihin, “Efficient check-
writes) while requiring 320 − 709× lower draining energy pointing of loop-based codes for non-volatile main memory,” in PACT,
compared to eADR. 2017.
122
[28] X. Fong, Y. Kim, R. Venkatesan, S. H. Choday, A. Raghunathan, and [56] S. Liu, K. Seemakhupt, Y. Wei, T. Wenisch, A. Kolli, and S. Khan,
K. Roy, “Spin-transfer torque memories: Devices, circuits, and systems,” “Cross-failure bug detection in persistent memory programs,” in ASP-
Proceedings of the IEEE, 2016. LOS, 2020.
[29] A. Freij, S. Yuan, H. Zhou, and Y. Solihin, “Persist level parallelism: [57] S. Liu, Y. Wei, J. Zhao, A. Kolli, and S. Khan, “Pmtest: A fast and
Streamlining integrity tree updates for secure persistent memory,” in flexible testing framework for persistent memory programs,” in ASPLOS,
MICRO, 2020. 2019.
[30] A. Frumusanu, “The apple iphone 11, 11 pro and 11 pro max review.” [58] Y. Lu, J. Shu, L. Sun, and O. Mutlu, “Loose-Ordering Consistency for
[Online]: https://www.anandtech.com/show/14892/the-apple-iphone-11- Persistent Memory,” in ICCD, 2014.
pro-and-max-review/2 [59] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou, R. Alagappan,
[31] A. A. Garcı́a, R. de Jong, W. Wang, and S. Diestelhorst, “Composing K. Strauss, and S. Swanson, “Atomic in-place updates for non-volatile
lifetime enhancing techniques for non-volatile main memories,” in main memories with kamino-tx,” in EuroSys, 2017.
MEMSYS, 2017. [60] A. Memaripour, J. Izraelevitz, and S. Swanson, “Pronto: Easy and fast
[32] E. Giles, K. Doshi, and P. Varman, “Hardware transactional persistent persistence for volatile data structures,” in ASPLOS, 2020.
memory,” in MEMSYS, 2018. [61] S. Mittal, J. S. Vetter, and D. Li, “Lastingnvcache: A technique for
[33] V. Gogte, S. Diestelhorst, W. Wang, S. Narayanasamy, P. M. Chen, and improving the lifetime of non-volatile caches,” in 2014 IEEE Computer
T. F. Wenisch, “Persistency for synchronization-free regions,” in PLDI, Society Annual Symposium on VLSI, 2014.
2018. [62] S. Nalli, S. Haria, M. D. Hill, M. M. Swift, H. Volos, and K. Keeton,
[34] A. Hansson, N. Agarwal, A. Kolli, T. Wenisch, and A. N. Udipi, “An analysis of persistent memory use with whisper,” SIGPLAN Not.,
“Simulating dram controllers for future system architecture exploration,” 2017.
in ISPASS, 2014. [63] D. Nayak, D. P. Acharya, and K. Mahapatra, “An improved energy
[35] C. Huang, V. Nagarajan, and A. Joshi, “Dca: A dram-cache-aware dram efficient sram cell for access over a wide frequency range,” Solid-State
controller,” in SC, 2016. Electronics, vol. 126, 2016.
[36] A. Inc, “Apple: iphone11 specifications.” [Online]: https://www.apple. [64] K. Oleary. How to detect persistent mem-
com/iphone-11/specs/ ory programming errors using intel inspec-
[37] Intel, “Deprecating the pcommit instruction,” 2016. [Online]: https: tor. [Online]: https://software.intel.com/en-us/articles/detect-persistent-
//software.intel.com/blogs/2016/09/12/deprecate-pcommit-instruction memory-programming-errors-with-intel-inspector-persistence-inspector
[38] Intel, “Persistent memory programming,” 2016, http://pmem.io. [65] D. Pandiyan and C.-J. Wu, “Quantifying the energy cost of data
[39] Intel and Micron, “Intel and micron produce breakthrough memory movement for emerging smart phone workloads on mobile platforms,”
technology,” 2015. IISWC, 2014.
[40] J. Izraelevitz, T. Kelly, and A. Kolli, “Failure-atomic persistent memory [66] M. Pavlovic, A. Kogan, V. J. Marathe, and T. Harris, “Brief announce-
updates via justdo logging,” in ASPLOS, 2016. ment: Persistent multi-word compare-and-swap,” in PODC, 2018.
[41] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J. [67] D. Pech, M. Brunet, H. Durou, P. Huang, V. Mochalin, Y. Gogotsi, P.-L.
Soh, Z. Wang, Y. Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic Taberna, and P. Simon, “Ultrahigh-power micrometre-sized supercapac-
performance measurements of the intel optane dc persistent memory itors based on onion-like carbon,” Nature nanotechnology, 2010.
module,” arXiv preprint arXiv:1903.05714, 2019. [68] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory Persistency,” in
[42] Y. Joo, D. Niu, X. Dong, G. Sun, N. Chang, and Y. Xie, “Energy- and ISCA, 2014.
endurance-aware design of phase change memory caches,” in DATE,
[69] PMEM.io, “Persistent memory programming,” 2019. [Online]: https:
2010.
//pmem.io/2019/12/19/performance.html
[43] A. Joshi, V. Nagarajan, M. Cintra, and S. Viglas, “Efficient Persist
[70] M. K. Qureshi, “Pay-as-you-go: Low-overhead hard-error correction for
Barriers for Multicores,” in Micro, 2015.
phase change memories,” in MICRO, 2011.
[44] A. Joshi, V. Nagarajan, M. Cintra, and S. Viglas, “Dhtm: Durable
[71] M. K. Qureshi, S. Gurumurthi, and B. Rajendran, Phase Change
hardware transactional memory,” in ISCA, 2018.
Memory: From Devices to Systems, 2011.
[45] S. Kannan, A. Gavrilovska, and K. Schwan, “Pvm: Persistent virtual
memory for efficient capacity scaling and object storage,” in EuroSys, [72] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and
2016. B. Abali, “Enhancing lifetime and security of pcm-based main memory
[46] T. Kawahara, R. Takemura, K. Miura, J. Hayakawa, S. Ikeda, Y. Lee, with start-gap wear leveling,” in MICRO, 2009.
R. Sasaki, Y. Goto, K. Ito, T. Meguro, F. Matsukura, H. Takahashi, [73] A. Raad, J. Wickerson, G. Neiger, and V. Vafeiadis, “Persistency
H. Matsuoka, and H. Ohno, “2Mb Spin-Transfer Torque RAM (SPRAM) semantics of the intel-x86 architecture,” Proc. ACM Program. Lang.,
with Bit-by-Bit Bidirectional Current Write and Parallelizing-Direction 2019.
Current Read,” in ISSCC, 2007. [74] R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, M. W.
[47] W.-H. Kim, J. Kim, W. Baek, B. Nam, and Y. Won, “Nvwal: Exploiting ur Rahman, and D. K. D. Panda, “MIC-Check: A Distributed Check-
nvram in write-ahead logging,” in ASPLOS, 2016. pointing Framework for the Intel Many Integrated Cores Architecture,”
[48] Y. Kim, S. R. Lee, D. Lee, C. B. Lee, M. Chang, J. H. Hur, M. Lee, in HPDC, 2014.
G. Park, C. J. Kim, U. Chung, I. Yoo, and K. Kim, “Bi-layered rram [75] P. Ramalhete, A. Correia, P. Felber, and N. Cohen, “Onefile: A wait-free
with unlimited endurance and extremely uniform switching,” in 2011 persistent transactional memory,” in DSN, 2019.
Symposium on VLSI Technology - Digest of Technical Papers, 2011. [76] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutiu, “Thynvm:
[49] A. Kolli, V. Gogte, A. Saidi, S. Diestelhorst, P. M. Chen, Enabling software-transparent crash consistency in persistent memory
S. Narayanasamy, and T. F. Wenisch, “Language-level persistency,” in systems,” in MICRO, 2015.
ISCA, 2017. [77] A. Rudoff, “Persistent memory programming.”
[50] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley, S. Liu, P. M. Chen, [78] A. Rudoff, “Persistent memory programming without all that cache
and T. F. Wenisch, “Delegated Persist Ordering,” in MICRO, 2016. flushing,” in SDC, 2020.
[51] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Eval- [79] M. Saxena and M. M. Swift, “Flashvm: Virtual memory management
uating STT-RAM as an Energy-effcient Main Memory Alternative,” in on flash,” in USENIXATC, 2010.
ISPASS, 2013. [80] S. Scargall, “Persistent memory architecture,” in Programming Persistent
[52] B. C. Lee, “Phase Change Technology and The Future of Main Memory,” Memory, 2020.
IEEE Micro, 2010. [81] S. Shin, S. K. Tirukkovalluri, J. Tuck, and Y. Solihin, “Proteus: A flexible
[53] S. K. Lee, J. Mohan, S. Kashyap, T. Kim, and V. Chidambaram, “Recipe: and fast software supported hardware logging approach for nvm,” in
converting concurrent dram indexes to persistent-memory indexes,” in MICRO, 2017.
SOSP, 2019. [82] S. Shin, J. Tuck, and Y. Solihin, “Hiding the long latency of persist
[54] Z. Lin, M. Alshboul, Y. Solihin, and H. Zhou, “Exploring memory barriers using speculative execution,” in ISCA, 2017.
persistency models for gpus,” in PACT, 2019. [83] Y. Solihin, Fundamentals of Parallel Multicore Architecture. Chapman
[55] M. Liu, M. Zhang, K. Chen, X. Qian, Y. Wu, W. Zheng, and J. Ren, & Hall/CRC Computational Science, 2015.
“Dudetm: Building durable transactions with decoupling for persistent [84] Y. Solihin, “Persistent memory: Abstractions, abstractions, and abstrac-
memory,” in ASPLOS, 2017. tions,” IEEE Micro, 39(1), 2019.
123
[85] P. Subrahmanyam, “pmem-go.” [Online]: https://github.com/jerrinsg/go-
pmem
[86] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne: Lightweight
Persistent Memory,” in ASPLOS, 2011.
[87] J. Wang, X. Dong, Y. Xie, and N. P. Jouppi, “i2wap: Improving non-
volatile cache lifetime by reducing inter- and intra-set write variations,”
in HPCA, 2013.
[88] T. Wang, S. Sambasivam, Y. Solihin, and J. Tuck, “Hardware supported
persistent object address translation,” in MICRO, 2017.
[89] T. Wang, S. Sambasivam, and J. Tuck, “Hardware supported permission
checks on persistent objects for performance and programmability,” in
ISCA, 2018.
[90] W. Wang and S. Diestelhorst, “Quantify the performance overheads of
pmdk,” in MEMSYS, 2018.
[91] W. Wang and S. Diestelhorst, “Brief announcement: Persistent atomics
for implementing durable lock-free data structures for non-volatile
memory,” in SPAA, 2019.
[92] Z. Wang, H. Yi, R. Liu, M. Dong, and H. Chen, “Persistent transactional
memory,” IEEE Computer Architecture Letters, 2014.
[93] Z.-S. Wu, K. Parvez, X. Feng, and K. Müllen, “Graphene-based in-plane
micro-supercapacitors with high power and energy densities,” Nature
communications, 2013.
[94] I. Xeon, “Intel® xeon® platinum 9222 processor.”
[Online]: https://ark.intel.com/content/www/us/en/ark/products/195437/
intel-xeon-platinum-9222-processor-71-5m-cache-2-30-ghz.html
[95] Y. Xu, Y. Solihin, and X. Shen, “Hardware-Based Domain Virtualization
for Intra-Process Isolation of Persistent Memory Objects,” in ISCA,
2020.
[96] Y. Xu, Y. Solihin, and X. Shen, “MERR: Improving Security of
Persistent Memory Objects via Efficient Memory Exposure Reduction
and Randomization,” in ASPLOS, 2020.
[97] L. Zhang and S. Swanson, “Pangolin: A fault-tolerant persistent memory
programming library,” in USENIXATC, 2019.
[98] Y. Zhu, S. Murali, M. D. Stoller, K. J. Ganesh, W. Cai, P. J. Ferreira,
A. Pirkle, R. M. Wallace, K. A. Cychosz, M. Thommes, D. Su, E. A.
Stach, and R. S. Ruoff, “Carbon-based supercapacitors produced by
activation of graphene,” science, 2011.
124