0% found this document useful (0 votes)
11 views14 pages

BBB: Simplifying Persistent Programming Using Battery-Backed Buffers

Uploaded by

a18257157319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views14 pages

BBB: Simplifying Persistent Programming Using Battery-Backed Buffers

Uploaded by

a18257157319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

BBB: Simplifying Persistent Programming using


Battery-Backed Buffers
Mohammad Alshboul1 , Prakash Ramrakhyani2 , William Wang2 , James Tuck1 , and Yan Solihin3
1
ECE, North Carolina State University: {maalshbo, jtuck}@ncsu.edu
2
Arm Research: {prakash.ramrakhyani, william.wang}@arm.com
3
Computer Science, University of Central Florida: [email protected]

Abstract—Non-volatile memory (NVM) is poised to augment the persistent memory state, and this may lead to inexplicable
or replace DRAM as main memory. With the right abstraction results. Programmers rely on the persistency model to write
and support, non-volatile main memory (NVMM) can provide an both normal-operation code and post-crash recovery code, and
alternative to the storage system to host long-lasting persistent
data. However, keeping persistent data in memory requires to reason about how such code can keep persistent data in
programs to be written such that data is crash consistent (i.e. a consistent state [23], [27], [43], [54], [68], [88], [89]. In
it can be recovered after failure). Critical to supporting crash designing persistency models, it is generally accepted that
recovery is the guarantee of ordering of when stores become there is a tradeoff between performance and programmability.
durable with respect to program order. Strict persistency, which For example, strict persistency requires persist order to co-
requires persist order to coincide with program order of stores,
is simple and intuitive but generally thought to be too slow. incide with program order of stores, while epoch persistency
More relaxed persistency models are available but demand higher orders persists across epochs but not within an epoch. While a
programming complexity, e.g. they require the programmer to more relaxed persistency model can offer higher performance,
insert persist barriers correctly in their program. adopting a more relaxed persistency model burdens the pro-
We identify the source of strict persistency inefficiency as the grammer with additional tasks, e.g. defining epochs.
gap between the point of visibility (PoV) which is the cache, and
the point of persistency (PoP) which is the memory. In this paper, Another persistency programmability challenge is caused by
we propose a new approach to close the PoV/PoP gap which we the gap between the point of visibility (PoV) and the point of
refer to as Battery-Backed Buffer (BBB). The key idea of BBB persistency (PoP), which affects parallel programs (Figure 1).
is to provide a battery-backed persist buffer (bbPB) in each core A store may become visible to other threads when the value
next to the L1 data cache (L1D). A store value is allocated in the is written to the cache, but does not persist until reaching
bbPB as it is written to cache, becoming part of the persistence
domain. If a crash occurs, battery ensures bbPB can be fully the memory controller (MC)1 . Furthermore, before persistency
drained to NVMM. BBB simplifies persistent programming as the is ensured, a store value may be observed by another thread
programmer does not need to insert persist barriers or flushes. which may persist another value, resulting a non-consistent
Furthermore, our BBB design achieves nearly identical results to persistent memory state.
eADR in terms of performance and number of NVMM writes,
while requiring two orders of magnitude smaller energy and time
to drain.
PoV PoP

I. I NTRODUCTION
Non-volatile main memory (NVMM) is poised to augment
or replace DRAM as main memory. Due to its non-volatility,
MC and
byte addressability, and being much faster in speed than SSD Core L1D L2 LLC NVMM
and HDD, NVM can host persistent data in main memory [4],
[10], [11], [39], [46], [51], [52], [74], [95], [96].
In order to utilize this non-volatility feature, it is critical to
guarantee ordering of persists, i.e. the ordering of stores reach-
ing persistent memory to become durable. This is specified
through a persistency model [5], [23], [38], [43], [68]. Without Fig. 1: Illustration for the gap between the Point of Visibilty (PoV)
at the L1D cache, and the Point of Persistency (PoP) at the NVMM
an explicit guarantee, the persist order will follow the cache or MC.
replacement policy instead of the program order in updating

This research is supported in part through the following grants: Solihin


Managing persist order and the PoV/PoP gap currently
was supported by NSF grant 1900724 and UCF. Alshboul and Tuck were incurs substantial performance penalty and requires oner-
supported in part by NSF grant CNS-1717486 and by NC State. Alshboul’s
PhD primary and co-advisor are Solihin and Tuck, respectively. Wang received 1 We assume a base system with ADR [37], where an update becomes
funding from the European Union’s Horizon 2020 research and innovation persistent when it reaches the write pending queue (WPQ) of the memory
programme under project Sage 2, grant agreement 800999. controller (MC).

2378-203X/21/$31.00 ©2021 IEEE 111


DOI 10.1109/HPCA51647.2021.00019
ous effort from the programmer. For example, consider In- TABLE I: Comparison between several schemes for providing strict
tel PMEM [73], which provides programmers flush (clwb/- memory persistency ordering: Intel PMEM, Bulk Strict Persistency
(BSP), eADR, and Battery-Backed Buffers (BBB).
clflushopt) and fence (sfence) instructions. To guarantee persist
ordering of two stores, clflushopt followed by sfence need to Aspect PMEM BSP eADR BBB
be inserted between them, delaying the issue of the second SW Complexity High Low Low Low
store (for a long time) until the first store reaches the MC. Persist Inst. clwb & fence None None None
HW Complexity Low High Low Low
A missing flush or fence may cause bugs that are difficult to Strict pers. penalty High Medium None Low
identify or reproduce due to their intermittency. Furthermore, Battery Needed None None Large Small
it is not easy to debug a persistent program as a crash must PoP location WPQ/mem Mem L1D bbPB/L1D
be induced at different points of the program to check its
persistent state correctness.
Thus, in this work, we ask the question: Can the gap Table I contrasts different approaches for implementing
between PoV and PoP be closed inexpensively? Closing this strict persistency. PMEM’s programming complexity is the
gap simplifies persistent programming and improves perfor- highest due to the need to correctly and completely insert
mance [90], because any committed store value persists and flushes and fences. BSP requires highly complex architectural
becomes visible simultaneously. One approach described in support to give the illusion of strict persistency even though
prior literature is to allow the PoV/PoP gap to exist, but hides the underlying hardware allows non-ordered persists, and still
the side effect if the gap may be exposed. In Bulk Strict incurs substantial performance penalty for strict persistency.
Persistency (BSP) [43], if a store value has not persisted but is eADR requires a large battery to support draining of the entire
requested by another thread/core, it (and older stores) are per- cache hierarchy. PMEM and BSP do not close the PoP/PoV
sisted first before responding to the request. This complicates gap as the memory or MC still serves as PoP. In contrast,
cache coherence and delays responses to external requests. BBB incurs low programming complexity (no flushes/fences
Another approach being considered by industry is to close the are needed), low hardware complexity, negligible performance
PoV/PoP gap by using non-volatile caches (NVCache), where penalty for strict persistency, and requires only small battery to
the entire cache hierarchy is added to the persistence domain. drain a few bbPB entries on a crash. BBB also closes PoP/PoV
NVM technologies (e.g. STT-RAM, ReRAM, and PCM) [28], completely as bbPB is added next to the L1D cache.
[48], [71] could be considered but they suffer from limited Overall, the contributions of this paper are the following:
write endurance, high access latency, high write energy, and • A novel approach called Battery-Backed Buffer (BBB) to
low write bandwidth [42], [61], [87]. These problems make close the PoV/PoP gap, including bbPB design choices
NVCache more suitable for last level (or near last level) cache. and coherence mechanisms.
However, the reduced PoV/PoP gap may improve performance • Estimation of energy and area needed for both BBB
but does not simplify programming. An alternative would and eADR, showing BBB with two orders of magnitude
be to use battery-backed SRAM caches through eADR [80]. improvement over eADR with regards to the energy and
However, eADR is expected to be costly as it requires a large time costs for draining.
battery to back the entire cache hierarchy [16]. • Evaluation of BBB’s effectiveness with different bbPB
In this paper, we propose a new approach to close the sizes.
PoV/PoP gap. We call this the Battery-Backed Buffer (BBB). The rest of the paper is organized as follows: Section II
The key idea of BBB is to provide a persist buffer in each provides a relevant background. Section III introduces BBB
core next to the L1 data cache (L1D) that is non-volatile and its design. The evaluation methodology is shown in
(backed by battery). A store is allocated an entry in the Section IV, while the experiments results are reported in
battery-backed persist buffers (bbPB) as it is written to cache, Section V. Section VI discusses relevant prior works. The
hence it becomes visible and persistent simultaneously. The paper concludes in Section VII.
stores in bbPB are then lazily drained from the bbPB to
memory. If a crash occurs, the battery ensures bbPB can II. BACKGROUND
be fully drained to NVMM. The bbPB is sized to balance A. The Difficulty of Persistent Programming
performance and battery cost. We find that a small number of In this section, we highlight how challenging it is for the
entries (e.g. 32) is sufficient. BBB provides strict persistency programmer to write code that guarantees crash recoverability.
semantics without the performance penalties associated with This has already been discussed in several prior works that
it. We explore a major design choice of whether to view bbPB have emphasized how writing NVMM-friendly code can be
as processor-side or memory-side structures, and discuss their very tedious and error-prone [1], [18], [56], [57], [64], [97].
tradeoffs. Due to the complete elimination of PoV/PoP gap, As mentioned in Section I, the volatility of the cache hierar-
persistent programming is simplified as flushes and fences are chy requires using persistency models to guarantee correct per-
no longer necessary. We show that with careful design, BBB’s sistency ordering at the NVMM. These models usually require
performance is nearly identical to eADR but with much lower the programmer to add special instructions to the code to guar-
hardware overhead because the bbPB size is much smaller antee crash recoverability. These instructions often have two
than caches. types of functionalities: (1) Sending updates to the NVMM,

112
1 void AppendNode(int new val){ Figure 3. Mainly, special instructions are added after storing
2 // create and initialize new node
3 node t* new node = new node t(new val);
the new node (line 7-8), and after storing the head pointer
4 // update new node’s next pointer (line 12-13). With that, it is now guaranteed that the update
5 new node −> next = head; to the head pointer will never be persisted until the update to
6 // update the linkedList ’ s head pointer
7 head = new node; the new node itself is persisted.
8 } With BBB, this persist ordering problem will no longer
exist, and the code shown in Figure 2 can still be safely used
without any risk of persist ordering issues. This is because
Fig. 2: Example code to add a node to the beginning of a the new node initialization (line 3) will become persistent
linked list. immediately after its store is committed. Since the two stores
are guaranteed to commit in program order, the store updating
the head pointer (line 7) will commit after that, and hence
1 void AppendNode(int new val){ the two updates are also going to automatically and instantly
2 // create and initialize new node
3 node t* new node = new node t(new val); persist in that order. Note that the discussion about this code
4 // update new node’s next pointer focuses on the persist ordering problem. Other programming
5 new node −> next = head; problems (e.g. transaction semantics, permanent leaks) are out
6 // NEW: Persist new node
7 writeBack(new node); of the scope of this paper.
8 persistBarrier ;
9 // update the linkedList ’ s head pointer
10 head = new node; B. Non-Volatile Caches and eADR
11 // NEW: Persist head pointer
12 writeBack(head); As mentioned earlier, non-volatile caches (NVCaches) have
13 persistBarrier ; been proposed and evaluated in literature [40], [58], [69],
14 }
[77]. NVCaches rely on various NVM technologies, such as
PCM, STT-RAM, and ReRAM [4], [28], [48], [52], [71],
Fig. 3: Updated code to add a node to the beginning of a which differ in their access latency and density. However, they
linked List. suffer from similar challenges as in NVMM, including limited
write endurance. These problems will be more pronounced
than NVMM because caches will be written at a much higher
rate than memory, and the closer it is to the core, the higher
which is done by flushing or writing back the corresponding the rate. Spin-Transfer Torque Random Access Memory (STT-
cache block from the caches to NVMM (writeBack). Some RAM) has a relatively high write endurance level of 4 × 1012
examples of such instructions are (clwb) and (clflushopt) from writes, higher than alternatives such as Phase Change Memory
x86 ISA, and (DCCVAP) and (DCCVADP) from Arm ISA. (PCM) with 108 writes [52], [61], [71], [87], and Resistive
(2) After sending blocks to the NVMM, the second instruction Random Access Memory (ReRAM) with 1011 writes [4],
type needed to guarantee persistency ordering is a barrier/fence [48], [61], [87]. However, their write endurance is still or-
instruction that ensures the subsequent instructions wait until ders of magnitude lower than SRAM memory cells (about
the flushing is completed (persistBarrier). Examples of such 1015 ) [61], [71], [87]. Furthermore, they also suffer from high
barrier instructions are (sfence) and (mfence) for x86 ISA, and write energy, and higher access latency than SRAM caches.
(DSB) and (DMB) for Arm ISA. Finally, unless they are used for the entire cache hierarchy,
The main task of a programmer is to decide when and where they will still have PoV/PoP gap that complicates persistency
to add these two types of special instructions. Figure 2 shows programming.
an example code to add a node to the head of a linked list. Nevertheless, SRAM-based caches can become non-volatile
The code creates and initializes a new node (line 3), makes by providing an additional energy source to create battery-
it to point to the current head node (line 5), and updates the backed caches. This energy source should be enough to im-
head pointer to the new node (line 7). The code example is plement flush-on-fail policy, where the entire cache hierarchy
correct and works well if crash recoverability is not a concern. is drained to memory when a crash happens, and thus making
However, with NVMM, the code risks losing the entire linked the SRAM-based caches appear non-volatile. This approach
list if stores persist out of the program order. For instance, enhances the ADR [37] guarantee from only covering the
the update to the head pointer may get persisted before the memory controller, to covering the entire cache hierarchy.
new node itself is persisted. If a crash occurs between the Hence, it is named enhanced-ADR (eADR) [16], [77], [78].
two persists, the new node will be lost (since it is still in the Compared to NVCaches, eADR does not affect access latency,
volatile caches), while the head pointer will still point to new write energy, and write endurance. However, flush-on-fail
node, which becomes invalid after the crash. requires a substantial amount of energy and time, resulting
To make this code NVMM-friendly, the programmer may in considerable space and cost for the energy source (e.g.
impose persist ordering by modifying the code as shown in battery), and delaying crash recovery until after the draining
is complete.

113
Core Core Core Core Core Core
PoP: New
Store Buffer
bbPB
PoV Caches
bbPB Private Private bbPB Processor
Cache(s) Cache(s)
Side Forced
Battery-backed in: Memory bbPB Drain
WPQ
eADR only Shared LLC Side
NVMM
BBB only WPQ
Both PoP: Old NVMM
DRAM NVMM
Controller Controller (a) Processor side (b) Memory side

Fig. 5: bbPB logical view as a processor-side structure (a) vs.


DRAM NVMM
memory-side structure (b).
Fig. 4: BBB system overview.

BBB also requires much smaller battery than eADR which


III. S YSTEM D ESIGN requires the entire cache hierarchy to be battery backed. The
size of the battery depends on the worst number of entries that
In this section, we discuss our proposed approach Battery- need to be drained at the time of crash, which is determined
Backed Buffers (BBB), its mechanism, the design space, and by the bbPB size (i.e. number of bbPB entries). Hence, the
the trade-offs associated. choice of bbPB size is an important parameter: it should be
large enough to avoid stalling the core if bbPB is full, but
A. High-Level Overview should be small enough to keep the battery small and cheap.
A crucial component that BBB adds is the battery-backed To keep the battery small, bbPB is only used for stores
persist buffers (bbPB), shown in Figure 4. The figure shows that need to persist (i.e. persisting store). Non-persisting stores
a multicore processor with flat physical address space that is consist of stores that go to volatile memory (DRAM), or
divided into DRAM and NVMM. A portion of the NVMM ones that go to NVMM but do not deal with persistent data.
address range is allocated for persistent data. Battery-backed They can go directly to the cache hierarchy without involving
components in eADR only are shown in light grey (all caches), bbPB. To distinguish store types, some prior work requires
battery-backed components in BBB only are shown in medium special instructions (e.g. NVload, NVStore) [58] to be used.
grey (bbPB), while battery-backed components in both eADR Instead, we assume that regular store instructions are used.
and BBB are shown in dark grey (store buffers and the write Persisting stores are distinguished from the pages that they
pending queue (WPQ) in the NVMM controller). Dataflow access. We assume persistent data is allocated only in the
path for persisting stores follow the thick lines. heap using persistent memory allocation (e.g. palloc), which
bbPB is located alongside the L1 data cache (L1D) of each allocates such data in pages that map to page frames within
core. Its first role is to serve as the Point of Persistence (PoP); the persistent portion of the physical address space. Persistent
any store allocated in the bbPB can be considered durable pages are allocated physically in the NVMM, while non-
as bbPB ensures that the store will eventually reach NVMM persistent pages can be allocated in either DRAM or NVMM.
through flush-on-fail draining; traditional persist buffers [50],
[62] are volatile as they lose content if power is lost. With B. Processor vs. Memory Side
BBB, a battery provides sufficient energy to drain bbPB to So far we have not discussed how bbPB should interact with
the NVMM in the event of a crash. the rest of the memory hierarchy. One obvious choice is for the
The bbPB also plays a role in matching the Point of Visibil- bbPB to be logically viewed as a processor-side structure, as it
ity (PoV) and PoP, by moving PoP up from the MC to L1D. To keeps track of persisting stores that need to drain to memory.
achieve that, a store is allocated in bbPB at the same time the This view is similar to traditional persist buffers [50], [62], but
store goes to the L1D, after any cache miss or coherence state with added PoP guarantee. Figure 5(a) illustrates this view. On
upgrade has been satisfied. Note that in some cases, the PoP the other hand, bbPB can also be viewed as a structure on the
needs to move further up to the store buffers (Section III-C). memory side. The latter view makes sense as well because in
By closing the PoV and PoP gap, persistency programming is the current architecture, only memory side structures such as
simplified as strict persistency can be achieved without explicit the NVMM and write pending queue (WPQ) in the memory
flushes and fences, and without the performance penalty of controller are in the persistence domain. As bbPB is added to
long-latency persist; a store instantly persists when allocated the persistence domain, it can be thought of as a persistence
a bbPB entry. In contrast, traditional persist buffers (used with domain extension of the write pending queues (WPQs) that
Buffered Epoch Persistency or BEP) [50], [62] have this gap are distributed in different cores.
and hence still require explicit persistency instructions. Stalls The choice of the procesor- vs. memory-side organizations
may still occur at epoch boundaries in BEP due to needing to carry important consequences. The close physical proximity to
wait the completion of persist buffer draining to NVMM. the core makes the processor-side organization a more intuitive

114
choice. In such an organization, each bbPB entry corresponds dirty block in the LLC. Being dirty-inclusive, an LLC miss is
to an (address, value) pair for each store instruction that needs guaranteed not to find a block in a bbPB, hence eliminating
to persist. Store granularity could be used (e.g. byte, word, the need to check bbPB on LLC misses. Enforcing inclusion
doubleword). The stores need to be ordered in the bbPB is simple. When a dirty LLC block is evicted, a forced drain
because they have not yet reached the persistence domain. message is sent to all bbPBs (Figure 5(b)), akin to back
Coalescing of values between stores is not permitted except invalidation being sent to the L2 and L1 caches. If a bbPB
in some special cases (e.g. when two stores are subsequent has such a block, it drains the block before responding with
and involve the same block)2 . In contrast, in the memory- acknowledgment.
side organization, each bbPB entry corresponds to a data A dirty block may be drained from bbPB as well as be
block whose value is changed by a store. Because bbPB written back from the LLC. While it is correct to let both
entries are already in the persistence domain, stores to the occur, for write endurance reason we should avoid redundant
same block can be coalesced regardless of the ordering of write back from the LLC. To quickly identify dirty blocks that
such stores. Furthermore, ordering is not necessary as store should not be written back, we add a bit to each cache block to
values have reached the persistence domain in bbPB entries. annotate a block that is holding persistent data, similar to the
Entries in bbPB can also drain out of order to NVMM, one used in [50]. When such a block is evicted from the LLC,
making various optimizations possible, for example, one that it is not written back to NVMM. Since a dirty persistent block
minimizes NVMM writes. By allowing store reordering and in LLC has or had a corresponding bbPB block, the value can
coalescing, the memory-side organization conveys substantial be considered to have been written back to memory.
advantages in requiring fewer bbPB entries to perform well,
and in reducing writes to NVMM. Furthermore, the memory-
side organization also simplifies cache coherence: since bbPB C. Handling Relaxed Memory Consistency Models
is at the memory side, it is not directly involved with cache
coherence the way L1D or L2 caches are. In an earlier discussion, we described the PoV/PoP gap.
The two approaches also differ in handling a load from the There is a subtle issue here related to relaxed memory con-
core. In the processor-side approach, a load must check the sistency models. For example, with release consistency, PoV
cache hierarchy and bbPB to find the data block. In most cases, is defined only for and with regard to release synchroniza-
the block will be found in the caches instead of the bbPB, but tion memory instructions but undefined/unordered for regular
in rare cases, the block may have been evicted from the caches stores. This creates an ambiguous aspect as to whether PoP
while still residing in the bbPB. In such cases, bbPB supplies for stores should be ordered or left unspecified as the PoV.
the block to the core. Handling a load in the memory-side To guide our choice, we note that PoV is only applicable to
approach is more complicated. A load first accesses the cache multi-threaded applications as it governs when a store from
hierarchy. If it misses in the hierarchy (i.e. last level cache one thread is seen by others, whereas memory persistency
(LLC) miss), the block may reside in the memory (NVMM applies even to single-threaded (sequential) applications. We
or WPQ) or the bbPB, so both need to be checked. The believe that the latter requires persist ordering to be defined
(MC) may need to inquire the bbPB of each core to find even for relaxed consistency models. Hence, we propose PoP
potentially the latest/valid value of the block. Alternatively, to follow the semantics of program-order for persisting stores.
to avoid broadcast, a bbPB directory may be kept at the MC A challenge to achieving program-order persistency for
to track which bbPB may hold a valid copy of the block. The relaxed consistency models is that while stores are committed
need to broadcast or keep a directory is a substantial drawback in program order, they do not go to the L1D in program order.
of the memory-side approach. For example, if an older store misses in L1D, a younger store
Thus, for our BBB design, we choose the memory-side that hits the cache is permitted to write its value in the L1D.
approach. However, there is a drawback of the memory-side If an update to bbPB and L1D coincides, we cannot guarantee
approach. We note that when the LLC misses, the missed program order to PoP. To solve this, for relaxed consistency
block may still be in bbPB pending to be drained to persistent models, we also battery-back the store buffer (SB) (Figure 4).
memory. The memory has a stale copy of the block so the In this design, PoP is achieved when a committed store is
missed block must be located from the right bbPB. To locate allocated in the SB, earlier than PoV which is at the L1D. This
the block, a broadcast to all bbPBs in all cores may be needed; requirement for sequential programs is equally needed when
or if a directory for bbPBs is kept, only select bbPBs need to using NVCache or eADR, as stores may also write to the cache
be inquired. However, a broadcast is not scalable, but keeping out of program order. This design adds a small cost to the
directory information updated requires a complex protocol battery but allows BBB to guarantee program order persistency
mechanism as various protocol races could occur. To avoid without requiring the programmer to use persist barriers and
such a problem, we require that the LLC be dirty-inclusive without incurring persistency stalls. When a crash happens,
of bbPBs, i.e. any bbPB block must have a corresponding the content of the SB will be drained directly to the WPQ
2 An additional coalescing opportunity is possible if epoch persistency is (similar to non-temporal stores [3]) after completely draining
considered: stores within an epoch may be coalesced. However, with BBB the content of the corresponding bbPB. This guarantees the
we are targeting strict persistency. per-core program order to be maintained.

115
D. BBB Design Invariants Core 1 2 1 2
+ L1D
BBB design requires the following invariants to be kept to
L2 2: Inv/
guarantee correct execution and crash recovery: X, M X Remove - X, I X, M X

1) Persisting stores enter the persistency domain in program 1: RdX


3: ReplyD (a) Invalidation to M block
order. The persistency domain includes bbPB (sequential
consistency and total store ordering), or additionally in-
Core 1 2
cludes store buffer (more relaxed consistency models).3 + L1D
1 2

2) Battery consists of sufficient energy to drain bbPB and L2 2: Inv/


WPQ (plus store buffers in some cases) to the NVMM X, S X X, S X, I X, M
Remove X
in the event of power loss. 1: Upgr
3: Ack
3) A store is not made visible to other cores/threads until (b) Invalidation to S block

it becomes persistent.
Core 1 2 1 2
4) LLC or L2 caches are inclusive of bbPBs, and a block + L1D
L2
only resides in at most one bbPB.
2: Down
X, M X - X, S X X, S
To meet Invariant 1, a store is allocated a bbPB entry after grade
1: Rd
all older stores have been allocated and written to the cache.
3: ReplyD (no flush to Mem)
If bbPB is full, some entries are drained to free them up. To
(c) Intervention to M block
avoid performance degradation from full bbPB, bbPB needs
to be sized sufficiently. If the bbPB already has the block, the Fig. 6: Illustrating how BBB handles main cache coherence
new store value will be coalesced with it. cases with data in bbPBs. Terms follow from [83].
Invariant 2 is ensured by having a battery with sufficient
energy to drain bbPB to memory, and thus guranteeing that
any allocated bbPB entry will eventually reach the NVMM. external invalidation or intervention request made by another
This includes in-flight inter-cores packets between bbPBs. core. This is not as simple as it sounds because the LLC does
Invariant 3 is common in persistency studies. Violating it not keep a bbPB directory; it only keeps directory for per-core
may result in a first store from a first thread that has not L2 caches. Hence, when a core wants to write to a block, it
persisted to become visible to a second thread which then does not know which bbPB to send the invalidation to. To
persists a second store that depends on the first store. If a crash simplify this, we enforce bbPB L2 inclusion, meaning that for
occurs, the threads disagree on the persistent state of the first each block in bbPB, the same block must also exist in the
store. To meet Invariant 3, the L1D cache ensures that it has L2 cache. L2 inclusion provides substantial benefits because
obtained the block in the coherence state that allows the store the LLC keeps L2 directory, hence by sending invalidation
(i.e. M state), before the store writes to the L1D cache and to the sharer L2 caches (which then send back invalidation
allocated in the bbPB. to their respective bbPBs), it is guaranteed that the bbPB
Invariant 4 was partly discussed in Section III-B and the containing the block will be notified as well. No new directory
rest will be discussed more below in Section III-E. information is needed in the LLC.
E. Cache Coherence Interaction Another issue is whether to drain the block from bbPB when
an invalidation/intervention is received by a bbPB. If the block
bbPBs have two unique characteristics. Despite being logi-
is drained, Invariant 4 is enforced as the block is removed
cally located at the memory side, each core has its own bbPB,
from the current bbPB so that the new bbPB can allocate it.
and hence if not carefully designed, a block may potentially
However, draining delays the acknowledgement or reply to
exist in multiple bbPB and suffer from coherence issues. In
the invalidation/intervention until draining is complete, and
order to avoid that, Invariant 4 requires that a block resides
it incurs an additional write to NVMM which reduces write
in at most one bbPB. The invariant ensures that a block
endurance. Thus, we choose not to drain the block. Instead,
is drained only once from bbPB to NVMM with the latest
when an external request is received, the block is moved to
value, and avoids dealing with coherence between copies at
the requesting core. The requesting core is now responsible for
multiple bbPBs. Another unique characteristic of bbPB is that
draining this block to the NVMM. Note that the that energy
it is located close to the core, and a persisting store needs to
source is sized to provide sufficient energy to complete any in-
allocate an entry as it writes its value to the L1D. To enforce
flight packets in the event of crash. Therefore, it is guaranteed
Invariant 4, a writing core cannot just allocate a new entry for
that no updates will be lost due to the inter-core movements of
a block if the block resides in another bbPB. The block must
cache blocks. This requirement is equally needed for eADR.
be removed from the other bbPB and retrieved to the writing
core’s bbPB. Figure 6 illustrates the main coherence scenarios. Two cores
There are two issues that we need to deal with to support are illustrated with the L2 cache and the bbPB is shown for
Invariant 4. First, a bbPB must be notified of any relevant each core. A block X and its initial state (assuming MESI
protocol) in the L2 cache of Core1 are shown. It receives
3 This invariant is equally needed for eADR or any NVCache solution. an external request from Core2. For example (a), the block

116
TABLE II: Illustrating the bbPB actions corresponding to different
coherence operations, originating from other cores (remote invali-
coalesce multiple stores, which decreases both performance
dation/intervention) or from the same core (local read/write). An and write endurance. On the other hand, draining too late
operation is marked unmodified (UM) if the base MESI protocol increases the chance of the bbPB being full when a burst
applies. of persisting stores needs new entries allocated, resulting in
State In bbPB? RemoteInv RemoteInt LocalRd LocalWr performance degradation. Hence, an important principle of
N UM UM UM Allocate optimization is to keep bbPB as full as possible while keeping
M
Y Fig. 6(a) Fig. 6(c) UM Coalesce the probability of full bbPB low. To achieve this balance, we
N UM UM UM Allocate define a draining occupancy threshold. bbPB does not drain
E
Y Invalidate UM UM Coalesce
N UM UM UM Allocate blocks except when its occupancy reaches the threshold, at
S
Y Fig. 6(b) UM UM Coalesce which time draining is initiated until the occupancy decreases
N UM UM UM Allocate below the threshold. For example, we found 75% threshold to
I
Y Invalidate UM UM Coalesce
work well for 32-entry bbPB. A similar optimization is applied
to memory controller WPQ [34], [35].
is in Core1’s L2 cache (in M state) and in bbPB. The L2 Regarding the how question, we apply a first come first
cache at Core1 receives a Read Exclusive request by Core2 served (FCFS) draining policy; the oldest block allocated in
and notifies the bbPB. The L2 cache invalidates the block and the bbPB is chosen to drain first. While other policies are
bbPB removes the block (without draining it). The block is possible, e.g. draining blocks based on the prediction for future
then sent back to Core2, which then installs it in the L2 cache writes, we leave them for future work.
(in M state), allowing it to write to the block and install it in b) Hardware cost of BBB: Assuming bbPB having 16-
its bbPB. This example illustrates that if a block is written by 32 entries (more on the choice in Section V), the total size of
multiple cores, the block may move between bbPBs but will bbPB will be about 1-2KB per core. Each bbPB entry contains
drain to memory only once. a 64-byte data block plus up to 8-byte meta-data that contains
In example (b), block X is initially shared by both cores. the physical block address, and a few bits for status. The
An Upgrade request is received at Core1’s L2 which notifies physical address is used to avoid accessing TLB when bbPB
bbPB. As before, the block is invalidated from the L2 cache is drained, or when the L2 cache sends back invalidation to
and removed from bbPB. An acknowledgment is sent to Core2. the bbPB.
At this time, Core2 has sufficient state to allow it to write to the c) Context switch: Because bbPB holds the block’s phys-
block and simultaneously install it in the bbPB. No draining ical address, there is no cache block address aliasing problem
occurs here, either. between multiple processes. No draining or state saving is
Finally, in example (c), the block is in the M state initially needed on context switch.
and Core1’s L2 cache receives a read request from Core2.
In response, Core1 downgrades its block from M to S and IV. M ETHODOLOGY
replies to the request with data. However, the block remains
TABLE III: The simulated system configuration.
in the original bbPB. With traditional MESI protocol, the block
will be written back to memory because the resulting state S Component Configuration
indicates that the block must be clean in the cache. However, Processor 8 cores, OoO, 2GHz, 8-wide issue/retire
our memory-side approach here allows an optimization. Since ROB: 192, fetchQ/issueQ/LSQ: 32/32/32
bbPB is in the persistence domain and can be considered as L1I and L1D private, 128kB, 8-way, 64B, 2 cycles
L2 shared, 1MB, 8-way, 64B, 11 cycles
an extension of the main memory, it is as if the M block had DRAM 8GB, 55ns read/write
already been written back to memory. Hence, write back to NVMM 8GB, 150ns read, 500ns write (ADR)
memory is skipped, resulting in bandwidth saving to the LLC. bbPB 32 entries per core, drain threshold 75%
In conclusion, the modifications to the cache coherence
protocol are minor. No additional delay is added to the critical
path of cache coherence transactions. Furthermore, our BBB
A. Simulation configuration
approach allows bbPB to minimize the number of writes to
memory, both for bbPB draining, as well as for writeback We evaluate BBB using a multicore processor model built
from the L2 cache. Table II illustrates full coherence cases on gem5 simulator [15], with parameters shown in Table III.
and the corresponding bbPB operations. The machine consists of a hybrid DRAM/NVM main memory,
each type being 8GB and having a separate MC. The NVMM
F. Other Issues MC is in persistence domain and is battery-backed (ADR).
a) bbPB draining policy: Another important design issue The NVMM read and writes latencies are 150ns and 500ns,
is regarding when and how to drain bbPB to NVMM. Regard- respectively, which are higher than DRAM latencies, in line
ing the when question, since blocks in bbPB are already in with prior studies [14], [20], [47], [55], [86]. We use Arm 64-
the persistence domain and there is sufficient energy to drain bit instruction set architecture (ISA). Our simulation models an
them to memory, in theory, they can stay in the bbPB for Arm-based mobile phone with an 8-core processor, each core
coalescing. Draining bbPB too early reduces opportunities to has an 8-wide out-of-order pipeline. L1 caches are private per

117
core, while the L2 is shared. Coherence between L1 caches is 107MB and 8.75MB for the server and the mobile class
rely on directory-based MESI protocol. system, respectively.
TABLE IV: Summary of the evaluated workloads along with their TABLE V: Systems used to evaluate the draining costs
descriptions and the percentage of the persistent stores (%P-Stores)
Component Mobile Class Server Class
to the total stores in the workload.
Number of cores 6 32
Workload Description %P-Stores L1 cache size 6 x 128kB 32 x 32 kB
rtree 1 million-node rtree insertion 15.5% L2 cache size 1 x 8MB 32 x 1 MB
ctree 1 million-node ctree insertion 18.9% L3 cache size N/A 2 x 35.75 MB
hashmap 1 million-node hashmap insertion 6.0% Memory channels 2 12
mutate[NC/C] modify in 1 million-element array 23.8%
swap[NC/C] swap in 1 million-element array 23.8%
To compare the draining cost of BBB and eADR, we focus
on (1) the energy needed at the time of the crash, which
determines the size, lifetime, and system footprint of the
B. Workload Description battery, and (2) the time needed to perform the draining, which
To evaluate battery-backed persist buffers (bbPB) size re- is affected by the amount of data to drain and non-volatile
quirements in BBB, we designed the workloads listed in main memory (NVMM) write bandwidth. Such time impacts
Table IV. These workloads are chosen to generate significant the turn-round time after a crash, and thus the responsiveness
persist traffic. of the system. It can also result in further energy overheads if
Among these workloads rtree, btree, and hashmap work- other parts of the system (e.g. the core) need to remain alive
loads maintain a 1 million-node data structure that is allocated during draining.
in the persistent space, and the workload performs random Estimating draining energy. On a crash, data in bbPB (or
insertions to the data structure. This generates persistent writes in caches for eADR) is accessed and then moved to NVMM.
that need to be allocated at the bbPB. Similarly, array-mutate We assume that caches in eADR and bbPB in BBB are SRAM.
and array-swap perform random mutate and swap operations The energy needed to access data in such SRAM cells is
respectively, on a 1 million-element array. NC or C after the estimated to be about 1pJ/Byte [63]. However, this is very
array operation’s name (e.g. mutateNC vs mutateC) stands small compared to the energy needed for data movement. The
for ”Non-Conflicting” or ”Conflicting”, respectively. This in- energy required for data movement is a lot harder to calculate.
dicates whether each thread performs updates on a separate Our estimations of the energy cost for data movement are
region of the array (hence non-conflicting), or conflicts are based on the results from the work done by Dhinakaran et
allowed between threads. Each workload runs with 8 threads al [65], which looked into the energy cost of data movement
on 8 cores. across several levels in the memory hierarchy. The energy
We designed the workloads to exert maximum pressure consumption per memory operation was measured using an ex-
on the bbPB. They perform back-to-back persistent writes ternal power meter while executing carefully designed micro-
with little other computation. In contrast, real-world workloads benchmarks. These micro-benchmarks were used to isolate
typically perform additional computations to generate the data and observe the energy needed solely for data movement
to be persisted. Thus, our analyses on size of bbPB required and minimize the effect of out-of-order execution and other
for good performance represents the worst-case end point for architectural optimizations. More specifically:
the workloads we studied. 1) To calculate the cost of data movement between the
For all of these workloads, we evaluate BBB normalized processor and a targeted level in the memory hierarchy
to eADR which serves as the base case. eADR represents the (e.g. L2 cache), these micro-benchmarks operate on an
optimal case for performance overheads and number of writes allocated data that is chosen such that it has a memory
to NVMM, hence it performs as good as a system without footprint to not fit in any of the cache levels above the
any persistency in mind. On average, the simulated window targeted level.
reports the timing of 250 million instructions, after 200 million 2) The average memory latency and the cache miss rates
instructions for warm-up. were continuously monitored to validate that the micro-
benchmarks are accessing the targeted level of the mem-
C. Methodology for Evaluating Draining Cost ory hierarchy.
eADR draining cost depends on the cache hierarchy and 3) The micro-benchmarks were designed to minimize the
number of cores. We evaluate the cost based on two types of impact of other operations, not related to memory ac-
systems with differing number of cores and cache hierarchy: a cesses.
server class and a mobile class system, as shown in Table V. 4) To isolate the compiler optimizations’ impact on the
The server class system is based on the specifications of Intel micro-benchmarks, all the assembly codes were man-
Xeon Platinum 9222 [25], [94], while the the mobile class sys- ually validated to guarantee the expected behavior.
tem are based on the Arm-based iPhone 11 specifications [26], These experiments provided the energy needed to move the
[30], [36]. Most notably, the total cache size for the system data between the processor’s registers and any level in the

118
memory hierarchy. Finally, the difference between these results V. E VALUATION
can be used to calculate the energy cost of moving data We first discuss the most important aspect of BBB: its
between different levels in the memory hierarchy. draining cost, in comparison to eADR (Section V-A). Then we
Table VI shows the estimated energy needed for draining discuss BBB performance and write overheads (Section V-B
data from different cache levels to NVMM. The numbers are and V-C). Finally, we present the sensitivity study of BBB
derived from [65], with some adaptation: (1) As the analysis design (Section V-D).
in [65] only reports data movement in the direction from
A. Draining Cost Comparison
memory to caches, we estimate that the energy needed to
bring data from a cache to memory (as needed at the crash) Table VII presents the average energy needed to drain
is similar to that to bring data from memory to the cache. data from caches (for eADR) and from bbPB (our BBB
(2) Since the results reported in [65] are only for a DRAM- approach, 32 entries), based on the cost model we discussed in
based system, we assume those results for our analysis on the Section IV-C. We give eADR optimistic estimates with several
energy needed to drain to NVMM. This assumption equally assumptions. First, we assume eADR only drains dirty cache
affects eADR and BBB, and thus will not have a notable blocks to memory. For the workloads evaluated, on average
impact on our comparison between the two schemes. (3) The 44.9% of blocks are dirty in the cache hierarchy, similar to
energy for draining a block from bbPB is estimated from the figure obtained by Garcia et al [31]. Second, we assume dirty
energy to drain a block from the L1D cache to NVMM. (4) blocks are identified using a hardware finite state machine
The energy numbers in [65] do not include a 3-level memory that is power efficient, and consumes zero energy overheads.
hierarchy. Therefore, we assume the draining cost numbers Moreover, we don’t include the static energy cost for eADR.
do not increase when adding another cache level, as in the In contrast, we note that BBB does not require cache accesses
server class system in Table V. This assumption produces for dirty block identification. Furthermore, caches do not need
an optimistic energy figure for eADR, so in reality, eADR to be powered during the draining process (hence no static
energy cost may be higher than our estimate. Moreover, when energy consumption). Finally, we assume that at the time of
reporting eADR draining energy and time costs, we calculate failure, the battery-backed persist buffers (bbPB) are full and
only the energy and time needed to drain dirty blocks, to all entries need to be drained, representing the worst case for
estimate the average energy and time. BBB.
As shown in Table VII, despite optimistic estimates, using
The final part of our energy analysis is estimating the battery eADR costs 46.5 mJ and 550 mJ to drain for the mobile and
size. In this analysis, the battery needs to be provisioned with server class systems, respectively. Not surprisingly, the mobile
sufficient energy to drain the entire caches, in case all blocks class system has smaller caches hence draining energy is
in the caches are dirty. This is important because missing to smaller than in the server class system. Despite more realistic
drain even one dirty cache block may result in inconsistent estimates, BBB costs only 145 µJ and 775 µJ, respectively,
persistent data that cannot be recovered. We chose the smallest which are 320× and 709× more efficient than eADR, respec-
battery size that is capable of storing the required energy. tively. BBB’s energy cost is between two to three orders of
Different battery technologies have different energy densities magnitude smaller than eADR.
(i.e. the amount of energy stored per volume unit). We looked
into two main battery technologies (SuperCap [98] and Li- TABLE VII: Estimated draining energy cost for BBB vs. eADR (dirty
thin [67]), which have energy density of (10−4 and 10−2 ) Wh blocks only).
cm-3 , respectively [93]. System
Draining Energy Normalized to BBB (X)
eADR BBB eADR BBB
Mobile Class 46.5 mJ 145 µJ 320× 1
TABLE VI: Estimated energy costs of different operations for drain- Server Class 550 mJ 775 µJ 709× 1
ing eADR or BBB at the moment of crash.
Operation Energy Cost Table VIII presents the average time needed to drain data
Accessing Data from SRAM 1pJ/Byte from caches (for eADR) and from bbPB (our BBB approach).
Moving data from L1D to NVMM 11.839nJ/Byte eADR takes 0.8 ms and 1.8 ms to drain for mobile class
Moving data from bbPB to NVMM 11.839nJ/Byte and the server class system, respectively. In contrast, BBB
Moving data from L2 to NVMM 11.228nJ/Byte takes only 2.6 µs (307× faster) and 2.4 µs (750× faster),
Moving data from L3 to NVMM 11.228nJ/Byte respectively, which represent two to three orders of magnitude
improvement.
eADR and our BBB need energy source to drain. We
Estimating draining time. For this part, we rely on the estimate two energy source types: super capacitors (Super-
reported NVMM bandwidth and latencies [41]. As the draining Cap) [98], and lithium thin-film batteries (Li-thin) [67], by
happens at crash with no other traffic present, we assume that applying the analysis from [93] (as discussed in Section IV-C),
the entire NVMM bandwidth will be dedicated for draining. while using the energy values from Table VII. Table IX shows
NVMM bandwidth also depends on the number of memory the estimates for only the active material needed for the
channels for each system as described in Table V. battery, excluding packaging and other aspects. As shown in

119
TABLE VIII: Estimated draining time for BBB vs. eADR (dirty TABLE X: Battery size (in mm3 ) when varying the number of bbPB
blocks only). entries for mobile (M) and server (S) platforms.
Draining Time Normalized to BBB (X) bbPB Size 1 4 16 32 64 256 1024
System
eADR BBB eADR BBB M 0.12 0.50 2.02 4.1 8.1 32.3 129.3
Mobile Class 0.8 ms 2.6 µs 307× 1 SprCap
S 0.7 2.7 10.8 21.6 43.1 172.4 689.7
Server Class 1.8 ms 2.4 µs 750× 1 M 0.001 0.005 0.02 0.04 0.08 0.3 1.3
Li-thin
S 0.006 0.026 0.10 0.21 0.43 1.7 6.8

column group (a), eADR in the mobile class system would


eADR. However, this may also incur performance degradation
require an energy source of 2.9 × 103 mm3 and 30 mm3 for
in BBB when persisting stores come at a faster pace than can
SuperCap and Li-thin technologies, respectively. In contrast,
be drained by bbPB, causing bbPB to be full and stalls the
BBB requires only 4.1 mm3 and 0.04 mm3 , respectively.
core. In contrast, eADR does not cause stalls anymore than
The server class system shows a similar trend with energy
regular caches would, hence it represents the ideal case.
source of 34 × 103 mm3 and 300 mm3 for SuperCap and Li-
Figure 7(a) illustrates the execution time of two versions of
thin technologies, respectively, for eADR. In contrast, BBB
BBB with differing bbPB entries (32 and 1024) and eADR,
requires only 21.6 mm3 and 0.21 mm3 volume, respectively.
normalized to eADR, for all workloads tested. As shown in
To put it in perspective, the last two columns convert
the figure, 32-entry BBB performs worse than eADR by only
size/volume to the footprint area of a typical core used in the
about 1% on average, and 2.8% in the worst case. The main
mobile class system (i.e. 2.61 mm2 ) [30]. Although we don’t
source of performance overhead is when a core’s bbPB is full
necessarily envision that this energy is going to be provided
when the core attempts to persist a store, and coalescing cannot
through introducing a new battery, we use the comparison
be done because the block being written is not in bbPB. In
to a mobile core’s size to help visualize the energy source
this case, the store stalls the core until some blocks in bbPB
comparison between BBB and eADR. To simplify converting
have been drained and free up some entries. With 1024 entries,
battery volume to area, we assume cubic battery shape and
BBB achieves nearly identical performance with eADR, at the
infer the footprint area from the volume. The areas needed
cost of a larger battery compared to 32-entry BBB.
for eADR batteries are substantial: 77× and 404× the size
of a core for the mobile class and the server class systems, 1.2 BBB BBB (1024) Optimal (eADR) 1.2

Number of Writes to NVMM (X)


respectively, when using SuperCap. Even when using Li-thin,
1.0 1.0
Execution Time (X)

which is more space efficient, the area is still large: 3.6× and
0.8 0.8
18.7× the size of the core, for the mobile class and server
class systems, respectively. In contrast, BBB requires a much 0.6 0.6

smaller size. Even with SuperCap, the area needed is 97.2% 0.4 0.4
and 296% the size of a core for the mobile and the server class, 0.2 0.2
respectively. It becomes even smaller with Li-thin: 4.5% and
0.0 0.0
13.7% the size of a core, respectively. Overall, the battery
volume for BBB is between 707 − 1, 574× smaller, while
the area for BBB is between 79 − 137× smaller than eADR. (a) Execution Time (b) Number Of Writes
Moreover, Table X looks into the battery size when varying
the number of entries in bbPB and shows that even with bbPB Fig. 7: Execution time and number of writes to NVMM for BBB
with 32 entries (first bar), with 1024 entries (second bar), and eADR
size of 1024 entries, BBB is 22 − 49× cheaper than eADR. (third bar), normalized to eADR.
TABLE IX: Estimates of the size of the energy source needed to
implement BBB and eADR (a). In addition, to showing the needed As discussed in Section IV, our workloads were designed
footprint to be occupied by the energy source as a ratio to the area to generate back-to-back writes to the persistent domain,
of the mobile class system’s core (b). which stresses persist buffers. However, the nature of the data
(a) Size/Volume (mm3 ) (b) Ratio to core area (%) structures in the workloads still creates a difference in the time
Battery needed to perform the operation and thus creates a difference
SuperCap Li-thin SuperCap Li-thin
7,746% 359.5% in the frequency of generating persists by the core. This is why
2.9 × 103
Server Mobile

eADR 30
(∼ 77×) (∼ 3.6×) some workloads (e.g. swapNC) might incur relatively higher
BBB 4.1 0.04 97.2% 4.5%
40,363% 1,873%
delays due to the very short time between two subsequent
eADR 34 × 103 300 persists.
(∼ 404×) (∼ 18.7×)
296%
BBB 21.6 0.21 13.7% C. NVMM Writes Evaluation
(∼ 3×)
Due to the limited write endurance of NVMM, the number
of writes to NVMM is also an important metric. If we had used
B. Performance Evaluation the processor-side approach, almost every persisting store must
The much smaller amount of data that needs draining is go to the bbPB and drain to the NVMM. With the memory-
the primary reason for BBB’s much smaller battery cost than side approach, stores are coalesced in bbPB, and the number

120
of NVMM writes depends on the number of block draining impact on execution time, which stops decreasing with 32
needed to free up entries in bbPB. Hence, we expect the entries. The bbPB drains overhead reaches near zero with 64
number of bbPB entries to determine the number of NVMM entries, but 32 entries are not far behind. This represents the
writes. Also, as discussed in Section III, dirty writebacks from amount of coalescing that was achieved at the bbPB and will
the LLC to the NVMM in our BBB approach are silently be translated into a reduction of the number of writes to the
dropped to avoid redundant writes to the NVMM. NVMM. Thus, 32-entry bbPB (our default configuration) is the
Figure 7(b) compares the number of NVMM writes for smallest size that shows very close results compared to eADR,
BBB with 32 and 1024 entries, with eADR, normalized to beyond which we start to have diminishing returns. This buffer
eADR. eADR represents the optimal case because eADR size might be higher than what was used in prior works (e.g. 8
does not introduce any new writes to the NVMM due to entries in [50]). However, we decided to conservatively choose
persistency ordering. The figure shows that even 32-entry a relatively larger buffer for the following reasons: (1) We
bbPB in BBB captures the majority of the coalescing that chose the size to conduct the energy comparison between BBB
happen in eADR; it only adds an average of 4.9% writes to and eADR. Therefore, we chose the smallest size that shows
NVMM (ranging from 1 − 7.9%) to eADR. This overhead almost no performance degradation (about 1%). Prior works
decreases to less than 1% if BBB uses 1024-entry bbPB, showed acceptable, yet higher, degradation when using smaller
since the larger buffer provides more room to hold blocks and sizes, which is consistent with the result we report in Figure 8.
captures most coalescing opportunities. This result illustrates (2) Our evaluated workloads were chosen to represent the
the effectiveness of the memory-side approach in coalescing worst-case and stress the design by generating back-to-back
stores in the bbPB, because bbPB is in the persistence domain. persists. Therefore we expect to have larger buffers. In general,
In contrast, traditional persist buffers prevent most coalescing the choice of bbPB size is a design decision based on the trade-
because it would result in persistency ordering violation, so off between the energy budget and the desired performance.
the number of writes would be much higher.
We also measured the number of writes to NVMM using the VI. R ELATED W ORK
processor-side approach, and found that on average, there are NVM has received significant research activities in recent
2.8× more writes to NVMM than eADR. This is because there years. Past work has examined various aspects of NVM,
are not many coalescing opportunities, while the memory-side including memory organization (e.g., [9], [79]), abstraction
approach is effective in performing coalescing. (e.g., [84]), checkpointing (e.g., [7], [27]), memory safety
(e.g., [8], [11], [95], [96]), secure execution environment (e.g.,
D. Sensitivity Studies
[12], [29]), extending life time (e.g., [6], [21], [22], [70], [72]),
To obtain deeper insights into BBB, we vary the bbPB size and persistency acceleration (e.g., [81], [82]). The above list
from 1 entry up to 1024 entries. Figure 8 shows the number is a small subset of examples of work in NVM research. From
of times persisting stores are rejected at the bbPB because it here on, we will expand on papers that are most immediately
is full (a), execution time overhead (b), and number of bbPB related to our work.
drains to memory (c). All figures are normalized to the 1-entry Persistency models. Persist barriers enable epochs to be
bbPB case. Note that the y-axes starts at -0.2, so that near zero defined in BPFS [23]. Pelley et al. defined and formalized
values are visible in the figures. memory persistency models including strict, epoch, and strand
persistency [68], in the order of increasing performance and
1
Workloads’ Avg. Overhead (X)

decreasing ordering constraints. Persist ordering can be com-


0.8
pletely relaxed using lazy persistency, as long as persistent
0.6 state integrity can be checked using checksums [5], [13].
0.4 Persist barrier variants that only ensure ordering but not
0.2
synchronous to instruction execution were presented in [62].
Persistency model is also moving up, being considered at the
0
programming language level [49].
-0.2
Persist buffers. Volatile persist buffers were introduced
1024

1024

1024
1
2
4
8

1
2
4
8

1
2
4
8
32

16
16
64

32
64

16
32
64
512

256
128
256
512

128
256

128
512

in DPO [50] and HOPS [62] to enable buffered persistency


(a) bbPB Rejections (b) Execution Time (c) bbPB Drains models, where stores are temporarily held until drained to
memory. In contrast to DPO and HOPS, BBB’s persist buffers
Fig. 8: Sensitivity study showing the average (i.e. geomean) impact (bbPB) differ in the following aspects: (1) battery-backed, (2)
of varying the bbPB size on: The number of persist request being part of the persistence domain, and (3) logically memory-
rejected due to full bbPB (a), The execution time comparison (b),
and the number bbPB drains to the NVMM (c). All are normalized side. These key differences lead to unique characteristics: (1)
(X) to the case with bbPB size of 1 (leftmost column) in each group. stores in bbPB can be freely coalesced and reordered, (2)
strict persistency is automatically achieved without flushes and
As shown in the figure, bbPB rejection count decreases fences.
quickly when increasing the bbPB size, reaching nearly zero eADR. Persistency domain initially consisted of only the
with 16-32 entries. These findings are consistent with the NVM, and has expanded over time. Initially, persisting a

121
store requires flush, fence, and another instruction pcommit R EFERENCES
which flushes MC write pending queue (WPQ) to NVMM.
With ADR, Intel adds WPQ to the persistency domain, thus [1] Intel. an introduction to pmemcheck. [Online]. Available: https:
//pmem.io/2015/07/17/pmemcheck-basic.html
deprecating pcommit. Capacitor or battery is needed to provide [2] “Persistent memory development kit (pmdk).” [Online]. Available:
WPQ’s flush-on-fail capability. More recently, Intel hinted that https://pmem.io/pmdk/
eADR will make it into production [78]. eADR [77] adds [3] “Non-temporal store instructions,” 2017. [Online]. Available: https:
the entire cache hierarchy to the persistence domain. Because //www.felixcloutier.com/x86/movntdq
[4] H. Akinaga and H. Shima, “Resistive Random Access Memory
PoV/PoP gap is closed, flush and fence instructions are gener- (ReRAM) Based on Metal Oxides,” IEEE Journal, 2010.
ally no longer necessary with eADR. However, eADR requires [5] M. Alshboul, J. Tuck, and Y. Solihin, “Lazy persistency: A high-
battery to provide flush-on-fail for the entire cache hierarchy performing and write-efficient software persistency technique,” in ISCA,
2018.
instead of only the bbPB in BBB. Table XI summarizes this [6] M. Alshboul, J. Tuck, and Y. Solihin, “Wet: Write efficient loop tiling
comparison between eADR and BBB. for non-volatile main memory,” in DAC, 2020.
[7] M. Alshboul, H. Elnawawy, R. Elkhouly, K. Kimura, J. Tuck, and
TABLE XI: Summary of the comparison between eADR and BBB Y. Solihin, “Efficient checkpointing with recompute scheme for non-
regarding the hardware/integration costs. volatile main memory,” ACM Trans. Archit. Code Optim., 2019.
[8] A. Awad, P. Manadhata, S. Haber, Y. Solihin, and W. Horne, “Silent
Aspect eADR BBB Shredder: Zero-Cost Shredding for Secure Non-Volatile Main Memory
(1) Adding the bbPBs Controllers,” in ASPLOS, 2016.
Processor modifications None [9] A. Awad, S. Blagodurov, and Y. Solihin, “Write-aware management of
(2) Minor coherence changes
Draining energy cost Very High Low nvm-based memory extensions,” in ICS, 2016.
Time needed to drain Very High Low [10] A. Awad, B. Kettering, and Y. Solihin, “Non-volatile Memory Host
Drive energy to Controller Interface Performance Analysis in High-performance I/O
Needed Needed Systems,” in ISPASS, 2015.
targeted components
[11] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “ObfusMem: A Low-
Overhead Access Obfuscation for Trusted Memories,” in ISCA, 2017.
Failure atomicity. Persistency programming assumes a [12] A. Awad, M. Ye, Y. Solihin, L. Njilla, and K. A. Zubair, “Triad-
certain persistency model and on top of that, relies on support NVM: Persistency for Integrity-Protected and Encrypted Non-Volatile
for failure atomic code regions. For sequential programs, Memories,” in ISCA, 2019.
[13] A. W. Baskara Yudha, K. Kimura, H. Zhou, and Y. Solihin, “Scalable
libraries and language extensions have been implemented to and fast lazy persistency on gpus,” in IISWC, 2020.
provide transaction-based failure atomicity [2], [17], [85]. Au- [14] F. Bedeschi, C. Resta, O. Khouri, E. Buda, L. Costa, M. Ferraro,
tomatic transformation of data structures to persistent memory F. Pellizzer, F. Ottogalli, A. Pirovano, M. Tosi, R. Bez, R. Gastaldi, and
G. Casagrande, “An 8Mb Demonstrator for High-density 1.8V Phase-
has been proposed too [53], [60]. For concurrent programs, Change Memories,” in VLSIIC, 2004.
compiler-based automatic instrumentation has been proposed [15] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
to transform lock-based concurrent programs for persistent J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
memory [19], [33]. Other works propose hardware and soft- M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The GEM5
simulator,” ACM SIGARCH Computer Architecture News (CAN), 2011.
ware primitives to aid porting lock-free data structures to [16] S. Blanas, “The new bottlenecks of scientific computing,”
persistent memory [66], [91]. Many studies add durability 2020. [Online]: https://www.sigarch.org/from-flops-to-iops-the-new-
to transaction-based concurrent programs without changing bottlenecks-of-scientific-computing/
[17] B. Bridge, “Nvm-direct.” [Online]: https://github.com/oracle/nvm-direct
application source code [32], [44], [92], while others add [18] M. Carlson. Persistent memory: What developers need to know.
durability to software transactions [24], [75], [86]. Finally, [Online]: https://www.snia.org/educational-library/persistent-memory-
checkpointing and system-level solutions were also used to what-developers-need-know-2018
[19] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “Atlas: Leveraging
achieve failure atomicity [45], [59], [76]. Overall, the afore- locks for non-volatile memory consistency,” ACM SIGPLAN Notices,
mentioned works provide mechanisms for achieving failure 2014.
atomic regions (i.e. durable transactions), which is orthogonal [20] A. Chatzistergiou, M. Cintra, and S. D. Viglas, “Rewind: Recovery
the goal of our paper. BBB addresses persist ordering and write-ahead system for in-memory non-volatile data-structures,” VLDB
Endow., Jan. 2015.
simplifies ordering-related programming complexity, which [21] J. Chen, G. Venkataramani, and H. H. Huang, “Repram: Re-cycling pram
provides a property that can be relied on by higher level faulty blocks for extended lifetime,” in DSN 2012, 2012.
primitives such as failure atomic regions. [22] J. Chen, Z. Winter, G. Venkataramani, and H. H. Huang, “rpram:
Exploring redundancy techniques to improve lifetime of pcm-based main
memory,” in PACT, 2011.
VII. C ONCLUSION [23] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and
D. Coetzee, “Better I/O through byte-addressable, persistent memory,”
We have proposed Battery-Backed Buffers (BBB), a mi- in SOSP, 2009.
croarchitectural approach that simplifies the persist ordering [24] A. Correia, P. Felber, and P. Ramalhete, “Romulus: Efficient algorithms
aspect of persistency programming by aligning the point of for persistent transactional memory,” in SPAA, 2018.
[25] CPU-WORLD, “Intel xeon 9222 specifications.” [Online]: http:
persistency (PoP) with the point of visibility (PoV). We //www.cpu-world.com/CPUs/Xeon/Intel-Xeon\%209222.html
evaluated BBB over several workloads and found that adding [26] J. Cross, “Inside apple’s a13 bionic system-on-chip.” [On-
32-entry bbPB per core is sufficient to provide performance line]: https://www.macworld.com/article/3442716/inside-apples-a13-
comparable to eADR (only 1% slow down and 4.9% extra bionic-system-on-chip.html
[27] H. Elnawawy, M. Alshboul, J. Tuck, and Y. Solihin, “Efficient check-
writes) while requiring 320 − 709× lower draining energy pointing of loop-based codes for non-volatile main memory,” in PACT,
compared to eADR. 2017.

122
[28] X. Fong, Y. Kim, R. Venkatesan, S. H. Choday, A. Raghunathan, and [56] S. Liu, K. Seemakhupt, Y. Wei, T. Wenisch, A. Kolli, and S. Khan,
K. Roy, “Spin-transfer torque memories: Devices, circuits, and systems,” “Cross-failure bug detection in persistent memory programs,” in ASP-
Proceedings of the IEEE, 2016. LOS, 2020.
[29] A. Freij, S. Yuan, H. Zhou, and Y. Solihin, “Persist level parallelism: [57] S. Liu, Y. Wei, J. Zhao, A. Kolli, and S. Khan, “Pmtest: A fast and
Streamlining integrity tree updates for secure persistent memory,” in flexible testing framework for persistent memory programs,” in ASPLOS,
MICRO, 2020. 2019.
[30] A. Frumusanu, “The apple iphone 11, 11 pro and 11 pro max review.” [58] Y. Lu, J. Shu, L. Sun, and O. Mutlu, “Loose-Ordering Consistency for
[Online]: https://www.anandtech.com/show/14892/the-apple-iphone-11- Persistent Memory,” in ICCD, 2014.
pro-and-max-review/2 [59] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou, R. Alagappan,
[31] A. A. Garcı́a, R. de Jong, W. Wang, and S. Diestelhorst, “Composing K. Strauss, and S. Swanson, “Atomic in-place updates for non-volatile
lifetime enhancing techniques for non-volatile main memories,” in main memories with kamino-tx,” in EuroSys, 2017.
MEMSYS, 2017. [60] A. Memaripour, J. Izraelevitz, and S. Swanson, “Pronto: Easy and fast
[32] E. Giles, K. Doshi, and P. Varman, “Hardware transactional persistent persistence for volatile data structures,” in ASPLOS, 2020.
memory,” in MEMSYS, 2018. [61] S. Mittal, J. S. Vetter, and D. Li, “Lastingnvcache: A technique for
[33] V. Gogte, S. Diestelhorst, W. Wang, S. Narayanasamy, P. M. Chen, and improving the lifetime of non-volatile caches,” in 2014 IEEE Computer
T. F. Wenisch, “Persistency for synchronization-free regions,” in PLDI, Society Annual Symposium on VLSI, 2014.
2018. [62] S. Nalli, S. Haria, M. D. Hill, M. M. Swift, H. Volos, and K. Keeton,
[34] A. Hansson, N. Agarwal, A. Kolli, T. Wenisch, and A. N. Udipi, “An analysis of persistent memory use with whisper,” SIGPLAN Not.,
“Simulating dram controllers for future system architecture exploration,” 2017.
in ISPASS, 2014. [63] D. Nayak, D. P. Acharya, and K. Mahapatra, “An improved energy
[35] C. Huang, V. Nagarajan, and A. Joshi, “Dca: A dram-cache-aware dram efficient sram cell for access over a wide frequency range,” Solid-State
controller,” in SC, 2016. Electronics, vol. 126, 2016.
[36] A. Inc, “Apple: iphone11 specifications.” [Online]: https://www.apple. [64] K. Oleary. How to detect persistent mem-
com/iphone-11/specs/ ory programming errors using intel inspec-
[37] Intel, “Deprecating the pcommit instruction,” 2016. [Online]: https: tor. [Online]: https://software.intel.com/en-us/articles/detect-persistent-
//software.intel.com/blogs/2016/09/12/deprecate-pcommit-instruction memory-programming-errors-with-intel-inspector-persistence-inspector
[38] Intel, “Persistent memory programming,” 2016, http://pmem.io. [65] D. Pandiyan and C.-J. Wu, “Quantifying the energy cost of data
[39] Intel and Micron, “Intel and micron produce breakthrough memory movement for emerging smart phone workloads on mobile platforms,”
technology,” 2015. IISWC, 2014.
[40] J. Izraelevitz, T. Kelly, and A. Kolli, “Failure-atomic persistent memory [66] M. Pavlovic, A. Kogan, V. J. Marathe, and T. Harris, “Brief announce-
updates via justdo logging,” in ASPLOS, 2016. ment: Persistent multi-word compare-and-swap,” in PODC, 2018.
[41] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J. [67] D. Pech, M. Brunet, H. Durou, P. Huang, V. Mochalin, Y. Gogotsi, P.-L.
Soh, Z. Wang, Y. Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic Taberna, and P. Simon, “Ultrahigh-power micrometre-sized supercapac-
performance measurements of the intel optane dc persistent memory itors based on onion-like carbon,” Nature nanotechnology, 2010.
module,” arXiv preprint arXiv:1903.05714, 2019. [68] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory Persistency,” in
[42] Y. Joo, D. Niu, X. Dong, G. Sun, N. Chang, and Y. Xie, “Energy- and ISCA, 2014.
endurance-aware design of phase change memory caches,” in DATE,
[69] PMEM.io, “Persistent memory programming,” 2019. [Online]: https:
2010.
//pmem.io/2019/12/19/performance.html
[43] A. Joshi, V. Nagarajan, M. Cintra, and S. Viglas, “Efficient Persist
[70] M. K. Qureshi, “Pay-as-you-go: Low-overhead hard-error correction for
Barriers for Multicores,” in Micro, 2015.
phase change memories,” in MICRO, 2011.
[44] A. Joshi, V. Nagarajan, M. Cintra, and S. Viglas, “Dhtm: Durable
[71] M. K. Qureshi, S. Gurumurthi, and B. Rajendran, Phase Change
hardware transactional memory,” in ISCA, 2018.
Memory: From Devices to Systems, 2011.
[45] S. Kannan, A. Gavrilovska, and K. Schwan, “Pvm: Persistent virtual
memory for efficient capacity scaling and object storage,” in EuroSys, [72] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and
2016. B. Abali, “Enhancing lifetime and security of pcm-based main memory
[46] T. Kawahara, R. Takemura, K. Miura, J. Hayakawa, S. Ikeda, Y. Lee, with start-gap wear leveling,” in MICRO, 2009.
R. Sasaki, Y. Goto, K. Ito, T. Meguro, F. Matsukura, H. Takahashi, [73] A. Raad, J. Wickerson, G. Neiger, and V. Vafeiadis, “Persistency
H. Matsuoka, and H. Ohno, “2Mb Spin-Transfer Torque RAM (SPRAM) semantics of the intel-x86 architecture,” Proc. ACM Program. Lang.,
with Bit-by-Bit Bidirectional Current Write and Parallelizing-Direction 2019.
Current Read,” in ISSCC, 2007. [74] R. Rajachandrasekar, S. Potluri, A. Venkatesh, K. Hamidouche, M. W.
[47] W.-H. Kim, J. Kim, W. Baek, B. Nam, and Y. Won, “Nvwal: Exploiting ur Rahman, and D. K. D. Panda, “MIC-Check: A Distributed Check-
nvram in write-ahead logging,” in ASPLOS, 2016. pointing Framework for the Intel Many Integrated Cores Architecture,”
[48] Y. Kim, S. R. Lee, D. Lee, C. B. Lee, M. Chang, J. H. Hur, M. Lee, in HPDC, 2014.
G. Park, C. J. Kim, U. Chung, I. Yoo, and K. Kim, “Bi-layered rram [75] P. Ramalhete, A. Correia, P. Felber, and N. Cohen, “Onefile: A wait-free
with unlimited endurance and extremely uniform switching,” in 2011 persistent transactional memory,” in DSN, 2019.
Symposium on VLSI Technology - Digest of Technical Papers, 2011. [76] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutiu, “Thynvm:
[49] A. Kolli, V. Gogte, A. Saidi, S. Diestelhorst, P. M. Chen, Enabling software-transparent crash consistency in persistent memory
S. Narayanasamy, and T. F. Wenisch, “Language-level persistency,” in systems,” in MICRO, 2015.
ISCA, 2017. [77] A. Rudoff, “Persistent memory programming.”
[50] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley, S. Liu, P. M. Chen, [78] A. Rudoff, “Persistent memory programming without all that cache
and T. F. Wenisch, “Delegated Persist Ordering,” in MICRO, 2016. flushing,” in SDC, 2020.
[51] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Eval- [79] M. Saxena and M. M. Swift, “Flashvm: Virtual memory management
uating STT-RAM as an Energy-effcient Main Memory Alternative,” in on flash,” in USENIXATC, 2010.
ISPASS, 2013. [80] S. Scargall, “Persistent memory architecture,” in Programming Persistent
[52] B. C. Lee, “Phase Change Technology and The Future of Main Memory,” Memory, 2020.
IEEE Micro, 2010. [81] S. Shin, S. K. Tirukkovalluri, J. Tuck, and Y. Solihin, “Proteus: A flexible
[53] S. K. Lee, J. Mohan, S. Kashyap, T. Kim, and V. Chidambaram, “Recipe: and fast software supported hardware logging approach for nvm,” in
converting concurrent dram indexes to persistent-memory indexes,” in MICRO, 2017.
SOSP, 2019. [82] S. Shin, J. Tuck, and Y. Solihin, “Hiding the long latency of persist
[54] Z. Lin, M. Alshboul, Y. Solihin, and H. Zhou, “Exploring memory barriers using speculative execution,” in ISCA, 2017.
persistency models for gpus,” in PACT, 2019. [83] Y. Solihin, Fundamentals of Parallel Multicore Architecture. Chapman
[55] M. Liu, M. Zhang, K. Chen, X. Qian, Y. Wu, W. Zheng, and J. Ren, & Hall/CRC Computational Science, 2015.
“Dudetm: Building durable transactions with decoupling for persistent [84] Y. Solihin, “Persistent memory: Abstractions, abstractions, and abstrac-
memory,” in ASPLOS, 2017. tions,” IEEE Micro, 39(1), 2019.

123
[85] P. Subrahmanyam, “pmem-go.” [Online]: https://github.com/jerrinsg/go-
pmem
[86] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne: Lightweight
Persistent Memory,” in ASPLOS, 2011.
[87] J. Wang, X. Dong, Y. Xie, and N. P. Jouppi, “i2wap: Improving non-
volatile cache lifetime by reducing inter- and intra-set write variations,”
in HPCA, 2013.
[88] T. Wang, S. Sambasivam, Y. Solihin, and J. Tuck, “Hardware supported
persistent object address translation,” in MICRO, 2017.
[89] T. Wang, S. Sambasivam, and J. Tuck, “Hardware supported permission
checks on persistent objects for performance and programmability,” in
ISCA, 2018.
[90] W. Wang and S. Diestelhorst, “Quantify the performance overheads of
pmdk,” in MEMSYS, 2018.
[91] W. Wang and S. Diestelhorst, “Brief announcement: Persistent atomics
for implementing durable lock-free data structures for non-volatile
memory,” in SPAA, 2019.
[92] Z. Wang, H. Yi, R. Liu, M. Dong, and H. Chen, “Persistent transactional
memory,” IEEE Computer Architecture Letters, 2014.
[93] Z.-S. Wu, K. Parvez, X. Feng, and K. Müllen, “Graphene-based in-plane
micro-supercapacitors with high power and energy densities,” Nature
communications, 2013.
[94] I. Xeon, “Intel® xeon® platinum 9222 processor.”
[Online]: https://ark.intel.com/content/www/us/en/ark/products/195437/
intel-xeon-platinum-9222-processor-71-5m-cache-2-30-ghz.html
[95] Y. Xu, Y. Solihin, and X. Shen, “Hardware-Based Domain Virtualization
for Intra-Process Isolation of Persistent Memory Objects,” in ISCA,
2020.
[96] Y. Xu, Y. Solihin, and X. Shen, “MERR: Improving Security of
Persistent Memory Objects via Efficient Memory Exposure Reduction
and Randomization,” in ASPLOS, 2020.
[97] L. Zhang and S. Swanson, “Pangolin: A fault-tolerant persistent memory
programming library,” in USENIXATC, 2019.
[98] Y. Zhu, S. Murali, M. D. Stoller, K. J. Ganesh, W. Cai, P. J. Ferreira,
A. Pirkle, R. M. Wallace, K. A. Cychosz, M. Thommes, D. Su, E. A.
Stach, and R. S. Ruoff, “Carbon-based supercapacitors produced by
activation of graphene,” science, 2011.

124

You might also like