0% found this document useful (0 votes)
62 views

Formal Definitions and Performance Comparison of Consistency Models For Parallel File Systems

1) The document discusses formal definitions and performance comparisons of consistency models for parallel file systems. It aims to establish a connection between memory consistency models and storage consistency models. 2) Most widely deployed parallel file systems (PFSs) support POSIX consistency, which requires all writes be immediately visible to all reads. However, maintaining POSIX consistency becomes increasingly expensive as systems scale. Some efforts have adopted relaxed consistency models but these are often defined informally. 3) The paper proposes a formal and unified framework for defining storage consistency models and evaluates their relative performance for different I/O workloads. It conducts a comprehensive performance comparison of two relaxed consistency models on parallel I/O workloads, finding that a weaker model can

Uploaded by

appsscribd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Formal Definitions and Performance Comparison of Consistency Models For Parallel File Systems

1) The document discusses formal definitions and performance comparisons of consistency models for parallel file systems. It aims to establish a connection between memory consistency models and storage consistency models. 2) Most widely deployed parallel file systems (PFSs) support POSIX consistency, which requires all writes be immediately visible to all reads. However, maintaining POSIX consistency becomes increasingly expensive as systems scale. Some efforts have adopted relaxed consistency models but these are often defined informally. 3) The paper proposes a formal and unified framework for defining storage consistency models and evaluates their relative performance for different I/O workloads. It conducts a comprehensive performance comparison of two relaxed consistency models on parallel I/O workloads, finding that a weaker model can

Uploaded by

appsscribd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1

Formal Definitions and Performance


Comparison of Consistency Models for Parallel
File Systems
Chen Wang, Member, IEEE, Kathryn Mohror, Member, IEEE, and Marc Snir, Fellow, IEEE

Abstract—The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency
models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict
consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of
arXiv:2402.14105v2 [cs.DC] 26 Feb 2024

parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While
some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and
ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency
models and storage consistency models and revisit the key design choices of storage consistency models from a high-level
perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered
implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a
comprehensive performance comparison of two relaxed consistency models on a range of commonly-seen parallel I/O workloads, such
as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O
scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that
typically found in deep learning applications, session consistency achieved an 5x improvement in I/O bandwidth compared to commit
consistency, even at small scales.

Index Terms—Consistency model, storage consistency, parallel file system, parallel I/O

1 I NTRODUCTION

H IGH performance computing (HPC) systems host par-


allel applications composed of hundreds to tens of
thousands of tightly-coupled processes that typically run
sistency model. A consistency model specifies a contract
between a programmer and a system, wherein the system
guarantees that if the programmer follows the rules, the
for hours or days. These large-scale applications that run shared data will be consistent and the results of reading,
on supercomputers often read and write large amounts of writing, or updating will be predictable. The POSIX stan-
data, spending a significant fraction of their execution time dard [5] specifies a strong and straightforward consistency
performing I/O [1], [2]. However, the I/O subsystem, a model, which requires all writes be immediately visible to
core component in HPC systems, has not evolved as fast as all subsequent reads. While the POSIX consistency model
other components such as compute and interconnect. I/O is is easy to maintain in a single-node environment, it is ex-
emerging as a major bottleneck for many HPC applications. pensive to maintain at scale. Nevertheless, most widely de-
For example, it is shown that I/O can take as much as 85% ployed PFSs, including Lustre [6], GPFS [7], and BeeGFS [8],
of the training time of a large-scale deep learning applica- support POSIX consistency. The cost of supporting POSIX
tion [3], the majority of which is due to the random read consistency is becoming increasingly unacceptable due to
requests to a large number of training samples. MuMMI [4], two key reasons: (1) the rapid growth in the scale of HPC
as another example, is a multi-scale simulation that models systems, which directly increases the software overhead of
the dynamics of RAS proteins. When recording snapshots at maintaining POSIX consistency; (2) the emergence of new
a 0.5 ns interval, MuMMI generates over 400 million files, storage devices such as solid storage devices (SSDs), which
occupying over 1 PB of disk space for a single run, which greatly improves I/O latency and bandwidth and makes
poses a significant challenge for I/O latency and bandwidth. software overhead more significant. In recent years, many
In order to reduce the I/O demand, compromises such as efforts have been made to develop burst buffer (BB) PFSs [9],
reducing snapshot frequencies have to be made. As we [10], [11], [12], [13] (especially user-level systems) with re-
move beyond the exascale era, the I/O bottleneck will only laxed consistency models, but these models were typically
be exacerbated. defined ambiguously and informally as by-products of their
A major constraint on the performance of parallel file PFS implementations. This leads to three major issues: (1)
systems (PFSs) is their strict adherence to the POSIX con- Performance: It is challenging for system developers to eval-
uate and compare the effectiveness of different consistency
• Chen Wang and Kathryn Mohror are with Lawrence Livermore National models; (2) Correctness: It is difficult for programmers to
Laboratory. E-mail: {wang116, mohror1}@llnl.gov. reason about their program or check the correctness of their
• Marc Snir is with the Department of Computer Science, University of code; (3) Portability: A program that runs correctly under
Illinois Urbana-Champaign. E-mail: [email protected].
a given relaxed consistency model is not guaranteed to run
2

correctly on a different model. 2.1 Consistency Model: Strong or Relaxed


When compared to consistency models of shared mem- Sequential consistency [14] is one of the most intuitive
ory systems (often referred to as memory models), storage consistency models. It says that the result of any execution
consistency models (or storage models for short) have re- is the same as if the operations of all the processors were
ceived far less attention and have not been systematically executed in some sequential order, and the operations of
studied from a higher-level perspective. Similar terminolo- each individual processor appear in this sequence in the
gies and concepts are repeatedly reinvented, and lessons order specified by its program. Sequential consistency is
learned from memory models are often overlooked. considered a strong consistency model because it guarantees
To summarize and motivate this work, here we list operations of a processor are seen to occur in the same order
fundamental questions that have not been clearly answered. by all processors. The major drawback is that it hinders op-
The first two focus on comparisons between storage and timizations that may result in reordering, e.g., store buffers
memory models. The last three focus on storage models and and out-of-order cores.
their performance implications. Relaxed consistency models (weaker than sequential
1) What are the reasons for the lack of attention to storage consistency) allow more optimizations but can be counter-
models compared to memory models? intuitive. Consider the well-known example shown in Ta-
2) What are the design choices for storage models, and ble 1, where each process loads the value of the variable (x
how do they relate to similar choices for memory mod- and y ) stored by the other process. Intuitively, there are three
els? possible outcomes: (r1 , r2 ) = (0, 100), (100, 0) or (100, 100).
3) How do existing storage models compare and what Sequential consistency guarantees that any execution of
commonalities exist among them? Can they be defined this program will produce one of these three results. In
in a unified and formal manner? reality, most real hardware also allows (r1 , r2 ) = (0, 0). For
4) What are the performance implications of a storage example, x86 systems from Intel uses a relaxed consistency
model? model (often referred to as total store order [15]) that allows
5) What are effective methods to evaluate and compare reordering non-conflicting store-load pairs, which violates
the performance of different storage models? sequential consistency. With this relaxation, store buffers can
This work seeks to answer these fundamental questions be used to buffer the expensive stores, so that loads (L12 and
and develop a better understanding of storage consistency L22 ) can bypass the previous stores (L11 and L21 ).
models by conducting a systematic study. Our work makes
the following contributions: TABLE 1: A load-after-store example. All variables are ini-
tially zero.
• We investigate the contributing factors of the limited
attention paid to storage models. We show that recent Process 1: Process 2:
advances in storage techniques are rapidly changing L11 : x = 100; L21 : y = 100;
some of these factors (Section 3). L12 : r1 = y; L22 : r2 = x;
• We revisit the design choices of storage models and re-
late them to memory models. We highlight the different The core idea behind the relaxed models is that some
considerations between memory systems and storage constraints imposed by stronger models are not necessary
systems for each design choice (Section 3). for the targeted program, while relaxing such constraints
• We propose a formal and unified framework for spec- provides significant performance gains. The drawback,
ifying the most widely-used family of storage models however, is that relaxing consistency semantics will likely
(Section 4). reduce portability or programmability.
• We study the performance implications of storage mod-
els. More importantly, we present a “layered” imple- 2.2 Relaxed Memory Models
mentation that allows for effective performance com-
2.2.1 Weak Ordering
parisons between different storage models (Section 5).
• Finally, we conduct a detailed performance compar-
Weak ordering was defined by Dubois et al., [16] as follows:
ison between two storage models using a range of In a multiprocessor system, memory accesses are weakly
common HPC I/O workloads. The results highlight ordered if (1) accesses to global synchronizing variables are
the significant impact storage models can have on I/O strongly ordered, (2) no access to a synchronizing variable
performance (Section 6). is issued by a processor before all previous global data
accesses have been globally performed, and (3) no access to
In this work, we focus our study on storage models in
global data is issued by a processor before previous accesses
the context of parallel file systems for HPC I/O, but the
to a synchronizing variable has been globally performed.
concepts we develop should be generally applicable to other
In essence, a system that follows weak ordering needs to
large-scale storage systems.
be able to recognize synchronization operations. Concurrent
accesses to shared memory can violate sequential consis-
2 BACKGROUND tency. But if all conflicting memory accesses are properly
This section describes example consistency models from synchronized, then a weakly ordered system will deliver the
both memory and storage domains, with the aim of intro- same result as a system with sequential consistency. Many
ducing their similarities and differences. To prevent confu- high-level languages require that programs be race-free, i.e.,
sion, we use terms store and load when describing memory that conflicting accesses be synchronized. Consider the ex-
models and write and read when describing storage models. ample in Table 2, when all operations are identified as data
3

operations, y will not be guaranteed to return 100 because that PFSs are ignorant of the application’s synchronization
processors are free to reorder operations. However, if L12 logic and the order of I/O operations of different processes.
and L21 are identified by programmers as synchronizations, PFSs must make worse-case assumptions and serialize all
then L22 is guaranteed to return the latest value of x due to potentially conflicting I/O operations to guarantee POSIX
the ordering imposed by the synchronizations. consistency. Alternatively, programmers can provide infor-
mation on program synchronizations of conflicting I/O
TABLE 2: A weak ordering example. All variables are ini- operations to the PFS. With this extra information, PFSs can
tially zero. adopt a weaker consistency model, while guaranteeing the
same outcome of POSIX consistency. Wang et al., [18] have
Process 1: Process 2:
studied many such PFSs and their consistency models. Here,
L11 : x = 100; L21 : while(!f lag){}; we briefly discuss the most commonly used models.
L12 : f lag = 1 L22 : y = x;
2.3.1 Commit Consistency
2.2.2 Release Consistency Commit consistency is a relaxed consistency model com-
Many synchronization operations occur in pairs. Release monly used by recent BB PFSs such as BSCFS [19], Uni-
consistency [17] utilizes this information by explicitly dis- fyFS [10], and SymphonyFS [13]. In commit consistency,
tinguishing them as release and acquire operations, with the “commit” operations are explicitly executed by processes.
help from programmers. The release operation instructs the The commit operation conveys synchronization informa-
processor to make all previous memory accesses globally tion. I/O updates performed by a process to a file before a
visible before the release completes, and the acquire opera- commit become globally visible upon return of the commit
tion instructs the processor not to start subsequent memory operation. To maintain portability, PFSs adopting commit
accesses before the acquire completes. In the example of consistency may use an existing POSIX call to indicate a
Table 2, L12 is a release operation and L21 is an acquire oper- commit. For example, in UnifyFS [10], a commit operation
ation. Release consistency is a further relaxation of weak or- is triggered by a fsync call, which applies to all updates
dering. It allows systems to have different implementations performed by a process on a file since the previous commit.
for release and acquire, which leads to better performance Note that finer commit granularity (e.g., committing byte
at the cost of the increased burden on programmers. ranges) is also possible, but may add additional overhead if
used in a superfluous way.
2.2.3 Entry Consistency
2.3.2 Session Consistency
A major issue of weak ordering and release consistency
is that their synchronization operations impose order on Commit consistency guarantees all local writes that precede
memory operations even if they do not conflict, which may the commit operation become globally visible after the
add unnecessary overhead. Consider the example in Table 3, commit operation. However, in many cases, data written
to make sure y in L22 returns the store to x in L12 , under is rarely read back by the same application, and even when
weak ordering or release consistency, L13 and L21 need this happens, usually only a subset of processes perform
to be identified as synchronizations. However, this also the reads. Thus, global visibility is not necessary. Session
prohibits reordering L11 and L13 , i.e., L11 must complete consistency (also known as close-to-open consistency) ad-
before L13 , which is unnecessary if no other process will dresses this issue by defining a pair of synchronization oper-
ever access w. Entry consistency addresses this issue by ations, namely, session_close and session_open. Ses-
requiring each ordinary shared data item to be associated sion consistency guarantees that writes by a process become
with a synchronization variable. When an acquire is done on visible to another process (not all processes) when the mod-
a synchronization variable, only those data guarded by that ified file is closed by the writing process and subsequently
synchronization variable are made consistent. For instance, opened by the reading process, with the session_close
in the case of example in Table 3, we can associate w and x happening before the session_open. The idea of session
with two different synchronization variables, thus allowing consistency is very similar to that of release consistency for
L11 to bypass L12 and L13 . memory models.
Note that we name the two operations session_open
TABLE 3: An entry consistency example. All variables are and session_close, but most existing systems adopt-
initially zero. ing session consistency such as NFS [20] do not pro-
vide the separate session_open and session_close
Process 1: Process 2: APIs. Rather, they are implied by POSIX open/close (or
L11 : w = 100; L21 : while(!f lag){}; fopen/fclose) calls, calls that have additional effects—
L12 : x = 100; L22 : y = x; they apply all updates to a file.
L13 : f lag = 1
2.3.3 MPI-IO Consistency
MPI-IO [21] is a part of the MPI standard that defines both
2.3 Relaxed Storage Models communications (message passing) and I/O operations. As
The requirements of POSIX consistency essentially impose the latest standard [22] states, MPI-IO provides three levels
sequential consistency. The fundamental problem behind of consistency: sequential consistency among all accesses
the performance issues stemming from POSIX consistency is using a single file handle, sequential consistency among all
4

accesses using file handles created from a single collective choose to implement the same standard, POSIX. Most local
open with atomic mode enabled, and user-imposed consis- file systems (e.g., ext3, ext4, and xtfs) and parallel file
tency among accesses other than the above. systems (e.g., Lustre [6] and GPFS [7]) are POSIX-compliant.
The first two cases are the most common cases, and
sequential consistency is guaranteed without extra syn- Application
chronizations. In the last case, sequential consistency can Application
I/O library
be achieved by using a sync-barrier-sync construct that
imposes orders on the conflicting I/O accesses. Here, DRF/HLL (e.g, C++) POSIX I/O (e.g., C++)
sync is one of MPI_File_open, MPI_File_close or
MPI_File_sync that takes in the file handle and flushes Compiler
the data (writer) or retrieves the latest data (reader). And
barrier provides a mechanism that imposes order between POSIX/ext4 POSIX/xfs POSIX/Others
the two syncs. In most cases, this is achieved using MPI calls.
For example, barrier can be one of the collective communi-
cation calls such as MPI_Barrier and MPI_Allgather, PMEM SSD HDD
TSO/Intel x86 Power/IBM Others
or a pair of point-to-point calls such as MPI_Send plus
MPI_Recv. However, barrier is not limited to MPI calls, it Memory Storage
can use any mechanism that properly orders the two sync
calls. Fig. 1: Memory vs. storage programming. The lack of au-
Even though MPI-IO has information on both I/O and tomated support (e.g., a compiler layer) in storage pro-
MPI communication, it can not assume the synchronization gramming hierarchy makes it harder to adopt different
information (i.e., barrier) is always available to the system as consistency models for different hardware.
it may not use MPI calls. MPI-IO consistency is similar to
session consistency, but additional optimizations are possi-
3.1.2 Software Overhead
ble if ordering is imposed through MPI calls, as those are
visible to the MPI library (which includes MPI-IO). The POSIX consistency semantics prohibit many optimiza-
tions. This disadvantage is not so apparent in a single-
node system, in which I/O operations are serialized. But
3 M EMORY M ODELS VS . S TORAGE M ODELS maintaining POSIX consistency in an HPC environment
In this section, we first investigate why relaxed storage can be more costly, as PFSs require distributed locking
consistency models have not gained enough attention and mechanisms running reliably at large scale to enforce it.
widespread adoption. Next, we analyze the key design Nevertheless, in most scenarios, the software overhead in-
choices for consistency models and compare the differ- curred is minor compared to the slow I/O performance
ent considerations between memory systems and storage of HDDs. Consequently, less attention has been given to
systems. Finally, we discuss the primary commonality of alternative consistency models. However, this is starting to
existing relaxed storage models, introduce the key concepts change due to two reasons: the rapid increase in the scale
that will serve as the foundation for our formal definition of of HPC systems, and the emergence of new, faster storage
these models. technologies, such as SSDs. The former directly increases the
software overhead required to enforce the same consistency
requirements. The latter makes the overhead more signifi-
3.1 Why Relaxed Storage Models are not Widely cant since I/O operations complete much faster.
Adopted
3.1.1 Programming Hierarchy 3.2 Design Considerations
The presence of compilers in the memory programming Here we describe several important design considerations
hierarchy (Figure 1(a)) has been an important factor in the of a consistency model and compares memory models (from
adoption of relaxed memory models. Compilers can hide the perspective of high-level programming languages) with
complexity and provide portability, which allows program- storage models (from the perspective of parallel file sys-
mers to target a single memory model specified by the high- tems) for each consideration.
level programming language (e.g., C++ and Java) without
the knowledge of underlying consistency models provided 3.2.1 Synchronization
by the CPUs. This way, a suitable consistency model can Synchronization is critical to a consistency model. It is used
be selected for the given hardware, without worrying about to enforce order between potential conflicting accesses. Syn-
programmability. For example, C++ allows specifying a dif- chronizations can be performed by explicitly invoking syn-
ferent consistency model for each atomic operation; but the chronization operations, which is common in both memory
semantics of these models is specified by the C++ standard models and storage models. Alternatively, synchronizations
is unrelated to the underlying hardware. can be specified in a declarative manner. For example, high-
In contrast, there is no corresponding “compiler” layer level languages often provide keywords (e.g., atomic in
in the storage programming hierarchy and no low-level C++ and volatile in Java) that can modify ordinary
hardware-supported consistency model to map onto: Con- objects to impose extra ordering restrictions on relevant
sistency protocols are implemented in software. To achieve accesses. Such features, however, are less common in storage
programmability and portability, most storage systems models.
5

3.2.2 Scope of synchronization control 3.2.7 External information


Both high-level programming languages and parallel file When programming on a PFS, as discussed in Section 2.3.3,
systems provide some limited control on synchronization programmers sometimes use non-storage operations, e.g.,
requirements of executing code. In a high-level language through RPC and message passing, to express their syn-
such as C++ this is done at the level of variable declarations chronization logic. However, PFSs are generally unaware
(for atomic variables) or atomic access operations (speci- of synchronization operations. In contrast, memory models
fying the applicable memory order). The latter is seldom are simpler as they assume that all synchronization is done
used. For POSIX file systems, this is done when a file is using memory operations.
opened, e.g., with O_SYNC, O_RSYNC or O_DSYNC flags.
Other scopes for such controls are feasible, in both cases.
For a programming language, the scope is likely to be a 3.3 Approach to Formally Define Consistency Models
static text scope; for a file system, it is likely to be a file, file
range, or an I/O call. The primary commonality of existing relaxed storage mod-
els is that they can guarantee sequential consistency for
programs that follow certain rules. Those programs share
3.2.3 Atomicity enough information (e.g., the commit calls in commit consis-
Atomicity is often a required property for both memory tency) with the system so the system can guarantee sequen-
systems and storage systems. It is key to ensure correctness tially consistent execution results even with relaxed storage
for applications with conflicting accesses. models. Such models are said to be in Sequential Consistency
Normal Form (SCNF) [23]. SCNF was a term initially defined
3.2.4 Granularity for memory model formalization, but it applies to storage
models as well.
Atomicity is supported in high-level languages with arbi-
trary granularity. One can specify a primitive object (e.g., Sequential Consistency Normal Form: A consistency
int) or a large data structure to be atomic. This granularity model is in sequential consistency normal form
needs not to match the granularity of memory operations in iff it guarantees sequential consistency to a set of
hardware; the compiler will implement them using native formally-characterized programs.
atomic operations or locks, depending on the granularity. The idea of providing sequential consistency semantics
Support for atomic access to larger memory objects will en- to a set of formally-characterized programs was formalized
tail additional software overheads. Similarly, consistency is by the data-race-free (DRF) memory models [23], [24]. The
supported by memory hardware and made visible in high- DRF models exploits the observation that good program-
level languages at the granularity of the smallest accessible ming practice dictates that programs be data-race-free; a
datum, namely a byte. But coherence protocols act at the data race often suggests that there are bugs in the code.
granularity of a cache line (typically, 64 bytes). Finer-grain The DRF models guarantee sequential consistency for the
coherence units would require more hardware; coarser- “correct” programs (i.e., without data races) and leave the
grain coherence units increase the amount of coherence behavior of the “incorrect” programs undefined.
memory traffic and the frequency of false sharing. Unfortunately, unlike the DRF memory model, existing
File systems also support storage accesses having arbi- SCNF storage models are typically defined ambiguously
trary lengths. POSIX does not guarantee atomicity of reads and informally as by-products of their PFS implementations.
and writes; the outcome of such operations is well-defined In the next section, we will present a unified and formal
only if conflicting operations are ordered by some means. framework to specify storage models that are in SCNF.
On the other hand, consistency is maintained at the byte
level by POSIX. PFSs’ units of coherence are necessarily
much coarser, so that fine-grain interleaved accesses by
4 A U NIFIED AND F ORMAL F RAMEWORK
distinct processes can generate a significant amount of co-
herence traffic and suffer from false sharing. The SCNF storage models we consider rely on synchroniza-
tion information to achieve sequential consistency. We call
3.2.5 Program text programs that contain adequate synchronization to enforce
necessary ordering properly-synchronized programs, and the
In memory systems, compilers see the program text and
storage models that guarantee sequential consistency to
thus have some information on possible executions of the
those programs properly-synchronized SCNF models. All mod-
program. Parallel file systems, on the other hand, have no
els we discussed in Section 2.3 are properly-synchronized
access to the program text. A PFS is an online system that
SCNF models.
sees one storage operation at a time.
The formalization of our framework is similar to that
of the Java memory model [25] (which adopts the DRF ap-
3.2.6 Reordering proach but with a much more complex model). Our frame-
The compiler can perform static passes to reorder mem- work does not make any assumptions about particular syn-
ory instructions, whereas PFSs are online systems that do chronization methods; it allows the specific storage model
not have the ability to make static reorderings. PFSs can to define its own set of synchronization operations. The
perform some limited reorderings by buffering/delaying key is to define which programs are considered properly-
certain storage operations. synchronized.
6
rj po hb
4.1 Specifying Properly-Synchronized SCNF Models −→∈ {−→, −→}. Note here the choice of rj can not be
hb
We first define two types of storage operations: A storage trimmed down to just −→ as some consistency models
operation is either a data storage operation or a synchronization may require a synchronization operation of the MSC
storage operation, defined as follows. to be called by one of the conflicting processes, where
po
Data Storage Operations: These are I/O operations that rj =−→.
ps
read or write storage, such as fread or fwrite. Data Properly-Synchronized Relation (−→): Two conflicting
operations include the specification of the storage loca- data storage operations X and Y are properly
ps
tion (possibly as a range) to be read or written. Each synchronized, i.e., X −→ Y , iff one of the following
data operation specifies an object called synchronization holds:
object that is associated with the requested location, hb
1) X is a read operation and X −→ Y .
such as a file handle. 2) X is a write operation, and there exists an MSC
Synchronization Storage Operations: These are I/O oper- between X and Y in the happens-before order.
ations that may be used to impose an order on data stor-
Storage Race: Two data storage operations X and Y in an
age operations, such as fsync, fopen, or fclose.
execution form a storage race iff they conflict and they
Synchronization operations are model-specific, where
are not properly synchronized.
each synchronization operation includes the specifica-
Properly-Synchronized Program: A program is properly
tion of a synchronization object.
synchronized iff for every sequentially consistent ex-
Further, we consider here the execution of a multiprocess ecution of the program, all storage operations can be
program, in an environment that provides well-defined distinguished by the system as either data or synchro-
mechanisms to synchronize concurrent processes, such as nization, and there are no storage races in the execution.
MPI message-passing. These mechanisms define a program Properly-Synchronized SCNF System: A system is said to
order and synchronization order on the executed operations of be a properly-synchronized SCNF system iff the result
the program: of every run of a properly-synchronized program on
po
Program Order (−→): The program order of a process is a the system is the result of a sequentially consistent
total order on the execution of the process’ operations execution of the program.
as specified by the program text. To keep the discussion Intuitively speaking, the key to achieving sequential
simple, we ignore the extensions needed to deal with consistency is to make sure the program is storage race
multithreaded processes. free (i.e., there are no conflicts or conflicts are properly
so
Synchronization Order (−→): A synchronization order is a synchronized). Storage race-freedom may require the use
partial order specified between operations executed by of storage synchronization operations, in addition to the
distinct processes. This partial order is consistent with synchronization constructs of the parallel programming sys-
po so
the program order, and −→ ∪ −→ is acyclic. tem. The properly-synchronized SCNF model specifies a
A properly-synchronized SCNF model is then defined as set S of storage synchronization operations and minimum
follows. synchronization constructs (MSC) to properly synchronize
hb
Happens-Before Order (−→): The happens-before order of conflicting I/O operations.
po so
an execution is the transitive closure of −→ ∪ −→.
The outcome of a parallel execution should be as if all 4.2 Describing Existing Models
hb
instructions were executed in the order specified by −→. Our framework provides a formal, but simple, way to
Thus, if ow and or are, respectively, a write and a read capture the specification of properly-synchronized SCNF
hb
to the same location, and ow −→ or, then or will return models, where only S and MSC need to be specified for a
the value written by ow, unless there is another store complete definition. Table 4 demonstrates how to describe
hb
ow′ to the same location such that ow −→ ow′ −→ or.
hb the storage models discussed earlier (Section 2.3) using our
The happens-before order is defined by the semantics of framework.
the programming system used. It orders I/O operations
4.2.1 POSIX Consistency
executed by the program. It is not necessarily visible to
the storage system. POSIX consistency can be considered as a special properly-
Conflict: Two data storage operations conflict iff their access synchronized SCNF model. With POSIX consistency, every
ranges overlap, and at least one of them is a write. write is immediately visible to all subsequent reads without
Minimum Synchronization Construct (MSC): An MSC synchronization operations. Here and in the rest of this
specifies a minimum sequence of synchronization section, “subsequent” is defined according to the happens-
storage operations required to synchronize two before order. Therefore, POSIX consistency has an empty set
hb
conflicting data operations. An MSC consists of k S and an MSC of −→.
synchronization storage operations and k + 1 edges,
where k ≥ 0: 4.2.2 Commit Consistency
r0 r1 r2 rk−1 rk For commit consistency, there is one synchronization op-
M SC =−→ S1 −→ S2 −→ ... −−−→ Sk −→
eration, commit. A write to file f becomes visible to all
For each i, 1 ≤ i ≤ k and Si ∈ S , where S is the set of subsequent reads from f upon the return of a subsequent
synchronization storage operations to be defined by the the commit call. Most commit-based systems require that
specific consistency model. For each j , 0 ≤ j ≤ k and the commit is called by the process that performs the writes,
7
po hb
by having an MSC of −→ commit −→. A relaxed version that adopt different consistency models also differ greatly
may allow a process to commit for the updates of other in their implementations and optimizations. It is difficult
hb hb to isolate the effect of a consistency model and ever harder to
processes, resulting an MSC of −→ commit −→.
conduct a fair comparison between different consistency models.
4.2.3 Session Consistency To address this, we present a “layered” implementation
Session consistency specifies two special synchronization that allows for an easy performance comparison of different
operations, S = {session_close, session_open}. For consistency models by keeping, as much as possible, every-
a write to become visible to a subsequent read, a close-to- thing other than the consistency model same. An overview
po of our approach is depicted in Figure 2. We design and
open pair has to be performed in between, thus, M SC =−→
hb po po implement a “base-layer” PFS, called BaseFS, which runs
session_close −→ session_open −→. The −→ at the on top of a system-level PFS such as GPFS or Lustre. BaseFS
beginning indicates that the session_close operation supports the basic functionalities of a PFS with essentially
has to be performed by the writing process. Similarly, the zero optimization. BaseFS buffers reads and writes using
po
−→ at the end indicates that the session_open operation burst buffer devices, and flushes data to the underlying
must be performed by the reading process. Finally, the PFS only when explicitly instructed. BaseFS provides a very
hb
−→ enforces that the session_close happens before the minimum consistency guarantee, but it exposes a set of
session_open. flexible primitives that can be used to implement custom
consistency models. On top of BaseFS, we can implement
4.2.4 MPI-IO Consistency PFSs providing different consistency models using these
As discussed in Section 2.3.3, MPI-IO provides three primitives. Since these PFSs use the set of primitives and
levels of consistency. For the first two cases, MPI-IO thus the same underlying implementation, we can limit the
guarantees sequential consistency without requiring ex- impact of other components of the PFS to a very low level.
tra synchronizations (just like POSIX consistency). In Ta- Comparing the performance of these PFSs thus can give us
ble 4, we show how to specify the MPI-IO consistency a good understanding of the impact of different consistency
model for the third case. In this case, MPI_File_close models.
synchronizes with all subsequent MPI_File_open and In this section we describe BaseFS and two example
MPI_File_sync. MPI_File_sync synchronizes with all PFSs, CommitFS and SessionFS, each adopting a different
subsequent MPI_File_sync and MPI_File_open. There- consistency model as suggested by its name.
fore, there are four possible MSCs that can be used to
properly synchronize the conflicting accesses:
Compute Nodes BaseFS Server
po hb po
• −→ MPI_File_close −→ MPI_File_open −→
po hb po Worker threads
• −→ MPI_File_close −→ MPI_File_sync −→
po hb po Application
• −→ MPI_File_sync −→ MPI_File_sync −→
po hb po
• −→ MPI_File_sync −→ MPI_File_open −→
Direct
hb CommitFS, SessionFS, ...
Calls
In each MSC, the −→ imposes the order between the two Dispatch tasks
po
synchronization operations, and the −→ enforces that the
BaseFS RPC Master thread
synchronization operations must be called by the conflicting
processes. I/O

Burst Buffer: I/O


5 AN I MPLEMENTATION FOR P ROPERLY - SSD, Persistent Memory, ...

S YNCHRONIZED SCNF S YSTEMS


Now that we have formally defined properly-synchronized
Underlying PFS (e.g., Lustre, GPFS, ...)
SCNF models, the next question is: when should we use
a particular consistency model? Another question that im-
mediately follows is: what is the performance difference? Fig. 2: Overview of a layered approach for implementing
Alternatively and more simply, how much performance PFSs with different consistency models.
can we gain from using a weaker consistency model? The
answers to these questions are important for both storage
system developers and application programmers because
they provide information to aid in understanding the trade- 5.1 BaseFS
off between extra programming effort and extra perfor-
mance. This information helps system developers choose BaseFS is not designed to be a full-fledged file system. Our
which consistency models to support and helps application focus is to evaluate the performance implications of different
programmers decide whether to port their codes to a storage consistency models. As a result, we consider detailed imple-
system with weaker consistency. mentation choices, e.g, how to resolve a path and map it to
To answer these questions, we need to conduct a the inode server and how to retrieve file locations given an
comprehensive performance comparison between different inode as control variables in our experiments, and we need
properly-synchronized SCNF models, which requires eval- to make sure that they do not compromise the comparison
uating PFSs that use those models. However, existing PFSs when evaluating different consistency models.
8

TABLE 4: Specifying properly-synchronized SCNF models using our framework.

Consistency Models S MSC


hb
POSIX Consistency {} −→
hb hb
Commit Consistency {commit} −→ commit −→
po hb po
Session Consistency {session_close, session_open} −→ session_close −→ session_open −→
po hb po
MPI-IO Consistency {MPI_File_sync, −→ s1 −→ s2 −→
MPI_File_close, s1 ∈ {MPI_File_close, MPI_File_sync}
MPI_File_open} s2 ∈ {MPI_File_sync, MPI_File_open}

5.1.1 Primitives partitioning. These optimizations will be equally beneficial


Modern PFSs [6], [7], [8] normally use some kind of lock- to the PFSs built on top of BaseFS (e.g., CommitFS and
ing mechanism to provide sequential consistency. But the SessionFS), and would not add additional value to the
lock-based design does not take advantage of the extra comparison.
information available to the weaker models, like commit As shown in Figure 2, BaseFS is implemented as a
consistency and session consistency. Thus, instead of using user-level BB file system with a focus on data operations.
locking for our BaseFS implementation, we developed a set Reads and writes are directly fulfilled by the BB devices
of flexible primitives (Table 5) which are more suitable for without any memory caching. A limited number of meta-
implementing properly-synchronized SCNF models. data operations (e.g., stat) and attributes (e.g., EOF) are
The BaseFS file system does not provide any implicit supported. In BaseFS, each client process buffers its writes
guarantee of consistency. Consistency must be enforced by (bfs_write) using node-local BB devices. We assume that
explicit synchronization calls. The system may store multi- the BB devices are large enough to accommodate the entire
ple, possibly inconsistent, copies of parts of a file on client storage required for a job execution (no system-initiated
nodes, in addition to a (partial) copy on a storage server. flushes). At a read call (bfs_read), the client reads from
In BaseFS, the write (bfs_write) writes to the local copy the buffer of the specified owner (which can be itself). If the
of the file at the calling client. The read (bfs_read) is requested range is not owned by any client, the client reads
implemented as a read from: The owner argument specifies from the underlying PFS to obtain the latest flushed data.
the client process that will source the data read. The owner We use a single global server to handle messages from
argument can be retrieved using the bfs_query call. The clients. These messages are generated only by the synchro-
read will return the values most recently written by the nization primitives, the write and read primitives do not
owner client. involve the global server. The global server is multithreaded
The two key synchronization primitives are where the master thread handles all communications and
bfs_attach and bfs_query. The attach call specifies the remaining threads run an identical worker routine. Each
a file range and the issuing client becomes the exclusive worker maintains a FIFO queue that holds client requests.
owner of those addresses in this range. One can attach only When a new client request (e.g., a query request) is received,
locations that were written by the local process and not the master thread creates a new task and appends it to one
flushed. Essentially, the attach call makes the local writes worker’s task queue. The worker is selected in a round-
visible to other processes. It does not guarantee the global robin manner. Once the task is completed by the specified
visibility of future writes to the same range. Whenever worker, the server will send back the result to the requesting
an update needs to be made visible to other processes, an client. Next, we go through the tasks triggered by the
attach call is required. An attach is not needed if the written synchronization primitives:
data will not be read by other processes. • Attaching: When a client process invokes a
The query call specifies a file range and returns the bfs_attach* primitive, it notifies the server that
current owners of the range. The result is returned in a it will be responsible for reads from the specified file
list of intervals. Each interval contains a disjoint subrange range. In other words, the client declares itself as the
and the last attached owner process of that subrange. A owner of the most recent update to the specified range.
query is required to retrieve the the latest attached writes The ownership is exclusive, the caller of bfs_attach*
from other processes. In most HPC I/O workloads, this is will take over the ownership in the case when the same
rare. Typically a process reads from its own writes or from range has been previously attached by another process.
a preexisting file. As a result, the fewer conflicting storage The subsequent queries (bfs_query) to the same
accesses occur, the fewer attach and query calls are needed range will return an exclusive owner. Other clients can
and thus the lower is the overhead. later use bfs_read to directly fetch the data from the
owner’s buffer without going through the underlying
5.1.2 Implementation PFS.
Again, the top priority of BaseFS is not to achieve the • Detaching: A client detaches from a previously at-
best performance, but to enable effective comparisons be- tached file range to relinquish ownership. After detach-
tween different consistency models. Therefore, our imple- ing, the owner does not own the range anymore and it
mentation is fairly straightforward, without complicated will not be responsible for future bfs_read calls to the
optimizations such as distributed servers and namespace detached range. If the data needs to be preserved for
9

TABLE 5: The most relevant primitives of BaseFS


• bfs_file_t* bfs_open(const char* pathname)
Description: Opens the file whose pathname is the string pointed to by pathname, and associates a BaseFS file handle (bfs_file_t) with it. This
file handle is an opaque object and can be used by subsequent I/O functions to refer to that file. The file is always opened in read-write mode.
Append mode is not supported. The file offset used to mark the current position within the file is set to the beginning of the file.
Return Value: Upon successful completion, the function returns a pointer to the BaseFS file handle; otherwise, a NULL pointer is returned.
• int bfs_close(bfs_file_t* file)
Description: Causes the file handle pointed to by file to be released and the associated file to be closed. Any buffered data is discarded (not flushed
as in in POSIX). Whether or not the call succeeds, the file handle is disassociated from the file.
Return Value: Upon successful completion, the function returns 0; otherwise, it returns -1.
• ssize_t bfs_write(bfs_file_t* file, const void* buf, size_t size)
Description: Writes size bytes of data from the buffer pointed by buf to the specified file. The file-position indicator of the calling process is
advanced by the number of bytes successfully written. The write becomes immediately visible to the writing process, but it is not guaranteed to
be visible to other processes after the call.
Return Value: Upon successful completion, the function returns the number of bytes written; otherwise, it returns -1.
• ssize_t bfs_read(bfs_file_t* tf, void* buf, size_t size, bfs_addr_t* owner)
Description: Reads size bytes of data from the specified file to the buffer pointed to by buf. The file-position indicator of the calling process is
advanced by the number of bytes successfully read. This function returns the most up-to-date buffered write of the specified owner process. The
function will fail if the owner process does not own the specified range. If owner is NULL, the function will directly read from the underlying PFS.
Return Value: Upon successful completion, the function shall return the number of bytes successfully read; otherwise, it returns -1.
• int bfs_attach(bfs_file_t* file, size_t offset, size_t size)
Description: Attaches the range from offset to offset+size-1 in file to the calling process. This function makes the most recent buffered writes of
the calling process to the specified range visible and available to all processes. Overlapping ranges that were attached by other processes shall
be overwritten. The data covered by the specified range must have been written locally. It is allowed to attach partially a previous write, but
attaching unwritten bytes is erroneous.
Return Value: Upon successful completion, 0 is returned. Otherwise, -1 is returned.
• int bfs_attach_file(bfs_file_t* file)
Description: Attaches all locally buffered writes by the calling process to file. Overlapping ranges that were attached by other processes shall be
overwritten. The function is a no-op if no buffered writes exist.
Return Value: Upon successful completion, 0 is returned. Otherwise, -1 is returned.
• int bfs_query(bfs_file_t* file, size_t offset, size_t size, bfs_interval_t** intervals,
int* num_intervals)
Description: Returns the attached subranges of file included in the range of [offset, offset+size-1]. The result is written to intervals and num intervals,
where intervals contains a list of file ranges and the owner process of each range.
Return Value: Upon successful completion, 0 is returned. Otherwise, -1 is returned.
• int bfs_query_file(bfs_file_t* file, bfs_interval_t** intervals, int* num_intervals)
Description: Returns all attached ranges of file. The result is written to intervals and num intervals, where intervals contains a list of file ranges and
the attached process of each range.
Return Value: Upon successful completion, 0 is returned. Otherwise, -1 is returned.
• int bfs_detach(bfs_file_t* file, size_t offset, size_t size)
Description: Detaches currently attached ranges in file that overlap with range of [offset, offset+size-1] of the file. The function removes the specified
range from the local buffer, and makes the buffered writes covered by the range no longer visible to all processes. If the data is needed for later
reads, then a bfs_flush call should be made before detaching. The function fails if the specified range was not attached before.
Return Value: Upon successful completion, 0 is returned. Otherwise, -1 is returned.
• int bfs_detach_file(bfs_file_t* file)
Description: Detaches all ranges of file that are currently attached to the calling process. The function is a no-op if no attached ranges exist.
Return Value: Upon successful completion, 0 is returned. Otherwise, -1 is returned.
• int bfs_flush(bfs_file_t* file, size_t offset, size_t size)
Description: Flushes the locally buffered data in the range from offset to offset+size-1 of file to the underlying PFS. Previously attached updates of
the same range will remain available to all processes until the detach call.
Return Value: Upon successful completion, the function returns 0; otherwise, it returns -1.
• int bfs_flush_file(bfs_file_t* file)
Description: Flushes all the locally buffered data (if any) of file. The function is a no-op if no locally buffered data exists.
Return Value: Upon successful completion, the function returns 0; otherwise, it returns -1.
• ssize_t bfs_seek(bfs_file_t* tf, size_t offset, int whence);
Description: Sets the file-position indicator for file. The new position, measured in bytes from the beginning of the file, is obtained by adding offset
to the position specified by whence. The specified point is the beginning of the file for SEEK SET, the current value of the file-position indicator for
SEEK CUR, or end-of-file (EOF) for SEEK END. Reads from never written locations before the EOF are filled with zeros. Reads from locations
beyond the EOF return undefined values. The function by itself is not changing the end-of-file location.
Return Value: Upon successful completion, the function returns the current file-position indicator; otherwise, it returns -1.
• ssize_t bfs_tell(bfs_file_t* file);
Description: This function obtains the current value of the file-position indicator for file.
Return Value: Upon successful completion, the function returns the current value of the file-position indicator for the file handle measured in
bytes from the beginning of the file. Otherwise, it returns -1.
• int bfs_stat(bfs_file_t* file, struct stat* buf)
Description: This function obtains information about file, and writes it to the area pointed to by buf. Currently, BaseFS only maintains the file size
attribute (i.e., st_size of struct stat), all other attributes are ignored.
Return Value: Upon successful completion, 0 is returned. Otherwise, -1 is returned.
10

future reads, then a bfs_flush call is required before disk I/O (i.e., reading directly from the underlying PFS).
detaching.
• Querying: A client issues a bfs_query call to ask the 5.2 CommitFS and SessionFS
server who owns the most up-to-date data of the given
With BaseFS, we can easily implement a PosixFS, Com-
range, i.e., who performed the last attach to the same
mitFS, and SessionFS on top. Table 6 shows the APIs ex-
range. The server will respond with a list of sub-ranges
posed by each along with their internal implementations
(since the queried range may cover multiple attach
using the BaseFS primitives. The primary difference in
operations) along with their owners’ information. An
their implementations is the placement of attach and query
empty list will be returned if no one has attached
primitives. The stronger the model, the more frequently the
locations in the range yet.
attach and query primitives are needed. For example, to
The global server maintains a per-file interval tree (noted
achieve POSIX consistency, an attach call has to be invoked
as global interval tree) to keep track of the attached file ranges.
by each write, and a query call has to be invoked by each
Internally, BaseFS uses an augmented self-balancing binary
read. In comparison, CommitFS only performs attach at the
search tree to implement this interval tree. Each interval
commit time, though query is still needed ahead of every
(or each node of the tree) has the form of ⟨Os , Oe , Owner⟩,
read operation.
where Os and Oe are the start and end offset of a file range,
As for SessionFS, a query is performed at the session
and Owner stores the information of the most recent client
open time, and an attach is performed at the session close
who attached the range. Note that the interval tree keeps
time. Within a session, multiple write and read calls can be
only the most recent attach and does not store any histories.
executed without any query or attach.
A new interval is inserted upon each attach request. At the
insertion time, the server checks the existing intervals to
decide if they need to be split or deleted. An existing interval 6 T HE I MPACT OF C ONSISTENCY M ODELS ON I/O
is split if it partially overlaps with the new interval and has P ERFORMANCE
a different owner; it is deleted if it is fully contained in the This section studies the impact of consistency models on
new interval. The server also merges intervals belonging to I/O performance. First, we evaluate the performance of
the same client with contiguous ranges. This reduces the commit consistency and session consistency using bench-
number of intervals and accelerates future queries. When marks that represent common HPC I/O patterns. Then we
the server receives a detach request, it consults the interval perform two case studies to further understand the per-
tree and checks whether the same client still owns the formance disparity caused by different consistency models.
entire range. It is possible that other clients has overwritten The first case study is of the I/O behavior of the Scalable
the same range and became new owners. In that case, the Checkpoint/Restart (SCR) library [26], while the second
detach will simply be a no-op. Otherwise, the detach request case study is of the I/O behavior of the training phase of
succeeds (with possible splits), and the interval is removed distributed deep learning applications. We note that in all
from the tree. cases, we only consider I/O operations and do not perform
Each client process also maintains a similar interval tree
any computation or communication.
(noted as local interval tree) for each file. It is used to keep We performed all experiments on the Catalyst system
track of locally written ranges and their mappings to the located at Lawrence Livermore National Laboratory. Cat-
local burst buffer files. Specifically, each interval of the alyst is a Cray CS300 system, where each compute node
local interval tree has the form of ⟨Os , Oe , Bs , Be , attached⟩, consists of an Intel Xeon E5-2695 with two sockets and 24
where Os and Oe indicate the range of a write to the cores in total, with 128GB memory. The nodes are connected
targeted PFS file, Bs and Be indicate where the range is via IB QDR. The operating system is TOSS 3. Slurm is
buffered on the local burst buffer file, and attached indicates used to manage user jobs. The underlying PFS is an LLNL
whether the write has been attached or not. At each write customized version of Lustre, 2.10.6 2.chaos. Each compute
(bfs_write), a new interval will be inserted into the local node is equipped with an 800GB Intel 910 Series SSD, which
interval tree. There will be no split because all writes are serves as the burst buffer device. The peak sequential write
from the same client. Contiguous intervals are merged as in bandwidth of the node-local SSD is 1GB/s, and its peak
the global interval tree. The bfs_attach primitive is used sequential read bandwidth is 2GB/s. We repeated all runs
to attach the writes to one contiguous file range, while the at least 10 times, and the average results are reported.
bfs_attach_file primitive attaches all local writes to the
file. Both calls will pack and send all supplied information
using a single RPC request. Moreover, both calls will check 6.1 I/O of Scientific Applications
the local interval tree to make sure the same range is not From a PFS perspective, within each file, there are three
attached twice, and the attached ranges were previously common parallel I/O access patterns: (1) Contiguous, where
written by the local process. multiple processes access the file in a contiguous manner
As mentioned above, a client can respond to read re- (normally without gaps); (2) Strided, where multiple pro-
quests from other clients after an attach call. This client- cesses access the file in an interleaved manner (often with
to-client data transfer can be performed efficiently using a fixed stride); and (3) Random, multiple processes access
RDMA. For this to work, each client process needs to the file in a random manner. The random access pattern is
spawn a separate thread to listen to the incoming bfs_read commonly observed in deep learning applications, where
requests. This increases CPU usage but can significantly multiple processes randomly load samples to feed the neu-
improve read performance, assuming RDMA is faster than ral network. On the other hand, contiguous and strided
11

TABLE 6: CommitFS and SessionFS: the exposed APIs and their implementations. POSIX consistency is included for dis.

File System Storage Model Key API Implementation


PosixFS POSIX consistency open bfs_open
close bfs_close
write bfs_write; bfs_attach
read bfs_query; bfs_read
CommitFS Commit consistency open bfs_open
close bfs_close
write bfs_write
read bfs_query; bfs_read
commit bfs_attach_file
SessionFS Session consistency open bfs_open
close bfs_close
write bfs_write
read bfs_read
session_open bfs_query_file
session_close bfs_attach_file

access patterns are commonly used in parallel scientific two different access sizes: 8KB for small accesses and 8MB
applications for performing logging, checkpointing, and for large accesses. The file system was purged before the
outputting snapshots. start of each run.
We constructed synthetic workloads to simulate com-
TABLE 8: Configurations for evaluating the impact of con-
mon HPC I/O scenarios. Each workload consists of a write
sistency models on common HPC I/O scenarios.
phase and/or a read phase, and the read phase begins
only after the write phase is complete. Additionally, all Code name Write phase Read phase nw nr
processes operate on a single shared file, resulting in an CN-W Contiguous N/A n 0
N -to-1 access pattern, where N is the total number of SN-W Strided N/A n 0
processes. The access pattern within the shared file for each CC-R Contiguous Contiguous n n
2 2
phase (contiguous, strided, or random) can be determined CS-R Contiguous Strided n n
2 2
at runtime. The workload can be run on either commit
consistency or session consistency using the corresponding
APIs provided by CommitFS or SessionFS. The other aspects 6.1.1 Write-only workloads
of the I/O behavior are controlled by the set of parameters
The first two configurations, CN-W and SN-W, are write-
summarized in Table 7.
only and differ only in how writes are performed by the
TABLE 7: Parameters of the synthetic I/O workloads. collaborating processes. Figure 3 shows their write band-
widths. With the use of node-local SSDs as burst buffers,
nw Number of writing nodes. All processes of a all writes are buffered by process-private cache files, which
writing node perform only writes. essentially converts the N -1 writes (contiguous or strided)
nr Number of reading nodes. All processes of a to N − N contiguous writes. Therefore, for both consistency
reading node perform only reads. models, the performance of CN-W and SN-W were about
n Total number of nodes; n = nr + nw . the same.
p Number of processes per node. Each node runs Since the file system is empty when the writes start,
an equal number of processes. session_open became a no-op, and session_close per-
mw Number of writes performed by each process. formed the same task as commit, thus session consistency
Each writing process performs the same number and commit consistency achieved similar bandwidths.
of writes. Finally, small writes yielded a worse performance as the
mr Number of reads performed by each process. small access sizes cannot saturate the bandwidth. When
Each reading process performs the same num- performing large writes, both access patterns were able
ber of reads. to achieve the peak write bandwidth, regardless of the
s Access size of each I/O operation. All I/O op- consistency model. This is because the overhead required
erations have the same access size. by the consistency model is insignificant compared to the
time needed to write to SSD.
We used the four configurations shown in Table 8 to
conduct the experiments. Each was run on up to 16 nodes 6.1.2 Read-after-write workloads
with 12 processes per node. In our experiments, write nodes The last two configurations, CC-R and CS-R, demonstrate
and read nodes did not overlap, so nw and nr always added the impact of consistency models on the read bandwidth of
up to n. In all runs, we set mw = mr = 10. Additionally, to workloads with different access patterns. In these configu-
understand the impact of a consistency model on scenarios rations, half of the nodes are used for writing and the other
with different accesses sizes, all experiments were run with half for reading the data back. In CC-R, writes and reads are
12

● Commit Session ● Commit Session


CN−W SN−W CC−R CS−R
Bandwidth (GB/s)

Bandwidth (GB/s)
16 ●16.1 15.9● 16 23.3●23.3
16

● ● 20
12 ● 12 11.8● 12 17.5●17.5
12
● ● 15 ● 14.2●14.3
8 ● 7.9 8 8 11.7●11.7 11.2●11.2 ●
8 ●
10
● ● ● ●
7.4 ● 7.4
4 4.1 ● 4.1 4 ● 4.1 5.9 ● 5.9 ●

5 3.9 ● 4
● ● ●

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of nodes Number of nodes
(a) S = 8MB (a) S = 8MB
● Commit Session ● Commit Session
CN−W SN−W CC−R CS−R
Bandwidth (GB/s)

Bandwidth (GB/s)
0.69● 0.7 0.5
0.67●0.67
0.6 ● 0.6 ●
0.6 ●0.59● 0.4 0.4
0.6 ● 0.34

0.48●0.48 0.48●0.48 0.3 0.27 0.29
0.4 ● ● 0.22
0.2 0.15 0.16
0.31● 0.3 0.31●0.31
0.1 0.09
0.2 ● ● ● 0.03● ● 0.03● ● 0.04● ● 0.04● ● 0.03● ● 0.03● ● 0.04● ● 0.03●
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of nodes Number of nodes
(b) S = 8KB (b) S = 8KB

Fig. 3: Write bandwidth of CN-W and SN-W with 8MB and Fig. 4: Read bandwidth of CC-R and CS-R with 8MB and
8KB access sizes. 8KB access sizes.

done contiguously, so each read node reads from only one level writes to the system-wide PFS, which can withstand
write node. In contrast, in CS-R, reads are strided, which an entire system failure. Faster checkpointing for the most
requires each read node to receive data from multiple write common failure modes involves using node-local storage,
nodes and may cause contention. such as RAM and SSD, and implementing cross-node re-
The results in Figure 4 demonstrate that CC-R outper- dundancy schemes.
forms CS-R under both consistency models and access sizes. In our emulation, we consider the most common case
For large reads (Figure 4a), the impact of consistency models where SCR uses node-local storage only. We use the “Part-
on the bandwidth is negligible. However, for small reads ner” redundancy scheme, where SCR writes checkpoints
(Figure 4b), session consistency achieved better performance to local storage and also copies each checkpoint to storage
and scalability than commit consistency. This is because local to a partner process from another failure group. This
commit consistency issues an RPC query every time it scheme requires twice the storage space, but it can with-
performs a read, and when the I/O of a read completes stand failures of multiple devices, so long as a device and
quickly, the software overhead becomes the I/O bottleneck, the corresponding partner device that holds the copy do
especially when many read requests are performed concur- not fail simultaneously. To be specific, in our experiment,
rently. In contrast, session consistency only queries once at at the checkpoint phase, SCR buffers the checkpoint data in
the session open time, and the overhead is amortized over memory (local and partner) and then flushes it to the SSDs
a number of reads. Lastly, we observed a high variance in (local and partner) using a file-per-process access pattern. At
the bandwidth of session consistency. To verify whether this the restart time, SCR reads directly from the memory buffer
was caused by network or system congestion, we repeated assuming the checkpoint data is still accessible.
the same experiments multiple times at different times of The client of SCR is HACC-IO, which produces the
the day and found consistent results. A further investi- actual checkpoint data. At each checkpoint step, HACC-
gation (where we used a single node and excluded the IO writes out 9 arrays of the same length, each containing
communication time) showed that the SSD itself had high all particle values of a different physical variable. The total
variance in small read performance, which we believe is data size is determined by the number of particles, which
due to normal wear and tear, as SSDs on Catalyst are rather we set to 10 million in our experiment. Furthermore, the
old. We confirmed this hypothesis by conducting the same experiment was run with one spare node, and we assumed a
experiments on a newer machine (Expanse at San Diego single-node failure. When running with n nodes, during the
Supercomputer Center), which showed very little variance. checkpoint phase, n − 1 nodes wrote to the node-local SSD,
with a copy buffered in local memory. During the restart
phase, n − 2 nodes read directly from the local memory
6.2 Case Study: I/O of Scalable Checkpoint/Restart buffer, and the spare node receives the checkpoint through
In this subsection, we study the I/O behavior of SCR [26] MPI from the partner of the failed node.
for checkpointing and restarting HACC-IO [27] using an We show the read and write bandwidths of checkpoint
emulator. SCR is a scalable multi-level checkpointing system and restart phases in Figure 5. To better understand the read
that supports multiple types of checkpoints with varying bandwidth, the result did not include the communication
costs and levels of resiliency. The slowest but most resilient time for the spare node to get the checkpoint. Similar to
13

the large-write experiments discussed earlier, SCR scaled set to 4 (in real DL training, this number is usually set
well for checkpointing and achieved the peak bandwidth at to match the number of GPUs per node). The results are
all scales under both consistency models. In other words, very similar to those of small-reads experiments shown in
the consistency model does not have a big impact on Figure 4b, only the bandwidth is higher here thanks to the
SCR’s checkpointing bandwidth. However, for restarting, slightly larger reads (116KB vs. 8KB). In both strong scaling
session consistency scaled better than commit consistency, and weak scaling, session consistency outperformed commit
mainly due to the low query frequency. At the restart phase, consistency in terms of scalability and bandwidth, due to
the reads were satisfied through memory buffers, and the the less time spent on queries. Additionally, the increasing
overall read bandwidth scaled linearly with the number gap in bandwidth between the two consistency models with
of nodes, which made the read time per node constant. the number of nodes further emphasizes the significance
However, under commit consistency, when more nodes of choosing an appropriate consistency model to achieve
were used, more query requests (one per read) were sent optimal performance and scalability.
simultaneously to the global server, which then became the
bottleneck and reduced scalability.
● Commit Session
Strong Scaling Weak scaling

Bandwidth (GB/s)
● Commit Session 3 2.7
Checkpoint Restart 2.21
Bandwidth (GB/s)

14.2● 125 117.9


2 1.76
13 1.36

12 100
10.4●10.9 90 1 0.78
75.1● 0.72
● 75 69.5● ●
0.33 0.42 0.32● 0.37● 0.39● 0.39●
8 ● ● ● ● ● ●
51.1●56.7

6.8● 7 ● ●
50 2 4 8 16 2 4 6 8 10 12 14 16
● ●
4 2.9●3.1 25 19.6●20.1 Number of nodes
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Number of nodes Fig. 6: Random read bandwidth of DL application.
Fig. 5: HACC-IO with SCR.

6.3 Case Study: I/O of Distributed Deep Learning 6.4 Key Takeaways
Deep learning has thrived in recent years. However, as
Here, we present the key findings derived from our experi-
data sizes and system scales increase, traditional methods of
ments.
feeding neural networks during training struggle to keep up
with the required computation. To accelerate data ingestion • When performing large writes and reads (e.g., over one
rates, various methods [28], [29], [30], [31] have been pro- megabyte per I/O operation), consistency models do
posed, such as data sharding, prestaging, and in-memory not have a big impact on the I/O bandwidth. This is
caching. because the overhead of maintaining the consistency
Here, we simulate the I/O of the “Preloaded” strategy model (weaker or stronger) is insignificant compared
that was proposed in [30] and implemented in LBANN [32]. to the time needed to to access the I/O device.
Our simulation assigns to each process a non-overlapping • When performing small writes and reads (e.g., ranging
subset of the training data. Before the training begins, each from a few bytes to a few kilobytes), the adoption
process loads its portion of data into its node-local SSD of a stronger consistency model can noticeably hinder
(hence the term Preloaded). Next, at the beginning of each performance. This is attributed to the faster completion
epoch, each process is assigned a random subset of samples. of each I/O operation, making the overhead of main-
The samples are evenly distributed to all processes so that taining strong consistency a bottleneck. Moreover, the
each process performs an equal amount of work. During traffic required to maintain the consistency model can
each epoch, each process reads the assigned samples, either lead to contention, particularly when there is a high
locally or from other processes using MPI. It is worth noting volume of small I/O operations.
that our benchmark is a simplified version of the Preloaded • When I/O operations are directly fulfilled by memory
strategy that differs in two major ways: (1) we store data or fast devices like persistent memory, the choice of con-
in node-local SSDs instead of memory, which is anyhow sistency models can significantly impact performance.
necessary for large datasets that do not fit in memory; and This is due to a similar reason as mentioned earlier,
(2) we do not perform aggregations when sending samples where the faster completion of I/O operations magni-
to the same process, which places additional stress on the fies the overhead associated with maintaining strong
file system. consistency models.
The average per-epoch read bandwidth is presented • For small random accesses (e.g., random reads of
in Figure 6. We conducted both strong scaling and weak deep learning applications), weaker consistency mod-
scaling experiments, with a mini-batch size of 1024 for els demonstrate higher I/O bandwidth and improved
strong scaling, and each process working on 32 samples scalability compared to stronger models. Notably, this
per iteration for weak scaling. The sample size was set to improvement is significant even at smaller scales, in-
116KB, which is the same as the average image size of dicating a promising direction for optimizing the I/O
ImageNet-1K [33]. The number of processes per node was performance of deep learning applications.
14

7 C ONCLUSION AND F UTURE W ORK [5] “IEEE Standard for Information Technology–Portable Operating
System Interface (POSIX(TM)) Base Specifications, Issue 7,” IEEE
This work explored consistency models from the perspec- Std 1003.1-2017 (Revision of IEEE Std 1003.1-2008), pp. 1–3951, 2018.
tive of parallel file systems. We provided a high-level dis- [6] P. Braam, “The Lustre Storage Architecture,” arXiv preprint
cussion on important aspects of storage consistency models, arXiv:1903.01955, 2019.
[7] F. B. Schmuck and R. L. Haskin, “GPFS: A Shared-Disk File System
including their design choices and their comparison with
for Large Computing Clusters.” in FAST, vol. 2, no. 19, 2002.
memory models. Based on the commonalities of existing [8] F. Herold, S. Breuner, and J. Heichler, “An Introduction
storage models, we proposed a unified and formal frame- to BeeGFS,” 2014. [Online]. Available: https://www.beegfs.io/
work for specifying properly-synchronized SCNF models, docs/whitepapers/Introduction to BeeGFS by ThinkParQ.pdf
[9] T. Wang, K. Mohror, A. Moody, W. Yu, and K. Sato, “BurstFS: A
which guarantee sequential consistency (or POSIX consis- Distributed Burst Buffer File System for Scientific Applications,”
tency) for programs that are properly synchronized. Ad- in The International Conference for High Performance Computing,
ditionally, we proposed a flexible design for implementing Networking, Storage and Analysis (SC), 2015.
properly-synchronized SCNF models that isolates the con- [10] L. L. N. Laboratory, “UnifyFS: A File System for Burst Buffers ,”
https://github.com/LLNL/UnifyFS, Mar. 2021.
sistency model from other file system components, making [11] A. Miranda, R. Nou, and T. Cortes, “echofs: A Scheduler-Guided
it easy to understand the impact of different consistency Temporary Filesystem to Leverage Node-local NVMs,” in 2018
models on I/O performance. 30th International Symposium on Computer Architecture and High
We also presented a detailed performance comparison Performance Computing (SBAC-PAD). IEEE, 2018, pp. 225–228.
[12] O. Tatebe, S. Moriwake, and Y. Oyama, “Gfarm/BB—Gfarm File
between commit consistency and session consistency. Our System for Node-Local Burst Buffer,” Journal of Computer Science
results indicate that session consistency is better suited for and Technology, vol. 35, no. 1, pp. 61–71, 2020.
most HPC I/O workloads in terms of performance and scal- [13] S. Oral, S. S. Vazhkudai, F. Wang, C. Zimmer, C. Brumgard,
J. Hanley, G. Markomanolis, R. Miller, D. Leverman, S. Atchley
ability. Although this comes at the cost of slightly reduced
et al., “End-to-end I/O Portfolio for the Summit Supercomputing
programmability, the performance gain is potentially huge, Ecosystem,” in Proceedings of the International Conference for High
especially for small reads such as those in deep learning Performance Computing, Networking, Storage and Analysis, 2019, pp.
applications. Overall, this work contributes to a better un- 1–14.
[14] L. Lamport, “How to Make a Multiprocessor Computer That
derstanding of consistency models in parallel file systems Correctly Executes Multiprocess Progranm,” IEEE transactions on
and their impact on I/O performance. computers, vol. 28, no. 09, pp. 690–691, 1979.
In our future work, we will implement different relaxed [15] P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen,
storage models in existing PFSs to evaluate their perfor- “x86-TSO: A Rigorous and Usable Programmer’s Model for x86
Multiprocessors,” Communications of the ACM, vol. 53, no. 7, pp.
mance impacts in a real-world setting. Additionally, we plan 89–97, 2010.
to study the consistency requirements of metadata opera- [16] M. Dubois, C. Scheurich, and F. Briggs, “Memory Access Buffering
tions for HPC applications and evaluate their performance in Multiprocessors,” ACM SIGARCH computer architecture news,
implications. vol. 14, no. 2, pp. 434–442, 1986.
[17] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and
J. Hennessy, “Memory Consistency and Event Ordering in Scal-
able Shared-Memory Multiprocessors,” ACM SIGARCH Computer
ACKNOWLEDGMENTS Architecture News, vol. 18, no. 2SI, pp. 15–26, 1990.
This work was supported by NSF SHF Collaborative grant [18] C. Wang, K. Mohror, and M. Snir, “File System Semantics Require-
ments of HPC Applications,” in Proceedings of the 30th International
1763540 and was performed under the auspices of the U.S. Symposium on High-Performance Parallel and Distributed Computing
Department of Energy by Lawrence Livermore National (HPDC), 2020, pp. 19–30.
Laboratory under Contract DE-AC52-07NA27344. LLNL- [19] IBM, “Burst Buffer Shared Checkpoint File System,” Apr.
2020. [Online]. Available: https://github.com/IBM/CAST/tree/
JRNL-849174-DRAFT. This material is based upon work
master/bscfs
supported by the U.S. Department of Energy, Office of [20] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame,
Science, Office of Advanced Scientific Computing Research M. Eisler, and D. Noveck, “RFC3530: Network File System (NFS)
under the DOE Early Career Research Program. Version 4 Protocol,” 2003.
[21] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J.-P. Prost,
M. Snir, B. Traversat, and P. Wong, “Overview of the MPI-IO
Parallel I/O Interface,” in IPPS’95 Workshop on Input/Output in
R EFERENCES Parallel and Distributed Systems, 1995, pp. 1–15.
[1] T. Patel, S. Byna, G. K. Lockwood, and D. Tiwari, “Revisiting I/O [22] “MPI: A Message-Passing Interface Standard Version 4.0,” https:
Behavior in Large-Scale Storage Systems: the Expected and the //www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf, 2021.
Unexpected,” in Proceedings of the International Conference for High [23] S. V. Adve, “Designing Memory Consistency Models for Shared-
Performance Computing, Networking, Storage and Analysis, 2019, pp. Memory Multiprocessors,” Ph.D. dissertation, University of Wis-
1–13. consin, Madison, 1993.
[2] A. K. Paul, O. Faaland, A. Moody, E. Gonsiorowski, K. Mohror, [24] S. V. Adve and M. D. Hill, “Weak Ordering - a New Definition,”
and A. R. Butt, “Understanding HPC Application I/O Behavior ACM SIGARCH Computer Architecture News, vol. 18, no. 2SI, pp.
Using System Level Statistics,” in 2020 IEEE 27th International 2–14, 1990.
Conference on High Performance Computing, Data, and Analytics [25] J. Manson, W. Pugh, and S. V. Adve, “The Java Memory Model,”
(HiPC). IEEE, 2020, pp. 202–211. ACM SIGPLAN Notices, vol. 40, no. 1, pp. 378–391, 2005.
[3] N. Dryden, R. Böhringer, T. Ben-Nun, and T. Hoefler, “Clairvoyant [26] A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski, “De-
Prefetching for Distributed Machine Learning I/O,” arXiv preprint sign, Modeling, and Evaluation of a Scalable Multi-level Check-
arXiv:2101.08734, 2021. pointing System,” in SC’10: Proceedings of the 2010 ACM/IEEE
[4] F. Di Natale, H. Bhatia, T. S. Carpenter, C. Neale, S. Kokkila- International Conference for High Performance Computing, Networking,
Schumacher, T. Oppelstrup, L. Stanton, X. Zhang, S. Sundram, Storage and Analysis. IEEE, 2010, pp. 1–11.
T. R. Scogland et al., “A Massively Parallel Infrastructure for Adap- [27] “HACC IO Kernel from the CORAL Benchmark Codes,” https:
tive Multiscale Simulations: Modeling RAS Initiation Pathway //asc.llnl.gov/coral-benchmarks#hacc, Jan 2018.
for Cancer,” in Proceedings of the International Conference for High [28] Y. Oyama, N. Maruyama, N. Dryden, E. McCarthy, P. Harrington,
Performance Computing, Networking, Storage and Analysis, 2019, pp. J. Balewski, S. Matsuoka, P. Nugent, and B. Van Essen, “The
1–16. Case for Strong Scaling in Deep Learning: Training Large 3D
15

CNNs With Hybrid Parallelism,” IEEE Transactions on Parallel and


Distributed Systems, vol. 32, no. 7, pp. 1641–1652, 2020.
[29] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski,
A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, Large
Minibatch SGD: Training ImageNet in 1 Hour,” arXiv preprint
arXiv:1706.02677, 2017.
[30] S. A. Jacobs, B. Van Essen, D. Hysom, J.-S. Yeom, T. Moon,
R. Anirudh, J. J. Thiagaranjan, S. Liu, P.-T. Bremer, J. Gaffney et al.,
“Parallelizing Training of Deep Generative Models on Massive
Scientific Datasets,” in 2019 IEEE International Conference on Cluster
Computing (CLUSTER). IEEE, 2019, pp. 1–10.
[31] S. A. Jacobs, N. Dryden, R. Pearce, and B. Van Essen, “Towards
Scalable Parallel Training of Deep Neural Networks,” in Proceed-
ings of the Machine Learning on HPC Environments, 2017, pp. 1–9.
[32] B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen,
“LBANN: Livermore Big Artificial Neural Network HPC Toolkit,”
in Proceedings of the Workshop on Machine Learning in High-
Performance Computing Environments, ser. MLHPC ’15. New
York, NY, USA: ACM, 2015, pp. 5:1–5:6. [Online]. Available:
http://doi.acm.org/10.1145/2834892.2834897
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-
agenet: A large-scale hierarchical image database,” in 2009 IEEE
conference on computer vision and pattern recognition. Ieee, 2009, pp.
248–255.

Chen Wang is Fernbach Postdoctoral Fellow


at Lawrence Livermore National Laboratory. He
is currently working at the Center for Applied
Scientific Computing at LLNL. He received his
Ph.D in computer science from University of Illi-
nois Urbana-Champaign. His research interests
include parallel computing, I/O and communica-
tion tracing, and parallel storage systems.

Kathryn Mohror is a computer scientist in the


Parallel Systems Group in the Center for Ap-
plied Scientific Computing (CASC) at Lawrence
Livermore National Laboratory (LLNL). Kathryn
serves as the Deputy Director for the Labora-
tory Directed Research & Development (LDRD)
program at LLNL, Lead for the NNSA Soft-
ware Technologies Portfolio for the U.S. Exas-
cale Computing Project (ECP), and as the ASCR
Point of Contact for Computer Science at LLNL.
Kathryn’s research on high-end computing sys-
tems is currently focused on I/O for extreme scale systems. Her other
research interests include scalable performance analysis and tuning,
fault tolerance, and parallel programming paradigms.

Marc Snir is Michael Faiman Emeritus Profes-


sor in the Department of Computer Science at
the University of Illinois Urbana-Champaign. He
was Director of the Mathematics and Computer
Science Division at the Argonne National Labo-
ratory from 2011 to 2016 and head of the Com-
puter Science Department at Illinois from 2001
to 2007. Until 2001 he was a senior manager
at the IBM T. J. Watson Research Center where
he led the Scalable Parallel Systems research
group that was responsible for major contribu-
tions to the IBM scalable parallel system and to the IBM Blue Gene
system.

You might also like