Formal Definitions and Performance Comparison of Consistency Models For Parallel File Systems
Formal Definitions and Performance Comparison of Consistency Models For Parallel File Systems
Abstract—The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency
models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict
consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of
arXiv:2402.14105v2 [cs.DC] 26 Feb 2024
parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While
some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and
ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency
models and storage consistency models and revisit the key design choices of storage consistency models from a high-level
perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered
implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a
comprehensive performance comparison of two relaxed consistency models on a range of commonly-seen parallel I/O workloads, such
as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O
scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that
typically found in deep learning applications, session consistency achieved an 5x improvement in I/O bandwidth compared to commit
consistency, even at small scales.
Index Terms—Consistency model, storage consistency, parallel file system, parallel I/O
1 I NTRODUCTION
operations, y will not be guaranteed to return 100 because that PFSs are ignorant of the application’s synchronization
processors are free to reorder operations. However, if L12 logic and the order of I/O operations of different processes.
and L21 are identified by programmers as synchronizations, PFSs must make worse-case assumptions and serialize all
then L22 is guaranteed to return the latest value of x due to potentially conflicting I/O operations to guarantee POSIX
the ordering imposed by the synchronizations. consistency. Alternatively, programmers can provide infor-
mation on program synchronizations of conflicting I/O
TABLE 2: A weak ordering example. All variables are ini- operations to the PFS. With this extra information, PFSs can
tially zero. adopt a weaker consistency model, while guaranteeing the
same outcome of POSIX consistency. Wang et al., [18] have
Process 1: Process 2:
studied many such PFSs and their consistency models. Here,
L11 : x = 100; L21 : while(!f lag){}; we briefly discuss the most commonly used models.
L12 : f lag = 1 L22 : y = x;
2.3.1 Commit Consistency
2.2.2 Release Consistency Commit consistency is a relaxed consistency model com-
Many synchronization operations occur in pairs. Release monly used by recent BB PFSs such as BSCFS [19], Uni-
consistency [17] utilizes this information by explicitly dis- fyFS [10], and SymphonyFS [13]. In commit consistency,
tinguishing them as release and acquire operations, with the “commit” operations are explicitly executed by processes.
help from programmers. The release operation instructs the The commit operation conveys synchronization informa-
processor to make all previous memory accesses globally tion. I/O updates performed by a process to a file before a
visible before the release completes, and the acquire opera- commit become globally visible upon return of the commit
tion instructs the processor not to start subsequent memory operation. To maintain portability, PFSs adopting commit
accesses before the acquire completes. In the example of consistency may use an existing POSIX call to indicate a
Table 2, L12 is a release operation and L21 is an acquire oper- commit. For example, in UnifyFS [10], a commit operation
ation. Release consistency is a further relaxation of weak or- is triggered by a fsync call, which applies to all updates
dering. It allows systems to have different implementations performed by a process on a file since the previous commit.
for release and acquire, which leads to better performance Note that finer commit granularity (e.g., committing byte
at the cost of the increased burden on programmers. ranges) is also possible, but may add additional overhead if
used in a superfluous way.
2.2.3 Entry Consistency
2.3.2 Session Consistency
A major issue of weak ordering and release consistency
is that their synchronization operations impose order on Commit consistency guarantees all local writes that precede
memory operations even if they do not conflict, which may the commit operation become globally visible after the
add unnecessary overhead. Consider the example in Table 3, commit operation. However, in many cases, data written
to make sure y in L22 returns the store to x in L12 , under is rarely read back by the same application, and even when
weak ordering or release consistency, L13 and L21 need this happens, usually only a subset of processes perform
to be identified as synchronizations. However, this also the reads. Thus, global visibility is not necessary. Session
prohibits reordering L11 and L13 , i.e., L11 must complete consistency (also known as close-to-open consistency) ad-
before L13 , which is unnecessary if no other process will dresses this issue by defining a pair of synchronization oper-
ever access w. Entry consistency addresses this issue by ations, namely, session_close and session_open. Ses-
requiring each ordinary shared data item to be associated sion consistency guarantees that writes by a process become
with a synchronization variable. When an acquire is done on visible to another process (not all processes) when the mod-
a synchronization variable, only those data guarded by that ified file is closed by the writing process and subsequently
synchronization variable are made consistent. For instance, opened by the reading process, with the session_close
in the case of example in Table 3, we can associate w and x happening before the session_open. The idea of session
with two different synchronization variables, thus allowing consistency is very similar to that of release consistency for
L11 to bypass L12 and L13 . memory models.
Note that we name the two operations session_open
TABLE 3: An entry consistency example. All variables are and session_close, but most existing systems adopt-
initially zero. ing session consistency such as NFS [20] do not pro-
vide the separate session_open and session_close
Process 1: Process 2: APIs. Rather, they are implied by POSIX open/close (or
L11 : w = 100; L21 : while(!f lag){}; fopen/fclose) calls, calls that have additional effects—
L12 : x = 100; L22 : y = x; they apply all updates to a file.
L13 : f lag = 1
2.3.3 MPI-IO Consistency
MPI-IO [21] is a part of the MPI standard that defines both
2.3 Relaxed Storage Models communications (message passing) and I/O operations. As
The requirements of POSIX consistency essentially impose the latest standard [22] states, MPI-IO provides three levels
sequential consistency. The fundamental problem behind of consistency: sequential consistency among all accesses
the performance issues stemming from POSIX consistency is using a single file handle, sequential consistency among all
4
accesses using file handles created from a single collective choose to implement the same standard, POSIX. Most local
open with atomic mode enabled, and user-imposed consis- file systems (e.g., ext3, ext4, and xtfs) and parallel file
tency among accesses other than the above. systems (e.g., Lustre [6] and GPFS [7]) are POSIX-compliant.
The first two cases are the most common cases, and
sequential consistency is guaranteed without extra syn- Application
chronizations. In the last case, sequential consistency can Application
I/O library
be achieved by using a sync-barrier-sync construct that
imposes orders on the conflicting I/O accesses. Here, DRF/HLL (e.g, C++) POSIX I/O (e.g., C++)
sync is one of MPI_File_open, MPI_File_close or
MPI_File_sync that takes in the file handle and flushes Compiler
the data (writer) or retrieves the latest data (reader). And
barrier provides a mechanism that imposes order between POSIX/ext4 POSIX/xfs POSIX/Others
the two syncs. In most cases, this is achieved using MPI calls.
For example, barrier can be one of the collective communi-
cation calls such as MPI_Barrier and MPI_Allgather, PMEM SSD HDD
TSO/Intel x86 Power/IBM Others
or a pair of point-to-point calls such as MPI_Send plus
MPI_Recv. However, barrier is not limited to MPI calls, it Memory Storage
can use any mechanism that properly orders the two sync
calls. Fig. 1: Memory vs. storage programming. The lack of au-
Even though MPI-IO has information on both I/O and tomated support (e.g., a compiler layer) in storage pro-
MPI communication, it can not assume the synchronization gramming hierarchy makes it harder to adopt different
information (i.e., barrier) is always available to the system as consistency models for different hardware.
it may not use MPI calls. MPI-IO consistency is similar to
session consistency, but additional optimizations are possi-
3.1.2 Software Overhead
ble if ordering is imposed through MPI calls, as those are
visible to the MPI library (which includes MPI-IO). The POSIX consistency semantics prohibit many optimiza-
tions. This disadvantage is not so apparent in a single-
node system, in which I/O operations are serialized. But
3 M EMORY M ODELS VS . S TORAGE M ODELS maintaining POSIX consistency in an HPC environment
In this section, we first investigate why relaxed storage can be more costly, as PFSs require distributed locking
consistency models have not gained enough attention and mechanisms running reliably at large scale to enforce it.
widespread adoption. Next, we analyze the key design Nevertheless, in most scenarios, the software overhead in-
choices for consistency models and compare the differ- curred is minor compared to the slow I/O performance
ent considerations between memory systems and storage of HDDs. Consequently, less attention has been given to
systems. Finally, we discuss the primary commonality of alternative consistency models. However, this is starting to
existing relaxed storage models, introduce the key concepts change due to two reasons: the rapid increase in the scale
that will serve as the foundation for our formal definition of of HPC systems, and the emergence of new, faster storage
these models. technologies, such as SSDs. The former directly increases the
software overhead required to enforce the same consistency
requirements. The latter makes the overhead more signifi-
3.1 Why Relaxed Storage Models are not Widely cant since I/O operations complete much faster.
Adopted
3.1.1 Programming Hierarchy 3.2 Design Considerations
The presence of compilers in the memory programming Here we describe several important design considerations
hierarchy (Figure 1(a)) has been an important factor in the of a consistency model and compares memory models (from
adoption of relaxed memory models. Compilers can hide the perspective of high-level programming languages) with
complexity and provide portability, which allows program- storage models (from the perspective of parallel file sys-
mers to target a single memory model specified by the high- tems) for each consideration.
level programming language (e.g., C++ and Java) without
the knowledge of underlying consistency models provided 3.2.1 Synchronization
by the CPUs. This way, a suitable consistency model can Synchronization is critical to a consistency model. It is used
be selected for the given hardware, without worrying about to enforce order between potential conflicting accesses. Syn-
programmability. For example, C++ allows specifying a dif- chronizations can be performed by explicitly invoking syn-
ferent consistency model for each atomic operation; but the chronization operations, which is common in both memory
semantics of these models is specified by the C++ standard models and storage models. Alternatively, synchronizations
is unrelated to the underlying hardware. can be specified in a declarative manner. For example, high-
In contrast, there is no corresponding “compiler” layer level languages often provide keywords (e.g., atomic in
in the storage programming hierarchy and no low-level C++ and volatile in Java) that can modify ordinary
hardware-supported consistency model to map onto: Con- objects to impose extra ordering restrictions on relevant
sistency protocols are implemented in software. To achieve accesses. Such features, however, are less common in storage
programmability and portability, most storage systems models.
5
future reads, then a bfs_flush call is required before disk I/O (i.e., reading directly from the underlying PFS).
detaching.
• Querying: A client issues a bfs_query call to ask the 5.2 CommitFS and SessionFS
server who owns the most up-to-date data of the given
With BaseFS, we can easily implement a PosixFS, Com-
range, i.e., who performed the last attach to the same
mitFS, and SessionFS on top. Table 6 shows the APIs ex-
range. The server will respond with a list of sub-ranges
posed by each along with their internal implementations
(since the queried range may cover multiple attach
using the BaseFS primitives. The primary difference in
operations) along with their owners’ information. An
their implementations is the placement of attach and query
empty list will be returned if no one has attached
primitives. The stronger the model, the more frequently the
locations in the range yet.
attach and query primitives are needed. For example, to
The global server maintains a per-file interval tree (noted
achieve POSIX consistency, an attach call has to be invoked
as global interval tree) to keep track of the attached file ranges.
by each write, and a query call has to be invoked by each
Internally, BaseFS uses an augmented self-balancing binary
read. In comparison, CommitFS only performs attach at the
search tree to implement this interval tree. Each interval
commit time, though query is still needed ahead of every
(or each node of the tree) has the form of ⟨Os , Oe , Owner⟩,
read operation.
where Os and Oe are the start and end offset of a file range,
As for SessionFS, a query is performed at the session
and Owner stores the information of the most recent client
open time, and an attach is performed at the session close
who attached the range. Note that the interval tree keeps
time. Within a session, multiple write and read calls can be
only the most recent attach and does not store any histories.
executed without any query or attach.
A new interval is inserted upon each attach request. At the
insertion time, the server checks the existing intervals to
decide if they need to be split or deleted. An existing interval 6 T HE I MPACT OF C ONSISTENCY M ODELS ON I/O
is split if it partially overlaps with the new interval and has P ERFORMANCE
a different owner; it is deleted if it is fully contained in the This section studies the impact of consistency models on
new interval. The server also merges intervals belonging to I/O performance. First, we evaluate the performance of
the same client with contiguous ranges. This reduces the commit consistency and session consistency using bench-
number of intervals and accelerates future queries. When marks that represent common HPC I/O patterns. Then we
the server receives a detach request, it consults the interval perform two case studies to further understand the per-
tree and checks whether the same client still owns the formance disparity caused by different consistency models.
entire range. It is possible that other clients has overwritten The first case study is of the I/O behavior of the Scalable
the same range and became new owners. In that case, the Checkpoint/Restart (SCR) library [26], while the second
detach will simply be a no-op. Otherwise, the detach request case study is of the I/O behavior of the training phase of
succeeds (with possible splits), and the interval is removed distributed deep learning applications. We note that in all
from the tree. cases, we only consider I/O operations and do not perform
Each client process also maintains a similar interval tree
any computation or communication.
(noted as local interval tree) for each file. It is used to keep We performed all experiments on the Catalyst system
track of locally written ranges and their mappings to the located at Lawrence Livermore National Laboratory. Cat-
local burst buffer files. Specifically, each interval of the alyst is a Cray CS300 system, where each compute node
local interval tree has the form of ⟨Os , Oe , Bs , Be , attached⟩, consists of an Intel Xeon E5-2695 with two sockets and 24
where Os and Oe indicate the range of a write to the cores in total, with 128GB memory. The nodes are connected
targeted PFS file, Bs and Be indicate where the range is via IB QDR. The operating system is TOSS 3. Slurm is
buffered on the local burst buffer file, and attached indicates used to manage user jobs. The underlying PFS is an LLNL
whether the write has been attached or not. At each write customized version of Lustre, 2.10.6 2.chaos. Each compute
(bfs_write), a new interval will be inserted into the local node is equipped with an 800GB Intel 910 Series SSD, which
interval tree. There will be no split because all writes are serves as the burst buffer device. The peak sequential write
from the same client. Contiguous intervals are merged as in bandwidth of the node-local SSD is 1GB/s, and its peak
the global interval tree. The bfs_attach primitive is used sequential read bandwidth is 2GB/s. We repeated all runs
to attach the writes to one contiguous file range, while the at least 10 times, and the average results are reported.
bfs_attach_file primitive attaches all local writes to the
file. Both calls will pack and send all supplied information
using a single RPC request. Moreover, both calls will check 6.1 I/O of Scientific Applications
the local interval tree to make sure the same range is not From a PFS perspective, within each file, there are three
attached twice, and the attached ranges were previously common parallel I/O access patterns: (1) Contiguous, where
written by the local process. multiple processes access the file in a contiguous manner
As mentioned above, a client can respond to read re- (normally without gaps); (2) Strided, where multiple pro-
quests from other clients after an attach call. This client- cesses access the file in an interleaved manner (often with
to-client data transfer can be performed efficiently using a fixed stride); and (3) Random, multiple processes access
RDMA. For this to work, each client process needs to the file in a random manner. The random access pattern is
spawn a separate thread to listen to the incoming bfs_read commonly observed in deep learning applications, where
requests. This increases CPU usage but can significantly multiple processes randomly load samples to feed the neu-
improve read performance, assuming RDMA is faster than ral network. On the other hand, contiguous and strided
11
TABLE 6: CommitFS and SessionFS: the exposed APIs and their implementations. POSIX consistency is included for dis.
access patterns are commonly used in parallel scientific two different access sizes: 8KB for small accesses and 8MB
applications for performing logging, checkpointing, and for large accesses. The file system was purged before the
outputting snapshots. start of each run.
We constructed synthetic workloads to simulate com-
TABLE 8: Configurations for evaluating the impact of con-
mon HPC I/O scenarios. Each workload consists of a write
sistency models on common HPC I/O scenarios.
phase and/or a read phase, and the read phase begins
only after the write phase is complete. Additionally, all Code name Write phase Read phase nw nr
processes operate on a single shared file, resulting in an CN-W Contiguous N/A n 0
N -to-1 access pattern, where N is the total number of SN-W Strided N/A n 0
processes. The access pattern within the shared file for each CC-R Contiguous Contiguous n n
2 2
phase (contiguous, strided, or random) can be determined CS-R Contiguous Strided n n
2 2
at runtime. The workload can be run on either commit
consistency or session consistency using the corresponding
APIs provided by CommitFS or SessionFS. The other aspects 6.1.1 Write-only workloads
of the I/O behavior are controlled by the set of parameters
The first two configurations, CN-W and SN-W, are write-
summarized in Table 7.
only and differ only in how writes are performed by the
TABLE 7: Parameters of the synthetic I/O workloads. collaborating processes. Figure 3 shows their write band-
widths. With the use of node-local SSDs as burst buffers,
nw Number of writing nodes. All processes of a all writes are buffered by process-private cache files, which
writing node perform only writes. essentially converts the N -1 writes (contiguous or strided)
nr Number of reading nodes. All processes of a to N − N contiguous writes. Therefore, for both consistency
reading node perform only reads. models, the performance of CN-W and SN-W were about
n Total number of nodes; n = nr + nw . the same.
p Number of processes per node. Each node runs Since the file system is empty when the writes start,
an equal number of processes. session_open became a no-op, and session_close per-
mw Number of writes performed by each process. formed the same task as commit, thus session consistency
Each writing process performs the same number and commit consistency achieved similar bandwidths.
of writes. Finally, small writes yielded a worse performance as the
mr Number of reads performed by each process. small access sizes cannot saturate the bandwidth. When
Each reading process performs the same num- performing large writes, both access patterns were able
ber of reads. to achieve the peak write bandwidth, regardless of the
s Access size of each I/O operation. All I/O op- consistency model. This is because the overhead required
erations have the same access size. by the consistency model is insignificant compared to the
time needed to write to SSD.
We used the four configurations shown in Table 8 to
conduct the experiments. Each was run on up to 16 nodes 6.1.2 Read-after-write workloads
with 12 processes per node. In our experiments, write nodes The last two configurations, CC-R and CS-R, demonstrate
and read nodes did not overlap, so nw and nr always added the impact of consistency models on the read bandwidth of
up to n. In all runs, we set mw = mr = 10. Additionally, to workloads with different access patterns. In these configu-
understand the impact of a consistency model on scenarios rations, half of the nodes are used for writing and the other
with different accesses sizes, all experiments were run with half for reading the data back. In CC-R, writes and reads are
12
Bandwidth (GB/s)
16 ●16.1 15.9● 16 23.3●23.3
16
●
● ● 20
12 ● 12 11.8● 12 17.5●17.5
12
● ● 15 ● 14.2●14.3
8 ● 7.9 8 8 11.7●11.7 11.2●11.2 ●
8 ●
10
● ● ● ●
7.4 ● 7.4
4 4.1 ● 4.1 4 ● 4.1 5.9 ● 5.9 ●
●
5 3.9 ● 4
● ● ●
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of nodes Number of nodes
(a) S = 8MB (a) S = 8MB
● Commit Session ● Commit Session
CN−W SN−W CC−R CS−R
Bandwidth (GB/s)
Bandwidth (GB/s)
0.69● 0.7 0.5
0.67●0.67
0.6 ● 0.6 ●
0.6 ●0.59● 0.4 0.4
0.6 ● 0.34
●
0.48●0.48 0.48●0.48 0.3 0.27 0.29
0.4 ● ● 0.22
0.2 0.15 0.16
0.31● 0.3 0.31●0.31
0.1 0.09
0.2 ● ● ● 0.03● ● 0.03● ● 0.04● ● 0.04● ● 0.03● ● 0.03● ● 0.04● ● 0.03●
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of nodes Number of nodes
(b) S = 8KB (b) S = 8KB
Fig. 3: Write bandwidth of CN-W and SN-W with 8MB and Fig. 4: Read bandwidth of CC-R and CS-R with 8MB and
8KB access sizes. 8KB access sizes.
done contiguously, so each read node reads from only one level writes to the system-wide PFS, which can withstand
write node. In contrast, in CS-R, reads are strided, which an entire system failure. Faster checkpointing for the most
requires each read node to receive data from multiple write common failure modes involves using node-local storage,
nodes and may cause contention. such as RAM and SSD, and implementing cross-node re-
The results in Figure 4 demonstrate that CC-R outper- dundancy schemes.
forms CS-R under both consistency models and access sizes. In our emulation, we consider the most common case
For large reads (Figure 4a), the impact of consistency models where SCR uses node-local storage only. We use the “Part-
on the bandwidth is negligible. However, for small reads ner” redundancy scheme, where SCR writes checkpoints
(Figure 4b), session consistency achieved better performance to local storage and also copies each checkpoint to storage
and scalability than commit consistency. This is because local to a partner process from another failure group. This
commit consistency issues an RPC query every time it scheme requires twice the storage space, but it can with-
performs a read, and when the I/O of a read completes stand failures of multiple devices, so long as a device and
quickly, the software overhead becomes the I/O bottleneck, the corresponding partner device that holds the copy do
especially when many read requests are performed concur- not fail simultaneously. To be specific, in our experiment,
rently. In contrast, session consistency only queries once at at the checkpoint phase, SCR buffers the checkpoint data in
the session open time, and the overhead is amortized over memory (local and partner) and then flushes it to the SSDs
a number of reads. Lastly, we observed a high variance in (local and partner) using a file-per-process access pattern. At
the bandwidth of session consistency. To verify whether this the restart time, SCR reads directly from the memory buffer
was caused by network or system congestion, we repeated assuming the checkpoint data is still accessible.
the same experiments multiple times at different times of The client of SCR is HACC-IO, which produces the
the day and found consistent results. A further investi- actual checkpoint data. At each checkpoint step, HACC-
gation (where we used a single node and excluded the IO writes out 9 arrays of the same length, each containing
communication time) showed that the SSD itself had high all particle values of a different physical variable. The total
variance in small read performance, which we believe is data size is determined by the number of particles, which
due to normal wear and tear, as SSDs on Catalyst are rather we set to 10 million in our experiment. Furthermore, the
old. We confirmed this hypothesis by conducting the same experiment was run with one spare node, and we assumed a
experiments on a newer machine (Expanse at San Diego single-node failure. When running with n nodes, during the
Supercomputer Center), which showed very little variance. checkpoint phase, n − 1 nodes wrote to the node-local SSD,
with a copy buffered in local memory. During the restart
phase, n − 2 nodes read directly from the local memory
6.2 Case Study: I/O of Scalable Checkpoint/Restart buffer, and the spare node receives the checkpoint through
In this subsection, we study the I/O behavior of SCR [26] MPI from the partner of the failed node.
for checkpointing and restarting HACC-IO [27] using an We show the read and write bandwidths of checkpoint
emulator. SCR is a scalable multi-level checkpointing system and restart phases in Figure 5. To better understand the read
that supports multiple types of checkpoints with varying bandwidth, the result did not include the communication
costs and levels of resiliency. The slowest but most resilient time for the spare node to get the checkpoint. Similar to
13
the large-write experiments discussed earlier, SCR scaled set to 4 (in real DL training, this number is usually set
well for checkpointing and achieved the peak bandwidth at to match the number of GPUs per node). The results are
all scales under both consistency models. In other words, very similar to those of small-reads experiments shown in
the consistency model does not have a big impact on Figure 4b, only the bandwidth is higher here thanks to the
SCR’s checkpointing bandwidth. However, for restarting, slightly larger reads (116KB vs. 8KB). In both strong scaling
session consistency scaled better than commit consistency, and weak scaling, session consistency outperformed commit
mainly due to the low query frequency. At the restart phase, consistency in terms of scalability and bandwidth, due to
the reads were satisfied through memory buffers, and the the less time spent on queries. Additionally, the increasing
overall read bandwidth scaled linearly with the number gap in bandwidth between the two consistency models with
of nodes, which made the read time per node constant. the number of nodes further emphasizes the significance
However, under commit consistency, when more nodes of choosing an appropriate consistency model to achieve
were used, more query requests (one per read) were sent optimal performance and scalability.
simultaneously to the global server, which then became the
bottleneck and reduced scalability.
● Commit Session
Strong Scaling Weak scaling
Bandwidth (GB/s)
● Commit Session 3 2.7
Checkpoint Restart 2.21
Bandwidth (GB/s)
6.3 Case Study: I/O of Distributed Deep Learning 6.4 Key Takeaways
Deep learning has thrived in recent years. However, as
Here, we present the key findings derived from our experi-
data sizes and system scales increase, traditional methods of
ments.
feeding neural networks during training struggle to keep up
with the required computation. To accelerate data ingestion • When performing large writes and reads (e.g., over one
rates, various methods [28], [29], [30], [31] have been pro- megabyte per I/O operation), consistency models do
posed, such as data sharding, prestaging, and in-memory not have a big impact on the I/O bandwidth. This is
caching. because the overhead of maintaining the consistency
Here, we simulate the I/O of the “Preloaded” strategy model (weaker or stronger) is insignificant compared
that was proposed in [30] and implemented in LBANN [32]. to the time needed to to access the I/O device.
Our simulation assigns to each process a non-overlapping • When performing small writes and reads (e.g., ranging
subset of the training data. Before the training begins, each from a few bytes to a few kilobytes), the adoption
process loads its portion of data into its node-local SSD of a stronger consistency model can noticeably hinder
(hence the term Preloaded). Next, at the beginning of each performance. This is attributed to the faster completion
epoch, each process is assigned a random subset of samples. of each I/O operation, making the overhead of main-
The samples are evenly distributed to all processes so that taining strong consistency a bottleneck. Moreover, the
each process performs an equal amount of work. During traffic required to maintain the consistency model can
each epoch, each process reads the assigned samples, either lead to contention, particularly when there is a high
locally or from other processes using MPI. It is worth noting volume of small I/O operations.
that our benchmark is a simplified version of the Preloaded • When I/O operations are directly fulfilled by memory
strategy that differs in two major ways: (1) we store data or fast devices like persistent memory, the choice of con-
in node-local SSDs instead of memory, which is anyhow sistency models can significantly impact performance.
necessary for large datasets that do not fit in memory; and This is due to a similar reason as mentioned earlier,
(2) we do not perform aggregations when sending samples where the faster completion of I/O operations magni-
to the same process, which places additional stress on the fies the overhead associated with maintaining strong
file system. consistency models.
The average per-epoch read bandwidth is presented • For small random accesses (e.g., random reads of
in Figure 6. We conducted both strong scaling and weak deep learning applications), weaker consistency mod-
scaling experiments, with a mini-batch size of 1024 for els demonstrate higher I/O bandwidth and improved
strong scaling, and each process working on 32 samples scalability compared to stronger models. Notably, this
per iteration for weak scaling. The sample size was set to improvement is significant even at smaller scales, in-
116KB, which is the same as the average image size of dicating a promising direction for optimizing the I/O
ImageNet-1K [33]. The number of processes per node was performance of deep learning applications.
14
7 C ONCLUSION AND F UTURE W ORK [5] “IEEE Standard for Information Technology–Portable Operating
System Interface (POSIX(TM)) Base Specifications, Issue 7,” IEEE
This work explored consistency models from the perspec- Std 1003.1-2017 (Revision of IEEE Std 1003.1-2008), pp. 1–3951, 2018.
tive of parallel file systems. We provided a high-level dis- [6] P. Braam, “The Lustre Storage Architecture,” arXiv preprint
cussion on important aspects of storage consistency models, arXiv:1903.01955, 2019.
[7] F. B. Schmuck and R. L. Haskin, “GPFS: A Shared-Disk File System
including their design choices and their comparison with
for Large Computing Clusters.” in FAST, vol. 2, no. 19, 2002.
memory models. Based on the commonalities of existing [8] F. Herold, S. Breuner, and J. Heichler, “An Introduction
storage models, we proposed a unified and formal frame- to BeeGFS,” 2014. [Online]. Available: https://www.beegfs.io/
work for specifying properly-synchronized SCNF models, docs/whitepapers/Introduction to BeeGFS by ThinkParQ.pdf
[9] T. Wang, K. Mohror, A. Moody, W. Yu, and K. Sato, “BurstFS: A
which guarantee sequential consistency (or POSIX consis- Distributed Burst Buffer File System for Scientific Applications,”
tency) for programs that are properly synchronized. Ad- in The International Conference for High Performance Computing,
ditionally, we proposed a flexible design for implementing Networking, Storage and Analysis (SC), 2015.
properly-synchronized SCNF models that isolates the con- [10] L. L. N. Laboratory, “UnifyFS: A File System for Burst Buffers ,”
https://github.com/LLNL/UnifyFS, Mar. 2021.
sistency model from other file system components, making [11] A. Miranda, R. Nou, and T. Cortes, “echofs: A Scheduler-Guided
it easy to understand the impact of different consistency Temporary Filesystem to Leverage Node-local NVMs,” in 2018
models on I/O performance. 30th International Symposium on Computer Architecture and High
We also presented a detailed performance comparison Performance Computing (SBAC-PAD). IEEE, 2018, pp. 225–228.
[12] O. Tatebe, S. Moriwake, and Y. Oyama, “Gfarm/BB—Gfarm File
between commit consistency and session consistency. Our System for Node-Local Burst Buffer,” Journal of Computer Science
results indicate that session consistency is better suited for and Technology, vol. 35, no. 1, pp. 61–71, 2020.
most HPC I/O workloads in terms of performance and scal- [13] S. Oral, S. S. Vazhkudai, F. Wang, C. Zimmer, C. Brumgard,
J. Hanley, G. Markomanolis, R. Miller, D. Leverman, S. Atchley
ability. Although this comes at the cost of slightly reduced
et al., “End-to-end I/O Portfolio for the Summit Supercomputing
programmability, the performance gain is potentially huge, Ecosystem,” in Proceedings of the International Conference for High
especially for small reads such as those in deep learning Performance Computing, Networking, Storage and Analysis, 2019, pp.
applications. Overall, this work contributes to a better un- 1–14.
[14] L. Lamport, “How to Make a Multiprocessor Computer That
derstanding of consistency models in parallel file systems Correctly Executes Multiprocess Progranm,” IEEE transactions on
and their impact on I/O performance. computers, vol. 28, no. 09, pp. 690–691, 1979.
In our future work, we will implement different relaxed [15] P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen,
storage models in existing PFSs to evaluate their perfor- “x86-TSO: A Rigorous and Usable Programmer’s Model for x86
Multiprocessors,” Communications of the ACM, vol. 53, no. 7, pp.
mance impacts in a real-world setting. Additionally, we plan 89–97, 2010.
to study the consistency requirements of metadata opera- [16] M. Dubois, C. Scheurich, and F. Briggs, “Memory Access Buffering
tions for HPC applications and evaluate their performance in Multiprocessors,” ACM SIGARCH computer architecture news,
implications. vol. 14, no. 2, pp. 434–442, 1986.
[17] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and
J. Hennessy, “Memory Consistency and Event Ordering in Scal-
able Shared-Memory Multiprocessors,” ACM SIGARCH Computer
ACKNOWLEDGMENTS Architecture News, vol. 18, no. 2SI, pp. 15–26, 1990.
This work was supported by NSF SHF Collaborative grant [18] C. Wang, K. Mohror, and M. Snir, “File System Semantics Require-
ments of HPC Applications,” in Proceedings of the 30th International
1763540 and was performed under the auspices of the U.S. Symposium on High-Performance Parallel and Distributed Computing
Department of Energy by Lawrence Livermore National (HPDC), 2020, pp. 19–30.
Laboratory under Contract DE-AC52-07NA27344. LLNL- [19] IBM, “Burst Buffer Shared Checkpoint File System,” Apr.
2020. [Online]. Available: https://github.com/IBM/CAST/tree/
JRNL-849174-DRAFT. This material is based upon work
master/bscfs
supported by the U.S. Department of Energy, Office of [20] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame,
Science, Office of Advanced Scientific Computing Research M. Eisler, and D. Noveck, “RFC3530: Network File System (NFS)
under the DOE Early Career Research Program. Version 4 Protocol,” 2003.
[21] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J.-P. Prost,
M. Snir, B. Traversat, and P. Wong, “Overview of the MPI-IO
Parallel I/O Interface,” in IPPS’95 Workshop on Input/Output in
R EFERENCES Parallel and Distributed Systems, 1995, pp. 1–15.
[1] T. Patel, S. Byna, G. K. Lockwood, and D. Tiwari, “Revisiting I/O [22] “MPI: A Message-Passing Interface Standard Version 4.0,” https:
Behavior in Large-Scale Storage Systems: the Expected and the //www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf, 2021.
Unexpected,” in Proceedings of the International Conference for High [23] S. V. Adve, “Designing Memory Consistency Models for Shared-
Performance Computing, Networking, Storage and Analysis, 2019, pp. Memory Multiprocessors,” Ph.D. dissertation, University of Wis-
1–13. consin, Madison, 1993.
[2] A. K. Paul, O. Faaland, A. Moody, E. Gonsiorowski, K. Mohror, [24] S. V. Adve and M. D. Hill, “Weak Ordering - a New Definition,”
and A. R. Butt, “Understanding HPC Application I/O Behavior ACM SIGARCH Computer Architecture News, vol. 18, no. 2SI, pp.
Using System Level Statistics,” in 2020 IEEE 27th International 2–14, 1990.
Conference on High Performance Computing, Data, and Analytics [25] J. Manson, W. Pugh, and S. V. Adve, “The Java Memory Model,”
(HiPC). IEEE, 2020, pp. 202–211. ACM SIGPLAN Notices, vol. 40, no. 1, pp. 378–391, 2005.
[3] N. Dryden, R. Böhringer, T. Ben-Nun, and T. Hoefler, “Clairvoyant [26] A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski, “De-
Prefetching for Distributed Machine Learning I/O,” arXiv preprint sign, Modeling, and Evaluation of a Scalable Multi-level Check-
arXiv:2101.08734, 2021. pointing System,” in SC’10: Proceedings of the 2010 ACM/IEEE
[4] F. Di Natale, H. Bhatia, T. S. Carpenter, C. Neale, S. Kokkila- International Conference for High Performance Computing, Networking,
Schumacher, T. Oppelstrup, L. Stanton, X. Zhang, S. Sundram, Storage and Analysis. IEEE, 2010, pp. 1–11.
T. R. Scogland et al., “A Massively Parallel Infrastructure for Adap- [27] “HACC IO Kernel from the CORAL Benchmark Codes,” https:
tive Multiscale Simulations: Modeling RAS Initiation Pathway //asc.llnl.gov/coral-benchmarks#hacc, Jan 2018.
for Cancer,” in Proceedings of the International Conference for High [28] Y. Oyama, N. Maruyama, N. Dryden, E. McCarthy, P. Harrington,
Performance Computing, Networking, Storage and Analysis, 2019, pp. J. Balewski, S. Matsuoka, P. Nugent, and B. Van Essen, “The
1–16. Case for Strong Scaling in Deep Learning: Training Large 3D
15