A survey of fault tolerance mechanisms adn checkpoint restart implementations for high performance computing systems
A survey of fault tolerance mechanisms adn checkpoint restart implementations for high performance computing systems
DOI 10.1007/s11227-013-0884-0
Abstract In recent years, High Performance Computing (HPC) systems have been
shifting from expensive massively parallel architectures to clusters of commodity PCs
to take advantage of cost and performance benefits. Fault tolerance in such systems is
a growing concern for long-running applications. In this paper, we briefly review the
failure rates of HPC systems and also survey the fault tolerance approaches for HPC
systems and issues with these approaches. Rollback-recovery techniques which are
most often used for long-running applications on HPC clusters are discussed because
they are widely used for long-running applications on HPC systems. Specifically, the
feature requirements of rollback-recovery are discussed and a taxonomy is developed
for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid
researchers in the domain as well as to facilitate development of new checkpointing
solutions.
S. Chen
Information Engineering Laboratory, CSIRO ICT Centre, Sydney, Australia
e-mail: [email protected]
A survey of fault tolerance mechanisms and checkpoint/restart 1303
1 Introduction
HPC systems continue to grow exponentially in scale; currently from petascale com-
puting (1015 floating point operations per second) to exascale computing (1018 float-
ing point operations per second) as well as in complexity due to the growing need
to handle long-running computational problems with effective techniques. However,
HPC systems come with their own technical challenges [67]. The total number of
hardware components, the software complexity and overall system reliability, avail-
ability and serviceability (RAS) are factors to contend with in HPC systems, because
hardware or software failure may occur while long-running parallel applications are
being executed. The need for reliable fault tolerant HPC system has intensified be-
cause failure may result in a possible increase in execution time and cost of running
the applications. Consequently, fault tolerance solutions are being incorporated into
the HPC systems. Fault tolerant systems have the ability to contain failures when they
occur, thereby minimizing the impact of failure. Hence, there is a need for further in-
vestigation of fault tolerance of HPC systems.
An analysis of the Top500 [74] HPC systems, it is clear that the number of pro-
cessors/and nodes are steadily increasing. Top500 is a statistical list with ranks and
details of the 500 world’s most powerful supercomputers. The list is compiled by
Hans Meuer (of the University of Mannheim) et al. and published twice a year.
It shows that performance has almost doubled each year. But, at the same time,
the overall system Mean Time Between Failure (MTBF) is reduced to just a few
hours [9]. This suggests that it is useful to review the current state of the art of
the application of fault tolerance techniques in HPC systems. For example, the IBM
Blue Gene/L was built with 131,000 processors. If the MTBF of each processor is
876,000 hours (100 years), a cluster of 131,000 processors has an MTBF of 876,000/
131,000 = 6.68 hours.
MTBF is a primary measure of system reliability which is defined as the proba-
bility that the system performs without deviations from agreed-upon behavior for a
specific period of time [29]. The reliability of a component is given as
n(t) failure free elements
Reliability function = = (1)
N number of elements at time = 0
The reliability of elements connected in series
m
Rs = e−λi t (2)
n=1
Most of the long-running applications are Message Passing Interface (MPI) appli-
cations. The Message Passing Interface (MPI) is the common parallel programming
standard with which most parallel applications are written [48]; it provides two modes
of operation running or failed. An example of an MPI application is the Portable Ex-
tensible Toolkit for Scientific Computation (PETSc) [53], which is used for modeling
in scientific applications such as acoustics, brain surgery, medical imaging, ocean dy-
namics, and oil recovery.
A survey of fault tolerance mechanisms and checkpoint/restart 1305
Software or hardware failure prompts the running MPI application to abort or stop,
and it may have to restart from the beginning. This can be a waste of resources (com-
puter resources, human resources, and electrical power) because all the computations
that have already been completed may be lost. Therefore, rollback-recovery tech-
niques are commonly used to provide fault tolerance to parallel applications so that
they can restart from a previously saved state. A good number of rollback-recovery
techniques have been developed so far, such as DMTCP [1] and, BLCR [21]. In this
paper, we provide a survey of such rollback-recovery facilities to facilitate develop-
ment of more robust ones for MPI applications.
Recently, there is also a trend to connect large clusters using high performance
networks, such as InfiniBand (IB) [33]. IB is a switched-fabric communications link
used in HPC because it provides high throughput, low latency, high quality of ser-
vice, and failover. The InfiniBand Architecture (IBA) may be the communication
technology of the next generation HPC systems; as of November 2011, InfiniBand
connected systems represented more than 42 % of the systems in the Top500 list
[33]. It is important for such large scale systems with IB interconnection networks
to have efficient fault tolerance that meet its requirements. Currently, only a small
number of checkpointing facilities support the IB architecture. We will state if the
checkpoint/restart facilities we reviewed provide support for IB sockets.
In order to survey the fault tolerance approaches, we first need to have an overview
of the failure rates of HPC systems. Generally, failures occur as a result of hard-
ware or software faults, human factors, malicious attacks, network congestion, server
overload, and other, possibly unknown causes [30, 44, 49, 50]. These failures may
cause computational errors, which may be transient or intermittent, but can still lead
to permanent failures [37]. A transient failure causes a component to malfunction for
a certain period of time, but then disappears and the functionality of that component
is fully restored. An intermittent failure appears and disappears; it never goes away
completely, unless it is resolved. A permanent failure causes the component to mal-
function until it is replaced. A lot of work has been done on understanding the causes
of failure and we briefly reviewed the major contributors of failure in this section. We
also add our findings to this review.
Gray [30] analyzed outage/failure reports of Tandem computer systems between 1985
and 1990, and found that software failure was a major source of outages at about
55 %. Tandem systems were designed to be single fault-tolerant systems, that is,
systems capable of overcoming the failure of a single element (but not simultane-
ous multiple failures). Each Tandem system consisted of 4 to 16 processors, 6 to
100 discs, 100 to 1,000 terminals and their communication gear. Systems with more
than 16 processors were partitioned to form multiple systems and each of the multiple
systems had 10 processors linked together to form an application system.
1306 I.P. Egwutuoha et al.
Lu [44] studied the failure log of three different architectures at the National Cen-
ter for Supercomputing Applications (NCSA). The systems were: (1) a cluster of 12
SGI Origin 2000 NUMA (Non-Uniform Memory Architecture) distributed shared
memory supercomputers with a total of 1,520 CPUs, (2) Platinum, a PC cluster with
1,040 CPUs and 520 nodes, and (3) Titan, a cluster of 162 two-way SMP 800 MHz
Itanium-1 nodes (324 CPUs). In the study, five types of outages/failures were de-
fined: software halt, hardware halt, scheduled maintenance, network outages, and air
conditioning or power halts. Lu found that software failure was the main contributor
of outage (59–83 %), suggesting that software failure rates are higher than hardware
failure rates.
A large set of failure data was also released by CFDR [10], comprising the failure
statistics of 22 HPC systems, including a total of 4,750 nodes and 24,101 processors
collected over a period of 9 years at Los Alamos National Laboratory (LANL). The
workloads consisted of large-scale long-running 3D scientific simulations which take
months to complete computation. We have filtered the data in order to reveal the
systems failure rates. Figure 2 shows systems (2 to 24) with different configurations
and architectures, with the number of nodes varying from 1 to 1,024, and the number
of processors varying from 4 to 6,152. System 2 with 6,152 processors recorded the
highest number of hardware failures. Figure 2 also shows the number of failures
recorded over the period, represented by a bar chart. From the bar chart, it can be
clearly seen that the failure rates of HPC systems increase as the number of nodes
and processors increases.
Schroeder and Gibson [64, 65] analyzed failure data collected at two large HPC
sites: the data set from LANL RAS [10] and the data set collected over the period
of one year at a large supercomputing system with 20 nodes and more than 10,000
processors. Their analysis suggests that (1) the mean repair time across all failures
(irrespective of their failure types) is about 6 hours, (2) that there is a relationship
between the failure rate of a system and the applications running on it, (3) that as
many as three failures may occur on some systems within 24 hours, and (4) that the
failure rate is almost proportional to the number of processors in a system.
Oliner and Stearley [49] studied system logs from five supercomputers installed at
Sandia National Labs (SNL) as well as Blue Gene/L, which is installed at Lawrence
Livermore National Labs (LLNL). The five systems were ranked in the Top500 super-
computers. The systems were structured as follows: (1) Blue Gene/L with 131,072
CPUs and a custom interconnect, (2) Thunderbird with 9,024 CPUs and an Infini-
Band interconnect, (3) Red Storm with 10,880 CPUs and a custom interconnect,
(4) Spirit (ICC2) with 1,028 CPUs and a GigEthernet (Gigabit Ethernet) intercon-
nect, and (5) Liberty with 512 CPUs and a Myrinet interconnect. The summary of
the system is provided in Table 1 for easy reference. Although the raw data collected
implied that 98 % of the failures were due to hardware, after they filtered the data,
their analysis revealed that 64 % of the failures were due to software.
3.2 Redundancy
Failure masking techniques provide fault tolerance by ensuring that services are avail-
able to clients despite failure of a worker, by means of a group of redundant and
physically independent workers; in the event of failure of one or more members of
the group, the services are still provided to clients by the surviving members of the
group, often without the clients noticing any disruption. There are two masking tech-
niques used to achieve failure masking: hierarchical group masking and flat group
A survey of fault tolerance mechanisms and checkpoint/restart 1311
masking) [18]. Figure 4 illustrates the flat group and the hierarchical group masking
methods.
Flat group masking is symmetrical and does not have a single point of failure;
the individual workers are hidden from the clients, appearing as a single worker.
A voting process is used to select a worker in event of failure. The voting process
may introduce some delays and overhead because a decision is only reached when
inputs from various workers have been received and compared.
In hierarchical group failure masking, a coordinator of the activities of the group
decides within a group which worker may replace a failed worker in event of failure.
This approach has a single point of failure; the ability to effectively mask failures
depends on the semantic specifications implemented [57].
Fault masking may create new errors, hazards and critical operational failures
when operational staff fails to replace already failed components [34]. When fail-
ure masking is used, the system should be regularly inspected. However, there are
costs associated with regular inspections.
Failure semantics refers to the different ways that a system designer anticipates the
system can fail, along with failure handling strategies for each failure mode. This
list is then used to decide what kind of fault tolerance mechanisms to provide in the
system. In other words, with failure semantics [18], the anticipated types of system
failure are built within the fault tolerance system and the recovery actions are invoked
upon detection of failures. Some of the different failure semantics are omission failure
semantics, performance semantics, and crash failure semantics.
Crash failure semantics apply if the only failure that the designers anticipate from
a component is for it to stop processing instructions, while behaving correctly prior
to that. Omission failure semantics are used if the designers expect a communication
service to lose messages, with negligible chances that messages are delayed or cor-
rupted. Omission/performance failure semantics apply when the designers expect a
service to lose or delay messages, but with lesser probability that messages can be
corrupted.
The fault tolerant system is built based on foreknowledge of the anticipated fail-
ure patterns and it reacts to them when these patterns are detected; hence, the level
of fault tolerance depends on the likely failure behaviors of the model implemented.
Broad classes of failure modes with associated failure semantics may also be defined
1312 I.P. Egwutuoha et al.
(rather than specific individual failure types). This technique relies on the ability of
the designer to predict failure modes accurately and to specify the appropriate action
to be taken when a failure scenario is detected. It is not feasible, however, in any sys-
tem of any complexity such as HPC systems, to predict all possible failure modes. For
example, a processor can achieve crash failure semantics with duplicate processors.
Failure semantics may also require hardware modifications [32]. Similarly, some of
the nodes and applications failures which occur in HPC systems may be unknown
to the fault tolerance in place. For example, a new virus may exhibit a new behavior
pattern which would go undetected even though it could crash the system [15].
3.5 Recovery
Generally fault tolerance implies recovering from an error, which otherwise may lead
to computational error or system failure. The main idea is to replace the erroneous
state with a correct and stable state. There are two forms of error recovery mecha-
nisms: forward and backward error recovery.
Forward Error Recovery: With Forward Error Recovery (FER) [68] mechanisms,
an effort is made to bring the system to a new correct state from where it can con-
tinue to execute, without the need to repeat any previous computations. FER, in other
words, implies detailed understanding of the impact of the error on the system, and
a good strategy for later recovery. FER is commonly implemented where continued
service is more important than immediate recovery, and high levels of accuracy in
values may be sacrificed; that is, where it is required to act urgently (in, e.g., mission-
critical environment) to keep the system operational.
FER is commonly used in flight control operation, where future recovery may be
preferable to rollback-recovery. A good example of forward correction is fault mask-
ing, such as voting process employed in triple modular redundancy and in N -version
programming.
As the number of redundant components increases, the overhead cost of FER and
of the CPU increases because recovery is expected to be completed in the degraded
operating states, and the possibility of reconstruction of data may be small in such
states [27]. Software systems typically have large numbers of states and multiple con-
current operations [17], which implies that there may be low probability of recovery
to a valid state. It may be possible in certain scenarios to predict the fault; however, it
may be difficult to design an appropriate solution in the event of unanticipated faults.
FER cannot guarantee that state variables required for the future computation are
correctly re-established following an error; therefore, the result of the computations
following an error occurrence may be erroneous. FER is also more difficult to im-
plement compared to rollback-recovery techniques, because of the number of states
and concurrent operations. In some applications, a combination of both forward and
rollback-recovery may be desirable.
Rollback-recovery: Rollback-recovery consists of checkpoint, failure detection,
and recovery/restart. A checkpoint [37] is a snapshot of the state of the entire pro-
cess at a particular point such that the process could be restarted from that point in
the event that a subsequent failure is detected. Rollback-recovery is one of the most
widely used fault tolerance mechanism for HPC systems, probably because (1) fail-
ures in HPC systems often lead to fail-stop of the MPI application execution, (2)
A survey of fault tolerance mechanisms and checkpoint/restart 1313
– Application coverage: The checkpointing solution must have a wide range of ap-
plications coverage, to reduce the likelihood of implementing and using multiple
different of checkpointing/restart solutions, which may lead to software conflicts
and greater performance overhead.
– Platform portability: It must not be tightly coupled to one version of an operating
system or application framework, so that it can be ported to other platforms with
minimal effort.
– Intelligence/Automatic: It should use failure prediction and failure detection mech-
anisms to determine when checkpointing/restart should occur without the users
intervention. Whenever this feature is lacking, users are involved in initializing
checkpointing/restart process. Although system users may be trained to carry out
the checkpoint/restart activities, human error can still be introduced if system users
are allowed to initiate checkpoint or recovery processes [6].
– Low overhead: The time to save checkpoint data should be significantly shorter
compared to the 40 to 60 minutes, which have been recorded on some of the
Top500 HPC systems [8]. The size of the checkpoint should be small.
advantages that it makes recovery from failed states simpler and is not prone to the
domino effect. Storage overhead is also reduced compared to uncoordinated check-
pointing, because each process maintains only one checkpoint on stable permanent
storage. However, it adds overhead because a global checkpoint needs internal syn-
chronization to occur prior to checkpointing. A number of checkpoint protocols have
been proposed to ensure global coordination: a nonblocking checkpointing coordi-
nation protocol was proposed [11] to ensure that applications that would make coor-
dinated checkpointing inconsistent are prevented from running. Checkpointing with
synchronized clocks [19] has also been proposed. The DMTCP [1] checkpointing
facility is an example that implements a coordinated checkpointing mechanism.
Communication-induced checkpointing (CIC) (also called message induced check-
pointing) protocols do not require that all checkpoints be consistent, and still avoids
the domino effect. With this technique, processes perform two types of checkpoints:
local and forced checkpoints. A local checkpoint is a snapshot of the local state of
a process, saved on persistent storage. Local checkpoints are taken independently of
the global state. Forced checkpoints are taken when the protocol forces the processes
to make an additional checkpoint. The main advantage of CIC protocols is that they
allow independence in detecting when to checkpoint. The overhead in saving is re-
duced because a process can take local checkpoints when the process state is small.
CIC, however, has two major disadvantages: (1) it generates large numbers of forced
checkpoints with resulting storage overhead and (2) the data piggybacked on the mes-
sages generates considerable communications overhead.
Casual message logging protocols utilize the advantages of both pessimistic and
optimistic message logging protocols. Here, the messages logs are stored in stable
storage when it is most convenient for the process to do so. In casual message logging
protocols, processes piggyback the non determinant messages on the local storage.
Therefore, only the most recent message log is required for restarting and multiple
copies are kept, making the logs available in event of multiple machine failure. In-
terested readers of the piggyback concept on casual message logging protocols are
referred to [23, 38]. The main disadvantage of the casual message logging protocol
is that it requires a more complex recovery protocol.
There are techniques that are designed to reduce the overhead cost in saving the
checkpoint data when writing the state of a process to persistent storage. This is,
of course, one of the major sources of increased performance overhead. We briefly
discuss here some of these techniques.
Concurrent checkpointing implementations [41] rely on the memory protection ar-
chitecture. Disk writing is done concurrently with execution of the targeted program;
that is, it allows the execution of the process while the process is being saved to a
separate buffer. The data is later transferred to a stable storage system.
In incremental checkpointing, only the portion of the program that has changed
since the last saved process [56] is saved. The unchanged portion can be restored
from previous checkpoints. The overhead of checkpointing is reduced in this process.
However, the recovery could be complex because the multiple incremental saved files
are kept and grow as the applications to be checkpointed. This can be limited to at
most n increments, after which a full checkpoint is saved.
Flash-based Solid State Disk (SSD) memory may also be used as a persistent store
for the checkpoint data. SSD is based on semiconductor chips rather than magnetic
1318 I.P. Egwutuoha et al.
media technology such as hard drives to store persistent data. SSD has lower access
times and latency compare to hard disks, however, it has limited read/write cycles of
about 100,000 times and data cannot be used after wearing out [13]. Wear leveling is
used to minimize this problem [43].
The Fusion-io ioDrive card may also be used to reduce write times. This is a
memory tier of NAND flash-based solid state technology, which increases bandwidth.
It is expected that such technology will scale up to the performance levels expected
of HPC systems [26]. Research on scalability of fusion-io in HPC may be highly
profitable.
Copy-on-write [56] techniques reduce the checkpoint time by allowing the parent
process to fork a child process at each checkpoint. The parent process continues ex-
ecution while the child process carries out checkpointing activities. The technique is
useful in reducing checkpoint time when the checkpoint data is small. However, there
is a performance degradation if the size of the checkpoint data is large because the
child and parent processes are competing for computer resources (e.g., memories and
network bandwidth).
Data compression reduces the size of checkpoint data to be saved on the storage’ it
also reduces the time to save the checkpoint data. However, it takes time and computer
resources to carry out the compression. Plank [55] showed that checkpointing can
benefit from data compression techniques. However, data compression depends on
the compression ratio and application state. If the amount of data to compress is
large, it consumes more memory, which will result in performance degradation of the
executing application. When data is compressed, it will require more time to restart
the application due to decompression time.
9 Summary
We presented reliability and MTBF of HPC systems. Based on the analysis and pub-
lished papers, we presented that reliability and MTBF of HPC systems with large
Table 2 Checkpoint/restart facilities
Author Checkpoint name Brief description of the Transparent OS/Application coverage Automatic Sockets
checkpoint
(Zhong and CRAK [79]. It requires no modification Transparent CRAK supports migration of User It supports TCP/UDP
Niel, 2001) http://systems.cs. of OS or application code. Kernel module networked processes, however; initiated sockets
columbia.edu/archive/ Targets processes are utilities it does not support
pub/2001/11/ stopped before they are virtualization and
checkpointed multi-threaded process. It
works on Linux 2.2 and 2.4
kernel platform
(Pinheiro, Epckpt, Epckpt support symmetric Transparent, It has support for system V User Cannot checkpoint
2001) [54] http://www.research. multiprocessors and does Kernel level IPC (Semaphores and Shared initiated and sockets, timers (sleeping
rutgers.edu/~edpin/ not require modification implementation Memory), fork parallel non-periodic processes will be
epckpt/ of OS or application code applications, dynamic load awakened) and System V
in other to use the facility libraries. Linux 2.0, 2.2 and IPC Messages Queues
2.4 kernels
(Condor Condor, Condor checkpoint/restart Not transparent, Condor support single process Periodic and Interprocess
A survey of fault tolerance mechanisms and checkpoint/restart
Team, 2010) http://www.cs.wisc. facility is enabled by the Library but multi-process jobs and user initiated communication is not
[72] edu/condor/ user through linking the implementation systems call are not supported. allowed (e.g., pipes,
program source code with Multiple kernel-level threads semaphores, and shared
the condor system call and memory mapped programs memory)
library are not allowed. Works on
kernel 2.4 and later
(Plank et al., Libckpt It is implemented in user Not completely It support files and Periodic Does not support sockets
1995) [56] http://web.eecs.utk. space. It uses Transparent. multiprocessor, It does not
edu/~plank/plank/ copy-on-write and Library provide support for
www/libckpt.html incremental checkpointing implementation multithread, pipes, Sys V IPC
mechanism but requires and distributed application
recompiling of the source
code
1319
Table 2 (Continued)
1320
Author Checkpoint name Brief description of the Transparent OS/Application coverage Automatic Sockets
checkpoint
(Stellner, CoCheck, User level MPI Transparent— CoCheck supports parallel Periodic Support TCP sockets
1996) [69] http://www.lrr.in. implementation. CoCheck Library processes running on
tum.de/Par/tools/ uses a special process to implementation multicomputer; CoCheck can
Projects.Old/ coordinate checkpoints be ported to different machine
CoCheck.html platforms. CoCheck cannot
process a checkpoint request
when a send operation is in
progress [63]
(Ansel et al., DMTCP, DMTCP is a coordinated Transparent, It supports distributed and Periodic and Provides supports for
2009) [1] http://dmtcp. transparent user level Library multithreaded applications It manually sockets but does not
sourceforge.net/ checkpointing for implementation support Linux 2.4.x and later initiated support multicast and
distributed applications RDMA (remote direct
memory access)
(Duell et al., BLCR, System-level and MPI Transparent It support serial and parallel User It does not checkpoint or
2002) [21] https://ftg.lbl.gov/ implementation for job, support single machine or initiated restore open sockets or
CheckpointRestart/ clusters parallel jobs that run across files like TCP/UDP
CheckpointRestart. multiple machines on cluster
shtml node. It partially supports
multithread applications.
BLCR kernel modules are
portable across difference
CPU architectures. BLCR
works on kernel 2.4.x and later
(Zandy, Ckpt, Implemented at user-level. Transparent – Ckpt provides checkpointing Periodic Does not support
2002) [78] http://pages.cs.wisc. Supports asynchronous Library functionality to an ordinary checkpoint TCP/UDP sockets
edu/~zandy/ckpt/ checkpoints and does not implementation program. Linux 2.4 and later or manual
require re-link to initiation
programs
I.P. Egwutuoha et al.
Table 2 (Continued)
Author Checkpoint name Brief description of the Transparent OS/Application coverage Automatic Sockets
checkpoint
(Overeinder Dynamite, User level implementation Not Supports open files, Periodic It supports TCP/UDP
et al., 1996) http://www.science. transparent— dynamically loaded libraries sockets
[52] uva.nl/research/scs/ requires and parallel processes
Software/ckpt/ re-linking of (PVM/MPI) but does not
#references libraries support multithread
applications. Linux 2.0, 2.2
and later
(Osman, Zap, Zap uses partial OS Transparent. Zap supports single-thread and User It supports TCP/UDP
2002) [51] http://www.ncl.cs. virtualization to allow the Kernel module, multithread process. It also initiated sockets, devices files
columbia.edu/ migration of process library supports SYS V IPC. Linux
research/migrate/ domains. It uses 2.4 and later
Checkpoint-restart
mechanism of CRAK
using a modified Linux
A survey of fault tolerance mechanisms and checkpoint/restart
kernel
(Sudakov et CHPOX, Systems-level Transparent and It supports files and pipes User Network sockets are not
al., 2007) http://freshmeat.net/ implementation and does uses kernel however multithreaded initiated supported
[70] projects/chpox/ not require modification module programs are not supported.
of OS or user programs Linux2.4 and later
(Ramkumar Porch, Implemented at user level It uses source to Not transparent, recompiling Periodic File I/O and socket I/O are
and http://supertech.csail. space source Multithread and distributed checkpoint not supported
Strumpen, mit.edu/porch/ compilation to applications are not supported
1997) [58] provide
checkpointing
solution in
heterogeneous
environment
1321
Table 2 (Continued)
1322
Author Checkpoint name Brief description of the Transparent OS/Application coverage Automatic Sockets
checkpoint
(Gibson) Esky, http://esky. User-level checkpointing Yes -Transparent Esky has limited application User Esky currently works on a
[28] sourceforge.net/ and use job freezing Library coverage. Esky can cope with initiated limited opening shared
techniques programs that open or mmap() libraries with dlopen()
(checkpoint/resume) for files. Linux 2.2 and later; and
Unix processes Solaris 2.6
(Sankaran et OpenMPI (LAM/MPI A user-level facility that Not transparent Uses BLCR facility to periodic Support Ethernet,
al., 2004) LA-MPI), uses coordinated protocol checkpoint parallel MPI InfiniBand, Myrinet
[63] http://www. and BLCR library to applications. It works on
lam-mpi.org/ checkpoint MPI recent kernels
applications
(Blackham, CryoPID, CryoPID uses freeze Transparent, It has support for single thread User Partial support to file
2005) [4] http://cryopid. techniques in Utilizes process. It does not support initiated descriptors, sockets and X
berlios.de/ checkpointing—it copy dynamically multithread processes. Linux applications
the state of a running linked library 2.4 and later
process and writes it into a
file
(William and Libtckpt, Implemented at user-level Not transparent, It supports multithreaded Periodic UDP sockets not
James, 2001) http://mtckpt. and requires recompiling Library applications and Linux and supported
[77] sourceforge.net/ Solaris
(Takahashi et Score No modifications to the Transparent, It supports parallel Periodic Has support for Myrinet
al., 2000) application source is Library applications and Ethernet
[71] required
(Fagg and FT-MPI, It is coordinated Not transparent Support parallel applications Semi- Ethernet, Infiniband,
Dongarra, http://icl.cs.utk. checkpointing facility and automatic Myrinet
2000) [24] edu/ftmpi/index.html uses messages logging
protocol to checkpoint
applications
I.P. Egwutuoha et al.
Table 2 (Continued)
Author Checkpoint name Brief description of the Transparent OS/Application coverage Automatic Sockets
checkpoint
(Ruscio et DejaVu Coordinated Transparent and DejaVu supports parallel and Periodic Support communication
al., 2007) checkpointing facility and library distributed applications. It has sockets. It supports
[61] implemented in user implementation support for forked processes. Infiniband through custom
space. DejaVu virtualizes Permits completely MVAPICH
at the OS interface asynchronous checkpoints, It
also support anonymous
mmap() and incremented
checkpointing
(Schulz et C3 (Cornell Application-level Not transparent It supports single-thread and Program Does not support
al., 2004) Check-point(pre) checkpointing and does distributed application. C3 initiated infiniband and Myrinet
[66] Compiler), require program system is easily ported among
A survey of fault tolerance mechanisms and checkpoint/restart
References
1. Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations
and the desktop. In: 23rd IEEE international parallel and distributed processing symposium, Rome,
Italy, pp 1–12
2. Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans
Dependable Secure Comput 1(1):87–96
3. Bartlett J, Gray J, Horst B (1986) Fault tolerance in tandem computer systems. Tandem Technical
Report
4. Blackham B (2005) [Online]. Available: http://cryopid.berlios.de/
5. Bosilca G, Bouteiller A, Cappello et al (2002) MPICH-V: toward a scalable fault tolerant MPI for
volatile nodes. In: IEEE/ACM SIGARCH
6. Brown A, Patterson DA (2001) To err is human. In: Proceedings of the first workshop on evaluating
and architecting system dependability (EASY’01), Göteborg, Sweden, July 2001
7. Byoung-Jip K (2005) Comparison of the existing checkpoint systems. Technical report, IBM Watson
8. Cappello F (2009) Fault tolerance in petascale/exascale systems: current knowledge, challenges and
research opportunities. Int J High Perform Comput Appl 23:212–226
9. Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M (2009) Toward exascale resilience. Int J
High Perform Comput Appl 23(4):378–388
10. CFDR (2012) [Online]. Available: CFDR http://cfdr.usenix.org/
11. Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed sys-
tems. ACM Trans Comput Syst 3(1):63–75
12. Checkpointing.org (2012) Checkpointing [Online]. Available: http://checkpointing.org
13. Chen F (2010) On performance optimization and system design of flash memory based solid state
drives in the storage hierarchy. Ph.D. dissertation, Ohio State University, Computer Science and En-
gineering, Ohio State University
14. Chen L, Avizienis A (1978) N-version programming: a fault-tolerance approach to reliability of soft-
ware operation, June, Toulouse, France, pp 3–9
15. Christodorescu M, Jha S (2003) Static analysis of executables to detect malicious patterns. In: Pro-
ceedings of the 12th USENIX security symposium, pp 169–186
16. Clark C, Fraser K, Hand S et al (2005) Live migration of virtual machines. In: Proceedings of the 2nd
conference on symposium on networked systems design and implementation, vol 2, May 2005, pp
273–286
17. Courtright II, William V, Gibson GA (1994) Backward error recovery in redundant disk arrays. In:
Proc 1994 computer measurement group con
18. Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56–88
19. Cristian F, Jahanian F (1991) A timestampbased checkpointing protocol for long-lived distributed
computations. In: Proceedings, tenth symposium on reliable distributed systems
20. Czarnecki K, Østerbye K, Völter M (2002) Generative programming. In: Object-oriented technology
ECOOP 2002 workshop reader. Springer, Berlin/Heidelberg, pp 83–115
A survey of fault tolerance mechanisms and checkpoint/restart 1325
21. Duell J, Hargrove P, Roman E (2002) The design and implementation of Berkeley lab’s Linux check-
point/restart. Berkeley Lab Technical Report (publication LBNL-54941), December 2002
22. Duell J, Hargrove P, Roman E (2002) Requirements for Linux checkpoint/restart. Lawrence Berkeley
National Laboratory Technical Report LBNL-49659
23. Elnozahy ENM, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in
message-passing systems. ACM Comput Surv 34(3):375–408
24. Fagg GE, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dy-
namic world. In: Recent advances in parallel virtual machine and message passing interface, pp 346–
353
25. Fault tolerance, wikipedia (2012) [Online]. Available: http://en.wikipedia.org/wiki/Fault-tolerant_
system
26. Fusion-IO (2012) [Online]. Available: http://www.rpmgmbh.com/download/Whitepaper_Green.pdf
27. Ghaeba JA, Smadia MA, Chebil J (2010) A high performance data integrity assurance based on the
determinant technique. Elsevier, Amsterdam
28. Gibson D (2012) esky [Online]. Available: http://esky.sourceforge.net
29. Grant-Ireson W, Coombs CF (1988) Handbook of reliability engineering and management. McGraw-
Hill, New York
30. Gray J (1990) A census of tandem system availability between 1985 and 1990. IEEE Trans Reliab
39(4):409–418
31. Gwertzman J, Seltzer M (1996) World-wide web cache consistency. In: Proc 1996 USENIX tech conf,
San Diego, CA, Jan 1996, pp 141–152
32. Hobbs C, Becha H, Amyot D (2008) Failure semantics in a SOA environment. In: 3rd int MCeTech
conference on etechnologies, Montréal
33. InfiniBand (2012) [Online]. Available: InfiniBand http://www.infinibandta.org/
34. Johnson C, Holloway C (2007) The dangers of failure masking in fault tolerant software: aspects of a
recent in-flight upset event. In: 2nd institution of engineering and technology international conference
on system safety, pp 60–65
35. Kalaiselvi S, Rajaraman V (2000) A survey of checkpointing algorithms for parallel and distributed
computers, pp 489–510
36. Koch D, Haubelt C, Teich J (2007) Efficient hardware checkpointing concepts, overhead analysis, and
implementation. In: Proceedings of int symp on field programmable gate arrays (FPGA)
37. Koren I, Krishna C (2007) Fault-tolerant systems. Elsevier/Morgan Kaufmann, San Diego, San Mateo
38. Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM
21:558–565
39. Laprie JC, Arlat J, Beounes C, Kanoun K (1990) Definition and analysis of hardware-and software-
fault-tolerant architectures. Computer 23(7):39–51
40. Large software state (2012) [Online]. Available: http://www.safeware-eng.com/White_Papers/
Software%20Safety.htm
41. Li K, Naughton JF, Plank JS (1994) Low-latency, concurrent checkpointing for parallel programs.
IEEE Trans Parallel Distrib Syst 5(8):874–879
42. Liang Y, Zhang Y, Jette et al (2006) BlueGene/L failure analysis and prediction models. In: Inter-
national conference on dependable systems and networks, DSN 2006. IEEE Press, New York, pp
425–434
43. Lofgren KMJ et al (2001) Wear leveling techniques for flash EEPROM systems. US Patent No
6,230,233, 8 May 2001
44. Lu CD (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. dissertation, Univer-
sity of Illinois at Urbana-Champaign
45. Lyons RE, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliabil-
ity. IBM J Res Dev 6(2):200–209
46. Maloney A, Goscinski A (2009) A survey and review of the current state of rollback-recovery for
cluster systems. Concurr Comput., 1632–1666
47. Milojicic DS, Douglis F, Paindaveine Y, Wheeler R, Zhou S (2000) Process migration. ACM Comput
Surv 32(3):241–299
48. MPI Forum (1994) MPI: a message-passing interface standard. Int J Supercomput Appl High Perform
Comput
49. Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. Washington, DC,
pp 575–584
50. Oppenheimer D, Patterson D (2002) Architecture and dependability of large-scale Internet services.
IEEE Internet Comput 6(5):41–49
1326 I.P. Egwutuoha et al.
51. Osman S, Subhraveti D, Su G, Nieh J (2002) The design and implementation of zap: a system for
migration computing environments. Oper Syst Rev 36(SI):361–376
52. Overeinder BJ, Sloot RN, Heederik RN, Hertzberger LO (1996) A dynamic load balancing system
for parallel cluster computing. Future Gener Comput Syst 12:101–115
53. PETSc (2012) [Online]. Available: http://www.mcs.anl.gov/petsc/petsc-as/
54. Pinheiro E (2001) http://www.research.rutgers.edu/~edpin/epckpt/
55. Plank JS, Li K (1994) ickp: a consistent checkpointer for multicomputers. In: IEEE parallel and
distributed technologies, vol 2, pp 62–67
56. Plank JS, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under UNIX. In:
Conference proceedings. Usenix, Berkeley
57. Poledna S (1996) The problem of replica determinism. Kluwer Academic, Boston, pp 29–30
58. Ramkumar B, Strumpen V (1997) Portable checkpointing for heterogeneous archtitectures. In: Pro-
ceedings of he 27th international symposium on fault-tolerant computing (FTCS’97), pp 58–67
59. Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng SE-1(2):220–
232
60. Roman E (2002) A survey of checkpoint/restart implementations. Berkeley Lab Technical Report
(publication LBNL-54942)
61. Ruscio J, Heffner M, Varadarajan S (2007) DejaVu: transparent user-level checkpointing, migration,
and recovery for distributed systems. In: IEEE international parallel and distributed processing sym-
posium, pp 1–10
62. Sancho JC, Petrini F, Davis K, Gioiosa R, Jiang S (2005) Current practice and a direction forward in
checkpoint/restart implementations for fault olerance. In: Proceedings of the 19th IEEE international
parallel and distributed processing symposium (IPDPS’05)—workshop 18
63. Sankaran S, Squyres JM, Barrett B et al (2005) The Lam/Mpi checkpoint/restart framework: system-
initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493
64. Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser
78(1):012022
65. Schroeder B, Gibson GA (2010) A large-scale study of failures in high performance computing sys-
tems. IEEE Trans Dependable Secure Comput 7(4):337–350
66. Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghill P (2004) Implementation
and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In:
Supercomputing, Pittsburgh, PA
67. Shalf J, Dosanjh S, Morrison J (2011) Exascale computing technology challenges. In: VECPAR 2010,
LNCS, vol 6449. Springer, Berlin, Heidelberg, pp 1–25
68. Slivinski T, Broglio C, Wild C et al. (1984) Study of fault-tolerant software technology. NASA CR
172385, Langley Research, Center, VA
69. Stellner G (1996) Cocheck: checkpointing and process migration for MPI. In: Proc IPPS
70. Sudakov OO, Meshcheriakov IS, Boyko YV (2007) CHPOX: transparent checkpointing system for
Linux clusters. In: IEEE international workshop on intelligent data acquisition and advanced comput-
ing systems: technology and applications, pp 159–164
71. Takahashi T, Sumimoto S, Hori A, Harada H, Ishikawa Y (2000) PM2: high performance communi-
cation middleware for heterogeneous network environments, in supercomputing. In: ACM/IEEE 2000
conference. IEEE Press, New York, p 16
72. Team Condor (2010) Condor version 7.5.3 manual. University of Wisconsin–Madison
73. Teodorescu R, Nakano J, Torrellas J (2006) SWICH: a prototype for efficient cache-level checkpoint-
ing and rollback. IEEE Micro
74. Top500 (2012) [Online]. Available: http://www.top500.org
75. Walters J, Chaudhary V (2006) Application-level checkpointing techniques for parallel programs. In:
Proc of the 3rd ICDCIT conf, pp 221–234
76. Wang Y-M, Chung P-Y, Lin I-J, Fuchs WK (1995) Checkpoint space reclamation for uncoordinated
checkpointing in message-passing systems. IEEE Trans Parallel Distrib Syst 6(5):546–554
77. William RD, James EL Jr (2001) User-level checkpointing for LinuxThreads programs. In: FREENIX
track: USENIX annual technical conference
78. Zandy V (2002) ckpt [Online]. Available: http://pages.cs.wisc.edu/~zandy/ckpt/
79. Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module. Technical Report
CUCS-014-01, Department of Computer Science, Columbia University