Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
1, JANUARY-MARCH 2009
Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes
create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over
extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event
Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and
node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in
dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of
processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The
low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an
experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is
small and bounded.
operation-induced disruption of parts or the entire execu- We consider two methods based on stable storage, i.e.,
tion of the application. We introduce flexible rollback logging and checkpointing.
recovery mechanisms that impose no artificial restrictions
on the execution. They do not depend on the prefailure 2.1 Logging-Based Approaches
configuration and consider 1) node and cluster failures as Logging [1] can be classified as pessimistic, optimistic, or
well as operation-induced unavailability of resources and causal. It is based on the fact that the execution of a process
2) dynamic topology reconfiguration in heterogeneous can be modeled as a sequence of state intervals. The
systems. execution during a state interval is deterministic. However,
The reminder of this paper is organized as follows: In each state interval is initiated by a nondeterministic event
Section 2, we present the necessary background information [27]. Now, assume that the system can capture and log
and related work. Next, in Section 3, we describe the sufficient information about the nondeterministic events
execution model considered. Two rollback-recovery proto- that initiated the state interval. This is called the piecewise
cols are introduced in Sections 4 and 5. A theoretical deterministic (PWD) assumption [27]. Then, a crashed
performance and cost analysis of these protocols is process can be recovered by 1) restoring it to the initial
presented in Section 6, followed by an experimental state and 2) replaying the logged events to it in the same
validation of the theoretical results in Section 7. Finally, order they appeared in the execution before the crash. To
we conclude this paper in Section 8. avoid a rollback to the initial state of a process and to limit
the amount of nondeterministic events that need to be
replayed, each process periodically saves its local state. Log-
2 BACKGROUND based mechanisms in which the only nondeterministic
Several fault-tolerance mechanisms exist to overcome the events in a system are the reception of messages is usually
problems described in Section 1. Each fault in a system, may referred to as message logging.
it be centralized or largely distributed, has the potential for Examples of systems based on message logging include
loss of information, which then has to be reestablished. MPICH-V2 [7], and FTL-Charm++ [8]. A disadvantage of
Recovery is, thus, based on redundancy. Several redun- log-based protocols for applications with extensive inter-
dancy principles exit, i.e., time, spatial, and information process communication is the potential for large overhead
redundancy. Time redundancy relies on multiple execu- with respect to space and time, due to the logging of
tions skewed in time on the same node. Spatial redundancy, messages.
on the other hand, uses physically redundant nodes for the
same computations. The final result is derived by voting on 2.2 Checkpointing-Based Approaches
the results of the redundant computations. However, there Rather than logging events, checkpointing relies on
are two disadvantages associated with redundancy: periodically saving the state of the computation to stable
storage [9]. If a fault occurs, the computation is restarted
1. Only a fixed number of faults can be tolerated from one of the previously saved states. Since the
depending on the type of fault. This number of computation is distributed, one has to consider the
redundant computations depends on the fault tradeoff space of local and global checkpointing strategies
model, which defines the degree of replication and their resulting recovery cost. Thus, checkpointing-
needed to tolerate the faults assumed [18], [29]. based methods differ in the way processes are coordinated
The exact types of faults considered, e.g., crash fault and in the derivation of a consistent global state. The
or omission fault, and their behavior will be consistent global state can be achieved either at the time of
described later in Section 3.4. checkpointing or at the time of rollback recovery. The two
2. The necessary degree of redundancy may introduce approaches are called coordinated and uncoordinated
unacceptable cost associated with the redundant checkpointing, respectively.
parallel computations and its impact on the infra- Coordinated checkpointing requires that all processes
structure [24]. This is especially true for intensive coordinate the construction of a consistent global state
Grid computations [2]. before they write the individual checkpoints to stable
As a result, solutions based on replication, i.e., time and storage. The disadvantage is the large latency and overhead
spatial redundancy, are, in general, not suitable for Grid associated with coordination. Its advantage is the simplified
computing where resources are preferably used for the recovery without rollback propagation and minimal storage
application itself. overhead, since each process only needs to keep the last
In information redundancy, on the other hand, redun- checkpoint of the global “recovery line.” This kind of
dant information is added that can be used during recovery protocol is used, e.g., in [26] and [33].
to reconstruct the original data or computation. This Uncoordinated checkpointing on the other hand assumes
method is based on the existence of the concept of stable that each process independently saves its state and a
storage [10]. One has to note that stable storage is only an consistent global state is achieved in the recovery phase
abstraction whose implementation depends on the fault [10]. The advantage of this method is that each process can
model assumed. Implementations of stable storage range make a checkpoint when its state is small. However, there
from simple local disks, e.g., to deal with the loss of are two main disadvantages. First, there is a possibility of
information due to transient faults, to complicated hybrid- rollback propagation which can result in the domino effect
redundancy management schemes, e.g., configurations [23], i.e., a cascading rollback to the beginning of the
based on RAID technology [21] or survivable storage [32]. computation. Second, due to the cascading effect, the
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
34 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 35
Fig. 2. KAAPI processor model. state on the local process, which now becomes a victim.
When the task is finished, either on the local process or a
subgraph Gi of G. Thus,
S the state of the entire application thief, it enters state finished and proceeds to state deleted.
is defined by G ¼ Gi over all processes Pi . Note that If a task has been stolen, the newly created thief process
this also includes the information associated with depen- utilizes the same model. In Fig. 2, the theft of task Ts on
dencies between Gi and Gj , i 6¼ j. This is due to the fact Process 2 by Process i is shown, as indicated by the arrow.
that Gi , by the definition of the principle of dataflow, Whereas this example shows task stealing on the same
contains all information necessary to identify exactly processor, the concept applies also to stealing across
which data is missing. processors. On the victim, the stolen task is in state stolen.
Upon theft, the stolen task enters state created on the thief.
3.2 Work-Stealing At this instant of time, the stolen task Ts and a task Tr
The runtime environment and primary mechanism for load charged with returning the result are the only tasks in the
distribution is based on a scheduling algorithm called thief’s stack, as shown in the figure. Since a stolen task by
work-stealing [11], [12]. The principle is simple: when a the definition of work-stealing is ready, it immediately
process becomes idle it tries to steal work from another enters state ready. It is popped from the stack, thereby
process called victim. The initiating process is called thief. entering state executing, and upon finishing, it enters state
Work-stealing is the only mechanism for distributing finished. It should be noted that the task enters this state on
the workload constituting the application, i.e., an idle the thief and the victim. For the latter, this is after receiving a
process seeks to steal work from another process. From a corresponding message from the thief. On both processes,
practical point of view, the application starts with the the task proceeds to state deleted.
process executing main(), which creates tasks. Typically,
3.4 Fault Model
some of these tasks are then stolen by idle processes,
which are either local or on other processors. Thus, the We will now describe the fault model that the execution
principal mechanism for dispatching tasks in the distrib- model is subjected to. The hybrid fault model described in
uted environment is task stealing. The communication due [29], which defines benign, symmetric, and asymmetric faults,
to the theft is the only communication between processes. will serve as a basis. Whereas benign faults are globally
Realizing that task theft creates the only dependencies diagnosable and, thus, self-evident, symmetric and asym-
between processes is crucial to understand the checkpoint- metric faults represent malicious faults which are either
ing protocol to be introduced later. consistent or possibly nonconsistent. In general, any fault
With respect to Fig. 1, work-stealing will be the that can be detected with certainty can be dealt with by our
scheduling algorithm of preference at Level 1. mechanisms. On one side, this includes any benign fault
such as a crash fault. On the other hand, this considers node
3.3 The KAAPI Environment volatility [5], e.g., transient and intermittent faults of nodes.
The target environment for multithreaded computations It should be noted that results of computation of volatile
with dataflow synchronization between threads is the nodes, which rejoin the system, will be ignored.
Kernel for Adaptive, Asynchronous Parallel Interface (KAAPI), In order to deal with symmetric or asymmetric faults, it
implemented as a C++ library. The library is able to is necessary that detection mechanisms are available. Such
schedule programs at fine or medium granularity in a approaches have been shown in [17] and [16] and can be
distributed environment. theoretically incorporated in this work.
Fig. 2 shows the general relationship between processors
and processes in KAAPI. A processor contains one or more
processes. Each process maintains its own stack.
4 THEFT-INDUCED CHECKPOINTING
The lifecycle of a task in the KAAPI execution model is As seen in the previous section, the dataflow graph
depicted in Fig. 3 and will be described first from a local constitutes a global state of the system. In order to use its
process’ and then from a thief’s point of view in the context abstraction for recovery, it is necessary that this global state
of task stealing. also represents a consistent global state.
At task creation, the task enters state created. At this time, With respect to Fig. 1, we can capture the abstraction of
it is pushed onto the stack. When all input data is available, the execution state at two extremes. At Level 0, one assumes
the task enters state ready. A ready-task which is on the top the representation derived from the construction of the
of the stack can be executed, i.e., it can be popped off the dataflow graph, whereas at Level 1, the interpretation is
stack, thereby entering state executing. A task in the ready derived as the result of its evaluation, which occurs at the
state can also be stolen, in which case it enters the stolen time of scheduling.
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
36 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 37
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
38 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009
Communication F to G—the return of the result to the (between events C and D) ensures that a crash after this
victim: checkpoint does not result in the loss of the thief’s
computation, i.e., there will be a record that allows the
1. If the thief crashes after event F, then condition victim’s replacement process to find the thief. u
t
IC2 arises. Upon reinitiating event E, the victim
will simply ignore the duplication. Note that this The actions described in the proof above constitute a new
can only occur in the tiny interval after F and generation of the protocol, i.e., the concept of a proactive
before P1 ’s termination. manager, as described in [14] and [15], has been eliminated.
2. A crash of the victim after it has received the It has been replaced with a passive name server imple-
result (event G) but before it can checkpoint will mented on the same reliable storage system that facilitates
result in condition IC3. This would stall the victim the checkpoint server.
after rollback to a state where the task is still
flagged as stolen, i.e., P00 would never receive the
result in event G. Therefore, as part of the rollback 5 SYSTEMATIC EVENT LOGGING
procedure, the victim inspects the last checkpoint Whereas the T IC protocol was defined with respect to
for tasks that have been flagged stolen. If the Level 1 in Fig. 1, we will now introduce a Level 0 protocol
victim’s checkpoint contains references to a thief called Systematic Event Logging ðSELÞ, which was derived
P1 that is already terminated, it rolls back P0 on P00 from a log-based method [1]. The motivation for SEL is
using the checkpoint of P0 together with the to reduce the amount of computation that can be lost,
thief’s final checkpoint containing the result. which is bound by the execution time of a single failed
Thus, the rollback uses G0 and G1 (which contains task.1 We will later elaborate on the differences between
only Tr ). On the other hand, if the last checkpoint T IC and SEL in their analysis presented in Section 6.
contains references to thieves that are still In SEL, only the events relevant for the construction of
executing, no action is required since the thief, the dataflow graph are logged. Logging events for tasks are
upon attempting to send the results to the old
their additions and deletions. Logging events of shared data
process P0 , will experience an error from the
objects are their additions, modifications, and deletions. A
transport layer and will inquire about P00 .
recovery consists of simply loading and rebuilding sub-
3. If the thief is rolled back to CP13 and finds out
graph Gi associated with the failed process Pi from the
during event F that the victim has crashed as well,
respective log.
it inquires about P00 . P00 will have either been
The SEL protocol implies the validity of the PWD
initiated with CP02 or a checkpoint taken after
hypothesis, which was introduced in Section 2.1. For the
event D, say CP03 . In the first case as the result of
hypothesis to be valid, the following two conditions
the error during event D, P00 inquires about the
must hold:
replacement victim and updates CP02 . In the
second case, it will be waiting for event G, which . C1 : Once a task starts executing, it will continue,
is coming from the replacement thief. The thief without being affected by external events, until its
found out about P00 as a result of the communica- execution ends.
tion error at event F during the attempt to reach . C2 : The execution of a task is deterministic with
the old victim. respect to the tasks and shared data objects that are
Part 3: So far, we have proven that by using T IC, created. Note that this implies that the execution will
inconsistencies are avoided. However, it remains to be always create the same (isomorphic) dataflow graph.
established why the three forced checkpoints shown At first sight, condition C1 may appear rather restrictive.
(shaded) in Fig. 5 are necessary. Let CP10 and CP1f However, this is not the case for our application domain,
denote the first and final checkpoint of a thief P1 , i.e., large parallel executions (see (1) below).
respectively. The initial checkpoint CP10 guarantees that If all tasks of a dataflow graph obey conditions C1 and
there exists at least one record of a theft request for a C2 , then all processes executing the graph will comply with
thief that crashes. Thus, upon a crash, the thief is rolled the PWD hypothesis. The idea behind the proof of this
back on the new process P10 . Without CP10 , any crash theorem is simple. In the execution model, the execution of
before a checkpoint on the thief would simply erase tasks is deterministic, whereas the starting time of their
any reference of the theft (event E) and would stall the execution is nondeterministic. However, this implies, in
victim. The final checkpoint of the thief, CP1f , is needed turn, that during the execution of a task in the execution
in case the victim P0 crashes after it has received the model, it itself will create the same sequence of tasks and
results from the thief, but before it could checkpoint its
data objects.
state reflecting the result. Thus, if the victim crashes
In case of a fault, task duplication needs to be avoided
between event G and its first checkpoint after G, then
during rollback. Specifically, in the implementation, one has
the actions describing Communication F to G will ensure
to guarantee that only one instance of a any given task can
the victim can receive the result of the stolen task.
exist. In the absence of such guarantee, it could happen that
It should be noted that the final checkpoint of the thief
during rollback a task recreates other tasks or data objects
cannot be deleted until the victim has taken a checkpoint
after event G, thereby checkpointing the result of the 1. Recall that the task is the smallest unit of execution in the execution
stolen task. Lastly, the forced checkpoint of the victim model.
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 39
that already exist from earlier failed executions. Note that, 6.1.1 Analysis of T IC
depending on the timing of the fault, this could result in a In T IC, a checkpoint is performed 1) periodically for each
significant number of duplicated nodes, since each dupli- process, as dictated by period and 2) as the result of work-
cated task itself may be the initiator of a significant portion stealing. Let TPT IC denote the execution of a parallel
of computation. In our implementation of SEL, duplication program on p processors under T IC. Then
avoidance is achieved using a unique and reproducible
TPT IC Tp þ max OverheadTi IC ; ð5Þ
identification method of all vertices in the graph. i¼1;...;p
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
40 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009
SEL
The function foverhead ðÞ indicates the overhead associated much smaller than jGi j, thus more than compensating
SEL T IC
with a single log. for foverhead ðjvj j; ts Þ < foverhead ðN1 ; ts Þ, as will be confirmed
by the results in Section 7. The reduced overhead has
6.2 Analysis of Executions Containing Faults huge implication on the avoidance of bottlenecks in the
The overhead associated with fault-free execution is the checkpointing server(s). For applications with large data
penalty one pays for having a recovery mechanism. It manipulations, T IC, with an appropriate choice of ,
remains to be shown how much overhead is associated with may be the only choice capable of eliminating storage
recovery as the result of a fault and how much execution bottlenecks.
time can be lost under different strategies. On the other hand, SEL addresses the needs of
The overhead associated with recovery is due to loading applications with low tolerance for lost execution time.
and rebuilding the affected portions of G. This can be However, one has to analyze the bandwidth requirements
effectively achieved by regenerating Gi of the affected of logging in order to determine feasibility.
processes. Thus, the time of recovery of a single process Pi , It should be emphasized that the advantage of the T IC
denoted by trecovery
i , depends only on the size of its and SEL protocols is that they do not require replacement
associated subgraph Gi , i.e., trecovery
i ¼ OðjGi jÞ. Note that resources for failed processes, e.g., the failed process can be
for a global recovery, as the result of the failure of the entire rolled back on an existing resource. This is due to the fact
application, this translates to maxðtrecovery
P i Þ and not to that the state of the execution is platform and configuration
trecovery
i . independent.
The way Gi is rebuilt for a failed process differs for the Lastly, we want to indicate that, even though the T IC
two protocols. Under T IC, rebuilding Gi implies simply protocol has been motivated by CIC [3], T IC has multiple
reading the structure from the checkpoint. For SEL, this is advantages over CIC. First, unlike CIC, in T IC, only the last
somewhat more involved, since now Gi has to be checkpoint needs to be kept in the stable storage. This has
reconstructed from the individual logs. potentially large implications on the amount of data that
Next, we address the amount of work that a process can needs to be stored. Thus, the advantage of T IC is the
lose due to a single fault. In T IC, this is the maximal reduction of checkpointing data as well as the time it takes
difference in time between two consecutive checkpoints. to recover this data during rollback. The second significant
This time is defined by the checkpointing period and the advantage is that in T IC only the failed process needs to be
execution time of a task, since a checkpoint of a process that rolled back. Note that, in CIC, all processes must be rolled
is executing a task cannot be made until the task finishes back after a fault.
execution. In the worst case, the process receives a
checkpointing signal after and has to wait for the end of
the execution of its current task before checkpointing. Thus, 7 EXPERIMENTAL RESULTS
the time between checkpoints is bound by þ maxðci Þ, 7.1 Application and Platform Description
where ci is the computation time of task Ti . But how bad can The performance and overhead of the T IC and SEL
the impact of ci be? In a parallel application, it is reasonable protocols were experimentally determined for the Quadratic
to assume T1 T1 . Since T1 is the critical path of the Assignment Problem (instance4 NUGENT 22) which was
application any ci T1 . As a result, one can assume ci to be parallelized in KAAPI. The local experiments were con-
relatively small. ducted on the iCluster2,5 which consists of 104 nodes
In SEL, due to its fine granularity of logging, the interconnected by a 100-Mbps Ethernet network, each node
maximum amount of execution time lost is simply that of a featuring two Itanium-2 processors (900 MHz) and 3 Gbytes
single task. However, this comes at the cost of higher of local memory. The intercluster experiments were
logging overhead, as was addressed in (8). conducted on Grid5000 [13], which consists of clusters
located at nine French institutions.
6.3 Discussion In order to take advantage of the distributed fashion of
The overhead of the T IC protocol depends on the number the checkpoint, i.e., Gi , each processor has a dedicated
of theft operations and period . To reduce the overhead, checkpoint server. This configuration has two advantages.
one needs to increase . However, this also increases the First, it reflects the theoretical assumptions in Section 6, and
maximum amount of computation that can be lost. second, the actual overhead of the checkpointing mechan-
For SEL, the overhead depends only on the size of ism is measured, rather than the overhead associated with a
graph G, i.e., its vertices vi which have to be saved. If one centralized checkpoint server.
wants to reduce the overhead, one has to reduce the size of G.
This, however, reduces the parallelism of the application. 7.2 Fault-Free Executions
Comparing the T IC and SEL protocol makes only We will now investigate the overhead of the protocols in
sense under consideration of the application, e.g., fault-free executions, followed by executions containing
number of tasks, task size, or parallelism. If T1 T1 , faults. Then, we show the results of a real-world example
given a reasonable value3 for , then the overhead of executing on heterogeneous and dynamic cluster config-
T IC is likely to be much lower than that of SEL, i.e., urations. We conclude with a comparison of both protocols
given (6) and (8), ½TPT IC = þ OðNtheft Þ is most likely with the closest counterpart, i.e., Satin [31].
3. Note that unreasonably small values of would result in excessive 4. See http://www.opt.math.tu-graz.ac.at/qaplib/.
local checkpointing. 5. See http://www.inrialpes.fr/sed/i-cluster2.
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 41
Fig. 7. Tasks and application granularity. Fig. 9. Total volume of data stored.
The impact of the degree of parallelism can be seen in periods . Furthermore, the data volume stays relatively
Fig. 7, where the number of parallel tasks generated during constant for different number of processors. This is due to
execution grows as the size of tasks is reduced. Recall that the fact that the number of thefts, and thus, theft-induced
the number of tasks directly relates to the size of graph G, overhead, is actually very small, as was explained in
which in turn has implication with respect to the overhead Section 6.
of the protocols. The degree of parallelism increases
drastically for threshold 5 and approaches its maximum at 7.3 Executions with Faults
threshold 10. To show the overhead of the mechanisms in the presence of
Fig. 8 shows the execution times of the application for faults, we consider executions containing faults. First, we
different protocols in the absence of faults. Two observa- want to measure the cost induced by the computation lost
tions can be made. First, the application scales with the due to the fault(s) and the overhead of the protocols.
number of processors for all protocols. Second, there is very Specifically, for each protocol, we show
little difference between the execution times of the protocols
Tpwithfault Tp0
for the same number of processors. In fact, the largest ; ð9Þ
difference among the executions was observed in the case of Tp0
120 processors and was measured at 7.6 percent. It is easy to where Tpwithfault is the time of execution in the presence of
falsely conclude that, based on the small differences shown faults and rollback, and Tp0 is the time of a fault-free
in the scenarios in Fig. 8, all protocols perform approxi- execution.
mately the same. The important measure of overhead of the Fig. 10 shows the measured cost using (9) for different
mechanism is the total amount of data associated with the numbers of faults. The interpretation of Tp0 is the execution
protocol that is sent to stable storage. This overhead is time of the application including the overhead of the
affected by the total size and the number of messages. Due checkpointing or logging mechanism. One can observe that,
to the efficient distributed configuration of the experiment, as the number of faults increases, the execution time grows
which may not be realistic for real-world applications, this linearly. Note that, since the overhead of the protocols is
overhead was hidden and, thus, does not show in the included in Tp0 , the values displayed are the computation time
figure. Fig. 9 addresses this cost, i.e., the cost of the fault- lost due to the faults as well as the overhead of rollback, but do
tolerance mechanism that the infrastructure has to absorb, not contain the overhead of checkpointing or logging. As
and shows the total volume of checkpointing and logging expected, and discussed in Section 6.3, the computation lost
data stored. The advantages of T IC can be seen in the using SEL is lower than that under T IC, since in SEL, only
significant reduction of data, which is most visible for larger
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
42 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 43
environments. As can be seen, the cost in Satin is [3] R. Baldoni, “A Communication-Induced Checkpointing Protocol
That Ensures Rollback-Dependency Trackability,” Proc. 27th Int’l
significantly higher than that in KAAPI/T IC, which used Symp. Fault-Tolerant Computing (FTCS ’97), p. 68, 1997.
¼ 1 second. The reason is that, in Satin, all computations [4] F. Baude, D. Caromel, C. Delb, and L. Henrio, “A Hybrid Message
affected by the fault are lost. In fact, the loss is higher the Logging-CIC Protocol for Constrained Checkpointability,” Proc.
later the fault occurs during the execution. This is not the European Conf. Parallel Processing (EuroPar ’05), pp. 644-653, 2005.
[5] G. Bosilca et al., “MPICH-V: Toward a Scalable Fault Tolerant
case in T IC where the maximum loss is small, i.e., MPI for Volatile Nodes,” Proc. ACM/IEEE Conf. Supercomputing
þ maxðci Þ, as was shown in Section 6.2. Thus, T IC (SC ’02), Nov. 2002.
overcomes this performance deficiency of Satin. [6] A. Bouteiller et al., “MPICH-V2: A Fault Tolerant MPI for Volatile
On the other hand, the T IC protocol is pessimistic in the Nodes Based on the Pessimistic Sender Based Message Logging,”
Proc. ACM/IEEE Conf. Supercomputing (SC ’03), pp. 1-17, 2003.
sense that processes are always checkpointed to anticipate a
[7] A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello,
future failure. The result is that for fault-free executions the “Coordinated Checkpoint versus Message Log for Fault Tolerant
Satin approach has lower overhead than T IC. However, as MPI,” Proc. Fifth IEEE Int’l Conf. Cluster Computing (Cluster ’03),
was shown in Section 7.2, the overhead of T IC is very p. 242, 2003.
[8] S. Chakravorty and L.V. Kale, “A Fault Tolerant Protocol for
small.
Massively Parallel Machines,” Proc. 18th IEEE Int’l Parallel and
For applications with small computation times (linear or Distributed Processing Symp. (IPDPS ’04), p. 212a, 2004.
quasilinear), Satin also tends to perform better than T IC. [9] K.M. Chandy and L. Lamport, “Distributed Snapshots: Determin-
The reason is that the time to recompute solutions under ing Global States of Distributed Systems,” ACM Trans. Computer
Systems, vol. 3, no. 1, pp. 63-75, 1985.
Satin may be less than the overhead associated with writing
[10] E.N. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A
checkpoints to stable storage. However, such applications Survey of Rollback-Recovery Protocols in Message-Passing
are difficult to parallelize due to the low computation/ Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408,
communication ratio. Sept. 2002.
[11] M. Frigo, C.E. Leiserson, and K.H. Randall, “The Implementation
of the Cilk-5 Multithreaded Language,” Proc. ACM SIGPLAN Conf.
8 CONCLUSIONS Programming Language Design and Implementation (PLDI ’98),
pp. 212-223, 1998.
To overcome the problem of applications executing in large [12] F. Galilée, J.-L. Roch, G. Cavalheiro, and M. Doreille, “Athapas-
systems where the MTTF approaches or sinks below the can-1: On-Line Building Data Flow Graph in a Parallel Language,”
Proc. Seventh Int’l Conf. Parallel Architectures and Compilation
execution time of the application, two fault-tolerant proto- Techniques (PACT ’98), pp. 88-95, 1998.
cols, T IC and SEL, were introduced. The two protocols [13] A Large Scale Nation-Wide Infrastructure for Grid Research, Grid5000,
take under consideration the heterogeneous and dynamic https://www.grid5000.fr, 2006.
characteristics of Grid or cluster applications that pose [14] S. Jafar, A. Krings, T. Gautier, and J.-L. Roch, “Theft-Induced
Checkpointing for Reconfigurable Dataflow Applications,” Proc.
limitations on the effective exploitation of the underlying IEEE Electro/Information Technology Conf. (EIT ’05), May 2005.
infrastructure. The flexibility of dataflow graphs has been [15] S. Jafar, T. Gautier, A. Krings, and J.-L. Roch, “A Checkpoint/
exploited to allow for a platform-independent description Recovery Model for Heterogeneous Dataflow Computations
of the execution state. This description resulted in flexible Using Work-Stealing,” Proc. European Conf. Parallel Processing
(EuroPar ’05), pp. 675-684, Aug.-Sept. 2005.
and portable rollback recovery strategies.
[16] A.W. Krings, J.-L. Roch, S. Jafar, and S. Varrette, “A
SEL allowed for rollback at the lowest level of Probabilistic Approach for Task and Result Certification of
granularity, with a maximal computational loss of one task. Large-Scale Distributed Applications in Hostile Environments,”
However, its overhead was sensitive to the size of the Proc. European Grid Conf. (EGC ’05), P. Sloot et al., eds., Feb.
2005.
associated dataflow graph. T IC experienced lower over-
[17] A.W. Krings, J.-L. Roch, and S. Jafar, “Certification of Large
head, related to work-stealing, which was shown bounded Distributed Computations with Task Dependencies in Hostile
by the critical path of the graph. By selecting an appropriate Environments,” Proc. IEEE Electro/Information Technology Conf.
application granularity for SEL and period for T IC, the (EIT ’05), May 2005.
[18] L. Lamport, M. Pease, and R. Shostak, “The Byzantine Generals
protocols can be tuned to the specific requirements or needs Problem,” ACM Trans. Programming Languages and Systems, vol. 4,
of the application. A cost model was derived, quantifying no. 3, pp. 382-401, July 1982.
the induced overhead of both protocols. The experimental [19] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Check-
results confirmed the theoretical analysis and demonstrated point and Migration of UNIX Processes in the Condor Distributed
Processing System,” Technical Report CS-TR-97-1346, Univ. of
the low overhead of both approaches. Wisconsin, Madison, 1997.
[20] A. Nguyen-Tuong, A. Grimshaw, and M. Hyett, “Exploiting Data-
Flow for Fault-Tolerance in a Wide-Area Parallel System,” Proc.
ACKNOWLEDGMENTS 15th Symp. Reliable Distributed Systems (SRDS ’96), pp. 2-11, 1996.
The authors wish to thank Jean-Louis Roch, ID-IMAG, [21] D.A. Patterson, G. Gibson, and R.H. Katz, “A Case for Redundant
Arrays of Inexpensive Disks (RAID),” Proc. ACM SIGMOD ’88,
France, for all the discussions and valuable insight that led pp. 109-116, 1988.
to the success of this research. [22] D.K. Pradhan, Fault-Tolerant Computer System Design. Prentice
Hall, 1996.
[23] B. Randell, “System Structure for Software Fault Tolerance,” Proc.
REFERENCES Int’l Conf. Reliable Software, pp. 437-449, 1975.
[1] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, [24] L. Sarmenta, “Sabotage-Tolerance Mechanisms for Volunteer
Optimistic, Causal and Optimal,” IEEE Trans. Software Eng., Computing Systems,” Future Generation Computer Systems,
vol. 24, no. 2, pp. 149-159, Feb. 1998. vol. 18, no. 4, 2002.
[2] K. Anstreicher, N. Brixius, J.-P. Goux, and J. Linderoth, “Solving [25] J. Silc, B. Robic, and T. Ungerer, “Asynchrony in Parallel
Large Quadratic Assignment Problems on Computational Grids,” Computing: from Dataflow to Multithreading,” Progress in
Math. Programming, vol. 91, no. 3, 2002. Computer Research, pp. 1-33, 2001.
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
44 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009
[26] G. Stellner, “CoCheck: Checkpointing and Process Migration Axel Krings received the PhD and MS degrees
for MPI,” Proc. 10th Int’l Parallel Processing Symp. (IPPS ’96), in computer science from the University of
pp. 526-531, Apr. 1996. Nebraska-Lincoln, in 1993 and 1991, respec-
[27] R. Strom and S. Yemini, “Optimistic Recovery in Distributed tively, and the MS degree in electrical engineering
Systems,” ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, from the FH-Aachen, Germany, in 1982. He is a
1985. professor of Computer Science at the University
[28] V. Strumpen, “Portable and Fault-Tolerant Software Systems,” of Idaho. Dr. Krings has published extensively in
IEEE Micro, vol. 18, no. 5, pp. 22-32, Sept./Oct. 1998. the area of Computer & Network Survivability,
[29] P. Thambidurai and Y.-K. Park, “Interactive Consistency with Security, Fault-Tolerance and Real-time Sche-
Multiple Failure Modes,” Proc. Seventh Symp. Reliable Distributed duling. He has organized and chaired confer-
Systems (SRDS ’88), pp. 93-100, Oct. 1988. ences and tracks in the area of system survivability and has served on
[30] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and numerous conference program committees. From 2004 to 2005, he was a
Computer Science Applications. John Wiley & Sons, 2001. visiting professor at the Institut d’Informatique et Mathématiques
[31] G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H.E. Bal, Appliquées de Grenoble, at the Institut National Polytechnique de
“Fault-Tolerance, Malleability and Migration for Divide-and- Grenoble, France. His work has been funded by DoE/INL, DoT/NIATT,
Conquer Applications on the Grid,” Proc. 19th Int’l Parallel and DoD/OST, NIST, and CNRS. He is a senior member of the IEEE.
Distributed Processing Symp. (IPDPS ’05), p. 13a, Apr. 2005.
[32] J.J. Wylie et al., “Selecting the Right Data Distribution Scheme for Thierry Gautier received the Dipl. Ing. MS and
a Survivable Storage System,” Technical Report CMU-CS-01-120, PhD in computer science at the INPG, in 1996.
Carnegie Mellon Univ., May 2001. He is a full time researcher at INRIA (the French
[33] G. Zheng, L. Shi, and L.V. Kalé, “FTC-Charm++: An In-Memory National Institute for Computer Science and
Checkpoint-Based Fault Tolerant Runtime for Charm++ and Control), with Project MOAIS, Laboratoire d’In-
MPI,” Proc. Sixth IEEE Int’l Conf. Cluster Computing (Cluster ’04), formatique de Grenoble, France, and has held a
pp. 93-103, Sept. 2004. post-doctoral position at ETH Zürich (1997).
Dr. Gautier conducts research in high perfor-
Samir Jafar received the PhD in computer mance computing and has been involved with
science from the Institut National Polytechnique the design of fault-tolerant protocols. He has
de Grenoble, France, in 2006, the MS degree lead the development of the Kernel for Asynchronous and Adaptive
from Joseph Fourier University, France, in 2002, Interface KAAPI, which is fundamental to this research.
and the M.S. degree in applied mathematics and
computer science from Damascus University, in
1998. He is a professor of computer science in the
. For more information on this or any other computing topic,
Department of Mathematics, Faculty of Sciences,
please visit our Digital Library at www.computer.org/publications/dlib.
University of Damascus, Damascus, Syria.
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.