0% found this document useful (0 votes)
58 views

Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

This document discusses flexible rollback recovery mechanisms for fault tolerance in dynamic heterogeneous grid computing environments. It introduces two protocols: 1) Theft-Induced Checkpointing which bases application state on a dataflow graph to allow efficient recovery even with changes in processor numbers. 2) Systematic Event Logging which logs sufficient information about nondeterministic events to allow recovery under the piecewise deterministic execution model. Experimental results show the overhead of these protocols is small, with bounded maximum work lost from a crashed process. The protocols allow recovery despite node or subnet failures and volatility in grid resource availability.

Uploaded by

Kiran Ghanta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

This document discusses flexible rollback recovery mechanisms for fault tolerance in dynamic heterogeneous grid computing environments. It introduces two protocols: 1) Theft-Induced Checkpointing which bases application state on a dataflow graph to allow efficient recovery even with changes in processor numbers. 2) Systematic Event Logging which logs sufficient information about nondeterministic events to allow recovery under the piecewise deterministic execution model. Experimental results show the overhead of these protocols is small, with bounded maximum work lost from a crashed process. The protocols allow recovery despite node or subnet failures and volatility in grid resource availability.

Uploaded by

Kiran Ghanta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

32 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO.

1, JANUARY-MARCH 2009

Flexible Rollback Recovery in Dynamic


Heterogeneous Grid Computing
Samir Jafar, Axel Krings, Senior Member, IEEE, and Thierry Gautier

Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes
create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over
extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event
Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and
node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in
dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of
processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The
low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an
experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is
small and bounded.

Index Terms—Grid computing, rollback recovery, checkpointing, event logging.

1 INTRODUCTION AND MOTIVATION

G RID and cluster architectures have gained popularity for


computationally intensive parallel applications. How-
ever, the complexity of the infrastructure, consisting of
failure. Fault tolerance is, thus, a necessity to avoid failure
in large applications, such as found in scientific computing,
executing on a Grid, or large cluster.
computational nodes, mass storage, and interconnection The fault-tolerance mechanisms also have to be capable of
networks, poses great challenges with respect to overall dealing with the specific characteristics of a heterogeneous
system reliability. Simple tools of reliability analysis show and dynamic environment. Even if individual clusters are
that as the complexity of the system increases, its reliability, homogeneous, heterogeneity in a Grid is mostly unavoid-
and thus, Mean Time to Failure (MTTF), decreases. If one able, since different participating clusters often use diverse
models the system as a series reliability block diagram [30], hardware or software architectures [13]. One possible
the reliability of the entire system is computed as the solution to address heterogeneity is to use platform-
product of the reliabilities of all system components. For independent abstractions such as the Java Virtual Machine.
applications executing on large clusters or a Grid, e.g., However, this does not solve the problem in general. There is
Grid5000 [13], the long execution times may exceed the a large base of existing applications that have been
MTTF of the infrastructure and, thus, render the execution developed in other languages. Reengineering may not be
infeasible. As an example, let us consider an execution feasible due to performance or cost reasons. Environments
lasting 10 days in a system that does not consider fault like Microsoft .Net address portability but only few scientific
tolerance. Under the optimistic assumption that the MTTF applications on Grids or clusters exist. Whereas Grids and
of a single node is 2,000 days, the probability of failure of clusters are dominated by unix operating systems, e.g., Linux
this long execution using 100, 200, or 500 nodes is 0.39, 0.63, or Solaris, Microsoft .Net is Windows-centric with only
or 0.91, respectively, approaching fast certain failure. The recent or partial unix support.
high failure probabilities are due to the fact that, in the Besides heterogeneity, one has to address the dynamic
absence of fault-tolerance mechanisms, the failure of a nature of the Grid. Volatility is not only an intracluster
single node will cause the entire execution to fail. Note that issue, i.e., configuration changes within a cluster, but also
this simple example does not even consider network an intercluster reality. Intracluster volatility may be the
failures, which are typically more likely than computer result of node failures, whereas intercluster volatility is
caused by network disruptions between clusters. From an
administrative viewpoint, the reality of Grid operation,
such as cluster/node reservations or maintenance, may
. S. Jafar is with the Department of Mathematics, Faculty of Sciences, restrict long executions on fixed topologies due to the fact
University of Damascus, Damascus, Syria. E-mail: [email protected].
. A. Krings is with the Department of Computer Science, University of
that operation at different sites may be hard to coordinate. It
Idaho, Moscow, ID 83844-1010. E-mail: [email protected]. is usually difficult to reserve a large cluster for long
. T. Gautier is with INRIA, (Projet MOAIS), LIG, 51 Avenue Jean executions, let alone scheduling extensive uninterrupted
Kuntzmann, F-38330 Montbonnot, St. Martin, France. time on multiple, perhaps geographically dispersed, sites.
E-mail: [email protected]. Lastly, configuration changes may be induced by the
Manuscript received 13 Sept. 2006; revised 31 May 2007; accepted 12 Nov. application as the result of changes of runtime observable
2007; published online 12 Mar. 2008. quality-of-service (QoS) parameters.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TDSC-0130-0906. To overcome the aforementioned problems and chal-
Digital Object Identifier no. 10.1109/TDSC.2008.17. lenges, we present mechanisms that tolerate faults and
1545-5971/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 33

operation-induced disruption of parts or the entire execu- We consider two methods based on stable storage, i.e.,
tion of the application. We introduce flexible rollback logging and checkpointing.
recovery mechanisms that impose no artificial restrictions
on the execution. They do not depend on the prefailure 2.1 Logging-Based Approaches
configuration and consider 1) node and cluster failures as Logging [1] can be classified as pessimistic, optimistic, or
well as operation-induced unavailability of resources and causal. It is based on the fact that the execution of a process
2) dynamic topology reconfiguration in heterogeneous can be modeled as a sequence of state intervals. The
systems. execution during a state interval is deterministic. However,
The reminder of this paper is organized as follows: In each state interval is initiated by a nondeterministic event
Section 2, we present the necessary background information [27]. Now, assume that the system can capture and log
and related work. Next, in Section 3, we describe the sufficient information about the nondeterministic events
execution model considered. Two rollback-recovery proto- that initiated the state interval. This is called the piecewise
cols are introduced in Sections 4 and 5. A theoretical deterministic (PWD) assumption [27]. Then, a crashed
performance and cost analysis of these protocols is process can be recovered by 1) restoring it to the initial
presented in Section 6, followed by an experimental state and 2) replaying the logged events to it in the same
validation of the theoretical results in Section 7. Finally, order they appeared in the execution before the crash. To
we conclude this paper in Section 8. avoid a rollback to the initial state of a process and to limit
the amount of nondeterministic events that need to be
replayed, each process periodically saves its local state. Log-
2 BACKGROUND based mechanisms in which the only nondeterministic
Several fault-tolerance mechanisms exist to overcome the events in a system are the reception of messages is usually
problems described in Section 1. Each fault in a system, may referred to as message logging.
it be centralized or largely distributed, has the potential for Examples of systems based on message logging include
loss of information, which then has to be reestablished. MPICH-V2 [7], and FTL-Charm++ [8]. A disadvantage of
Recovery is, thus, based on redundancy. Several redun- log-based protocols for applications with extensive inter-
dancy principles exit, i.e., time, spatial, and information process communication is the potential for large overhead
redundancy. Time redundancy relies on multiple execu- with respect to space and time, due to the logging of
tions skewed in time on the same node. Spatial redundancy, messages.
on the other hand, uses physically redundant nodes for the
same computations. The final result is derived by voting on 2.2 Checkpointing-Based Approaches
the results of the redundant computations. However, there Rather than logging events, checkpointing relies on
are two disadvantages associated with redundancy: periodically saving the state of the computation to stable
storage [9]. If a fault occurs, the computation is restarted
1. Only a fixed number of faults can be tolerated from one of the previously saved states. Since the
depending on the type of fault. This number of computation is distributed, one has to consider the
redundant computations depends on the fault tradeoff space of local and global checkpointing strategies
model, which defines the degree of replication and their resulting recovery cost. Thus, checkpointing-
needed to tolerate the faults assumed [18], [29]. based methods differ in the way processes are coordinated
The exact types of faults considered, e.g., crash fault and in the derivation of a consistent global state. The
or omission fault, and their behavior will be consistent global state can be achieved either at the time of
described later in Section 3.4. checkpointing or at the time of rollback recovery. The two
2. The necessary degree of redundancy may introduce approaches are called coordinated and uncoordinated
unacceptable cost associated with the redundant checkpointing, respectively.
parallel computations and its impact on the infra- Coordinated checkpointing requires that all processes
structure [24]. This is especially true for intensive coordinate the construction of a consistent global state
Grid computations [2]. before they write the individual checkpoints to stable
As a result, solutions based on replication, i.e., time and storage. The disadvantage is the large latency and overhead
spatial redundancy, are, in general, not suitable for Grid associated with coordination. Its advantage is the simplified
computing where resources are preferably used for the recovery without rollback propagation and minimal storage
application itself. overhead, since each process only needs to keep the last
In information redundancy, on the other hand, redun- checkpoint of the global “recovery line.” This kind of
dant information is added that can be used during recovery protocol is used, e.g., in [26] and [33].
to reconstruct the original data or computation. This Uncoordinated checkpointing on the other hand assumes
method is based on the existence of the concept of stable that each process independently saves its state and a
storage [10]. One has to note that stable storage is only an consistent global state is achieved in the recovery phase
abstraction whose implementation depends on the fault [10]. The advantage of this method is that each process can
model assumed. Implementations of stable storage range make a checkpoint when its state is small. However, there
from simple local disks, e.g., to deal with the loss of are two main disadvantages. First, there is a possibility of
information due to transient faults, to complicated hybrid- rollback propagation which can result in the domino effect
redundancy management schemes, e.g., configurations [23], i.e., a cascading rollback to the beginning of the
based on RAID technology [21] or survivable storage [32]. computation. Second, due to the cascading effect, the

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
34 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

In Level 0, the program to be executed is viewed as an


abstraction that represents the state symbolizing the future
of an execution. By “future,” we mean the execution that
has not unfolded yet. Specifically, the input to the virtual
machine VM0 is the sequential input program supplemen-
ted by instructions for the runtime system that describe the
parallelism of the application. This is accomplished by two
primitives called Task_Creation and Data_Creation. Whereas
the first creates (but does not execute) an executable task,
the latter creates a shared data object. The sequential
program language, together with these primitives, consti-
tutes language L0. Note that L0 is now a language
supporting parallelism.
Level 1 takes the abstraction of Level 0 and schedules
tasks using the primitives Task_Export, Task_Import, and
Task_Execution. The sequential program language, together
Fig. 1. Execution model.
with these primitives, constitutes the language L1. This
language encompasses the scheduling algorithm. Conse-
storage requirement is much higher, i.e., each process needs quently, Level 1 implements the dispatcher, whose deci-
to store multiple checkpoints. sions (which will affect the future of the execution) will be
A compromise between coordinated and uncoordinated executed at Level 0. In the figure, this is indicated with the
checkpointing is communication-induced checkpointing (CIC). arrow from the virtual machine VM1 to VM0. Note that
To avoid the domino effect that can result from indepen- both levels represent the runtime system, however, whereas
dent checkpoints of different processes, a consistent global the state of the execution is derived at Level 0, the decisions
state is achieved by forcing each process to take additional about the future are made at Level 1.
checkpoints based on some information piggybacked on the The justification of the general execution model in Fig. 1
application messages [3]. There are two main disadvantages is that it is independent of the operating system and the
with this approach. First, it requires global rollback. Second, hardware architecture. Furthermore, it does not depend on
it can result in the creation, and thus storage, of a large the number of resources, e.g., processors. As such, the
number of unused checkpoints, i.e., checkpoints that will execution model is suitable for heterogeneous and dynamic
never be used in the construction of a consistent global target systems, e.g., large clusters, Grid or peer-to-peer
state. An example of a system using this approach is systems. We will now explain the aforementioned abstrac-
ProActive [4]. tion of the execution state.
The essential issue in checkpointing and logging meth-
3.1 Dataflow Representation
ods is to determine what information should be stored in
the checkpoint or log. This information will determine the The representation of the state of an execution is based on
properties and suitable environment of the rollback, e.g., the principle of dataflow [25]. Dataflow allows for a natural
representation of a parallel execution and can be exploited
homogeneous versus heterogeneous system architecture or
for fault tolerance [20]. In a dataflow model, tasks, which
static versus dynamic system configuration. A popular
are the smallest units of execution, become ready for
checkpointing library used in systems like CoCheck [26],
execution upon availability of all their input data. The
MPICH-V2 [7], and MPICH-CL [7] is the Condor check-
dependencies among tasks are modeled in a dataflow
point library [19]. In Condor, the information constituting graph, which is defined as a directed graph G ¼ ðV; EÞ,
the checkpoint is the execution state of the process and, where V is a finite set of vertices and E is a set of edges
thus, depends on the specific architecture of the platform representing precedence relations between vertices. A
which executes the process. As a consequence, rollback is vertex vi 2 V is either a computational task or a shared
feasible only on an identical platform and it requires the data object. An edge eij 2 E represents the dependencies
creation of a replacement process. We will present below an between vi and vj . Within the context of this research, G is a
approach that overcomes both of these limitations, using an dynamic graph, i.e., it changes during runtime as the result
abstract state of the execution represented by a dataflow of task creations/terminations as well as shared data object
graph. This generalizes the approach used in the Satin creations/deletions.
parallel programming environment [31], which will be The dynamic dataflow graph should not be confused
further discussed in Section 7.5. with the static precedence graphs often used in scheduling
theory. Here, as tasks, data objects and their dependencies
are created/deleted, the graph changes. Within the context
3 EXECUTION MODEL
of the general execution model, graph G is the representa-
The general execution model of large Grid applications can tion of the global system state, i.e., the “abstraction of the
be viewed as having two levels, as shown in Fig. 1. Level 0 execution state” shown in Fig. 1.
only creates the “abstraction of the execution state” of the Whereas graph G is viewed as a single virtual
application. This abstraction is then used by Level 1 to dataflow graph, its implementation is in fact distributed.
actually schedule and, thus, execute the workload. Specifically, each process Pi contains and executes a

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 35

Fig. 3. Lifecycle of a task in KAAPI.

Fig. 2. KAAPI processor model. state on the local process, which now becomes a victim.
When the task is finished, either on the local process or a
subgraph Gi of G. Thus,
S the state of the entire application thief, it enters state finished and proceeds to state deleted.
is defined by G ¼ Gi over all processes Pi . Note that If a task has been stolen, the newly created thief process
this also includes the information associated with depen- utilizes the same model. In Fig. 2, the theft of task Ts on
dencies between Gi and Gj , i 6¼ j. This is due to the fact Process 2 by Process i is shown, as indicated by the arrow.
that Gi , by the definition of the principle of dataflow, Whereas this example shows task stealing on the same
contains all information necessary to identify exactly processor, the concept applies also to stealing across
which data is missing. processors. On the victim, the stolen task is in state stolen.
Upon theft, the stolen task enters state created on the thief.
3.2 Work-Stealing At this instant of time, the stolen task Ts and a task Tr
The runtime environment and primary mechanism for load charged with returning the result are the only tasks in the
distribution is based on a scheduling algorithm called thief’s stack, as shown in the figure. Since a stolen task by
work-stealing [11], [12]. The principle is simple: when a the definition of work-stealing is ready, it immediately
process becomes idle it tries to steal work from another enters state ready. It is popped from the stack, thereby
process called victim. The initiating process is called thief. entering state executing, and upon finishing, it enters state
Work-stealing is the only mechanism for distributing finished. It should be noted that the task enters this state on
the workload constituting the application, i.e., an idle the thief and the victim. For the latter, this is after receiving a
process seeks to steal work from another process. From a corresponding message from the thief. On both processes,
practical point of view, the application starts with the the task proceeds to state deleted.
process executing main(), which creates tasks. Typically,
3.4 Fault Model
some of these tasks are then stolen by idle processes,
which are either local or on other processors. Thus, the We will now describe the fault model that the execution
principal mechanism for dispatching tasks in the distrib- model is subjected to. The hybrid fault model described in
uted environment is task stealing. The communication due [29], which defines benign, symmetric, and asymmetric faults,
to the theft is the only communication between processes. will serve as a basis. Whereas benign faults are globally
Realizing that task theft creates the only dependencies diagnosable and, thus, self-evident, symmetric and asym-
between processes is crucial to understand the checkpoint- metric faults represent malicious faults which are either
ing protocol to be introduced later. consistent or possibly nonconsistent. In general, any fault
With respect to Fig. 1, work-stealing will be the that can be detected with certainty can be dealt with by our
scheduling algorithm of preference at Level 1. mechanisms. On one side, this includes any benign fault
such as a crash fault. On the other hand, this considers node
3.3 The KAAPI Environment volatility [5], e.g., transient and intermittent faults of nodes.
The target environment for multithreaded computations It should be noted that results of computation of volatile
with dataflow synchronization between threads is the nodes, which rejoin the system, will be ignored.
Kernel for Adaptive, Asynchronous Parallel Interface (KAAPI), In order to deal with symmetric or asymmetric faults, it
implemented as a C++ library. The library is able to is necessary that detection mechanisms are available. Such
schedule programs at fine or medium granularity in a approaches have been shown in [17] and [16] and can be
distributed environment. theoretically incorporated in this work.
Fig. 2 shows the general relationship between processors
and processes in KAAPI. A processor contains one or more
processes. Each process maintains its own stack.
4 THEFT-INDUCED CHECKPOINTING
The lifecycle of a task in the KAAPI execution model is As seen in the previous section, the dataflow graph
depicted in Fig. 3 and will be described first from a local constitutes a global state of the system. In order to use its
process’ and then from a thief’s point of view in the context abstraction for recovery, it is necessary that this global state
of task stealing. also represents a consistent global state.
At task creation, the task enters state created. At this time, With respect to Fig. 1, we can capture the abstraction of
it is pushed onto the stack. When all input data is available, the execution state at two extremes. At Level 0, one assumes
the task enters state ready. A ready-task which is on the top the representation derived from the construction of the
of the stack can be executed, i.e., it can be popped off the dataflow graph, whereas at Level 1, the interpretation is
stack, thereby entering state executing. A task in the ready derived as the result of its evaluation, which occurs at the
state can also be stolen, in which case it enters the stolen time of scheduling.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
36 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

Fig. 4. T IC protocol: forced checkpoints.


Fig. 5. T IC protocol: local and forced checkpoints.
In this section, we will introduce a Level 1 protocol
capable of deriving a fault-tolerant coherent system state processes P0 and P1 . Initially, P0 is executing a task from its
from the interpretation of the execution state. Specifically,
stack. The following sequence of events takes place:
we will define a checkpointing protocol called Theft-Induced
Checkpointing ðT ICÞ. 1. A process P1 is created on an idle resource. If it
finds a process P0 that has a potential task to be
4.1 Definition of a Checkpoint
stolen, it creates a “theft” task Tt charged with
As indicated before, a copy of the dataflow graph G stealing a task from process P0 . Before executing Tt ,
represents a global checkpoint of the application. In this process P1 checkpoints its state in CP10 . Event A is
research, checkpoints are with respect to a process, and the execution of Tt which sends a theft request to P0 .
consist of a copy of its local Gi , representing the process’ 2. Event B is the receipt of the theft request by P0 .
stack. The checkpointing protocol must ensure that check- Between events B and C, it identifies a task Ts and
points are created in such a way that G is always a flags it as “stolen by P1 ”. Between events B and C,
consistent global application state, even if only a single victim P0 is in a critical section with respect to theft
process is rolled back. The latter indicates the powerful operations.
feature of individual rollbacks. 3. Between events C and D, it forces a checkpoint to
The checkpoint of Gi itself consists of the entries of the reflect the theft. At this time, P0 becomes a victim.
process’ state, e.g., its stack. As such, it constitutes its tasks Event D constitutes sending Ts to P1 .
and their associated inputs, and not the task execution 4. Event E is the receipt of the stolen task Ts from P0 .
state on the processor itself. Understanding this difference Thief P1 creates entries for two tasks, Ts and Tr , in its
between the two concepts is crucial. Checkpointing the stack, as shown in Fig. 2. Task Tr is charged with
tasks and their inputs simply requires to store the tasks returning the results of the execution of Ts to P0 and
and their input data as a dataflow graph. On the other becomes ready when Ts finishes.
hand, checkpointing the execution of a task usually 5. When P1 finishes the execution of Ts , it takes a
consists of storing the execution state of the processor as checkpoint and executes Tr , which returns the result
defined by the processor context, i.e., the processor of Ts to P0 in event F.
registers such as program counters and stack pointers as 6. Event G is the receipt of the result by P0 .
well as data. In the first case, it is possible to move a task
and its inputs, assuming that both are represented in a 4.2.2 Local Checkpoints
platform-independent fashion. In the latter case, the fact Local checkpoints of each process Pi are stored periodically,
that the process context is platform dependent requires a after the expiration of the predefined period . Specifically,
homogeneous system in order to perform a restore after the expiration of , a process receives a signal to
operation or a virtualization of this state [28]. checkpoint. The process can now take a checkpoint.
The jth checkpoint of process Pi will be denoted by CPij . However, there are two exceptions. First, if the process has
Thus, the subscript denotes the process and the superscript a task in state executing, it must wait until execution is
the instance of the checkpoint. finished. Second, if a process is in the critical section between
4.2 Checkpoint Protocol Definition events B and C, checkpointing must be delayed until exiting
The creation of checkpoints can be initiated by 1) work- the critical section. A checkpointing scenario comprising
stealing or 2) at specific checkpointing periods. We will local and forced checkpoints is shown in Fig. 5, where local
first describe the protocol with respect to work-stealing, and forced checkpoints are shown unshaded and shaded,
since it is the cause of the only communication (and thus, respectively. Note that the temporal spacing of the two local
dependencies) between processes. Checkpoints resulting (unshaded) checkpoints on process P0 is at least .
from work-stealing are called forced checkpoints. Then, we
will consider the periodic checkpoints, called local 4.2.3 TIC Rollback
checkpoints, which are stored periodically, after expiration The objective of T IC is to allow rollback of only crashed
of predefined periods . processes. A process can be rolled back to its last
checkpoint. In fact, for each process, only the last
4.2.1 Forced Checkpoints checkpoint is kept. We now present a theorem that proves
The T IC protocol with respect to forced checkpoints is that under T IC a global consistent state of the execution is
defined in Fig. 4, showing events A through G for two maintained.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 37

the transport layer, e.g., transport control protocol TCP.


This error code is used to initiate actions with respect to
IC2 and IC3.
We now present systematically, for each of the three
communications of T IC, the three possible fault cases as
they relate to the treatment of IC2, IC3, and a double
Fig. 6. Sources of inconsistency. fault. The discussion is based on Fig. 5.
Communication A to B—the theft request:
Theorem 1. Under the T IC protocol, the faulty processes can be
rolled back, while guaranteeing a consistent global state of the 1. If thief P1 crashes such that it rolls back past
event A, condition IC2 arises. This presents no
execution.
problem for the new process P10 (replacing the
Proof. In general, to show that a set of checkpoints forms a crashed P1 ). P10 simply requests a theft from
consistent system state, three conditions must be another process. P0 , on the other hand, will
satisfied [22]: IC1: There is exactly one recovery point detect the rollback upon unsuccessfully attempt-
for each process. IC2: There is no event for sending a ing to communicate with the crashed P1 (in
message in a process P after its recovery point, whose event D), where it receives an error code. P0 ,
corresponding receive event in another process Q is thus, voids the theft, i.e., it unlabels task Ts and
before the recovery point of Q. IC3: There is no event of takes another checkpoint reflecting its new state.
sending a message in a process P before its recovery point, Note that this checkpoint is a new version of
whose corresponding receive event in another process Q the checkpoint between C and D.
is after the recovery point of Q. The scenarios represent- 2. If victim P0 crashes after event B but before CP02 ,
ing conditions IC2 and IC3 are depicted in Fig. 6. then condition IC3 is introduced. However, this
Proving that condition IC1 is met is trivial since T IC presents no problem for P1 who simply times out
stores only the last checkpoint in stable storage. In the while waiting for event E. P1 makes another
remainder of the proof of T IC, we will consider all request.
actions possible with respect to the events and check- 3. A double fault implies that upon rollback of P1
points shown in Fig. 5. This enumeration of events and as P10 , the reinitiation of event A returns an
checkpoints is exhaustive. error. P10 will inquire about replacement P00 for
Part 1: Let us assume that processes do not the nonresponding process P0 . If P00 has not
passed event B, then this constitutes a new theft
communicate. It is well known that, under this
request. If P00 has been restarted from CP02 , then
assumption, a global consistent state of an execution
P00 will detect that the thief has also been rolled
is guaranteed implicitly by using local checkpoints.
back upon an unsuccessful event D and will
Thus, in the absence of communication, only the local void the theft. This is exactly the action the
process is affected by the rollback. In the context of victim took in case 1.
T IC, this means that a process that has not
participated in any communication since its last Communication D to E—the actual theft:
checkpoint, neither as a sender nor receiver, can be
rolled back unconditionally to that checkpoint. In 1. If P0 fails after event D but before it could
Fig. 5, this scenario covers, for each checkpoint, the checkpoint, then condition IC2 arises. The (rolled
time interval which starts at the time the checkpoint back) victim will initiate another event D to the
is established until the next event or checkpoint. If same thief for the same request (indicated by
tðCPij Þ denotes the time at which checkpoint CPij is CP02 ). This is recognized by P1 as a duplicate and
established and tðXÞ denotes the time of event X, then is ignored.
rollback during the following intervals will maintain a 2. If the thief crashes after the actual theft (event E)
consistent execution state: ½tðCP01 Þ; tðBÞÞ, ½tðCP02 Þ; tðDÞÞ, but before it was able to checkpoint, then condition
½tðCP03 Þ; tðGÞÞ, ½tðCP04 Þ; Þ f o r p r o c e s s P0 a n d IC3 arises. The thief is simply rolled back as P10 to
½tðCP11 Þ; tðAÞÞ, ½tðCP12 Þ; tð½CP13 ÞÞ, and ½tðCP13 Þ; tðF ÞÞ for the initial checkpoint CP10 where it will rerequest a
process P1 . Note that the intervals are open to the task from P0 (event A). Victim P0 , recognizing the
right, i.e., the right side of an interval is the time redundant request, changes the state of Ts from
before the event. Furthermore, symbol “” in stolen to ready, thus nullifying the old theft, and
½tðCP04 Þ; Þ indicates the time of the next event or treats the theft request as a new request.
checkpoint. 3. The victim is rolled back past event D and finds
Part 2: Now, we prove that T IC can deal with rollback out the thief does not respond; a double fault.
that affects or is affected by communication, i.e., we need Thus, victim P00 inquires about the replacement
to show how T IC effectively avoids inconsistency with process P10 . If P10 was initialized with CP11 , it will
respect to conditions IC2 and IC3. Recall that the only find out about the new P00 as the result of a
communication in the system is that due to task stealing, communication error at event A. If P10 was rolled
i.e., three communications per theft, as shown in Fig. 5. back with a checkpoint taken after event E, then it
An attempt to communicate with a crashed process will takes a new CP02 to reflect that P10 is the rolled
result in failure, indicated by an error code generated by back thief.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
38 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

Communication F to G—the return of the result to the (between events C and D) ensures that a crash after this
victim: checkpoint does not result in the loss of the thief’s
computation, i.e., there will be a record that allows the
1. If the thief crashes after event F, then condition victim’s replacement process to find the thief. u
t
IC2 arises. Upon reinitiating event E, the victim
will simply ignore the duplication. Note that this The actions described in the proof above constitute a new
can only occur in the tiny interval after F and generation of the protocol, i.e., the concept of a proactive
before P1 ’s termination. manager, as described in [14] and [15], has been eliminated.
2. A crash of the victim after it has received the It has been replaced with a passive name server imple-
result (event G) but before it can checkpoint will mented on the same reliable storage system that facilitates
result in condition IC3. This would stall the victim the checkpoint server.
after rollback to a state where the task is still
flagged as stolen, i.e., P00 would never receive the
result in event G. Therefore, as part of the rollback 5 SYSTEMATIC EVENT LOGGING
procedure, the victim inspects the last checkpoint Whereas the T IC protocol was defined with respect to
for tasks that have been flagged stolen. If the Level 1 in Fig. 1, we will now introduce a Level 0 protocol
victim’s checkpoint contains references to a thief called Systematic Event Logging ðSELÞ, which was derived
P1 that is already terminated, it rolls back P0 on P00 from a log-based method [1]. The motivation for SEL is
using the checkpoint of P0 together with the to reduce the amount of computation that can be lost,
thief’s final checkpoint containing the result. which is bound by the execution time of a single failed
Thus, the rollback uses G0 and G1 (which contains task.1 We will later elaborate on the differences between
only Tr ). On the other hand, if the last checkpoint T IC and SEL in their analysis presented in Section 6.
contains references to thieves that are still In SEL, only the events relevant for the construction of
executing, no action is required since the thief, the dataflow graph are logged. Logging events for tasks are
upon attempting to send the results to the old
their additions and deletions. Logging events of shared data
process P0 , will experience an error from the
objects are their additions, modifications, and deletions. A
transport layer and will inquire about P00 .
recovery consists of simply loading and rebuilding sub-
3. If the thief is rolled back to CP13 and finds out
graph Gi associated with the failed process Pi from the
during event F that the victim has crashed as well,
respective log.
it inquires about P00 . P00 will have either been
The SEL protocol implies the validity of the PWD
initiated with CP02 or a checkpoint taken after
hypothesis, which was introduced in Section 2.1. For the
event D, say CP03 . In the first case as the result of
hypothesis to be valid, the following two conditions
the error during event D, P00 inquires about the
must hold:
replacement victim and updates CP02 . In the
second case, it will be waiting for event G, which . C1 : Once a task starts executing, it will continue,
is coming from the replacement thief. The thief without being affected by external events, until its
found out about P00 as a result of the communica- execution ends.
tion error at event F during the attempt to reach . C2 : The execution of a task is deterministic with
the old victim. respect to the tasks and shared data objects that are
Part 3: So far, we have proven that by using T IC, created. Note that this implies that the execution will
inconsistencies are avoided. However, it remains to be always create the same (isomorphic) dataflow graph.
established why the three forced checkpoints shown At first sight, condition C1 may appear rather restrictive.
(shaded) in Fig. 5 are necessary. Let CP10 and CP1f However, this is not the case for our application domain,
denote the first and final checkpoint of a thief P1 , i.e., large parallel executions (see (1) below).
respectively. The initial checkpoint CP10 guarantees that If all tasks of a dataflow graph obey conditions C1 and
there exists at least one record of a theft request for a C2 , then all processes executing the graph will comply with
thief that crashes. Thus, upon a crash, the thief is rolled the PWD hypothesis. The idea behind the proof of this
back on the new process P10 . Without CP10 , any crash theorem is simple. In the execution model, the execution of
before a checkpoint on the thief would simply erase tasks is deterministic, whereas the starting time of their
any reference of the theft (event E) and would stall the execution is nondeterministic. However, this implies, in
victim. The final checkpoint of the thief, CP1f , is needed turn, that during the execution of a task in the execution
in case the victim P0 crashes after it has received the model, it itself will create the same sequence of tasks and
results from the thief, but before it could checkpoint its
data objects.
state reflecting the result. Thus, if the victim crashes
In case of a fault, task duplication needs to be avoided
between event G and its first checkpoint after G, then
during rollback. Specifically, in the implementation, one has
the actions describing Communication F to G will ensure
to guarantee that only one instance of a any given task can
the victim can receive the result of the stolen task.
exist. In the absence of such guarantee, it could happen that
It should be noted that the final checkpoint of the thief
during rollback a task recreates other tasks or data objects
cannot be deleted until the victim has taken a checkpoint
after event G, thereby checkpointing the result of the 1. Recall that the task is the smallest unit of execution in the execution
stolen task. Lastly, the forced checkpoint of the victim model.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 39

that already exist from earlier failed executions. Note that, 6.1.1 Analysis of T IC
depending on the timing of the fault, this could result in a In T IC, a checkpoint is performed 1) periodically for each
significant number of duplicated nodes, since each dupli- process, as dictated by period  and 2) as the result of work-
cated task itself may be the initiator of a significant portion stealing. Let TPT IC denote the execution of a parallel
of computation. In our implementation of SEL, duplication program on p processors under T IC. Then
avoidance is achieved using a unique and reproducible  
TPT IC  Tp þ max OverheadTi IC ; ð5Þ
identification method of all vertices in the graph. i¼1;...;p

where OverheadTi IC denotes the total T IC checkpointing


6 COMPLEXITY ANALYSIS overhead on processor Pi . This overhead depends on the total
In this section, we present a cost model for the T IC and number of checkpoints taken on processor Pi and the
SEL protocol. However, first we want to introduce the overhead of a single checkpoint. The maximal number of
necessary notation and analyze the general work-stealing checkpoints performed by a processor is ½TPT IC = þ OðNtheft Þ,
where TPT IC = indicates the number of checkpoints due to
model.
period  and Ntheft is the maximal number of thefts performed
Let Tsec be the time of execution of a sequential program
by any processor. Note that we use OðNtheft Þ, since, with
on a single processor. Furthermore, let T1 denote the time of
respect to Fig. 4, the numbers of checkpoints of the thief and
the execution of the corresponding parallel program on a the victim are not equal.
single processor, and let T1 be the theoretical execution The overhead of a single checkpoint in T IC is associated
time of the application as executed on an unbounded with storing the collection of vertices in Gi and depends on
number of processors. Thus, T1 represents the execution two parameters. First, it depends on the size of G.
time associated with the critical path. It should be noted Specifically, it depends on the number of tasks and shared
that in large executions suitable for parallel environments, data objects, as well as the size of the latter. Second, it
we always have depends on the time of an elementary access to stable
storage, denoted by ts .
T1  T1 : ð1Þ The number of vertices in Gi has an upper bound of N1 ,
Next, let Tp be the execution time of a program on p which denotes the maximum number of vertices in a path of
G [11]. The checkpoint overhead for processor Pi is, thus,
identical physical processors. Then, the execution of a
bound by
parallel program using work-stealing is bound by [11]
  T IC
T1 OverheadTi IC ¼ TPT IC = þ OðNtheft Þ foverhead ðN1 ; ts Þ: ð6Þ
Tp  þ c1 T1 ; ð2Þ
p T IC
The function foverhead ðÞ indicates the overhead associated
where constant c1 defines a bound on the overhead with a single checkpoint and depends only on G, or more
precisely N1 , as well as ts .
associated with the critical path, including the scheduling
overhead. Furthermore, we have 6.1.2 Analysis of SEL
T1  c1 Tsec ; ð3Þ As defined in Section 5, in SEL, a log is performed for
each of the described events relevant for the construction
where c1 corresponds to the maximum overhead induced of G, i.e., 1) vertex creation, 2) shared data modification,
by parallelism, excluding the cost of scheduling. The and 3) vertex deletion. Recall that, in G ¼ ðV; EÞ, a vertex
constants c1 and c1 depend on the specific implementation vi 2 V is either a task or a shared data object.
of the execution model and are a measure of the Let TPSEL denote the execution of a parallel program on p
implementation’s efficiency. processors under SEL. Then, TPSEL can be expressed as
To show how little impact the term c1 T1 of (2) has, one  
should note that the number of thefts performed by any TPSEL  Tp þ max OverheadSEL i : ð7Þ
i¼1;...;p
process,2 denoted by Ntheft , which introduce the scheduling
This overhead depends on the total number of vertices in Gi
overhead hidden in c1 is small [11], [12], since
and the overhead of a single event log. The maximal
Ntheft  OðT1 Þ: ð4Þ number of logs performed by a processor is jGi j, i.e., the
number of vertices in Gi .
Specifically, with T1  T1 , we can approximate (2) The overhead of a single event log in SEL is associated
by Tp  Tp1 . with storing a single vertex vj of Gi and depends on two
parameters. Specifically, it depends on the size of vj and the
6.1 Analysis of Fault-Free Execution
access time to stable storage ts . Note that if vj is a task, then
If we add a checkpointing mechanism, it is of special the log is potentially very small and of constant size,
interest to analyze its overhead associated with fault-free whereas if it is a data object, then the log size is equal to that
execution, since the occurrence of faults is considered to be of the object. The logging overhead for processor Pi is thus
the rare exception rather than the norm. bound by
 
2. We assume that at any given time, at most, one process is active on a OverheadSELi
SEL
¼ jGi jfoverhead jvj j; ts : ð8Þ
processor.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
40 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

SEL
The function foverhead ðÞ indicates the overhead associated much smaller than jGi j, thus more than compensating
SEL T IC
with a single log. for foverhead ðjvj j; ts Þ < foverhead ðN1 ; ts Þ, as will be confirmed
by the results in Section 7. The reduced overhead has
6.2 Analysis of Executions Containing Faults huge implication on the avoidance of bottlenecks in the
The overhead associated with fault-free execution is the checkpointing server(s). For applications with large data
penalty one pays for having a recovery mechanism. It manipulations, T IC, with an appropriate choice of ,
remains to be shown how much overhead is associated with may be the only choice capable of eliminating storage
recovery as the result of a fault and how much execution bottlenecks.
time can be lost under different strategies. On the other hand, SEL addresses the needs of
The overhead associated with recovery is due to loading applications with low tolerance for lost execution time.
and rebuilding the affected portions of G. This can be However, one has to analyze the bandwidth requirements
effectively achieved by regenerating Gi of the affected of logging in order to determine feasibility.
processes. Thus, the time of recovery of a single process Pi , It should be emphasized that the advantage of the T IC
denoted by trecovery
i , depends only on the size of its and SEL protocols is that they do not require replacement
associated subgraph Gi , i.e., trecovery
i ¼ OðjGi jÞ. Note that resources for failed processes, e.g., the failed process can be
for a global recovery, as the result of the failure of the entire rolled back on an existing resource. This is due to the fact
application, this translates to maxðtrecovery
P i Þ and not to that the state of the execution is platform and configuration
trecovery
i . independent.
The way Gi is rebuilt for a failed process differs for the Lastly, we want to indicate that, even though the T IC
two protocols. Under T IC, rebuilding Gi implies simply protocol has been motivated by CIC [3], T IC has multiple
reading the structure from the checkpoint. For SEL, this is advantages over CIC. First, unlike CIC, in T IC, only the last
somewhat more involved, since now Gi has to be checkpoint needs to be kept in the stable storage. This has
reconstructed from the individual logs. potentially large implications on the amount of data that
Next, we address the amount of work that a process can needs to be stored. Thus, the advantage of T IC is the
lose due to a single fault. In T IC, this is the maximal reduction of checkpointing data as well as the time it takes
difference in time between two consecutive checkpoints. to recover this data during rollback. The second significant
This time is defined by the checkpointing period  and the advantage is that in T IC only the failed process needs to be
execution time of a task, since a checkpoint of a process that rolled back. Note that, in CIC, all processes must be rolled
is executing a task cannot be made until the task finishes back after a fault.
execution. In the worst case, the process receives a
checkpointing signal after  and has to wait for the end of
the execution of its current task before checkpointing. Thus, 7 EXPERIMENTAL RESULTS
the time between checkpoints is bound by  þ maxðci Þ, 7.1 Application and Platform Description
where ci is the computation time of task Ti . But how bad can The performance and overhead of the T IC and SEL
the impact of ci be? In a parallel application, it is reasonable protocols were experimentally determined for the Quadratic
to assume T1  T1 . Since T1 is the critical path of the Assignment Problem (instance4 NUGENT 22) which was
application any ci  T1 . As a result, one can assume ci to be parallelized in KAAPI. The local experiments were con-
relatively small. ducted on the iCluster2,5 which consists of 104 nodes
In SEL, due to its fine granularity of logging, the interconnected by a 100-Mbps Ethernet network, each node
maximum amount of execution time lost is simply that of a featuring two Itanium-2 processors (900 MHz) and 3 Gbytes
single task. However, this comes at the cost of higher of local memory. The intercluster experiments were
logging overhead, as was addressed in (8). conducted on Grid5000 [13], which consists of clusters
located at nine French institutions.
6.3 Discussion In order to take advantage of the distributed fashion of
The overhead of the T IC protocol depends on the number the checkpoint, i.e., Gi , each processor has a dedicated
of theft operations and period . To reduce the overhead, checkpoint server. This configuration has two advantages.
one needs to increase . However, this also increases the First, it reflects the theoretical assumptions in Section 6, and
maximum amount of computation that can be lost. second, the actual overhead of the checkpointing mechan-
For SEL, the overhead depends only on the size of ism is measured, rather than the overhead associated with a
graph G, i.e., its vertices vi which have to be saved. If one centralized checkpoint server.
wants to reduce the overhead, one has to reduce the size of G.
This, however, reduces the parallelism of the application. 7.2 Fault-Free Executions
Comparing the T IC and SEL protocol makes only We will now investigate the overhead of the protocols in
sense under consideration of the application, e.g., fault-free executions, followed by executions containing
number of tasks, task size, or parallelism. If T1  T1 , faults. Then, we show the results of a real-world example
given a reasonable value3 for , then the overhead of executing on heterogeneous and dynamic cluster config-
T IC is likely to be much lower than that of SEL, i.e., urations. We conclude with a comparison of both protocols
given (6) and (8), ½TPT IC = þ OðNtheft Þ is most likely with the closest counterpart, i.e., Satin [31].

3. Note that unreasonably small values of  would result in excessive 4. See http://www.opt.math.tu-graz.ac.at/qaplib/.
local checkpointing. 5. See http://www.inrialpes.fr/sed/i-cluster2.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 41

Fig. 7. Tasks and application granularity. Fig. 9. Total volume of data stored.

The impact of the degree of parallelism can be seen in periods . Furthermore, the data volume stays relatively
Fig. 7, where the number of parallel tasks generated during constant for different number of processors. This is due to
execution grows as the size of tasks is reduced. Recall that the fact that the number of thefts, and thus, theft-induced
the number of tasks directly relates to the size of graph G, overhead, is actually very small, as was explained in
which in turn has implication with respect to the overhead Section 6.
of the protocols. The degree of parallelism increases
drastically for threshold 5 and approaches its maximum at 7.3 Executions with Faults
threshold 10. To show the overhead of the mechanisms in the presence of
Fig. 8 shows the execution times of the application for faults, we consider executions containing faults. First, we
different protocols in the absence of faults. Two observa- want to measure the cost induced by the computation lost
tions can be made. First, the application scales with the due to the fault(s) and the overhead of the protocols.
number of processors for all protocols. Second, there is very Specifically, for each protocol, we show
little difference between the execution times of the protocols
Tpwithfault  Tp0
for the same number of processors. In fact, the largest ; ð9Þ
difference among the executions was observed in the case of Tp0
120 processors and was measured at 7.6 percent. It is easy to where Tpwithfault is the time of execution in the presence of
falsely conclude that, based on the small differences shown faults and rollback, and Tp0 is the time of a fault-free
in the scenarios in Fig. 8, all protocols perform approxi- execution.
mately the same. The important measure of overhead of the Fig. 10 shows the measured cost using (9) for different
mechanism is the total amount of data associated with the numbers of faults. The interpretation of Tp0 is the execution
protocol that is sent to stable storage. This overhead is time of the application including the overhead of the
affected by the total size and the number of messages. Due checkpointing or logging mechanism. One can observe that,
to the efficient distributed configuration of the experiment, as the number of faults increases, the execution time grows
which may not be realistic for real-world applications, this linearly. Note that, since the overhead of the protocols is
overhead was hidden and, thus, does not show in the included in Tp0 , the values displayed are the computation time
figure. Fig. 9 addresses this cost, i.e., the cost of the fault- lost due to the faults as well as the overhead of rollback, but do
tolerance mechanism that the infrastructure has to absorb, not contain the overhead of checkpointing or logging. As
and shows the total volume of checkpointing and logging expected, and discussed in Section 6.3, the computation lost
data stored. The advantages of T IC can be seen in the using SEL is lower than that under T IC, since in SEL, only
significant reduction of data, which is most visible for larger

Fig. 8. Execution times of protocols. Fig. 10. Overhead of rollback.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
42 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

Fig. 13. Comparison of satin with KAAPI/T IC using 32 processors.


Fig. 11. Total overhead considering faults.
number of processors available to the application was again
the computation of failed tasks are lost. For the experiment, 30. The fourth bar in Fig. 12 shows the time of the fault-free
the period in T IC was set at  ¼ 1 second, and the mean task execution (175 seconds) using no fault-tolerance protocol at
execution time was 0.23 second. all. Next, the same experiment was repeated using the T IC
However, Fig. 10, with its interpretation of Tp0 , does not protocol with  ¼ 5 seconds. The result is shown in the fifth
account for the overhead of checkpointing or logging. bar peaked at 185 seconds. The difference in execution times
This overhead was included in the measurement shown between this and the previous scenario is entirely due to the
in Fig. 11. Now, Tp0 in (9) is the execution time of the overhead of T IC and its remote checkpointing. Finally, an
application without any fault-tolerance protocol, i.e., execution with fault in the PowerPC cluster was considered.
neither SEL nor T IC. The measurements reveal that the Specifically, after 50 percent of the application had
actual overhead of SEL overshadows its advantages executed, a fault was injected that affected all 10 nodes of
shown in Fig. 10. Specifically, accounting for the overhead the PowerPC cluster, i.e., the cluster was lost. The affected
of checkpointing (of T IC) and logging (of SEL), the real part of the execution rolled back and finished execution on
advantage of lower checkpointing overhead of T IC the remaining 20 processors. One can see (in the bar to the
surfaces. right indicating 216 seconds) that the execution tolerated the
cluster fault exceptionally well, resulting in an overall
7.4 Application Executing on Heterogeneous and execution time which was only 17 percent larger than that of
Dynamic Grid the fault-free case, even though one entire cluster was
Next, we show an application of T IC in a heterogeneous permanently lost. Furthermore, the rollback was across
Grid. Four clusters of Grid5000 (geographically dispersed in platforms, i.e., the computations of the failed cluster were
France) were used, utilizing different hardware architec- dynamically absorbed by the two remaining clusters using
tures. The execution clusters used AMD Opteron, Intel different hardware architectures.
Xeon, and PowerPC architectures, respectively, whereas the
stable storage cluster used Xeons. Fig. 12 summarizes 7.5 Comparison with Satin
several experiments. First, the entire application was A fault-tolerant parallel programming environment similar
executed on each of the three execution clusters using to the approach presented above is Satin [31]. In fact, the
30 computational nodes. The respective execution times are Satin environment follows the general execution model
shown in the three bars to the left. presented in Fig. 1. However, the abstraction of the
Next, the application was executed on all three execution execution state is a series-parallel graph, rather than the
clusters, using 10 nodes on each cluster. Thus, the total dataflow graph. As such, Satin only addresses recursive
series-parallel programming applications. In Satin, fault-
tolerance is based on redoing the work lost by the crashed
processor(s). To avoid redundant computations, partial
results, which are stored in a global replicated table, can
later be reused during recovery after a crash.
To compare the performance of T IC with Satin, a
different application was used, i.e., a recursive application
resembling a generalization of a Fibonacci computation.
Fig. 13 shows the result of executions of both approaches for
different fault scenarios. Specifically, for each approach,
first, an execution without fault is shown. Next, a single
fault was injected after 25 percent, 50 percent, and 75 percent
of the execution had completed. To eliminate the impact of
the different implementation languages and execution
environments on the execution times, i.e., C++/KAAPI
and Java/Satin, the measurements presented in the figure
Fig. 12. QAP application on Grid5000. are relative to the execution times in their respective

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
JAFAR ET AL.: FLEXIBLE ROLLBACK RECOVERY IN DYNAMIC HETEROGENEOUS GRID COMPUTING 43

environments. As can be seen, the cost in Satin is [3] R. Baldoni, “A Communication-Induced Checkpointing Protocol
That Ensures Rollback-Dependency Trackability,” Proc. 27th Int’l
significantly higher than that in KAAPI/T IC, which used Symp. Fault-Tolerant Computing (FTCS ’97), p. 68, 1997.
 ¼ 1 second. The reason is that, in Satin, all computations [4] F. Baude, D. Caromel, C. Delb, and L. Henrio, “A Hybrid Message
affected by the fault are lost. In fact, the loss is higher the Logging-CIC Protocol for Constrained Checkpointability,” Proc.
later the fault occurs during the execution. This is not the European Conf. Parallel Processing (EuroPar ’05), pp. 644-653, 2005.
[5] G. Bosilca et al., “MPICH-V: Toward a Scalable Fault Tolerant
case in T IC where the maximum loss is small, i.e., MPI for Volatile Nodes,” Proc. ACM/IEEE Conf. Supercomputing
 þ maxðci Þ, as was shown in Section 6.2. Thus, T IC (SC ’02), Nov. 2002.
overcomes this performance deficiency of Satin. [6] A. Bouteiller et al., “MPICH-V2: A Fault Tolerant MPI for Volatile
On the other hand, the T IC protocol is pessimistic in the Nodes Based on the Pessimistic Sender Based Message Logging,”
Proc. ACM/IEEE Conf. Supercomputing (SC ’03), pp. 1-17, 2003.
sense that processes are always checkpointed to anticipate a
[7] A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello,
future failure. The result is that for fault-free executions the “Coordinated Checkpoint versus Message Log for Fault Tolerant
Satin approach has lower overhead than T IC. However, as MPI,” Proc. Fifth IEEE Int’l Conf. Cluster Computing (Cluster ’03),
was shown in Section 7.2, the overhead of T IC is very p. 242, 2003.
[8] S. Chakravorty and L.V. Kale, “A Fault Tolerant Protocol for
small.
Massively Parallel Machines,” Proc. 18th IEEE Int’l Parallel and
For applications with small computation times (linear or Distributed Processing Symp. (IPDPS ’04), p. 212a, 2004.
quasilinear), Satin also tends to perform better than T IC. [9] K.M. Chandy and L. Lamport, “Distributed Snapshots: Determin-
The reason is that the time to recompute solutions under ing Global States of Distributed Systems,” ACM Trans. Computer
Systems, vol. 3, no. 1, pp. 63-75, 1985.
Satin may be less than the overhead associated with writing
[10] E.N. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A
checkpoints to stable storage. However, such applications Survey of Rollback-Recovery Protocols in Message-Passing
are difficult to parallelize due to the low computation/ Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408,
communication ratio. Sept. 2002.
[11] M. Frigo, C.E. Leiserson, and K.H. Randall, “The Implementation
of the Cilk-5 Multithreaded Language,” Proc. ACM SIGPLAN Conf.
8 CONCLUSIONS Programming Language Design and Implementation (PLDI ’98),
pp. 212-223, 1998.
To overcome the problem of applications executing in large [12] F. Galilée, J.-L. Roch, G. Cavalheiro, and M. Doreille, “Athapas-
systems where the MTTF approaches or sinks below the can-1: On-Line Building Data Flow Graph in a Parallel Language,”
Proc. Seventh Int’l Conf. Parallel Architectures and Compilation
execution time of the application, two fault-tolerant proto- Techniques (PACT ’98), pp. 88-95, 1998.
cols, T IC and SEL, were introduced. The two protocols [13] A Large Scale Nation-Wide Infrastructure for Grid Research, Grid5000,
take under consideration the heterogeneous and dynamic https://www.grid5000.fr, 2006.
characteristics of Grid or cluster applications that pose [14] S. Jafar, A. Krings, T. Gautier, and J.-L. Roch, “Theft-Induced
Checkpointing for Reconfigurable Dataflow Applications,” Proc.
limitations on the effective exploitation of the underlying IEEE Electro/Information Technology Conf. (EIT ’05), May 2005.
infrastructure. The flexibility of dataflow graphs has been [15] S. Jafar, T. Gautier, A. Krings, and J.-L. Roch, “A Checkpoint/
exploited to allow for a platform-independent description Recovery Model for Heterogeneous Dataflow Computations
of the execution state. This description resulted in flexible Using Work-Stealing,” Proc. European Conf. Parallel Processing
(EuroPar ’05), pp. 675-684, Aug.-Sept. 2005.
and portable rollback recovery strategies.
[16] A.W. Krings, J.-L. Roch, S. Jafar, and S. Varrette, “A
SEL allowed for rollback at the lowest level of Probabilistic Approach for Task and Result Certification of
granularity, with a maximal computational loss of one task. Large-Scale Distributed Applications in Hostile Environments,”
However, its overhead was sensitive to the size of the Proc. European Grid Conf. (EGC ’05), P. Sloot et al., eds., Feb.
2005.
associated dataflow graph. T IC experienced lower over-
[17] A.W. Krings, J.-L. Roch, and S. Jafar, “Certification of Large
head, related to work-stealing, which was shown bounded Distributed Computations with Task Dependencies in Hostile
by the critical path of the graph. By selecting an appropriate Environments,” Proc. IEEE Electro/Information Technology Conf.
application granularity for SEL and period  for T IC, the (EIT ’05), May 2005.
[18] L. Lamport, M. Pease, and R. Shostak, “The Byzantine Generals
protocols can be tuned to the specific requirements or needs Problem,” ACM Trans. Programming Languages and Systems, vol. 4,
of the application. A cost model was derived, quantifying no. 3, pp. 382-401, July 1982.
the induced overhead of both protocols. The experimental [19] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Check-
results confirmed the theoretical analysis and demonstrated point and Migration of UNIX Processes in the Condor Distributed
Processing System,” Technical Report CS-TR-97-1346, Univ. of
the low overhead of both approaches. Wisconsin, Madison, 1997.
[20] A. Nguyen-Tuong, A. Grimshaw, and M. Hyett, “Exploiting Data-
Flow for Fault-Tolerance in a Wide-Area Parallel System,” Proc.
ACKNOWLEDGMENTS 15th Symp. Reliable Distributed Systems (SRDS ’96), pp. 2-11, 1996.
The authors wish to thank Jean-Louis Roch, ID-IMAG, [21] D.A. Patterson, G. Gibson, and R.H. Katz, “A Case for Redundant
Arrays of Inexpensive Disks (RAID),” Proc. ACM SIGMOD ’88,
France, for all the discussions and valuable insight that led pp. 109-116, 1988.
to the success of this research. [22] D.K. Pradhan, Fault-Tolerant Computer System Design. Prentice
Hall, 1996.
[23] B. Randell, “System Structure for Software Fault Tolerance,” Proc.
REFERENCES Int’l Conf. Reliable Software, pp. 437-449, 1975.
[1] L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, [24] L. Sarmenta, “Sabotage-Tolerance Mechanisms for Volunteer
Optimistic, Causal and Optimal,” IEEE Trans. Software Eng., Computing Systems,” Future Generation Computer Systems,
vol. 24, no. 2, pp. 149-159, Feb. 1998. vol. 18, no. 4, 2002.
[2] K. Anstreicher, N. Brixius, J.-P. Goux, and J. Linderoth, “Solving [25] J. Silc, B. Robic, and T. Ungerer, “Asynchrony in Parallel
Large Quadratic Assignment Problems on Computational Grids,” Computing: from Dataflow to Multithreading,” Progress in
Math. Programming, vol. 91, no. 3, 2002. Computer Research, pp. 1-33, 2001.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.
44 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

[26] G. Stellner, “CoCheck: Checkpointing and Process Migration Axel Krings received the PhD and MS degrees
for MPI,” Proc. 10th Int’l Parallel Processing Symp. (IPPS ’96), in computer science from the University of
pp. 526-531, Apr. 1996. Nebraska-Lincoln, in 1993 and 1991, respec-
[27] R. Strom and S. Yemini, “Optimistic Recovery in Distributed tively, and the MS degree in electrical engineering
Systems,” ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, from the FH-Aachen, Germany, in 1982. He is a
1985. professor of Computer Science at the University
[28] V. Strumpen, “Portable and Fault-Tolerant Software Systems,” of Idaho. Dr. Krings has published extensively in
IEEE Micro, vol. 18, no. 5, pp. 22-32, Sept./Oct. 1998. the area of Computer & Network Survivability,
[29] P. Thambidurai and Y.-K. Park, “Interactive Consistency with Security, Fault-Tolerance and Real-time Sche-
Multiple Failure Modes,” Proc. Seventh Symp. Reliable Distributed duling. He has organized and chaired confer-
Systems (SRDS ’88), pp. 93-100, Oct. 1988. ences and tracks in the area of system survivability and has served on
[30] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and numerous conference program committees. From 2004 to 2005, he was a
Computer Science Applications. John Wiley & Sons, 2001. visiting professor at the Institut d’Informatique et Mathématiques
[31] G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H.E. Bal, Appliquées de Grenoble, at the Institut National Polytechnique de
“Fault-Tolerance, Malleability and Migration for Divide-and- Grenoble, France. His work has been funded by DoE/INL, DoT/NIATT,
Conquer Applications on the Grid,” Proc. 19th Int’l Parallel and DoD/OST, NIST, and CNRS. He is a senior member of the IEEE.
Distributed Processing Symp. (IPDPS ’05), p. 13a, Apr. 2005.
[32] J.J. Wylie et al., “Selecting the Right Data Distribution Scheme for Thierry Gautier received the Dipl. Ing. MS and
a Survivable Storage System,” Technical Report CMU-CS-01-120, PhD in computer science at the INPG, in 1996.
Carnegie Mellon Univ., May 2001. He is a full time researcher at INRIA (the French
[33] G. Zheng, L. Shi, and L.V. Kalé, “FTC-Charm++: An In-Memory National Institute for Computer Science and
Checkpoint-Based Fault Tolerant Runtime for Charm++ and Control), with Project MOAIS, Laboratoire d’In-
MPI,” Proc. Sixth IEEE Int’l Conf. Cluster Computing (Cluster ’04), formatique de Grenoble, France, and has held a
pp. 93-103, Sept. 2004. post-doctoral position at ETH Zürich (1997).
Dr. Gautier conducts research in high perfor-
Samir Jafar received the PhD in computer mance computing and has been involved with
science from the Institut National Polytechnique the design of fault-tolerant protocols. He has
de Grenoble, France, in 2006, the MS degree lead the development of the Kernel for Asynchronous and Adaptive
from Joseph Fourier University, France, in 2002, Interface KAAPI, which is fundamental to this research.
and the M.S. degree in applied mathematics and
computer science from Damascus University, in
1998. He is a professor of computer science in the
. For more information on this or any other computing topic,
Department of Mathematics, Faculty of Sciences,
please visit our Digital Library at www.computer.org/publications/dlib.
University of Damascus, Damascus, Syria.

Authorized licensed use limited to: TAGORE ENGINEERING COLLEGE. Downloaded on June 4, 2009 at 09:38 from IEEE Xplore. Restrictions apply.

You might also like