0% found this document useful (0 votes)
12 views

Unit 4 Part 3

Bda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 4 Part 3

Bda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CS 3551 DISTRIBUTED

COMPUTING
Checkpoint Based Recovery

1. Uncoordinated checkpointing
2. Coordinated checkpointing
a. Blocking coordinated checkpointing
b. Non-blocking checkpoint coordination
3. Impossibility of min-process non-blocking checkpointing
4. Communication-induced checkpointing
a. Model-based checkpointing
b. Index-based checkpointing
Checkpoint Based Recovery
● In the checkpoint-based recovery approach, the state of each process and the
communication channel is checkpointed frequently so that, upon a failure, the
system can be restored to a globally consistent set of checkpoints.
● It does does not need to detect, log, or replay non-deterministic events.
Checkpoint-based protocols are therefore less restrictive and simpler to
implement than log-based rollback recovery.
● However, checkpoint-based rollback recovery does not guarantee that
prefailure execution can be deterministically regenerated after a rollback.
● It may not be suitable for applications that require frequent interactions with
the outside world.
1. Uncoordinated Checkpointing
● Each process has autonomy in deciding when to take checkpoints.
● Synchronization overhead is minimal as there is no need for coordination
between processes. ( Lower runtime overhead).
● Autonomy in taking checkpoints also allows each process to select
appropriate checkpoints positions

Drawbacks:

1. Domino effect may occur during a recovery.


2. Recovery is slow because processes need to iterate to find a consistent set of checkpoints.
3. Useless Checkpoint:
a. Since no coordination is done at the time the checkpoint is taken, checkpoints taken by a process may be useless
checkpoints.
b. Useless checkpoints are undesirable because they incur overhead and do not contribute to advancing the
recovery line.
4. forces each process to maintain multiple checkpoints, and to periodically invoke a garbage
collection algorithm to reclaim the checkpoints that are no longer required
5. it is not suitable for applications with frequent output commits because these require global
coordination to compute the recovery line
How consistent global checkpoint is determined?
Steps:

1. When a failure occurs, the recovering process


initiates rollback by broadcasting a
dependency request message to collect all
the dependency information maintained by
each process.
2. When a process receives this message, it
stops its execution and replies with the
dependency information saved on the stable
storage as well as with the dependency
information, if any, which is associated with its
current state.
3. The initiator then calculates the recovery line
based on the global dependency information
and broadcasts a rollback request message
containing the recovery line.
4. Upon receiving this message, a process
whose current state belongs to the recovery
line simply resumes execution; otherwise, it
rolls back to an earlier checkpoint as
indicated by the recovery line.
Blocking Coordinated Checkpoint
Non-Blocking Coordinated Checkpoint
● In this approach the processes need not stop their execution while taking
checkpoints.
● A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint
inconsistent.
Solution 1

If channels are FIFO, this


problem can be avoided by
preceding the first
post-checkpoint message on
each channel by a checkpoint
request, forcing each process
to take a checkpoint before
receiving the first
post-checkpoint message,
Solution 2
● If the channels are non-FIFO, the following two approaches can be used: first,
the marker can be piggybacked on every post-checkpoint message.
● When a process receives an application message with a marker, it treats it as
if it has received a marker message, followed by the application message.
● Alternatively, checkpoint indices can serve the same role as markers, where a
checkpoint is triggered when the receiver’s local checkpoint index is lower
than the piggybacked checkpoint index.
Impossibility of min-process non-blocking checkpointing
● A min-process, non-blocking checkpointing algorithm is one that forces only a minimum
number of processes to take a new checkpoint, and at the same time it does not force any
process to suspend its computation.
● Clearly, such checkpointing algorithms will be very attractive. Cao and Singhal [7] showed
that it is impossible to design a min-process, non-blocking checkpointing algorithm.
● Possible Algorithm:
● Phase 1:
○ checkpoint initiator identifies all processes with which it has communicated since the last checkpoint and sends
them a request. Upon receiving the request, each process in turn identifies all processes it has communicated
with since the last checkpoint and sends them a request, and so on, until no more processes can be identified.
● Phase 2:
○ all processes identified in the first phase take a checkpoint.
○ The result is a consistent checkpoint that involves only the participating processes.
○ In this protocol, after a process takes a checkpoint, it cannot send any message until the second phase
terminates successfully, although receiving a message after the checkpoint has been taken is allowable.
● Based on a concept called “Z-dependency,” Cao and Singhal proved that there does not
exist a non-blocking algorithm that will allow a minimum number of processes to take their
checkpoints.
Communication
Induced
Checkpoint
Model Based
Checkpoint
Index-based checkpointing
● Index-based communication-induced checkpointing assigns monotonically
increasing indexes to checkpoints, such that the checkpoints having the same
index at different processes form a consistent state.
● Inconsistency between checkpoints of the same index can be avoided in a
lazy fashion if indexes are piggybacked on application messages to help
receivers decide when they should take a forced a checkpoint.
CS 3551 DISTRIBUTED
COMPUTING
Koo–Toueg coordinated checkpointing algorithm
Koo–Toueg coordinated checkpointing algorithm
Koo–Toueg coordinated checkpointing algorithm
Objective:

● Takes a consistent set of checkpoints and avoids the domino effect and livelock problems during the recovery.
● Processes coordinate their local checkpointing actions such that the set of all checkpoints in the system is consistent.

Assumptions of Checkpointing Algorithm:

● Processes communicate by exchanging messages through communication channels. Communication channels are
FIFO.
● It is assumed that end-to-end protocols (such as the sliding window protocol) exist to cope with message loss due to
rollback recovery and communication failure.
● Communication failures do not partition the network.

Permanent vs Tentative:

1. A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global checkpoint.
2. A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint on the successful termination
of the checkpoint algorithm.
3. In case of a failure, processes roll back only to their permanent checkpoints for recovery.
Checkpoint - Phase 1 Checkpoint - Phase 2
1. An initiating process Pi takes a
tentative checkpoint and requests all 1. Pi informs all the processes of the
other processes to take tentative decision it reached at the end of
checkpoints. the first phase.
2. Each process informs Pi whether it
succeeded in taking a tentative
2. A process, on receiving the
checkpoint. message from Pi, will act
3. A process says “no” to a request if it accordingly.
fails to take a tentative checkpoint,
which could be due to several
3. Therefore, either all or none of the
reasons, depending upon the processes advance the checkpoint
underlying application. by taking permanent checkpoints.
4. If Pi learns that all the processes
have successfully taken tentative
4. The algorithm requires that after a
checkpoints, Pi decides that all process has taken a tentative
tentative checkpoints should be checkpoint, it cannot send
made permanent; otherwise, Pi messages related to the underlying
decides that all the tentative
checkpoints should be discarded. computation until it is informed of
Pi’s decision.
Optimization
Correctness
● A set of permanent checkpoints
taken by this algorithm is
consistent because of the
following two reasons:
○ Either all or none of the processes
take permanent checkpoints;
○ no process sends a message after
taking a tentative checkpoint until
the receipt of the initiating
process’s decision, as by then all
processes would have taken
checkpoints.
● Thus, a situation will not arise
where there is a record of a
message being received but
there is no record of sending it
Assumption

● a single process invokes the algorithm.


Rollback Recovery Algorithm ● It also assumes that the checkpoint and the rollback
recovery algorithms are not invoked concurrently

Phase 1
Phase 2
● An initiating process Pi sends a message
to all other processes to check if they all ● Pi propagates its decision to
are willing to restart from their previous all the processes.
checkpoints.
● A process may reply “no” to a restart
● On receiving Pi’s decision, a
request due to any reason (e.g., it is process acts accordingly.
already participating in a checkpoint or ● During the execution of the
recovery process initiated by some other
process).
recovery algorithm, a
● If Pi learns that all processes are willing to process cannot send
restart from their previous checkpoints, Pi messages related to the
decides that all processes should roll back
to their previous checkpoints. Otherwise, underlying computation while
Pi aborts the rollback attempt and it may it is waiting for Pi’s decision.
attempt a recovery at a later time.
Optimization
Correctness
● All processes restart from an
appropriate state because, if
they decide to restart, they
resume execution from a
consistent state (the
checkpointing algorithm
takes a consistent set of
checkpoints).
CS 3551 DISTRIBUTED
COMPUTING
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
System Model and Assumptions:

● Communication channels are reliable, deliver the messages in FIFO order, and have infinite buffers.
● The message transmission delay is arbitrary, but finite.
● The processors directly connected to a processor via communication channels are called its
neighbors.
● The underlying computation or application is assumed to be event-driven: a processor P waits until a
message m is received, it processes the message m, changes its state from s to s , and sends zero
or more messages to some of its neighbors.
● Then the processor remains idle until the receipt of the next message.
● The new state s and the contents of messages sent to its neighbors depend on state s and the
contents of message m. The events at a processor are identified by unique monotonically increasing
numbers, ex0, ex1, ex2,
● Storage can be:
○ Volatile Log - Less time but data lost when power is lost
○ Stable Storage - More time but data not lost.
Asynchronous Checkpointing

● After executing an event, a processor records a triplet (s, m, msgs_sent) in its


volatile storage,
○ s is the state of the processor before the event
○ m is the message (including the identity of the sender of m, denoted as m.sender) whose
arrival caused the event
○ msgs_sent is the set of messages that were sent by the processor during the event.
● Therefore, a local checkpoint at a processor consists of the record of an event
occurring at the processor and it is taken without any synchronization with
other processors.
● Periodically, a processor independently saves the contents of the volatile log
in the stable storage and clears the volatile log. (Equivalent to taking a local
checkpoint)
Recovery Algorithm

You might also like