Unit 4 Part 3
Unit 4 Part 3
COMPUTING
Checkpoint Based Recovery
1. Uncoordinated checkpointing
2. Coordinated checkpointing
a. Blocking coordinated checkpointing
b. Non-blocking checkpoint coordination
3. Impossibility of min-process non-blocking checkpointing
4. Communication-induced checkpointing
a. Model-based checkpointing
b. Index-based checkpointing
Checkpoint Based Recovery
● In the checkpoint-based recovery approach, the state of each process and the
communication channel is checkpointed frequently so that, upon a failure, the
system can be restored to a globally consistent set of checkpoints.
● It does does not need to detect, log, or replay non-deterministic events.
Checkpoint-based protocols are therefore less restrictive and simpler to
implement than log-based rollback recovery.
● However, checkpoint-based rollback recovery does not guarantee that
prefailure execution can be deterministically regenerated after a rollback.
● It may not be suitable for applications that require frequent interactions with
the outside world.
1. Uncoordinated Checkpointing
● Each process has autonomy in deciding when to take checkpoints.
● Synchronization overhead is minimal as there is no need for coordination
between processes. ( Lower runtime overhead).
● Autonomy in taking checkpoints also allows each process to select
appropriate checkpoints positions
Drawbacks:
● Takes a consistent set of checkpoints and avoids the domino effect and livelock problems during the recovery.
● Processes coordinate their local checkpointing actions such that the set of all checkpoints in the system is consistent.
● Processes communicate by exchanging messages through communication channels. Communication channels are
FIFO.
● It is assumed that end-to-end protocols (such as the sliding window protocol) exist to cope with message loss due to
rollback recovery and communication failure.
● Communication failures do not partition the network.
Permanent vs Tentative:
1. A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global checkpoint.
2. A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint on the successful termination
of the checkpoint algorithm.
3. In case of a failure, processes roll back only to their permanent checkpoints for recovery.
Checkpoint - Phase 1 Checkpoint - Phase 2
1. An initiating process Pi takes a
tentative checkpoint and requests all 1. Pi informs all the processes of the
other processes to take tentative decision it reached at the end of
checkpoints. the first phase.
2. Each process informs Pi whether it
succeeded in taking a tentative
2. A process, on receiving the
checkpoint. message from Pi, will act
3. A process says “no” to a request if it accordingly.
fails to take a tentative checkpoint,
which could be due to several
3. Therefore, either all or none of the
reasons, depending upon the processes advance the checkpoint
underlying application. by taking permanent checkpoints.
4. If Pi learns that all the processes
have successfully taken tentative
4. The algorithm requires that after a
checkpoints, Pi decides that all process has taken a tentative
tentative checkpoints should be checkpoint, it cannot send
made permanent; otherwise, Pi messages related to the underlying
decides that all the tentative
checkpoints should be discarded. computation until it is informed of
Pi’s decision.
Optimization
Correctness
● A set of permanent checkpoints
taken by this algorithm is
consistent because of the
following two reasons:
○ Either all or none of the processes
take permanent checkpoints;
○ no process sends a message after
taking a tentative checkpoint until
the receipt of the initiating
process’s decision, as by then all
processes would have taken
checkpoints.
● Thus, a situation will not arise
where there is a record of a
message being received but
there is no record of sending it
Assumption
Phase 1
Phase 2
● An initiating process Pi sends a message
to all other processes to check if they all ● Pi propagates its decision to
are willing to restart from their previous all the processes.
checkpoints.
● A process may reply “no” to a restart
● On receiving Pi’s decision, a
request due to any reason (e.g., it is process acts accordingly.
already participating in a checkpoint or ● During the execution of the
recovery process initiated by some other
process).
recovery algorithm, a
● If Pi learns that all processes are willing to process cannot send
restart from their previous checkpoints, Pi messages related to the
decides that all processes should roll back
to their previous checkpoints. Otherwise, underlying computation while
Pi aborts the rollback attempt and it may it is waiting for Pi’s decision.
attempt a recovery at a later time.
Optimization
Correctness
● All processes restart from an
appropriate state because, if
they decide to restart, they
resume execution from a
consistent state (the
checkpointing algorithm
takes a consistent set of
checkpoints).
CS 3551 DISTRIBUTED
COMPUTING
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
System Model and Assumptions:
● Communication channels are reliable, deliver the messages in FIFO order, and have infinite buffers.
● The message transmission delay is arbitrary, but finite.
● The processors directly connected to a processor via communication channels are called its
neighbors.
● The underlying computation or application is assumed to be event-driven: a processor P waits until a
message m is received, it processes the message m, changes its state from s to s , and sends zero
or more messages to some of its neighbors.
● Then the processor remains idle until the receipt of the next message.
● The new state s and the contents of messages sent to its neighbors depend on state s and the
contents of message m. The events at a processor are identified by unique monotonically increasing
numbers, ex0, ex1, ex2,
● Storage can be:
○ Volatile Log - Less time but data lost when power is lost
○ Stable Storage - More time but data not lost.
Asynchronous Checkpointing