0% found this document useful (0 votes)
22 views5 pages

16_issues in Failure Recovery

Uploaded by

alainemario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views5 pages

16_issues in Failure Recovery

Uploaded by

alainemario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

EnggTree.

com
CS3551 DISTRIBUTED COMPUTING

Issues in failure recovery


In a failure recovery, we must not only restore the system to a consistent state, but also
appropriately handle messages that are left in an abnormal state due to the failure and recovery

The computation comprises of three processes Pi, Pj , and Pk, connected through a communication
network. The processes communicate solely by exchanging messages over fault- free, FIFO
communication channels.
www.EnggTree.com

Processes Pi, Pj , and Pk have taken checkpoints

 The rollback of process 𝑃𝑖 to checkpoint 𝐶𝑖,1 created an orphan message H


 Orphan message I is created due to the roll back of process 𝑃𝑗 to checkpoint 𝐶𝑗,1
 Messages C, D, E, and F are potentially problematic
– Message C: a delayed message
– Message D: a lost message since the send event for D is recorded in the
restored state for 𝑃𝑗, but the receive event has been undone at process 𝑃𝑖.
– Lost messages can be handled by having processes keep a message log of all
the sent messages
– Messages E, F: delayed orphan messages. After resuming execution from their
checkpoints, processes will generate both of these messages

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com


EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Checkpoint-based recovery
Checkpoint-based rollback-recovery techniques can be classified into three categories:
1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing

1. Uncoordinated Checkpointing
 Each process has autonomy in deciding when to take checkpoints
 Advantages
The lower runtime overhead during normal execution

 Disadvantages
1. Domino effect during a recovery
2. Recovery from a failure is slow because processes need to iterate to find a
consistent set of checkpoints
3. Each process maintains multiple checkpoints and periodically invoke a
www.EnggTree.com
garbage collection algorithm
4. Not suitable for application with frequent output commits
 The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation
 The following direct dependency tracking technique is commonly used in uncoordinated
checkpointing.
Direct dependency tracking technique
 Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
 𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
 When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to 𝐼𝑗,𝑦,
which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com


EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 When a failure occurs, the recovering process initiates rollback by broadcasting a


dependency request message to collect all the dependency information maintained by each
process.

 When a process receives this message, it stops its execution and replies with the
dependency information saved on the stable storage as well as with the dependency
information, if any, which is associated with its current state.
www.EnggTree.com
 The initiator then calculates the recovery line based on the global dependency information
and broadcasts a rollback request message containing the recovery line.

 Upon receiving this message, a process whose current state belongs to the recovery line
simply resumes execution; otherwise, it rolls back to an earlier checkpoint as indicated by
the recovery line. 
2. Coordinated Checkpointing
In coordinated checkpointing, processes orchestrate their checkpointing activities so that all
local checkpoints form a consistent global state
Types
1. Blocking Checkpointing: After a process takes a local checkpoint, to prevent orphan
messages, it remains blocked until the entire checkpointing activity is complete
Disadvantages: The computation is blocked during the checkpointing
2. Non-blocking Checkpointing: The processes need not stop their execution while taking
checkpoints. A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint inconsistent.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com


EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Example (a) : Checkpoint inconsistency


 Message m is sent by 𝑃0 after receiving a checkpoint request from the checkpoint
coordinator
 Assume m reaches 𝑃1 before the checkpoint request
 This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥 shows the receipt
of message m from 𝑃0, while checkpoint 𝐶0,𝑥 does not show m being sent from
𝑃0
Example (b) : A solution with FIFO channels
 If channels are FIFO, this problem can be avoided by preceding the first post-checkpoint
message on each channel by a checkpoint request, forcing each process to take a checkpoint
before receiving the first post-checkpoint message

www.EnggTree.com

Impossibility of min-process non-blocking checkpointing


 A min-process, non-blocking checkpointing algorithm is one that forces only a minimum
number of processes to take a new checkpoint, and at the same time it does not force any
process to suspend its computation.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com


EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Algorithm
 The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and sends
them a request.
 Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on, until
no more processes can be identified.
 During the second phase, all processes identified in the first phase take a checkpoint. The
result is a consistent checkpoint that involves only the participating processes.

 In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the checkpoint
has been taken is allowable.
3. Communication-induced Checkpointing
Communication-induced checkpointing is another way to avoid the domino effect, while allowing
processes to take some of their checkpoints independently. Processes may be forced to take
additional checkpoints
www.EnggTree.com
Two types of checkpoints
1. Autonomous checkpoints
2. Forced checkpoints
The checkpoints that a process takes independently are called local checkpoints, while those that
a process is forced to take are called forced checkpoints.
 Communication-induced check pointing piggybacks protocol- related information on
each application message
 The receiver of each application message uses the piggybacked information to determine
if it has to take a forced checkpoint to advance the global recovery line
 The forced checkpoint must be taken before the application may process the contents of
the message
 In contrast with coordinated check pointing, no special coordination messages are
exchanged
Two types of communication-induced checkpointing
1. Model-based checkpointing
2. Index-based checkpointing.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

You might also like