0% found this document useful (0 votes)

22 views5 pages

16_issues in Failure Recovery

Uploaded by

alainemario

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views5 pages

16_issues in Failure Recovery

Uploaded by

alainemario

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

EnggTree.

com
CS3551 DISTRIBUTED COMPUTING

Issues in failure recovery

In a failure recovery, we must not only restore the system to a consistent state, but also
appropriately handle messages that are left in an abnormal state due to the failure and recovery

The computation comprises of three processes Pi, Pj , and Pk, connected through a communication
network. The processes communicate solely by exchanging messages over fault- free, FIFO
communication channels.
www.EnggTree.com

Processes Pi, Pj , and Pk have taken checkpoints

 The rollback of process 𝑃𝑖 to checkpoint 𝐶𝑖,1 created an orphan message H

 Orphan message I is created due to the roll back of process 𝑃𝑗 to checkpoint 𝐶𝑗,1
 Messages C, D, E, and F are potentially problematic
– Message C: a delayed message
– Message D: a lost message since the send event for D is recorded in the
restored state for 𝑃𝑗, but the receive event has been undone at process 𝑃𝑖.
– Lost messages can be handled by having processes keep a message log of all
the sent messages
– Messages E, F: delayed orphan messages. After resuming execution from their
checkpoints, processes will generate both of these messages

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Checkpoint-based recovery
Checkpoint-based rollback-recovery techniques can be classified into three categories:
1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing

1. Uncoordinated Checkpointing
 Each process has autonomy in deciding when to take checkpoints
 Advantages
The lower runtime overhead during normal execution

 Disadvantages
1. Domino effect during a recovery
2. Recovery from a failure is slow because processes need to iterate to find a
consistent set of checkpoints
3. Each process maintains multiple checkpoints and periodically invoke a
www.EnggTree.com
garbage collection algorithm
4. Not suitable for application with frequent output commits
 The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation
 The following direct dependency tracking technique is commonly used in uncoordinated
checkpointing.
Direct dependency tracking technique
 Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
 𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
 When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to 𝐼𝑗,𝑦,
which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 When a failure occurs, the recovering process initiates rollback by broadcasting a

dependency request message to collect all the dependency information maintained by each
process.

 When a process receives this message, it stops its execution and replies with the
dependency information saved on the stable storage as well as with the dependency
information, if any, which is associated with its current state.
www.EnggTree.com
 The initiator then calculates the recovery line based on the global dependency information
and broadcasts a rollback request message containing the recovery line.

 Upon receiving this message, a process whose current state belongs to the recovery line
simply resumes execution; otherwise, it rolls back to an earlier checkpoint as indicated by
the recovery line. 
2. Coordinated Checkpointing
In coordinated checkpointing, processes orchestrate their checkpointing activities so that all
local checkpoints form a consistent global state
Types
1. Blocking Checkpointing: After a process takes a local checkpoint, to prevent orphan
messages, it remains blocked until the entire checkpointing activity is complete
Disadvantages: The computation is blocked during the checkpointing
2. Non-blocking Checkpointing: The processes need not stop their execution while taking
checkpoints. A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint inconsistent.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Example (a) : Checkpoint inconsistency

 Message m is sent by 𝑃0 after receiving a checkpoint request from the checkpoint
coordinator
 Assume m reaches 𝑃1 before the checkpoint request
 This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥 shows the receipt
of message m from 𝑃0, while checkpoint 𝐶0,𝑥 does not show m being sent from
𝑃0
Example (b) : A solution with FIFO channels
 If channels are FIFO, this problem can be avoided by preceding the first post-checkpoint
message on each channel by a checkpoint request, forcing each process to take a checkpoint
before receiving the first post-checkpoint message

www.EnggTree.com

Impossibility of min-process non-blocking checkpointing

 A min-process, non-blocking checkpointing algorithm is one that forces only a minimum
number of processes to take a new checkpoint, and at the same time it does not force any
process to suspend its computation.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Algorithm
 The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and sends
them a request.
 Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on, until
no more processes can be identified.
 During the second phase, all processes identified in the first phase take a checkpoint. The
result is a consistent checkpoint that involves only the participating processes.

 In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the checkpoint
has been taken is allowable.
3. Communication-induced Checkpointing
Communication-induced checkpointing is another way to avoid the domino effect, while allowing
processes to take some of their checkpoints independently. Processes may be forced to take
additional checkpoints
www.EnggTree.com
Two types of checkpoints
1. Autonomous checkpoints
2. Forced checkpoints
The checkpoints that a process takes independently are called local checkpoints, while those that
a process is forced to take are called forced checkpoints.
 Communication-induced check pointing piggybacks protocol- related information on
each application message
 The receiver of each application message uses the piggybacked information to determine
if it has to take a forced checkpoint to advance the global recovery line
 The forced checkpoint must be taken before the application may process the contents of
the message
 In contrast with coordinated check pointing, no special coordination messages are
exchanged
Two types of communication-induced checkpointing
1. Model-based checkpointing
2. Index-based checkpointing.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
Lab Experiment For Synchro Transmitter and Receiver PDF
100% (1)
Lab Experiment For Synchro Transmitter and Receiver PDF
8 pages
Budget Management Thesis
100% (3)
Budget Management Thesis
5 pages
Salmon Dna: Introducing Scientific Breaktrough From Japan
100% (1)
Salmon Dna: Introducing Scientific Breaktrough From Japan
6 pages
Value in Business Markets: What Do We Know? Where Are We Going?
100% (1)
Value in Business Markets: What Do We Know? Where Are We Going?
17 pages
Parndorf Store Guide
No ratings yet
Parndorf Store Guide
1 page
Articles of Incorporation - Timeline Text
100% (2)
Articles of Incorporation - Timeline Text
6 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
Module 4
No ratings yet
Module 4
59 pages
Checkpointing
No ratings yet
Checkpointing
20 pages
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
No ratings yet
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
19 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
No ratings yet
A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
10 pages
CSE QP UNIT 4 (2)
No ratings yet
CSE QP UNIT 4 (2)
2 pages
11 Coordinated Checkpoint
No ratings yet
11 Coordinated Checkpoint
3 pages
Sangeet-Natak-Akademi-SNA-Recruitment-2025-Notice
No ratings yet
Sangeet-Natak-Akademi-SNA-Recruitment-2025-Notice
1 page
CS3551 - Unit 4 QB
No ratings yet
CS3551 - Unit 4 QB
1 page
DC Ict Test-2
No ratings yet
DC Ict Test-2
1 page
Checkpoints Recovery(1)(2)
No ratings yet
Checkpoints Recovery(1)(2)
35 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
2 pages
Unit IV 2 Marks With Answer
No ratings yet
Unit IV 2 Marks With Answer
2 pages
Session 33
No ratings yet
Session 33
4 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
DC UNIT4
No ratings yet
DC UNIT4
33 pages
Checkpoint Recovery
No ratings yet
Checkpoint Recovery
4 pages
Vincentius Hadi Soetjiadi
No ratings yet
Vincentius Hadi Soetjiadi
3 pages
module4_distributed
No ratings yet
module4_distributed
6 pages
Q1W1 - (E) Test For Fat
No ratings yet
Q1W1 - (E) Test For Fat
2 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Distributed-Computing-Module-4-Important-Topics-PYQs
No ratings yet
Distributed-Computing-Module-4-Important-Topics-PYQs
23 pages
Viva Questions 2024
No ratings yet
Viva Questions 2024
29 pages
Presentation On Consistent Checkpoints & Recovery in Distributed System
100% (1)
Presentation On Consistent Checkpoints & Recovery in Distributed System
26 pages
DC 4unit
No ratings yet
DC 4unit
8 pages
rollback_slides
No ratings yet
rollback_slides
22 pages
A Chemistry Laboratory Platform Enhanced With Virtual Reality For Students' Adaptive Learning
No ratings yet
A Chemistry Laboratory Platform Enhanced With Virtual Reality For Students' Adaptive Learning
10 pages
Disributed computing Question bank
No ratings yet
Disributed computing Question bank
5 pages
Recovery DC
No ratings yet
Recovery DC
6 pages
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
International Journal of Ophthalmology and Clinical Research Ijocr 2 035
No ratings yet
International Journal of Ophthalmology and Clinical Research Ijocr 2 035
5 pages
The Economic Survey 2009
No ratings yet
The Economic Survey 2009
261 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
12_JuangVenkatesan
No ratings yet
12_JuangVenkatesan
4 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Simulation - Deadlock P-II
No ratings yet
Simulation - Deadlock P-II
30 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
21 pages
Boundary Conditions at A Naturally Permeable Wall
No ratings yet
Boundary Conditions at A Naturally Permeable Wall
11 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
Cohort Study Final Day 9
No ratings yet
Cohort Study Final Day 9
33 pages
Chapter 6: MEASUREMENT: Conceptual Framework in Financial Reporting
No ratings yet
Chapter 6: MEASUREMENT: Conceptual Framework in Financial Reporting
24 pages
CH4 Control System
No ratings yet
CH4 Control System
10 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
Cost-Volume-Profit Relationships
No ratings yet
Cost-Volume-Profit Relationships
97 pages
c1cc1cde-bdda-41e7-92a0-5453e98d0676
No ratings yet
c1cc1cde-bdda-41e7-92a0-5453e98d0676
5 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Cot - DLP - English 4 by Teacher Margie v#4
No ratings yet
Cot - DLP - English 4 by Teacher Margie v#4
6 pages
Rohini 836843492
No ratings yet
Rohini 836843492
3 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
DS UNIT-3 NOTES
No ratings yet
DS UNIT-3 NOTES
35 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
CS3551 - QB
No ratings yet
CS3551 - QB
5 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
unit 4
No ratings yet
unit 4
94 pages
u4p6
No ratings yet
u4p6
10 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
DC Quiz 2
No ratings yet
DC Quiz 2
31 pages
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
No ratings yet
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
16 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
Unit 4 Answer Key
No ratings yet
Unit 4 Answer Key
24 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI
No ratings yet
Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI
27 pages
Multiscale Simulation Methods PDF
0% (1)
Multiscale Simulation Methods PDF
592 pages
1279 Data Sheet
No ratings yet
1279 Data Sheet
2 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Bahria University, Islamabad Campus
No ratings yet
Bahria University, Islamabad Campus
11 pages
Korean Wave
No ratings yet
Korean Wave
12 pages
NBFC Companies
No ratings yet
NBFC Companies
1,608 pages
Beckman DU800 Manual
100% (1)
Beckman DU800 Manual
162 pages
Group Screening Test English Phil Iri
No ratings yet
Group Screening Test English Phil Iri
20 pages
1904050001
No ratings yet
1904050001
119 pages
Ticketing Handbook 39 2007 Eng
No ratings yet
Ticketing Handbook 39 2007 Eng
348 pages
Section 200 Complaint in Vinay Rai V
No ratings yet
Section 200 Complaint in Vinay Rai V
4 pages
System Recovery
No ratings yet
System Recovery
38 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
Talent Acquisition
No ratings yet
Talent Acquisition
60 pages
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet

16_issues in Failure Recovery

Uploaded by

16_issues in Failure Recovery

Uploaded by

EnggTree.

Issues in failure recovery

Processes Pi, Pj , and Pk have taken checkpoints

 The rollback of process 𝑃𝑖 to checkpoint 𝐶𝑖,1 created an orphan message H

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

 When a failure occurs, the recovering process initiates rollback by broadcasting a

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

Example (a) : Checkpoint inconsistency

Impossibility of min-process non-blocking checkpointing

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

You might also like