Unit 4 Part 3

Bda

Uploaded by

Sobana Senthil kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Unit 4 Part 3

Bda

Uploaded by

Sobana Senthil kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CS 3551 DISTRIBUTED

COMPUTING
Checkpoint Based Recovery

1. Uncoordinated checkpointing
2. Coordinated checkpointing
a. Blocking coordinated checkpointing
b. Non-blocking checkpoint coordination
3. Impossibility of min-process non-blocking checkpointing
4. Communication-induced checkpointing
a. Model-based checkpointing
b. Index-based checkpointing
Checkpoint Based Recovery
● In the checkpoint-based recovery approach, the state of each process and the
communication channel is checkpointed frequently so that, upon a failure, the
system can be restored to a globally consistent set of checkpoints.
● It does does not need to detect, log, or replay non-deterministic events.
Checkpoint-based protocols are therefore less restrictive and simpler to
implement than log-based rollback recovery.
● However, checkpoint-based rollback recovery does not guarantee that
prefailure execution can be deterministically regenerated after a rollback.
● It may not be suitable for applications that require frequent interactions with
the outside world.
1. Uncoordinated Checkpointing
● Each process has autonomy in deciding when to take checkpoints.
● Synchronization overhead is minimal as there is no need for coordination
between processes. ( Lower runtime overhead).
● Autonomy in taking checkpoints also allows each process to select
appropriate checkpoints positions

Drawbacks:

1. Domino effect may occur during a recovery.

2. Recovery is slow because processes need to iterate to find a consistent set of checkpoints.
3. Useless Checkpoint:
a. Since no coordination is done at the time the checkpoint is taken, checkpoints taken by a process may be useless
checkpoints.
b. Useless checkpoints are undesirable because they incur overhead and do not contribute to advancing the
recovery line.
4. forces each process to maintain multiple checkpoints, and to periodically invoke a garbage
collection algorithm to reclaim the checkpoints that are no longer required
5. it is not suitable for applications with frequent output commits because these require global
coordination to compute the recovery line
How consistent global checkpoint is determined?
Steps:

1. When a failure occurs, the recovering process

initiates rollback by broadcasting a
dependency request message to collect all
the dependency information maintained by
each process.
2. When a process receives this message, it
stops its execution and replies with the
dependency information saved on the stable
storage as well as with the dependency
information, if any, which is associated with its
current state.
3. The initiator then calculates the recovery line
based on the global dependency information
and broadcasts a rollback request message
containing the recovery line.
4. Upon receiving this message, a process
whose current state belongs to the recovery
line simply resumes execution; otherwise, it
rolls back to an earlier checkpoint as
indicated by the recovery line.
Blocking Coordinated Checkpoint
Non-Blocking Coordinated Checkpoint
● In this approach the processes need not stop their execution while taking
checkpoints.
● A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint
inconsistent.
Solution 1

If channels are FIFO, this

problem can be avoided by
preceding the first
post-checkpoint message on
each channel by a checkpoint
request, forcing each process
to take a checkpoint before
receiving the first
post-checkpoint message,
Solution 2
● If the channels are non-FIFO, the following two approaches can be used: first,
the marker can be piggybacked on every post-checkpoint message.
● When a process receives an application message with a marker, it treats it as
if it has received a marker message, followed by the application message.
● Alternatively, checkpoint indices can serve the same role as markers, where a
checkpoint is triggered when the receiver’s local checkpoint index is lower
than the piggybacked checkpoint index.
Impossibility of min-process non-blocking checkpointing
● A min-process, non-blocking checkpointing algorithm is one that forces only a minimum
number of processes to take a new checkpoint, and at the same time it does not force any
process to suspend its computation.
● Clearly, such checkpointing algorithms will be very attractive. Cao and Singhal [7] showed
that it is impossible to design a min-process, non-blocking checkpointing algorithm.
● Possible Algorithm:
● Phase 1:
○ checkpoint initiator identifies all processes with which it has communicated since the last checkpoint and sends
them a request. Upon receiving the request, each process in turn identifies all processes it has communicated
with since the last checkpoint and sends them a request, and so on, until no more processes can be identified.
● Phase 2:
○ all processes identified in the first phase take a checkpoint.
○ The result is a consistent checkpoint that involves only the participating processes.
○ In this protocol, after a process takes a checkpoint, it cannot send any message until the second phase
terminates successfully, although receiving a message after the checkpoint has been taken is allowable.
● Based on a concept called “Z-dependency,” Cao and Singhal proved that there does not
exist a non-blocking algorithm that will allow a minimum number of processes to take their
checkpoints.
Communication
Induced
Checkpoint
Model Based
Checkpoint
Index-based checkpointing
● Index-based communication-induced checkpointing assigns monotonically
increasing indexes to checkpoints, such that the checkpoints having the same
index at different processes form a consistent state.
● Inconsistency between checkpoints of the same index can be avoided in a
lazy fashion if indexes are piggybacked on application messages to help
receivers decide when they should take a forced a checkpoint.
CS 3551 DISTRIBUTED
COMPUTING
Koo–Toueg coordinated checkpointing algorithm
Koo–Toueg coordinated checkpointing algorithm
Koo–Toueg coordinated checkpointing algorithm
Objective:

● Takes a consistent set of checkpoints and avoids the domino effect and livelock problems during the recovery.
● Processes coordinate their local checkpointing actions such that the set of all checkpoints in the system is consistent.

Assumptions of Checkpointing Algorithm:

● Processes communicate by exchanging messages through communication channels. Communication channels are
FIFO.
● It is assumed that end-to-end protocols (such as the sliding window protocol) exist to cope with message loss due to
rollback recovery and communication failure.
● Communication failures do not partition the network.

Permanent vs Tentative:

1. A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global checkpoint.
2. A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint on the successful termination
of the checkpoint algorithm.
3. In case of a failure, processes roll back only to their permanent checkpoints for recovery.
Checkpoint - Phase 1 Checkpoint - Phase 2
1. An initiating process Pi takes a
tentative checkpoint and requests all 1. Pi informs all the processes of the
other processes to take tentative decision it reached at the end of
checkpoints. the first phase.
2. Each process informs Pi whether it
succeeded in taking a tentative
2. A process, on receiving the
checkpoint. message from Pi, will act
3. A process says “no” to a request if it accordingly.
fails to take a tentative checkpoint,
which could be due to several
3. Therefore, either all or none of the
reasons, depending upon the processes advance the checkpoint
underlying application. by taking permanent checkpoints.
4. If Pi learns that all the processes
have successfully taken tentative
4. The algorithm requires that after a
checkpoints, Pi decides that all process has taken a tentative
tentative checkpoints should be checkpoint, it cannot send
made permanent; otherwise, Pi messages related to the underlying
decides that all the tentative
checkpoints should be discarded. computation until it is informed of
Pi’s decision.
Optimization
Correctness
● A set of permanent checkpoints
taken by this algorithm is
consistent because of the
following two reasons:
○ Either all or none of the processes
take permanent checkpoints;
○ no process sends a message after
taking a tentative checkpoint until
the receipt of the initiating
process’s decision, as by then all
processes would have taken
checkpoints.
● Thus, a situation will not arise
where there is a record of a
message being received but
there is no record of sending it
Assumption

● a single process invokes the algorithm.

Rollback Recovery Algorithm ● It also assumes that the checkpoint and the rollback
recovery algorithms are not invoked concurrently

Phase 1
Phase 2
● An initiating process Pi sends a message
to all other processes to check if they all ● Pi propagates its decision to
are willing to restart from their previous all the processes.
checkpoints.
● A process may reply “no” to a restart
● On receiving Pi’s decision, a
request due to any reason (e.g., it is process acts accordingly.
already participating in a checkpoint or ● During the execution of the
recovery process initiated by some other
process).
recovery algorithm, a
● If Pi learns that all processes are willing to process cannot send
restart from their previous checkpoints, Pi messages related to the
decides that all processes should roll back
to their previous checkpoints. Otherwise, underlying computation while
Pi aborts the rollback attempt and it may it is waiting for Pi’s decision.
attempt a recovery at a later time.
Optimization
Correctness
● All processes restart from an
appropriate state because, if
they decide to restart, they
resume execution from a
consistent state (the
checkpointing algorithm
takes a consistent set of
checkpoints).
CS 3551 DISTRIBUTED
COMPUTING
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
System Model and Assumptions:

● Communication channels are reliable, deliver the messages in FIFO order, and have infinite buffers.
● The message transmission delay is arbitrary, but finite.
● The processors directly connected to a processor via communication channels are called its
neighbors.
● The underlying computation or application is assumed to be event-driven: a processor P waits until a
message m is received, it processes the message m, changes its state from s to s , and sends zero
or more messages to some of its neighbors.
● Then the processor remains idle until the receipt of the next message.
● The new state s and the contents of messages sent to its neighbors depend on state s and the
contents of message m. The events at a processor are identified by unique monotonically increasing
numbers, ex0, ex1, ex2,
● Storage can be:
○ Volatile Log - Less time but data lost when power is lost
○ Stable Storage - More time but data not lost.
Asynchronous Checkpointing

● After executing an event, a processor records a triplet (s, m, msgs_sent) in its

volatile storage,
○ s is the state of the processor before the event
○ m is the message (including the identity of the sender of m, denoted as m.sender) whose
arrival caused the event
○ msgs_sent is the set of messages that were sent by the processor during the event.
● Therefore, a local checkpoint at a processor consists of the record of an event
occurring at the processor and it is taken without any synchronization with
other processors.
● Periodically, a processor independently saves the contents of the volatile log
in the stable storage and clears the volatile log. (Equivalent to taking a local
checkpoint)
Recovery Algorithm

(Ebook) C Programming on Raspberry Pi: Develop innovative hardware-based projects in C by Dogan Ibrahim ISBN 9783895764325, 3895764329 All Chapters Instant Download
100% (11)
(Ebook) C Programming on Raspberry Pi: Develop innovative hardware-based projects in C by Dogan Ibrahim ISBN 9783895764325, 3895764329 All Chapters Instant Download
71 pages
CS3492 DBMS Notes
No ratings yet
CS3492 DBMS Notes
165 pages
iSCSI The Universal Storage Connection PDF
No ratings yet
iSCSI The Universal Storage Connection PDF
416 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
c1cc1cde-bdda-41e7-92a0-5453e98d0676
No ratings yet
c1cc1cde-bdda-41e7-92a0-5453e98d0676
5 pages
16_issues in Failure Recovery
No ratings yet
16_issues in Failure Recovery
5 pages
Recovery DC
No ratings yet
Recovery DC
6 pages
unit 4
No ratings yet
unit 4
94 pages
11 Coordinated Checkpoint
No ratings yet
11 Coordinated Checkpoint
3 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
DS Part B 4
No ratings yet
DS Part B 4
3 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
Presentation On Consistent Checkpoints & Recovery in Distributed System
100% (1)
Presentation On Consistent Checkpoints & Recovery in Distributed System
26 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
1904050001
No ratings yet
1904050001
119 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
No ratings yet
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
19 pages
System Recovery
No ratings yet
System Recovery
38 pages
A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
No ratings yet
A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
10 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
u4p6
No ratings yet
u4p6
10 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Unit IV 2 Marks With Answer
No ratings yet
Unit IV 2 Marks With Answer
2 pages
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
No ratings yet
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
34 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
a161126
No ratings yet
a161126
26 pages
Simulation - Deadlock P-II
No ratings yet
Simulation - Deadlock P-II
30 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Checkpointing
No ratings yet
Checkpointing
20 pages
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
rollback_slides
No ratings yet
rollback_slides
22 pages
Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI
No ratings yet
Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI
27 pages
Module 4
No ratings yet
Module 4
59 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
CST402-SCHEME
No ratings yet
CST402-SCHEME
9 pages
chap3dc
No ratings yet
chap3dc
13 pages
Document 32Distributed computing concept
No ratings yet
Document 32Distributed computing concept
16 pages
6.2 Lamport 1 Logical
No ratings yet
6.2 Lamport 1 Logical
27 pages
Mutual Exclusion RECORD
No ratings yet
Mutual Exclusion RECORD
7 pages
Distributed-Computing-Module-4-Important-Topics-PYQs
No ratings yet
Distributed-Computing-Module-4-Important-Topics-PYQs
23 pages
3.Synchronization
No ratings yet
3.Synchronization
45 pages
Synchronous Systems With Failures
No ratings yet
Synchronous Systems With Failures
9 pages
Unit 4 Answer Key
No ratings yet
Unit 4 Answer Key
24 pages
DC(UNIT-4)
No ratings yet
DC(UNIT-4)
14 pages
DC Quiz 2
No ratings yet
DC Quiz 2
31 pages
Identification of Critical Factors in Check Pointing Based Multiple Fault Tolerance For Distributed System
No ratings yet
Identification of Critical Factors in Check Pointing Based Multiple Fault Tolerance For Distributed System
6 pages
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
No ratings yet
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
16 pages
Disributed computing Question bank
No ratings yet
Disributed computing Question bank
5 pages
CS8603_DS_Unit3_CompleteMaterial
No ratings yet
CS8603_DS_Unit3_CompleteMaterial
95 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
12_JuangVenkatesan
No ratings yet
12_JuangVenkatesan
4 pages
CCNA Exam Focus: Study Guide with Practice Tests
From Everand
CCNA Exam Focus: Study Guide with Practice Tests
SUJAN
No ratings yet
Cyber Physical Systems - Advances and Applications
From Everand
Cyber Physical Systems - Advances and Applications
Anitha Kumari K.
No ratings yet
Web Analytics - 1048433719
No ratings yet
Web Analytics - 1048433719
27 pages
Idea Presentation - TechTroupes
No ratings yet
Idea Presentation - TechTroupes
7 pages
2.5 Exact Inference Short Notes
No ratings yet
2.5 Exact Inference Short Notes
13 pages
Assignment II
No ratings yet
Assignment II
7 pages
WE Mini Report Final
No ratings yet
WE Mini Report Final
58 pages
CS3492 Database Management Systems Question Bank 1
No ratings yet
CS3492 Database Management Systems Question Bank 1
11 pages
Getting Started SNAC12.1
No ratings yet
Getting Started SNAC12.1
30 pages
Sysmac Studio Shortcutkey Ref
100% (1)
Sysmac Studio Shortcutkey Ref
4 pages
Cisco Reviewer
No ratings yet
Cisco Reviewer
27 pages
1.1-1 Computer Parts and Its Hardware
No ratings yet
1.1-1 Computer Parts and Its Hardware
34 pages
Upgrading Firmware For The G90
No ratings yet
Upgrading Firmware For The G90
18 pages
Gpac Dami Userguide
No ratings yet
Gpac Dami Userguide
299 pages
Linux File Complete
No ratings yet
Linux File Complete
9 pages
Firmware: Revision Record
No ratings yet
Firmware: Revision Record
5 pages
ICT REVISION
No ratings yet
ICT REVISION
6 pages
ITT593GroupProject - Case Study Reading
No ratings yet
ITT593GroupProject - Case Study Reading
9 pages
Gigabyte Confidential Do Not Copy: Gigabyte Technology Gigabyte Technology Gigabyte Technology
No ratings yet
Gigabyte Confidential Do Not Copy: Gigabyte Technology Gigabyte Technology Gigabyte Technology
33 pages
LDAP Client Configuration With Autofs Home Directories
No ratings yet
LDAP Client Configuration With Autofs Home Directories
3 pages
Path Management and SAN Boot With Mpio On Aix: Power Systems Advanced Technical Skills
No ratings yet
Path Management and SAN Boot With Mpio On Aix: Power Systems Advanced Technical Skills
55 pages
Setting Network Connections Using Oracle Virtual Box On Windows Server 2012
No ratings yet
Setting Network Connections Using Oracle Virtual Box On Windows Server 2012
7 pages
One Drive Guide
No ratings yet
One Drive Guide
3 pages
Full Circle Magazine - January 2019 40 1 41
No ratings yet
Full Circle Magazine - January 2019 40 1 41
55 pages
Reading and Interacting With SNMP Servers
No ratings yet
Reading and Interacting With SNMP Servers
10 pages
MX855UserManuaRev130 486466418
No ratings yet
MX855UserManuaRev130 486466418
50 pages
05 TD IoT
No ratings yet
05 TD IoT
29 pages
DFWFWD
No ratings yet
DFWFWD
4 pages
Total Time To Complete This Test Is 35 Minutes, You Have 35 Minutes Remaining Question 1
No ratings yet
Total Time To Complete This Test Is 35 Minutes, You Have 35 Minutes Remaining Question 1
6 pages
MT6737M Android Scatter
No ratings yet
MT6737M Android Scatter
8 pages
Bugreport V2217T SP1A.210812.003 2023 11 01 14 38 03 Dumpstate - Log 8432
No ratings yet
Bugreport V2217T SP1A.210812.003 2023 11 01 14 38 03 Dumpstate - Log 8432
32 pages
Dear Valued Pioneer Customer:: There Is No Need To Install This Update
No ratings yet
Dear Valued Pioneer Customer:: There Is No Need To Install This Update
2 pages
Active Standby Failover Configuration
No ratings yet
Active Standby Failover Configuration
5 pages
Noor. The Best Laptops For Podcasting
No ratings yet
Noor. The Best Laptops For Podcasting
8 pages
Industrial Embedded Systems & Iot With Raayanmini Board Based Arm Cortex M4 (Stm32)
No ratings yet
Industrial Embedded Systems & Iot With Raayanmini Board Based Arm Cortex M4 (Stm32)
4 pages
Interview Questions and Answers For Freshers - TCP - IP PDF
No ratings yet
Interview Questions and Answers For Freshers - TCP - IP PDF
8 pages

Unit 4 Part 3

Uploaded by

Unit 4 Part 3

Uploaded by

CS 3551 DISTRIBUTED

1. Domino effect may occur during a recovery.

1. When a failure occurs, the recovering process

If channels are FIFO, this

Assumptions of Checkpointing Algorithm:

● a single process invokes the algorithm.

● After executing an event, a processor records a triplet (s, m, msgs_sent) in its

You might also like