100% found this document useful (1 vote)

56 views

Presentation On Consistent Checkpoints & Recovery in Distributed System

The document summarizes consistent checkpointing and recovery in distributed systems. It discusses problems with naive checkpointing like orphan messages and domino effects that can cause inconsistencies. It then presents synchronous and asynchronous checkpointing algorithms. The synchronous approach uses global synchronization to create a consistent checkpoint set. The asynchronous approach allows independent checkpointing at each process and requires a recovery algorithm to find the latest consistent checkpoint set. Recovery in distributed databases is also briefly covered, discussing using message spools or copier transactions to recover failed sites.

Uploaded by

Swati Mishra

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

56 views

Presentation On Consistent Checkpoints & Recovery in Distributed System

Uploaded by

Swati Mishra

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

PRESENTATION ON CONSISTENT CHECKPOINTS & RECOVERY IN DISTRIBUTED SYSTEM

Submitted By: Swati Mishra I.T(4th year) 0833413058

Checkpointing

Problem of nave checkpointing

Orphan Messages and the Domino Effect

Orphan message : a message that make an inconsistent state Domino Effect : what a single rolling back induce other rolling back

Lost Messages Livelocks

Orphan message and Domino Effect

x1 x2 x3

[
y1

[
y2

[
Y has not sent yet, but X has received.

[
z1 z2

: Orphan message
Roll back

[ : Domino Effect

Lost messages
x1 x2 x3

[
y1

[
y2

[
X has sent, but Y cannot receive forever

[
z1 z2

: Lost message
Roll back

Livelocks
x1

[
n1
m2 m1

n2 n1

Consistency of Checkpoint

Strongly consistent set of checkpoints

no messages penetrating the set

Consistent set of checkpoints

no messages penetrating the set backward
x1 x2

[
y1
y2

[
need to deal with lost messages

[
Strongly consistent z1

[
consistent

Checkpoint/Recovery Algorithm

Synchronous

with global synchronization at checkpointing without global synchronization at checkpointing

Asynchronous

Preliminary (Assumption)
Synchronous Checkpoint

Goal
To make a consistent global checkpoint

Assumptions

Communication channels are FIFO No partition of the network End-to-end protocols cope with message loss due to rollback recovery and communication failure No failure during the execution of the algorithm

Preliminary (Two types of checkpoint)

Synchronous Checkpoint

tentative checkpoint :

a temporary checkpoint a candidate for permanent checkpoint a local checkpoint at a process a part of a consistent global checkpoint

permanent checkpoint :

Checkpoint Algorithm
Synchronous Checkpoint
Algorithm
1.

2. 3.
4.

5. 6.

an initiating process (a single process that invokes this algorithm) takes a tentative checkpoint it requests all the processes to take tentative checkpoints it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. it informs all the processes of the decision The processes that receive the decision act accordingly

Supplement Once a process has taken a tentative checkpoint, it shouldnt send messages until it is informed of initiators decision.

Diagram Algorithmof Checkpoint

Synchronous Checkpoint
Tentative checkpoint decide to commit

Initiator

|
request to take a tentative checkpoint

permanent checkpoint

[
consistent global checkpoint

[
Unnecessary checkpoint

consistent global checkpoint

Correctness
Synchronous Checkpoint

A set of permanent checkpoints taken by this algorithm is consistent

No process sends messages after taking a tentative checkpoint until the receipt of the decision New checkpoints include no message from the processes that dont take a checkpoint The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.

Drawbacks of Synchronous Approach

Additional messages are exchanged Synchronization delay An unnecessary extra load on the system if failure rarely occurs

Asynchronous Checkpoint
Characteristic

Each process takes checkpoints independently No guarantee that a set of local checkpoints is consistent A recovery algorithm has to search consistent set of checkpoints No additional message No synchronization delay Lighter load during normal excution

Preliminary (Assumptions)
Asynchronous Checkpoint / Recovery

Goal
To find the latest consistent set of checkpoints

Assumptions

Communication channels are FIFO Communication channels are reliable The underlying computation is event-driven

Preliminary (Two types of log)

Asynchronous Checkpoint / Recovery

save an event on the memory at receipt of messages (volatile log) volatile log periodically flushed to the disk (stable log) checkpoint
volatile log :
quick access lost if the corresponding processor fails

stable log :
slow access not lost even if processors fail

Preliminary (Definition)
Asynchronous Checkpoint / Recovery

Definition
CkPti : the checkpoint (stable log) that i rolled back to when failure occurs RCVDij (CkPti / e ) :
the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e.

SENTij(CkPti / e ) :
the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e

Recovery Algorithm
Asynchronous Checkpoint / Recovery

Algorithm
1. 2. 3. 4. 5.

When one process crashes, it recovers to the latest checkpoint CkPt. It broadcasts the message that it had failed. Others receive this message, and rollback to the latest event. Each process sends SENT(CkPt) to neighboring processes Each process waits for SENT(CkPt) messages from every neighbor On receiving SENTji(CkPtj) from j, if i notices RCVDij (CkPti) > SENTji(CkPtj), it rolls back to the event e such that RCVDij (e) = SENTji(e), repeat 3,4,and 5 N times (N is the number of processes)

Recovery In Distributed Database System

In Distributed System replicas of data objects at different sites,the availability & reliability increases. It is also known as Replicated Distributed Database System. In recovery algorithm two methods are used to recover the failed sites.

First Method Of Recovery

Updation message are going towards the failed site. Saved in message spoolers. All the failed site process all the updates from there & then gets normal operating situations.

Second Method Of Recovery

Copier transactions are used. Two things are necessary for this method ; Replicas having missed updation are not used in user transactions. If these are used in user transactions,they are made up to date by copier transaction.

In Distributed Database System Network

Any site may be in four states: Operational/Up Recovering Down Non-operational

Any Queries?

Thank you

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Serializability Theory
100% (1)
Serializability Theory
8 pages
unit 4
No ratings yet
unit 4
94 pages
c1cc1cde-bdda-41e7-92a0-5453e98d0676
No ratings yet
c1cc1cde-bdda-41e7-92a0-5453e98d0676
5 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
1904050001
No ratings yet
1904050001
119 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
System Recovery
No ratings yet
System Recovery
38 pages
Distributed-Computing-Module-4-Important-Topics-PYQs
No ratings yet
Distributed-Computing-Module-4-Important-Topics-PYQs
23 pages
Recovery DC
No ratings yet
Recovery DC
6 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
Rohini 836843492
No ratings yet
Rohini 836843492
3 pages
16_issues in Failure Recovery
No ratings yet
16_issues in Failure Recovery
5 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
DS Part B 4
No ratings yet
DS Part B 4
3 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
12_JuangVenkatesan
No ratings yet
12_JuangVenkatesan
4 pages
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
Unit IV 2 Marks With Answer
No ratings yet
Unit IV 2 Marks With Answer
2 pages
Session 33
No ratings yet
Session 33
4 pages
Document 32Distributed computing concept
No ratings yet
Document 32Distributed computing concept
16 pages
DC(UNIT-4)
No ratings yet
DC(UNIT-4)
14 pages
CST402-SCHEME
No ratings yet
CST402-SCHEME
9 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
11 Coordinated Checkpoint
No ratings yet
11 Coordinated Checkpoint
3 pages
Checkpointing
No ratings yet
Checkpointing
20 pages
Unit 4 Answer Key
No ratings yet
Unit 4 Answer Key
24 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
4.1.6. Coordinated Checkpointing Algorithm-1
No ratings yet
4.1.6. Coordinated Checkpointing Algorithm-1
7 pages
Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI
No ratings yet
Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI
27 pages
rollback_slides
No ratings yet
rollback_slides
22 pages
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
No ratings yet
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
34 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
No ratings yet
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
19 pages
Module 4
No ratings yet
Module 4
59 pages
DC Unit 4 Important
No ratings yet
DC Unit 4 Important
6 pages
Synchronous Checkpoint and Recovery
No ratings yet
Synchronous Checkpoint and Recovery
4 pages
Distributed UNIT IV (1)
No ratings yet
Distributed UNIT IV (1)
60 pages
Simulation - Deadlock P-II
No ratings yet
Simulation - Deadlock P-II
30 pages
u4p6
No ratings yet
u4p6
10 pages
Session 32
No ratings yet
Session 32
3 pages
4.1.3. Issues in Failure Recovery-1
No ratings yet
4.1.3. Issues in Failure Recovery-1
4 pages
DC Ict Test-2
No ratings yet
DC Ict Test-2
1 page
CSE QP UNIT 4 (2)
No ratings yet
CSE QP UNIT 4 (2)
2 pages
a161126
No ratings yet
a161126
26 pages
CS3551 - QB
No ratings yet
CS3551 - QB
5 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
CMSC 724: Recovery: Amol Deshpande
No ratings yet
CMSC 724: Recovery: Amol Deshpande
13 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
OS CH3
No ratings yet
OS CH3
51 pages
MODULE V Part 3
No ratings yet
MODULE V Part 3
16 pages
CST206OS Previous Qns
No ratings yet
CST206OS Previous Qns
13 pages
Preemptive Scheduling: by Curated by Kavitha Patil
No ratings yet
Preemptive Scheduling: by Curated by Kavitha Patil
11 pages
Exception Handling and Multithreading
No ratings yet
Exception Handling and Multithreading
60 pages
Multithreading PPT
No ratings yet
Multithreading PPT
44 pages
Operating Systems Cheat Sheet
No ratings yet
Operating Systems Cheat Sheet
21 pages
OS Question Bank 5/8888
No ratings yet
OS Question Bank 5/8888
15 pages
Operating System Imp Questions
No ratings yet
Operating System Imp Questions
34 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
64 pages
Shared Memory Synchronization
No ratings yet
Shared Memory Synchronization
223 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
5 pages
GoBench A Benchmark Suite of Real-World Go Concurrency Bugs
No ratings yet
GoBench A Benchmark Suite of Real-World Go Concurrency Bugs
13 pages
CS609 Update SOLVED MCQs FINAL TERM BY JUNAID
No ratings yet
CS609 Update SOLVED MCQs FINAL TERM BY JUNAID
33 pages
Operating System Chapter 3 Scheduling
No ratings yet
Operating System Chapter 3 Scheduling
10 pages
Concepts and Notations For Concurrent Programming PDF
No ratings yet
Concepts and Notations For Concurrent Programming PDF
41 pages
OS Lecture 05 Synchronization
No ratings yet
OS Lecture 05 Synchronization
28 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
3 pages
CH7 Operating System Concepts
No ratings yet
CH7 Operating System Concepts
9 pages
7TH_UNIT 3-21EC74H6_CA
No ratings yet
7TH_UNIT 3-21EC74H6_CA
45 pages
CH 5 Process Scheduling
No ratings yet
CH 5 Process Scheduling
100 pages
What Are Multi Tasking
No ratings yet
What Are Multi Tasking
2 pages
Chap3 Pthread
No ratings yet
Chap3 Pthread
33 pages
Serializability of Transaction Schedule
No ratings yet
Serializability of Transaction Schedule
11 pages
Hazard Pointers - Safe Memory Reclamation forLock-Free Objects
No ratings yet
Hazard Pointers - Safe Memory Reclamation forLock-Free Objects
14 pages
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
No ratings yet
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
27 pages
7-Semaphore Example Readers and Writers
No ratings yet
7-Semaphore Example Readers and Writers
3 pages

Presentation On Consistent Checkpoints & Recovery in Distributed System

Uploaded by

Presentation On Consistent Checkpoints & Recovery in Distributed System

Uploaded by

PRESENTATION ON CONSISTENT CHECKPOINTS & RECOVERY IN DISTRIBUTED SYSTEM

Submitted By: Swati Mishra I.T(4th year) 0833413058

Problem of nave checkpointing

Orphan Messages and the Domino Effect

Lost Messages Livelocks

Orphan message and Domino Effect

Strongly consistent set of checkpoints

Consistent set of checkpoints

with global synchronization at checkpointing without global synchronization at checkpointing

Preliminary (Two types of checkpoint)

Diagram Algorithmof Checkpoint

consistent global checkpoint

A set of permanent checkpoints taken by this algorithm is consistent

Drawbacks of Synchronous Approach

Preliminary (Two types of log)

Recovery In Distributed Database System

Recovery In Distributed Database System

First Method Of Recovery

Second Method Of Recovery

In Distributed Database System Network

Any site may be in four states: Operational/Up Recovering Down Non-operational

You might also like