100% found this document useful (1 vote)
56 views

Presentation On Consistent Checkpoints & Recovery in Distributed System

The document summarizes consistent checkpointing and recovery in distributed systems. It discusses problems with naive checkpointing like orphan messages and domino effects that can cause inconsistencies. It then presents synchronous and asynchronous checkpointing algorithms. The synchronous approach uses global synchronization to create a consistent checkpoint set. The asynchronous approach allows independent checkpointing at each process and requires a recovery algorithm to find the latest consistent checkpoint set. Recovery in distributed databases is also briefly covered, discussing using message spools or copier transactions to recover failed sites.

Uploaded by

Swati Mishra
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
56 views

Presentation On Consistent Checkpoints & Recovery in Distributed System

The document summarizes consistent checkpointing and recovery in distributed systems. It discusses problems with naive checkpointing like orphan messages and domino effects that can cause inconsistencies. It then presents synchronous and asynchronous checkpointing algorithms. The synchronous approach uses global synchronization to create a consistent checkpoint set. The asynchronous approach allows independent checkpointing at each process and requires a recovery algorithm to find the latest consistent checkpoint set. Recovery in distributed databases is also briefly covered, discussing using message spools or copier transactions to recover failed sites.

Uploaded by

Swati Mishra
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

PRESENTATION ON CONSISTENT CHECKPOINTS & RECOVERY IN DISTRIBUTED SYSTEM

Submitted By: Swati Mishra I.T(4th year) 0833413058

Checkpointing

Problem of nave checkpointing

Orphan Messages and the Domino Effect

Orphan message : a message that make an inconsistent state Domino Effect : what a single rolling back induce other rolling back

Lost Messages Livelocks

Orphan message and Domino Effect


x1 x2 x3

[
y1

[
y2

[
Y has not sent yet, but X has received.

[
z1 z2

: Orphan message
Roll back

[ : Domino Effect

Lost messages
x1 x2 x3

[
y1

[
y2

[
X has sent, but Y cannot receive forever

[
z1 z2

: Lost message
Roll back

Livelocks
x1

[
n1
m2 m1

n2 n1

y1

Consistency of Checkpoint

Strongly consistent set of checkpoints


no messages penetrating the set

Consistent set of checkpoints


no messages penetrating the set backward
x1 x2

[
y1
y2

[
need to deal with lost messages

[
Strongly consistent z1

[
consistent

z2

Checkpoint/Recovery Algorithm

Synchronous

with global synchronization at checkpointing without global synchronization at checkpointing

Asynchronous

Preliminary (Assumption)
Synchronous Checkpoint

Goal
To make a consistent global checkpoint

Assumptions

Communication channels are FIFO No partition of the network End-to-end protocols cope with message loss due to rollback recovery and communication failure No failure during the execution of the algorithm

Preliminary (Two types of checkpoint)


Synchronous Checkpoint

tentative checkpoint :

a temporary checkpoint a candidate for permanent checkpoint a local checkpoint at a process a part of a consistent global checkpoint

permanent checkpoint :

Checkpoint Algorithm
Synchronous Checkpoint
Algorithm
1.

2. 3.
4.

5. 6.

an initiating process (a single process that invokes this algorithm) takes a tentative checkpoint it requests all the processes to take tentative checkpoints it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. it informs all the processes of the decision The processes that receive the decision act accordingly

Supplement Once a process has taken a tentative checkpoint, it shouldnt send messages until it is informed of initiators decision.

Diagram Algorithmof Checkpoint


Synchronous Checkpoint
Tentative checkpoint decide to commit

Initiator

|
request to take a tentative checkpoint

permanent checkpoint

OK

[
consistent global checkpoint

[
Unnecessary checkpoint

consistent global checkpoint

Correctness
Synchronous Checkpoint

A set of permanent checkpoints taken by this algorithm is consistent

No process sends messages after taking a tentative checkpoint until the receipt of the decision New checkpoints include no message from the processes that dont take a checkpoint The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.

Drawbacks of Synchronous Approach


Additional messages are exchanged Synchronization delay An unnecessary extra load on the system if failure rarely occurs

Asynchronous Checkpoint
Characteristic

Each process takes checkpoints independently No guarantee that a set of local checkpoints is consistent A recovery algorithm has to search consistent set of checkpoints No additional message No synchronization delay Lighter load during normal excution

Preliminary (Assumptions)
Asynchronous Checkpoint / Recovery

Goal
To find the latest consistent set of checkpoints

Assumptions

Communication channels are FIFO Communication channels are reliable The underlying computation is event-driven

Preliminary (Two types of log)


Asynchronous Checkpoint / Recovery

save an event on the memory at receipt of messages (volatile log) volatile log periodically flushed to the disk (stable log) checkpoint
volatile log :
quick access lost if the corresponding processor fails

stable log :
slow access not lost even if processors fail

Preliminary (Definition)
Asynchronous Checkpoint / Recovery

Definition
CkPti : the checkpoint (stable log) that i rolled back to when failure occurs RCVDij (CkPti / e ) :
the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e.

SENTij(CkPti / e ) :
the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e

Recovery Algorithm
Asynchronous Checkpoint / Recovery

Algorithm
1. 2. 3. 4. 5.

6.

When one process crashes, it recovers to the latest checkpoint CkPt. It broadcasts the message that it had failed. Others receive this message, and rollback to the latest event. Each process sends SENT(CkPt) to neighboring processes Each process waits for SENT(CkPt) messages from every neighbor On receiving SENTji(CkPtj) from j, if i notices RCVDij (CkPti) > SENTji(CkPtj), it rolls back to the event e such that RCVDij (e) = SENTji(e), repeat 3,4,and 5 N times (N is the number of processes)

Recovery In Distributed Database System

Recovery In Distributed Database System

In Distributed System replicas of data objects at different sites,the availability & reliability increases. It is also known as Replicated Distributed Database System. In recovery algorithm two methods are used to recover the failed sites.

First Method Of Recovery

Updation message are going towards the failed site. Saved in message spoolers. All the failed site process all the updates from there & then gets normal operating situations.

Second Method Of Recovery


Copier transactions are used. Two things are necessary for this method ; Replicas having missed updation are not used in user transactions. If these are used in user transactions,they are made up to date by copier transaction.

In Distributed Database System Network


Any site may be in four states: Operational/Up Recovering Down Non-operational

Any Queries?

Thank you

You might also like