0% found this document useful (0 votes)
27 views

DC - Unit IV

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

DC - Unit IV

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

Sub Code: CS3551

Subject Name: DISTRIBUTED COMPUTING

Semester/Year: III/V

Unit IV-Study Material

Prepared by Approved by

Dr.N.Indumathi HoD

1
UNIT – IV : Syllabus
SYLLABUS:
UNIT IV CONSENSUS AND RECOVERY 10
Consensus and Agreement Algorithms: Problem Definition – Overview of Results – Agreement in a
Failure-Free System(Synchronous and Asynchronous) – Agreement in Synchronous Systems with
Failures; Checkpointing and Rollback Recovery: Introduction – Background and Definitions – Issues
in Failure Recovery – Checkpoint-based Recovery – Coordinated Checkpointing Algorithm --
Algorithm for Asynchronous Checkpointing and Recovery

1. CONSENSUS AND AGREEMENT ALGORITHMS : PROBLEM DEFINITION


1.1 Introduction to Consensus and Agreement Algorithms
Consensus in Distributed Systems refers to an agreement on any subject by a group of participants.
 To understand the importance of consensus algorithms in distributed systems, it requires us to
understand the implications of choosing a particular algorithm.
 For example, a bunch of friends deciding which café to visit next is an agreement.
 On a different scale, citizens of a nation electing a government also constitutes an agreement.
Definition : Consensus
A fundamental problem in distributed computing and multi-agent systems is to achieve overall system
reliability in the presence of a number of faulty processes.
This often requires coordinating processes to reach consensus, or agree on some data value that is
needed during computation.

Consensus is a fundamental paradigm for fault-tolerant asynchronous distributed systems.


 Each process proposes a value to the others.
 All correct processes have to agree (Termination) on the same value (Agreement) which must
be one of the initially proposed values (Validity).
 Solving consensus in asynchronous distributed systems where processes can crash is a well-
known difficult task.

Distributed Consensus
 A procedure to reach a common agreement in a distributed or decentralized multi-agent
platform is called distributed consensus.

2
 It is important for the message passing system.

Features:
 It ensures reliability and fault tolerance in distributed systems.
 In the presence of faulty individuals, it is Ensure correct operations.

Example:
Commit a transaction in a database, State machine replication, Clock synchronization.

How to achieve distributed consensus?


There are some conditions that need to be followed in order to achieve distributed consensus.

 Termination – Every non-faulty process must eventually decide. 


 Agreement – The final decision of every non-faulty process must be identical. 
 Validity – Every non-faulty process must begin and ends with the same value. 
 Integrity – Every correct individual decides at most one value, and the decided value must
be proposed by some individual. 
Why consensus?
 A complex system can consist of multiple processes that work independently and often in a
distributed setup.
 To achieve a common objective, they have to agree on several decisions.
 The problem is compounded by the fact that some of these processes can fail randomly or
become unreliable in many other ways.
 However, the system as a whole still needs to continue functioning.
 This requires the consensus mechanism we choose to be fault-tolerant or resilient.
 Hence, the choice of protocol to solve the consensus problem is guided by the faults it needs to
tolerate.

2. OVERVIEW OF RESULTS
2.1 Outline of an Overview of Results
We expect a consensus algorithm to exhibit certain properties:
i) The first is Termination, implying that every non-faulty process must decide on a value
ii) The second is Agreement, that is, every correct process must agree on the same value
iii) The last is Integrity, which means if all the correct processes proposed the same value, then
any correct process must decide the same value
i) Agreement is a safety property.
– Every possible state of the system has this property in all possible executions.
– I.e., either they have not agreed yet, or they all agreed on the same value.

3
ii) Termination is a liveness property.
– Some state of the system has this property in all possible executions.
– The property is stable: once some state of an execution has the property, all subsequent states also
have it.
 There are many ways in which processes in a distributed system can reach a consensus.
 However, there is usually a constant struggle between security and performance.
 more we want our algorithm to be secure against ways in which failure can happen, the less
performant it tends to become.
 Then, there are considerations like resource consumption and the size of messages.

Table: Overview of results on agreement.


Failure Synchronous system Asynchronous system
mode (message-passing and shared (message-passing and shared
memory) memory)
No Failure agreement attainable; agreement attainable; concurrent
common knowledge attainable common knowledgeattainable
Crash Failure agreement attainablef < n agreement not attainable
processes Ω(f + 1) rounds
Byzantine agreement attainable agreement not attainable
Failure f ≤ [(n - 1)/3] Byzantine processes
Ω(f + 1) rounds
 f denotes number of failure-prone processes. n is the total number of processes.
 .In a failure-free system, consensus can be attained in a straightforward manner.

2.2 Computation Model for Consensus Algorithms


 Consensus algorithms for distributed systems have been an active area of research for
several decades.
 Possibly, it started in the 1970s, when Leslie Lamport began reasoning about the chaotic
world of distributed systems.
 It led to the development of several key algorithms, which remain a milestone in the family
of classical consensus algorithms.
 However, the turn of the Millennium brought new technologies like blockchain.

4
 This propelled the world of distributed computing to an unimaginable scale.
 It led to the development of numerous tailor-made consensus algorithms.
 To make sense of this, it’s quite useful to understand the computation model in which the
consensus algorithms are expected to operate.

2.3 Failures in Distributed Computing environment


Distributed systems are prone to faults that a consensus algorithm needs to be resilient against. Now,
it’s important to understand the types of faults a process may undergo.
There are several types of failures,
i) crash failures and
ii) byzantine failures.

i) A crash fault occurs when a process abruptly stops and does not resume.
ii) On the contrary, a Byzantine fault is much more arbitrary and disruptive. It may occur for several
reasons, like malicious activity on the part of an adversary.
For instance, when members send conflicting views of reality:

 The Byzantine failure derives its name from the “Byzantine generals problem”.
 It’s a game theory problem that describes how actors in a decentralized system arrive at a
consensus without a trusted central party.
 Some of the actors here can be unreliable.
 Interestingly, in case of a Byzantine failure, a process can inconsistently appear both failed
and functioning to the rest of the system.
 Hence, it becomes much more challenging for a consensus algorithm to be resilient to
Byzantine failures than crash failures.

5
2.4 Failure Detector
 A failure detector is composed of failure detection modules, one by process, that provides it
with a list of processes currently suspected by the detector to have crashed.
 A failure detection module can make mistakes by not suspecting a crashed process or by
erroneously suspecting a correct one.
 Formally, a failure detector is defined by two properties:
i) completeness (a property on the actual detection of process crashes), and
ii) accuracy (a property that restricts the mistakes on erroneous suspicions).

2.5 Synchronous vs. Asynchronous Communication


Another important aspect of the computation model is how processes communicate with each other.
 A synchronous system is where each process runs using the same clock or perfectly
synchronized clocks.
 This naturally puts an upper bound on each processing step and message transmission.
 However, in a distributed system, communication between processes is inherently
asynchronous. There is no global clock nor consistent clock rate.
 Hence, each process handles communication at different rates, thus making the
communication asynchronous:

Thus, in a fully asynchronous distributed system where even a single process may have a crash
failure, it’s impossible to have a deterministic algorithm for achieving consensus.
2.6 Consensus algorithms

What is a consensus algorithm in distributed systems?

 Consensus algorithms are vital in large-scale, fault-tolerant systems because they enable a set
of distributed/replicated machines or servers to work as a coherent group and agree on system
state, even in the presence of failures or outages.

6
Use of consensus algorithm

 The consensus algorithm or mechanism is designed to acquire reliability in a network that


consists of nodes or multiple users.
 So, consensus algorithms allow blockchain to achieve reliability and trust among nodes while
ensuring security in the network environment.

2.7 Model of Processor Failures


 In agreement problems, we consider a very general model of processor failures.
 A processor can fail in three modes:
i) crash fault
ii) omission fault and
iii) malicious fault(Byzantine faults)

i) crash fault
 In a crash fault, a processor stops functioning and never resumes operation.

ii) omission fault


 In an omission fault, a processor "omits" to send messages to some processors. (These are the
messages that the processor should have sent according to the protocol or algorithm it is
executing.)
 For example, a processor is supposed to broadcast a message to all other processors, but it sends
the message to only a few processors.

iii) malicious fault(Byzantine faults)


 In a malicious fault, a processor behaves randomly and arbitrarily.
 For example, a processor may send fictitious messages to other processors to confuse them.
 Malicious faults are very broad in nature and thus most other conceivable faults can be treated as
malicious faults.
Processor Failures
 Since a faulty processor can refuse to send a message, a nonfaulty processor may never receive
an expected message from a faulty processor.
 In such a situation, we assume that the nonfaulty processor simply chooses an arbitrary value and
acts as if the expected message has been received.
 Of course, we assume that such situations, where a processor refuses to send a message, can be
detected by the respective receiver processors.

7
 In synchronous systems, if the duration of each round is known, then this detection is simple-all
the expected messages not received by the end of a round were not sent.

2.8 Advantages and disadvantages of consensus mechanism

Advantages of consensus mechanism

 It can be a precarious way to make decisions.


 The advantages of consensus decision-making include it being a group decision, giving
employees a sense of involvement, and providing a united front.
 Group Agrees To Support the Decision : Reaching a conclusion that everyone on the team
supports is a positive, often effective, team strategy.
i) Involved Employees See Benefits
ii) You Present a Unified Front
iii) Collaborative Spirit of the Team
Disadvantages of consensus mechanism
 The disadvantages include groupthink, those with power leverage their position, and agreeing
to make bad decisions.
 Agreeing to Bad Decisions: Consensus decision-making does not always lead to good
decisions, especially if the group is relatively homogenous.
 Groupthink : The desire to reach a consensus can cause people to ignore indications that what
is proposed is a bad idea. The team pushes aside any data that may derail the consensus
decision.

3. AGREEMENT IN A FAILURE-FREE SYSTEM (SYNCHRONOUS AND


ASYNCHRONOUS)

3.1 Agreement in (message-passing) Synchronous systems with Failures

What is an agreement in a distributed system?


 In distributed systems, where sites (or processors) often compete as well as cooperate to
achieve a common goal, it is often required that sites reach mutual agreement.
 For example, in distributed database systems, data managers at sites must agree on whether
to commit or to abort a transaction.

3.2 Agreement in a Failure-free System


 Based on the token circulation on a logical ring, or the three-phase tree-based broadcast
converge cast: broadcast, or direct communication with all nodes.
8
 Trivially, agreement problems such as consensus, that cannot be solved in non-
anonymous asynchronous systems prone to process failures, cannot be solved either if
the system is anonymous.

3.3 Consensus algorithm for crash failures (synchronous system)


 Consensus algorithm for crash failures message passing synchronous system.
 The consensus algorithm for n processes where up to f processes where f < n may fail in a fail
stop failure model.
 Here the consensus variable x is integer value; each process has initial value xi. If
 up to f failures are to be tolerated than algorithm has f+1 rounds, in each round a process i
sense the value of its variable xi to all other processes if that value has not been sent before.
 So, of all the values received within that round and its own value xi at that start of the round
the process takes minimum and updates xi occur f + 1 rounds the local value xi guaranteed to
be the consensus value.
 In one process is faulty, among three processes then f = 1. So the agreement requires f + 1 that
is equal to two rounds.
 If it is faulty let us say it will send 0 to 1 process and 1 to another process i, j and k. Now, on
receiving one on receiving 0 it will broadcast 0 over here and this particular process on
receiving 1 it will broadcast 1 over here.
 So, this will complete one round in this one round and this particular process on receiving 1 it
will send 1 over here and this on the receiving 0 it will send 0 over here.

(global constants)

integer: f; // maximum number of crash failures


tolerated(local variables)
Integer: x local value;

(1) Process Pi (1<i < n) execute the consensus algorithm for up to f crash
failures:(1a) for round from 1 to f + 1 do

(1b) if the current value of x has not been broadcast then

(1c) broadcast(x);

(1d) yi  value (if any) received from process j in this


round;(1e) x  minj (x, y j) ;
(1f) output x as the consensus value.

Fig : Consensus with up to f fail-stop processes in a system of n processes, n > f

9
 The agreement condition is satisfied because in the f+ 1 rounds, there must be at least
one round in which no process failed.

 In this round, say round r, all the processes that have not failed so far succeed in
broadcasting their values, and all these processes take the minimum of the values
broadcast and received in that round.

 Thus, the local values at the end of the round are the same, say x i fror all non-failed
processes.

 In further rounds, only this value may be sent by each process at most once, and no
process i will update its value xi . r

 The validity condition is satisfied because processes do not send fictitious values inthis
failure model.

 For all i, if the initial value is identical, then the only value sent by any process is the
value that has been agreed upon as per the agreement condition.

 The termination condition is seen to be satisfied.

Complexity

 The complexity of this particular algorithm is it requires f + 1 rounds where f < n and the
number of messages is O(n 2 )in each round and each message has one integers hence the
total number of messages is O((f +1)· n 2) is the total number ofrounds and in each round n 2
messages are required.

Lower bound on the number of rounds

 At least f + 1 rounds are required, where f < n.

 In the worst-case scenario, one process may fail in each round; with f + 1 rounds, there
is at least one round in which no process fails. In that guaranteed failure-free round, all
messages broadcast can be delivered reliably, and all processes that have not failed can
compute the common function of the received values to reach an agreement value.
3.4 Voting-based Consensus Algorithms
 As the consensus in distributed systems has widened significantly, it’s important to
draw some broad categories to understand them better.
 Some of the earliest implementations of consensus algorithms started to use different

10
voting-based mechanisms.
 These provide reasonable fault tolerance and have strong mathematical proofs to
ensure security and stability.
 However, the very democratic nature of these algorithms makes them incredibly
slow and inefficient, especially as the network grows larger.
3.5 Consensus algorithms for Byzantine failures (synchronous system)
What is Byzantine problem in distributed system?

The Byzantine generals problem is a well-known concept in distributed computing and computer
science that describes the difficulty of coordinating the actions of several independent parties in a
distributed system.
Byzantine agreement
 In the Byzantine agreement problem, a single value, which is to be agreed on, is initialized by an
arbitrary processor and all nonfaulty processors have to agree on that value.
 In the consensus problem, every processor has its own initial value and all nonfaulty processors
must agree on a single common value.
Upper bound on Byzantine processes

 In a system of n processes, the Byzantine agreement problem can be solved in a


synchronoussystem only if the number of Byzantine processes f is such that
f  n 1

 3  


Fig: Impossibility of achieving Byzantine agreement with n = 3 processes and f
= 1 malicious process

11
 The condition where f < (n – 1) / 2 is violated over here; that means, if f = 1 andn = 2 this
particular assumption is violated

 (n – 1) / 2 is not 1 in that case, but we are assuming 1 so obviously, as per the previous
condition agreement byzantine agreement is not possible.

 Here P 0 is faulty is non faulty and here P 0 is faulty so that means P 0 is the source, the
source is faulty here in this case and source is non faulty in the other case.

 So, source is non faulty, but some other process is faulty let us say that P 2 is faulty. P 1
will send because it is non faulty same values to P 1 and P 2 and as far as the P 2s
concerned it will send a different value because it is a faulty.

 Agreement is possible when f = 1 and the total number of processor is 4. So, agreementwe
can see how it is possible we can see about the commander P c.

 So, this is the source it will send the message 0 since it is faulty. It will send 0 to Pd 0 to P
b, but 1 to pa in the first column. So, P a after receiving this one it will send one to both
the neighbors, similarly P b after receiving 0 it will send 0 since itis not faulty.

 Similarity P d will send after receiving 0 at both the ends.


 If we take these values which will be received here it is 1 and basically it is 0 and thisis
also 0.

 So, the majority is basically 0 here in this case here also if you see the values 10 and 0.
The majority is 0 and here also majority is 0.

 In this particular case even if the source is faulty, it will reach to an agreement, reach an
agreement and that value will be agreed upon value or agreement variable will be equal to 0.

12
Fig : Achieving Byzantine agreement when n = 4 processes and f = 1

4. AGREEMENT IN A FAILURE-FREE SYSTEM WITH FAILURES

4.1 Failure-Free Systems


 In a failure-free system, consensus can be reached by collecting information from the different
processes, arriving at a decision, and distributing this decision in the system.
 A distributed mechanism would have each process broadcast its values to others, andeach
process computes the same function on the values received.
 The decision can be reached by using an application specific function.

 Algorithms to collect the initial values and then distribute the decision may be based on the
token circulation on a logical ring, or the three-phase tree-based broadcast converge cast:
broadcast, or direct communication with all nodes.
 In a synchronous system, this can be done simply in a constant number of rounds.

 Further, common knowledge of the decision value can be obtained using an additionalround.
 In an asynchronous system, consensus can similarly be reached in a constant numberof
message hops.
 Further, concurrent common knowledge of the consensus value can also be attained.
 For the “no failure” case, consensus is attainable.
 Further, in a synchronous system, common knowledge of the consensus value is also
attainable, whereas in the asynchronous case, concurrent common knowledge of the consensus
value is attainable.
5. CHECKPOINTING AND ROLLACK RECOVERY: INTRODUCTION

5.1 Introduction

5.1.1 Checkpointing
 The saved state is called a checkpoint, and the procedure of restarting from a previously
checkpointed state is called rollback recovery.
 A checkpoint can be saved on either the stable storage or the volatile storage depending on
the failure scenarios to be tolerated.
 Checkpointing is a mechanism to store the state of a computation so that it can be retrieved
at a later point in time and continued.
 The process of writing the computation's state is referred to as Checkpointing, the data

13
written as the Checkpoint, and the continuation of the application as Restart or Recovery.

5.1.1.1 Local Checkpoint


 In distributed systems, all processes save their local states at certain instants of time.
This savedstate is known as a local checkpoint.
 A local checkpoint is a snapshot of the state of the process at a given instance. The
event of recording the state of a process is called local check-pointing.
 The contents of a checkpoint depend upon the application context and the check-
pointing method being used.
 Depending upon the check-pointing method used, a process may keep several local
checkpointsor just a single checkpoint at any time.
 We assume that a process stores all local checkpoints on the stable storage so that
they are available even if the process crashes.
 We assume that a process is able to roll back to any of its existing local checkpoints
and thus restore to and restartfrom the corresponding state.
5.2 Rollback recovery
What is rollback recovery?
 The roll back recovery algorithm is based on a pattern similar to the two-phase commit
protocol.
 When a failure occurs in a process, the process recovers or rolls back to a previously consistent
state and sends a request to all other processes to restart.
Why do we need efficient recovery in message passing systems?
i) Costly messaging can be avoided
ii) All data is not computed repeatedly
iii) The algorithms are suited for different situations
Factors we need to consider
a. Stable Storage
b. Input/Output protocols
c. Garbage collection
d. Consistency
6. BACKROUND AND DEFINITIONS

6.1 An Introduction to Checkpoint and Roll back recovery


 Checkpoint and Roll back recovery are the two important techniques used by a recovery
manager to recover the state of a process in the event of failure and let the process proceed
normally, in spite of failure.

14
 A checkpoint is an entry made in the recovery file at specific intervals of time that will force all
the currently committed values to the stable storage.
 Recovery from an error is essential to fault tolerance, and error is a component of a system that
could result in failure.

6.2 Error recovery


 The whole idea of error recovery is to replace an erroneous state with an error-free state.
Error recovery can be broadly divided into two categories:
i) Backward Recovery
ii) Forward Recovery

i) Backward Recovery:
 Moving the system from its current state back into a formerly accurate condition from an
incorrect one is the main challenge in backward recovery.
 It will be required to accomplish this by periodically recording the system’s state and restoring
it when something goes wrong.
 A checkpoint is deemed to have been reached each time (part of) the system’s current state is
noted.
ii) Forward Recovery:
 Instead of returning the system to a previous, checkpointed state in this instance when it has
entered an incorrect state, an effort is made to place the system in a correct new state from
which it can continue to operate.
 The fundamental issue with forward error recovery techniques is that potential errors must be
anticipated in advance.
 Only then is it feasible to change those mistakes and transfer to a new state.
 Whenever a process fails it will roll back to the most recent checkpoint made and restart the
system from a previously consistent state.
 In a distributed system, the recovery managers need to make sure that these checkpoints lead the
system to a globally consistent state when a server recovers from a failure and restarts.

6.3 Rollback recovery


 Rollback recovery treats a distributed system application as a collection of processes that
communicate over a network.
 It achieves fault tolerance by periodically saving the state of a process during the failure-free
execution, enabling it to restart from a saved state upon a failure to reduce the amount of lost
work.

15
 In distributed systems, rollback recovery is complicated because messages induce inter-
process dependencies during failure-free operation.
 Upon a failure of one or more processes in a system, these dependencies may force some of
the processes that did not fail to roll back, creating what is commonly called a rollback
propagation.
 Distributed systems are not fault-tolerant and the vast computing potential of these systems
is often hampered by their susceptibility to failures.
 Many techniques have been developed to add reliability and high availability to distributed
systems. These techniques include transactions, group communication, and rollback
recovery.
 These techniques have different tradeoffs and focus.

Drawback of a checkpoint based rollback recovery approach (Disadvantages)


a) Multiple processes may force independent checkpoints for the same Z-formation.
b) Checkpoints may be forced for Z-formations that never emerge.

6.4 Rollback recovery protocols


 Restore the system back to a consistent state after a failure.
 Receive fault tolerance by periodically saving the state of a process during the
failure-freeexecution.
 Treats a distributed system application as a collection of processes that communicate
over anetwork.
 Checkpoints –the saved states of a process.
 The dependencies may force some of the processes that did not fail to roll back. This
phenomenon is called “domino effect”.

6.5 Techniques that avoid domino effect

i) Coordinated checkpointing rollback recovery


 Processes coordinate their checkpoints to form a system-wide consistent state.
ii) Communication-induced checkpointing rollback recovery
 Forces each process to take checkpoints based on information piggybacked on the
application.
iii) Log-based rollback recovery
 Combines checkpointing with logging of non-deterministic events.
 Relies on PieceWise Deterministic (PWD) assumption.

16
7. ISSUES IN FAILURE RECOVERY

7.1 Introduction
 In a failure recovery, we must not only restore the system to a consistent state, but
handle messages that are left in irregular state due to the failure and recovery.
 The computation comprises of three processes Pi, Pj and Pk, connected through a
communication network.
 The processes communicate by exchanging messages over fault free, FIFO
communication channels.
 Processes Pi, Pj and Pk have taken checkpoints {Ci,0, Ci,1}, {Cj,0, Cj,1, Cj,2}, and
{Ck,0, Ck,1}, respectively, and these processes have exchanged messages A to J as
shown in the Figure 2.

Figure 2: Illustration of Issues in Failure Recovery


Checkpoints : {𝐶𝑖,0, 𝐶𝑖,1}, {𝐶𝑗,0, 𝐶𝑗,1,𝐶𝑗,2}, and {𝐶𝑘,0, 𝐶𝑘,1,𝐶𝑘,2}
Messages : A - J
The restored global consistent state : {𝐶𝑖,1, 𝐶𝑗,1, 𝐶𝑘,1}
The rollback of process 𝑃𝑖 to checkpoint 𝐶𝑖,1 created an orphan message H
Orphan message I is created due to the roll back of process 𝑃𝑗 to checkpoint 𝐶𝑗,1
Messages C, D, E, and F are potentially problematic
i) Message C: a delayed message
ii) Message D: a lost message since the send event for D is recorded in the restored
state for 𝑃𝑗,but the receive event has been undone at process 𝑃𝑖.
Lost messages can be handled by having processes keep a message log of all the sent
messages
iii) Messages E, F: delayed orphan messages. After resuming execution from their
checkpoints, processes will generate both of these messages
Suppose process Pi fails at the instance indicated in the Figure.

17
 All the contents in the volatile memory of Pi is lost and after Pi recovers from the
failure, the system needs to be restored to a consistent global state from where the
processes resume their execution.
 The process Pi’s state is restored to a valid state by rolling it back to its latest
checkpoint Ci,1.
 To restore the system to a consistent state process Pj rolls back to checkpoint Cj,1
because roll back of process Pi to checkpoint Ci,1 created an orphan message H
(the receive event of H is recorded at process Pj while the send event of H has
been undone at process Pi).
 Process Pj does not roll back to checkpoint Cj,2 but to checkpoint Cj,1 because
rolling back to checkpoint Cj,2 does not eliminate the orphan message H.
 Even this resulting state is not a consistent global state as an orphan message I is
created due to the rolls back of process Pj to checkpoint Cj,1.

To eliminate this orphan message process Pk rolls back to checkpoint Ck,1.


The restored global state {Ci,1, Ci,1, Ci,1 } is a consistent state as it is free from orphan
messages.
 Although the system state has been restored to a consistent state, several messages
are left in anincorrect state which must be handled correctly.
 Messages A, B, D, G, H, I, and J had been received at the points indicated in the
Figure and messages C, E and F were in transit when the failure occurred.
 Restoration of system state to checkpoints {Ci,1, Cj,1,Ck,1} automatically handles
messages A, B, and J because send and receive events of messages A, B and J have
been recorded and both the events for G, H, and I have been completely undone.
 These messages cause no problem and we call messages A, B and J normal
messages and messages G, H and I as vanished messages.
 Messages C, D, E, and F are potentially problematic. Message C is in transit during
the failure and it is a delayedmessage.
The delayed message C has several possibilities:
 C might arrive at process Pi before it recovers, it might arrive while Pi is
recovering, or it mightarrive after Pi has completed recovery.
 Each of these cases must be dealt with correctly.
 Message D is a lost message since the send event for D is recorded in the
restored state for process Pj but the receive event has been undone at process Pi.
Process Pj will not resend D, since the send D at Pj occurred before the checkpoint
and communication system successfully delivered D.

18
 Messages E and F are delayed orphan messages and pose perhaps the most serious
problem of all the messages.
 When messages E and F arrive at their respective destinations, they must be
discarded since their send events have been undone.
 Processes, after resuming execution from their checkpoints, will generate both of
these messages and recovery techniques must be able to distinguish between
messages like C and those like E and F.
 Lost messages like D can be handled by having processes keep a message log of all
the sent messages.
 So when a process restores to a checkpoint, it replays the messages from its log to
handle the lost message problem.
 Message logging and message replaying during recovery can result in duplicate
messages.
 In the example shown in the Figure, when process Pj replays messages from its log,
it will regenerate message J.
 Process Pk which has already received message J will receive it again, thereby,
causing inconsistency in the system state. These duplicate messages must be
handled properly.
 Overlapping failures further complicate the recovery process. A process Pj that
begins rollback/recovery in response to the failure of a process Pi can itself fail and
forget process Pi’sfailure.
 If overlapping failures are to be tolerated, a mechanism must be introduced to deal
withthe resulting inconsistencies.



7.2 Checkpoint-based rollback-recovery

Checkpoint-based rollback-recovery techniques can be classified into three categories:


i) Uncoordinated check pointing
ii) Coordinated check pointing
iii) Communication-induced check pointing

i) Uncoordinated Check pointing


 In uncoordinated checkpointing, each process has liberty in deciding when to take
checkpoints.
 This eliminates synchronization overhead as there is no need for coordination

19
between processes and it allows processes to take checkpoints when it is most
convenient or efficient.
 The advantage is the lower runtime overhead during normal execution because no
coordinationamong processes is necessary.
 Liberty in taking checkpoints allows each process to select appropriate checkpoints
positions.

Uncoordinated checkpointing has shortcomings:


-Domino effect during a recovery
–Recovery from a failure is slow because processes need to iterate to find a consistent set of
checkpoints
–Each process maintains multiple checkpoints and periodically invoke a garbage collection
algorithm
–Not suitable for application with frequent output commits
 As each process takes checkpoints independently, we need to determine a consistent
global checkpoint to rollback to, when a failure occurs.
 In order to determine a consistent global checkpoint during recovery, the processes
record the dependencies among their checkpoints caused by message exchange during
failure-free operation.

 Each process has autonomy in deciding when to take checkpoint.

Advantage: The lower runtime overhead during normal execution


Disadvantages

a) Domino effect during a recovery

b) Recovery from a failure is slow because processes need to iterate to


find aconsistent set of checkpoints
c) Each process maintains multiple checkpoints and periodically invoke a
garbage collection algorithm
d) Not suitable for application with frequent output commits

 The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation
 The following direct dependency tracking technique is commonly used in
uncoordinatedcheck pointing.

Direct dependency tracking technique

20
 Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0

 𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥

 When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to
𝐼𝑗,𝑦,which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦

 When a failure occurs, the recovering process initiates rollback by broadcasting a


dependency request message to collect all the dependency information maintained by
each process.
 When a process receives this message, it stops its execution and replies with the
dependency information saved on the stable storage as well as with the dependency
information, if any, which is associated with its current state.
 The initiator then calculates the recovery line based on the global dependency
information and broadcasts a rollback request message containing the recovery line.

 Upon receiving this message, a process whose current state belongs to the recovery
line simply resumes execution; otherwise, it rolls back to an earlier checkpoint as
indicated by the recovery line.

ii) Coordinated Checkpointing

In coordinated check pointing, processes orchestrate their checkpointing activities so that all
local checkpoints form a consistent global state

Types of Coordinated Checkpointing

a) Blocking Checkpointing: After a process takes a local checkpoint, to prevent


orphanmessages, it remains blocked until the entire checkpointing activity is
complete
Disadvantages: The computation is blocked during the checkpointing.

21
b) Non-blocking Checkpointing:
 The processes need not stop their execution while taking checkpoints.
 A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint
inconsistent.

Example (a) : Checkpoint inconsistency

 Message m is sent by 𝑃0 after receiving a checkpoint request from the checkpoint


coordinator
 Assume m reaches 𝑃1 before the checkpoint request
 This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥 shows the
receipt of message m from 𝑃0, while checkpoint 𝐶0,𝑥 does not show m being sent
from 𝑃0

Example (b) : A solution with FIFO channels

 If channels are FIFO, this problem can be avoided by preceding the first post-
checkpoint message on each channel by a checkpoint request, forcing each process to
take a checkpoint before receiving the first post-checkpoint message.

Impossibility of min-process non-blocking checkpointing

 A min-process, non-blocking checkpointing algorithm is one that forces only a


minimum number of processes to take a new checkpoint, and at the same time it does
not force any process to suspend its computation.

22
Algorithm

 The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and
sends them a request.
 Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on,
until no more processes can be identified.
 During the second phase, all processes identified in the first phase take a checkpoint.
The result is a consistent checkpoint that involves only the participating processes.

 In this protocol, after a process takes a checkpoint, it cannot send any message until
the second phase terminates successfully, although receiving a message after the
checkpoint has been taken is allowable.

iii) Communication-induced Checkpointing

Communication-induced checkpointingis another way to avoid the domino effect, while


allowingprocesses to take some of their checkpoints independently. Processes may be forced
to take additional checkpoints
Two types of checkpoints

1. Autonomous checkpoints

2. Forced checkpoints

The checkpoints that a process takes independently are called local checkpoints, while those
thata process is forced to take are called forced checkpoints.
 Communication-induced checkpointing piggybacks protocol- related information on
eachapplication message
 The receiver of each application message uses the piggybacked information to
determineif it has to take a forced checkpoint to advance the global recovery line
 The forced checkpoint must be taken before the application may process the
contents ofthe message
 In contrast with coordinated checkpointing, no special coordination messages are
exchanged
Two types of communication-induced checkpointing

a) Model-based checkpointing

b) Index-based checkpointing.

23
a) Model-based checkpointing

 Model-based checkpointing prevents patterns of communications and checkpoints


that could result in inconsistent states among the existing checkpoints.
 No control messages are exchanged among the processes during normal operation.
All information necessary to execute the protocol is piggybacked on application
messages
 There are several domino-effect-free checkpoint and communication model.
 The MRS (mark, send, and receive) model of Russell avoids the domino effect by ensuring
that within every checkpoint interval all message receiving events precede all message-
sending events.
b) Index-based checkpointing
Index-based communication-induced checkpointing assigns monotonically increasing indexes to
checkpoints, such that the checkpoints having the same index at different processes form a consistent
state.

8. COORDIANTED CHECKPOINTING ALGORITHM

8.1 Basic Strategy


1. Create checkpoints on each process.
2. During a failure, rollback to the last checkpoint.
3.Check the consistency among other processes.
4.Rollback further if inconsistent (i.e. dependency on a variable).
8.2 Coordinated Checkpointing Algorithm

8.2.1 Non-Blocking Checkpoint Coordination

 In this approach the processes don’t need to stop their execution while taking
checkpoints.
 A fundamental problem in coordinated checkpointing is to prevent a process from
receiving application messages that could make the checkpoint inconsistent.

24
Advantages
i) Simplifies recovery.
ii) Avoids the domino effect.
iii) Only one permanent set of states in stable storage.

Disadvantages
However, there exists high latency for interaction with the outside world (global checkpoint
necessary)

8.3 Asynchronous checkpointing and recovery


The algorithm of Juang and Venkatesan for recovery in a system that uses asynchronous
checkpointing.

A. System Model and Assumptions


The algorithm makes the following assumptions about the underlying system:
The communication channels are reliable, deliver the messages in FIFO order and have
infinite buffers.
The message transmission delay is arbitrary, but finite. The processors directly connected to
a processor via communication channels are called its neighbors.
The underlying computation or application is assumed to be event-driven:
a processor P waits until a message m is received, it processes the message m, changes its
state from s to s′, and sends zero or more messages to some of its neighbors.
Then the processor remains idle until thereceipt of the next message.
The new state s′ and the contents of messages sent to its neighbors depend on state s and the
25
contents of message m.
The events at a processor are identified by unique monotonically increasing numbers, ex0,
ex1, ex2,...(see Fig.1)

Figure 1: An event driven computation


To help recovery after a process failure and restore the system to a consistent state, two types
oflog storage are maintained, volatile log and stable log. Accessing the volatile log takes less
time than accessing the stable log, but the contents of the volatile log are lost if the
corresponding processor fails. The contents of the volatile log are periodically flushed to the
stable storage.
A. Asynchronous Checkpointing
 After executing an event, a processor records a triplet {s,m,msgs_sent} in its
volatile storage, where s is the state of the processor before the event, m is the
message (including the identity of the sender of m, denoted as m.sender) whose
arrival caused the event, and msqs_sent is the set of messages that were sent by the
processor during the event.
 A local checkpoint at a processor consists of the record of an event occurring at the
processor and it is taken without any synchronization with other processors.
 Periodically, a processor independently saves the contents of the volatile log in the
stable storage and clears the volatile log. This operation amounts to taking a local
checkpoint.
B. The Recovery Algorithm
Notations and data structure
The following notations and data structure are used by the algorithm:
• RCV Di←j(CkPti) represents the number of messages received by processor pi from
processorpj , from the beginning of the computation till the checkpoint CkPti.
• SENTi→j(CkPti) represents the number of messages sent by processor pi to processor

pj ,from the beginning of the computation till the checkpoint CkPti.

26
Basic idea
 Since the algorithm is based on asynchronous checkpointing, the main issue in the
recovery is tofind a consistent set of checkpoints to which the system can be restored.
 The recovery algorithm achieves this by making each processor keep track of both the
number of messages it has sent toother processors as well as the number of messages it
has received from other processors. Recovery may involve several iterations of roll
backs by processors.
 Whenever a processor rollsback, it is necessary for all other processors to find out if any
message sent by the rolled back processor has become an orphan message.
 Orphan messages are discovered by comparing the number of messages sent to and
received from neighboring processors.
 For example, if RCV Di←j(CkPti) > SENTj→i(CkPtj) (that is, the number of
messages received by processor pi from processor pj is greater than the number of
messages sent by processor pj to processor pi, according to the current states the
processors), then one or more messages at processor pj are orphan messages.
 In this case, processor pj must roll back to a state where the number of messages
received agrees with the number of messages sent.
 Consider an example shown in Fig. 1. Suppose processor Y crashes at the point
indicated and rolls back to a state corresponding to checkpoint ey1.
 According to this state, Y has sent one message to X; according to X’s current state
(ex2), X has received two messages from Y.
 X must roll back to a state preceding ex2 to be consistent with Y’s state.
 If X rolls back to checkpoint ex1, then it will be consistent with Y’s state, ey1.
Processor Z must roll back to checkpoint ez2 to be consistent with Y’s state, ey1.
 Processors X and Z will have to resolve any such mutual inconsistencies.

The Algorithm
 When a processor restarts after a failure, it broadcasts a ROLLBACK message that it
had failed1.
 The recovery algorithm at a processor is initiated when it restarts after a failure or
when it learns of a failure at another processor.
 Because of the broadcast of ROLLBACK messages, the recovery algorithm is initiated
at all processors.

Procedure RollBack_Recovery
processor pi executes the following:

27
STEP (a)
if processor pi is recovering after a failure then
CkPti := latest event logged in the stable
storageelse
CkPti := latest event that took place in pi {The latest event at pi can be either in stable or in
volatile storage.}

end if
STEP (b)
for k = 1 1 to N {N is the number of processors in the system} dofor each neighboring processor
pj do

compute SENTi→j(CkPti)
send a ROLLBACK(i, SENTi→j(CkPti)) message to pj

end for
for every ROLLBACK(j, c) message received from a neighbor j do
if RCV Di←j(CkPti) > c {Implies the presence of orphan messages} then
find the latest event e such that RCV Di←j(e) = c {Such an event e may be in the volatile
storageor stable storage.}
CkPti
:= eend
if end
for
end for{for k}

The rollback starts at the failed processor and slowly diffuses into the entire system through
ROLLBACK messages.
The procedure has |N| iterations. During the kth iteration (k ≠ 1), a processor pi does the
following:
(i) based on the state CkPti it was rolled back in the (k - 1)th iteration, it computes
SENTi→j(CkPti)) for each neighbor pj and sends this value in a ROLLBACK message to
that neighbor and
(ii) pi waits for and processes ROLLBACK messages that it receives from its neighbors in
kth iteration and determines a new recovery point CkPti forpi based on information in these
messages. At the end of each iteration, at least one processor will roll back to its final
recovery point, unless the current recovery points are already consistent.

28
D. An Example
Consider an example shown in Figure 2 consisting of three processors. Suppose processor Y
fails and restarts. If event ey2 is the latest checkpointed event at Y, then Y will restart from
the state corresponding to ey2.

Figure 2: An example of Juan-Venkatesan algorithm.


Because of the broadcast nature of ROLLBACK messages, the recovery algorithm is
initiated at processors X and Z. Initially, X, Y, and Z set CkPtX ← ex3, CkPtY ← ey2
and CkPtZ ← ez2, respectively, and X, Y, and Z send the following messages during the
first iteration:
Y sends ROLLBACK(Y,2) to X and
ROLLBACK(Y,1) to Z;X sends ROLLBACK(X,2)
to Y and ROLLBACK(X,0) to Z;
and Z sends ROLLBACK(Z,0) to X and ROLLBACK(Z,1) to Y.
Since RCV DX←Y (CkPtX) = 3 > 2 (2 is the value received in the ROLLBACK(Y,2)
message fromY), X will set CkPtX to ex2 satisfying RCV DX←Y (ex2) = 1≤ 2.
Since RCV DZ←Y (CkPtZ) = 2 > 1, Z will set CkPtZ to ez1 satisfying RCV DZ←Y (ez1)
= 1 ≤ 1.At Y, RCV DY←X(CkPtY ) = 1 < 2 and RCV DY←Z(CkPtY ) = 1 =
SENTZ←Y (CkPtZ).
Y need not roll back further.
In the second iteration, Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z;

Z sends ROLLBACK(Z,1) to Y and ROLLBACK(Z,0)


to X;X sends ROLLBACK(X,0) to Z and
ROLLBACK(X, 1) to Y.
If Y rolls back beyond ey3 and loses the message from X that caused ey3, X can resend this
message to Y because ex2 is logged at X and this message available in the log. The second
and third iteration will progress in the same manner. The set of recovery points chosen at the
end ofthe first iteration, {ex2, ey2, ez1}, is consistent, and no further rollback occurs.

29
9. COORDINATED CHECKPOINTING ALGORITHM (KOO-TOUEG)

9.1 Introduction to Coordinated Checkpointing

 Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case
of failures by preserving a consistent global checkpoint on stable storage.
 However, the approach suffers from high overhead associated with the checkpointing process.
 Two approaches are used to reduce the overhead:
i) First is to minimize the number of synchronization messages and the number of
checkpoints,
ii) the other is to make the checkpointing process nonblocking.

9.2 Coordinated Checkpointing Algorithm

 Koo and Toueg coordinated check pointing and recovery technique takes a consistent
setof checkpoints and avoids the domino effect and live lock problems during the
recovery.
• Includes 2 parts: the check pointing algorithm and the recovery algorithm.

A. The Check pointing Algorithm

The checkpoint algorithm makes the following assumptions about the distributed system:

 Processes communicate by exchanging messages through communication channels.

 Communication channels are FIFO.

 Assume that end-to-end protocols (the sliding window protocol) exist to handle
withmessage loss due to rollback recovery and communication failure.
 Communication failures do not divide the network.

 The checkpoint algorithm takes two kinds of checkpoints on the stable storage:
Permanent and Tentative.
 A permanent checkpoint is a local checkpoint at a process and is a part of a
consistentglobal checkpoint.
 A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint
onthe successful termination of the checkpoint algorithm.

30
The algorithm consists of two phases:
i) First Phase
ii) Second Phase

i) First Phase
1. An initiating process Pi takes a tentative checkpoint and requests all other processes to take
tentative checkpoints. Each process informs Pi whether it succeeded in taking a tentative
checkpoint.
2. A process says “no” to a request if it fails to take a tentative checkpoint

3. If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides
that all tentative checkpoints should be made permanent; otherwise, Pi decides that all the
tentative checkpoints should be thrown-away.

ii) Second Phase

1. Pi informs all the processes of the decision it reached at the end of the first phase.

2. A process, on receiving the message from Pi will act accordingly.

3. ither all or none of the processes advance the checkpoint by taking permanent
checkpoints.
4. The algorithm requires that after a process has taken a tentative checkpoint, it cannot
send messages related to the basic computation until it is informed of Pi’s decision.

Correctness: for two reasons

i. Either all or none of the processes take permanent checkpoint

ii. No process sends message after taking permanent checkpoint

Optimization

 The above protocol may cause a process to take a checkpoint even when it is not
necessary for consistency.

 Since taking a checkpoint is an expensive operation, we avoid taking checkpoints.

B. The Rollback Recovery Algorithm

31
 The rollback recovery algorithm restores the system state to a consistent state after a
failure.

 The rollback recovery algorithm assumes that a single process invokes the algorithm.

 It assumes that the checkpoint and the rollback recovery algorithms are not invoked
concurrently.

The rollback recovery algorithm has two phases:

i) First Phase

ii) Second Phase

i) First Phase

1. An initiating process Pi sends a message to all other processes to check if they all
arewilling to restart from their previous checkpoints.
2. A process may reply “no” to a restart request due to any reason (e.g., it is already
participating in a check pointing or a recovery process initiated by some other process).
3. If Pi learns that all processes are willing to restart from their previous checkpoints,
Pidecides that all processes should roll back to their previous checkpoints. Otherwise,
4. Pi aborts the roll back attempt and it may attempt a recovery at a later time.

ii) Second Phase

1. Pi propagates its decision to all the processes.

2. On receiving Pi’s decision, a process acts accordingly.

3. During the execution of the recovery algorithm, a process cannot send messages
relatedto the underlying computation while it is waiting for Pi’s decision.
Correctness: Resume from a consistent state

Optimization: May not to recover all, since some of the processes did not change anything

32
Optimization:
 May not to recover all, since some of the processes did not change anything
 The above protocol, in the event of failure of process X, the above protocol will require
processes X, Y, and Z to restart from checkpoints x2, y2, and z2, respectively.
 Process Z need not roll back because there has been no interaction between process Z and the
other twoprocesses since the last checkpoint at Z.

10. ALGORITHM FOR ASYNCHRONOUS CHECKPOINTING AND RECOVERY


(JUANG-VENKATESAN)
 This algorithm helps in recovery in asynchronous checkpointing. 

 The following are the assumptions made:

 communication channels are reliable

 delivery messages in FIFO order

 infinite buffers

 message transmission delay is arbitrary but finite

 The underlying computation or application is event-driven: When process P is at states, receives


message m, it processes the message; moves to state s’ and send messages out. So the triplet (s,
m, msgs_sent) represents the state of P.

 To facilitate recovery after a process failure and restore the system to a consistent state, two
types of log storage are maintained:

 Volatile log: It takes short time to access but lost if processor crash.
Thecontents of volatile log are moved to stable log periodically.

 Stable log: longer time to access but remained if crashed.

33
Asynchronous checkpointing

 After executing an event, a processor records a triplet (s, m, msg_sent) in its volatilestorage.

 s:state of the processor before the event

 m: message

 msgs_sent: set of messages that were sent by the processor during the
event.

 A local checkpoint at a processor consists of the record of an event occurring at theprocessor


and it is taken without any synchronization with other processors.

 Periodically, a processor independently saves the contents of the volatile log in thestable
storage and clears the volatile log.

 This operation is equivalent to taking a local checkpoint.

Recovery Algorithm

The data structures followed in the algorithm are:

RCVDi j (CkPti )This represents the number of messages received by processor pifrom
processor pj,from the beginning of the computation until the checkpoint CkPti.

SENTi j (CkPt i )

This represents the number of messages sent by processor pi to processor pj, from the
beginning of the computation until the checkpoint CkPti.

 The main idea of the algorithm is to find a set of consistent checkpoints, from thesetof
checkpoints.

 This is done based on the number of messages sent and received.


 Recovery may involve multiple iterations of roll backs by processors.
 Whenever a processor rolls back, it is necessary for all other processors to find outifany
message sent by the rolled back processor has become an orphan message.

 The orphan messages are identified by comparing the number of messages sent toand

34
received from neighboring processors.

 When a processor restarts after a failure, it broadcasts a ROLLBACK message thatithas


failed.

 The recovery algorithm at a processor is initiated when it restarts after a failure orwhen
it learns of a failure at another processor.

 Because of the broadcast of ROLLBACK messages, the recovery algorithm isinitiated at


all processors.

Procedure RollBack_Recovery:

processor pi executes the following:

STEP (a)
if processor pi is recovering after a failure then
Ck Pti := latest event logged in the stable
storageelse
Ck Pti := latest event that look place in pi {The latest event at pi can be either instable or
in
volatile storage}

end if

STEP(b)
for k=1 to N {N is the number of processors in the system} do
for each neighboring processor pj do
compute SENTi j (Ck Pti)

send a ROLLBACK(i, SENTi j (Ck Pti)) message to pj

end for

for every ROLLBACK(j,c) message received from a neighbor j do

if RCVD i j (Ck Pti) > c {Implies the presence of orphan message}

then

find the latest event e such that RCVD i j (e) = c {Such an event e may be
inthe volatile storage or stable storage}

Ck Pti := e

35
end if
end for
end for {for k}
Fig : Algorithm for Asynchronous Check pointing and Recovery (Juang- Venkatesan)

 The rollback starts at the failed processor and slowly diffuses into the entire
systemthrough ROLLBACK messages.

 During the kth iteration (k != 1), a processor pi does the following:

(i) based on the state CkPti it was rolled back in the (k − 1)th iteration, it
computes SENTi j (CkPti) for each neighbor pj and sends this value in a

ROLLBACK message to that neighbor

(ii) pi waits for and processes ROLLBACK messages that it receives from its
neighbors in kth iteration and determines a new recovery point CkPt i for pi
based on information in these messages.

Fig : Asynchronous Checkpointing And Recovery


At the end of each iteration, at least one processor will roll back to its final recovery point,
unless the current recovery points are already consistent.

********************************************

36

You might also like