DC - Unit IV
DC - Unit IV
ENGINEERING
Semester/Year: III/V
Prepared by Approved by
Dr.N.Indumathi HoD
1
UNIT – IV : Syllabus
SYLLABUS:
UNIT IV CONSENSUS AND RECOVERY 10
Consensus and Agreement Algorithms: Problem Definition – Overview of Results – Agreement in a
Failure-Free System(Synchronous and Asynchronous) – Agreement in Synchronous Systems with
Failures; Checkpointing and Rollback Recovery: Introduction – Background and Definitions – Issues
in Failure Recovery – Checkpoint-based Recovery – Coordinated Checkpointing Algorithm --
Algorithm for Asynchronous Checkpointing and Recovery
Distributed Consensus
A procedure to reach a common agreement in a distributed or decentralized multi-agent
platform is called distributed consensus.
2
It is important for the message passing system.
Features:
It ensures reliability and fault tolerance in distributed systems.
In the presence of faulty individuals, it is Ensure correct operations.
Example:
Commit a transaction in a database, State machine replication, Clock synchronization.
2. OVERVIEW OF RESULTS
2.1 Outline of an Overview of Results
We expect a consensus algorithm to exhibit certain properties:
i) The first is Termination, implying that every non-faulty process must decide on a value
ii) The second is Agreement, that is, every correct process must agree on the same value
iii) The last is Integrity, which means if all the correct processes proposed the same value, then
any correct process must decide the same value
i) Agreement is a safety property.
– Every possible state of the system has this property in all possible executions.
– I.e., either they have not agreed yet, or they all agreed on the same value.
3
ii) Termination is a liveness property.
– Some state of the system has this property in all possible executions.
– The property is stable: once some state of an execution has the property, all subsequent states also
have it.
There are many ways in which processes in a distributed system can reach a consensus.
However, there is usually a constant struggle between security and performance.
more we want our algorithm to be secure against ways in which failure can happen, the less
performant it tends to become.
Then, there are considerations like resource consumption and the size of messages.
4
This propelled the world of distributed computing to an unimaginable scale.
It led to the development of numerous tailor-made consensus algorithms.
To make sense of this, it’s quite useful to understand the computation model in which the
consensus algorithms are expected to operate.
i) A crash fault occurs when a process abruptly stops and does not resume.
ii) On the contrary, a Byzantine fault is much more arbitrary and disruptive. It may occur for several
reasons, like malicious activity on the part of an adversary.
For instance, when members send conflicting views of reality:
The Byzantine failure derives its name from the “Byzantine generals problem”.
It’s a game theory problem that describes how actors in a decentralized system arrive at a
consensus without a trusted central party.
Some of the actors here can be unreliable.
Interestingly, in case of a Byzantine failure, a process can inconsistently appear both failed
and functioning to the rest of the system.
Hence, it becomes much more challenging for a consensus algorithm to be resilient to
Byzantine failures than crash failures.
5
2.4 Failure Detector
A failure detector is composed of failure detection modules, one by process, that provides it
with a list of processes currently suspected by the detector to have crashed.
A failure detection module can make mistakes by not suspecting a crashed process or by
erroneously suspecting a correct one.
Formally, a failure detector is defined by two properties:
i) completeness (a property on the actual detection of process crashes), and
ii) accuracy (a property that restricts the mistakes on erroneous suspicions).
Thus, in a fully asynchronous distributed system where even a single process may have a crash
failure, it’s impossible to have a deterministic algorithm for achieving consensus.
2.6 Consensus algorithms
Consensus algorithms are vital in large-scale, fault-tolerant systems because they enable a set
of distributed/replicated machines or servers to work as a coherent group and agree on system
state, even in the presence of failures or outages.
6
Use of consensus algorithm
i) crash fault
In a crash fault, a processor stops functioning and never resumes operation.
7
In synchronous systems, if the duration of each round is known, then this detection is simple-all
the expected messages not received by the end of a round were not sent.
(global constants)
(1) Process Pi (1<i < n) execute the consensus algorithm for up to f crash
failures:(1a) for round from 1 to f + 1 do
(1c) broadcast(x);
9
The agreement condition is satisfied because in the f+ 1 rounds, there must be at least
one round in which no process failed.
In this round, say round r, all the processes that have not failed so far succeed in
broadcasting their values, and all these processes take the minimum of the values
broadcast and received in that round.
Thus, the local values at the end of the round are the same, say x i fror all non-failed
processes.
In further rounds, only this value may be sent by each process at most once, and no
process i will update its value xi . r
The validity condition is satisfied because processes do not send fictitious values inthis
failure model.
For all i, if the initial value is identical, then the only value sent by any process is the
value that has been agreed upon as per the agreement condition.
Complexity
The complexity of this particular algorithm is it requires f + 1 rounds where f < n and the
number of messages is O(n 2 )in each round and each message has one integers hence the
total number of messages is O((f +1)· n 2) is the total number ofrounds and in each round n 2
messages are required.
In the worst-case scenario, one process may fail in each round; with f + 1 rounds, there
is at least one round in which no process fails. In that guaranteed failure-free round, all
messages broadcast can be delivered reliably, and all processes that have not failed can
compute the common function of the received values to reach an agreement value.
3.4 Voting-based Consensus Algorithms
As the consensus in distributed systems has widened significantly, it’s important to
draw some broad categories to understand them better.
Some of the earliest implementations of consensus algorithms started to use different
10
voting-based mechanisms.
These provide reasonable fault tolerance and have strong mathematical proofs to
ensure security and stability.
However, the very democratic nature of these algorithms makes them incredibly
slow and inefficient, especially as the network grows larger.
3.5 Consensus algorithms for Byzantine failures (synchronous system)
What is Byzantine problem in distributed system?
The Byzantine generals problem is a well-known concept in distributed computing and computer
science that describes the difficulty of coordinating the actions of several independent parties in a
distributed system.
Byzantine agreement
In the Byzantine agreement problem, a single value, which is to be agreed on, is initialized by an
arbitrary processor and all nonfaulty processors have to agree on that value.
In the consensus problem, every processor has its own initial value and all nonfaulty processors
must agree on a single common value.
Upper bound on Byzantine processes
3
Fig: Impossibility of achieving Byzantine agreement with n = 3 processes and f
= 1 malicious process
11
The condition where f < (n – 1) / 2 is violated over here; that means, if f = 1 andn = 2 this
particular assumption is violated
(n – 1) / 2 is not 1 in that case, but we are assuming 1 so obviously, as per the previous
condition agreement byzantine agreement is not possible.
Here P 0 is faulty is non faulty and here P 0 is faulty so that means P 0 is the source, the
source is faulty here in this case and source is non faulty in the other case.
So, source is non faulty, but some other process is faulty let us say that P 2 is faulty. P 1
will send because it is non faulty same values to P 1 and P 2 and as far as the P 2s
concerned it will send a different value because it is a faulty.
Agreement is possible when f = 1 and the total number of processor is 4. So, agreementwe
can see how it is possible we can see about the commander P c.
So, this is the source it will send the message 0 since it is faulty. It will send 0 to Pd 0 to P
b, but 1 to pa in the first column. So, P a after receiving this one it will send one to both
the neighbors, similarly P b after receiving 0 it will send 0 since itis not faulty.
So, the majority is basically 0 here in this case here also if you see the values 10 and 0.
The majority is 0 and here also majority is 0.
In this particular case even if the source is faulty, it will reach to an agreement, reach an
agreement and that value will be agreed upon value or agreement variable will be equal to 0.
12
Fig : Achieving Byzantine agreement when n = 4 processes and f = 1
Algorithms to collect the initial values and then distribute the decision may be based on the
token circulation on a logical ring, or the three-phase tree-based broadcast converge cast:
broadcast, or direct communication with all nodes.
In a synchronous system, this can be done simply in a constant number of rounds.
Further, common knowledge of the decision value can be obtained using an additionalround.
In an asynchronous system, consensus can similarly be reached in a constant numberof
message hops.
Further, concurrent common knowledge of the consensus value can also be attained.
For the “no failure” case, consensus is attainable.
Further, in a synchronous system, common knowledge of the consensus value is also
attainable, whereas in the asynchronous case, concurrent common knowledge of the consensus
value is attainable.
5. CHECKPOINTING AND ROLLACK RECOVERY: INTRODUCTION
5.1 Introduction
5.1.1 Checkpointing
The saved state is called a checkpoint, and the procedure of restarting from a previously
checkpointed state is called rollback recovery.
A checkpoint can be saved on either the stable storage or the volatile storage depending on
the failure scenarios to be tolerated.
Checkpointing is a mechanism to store the state of a computation so that it can be retrieved
at a later point in time and continued.
The process of writing the computation's state is referred to as Checkpointing, the data
13
written as the Checkpoint, and the continuation of the application as Restart or Recovery.
14
A checkpoint is an entry made in the recovery file at specific intervals of time that will force all
the currently committed values to the stable storage.
Recovery from an error is essential to fault tolerance, and error is a component of a system that
could result in failure.
i) Backward Recovery:
Moving the system from its current state back into a formerly accurate condition from an
incorrect one is the main challenge in backward recovery.
It will be required to accomplish this by periodically recording the system’s state and restoring
it when something goes wrong.
A checkpoint is deemed to have been reached each time (part of) the system’s current state is
noted.
ii) Forward Recovery:
Instead of returning the system to a previous, checkpointed state in this instance when it has
entered an incorrect state, an effort is made to place the system in a correct new state from
which it can continue to operate.
The fundamental issue with forward error recovery techniques is that potential errors must be
anticipated in advance.
Only then is it feasible to change those mistakes and transfer to a new state.
Whenever a process fails it will roll back to the most recent checkpoint made and restart the
system from a previously consistent state.
In a distributed system, the recovery managers need to make sure that these checkpoints lead the
system to a globally consistent state when a server recovers from a failure and restarts.
15
In distributed systems, rollback recovery is complicated because messages induce inter-
process dependencies during failure-free operation.
Upon a failure of one or more processes in a system, these dependencies may force some of
the processes that did not fail to roll back, creating what is commonly called a rollback
propagation.
Distributed systems are not fault-tolerant and the vast computing potential of these systems
is often hampered by their susceptibility to failures.
Many techniques have been developed to add reliability and high availability to distributed
systems. These techniques include transactions, group communication, and rollback
recovery.
These techniques have different tradeoffs and focus.
16
7. ISSUES IN FAILURE RECOVERY
7.1 Introduction
In a failure recovery, we must not only restore the system to a consistent state, but
handle messages that are left in irregular state due to the failure and recovery.
The computation comprises of three processes Pi, Pj and Pk, connected through a
communication network.
The processes communicate by exchanging messages over fault free, FIFO
communication channels.
Processes Pi, Pj and Pk have taken checkpoints {Ci,0, Ci,1}, {Cj,0, Cj,1, Cj,2}, and
{Ck,0, Ck,1}, respectively, and these processes have exchanged messages A to J as
shown in the Figure 2.
17
All the contents in the volatile memory of Pi is lost and after Pi recovers from the
failure, the system needs to be restored to a consistent global state from where the
processes resume their execution.
The process Pi’s state is restored to a valid state by rolling it back to its latest
checkpoint Ci,1.
To restore the system to a consistent state process Pj rolls back to checkpoint Cj,1
because roll back of process Pi to checkpoint Ci,1 created an orphan message H
(the receive event of H is recorded at process Pj while the send event of H has
been undone at process Pi).
Process Pj does not roll back to checkpoint Cj,2 but to checkpoint Cj,1 because
rolling back to checkpoint Cj,2 does not eliminate the orphan message H.
Even this resulting state is not a consistent global state as an orphan message I is
created due to the rolls back of process Pj to checkpoint Cj,1.
18
Messages E and F are delayed orphan messages and pose perhaps the most serious
problem of all the messages.
When messages E and F arrive at their respective destinations, they must be
discarded since their send events have been undone.
Processes, after resuming execution from their checkpoints, will generate both of
these messages and recovery techniques must be able to distinguish between
messages like C and those like E and F.
Lost messages like D can be handled by having processes keep a message log of all
the sent messages.
So when a process restores to a checkpoint, it replays the messages from its log to
handle the lost message problem.
Message logging and message replaying during recovery can result in duplicate
messages.
In the example shown in the Figure, when process Pj replays messages from its log,
it will regenerate message J.
Process Pk which has already received message J will receive it again, thereby,
causing inconsistency in the system state. These duplicate messages must be
handled properly.
Overlapping failures further complicate the recovery process. A process Pj that
begins rollback/recovery in response to the failure of a process Pi can itself fail and
forget process Pi’sfailure.
If overlapping failures are to be tolerated, a mechanism must be introduced to deal
withthe resulting inconsistencies.
7.2 Checkpoint-based rollback-recovery
19
between processes and it allows processes to take checkpoints when it is most
convenient or efficient.
The advantage is the lower runtime overhead during normal execution because no
coordinationamong processes is necessary.
Liberty in taking checkpoints allows each process to select appropriate checkpoints
positions.
The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation
The following direct dependency tracking technique is commonly used in
uncoordinatedcheck pointing.
20
Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to
𝐼𝑗,𝑦,which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦
Upon receiving this message, a process whose current state belongs to the recovery
line simply resumes execution; otherwise, it rolls back to an earlier checkpoint as
indicated by the recovery line.
In coordinated check pointing, processes orchestrate their checkpointing activities so that all
local checkpoints form a consistent global state
21
b) Non-blocking Checkpointing:
The processes need not stop their execution while taking checkpoints.
A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint
inconsistent.
If channels are FIFO, this problem can be avoided by preceding the first post-
checkpoint message on each channel by a checkpoint request, forcing each process to
take a checkpoint before receiving the first post-checkpoint message.
22
Algorithm
The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and
sends them a request.
Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on,
until no more processes can be identified.
During the second phase, all processes identified in the first phase take a checkpoint.
The result is a consistent checkpoint that involves only the participating processes.
In this protocol, after a process takes a checkpoint, it cannot send any message until
the second phase terminates successfully, although receiving a message after the
checkpoint has been taken is allowable.
1. Autonomous checkpoints
2. Forced checkpoints
The checkpoints that a process takes independently are called local checkpoints, while those
thata process is forced to take are called forced checkpoints.
Communication-induced checkpointing piggybacks protocol- related information on
eachapplication message
The receiver of each application message uses the piggybacked information to
determineif it has to take a forced checkpoint to advance the global recovery line
The forced checkpoint must be taken before the application may process the
contents ofthe message
In contrast with coordinated checkpointing, no special coordination messages are
exchanged
Two types of communication-induced checkpointing
a) Model-based checkpointing
b) Index-based checkpointing.
23
a) Model-based checkpointing
In this approach the processes don’t need to stop their execution while taking
checkpoints.
A fundamental problem in coordinated checkpointing is to prevent a process from
receiving application messages that could make the checkpoint inconsistent.
24
Advantages
i) Simplifies recovery.
ii) Avoids the domino effect.
iii) Only one permanent set of states in stable storage.
Disadvantages
However, there exists high latency for interaction with the outside world (global checkpoint
necessary)
26
Basic idea
Since the algorithm is based on asynchronous checkpointing, the main issue in the
recovery is tofind a consistent set of checkpoints to which the system can be restored.
The recovery algorithm achieves this by making each processor keep track of both the
number of messages it has sent toother processors as well as the number of messages it
has received from other processors. Recovery may involve several iterations of roll
backs by processors.
Whenever a processor rollsback, it is necessary for all other processors to find out if any
message sent by the rolled back processor has become an orphan message.
Orphan messages are discovered by comparing the number of messages sent to and
received from neighboring processors.
For example, if RCV Di←j(CkPti) > SENTj→i(CkPtj) (that is, the number of
messages received by processor pi from processor pj is greater than the number of
messages sent by processor pj to processor pi, according to the current states the
processors), then one or more messages at processor pj are orphan messages.
In this case, processor pj must roll back to a state where the number of messages
received agrees with the number of messages sent.
Consider an example shown in Fig. 1. Suppose processor Y crashes at the point
indicated and rolls back to a state corresponding to checkpoint ey1.
According to this state, Y has sent one message to X; according to X’s current state
(ex2), X has received two messages from Y.
X must roll back to a state preceding ex2 to be consistent with Y’s state.
If X rolls back to checkpoint ex1, then it will be consistent with Y’s state, ey1.
Processor Z must roll back to checkpoint ez2 to be consistent with Y’s state, ey1.
Processors X and Z will have to resolve any such mutual inconsistencies.
The Algorithm
When a processor restarts after a failure, it broadcasts a ROLLBACK message that it
had failed1.
The recovery algorithm at a processor is initiated when it restarts after a failure or
when it learns of a failure at another processor.
Because of the broadcast of ROLLBACK messages, the recovery algorithm is initiated
at all processors.
Procedure RollBack_Recovery
processor pi executes the following:
27
STEP (a)
if processor pi is recovering after a failure then
CkPti := latest event logged in the stable
storageelse
CkPti := latest event that took place in pi {The latest event at pi can be either in stable or in
volatile storage.}
end if
STEP (b)
for k = 1 1 to N {N is the number of processors in the system} dofor each neighboring processor
pj do
compute SENTi→j(CkPti)
send a ROLLBACK(i, SENTi→j(CkPti)) message to pj
end for
for every ROLLBACK(j, c) message received from a neighbor j do
if RCV Di←j(CkPti) > c {Implies the presence of orphan messages} then
find the latest event e such that RCV Di←j(e) = c {Such an event e may be in the volatile
storageor stable storage.}
CkPti
:= eend
if end
for
end for{for k}
The rollback starts at the failed processor and slowly diffuses into the entire system through
ROLLBACK messages.
The procedure has |N| iterations. During the kth iteration (k ≠ 1), a processor pi does the
following:
(i) based on the state CkPti it was rolled back in the (k - 1)th iteration, it computes
SENTi→j(CkPti)) for each neighbor pj and sends this value in a ROLLBACK message to
that neighbor and
(ii) pi waits for and processes ROLLBACK messages that it receives from its neighbors in
kth iteration and determines a new recovery point CkPti forpi based on information in these
messages. At the end of each iteration, at least one processor will roll back to its final
recovery point, unless the current recovery points are already consistent.
28
D. An Example
Consider an example shown in Figure 2 consisting of three processors. Suppose processor Y
fails and restarts. If event ey2 is the latest checkpointed event at Y, then Y will restart from
the state corresponding to ey2.
29
9. COORDINATED CHECKPOINTING ALGORITHM (KOO-TOUEG)
Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case
of failures by preserving a consistent global checkpoint on stable storage.
However, the approach suffers from high overhead associated with the checkpointing process.
Two approaches are used to reduce the overhead:
i) First is to minimize the number of synchronization messages and the number of
checkpoints,
ii) the other is to make the checkpointing process nonblocking.
Koo and Toueg coordinated check pointing and recovery technique takes a consistent
setof checkpoints and avoids the domino effect and live lock problems during the
recovery.
• Includes 2 parts: the check pointing algorithm and the recovery algorithm.
The checkpoint algorithm makes the following assumptions about the distributed system:
Assume that end-to-end protocols (the sliding window protocol) exist to handle
withmessage loss due to rollback recovery and communication failure.
Communication failures do not divide the network.
The checkpoint algorithm takes two kinds of checkpoints on the stable storage:
Permanent and Tentative.
A permanent checkpoint is a local checkpoint at a process and is a part of a
consistentglobal checkpoint.
A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint
onthe successful termination of the checkpoint algorithm.
30
The algorithm consists of two phases:
i) First Phase
ii) Second Phase
i) First Phase
1. An initiating process Pi takes a tentative checkpoint and requests all other processes to take
tentative checkpoints. Each process informs Pi whether it succeeded in taking a tentative
checkpoint.
2. A process says “no” to a request if it fails to take a tentative checkpoint
3. If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides
that all tentative checkpoints should be made permanent; otherwise, Pi decides that all the
tentative checkpoints should be thrown-away.
1. Pi informs all the processes of the decision it reached at the end of the first phase.
3. ither all or none of the processes advance the checkpoint by taking permanent
checkpoints.
4. The algorithm requires that after a process has taken a tentative checkpoint, it cannot
send messages related to the basic computation until it is informed of Pi’s decision.
Optimization
The above protocol may cause a process to take a checkpoint even when it is not
necessary for consistency.
31
The rollback recovery algorithm restores the system state to a consistent state after a
failure.
The rollback recovery algorithm assumes that a single process invokes the algorithm.
It assumes that the checkpoint and the rollback recovery algorithms are not invoked
concurrently.
i) First Phase
i) First Phase
1. An initiating process Pi sends a message to all other processes to check if they all
arewilling to restart from their previous checkpoints.
2. A process may reply “no” to a restart request due to any reason (e.g., it is already
participating in a check pointing or a recovery process initiated by some other process).
3. If Pi learns that all processes are willing to restart from their previous checkpoints,
Pidecides that all processes should roll back to their previous checkpoints. Otherwise,
4. Pi aborts the roll back attempt and it may attempt a recovery at a later time.
3. During the execution of the recovery algorithm, a process cannot send messages
relatedto the underlying computation while it is waiting for Pi’s decision.
Correctness: Resume from a consistent state
Optimization: May not to recover all, since some of the processes did not change anything
32
Optimization:
May not to recover all, since some of the processes did not change anything
The above protocol, in the event of failure of process X, the above protocol will require
processes X, Y, and Z to restart from checkpoints x2, y2, and z2, respectively.
Process Z need not roll back because there has been no interaction between process Z and the
other twoprocesses since the last checkpoint at Z.
infinite buffers
To facilitate recovery after a process failure and restore the system to a consistent state, two
types of log storage are maintained:
Volatile log: It takes short time to access but lost if processor crash.
Thecontents of volatile log are moved to stable log periodically.
33
Asynchronous checkpointing
After executing an event, a processor records a triplet (s, m, msg_sent) in its volatilestorage.
m: message
msgs_sent: set of messages that were sent by the processor during the
event.
Periodically, a processor independently saves the contents of the volatile log in thestable
storage and clears the volatile log.
Recovery Algorithm
RCVDi j (CkPti )This represents the number of messages received by processor pifrom
processor pj,from the beginning of the computation until the checkpoint CkPti.
SENTi j (CkPt i )
This represents the number of messages sent by processor pi to processor pj, from the
beginning of the computation until the checkpoint CkPti.
The main idea of the algorithm is to find a set of consistent checkpoints, from thesetof
checkpoints.
The orphan messages are identified by comparing the number of messages sent toand
34
received from neighboring processors.
The recovery algorithm at a processor is initiated when it restarts after a failure orwhen
it learns of a failure at another processor.
Procedure RollBack_Recovery:
STEP (a)
if processor pi is recovering after a failure then
Ck Pti := latest event logged in the stable
storageelse
Ck Pti := latest event that look place in pi {The latest event at pi can be either instable or
in
volatile storage}
end if
STEP(b)
for k=1 to N {N is the number of processors in the system} do
for each neighboring processor pj do
compute SENTi j (Ck Pti)
end for
then
find the latest event e such that RCVD i j (e) = c {Such an event e may be
inthe volatile storage or stable storage}
Ck Pti := e
35
end if
end for
end for {for k}
Fig : Algorithm for Asynchronous Check pointing and Recovery (Juang- Venkatesan)
The rollback starts at the failed processor and slowly diffuses into the entire
systemthrough ROLLBACK messages.
(i) based on the state CkPti it was rolled back in the (k − 1)th iteration, it
computes SENTi j (CkPti) for each neighbor pj and sends this value in a
(ii) pi waits for and processes ROLLBACK messages that it receives from its
neighbors in kth iteration and determines a new recovery point CkPt i for pi
based on information in these messages.
********************************************
36