0% found this document useful (0 votes)
44 views27 pages

Unit Iv Recovery

The document discusses checkpointing and rollback recovery techniques for distributed systems. It covers checkpoint-based recovery which involves periodically saving process states to allow restoring the system after failures. Issues discussed include orphan messages and delayed messages caused by independent process checkpoints. Coordinated checkpointing is introduced to avoid the domino effect of independent checkpoints during recovery.

Uploaded by

Merbin Jose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views27 pages

Unit Iv Recovery

The document discusses checkpointing and rollback recovery techniques for distributed systems. It covers checkpoint-based recovery which involves periodically saving process states to allow restoring the system after failures. Issues discussed include orphan messages and delayed messages caused by independent process checkpoints. Coordinated checkpointing is introduced to avoid the domino effect of independent checkpoints during recovery.

Uploaded by

Merbin Jose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CS8603 - DISTRIBUTED SYSTEMS

UNIT IV RECOVERY & CONSENSUS

Checkpointing and rollback recovery: Introduction – Background and definitions – Issues in


failure recovery – Checkpoint-based recovery – Log-based rollback recovery – Coordinated
checkpointing algorithm – Algorithm for asynchronous checkpointing and recovery.
Consensus and agreement algorithms: Problem definition – Overview of results – Agreement
in a failure –free system – Agreement in synchronous systems with failures.

CHECKPOINTING AND ROLLBACK RECOVERY INTRODUCTION

• Rollback recovery protocols


– restore the system back to a consistent state after a failure
– achieve fault tolerance by periodically saving the state of a process during the
failure-free execution
– treats a distributed system application as a collection of processes that
communicate over a network
• Checkpoints
– the saved states of a process
• Why is rollback recovery of distributed systems complicated?
– messages induce inter-process dependencies during failure-free operation
• Rollback propagation
– the dependencies may force some of the processes that did not fail to roll back
– This phenomenon is called “domino effect”
• If each process takes its checkpoints independently, then the system can not avoid the
domino effect
– this scheme is called independent or uncoordinated checkpointing
• Techniques that avoid domino effect
– Coordinated checkpointing rollback recovery
 processes coordinate their checkpoints to form a system-wide consistent
state
– Communication-induced checkpointing rollback recovery
 forces each process to take checkpoints based on information piggybacked
on the application
– Log-based rollback recovery
 combines checkpointing with logging of non-deterministic events  relies
on piecewise deterministic (PWD) assumption

Background and definitions

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 1


CS8603 - DISTRIBUTED SYSTEMS
System model

• A distributed system consists of a fixed number of processes, P1, P2…. PN, which
communicate only through messages.
• Processes cooperate to execute a distributed application and interact with the outside
world by receiving and sending input and output messages, respectively. Figure 4.1
shows a system consisting of three processes and interactions with the outside world.

Figure 4.1 An example of a distributed system with three processes.


A local checkpoint

• All processes save their local states at certain instants of time


• A local check point is a snapshot of the state of the process at a given instance
• Assumption
o A process stores all local checkpoints on the stable storage o A process is able to
roll back to any of its existing local checkpoints
• 𝐶𝑖,𝑘
o The kth local checkpoint at process 𝑃𝑖
• 𝐶𝑖,0
o A process 𝑃𝑖 takes a checkpoint 𝐶𝑖,0 before it starts execution

Consistent states

• A global state of a distributed system


– a collection of the individual states of all participating processes and the states
of the communication channels
• Consistent global state
– a global state that may occur during a failure-free execution of distribution of
distributed computation

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 2


CS8603 - DISTRIBUTED SYSTEMS
– if a process’s state reflects a message receipt, then the state of the corresponding
sender must reflect the sending of the message
• A global checkpoint
– a set of local checkpoints, one from each process
• A consistent global checkpoint
– a global checkpoint such that no message is sent by a process after taking its
local point that is received by another process before taking its checkpoint
Consistent state-Examples

Figure 4.2 Examples of consistent and inconsistent states.

Figure 13.2(a) shows message m1 to have been sent but not yet received, but that is alright. The
state in Figure 13.2(a) is consistent because it represents a situation in which every message that
has been received, there is a corresponding message send event. The state in Figure 13.2(b) is
inconsistent because process P2 is shown to have received m2 but the state of process P1 does not
reflect having sent it. Such a state is impossible in any failure-free, correct computation.
Inconsistent states occur because of failures. For instance, the situation shown in Figure 13.2(b)
may occur if process P1 fails after sending message m2 to process P2 and then restarts at the state
shown in Figure 13.2(b).

Interactions with outside world

• A distributed system often interacts with the outside world to


• receive input data or deliver the outcome of a computation
• Outside World Process (OWP)
– a special process that interacts with the rest of the system through message
passing
• A common approach

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 3


CS8603 - DISTRIBUTED SYSTEMS
– save each input message on the stable storage before allowing the application
program to process it
• Symbol “||”
• An interaction with the outside world to deliver the outcome of a computation

Different types of messages

• In-transit message o messages that have been sent but not yet received
• Lost messages o messages whose „send‟ is done but „receive‟ is undone
due to rollback
• Delayed messages o messages whose „receive‟ is not recorded because the
receiving process was either down or the message arrived after rollback
• Orphan messages o messages with „receive‟ recorded but message „send‟
not recorded o do not arise if processes roll back to a consistent global
state
• Duplicate messages o arise due to message logging and replaying during
process recovery Messages-Example

Figure 4.3 Different types of messages

In-transit

• In Figure 4.3 , the global state

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 4


CS8603 - DISTRIBUTED SYSTEMS
shows that message m1has been sent but not yet received. We call such a message an
in-transit message. Message m2 is also an in-transit message. Lost Messages

• Messages whose send is not undone but receive is undone due to rollback are called
lost messages. In Figure 4.3,message m1is a lost message. Delayed messages

• Messages whose receive is not recorded because the receiving process was either
down or the message arrived after the rollback of the receiving process, are called
delayed messages. For example, messages m2 and m5 nFigure4.3 are delayed
messages. Orphan messages

• Messages with receive recorded but message send not recorded are called orphan
Messages
Duplicate messages

• Duplicate messages arise due to message logging and replaying during process
recovery. For example, in Figure 4.3 , message m4 was sent and received before
the rollback

ISSUES IN FAILURE RECOVERY

Figure 4.4 Illustration of issues in failure recovery


Checkpoints : {𝐶𝑖,0, 𝐶𝑖,1}, {𝐶𝑗,0, 𝐶𝑗,1, 𝐶𝑗,2}, and {𝐶𝑘,0, 𝐶𝑘,1, 𝐶𝑘,2}

• Messages : A - J
• The restored global Consistent State : {𝐶𝑖,1, 𝐶𝑗,1, 𝐶𝑘,1}

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 5


CS8603 - DISTRIBUTED SYSTEMS
Issues in Failure Recovery

• The rollback of process 𝑃𝑖 to checkpoint 𝐶𝑖,1 created an orphan message H


• Orphan message I is created due to the roll back of process 𝑃𝑗 to checkpoint 𝐶𝑗,1
• Messages C, D, E, and F are potentially problematic o Message C: a delayed message
o Message D: a lost message since the send event for D is recorded in the restored
state for 𝑃𝑗, but the receive event has been undone at process 𝑃𝑖.
o Lost messages can be handled by having processes keep a message log of all the
sent messages
• Messages E, F: delayed orphan messages. After resuming execution from their checkpoints,
processes will generate both of these messages

CHECKPOINT-BASED RECOVERY

• In the checkpoint-based recovery approach, the state of each process and the
communication channel is checkpointed frequently so that, upon a failure,the system
can be restored to a globally consistent set of checkpoints.
• Checkpoint-based rollback-recovery techniques can be classified into three categories:
uncoordi-nated checkpointing ,coordinated checkpointing, and
communication-induced checkpointing.

Uncoordinated Checkpointing

• Each process has autonomy in deciding when to take


– checkpoints
• Advantages
• The lower runtime overhead during normal execution
• Disadvantages
• Domino effect during a recovery
• Recovery from a failure is slow because processes need to iterate to find a
consistent set of checkpoints
• Each process maintains multiple checkpoints and periodically invoke a garbage
collection algorithm
• Not suitable for application with frequent output commits
• The processes record the dependencies among their checkpoints caused by
message exchange during failure-free operation
Direct dependency tracking technique

• Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
• 𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
• When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to
𝐼𝑗,𝑦, which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 6


CS8603 - DISTRIBUTED SYSTEMS

Figure 4.5 Checkpoint index and checkpoint interval

Figure 4.5. When process Pi at interval Ii_x sends a message m to Pj ,it piggybacks the pair (i, x)
on m. When Pj receives m during interval Ij_y , it records the dependency from Ii_x to Ij_y , which
is later saved onto stable storage when Pj takes checkpoint Cj_y .

Coordinated Checkpointing

Coordi-nated checkpointing simplifies recovery and is not susceptible to the domino effect, since
every process always restarts from its most recent checkpoint. The main disadvantage of this
method is that large latency is involved in committing output, as a global checkpoint is needed
before a message is sent to the OWP.

Blocking Checkpointing

After a process takes a local checkpoint, to prevent orphan messages, it remains blocked
until the entire checkpointing activity is complete.

Disadvantage

• the computation is blocked during the checkpointing

Non-blocking Checkpointing
• The processes need not stop their execution while taking checkpoints
• A fundamental problem in coordinated checkpointing is to prevent a process from
receiving application messages that could make the checkpoint inconsistent.

Example (a) : checkpoint inconsistency

• message m is sent by 𝑃0 after receiving a checkpoint request from the checkpoint


coordinator

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 7


CS8603 - DISTRIBUTED SYSTEMS
• Assume m reaches 𝑃1 before the checkpoint request
• This situation results in an inconsistent checkpoint since checkpoint 𝐶 1,𝑥 shows the
receipt of message m from 𝑃0, while checkpoint 𝐶 0,𝑥 does not show m being sent from
𝑃0

Example (b) : a solution with FIFO channels

• If channels are FIFO, this problem can be avoided by preceding the first postcheckpoint
message on each channel by a checkpoint request, forcing each process to take a
checkpoint before receiving the first post-checkpoint message

Figure 4.6 Non-blocking coordinated checkpointing: (a) checkpoint inconsistency; (b)


a solution with FIFO channels

Communication-induced Checkpointing

• Two types of checkpoints


• autonomous and forced checkpoints
• Communication-induced checkpointing piggybacks protocol- related information on
each application message
• The receiver of each application message uses the piggybacked information to
determine if it has to take a forced checkpoint to advance the global recovery line
• The forced checkpoint must be taken before the application may process the contents
of the message
• In contrast with coordinated checkpointing, no special
• coordination messages are exchanged
• Two types of communication-induced checkpointing

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 8


CS8603 - DISTRIBUTED SYSTEMS
• model-based checkpointing and index-based checkpointing.

Impossibility of min-process non-blocking checkpointing

A min-process, non-blocking checkpointing algorithm is one that forces only a minimum number
of processes to take a new checkpoint, and at the same time it does not force any process to suspend
its computation.

Communication-induced checkpointing

Communication-induced checkpointing is another way to avoid the domino effect, while allowing
processes to take some of their checkpoints inde-pendently.

There are two types of communication-induced checkpointing model- based checkpointing and
index-based checkpointing. In model-based check- pointing , the system maintains checkpoints
and communication structures that prevent the domino effect or achieve some even stronger

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 9


CS8603 - DISTRIBUTED SYSTEMS
properties. In index-based checkpointing , the system uses an indexing scheme for the local and
forced checkpoints, such that the checkpoints of the same index at all processes form a consistent
state.

LOG-BASED ROLLBACK RECOVERY

• A log-based rollback recovery makes use of deterministic and nondeterministic


events in a computation.
• Deterministic and Non-deterministic events
• Non-deterministic events can be the receipt of a message from another process or
an event internal to the process
• a message send event is not a non-deterministic event.
 the execution of process 0 is a sequence of four deterministic intervals
• Log-based rollback recovery assumes that all non-deterministic events can be
identified and their corresponding determinants can be logged into the stable storage
• During failure-free operation, each process logs the determinants of all
nondeterministic events that it observes onto the stable storage.

For example, in Figure 4.7, the execution of process P0 is a sequence of four deterministic intervals.
The first one starts with the creation of the process, while the remaining three start with the receipt
of messages m0, m3, and m7, respectively. Send event of message m2 is uniquely determined by
the initial state of P0 and by the receipt of message m0, and is therefore not a nondeterministic
event.

Figure 4.7 Deterministic and non-deterministic events.

No-orphans consistency condition

• Let e be a non-deterministic event that occurs at process p

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 10


CS8603 - DISTRIBUTED SYSTEMS
• Depend(e)
– the set of processes that are affected by a non-deterministic event e. This set consists
of p, and any process whose state depends on the event e according to Lamport’s
happened before relation
• Log(e)
– the set of processes that have logged a copy of e’s determinant in their volatile memory
• Stable(e)
– a predicate that is true if e’s determinant is logged on the stable storage

• always-no-orphans condition – ∀(e) :¬Stable(e) ⇒ Depend(e) ⊆ Log(e)


Pessimistic logging

• Pessimistic logging protocols assume that a failure can occur after any non-deterministic event
in the computation
• However, in reality failures are rare
• synchronous logging
– ∀e: ¬Stable(e) ⇒ |Depend(e)| = 0
– if an event has not been logged on the stable storage, then no process ca n depend on it.
– stronger than the always-no-orphans condition

Figure 4.8 Pessimistic logging

• Suppose processes 𝑃1 and 𝑃2 fail as shown, restart from checkpoints B and C, and roll
forward using their determinant logs to deliver again the same sequence of messages as in
the pre-failure execution

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 11


CS8603 - DISTRIBUTED SYSTEMS
• Consider the example in Figure 4.8. During failure-free operation the logs of processes
P0,P1, andP2contain the determinants needed to replay messages m0,m4,m7,m1,m3,m6,
and m2,m5, respectively.
• Once the recovery is complete, both processes will be consistent with the state of 𝑃0 that
includes the receipt of message 𝑚7 from 𝑃1.
• Implementations of pessimistic logging must use special techniques to reduce the effects of
synchronous logging on the performance. This overhead can be lowered using special
hardware.
• Some pessimistic logging systems reduce the overhead of synchronous log-ging without
relying on hardware. For example, the sender-based message logging (SBML) protocol
keeps the determinants corresponding to the deliv-ery of each message m in the volatile
memory of its sender.

Optimistic Logging

Processes log determinants asynchronously to the stable storage

• Optimistically assume that logging will be complete before a failure occurs


• Thus, optimistic logging does not require the application to block waiting for the determinants
to be written to the stable storage, and therefore incurs much less overhead during failure-free
execution.
• Do not implement the always-no-orphans condition
• To perform rollbacks correctly, optimistic logging protocols track causal dependencies during
failure free execution
• Optimistic logging protocols require a non-trivial garbage collect ion scheme
• Pessimistic protocols need only keep the most recent checkpoint of each process, whereas
optimistic protocols may need to keep multiple checkpoints for each process.
• Figure 4.9 Suppose process P2 fails before the determinant for m5 is logged to the stable storage.
Process P1 then becomes an orphan process and must roll back to undo the effects of receiving
the orphan message m6 . The rollback of P1further forces P0 to roll back to undo the effects of
receiving message m7.

Figure 4.9 Optimistic logging

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 12


CS8603 - DISTRIBUTED SYSTEMS
Causal logging

• Combines the advantages of both pessimistic and optimistic logging at the expense of a more
complex recovery protocol
• Like optimistic logging, it does not require synchronous access to the stable storage except
during output commit
• Like pessimistic logging, it allows each process to commit output independently and never
creates orphans, thus isolating processes from the effects of failures at other processes
• Make sure that the always-no-orphans property holds
• Each process maintains information about all the events that have causally affected its state.
• Consider the example in Figure 4.10 . Messages m5 and m6 are likely to be lost on the failures
of P1and P2 at the indicated instants. Process P0 at state X will have logged the determinants
of the non-deterministic events that causally precede its state according to Lamport’s happened-
before relation. These events consist of the delivery of messages m0,m1,m2,m3, and m4.

Figure 4.10 Causal logging

COORDINATED CHECKPOINTING ALGORITHM Koo–Toueg coordinated checkpointing


algorithm

• A coordinated checkpointing and recovery technique that takes a consistent set of


checkpointing and avoids domino effect and livelock problems during the recovery
• Includes 2 parts: the checkpointing algorithm and the recovery algorithm
Checkpointing algorithm

• Assumptions: FIFO channel, end-to-end protocols, communication failures do not


partition the network, single process initiation, no process fails during the execution of
the algorithm
• Two kinds of checkpoints: permanent and tentative
– Permanent checkpoint: local checkpoint, part of a consistent global checkpoint

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 13


CS8603 - DISTRIBUTED SYSTEMS
– Tentative checkpoint: temporary checkpoint, become permanent checkpoint when
the algorithm terminates successfully
• The algorithm consists of two phases
• First phase
– The initiating process takes a tentative checkpoint and requests all other processes
to take tentative checkpoints. Every process can not send messages after taking
tentative checkpoint. All processes will finally have the single same decision: do
or discard
– All processes will receive the final decision from initiating process and act
accordingly

• Second phase
• Pi informs all the processes of the decision it reached at the end of the first phase. A
process, on receiving the message from Pi , will act accordingly. Therefore, either all
or none of the processes advance the checkpoint by taking permanent checkpoints.
• Correctness: for 2 reasons
– Either all or none of the processes take permanent checkpoint
– No process sends message after taking permanent checkpoint

• Optimization: maybe not all of the processes need to take checkpoints (if not change
since the last checkpoint)

The rollback recovery algorithm

– Restore the system state to a consistent state after a failure with assumptions: single
initiator, checkpoint and rollback recovery algorithms are not invoked concurrently
– 2 phases

• The initiating process send a message to all other processes and ask for the preferences
– restarting to the previous checkpoints. All need to agree about either do or not.

• The initiating process send the final decision to all processes, all the processes act
accordingly after receiving the final decision.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 14


CS8603 - DISTRIBUTED SYSTEMS

Figure 4.11 Example of checkpoints taken unnecessarily.

• Correctness: resume from a consistent state

• Optimization: may not to recover all, since some of the processes did not change
anything

ALGORITHM FOR ASYNCHRONOUS CHECKPOINTING AND RECOVERY Juang–


Venkatesan algorithm for asynchronous checkpointing and recovery

• Assumptions: communication channels are reliable, delivery messages in FIFO order,


infinite buffers, message transmission delay is arbitrary but finite.
• Underlying computation/application is event-driven: process P is at state s, receives
message m, processes the message, moves to state s’ and send messages out. So the triplet
(s, m, msgs_sent) represents the state of P
• Two type of log storage are maintained:
• Volatile log: short time to access but lost if processor crash. Move to stable log
periodically.
• Stable log: longer time to access but remained if crashed Asynchronous checkpointing:

• After executing an event, the triplet is recorded without any synchronization with other
processes.
• Local checkpoint consist of set of records, first are stored in volatile log, then moved to
stable log.

Recovery algorithm

Notations:
• 𝑅𝐶𝑉𝐷𝑖←𝑗(𝐶𝑘𝑃𝑡𝑖 ): number of messages received by 𝑝𝑖 from 𝑝𝑗, from the beginning
of computation to checkpoint 𝐶𝑘𝑃𝑡𝑖

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 15


CS8603 - DISTRIBUTED SYSTEMS
• 𝑆𝐸𝑁𝑇𝑖→𝑗(𝐶𝑘𝑃𝑡𝑖 ): number of messages sent by 𝑝𝑖 to 𝑝𝑗, from the beginning of
computation to checkpoint 𝐶𝑘𝑃𝑡𝑖 Idea:
• From the set of checkpoints, find a set of consistent checkpoints Doing that based
on the number of messages sent and received

Example

Consider the example shown in Figure 4.12. In the event of failure of process X, the above protocol
will require processes X, Y , and Z to restart from checkpoints x2, y2, and z2, respectively.
However, note that process Z need not roll back because there has been no interaction between
process Z and the other two processes since the last checkpoint at Z.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 16


CS8603 - DISTRIBUTED SYSTEMS

Figure 4.12 Example of an unnecessary rollback.

Figure 4.13 An event-driven computation.

Consensus and agreement algorithms Problem definition Assumptions

• System assumptions
• Failure models
• Synchronous/ Asynchronous
• communication
• Network connectivity
• Sender identification
• Channel reliability
• Authenticated vs. non-authenticated
• messages
• Agreement variable

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 17


CS8603 - DISTRIBUTED SYSTEMS

Figure 4.14 Byzantine generals sending confusing messages.

A simple example of Byzantine behavior is shown in Figure 4.14. Four generals are shown, and a
consensus decision is to be reached about a boolean value. The various generals are conveying
potentially misleading values of the decision variable to the other generals, which results in
confusion.

Problem Specifications

Byzantine Agreement (single source has an initial value)

• Agreement: All non-faulty processes must agree on the same value.


• Validity: If the source process is non-faulty, then the agreed upon value by all the nonfaulty
processes must be the same as the initial value of the source.
• Termination: Each non-faulty process must eventually decide on a value.
• Consensus Problem (all processes have an initial value)
• Agreement: All non-faulty processes must agree on the same (single) value.
• Validity: If all the non-faulty processes have the same initial value, then the agreed upon value
by all the non-faulty processes must be that same value.
• Termination: Each non-faulty process must eventually decide on a value.
• Interactive Consistency (all processes have an initial value)
• Agreement: All non-faulty processes must agree on the same array of values A[v1 : : : vn].
• Validity: If process i is non-faulty and its initial value is vi , then all non-faulty processes agree
on vi as the ith element of the array A. If process j is faulty, then the non-faulty processes can
agree on any value for A[j ].
• Termination: Each non-faulty process must eventually decide on the array A.

These problems are equivalent to one another! Show using reductions. Overview

of results

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 18


CS8603 - DISTRIBUTED SYSTEMS

Table 4.1: Overview of results on agreement. f denotes number of failure-prone processes. n is the
total number of processes.

In a failure-free system, consensus can be attained in a straightforward manner

Some Solvable Variants of the Consensus Problem in Async Systems

Table 4.2: Some solvable variants of the agreement problem in asynchronous system. The overhead
bounds are for the given algorithms, and not necessarily tight bounds for the problem.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 19


CS8603 - DISTRIBUTED SYSTEMS
Solvable Variants of the Consensus Problem in Async Systems

Figure 4.15 Circumventing the impossibility result for consensus in asynchronous systems.

AGREEMENT IN A FAILURE –FREE SYSTEM

In a failure-free system, consensus can be reached by collecting information from the different
processes, arriving at a “decision,” and distributing this decision in the system. A distributed
mechanism would have each process broadcast its values to others, and each process computes the
same function on the values received. The decision can be reached by using an application specific
function – some simple examples being the majority, max, and min functions. Algorithms to
collect the initial values and then distribute the decision may be based on the token circulation on
a logical ring, or the three-phase tree-based broadcast–converge cast–broadcast, or direct
communication with all nodes.

• In a synchronous system, this can be done simply in a constant number of rounds


(depending on the specific logical topology and algorithm used). Further, common knowledge of
the decision value can be obtained using an additional round.

• In an asynchronous system, consensus can similarly be reached in a constant number of


message hops. Further, concurrent common knowledge of the consensus value can also be
attained, using any of the algorithms.

AGREEMENT IN SYNCHRONOUS SYSTEMS WITH FAILURES.

Consensus Algorithm for Crash Failures (MP, synchronous)

• Up to f (< n) crash failures possible.


• In f + 1 rounds, at least one round has no failures.
• Now justify: agreement, validity, termination conditions are satisfied.
• Complexity: O(f + 1)n2 messages
• f + 1 is lower bound on number of rounds

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 20


CS8603 - DISTRIBUTED SYSTEMS

Algorithm 4.1 Consensus with up to f fail-stop processes in a system of n processes, n > f [8]. Code
shown is for process Pi , 1 ≤ i ≤ n.

The agreement condition is satisfied because in the f +1 rounds, there must be at least one round
in which no process failed. In this round, say round r, all the processes that have not failed so far
succeed in broadcasting their values, and all these processes take the minimum of the values
broadcast and received in that round. Thus, the local values at the end of the round are the same,
say xri for all non-failed processes. In further rounds, only this value may be sent by each process
at most once, and no process i will update its value xri .

• The validity condition is satisfied because processes do not send fictitious values in this failure
model. (Thus, a process that crashes has sent only correct values until the crash.) For all i, if the
initial value is identical, then the only value sent by any process is the value that has been agreed
upon as per the agreement condition.

• The termination condition is seen to be satisfied.

Complexity

There are f +1 rounds, where f < n. The number of messages is at most O(n2) in each round, and
each message has one integer. Hence the total number of messages is O((f +1).n2. The worst-case
scenario is as follows. Assume that the minimum value is with a single process initially. In the
first round, the process manages to send its value to just one other process before failing. In
subsequent rounds, the single process having this minimum value also manages to send that value
to just one other process before failing.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 21


CS8603 - DISTRIBUTED SYSTEMS
A lower bound on the number of rounds

At least f +1 rounds are required, where f < n. The idea behind this lower bound is that in the
worst-case scenario, one process may fail in each round; with f +1 rounds, there is at least one
round in which no process fails.

Consensus algorithms for Byzantine failures (synchronous system)


Upper bound on Byzantine processes Agreement
impossible when f = 1, n = 3.

Figure 4.16 Impossibility of achieving Byzantine agreement with n = 3 processes and f = 1


malicious process.

Taking simple majority decision does not help because loyal commander Pa cannot distinguish between
the possible scenarios (a) and (b); hence does not know which action to take.

With n = 3 processes, the Byzantine agreement problem cannot be solved if the number of
Byzantine processes f = 1. The argument uses the illustration in Figure 4.16, which shows a
commander Pc and two lieutenant processes Pa and Pb. The malicious process is the lieutenant Pb
in the first scenario (Figure 4.16(a)) and hence Pa should agree on the value of the loyal
commander Pc, which is 0. But note the second scenario (Figure 4.16(b)) in which Pa receives
identical values from Pb and Pc, but now Pc is the disloyal commander whereas Pb is a loyal
lieutenant. In this case, Pa needs to agree with Pb. However, Pa cannot distinguish between the
two scenarios and any further message exchange does not help because each process has already
conveyed what it knows from the third process.

Consensus Solvable when f = 1, n = 4

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 22


CS8603 - DISTRIBUTED SYSTEMS

Figure 4.17 Achieving Byzantine agreement when n = 4 processes and f = 1 malicious process.

There is no ambiguity at any loyal commander, when taking majority decision Majority decision
is over 2nd round messages, and 1st round message received directly from commander-in-chief
process.

Byzantine Generals (recursive formulation), (sync, msg-passing)

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 23


CS8603 - DISTRIBUTED SYSTEMS
Relationship between # Messages and Rounds

Table 4.3: Relationships between messages and rounds in the Oral Messages algorithm for Byzantine
agreement.

Complexity: f + 1 rounds, exponential amount of space, and


(n − 1) + (n − 1)(n − 2) + . . . + (n − 1)(n − 2)..(n − f − 1)messages

Bzantine Generals (iterative formulation), Sync, Msg-passing

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 24


CS8603 - DISTRIBUTED SYSTEMS
Algorithm 4.3 Byzantine generals algorithm – exponential number of unsigned messages, n > 3f .
Iterative formulation. Code for process Pi.

Tree Data Structure for Agreement Problem (Byzantine Generals)

Figure 4.18 Local tree at P3 for solving the Byzantine agreement, for n = 10 and f = 3. Only one
branch of the tree is shown for simplicity.

Some branches of the tree at P3. In this example, n = 10, f = 3, commander is P0.

• (round 1) P0 sends its value to all other processes using Oral_Msg (3), including to P3.
• (round 2) P3 sends 8 messages to others (excl. P0 and P3) using Oral_Msg (2). P3 also receives
8 messages.
• (round 3) P3 sends 8 7 = 56 messages to all others using Oral_Msg (1); P3 also receives 56
messages.
• (round 4) P3 sends 56 6 = 336 messages to all others using Oral_Msg (0); P3 also receives 336
messages. The received values are used as estimates of the majority function at this level of
recursion.

The Phase King Algorithm Operation

• Each phase has a unique ”phase king” derived, say, from PID. Each phase has
two rounds:
o in 1st round, each process sends its estimate to all other processes. o in 2nd round,
the ”Phase king” process arrives at an estimate based on the values it received in 1st
round, and broadcasts its new estimate to all others.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 25


CS8603 - DISTRIBUTED SYSTEMS

Figure 4.19 Message pattern for the phase-king algorithm

Algorithm 4.4 Phase-king algorithm [4] – polynomial number of unsigned messages, n > 4f .
Code is for process Pi, 1 ≤ i ≤ n.

• (f + 1) phases, (f + 1)[(n − 1)(n + 1)] messages, and can tolerate up to


f < |n/4| malicious processes

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 26


CS8603 - DISTRIBUTED SYSTEMS
Correctness Argument

• Among f + 1 phases, at least one phase k where phase-king is non-malicious.


• In phase k , all non-malicious processes Pi and Pj will have same estimate of consensus value
as Pk does.
o Pi and Pj use their own majority values (Hint: = Pi ’s mult > n/2 + f ) o Pi
uses its majority value; Pj uses phase-king’s tie-breaker value. (Hint: Pi ”s mult >
n/2 + f , Pj ’s mult > n/2 for same value)
o Pi and Pj use the phase-king’s tie-breaker value. (Hint: In the phase in which Pk
is non-malicious, it sends same value to Pi and Pj )

In all 3 cases, argue that Pi and Pj end up with same value as estimate

• If all non-malicious processes have the value x at the start of a phase, they will continue to have
x as the consensus value at the end of the phase.

Prepared By: Mr. P. J. Merbin Jose, AP/CSE Page 27

You might also like