Unit Iv Recovery
Unit Iv Recovery
• A distributed system consists of a fixed number of processes, P1, P2…. PN, which
communicate only through messages.
• Processes cooperate to execute a distributed application and interact with the outside
world by receiving and sending input and output messages, respectively. Figure 4.1
shows a system consisting of three processes and interactions with the outside world.
Consistent states
Figure 13.2(a) shows message m1 to have been sent but not yet received, but that is alright. The
state in Figure 13.2(a) is consistent because it represents a situation in which every message that
has been received, there is a corresponding message send event. The state in Figure 13.2(b) is
inconsistent because process P2 is shown to have received m2 but the state of process P1 does not
reflect having sent it. Such a state is impossible in any failure-free, correct computation.
Inconsistent states occur because of failures. For instance, the situation shown in Figure 13.2(b)
may occur if process P1 fails after sending message m2 to process P2 and then restarts at the state
shown in Figure 13.2(b).
• In-transit message o messages that have been sent but not yet received
• Lost messages o messages whose „send‟ is done but „receive‟ is undone
due to rollback
• Delayed messages o messages whose „receive‟ is not recorded because the
receiving process was either down or the message arrived after rollback
• Orphan messages o messages with „receive‟ recorded but message „send‟
not recorded o do not arise if processes roll back to a consistent global
state
• Duplicate messages o arise due to message logging and replaying during
process recovery Messages-Example
In-transit
• Messages whose send is not undone but receive is undone due to rollback are called
lost messages. In Figure 4.3,message m1is a lost message. Delayed messages
• Messages whose receive is not recorded because the receiving process was either
down or the message arrived after the rollback of the receiving process, are called
delayed messages. For example, messages m2 and m5 nFigure4.3 are delayed
messages. Orphan messages
• Messages with receive recorded but message send not recorded are called orphan
Messages
Duplicate messages
• Duplicate messages arise due to message logging and replaying during process
recovery. For example, in Figure 4.3 , message m4 was sent and received before
the rollback
• Messages : A - J
• The restored global Consistent State : {𝐶𝑖,1, 𝐶𝑗,1, 𝐶𝑘,1}
CHECKPOINT-BASED RECOVERY
• In the checkpoint-based recovery approach, the state of each process and the
communication channel is checkpointed frequently so that, upon a failure,the system
can be restored to a globally consistent set of checkpoints.
• Checkpoint-based rollback-recovery techniques can be classified into three categories:
uncoordi-nated checkpointing ,coordinated checkpointing, and
communication-induced checkpointing.
Uncoordinated Checkpointing
• Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
• 𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
• When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to
𝐼𝑗,𝑦, which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦
Figure 4.5. When process Pi at interval Ii_x sends a message m to Pj ,it piggybacks the pair (i, x)
on m. When Pj receives m during interval Ij_y , it records the dependency from Ii_x to Ij_y , which
is later saved onto stable storage when Pj takes checkpoint Cj_y .
Coordinated Checkpointing
Coordi-nated checkpointing simplifies recovery and is not susceptible to the domino effect, since
every process always restarts from its most recent checkpoint. The main disadvantage of this
method is that large latency is involved in committing output, as a global checkpoint is needed
before a message is sent to the OWP.
Blocking Checkpointing
After a process takes a local checkpoint, to prevent orphan messages, it remains blocked
until the entire checkpointing activity is complete.
Disadvantage
Non-blocking Checkpointing
• The processes need not stop their execution while taking checkpoints
• A fundamental problem in coordinated checkpointing is to prevent a process from
receiving application messages that could make the checkpoint inconsistent.
• If channels are FIFO, this problem can be avoided by preceding the first postcheckpoint
message on each channel by a checkpoint request, forcing each process to take a
checkpoint before receiving the first post-checkpoint message
Communication-induced Checkpointing
A min-process, non-blocking checkpointing algorithm is one that forces only a minimum number
of processes to take a new checkpoint, and at the same time it does not force any process to suspend
its computation.
Communication-induced checkpointing
Communication-induced checkpointing is another way to avoid the domino effect, while allowing
processes to take some of their checkpoints inde-pendently.
There are two types of communication-induced checkpointing model- based checkpointing and
index-based checkpointing. In model-based check- pointing , the system maintains checkpoints
and communication structures that prevent the domino effect or achieve some even stronger
For example, in Figure 4.7, the execution of process P0 is a sequence of four deterministic intervals.
The first one starts with the creation of the process, while the remaining three start with the receipt
of messages m0, m3, and m7, respectively. Send event of message m2 is uniquely determined by
the initial state of P0 and by the receipt of message m0, and is therefore not a nondeterministic
event.
• Pessimistic logging protocols assume that a failure can occur after any non-deterministic event
in the computation
• However, in reality failures are rare
• synchronous logging
– ∀e: ¬Stable(e) ⇒ |Depend(e)| = 0
– if an event has not been logged on the stable storage, then no process ca n depend on it.
– stronger than the always-no-orphans condition
• Suppose processes 𝑃1 and 𝑃2 fail as shown, restart from checkpoints B and C, and roll
forward using their determinant logs to deliver again the same sequence of messages as in
the pre-failure execution
Optimistic Logging
• Combines the advantages of both pessimistic and optimistic logging at the expense of a more
complex recovery protocol
• Like optimistic logging, it does not require synchronous access to the stable storage except
during output commit
• Like pessimistic logging, it allows each process to commit output independently and never
creates orphans, thus isolating processes from the effects of failures at other processes
• Make sure that the always-no-orphans property holds
• Each process maintains information about all the events that have causally affected its state.
• Consider the example in Figure 4.10 . Messages m5 and m6 are likely to be lost on the failures
of P1and P2 at the indicated instants. Process P0 at state X will have logged the determinants
of the non-deterministic events that causally precede its state according to Lamport’s happened-
before relation. These events consist of the delivery of messages m0,m1,m2,m3, and m4.
• Second phase
• Pi informs all the processes of the decision it reached at the end of the first phase. A
process, on receiving the message from Pi , will act accordingly. Therefore, either all
or none of the processes advance the checkpoint by taking permanent checkpoints.
• Correctness: for 2 reasons
– Either all or none of the processes take permanent checkpoint
– No process sends message after taking permanent checkpoint
• Optimization: maybe not all of the processes need to take checkpoints (if not change
since the last checkpoint)
– Restore the system state to a consistent state after a failure with assumptions: single
initiator, checkpoint and rollback recovery algorithms are not invoked concurrently
– 2 phases
• The initiating process send a message to all other processes and ask for the preferences
– restarting to the previous checkpoints. All need to agree about either do or not.
• The initiating process send the final decision to all processes, all the processes act
accordingly after receiving the final decision.
• Optimization: may not to recover all, since some of the processes did not change
anything
• After executing an event, the triplet is recorded without any synchronization with other
processes.
• Local checkpoint consist of set of records, first are stored in volatile log, then moved to
stable log.
Recovery algorithm
Notations:
• 𝑅𝐶𝑉𝐷𝑖←𝑗(𝐶𝑘𝑃𝑡𝑖 ): number of messages received by 𝑝𝑖 from 𝑝𝑗, from the beginning
of computation to checkpoint 𝐶𝑘𝑃𝑡𝑖
Example
Consider the example shown in Figure 4.12. In the event of failure of process X, the above protocol
will require processes X, Y , and Z to restart from checkpoints x2, y2, and z2, respectively.
However, note that process Z need not roll back because there has been no interaction between
process Z and the other two processes since the last checkpoint at Z.
• System assumptions
• Failure models
• Synchronous/ Asynchronous
• communication
• Network connectivity
• Sender identification
• Channel reliability
• Authenticated vs. non-authenticated
• messages
• Agreement variable
A simple example of Byzantine behavior is shown in Figure 4.14. Four generals are shown, and a
consensus decision is to be reached about a boolean value. The various generals are conveying
potentially misleading values of the decision variable to the other generals, which results in
confusion.
Problem Specifications
These problems are equivalent to one another! Show using reductions. Overview
of results
Table 4.1: Overview of results on agreement. f denotes number of failure-prone processes. n is the
total number of processes.
Table 4.2: Some solvable variants of the agreement problem in asynchronous system. The overhead
bounds are for the given algorithms, and not necessarily tight bounds for the problem.
Figure 4.15 Circumventing the impossibility result for consensus in asynchronous systems.
In a failure-free system, consensus can be reached by collecting information from the different
processes, arriving at a “decision,” and distributing this decision in the system. A distributed
mechanism would have each process broadcast its values to others, and each process computes the
same function on the values received. The decision can be reached by using an application specific
function – some simple examples being the majority, max, and min functions. Algorithms to
collect the initial values and then distribute the decision may be based on the token circulation on
a logical ring, or the three-phase tree-based broadcast–converge cast–broadcast, or direct
communication with all nodes.
Algorithm 4.1 Consensus with up to f fail-stop processes in a system of n processes, n > f [8]. Code
shown is for process Pi , 1 ≤ i ≤ n.
The agreement condition is satisfied because in the f +1 rounds, there must be at least one round
in which no process failed. In this round, say round r, all the processes that have not failed so far
succeed in broadcasting their values, and all these processes take the minimum of the values
broadcast and received in that round. Thus, the local values at the end of the round are the same,
say xri for all non-failed processes. In further rounds, only this value may be sent by each process
at most once, and no process i will update its value xri .
• The validity condition is satisfied because processes do not send fictitious values in this failure
model. (Thus, a process that crashes has sent only correct values until the crash.) For all i, if the
initial value is identical, then the only value sent by any process is the value that has been agreed
upon as per the agreement condition.
Complexity
There are f +1 rounds, where f < n. The number of messages is at most O(n2) in each round, and
each message has one integer. Hence the total number of messages is O((f +1).n2. The worst-case
scenario is as follows. Assume that the minimum value is with a single process initially. In the
first round, the process manages to send its value to just one other process before failing. In
subsequent rounds, the single process having this minimum value also manages to send that value
to just one other process before failing.
At least f +1 rounds are required, where f < n. The idea behind this lower bound is that in the
worst-case scenario, one process may fail in each round; with f +1 rounds, there is at least one
round in which no process fails.
Taking simple majority decision does not help because loyal commander Pa cannot distinguish between
the possible scenarios (a) and (b); hence does not know which action to take.
With n = 3 processes, the Byzantine agreement problem cannot be solved if the number of
Byzantine processes f = 1. The argument uses the illustration in Figure 4.16, which shows a
commander Pc and two lieutenant processes Pa and Pb. The malicious process is the lieutenant Pb
in the first scenario (Figure 4.16(a)) and hence Pa should agree on the value of the loyal
commander Pc, which is 0. But note the second scenario (Figure 4.16(b)) in which Pa receives
identical values from Pb and Pc, but now Pc is the disloyal commander whereas Pb is a loyal
lieutenant. In this case, Pa needs to agree with Pb. However, Pa cannot distinguish between the
two scenarios and any further message exchange does not help because each process has already
conveyed what it knows from the third process.
Figure 4.17 Achieving Byzantine agreement when n = 4 processes and f = 1 malicious process.
There is no ambiguity at any loyal commander, when taking majority decision Majority decision
is over 2nd round messages, and 1st round message received directly from commander-in-chief
process.
Table 4.3: Relationships between messages and rounds in the Oral Messages algorithm for Byzantine
agreement.
Figure 4.18 Local tree at P3 for solving the Byzantine agreement, for n = 10 and f = 3. Only one
branch of the tree is shown for simplicity.
Some branches of the tree at P3. In this example, n = 10, f = 3, commander is P0.
• (round 1) P0 sends its value to all other processes using Oral_Msg (3), including to P3.
• (round 2) P3 sends 8 messages to others (excl. P0 and P3) using Oral_Msg (2). P3 also receives
8 messages.
• (round 3) P3 sends 8 7 = 56 messages to all others using Oral_Msg (1); P3 also receives 56
messages.
• (round 4) P3 sends 56 6 = 336 messages to all others using Oral_Msg (0); P3 also receives 336
messages. The received values are used as estimates of the majority function at this level of
recursion.
• Each phase has a unique ”phase king” derived, say, from PID. Each phase has
two rounds:
o in 1st round, each process sends its estimate to all other processes. o in 2nd round,
the ”Phase king” process arrives at an estimate based on the values it received in 1st
round, and broadcasts its new estimate to all others.
Algorithm 4.4 Phase-king algorithm [4] – polynomial number of unsigned messages, n > 4f .
Code is for process Pi, 1 ≤ i ≤ n.
In all 3 cases, argue that Pi and Pj end up with same value as estimate
• If all non-malicious processes have the value x at the start of a phase, they will continue to have
x as the consensus value at the end of the phase.