Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
Failure recovery
Distributed shared memory – Abstraction
and advantages
Distributed shared memory (DSM) is an abstraction provided to the programmer of a
distributed system.
Programmers access the data across the network using read and write primitives.
A part of each computer’s memory is marked for shared space, and the remainder is
private memory.
To provide programmers with the illusion of a single shared address space, a memory
mapping management layer is required to manage the shared virtual memory space.
Advantages
1. Communication across the network is achieved by the read/write abstraction that
simplifies the task of programmers.
2. A single address space is provided, thereby providing the possibility of avoiding data
movement across multiple address spaces, and simplifying passing-by-reference and
passing complex data structures containing pointers.
3. If a block of data needs to be moved, the system can exploit locality of reference to
reduce the communication overhead.
4. DSM is often cheaper than using dedicated multiprocessor systems, because it uses
simpler software interfaces.
5. There is no bottleneck presented by a single memory access bus
6. DSM effectively provides a large (virtual) main memory.
Disadvantages
1. Programmers are not shielded from having to know about various replica consistency
models and from coding their distributed applications according to the semantics of
these models.
2. DSM implementations cannot be more efficient than asynchronous message-passing
implementations. The generality of the DSM software may make it less efficient.
3. The standard implementations of DSM have a higher overhead than a programmer-
written implementation for a specific application and system.
Determining what semantics to allow for concurrent access to shared objects. The
semantics needs to be clearly specified so that the programmer can code his program
using an appropriate logic.
Determining the best way to implement the semantics of concurrent access to shared
data. One possibility is to use replication.
Selecting the locations for replication (if full replication is not used), to optimize
efficiency from the system’s viewpoint.
Determining the location of remote data that the application needs to access, if full
replication is not used.
Reducing communication delays and the number of messages that are involved under
the covers while implementing the semantics of concurrent access to shared data.
Four broad dimensions along which DSM systems can be classified and implemented:
System model
A distributed system consists of a fixed number of processes, P1, P2… PN , which
communicate only through messages.
Processes cooperate to execute a distributed application and interact with the outside
world by receiving and sending input and output messages, respectively.
A local checkpoint
In distributed systems, all processes save their local states at certain instants of time.
This saved state is known as a local checkpoint.
A local checkpoint is a snapshot of the state of the process at a given instance and the
event of recording the state of a process is called local checkpointing.
The contents of a checkpoint depend upon the application context and the
checkpointing method being used.
Depending upon the checkpointing method used, a process may keep several local
checkpoints or just a single checkpoint at any time.
We assume that a process stores all local checkpoints on the stable storage so that
they are available even if the process crashes.
We also assume that a process is able to roll back to any of its existing local
checkpoints and thus restore to and restart from the corresponding state.
A local checkpoint is shown in the process-line by the symbol “ | ”.
Note that the consistent state in Figure 13.2(a) shows message m1 to have been sent
but not yet received, but that is alright.
The state in Figure 13.2(a) is consistent because it represents a situation in which every
message that has been received, there is a corresponding message send event.
The state in Figure 13.2(b) is inconsistent because process P2 is shown to have
received m2 but the state of process P1 does not reflect having sent it. Such a state is
impossible in any failure-free, correct computation.
Inconsistent states occur because of failures. For instance, the situation shown in
Figure 13.2(b) may occur if process P1 fails after sending message m2 to process P2
and then restarts at the state shown in Figure 13.2(b).
Thus, a local checkpoint is a snapshot of a local state of a process and a global
checkpoint is a set of local checkpoints, one from each process.
A consistent global checkpoint is a global checkpoint such that no message is sent by a
process after taking its local checkpoint that is received by another process before
taking its local checkpoint.
The consistency of global checkpoints strongly depends on the flow of messages
exchanged by processes and an arbitrary set of local checkpoints at processes may not
form a consistent global checkpoint
The fundamental goal of any rollback-recovery protocol is to bring the system to a
consistent state after a failure.
The reconstructed consistent state is not necessarily one that occurred before the
failure.
It is sufficient that the reconstructed state be one that could have occurred before the
failure in a failure-free execution, provided that it is consistent with the interactions that
the system had with the outside world.
A common approach is to save each input message on the stable storage before allowing
the application program to process it. An interaction with the outside world to deliver the
outcome of a computation is shown on the process-line by the symbol “||”.
Different type of messages
A process failure and subsequent recovery may leave messages that were perfectly
received (and processed) before the failure in abnormal states.
This is because a rollback of processes for recovery may have to rollback the send and
receive operations of several messages.
1. In-transit messages
1. In Figure the global state shows that message m1 has been sent but not yet
received. We call such a message an in-transit message
2. When in-transit messages are part of a global system state, these messages do not
cause any inconsistency.
3. However, depending on whether the system model assumes reliable
communication channels, rollback-recovery protocols may have to guarantee the
delivery of in-transit messages when failures occur.
4. For reliable communication channels, a consistent state must include in-transit
messages because they will always be delivered to their destinations in any legal
execution of the system.
5. On the other hand, if a system model assumes lossy communication channels,
then in-transit messages can be omitted from system state.
2. Lost messages
1. Messages whose send is not undone but receive is undone due to rollback are
called lost messages.
2. This type of messages occurs when the process rolls back to a checkpoint prior to
reception of the message while the sender does not rollback beyond the send
operation of the message.
3. In Figure 13.3, message m1 is a lost message.
3. Delayed messages
1. Messages whose receive is not recorded because the receiving process was either
down or the message arrived after the rollback of the receiving process, are called
delayed messages.
2. For example, messages m2 and m5 in Figure 13.3 are delayed messages.
4. **Orphan Messages**
1. Messages with receive recorded but message send not recorded are called orphan
messages.
2. For example, a rollback might have undone the send of such messages, leaving
the receive event intact at the receiving process.
3. Orphan messages do not arise if processes roll back to a consistent global state.
5. Duplicate messages
1. Duplicate messages arise due to message logging and replaying during process
recovery.
The computation comprises of three processes Pi, Pj, and Pk, connected through a
communication network.
The processes communicate solely by exchanging messages over fault-free, FIFO
communication channels.
Processes Pi, Pj, and Pk have taken checkpoints {Ci0, Ci1}, {Cj0, Cj1, Cj2}, and {Ck0,
Ck1}, respectively, and these processes have exchanged messages A to J
Suppose process Pi fails at the instance indicated in the figure.
All the contents of the volatile memory of Pi are lost and, after Pi has recovered from the
failure, the system needs to be restored to a consistent global state from where the
processes can resume their execution.
Process Pi’s state is restored to a valid state by rolling it back to its most recent
checkpoint Ci1.
To restore the system to a consistent state, the process Pj rolls back to checkpoint Cj1
because the rollback of process Pi to checkpoint Ci1 created an orphan message H.
Note that process Pj does not roll back to checkpoint Cj2 but to checkpoint Cj1,
because rolling back to checkpoint Cj2 does not eliminate the orphan message H.
Even this resulting state is not a consistent global state, as an orphan message I is
created due to the roll back of process Pj to checkpoint Cj1.
To eliminate this orphan message, process Pk rolls back to checkpoint Ck1.
The restored global state {Ci1, Cj1, Ck1} is a consistent state as it is free from orphan
message.
Although the system state has been restored to a consistent state, several messages
are left in an erroneous state which must be handled correctly.
Messages A, B, D, G, H, I, and J had been received at the points indicated in the figure
and messages C, E, and F were in transit when the failure occurred.
Restoration of system state to checkpoints {Ci1, Cj1,Ck1} automatically handles
messages A, B, and J because the send and receive events of messages A, B, and J
have been recorded, and both the events for G, H, and I have been completely undone.
These messages cause no problem and we call messages A, B, and J normal
messages and messages G, H, and I vanished messages.
Messages C, D, E, and F are potentially problematic.
Message C is in transit during the failure and it is a delayed message.
The delayed message C has several possibilities: C might arrive at process Pi before it
recovers, it might arrive while Pi is recovering, or it might arrive after Pi has completed
recovery.
Each of these cases must be dealt with correctly.
Message D is a lost message since the send event for D is recorded in the restored
state for process Pj, but the receive event has been undone at process Pi.
Process Pj will not resend D without an additional mechanism, since the send D at Pj
occurred before the checkpoint and the communication system successfully delivered D
Messages E and F are delayed orphan messages and pose perhaps the most serious
problem of all the messages.
When messages E and F arrive at their respective destinations, they must be discarded
since their send events have been undone.
Processes, after resuming execution from their checkpoints, will generate both of these
messages, and recovery techniques must be able to distinguish between messages like
C and those like E and F.
Lost messages like D can be handled by having processes keep a message log of all
the sent messages.
So when a process restores to a checkpoint, it replays the messages from its log to
handle the lost message problem.
However, message logging and message replaying during recovery can result in
duplicate messages.
Log-based rollback recovery : Deterministic
and non-deterministic events
Log-based rollback recovery exploits the fact that a process execution can be modeled
as a sequence of deterministic state intervals, each starting with the execution of a non-
deterministic event.
A non-deterministic event can be the receipt of a message from another process or an
event internal to the process.
For example, in Figure the execution of process P0 is a sequence of four deterministic
intervals.
The first one starts with the creation of the process, while the remaining three start with
the receipt of messages m0, m3, and m7, respectively.
Send event of message m2 is uniquely determined by the initial state of P0 and by the
receipt of message m0, and is therefore not a non-deterministic event.
Log-based rollback recovery assumes that all non-deterministic events can be identified
and their corresponding determinants can be logged into the stable storage.
During failure-free operation, each process logs the determinants of all non-
deterministic events that it observes onto the stable storage.
Additionally, each process also takes checkpoints to reduce the extent of rollback during
recovery.
After a failure occurs, the failed processes recover by using the checkpoints and logged
determinants to replay the corresponding non-deterministic events precisely as they
occurred during the pre-failure execution.
Because execution within each deterministic interval depends only on the sequence of
non-deterministic events that preceded the interval’s beginning,
Log-based rollback-recovery protocols guarantee that upon recovery of all failed processes,
the system does not contain any orphan process. Log-based rollback-recovery protocols are
of three types:
1. pessimistic logging,
2. optimistic logging,
3. causal logging
Pessimistic Logging
Pessimistic logging protocols assume that a failure can occur after any non-
deterministic event in the computation.
This assumption is “pessimistic” since in reality failures are rare
In their most straightforward form, pessimistic protocols log to the stable storage the
determinant of each non-deterministic event before the event affects the computation.
Pessimistic protocols implement the following property, often referred to as synchronous
logging, which is a stronger than the always-no-orphans condition
∀e : ¬Stable(e) ⟹ |Depend(e)| = 0
That is, if an event has not been logged on the stable storage, then no process can
depend on it.
In addition to logging determinants, processes also take periodic checkpoints to
minimize the amount of work that has to be repeated during recovery.
When a process fails, the process is restarted from the most recent checkpoint and the
logged determinants are used to recreate the pre-failure execution.
In Figure During failure-free operation the logs of processes P0, P1, and P2 contain the
determinants needed to replay messages m0, m4, m7, m1, m3, m6, and m2, m5,
respectively.
Suppose processes P1 and P2 fail as shown, restart from checkpoints B and C, and roll
forward using their determinant logs to deliver again the same sequence of messages
as in the pre-failure execution.
This guarantees that P1 and P2 will repeat exactly their pre-failure execution and re-
send the same messages.
Hence, once the recovery is complete, both processes will be consistent with the state
of P0 that includes the receipt of message m7 from P1.
In a pessimistic logging system, the observable state of each process is always
recoverable.
fast non-volatile semiconductor memory can be used to implement the stable storage.
Another approach is to limit the number of failures that can be tolerated.
The overhead of pessimistic logging is reduced by delivering a message or executing
an event and deferring its logging until the process communicates with another process
or with the outside world.
Some pessimistic logging systems reduce the overhead of synchronous logging without
relying on hardware.
For example, the sender-based message logging (SBML) protocol keeps the
determinants corresponding to the delivery of each message m in the volatile memory
of its sender.
Optimistic Logging
In optimistic logging protocols, processes log determinants asynchronously to the stable
storage .
These protocols optimistically assume that logging will be complete before a failure
occurs.
Determinants are kept in a volatile log, and are periodically flushed to the stable
storage.
Optimistic logging protocols do not implement the always-no-orphans condition.
The protocols allow the temporary creation of orphan processes which are eventually
eliminated.
To perform rollbacks correctly, optimistic logging protocols track causal dependencies
during failure free execution.
Upon a failure, the dependency information is used to calculate and recover the latest
global state of the pre-failure execution in which no process is in an orphan.
Causal Logging
Causal logging combines the advantages of both pessimistic and optimistic logging at
the expense of a more complex recovery protocol.
Like optimistic logging, it does not require synchronous access to the stable storage
except during output commit.
Like pessimistic logging, it allows each process to commit output independently and
never creates orphans, thus isolating processes from the effects of failures at other
processes.
Moreover, causal logging limits the rollback of any failed process to the most recent
checkpoint on the stable storage, thus minimizing the storage overhead and the amount
of lost work.
Causal logging protocols make sure that the always-no-orphans property holds by
ensuring that the determinant of each non-deterministic event that causally precedes
the state of a process is either stable or it is available locally to that process.
Process P0 at state X will have logged the determinants of the nondeterministic events
that causally precede its state according to Lamport’s happened-before relation.
These events consist of the delivery of messages m0, m1, m2, m3, and m4.
The determinant of each of these non-deterministic events is either logged on the stable
storage or is available in the volatile log of process P0.
The determinant of each of these events contains the order in which its original receiver
delivered the corresponding message.
The message sender, as in sender-based message logging, logs the message content.
Thus, process P0 will be able to “guide” the recovery of P1 and P2