Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
4.1 Introduction
A large class of problems in distributed computing can be cast as execut-
ing some notification or reaction when the state of the system satisfies a
particular condition. Examples of such problems include monitoring and
debugging, detection of particular states such as deadlock and termination,
1
2 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
global predicate and dynamic adaptation of a program’s configuration such as for load bal-
evaluation ancing. Thus, the ability to construct a global state and evaluate a predicate
GPE
over such a state constitutes the core of solutions to many problems in
distributed computing.
The global state of a distributed system is the union of the states of the
individual processes. Given that the processes of a distributed system do
not share memory but instead communicate solely through the exchange
of messages, a process that wishes to construct a global state must infer
the remote components of that state through message exchanges. Thus, a
fundamental problem in distributed computing is to ensure that a global
state constructed in this manner is meaningful.
In asynchronous distributed systems, a global state obtained through
remote observations could be obsolete, incomplete, or inconsistent. Infor-
mally, a global state is inconsistent if it could never have been constructed
by an idealized observer that is external to the system. It should be clear
that uncertainties in message delays and in relative speeds at which local
computations proceed prevent a process from drawing conclusions about
the instantaneous global state of the system to which it belongs. While sim-
ply increasing the frequency of communication may be effective in making
local views of a global state more current and more complete, it is not
sufficient for guaranteeing that the global state is consistent. Ensuring
the consistency of a constructed global state requires us to reason about
both the order in which messages are observed by a process as well as
the information contained in the messages. For a large class of problems,
consistency turns out to be an appropriate formalization of the notion that
global reasoning with local information is “meaningful”.
Another source of difficulty in distributed systems arises when separate
processes independently construct global states. The variability in message
delays could lead to these separate processes constructing different global
states for the same computation. Even though each such global state may
be consistent and the processes may be evaluating the same predicate,
the different processes may execute conflicting reactions. This “relativistic
effect” is inherent to all distributed computations and limits the class of
system properties that can be effectively detected.
In this chapter, we formalize and expand the above concepts in the con-
text of an abstract problem called Global Predicate Evaluation (GPE). The
goal of GPE is to determine whether the global state of the system satis-
fies some predicate . Global predicates are constructed so as to encode
system properties of interest in terms of state variables. Examples of dis-
4.2. Asynchronous Distributed Systems 3
tributed system problems where the relevant properties can be encoded as channels
global predicates include deadlock detection, termination detection, token
loss detection, unreachable storage (garbage) collection, checkpointing and
restarting, debugging, and in general, monitoring and reconfiguration. In
this sense, a solution to GPE can be seen as the core of a generic solution
for all these problems; what remains to be done is the formulation of the
appropriate predicate and the construction of reactions or notifications
to be executed when the predicate is satisfied.
We begin by defining a formal model for asynchronous distributed sys-
tems and distributed computations. We then examine two different strate-
gies for solving GPE. The first strategy, introduced in Section 4.5, and
refined in Section 4.13, is based on a monitor process that actively inter-
rogates the rest of the system in order to construct the global state. In
Section 4.6 we give a formal definition for consistency of global states. The
alternative strategy, discussed in Section 4.7, has the monitor passively ob-
serve the system in order to construct its global states. Sections 4.8 – 4.13
introduce a series of concepts and mechanisms necessary for making the
two strategies work efficiently. In Section 4.14 we identify properties that
global predicates must satisfy in order to solve practical problems using
GPE. In Section 4.15 we address the issue of multiple monitors observing
the same computation. We illustrate the utility of the underlying concepts
and mechanisms by applying them to deadlock detection and to debugging
in distributed systems.
1. For finite computations, this can be easily accomplished by adding the process index
and a sequence number to the data value to construct the message identifier.
4.3. Distributed Computations 5
the process is not ready) or the process is delayed (because the message has local history
not arrived). canonical
enumeration of
Note that this “message passing” view of communication at the event events
level may be quite different from those of higher system layers. Remote global history
communication at the programming language level may be accomplished causally precedes
through any number of paradigms including remote procedure calls [?], relation
broadcasts [?], distributed transactions [?], distributed objects [17] or dis-
tributed shared memory [18]. At the level we observe distributed computa-
tions, however, all such high-level communication boil down to generating
matching send and receive events at pairs of processes.
The local history of process during the computation is a (possibly infi-
nite) sequence of events 1 2 . . .. This labeling of the events of process
where 1 is the first event executed, 2 is the second event executed, etc. is
called the canonical enumeration and corresponds to the total order imposed
1 2
by the sequential execution on the local events. Let . . . denote
an initial prefix of local history containing the first events. We define
0 to be the empty sequence. The global history of the computation is a set
2. Sometimes we are interested in local histories as sets rather than sequences of events.
Since all events of a computation have unique labels in the canonical enumeration, as a
set contains exactly the same events as as a sequence. We use the same symbol to denote
both when the appropriate interpretation is clear from context.
6 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
partial order
concurrent events 1 2 3 4 5 6
1 1 1 1 1 1
distributed
computation 1
space-time diagram
2
2
2
1 3
2 2
3
1 2 3 4 5 6
3 3 3 3 3 3
3. While “ may causally affect ”, or, “ occurs in the causal context of ” [25] are
equivalent interpretations of this relation, we prefer not to interpret it as “ happens before
” [15] because of the real-time connotation.
4.4. Global States, Cuts and Runs 7
with time progressing from left to right. An arrow from one process to process local state
another represents a message being sent, with the send event at the base global state
of the arrow and the corresponding receive event at the head of the arrow. cut
frontier of cut
Internal events have no arrows associated with them. Given this graphical
run
representation, it is easy to verify if two events are causally related: if a
path can be traced from one event to the other proceeding left-to-right along
the horizontal lines and in the sense of the arrows, then they are related;
otherwise they are concurrent. For example, in the figure 12 63 but 22 63 .
4. We can define global states without referring to channel states since they can always
be encoded as part of the process local states. We discuss explicit representation of channel
states in Section 4.13.
5. If two events actually do occur at the same real-time, we can arbitrarily say that the event
of the process with the smaller index occurs before the event of the larger-index process.
8 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
monitor
1 2 3 4 5 6
1 1 1 1 1 1
1
2
2
2
1 3
2 2
3
1 2 3 4 5 6
3 3 3 3 3 3
words, for each process , the events of appear in in the same order
that they appear in . Note that a run need not correspond to any possible
execution and a single distributed computation may have many runs, each
corresponding to a different execution.
In the first strategy we pursue for constructing global states, the monitor client-server
interaction
0 takes on an active role and sends each process a “state enquiry” message.
RPC deadlock
Upon the receipt of such a message, replies with its current local state waits-for graph
. When all processes have replied, 0 can construct the global state deadlock detection
1 ... . Note that the positions in the process local histories that state
enquiry messages are received effectively defines a cut. The global state
constructed by 0 is the one corresponding to this cut.
Given that the monitor process is part of the distributed system and is
subject to the same uncertainties as any other process, the simple-minded
approach sketched above may lead to predicate values that are not mean-
ingful. To illustrate the problems that can arise, consider a distributed
system composed of servers providing remote services and clients that in-
voke them. In order to satisfy a request, a server may invoke other services
(and thus act as a client). Clients and servers interact through remote pro-
cedure calls—after issuing a request for service, the client remains blocked
until it receives the response from the server. The computation depicted in
Figure 4.1 could correspond to this interaction if we interpret messages la-
beled req as requests for service and those labeled resp as responses. Clearly,
such a system can deadlock. Thus, it is important to be able to detect when
the state of this system includes deadlocked processes.
One possibility for detecting deadlocks in the above system is as follows.
Server processes maintain local states containing the names of clients from
which they received requests but to which they have not yet responded.
The relevant aspects of the global state of this system can be summarized
through a waits-for graph (WFG ) where the nodes correspond to processes
and the edges model blocking. In this graph, an edge is drawn from node
to node if has received a request from to which it has not yet
responded. Note that WFG can be constructed solely on the basis of local
states. It is well known that a cycle in WFG is a sufficient condition to
characterize deadlock in this system [10]. The nodes of the cycle are exactly
those processes involved in the deadlock. Thus, the predicate “WFG
contains a cycle” is one possibility for deadlock detection.6
Let us see what might happen if process 0 monitors the computation of
Figure 4.1 as outlined above. Suppose that the state enquiry messages of 0
are received by the three application processes at the points corresponding
6. Note that defined as a cycle in WFG characterizes a stronger condition than deadlock
in the sense that implies deadlock but not vice versa. If, however, processes can receive
and record requests while being blocked, then a deadlocked system will eventually satisfy
.
10 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
ghost deadlock to cut of Figure 4.2. In other words, processes 1 , 2 and 3 report local
consistent cut states 13 , 22 and 36 , respectively. The WFG constructed by 0 for this
consistent global state global state will have edges 1 3 , 2 1 and 3 2 forming a cycle. Thus, 0
will report a deadlock involving all three processes.
An omniscient external observer of the computation in Figure 4.1, on the
other hand, would conclude that at no time is the system in a deadlock
state. The condition detected by 0 above is called a ghost deadlock in that it
is fictitious. While every cut of a distributed computation corresponds to
a global state, only certain cuts correspond to global states that could have
taken place during a run. Cut of Figure 4.2 represents such a global state.
On the other hand, cut constructed by 0 corresponds to a global state
that could never occur since process 3 is in a state reflecting the receipt of
a request from process 1 that 1 has no record of having sent. Predicates
applied to cuts such as can lead to incorrect conclusions about the system
state.
We return to solving the GPE problem through active monitoring of
distributed computations in Section 4.13 after understanding why the above
approach failed.
4.6 Consistency
Causal precedence happens to be the appropriate formalism for distin-
guishing the two classes of cuts exemplified by and . A cut is
consistent if for all events and
In other words, a consistent cut is left closed under the causal precedence
relation. In its graphical representation, verifying the consistency of a cut
becomes easy: if all arrows that intersect the cut have their bases to the
left and heads to the right of it, then the cut is consistent; otherwise it is
inconsistent. According to this definition, cut of Figure 4.2 is consistent
while cut is inconsistent. A consistent global state is one corresponding
to a consistent cut. These definitions correspond exactly to the intuition
that consistent global states are those that could occur during a run in the
sense that they could be constructed by an idealized observer external to
the system. We can now explain the ghost deadlock detected by 0 in the
previous section as resulting from the evaluation of in an inconsistent
global state.
4.6. Consistency 11
Consistent cuts (and consistent global states) are fundamental towards consistent run
understanding asynchronous distributed computing. Just as a scalar time reachability
value denotes a particular instant during a sequential computation, the global state lattice
lattice (level)
frontier of a consistent cut establishes an “instant” during a distributed
computation. Similarly, notions such as “before” and “after” that are de-
fined with respect to a given time in sequential systems have to be inter-
preted with respect to consistent cuts in distributed system: an event is
before (after) a cut if is to the left (right) of the frontier of .
Predicate values are meaningful only when evaluated in consistent global
states since these characterize exactly the states that could have taken place
during an execution. A run is said to be consistent if for all events,
implies that appears before in . In other words, the total order
imposed by on the events is an extension of the partial order defined
by causal precedence. It is easy to see that a run 1 2 . . . results in
0 1 2 0
a sequence of global states . . . where denotes the initial global
0
state 1 . . . 0 . If the run is consistent, then the global states in the
sequence will all be consistent as well. We will use the term “run” to refer
to both the sequence of events and the sequence of resulting global states.
Each (consistent) global state of the run is obtained from the previous
1
state by some process executing the single event . For two such
(consistent) global states of run , we say that 1 leads to in . Let
denote the transitive closure of the leads-to relation in a given run .
We say that is reachable from in run if and only if . We drop
the run subscript if there exists some run in which is reachable from .
The set of all consistent global states of a computation along with the
leads-to relation defines a lattice. The lattice consists of orthogonal axes,
with one axis for each process. Let 1 ... be a shorthand for the global
state 1 1 . . . and let 1 be its level. Figure 4.3 illustrates
a distributed computation of two processes and the corresponding global
state lattice. Note that every global state is reachable from the initial global
state 00 . A path in the lattice is a sequence of global states of increasing
level (in the figure, downwards) where the level between any two succes-
sive elements differs by one. Each such path corresponds to a consistent
run of the computation. The run is said to “pass through” the global states
included in the path. For the example illustrated in Figure 4.3, one possible
run may pass through the sequence of global states
00 01 11 21 31 32 42 43 44 54 64 65
12 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
reactive monitor
00
10 01
11 02
21 12 03
31 22 13 04
41 32 23 14
42 33 24 1 2 3 4 5 6
1 1 1 1 1 1
43 34 1
53 44 35
63 54 45
64 55
2
1 2 3 4 5
65 2 2 2 2 2
Figure 4.3. A Distributed Computation and the Lattice of its Global States
1 1 1 2 4 2 2 3 3 4 5
1 2 1 3 3 3 1 2 3 1 1 3...
1 1 1 2 2 3 4 3 2 5 6
2 1 3 2 3 1 3 3 1 2 3 3...
1 1 1 2 2 3 3 4 4 2 5
3 3 2 1 1 3 3 1 3 1 2 1...
7. In general, the application processes need to inform 0 only when they execute an
event that is relevant to . A local event is said to be relevant to predicate if the
value of evaluated in a global state . . . . . . could be different from that evaluated in
1
... . . . . For example, in the client-server computation of Figure 4.1, the only events
relevant to deadlock detection are the sending/receiving of request and response messages
since only these can change the state of the WFG .
14 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
delivery rule observation 2 would be 13 22 36 , which is exactly the global state de-
FIFO delivery fined by cut of Figure 4.2 resulting in the detection of a ghost deadlock.
Finally, 3 is a consistent observation and leads to the same global state as
that of cut in Figure 4.2.
It is the possibility of messages being reordered by channels that leads
to undesirable observations such as 1 . We can restore order to messages
between pairs of processes by defining a delivery rule for deciding when
received messages are to be presented to the application process. We call
the primitive invoked by the application deliver to distinguish it from receive,
which remains hidden within the delivery rule and does not appear in the
local history of the process.
Communication from process to is said to satisfy First-In-First-Out
(FIFO) delivery if for all messages and
FIFO Delivery: 8
increasing timestamp order and delivering only messages with timestamps clock condition
up to time ensures that no future message can arrive with a timestamp logical clocks
smaller than any of the messages already delivered. Since the observation
coincides with the delivery order, is consistent if and only if
Clock Condition: RC RC .
This condition is certainly satisfied when timestamps are generated using
the global real-time clock. As it turns out, the clock condition can be
satisfied without any assumptions—in an asynchronous system.
Note that the above construction produces logical clock values that are
increasing with respect to causal precedence. It is easy to verify that for any
two events where , the logical clocks associated with them are such
that LC LC . Thus, logical clocks satisfy the clock condition of the
previous section.10
Now let us return to the goal at hand, which is constructing consistent
observations in asynchronous systems. In the previous section, we argued
that delivery rule DR1 lead to consistent observations as long as times-
tamps satisfied the clock condition. We have just shown that logical clocks
indeed satisfy the clock condition and are realizable in asynchronous sys-
tems. Thus, we should be able to use logical clocks to construct consistent
observations in asynchronous systems. Uses of logical clocks in many other
contexts are discussed in [26].
Consider a delivery rule where those messages that are delivered, are
delivered in increasing (logical clock) timestamp order, with ties being
broken as usual based on process index. Applying this rule to the example
of Figure 4.4, 0 would construct the observation
10. Note that logical clocks would continue to satisfy the clock condition with any arbitrary
positive integer (rather than one) as the increment value of the update rules.
4.8. Logical Clocks 17
gap-detection
property
1 1 1 2 2 3 3 4 4 2 5 5 3 6 6 stable message
1 2 3 1 3 3 1 3 1 2 3 1 2 1 3
11. Even this delivery rule may lack liveness if some processes do not communicate with
0 after a certain point. Liveness can be obtained by the monitor 0 requesting an acknowl-
edgement from all processes to a periodic empty message [15]. These acknowledgements
serve to “flush out” messages that may have been in the channels.
18 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
causal delivery
1
causal history
1 2 3 4 5 6
1 1 1 1 1 1
1
2
2
2
1 3
2 2
3
1 2 3 4 5 6
3 3 3 3 3 3
4
Figure 4.6. Causal History of Event 1
an event as its “clock” value [30]. We define the causal history of event in
distributed computation as the set
vector clocks
In case , the set inclusion above can be replaced by the simple set
membership test . The unfortunate property of causal histories
that renders them impractical is that they grow rapidly.
VC if and only if
The resulting mechanism is known as vector clocks and has been dis-
covered independently by many researchers in many different contexts
(see [30] for a survey). In this scheme, each process maintains a local
vector VC of natural numbers where VC denotes the vector clock value
of when it executes event . As with logical clocks, we use VC to refer
to the current vector clock of a process that is implicit from context. Each
process initializes VC to contain all zeros. Each message contains a
timestamp TS which is the vector clock value of its send event. The
following update rules define how the vector clock is modified by with
the occurrence of each new event :
VC : VC 1 if is an internal or send event
VC : max VC TS if
VC : VC 1
22 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
Note that for the above test, it is not necessary to know on which pro-
cesses the two events were executed. If this information is available, causal
precedence between two events can be verified through a single scalar
comparison.
consistent cut (in The two disjuncts characterize exactly the two possibilities for the cut to
terms of vector
clocks) include at least one receive event without including its corresponding send
weak gap-detection event (thus making it inconsistent). While this property might appear to
(in terms of vector be equivalent to at first sight, this is not the case; it is obviously
clocks) possible for two events to be causally related and yet be pairwise consistent.
We can then characterize a cut as being consistent if its frontier contains
no pairwise inconsistent events. Given the definition of a cut, it suffices to
check pairwise inconsistency only for those events that are in the frontier
of the cut. In terms of vector clocks, the property becomes
Property 6 (Counting) Given event of process and its vector clock value
VC , the number of events such that (equivalently, VC VC )
is given by # .
The property is “weak” in the sense that, for arbitrary processes and
, we cannot conclude if the three events form a causal chain .
4.11. Implementing Causal Delivery with Vector Clocks 25
For the special case , however, the property indeed identifies the causal delivery rule
sufficient condition to make such a conclusion.
12. Equivalently, processes send a notification message to the monitor for all of their events.
26 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
TS 1
TS
to any one of the global states in the run since each is guaranteed to be hidden channels
consistent. An application of this solution to deadlock detection is given in
Section 4.14.1.
Causal delivery can be implemented at any process rather than just at the
monitor. If processes communicate exclusively through broadcasts (rather
than point-to-point sends), then delivery rule DR3 remains the appropri-
ate mechanism for achieving causal delivery at all destinations [3]. The
resulting primitive, known as causal broadcast (c.f. Section 4.15, Chapter
XXXbcast), has been recognized as an important abstraction for building
distributed applications [3,13,25]. If, on the other hand, communication
can take place through point-to-point sends, a delivery rule can be de-
rived based on an extension of vector clocks where each message carries a
timestamp composed of vector clocks (i.e., an matrix) [29,27].
distributed snapshots
the monitor may make an incorrect deduction about the system property channel states
encoded in the global predicate. incoming channels
We will now develop a snapshot protocol that constructs only consistent outgoing channels
global states. The protocol is due to Chandy and Lamport [4], and the
development described here is due to Morgan [23]. For simplicity, we will
assume that the channels implement FIFO delivery, and we omit details of
how individual processes return their local states to 0 .
For this protocol, we will introduce the notion of a channel state. For each
channel from to , its state are those messages that has sent to
but has not yet received. Channel states are only a convenience in
that each can be inferred from just the local states and as the set
difference between messages sent by to (encoded in ) and messages
received by from (encoded in ). In many cases, however, explicit
maintenance of channel state information can lead to more compact process
local states and simpler encoding for the global predicate of interest. For
example, when constructing the waits-for graph in deadlock detection, an
edge is drawn from to if is blocked due to . This relation can be
easily obtained from the local process states and channel states: process
is blocked on if records the fact that there is an outstanding request to
, and contains no response messages.
Let IN be the set of processes that have channels connecting them directly
to and OUT be the set of processes to which has a channel. Channels
from IN to are called incoming while channels from to OUT
are called outgoing with respect to . For each execution of the snapshot
protocol, a process will record its local state and the states of its
incoming channels ( , for all IN ).
13. Recall that there need not be a channel between all pairs of processes, and so must
30 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
Snapshot Protocol 1
1. Process 0 sends the message “take snapshot at ” to all processes.14
2. When clock RC reads , each process records its local state ,
sends an empty message over all of its outgoing channels, and starts
recording messages received over each of its incoming channels.
Recording the local state and sending empty messages are per-
formed before any intervening events are executed on behalf of the
underlying computation.
3. First time receives a message from with timestamp greater than
or equal to , stops recording messages for that channel and
declares as those messages that have been recorded.
For each IN , the channel state constructed by process contains
the set of messages sent by before and received by after . The
empty messages in Step 2 are sent in order to guarantee liveness:15 process
is guaranteed to eventually receive a message from every incoming
channel such that TS .
Being based on real-time, it is easy to see that this protocol constructs
a consistent global state—it constructs a global state that did in fact occur
and thus could have been observed by our idealized external observer.
However, it is worth arguing this point a little more formally. Note that an
event is in the consistent cut associated with the constructed global
state if and only if RC . Hence,
RC RC
Since real-time clock RC satisfies the clock condition, the above equation
implies that is a consistent cut. In fact, the clock condition is the only
property of RC that is necessary for to be a consistent cut. Since logical
clocks also satisfy the clock condition, we should be able to substitute
logical clocks for real-time clocks in the above protocol.
Snapshot Protocol 2
1. Process 0 sends “take snapshot at ” to all processes and then sets
its logical clock to .
16. Note that this problem happens to be one aspect of the more general problem of simu-
lating a synchronous system in an asynchronous one [1].
32 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
2. When its logical clock reads , process records its local state ,
sends an empty message along each outgoing channel, and starts
recording messages received over each of its incoming channels.
Recording the local state and sending empty messages are per-
formed before any intervening events are executed on behalf of the
underlying computation.
3. First time receives a message from with timestamp greater
than or equal to , stops recording messages for that channel and
declares as those messages that have been recorded.
Channel states are constructed just as in Protocol 1 with playing the
role of . As soon as 0 sets its logical clock to , it will immediately
execute Step 2, and the empty messages sent by it will force the clocks of
processes in OUT0 to attain . Since the network is strongly connected, all
of the clocks will eventually attain , and so the protocol is live.
We now remove the need for . Note that, with respect to the above
protocol, a process does nothing between receiving the “take snapshot at
” message and receiving the first empty message that causes its clock to
pass through . Thus, we can eliminate the message “take snapshot at ”
and instead have a process record its state when it receives the first empty
message. Since processes may send empty messages for other purposes,
we will change the message from being empty to one containing a unique
value, for example, “take snapshot”. Furthermore, by making this mes-
sage contain a unique value, we no longer need to include timestamps in
messages—the message “take snapshot” is the first message that any pro-
cess sends after the snapshot time. Doing so removes the last reference to
logical clocks, and so we can eliminate them from our protocol completely.
1 1 2 3 2 4 3 4 5 5 6
2 1 1 1 2 1 2 2 1 2 1
00 01 11 21 31 32 42 43 44 54 55 65
Let the global state of this run in which the protocol is initiated be 21
and the global state in which it terminates be 55 . Note that run does
not pass through the constructed global state 23 . As can be verified by
the lattice of Figure 4.3, however, 21 23 55 in this example. We now
prerecording event
postrecording event
0
1 2 3 4 5 6
1 1 1 1 1 1
1
1
2
1 2 3 4 5
2 2 2 2 2 2
3 2 4 3
17. Adjacent event pairs 1 2 and 1 2 of run are two such examples.
4.14. Properties of Global Predicates 35
stable predicates
process p(i): 1 i n
var pending: queue of [message, integer] init empty; % pending requests to p(i)
working: boolean init false; % processing a request
m: message; j: integer;
while true do
while working or (size(pending) = 0) do
receive m from p(j); % m set to message, j to its source
case m.type of
request:
pending := pending + [m, j];
response:
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
esac
od ;
while not working and (size(pending) 0) do
[m, j] := first(pending);
pending := tail(pending);
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j)
od
od
end p(i);
process p(i): 1 i n
var pending: queue of [message, integer] init empty; % pending requests to p(i)
working: boolean init false; % processing a request
blocking: array [1..n] of boolean init false; % blocking[j] = “p(j) is blocked on p(i)”
m: message; j: integer; s: integer init 0;
while true do
while working or (size(pending) = 0) do
receive m from p(j); % m set to message, j to its source
case m.type of
request:
blocking[j] := true;
pending := pending + [m, j];
response:
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then blocking[j] := false;
snapshot:
if s = 0 then
% this is the first snapshot message
send [type: snapshot, data: blocking] to p(0);
send [type: snapshot] to p(1),...,p(i 1),p(i 1),...,p(n)
s := (s + 1) mod n;
esac
od ;
while not working and (size(pending) 0) do
[m, j] := head(pending);
pending := tail(pending);
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then blocking[j] := false;
od
od
end p(i);
process p(0):
var wfg: array [1..n] of array [1..n] of boolean; % wfg[i, j] = p(j) waits-for p(i)
j, k: integer; m: message;
while true do
wait until deadlock is suspected;
send [type: snapshot] to p(1), ..., p(n);
for k := 1 to n do
receive m from p(j);
wfg[j] := m.data;
if (cycle in wfg) then system is deadlocked
od
end p(0);
18. By the assumption that the network is completely connected, each invocation of the
4.14. Properties of Global Predicates 39
The conventional definition of “ waits-for ” is that has sent a request nonstable predicates
to and has not yet responded. As in Section 4.5, we will instead use
the weaker definition waits-for which holds when has received
a request from to which it has not yet responded [10]. By structuring
the server as a state machine, even requests sent to a deadlocked server
will eventually be received and denoted in blocking. Hence, a system that
contains a cycle in the conventional WFG will eventually contain a cycle
in the WFG , and so a deadlock will be detected eventually. Furthermore,
using the WFG instead of the WFG has the advantage of referencing only
the local process states, and so the embedded snapshot protocol need not
record channel states.
Figure 4.13 shows the code run by the monitor 0 acting as the deadlock
detector. This process periodically starts a snapshot by sending a snapshot
message to all other processes. Then, 0 receives the arrays blocking from
each of the processes and uses this data to test for a cycle in the WFG .
This approach has the advantage of generating an additional message load
only when deadlock is suspected. However, the approach also introduces a
latency between the occurrence of the deadlock and detection that depends
on how often the monitor starts a snapshot.
Figures 4.14 and 4.15 show the server and monitor code, respectively, of
a reactive-architecture deadlock detector. This solution is much simpler
than the snapshot-based version. In this case, sends a message to 0
whenever receives a request or sends a response to a client. Monitor 0
uses these notifications to update a WFG in which it tests for a cycle. The
simplicity of this solution is somewhat superficial, however, because the
protocol requires all messages to 0 to be sent using causal delivery order
instead of FIFO order. The only latency between a deadlock’s occurrence
and its detection is due to the delay associated with a notification message,
and thus, is typically shorter than that of the snapshot-based detector.
snapshot protocol will result in exactly snapshot messages to be received by each of the
processes.
40 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
process p(i): 1 i n
var pending: queue of [message, integer] init empty; % pending requests to p(i)
working: boolean init false; % processing a request
m: message; j: integer;
while true do
while working or (size(pending) = 0) do
receive m from p(j); % m set to message, j to its source
case m.type of
request:
send [type: requested, of: i, by: j] to p(0);
pending := pending + [m, j];
response:
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then
send [type: responded, to: j, by: i] to p(0);
esac
od ;
while not working and (size(pending) 0) do
[m, j] := first(pending);
pending := tail(pending);
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then
send [type: responded, to: j, by: i] to p(0)
od
od
end p(i);
process p(0):
var wfg: array [1..n, 1..n] of boolean init false; % wfg[i, j] = “p(j) waits-for p(i)”
m: message; j: integer;
while true do
receive m from p(j); % m set set to message, j to its source
if (m.type = responded) then
wfg[m.by, m.to] := false
else
wfg[m.of, m.by] := true;
if (cycle in wfg) then system is deadlocked
od
end p(0);
00 01 11 12 22 32 42 43 44 45 55 65
From the result of Section 4.13.2, we know that the snapshot protocol
could construct either global state 31 or 41 since both are reachable from
11 . Thus, the monitor could “detect” 2 even though the actual
run never goes through a state satisfying the condition.
It appears that there is very little value in using a snapshot protocol
to detect a nonstable predicate—the predicate may have held even if it is
not detected, and even if it is detected it may have never held. The same
problems exist when nonstable predicates are evaluated over global states
42 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
possibly
definitely
00
10 01
11 02
2
21 12 03
31 22 13
: 3 : 4 : 5
1 2 3 4 5 6
41 32 23 1 1 1 1 1 1
1
42 33
43
53 44
2
63 54 45
1 2 3 4 5
2 2 2 2 2
64 55 : 6 : 4 : 2
65 0 10
procedure Possibly( );
var current: set of global states;
: integer;
begin
% Synchronize processes and distribute
send to all processes;
current := global state 0...0;
release processes;
:= 0;
0...0
% Invariant: current contains all states of level that are reachable from
while (no state in current satisfies ) do
if current = final global state then return false
:= + 1;
current := states of level
od
return true
end
1
1
2
2
... : : VC VC
and
1
1
2
2
... :
1
: VC VC VC VC
procedure Definitely( );
var current, last: set of global states;
: integer;
begin
% Synchronize processes and distribute
send to all processes;
last := global state 0...0;
release processes;
remove all states in last that satisfy ;
:= 1;
% Invariant: last contains all states of level 1 that are reachable
% from 0...0 without passing through a state satisfying
while (last ) do
current := states of level reachable from a state in last;
remove all states in current that satisfy ;
if current = final global state then return false
:= + 1;
last := current
od
return true
end ;
global states are consistent. One can be careful and avoid redundantly
constructing the same global state of level 1 from different global states
of level , but the computation can still be costly.
Figure 4.18 gives the high-level algorithm used by the monitoring process
0 to detect Definitely( ). This algorithm iteratively constructs the set of
global states that have a level and are reachable from the initial global state
without passing through a global state that satisfies . If this set of states is
empty, then Definitely( ) holds and if this set contains only the final global
state then Definitely( ) holds. Note that, unlike detecting Possibly( ),
not all global states need be examined. For example, in Figure 4.16, suppose
that when the global states of level 2 were constructed, it was determined
that 02 satisfied . When constructing the states of level 3, global state 03
need not be included since it is reachable only through 02 .
The two detection algorithms are linear in the number of global states, but
unfortunately the number of global states is where is the maximum
number events a monitored process has executed. There are techniques that
can be used to limit the number of constructed global states. For example,
4.15. Multiple Monitors 47
19. In the limit, each process could act as a monitor such that the predicate could be
evaluated locally.
48 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
20. Note that in Chapter XXXbcast, this primitive is called casual broadcast without the
reliable qualifier since all broadcast primitives are specified in the presence of failures.
4.16. Conclusions 49
4.16 Conclusions
We have used the GPE problem as a motivation for studying consistent
global states of distributed systems. Since many distributed systems prob-
lems require recognizing certain global conditions, constructing consistent
global states and evaluating predicates over them constitute fundamental
primitives. We derived two classes of solutions to the GPE problem: one
based on distributed snapshots and one based on passive observations. In
doing so, we have developed a set of concepts and mechanisms for rep-
resenting and reasoning about computations in asynchronous distributed
systems. These concepts generalize the notion of time in order to cope with
the uncertainty that is inherent to such systems.
We illustrated the practicality of our mechanisms by applying them
to distributed deadlock detection and distributed debugging. Reactive-
architecture solutions based on passive observations were shown to be
more flexible. In particular, these solutions can be easily adapted to deal
with nonstable predicates, multiple observations and failures. Each ex-
tension can be easily accommodated by using an appropriate communi-
cation primitive for notifications, leaving the overall reactive architecture
unchanged.
51
52 BIBLIOGRAPHY