0% found this document useful (0 votes)
28 views

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

This document summarizes concepts related to establishing consistent global states in distributed systems. It introduces the problem of Global Predicate Evaluation (GPE), where the goal is to determine if a global system state satisfies a given predicate. GPE forms the core of solutions to many distributed computing problems involving detecting global system properties. The document outlines challenges in asynchronous distributed systems where processes have imperfect information about the global state due to uncertainties in message delays and execution speeds. It also discusses two strategies for solving GPE using either an active or passive monitor process to construct global states.

Uploaded by

akash goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

This document summarizes concepts related to establishing consistent global states in distributed systems. It introduces the problem of Global Predicate Evaluation (GPE), where the goal is to determine if a global system state satisfies a given predicate. GPE forms the core of solutions to many distributed computing problems involving detecting global system properties. The document outlines challenges in asynchronous distributed systems where processes have imperfect information about the global state due to uncertainties in message delays and execution speeds. It also discusses two strategies for solving GPE using either an active or passive monitor process to construct global states.

Uploaded by

akash goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Chapter 4

Consistent Global States of Distributed Systems:


Fundamental Concepts and Mechanisms

Özalp Babaoğlu and Keith Marzullo

Many important problems in distributed computing admit solutions that


contain a phase where some global property needs to be detected. This
subproblem can be seen as an instance of the Global Predicate Evaluation
(GPE) problem where the objective is to establish the truth of a Boolean
expression whose variables may refer to the global system state. Given the
uncertainties in asynchronous distributed systems that arise from commu-
nication delays and relative speeds of computations, the formulation and
solution of GPE reveal most of the subtleties in global reasoning with im-
perfect information. In this chapter, we use GPE as a canonical problem in
order to survey concepts and mechanisms that are useful in understanding
global states of distributed computations. We illustrate the utility of the
developed techniques by examining distributed deadlock detection and
distributed debugging as two instances of GPE.

4.1 Introduction
A large class of problems in distributed computing can be cast as execut-
ing some notification or reaction when the state of the system satisfies a
particular condition. Examples of such problems include monitoring and
debugging, detection of particular states such as deadlock and termination,

1
2 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

global predicate and dynamic adaptation of a program’s configuration such as for load bal-
evaluation ancing. Thus, the ability to construct a global state and evaluate a predicate
GPE
over such a state constitutes the core of solutions to many problems in
distributed computing.
The global state of a distributed system is the union of the states of the
individual processes. Given that the processes of a distributed system do
not share memory but instead communicate solely through the exchange
of messages, a process that wishes to construct a global state must infer
the remote components of that state through message exchanges. Thus, a
fundamental problem in distributed computing is to ensure that a global
state constructed in this manner is meaningful.
In asynchronous distributed systems, a global state obtained through
remote observations could be obsolete, incomplete, or inconsistent. Infor-
mally, a global state is inconsistent if it could never have been constructed
by an idealized observer that is external to the system. It should be clear
that uncertainties in message delays and in relative speeds at which local
computations proceed prevent a process from drawing conclusions about
the instantaneous global state of the system to which it belongs. While sim-
ply increasing the frequency of communication may be effective in making
local views of a global state more current and more complete, it is not
sufficient for guaranteeing that the global state is consistent. Ensuring
the consistency of a constructed global state requires us to reason about
both the order in which messages are observed by a process as well as
the information contained in the messages. For a large class of problems,
consistency turns out to be an appropriate formalization of the notion that
global reasoning with local information is “meaningful”.
Another source of difficulty in distributed systems arises when separate
processes independently construct global states. The variability in message
delays could lead to these separate processes constructing different global
states for the same computation. Even though each such global state may
be consistent and the processes may be evaluating the same predicate,
the different processes may execute conflicting reactions. This “relativistic
effect” is inherent to all distributed computations and limits the class of
system properties that can be effectively detected.
In this chapter, we formalize and expand the above concepts in the con-
text of an abstract problem called Global Predicate Evaluation (GPE). The
goal of GPE is to determine whether the global state of the system satis-
fies some predicate . Global predicates are constructed so as to encode
system properties of interest in terms of state variables. Examples of dis-
4.2. Asynchronous Distributed Systems 3

tributed system problems where the relevant properties can be encoded as channels
global predicates include deadlock detection, termination detection, token
loss detection, unreachable storage (garbage) collection, checkpointing and
restarting, debugging, and in general, monitoring and reconfiguration. In
this sense, a solution to GPE can be seen as the core of a generic solution
for all these problems; what remains to be done is the formulation of the
appropriate predicate and the construction of reactions or notifications
to be executed when the predicate is satisfied.
We begin by defining a formal model for asynchronous distributed sys-
tems and distributed computations. We then examine two different strate-
gies for solving GPE. The first strategy, introduced in Section 4.5, and
refined in Section 4.13, is based on a monitor process that actively inter-
rogates the rest of the system in order to construct the global state. In
Section 4.6 we give a formal definition for consistency of global states. The
alternative strategy, discussed in Section 4.7, has the monitor passively ob-
serve the system in order to construct its global states. Sections 4.8 – 4.13
introduce a series of concepts and mechanisms necessary for making the
two strategies work efficiently. In Section 4.14 we identify properties that
global predicates must satisfy in order to solve practical problems using
GPE. In Section 4.15 we address the issue of multiple monitors observing
the same computation. We illustrate the utility of the underlying concepts
and mechanisms by applying them to deadlock detection and to debugging
in distributed systems.

4.2 Asynchronous Distributed Systems


A distributed system is a collection of sequential processes 1 2 . . . and a
network capable of implementing unidirectional communication channels
between pairs of processes for message exchange. Channels are reliable
but may deliver messages out of order. We assume that every process
can communicate with every other process, perhaps through intermediary
processes. In other words, the communication network is assumed to be
strongly connected (but not necessarily completely connected).
In defining the properties of a distributed system, we would like to make
the weakest set of assumptions possible. Doing so will enable us to establish
upper bounds on the costs of solving problems in distributed systems. More
specifically, if there exists a solution to a problem in this weakest model
with some cost , then there is a solution to the same problem with a cost
4 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

asynchronous no greater than in any distributed system.


distributed system The weakest possible model for a distributed system is called an asyn-
event
chronous system and is characterized by the following properties: there exist
event (internal)
event no bounds on the relative speeds of processes and there exist no bounds
(communication) on message delays. Asynchronous systems rule out the possibility of pro-
cesses maintaining synchronized local clocks [16,7] or reasoning based on
global real-time. Communication remains the only possible mechanism for
synchronization in such systems.
In addition to their theoretical interest as noted above, asynchronous dis-
tributed systems may also be realistic models for actual systems. It is often
the case that physical components from which we construct distributed
systems are synchronous. In other words, the relative speeds of processors
and message delays over network links making up a distributed system
can be bounded. When, however, layers of software are introduced to
multiplex these physical resources to create abstractions such as processes
and (reliable) communication channels, the resulting system may be better
characterized as asynchronous.

4.3 Distributed Computations


Informally, a distributed computation describes the execution of a dis-
tributed program by a collection of processes. The activity of each sequen-
tial process is modeled as executing a sequence of events. An event may
be either internal to a process and cause only a local state change, or it
may involve communication with another process. Without loss of gener-
ality, we assume that communication is accomplished through the events
and that match based on the message identifier . In
other words, even if several processes send the same data value to the same
process, the messages themselves will be unique.1 Informally, the event
enqueues message on an outgoing channel for transmission to
the destination process. The event , on the other hand, corre-
sponds to the act of dequeuing message from an incoming channel at
the destination process. Clearly, for event to occur at process ,
message must have arrived at and must have declared its willingness
to receive a message. Otherwise, either the message is delayed (because

1. For finite computations, this can be easily accomplished by adding the process index
and a sequence number to the data value to construct the message identifier.
4.3. Distributed Computations 5

the process is not ready) or the process is delayed (because the message has local history
not arrived). canonical
enumeration of
Note that this “message passing” view of communication at the event events
level may be quite different from those of higher system layers. Remote global history
communication at the programming language level may be accomplished causally precedes
through any number of paradigms including remote procedure calls [?], relation
broadcasts [?], distributed transactions [?], distributed objects [17] or dis-
tributed shared memory [18]. At the level we observe distributed computa-
tions, however, all such high-level communication boil down to generating
matching send and receive events at pairs of processes.
The local history of process during the computation is a (possibly infi-
nite) sequence of events 1 2 . . .. This labeling of the events of process

where 1 is the first event executed, 2 is the second event executed, etc. is
called the canonical enumeration and corresponds to the total order imposed
1 2
by the sequential execution on the local events. Let . . . denote
an initial prefix of local history containing the first events. We define
0 to be the empty sequence. The global history of the computation is a set

1 containing all of its events.2


Note that a global history does not specify any relative timing between
events. In an asynchronous distributed system where no global time frame
exists, events of a computation can be ordered only based on the notion of
“cause-and-effect”. In other words, two events are constrained to occur in
a certain order only if the occurrence of the first may affect the outcome of
the second. This in turn implies that information flows from the first event
to the second. In an asynchronous system, information may flow from one
event to another either because the two events are of the same process,
and thus may access the same local state, or because the two events are
of different processes and they correspond to the exchange of a message.
We can formalize these ideas by defining a binary relation defined over
events such that [15]:
1. If and , then ,
2. If and , then ,
3. If and , then .
As defined, this relation effectively captures our intuitive notion of

2. Sometimes we are interested in local histories as sets rather than sequences of events.
Since all events of a computation have unique labels in the canonical enumeration, as a
set contains exactly the same events as as a sequence. We use the same symbol to denote
both when the appropriate interpretation is clear from context.
6 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

partial order
concurrent events 1 2 3 4 5 6
1 1 1 1 1 1
distributed
computation 1
space-time diagram

2
2
2
1 3
2 2

3
1 2 3 4 5 6
3 3 3 3 3 3

Figure 4.1. Space-Time Diagram Representation of a Distributed Computation

“cause-and-effect” in that if and only if causally precedes .3 Note


that only in the case of matching send-receive events is the cause-and-effect
relationship certain. In general, the only conclusion that can be drawn from
is that the mere occurrence of and its outcome may have been influ-
enced by event .
Certain events of the global history may be causally unrelated. In other
words, it is possible that for some and , neither nor . We call
such events concurrent and write .
Formally, a distributed computation is a partially ordered set (poset) de-
fined by the pair . Note that all events are labeled with their canonical
enumeration, and in the case of communication events, they also contain
the unique message identifier. Thus, the total ordering of events for each
process as well as the send-receive matchings are implicit in .
It is common to depict distributed computations using an equivalent
graphical representation called a space-time diagram. Figure 4.1 illustrates
such a diagram where the horizontal lines represent execution of processes,

3. While “ may causally affect ”, or, “ occurs in the causal context of ” [25] are
equivalent interpretations of this relation, we prefer not to interpret it as “ happens before
” [15] because of the real-time connotation.
4.4. Global States, Cuts and Runs 7

with time progressing from left to right. An arrow from one process to process local state
another represents a message being sent, with the send event at the base global state
of the arrow and the corresponding receive event at the head of the arrow. cut
frontier of cut
Internal events have no arrows associated with them. Given this graphical
run
representation, it is easy to verify if two events are causally related: if a
path can be traced from one event to the other proceeding left-to-right along
the horizontal lines and in the sense of the arrows, then they are related;
otherwise they are concurrent. For example, in the figure 12 63 but 22 63 .

4.4 Global States, Cuts and Runs


Let denote the local state of process immediately after having executed
event and let 0 be its initial state before any events are executed. In
general, the local state of a process may include information such as the
values of local variables and the sequences of messages sent and received
over the various channels incident to the process. The global state of a
distributed computation is an -tuple of local states 1 ... , one for
4
each process. A cut of a distributed computation is a subset of its global
history and contains an initial prefix of each of the local histories. We can
specify such a cut 1
1
through the tuple of natural numbers
1 ... corresponding to the index of the last event included for each
process. The set of last events 11 . . . included in cut 1 . . . is
called the frontier of the cut. Clearly, each cut defined by 1 . . . has a
corresponding global state which is 11 . . . .
As shown in Figure 4.2, a cut has a natural graphical interpretation as a
partitioning of the space-time diagram along the time axis. The figure illus-
trates two cuts and corresponding to the tuples 5 2 4 and 3 2 6 ,
respectively.
Even though a distributed computation is a partially ordered set of
events, in an actual execution, all events, including those at different pro-
cesses, occur in some total order.5 To be able to reason about executions
in distributed systems, we introduce the notion of a run. A run of a dis-
tributed computation is total ordering that includes all of the events in
the global history and that is consistent with each local history. In other

4. We can define global states without referring to channel states since they can always
be encoded as part of the process local states. We discuss explicit representation of channel
states in Section 4.13.
5. If two events actually do occur at the same real-time, we can arbitrarily say that the event
of the process with the smaller index occurs before the event of the larger-index process.
8 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

monitor
1 2 3 4 5 6
1 1 1 1 1 1
1

2
2
2
1 3
2 2

3
1 2 3 4 5 6
3 3 3 3 3 3

Figure 4.2. Cuts of a Distributed Computation

words, for each process , the events of appear in in the same order
that they appear in . Note that a run need not correspond to any possible
execution and a single distributed computation may have many runs, each
corresponding to a different execution.

4.5 Monitoring Distributed Computations


Given the above notation and terminology, GPE can be stated as evaluating
a predicate that is a function of the global state of a distributed system.
For the time being, we will assume that a single process called the monitor
is responsible for evaluating . Let 0 be this process which may be one of
1 ... or may be external to the computation (but not the system). In
this special case, where there is a single monitor, solving GPE reduces to
0 constructing a global state of the computation (to which is applied).
For simplicity of exposition, we assume that events executed on behalf of
monitoring are external to the underlying computation and do not alter the
canonical enumeration of its events.
4.5. Monitoring Distributed Computations 9

In the first strategy we pursue for constructing global states, the monitor client-server
interaction
0 takes on an active role and sends each process a “state enquiry” message.
RPC deadlock
Upon the receipt of such a message, replies with its current local state waits-for graph
. When all processes have replied, 0 can construct the global state deadlock detection
1 ... . Note that the positions in the process local histories that state
enquiry messages are received effectively defines a cut. The global state
constructed by 0 is the one corresponding to this cut.
Given that the monitor process is part of the distributed system and is
subject to the same uncertainties as any other process, the simple-minded
approach sketched above may lead to predicate values that are not mean-
ingful. To illustrate the problems that can arise, consider a distributed
system composed of servers providing remote services and clients that in-
voke them. In order to satisfy a request, a server may invoke other services
(and thus act as a client). Clients and servers interact through remote pro-
cedure calls—after issuing a request for service, the client remains blocked
until it receives the response from the server. The computation depicted in
Figure 4.1 could correspond to this interaction if we interpret messages la-
beled req as requests for service and those labeled resp as responses. Clearly,
such a system can deadlock. Thus, it is important to be able to detect when
the state of this system includes deadlocked processes.
One possibility for detecting deadlocks in the above system is as follows.
Server processes maintain local states containing the names of clients from
which they received requests but to which they have not yet responded.
The relevant aspects of the global state of this system can be summarized
through a waits-for graph (WFG ) where the nodes correspond to processes
and the edges model blocking. In this graph, an edge is drawn from node
to node if has received a request from to which it has not yet
responded. Note that WFG can be constructed solely on the basis of local
states. It is well known that a cycle in WFG is a sufficient condition to
characterize deadlock in this system [10]. The nodes of the cycle are exactly
those processes involved in the deadlock. Thus, the predicate “WFG
contains a cycle” is one possibility for deadlock detection.6
Let us see what might happen if process 0 monitors the computation of
Figure 4.1 as outlined above. Suppose that the state enquiry messages of 0
are received by the three application processes at the points corresponding

6. Note that defined as a cycle in WFG characterizes a stronger condition than deadlock
in the sense that implies deadlock but not vice versa. If, however, processes can receive
and record requests while being blocked, then a deadlocked system will eventually satisfy
.
10 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

ghost deadlock to cut of Figure 4.2. In other words, processes 1 , 2 and 3 report local
consistent cut states 13 , 22 and 36 , respectively. The WFG constructed by 0 for this
consistent global state global state will have edges 1 3 , 2 1 and 3 2 forming a cycle. Thus, 0
will report a deadlock involving all three processes.
An omniscient external observer of the computation in Figure 4.1, on the
other hand, would conclude that at no time is the system in a deadlock
state. The condition detected by 0 above is called a ghost deadlock in that it
is fictitious. While every cut of a distributed computation corresponds to
a global state, only certain cuts correspond to global states that could have
taken place during a run. Cut of Figure 4.2 represents such a global state.
On the other hand, cut constructed by 0 corresponds to a global state
that could never occur since process 3 is in a state reflecting the receipt of
a request from process 1 that 1 has no record of having sent. Predicates
applied to cuts such as can lead to incorrect conclusions about the system
state.
We return to solving the GPE problem through active monitoring of
distributed computations in Section 4.13 after understanding why the above
approach failed.

4.6 Consistency
Causal precedence happens to be the appropriate formalism for distin-
guishing the two classes of cuts exemplified by and . A cut is
consistent if for all events and

In other words, a consistent cut is left closed under the causal precedence
relation. In its graphical representation, verifying the consistency of a cut
becomes easy: if all arrows that intersect the cut have their bases to the
left and heads to the right of it, then the cut is consistent; otherwise it is
inconsistent. According to this definition, cut of Figure 4.2 is consistent
while cut is inconsistent. A consistent global state is one corresponding
to a consistent cut. These definitions correspond exactly to the intuition
that consistent global states are those that could occur during a run in the
sense that they could be constructed by an idealized observer external to
the system. We can now explain the ghost deadlock detected by 0 in the
previous section as resulting from the evaluation of in an inconsistent
global state.
4.6. Consistency 11

Consistent cuts (and consistent global states) are fundamental towards consistent run
understanding asynchronous distributed computing. Just as a scalar time reachability
value denotes a particular instant during a sequential computation, the global state lattice
lattice (level)
frontier of a consistent cut establishes an “instant” during a distributed
computation. Similarly, notions such as “before” and “after” that are de-
fined with respect to a given time in sequential systems have to be inter-
preted with respect to consistent cuts in distributed system: an event is
before (after) a cut if is to the left (right) of the frontier of .
Predicate values are meaningful only when evaluated in consistent global
states since these characterize exactly the states that could have taken place
during an execution. A run is said to be consistent if for all events,
implies that appears before in . In other words, the total order
imposed by on the events is an extension of the partial order defined
by causal precedence. It is easy to see that a run 1 2 . . . results in
0 1 2 0
a sequence of global states . . . where denotes the initial global
0
state 1 . . . 0 . If the run is consistent, then the global states in the
sequence will all be consistent as well. We will use the term “run” to refer
to both the sequence of events and the sequence of resulting global states.
Each (consistent) global state of the run is obtained from the previous
1
state by some process executing the single event . For two such
(consistent) global states of run , we say that 1 leads to in . Let
denote the transitive closure of the leads-to relation in a given run .
We say that is reachable from in run if and only if . We drop
the run subscript if there exists some run in which is reachable from .
The set of all consistent global states of a computation along with the
leads-to relation defines a lattice. The lattice consists of orthogonal axes,
with one axis for each process. Let 1 ... be a shorthand for the global
state 1 1 . . . and let 1 be its level. Figure 4.3 illustrates
a distributed computation of two processes and the corresponding global
state lattice. Note that every global state is reachable from the initial global
state 00 . A path in the lattice is a sequence of global states of increasing
level (in the figure, downwards) where the level between any two succes-
sive elements differs by one. Each such path corresponds to a consistent
run of the computation. The run is said to “pass through” the global states
included in the path. For the example illustrated in Figure 4.3, one possible
run may pass through the sequence of global states

00 01 11 21 31 32 42 43 44 54 64 65
12 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

reactive monitor
00

10 01

11 02

21 12 03

31 22 13 04

41 32 23 14

42 33 24 1 2 3 4 5 6
1 1 1 1 1 1
43 34 1

53 44 35

63 54 45

64 55
2
1 2 3 4 5
65 2 2 2 2 2

Figure 4.3. A Distributed Computation and the Lattice of its Global States

Note that one might be tempted to identify the run corresponding to


the actual execution of the computation. As we argued earlier, in an asyn-
chronous distributed system, this is impossible to achieve from within the
system. Only an omniscient external observer will be able to identify the
sequence of global states that the execution passed through.

4.7 Observing Distributed Computations


Let us consider an alternative strategy for the monitor process 0 in con-
structing global states to be used in predicate evaluation based on a reactive
architecture [11]. In this approach, 0 will assume a passive role in that it
will not send any messages of its own. The application processes, how-
ever, will be modified slightly so that whenever they execute an event, they
4.7. Observing Distributed Computations 13

notify 0 by sending it a message describing the event.7 As before, we observation


assume that monitoring does not generate any new events in that the send consistent observation
to 0 for notification coincides with the event it is notifying. In this manner,
the monitor process constructs an observation of the underlying distributed
computation as the sequence of events corresponding to the order in which
the notification messages arrive [12].
We note certain properties of observations as constructed above. First,
due to the variability of the notification message delays, a single run of
a distributed computation may have different observations at different
monitors. This is the so-called “relativistic effect” of distributed computing
to which we return in Section 4.15. Second, an observation can correspond
to a consistent run, an inconsistent run or no run at all since events from
the same process may be observed in an order different from their local
history. A consistent observation is one that corresponds to a consistent run.
To illustrate these points, consider the following (consistent) run of the
computation in Figure 4.1:
1 1 2 1 3 4 2 2 5 3 4 5 6 3 6
3 1 3 2 3 3 2 1 3 1 1 1 3 2 1

All of the following are possible observations of :

1 1 1 2 4 2 2 3 3 4 5
1 2 1 3 3 3 1 2 3 1 1 3...
1 1 1 2 2 3 4 3 2 5 6
2 1 3 2 3 1 3 3 1 2 3 3...
1 1 1 2 2 3 3 4 4 2 5
3 3 2 1 1 3 3 1 3 1 2 1...

Given our asynchronous distributed system model where communica-


tion channels need not preserve message order, any permutation of run is
a possible observation of it. Not all observations, however, need be mean-
ingful with respect to the run that produced them. For example, among
those indicated above, observation 1 does not even correspond to a run
since events of process 3 do not represent an initial prefix of its local history
( 43 appears before event 33 ). Observation 2 , on the hand, corresponds to
an inconsistent run. In fact, the global state constructed by 0 at the end of

7. In general, the application processes need to inform 0 only when they execute an
event that is relevant to . A local event is said to be relevant to predicate if the
value of evaluated in a global state . . . . . . could be different from that evaluated in
1
... . . . . For example, in the client-server computation of Figure 4.1, the only events
relevant to deadlock detection are the sending/receiving of request and response messages
since only these can change the state of the WFG .
14 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

delivery rule observation 2 would be 13 22 36 , which is exactly the global state de-
FIFO delivery fined by cut of Figure 4.2 resulting in the detection of a ghost deadlock.
Finally, 3 is a consistent observation and leads to the same global state as
that of cut in Figure 4.2.
It is the possibility of messages being reordered by channels that leads
to undesirable observations such as 1 . We can restore order to messages
between pairs of processes by defining a delivery rule for deciding when
received messages are to be presented to the application process. We call
the primitive invoked by the application deliver to distinguish it from receive,
which remains hidden within the delivery rule and does not appear in the
local history of the process.
Communication from process to is said to satisfy First-In-First-Out
(FIFO) delivery if for all messages and
FIFO Delivery: 8

In other words, FIFO delivery prevents one message overtaking an earlier


message sent by the same process. For each source-destination pair, FIFO
delivery can be implemented over non-FIFO channels simply by having the
source process add a sequence number to its messages and by using a deliv-
ery rule at the destination that presents messages in an order corresponding
to the sequence numbers. While FIFO delivery is sufficient to guarantee
that observations correspond to runs, it is not sufficient to guarantee con-
sistent observations. To pursue this approach for solving the GPE problem
where is evaluated in global states constructed from observations, we
need to devise a mechanism that ensures their consistency.
We proceed by devising a simple mechanism and refining it as we relax
assumptions. Initially, assume that all processes have access to a global real-
time clock and that all message delays are bounded by . This is clearly
not an asynchronous system but will serve as a starting point. Let RC
denote the value of the global clock when event is executed. When a
process notifies 0 of some local event , it includes RC in the notification
message as a timestamp. The delivery rule employed by 0 is the following:
DR1: At time , deliver all received messages with timestamps up to
in increasing timestamp order.
To see why an observation constructed by 0 using DR1 is guaranteed
to be consistent, first note that an event is observed before event if
and only if RC RC .9 This is true because messages are delivered in

8. Subscripts identify the process executing the event.


9. Again, we can break ties due to simultaneous events based on process indexes.
4.8. Logical Clocks 15

increasing timestamp order and delivering only messages with timestamps clock condition
up to time ensures that no future message can arrive with a timestamp logical clocks
smaller than any of the messages already delivered. Since the observation
coincides with the delivery order, is consistent if and only if
Clock Condition: RC RC .
This condition is certainly satisfied when timestamps are generated using
the global real-time clock. As it turns out, the clock condition can be
satisfied without any assumptions—in an asynchronous system.

4.8 Logical Clocks


In an asynchronous system where no global real-time clock can exist, we
can devise a simple clock mechanism for “timing” such that event orderings
based on increasing clock values are guaranteed to be consistent with causal
precedence. In other words, the clock condition can be satisfied in an
asynchronous system. For many applications, including the one above,
any mechanism satisfying the clock condition can be shown to be sufficient
for using the values produced by it as if they were produced by a global
real-time clock [24].
The mechanism works as follows. Each process maintains a local vari-
able LC called its logical clock that maps events to the positive natural num-
bers [15]. The value of the logical clock when event is executed by process
is denoted LC . We use LC to refer to the current logical clock value of
a process that is implicit from context. Each message that is sent contains
a timestamp TS which is the logical clock value associated with the
sending event. Before any events are executed, all processes initialize their
logical clocks to zero. The following update rules define how the logical
clock is modified by with the occurrence of each new event :

LC 1 if is an internal or send event


LC :
max LC TS 1 if

In other words, when a receive event is executed, the logical clock is


updated to be greater than both the previous local value and the timestamp
of the incoming message. Otherwise (i.e., an internal or send event is
executed), the logical clock is simply incremented. Figure 4.4 illustrates
the logical clock values that result when these rules are applied to the
computation of Figure 4.1.
16 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

Figure 4.4. Logical Clocks

Note that the above construction produces logical clock values that are
increasing with respect to causal precedence. It is easy to verify that for any
two events where , the logical clocks associated with them are such
that LC LC . Thus, logical clocks satisfy the clock condition of the
previous section.10
Now let us return to the goal at hand, which is constructing consistent
observations in asynchronous systems. In the previous section, we argued
that delivery rule DR1 lead to consistent observations as long as times-
tamps satisfied the clock condition. We have just shown that logical clocks
indeed satisfy the clock condition and are realizable in asynchronous sys-
tems. Thus, we should be able to use logical clocks to construct consistent
observations in asynchronous systems. Uses of logical clocks in many other
contexts are discussed in [26].
Consider a delivery rule where those messages that are delivered, are
delivered in increasing (logical clock) timestamp order, with ties being
broken as usual based on process index. Applying this rule to the example
of Figure 4.4, 0 would construct the observation

10. Note that logical clocks would continue to satisfy the clock condition with any arbitrary
positive integer (rather than one) as the increment value of the update rules.
4.8. Logical Clocks 17

gap-detection
property
1 1 1 2 2 3 3 4 4 2 5 5 3 6 6 stable message
1 2 3 1 3 3 1 3 1 2 3 1 2 1 3

which is indeed consistent. Unfortunately, the delivery rule as stated lacks


liveness since, without a bound on message delays (and a real-time clock to
measure it), no message will ever be delivered for fear of receiving a later
message with a smaller timestamp. This is because logical clocks, when
used as a timing mechanism, lack what we call the gap-detection property:
Gap-Detection: Given two events and along with their clock values
LC and LC where LC LC , determine whether some
other event exists such that LC LC LC .
It is this property that is needed to guarantee liveness for the delivery
rule and can be achieved with logical clocks in an asynchronous system
only if we exploit information in addition to the clock values. One pos-
sibility is based on using FIFO communication between all processes and
0 . As usual, all messages (including those sent to 0 ) carry the logical
clock value of the send event as a timestamp. Since each logical clock is
monotone increasing and FIFO delivery preserves order among messages
sent by a single process, when 0 receives a message from process with
timestamp TS , it is certain that no other message can arrive from
such that TS TS . A message received by process is called
stable if no future messages with timestamps smaller than TS can be
received by . Given FIFO communication between all processes and 0 ,
stability of message at 0 can be guaranteed when 0 has received at least
one message from all other processes with a timestamp greater than TS .
This idea leads to the following delivery rule for constructing consistent
observations when logical clocks are used for timestamps:
DR2: Deliver all received messages that are stable at 0 in increasing times-
tamp order.11
Note that real-time clocks lack the gap-detection property as well. The
assumption, however, that message delays are bounded by was sufficient
to devise a simple stability check in delivery rule DR1: at time , all received
messages with timestamps smaller than are guaranteed to be stable.

11. Even this delivery rule may lack liveness if some processes do not communicate with
0 after a certain point. Liveness can be obtained by the monitor 0 requesting an acknowl-
edgement from all processes to a periodic empty message [15]. These acknowledgements
serve to “flush out” messages that may have been in the channels.
18 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

causal delivery
1

Figure 4.5. Message Delivery that is FIFO but not Causal

4.9 Causal Delivery


Recall that FIFO delivery guarantees order to be preserved among messages
sent by the same process. A more general abstraction extends this ordering
to all messages that are causally related, even if they are sent by different
processes. The resulting property is called causal delivery and can be stated
as:
Causal Delivery (CD):
for all messages , sending processes and destination process .
In other words, in a system respecting causal delivery, a process cannot
known about the existence of a message (through intermediate messages)
any earlier than the event corresponding to the delivery of that message [28].
Note that having FIFO delivery between all pairs of processes is not suf-
ficient to guarantee causal delivery. Figure 4.5 illustrates a computation
where all deliveries (trivially) satisfy FIFO but those of 3 violate CD.
The relevance of causal delivery to the construction of consistent obser-
vations is obvious: if 0 uses a delivery rule satisfying CD, then all of its
observations will be consistent. The correctness of this result is an imme-
diate consequence of the definition of CD, which coincides with that of a
consistent observation. In retrospect, the two delivery rules DR1 and DR2
we developed in the previous sections are instances of CD that work un-
der certain assumptions. What we seek is an implementation for CD that
4.10. Constructing the Causal Precedence Relation 19

makes no assumptions beyond those of asynchronous systems. strong clock condition

4.10 Constructing the Causal Precedence Relation


Note that we have stated the gap-detection property in terms of clock
values. For implementing causal delivery efficiently, what is really needed
is an effective procedure for deciding the following: given events , that
are causally related and their clock values, does there exist some other event
such that (i.e., falls in the causal “gap” between and )?
By delivering event notification messages in strict increasing timestamp
order, rules DR1 and DR2 assume that RC RC (equivalently,
LC LC ) implies . This is a conservative assumption since
timestamps generated using real-time or logical clocks only guarantee the
clock condition, which is this implication in the opposite sense. Given
RC RC (or LC LC ), it may be that causally precedes or
that they are concurrent. What is known for certain is that . Having
just received the notification of event , DR1 and DR2 could unnecessarily
delay its delivery even if they could predict the timestamps of all notifica-
tions yet to be received. The delay would be unnecessary if there existed
future notifications with smaller timestamps, but they all happened to be
for events concurrent with .
The observations of the preceding two paragraphs suggest a timing
mechanism TC whereby causal precedence relations between events can
be deduced from their timestamps. We strengthen the clock condition by
adding an implication in the other sense to obtain:
Strong Clock Condition: TC TC .
While real-time and logical clocks are consistent with causal precedence,
timing mechanism TC is said to characterize causal precedence since the en-
tire computation can be reconstructed from a single observation containing
TC as timestamps [8,30]. This is essential not only for efficient implementa-
tion of CD, but also for many other applications (e.g., distributed debugging
discussed in Section 4.14.2) that require the entire global state lattice rather
than a single path through it.

4.10.1 Causal Histories


A brute-force approach to satisfying the strong clock condition is to devise
a timing mechanism that produces the set of all events that causally precede
20 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

causal history
1 2 3 4 5 6
1 1 1 1 1 1
1

2
2
2
1 3
2 2

3
1 2 3 4 5 6
3 3 3 3 3 3

4
Figure 4.6. Causal History of Event 1

an event as its “clock” value [30]. We define the causal history of event in
distributed computation as the set

In other words, the causal history of event is the smallest consistent


cut that includes . The projection of on process is the set
. Figure 4.6 graphically illustrates the causal history of event 41 as
the darkened segments of process local histories leading towards the event.
4 1 2 3 4 1 1 2 3
From the figure, it is easy to see that 1 1 1 1 1 2 3 3 3 .
In principle, maintaining causal histories is simple. Each process
initializes local variable to be the empty set. If is the receive of message
by process from , then is constructed as the union of , the
causal history of the previous local event of and the causal history of the
corresponding send event at (included in message as its timestamp).
Otherwise ( is an internal or send event), is the union of and the
causal history of the previous local event.
When causal histories are used as clock values, the strong clock condition
can be satisfied if we interpret clock comparison as set inclusion. From the
definition of causal histories, it follows that
4.10. Constructing the Causal Precedence Relation 21

vector clocks

In case , the set inclusion above can be replaced by the simple set
membership test . The unfortunate property of causal histories
that renders them impractical is that they grow rapidly.

4.10.2 Vector Clocks


The causal history mechanism proposed in the previous section can be
made practical by periodically pruning segments of history that are known
to be common to all events [25]. Alternatively, the causal history can be
represented as a fixed-dimensional vector rather than a set. The resulting
growth rate will be logarithmic in the number of events rather than linear.
In what follows, we pursue this approach.
First, note that the projection of causal history on process corre-
sponds to an initial prefix of the local history of . In other words,
for some unique and, by the canonical enumeration of events,
for all . Thus, a single natural number is sufficient to represent the
set . Since 1 , the entire causal history can be
represented by an -dimensional vector VC where for all 1 , the
th component is defined as

VC if and only if

The resulting mechanism is known as vector clocks and has been dis-
covered independently by many researchers in many different contexts
(see [30] for a survey). In this scheme, each process maintains a local
vector VC of natural numbers where VC denotes the vector clock value
of when it executes event . As with logical clocks, we use VC to refer
to the current vector clock of a process that is implicit from context. Each
process initializes VC to contain all zeros. Each message contains a
timestamp TS which is the vector clock value of its send event. The
following update rules define how the vector clock is modified by with
the occurrence of each new event :
VC : VC 1 if is an internal or send event

VC : max VC TS if
VC : VC 1
22 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

Figure 4.7. Vector Clocks

In other words, an internal or send event simply increments the local


component of the vector clock. A receive event, on the other hand, first
updates the vector clock to be greater than (on a component-by-component
basis) both the previous value and the timestamp of the incoming message,
and then increments the local component. Figure 4.7 illustrates the vector
clocks associated with the events of the distributed computation displayed
in Figure 4.1.
Given the above implementation, the th component of the vector clock
of process has the following operational interpretation for all :

VC number of events of that causally precede event of

On the other hand, VC counts the number of events has executed


up to and including . Equivalently, VC is the ordinal position of
event in the canonical enumeration of ’s events.
From the definition of vector clocks, we can easily derive a collection of
useful properties. Given two -dimensional vectors and of natural
numbers, we define the “less than” relation (written as ) between them as
follows
4.10. Constructing the Causal Precedence Relation 23

strong clock condition


(in terms of vector
:1 : clocks)
strong clock condition
This allows us to express the strong clock condition in terms of vector (in terms of vector
clocks as clocks)
concurrent events (in
terms of vector
Property 1 (Strong Clock Condition) clocks)
pairwise inconsistent
VC VC events (in terms of
vector clocks)

Note that for the above test, it is not necessary to know on which pro-
cesses the two events were executed. If this information is available, causal
precedence between two events can be verified through a single scalar
comparison.

Property 2 (Simple Strong Clock Condition) Given event of process and


event of process , where
VC VC

Note that the condition VC VC is possible and represents


the situation where is the latest event of that causally precedes of
(thus must be a send event).
Given this version of the strong clock condition, we obtain a simple test
for concurrency between events that follows directly from its definition

Property 3 (Concurrent) Given event of process and event of process


VC VC VC VC

Consistency of cuts of a distributed computation can also be easily ver-


ified in terms of vector clocks. Events and are said to be pairwise
inconsistent if they cannot belong to the frontier of the same consistent cut.
In terms of vector clocks, this can be expressed as

Property 4 (Pairwise Inconsistent) Event of process is pairwise inconsistent


with event of process , where , if and only if
VC VC VC VC
24 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

consistent cut (in The two disjuncts characterize exactly the two possibilities for the cut to
terms of vector
clocks) include at least one receive event without including its corresponding send
weak gap-detection event (thus making it inconsistent). While this property might appear to
(in terms of vector be equivalent to at first sight, this is not the case; it is obviously
clocks) possible for two events to be causally related and yet be pairwise consistent.
We can then characterize a cut as being consistent if its frontier contains
no pairwise inconsistent events. Given the definition of a cut, it suffices to
check pairwise inconsistency only for those events that are in the frontier
of the cut. In terms of vector clocks, the property becomes

Property 5 (Consistent Cut) A cut defined by 1 ... is consistent if and


only if
:1 1 : VC VC

Recall that, for all , the vector clock component VC can be


interpreted as the number of events of that causally precede event of
. The component corresponding to the process itself, on the other hand,
counts the total number of events executed by up to and including .
Let # 1 VC 1. Thus, # denotes exactly the number
of events that causally precede in the entire computation.

Property 6 (Counting) Given event of process and its vector clock value
VC , the number of events such that (equivalently, VC VC )
is given by # .

Finally, vector clocks supply a weak form of the gap-detection property


that logical and real-time clocks do not. The following property follows
directly from the vector clock update rules and the second form of the
Strong Clock Condition. It can be used to determine if the causal “gap”
between two events admits a third event.

Property 7 (Weak Gap-Detection) Given event of process and event of


process , if VC VC for some , then there exists an event
such that

The property is “weak” in the sense that, for arbitrary processes and
, we cannot conclude if the three events form a causal chain .
4.11. Implementing Causal Delivery with Vector Clocks 25

For the special case , however, the property indeed identifies the causal delivery rule
sufficient condition to make such a conclusion.

4.11 Implementing Causal Delivery with Vector Clocks


The weak gap-detection property of the previous section can be exploited
to efficiently implement causal delivery using vector clocks. Assume that
processes increment the local component of their vector clocks only for
events that are notified to the monitor. 12 As usual, each message carries
a timestamp TS which is the vector clock value of the event being
notified by . All messages that have been received but not yet delivered
by the monitor process 0 are maintained in a set , initially empty.
A message from process is deliverable as soon as 0 can verify
that there are no other messages (neither in nor in the network) whose
sending causally precede that of . Let be the last message delivered
from process , where . Before message of process can be
delivered, 0 must verify two conditions:
1. there is no earlier message from that is undelivered, and
2. there is no undelivered message from such that

The first condition holds if exactly TS 1 messages have already


been delivered from . To verify the second condition, we can use the
special case of weak gap-detection where and ,
and . Since the two events and
both occur at process , Property 7 can be written as
(Weak Gap-Detection) If TS TS for some , then there exists
event such that

Thus, no undelivered message exists if TS TS , for


all . These tests can be efficiently implemented if 0 maintains an array
1 . . . of counters, initially all zeros, such that counter contains
TS where is the last message that has been delivered from process
. The delivery rule then becomes:

12. Equivalently, processes send a notification message to the monitor for all of their events.
26 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

Figure 4.8. Causal Delivery Using Vector Clocks

DR3: (Causal Delivery) Deliver message from process as soon as both


of the following conditions are satisfied

TS 1
TS

When 0 delivers , array is updated by setting to TS .


Figure 4.8 illustrates the application of this delivery rule by 0 in a sample
computation. The events of 1 and 2 are annotated with the vector clock
values while those of 0 indicate the values of array . Note that the
delivery of message is delayed until message has been received and
delivered. Message , on the other hand, can be delivered as soon as it is
received since 0 can verify that all causally preceding messages have been
delivered.
At this point, we have a complete reactive-architecture solution to the
GPE problem in asynchronous distributed systems based on passive ob-
servations. The steps are as follows. Processes notify the monitor 0 of
relevant events by sending it messages. The monitor uses a causal de-
livery rule for the notification messages to construct an observation that
corresponds to a consistent run. The global predicate can be applied
4.12. Causal Delivery and Hidden Channels 27

to any one of the global states in the run since each is guaranteed to be hidden channels
consistent. An application of this solution to deadlock detection is given in
Section 4.14.1.
Causal delivery can be implemented at any process rather than just at the
monitor. If processes communicate exclusively through broadcasts (rather
than point-to-point sends), then delivery rule DR3 remains the appropri-
ate mechanism for achieving causal delivery at all destinations [3]. The
resulting primitive, known as causal broadcast (c.f. Section 4.15, Chapter
XXXbcast), has been recognized as an important abstraction for building
distributed applications [3,13,25]. If, on the other hand, communication
can take place through point-to-point sends, a delivery rule can be de-
rived based on an extension of vector clocks where each message carries a
timestamp composed of vector clocks (i.e., an matrix) [29,27].

4.12 Causal Delivery and Hidden Channels


In general, causal delivery allows processes to reason globally about the
system using only local information. For such conclusions to be mean-
ingful, however, we need to restrict our attention to closed systems—those
that constrain all communication to take place within the boundaries of the
computing system. If global reasoning based on causal analysis is applied
to systems that contain so-called hidden channels, incorrect conclusions may
be drawn [15].
To illustrate the problem, consider the example taken from [14] and
shown in Figure 4.9. A physical process is being monitored and controlled
by a distributed system consisting of 1 2 and 3 . Process 1 is monitoring
the state of a steam pipe and detects its rupture. The event is notified
to the controller process 3 in message . Process 2 is monitoring the
pressure of the same pipe, several meters downstream from 1 . A few
seconds after the rupture of the pipe, 2 detects a drop in the pressure and
notifies 3 of the event in message . Note that from the point of view of
explicit communication, messages and are concurrent. Message
arrives at 3 and is delivered without delay since there are no undelivered
messages that causally precede it. As part of its control action, 3 reacts to
the pressure drop by applying more heat to increase the temperature. Some
time later, message arrives reporting the rupture of the pipe. The causal
sequence observed by 3 is pressure drop, apply heat, pipe rupture leading
it to conclude that the pipe ruptured due to the increased temperature. In
28 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

distributed snapshots

Figure 4.9. External Environment as a Hidden Channel

fact, the opposite is true.


The apparent anomaly is due to the steam pipe which acts as a commu-
nication channel external to the system. The rupture and pressure drop
events are indeed causally related even though it is not captured by the
relation. When the pipe is included as a communication channel, the
order in which messages are seen by 3 violates causal delivery. In systems
that are not closed, global reasoning has to be based on totally-ordered
observations derived from global real-time. Since this order is consistent
with causal precedence, anomalous conclusions such as the one above will
be avoided.

4.13 Distributed Snapshots


In Section 4.5 we presented a strategy for solving the GPE problem through
active monitoring. In this strategy, 0 requested the states of the other pro-
cesses and then combined them into a global state. Such a strategy is often
called a “snapshot” protocol, since 0 “takes pictures” of the individual
process states. As we noted, this global state may not be consistent, and so
4.13. Distributed Snapshots 29

the monitor may make an incorrect deduction about the system property channel states
encoded in the global predicate. incoming channels
We will now develop a snapshot protocol that constructs only consistent outgoing channels
global states. The protocol is due to Chandy and Lamport [4], and the
development described here is due to Morgan [23]. For simplicity, we will
assume that the channels implement FIFO delivery, and we omit details of
how individual processes return their local states to 0 .
For this protocol, we will introduce the notion of a channel state. For each
channel from to , its state are those messages that has sent to
but has not yet received. Channel states are only a convenience in
that each can be inferred from just the local states and as the set
difference between messages sent by to (encoded in ) and messages
received by from (encoded in ). In many cases, however, explicit
maintenance of channel state information can lead to more compact process
local states and simpler encoding for the global predicate of interest. For
example, when constructing the waits-for graph in deadlock detection, an
edge is drawn from to if is blocked due to . This relation can be
easily obtained from the local process states and channel states: process
is blocked on if records the fact that there is an outstanding request to
, and contains no response messages.
Let IN be the set of processes that have channels connecting them directly
to and OUT be the set of processes to which has a channel. Channels
from IN to are called incoming while channels from to OUT
are called outgoing with respect to . For each execution of the snapshot
protocol, a process will record its local state and the states of its
incoming channels ( , for all IN ).

4.13.1 Snapshot Protocols


We proceed as before by devising a simple protocol based on a strong set of
assumptions and refining the protocol as we relax them. Initially, assume
that all processes have access to a global real-time clock RC, that all message
delays are bound by some known value, and that relative process speeds
are bounded.
The first snapshot protocol is based on all processes recording their states
at the same real-time. Process 0 chooses a time far enough in the future
in order to guarantee that a message sent now will be received by all other
processes before .13 To facilitate the recording of channel states, processes

13. Recall that there need not be a channel between all pairs of processes, and so must
30 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

include a timestamp in each message indicating when the message’s send


event was executed.

Snapshot Protocol 1
1. Process 0 sends the message “take snapshot at ” to all processes.14
2. When clock RC reads , each process records its local state ,
sends an empty message over all of its outgoing channels, and starts
recording messages received over each of its incoming channels.
Recording the local state and sending empty messages are per-
formed before any intervening events are executed on behalf of the
underlying computation.
3. First time receives a message from with timestamp greater than
or equal to , stops recording messages for that channel and
declares as those messages that have been recorded.
For each IN , the channel state constructed by process contains
the set of messages sent by before and received by after . The
empty messages in Step 2 are sent in order to guarantee liveness:15 process
is guaranteed to eventually receive a message from every incoming
channel such that TS .
Being based on real-time, it is easy to see that this protocol constructs
a consistent global state—it constructs a global state that did in fact occur
and thus could have been observed by our idealized external observer.
However, it is worth arguing this point a little more formally. Note that an
event is in the consistent cut associated with the constructed global
state if and only if RC . Hence,

RC RC

Since real-time clock RC satisfies the clock condition, the above equation
implies that is a consistent cut. In fact, the clock condition is the only
property of RC that is necessary for to be a consistent cut. Since logical
clocks also satisfy the clock condition, we should be able to substitute
logical clocks for real-time clocks in the above protocol.

account for the possibility of messages being forwarded.


14. For simplicity, we describe the protocols for a single initiation by process 0 . In fact, they
can be initiated by any process, and as long as concurrent initiations can be distinguished,
multiple initiations are possible.
15. Our use of empty messages here is not unlike their use in distributed simulation for the
purposes of advancing global virtual time [22].
4.13. Distributed Snapshots 31

There are, however, two other properties of synchronous systems used


by Snapshot Protocol 1 that need to be supplied:
The programming construct “when LC do ” doesn’t make
sense in the context of logical clocks since the given value need
not be attained by a logical clock.16 For example, in Figure 4.4 the
logical clock of 3 never attains the value 6, because the receipt of
the message from 1 forces it to jump from 5 to 7. Even if LC does
attain a value of , the programming construct is still problematic.
Our rules for updating logical clocks are based on the occurrence of
new events. Thus, at the point LC , the event that caused the
clock update has been executed rather than the first event of .
We overcome this problem with the following rules. Suppose
contains the statement “when LC do ”, where generates only
internal events or send events. Before executing an event , process
makes the following test:
– If is an internal or send event and LC 2, then executes
and then starts executing .
– If where TS and LC 1, then puts
the message back onto the channel, re-enables for execution,
sets LC to 1 and starts executing .
In Protocol 1, the monitor 0 chooses such that the message “take
snapshot at ” is received by all other processes before time .
In an asynchronous system, 0 cannot compute such a logical clock
value. Instead, we assume that there is an integer value large
enough that no logical clock can reach by using the update rules
in Section 4.8.
Assuming the existence of such an requires us to bound both relative
process speeds and message delays, and so we will have to relax it as well.
Given the above considerations, we obtain Snapshot Protocol 2, which
differs from Protocol 1 only in its use of logical clocks in place of the real-
time clock.

Snapshot Protocol 2
1. Process 0 sends “take snapshot at ” to all processes and then sets
its logical clock to .

16. Note that this problem happens to be one aspect of the more general problem of simu-
lating a synchronous system in an asynchronous one [1].
32 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

2. When its logical clock reads , process records its local state ,
sends an empty message along each outgoing channel, and starts
recording messages received over each of its incoming channels.
Recording the local state and sending empty messages are per-
formed before any intervening events are executed on behalf of the
underlying computation.
3. First time receives a message from with timestamp greater
than or equal to , stops recording messages for that channel and
declares as those messages that have been recorded.
Channel states are constructed just as in Protocol 1 with playing the
role of . As soon as 0 sets its logical clock to , it will immediately
execute Step 2, and the empty messages sent by it will force the clocks of
processes in OUT0 to attain . Since the network is strongly connected, all
of the clocks will eventually attain , and so the protocol is live.
We now remove the need for . Note that, with respect to the above
protocol, a process does nothing between receiving the “take snapshot at
” message and receiving the first empty message that causes its clock to
pass through . Thus, we can eliminate the message “take snapshot at ”
and instead have a process record its state when it receives the first empty
message. Since processes may send empty messages for other purposes,
we will change the message from being empty to one containing a unique
value, for example, “take snapshot”. Furthermore, by making this mes-
sage contain a unique value, we no longer need to include timestamps in
messages—the message “take snapshot” is the first message that any pro-
cess sends after the snapshot time. Doing so removes the last reference to
logical clocks, and so we can eliminate them from our protocol completely.

Snapshot Protocol 3 (Chandy-Lamport [4])


1. Process 0 starts the protocol by sending itself a “take snapshot”
message.
2. Let be the process from which receives the “take snapshot”
message for the first time. Upon receiving this message, records
its local state and relays the “take snapshot” message along all
of its outgoing channels. No intervening events on behalf of the
underlying computation are executed between these steps. Channel
state is set to empty and starts recording messages received
over each of its other incoming channels.
4.13. Distributed Snapshots 33

3. Let be the process from which receives the “take snapshot”


message beyond the first time. Process stops recording messages
along the channel from and declares channel state as those
messages that have been recorded.
Since a “take snapshot” message is relayed only upon the first receipt
and since the network is strongly connected, a “take snapshot” message
traverses each channel exactly once. When process has received a “take
snapshot” message from all of its incoming channels, its contribution to the
global state is complete and its participation in the snapshot protocol ends.
Note that the above protocols can be extended and improved in many
ways including relaxation of the FIFO assumption [20,31] and reduction of
the message complexity [21,28].

4.13.2 Properties of Snapshots


Let be a global state constructed by the Chandy-Lamport distributed
snapshot protocol. In the previous section, we argued that is guaranteed
to be consistent. Beyond that, however, the actual run that the system
followed while executing the protocol may not even pass through . In
this section, we show that is not an arbitrary consistent global state, but
one that has useful properties with respect to the run that generated it.
Consider the application of Chandy-Lamport snapshot protocol to the
distributed computation of Figure 4.3. The composite computation is
shown in Figure 4.10 where solid arrows indicate messages sent by the
underlying computation while dashed arrows indicate “take snapshot”
messages sent by the protocol. From the protocol description, the con-
structed global state is 23 with 1 2 empty and 2 1 containing message .
Let the run followed by processes 1 and 2 while executing the protocol be

1 1 2 3 2 4 3 4 5 5 6
2 1 1 1 2 1 2 2 1 2 1

or in terms of global states,

00 01 11 21 31 32 42 43 44 54 55 65

Let the global state of this run in which the protocol is initiated be 21
and the global state in which it terminates be 55 . Note that run does
not pass through the constructed global state 23 . As can be verified by
the lattice of Figure 4.3, however, 21 23 55 in this example. We now

show that this relationship holds in general.


34 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

prerecording event
postrecording event
0

1 2 3 4 5 6
1 1 1 1 1 1
1
1

2
1 2 3 4 5
2 2 2 2 2 2

Figure 4.10. Application of the Chandy-Lamport Snapshot Protocol

Let be the global state in which the snapshot protocol is initiated,


be the global state in which the protocol terminates and be the
global state constructed. We will show that there exists a run such that
. Let be the actual run the system followed while exe-
cuting the snapshot protocol, and let denote the event when receives
“take snapshot” for the first time, causing to record its state. An event
of is a prerecording event if ; otherwise, it is a postrecording event.
Consider any two adjacent events of such that is a postrecording
event and is a prerecording event.17 We will show that , and so
the order of these two events can be swapped, thereby resulting in another
consistent run. If we continue this process of swapping postrecording,
prerecording event pairs, then we will eventually construct a consistent
run in which no prerecording event follows a postrecording event. The
global state associated with the last prerecording event is therefore reach-
able from and the state is reachable from it. Finally, we will show
that this state is , the state that the snapshot protocol constructs.
Consider the subsequence of run where is a postrecording
event and a prerecording event. If then the two events cannot

3 2 4 3
17. Adjacent event pairs 1 2 and 1 2 of run are two such examples.
4.14. Properties of Global Predicates 35

be swapped without resulting in an inconsistent run. For contradiction,


assume that . There are two cases to consider:
1. Both events and are from the same process. If this were the case,
however, then by definition would be a postrecording event.
2. Event is a send event of and is the corresponding receive event
of . If this were the case, however, then from the protocol will
have sent a “take snapshot” message to by the time is executed,
and since the channel is FIFO, will also be a postrecording event.
Hence, a postrecording event cannot causally precede a prerecording
event and thus any postrecording, prerecording event pair can be swapped.
Let be the run derived from by swapping such pairs until all postrecord-
ing events follow the prerecording events. We now argue that the global
state after the execution of the last prerecording event in is . By
the protocol description and the definition of prerecording, postrecording
events that record local states will record them at point . Furthermore, by
the protocol, the channel states that are recorded are those messages that
were sent by prerecording events and received by postrecording events. By
construction, these are exactly those messages in the channels after the ex-
ecution of event , and so is the state recorded by the snapshot protocol.

4.14 Properties of Global Predicates


We have derived two methods for global predicate evaluation: one based
on a monitor actively constructing snapshots and another one based on a
monitor passively observing runs. The utility of either approach for solving
practical distributed systems problems, however, depends in part on the
properties of the predicate that is being evaluated. In this section, some of
these properties are examined.

4.14.1 Stable Predicates


Let be a consistent global state of a computation constructed through
any feasible mechanism. Given that communication in a distributed system
incurs delays, can only reflect some past state of the system— by the
time they are obtained, conclusions drawn about the system by evaluating
predicate in may have no bearing to the present.
Many system properties one wishes to detect, however, have the charac-
teristic that once they become true, they remain true. Such properties (and
36 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

stable predicates
process p(i): 1 i n
var pending: queue of [message, integer] init empty; % pending requests to p(i)
working: boolean init false; % processing a request
m: message; j: integer;
while true do
while working or (size(pending) = 0) do
receive m from p(j); % m set to message, j to its source
case m.type of
request:
pending := pending + [m, j];
response:
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
esac
od ;
while not working and (size(pending) 0) do
[m, j] := first(pending);
pending := tail(pending);
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j)
od
od
end p(i);

Figure 4.11. Server Process

their predicates) are called stable. Examples of stable properties include


deadlock, termination, the loss of all tokens, and unreachability of storage
(garbage collection). If is stable, then the monitor process can strengthen
its knowledge about when is satisfied.
As before, let be the global state in which the global state construction
protocol is initiated, be the global state in which the protocol terminates
and be the global state it constructs. Since , if is stable,
then the following conclusions are possible
is true in is true in
and
is false in is false in
As an example of detecting a stable property, we return to deadlock in
the client-server system described in Section 4.5. We assume that there is a
4.14. Properties of Global Predicates 37

process p(i): 1 i n
var pending: queue of [message, integer] init empty; % pending requests to p(i)
working: boolean init false; % processing a request
blocking: array [1..n] of boolean init false; % blocking[j] = “p(j) is blocked on p(i)”
m: message; j: integer; s: integer init 0;
while true do
while working or (size(pending) = 0) do
receive m from p(j); % m set to message, j to its source
case m.type of
request:
blocking[j] := true;
pending := pending + [m, j];
response:
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then blocking[j] := false;
snapshot:
if s = 0 then
% this is the first snapshot message
send [type: snapshot, data: blocking] to p(0);
send [type: snapshot] to p(1),...,p(i 1),p(i 1),...,p(n)
s := (s + 1) mod n;
esac
od ;
while not working and (size(pending) 0) do
[m, j] := head(pending);
pending := tail(pending);
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then blocking[j] := false;
od
od
end p(i);

Figure 4.12. Deadlock Detection through Snapshots: Server Side


38 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

process p(0):
var wfg: array [1..n] of array [1..n] of boolean; % wfg[i, j] = p(j) waits-for p(i)
j, k: integer; m: message;
while true do
wait until deadlock is suspected;
send [type: snapshot] to p(1), ..., p(n);
for k := 1 to n do
receive m from p(j);
wfg[j] := m.data;
if (cycle in wfg) then system is deadlocked
od
end p(0);

Figure 4.13. Deadlock Detection through Snapshots: Monitor Side

bidirectional communication channel between each pair of processes and


each process when acting as a server runs the same program, shown in
Figure 4.11. The behavior of a process as a client is much simpler: after
sending a request to a server it blocks until the response is received. The
server is modeled as a state machine [?]: it repeatedly receives a message,
changes its state, and optionally sends one or more messages. The function
NextState() computes the action a server next takes: given a message
from process , the invocation NextState(m, j) changes the state of the
server and returns the next message to send along with its destination. The
resulting message may be a response to the client’s request or it may be a
further request whose response is needed to service the client’s request. All
requests received by a server, including those received while it is servicing
an earlier request, are queued on the FIFO queue pending, and the server
removes and begins servicing the first entry from pending after it finishes
an earlier request by sending the response.
Figures 4.12 shows the server of Figure 4.11 with a snapshot protocol em-
bedded in it. Each server maintains a boolean array blocking that indicates
which processes have sent it requests to which it has not yet responded
(this information is also stored in pending, but we duplicate it in blocking
for clarity). When a server first receives a snapshot message, it sends
the contents of blocking to 0 and relays the snapshot message to all other
processes. Subsequent snapshot messages are ignored until has received
such messages, one from each other process.18

18. By the assumption that the network is completely connected, each invocation of the
4.14. Properties of Global Predicates 39

The conventional definition of “ waits-for ” is that has sent a request nonstable predicates
to and has not yet responded. As in Section 4.5, we will instead use
the weaker definition waits-for which holds when has received
a request from to which it has not yet responded [10]. By structuring
the server as a state machine, even requests sent to a deadlocked server
will eventually be received and denoted in blocking. Hence, a system that
contains a cycle in the conventional WFG will eventually contain a cycle
in the WFG , and so a deadlock will be detected eventually. Furthermore,
using the WFG instead of the WFG has the advantage of referencing only
the local process states, and so the embedded snapshot protocol need not
record channel states.
Figure 4.13 shows the code run by the monitor 0 acting as the deadlock
detector. This process periodically starts a snapshot by sending a snapshot
message to all other processes. Then, 0 receives the arrays blocking from
each of the processes and uses this data to test for a cycle in the WFG .
This approach has the advantage of generating an additional message load
only when deadlock is suspected. However, the approach also introduces a
latency between the occurrence of the deadlock and detection that depends
on how often the monitor starts a snapshot.
Figures 4.14 and 4.15 show the server and monitor code, respectively, of
a reactive-architecture deadlock detector. This solution is much simpler
than the snapshot-based version. In this case, sends a message to 0
whenever receives a request or sends a response to a client. Monitor 0
uses these notifications to update a WFG in which it tests for a cycle. The
simplicity of this solution is somewhat superficial, however, because the
protocol requires all messages to 0 to be sent using causal delivery order
instead of FIFO order. The only latency between a deadlock’s occurrence
and its detection is due to the delay associated with a notification message,
and thus, is typically shorter than that of the snapshot-based detector.

4.14.2 Nonstable Predicates


Unfortunately, not all predicates one wishes to detect are stable. For ex-
ample, when debugging a system one may wish to monitor the lengths
of two queues, and notify the user if the sum of the lengths is larger than
some threshold. If both queues are dynamically changing, then the pred-
icate corresponding to the desired condition is not stable. Detecting such

snapshot protocol will result in exactly snapshot messages to be received by each of the
processes.
40 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

process p(i): 1 i n
var pending: queue of [message, integer] init empty; % pending requests to p(i)
working: boolean init false; % processing a request
m: message; j: integer;
while true do
while working or (size(pending) = 0) do
receive m from p(j); % m set to message, j to its source
case m.type of
request:
send [type: requested, of: i, by: j] to p(0);
pending := pending + [m, j];
response:
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then
send [type: responded, to: j, by: i] to p(0);
esac
od ;
while not working and (size(pending) 0) do
[m, j] := first(pending);
pending := tail(pending);
[m, j] := NextState(m, j);
working := (m.type = request);
send m to p(j);
if (m.type = response) then
send [type: responded, to: j, by: i] to p(0)
od
od
end p(i);

Figure 4.14. Deadlock Detection using Reactive Protocol: Server Side


4.14. Properties of Global Predicates 41

process p(0):
var wfg: array [1..n, 1..n] of boolean init false; % wfg[i, j] = “p(j) waits-for p(i)”
m: message; j: integer;
while true do
receive m from p(j); % m set set to message, j to its source
if (m.type = responded) then
wfg[m.by, m.to] := false
else
wfg[m.of, m.by] := true;
if (cycle in wfg) then system is deadlocked
od
end p(0);

Figure 4.15. Deadlock Detection using Reactive Protocol: Monitor Side

predicates poses two serious problems.


The first problem is that the condition encoded by the predicate may not
persist long enough for it to be true when the predicate is evaluated. For
example, consider the computation of Figure 4.16 in which variables and
are being maintained by two processes 1 and 2 , respectively. Suppose
we are interested in monitoring the condition . There are seven
consistent global states in which the condition holds, yet if the
monitor evaluates after state 54 then the condition will be found
not to hold.
The second problem is more disturbing: if a predicate is found to be
true by the monitor, we do not know whether ever held during the actual
run. For example, suppose in the same computation the condition being
monitored is 2. The only two global states of Figure 4.16 satisfying
this condition are 31 and 41 . Let a snapshot protocol be initiated in state
11 of the run

00 01 11 12 22 32 42 43 44 45 55 65

From the result of Section 4.13.2, we know that the snapshot protocol
could construct either global state 31 or 41 since both are reachable from
11 . Thus, the monitor could “detect” 2 even though the actual
run never goes through a state satisfying the condition.
It appears that there is very little value in using a snapshot protocol
to detect a nonstable predicate—the predicate may have held even if it is
not detected, and even if it is detected it may have never held. The same
problems exist when nonstable predicates are evaluated over global states
42 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

possibly
definitely
00

10 01

11 02
2
21 12 03

31 22 13
: 3 : 4 : 5
1 2 3 4 5 6
41 32 23 1 1 1 1 1 1
1
42 33

43

53 44

2
63 54 45
1 2 3 4 5
2 2 2 2 2
64 55 : 6 : 4 : 2

65 0 10

Figure 4.16. Global States Satisfying Predicates and 2

constructed from observations of runs: if a nonstable predicate holds at


some state during a consistent observation, then the condition may or may
not have held during the actual run that produced it.
With observations, however, we can extend nonstable global predicates
such that they are meaningful in the face of uncertainty of the run actually
followed by the computation. To do this, the extended predicates must
apply to the entire distributed computation rather than to individual runs
or global states of it. There are two choices for defining predicates over
computations [5,19]:
1. Possibly( ): There exists a consistent observation of the compu-
tation such that holds in a global state of .
2. Definitely( ): For every consistent observations of the computa-
4.14. Properties of Global Predicates 43

tion, there exists a global state of in which holds. distributed debugging


The distributed computation of Figure 4.16 satisfies both predicates
Possibly( 2) and Definitely( ). As with stable predicates, by
the time they are detected, both of these predicates refer to some past state
or states of a run. The predicate “ currently holds” can also be detected,
but to do so will introduce blocking of the underlying computation.
An application of these extended predicates is when a run is observed
for purposes of debugging [5]. For instance, if identifies some erroneous
state of the computation, then Possibly( ) holding indicates a bug, even if
it is not observed during an actual run. For example, if 2 denotes
an erroneous state, then the computation of Figure 4.16 is incorrect, since
there is no guarantee that the erroneous state will not occur in some run.
The intuitive meanings of Possibly and Definitely could lead one to
believe that they are duals of each other: Definitely being equivalent
to Possibly( ) and Possibly being equivalent to Definitely( ).
This is not the case. For example, while it is true that Definitely
holding through a computation does imply Possibly( ) (there must be an
observation in which holds in all of its states), it is possible to have
both Possibly( ) and Definitely( ) hold. Figure 4.16 illustrates the latter:
the computation satisfies both Possibly( ) and Definitely( ).
Furthermore, if predicate is stable, then Possibly Definitely .
The inverse, however, is not true in general.
The choice between detecting whether currently holds versus whether
possibly or definitely held in the past depends on which restrictions con-
found the debugging process more. Detecting a condition that occurred in
the past may require program replay or reverse execution in order to re-
cover the state of interest, which can be very expensive to provide. Hence,
detection in the past is better suited to a post-mortem analysis of a com-
putation. Detecting the fact that currently holds, on the other hand,
requires delaying the execution of processes, which can be a serious im-
pediment to debugging. By blocking some processes when the predicate
becomes potentially true, we may make the predicate either more or less
likely to occur. For example, a predicate may be less likely to occur if
processes “communicate” using timeouts or some other uncontrolled form
of communication. The latter is a particular problem when processes are
multithreaded; that is, consisting of multiple, independently schedulable
threads of control which may communicate through shared memory. In
fact, it is rarely practical to monitor such communication when debugging
without hardware or language support.
44 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

procedure Possibly( );
var current: set of global states;
: integer;
begin
% Synchronize processes and distribute
send to all processes;
current := global state 0...0;
release processes;
:= 0;
0...0
% Invariant: current contains all states of level that are reachable from
while (no state in current satisfies ) do
if current = final global state then return false
:= + 1;
current := states of level
od
return true
end

Figure 4.17. Algorithm for Detecting Possibly( ).

4.14.3 Detecting Possibly and Definitely


The algorithms for detecting Possibly( ) and Definitely( ) are based on
the lattice of consistent global states associated with the distributed compu-
tation. For every global state in the lattice there exists at least one run that
passes through . Hence, if any global state in the lattice satisfies , then
Possibly( ) holds. For example, in the global state lattice of Figure 4.16
both 31 and 41 satisfy 2 meaning that Possibly( 2)
holds for the computation. The property Definitely( ) requires all possible
runs to pass through a global state that satisfies . In Figure 4.16 the global
state 43 satisfies . Since this is the only state of level 7, and all runs
must contain a global state of each level, Definitely( ) also holds for
the computation.
Figure 4.17 is a high-level procedure for detecting Possibly( ). The
procedure constructs the set of global states current with progressively
increasing levels (denoted by ). When a member of current satisfies , then
the procedure terminates indicating that Possibly( ) holds. If, however,
the procedure constructs the final global state (the global state in which the
computation terminates) and finds that this global state does not satisfy ,
then the procedure returns Possibly( ).
In order to implement this procedure, the monitored processes send the
4.14. Properties of Global Predicates 45

portion of their local states that is referenced in to the monitor 0 . Process


0 maintains sequences of these local states, one sequence per process, and
uses them to construct the global states of a given level. The basic operation
used by the procedure is “current := states of level ”, and so we must be
able to determine when all of the global states of a given level can be
assembled and must be able to effectively assemble them.
Let be the sequence of local states, stored in FIFO order, that 0 main-
tains for process , where each state in is labeled with the vector
timestamp VC . Define to be the global state with the smallest
level containing and to be the global state with the largest
3 31
level containing . For example, in Figure 4.16, 1 and
3 33 . These states can be computed using Property 5 of vector
1
clocks as follows:

1
1
2
2
... : : VC VC

and

1
1
2
2
... :
1
: VC VC VC VC

where is the state in which process terminates.


Global states and bound the levels of the lattice in
which occurs. The minimum level containing is particularly easy
to compute: it is the sum of components of the vector timestamp VC .
Thus, 0 can construct the set of states with level when, for each sequence
, the sum of the components of the vector timestamp of the last element
of is at least . For example, if 0 monitors the computation shown in
Figure 4.16, then 0 can start enumerating level 6 when it has received states
5 4 5
1 and 2 because any global state containing 1 must have a level of at
least 8 (5 3) and any global state containing 24 must also have a level of
at least 8 (4 4). Similarly, process 0 can remove state from when
is greater than the level of the global state . For the example in
Figure 4.16, 2 23 , and so 2
1 0 can remove 1 from 1 once it has
set to 6.
Given the set of states of some level , it is also straightforward (if some-
what costly) to construct the set of states of level 1: for each state 1 2
of level , one constructs the global states 1 1 2 , ..., 1 2 1.

Then, Property 5 of vector clocks can be used to determine which of these


46 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

procedure Definitely( );
var current, last: set of global states;
: integer;
begin
% Synchronize processes and distribute
send to all processes;
last := global state 0...0;
release processes;
remove all states in last that satisfy ;
:= 1;
% Invariant: last contains all states of level 1 that are reachable
% from 0...0 without passing through a state satisfying
while (last ) do
current := states of level reachable from a state in last;
remove all states in current that satisfy ;
if current = final global state then return false
:= + 1;
last := current
od
return true
end ;

Figure 4.18. Algorithm for Detecting Definitely( ).

global states are consistent. One can be careful and avoid redundantly
constructing the same global state of level 1 from different global states
of level , but the computation can still be costly.
Figure 4.18 gives the high-level algorithm used by the monitoring process
0 to detect Definitely( ). This algorithm iteratively constructs the set of
global states that have a level and are reachable from the initial global state
without passing through a global state that satisfies . If this set of states is
empty, then Definitely( ) holds and if this set contains only the final global
state then Definitely( ) holds. Note that, unlike detecting Possibly( ),
not all global states need be examined. For example, in Figure 4.16, suppose
that when the global states of level 2 were constructed, it was determined
that 02 satisfied . When constructing the states of level 3, global state 03
need not be included since it is reachable only through 02 .
The two detection algorithms are linear in the number of global states, but
unfortunately the number of global states is where is the maximum
number events a monitored process has executed. There are techniques that
can be used to limit the number of constructed global states. For example,
4.15. Multiple Monitors 47

a process need only send a message to 0 when potentially changes


or when learns that has potentially changed . Another technique is
for to send an empty message to all other processes when potentially
changes . These, and other techniques for limiting the number of global
states are discussed in [19]. An alternative approach is to restrict the global
predicate to one that can be efficiently detected, such as the conjunction
and disjunction of local predicates [9].

4.15 Multiple Monitors


There are several good reasons for having multiple monitors observe the
same computation for the purposes of evaluating the same predicate [12].
One such reason is increased performance—in a large system, interested
parties may have the result of the predicate sooner by asking the monitor
that is closest to them.19 Another reason is increased reliability—if the
predicate encodes the condition guarding a critical system action (e.g.,
shutdown of a chemical plant), then having multiple monitors will ensure
the action despite a bounded number of failures.
The reactive-architecture solution to GPE based on passive observations
can be easily extended to multiple monitors without modifying its general
structure. The only change that is required is for the processes to use a
causal broadcast communication primitive to notify the group of moni-
tors [?]. In this manner, each monitor will construct a consistent, but not
necessarily the same, observation of the computation. Each observation
will correspond to a (possibly different) path through the global state lat-
tice of the computation. Whether the results of evaluating predicate
by each monitor using local observations are meaningful depends on the
properties of . In particular, if is stable and some monitor observes
that it holds, then all monitors will eventually observe that it holds. For
example, in the case of deadlock detection with multiple monitors, if any
one of them detects a deadlock, eventually all of them will detect the same
deadlock since deadlock is a stable property. They may, however, disagree
on the identity of the process that is responsible for creating the deadlock
(the one who issued the last request forming the cycle in the WFG ).
Multiple observations for evaluating nonstable predicates create prob-
lems similar to those discussed in Section 4.14.2. There are essentially two

19. In the limit, each process could act as a monitor such that the predicate could be
evaluated locally.
48 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms

possibilities for meaningfully evaluating nonstable predicates over multi-


ple observations. First, the predicate can be extended using Definitely or
Possibly such that it is made independent of the particular observation but
becomes a function of the computation, which is the same for all moni-
tors. Alternatively, the notification messages can be disseminated to the
group of monitors in a manner such that they all construct the same ob-
servation. This can be achieved by using a causal atomic broadcast primitive
that results in a unique total order consistent with causal precedence for
all messages (even those that are concurrent) at all destinations [6,?]. Note
that the resulting structure is quite similar to that proposed in Chapter
XXXstate-machines for handling replicated state machines.
Now consider the case where the monitor is replicated for increased reli-
ability. If communication or processes in a distributed system are subject to
failures, then sending the same notification message to all monitor replicas
using causal delivery is not sufficient for implementing a causal broadcast
primitive. For example if channels are not reliable, some of the notifica-
tion messages may be lost such that different monitors effectively observe
different computations. Again, we can accommodate communication and
processes failures in our reactive architecture by using a reliable version of
causal broadcast as the communication primitive [?]. Informally, a reliable
causal broadcast, in addition to preserving causal precedence among mes-
sage send events, guarantees delivery of a message either to all or none of
the destination processes.20 A formal specification of reliable causal broad-
cast in the presence of failures has to be done with care and can be found
in Chapter XXXbcast.
Note that in an asynchronous system subject to failures, reliable causal
broadcast is the best one can hope for in that it is impossible to implement
communication primitives that achieve totally-ordered delivery using de-
terministic algorithms. Furthermore, in an environment where processes
may fail, and thus certain events never get observed, our notion of a consis-
tent global state may need to be re-examined. For some global predicates,
the outcome may be sensitive not only to the order in which events are ob-
served, but also to the order in which failures are observed by the monitors.
In such cases, it will be necessary to extend the causal delivery abstraction
to include not only actual messages but also failure notifications (as is done
in systems such as ISIS [2]).

20. Note that in Chapter XXXbcast, this primitive is called casual broadcast without the
reliable qualifier since all broadcast primitives are specified in the presence of failures.
4.16. Conclusions 49

4.16 Conclusions
We have used the GPE problem as a motivation for studying consistent
global states of distributed systems. Since many distributed systems prob-
lems require recognizing certain global conditions, constructing consistent
global states and evaluating predicates over them constitute fundamental
primitives. We derived two classes of solutions to the GPE problem: one
based on distributed snapshots and one based on passive observations. In
doing so, we have developed a set of concepts and mechanisms for rep-
resenting and reasoning about computations in asynchronous distributed
systems. These concepts generalize the notion of time in order to cope with
the uncertainty that is inherent to such systems.
We illustrated the practicality of our mechanisms by applying them
to distributed deadlock detection and distributed debugging. Reactive-
architecture solutions based on passive observations were shown to be
more flexible. In particular, these solutions can be easily adapted to deal
with nonstable predicates, multiple observations and failures. Each ex-
tension can be easily accommodated by using an appropriate communi-
cation primitive for notifications, leaving the overall reactive architecture
unchanged.

Acknowledgments The material on distributed debugging is derived from


joint work with Robert Cooper and Gil Neiger. We are grateful to them
for consenting to its inclusion here. The presentation has benefited greatly
from extensive comments by Friedemann Mattern, Michel Raynal and Fred
Schneider on earlier drafts.
50 Chapter 4. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
Bibliography

[1] Baruch Awerbuch. Complexity of network synchronization. Journal of


the ACM, 32(4):804–823, October 1985.
[2] K. Birman. The process group approach to reliable distributed comput-
ing. Technical Report TR91-1216, Department of Computer Science,
Cornell University, January 1993. To appear in Communications of the
ACM.
[3] K. Birman, A. Schiper, and P. Stephenson. Lightweight causal and
atomic group multicast. ACM Transactions on Computer Systems,
9(3):272–314, August 1991.
[4] K. Mani Chandy and Leslie Lamport. Distributed snapshots: De-
termining global states of distributed systems. ACM Transactions on
Computer Systems, 3(1):63–75, February 1985.
[5] Robert Cooper and Keith Marzullo. Consistent detection of global
predicates. In ACM/ONR Workshop on Parallel and Distributed Debug-
ging, pages 163–173, Santa Cruz, California, May 1991.
[6] Flaviu Cristian, H. Aghili, H. Ray Strong, and Danny Dolev. Atomic
broadcast: From simple message diffusion to Byzantine agreement. In
Proceedings of the International Symposium on Fault-Tolerant Computing,
pages 200–206, Ann Arbor, Michigan, June 1985. A revised version
appears as IBM Technical Report RJ5244.
[7] D. Dolev, J.Y. Halpern, and R. Strong. On the possibility and impos-
sibility of achieving clock synchronization. In Proceedings of the ACM
Symposium on the Theory of Computing, pages 504–511, April 1984.
[8] C. J. Fidge. Timestamps in message-passing systems that preserve the
partial ordering. In Eleventh Australian Computer Science Conference,
pages 55–66, University of Queensland, February 1988.
[9] V. K. Garg and B. Waldecker. Unstable predicate detection in dis-
tributed programs. Technical Report TR-92-07-82, University of Texas
at Austin, March 1992.

51
52 BIBLIOGRAPHY

[10] V. Gligor and S. Shattuck. On deadlock detection in distributed sys-


tems. IEEE Transactions on Software Engineering, SE-6:435–440, Septem-
ber 1980.
[11] David Harel and Amir Pnueli. On the development of reactive sys-
tems. In K. R. Apt, editor, Logics and Models of Concurrent Systems,
NATO ASI, pages 477–498. Springer-Verlag, 1985.
[12] J. Helary, C. Jard, N. Plouzeau, and M. Raynal. Detection of sta-
ble properties in distributed applications. In Proceedings of the ACM
Symposium on Principles of Distributed Computing, pages 125–136, Van-
couver, British Columbia, August 1987.
[13] M. Frans Kaashoek and Andrew S. Tanenbaum. Group communica-
tion in the amoeba distributed operating system. In Proceedings of the
Eleventh International Conference on Distributed Computer Systems, pages
222–230, Arlington, Texas, May 1991. IEEE Computer Society.
[14] Hermann Kopetz. Sparse time versus dense time in distributed real-
time systems. In Proceedings of the Twelfth International Conference on
Distributed Computing Systems, pages 460–467, Yokohama, Japan, June
1992. IEEE Computer Society.
[15] Leslie Lamport. Time, clocks, and the ordering of events in a dis-
tributed system. Communications of the ACM, 21(7):558–565, July 1978.
[16] Leslie Lamport and P. M. Melliar-Smith. Synchronizing clocks in the
presence of faults. Journal of the ACM, 32(1):52–78, January 1985.
[17] H. M. Levy and E. D. Tempero. Modules, objects, and distributed
programming: Issues in rpc and remote object invocation. Software -
Practice and Experience, 21(1):77–90, January 1991.
[18] Kai Li and Paul Hudak. Memory coherence in shared virtual memory
systems. ACM Transactions on Computer Systems, 7(4):321–359, Novem-
ber 1989.
[19] Keith Marzullo and Gil Neiger. Detection of global state predicates. In
Proceedings of the Fifth International Workshop on Distributed Algorithms
(WDAG-91), Lecture Notes on Computer Science. Springer-Verlag,
Delphi, Greece, October 1991.
[20] Friedemann Mattern. Virtual time and global states of distributed
systems. In Michel Cosnard et. al., editor, Proceedings of the International
Workshop on Parallel and Distributed Algorithms, pages 215–226. North-
Holland, October 1989.
[21] Friedemann Mattern. Efficient algorithms for distributed snapshots
and global virtual time approximation. Journal of Parallel and Dis-
tributed Computing, 1993. To appear.
BIBLIOGRAPHY 53

[22] J. Misra. Distributed-discrete event simulation. ACM Computing Sur-


veys, 18(1):39–65, March 1986.
[23] Carroll Morgan. Global and logical time in distributed algorithms.
Information Processing Letters, 20:189–194, May 1985.
[24] Gil Neiger and Sam Toueg. Substituting for real time and common
knowledge in asynchronous distributed systems. In Proceedings of the
ACM Symposium on Principles of Distributed Computing, pages 281–293,
Vancouver, British Columbia, August 1987.
[25] Larry L. Peterson, Nick C. Bucholz, and Richard D. Schlichting. Pre-
serving and using context information in interprocess communication.
ACM Transactions on Computer Systems, 7(3):217–246, August 1989.
[26] M. Raynal. About logical clocks for distributed systems. Operating
Systems Review, 26(1):41–48, January 1992.
[27] M. Raynal, A. Schiper, and S. Toueg. The causal ordering abstrac-
tion and a simple way to implement it. Information Processing Letters,
39(6):343–350, September 1991.
[28] A. Sandoz and A. Schiper. A characterization of consistent distributed
snapshots using causal order. Technical Report 92-14, Departement
d’Informatique, Ecole Polytechnique Federale de Lausanne, Switzer-
land, October 1992.
[29] A. Schiper, J. Eggli, and A. Sandoz. A new algorithm to implement
causal ordering. In J.-C. Bermond and M. Raynal, editors, Proceedings
of the Third International Workshop on Distributed Algorithms, volume
392 of Lecture Notes on Computer Science, pages 219–232, Nice, France,
September 1989. Springer-Verlag.
[30] Reinhard Schwarz and Friedemann Mattern. Detecting causal relation-
ships in distributed computations: In search of the Holy Grail. Techni-
cal Report SFB124-15/92, Department of Computer Science, Univer-
sity of Kaiserslautern, Kaiserslautern, Germany, December 1992.
[31] Kim Taylor. The role of inhibition in asynchronous consistent-cut
protocols. In J.-C. Bermond and M. Raynal, editors, Proceedings of
the Third International Workshop on Distributed Algorithms, volume 392
of Lecture Notes on Computer Science, pages 280–291. Springer-Verlag,
Nice, France, September 1989.

You might also like