The Fault Detection Problem
The Fault Detection Problem
1 Introduction
Handling faults is a key challenge in building reliable distributed systems. There
are two main approaches to this problem: Fault masking aims to hide the symp-
toms of a limited number of faults, so that users can be provided with correct
service in the presence of faults [4, 14], whereas fault detection aims at identi-
fying the faulty components, so that they can be isolated and repaired [7, 10].
These approaches are largely complementary. In this paper, we focus on fault
detection.
Fault detection has been extensively studied in the context of “benign” crash
faults, where it is assumed that a faulty component simply stops taking steps of
its algorithm [5, 6]. However, this assumption does not always hold in practice;
in fact, recent studies have shown that general faults (also known as Byzantine
faults [15]) can have a considerable impact on practical systems [17]. Thus, it
would be useful to apply fault detection to a wider class of faults. So far, very
little is known about this topic; there is a paper by Kihlstrom et al. [12] that
discusses Byzantine fault detectors for consensus and broadcast protocols, and
there are several algorithms for detecting certain types of non-crash faults, such
1
as PeerReview [10] and SUNDR [16]. However, many open questions remain; for
example, we still lack a formal characterization of the types of non-crash faults
that can be detected in general, and nothing is known about inherent costs of
detection.
This paper is a first step towards a better understanding of general fault
detection. We propose a formal model that allows us to formulate the fault
detection problem for arbitrary faults, including non-crash faults. We introduce
the notion of a fault class that captures a set of faults, i.e., deviations of system
components from their expected behavior. Solving the fault detection problem
for a fault class F means finding a transformation τF that, given any algorithm
A, constructs an algorithm Ā (called an extension of A) that works exactly like A
but does some additional work to identify and expose faulty nodes. Whenever a
fault instance from the class F appears, Ā must expose at least one faulty suspect
(completeness), it must not expose any correct nodes infinitely long (accuracy),
and, optionally, it may ensure that all correct nodes expose the same faulty
suspects (agreement).
Though quite weak, our definition of the fault detection problem still allows
us to answer two specific questions: Which faults can be detected, and how much
extra work from does fault detection require from the extension? To answer the
first question, we show that the set of all fault instances can be divided into
four non-overlapping classes, and that the fault detection problem can be solved
for exactly two of them, which we call commission faults and omission faults.
Intuitively, a commission fault exists when a node sends messages a correct
node would not send, whereas an omission fault exists when a node does not
send messages a correct node would send.
To answer the second question, we study the message complexity of the fault
detection problem, that is, the ratio between the number of messages sent by the
most efficient extension and the number of messages sent by the original algo-
rithm. We derive tight lower bounds on the message complexity for commission
and omission faults, with and without agreement. Our results show that a) the
message complexity for omission faults is higher than that for commission faults,
and that b) the message complexity is (optimally) linear in the number of nodes
in the system, except when agreement is required for omission faults, in which
case it is quadratic in the number of nodes.
In summary, this paper makes the following four contributions: (1) a formal
model of a distributed system in which various kinds of faults can be selectively
analyzed, (2) a statement of the fault detection problem for arbitrary faults, (3) a
complete classification of all possible faults, including a precise characterization
of the set of faults for which the fault detection problem can be solved, and (4)
tight lower bounds on the message complexity of the fault detection problem.
Viewed collectively, our results constitute a first step toward understanding the
power and the inherent costs of fault detection in a distributed system.
The rest of this paper is organized as follows: We begin by introducing our
system model in Section 2 and then formally state the fault detection problem
in Section 3. In Section 4, we present our classification of faults, and we show for
2
which classes the fault detection problem can be solved. In Section 5, we derive
tight bounds on the message complexity, and we conclude by discussing related
work in Section 6 and future work in Section 7. Omitted proofs can be found in
the full version of this paper, which is available as a technical report [9].
2 Preliminaries
Let N be a set of nodes. Each node has a terminal3 and a network interface.
It can communicate with the other nodes by sending and receiving messages
over the network, and it can send outputs to, and receive inputs from, its local
terminal. We assume that processing times are negligible; when a node receives
an input, it can produce a response immediately.
Each message m has a unique source src(m) ∈ N and a unique destination
dest (m) ∈ N . We assume that messages are authenticated; that is, each node i
can initially create only messages m with src(m) = i, although it can delegate
this capability to other nodes (e.g., by revealing its key material). Nodes can
also forward messages to other nodes and include messages in other messages
they send, and we assume that a forwarded or included message can still be
authenticated.
A computation unfolds in discrete events. An event is a tuple (i, I, O), where
i ∈ N is a node on which the event occurs, I is a set of inputs (terminal inputs
or messages) that i receives in the event, and O is a set of outputs (terminal
outputs or messages) that i produces in the event. An execution e is a sequence
of events (i1 , I1 , O1 ), (i2 , I2 , O2 ), . . .. We write e|S for the subsequence of e that
contains the events with ik ∈ S; for i ∈ N , we abbreviate e{i} as e|i . When a
finite execution e is a prefix of another execution e′ , we write e ⇂ e′ . Finally, we
write |e| to denote the number of unique messages that are sent in e.
A system is modeled as a set of executions. In this paper, we assume that
the network is reliable, that is, a) a message is only received if it has previously
been sent at least once, and b) a message that is sent is eventually received at
least once. Formally, we assume that, for every execution e of the system and
every message m:
3
– recv(i, m) ∈ e iff m is a message with i = dest (m) and (i, I, O) ∈ e with
m ∈ I.
– send(i, m, j) ∈ e iff m is a message with j = dest (m) and (i, I, O) ∈ e with
m ∈ O.
– in(i, t) ∈ e if t is a terminal input and (i, I, O) ∈ e with t ∈ I.
– out(i, t) ∈ e if t is a terminal output and (i, I, O) ∈ e with t ∈ O.
2.3 Extensions
(Ā, A, µm , µs , XO) is called a reduction of an algorithm Ā = (M̄ , T̄I, T̄O, Σ̄, σ̄0 , ᾱ)
to an algorithm A = (M, TI, TO, Σ, σ0 , α) iff µm is a total map M̄ 7→ P (M ), µs
is a total map Σ̄ 7→ Σ, and the following conditions hold:
4
X1 ¯ = TI, that is, A accepts the same terminal inputs as Ā;
TI
X2 ¯ = TO ∪XO and TO ∩XO = ∅, that is, A produces the same terminal
TO
outputs as Ā, except XO;
X3 µs (σ̄0 ) = σ0 , that is, the initial state of Ā maps to the initial state of A;
X4 ∀m ∈ M ∃ m̄ ∈ M̄ : µm (m̄) = m, that is, every message of A has at least one
counterpart in Ā;
X5 ∀σ ∈ Σ ∃ σ̄ ∈ Σ̄ : µs (σ̄) = σ, that is, every state of A has at least one
counterpart in Σ̄;
X6 ∀σ̄1 , σ̄2 ∈ Σ̄, m̄i, m̄o ⊆ M̄ , ti ⊆ TI, to ⊆ TO : [ᾱ(σ̄1 , m̄i ∪ ti) = (σ̄2 , m̄o ∪ to)] ⇒
[α(µs (σ¯1 ), µm (m̄i) ∪ ti) = (µs (σ¯2 ), µm (m̄o) ∪ (to \ XO))], that is, there is a
homomorphism between ᾱ and α.
1. Start with e = ∅.
2. For each new event (i, I, ¯ Ō), perform the following steps:
(a) Compute I := (I¯ ∩ TIi ) ∪ µm (I¯ ∩ M̄ ) and O := (Ō ∩ TOi ) ∪ µm (Ō ∩ M̄ ).
(b) Remove from I any m ∈ M with dest (m) 6= i or recv(i, m) ∈ e.
(c) Remove from O any m ∈ M with send(i, m, j) ∈ e.
(d) For each node j ∈ N , compute Oj := {m ∈ O | src(m) = j}.
(e) If I 6= ∅ or Oi 6= ∅, append (i, I, Oi ) to e.
(f) For each j 6= i with Oj 6= ∅, append (j, ∅, Oj ) to e.
A simple example of a reduction is the identity (A, A, id, id, ∅). Note that there
is a syntactic correspondence between an extension and its original algorithm,
not just a semantic one. In other words, the extension not only solves the same
problem as the original algorithm (by producing the same terminal outputs as
the original), it also solves it in the same way (by sending the same messages in
the same order). Recall that our goal is to detect whether or not the nodes in
the system are following a given algorithm; we are not trying to find a better
algorithm. Next, we state a few simple lemmas about extensions.
Lemma 1. Let Ā and A be two algorithms for which a reduction (Ā,A,µm ,µs ,XO)
exists. Then, if ē is an execution in which a node i is correct with respect to Ā,
i is correct in µe (ē) with respect to A.
Note that, if a node i is correct in ē with respect to Ā, then it must be correct in
µe (ē) with respect to A, but the reverse is not true. In other words, it is possible
for a node i to be faulty in ē with respect to Ā but still be correct in µe (ē) with
respect to A.
Lemma 2. Let Ā and A be two algorithms for which a reduction (Ā,A,µm ,µs ,XO)
exists, let ē1 be an execution of Ā, and let ē2 be a prefix of e¯1 . Then µe (ē2 ) is a
prefix of µe (ē1 ).
5
Lemma 3. Let Ā and A be two algorithms for which a reduction (Ā,A,µm ,µs ,XO)
exists, and let e be an execution of A. Then there exists an execution ē of Ā such
that a) µe (ē) = e (modulo duplicate messages sent by faulty nodes in e), and b)
a node i is correct in ē with respect to Ā iff it is correct in e with respect to A.
6
S 18 F S F S F
N I P M 23 23
7
23
7
A D D D
L O 5 H 5 16 H
Q S 12 F H
C B 23
K D S 7 F S 11 F S 81 F
G E 23 23 23
N C 5 H D D D
J 38 H 12 H 19 H
Fig. 1. Example scenario. Nodes F and H are supposed to each send a number between
1 and 10 to D, who is supposed to add the numbers and send the result to K. If K
receives 23, it knows that at least one of the nodes in S = {D, F, H} must be faulty,
but it does not know which ones, or how many.
which the fault can be localized. The best case is |S| = 1; this indicates that
the fault can be traced to exactly one node. The worst case is S = N \ C; this
indicates that the nodes in C know that a fault exists somewhere, but they are
unable to localize it.
2.6 Environments
Our formulation of the fault detection problem does not require a bound on the
number of faulty nodes. However, if such a bound is known, it is possible to find
solutions with a lower message complexity. To formalize this, we use the notion
of an environment, which is a restriction on the fault patterns that may occur in
a system. In this paper, we specifically consider environments Ef , in which the
total number of faulty nodes is limited to f . If a system in environment Ef is
assigned a distributed algorithm A, the only executions that can occur are those
in which at most f nodes are faulty with respect to A.
7
We also consider the fault detection problem with agreement, which additionally
requires:
Note that condition C2 does not require us to detect nodes that are faulty in
ē with respect to Ā, but correct in µe (ē) with respect to A. Thus, we avoid the
infinite recursion that would result from trying to detect faults in the detector
itself. Note also that condition C3 is weaker than the definition of eventual strong
accuracy in [6], which requires that correct nodes eventually output only faulty
nodes. This change is necessary to make the problem solvable in an asynchronous
environment.
In the rest of this paper, we assume that the only facts for which evidence
can exist are a) message transmissions, and b) message receptions. Specifically,
a properly authenticated message m̄ with µm (m̄) = m and src(m) = i in an
execution ē is evidence of a fact {e | send(i, m, dest (m)) ∈ e} about µe (ē), and a
properly authenticated message m̄′ with src(m̄′ ) = i, m ∈ m̄′ , and dest (m) = i
in an execution ē is evidence of a fact {e | recv(i, m) ∈ e} about µe (ē). Note
that in some systems it may be possible to construct evidence of additional
facts (e.g., when the system has more synchrony or access to more sophisticated
cryptographic primitives). In such systems, the following results may not apply.
4.1 Definitions
8
π(A,φ± (C, e), N) = ∅
π(A,φ± (C,e),C ∪ S) = ∅
π(A,φ+ (C,e),C ∪ S) = ∅
Fig. 2. Classification of all fault instances. The fault detection problem cannot be
solved for fault instances in FNO (Theorem 2) or FAM (Theorem 3), but solutions
exist for FOM and FCO (Theorem 4).
while φ− (C, e) effectively represents knowledge about all the messages not sent
or received in e by the nodes in C.
We also define the plausibility map π as follows. Let A be a distributed
algorithm, Z a set of facts, and C a set of nodes. Then π(A, Z, C) represents all
infinite executions e ∈ Z in which each node c ∈ C is correct in e with respect
to A. Intuitively, π(A, Z, C) is the set of executions of A that are plausible given
the facts in Z, and given that (at least) the nodes in C are correct.
A few simple properties of φ and π are: 1) C1 ⊆ C2 ⇒ φ(C2 , e) ⊆ φ(C1 , e),
that is, adding evidence from more nodes cannot reduce the overall knowledge;
2) p1 ⇂ p2 ⇒ φ(C, p2 ) ⊆ φ(C, p1 ), that is, knowledge can only increase during
an execution; 3) C1 ⊆ C2 ⇒ π(A, Z, C2 ) ⊆ π(A, Z, C1 ), that is, assuming that
more nodes are correct can only reduce the number of plausible executions; and
4) Z1 ⊆ Z2 ⇒ π(A, Z1 , C) ⊆ π(A, Z2 , C), that is, learning more facts can only
reduce the number of plausible executions.
FN O is the class of non-observable faults. For executions in this class, the nodes
in C cannot even be sure that the system contains any faulty nodes, since there
exists a correct execution of the entire system that is consistent with everything
they see. We will show later in this section that the fault detection problem
cannot be solved for faults in this class.
FAM is the class of ambiguous fault instances. When a fault instance is in
this class, the nodes in C know that a faulty node exists, but they cannot be
sure that it is one of the nodes in S. We will show later that the fault detection
problem cannot be solved for fault instances in this class. Note that the problem
9
here is not that the faults cannot be observed from C, but that the set S is too
small. If S is sufficiently extended (e.g., to N \ C), these fault instances become
solvable.
FOM is the class of omission faults. For executions in this class, the nodes in
C could infer that one of the nodes in S is faulty if they knew all the facts, but
the positive facts alone are not sufficient; that is, they would also have to know
that some message was not sent or not received. Intuitively, this occurs when
the nodes in S refuse to send some message they are required to send.
FCO is the class of commission faults. For executions in this class, the nodes
in C can infer that one of the nodes in S is faulty using only positive facts.
Intuitively, this occurs when the nodes in S send some combination of messages
they would never send in any correct execution.
Theorem 1. (FN O , FAM , FOM , FCO ) is a partition of the set of all fault in-
stances.
Proof. First, we show that no fault instance can belong to more than one class.
Suppose ψ := (A, C, S, e) ∈ FN O ; that is, there is a plausible correct exe-
cution e′ of the entire system. Then ψ can obviously not be in FAM , since
π(A, φ± (C, e), N ) cannot be both empty and non-empty. Since all nodes are
correct in e′ , the nodes in C ∪ S in particular are also correct, so ψ 6∈ FOM
(Section 4.1, Property 3), and they are still correct if negative facts are ignored,
so ψ 6∈ FCO . Now suppose ψ ∈ FAM . Obviously, ψ cannot be in FOM , since
π(A, φ± (C, e), C ∪ S) cannot be both empty and non-empty. But ψ cannot be
in FCO either, since using fewer facts can only increase the number of plausible
executions (Section 4.1, Property 1). Finally, observe that ψ cannot be in both
FOM and FCO , since π(A, φ+ (C, e), C ∪S) cannot be both empty and non-empty.
It remains to be shown that any fault instance belongs to at least one of the
four classes. Suppose there is a fault instance ψ 6∈ (FN O ∪ FAM ∪ FOM ∪ FCO ).
Since ψ is not in FN O , we know that π(A, φ± (C, e), N ) = ∅. But if this is true
and ψ is not in FAM , it follows that π(A, φ± (C, e), C ∪ S) = ∅. Given this and
that ψ is not in FOM , we can conclude that π(A, φ+ (C, e), C ∪ S) = ∅. But then
ψ would be in FCO , which is a contradiction.
Theorem 2. The fault detection problem cannot be solved for any fault class F
with F ∩ FN O 6= ∅.
Proof sketch. The proof works by showing that, for any fault instance ψ :=
(A, C, S, e) ∈ FN O , we can construct two executions ēgood and ēbad of Ā := τ (A)
such that a) all the nodes are correct in ēgood , b) the fault occurs in ēbad , and
c) the two executions are indistinguishable from the perspective of the nodes
in C (that is, ēgood |C = ēbad |C ). Hence, the nodes in C would have to both
expose some node in S (to achieve completeness in ēbad ) and not expose any
node in S (to achieve accuracy in ēgood ) based on the same information, which
is impossible. For the full proof, see [9]. 2
Theorem 3. The fault detection problem cannot be solved for any fault class F
with F ∩ FAM 6= ∅.
10
Proof sketch. The proof is largely analogous to that of Theorem 2, except that
we now construct two executions ē∈S and ē6∈S of Ā := τ (A) such that a) in ē∈S
the faulty node is a member of S, b) in ē6∈S all the nodes in S are correct, and c)
the two executions are indistinguishable from C. For the full proof, see [9]. 2
Corollary 1. If the fault detection problem can be solved for a fault class F ,
then F ⊆ FOM ∪ FCO .
Theorem 4. There is a solution to the fault detection problem with agreement
for the fault class FOM ∪ FCO .
For a transformation that solves the fault detection problem for this class, please
refer to the proof of Theorem 8 (Section 5.2) that appears in [9].
5 Message complexity
In this section, we investigate how expensive it is to solve the fault detection
problem, that is, how much additional work is required to detect faults. The
metric we use is the number of messages that must be sent by correct nodes.
(Obviously, the faulty nodes can send arbitrarily many messages). Since the
answer clearly depends on the original algorithm and on the actions of the faulty
nodes in a given execution, we focus on the following two questions: First, what
is the maximum number of messages that may be necessary for some algorithm,
and second, what is the minimum number of messages that is sufficient for any
algorithm?
5.1 Definitions
If τ is a solution of the fault detection problem, we say that the message complex-
ity γ(τ ) of τ is the largest number such that for all k, there exists an algorithm
A, an execution e of A, and an execution ē of τ (A) such that
| {m̄ | send(i, m̄, j) ∈ ē ∧ i ∈ corr (τ (A), ē)} |
(µe (ē) = e) ∧ (|e| ≥ k) ∧ ≥ γ(τ )
|e|
In other words, the message complexity is the maximum number of messages that
must be sent by correct nodes in any ē per message sent in the corresponding
e := µe (ē). The message complexity of the fault detection problem as a whole is
the minimum message complexity over all solutions.
11
Theorem 5. Any solution τ of the fault detection problem for FCO in the en-
vironment Ef has message complexity γ(τ ) ≥ f + 2, provided that f + 2 < |N |.
Theorem 6. The message complexity of the fault detection problem with agree-
ment for FCO in the environment Ef is at most f + 2, provided that f + 2 < |N |.
Corollary 2. The message complexity of the fault detection problem (with or
without agreement) for FCO in environment Ef is f +2, provided that f +2 < |N |.
Theorem 7. Any solution τ of the fault detection problem for FOM in the en-
vironment Ef has message complexity γ(τ ) ≥ 3f + 4, provided that f + 2 < |N |.
Theorem 8. The message complexity of the fault detection problem for FOM in
the environment Ef is at most 3f + 4, provided that f + 2 < |N |.
Interestingly, if we additionally require agreement, then the optimal message
complexity of the fault detection problem with respect to omission faults is
quadratic in |N |, under the condition that at least half of the nodes may fail.
Intuitively, if a majority of N is known to be correct, it should be possible to
delegate fault detection to a set ω with |ω| = 2f + 1, and to have the remaining
nodes follow the majority of ω. This would reduce the message complexity to
approximately |N | · (2f + 1).
Theorem 9. Any solution τ of the fault detection problem with agreement for
FOM in the environment Ef has message complexity γ(τ ) ≥ (|N | − 1)2 , provided
that |N2|−1 < f < |N | − 2.
Proof sketch. In contrast to commission faults, there is no self-contained proof of
an omission fault; when a node is suspected of having omitted a message m, the
suspicion can always turn out to be groundless when m eventually arrives. We
show that, under worst-case conditions, such a ‘false positive’ can occur after
every single message. Moreover, since agreement is required, a correct node must
not suspect (or stop suspecting) another node unless every other correct node
eventually does so as well. Therefore, after each message, the correct nodes may
have to ensure that their own evidence is known to all the other correct nodes,
which in the absence of a correct majority requires reliable broadcast and thus
at least (|N | − 1)2 messages. For the full proof, see [9]. 2
Theorem 10. The message complexity of the fault detection problem with agree-
ment for FOM in the environment Ef is at most (|N | − 1)2 , provided that
f + 2 < |N |.
5.3 Summary
Table 1 summarizes the results in this section. Our two main results are that
a) detecting omission faults has a substantially higher message complexity than
detecting commission faults, and that b) the message complexity is generally
linear in the failure bound f , except when the fault class includes omission faults
and agreement is required, in which case the message complexity is quadratic in
the system size |N |.
12
Fault detection problem
Fault class Fault detection problem
with agreement
f +2 f +2
FCO
(Corollary 2) (Corollary 2)
3f + 4 (|N | − 1)2
FOM
(Theorems 7 and 8) (Theorems 9 and 10)
Table 1. Message complexity in environments with up to f faulty nodes.
6 Related work
13
The results in this paper have important consequences for research on ac-
countability in distributed computing. Systems like PeerReview [10] provide ac-
countability by ensuring that faults can eventually be detected and irrefutably
linked to a faulty node. Since fault detection is an integral part of accountability,
this paper establishes an upper bound on the set of faults for which accountabil-
ity can be achieved, as well as a lower bound on the worst-case message com-
plexity. Note that practical accountability systems have other functions, such as
providing more detailed fault notifications, which we do not model here.
14
by any correct node; this would require stronger synchrony assumptions [6, 8].
On the other hand, completeness could be relaxed in such a way that faults must
only be detected with high probability. Preliminary evidence suggests that such
a definition would substantially reduce the message complexity [10].
In conclusion, we believe that this work is a step toward a better understand-
ing of the costs and limitations of fault detection in distributed systems. We also
believe that this work could be used as a basis for extending the spectrum of
fault classes with new intermediate classes, ranging between the “benign” crash
faults (which have proven to be too restrictive for modern software) and the
generic but rather pessimistic Byzantine faults.
References
1. Alvisi, L., Malkhi, D., Pierce, E.T., Reiter, M.K.: Fault detection for Byzantine
quorum systems. IEEE Trans. Parallel Distrib. Syst. 12(9), 996–1007 (2001)
2. Bar-Noy, A., Dolev, D., Dwork, C., Strong, H.R.: Shifting gears: changing algo-
rithms on the fly to expedite Byzantine agreement. In: PODC. pp. 42–51 (1987)
3. Bracha, G.: Asynchronous Byzantine agreement protocols. Information and Com-
putation 75(2), 130–143 (Nov 1987)
4. Castro, M., Liskov, B.: Practical Byzantine fault tolerance and proactive recovery.
ACM Transactions on Computer Systems 20(4), 398–461 (Nov 2002)
5. Chandra, T.D., Hadzilacos, V., Toueg, S.: The weakest failure detector for solving
consensus. J. ACM 43(4), 685–722 (Jul 1996)
6. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed sys-
tems. J. ACM 43(2), 225–267 (Mar 1996)
7. Denning, D.E.: An intrusion-detection model. IEEE Transactions on Software En-
gineering 13(2), 222–232 (1987)
8. Dolev, D., Dwork, C., Stockmeyer, L.: On the minimal synchronism needed for
distributed consensus. J. ACM 34(1), 77–97 (Jan 1987)
9. Haeberlen, A., Kuznetsov, P.: The fault detection problem (Oct 2009), Technical
Report MPI-SWS-2009-005, Max Planck Institute for Software Systems
10. Haeberlen, A., Kuznetsov, P., Druschel, P.: PeerReview: Practical accountability
for distributed systems. In: SOSP. pp. 175–188 (Oct 2007)
11. Halpern, J.Y., Moses, Y.: Knowledge and common knowledge in a distributed
environment. J. ACM 37(3), 549–587 (Jul 1990)
12. Kihlstrom, K.P., Moser, L.E., Melliar-Smith, P.M.: Byzantine fault detectors for
solving consensus. The Computer Journal 46(1), 16–35 (Jan 2003)
13. Ko, C., Fink, G., Levitt, K.: Automated detection of vulnerabilities in privileged
programs using execution monitoring. In: Proceedings of the 10th Annual Com-
puter Security Application Conference (Dec 1994)
14. Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169
(May 1998)
15. Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Trans.
Progr. Lang. Syst. 4(3), 382–401 (Jul 1982)
16. Li, J., Krohn, M., Mazières, D., Sasha, D.: Secure untrusted data repository
(SUNDR). In: OSDI (Dec 2004)
17. Vandiver, B., Balakrishnan, H., Liskov, B., Madden, S.: Tolerating Byzantine faults
in transaction processing systems using commit barrier scheduling. In: SOSP (Oct
2007)
15