0% found this document useful (0 votes)
19 views8 pages

Arxiv2004 2004.05074 (Heidi Howard 2020) Paxos Vs Raft Have We Reached Consensus On Distributed Consensus

arXiv2004 2004.05074 (Heidi Howard 2020) Paxos vs Raft Have We Reached Consensus on Distributed Consensus

Uploaded by

hengxin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views8 pages

Arxiv2004 2004.05074 (Heidi Howard 2020) Paxos Vs Raft Have We Reached Consensus On Distributed Consensus

arXiv2004 2004.05074 (Heidi Howard 2020) Paxos vs Raft Have We Reached Consensus on Distributed Consensus

Uploaded by

hengxin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Paxos vs Raft: Have we reached consensus on

distributed consensus?
Heidi Howard Richard Mortier
University of Cambridge University of Cambridge
Cambridge, UK Cambridge, UK
[email protected] [email protected]
Abstract Raft’s authors’ claim that Raft is as efficient as Paxos whilst
arXiv:2004.05074v1 [cs.DC] 10 Apr 2020

Distributed consensus is a fundamental primitive for con- being more understandable and thus provides a better foun-
structing fault-tolerant, strongly-consistent distributed sys- dation for building practical systems. Raft seeks to achieve
tems. Though many distributed consensus algorithms have this in three distinct ways:
been proposed, just two dominate production systems: Paxos, Presentation Firstly, the Raft paper introduces a new ab-
the traditional, famously subtle, algorithm; and Raft, a more straction for describing leader-based consensus in the con-
recent algorithm positioned as a more understandable alter- text of state machine replication. This pragmatic presen-
native to Paxos. tation has proven incredibly popular with engineers.
In this paper, we consider the question of which algo- Simplicity Secondly, the Raft paper prioritises simplicity
rithm, Paxos or Raft, is the better solution to distributed con- over performance. For example, Raft decides log entries
sensus? We analyse both to determine exactly how they dif- in-order whereas Paxos typically allows out-of-order de-
fer by describing a simplified Paxos algorithm using Raft’s cisions but requires an extra protocol for filling the log
terminology and pragmatic abstractions. gaps which can occur as a result.
We find that both Paxos and Raft take a very similar ap- Underlying algorithm Finally, the Raft algorithm takes a
proach to distributed consensus, differing only in their ap- novel approach to leader election which alters how a leader
proach to leader election. Most notably, Raft only allows is elected and thus how safety is guaranteed.
servers with up-to-date logs to become leaders, whereas Paxos Raft rapidly became popular [30] and production systems
allows any server to be leader provided it then updates its today are divided between those which use Paxos [3, 5, 31,
log to ensure it is up-to-date. Raft’s approach is surprisingly 33, 36, 38] and those which use Raft [2, 8–10, 15, 24, 34].
efficient given its simplicity as, unlike Paxos, it does not re- To answer the question of which, Paxos or Raft, is the
quire log entries to be exchanged during leader election. We better solution to distributed consensus, we must first an-
surmise that much of the understandability of Raft comes swer the question of how exactly the two algorithms differ
from the paper’s clear presentation rather than being fun- in their approach to consensus? Not only will this help in
damental to the underlying algorithm being presented. evaluating these algorithms, it may also allow Raft to ben-
efit from the decades of research optimising Paxos’ perfor-
mance [6, 12, 14, 18–20, 26, 27] and vice versa [1, 37].
However, answering this question is not a straightforward
matter. Paxos is often regarded not as a single algorithm but
1 Introduction as a family of algorithms for solving distributed consensus.
Paxos’ generality (or underspecification, depending on your
State machine replication [32] is widely used to compose a
point of view) means that descriptions of the algorithm vary,
set of unreliable hosts into a single reliable service that can
sometimes considerably, from paper to paper.
provide strong consistency guarantees including linearizabil-
To overcome this problem, we present here a simplified
ity [13]. As a result, programmers can treat a service imple-
version of Paxos that results from surveying the various
mented using replicated state machines as a single system,
published descriptions of Paxos. This algorithm, which we
making it easy to reason about expected behaviour. State
refer to simply as Paxos, corresponds more closely to how
machine replication requires that each state machine receives
Paxos is used today than to how it was first described [16]. It
the same operations in the same order, which can be achieved
has been referred to elsewhere as multi-decree Paxos, or just
by distributed consensus.
MultiPaxos, to distinguish it from single-decree Paxos, which
The Paxos algorithm [16] is synonymous with distributed
decides a single value instead of a totally-ordered sequence
consensus. Despite its success, Paxos is famously difficult to
of values. We also describe our simplified algorithm using
understand, making it hard to reason about, implement cor-
the style and abstractions from the Raft paper, allowing a
rectly, and safely optimise. This is evident in the numerous
fair comparison between the two different algorithms.
attempts to explain the algorithm in simpler terms [4, 17, 22,
23, 25, 29, 35], and was the motivation behind Raft [28].
1
We conclude that there is no significant difference in un- starts up/ times out,
derstandability between the algorithms, and that Raft’s leader recovers new election
election is surprisingly efficient given its simplicity. times out, receives votes from
starts election majority of servers
2 Background
discovers new
This paper examines distributed consensus in the context Follower Candidate Leader
term (or leader)
of state machine replication. State machine replication re-
quires that an application’s deterministic state machine is
replicated across n servers with each applying the same set
discovers new term
of operations in the same order. This is achieved using a
replication log, managed by a distributed consensus algo-
rithm, typically Paxos or Raft.
Figure 1. State transitions between the server states for
We assume that the system is non-Byzantine [21] but we
Paxos & Raft. The transitions in blue are specific to Raft.
do not assume that the system is synchronous. Messages
may be arbitrarily delayed and participating servers may
operate at any speed, but we assume message exchange is Leader An active state where it is responsible for adding
reliable and in-order (e.g., through use of TCP/IP). We do operations to the replicated log using the AppendEntries
not depend upon clock synchronisation for safety, though RPC.
we must for liveness [11]. We assume each of the n servers Initially, servers are in the follower state. Each server con-
has an unique id s where s ∈ {0..(n − 1)}. We assume that tinues as a follower until it believes that the leader has failed.
operations are unique, easily achieved by adding a pair of The follower then becomes a candidate and tries to be elected
sequence number and server id to each operation. leader using RequestVote RPCs. If successful, the candidate
becomes a leader. The new leader must regularly send Ap-
3 Approach of Paxos & Raft pendEntries RPCs as keepalives to prevent followers from
timing out and becoming candidates.
Many consensus algorithms, including Paxos and Raft, use
Each server stores a natural number, the term, which in-
a leader-based approach to solve distributed consensus. At
creases monotonically over time. Initially, each server has
a high-level, these algorithms operate as follows:
a current term of zero. The sending server’s (hereafter, the
One of the n servers is designated the leader. All opera-
sender) current term is included in each RPC. When a server
tions for the state machine are sent to the leader. The leader
receives an RPC, it (hereafter, the server) first checks the in-
appends the operation to their log and asks the other servers
cluded term. If the sender’s term is greater than the server’s,
to do the same. Once the leader has received acknowledge-
then the server will update its term before responding to the
ments from a majority of servers that this has taken place,
RPC and, if the server was either a candidate or a leader, step
it applies the operation to its state machine. This process
down to become a follower. If the sender’s term is equal to
repeats until the leader fails. When the leader fails, another
that of the server, then the server will respond to the RPC as
server takes over as leader. This process of electing a new
usual. If the sender’s term is less than that of the server, then
leader involves at least a majority of servers, ensuring that
the server will respond negatively to the sender, including
the new leader will not overwrite any previously applied
its term in the response. When the sender receives such a
operations.
response, it will step down to follower and update its term.
We now examine Paxos and Raft in more detail. Read-
ers may find it helpful to refer to the summaries of Paxos 3.2 Normal operation
and Raft provided in Appendices A & B. We focus here on
the core elements of Paxos and Raft and, due to space con- When a leader receives an operation, it appends it to the end
straints, do not compare garbage collection, log compaction, of its log with the current term. The pair of operation and
read operations or reconfiguration algorithms. term are known as a log entry. The leader then sends Appen-
dEntries RPCs to all other servers with the new log entry.
Each server maintains a commit index to record which log
3.1 Basics
entries are safe to apply to its state machine, and responds
As shown in Figure 1, at any time a server can be in one of to the leader acknowledging successful receipt of the new
three states: log entry. Once the leader receives positive responses from
Follower A passive state where it is responsible only for a majority of servers, the leader updates its commit index
replying to RPCs. and applies the operation to its state machine. The leader
Candidate An active state where it is trying to become a then includes the updated commit index in subsequent Ap-
leader using the RequestVotes RPC. pendEntries RPCs.
2
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

s1 1 2 s1 1 4 4 s1 1 2 s1 1 2
A B A B C A B A B

s2 1 2 2 s2 1 2 2 s2 1 2 5 5 s2 1 2 2
A B C A B C A B D E A B C

s3 1 3 3 3 s3 1 3 3 3 s3 1 3 3 3 s3 1 6 6 6
A B D E A B D E A B D E A B D E

(a) Initial state of logs before (b) s 1 is leader in term 4 after a (c) s 2 is leader in term 5 after a (d) s 3 is leader in term 6 after a
electing a new leader. vote from s 2 . vote from s 3 . vote from s 1 or s 2 .

Figure 2. Logs of three servers running Paxos. Figure (a) shows the logs when a leader election was triggered. Figures (b—d)
show the logs after a leader has been elected but before it has sent its first AppendEntries RPC. The black line shows the
commit index and red text highlights the log changes.

A follower will only append a log entry (or set of log en- will send RequestVote RPCs to the other servers. Each in-
tries) if its log prior to that entry (or entries) is identical cludes the candidate’s term as well as the candidate’s last
to the leader’s log. This ensures that log entries are added log term and index. When a server receives the RequestVote
in-order, preventing gaps in the log, and ensuring followers request it will respond positively provided the candidate’s
apply the correct log entries to their state machines. term is greater than or equal to its own, it has not yet voted
for a candidate in this term, and the candidate’s log is at least
3.3 Handling leader failures as up-to-date as its own. This last criterion can be checked
by ensuring that the candidate’s last log term is greater than
This process continues until the leader fails, requiring a new the server’s or, if they are the same, that the candidate’s last
leader to be established. Paxos and Raft take different ap- index is greater than the server’s.
proaches to this process so we describe each in turn. Once the candidate has received positive RequestVote re-
Paxos. A follower will timeout after failing to receive a sponses from a majority of servers, the candidate can be-
recent AppendEntries RPC from the leader. It then becomes come a leader and start replicating its log. However, for safety
a candidate and updates its term to the next term such that Raft requires that the leader does not update its commit in-
t mod n = s where t is the next term, n is the number of dex until at least one log entry from the new term has been
servers and s is the candidate’s server id. The candidate will committed.
send RequestVote RPCs to the other servers. This RPC in- As there may be multiple candidates in a given term, votes
cludes the candidate’s new term and commit index. When may be split such that no candidate has a majority. In this
a server receives the RequestVote RPC, it will respond posi- case, the candidate times out and starts a new election with
tively provided the candidate’s term is greater than its own. the next term.
This response also includes any log entries that the server
has in its log subsequent to the candidate’s commit index. 3.4 Safety
Once the candidate has received positive RequestVote re- Both algorithms guarantee the following property:
sponses from a majority of servers, the candidate must en-
sure its log includes all committed entries before becoming Theorem 3.1 (State Machine Safety). If a server has applied
a leader. It does so as follows. For each index after the com- a log entry at a given index to its state machine, no other server
mit index, the leader reviews the log entries it has received will ever apply a different log entry for the same index.
alongside its own log. If the candidate has seen a log entry There is at most one leader per term and a leader will not
for the index then it will update its own log with the entry overwrite its own log so we can prove this by proving the
and the new term. If the leader has seen multiple log entries following:
for the same index then it will update its own log with the
entry from the greatest term and the new term. An example Theorem 3.2 (Leader Completeness). If an operation op is
of this is given in Figure 2. The candidate can now become committed at index i by a leader in term t then all leaders of
a leader and begin replicating its log to the other servers. terms > t will also have operation op at index i.
Raft. At least one of the followers will timeout after not Proof sketch for Paxos. Assume operation op is committed at
receiving a recent AppendEntries RPC for the leader. It will index i with term t. We will use proof by induction over the
become a candidate and increment its term. The candidate terms greater than t.
3
Base case: If there is a leader of term t + 1, then it will have In Paxos, the log entries replicated by the leader are ei-
the operation op at index i. ther from the current term or they are already committed.
As any two majority quorums intersect and messages are We can see this in Figure 2, where all log entries after the
ordered by term, then at least one server with operation op commit index on the leader have the current term. This is
at index i and with term t must have positively replied to the not the case in Raft where a leader may be replicating un-
RequestVote RPC from the leader of t + 1. This server can- committed entries from previous terms.
not have deleted or overwritten this operation as it cannot Overall, we feel that Raft’s approach is slightly more un-
have positively responded to an AppendEntries RPC from derstandable than Paxos’ but not significantly so.
any other leader since the leader of term t. The leader will
Efficiency. In Paxos, if multiple servers become candidates
choose the operation op as it will not receive any log entries
simultaneously, the candidate with the higher term will win
with a term > t.
the election. In Raft, if multiple servers become candidates
Inductive case: Assume that any leaders of terms t + 1 to
simultaneously, they may split the votes as they will have
t + k have the operation op at index i. If there is a leader of
the same term, and so neither will win the election. Raft mit-
term t + k + 1 then it will also have operation op at index i.
igates this by having followers wait an additional period,
As any two majority quorums intersect and messages are
drawn from a uniform random distribution, after the elec-
ordered by term, then at least one server with operation op
tion timeout. We thus expect that Raft will be both slower
at index i and with term t to t + k must have positively
and have higher variance in the time taken to elect a leader.
replied to the RequestVote RPC from the leader of term t +
However, Raft’s leader election phase is more lightweight
k +1. This is because the server cannot have deleted or over-
than Paxos’. Raft only allows a candidate with an up-to-date
written this operation as it has not positively responded to
log to become a leader and thus need not send log entries
an AppendEntries RPC from any leader except those with
during leader election. This is not true of Paxos, where ev-
terms t to t + k. From our induction hypothesis, all these
ery positive RequestVote response includes the follower’s
leaders will also have operation op at index i and thus will
log entries after the candidate’s commit index. There are
not have overwritten it. The leader may choose another op-
various options to reduce the number of log entries sent but
eration only if it receives a log entry with that different oper-
ultimately, it will always be necessary for some log entries
ation at index i and with a greater term. From our induction
to be sent if the leader’s log is not already up to date.
hypothesis, all leaders of terms t to t +k will also have oper-
It is not just with the RequestVote responses that Paxos
ation op at index i and thus will not write another operation
sends more log entries than Raft. In both algorithms, once
at index i. 
a candidate becomes a leader it will copy its log to all other
The proof for Raft uses the same induction but the details servers. In Paxos, a log entry may have been given a new
differ due to Raft’s different approach to leader election. term by the leader and thus the leader may send another
copy of the log entry to a server which already has a copy.
This is not the case with Raft, where each log entry keeps the
4 Discussion
same term throughout its time in the log. Again, there are
Raft and Paxos take different approaches to leader election, various options for mitigating this issue in Paxos but they
summarised in Table 1. We compare two dimensions, under- are beyond the scope of our simplified Paxos algorithm.
standability and efficiency, to determine which is best. Overall, compared to Paxos, Raft’s approach to leader elec-
Understandability. Raft guarantees that if two logs con- tion is surprisingly efficient for such a simple approach.
tain the same operation then it will have the same index and
term in both. In other words, each operation is assigned a 5 Relation to classical Paxos
unique index and term pair. However, this is not the case in Readers who are familiar with Paxos may feel that our de-
Paxos, where an operation may be assigned a higher term scription of Paxos differs from those previously published,
by a future leader, as demonstrated by operations B and C and so we now outline how our Paxos algorithm relates to
in Figure 2b. In Paxos, a log entry before the commit index those found elsewhere in the literature.
may be overwritten. This is safe because the log entry will
Roles Some descriptions of Paxos divide the responsibility
only be overwritten by an entry with the same operation,
of Paxos into three roles: proposer, acceptor and learner [17]
but it not as intuitive as Raft’s approach.
or leader, acceptor and replica [35]. Our presentation uses
The flip side of this is that Paxos makes it safe to commit
just one role, server, which incorporates all roles. This pre-
a log entry if it is present on a majority of servers; but this
sentation using a single role has also used the name replica [7],
is not the case for Raft, which requires that a leader only
commits a log entry from a previous term if it is present Terminology Terms are also referred to as views, ballot
on the majority of servers and the leader has committed a numbers [35], proposal numbers [17], round numbers, se-
subsequent log entry from the current term. quence numbers [7] or epochs. Our leader is also referred to
4
Paxos Raft
How does it ensure A server s can only be a candidate in a term t if A follower can become a candidate in any
that each term has at t mod n = s. There will only be one candidate term. Each follower will only vote for one can-
most one leader? per term so only one leader per term. didate per term, so only one candidate can get
a majority of votes and become the leader.
How does it ensure Each RequestVote reply includes the fol- A vote is granted only if the candidate’s log is
that a new leader’s log lower’s log entries. Once a candidate has re- at least as up-to-date as the followers’. This en-
contains all commit- ceived RequestVote responses from a majority sures that a candidate only becomes a leader
ted log entries? of followers, it adds the entries with the high- if its log is at least as up-to-date as a majority
est term to its log. of followers.
How does it ensure Log entries from previous terms are added to The leader replicates the log entries to the
that leaders safely the leader’s log with the leader’s term. The other servers without changing the term. The
commit log entries leader then replicates the log entries as if they leader cannot consider these entries commit-
from previous terms? were from the leader’s term. ted until it has replicated a subsequent log en-
try from its own term.
Table 1. Summary of the differences between Paxos and Raft

as a master [7], primary, coordinator or distinguished pro- (i) Paxos divides terms between servers, whereas Raft al-
poser [17]. Typically, the period during which a server is a lows a follower to become a candidate in any term but
candidate is known as phase-1 and the period during which followers will vote for only one candidate per term.
a server is a leader is known as phase-2. The RequestVote (ii) Paxos followers will vote for any candidate, whereas Raft
RPCs are often referred to as phase1a and phase1b messages [35], followers will only vote for a candidate if the candidate’s
prepare request and response [17] or prepare and promise log is at-least-as up-to-date.
messages. The AppendEntries RPCs are often referred to as (iii) If a leader has uncommitted log entries from a previous
phase2a and phase2b messages [35], accept request and re- term, Paxos will replicate them in the current term whereas
sponse [17] or propose and accept messages. Raft will replicate them in their original term.
The Raft paper claims that Raft is significantly more un-
Terms Paxos requires only that terms are totally ordered derstandable than Paxos, and as efficient. On the contrary,
and that each server is allocated a disjoint set of terms (for we find that the two algorithms are not significantly dif-
safety) and that each server can use a term greater than any ferent in understandability but Raft’s leader election is sur-
other term (for liveness). Whilst some descriptions of Paxos prisingly lightweight when compared to Paxos’. Both algo-
use round-robin natural numbers like us [7], others use lexi- rithms we have presented are naïve by design and could cer-
cographically ordered pairs, consisting of an integer and the tainly be optimised to improve performance, though often
server ID, where each server only uses terms containing its at the cost of increased complexity.
own ID [35].

Ordering Our log entries are replicated and decided in- References
order. This is not necessary but it does avoid the complex- [1] Arora, V., Mittal, T., Agrawal, D., El Abbadi, A., Xue, X.,
ities of filling log gaps [17]. Similarly, some descriptions of Zhiyanan, Z., and Zhujianfeng, Z. Leader or majority: Why have
Paxos bound the number of concurrent decisions, often nec- one when you can have both? improving read scalability in raft-like
consensus protocols. In Proceedings of the 9th USENIX Conference on
essary for reconfiguration [17, 35].
Hot Topics in Cloud Computing (USA, 2017), HotCloudâĂŹ17, USENIX
Association, p. 14.
6 Summary [2] Atomix. https://atomix.io.
[3] Baker, J., Bond, C., Corbett, J. C., Furman, J., Khorlin, A., Lar-
The Raft algorithm was proposed to address the longstand- son, J., Leon, J.-M., Li, Y., Lloyd, A., and Yushprakh, V. Megastore:
ing issues with understandability of the widely studied Paxos Providing scalable, highly available storage for interactive services.
algorithm. In Proceedings of the Conference on Innovative Data system Research
In this paper, we have demonstrated that much of the un- (CIDR) (2011), pp. 223–234.
[4] Boichat, R., Dutta, P., Frølund, S., and Guerraoui, R. Deconstruct-
derstandability of Raft comes from its pragmatic abstraction ing paxos. SIGACT News 34, 1 (Mar. 2003), 47âĂŞ67.
and excellent presentation. By describing a simplified Paxos [5] Burrows, M. The chubby lock service for loosely-coupled distributed
algorithm using the same approach as Raft, we find that the systems. In Proceedings of the 7th Symposium on Operating Systems
two algorithms differ only in their approach to leader elec- Design and Implementation (USA, 2006), OSDI âĂŹ06, USENIX Asso-
tion. Specifically: ciation, p. 335âĂŞ350.

5
[6] Camargos, L. J., Schmidt, R. M., and Pedone, F. Multicoordinated [28] Ongaro, D., and Ousterhout, J. In search of an understandable
paxos. In Proceedings of the Twenty-Sixth Annual ACM Symposium on consensus algorithm. In Proceedings of the 2014 USENIX Confer-
Principles of Distributed Computing (New York, NY, USA, 2007), PODC ence on USENIX Annual Technical Conference (USA, 2014), USENIX
âĂŹ07, Association for Computing Machinery, p. 316âĂŞ317. ATCâĂŹ14, USENIX Association, p. 305âĂŞ320.
[7] Chandra, T. D., Griesemer, R., and Redstone, J. Paxos made live: [29] Prisco, R. D., Lampson, B. W., and Lynch, N. A. Revisiting the
An engineering perspective. In Proceedings of the Twenty-Sixth An- paxos algorithm. In Proceedings of the 11th International Workshop
nual ACM Symposium on Principles of Distributed Computing (New on Distributed Algorithms (Berlin, Heidelberg, 1997), WDAG âĂŹ97,
York, NY, USA, 2007), PODC âĂŹ07, Association for Computing Ma- Springer-Verlag, p. 111âĂŞ125.
chinery, p. 398âĂŞ407. [30] The raft consensus algorithm. https://raft.github.io.
[8] CockroachDB. https://www.cockroachlabs.com. [31] Ramakrishnan, R., Sridharan, B., Douceur, J. R., Kasturi, P.,
[9] Consul by hashicorp. https://www.consul.io. Krishnamachari-Sampath, B., Krishnamoorthy, K., Li, P., Manu,
[10] etcd. https://coreos.com/etcd/. M., Michaylov, S., Ramos, R., and et al. Azure data lake store: A
[11] Fischer, M. J., Lynch, N. A., and Paterson, M. S. Impossibility of dis- hyperscale distributed file service for big data analytics. In Proceed-
tributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), ings of the 2017 ACM International Conference on Management of Data
374âĂŞ382. (New York, NY, USA, 2017), SIGMOD âĂŹ17, Association for Comput-
[12] Gafni, E., and Lamport, L. Disk paxos. In Proceedings of the 14th In- ing Machinery, p. 51âĂŞ63.
ternational Conference on Distributed Computing (Berlin, Heidelberg, [32] Schneider, F. B. Implementing fault-tolerant services using the state
2000), DISC âĂŹ00, Springer-Verlag, p. 330âĂŞ344. machine approach: A tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990),
[13] Herlihy, M. P., and Wing, J. M. Linearizability: A correctness con- 299âĂŞ319.
dition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 [33] Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes,
(July 1990), 463âĂŞ492. J. Omega: Flexible, scalable schedulers for large compute clusters. In
[14] Kraska, T., Pang, G., Franklin, M. J., Madden, S., and Fekete, A. Proceedings of the 8th ACM European Conference on Computer Systems
Mdcc: Multi-data center consistency. In Proceedings of the 8th ACM (New York, NY, USA, 2013), EuroSys âĂŹ13, Association for Comput-
European Conference on Computer Systems (New York, NY, USA, 2013), ing Machinery, p. 351âĂŞ364.
EuroSys âĂŹ13, Association for Computing Machinery, p. 113âĂŞ126. [34] Trillian: General transparency. https://github.com/google/trillian/.
[15] Kubernetes: Production-grade container orchestration. [35] Van Renesse, R., and Altinbuken, D. Paxos made moderately com-
https://kubernetes.io. plex. ACM Comput. Surv. 47, 3 (Feb. 2015).
[16] Lamport, L. The part-time parliament. ACM Trans. Comput. Syst. 16, [36] Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E.,
2 (May 1998), 133âĂŞ169. and Wilkes, J. Large-scale cluster management at google with borg.
[17] Lamport, L. Paxos made simple. ACM SIGACT News (Distributed In Proceedings of the Tenth European Conference on Computer Systems
Computing Column) 32, 4 (Whole Number 121, December 2001) (De- (New York, NY, USA, 2015), EuroSys âĂŹ15, Association for Comput-
cember 2001), 51–58. ing Machinery.
[18] Lamport, L. Generalized consensus and paxos. Tech. Rep. MSR-TR- [37] Zhang, Y., Ramadan, E., Mekky, H., and Zhang, Z.-L. When raft
2005-33, Microsoft, March 2005. meets sdn: How to elect a leader and reach consensus in an unruly net-
[19] Lamport, L. Fast paxos. Distributed Computing 19 (October 2006), work. In Proceedings of the First Asia-Pacific Workshop on Networking
79–103. (New York, NY, USA, 2017), APNetâĂŹ17, Association for Computing
[20] Lamport, L., and Massa, M. Cheap paxos. In Proceedings of the 2004 Machinery, p. 1âĂŞ7.
International Conference on Dependable Systems and Networks (USA, [38] Zheng, J., Lin, Q., Xu, J., Wei, C., Zeng, C., Yang, P., and Zhang, Y.
2004), DSN âĂŹ04, IEEE Computer Society, p. 307. Paxosstore: High-availability storage made practical in wechat. Proc.
[21] Lamport, L., Shostak, R., and Pease, M. The byzantine generals VLDB Endow. 10, 12 (Aug. 2017), 1730âĂŞ1741.
problem. ACM Trans. Program. Lang. Syst. 4, 3 (July 1982), 382âĂŞ401.
[22] Lampson, B. The abcdâĂŹs of paxos. In Proceedings of the Twentieth
Annual ACM Symposium on Principles of Distributed Computing (New
York, NY, USA, 2001), PODC âĂŹ01, Association for Computing Ma-
chinery, p. 13.
[23] Lampson, B. W. How to build a highly available system using consen-
sus. In Proceedings of the 10th International Workshop on Distributed
Algorithms (Berlin, Heidelberg, 1996), WDAG âĂŹ96, Springer-Verlag,
p. 1âĂŞ17.
[24] M3: UberâĂŹs open source, large-scale metrics platform for
prometheus. https://eng.uber.com/m3/.
[25] Meling, H., and Jehl, L. Tutorial summary: Paxos explained from
scratch. In Proceedings of the 17th International Conference on Prin-
ciples of Distributed Systems - Volume 8304 (Berlin, Heidelberg, 2013),
OPODIS 2013, Springer-Verlag, p. 1âĂŞ10.
[26] Moraru, I., Andersen, D. G., and Kaminsky, M. There is more
consensus in egalitarian parliaments. In Proceedings of the Twenty-
Fourth ACM Symposium on Operating Systems Principles (New York,
NY, USA, 2013), SOSP âĂŹ13, Association for Computing Machinery,
p. 358âĂŞ372.
[27] Moraru, I., Andersen, D. G., and Kaminsky, M. Paxos quorum
leases: Fast reads without sacrificing writes. In Proceedings of the ACM
Symposium on Cloud Computing (New York, NY, USA, 2014), SOCC
âĂŹ14, Association for Computing Machinery, p. 1âĂŞ13.

6
A Paxos Algorithm RequestVote RPC
This summarises our simplified, Raft-style Paxos algorithm.
Invoked by candidates to gather votes
The text in red is unique to Paxos.
Arguments:
State term candidate’s term
Persistent state on all servers: (Updated on stable storage leaderCommit candidate’s commit index
before responding to RPCs) Results:
currentTerm latest term server has seen (initialized to 0 on term currentTerm, for candidate to update itself
first boot, increases monotonically) voteGranted true indicates candidate received vote
log[ ] log entries; each entry contains command for state entries[] follower’s log entries after leaderCommit
machine, and term when entry was received by leader Receiver implementation:
(first index is 1) 1. Reply false if term < currentTerm
Volatile state on all servers: 2. Grant vote and send any log entries after leaderCommit
commitIndex index of highest log entry known to be com-
mitted (initialized to 0, increases monotonically) Rules for Servers
lastApplied index of highest log entry applied to state ma- All Servers:
chine (initialized to 0, increases monotonically) • If commitIndex > lastApplied: increment lastApplied and
Volatile state on candidates: (Reinitialized after election) apply log[lastApplied] to state machine
entries[] Log entries received with votes • If RPC request or response contains term T > currentTerm:
Volatile state on leaders: (Reinitialized after election) set currentTerm = T and convert to follower
nextIndex[ ] for each server, index of the next log entry to Followers:
send to that server (initialized to leader commit index + 1) • Respond to RPCs from candidates and leaders
matchIndex[ ] for each server, index of highest log en- • If election timeout elapses without receiving AppendEn-
try known to be replicated on server (initialized to 0, in- tries RPC from current leader or granting vote to candi-
creases monotonically) date: convert to candidate
Candidates:
AppendEntries RPC • On conversion to candidate, start election: increase cur-
Invoked by leader to replicate log entries; also used as heart- rentTerm to next t such that t mod n = s, copy any log en-
beat tries after commitIndex to entries[], and send RequestVote
Arguments: RPCs to all other servers
• Add any log entries received from RequestVote responses
term leader’s term
to entries[]
prevLogIndex index of log entry immediately preceding
• If votes received from majority of servers: update log
new ones
by adding entries[] with currentTerm (using value with
prevLogTerm term of prevLogIndex entry
greatest term if there are multiple entries with same in-
entries[ ] log entries to store (empty for heartbeat; may
dex) and become leader
send more than one for efficiency)
Leaders:
leaderCommit leader’s commitIndex
• Upon election: send initial empty AppendEntries RPCs
Results:
(heartbeat) to each server; repeat during idle periods to
term currentTerm, for leader to update itself
prevent election timeouts
success true if follower contained entry matching pre-
• If command received from client: append entry to local
vLogIndex and prevLogTerm
log, respond after entry applied to state machine
Receiver implementation:
• If last log index ≥ nextIndex for a follower: send Appen-
1. Reply false if term < currentTerm
dEntries RPC with log entries starting at nextIndex
2. Reply false if log doesn’t contain an entry at prevLogIndex
– If successful: update nextIndex and matchIndex for fol-
whose term matches prevLogTerm
lower
3. If an existing entry conflicts with a new one (same index
– If AppendEntries fails because of log inconsistency:
but different terms), delete the existing entry and all that
decrement nextIndex and retry
follow it
• If there exists an N such that N > commitIndex and a ma-
4. Append any new entries not already in the log
jority of matchIndex[i] ≥ N: set commitIndex = N
5. If leaderCommit > commitIndex: set commitIndex =
min(leaderCommit, index of last new entry)

7
B Raft Algorithm RequestVote RPC
This is a reproduction of Figure 2 from the Raft paper [28].
Invoked by candidates to gather votes
The text in red is unique to Raft.
Arguments:
State term candidate’s term
Persistent state on all servers: (Updated on stable storage candidateId candidate requesting vote
before responding to RPCs) lastLogIndex index of candidate’s last log entry
currentTerm latest term server has seen (initialized to 0 on lastLogTerm term of candidate’s last log entry
first boot, increases monotonically) Results:
votedFor candidateId that received vote in current term (or term currentTerm, for candidate to update itself
null if none) voteGranted true indicates candidate received vote
log[ ] log entries; each entry contains command for state Receiver implementation:
machine, and term when entry was received by leader 1. Reply false if term < currentTerm
(first index is 1) 2. If votedFor is null or candidateId, and candidate’s log is at
Volatile state on all servers: least as up-to-date as receiver’s log: grant vote
commitIndex index of highest log entry known to be com-
mitted (initialized to 0, increases monotonically) Rules for Servers
lastApplied index of highest log entry applied to state ma- All Servers:
chine (initialized to 0, increases monotonically) • If commitIndex > lastApplied: increment lastApplied, ap-
Volatile state on leaders: (Reinitialized after election) ply log[lastApplied] to state machine
nextIndex[ ] for each server, index of the next log entry to • If RPC request or response contains term T > currentTerm:
send to that server (initialized to leader last log index + 1) set currentTerm = T and convert to follower
matchIndex[ ] for each server, index of highest log en- Followers:
try known to be replicated on server (initialized to 0, in- • Respond to RPCs from candidates and leaders
creases monotonically) • If election timeout elapses without receiving AppendEn-
tries RPC from current leader or granting vote to candi-
AppendEntries RPC date: convert to candidate
Invoked by leader to replicate log entries; also used as heart- Candidates:
beat • On conversion to candidate, start election: increment cur-
Arguments: rentTerm, vote for self, reset election timer and send Re-
questVote RPCs to all other servers
term leader’s term
• If votes received from majority of servers: become leader
prevLogIndex index of log entry immediately preceding
• If AppendEntries RPC received from new leader: convert
new ones
to follower
prevLogTerm term of prevLogIndex entry
• If election timeout elapses: start new election
entries[ ] log entries to store (empty for heartbeat; may
Leaders:
send more than one for efficiency)
• Upon election: send initial empty AppendEntries RPCs
leaderCommit leader’s commitIndex
(heartbeat) to each server; repeat during idle periods to
Results:
prevent election timeouts
term currentTerm, for leader to update itself
• If command received from client: append entry to local
success true if follower contained entry matching pre-
log, respond after entry applied to state machine
vLogIndex and prevLogTerm
• If last log index ≥ nextIndex for a follower: send Appen-
Receiver implementation:
dEntries RPC with log entries starting at nextIndex
1. Reply false if term < currentTerm
– If successful: update nextIndex and matchIndex for fol-
2. Reply false if log doesn’t contain an entry at prevLogIndex
lower
whose term matches prevLogTerm
– If AppendEntries fails because of log inconsistency:
3. If an existing entry conflicts with a new one (same index
decrement nextIndex and retry
but different terms), delete the existing entry and all that
• If there exists an N such that N > commitIndex and a ma-
follow it
jority of matchIndex[i] ≥ N, and log[N].term == current-
4. Append any new entries not already in the log
Term: set commitIndex = N
5. If leaderCommit > commitIndex: set commitIndex =
min(leaderCommit, index of last new entry)

You might also like