Consensus
Consensus
In a synchronous system, processors execute in a lock step mode; every processor at any given
moment is executing the same step. The message transmission delay between any pair of processes
is finite and is known.
On the other hand, in asynchronous systems, there is no upper bound on the relative execution
speeds of the processors. Also, the upper bound on the message transmission delay between any
pair of processes is not known although it is finite.
A partially synchronous system lies between the synchronous and asynchronous systems. In a
partially synchronous system, the upper bounds on the difference between the relative speeds of
different processors and the message transmission delay between any pair of processors are not
known beforehand but they do exist. In another version of partially synchronous system, the upper
bounds on the relative processor speed difference and the message transmission delay are known
but they hold starting at some unknown time. In partial synchrony, the system behaves
asynchronously most of the times with periods of synchrony in between. Distributed systems are
partially synchronous in reality as machines may run out of memory and slow down and, the
communication network may experience congestion and messages may hence be dropped.
A fault tolerant system should continue to operate even in the presence of failures without
significantly affecting the performance.
The terms fault, error and failure are better understood when they are defined together. A fault
existing in the system will at some point cause an error which may lead to failure of the system. A
fault is the root cause of an error. A system is considered to fail when it cannot provide the expected
services. An error may lead to system failure depending on how severe it is.
Failures can be classified into different types. Crash failure and arbitrary/byzantine failure are the
two basic types.
A crash failure is also called a fail-stop failure. When a process fails, it stops permanently. It is the
most severe type of failure. It may be fail-silent - it does not proceed any further or communicate
with the rest of the processes in the system. It can be detected by other processes in the system
which try to communicate with it.
An arbitrary/byzantine failure is one in which a process behaves arbitrarily and produces random
output. It is very difficult to diagnose such failures. It is sometimes because of an external agent
taking over the process to produce arbitrary behavior.
Failure detection involves finding which processes are faulty and disseminating this information to all
the other processes in the system consistently. A straightforward way to detect failures is by pinging
wherein a process asks another whether it is alive. If the pinged process replies within a timeout
period, then the pinging process finds it to be fault free; faulty otherwise. In an alternative method,
a process periodically sends heartbeat messages to let other process know that it is fault free. If a
heartbeat message is not received from a process for a predefined period, then it is suspected to
have failed. The process is detected to have failed when the majority of processes in the system
suspect it to have failed. Yet another way is to find out failures is as a consequence of regular
communication between processes known as Gossiping. Gossip messages carry failure
suspicion/detection information with them. Lack of expected communication between processes
results in suspicion/detection of failed processes. The Gossiping gradually spreads throughout the
system and eventually every process knows about all the failures.
The consensus problem in distributed systems involves reaching agreement on some piece of
information among all the processes in the system.
In the Byzantine agreement problem, one of the processes, called the source process, has an initial
value and all the processes reach agreement on this initial value. It is similar to a source process
broadcasting a value to all the other processes. The Byzantine agreement problem can be formally
dened as:
1. Agreement: All the fault free processes agree to the same value.
2. Validity: The value agreed upon by all the fault free processes must be the same as the initial
value of the source process assuming that the source process is fault free.
3. Termination: Every fault free process agrees on the initial value eventually.
In the consensus problem, each process has an initial value and all fault free processes reach
agreement on the same value. It is like checking if all the processes have the same value. The
consensus problem can be formally dened as:
1. Agreement: All the fault free processes agree to the same value.
2. Validity: The value agreed upon by all the fault free processes must be the same assuming that all
the processes are fault free and have the same value.
In the interactive consistency problem, each process i has an initial value vi and all fault free
processes reach agreement on a set of values with one value for each process. It is the same as
gathering a value from all the processes at all the processes. The interactive consistency problem can
be formally dened as:
1. Agreement: All the fault free processes agree to the same set of values {v1, v2, …, vn}.
2. Validity: The value agreed upon by all the fault free processes at position i in the set must be the
same assuming that all the processes are fault free.
The consensus problem is solvable only in certain environments. Consensus in the presence of
failures is even more challenging. Table below summarizes the consensus feasibility results in
different environments considering different types of failures. Generally, the consensus problem
cannot be solved in an asynchronous system even in the presence of a single failure. However, it can
be solved in the presence of crash failures using unreliable failure detectors.
2.1.5 Commit protocols
Distributed commit algorithms play a central role in distributed applications like consistent failure
detection, distributed transactions in databases for consistency maintenance, etc. where an
operation needs to be performed by all the processes or none at all.
Distributed commit is often performed with the help of a coordinator. There are different types of
commit algorithms depending on the number of rounds of message exchanges between the
coordinator and the other members (participants). In the coordinator based consensus approaches
the two-phase and three-phase commit algorithms are used.
In the One-phase commit algorithm, the coordinator tells all the other processes weather to locally
perform the operation. The problem here is that a participant has no way to know if the operation
has been performed at every other participant. Two-phase commit and three-phase commit are
widely used and are discussed below. The paxos algorithm, which is challenging to implement, has
not been widely used even though it is non-blocking and has the same complexity as two-phase
commit.
The two-phase commit algorithm has voting and decision phases each consisting of 2 steps. When
failures are ignored it operates as below.
1. The coordinator initiates the protocol by sending a VOTE-REQUEST message to all the participants
and then it waits.
In a system where failures occur, the participants or the coordinator may wait indefinitely for
messages from each other resulting in deadlock. Timeout mechanisms are used to deal with the
deadlocks as explained in the following scenarios.
First, a participant may be waiting for the VOTE-REQUEST message from the coordinator. If it
receives no such message, it locally aborts and sends a VOTE-ABORT message to the coordinator
after the timeout. Second, the coordinator may be waiting for the vote message (VOTE-COMMIT or
VOTE-ABORT) from the participants. If no message arrives in time, the coordinator sends GLOBAL-
ABORT message to the participants. And Finally, the participants may be waiting for the global
decision (GLOBAL-COMMIT or GLOBAL-ABORT) from the coordinator.
Now, a participant cannot simply abort after timeout if no message has been received and it has to
wait for the coordinator to recover if it had crashed. To ensure that the algorithm terminates, the
state of the coordinator and participants need to be stored into persistent storage (again in a fault
tolerant manner) for use during recovery which will not be discussed here. Alternatively, a
participant (say B) can contact one of the other participants (say A) and decide as shown in the table
below.
Note that if no participant received a decision before the coordinator crashed, the algorithm blocks
and the participants keep waiting for the coordinator to recover. Thus, using two-phase commit
protocol without storing process state to disc it is not possible to achieve consensus when the
coordinator fails and hence it is called blocking.
In the absence of failures, the protocol starts with the coordinator sending VOTE-REQUEST message
to the participants. If any participant, in response, sends VOTE-ABORT, it sends GLOBAL-ABORT;
otherwise, it sends a PREPARE-COMMIT message. And finally it sends a GLOBAL-COMMIT message
after all participants have acknowledged the PREPARE-COMMIT message.
When failures occur, the coordinator and participants may be blocked waiting for messages. First, if
a participant is waiting for a VOTE-REQUEST message from the coordinator, it will abort thinking that
the coordinator crashed. Second, if the coordinator is waiting for votes from the participants, it will
abort thinking that a participant crashed and send GLOBAL-ABORT to participants. Third, if the
coordinator is waiting for acknowledgment for PREPARE-COMMIT messages, it will think that a
participant crashed and send GLOBAL-COMMIT message to the participants; and when the crashed
participant comes up it will commit as well.
Finally, a participant invokes the termination detection algorithm when it is blocked waiting for
PREPARE-COMMIT or GLOBAL-COMMIT message from the crashed coordinator. A participant (Say B)
now decides what to do based on the state of other participant(s) (Say A) it contacts as shown in the
table above.
Two types of consensus algorithms are used in different Blockchain platforms – Proof-based and
Vote-based.
As mentioned, in the Blockchain network, if every node tries to broadcast their blocks containing the
verified transactions, confusion could possibly arise. For example, consider a transaction that is
verified by many nodes, who will then put it into their blocks, and broadcast to other nodes. If the
broadcasting work is free, this transaction could be duplicated in different blocks, then the ledger is
meaningless. In order to get agreement between all nodes about the newly added block, the PoW
requires each node to solve a difficult puzzle with adjusted difficulty, to get the right to append a
new block to the current chain. The first node who solves the puzzle will have this right.
Specifically, before solving this puzzle, all the verifying nodes would have to put their verified
transactions, as well as other information like Prev_Hash and Timestamp, into a block. Then they
start solving this puzzle, by guessing a secret value, which is the nonce field as introduced in Section
2.2, then put it into the block. All the information inside the block header will be combined together,
and inputted to an SHA-256 hash function [25]. If the output of this function is below a given
threshold T, which is designated by the difficulty, the secret value is accepted. Otherwise, the node
has to make another guess of the secret value, until he gets the answer. Because of the efforts paid
for guessing the right value, this work is called the PoW. Also, the node joining the network using
PoW can be called a miner, and the action of finding a suitable nonce is called mining.
When a node finds the secret value, he broadcasts his proposed block with this value to other nodes,
to notify them that the answer has been found. Right after that, all the miners receiving this
message, who have still not found the secret answer for their puzzles, will stop guessing. Instead,
they check the broadcasted block for whether all the transactions are valid; if the block contains the
Prev_Hash value is the hash value of the last block in their chain, the validity of the nonce field … If
all the verifications are correct, then these nodes will append the proposed block to their current
chain, and re-start guessing the secret value, by repeating the steps above again.
However, there is a rare case when more than 1 miner finds the answers for the puzzle, before it
being noticed that another miner has also found another suitable answer. At that time, these miners
will still broadcast their block with the found nonce. Then, other miners who receive the first coming
block will ignore the others coming later. This work leads to the forking problem: in the verifying
network, there are different chains of blocks (they should be the same). Satoshi proposed that those
miners will keep mining a new block on their forks, until one fork is made longer than the others. So
at this time, all nodes have to follow this longest fork. Fig. 5 describes the forking problem and the
PoW solution to handle it. Whenever a block is recognized in the chain by all the nodes, the miner
appending this block will receive some bitcoins as a reward.
2.2.1.2 Proof-of-stake
In PoS, the more stake a miner owns, the more chance he has to mine a new block. Specifically, if
there are a total b coins from all the miners, and miner M owns a coins (a < b), the chance for miner
M getting the right to mine a new block is a/b. The lucky miner picking work is executed every 60
minutes, and this work is made randomly based on the stake of each miner, as mentioned. Once a
miner gets the chance to mine a new block, he will verify the transactions, collect them into a block,
then broadcast it to the other miners, and receive the rewarding fee.
In DPoS, there are many people who hold stake, and they will have to vote for a delegation, which
includes some “witnesses”, who are miners verifying the transactions and maintaining the chain. The
more stake a person owns, the more powerful voting he has to assign the witness. After the verifying
congress is made, the witnesses inside will verify the transactions, and produce blocks, including the
valid ones. The list of witnesses is always shuffled. With the creating speed of 2 seconds per block,
the witness in the list will have to produce blocks sequentially. If any witness fails to produce his
block, he would potentially be removed from the delegation. Whenever a witness creates a block to
append to the chain, he will receive a reward.
2.2.1.2 Proof-of-elapsed-time
Proposed by Intel, proof of elapsed time is used in a blockchain platform called Sawtooth Lake. In
order to conduct the consensus algorithm, each miner will at the same time request a wait-time
from a trusted enclave, which is also known as a trusted function inside XGS hardware.
Subsequently, all the miners will receive their responded wait-time from the enclave, and together
will wait until their received time elapses. When a miner waits enough time, but has found no one
has finished the waiting match, he will broadcast to all other miners that he is the winner, which
provides him a chance to mine new block. In short, the miner owning the shortest received wait-
time will be the one to mine a new block.
Vote-based employ a consensus algorithm that simulates a real-life situation: the final decision is
made based on the results of the majority.
When each node executes the consensus work, some nodes can be subverted, which sends different
results to other nodes. Then the worst case is the network could not resist them, causing the ledger
to be different in different nodes. Considering the crashing cases, if nodes are crashed, then they
could not send their results to other nodes, which makes it difficult to make the final decision.
Consequently, based on these bad situations, the voting based consensus algorithms could be
classified into two main kinds:
• Byzantine fault tolerance based consensus: a kind of consensus that could prevent the cases of
• Crash fault tolerance based consensus: a kind of consensus that could only prevent the cases of
crashing nodes.
All consensus algorithms in these two sub-categories will have to make a trust assumption: among N
nodes, there should be at least t nodes (t < N) operating normally. While in crash fault tolerance-
based
consensus, t is usually set equal to [N/2 + 1], in Byzantine fault tolerance-based consensus, t is
usually
Blockchain platform called Chain employs the crash fault tolerance consensus algorithm named
federated. In this one, among n nodes in the verifying network, there is a node called ‘block
generator’, and other nodes called ‘block signers’. The block generator receives transactions from
clients, then verifies each of them, and stores the valid ones in a temporary list. After each duration,
the block generator will sequentially take some requested transactions to put into blocks, which will
be sent to all block signers. These signers will validate the received blocks, and sign into them if they
are valid, then send them back to the block generator. If a block is signed by more than m (m < n)
different block signers, the block generator will append the block to his current chain, and propose
this block to other nodes. Following this method, Chain could resist the crash fault if the block signer
does not work. However, if this case happens with the block generator, the network would be
halted.
A kind of Byzantine fault tolerance called Practical Byzantine Fault Tolerance (PBFT) was used for
Hyperledger Fabric. In PBFT, there are two kinds of nodes: A leader node, and some validating peers
(nodes); and these peers will execute some rounds for appending a block to the chain. Initially, the
clients send their requests for transactions to their corresponding validating peers. From here, the
receiving peer will validate the transactions, then broadcast them to other peers, including the
leader. After the number of transactions reaches a threshold called batch size, or after an interval,
the leader node will order the transactions by their created time, putting them into a block.
Afterwards, three phases of PBFT are executed. Firstly, in the Pre-prepare phase, the leader
broadcasts his proposed block to other peers. They will receive and store the block locally. Then in
order to make sure that the received block from the leader is the same, they make a double-check
by broadcasting it in the Prepare phase and Commit phase. After the Prepare phase, if any node
receives the blocks, which are the same as what they have stored locally before, from more than 2/3
of all the nodes, they will execute the Commit phase. Then the same procedure is recorded after the
Commit phase, which is the requirement for any node to execute the transactions in the proposed
block, and append it to their current chains.