Imran Bashir - Blockchain Consensus - An Introduction To Classical, Blockchain, and Quantum Consensus Protocols-Apress (2022)
Imran Bashir - Blockchain Consensus - An Introduction To Classical, Blockchain, and Quantum Consensus Protocols-Apress (2022)
Consensus
An Introduction to Classical, Blockchain,
and Quantum Consensus Protocols
—
Imran Bashir
Blockchain Consensus
An Introduction to Classical,
Blockchain, and Quantum
Consensus Protocols
Imran Bashir
Blockchain Consensus: An Introduction to Classical, Blockchain, and Quantum
Consensus Protocols
Imran Bashir
London, UK
Introduction������������������������������������������������������������������������������������������������������������xxi
Chapter 1: Introduction�������������������������������������������������������������������������������������������� 1
Distributed Systems���������������������������������������������������������������������������������������������������������������������� 1
Characteristics������������������������������������������������������������������������������������������������������������������������ 2
Why Build Distributed Systems����������������������������������������������������������������������������������������������� 4
Challenges������������������������������������������������������������������������������������������������������������������������������� 7
Parallel vs. Distributed vs. Concurrency�������������������������������������������������������������������������������� 10
Centralized vs. Decentralized vs. Distributed������������������������������������������������������������������������ 10
Distributed Algorithm������������������������������������������������������������������������������������������������������������� 12
Elements of Distributed Computing/Pertinent Terms/Concepts�������������������������������������������� 14
Types of Distributed Systems������������������������������������������������������������������������������������������������ 19
Software Architecture Models����������������������������������������������������������������������������������������������� 19
Distributed System Model����������������������������������������������������������������������������������������������������� 25
Synchrony and Timing����������������������������������������������������������������������������������������������������������� 31
Time, Clocks, and Order��������������������������������������������������������������������������������������������������������� 37
Physical Clocks���������������������������������������������������������������������������������������������������������������������� 39
Happens-Before Relationship and Causality������������������������������������������������������������������������� 50
CAP Theorem������������������������������������������������������������������������������������������������������������������������������� 61
Consistency��������������������������������������������������������������������������������������������������������������������������� 61
Availability����������������������������������������������������������������������������������������������������������������������������� 61
Partition Tolerance����������������������������������������������������������������������������������������������������������������� 61
v
Table of Contents
Chapter 2: Cryptography���������������������������������������������������������������������������������������� 67
Introduction��������������������������������������������������������������������������������������������������������������������������������� 67
A Typical Cryptosystem��������������������������������������������������������������������������������������������������������������� 68
Cryptographic Primitives������������������������������������������������������������������������������������������������������������� 70
Symmetric Cryptography������������������������������������������������������������������������������������������������������������ 70
Stream Ciphers���������������������������������������������������������������������������������������������������������������������� 71
Block Ciphers������������������������������������������������������������������������������������������������������������������������ 73
Advanced Encryption Standard��������������������������������������������������������������������������������������������� 77
Some Basic Mathematics����������������������������������������������������������������������������������������������������������� 79
Prime������������������������������������������������������������������������������������������������������������������������������������� 79
Modular Arithmetic���������������������������������������������������������������������������������������������������������������� 80
Group������������������������������������������������������������������������������������������������������������������������������������� 80
Abelian Group������������������������������������������������������������������������������������������������������������������������ 80
Field��������������������������������������������������������������������������������������������������������������������������������������� 80
Finite Field (Galois Field)������������������������������������������������������������������������������������������������������� 80
Prime Fields��������������������������������������������������������������������������������������������������������������������������� 80
Generator������������������������������������������������������������������������������������������������������������������������������� 81
Public Key Cryptography������������������������������������������������������������������������������������������������������������� 81
Diffie-Hellman Key Exchange������������������������������������������������������������������������������������������������ 82
Digital Signatures������������������������������������������������������������������������������������������������������������������ 85
RSA���������������������������������������������������������������������������������������������������������������������������������������� 85
Elliptic Curve Cryptography��������������������������������������������������������������������������������������������������� 88
Digital Signatures������������������������������������������������������������������������������������������������������������������������ 94
Authenticity��������������������������������������������������������������������������������������������������������������������������� 94
Unforgeability (Nonrepudiation)��������������������������������������������������������������������������������������������� 94
Nonreusability����������������������������������������������������������������������������������������������������������������������� 95
ECDSA Signatures����������������������������������������������������������������������������������������������������������������� 96
vi
Table of Contents
Multisignatures���������������������������������������������������������������������������������������������������������������������� 97
Threshold Signatures������������������������������������������������������������������������������������������������������������ 98
Aggregate Signatures������������������������������������������������������������������������������������������������������������ 99
Ring Signatures������������������������������������������������������������������������������������������������������������������� 100
Hash Functions������������������������������������������������������������������������������������������������������������������������� 101
Preimage Resistance����������������������������������������������������������������������������������������������������������� 101
Second Preimage Resistance���������������������������������������������������������������������������������������������� 101
Collision Resistance������������������������������������������������������������������������������������������������������������ 102
Design of Secure Hash Algorithms (SHA)���������������������������������������������������������������������������� 103
Design of SHA-3 (Keccak)��������������������������������������������������������������������������������������������������� 105
Message Authentication Codes������������������������������������������������������������������������������������������������� 107
Hash-Based MACs (HMACs)������������������������������������������������������������������������������������������������ 108
Verifiable Delay Functions�������������������������������������������������������������������������������������������������������� 109
Verifiable Random Functions���������������������������������������������������������������������������������������������������� 110
Summary���������������������������������������������������������������������������������������������������������������������������������� 111
Bibliography������������������������������������������������������������������������������������������������������������������������������ 111
vii
Table of Contents
viii
Table of Contents
ix
Table of Contents
PBFT������������������������������������������������������������������������������������������������������������������������������������������ 315
Certificates in PBFT������������������������������������������������������������������������������������������������������������� 319
PBFT Advantages and Disadvantages��������������������������������������������������������������������������������� 323
Safety and Liveness������������������������������������������������������������������������������������������������������������ 324
Blockchain and Classical Consensus���������������������������������������������������������������������������������������� 327
Summary���������������������������������������������������������������������������������������������������������������������������������� 328
Bibliography������������������������������������������������������������������������������������������������������������������������������ 328
xi
Table of Contents
xii
Table of Contents
xiii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 431
xiv
About the Author
Imran Bashir has an MSc in information security from Royal
Holloway, University of London. He has a background in
software development, solution architecture, infrastructure
management, information security, and IT service
management. His current focus is on the latest technologies
such as the blockchain and quantum computing. He is
a member of the Institute of Electrical and Electronics
Engineers (IEEE) and the British Computer Society (BCS).
His book on blockchain technology, Mastering Blockchain,
is a widely accepted standard text on the subject. He has worked in various senior
technical roles for different organizations worldwide. Currently, he is working as a
researcher in London, UK.
xv
About the Technical Reviewer
Prasanth Sahoo is a thought leader, an adjunct professor, a
technical speaker, and a full-time practitioner in blockchain,
DevOps, cloud, and Agile, working for PDI Software.
He was awarded the “Blockchain and Cloud Expert of
the Year Award 2019” from TCS Global Community
for his knowledge share within academic services to
the community. He is passionate about driving digital
technology initiatives by handling various community
initiatives through coaching, mentoring, and grooming
techniques.
xvii
Acknowledgments
This book would not have been possible without help from many people. First, I would
like to thank Aditee Mirashi from Apress for her time, patience, and dedication to this
project.
Over the years, I have gone through many books, papers, online resources, and
lectures from experts and academics in this field to learn about this subject. I want to
thank all those researchers and engineers who have shared their knowledge. I also want
to thank the reviewers whose suggestions have improved this book greatly.
I want to thank my wife and children for their support and bearing with me when
I was busy writing during weekends, which I was supposed to spend with them.
Finally, I want to thank my father, my beacon of light. He sacrificed everything for
me, guided me at every step in life, and empowered me to achieve the best in life. Thank
you, Dad! He motivated me to write this book and suggested that I publish it in 2022.
And my mother, whose unconditional love for me has no bounds. Thank you Ammi!
xix
Introduction
This book is an introduction to distributed consensus and its use in the blockchain.
It covers classical protocols, blockchain age protocols that emerged after Bitcoin, and
quantum protocols. Many enthusiasts have come from different backgrounds into
the blockchain world and may not have traditional distributed systems experience.
This book fills that knowledge gap. It introduces classical protocols and foundations
of distributed consensus so that a solid foundation is built to understand the
research on blockchain consensus. Many other people have come from traditional
distributed systems backgrounds, either developers or theorists. Still, they may lack the
understanding of blockchain and relevant concepts such as Bitcoin and Ethereum. This
book will fill that gap too.
Moreover, as quantum computing will impact almost everything in the future, I have
also covered how quantum computing can help build quantum consensus algorithms.
A clear advantage can be realized in the efficiency and security of consensus algorithms
by using quantum computing. Therefore, an entire chapter is dedicated to quantum
consensus.
This book is for everyone who wants to understand this fascinating world of
blockchain consensus and distributed consensus in general. A basic understanding
of computer science is all that's required to fully benefit from this book. This book can
also serve as a study resource for a one-semester course on blockchain and distributed
consensus.
The book starts with a basic introduction to what distributed consensus is and covers
fundamental ideas such as causality, time, and various distributed system models. Then
to build the foundation for understanding the security aspects of blockchain consensus,
an introduction to cryptography is provided. Then a detailed introduction to distributed
consensus is presented. Next, an introduction to the blockchain is given, which gives a
solid understanding of what a blockchain is and how it is fundamentally a distributed
system. We then discuss blockchain consensus, focusing on the first cryptocurrency
blockchain, Bitcoin, and how it achieves its security and distributed consensus goals.
Starting from Chapter 6 is an introduction to early protocols covering classical work
like the Byzantine generals problem and its various solutions. After this, classical
xxi
Introduction
protocols such as Paxos, DLS, and PBFT are covered. Next, the blockchain protocols
such as ETHASH, Tendermint, GRANDPA, BABE, HotStuff, and Casper are introduced.
These protocols are the latest in the research on blockchain consensus mechanisms. Of
course, we cannot cover everything due to the vastness of the subject. Still, this chapter
dedicated to blockchain consensus introduces all those protocols which are state of
the art and in use in mainstream blockchain platforms, such as Polkadot, Ethereum,
and Cosmos.
The next chapter is another exciting topic, quantum consensus. With the advent of
quantum computing, it has been realized that quantum computing can significantly
enhance the classical distributed consensus results. Even results such as FLP
impossibility might be possible to refute using quantum properties like superposition
and entanglement.
Finally, the last chapter summarizes what we have learned in the book, introduces
some exotic protocols, and suggests some research directions.
As this book focuses on the foundations of the blockchain and consensus, I believe
that this book will serve as a great learning resource for all enthusiasts who want to learn
about the blockchain and blockchain consensus. Furthermore, I hope that this book will
serve technologists, researchers, students, developers, and indeed anyone who wants to
know about this fascinating subject well for many years to come.
xxii
CHAPTER 1
Introduction
In this chapter, we explore the foundations of distributed computing. First, we will
answer the questions about a distributed system, its fundamental abstractions, system
models, and relevant ideas.
Distributed Systems
In the literature, there are many different definitions of distributed systems. Still,
fundamentally they all address the fact that a distributed system is a collection of
computers working together to solve a problem.
Some definitions from famous scholars in this field are as follows:
1
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_1
Chapter 1 Introduction
Characteristics
What makes a system distributed? Here are some fundamental properties:
2
Chapter 1 Introduction
4. Heterogeneous
5. Coherent
6. Concurrency/concurrent operation
No global physical clock implies that the system is distributed in nature and
asynchronous. The computers or nodes in a distributed system are independent with
their own memory, processor, and operating system. These systems do not have a global
shared clock as a source of time for the entire system, which makes the notion of time
tricky in distributed systems, and we will shortly see how to overcome this limitation.
The fact that there is no global shared memory implies that the only way processes can
communicate with each other is by consuming messages sent over a network using
channels or links.
All processes or computers or nodes in a distributed system are independent, with
their own operating system, memory, and processor. There is no global shared memory
in a distributed system, which implies that each processor has its own memory and
its own view of its state and has limited local knowledge unless a message from other
node(s) arrives and adds to the local knowledge of the node.
Distributed systems are usually heterogeneous with multiple different types
of computers with different architecture and processors. Such a setup can include
commodity computers, high-end servers, IoT devices, mobile devices, and virtually
any device or "thing" that runs the distributed algorithm to solve a common problem
(achieve a common goal) the distributed system has been designed for.
Distributed systems are also coherent. This feature abstracts away all minute details
of the dispersed structure of a distributed system, and to an end user, it appears as a
single cohesive system. This concept is known as distribution transparency.
Concurrency in a distributed system is concerned with the requirement
that the distributed algorithm should run concurrently on all processors in the
distributed system.
Figure 1-1 shows a generic model of a distributed system.
3
Chapter 1 Introduction
There are several reasons why we would want to build a distributed system. The most
common reason is scalability. For example, imagine you have a single server serving 100
users a day; when the number of users grows, the usual method is to scale vertically by
adding more powerful hardware, for example, faster CPU, more RAM, bigger hard disk,
etc., but in some scenarios, you can only go so much vertically, and at some point, you
have to scale horizontally by adding more computers and somehow distributing the load
between them.
• Reliability
• Performance
• Resource sharing
• Inherently distributed
4
Chapter 1 Introduction
Reliability
Reliability is a key advantage of distributed systems. Imagine if you have a single
computer. Then, when it fails, there is no choice but to reboot it or get a new one if it had
developed a significant fault. However, there are multiple nodes in distributed systems
in a system that allows a distributed system to tolerate faults up to a level. Thus, even if
some computers fail in a distributed network, the distributed system keeps functioning.
Reliability is one of the significant areas of study and research in distributed computing,
and we will look at it in more detail in the context of fault tolerance shortly.
Reliability encompasses several aspects including availability, integrity, and fault
tolerance:
Performance
In distributed systems, better performance can be achieved naturally. For example, in
the case of a cluster of computers working together, better performance can be achieved
by parallelizing the computation. Also, in a geographically dispersed distributed
network, clients (users) accessing nodes can get data from the node which is closer to
their geographic region, which results in quicker data access. For example, in the case
of Internet file download, a mirror that is closer to your geographic region will provide
much better download speed as compared to the one that might be in another continent.
Performance of a distributed system generally encompasses two facets,
responsiveness and throughput.
Responsiveness
This property guarantees that the system is reasonably responsive, and users can get
adequate response from the distributed system.
5
Chapter 1 Introduction
Throughput
Throughput of a distributed system is another measure by which the performance of
the distributed system can be judged. Throughput basically captures the rate at which
processing is done in the system; usually, it is measured in transactions per second.
As we will see later in Chapter 5, high transaction per second rate is quite desirable
in blockchain systems (distributed ledgers). Quite often, transactions per second or
queries executed per second are measured for a distributed database as a measure
of the performance of the system. Throughput is impacted by different aspects of the
distributed system, for example, processing speeds, communication network quality,
speed and reliability, and the algorithm. If your hardware is good, but the algorithm
is designed poorly, then that can also impact the throughput, responsiveness, and the
overall performance of the system.
Resource Sharing
Resources in a distributed system can be shared with other nodes/participants in the
distributed system. Sometimes, there are expensive resources such as a supercomputer,
a quantum computer, or some industrial grade printer which can be too expensive to
be made available at each site; in that case, resources can be shared via communication
links to other nodes remotely. Another scenario could be where data can be divided into
multiple partitions (shards) to enable quick access.
Inherently Distributed
There are scenarios where there is no option but to build a distributed system because
the problem can only be solved by a distributed system. For example, a messaging
system is inherently distributed. Mobile network is inherently by nature distributed.
In these and similar use cases, a distributed system is the only one that can solve the
problem; therefore, the system has to be distributed by design.
With all these benefits of distributed systems, there are some challenges that need to
be addressed when building distributed systems. The properties of distributed systems
such as no access to a global clock, asynchrony, and partial failures make designing
reliable distributed systems a difficult task. In the next section, we look at some of the
primary challenges that should be addressed while building distributed systems.
6
Chapter 1 Introduction
Challenges
Distributed systems are hard to build. There are multiple challenges that need to be
addressed while designing distributed systems. A collection of some common challenges
is presented as follows.
Fault Tolerance
With more computers and at times 100s of thousands in a data center, for example, in
the case of cloud computing, inevitably something somewhere would be failing. In other
words, the probability of failing some part of the distributed system, be it a network
cable, a processor, or some other hardware, increases with the number of computers.
This aspect of distributed systems requires that even if some parts of the distributed
system fail (usually a certain threshold), the distributed system as a whole must keep
operating. To this end, there are various problems that are studied in distributed
computing under the umbrella of fault tolerance. Fault-tolerant consensus is one such
example where the efforts are made to build consensus algorithms that continue to run
correctly as specified even in the presence of a threshold of faulty nodes or links in a
distributed system. We will see more details about that in Chapter 3.
A relevant area of study is failure detection which is concerned with the
development of algorithms that attempt to detect faults in a distributed system. This is
especially an area of concern in asynchronous distributed systems where there is no
upper bound on the message delivery times. The problem becomes even more tricky
when there is no way to distinguish between a failed node and a node that is simply
slower and a lost message on the link. Failure detection algorithms give a probabilistic
indication about the failure of a process. This up or down status of the node then can be
used to handle that fault.
Another area of study is replication which provides fault tolerance on the principle
that if the same data is replicated across multiple nodes in a distributed system, then
even if some nodes go down the data is still available, which helps to keep the system
stable and continue to meet its specification (guarantees) and remain available to the
end users. We will see more about replication in Chapter 3.
Security
Being a distributed system with multiple users using it, out of which some might be
malicious, the security of distributed systems becomes a prime concern. This situation
7
Chapter 1 Introduction
is even more critical in geographically dispersed distributed systems and open systems
such as blockchains, for example, Bitcoin blockchain. To this, the fundamental science
used for providing security in distributed systems is cryptography, which we will cover
in detail in Chapter 2, and we will keep referring to it throughout the book, especially
in relation to blockchain consensus. Here, we study topics such as cryptography
and address challenges such as authentication, confidentiality, access control,
nonrepudiation, and data integrity.
Heterogeneity
A distributed system is not necessarily composed of exactly the same hardware nodes. It
is possible and is called homogenous distributed system, but usually the hardware and
operating systems are different from each other. In this type of scenario, different operating
systems and hardware can behave differently, leading to synchronization complexities.
Some nodes might be slow, running a different operating system which could have bugs,
some might run faster due to better hardware, and some could be resource constrained
as mobile devices or IoT devices. With all these different types of nodes (processes,
computers) in a distributed system, it becomes challenging to build a distributed
algorithm that works correctly on all these different types of systems and continues to
operate correctly despite the differences in the local operating environment of the nodes.
Distribution Transparency
One of the goals of a distributed system is to achieve transparency. It means that the
distributed system, no matter how many individual computers and peripherals it is
built of, it should appear as a single coherent system to the end user. For example,
an ecommerce website may have many database servers, firewalls, web servers, load
balancers, and many other elements in their distributed system, but all that should be
abstracted away from the end user. The end user is not necessarily concerned about
these backend "irrelevant" details but only that when they make a request, the system
responds. In summary, the distributed system is coherent if it behaves in accordance
with the expectation of the end user, despite its heterogeneous and dispersed structure.
For example, think about IPFS, a distributed file system. Even though the files are spread
and sharded across multiple computers in the IPFS network, to the end user all that
detail is transparent, and the end user operates on it almost as if they are using a local
file system. Similar observation can be made about other systems such as online email
platforms and cloud storage services.
8
Chapter 1 Introduction
Timing and Synchronization
Synchronization is a vital operation of a distributed system to ensure a stable global
state. As each process has its view of time depending on their internal physical clocks
which can drift apart, the time synchronization becomes one of the fundamental issues
to address in designing distributed systems. We will see some more details around this
interesting problem and will explore some solutions in our section on timing, orders,
and clocks in this chapter.
Global State
As the processes in a distributed system only have knowledge of their local states, it
becomes quite a challenge to ascertain the global state of the system. There are several
algorithms that can be used to do that, such as the Chandy-Lamport algorithm. We will
briefly touch upon that shortly.
Concurrency
Concurrency means multiple processes running at the same time. There is also a
distinction made between logical and physical concurrency. Logical concurrency refers
to the situation when multiple programs are executed in an interleaving manner on a
single processor. Physical concurrency is where program units from the same program
execute at the same time on two or more processors.
Distributed systems are ubiquitous. They are in everyday use and have become
part of our daily routine as a society. Be it the Internet, the World Wide Web, Bitcoin,
Ethereum, Google, Facebook, or Twitter, distributed systems are now part of our daily
lives. At the core of distributed systems, there are distributed algorithms which form the
foundation of the processing being performed by the distributed system. Each process
runs the same copy of the algorithm that intends to solve the problem for which the
distributed system has been developed, hence the term distributed algorithm.
A process can be a computer, an IoT device, or a node in a data center. We abstract
these devices and represent these as processes, whereas physically it can be any physical
computer.
Now let’s see some of the relevant technologies and terminologies.
9
Chapter 1 Introduction
Memory Shared memory, a common address Each processor has its own
space memory
Coupling Tightly coupled Loosely coupled
Synchronization Through a global shared clock Through synchronization algorithms
Goal High performance Scalability
Algorithms Concurrent Distributed
Messaging No network, shared memory Message-passing network
There are some overlapping ideas in the distributed computing, and sometimes it
becomes a bit difficult for beginners to understand. In the next section, I will try to clarify
some of the pertinent terminology and some ambiguities.
10
Chapter 1 Introduction
Figure 1-2 shows the traditional view of centralized, decentralized, and distributed
systems. However, in recent years a slightly different picture started to emerge which
highlights the notion of a system with a central controller and the one with no controller
at all and where all users participate equally without any dependence on a trusted
third party. These new types of distributed systems are blockchains, especially public
blockchains, where there is no central controller, such as Bitcoin blockchain. We will
cover more on blockchain in Chapter 4 and then throughout the book. However,
let’s now look at Figure 1-3, which depicts this type of architecture and highlights the
differences from a control point of view.
11
Chapter 1 Introduction
Notice that in Figure 1-3 the topology of distributed and decentralized systems may
be the same, but there is a central controller, depicted by a symbolic hand on top of the
figure. However, in a decentralized system notice that there is no hand shown, which
depicts there is no single central controller or an authority.
So far, we have focused mainly on the architecture of distributed systems and
generally defined and explored what distributed systems are. Now let’s look at the
most important fundamental element of a distributed system, that is, the distributed
algorithm that enables a distributed system to do what it is supposed to do. It is the
algorithm that runs on each node in a distributed system to accomplish a common
goal. For example, a common goal in a cryptocurrency blockchain is to disallow double-
spending. The logic to handle that is part of the distributed algorithm that runs on
each node of the cryptocurrency blockchain, and collectively and collaboratively, the
blockchain (the distributed system) accomplishes this task (goal) to avoid double-
spending. Don’t worry if some of these terms don’t make sense now; they will become
clear in Chapter 4.
Distributed Algorithm
A distributed algorithm runs on multiple computers concurrently to accomplish
something in a distributed system. In a distributed system, the same algorithm runs on
all computers concurrently to achieve a common goal.
12
Chapter 1 Introduction
13
Chapter 1 Introduction
• Processes
14
Chapter 1 Introduction
• Events
• Executions
• Links
• State
• Global state
• Cuts
Processes
A process in a distributed system is a computer that executes the distributed algorithm.
It is also called a node. It is an autonomous computer that can fail independently and
can communicate with other nodes in the distributed network by sending and receiving
messages.
Events
An event can be defined as some operation occurring in a process. There can occur three
types of events in a process:
15
Chapter 1 Introduction
State
The concept of state is critical in distributed systems. You will come across this term
quite a lot in this book and other texts on distributed systems, especially in the context of
distributed consensus. Events make up the local state of a node. In other words, a state is
composed of events (results of events) in a node. Or we can say that the contents of the
local memory, storage, and program as a result of events make up the process’s state.
Global State
The collection of states in all processes and communication links in a distributed system
is called a global state.
This is also known as configuration which can be defined as follows:
The configuration of a distributed system is composed of states of the processes and
messages in transit.
Execution
An execution in a distributed system is a run or computation of the distributed algorithm
by a process. There are two types of executions:
• Synchronous execution
• Asynchronous execution
16
Chapter 1 Introduction
Cuts
A cut can be defined as a line joining a single point in time on each process line in a
space-time diagram. Cuts on a space-time diagram can serve as a way of visualizing the
global state (at that cut) of a distributed computation. Also, it serves as a way to visualize
what set of events occurred before and after the cut, that is, in the past or future. All
events on the left of the cut are considered past, and all events on the right side of the
cut are said to be future. There are consistent cuts and inconsistent cuts. If all received
messages are sent within the elapsed time before the cut, that is, the past, it is called
a consistent cut. In other words, a cut that obeys causality rules is a consistent cut. An
inconsistent cut is where a message crosses the cut from the future (right side of the cut)
to the past (left side of the cut).
If a cut crosses over a message from the past to the future, it is a graphical
representation of messages in transit.
The diagram shown in Figure 1-6 illustrates this concept, where C1 is an inconsistent
cut and C2 is a consistent cut.
17
Chapter 1 Introduction
• Stops recording
18
Chapter 1 Introduction
Note that any process can initiate the snapshot, the algorithm does not interfere with
the normal operation of the distributed system, and each process records the state of
incoming channels and its own.
Client-Server
This model is a common way to have two processes work together. A process assumes
the role of a client, and the other process assumes the role of a server. The server receives
requests made by the client and responds with a reply. There can be multiple client
processes but only a single server process. For example, a classic web client and web
server (browser to a web server) design follows this type of architecture. Figure 1-7
depicts the so-called physical view of this type of architecture.
19
Chapter 1 Introduction
Multiserver
A multiserver architecture is where multiple servers work together. In one style of
architecture, the server in the client-server model can itself become a client of another
server. For example, if I have made a request from my web browser to a web server to
find prices of different stocks, it is possible that the web server now makes a request to
the backend database server or, via a web service, requests this pricing information from
some other server. In this scenario, the web server itself has become a client. This type of
architecture can be seen as a multiserver architecture.
Another quite common scenario is where multiple servers act together to provide a
service to a client, for example, multiple database servers providing data to a web server.
There are two usual methods to implement such collaborative architecture. The first is
data partitioning, and another is data replication. Another closely related term to data
partitioning is data sharding.
Data partition refers to an architecture where data is distributed among the nodes
in a distributed system, and each node becomes responsible for its partition (section) of
the data. Partitioning of data helps to achieve better performance, easier administration,
load balancing, and better availability. For example, data for each department of a
company can be divided into partitions and stored separately on different local servers.
Another way of looking at it is that if we have a large table with one million rows, I might
put half a million rows on one server and another half on another server. This scheme is
20
Chapter 1 Introduction
Note that data partitioning shown in Figure 1-8 is where a large central database
is partitioned into smaller datasets relevant to each region, and a regional server then
manages the partition. However, in another type of partitioning, a large table can be
partitioned into different tables, but it remains on the same physical server. It is called
logical partitioning.
A shard is a horizontal partition of data where each shard (fragment) resides on a
separate server. One immediate benefit of such an approach is load balancing to spread
the load between servers. This concept is shown in Figure 1-9.
21
Chapter 1 Introduction
Data replication refers to an architecture where each node in the distributed system
holds an identical copy of the data. A typical simple example is that of the RAID 0 system;
while they are not separate physical servers, the data is replicated across two disks,
which makes it a data replication (commonly called mirroring) architecture. In another
scenario, a database server might run a replication service to replicate data across
multiple servers. This type of architecture allows for better performance, fault tolerance,
and higher availability. A specific type of replication and fundamental concept in a
distributed system is state machine replication used to build fault-tolerant distributed
systems. We will cover more about this in Chapter 3.
Figure 1-10 shows multiserver architectures where a variation of the client-server
model is shown. The server can act as a client to another server. This is another
approach where multiple servers work together closely to provide a service.
22
Chapter 1 Introduction
Figure 1-10. Multiple servers acting together (client-server and multiple servers
coordinating closely/closely coupled servers)
In summary, replication refers to a practice where a copy of the same data is kept on
multiple different nodes, whereas partitioning refers to a practice where data is split into
smaller subsets, and these smaller subsets are then distributed across different nodes.
23
Chapter 1 Introduction
Proxy Servers
A proxy server–based architecture allows for intermediation between clients and
backend servers. A proxy server can receive the request from the clients and forward
it to the backend servers (most commonly, web servers). In addition, proxy servers
can interpret client requests and forward them to the servers after processing them.
This processing can include applying some rules to the request, perhaps anonymizing
the request by removing the client’s IP address. From a client’s perspective, using
proxy servers can improve performance by caching. These servers are usually used in
enterprise settings where corporate policies and security measures are applied to all
web traffic going in or out of the organization. For example, if some websites need to
be blocked, administrators can use a proxy server to do just that where all requests go
through the proxy server, and any requests for blocked sites are intercepted, logged, and
ignored.
The diagram in Figure 1-12 shows a proxy architecture.
Figure 1-12. Proxy architecture – one proxy between servers and clients
Peer to Peer
In the peer-to-peer architecture, the nodes do not have specific client or server roles.
They have equal roles. There is no single client or a server. Instead, each node can play
either a client or a server role, depending on the situation. The fact that all nodes have an
equal role resulted in the term "peer."
Peer-to-peer architecture is shown in the diagram in Figure 1-13.
24
Chapter 1 Introduction
In some scenarios, it is also possible that not all nodes have equal roles; some may
act as servers and clients to each other. Generally, however, all nodes have the same role
in a peer-to-peer network.
Now that we have covered some architectural styles of distributed systems, let’s focus
on a more theoretical side of the distributed system, which focuses on the abstract view
of the distributed systems. First, we explore the distributed system model.
25
Chapter 1 Introduction
Figure 1-14. Physical architecture (left) vs. abstract system model (right)
Now let’s see what the three fundamental abstractions in a distributed system are.
Failures characterize all these abstractions. We capture our assumption about what fault
might occur in our system. For example, processes or nodes can crash or act maliciously
in a distributed system. A network can drop messages, or messages can be delayed.
Message delays are captured using timing assumptions.
So, in summary, when a distributed system model is created, we make some
assumptions about the behavior of the system. This process includes timing assumptions
regarding processes and the network. We also make failure assumptions regarding the
network and the processors, for example, how a process can fail and whether it can
exhibit arbitrary failures, how an adversary can affect the processors or the network, and
whether processes can crash or recover after a crash. Is it possible that the network links
drop messages? In the next section, we discuss all these scenarios in detail.
Processes
A process or node is a fundamental element in a distributed system which runs the
distributed algorithm to achieve that common goal for which the distributed system has
been designed.
Now imagine what a process can do in a distributed system. First, let’s think about a
normal scenario. If a process is behaving according to the algorithm without any failures,
then it is called a correct process or honest process. So, in our model we say that a node
running correctly is one of the behaviors a node can exhibit. What else? Yes, of course, it
can fail. If a node fails, we say it’s faulty; if not, then it is nonfaulty or correct or honest.
26
Chapter 1 Introduction
There are different types of failures that can occur in a process, such as
• Crash-stop
• Omission
• Eavesdropping
• Arbitrary
Crash-Stop Failure
Crash-stop faults are where a process crashes and never recovers. This model of faults
or node behavior captures an irreparable hardware fault, for example, short circuit in a
motherboard causing failure.
Omission Failure
Omission failures capture the fault scenarios where a processor fails to send a message
or receive a message. Omission failures are divided into three categories: send
omissions, receive omissions, and general omissions. Send omissions are where a
processor doesn’t send a message out which it was supposed to as per the distributed
algorithm; receive omissions occur when a process does not receive an expected
message. In practical terms, these omissions arise due to physical faults, memory issues,
buffer overflows, malicious actions, and network congestions.
Crash with Recovery
A process exhibiting crash with recovery behavior can recover after a crash. It captures
a scenario where a process crashes, loses its in-memory state, but recovers and resumes
its operation later. This occurrence can be seen as an omission fault too, where now the
node will not send or receive any messages because it has crashed. In practical terms, it
can be a temporary intentional restart of a process or reboot after some operating system
errors. Some examples include resumption of the normal operation after rebooting due
to a blue screen in Windows or kernel panic in Linux.
When a process crashes, it may lose its internal state (called amnesia), making a
recovery tricky. However, we can alleviate this problem by keeping stable storage (a log)
which can help to resume operations from the last known good state. A node may also
27
Chapter 1 Introduction
lose all its state after recovery and must resynchronize with the rest of the network. It
may also happen that a node is down for a long time and has desynchronized with the
rest of the network (other nodes) and has its old view of the state. In that case, the node
must resynchronize with the network. This situation is especially true in blockchain
networks such as Bitcoin or Ethereum, where a node might be off the network for quite
some time. When it comes back online, it synchronizes again with the rest of the nodes
to resume its full normal operation.
Eavesdropping
In this model, a distributed algorithm may leak confidential information, and an
adversary can eavesdrop to learn some information from the processes. This model
is especially true in untrusted and geographically dispersed environments such as
a blockchain. The usual defense against these attacks is encryption which provides
confidentiality by encrypting the messages.
Arbitrary (Byzantine)
A Byzantine process can exhibit any arbitrary behavior. It can deviate from the
algorithm in any possible way. It can be malicious, and it can actively try to sabotage
the distributed algorithm, selectively omit some messages, or covertly try to undermine
the distributed algorithm. This type of fault is the most complex and challenging in
a distributed algorithm or system. In practical terms, it could be a hacker coming up
with novel ways to attack the system, a virus or worm on the network, or some other
unprecedented attack. There is no restriction on the behavior of a Byzantine faulty node;
it can do anything.
A relevant concept is that of the adversary model, where the adversary behavior is
modelled. We will cover this later in the section “Adversary Model”.
Now we look at another aspect of the distributed system model, network.
Network
In a distributed network, links (communication links) are responsible for passing
messages, that is, take messages from nodes and send to others. Usually, the assumption
is a bidirectional point-to-point connection between nodes.
A network partition is a scenario where the network link becomes unavailable for
some finite time between two groups of nodes. In practice, this could be due to a data
28
Chapter 1 Introduction
Link Failures
Links can experience crash failure where a correctly functioning link may stop carrying
messages. Another type of link failure is omission failure, where a link carries some
messages, and some don’t. Finally, Byzantine failures or arbitrary failures can occur on
links where the link can create rogue messages and modify messages and selectively
deliver some messages, and some don’t.
With this model, we can divide the communication links into different types
depending on how they fail and deliver the messages.
Two types of events occur on links (channels), the send event where a message is
put on the link and the deliver event where the link dispenses a message, and a process
delivers it.
Fair-Loss Links
In this abstraction, we capture how messages on this link can be lost, duplicated, or
reordered. The messages may be lost but eventually delivered if the sender and receiver
process is correct and the sender keeps retransmitting. More formally, the three
properties are as follows.
Fair-Loss
This property guarantees that the link with this property does not systematically drop
every message, which implies that, eventually, delivery of a message to the destination
node will be successful even if it takes several retransmissions.
Finite Duplication
This property ensures that the network does not perform more retransmissions than the
sender does.
No Creation
This property ensures that the network does not corrupt messages or create messages
out of thin air.
29
Chapter 1 Introduction
Stubborn Links
This abstraction captures the behavior of the link where the link delivers any message
sent infinitely many times. The assumption about the processes in this abstraction is
that both sender and receiver processes are correct. This type of link will stubbornly try
to deliver the message without considering performance. The link will just keep trying
regardless until the message is delivered.
Formally, there are two properties that stubborn links have.
Stubborn Delivery
This property means that if a message m is sent from a correct process p to a correct
process q once, it will be delivered infinitely many times by process q, hence the term
"stubborn"!
No Creation
This means that messages are not created out of the blue, and if a message is delivered
by some process, then it must have been sent by a process. Formally, if a process q
delivers a message m sent from process p, then the message m is indeed sent from
process p to process q.
Reliable Delivery
No Duplication
A correct process p does not deliver a message m more than once.
30
Chapter 1 Introduction
No Creation
This property ensures that messages are not created out of thin air, and if they are
delivered, they must have been created and sent by a correct process before delivering.
Arbitrary Links
In this abstraction, the link can exhibit any behavior. Here, we consider an active
adversary who has the power to control the messages. This link depicts scenarios where
an attacker can do malicious actions, modify the messages, replay them, or spoof them.
In short, on this link, any attack is possible.
In practical terms, this depicts a typical Internet connection where a hacker can
eavesdrop, modify, spoof, or replay the messages. But, of course, this could also be due
to Internet worms, traffic analyzers, and viruses.
Synchrony and Timing
In distributed systems, delays and speed assumptions capture the behavior of the
network.
In practical terms, delays are almost inevitable in a distributed system, first
because of inherent asynchrony, dispersion, and heterogeneity and specific causes
31
Chapter 1 Introduction
such as message loss, slow processors, and congestion on the network. Due to network
configuration changes, it may also happen that unexpected or new delays are introduced
in the distributed system.
Synchrony assumption in a distributed system is concerned with network delays and
processor delays incurred by slow network links or slow processor speeds.
In practical terms, processors can be slow because of memory exhaustion in the
nodes. For example, java programs can pause execution altogether during the "stop the
world" type of garbage collection. On the other hand, some high-end processors are
inherently faster than low-end processors on resource-constrained devices. All these
differences and situations can cause delays in a distributed system.
In the following, we discuss three models of synchrony that capture the timing
assumption of distributed systems.
Synchronous
A synchronous distributed system has a known upper bound on the time it takes for
a message to reach a node. This situation is ideal. However, in practice, messages can
sometimes be delayed. Even in a perfect network, there are several factors, such as
network link quality, network latency, message loss, processing speed, or capacity of the
processors, which can adversely affect the delivery of the message.
In practice, synchronous systems exist, for example, a system on a chip (SoC),
embedded systems, etc.
Asynchronous
Asynchronous distributed systems are on the other end of the spectrum. In this model,
there is no timing assumption made regarding the timing. In other words, there is no
upper bound on the time it takes to deliver a message. There can be arbitrarily long and
unbounded delays in message delivery or processing in a node. The processes can run at
different speeds.
Also, a process can arbitrarily pause or delay the execution or can process faster
than other processes. You can probably imagine now that distributed algorithms
designed for such a system can be very robust and resilient. However, many problems
cannot be solved in an asynchronous distributed system. A whole class of results called
"impossibility results" captures the unsolvable problems in distributed systems. We will
look at impossibility results in more detail later in the chapter and then in Chapter 3.
32
Chapter 1 Introduction
Partially Synchronous
A partially synchronous model captures the assumption that the network is primarily
synchronous and well behaved, but it can sometimes behave asynchronously. For
example, processing speeds can differ, or network delays can occur, but the system
ultimately returns to a synchronous state to resume normal operation.
Another way to think about this is that the network usually is synchronous but can
unpredictably, for a bounded amount of time, behave asynchronously, but there are long
enough periods of synchrony where the system behaves correctly.
Another way to think about this is that the real systems are synchronous most of the
time but can behave arbitrarily and unpredictably asynchronous at times. During the
synchronous period, the system is able to make decisions and terminate.
In summary, we can quote Leonardo da Vinci:
Time stays long enough for anyone who will use it.
33
Chapter 1 Introduction
Eventually Synchronous
In the eventually synchronous version of partial synchrony, the system can be initially
asynchronous, but there is an unknown time called global stabilization time (GST),
unknown to processors, after which the system eventually becomes synchronous. Also,
it does not mean that the system will forever remain synchronous after GST. That is not
possible practically, but the system is synchronous for a long enough period after GST to
make a decision and terminate.
We can visualize the spectrum of synchrony models from asynchronous to
synchronous in Figure 1-16.
Both message delivery delay and relative speed of the processes are taken into
consideration in synchrony models.
Formal Definitions
Some formal definitions regarding the partial synchrony model are stated as follows:
• GST is the global stabilization time after which the system behaves
synchronously.
With these preceding variables defined, we can define various models of synchrony
as follows:
34
Chapter 1 Introduction
• Where fixed upper bounds Δ and Φ exist, but they are not known.
• Where fixed upper bounds Δ and Φ are known but hold after some
unknown time T. This is the eventually synchronous model. We can
say that eventually synchronous model is where fixed upper bounds
Δ and Φ are known but only hold after some time, known as GST.
• In another variation after GST Δ holds for long enough to allow the
protocol to terminate.
Now that we have discussed the synchrony model, let’s now turn our attention to the
adversary model, which allows us to make assumptions about the effect of adversary on
a distributed system. In this model, we model how an adversary can behave and what
powers an adversary may have in order to adversely influence the distributed system.
35
Chapter 1 Introduction
Adversary Model
In addition to assumptions about synchrony and timing in a distributed system model,
there is another model where assumptions about the power of the adversary and how
it can adversely affect the distributed system are made. This is an important model
which allows a distributed system designer to reason about different properties of the
distributed system while facing the adversary. For example, a distributed algorithm
is guaranteed to work correctly only if less than half of the nodes are controlled by a
malicious adversary. Therefore, adversary models are usually modelled with a limit to
what an adversary can do. But, if an adversary is assumed to be all-powerful who can do
anything and control all nodes and communication links, then there is no guarantee that
the system will ever work correctly.
Adversary models can be divided into different types depending on the distributed
system and the influence they can have on the distributed system and adversely
affect them.
In this model, it is assumed there is an external entity that has corrupted the
processes and can control and coordinate faulty processes’ actions. This entity is called
an adversary. Note that there is a slight difference compared to the failure model here
because, in the failure model, the nodes can fail for all sorts of reasons, but no external
entity is assumed to take control of processes.
Adversaries can affect a distributed system in several ways. A system designer
using an adversary model considers factors such as the type of corruption, time
of corruption, and extent of corruption (how many processes simultaneously). In
addition, computational power available to the adversary, visibility, and adaptability
of the adversary are also considered. The adversary model also allows designers to
specify to what limit the number of processes in a network can be corrupted.
We will briefly discuss these types here.
Threshold Adversary
A threshold adversary is a standard and widely used model in distributed systems. In this
model, there is a limit imposed on the number of overall faulty processes in the system.
In other words, there is a fixed upper bound f on the number of faulty processes in the
network. This model is also called the global adversary model. Many different algorithms
have been developed under this assumption. Almost all of the consensus protocols
work under at least the threshold adversary model where it is assumed that an adversary
36
Chapter 1 Introduction
can control up to f number of nodes in a network. For example, in the Paxos protocol
discussed in Chapter 7, classical consensus algorithms achieve consensus under the
assumption that an adversary can control less than half of the total number of nodes in
the network.
Dynamic Adversary
Also called adaptive adversary, in this model the adversary can corrupt processes
anytime during the execution of the protocol. Also, the faulty process then remains faulty
until the execution ends.
Static Adversary
This type of adversary is able to perform its adversarial activities such as corrupting
processes only before the protocol is executed.
Passive Adversary
This type of adversary does not actively try to sabotage the system; however, it can learn
some information about the system while running the protocol. Thus, it can be called a
semi-honest adversary.
An adversary can cause faults under two models: the crash failure model and the
Byzantine failure model.
In the crash failure model, the adversary can stop a process from executing the
protocol it has control over anytime during the execution.
In the Byzantine failure model, the adversary has complete control over the
corrupted process and can control it to deviate arbitrarily from the protocol. Protocols
that work under these assumptions and tolerate such faults are called crash fault–
tolerant protocols (CFT) or Byzantine fault–tolerant protocols (BFT), respectively.
37
Chapter 1 Introduction
for scheduling internal events. All these use cases and countless other computer and
distributed systems operations require some notion of time.
The notion of time in distributed systems is tricky. Events shown in Figure 1-18 need
to be ordered for a distributed system to be reasonably useful. Ordering of events in a
distributed system is one of the fundamental and critical requirements. As there is no
global shared clock in distributed systems, the ordering of events becomes a challenging
problem. To this end, the main concern here is to accomplish the correct order of
events in the system. We have this notion of time in our daily lives where we can say that
something happened before something else. For example, if I sat an exam and the results
came out a week later, we can say confidently that the exam must have occurred or
happened before the results came out. We can visualize this relationship in the diagram
in Figure 1-19.
38
Chapter 1 Introduction
Usually, we are familiar with the physical clock, that is, our typical day-to-day
understanding and the notion of time where I can say something like I will meet you at 3
PM today, or the football match is at 11 AM tomorrow. This notion of time is what we are
familiar with. Moreover, physical clocks can be used in distributed systems, and several
algorithms are used to synchronize time across all nodes in a distributed system. These
algorithms can synchronize clocks in a distributed system using message passing.
Let’s first have a look at the physical clocks and see some algorithms that can be used
for time synchronization based on internal physical clocks and external time source.
Physical Clocks
Physical clocks are in everyday use. Now prevalent digital clocks are based on quartz
crystal, whereas traditional mechanical clocks are based on spring mechanisms or
pendulums. Digital clocks, from wristwatches to clocks on a computer motherboard,
make use of quartz crystals. In practice, an oscillator circuit regulated by a quartz crystal
is used to generate an accurate frequency. When the electric field is applied to a quartz
crystal, it bends and starts to resonate at a frequency depending upon its size, cut,
temperature, and housing. The most common frequency is 32768 Hz which is almost
universally used in quartz-based clocks. Figure 1-20 shows, from left to right, a quartz
crystal in a natural form, in a component form, and inside a casing with additional
oscillator circuitry.
39
Chapter 1 Introduction
Quartz-based clocks are usually accurate enough for general-purpose use. However,
several factors such as manufacturing differences, casing, and operating environment
(too cold, too hot) impact the operation of a quartz crystal. Usually, too low or high a
temperature can slow down the clock. Imagine if an electronic device operating in the
field is exposed to high temperatures; the clock can run slower than a clock working in
normal favorable conditions. This difference caused by the clock running faster or slower
is called drift. Drift is measured in parts per million (ppm) units.
In almost all quartz clocks, the frequency of the quartz crystal is 32,768 kHz due to its
cut and size and how it is manufactured. This is a specific cut and size which looks like a
tuning fork, due to which the frequency produced is always 32,768 Hertz. I decided to do
a small experiment with my oscilloscope and an old clock lying around to demonstrate
this fact.
Here are the results! Figure 1-21 shows a quartz crystal in a clock circuit producing
exactly 32,768 Hertz at normal room temperature, shown on the oscilloscope screen.
In Figure 1-21, the probes from the oscilloscope are connected to the quartz crystal
component on the clock circuit, and the waveform is shown on the oscilloscope screen.
Also, the frequency is displayed at the right bottom of the oscilloscope screen, which
reads 32.7680KHz.
40
Chapter 1 Introduction
Atomic Clocks
Atomic clocks are based on the quantum mechanical properties of atoms. Atoms such
as cesium or rubidium and mercury are used, and resonant frequencies (oscillations) of
atoms are used to record accurate and precise times.
Our notion of time is based on astronomical observations such as changing seasons
and the Earth’s rotation. The higher the oscillation, the higher the frequency and the
more precise the time. This is the principle on which atomic clocks work and produce
highly precise time.
In 1967, the unit of time was defined as a second of “the duration of 9,192,631,770
periods of the radiation corresponding to the transition between the two hyperfine levels
of the ground state of the caesium-133 atom.” In other words, oscillation of cesium atoms
between two energy states exactly 9,192,631,770 times under controlled environment
defines a true second. An atomic clock is shown in Figure 1-22.
41
Chapter 1 Introduction
Now imagine a scenario where we discover a clock skew and see that one clock is
running behind ten seconds. We can usually and simply advance it to ten seconds to
make the clock accurate again. It is not ideal but not as bad as the clock skew, where we
may discover a clock to run ten seconds behind. What can we do in that case? Can we
simply push it back to ten seconds? It is not a very good idea because we can then run
into situations where it would appear that a message is received before we sent it.
To address clock skews and drifts, we can synchronize clocks with a trusted and
accurate time source.
You might be wondering why there is such a requirement for more and more precise
clocks and sources of time. Quartz clocks are good enough for day-to-day use; then
we saw GPS as a more accurate time source, and then we saw atomic clocks that are
even more accurate and can drift only a second in about 300 million years!1 But why do
we need such highly accurate clocks? The answer is that for day-to-day use, it doesn’t
matter. If the time on my wristwatch is a few seconds different from other clocks, it’s
not a problem. If my post on a social media site has a timestamp that is a few seconds
1
www.nist.gov/si-redefinition/second/second-present
42
Chapter 1 Introduction
apart from the exact time I posted it, perhaps that is not an issue. Of course, as long
as the sequence is maintained, the timestamp is acceptable within a few seconds. But
the situation changes in many other practical scenarios and distributed systems. For
example, high-frequency trading systems require (by regulation MiFID II) that the
mechanism format the timestamp on messages in the trading system in microseconds
and be accurate within 100 microseconds. From a clock synchronization point of view,
only 100 microseconds divergence is allowed from UTC. While such requirements are
essential for the proper functioning and regulation of the trading systems, they also
pose technical challenges. In such scenarios, the choice of source of accurate time,
choice of synchronization algorithms, and handling of skews and drifts become of prime
importance.
You can see specific MiFID requirements here as a reference:
https://ec.europa.eu/finance/securities/docs/isd/mifid/rts/160607-rts-25-
annex_en.pdf
1. External synchronization
2. Internal synchronization
43
Chapter 1 Introduction
NTP
The network time protocol (NTP) allows clients to synchronize with UTC. In NTP, servers
are organized in so-called strata, where stratum 1 servers (primary time servers) are
directly connected to an accurate time source in stratum 0, for example, GPS or atomic
clock. Stratum 2 servers synchronize with stratum 1 servers over the network, and
stratum 2 servers synchronize with stratum 3 servers. This type of architecture provides
a reliable, secure, and scalable protocol. Reliability comes from the use of redundant
servers and paths. Security is provided by utilizing appropriate authentication
mechanisms, and scalability is characterized by NTP’s ability to serve a large number
of clients. While NTP is an efficient and robust protocol, inherent network latency,
misconfigurations in the protocol setup, network misconfigurations that may block the
NTP protocol, and several other factors can still cause clocks to drift.
44
Chapter 1 Introduction
In distributed systems, even if each process has a local clock and is synchronized
with some global clock source, there is still a chance that each local processor would see
the time differently. The clocks can drift over time, the processors can experience bugs,
or there can be an inherent drift, for example, quartz clocks or GPS systems, making it
challenging to handle time in a distributed system.
Imagine a distributed system with some nodes in the orbit and some in other
geographical locations on earth, and they all agree to use UTC. The physical clocks in
satellites or ISS will run at a different rate, and skew is inevitable. The core limitation
in depending on physical clocks is that even if trying to synchronize them perfectly,
timestamps will be slightly apart. However, these physical clocks cannot (should not)
be used to establish the order of events in a distributed system because it is difficult to
accurately find out the global order of events based on timestamps in different nodes.
Physical clocks are not very suitable for distributed systems because they can
drift apart. Even with one universal source, such as the atomic clock through NTP,
they can still drift and desynchronize over time with the source. Even a difference of a
second can sometimes cause a big issue. In addition, there can be software bugs in the
implementation that can cause unintentional consequences. For example, let’s look at
a famous bug, the leap second bug that is a cause of significant disruption of Internet
services.
UTC Time
UTC time is a time standard used around the world. There are two sources of time that
are used to make up coordinated universal time (UTC):
• Astronomical time
45
Chapter 1 Introduction
While TAI is highly accurate, it doesn’t consider the Earth’s rotation, that is, the
astronomically observed time that determines the true length of the day. Earth’s rotation
is not constant. It is occasionally faster and is slowing down overall. Therefore, days are
not exactly 24 hours. The impact on Earth’s rotation is due to celestial bodies such as
the moon, tides, and other environmental factors. Therefore, UTC is kept in constant
comparison with the astronomical time, and any difference is added to UTC. This
difference is added in the form of leap second; before the difference between TAI and
astronomical time reaches 0.9, a leap second is added to the UTC. This is the practice
since 1972.
OK, this seems like a reasonable solution to keep both times synced; however,
computers don’t seem to handle this situation well. Unix systems use Unix time (epoch),
simply the number of seconds elapsed since January 1, 1970. When a leap second is
added, this is how the clock looks like: in a normal case, it is observed that after 23:59:59,
there is 00:00:00. However, adding a leap second seems as if after 23:59:59, there is
23:59:60 and then 00:00:00. In other words, 23:59:59 happens twice. When Unix time
deals with this addition of an extra second, it can produce arbitrary behavior. In the past,
when a leap second is added, servers across the Internet experienced issues and services
as critical as airline booking systems were disrupted.
A technique called "leap smear" has been developed, which allows for the gradual
addition of a few milliseconds over a day to address this issue of sudden addition and
problem associated with this sudden one-second additional.
OK, so far, we have seen that UTC and astronomical time are synced by adding a leap
second. With the "leap second smear" technique, we can gradually add a leap second
over time, which alleviates some of the issues associated with sudden additional leap
second. There are also calls to abolish this ritual altogether. However, so far, we see
adding a leap second as a reasonable solution, and it seems to work somewhat OK. We
just add a leap second when the Earth’s rotation slows down, but what if the Earth is
spinning faster? In 2020, the Earth indeed spun faster, during the pandemic, for whatever
reason. Now the question is, do we remove one second from UTC? In other words,
introduce the negative leap second! This situation can pose some more challenges –
perhaps even more demanding to address as compared to adding a leap second.
The question is, what to do about this, ignore this? What algorithm can help to
remove one second and introduce a negative leap second?
So far, it is suggested that simply skip 23:59:59, that is, go from 23:59:58 to 00:00:00
directly. It is expected that this is easier to deal with as compared to adding a leap
46
Chapter 1 Introduction
second. Perhaps, a solution is unnecessary because we may ignore the Earth spinning
faster or slower altogether and abolish the leap second adjustment practice, either
negative or positive. It is not ideal, but we might do that to avoid issues and ambiguity
associated with handling leap seconds, especially adding leap seconds! At the time of
writing, this is an open question.
Some more info are found here: https://fanf.dreamwidth.org/133823.html
(negative leap second) and www.eecis.udel.edu/~mills/leap.html.
To avoid limitations and problems associated with physical clocks and
synchronization, for distributed systems, we can use logical clocks, which have no
correlation with physical clocks but are a way to order the events in a distributed system.
Although, as we have seen, ordering of events and a causal relationship is an essential
requirement in distributed systems, logical clocks play a vital role in ensuring this in
distributed systems.
From a distributed system point of view, we learned earlier that the notion of global
state is very important, which allows us to observe the state of a distributed system
and helps to snapshot or checkpoint. Thus, time plays a vital role here because if time
is not uniform across the system (each processor running at a different time), and we
try to read states from all different processors and links in a system, it will result in an
inconsistent state.
1. Time-of-day clocks
2. Monotonic clocks
47
Chapter 1 Introduction
Set
A set is a collection of elements. It is denoted by a capital letter. All elements of the set
are listed inside the brackets. If an element x is present in a set, then it is written as x ∈ X,
which means x is in X or x belongs to X. Similarly, if an element x is not present in a set
X, it is written as x ∉ X. It does not matter which order the elements are in. Two sets are
equal if they have the same elements. Equality is expressed as X = Y, meaning set X is
equal to set Y. If two sets X and Y are not equal, it is written as X ≠ Y. A set that does not
have any elements is called an empty set and is denoted as { } or ϕ. An example of a set is
X = {1,5,2,8,9}
Y = {2,8}
Y⊆X
A union of two sets A and B contains all the elements in A and B, for example:
A = {1,2,3}
and
B = {3,4,5}
48
Chapter 1 Introduction
S = {1,2,3,4,5}
The cartesian product of two sets A and B is the set of ordered pairs (a, b) for each
element in sets A and B. It is denoted as A × B. It is a set of ordered pairs (a, b) for each
a ∈ A and b ∈ B.
An ordered pair is composed of two elements inside parentheses, for example, (1,
2) or (2, 1). Note here that the order of elements is important and matters in the case of
ordered pairs, whereas in sets the order of elements does not matter. For example, (1, 2)
is not the same as (2, 1), but {1,2} and {2,1} are the same or equal sets.
An example of a cartesian product, A × B, for sets shown earlier is
{(1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,3), (3,4), (3,5)}
Note that in the ordered pair, the first element is taken from set A and the second
element from set B.
Relation
A relation (binary) between two sets A and B is a subset of the cartesian product A × B.
The relationship between two elements is binary and can be written as a set of
ordered pairs. We can express this as a R b (infix notation) or (a, b) \in R, meaning the
ordered pair (a, b) is in relation R.
When a binary relation on a set S has properties of reflexivity, symmetry, and
transitivity, it is called an equivalence relation.
When a binary relation on a set S has three properties of reflexivity, antisymmetry,
and transitivity, then it is called a partial ordering on S.
Partial Order
It is a binary relation ≤ (less than or equal to – for comparison) between the elements of
a set S. A binary relation on a set S, which is reflexive, antisymmetric, and transitive, is
known as a partial ordering on S. We now define the three conditions.
Reflexivity
This property means that every element is related to itself. Mathematically, we can write
it like this: ∀a ∈ S, a ≤ a.
49
Chapter 1 Introduction
Antisymmetry
This means that two elements cannot be related in both directions. Mathematically, it
can be written as ∀a, b ∈ S, if a ≤ b ∧ b ≤ a, a = b.
Transitivity
Irreflexive
This property means that there is no element that is related to itself. Mathematically,
we can write it like ∀a ∈ S, a ≰ a, or given a relation R on a set S, R is irreflexive if
∀s ∈ S : (s, s) ∉ R.
Total Order
A total order or linear order is a partial order in which each pair of elements is
comparable.
After this brief introduction to some math concepts, let us now look into what
causality is and what is a happens-before relationship.
50
Chapter 1 Introduction
• If there exists an event g such that e→g and g→f, then e→f. This is
called a transitive relationship.
If e and f are partially ordered, we then say that e happened before f. If e and f are not
partially ordered, then we say that e and f are concurrent. This also doesn’t mean that e
and f are executed independently exactly at the same time. It just means that e and f are
not causally related. In other words, there is no sequence of messages which leads from
one event to another. The concurrency is written as e ∥ g. Figure 1-24 shows an example
scenario in detail.
51
Chapter 1 Introduction
In Figure 1-24, the relations e → f, g → h, i → j are due to the order in which processes
execute the events. The relations f → g, h → j are due to messages m1 and m2. Moreover
e → g, e → h, e → j, f → h, f → j, g → j represent transitive relation. Finally, the concurrent
events are e ∥ i, f ∥ i, g ∥ i, h ∥ i.
Logical Clocks
Logical clocks do not depend on physical clocks and can be used to define the order of
events in a distributed system. Logical clocks only measure the order of events without
any reference to external physical time.
Lamport Clocks
A Lamport clock is a logical counter that is maintained by each process in a distributed
system, and with each occurrence of an event, it is incremented to provide a means of
maintaining and observing a happens-before relationship between events occurring in
the distributed system.
The key idea here is that each event is assigned a number which increments as the
event occurs in the system. This number is also called the Lamport clock. A Lamport
clock captures causality.
The algorithm for Lamport’s clocks/logical clocks is described as follows:
on init
t := 0
on event localcomputation do
t := t + 1
end
on event send(m) do
t := t + 1
send(m, t)
end
Lamport clocks are consistent with causality. We can write this like
if e → f ⇒ LC ( e ) < LC ( f )
This means that if e happened before f, then it implies that the timestamp (Lamport
clock – LC) of event e is less than the timestamp of event f.
There is a correctness criterion called the clock condition which is used to evaluate
the logical clocks:
∀a, b : a → b ⇒ LC ( a ) < LC ( b )
53
Chapter 1 Introduction
This is read as follows: for all a and b, a happened before b implies that the Lamport
clock (timestamp) of a is less than the Lamport clock (timestamp) of b.
This means that if event A happens before event B, then it implies that the Lamport
clock of event A is less than the Lamport clock of event B.
Now we can see a picture emerging. Without using physical clocks, we can now see
how events in a distributed system can be assigned a number which can be used for
ordering them by using Lamport clocks.
Now let’s run this algorithm on a simple distributed system composed of three
processes (nodes, computers) – P, Q, and R.
There are two key properties of this algorithm:
This means that two events can have the same timestamp. As shown in Figure 1-25,
on process lines P and R, notice event timestamp 1 as the same. Did you spot a problem
here? In this scheme, the total order is not guaranteed because two events can get the
same timestamp.
One obvious way to fix this is to use an identifier for the process with the timestamp.
This way, the total order can be achieved.
Figure 1-26 shows executions with a totally ordered logical clock.
54
Chapter 1 Introduction
Knowing the order of events in a distributed system is very useful. The order of
events allows us to find the causality between the events. The knowledge of causality in
distributed systems helps to solve several problems. Some examples include but are not
limited to consistency in replicated databases, figuring out causal dependency between
different events, measuring the progress of executions in a distributed system, and
measuring concurrency.
We can use it to build distributed state machines. If events are timestamped, we
can also see when exactly an event has occurred and what happened before and what
occurred after, which can help debug and investigate distributed systems’ faults. This
knowledge can be instrumental in building debuggers, snapshotting a point in time,
pruning some data before a point in time, and many other use cases.
The limitation that LC(a) < LC(b) does not mean that a → b. This means that
Lamport clocks cannot tell if two events are concurrent or not. This problem can be
addressed using vector clocks.
Vector Clocks
It is a type of logical clock which allows detecting concurrent events in addition to
determining partial ordering of events and detecting causality violations. Here is how
it works:
–– At the start, all vector clocks in a distributed system are set to zero,
that is, [0,0,0,0,0].
• Process/program faults
• Communication/link faults
• Storage faults
56
Chapter 1 Introduction
There are several types of faults that have been formally defined in distributed
systems literature. These types are categorized under the so-called fault model which
basically tells us which kind of faults can occur.
We now define each one of these as follows.
Crash-Stop
In this scenario, a process can fail to stop a function at any point in time. This can
happen when a hardware fault may have occurred in a node. Other nodes are unable to
find out about the crash of the node in this model.
Fail-Stop
In this model, a process can fail by stopping execution of the algorithm. Other nodes in
the distributed system can learn about this failure, usually by using failure detectors.
Omission Faults
Omission faults are where a message can be lost.
Send Omission
Receive Omission
General Omission
This is where a process may exhibit either a send omission or a receive omission.
Covert Faults
This model captures a behavior where a failure might remain hidden or undetected.
Computation Faults
In this scenario, we capture the situation where a processor responds incorrectly.
57
Chapter 1 Introduction
Byzantine Faults
This model captures the arbitrary faults where a process may fail in arbitrarily
many ways.
In this model, a process can exhibit arbitrary behavior; however, there is a verification of
received messages to this process, which is possible by using authentication and digital
signatures. This nonrepudiation and verification can make dealing with Byzantine faults
a bit easier.
In this model, a process can exhibit arbitrary behavior, but no message verification is
possible to ascertain the validity of the messages.
Timing Faults
This is where a process can exhibit slow behavior or may run faster than other processes.
This can initially look like a partially synchronous behavior, but a node that has not
received a message for a very long time can be seen as one example of this type of
fault. This covers scenarios where an expected message delivery is not in line with the
expected delivery time or lies outside the specified time interval.
Failures can be detected using failure detectors where a process can be suspected of
a failure. For example, a message not received for an extended period of time or that has
gone past the threshold of timeout can be marked as a failed process.
More on failure detector in Chapter 3; now let’s discover what a fault model is and
fault classes.
In Figure 1-28, we can visualize various classes of faults, where Byzantine faults
encompass all types of faults at varying degrees of complexity and can happen
arbitrarily, whereas crash faults are the simplest type of faults.
58
Chapter 1 Introduction
Fault classes allow us to see what faults can occur, whereas fault models help us
to see what kind of faults the system can exhibit and what types of faults should be
tolerated in our distributed algorithm.
A system or algorithm that can tolerate crash faults only is called a crash fault
tolerant or CFT in short. In contrast, a system that can handle Byzantine faults is called
the Byzantine fault–tolerant system or algorithm. Usually, this applies to consensus
mechanisms categorized and developed with the goal of crash fault tolerance or
Byzantine fault tolerance. We will see more about this in Chapter 3, where we discuss
consensus algorithms.
Safety and Liveness
Remember we discussed in communication abstractions that broadcast protocols and
point-to-point links have some properties. For example, a fair-loss property ensures that
messages sent will eventually be delivered under fair-loss links. This type of property
where something will eventually happen is considered a liveness property. Colloquially
speaking, this means that something good will eventually occur.
59
Chapter 1 Introduction
Also, remember that under the finite duplication property for fair-loss links, we said
that there are finite message duplications. This type of property where something can be
measured and observed infinite time is called a safety property. Colloquially speaking,
this means that something bad never happens. Of course, if you don’t do anything, then
nothing will ever happen, which theoretically satisfies the safety property; however, the
system is not making any progress in this scenario. Therefore, the liveness property,
which ensures the progress of the system, is also necessary.
These properties are used in many different distributed algorithms to reason about
the correctness of the protocols. In addition, they are frequently used in describing the
safety and liveness requirements and properties of consensus protocols. We will cover
distributed consensus in detail in Chapter 3.
Safety and liveness are correctness properties of a distributed algorithm. For
example, the safety and liveness of traffic signals at a crossing can be described as
follows. The safety properties in this scenario are that, at a time, only one direction must
be a green light, and no signal should have all lights turned on at the same time. Another
safety property could be that the system should turn off no signals. And the liveness
property is that, eventually, each signal must get the green light.
For example, in a partially synchronous system, to prove safety properties, it is
assumed that the system is asynchronous, whereas to prove the liveness of the system,
the partial synchrony assumption is used. The progress of liveness of the system is
ensured in a partially synchronous system, for example, after GST when the system
is synchronous for long enough to allow the algorithm to achieve its objective and
terminate.
For a distributed system to be practical, safety and liveness properties must be
specified and guaranteed.
60
Chapter 1 Introduction
CAP Theorem
The CAP theorem states that a distributed system can only deliver two of three desired
features, that is, consistency, availability, and partition tolerance. Let’s first define these
terms, and then we’ll investigate the theorem in more detail.
Consistency
The consistency property means that data should be consistent across all nodes in the
distributed system, and the client connecting to the distributed system at the same time
should see the same consistent data. This is commonly achieved using replication.
Availability
Availability means the distributed system responds to the client requests even in the
presence of faults. This is achieved using fault tolerance techniques such as replication,
partitioning, or sharding.
Partition Tolerance
A partition refers to a scenario where the communication link between two or more
nodes breaks. A distributed system should be able to tolerate that and continue to
operate correctly.
We know that partitions in a network are almost inevitable; sooner or later, there
will be some communication disruption. This means that as network partitions are
unavoidable, the choice really becomes to choose between availability and consistency.
The question becomes, in the case of partitions, what we are willing to sacrifice,
consistency or availability. It all depends on the use case. For example, in a financial
application, it’s best to sacrifice availability in favor of consistency, but perhaps on web
search results, we could sacrifice a bit of consistency in favor of availability. It should
be noted that when there are no network partitions, consistency and availability are
both provided. But then again, if a network partition occurs, then what do we choose,
availability or consistency?
A Venn diagram shown in Figure 1-29 can be used to visualize this concept.
61
Chapter 1 Introduction
The CAP theorem allows us to categorize databases (NoSQL DBs) based on the
properties they support. For example, a CP database provides consistency and partition
tolerance but sacrifices availability. In the case of a partition, the nonconsistent nodes
are shut down until the network partition heals. An AP database sacrifices consistency
but offers availability and partition tolerance. In the case of a network partition, there
is a chance that nodes that have not been able to get the updates due to a network
partition will continue to serve old data. This might be acceptable in some scenarios,
such as a web search. When the partition heals, the out-of-sync nodes are synchronized
with the latest updates. On the other hand, a CA database is not partition tolerant and
can provide both consistency and availability only if the network is healthy. As we saw
earlier, network partitions are inevitable; therefore, CA databases only exist in an ideal
world where no network partitions occur.
While the CAP theorem is helpful, there are many other more precise impossibility
results in distributed computing.
Let’s now discuss what eventual consistency is. Eventual consistency refers to a
situation where nodes may disagree or not update their local database, but, eventually,
the state is agreed upon and updated.
One example of such a scenario could be when an electronic voting system captures
voters’ votes and writes them to a central vote registration system. However, it could
happen that due to a network partition, the communication link in the central vote
62
Chapter 1 Introduction
registration system is lost, and this voting machine is now not able to write data to the
backend voting registration system. It could now keep receiving votes from the user and
record them locally, and when the network partition heals, it can write the ballots back
to the central vote registration system. During the network partition from the central vote
registration system’s point of view, the count of votes is different from what the voting
machine can see. The machine can write back to the central vote registration system
when the partition heals to achieve consistency. The consistency between the backend
server storage and local storage is not achieved immediately, but, over time, this type of
consistency is called eventual consistency.
A now established example of an eventually consistent system is Bitcoin. We will
learn more about this in Chapter 4 and see how Bitcoin is eventually consistent.
The domain name system (DNS) is the most prevalent system that implements
eventual consistency. When a name is updated, it is distributed as per a configured
pattern, and, eventually, all clients see the update.
Through the lens of the CAP theorem, the distributed consensus is a CP system
where availability is sacrificed in favor of consistency. As a result, the distributed
consensus is used to provide strong consistency guarantees.
For example, if you have a five-node system and three nodes go down, then the
whole system stalls until the other three nodes come up. This is so that a consistency
guarantee can be maintained, even if the system is not available for some time.
If we look at Bitcoin, it appears that it is an AP system where consistency is sacrificed
for some time due to forks, but, eventually, the consistency is achieved. Therefore,
Bitcoin can also be considered a CP system where consistency is eventually strong.
Usually, strong consistency (also called linearizability) is for what distributed
consensus is used for; however, eventual consistency in systems like Bitcoin is also
acceptable.
63
Chapter 1 Introduction
Summary
We covered several topics in this chapter:
64
Chapter 1 Introduction
• The CAP theorem states that a distributed system can only deliver
two of three desired features, that is, consistency, availability, and
partition tolerance.
Bibliography
1. Safety and liveness properties were first formalized in a paper
by Alpern, B. and Schneider, F.B., 1987. Recognizing safety and
liveness. Distributed computing, 2(3), pp. 117–126.
65
Chapter 1 Introduction
66
CHAPTER 2
Cryptography
This chapter will cover cryptography and its two main types, symmetric cryptography
and public key cryptography. After exploring some fundamental ideas, we will dive
deeper into symmetric key primitives and then public key primitives.
Moreover, we will examine hash functions, message authentication codes, digital
signature schemes, and elliptic curve cryptography. Finally, we’ll shed some light on
some progressive ideas, proposals, and techniques, especially those which are used in
blockchain consensus.
Introduction
Cryptography is the science of secret communication in the presence of adversaries.
Historically, this subject was more of an art, but, now, it is a rigorous and formal science
with formal definitions, assumptions, and security proofs.
There are three fundamental principles of cryptography: confidentiality, integrity,
and authenticity. Confidentiality is the assurance that the information is available only
to the authorized entities. Integrity assures that only authorized entities can modify the
information. Finally, authenticity guarantees the message validity or identity of an entity.
Authentication can be of two types, entity authentication or data origin authentication.
Entity authentication ensures that an entity claiming some identity (the claimant)
is verifiably identifiable to another identity (the verifier) and that the entity is alive and
participating. Different methods such as something you have (e.g., a hardware token),
something you know (e.g., a password), and something you are (e.g., fingerprint) are
used to achieve entity authentication in identification protocols. Entity authentication
is of a fundamental concern in a secure distributed system. As a distributed system
is dispersed and heterogenous, with multiple users, it can become an easy target for
67
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_2
Chapter 2 Cryptography
A Typical Cryptosystem
A typical model of a cryptographic system is shown in Figure 2-1. We can define
a cryptographic system as a combined manifestation of cryptographic primitives,
protocols, and algorithms to accomplish specified security goals. Thus, a cryptosystem is
composed of several components.
68
Chapter 2 Cryptography
There are three key actors in this system: sender, receiver, and adversary. The sender
wants to send a secret message to the receiver via an insecure channel in the presence
of an adversary who is a malicious attacker wishing to learn about the message. Other
elements are plaintext, ciphertext, keys, secure channel, encryption function, decryption
function, and key source:
69
Chapter 2 Cryptography
Cryptographic Primitives
A cryptographic primitive is a fundamental method that delivers particular security
services, for example, confidentiality or integrity. These cryptographic primitives are
used to build security protocols, such as authentication protocols. Cryptographic
primitives include symmetric primitives, asymmetric primitives, and keyless primitives.
A high-level taxonomy of cryptographic primitives is shown in Figure 2-2.
Symmetric Cryptography
Symmetric cryptosystems use the same key for encryption and decryption. The key
must be kept secret and transferred over a secure channel before the data transfer
between a sender and a receiver. For secure key transfers, key establishment protocols
are used. Usually, public key cryptography is used to exchange keys, allowing for easier
key management than symmetric key management, where it can become challenging to
manage keys as the number of users grows.
70
Chapter 2 Cryptography
Stream Ciphers
These cryptosystems encrypt the plaintext one bit at a time. The algorithm takes a single
bit of the plaintext as input, processes it, and produces a single bit of ciphertext. The
processing involves the use of XOR operations to perform encryption and decryption.
The model of stream ciphers is in Figure 2-4.
71
Chapter 2 Cryptography
In this model, plaintext feeds into the encryption function bit by bit along with a
keystream generated by the key generator. The key generator generates a pseudorandom
keystream which is usually much smaller than the plaintext. Usually, the key length is
128 bits. The keystream and plaintext go through the XOR to produce the ciphertext.
During decryption, the same process applies again, and plaintext is retrieved.
Pseudorandom generation means that the bits generated are not random but appear
random, hence the term pseudorandom. Keystreams are commonly generated using
linear feedback shift registers (LFSRs). The input bit of LFSR is a linear function of its
previous state, where the linear function is usually the XOR operation.
The key generator is a cryptographically secure pseudorandom number generator
(CSPRNG or CPRNG). Being “pseudo,” we can compute the number, and anyone
computing it will have the same result, which implies that these PRNGs are also
deterministic. If they are truly random and not deterministic, then once generated,
the random number cannot be regenerated by anyone else, meaning the decryption
won’t be possible. So, they look random, but actually they are not and are computable.
CPRNGs have a particular property that the numbers they generate are unpredictable.
There are two types of stream ciphers: synchronous stream ciphers and
asynchronous stream ciphers. In synchronous stream ciphers, the keystream is
dependent only on the key. In contrast, the keystream relies on the fixed number of
previously transmitted encrypted bits and the key in asynchronous stream ciphers.
72
Chapter 2 Cryptography
Stream ciphers are usually more suited for hardware devices; however, they can also
be used in software environments. Many examples of stream ciphers exist, such as A5/1,
used in GSM communications to provide confidentiality. However, Salsa20 and ChaCha
are most used in software environments. Some other stream ciphers include Trivium,
Rabbit, RC4, and SEAL.
Block Ciphers
Block ciphers encrypt the plaintext by dividing it into blocks of fixed length. Historically,
block ciphers, such as DES, were built using Feistel mechanisms. Modern ciphers, such
as AES, use a substitution-permutation network (SPN).
A simple model of a block cipher is shown in Figure 2-5.
73
Chapter 2 Cryptography
• Counter mode
Electronic Codebook
Electronic codebook (ECB) is a fundamental mode of operation in which the encrypted
data results from applying the encryption algorithm to each block of plaintext,
one by one.
This mode is the most straightforward, but we should not use it in practice as it is
insecure and can reveal information.
74
Chapter 2 Cryptography
Figure 2-6 shows that we have plaintext P provided as an input to the block cipher
encryption function and a key, which produces ciphertext C as output.
75
Chapter 2 Cryptography
Counter Mode
The counter (CTR) mode uses a block cipher as a stream cipher. In this case, a unique
nonce is concatenated with the counter value to generate a keystream.
As shown in Figure 2-8, CTR mode works by utilizing a nonce N and a counter
C that feed into the block cipher encryption function. The block cipher encryption
function takes the secret key “KEY” as input and produces a keystream (a stream of
pseudorandom or random characters), which, when XORed with the plaintext (P),
generates the ciphertext (C).
There are other modes that we can use for different purposes other than encryption.
We discuss some of these in the following section.
76
Chapter 2 Cryptography
77
Chapter 2 Cryptography
After state initialization using the input plaintext, AES sequentially performs the
following four operations to produce the ciphertext:
• ShiftRows: This step shifts each row to the left in the state array in a
cyclic and incremental manner. The first row is excluded, the second
row is shifted left by one byte, the third row is shifted left by two bytes,
and the fourth row is shifted left by three bytes or positions.
The abovementioned four steps form a single round of AES. In the final round,
step 4 (MixColumns) is not performed. Instead, it is replaced with the AddRoundKey
step to ensure that the first three steps cannot be simply reversed. This process is shown
in Figure 2-9.
78
Chapter 2 Cryptography
Prime
A prime is a number which is only divisible fully by itself and 1. For example, 23 is a
prime number as it can only be divided precisely without leaving any remainder either
by 23 or 1.
79
Chapter 2 Cryptography
Modular Arithmetic
It is a system of performing arithmetic operations on integers where numbers wrap
around when they reach a certain fixed number. This fixed number is called a modulus,
and all arithmetic operations are performed based on this modulus.
Group
A group G is a set whose elements can be combined with an operation ∘. It has the
following properties:
Closure means that all group operations are closed. Formally, ∀a,
b ∈ G : a ∘ b = c ∈ G.
Associativity means that all group operations are associative. Formally,
a ∘ (b ∘ c) = (a ∘ b ) ∘ c : ∀ a, b, c ∈ G.
There exists a special identity element i such that ∀a ∈ G : a ∘ i = i ∘ a = a.
In each element a ∈ G , there is a corresponding inverse element a−1 such that
a ∘ a−1 = a−1 ∘ a = i.
Abelian Group
A group is a commutative or abelian group if in addition to the abovementioned
properties of groups, ∀a, b ∈ G : a ∘ b = b ∘ a .
Field
A field F is a set with two operations on F called addition and multiplication.
Prime Fields
A prime field is a finite field containing a prime number of elements.
80
Chapter 2 Cryptography
Generator
A generator is a point on an elliptic curve.
A fundamental issue in symmetric key systems is that they need a secret key to be
shared before the communication using a secure channel, which can be challenging to
achieve. Another issue with symmetric key systems is key management. The number of
keys grows exponentially as the number of users grows in the system. An n user network
will need n(n-1)/2 keys where each user will store n-1 keys. In a 100-user network, each
user will store 99 keys. The formula 100(100-1)/2 means there are 4950 keys in total,
which is quite tricky to manage practically. Public key cryptography solves this issue of
key distribution and key management.
A typical use of public key cryptography is to establish a shared secret key between
two parties. This shared secret key is used by symmetric algorithms, such as AES, to
encrypt the data. As they have already established a secret key, both parties can then
81
Chapter 2 Cryptography
encrypt and decrypt without ever transmitting the secret key on the wire. This way, the
parties get the high security of public key cryptography with the speed of symmetric
encryption. Asymmetric cryptography is not used much for bulk encryption due to slow
performance; however, this is the norm for key establishment. Such systems where a
symmetric key is used to encrypt the data and a secret key is encrypted using public key
cryptography are called hybrid cryptosystems. For example, the Integrated Encryption
Scheme is a hybrid encryption scheme. ECIES is the elliptic curve (EC) version of the
IES scheme.
a. Alice
65 mod 13
7776 mod 13 = 2
Public key = 2
b. Bob
64 mod 13
1296 mod 13 = 9
Public key = 9
5. Bob sends public key 9 to Alice, and Alice sends public key
2 to Bob.
82
Chapter 2 Cryptography
a. Alice
95 mod 13 = 3
b. Bob
24 mod 13 = 3
Public key cryptosystems rely on one-way trap door functions. Trapdoor functions
are easy to compute in one direction but difficult to compute in the opposite direction,
unless there is a special value, called the trapdoor, available. This concept can be
visualized in Figure 2-11.
83
Chapter 2 Cryptography
32 mod 10 = 9
Now, given 9, finding 2, the exponent of the generator 3, is extremely hard to do.
Formally, we can say that given numbers a and n where n is a prime, the function
f ( b ) = a b mod n
is a one-way function, because calculating f(b) is easy, but given f(b), finding b
is hard.
Another method developed in the mid-1980s is elliptic curve cryptography. Elliptic
curve cryptography has gained special attention due to its usage in blockchain platforms,
such as Bitcoin and Ethereum. Protocols such as the Elliptic Curve Diffie-Hellman key
exchange and elliptic curve digital signature algorithms are most prevalent in this space.
ECC is fundamentally a discrete logarithm problem but founded upon elliptic curves
over finite fields. A key advantage of ECC is that a smaller key size provides the same
level of security as a larger key size in RSA. For example, a security level of a 1024-bit
integer factorization scheme, such as RSA, can be achieved by only a 160-bit elliptic
curve–based scheme, such as ECDSA.
Public key cryptosystems can be used for encryption, though it is very less common
and not efficient for large datasets. It is also used for providing other security services
and protocols, such as digital signatures, entity authentication, and key agreement.
84
Chapter 2 Cryptography
Digital Signatures
Digital signatures are one of the most common uses of public key cryptography. Digital
signatures provide nonrepudiation services. Most common examples include RSA-based
digital signatures, digital signature algorithms, and ECDSA and Schnorr signatures.
Entity Authentication
Entity authentication or identification is another service that public key cryptosystems
can provide. Usually, challenge-response mechanisms are in widespread use where
a challenge sent by the verifier is required to be responded to correctly by the prover
(claimant of identity) to ascertain the legitimacy of the claimant.
Key Agreement
Key agreement protocols are used to establish secret keys before an encrypted data
transfer. The most common example of such protocols is the Diffie-Hellman key
exchange protocol.
RSA
RSA is widely used for secure key transport and building digital signatures. Diffie and
Hellman invented public key cryptography in 1976. Based on this idea, in 1978, the RSA
public key cryptosystem was developed by Rivest, Shamir, and Adleman.
In this section, I will walk you through the steps of generating key pairs in RSA and
how to encrypt and decrypt.
a. Select p and q, two large prime numbers. Usually, 2^1024 bits or more.
85
Chapter 2 Cryptography
gcd(e, (p − 1)(q − 1) = 1.
iii.
e ∈ {1, 2, . . . , ϕ(n) − 1}.
iv.
a. The private key, let’s call it d, is calculated from two primes p and q from
step 1 and the special number e from step 2. The private key is the inverse of
e modulo (p − 1)(q − 1), which we can write as
ed = 1 mod ( p − 1) ( q − 1)
ed = 1mod φ n
d = e −1 = 7 mod 20
86
Chapter 2 Cryptography
Encryption and Decryption
Now, let’s see how encryption and decryption operations are performed using RSA. RSA
uses the following equation to produce ciphertext:
C = P e mod n
This means that plaintext P is raised to the power of e and then reduced to modulo n.
Decryption in RSA is provided in the following equation:
P = C d mod n
This means that the receiver who has a public key pair (n, e) can decipher the data by
raising C to the value of the private key d and then reducing to modulo n.
3. n = pq = 3 x 11 = 33.
87
Chapter 2 Cryptography
y 2 = x 3 + ax + b mod p
Here, a and b belong to a finite field Zp or Fp (prime finite field), that is, (a, b) ∈ Z,
and an imaginary point of infinity. The point of infinity ∞ is used to provide identity
operations for points on the curve.
Furthermore, a condition shown below ensures that the curve is nonsingular,
meaning the curve does not self-intersect or has vertices:
4a 3 + 27b 2 ≠ 0 mod p
To construct the discrete logarithm problem based on elliptic curves, a large enough
cyclic group is required. First, the group elements are identified as a set of points that
satisfy the elliptic curve equation. After this, group operations need to be defined on
these points. The fundamental group operations on elliptic curves are point addition
and point doubling. Point addition is a process where two different points are added,
and point doubling means that the same point is added to itself.
An elliptic curve can be visualized over real numbers as shown in Figure 2-12.
88
Chapter 2 Cryptography
We can visualize the curve and group operations, that is, addition and doubling,
geometrically over real numbers, which helps to build intuition. In practice, however, the
curve over prime field is used to build ECC-based schemes. Though, when we try to plot
it, it appears quite random and not intuitive.
Point Addition
For adding two points, a line is drawn through points P and Q (the diagonal line in
Figure 2-13) to obtain a third point. This point, when reflected, is point R, shown as P+Q
in Figure 2-13.
89
Chapter 2 Cryptography
Algebraically speaking, in point addition operation, two points P and Q are added to
obtain the coordinates of the third point R on the curve:
P+Q = R
S=
( y2 − y1 ) mod p
( x2 − x1 )
x3 = s 2 − x1 − x2 mod p
y3 = s ( x1 − x2 ) − y1 mod p
90
Chapter 2 Cryptography
Point Doubling
In point doubling, P is added to itself. In other words, P and Q are the same point. As the
point adds to itself, we can call this operation point doubling.
To double a point, a tangent line (the dotted diagonal line in Figure 2-14) is drawn
through point P, which obtains a second point where the line intersects with the curve.
This point is reflected to yield the result R, shown as 2P in Figure 2-14.
S=
( 3x
2
1 + a)
mod p
2 y1
x3 = s 2 − x1 − x2 mod p
y3 = s(x1 − x2) − y1 mod p
91
Chapter 2 Cryptography
P + P + …+ P = dP
Q = dP
where P is a point on the curve, d is a randomly chosen integer as the private key,
and Q is the public key obtained after the multiplication.
Making point multiplication faster is an active area of research. While there are many
algorithms for making scalar multiplication more quickly, we describe a quick example
here using the double and add algorithm. It combines point addition and doubling
operations to achieve performance.
For example, if using addition only, to get 9P we must do P + P + P + P + P + P + P +
P + P, which can become impracticable very quickly if the number of Ps increases. We
can use the double and add mechanism to make this faster. Here, we first convert nine
into binary. Starting from the most significant bit (MSB), for each bit that is one (high),
perform the double and addition operations, and for each zero, perform only the double
operation. We do not perform any operation on the most significant bit. Nine is 1001 in
binary, so for each bit we get (starting from left to right) P, 2P, 4P, 8P+P. This scheme
produces 9P only with three double operations and one addition operation, instead of
nine addition operations.
Consider an elliptic curve E, with two elements P and Q. The discrete logarithm
problem is to find the integer d, where 1 < = d < = # E, such that
P + P + …+ P = dP = Q
Here, Q is the public key (a point generated on the curve, (x, y)), and d is the private
key (another point on the curve). The public key is a random multiple of the generator
point P, whereas the private key is the integer d that is used to generate the multiple. The
Generator point or base point G is a point on the curve that generates a cyclic subgroup,
which means that every point in the group can be reached by repeated addition of the
base point.
#E represents the order of the group (elliptic curve), which means the number of
points that are present in the cyclic subgroup of the elliptic curve. A cyclic group is
formed by a combination of points on the elliptic curve and the point of infinity. Cofactor
h is the number of points in the curve divided by the order of the subgroup.
The initial starting point P is a public parameter, and the public key Q is also
published, whereas d, the private key, is kept secret. If d is not known, it is unfeasible to
calculate with only the knowledge of Q and P, thus creating the hard problem on which
ECDLP is built.
A key pair is linked with the specific domain parameters of an elliptic curve. Domain
parameters are public values that are required to implement ECC schemes. These
parameters are represented as a tuple {p, a, b, G, n, h}:
For example, Bitcoin uses the SECP256k1 curve with the equation y2 = x3 + 7 and
domain parameters as defined here: https://en.bitcoin.it/wiki/Secp256k1.
The most used curves are NIST proposed curves, such as P-256. Other curves include
Curve25519, Curve1174, and many more. Of course, it is advisable to choose a safe curve.
An excellent resource of safe and unsafe curves along with explanations is maintained
online here: https://safecurves.cr.yp.to.
93
Chapter 2 Cryptography
Digital Signatures
Public key cryptography is used to create digital signatures. It is one of the most common
applications of public key cryptography. In this section, we will discover how RSA,
ECDSA, and Schnorr signatures work. Concepts such as aggregate signatures and
multisignatures, also commonly used in blockchains, will be introduced.
Digital signatures provide a means of associating a message with an entity from
which the message has originated. Digital signatures are used to provide data origin
authentication and nonrepudiation.
Digital signatures are used in consensus algorithms and especially in blockchain
networks to sign the transactions and messages sent by a user on the blockchain
network. Blocks are sealed cryptographically using a digital signature so that the
recipient can verify the authenticity of the transmitted block. Similarly, all transactions
are signed as well. It is common in consensus algorithms that blocks are sealed and
broadcast to the network that recipients (other nodes) receive, who verifies the signature
to ascertain the block’s authenticity. The blocks are inserted into the local blockchain
after verification.
Digital signatures have three security properties: authenticity, unforgeability, and
nonreusability.
Authenticity
This means that the digital signatures are verifiable by the receiving party.
Unforgeability (Nonrepudiation)
This property guarantees that only the message’s sender can sign using the private
key. Digital signatures must also protect against forgery. Forgery means an adversary
fabricating a valid signature for a message without access to the legitimate signer’s
private key. In other words, unforgeability implies that no one else can produce the
signed message produced by a genuine sender.
94
Chapter 2 Cryptography
Nonreusability
This property necessitates that the digital signature cannot be separated from a message
and used again for another message. In other words, the digital signature is firmly bound
to the corresponding message and cannot be separated from its original message and
attached to another.
The process of signing and verification using digital signatures is shown in
Figure 2-15.
First, we produce the hash of the data for which we want to prove data origin
authentication. Then we encrypt the hash using the prover’s private key (signing key) to
create a “signature” and attach it with the data. Finally, this signed object is sent to the
verifier.
The verifier decrypts the encrypted hash of the data using the signer’s (sender)
public key to retrieve the original hash. The verifier then takes the data and hashes it
again through the hash function to produce the hash. If both these hashes match, the
verification is successful, proving that the signer indeed signed the data. It also proves
the data origin authentication, along with nonrepudiation and data integrity properties.
Now we describe how ECDSA (elliptic curve digital signature algorithm) works.
95
Chapter 2 Cryptography
ECDSA Signatures
ECDSA is a DSA based on elliptic curves. The DSA is a standard for digital signatures.
It is based on modular exponentiation and the discrete logarithm problem. It is used
on Bitcoin and Ethereum blockchain platforms to validate messages and provide data
integrity services.
Now, we’ll describe how ECDSA works.
To sign and verify using the ECDSA scheme, first a key pair needs to be generated:
• Modulus P
• Coefficients a and b
Kpb = ( p, a, b, q, A, B )
Kpr = d
Now, the signature can be generated using the private and public keys.
4. An ephemeral key Ke is chosen, where 0 < Ke < q. Also, ensure that
Ke is random and that no two signatures end up with the same key;
otherwise, the private key can be calculated.
s = ( h ( m ) + d .r ) K e−1 mod q
96
Chapter 2 Cryptography
Here, m is the message for which the signature is calculated, and h(m) is the hash of
the message m.
• Calculate point P:
P = u1 A + u 2 B
Multisignatures
In this scheme, multiple unique keys held by their respective owners are used to sign a
single message. In blockchain implementations, multisignature schemes allow multiple
users to sign a transaction, which results in increased security. Moreover, in blockchain
networks, these schemes can be used so that users can set the condition of at least one or
more than one signature to authorize transactions.
For example, a 1-of-2 multisignature scheme can represent a joint account where
either one of the joint account holders is required to authorize a transaction by signing
it. In another variation, a 2-of-2 multisignature can be used where both joint account
holders’ signatures must authorize the transaction. This concept is generalized as m of
n signatures, where m is the minimum number of expected signatures and n is the total
number of signatures.
97
Chapter 2 Cryptography
Figure 2-16 shows the signing process on the left-hand side, where m is the number
of different users, and holding m unique signatures signs a single transaction. When the
validator or verifier receives it, all the signatures in it need to be individually verified.
Threshold Signatures
This scheme does not rely on users to sign the message with their individual keys;
instead, it requires only one public key and one private key to produce the digital
signature. In a multisignature scheme, the signed message contains digital signatures
from all signers. It requires verification individually by the verification party, but the
verifier must verify only one digital signature in threshold signatures. The key idea in this
scheme is to split the private key into multiple parts, and each signer keeps its share of
the private key. The signing process requires each user to use their respective share of
the private key to sign the message. A particular communication protocol manages the
communication between the signers.
In contrast with multisignatures, the threshold signatures result in smaller
transaction sizes and are faster to verify. A downside, however, is that for threshold
signatures to work, all signers must remain online. In multisignature schemes, the
98
Chapter 2 Cryptography
signatures can be delivered asynchronously. In other words, users can provide signatures
whenever available. One downside is that there could be a situation where users may
withhold their signature maliciously, resulting in a denial of service. We can also use
threshold signatures to provide anonymity in a blockchain network, as individual signers
are unidentifiable in multisignature schemes.
Figure 2-17 shows the signing process on the left-hand side, where an m number
of different users, holding different parts (shares) of a digital signature, sign a single
transaction. When the validator or verifier receives it, only one signature needs to be
verified.
Aggregate Signatures
Aggregate signatures reduce the size of digital signatures. This scheme is beneficial
in scenarios where multiple digital signatures are in use. The core idea is to aggregate
multiple signatures into a single signature without increasing the size of the signature of
a single message. It is simply a type of digital signature that supports aggregation.
The small aggregate signature is enough to prove to the verifier that all users signed
their original messages. Thus, aggregate signatures are commonly used to reduce the
size of messages in network and security protocols. For example, we can significantly
reduce the size of digital certificate chains in Public Key Infrastructure (PKI) by
compressing all signatures in the chain into a single signature. Boneh-Lynn-Shacham
99
Chapter 2 Cryptography
(BLS) aggregate signatures are a typical example of the aggregate signature. BLS has also
been used in various blockchains and especially in Ethereum 2.0.
Schnorr signatures are another type of signature based on elliptic curve
cryptography that allows key and signature aggregation. Schnorr signatures are 64 bytes
in size as compared to ECDSA, which is 71 bytes in signature size. ECDSA’s private key
size is 32 bytes and its public key is 33 bytes, whereas the Schnorr scheme’s private
and public keys are 32 bytes in size. Overall, Schnorr signatures are smaller and faster
than ECDSA.
Figure 2-18 shows how the aggregate signatures work.
Ring Signatures
Ring signature schemes are mechanisms where any member of a group of signers can
sign a message on behalf of the entire group. Each member of the ring group keeps
a public key and a private key. The key point here is that the identity of the actual
signer who signed the message must remain unknown (computationally infeasible to
determine) to an outside observer. It looks equally likely that anyone from the trusted
group of signers could have signed the message, but it is not possible to figure out the
individual user who signed the message. Thus, we can use ring signatures to provide an
anonymity service.
100
Chapter 2 Cryptography
Hash Functions
Hash functions are keyless primitives which create fixed-length digests of arbitrarily long
input data. There are three security properties of hash functions.
Preimage Resistance
This property is also called a one-way property. It can be explained by using the simple
equation:
h( x) = y
where h is the hash function, x is the input, and y is the output hash. The first security
property requires that y cannot be reverse-computed to x. x is the preimage of y, thus the
name preimage resistance. This property is depicted in Figure 2-19.
101
Chapter 2 Cryptography
Collision Resistance
The collision resistance property requires that two different input messages should not
hash to the same output. In other words, h(x) ≠ h(z). Figure 2-21 shows a depiction of
collision resistance.
• Easy to compute
Hash functions, due to their very nature, are always expected to have collisions,
where two different messages hash to the same output, but in a good hash function,
collisions must be computationally infeasible to find.
Moreover, hash functions should also have a property that a small change, even
a single character change in the input text, should result in an entirely different hash
output. This is known as the avalanche effect.
102
Chapter 2 Cryptography
Hash functions are usually designed by using the iterated hash function method,
where the input data is divided into equal block sizes and then they are processed
iteratively through the compression functions.
Some prominent approaches to build hash functions using iterative methods are
listed as follows:
• Merkle-Damgard construction
• Sponge construction
The most common hash function schemes are SHA-0, SHA-1, SHA-2, SHA-3,
RIPEMD, and Whirlpool.
Design of SHA-256
SHA-256 has an input message size limit of 264 − 1 bits. The block size is 512 bits, and it
has a word size of 32 bits. The output is a 256-bit digest.
The compression function processes a 512-bit message block and a 256-bit
intermediate hash value. There are two main components of this function: the
compression function and a message schedule.
The algorithm works as follows, in nine steps.
Preprocessing
• Padding of the message is used to adjust the length of a block to 512
bits if it is smaller than the required block size of 512 bits.
• Parsing the message into message blocks, which ensures that the
message and its padding are divided into equal blocks of 512 bits.
103
Chapter 2 Cryptography
• Setting up the initial hash value, which consists of the eight 32-bit
words obtained by taking the first 32 bits of the fractional parts of the
square roots of the first eight prime numbers. These initial values
are fixed and chosen to initialize the process. They provide a level of
confidence that no backdoor exists in the algorithm.
Hash Computation
• Each message block is then processed in a sequence, and it requires
64 rounds to compute the full hash output. Each round uses slightly
different constants to ensure that no two rounds are the same.
104
Chapter 2 Cryptography
into the compression function with the first message. Subsequent blocks are fed into the
compression function until all blocks are processed to produce the output hash.
The compression function of SHA-256 is shown in Figure 2-23.
In Figure 2-23, a, b, c, d, e, f, g, and h are the registers for eight working variables. Maj
and Ch are functions which are applied bitwise. Σ0 and Σ1 perform bitwise rotation. The
round constants are Wj and Kj, which are added in the main loop (compressor function)
of the hash function, which runs 64 times.
105
Chapter 2 Cryptography
Figure 2-24 shows the sponge and squeeze model on which SHA-3 or Keccak is
based. Analogous to a sponge, the data (m input data) is first “absorbed” into the sponge
after applying padding. It is then changed into a subset of permutation state using
XOR (exclusive OR), and, finally, the output is “squeezed” out of the sponge function
representing the transformed state. The rate r is the input block size of the sponge
function, whereas capacity c determines the security level.
In Figure 2-24, state size b is calculated by adding bit rate r and capacity bits c. r and
c can be any values if sizes of r + c are 25, 50, 100, 200, 400, 800, or 1600. The state is a
three-dimensional bit matrix which is initially set to 0. The data m is entered into the
absorb phase block by block via XOR ⊕ after applying padding.
Table 2-1 shows the value of bit rate r (block size) and capacity c required to achieve
the desired output hash size under the most efficient setting of r + c = 1600.
106
Chapter 2 Cryptography
The key idea is to apply these transformations to achieve the avalanche effect, which
ensures that even a tiny change in the input results in a substantial change in the output.
These five operations combined form a single round. In the SHA-3 standard, the number
of rounds is 24 to achieve the desired level of security.
107
Chapter 2 Cryptography
T and M are sent to the receiver who runs the same process and compares T with T’,
which the verifier has generated by applying the same MAC function, and if they match,
the verification is successful.
108
Chapter 2 Cryptography
There are pros and cons to both methods. Some attacks on both schemes have
occurred. HMAC construction schemes use ipad (inner padding) and opad (outer
padding) for padding, which is considered secure with some assumptions.
Various significant applications of hash functions are used in peer-to-peer networks
and blockchain networks, such as Merkle trees, Patricia tries, Merkle Patricia tries, and
distributed hash tables.
Some latest advancements, such as verifiable delay functions, are discussed next.
There are two security properties of VDFs, uniqueness and sequentiality. Uniqueness
ensures that the output y produced by VDF is unique for every input x. The sequentiality
property ensures that the delay parameter t is enforced.
109
Chapter 2 Cryptography
There are many proposals on how to construct VDFs. Some approaches include
hardware enclaves to store cryptographic keys inside the enclave and use those keys to
generate VDFs. Using hash functions to iteratively hash the output again as input to form
a hash chain is another way of constructing verifiable delay functions. Creating a hash
chain using a hash function iteratively is a sequential process and takes time; thus, it can
work as an evaluation function of the VDF. Another method gaining more popularity
is the algebraic construction, where finite cyclic groups are used which are assumed to
have unknown order.
VDFs have many innovative applications in blockchains, including constructing
consensus algorithms, as a source of verifiable randomness and leader election. You will
explore these applications in detail when we discuss relevant consensus protocols in
Chapter 8.
110
Chapter 2 Cryptography
Summary
• Cryptography is the science of secret communication.
Bibliography
1. Paar, C. and Pelzl, J., 2009. Understanding cryptography: a
textbook for students and practitioners. Springer Science &
Business Media.
111
Chapter 2 Cryptography
7. Boneh, D., Bonneau, J., Bünz, B., and Fisch, B., 2018, August.
Verifiable delay functions. In Annual international cryptology
conference (pp. 757–788). Springer, Cham.
112
CHAPTER 3
Distributed Consensus
Consensus is a fundamental problem in distributed computing. This chapter will cover
the fundamentals of the consensus problem and discuss some history covering the
Byzantine generals problem, building blocks of consensus, and how we can solve this
problem in distributed systems.
As fault tolerance is a fundamental requirement in distributed systems, several
primitives introduce fault tolerance. Fault-tolerant broadcast algorithms allow for the
development of fault-tolerant applications. Consensus enables processes to reach a
common decision despite failures. Both topics are well researched in academia and the
industry.
Before we dive into discussing consensus and agreement problems in detail, let’s
cover some building blocks in continuation of link abstractions from Chapter 1 that are
closely related to consensus and agreement problems.
Broadcast Primitives
Earlier in Chapter 1, we learned about links that pass messages between a pair of
processes in a point-to-point or one-to-one setting. This one-to-one communication
(also called unicast) is quite common and used in the client-server architecture. For
example, a web server making requests to a backend database can be seen as an
example of this type of two-sided connection. There is one sender and one specific
recipient, that is, the web server and backend database, respectively.
However, in many cases where multiple nodes are involved, the client-server type
scheme is not adequate. Moreover, in many situations, one-to-one communication
is not sufficient, and we need to use some mechanism that can send messages to
multiple nodes or a group of nodes simultaneously. In such situations, we use broadcast
protocols.
113
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_3
Chapter 3 Distributed Consensus
Figure 3-1. A node broadcasting a message m and all three nodes delivering it
Note that there is a difference between sending and receiving and broadcasting and
delivering. Sending and receiving are used in the context of point-to-point links, and
broadcast and delivery are used in broadcast abstractions where a message is broadcast
to multiple/all nodes.
Point-to-point links discussed in Chapter 1 are associated with send and receive
primitives where a node sends a message, and the recipient node receives them.
Broadcast abstractions with broadcast and deliver primitives depict a situation
where a node sends a message to multiple/all nodes in the network and nodes receive
them, but, here, the broadcast algorithm can store and buffer the message after receiving
it and deliver it to the process later. It depends on the broadcast algorithm (also called
middleware). For example, in total order broadcast, the message may be received by
broadcast algorithms running on each process but can be buffered until the conditions
meet to deliver the message to the application.
The diagram in Figure 3-2 shows this concept visually.
114
Chapter 3 Distributed Consensus
The communication occurs within a group of nodes where the number of nodes
might be static or dynamic. One process sends it, and all nodes in the group agree on
it and deliver it. If a single processor or some processors become faulty, the remaining
nodes carry on working. Broadcast messages are targeted to all processes.
Broadcast abstractions allow for the development of fault-tolerant applications.
There are several types which I describe as follows.
Best-Effort Broadcast
In this abstraction, reliability is guaranteed only if the sender process does not fail. This
is the weakest form of reliable broadcast. There are three properties that best-effort
broadcast has.
Validity
If a message m is broadcast by a correct process p, then message m is eventually
delivered by every correct process. This is a liveness property.
No Duplication
Every message is delivered only once.
115
Chapter 3 Distributed Consensus
No Creation
If a process delivers a message m with a sender process p, then m was previously
broadcast by sender process p. In other words, messages are not created out of thin air.
Figure 3-3 depicts an execution of best-effort broadcast.
In Figure 3-3, notice that process p has broadcast the message but then crashed,
and as per system properties, message delivery is not guaranteed in this case. Notice
that process q did not deliver the message because our process p is no longer correct.
However, process R delivered it. There is no delivery guarantee in this abstraction if the
sender fails, as shown in Figure 3-3. In case some processes may deliver the messages,
and some don’t, it results in disagreement. As you can imagine, this abstraction may not
be quite useful in some stricter scenarios. We need some more robust protocol than this.
To address such limitations, a reliable broadcast abstraction is used.
Reliable Broadcast
A reliable broadcast abstraction introduces an additional liveness property called
agreement. No duplication and no creation properties remain the same as the best-
effort broadcast abstraction. The validity property is slightly weakened. Formally,
validity and agreement properties can be stated as follows.
Validity
If a message m is broadcast by a correct process p, then p itself eventually delivers m.
116
Chapter 3 Distributed Consensus
Agreement
If a message m is delivered by a correct process, then every correct process delivers m.
Remarks
In case the sender process crashes while broadcasting and has not been able to send to
all processes, the agreement property ensures that no process delivers it. It is possible
that some processes may have received the message, but reliable broadcast ensures that
no process will deliver it unless there’s an agreement on the delivery. In other words, if
the sender process crashes, the reliable broadcast ensures that either all correct nodes
eventually deliver the message or no nodes deliver the message at all.
If the sender process fails, this property ensures that all correct nodes get the
message or none of the correct nodes receives the message. This property is achieved
by correct processes retransmitting any dropped messages, which result in the eventual
delivery of the message.
This solution seems reasonable enough, but there might be situations in which
the broadcaster process may have been able to deliver to itself but then crashed before
it could send to other processes. This means that all correct processes will agree not
to deliver the message because they have not received it, but the original broadcaster
delivers it. Such situations can cause safety issues.
To address this limitation, uniform reliable broadcast, which provides a stronger
guarantee, is used.
Uniform Agreement
If a message m is delivered by a process p, every correct process eventually delivers m. p
can be either a correct or a failed process.
117
Chapter 3 Distributed Consensus
In all the preceding discussed abstractions, there’s a crucial element missing which
is required in many distributed services. For example, imagine a scenario of an online
chat app. If a user sends a message saying “England has won the cricket match,” and
another user replies “congratulations,” and a third user says “But I wanted Ireland to
win,” the expected sequence in which the messages are supposed to appear in the
chat app is
• User 2: congratulations
However, if there is no order imposed on the message delivery, it might appear that
even if User 1’s message was sent first, it may turn out that in the app (to the end user)
the messages might appear like this:
• User 2: congratulations
Now this is not the expected order; the “congratulations” message without the
context of winning the match would appear confusing. This is the problem that is solved
by imposing an order guarantee on the broadcast abstractions.
Now we will discuss four abstractions that deliver messages with an order guarantee
with varying degrees of strictness: FIFO reliable broadcast, causal reliable broadcast,
total order reliable broadcast, and FIFO total order broadcast.
118
Chapter 3 Distributed Consensus
In this abstraction, all properties remain the same as reliable broadcast; however, a
new property for FIFO delivery is introduced.
FIFO Delivery
If a process has broadcast two messages m1 and m2, respectively, then any correct
process does not deliver m2 before m1. In other words, if m1 is broadcast before m2 by
the same process, then no correct process delivers m2 unless it has delivered m1 first.
This guarantee is however only in the case when m1 and m2 are broadcast by the same
process; if two different processes have broadcast messages, then there is no guarantee
in which order they will be delivered.
In practice, TCP is an example of FIFO delivery. If you need FIFO delivery in your use
case, you can simply use TCP.
Validity
If a correct process p broadcasts a message m, then some correct process eventually
delivers m.
Agreement
If a message m is delivered by a correct process p, then all correct processes eventually
deliver m.
119
Chapter 3 Distributed Consensus
Integrity
For any message m, each process delivers m at most once and only if m was previously
broadcast. In literature, this property is sometimes divided into two separate properties:
no duplication where no message is delivered more than once and no creation which
states that a delivered message must be broadcast by the sender process. In other words,
no messages are created out of thin air.
Total Order
In this property, if a message m1 is delivered before m2 in a process, then m1 is delivered
before m2 in all processes.
Imagine another situation where using reliable links a node has sent a message to all
nodes individually, but while in transit to some nodes, some messages were dropped.
At this point, the sender process fails and consequently no retransmission occurred.
As a result, those nodes that did not receive messages will now never receive messages
because the sender process has crashed. How can we improve reliability in this scenario?
We can devise a scheme where if one node receives a message for the first time, it
broadcasts it again to other nodes through reliable channels. This way, all correct nodes
will receive all the messages, even if some nodes crash. This is called eager reliable
broadcast. Eager reliable broadcast is reliable; however, it can incur O(n) steps and
O(n)2 messages for n number of nodes.
Figure 3-4 visualizes the eager reliable protocol.
There are also other algorithms which we can call nature-inspired algorithms. For
example, consider how an infectious disease might spread or a rumor might spread. One
person infects a few others, and then those others infect others, and quickly the infection
rate increases. Now imagine if a broadcast protocol is designed on such principle, then
it can be very effective at disseminating information (messages) throughout the network
quickly. As these protocols are basically randomized protocols, they do not guarantee
that all nodes will receive a message, but there is usually a very high probability that
all nodes eventually get all messages. Probabilistic protocols or gossip protocols are
commonly used in peer-to-peer networks. There are many protocols that have been
designed based on this type of dissemination.
Figure 3-5 illustrates how a gossip protocol works. The idea here is that when a node
receives a message for the first time, it forwards it to some randomly chosen other nodes.
This technique is useful for broadcasting messages to many nodes, and the message
eventually reaches all nodes with high probability.
121
Chapter 3 Distributed Consensus
Probabilistic Validity
If a correct process p broadcasts a message m, then every correct process eventually
delivers it with probability 1.
Integrity
Any message is delivered at most once, and the message delivered has been previously
broadcast by a process – in other words, no duplicate message and no message creation
out of thin air.
122
Chapter 3 Distributed Consensus
With this, we complete our discussion on broadcast protocols. Let’s now move
on to the agreement abstraction, which is one of the most fundamental problems in
distributed computing.
First, I’ll explain what an agreement is, and then we’ll build upon this fundamental
idea and present the consensus problem.
Agreement
In a distributed system, the agreement between the processes is a fundamental
requirement. There are many scenarios where processes need to agree for the
distributed system to achieve its goals. For example, in broadcast abstractions, an
agreement is needed between processes for the delivery of messages.
There are various agreement problems, and we will cover the most prominent of
them in the following and then focus more on consensus.
We have already covered reliable broadcast and total order broadcast. In this section,
I will explain some additional points briefly on reliable broadcast and total order
broadcast; then, we’ll explore the Byzantine agreement and consensus.
Reliable Broadcast
The reliable broadcast assures reliability even if the sender process fails. In other words,
reliability is guaranteed whether the sender is correct or not.
123
Chapter 3 Distributed Consensus
124
Chapter 3 Distributed Consensus
125
Chapter 3 Distributed Consensus
There are many variations of the consensus problem depending upon the system
model and failure models.
The abovementioned consensus problem is called uniform consensus where the
agreement property is strict and does not allow a crashed process to decide differently.
126
Chapter 3 Distributed Consensus
The first variation where the validity property is weak and the agreement is also weak
can be categorized as crash fault–tolerant consensus.
An algorithm that satisfies all these safety and liveness properties is called a correct
algorithm. Solving a consensus problem is not difficult in failure-free synchronous
systems; however, the problem becomes difficult in systems that are failure-prone.
Failures and asynchrony make solving consensus a complex problem.
Binary consensus is a simple type of consensus where the input is restricted, and
as a result, the decision value is restricted to a single bit, either zero or one. Multivalued
consensus is a type of consensus where the objective is to agree on multiple values, that
is, a series of values over time.
While binary consensus may not seem like a very useful construct in the first
instance, a solution to a binary consensus problem leads to a solution for multivalued
consensus; hence, it is an important area for research.
The definitions of properties of consensus may change slightly depending on
the application. For example, usually in blockchain consensus protocols the validity
property is defined differently from established definitions and may settle for a weaker
variant. For example, in Tendermint consensus the validity property simply states,
“a decided value is valid i.e. it satisfies the predefined predicate denoted valid( )”. This
can be an application-specific condition. For example, in blockchain context it could
be required that a new block added to the Bitcoin blockchain must have a valid block
header that passes node validation checks. In other variations, the valid( ) predicate
requirement and the condition “if all honest processes propose the same value, then all
decide on that same value” can be combined. This is a combination of validity predicate
and traditional validity condition. There are variations and different definitions. Some
127
Chapter 3 Distributed Consensus
are strict, and some are not so strict, depending on the application. We will cover
blockchain consensus and relevant consensus protocols throughout this book and will
define and redefine these requirements in the context of the consensus algorithm and
fault models being discussed.
A consensus protocol is crash fault tolerant (CFT) if it can tolerate benign faults
up to a certain threshold. A consensus protocol is Byzantine fault tolerant (BFT) if it
can tolerate arbitrary faults. In order to achieve crash fault tolerance, the underlying
distributed network must satisfy the condition N >= 2F+1, where N is the number of
nodes in the network, and F is the number of faulty nodes. If the network satisfies this
condition, only then it will be able to continue to work correctly and achieve consensus.
If Byzantine faults are required to be tolerated, then the condition becomes N>=3F+1.
We will cover this more formally when we discuss impossibility results later in this
chapter. But remember these conditions as lower tight bounds.
A consensus problem applies to other problems in distributed computing too.
Problems like total order broadcast, leader election problem, and terminating reliable
broadcast require an agreement on a common value. These problems can be considered
consensus variants.
System Models
To study consensus and agreement problems and develop solutions, there are some
underlying assumptions that we make about the behavior of the distributed system. We
learned many of these abstractions regarding node and network behavior in Chapter 1.
Here, we summarize those assumptions and move on to discuss the consensus
problem in more detail. The reason for describing system models here is twofold: first,
to summarize what we learned in the chapter regarding the behavior of the nodes and
networks and, second, to put this knowledge in the context of studying consensus and
agreement problems. For a detailed study, you can refer to Chapter 1.
Distributed System
A distributed system is a set of processes that communicate using message passing.
Consensus algorithms are designed based on assumptions made about timing and
synchrony behavior of the distributed system. These assumptions are captured under
the timing model or synchrony assumptions, which we describe next.
128
Chapter 3 Distributed Consensus
Timing Model/Synchrony
Synchrony assumptions capture the timing assumption about a distributed system. The
relative speed of processors and communication is also taken into consideration. There
are several synchrony models. We briefly describe those as follows:
129
Chapter 3 Distributed Consensus
for the asynchronous model, such as HoneyBadger. We will cover blockchain consensus
in Chapter 5 and then throughout this book. For now, I will focus on the distributed
consensus problem in general and from a traditional point of view.
Also, note that an asynchronous message-passing model with Byzantine faults
expresses conditions of a typical distributed system based on the Internet today.
Especially, this is quite true in public blockchain platforms such as Bitcoin or Ethereum.
Process Failures
Failure models allow us to make assumptions about which failures can occur and how
we can address them. Failure models describe the conditions under which the failure
may or may not occur. There are various classes, such as crash failures, where processes
can crash-stop or crash-fail, or omission failures where a processor can omit sending or
receiving a message.
Another type of omission fault is called the dynamic omission fault. In this model, a
system can lose a maximum number of messages in each round. However, the channels
on which the message losses occur may change from round to round.
Timing failures are those where processes do not comply with the synchrony
assumptions. The processes may exhibit Byzantine behavior where the processes can
behave arbitrarily or maliciously. In the Byzantine model, the corrupted processor can
duplicate, drop a message, and actively try to sabotage the entire system. We also define
an adversary model here where we make some assumptions about an adversary who can
adversely affect the distributed system and corrupt the processors.
In authenticated Byzantine failures, it is possible to identify the source of the
message via identification and detect the forged messages, usually via digital signatures.
Failures that occur under this assumption are called authenticated Byzantine failures.
Messages can be authenticated or non-authenticated. Authenticated messages
usually use digital signatures to allow forgery detection and message tampering.
The agreement problem becomes comparatively easier to solve with authenticated
messages because recipients can detect the message forgery and reject the unsigned or
incorrectly signed messages or messages coming from unauthenticated processes. On
the other hand, distributed systems with non-authenticated messages are difficult to
deal with as there is no way to verify the authenticity of the messages. Non-authenticated
messages are also called oral messages or unsigned messages. Even though it is difficult,
130
Chapter 3 Distributed Consensus
Channel Reliability
It is often assumed that the channel is reliable. Reliable channels guarantee that if a
correct process p has sent a message m to a correct process q, then q will eventually
receive m. In practice, this is usually the TCP/IP protocol that provides reliability.
Lossy channels are another assumption that captures the notion of channels where
messages can be lost. This can happen due to poor network conditions, delays, denial-of-
service attacks, hacking attacks in general, slow network, a misconfiguration in the network
configuration, noise, buffer overflows, network congestion, and physical disconnections.
There might be many other reasons, but I just described the most common ones.
There are two variations of fair-loss channels. There is an upper bound k on the
number of lost messages in one variation, and in another, known as fair-loss channels,
there is no such upper bound. The first variation is easier to handle where the algorithm
can retransmit the message k+1 times, ensuring that at least one copy is received. In
the latter variation, the fair-loss channels, if the sender keeps resending a message,
eventually it is delivered, provided that both the sender and the receiver are correct. We
discussed this in greater detail in Chapter 1.
History
Consensus problems have been studied for decades in distributed computing. Achieving
consensus under faults was first proposed by Lamport et al. in their paper “SIFT: Design
and analysis of a fault-tolerant computer for aircraft control.”
Later, a Byzantine fault–tolerant protocol under a synchronous setting was first
proposed by Lamport et al. in their seminal paper “Reaching Agreement in the Presence
of Faults.”
The impossibility of reaching an agreement even if a single process crash-fails was
proven by Fischer, Lynch, and Paterson. This discovery is infamously known as the FLP
impossibility result.
Ben-Or proposed asynchronous Byzantine fault tolerance using randomization to
circumvent FLP. In addition, partial synchrony was presented in DLS 88 for BFT.
131
Chapter 3 Distributed Consensus
132
Chapter 3 Distributed Consensus
In Figure 3-8, two generals must agree on the time to attack; otherwise, no win.
The issue is that no general can ever be sure about the commitment from the
other general. If general 1 always attacks even if no acknowledgment is received from
general 2, then general 1 risks being alone in the attack if all messengers are lost. This
is the case because general 2 knows nothing about the attack. If general 1 attacks
only if a positive acknowledgment is received from general 2, then general 1 is safe.
General 2 is in the same situation as general 1 because now he is waiting for general 1’s
acknowledgment. General 2 might consider himself safe as he knows that general 1 will
only attack if general 2’s response is received by general 1. General 2 is now waiting for
the acknowledgment from general 1. They are both thinking about whether the other
general received their message or not, hence the paradox!
From a distributed system perspective, this experiment depicts a situation where two
processes have no common knowledge, and the only way they can find out about the
state of each other is via messages.
• The honest generals also don’t know who the traitors are, but traitors
can collude together.
The challenge here is whether in this situation an agreement can be reached and
what protocol can solve this problem, if any. Figure 3-9 shows this problem.
Figure 3-9. Byzantine generals problem – showing each army unit receiving
misleading, correct, or no messages at all
134
Chapter 3 Distributed Consensus
It turns out that this problem is impossible to solve. It has been proven that this
problem can be solved only if fewer than one-third of generals are traitors. For example,
if there are 3t + 1 generals, only up to t can be malicious. This is a proven lower bound
on the Byzantine fault tolerance. We will see this more formally under the section
“Impossibility Results” where we discuss FLP, CFT lower bounds, and BFT lower bounds.
In distributed systems, we can have an analog where generals represent processes
(nodes), traitors represent Byzantine processes, honest generals represent correct
processes, messengers represent communication links, loss of a message is a captured
messenger, and no time limit on the messenger to reach generals represents asynchrony.
I think you get the picture now!
Replication
In this section, we will discuss replication. Replication is used to maintain an exact copy
of the data in multiple nodes. This technique has several advantages. One key advantage
is fault tolerance. One example of the simplest replication is RAID in storage systems.
For example, in RAID-1, there are two disks, and they are exact replicas (mirror) of each
other. If one is unavailable, the copy is available, resulting in fault tolerance and high
availability. In distributed systems, replication is used for various reasons and, unlike
RAID, is between multiple nodes instead of just between two disks within a server.
Also, if data remains unchanged, then replication is easy. You can just make a one-off
copy of the data and store it on another disk or node. The challenging part is how to keep
replication consistent when the data is subject to constant change.
There are several advantages:
135
Chapter 3 Distributed Consensus
• High cost, due to multiple replica nodes required, the setup might be
expensive.
Replication can be achieved by using two methods. One is the state transfer, where
a state is sent from one node to another replica. The other approach is state machine
replication. Each replica is a state machine that runs commands deterministically in the
same sequence as other replicas, resulting in a consistent state across replicas. Usually,
in this case, a primary server receives the commands, which then broadcasts them to
other replicas that apply those commands.
There are two common techniques for achieving replication. We define them as
follows.
136
Chapter 3 Distributed Consensus
Active Replication
In this scheme, the client commands are ordered via an ordering protocol and forwarded
to replicas that execute those commands deterministically. The intuition here is that if all
commands are applied in the same order at all replicas, then each replica will produce
the same state update. This way, all replicas can be kept consistent with each other. The
key challenge here is to develop a scheme for ordering the commands and that all nodes
execute the same commands in the same order. Also, each replica starts in the same
state and is a copy of the original state machine. Active replication is also known as state
machine replication.
Passive Replication
In the passive replication approach, there is one replica that is designated as primary.
This primary replica is responsible for executing the commands and sending (broadcast)
the updates to each replica, including itself. All replicas then apply the state update
in the order received. Unlike active replication, the processing is not required to be
deterministic, and any anomalies are usually resolved by the designated primary replica
and produce deterministic state updates. This approach is also called primary backup
replication. In short, there is only a single copy of the state machine in the system kept
by the primary replica, and the rest of the replicas only maintain the state.
Pros and Cons
There are pros and cons of both approaches. Active replication can result in wastage of
resources if the operations are intensive. In the case of passive replication, large updates
can consume a large amount of network bandwidth. Furthermore, in passive replication,
as there is one primary replica, if it fails, the performance and availability of the system
are impacted.
In the passive approach, client write requests are preprocessed by the primary and
transformed into state update commands, which apply to all replicas in the same order.
Each replica is a copy of the state machine in active replication, whereas, in passive
replication, only the primary is a single copy of the state machine.
137
Chapter 3 Distributed Consensus
Note that even though there is a distinction between active and passive replication at
a fundamental level, they both are generic approaches to making a state machine
fault-tolerant.
Let’s now see how primary backup replication works. I am assuming a fail-stop
model here.
4. The primary waits until it has received all responses from backup
replicas.
6. After this, the primary sends the response back to the client.
138
Chapter 3 Distributed Consensus
How are failures handled? If the primary fails, one of the backups will take over.
Now, this looks like a suitable protocol for achieving fault tolerance, but what if the
primary fails? Primary failure can lead to downtime as recovery can take time. Also,
reading from the primary can produce incorrect results because, in scenarios where
the client makes a read request to the primary before the commit point, the primary
will not produce the result, even though all replicas have that update delivered to them.
One solution might be to deal with reads as updates, but this technique is relatively
inefficient. Also, the primary is doing all the work, that is, sending to other replicas,
receiving responses, committing, and then replying to the client. Also, the primary must
wait until all responses are received from the replicas for it to be able to respond to the
client. A better solution, as compared to the primary backup replica solution, is chain
replication. Here, the core idea is that one of the backup servers will reply to read the
requests, and another will process the update commands.
Chain Replication
Chain replication organizes replicas in a chain with a head and a tail. The head is the
server with the maximum number, whereas the tail is the one with the lowest number.
Write requests or update commands are sent to the head, which sends the request using
reliable FIFO links to the next replica on the chain, and the next replica then forwards it
to the next until the update reaches the last (tail) server. The tail server then responds to
the client. The head replica orders the requests coming in from the clients.
139
Chapter 3 Distributed Consensus
For a read request (query), the client sends it directly to the tail, and the tail replies.
It is easy to recover when the tail fails by just reselecting the predecessor as the new tail.
If the head fails, its successor becomes the new head and clients are notified. Figure 3-11
shows how chain replication works.
140
Chapter 3 Distributed Consensus
state := initial
log := lastcommand
while (true) {
on event receivecommand()
{
appendtolog(command)
output := statetransition(command, state)
sendtoclient(output)
}
}
In this pseudocode, the state machine starts with an initial state. When a command
is received from the client, it appends that to the log. After that, it executes the command
through the transition function and updates the state and produces an output. This
output is sent to the client as a response.
Figure 3-12 illustrates this concept.
The key idea behind state machine replication is that if the system is modelled
as a state machine, then replica consistency can be achieved by simply achieving an
agreement on the order of operations. If the same commands are applied to all nodes in
141
Chapter 3 Distributed Consensus
the same order, then a general approach to keep all replicas consistent with each other is
achieved. However, the challenge here is to figure out how to achieve a common global
order of the commands.
In order to achieve an agreement on the order of operations, we can use agreement
protocols such as Byzantine agreement protocols or reliable broadcast protocols. We
discussed the total order broadcast abstraction earlier in this chapter. We can also
use consensus algorithms such as Paxos or PBFT to achieve this. Remember that in
total order broadcast, each process delivers the same message in the same order. This
property immediately solves our problem of achieving an agreement on the order of
operations, which is the core insight behind state machine replication. Total order
broadcast ensures that commands from different clients are delivered in the same order.
If commands are delivered in the same order, they will be executed in the same order
and as state machine is deterministic, all replicas will end up in the same state.
Each replica is a state machine which transitions its state to the next
deterministically as a result of executing the input command. The state on each replica
is maintained as a set of (key, value) pairs. The output of commands is transitioned from
the current state to the next. Determinism is important because this ensures that each
command execution produces the same output. Each replica starts in the same initial
state. Total order broadcast delivers the same command to each replica in the global
order, which results in each replica executing the same sequence of commands and
transitioning to the same state. This achieves the same state at each replica.
This principle is also used in blockchains where a total order is achieved on the
sequence of transactions and blocks via some consensus mechanism, and each node
executes and stores those transactions in the same sequence as other replicas and as
proposed by proof of work winner, leader. We will explore this in detail in Chapter 5.
Traditional protocols such as Practical Byzantine Fault Tolerance (PBFT) and RAFT are
state machine replication protocols.
SMR is usually used to achieve increased system performance and fault tolerance.
System performance can increase because multiple replicas host copies of data, and
more resources are available due to multiple replicas. Fault tolerance increases due to
the simple fact that as data is replicated on each replica, even if some replicas are not
available, the system will continue to operate and respond to client queries and updates.
Now let’s look at SMR properties formally.
142
Chapter 3 Distributed Consensus
Deterministic Operations
All correct replicas deterministically produce the same output and state for the same
input and state.
Coordination
All correct replicas process the same commands in the same order.
The coordination property requires the use of agreement protocols such as total
order broadcast or some consensus algorithms.
There are also two safety and liveness properties which we describe as follows.
Safety
All correct replicas execute the same commands. This is the agreement property. There
are two general approaches to achieve an agreement. We can use either a total order
broadcast or a consensus protocol. A total order broadcast protocol needs to run only
once per state machine replication, whereas a consensus mechanism is instantiated for
each period of the sequence of commands.
Liveness
All correct commands are eventually executed by correct replicas. This is also called the
completion property.
Safety ensures consistency, whereas liveness ensures availability and progress.
Figure 3-13 demonstrates how SMR generally works.
143
Chapter 3 Distributed Consensus
The replicated log on the replicas ensures that commands are executed by the state
machine in the same order on each replica. The consensus mechanism (in the top-left
corner of the image) ensures that an agreement is achieved on the order of commands
and as a result written into the log as such. This involves reaching an agreement on the
sequence of commands with other replicas. This replicated system will make progress if
majority of the replicas are up.
Consensus and state machine replication are related in the sense that distributed
consensus establishes the global common order of state machine commands, whereas
the state machine executes these commands according to the global order determined
by the consensus (agreement) algorithm, and thus each node (state machine) reaches
the same state.
144
Chapter 3 Distributed Consensus
A crash fault–tolerant SMR requires at least 2f+1 replicas, whereas a BFT SMR
requires 3f+1 replicas, where f is the number of failed replicas.
State machine replication achieves consistency among replicas. There are various
replica consistency models. We’ll briefly explore them here.
Linearizability
Another stronger property that a state machine replication protocol may implement is
called linearizability. Linearizability is also called atomic consistency, and it means that
command execution appears as if executed on a single copy of the state machine, even
if there are multiple replicas. The critical requirement of linearizability is that the state
read is always up to date, and no stale data is ever read.
Consistency models allow developers to understand the behavior of the replicated
storage system. When interacting with a replicated system, the application developers
experience the same behavior as interacting with a single system. Such transparency
allows developers to use the same single server convention of writing application logic. If
a replicated system possesses such transparency, then it is said to be linearizable.
In literature, linearizability is also called strong consistency, atomic consistency, or
immediate consistency.
Sequential Consistency
In this type of consistency, all nodes see the same order of commands as other nodes.
Linearizability and sequential consistency are two classes of strong consistency.
Eventual Consistency
Under the eventual consistency model, there is an eventual guarantee that each replica
will be in the same state if there are no more updates. However, this implies that there
is no timing guarantee because updates may never stop. Therefore, this is not quite a
reliable model. Another stronger scheme is called strong eventual consistency which
has two properties. Firstly, updates applied to one honest replica are eventually applied
to every nonfaulty replica. Secondly, regardless of the order in which the updates have
been processed, if two replicas have processed the same set of updates, they end up
in the same state. The first property is called eventual delivery, whereas the second
property is named convergence.
145
Chapter 3 Distributed Consensus
There are several advantages of this approach. It allows replicas to progress without
network connectivity until the connectivity is restored, and eventually replicas converge
to the same state. Eventual consistency can work with weaker models of broadcast,
instead of total order broadcast.
Eventual consistency has several categories such as last write wins. The technique
here is to apply updates with the most recent timestamp and discard any other updates
writing to the same key (updating the same data) with lower timestamps. This means
that we accept some data loss in favor of eventually converging state at all replicas.
This table has been adapted from wonderful lectures on distributed computing by Dr. Martin
1
Kleppmann.
146
Chapter 3 Distributed Consensus
Fundamental Results
In distributed computing, there are many fundamental results that have been reported
by researchers. These fundamental results provide the foundation on which the
distributed computing paradigm stands. Most interesting of these are impossibility
results.
Impossibility Results
Impossibility results provide us an understanding of whether a problem is solvable
or not and the minimum resources required to do so. If a problem is unsolvable, then
these results provide a clear understanding why a specific problem is unsolvable. If
an impossibility result is proven, then no further research is necessary on that, and
researchers can focus their attention on other problems or to circumvent these results
somehow. These results show us that certain problems are unsolvable unless sufficient
resources are provided. In other words, they show that certain problems cannot be
computed if resources are insufficient. There are problems that are outright unsolvable,
147
Chapter 3 Distributed Consensus
and some are solvable only if given enough resources. The requirement of minimum
available resources to solve a problem is known as lower bound results.
In order to prove that some problems cannot be solved, it is essential to define a
system model and the class of allowable algorithms. Some problems are solvable under
one model but not in others. For example, consensus is unsolvable under asynchronous
network assumptions but is solvable under synchronous and in partially synchronous
networks.
One of the most fundamental results in distributed computing is how many nodes/
processes are required to tolerate crash only and Byzantine faults.
Crash Failure
To achieve crash fault tolerance, the tight lower bound is N => 2F + 1, where F is the
number of failed nodes. This means that a minimum of three processes are required, if
one crash-fails to achieve crash fault tolerance. Consensus is impossible to solve if n <=
2f in crash fault–tolerant settings.
Byzantine Failure
To achieve Byzantine fault tolerance, the tight lower bound is N >= 3F + 1, where F is the
number of failed nodes. This means that a minimum of four nodes are required, if one
fails arbitrarily to achieve Byzantine fault tolerance.
No algorithm can solve a consensus problem if n<=3f, where n are nodes and f are
Byzantine nodes. There is a proven tight lower bound of 3F+1 on the number of faulty
processors.
148
Chapter 3 Distributed Consensus
Minimum Connectivity
The minimum network connectivity to tolerate failures is at least 2f+1.
Minimum Rounds
The minimum number of rounds required is f+1, where f can fail. This is because one
round more than the number of failures ought to have one round failure-free, thus
allowing consensus.
FLP Impossibility
The FLP impossibility result states that it is impossible to solve consensus
deterministically in a message-passing asynchronous system in which at most one
process may fail by crashing. In other words, in a system comprising n nodes with
unbounded delays there is no algorithm that can solve consensus. Either there will be
executions in which no agreement is achieved or there will be an execution which does
not terminate (infinite execution).
The key issue on which the FLP impossibility result is based is that in an
asynchronous system, it is impossible to differentiate between a crashed process and a
process that is simply slow or has sent a message on a slow link, and it’s just taking time
to reach the recipient.
FLP is one of the most fundamental unsolvability results in distributed computing.
FLP is named after the authors MICHAEL J. FISCHER, NANCY A. LYNCH AND
MICHAEL S. PATERSON who reported this result in 1982 in their paper “Impossibility of
Distributed Consensus with One Faulty Process.”
A configuration of global state C is univalent if all executions starting from C output
the same value, that is, there is only one possible output. The configuration is 0-valent
if it results in deciding 0 and 1-valent if it results in deciding 1. A configuration of global
state C is bivalent if there are two executions starting from C that output different values.
We can visualize this in Figure 3-14.
149
Chapter 3 Distributed Consensus
The key idea behind FLP is that a bivalent configuration can always transition to
some bivalent configuration. As there is an initial bivalent configuration, it follows that
there is a nonterminating execution, leading to only bivalent configurations.
We can understand this through a scenario. Suppose you have two different sets of
nodes, say set A and set B, each with five nodes. In a five-node network with no failures,
the majority (i.e., three out of five) in each set will lead to the consensus. Suppose that
in set A the five nodes have a proposed value of 1 {11111}, then we know that in an
honest environment the decision will be value 1 by all nodes. Similarly in set B, if the
initial value is 0 at all nodes {00000}, then all nodes will agree to the value 0 in a fault-free
environment. We can say the configuration, that is, global state, is 1-valent and 0-valent
in set A and set B, respectively. However, now imagine a situation where not all nodes
are 0 or 1, but some are 0 and some have a value of 1. Imagine in set A, three nodes
are 1, and two have a value of 0, that is, {11100}. Similarly in set B, two nodes are 1 and
three nodes are holding value 0, that is, {11000}. Note that these sets now only have one
difference of a single node with value 1 in set A and value 0 in set B, that is, the middle
node (third element) in the set. Consensus of 1 is reached in set A due to three out of five
majority, whereas consensus 0 is reached in set B due to three out of five majority. Let’s
call these two sets, configurations or global states. So now we have two configurations,
one reaching consensus of 1 and the other 0. So far so good, but imagine now that one
node fails, and it is that middle node which is the only difference between these two sets.
If the middle node is failed from both sets A and B, they become {1100} each, which
means that both sets are now indistinguishable from each other implying that they both
can reach the same consensus of either 0 or 1 depending on the availability of the third
element (middle node). This also means that one of these sets can reach both consensus
decisions, 0 or 1, depending on the availability of node 3.
150
Chapter 3 Distributed Consensus
Now imagine that the default value of all nodes is 0, and now with a failed (removed)
node, set A {11100} will end up reaching consensus of 0 if the middle node is failed,
and it will have consensus of 1 if no node fails. This is an ambiguous situation, called a
bivalent configuration where consensus 0 is reached if the middle node holding value 1
is unavailable but will reach consensus of 1 if no node fails. The situation is now that sets
(nodes) can reach consensus of either 0 or 1 and the outcome is unpredictable.
It is proven that this ambiguous situation of bivalent initial configuration can
always exist in case of even a single failure, and secondly it can always lead to another
ambiguous situation. In other words, an initial bivalent configuration can always
transition to another bivalent configuration, hence the impossibility of consensus as no
convergence on a univalent (either 0-valent or 1-valent) is possible.
There are two observations which lead to FLP impossibility results. First, there
always exists at least one bivalent initial configuration in any consensus algorithm
working in the presence of faults. Second, a bivalent configuration can always transition
to another bivalent configuration.
The FLP result concludes that in asynchronous systems, first there is a global state
(configuration) where the algorithm cannot decide, and there always will be a scenario
where the system is inconclusive. In other words, there is always an admissible run
which always remains in an indecisive state under asynchrony.
State machine replication under asynchrony is also prone to FLP impossibility
limitation. Blockchain networks are also subject to FLP impossibility results. Bitcoin,
Ethereum, and other blockchain networks would have not been possible to build if FLP
impossibility wasn’t circumvented by introducing some level of synchrony.
Many efforts have been proposed to somehow circumvent the FLP impossibility.
This circumvention revolves around the use of oracles. The idea is to make an oracle
available to the distributed system to help solve a problem. An oracle can be defined
as a service or a black box that processes (nodes) can query to get some information
to help them decide a course of action. In the following, we introduce some common
oracles that provide enough information to distributed algorithms to solve a problem,
which might be unsolvable otherwise. We can use oracles to facilitate solving consensus
problems in distributed systems.
The key ideas behind the circumvention of FLP impossibility are based around
sacrificing asynchrony and determinism. Of course, as we have learned, deterministic
consensus is not possible under asynchrony even if one process crash-fails; therefore,
the trick is to slightly sacrifice either asynchrony or determinism, just enough to get to a
point to reach a decision and terminate.
151
Chapter 3 Distributed Consensus
• Random oracles
• Failure detectors
• Synchrony assumptions
• Hybrid models
Synchrony Assumptions
Under the synchrony assumption, assumptions about timing are introduced in the model.
Remember, we discussed partial synchrony earlier in this chapter and the first chapter.
Partial synchrony is a technique that allows solving consensus by circumventing FLP
impossibility. Under the partial synchrony model, asynchrony is somewhat forfeited to
introduce some timing assumptions that allow for solving the consensus problem. Similarly,
under the eventual synchrony model, assumptions are made that the system is eventually
synchronous after an unknown time called global stabilization time (GST). Another timing
assumption is the weak synchrony which assumes that the delays remain under a certain
threshold and do not grow forever. Such timing assumptions allow a consensus algorithm to
decide and terminate by assuming some notion of time (synchrony).
Random Oracles
Random oracles allow for the development of randomized algorithms. This is where
determinism is somewhat sacrificed in favor of reaching an agreement probabilistically.
The advantage of this approach is that there are no assumptions made about timing,
but the downside is that randomized algorithms are not very efficient. In randomized
consensus algorithms, one of the safety or liveness properties is changed to a
nondeterministic probabilistic version. For example, the liveness property becomes
This addresses FLP impossibility in the sense that FLP impossibility means in
practice that there are executions in the consensus that do not terminate. If the
termination is made probabilistic, it can “circumvent” the impossibility of consensus:
152
Chapter 3 Distributed Consensus
Hybrid Models
In a hybrid model approach to circumvent FLP impossibility, a combination of
randomization and failure detectors is used.
Wormholes are extensions in a system model with stronger properties as compared
to other parts of the system. Usually, it is a secure, tamper-proof, and fail-silent trusted
hardware which provides a way for processes to correctly execute some crucial steps
of the protocol. Various wormholes have been introduced in the literature such as
attested append-only memory, which forces replicas to commit to a verifiable sequence
of operations. The trusted timely computing base (TTCB) was the first wormhole
introduced for consensus supported by wormholes.
153
Chapter 3 Distributed Consensus
Failure Detectors
This intuition behind failure detectors is that if somehow we can get a hint about the
failure of a process, then we can circumvent FLP impossibility. Remember the FLP
impossibility result suggests that it is impossible to distinguish between a crashed
process and simply a very slow one. There is no way to find out, so if somehow we can
get an indication that some process has failed, then it would be easier to handle the
situation. In this setting, asynchrony is somewhat sacrificed because failure detectors
work based on heartbeats and timeout assumptions. Failure detectors are added as an
extension to the asynchronous systems.
A failure detector can be defined as a distributed oracle at each process that gives
hints about (suspects) whether a process is alive or has crashed. In a way, failure
detectors encapsulate timeout and partial synchrony assumptions as a separate module.
There are two categories of properties that define failure detectors:
Based on the preceding two properties, eight classes of failure detectors have been
proposed by Chandra and Toueg in their seminal paper “Unreliable Failure Detectors
for Reliable Distributed Systems.” It is also possible to solve consensus by introducing a
weak unreliable failure detector. This work was also proposed by Chandra and Toueg.
The ability of a failure detector to accurately suspect failure or liveness depends
on the system model. A failure detector is usually implemented using a heartbeat
mechanism where heartbeat messages are exchanged between processes, and if
these messages are not received by some processes for some time, then failure can be
suspected. Another method is to implement a timeout mechanism which is based on
worst-case message round-trip time. If a message is not received by a process in the
expected timeframe, then timeout occurs, and the process is suspected failed. After this,
if a message is received from the suspected process, then the timeout value is increased,
and the process is no longer suspected failed. A failure detector using a heartbeat
mechanism is shown in Figure 3-15.
154
Chapter 3 Distributed Consensus
155
Chapter 3 Distributed Consensus
process p” becomes (wait for message m from process p) or (suspect p of failure). Now
you can see the blocking program becomes nonblocking, and there is now no infinite
waiting; if the p is suspected, then it’s added to the suspected list, and the program
continues its operation, whatever that might be.
Now let’s look at the properties of strong and weak completeness and accuracy.
Strong Completeness
This property requires that eventually every crashed process is permanently suspected
by every correct process.
Weak Completeness
The property requires that eventually each crashed process is permanently suspected by
some correct process.
Strong Accuracy
This property denotes that a process is never suspected until it crashes (before it crashes)
by any correct process.
Weak Accuracy
This property describes that some correct process is never suspected by any correct
process.
This property suggests that after some time, correct processes do not suspect any correct
processes any longer.
This property implies that after some time, some correct process is not suspected
anymore by any correct process.
156
Chapter 3 Distributed Consensus
We can visualize strong and weak completeness in the diagram shown in Figure 3-16.
Now we discuss eight classes of failure detectors. There are four classes of failure
detectors which provide strong completeness. The first two failure detectors work under
synchronous systems, namely, perfect detector P and strong detector S. The other two work
under partially synchronous models, namely, eventually perfect detector (diamond P) and
eventually strong detector (diamond S).
We describe these classes now, first with strong completeness.
This type of failure detector satisfies strong completeness and strong accuracy
properties. P cannot be implemented in asynchronous systems. This is because strong
completeness and accuracy cannot be achieved for P in asynchronous systems.
This class of FDs satisfies strong completeness and eventual strong accuracy.
157
Chapter 3 Distributed Consensus
This class of FDs satisfies strong completeness and eventual weak accuracy.
There are also four classes of failure detectors which provide weak completeness.
Detector Q and weak detector W work under synchronous models. Two other detectors,
eventually detector Q (diamond Q) and eventually weak detector (diamond W), work
under partial synchrony assumptions.
We describe these as follows.
This type satisfies weak completeness and eventual weak accuracy properties.
Detector Q or V
158
Chapter 3 Distributed Consensus
The properties of failure detectors fundamentally revolve around the idea of how
fast and correctly a failure detector detects faults while avoiding false positives. A perfect
failure detector will always correctly detect failed processes, whereas a weak failure
detector may only be able to detect very few or almost no faults accurately.
Quorums
A quorum can be defined as any set of majority of processes. The concept is related
to voting among a set of objects. Quorum systems are important to ensuring the
consistency, availability, efficiency, and fault tolerance in replicated systems.
A quorum can also be thought of as a set of minimum number of processes (votes)
required to decide about an operation in a distributed system. A quorum-based
methodology ensures consistency in a distributed system. We just learned under
the “Replication” section that replication allows to build a fault-tolerant consistent
distributed system. Here, the question arises about how many replicas are required to
decide to finally commit an update or abort.
Mathematically, a quorum is defined as follows:
160
Chapter 3 Distributed Consensus
This means that any two quorums must intersect at one or more processes. This
is also known as the pigeonhole principle. Moreover, this is called the consistency
property.
There must always be at least one quorum available that is not failed. This is the
quorum availability property.
Quorum systems are usually used in scenarios where a process, after broadcasting
its request, awaits until it has received a response from all processes that belong to a
quorum. This way, we can address the consistency requirements of a problem. Quorums
are usually used to achieve crash and Byzantine fault tolerance. In consensus algorithms,
for example, a certain size of a quorum is needed to guarantee safety and liveness. In
other words, algorithms based on quorums satisfy safety and liveness only if a quorum of
correct processes can be established.
161
Chapter 3 Distributed Consensus
Byzantine Quorums
Byzantine failures are difficult to handle. Imagine if there are N nodes, out of which
f number of nodes turn Byzantine. Now these f nodes can behave arbitrarily, and there
can be a case where they can vote in favor of a value and against it. They can make
different statements to different nodes on purpose. Such a situation can cause even
correct nodes to have divergent states and can also lead to dead locks.
A Byzantine quorum that can tolerate f faults has more than (n + f)/2 processes.
There is always an intersection of at least one correct process between two Byzantine
fault–tolerating quorums. The progress is guaranteed in Byzantine settings if N > 3f. In
other words, Byzantine fault tolerance requires that f < n/3.
For example:
n 7=
= , f 1
(n + f ) / 2
( 7 + 1) / 2 = 4
ceiling ( 7 + 1 + 1 / 2 ) = 4
Each Byzantine quorum contains more than n − f/2 honest processes. 7 − 1/2 = > 3,
so there is at least one correct process in the intersection of two Byzantine quorums.
Classical Consensus
Classical consensus or traditional distributed consensus has been a topic of research
for around 40 years now. Starting with the SIFT project and Lamport’s and many other
researchers’ contributions, we now have a large body of work that deals with the classical
distributed consensus. Protocols such as Paxos, PBFT, and RAFT are now a norm for
implementation in various practical systems.
Summary
In this chapter, we covered the main concepts of agreement, broadcast, replication, and
consensus:
163
Chapter 3 Distributed Consensus
• The last half century of research has produced two main classes of
consensus, that is, classical permissioned consensus and Nakamoto
nonpermissioned consensus.
In the next chapter, we will cover blockchain and describe what it is and how we can
see it in the light of what we have learned so far in this book.
Bibliography
1. Chandra, T.D. and Toueg, S., 1996. Unreliable failure detectors for
reliable distributed systems. Journal of the ACM (JACM), 43(2),
pp. 225–267.
164
Chapter 3 Distributed Consensus
3. Pease, M., Shostak, R., and Lamport, L., 1980. Reaching agreement
in the presence of faults. Journal of the ACM (JACM), 27(2),
pp. 228–234.
165
CHAPTER 4
Blockchain
In this chapter, we’ll learn what a blockchain is and its various elements and see the
blockchain through the lens of distributed computing. Also, we will present formal
definitions and properties of the blockchain. In addition, we will also introduce Bitcoin
and Ethereum. Finally, I will introduce some blockchain use cases.
Blockchains are fascinating because they touch many disciplines, including
distributed computing, networking, cryptography, economics, game theory,
programming languages, and computer science.
Blockchains are appealing to people from so many different areas, including but not
limited to the subjects mentioned earlier. With use cases in almost every walk of life,
blockchains have captured the public’s imagination and, indeed, many academics and
industry professionals.
The blockchain emerged in 2008 with Bitcoin, a peer-to-peer, decentralized,
electronic cash scheme that does not need any trusted third party to provide trust
guarantees associated with money.
What Is Blockchain
There are many definitions of a blockchain on the Internet and many different books.
While all those definitions are correct, and some are excellent, I will try to define the
blockchain in my own words.
First, we’ll define it from a layman’s perspective and then from a purely technical
standpoint.
Layman’s Definition
A blockchain is a shared record-keeping system where each participant keeps a copy
of the chronologically ordered records. Participants can add new records only if they
collectively agree to do so.
167
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_4
Chapter 4 Blockchain
Technical Definition
A blockchain is a peer-to-peer, cryptographically secure, append-only, immutable, and
tamper-resistant shared distributed ledger composed of temporally ordered and publicly
verifiable transactions. Users can only add new records (transactions and blocks) in a
blockchain through consensus among peers on the network.
Background
The origins of the blockchain can be found in early systems developed for the digital
timestamping of documents. Also, the long-standing problem of creating secure
electronic cash with desirable features such as anonymity and accountability has
inspired blockchain development.
Some of the key ideas that contributed to the development of the blockchain are
discussed as follows.
Two fundamental issues need to be dealt with to create practical digital cash:
The question is how to resolve accountability and double-spend issues. The schemes
described in the following tried to address these issues and managed to achieve these
properties; however, the usability was difficult, and they relied on trusted third parties.
168
Chapter 4 Blockchain
rewards is very close to the concept that we know as proof of stake today. Similarly, the
idea of solving a previously unsolved computational problem is what we know today as
proof of work.
Another electronic cash proposal is BitGold, introduced by Nick Szabo. Bitgold
can be seen as a direct precursor of Bitcoin. The Bitgold proposal emphasized no
dependence on trusted third parties and proof of work by solving a “challenge string.”
On the other hand, progress and development in cryptography and computer
technology generally resulted in several advances and innovative applications. Some of
these advances related to the blockchain are digital timestamping of documents, email
spam protection, and reusable proof of work.
The work on timestamping of digital documents to create an ordered chain of
documents (hashes) by using a timestamping service was first proposed by Haber and
Stornetta. This idea is closely related to the chain of blocks in a blockchain. However, the
timestamping service is centralized and needs to be trusted.
The origins of proof of work based on hash functions used in Bitcoin can be found
in previous work by Dwork and Naor to use proof of work to thwart email spam. Adam
Back invented the Hashcash proof of work scheme for email spam control. Moreover,
Hal Finney introduced reusable proof of work for token money, which used Hashcash to
mint a new PoW token.
Another technology that contributed to the development of Bitcoin is cryptography.
Cryptographic primitives and tools like hash functions, Merkle trees, and public
key cryptography all played a vital role in the development of Bitcoin. We covered
cryptography in Chapter 2 in detail.
Figure 4-1 illustrates this fusion of different techniques.
169
Chapter 4 Blockchain
Benefits of Blockchain
Multiple benefits of blockchain technology are envisaged, and a lot has been
accomplished since the invention of Bitcoin. Especially with the advent of Ethereum,
a programmable platform is available where smart contracts can implement any logic,
which resulted in increased utility and paved the path for further adoption. Today, one
of the most talked-about applications of the blockchain, decentralized finance, or DeFi
for short, is seen as a significant disruptor of the current financial system. Non-fungible
tokens (NFTs) are another application that has gained explosive popularity. NFTs on the
blockchain enable tokenization of assets. Currently, there is almost 60 billion USD worth
of value locked in the DeFi ecosystem. This huge investment is a testament that the
blockchain has now become part of our economy. You can track this metric at https://
defipulse.com/.
Now I list some of the most prominent benefits of the blockchain:
• Cost saving
• Due to streamlining of processes, transparency, and single data
sharing platform which comes with security guarantees, the
blockchain can result in cost saving. Also, there is no need to
create separate secure infrastructure; users can use an already
existing secure blockchain network with an entry-level computer
running the blockchain software client.
• Transparency
170
Chapter 4 Blockchain
• Auditability
• Security
• Supply chain
• Government
• Medicine/health
• Finance
• Trading
• Identity
• Insurance
171
Chapter 4 Blockchain
Types of Blockchain
There are several types of blockchains. The original blockchain introduced with Bitcoin
is a public blockchain:
• Permissioned blockchain
• Private blockchain
• Consortium or enterprise blockchain
• Application-specific blockchain
• Heterogeneous multichain
172
Chapter 4 Blockchain
Blockchains are also shared data platforms where multiple organizations can share
data in a tamper-resistant manner, ensuring data integrity. However, this sharing can
be achieved if there is only one standard blockchain, but many different blockchains
have emerged since the advent of Ethereum. This variety has resulted in a problem
where one blockchain runs a different protocol and cannot share its data with another
blockchain. This disconnect creates a situation where each blockchain is a silo. To
solve this issue, the organization that wants to join a consortium network must either
use the software client specific to that blockchain or somehow devise a complex
interoperability mechanism. This problem is well understood, and a lot of work is
underway regarding novel interoperability protocols. Also, new types of blockchains,
such as heterogeneous multichains and sharding-based approaches, are emerging.
One prime example is Polkadot, a replicated sharded state machine where heterogenous
chains can talk to each other through a so-called relay chain. Another effort is Ethereum
2.0, where sharded chains serve as a mechanism to provide scalability and cross-shard
interoperability. Cardano is another blockchain that is on course to provide scalability
and interoperability between chains. With all these platforms and the pace of the work
in progress to realize these ideas, we can envisage that in the next eight to ten years,
these blockchains and others will be running just like we have the Internet today with
seamless data sharing between different chains. The chains that facilitate such a level of
natural interoperability, giving rise to an ecosystem of multiple interoperating, general-
purpose enterprise chains and ASBCs, are called heterogeneous multichains.
Now let’s clarify an ambiguity. You might have heard the term “distributed ledger,”
which is sometimes used to represent a blockchain. While both terms, blockchain and
distributed ledger, are used interchangeably, there’s a difference. The distributed ledger
is an overarching term describing a ledger with distributed properties. Blockchains fall
under this umbrella. A blockchain is a distributed ledger, but not all distributed ledgers
are a blockchain. For example, some distributed ledgers do not use blocks composed
of transactions in their blockchain construction. Instead, they treat transaction records
individually and store them as such. Usually, however, in most distributed ledgers,
blocks are used as containers for a batch of transactions and several other elements such
as block header, which also contains several components. Such use of blocks to bundle
transactions makes them a blockchain.
173
Chapter 4 Blockchain
174
Chapter 4 Blockchain
175
Chapter 4 Blockchain
Properties
There are several properties associated with a blockchain, which are described as
follows.
Consistency
• All replicas hold the same up-to-date copy of the data. In the case of
public blockchains, it is usually eventual consistency, and in the case
of permissioned blockchains, it is strong consistency.
Fault Tolerant
• Blockchains are fault-tolerant distributed systems. A blockchain
network can withstand Byzantine or crash faults up to a threshold.
Finality
• Finality occurs when a transaction is considered irrevocable
and permanent. This event can be a certain number of blocks, a
time interval, or a step (phase) in the execution of a consensus
algorithm. For example, in Bitcoin it is usually six blocks after
which a transaction is considered irrevocable, and in permissioned
blockchains using BFT protocols, the moment the transaction is
committed, it is considered irrevocably final.
176
Chapter 4 Blockchain
Immutability
• Blockchains are immutable, which means that once a record has
made it to the ledger, it can never be removed.
Append Only
• New records can only be appended to the blockchain. New records
cannot be inserted in between previously existing records. For
example, a new block can only be added after the last final block, not
in between other blocks.
Tamper Resistant/Proof
• It is practically impossible to remove or rearrange finalized blocks in
a blockchain.
Validity
• Only valid transactions and blocks are appended to the blockchain
Order
• If a block x happens before block y and block y happens before block z,
then block x happens before block z and forms a transitive relationship.
• It is an ordered ledger.
177
Chapter 4 Blockchain
Verifiable
• All transactions and blocks in a blockchain are verifiable and adhere
to a validity predicate specific to the blockchain. Anyone can verify
the validity of a transaction.
178
Chapter 4 Blockchain
Anatomy of a Blockchain
A blockchain is composed of blocks, where each block is linked to its previous block
except the first genesis block. The blockchain term was used by Satoshi Nakamoto in
his Bitcoin code for the first time. Even though now it is used as one word, in his original
Bitcoin code it was written as two separate words, “block chain.” It can be visualized as a
chain of blocks, as shown in Figure 4-4.
179
Chapter 4 Blockchain
Other structures such as DAGs, hash graphs, and Merkle trees are now used in some
distributed ledgers instead of the usual block-based model in modern blockchains. For
example, Avalanche uses DAGs for storage instead of a linear block-based structure.
We will cover these in detail when we discuss consensus protocols specific to these
blockchains (distributed ledgers) in Chapter 8.
Block
A block consists of a block header and transactions. A block header is composed of
several fields. A generic depiction is shown in Figure 4-5.
Platforms
In this section, we will describe two major blockchain platforms, Bitcoin and Ethereum.
Bitcoin
Bitcoin was invented in 2008 as the first blockchain by Satoshi Nakamoto. However, it is
believed to be a pseudonym as the identity of Satoshi Nakamoto is shrouded in mystery.
After the introduction of Bitcoin, Satoshi remained active for some time but left the
community abruptly. Since then, no contact has been made by him.
We discussed the prehistory and attempts to create digital cash and document
timestamping system before in this chapter. In this section, I will jump straight into
technical details.
Bitcoin is a peer-to-peer electronic cash system that solved the double-spending
problem without requiring a trusted third party. Furthermore, Bitcoin has this fantastic
property called “inclusive accountability,” which means that anyone on the Bitcoin
network can verify claims of possession of electronic cash, that is, the Bitcoin. This
property makes Bitcoin a transparent and verifiable electronic cash system.
The Bitcoin network is composed of nodes. There are three types of nodes in a
Bitcoin network: miner nodes, full nodes, and light nodes. Miner nodes perform mining
and keep a full copy of the chain. Bitcoin is a loosely coupled network composed of
nodes. All nodes communicate with each other using a peer-to-peer gossip protocol.
181
Chapter 4 Blockchain
There are primarily three different types of nodes in the Bitcoin network. Full nodes
keep the entire history of the blockchain. Miner nodes keep the whole history and
participate in mining to add new blocks to the blockchain. Finally, light nodes do not
keep a copy of the entire blockchain. Instead, they only download the block headers
and use a method called simple payment verification to validate the authenticity of the
transactions. The Bitcoin node architecture is shown in Figure 4-6.
When a node starts up, it discovers other nodes using a process called node
discovery. In this process, the node first connects to the seed nodes, which are trusted
bootstrap nodes maintained by the core developers. After this initial connection,
further connections are made. At one point, there are x connections alive with other
peers. There is also spam protection built in the Bitcoin protocol, where a points-based
reputation system scores the nodes based on the connection attempts it is trying to
make. If a node sends excessive messages to another node, its reputation score goes
above a threshold of 100 points, and it gets blocked for 24 hours. The node discovery
and handshake between nodes rely on several protocol messages. A list is shown in the
following with their explanations. In Figure 4-7, you can visualize how node handshake
and message exchange occurs.
182
Chapter 4 Blockchain
Some of the most used protocol messages and an explanation of them are listed as
follows:
• Version: This is the first message that a node sends out to the network,
advertising its version and block count. The remote node then replies
with the same information, and the connection is then established.
• Getblocks: This returns an inv packet containing the list of all blocks
starting after the last known hash or 500 blocks.
Figure 4-7. Node discovery and handshake diagram + header and block
synchronization
Cryptography in Bitcoin
Cryptography plays a vital role in the Bitcoin blockchain. The entire security of
the Bitcoin blockchain is indeed based on cryptography. Although we discussed
cryptography in Chapter 2, I will now describe which cryptographic protocols are used
in Bitcoin and how.
184
Chapter 4 Blockchain
Wallets in Bitcoin are used to store cryptographic keys. Wallets sign transactions
using private keys. Private keys are generated by randomly choosing a 256-bit
number provided by the wallet. A Bitcoin client includes a standard wallet called
nondeterministic wallet.
Addresses and Accounts
Users are represented by accounts in Bitcoin. The Bitcoin address generation process is
shown in Figure 4-8.
185
Chapter 4 Blockchain
8. The first 4 bytes of the result produced from step 7 is the address
checksum.
186
Chapter 4 Blockchain
• All nodes verify the transaction and place it in their transaction pools.
• Mining starts and one of the miners that solves the proof of work
problem wins the right to announce its block and earn bitcoins as
a reward.
A transaction is made up of several fields. Table 4-1 shows all fields and their
description.
Transactions are of two types. On-chain transactions are native to the Bitcoin
network, and off-chain transactions are performed outside the blockchain network.
On-chain transactions occur on the blockchain network and are validated on-chain
by network participants, whereas off-chain transactions use payment channels or
187
Chapter 4 Blockchain
• OP_DUP: Takes the top item on the stack and duplicates it.
• OP_EQUAL: Checks the equality of the top two items on the stack.
Outputs TRUE if equal, otherwise FALSE, on the stack.
• OP_VERIFY: Checks if the top item on the stack is false; if it is, the
script terminates and outputs failure.
188
Chapter 4 Blockchain
There are several types of scripts in Bitcoin. The most common is Pay-to-Public-Key-
Hash (P2PKH), which is used to send a transaction to a bitcoin address. The format of
this script is shown as follows:
While the bitcoin script is the original method of transferring payments, and it works
well, it is not much flexible. There is a language developed for bitcoin which supports
development of smart contracts. The language is called Ivy. A solution to make writing
scripts easier and in a more structured way is the bitcoin miniscript.
189
Chapter 4 Blockchain
Blocks and Blockchain
A blockchain is composed of blocks. Blocks are composed of a block header and
transactions. A block header consists of several fields. The first block in the Bitcoin
blockchain is called the genesis block, which doesn’t link back to any block, being the
first block. It is usually hardcoded in the software clients.
We can see a complete visualization of blocks, block headers, transactions, and
scripts in Figure 4-12.
1
After the authors’ names: Fischer, Lynch, and Merritt – https://groups.csail.mit.edu/tds/
papers/Lynch/FischerLynchMerritt-dc.pdf
2
Public Key Infrastructure.
190
Chapter 4 Blockchain
Mining
Mining is the process by which new coins are added to the Bitcoin blockchain. This
process secures the network and incentivizes the users who spend resources to protect
the network. More details on the specifics are in Chapter 5; however, now I will touch
upon the mining hardware. When Bitcoin was introduced, it was easy to mine with
CPUs, which quickly increased the difficulty, leading to miners using GPUs. Shortly
after the successful adoption of GPUs, FPGAs emerged as a mechanism to further speed
up SHA-256 hashing. Soon, these were outperformed by ASICs, and now ASICs are a
prevalent mechanism to mine Bitcoin. However, solo mining where individual users use
mining hardware to mine is also not much profitable due to exorbitant mining difficulty.
Instead, mining farms comprising thousands of ASICs are commonly used now. Also,
mining pools are common where multiple users collectively solve the hash puzzle to
earn rewards proportionate to their contribution.
Bitcoin As a Platform
Other than electronic cash, Bitcoin as a platform can be used for several use cases. For
example, it can be used as a timestamping service or a general ledger to store some
information permanently. In addition, we can use the OP_RETURN instruction to store
data, which can store up to 80 bytes of arbitrary data. Other use cases such as smart
property, smart assets, and blocks as a source of randomness also transpired.
The desire to use Bitcoin for different purposes also resulted in techniques to
enhance Bitcoin, resulting in colored coins, rootstock, Omni layer, and counterparty
projects. While Bitcoin did what it intended to do and a lot more in the form of
innovations mentioned earlier, the fundamental limitation in Bitcoin protocols meant
that all flexible new protocols would have to be built on top of Bitcoin. There is no
inherent flexibility in Bitcoin to perform all these different tasks. Therefore, there was a
need felt to do more than just cryptocurrency on Blockchain. This ambition motivated
the invention of Ethereum, the first general-purpose blockchain platform that supported
smart contracts.
191
Chapter 4 Blockchain
Ethereum
Ethereum was introduced in 2014 in a whitepaper by Vitalik Buterin. Ethereum
introduced a platform on which users can run arbitrary code in the form of smart
contracts. To thwart the denial-of-service attacks caused by infinite loops in code, the
concept of metered execution was also introduced. Metered executions require that
for every operation performed on the blockchain, a fee is charged, which is paid in
Ether, the native currency of the Ethereum blockchain. With smart contracts, Ethereum
opened a whole new world of generic platforms where the operations are no longer
limited to only bitcoin-style value transfer transactions, but users can execute any type of
diverse business logic on-chain due to Ethereum’s Turing complete design. Ethereum is
currently the most used blockchain platform for smart contracts.
Today’s Internet is centralized, which is dominated by large companies. The Internet
that we use today is called Web 2. Ethereum is developed with the vision of Web3, where
anyone can participate in the network without any reliance on a third party. In the Web
2 model, big service providers currently provide services in return for personal data;
however, in Web3, anyone can participate without giving up their personal information
in exchange for services. However, with decentralized applications (DApps), anyone can
provide any service which any user on the network can use, and no one can block your
access to the service.
Ethereum Network
An Ethereum network is composed of loosely coupled nodes which exchange messages
via a gossip protocol.
A high-level visualization of the Ethereum network is shown in Figure 4-13.
192
Chapter 4 Blockchain
193
Chapter 4 Blockchain
• Full node
• Light node
• Archive node
Full nodes store the entire chain data and validate blocks, transactions, and states.
Light nodes only store the block headers and verify the data against state roots present
in the block headers. Light nodes are suitable for resource-constrained devices, such as
mobile devices. Archive nodes include everything that is in a full node but also builds
an archive of historical states. Miner nodes are a full node but also perform mining
operation and participate in proof of work consensus.
194
Chapter 4 Blockchain
A new Ethereum node joining the network uses hardcoded bootstrap nodes
as an initial entry point into the network from where the further discovery of other
nodes begins.
RLPx is a TCP-based transport protocol. It enables secure communication between
Ethereum nodes by using the Elliptic Curve Integrated Encryption Scheme (ECIES) for
handshaking and key exchange.
DEVP2P or the wire protocol negotiates an application session between two
Ethereum nodes that have been discovered and have established a secure channel
using RLPx.
After discovering and establishing a secure transport channel and negotiating an
application session, the nodes exchange messages using “capability protocols,” for
example, eth (versions 62, 63, and 64), Light Ethereum Subprotocol (LES), Whisper,
and Swarm. These capability protocols or application subprotocols enable different
application-level communications, for example, eth for block synchronization.
The node discovery protocol and other relevant protocols are shown in Figure 4-15.
195
Chapter 4 Blockchain
Cryptography in Ethereum
Like any other blockchain, Ethereum’s security relies on cryptography. Ethereum uses
cryptography throughout the blockchain and node design:
Accounts and Addresses
A Bitcoin model is based on transactions, whereas Ethereum is based on accounts.
Accounts are part of the Ethereum state and keep an intrinsic balance and transaction
count. 160-bit long addresses identify accounts. An account is how a user interacts with
the blockchain. A transaction signed by an account is verified and broadcast to the
network, which results in a state transition on the blockchain once executed. There are
two types of accounts, contract accounts (CAs) and externally owned accounts (EOAs).
EOAs are associated with a human user, whereas CAs have no intrinsic association
with a user.
196
Chapter 4 Blockchain
A world state is a mapping between addresses and account states. An account state
consists of the fields shown in Table 4-2.
Nonce Number of transactions originated from an address or, in the case of smart contracts,
the number of contracts created by an account
Balance Number of Wei owned by this address
StorageRoot 256-bit hash of the root node of the Merkle Patricia trie, which encodes the storage
contents of the account
codeHash Hash of the associated EVM code (bytecode)
Transactions and Executions
Transactions in Ethereum are signed instructions which once executed result in a
message call or contract creation (new account with associated code) on the blockchain.
Fundamentally, there are two types of transactions, message call and contract creation,
but over time for easier understanding, three types are now usually defined:
197
Chapter 4 Blockchain
Blocks and Blockchain
Blocks in Ethereum are composed of a block header and transactions. A blockchain
consists of blocks, which contain transactions.
Like any other blockchain, blocks are the main building blocks of Ethereum. An
Ethereum block consists of the block header, the list of transactions, and the list of
ommer block headers. A block header also consists of several elements. All these
elements in a block are shown in Tables 4-4 and 4-5 with a description.
Parent hash Hash Keccak 256-bit hash of the parent block’s header
Ommers hash Hash Keccak 256-bit hash of the list of ommers
Beneficiary Address 160-bit recipient address for mining reward
State root Hash Keccak 256-bit hash of the root node of the transaction trie
Transaction root Hash Keccak 256-bit hash of the root node of the transaction trie
Receipts root Hash Keccak 256-bit hash of the root node of the transaction receipts trie,
which contains receipts of all transactions included in the block
Logs bloom Variable Bloom filter composed logger address and log topics
Difficulty Integer Difficulty level of the current block
Number Integer Total number of all previous blocks
(continued)
199
Chapter 4 Blockchain
Gas limit Integer Limit set on the gas consumption per block
Gas used Integer Total gas consumed by all transactions included in the block
Timestamp Integer Unix epoch timestamp
Extra Variable An optional free field for storing extra data
MixHash Integer Computational effort proof
Nonce Integer Combined with MixHash to prove computational effort
basefeepergas Integer (Post EIP-1559) Records the protocol calculated fee required for a
transaction to be included in the block
Ethereum uses a new data structure called Merkle Patricia trie to store and organize
transactions and relevant data. It is a combination of Patricia and Merkle trees with novel
properties.
There are four tries used in Ethereum to organize data such as transactions, state,
receipts, and contract storage.
Transaction Trie
Each Ethereum block contains the root of a transaction trie, which is composed of
transactions.
200
Chapter 4 Blockchain
Transactions within the blocks are executed using the Ethereum virtual machine,
which we describe next.
Mining in Ethereum
In contrast with Bitcoin, mining in Ethereum is ASIC (application-specific integrated
circuit) resistant.
ASIC-based, special-purpose, efficient, and extremely fast hardware is built for
performing Bitcoin mining. These devices have only one specific job, and that is to run
hash function SHA-256 repeatedly and extremely fast.
Ethereum uses proof of work; however, the consensus is memory-hard, which
makes building ASICs difficult due to large memory requirements. The protocol is called
ETHASH, which generates a large direct acyclic graph (DAG) to be used by miners.
201
Chapter 4 Blockchain
DAG grows and shrinks according to the network difficulty level; however, over time, it
has increased up to roughly about 4 GB in size. As this DAG consumes large memory,
building ASICs with such large memory is prohibitively hard, thus making ETHASH an
ASIC-resistant algorithm. We will explain ETHASH in more detail in Chapter 8.
The Ethereum 1.0 blockchain will continue to evolve according to its road map and
will eventually become a shard in phase 1 of Ethereum 2.0.
202
Chapter 4 Blockchain
With this, we complete our brief discussion on the two most prominent and
pioneering blockchain platforms. More modern blockchain platforms, such as Polkadot,
Cardano, Solana, Avalanche, and Ethereum 2.0, will be introduced when we discuss their
respective consensus protocols in Chapter 8.
Summary
• A blockchain is a peer-to-peer, cryptographically secure, append-
only, immutable, and tamper-proof shared distributed ledger
composed of temporally ordered and publicly verifiable transactions.
203
Chapter 4 Blockchain
Bibliography
1. Bashir, I., 2020. Mastering blockchain: a deep dive into
distributed ledgers, consensus protocols, smart contracts, DApps,
cryptocurrencies, Ethereum, and more.
5. Bitgold: https://unenumerated.blogspot.com/2005/12/bit-
gold.html
204
Chapter 4 Blockchain
12. Gupta, S., Hellings, J., and Sadoghi, M., 2021. Fault-Tolerant
Distributed Transactions on Blockchain. Synthesis Lectures on
Data Management, 16(1), pp. 1–268.
17. Fischer, M.J., Lynch, N.A., and Merritt, M., 1986. Easy impossibility
proofs for distributed consensus problems. Distributed
Computing, 1(1), pp. 26–39.
205
CHAPTER 5
Blockchain Consensus
Blockchain consensus is the core element of a blockchain, which ensures the integrity
and consistency of the blockchain data. Blockchain being a distributed system, in
the first instance, it may appear that we can apply traditional distributed consensus
protocols, such as Paxos or PBFT, to address the agreement and total order requirements
in a blockchain. However, this can only work in consortium chains where participants
are known and limited in number. In public chains, traditional consensus protocols
cannot work due to the permissionless environment. However, in 2008 a new class of
consensus algorithms emerged, which relied on proof of work to ensure random leader
election by solving a mathematical puzzle. The elected leader wins the right to append
to the blockchain. This is the so-called Nakamoto consensus protocol. This algorithm
for the very first time solved the problem of consensus in a permissionless public
environment with many anonymous participants.
We have already discussed distributed consensus from a traditional perspective
in Chapter 3. In this chapter, we will cover what blockchain consensus is, how the
traditional protocols can be applied to a blockchain, how proof of work works, how it was
developed, and what the blockchain consensus requirements are, and we will analyze
blockchain consensus such as proof of work through the lens of distributed consensus.
Also, we’ll see how the requirements of consensus may change depending upon the type
of blockchain in use. For example, for public blockchains proof of work might be a better
idea, whereas for permissioned blockchains BFT-style protocols may work better.
207
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_5
Chapter 5 Blockchain Consensus
Background
Distributed consensus has always been a fundamental problem in distributed systems.
Similarly, in blockchain it plays a vital role in ensuring the integrity of the blockchain.
There are two broad classes of algorithms that have emerged as a result of the last almost
45 years of research on distributed consensus:
208
Chapter 5 Blockchain Consensus
Another point to keep in mind is the distinction between a broadcast problem and
a consensus problem. Consensus is a decision problem, whereas broadcast is a delivery
problem. The properties of both are the same but with slightly different definitions.
These properties include agreement, validity, integrity, and termination. In essence,
broadcast and consensus are interrelated and deeply connected problems as it is
possible to implement one from the other.
We will be focusing more on a consensus problem instead of a broadcast problem.
209
Chapter 5 Blockchain Consensus
Blockchain Consensus
A blockchain consensus protocol is a mechanism that allows participants in a
blockchain system to agree on a sequence of transactions even in the presence of faults.
In other words, consensus algorithms ensure that all parties agree on a single source of
truth even if some parties are faulty.
There are some properties that are associated with blockchain consensus. The
properties are almost the same as standard distributed consensus but with a slight
variation.
As a standard, there are safety and liveness properties. The safety and liveness properties
change depending on the type of blockchain. First, we define the safety and liveness
properties for a permissioned/consortium blockchain and then for a public blockchain.
Traditional BFT
There are several properties that we can define for traditional BFT consensus, which are
commonly used in a permissioned blockchain. There are various variants, for example,
Tendermint, that are used in a blockchain. We covered traditional BFT in detail in
Chapter 3; however, in this section we will redefine that in the context of a blockchain
and especially permissioned blockchain. Most of the properties remain the same as
public permissionless consensus; however, the key difference is between deterministic
and probabilistic termination and agreement.
Agreement
No two honest processes decide on a different block. In other words, no two honest
processes commit different blocks at the same height.
Validity
If an honest process decides on a block b, then b satisfies the application-specific validity
predicate valid (). Also, the block b agreed must be proposed by some honest node.
Termination
Every honest process decides. After GST, every honest process continuously
commits blocks.
Agreement and validity are safety properties, whereas termination is a liveness property.
210
Chapter 5 Blockchain Consensus
Integrity
A process must decide at most once in a consensus round.
Other properties can include the following.
Instant Irrevocability
Once a transaction has made it to the block and the block is finalized, the transaction
cannot be removed.
Consensus Finality
Finality is deterministic and immediate. Transactions are final as soon as they’ve made it
to the block, and blocks are final as soon as they’ve been appended to the blockchain.
While there are many blockchain consensus algorithms now, Nakamoto consensus
is the first blockchain protocol introduced with Bitcoin, which has several novel
properties. Indeed, it is not a classical Byzantine algorithm with deterministic properties;
instead, it has probabilistic features.
Nakamoto Consensus
The Nakamoto or PoW consensus can be characterized with several properties. It is
commonly used in public blockchains, for example, Bitcoin.
Agreement
Eventually, no two honest processes decide on a different block.
Validity
If an honest process decides on a block b, then b satisfies the application-specific validity
predicate valid (). Also, the transactions within the block satisfy the application-specific
validity predicate valid(). In other words, only valid and correct transactions make it
to the block, and only correct and valid blocks make it to the blockchain. Only valid
211
Chapter 5 Blockchain Consensus
transactions and blocks are accepted by the nodes. Also, mining nodes (miners) will only
accept the valid transactions. In addition, the decided value must be proposed by some
honest process.
Termination
Every honest process eventually decides.
Agreement and validity are safety properties, whereas termination is a liveness
property.
Consensus Finality
With two correct processes p1 and p2, if p1 appends a block b to its local blockchain
before another block b’, then no other correct node appends b’ before b.
For a proof of work blockchain point of view, we can further carve out some
properties.
212
Chapter 5 Blockchain Consensus
Consistent/Consistency
The blockchain must eventually heal a forked chain to arrive at a single longest chain. In
other words, everyone must see the same history.
Eventual Irrevocability
The probability of a transaction being rolled back decreases with more blocks appended
to the blockchain. This is a crucial property from end users’ point of view as this
property gives confidence to the users that after their transaction has been made part
of a block and it’s been finalized and accepted, then new blocks being added to the
blockchain further ensure that the transaction is permanently and irrevocably part of the
blockchain.
Table 5-1 shows some key differences between traditional BFT and Nakamoto
consensus.
Now we turn our attention to system models, which are necessary to describe as
they capture the assumption that we make about the environment in which blockchain
consensus protocols will operate.
System Model
Blockchain consensus protocols assume a system model under which they guarantee the
safety and liveness properties. Here, I describe two system models, which are generally
applicable to public and permissioned blockchain systems, respectively.
213
Chapter 5 Blockchain Consensus
214
Chapter 5 Blockchain Consensus
The term Sybil attack is coined after a book named Sybil published in 1973, where
the main character in the book named Sybil Dorsett has multiple personality
disorder.
The first proof of work was introduced by Dwork and Naor in 1992 [1]. This work was
done to combat junk emails whereby associating a computational cost, that is, pricing
functions, with sending emails results in creating a type of access control mechanism
where access to resources can only be obtained by computing a moderately hard
function which prevents excessive use. Proof of work has also been proposed in Adam
Back’s Hashcash proposal [10].
The key intuition behind proof of work in a blockchain is to universally slow
down the proposals for all participants, which achieves two goals. First, it allows all
participants to converge on a common consistent view, and, second, it makes Sybil
attacks very expensive, which helps with the integrity of the blockchain.
It has been observed that it is impossible (impossibility results) to achieve an
agreement in a network where participants are anonymous even if there is only one
Byzantine node [2]. This is due to the Sybil attack, which can create arbitrarily many
identities to game the system in attackers’ favor by voting many times. If there is a way
to prevent such attacks, only then there is some guarantee that the system will work
as expected; otherwise, the attacker can create arbitrarily many identities to attack the
system. This problem was solved practically by proof of work consensus or Nakamoto
consensus [3]. Before Bitcoin, the use of moderately hard puzzles to assign identities in
an anonymous network was first suggested by Aspnes [4]. However, the solution that
Aspnes introduced requires authenticated channels, whereas in Bitcoin unauthenticated
communication is used, and puzzles are noninteractive and publicly verifiable.
So even in the presence of the abovementioned impossibility results in classical
literature, Nakamoto consensus emerged, which for the first time showed that consensus
can be achieved in a permissionless model.
Remember we discussed random oracles in Chapter 3. In proof of work, hash
functions are used to instantiate random oracles. Since the output of hash functions
is sufficiently long and random, an adversary cannot predict future hashes or can
cause hash collisions. These properties make SHA-256 a good choice to use it as a hash
function in the proof of work mechanism.
215
Chapter 5 Blockchain Consensus
216
Chapter 5 Blockchain Consensus
• Partition tolerance.
• Geographically dispersed.
The question is, how do you design a consensus protocol for such a difficult
environment? Yet, Bitcoin PoW has stood the test of time, and apart from some limited
and carefully orchestrated attacks and some inadvertent bugs, largely the Bitcoin
network has been running without any issues for the last 13 years. How? I’ll explain now.
217
Chapter 5 Blockchain Consensus
one to accept or which to reject. Proposals are made at the same time, and now nodes
don’t know which block to insert; perhaps, they will insert both. Now some nodes have
inserted blocks from proposer 1 and the others from proposer 2 only and some from
both. As you can imagine, there is no consensus here.
Imagine another scenario where two nodes simultaneously announce a block; now
the receiving nodes will receive two blocks, and instead of one chain, there are now two
chains. In other words, there are two logs and histories of events. Two nodes proposed a
block at the same time; all nodes added two blocks. Now it is no longer a single chain, it
is a tree, with two branches. This is called a fork. In other words, if nodes learn about two
different blocks pointing to the same parent at the same time, then the blockchain forks
into two chains.
Now in order to resolve this, we can allow nodes to pick the longest chain of blocks
at that time that they know of and add the new block to that chain and ignore the other
branch. If it so happens that there are two or more branches somehow with the same
height (same length), then just pick up randomly one of the chains and add the new block
to it. This way, we can resolve this fork. Now all nodes, knowing this rule that only the
longest chain is allowed to have new blocks, will keep building the longest chain. In the
case of two or more same height chains, then just randomly add the block to any of these.
So far, so good! This scheme appears to work. A node decides to add the new block into a
randomly chosen chain and propagates that decision to others, and other nodes add that
same block to their chains. Over time, the longest chain takes over, and the shorter chain is
ignored because no new blocks are added to it, because it’s not the longest chain.
But now there is another problem. Imagine a situation where a node randomly
chooses one chain after a fork, adds a block to it, propagates that decision to others,
other nodes add as well, and, at this point, some nodes due to latency don’t hear about
the decision. Some nodes add the block they heard from another node to one of its
chains, another one does the opposite, and this cycle repeats. Now you can clearly see
that there are two chains, both getting new blocks. There is no consensus. There is a
livelock situation where nodes can keep adding to both chains.
At this point, let’s think about what the fundamental reason is and why this
livelock is occurring. The reason is that blocks are generating too fast, and other nodes
receive many different blocks from different nodes, some quickly, some delayed. This
asynchrony results in a livelock. The solution? Slow it down! Give nodes time to converge
to one chain! Let’s see how.
218
Chapter 5 Blockchain Consensus
We can introduce a random waiting period, which will make miners to arbitrarily sleep
for some time and then mine. The key insight here is that the livelock (continuous fork)
problem can be resolved by introducing a variable speed timer at each node. When a node
adds a new block to its chain, it stops its timer and sends it to other nodes. Other nodes
are waiting for their timers to expire, but during that waiting time, if they hear about this
new block from another node, they simply stop their timers and add this new block and
reset the timers and start waiting again. This way, there will only be one block added to the
chain, instead of two. If the timers are long enough, then chances of forking and livelocking
decrease significantly. Another thing to note here is that if there are many nodes in the
system, then there is a higher chance that some timer will expire soon, and as we keep
adding more and more nodes, the probability of such occurrences increases because
timers are random and there are many nodes now. In order to avoid the same livelock
situations, we need to increase the sleeping time of these timers as we add more nodes,
so that the probability of adding a block by nodes quickly is decreased to such a level that
only one node will eventually succeed to add a new block to their chain and will announce
that to the network. Also, the waiting period ensures with high probability that forks will
be resolved during this waiting time. It is enough time to ensure complete propagation of a
new valid block so that no other block for the same height can be proposed.
Bitcoin chooses this timeout period based on the rate of block generation of 2016
blocks, which is roughly two weeks. As the block generation should be roughly a single
block every ten minutes, if the protocol observes that the block generation has been
faster in the last two weeks, then it increases the timeout value, resulting in slower
generation of blocks. If the protocols observe that the block generation has been slower,
then it decreases the timeout value. Now one problem in this timeout mechanism is
that if a single node turns malicious and always manages to somehow make its timer
expire earlier than other nodes, this node will end up creating a block every time. Now
the requirement becomes to build a timer which is resistant to such cheating. One
way of doing this is to build a trusted mechanism with some cryptographic security
guarantees to act as a secure enclave in which the timer code runs. This way, due to
cryptographic guarantees, the malicious node may not be able to trick the time into
always expiring first.
This technique is used in the PoET (proof of elapsed time) algorithm used in
Hyperledger Intel Sawtooth blockchain. We will discuss this in Chapter 8.
219
Chapter 5 Blockchain Consensus
Another way, the original way, Nakamoto designed the algorithm is to make
computers do a computationally complex task which takes time to solve – just enough
to be able to solve it almost every ten minutes. Also, the task is formulated in such a
way that nodes cannot cheat, except to try to solve the problem. Any deviation from
the method of solving the problem will not help, as the only way to solve the problem
is to try every possible answer and match it with the expected answer. If the answer
matches with the expected answer, then the problem is solved; otherwise, the computer
will have to try the next answer and keep doing that in a brute-force manner until the
answer is found. This is a brilliant insight by Satoshi Nakamoto which ensures with
high probability that computers cannot cheat, and timers only expire almost every ten
minutes, giving one of the nodes the right to add its block to the blockchain. This is the
so-called proof of work, meaning a node has done enough work to demonstrate that it
has spent enough computational power to solve the math problem to earn the right to
insert a new block to the blockchain.
Proof of work is based on cryptographic hash functions. It requires that for a block
to be valid, its hash must be less than a specific value. This means that the hash of the
block must start with a certain number of zeroes. The only way to find such a hash is to
repeatedly try each possible hash and see if it matches the criterion; if not, then try again
until one node finds such a hash. This means that in order to find a valid hash, it takes
roughly ten minutes, thus introducing just enough delay which results in resolving forks
and convergence on one chain while minimizing the chance of one node winning the
right to create a new block every time.
Now it is easy to see that proof of work is a mechanism to introduce waiting time
between block creation and ensuring that only one leader eventually emerges, which can
insert the new block to the chain.
So, it turns out that PoW is not, precisely speaking, a consensus algorithm; it is a
consensus facilitation algorithm which, due to slowing down block generations, allows
nodes to converge to a common blockchain.
Now as we understand the intuition behind the proof of work mechanism, next we
will describe how exactly the proof of work algorithm works in Bitcoin.
220
Chapter 5 Blockchain Consensus
PoW Formula
The PoW consensus process can be described with the help of a formula:
New difficulty =
( previous difficulty × 2016 × 10 minutes )
( time took to mine most recent 2016 blocks )
This formula basically regulates the blockchain to produce new blocks roughly at a
mean rate of ten minutes.
Now in order to calculate the target, first calculate the difficulty using the following
formula:
difficulty =
( possible target )
( current target )
target =
( possible target )
( difficulty )
221
Chapter 5 Blockchain Consensus
Now that we have established how the target value is calculated, let’s see what
miners do and how they find a hash which satisfies the preceding equation, that is, the
value obtained after hashing the block is less than the target value. In other words, the
block hash must match a specific pattern where the hash starts with a certain number
of zeroes. This is also known as the partial hash inversion problem. This problem is to
find a partial preimage to the double SHA-256 hash function, which can only be found (if
ever) by trying different inputs one by one until one of the inputs works.
Task of Miners
In the Bitcoin blockchain network, when new transactions are executed by a user they
are broadcast to all nodes on the network via a peer-to-peer gossip protocol. These
transactions end up in transaction pools of nodes. Miners perform several tasks:
• They also listen for new blocks and append any new valid blocks
to their chain. This is of course not only the task for a miner, other
nonmining nodes also simply synchronize the blocks.
222
Chapter 5 Blockchain Consensus
• Fetch the reward by receiving Coinbase on the address that the miner
wants to send the reward to.
The diagram in Figure 5-2 shows how transactions from a transaction pool (bottom
left of the figure) are picked up and a Merkle tree is created, the root of which is included
in the candidate block. Finally, double (SHA-256) is computed for the block for a
comparison against the target.
223
Chapter 5 Blockchain Consensus
Figure 5-2. Transaction pool transactions to the Merkle tree and candidate block
A nonce is a number from 1 to 232 – 1, that is, a 32-bit unsigned integer which gets
included in the block. Using this nonce in each iteration of checking if the resultant
number is less than the target is what’s called a mining. If the resultant number is less
than the target, then it’s a block mined and it’s valid, which is then broadcast to the
network.
The nonce field in a block being an unsigned integer, there are only 232 nonces to
try. As such, miners can run out of them quite quickly. In other words, it means that
there are roughly four billion nonces to try which miners can quickly perform given the
powerful mining hardware available. It also is very easy even for a normal computer to
quickly check.
This of course can create an issue where no one is able to find the required nonce
which produces the required hash. Even if miners try again, they will try the same thing
again with the same results. At this stage, we can use other attributes of the block and use
them as a variable and keep modifying the block until the hash of the block is less than
the target, that is, SHAd256(Block header ‖ nonce ) < Target.
224
Chapter 5 Blockchain Consensus
Now after going through all these iterations, what if the valid nonce is not found?
At this point, miners will have to increase the search space somehow. For this, they can
modify the block somewhat to get a different hash. They can do several things:
• Modify the timestamp slightly (in the range of two hours; otherwise,
it’s an invalid block). It can be done simply by adding just a
second, which will result in a different header and consequently a
different hash.
• Modify Coinbase via unused ScriptSig, where you can put any
arbitrary data. This will change the Merkle root and hence the header
and consequently the hash.
And miners can keep modifying with different variations until they reach
SHAd256(Block header ‖ nonce ) < target, which means that they’ve found a valid nonce
that solves the proof of work.
The discovery of the valid hash is based on the concept known as partial hash
inversion.
Proof of work has some key properties. Formally, we list them as follows.
Properties of PoW
Proof of work has five properties: completeness, computationally complex, dynamic cost
adjustment, quick verification, and progress free.
Completeness
This property implies that proofs produced by the prover are verifiable and acceptable
by the verifier.
225
Chapter 5 Blockchain Consensus
Progress Free
This property implies that the chance of solving the proof of work is proportional to the
hash power contributed; however, it is still a chance, not a 100% guarantee that a miner
with the highest hash power will always win. In other words, miners with more hash
226
Chapter 5 Blockchain Consensus
power get only proportional advantage, and miners with less power get proportional
compensation too and get lucky sometimes to find blocks before even miners with more
hash power.
In practice, this means that every miner is in fact working on a different candidate
block to solve the proof of work. Miners are not working on the same block; they are not
trying to find a valid nonce for the same hash. This is because of several differences, such
as transactions, version number, Coinbase differences, and other metadata differences,
which when hashed result in a totally different hash (SHA-256 twice). This means that
every miner is solving a different problem and solving a different part of the double
SHA-256 or conveniently written as SHAd256 search space.
The progress free property can be visualized in Figure 5-3. As shown in Figure 5-3,
miners are all working on their own candidate block, which is different from other blocks
due to differences mentioned earlier. So, every nonce that the miners concatenate with
the block data to get the hash will result in a hash that no other miner is aware of. This
gives some advantage to a miner with less power, where it can happen that the block
which this small miner is trying to find a valid nonce for manages to find the nonce that
solves PoW before a miner with more hash power finds the valid nonce for their block.
Figure 5-3. Progress free property – each miner working on a different part of
double (SHA-256) search space
227
Chapter 5 Blockchain Consensus
This is another elegant property of PoW which ensures that miners with more hash
power may have some advantage, but it also means that a miner with less hash power
can be lucky in finding the nonce that works before the large miners. The key point is
miners are not working on the same block! If it were the same block every time, the most
powerful miner would’ve won. This is called the progress free property of Bitcoin PoW.
It is however possible that many miners collaboratively work on the same block
(same search space), hence dividing up the work between themselves. Imagine the
search space is 1 to 100 for a block, it may be divided in 10 different parts, then all miners
can collectively work on a single block. This divides up the work, and all miners can
contribute and earn their share of the reward. This is called pool mining. Unlike solo
mining where only one miner tries and the entire effort can be lost if it doesn’t find
the nonce and tries again for the next block, in pool mining individual contribution is
not wasted.
This concept can be visualized in Figure 5-4.
Figure 5-4. Mining pool – many miners working on a single block (shad256
search space)
In Figure 5-4, there are different miners working on the same hash search space
produced by the same block. This way, the pool operators split the proof of work into
different pieces and distribute them to the miners in the pool. All miners work and put in
228
Chapter 5 Blockchain Consensus
the effort, and eventually one miner finds the block which is broadcast normally to the
Bitcoin network. The pool operator receives the block reward which is split between the
miners in proportion to the effort put in by the miners.
PoW is almost like roll dicing, for example, if I have rolled the dice a few times, I
cannot know when the next six will occur; it might be that I get six in the first attempt,
may never get six, or get six after several rolls. Similarly, whether a single nonce has been
tried to find the valid nonce or trillions and trillions of nonces have been tried, the mean
time until a miner finds the valid nonce remains probabilistic. It doesn’t matter whether
100 million nonces have been tried or only one; the probability of finding the valid nonce
remains the same. So trying millions of nonces doesn’t make it more likely to find the
valid nonce; even trying once or only a few times could find the valid nonce.
A Bernoulli trial iterated enough to achieve a continuous result instead of discrete
is called a Poisson process. Formally, we can say that a Poisson process is a sequence of
discrete events, where events occur independently at a known constant average rate, but
the exact timing of events is random.
229
Chapter 5 Blockchain Consensus
For example, movements in a stock price are a Poisson process. A Poisson process
has some properties:
The average time between events is known, but they are randomly spaced
(stochastic). In Bitcoin, of course, we know the time between two block generation
events is known, that is, roughly ten minutes, but the generations are randomly spaced.
The mean time for a new block is ten minutes on average. We can use a simple
formula to find out the meantime of finding the next block for a particular miner.
10 minutes
Next block mean time ( specific ) =
fraction of hash power controlled by the miner
q
z
qz = 1 if p ≤ q _ if p > q
p
where
z = blocks to catch up
230
Chapter 5 Blockchain Consensus
This means that if the honest hash rate is less than the attacker’s hash rate, then the
probability of an attacker catching up is one, and if the honest hash rate is more than the
z
q
attacker’s hash rate, then the probability of catching up is .
p
In the next section, we formally write the proof of work algorithm.
PoW Algorithm
Formally, we can write the entire proof of work algorithm as follows:
1: nonce := 0
2: hashTarget := nBits
3: hash := null
4: while (true) {
5: SHA256(SHA256(blockheader || nonce))
6: if (hash ≤ hashTarget) {
7: append to blockchain
8: else
9: nonce := nonce + 1
10: }
11: }
In the preceding algorithm, the nonce is initialized as zero. The hash target which
is the difficulty target of the network is taken from the nBits field of the candidate
block header. The hash is initialized to null. After that, an infinite loop runs, which first
concatenates the block header and the nonce and runs SHA-256 twice on it to produce
the hash. Next, if the produced hash is less than the target hash, then it is accepted and
appended to the blockchain; otherwise, the nonce is incremented and the process starts
again. If no nonce is found, then the algorithm tries the next block.
This process can be visualized in Figure 5-5.
231
Chapter 5 Blockchain Consensus
In Figure 5-5, the previous block hash, transactions, and nonce are fed into a hash
function to produce a hash which is checked against the target value. If it is less than
the target value, then it’s a valid hash and the process stops; otherwise, the nonce is
incremented, and the entire process repeats until the resultant hash is less than the
target, where the process stops.
232
Chapter 5 Blockchain Consensus
Games represent different strategic situations. There are some classical games such
as Bach or Stravinsky, prisoner’s dilemma, Hawk-Dove, and matching pennies. In games,
players are not aware of the actions of other players when making their own decisions;
such games are called simultaneous move games.
Games can be analyzed by creating a table where all possible actions of players and
payoffs are listed. This table is known as the strategic form of the game or payoff matrix.
A Nash equilibrium is a fundamental and powerful concept in game theory. In a
Nash equilibrium, each rational player chooses the best course of action in response
to the choice made by other players. Each player is aware of other players’ equilibrium
strategies, and no player can make any gains by changing only their own strategy. In
short, any deviation from the strategy does not result in any gain for the deviant.
Prisoner’s Dilemma
In this simultaneous move game, two suspects of a crime are put into separate cells
without any way to communicate. If they both confess, then they both will be imprisoned
for three years each. If one of them confesses and acts as a witness against the other,
then charges against him will be dropped; however, the other suspect will get four years
in prison. If none of them confesses, then both will be sentenced to only one year in
prison. Now you can see that if both suspects cooperate and don’t confess, then it results
in the best outcome for both. However, there is a big incentive of going free for both to
not cooperate and act as a witness against the other. This game results in gains for both if
they cooperate and don’t confess and results in only one year in prison each. Let’s name
these characters Alice and Bob for ease and see what possible outcomes there are in
this game.
There are four possible outcomes of this game:
If Alice and Bob can somehow communicate, then they can jointly decide to not
confess, which will result in only a one-year sentence each. However, the dominant
strategy here is to confess rather than don’t confess.
233
Chapter 5 Blockchain Consensus
A dominant strategy is a strategy that results in the largest payoff regardless of the
behaviors of other players in the game.
We can represent this in a payoff matrix form as shown in Figure 5-6.
Alice and Bob are both aware of this matrix and know that they both have this matrix
to choose from. Alice and Bob are players, “confess” and “don’t confess” are actions, and
payoffs are prison sentences.
Regardless of what Alice does or Bob does, the other player still confesses. Alice’s
strategy is that if Bob confesses, she should confess too because a one-year prison
sentence is better than three. If Bob does not confess, she should still confess because
she will go free. The same strategy is employed by Bob. The dominant strategy here is to
confess, regardless of what the other player does.
Both players confess and go to prison for three years each. This is because even if
Bob had somehow managed to tell Alice about his no confession strategy, Alice would
have still confessed and became a witness to avoid prison altogether. Similar is the case
from Alice’s perspective. Therefore, the best outcome for both becomes “confession” in a
Nash equilibrium. This is written as {confess, confess}.
In the prisoner’s dilemma, there is a benefit of cooperation for both players, but the
possible incentive of going free for each player entices contest. When all players in
a game are rational, the best choice is to be in a Nash equilibrium.
Game theory models are highly abstract; therefore, they can be used in many
different situations, once developed for a particular situation. For example, the
prisoner’s dilemma model can be used in many other areas. In network communications
where wireless network devices compete for bandwidth, energy supply, etc., there is a
need to regulate node behavior in such a way that all devices on the network can work
in harmony. Imagine a network where two cell towers working on the same frequency
in close vicinity can affect each other’s performance. One way to counter this problem
is to run both towers at low energy so that they don’t interfere with each other, but that
234
Chapter 5 Blockchain Consensus
will decrease the bandwidth of both the towers. If one tower increases its energy and
the other don’t, then the one that doesn’t loses and runs on lower bandwidth. So, the
dominant strategy here becomes to run towers at maximum power regardless of what
the other tower does, so that they achieve maximum possible gain. This result is like the
prisoner’s dilemma where confession is the best strategy. Here, maximum power is the
best strategy.
Now in the light of the preceding explained concepts, we can analyze the Bitcoin
protocol from a game theoretic perspective.
235
Chapter 5 Blockchain Consensus
happens, the Bitcoin almost certainly will become worthless, because this event would
imply that the very cryptography that protects the network has been broken (assuming
that real Satoshi is not alive or has lost his private keys irrecoverably).
I have a feeling that Satoshi is not moving his coins because that can cause Bitcoin
to lose its value drastically.
Similarly, even if an adversary somehow gains 51% of the network hash power,
taking over the entire network may not be beneficial anymore. Why? Because the best
course of action in such a situation for the adversary is to keep mining silently with some
reasonable hash power to gain economic incentives (earn bitcoins) just like others on
the network, instead of utilizing the entire 51% hash power announcing the attack to the
world. That would just diminish the Bitcoin value almost entirely, and any gains by the
attacker would be worthless. Therefore, attackers do not have incentive to take over the
Bitcoin network, perhaps apart from some mishaps that occurred due to human errors
and compromised keys. This is the elegance and beauty of Bitcoin that even attackers do
not gain by attacking the network. All participants gain by just playing by the rules. The
dominant strategy for miners is to be honest.
For the very first time in distributed computing, a network is created which does not
rely on any trusted third party and is permissionless, yet it doesn’t let any attacker take
over the network. Here, I remember something, which is not directly relevant to Bitcoin,
but helps to feel what many distributed computing experts might feel about Bitcoin
when they first realize how elegant it is.
236
Chapter 5 Blockchain Consensus
We can also think of the fork resolution mechanism as a Schelling point solution. This
is a game theory concept where a focal point or also called Schelling point is a solution that
people choose by default in the absence of communication. Similarly, in the proof of work
fork resolution mechanism, due to the longest (strongest) chain rule, nodes tend to choose
the longest chain as a canonical chain to add the block that they’ve received without
any communication or direction from other nodes. This concept of cooperating without
communication was introduced by Thomas Schelling in his book The Strategy of Conflict.
237
Chapter 5 Blockchain Consensus
Common Prefix
This property implies that all honest nodes will share the same large common prefix.
Chain Quality
This property means that the blockchain contains a certain required level of correct
blocks created by honest miners. If the chain quality is compromised, then the validity
property of the protocol cannot be guaranteed.
Chain Growth
This property means that new correct blocks are constantly added to the blockchain
regularly.
These properties can be seen as the equivalent of traditional consensus properties
in the Nakamoto world. Here, the common prefix is an agreement property, the chain
quality is a validity property, and chain growth can be seen as a liveness property.
238
Chapter 5 Blockchain Consensus
leader is only changed when the primary fails, but in PoW a leader is elected every block.
Leader election in PoW is based on computational power; However, several techniques
have been used in other permissioned blockchains, from simply randomly choosing
a leader or simple rotation formula to complex means such as verifiable random
functions. We will cover these techniques in Chapter 8 in detail.
The leader election formula is simply the same PoW formula that we have already
covered in the section “How PoW Works.” A soon as any miner solves the proof of work, it
immediately is elected as a leader and earns the right to broadcast its newly mined block.
At this point, the miner is also awarded 6.25 BTC. This reward halves every four years.
At the leader election stage, the miner node has successfully solved the PoW puzzle,
and now the log replication can start.
Log Replication
The log replication or block replication to achieve consistency among nodes is achieved
by broadcasting the newly mined block to other nodes via a gossip dissemination
protocol. The key differences between a normal log and a blockchain log are as follows:
• The blocks (content in the log) are verifiable from the previous block.
When a new block is broadcast, it is validated and verified by each honest node on
the network before it is appended to the blockchain. Log replication after leader election
can be divided into three steps.
239
Chapter 5 Blockchain Consensus
This type of propagation ensures that eventually all nodes get the message with
high probability. Moreover, this pattern does not overwhelm a single node with the
requirement of broadcasting a message to all nodes.
Block Validation
Block validation can be seen as the state transition function (STF). This block validation
function has the following high-level rules:
• The block header hash is less than the network difficulty target.
• The block timestamp is not more than two hours in future.
The protocol specifies very precise rules, details of which can be found at https://
en.bitcoin.it/wiki/Protocol_rules; however, the preceding list is a high-level list of
block validation checks a node performs.
240
Chapter 5 Blockchain Consensus
Append to the Blockchain
The block is finally inserted into the blockchain by the nodes. When appending to the
blockchain, it may happen that those nodes may have received two valid blocks. In that
case, a fork will occur, and nodes will have to decide which chain to append the block to.
We can visualize this concept in Figure 5-9.
241
Chapter 5 Blockchain Consensus
Fork Resolution
Fork resolution can be seen as a fault tolerance mechanism in Bitcoin. Fork resolution
rules ensure that only the chain that has the most work done to produce it is the one
that is always picked up by the nodes when inserting a new block. When a valid block
arrives for the same height, then the fork resolution mechanism allows the node to
ignore the shorter chain and add the block only to the longest chain. Also note that this
is not always the case that the longest chain has the most work done; it could happen
that a shorter chain may have the most computational hash power behind it, that is, the
accumulated proof of work, and in that case, that chain will be selected.
We can calculate the accumulated proof of work by first calculating the difficulty of
a particular block, say B, then we can use the following formula. The difficulty of a block
can be defined as finding how much harder it is to find a valid proof of work nonce for
this specific block B in comparison to the difficulty of a genesis block.
We can say that the accumulated proof of work for a chain is the sum of the difficulty
of all blocks in the chain. The chain that has most proof of work behind it will be chosen
for a new block to be appended.
The longest chain rule was originally simply the chain with the highest number
of blocks. However, this simple rule was modified later, and the “longest” chain
became the chain with the most work done to create it, that is, the strongest chain.
In practice, there is a chainwork value in the block which helps to identify the chain
with the most work, that is, the correct “longest” or “strongest” chain.
For example, we use
bitcoin-cli getblockheader
0000000000000000000811608a01b388b167d9c94c0c0870377657d524ff0003
242
Chapter 5 Blockchain Consensus
{
"result": {
"hash":
"0000000000000000000811608a01b388b167d9c94c0c0870377657d524ff0003",
"confirmations": 1,
"height": 687731,
"version": 547356676,
"versionHex": "20a00004",
"merkleroot":
"73f4a59b854ed2d6597b56e6bc499a7e0b8651376e63e0825dbcca3b9dde61ae",
"time": 1623786185,
"mediantime": 1623781371,
"nonce": 2840970250,
"bits": "170e1ef9",
"difficulty": 19932791027262.74,
"chainwork": "00000000000000000000000000000000000000001eb7091803
0b922df7533fd4",
"nTx": 2722,
"previousblockhash":
"00000000000000000000f341e0046c6d82979fdfa09ab324a0e8ffbabd22815d"
},
"error": null,
"id": null
}
• Regular fork
• Hard fork
• Soft fork
• Byzantine fork
243
Chapter 5 Blockchain Consensus
Regular fork
A fork can naturally occur in the Bitcoin blockchain when two miners competing
to solve the proof of work happen to solve it almost at the same time. As a result, two
new blocks are added to the blockchain. Miners will keep working on the longest chain
that they are aware of, and soon the shorter chain with so-called orphan blocks will be
ignored.
The diagram in Figure 5-10 shows how consensus finality is impacted by forks.
Due to the forking possibility, consensus is probabilistic. When the fork is resolved,
previously accepted transactions are rolled back, and the longest (strongest) chain
prevails.
The probability of these regular forks is quite low. A split of one block can occur
almost every two weeks and is quickly resolved when the next block arrives, referring
to the previous one as a parent. The probability of occurrence of a two-block split is
exponentially lower, which is almost once in 90 years. The probability of occurrence of a
four-block temporary fork is once in almost 700,000,000 years.
Hard fork
A hard fork occurs due to changes in the protocol, which are incompatible with the
existing rules. This essentially creates two chains, one running on the old rules and the
new one on new rules.
We can visualize how a hard fork behaves in Figure 5-11.
244
Chapter 5 Blockchain Consensus
Soft fork
A soft fork occurs when changes in the protocol are backward compatible. It means
that there is no need to update all the clients; even if not all the clients are upgraded, the
chain is still one. However, any clients that do not upgrade won’t be able to operate using
the new rules. In other words, old clients will still be able to accept the new blocks.
This concept can be visualized in the diagram in Figure 5-12.
245
Chapter 5 Blockchain Consensus
Byzantine fork
A Byzantine fork or malicious fork can occur in scenarios where an adversary may try
to create a new chain and succeeds in imposing its own version of the chain.
With this, we complete our discussion on forks.
A core feature of proof of work consensus is the Sybil resistance mechanism
which ensures that creating many new identities and using them is prohibitively
computationally complex. Let’s explore this concept in more detail.
Sybil Resistance
A Sybil attack occurs when an attacker creates multiple identities, all belonging to them
to subvert the network relying on voting by using all those identities to cast vote in
their favor. Imagine if an attacker creates more nodes than the entire network, then the
attacker can skew the network in their favor.
A Sybil attack can be visualized in Figure 5-13, where an attacker is controlling more
Sybil nodes than the network.
Proof of work makes it prohibitively expensive for an attacker to use multiple nodes
controlled by them to participate in the network because each node will have to do
computationally complex work in order to be part of the network. Therefore, an attacker
controlling a large number of nodes will not be able to influence the network.
246
Chapter 5 Blockchain Consensus
247
Chapter 5 Blockchain Consensus
Block timestamps not only serve to provide some variation for the block hash, which
is useful in proof of work, but also helps to protect against blockchain manipulation
where an adversary could try to inject an invalid block in the chain. When a Bitcoin node
connects to another node, it receives the timestamp in UTC format from it. The receiving
node then calculates the offset of the received time from the local system clock and
stores it. The network adjusted time is then calculated as the local UTC system clock plus
the median offset from all connected nodes.
There are two rules regarding timestamps in Bitcoin blocks. A valid timestamp
must be greater than the median timestamp of the previous 11 blocks. It should also
be less than the median timestamp calculated based on the time received from other
connected nodes (i.e., network adjusted time) plus two hours. However, this network
time adjustment must never be more than 70 minutes from the local system clock.
The conclusion is that Bitcoin is in fact secure only under a synchronous network
model. More precisely, it is a lockstep-free synchrony where there exists some known
finite time bound, but execution is not in lockstep.
A Caveat
The order of transaction is not consensus driven. Each miner picks up a transaction in
a hardcoded order within the client, and indeed there have been some attacks that can
result in transaction censorship or ignoring or reordering. Consensus is achieved in fact
on the block, and that is also not through voting; once a miner has solved PoW, it just
simply wins the right to append a new block to the chain. Of course, it will be validated
by other nodes when they receive it, but there is no real agreement or voting mechanism
after the mined block has been broadcast by the successful miner. There is no voting
or consensus which agrees on this new block; the miner who is the elected leader and
because they solved PoW has won the right to add a new block. Other nodes just accept
it if it passes the valid() predicate.
So, the caveat here is that when a candidate block is created by picking up
transactions from the transaction pool, they are picked up in a certain order which a
miner can influence. For example, some miners may choose not to include transactions
without any fee and only include those which are paying fee. Fair for the miner
perhaps, but unfair for the user and the overall Bitcoin system! However, eventually all
transactions will be added, even those without a fee, but they might be considered only
after some considerable time has elapsed since their inclusion in the transaction pool.
If they’ve aged, then they’ll be eventually included. Moreover, under the assumption
248
Chapter 5 Blockchain Consensus
that usually there is a majority of honest miners always in the network, the transactions
are expected to be picked up in a reasonable amount of time in line with the protocol
specification.
Let us now see what is that order.
The transactions are picked up from the transaction pool based on its priority, which
is calculated using the following formula [8]:
Σ ( vi × ai )
p=
s
It is of concern that the ordering of transactions is not fair and leads to front running
and other relevant attacks. We will discuss fair ordering in Chapter 10.
249
Chapter 5 Blockchain Consensus
that time. This mathematical puzzle serves two purposes; first, it is a proof that the
general is honest as they have solved the math puzzle, and, second, it stops the generals
from proposing too many times in quick succession, which will result in disagreement
and confusion between the generals. We can see that this mechanism can be seen
as a solution to the Byzantine generals problem; however, with a compromise, that
temporary disagreement is acceptable.
Bitcoin PoW is a probabilistic consensus algorithm. The big question now arises
whether deterministic consensus can be achieved when the number of nodes is
unknown and in the presence of Byzantine nodes.
Now let’s revisit the validity, agreement, and termination properties defined at the
start of this chapter in the light of what we have learned so far about the proof of work
algorithm.
We can see clearly now that PoW is not a classical deterministic Byzantine consensus
algorithm. It is a protocol with probabilistic properties.
Let’s revisit the properties now.
Agreement
An agreement property is probabilistic. This is the case because it can happen that two
different miners produce a valid block almost simultaneously, and some nodes add a
block from one miner and some other nodes from another. Eventually, however, the
longest (strongest) chain rule will ensure that the chain with less proof of work behind
it is pruned and the longest chain prevails. This will result in previously accepted
transactions to be rolled back; thus, the agreement is probabilistic.
This is a deterministic property agreement where honest nodes only accept those blocks
which are valid. Formally, we can say that if a correct process p eventually decides on b,
then v must satisfy the application-specific valid() predicate. We discussed the validity
predicate, that is, the block validation criteria, in detail earlier in the chapter.
250
Chapter 5 Blockchain Consensus
Termination
PoW Concerns
There are several concerns regarding PoW, including attacks and extreme energy
consumption.
In the next section, we discuss some of the attacks that can be carried out against the
proof of work consensus, which adversely affects the Bitcoin network.
51% Attack
A 51% attack on Bitcoin can occur when more than 50% of the mining hash power is
controlled by an adversary.
Table 5-3 shows a list of actions that an adversary can possibly try to take after taking
over more than 50% hash power of the network.
251
Chapter 5 Blockchain Consensus
Note that some attacks are still impossible, while the most detrimental to a system
are possible, such as double-spend.
Selfish Mining
This type of attack occurs when a miner who has found a block keeps it a secret instead
of announcing it and keeps building on top of it privately. Imagine the attacker has
managed to create another block. Now the attacker has two blocks in their private
forked chain. At this point, the attacker waits for someone else to find a block. When
the attacker sees this new block, they release their two-block chain. Because other
miners are honest and abiding by the longest chain rule, they will accept this new chain
being the longest. Now the block that was mined by someone else is orphaned despite
spending resources on it, but that work is wasted. The attacker could also wait for a
longer chain to be created, albeit mostly by luck, but if the attacker manages to create
such a private fork which is longer than the honest chain, then the attacker can release
that as soon as some other block is announced. Now when the nodes see this new
longest chain, according to the rules, they will start mining on top of this new longer
chain and orphaning the other chains, which could just be one block shorter than the
attacker’s chain. All the work that has gone into creating the honest chain is now wasted,
and the attacker gets the rewards, instead of other miners who did the work on the
honest chain.
Race Attack
This attack can occur in a situation where the adversary can make a payment to one
beneficiary and then a second one to themselves or someone else. Now if the first
payment is accepted by the recipient after zero confirmations, then it could happen that
the second transaction is mined and accepted in the next block, and the first transaction
could remain unmined. As a result, the first recipient may never get their payment.
Finney Attack
The Finney attack can occur when a recipient of a payment accepts the payment with
zero confirmations. It is a type of double-spend attack where an attacker creates two
transactions. The first of these transactions is a payment to the recipient (victim) and
the second to themselves. However, the attacker does not broadcast the first transaction;
instead, they include the second transaction in a block and mine it. Now at this point,
252
Chapter 5 Blockchain Consensus
the attacker releases the first transaction and pays for the goods. The merchant does not
wait here for the confirmations and accepts the payment. Now the attacker broadcasts
the premined block with the second transaction that pays to themselves. This invalidates
the first transaction as the second transaction takes precedence over the first one.
Vector76 Attack
This attack is a combination of Finney and race attacks. This attack is powerful enough to
reverse a transaction even if it has one confirmation.
Eclipse Attack
This attack attempts to obscure a node’s correct view of the network, which can lead to
disruption to service, double-spend attacks, and waste of resources. There are several
solutions to fix the issue, which have been implemented in Bitcoin. More details can be
found here: https://cs-people.bu.edu/heilman/eclipse/.
ESG Impact
ESG metrics represent an overall picture of environmental, social, and governance
concerns. These metrics are used as a measure to assess a company’s exposure to
environmental, social, and governance risks. They are used by investors to make
investment decisions. Investors may not invest where ESG risks are higher and may
prefer companies where ESG risk is low.
Proof of work has been criticized for consuming too much energy. It is true that
currently at the time of writing, the total energy consumption of the Bitcoin blockchain is
more than the entire country of Pakistan [9].
There are environmental, social, and governance concerns (ESG concerns) that have
been the cause of low interest from savvy mainstream investors. Nonetheless, Bitcoin
largely can be seen as a success despite its ESG concerns.
Not only has Bitcoin been criticized for its high energy consumption but often
seen as a vehicle for criminal activities, where Bitcoin has been accepted as a mode of
payment for illicit drugs and other criminal activities.
A centralization problem is also a concern where some powerful miners with mining
farms take up most of the hash rate of the Bitcoin network. The ASICs that are used to
build these mining farms are produced only by a few manufacturers, which means that
this is also a highly centralized space. Moreover, a crackdown [13] on Bitcoin mining
253
Chapter 5 Blockchain Consensus
could also result in more centralization, where only the most powerful miners may
be able to withstand this crackdown and survive, resulting in only a few surviving and
powerful miners at the end.
There are however points in favor of Bitcoin. Bitcoin can be used as a cross-border
remittance mechanism for migrant families. It can also be used as a mode of payment in
struggling economies. It can serve the unbanked population, which is estimated to be 1.7
billion [12]. Bitcoin serves as a vehicle for financial inclusion.
We could think of scenarios where the heat produced by Bitcoin mining farms may
be used to heat up water and eventually homes. Even electricity could be generated
and fed back into the electricity grid by using thermoelectric generators due to
thermoelectric effect. Of course, economics and engineering need to be worked out;
however, this idea can work.
Payment systems and in fact any system require electricity to run. Bitcoin is criticized
of consuming too much energy; however, this is the price paid for the strength of the
system. The network difficulty rate is so high now that even many attackers colluding
together won’t be able to generate enough hash power to launch a 51% attack. So yes,
electricity is consumed, but in return there are benefits. In addition to the security of
Bitcoin, there are other benefits such as
• Borderless payments.
254
Chapter 5 Blockchain Consensus
In summary, Bitcoin, despite its energy consumption and not living up to its original
philosophy of One CPU = One Vote, still can be seen as a successful project with many
benefits.
Variants of PoW
There are two types of proof of work algorithms depending on the hardware it is
intended to run on:
• CPU-bound PoW
• Memory-bound PoW
CPU-Bound PoW
These puzzles run at the speed of the processor. CPU-bound PoW refers to a type of PoW
where the processing required to find the solution to the cryptographic hash puzzle is
directly proportional to the calculation speed of the CPU or hardware such as ASICs.
Because ASICs have dominated Bitcoin PoW and provide somewhat undue advantage
to the miners who can afford to use ASICs, this CPU-bound PoW is seen as shifting
toward centralization. Moreover, mining pools with extraordinary hash power can shift
the balance of power toward them. Therefore, memory-bound PoW algorithms have
been introduced, which are ASIC resistant and are based on memory-oriented design
instead of CPU.
Memory-Bound PoW
Memory-bound PoW algorithms rely on system RAM to provide PoW. Here, the
performance is bound by the access speed of the memory or the size of the memory.
This reliance on memory also makes these PoW algorithms ASIC resistant. Equihash is
one of the most prominent memory-bound PoW algorithms.
There are other improvements and variations of proof of work, which we will
introduce in Chapter 8.
255
Chapter 5 Blockchain Consensus
Summary
In this chapter, we covered blockchain consensus:
• Proof of work can be seen in the light of game theory where the
protocol is a Nash equilibrium, and the dominant strategy for all
players is to be honest.
• Proof of work consumes high energy, and there are ESG concerns;
however, there are benefits as well.
Bibliography
1. Proof of work originally introduced in: Cynthia Dwork and Moni
Naor. Pricing via processing or combatting junk mail. In Ernest
F. Brickell, editor, Advances in Cryptology – CRYPTO ’92, 12th
Annual International Cryptology Conference, Santa Barbara,
California, USA, August 16–20, 1992, Proceedings, volume
740 of Lecture Notes in Computer Science, pages 139–147.
Springer, 1992.
256
Chapter 5 Blockchain Consensus
3. https://bitcoin.org/bitcoin.pdf
4. www.cs.yale.edu/publications/techreports/tr1332.pdf
5. https://satoshi.nakamotoinstitute.org/emails/
cryptography/11/
6. https://hal.inria.fr/hal-01445797/document
8. https://en.bitcoin.it/wiki/Miner_fees#Priority_
transactions
9. Digiconomist: https://digiconomist.net/bitcoin-energy-
consumption/
10. www.hashcash.org
12. https://globalfindex.worldbank.org/sites/globalfindex/
files/chapters/2017%20Findex%20full%20report_
chapter2.pdf
13. www.coindesk.com/bitcoin-slips-37k-china-
vicecrackdown-mining
257
CHAPTER 6
Early Protocols
In this chapter, I introduce early protocols. First, we start with a background on
distributed transactions and relevant protocols, such as the two-phase commit.
After that, we’ll continue our journey, look at the agreement protocols, and conclude
the chapter with some fundamental results in distributed computing. This chapter
introduces early consensus algorithms such as those presented in the works of Lamport
et al., Ben-Or et al., and Toueg et.al. It is helpful to understand these fundamental ideas
before continuing our voyage toward more complex and modern protocols.
Introduction
In my view, the 1980s was the golden age for innovation and discovery in distributed
computing. Many fundamental problems, algorithms, and results such as the Byzantine
generals problem, FLP impossibility result, partial synchrony, and techniques to
circumvent FLP impossibility were discovered during the late 1970s and 1980s. Starting
from Lamport’s phenomenal paper “Time, Clocks, and the Ordering of Events in a
Distributed System” to the Byzantine generals problem and then Schneider’s state
machine replication paper, one after another, there were most significant contributions
made to the consensus problem and generally in distributed computing.
Consensus can be defined as a protocol for achieving agreement. A high-level list of
major contributions is described as follows.
In his seminal paper in 1978, “Time, Clocks, and Ordering of Events in a Distributed
System”, Lamport described how to order events using synchronized clocks in the
absence of faults. Then in 1980, the paper “Reaching Agreement in the Presence of
Faults” posed the question if agreement can be reached in an unreliable distributed
system. It was proven that agreement is achievable if the number of faulty nodes in a
distributed system is less than one-third of the total number of processes, i.e. n>=3f+1,
where n is the number of total nodes and f is the number of faulty processors. In the
paper “The Byzantine Generals Problem” in 1982, Lamport et al. showed that agreement
259
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_6
Chapter 6 Early Protocols
is solvable using oral messages if more than two-thirds of the generals are loyal. In 1982,
the paper “The Byzantine generals strike again” by Danny Dolev showed that unanimity
is achievable if less than one-third of the total number of processors are faulty and more
than one-half of the network’s connectivity is available.
Unanimity is a requirement where if all initial values of the processes are the same,
say v, then all processes decide on that value v. This is strong unanimity. However,
a weaker variant called weak unanimity only requires this condition to hold if all
processes are correct; in other words, no processes are faulty.
The paper also provided the first proof that the distributed system must have 3f + 1
nodes to tolerate f faults. However, the celebrated FLP result appeared a little later
which proved that deterministic asynchronous consensus is not possible even if a single
process is crash faulty. FLP impossibility implies that safety and liveness of a consensus
protocol cannot be guaranteed in an asynchronous network.
Lamport’s algorithm was for a synchronous setting and assumed that eventually all
the messages will be delivered. Moreover, it wasn’t fault tolerant because a single failure
will halt the algorithm.
After the FLP impossibility result appeared, attempts started to circumvent it and
solve the consensus problem nevertheless. The intuition behind circumventing FLP is to
relax some stricter requirements of timing and determinism.
Ben-Or proposed the earliest algorithms to sacrifice some level of determinism to
circumvent FLP. As FLP impossibility implies that under asynchrony, there will always
be an execution that does not terminate, one way of avoiding that is to try and make
termination probabilistic. So that instead of deterministic termination, probabilistic
termination is used. The intuition behind these algorithms is to use the “common coin”
approach, where a process randomly chooses its values if it doesn’t receive messages
from other nodes. In other words, a process is allowed to select a value to vote on if
it doesn’t receive a majority of votes on the value from the rest of the processes. This
means that eventually more than half of the nodes will end up voting for the same value.
However, this algorithm’s communication complexity increases exponentially with the
number of nodes. Later, another approach that achieved consensus in a fixed number
of rounds was proposed by Rabin. These proposals required 5f + 1 and 10f + 1 rounds,
respectively, as compared to the 3f + 1 lower bound commonly known today.
260
Chapter 6 Early Protocols
Consensus protocols that relax timing (synchrony) requirements aim to provide safety
under all circumstances and liveness only when the network is synchronous. A significant
breakthrough was the work of Dwork, Lynch, and Stockmeyer, which for the first time
introduced a more realistic idea of partial synchrony. This model is more practical as it captures
how real distributed systems behave. More precisely, distributed systems can be asynchronous
for arbitrary periods but will eventually return to synchrony long enough for the system
to make a decision and terminate. This paper introduced various combinations of processor
and network synchrony and asynchrony and proved the lower bounds for such scenarios.
Table 6-1 shows the summary of results from the DLS88 paper showing a minimum
number of processors for which a fault-tolerant consensus protocol exists.
Fail-stop f NA 2f + 1
Omission f NA 2f + 1
Authenticated Byzantine f NA 3f + 1
Byzantine 3f + 1 NA 3f + 1
This paper introduced the DLS algorithm which solved consensus under partial
synchrony.
Some major results are listed as follows, starting from the 1980s:
• Lamport showed in LPS 82 that under a synchronous setting, n > 2f
with authentication and n > 3f are at least required with oral messages.
• The FLP result in 1982 showed that even with a single crash failure,
consensus is impossible under asynchrony, and at least n > 3f are
required for safety.
• Ben-Or in 1983 proposed a randomized solution under asynchrony.
Distributed Transactions
A distributed transaction is a sequence of events spread across multiple processes. A
transaction either concludes with a commit or abort. If committed, all events are executed,
and the output is generated, and if aborted, the transaction halts without complete
execution. A transaction is atomic if it executes and commits fully; otherwise, it rolls back
with no effect. In other words, atomic transactions either execute in full or not at all.
There are four properties that a transaction must satisfy, commonly known as the
ACID consistency model:
One point to note here is that consistency is guaranteed much easily in monolithic
architectures. In contrast, consistency is not immediate in distributed architectures, and
distributed architectures rely on so-called eventual consistency. Eventual consistency
means that all nodes in a system eventually (at some point in time in future) synchronize
and agree on a consistent state of the system. ACID properties must hold even if some
nodes (processes) fail.
Atomicity, isolation, and durability are easier to achieve in monolithic architectures,
but achieving these properties in distributed settings becomes more challenging.
A two-phase commit protocol is used to achieve atomicity across multiple processes.
Replicas should be consistent with one another. Atomic commit protocols are in fact
a kind of consensus mechanism because in transaction commit protocols nodes must
come to an agreement to either commit if all is well or roll back in case something goes
wrong. Imagine if a transaction is expected to be committed on all nodes in a distributed
system (a network), then either it must commit on all or none to maintain replica
consistency. We cannot have a situation where a transaction succeeds on some nodes
and not on others, leading to an inconsistent distributed system. This is where atomic
commit comes in. It can be seen fundamentally as a consensus algorithm because
this protocol requires an agreement between all nodes in a network. However, there
262
Chapter 6 Early Protocols
Two-Phase Commit
A two-phase commit (2PC) is an atomic commit protocol to achieve atomicity. It was
first published in a paper by Lampson and Sturgis in 1979. A two-phase commit enables
updating multiple databases in a single transaction and committing/aborting atomically.
As the name suggests, it works in two phases. The first phase is the vote collection
phase in which a coordinator node collects votes from each node participating in
the transaction. Each participant node either votes yes or no to either commit the
transaction or abort the transaction. When all votes are collected, the coordinator
263
Chapter 6 Early Protocols
(transaction manager) starts the second phase, called the decision phase. In the decision
phase, the coordinator commits the transaction if it has received all yes votes from other
nodes; otherwise, it aborts the transaction. Any node that had voted yes to commit
the transaction waits until it receives the final decision from the coordinator node. If it
receives no from the coordinator, it will abort the transaction; otherwise, it will commit
the transaction. Nodes that voted no immediately terminate the transaction without
waiting to receive a decision from the coordinator. When a transaction is aborted, any
changes made are rolled back. The changes are made permanent after committing at
nodes that said yes when they receive a commit decision from the coordinator. Any
changes made by the transaction are not permanent, and any locks are released after
a write operation is performed. All participants send the acknowledgment back to
the coordinator after they’ve received the decision from the coordinator. As a failure
handling mechanism, a logging scheme is used in two-phase commits. In this scheme,
all messages are written to a local stable storage before they are sent out to the recipients
in the network. When the coordinator fails (crashes), it writes its decision to the local
disk in the log, and when it recovers, it sends its decision to other nodes. If no decision
was made before the crash, then it simply aborts the transaction. When a node fails
(other than the coordinator node), the coordinator waits until it times out, and a decision
is made to abort the transaction for all.
Figure 6-1 shows the two-phase commit protocol in action. Here, the client
(application) starts the transaction as usual and performs a usual read/write operation
on the database nodes, that is, on the transaction participant nodes. After a normal
transaction execution on each participant, when the client is ready to commit the
transaction, the coordinator starts the first phase, that is, the prepare phase. It sends
the prepare request to all nodes and asks them whether they can commit or not. If the
participants reply with a yes, it means that they are willing and ready to commit the
transaction, then the coordinator starts the second phase called the commit phase. This
is when the coordinator sends out the commit decision, and the transaction is finally
committed, and a commit actually takes place. If any of the participant nodes replies to
the prepare request with a no, then the coordinator sends out the abort request in phase
two, and all nodes abort accordingly. Note that after the first phase, there is a decision
point where the coordinator decides whether to commit or abort. The action after the
decision phase is either commit or abort, based on the yes or no received from the
participants.
264
Chapter 6 Early Protocols
265
Chapter 6 Early Protocols
Three-Phase Commit
As we saw in the two-phase commit, it is not fault tolerant and blocks until the failed
coordinator recovers. If the coordinator or a participant fails in the commit phase, the
protocol cannot recover reliably. Even when the coordinator is replaced or recovers, it
cannot proceed to process the transaction reliably from where the failure occurred. The
three-phase commit solves this problem by introducing a new pre-commit intermediate
phase. After receiving a yes from all the participants, the coordinator moves to this
intermediate phase. Unlike 2PC, here, the coordinator does not immediately broadcast
commit; instead, it sends a pre-commit first, which indicates the intention to commit
the transaction. When participants receive the pre-commit message, they reply with the
ack messages. When the coordinator receives this ack from all participants, it sends the
commit message and proceeds as in the two-phase commit. If a participant fails before
sending back a message, the coordinator can still decide to commit the transaction. If the
coordinator crashes, the participants can still agree to abort or commit the transaction.
This is so because no actual commit or abort has taken place yet. The participants now
have another chance to decide by checking that if they have seen a pre-commit from the
coordinator, they commit the transaction accordingly. Otherwise, the participants abort
the transaction, as no commit message has been seen from the coordinator.
This process can be visualized in Figure 6-2.
266
Chapter 6 Early Protocols
267
Chapter 6 Early Protocols
An oral message is a message whose contents are under complete control of the
sender. The sender can send any possible message.
There is no solution to the Byzantine generals problem unless more than two-thirds
of generals are loyal. For example, if there are three generals and one is a traitor, then
there is no solution to BGP if oral messages are used. Formally
In other words, if n <= 3m, then a Byzantine agreement is not possible. The algorithm
is recursive.
268
Chapter 6 Early Protocols
Algorithm
Base case: OM(0)
Figure 6-3. OM base case vs. OM(1) case, where the commander is the traitor
We can also visualize the case where a lieutenant is the traitor as shown in Figure 6-4.
269
Chapter 6 Early Protocols
270
Chapter 6 Early Protocols
For I = 1 : N – 1 do
For j = 1 : N – 1 AND j ≠ i do
Li stores the value received from Lj as vj
Vj = default if no value received
End for
Li chooses majority from {v1, v2, v3, , , , vn-1}
End for
As you may have noticed, this algorithm, while it works, is not very efficient due
to the number of messages required to be passed around. More precisely, from a
communication complexity perspective, this algorithm is exponential in the number
of traitors. If there are no traitors, as in the base case, then its constant, O(1), otherwise
its O(mn), which means that it grows exponentially with the number of traitors, which
makes it impractical for a large number of n.
Using the space-time diagram, we can visualize the base case as shown in Figure 6-5.
We can also visualize the m > 0 case where the commander is the traitor sending
conflicting messages to lieutenants in Figure 6-6.
271
Chapter 6 Early Protocols
Figure 6-6. Oral message protocol case where m =1, the commander is a traitor
In the digital world, commanders and lieutenants represent processes, and the
communication between these processes is achieved by point-to-point links and
physical channels.
So far, we have discussed the case with oral messages using no cryptography;
however, another solution with signed messages is also possible where digital signatures
are used to guarantee the integrity of the statements. In other words, the use of oral
messages does not allow the receiver to ascertain whether the message has been altered
or not. However, digital signatures provide a data authentication service that enables
receiving processes to check whether the message is genuine (valid) or not.
Based on whether oral messages are used, or digital signatures have been used,
Table 6-1, earlier in this chapter, summarizes the impossibility results under various
system models.
272
Chapter 6 Early Protocols
There is a signed solution to BGP which was proposed in the same BGP paper by
Lamport, Shostak, and Pease. It uses digital signatures to sign the messages. Here are the
additional assumptions under this model:
Under this model, each lieutenant maintains a vector of signed orders received.
Then, the commander sends the signed messages to the lieutenants.
Generally, the algorithm works like this:
A lieutenant receives an order from either a commander or other lieutenants and
saves it in the vector that he maintains after verifying the message's authenticity. If there
are less than m signatures on the order, the lieutenant adds a signature to the order
(message) and relays this message to other lieutenants who have not seen it yet. When a
lieutenant does not receive any newer messages, he chooses the value from the vector as
a decision consensus value.
The lieutenants can detect that the commander is a traitor by using signed
messages because the commander's signature appears on two different messages. Our
assumptions under this model are that signatures are unforgeable, and anyone can verify
the signature's authenticity. This implies that the commander is a traitor because only he
could have signed two different messages.
Formally, the algorithm is described as follows.
Algorithm: For n generals and m traitor generals where n > 0. In this algorithm, each
lieutenant i keeps a set Vi of properly signed messages it has received so far. When the
commander is honest, then the set Vi contains only a single element.
Algorithm SM(m)
Initialization:
Vi = { }, that is, empty
2. For each i
a. If lieutenant i receives a message of the form v : 0 from the
commander and has not yet received any message (order) from
the commander, that is, Vi is empty, then
273
Chapter 6 Early Protocols
i. Set Vi = {v}.
3. For each i
Here, v : i is the value v signed by general i, and v : i : j is the message v : i counter
signed by general j. Each general i maintains a set Vi which contains all orders received.
The diagram in Figure 6-7 visualizes a traitor commander scenario.
274
Chapter 6 Early Protocols
With signed messages, it’s easy to detect if a commander is a traitor because its
signature would appear on two different orders, and by the assumption of unforgeable
signature, we know that only the commander could have signed the message.
Formally, for any m, the algorithm SM(m) solves the Byzantine generals problem if
there are at most m traitors. The lieutenants maintain a vector of values and run a choice
function to retrieve the order choice {attack, retreat}. Timeouts are used to ascertain if
no more messages will arrive. Also, in step 2, lieutenant i ignores any message v that is
already in the set Vi.
This algorithm has message complexity O(nm + 1), and it requires m + 1 number of
rounds. This protocol works for N ≥ m + 2.
In contrast with the oral message protocol, the signed message protocol is more
resilient against faults; here, if at least two generals are loyal in three generals, the
problem is solvable. In the oral message, even if there is a single traitor in three generals,
the problem is unsolvable.
275
Chapter 6 Early Protocols
The algorithm achieves strong unanimity for a set V with an arbitrary value under
Byzantine faults with authentication.
The algorithm progresses in phases. Each phase k consists of four consecutive
rounds, from 4k – 3 to 4k. Each phase has a unique coordinator c which leads the phase.
A simple formula k = i (mod n) is used to select the coordinator from all processes, where
k is the phase, i is the process number, and n is the total number of processes.
Each process maintains some variables:
• A local variable LOCK which keeps the locked value. A process may
lock a value in a phase if it believes that some process may decide on
this value. Initially, no value is locked. A phase number is associated
with every lock. In addition, a proof of acceptability of the locked
value is also associated with every lock. Proof of acceptability is in the
form of a set of signed messages sent by n − t processes, indicating
that the locked value is acceptable and proper, that is, it is in their
PROPER sets at the start of the given phase.
276
Chapter 6 Early Protocols
The coordinator broadcasts a message of the form E(lock, v, k, proof), where the proof
is composed of the set of signed messages E(list, k) received from the n − t processes that
found v acceptable and proper.
Round 3: Round 4k – 1
If any process receives an E(lock, v, k, proof) message, it validates the proof to
ascertain that n − t processors do find v acceptable and proper at phase k. If the proof
is valid, it locks v, associating the phase number k and the message E(lock, v, k, proof)
with the lock, and sends an acknowledgment to the current coordinator. In this
case, the processes release any earlier lock placed on v. If the coordinator receives
acknowledgments from at least 2t + 1 processors, then it decides on the value v.
Round 4: Round 4k
This is where locks are released. Processes broadcast messages of the form
E(lock v, h, proof), indicating that they have a lock on value v with associated phase h and
the associated proof and that a coordinator sent the message at phase h, which caused
the lock to be placed. If any process has a lock on some value v with associated phase
h and receives a properly signed message E(lock, w, h’, proof′) with w ≠ v and h′ ≥ h,
then the process releases its lock on v. This means that if a most recent properly signed
message is received by a process indicating a lock on some value which is different from
its locally locked value and the phase number is either higher or equal to the current
phase number, then it will release the lock from the local locked value.
Notes
Assuming that the processes are correct, two different values cannot be locked in the
same phase because the correct coordinator will never send conflicting messages which
may suggest locks on two different values.
This algorithm achieves consistency, strong unanimity, and termination under
partial synchrony, with Byzantine faults and authentication, where n ≥ 3t + 1.
Authenticated Byzantine means that failures are arbitrary, but messages can be signed
with unforgeable digital signatures.
Consistency means no two different processes decide differently. Termination means
every process eventually decides. Unanimity has two flavors, strong unanimity and weak
unanimity. Strong unanimity requires that if all processes have the same initial value
v and if any correct process decides, then it only decides on v. Weak unanimity means
that if all processes have the same initial value v and all processes are correct, then if any
process decides, it decides on v. In other words, strong unanimity means that if all initial
values are the same, for example, v, then v is the only common decision. Under weak
unanimity, this condition is expected to hold only if all processes are correct.
277
Chapter 6 Early Protocols
Ben-Or Algorithms
The Ben-Or protocol was introduced in 1983. It is named after its author Michael Ben-
Or. This was the first protocol that solved the consensus problem with probabilistic
termination under a model with a strong adversary. The Ben-Or algorithm proposed
how to circumvent an FLP result and achieve consensus under asynchrony. There are
two algorithms proposed in the paper. The first algorithm tolerates t < n/2 crash failures,
and the second algorithm tolerates t < n/5 for Byzantine failures. In other words, with
N > 2t it tolerates crash faults and achieves an agreement, and with N > 5t the protocol
tolerates Byzantine faults and reaches an agreement. The protocol achieves consensus
under the conditions described earlier, but the expected running time of the protocol
is exponential. In other words, it requires exponential running time to terminate in the
worst case because it can require multiple rounds to terminate. It can however terminate
in constant time if the value of t is very small, that is, O(√n).
This protocol works in asynchronous rounds. A round simulates time because all
messages are tagged with a round number, and because of this, processes can figure out
which messages belong to which round even if they arrive asynchronously. A process
ignores any messages for previous rounds and holds messages for future rounds in a
buffer. Each round has two phases or subrounds. The first is the proposal (suggestion)
phase, where each process p transmits its value v and waits until it receives from other
n − t processes. In the second phase, called the decision (ratification) phase, the protocol
checks if a majority is observed and takes that value; otherwise, it flips a coin. If a certain
threshold of processes sees the same majority value, then the decision is finalized. In
case some other value is detected as a majority, then the processor switches to that
value. Eventually, the protocol manages to terminate because at some point all processes
will flip the coin correctly and reach the majority value. You may have noticed that this
protocol only considers binary decision values, either a 0 or 1. Another important aspect
to keep in mind is that the protocol cannot wait indefinitely for all processes to respond
because they could be unavailable (offline).
This algorithm works only for binary consensus. There are two variables that need
to be managed in the algorithm, a value which is either 0 or 1 and phase (p), which
represents the stage where the algorithm is currently at. The algorithm proceeds in
rounds, and each round has two subrounds or phases.
278
Chapter 6 Early Protocols
Note that each process has its own coin. This class of algorithms that utilize such
coin scheme is called local coin algorithms. Local coin tossing is implemented using a
random number generator that outputs binary numbers. Each process tosses its own
coin and outputs 0 or 1, each with probability ½. The coin is tossed by a process to pick a
new local value if a majority was not found.
The algorithm for benign faults/crash faults only – non-Byzantine:
Here, r is the round number; x is the initial preference or value proposed by the
process; 1 is the first subround, round, or phase of the main round; 2 is the second
subround, round, or phase of the main round; * can be 0 or 1; ? represents no
majority observed; N is the number of nodes (processes); D is an indication of approval
(ratification) – in other words, it is an indication that the process has observed a majority
of the same value – t is the number of faulty nodes; v is the value; and coinflip() is a
uniform random number generator that generates either 0 or 1.
We can visualize this protocol in the diagram shown in Figure 6-8.
279
Chapter 6 Early Protocols
Figure 6-8. Ben-Or crash fault tolerant only agreement protocol – (non-Byzantine)
If n > 2t, the protocol guarantees with probability 1 that all processes will eventually
decide on the same value, and if all processes start with the value v, then within one
round all processes will decide on v. Moreover, if in some round a process decides on
v after receiving more than t D type messages, then all other processes will decide on v
within the next round.
The protocol described earlier works for crash faults; for tolerating Byzantine faults,
slight modifications are required, which we describe next.
280
Chapter 6 Early Protocols
Here, r is the round number; x is the initial preference or value proposed by the
process; 1 is the first subround, round, or phase of the main round; 2 is the second
subround, round, or phase of the main round; * can be 0 or 1; ? represents no
majority observed; N is the number of nodes (processes); D is an indication of approval
(ratification) – in other words, it is an indication that the process has observed a majority
of the same value – t is the number of faulty nodes; v is the value; and coinflip() is a
uniform random number generator that generates either 0 or 1.
We can visualize this protocol in Figure 6-9.
281
Chapter 6 Early Protocols
In the first subround or phase of the protocol, every process broadcasts its proposed
n+t
preferred value and awaits n − t messages. If more than processes agree, then a
2
majority is achieved, and the preferred value is set accordingly.
In the second subround or phase of the protocol, if a majority is observed in the first
subround, then an indication of majority is broadcast (2, r, v, D); otherwise, if no majority
(?) was observed in the first subround, then no majority is broadcast. The protocol then
waits for n – t confirmations. If at least t + 1 confirmations of a majority of either 0 or 1
are observed, then the preferred value is set accordingly. Here, only the preferred value
n+t
is set, but no decision is made. A decision is made by p if more than confirmations
2
n+t
are received, only then the value is decided. If neither t + 1 confirmations nor
2
confirmations are received, then the coin is flipped to choose a uniform random value,
either 0 or 1.
282
Chapter 6 Early Protocols
Note that, by waiting for n – t messages, the Byzantine fault case where Byzantine
processes maliciously decide not to vote is handled. This is because in the presence of t
faults, at least n is honest. In the second subround, t + 1 confirmations of a majority value
n+t
mean that at least one honest process has observed a majority. In the case of , it
2
means a value has been observed by a majority.
So, in summary, if n > 5t, this protocol guarantees with probability 1 that all
processes will eventually decide on the same value, and if all processes start with the
value v, then within one round all processes will decide on v. Moreover, if in some round
number of processes to be 3f + 1. With two rounds under asynchrony, the 2f + 1 lower
bound is met.
The Ben-Or algorithms described earlier do not use any cryptographic primitives
and assume strong adversary. However, a lot of work has also been carried out
where an asynchronous Byzantine agreement is studied under the availability of
cryptographic primitives. Of course, under this model the adversary is assumed to be
always computationally bounded. Some prominent early protocols under this model
are described earlier, such as the signed message protocol and the DLS protocol for the
authenticated Byzantine failure model. There are other algorithms that process coin
tosses cooperatively or, in other words, use global or shared coin tossing mechanisms.
A shared coin or global coin is a pseudorandom coin that produces the same result at
all processes in the same round. This attribute immediately implies that convergence is
much faster in the case of shared coin–based mechanisms. A similar technique was first
used in Rabin’s algorithm [14] utilizing cryptographic techniques which reduced the
expected time to the constant number of rounds.
After this basic introduction to early consensus protocols, I’ll now introduce early
replication protocols, which of course are fundamentally based on consensus, but can be
classified as replication protocols rather than just consensus algorithms.
We saw earlier, in Chapter 3, that replication allows multiple replicas to achieve
consistency in a distributed system. It is a method to provide high availability in a
distributed system. There are different models including primary backup replication
and active replication. You can refer to Chapter 3 to read more about state machine
replication and other techniques.
284
Chapter 6 Early Protocols
The protocol works in rounds under asynchrony with a rotating coordinator. The
protocol uses reliable broadcast which ensures that any message broadcast is either not
received (delivered) at all by any process or exactly once by all honest processes.
The algorithm works as follows.
Each process maintains some variables:
• State
3. Each process waits for the new proposal (estimate) from the
current coordinator or for the failure detector to suspect the
current coordinator. If it receives a new estimate, it updates its
preference, updates the last round variable to the current round,
and sends the ack message to the current coordinator. Otherwise,
it sends nack, suspecting that the current coordinator has crashed.
285
Chapter 6 Early Protocols
n + 1
4. The current coordinator waits for the – that is, a majority of
2
replies from processes, either ack or nack. If the current coordinator
n + 1
2
receives a majority of acks, meaning has accepted its
• Finally, any undecided process that delivers a value via the reliable
broadcast accepts and decides on that value.
Note that there are other algorithms in the paper [10] as well, but I have described
here only the one that solves consensus using an eventually strong failure detector.
Now let’s see how agreement, validity, and termination requirements are met.
The agreement is satisfied. Let’s think about a scenario where it is possible that two
coordinators broadcast, and some processes end up accepting a value from the first
coordinator and some from the other. This will violate the agreement because, here,
two processes are deciding differently, that is, two different values are both chosen.
However, this cannot occur because for the first coordinator to send a decision, it must
have received enough acknowledgments (acks) from the majority of the processes.
All subsequent coordinators looking for the majority will see an overlap with the
previous one. The estimate will be the most recent one. As such, any two coordinators
broadcasting the decision are sending out the same decision.
Validity is also satisfied because every estimate is some process’s input value. The
protocol design does not allow generating any new estimates.
The protocol eventually terminates because the failure detector, an eventually
strong failure detector, will eventually stop suspecting some correct process, which
will eventually become the coordinator. With the new coordinator, in some round, all
correct processes will wait to receive this new coordinator’s estimate and will respond
with enough ack messages. When the coordinator collects the majority of ack messages,
it will send its decided estimate to all, and all processes will terminate. Note that if
some process ends up waiting for a response from an already terminated process, it
will also eventually get the message by retransmission through other correct nodes and
eventually decide and terminate. For example, suppose a process gets stuck waiting
for messages from a crashed coordinator. Eventually, due to the strong completeness
property of the eventually strong failure detector, the failed coordinator will be
suspected, ensuring progress.
286
Chapter 6 Early Protocols
Summary
This chapter covered early protocols that provide a solid foundation for most of the
consensus research done today. With the advent of blockchains, many of these protocols
inspired the development of new blockchain age protocols, especially for permissioned
blockchains. For example, Tendermint is based on the DLS protocol, that is, algorithm 2
from the DLS paper.
We did not discuss every algorithm in this chapter, but this chapter should provide
readers with a solid foundation to build on further. To circumvent FLP impossibility,
randomness can be introduced into the system by either assuming the randomized
model or local coin flips at the processes. The first proposal that assumes a randomized
model (also called fair scheduling, randomized scheduling) mechanism is by Bracha
and Toueg [17]. Algorithms based on the second approach where processes are provided
with a local coin flip operation were first proposed by Ben-Or [2], which is the first
randomized consensus protocol. The first approach to achieve the expected constant
number of rounds by using the shared coin (global coin) approach implemented using
digital signatures and a trusted dealer is published in Rabin [14]. Protocols utilizing
failure detectors were proposed by Chandra and Toueg [15]. An excellent survey of
randomized protocols for asynchronous consensus is by Aspnes [16].
Randomized protocols are a way to circumvent an FLP result, but can we refute the
FLP impossibility result altogether? Sounds impossible, but we’ll see in Chapter 9 that
refuting the FLP result might be possible.
In the next chapter, we will cover classical protocols such as PBFT, which is seen as a
natural progression from the viewstamped replication (VR) protocol, which we will also
introduce in the next chapter. While VR dealt with crash faults only, PBFT also dealt with
Byzantine faults. We’ll cover other protocols, too, such as Paxos, which is the foundation
of most if not all consensus protocols. Almost all consensus algorithms utilize the
fundamental ideas presented in Paxos in one way or another.
Bibliography
1. Impossibility of distributed consensus with one faulty process
J. Assoc. Computer. Mach., 32 (No. 2) (1985), pp. 374–382.
6. Dwork, C., Lynch, N., and Stockmeyer, L., 1988. Consensus in the
presence of partial synchrony. Journal of the ACM (JACM), 35(2),
pp. 288–323.
10. Chandra, T.D. and Toueg, S., 1996. Unreliable failure detectors for
reliable distributed systems. Journal of the ACM (JACM), 43(2),
pp. 225–267.
288
Chapter 6 Early Protocols
15. Chandra, T.D. and Toueg, S., 1996. Unreliable failure detectors for
reliable distributed systems. Journal of the ACM (JACM), 43(2),
pp. 225–267.
289
CHAPTER 7
Classical Consensus
Consensus and replication protocols that appeared in the 1980s have made profound
contributions in consensus protocol research. Early replication protocols like
viewstamped replication provided deep insights into how fault-tolerant replication can
be designed and implemented. Around the same time, Paxos was introduced, which
offered a practical protocol with rigorous formal specification and analysis. In 1999,
the first practical Byzantine fault–tolerant protocol was introduced. This chapter covers
these classical protocols in detail, their design, how they work, and how they provide
safety and liveness guarantees. Moreover, some ideas on how and if we can use them in
the blockchain are also presented. Additionally, recently developed protocols such as
RAFT are also discussed, which builds on previous classical protocols to construct an
easy-to-understand consensus protocol.
Viewstamped Replication
A viewstamped replication approach to replicate among peers was introduced by Brian
Oki and Barbara Liskov in 1988. This is one of the most fundamental mechanisms to
achieve replication to guarantee consistency (consistent view) over replicated data. It
works in the presence of crash faults and network partitions; however, it is assumed that
eventually nodes recover from crashes, and network partitions are healed. It is also a
consensus algorithm because to achieve consistency over replicated data, nodes must
agree on a replicated state.
Viewstamped replication has two primary purposes. One is to provide a distributed
system which is coherent enough that the clients see that as if they are communicating
with a single server. The other one is to provide state machine replication. State machine
replication requires that all replicas start in the same initial state and operations are
deterministic. With these requirements (assumptions), we can easily see that if all
replicas execute the same sequence of operations, then they will end up in the same
291
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_7
Chapter 7 Classical Consensus
state. Of course, the challenge here is to ensure that operations execute in the same order
at all replicas even in the event of failures. So, in summary the protocol provides fault
tolerance and consistency. It is based on a primary backup copy technique.
There are three subprotocols in the viewstamped replication (VR) protocol:
• View change protocol: Handles primary failure and starts a new view
with a new primary
VR is inspired by the two-phase commit protocol, but unlike the two-phase commit,
it’s a failure-resilient protocol and does not block if the primary (coordinator in 2PC
terminology) or replicas fail. The protocol is reliable and ensures availability if no more
than f replicas are faulty. It uses replica groups of 2f + 1 and tolerates crash failures under
asynchrony with f+1 quorum sizes.
Every replica maintains a state which contains information such as configuration, replica
number, current view, current status – normal or view change or recovering, assigned op
number to the latest request, log containing entries which contain the requests received
so far with their op numbers, and the client table which consists of the most recent client
request, with status if it has been executed or not and associated result for that request.
Let’s see how the normal operation works in VR. First, let’s see the list of variables
and their meanings:
• c: Client ID
• x: Result
292
Chapter 7 Classical Consensus
Protocol Steps
1. A client sends a request message of the form <REQUEST op, c, s,
v> message to the primary replica.
a. The prepare message is only accepted if all previous requests preceding the
op number in the prepare message have entries in their log.
b. Otherwise, they wait until the missing entries are updated – via state
transfer.
293
Chapter 7 Classical Consensus
When the primary fails, the view change protocol initiates. Failure is indicated by
timeout at replicas:
• v: View number
• I: Replica identifier
View Change
A view change protocol works as follows:
294
Chapter 7 Classical Consensus
a. Chooses the most recent log in the message and picks that as its new log
b. Sets the op number to that of the latest entry in the new log
b. Set their op number to the one in the latest entry in the log
295
Chapter 7 Classical Consensus
The key safety requirement here is that all committed operations make it to the next
views with their order preserved.
VR is not discussed with all intricate details on purpose, as we focus more on
mainstream protocols. Still, it should give you an idea about the fundamental concepts
introduced in VR, which play an essential role in almost all replication and consensus
protocols, especially PBFT, Paxos, and RAFT. When you read the following sections,
you will see how PBFT is an evolved form of VR and other similarities between VR and
different protocols introduced in this chapter. When you read the section on RAFT, you
will find good resemblance between VR and RAFT.
Let’s look at Paxos first, undoubtedly the most influential and fundamental
consensus protocol.
Paxos
Leslie Lamport discovered Paxos. It was proposed first in 1988 and then later more
formally in 1998. It is the most fundamental distributed consensus algorithm which
allows consensus over a value under unreliable communications. In other words, Paxos
is used to build a reliable system that works correctly, even in the presence of faults.
Paxos made state machine replication more practical to implement. A version of Paxos
called multi-Paxos is commonly used to implement a replicated state machine. It runs
under a message-passing model with asynchrony. It tolerates fewer than n/2 crash faults,
that is, it meets the lower bound of 2f + 1.
Earlier consensus mechanisms did not handle safety and liveness separately.
The Paxos protocol takes a different approach to solving the consensus problem by
separating the safety and liveness properties.
There are three roles that nodes in a system running the Paxos protocol can
undertake. A single process may assume all three roles:
296
Chapter 7 Classical Consensus
There are also some rules associated with Paxos nodes. Paxos nodes must be
persistent, that is, they must store what their action is and must remember what they’ve
accepted. Nodes must also know how many acceptors make a majority.
Paxos can be seen as similar to the two-phase commit protocol. A two-phase commit
(2PC) is a standard atomic commitment protocol to ensure that the transactions are
committed in distributed databases only if all participants agree to commit. Even if a
single node does not agree to commit the transaction, it is rolled back completely.
Similarly, in Paxos, the proposer sends a proposal to the acceptors in the first phase.
Then, the proposer broadcasts a request to commit to the acceptors if and when they
accept the proposal. Once the acceptors commit and report back to the proposer, the
proposal is deemed final, and the protocol concludes. In contrast with the two-phase
commit, Paxos introduced ordering, that is, sequencing, to achieve the total order of
the proposals. In addition, it also introduced a majority quorum–based acceptance of
the proposals rather than expecting all nodes to agree. This scheme allows the protocol
to make progress even if some nodes fail. Both improvements ensure the safety and
liveness of the Paxos algorithm.
The protocol is composed of two phases, the prepare phase and the accept phase.
At the end of the prepare phase, a majority of acceptors have promised a specific
proposal number. At the end of the accept phase, a majority of acceptors have accepted a
proposed value, and consensus is reached.
The algorithm works as follows:
Phase 1 – prepare phase
297
Chapter 7 Classical Consensus
• The proposer waits until it gets responses from the majority of the
acceptors for n.
298
Chapter 7 Classical Consensus
Sometimes, there is a distinction made between the accept phase and a third
phase called the learning phase where learners learn about the decided value from the
acceptors. We have not shown that separately in the preceding algorithm, as learning
is considered part of the second phase. As soon as a proposal is accepted in the accept
phase, the acceptor informs the learners. Figure 7-3 does show a third phase called the
learn phase, but it is just for visualizing the protocol in a simpler way; learning is in fact
part of phase 2, the accept phase.
We have used the term majority indicating that a majority of acceptors have
responded to or accepted a message. Majority comes from a quorum. In the majority
n
quorum, every quorum has + 1 nodes. Also note that in order to tolerate f faulty
2
acceptors, at least a set consisting of 2f + 1 acceptors is required. We discussed quorum
systems in Chapter 3.
The protocol is illustrated in Figure 7-3.
299
Chapter 7 Classical Consensus
Note that the Paxos algorithm once reached a single consensus will not proceed
to another consensus. Another run of Paxos is needed to reach another consensus.
Moreover, Paxos cannot make progress if half or more than half of the nodes are faulty
because in such a case a majority cannot be achieved, which is essential for making
progress. It is safe because once a value is agreed, it is never changed. Even though Paxos
is guaranteed to be safe, liveness of the protocol is not guaranteed. The assumption here
is that a large portion of the network is correct (nonfaulty) for adequately enough time,
and then the protocol reaches consensus; otherwise, the protocol may never terminate.
Usually, learners learn the decision value directly from the acceptors; however, it is
possible that in a large network learners may learn values from each other by relaying
what some of them (a small group) have learned directly from acceptors. Alternatively,
learners can poll the acceptors at intervals to check if there’s a decision. There can also
be an elected learner node which is notified by the acceptors, and this elected learner
disseminates the decision to other learners.
Now let’s consider some failure scenarios.
300
Chapter 7 Classical Consensus
Failure Scenarios
Imagine if an acceptor fails in the first phase, that is, the prepare phase, then it won’t
send the promise message back to the proposer. However, if a majority quorum can
respond back, the proposer will receive the responses, and the protocol will make
progress. If an acceptor fails in the second phase, that is, the accept phase, then the
acceptor will not send the accepted message back to the proposer. Here again, if the
majority of the acceptors is correct and available, the proposer and learners will receive
enough responses to proceed.
What if the proposer failed either in the prepare phase or the accept phase? If a
proposer fails before sending any prepare messages, there is no impact; some other
proposer will run, and the protocol will continue. If a proposer fails in phase 1, after
sending the prepare messages, then acceptors will not receive any accept messages,
because promise messages did not make it to the proposer. In this case, some other
proposer will propose with a higher proposal number, and the protocol will progress.
The old prepare will become history. If a proposer fails during the accept phase after
sending the accept message which was received by at least one acceptor, some other
proposer will send a prepare message with a higher proposal number, and the acceptor
will respond to the proposer with a promise message that an earlier value is already
accepted. At this point, the proposer will switch to proposing the same earlier value
bearing the highest accepted proposal number, that is, send an accept message with the
same earlier value.
Another scenario could be if there are two proposers trying to propose their value at
the same time. Imagine there are two proposers who have sent their prepare messages
to the acceptors. In this case, any acceptor who had accepted a larger proposal number
previously from P1 would ignore the proposal if the proposal number proposed by the
proposer P2 is lower than what acceptors had accepted before. If there is an acceptor
A3 who has not seen any value before, it would accept the proposal number from P2
even if it is lower than the proposal number that the other acceptors have received and
accepted from P1 before because the acceptor A3 has no idea what other acceptors are
doing. The acceptor will then respond as normal back to P2. However, as proposers
wait for a majority of acceptors to respond, P2 will not receive promise messages from a
majority, because A3 only is not a majority. On the other hand, P1 will receive promise
messages from the majority, because A1 and A2 (other proposers) are in the majority
and will respond back to P1. When P2 doesn’t hear from a majority, it times out and can
retry with a higher proposal number.
301
Chapter 7 Classical Consensus
Now imagine a scenario where with P1 the acceptors have already reached a
consensus, but there is another proposer P2 which doesn’t know that and sends a
prepare message with a higher than before proposal number. The acceptors at this point,
after receiving the higher proposal number message from P2, will check if they have
accepted any message at all before; if yes, the acceptors will respond back to P2 with
the promise message of the form promise(nfromp2,(nfromp1, vfromp1)) containing the
previous highest proposal number they have accepted, along with the previous accepted
value. Otherwise, they will respond normally back to P2 with a promise message. When
P2 receives this message, promise(nfromp2,(nfromp1, vfromp1)), it will check the
message, and value v will become vfromp1 if nfromp1 is the highest previous proposal
number. Otherwise, P2 will choose any value v it wants. In summary, if P2 has received
promise messages indicating that another value has already been chosen, it will propose
the previously chosen value with the highest proposal number. At this stage, P2 will
send an accept message with its n and v already chosen (vfromp1). Now acceptors are
happy because they see the highest n and will respond back with an accepted message
as normal and will inform learners too. Note that the previously chosen value is still the
value proposed by P2, just with the highest proposal number n now.
There are scenarios where the protocol could get into a livelock state and progress
can halt. A scenario could be where two different proposers are competing with
proposals. This situation is also known as “dueling proposers.” In such cases, the liveness
of Paxos cannot be guaranteed.
Imagine we have two proposers, P1 and P2. We have three acceptors, A1, A2, and A3.
Now, P1 sends the prepare messages to the majority of acceptors, A1 and A2. A1 and A2
reply with promise messages to P1. Imagine now the other proposer, P2, also proposes
and sends a prepare message with a higher proposal number to A2 and A3. A3 and A2
send the promise back to P2 because, by protocol rules, acceptors will promise back to
the prepare message if the prepare message comes with a higher proposal number than
what the acceptors have seen before. In phase 2, when P1 sends the accept message,
A1 will accept it and reply with accepted, but A2 will ignore this message because it has
already promised a higher proposal number from P2. In this case, P1 will eventually time
out, waiting for a majority response from acceptors because the majority will now never
respond. Now, P1 will try again with a higher proposal number and send the prepare
message to A1 and A2. Assume both A1 and A2 have responded with promise messages.
Now suppose P2 sends an accept message to get its value chosen to A2 and A3. A3 will
respond with an accepted message, but A2 will not respond to P2 because it has already
302
Chapter 7 Classical Consensus
promised another higher proposal number from P1. Now, P2 will time out, waiting for
the majority response from the acceptors. P2 now will try again with a higher proposal
number. This cycle can repeat again and again, and consensus will never be reached
because there is never a majority response from the acceptors to any proposers.
This issue is typically handled by electing a single proposer as the leader to
administer all clients’ incoming requests. This way, there is no competition among
different proposers, and this livelock situation cannot occur. However, electing a leader is
also not straightforward. A unique leader election is equivalent to solving consensus. For
leader election, an instance of Paxos will have to run, that election consensus may get a
livelock too, and we are in the same situation again. One possibility is to use a different
type of election mechanism, for example, the bully algorithm. Some other leader
election algorithms are presented in works of Aguilera et.al. We may use some other
kind of consensus mechanism to elect a leader that perhaps guarantees termination
but somewhat sacrifices safety. Another way to handle the livelock problem is to use
random exponentially increasing delays, resulting in a client having to wait for a while
before proposing again. I think these delays may well also be introduced at proposers,
which will result in one proposer taking a bit of precedence over another and getting
its value accepted before the acceptors could receive another prepared message with a
higher proposal number. Note that there is no requirement in classical Paxos to have a
single elected leader, but in practical implementations, it is commonly the case to elect a
leader. Now if that single leader becomes the single point of failure, then another leader
must be elected.
A key point to remember is that 2f + 1 acceptors are required for f crash faults to be
tolerated. Paxos can also tolerate omission faults. Suppose a prepare message is lost and
didn’t make it to acceptors, the proposer will wait and time out and retry with a higher
proposal number. Also, another proposer can propose meanwhile with a higher proposal
number, and the protocol can still work. Also, as only a majority of acceptor responses
are required, as long as a majority of messages (2f + 1) made it through to the proposer
from acceptors, the protocol will progress. It is however possible that due to omission
faults, the protocol takes longer to reach consensus or may never terminate under some
scenarios, but it will always be safe.
303
Chapter 7 Classical Consensus
Safety and Liveness
The Paxos algorithm solves the consensus problem by achieving safety and liveness
properties. We have some requirements for each property. Under safety, we mainly
have the agreement and validity. An agreement means that no two different values are
chosen. Validity or sometimes called nontriviality means no value is decided unless
proposed by some process participating in the protocol. Another safety requirement
which stems from the validity property and may be called “valid learning” is that if a
process learns a value, the value must have been decided by a process. An agreement
ensures that all processes decide on the same value. Validity and valid learning
requirements ensure that processes decide only on a proposed value and do not trivially
choose to not decide or just choose some predefined value.
Under liveness, there are two requirements. First, the protocol eventually decides,
that is, a proposed value is eventually decided. Second, if a value is decided, the learners
eventually learn that value.
Let’s now discuss how these safety and liveness requirements are met.
Intuitively, the agreement is achieved by ensuring that a majority of acceptors
can vote for only one proposal. Imagine two different values v1 and v2 are somehow
chosen (decided). We know that the protocol will choose a value only if a majority of the
acceptors accept the same accept message from a proposer. This condition implies that
a set of majority acceptors A1 must have accepted an accept message with a proposal
(n1,v1). Also, another accept message with proposal (n2, v2) must have been accepted
by another set of majority acceptors A2. Assuming that two majority sets A1 and A2 must
intersect, meaning they will have at least one acceptor in common due to the quorum
intersection rule. This acceptor must have accepted two different proposals with the
same proposal number. Such a scenario is impossible because an acceptor will ignore
any prepare or accept messages with the same proposal number they have already
accepted.
If n1 <> n2 and n1 < n2 and n1 and n2 are consecutive proposal rounds, then this
means that A1 must have accepted the accept message with proposal number n1 before
A2 accepted the accept messages with n2. This is because an acceptor ignores any
prepare or accept messages if they have a smaller proposal number than the previously
promised proposal number. Also, the proposed value by a proposer must be from either
an earlier proposal with the highest proposal number or the proposer’s own proposed
value if no proposed value is included in the accepted message. As we know, A1 and A2
must intersect with at least one common acceptor; this common acceptor must have
304
Chapter 7 Classical Consensus
accepted the accept messages for both proposals (n1,v1) and (n2,v2). This scenario is
also impossible because the acceptor would have replied with (n1,v1) in response to the
prepare message with proposal number n2, and the proposer must have selected the
value v1 instead of v2. Even with nonconsecutive proposals, any intermediate proposals
must also select v1 as the chosen value.
Validity is ensured by allowing only the input values of proposers to be proposed. In
other words, the decided value is never predefined, nor is it proposed by any other entity
that is not part of the cluster running Paxos.
Liveness is not guaranteed in Paxos due to asynchrony. However, if some synchrony
assumption, that is, a partially synchronous environment, is assumed, then progress can
be made, and termination is achievable. We assume that after GST, at least a majority of
acceptors is correct and available. Messages are delivered within a known upper bound,
and an elected unique nonfaulty leader proposer is correct and available.
In Practice
Paxos has been implemented in many practical systems. Even though the Paxos
algorithm is quite simple at its core, it is often viewed as difficult to understand.
As a result, many papers have been written to explain it. Still, it is often considered
complicated and tricky to comprehend fully. Nevertheless, this slight concern does
not mean that it has not been implemented anywhere. On the contrary, it has been
implemented in many production systems, such as Google’s Spanner and Chubby.
The first deployment of Paxos was in a Petal distributed storage system. Some other
randomly chosen examples include Apache ZooKeeper, NoSQL Azure Cosmos database,
and Apache Cassandra. It proves to be the most efficient protocol to solve the consensus
problem. It has been shown that the two-phase commit is a special case of Paxos, and
PBFT is a refinement of Paxos.
Variants
There are many variants of classical Paxos, such as multi-Paxos, Fast Paxos, Byzantine
Paxos, Dynamic Paxos, Vertical Paxos, Disk Paxos, Egalitarian Paxos, Stoppable Paxos,
and Cheap Paxos.
305
Chapter 7 Classical Consensus
Multi-Paxos
In classical Paxos, even in an all-correct environment, it takes two round trips to achieve
consensus on a single value. This approach is slow, and if consensus is required on a
growing sequence of values (which is practically the case), this single value consensus
must repeatedly run, which is not efficient. However, an optimization can make classical
Paxos efficient enough to be used in practical systems. Recall that Paxos has two phases.
Once phases 1 and 2 both have completely run once, then, at that point, a majority
of acceptors is now available to that proposer who ran this round of phases 1 and 2.
This proposer is now a recognized leader. Instead of rerunning phase 1, the proposer
(leader) can keep running phase 2 only, with the available majority of acceptors. As long
as it does not crash, or some other proposer doesn’t come along and propose with a
higher proposal number, this process of successive accept messages can continue. The
proposer can keep running the accept/accepted round (phase 2) with even the same
proposal number without running the prepare/promise round (phase 1). In other words,
the message delays are reduced from four to two. When another proposer comes along
or the previous one fails, this new proposer can run another round of phases 1 and 2
by following classical Paxos. When this new proposer becomes the leader by receiving
a majority from the acceptors, the basic classical Paxos protocol upgrades to multi-
Paxos, and it can start running phase 2 only. As long as there is only a single leader in the
network, no acceptor would notify the leader that it has accepted any other proposal,
which will let the leader choose any value. This condition allows omitting the first phase
when only one elected proposer is the leader.
This protocol is known as optimization Paxos or multi-Paxos. A normal run of multi-
Paxos is shown in Figure 7-4.
306
Chapter 7 Classical Consensus
Figure 7-4. Multi-Paxos – note the first phase, prepare phase, skipped
307
Chapter 7 Classical Consensus
RAFT
RAFT is designed in response to shortcomings in Paxos. RAFT stands for Replicated And
Fault Tolerant. The authors of RAFT had the main aim of developing a protocol which is
easy to understand and easy to implement. The key idea behind RAFT is to enable state
machine replication with a persistent log. The state of the state machine is determined
by the persistent log. RAFT allows cluster reconfiguration which enables cluster
membership changes without service interruption. Moreover, as logs can grow quite
large on high throughput systems, RAFT allows log compaction to alleviate the issue of
consuming too much storage and slow rebuild after node crashes.
RAFT operates under a system model with the following assumptions:
• No Byzantine failures.
• Deterministic state machine on each node that starts with the same
initial state on each node.
• The Client must communicate strictly with only the current leader.
It is the Client’s responsibility as clients know all nodes and are
statically configured with this information.
308
Chapter 7 Classical Consensus
Time in RAFT is logically divided into terms. A term (or epoch) is basically a
monotonically increasing value which acts as a logical clock to achieve global partial
ordering on events in the absence of a global synchronized clock. Each term starts with
an election of a new leader, where one or more candidates compete to become the
leader. Once a leader is elected, it serves as a leader until the end of the term. The key
role of terms is to identify stale information, for example, stale leaders. Each node stores
a current term number. When current terms are exchanged between nodes, it is checked
if one node’s current term number is lower than the other node’s term number; if it is,
then the node with the lower term number updates its current term to the larger value.
When a candidate or a leader finds out that its current term number is stale, it transitions
its state to follower mode. Any requests with a stale term number received by a node are
rejected.
Terms can be visualized in Figure 7-5.
A RAFT protocol works using two RPCs, AppendEntries RPC, which is invoked by
a leader to replicate log entries and is also used as a heartbeat, and RequestVote RPC,
which is invoked by candidates to collect votes.
RAFT consists of two phases. The first is leader election, and the second is log
replication. In the first phase, the leader is elected, and the second phase is where
the leader accepts the clients’ requests, updates the logs, and sends a heartbeat to all
followers to maintain its leadership.
First, let’s see how leader election works.
309
Chapter 7 Classical Consensus
Leader Election
A heartbeat mechanism is used to trigger a leader election process. All nodes start up as
followers. Followers will run as followers as long as they keep receiving valid RPCs from
a leader or a candidate. If a follower does not receive heartbeats from the leader for some
time, then an “election timeout” occurs, which indicates that the leader has failed. The
election timeout is randomly set to be between 150ms and 300ms.
Now the follower node undertakes the candidate role and attempts to become
the leader by starting the election process. The candidate increments the current
term number, votes for itself, resets election timer, and seeks votes from others via the
RequestVote RPC. If it receives votes from the majority of the nodes, then it becomes the
leader and starts sending heartbeats to other nodes, which are now followers. If another
candidate has won and became a valid leader, then this candidate would start receiving
heartbeats and will return to a follower role. If no one wins the elections and election
timeout occurs, the election process starts again with a new term.
Note that votes will only be granted by the receiver node in response to the
RequestVote RPC if a candidate’s log is at least as up to date as the receiver’s log. Also, a
“false” will be replied if the received term number is lower than the current term.
The specific process of a leader election is shown in Figure 7-6.
310
Chapter 7 Classical Consensus
A node can be in three states; we can visualize server states in the state diagram
shown in Figure 7-7, which also shows leader election.
Once a leader is elected, it is ready to receive requests from clients. Now the log
replication can start.
Log Replication
The log replication phase of RAFT is straightforward. First, the client sends commands/
requests to the leader to be executed by the replicated state machines. The leader
then assigns a term and index to the command so that the command can be uniquely
identified in the logs held by nodes.
It appends this command to its log. When the leader has a new entry in its log, at the
same time it sends out the requests to replicate this command via the AppendEntries
RPC to the follower nodes.
When the leader is able to replicate the command to the majority of the follower
nodes, that is, acknowledged, the entry is considered committed on the cluster. Now the
leader executes the command in its state machine and returns the result to the client. It
also notifies the followers that the entry is committed via the AppendEntries RPC, and
the followers execute committed commands in their state machines. A set of logs from
five nodes is shown in Figure 7-8.
311
Chapter 7 Classical Consensus
Notice that entries up to log index number 6 are replicated on a majority of servers
as the leader, follower 3, and follower 4 all have these entries, resulting in a majority –
three out of five nodes. This means that they are committed and are safe to apply to their
respective state machines. The log on followers 1 and 3 is not up to date, which could
be due to a fault on the node or communication link failure. If there is a crashed or slow
follower, the leader will keep retrying via the AppendEntries RPC until it succeeds.
The log replication process is shown in Figure 7-9.
312
Chapter 7 Classical Consensus
313
Chapter 7 Classical Consensus
Guarantees and Correctness
Guarantees provided by RAFT are
• Election correctness
• Log matching: If two logs on two different servers have an entry with
the same index and term, then these logs are identical in all previous
entries, and they store the same command.
Election correctness requires safety and liveness. Safety means that at most one
leader is allowed per term. Liveness requires that some candidate must win and become
a leader eventually. To ensure safety, each node votes only once in a term which it
persists on storage. The majority is required to win the election; no two different
candidates will get a majority at the same time.
Split votes can occur during leader election. If two nodes get elected simultaneously,
then the so-called “split vote” can occur. RAFT uses randomized election timeouts to
ensure that this problem resolves quickly. This helps because random timeouts allow
only one node to time out and win the election before other nodes time out. In practice,
this works well if the random time chosen is greater than the network broadcast time.
Log matching achieves a high level of consistency between logs. We assume that the
leader is not malicious. A leader will never add more than one entry with the same index
and same term. Log consistency checks ensure that all previous entries are identical.
The leader keeps track of the latest index that it has committed in its log. The leader
314
Chapter 7 Classical Consensus
broadcasts this information in every AppendEntries RPC. If a follower node doesn’t have
an entry in its log with the same index number, it will not accept the incoming entry.
However, if the follower accepts the AppendEntries RPC, the leader knows that the
logs are identical on both. Logs are generally consistent unless there are failures on the
network. In that case, the log consistency check ensures that nodes eventually catch up
and become consistent. If a log is inconsistent, the leader will retransmit missing entries
to followers that may not have received the message before or crashed and now have
recovered.
Reconfiguration and log compaction are two useful features of RAFT. I have not
discussed those here as they are not related directly to the core consensus protocol. You
can refer to the original RAFT paper mentioned in the bibliography for more details.
PBFT
Remember, we discussed the oral message protocol and the Byzantine generals
problem earlier in the book. While it solved the Byzantine agreement, it was not a
practical solution. The oral message protocol only works in synchronous environments,
and computational complexity (runtime) is also high unless there is only one faulty
processor, which is not practical. However, systems show some level of communication
and processor asynchrony in practice. A very long algorithm runtime is also
unacceptable in real environments.
A practical solution was developed by Castro and Liskov in 1999 called practical
Byzantine fault tolerance (PBFT). As the name suggests, it is a protocol designed to
provide consensus in the presence of Byzantine faults. Before PBFT, Byzantine fault
tolerance was considered impractical. With PBFT, the duo demonstrated that practical
Byzantine fault tolerance is possible for the first time.
PBFT constitutes three subprotocols called normal operation, view change, and
checkpointing. The normal operation subprotocol refers to a mechanism executed
when everything is running normally, and the system is error-free. The view change is a
subprotocol that runs when a faulty leader node is detected in the system. Checkpointing
is used to discard the old data from the system.
The PBFT protocol consists of three phases. These phases run one after another to
complete a single protocol run. These phases are pre-prepare, prepare, and commit,
which we will cover in detail shortly. In normal conditions, a single protocol run is
enough to achieve consensus.
315
Chapter 7 Classical Consensus
The protocol runs in rounds where, in each round, a leader node, called the
primary node, handles the communication with the client. In each round, the protocol
progresses through the three previously mentioned phases. The participants in the PBFT
protocol are called replicas. One of the replicas becomes primary as a leader in each
round, and the rest of the nodes act as backups. PBFT enables state machine replication,
which we discussed earlier. Each node maintains a local log, and the logs are kept in sync
with each other via the consensus protocol, PBFT.
We know by now that to tolerate Byzantine faults, the minimum number of nodes
required is n = 3f + 1 in a partially synchronous environment, where n is the number of
nodes and f is the number of faulty nodes. PBFT ensures Byzantine fault tolerance as
long as the number of nodes in a system stays n ≥ 3f + 1.
When a client sends a request to the primary (leader), a sequence of operations
between replicas runs, leading to consensus and a reply to the client.
This sequence of operations is composed of three phases:
• Pre-prepare
• Prepare
• Commit
In addition, each replica maintains a local state containing three main elements:
• A service state
• A message log
316
Chapter 7 Classical Consensus
If all these checks pass, the backup replicas accept the message, update their local
state, and move to the prepare phase.
In summary, the pre-prepare phase
This phase assigns a unique sequence number to the client request. We can think of
it as an orderer that applies order to the client requests.
Prepare phase – phase 2
Each backup replica sends the prepare message to all other replicas in the system.
Each backup replica waits for at least 2f + 1 prepare messages to arrive from other
replicas. They check
If all these checks pass, the replica updates its local state and moves to the
commit phase.
317
Chapter 7 Classical Consensus
This phase ensures that honest replicas in the network agree on the total order of
requests within a view.
Commit phase
Each replica sends a commit message to all other replicas in the network in the
commit phase. Like the prepare phase, replicas wait for 2f + 1 commit messages to
arrive from other replicas. The replicas also check the view number, sequence number,
digital signature, and message digest values. If they are valid for 2f + 1 commit messages
received from other replicas, the replica executes the request, produces a result, and
finally updates its state to reflect a commit. If some messages are queued up, the replica
will execute those requests first before processing the latest sequence numbers. Finally,
the replica sends the result to the client in a reply message.
The client accepts the result only after receiving 2f + 1 reply messages containing the
same result.
The commit subprotocol steps
• The replica waits for 2f + 1 prepare messages with the same view,
sequence, and request.
This phase ensures that honest replicas in the network agree on the total order of
client requests across views.
In essence, the PBFT protocol ensures that enough replicas process each request so
that the same requests are processed and in the same order.
We can visualize the normal mode of operation of the protocol in Figure 7-10.
318
Chapter 7 Classical Consensus
During the execution of the protocol, the protocol must maintain the integrity of the
messages and operations to deliver an adequate level of security and assurance. Digital
signatures fulfill this requirement. It is assumed that digital signatures are unforgeable,
and hash functions are collision resistant. In addition, certificates are used to ensure the
proper majority of participants (nodes).
Certificates in PBFT
Certificates in PBFT protocols establish that at least 2f + 1 replicas have stored the
required information. In other words, the collection of 2f + 1 messages of a particular
type is considered a certificate. For example, suppose a node has collected 2f + 1
messages of type prepare. In that case, combining it with the corresponding pre-prepare
message with the same view, sequence, and request represents a certificate, called a
prepared certificate. Likewise, a collection of 2f + 1 commit messages is called a commit
certificate.
319
Chapter 7 Classical Consensus
There are also several variables that the PBFT protocol maintains to execute the
algorithm. These variables and the meanings of these are listed as follows:
• v: View number
• t: Timestamp
• c: Client identifier
• r: Reply
Let’s now look at the types of messages and their formats. These messages are easy to
understand if we refer to the preceding variable list.
Types of Messages
The PBFT protocol works by exchanging several messages. A list of these messages is
shown in Table 7-1 with their format and direction.
320
Chapter 7 Classical Consensus
Note that all messages are signed with digital signatures, which enable every node to
identify which replica or client generated any given message.
View Change
A view change occurs when a primary replica is suspected faulty by other replicas. This
phase ensures protocol progress. A new primary is selected with a view change, which
starts normal mode operation again. The new primary is chosen in a round-robin
fashion using the formula p = v mod n, where v is the view number and n is the total
number of nodes in the system.
When a backup replica receives a request, it tries to execute it after validating the
message, but for any reason, if it does not execute it for a while, the replica times out. It
then initiates the view change subprotocol.
During the view change, the replica stops accepting messages related to the current
view and updates its state to a view change. The only messages it can receive in this state
are checkpoint, view change, and new view messages. After that, it broadcasts a view
change message with the next view number to all replicas.
When this message reaches the new primary, the primary waits for at least 2f view
change messages for the next view. If at least 2f + 1 view change messages are acquired,
it broadcasts a new view message to all replicas and runs normal operation mode
once again.
321
Chapter 7 Classical Consensus
When other replicas receive a new view message, they update their local state
accordingly and start the normal operation mode.
The algorithm for the view change protocol is as follows:
3. Broadcast a view change message with the next view number and
a set of all the prepared certificates to all replicas.
The view change subprotocol is a means to achieve liveness. Three clever techniques
are used in this subprotocol to ensure that:
1. A replica that has broadcast the view change message waits for
2f+1 view change messages and then starts its timer. If the timer
expires before the node receives a new view message for the next
view, the node will start the view change for the next sequence
but increase its timeout value. This situation will also occur if the
replica times out before executing the new unique request in the
new view.
322
Chapter 7 Classical Consensus
set so that the next view change does not occur too late. This is
also the case even if the timer has not expired; it will still send the
view change for the smallest view.
3. As the view change will only occur if at least f + 1 replicas have
sent the view change message, this mechanism ensures that a
faulty primary cannot indefinitely stop progress by successively
requesting view changes.
Strengths
• PBFT provides immediate and deterministic transaction finality. In
comparison, in the PoW protocol, several confirmations are required
to finalize a transaction with high probability.
323
Chapter 7 Classical Consensus
Weaknesses
• PBFT is not very scalable. This limitation is why it is more suitable
for consortium networks than public blockchains. It is, however,
considerably faster than PoW protocols.
Safety and Liveness
Liveness means that a client eventually gets a response to its request if the message
delivery delay does not increase quicker than the time itself indefinitely. In other words,
the protocol ensures progress if latency increases slower than the timeout threshold.
A Byzantine primary may induce delay on purpose. However, this delay cannot
be indefinite because every honest replica has a view change timer. This timer starts
whenever the replica receives a request. Suppose the replica times out before the request
is executed; the replica suspects the primary replica and broadcasts a view change
message to all replicas. As soon as f + 1 replicas suspect the primary as faulty, all honest
replicas enter the view change process. This scenario will result in a view change, and
the next replica will take over as the primary, and the protocol will progress.
n − 1
Liveness is guaranteed, as long as no more than replicas are faulty, and the
3
message delay does not grow faster than the time itself. It means that the protocol will
eventually make progress with the preceding two conditions. This weak synchrony
assumption is closer to realistic environments and enables the system to circumvent
the FLP result. A clever trick here is that if the view change timer expires before a replica
receives a valid new view message for the expected new view, the replica doubles the
timeout value and restarts its view change timer. The idea is that the timeout timer
doubles the wait time to wait for a longer time as the message delays might be more
extensive. Ultimately, the timer values become larger than the message delays, meaning
324
Chapter 7 Classical Consensus
messages will eventually arrive before the timer expires. This mechanism ensures that
eventually a new view will be available on all honest replicas, and the protocol will make
progress.
Also, a Byzantine primary cannot do frequent view changes successively to slow
down the system. This is so because an honest replica joins the view change only when
it has received at least f + 1 view change messages. As there are at most f faulty replicas,
only f replicas cannot cause a view change when all honest replicas are live, and the
protocol is making progress. In other words, as at most f successive primary replicas can
be faulty, the system eventually makes progress after at most f + 1 view changes.
Replicas wait for 2f + 1 view change messages and start a timer to start a new view
which avoids starting a view change too soon. Similarly, if a replica receives f + 1 view
change messages for a view greater than its current view, it broadcasts a view change.
This prevents starting the next view change too late.
Safety requires that each honest replica execute the received client request in the
same total order, that is, execute the same request in the same order in all phases.
PBFT is assumed safe if the total number of nodes is 3f + 1. In that case, f Byzantine
nodes are tolerated.
Let’s first recall what a quorum intersection is. If there are two sets, say S1 and S2,
with ≥2f + 1 nodes each, then there is always a correct node in S1 ∩ S2. This is true
because if there are two sets of at least 2f + 1 nodes each, and there are 3f + 1 nodes in
total, then the pigeonhole principle implies that the intersection of S1 and S2 will have
at least f + 1 nodes. As there are at most f faulty nodes, the intersection, S1 ∩ S2 must
contain at least 1 correct node.
Each phase in PBFT must acquire 2f + 1 certificate/votes to be accepted. It turns out
that at least one honest node must vote twice on the same sequence number to result in
a safety violation, which is not possible because an honest node cannot vote maliciously.
In other words, if the same sequence number is assigned to two different messages by a
malicious primary to violate safety, then at least one honest replica will reject it due to
a quorum intersection property. This is because a 2f + 1 quorum means that there is at
least one honest intersecting replica.
The commit phase ensures that the correct order is achieved even across views. If a
view change occurs, the new primary replica acquires prepared certificates from 2f + 1
replicas, which ensures that the new primary gets at least one prepared certificate for
every client request executed by a correct replica.
325
Chapter 7 Classical Consensus
326
Chapter 7 Classical Consensus
327
Chapter 7 Classical Consensus
Summary
In this chapter, we covered a number of topics including viewstamped replication,
practical Byzantine fault tolerance, RAFT, and Paxos. Paxos and viewstamped replication
are fundamentally important because they provide very fundamental ideas in the history
of the distributed consensus problem. Paxos especially provided formal description
and proofs of protocol correctness. VR bears resemblance with multi-Paxos. RAFT is a
refinement of Paxos. PBFT is in fact seen as a Byzantine-tolerant version of Paxos, though
PBFT was developed independently.
This chapter serves as a foundation to understand classical protocols before
blockchain age protocols in the next chapter. Many ideas originate from these classical
protocols that lead to the development of newer protocols for the blockchain.
Bibliography
1. A Google TechTalk, 2/2/18, presented by Luis Quesada Torres.
https://youtu.be/d7nAGI_NZPk
8. https://en.wikipedia.org/wiki/Paxos_(computer_science)
10. Howard, H., 2014. ARC: analysis of Raft consensus (No. UCAM-
CL-TR-857). University of Cambridge, Computer Laboratory.
11. Ongaro, D. and Ousterhout, J., 2015. The raft consensus algorithm.
329
CHAPTER 8
Introduction
Consensus protocols are at the core of any blockchain. A new class of consensus protocols
emerged with Bitcoin. Therefore, we can categorize all consensus protocols for a
blockchain that emerged with and after Bitcoin as “blockchain age consensus protocols.”
The primary aim of a consensus protocol in a blockchain is to achieve an agreement
on the state of the blockchain while preserving the safety and liveness of the system. The
state generally refers to the value, history, and rules of the blockchain. An agreement on
the canonical history of the blockchain is vital, and so is the agreement on the governing
rules of the chain. Additionally, consensus on values (data) added to the chain is
fundamentally critical.
Like traditional pre-blockchain protocols, safety and liveness are two key properties
that should be fulfilled by a consensus protocol to ensure the consistency and progress
of the blockchain.
Blockchain consensus protocols can be divided into two main categories: the
probabilistic finality protocols and absolute finality protocols – in other words,
probabilistic termination protocols and deterministic termination protocols.
Probabilistic protocols are abundantly used in cryptocurrency public blockchains
like Ethereum and Bitcoin. Deterministic protocols, usually from the BFT class of
331
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_8
Chapter 8 Blockchain Age Protocols
protocols, are commonly used in enterprise blockchains; however, they are also
used in some public blockchains. While PBFT variants are more commonly used in
enterprise blockchains, their usage in public chains is somewhat limited only to some
public blockchains. For example, TowerBFT used in Solana is a deterministic finality
consensus protocol. BFT-DPOS used in EOSIO is another example. Deterministic finality
is also known as forward security where a guarantee is provided that a transaction once
finalized will not be rolled back.
From the perspective of how the consensus algorithms work, blockchains or
distributed ledgers are based on one or a combination of the following types of
consensus algorithms:
After Bitcoin’s inception, many blockchains emerged, and alternative PoW algorithms
were introduced, for example, Litecoin. As PoW consumes excessive amounts of energy,
the community felt very early that alternatives that are not excessively energy consuming
need to be designed. In the wake of introducing less energy-consuming protocols,
developers introduced proof of stake. With PoS, the sustainable public blockchain
networks have become possible to build. There are, however, some challenges and caveats.
After going through the mechanics of how PoS works, we will discuss these limitations.
332
Chapter 8 Blockchain Age Protocols
Proof of Stake
Even though Bitcoin’s PoW has proven to be a resilient and robust protocol, it has several
limitations:
calculated and a stakeholder is selected as the block proposer, the block proposed by the
proposer is readily accepted. The higher the stake, the better the chances of winning the
right to propose the next block.
A general scheme of a PoS scheme is shown in Figure 8-1.
As shown in Figure 8-1, PoS uses a stake calculator function to calculate the amount
of staked funds, and, based on that, it selects a new proposer.
The next proposer is usually elected randomly. Proposers are incentivized either
with transaction fees or block rewards. Control over most of the network by having a
large portion of the stake is needed to attack the network.
There is some element of randomness introduced in the selection process to ensure
fairness and decentralization. Other factors in electing a proposer include the age of the
tokens, which takes into account for how long the staked tokens have been unspent; the
longer the tokens have been unspent, the better the chances to be elected.
There are several types of PoS:
• Chain-based PoS
• BFT-based PoS
• Committee-based PoS
334
Chapter 8 Blockchain Age Protocols
Chain-Based PoS
This scheme is the first alternative proposed to PoW. It was used first in Peercoin in 2012.
This mechanism is like PoW; however, the block generation method is changed, which
finalizes blocks in two steps:
• Set up a clock with a constant tick interval. At each clock tick, check
whether the hash of the block header concatenated with the clock
time is less than the product of the target value and the stake value.
We can show this simple formula as follows:
The stake value depends on how the algorithm works. In some chains, it is
proportional to the amount of stake. In others, it is based on the amount of time the
participant has held the stake. The target is the mining difficulty per unit of the value of
the stake.
This mechanism uses hashing puzzles, as in PoW. But, instead of competing to solve
the hashing puzzle by consuming high energy and using specialized hardware, the hashing
puzzle in PoS is solved only once at regular clock intervals. A hashing puzzle becomes
proportionally easier to solve if the stake value of the miner is high. This contrasts with
PoW where repeated brute-force hashing is required to solve the math puzzle.
Committee-Based PoS
In this scheme, a group of stakeholders is chosen randomly, usually using a verifiable
random function (VRF). The VRF produces a random set of stakeholders based on their
stake and the current state of the blockchain. The chosen group of stakeholders becomes
responsible for proposing blocks in sequential order.
335
Chapter 8 Blockchain Age Protocols
• When it’s the turn, collect transaction, generate block, append the
new block in the chain, and finally broadcast the block.
• At the other receiver nodes, verify the block; if valid, append the
block into the blockchain and gossip the block to others.
BFT-Based PoS
In this scheme, the blocks are generated using a proof of stake mechanism where a block
proposer is chosen based on the proof of stake which proposes new blocks. The proposer
is elected based on the stake deposited in the system. The chance of being chosen is
proportional to the amount of stake deposited in the system. The proposer generates a
block and appends it to a temporary pool of blocks from which the BFT protocol finalizes
one block.
A general scheme works as follows:
• Receiver: When other nodes receive this, they validate the block and,
if valid, add to the local temporary block pool.
336
Chapter 8 Blockchain Age Protocols
Delegated PoS
DPoS works like proof of stake, but a critical difference is a voting and delegation
mechanism which incentivizes users to secure the network. DPoS limits the size of
the chosen consensus committee, which reduces the communication complexity in
the protocol. The consensus committee is composed of so-called delegates elected by
a delegation mechanism. The process works by stakeholders voting for delegates by
using their stake. Delegates (also called witnesses) are identifiable, and voters know
who they are, thus reducing the delegates’ chance of misbehavior. Also, a reputation-
based mechanism can be implemented, allowing delegates to earn a reputation based
on the services they offer and their behavior on the network. Delegates can represent
themselves for earning more votes. Delegates who get the most votes become members
of the consensus committee or group. Usually, a BFT-style protocol runs between the
members of the chosen consensus committee to produce and finalize blocks. Each
member can take a round-robin fashion to propose the next block, but this activity
remains within the elected consensus committee. Delegates earn incentives to produce
blocks. Again, under the BFT assumptions, the protocol within the consensus committee
can tolerate f faults in a 3f+1 member group. In other words, it can tolerate one-third
or 33% of delegates being faulty. This protocol provides instant finality and incentives
in proportion to the stake of the stakeholders. As network-wide consensus is not
required and only a smaller group of delegators oversee making decisions, the efficiency
increases significantly. Delegated PoS is implemented in EOS, Lisk, Tron, and quite a few
other chains.
Liquid PoS
LPoS is a variant of DPoS. Token holders delegate their validation rights to validators
without requiring transferring the ownership of the tokens. There exists a delegation
market where delegates compete to become the chosen validator. Here, the competition
337
Chapter 8 Blockchain Age Protocols
is primarily on fees, services offered, reputation, payout frequency, and possibly other
factors. Any misbehavior such as charging high fees by a validator is detectable quickly
and will be penalized accordingly. Token holders are also free to move to any other
validator. LPoS supports a dynamic number of validators as compared to DPoS’s fixed
validator set. Token holders are also allowed to become validators themselves by self-
electing. Token holders with small amount can delegate to larger amount holders. Also, a
number of small token holders can form a syndicate. Such “liquid” protocol allows much
flexibility as compared to other PoS protocols and helps to thwart creation of lobbies to
become a fixed validator set. LPoS is used in the Tezos blockchain.
There are some attacks against PoS, such as the nothing-at-stake attack, long-range
attack, and stake grinding attack. We explain these attacks as follows.
Attacks
PoS suffers generally from a costless simulation problem where an adversary can
simulate any history of the chain without incurring any additional cost, as opposed to
PoW where the cost is computational power. This no-cost block generation is the basis of
many attacks in PoS.
Nothing-at-Stake Problem
The nothing-at-stake or double bet problem occurs when multiple forks occur. An
attacker can generate a block on top of each fork without any additional cost. To solve
this problem, economic penalties are introduced in protocols that prevent attackers from
launching this attack. If a significant number of nodes do this, then an attacker holding
even less than 50% of tokens can launch a double-spend attack.
Long-Range Attacks
Long-range attacks exist due to weak subjectivity and costless simulation. Long-range
attacks are also possible because of costless simulation where an adversary creates a
new branch starting from the genesis block with the aim to take over the main good
chain, once the bad chain becomes longer than the real main chain. This can create an
alternate history which is detrimental to the blockchain.
A weak subjectivity problem affects new nodes and the nodes which were offline
for a long time and rejoined the network. As nodes are not synchronized and there are
338
Chapter 8 Blockchain Age Protocols
usually multiple forks available in the network, these nodes are unable to differentiate
between which node is correct and which one is malicious; they may as well accept a
malicious fork as valid.
Other Attacks
Liveness denial is another attack that PoS can suffer from. In this attack, some or all
validators collectively decide to stop validating the blocks, resulting in halting block
production. Penalizing such activities by the protocol can prevent these types of attacks.
A selfish mining or block withholding attack occurs when an adversary mines their
own chain offline. Once the chain is at a desired length, the adversary releases this chain
to the network with the expectation that the bad chain will take over the main good
chain. It can cause disruption on the network as it can result in causing honest validators
to waste resources.
A grinding attack on PoS occurs if a slot leader election process is not random. If no
randomness is introduced in this process, then a slot leader can increase the frequency
of its own election again and again, which can result in censorship or disproportionate
rewards. An easy way to solve this is to use some good random selection process, usually
based on verifiable random functions (VRFs).
Next, we discuss Ethereum’s proof of work – Ethash.
339
Chapter 8 Blockchain Age Protocols
Note that for quite some time, this memory hardness of Ethash prevented the
development of ASICs, but now various ASIC miners are available for Ethereum mining.
This algorithm requires subsets of a fixed resource called a directed acyclic graph
(DAG) to be chosen, depending on the nonce and block headers.
DAG is a large, pseudorandomly generated dataset. This graph is represented as
a matrix in the DAG file created during the Ethereum mining process. The Ethash
algorithm has the DAG as a two-dimensional array of 32-bit unsigned integers. Mining
only starts when DAG is fully created the first time a mining node starts.
This DAG is used as a seed by the Ethash algorithm. The Ethash algorithm requires a
DAG file to work. A DAG file is generated every epoch, 30,000 blocks. DAG grows linearly
as the chain size grows.
The protocol works as follows:
• The header from the latest block and a 32-bit random nonce are
combined using the Keccak-256 hash function.
• Once the data is fetched from the DAG, it is “mixed” with the mix to
produce the next mix, which is then used to fetch data from the DAG
and subsequently mixed again. This process repeats 64 times.
340
Chapter 8 Blockchain Age Protocols
• Light clients can verify mining rounds much efficiently and should be
able to become operational quickly.
• The algorithm runs prohibitively slow on light clients as they are not
expected to mine.
As the Ethereum execution layer (formerly Eth1) advances toward the consensus
layer (formerly Ethereum 2), this PoW will eventually phase out. When the current EVM
chain is docked into the beacon chain, that is, the so-called “the merge” happens, Casper
FFG will run on top of PoW. Eventually, however, Casper CBC, the pure PoS algorithm,
will finally take over.
Also, with ice age activation, the PoW will become almost impossible to mine due to
the extreme difficulty level induced by the “ice age,” and users will have no choice but to
switch to PoS.
341
Chapter 8 Blockchain Age Protocols
Solana
Solana is a layer 1 blockchain with smart contract support introduced in 2018.
Developers of Solana aimed for speed, security, scalability, and decentralization. At
the time of writing, it is in Beta, and it is growing in popularity quickly. Though it is an
operational network with production systems running on it, there are some technical
issues which are being addressed.
The ledger is a verifiable delay function where time is a data structure, that is, data
is time. It potentially supports millions of nodes and utilizes GPUs for acceleration. SOL
coin is the native token on the platform used for governance and incentivization. The
main innovations include the following.
Proof of history (PoH) enables ordering of events using a data structure–based
cryptographic clock instead of an external source of time, which then leads to consensus.
TowerBFT is a protocol for consensus derived from PBFT. Note that PoH is not a
consensus protocol, it is simply a mechanism to enable ordering of events using a data
structure–based clock. A consensus mechanism is still needed to enable nodes to vote
on a correct branch of the ledger.
Turbine is another innovation which enables block propagation in small chunks
called shreds, which helps to achieve speed and efficiency. There are no memory pools
in Solana as transactions are processed so fast that memory pools do not form. This
mechanism has been named the Gulf stream.
Solana supports parallel execution of smart contracts, which again results in
efficiency gains.
Transactions are validated in an optimized fashion using pipelining in the so-called
transaction processing unit residing within validators.
Cloud break is a name given to the database which is horizontally scalable. Finally,
archivers or replicators are nodes which allow for distributed ledger storage, as in a
high throughput system such as Solana data storage can become a bottleneck. For this
purpose, archivers are used which are incentivized to store data.
As our main focus is consensus algorithms, I will leave the introduction to
blockchains here and move on to discussing the actual consensus and relevant
mechanisms in Solana.
Solana uses proof of stake and TowerBFT consensus algorithms. One of the key
innovations in Solana is proof of history, which is not a consensus algorithm but allows
to create a self-consistent record of events proving that some event occurred before
and after some point in time. This then leads to consensus. It results in reducing the
342
Chapter 8 Blockchain Age Protocols
Proof of History
As discussed in Chapter 1, time in distributed systems is crucial. If time is synchronized
among processes, that is, a synchronized clock is available in a distributed network, then
communication can be reduced, which results in improved performance. A node can
deduce information from past events instead of asking another node repeatedly about
some information. For example, with the availability of a global clock where all nodes
are synchronized, the system can establish a notion of the system-wide history of events.
For example, a timestamp on an event can inform a node when this event occurred in
reference to the globally synchronized time across the network, instead of asking a node
again who produced that event when this event occurred.
Another application of a synchronized clock is that entities in the system can deduce
if something has expired, for example, a timestamped security token can immediately
tell a node how much time has elapsed since its creation. The node can infer if it is valid
anymore or not and not something that occurred in the distant past, making this token
no longer applicable and expired.
In replication protocols, clock synchronization also plays a crucial role. If nodes
don’t have clocks synchronized, that can lead to inconsistency because every node will
have a different view of the order of events.
If the time is not synchronized among nodes, the system cannot establish a global
notion of time and history. It is usually possible in practical systems using an NTP
protocol. We discussed this in Chapter 2 before in the context of time and order in
distributed systems.
343
Chapter 8 Blockchain Age Protocols
This is where PoH comes in. In Solana, one leader at a time processes transactions
and updates the state. Other validators read the state and send votes to the leader
to confirm them. This activity is split into very short successive sessions where one
leader after another performs this. It can be thought of as if the ledger is split into
small intervals. These small intervals are of 400ms each. The leader rotation schedule
is predetermined and deterministic based on several factors such as the stake and
behavior of previous transactions. But how can we ensure that the leader rotation is done
at the right time and does not skip the leader’s turn?
In PoH, the passage of time is proven by creating a sequence of these hashes, as
shown in Figure 8-3.
345
Chapter 8 Blockchain Age Protocols
We can then sample this sequence at regular intervals to provide a notion of the
passage of time. This is so because hash generation takes some CPU time (roughly 1.75
cycles for SHA-256 instruction on an Intel or AMD CPU), and this process is purely
sequential; we can infer from looking at this sequence, that since the first hash is
generated, up to a later hash in the sequence, some time has passed. If we can also add
some data with the input hash to the hash function, then we can also deduce that this
data must have existed before the next hash and after the previous hash. This sequence
of hashes thus becomes a proof of history, proving cryptographically that some event,
let’s say event e, occurred before event f and after event d.
It is a sequential process that runs SHA-256 repeatedly and continuously, using its
previous output as its input. It periodically records a counter for each output sample,
for example, every one second, and current state (hash output), which acts like clock
ticks. Looking at this structure of sampled hashes at regular intervals, we can infer that
some time has passed. It is impossible to parallelize because the previous output is the
input for the next iteration. For example, we can say time has passed between counter
1 and counter N (Figure 8-3), where time is the SHA-256 counter. We can approximate
real time from this count. We can also associate some data, which we can append to
the input of the hash function; once hashed, we can be sure that data must have existed
before the hash is generated. This structure can only be generated in sequence; however,
we can verify it in parallel. For example, if 4000 samples took 40 seconds to produce, it
will take only 1 second to verify the entire data structure with a 4000 core GPU.
The key idea is that PoH transactional throughput is separated from consensus,
which is key to scaling. Note that the order of events generated, that is, the sequence, is
not globally unique. Therefore, a consensus mechanism is needed to ascertain the true
chain, as anyone can generate an alternate history.
Proof of history is a cryptographically proven way of saying that time has elapsed. It
can be seen as an application-specific verifiable delay function. It encodes the passage
of time as data using SHA-256 hashing to hash the incoming events and transactions. It
produces a unique hash and count of each event, which produces a verifiable ordering of
events as a function of time. This means that time and ordering of events can be agreed
without waiting to hear from other nodes – in other words, no weak subjectivity where
nodes must rely on other nodes to determine the current state of the system. This results
in high throughput, because the information that is usually required to be provided by
other nodes is already there in the sequence generated by the PoH mechanism and is
cryptographically verifiable, ensuring integrity. This means that a global order of events
346
Chapter 8 Blockchain Age Protocols
Tendermint
Tendermint is inspired by the DLS protocol that we covered in Chapter 6 and was
originally introduced in the DLS paper. It can be seen as a variant of PBFT too with
similarities in the phases.
347
Chapter 8 Blockchain Age Protocols
The Tendermint protocol works in rounds. In each round, an elected leader proposes
the next block. In Tendermint, the view change process is part of the normal operation.
This concept is different from PBFT, where a view change only occurs in the event of a
suspected faulty leader. Tendermint works similarly to PBFT, where three phases are
required to achieve consensus. A key innovation in Tendermint is the design of a new
termination mechanism. Unlike other PBFT-like protocols, Tendermint has developed a
more straightforward mechanism like a PBFT-style normal operation. Instead of having
two subprotocols for normal mode and view change mode (recovery in case of a faulty
leader), Tendermint terminates without additional communication costs.
Tendermint works under some assumptions about the operating environment,
which we describe next:
348
Chapter 8 Blockchain Age Protocols
Only the proposal message contains the original value. The other two messages, pre-
vote and pre-commit, use a value identifier representing the initially proposed value.
There are three timeouts in the protocol, corresponding to each message type:
• Timeout-propose
• Timeout-prevote
• Timeout-precommit
349
Chapter 8 Blockchain Age Protocols
These timeouts prevent the algorithm from waiting indefinitely for certain
conditions to be met. They also ensure that processes make progress through the rounds.
A mechanism to increase timeout with every new round assures that after reaching GST,
the communication between correct processes eventually becomes reliable, nodes can
reach a decision, and protocol terminates.
All processes maintain some necessary variables in the protocol:
• Step: This variable holds the current state of the Tendermint state
machine in the current round.
• An array of decisions
350
Chapter 8 Blockchain Age Protocols
351
Chapter 8 Blockchain Age Protocols
352
Chapter 8 Blockchain Age Protocols
HotStuff
HotStuff is a BFT protocol for state machine replication. Several innovations make
it a better protocol than traditional PBFT. However, like PBFT, it works under partial
synchrony in a message-passing network with minimum n = 3f + 1 and relies on a leader-
based primary backup approach. It utilizes reliable and authenticated communication
links. HotStuff makes use of threshold signatures where all nodes use a single public key,
but each replica uses a unique private key. The use of threshold signatures results in
reduced communication complexity.
HotStuff introduced some innovations which we introduce as follows.
Optimistic Responsiveness
Optimistic responsiveness allows any correct leader after GST to only need the first n − f
responses to ensure progress instead of waiting for n − f from every replica. This means
that it operates at network speed instead of waiting unnecessarily for more messages
from other nodes and move to the next phase.
Chain Quality
This property provides fairness and liveness in the system by allowing frequent leader
rotation.
Hidden Lock
It also solves the hidden lock problem. A “hidden lock” problem occurs when a leader
validator does not wait for the expiration time of a round. The highest lock may not
get to the leader if we rely only on receiving n – f messages. The highest locked value
may be held in another replica from which the leader did not wait to get a response,
thus resulting in a situation where the leader is unaware of the highest locked value. If
a leader then proposes a lower lock value and some other nodes already have a higher
value locked, this can lead to liveness issues. The nodes will wait for a higher lock or the
same lock reply, but the leader is unaware of the highest lock value and will keep sending
a lower lock value, resulting in a race condition and liveness violation.
HotStuff has solved this problem by adding the precursor lock round before the
actual lock round. The insight here is that if 2f + 1 nodes accept the precursor lock, the
leader will get a response from them and learn the highest locked value. So now the
leader doesn’t have to wait for Δ (delta – an upper bound on a message delivery delay)
time and can learn the highest lock with n − f responses.
354
Chapter 8 Blockchain Age Protocols
Pacemaker
HotStuff innovatively separates the safety and liveness mechanisms. Safety is ensured
through voting and commit rules for participants in the network. On the other hand,
liveness is the responsibility of a separate module, called pacemaker, which ensures a
new, correct, and unique leader is elected. Furthermore, pacemaker guarantees progress
after GST is reached. The first responsibility it has is to bring all honest replicas and a
unique leader to a common height for a sufficiently long period. For synchronization,
replicas keep increasing their timeouts gradually until progress is made. As we assume a
partially synchronous model, this mechanism is likely to work. Also, the leader election
process is based on a simple rotating coordinator paradigm, where a specific schedule,
usually round-robin, is followed by replicas to select a new leader. Pacemaker also
ensures that the leader chooses a proposal that replicas will accept.
355
Chapter 8 Blockchain Age Protocols
view change occurs, but in HotStuff, the view change can occur directly without invoking
a separate subprotocol. Instead, checking the threshold of the messages to change the
view becomes part of the normal view.
How It Works
HotStuff is composed of four phases: prepare, pre-commit, commit, and decide phases.
A quorum certificate (QC) is a data structure that represents a collection of
signatures produced by n – f nodes to indicate that a required threshold of messages has
been achieved. In other words, a collection of votes from n − f nodes is a QC.
Prepare
Once a new leader has accumulated new view messages from N – F nodes, the protocol
starts with a new leader. The leader processes these messages to determine the latest
branch in which the highest quorum certificate of PREPARE messages is present.
Pre-commit
As soon as a leader accumulates N – F prepare votes, it creates a quorum certificate
called “prepare quorum certificate.” The leader broadcasts this certificate to other
nodes as a PRE-COMMIT message. When a node receives the PRE-COMMIT message,
it responds with a pre-commit vote. The quorum certificate indicates that the required
threshold of nodes has confirmed the request.
Commit
When the leader has accumulated N – F pre-commit votes, it creates a PRE-COMMIT
quorum certificate and broadcasts it to other nodes as the COMMIT message. When
nodes receive this COMMIT message, they respond with their commit vote. At this stage,
nodes lock the PRE-COMMIT quorum certificate to ensure the safety of the algorithm
even if a view change occurs.
Decide
When the leader receives N – F commit votes, it creates a COMMIT quorum certificate.
Then, the leader broadcasts this COMMIT quorum certificate to other nodes in the
DECIDE message. When nodes receive this DECIDE message, they execute the request
356
Chapter 8 Blockchain Age Protocols
because this message contains an already committed certificate/value. The new view
starts once the state transition occurs due to the DECIDE message acceptance and
execution.
We can visualize this protocol in Figure 8-5.
• A new primary acquires new view messages from n-f nodes with
the highest prepare quorum certificate that each validator receives.
The primary looks at these messages and finds the prepare QC with
the highest view (round number). The leader then broadcasts the
proposal in a prepare message.
• When other nodes receive this prepare message from the leader, they
check if the prepare proposal extends the highest prepare QC branch
and has the higher view number associated than what they have
currently locked.
357
Chapter 8 Blockchain Age Protocols
• When n-f votes are acquired by the leader, it combines them into a
prepare QC and broadcasts this QC in a pre-commit message.
• Replicas reply to the leader with commit votes and replicas lock on
the pre-commit QC. When the leader receives n-f commit votes from
the replicas, it combines them into a commit QC and broadcasts the
decide message.
There are other optimizations such as pipelining which allows for further
performance improvements. As all the phases are fundamentally identical, it’s easy
to pipeline HotStuff, which improves performance. Pipelining allows the protocol to
commit a client’s request in each phase. In a view, a leader in each phase proposes a new
client request. This way, a leader can concurrently process pre-commit, commit, and
decide messages for previous client requests passed on to the last leader via the commit
certificate.
Safety and Liveness
HotStuff guarantees liveness by using the pacemaker, which ensures progress after
GST within a bounded time interval by advancing views. This component encapsulates
view synchronization logic to ensure liveness. It keeps enough honest nodes in the
same view for sufficiently long periods to ensure progress. This property is achieved by
progressively increasing the time until progress is made.
Whenever a node times out in a view, it broadcasts a timeout message and advances
to the following when a quorum certificate of 2f + 1 timeout messages is received.
This certificate is also sent to the next leader, who takes the protocol further. Does this
358
Chapter 8 Blockchain Age Protocols
Polkadot
Polkadot is a modern blockchain protocol that connects a network of purpose-built
blockchains and allows them to operate together. It is a heterogenous multichain
ecosystem with shared consensus and shared state.
Polkadot has a central main chain, called a relay chain. This relay chain manages
the Parachains – the heterogenous shards that are connected to the relay chain. A relay
chain holds the states of all Parachains. All these Parachains can communicate with each
other and share the security, which leads to a better and more robust ecosystem. As the
Parachains are heterogenous, they can serve different purposes; a chain can be a specific
chain for smart contracts, another for gaming, another could be for providing some
public services, and so on and so forth. The relay chain is secured by a nominated proof
of stake.
The validators on the relay chain produce blocks and communicate with Parachains
and finalize blocks. On-chain governance decides what the ideal number of validators
should be.
A depiction of the Polkadot chain with Parachains is shown in Figure 8-6.
359
Chapter 8 Blockchain Age Protocols
Polkadot aims to be able to communicate with other blockchains as well. For its
purposes, bridges are used, which connect Parachains to external blockchains, such as
Bitcoin and Ethereum.
There are several components in Polkadot. The relay chain is the main chain
responsible for managing Parachains, cross-chain interoperability, and interchain
messaging, consensus, and security.
It consists of nodes and roles. Nodes can be light clients, full nodes, archive nodes, or
sentry nodes. Light clients consist of only the runtime and state. Full nodes are pruned at
configurable intervals. Archive nodes keep the entire history of blocks, and sentry nodes
protect validators and thwart DDoS attacks to provide security to the relay chain. There
are several roles that nodes can perform: validator, nominator, collator, and fisherman.
Validators are the highest level in charge in the system. They are block producers, and to
become block producers, they need to provide a sufficient bond deposit. They produce
and finalize blocks and communicate with Parachains. Nominators are stakeholders
and contribute to the validators’ security bond. They place trust in a validator to “be
good” and produce blocks. Collators are responsible for transaction execution. They
create unsealed but valid blocks to validators that propose. Fishermen are used to
detect malicious behavior. Fishermen are rewarded for providing proof of misbehavior
of participants. Parachains are heterogenous blockchains connected to the relay
360
Chapter 8 Blockchain Age Protocols
chain. These are fundamentally the execution core of Polkadot. Parachains can be with
their own runtime called application-specific blockchains. Another component called
Parathread is a blockchain that works within the Polkadot host and connects to the
relay chain. They can be thought of as pay-as-you-go chains. A Parathread can become
a Parachain via an auction mechanism. Bridges are used to connect Parachains with
external blockchain networks like Bitcoin and Ethereum.
Consensus in Polkadot
Consensus in Polkadot is achieved through a combination of various mechanisms. For
governance and accounting, a nominated proof of stake is used. For block production,
BABE is used. GRANDPA is the finality gadget. In the network, validators have their own
clocks, and a partially synchronous network is assumed.
Finality is usually probabilistic as we saw in traditional Nakamoto PoW consensus.
In most permissioned networks and some public networks, it tends to be deterministic,
that is, provable finality, for example, PBFT and Tendermint.
In Polkadot, due to the reason that it is a multichain heterogenous architecture,
there could be some situations where, due to some conflicts between chains, some rogue
blocks are added. These rogue blocks will need to be removed after conflict resolution; in
such situations, deterministic finality is not suitable due to its irreversible property. On
the other hand, PoW is too slow, energy consuming, and probabilistic. The solution for
this is to keep producing blocks as fast as possible but postpone finality for later as soon
as it is suitable to finalize. This way, block production can continue and is revertible, but
finality decision can be made separately and provably at a later stage.
This notion of provable finality is quite useful in a multichain heterogenous network
because it allows us to prove to other parties that are not involved in consensus that a
block is final. Also, provable finality makes it easier to make bridges to other blockchains.
This hybrid approach works by allowing validators to produce blocks even if only
one validator is online and correct, but the finalization of the blocks is offloaded to a
separate component called a finality gadget. Under normal conditions, block finalization
is also quite fast, but in case of issues such as state conflicts, the finalization can be
postponed until more scrutiny checks are performed on the blocks. In case of severe
attacks or huge network partitions, block production will continue; however, as a
fallback mechanism, Polkadot will fall back to the probabilistic finalization mechanism.
This way, liveness is guaranteed even under extreme scenarios, as long as at least one
361
Chapter 8 Blockchain Age Protocols
validator is correct and alive. The blocks are produced by BABE, whereas they are
finalized by GRANDPA. GRANDPA finalizes a chain of blocks instead of block-by-block
finalization which improves efficiency. As finalization is a separate process, block
production can continue at whatever speed the network allows for, but finality doesn’t
impact the block production speed and is done later.
There can be some forks before a “best” chain is finalized by GRANDPA. We can
visualize this in Figure 8-7.
The diagram shows on the right side three produced blocks by BABE in three forks;
GRANDPA resolves these forks and finalizes the chain. Now let’s see how blocks are
produced by BABE.
362
Chapter 8 Blockchain Age Protocols
Validators use public key cryptography. There are two types of key pairs. A private
key from the first key pair is used for block signing. The second pair is used for a
verifiable random function (VRF), also called the lottery key pair. A private key from the
latter pair is used as an input to the verifiable random function. Block signing provides
usual nonrepudiation, integrity, and data origin authentication guarantees, verifying that
the validator has indeed produced this block. In the VRF, the private key generates the
randomness, but the public key proves to other nodes that the randomness generated is
indeed reliable and that the validator did not cheat.
Each validator has an almost equal chance of being selected. A slot leader election
is done like PoW, where, if the result of VRF is lower than a predetermined threshold,
then the validator wins the right to produce the block. Also, the proof generated from the
VRF enables other participants to verify that the validator is following the rules and not
cheating; in other words, it proves that the randomness generated is reliable. If the value
produced from the VRF is higher than or equal to the target, then the validator simply
collects blocks from other validators.
Here are the phases of the BABE protocol.
Genesis Phase
In this phase, the unique genesis block is created manually. A genesis block contains a
random number that is used during the first two epochs for the slot leader selection.
Normal Phase
Each validator divides its time into so-called slots after receiving the genesis block.
Validators determine the current slot number according to the relative time algorithm,
which we’ll explain shortly. Each validator during normal operation is expected to
produce a block, whereas other nonvalidator nodes simply receive the produced blocks
and synchronize. It is expected that each validator has a set of chains in the current
slot/epoch and has the best chain selected in the previous slot by using the best chain
selection mechanism, which we’ll explain shortly.
363
Chapter 8 Blockchain Age Protocols
The slot leader selection is based on the output of the VRF. If the output of the VRF
is below a certain threshold, then the validator becomes the slot leader. If not, then it
simply collects the blocks from the leader.
The block generated by the leader is added to the best chain selected in the current
slot. The produced block must at least contain the slot number, the hash of the previous
block, the VRF output, VRF proof, transactions, and the digital signature. Once the
chain is updated with the new block, the block is broadcast. When another non-leader
validator receives the block, it checks if the signature is valid. It also verifies if a valid
leader has produced the block by checking the VRF output using the VRF verification
algorithm. It checks that if the output of the VRF is lower than the threshold, then
the leader is valid. It further checks if there is a valid chain with the required header
available in which this received block is expected to be added, and if the transactions in
the block are valid.
If all is valid, then the validator adds the block to the chain. When the slot ends,
the validator finally selects the best chain using the best chain selection algorithm,
which eliminates all chains that do not include the finalized block by the finality gadget
GRANDPA.
Epoch Update
A new epoch starts every n number of slots. A validator must obtain the new epoch
randomness and active validator set for the new epoch before beginning the new epoch.
The new validator set for the new epoch is included in the relay chain to enable block
production. A new validator must wait for two epochs before the protocol can select
it. Adding a validator two epochs later ensures that VRF keys of the new validators are
added to the chain before the randomness of the future epoch in which they are going
to be active is revealed. A new randomness for the epoch is calculated based on the
previous two epochs by concatenating all the VRF outputs of blocks in those epochs.
The diagram in Figure 8-9 illustrates this slot leader election process.
364
Chapter 8 Blockchain Age Protocols
Figure 8-9. Slot leader election via VRF and block production in slots and epochs
The best chain selection algorithm simply removes all chains that do not contain a
finalized block by GRANDPA. In case GRANDPA does not finalize any block, the protocol
falls back to probabilistic finality, and the finalized block is chosen as the one which is
several blocks (a number) before the last block. This works almost like a chain depth
rule in PoW.
Time is managed in BABE using the relative time algorithm. It is critical for the
security of the BABE that all parties are aware of the current slot number. BABE does not
use a time source managed by NTP as clearly a central source of time cannot be trusted.
Validators realize a notion of logical time by using block arrival times as a reference
without relying on an external source of time. When a validator receives the genesis
block, it records the arrival time as a reference point of the beginning of the first slot. As
the beginning time of each slot is expected to be different on each node, an assumption
is made that this difference is reasonably limited. Each validator updates its clock by
calculating the median of the arrival times of the blocks in the epoch. Although the
mechanics are different, the fundamental concept appears to be similar to the logical
clocks we discussed in Chapter 1. Temporary clock adjustment until the next epoch is
also possible for validators that went offline and joined the network again.
Safety and Liveness
There are four security properties that BABE satisfies: chain growth, existential chain
quality, chain density, and common prefix.
365
Chapter 8 Blockchain Age Protocols
Chain Growth
This property guarantees a minimum growth between slots. In other words, it’s a liveness
property, and chain growth is guaranteed as long as a supermajority of honest validators
is available. Malicious validators cannot stop the progress of the best chain.
Chain Quality
This property ensures that at least one honest block is contributed to any best chain
owned by an honest party in every x number of slots. The protocol guarantees that
even in the worst case, there will be at least one honest block included in the best chain
during an epoch. This ensures that the randomness is not biased.
Chain Density
This property ensures that any sufficiently long portion of blocks in the best chain
contains more than half of the blocks produced by honest validators. This property is
implied by chain growth and chain quality.
Common Prefix
This property ensures that any blocks before the last block in the best chain of an honest
validator cannot be changed and are final. Again, this property is satisfied due to the
assumption of super honest majority of honest validators. It is rare for a malicious
validator to be elected in a slot, and only mostly honest validators will be elected;
therefore, malicious validators are in such a minority that they cannot create another
“best” chain which do not contain a finalized block.
366
Chapter 8 Blockchain Age Protocols
• The primary broadcasts the highest block that it thinks might be final
from the previous round.
367
Chapter 8 Blockchain Age Protocols
Safety
The protocol ensures that all votes are descendants of some block that could have been
finalized in the previous round. Nodes estimate the finalization possibility of a block based
on pre-votes and pre-commits. Before a new round starts, nodes ensure by acquiring
enough pre-commits that no block with this round’s estimate can be finalized on a
different chain or later on the same chain. In the next round, it also ensures that it only pre-
votes and pre-commits on the blocks that are descended from the last round’s estimate.
Liveness
The protocols select a validator in rotation to become the primary. The primary starts
the round by broadcasting their estimate of block finalization from the last round.
Validators pre-vote for the best chain, including the primary’s proposed block if (1) the
block is at least the validator’s estimate and (2) the validator has acquired >2/3 pre-votes
for the block and its descendants in the last round.
The key insight here is that if the primary’s proposed block has not been finalized, it
is finalized to make progress. For example, suppose the proposed block by the primary
has not been finalized, and all validators have agreed on the best chain with the last
finalized block. In that case, progress is made by finalizing the latest agreed final chain.
If GRANDPA cannot conclude, then BABE provides its probabilistic finality as a fallback
mechanism, ensuring progress.
GRANDPA and BABE are one of the latest heterogenous multichain protocols. There
are other protocols in this family such as Casper FFG used in the Ethereum consensus
layer (Ethereum 2, beacon chain).
Ethereum 2
Ethereum 2, also called Serenity or Eth2, is the final version of Ethereum. Currently,
Ethereum is based on proof of work and is known as Eth1. It is now called the execution
layer, and the previous terminology of Eth1 and Eth2 is no longer valid. Eth2 is now
368
Chapter 8 Blockchain Age Protocols
called the consensus layer. As per the original plan, the existing PoW chain will
eventually be deprecated, and users and apps will migrate to the new PoS chain Eth2.
However, this process is expected to take years, and an alternative better proposal is
to continue improving the existing PoW chain and make it a shard of Ethereum 2. This
change will ease the transition to proof of stake and allow scaling up using rollups
instead of sharded execution. The beacon chain is already available; “the merge” phase
where the Ethereum mainnet merges with the beacon chain is expected in 2022. After
the merge, the beacon chain will become executable with proof of stake and EVM
capabilities. Old Ethereum 1 (Eth1) will become the execution layer with execution
clients, for example, “geth” from Eth1. Ethereum 2 (consensus) clients such as prysm and
lighthouse will continue operating on the beacon chain. Eventually, the shard chains
that expand Ethereum capacity and support execution are planned for 2023. Ethereum
2.0 “consensus” with Ethereum 1 “execution” as shard 0, along with other upgrades
based on their road map, can be visualized in Figure 8-10.
Figure 8-10. Ethereum upgrades, showing the merge and subsequent upgrades
In short, Eth1 is now called the execution layer, which handles transactions and
executions, whereas Eth2 is now called the consensus layer, which manages proof
of stake consensus. As part of the consensus layer, the “Ethereum 2” proof of stake
consensus protocol is proposed, which we discuss next.
369
Chapter 8 Blockchain Age Protocols
Casper
Casper is a proof of stake protocol which is built to replace the current PoW algorithm in
Ethereum. There are two protocols in this family:
Casper FFG is a PoS BFT–style hybrid protocol that adds a PoS overlay to the current
PoW, whereas Casper the Friendly GHOST is purely a PoS protocol. Casper FFG provides
a transition phase before being replaced with Casper CBC, a pure PoS protocol. We’ll
discuss Casper FFG as follows.
Casper FFG
Casper can be seen as an improved PBFT with proof of stake for public blockchains.
Casper the Friendly Finality Gadget introduces some novel features:
• Accountability
• Dynamic validators
• Defenses
• Modular overlay
370
Chapter 8 Blockchain Age Protocols
The LMD GHOST fork choice rule is based on GHOST. The LMD GHOST choice rule
selects the correct chain from multiple forks. The honest chain is the one that has the
most attestations from the validators and stake (i.e., weight). Forks occur due to network
partitions, Byzantine behavior, and other faults. We can see LMD GHOST in action in
Figure 8-11.
In Figure 8-11
371
Chapter 8 Blockchain Age Protocols
child blocks of a parent block. Casper is responsible for choosing a single child from each
parent block, thus choosing a single canonical chain from the block tree.
Casper however does not deal with the entire block tree due to efficiency concerns;
instead, it considers a subtree of checkpoints forming the checkpoint tree.
New blocks are appended to the block tree structure. A subtree of a tree called the
checkpoint tree is where the decision is required. This structure is shown in Figure 8-12.
The genesis block is a checkpoint, and every 100th block is a checkpoint. The
distance from one checkpoint to another is called an epoch. In other words, validators
finalize checkpoints every 100 blocks. Each validator that joins the network deposits
their owned deposit. This deposit is subject to increase and decrease due to penalty
and reward mechanism. Validators broadcast votes. A vote weight is proportional to a
validator’s stake. A validator can lose the entire deposit if it deviates from the protocol,
that is, violates any rules. This is to achieve safety.
372
Chapter 8 Blockchain Age Protocols
Validators create the vote message, sign it, and broadcast to the network.
The hash of a checkpoint is used to identify the corresponding checkpoint. The
vote is valid only if s is an ancestor of t in the checkpoint tree and the public key of the
validator is in the validator set. When more than two-thirds of validators vote on a chain
from a source to a target, this chain or link becomes the supermajority link. For example,
cp′ as a source and cp as a target, then cp′ → cp is the majority link. A checkpoint is
justified if it is the genesis block or if it is the checkpoint in a supermajority link where
the last checkpoint is justified. Precisely, we can say a checkpoint cp is justified if there
is a supermajority link cp′ → cp where cp′ is justified. A checkpoint cp is considered final
if it is justified and there exists a supermajority link cp → cp′ where cp′ is a direct child of
checkpoint cp.
Justified checkpoints are not considered final as there can exist conflicting justified
checkpoints. To finalize a checkpoint cp, a second round of confirmation is required
where a direct child cp′ of cp with a supermajority link cp → cp′ is justified.
Protocol steps
373
Chapter 8 Blockchain Age Protocols
Essentially, the process works in three steps. First, votes are casted for a checkpoint;
if >2/3 votes are acquired, then the checkpoint is in a justified state. Finally, the chain
is finalized to form the canonical chain. Justification does not mean finalized; there
can be multiple justified chains in the block tree. Finality occurs when two consecutive
checkpoints receive 2/3 votes.
Safety and liveness
Any validator who violates these conditions will be penalized by slashing their
deposit. These are known as minimum slashing conditions:
• A validator must not publish two distinct votes for the same target
checkpoint height.
• A validator must not vote again within the source to target span of its
other existing vote.
• Accountable safety
• Plausible liveness
Summary
This chapter discussed the blockchain age protocols that emerged after Bitcoin. There
are several types of blockchain consensus protocols; some are based on voting, some
are proof of work, and another class is proof of stake protocols. All these protocols have
some safety and liveness properties that ensure that the protocol is correct and works
374
Chapter 8 Blockchain Age Protocols
Bibliography
1. Peer coin paper: www.peercoin.net/whitepapers/peercoin-
paper.pdf
2. Xiao, Y., Zhang, N., Lou, W., and Hou, Y.T., 2020. A survey of
distributed consensus protocols for blockchain networks. IEEE
Communications Surveys & Tutorials, 22(2), pp. 1432–1465.
375
Chapter 8 Blockchain Age Protocols
9. Buterin, V., Hernandez, D., Kamphefner, T., Pham, K., Qiao, Z.,
Ryan, D., Sin, J., Wang, Y., and Zhang, Y.X., 2020. Combining
GHOST and casper. arXiv preprint arXiv:2003.03052.
11. Burdges, J., Cevallos, A., Czaban, P., Habermeier, R., Hosseini, S.,
Lama, F., Alper, H.K., Luo, X., Shirazi, F., Stewart, A., and Wood, G.,
2020. Overview of polkadot and its design considerations. arXiv
preprint arXiv:2005.13456.
12. Yin, M., Malkhi, D., Reiter, M.K., Gueta, G.G., and Abraham, I.,
2018. HotStuff: BFT consensus in the lens of blockchain. arXiv
preprint arXiv:1803.05069.
13. Buterin, V., Hernandez, D., Kamphefner, T., Pham, K., Qiao, Z.,
Ryan, D., Sin, J., Wang, Y., and Zhang, Y.X., 2020. Combining
GHOST and casper. arXiv preprint arXiv:2003.03052.
376
CHAPTER 9
Quantum Consensus
This chapter covers quantum consensus. Before we explain what quantum consensus
is, a basic introduction to quantum computing and its advantages is given to build an
understanding about how quantum computer works. Moreover, topics like quantum
networks, quantum Internet, quantum cryptography, and quantum blockchains are
also covered. Then we discuss quantum consensus and explain what it is, how quantum
computing impacts classical consensus in classical and quantum networks, and
how quantum computing can enhance existing distributed consensus protocols. We
survey what has been done so far in the research community and some open research
problems.
Introduction
The roots of the idea to combine quantum mechanics and information theory can
be traced back to as early as the 1970s. In 1979, Paul Benioff proposed a theoretical
foundation of quantum computing. In 1982, Richard Feynman gave a lecture in which
he argued that classical computers cannot possibly perform calculations that describe
quantum phenomena. Classical computers are inherently limited, and to simulate
quantum phenomena, the computing device must also be based on quantum principles,
thus allowing quantum mechanical simulations and calculations which otherwise
are not possible in the classical computing world. This was received well, and many
researchers started working on this.
In 1985, David Deutsch proposed a universal quantum computer and indicated
that it might perform simultaneous operations using quantum superposition. He also
suggested the “Deutsch algorithm,” which could determine if a quantum coin is biased
with a single toss. After this, an interest sparked again but soon waned. However,
quantum computing came into the limelight when Peter Shor, in 1994, described a
quantum algorithm that could factorize large numbers quickly. This event sparked a
377
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_9
Chapter 9 Quantum Consensus
lot of interest because primarily Internet security is based on RSA, which uses prime
factorization as a hard problem for its security. More precisely, it is computationally
infeasible to factor large prime numbers on classical computers, which gives RSA its
security. However, this can be done efficiently with quantum computers, thus breaking
RSA, hence the security on the Internet. Of course, we can imagine, this was big news.
In 1996, Grover introduced his quantum search algorithm, which further renewed the
researchers’ interest in quantum computing. Almost 28 years later, we are at the stage
where some companies have claimed quantum supremacy. Many researchers from
academia and industry are working on quantum computing, and it now appears that
quantum computing is at a stage where classical computing was in the 1960s. It will
become mainstream in a decade or so in most large organizations, if not everywhere.
Perhaps, quantum computers may not become a household reality soon. Still, one thing
is clear; quantum computing is evolving rapidly and will start to impact (good or bad)
quite soon on our daily lives.
Quantum computers use ideas from various fields, including computer science,
engineering, quantum mechanics, physics, mathematics, and information theory.
Several subjects have emerged from this, such as quantum information science and
technology (QIST), a merger of quantum mechanics and information technology.
Quantum information science (QIS) is a subject at the intersection of computer
science, information theory, and quantum mechanics. QIS changes how we
fundamentally think about information processing and results in novel ways to solve
previously unsolvable computationally complex problems. A quantum computer stores
and processes data fundamentally differently from classical computing, where 0s and 1s
are used to encode data. This difference in how the information is processed in quantum
computers opens the door to achieving significant speedup to solve complex problems.
378
Chapter 9 Quantum Consensus
modelling climate change and many others. A simple example of a complex problem
could be when you organize ten people around a table for dinner. It turns out that there
are 3,628,8001 ways to solve this. A brute-force way is to calculate factorial.
Another problem is the travelling salesman problem, which is an NP hard problem.
This problem aims to find the shortest route for a round trip among multiple cities.
We can solve many complex problems on classical computers, and we have
supercomputers available that can solve problems very fast, such as everyday math,
algebra problems, etc. However, the intractable problems are not solvable on even
modern supercomputers. This is where quantum computers come in. Especially in
combinatorial optimization problems where even supercomputers fail, quantum
computers provide a way to solve them.
Optimization problems are problems which try to find the best solution from all
feasible solutions. Quantum computers are good at solving these types of problems
where a large state space is explored.
Efficiently simulating molecules can help in new drug discovery. This problem of
simulation of molecules is difficult because all variations in the way atoms behave with
each other and even a small change in the way a single atom is positioned impact all
other atoms. Such problems where exponentially many variations exist are expected to
be solvable on quantum computers. Also, this information cannot be held on a classical
computer, as we don’t have such amount of space available.
For example, a caffeine molecule is composed of 24 atoms, but representing that
requires 10^48 bits, which makes this problem intractable on a classical computer;
however, a quantum computer can handle this information in 160 qubits.
Route optimization of delivery companies is an exciting application where the aim
is to find optimized routes in order to minimize fuel usage while still able to deliver a
greater number of packages.
Quantum computing applications are vast, including but not limited to
cryptography, machine learning, data analysis, computational biology, simulating
chemistry, and quantum simulation.
An application in chemistry helps to discover new materials and compounds, new
drugs, and improvements in fertilizer production, which leads to better agriculture. In
cybersecurity, better and secure key generation and distribution mechanisms and novel
1
www.ibm.com/quantum-computing/what-is-quantum-computing/
379
Chapter 9 Quantum Consensus
• Qubit
• Superposition
• Entanglement
• Teleportation
Qubit
A classical computer works based on two distinct states, 0 and 1. Classical computers use
transistors to create the absence or presence of an electric signal which represents 0 or 1,
respectively. Fundamentally, it is all transistors, even in most modern supercomputers.
With qubits in quantum computers, this fundamental paradigm shifts, which leads
to extraordinary speeds at which quantum computers can operate. A qubit is the state
of physical atomic particles, for example, a spin on an electron. A qubit can be in a
superposition of two states 0 and 1 simultaneously. The speedup rises exponentially as
more qubits are added. Eight bits together in a classical computer are called a byte. In
the quantum computing world, eight qubits together are called a qubyte.
Imagine 4 bits in a classical computer. These bits can have 16 possible states and
can only be input sequentially. However, in a quantum computer, 4 qubits can be in
a superposition of all 16 possible states and thus be input simultaneously Instead of
the classical version, where 4 bits do have 16 possible states but can only be input
sequentially. This phenomenon is called quantum parallelism and is the key to speeding
up certain types of problems that are intractable on classical computers.
There are many ways to build a qubit physically. These techniques include trapped
ions, photons, neutral atom, NMR, and several others.
380
Chapter 9 Quantum Consensus
Dirac notation is used to represent qubits. Qubits are represented as ∣0⟩ and
∣1⟩,as compared to classical 0 and 1. The difference is that a qubit can be in a linear
combination of states called superpositions. A qubit can be a superposition of,
for example, an electron with spin-up or spin-down or a photon with +45-degree
polarization or –45-degree polarization and many other ways.
Dirac notation is used to represent quantum states and their superpositions. Dirac
notation has the form ∣0⟩ + ∣ 1⟩ where 0 and 1 are states.
A qubit can be in a quantum state ∣ψ⟩ = α ∣ 0⟩ + β ∣ 1⟩ where α, β ∈ C (complex
amplitudes) and |α|2 + |β|2 = 1. This means that the state of a single qubit is represented
by ∣ψ⟩ = α ∣ 0⟩ + β ∣ 1⟩, and the probability condition is |α|2 + |β|2 = 1. This probability
condition means that the values that α, β can take are limited, and in any case, both must
add to one. C represents complex numbers.
Other than Dirac notation, we can also use vector representations. A single vector
can represent the state containing amplitudes α and β:
|ψ 〉 = [α β ]
and
381
Chapter 9 Quantum Consensus
We can describe a single qubit as a point on the surface of the Bloch sphere. The
North pole represents state ∣0⟩, and the South pole represents state ∣1⟩. Qubit angles are
θ the latitude and ϕ the longitude. When a single gate operation is performed on the
qubit, the state ψ (the qubit) rotates to another point on the Bloch sphere.
Superposition
Superposition is a fundamental principle of quantum mechanics. Superposition means
that quantum states can be added together to get another valid quantum state. This is
analogous to classical mechanics, where waves can be added together. Added quantum
states are so-called “superposed.” Superposition is the key to extraordinary speedup as it
allows many computation paths to be explored simultaneously.
Entanglement
Entanglement is an incredibly strong correlation that exists between particles, which
allows two or more particles to inseparably link with each other. It allows any two
quantum particles to exist in a shared state. Any action on one particle instantly affects
the other particle even at massive distances. Entanglement is usually performed by
382
Chapter 9 Quantum Consensus
bringing two qubits close together, performing an operation to entangle them, and, once
entangled, moving them apart again. They will remain entangled even if one of them is
on earth and the other is moved to outer space at a vast distance.
There are two features of entanglement which makes it particularly suitable for a
large range of applications: maximal coordination and monogamy.
Maximal Coordination
When two qubits at different nodes in a network entangle, that is, the quantum state
of two particles become inseparably connected, they provide stronger correlation and
coordination properties, which are nonexistent in classical networks. This property
is called maximal coordination. For example, for any measurement on the first qubit,
if the same measurement is made on the second qubit, instantaneously, the same
answer is shown, even though that answer is random and was not predetermined.
More precisely, they will always yield a zero or one at random, but both will produce
the same output always. This feature makes entanglement suitable for tasks requiring
coordination, for example, clock synchronization, leader election, and consensus.
Imagine clock synchronization in a distributed network without physical transfer; it can
make distributed networks extraordinarily fast. (Remember replacing communication
with local computation from the last chapter.) Also, state transfer/awareness during
consensus immediately makes consensus faster. The fundamental idea here is that
when entangled, it is possible to change the state globally (full state) by only performing
operations (changing parameters) in one qubit. This feature has far-reaching
implications; imagine being able to do immediate state transfer to all nodes in the
network. This can result in extraordinary speedup in consensus algorithms.
Monogamy
Quantum entanglement is not shareable. If two qubits are entangled, then a third qubit
from anywhere in the universe can never entangle with either of them. This property is
called monogamy of entanglement. This property can enable applications in privacy,
cryptographic key generation, and identification.
383
Chapter 9 Quantum Consensus
Quantum Gates
Just like classical computing in the quantum world, we use gate operations for data
processing. We are used to Boolean gates used in the classical world, such as NOT, AND,
OR, XOR, NAND, and NOR. In the quantum world, we apply some operator to an input
state which transforms into an output state. This operator is called a quantum gate.
Quantum gates operate on a single qubit or multiple qubits. A rule here is that each gate
must have the same number of inputs as outputs. This is what makes the gate reversible
and consequently the quantum computers reversible. There are single qubit gates which
make rotations on a Bloch sphere. Then there are two qubit gates which combine single
gates to create more complex functions, which leads to building quantum computers.
There are many quantum gates; some common ones are introduced as follows.
Hadamard
This gate transforms a basis state into an even superposition of the two basis states.
Fundamentally, it allows us to create superpositions. It operates on one qubit and is
denoted by the symbol shown in Figure 9-2.
384
Chapter 9 Quantum Consensus
T
The T gate induces a pi/4 phase between contributing basis states. The symbol shown in
Figure 9-2 represents the T gate. Relative phase rotation is by 45 degrees in T gate.
CNOT
This gate is called the controlled NOT. It is the same as the classical XOR gate, but
with the property of reversibility. It works with two qubits. The first qubit serves as the
control qubit, and the second qubit acts as the target qubit. It changes the state of the
target qubit only if the first qubit is in a specific state. This gate can be used to create an
entangled state in a two or more qubit system.
Toffoli (CCNOT)
This is the controlled-controlled NOT gate. It operates on three qubits. It switches the
third bit of a three-bit state where the first two bits are 1, that is, it switches ∣110⟩ to ∣111⟩
and vice versa. It is represented by the symbol shown in Figure 9-2. In other words, if the
first two bits are 1, the third bit inverts.
Z
It is a phase shift gate. It maps 1 to –1 and keeps 0 as 0. In other words, the amplitude of
∣1⟩ is negated. Fundamentally, it rotates the phase by 180 degrees. It is represented in the
circuit with the symbol Z in a box, as shown in Figure 9-2.
NOT
This gate switches ∣1⟩ to ∣1⟩ and vice versa. It is an analogue of the classical NOT gate.
Swap Gate
The swap gate swaps two qubits. It can be visualized in Figure 9-2. Of course, there are
many quantum gates, but we have introduced those which are commonly used and will
help us to understand the algorithms later in this chapter.
All these gates can be visualized in Figure 9-2.
385
Chapter 9 Quantum Consensus
Measurement
In addition to gates, another important element is measurements. A measurement takes
a quantum state and collapses it into one of the basis states. We can visualize this in
Figure 9-3.
Quantum Circuits
Using quantum gates, we build quantum circuits. A quantum circuit is basically a
sequence of quantum operations applied to qubits. It is composed of quantum gates
(operators), quantum registers containing qubits providing input, quantum wires
representing a sequence of operations over time, and measurements. Time runs from left
to right in quantum circuits. Figure 9-4 shows how a quantum circuit looks like.
Quantum gates are represented as boxes. On the left side, we have quantum
registers. Quantum wires represent a qubit, that is, a photon or an electron. Each gate
introduces a change in the qubit, for example, a change in the spin of an electron.
386
Chapter 9 Quantum Consensus
Teleportation Circuit
Can we transfer a quantum state from one quantum device to another? Yes, we can; for
this purpose, teleportation is used, which uses entanglement to move a quantum state
from one quantum device to another.
In Figure 9-5, a teleportation circuit is shown, which can transport a quantum state
from one party to another.
GHZ Circuit
The Greenberger-Horne-Zeilinger (GHZ) state is an entangled state of three or more
qubits. If three or more particles get into an entangled state, it’s called a multipartite
entanglement.
Figure 9-6 visualizes this circuit.
387
Chapter 9 Quantum Consensus
GHZ states have been shown to be useful in quantum cryptography and quantum
Byzantine agreement (consensus) algorithms, as we’ll explore later in this chapter.
W State Circuit
The W state circuit is another way to achieve entanglement of three particles. The
difference with GHZ is that in the W state if one qubit is lost out of three, then the
remaining two will remain entangled. GHZ however does not have this property. The
circuit is shown in Figure 9-7.
The W state circuit has application in leader election algorithms, as we’ll see later in
this chapter.
Quantum Algorithms
As we know, algorithms are set of instructions to solve a problem. Quantum algorithms
are the same in this regard; however, they run on quantum devices and contain at least
one quantum operation, for example, superpositions or entanglement operations. In
388
Chapter 9 Quantum Consensus
other words, a quantum algorithm is the same as the classical algorithm in the sense
that it is a set of instructions to solve a problem, but it has instructions for creating
superpositions and entanglements.
Famous quantum algorithms include the Deutsch-Jozsa blackbox solution
algorithm, Shor’s discrete log problem and factorization algorithm, and Grover’s search
algorithm. There is a catalogue maintained at the quantum algorithm zoo –
https://quantumalgorithmzoo.org.
There are primarily three classes of quantum algorithms: quantum search
algorithms, quantum Fourier transform–based algorithms, and quantum simulation
algorithms.
389
Chapter 9 Quantum Consensus
390
Chapter 9 Quantum Consensus
categorize the algorithms into different categories based on the computational resources
required to solve a problem. For this purpose, complexity classes are considered.
With quantum computing, new complexity classes have also emerged. Quantum
computers can solve NP hard problems, which classical computers cannot.
Several complexity classes exist; we describe them as follows.
P – Polynomial
A polynomial time class categorizes problems which are solvable in polynomial time,
that is, a reasonable amount of time.
391
Chapter 9 Quantum Consensus
392
Chapter 9 Quantum Consensus
Quantum Networks
Quantum networks are like classical networks with the same routing strategies and
topologies. The key difference is that nodes can implement quantum computations and
relevant quantum processes. Channels between quantum devices in a quantum network
can be quantum or classical.
Quantum Internet
Just as the ARPANET from 1969 with just four nodes became the Internet of today with
billions of entities2 on it, it is expected that small experimental scale quantum networks
will become a quantum Internet of tomorrow.
It is envisioned that a quantum network infrastructure will be developed to
interconnect remote quantum devices and enable quantum communication between
them. The quantum Internet is governed by laws of quantum mechanics. It transfers
qubits and distributes entangled quantum states. As the number of nodes grows in the
quantum Internet, so does the quantum power. This is so because, as the number of
qubits scales linearly with the number of quantum devices on the network, the quantum
Internet could enable an exponential quantum speedup, resulting in a “virtual quantum
computer” capable of solving previously impossible problems.
Traditional operations present in classical networks such as long-term data storage,
data duplication (copying), and straightforward state reading are no longer applicable in
quantum networks:
2
As almost everything can connect to the Internet now, the term “entities” is used which includes
users, things, devices, etc.
393
Chapter 9 Quantum Consensus
394
Chapter 9 Quantum Consensus
Quantum Blockchain
Inevitably with the quantum Internet and quantum distributed systems, we can envisage
a quantum blockchain that utilizes quantum computers as nodes and the underlying
quantum Internet as the communication layer.
There are two facets of blockchains in the quantum world. The first one is the
pure quantum blockchains running on top of the quantum Internet. Some work has
been done in this regard, and an innovative proposal by Rajan and Visser is to encode
blockchains into a temporal GHZ state.
The other aspect is the existence of classical blockchains in the post-quantum world.
Quantum computers can impact the security of blockchains and consensus adversely
due to the ability to break classical cryptography. More on this later in this chapter in the
last section.
395
Chapter 9 Quantum Consensus
Quantum Cryptography
Quantum cryptography is indeed the most talked-about aspect of quantum computing,
especially from the point of view of the impact that it can have on existing cryptography.
There are two dimensions here; one is quantum cryptography, and the other is post-
quantum cryptography. Quantum cryptography refers to cryptography primitives or
techniques that are based on properties of quantum mechanics. For example, quantum
key distribution, quantum coin flipping, and quantum commitment. Using quantum
properties, a new type of unconditionally secure mechanisms can be developed, which
have no counterpart in the classical world. Quantum key distribution (QKD) protocols,
such as BB84, were proposed by Bennett and Brassard in 1984, which allow two parties
to construct private keys securely using qubits. The benefit of this quantum scheme is
that due to superposition any adversary trying to eavesdrop will inevitably be detected.
The other dimension of the study of quantum cryptography is the impact of
quantum computers on classical cryptography. We know that using Shor’s algorithm, the
discrete log problem can be solved, and integer factorization can be sped up, which can
result in breaking commonly used public key cryptography schemes, such as RSA and
elliptic curve cryptography. Not so much impact is expected on symmetric cryptography
because simply key lengths can be increased to ensure that exhaustive searches O(n)
( )
in the classic world and O n using quantum techniques made possible by Grover’s
algorithm or similar no longer are effective.
396
Chapter 9 Quantum Consensus
3
https://doi.org/10.1116/5.0073075
397
Chapter 9 Quantum Consensus
Quantum Consensus
Quantum consensus drives a quantum network with some qubits to a symmetric state.
Also, with the advent of quantum computing, the problems from the classical computing
world are being studied through the lens of quantum computing to see if there are any
improvements that can be made to the existing algorithms from the classical world by
harnessing the quantum power. One such problem is distributed consensus from the
classical computing/networking world, which we have studied throughout this book.
We’ll study the agreement/consensus in quantum distributed systems and the impact of
quantum computing on classical distributed systems, especially consensus. It has been
shown that the classical distributed consensus problem when studied under a quantum
framework results in enhancing the classical results and can also solve problems which
are otherwise unsolvable in classical networks.
Also, in quantum networks, reaching an agreement is required in many cases; thus,
pure quantum consensus for quantum networks and quantum Internet is also an area of
interest.
In this section, we’ll focus more on enhancement of classical results rather than
pure quantum consensus where a quantum network with some qubits is brought to
consensus, that is, in a symmetric state. Classical result enhancement is more relevant to
our study of consensus in this book.
Quantum consensus algorithms are a very active area of research. There are four
categories that have emerged in this context:
• Entanglement-based consensus
• Measurement-based consensus
398
Chapter 9 Quantum Consensus
399
Chapter 9 Quantum Consensus
The system model comprises of n nodes, and each pair of nodes is connected
through a separate two-way quantum channel. The protocol works in rounds, and each
round has two phases. In the first phase, all processors send and receive messages,
and the second phase is the computation phase where nodes do local computation to
process received messages and decide what messages to send.
The protocol satisfies the following conditions:
a. Send the k th qubit to the k th player while keeping one part to yourself.
400
Chapter 9 Quantum Consensus
n 3/ 2
1
2. Generate the state | Li 〉 = ∑ | a, a,…a〉 on n qubits, an equal
n3/ 2 a =1
superposition of the numbers from 1 to n3/2.
Round 2
5. Select the process which has the highest leader value as the leader
of the round.
Using this algorithm, a weak global coin is obtained. The probability for either
1
common outcome is at least if 3t < n. The protocol works for crash faults.
3
Another protocol for Byzantine faults is presented in the paper, which can tolerate up
n
to t < faulty nodes under an asynchronous environment.
4
401
Chapter 9 Quantum Consensus
This is so because the quantum entanglement here guarantees that entangled qubits
collapse to the same state. This means that all nodes will end up with the same state,
which means an agreement.
The GHZ state is used to realize this scheme. The key assumption made here is that
each node receives a qubit during the setup phase and then later measures it, which
results in the same collapsed state at all nodes.
The algorithm works as follows:
An agreement is provided simply because measuring any single full entangled qubit
in the GHZ state causes all other qubits in the same GHZ state to collapse to the same
basis state. Validity is achieved because when the measurement is made on the first
node, it’s effectively proposing either a ∣0⟩ or ∣1⟩.
This protocol can tolerate network delays because even if a qubit arrives late on a
quantum node, the other nodes will keep working without any impact. When the late
qubit arrives at any time, it will already contain the agreed-upon value, again due to
entanglement. The fundamental reason why FLP can be refuted using this algorithm is
because it requires one-way broadcast, and no classical response is required. Even if a
single process does not measure its qubit, it will not impact the overall outcome of the
computation, that is, even if that single measurement is not available, the other correct
402
Chapter 9 Quantum Consensus
nodes will continue to operate even if one qubit is missing. This algorithm will work if
the distribution of the original GHZ state completes successfully. Once that’s complete,
missing measurements won’t impact the protocol from there onward. In case of
Byzantine faults, the algorithm is also resilient. Any malicious party cannot tamper with
the value the algorithm will eventually choose. This is so because any measurement does
not impact the correlation of other qubits in the system. This means that any adversary
measuring the qubits cannot impact the final chosen value. Due to the quantum nature,
the qubit will always end up as 0 or 1. This always means that the decision will eventually
be made regardless of Byzantine faults. This achieves termination.
403
Chapter 9 Quantum Consensus
–– If the same state exists on all quantum nodes, each individual node’s
wave function is a scalar multiple of the system’s state wave function.
The wave function then outputs zero.
–– In case of any different state on a node, then that node’s wave func-
tion will not be a scalar multiple of the entire system’s state function.
So, the difference between the system state wave function and the
node’s state wave function is not zero, indicating the discrepancy.
The difference increases as the discrepancy between the node’s state
and system state grows. The level of fault tolerance can be deter-
mined if the difference does not fall below a certain threshold. This
could be a percentage of the total system, in line with classical BFT
where roughly 33% of the nodes can be faulty in a network. It is
possible to apply similar thresholds here too, though not proven.
–– If all wave functions are the same, then the expected result is zero,
indicating a coherent system. Otherwise, the system is not coherent.
The larger the difference, the more incoherent the system is.
404
Chapter 9 Quantum Consensus
can be done locally, by comparing the state of all qubits. This contrasts with a classical
high complexity request-response style of message passing, which increases time and
communication complexity. With quantum properties, this complexity is reduced, and
consensus can be reached in half the time as compared to the classical consensus. Such
efficiency gains can be utilized in blockchains to increase scalability. Moreover, any
attempt of manipulation of the original and the sent state in isolation is immediately
detectable due to entanglement, thus resulting in increased security of the system.
The paper also proposes several scalability, privacy, and performance enhancements
addressing the blockchain trilemma.
1
|W 〉 = ( 001〉+|010〉+|100〉 )
3
3. b = measurement of q
405
Chapter 9 Quantum Consensus
This protocol has time complexity of O(1), that is, constant, and there is no message
passing required. This is in complete contrast with the classical world where multiround
protocols with higher complexity are usually common. This quantum protocol works in
asynchronous networks too.
Also, a simple algorithm for consensus is presented. Leader election was based on
symmetry breaking; however, this algorithm is dependent on symmetry preservation.
The idea is that to achieve totally correct anonymous quantum distributed consensus
where each process has one qubit initially, the processors are required to be entangled in
a GHZ state. It is shown that not only is it necessary but also a sufficient condition.
The core idea of the protocol is to share the GHZ entangled state between all nodes
participating in the consensus. This allows to create symmetry in one step.
The GHZ state for three qubits is given as
| GHZ 〉 =
(|000〉+|111〉 )
√2
Again, this protocol has time complexity of O(1), and no message passing is needed.
This protocol works under different communication topologies and under asynchrony.
The result in the paper shows that GHZ for consensus is necessary and sufficient.
Moreover, W is necessary and sufficient for leader election.
Other Algorithms
There is a wide range of quantum consensus algorithms and relevant proposals. It’s not
possible to cover them all here in detail; however, this section summarizes some of the
prominent results.
Luca Mazzarella, Alain Sarlette, and Francesco Ticozzi in “Consensus for Quantum
Networks: From Symmetry to Gossip Iterations” extend the classical distributed
computing problem to networks of quantum systems. They proposed a general
framework to study the consensus problem in the quantum world. Also, a quantum
gossip–style algorithm is presented in the paper.
406
Chapter 9 Quantum Consensus
Summary
As quantum computing is a vast and deep subject, we have not covered everything. Still,
this chapter should give us a good understanding of quantum computing and how it can
benefit distributed systems and consensus. Moore’s law has almost come to an end, so
with quantum computing, we can reinvigorate it. In computer science, more complexity
407
Chapter 9 Quantum Consensus
classes are emerging due to quantum computing. For physicists, the interest is to
understand more about quantum theory.
This chapter gives us an intuition to think more deeply and prepare ourselves
for further research and exploration. The main point to remember is that largely
quantum consensus mechanisms exploit the quantum properties of superposition and
entanglement.
In the next chapter, we conclude all ideas presented in this book and some exotic
ideas and future research directions.
Bibliography
1. Rohde, P.P., 2021. The Quantum Internet: The Second Quantum
Revolution. Cambridge University Press.
408
Chapter 9 Quantum Consensus
13. Lamport, L., 1979. Constructing digital signatures from a one way
function. (Lamport signatures).
16. Sun, X., Kulicki, P., and Sopek, M., 2020. Multi-party quantum
Byzantine agreement without entanglement. Entropy, 22(10),
p. 1152.
17. Fitzi, M., Gisin, N., and Maurer, U., 2001. Quantum solution to the
Byzantine agreement problem. Physical Review Letters, 87(21),
p. 217901.
18. Webber, M., Elfving, V., Weidt, S., and Hensinger, W.K., 2022.
The impact of hardware specifications on reaching quantum
advantage in the fault tolerant regime. AVS Quantum Science, 4(1),
p. 013801.
409
CHAPTER 10
Conclusion
Congratulations on making this far! We’ve come a long way with a lot of information
under our belt. In this chapter, we summarize some important topics. We will also look
at some latest research and ideas and touch upon some more consensus protocols. An
important aspect in consensus protocol research is formal design and verification of
the algorithms. We briefly explain this important area in this chapter. Also, we compare
some of the most common consensus protocols from different angles and introduce
some important research directions.
Introduction
Consensus protocols are the backbone of distributed systems and especially
blockchains. Throughout this book, we discussed several protocols and relevant
topics. Currently, a blockchain is the most common candidate for implementing
consensus protocols. In fact, they are at the core of the blockchain. With the blockchain,
novel schemes are being introduced which address various problems including
node scalability, transaction throughput, consensus efficiency, fault tolerance,
interoperability, and various security aspects.
Distributed consensus is used almost everywhere between networked devices,
not only in classical distributed systems that we are used to. This includes the Internet
of Things, multiagent systems, distributed real-time systems, embedded systems,
and lightweight devices. With the evolution of the blockchain, blockchain consensus
adoption is expected to grow in all these systems too.
Other Protocols
In this section, we briefly introduce protocols that we did not cover before. As it is a vast
area, a brief introduction only is given.
411
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6_10
Chapter 10 Conclusion
PoET
The proof of elapsed time algorithm was introduced by Intel in 2016. Remember we
discussed earlier in Chapter 5 that a key purpose that PoW fulfills is the passage of some
time until the network can converge to a canonical chain. Also, the leader, the miner
whose block is accepted, wins the right to do so by solving PoW. PoET is fundamentally
a leader election algorithm which utilizes trusted hardware to ensure that a certain time
has elapsed before the next leader is selected for block proposal. The fundamental idea
in PoET is to provide a mechanism of leader election by waiting randomly to be elected
as a leader for proposing new blocks.
PoET in fact emulates the passage of time that would be consumed by PoW mining.
The core idea is that every node randomly waits for some time before producing a block.
The random waiting process runs inside a Trusted execution environment (TEE) to
ensure that true time has indeed passed. For this purpose, Intel SGX or ARM TrustZone
can be used. As TEEs provide confidentiality and integrity, the network in turn trusts
the block producers. PoET tolerates up to 50% faulty TEE nodes. However, there is a
possibility of Sybil attacks where an actor can run many TEE nodes, which can result in
shortening the random waiting time. This can result in the creation of a malicious chain
if more than 50% of TEEs become malicious. Another limitation is the stale chip problem
highlighted by Ittay Eyal. This limitation results in hardware wastage, which results
in resource wastage. The stale chip problem stems from the idea that it is financially
beneficial for malicious actors to collect many old SGX chips, which increases their odds
of becoming the producer of the next block. For example, adversarial actors can collect
many old SGX chips to build mining rigs. It serves only one purpose, that is, mining,
instead of buying modern CPUs with SGX, which will help in PoET consensus and be
useful for general computation. Instead, they can choose to collect as many old SGX-
enabled chips as they can and increase their chances of winning the mining lottery. Also,
old SGX-enabled CPUs are cheap and can increase the use of old, inefficient CPUs. It is
like Bitcoin miners racing to get as many fast ASICs as possible to increase their chances
of becoming elected miners. However, it results in hardware wastage. There is also the
possibility of hacking the chip's hardware. If an SGX chip is compromised, the malicious
node can win the mining round every time, resulting in complete system compromise
and undeserved incentivization of miners. This problem is called the broken chip
problem.
412
Chapter 10 Conclusion
Proof of Authority
We can think of PoA as a specific kind of proof of stake where the validators stake
with their identity instead of economic tokens. The identity of a validator depicts the
authority associated with the validator. A usual process of earning authority involves
identity verification, reputation building, and a publicly scrutinized assessment process.
The resultant group becomes a highly trusted set of validators to participate in the
consensus protocol and produce blocks. Any violation by a validator of the rules of the
protocol or inability to justify the earned right to produce blocks results in the removal of
the dishonest validator by other validators and users on the network. It’s used in Rinkeby
and Kovan Ethereum test nets. PoA provides good security as we have trusted validators
in the network, but the network is somewhat centralized. The resilience against collusion
and other security threats depends on the consensus algorithm used by the validators. If
it’s a BFT variant, then usual BFT guarantees of 33% fault tolerance apply.
HoneyBadger BFT
HoneyBadger BFT (HBBFT) is a leaderless and randomized consensus protocol that
works under asynchrony. It is the first practical asynchronous Byzantine consensus
protocol. Generally, in distributed system theory, randomized algorithms are thought
to be impractical. The HoneyBadger protocol authors claim to refute this belief. The
approach taken to build this protocol is to improve efficiency by fine-tuning existing
primitives and introducing new encryption techniques. These techniques result in
removing bottlenecks in the protocol.
The first limitation to address is the leader bottleneck, for example, in PBFT and
other variants there’s a standard reliable broadcast mechanism used to disseminate
information from the leader to other nodes. This results in significant bandwidth
consumption, for example, on the order of O(nB) where b is the blocks. To address this
limitation, erasure coding can be used which allows the leader to send only erasure
codes to the nodes. Then each node sends the erasure code stripe received to other
nodes. This way, there is less load on the leader. Then all nodes simply reconstruct the
message. This reduces the leader bandwidth to O(n). Transactions are processed in
batches in HBBFT, which increases the throughput.
413
Chapter 10 Conclusion
414
Chapter 10 Conclusion
Avalanche
This new paradigm of consensus protocols achieves an agreement through random
network sampling. The Avalanche family of protocols allow for a more relaxed form of
agreement as compared to a deterministic agreement. However, it provides stronger
safety than PoW and enjoys the node scalability of the Nakamoto consensus. We can
think of it as a probabilistic version of the traditional quorum-based protocol but
without explicit intersections in voting. The key idea is to combine the best of Nakamoto
family with the best of classical family.
The safety of these protocols is probabilistic but with negligible failure possibility.
In terms of liveness, these protocols terminate with high probability. For liveness, these
protocols rely on synchrony. The protocol achieves safety through metastability. Safety
provided by the protocol is probabilistic where an adjustable system chosen security
parameter makes the possibility of consensus failure negligibly small. The protocol
guarantees safety and liveness properties, with high probability:
This family has several protocols that build up the complete Avalanche protocol.
The protocol is built gradually, starting from the so-called slush protocol, which provides
metastability, snowflake, the BFT protocol, Snowball, which affirms the state by adding
confidence to the decision; and, finally, Avalanche, which adds the DAG structure to
improve efficiency.
With these innovations, the protocol provides quick finality, low latency, high
throughput, and scalability.
Block-Based DAG
Every vertex of a blockDAG contains a block. Each block can be a child of multiple
blocks, instead of only one parent in linear designs, SPECTRE, PHANTOM, and
Meshcash.
Transaction-Based DAG
Each vertex of a transaction-based DAG contains a transaction. IOTA Tangle, Byteball,
Graphchain, and Avalanche are some examples of a transaction-based DAG.
Hashgraph is a permissioned graph–based blockchain with a BFT-type consensus
protocol.
Ebb-and-Flow Protocols
This work is a response to a liveness issue found in Gasper. Gasper is a PoS-based
consensus mechanism proposed for Ethereum’s beacon chain. It is a combination of
Casper FFG, the finality gadget, and LMD GHOST, the fork choice rule. In this work,
“snap-and-chat” protocols are proposed that are provably secure.
Nakamoto-style protocols provide liveness under network partitions and dynamic
network participation, but they sacrifice safety over liveness. BFT protocols, on the
other hand, provide safety (finality) under network partitions and low participation
(participants less than <3f +1) but sacrifice liveness. It has been shown that it is
impossible for a protocol to be both live under dynamic participation and safe under
network partitions. This work answers the question of whether there exists a consensus
mechanism that guarantees both availability and safety. The key idea is exquisite and
proposes to create two ledgers instead of one. Remember, no protocol can ensure safety
and liveness under network partitions and dynamic partitions with a single ledger. In
other words, longest chain–style mechanisms favor liveness over safety and provide
dynamic availability under different participation levels; however, BFT protocols favor
safety over liveness and provide finality. This problem is called the availability-finality
dilemma, as a single ledger cannot provide both properties. Therefore, the proposal is to
create two ledgers. The first is an “available full ledger” that is always live but safe only
without network partitions, similar to “longest chain–type PoW protocol.” The other
ledger, called the “finalized prefix ledger,” is always safe but not live in low participation
scenarios. This concept is the same as traditional BFT–style protocols, for example,
416
Chapter 10 Conclusion
PBFT, which stalls unless a threshold of participants is available. As the finalized prefix
ledger is the prefix of the available full ledger, both ledgers eventually converge as a
single authentic chain of history. In other words, the finalized prefix ledger is safe under
network partitions and if less than one-third of participants are faulty.
Moreover, the available full ledger is live under dynamic (low) participation and
if <50% of active players are Byzantine. This technique of combining BFT-style with
Nakamoto-style protocols and creating so-called nested ledgers has been named the
“Ebb-and-Flow” property. The so-called “snap-and-chat” protocols are developed to
achieve the “Ebb-and-Flow” property. The finalized ledger is always the prefix of the
available ledger, thus creating a combined proper single chain. At a high level, this
mechanism works by ordering transactions into a chain of blocks by using some longest
chain–type protocol, for example, PoW. Next, snapshots of prefixes from this blockchain
feed into a partially synchronous BFT-style protocol for example, PBFT, which produces
a chain containing multiple chains of blocks. Next, any duplicates or invalid transactions
are removed, which creates a finalized prefix ledger. This finalized prefix ledger is added
before the output of the PoW-style protocol, and any duplicates or invalid transactions
are removed. This process finally creates an available single ledger.
There is a myriad of consensus protocols that exist, but we cannot cover all of them.
However, we did explore quite a few covering different types and classes including
randomized, deterministic, CFT, BFT, and Nakamoto-style protocols.
Let’s now turn our attention to formal verification which allows us to ensure the
correctness of all these different consensus protocols.
Formal Verification
With all the activity in blockchain consensus research, we can now appreciate that it
is such an active area of research. Many new protocols have been proposed to solve
consensus problems in innovative ways. For example, some address efficiency, some
look at scalability problems, some try to reduce message complexity, some modify the
existing classical protocols to make them suitable for the blockchain, some try to speed
up the consensus mechanism, and many other improvements and novelties are claimed.
Here, a question arises: How do we ensure that these consensus protocols are correct
and perform as we intend them to? For this purpose, usually researchers write research
papers with proofs and arguments about the correctness of the protocols. Moreover,
formal methods are used to ensure protocol correctness.
417
Chapter 10 Conclusion
• Mechanically check the model to ensure that the model satisfies the
specification.
Two categories of techniques are commonly used for verification: state exploration–
based approaches and proof-based approaches. State exploration–based methods are
automatic but are inefficient and difficult to scale. For example, a usual problem is state
explosion, where the number of states to check grows so exponentially large that the
model does not fit in a computer's memory. This is the reason the model must be finite,
so that it can be efficiently verified. On the other hand, proof-based approaches (i.e.,
theorem proving) are more precise and less memory consuming but require human
interaction and more in-depth knowledge of the proofs and relevant techniques. Proof-
based techniques are the most elegant way of reasoning about properties of a system
without any limit on the size of the spec. This is in contrast with model checking where
there must be a limit on the size of the model. With proof-based techniques, you can
reason about the system states and prove that with any input the system will always
work as intended. Proof assistants such as Isabelle are used to help with reasoning about
systems by allowing automated theorem proving.
418
Chapter 10 Conclusion
419
Chapter 10 Conclusion
Impossibility Results
Unsolvability results in distributed systems show that certain problems cannot be
solved. Lower bound results show that certain problems cannot be solved if resources
are insufficient; in other words, these lower bound results show that certain problems
are only solvable if a certain threshold of sufficient resources is available, that is,
minimum resources required to solve a problem.
Table 10-1 summarizes the core impossibility results related to the consensus
problem.
420
Chapter 10 Conclusion
The results in Table 10-1 are standard impossibility results. However, there are
many others.
With the innovative research on blockchains, some new results have emerged.
Andrew Lewis-Pye and Tim Roughgarden announced a fascinating new impossibility
result similar to the CAP theorem where we can simultaneously choose only two of
three properties. It states that no blockchain protocol can operate in the unconstrained
environment (e.g., PoW), is live under a synchronous environment with significant and
sharp dynamic changes in network resources (e.g., participant numbers), and satisfies
probabilistic finality (consistency) in the partially synchronous environment. We can
only choose two properties simultaneously out of the three properties stated earlier.
For example, in an unsized environment such as Bitcoin, imagine a node stops
receiving any new blocks. Now the node cannot differentiate between whether the other
nodes have lost their resources and cannot produce blocks anymore and if the block
messages are delayed. Now if the node stops producing blocks and other nodes are low
on resources and not producing blocks, it violates the liveness property, because this
node must keep producing blocks even if others don’t. However, if it keeps producing
blocks but the block messages are just delayed, then it is violating the consistency
property, because there could be other conflicting blocks which are just delayed.
421
Chapter 10 Conclusion
Complexity and Performance
A consensus algorithm can be evaluated from a communication complexity point of
view. This involves calculations such as if the protocol is running in normal mode (no
failures), then how many messages are required to be exchanged to reach consensus.
Also, in case of leader failure when a view change occurs, how many messages are
exchanged? Such metrics can help to understand how the algorithm behaves practically,
which helps to estimate the efficiency of the algorithm.
Message delays can be defined as the number of messages required by an algorithm
that cannot be sent before the previous message is received. In other words, it’s a
message which is sent only after the previous one has been received. An algorithm
requires n message delay; if some execution contains a chain of n messages, each of
which cannot be sent before receiving the previous one.
To evaluate the cost associated with an algorithm, we can think of different
complexity traits. There are three costs associated with a consensus algorithm: message
complexity, communication complexity, and time complexity.
Message Complexity
Message complexity denotes the total number of messages that are required to be
exchanged by the algorithm to reach consensus. For example, imagine an algorithm
where all processes broadcast to all other nodes. This means that n(n − 1) messages will
be received. This means this algorithm has O(n2) message complexity.
422
Chapter 10 Conclusion
Time Complexity
Time complexity is concerned with the amount of time needed to complete the
execution of the algorithm. The time taken to execute the algorithm also depends on
the time it takes to deliver the messages in the protocol. The time to deliver messages
is quite large as compared to the local computation on the message. Time can be then
thought of as the number of consecutive message delays. The same algorithm from the
previous example running on a faultless network has O(1) time complexity.
Space Complexity
Space complexity deals with the total amount of space required for the algorithm to run.
Space complexity is mostly relevant in a shared memory framework.
In message-passing distributed systems, such as blockchains, mostly message
complexity is considered. Bit complexity is not so much relevant; however, if the size
of the messages is big, then this can become another complexity measure to take into
consideration.
Table 10-2 summarizes the complexity results of some common BFT protocols.
423
Chapter 10 Conclusion
Comparison of Protocols
We can compare consensus algorithms from different perspectives. Table 10-3
summarizes the results.
424
Chapter 10 Conclusion
425
Chapter 10 Conclusion
Notes:
• PoA is assumed to be BFT based.
• While only a single example is given, there are many, for example,
Ethereum uses PoW too.
• PBFT and RAFT both are state machine replication protocols with a
leader-follower architecture, also called a primary backup. Usually,
for PBFT a primary backup is used in the literature; however, for
RAFT leader-follower terms are used. Fundamentally, they serve the
same purpose.
Network Model
We can model a blockchain network in several ways, as shown in the following.
Fundamentally, the networks are either synchronous, asynchronous, or partially
synchronous. However, in literature several terms are used and as such explained as follows.
Synchronous
All messages are delivered within delta Δ time.
Eventual Synchrony
After an unknown global stabilization time (GST), all messages are delivered within
delta Δ time.
Partial Synchrony
A protocol does not know what delta Δ is.
426
Chapter 10 Conclusion
Weak Synchrony
Δ varies with time. In practice, the Δ increases systematically until liveness is achieved.
However, the delays are not expected to grow exponentially.
Asynchronous
All messages are eventually delivered but have no fixed upper bound on message
delivery time. The message delivery delay is finite, but no time bound is assumed on it.
Adversaries are primarily of two types, a static adversary and an adaptive adversary.
A static adversary performs corruption before the protocol executes, whereas an
adaptive adversary can cause corruption anytime during the execution of the protocol.
There are two crash models, a crash failure model and a Byzantine failure model.
Round-based algorithms have a send step, receive step, and compute step, which make
one round.
We can think of several aspects of consensus protocols that we can use to study,
evaluate, or classify them:
• Time complexity: How long the protocol takes to run and how many
message delays.
• Trusted setup: Does the protocol need any setup like PKI or no
dealer is required?
427
Chapter 10 Conclusion
Research Directions
Blockchain consensus is a very ripe area for research. Even though tremendous progress
has been made, still there are a few open research problems which should be addressed.
Some of these problems are listed as follows, with possible direction of research:
428
Chapter 10 Conclusion
429
Chapter 10 Conclusion
Summary
In this last chapter, we summarized what we learned throughout this book. We also
covered the algorithms that we did not before, especially new consensus protocols
such as Avalanche and Ebb-and-Flow protocols. We also touched upon some research
directions which require further work. We have come a long way from the Byzantine
generals problem to Nakamoto consensus and now multichain consensus protocols.
This is such a ripe area for research that we’ll only see more progress in it with more
innovative ideas in the future.
In this book, we have explored the foundations of a blockchain and distributed
consensus. We learned what impact quantum computing can have on distributed
consensus and how an agreement could be achieved in quantum networks. Blockchain
consensus is possibly the strongest area for research in the blockchain.
Thank you for staying with me in this wonderful journey. You are now capable of
applying the knowledge from this book as a blockchain researcher and continuing your
learning and research in the field of blockchain consensus.
Bibliography
1. Decentralized thoughts: https://decentralizedthoughts.
github.io
2. Xiao, Y., Zhang, N., Lou, W., and Hou, Y.T., 2020. A survey of
distributed consensus protocols for blockchain networks. IEEE
Communications Surveys & Tutorials, 22(2), pp. 1432–1465.
430
Chapter 10 Conclusion
3. Miller, A., Xia, Y., Croman, K., Shi, E., and Song, D., 2016, October.
The honeybadger of BFT protocols. In Proceedings of the 2016
ACM SIGSAC conference on computer and communications
security (pp. 31–42).
5. Neu, J., Tas, E.N., and Tse, D., 2021, May. Ebb-and-flow protocols:
A resolution of the availability-finality dilemma. In 2021 IEEE
Symposium on Security and Privacy (SP) (pp. 446–465). IEEE.
431
Index
A B
Account state, 197, 200, 201 Ben-Or algorithms, 400
ACID consistency model, 262 asynchronous rounds, 278
Ack implosion, 120 benign faults/crash faults, 279–283
Adaptive adversary, 37, 427 binary consensus, 278
Advanced Encryption Standard (AES), 77 definition, 278, 284
Adversary model failure detectors,
definition, 36 consensus, 284–286
types Bernoulli trial, 229
dynamic, 37 Binary consensus, 127, 278
passive, 37 Bitcoin
static, 37 address and accounts, 185, 186
threshold, 36 blocks, 190
Aggregate signatures, 94, 99–101 cryptography, 169, 184–188
Agreement digital cash, 181
byzantine agreement problem, 124 digital timestamping, 181
problems, 123 electronic cash system, 181
reliable broadcast, 123 hash functions, 169
total order broadcast, 124 mining, 191
Agreement property, 117, 126, 127, 143, node, 181–184
250, 407 platform, 191
Apache ZooKeeper, 305 pseudonym, 181
Application-specific blockchains, 172, 361 public blockchain, 172
Application-specific validity, 210, 211 techniques, 169
Archive nodes, 194, 360 transactions, 186, 187
Asynchronous binary agreement protocol UTXO, 186
(ABA), 414 Bitcoin address generation
Asynchronous common subset process, 185, 186
(ACS), 414 Bitcoin cryptocurrency, 170, 172
Asynchronous distributed systems, 7, 32 Bitcoin network, 181, 182, 187, 217, 236,
Atomic clocks, 41–43, 343 247, 253
Avalanche, 102, 415, 430 Bitcoin node architecture, 181–184
433
© Imran Bashir 2022
I. Bashir, Blockchain Consensus, https://doi.org/10.1007/978-1-4842-8179-6
INDEX
434
INDEX
435
INDEX
436
INDEX
437
INDEX
Ethereum (cont.) G
transactions, 197, 198
Game theory
Web 2 model, 192
Bitcoin network, 235
Ethereum 2
definition, 232
Casper, 370–374
distributed computing, 236
definition, 368
fields, 232
Ethereum network
incentive mechanisms, 235
blockchain protocol, 193
Nash equilibrium, 235
DEVP2P/wire protocol, 195
prisoner’s dilemma, 233–235
high-level visualization, 192, 193
schelling point, 237
loosely coupled nodes, 192
strategic situations, 233
node discovery protocol, 195
Gas fee, 202
node types, 194
Global stabilization time (GST), 34,
secure transport channel, 195
152, 426
Ethereum node architecture, 194
Global state, 16
Ethereum node joining, 195
Gossip protocol, 239, 240
Ethereum virtual machine (EVM), 202
Greatest common divisor (GCD), 86
Eventual consistency model, 145
Greenberger-Horne-Zeilinger (GHZ)
Externally owned accounts (EOAs), 196
state, 387
Grinding attack, 339
F
Fair scheduling, 287 H
Fault-tolerant broadcast Happens-before relationship
algorithm, 113 definition, 51
First-in, first-out (FIFO), 118 example, 52
FLM impossibility, 190 faults/fault tolerance, 56–59
Fork resolution Lamport clock, 52–55
byzantine, 246 logical clocks, 52
chainwork value, 242, 243 properties, 51
computational hash power, 242 safety/liveness, 59, 60
fault tolerance mechanism, 242 vector clock, 55, 56
hard, 244 Hard fork, 244, 245
regular, 244 Hash-based MACs (HMACs), 108
rules, 242 Hash functions
soft, 245 collision resistance
Friendly Finality Gadget (FFG), 370 property, 102
Full nodes, 194 definition, 101
438
INDEX
439
INDEX
N BGP, 268–271
definition, 268
Nakamoto consensus
IC1/IC2, 268
agreement, 211
protocol case, 270, 272
chain progress, 213
signed message solution, BGP, 273–275
consistent/consistency, 213
eventual irrevocability, 213
finality, 212 P
protocol, 207 Parachains, 360
termination, 212 Parathread, 361
validity, 211 Partially synchronous model, 33
Nakamoto-style (PoW) consensus Partial synchrony network model
algorithms, 163 algorithms, 275, 276
Nakamoto-style protocols, 416 Byzantine faults with
Nash equilibrium, 232, 233 authentication, 276, 277
Network partition variables, 276
arbitrary links, 31 Paxos, 328
definition, 28 algorithm, work, 297–299
fair-loss links, 29 definition, 296
links, 29 failure scenarios, 301–303
logged perfect links, 31 multi-paxos, 306
perfect (reliable) links, 30 nodes, 297
stubborn links, 30 protocol, 296
Network time protocol (NTP), 44 run, 300
Nominators, 360 safety and liveness properties, 304
Nonce, 221, 224 variants, 305
Nondeterministic Polynomial (NP), 391 Payoff matrix, 233
Non-fungible tokens (NFTs), 170 Pay-to-Public-Key-Hash (P2PKH), 189
Normal log vs. blockchain log, 239 Permissioned blockchains, 172, 174
Permissionless blockchains, 172, 174
Physical clocks
O atomic, 41–43
Off-chain transactions, 187 clock skew vs. drift, 41
Omega Ω failure detector, 159 definition, 39
On-chain transactions, 187 irreflexive partial order, 50
One-way property, 101 oscillator circuitry, 39
Opcodes, 188, 202 oscilloscope, 40
Oral message (OM) algorithm partial order, 49
440
INDEX
441
INDEX
442
INDEX
443
INDEX
444
INDEX
445