Distributed Systems For Practitioners Sample
Distributed Systems For Practitioners Sample
Preface i
Acknowledgements iii
I Fundamental Concepts 1
1 Introduction 2
What is a distributed system and why we need it . . . . . . . . . . 2
The fallacies of distributed computing . . . . . . . . . . . . . . . . 5
Why distributed systems are hard . . . . . . . . . . . . . . . . . . 7
Correctness in distributed systems . . . . . . . . . . . . . . . . . . 8
System models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
The tale of exactly-once semantics . . . . . . . . . . . . . . . . . . 10
Failure in the world of distributed systems . . . . . . . . . . . . . . 13
Stateful and Stateless systems . . . . . . . . . . . . . . . . . . . . . 15
References 17
1
Preface
Distributed systems are becoming ubiquitous in our life nowadays: from how
we communicate with our friends to how we make online shopping and many
more things. It might be transparent to us sometimes, but many companies
are making use of extremely complicated software systems under the hood
to satisfy our needs. By using these kind of systems, companies are capable
of significant achievements, such as sending our message to a friend who is
thousand miles away in a matter of milliseconds, delivering our orders despite
outages of whole datacenters or searching the whole Internet by processing
more than a million terabytes of data in less than a second. Putting all
of this into perspective, it’s easy to understand the value that distributed
systems bring in the current world and why it’s useful for software engineers
to be able to understand and make use of distributed systems.
The ultimate goal of this book is to help these people get started with
i
PREFACE ii
As any other book, this book might have been written by a single person,
but that would not have been possible without the contribution of many
others. As a result, credits should be given to all my previous employers
and colleagues that have given me the opportunity to work with large-scale,
distributed systems and appreciate both their capabilities and complexities
and the distributed systems community that was always open to answer any
questions. I would also like to thank Richard Gendal Brown for reviewing
the case study on Corda and giving feedback that was very useful in helping
me to add clarity and remove ambiguity. Of course, this book would not
have been possible without the understanding and support of my partner in
life, Maria.
iii
Part I
Fundamental Concepts
1
Chapter 1
Introduction
First of all, we need to define what a distributed system is. Multiple, different
definitions can be found, but we will use the following:
"A distributed system is a system whose components are lo-
cated on different networked computers, which communi-
cate and coordinate their actions by passing messages to one
another."[1]
As shown in Figure 1.1, this network can either consist of direct connections
between the components of the distributed system or there could be more
components that form the backbone of the network (if communication is
done through the Internet for example). These components can take many
forms; they could be servers, routers, web browsers or even mobile devices.
In an effort to keep an abstract and generic view, in the context of this book
we’ll refer to them as nodes, being agnostic to their real form. In some cases,
such as when providing a concrete example, it might be useful to escape this
generic view and see how things work in real-life. In these cases, we might
explain in detail the role of each node in the system.
As we will see later, the 2 parts that were highlighted in the definition above
are central to how distributed systems function:
• the various parts that compose a distributed system are located re-
motely, separated by a network.
2
CHAPTER 1. INTRODUCTION 3
Now that we have defined what a distributed system is, let’s explore its
value.
Why do we really need distributed systems ?
Looking at all the complexity that distributed systems introduce, as we will
see during this book, that’s a valid question. The main benefits of distributed
systems come mostly in the following 3 areas:
• performance
• scalability
• availability
Let’s explain each one separately. The performance of a single computer
has certain limits imposed by physical constraints on the hardware. Not
only that, but after a point, improving the hardware of a single computer
in order to achieve better performance becomes extremely expensive. As
CHAPTER 1. INTRODUCTION 4
a result, one can achieve the same performance with 2 or more low-spec
computers as with a single, high-end computer. So, distributed systems
allow us to achieve better performance at a lower cost. Note that
better performance can translate to different things depending on the context,
such as lower latency per request, higher throughput etc.
"Scalability is the capability of a system, network, or process to
handle a growing amount of work, or its potential to be enlarged
to accommodate that growth." [2]
Most of the value derived from software systems in the real world comes from
storing and processing data. As the customer base of a system grows, the
system needs to handle larger amounts of traffic and store larger amounts of
data. However, a system composed of a single computer can only scale up to
a certain point, as explained previously. Building a distributed system
allows us to split and store the data in multiple computers, while
also distributing the processing work amongst them1 . As a result of
this, we are capable of scaling our systems to sizes that would not even be
imaginable with a single-computer system.
In the context of software systems, availability is the probability that a
system will work as required when required during the period of a mission.
Note that nowadays most of the online services are required to operate all
the time (known also as 24/7 service), which makes this a huge challenge.
So, when a service states that it has 5 nines of availability, this means that
it operates normally for 99.999% of the time. This implies that it’s allowed
to be down for up to 5 minutes a year, to satisfy this guarantee. Thinking
about how unreliable hardware can be, one can easily understand how big
an undertaking this is. Of course, using a single computer, it would be
infeasible to provide this kind of guarantees. One of the mechanisms
that are widely used to achieve higher availability is redundancy,
which means storing data into multiple, redundant computers. So,
when one of them fails, we can easily and quickly switch to another one,
preventing our customers from experiencing this failure. Given that data are
stored now in multiple computers, we end up with a distributed system!
Leveraging a distributed system we can get all of the above benefits. However,
as we will see later on, there is a tension between them and several other
1
The approach of scaling a system by adding resources (memory, CPU, disk) to a single
node is also referred to as vertical scaling, while the approach of scaling by adding more
nodes to the system is referred to as horizontal scaling.
CHAPTER 1. INTRODUCTION 5
As you progress through the book, you will get a deeper understanding
of why these statements are fallacies. However, we will try and give you
a sneak preview here by going quickly over them and explain where they
fall short. The first fallacy is sometimes enforced by abstractions provided
to developers from various technologies and protocols. As we will see in a
later chapter networking protocols, such as TCP, can make us believe that
the network is reliable and never fails, but this is just an illusion and can
have significant repercussions. Network connections are also built on top of
hardware that will also fail at some point and we should design our systems
2
See: https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
CHAPTER 1. INTRODUCTION 6
This assumption can be quite deceiving, since it’s somewhat intuitive and
3
See: https://grpc.io/
4
See: https://thrift.apache.org/
CHAPTER 1. INTRODUCTION 7
holds true when working in systems that are not distributed. For instance,
an application that runs in a single computer can use the computer’s local
clock in order to decide when events happen and what’s the order between
them. Nonetheless, that’s not true in a distributed system, where every node
in the system has its own local clock, which runs at a different rate from the
other ones. There are ways to try and keep the clocks in sync, but some of
them are very expensive and do not eliminate these differences completely.
This limitation is again bound by physical laws5 . An example of such an
approach is the TrueTime API that was built by Google [5], which exposes
explicitly the clock uncertainty as a first-class citizen. However, as we will
see in the next chapters of the book, when one is mainly interested in cause
and effects, there are other ways to reason about time using logical clocks
instead.
In general, distributed systems are hard to design, build and reason about,
thus increasing the risk of error. This will become more evident later in the
book while exploring some algorithms that solve fundamental problems that
emerge in distributed systems. It’s worth questioning: why are distributed
systems so hard? The answer to this question can help us understand what
are the main properties that make distributed systems challenging, thus
eliminating our blind spots and providing some guidance on what are some
of the aspects we should be paying attention to.
The main properties of distributed systems that make them challenging to
reason about are the following:
• network asynchrony
• partial failures
• concurrency
Network asynchrony is a property of communication networks that cannot
provide strong guarantees around delivery of events, e.g. a maximum amount
of time required for a message to be delivered. This can create a lot of
counter-intuitive behaviours that would not be present in non-distributed
systems. This is in contrast to memory operations that can provide much
stricter guarantees6 . For instance, in a distributed system messages might
5
See: https://en.wikipedia.org/wiki/Time_dilation
6
See: https://en.wikipedia.org/wiki/CAS_latency
CHAPTER 1. INTRODUCTION 8
System models
such as the Internet, where we cannot have control over all the components
involved and there are very limited guarantees on the time it will take for a
message to be sent between two places. As a result, most of the algorithms
we will be looking at this book assume an asynchronous system model.
There are also several different types of failure. The most basic categories
are:
• Fail-stop: A node halts and remains halted permanently. Other nodes
can detect that the node has failed (i.e. by communicating with it).
• Crash: A node halts and remains halted, but it halts in a silent way.
So, other nodes may not be able to detect this state (i.e. they can only
assume it has failed on the basis of not being able to communicate
with it).
• Omission: A node fails to respond to incoming requests.
• Byzantine: A node exhibits arbitrary behavior: it may transmit
arbitrary messages at arbitrary times, it may stop or take an incorrect
step.
Byzantine failures can be exhibited, when a node does not behave according to
the specified protocol/algorithm, i.e. because the node has been compromised
by a malicious actor or because of a software bug. Coping with these failures
introduces significant complexity to the resulting solutions. At the same
time, most distributed systems in companies are deployed in environments
that are assumed to be private and secure. Fail-stop failures are the simplest
and the most convenient ones from the perspective of someone that builds
distributed systems. However, they are also not very realistic, since there
are cases in real-life systems where it’s not easy to identify whether another
node has crashed or not. As a result, most of the algorithms analysed in this
book work under the assumption of crash failures.
of this message from the software application layer of the node. In most
cases, what we really care about is how many times a message is processed,
not how many times it has been delivered. For instance, in our previous
e-mail example, we are mainly interested in whether the application will
display the same e-mail twice, not whether it will receive it twice. As the
previous examples demonstrated, it’s impossible to have exactly-once
delivery in a distributed system. It’s still sometimes possible though to
have exactly-once processing. With all that said, it’s important to un-
derstand the difference between these 2 notions and make clear what you
are referring to, when you are talking about exactly-once semantics.
Also, as a last note, it’s easy to see that at-most-once delivery semantics and
at-least-once delivery semantics can be trivially implemented. The former can
be achieved by sending every message only one time no matter what happens,
while the latter one can be achieved by sending a message continuously, until
we get an acknowledgement from the recipient.
We could say that a system can belong in one of the 2 following categories:
• stateless systems
• stateful systems
A stateless system is one that maintains no state of what has happened
in the past and is capable of performing its capabilities, purely based on
the inputs provided to it. For instance, a contrived stateless system is one
that receives a set of numbers as input, calculates the maximum of them
and returns it as the result. Note that these inputs can be direct or indirect.
Direct inputs are those included in the request, while indirect inputs are
those potentially received from other systems to fullfil the request. For
instance, imagine a service that calculates the price for a specific product by
retrieving the initial price for it and any currently available discounts from
some other services and then performing the necessary calculations with this
data. This service would still be stateless. On the other hand, stateful
systems are responsible for maintaining and mutating some state and their
results depend on this state. As an example, imagine a system that stores
the age of all the employees of a company and can be asked for the employee
with the maximum age. This system is stateful, since the result depends on
the employees we’ve registered so far in the system.
There are some interesting observations to be made about these 2 types of
systems:
• Stateful systems can be really useful in real-life, since computers are
much more capable in storing and processing data than humans.
• Maintaining state comes with additional complexity, such as deciding
what’s the most efficient way to store it and process it, how to perform
back-ups etc.
• As a result, it’s usually wise to create an architecture that contains
clear boundaries between stateless components (which are performing
business capabilities) and stateful components (which are responsible
for handling data).
CHAPTER 1. INTRODUCTION 16
• Last and most relevant to this book, it’s much easier to design, build
and scale distributed systems that are stateless when compared to
stateful ones. The main reason for this is that all the nodes (e.g.
servers) of a stateless system are considered to be identical. This makes
it a lot easier to balance traffic between them and scale by adding
or removing servers. However, stateful systems present many more
challenges, since different nodes can hold different pieces of data, thus
requiring additional work to direct traffic to the right place and ensure
each instance is in sync with the other ones.
As a result, some of the book’s examples might include stateless systems,
but the most challenging problems we will cover in this book are present
mostly in stateful systems.
References
17