Chapter Two - Distributed System-1
Chapter Two - Distributed System-1
Functional Requirements
Computer systems are designed to meet one or multiple objectives. Without consideration of such
requirements, it is impossible to compare alternative designs.
A functional requirement specifies something the system must be able to do. It specifies a feature,
expressed as a specific function or as a broadly defined behavior. The system either has the feature,
and thus provides the function or behaves as specified, or not.
1. The player can connect to the game servers and see who is connected to each server.
2. The player can join their friends and exchange messages.
3. Players engaged in parkour against the clock can do so wherever they are in the world,
even if the clocks on their computers are not synchronized.
4. Specific actions where multiple players compete to retrieve the same object are correctly
resolved by the game, and, subsequently, each player sees the same outcome.
5. Modifications made to an in-game object by the players are seen in the correct order by all
nearby players.
In contrast, a non-functional requirement focuses on how well the feature works, defining quality
attributes for the distributed system. Examples related to the same online game include:
1. Players experience that their actions (clicks) receive a reaction from the game within 150
milliseconds. This lag is not only limited but also stable.
2. Players can access the game services every time they try.
3. Each machine used by the gaming platform is utilized more than 50%, each day.
4. The gaming platform consumes at most 1 MWh of electricity per day.
• Naming: We expect that components of the same computer system find each other, by
identifier or name, and can communicate easily.
• Clock Synchronization: We also expect that the system provides a clock and implicitly
that the clock is synchronized for all system components; in other words, acting 'at the same
time' (synchronously) is trivial for components in a computer system.
Distributed systems counter all these intuitions. Because the machines hosting components of the
system are physically distributed, the laws of physics have an important impact: the real-world
time it takes for information to get from one component to another can be orders of magnitude
higher than it normally takes in the computers and smartphones we are used to. This real-world
information delay changes everything. Components in distributed systems cannot easily name or
communicate with other components. Distributed systems cannot easily achieve clock
synchronization, consensus, or consistency. Instead, all these functions require specialized
approaches in distributed systems.
The ability to communicate is essential for systems. Even single machines are constructed around
the movement of data, from input devices, to memory and persistent storage, to output devices.
Although computers are increasingly complex, this communication is well-understood. We
include in the typical functional requirements, and modern systems already meet these
requirements, that messages arrive correctly at the receiver, that there is an upper limit on the
amount of time it takes to read or write a message, and that developers know how much data can
be safely exchanged between applications at any point in time. In a distributed system, none of
this is true without additional effort.
• A distributed system can only function if its components are able to communicate.
• The components in a distributed system are asynchronous: they run independently from,
and do not wait for, other components.
• Components must continue to function, even though communication with other
components can start and stop at any time.
• The networks used for communication in distributed system are unreliable. Networks may
drop, delay, or reorder messages arbitrarily, and components need to take these possibilities
into account.
To enable communication between computers, they need to speak the same protocol. A protocol
defines the rules of communication, including
How a protocol is defined depends on the technology that underlays it. For protocols that directly
use the network’s transport layer, they need to define data fields as a sequence of bits or bytes.
Defining a protocol on this level, however, has multiple disadvantages. It is labor intensive, the
binary messages are challenging to debug, and it is difficult to achieve backward compatibility.
When a protocol defines data fields on the level of bits and bytes, adding or changing what data
can be sent while still supporting older implementations is difficult. For these and other reasons,
distributed systems often define their protocols on a higher layer of abstraction.
One of the simplest abstractions on top of byte streams is plain-text messages. These are used
widely in practice, especially in the older technologies that form the core of the Internet. For
example, all of the following protocols use plain-text messages: -
Instead of defining fields with specified bit or byte lengths, plain-text protocols are typically line-
based, meaning every message ends with a new line character (“\n”). The advantages of such
protocols are that they are easy to debug by both humans and computers, and that they offer
increased flexibility due to variable-length fields. Text-based protocols can easily be changed into
binary protocols without losing their advantages, by compressing the data before it is sent over the
network.
Moving one more level up brings us to structured-text protocols. Such protocols use specialized
languages designed to represent data, and use them to define messages. For example, REST APIs
typically use JSON to exchange data. A structured text comes with its own (more complex) rules
on how to format messages. Fortunately, many parser libraries exist for popular structured-text
formats such as XML and JSON, making it easier for distributed system developers to use these
formats without writing the tools themselves
Finally, structs and objects in programming languages can also be used as messages. Typically,
these structs are translated to and from structured-text or binary representations with little or no
work required from developers. Mapping programming-language-specific data structures to and
from message formats are called marshaling and unmarshaling respectively. Marshaling libraries
and tools take care of both marshaling and unmarshaling. Examples of marshaling libraries for
structured-text protocols include Jackson for Java and the built-in JSON library for Golang.
Examples of marshaling libraries for binary formats include Java’s built-in Serializable interface
and Google’s Protocol Buffers.
Message passing always occurs between a sender and a receiver. It requires messages to traverse
a possibly unreliable communication environment and can start and end at synchronized or
arbitrary moments. This leads to multiple ways in which message-passing communication can
occur, and thus multiple useful models.
Depending on whether the message in transit through the communication environment is stored
(persisted) until it can be delivered or not, we distinguish between transient and persistent
communication:
• Transient communication only maintains the message while the sender and the receiver are
online, and only if no transmission error occurs. This model is the easiest to implement and
matches well with the typical Internet-router based on store-and-forward or cut-through
technology. An example here is that real-time games may occasionally drop updates and use
local correction mechanisms. This allows in many cases the use of relatively simple designs,
but for some game genres can lead to the perception of lag or choppy movement of the avatars
and objects.
• Persistent communication requires the communication environment to store the message until
it is received. This is convenient for the programmer, but much more complex to guarantee by
the distributed system. Worse, this leads typically to lower scalability than approaches based
on transient communication due to the higher latency of the message broker storing incoming
messages on a persistent storage device as well as potential limits of the number of messages
that can be persisted at the same time. An example of the use of a persistent communication
system appears in the email system. Emails are sent and received using SMTP and IMAP
respectively. SMTP copies email from a client or server to another server, and IMAP copies
email from a server to a client. The client can copy the email from their server repeatedly
because the email is persisted on the server.
Depending on whether the sender and/or the receiver has to wait (is blocked) in the process of
transmitting or receiving, we distinguish between asynchronous communication and
synchronous communication:
Implementation
Functionally, RPC wants to maintain the illusion that a program would call a local implementation
of the service. Since now caller and callee reside on different machines, they need to agree on a
definition of what the procedure is: its name and parameters. This information is often encoded in
an interface written in an Interface Definition Language (IDL).
In order for local programs to be able to call the service, a stub is created that implements the
interface but, instead of doing function execution locally, encodes the name and the argument
values in a message that is forwarded to the callee. Since this is a mechanical, deterministic
process, the stub can be compiled automatically by a stub generator.
On the server side, the message is received and the arguments need to be unmarshalled from the
message so that the function can be invoked on behalf of the client. This is again performed by an
automatically generated stub, on this side of the system also often referred to as a skeleton (server
stub in the image below)
Since the late-1950s, the programming language community has proposed and developed
programming models that consider objects, rather than merely operations and procedures for
program control. Object-oriented programming languages, such as Java (and Kotlin), Python, and
C++, remain among the most popular programming languages. It is thus meaningful to ask the
question: Can RPC be extended to (remote) objects?
An object-oriented equivalent of RPC is remote method invocation (RMI). RMI is similar to RPC,
but has to deal with the additional complexity of remote-object state. In RMI, the object is located
on the server, together with its methods (equivalent of procedures for RPC). The client calls a
function on a proxy, which fulfills the same role as the client-stub in RPC. On the server side, the
RMI message is received by a skeleton, which executes the method call on the correct object.
The messages that are sent between machines, be they sent as plain messages, or as the underlying
technology of RPC, show distinct patterns over time depending on the properties of the system
that uses them. Below we describe some of the most prevalent communication patterns.
Communication in Practice
• RabbitMQ is a message broker, a middleware system that facilitates sending and receiving
messages. It is similar to Kafka, but not specifically built for stream-processing systems.
Naming schemes (schema) are the rules by which names are given to individual entities. There are
an infinite number of ways in which to ascribe names to entities. In this section, we identify and
discuss three categories of naming schema:
• Simple naming,
• Hierarchical naming, and
• Attribute-based naming.
Simple Naming: Focusing on uniquely identifying one entity among many is the simplest way to
name entities in a distributed system. Such a name contains no information about the entity’s
location or role.
Advantages: The main advantage of this approach is simplicity. The effort required to assign a
name is low—the only requirement is that the name is not already taken. Various approaches can
simplify even this verification step, at the cost of a (very low) probability the name may cause a
collision with another chosen name.
Disadvantages: A simple name shifts the complexity of locating it to the naming service.
Addressing the downside of simple naming, distributed systems can use rich names. Such names
not only uniquely identify an entity, but also contain additional information, for example, about
Hierarchical Naming: In hierarchical naming, names are allowed to contain other names, creating
a tree structure.
Namespaces are commonly used in practice. Examples include file systems, the DNS, and package
imports in Java and other languages. These names consist of a concatenation of words separated
by a special character such as “.” or “/”. The tree structure forms a name hierarchy, which combines
well with, but is not the same as, a hierarchical name resolution approach. When a hierarchical
naming scheme is combined with hierarchical name resolution, a machine is typically responsible
for all names in one part of the hierarchy. For example, when using DNS to look up the name
https://rift.com.et, we first contact one of the DNS root servers. These forward us to the “et”
servers, which forward us to the “com” servers, which in turn know where to find “rvu.com.et”.
Figure 1. Simplified attribute-based naming for a Minecraft-like game. Steps 1-3 are discussed in
the text.
Figure 1 illustrates how a player in the example might find a game of Minecraft located in an EU
datacenter. In step 1, the game client on the player's computer automates this, by querying the
naming service to "search((R=“EU”)(G=“Minecraft”))". Because the entries in attribute-based
naming are key-value pairs, searches are easy to make, and also partial searches can result in
matches. In step 2, the naming service returns the information that "server 42" is a server matching
Naming schema in practice: The lightweight directory access protocol (LDAP) is a name-
resolution protocol that uses both hierarchical and attribute-based naming. Names consist of
attributes, and can be found by performing search operations. The protocol returns all names with
matching attributes. In addition to the attributes, names also have a distinguished name, which is
a unique hierarchical name similar to a file path. This name can change when, for example, the
name is moved to a different server. Because LDAP is a protocol, multiple implementations exist.
ApacheDS is one of these implementations.
Once every entity in the system has a name, we would like to use those names to address our
messages. Networking approaches assume that we know, for each entity, on which machine it is
currently running. In distributed systems, we want to break free of this limitation. Modern
datacenter architectures often run systems inside virtual machines that can be moved from one
physical machine to the next in seconds. Even if instances of entities are not moved, they may fail
or be shut down, while new instances of the same service are started on other machines. Naming
services address such complexity in distributed systems.
Name resolution: A subsystem is responsible for maintaining a mapping between entity names
and transport-layer addresses. Depending on the scalability requirements of the system, this could
be implemented on a single machine, as a distributed database, etc.
Publish-Subscribe systems: The entities only have to indicate which messages they are interested
in receiving. In other words, they subscribe to certain messages with the naming service. This
subscription can be based on multiple properties. Common properties for publish-subscribe
systems include messages of a certain topic, with certain content, or of a certain type. When an
entity wants to send a message, it sends it not to the interested entities, but to the naming service.
The naming service then proceeds by publishing the message to all subscribed entities.
The publish-subscribe service is reminiscent of the bus found in single-machine systems. For this
reason, the publish-subscribe service is often called the "enterprise bus". A bus provides a single
channel of communication to which all components are connected. When one component sends a
message, all others are able to read it. It is then up to the recipients to decide if that message is of
interest to them. Publish-subscribe differs from this approach by centralizing the logic that decides
which messages are of interest to which entities.
In step 2, users B and C send updates (messages) to the publish-subscribe system. These are stored
(published) and may be forwarded to users other than A.
In step 3, user D sends a new message to the publish-subscribe system. The system analyzes this
message and decides it fits the subscription made by user A. Consequently, in step 4, user A will
receive the new message.
Clock synchronization focuses on an agreement between components about the time, which is a
single, numerical value, which changes continuously but unidirectionally (it only increases if one
counts the elapsed number of milli- or microseconds since a commonly agreed start time, as
computer systems do) and monotonically (with the clock frequency). We also observe a
synchronized clock enables establishing a happens-before relationship between events recorded in
the system; even without a physical clock, if we can otherwise establish this relationship, we have
the equivalent of a logical clock. Using the happens-before relationship between any two events,
we can create a total order of these events.
Consensus focuses on any value, so unlike the clock not only numerical. More importantly, the
value subject to consensus does not need to change as clocks do; in fact, it may not even change
at all. Consensus may focus on a single value but, by reaching consensus repeatedly, can also
enable a total ordering of events but such an approach is expensive in time and resources.
Consistency focuses on any value from the many included in a dataset, creating a flexible order.
Consistency protocols in distributed systems define the kind of order that can be achieved, for
example, total order, and, more loosely, when the order will be achieved, for example, after each
operation, after some guaranteed maximum number of operations, or eventually. Using
consistency protocols to order events in a form weaker than total ordering, and even some
discrepancies between how different components see the values in the database, are useful for
different classes of applications because they can often be achieved much quicker and with much
more scalable techniques.
In a distributed system,
Consensus is the ability to have all machines agree on a value. Consensus protocols ensure this
ability.
Consensus protocols (distributed algorithms) can create a total order of operations by repeatedly
agreeing on what operation to perform next.
Theoretical computer science has considered for many decades the problem of reaching consensus.
When machine failures can occur, reaching consensus is surprisingly difficult. If the delay of
transmitting a message between machines is left unbound, it is proved that, even when using
reliable networks, no distributed consensus protocol is guaranteed to complete. The proof itself is
known as the FLP proof, after the acronym of the family names of its creators. It can be found in
the aptly named article “Impossibility of Distributed Consensus with One Faulty Process” [1].
Consider that the claim is not true: There exists a consistency protocol, a distributed algorithm that
always reaches consensus in bounded time. For the algorithm to be correct, all machines that
decide on a value must decide on the same value. This prevents the algorithm from simply letting
the machines guess a value. Instead, they need to communicate to decide which value to choose.
This communication is done by sending messages. Receiving, processing, and sending messages
makes the algorithm progress toward completion. At the start of the algorithm, the system is in an
undecided state. After exchanging a certain number of messages, the algorithm decides. After a
decision, the algorithm - and the system - can no longer “change its mind.” The FLP proof shows
that there is no upper bound on the number of messages required to reach consensus.
1. Safety, which guarantees "nothing incorrect can happen". The consensus protocol must
decide on a single value, and cannot decide on two values, or more, at once.
2. Liveness, which guarantees "something correct will happen, even if only slowly". The
consensus protocol, left without hard cases to address - for example, no failures for some
amount of time -, can and will reach its decision on which value is correct.
Many protocols have been proposed to achieve consensus, with various degrees of capability under
various forms of failures, messaging delays they tolerate, etc.
Among the protocols that are used in practice, Paxos, multi-Paxos, and more recently Raft seem
to be very popular. For example, etcd is a distributed database built on top of the Raft consensus
algorithm. Its API is similar to that of Apache ZooKeeper (a widely-used open-source coordination
service), allowing users to store data in a hierarchical data-structure. Etcd is used by Kubernetes
and several other widely-used systems to keep track of shared state.
We sketch here the operation of the Raft approach to reach consensus. Raft is a consensus
algorithm specifically designed to be easy to understand. Compared to other consensus algorithms,
it has a smaller state space (the number of configurations the system can have), and fewer parts.
1. Raft first elects a leader ("leader election" in Figure 3). The other machines become
followers. Once a leader has been elected, the algorithm can start accepting new log entries
(data operations).
2. The log (data) is replicated across all the machines in the system ("log replication" in the
figure).
3. Users send new entries only to the leader.
4. The leader asks every follower to confirm. If most followers confirm, the log is updated
(performs the operation).
We describe three key parts of Raft. These do not form the entirety of Raft, which is indicative
that even a consensus protocol designed to be easy to understand still has many aspects to cover.
The Raft leader election: Having a leader simplifies decision-making. The leader decides on the
values. The other machines are followers, accepting all decisions from the leader. Easy enough.
But how do we elect a leader? All machines must agree on who the leader is—leader election
requires reaching consensus, and must have safety and liveness properties.
In Raft, machines can try to become the new leader by starting an election. Doing so changes their
role to candidate. Leaders are appointed until they fail, and followers only start an election if they
believe the current leader to have failed. A new leader is elected if a candidate receives the majority
of votes. With one exception, which we discuss in the section on safety below, followers always
vote in favor of the candidate.
Raft uses terms to guarantee that voting is only done for the current election, even when messages
can be delayed. The term is a counter shared between all machines. It is incremented with each
election. A machine can only vote once for every term. If the election completes without selecting
a new leader, the next candidate increments the term number and starts a new election. This gives
machines a new vote, guaranteeing liveness. It also allows distinguishing old from new votes by
looking at the term number, guaranteeing safety.
An election is more likely to succeed if there are fewer concurrent candidates. To this end,
candidates wait a random amount of time after a failed election before trying again.
Log replication: In Raft, users only submit new entries to the leader, and log entries only move
from the leader to the followers. Users that contact a follower are redirected to the leader.
New entries are decided, or “chosen,” once they are accepted by a majority of machines. As Figure
4 illustrates, this happens in a single round-trip: (a) The leader propagates the entries to the
followers and, (b) counts the votes and accepts the entry only if a majority in the system voted
positively.
Log replication is relatively simple because it uses a leader. Having a leader means, for example,
that there cannot be multiple log entries contending for the same place in the log.
Safety in Raft: Electing a leader and then replicating new entries is not enough to guarantee safety.
For example, it is possible that a follower misses one or multiple log entries from the leader, the
leader fails, the follower becomes a candidate and becomes the new leader, and finally overwrites
these missed log entries. (Sequences of events that can cause problems are a staple of consensus-
protocol analysis.) Raft solves this problem by setting restrictions on which machines may be
elected leader. Specifically, machines vote “yes” for a candidate only if that candidate’s log is at
least as up-to-date as theirs. This means two things must hold:
When machines vote according to these rules, it cannot occur that an elected leader overwrites
chosen (voted upon) log entries. It turns out this is sufficient to guarantee safety; additional
information can be found in the original article.
The essence of any discussion about consistency is the abstract notion of the data store. Data stores
can differ when servicing diverse applications, types of operations, and kinds of transactions, but
essentially a data store:
Many applications only have a single primary user. You are likely the only one accessing your
email, for business or leisure. You may have a private Dropbox folder, which you may want to
access at home, on the train, wherever you stay long enough to want to store new photos, etc. Many
mobile-first users recognize these and similar applications. Figure 1 depicts the data store for the
single primary user. Here, the user can connect from one location (or device), write new
information - a new email, a new Dropbox file, then disconnect. After moving to a new location
(or device), and reconnecting, the user should be able to resume the email and access the latest
version of the file.
Other applications have multiple users, writing together information to the same shared document,
changing together the state of an online game, making together transactions affecting many shared
accounts in a large data management system, etc. Here, the data store again has to manage the
data-updates, and deliver correct results when users query (read).
In a distributed system, achieving consistency falls upon the consistency model and consistency
(enforcing) mechanisms.
• Consistency models determine which data and operations are visible to a user or process,
and which kind of read and write operations are supported on them.
• Consistency mechanisms update data between replicas to meet the guarantees specified by
the model.
Classes of consistency models: The consistency model offers guarantees, but outside the
guarantees, almost anything is allowed, even if it seems counter-intuitive.
1. Strong consistency: an operation, particularly a query, can return only a consistent state.
2. Weak consistency: an operation, particularly a query, can return inconsistent state, but
there is an expectation there will be a moment when consistent state is returned to client.
Sometimes, the model guarantees which moment, or which (partial) state.
The strictest forms of consistency are so costly to maintain that, in practice, there may be some
tolerance for a bit of inconsistency after all. The CAP theorem suggests availability may suffer
under these strict models, and, the PACELCA framework further suggests also performance is a
trade-off with how strict the consistency model can be.
Many views on consistency models exist. Traditional results from theoretical computer science
and formal methods indicate
Notions of (i) linearizability or (ii) serializability emerged to indicate Write operations can seem
instantaneous yet a real-time or an arbitrary total order can be enforced, respectively.
(1) in operation-centric consistency models, a single client can access a single data object,
(2) in transaction-centric consistency models, multiple clients can access any of the multiple data
objects, and
Operation-Centric Consistency Models (single client, single data object, data store with multiple
replicas):
Several important models emerged in the past four decades, and more may continue to emerge:
Sequential consistency: All replicas see the same order of operations as all other replicas. This is
desirable, but of course prohibitively expensive.
Causal consistency weakens the promises, but also the needs to operate, of sequential consistency:
As for sequential consistency, causally related operations must still be observed in the same order
by all replicas. However, for other operations that are not causally related, different replicas may
see a different order of operations and thus of outcomes. Important cases of causal consistency,
with important applications, include:
1. Monotonic Reads: Subsequent reads by the same process always return a value that is at
least as recent as a previous read. Important applications include Calendar, inventories in
online games, etc.
2. Monotonic Writes: Subsequent writes by the same process follow each other in that order.
Important applications include email, coding on multiple machines, your bank account,
bank accounts in online games, etc.
3. Read Your Writes: A client that writes a value, upon reading it will see a version that is at
least as recent as the version they wrote. Updating a webpage should always, in our
expectation, make the page refresh show the update.
4. Writes Follow Reads: A client that first reads and then writes a value, will write to the
same, or a more recent, version of the value it read. Imagine you want to post a reply on
social media. You expect this reply to appear following the post you read.
Causal consistency is still remarkably difficult to ensure in practice. What could designers use that
is so lightweight it can scale to millions of clients or more? The key difficulty in scaling causal
consistency is that updates that multiple replicas to coordinate could hit the system concurrently,
effectively slowing it down to non-interactive responses and breaking scalability needs. A
consistency model that can delay when replicas need to coordinate would be very useful to achieve
scale.
Eventual consistency: Absent new writes, all replicas will eventually have the same contents. Here,
the coordination required to achieve consistency can be delayed until the system is less busy, which
may mean indefinitely in a very crowded system; in practice, many systems are not heavily
overloaded much of the time, and eventual consistency can achieve good consistency results in a
matter of minutes or hours.
Under special circumstances, there is no need for the heavy, scalability-breaking coordination
needed to ensure consistency we saw for operation-centric consistency models (and can have an
intuition about the even heavier transaction-centric consistency models). Identifying such
circumstances in general has proven very challenging, but good patterns have emerged for specific
(classes of) applications.
Applications where small inconsistencies can be tolerated include social media, where for example
information can often be a bit stale (but not too much!) without much impact, online gaming, where
slight inconsistencies between the positions of in-game objects can be tolerated (but large
inconsistencies cannot), and even banking where inconsistent payments are tolerated as long as
their sum does not exceed the maximum amount allowed for the day. Consistency models where
limited inconsistency is allowed, but also tracked and not allowed to go beyond known bounds,
include conits (we discuss them in the next section).
We conclude by observing the designer of distributed systems and applications must have at least
a basic grasp of consistency, and of known classes of consistency models with a proven record for
exactly that kind of system or application. This can be a challenging learning process, and mistakes
are costly.
2.5.2 Consistency for Online Gaming, Virtual Environments, and the Metaverse
Dead Reckoning
One of the earliest consistency techniques in games is the dead reckoning. The technique addresses
the key problem that information arriving over the network may be stale by the moment of arrival
due to network latency. The main intuition behind this technique is that many values in the game
follow a predictable trajectory, so updates to these values over time can largely be predicted. Thus,
as a latency-hiding technique, dead reckoning uses a predictive technique, which estimates the
next value and, without new information arriving over the network from the other nodes in the
distributed system, updates the value to match the prediction.
Although players are not extremely sensitive to accurate updates, and as long as the updated values
seem to follow an intuitive trajectory will experience the game as smooth, they are sensitive to
jumps in values. Thus, when the locally predicted values and the values arriving over the network
diverge, dead reckoning cannot simply replace the local value with the newly arrived; such an
The interplay between the two techniques, the predictive and the convergence, makes dead
reckoning an eventually consistent technique, with continuous updates and managed
inconsistency.
Advantages: Although using two internal techniques may seem complex, dead reckoning is a
simple technique with excellent properties when used in distributed systems. It is also mature, with
many decades of practical experience already available.
For many gaming applications, trajectories are bound by limitations on allowed operations, so the
numerical inconsistency can be quantified as a function of the staleness of information.
Drawbacks: As a significant drawback, dead reckoning works only for applications where the two
techniques, especially the predictive, can work with relatively low overhead.
Example: Figure 1 illustrates how dead reckoning works in practice. In this example, an object is
located in a 2D space (so, has two coordinates), in which it moves with physical velocity (so, a 2D
velocity vector expressed as a pair). The game engine updates the position of each object after
each time tick, so at time t=0, t=1, t=2, etc. In the example, the local game engine receives an
update about the object, at t=0; this update positions the object at position (0,0), with velocity (2,2).
The dead reckoning predictor can easily predict the next positions the object will take during the
next time ticks: (2,2) at t=1, (4,4) at t=2, etc. If the local game engine receives no further updates,
this predictor can continue to update the object, indefinitely.
If the local game engine keeps receiving new information, dead reckoning ensures a state of
smooth inconsistency, which the players experience positively.
Lock-step Consistency
Toward the end of 1997, multiplayer gaming was already commonplace, and games like Age of
Empires were launched with much acclaim and sold to millions. The technical conditions were
much improved over the humble beginnings of such games, around the 1960s for small-scale
online games and through the 1970s for large-scale games with hundreds of concurrent players
(for example, in the PLATO metaverse). Players could connect with the main servers through high-
speed networks... of 28.8 Kbps, with connections established over dial-up (phone) lines with
modems. So, following a true Jevons' paradox, gaming companies developing real-time strategy
games focused on scaling up, from a few tens of units to hundreds, per player.
Consequently, the network became a main bottleneck - sending around information about
hundreds to thousands of units (location, velocity, direction, status for each other tracked variable),
about 20 times per second as required in this game genre at the time, would quickly exceed the
limit of about 3,000 bytes per second. To expert designers, these network conditions could support
a couple of hundred but not 1,000 units. In a game like Age of Empires, the target limit set by
designers was even higher: 1,500 units across 8 players. How to ensure consistency under these
circumstances? (Similar situations continue to occur: For each significant advance in the speed of
the network and the processing power of the local gaming rig, game developers embark again on
new games that quickly exceed the new capabilities.)
One more ingredient is needed to have a game where the state of every unit - location, appearance,
activity, etc. - appears consistent across all players: the state needs to be the same at the same
moment because players are engaged in a synchronous contest against each other. So, the missing
ingredient is a synchronized clock linked to the consistency process.
Lock-step consistency occurs when simulations progress at the same rate and achieve the same
status at the end (or start) of each step (time tick).
One approach to achieve lock-step consistency is for all the computers in the distributed system
running the game to synchronize their game clocks. Players would input their commands to their
local game engines, which the local game engine communicates over the network to all other game
engines. Then, every game engine updates the local status based on the received input, either
A main benefit of this approach is that the approach trades-off communication for local
computation: the communication part is reduced only to necessary updates, such as player inputs,
and the game engines recompute the state of the game using dead reckoning and the inputs. The
network bandwidth is therefore sufficient for a game like Age of Empires with 1,500 moving units.
As a drawback, this approach uses a sequence of three operations, which is prohibitive when the
target is to complete all of them in under 50 milliseconds (to enable updates 20 times per second,
as described earlier in the section). Suppose performance variability occurs in the distributed
system, either in transferring data over the Internet or in computing the updates, for any of the
players. In this case, the next step either cannot complete in time or has to wait for the slowest
player to complete (lock-step really means the step is locked until everyone completes it).
Another approach pipelines communication and communication processes, that is, updating the
state while receiving input from players. To prevent inconsistent results, this approach again uses
time ticks, and input received during one step is always enforced two steps later.
Disadvantages: As for the first approach, performance variability, which is predominantly caused
by the processing of the slowest computer in the distributed game or by the laggiest Internet
connection, can cause problems. The problems occur only when the performance variability is
extreme, closer to 1,000 ms than to 400 ms over the typical performance at the time.
A third approach improves on the second by allowing turns to have variable lengths and thus
match performance variability. This approach works as the second whenever performance
becomes stable near normal levels: The turn length stays at 200 ms, with ticks for communication
and computation set at 50 ms.
Whenever performance degrades, this approach provides a technique to lengthen the step duration,
typically up to 1,000 ms. Beyond this value, empirical studies indicate the game becomes much
less enjoyable. Not only the turn lengthens when needed, but also how it is allocated for
computation and communication tasks, next to local updates and rendering. This approach
allocates, from the total turn duration, more time for computation to accommodate for a slower
computer among the players or more time for communication to accommodate for slow Internet
connections.
Beyond mere lock-step consistency: Lock-step approaches still suffer from a major drawback:
when every player has to simulate locally every input, the amount of computation can quickly
overwhelm slower computers, especially when game designers intend to scale well beyond 8
players, to possibly tens or hundreds or thousands for real-time strategy games.
First, we partition the virtual world into areas so that the game engine can select only those of
interest for each player. Second, the game engine updates areas judiciously. Some areas do not
receive updates because no player is interested in them. Areas interesting for only one player are
updated on that player's machine. Each area that is interesting for two or more players is updated
with lock-step or communication-only consistency protocols, depending on the computation and
communication capabilities of the players interested in the area.
In summary: Trading off communication for computation needs is a typical problem for online
games, virtual environments, and metaverses. Lock-step consistency provides a solution based on
this trade-off, with many desirable properties. Still, lock-step consistency is challenging when the
system exhibits unstable behavior, such as performance variability. In production, games must
cope with unstable behavior often. Then, monitoring the system carefully while it operates, and
conducting careful empirical studies of how the players experience the game under different levels
of performance, is essential to addressing the unstable behavior satisfactorily.
Conit-based Consistency
Although lock-step consistency is useful, in games where many changes occur that do not fit local
predictors, so for which dead-reckoning and other computationally efficient techniques are
difficult to find, it is better when scaling the virtual world to allow for some inconsistency to occur.
In particular, games such as Minecraft could benefit from this.
Conits, abbreviation of consistency unit, have been designed to support consistency approaches
where inconsistency can occur but should be quantified and managed. In the original design by Yu
and Vahdat [1], conits quantify three dimensions of inconsistency:
Any conit-based consistency protocol uses at least one conit to capture the inconsistency in the
system along the three dimensions. Time elapsed and data-changing operations lead to updates to
the conit state, typically increasing inconsistency values along one or more dimensions. At
runtime, when the limit of inconsistency set by the system operators is exceeded, the system
triggers a consistency-enforcing protocol and the conit is reset to (near-)zero inconsistency across
all dimensions.
Conits provide a versatile base for consistency approaches. Still, they so far have not been much
used in practice for two main reasons: First, not many applications exist that would tolerate
significant amounts of inconsistency. Second, setting the thresholds after which consistency must
be enforced is error-prone and application-dependent.
To provide a seamless experience to their users, distributed systems often rely on data replication.
Replication allows companies such as Amazon, Dropbox, Google, and Netflix to move data close
to their users, significantly improving non-functional requirements such as latency and reliability.
We study in this section what replication is and what are the main concerns for the designer when
using replication. One of the main such concerns, consistency of data across the replica, relates to
an important functional requirement and will be the focus of the next sections in this module.
What is Replication?
The core idea of replication is to repeat essential operations by duplicating, triplicating, and
generally multiplying the same service or physical resource, thread or virtual resource, or, at a
finer granularity and with a higher level of abstraction, data or code (computation).
Like resource sharing, replication can occur (i) in time, where multiple replicas (instances) co-exist
on the same machine (node), simultaneously, or (ii) in space, where multiple instances exist on
multiple machines. Figure 1 illustrates how data or services could be replicated in time or in space.
For example, data replication in space (Figure 1, bottom-left quadrant) places copies of the data
from Node 1 on several other nodes, here, Node 2 through n. As another example, service
replication in time (Figure 1, top-right quadrant) launches copies of the service on the same node,
Node 1.
To clarify replication, we must further distinguish it from other operational techniques that use
copies of services, physical resources, threads, virtual resources, data, computation, etc.
Replication differs from other techniques in many ways, including:
1. Unlike partitioning data or computation, replication makes copies of (and then uses) entire
sources, so entire datasets, entire compute tasks, etc. (A variant of replication, more
selective, focuses on making copies of entire sources, but only if sources are considered
important enough.)
2. Unlike load balancing, replication makes copies of the entire workload. (Selective
replication only makes copies of the part of the workload considered important enough.)
3. Unlike data persistence, checkpointing, and backups, replication techniques repeatedly act
on the replicas, and access to the source replica is similar to accessing the other replicas.
4. Unlike speculative execution, replication techniques typically consider replicas as
independent contributors to overall progress with the workload.
5. Unlike migration, replication techniques continue to use the source.
When replicating in space, because the many nodes are unlikely to all be affected by the same
performance issue when completing a share of the workload, the entire system delivers relatively
stable performance; in this case, replication also decreases performance variability.
Geographical replication, where nodes can be placed close-to-users, can lead to important
performance gains, guided by the laws of physics, particularly the speed of light.
Replication can lead to higher reliability and to what practice considers high availability: in a
system with more replicas, more of them need to fail before the entire system becomes unavailable,
relative to a system with only one replica. The danger of a single point of failure (see also the
discussion about scheduler architectures, in Module 4) is alleviated.
When multiple replicas can perform the same service concurrently, their local state may become
different, a consequence of the different operations performed by each replica. In this situation, if
the application cannot tolerate the inconsistency, the distributed system must enforce a consistency
protocol to resolve the inconsistency, either immediately, at some point in time but with specific
guarantees, or eventually. As explained during the introduction, the CAP theorem indicates
consistency is one of the properties of distributed systems that cannot be easily achieved, and in
particular it presents trade-offs with availability (and performance, as we will learn at the end of
this module). So, this approach may offset and even negate some of the benefits discussed earlier
in this section.
Replication Approaches
In a small-scale distributed system, replication is typically achieved by executing the incoming
stream of tasks (requests in web and database applications, jobs in Module 4) either (i) passively,
where the execution happens on a single replica, which then broadcasts to the others the results, or
(ii) actively, where each replica receives the input stream of tasks and executes it. However, many
more considerations appear as soon as the distributed system becomes larger than a few nodes
serving a few clients.
Replica-server location: Like any data or compute task, replicas require physical or virtual
machines on which to run. Thus, the problem of placing these machines, such that their locations
provide the best possible service to the system and a good trade-off with other considerations, is
important. This problem is particularly important for distributed systems with a highly
decentralized administration, for which decisions taken by the largely autonomous nodes can even
interfere with each other, and for distributed systems with highly volatile clients and particularly
those with high churn, where the presence of clients in one place or another can be difficult to
predict.
Replica-server location defines the conditions of a facility location problem, for example, finding
the best K locations out of the N possible, subject to many performance, cost, and other constraints,
with many theoretical solutions from Operations Research.
An interesting problem is how should new replica locations emerge. When replica-servers are
permanent, for example, as game operators run their shared sites, or web operators mirror the
websites, all that is needed is to add a statically configured machine. However, to prevent resource
waste, it would be better to allow replica-servers to be added or removed as needed, related to
(anticipated) load. (This is the essential case of many modern IT operations, which underlies the
need for cloud and serverless computing.) In such a situation, derived from traditional systems
considerations, who should trigger adding or removing a replica-server, the distributed system or
the client? A traditional answer is both, which means that the designer must (1) consider whether
to allow the replica-server system to be elastic, adding and removing replicas as needed, and (2)
enable both system- and client-initiated elasticity.
Replica placement: Various techniques can help with placing replicas on available replica-
servers.
A simplistic approach is to place one replica on each available replica-server. The main advantage
of this approach is that each location, presumably enabled because of its good capability to service
clients, can actually do so. The main disadvantages are that this approach does not enable
replication in time, so multiple replicas located in the same location when enough resources exist,
and rapid changes in demand or system conditions cannot be accommodated. (Longer-term
changes can be accommodated by good management of replica-server locations.)
Another approach is to use a multi-objective solver to place replicas in the replica-server topology.
The topological space can be divided, e.g., into Voronoi diagrams; conditions such as poor
connectivity between adjacent divisions can be taken into account, etc. Online algorithms often
use simplifications, such as partitioning the topology only along the main axes, and greedy
approaches, such as placing servers first in the most densely populated areas.
What to update? Replicas need to achieve a consistent state, but how they do so can differ by
system and, in dynamic systems, even by replica itself (e.g., as in [3] for an online gaming
application). Two main classes of approaches exist: (i) updating from the result computed by one
replica (the coordinating-replica), and (ii) updating from the stream of input operations that,
applied identically, will lead to the same outcome and thus a consistent state across all replicas.
(Note (i) corresponds to the passive replication described at the start of this section, whereas (ii)
corresponds to active replication.)
Passive replication typically consumes fewer compute resources per replica receiving the result.
Conversely, active replication typically consumes fewer networking resources to send the update
to all replicas. In section 2.2.7, Consistency for Online Gaming, Virtual Environments, and the
Metaverse, we see how these trade-offs are important to manage for online games.
When to perform updates? With synchronous updates, all replicas perform the same update, which
has the advantage that the system will be in a consistent state at the end of each update, but also
the drawbacks of waiting for the slowest part of the system to complete the operation and of having
to update each replica even if this is not immediately necessary.
With asynchronous updates, the source informs the other replicas of the changes, and often just
that a new operation has been performed or that enough time has elapsed since the last update.
Then, replicas mark their local data as (possibly) outdated. Each replica can decide if and when to
perform the update, lazily.
With push-based protocols, the system propagates modifications to replicas, informing the clients
when replica-updates must occur. This means replicas in the system must be stateful, thus able to
consider the need to propagate modifications by inspecting the previous state. This approach is
useful for applications with high ratios of operations that do not change the state, relative to those
that change it (e.g., high read:write ratios). With this approach, the system is more expensive to
operate, and typically less scalable, than when it does not have to maintain state.
With pull-based protocols, clients ask for updates. Different approaches exist: clients could poll
the system to check for updates, but if the frequency is polling is too high the system can get
overloaded, and if it is too low (i) the client may get stale information from its state, or (ii) the
client may have to wait for a relatively long time before obtaining the updated information from
the system, leading to low performance.
As is common in distributed systems, a hybrid approach could work better. Leases, where push-
based protocols are used while the lease is active, and pull-based protocols are used outside the
scope of the lease, are such a hybrid approach.
References:
[1] The BBC, Microsoft says services have recovered after widespread outage, Jan 2022.
[2] Sacheendra Talluri, (2021) Empirical Characterization of User Reports about Cloud Failures. 2021.
[3] More on Consistency and Replication