Replication in Distributed Systems
Replication in Distributed Systems
Saurav Prateek’s
Replication in
Distributed Systems
Understanding how Replicated state machines support
Fault Tolerance and Availability
Table of Contents
Replicated State Machines
Introduction 4
Fault Tolerance 8
The Output Rule 9
Multi-Leader Replication
Introduction 24
Use Case 1: Systems with Multiple Data Centres 24
Use Case 2: Apps like Google Calendar 28
Use Case 3: Collaborative Editing in Google Docs 29
Write Conflicts 30
Synchronous Conflict Detection 30
Asynchronous Conflict Detection 32
Method 1: Last Write Wins (LWW) 34
Method 2: Assign Unique ID to the replicas 35
Method 3: Manual Conflict Resolution 35
Method 4: Merging Values together 36
Replicated State
Machines
The chapter discusses the Replicated State Machines and how it
ensures Fault Tolerance and High Availability in our services
with the Distributed Systems. It also discusses “The Output Rule”
used by this scheme to ensure fault tolerance.
In the above architecture there are two machines. These are Primary and Backup
replicas. The client sends the request (packets) to the Primary replica (machine).
The request packet goes to the Virtual Machine Monitor (VMM). On arrival the
VMM generates a network packet arrival interrupt to the operating system. In
addition VMM knows that the request is an input to the Primary replica and hence it
sends the copy of the request packet to the Backup Virtual Machine Monitor.
Once the Primary replica receives the packet, it starts processing the request and
then generates a reply/response packet and further sends it to its Virtual Machine
Monitor. The VMM sees the response packet sent by the Primary replica and then
sends the reply packet over the network back to the client.
Hence, if the Primary Machine crashes then the Backup machine will stop receiving
interrupt signals from it and will soon know that the Primary replica has crashed
and stopped functioning due to some unfortunate reasons.
In that case the Backup Replica goes Live. This means it stops waiting for the
Primary replica to send the event signals over the channel to itself. The VMM
(Virtual Machine Monitor) present in the Backup replica now allows it to execute
freely. This means that the VMM of the Backup replica machine does two things:
In the same way if the Backup replica machine fails then the Primary replica will
stop sending the event signals to the Backup replica and will function as a single
non-replicated server.
The Client C sends an INCR (Increment) request to the Primary replica. The Primary
machine receives and executes the operation and increments the value of X to 11.
Then it sends the newly updated value back to the Client. The same request is also
supposed to be sent to the Backup replica from the Primary machine. The Backup
machine will also update the data-item X value and send the response which will
further be discarded by its VMM (Virtual Machine Monitor).
Suppose the Primary machine dies after sending the response packet to the Client
and before sending the copy of the request packet to the Backup replica machine.
The Client will receive the updated value (11) of the data-item X. But the Backup
Replica still has the older copy (10) of the data-item X.
This can lead to Data inconsistency in our Database servers which can further
lead to huge problems.
The way the Replicated State Machine avoids this Problem is through The Output
Rule. The Output Rule states that:
“The Primary replica is not allowed to send any output response packet to the
Client until the Backup replica acknowledges that it has received all the event
records till that point”
Single Leader
Replication
The chapter discusses Replication in detail and its multiple
schemes. It also discusses the Single Leader Replication scheme,
Synchronous and Asynchronous replication processes along with
the Fault Tolerant schemes.
1. To keep the data geographically closer to the Client. We can replicate our
data on multiple servers and these servers can be geographically distributed.
In this way clients can request the data from the server which is closest to
them.
3. To offer Scalability. Some systems are read-heavy and distributing our data
to multiple machines can allow all of them to take up the read requests from
the Clients. This distributes the load and helps the system to scale.
These are some obvious reasons to introduce Replication into the system. If the
data being replicated doesn’t change over time then the entire replication process
becomes fairly simple. We just need to keep copying the data to multiple machines
and scale up the Read requests. The real challenge comes up when the data being
replicated can change over time. This change of data in one machine will demand
the change to be replicated to all the other machines that hold the copy of the
similar data.
Multiple Replication algorithms are used in order to handle the changes in the
replicated data. These are:
Let’s first understand what Leader and Followers actually are. Suppose we have a
Distributed System having multiple machines connected over a network. All the
machines store a copy of the database. They are also known as Replicas. Among
these replicas, one of them is elected as a Leader. So Leader is a machine which
accepts all the Write requests from the Clients. Whenever any client wants to make
any changes in the database (update/delete) then it sends the Write request to the
Leader replica.
Now the Leader receives the Write request and makes the required changes in its
copy of the Database. Once the changes are made, the Leader sends the Write
request to the rest of the Followers. These followers then update their copy of the
data. All the Followers perform the write requests in the same order in which those
requests were executed by the Leader.
When a Client needs to perform a Read request, it can freely send its request to
any replica machines, be it a Leader or the Followers.
Synchronous Replication
Let’s take the previous example where the Leader received an incoming Write
request from the Client. Afterwards it forwarded the Write request to all the
Followers. Now in the Synchronous Replication process the Leader will wait for the
followers to confirm that it has successfully received the incoming Client’s request.
Let’s say there’s an architecture having one Leader (L) and one Follower (F). The
Synchronous replication scheme will look something like this.
Asynchronous Replication
Once the Leader receives the Write request from the Client, it will forward it to all
the Follower replicas. After forwarding the request to the Followers, it won’t wait for
them to acknowledge the receipt of the request. This is known as the Asynchronous
Replication process.
Let’s take another similar example of a Distributed System having a Leader (L) and a
Follower (F). The Asynchronous Replication scheme will look somewhat like this.
Semi-Synchronous Configuration
In this configuration, one follower is Synchronous while the rest of the other
followers are Asynchronous. Since only one Follower is made synchronous, hence
it reduces the chance of system failure due to the failure of the Follower. If the
synchronous follower stops functioning then one of the asynchronous followers is
made synchronous. This ensures that one of the synchronous followers and the
leader has an up-to-date copy of the data.
This configuration reduces the chance of data-loss, since we have up-to-date copy
of data in at least two machines.
Fault Tolerance
Since we are discussing Distributed Systems having multiple machines, there is a
huge chance for some of these machines to fail. We need a solid fault tolerant
1. Stage 1: In this stage, the rest of the Followers in the system detects the
Leader has failed. This is achieved by the process of Heartbeat. All the
replicas in the system keep on exchanging messages back and forth between
each other. These messages are also termed as Heartbeats. So, when one of
the nodes dies then the others stop receiving messages from its end and
they detect that the node has failed. There is a certain period of time till
which a node waits for the message from the other node. If it doesn’t receive
anything till that point of time then it considers that node to be dead.
3. Stage 3: Once the new Leader has been appointed, now all the clients must
send their write requests to the new Leader. Also if the old Leader comes
back, it must not act as a new Leader. It should know that a new leader has
been already appointed and it should become a Follower.
Every follower maintains a Log of data changes received from the Leader on its
Local Disk. In future if the follower dies or gets disconnected from the network then
it can recover back easily with the help of its Log. The follower once joins back the
network can again connect with the Leader and request all the data changes after
the last transaction that was processed in its Log. Hence it can receive all the data
changes that happened during the time when the follower was disconnected.
Multi-Leader
Replication
The chapter discusses the concept of the Multi-Leader
Replication scheme in detail. It also discusses multiple Use Cases
where the use of a Multi-Leader replication scheme can be
justified. It also introduces the problem of Conflict Resolution
which is faced by this replication scheme and steps to resolve the
Write Conflicts.
Here we can introduce multiple leaders in our system that can take requests from
the Clients. In that case if a client somehow cannot reach Leader A then he might
connect with Leader B to process his requests. This is a Multi-Leader Replication
scheme. In this scheme every leader sends the update to all the other nodes
present in the system.
There are multiple places where using a Multi-Leader replication scheme can make
sense. Let's discuss them one by one.
Now, you plan to scale your system and hence start building data-centres in Asia
and Europe as well. These data-centres hold the replicas of the same database.
Every data-centre will have its own Leader and the clients can now connect to the
Leader present in the closest data-centre to them. The architecture might look like
this.
But with the Multi-Leader replication scheme in picture, the client can now send
their requests to the data-centre closest to them. This will save some latency and
also reduce the response time of the servers. This is possible since every
data-centre had their own Leader and each data-centre functions independently.
Suppose we have our Google Calendar signed into two of our phones, one laptop
and one tab. We added an event to our Calendar through one of our phones while
it was offline. At that time the event will be saved to that phone’s local storage and
will be synced with all the other devices when our phone goes online.
This is similar to the Multi-Leader replication scheme where each device acts as a
Leader since it accepts the reads and writes made by its clients. There is an
asynchronous multi-leader replication scheme between the replicas of our calendar
present in all of our devices.
One way to resolve this is by the process of acquiring locks. This will ensure that
there will be no Write Conflicts. The user must obtain a lock over the document
before editing it and if another user wants to edit the same document then he/she
might need to wait until the first user has committed their changes and released
the lock. This is equivalent to the Single Leader replication scheme but will make
the entire concept of collaborative editing process super slow with a terrible user
experience.
Write Conflicts
Write Conflicts can be a situation where our system ends up with two different
values for a single item and not able to make a decision on which of the copies to
keep and which one to discard.
Let’s take the previous example of Google Docs collaborative editing feature
(example used in previous section). We have a common document whose replica is
present in two local devices. Now Client 1 tries to update the heading of the doc in
Device 1. Now at the same time Client 2 also tries to update the heading of the
same doc in Device 2. Since both the device holds the replica of the same doc and
have independent Leaders, the writes will succeed at that time. The update made
by the clients will be stored in the local storage of the respective devices. However
when the changes are synchronously replicated, then the conflict will be detected.
For the Multi-Leader replication scheme we will look around ways to resolve the
Write Conflicts. The end goal is to make all the replicas agree on a single value for
the data-item.
Note: Both the above Conflict resolution schemes will lead to Data Loss.