Distributed Transaction Management Notes
Distributed Transaction Management Notes
Consider the following partial execution of transactions T1 and T2 under distributed 2PL:
r1 [accno 789], w1 [accno 789], r2 [accno 123], r1 [accno 123]
T1 is unable to proceed since its next operation w1 [accno 123] is blocked waiting for T2 to release the
R-lock obtained by r2 [accno 123].
T2 is unable to proceed since its next operation r2 [accno 789] is blocked waiting for T1 to release the
W-lock obtained by w1 [accno 789].
1
In a centralised DB system, the ‘waits-for’ graph would contain a cycle, and either T1 or T2 would be
rolled back.
In a DDB, the LTMs store their own local waits-for graphs, but also periodically exchange ‘waits-for’
information between each other, possibly at the instruction of the GTM.
In our distributed DB example, the transaction fragments of T1 and T2 executing at the site of account
123 would cause a waits-for arc T1 → T2 which would eventually be transmitted to the site of account
789.
Similarly, the transaction fragments executing at the site of account 789 would cause a waits-for arc
T2 → T1 which would eventually be transmitted to the site of account 123.
Whichever site detects the deadlock first will notify the GTM, which will select one of the transactions
to be aborted and restarted.
Distributed Commit
Once a global transaction has completed all its operations, the ACID properties require that it be made
durable when it commits.
This means that the LTMs participating in the execution of the transaction must either all commit or
all abort their sub-transactions.
The most common protocol for ensuring distributed atomic commitment is the two-phase commit
(2PC) protocol. It involves two phases:
1. • The GTM sends the message PREPARE to all the LTMs participating in the execution of the
global transaction, informing them that the transaction should now commit.
• An LTM may reply READY if it is ready to commit, after first “forcing” (i.e. writing to
persistent storage) a PREPARE record to its log. After that point it may not abort its sub-
transaction, unless instructed to do so by the GTM.
• Alternatively, an LTM may reply REFUSE if it is unable to commit, after first forcing an
ABORT record to its log. It can then abort its sub-transaction.
2. • If the GTM receives READY from all LTMs, it sends the message COMMIT to all LTMs,
after first forcing a GLOBAL-COMMIT record to its log. All LTMs commit after receiving this
message.
• If the GTM receives REFUSE from any LTM, it transmits ROLLBACK to all LTMs, after first
forcing an GLOBAL-ABORT record to its log. All LTMs rollback their sub-transactions on
receiving this message.
• After committing or rolling back their sub-transactions, the LTMs send an acknowledgement
back to the GTM, which then writes an end-of-transaction record in its log.
A DDB system can suffer from the same types of failure as centralised systems (software/hardware
faults, disk failures, site failures); but also loss/corruption of messages, failure of communication links,
or partitioning of the system into two or more disconnected subsystems.
There is therefore a need for a termination protocol to deal with situations where the 2PC protocol
is not being obeyed by its participants.
There are three situations in 2PC where the GTM or an LTM may be waiting for a message, that need
to be dealt with:
2
– If the GTM does not receive a reply within a specified time period, it can decide to abort the
transaction, sending ROLLBACK to all LTMs.
• An LTM which voted READY may be waiting for a ROLLBACK/COMMIT message from the GTM:
– It can try contacting the other LTMs to find out if any of them has either
(i) already voted REFUSE or not voted READY, or
(ii) received a ROLLBACK/COMMIT message.
If it cannot get a reply from any LTM for which (i) or (ii) holds, then it is blocked. It is
unable to either commit or abort its sub-transaction, and must retain all the locks associated
with this sub-transaction while in this state of indecision.
The LTM will persist in this state until enough failures are repaired to enable it to communi-
cate with either the GTM or some other LTM for which (i) or (ii) holds.
Distributed Consensus
(This topic is based on material from Section 23.2.2 of [SKS].)
The possibility of blocking in 2PC may have a negative impact on performance. Blocking can be avoided
by using the idea of fault-tolerant distributed consensus.
Two widely used protocols for distributed consensus are Paxos and Raft. For example, Google’s Spanner
uses 2PC with Paxos.
Failure of 2PC participants could make data unavailable, in the absence of replication. Distributed
consensus can also be used to keep replicas of a data item in a consistent state (see later).
The distributed consensus problem is as follows:
• A set of n nodes need to agree on a decision; in this case, whether or not to commit a particular
transaction.
• The inputs to make the decision are provided to all the nodes, and then each node votes on the
decision; in the case of 2PC, the decision is on whether or not to commit a transaction.
• This decision should be made in such a way that all nodes will “learn” the same value for the
decision, even if some nodes fail during the execution of the protocol, or there are network partitions.
• Further, the protocol should not block, as long as a majority of the nodes participating remain
alive and can communicate with each other.
2 Data Replication
(This section is based on material from Section 23.4 of [SKS].)
Distributed databases are often expected to provide high availability, i.e., access to data remains even
in the presence of (node and network) failures. This is usually achieved by replicating data. Now, the
main challenge is to ensure that all replicas of the same data item (or partition) are consistent, i.e.,
have the same value. (Note that we are not considering transactions at this point, only individual data
items.)
Given that some nodes may be disconnected or may have failed, it is impossible to ensure that all copies
have the same value. Instead, the system should ensure that even if some replicas do not have the latest
value, reads of a data item get to see the latest value that was written.
In general, read and write operations on the replicas of a data item must ensure what is called linearis-
ability: Given a set of read and write operations on a data item,
3
• there must be a linear ordering of the operations such that each read in the ordering should see
the value written by the most recent write preceding the read (or the initial value if there is no
such write), and
• if an operation o1 finishes before an operation o2 begins (based on external time), then o1 must
precede o2 in the linear order.
Note that linearisability only addresses what happens to a single data item, and it is not related to
serialisability.
Concurrency Control with Replicas
Returning to management of global transactions, now in the presence of replication, GTMs commonly
employ a ROWA (Read One, Write All) locking protocol:
• an R-lock on a data item is only placed on the copy of the data item that is being read by a local
sub-transaction;
• a W-lock is placed on all copies of a data item that is being written by some local sub-transaction;
Since conflicts only involve W-locks, and a conflict only needs to be detected at one site for a global
transaction to be prevented from executing incorrectly, it is sufficient to place an R-lock on just one copy
of a data item being read and to place a W-lock all copies of a data item being written.
When coupled with 2PL, ROWA ensures that reads see the value written by the most recent write of
the same data item.
An alternative to ROWA is the Majority Protocol, which states that:
• if a data item is replicated at n sites, then a lock request (whether R-lock or W-lock) must be sent
to, and granted by, more than half of the n sites.
ROWA has the advantage of fewer overheads for Read operations compared to the majority protocol,
making it advantageous for predominantly Read workloads.
The majority protocol has the advantage of lower overheads for Write operations compared with ROWA,
making it advantageous for predominantly Write or mixed workloads.
Also, it can be extended to deal with site or network failures, in contrast to ROWA which requires all
sites holding copies of a data item that needs to be W-locked to be available.
Trading Off Consistency for Availability
Many applications require maximum availability in the presence of failures, even at the cost of consistency.
The so-called CAP Theorem states that it is not possible to achieve all three of the following at all
times in a system that has distributed replicated data:
4
Here, “consistency” means “linearisability” (see earlier).
Protocols such as ROWA and Majority sacrifice availability, not consistency.
However, for some applications, availability is mandatory, while they may only require the so-called
“BASE” properties:
Such applications require updates to continue to be executed on whatever replicas are available, even in
the event of network partitioning.
“Soft state” refers to the fact that there may not be a single well-defined database state, with different
replicas of the same data item having different values.
“Eventual consistency” guarantees that, once the partitioning failures are repaired, eventually all replicas
will become consistent with each other. This may not be fully achievable by the database system itself
and may need application-level code to resolve some inconsistencies (see later).
Many NoSQL systems do not aim to provide Consistency at all times, aiming instead for Eventual
Consistency.
Replication Protocols
With the distributed locking and commit protocols described earlier, all data replicas are updated as
part of the same global transaction — this is known as eager or synchronous replication.
However, many systems, including some relational ones, support replication with a weaker form of
consistency.
A common approach is for the DBMS to update just one ‘primary’ copy of a database object (also
known as leader-based replication), and to propagate updates to the rest of the copies (also known
as followers) afterwards — this is known as lazy or asynchronous replication.
Whenever the leader writes new data to its local storage, it also sends the change to all of its followers
as part of a replication log.
Each follower takes the log and updates its local copy of the data.
When a client wants to read from the database, it can query either the leader or any of the followers.
However, writes are only accepted on the leader.
The advantage of asynchronous replication is faster completion of operations, e.g. if some of the non-
primary data is temporarily not available, and lower concurrency control overheads (if transactions are
supported).
The disadvantage is that non-primary data may not always be in sync with the primary copy. For
a transaction-based system, this may result in non-serialisable executions of global transactions read-
ing/writing different versions of the same data item.
This may not be a problem for certain classes of applications, e.g. statistical analysis, decision support
systems. But it may not be appropriate for some application classes, e.g. financial applications.
When failed links/sites recover, followers that are “lagging behind” need to be automatically brought
“up-to-date” (i.e., to satisfy “eventual consistency”). They can simply request from the leader all changes
made since they were disconnected.
Failure of the leader is more complicated.
Leader failure
The process of recovering from leader failure is called failover. It comprises 3 steps:
1. Detect that the leader has failed. This is usually done using timeouts (the nodes frequently bounce
messages back and forth).
2. Elect a new leader. Getting all the nodes to agree on a new leader is an application of the distributed
consensus problem.
5
3. Reconfigure the system to use the new leader. When the old leader comes back online, it needs to
become a follower.
There are many subtle problems which can occur in this process, e.g., the new leader may not have
received all writes from the old leader before it failed (see Kleppmann’s book for more examples).
Read Your Own Writes
Many applications allow a user to submit some data and then view what they have submitted.
If the application writes to the leader but reads from a follower, the update may not yet have reached
the follower.
In this case, it will appear to the user that their submission was lost.
What is needed is read-after-write consistency, also known as read your own writes.
Some ways of achieving “read-after-write consistency” with leader-based replication are:
• When reading an item the user may have modified, read it from the leader; otherwise, read it from
a follower. For example, a user may only be able to update their own social media profile, so if
they query that, read it from the leader.
• Keep track of the time of the last update and, for one minute after that, say, make all reads from
the leader.
• The client can remember the timestamp of its most recent write, and the system can ensure that
the follower serving any reads for that client reflects updates until at least that timestamp.
Multi-Leader Replication
Sometimes it advantageous to have multiple leaders, say in an environment where there are multiple
data centres, in which case having one leader per data centre makes sense.
The biggest problem with multi-leader replication is that write conflicts can occur, i.e., the same data
can be modified in different ways at two leaders. In this case, some form of conflict resolution is needed.
Leaderless Replication
In leaderless replication, there is no leader. Instead, writes are sent to some number of replicas. Amazon’s
original Dynamo system used leaderless replication, as do Cassandra and Riak, for example.
To solve the problem that some nodes may have been down when an update was applied, read requests
are also sent to several nodes in parallel. Version numbers are used to determine the most recent value
if different results are received.
Version numbers are also used to detect write conflicts.
Riak
In Riak, you can choose
where R ≤ N and W ≤ N .
When R > 1, the “most recent” item is returned. This is determined using a “logical” clock, also known
as a “vector clock”, using the version-vector scheme.
Version-vector scheme
6
• Then each node i stores an array, Vi [1, . . . , n] of version numbers for d.
• When node i updates d, it increments the version number Vi [i] by one.
• Its update and version vector are propagated to the other nodes on which d is replicated.
Example
Suppose d is replicated at nodes 1, 2 and 3.
If d is initially created at 1, the version vector V1 would be [1, 0, 0].
When this update is propagated to nodes 2 and 3, their version vectors V2 and V3 would each be [1, 0, 0].
If d is then updated at node 2, V2 becomes [1, 1, 0].
Assume R = 2 and item d is now read from nodes 1 and 2 before the update from 2 has been propagated.
V1 is [1, 0, 0] while V2 is [1, 1, 0], so V2 is more recent, and d from node 2 is used.
Detecting Write Conflicts
Two replicas can be updated concurrently, leading to a write conflict. This can be detected using the
version-vector scheme.
Whenever two nodes i and j exchange an updated data item d, they have to determine whether their
copies are consistent:
• If there is a pair of values k and m such that Vi [k] < Vj [k] and Vi [m] > V j[m], then the copies are
inconsistent.
Example
Recall the example where d is replicated at nodes 1, 2 and 3.
If d is initially created at 1, the version vector V1 would be [1, 0, 0].
Say this update is only propagated to 2 where it is then updated, giving V2 = [1, 1, 0].
Suppose now that this version of d is replicated to 3, and then both 2 and 3 concurrently update d.
Then, V2 would be [1, 2, 0], while V3 would be [1, 1, 1].
These vectors are inconsistent, since V2 [2] = 2 while V3 [2] = 1, whereas V2 [3] = 0, while V3 [3] = 1.
Having detected write conflicts, they should be resolved in some way, ideally automatically.
For example, if the write operations commute with one another (e.g., each write adds an item to a
shopping basket) then they can be resolved.
However, in general, there is no technique that can automatically resolve all kinds of conflicting writes.
It is then left to the application programmer to decide how to resolve them.
Or one can adopt a policy of last write wins (available in Riak, e.g.), which is based on the system
clocks of the nodes (which are not necessarily completely synchronised) rather than vector clocks.
Further Reading
In [SKS], distributed transactions are covered in Section 23.1, commit protocols in 23.2, replication in
23.4, replication with weak degrees of consistency in 23.6, coordinator selection in 23.7, and distributed
consensus protocols in 23.8.
In Martin Kleppmann’s book, replication is covered in Chapter 5, transactions in Chapter 7, and con-
sistency and consensus in Chapter 9.
7
A Little Riak Book, Eric Redmond and John Daly, see https://github.com/basho-labs/little_
riak_book/blob/master/rendered/riaklil-print-en.pdf