Lecture 13
Lecture 13
CIS 5050
Software Systems
2
©2016-2024 Linh Thi Xuan Phan
Recall: Replication
4
©2016-2024 Linh Thi Xuan Phan
Why distributed commit?
• Suppose a large bank is operating a ‘shared’
account database
– Example: Node #1 has account data for customers whose first
names start with A, node #2 has B, node #3 C, ...
5
©2016-2024 Linh Thi Xuan Phan
Atomicity
• Goal: We need to ensure atomicity
– Either all parts of the transaction are completed, or none of them is!
– This is one of the four classical ACID properties from databases
• Atomicity, Consistency, Isolation, Durability
6
©2016-2024 Linh Thi Xuan Phan
Why one-phase commit fails
Coordinator crashes
in the middle
CO
T
CO MMI M
M But I already
IT
COMMIT
??? aborted!
B
C
OK!
A
7
©2016-2024 Linh Thi Xuan Phan
Two-Phase Commit (2PC)
• Idea: We need two rounds of communication!
9
©2016-2024 Linh Thi Xuan Phan
2PC: Steps in more detail
• When a transaction wants to commit:
– The coordinator sends a prepare message to each subordinate
– Subordinate force-writes an abort or prepare log record, then
sends a no (abort) or yes (prepare) message to coordinator
– The coordinator then considers the votes:
• If everyone has voted yes, it force-writes a commit log record and sends
commit message to all subordinates
• Else, it force-writes an abort log record and sends an abort message
– The subordinates force-write abort/commit log records based on
the message they get, and then send an ack message to
coordinator
– The coordinator writes an end log record after getting all the
acks
• Why is the 'end' record useful?
Messages in red
Log records in green 10
©2016-2024 Linh Thi Xuan Phan
2PC: Protocol (1/2)
void coordinator(Transaction t, Set nodes)
{
log.write("BEGIN");
log.write(result);//commits the result
foreach (n : nodes)
foreach (n : nodes)
send(n, "PREPARE");
send(n, result);
if (result == "COMMIT")
Set responses = new Set();
t.performLocalPart();
bool allInFavor = true;
if (!t.localPartCanCommit())
Set finished = new Set();
allInFavor = false;
while (!finished.equals(nodes)){
while (!responses.equals(nodes) &&
Node sender;
!timeout() && allInFavor){
Message msg = recv(&sender);
Node sender;
if (msg == "STATUS?")
Message msg = recv(&sender);
send(sender, result);
responses.add(sender);
if (msg == "ACK")
if (msg == "NO")
finished.add(sender);
allInFavor = false;
}
}
if (timeout())
log.write("END");
allInFavor = false;
}
String result;
if (allInFavor)
result = "COMMIT";
else
result = "ABORT";
11
©2016-2024 Linh Thi Xuan Phan
2PC: Protocol (2/2)
void subordinate(Transaction t, Node coordinator)
{
log.write("BEGIN");
while (true) {
Message msg = recvFrom(coordinator);
if (msg == "PREPARE") {
if (t.localPartCanCommit()) {
log.write("PREPARE");
send(coordinator, "YES");
} else {
log.write("ABORT");
send(coordinator, "NO");
}
} else if (msg == "COMMIT") {
log.write("COMMIT");
t.performLocalPart();
log.write("END");
send(coordinator, ”ACK");
break;
} else if (msg == "ABORT") {
log.write("ABORT");
log.write("END");
send(coordinator, ”ACK");
break;
}
}
} 12
©2016-2024 Linh Thi Xuan Phan
2PC: Illustration
Coordinator Subordinate 1 Subordinate 2
force-write
begin log entry
send “prepare”
send “prepare”
force-write force-write
prepare log entry prepare log entry
send “yes”
send “yes”
force-write
commit log entry commit point
send “commit”
send “commit”
force-write force-write
commit log entry commit log entry
send “ack”
send “ack”
write
end log entry
• Can we do better?
18
©2016-2024 Linh Thi Xuan Phan
How can we improve 2PC?
• What is the real reason why 2PC can block?
– Suppose both the coordinator and a subordinate crash
– The decision could have been COMMIT, and the subordinate may
have already completed the operation
– But the other subordinates have no way to distinguish this from the
situation where the decision was ABORT
• Phase 2: Precommit
– If at least one vote is NO, the coordinator sends ABORT as before
– If all votes are YES, the coordinator sends PRECOMMIT to
at least k subordinates (where k is a tunable parameter)
• OK to send more PRECOMMITs if there are not enough responses
– Each subordinate replies with an ACK
• Phase 3: Commit
– Once the coordinator has received k ACKs, it sends COMMIT to
each subordinate
– Each subordinate responds with an ACK
20
©2016-2024 Linh Thi Xuan Phan
3PC: Handling coordinator failures
• What if some nodes fail, including the coordinator?
– Remaining nodes ask each other what the coordinator has told them
• Situation #1: Nobody has seen a PRECOMMIT
– 2PC would have block in this case!
– But with 3PC, the remaining nodes can safely ABORT, since the
failed nodes could not have made any changes yet
• ... at least unless more than k nodes have failed! (why?)
21
©2016-2024 Linh Thi Xuan Phan
Recap: Distributed commit
• Goal: Do something on all nodes, or none of them
– A very common requirement in distributed systems
– Naïve solution (one-phase commit) isn't safe
22
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Distributed commit
– Two-phase commit (2PC)
– Three-phase commit (3PC)
– Centralized checkpointing
– Chandy-Lamport algorithm
23
©2016-2024 Linh Thi Xuan Phan
Why logging and recovery?
• Suppose a distributed system fails in the middle of
a long and expensive computation
24
©2016-2024 Linh Thi Xuan Phan
Message logging
• What if the latest checkpoint was some time ago?
– Rolling back the entire system would destroy a lot of useful work
25
©2016-2024 Linh Thi Xuan Phan
Checkpointing on a single node
• How would you record a checkpoint on one node?
• Idea #1: Just write its memory contents to disk!
– Problem: Write can take a long time!
– If the node keeps running during that time, the checkpoint will be
inconsistent: the state at the beginning is older than that at the end
– The node may never have been in the state that the checkpoint
describes (at least not at any given time)
2
em
ge
m
ag
1
ss
Me
B
Event B0 Event B1 Event B2 Event B3
Failure
28
©2016-2024 Linh Thi Xuan Phan
The Domino Effect
= Checkpoint
Failure
31
©2016-2024 Linh Thi Xuan Phan
Snapshots
• Can we get a consistent "snapshot"
of the system?
Node A Node B
m2
m2 , m3: sent before B’s checkpoint
received after A’s checkpoint
m3
33
©2016-2024 Linh Thi Xuan Phan
Some simplifying assumptions
• To simplify our discussion,
we will assume that:
– Each node Ni has a direct "channel" cij
to every other node Nj
– Channels are unidirectional and
deliver messages in FIFO order
– Channels are reliable, i.e., messages
are never lost
• Termination condition:
– The node initiated the snapshot has received a marker from every other node
0
10
:$
ce
0
: $5
lice
Ali
→ A
b→
Bob
Bob: $300 Bo
Bob: $200 Bob: $50
B
Bob: $150
Bo $1
Ali
b→ 00
c
e→ 250
Ch
$
Ch
arl
arl
ie:
Charlie: $100
ie:
C
Charlie: $350 Charlie: $450
37
©2016-2024 Linh Thi Xuan Phan