0% found this document useful (0 votes)
10 views37 pages

Lecture 13

The document discusses fault tolerance in distributed systems, focusing on replication, two-phase commit (2PC), and three-phase commit (3PC) protocols. It emphasizes the importance of ensuring atomicity in transactions and outlines the steps involved in 2PC and 3PC, including handling failures and the need for logging and recovery mechanisms. The lecture highlights the challenges of blocking in 2PC and introduces 3PC as a more complex but non-blocking alternative.

Uploaded by

ronesa3901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

Lecture 13

The document discusses fault tolerance in distributed systems, focusing on replication, two-phase commit (2PC), and three-phase commit (3PC) protocols. It emphasizes the importance of ensuring atomicity in transactions and outlines the steps involved in 2PC and 3PC, including handling failures and the need for logging and recovery mechanisms. The lecture highlights the challenges of blocking in 2PC and introduces 3PC as a more complex but non-blocking alternative.

Uploaded by

ronesa3901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

University of Pennsylvania

CIS 5050
Software Systems

Linh Thi Xuan Phan

Department of Computer and Information Science


University of Pennsylvania

Lecture 13: Fault tolerance


March 28, 2024

©2016-2024 Linh Thi Xuan Phan


Announcements
• Each team please email me the GitHub IDs of your
team members

• Needed for creating the team GitHub repository

2
©2016-2024 Linh Thi Xuan Phan
Recall: Replication

• How does replication help?


– Durability: Once data is stored, the system should not lose it
– Availability: Data should be accessible whenever we need it
– Speed: Accessing the data should be fast
– Scalability: Distribute load and functionality across machines
• What do we need for replication?
– A replication protocol (how to propagate updates to all replicas)
– A consistency model (what is the ‘right’ value to return to the client)
• Are there any new challenges with replication?
3
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Distributed commit NEXT

– Two-phase commit (2PC)


– Three-phase commit (3PC)

• Logging and recovery


– Centralized checkpointing
– Chandy-Lamport algorithm

4
©2016-2024 Linh Thi Xuan Phan
Why distributed commit?
• Suppose a large bank is operating a ‘shared’
account database
– Example: Node #1 has account data for customers whose first
names start with A, node #2 has B, node #3 C, ...

• Now suppose Alice wants to send $100 to Bob


– This involves changes on two separate nodes: Adding $100 to Bob's
account (node #2), and taking $100 from Alice's account (node #1)
– What if node #2 already finished the former, and then node #1
notices that there isn't enough money in Alice's account?
– What if node #1 finishes its part, but node #2 crashes before it can
complete the operation?
– ...

5
©2016-2024 Linh Thi Xuan Phan
Atomicity
• Goal: We need to ensure atomicity
– Either all parts of the transaction are completed, or none of them is!
– This is one of the four classical ACID properties from databases
• Atomicity, Consistency, Isolation, Durability

• How can we ensure atomicity?

• Idea: Let's do one-phase commit


– We elect one node as the coordinator; the others are subordinates
– The coordinator tells all the subordinates whether to finalize their part of
the transaction (commit), or whether to undo it (abort)
– The subordinates do what they are told, and then acknowledge

• Is this a good solution?

6
©2016-2024 Linh Thi Xuan Phan
Why one-phase commit fails
Coordinator crashes
in the middle
CO
T
CO MMI M
M But I already
IT

COMMIT
??? aborted!

B
C
OK!
A

• Problem #1: Subordinate cannot independently abort the


transaction
• Problem #2: Coordinator might crash before subordinate
receives the message; partial results are lost

7
©2016-2024 Linh Thi Xuan Phan
Two-Phase Commit (2PC)
• Idea: We need two rounds of communication!

• First round: Voting


– Coordinator sends prepare message to each subordinate
– Each subordinate responds with yes if it is able to commit its part of
the transaction, otherwise with no
• If the subordinate needs locks to commit, it needs to acquire them before
responding with yes (why?)

• Second round: Decision


– Coordinator sends commit or abort to each subordinate
– Each subordinate responds with an ack

• Result: Any site can decide to abort a transaction!


8
©2016-2024 Linh Thi Xuan Phan
2PC: What about crashes?
• We also need to handle the case where a node
crashes in the middle
– Nodes need to be able to 'remember' some information!

• Idea: Each node is given a local, persistent log


– This log could be stored on disk, or in NVRAM
• Is this enough to ensure that data remains persistent despite a possible crash?
– When a node recovers after a crash, it can look at its local log to
figure out what the next steps should be

9
©2016-2024 Linh Thi Xuan Phan
2PC: Steps in more detail
• When a transaction wants to commit:
– The coordinator sends a prepare message to each subordinate
– Subordinate force-writes an abort or prepare log record, then
sends a no (abort) or yes (prepare) message to coordinator
– The coordinator then considers the votes:
• If everyone has voted yes, it force-writes a commit log record and sends
commit message to all subordinates
• Else, it force-writes an abort log record and sends an abort message
– The subordinates force-write abort/commit log records based on
the message they get, and then send an ack message to
coordinator
– The coordinator writes an end log record after getting all the
acks
• Why is the 'end' record useful?
Messages in red
Log records in green 10
©2016-2024 Linh Thi Xuan Phan
2PC: Protocol (1/2)
void coordinator(Transaction t, Set nodes)
{
log.write("BEGIN");
log.write(result);//commits the result
foreach (n : nodes)
foreach (n : nodes)
send(n, "PREPARE");
send(n, result);
if (result == "COMMIT")
Set responses = new Set();
t.performLocalPart();
bool allInFavor = true;
if (!t.localPartCanCommit())
Set finished = new Set();
allInFavor = false;
while (!finished.equals(nodes)){
while (!responses.equals(nodes) &&
Node sender;
!timeout() && allInFavor){
Message msg = recv(&sender);
Node sender;
if (msg == "STATUS?")
Message msg = recv(&sender);
send(sender, result);
responses.add(sender);
if (msg == "ACK")
if (msg == "NO")
finished.add(sender);
allInFavor = false;
}
}
if (timeout())
log.write("END");
allInFavor = false;
}
String result;
if (allInFavor)
result = "COMMIT";
else
result = "ABORT";
11
©2016-2024 Linh Thi Xuan Phan
2PC: Protocol (2/2)
void subordinate(Transaction t, Node coordinator)
{
log.write("BEGIN");
while (true) {
Message msg = recvFrom(coordinator);
if (msg == "PREPARE") {
if (t.localPartCanCommit()) {
log.write("PREPARE");
send(coordinator, "YES");
} else {
log.write("ABORT");
send(coordinator, "NO");
}
} else if (msg == "COMMIT") {
log.write("COMMIT");
t.performLocalPart();
log.write("END");
send(coordinator, ”ACK");
break;
} else if (msg == "ABORT") {
log.write("ABORT");
log.write("END");
send(coordinator, ”ACK");
break;
}
}
} 12
©2016-2024 Linh Thi Xuan Phan
2PC: Illustration
Coordinator Subordinate 1 Subordinate 2
force-write
begin log entry
send “prepare”
send “prepare”

force-write force-write
prepare log entry prepare log entry
send “yes”
send “yes”
force-write
commit log entry commit point
send “commit”
send “commit”
force-write force-write
commit log entry commit log entry
send “ack”
send “ack”
write
end log entry

©2016-2024 Linh Thi Xuan Phan 13


2PC: Some observations
• All log records for a transaction contain its ID and
the coordinator’s ID
– The coordinator’s abort/commit record also includes IDs of all
subordinates (why?)

• Every message reflects a decision by the sender


– To ensure that this decision survives failures, it is first recorded in
the local log

• There exists no distributed commit protocol that


can recover without communicating with other
processes, in the presence of multiple failures!

©2016-2024 Linh Thi Xuan Phan


What if a node fails in the middle?
• Suppose we find a commit or abort log record for
transaction T, but not an end record?
– Need to redo/undo T
– If this node is the coordinator for T, keep sending commit / abort
messages to subordinates until acks have been received
• Suppose we find a prepare log record for
transaction T, but not commit/abort?
– This node is a subordinate for T
– Repeatedly contact the coordinator to find status of T, then write
commit/abort log record; redo/undo T; and write end log record
• Suppose we don’t find even a prepare record for T?
– Unilaterally abort and undo T
– This site may be coordinator! If so, subordinates may send
messages and need to be undone as well

©2016-2024 Linh Thi Xuan Phan


Coordinator failure
• What should a subordinate do when the
coordinator fails after it has already voted yes?
– Problem: Cannot decide whether to commit or abort T until
coordinator recovers - T is blocked!
– Consequences?

• Suppose all the subordinates know each other?


– Can be implemented (requires extra overhead in prepare msg)
– But: They are still blocked, unless one of them voted no, or one of
them has already received the coordinator's decision

©2016-2024 Linh Thi Xuan Phan


Link and remote site failures
• What should a node do when a remote node does
not respond during the commit protocol for
transaction T (either because the remote node
failed or the link failed)?
– If the current node is the coordinator for T, should abort T
– If the current node is a subordinate, and has not yet voted yes, it
should abort T
– If the current node is a subordinate and has voted yes, it is blocked
until the coordinator responds!

• Can we do better?

©2016-2024 Linh Thi Xuan Phan


Plan for today
• Distributed commit
– Two-phase commit (2PC)
– Three-phase commit (3PC) NEXT

• Logging and recovery


– Centralized checkpointing
– Chandy-Lamport algorithm

18
©2016-2024 Linh Thi Xuan Phan
How can we improve 2PC?
• What is the real reason why 2PC can block?
– Suppose both the coordinator and a subordinate crash
– The decision could have been COMMIT, and the subordinate may
have already completed the operation
– But the other subordinates have no way to distinguish this from the
situation where the decision was ABORT

• Idea: Let's make sure that the subordinates know


the decision before they execute anything!
– When the coordinator has received yes votes from all the subordi-
nates, it tells them that there will be a COMMIT ("PRECOMMIT")
– Once everyone (or at least a large number) acknowledges the
receipt, the coordinator then sends the actual COMMIT message
– What happens now when the coordinator fails?
• ... during the PRECOMMIT phase? ... during the actual COMMIT phase? 19
©2016-2024 Linh Thi Xuan Phan
Three-phase commit (3PC)
• Phase 1: Voting
– Coordinator sends PREPARE to each subordinate
– Each subordinate votes either YES or NO

• Phase 2: Precommit
– If at least one vote is NO, the coordinator sends ABORT as before
– If all votes are YES, the coordinator sends PRECOMMIT to
at least k subordinates (where k is a tunable parameter)
• OK to send more PRECOMMITs if there are not enough responses
– Each subordinate replies with an ACK

• Phase 3: Commit
– Once the coordinator has received k ACKs, it sends COMMIT to
each subordinate
– Each subordinate responds with an ACK
20
©2016-2024 Linh Thi Xuan Phan
3PC: Handling coordinator failures
• What if some nodes fail, including the coordinator?
– Remaining nodes ask each other what the coordinator has told them
• Situation #1: Nobody has seen a PRECOMMIT
– 2PC would have block in this case!
– But with 3PC, the remaining nodes can safely ABORT, since the
failed nodes could not have made any changes yet
• ... at least unless more than k nodes have failed! (why?)

• Situation #2: At least one PRECOMMIT or COMMIT


– The remaining subordinates know that the decision was (or was
going to be) COMMIT
– They can all decide to go ahead and COMMIT; once the other nodes
come back up, they can learn about this decision and COMMIT too
• Situation #3: Network partition
– 3PC isn't safe in this case! (why?)

21
©2016-2024 Linh Thi Xuan Phan
Recap: Distributed commit
• Goal: Do something on all nodes, or none of them
– A very common requirement in distributed systems
– Naïve solution (one-phase commit) isn't safe

• We have seen two solutions: 2PC and 3PC


– The key idea is for the nodes to 'vote' whether to commit or abort
– A coordinator is elected to collect votes and broadcast the decision
– 2PC is much simpler, but can block when the coordinator fails
– 3PC doesn't have that problem, but it is more complicated
– Neither 2PC nor 3PC can tolerate network partitions

22
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Distributed commit
– Two-phase commit (2PC)
– Three-phase commit (3PC)

• Logging and recovery NEXT

– Centralized checkpointing
– Chandy-Lamport algorithm

23
©2016-2024 Linh Thi Xuan Phan
Why logging and recovery?
• Suppose a distributed system fails in the middle of
a long and expensive computation

• What should we do?


– Restarting from scratch may be expensive, or even impossible!

• Idea: Periodically record a checkpoint of the system


– Each node writes down ("logs") its current state, e.g., on its local disk
– If something bad happens, the nodes can go back to the latest
checkpoint and then continue from there!

24
©2016-2024 Linh Thi Xuan Phan
Message logging
• What if the latest checkpoint was some time ago?
– Rolling back the entire system would destroy a lot of useful work

• Idea: Use message logging + deterministic replay


– Whenever a message is sent, it is saved somewhere – either by the
sender (sender-based logging) or by the recipient (receiver-based)
– Nodes also record all nondeterministic events
• What are these? Why do they need to be remembered?
– When a node crashes, we can roll back only that node to its latest
checkpoint, and then feed it all the recorded messages and
nondeterministic events
– This will bring the node back to the state it was before the crash
• What are we assuming about the software on the node?
• Does this assumption generally hold? What does it take to make it hold?

25
©2016-2024 Linh Thi Xuan Phan
Checkpointing on a single node
• How would you record a checkpoint on one node?
• Idea #1: Just write its memory contents to disk!
– Problem: Write can take a long time!
– If the node keeps running during that time, the checkpoint will be
inconsistent: the state at the beginning is older than that at the end
– The node may never have been in the state that the checkpoint
describes (at least not at any given time)

• Idea #2: Stop the node during the checkpoint


– Not ideal either – system will be unresponsive during that time!
– Actual checkpointing systems use techniques like copy-on-write
(CoW) to take a checkpoint in memory very quickly (a few ms)
• This can then be written to disk asynchronously!
– But this trick won't work if we have a distributed system!
26
©2016-2024 Linh Thi Xuan Phan
Consistent vs. inconsistent cuts
Event A0 Event A1 Event A2 Event A3
A
Me
ssa

2
em
ge
m

ag
1

ss
Me
B
Event B0 Event B1 Event B2 Event B3

Inconsistent cut Consistent cut

• We can define a cut of the distributed execution


– Basically, one prefix of each node's local execution, taken together
• When can we call a cut consistent?
– If, for every event it contains, it also contains all events that
'happened before' that event.
– In particular, every received message must have been sent
• Which of the above cuts are consistent?
27
©2016-2024 Linh Thi Xuan Phan
Recovery lines
= Checkpoint

Failure

Initial state Recovery line Inconsistent cut

• Suppose we want to roll back the entire system


– Can we simply roll back each individual node to a recent checkpoint?
– Problem: Checkpoints could describe an inconsistent cut!
• A recovery line is a consistent set of checkpoints
– How can we find such a recovery line?

28
©2016-2024 Linh Thi Xuan Phan
The Domino Effect
= Checkpoint

Failure

Initial state Recovery line

• What if a set of checkpoints is not consistent?


– Need to roll back some of the nodes to an even earlier checkpoint!
– But that could create further inconsistencies!
• Problem: Cascading rollbacks! Why?
– In this example, the nodes just took their checkpoints individually
– Thus, it is unlikely that they would (ever) form a consistent cut
– The nodes need to coordinate their checkpointing!
29
©2016-2024 Linh Thi Xuan Phan
Centralized checkpointing
• Similar to 2PC with a coordinator

• Coordinator first multicasts a CHECKPOINT


message to all processes

• When a process receives the message:


– Takes a local checkpoint
– Queues any subsequent outgoing messages (why?)
– Responds with an ACK

• When the coordinator receives all the ACKs, it


sends a DONE message
– At that point, the queued messages are sent, and the execution
proceeds normally again

©2016-2024 Linh Thi Xuan Phan


Can we do better?
• Centralized checkpointing has several drawbacks

• Problem #1: Need for a central coordinator


– Ideally, we would like a fully distributed protocol
– Any node should be able to initiate a checkpoint at any time

• Problem #2: Queueing


– This is necessary to ensure that the cut is consistent
– But it also holds up the system for a while!

• Problem #3: Messages are not captured


– Some messages may still be 'in flight' at checkpoint time
– If we roll back every node to the checkpoint, these messages will
not be re-sent!

31
©2016-2024 Linh Thi Xuan Phan
Snapshots
• Can we get a consistent "snapshot"
of the system?

• This should include:


– A checkpoint for each node
– The set of messages that were "in flight"
at the time (i.e., sent but not yet received)

• Can we expect to capture a state of the entire


system as of some particular time?
– Not unless clocks are perfectly synchronized!
– All we can hope for is a state that the system could have been in,
and from which the actual state is reachable (why?)
32
©2016-2024 Linh Thi Xuan Phan
What are in flight messages from B to A?

Node A Node B

m1: sent and received before m1


both A’s and B’s checkpoints

m2
m2 , m3: sent before B’s checkpoint
received after A’s checkpoint
m3

m4: sent and received after


both A’s and B’s checkpoints m4

33
©2016-2024 Linh Thi Xuan Phan
Some simplifying assumptions
• To simplify our discussion,
we will assume that:
– Each node Ni has a direct "channel" cij
to every other node Nj
– Channels are unidirectional and
deliver messages in FIFO order
– Channels are reliable, i.e., messages
are never lost

• How can we make these assumptions true?


– Direct channel: This is just an abstraction; all we need is that every
node can send messages to every other node
– FIFO order: Can use sequence numbers
– Reliable: Can use retransmissions
34
©2016-2024 Linh Thi Xuan Phan
K. Mani Chandy and Leslie Lamport: "Distributed Snapshots: Determining Global States of Distributed Systems", ACM TOCS 3(1):63-75
Chandy-Lamport algorithm
• When a node Ni wants to initiate a snapshot:
– it takes a local checkpoint,
– it sends a special marker message on each of its outgoing channels, and
– it begins recording messages that arrive on its incoming channels
• When a node Nj receives a marker on channel cij:
– If Nj has not yet taken a checkpoint:
• it takes a local checkpoint and sends a special marker on each outgoing channel,
• it records the state of cij as the empty set, and
• it begins recording messages that arrive on any
other incoming channel ckj
– If Nj has already taken a checkpoint:
• it stops recording messages for cij

• Termination condition:
– The node initiated the snapshot has received a marker from every other node

• What are the 'recorded' sets of messages?


– These are the messages that are "in flight" at the snapshot time
35
©2016-2024 Linh Thi Xuan Phan
Chandy-Lamport: Example
Alice: $500 Alice: $600 Alice: $350 Alice: $400
A

0
10
:$
ce
0
: $5
lice

Ali
→ A

b→
Bob
Bob: $300 Bo
Bob: $200 Bob: $50
B
Bob: $150

Bo $1
Ali

b→ 00
c
e→ 250

Ch
$
Ch

arl
arl

ie:
Charlie: $100

ie:
C
Charlie: $350 Charlie: $450

Node C: Chk="Charlie: $100"; SAC = { Alice→Charlie: $250 }; SBC = {};

Node B: Chk=”Bob: $150"; SAB = {}; SCB = {};

Node A: Chk=”Alice: $350"; SBA = { Bob → Alice: $50 } SCA = {};

• What happens if we roll back & replay messages?


– We arrive at the point indicated by the orange dots! 36
©2016-2024 Linh Thi Xuan Phan
Recap: Chandy-Lamport algorithm
• Goal: A consistent "snapshot" of the system
– One checkpoint per node, and all the messages that were "in flight"
– Not necessarily a state that the system ever was in at any given
(wallclock) time

• Properties of the algorithm:


– Fully distributed; no need for a central coordinator
– Any node can trigger a snapshot at any time
– Execution can continue; no need to "queue" messages

• How can we use this?


– Example: Stable property detection
• Some of the algorithms we have discussed (e.g., deadlock detection via circular
wait) assume that we have a consistent snapshot of the global system state
– Example: Failure recovery

37
©2016-2024 Linh Thi Xuan Phan

You might also like