Chen 07
Chen 07
1
Basic Concepts
Fault Tolerance is closely related to the notion of
“Dependability”. In Distributed Systems, this is
characterized under a number of headings:
2
But, What Is “Failure”?
Definition:
3
Types of Fault
There are three main types of ‘fault’:
5
Failure Masking by Redundancy
Strategy: hide the occurrence of failure
from other processes using redundancy.
Three main types:
7
DS Fault Tolerance Topics
1. Process Resilience
4. Distributed COMMIT
5. Recovery Strategies
8
1. Process Resilience
Processes can be made fault tolerant by arranging
to have a group of processes, with each member of
the group being identical.
9
Flat vs. Hierarchical Groups
a) Communication in a flat group – all the processes are equal, decisions are made
collectively. Note: no single point-of-failure, however: decision making is complicated
as consensus is required.
b) Communication in a simple hierarchical group – one of the processes is elected to
be the coordinator, which selects another process (a worker) to perform the operation.
Note: single point-of-failure, however: decisions are easily and quickly made by the
coordinator without first having to get consensus.
10
Failure Masking and Replication
By organizing a fault tolerant group of
processes, we can protect a single
vulnerable process.
11
The Goal of Agreement Algorithms
12
History Lesson: The Byzantine Empire
Time: 330-1453 AD.
13
Agreement in Faulty Systems (1)
How does a process group deal with a faulty member?
16
Example: RPC Semantics and Failures
The RPC mechanism works well as long as both the client
and server function perfectly. (the higher level)
18
The Five Classes of Failure (2)
An appropriate exception handling mechanism can
deal with a missing server. However, such
technologies tend to be very language-specific,
and they also tend to be non-transparent.
19
The Five Classes of Failure (3)
Server crashes are dealt with by implementing one
of three possible implementation philosophies:
20
The Five Classes of Failure (4)
Lost replies are difficult to deal with.
Why was there no reply? Is the server dead, slow, or
did the reply just go missing? Emmmmm?
23
Basic Reliable-Multicasting Schemes
29
Commit Protocols
One-Phase Commit Protocol:
◼ An elected co-ordinator tells all the other
processes to perform the operation in question.
The solutions:
◼ The Two-Phase and Three-Phase Commit
Protocols.
30
The Two-Phase Commit Protocol
First developed in 1978!!!
Summarized: GET READY, OK, GO AHEAD.
1. The coordinator sends a VOTE_REQUEST
message to all group members.
2. The group member returns VOTE_COMMIT if it
can commit locally, otherwise VOTE_ABORT.
3. All votes are collected by the coordinator. A
GLOBAL_COMMIT is sent if all the group
members voted to commit. If one group
member voted to abort, a GLOBAL_ABORT is
sent.
4. The group members then COMMIT or ABORT
based on the last message received from the
coordinator.
31
Two-Phase Commit Finite State Machines
32
Big Problem with Two-Phase Commit
It can lead to both the coordinator and the group
members blocking, which may lead to the
dreaded deadlock.
37
Summary (1 of 2)
Fault Tolerance:
Types of failure:
38
Summary (2 of 2)
Fault Tolerance is generally achieved through use of
redundancy and reliable multitasking protocols.