Lecture 04
Lecture 04
Even-’22
When you have 120 servers in the DC, the mean time to failure (MTTF)
of the next machine is 1 month.
When you have 12,000 servers in the DC, the MTTF is about once every
7.2 hours!
1000’s of processes
Unreliable Communication
Network
8
Group Membership Protocol
II Failure Detector
Some process
pi finds out quickly
pj crashed
I pj
III Dissemination
Unreliable Communication
Network
Crash-stop Failures only 9
Next
• How do you design a group membership
protocol?
10
I. pj crashes
• Nothing we can do about it!
• A frequent occurrence
• Common case rather than exception
• Frequency goes up linearly with size of
datacenter
11
II. Distributed Failure Detectors:
Desirable Properties
• Completeness = each failure is detected
• Accuracy = there is no mistaken detection
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 12
Distributed Failure Detectors:
Properties
Impossible together in
• Completeness
lossy networks [Chandra
• Accuracy and Toueg]
• Speed
– Time to first detection of a failureIf possible, then can
solve consensus!
• Scale
– Equal Load on each member
– Network Message Load
13
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 14
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale Time until some
process detects the failure
– Equal Load on each member
– Network Message Load 15
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale Time until some
process detects the failure
– Equal Load on each member
No bottlenecks/single
– Network Message Load failure point 16
Failure Detector Properties
• Completeness In spite of
arbitrary simultaneous
• Accuracy process failures
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 17
Centralized Heartbeating
Hotspot
pi
…
pi, Heartbeat Seq. l++
pj • Heartbeats sent periodically
• If heartbeat not received from pi within
18
timeout, mark pi as failed
Ring Heartbeating
Unpredictable on
pi simultaneous multiple
pi, Heartbeat Seq. l++
failures
pj
…
…
19
All-to-All Heartbeating
pi Equal load per member
pi, Heartbeat Seq. l++
…
pj
20
Next
• How do we increase the robustness of all-to-all
heartbeating?
21
Gossip-style Heartbeating
Good accuracy
Array of pi properties
Heartbeat Seq. l
for member subset
22
Gossip-Style Failure Detection
1 10118 64
2 10110 64
1 10120 66 3 10090 58
2 10103 62 4 10111 65
3 10098 63 2
4 10111 65 1
1 10120 70
Address Time (local) 2 10110 64
Heartbeat Counter 3 10098 70
Protocol:
• Nodes periodically gossip their membership 4 4 10111 65
4
3 25
Multi-level Gossiping
• Network topology is
hierarchical N/2 nodes in a subnet
• Random gossip target selection (Slide corrected after lecture)
=> core routers face O(N) load
(Why?)
Router
• Fix: In subnet i, which contains
ni nodes, pick gossip target in
your subnet with probability
(1-1/ni)
• Router load=O(1)
• Dissemination time=O(log(N))
• What about latency for multi-
level topologies? N/2 nodes in a subnet
26
Analysis/Discussion
• What happens if gossip period Tgossip is decreased?
• A single heartbeat takes O(log(N)) time to propagate. So: N heartbeats
take:
– O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be
O(N)
– O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)
– What about O(k) bandwidth?
• What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
• Tradeoff: False positive rate vs. detection time vs. bandwidth
27
Next
• So, is this the best we can do? What is the best
we can do?
28
Failure Detector Properties …
• Completeness
• Accuracy
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 29
…Are application-defined Requirements
• Completeness Guarantee always
Probability PM(T)
• Accuracy
T time units
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load
30
…Are application-defined Requirements
• Completeness Guarantee always
Probability PM(T)
• Accuracy
T time units
• Speed
– Time to first detection of a failure
N*L: Compare this across protocols
• Scale
– Equal Load on each member
– Network Message Load
31
All-to-All Heartbeating
pi, Heartbeat Seq. l++ pi Every T units
L=N/T
…
32
Gossip-style Heartbeating
pi T=logN * tg
Array of
Heartbeat Seq. l L=N/tg=N*logN/T
for member subset
Every tg units
=gossip period,
send O(N) gossip
message
33
What’s the Best/Optimal we can do?
Slide changed after lecture
log( PM (T )) 1
• L* .
log( p ) T
ml
34
Heartbeating
• Optimal L is independent of N (!)
• All-to-all and gossip-based: sub-optimal
• L=O(N/T)
• try to achieve simultaneous detection at all processes
• fail to distinguish Failure Detection and Dissemination
components
Key:
Separate the two components
Use a non heartbeat-based Failure Detection Component
35
Next
• Is there a better failure detector?
36
SWIM Failure Detector Protocol
pi pj
• random pj
ping K random
ack processes
• random K
ping-req
X
X
Protocol ping
period ack
= T’ time units ack
37
SWIM versus Heartbeating
Heartbeating
O(N)
First Detection
Time
SWIM Heartbeating
Constant
L E[ L]
• 28 8
L* L* for up to 15 % loss rates
40
Detection Time
1 N 1 1
• Prob. of being pinged in T’= 1 (1 ) 1 e
N
• E[T ] = e
T'.
e 1
• Completeness: Any alive member detects failure
– Eventually
– By using a trick: within worst case O(N) protocol periods
41
Next
• How do failure detectors fit into the big picture
of a group membership protocol?
• What are the missing blocks?
42
Group Membership Protocol
II Failure Detector
Some process
pi finds out quickly
pj crashed
I pj
III Dissemination
Unreliable Communication
Network
Crash-stop Failures only 43
Dissemination Options
• Multicast (Hardware / IP)
– unreliable
– multiple simultaneous multicasts
• Point-to-point (TCP / UDP)
– expensive
• Zero extra messages: Piggyback on Failure
Detector messages
– Infection-style Dissemination 44
Infection-style Dissemination
pi pj
• random pj
ping K random
ack processes
• random K
ping-req
X
X
Protocol ping
period ack
= T time units ack Piggybacked
membership
information
45
Suspicion Mechanism
• False detections, due to
– Perturbed processes
– Packet losses, e.g., from congestion
• Indirect pinging may not solve the problem
• Key: suspect a process before declaring it as
failed in the group
46
Suspicion Mechanism pi
pi:: State Machine for pj view element
Dissmn (Suspect pj) Dissmn
d ) FD47
i l e
a t pj Suspected
f
g ec
i n
p usp Tim
i
p :(S s eo
: : : c e s ut
FD smn s uc j )
Di s g ve p
i n
p Ali
pi
Alive D:: n::( Failed
F s sm
Di
Dissmn (Alive pj) Dissmn (Failed pj)
Suspicion Mechanism
• Distinguish multiple suspicions of a process
– Per-process incarnation number
– Inc # for pi can be incremented only by pi
• e.g., when it receives a (Suspect, pi) message
– Somewhat similar to DSDV
• Higher inc# notifications over-ride lower inc#’s
• Within an inc#: (Suspect inc #) > (Alive, inc #)
• (Failed, inc #) overrides everything else
48
Wrap Up
• Failures the norm, not the exception in datacenters
• Every distributed system uses a failure detector
• Many distributed systems use a membership service