0% found this document useful (0 votes)
9 views

Lecture 04

Uploaded by

atik
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture 04

Uploaded by

atik
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

CSE-813(Distributed & Cloud Computing)

Even-’22

Dr. Atiqur Rahman


ড. আতিকু র রহমান
Ph.D.(CQUPT, China), MS.Engg.(CU), B.Sc.(CU)
Associate Professor
Department of Computer Science and Engineering
University of Chittagong

Lecture 4: Failure Detection and


Membership
A Challenge
• You’ve been put in charge of a datacenter, and your
manager has told you, “Oh no! We don’t have any failures
in our datacenter!”

• Do you believe him/her?

• What would be your first responsibility?


• Build a failure detector
• What are some things that could go wrong if you didn’t do
this?
Failures are the Norm
… not the exception, in datacenters.

Say, the rate of failure of one machine (OS/disk/motherboard/network,


etc.) is once every 10 years (120 months) on average.

When you have 120 servers in the DC, the mean time to failure (MTTF)
of the next machine is 1 month.

When you have 12,000 servers in the DC, the MTTF is about once every
7.2 hours!

Soft crashes and failures are even more frequent!


To build a failure detector
• You have a few options

1. Hire 1000 people, each to monitor one machine in


the datacenter and report to you when it fails.
2. Write a failure detector program (distributed) that
automatically detects failures and reports to your
workstation.
Target Settings
• Process ‘group’-based systems
– Clouds/Datacenters
– Replicated servers
– Distributed databases

• Crash-stop/Fail-stop process failures 5


Group Membership Service
Application Queries Application Process pi
e.g., gossip, overlays,
DHT’s, etc.
joins, leaves, failures
of members
Membership
Protocol
Membership
Group List
Membership List
Unreliable
Communication 6
Two sub-protocols
Application Process pi
Group
Membership List
pj
• Complete list all the time (Strongly consistent) Dissemination
• Virtual synchrony
• Almost-Complete list (Weakly consistent)
Failure Detector
• Gossip-style, SWIM, …
• Or Partial-random list (other systems)
• SCAMP, T-MAN, Cyclon,… Unreliable
Focus of this series of lecture Communication 7
Large Group: Scalability A Goal
this is us (pi) Process Group
“Members”

1000’s of processes

Unreliable Communication
Network
8
Group Membership Protocol
II Failure Detector
Some process
pi finds out quickly
pj crashed
I pj

III Dissemination
Unreliable Communication
Network
Crash-stop Failures only 9
Next
• How do you design a group membership
protocol?

10
I. pj crashes
• Nothing we can do about it!
• A frequent occurrence
• Common case rather than exception
• Frequency goes up linearly with size of
datacenter

11
II. Distributed Failure Detectors:
Desirable Properties
• Completeness = each failure is detected
• Accuracy = there is no mistaken detection
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 12
Distributed Failure Detectors:
Properties
Impossible together in
• Completeness
lossy networks [Chandra
• Accuracy and Toueg]
• Speed
– Time to first detection of a failureIf possible, then can
solve consensus!
• Scale
– Equal Load on each member
– Network Message Load
13
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 14
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale Time until some
process detects the failure
– Equal Load on each member
– Network Message Load 15
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
– Time to first detection of a failure
• Scale Time until some
process detects the failure
– Equal Load on each member
No bottlenecks/single
– Network Message Load failure point 16
Failure Detector Properties
• Completeness In spite of
arbitrary simultaneous
• Accuracy process failures
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 17
Centralized Heartbeating
 Hotspot
pi


pi, Heartbeat Seq. l++
pj • Heartbeats sent periodically
• If heartbeat not received from pi within
18
timeout, mark pi as failed
Ring Heartbeating
 Unpredictable on
pi simultaneous multiple
pi, Heartbeat Seq. l++
failures
pj


19
All-to-All Heartbeating
pi  Equal load per member
pi, Heartbeat Seq. l++


pj

20
Next
• How do we increase the robustness of all-to-all
heartbeating?

21
Gossip-style Heartbeating
 Good accuracy
Array of pi properties
Heartbeat Seq. l
for member subset

22
Gossip-Style Failure Detection
1 10118 64
2 10110 64
1 10120 66 3 10090 58
2 10103 62 4 10111 65
3 10098 63 2
4 10111 65 1
1 10120 70
Address Time (local) 2 10110 64
Heartbeat Counter 3 10098 70
Protocol:
• Nodes periodically gossip their membership 4 4 10111 65

list: pick random nodes, send it list 3


• On receipt, it is merged with local Current time : 70 at node 2
membership list (asynchronous clocks)
• When an entry times out, member is marked
as failed 23
Gossip-Style Failure Detection
• If the heartbeat has not increased for more
than Tfail seconds,
the member is considered failed
• And after Tcleanup seconds, it will delete the
member from the list
• Why two different timeouts?
24
Gossip-Style Failure Detection
• What if an entry pointing to a failed node is
deleted right after Tfail (=24) seconds?
1 10120 66
2 10110 64
1 10120 66 34 10098
10111 75
50
65
2 10103 62 4 10111 65
3 10098 55 2
4 10111 65 1
Current time : 75 at node 2

4
3 25
Multi-level Gossiping
• Network topology is
hierarchical N/2 nodes in a subnet
• Random gossip target selection (Slide corrected after lecture)
=> core routers face O(N) load
(Why?)
Router
• Fix: In subnet i, which contains
ni nodes, pick gossip target in
your subnet with probability
(1-1/ni)
• Router load=O(1)
• Dissemination time=O(log(N))
• What about latency for multi-
level topologies? N/2 nodes in a subnet
26
Analysis/Discussion
• What happens if gossip period Tgossip is decreased?
• A single heartbeat takes O(log(N)) time to propagate. So: N heartbeats
take:
– O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be
O(N)
– O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)
– What about O(k) bandwidth?
• What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
• Tradeoff: False positive rate vs. detection time vs. bandwidth

27
Next
• So, is this the best we can do? What is the best
we can do?

28
Failure Detector Properties …
• Completeness
• Accuracy
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load 29
…Are application-defined Requirements
• Completeness Guarantee always
Probability PM(T)
• Accuracy
T time units
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member
– Network Message Load
30
…Are application-defined Requirements
• Completeness Guarantee always
Probability PM(T)
• Accuracy
T time units
• Speed
– Time to first detection of a failure
N*L: Compare this across protocols
• Scale
– Equal Load on each member
– Network Message Load
31
All-to-All Heartbeating
pi, Heartbeat Seq. l++ pi Every T units
L=N/T

32
Gossip-style Heartbeating
pi T=logN * tg
Array of
Heartbeat Seq. l L=N/tg=N*logN/T
for member subset
Every tg units
=gossip period,
send O(N) gossip
message

33
What’s the Best/Optimal we can do?
Slide changed after lecture

• Worst case load L* per member in the group


(messages per second)
– as a function of T, PM(T), N
– Independent Message Loss probability pml

log( PM (T )) 1
• L*  .
log( p ) T
ml

34
Heartbeating
• Optimal L is independent of N (!)
• All-to-all and gossip-based: sub-optimal
• L=O(N/T)
• try to achieve simultaneous detection at all processes
• fail to distinguish Failure Detection and Dissemination
components

Key:
Separate the two components
Use a non heartbeat-based Failure Detection Component
35
Next
• Is there a better failure detector?

36
SWIM Failure Detector Protocol
pi pj
• random pj
ping K random
ack processes
• random K
ping-req
X
X
Protocol ping
period ack
= T’ time units ack

37
SWIM versus Heartbeating
Heartbeating
O(N)

First Detection
Time
SWIM Heartbeating
Constant

For Fixed : Constant Process Load O(N)


• False Positive Rate
• Message Loss Rate 38
SWIM Failure Detector
Parameter SWIM

First Detection Time


• Expected
 e periods
 e 1
• Constant (independent of group size)

Process Load • Constant per period


• < 8 L* for 15% loss

False Positive Rate • Tunable (via K)


• Falls exponentially as load is scaled

Completeness • Deterministic time-bounded


• Within O(log(N)) periods w.h.p. 39
Accuracy, Load

• PM(T) is exponential in -K. Also depends on pml (and


pf )
– See paper

L E[ L]
•  28 8
L* L* for up to 15 % loss rates
40
Detection Time

1 N 1 1
• Prob. of being pinged in T’= 1  (1  )  1  e
N
• E[T ] = e
T'.
e 1
• Completeness: Any alive member detects failure
– Eventually
– By using a trick: within worst case O(N) protocol periods
41
Next
• How do failure detectors fit into the big picture
of a group membership protocol?
• What are the missing blocks?

42
Group Membership Protocol
II Failure Detector
Some process
pi finds out quickly
pj crashed
I pj

III Dissemination
Unreliable Communication
Network
Crash-stop Failures only 43
Dissemination Options
• Multicast (Hardware / IP)
– unreliable
– multiple simultaneous multicasts
• Point-to-point (TCP / UDP)
– expensive
• Zero extra messages: Piggyback on Failure
Detector messages
– Infection-style Dissemination 44
Infection-style Dissemination
pi pj
• random pj
ping K random
ack processes
• random K
ping-req
X
X
Protocol ping
period ack
= T time units ack Piggybacked
membership
information
45
Suspicion Mechanism
• False detections, due to
– Perturbed processes
– Packet losses, e.g., from congestion
• Indirect pinging may not solve the problem
• Key: suspect a process before declaring it as
failed in the group

46
Suspicion Mechanism pi
pi:: State Machine for pj view element
Dissmn (Suspect pj) Dissmn
d ) FD47
i l e
a t pj Suspected
f
g ec
i n
p usp Tim
i
p :(S s eo
: : : c e s ut
FD smn s uc j )
Di s g ve p
i n
p Ali
pi
Alive D:: n::( Failed
F s sm
Di
Dissmn (Alive pj) Dissmn (Failed pj)
Suspicion Mechanism
• Distinguish multiple suspicions of a process
– Per-process incarnation number
– Inc # for pi can be incremented only by pi
• e.g., when it receives a (Suspect, pi) message
– Somewhat similar to DSDV
• Higher inc# notifications over-ride lower inc#’s
• Within an inc#: (Suspect inc #) > (Alive, inc #)
• (Failed, inc #) overrides everything else
48
Wrap Up
• Failures the norm, not the exception in datacenters
• Every distributed system uses a failure detector
• Many distributed systems use a membership service

• Ring failure detection underlies


– IBM SP2 and many other similar clusters/machines

• Gossip-style failure detection underlies


– Amazon EC2/S3 (rumored!)
49

You might also like