0% found this document useful (0 votes)
12 views

Unit Iv

ES notes

Uploaded by

Shanmathy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit Iv

ES notes

Uploaded by

Shanmathy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT IV

RELIABILITY AND CLOCK SYNCHRONIZATION


Reliability and Clock Synchronization: Introduction to Reliability Evaluation Techniques –
Reliability Models for Hardware Redundancy – Permanent faults only - Transient faults.
Introduction to clock synchronization – A Non-Fault-Tolerant Synchronization Algorithm - Fault-
Tolerant Synchronization in Hardware – Completely connected zero propagation time system –
Sparse interconnection zero propagation time system –Fault tolerant analysis with Signal
Propagation delays.

4.1 INTRODUCTION TO RELIABILITY EVALUATION TECHNIQUES:


 Computers used in life-critical applications must be so reliable that they cannot be
validate by experiment alone.
 We must mathematical models of reliability.
 We construct a mathematical model of the real-time computer, and solve it.by doing this,
we are adding one possible source of error –the assumptions of the mathematical
model.
 If these are not correct, neither will be the result of our model.
 We introduce reliability evaluation techniques.

4.2 RELIABLITY MODEL FOR HARDWARE REDUNDANCY


 The most difficult problem in reliability modelling is to keep the complexity of the models
sufficiently small.
 When the various parameters of the model are exponentially distributed result in an
unacceptable complexity for all current techniques to reduce the complexity of such
models consist largely of state aggregation, in which multiple states are grouped
together and treated as a singlestate; and decomposition, in which the overall model is
broken down into sub models, each sub model is solved.
 The reliability of components is usually specified through a probability distribution
function of the lifetime of that components.
 For example, if failures occur as Poisson process with rate λ, the life time distribution is
given by Fl(t)=1-exp(-λt).
 If failure occur as a Weibull process with a SHAPE parameter α and scale parameter,
the lifetime distribution is Fl(t)=1-exp(-[λt] α). We will denote by fl(t)the associated
density function (we will assume here that Fl(t) is differentiable).
The hazard rate h(t) of a component with age t is defined as the rate of failure at time t,
given that it has not failed up to time t. We can use Bayes’s law to express the hazard
rate as function of the lifetime distribution function.

h(t)dt = Prob {System fails in [ t, t + dt] | System has not failed up to t}

= Prob {System fails in [ t, t + dt] ᴖ System has not failed up to

t} Prob {System has not failed up to t}

If failure process is poission with rate λ (i.e., if the lifetime distribution is


exponentially distributed is exponentially distributed with mean 1/ λ), then the hazard
rate is

= λ

The hazard rate is thus independent of the age of the component if the failure process is
poission.
If the failure process is Weibull with shape and scale parameters α and λ,
respectively the hazard rate is given by

If 0< α<1, then h(t) decrease with time. This means that the failure rate of a
component drops as it ages. Component with decreasing hazard rates are said to
have the used-better-than-newproperty.

If α =1, the failure process is Poisson. If α > 1, h(t) increase with time; that is, the
failure rate increase with age, and such components have the new-better-than-used
property.
The rate then becomes approximately constant, before aging effects set in and
cause the hazard rate to rise with age.
Figure provides some numerical illustrations of the lifetime distribution associated with
the Weibull distribution, h(t)=αλ(λt)α-1 .

4.3 PERMANT FAULTS ONLY

NMR CLUSTERS
Consider an N-modular-redundant cluster. For the moment, let us assume that the fault latency
is zero (i.e., faults start generating errors immediately when they arrive) and that faulty
processors are immediately identified and disconnect from the system. As a result, the system
will always consist of good processors only, and will continue to function until it has fewer than
two function processors.

Combinatorial model

The system will fail only if there are fewer than two functional Processors Left in
the system.
Since there is no repair and all the failures are assumed to be permanent, the failure
Probability can be found by counting (hence the term combinatorial) all the various
ways in which fewer than two processors are left and weighting each by its Probability
of occurrence.

The probability that an individual processor suffers failure some time in an interval of
duration t is given by Fl(t)=1-exp(-λt).

The Probability of system failure over this interval is given by

Prob {system failure in [0,t]} =

This is, of course, a Bernoulli process, and the probability that there are only

i processors functional is given by

We therefore have, after some minor


algebra,

prob {System failure in [0, t]} =N FlN-I(t)-(N-1) FlN(t)

MARKOV CHAIN MODEL

Markov chain models, while more complex than combinatorial models for such simple cases,
are the solution method of choice when the system are more complex. the system can be
modelled as a Markov chain as shown in figure 8.8, where the states represent the number of
functional processors.since failed units are removed immediately from the system, and there is
no repair, we have a “pure-death process”. this chin is identical to one discussed

Markov chain for an NMR System.

There it is shown that the probability πi(t) of the system being in state i at time t (given
that it started in state N at time 0) is given by

where
The probability that the system has failed by time t is given by the probability that it is in
either state 0 or 1, that is,
l
prob {system failure} =∑ πi(t)
i=0

In fact, there is no need, for our purposes,to distinguish between states 0 and 1. Indeed
we could have defined states 0 and 1 as one failed state, and computed the probability of
ever entering that state up to time t.

Voter reliability

There are two typical designs for voter reliability, one in which there is exactly one voter
providing output for the cluster, and the second in which there are N voters, one per
processor. we focus on the first design, and leave the second as an exercise.

Let fv, n(t) be the PDF of the voter lifetime when it has to arbitrate among N inputs.
This is a function of N, since the voter complexity increases. The system will fail whenever
fewer than two processors are functioning or the voter fails. Assuming the two events are
independent, the PDF of the system lifetime is thus given by

ϕN(t)=FV, N(t){NFlN-1(t)-(N-1)FlN(t)}

This equation raises the probability that because the voter becomes less reliable as the
cluster size increase, an increase in the cluster size can actually decrease the reliability of the
cluster. In Particular, we have, after some algebra,

To make this more concrete, consider the following example of Poisson failures.

4.4 INTRODUCTION TO TRANSIENT FAULTS

Let a and b be the permanent and transient failure rates, respectively, of each processor,
Failures are assumed to occur as a Poissonprocess. Intermittent failures are ignored in
hismodel. We assume that all the failures are independent of one another, and that faults
manifest themselves immediately (i.e., the fault latency is zero). Processors that are taken
offline due to a fault being detected are tested continuously to see if their fault is transient, and
if it is, whether it has died away. If this is the case, the processors are inducted back into the
cluster. Let the time between when a processor suffers transient failure and when it is brought
back on line be exponentially distributed with mean 1/e. We will ignore the time it takes to
reintegrate it into the system; this can be taken into account by assuming it to be part of the
delay time. What is the probability pFAIL(t) that such a system will fail by time t, given that it was
in perfect working order at time 0? here we assume that system failure occurs when there are
fewer than two operational processors.

Once again, we use a Markov chain. However, unlike in the previous case where processors
could only be in one of two states, permanently failed and good, in this model they can be in
one of three states: permanently failed, currently offline due to transient failure, and good. since
the total number of processor is fixed at N, we need two state variables to denote the state of
the system. Let us denote the state by (s1, s2) where s1 and s2 denote, respectively, the number
of functional processors and the number of processor currently undergoing transient failure. The
number of processor that have failed permanently is N- s1-s2.the markov chain for this model is
shown here. To avoid clutter the individual arcs in the chain are not labelled with the associated
transition rates; rather, the inset figure provides the transition rates out of each state (and by
implication into each state). while it may look complicated, generating this chain is quite simple.
let us consider transition out of state i, j.

In this state, we have I functional processors and j processors that are currently suffering
transient failure. The rest of the processors, numbering N - i - j have suffered permanent failure.
Of course, the system does not know whether a failed processor is suffering a transient or a
permanent failure. It will keep trying to run tests on all the failed processors, and if a previously
failed processor recovers, it will pass the test.

The i functional processors may suffer either permanent or transient failure. The permanent
failure rate per processors is a, so the overall rate due to permanent failure out of state i, j (and
into state i-1, j) is ia. similarly, the overall rate due to transient failure out of state i, j (and into
state i-1, j+1) is ib. Even processors that are currently suffering transient failures are not
immune to permanent failures; this explains the transition from state i, j to j-1 with a rate of ja.
Transient faults die away in an exponentially distributed duration of mean 1/e, and so the rate
out of state i, j to i+1, uj-1 is given by je. When only two processors are functionally, the failure
of any one of these spells failure for the whole system, which explains the transition to the
FAIL state. Notice that FAIL is an absorbing state-once the system is in this state, there is no
way out. Because we are modelling the onset failure, we wish to compute the probability that
the system will ever enter the FAIL state over any given interval [0, t].

It only remains for us to write the differential equations connected with this process. This can be
done by inspection of the Markov chain. Let π i, j(t) denote the probability of being in state i, j ≤
N. if I < 0, j<0, or i+ j > N, define πi,j(t)=0. The differential equations are

Where the initial condition reflects the fact that we start the system in state N,0 (i.e) π n,0(0)
=1). These equations can now be solved numerically.

4.5 INTRODUCTION CLOCK SYNCHRONIZATION


Clock synchronization is vital to the correct operation of real- time systems. Such
activities as voting and synchronized rollback assume that the clocks are synchronized fairly
tightly. In this chapter, we will look at some algorithms for fault-tolerant synchronization. These
ensure that the functional processes remain synchronized despite a few processor or link
failures.

We discuss both hardware and software synchronization algorithms in this chapter.


Hardware synchronization requires special hardware, which is not required by software
synchronization; however, it offers much tighter synchronization.

4.6 A NONFAULT-TOLERANT SYNCHRONIZATION ALGORITHM

Consider the following simple procedure for synchronization. At regular intervals of T


(as measured by itself), each clock sends out its timing signals (clock ticks) to the other clocks.
A clock compares its own timing singles with those it receives from the others and adjusts itself
appropriately.

For the moment, assume that the signal propagation times are zero and consider a
three-clock system. Suppose the timing signals are as shown in Figure.

where ti is the real time when clock cisend its signal. The middle clock is chose as the
correct clock, and the other two try to align themselves with this clock. It is tempting to do this

by having each clock correct as soon as it can, by moving clock c1back byt2 – t1 at real time t2

and clock c3 forward by t3 – t2 at real time t3. However, this is not acceptable, since a process

which was using clock c3 would see time moving backwards (see Figure b). For example,

suppose this process timestamped event X at real time tx and event Y at real time ty. Y occure
after X, but due to the clock adjustment, its timestamp will make it appear as if it occurred
before X. This illustrates why we should never turn a clock back in the process of
synchronization. it is also a bad idea to introduce a jump in the clock time, as in Figure (a)
Clock c1 will slow itself down clock c3 will speed itself up so that their next clock ticks will
align as closely as possible with the next clock tick of clock c2 (which, because it delivered its
tick between those of clocks c1 and c3, is being used as reference or trigger with which the
other clocks align). Clock c2 is not corrected in any way. In other words, we will attempt to
deliver the next ticks of clocks c1, c2, and c3 at time R2, which is T c-seconds after t2.

Amortized clock adjustment


For example, there is a dedicated link between each pair of clocks. Then, it is
(theoretically, at any rate) possible to correct for the propagation times, and reduce the
problem to one where the propagation time are zero.

for example, the clocks could send their clock ticks as messages on a store-and –
forward network, and the propagation time then depends on the path chosen and the
congestion on the path.

Consider the case where t1, t2, t3 and the propagation times are such that c1 and c3 are
both sure that c2 is the middle clock, and thus is the reference they must synchronize to.
Suppose they each estimate the propagation time from c2 to be x. Let us see what happens
with both c1 and c3.

c1 receives the signal from c2 at real time t2+ µ2.1. Since it is estimating that the
propagation time is x, it thinks that c2 transmitted that signal at real time t2 + µ2,1 – x. It tries
therefore to deliver its next clock time T c-units after this time, However, this time is being
measured by the clock. An interval of nominal duration T c-units may actually be anything in the
range [(1 – ρ)T, (1 + ρ)T] r-units. Hence c1 can deliver its next clock time in the r-interval

I1= [(1 – ρ)T + t2+ µ2,1 – x, (1 + ρ)T + t2+ µ2,1 – x]

By a similar reasoning, c3 delivers its next clock tick in the r-interval

I3= [(1 – ρ)T + t2+ µ2,3 – x, (1 + ρ)T + t2 + µ2,1 – x]

Clock c2 delivers its next clock tick in the r-interval

I2 = [(1 – ρ)T + t2, (1 + ρ)T + t2]

What does this do for the clock skew at the next tick? In the worest case, if µ 2,1 = µmin and
clock c1 is running as fast as is legally allowed, the next c1 tick will occur at r-time (1 – ρ)T + t2+
µmin – x. If µ2,3 = µmax and c3 is running as slow as is legally allowed, the next c3 ticks were
occur at r-time (1 + ρ)T + t2+ µmax – x. The clock skew will then be

[(1 + ρ)T + t2 + µmax – x]- (1 – ρ)T + t2 + µmin – x = 2ρT + µmax - µmin

An alternative to mutual synchronization is to use a master-slave structure. The slave


clocks try to align themselves to the master clock. They do this by sending out a read _ clock
request to the master, which respond with a message containing its clock time when it
received this request.

Suppose the round-trip real time between sending the read_ clock request and receiving the
answer is r. Let µmin be the minimum time taken to send a message between the master and
the slave. Let ts->m and tm->s be the respective time taken for a given read_ clock request to
propagate to the master, and for the master’s reply (e.g., “present time is T”) to propagate to the
slave. By definition = ts->m + tm->s > 2µmin.
When the slave receives the “present time is T” message from the master, the master clock will
tell a time in the interval I = [T + µmin(1-ρ),T+(r-µmin) (1+ρ)]. The derivation of this is simple. The
lower bound is the time when the tm->s =µmin (the minimum possible) and the master clock runs
as slow as it is allowed to. The upper bound is compound as follows. The maximum possible
value of tm->s is r-µmin, for a given value of r. The maximum rate at which the master clock can
run is 1+ρ.

Suppose for a moment that the slave clock can measure the round –trip delay with perfect
accuracy. The duration of the interval Iis r(1+ρ)-2ρµmin.Thenthe best estimate of the time told
by the master clock when its “present time is T” message is received by the slave clock is the
midpoint of the interval I, that is,

The error in making this estimate is therefore upper-bounded by

But the slave clock cannot measure the round-trip delay with perfect accuracy; all it has is the
round-trip delay as measured by itself. The duration of r, when measured, may be as (1+ρ)r.
Thus, interval I may be the interval

I’ = [T + µmin(1-ρ), T + (r(1+ρ)-µmin) (1 +ρ)]

and the estimate of the time may be


The estimation error is thus upper-bounded by

If the message between master and slave clocks pass over a network that is shared by other
traffic, r can vary a lot depending on the intensity of that traffic. The slave clock can try to
limit the estimation error.

4.7 FAULT –TOLERANT SYNCHRONIZATION IN HARDWARE

To Synchronize in hardware, we can use phase-locked loops. These date back to the 1930s,
and have been widely used in ratio and other forms of communication. The basic structure of a
phase –locked loop is shown in fig.9.12. The objective is to align.as closely as possible, the
output of the oscillator with an oscillatory signal input. The comparator puts out a signal that’s is
proportional to the difference between the phase of the input and that of the oscillator. This is
passed through a filter, and the resultant signal is used to modify the frequency of a voltage
controlled oscillator(VCO).

Let us carry out a simple analysis of the phase-locked loops. The output voltage of the
comparator at any time t is proportional to the difference between the phase of the
signal input, (t), and that of the VCO,, (t):

STRUCTURE OF A PHASE LOCKED LOOP


Frequency control of an ideal VCO
4.8 COMPLETELY CONNECTED, ZERO PROPAGATION –TIME SYSTEM

The purpose of our analysis so far has been to convince the reader that the phase locked
loop has the ability to track input signals and thus synchronize the output with respect to the
input. Thus, if we can suitable define a reference (or trigger) input as a function of the output
of the clocks in the system, we can synchronize the clocks.

Fig. shows the structure of each of the clocks. Every clock is connected by a dedicated line to
every other clock. (see fig. for an example of a four-clock system.) we assume, to begin with
that signal-propagation times are zero.

Structure of a Phase locked loop used in synchronization

Each clock has a reference circuit, which accepts as input the clock ticks from the other clocks
in the system as well as that of its own VCO. It generates a reference signal to which its VCO
tries to align itself. The problem of designing a fault-tolerant synchronizer thus reduces to
obtaining a reference signal that will permit the system to remain synchronized in the face of up
to given number of failures.

The obvious approach is to make the reference signal equal to the median of the incoming
signals(keep in mind that we are assuming zero-message – propagation times).Unfortunately,
this won’t work if there are two or more maliciously faulty clocks.

A Four clock system


SPARSE INTERCONNECTION,ZERO PROPAGATION TIME SYSTEM

Suppose instead of a completely connected structure, we have clocks organized into multiple
clusters.

Each clock in a cluster is connected by a dedicated link to every other clock in that cluster.

You might also like