MN5554 Reliability Notes
MN5554 Reliability Notes
MSc Programme
PART B
Reliability
AMEE
1
MN5554 Reliability Engineering
MSc Programme
PART B
Reliability
AMEE
Prepared by Joe Au, Tao Zhang, EngTseng Lau and John Harris
Advanced Manufacturing & Enterprise Engineering (AMEE)
College of Engineering, Design and Physical Sciences
2
MN5554 Reliability Engineering
Preface
This study pack is concerned with Reliability, the second component of the module
MN5554 Quality Management and Reliability. It comprises four units and it
addresses the following topics:
1. Unit 1 - Decision Analysis,
2. Unit 2 - Measuring the Reliability of Items,
3. Unit 3 - Weibull Analysis of Lifetime Data, and
4. Unit 4 - Assessing the Reliability of Systems.
In Unit 1, Decision Analysis techniques are used to identify the reliability problem
areas (Pareto analysis) and to determine whether or not the situation is improving
after being addressed (Trend analysis using the moving-average and CUSUM).
Unit 2 Measuring the Reliability of Items, deals with the theory and concepts of
reliability, the practical aspects of how reliability is quantified and the characteristic
patterns of failure. Statistical techniques are used for describing reliability and its
converse, unreliability (also known as failure probability).
The study pack contains not only these units, but also additional notes and
Microsoft Excel spreadsheets designed formerly by Dr Joe Au of School of
Engineering and Design, Brunel University London. The notes are provided to
clarify or amplify points and concepts presented in the units. The spreadsheets
contain alternative methods of solution to augment those described in the units and
4
MN5554 Reliability Engineering
solutions to exercises in the units. These are ‘live’ documents such that you can
experiment with them by changing values in different cells - a significant advantage
for exploring various ‘what-if’ scenarios readily and efficiently.
Finally, do make a point of visiting the Blackboard Learn. Under this module, you
will find short questions and exercises posted for testing your understanding and
skills. If you have any question while reading through the study material, you can,
of course, write to Dr Tao Zhang by email to [email protected].
I hope you will enjoy reading this study pack. Have fun!
5
MN5554 Reliability Engineering
Table of Contents
1 Initial Decision Analysis ............................................................................................................................ 1
1.1 Introduction ............................................................................................................................................ 1
1.2 Pareto analysis ........................................................................................................................................ 1
1.3 Exercise ................................................................................................................................................... 5
1.4 Trend analysis ......................................................................................................................................... 6
1.5 Moving average: a simple example ........................................................................................................ 6
1.6 CUSUM ................................................................................................................................................... 9
6
MN5554 Reliability Engineering
7
MN5554 Reliability Engineering
Initially, of course, his attention may have been drawn to the need to address
reliability performance by his detecting its downward trend – preferably at an early
stage. And having taken remedial action he will want confirmation of
improvement, ie clear evidence of an upward trend. Which is where 'Trend
analysis', the subject of the second part of this unit, is required.
1
MN5554 Reliability Engineering
of the total income was paid to only 20% of the people (Things never change, do
they!
In maintenance, maldistributions may be observed in such things as:
spares cost: most of the total accounted for by a very small fraction of the
items held;
manpower needs: dominated by a few, troublesome, parts of the plant;
outage time: mostly caused by a few critical, but unreliable items
Clearly, cost-effective remedial action in these and similar areas should first
address 'the vital few' (eg the Top Ten), leaving the 'the trivial many' until later -
or even ignoring some of them altogether. The first task, therefore, is to identify
which is which, and this is where Pareto Analysis is used. A real example,
slightly simplified for clarity, will illustrate the technique, which is a very simple
one.
The first step was to calculate the total manhours for each sub-system type. Table
1, in which the sub-systems were ranked in order of manhours expended, was then
produced.
(Note that each of the first eleven rows refers to a single sub-system – one of the
eleven worst performers – whereas the last row gives the total manhours for the
remaining eighty nine sub-systems – each of which individually placed only
a minor, or sometimes zero, demand on manhours)
Two different forms of Pareto plot could then be derived from the data in the table,
viz
(i) A simple histogram, as in Figure 1, using the data from either Column 3 or
Column 4 (in this case from Column 3),
(ii) A cumulative plot, as in Figure 2, using the data from Column 5.
2
MN5554 Reliability Engineering
The particular analysis used for this example was, in fact, used to select those
platform sub- systems for which application of Reliability Centred Maintenance
analysis could be justified on the basis of cost-effectiveness.
3
MN5554 Reliability Engineering
3000
2500
Maintenance man-hours
2000
1500
1000
500
0
A B C D E F G H I J K L+
System Ref.
Clearly, Pareto Analysis is quite straightforward. It has been said, however, that –
‘In spite of this (ie its straightforwardness) it is a very powerful tool in the
assessment of field data and should not be pushed on one side in favour of more
sophisticated statistical techniques on account of its simplicity.'
4
MN5554 Reliability Engineering
1.3 Exercise
(i) Over a period of time data has been accumulated, from warranty
records, of the failures occurring in a well-known type of domestic washing
machine. The results are shown in Table 2. From the data, derive firstly
a histogram and secondly a cumulative Pareto plot (similar to Figure
1Figure 2 respectively) indicating the most unreliable components. (Derived
from an example quoted in Practical Reliability Engineering, P.T.D.
O'Connor, see Bibliography, Unit 4)
Component No of failures
Cables and connectors 3
Drive belt 6
Drum trunnions 10
High-level switch 68
Inlet solenoid valve 2
Outlet pump 76
Programme switch 105
Seals 38
Spin dryer motor brake 8
Spin dryer suspension 10
Starting capacitor (dryer) 4
Starting capacitor (main motor) 6
Another 28 components (Total N0.) 42
(ii) Now consult your own plant history, maintenance records, stores
records, accounts, etc and produce a Pareto analysis of some pertinent
variable,
Eg: Failures per failure mode or per unit
5
MN5554 Reliability Engineering
6
MN5554 Reliability Engineering
7
MN5554 Reliability Engineering
Figure 3: Conventional
Figure 3.plot of monthly plot
Conventional dataof monthly data
25
20
Observation
15
10
0
0 5 10 15 20 25
Month
Figure 4. Moving average plot
Figure 4: Moving average plot
20.0
3-month moving average
18.0
16.0
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
0 5 10 15 20 25
Month
8
MN5554 Reliability Engineering
1.6 CUSUM
A somewhat more sophisticated method developed by ICI to monitor trends
in plant performance is the Cumulative Sum (CUSUM) Chart. It has proved to be
very effective when dealing with maintenance and reliability data. Numerical
observations and made at fixed intervals and each time are subtracted from
a pre-determined and constant target value derived from an analysis of previous
data on the parameter of interest. The cumulative sum of these deviations is then
plotted against time. For the data of Table 3 for example, a monthly target, T of 15
failures was adopted. The fourth column in the table then shows the deviations,
from this target, of each month’s recorded observation and the monthly entry in the
fifth column shows the total of these deviations up to each month (eg by Month 4
deviations of -3, +2, -1 and -1 have been observed, giving a total, or cumulative
sum, of -3). When plotted, as in
Figure 5, the trends are very clearly revealed, ie a steady performance over
the first seven months, a significant improvement (ie a decreasing failure rate)
during the next thirteen months and a sharp deterioration over the remaining five
months or so.
9
MN5554 Reliability Engineering
'The probability that an item will perform its required function in the desired
manner under all the relevant conditions and on the occasions, or during the time
intervals, when it is required so to perform.'
[Greene A E and Bourne J, p 25, Reliability Technology, Wiley 1972]
Like most definitions, some of the words used need clarification. Firstly,
'probability'; this is still a subject of philosophical debate. One (perfectly reputable)
school of thought – the 'Bayesian' school – maintains that it is a measure of our
'strength of belief' (eg that a pump will fail within the week) and a whole statistics
of reliability prediction can be (and has been) based on that idea. For our purposes,
however, we will adopt the position of the 'Frequentist' school and take it to be a
measure of what is expected to happen, on average, if a given event is repeated a
large number of times under identical conditions (eg the probability of getting a
five, say, when throwing a six- sided die – which is 1/6, or 16.67 %).
10
MN5554 Reliability Engineering
An 'item' could be -
A component the smallest part which would be replaced or repaired on failure
(eg a spring, bolt or impeller)
A unit comprising a number of components (eg a pump or compressor)
A system comprising may units (eg a process line)
The'.. required function in the desired manner under all the relevant conditions..'
refers to the duty undertaken. With electronic equipment the duty demanded
of any particular item will probably be much the same in one application as in
another; stresses are usually low and steady, and the equipment encapsulated. The
reliability observed in one context is therefore likely to be similar to that in another,
ie to be generic (characteristic of a large group or class; general, not specific or
special). This would rarely be the case with mechanical or hydraulic plant - as
found in process or power systems - which could be subject to wide ranges of
operating stress (start-up acceleration, throttled or open running etc),
environmental extremes (tropical or arctic, off or on shore, etc), and materials
handled (erosive, corrosive etc) so reliability assessment would have to take this
into account.
The phrase '...on the occasions, or during the time intervals...' indicates that there
are two basic sorts of reliability, namely –
(a) Time-independent
The item functions only on demand, being otherwise dormant (eg a pressure release
valve). Reliability is measured by the Probability of Successful Function, PS; eg if
a starter motor had failed to operate three times in a hundred demands then
PS = 97/100 = 0.97 (or 97 %)
And it would probably be more meaningful to quote this in terms of the Probability
of Failure on Demand, PF (or Fractional Dead Time, FTD), ie –
PF = 3/100 = 0.03 = 3 %
(b) Time-dependent
The item functions continuously (eg a turbo-alternator). Reliability is measured by
the probability R(t) that it will run successfully for some specified time t
(eg from one annual shutdown to the next). Thus, if a hundred identical pumps had
been started together and after three weeks twenty, say, had failed then
11
MN5554 Reliability Engineering
In this course we will usually be concerned with reliability of type (b), because it
is of relevance to the maintenance of most types of continuously operated industrial
plant. Of course, if the main concern was the maintenance of safety or stand-by
devices, eg smoke alarms, then reliability of type (a) would be the primary interest.
Note:
a) The first row of the table shows the standard statistical terms ('Class interval'
etc) for the types of quantity evaluated;
b) The figures in the fourth column are obtained by dividing those in the third by
100 hours, the width of the class interval used.
12
MN5554 Reliability Engineering
Using the data in the fourth column a histogram can be constructed, as in Figure 6.
By drawing it this way the area of the block above each class interval always equals
the relative frequency of failure in that time interval (even if unequal class intervals
were to be used, which is sometimes more convenient).
The assumption might now be made that the pattern of failures exhibited by this
sample is typical of all such pumps; ie the observed relative frequencies truly reflect
the expected probabilities of failure. The probability that any one pump of this kind
will last longer than, say, 700 hours is then given by the shaded area in Figure 6,
ie –
0.19 + 0.08 + 0.01 = 0.28 or 28%.
We now require some measures which will indicate the general nature of the
variability.
1. For its average magnitude, or central tendency, we shall use the arithmetic mean
13
MN5554 Reliability Engineering
where, for example, in the first bracket 0.02 is the relative frequency and 350 hours
the mid- point, or 'class mark’, of the first quoted class-interval.
where, as before, the first bracket, say, refers to the data for the first quoted class
interval and 642 hours is the previously calculated overall mean. A quantity
measured, in 'hours- squared' is rather mysterious (although it is, in fact,
indispensable in many statistical calculations), so for presenting information on the
observed spread of the times-to-failure we quote its square-root, the 'standard
deviation'
0.0045
Probability density (relative frequency per
0.004
0.0035
0.003
unit of time), f(t)
0.0025
0.002
0.0015
0.001
0.0005
0
0 200 400 600 800 1000
Time to failure, t
14
MN5554 Reliability Engineering
15
MN5554 Reliability Engineering
16
MN5554 Reliability Engineering
Failure probability F(t): A graph can be drawn (see Table 5 and Figure 10) of the
rise in the total fraction F(t) of pumps failed by a given time t.
17
MN5554 Reliability Engineering
If the tested pumps are representative the observed F(t) can be taken to be the
probability, for any one such pump, of its of failure before a running time t. The
mathematical expression of a plot such as that in Figure 10 is called a cumulative
distribution function or cdf. For the negative exponential pdf, for example, the cdf
is
F(t) = 1 - exp(λt)
Reliability R(t): Alternatively, the fraction R (t) of items surviving at running time
t could be tabulated and plotted, as in Table 6 and Figure 11.
R(t) = exp(-λt).
18
MN5554 Reliability Engineering
Hazard rate Z (t): This is defined as the fraction, of those items which have
survived up to the time t, expected to fail, per unit time. Thus, at any time t,
ie the failure rate is constant, the item is always 'as good as new', as already
explained. For the data of Table 6, Z(t) is calculated, tabulated and plotted in Table
7 and Figure 12.
19
MN5554 Reliability Engineering
20
MN5554 Reliability Engineering
Our frequent use of these expressions needs to be qualified. Time may, of course,
mean actual calendar time (before failure). However, it might sometimes
more appropriately refer to total running time, ie exclusive of stoppages, or
number of operational cycles, or whatever. Also, what constitutes a failure
depends on the operational requirement, even for the simplest electrical
component. Failure of a resistor, for example, could be a break, or a short, or either
one of these. There may be several different ways in which the operation of
a mechanical item could be degraded, but only one or two which would
actually impair system performance should they occur, eg a hydraulic valve
could suffer an internal leak, an external leak, failure to close, failure to open, or
spurious operation; also, a partial leak may be acceptable.
21
MN5554 Reliability Engineering
Figure 15. Death rate characteristic for male living in England and
Wales for the years 1960 - 1962
22
MN5554 Reliability Engineering
23
MN5554 Reliability Engineering
b) Calculate the mean and standard deviation, using your class-interval data,
c) Using a 'scientific' hand calculator find the mean and standard deviation
of the un- grouped data and compare with the result of (b),
d) Plot the variation in F(t), R(t), and Z(t) on ordinary graph paper
24
MN5554 Reliability Engineering
Weibull (a Swedish metallurgist working with the Bofors steel company) was
involved in analysing the results of load tests on many, nominally identical, test
specimens of a particular type of steel. Their Ultimate Tensile Strengths exhibited
random variability, as they always do. If F(x) was defined as the cumulative
fraction which exhibited strengths less than a particular load x (ie F(x) was the
cumulative distribution function or cdf, the distribution of the probability that a
specimen would fail under the load x), then a plot of F(x) looked like the one shown
in Figure 16. None failed before some given load x0 (the guaranteed strength) and
a few hung on to quite large loads.
Firstly, Weibull conjectured that it might be possible to represent such a cdf fairly
accurately by the expression
F(x) = 1 – exp {– (x – x0)},
𝑥 − 𝑥0 𝛽
𝐹 (𝑥 ) = 1 − 𝑒𝑥𝑝 {− }
𝜂
which enabled him to correlate his test data very well. In addition, the expression
had some other very useful properties, as we shall see.
In the reliability problems that we are looking at here the stressing factor is not load
but running time t, since new or since last overhaul. The Weibull cdf for times to
failure is therefore written as
𝑡 − 𝑡0 𝛽
𝐹(𝑡) = 1 − 𝑒𝑥𝑝 {− }
𝜂
26
From this, some simple mathematics then leads to the appropriate expressions for
the Weibull pdf f(t), Reliability R(t), and Hazard Rate Z(t), ie –
Each of the terms in these expressions has a practical meaning and significance.
The threshold time-to-failure, or guaranteed life t0: In many cases of wear-out
the first failure does not appear until some significant running time t0 has elapsed.
In the Weibull expressions the time factor is therefore the time interval (t - t0).
The characteristic life, When t - t0 = , R (t) = exp (-1) = 0.37, ie is the
interval
between t0 and the time at which it can be expected that 37 per cent of the items
will have survived (and hence 63 per cent will have failed).
The shape factor, . : Figure 17 hows how the Weibull pdf of time-to-failure
changes as is changed (for clarity, on each plot t0 = 0 and
*If is significantly less than one, the pdf approximates to the hyper-exponential,
ie is characteristic of 'running-in' failure;
*If the pdf becomes the simple negative exponential, characteristic of
'purely random' failure;
*As rises above a value of about 2, the pdf converges ever more closely to the
Normal pdf, characteristic of 'wear-out' failure.
27
Figure 17. Influence of shape factor 𝜷 on the form of the Weibull pdf
of time to failure (For all plots t0 and 𝜼 = 1)
NB For the first two cases t0 must be zero, of course; for the wear-out case it may
or may not be. Also note from Figure 17 that 𝛽 characterises the consistency of
failure occurrence. The larger its value the greater is the tendency for the failures
to occur at about the same running time.
28
3.2 Weibull probability paper
How do we test whether observed times-to-failure look as if they could be plausibly
represented by a Weibull cdf? In the language of statistics, whether they look as if
they have been sampled from such a distribution? And if they do, how do we
determine the values of t0, and which will give the distribution which best fits
the data? One easy way is to use Weibull probability graph paper. There are
several versions of this; we shall use the one that is most popular in the UK,
marketed by the Chartwell technical graph paper company (Ref No. 6572 in their
list). A full size sheet is reproduced on page 7. On this, the y-axis variable is the
cumulative fraction failed, F(t), expressed in per cent, and the x-axis variable is (t
- t0), in whatever are the appropriate units of time for the particular item studied
(as explained in Unit 2, 'time' in this context is a measure of usage and might
appropriately be 'number of operational cycles'). The axial scales are so arranged
that if a theoretical Weibull cdf were to be plotted on the paper (ie using values of
F (t) calculated from the expression given earlier) they would lie on a perfectly
straight line. The following example shows how the paper is
used.
29
Table 8. Pump failure data
5. The characteristic life, , is the value of t - t0 at which the line fitted to the
straightest plot reaches the 63% failed level, in this case 600 hours. (NB t - t0
= 600 h corresponds to a total actual running time of t = 1500 h, remembering
that t0 = 900 h)
6. As shown, a perpendicular is dropped from the fixed 'Estimation point' (printed
just above the top left-hand corner of the diagram) to the straight line fit. The
point at which this perpendicular intersects the special scale at the top of
the graph gives the value of for the best-fit cdf (in this case approximately
3.5, clearly pointing to a wear-out mode of failure)
7. Note that the perpendicular also intersects another scale, marked F This gives
the value of the cumulative percent failed, in this case 49.8%, at the mean life.
From the main straight-line plot we then read off that at this value of F (t), t -
t0 = 540 hours. So, since the estimated t0 is 900 hours, we deduce that the
estimated mean life is 900 + 540 = 1440 hours.
30
Figure 18. Weibull plot of pump failure data from Table 8.
3.3.1 Example
Plot the pump failure data of Table 4, Unit 2 (ie the cumulative, F (t), data)
on the Weibull paper provided and hence obtain values of t0, and for the
distribution.
(You should find t0 = 200 hrs, hrs and = 4.0, APPROXIMATELY –
Remember, this is a graphical technique!)
31
3.4 Median ranks
In practice, it is most often the case that only a handful of times-to-failure have
been recorded, particularly for small data samples, possibly incomplete, ie
'censored'. Indeed, the items under examination might be large and expensive and
only a few might yet have been made. In addition, some of the items might still be
running, not having reached the failure point (ie be 'suspended') or may have been
withdrawn from the trial (ie 'censored') because, in their case, the test conditions
were accidentally altered.
In this situation the results of any analysis will necessarily be subject to greater
statistical uncertainty. A Weibull analysis might still be needed, however, on the
grounds that an approximate result at the end of a fortnight may be of more value
than a precise one obtained by waiting for another three months.
In the above case, a technique in which so-called 'Median (or 50%) Rank'
estimations of the F(t) values are plotted on Weibull probability paper can then
produce meaningful estimates of t0, and It has the additional advantage that,
using published tables of 5% Rank and 95% Rank estimations, confidence limits
can be assigned to the estimated parameters. The following example shows how
the technique is applied.
Ten springs are being tested to failure. The situation to date is as shown in Table 9.
A Weibull fit to the data is required.
1. The failure points are ranked in ascending order (Table 10, Column 2) and
classified, in Column 3, as failed 'f 'or suspended (or censored) 's' (Table 10,
Column 3)
32
Table 9. Spring failure data
33
2. For the first failed item the 'New Increment' is calculated from the formula
Where N = total number of items in the sample (in this case, 10). Since this is
the first failed item, the previous Order Number is zero. Also, in this case, the
number of previous items is zero and therefore the calculated New Increment is
1 (Column 4).
Note that if the first failure had been preceded by some suspended items the
New Increment would have been greater than 1, eg if the first two items had
been suspended, the New Increment would have been –
10 + 1 − 0
= 1.22
10 + 1 − 2
3. The 'Order Number' of the first failed item is then obtained from the
expression
Order Number = New Increment + previous Order Number
ie in this case, Order Number = 1 + 0 = 1 (Column 5).
4. This procedure is repeated for all the remaining failed items, in succession, ie
10 + 1 − 1
Second failed item: New increment = =1
10 + 1 − 1
Order number = 1 + 1 = 2
10 + 1 − 2
Third failed item: New increment = = 1.125
10 + 1 − 3
and so on......
NB After a suspended item, or group of suspended items, the value of the New
Increment remains constant, and therefore need not be revised, until the next
suspended item or group of such items.
34
[Mini-exercise: Check for yourself, from the data in this example, that this is
indeed the case]
5. Having completed Column 5, the corresponding 'Median Ranks' (Column 6)
are calculated from the formula
Order Number − 0.3
Median Rank =
N + 0.4
eg for the fourth failed item
4.250 − 0.3 3.950
Median Rank = = = 0.380
10 + 0.4 10.4
6. The Median Ranks, expressed as percentages, are then plotted against cycles
(or time) to failure on Weibull graph paper, as in Figure 19 and values of ,
and obtained (in this case approximately 8200 cycles, 7400 cycles and 3.0
respectively), the same way as in the previous example. In this case t0 = 0, this
giving a good straight line plot, but in the general case t0 would be established
by trial plots exactly as previously.
7. Tables of Median (or 50%) Ranks (derived via the formula used above - called
'Benard's Estimate’) and also of ‘5% Ranks' and ‘95% Ranks', for various
sample sizes and Order Numbers, can be found in various textbooks [eg in
Andrews and Moss, and in O'Connor, see bibliography at the end of Unit 4]
and are also appended to this unit. It is a simple matter to plot these Ranks, as
well as the Median Ranks, against the observed cycles (or times) to failure, thus
obtaining (see Figure 19) the 90% Confidence Band for the data, ie the band
within which it is 90% probable that the plot would lie that would be obtained
from a very large number of items. In the absence of suspended items the
procedure is, of course, much more straightforward, in that the Order Numbers
are simply 1, 2, 3, 4 ... N.
35
Figure 19. Weibull plot for spring failures, Median, 5% and 95%
Ranks. Data of Table 10
36
3.4.1 Exercise
Ten identical lifts were operated under identical conditions. For each, the total
number of operating cycles achieved before breakdown was as follows –
Using the Median Rank method, fit a Weibull pdf to the data, evaluating 𝜷
and 𝜼 (cycles to breakdown). Also, using the Rank Tables provided –
(a) fit a 90% confidence band
(b) hence estimate the number of cycles for which we could be 95% confident
that the lift Reliability (survival probability) would be at least 70%
[Answers: From the data alone t0 is clearly zero. From the plot, 𝜷 = 1.1– 1.2,
𝜼 = 32 – 36 ×104 cycles. (b) From the 95% Rank plot it can be seen that at t =
5 ×104 cycles, there is 95% confidence that F(t) should be at most 30%, ie R(t)
should be at least 70%)]
37
38
39
40
4 Assessing Reliability of Systems
If we had acquired enough statistical data – of the kind, and using the analyses,
described in Units 2 and 3 – on all the component items of an engineering system
then, in principle, we should be able to assess the expected reliability performance
of the system as a whole. Unfortunately, both the length and complexity of such
an assessment increases exponentially with the number of items involved and the
complexity of their inter-dependence. All is not lost, however. Valuable insights
into probable system performance may often be gained from straightforward hand
calculations based on simplified modelling of the system. An introduction to such
modelling now follows. A fuller explanation, with more worked calculations and
some useful tables of formulae covering the more commonly occurring cases, can
be found in the IMechE guidebook.
[Davidson and Hunsley (Eds), pp 45-91, see bibliography at the end of this
unit].
To illustrate this, consider a system comprising two fuel pumps, both of which are
normally working. The system is designed so that enough flow will be achieved
even if one of the pumps were to fail. The RBD would then be as in the second
diagram of Figure 20, which indicates that the required flow may be achieved via
either or both, of the pumps. This diagram would still be the appropriate RBD
whether the pumps were located in series, along a single fuel line, or in parallel,
provided that required flow would still be achieved even with one pump failed and
that such a failure would not block the fuel line. The RBD is determined by the
reliability logic, not by the physical layout.
42
4.2 Series Reliability
Consider the simplest system, of just two units. If successful system operation
requires both units to be working (ie if either one fails the system fails) then for a
reliability assessment they are considered to be in series dependency and are
represented as in the RBD at the top of Figure 1.
If the failure probabilities of the units can be assumed to be independent (ie the
failure behaviour of one is not influenced by that of the other) then the expected
system reliability at any given time t is given by the product of the two estimated
unit reliabilities at that time (This is directly analogous to betting on a racing
'double', where the quoted odds against such an event are obtained by multiplying
together the separate odds against each horse winning its own race), ie
4.2.1 Example
Two units in series reliability dependency, so the RBD is as the first one in
Figure 20. The unit reliabilities at t = 10,000 operating-hours are 0.90 and 0.95
respectively. So, system reliability at 10,000 hours will be
It can be simply shown that if both types of unit exhibit negative exponential pdfs
(ie they are in their useful life, 'random failures' phase, see Unit 2) then the system
as a whole will also exhibit such a pdf, ie
43
4.2.2 Example
Items 1 and 2 are in series dependency, as above. If –
(MTTF)1 = 100 hrs, and therefore 1 = 1/100 = 0.01 /hr,
(MTTF)2 = 200 hrs, and therefore 2 = 1/200 = 0.005 /hr,
And
(MTTF)S = 1/0.015 = 67 hrs
All that is required in the case of series reliability systems with more than two
units is simple extension of these calculations – and the same general logic also
applies to availability calculations for series systems.
4.2.3 Example
A simple process flow system of four units is required to run continuously, has
no redundancy (ie the units are in series reliability), and the units are subject to
randomly occurring failure- and-repair outages which result in average unit
availabilities of 98, 98, 96 and 95 per cent respectively (see Figure 21).
The system is available only when all units are working, so its expected average
availability over a long period will be
AS = 0.98 0.98 0.96 0.95 = 0.88 or 88%
And hence its unavailability
US = 1 AS = 1 0.88 = 0.12 or 12%.
[In this example, because all the unit unavailabilities are very small,
eg U1 = 1 0.98 = 0.02,
44
We could have used the approximation
US U1 + U2 + U3 + U4 = 0.02 + 0.02 + 0.04 + 0.05 = 0.13 or 13%
An example which illustrates 'Lusser's Rule' – named after the second world -war
German ballistic engineer who, rather surprisingly, seems to have been the first
to state it in formal terms – that the reliability of a series arrangement will always
be less than that of its least reliable component.
So,
RS(t) = 1 FS(t) = 1 F1(t).F2(t) = 1 Rt)}.{1 R2(t)}
4.3.1. Example
Two units are in parallel dependency, their separate reliabilities being as in the
earlier example, ie at t = 10,000 operating hours R1 = 0.90 and R2 = 0.95. The
system reliability at that time is therefore
RS = 1 0.90).(1 0.95) = 1 0.10 0.05 = 0.995 or 99.5%.
As with series systems, for more than two units the algebra is simply extended, ie
RS(t) = 1 Rt)}.{1 R2(t)}.{1 R3(t)}.......
If both units were to be operating in their negative exponential phase then it can be
shown that the system pdf would not be a simple negative exponential. In fact, it
would be
fS (t) = exp(1t) exp(t) exp{t}
And in larger active-parallel configurations it would be a yet more complex
function.
Expressions for MTTF are likewise complicated, eg for the two-unit system
(MTTF)S = (1/ .
45
4.3.2. Example
If, for the two-unit active parallel system, 1 = 0.01 /hr and 2 = 0.005 /hr, then,
for the system,
(MTTF)S = (1/0.01) + (1/0.005) 0.01 + 0.005
= 100 + 200 67
= 233 hrs.
4.3.3. Example
A system comprises three chemical reactors, RBD as in Figure 22. The system
would be down only if all three were down. The reactors are subject to randomly
occurring failure-and-repair outages which result in average availabilities of 60,
70, and 80 per cent respectively, ie average unavailabilities of 40, 30 and 20 per
cent – or 0.40, 0.30 and 0.20. For the system –
The last three examples illustrate, of course, the general rule (the converse of
Lusser's Rule) that the reliability of a parallel (ie redundant) system will always be
greater than that of the most reliable of its component units.
46
4.3.4. Exercises
Calculate the availabilities of the two systems that are shown below in the form
of Reliability Block Diagrams (unit availabilities are as indicated).
(a)
4.4.1. Example
An aircraft, with four completely independent engines – normally all operating –
will continue to fly safely if at least any two engines are working. The observed
mean failure rate (given that it has started successfully) of any one of these
engines is 10 hr, ie MTTF = 100hrs (not the sort of engine to which I would
entrust life and limb!). Given that the aircraft has taken off successfully, what is
the probability that it will not fly safely for 10 hours, assuming that all failures
other than engine failure are negligible?
Answers:
47
For the system of four engines
FS(10 hrs) = 0.0954 + 4 0.905 0.0953
Analysis, even of this very simple arrangement, can be quite complex. There are
several different ways in which it could fail, viz
(i) failure of both unit 1 and unit 2,
(ii) failure of D to divert when required,
(iii) D wrongly diverting to a failed unit,
(iv) failure of D to transmit flow.
In addition, there are several different repair policies which could be adopted for the
failed, off- line, unit.
Consider, for example, one of the simplest cases. D perfectly reliable, both units in
the negative exponential phase ie R(t) = exp (t), no repair or replacement. System
operates until unit 1 fails, then diverts to unit 2 and operates until it fails. System
failure therefore requires two failures, failures occurring at average rate It can
be shown that, for the system,
f (t) = t.exp (t),
And,
R (t) = (1 + t).exp (t)
48
Extension of this analysis up to any number N of units in parallel, with n units on
standby, is relatively straightforward. If, however, the failed units can be repaired
calculation of system reliability RS(t), or of average system availability AS,
demands that we account for
(i) random variation in repair time,
(ii) spares waiting time,
(iii) availability of maintenance resources (men, tools and spares),
(iv) repair priority rules, etc.
Clearly, one possibility is that on-line units could fail while the parallel units might
still be under or awaiting repair. Analysis can be very complex and is generally
accomplished via Markov Discrete State Chain Analysis or via Simulation – either
technique usually requiring iterative numerical procedures. Powerful PC-based
software is, however, readily available for doing this.
So far, it has been assumed that the unit or system can be in one of only two states,
working or failed. It will often be the case, of course, that some of the units, and
hence the system, can operate in a partially reduced state. If the likelihood of this is
significant it has to be considered in the reliability calculation, lengthening it still
further.
49
4.6 Complex systems: some further methods of reliability analysis
4.6.1. Truth table
This is actually the least analytical approach. It is best explained by working through
a simple illustration, which does not really call for this procedure, but which will
serve to explain it effectively.
Consider the system of Figure 23. Let X, Y, Z be the respective reliabilities R (t),
say, of the units at some time t of interest (they could also have been the long-term
average unit availabilities, if system availability had been the parameter of interest).
So (1X), (1Y) and (1 will be the respective failure probabilities F (t). Denoting
working status by 1 and failed by 0, we can now construct
Table 11, where each row shows firstly a possible combination of the unit states,
secondly the resulting system state, and finally, in the last column, the probability
of occurrence of that combination. In this case eight different combinations or
scenarios are possible, and thus there are eight rows in the table.
50
Table 11. Truth table for system of Figure 23.
Only the first three states are operational ones, so the system reliability is given by
the sum of the first three probabilities ie
4.6.2. Example
For the above system it has been found that, at t = 20 hrs, RX = 0.80 and both RY
and RZ = 0.70, so the expected system reliability at that time will be
RS (20) = (0.80 0.70) + (0.80 0.70) (0.80 0.70 0.70) = 0.73 or 73%
The Truth Table method can be extended to consider reduced-output unit states.
Analysis of anything other than the smallest systems, however, has to be
computerised because even with the two-state model (units assumed either fully
operational or completely failed) each additional unit doubles the number of
possible system states (rows in the table) and hence the computing time. If the units
are of high reliability this problem can be alleviated by assuming that system states
with more than, say, three or four failures at any one time have a negligible
probability of occurrence, which considerably reduces the length of the table.
51
4.7 System reduction
The analysis of a large RBD may be carried out in stages, at each of which small
groups of units are analysed by the methods so far described, allowing the RBD to
be re-drawn in a reduced form.
4.7.1. Example
Consider the system RBD shown in Figure 24(a), where the number in the bottom
right hand corner of each numbered block is the assessed average availability of
the unit represented by that block.
Stage 1. Take each series-connected group, calculate its availability and assign
that value to an 'equivalent' single unit, which replaces the original group, ie –
For groups (2, 3), (4, 5) and (6, 7), availability = 0.90 0.80 = 0.72
For groups (8, 9, 10) and (11, 12, 13), availability = 0.95 0.90 0.80 = 0.684
For group (2-7), operational if at least two channels out of the three are
working,
availability = 0.723 + [ 3 0.722 (1 0.72)] = 0.81.
Should a system, no matter how small, contain a 'cross-linked' unit (ie one
incorporated into more than one line or sub-system) then its analysis has to be
53
undertaken via the Truth Table method (or via a rather more sophisticated method
based on Bayes' Conditional Probability Theorem, which is not dealt with in this
unit). Powerful and user-friendly PC software for the reliability assessment of very
large complex systems and incorporating techniques such as Weibull Failure Data
Analysis (as in Unit 3), Failure Mode Effect and Criticality Analysis (FMECA),
Fault Tree Analysis, Markov Analysis and Simulation is available. Introductions to
these and other such matters are given in the IMechE handbook The Reliability of
Mechanical Systems.
[See bibliography at the end of this unit]
4.7.2. Example
Calculate, using the System Reduction procedure, the availabilities of the two
systems that are shown below in the form of Reliability Block Diagrams (unit
availabilities are as indicated).
54
4.8 Redundancy allocation for a series system
The reliability of a component can be increased by placing an identical component
in parallel to the existing component, both operating in what is called the active
redundancy mode. The act of adding the redundant component is called doubling.
Barlow and Proschan (1965) proposed a sequential search method for solving this
kind of problem and it is explained below.
𝑅 = 𝑝1 𝑝2 … 𝑝𝑛 .
In this case, which of the n components should be doubled first in order to maximise
the ensuing reliability of the system? To answer this question, let us choose at
random a component to double, say . The reliability of the system is
𝑅𝑖 = 𝑝1 𝑝2 … [1 − (1 − 𝑝𝑖 )2 ] … 𝑝𝑛
or
𝑅𝑖 = 𝑅(2 − 𝑝𝑖 ). (1)
Reference:
Barlow, R.E. and Proschan, F. (1965). Mathematical Theory of Probability. New
York: Wiley.
55
4.9 Preventive maintenance planning
Having acquired pdfs of time-to-failure, mean failure rates, etc for the various parts
of a proposed system it will be clear that many of the parts will have a useful life
much less than that anticipated for the system. The desired system reliability will
therefore only be achieved if such parts are easily replaceable, ie such replacements
should –
(i) interfere as little as possible with the operation of other parts,
(ii) interfere as little as possible with normal operation or production,
(iii) be needed at intervals which exceed the maximum operational cycle or
production run.
If, in addition to the data on reliability, there is also information on the costs (direct
and indirect) of replacement, it might be possible, using one of the many Operational
Research models that have been developed for the purpose, to determine whether
such replacements should be periodic, grouped, opportunistic, on-failure, etc. More
often than not, however, replacement on
failure is the overall cheapest solution. Time-based replacement is cheapest only if
(i) the cost per item is much less than that of replacement on failure,
(ii) the item exhibits a pronounced wear-out behaviour.
Spares should be available when needed, of course, and statistical data on failure
rates etc are essential for determining optimum spares inventory policies.
56
4.10 Bibligraphy
Comprehensive introductions to this subject, with essential data, formulae,
statistical tables etc
57