0% found this document useful (0 votes)
154 views

MN5554 Reliability Notes

Uploaded by

Aswin P Subhash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

MN5554 Reliability Notes

Uploaded by

Aswin P Subhash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

MN5554 Reliability Engineering

MSc Programme

MN 5554 Quality Management and Reliability

PART B

Reliability

AMEE

College of Engineering, Design and Physical Sciences

1
MN5554 Reliability Engineering

MSc Programme

MN 5554 Quality Management and Reliability

PART B

Reliability

AMEE

College of Engineering, Design and Physical Sciences

Prepared by Joe Au, Tao Zhang, EngTseng Lau and John Harris
Advanced Manufacturing & Enterprise Engineering (AMEE)
College of Engineering, Design and Physical Sciences
2
MN5554 Reliability Engineering

Brunel University London


Kingston Lane, London, Uxbridge UB8 3PH
Copyright © 2007 - 2018 Brunel University London
All right reserved. No part of this material may be reproduced, stored in a retrieval system or transmitted in any
form, or by any means – electronic, mechanical, photocopying or otherwise, without permission in writing from
the publisher.
3
MN5554 Reliability Engineering

Preface

This study pack is concerned with Reliability, the second component of the module
MN5554 Quality Management and Reliability. It comprises four units and it
addresses the following topics:
1. Unit 1 - Decision Analysis,
2. Unit 2 - Measuring the Reliability of Items,
3. Unit 3 - Weibull Analysis of Lifetime Data, and
4. Unit 4 - Assessing the Reliability of Systems.

In Unit 1, Decision Analysis techniques are used to identify the reliability problem
areas (Pareto analysis) and to determine whether or not the situation is improving
after being addressed (Trend analysis using the moving-average and CUSUM).

Unit 2 Measuring the Reliability of Items, deals with the theory and concepts of
reliability, the practical aspects of how reliability is quantified and the characteristic
patterns of failure. Statistical techniques are used for describing reliability and its
converse, unreliability (also known as failure probability).

Weibull analysis presented in Unit 3 is a widely-used technique for the statistical


analysis of data on times-to-failure. The unit describes a graphical approach of the
Weibull analysis that provides a statistical description of failure.

Given the knowledge of the reliability of individual items in a system, it is natural


to ask how the reliability of the system itself can be assessed. This is the main
question that will be answered in Unit 4. It should be noted that the method
described in this unit for reliability calculation is equally applicable to availability
calculation.

These units are written by Dr John Harris, Honorary Fellow of University of


Manchester School of Engineering. Course units are modified regularly by Dr. Joe
Au, Dr. Tao Zhang and Dr. Eng Tseng Lau from Brunel University London. For
your information, they form part of the module, called Reliability Engineering, in
the MSc Process Engineering Management distance-learning programme at Brunel
University London.

The study pack contains not only these units, but also additional notes and
Microsoft Excel spreadsheets designed formerly by Dr Joe Au of School of
Engineering and Design, Brunel University London. The notes are provided to
clarify or amplify points and concepts presented in the units. The spreadsheets
contain alternative methods of solution to augment those described in the units and
4
MN5554 Reliability Engineering

solutions to exercises in the units. These are ‘live’ documents such that you can
experiment with them by changing values in different cells - a significant advantage
for exploring various ‘what-if’ scenarios readily and efficiently.

Finally, do make a point of visiting the Blackboard Learn. Under this module, you
will find short questions and exercises posted for testing your understanding and
skills. If you have any question while reading through the study material, you can,
of course, write to Dr Tao Zhang by email to [email protected].

I hope you will enjoy reading this study pack. Have fun!

5
MN5554 Reliability Engineering

Table of Contents
1 Initial Decision Analysis ............................................................................................................................ 1
1.1 Introduction ............................................................................................................................................ 1
1.2 Pareto analysis ........................................................................................................................................ 1
1.3 Exercise ................................................................................................................................................... 5
1.4 Trend analysis ......................................................................................................................................... 6
1.5 Moving average: a simple example ........................................................................................................ 6
1.6 CUSUM ................................................................................................................................................... 9

2 Measuring the Reliability of Items .......................................................................................................... 10


2.1 Engineering Reliability ......................................................................................................................... 10
2.2 Statistical Analysis of Lifetime Items ................................................................................................... 12
2.2.1 Mean, variance and standard deviation ........................................................................................ 12
2.2.2 Probability Density Function (PDF) .............................................................................................. 14
2.2.3 Measures of Item Reliability .......................................................................................................... 17
2.3 The ‘whole-life’ picture ........................................................................................................................ 20
2.3.1 ‘Time to failure’ and what we mean by ‘failure’ ......................................................................... 21
2.3.2 ‘The whole life item failure profile ................................................................................................ 21
2.3.3 Diagnosis of recurrent failures and prescription of the remedy ................................................. 23
2.4 Now try this exercise ............................................................................................................................. 24

3 Weibull Analysis of Lifetime Data .......................................................................................................... 25


3.1 Now try this exercise ............................................................................................................................. 28
3.2 Weibull probability paper ..................................................................................................................... 29
3.3 Weibull analysis of a large and complete sample of times-to-failure.................................................. 29
3.3.1 Example ........................................................................................................................................... 31
3.4 Median ranks ........................................................................................................................................ 32
3.4.1 Exercise ............................................................................................................................................ 37

4 Assessing Reliability of Systems .............................................................................................................. 41


4.1 Reliability block diagram ...................................................................................................................... 41
4.2 Series Reliability ................................................................................................................................... 43
4.2.1 Example ........................................................................................................................................... 43
4.2.2 Example ........................................................................................................................................... 44
4.2.3 Example ........................................................................................................................................... 44
4.3 Active-parallel Reliability ..................................................................................................................... 45
4.3.1. Example ........................................................................................................................................... 45
4.3.2. Example ........................................................................................................................................... 46
4.3.3. Example ........................................................................................................................................... 46
4.3.4. Exercises .......................................................................................................................................... 47
4.4 Active-parallel reliability with partial redundancy .............................................................................. 47
4.4.1. Example ........................................................................................................................................... 47

6
MN5554 Reliability Engineering

4.5 Inactive parallel, or standby, reliability ............................................................................................... 48


4.6 Complex systems: some further methods of reliability analysis .......................................................... 50
4.6.1. Truth table ....................................................................................................................................... 50
4.6.2. Example ........................................................................................................................................... 51
4.7 System reduction ................................................................................................................................... 52
4.7.1. Example ........................................................................................................................................... 52
4.7.2. Example ........................................................................................................................................... 54
4.8 Redundancy allocation for a series system .......................................................................................... 55
4.9 Preventive maintenance planning ........................................................................................................ 56
4.10 Bibligraphy............................................................................................................................................ 57

7
MN5554 Reliability Engineering

1 Initial Decision Analysis


1.1 Introduction
The basic task of the maintenance manager in charge of complex production plant
is to maximise the contribution of the maintenance function to sustaining –
and, wherever possible, improving – the profitability of his plant. This he can do
by, among other things, applying the techniques of reliability engineering, as
introduced in this module, to ensuring that the plant will achieve and sustain its
inherent reliability (ie the design intent) and hence availability. However, his
resources will be finite and constrained (culturally, environmentally, statutorily,
etc), and the techniques can be demanding on manpower. Among the things
he will therefore need to do is to identify which parts of his plant or operation
are causing the biggest reliability problems, and hence which actions are likely to
induce the biggest improvement in performance for a given cost. Which is
where ‘Pareto' Analysis (or, more crudely, 'Top Ten' analysis), which will be
explained in the first part of this unit, comes in.

Initially, of course, his attention may have been drawn to the need to address
reliability performance by his detecting its downward trend – preferably at an early
stage. And having taken remedial action he will want confirmation of
improvement, ie clear evidence of an upward trend. Which is where 'Trend
analysis', the subject of the second part of this unit, is required.

1.2 Pareto analysis


Quantitative analysis of the many different aspects of a plant's maintenance
problems inevitably reveals striking examples of what has come to be known
as Pareto's Law of Maldistribution, after Vilfredo Pareto (1848-1923), an Italian
engineer who played a major role in creating his country's railway system and
then turned his hand to creating modern Mathematical Economics, or
'Econometrics'. In a study of the economy of East Prussia he found that 80%

1
MN5554 Reliability Engineering

of the total income was paid to only 20% of the people (Things never change, do
they!
In maintenance, maldistributions may be observed in such things as:
spares cost: most of the total accounted for by a very small fraction of the
items held;
manpower needs: dominated by a few, troublesome, parts of the plant;
outage time: mostly caused by a few critical, but unreliable items

Clearly, cost-effective remedial action in these and similar areas should first
address 'the vital few' (eg the Top Ten), leaving the 'the trivial many' until later -
or even ignoring some of them altogether. The first task, therefore, is to identify
which is which, and this is where Pareto Analysis is used. A real example,
slightly simplified for clarity, will illustrate the technique, which is a very simple
one.

An offshore oil and gas extraction system consisted of several identical


platforms. Each platform comprised one hundred different sub-systems. For
some years, records had been kept of the maintenance manhours expended on
each of these sub-systems. A Pareto Analysis of these records was required.

The first step was to calculate the total manhours for each sub-system type. Table
1, in which the sub-systems were ranked in order of manhours expended, was then
produced.
(Note that each of the first eleven rows refers to a single sub-system – one of the
eleven worst performers – whereas the last row gives the total manhours for the
remaining eighty nine sub-systems – each of which individually placed only
a minor, or sometimes zero, demand on manhours)

Two different forms of Pareto plot could then be derived from the data in the table,
viz

(i) A simple histogram, as in Figure 1, using the data from either Column 3 or
Column 4 (in this case from Column 3),
(ii) A cumulative plot, as in Figure 2, using the data from Column 5.

2
MN5554 Reliability Engineering

Table 1 : Manhour analysis for platform sub-systems

The cumulative plot probably demonstrates the maldistribution more


dramatically. The histogram however is more generally useful because subsequent
changes can be more clearly shown for the individual items.

The particular analysis used for this example was, in fact, used to select those
platform sub- systems for which application of Reliability Centred Maintenance
analysis could be justified on the basis of cost-effectiveness.

3
MN5554 Reliability Engineering

Figure 1: Histogram of ranked manhour data

3000

2500
Maintenance man-hours

2000

1500

1000

500

0
A B C D E F G H I J K L+
System Ref.

Figure 2: Cumulative plot of percentage man-hours expanded

Clearly, Pareto Analysis is quite straightforward. It has been said, however, that –

‘In spite of this (ie its straightforwardness) it is a very powerful tool in the
assessment of field data and should not be pushed on one side in favour of more
sophisticated statistical techniques on account of its simplicity.'

4
MN5554 Reliability Engineering

[Taken from A.D.S. Carter, Mechanical Reliability (2nd Ed), Macmillan


Education (1986), an in-depth treatment of all aspects of assessing and
achieving reliability in mechanical equipment, based on the author's military
engineering work. He gives several illustrations of the value of Pareto
analysis.]

1.3 Exercise
(i) Over a period of time data has been accumulated, from warranty
records, of the failures occurring in a well-known type of domestic washing
machine. The results are shown in Table 2. From the data, derive firstly
a histogram and secondly a cumulative Pareto plot (similar to Figure
1Figure 2 respectively) indicating the most unreliable components. (Derived
from an example quoted in Practical Reliability Engineering, P.T.D.
O'Connor, see Bibliography, Unit 4)

Table 2. Washing machine failures

Component No of failures
Cables and connectors 3
Drive belt 6
Drum trunnions 10
High-level switch 68
Inlet solenoid valve 2
Outlet pump 76
Programme switch 105
Seals 38
Spin dryer motor brake 8
Spin dryer suspension 10
Starting capacitor (dryer) 4
Starting capacitor (main motor) 6
Another 28 components (Total N0.) 42

(ii) Now consult your own plant history, maintenance records, stores
records, accounts, etc and produce a Pareto analysis of some pertinent
variable,
Eg: Failures per failure mode or per unit
5
MN5554 Reliability Engineering

Forced downtime per unit


Direct maintenance cost (man-hours/spares) per unit
Demands per type of spare
Value of holding per type of spare
Or any other indicator of plant or workforce performance that could be of
interest to you.

1.4 Trend analysis


Pareto or ‘Top Ten’ analysis may have identified the main causes of failure,
or the most costly failures, and we may then have taken steps to eliminate these
causes or mitigate the impact of the failures. Some time later we will need to know
whether our efforts have been worthwhile. Is the reliability performance of our plant
getting better? Staying much the same? Or still getting worse? Are failure
rates or repair times decreasing or increasing? Unfortunately, a simple plot
of the month by month data on these parameters will usually be a very fluctuating
‘saw tooth’ line and any trend will be difficult to discern. A plot of the
‘moving average’ may, however, make things a litter clearer.

1.5 Moving average: a simple example


For 25 months records were kept, at a particular plant, of the total number
of failures occurring each month, with the results shown in the first two columns of
Table 3 and in the straightforward plot, Figure 3. Because of the random fluctuation
it was difficult to detect any trend from either the raw data or the plot. Each month’s
entry in the third column of the table, however, is the average of the three months
up to and including that month (eg the first such entry at Month 3 is the average,
14.3, of 12, 17 and 14). When plotted, as shown in Figure 4, these three-monthly
moving averages strongly suggested that there had been a downward trend over the
first twenty months followed by a sharp upturn during the remaining five months.

6
MN5554 Reliability Engineering

Three- Deviation Cumulative


Month Failures month from target sum of
moving T=15 deviations
average
1 12 -3 -3
2 17 2 -1
3 14 14.3 -1 -2
4 14 15.0 -1 -3
5 17 15.0 2 -1
6 16 15.7 1 0
7 14 15.7 -1 -1
8 11 13.7 -4 -5
9 13 12.7 -2 -7
10 14 12.7 -1 -8
11 15 14.0 0 -8
12 11 13.3 -4 -12
13 14 13.3 -1 -13
14 16 13.7 1 -12
15 13 14.3 -2 -14
16 14 14.3 -1 -15
17 11 12.7 -4 -19
18 12 12.3 -3 -22
19 13 12.0 -2 -24
20 16 13.7 1 -23
21 12 13.7 -3 -26
22 18 15.3 3 -23
23 18 16.0 3 -20
24 17 17.7 2 -18
25 20 18.3 5 -13

Table 3: Monthly data, moving average and CUSUM

7
MN5554 Reliability Engineering

Figure 3: Conventional
Figure 3.plot of monthly plot
Conventional dataof monthly data
25

20
Observation

15

10

0
0 5 10 15 20 25
Month
Figure 4. Moving average plot
Figure 4: Moving average plot
20.0
3-month moving average

18.0
16.0
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
0 5 10 15 20 25
Month

Figure 5: CUSUM plot

8
MN5554 Reliability Engineering

1.6 CUSUM
A somewhat more sophisticated method developed by ICI to monitor trends
in plant performance is the Cumulative Sum (CUSUM) Chart. It has proved to be
very effective when dealing with maintenance and reliability data. Numerical
observations and made at fixed intervals and each time are subtracted from
a pre-determined and constant target value derived from an analysis of previous
data on the parameter of interest. The cumulative sum of these deviations is then
plotted against time. For the data of Table 3 for example, a monthly target, T of 15
failures was adopted. The fourth column in the table then shows the deviations,
from this target, of each month’s recorded observation and the monthly entry in the
fifth column shows the total of these deviations up to each month (eg by Month 4
deviations of -3, +2, -1 and -1 have been observed, giving a total, or cumulative
sum, of -3). When plotted, as in
Figure 5, the trends are very clearly revealed, ie a steady performance over
the first seven months, a significant improvement (ie a decreasing failure rate)
during the next thirteen months and a sharp deterioration over the remaining five
months or so.

[This example-originally given in ‘BSS5703 Part1: Guide to data analysis


and quality control using CDSUM techniques’, BSI 1980-has been taken
from the article ‘Maintenance Decision Analysis’ by T R Moss, Maintenance
and Asset Management, Vol 12, No. 3, 1997

9
MN5554 Reliability Engineering

2 Measuring the Reliability of Items


The solution of maintenance problems often requires us to think about the
reliability of machinery. For example, we might need to assess whether an
especially critical compressor is likely to run satisfactorily, without failing, from
one seasonal shutdown to another. (If not, should it be replaced, maintained
differently, or should we acquire a standby?). We know that we can never be
absolutely certain that it will. Even if it is well designed, well maintained and
carefully operated there is always a small chance that it will fail in service,
and we cannot say exactly when this might happen.

Thus, analysis of engineering reliability - like betting on horses - deals in


probabilities (ie likelihoods of success or failure) and probabilistic variables (ie
quantities that vary randomly, such as times-to-failure). In particular, it deals with
the application of statistical techniques to the analysis of patterns of component
and equipment failure – and acting on the information thus gained (eg
replacing or repairing a component at timely intervals, or re-designing a
complete engineering system, etc) in order to increase system reliability.

2.1 Engineering Reliability


A good definition of this was given by Green and Bourne in their authoritative
textbook on the subject, viz

'The probability that an item will perform its required function in the desired
manner under all the relevant conditions and on the occasions, or during the time
intervals, when it is required so to perform.'
[Greene A E and Bourne J, p 25, Reliability Technology, Wiley 1972]

Like most definitions, some of the words used need clarification. Firstly,
'probability'; this is still a subject of philosophical debate. One (perfectly reputable)
school of thought – the 'Bayesian' school – maintains that it is a measure of our
'strength of belief' (eg that a pump will fail within the week) and a whole statistics
of reliability prediction can be (and has been) based on that idea. For our purposes,
however, we will adopt the position of the 'Frequentist' school and take it to be a
measure of what is expected to happen, on average, if a given event is repeated a
large number of times under identical conditions (eg the probability of getting a
five, say, when throwing a six- sided die – which is 1/6, or 16.67 %).

10
MN5554 Reliability Engineering

An 'item' could be -
A component the smallest part which would be replaced or repaired on failure
(eg a spring, bolt or impeller)
A unit comprising a number of components (eg a pump or compressor)
A system comprising may units (eg a process line)

Reliability analysis is quite different in each case and it is therefore important


that we always clearly define the physical boundaries of the item being analysed.

The'.. required function in the desired manner under all the relevant conditions..'
refers to the duty undertaken. With electronic equipment the duty demanded
of any particular item will probably be much the same in one application as in
another; stresses are usually low and steady, and the equipment encapsulated. The
reliability observed in one context is therefore likely to be similar to that in another,
ie to be generic (characteristic of a large group or class; general, not specific or
special). This would rarely be the case with mechanical or hydraulic plant - as
found in process or power systems - which could be subject to wide ranges of
operating stress (start-up acceleration, throttled or open running etc),
environmental extremes (tropical or arctic, off or on shore, etc), and materials
handled (erosive, corrosive etc) so reliability assessment would have to take this
into account.

The phrase '...on the occasions, or during the time intervals...' indicates that there
are two basic sorts of reliability, namely –

(a) Time-independent
The item functions only on demand, being otherwise dormant (eg a pressure release
valve). Reliability is measured by the Probability of Successful Function, PS; eg if
a starter motor had failed to operate three times in a hundred demands then
PS = 97/100 = 0.97 (or 97 %)
And it would probably be more meaningful to quote this in terms of the Probability
of Failure on Demand, PF (or Fractional Dead Time, FTD), ie –
PF = 3/100 = 0.03 = 3 %
(b) Time-dependent
The item functions continuously (eg a turbo-alternator). Reliability is measured by
the probability R(t) that it will run successfully for some specified time t
(eg from one annual shutdown to the next). Thus, if a hundred identical pumps had
been started together and after three weeks twenty, say, had failed then

11
MN5554 Reliability Engineering

R (3 weeks) = 80/100 = 0.8 or 80%

In this course we will usually be concerned with reliability of type (b), because it
is of relevance to the maintenance of most types of continuously operated industrial
plant. Of course, if the main concern was the maintenance of safety or stand-by
devices, eg smoke alarms, then reliability of type (a) would be the primary interest.

[For a basic introduction to the reliability of such safety equipment see


Davidson J and Hunsley C, The Reliability of Mechanical Systems, 2nd Ed.,
Mechanical Engineering Publications, IMechE, London 1994, pp 59-62 –
material which was contributed by the present author]

2.2 Statistical Analysis of Lifetime Items

2.2.1 Mean, variance and standard deviation


Let us assume, as a highly idealised illustration, that we have been able to
test one hundred identical pumps of a new design by running them continuously
until each one has failed, with the results shown in Table 4.

Table 4. Pump failure data

Note:
a) The first row of the table shows the standard statistical terms ('Class interval'
etc) for the types of quantity evaluated;
b) The figures in the fourth column are obtained by dividing those in the third by
100 hours, the width of the class interval used.
12
MN5554 Reliability Engineering

Using the data in the fourth column a histogram can be constructed, as in Figure 6.
By drawing it this way the area of the block above each class interval always equals
the relative frequency of failure in that time interval (even if unequal class intervals
were to be used, which is sometimes more convenient).

Figure 6. Histogram of pump failure data.

The assumption might now be made that the pattern of failures exhibited by this
sample is typical of all such pumps; ie the observed relative frequencies truly reflect
the expected probabilities of failure. The probability that any one pump of this kind
will last longer than, say, 700 hours is then given by the shaded area in Figure 6,
ie –
0.19 + 0.08 + 0.01 = 0.28 or 28%.

We now require some measures which will indicate the general nature of the
variability.
1. For its average magnitude, or central tendency, we shall use the arithmetic mean

m = (0.02 × 350 ) + (0.09 × 450 ) + (0.21 × 550 ) + ....... etc. = 642 h

13
MN5554 Reliability Engineering

where, for example, in the first bracket 0.02 is the relative frequency and 350 hours
the mid- point, or 'class mark’, of the first quoted class-interval.

2. For the spread, or dispersion, of the times-to-failure we shall calculate the


variance
s2 = 0.02(350 642)2 + 0.09(450 642)2 + ............. etc. = 13 500 h2

where, as before, the first bracket, say, refers to the data for the first quoted class
interval and 642 hours is the previously calculated overall mean. A quantity
measured, in 'hours- squared' is rather mysterious (although it is, in fact,
indispensable in many statistical calculations), so for presenting information on the
observed spread of the times-to-failure we quote its square-root, the 'standard
deviation'

s = (13 500)1/2 = 116 hours.

2.2.2 Probability Density Function (PDF)


If thousands of pumps had been tested, instead of just a hundred, the width of the
class intervals in Figure 6 could have been reduced and a virtually continuous
probability density function or pdf obtained, as in Figure 7. Many failure processes
generate pdf's of time-to-failure which can be represented fairly accurately by
simple mathematical expressions. This can be useful in reliability-assessment
calculations.

Figure 7, Continuous probability


Figure 2. Continuous densitydensity
probability distribution
distribution

0.0045
Probability density (relative frequency per

0.004

0.0035

0.003
unit of time), f(t)

0.0025

0.002

0.0015

0.001

0.0005

0
0 200 400 600 800 1000
Time to failure, t
14
MN5554 Reliability Engineering

The Normal, or 'Wear-out', pdf: Many engineering items exhibit definite


wear-out failure, ie they mostly fail around some mean operating age, although a
few fail sooner and a few later. The distribution of times-to-failure often
approximates to the symmetric, bell-shaped Normal pdf, a distribution which is of
pivotal importance in general statistical theory. (It is often called Gauss's
distribution because he derived it - by formulating a simple model of the way in
which errors of measurement are generated.) If the times-to-failure were to be
distributed in this way then, as indicated in Figure 8, 50% of them would fall in
the range (m 0.67s) to (m + 0.67s), and 95% in the range (m2s) to (m + 2s), where
m is the measured mean of the distribution and s its standard deviation.
Statistical tables give other percentage probabilities for other ranges
(expressed as multiples of s) about the mean.

Figure 8. The Normal pdf (zero mean, unit std.dev.)

15
MN5554 Reliability Engineering

The Negative Exponential, or 'Random Failure’ pdf: During their 'as-


designed' lives many engineering items, if properly operated, do not ‘wear-out'.
On the contrary, they are as likely to fail sooner as later. The probability of failure
is constant and independent of running time (and probably small); the item is
always effectively 'as good as new'. This indicates that the cause of any failure is
external to the item, eg overload, impact by plant-generated missile etc. It can be
shown that, in this case, the distribution of times-to-failure is given by the negative
exponential pdf (see Figure 9)
f(t) =λexp( -λt)
where = mean failure rate (failures/unit time/machine) and 1/ = mean time to
failure (MTTF).

[The significance of this algebraic expression will become clearer in


Unit 4, on System Reliability, where numbers will be put into it.]

Figure 9. Hyper-exponential and exponential pdf's

16
MN5554 Reliability Engineering

The Hyper-exponential, or 'Running-in’ pdf: Sometimes, the probability of


failure is found to be higher immediately following installation or overhaul than
during subsequent operation. This can be represented by a pdf of time-to-failure
which exhibits two phases - an initial rapid fall and a later slower one (see Figure
9). This is evidence that some of the items concerned have manufacturing
defects, or have been re-assembled incorrectly, faults that show up during
running-in. Items that survive this stage are without such defects and go on to
exhibit the sort of time-independent failure probability previously discussed.
Note - they are not improving with age! Some items merely start off with a
better chance of survival than others.

2.2.3 Measures of Item Reliability


The information in Figure 6 can be presented in other ways that may be more
useful.

Failure probability F(t): A graph can be drawn (see Table 5 and Figure 10) of the
rise in the total fraction F(t) of pumps failed by a given time t.

Table 5. Rise in total fraction failed: pump data of Table 4.

Figure 10. Probability of failure before time t

17
MN5554 Reliability Engineering

If the tested pumps are representative the observed F(t) can be taken to be the
probability, for any one such pump, of its of failure before a running time t. The
mathematical expression of a plot such as that in Figure 10 is called a cumulative
distribution function or cdf. For the negative exponential pdf, for example, the cdf
is
F(t) = 1 - exp(λt)

Reliability R(t): Alternatively, the fraction R (t) of items surviving at running time
t could be tabulated and plotted, as in Table 6 and Figure 11.

Table 6. Fall in total fraction surviving: pump data of Table 4

Figure 11. Reliability at time t

Clearly, R(t) = 1 - F(t),

And for the negative exponential case, for example,

R(t) = exp(-λt).

Again, if the observed pumps are assumed to be representative R(t) can be


taken to be the Reliability at time t for any such pump.

18
MN5554 Reliability Engineering

Hazard rate Z (t): This is defined as the fraction, of those items which have
survived up to the time t, expected to fail, per unit time. Thus, at any time t,

Z (t) = Fraction, of original pumps, failing per hour at time t = f (t)


Fraction, of original pumps, still running at time t R(t)
And for the negative exponential case,

Z (t) = {λ exp(-λt)} / exp(-λt)} = λ, 

ie the failure rate is constant, the item is always 'as good as new', as already
explained. For the data of Table 6, Z(t) is calculated, tabulated and plotted in Table
7 and Figure 12.

Table 7. Variation of hazard rate: pump data of Table 4

Figure 12. Hazard rate plot

19
MN5554 Reliability Engineering

2.3 The ‘whole-life’ picture


F(t), R(t) and Z(t) for all three component ages – running-in, useful-life and
wear-out - are compared in Figure 13.

Figure 13. Principal modes of failure

20
MN5554 Reliability Engineering

2.3.1 ‘Time to failure’ and what we mean by ‘failure’

Our frequent use of these expressions needs to be qualified. Time may, of course,
mean actual calendar time (before failure). However, it might sometimes
more appropriately refer to total running time, ie exclusive of stoppages, or
number of operational cycles, or whatever. Also, what constitutes a failure
depends on the operational requirement, even for the simplest electrical
component. Failure of a resistor, for example, could be a break, or a short, or either
one of these. There may be several different ways in which the operation of
a mechanical item could be degraded, but only one or two which would
actually impair system performance should they occur, eg a hydraulic valve
could suffer an internal leak, an external leak, failure to close, failure to open, or
spurious operation; also, a partial leak may be acceptable.

2.3.2 ‘The whole life item failure profile


By combining the three Z(t) curves of Figure 13 a single Z(t) curve as in
Figure 14 can be obtained which, broadly speaking, gives the whole-life profile
of failure probability for the generality of components. This is the much quoted,
and much abused, 'Bath-Tub Curve'. This is only the 'Bath-Tub Curve' when the
variable on the 'y-axis' is the Hazard Rate, Z(t), as we have defined it here. The
actual level of Z(t), the time scale involved and the relative lengths of the three
phases, will vary by orders of magnitude from one sort of item, and one
application, to another. Furthermore, in any specific case one or two of the phases
could be effectively absent (eg in the case of high reliability aircraft control gear,
where running-in failure is negligible and wear-out effectively non-existent).
Human beings strikingly exemplify this behaviour. Figure 15 is taken from Greene
and Bourne. It gives the Hazard Rate for the UK male population in the 1960s
(where in this case the hazard is death, and the rate is called – by actuaries (the
people who calculate insurance premiums – The Force of Mortality).

21
MN5554 Reliability Engineering

Figure 14. Typical Z(t) characteristics for engineering devices

Figure 15. Death rate characteristic for male living in England and
Wales for the years 1960 - 1962

22
MN5554 Reliability Engineering

Figure 14 suggests that if an item's whole-life behaviour conformed to that shown,


and the various
Hazard Rates and time scales were known, then perhaps we should
a) run such items in, for the duration of Phase I, before putting them into service,
b) carry out nothing other than minimally intrusive on-line routines, such as
lubrication and simple condition monitoring, during Phase II,
c) replace or overhaul early in Phase III.

Of course, in practice the necessary statistical information would rarely be known,


and even if it were there would be many other factors - output needs, economics,
opportunities - to take into account when identifying appropriate maintenance
policy.

2.3.3 Diagnosis of recurrent failures and prescription of the remedy


As has been shown, the shape of the pdf of time-to-failure or, more strikingly, the
trend of the Hazard Rate plot can, however, indicate whether a recurring failure is
of the run-in, wear-out or purely random kind. Combined with appropriate
physical or chemical investigations (eg microscopic examination of wear or
fracture surfaces, chemical analysis of lubricants) this information can facilitate
the diagnosis of the cause of the failure and prescription of remedial action. If, for
example, the failure was clearly a wear-out the choice would be between
time- based, condition-based or failure-based replacement, or design-out.
The final judgement, however, would probably have to take into account a
combination of economic, environmental and safety factors.

23
MN5554 Reliability Engineering

2.4 Now try this exercise


Fifty observed times-to-failure (in weeks) of a type of pump are given
below. Group the data into 8 – 10 equal class intervals (NB Think
carefully about (i) where you set the boundaries of these, and (ii) what the
value of each interval's 'class mark' – see p. 5 – is) and hence–

a) Plot a histogram (see example Figure 6) of the distribution,

b) Calculate the mean and standard deviation, using your class-interval data,

c) Using a 'scientific' hand calculator find the mean and standard deviation
of the un- grouped data and compare with the result of (b),

d) Plot the variation in F(t), R(t), and Z(t) on ordinary graph paper

138 123 123 138


129 147 154 101
148 151 148 138
132 164 168 135
109 109 152 129
74 144 136 110
112 115 178 147
81 138 96 132
147 112 158 153
129 125 184 117
136 152 165 161
126 168 117 107
139 116

24
MN5554 Reliability Engineering

3 Weibull Analysis of Lifetime Data


Although, as explained in Unit 2, the Normal pdf can be used to represent wear-out
failure, the Exponential pdf purely random failure and the Hyper-exponential pdf
running-in failure (and statisticians have, in fact, conjured up many other such pdfs
for these purposes), there is one particular pdf, the Weibull pdf, which, because it
can represent any of the three basic types of failure, has been found to be
particularly useful. In addition, it has two other practical virtues, viz
(i) applicability via simple graphical techniques,
(ii) defining-parameters (the terms in the formula) which have an everyday
engineering significance.
The ideas underlying this pdf may be grasped from Weibull's own derivation,
which was neither mathematical nor statistical, but was based on a few practical
considerations.

Weibull (a Swedish metallurgist working with the Bofors steel company) was
involved in analysing the results of load tests on many, nominally identical, test
specimens of a particular type of steel. Their Ultimate Tensile Strengths exhibited
random variability, as they always do. If F(x) was defined as the cumulative
fraction which exhibited strengths less than a particular load x (ie F(x) was the
cumulative distribution function or cdf, the distribution of the probability that a
specimen would fail under the load x), then a plot of F(x) looked like the one shown
in Figure 16. None failed before some given load x0 (the guaranteed strength) and
a few hung on to quite large loads.

Firstly, Weibull conjectured that it might be possible to represent such a cdf fairly
accurately by the expression
F(x) = 1 – exp {–  (x – x0)},

where  (x – x0) would be some function, as yet undefined, of (x – x0), eg 3 (x –


x0) or (x – x0)2 or whatever, which itself increased as x increased. This would give
a plot which started at x0 and approached F(x) = 1 asymptotically, as required.
However,  (x – x0) would have to be such that it gave the right rate of rise of the
value of F(x), and would have to be dimensionless (because it is an exponent).
25
MN5554 Reliability Engineering

Figure 16. Probability of failure of specimens as a function of load

Weibull found that the form


𝑥 − 𝑥0 𝛽
∅(𝑥 ) = { }
𝜂
Where  was a characteristic load (determining, along with x0, the scale of the loads
involved) and  a curve-shaping factor, gave him an expression for the cdf

𝑥 − 𝑥0 𝛽
𝐹 (𝑥 ) = 1 − 𝑒𝑥𝑝 {− }
𝜂
which enabled him to correlate his test data very well. In addition, the expression
had some other very useful properties, as we shall see.

In the reliability problems that we are looking at here the stressing factor is not load
but running time t, since new or since last overhaul. The Weibull cdf for times to
failure is therefore written as
𝑡 − 𝑡0 𝛽
𝐹(𝑡) = 1 − 𝑒𝑥𝑝 {− }
𝜂

26
From this, some simple mathematics then leads to the appropriate expressions for
the Weibull pdf f(t), Reliability R(t), and Hazard Rate Z(t), ie –


Each of the terms in these expressions has a practical meaning and significance.
The threshold time-to-failure, or guaranteed life t0: In many cases of wear-out
the first failure does not appear until some significant running time t0 has elapsed.
In the Weibull expressions the time factor is therefore the time interval (t - t0).
The characteristic life,   When t - t0 = , R (t) = exp (-1) = 0.37, ie  is the
interval
between t0 and the time at which it can be expected that 37 per cent of the items
will have survived (and hence 63 per cent will have failed).
The shape factor, . : Figure 17 hows how the Weibull pdf of time-to-failure
changes as  is changed (for clarity, on each plot t0 = 0 and    

*If  is significantly less than one, the pdf approximates to the hyper-exponential,
ie is characteristic of 'running-in' failure;
*If    the pdf becomes the simple negative exponential, characteristic of
'purely random' failure;
*As  rises above a value of about 2, the pdf converges ever more closely to the
Normal pdf, characteristic of 'wear-out' failure.

27
Figure 17. Influence of shape factor 𝜷 on the form of the Weibull pdf
of time to failure (For all plots t0 and 𝜼 = 1)

NB For the first two cases t0 must be zero, of course; for the wear-out case it may
or may not be. Also note from Figure 17 that 𝛽 characterises the consistency of
failure occurrence. The larger its value the greater is the tendency for the failures
to occur at about the same running time.

3.1 Now try this exercise


Get out your calculator and some ordinary graph paper. Assign actual values
to t0 , 𝜼 and 𝜷 and, using the expressions for R(t) and Z(t) given on Page 3,
plot graphs of R(t) and Z(t) vs t. Firstly, try: t0 = 0, 𝜼 = 1 and 𝜷 = 0.5, 1.0,
2.0, and 4.0 respectively (Values of t from 0 to 4.0, say, in steps of 0.25).
Then try: t0 = 2, 𝜼 = 4 and 𝜷 = 3.0 (Values of t from 0 to 8.0, say, in steps of
0.25)]

28
3.2 Weibull probability paper
How do we test whether observed times-to-failure look as if they could be plausibly
represented by a Weibull cdf? In the language of statistics, whether they look as if
they have been sampled from such a distribution? And if they do, how do we
determine the values of t0,  and  which will give the distribution which best fits
the data? One easy way is to use Weibull probability graph paper. There are
several versions of this; we shall use the one that is most popular in the UK,
marketed by the Chartwell technical graph paper company (Ref No. 6572 in their
list). A full size sheet is reproduced on page 7. On this, the y-axis variable is the
cumulative fraction failed, F(t), expressed in per cent, and the x-axis variable is (t
- t0), in whatever are the appropriate units of time for the particular item studied
(as explained in Unit 2, 'time' in this context is a measure of usage and might
appropriately be 'number of operational cycles'). The axial scales are so arranged
that if a theoretical Weibull cdf were to be plotted on the paper (ie using values of
F (t) calculated from the expression given earlier) they would lie on a perfectly
straight line. The following example shows how the paper is
used.

3.3 Weibull analysis of a large and complete sample of times-to-


failure
One hundred identical pumps have been run continuously and their times-to-
failure noted. A Weibull fit to the data is required.
1. The data is tabulated as in Columns 1 and 2 of Table 8
2. Successive addition of the figures in Column 2 leads to Column 3, the
cumulative percentages of pumps failed by the ends of each of the class
intervals of Column 1.
3. Three or four possible values, thought likely to span the actual value, are
assigned to t0 (the guaranteed life). The resulting values of t - t0 are tabulated
in columns 4, 5 and 6. NB In each case t is the time of the end of the interval,
eg in Column 4, Row 3: t - t0 = 1300 - 800 = 500 hours.
4. On the Weibull probability paper, the Column 3 figures are plotted firstly
against those in Column 4, then against those in Column 5 and Column 6
respectively. The result is shown in Figure 18, which also includes plots for t0
= 700 h and t0 = 1100 h. The value of t0 which results in the straightest plot,
in this case 900 hours is the one which gives a Weibull cdf which best
represents the data.

29
Table 8. Pump failure data

5. The characteristic life, , is the value of t - t0 at which the line fitted to the
straightest plot reaches the 63% failed level, in this case 600 hours. (NB t - t0
= 600 h corresponds to a total actual running time of t = 1500 h, remembering
that t0 = 900 h)
6. As shown, a perpendicular is dropped from the fixed 'Estimation point' (printed
just above the top left-hand corner of the diagram) to the straight line fit. The
point at which this perpendicular intersects the special   scale at the top of
the graph gives the value of  for the best-fit cdf (in this case approximately
3.5, clearly pointing to a wear-out mode of failure)
7. Note that the perpendicular also intersects another scale, marked F  This gives
the value of the cumulative percent failed, in this case 49.8%, at the mean life.
From the main straight-line plot we then read off that at this value of F (t), t -
t0 = 540 hours. So, since the estimated t0 is 900 hours, we deduce that the
estimated mean life is 900 + 540 = 1440 hours.

30
Figure 18. Weibull plot of pump failure data from Table 8.

3.3.1 Example
Plot the pump failure data of Table 4, Unit 2 (ie the cumulative, F (t), data)
on the Weibull paper provided and hence obtain values of t0,  and  for the
distribution.
(You should find t0 = 200 hrs,    hrs and  = 4.0, APPROXIMATELY –
Remember, this is a graphical technique!)

31
3.4 Median ranks
In practice, it is most often the case that only a handful of times-to-failure have
been recorded, particularly for small data samples, possibly incomplete, ie
'censored'. Indeed, the items under examination might be large and expensive and
only a few might yet have been made. In addition, some of the items might still be
running, not having reached the failure point (ie be 'suspended') or may have been
withdrawn from the trial (ie 'censored') because, in their case, the test conditions
were accidentally altered.

In this situation the results of any analysis will necessarily be subject to greater
statistical uncertainty. A Weibull analysis might still be needed, however, on the
grounds that an approximate result at the end of a fortnight may be of more value
than a precise one obtained by waiting for another three months.

In the above case, a technique in which so-called 'Median (or 50%) Rank'
estimations of the F(t) values are plotted on Weibull probability paper can then
produce meaningful estimates of t0,  and   It has the additional advantage that,
using published tables of 5% Rank and 95% Rank estimations, confidence limits
can be assigned to the estimated parameters. The following example shows how
the technique is applied.

Ten springs are being tested to failure. The situation to date is as shown in Table 9.
A Weibull fit to the data is required.
1. The failure points are ranked in ascending order (Table 10, Column 2) and
classified, in Column 3, as failed 'f 'or suspended (or censored) 's' (Table 10,
Column 3)

32
Table 9. Spring failure data

Table 10. Ranked failure data

33
2. For the first failed item the 'New Increment' is calculated from the formula

N + 1 − (Order Number of Previous failed item)


New Increment =
N + 1 − (number of previous items)

Where N = total number of items in the sample (in this case, 10). Since this is
the first failed item, the previous Order Number is zero. Also, in this case, the
number of previous items is zero and therefore the calculated New Increment is
1 (Column 4).
Note that if the first failure had been preceded by some suspended items the
New Increment would have been greater than 1, eg if the first two items had
been suspended, the New Increment would have been –

10 + 1 − 0
= 1.22
10 + 1 − 2

3. The 'Order Number' of the first failed item is then obtained from the
expression
Order Number = New Increment + previous Order Number
ie in this case, Order Number = 1 + 0 = 1 (Column 5).
4. This procedure is repeated for all the remaining failed items, in succession, ie

10 + 1 − 1
Second failed item: New increment = =1
10 + 1 − 1

Order number = 1 + 1 = 2

10 + 1 − 2
Third failed item: New increment = = 1.125
10 + 1 − 3

Order number = 1.125 + 2 = 3.125

and so on......
NB After a suspended item, or group of suspended items, the value of the New
Increment remains constant, and therefore need not be revised, until the next
suspended item or group of such items.

34
[Mini-exercise: Check for yourself, from the data in this example, that this is
indeed the case]
5. Having completed Column 5, the corresponding 'Median Ranks' (Column 6)
are calculated from the formula
Order Number − 0.3
Median Rank =
N + 0.4
eg for the fourth failed item
4.250 − 0.3 3.950
Median Rank = = = 0.380
10 + 0.4 10.4

6. The Median Ranks, expressed as percentages, are then plotted against cycles
(or time) to failure on Weibull graph paper, as in Figure 19 and values of , 
and  obtained (in this case approximately 8200 cycles, 7400 cycles and 3.0
respectively), the same way as in the previous example. In this case t0 = 0, this
giving a good straight line plot, but in the general case t0 would be established
by trial plots exactly as previously.
7. Tables of Median (or 50%) Ranks (derived via the formula used above - called
'Benard's Estimate’) and also of ‘5% Ranks' and ‘95% Ranks', for various
sample sizes and Order Numbers, can be found in various textbooks [eg in
Andrews and Moss, and in O'Connor, see bibliography at the end of Unit 4]
and are also appended to this unit. It is a simple matter to plot these Ranks, as
well as the Median Ranks, against the observed cycles (or times) to failure, thus
obtaining (see Figure 19) the 90% Confidence Band for the data, ie the band
within which it is 90% probable that the plot would lie that would be obtained
from a very large number of items. In the absence of suspended items the
procedure is, of course, much more straightforward, in that the Order Numbers
are simply 1, 2, 3, 4 ... N.

35
Figure 19. Weibull plot for spring failures, Median, 5% and 95%
Ranks. Data of Table 10

36
3.4.1 Exercise
Ten identical lifts were operated under identical conditions. For each, the total
number of operating cycles achieved before breakdown was as follows –

Lift Cycles to breakdown


1 19 × 104
2 7 × 104
3 44 × 104
4* 32 × 104
5 16 × 104
6 17 × 104
7* 24 × 104
8* 6 × 104
9 4 × 104
10 68 × 104

*Lifts 4, 7 and 8 had not failed at the time of the survey

Using the Median Rank method, fit a Weibull pdf to the data, evaluating 𝜷
and 𝜼 (cycles to breakdown). Also, using the Rank Tables provided –
(a) fit a 90% confidence band
(b) hence estimate the number of cycles for which we could be 95% confident
that the lift Reliability (survival probability) would be at least 70%

[Answers: From the data alone t0 is clearly zero. From the plot, 𝜷 = 1.1– 1.2,
𝜼 = 32 – 36 ×104 cycles. (b) From the 95% Rank plot it can be seen that at t =
5 ×104 cycles, there is 95% confidence that F(t) should be at most 30%, ie R(t)
should be at least 70%)]

37
38
39
40
4 Assessing Reliability of Systems
If we had acquired enough statistical data – of the kind, and using the analyses,
described in Units 2 and 3 – on all the component items of an engineering system
then, in principle, we should be able to assess the expected reliability performance
of the system as a whole. Unfortunately, both the length and complexity of such
an assessment increases exponentially with the number of items involved and the
complexity of their inter-dependence. All is not lost, however. Valuable insights
into probable system performance may often be gained from straightforward hand
calculations based on simplified modelling of the system. An introduction to such
modelling now follows. A fuller explanation, with more worked calculations and
some useful tables of formulae covering the more commonly occurring cases, can
be found in the IMechE guidebook.
[Davidson and Hunsley (Eds), pp 45-91, see bibliography at the end of this
unit].

4.1 Reliability block diagram


When designing complex engineering plant one of the first tasks is to draw up a
'block diagram' in which each block represents one of the constituent sub-systems
or units. Clearly, this is going to be a simplification, an abstraction to facilitate
clearer and speedier analysis, and it will always be a matter of judgement – in the
light of experience – as to what should be the appropriate 'indenture' level, ie the
extent to which the plant, with all its engineered complexities, can be sub-divided
into a nexus of 'black boxes' each with just a few inputs and outputs. The result of
such an exercise could be a Schematic Block Diagram, showing how the items are
physically connected, or a Functional Block Diagram, showing flows of power,
material etc – with the inputs and outputs specified for each block. At a later stage
a Piping and Instrumentation (P&I) Diagram might also be constructed.

In a similar manner, the assessment of the overall system reliability may be


facilitated by constructing and analysing a Reliability Block Diagram (RBD). In
this, the connections symbolise the ways in which the system will function as
required and do not necessarily indicate the actual physical connections. Also, for
41
ease (and, very often, downright feasibility) of subsequent analysis each item is
usually modelled as either 'fully working' or 'totally failed' (conservatively, partial
failures are assumed to be total ones). An RBD is usually constructed on the basis
of the information provided by the Functional, Schematic, and/or P&I diagrams.

To illustrate this, consider a system comprising two fuel pumps, both of which are
normally working. The system is designed so that enough flow will be achieved
even if one of the pumps were to fail. The RBD would then be as in the second
diagram of Figure 20, which indicates that the required flow may be achieved via
either or both, of the pumps. This diagram would still be the appropriate RBD
whether the pumps were located in series, along a single fuel line, or in parallel,
provided that required flow would still be achieved even with one pump failed and
that such a failure would not block the fuel line. The RBD is determined by the
reliability logic, not by the physical layout.

Figure 20. Series and parallel reliability dependencies.

42
4.2 Series Reliability
Consider the simplest system, of just two units. If successful system operation
requires both units to be working (ie if either one fails the system fails) then for a
reliability assessment they are considered to be in series dependency and are
represented as in the RBD at the top of Figure 1.

If the failure probabilities of the units can be assumed to be independent (ie the
failure behaviour of one is not influenced by that of the other) then the expected
system reliability at any given time t is given by the product of the two estimated
unit reliabilities at that time (This is directly analogous to betting on a racing
'double', where the quoted odds against such an event are obtained by multiplying
together the separate odds against each horse winning its own race), ie

RS(t) = R1 (t)  R2 (t)

4.2.1 Example
Two units in series reliability dependency, so the RBD is as the first one in
Figure 20. The unit reliabilities at t = 10,000 operating-hours are 0.90 and 0.95
respectively. So, system reliability at 10,000 hours will be

0.90  0.95 = 0.855 or 85.5%

It can be simply shown that if both types of unit exhibit negative exponential pdfs
(ie they are in their useful life, 'random failures' phase, see Unit 2) then the system
as a whole will also exhibit such a pdf, ie

fS(t) = (   exp {      t },

and the system reliability will be


RS(t) = exp {  (    t },

the system mean-time-to-failure being

(MTTF)S = 1/ (  ),

where the 's are the respective unit failure rates.

43
4.2.2 Example
Items 1 and 2 are in series dependency, as above. If –
(MTTF)1 = 100 hrs, and therefore 1 = 1/100 = 0.01 /hr,
(MTTF)2 = 200 hrs, and therefore 2 = 1/200 = 0.005 /hr,

Then, for the system,

fS(t) = 0.015 exp (0.015 t),


RS(t) = exp ( 0.015 t)

eg at 100 hrs, And

RS = exp ( 0.015  100) = 0.22 or 22%

And
(MTTF)S = 1/0.015 = 67 hrs
All that is required in the case of series reliability systems with more than two
units is simple extension of these calculations – and the same general logic also
applies to availability calculations for series systems.

4.2.3 Example
A simple process flow system of four units is required to run continuously, has
no redundancy (ie the units are in series reliability), and the units are subject to
randomly occurring failure- and-repair outages which result in average unit
availabilities of 98, 98, 96 and 95 per cent respectively (see Figure 21).

Figure 21. Four units in series dependency, average availabilities as


shown

The system is available only when all units are working, so its expected average
availability over a long period will be
AS = 0.98  0.98  0.96  0.95 = 0.88 or 88%
And hence its unavailability
US = 1  AS = 1  0.88 = 0.12 or 12%.
[In this example, because all the unit unavailabilities are very small,
eg U1 = 1 0.98 = 0.02,
44
We could have used the approximation
US  U1 + U2 + U3 + U4 = 0.02 + 0.02 + 0.04 + 0.05 = 0.13 or 13%

An example which illustrates 'Lusser's Rule' – named after the second world -war
German ballistic engineer who, rather surprisingly, seems to have been the first
to state it in formal terms – that the reliability of a series arrangement will always
be less than that of its least reliable component.

4.3 Active-parallel Reliability


Consider as before the simplest of systems, ie of just two units, both operating,
but in this case joined in such a way that the system only fails if both units fail.
Such an arrangement is termed an active-parallel dependency and is represented
as in the second RBD of Figure 20. For this, the logic is clearer if expressed in
terms of failure probabilities F(t) or unavailabilities U, rather than in reliability
R(t) or availability A as with series systems, ie in failure logic rather than in
success logic. Thus, the system fails only if both units fail, so its failure probability
is simply the product of the two unit failure probabilities (given the usual
assumptions about statistical independence and so on), ie
FS (t) = F1 (t)  F2 (t).

So,
RS(t) = 1  FS(t) = 1  F1(t).F2(t) = 1    Rt)}.{1  R2(t)}

4.3.1. Example
Two units are in parallel dependency, their separate reliabilities being as in the
earlier example, ie at t = 10,000 operating hours R1 = 0.90 and R2 = 0.95. The
system reliability at that time is therefore
RS = 1    0.90).(1  0.95) = 1  0.10  0.05 = 0.995 or 99.5%.
As with series systems, for more than two units the algebra is simply extended, ie
RS(t) = 1    Rt)}.{1  R2(t)}.{1  R3(t)}.......
If both units were to be operating in their negative exponential phase then it can be
shown that the system pdf would not be a simple negative exponential. In fact, it
would be
fS (t) =  exp(1t)  exp(t)    exp{t}
And in larger active-parallel configurations it would be a yet more complex
function.

Expressions for MTTF are likewise complicated, eg for the two-unit system
(MTTF)S = (1/      .
45
4.3.2. Example
If, for the two-unit active parallel system, 1 = 0.01 /hr and 2 = 0.005 /hr, then,
for the system,
(MTTF)S = (1/0.01) + (1/0.005)  0.01 + 0.005
= 100 + 200  67
= 233 hrs.

As before, the same logic applies to availability calculations.

4.3.3. Example
A system comprises three chemical reactors, RBD as in Figure 22. The system
would be down only if all three were down. The reactors are subject to randomly
occurring failure-and-repair outages which result in average availabilities of 60,
70, and 80 per cent respectively, ie average unavailabilities of 40, 30 and 20 per
cent – or 0.40, 0.30 and 0.20. For the system –

Unavailability, US = 0.40  0.30  0.20 = 0.024 or 2.4%,

Availability, AS = 1  0.024 = 0.976 or 97.6 %.

Figure 22. Three units in active parallel dependency, availabilities as shown

The last three examples illustrate, of course, the general rule (the converse of
Lusser's Rule) that the reliability of a parallel (ie redundant) system will always be
greater than that of the most reliable of its component units.

46
4.3.4. Exercises
Calculate the availabilities of the two systems that are shown below in the form
of Reliability Block Diagrams (unit availabilities are as indicated).
(a)

 0.85  0.90  0.95 



(b)

Ans. (a) 0.73 or 73%


(b) 0.999 or 99%

4.4 Active-parallel reliability with partial redundancy


In some cases the output from an active parallel system of several identical units
may be deemed acceptable if at least some of the units are working. Such systems
are termed partially redundant.

4.4.1. Example
An aircraft, with four completely independent engines – normally all operating –
will continue to fly safely if at least any two engines are working. The observed
mean failure rate (given that it has started successfully) of any one of these
engines is   10 hr, ie MTTF = 100hrs (not the sort of engine to which I would
entrust life and limb!). Given that the aircraft has taken off successfully, what is
the probability that it will not fly safely for 10 hours, assuming that all failures
other than engine failure are negligible?

Answers:

For each engine:


R(10 hrs) = exp (t) = exp (0.01  10) = 0.905.
F(10 hrs) = 1  0.905 = 0.095

47
For the system of four engines
FS(10 hrs) = 0.0954 + 4  0.905  0.0953

Probability of Probability of any three engines failing


all four engines while the fourth engine succeeds
failing (NB There are 4 ways this can happen)

= 8.15  10-5 + 3.10  10-3


= 3.2  10-3 or 0.32 %

4.5 Inactive parallel, or standby, reliability


In much process or power plant active-parallel arrangements may not be desirable,
or even feasible. If, for example, a vessel was to be fed by two positive displacement
pumps in active- parallel it might be over-pressurised, so a better arrangement would
be to have one pump in-line and the other on standby, as modelled in the lowest
RDB on Figure 20, ie to have an inactive- parallel system.

Analysis, even of this very simple arrangement, can be quite complex. There are
several different ways in which it could fail, viz
(i) failure of both unit 1 and unit 2,
(ii) failure of D to divert when required,
(iii) D wrongly diverting to a failed unit,
(iv) failure of D to transmit flow.

In addition, there are several different repair policies which could be adopted for the
failed, off- line, unit.

Consider, for example, one of the simplest cases. D perfectly reliable, both units in
the negative exponential phase ie R(t) = exp (t), no repair or replacement. System
operates until unit 1 fails, then diverts to unit 2 and operates until it fails. System
failure therefore requires two failures, failures occurring at average rate  It can
be shown that, for the system,
f (t) = t.exp (t),
And,
R (t) = (1 + t).exp (t)
48
Extension of this analysis up to any number N of units in parallel, with n units on
standby, is relatively straightforward. If, however, the failed units can be repaired
calculation of system reliability RS(t), or of average system availability AS,
demands that we account for
(i) random variation in repair time,
(ii) spares waiting time,
(iii) availability of maintenance resources (men, tools and spares),
(iv) repair priority rules, etc.

Clearly, one possibility is that on-line units could fail while the parallel units might
still be under or awaiting repair. Analysis can be very complex and is generally
accomplished via Markov Discrete State Chain Analysis or via Simulation – either
technique usually requiring iterative numerical procedures. Powerful PC-based
software is, however, readily available for doing this.

So far, it has been assumed that the unit or system can be in one of only two states,
working or failed. It will often be the case, of course, that some of the units, and
hence the system, can operate in a partially reduced state. If the likelihood of this is
significant it has to be considered in the reliability calculation, lengthening it still
further.

Combining analyses of the kind described into an assessment of the reliability of a


system of many units, joined in a complex arrangement of process, signal or energy
paths, is a lengthy job. However, there are some ways of systematising the task, as
we shall now see.

49
4.6 Complex systems: some further methods of reliability analysis
4.6.1. Truth table
This is actually the least analytical approach. It is best explained by working through
a simple illustration, which does not really call for this procedure, but which will
serve to explain it effectively.

Consider the system of Figure 23. Let X, Y, Z be the respective reliabilities R (t),
say, of the units at some time t of interest (they could also have been the long-term
average unit availabilities, if system availability had been the parameter of interest).
So (1X), (1Y) and (1 will be the respective failure probabilities F (t). Denoting
working status by 1 and failed by 0, we can now construct
Table 11, where each row shows firstly a possible combination of the unit states,
secondly the resulting system state, and finally, in the last column, the probability
of occurrence of that combination. In this case eight different combinations or
scenarios are possible, and thus there are eight rows in the table.

Figure 23. A system configuration.

50
Table 11. Truth table for system of Figure 23.
Only the first three states are operational ones, so the system reliability is given by
the sum of the first three probabilities ie

RS(t) = XYZ + XY(1  X(1Y)Z


= XY + XZ  XYZ
= RX (t).RY (t) + RX (t).RZ (t)  RX (t).RY (t).RZ (t)

4.6.2. Example
For the above system it has been found that, at t = 20 hrs, RX = 0.80 and both RY
and RZ = 0.70, so the expected system reliability at that time will be

RS (20) = (0.80  0.70) + (0.80  0.70)  (0.80  0.70  0.70) = 0.73 or 73%

The Truth Table method can be extended to consider reduced-output unit states.
Analysis of anything other than the smallest systems, however, has to be
computerised because even with the two-state model (units assumed either fully
operational or completely failed) each additional unit doubles the number of
possible system states (rows in the table) and hence the computing time. If the units
are of high reliability this problem can be alleviated by assuming that system states
with more than, say, three or four failures at any one time have a negligible
probability of occurrence, which considerably reduces the length of the table.

51
4.7 System reduction
The analysis of a large RBD may be carried out in stages, at each of which small
groups of units are analysed by the methods so far described, allowing the RBD to
be re-drawn in a reduced form.

4.7.1. Example
Consider the system RBD shown in Figure 24(a), where the number in the bottom
right hand corner of each numbered block is the assessed average availability of
the unit represented by that block.

Stage 1. Take each series-connected group, calculate its availability and assign
that value to an 'equivalent' single unit, which replaces the original group, ie –

For groups (2, 3), (4, 5) and (6, 7), availability = 0.90  0.80 = 0.72
For groups (8, 9, 10) and (11, 12, 13), availability = 0.95  0.90  0.80 = 0.684

Using these results the RBD can be re-drawn as in Figure 24(b).

Stage 2. Take each parallel-connected group in Figure 24(b), calculate its


availability and assign that value to an equivalent single unit, as in Stage 1.

For group (2-7), operational if at least two channels out of the three are
working,
availability = 0.723 + [ 3 0.722  (1  0.72)] = 0.81.

3 channels up 2channels up, 1 down


(3 ways this can occur)

For group (8-13), availability = 1  (1 0.684)2 = 0.90,


So the RBD can be redrawn as in Figure 24(c).

Stage 3. On Figure 24(c) proceed as in Stage 1, ie –


For group (1-7), availability = 0.95  0.81 = 0.77,
And the RBD is reduced to Figure 24 (d).

Stage 4. The remaining, parallel-connected, group is now readily analysed, ie –


System availability = 1  (1  0.77) (1  0.90) = 0.98 or 98%.
52
Figure 24. Successive reduction of a large configuration

Should a system, no matter how small, contain a 'cross-linked' unit (ie one
incorporated into more than one line or sub-system) then its analysis has to be

53
undertaken via the Truth Table method (or via a rather more sophisticated method
based on Bayes' Conditional Probability Theorem, which is not dealt with in this
unit). Powerful and user-friendly PC software for the reliability assessment of very
large complex systems and incorporating techniques such as Weibull Failure Data
Analysis (as in Unit 3), Failure Mode Effect and Criticality Analysis (FMECA),
Fault Tree Analysis, Markov Analysis and Simulation is available. Introductions to
these and other such matters are given in the IMechE handbook The Reliability of
Mechanical Systems.
[See bibliography at the end of this unit]

4.7.2. Example
Calculate, using the System Reduction procedure, the availabilities of the two
systems that are shown below in the form of Reliability Block Diagrams (unit
availabilities are as indicated).

54
4.8 Redundancy allocation for a series system
The reliability of a component can be increased by placing an identical component
in parallel to the existing component, both operating in what is called the active
redundancy mode. The act of adding the redundant component is called doubling.

For a system of n components connected in series where these components have


different reliability values, doubling also improves the reliability of the system. But
the question is whether there is an optimal sequence for adding components which
will give the largest gain in the reliability of the system at each addition.

Barlow and Proschan (1965) proposed a sequential search method for solving this
kind of problem and it is explained below.

Consider a single component with reliability 𝑝. If an identical component is added


in an active redundancy with it, the reliability of the pair becomes 1 − (1 − 𝑝)2
which can be simplified to 𝑝(2 − 𝑝) Thus, it can be seen that the reliability is
increased by a factor of (2 − 𝑝) by this act of doubling.

Now consider a system comprising n components 𝑥1 , 𝑥2 , … … , 𝑥𝑛 connected in


series. If the reliabilities of the components are 𝑝1 , 𝑝2 , … … , 𝑝𝑛 , respectively, then
the reliability of the system is given by

𝑅 = 𝑝1 𝑝2 … 𝑝𝑛 .

In this case, which of the n components should be doubled first in order to maximise
the ensuing reliability of the system? To answer this question, let us choose at
random a component to double, say . The reliability of the system is
𝑅𝑖 = 𝑝1 𝑝2 … [1 − (1 − 𝑝𝑖 )2 ] … 𝑝𝑛
or
𝑅𝑖 = 𝑅(2 − 𝑝𝑖 ). (1)

Eq.(1) states that the system reliability is improved by a factor of (2 − 𝑝𝑖 ), as


observed before where a single component is doubled. Furthermore, Eq.(1)
indicates that 𝑹𝒊 is maximum if 𝒑𝒊 is minimum. In other words, by doubling the
least reliable component, the system reliability increases the most. Repeating this
reasoning, we either add another component in parallel with 𝑥𝑖 (if the 𝑥𝑖 pair still
gives the lowest reliability among all components in the system) or double the least
reliable component other than 𝑥𝑖 , and so on.

Reference:
Barlow, R.E. and Proschan, F. (1965). Mathematical Theory of Probability. New
York: Wiley.
55
4.9 Preventive maintenance planning

Having acquired pdfs of time-to-failure, mean failure rates, etc for the various parts
of a proposed system it will be clear that many of the parts will have a useful life
much less than that anticipated for the system. The desired system reliability will
therefore only be achieved if such parts are easily replaceable, ie such replacements
should –
(i) interfere as little as possible with the operation of other parts,
(ii) interfere as little as possible with normal operation or production,
(iii) be needed at intervals which exceed the maximum operational cycle or
production run.
If, in addition to the data on reliability, there is also information on the costs (direct
and indirect) of replacement, it might be possible, using one of the many Operational
Research models that have been developed for the purpose, to determine whether
such replacements should be periodic, grouped, opportunistic, on-failure, etc. More
often than not, however, replacement on
failure is the overall cheapest solution. Time-based replacement is cheapest only if
(i) the cost per item is much less than that of replacement on failure,
(ii) the item exhibits a pronounced wear-out behaviour.
Spares should be available when needed, of course, and statistical data on failure
rates etc are essential for determining optimum spares inventory policies.

56
4.10 Bibligraphy
Comprehensive introductions to this subject, with essential data, formulae,
statistical tables etc

Davidson J and Hunsley C (Eds) (1994) The Reliability of Mechanical Systems,


2nd Ed., Mechanical Engineering Publications, IMechE, London
(A 'must' if you can buy or borrow a copy. The right level for this course and part-
written by the author of this unit)

Andrews J D and Moss T R, Reliability and Risk Assessment, Longman


Scientific, 1993, pp 262-298.
(A natural 'follow-on' to the above, for those who want to go more deeply into this
subject. Moss was the moving force behind, and a contributor to, Davidson and
Hunsley)

O'Connor P D T & Kleyner A (2012) Practical Reliability Engineering (5th


Ed), Wiley
(A good general intro., covering most aspects but not in too much analytical detail
and a good source of references. About the right level for this course)

57

You might also like