Reliability
Reliability
Reliability
MSc Strategic Quality Management
Quantitative Methods - QUAN
RELIABILITY
Aims of Session
By the end of the session the student should understand the basic concepts in reliability and
be able to apply the principles to a range of problems for reliability improvement.
Learning Approach
On completion of the above tasks you should be able to do the self-assessment exercises in
the attached notes.
Reading
You will find further reading on this topic in Besterfield (2004), Chapter 11. O‘Connor (2002
or earlier editions) provides a more extensive coverage.
Contents
Introduction to Reliability
What is reliability?
Quantitative Approach to Reliability
Concept of Bath-tub Curve
Measuring reliability
Exercises
Reliability Distributions
Reliability Prediction Using Exponential Distribution
Introduction to Mean Time between Failures and Mean Time to Failure
Exercises
Self-Assessment Exercise
The addition and multiplication rules of probability
Reliability of Systems - Series, Parallel, and Combination Systems
Exercises
Self-assessment Exercise
Exercises
Reliability Improvement Techniques
Fault Tree Analysis
Mathematical Appendix
Healthy laptop case study
RELIABILITY
Introduction to Reliability
Reliability has gained increasing importance in the last few years in manufacturing
organisations, the government and civilian communities. With recent concern about
government spending, agencies are trying to purchase systems with higher reliability and
lower maintenance costs. In defence, for example, these systems range from complete
weapon systems such as aircraft or tanks, to individual critical components such as diodes,
resistors, transistors, IC's and so on. As consumers, we are mainly concerned with buying
products that last longer and are cheaper to maintain, i.e., have higher reliability. But what is
reliability and how do we measure reliability?
What is reliability?
The reliability of a product (or system) can be defined as the probability that the product will
perform a required function under specified conditions for a certain period of time.
Increased sales
Improved safety
Increased competition
Reliability
We can quantitatively define the reliability of a component or system as the probability that it
will meet the qualitative definition.
R(0) = 1 = 100%
After this, when t > 0, R(t) will decline as some components fail.
The hazard rate (usually represented by h(t) or λ) is a very useful quantity. This is defined as
the probability of a component failing in one (small) unit of time.
NF
h(t )
N S * t
For example, if there are 200 surviving components after 400 seconds, and 8 components fail
over the next 10 seconds, the hazard rate is given by
This simply means that 0.4% of the surviving components fail in each second.
(You may wonder why the above equation defines h(400) and not h(410). The reason is that
Δt is a small time interval, so it is reasonable to assume that the hazard rate will not change
appreciably during the interval. We then define the hazard rate using the beginning of the
interval for convenience. In the extreme we can make Δt infinitesimally small—which is the
basis of the differential calculus.)
The term hazard rate applies to non-repairable items such as transistors, light bulbs,
microprocessors etc. The failure rate is defined in just the same way as the hazard rate,
except that it applies to repairable items such as computers and TV sets.
The so-called bath-tub curve represents the pattern of failure for many products – e.g.
semiconductor devices and electronic components. The vertical axis in Figure 2 is the hazard
or failure rate at each point in time. Higher values here indicate higher probabilities of failure.
The bath-tub curve is divided into three regions: infant mortality, useful life and wear-out.
Infant Mortality:
This stage is also called early failure or debugging stage. In this stage, the failure rate is high
but decreases gradually with time. During this period, failures occur because engineering did
not test products or systems or devices sufficiently, or manufacturing made some defective
products. Therefore the failure rate at the beginning of infant mortality stage is high and then
it decreases with time after early failures are removed by burn-in or other stress screening
methods. Some of the typical early failures are:
poor welds
poor connections
Useful life:
This is the middle stage of the bath-tub curve. This stage is characterised by a constant failure
rate. This period is usually given the most consideration during design stage and is the most
significant period for reliability prediction and evaluation activities. Product or component
reliability with a constant failure rate can be predicted by the exponential distribution (which
we come to later in this session).
Wear-out stage:
This is the final stage where the failure rate increases as the products begin to wear out
because of age or lack of maintenance. When the failure rate becomes high, repair,
replacement of parts etc., should be done.
Time
Figure
Figure 23Extended
Extended bath-tub
bath-tub curve
curve
Can we extend the useful life stage of a product? Products can have their useful life extended
by proper design, conscientious assembly, and careful handling. This leads to a change in the
shape of the bath tub curve (see Figure 2).
Measuring reliability
To see the level and pattern of the reliability of a product or component in practice, it is
necessary to make some measurements. The simplest way to do this is to test a large number
of products or components until they fail, and then analyse the resulting data. Exercise 2
below shows how this works. This enables us to estimate the hazard rate and reliability after
different lengths of time - and decide, empirically, if the bath tub curve applies, or if the
hazard rate shows some other pattern.
There are a number of obvious difficulties which may arise. If the useful life is large it may
be not be practical to wait until products or components fail. It may be too expensive to test
large samples, so small samples may have to suffice. And for some products (e.g. space
capsules) it may be difficult to simulate operating conditions at all closely. There are a
number of approaches to these difficulties - most of these are beyond the scope of this unit,
but they are discussed in the reading suggested for this session. (An exception is the
prediction of the reliability of a product from information about the reliability of its
Page 9 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN
components - this avoids the necessity to test the whole product - which is discussed in a later
section of the notes for this session).
It is also possible that the failure rate will depend on environmental conditions in a
predictable way. For example, one of the key factors affecting the reliability of electronic
components and systems is temperature – basically the higher the temperature of the device
the higher the failure rate. Most computer equipment therefore has some form of cooling,
ranging from a simple fan to forced chilled air cooling. The following chart is from a report
on the effect of temperature on a single board computer published by Crane Surface Warfare
Center Division in 2001, and clearly shows the effect of temperature on failure rate.
70
60
50
Failure Rate
40
30
20
10
0
25 35 45 55 65 75 85 95
Tem perature
Temperature
Figure 3
Now consider and answer the following questions, the purpose of which
is to test your retention of the information given in the course notes for
this session. When you have done this, check back on any questions
which you could not answer, or where you gave the wrong answer then
go on to the Self-Appraisal Exercise.
Questions
1 Would you expect the bath tub curve to apply to a car? What about a human
being?
2 One thousand transistors are placed on life test, and the number of failures in
each time interval are recorded. Find the reliability and the hazard rate at each
point in time. (You may find it helpful to set this up on a spreadsheet.)
Do you think this component shows the bath tub pattern of failure?
‡
Page 11 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN
(BLANK PAGE)
/
Suggested Answers
Note that the hazard rate is per hour. (This is why the figures are so small.) A
hazard rate of 0.10% means that the probability of a failure in a given hour is
0.10%.
Note also that I am using data from the whole interval from 0-100 hours to
calculate the figures for 100 hours. This is an approximation which may lead to
small inaccuracies.
This result shows that the hazard rate is initially 0.16% per hour, and then drops
to around 0.10% to 0.11%. There is an infant mortality phase, a useful life phase
with a fairly constant hazard rate, but no wear out phase. However, there are still
311 components left after the 1000 hour test - the wear out phase may become
apparent if the test were to be prolonged. Obviously, the best way to show this
pattern would be to draw a graph of hazard rate against time.
(BLANK PAGE)
Reliability Distributions
There are many statistical distributions used for reliability analysis—for example, the
exponential distribution, the Weibull distribution, the normal distribution, the lognormal
distribution, and the gamma distribution. In this unit we look at the exponential distribution
only, as this is the simplest and the most widely applicable.
The exponential distribution applies when the hazard rate is constant - the graph is a
straight horizontal line instead of a “bath tub”. (It can be used to analyse the middle phase
of a bath tub - e.g. the period from 100 to 1000 hours in Exercise 2 above.) It is one of the
most commonly used distributions in reliability, and is used to predict the probability of
survival to a particular time. If λ is the failure rate and t is the time, then the reliability, R, can
be determined by the following equation
R(t) = e -λ t
There is a brief note on the mathematical background to this equation in the Appendix to this
session.
To see that it gives sensible results, imagine that there are initially 1000 components and that
is 10% per hour. After one hour about 10% of the original 1000 components will have failed
- leaving about 900 survivors. After two hours, about 10% of the 900 survivors will have
failed leaving about 810. Similarly there will be about 729 survivors after the third hour,
which means that the reliability after 3 hours is 0.729.
Using the exponential distribution the reliability after 3 hours, with λ=0.1, is given by
R(t) = e -3 λ = e -0.3 = 0.741
(You can work this out using a calculator or a spreadsheet—see the mathematical appendix for
more details.)
This is close to the earlier answer as we should expect. The reason it is not identical is that the
method of subtracting 10% every hour to obtain 900, 810, etc ignores the fact that the number
of survivors is changing all the time, not just every hour. The exponential formula uses
If the failure rate is small in relation to the time involved a much simpler method will give
reasonable results. Let‘s suppose that λ=0.01 (1%) in the example above. The formula now
gives the reliability after three hours as
R(t) = e -3 λ = e -0.03 = 0.9704
A simpler way of working this out would be just to say that if the failure rate is 0.01 per hour
the total proportion of failures in 3 hours will be 0.03 (3%) so the reliability after three hours
is simply
R(t) = 1 – 0.03 = 0.97 (or 100% – 3% = 97%)
This is not exactly right because in each hour the expected number of failures will decline as
the surviving pool of working components gets smaller. But when the failure are is 1% and we
are interested in what happens after three hours, the error is negligible. On the other hand, if
we want to know what happens after 300 hours the simple method gives a silly answer (the
reliability will be negative!) and we need to use the exponential method. You should be able
to check this answer with your calculator or a computer (the answer should be 0.0498 or about
5%).
Mean time to Failure (MTTF) and Mean time between Failures (MTBF)
MTTF applies to non-repairable items or devices and is defined as "the average time an item
may be expected to function before failure". This can be estimated from a suitable sample of
items which have been tested to the point of failure: the MTTF is simply the average of all the
times to failure. For example, if four items have lasted 3,000 hours, 4000, hours, 4000 hours
and 5,000 hours, the MTTF is 16,000/4 or 4,000 hours.
The MTBF applies to repairable items. The definition of this refers to ―between‖ failures for
obvious reasons. It should be obvious that
MTBF = Total device hours / number of failures
For example, consider an item which has failed, say, 4 times over a period of 16,000 hours.
Then MTBF is 16,000/4 = 4,000 hours. (This is, of course, just the same method as for
MTTF.)
For example, the item above fails, on average, once every 4000 hours, so the probability of
failure for each hour is obviously 1/4000. This depends on the failure rate being constant -
which is the condition for the exponential distribution.
Questions
2 What is the highest failure rate for a product if it is to have a reliability (or
probability of survival) of 98 percent at 5000 hours? Assume that the time to
failure follows an exponential distribution.
3 Suppose that a component we wish to model has a constant failure rate with a
mean time between failures of 25 hours? Find:-
(BLANK PAGE)
Suggested Answers
1 The compressor has a constant failure rate and therefore the life times of these
reliability is given by:
R = e -λ t
3 (a) Since the failure rate is constant, we will use the exponential distribution.
Also, the MTBF = 25 hours. We know, for an exponential distribution, MTBF
= 1/λ
(BLANK PAGE)
SELF-APPRAISAL EXERCISE
Exercise
Here follows an exercise to give you further information and food for thought.
1 The equipment in a packaging plant has a MTBF of 1000 hours. What is the
probability that the equipment will operate for a period of 500 hours without
failure?
2 During World War II, the hazard rate for bomber aircraft flying over Europe
was believed to be a 4% chance of non-return from each mission, however
experienced the pilot was. Calculate the probability that a crew member will
survive 25 missions. How many missions would it take to reduce a crew
member's probability of survival to 10%?
Microwave Hours
1 2300
2 2150
3 2800
4 1890
5 2790
6 1890
7 2450
8 2630
9 2100
10 2120
What is the mean time to failure of the microwave ovens? (Note that the mean
life of the microwave is defined in terms of their mean time to failure because
no maintenance is performed on the ovens).
‡
MSc SQM\QUAN Page 22 of 56
Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN
(BLANK PAGE)
Suggested Answers
2 As the hazard rate is constant we can use the exponential model with a hazard
rate of 0.04.
R(25) = e -0.04*25 = e-1 = 0.37
Trial and error (or natural logarithms - see the mathematical appendix) shows
that
e-2.3 = 0.1 (approximately)
The next aspects of reliability theory we will consider depend on some probability theory—
the addition and multiplication rules. I will explain these in terms of dice and cards.
Similarly if you choose a single card from a pack of (52) cards, the probability of getting an
ace or a picture card (jack, queen or king) is
P(ace or picture) = 4/52 + 12/52 = 16/52
Because 4 of the 52 cards are aces, and another 12 are picture cards.
These examples suggest that if we want the probability of something or something else
happening, we can add the probabilities.
But … what about the probability of an ace or a red card (hearts or diamonds). Can we say
that
P(ace or red) = P(ace) + P(red) = 4/52 + 26/52 = 30/52 ??
This is obviously wrong because two of the aces are also red, so we are in effect double
counting these aces if we add the probabilities. Before adding the probabilities you need to
check that the two events cannot both occur—i.e. they do not overlap or are mutually
exclusive (i.e. each excludes the other).
To explain the and rule we need to do more than one thing, so let‘s throw the dice and draw a
card from the pack. How can we work out the probability of getting a 6 and a spade?
This is a little more complicated than the or rule. It helps to imagine doing the experiment
lots of times—say 1000 times.
Obviously you will get a 6 on about 1/6 of these thousand times—i.e. about 167 times. Now
think about how many times you will get a spade as well. This will happen on about ¼ of
these 167 times, or about 42 times out of the 1000 times we imagined doing the experiment.
So the probability is about 42/100.
Now do the same thought experiment, but working in terms of the probabilities this time. We
will get a 6 on about one sixth of the times, and a spade as well on about one quarter of this
one sixth. One quarter of one sixth means the same as one quarter times one sixth, so we
simply multiply the probabilities:
P(6 and spade) = P(6)*P(spade) = 1/6 * 1/4 = 1/24
which is, of course, about 42/1000 as before.
Just like the addition rule there is an important assumption here. We assumed that the two
events are statistically independent. This means that the two probabilities are independent of
each other: knowing whether one has happened is of no help in assessing the probability of
the other happening. In the example above, knowing that we‘ve got a 6 is of no relevance to
what will happen with the cards—so these events are statistically independent.
But in some situations you need to be very careful about this assumption. Suppose you know
that in a particular place the probability of rain falling on a given day is 1/3. Can you say that
P(rain today and rain tomorrow) = 1/3 * 1/3 = 1/9 ??
In practice this is likely to be wrong because the two events are not likely to be independent.
In England certainly, if it rains today the probability of rain tomorrow is likely to be higher
than if it did not rain today because the weather tends to set in to wet or dry spells. This
means that the second probability should probably be rather more than 1/3, so the result is
likely to be substantially more than 1/9.
Here follows an exercise to give you further information and food for thought.
Two football teams are due to play each other three times in the next few weeks. The
data from their matches in the past suggests that the probability of Team A winning is
50%, and the probability of Team B winning is 30%
2 Assuming that the results of the three matches are statistically independent what
is the probability of Team B winning all three matches?
3 Assuming that the results of the three matches are statistically independent what
is the probability of Team B winning at least one of the matches?
‡
Page 27 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN
Suggested Answers
1 The match must end in a win for Team A, or a win for Team B, or a draw.
These three probabilities are mutually exclusive so they must add up to
100%. This means the probability of a draw is 20% because
50%+30%+20%=100%.
3 This is a little more difficult. The easiest approach to ―at least one‖
questions is to work out the probability of the event never happening, and
then subtract it from 1. In this case, the probability of Team B failing to
win each match is 70% (i.e. the probability of a Team A win or a draw), so
the probability of failing to win all three matches is
70% * 70% * 70% = 7/10 * 7/10 *7/10 = 343/1000 = 34.3%
In all other circumstances Team B will win at least one match so the
probability is
P(B wins at least one match) = 100% - 34.3% = 65.7%
Which is fairly likely!
This method is used to derive one of the formulae (for parallel
configurations) below.
It is often useful to be able estimate the reliability of a whole system from the reliability of
the individual components. This enables designers to predict reliability levels without having
to build and test the whole system. It also enables them to see weak spots in the system, and
to experiment with ways of improving these - perhaps by installing backups for critical
components.
1 Series configuration
2 Parallel configuration
3 Combination of series and parallel
1 Series Configuration
For a series configuration, the system will fail if any of the components in the system fail.
Like links in a chain, if one link breaks, then the entire chain fails to perform the intended
function. This can be represented by a block diagram like the one in Question 1 in the next
set of Questions to stimulate your thinking on page 31.
If R1, R2, R3, ............. Rn are the reliabilities of first, second, third, ........ nth components in a
series system, then the system reliability is given by:
Rsys. = R1 x R2 x R3 ..............Rn
This follows from the rules of probability. It depends on the assumptions that the
probabilities are statistically independent. This means that knowing one of the probabilities
is of no help in estimating any other the others: they can all be estimated independently of
each other.
When several components are placed in series, the total system reliability decreases quickly.
If we assume that: R1= R2=R3 ..........=Rn =R , then,
Rsys. = R x R x R ...................R
Assume that each component has the reliability of 0.95, then the system reliability of 5
components in series is given by:
2 Parallel Configuration
In a parallel configuration, the system fails only if all of the individual components fail. If
one component fails the others serve as a backup - see the diagram in Question 2 on page 31.
below.
Assuming that there are two components in the system, then for the system to fail, both
components have to fail, which will happen with probability
(1-R1) (1-R2)
Electrical circuits or mechanical assemblies are often designed in configurations which are
part series, part parallel. To increase the reliability of the system, wherever possible, low
reliability components are placed in parallel, and components with relatively high reliability
are in series.
The following steps could be useful in dealing with these mixed configurations.
Step 1: Calculate the equivalent reliability of components in series within any parallel
configuration.
Step 3: Multiply reliabilities in a series together until the overall reliability is derived.
The calculations for both series and parallel configurations depend on the assumption of
statistical independence. The answers may be very different if the probabilities are not
statistically independent - an example is provided by the first self appraisal exercise below.
Now consider and answer the following questions. The purpose of which is
to test your retention of the information given in the course notes for this
session. When you have done this, check back on any questions which you
could not answer, or where you gave the wrong answer then go on to the
Self-Appraisal Exercise.
Questions
1 Five components in series have individual reliabilities of 0.99, 0.97, 0.90, 0.98, and
0.94 - as shown in the following (block) diagram. Calculate the reliability of the
system.
2 Three components in parallel have individual reliabilities of 0.99, 0.90, and 0.85.
Calculate the reliability of the system.
0.99
0.90
0.85
. . .
. . .
0.90 0.90
1 2
0.90 0.90
0.90
5
The first two components are in series, the last configuration in parallel. Find the
system reliability.
Suggested Answers
1 Because the given system is a series configuration, the reliability of the system
can be obtained by:
2 Because the components are connected in parallel, the reliability of the system is
given by:
Now Rs1 forms a parallel system with component 5. Therefore the reliability of
the parallel system can be determined as follows:
Now we have formed a series system with reliabilities 0.90, 0.90 and 0.981.
Therefore the system reliability is given by:
(BLANK PAGE)
SELF-APPRAISAL EXERCISE
Exercise
Here follows another exercise to give you further information and food for thought.
O-rings
Vent
‡
MSc SQM\QUAN Page 36 of 56
Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN
(BLANK PAGE)
Suggested Answers
1 a A parallel system as the rocket will fail only if both o-rings fail.
c This assumes the two potential failures are statistically independent - which
may be unlikely in practice. A single factor - such as temperature - may be
responsible for failures in both. This means that, if we know one o-ring has
failed, our estimate of the probability of the other failing will be higher. In
the extreme case, one o-ring fails whenever the other does, and the
reliability of the system is 0.95. In practice, the reliability of the whole
system is likely to be between 0.95 and 0.9975 (the answer based on the
assumption of independence).
2 a 2% per hour
b R(100) = e-0.02x100=0.1353
c The parallel system will only fail if all components fail. The probability of
each failing in 1 - 0.1353 = 0.8647.
1 - 0.8647n = 0.9, or
0.8647n = 0.1
n = 16
The second (FMECA) is not discussed here as it is covered in another unit of the course
(Performance Evaluation).
iv Collecting basic data such as components' failure rates, repair rates, and failure
occurrence probability.
This method is frequently used as a qualitative evaluation method in order to assist the
designer, planner or operator in deciding how a system may fail and what remedies may be
used to overcome the causes of failure. The method can also be used for quantitative
evaluation, in which case the causes of system failure are gradually broken down into an
increasing number of hierarchical levels until a level is reached at which reliability data is
Page 39 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN
sufficient or precise enough for a quantitative assessment to be made. The appropriate data is
then inserted into the tree at this hierarchial level and combined together using the logic of
the tree to give the reliability assessment of the complete system being studied.
In order to illustrate the application of this method, consider the electric power requirements
of the system in the following example. In this example, the failure event being considered is
'loss of the electric power'. In practice the electric power requirements may be both a.c.
power, to supply energy for prime movers, and d.c. power, to operate relays and contractors,
both of which are required to ensure the successful operation of the electric power.
Consequently the event 'loss of electric power' can be divided into two sub-events 'loss of
a.c.power' and 'loss of d.c.power'. This is shown in the following figure (see Figure 4) with
the events being joined by an OR gate as failure of either, or both, causes the system to fail.
or
and
If this subdivision is insufficient, sub-events can be divided further. The event 'loss of a.c.
power' may be caused by 'loss of offsite power' (the grid supply) and by 'loss of onsite power'
(standby generators or similar devices). This process can be continued downwards to any
required level of subdivision. After developing a fault tree, it is necessary to evaluate the
probability of occurrence of the upper event by combining component probabilities using
basic rules of probability and the logic defined in the fault tree.
In the present example, suppose the probabilities of the events at the bottom of the tree are:
Prob (Loss of offsite power) = 0.067
Prob (Loss of onsite power) = 0.075
Prob (Loss of dc power) = 0.005
MSc SQM\QUAN Page 40 of 56
Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN
Notice that these are the probabilities of faults and are typically low, whereas the reliabilities
used in the block diagrams are typically larger. Obviously
Reliability of the offsite power = 1 – Prob (Loss of offsite power) = 1 – 0.067 = 0.933.
These probabilities for the faults at the bottom of the tree can now be combined using the
addition and multiplication rules of probability. For the AND gate at the bottom we multiply
the probabilities to work out
Prob (Loss of a.c. power) = Prob(Loss of offsite power) x Prob (Loss of onsite power)
= 0.067 x 0.075 = 0.005025.
For the OR gate we add the probabilities to get the probability of the top event:
Prob (Loss of electric power) = Prob (Loss of a.c. power) + Prob (Loss of d..c power)
= 0.005025 + 0.005 = 0.010025.
This means that the probability of the top, undesirable, event – Loss of electric power – is
about 1%. The calculation make two assumptions.
First, to multiply the two probabilities at the bottom of the tree we must assume they are
statistically independent. This is reasonable if the onsite and offsite systems are separate so
that the failure of one is independent of the other. Notice that the combined probability (Loss
of a.c. power) is much less than the probability of either of the two probabilities in the AND
gate. The principle here is that of reducing the risk of a fault by using a backup system: this
will not work, of course, if the two systems are dependent so that if one fails there is an
increased chance that other will fail too.
The second assumption is the assumption when we add the probabilities for the OR gate that
the events are mutually exclusive (i.e. they don‘t overlap). With small probabilities, like we
have here, this is reasonable because the probability of more than one fault occuring is small.
SELF-APPRAISAL EXERCISE
Exercise
(The events and probabilities are taken from a presentation by C. Leighton &
C. Dennis at a conference on Risk Analysis and Assessment, Edinburgh, 1994.)
(BLANK PAGE)
Suggested Answers
2 Assuming the probabilities of the two events (suspension system deflated and
emergency springs failed) are independent, we can multiply the probabilities, so
the estimate for the probability of suspension system failure is 1.5 x 10-10. In
practice, this assumption of independence may not be fully realistic if some
causes of the first event also make the second more likely. If the events are not
independent, the probabilities may be much higher.
3 To work out the probability of passenger train derailment we add the four
probabilities to give 1.7 x 10-8. Note that the probability of the first event - over-
speeding - is negligible compared with the others.
This assumes that these are the only faults that will lead to derailment. Sabotage,
for example, does not seem to be included. (Strictly, adding probabilities
presupposes the events are mutually exclusive. It is possible that more than one
event can occur. However, the probability of this is so small - of the order of 10-
18
- that it can reasonably be ignored.)
4 The probability of derailment per day is 100 x 20 x 2 x 1.7 x 10-8 = 6.8 x 10-5
This is, in effect, a failure rate, so the mean time between derailments (failures)
is 1/(6.8 x 10-5) or 1.5 x 104 or 15000 days or about 41 years. Derailments can
be expected, on average, every 41 years.
5 The accuracy and usefulness of the model are obviously dependent on the data
on which it is built. In particular, it is obviously important that all input
probabilities are accurate (e.g. of the emergency springs failing), that
assumptions about statistical independence are carefully checked, and that all
lists of events specifying the possible ways a fault can occur are as complete as
possible. This obviously requires systematic research.
Derailment
OR
OR
AND
Mathematical Appendix
Powers are also defined for negative and any other fractional index. Experiment with your
calculator or spreadsheet.
ex is a function which arises from the mathematics of constant rates of growth. You should
have a button for it on your calculator, and on a spreadsheet the function will be EXP(X) or
something similar. Some examples (all rounded to two decimal places):
e1 = 2.72
e3 = 20.09
e-1.6 = 0.20
The inverse function to ex is the natural logarithm of x, loge(x) or ln(x). This is useful if you
want to know what x is if, for example,
ex = 0.5
x = loge(0.5) = -0.69315
This can be checked by working out e-0.69315. It should be very close to 0.5.
The home healthcare (HHC) division provides nurses (i.e. district nurses) to home care
patients. It employs 125 nurses and each one is provided with a laptop that allows them to
maintain patient records, point of care documentation, case loads, and payroll entry. The
nurses maintain a remote connection with the central server to access new cases and upload
any historical patient data gained throughout the day. The availability of the system, quality
of the data and timeliness of transcribing information are critical to the nurses.
The nurses have been complaining that they are having a lot of problems with the laptops and
the connection to the network and server. Reported laptop performance is very poor and 60%
of the nurses have complained of problems of one sort or another. All the laptops are from the
same manufacturer and have a similar configuration. Whenever a failure occurs the nurse has
to return the laptop to the office, obtain a loan machine and then return that to the office when
their own laptop had been repaired. The company knows how many machines have been
returned to the manufacturer for repair but did not itself keep any records of failure symptoms
and causes – that is done by the vendor‘s field service group who made the repairs.
Discussions between HHC‘s IT Director and the manufacturer‘s Field Support Manager are
getting progressively more heated with HHC threatening to switch suppliers completely –
with serious consequences for the manufacturer who actually supplied all HHC‘s IT
hardware. Losing the laptop contract is likely could result in the entire customer being lost.
In order to try and bring the relationship back onto a more business like footing HHC decided
to bring in an independent reliability specialist to arbitrate.
The manufacturer did not and would not publish reliability data for its products, saying that
―We do not quote MTBF (Mean Time Before Failure) numbers for our products or
believe this type of information should be used as a meaningful description of the quality
of our systems or component parts within those systems. MTBF is an older industry term
that today has very little value and is mostly misinterpreted and misunderstood. The
following is a definition and example of why MTBF should be avoided.
MTBF is the point at which 63.2% of the population, (everything of that component
built by a manufacturer), will fail. So for example, disk drives that claim 1,000,000 hour
MTBF with 720 power on hours per month would take 115 years to reach the 63.2%
mark. It does NOT mean that no failure should occur for 115 years. This is a total
population statement from the manufacturer‘s point of view not from the customer point
of view. As a Manufacturer, we know that some of that product will fail but we have no
idea which ones, and we do not know the distribution of good or bad product shipped to
any given customer. A single customer may get more or less than his fair share of drives
that may fail earlier than expected and therefore may achieve better or worse results and
still meet the MTBF criteria.‖
Exercise A.
Please comment on the manufacturer‘s refusal to publish MTBF data – is their position
reasonable? Where does the 63.2% come from?
===========================================================
The manufacturers were willing to show the defect data for HHC‘s laptops, but once again
would not compare those results with a control sample of the same number of the same
product. This laptop failure data (a Pareto diagram) was produced at the next meeting, as
follows;
25
20
15
10
5
0
D
em
le
re
rs
re
er
y
rd
ng
tu
la
C
in
ab
ve
il u
th
i lu
oa
od
pi
rg
sp
ad
oo
O
ad
fa
co
fa
yb
op
ha
M
di
re
tb
re
e
py
Ke
d
dr
tc
ed
v
no
to
un
e
op
dr
er
ag
no
ck
e
ay
Fl
ar
i ll
w
am
bl
ra
ry
Po
H
pl
na
t te
D
is
U
D
Ba
The average age of the machines was three years. This prompted some discussion that they
were therefore fully depreciated and should be written off and replacements bought. The
consultant thought that this was an accountant‘s view of the situation and was not a view that
could be supported from a quality and reliability perspective. At three years the machines
should be at the most reliable part of the reliability (bathtub) curve – well past early life
failure and not yet approaching the wear-out part of the curve when failures start to increase.
The high number of problems due to damaged covers, hard drives and CD failures made him
suspect that part of the problem lay in the environment in which the laptops were used and
the treatment that they received. If this were the case then replacing them with new machines
would not solve the problem as the new ones would ultimately suffer the same fate.
The diagram below shows a fault tree for one of these problem categories – hard drive
failures. (In order to make it legible it has not been fully developed—seek and logic failures
have not been explored.)
Exercise B
How can this diagram, and other similar diagrams, be used to analyse and help prevent
failures? Does it show how misuse can contribute to the failure of the component under
review?
===========================================================
The consultant decided to ―walk the process‖ and observe a sample of the nurses at work over
a period of time. He was struck by the fact that the nurses used their cars as mobile offices.
They explained that they did not like taking the laptop into the patient‘s house and updating
patient records in front of them but preferred to do it in the privacy of their car in between
calls. He also noticed that usually the nurses did not pack the laptop away in its case when
they had finished to update and often put it on the car floor where it was easy to get to next
time. He also noticed that nurses without CD players in their cars were prone to use their
laptop to play their favoured music as they drove between calls.
This pointed him to the cause of the reliability problems. Most of the cover damage and
screen breakages were actually being caused by shock and vibration in the car when the
Page 51 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN
laptop was unprotected, or by being dropped when picked up from the floor. Some of the
other failures could be caused by contamination (dust and grit, etc) getting into the drives
while the machine was in the relatively dirty area of the car floor. The operating environment
in the car could cause other failures – the ambient temperature in the car if the laptop was left
in direct sunlight while the nurse was with a patient could easily exceed the manufacturer‘s
specification. The factors that are normally used to calculate failure rates are power on hours,
power on cycles, temperature, and relative humidity.
Although it was not possible to compare these findings to the manufacturer‘s own data it was
possible to compare them to the findings of published industry surveys. This indicated that
laptops were increasingly being used in environments for which they are not designed and
that field personnel were experiencing failure rates as high as 20% per week. This confirmed
the consultant‘s opinion that merely changing from manufacturer‘s standard product to
another‘s would not solve the problem.
She decided that two courses of action were required, both involving Failure Mode Effect
Analysis (FMEA—which is covered in the Performance Evaluation Unit), in order to:
a) Recognise and evaluate the potential failure of a product or process and its effects.
b) Identify actions that could eliminate or reduce the chance of the potential failure
occurring.
With the agreement of HHC it was decided to apply FMEA to the way the nurses used
and treated their laptops (their process) to reduce the opportunities to induce failure.
Process FMEA is normally performed on the manufacturing process. In this case study
we are applying the technique the users process. The objective is to identify where within
the process damage or failure can be caused to the laptop, and what action can be taken
by the nurse or HHC to minimise or eliminate it.
With the agreement of the manufacturer it was also decided to use FMEA to evaluate the
hardware in order to develop a ―ruggedised‖ version of the laptop for use in this and
similar field situations.
Exercise A.
Please comment on the manufacturer’s refusal to publish MTBF data – is their position
reasonable?
In these situations it is always very frustrating for customers and user groups to be told that
they will not be given Mean Time Between Failure (MTBF) and other failure rate data. From
the manufacturer‘s perspective this is a sensible approach because there is always a danger
that the MTBF data will be misinterpreted by the customer. MTBF is a measure applied to
the population as a whole, not to a very small sample.
It is true that caution must exercised when applying averages to individual cases. However,
provided that the customer understands that the MTBF is an average figure, the MTBF would
still be useful, particularly for a customer such as HHC which buys large numbers of laptops.
In this situation, a significant departure from the average figures would indicate either a
problem with the laptops supplied, or that they are being used in unsuitable conditions. (This
issue can be analysed more formally as a null hypothesis test – see Wood, 2003, Chapter 8).
The 63.2% comes from the exponential reliability distribution. If the MTBF is 1,000,000
hours, the failure rate is 0.000001 per hour and the exponential distribution gives
R = e –0.000001 x 1000000 = e –1 = 0.368 = 36.8%
which means that the other 63.2% must have failed. The quotation from the manufacturer is
misleading in that it is not a definition of the MTBF, but rather a fact that follows from the
exponential distribution.
===========================================================
Exercise B.
This diagram is a good illustration of how a fault tree can help in the analysis and prevention
of failures. It shows how external factors such as mechanical shock and dirt can damage
components. Other fault trees should be constructed for the other sub-systems contained in
the laptop schematic diagram and then combined into a fault tree for the entire laptop system.
The first stage of the analysis of the fault tree is a qualitative evaluation of the subsystem to
understand how failure occurs. The next stage is to gather data on component failure rates
and repair rates to allow a quantitative evaluation – where are the major sources of failure and
what action can be taken to minimise or eliminate them?
Page 53 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN
A process FMEA document for the Field Nurse Laptop Use Process, identifying
reasonable changes that could be made without inconvenience to the nurses, that could
reduce the failure rate of their laptops.
S = Severity; an assessment of the seriousness of the effect of the potential failure mode.
Rated on a scale of 1 (none) to 10 (most severe)
O = Occurrence; the chance that the specific cause will occur. Rated on a scale of 1 (least
chance of occurrence) to 10 (highest chance of occurrence).
An FMEA study on the laptop itself with the objective of identifying the failure points and
the design changes that could minimise or eliminate the risk of it occurring. First create
the block diagram of a laptop and then the FMEA document (only complete the document
for one item).
Design FMEA should always begin with a block diagram (Total Quality Management,
Besterfield). It is used to show the different flows involved with the item being analysed. Its
purpose is to understand the input to each block, the function of the block, and the output.
First list all the components of the system, their function and the means of connection or
attachment between components, then the components are placed in blocks and their
functional relationship represented by lines connecting the blocks.
The laptop component diagram provided earlier only shows data flows between components
but it can be updated as follows;