100% found this document useful (1 vote)
143 views

Reliability

This book comprises of solution to the Reliability problems

Uploaded by

Aminu A.O
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
143 views

Reliability

This book comprises of solution to the Reliability problems

Uploaded by

Aminu A.O
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Unit QUAN – Session 7

Reliability
MSc Strategic Quality Management
Quantitative Methods - QUAN

RELIABILITY

Aims of Session

By the end of the session the student should understand the basic concepts in reliability and
be able to apply the principles to a range of problems for reliability improvement.

Learning Approach

Study the attached notes, and case study Healthy Laptop.

On completion of the above tasks you should be able to do the self-assessment exercises in
the attached notes.

Reading

You will find further reading on this topic in Besterfield (2004), Chapter 11. O‘Connor (2002
or earlier editions) provides a more extensive coverage.

Revised December 2009

Page 3 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit – QUAN

Contents

Introduction to Reliability
What is reliability?
Quantitative Approach to Reliability
Concept of Bath-tub Curve
Measuring reliability
Exercises
Reliability Distributions
Reliability Prediction Using Exponential Distribution
Introduction to Mean Time between Failures and Mean Time to Failure
Exercises
Self-Assessment Exercise
The addition and multiplication rules of probability
Reliability of Systems - Series, Parallel, and Combination Systems
Exercises
Self-assessment Exercise
Exercises
Reliability Improvement Techniques
Fault Tree Analysis
Mathematical Appendix
Healthy laptop case study

MSc SQM\QUAN Page 4 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

RELIABILITY

Introduction to Reliability

Reliability has gained increasing importance in the last few years in manufacturing
organisations, the government and civilian communities. With recent concern about
government spending, agencies are trying to purchase systems with higher reliability and
lower maintenance costs. In defence, for example, these systems range from complete
weapon systems such as aircraft or tanks, to individual critical components such as diodes,
resistors, transistors, IC's and so on. As consumers, we are mainly concerned with buying
products that last longer and are cheaper to maintain, i.e., have higher reliability. But what is
reliability and how do we measure reliability?

What is reliability?

The reliability of a product (or system) can be defined as the probability that the product will
perform a required function under specified conditions for a certain period of time.

Why we need to go for high product or component or system reliability?

 Higher customer satisfaction

 Increased sales

 Improved safety

 Increased competition

 Decreased warranty costs

 Decreased maintenance costs, etc.

Reliability is often expressed as a probability of success ranging from 0 to 1 or as a


percentage from 0 to 100%.

Page 5 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

The Quantitative Approach to Reliability

Reliability

We can quantitatively define the reliability of a component or system as the probability that it
will meet the qualitative definition.

The probabilistic definition of reliability is given by:

number of survivorsat time t


R(t )
number of items put on test at time t  0

At time t = 0, the number of survivors is equal to number of items put on test.


Therefore, the reliability at t = 0 is

R(0) = 1 = 100%

After this, when t > 0, R(t) will decline as some components fail.

The hazard rate and the failure rate

The hazard rate (usually represented by h(t) or λ) is a very useful quantity. This is defined as
the probability of a component failing in one (small) unit of time.

Let NF = number of failures in a small time interval, say, Δt.

NS = number of survivors at time t.

The hazard rate can then be calculated by the equation:

NF
h(t ) 
N S * t

For example, if there are 200 surviving components after 400 seconds, and 8 components fail
over the next 10 seconds, the hazard rate is given by

h(400) = 8 / (200 x 10) = 0.004 = 0.4%

MSc SQM\QUAN Page 6 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

This simply means that 0.4% of the surviving components fail in each second.

(You may wonder why the above equation defines h(400) and not h(410). The reason is that
Δt is a small time interval, so it is reasonable to assume that the hazard rate will not change
appreciably during the interval. We then define the hazard rate using the beginning of the
interval for convenience. In the extreme we can make Δt infinitesimally small—which is the
basis of the differential calculus.)

The term hazard rate applies to non-repairable items such as transistors, light bulbs,
microprocessors etc. The failure rate is defined in just the same way as the hazard rate,
except that it applies to repairable items such as computers and TV sets.

The Concept of the Bath-tub Curve

The so-called bath-tub curve represents the pattern of failure for many products – e.g.
semiconductor devices and electronic components. The vertical axis in Figure 2 is the hazard
or failure rate at each point in time. Higher values here indicate higher probabilities of failure.

The bath-tub curve is divided into three regions: infant mortality, useful life and wear-out.

Figure 1 Product failure pattern illustrated by bath-tub


curve
Observed failure rate

Infant mortality stage Useful life stage Wear-out stage

Time s ince s tart of tes ting

Figure 2 Product failure pattern illustrated by bath-tub curve

The bath-tub curve is illustrated in Figure 1.

Page 7 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

Infant Mortality:

This stage is also called early failure or debugging stage. In this stage, the failure rate is high
but decreases gradually with time. During this period, failures occur because engineering did
not test products or systems or devices sufficiently, or manufacturing made some defective
products. Therefore the failure rate at the beginning of infant mortality stage is high and then
it decreases with time after early failures are removed by burn-in or other stress screening
methods. Some of the typical early failures are:

 poor welds

 poor connections

 contamination on surface in materials

 incorrect positioning of parts, etc.

Useful life:

This is the middle stage of the bath-tub curve. This stage is characterised by a constant failure
rate. This period is usually given the most consideration during design stage and is the most
significant period for reliability prediction and evaluation activities. Product or component
reliability with a constant failure rate can be predicted by the exponential distribution (which
we come to later in this session).

Wear-out stage:

This is the final stage where the failure rate increases as the products begin to wear out
because of age or lack of maintenance. When the failure rate becomes high, repair,
replacement of parts etc., should be done.

MSc SQM\QUAN Page 8 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Observed failure rate

Useful life stage

Time

Figure
Figure 23Extended
Extended bath-tub
bath-tub curve
curve

Can we extend the useful life stage of a product? Products can have their useful life extended
by proper design, conscientious assembly, and careful handling. This leads to a change in the
shape of the bath tub curve (see Figure 2).

Measuring reliability

To see the level and pattern of the reliability of a product or component in practice, it is
necessary to make some measurements. The simplest way to do this is to test a large number
of products or components until they fail, and then analyse the resulting data. Exercise 2
below shows how this works. This enables us to estimate the hazard rate and reliability after
different lengths of time - and decide, empirically, if the bath tub curve applies, or if the
hazard rate shows some other pattern.

There are a number of obvious difficulties which may arise. If the useful life is large it may
be not be practical to wait until products or components fail. It may be too expensive to test
large samples, so small samples may have to suffice. And for some products (e.g. space
capsules) it may be difficult to simulate operating conditions at all closely. There are a
number of approaches to these difficulties - most of these are beyond the scope of this unit,
but they are discussed in the reading suggested for this session. (An exception is the
prediction of the reliability of a product from information about the reliability of its
Page 9 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

components - this avoids the necessity to test the whole product - which is discussed in a later
section of the notes for this session).

It is also possible that the failure rate will depend on environmental conditions in a
predictable way. For example, one of the key factors affecting the reliability of electronic
components and systems is temperature – basically the higher the temperature of the device
the higher the failure rate. Most computer equipment therefore has some form of cooling,
ranging from a simple fan to forced chilled air cooling. The following chart is from a report
on the effect of temperature on a single board computer published by Crane Surface Warfare
Center Division in 2001, and clearly shows the effect of temperature on failure rate.

Failure Rate over Temperature

70

60

50
Failure Rate

40

30

20

10

0
25 35 45 55 65 75 85 95
Tem perature
Temperature

(Failure rate = failures per million hours)

Figure 3

MSc SQM\QUAN Page 10 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

QUESTIONS TO STIMULATE YOUR THINKING

Now consider and answer the following questions, the purpose of which
is to test your retention of the information given in the course notes for
this session. When you have done this, check back on any questions
which you could not answer, or where you gave the wrong answer then
go on to the Self-Appraisal Exercise.

Questions

1 Would you expect the bath tub curve to apply to a car? What about a human
being?

2 One thousand transistors are placed on life test, and the number of failures in
each time interval are recorded. Find the reliability and the hazard rate at each
point in time. (You may find it helpful to set this up on a spreadsheet.)

Time interval Number of failures


0-100 160
100-200 86
200-300 78
300-400 70
400-500 64
500-600 58
600-700 52
700-800 43
800-900 42
900-1000 36

Do you think this component shows the bath tub pattern of failure?


Page 11 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

(BLANK PAGE)

MSc SQM\QUAN Page 12 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

/
 Suggested Answers

1 I think the bath tub pattern would apply to both.

2 My answers are below.

Time Failures in Survivors Reliability Hazard rate


next 100 h (per hr)
0 160 1000 100.00% 0.16%
100 86 840 84.0% 0.10%
200 78 754 75.4% 0.10%
300 70 676 67.6% 0.10%
400 64 606 60.6% 0.11%
500 58 542 54.2% 0.11%
600 52 484 48.4% 0.11%
700 43 432 43.2% 0.10%
800 42 389 38.9% 0.11%
900 36 347 34.7% 0.10%

Note that the hazard rate is per hour. (This is why the figures are so small.) A
hazard rate of 0.10% means that the probability of a failure in a given hour is
0.10%.

Note also that I am using data from the whole interval from 0-100 hours to
calculate the figures for 100 hours. This is an approximation which may lead to
small inaccuracies.

This result shows that the hazard rate is initially 0.16% per hour, and then drops
to around 0.10% to 0.11%. There is an infant mortality phase, a useful life phase
with a fairly constant hazard rate, but no wear out phase. However, there are still
311 components left after the 1000 hour test - the wear out phase may become
apparent if the test were to be prolonged. Obviously, the best way to show this
pattern would be to draw a graph of hazard rate against time.

Page 13 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

(BLANK PAGE)

MSc SQM\QUAN Page 14 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Reliability Distributions

There are many statistical distributions used for reliability analysis—for example, the
exponential distribution, the Weibull distribution, the normal distribution, the lognormal
distribution, and the gamma distribution. In this unit we look at the exponential distribution
only, as this is the simplest and the most widely applicable.

Reliability Prediction Using the Exponential Distribution

The exponential distribution applies when the hazard rate is constant - the graph is a
straight horizontal line instead of a “bath tub”. (It can be used to analyse the middle phase
of a bath tub - e.g. the period from 100 to 1000 hours in Exercise 2 above.) It is one of the
most commonly used distributions in reliability, and is used to predict the probability of
survival to a particular time. If λ is the failure rate and t is the time, then the reliability, R, can
be determined by the following equation

R(t) = e -λ t

There is a brief note on the mathematical background to this equation in the Appendix to this
session.

To see that it gives sensible results, imagine that there are initially 1000 components and that
 is 10% per hour. After one hour about 10% of the original 1000 components will have failed
- leaving about 900 survivors. After two hours, about 10% of the 900 survivors will have
failed leaving about 810. Similarly there will be about 729 survivors after the third hour,
which means that the reliability after 3 hours is 0.729.

Using the exponential distribution the reliability after 3 hours, with λ=0.1, is given by
R(t) = e -3 λ = e -0.3 = 0.741

(You can work this out using a calculator or a spreadsheet—see the mathematical appendix for
more details.)

This is close to the earlier answer as we should expect. The reason it is not identical is that the
method of subtracting 10% every hour to obtain 900, 810, etc ignores the fact that the number
of survivors is changing all the time, not just every hour. The exponential formula uses

Page 15 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

calculus to take this into account.

If the failure rate is small in relation to the time involved a much simpler method will give
reasonable results. Let‘s suppose that λ=0.01 (1%) in the example above. The formula now
gives the reliability after three hours as
R(t) = e -3 λ = e -0.03 = 0.9704
A simpler way of working this out would be just to say that if the failure rate is 0.01 per hour
the total proportion of failures in 3 hours will be 0.03 (3%) so the reliability after three hours
is simply
R(t) = 1 – 0.03 = 0.97 (or 100% – 3% = 97%)
This is not exactly right because in each hour the expected number of failures will decline as
the surviving pool of working components gets smaller. But when the failure are is 1% and we
are interested in what happens after three hours, the error is negligible. On the other hand, if
we want to know what happens after 300 hours the simple method gives a silly answer (the
reliability will be negative!) and we need to use the exponential method. You should be able
to check this answer with your calculator or a computer (the answer should be 0.0498 or about
5%).

Mean time to Failure (MTTF) and Mean time between Failures (MTBF)

MTTF applies to non-repairable items or devices and is defined as "the average time an item
may be expected to function before failure". This can be estimated from a suitable sample of
items which have been tested to the point of failure: the MTTF is simply the average of all the
times to failure. For example, if four items have lasted 3,000 hours, 4000, hours, 4000 hours
and 5,000 hours, the MTTF is 16,000/4 or 4,000 hours.

The MTBF applies to repairable items. The definition of this refers to ―between‖ failures for
obvious reasons. It should be obvious that
MTBF = Total device hours / number of failures
For example, consider an item which has failed, say, 4 times over a period of 16,000 hours.
Then MTBF is 16,000/4 = 4,000 hours. (This is, of course, just the same method as for
MTTF.)

For the particular case of an exponential distribution,


λ = 1/MTBF (or 1/MTTF)
where λ is the hazard rate.

MSc SQM\QUAN Page 16 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

For example, the item above fails, on average, once every 4000 hours, so the probability of
failure for each hour is obviously 1/4000. This depends on the failure rate being constant -
which is the condition for the exponential distribution.

This equation can also be written the other way round:

MTBF (or MTTF) = 1/λ



For example, if the hazard rate is 0.00025, then

MTBF (or MTTF) = 1/0.00025 = 4,000 hours.

Page 17 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

QUESTIONS TO STIMULATE YOUR THINKING

Now consider and answer the following questions. The purpose of


which is to test your retention of the information given in the course
notes for this session. When you have done this, check back on any
questions which you could not answer, or where you gave the wrong
answer then go on to the Self-Appraisal Exercise.

Questions

1 An industrial machine compresses natural gas into an interstate gas pipeline.


The compressor is on line 24 hours a day. (If the machine is down, a gas field
has to be shut down until the natural gas can be compressed, so down time is
very expensive.) The vendor knows that the compressor has a constant failure
rate of 0.000001 failures/hr. What is the operational reliability after 2500 hours
of continuous service?

2 What is the highest failure rate for a product if it is to have a reliability (or
probability of survival) of 98 percent at 5000 hours? Assume that the time to
failure follows an exponential distribution.

3 Suppose that a component we wish to model has a constant failure rate with a
mean time between failures of 25 hours? Find:-

(a) The reliability function.


(b) The reliability of the item at 30 hours.

4 A certain type of engine seal (non-repairable) is know to have life exponentially


distributed with a constant failure rate = 0.03 * 10-4 failures/hour.

(a) What is the MTTF of the seal?


(b) What is the reliability at MTTF?

MSc SQM\QUAN Page 18 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

(BLANK PAGE)

Page 19 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

 Suggested Answers

1 The compressor has a constant failure rate and therefore the life times of these
reliability is given by:

R = e -λ t

Failure rate λ = 0.000001 f/hr., operational time t = 2500 hours


Reliability = e-(0.000001 * 2500) = 0.9975

2 The reliability of the product is given to be 0.98. The reliability of an


exponential distribution is given by:
R = e -λt
t

i.e., 0.98 = e(-λ * 5000)


Taking natural logarithms on both sides (see Appendix), we get,
-0.02020 = -λ * 5000
Therefore λ = 4.04 * 10-6 f/hr
(Alternatively, you could use a trial and error process with your calculator to
find a negative number which gives you 0.98 when you press the ex button. This
number must be equal to λ*5000...)
Therefore the highest failure rate is 4.04/106 hours for a reliability of 0.98 at
5000 hours.

3 (a) Since the failure rate is constant, we will use the exponential distribution.
Also, the MTBF = 25 hours. We know, for an exponential distribution, MTBF
= 1/λ

Therefore λ = 1/25 = 0.04

The reliability function is given by: R(t) = e-λt = e- (0.04 * t)

(b) The reliability of the item at 30 hours = e-0.04 * 30 = 0.3012

4 a) λ = 0.03 * 10-4 failures/hour


MTTF = 1/λ = 333,333 hours
i.e., the average life of these seals is about 333,333 hours

b) MTTF = 333,333 hours


Therefore R(333,333) = e- (0.000003 * 333,333) = 0.368

MSc SQM\QUAN Page 20 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

(BLANK PAGE)

Page 21 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

SELF-APPRAISAL EXERCISE
Exercise

Here follows an exercise to give you further information and food for thought.

The assigned task is:

1 The equipment in a packaging plant has a MTBF of 1000 hours. What is the
probability that the equipment will operate for a period of 500 hours without
failure?

2 During World War II, the hazard rate for bomber aircraft flying over Europe
was believed to be a 4% chance of non-return from each mission, however
experienced the pilot was. Calculate the probability that a crew member will
survive 25 missions. How many missions would it take to reduce a crew
member's probability of survival to 10%?

3 ALCO manufactures microwave ovens. In order to develop warranty


guidelines, TALCO randomly tested 10 microwave ovens continuously to
failure. The failure information of the 10 ovens is shown in Table 2.

Microwave Hours
1 2300
2 2150
3 2800
4 1890
5 2790
6 1890
7 2450
8 2630
9 2100
10 2120

What is the mean time to failure of the microwave ovens? (Note that the mean
life of the microwave is defined in terms of their mean time to failure because
no maintenance is performed on the ovens).


MSc SQM\QUAN Page 22 of 56
Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

(BLANK PAGE)

Page 23 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

 Suggested Answers

1 Assuming the exponential model, the hazard rate is


1/MTBF = 0.001
So
R(500) = e-500*0.001 = e-0.5 = 0.61

2 As the hazard rate is constant we can use the exponential model with a hazard
rate of 0.04.
R(25) = e -0.04*25 = e-1 = 0.37

The crew have a 37% chance of surviving 25 missions.

To reduce this to 10% we need e-0.4*t =0.1.

Trial and error (or natural logarithms - see the mathematical appendix) shows
that
e-2.3 = 0.1 (approximately)

So 0.04t = 2.3, so t is 23/0.04 or about 57 missions.

Note that ―time‖ is measured in missions.

3 The MTTF is 2312 hours (simply the average).

MSc SQM\QUAN Page 24 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

The addition and multiplication rules of probability

The next aspects of reliability theory we will consider depend on some probability theory—
the addition and multiplication rules. I will explain these in terms of dice and cards.

Suppose that you throw a single dice. The probability of getting a 6 is


P(6) = 1/6
And the probability of getting a 5 is also
P(5) = 1/6
Equally obvious is that the probability getting a 5 or a 6 is
P(5 or 6) = 2/6 = 1/3
And that
P(5 or 6) = P(5) + P(6)

Similarly if you choose a single card from a pack of (52) cards, the probability of getting an
ace or a picture card (jack, queen or king) is
P(ace or picture) = 4/52 + 12/52 = 16/52
Because 4 of the 52 cards are aces, and another 12 are picture cards.

These examples suggest that if we want the probability of something or something else
happening, we can add the probabilities.

But … what about the probability of an ace or a red card (hearts or diamonds). Can we say
that
P(ace or red) = P(ace) + P(red) = 4/52 + 26/52 = 30/52 ??

This is obviously wrong because two of the aces are also red, so we are in effect double
counting these aces if we add the probabilities. Before adding the probabilities you need to
check that the two events cannot both occur—i.e. they do not overlap or are mutually
exclusive (i.e. each excludes the other).

The complete addition rule is


P(A or B) = P(A) + P(B) if A and B are mutually exclusive (i.e. they don’t overlap).
It can easily be extended to three or more events:
P(A or B or C or …) = P(A) + P(B) + P(C) + … if A, B, C … are mutually exclusive.

To explain the and rule we need to do more than one thing, so let‘s throw the dice and draw a
card from the pack. How can we work out the probability of getting a 6 and a spade?

Page 25 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

This is a little more complicated than the or rule. It helps to imagine doing the experiment
lots of times—say 1000 times.

Obviously you will get a 6 on about 1/6 of these thousand times—i.e. about 167 times. Now
think about how many times you will get a spade as well. This will happen on about ¼ of
these 167 times, or about 42 times out of the 1000 times we imagined doing the experiment.
So the probability is about 42/100.

Now do the same thought experiment, but working in terms of the probabilities this time. We
will get a 6 on about one sixth of the times, and a spade as well on about one quarter of this
one sixth. One quarter of one sixth means the same as one quarter times one sixth, so we
simply multiply the probabilities:
P(6 and spade) = P(6)*P(spade) = 1/6 * 1/4 = 1/24
which is, of course, about 42/1000 as before.

Just like the addition rule there is an important assumption here. We assumed that the two
events are statistically independent. This means that the two probabilities are independent of
each other: knowing whether one has happened is of no help in assessing the probability of
the other happening. In the example above, knowing that we‘ve got a 6 is of no relevance to
what will happen with the cards—so these events are statistically independent.

But in some situations you need to be very careful about this assumption. Suppose you know
that in a particular place the probability of rain falling on a given day is 1/3. Can you say that
P(rain today and rain tomorrow) = 1/3 * 1/3 = 1/9 ??
In practice this is likely to be wrong because the two events are not likely to be independent.
In England certainly, if it rains today the probability of rain tomorrow is likely to be higher
than if it did not rain today because the weather tends to set in to wet or dry spells. This
means that the second probability should probably be rather more than 1/3, so the result is
likely to be substantially more than 1/9.

The multiplication rule, then, also has an important condition:


P(A and B) = P(A)*P(B) if A and B are statistically independent
And just as before it can be extended:
P(A and B and C and …) = P(A)*P(B)*P(C)*… if A, B, C, … are statistically
independent

MSc SQM\QUAN Page 26 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

SELF APPRAISAL EXERCISE


Exercise

Here follows an exercise to give you further information and food for thought.

The assigned task is:

Two football teams are due to play each other three times in the next few weeks. The
data from their matches in the past suggests that the probability of Team A winning is
50%, and the probability of Team B winning is 30%

1 What is the probability of a draw in the first match?

2 Assuming that the results of the three matches are statistically independent what
is the probability of Team B winning all three matches?

3 Assuming that the results of the three matches are statistically independent what
is the probability of Team B winning at least one of the matches?

4 Do you think the assumption of statistical independence is justified?


.


Page 27 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

 Suggested Answers

1 The match must end in a win for Team A, or a win for Team B, or a draw.
These three probabilities are mutually exclusive so they must add up to
100%. This means the probability of a draw is 20% because
50%+30%+20%=100%.

2 P(B winning three matches)=P(B wins 1st match)*P(B wins 2nd


match)*P(B wins 3rd)
=30% * 30% * 30% = 27/1000 = 0.027 = 2.7%
so it is unlikely!

3 This is a little more difficult. The easiest approach to ―at least one‖
questions is to work out the probability of the event never happening, and
then subtract it from 1. In this case, the probability of Team B failing to
win each match is 70% (i.e. the probability of a Team A win or a draw), so
the probability of failing to win all three matches is
70% * 70% * 70% = 7/10 * 7/10 *7/10 = 343/1000 = 34.3%
In all other circumstances Team B will win at least one match so the
probability is
P(B wins at least one match) = 100% - 34.3% = 65.7%
Which is fairly likely!
This method is used to derive one of the formulae (for parallel
configurations) below.

4 This is a difficult question to which there is no easy answer. Perhaps if


Team B wins the first match they would become more confident and so
more likely to win the next match? Or perhaps Team A would get more
determined?

MSc SQM\QUAN Page 28 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Reliability of Systems - Series, Parallel, and Combination Systems

It is often useful to be able estimate the reliability of a whole system from the reliability of
the individual components. This enables designers to predict reliability levels without having
to build and test the whole system. It also enables them to see weak spots in the system, and
to experiment with ways of improving these - perhaps by installing backups for critical
components.

The following three types of configurations will be discussed:

1 Series configuration
2 Parallel configuration
3 Combination of series and parallel

1 Series Configuration

For a series configuration, the system will fail if any of the components in the system fail.
Like links in a chain, if one link breaks, then the entire chain fails to perform the intended
function. This can be represented by a block diagram like the one in Question 1 in the next
set of Questions to stimulate your thinking on page 31.

If R1, R2, R3, ............. Rn are the reliabilities of first, second, third, ........ nth components in a
series system, then the system reliability is given by:

Rsys. = R1 x R2 x R3 ..............Rn

This follows from the rules of probability. It depends on the assumptions that the
probabilities are statistically independent. This means that knowing one of the probabilities
is of no help in estimating any other the others: they can all be estimated independently of
each other.

When several components are placed in series, the total system reliability decreases quickly.
If we assume that: R1= R2=R3 ..........=Rn =R , then,

Rsys. = R x R x R ...................R

Assume that each component has the reliability of 0.95, then the system reliability of 5
components in series is given by:

Page 29 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

Rsys. = 0.955 = 0.774

2 Parallel Configuration

In a parallel configuration, the system fails only if all of the individual components fail. If
one component fails the others serve as a backup - see the diagram in Question 2 on page 31.
below.

Rsys. =P(component 1 works) or P(component 2 works) or ....... P(component n


works)

Assuming that there are two components in the system, then for the system to fail, both
components have to fail, which will happen with probability

(1-R1) (1-R2)

So the system reliability is given by

Rsys. = 1-{ (1-R1) (1-R2)}

Again, this assumes that the reliabilities are statistically independent.

This argument can easily be extended to n components in parallel:

Rsys. = 1-{ (1-R1) ... (1-Rn)}

3 Combination of Series and Parallel Configuration

Electrical circuits or mechanical assemblies are often designed in configurations which are
part series, part parallel. To increase the reliability of the system, wherever possible, low
reliability components are placed in parallel, and components with relatively high reliability
are in series.

The following steps could be useful in dealing with these mixed configurations.

Step 1: Calculate the equivalent reliability of components in series within any parallel
configuration.

MSc SQM\QUAN Page 30 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Step 2: Reduce each parallel configuration to an equivalent single reliability.

Step 3: Multiply reliabilities in a series together until the overall reliability is derived.

The exercises below should help make this clear.

The importance of statistical independence

The calculations for both series and parallel configurations depend on the assumption of
statistical independence. The answers may be very different if the probabilities are not
statistically independent - an example is provided by the first self appraisal exercise below.

Page 31 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

QUESTIONS TO STIMULATE YOUR THINKING

Now consider and answer the following questions. The purpose of which is
to test your retention of the information given in the course notes for this
session. When you have done this, check back on any questions which you
could not answer, or where you gave the wrong answer then go on to the
Self-Appraisal Exercise.

Questions

1 Five components in series have individual reliabilities of 0.99, 0.97, 0.90, 0.98, and
0.94 - as shown in the following (block) diagram. Calculate the reliability of the
system.

0.99 0.97 0.90 0.98 0.94

2 Three components in parallel have individual reliabilities of 0.99, 0.90, and 0.85.
Calculate the reliability of the system.

0.99

0.90

0.85

. . .

MSc SQM\QUAN Page 32 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

. . .

3 Consider the following system:


3 4

0.90 0.90
1 2

0.90 0.90

0.90
5

The first two components are in series, the last configuration in parallel. Find the
system reliability.

Page 33 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

 Suggested Answers

1 Because the given system is a series configuration, the reliability of the system
can be obtained by:

Rsys. = 0.99 * 0.97 * 0.90 * 0.98 * 0.94 = 0.796

2 Because the components are connected in parallel, the reliability of the system is
given by:

Rsys. = 1 - Qsys. = 1 - {(1 - R1) (1 - R2) (1 - R3)}


= 1 - {(1 - 0.99) * (1 - 0.90) * (1 - 0.85)} = 0.99985

3 This is a simple series/parallel configuration. The first step is to determine the


reliability of the series system with components 3 and 4.

Therefore Rs1 =0.90 * 0.90 = 0.81

Now Rs1 forms a parallel system with component 5. Therefore the reliability of
the parallel system can be determined as follows:

Rp1 = 0.81 + 0.90 - 0.90 * 0.81 = 0.981

Now we have formed a series system with reliabilities 0.90, 0.90 and 0.981.
Therefore the system reliability is given by:

Rsys. = 0.90 * 0.90 * 0.981 = 0.795

MSc SQM\QUAN Page 34 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

(BLANK PAGE)

Page 35 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

SELF-APPRAISAL EXERCISE
Exercise
Here follows another exercise to give you further information and food for thought.

The assigned task is:


1 Below is a cross-section diagram of a booster rocket outer shell at a joint
between two stages. The system will fail only if both o-rings fail.

(a) Do the two o-rings form a series or parallel system?


(b) Suppose the reliability of each o-ring is 0.95 during the most critical
phase of flight. What is the system reliability?
(c) Do you think the two reliabilities are likely to be statistically
independent? Would this make any difference to your answer for the
system reliability?
Fuel
Atmosphere

O-rings
Vent

2 A certain electronic component has an exponential failure time with a mean of


50 hours.
(a) What is the instantaneous failure rate of this component?
(b) What is the reliability of this component at 100 hours?
(c) What is the minimum number of these components that should be placed
in parallel if we desire a reliability of 0.90 at 100 hours?


MSc SQM\QUAN Page 36 of 56
Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

(BLANK PAGE)

Page 37 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

 Suggested Answers

1 a A parallel system as the rocket will fail only if both o-rings fail.

b 1 - (1- 0.95)(1- 0.95) = 99.75%

c This assumes the two potential failures are statistically independent - which
may be unlikely in practice. A single factor - such as temperature - may be
responsible for failures in both. This means that, if we know one o-ring has
failed, our estimate of the probability of the other failing will be higher. In
the extreme case, one o-ring fails whenever the other does, and the
reliability of the system is 0.95. In practice, the reliability of the whole
system is likely to be between 0.95 and 0.9975 (the answer based on the
assumption of independence).

2 a 2% per hour

b R(100) = e-0.02x100=0.1353

c The parallel system will only fail if all components fail. The probability of
each failing in 1 - 0.1353 = 0.8647.

If there are n in parallel we need

1 - 0.8647n = 0.9, or

0.8647n = 0.1

By trial and error (or see the Appendix)

n = 16

We need n components in parallel.

MSc SQM\QUAN Page 38 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Reliability Improvement Techniques

There are two established techniques for this purpose:

1 Fault Tree Analysis

2 Failure Mode Effect and Criticality Analysis (FMECA)

The second (FMECA) is not discussed here as it is covered in another unit of the course
(Performance Evaluation).

Fault Tree Analysis

This is a commonly used technique in industry to evaluate reliability of a system. It was


developed in the early 1960s to evaluate reliability of the Minuteman Launch Control
System. Since then it has gained favour especially when analysing complex systems. Fault
tree analysis begins by identifying the top event, known as the undesirable event of the
system. The undesirable event of the system is caused by events generated and connected by
logic gates such as AND, OR, etc (as you will see below). The following basic steps are
involved in performing fault tree analysis:

i Establishing system definition.

ii Constructing the fault tree.

iii Evaluating the fault tree qualitatively.

iv Collecting basic data such as components' failure rates, repair rates, and failure
occurrence probability.

v Evaluating fault tree quantitatively.

vi Recommending corrective measures.

This method is frequently used as a qualitative evaluation method in order to assist the
designer, planner or operator in deciding how a system may fail and what remedies may be
used to overcome the causes of failure. The method can also be used for quantitative
evaluation, in which case the causes of system failure are gradually broken down into an
increasing number of hierarchical levels until a level is reached at which reliability data is
Page 39 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

sufficient or precise enough for a quantitative assessment to be made. The appropriate data is
then inserted into the tree at this hierarchial level and combined together using the logic of
the tree to give the reliability assessment of the complete system being studied.

In order to illustrate the application of this method, consider the electric power requirements
of the system in the following example. In this example, the failure event being considered is
'loss of the electric power'. In practice the electric power requirements may be both a.c.
power, to supply energy for prime movers, and d.c. power, to operate relays and contractors,
both of which are required to ensure the successful operation of the electric power.
Consequently the event 'loss of electric power' can be divided into two sub-events 'loss of
a.c.power' and 'loss of d.c.power'. This is shown in the following figure (see Figure 4) with
the events being joined by an OR gate as failure of either, or both, causes the system to fail.

Loss of Electric power

or

Loss of a.c. power Loss of d.c. power

and

Loss of offsite power Loss of onsite power

Figure 5: Development of a Fault Tree

Figure 4: Development of a Fault Tree

If this subdivision is insufficient, sub-events can be divided further. The event 'loss of a.c.
power' may be caused by 'loss of offsite power' (the grid supply) and by 'loss of onsite power'
(standby generators or similar devices). This process can be continued downwards to any
required level of subdivision. After developing a fault tree, it is necessary to evaluate the
probability of occurrence of the upper event by combining component probabilities using
basic rules of probability and the logic defined in the fault tree.

In the present example, suppose the probabilities of the events at the bottom of the tree are:
Prob (Loss of offsite power) = 0.067
Prob (Loss of onsite power) = 0.075
Prob (Loss of dc power) = 0.005
MSc SQM\QUAN Page 40 of 56
Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Notice that these are the probabilities of faults and are typically low, whereas the reliabilities
used in the block diagrams are typically larger. Obviously
Reliability of the offsite power = 1 – Prob (Loss of offsite power) = 1 – 0.067 = 0.933.

These probabilities for the faults at the bottom of the tree can now be combined using the
addition and multiplication rules of probability. For the AND gate at the bottom we multiply
the probabilities to work out
Prob (Loss of a.c. power) = Prob(Loss of offsite power) x Prob (Loss of onsite power)
= 0.067 x 0.075 = 0.005025.

For the OR gate we add the probabilities to get the probability of the top event:
Prob (Loss of electric power) = Prob (Loss of a.c. power) + Prob (Loss of d..c power)
= 0.005025 + 0.005 = 0.010025.

This means that the probability of the top, undesirable, event – Loss of electric power – is
about 1%. The calculation make two assumptions.

First, to multiply the two probabilities at the bottom of the tree we must assume they are
statistically independent. This is reasonable if the onsite and offsite systems are separate so
that the failure of one is independent of the other. Notice that the combined probability (Loss
of a.c. power) is much less than the probability of either of the two probabilities in the AND
gate. The principle here is that of reducing the risk of a fault by using a backup system: this
will not work, of course, if the two systems are dependent so that if one fails there is an
increased chance that other will fail too.

The second assumption is the assumption when we add the probabilities for the OR gate that
the events are mutually exclusive (i.e. they don‘t overlap). With small probabilities, like we
have here, this is reasonable because the probability of more than one fault occuring is small.

Page 41 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

SELF-APPRAISAL EXERCISE
Exercise

A risk assessment of a proposed new railway included the following results:


Passenger train derailment would occur if
derailment occurred due to over-speeding (4.3 x 10-11) or
derailment occurred due to rail faults (6.4 x10-9) or
derailment occurred due to rolling stock faults (3.9 x10-9) or
derailment occurred due to running into obstructions (7.1 x 10-9)
The figures in brackets represent the estimated probability of occurrence of each of
these per train kilometre. For example the first probability means that the probability for
the event in question is 4.3 x 10-11 for each train for each kilometre. Clearly the
probability for six trains, each travelling 100 km, would be 600 times as great.
The probabilities were estimated by breaking the events down into more detail
and using historical data. For example, ―derailment occurred due to rolling stock faults‖
was broken down into seven events which would lead to rolling stock faults - these
included wheel failure and suspension system failure. Furthermore, suspension system
failure would occur if
the suspension system deflated (1.5 x 10-6), and
the emergency springs failed (1.0 x 10-4)

The assigned task is:


1 Draw a fault tree from the above information.
2 Estimate the probability (per train kilometre) of suspension system failure and
explain the assumptions on which your estimate is based.
3 Estimate the probability (per train kilometre) of passenger train derailment (the
top event of the fault tree) and explain the assumptions on which your answer is
based.
4 Assuming that the railway is 100 km long and there are 20 trains each way every
day, estimate the mean time between passenger train derailments. Do you think
this represents a satisfactory level of risk?
5 What do you think of the accuracy and usefulness of this type of analysis?

(The events and probabilities are taken from a presentation by C. Leighton &
C. Dennis at a conference on Risk Analysis and Assessment, Edinburgh, 1994.)

MSc SQM\QUAN Page 42 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

(BLANK PAGE)

Page 43 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

 Suggested Answers

1 The fault tree is on the next page.

2 Assuming the probabilities of the two events (suspension system deflated and
emergency springs failed) are independent, we can multiply the probabilities, so
the estimate for the probability of suspension system failure is 1.5 x 10-10. In
practice, this assumption of independence may not be fully realistic if some
causes of the first event also make the second more likely. If the events are not
independent, the probabilities may be much higher.

3 To work out the probability of passenger train derailment we add the four
probabilities to give 1.7 x 10-8. Note that the probability of the first event - over-
speeding - is negligible compared with the others.

This assumes that these are the only faults that will lead to derailment. Sabotage,
for example, does not seem to be included. (Strictly, adding probabilities
presupposes the events are mutually exclusive. It is possible that more than one
event can occur. However, the probability of this is so small - of the order of 10-
18
- that it can reasonably be ignored.)

4 The probability of derailment per day is 100 x 20 x 2 x 1.7 x 10-8 = 6.8 x 10-5
This is, in effect, a failure rate, so the mean time between derailments (failures)
is 1/(6.8 x 10-5) or 1.5 x 104 or 15000 days or about 41 years. Derailments can
be expected, on average, every 41 years.

5 The accuracy and usefulness of the model are obviously dependent on the data
on which it is built. In particular, it is obviously important that all input
probabilities are accurate (e.g. of the emergency springs failing), that
assumptions about statistical independence are carefully checked, and that all
lists of events specifying the possible ways a fault can occur are as complete as
possible. This obviously requires systematic research.

MSc SQM\QUAN Page 44 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Fault tree for question 1


(Note that this tree omits some faults, as explained above.)

Derailment

OR

Over-speeding Rail fault Rolling stock Obstruction


fault

OR

Wheel failure Suspension


failure

AND

Suspension Emergency springs


deflated failed

Page 45 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

Mathematical Appendix

Powers and roots

As you will doubtless know

pppp (four p‘s multiplied together) can be written as p4

You may also need to do this backwards. Suppose that

p4 = 0.7 (p4 is p^4 on a spreadsheet.)

and you want to know p. p is the fourth root of 0.7 or

p = 0.71/4 = 0.70.25 = 0.915

To check, try 0.9154. This should come to 0.7 (approximately).

Powers are also defined for negative and any other fractional index. Experiment with your
calculator or spreadsheet.

Exponential functions and natural logarithms

ex is a function which arises from the mathematics of constant rates of growth. You should
have a button for it on your calculator, and on a spreadsheet the function will be EXP(X) or
something similar. Some examples (all rounded to two decimal places):

e1 = 2.72
e3 = 20.09
e-1.6 = 0.20

The inverse function to ex is the natural logarithm of x, loge(x) or ln(x). This is useful if you
want to know what x is if, for example,

ex = 0.5

You can find the answer by using natural logarithms:

x = loge(0.5) = -0.69315

This can be checked by working out e-0.69315. It should be very close to 0.5.

MSc SQM\QUAN Page 46 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Healthy Laptop Case Study


This case study explores some of the concepts of reliability. It concerns a private provider of
health care that runs a healthcare system with 8 acute care hospitals, 100 clinics, and home
healthcare.

The home healthcare (HHC) division provides nurses (i.e. district nurses) to home care
patients. It employs 125 nurses and each one is provided with a laptop that allows them to
maintain patient records, point of care documentation, case loads, and payroll entry. The
nurses maintain a remote connection with the central server to access new cases and upload
any historical patient data gained throughout the day. The availability of the system, quality
of the data and timeliness of transcribing information are critical to the nurses.

The nurses have been complaining that they are having a lot of problems with the laptops and
the connection to the network and server. Reported laptop performance is very poor and 60%
of the nurses have complained of problems of one sort or another. All the laptops are from the
same manufacturer and have a similar configuration. Whenever a failure occurs the nurse has
to return the laptop to the office, obtain a loan machine and then return that to the office when
their own laptop had been repaired. The company knows how many machines have been
returned to the manufacturer for repair but did not itself keep any records of failure symptoms
and causes – that is done by the vendor‘s field service group who made the repairs.

Discussions between HHC‘s IT Director and the manufacturer‘s Field Support Manager are
getting progressively more heated with HHC threatening to switch suppliers completely –
with serious consequences for the manufacturer who actually supplied all HHC‘s IT
hardware. Losing the laptop contract is likely could result in the entire customer being lost.

In order to try and bring the relationship back onto a more business like footing HHC decided
to bring in an independent reliability specialist to arbitrate.

A typical laptop configuration is shown in the following diagram;

Page 47 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

The manufacturer did not and would not publish reliability data for its products, saying that

―We do not quote MTBF (Mean Time Before Failure) numbers for our products or
believe this type of information should be used as a meaningful description of the quality
of our systems or component parts within those systems. MTBF is an older industry term
that today has very little value and is mostly misinterpreted and misunderstood. The
following is a definition and example of why MTBF should be avoided.

MTBF is the point at which 63.2% of the population, (everything of that component
built by a manufacturer), will fail. So for example, disk drives that claim 1,000,000 hour
MTBF with 720 power on hours per month would take 115 years to reach the 63.2%
mark. It does NOT mean that no failure should occur for 115 years. This is a total
population statement from the manufacturer‘s point of view not from the customer point
of view. As a Manufacturer, we know that some of that product will fail but we have no
idea which ones, and we do not know the distribution of good or bad product shipped to
any given customer. A single customer may get more or less than his fair share of drives

MSc SQM\QUAN Page 48 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

that may fail earlier than expected and therefore may achieve better or worse results and
still meet the MTBF criteria.‖

Exercise A.

Imagine that you are the independent reliability consultant;

Please comment on the manufacturer‘s refusal to publish MTBF data – is their position
reasonable? Where does the 63.2% come from?

===========================================================

The manufacturers were willing to show the defect data for HHC‘s laptops, but once again
would not compare those results with a control sample of the same number of the same
product. This laptop failure data (a Pareto diagram) was produced at the next meeting, as
follows;

Laptop Failure causes

25
20
15
10
5
0
D

em
le
re
rs

re

er
y

rd
ng
tu
la
C

in

ab
ve

il u

th
i lu

oa
od
pi
rg
sp
ad

oo

O
ad
fa
co

fa

yb
op
ha

M
di
re

tb

re
e

py

Ke
d

dr
tc
ed
v

no
to

un
e

op
dr

er
ag

no
ck
e

ay
Fl
ar

i ll

w
am

bl

ra

ry

Po
H

pl
na

t te
D

is
U

D
Ba

The average age of the machines was three years. This prompted some discussion that they
were therefore fully depreciated and should be written off and replacements bought. The
consultant thought that this was an accountant‘s view of the situation and was not a view that
could be supported from a quality and reliability perspective. At three years the machines
should be at the most reliable part of the reliability (bathtub) curve – well past early life
failure and not yet approaching the wear-out part of the curve when failures start to increase.
The high number of problems due to damaged covers, hard drives and CD failures made him
suspect that part of the problem lay in the environment in which the laptops were used and

Page 49 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

the treatment that they received. If this were the case then replacing them with new machines
would not solve the problem as the new ones would ultimately suffer the same fate.

The diagram below shows a fault tree for one of these problem categories – hard drive
failures. (In order to make it legible it has not been fully developed—seek and logic failures
have not been explored.)

MSc SQM\QUAN Page 50 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Exercise B

How can this diagram, and other similar diagrams, be used to analyse and help prevent
failures? Does it show how misuse can contribute to the failure of the component under
review?

===========================================================

The consultant decided to ―walk the process‖ and observe a sample of the nurses at work over
a period of time. He was struck by the fact that the nurses used their cars as mobile offices.
They explained that they did not like taking the laptop into the patient‘s house and updating
patient records in front of them but preferred to do it in the privacy of their car in between
calls. He also noticed that usually the nurses did not pack the laptop away in its case when
they had finished to update and often put it on the car floor where it was easy to get to next
time. He also noticed that nurses without CD players in their cars were prone to use their
laptop to play their favoured music as they drove between calls.

This pointed him to the cause of the reliability problems. Most of the cover damage and
screen breakages were actually being caused by shock and vibration in the car when the
Page 51 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

laptop was unprotected, or by being dropped when picked up from the floor. Some of the
other failures could be caused by contamination (dust and grit, etc) getting into the drives
while the machine was in the relatively dirty area of the car floor. The operating environment
in the car could cause other failures – the ambient temperature in the car if the laptop was left
in direct sunlight while the nurse was with a patient could easily exceed the manufacturer‘s
specification. The factors that are normally used to calculate failure rates are power on hours,
power on cycles, temperature, and relative humidity.

Although it was not possible to compare these findings to the manufacturer‘s own data it was
possible to compare them to the findings of published industry surveys. This indicated that
laptops were increasingly being used in environments for which they are not designed and
that field personnel were experiencing failure rates as high as 20% per week. This confirmed
the consultant‘s opinion that merely changing from manufacturer‘s standard product to
another‘s would not solve the problem.

She decided that two courses of action were required, both involving Failure Mode Effect
Analysis (FMEA—which is covered in the Performance Evaluation Unit), in order to:

a) Recognise and evaluate the potential failure of a product or process and its effects.
b) Identify actions that could eliminate or reduce the chance of the potential failure
occurring.

 With the agreement of HHC it was decided to apply FMEA to the way the nurses used
and treated their laptops (their process) to reduce the opportunities to induce failure.
Process FMEA is normally performed on the manufacturing process. In this case study
we are applying the technique the users process. The objective is to identify where within
the process damage or failure can be caused to the laptop, and what action can be taken
by the nurse or HHC to minimise or eliminate it.
 With the agreement of the manufacturer it was also decided to use FMEA to evaluate the
hardware in order to develop a ―ruggedised‖ version of the laptop for use in this and
similar field situations.

These FMEA documents are in the Appendix.

MSc SQM\QUAN Page 52 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

Laptop Case Study – Answers and Appendix

Exercise A.

Please comment on the manufacturer’s refusal to publish MTBF data – is their position
reasonable?

In these situations it is always very frustrating for customers and user groups to be told that
they will not be given Mean Time Between Failure (MTBF) and other failure rate data. From
the manufacturer‘s perspective this is a sensible approach because there is always a danger
that the MTBF data will be misinterpreted by the customer. MTBF is a measure applied to
the population as a whole, not to a very small sample.

It is true that caution must exercised when applying averages to individual cases. However,
provided that the customer understands that the MTBF is an average figure, the MTBF would
still be useful, particularly for a customer such as HHC which buys large numbers of laptops.
In this situation, a significant departure from the average figures would indicate either a
problem with the laptops supplied, or that they are being used in unsuitable conditions. (This
issue can be analysed more formally as a null hypothesis test – see Wood, 2003, Chapter 8).

The 63.2% comes from the exponential reliability distribution. If the MTBF is 1,000,000
hours, the failure rate is 0.000001 per hour and the exponential distribution gives
R = e –0.000001 x 1000000 = e –1 = 0.368 = 36.8%
which means that the other 63.2% must have failed. The quotation from the manufacturer is
misleading in that it is not a definition of the MTBF, but rather a fact that follows from the
exponential distribution.
===========================================================

Exercise B.

This diagram is a good illustration of how a fault tree can help in the analysis and prevention
of failures. It shows how external factors such as mechanical shock and dirt can damage
components. Other fault trees should be constructed for the other sub-systems contained in
the laptop schematic diagram and then combined into a fault tree for the entire laptop system.
The first stage of the analysis of the fault tree is a qualitative evaluation of the subsystem to
understand how failure occurs. The next stage is to gather data on component failure rates
and repair rates to allow a quantitative evaluation – where are the major sources of failure and
what action can be taken to minimise or eliminate them?
Page 53 of 56 MSc SQM\QUAN
©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

Appendix: FMEA Documents

A process FMEA document for the Field Nurse Laptop Use Process, identifying
reasonable changes that could be made without inconvenience to the nurses, that could
reduce the failure rate of their laptops.

Failure Mode Effect Analysis (Process FMEA)

Process Potential Potential cause/ Current Recommended Actions


Step Failure S mechanism of failure O Design (process change)
Mode Controls
Put laptop in - Drop laptop 1 - Broken screen 1 None 1. Fit laptops into
car - Dislodged permanent protective case
component with hinged top.
- Damaged hard disk 2. Make nurses responsible
- Damaged outer for damage caused by
casing carelessness or abuse, or
reward those that do not
cause any problems.
Enter patient - Drop laptop - Broken screen 4 None 1. Instruct nurses to carry
data when picking - Dislodged laptop into patient‘s home
it up. component and not to use their car as
- Damaged hard disk an office.
- Damaged outer 2. Investigate
casing reengineering the process;
for example, use voice
recognition software so
that nurses can phone in
after each call and dictate
directly into the server and
eliminate the need for
laptops.
Put laptop on - Excessive - Hard drive damage 8 None Instruct nurses to put
car floor while vibration - Electronic laptops in the car boot in
driving - Dirt/dust component failure its protective case. Also
entry avoids a security risk of
- Excessive laptops being stolen.
heat
Play music CD - Dirt/dust - Failure of CD drive None As above
in Laptop entry
- Excessive
CD usage

Take Laptop - Drop laptop 1 - Broken screen 1. Fit laptops into


out of car - Dislodged permanent protective case
component with hinged top.
- Damaged hard disk 2. Make nurses responsible
- Damaged outer for damage caused by
casing carelessness or abuse, or
reward those that do not
cause any problems.

S = Severity; an assessment of the seriousness of the effect of the potential failure mode.
Rated on a scale of 1 (none) to 10 (most severe)

O = Occurrence; the chance that the specific cause will occur. Rated on a scale of 1 (least
chance of occurrence) to 10 (highest chance of occurrence).

MSc SQM\QUAN Page 54 of 56


Session 7 ©University of Portsmouth
Quantitative Methods Unit - QUAN

An FMEA study on the laptop itself with the objective of identifying the failure points and
the design changes that could minimise or eliminate the risk of it occurring. First create
the block diagram of a laptop and then the FMEA document (only complete the document
for one item).

Design FMEA should always begin with a block diagram (Total Quality Management,
Besterfield). It is used to show the different flows involved with the item being analysed. Its
purpose is to understand the input to each block, the function of the block, and the output.
First list all the components of the system, their function and the means of connection or
attachment between components, then the components are placed in blocks and their
functional relationship represented by lines connecting the blocks.

The laptop component diagram provided earlier only shows data flows between components
but it can be updated as follows;

Page 55 of 56 MSc SQM\QUAN


©University of Portsmouth Session 7
Quantitative Methods Unit - QUAN

Failure Mode Effect Analysis (Product Design FMEA)

Item Potential Potential Potential Current Recommended


/Function Failure Effect of S cause/ O Design Actions
Mode Failure mechanism Controls (design change)
of failure
30 gb hard 1. Increase
drive frequency of
Product is checks to ensure
Store data for Head / Disk Complete 10 Contaminatio 3 assembled in a that clean room
use by the interference functional n during clean room. environment
operating failure – data manufacture meets
system and can neither specification.
applications. be read nor 2. Increase checks
written to or on clean room staff
Read data from from disk. clothing and
disk and cleanliness
provide it to 1. Use better
the processor Disk Laptop casing energy absorbing
chip when becoming 5 designed to material for
required. distorted by absorb impact casing.
physical without damage 2. Integrate a
Write data to damage to contents carrying handle into
disk when the casing.
instructed by
the processor Review construction
chip. Dirt ingress 2 Product is a of disk drive case –
during sealed unit. can it fail in use and
operation allow dirt ingress?

Slow disk Excessive 9 Bearing 5 A sample are Shelf life problem –


rotation data errors on ‗sticky‘ if laboratory apply stock control
read/write. long period tested during limits to spare parts
idle. manufacture and assemblies.
Complete
functional
failure as Motor 1 Manufacturing Change to a higher
speed drops defective put a sample on specification if test
life test. results show a life
problem.

MSc SQM\QUAN Page 56 of 56


Session 7 ©University of Portsmouth

You might also like