RCM Fundamentals - Meridium
RCM Fundamentals - Meridium
Maintenance
Table of Contents
Operational ............................................................................... 56
Repair Only ............................................................................... 57
RCM-HO-05a Assigning Consequences ................................................. 58
Applicable and Effective ................................................................ 61
Tolerable levels of Risk ................................................................. 62
Hidden Failures........................................................................... 64
The Famous Pump Example ............................................................ 67
Exercise 1 ................................................................................. 78
Exercise 2 ................................................................................. 79
Exercise 3 ................................................................................. 80
Case Study - BP refinery Incident...................................................... 82
Managing Safety and Environmental Consequences................................. 83
Economic Consequences ................................................................ 84
RCM-DO-06 Applicability and Task Selection .......................................... 87
Types of Maintenance ................................................................... 90
Preventive Maintenance (PM’s) ........................................................ 91
Predictive Maintenance ................................................................. 94
Detective Maintenance.................................................................101
Exercise 1 – Task Categories...........................................................102
Exercise 2 – Which type of maintenance? ...........................................103
The Basis of Task Preference..........................................................104
RCM-DO-06c Uses of MTBF...............................................................105
What MTBF can tell us?.................................................................105
At what level can we apply MTBF? ...................................................106
How can MTBF add value to Reliability Initiatives? ................................108
Summary .................................................................................110
RCM-DO-06d Advanced Detective Maintenance Techniques........................112
Exercise 1 – Steam Turbine ............................................................121
Exercise 2 – Steel Plant ................................................................122
Common Cause Failure Modes.........................................................123
Exercise 4 - Hoist .......................................................................125
Options for redesign ....................................................................126
Multiple Redundant Devices ...........................................................130
Exercise 5 – Pumps and PSV’s .........................................................130
Managing Risk in Hidden Failures .....................................................132
Voting Systems ..........................................................................133
Economic Consequences ...............................................................134
Exercise 6 – Economic Hidden Failures ..............................................138
RCM-DO-07 The Value of RCM...........................................................139
The Cashable Results of RCM..........................................................139
The Non-cashable Results of RCM ....................................................146
The Principal Barrier to Value Realization ..........................................148
The Role of the RCM Facilitator/Analyst ............................................149
Foreword
The Reliability-Centered Maintenance (RCM) approach was first documented in
the detailed book on the subject by F. Stanley Nowlan, Director, Maintenance
Analysis, and Howard F. Heap, Manager, Maintenance Program Planning, both
of United Airlines1. The book was sponsored by the Office of the Assistant
Secretary of Defense (Manpower, Reserve Affairs and Logistics) and was
published in 1978. From that book:
For years maintenance was a craft learned through experience and rarely examined
analytically. As new performance requirements led to increasingly complex equipment, however,
maintenance cost grew accordingly. By the late 1950's the volume of these cost in the airline
industry had reached a level that warranted a new look at the entire concept of preventive
maintenance. By that time studies of actual operating data had also begun to contradict certain
basis assumptions of traditional maintenance practice.
One of the underlying assumptions of maintenance theory has always been that there is a
fundamental cause-and-effect relationship between scheduled maintenance and operating
reliability. This assumption was based on the intuitive belief that because mechanical parts wear
out, the reliability of any equipments directly related to operating age. It therefore followed that
the more frequently equipment was overhauled, the better protected it was against the likelihood of
failure. The only problem was in determining what age limit was necessary to assure reliable
operation.
In the case of aircraft it was also commonly assumed that all reliability problems were
directly related to operating safety. Over the years, however, it was found that many types of
failures could not be prevented no matter how intensive the maintenance activities. Moreover, in a
field subject to rapidly expanding technology it was becoming increasingly difficult to eliminate
uncertainty. Equipment designers were able to cope with this problem, not by preventing failures,
but by preventing such failures from affecting safety. In most aircraft essential functions are
protected by redundancy features which ensure that, in the event of a failure, the necessary
function will still be available from some other source. Although fail-safe and "failure-tolerant"
design practices have not entirely eliminated the relationship between safety and reliability, they
have dissociated the two issues sufficiently that their implications for maintenance have become
quite different.
A major question still remained, however, concerning the relationship between schedule
maintenance and reliability. Despite the time-honored belief that reliability was directly related to
the intervals between scheduled overhauls, searching studies based on actuarial analysis of failure
1
F. Stanley Nowlan and Howard F. Heap, Reliability Centered Maintenance, United Airlines and Dolby Press,
sponsored and published by the Office of Assistant Secretary of Defense, 1978
data suggested that the traditional hard-time policies were, apart from their expense, ineffective in
controlling failure rates. This was not because the intervals were not short enough, and surely not
because the teardown inspections were not sufficiently through. Rather, it was because, contrary
to expectations, for many items the likelihood of failure did not in fact increase with increasing
operation age. Consequently a maintenance policy based exclusively on some maximum operating
age would, no matter what the age limit, have little or no effect on the failure rate.
In 1960 a task force of FAA and airline personnel was formed to investigate
scheduled maintenance and resulted in an FAA/Industry Reliability Program in
1961. Building upon this work, in 1965 United Airlines developed a
rudimentary decision-diagram technique. This technique was refined and
embodied in the 747 Maintenance Steering Group (MSG) Handbook:
Maintenance Evaluation and Program Development (MSG-1) from the Air
Transport Association in 1968. MSG-1 was used to develop the maintenance
program for the Boeing 747, the first maintenance program to apply RCM
concepts. Subsequent improvements led to MSG-2, which was used to develop
the maintenance programs for the Lockheed 1011 and the Douglas DC-10. A
similar document, European Maintenance System Guide, served as the basis for
development of the initial programs for the Concorde and the Airbus A-300.
The objective of the approach outlined in MSG-1 and MSG-2 was to develop a
scheduled maintenance program that assured the maximum safety and
reliability of equipment at the lowest cost. An example of the success of this
approach can be seen comparing the Douglas DC-8, which had a scheduled
overhaul of 339 items in a traditional maintenance program to the DC-10, based
upon MSG-2, which only had seven items to be overhauled. The latest
commercial aircraft maintenance guidance is based upon MSG-3 (Rev 2) for the
Boeing 757 and 767 aircraft.
In the early 1970's this work attracted the attention of the office of the Secretary
of Defense. The Navy was the first military organization to apply RCM to both
new design and in-service aircraft. Also in the early 1970's, the Navy embarked
on a major program to change the way nuclear submarines were maintained.
Over the next 20 years the Navy would virtually eliminate scheduled overhaul on
the nuclear submarine based upon an aggressive Condition Monitoring Program
and other technical advances to the ship systems. RCM is currently being used
on all new ship designs.
The RCM methodology has subsequently been applied in a wide variety of
commercial and military applications. The Electric Power Research Institute
(EPRI) has tested the methodology at several nuclear power utility sites of Florida
Power & Electric, Duke Power, and Southern California Edison. Puget Sound
Power and Light Co. has been using RCM since 1991 in both substations and
line maintenance. NASA has long used RCM in analyzing Space Shuttle and
2
Westbrook, Dennis, Boeing Commercial Airplane Group, and William H. Closser, C&A Consulting, “Transition of
an Organization to a Reliability Based Culture”, Proceedings of 14th Annual International Maintenance Conference,
August 3-7, 1997, Atlanta, GA
3
Nicholas, Jack R. “The Controversy about Reliability Centered Maintenance Methodology, Its Variants and
Derivatives”, Proceedings of the 18th International Maintenance Conference, Dec. 7-10, 2003, Clearwater, FL.
Reliability-centered Maintenance
• Productivity
– How much are we producing?
• Cost-Effectiveness
– What is it costing us to do so?
• Safety & Environment
– Are we hurting anybody or damaging the environment in
the process?
• Quality
– Are we producing at a consistent high level of quality?
• Corporate Learning
– How can I make sure that I will be able to sustain/improve
this into the future?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Understanding Failure
The “Wear-out” Curve The
Thebelief
beliefthat
thatall
allassets
assetshave
haveaa“life”.
“life”.That
Thatisis––
aaperiod
periodofoffew
fewrandom
randomfailures
failuresfollowed
followedby byaa
wear out zone.
wear out zone.
Eventually
Eventuallypeople
peoplestarted
startedto
tobelieve
believethatthatmany
many
assets actually suffered early life failures.
assets actually suffered early life failures. The “Bathtub” Curve
The
The“bath-tub”
“bath-tub”curve
curvemakes
makesup upthethebasis
basisofof
many engineers beliefs in asset performance
many engineers beliefs in asset performance
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
B 2% 1% 17%
C 5% 4% 3%
D 7% 11% 6%
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
to do)
– First, define what the users want the
asset to do in its present operating
context
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
What is RCM?
~John Moubray
6. What should be done to predict or prevent each failure? (Proactive Tasks and Task Intervals)
7. What should be done if a suitable proactive task cannot be found? (Default Actions)
Functions
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
The FMEA
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Consequences
The Seven Questions of RCM
(SAE JA1011 5a. -5g. 2002 )
1. What are the functions and associated desired standards of
performance of the asset in its present operating context?
(Functions)
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
HN1
Predictive
Task
No HE1
Predictive
Task
No
RCM Decision EE1
Predictive
Task
No EN1
Predictive
Task
No
Is a Preventive Replacement task Is a Preventive Replacement task Is a Preventive Replacement task Is a Preventive Replacement task
technically feasible and effective? technically feasible and effective? technically feasible and effective? technically feasible and effective?
HO3 Preventive Yes HS3 Preventive Yes ES3 Preventive Yes EO3 Preventive Yes
Replacement Replacement Replacement Replacement
Task Task Task Task
HN3 No HE3 No EE3 No EN3 No
Yes Yes
HO4 HS4
Detective Detective
Task No Task No
HN4 HE4
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
6. What should be done to predict or prevent each failure? Each task must be
(Proactive Tasks and Task Intervals) applicable and effective
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Default Actions
The
The Seven
Seven Questions
Questions of
of RCM
RCM
(SAE
(SAE JA1011
JA1011 5a.
5a. -5g.
-5g. 2002
2002 ))
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Operating Context
1. Duty Cycles…
Our car is a Ford Focus.
2. Weather and the immediate Great car…we maintain it to
Environmental… the manufacturers
specifications…
3. Applicable regulations and laws…
5. Remoteness…
Why?
6. How it is managed…
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Writing Functions
Writing Functions
SAE JA1011, 5.1.3 - All functions shall contain a verb, an object and a
performance standard (quantified in every case where this is done)
X We
Weaccept
will
acceptthat
that“times
deteriorate.
will deteriorate.
“timesarrow”
arrow”means
meansthat
thatassets
assets
Performance
Performancestandards,
standards,tell
tellus
usthe
theminimum
minimum
level
level of performance acceptable to theusers
of performance acceptable to the usersor
or
owners of the asset.
owners of the asset.
© Copyright Meridium, Inc. 2007
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Performance Standards
(What it can do)
4. Total
Margin for Deterioration 1. Between Limits
2. Specific
What its users want it to
do
Performance
3. Varying – Up To
6. Open
Up to 800
l/minute At 100 bar
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
• To be safe…
• To be reliable…
• To comply with environmental standards…
• To comply with IE2314356XXX (etc)…
Performance
Performance standards
standards need
need to
to be
be quantified
quantified
where possible to avoid ambiguity.
where possible to avoid ambiguity.
E.g.
E.g. What
What is
is reliable,
reliable, and
and who
who says
says so?
so?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Exercises
• An office chair…
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Secondary Functions
Secondary Functions
(SAE JA1012 6.2.2)
Secondary functions are all the other requirements we have of
the asset (s) that are not covered by the primary function.
Environmental Integrity
Safety / Structural Integrity
Control / Containment / Comfort
Appearance
Protective Devices and Systems
Economy and Efficiency
Superfluous
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Operational Description
The system is very simple and consists of a reciprocating piston compressor, a condenser, a thermal
expansion valve, and an evaporator. A three-phase electric squirrel cage motor drives the compressor
via four parallel v-belts. A guard is in place to stop people touching the belts while they are in use.
Setting air conditioning temperatures can be very individual and is almost never without complaints.
Over the years the company has determined that a temperature in the range of 19oC (~66oF) and 23oC
(~73oF) is the most comfortable to work at, and causes the least amount of arguments. The thermostat
is set to 21oC (~70oF), and they would like it to not exceed 23oC, or to not go below 19oC.
The compressor is oil lubricated, and compresses a standard refrigerant gas, which is a known
greenhouse gas. Any release of the refrigerant breaches a number of environmental regulations. It
takes low-pressure superheated gas from the evaporator, compresses it to high-pressure superheated
gas, and pushes it through the condenser.
A draft over the condenser coils by comes from a three phase electric fan, which removes the heat and
changes the high-pressure vapor to a high pressure liquid. When the condenser is working well there is
a temperature differential of 3.1oC (10oF) across the condenser.
De-superheated high pressure liquid leaves the condenser in the liquid line to the thermal expansion
valve (TX valve). The TX valve regulates the flow of high-pressure liquid refrigerant into the
evaporator coil. It is designed to open just enough to let refrigerant flow while maintaining a high
pressure differential from its inlet to its outlet. The pressure at the exit of the expansion valve is low
enough that it initiates a phase change in the liquid refrigerant to a vapor.
A three phase motor forces draft air over the evaporator coils and superheats the vapor. This creates
the cooling effect. Both the evaporator fan and the condenser fan have lightweight steel cowls to stop
foreign objects from damaging the fan blades.
The refrigerant then leaves the evaporator as a superheated gas and reinitiates the process again with
the compressor. Any failure of the evaporator means that there is a possibility of liquid entering the
compressor, destroying the internal components. When the evaporator is working well there is a
temperature differential of 3.1oC (10oF) across its coils.
The electric motor drives of the compressor and the evaporator have thermal overloads that will trip
the circuit if the full load current (FLA) reaches 125%, the condenser fan has protection of 115% of
FLA.
The company has local research reports that show that bacteria,
viruses and fungi tend to thrive in that part of the world when the
humidity is greater than 47%. Similar “wellness” reports have
shown that workers in an office environment are most
comfortable between 30% and 44%. If the humidity is too low
workers offer suffer from dry eyes, increased static and it feels
colder than it is. Too high and workers feel very uncomfortable
and feel hotter than it is.
The air conditioner typically needs to run for 8-10 minutes before
the dehumidification process can commence. At its present
design capacity, it will run for 100% of the time in summer, and
40-50% of the time during other seasons in this climate.
However, if the thermostat fails, and stops the compressor at
temperatures above its set point, then this will cause short run
times, and will not allow the unit to dehumidify the air in the
office space.
The company using this unit has other similar systems installed
in other offices and finds them to be reliable and economical to install and to run. However,
discussions with the manufacturer and a study of the history of similar systems have produced the
following list of common failures.
a) Condenser fins flattened, preventing forced airflow over the condenser coils. (Installation
errors)
b) Evaporator fins flattened, preventing forced airflow over the evaporator coils. (Installation
errors)
c) Clogging of the TX valve, causing a total failure of the system (Normally occurs every 2 years)
d) Wear out of the valves within the compressor. (Normally once every 5 years)
e) Failure of the thermostat, meaning it will not trip at all (once every 4 years), or it will trip at
temperatures greater than the set point. (once every 6 years)
While these are common failure modes, they do not include all of the likely failure modes. For
example, the drive motors for the compressor, the condenser, and the evaporator are all standard three-
phase squirrel cage electric motors and suffer from the failure modes that generally occur in these
types of motors.
Functional Failures
The Seven Questions of RCM
(SAE JA1011 5a. -5g. 2002 )
1. What are the functions and associated desired standards of performance of
the asset in its present operating context?
(Functions)
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Failed States
Failed States
Functional Failures indicate failed states – “How” it is unable to do
what we want it to.
• We need to define all of the Failed States for every function.
– Failed states are derived directly from the function statements and
their performance standards
– Generally cover too much, too little (partial) and not at all…(total)
• To pump water from tank A to tank B at up to 800 l/minute
(Varying)
– Unable to pump at all
– Pumps at more than 800 l/minute (?)
• To pump water from tank A to tank B at between 800 l/minute
and 1000 l/minute (Multiple)
– Unable to pump at all
– Pumps at less than 800 l/minute
– Pumps at more than 800 l/minute
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Exercise
The primary function of a grinding machine may be listed as: “To grind bearing journals in a
cycle time of 3.00 minutes ± 3 seconds, to a diameter of 75 mm ± 0.1 mm, with a surface
finish of no greater than Ra 0.2.”
0.05 75 mm 0.05
0.05
0.05
3,06
3.03
3 minutes
2.57
2.54
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Exercises
100 bar
800l/minute
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Reasonably Likely
Reasonably Likely
Pump struck by lightning North West Australia – reasonably likely
The Atacama Desert in Chile – Highly unlikely
Levels
Levels of
of reasonableness
reasonableness determined
determinedby
bythe
the
analysis group…..
analysis group…..
IfIf no
no agreement
agreementisispossible
possible then
thenthe
theorganization
organization
that owns the assets must make a decision
that owns the assets must make a decision
© Copyright Meridium, Inc. 2007
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Causality
Level 1?
… or Level 2?
Unable to pump water at all
1. Motor Fails Unable to pump water at all
2. Pump Fails 1. Motor Fails due to stator
3. Pipes Fail earth fault
4. Inlet to tank B blocked 2. Motor fails due to short
5. Outlet from tank A between the coils
blocked 3. Motor fails due fan end
bearing failure
4. Motor fails due to drive
… or Level 3? end bearing failure
Unable to pump water at all 5. Motor fails due to
1. Drive end bearing fails overheating
due to ingress of water 6. Motor fails due to loose
2. Drive end bearing fails connections
due to lack of adequate
grease
3. Drive end bearing fails
due to misalignment
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Due to inadequate
training of the
lubrication technician
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
To pump Unable to pump water from Drive end motor bearing failed due to
water from tank A to tank B lack of grease
tank A to tank Short in motor windings due to
B at 800 insulation degrades over time
l/minute
Drive end motor bearing seized due
to misalignment on installation.
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Types of Failures
What its users want it
What it can do What it can do to do
What its users want it to What its users want it to What it can do
do do
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
________________________________________
Knowledge
70%
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Effects
The Seven Questions of RCM
(SAE JA1011 5a. -5g. 2002 )
1. What are the functions and associated desired standards of performance of
the asset in its present operating context?
(Functions)
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
• SAE JA1011, 5.4.1 “Failure effects shall describe what would happen if
no specific task is done to anticipate prevent or detect the failure
• They are the typical worst case scenario… not the extreme worst case
scenario.
© Copyright Meridium, Inc. 2007
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
A Hierarchy of Consequences
No scheduled maintenance
Combination of Tasks?
No scheduled maintenance
Redesign is Compulsory
Redesign may be desirable
Redesign may be desirable
Redesign is Compulsory
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Hidden or Evident?
To Process
To Process
To Process
To Process
To Process
To Process
# The Maintenance Scorecard, Daryl Mather,
Industrial Press, ISBN 0831131810
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Hidden Failures
HN HO HE HS ES EE EO EN
Will the loss of Is there an intolerable Is there an intolerable Does the failure have a
Does the failure have a Is there an intolerable Is there an intolerable function caused by this risk that the failure risk that the failure direct adverse effect on
direct adverse effect on risk that the multiple risk that the multiple
operational capability? failure could breach a failure mode on its could kill or injure could breach a known operational capability?
failure could kill or own become evident to environmental standard
known environmental someone?
injure someone? the operating crew or regulation?
standard or regulation?
under normal
circumstances?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
________________________________________
Safety
Safety Consequences
HN HO HE HS ES EE EO EN
Will the loss of Is there an intolerable Is there an intolerable Does the failure have a
Does the failure have a Is there an intolerable Is there an intolerable function caused by this risk that the failure risk that the failure direct adverse effect on
direct adverse effect on risk that the multiple risk that the multiple failure mode on its could kill or injure could breach a known operational capability?
operational capability? failure could breach a failure could kill or own become evident to environmental standard
known environmental someone?
injure someone? the operating crew or regulation?
standard or regulation?
under normal
circumstances?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Environmental
Environmental
Environmental Consequences
Consequences
HN HO HE HS ES EE EO EN
Will the loss of Is there an intolerable Is there an intolerable Does the failure have a
Does the failure have a Is there an intolerable Is there an intolerable function caused by this risk that the failure risk that the failure direct adverse effect on
direct adverse effect on risk that the multiple risk that the multiple
operational capability? failure could breach a failure mode on its could kill or injure could breach a known operational capability?
failure could kill or own become evident to environmental standard
known environmental someone?
injure someone? the operating crew or regulation?
standard or regulation?
under normal
circumstances?
• Will not default to run to failure under any circumstances, at all times
there is a need to take some action
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Operational
Operational Consequences
HN HO HE HS ES EE EO EN
Will the loss of Is there an intolerable Is there an intolerable Does the failure have a
Does the failure have a Is there an intolerable Is there an intolerable function caused by this risk that the failure risk that the failure direct adverse effect on
direct adverse effect on risk that the multiple risk that the multiple
operational capability? failure could breach a failure mode on its could kill or injure could breach a known operational capability?
failure could kill or own become evident to environmental standard
known environmental someone?
injure someone? the operating crew or regulation?
standard or regulation?
under normal
circumstances?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Repair Only
Non-operational consequences
HN HO HE HS ES EE EO EN
Will the loss of Is there an intolerable Is there an intolerable Does the failure have a
Does the failure have a Is there an intolerable Is there an intolerable
direct adverse effect on risk that the multiple function caused by this risk that the failure risk that the failure direct adverse effect on
risk that the multiple failure mode on its could breach a known operational capability?
operational capability? failure could breach a could kill or injure
failure could kill or own become evident to environmental standard
known environmental injure someone? someone?
the operating crew or regulation?
standard or regulation?
under normal
circumstances?
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
f) A speed sensor protects a turbine from over-speed, preventing it from speeding up to destruction,
sending debris in every direction. The sensor has failed in such a way that it will not trip the
turbine on over speed.
g) A standby motor to drive a pumping system has developed false brinnelling (flat spots) of the
bearings. This means that that when it is called on to run it will run for a short while before
tripping the motor on overload. If it runs continually in this fashion, it could also cause secondary
damage to the motor shaft.
h) When the level in a tank reaches the low level, the low-level switch starts a pump. Because of
vibration in the surrounding area, one of the terminals comes loose and the switch will not work
when it is required to.
i) A pumping system has a duty and a standby pump. The stand by pump takes over the function if
ever the duty pump should fail. Over time, the resistance of the insulation within the duty motor
breaks down, and it suffers an earth fault.
j) Due to a pinhole leak, the air pressure has gone out of the spare tire in your car.
k) Each aircraft is equipped with life preserver jackets for passenger use in case of a water landing.
One of these has developed a failure, preventing it from inflating when required.
l) An electrically driven “pony” pump primes a lubrication system on start up, at a specified pressure
the main pump takes over to run the system at operating pressure. This is an effort to minimize the
energy usage of the plant, and the main pump could easily start up under full load with no
consequence aside from increased energy usage. The pony pump has a failure of the mechanical
seal and be unserviceable for a time.
m) An air-conditioning system has had the condenser fins flattened out by vandalism; the result is that
the airflow through the condenser is not sufficient to reduce the temperature prior to the refrigerant
gas travelling to the evaporator. The result is that the system will not reduce room temperature
below the 35oC ambient temperature. This affects the health of the people working in the room and
results in two people suffering from heatstroke.
n) The high-high level switch on a tank trips the pump when there is a high-high level. This then
needs to a manual reset. At present, this switch has spurious trips that cause the pump to stop when
there is no high level.
o) A large-scale screening facility gets its supply from a conveyor running the length of the building
some four stories above the ground. Along the side of the conveyor are walkways with handrails.
One of the handrails has a crack in it that is not visible to the naked eye. However, if somebody
were to use it, it would give way, leaving the person to fall four stories to their death.
p) An IT data center houses all of servers containing the corporate IT information. The cooling
system of a data center requires the rooms to be continuously at a temperature of between 20oC
and 25oC, and a humidity range of between 40%-60%.
A failure of the power supply could lead to outright server failure, or at the very least increase
failure rates of electronic components. This would have a catastrophic effect on business
continuity. For these reasons, a diesel generator set is on permanent standby protecting the power
supply to the coolers and humidifiers; an uninterruptible power supply or UPS further protects this.
The diesel generator set has developed a failure in the starter circuit due to corroded battery
terminals, meaning it will not be able to start when required.
q) An operating company used a tank farm to store flammable liquid raw material. A pressure safety
valve (PSV) set at the tank maximum allowable working pressure (MAWP) of 100 psig protected
one of the tanks containing a highly reactive material.
The previous PHA identified the plugging of the PSV inlet as a potential concern. The PSV’s
annual inspection reports verified plugging, substantiating this concern. The PHA team
recommended the installation of a rupture disc upstream of the PSV.
A month later, an overpressure event (triggered by contamination) caused the tank pressure to
reach 180 psig before the rupture disc blew and vented the tank contents. The ensuing Incident
Investigation revealed that the rupture disc had developed a pinhole leak and the space between the
rupture disc and PSV had pressurized to the normal tank pressure of 80 psig.
Before selecting
Applicable
any failure
management policy
analysts first need
to determine
whether or not the
task is actually
possible!
Effective Within
WithinRCMRCMNO NOtask taskcan canbe
beapplied
appliedtoto
any
any failure mode withoutfirst
failure mode without first
Then they need to determine whether the task establishing
establishingthat thatititisisactually
actuallypossible
possible
will be worthwhile in terms of either cost or risk. totodo
dothe
thetask,
task,and andsecondly
secondlywithout
without
(Based on the consequences) ensuring
ensuringthatthatititwill
willadequately
adequatelymanage
manage
the
theconsequences.
consequences.
© Copyright Meridium, Inc. 2007
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Ideal Reality
Risk is the likelihood of
an unwanted event
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Individual risk minimum (Worker) 1x 10-5 Not Used Not Used Not Used
Individual risk minimum (Public) 1x 10-6 Not Used 1x 10-6 Not Used
Individual risk maximum (Worker) 1x 10-3 Not Used Not Used Not Used
Individual risk maximum (Public) 1x 10-4 1x 10-5 1x 10-6 1x 10-6
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Hidden Failures
Hidden Failures
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Item
Item
Item
Function Function Failure Failure Modes and Effects
Item
Item
Function Function Failure Failure Modes and Effects
Hidden
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Slide 18
Protective
Device C Fails
If the failure rate is once in four years, then the probability that it
will fail in one year is 1 in 4.
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Slide 19
One year
Protected
Function B Fails
Protective
Device C Fails
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Slide 20
One year
Protected Mean Time Between Failures = 4 years
Function B
Protective
Device C Availability = 67% Downtime = 33%
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
One year
Protected Mean Time Between Failures = 4 years
Function B
Protective
Device C Availability = 67% Downtime = 33%
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
• 6 identical PSV’s have each been checked once a year for 5 years (FFI = 1
year)
To Process
1
• We know that the failed devices failed
some time during the year before the 2
checks – but not when… 3
2. So on the basis of these figures it appears that: FFI = 2 x DT device x MTBF device
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Slide 24
Time
Time
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
________________________________________
One year
Protected Mean Time Between Failures = 10 years
Function B Fails
Protective Failed
Device C
1 in 10 x 1 in 100 = 1 in 1000
Step One
Step Two Step Three Decide what
Determine / estimate how Calculate what unavailability of probability we
often the protected function the protective device enables us tolerate for the
is likely to need to protective to achieve 1 given 2 multiple failure
device
if then
DTdevice = Unavailability of the protective device (1/MTBFfunction) x DTdevice = 1/MTBFmultiple
MTBFfunction = Failure rate of the protected function
MTBFmultiple = Failure rate of the multiple failure or
DTdevice = MTBFfunction / MTBFmultiple
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
________________________________________
Where:
FFI = 2 x DT device x MTBF device …1
Where:
…and that…
••FFI
FFI ==failure
failurefinding
findingtask
task
interval DTdevice = MTBFfunction / MTBFmultiple …2
interval
••Dt
Dtdevice = Unavailability of the
device = Unavailability of the
protective
protectivedevice
device
••MTBF
MTBFdevice = MTBF of the Therefore.. by substituting 2 into 1 gives…
device = MTBF of the
protective device
protective device
••MTBF
MTBFfunction = MTBF of the
function = MTBF of the
protected function
protected function
••MTBF
MTBFmultiple = MTBF of the
multiple = MTBF of the 2 x MTBFfunction x MTBFdevice
multiple
multiplefailure
failure FFI =
MTBFmultiple
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Exercise 1
A small chemical plant has an eye bath to enable people to wash their eyes if dangerous chemicals
contaminate them. When asked what checks have been done on the eye bath in the past, the
maintenance department said “that’s the production departments job”. However, production thought
the safety officer was doing it, who in turn thought it was “looked after by the preventive maintenance
system”. As a result, it appears that the eye bath has never been checked, at least on a routine basis.
The eye bath has been in place for eight years. A quick check now reveals that the eye bath is actually
in working order, so the only data we have about the reliability of this bath is that it has not failed in
eight years. Further investigation reveals that someone needed to use it in an emergency on two
occasions since it was installed.
The plant manager has asked you to set up a checking routine for this eye bath as a matter of urgency.
How often should the check be done?
The safety committee decided that they do not want the eye bath to be inoperable when it is needed
more than once in 1,000,000 years. A series of phone calls to other companies reveals 60 eyebaths that
have been installed for a total of 720 years between them. 2 of these have been found to be in a failed
state in that period.
2 x MTBFdevice x MTBFfunction
FFI =
MTBFmultiple
Exercise 2
A tank is used to store diesel and is
enclosed in a concrete bund. This
is intended to prevent anything
which might escape from the tank
seeping into the ground and
breaching a variety of
environmental regulations. The
review group decides that they
would not like this to happen more
than once every 10,000 years.
A review of a number of similar
systems and discussions with users
suggest that a significant quantity
of liquid is likely to escape into the
bund no more than once every 150
years on average, usually due to leaks in pipeline flanges or seals. The integrity of the bund itself has
never been checked until now, but it can be done in a number of ways.
One is to fill the bund with water to a depth of (say) 100 millimeters, and check whether the water
level drops by more than the rate of evaporation over a period of (say) two days. Such a check is
carried out on the bund, and reveals that it is still intact.
So in the absence of any hard data at all, and after considerable discussion, the group decides that in
any one year the chance of the bund springing an invisible leak (due to subsidence, latent construction,
defects or whatever) is “1 in 100”.
2 x MTBFdevice x MTBFfunction
FFI =
MTBFmultiple
Exercise 3
A steel producing plant has a need for many product-
handling assets to move around the raw iron ore prior to
processing.
As part of this asset base, they have 10 large conveyors.
Each of these has 4 e-stops, one on either side of the head
end, and one on either side of the tail end of the conveyor.
The management has tasked the maintenance team with
determining a frequency for testing the function of each
of these e-stops to make sure that when we need them to
work they will work.
After some discussion, they consulted relevant
specifications and determined that they wanted these e-
stops to meet their SIL-2 classification. For this company
that means a likelihood of 1:100,000 (105) that any one
would have a failure in any one year.
They found that on their own plant they had never experienced a failure of one of the emergency stops.
However, on consulting a commercial data store they found the following information:
• A population was tested over a time period of 106 hours
• During this time the item was found to have failed 8 times in an undetected and unsafe manner,
• And 60 times in a detected safe fashion
They were installed all of the conveyors at roughly the same time 20 years ago. After conversations
with a few of the longer serving people they were able to ascertain that they had required to use an e-
stop, either to protect people or to protect life, approximately 15 times.
What frequency will they need to do for the detective task to maintain the level of risk that the
company has deemed as tolerable?
It needs a failure
of the function
This is the multiple failure. The before the
result of a protected function consequences of
failing while the protective device a hidden failure
is in a failed state… are realised!
Function
Device
Failure of the
device has no
consequences
by itself…..
Therefore… for a detective 1. Ensure that the task will not increase
maintenance task to be technically the probability of a multiple failure
feasible we need to:
2. Determine whether it is practical to
do the task in the desired intervals
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Economic Consequences
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Technical
Hidden Will the loss of
Evident Feasibility Criteria
operating crew
No Yes Yes Yes Yes Yes Yes No
under normal
Is a Predictive On-Condition task Is a Predictive task technically feasible
circumstances? Is a Predictive task technically feasible Is a Predictive task technically feasible Predictive Tasks
technically feasible and effective? and effective? and effective? and effective? Is there a clear potential failure condition?
What is it? What is the P-F interval? Is the
interval long enough for action to be
HO1 Yes HS1 Yes ES1 Yes EO1 Yes taken to avoid or minimise the
HN1
Predictive
Task
any failure
technically feasible and effective? technically feasible and effective? technically feasible and effective? technically feasible and effective?
Tasks
Is there an age at which there is an
Yes Yes Based on Example 2 Yes Yes
increase in the conditional probability of
HO2 HS2 ES2 EO2 failure? (Life?) What is this age? Do
Preventive Preventive SAE JA1012 Preventive Preventive
enough items survive to this age to satisfy
Restoration Restoration Restoration Restoration
No No No No the effectiveness criteria? Will the task
Task HE2 Task task EN2 Task
management policy
HN2 EE2 restore the original resistance to failure?
When there are safety or environmental
consequences, all items need to survive to
this age.
Is a Preventive Replacement task Is a Preventive Replacement task Is a Preventive Replacement task Is a Preventive Replacement task
technically feasible and effective? technically feasible and effective? technically feasible and effective? technically feasible and effective?
Preventive Replacement
HO3 Preventive Yes HS3 Preventive Yes ES3 Preventive Yes EO3 Preventive Yes
Tasks
Is there an age at which there is an
increase in the conditional probability of
analysts first need
Applicable
Replacement Replacement Replacement Replacement failure? (Life?) What is this age? Do
Task Task Task Task enough items survive to this age to satisfy
HN3 No HE3 No EE3 No EN3 No the effectiveness criteria?
to determine
When there are safety or environmental
consequences, all items need to survive
Is a Detective task to detect the Is a Detective task to detect the to this age.
failure technically feasible and failure technically feasible and
effective? effective?
Detective Tasks
Yes Yes
task is actually
Yes Yes Yes
HO5 ES4 EO4 Run-to-Fail or a Combination
Run-to Run-to-Fail ? Is a combination of Run-to Run-to-Fail ?
Combination tasks technically
-Fail
of tasks
-Fail of Tasks
HN5 EE4 feasible and effective? EN4 For Hidden Safety & Environmental
consequences if no Failure Finding Task is
No feasible then re-design is compulsory.
No No
HO6
HN6
Redesign may
be desirable
HS5
HE5
Redesign is
compulsory
ES5
EE5
Redesign is
compulsory
EO5
EN5
Redesign may
be desirable
For Evident Safety & Environmental
consequences if no combination of tasks
is feasible then re-design is compulsory.
Hidden Economic Hidden Safety and Environmental Evident Safety and Environmental Evident Economic
Consequence Consequence Consequence Consequences
To be effective:-
Over a period of time, the failure
To be effective:-
The failure management policy must reduce the risk of
Effectiveness Criteria To be effective:-
The failure management policy must reduce the risk
To be effective:-
Over a period of time, the failure management
management policy reduce the risk of a the failure to a tolerable level. of the failure to a tolerable level. policy must cost less than the cost of the
multiple failure (and associate total costs) operational consequences (if any) plus the total
to at an acceptable minimum. cost of repair.
Effective
Then they need to determine whether the task
will be worthwhile in terms of either cost or risk.
(Based on the consequences)
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Technically Feasible
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Types of Maintenance
RCM Term Colloquial Term What it is… Abbreviations
Predictive • On-Condition Maintenance Check an item for signs of potential PTIVE
Maintenance • Condition Based failures and leave it in place on the
Maintenance (CBM) condition that it will make it to it’s next
• Condition Monitoring (CM) inspection interval. (Planned)
• Inspections
Preventive • Overhaul A task to restore an assets original PRES
Restoration • Scheduled Restoration resistance to failure prior to its failure,
• Rework this is a preventive task (Planned)
RCM will always direct maintainers to choose a maintenance or operational activity over a redesign as it is
almost always the most cost effective means of managing failure.
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Characteristicsof
Characteristics ofPreventive
Preventivetasks….
tasks….
IfIfthe
thetask
taskisisaa The
Themajority
majorityofofitems
itemsmust
must There
Theremust
mustbebean
anage
agewhere
wherethe the
restoration task survive until this point (only conditional probability of failure
restoration task survive until this point (only conditional probability of failure will will
thenititneeds
then needsto to aafew
few“random”
“random”failures)
failures) increasedramatically
increase dramatically(a (aLife)
Life)
restore
restore thethe
items original
items original
resistance
resistance to to
failure…
failure…
Life
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
•Withinreliability-centred
•Within reliability-centred • Preventive Restoration – tasks to restore
maintenancethere
thereare
aretwo
two an items original resistance to failure.
maintenance
types of preventive tasks.
types of preventive tasks. • Preventive Replacement – tasks to
replace an asset.
Characteristicsof
Characteristics ofPreventive
Preventivetasks….
tasks….
IfIfthe
thefailure
failuremode
modehashassafety
safetyor
or Theremust
There mustbebean
anage
agewhere
wherethe the
environmentalconsequences
environmental consequencesthen thenall
allitems
items conditional
conditionalprobability
probabilityofoffailure
failurewill
will
must survive to this age!
must survive to this age! increase dramatically (a Life)
increase dramatically (a Life)
IfIfthe
thetask
taskisisaa
restoration
restoration tasktaskthen
thenitit
needs to restore
needs to restore the the
itemsoriginal
original
Life
items
resistance
resistance to tofailure…
failure…
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
89% of failures
were not related to
age!
Yet
Yetdespite
despiteknowing
knowingthese
thesefacts
factsmany
manypeople
peopleare
are
reluctant to let go of time based maintenance
reluctant to let go of time based maintenance Why?
(such
(suchasasmany
manyscheduled
scheduledshutdowns)
shutdowns)
© Copyright Meridium, Inc. 2007
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Predictive Maintenance
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
P-F Interval
P1 (1 Month)
P2
2 weeks
P3
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Predictive Maintenance
The P-F interval is long enough
for action to be taken to avoid,
eliminate or minimise the
consequences of failure
There is a clear potential failure
condition (in other words there is a
clear warning that the failure of
the onset of failure) The P - F Interval
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
P-F Interval
P1 (10 Months) P-F Interval
P1 (1 Month)
P2
P2
P3 P3
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Condition Monitoring
Detective Maintenance
It needs a failure
of the function
This is the multiple failure. The before the
result of a protected function consequences of
failing while the protective device a hidden failure
is in a failed state… are realised!
Function
Device
Failure of the
device has no
consequences
by itself…..
Therefore… for a detective 1. Ensure that the task will not increase
maintenance task to be technically the probability of a multiple failure
feasible we need to:
2. Determine whether it is practical to
do the task in the desired intervals
© Copyright Meridium, Inc. 2007
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
3. A hydraulic oil system provides pressure to drive the hydraulic motor that powers an apron
feeder. Every so often there is a differential pressure alarm, which signals when the filter is no
longer able to filter to the correct level and rate. When this occurs, the maintenance team
cleans the filter.
4. A weight meter in a product handling plant is routinely calibrated to ensure that the production
(profitability) of the plant is accurately measured.
6. A tank contains corrosive acid which would is prohibited from seeping into the ground by law.
A task has been scheduled to perform a seepage test on this tank every 4 years.
We know that once cracks are able to be detected via the dye-penetrant test, the
bearing usually has around 3 months left prior to total failure.
• Preventive Restoration – When directed at specific components and parts it will lead to a reduction
in the overall failure rate of items that have a dominant failure mode.
• Detective Tasks – If the other three are not able to be applied, then this is the best option for hidden
failures.
If the frequency is practical, it can be done safely and the task itself does not substantially increase
the risk of failure, then this is the selected option.
Unlike the other three tasks this option will leave the function in a state of unavailability for a period
High-High
High-Level
4
This module deals with MTBF in isolation and does not discuss other metrics such as MTTF (Mean Time To
Failure) or MTTR (Mean Time To Repair).
Table 1 contains some information that should immediately provoke some questions. For example, we
have counted four failures in our system level MTBF, yet the table contains 13 failures. (Not counting
the system failures)
To understand this we need to review the functions for each of the components mentioned.
For example, the function of the High-High Level Switch is to trip the pumping system when water
reaches the high-high level. If there is a failure preventing this asset from performing its function, it
will not prevent the system from pumping water. We have had one failure on the switch that we know
about in this period.
Another obvious issue is the fact that we have had seven failures of the Duty Pump. However, during
this time we have also only had two failures on the Stand-By pump, a dormant function, which we
know of. As this system has redundancy built into it, we can only experience a loss of the primary
function if we have a failure of the Duty pump and the Standby pump at the same time.
The four failures causing the loss of function at the system level were:
• One multiple failure of the duty and standby pump
• One failure of the High Level switch, meaning the level reached the High-High level once during
the 5-year period.
• One failure of the Low Level switch, resulting in the Low-Low level tripping the downstream
process
• One failure of the piping causing downtime
Figure 2 - MTBF at Different Levels
System
System
MTBF 1.25 years
MTBF 1.25 years
In any year 8x10-1-1
In any year 8x10
All the other failure mentioned were either; hidden to the operations team until revealed by inspection,
or their function was protected by other assets. (In the case of the failures on the Duty Pump)
As shown in Figure 1, MTBF is useful at any level throughout an asset base. However, its’ application
must be on the functions of the assets, and the total time required of each function, at each level of
performance measurement.
How can MTBF add value to Reliability Initiatives?
In the hands of a skilled RCM facilitator the measurement and manipulation of MTBF can be used to
set the performance expectations of the physical asset base, as well as providing a base for evaluation
of strategies, and to indicate the overall performance of assets; not just the performance of their
functions.
This helps organizations in the change process because they begin to think about what the assets do,
rather than what they are. That is, an appreciation of functional performance as opposed to asset
performance.
For example, in the system described in Figure 1 we can break the system down into its’ functions, and begin to assign
performance expectations to each of these. 7
Function 1 - To pump water to tank B at a rate of between 900 l/minute and 1000 l/minute
Functional Failure 1.A – Does not pump water at all
The water pump in this example provides, say, the cooling water for a petrochemical plant. If the system is unable to pump
water, there will be a loss of production. The tank contains enough water to keep the plant running for a minimum period
of 2 hours, and a maximum period of 6 hours.
A multiple failure of both pumps would nominally result in a loss of production equal to, say, USD $2,000,000. In this case
the asset owner would like to keep the likelihood of this occurring to a reasonably low level and after some discussion he
decides on a level of 1:10,000 years, or an annual rate of 10-4.
This means management of all failure modes causing this consequence, an adverse impact on operational capability, to the
same level of likelihood.
Function 2 – To trip the pumping system when water reaches the high-high level
Functional Failure 2.A – Does not trip when the water reaches the high-high level.
In the case of the water system, an overflow of the tank would result in water in the surrounding area. While this is a slip
hazard for employees sent to correct the issue, the asset owner does not regard it as a serious hazard, nor will it result in
any damage to additional equipment.
The failure mode is dormant, meaning it will only have consequences when there is a failure of the high-level switch and
the high-high level switch. In this particular case, the asset owner is at ease accepting a higher level of risk of occurrence,
say, one in every 100 years, or a likelihood of 10-2 in any one year.
Function 3 – To alarm when the tank level is at the low-low level
Functional Failure 3.A – Does not trip when the tank is at the low-low level.
As with the High-High protection this alarm is only required once there has already been a failure of some sort, in this case,
notably a failure of the Low-Level Switch.
If this was to occur, and the tank consequently ran dry, the results would be catastrophic in financial terms. The
downstream equipment would run dry, and the plant would be without cooling water forcing a loss of production estimated
at around 3 days or USD $6,000,000 in this case. There would also be damages conservatively estimated at
USD$1,500,000 for producing assets.
The asset owner sees this as the worst possible outcome of a failure of this system. As a result, he would like to keep the
likelihood of failure at 1:100,000 years, or 10-5 per year. The resulting performance expectations of failure modes are in
Table 2 below.
We can see that the sum of each of the failure modes contributing to the loss of function must equal the desired failure rate,
or risk, at the above level. (Assuming these are all the relevant failure modes)
7
Full details about how to construct a risk profile based on performance expectations is contained in module RCM-
DO-05a Tolerable Levels of Risk (A Study of Industry)
Here we can see the desired failure rates set out in Table 2 for each function, and translated into a
performance requirement for each failure mode.
We can also record actual MTBF measures against this to see how effective we have been in managing
the failures of this asset to the desired levels of performance. However, this would only be a guide.
The MTBF measured would only calculate since the beginning of measurement. The best use of this
approach is to provide valuable input for RCM analysts, as well as for other applications within the
reliability field. It would also give asset owners a pre-determined risk envelope that they require their
assets to work within, increasing their control over asset performance, and hence over corporate
profitability.
Summary
MTBF is an exceptionally useful metric in the field of physical asset management and it is possible to
apply it at any level throughout the physical asset base.
The principal benefit of wide ranging use of MTBF is that it begins the process of focusing a company
on how the assets work to fulfill a function, rather than what those assets actually are. This is one of
the fundamental concepts of Reliability-centered Maintenance.
As such, at whatever level it is applied, MTBF measures the function performed by that asset, asset
system, or entire process. It is also useful for proactively establishing the performance expectations of
the asset base, particularly in the areas of the Efficiency function.
To Process
To Process
To Process
To Process
To Process
# The Maintenance Scorecard, Daryl Mather,
Industrial Press, ISBN 0831131810
Protective
Device C Fails
If the failure rate is once in four years, then the probability that it
will fail in one year is 1 in 4.
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
One year
Protected
Function B Fails
Protective
Device C Fails
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
One year
Protected Mean Time Between Failures = 4 years
Function B
Protective
Device C Availability = 67% Downtime = 33%
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
One year
Protected Mean Time Between Failures = 4 years
Function B
Protective
Device C Availability = 67% Downtime = 33%
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
• 6 identical PSV’s have each been checked once a year for 5 years (FFI = 1
year)
To Process
1
• We know that the failed devices failed
some time during the year before the 2
checks – but not when… 3
2. So on the basis of these figures it appears that: FFI = 2 x DT device x MTBF device
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
One year
Protected Mean Time Between Failures = 10 years
Function B Fails
Protective Failed
Device C
1 in 10 x 1 in 100 = 1 in 1000
Step One
Step Two Step Three Decide what
Determine / estimate how Calculate what unavailability of probability we
often the protected function the protective device enables us tolerate for the
is likely to need to protective to achieve 1 given 2 multiple failure
device
if then
DTdevice = Unavailability of the protective device (1/MTBFfunction) x DTdevice = 1/MTBFmultiple
MTBFfunction = Failure rate of the protected function
MTBFmultiple = Failure rate of the multiple failure or
DTdevice = MTBFfunction / MTBFmultiple
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Where:
FFI = 2 x DT device x MTBF device …1
Where:
…and that…
••FFI
FFI ==failure
failurefinding
findingtask
task
interval DTdevice = MTBFfunction / MTBFmultiple …2
interval
••DT
DTdevice = Unavailability of
device = Unavailability of
the
theprotective
protectivedevice
device
••MTBF
MTBFdevice = MTBF of the Therefore.. by substituting 2 into 1 gives…
device = MTBF of the
protective
protectivedevice
device
••MTBF
MTBFfunction = MTBF of the
function = MTBF of the
protected function
protected function
••MTBF
MTBFmultiple = MTBF of the
multiple = MTBF of the 2 x MTBFfunction x MTBFdevice
multiple
multiplefailure
failure FFI =
MTBFmultiple
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
2 x MTBFdevice x MTBFfunction
FFI =
MTBFmultiple
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
If we call MTBF of each of the three failure modes MD1, MD2 and MD3 respectively
then…
Therefore…
2 x MTBF Function
FFI =
MTBF Multiple x (1/MD1+1/ MD2+1/ MD3)
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Exercise 4 - Hoist
A speed sensor on the hoist drum of a crane used in a machine shop is designed to activate the
emergency brake on the main hoist if the drum starts turning too fast. If any aspect of the emergency
braking system does not work when required and the hoist drum runs away, industry standards
statistics suggest that there is a 5% chance that someone could get badly hurt or killed as a result.
The group performing the review decides that they would like to reduce the probability of this
happening to once in 200,000 years.
If there is only a 1 in 20 chance (5%) that the multiple failure of the over speeding drum and failed
emergency brakes will hurt or kill someone, an overall probability of 1 death or injury in 200,000
years for this reason can be achieved if the probability of the multiple failure itself is reduced to 1 in
10,000 years.
This is a new system, so the users of the crane have no historical data about its performance. However,
the suppliers of the speed sensor advise that it has an MTBF in this context of 300 years, and the
emergency brake an MTBF in this context of 100 years. No information is available about the
reliability of the electrical circuit between the two, but the behavior of similar circuits on similar
cranes suggests an MTBF of 200 years.
The circumstances under which the drum over speeds and needs the emergency brake occur on
average once every 50 years. You are asked to determine how often the emergency braking system
should be tested to reduce the multiple failure probability to the required level.
2 x MTBF Function
FFI =
MTBF Multiple x (1/MD1+1/ MD2+1/ MD3)
What if, we re-did the speed sensor example, but with different figures? (A higher level of tolerable
risk and a lower device failure rate?)
…The electric utility which operates the turbine decides that they will accept a probability of failure of
the multiple failure once in (say) 1,000,000 years for any one turbine.
The utility has twenty similar turbines in operation for an average of ten years each, giving a total of
200 years of operating experience. As far as anyone knows, only two of these turbines have tripped out
due to over-speeding during this period. This corresponds to an MTBF of the protected function of 100
years for any one turbine.
The utility has never found one of the over speed mechanisms to be in a failed state when they have
carried out failure finding checks on their own machines, but data from a commercial data bank
indicate an MTBF of 100 years.
How often should the utility perform a failure finding task on the over speed mechanism in order to
reduce the probability of failure of the multiple failure to the desired level?
2 x 100 x 100
FFI = FFI = 7.3 days
1,000,000
We can…
Make the function evident somehow
…or…
Provide additional layers of protection
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
One year
Protected Mean Time Between Failures = 5 years 10-2 x 10-2 = 1:10-4
Function B
Protective
Device C Availability = 75% Downtime = 25%
Function
Device 2
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
________________________________________
Therefore… 1/2
FFI = 2 x MTBF Device x MTBF Function
MTBF Multiple
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
What if, we re-did the speed sensor example, but with different figures? (Higher level of tolerable risk,
and lower device failure rate?) (Now two sensors)
…The electric utility which operates the turbine decides that they will accept a probability of failure of
the multiple failure once in (say) 1,000,000 years for any one turbine.
The utility has twenty similar turbines in operation for an average of ten years each, giving a total of
200 years of operating experience. As far as anyone knows, only two of these turbines have tripped out
due to over-speeding during this period. This corresponds to am MTBF of the protected function of
100 years for any one turbine.
The utility has never found one of the over speed mechanisms to be in a failed state when they have
carried out failure finding checks on their own machines, but data from a commercial data bank
indicate an MTBF of 100 years.
How often should the utility perform a failure finding task on the over speed mechanism in order to
reduce the probability of failure of the multiple failure to the desired level?
MTBF Multiple
1/2
100
FFI = 2 years!
FFI = 2 x 100 x
1,000,000
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
MTBF Multiple
1/n
FFI = MTBF Device x (n+1) x MTBF Function
MTBF Multiple
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
A hydraulic system is protected from overpressure by four Pressure Relief Valves (PRV’s). One is
placed in the line to the line directly from the duty and standby pumping arrangements, and there is
one PSV in each of the supply lines to the accumulators. If the pressure exceeds the safe working
pressures then the PRV’s will relieve the pressure in the lines back to the hydraulic oil tank. All PSV’s
are set to the same pressure level.
The unit operates under extremely high pressures and if the safe working pressure is exceeded there is
a chance of a pipe rupturing, exposing people in the surrounding areas to pressures likely to cause
serious injuries. Risk ranking structures set-up by the corporate safety department has deemed this
asset as a high criticality asset. This means that it will need to be managed to a tolerable probability of
failure of 1:1,000,000.
In the 12 years that the hydraulic system
PSV 1
has been installed it has never once
required any PRV to relieve the pressure
within the hydraulic circuit to the
Accumulators accumulators. For this system they were
PSV 2
unable to find failure rate information in
commercial databanks.
However, a quick call to their 5 other
plants in their company showed them that
PSV 3 there were 4 such systems in the company,
with a combined operating life of 80
years. Incident records show that the
PRV’s have been used to relieve the
pumps 10 times. Evidence from the manufacturer suggests that the PRV’s have a failure rate of 1:100.
Given that all three will be tested at the same time, what is the failure finding frequency required to
achieve the tolerable probability of a multiple failure?
1/n
FFI = MTBF Device x (n+1) x MTBF Function
MTBF Multiple
Preventive
Third determine if a Preventive
Restoration
task is applicable and effective
Preventive
Replacement
Failure
Fourth determine if a Detective
Finding
task is applicable and effective
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Voting Systems
k out
Formula
of n systems
6 If “r” = number of units that need to be in a
failed state before the entire system would
fail then…
B Fails
U1 U2 U3
C Fails r = n – k +1
1/r
(n-1)! x r! x (r + 1) x MTBF Function
FFI = MTBF Device x
n! x MTBF Multiple
! = Factorial (Used a lot in combinatronics and other probability theory statistical formulae)
5! = 1 x 2 x 3 x 4 x 5
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Economic Consequences
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
CM x FFI
The annualized cost of failure will be 2 x MTBFdevice x MTBFfunction
CFF
The annualized cost of doing a failure finding task
FFI
C Device
If FFI is a fairly small fraction of MTBF Function, the annualized cost of
repairing the failed protective device will be approximately: MTBF Device
C Function
Likewise for the function…
MTBF Function
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Annualised cost of a
dCTotal multiple failure CFF C Device C Function
Cmultiple FFI + + +
x
dFFI FFI MTBF Device MTBF Function
2 x MTBFdevice x MTBFfunction
At a minimum
when
Annualised cost of
failure finding
CFF
FFI
Cost
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
Annualised cost of a
multiple failure CFF C Device C Function
Cmultiple + + +
x FFI
FFI MTBF Device MTBF Function
2 x MTBFdevice x MTBFfunction
Cmultiple CFF
dCTotal -
= 2 x MTBFdevice x MTBFfunction
FFI2
dFFI
2 x MTBFdevice x MTBFfunction x C
FFI2 = FF
Where:
Where: Cmultiple
•• CCmultiple ==Cost
Costof
ofone
oneMultiple
Multiple
multiple
Failure
Failure
•• CCFF ==Cost 1/2
FF Costofofone
onefailure
failure
finding
findingtasktask
2 x MTBFdevice x MTBFfunction x CFF
•• MTBF
MTBFdevice = Failure rate of
device = Failure rate of
the
the protectivedevice
protective device
•• MTBF
MTBFfunction = Failure rate of CMultiple
function = Failure rate of
the
the protectedfunction
protected function
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
1/2
CMultiple
8
OPEX – Operational Expenditure
9
CAPEX – Capital Expenditure
10
Asset-Intensive – Industries where asset maintenance and asset replacement form major parts of OPEX and
CAPEX
11
Maintenance refers to both routine and corrective or reactive activities.
12
The issues surrounding RCM and WoL asset management are covered in more detail in “RCM-DO-10 RCM and
Whole-of-Life (WoL) Asset Management”
When maintenance is developed using an unstructured method there are common errors that can occur.
Ineffective Maintenance
One of the great misleading statistics in asset maintenance today is the calculation of average life for
bearings. The effect of this is to support the outdated and almost mystical belief of the link between
age and failure.
Based on this way of thinking, it is still common to find maintenance departments carrying out hard-
time bearing replacement programs as a means of managing risk.
However, it has been the experience of the author that hard time bearing replacement policies can
increase, rather than decrease, the likelihood of failure while at the same time increasing the direct
maintenance costs.
This flies in the face of popular beliefs and is an example of how RCM thinking can drive reductions
in routine maintenance levels.
The original Nowlan and Heap report13 specifically spoke about bearings when addressing failure in
complex assets.
A complex item, as opposed to a simple item, is one that is subject to many failure modes. As a result,
the failure processes may involve a dozen different stress and resistance considerations.
Even with complex items, failures related to age will concentrate about an average age for that mode.
However, bearings have many failure modes.
Where there is no dominant failure mode14, as is the case in complex items such as most bearings, then
distribution of the average life of all the failure modes is widely dispersed along the entire exposure
axis.15 Therefore, failure will be unrelated to operating age. This is a unique feature of complex items.
When deciding maintenance policy for bearings, this issue is further exacerbating by the provision of
the L10 life by manufacturers. This number represents the point at which 10% of the items may have
failed, meaning that 90% will have survived.
Lieblein and Zelen, in their seminal work on the subject of bearing life16, found that the characteristic
life, the point where statistically 63.2% of the items will have failed, was roughly 5 times the L10 life.
They also found that the “life” forecasts had a median Weibull Beta value of 1.4, indicating a near
constant probability of failure. This means that the likelihood of failure at any point in the life of the
bearings in their study increased only marginally as the asset aged.
Other published analyses have quoted a beta of “1.3” for Ball and Roller Bearings, and a beta of “1”
for sleeve bearings.17
13
Reliability-centered Maintenance, F.S. Nowlan et al, United Airlines, San Francisco, December 1978
14
Dominant failure mode – the most common cause of failure
15
Reliability-centered Maintenance, F.S. Nowlan et al, United Airlines, San Francisco, December 1978
16
Statistical Investigation of the Fatigue Life of Deep Groove Bearings, J. Lieblen and M. Zelen, Journal of Research
of the National Bureau of Standards, Vol 57, No 5, November 1956.
In process manufacturing industries, we find contaminated oil as one of frequent reasons for early life
failures. However, this is only one of the multitudes of stresses that bearings face as complex assets.
Others can include poor storage
leading to false brinnelling and
Characteristic Life
early corrosion, excessive heat Complex assets, such as 63.2%
and pressure, overloading, bearings, do not have a
exposure to vibration, abrasions dominant failure mode.
Instead they many different
and cracks. All of these could stresses leading to failure.
contribute to either early life L50 Life
failures, or premature wear out.
Often, the L10 life is mistaken for
an end life point for bearings, L10 Life
thus used as a reference interval
for replacement tasks. However,
Average Life
as can be seen from the
information above, it is not the
end-life, rather a minimum
Conditional
guaranteed life for 90% of probability of
bearings under specific load failure
conditions. Likelihood of failure at every point…
Constant / Random
These failures are distributed
This is in line with Nowlan and along the stress axis, making
Heaps’ findings and shows that in failure unrelated to age.
This is unique to complex
many cases we are at best assets.
wasting a large portion of the
bearings useful life, making this
an ineffective use of maintenance
resources.18
Increased bearing life and decreased labor costs are not the only potential savings. Frequent replacing
of bearings on, say, motor shafts we introduce the likelihood of a range of additional failure modes.
For example, installation and frequent change out failures include:
Wear of the motor shaft, decreasing the adequacy of the interference fit; leading to bearings spinning
on the shaft (A failure of the motor, not of the bearing)
Over heating of the bearing leading to early life failures and distortion of the inner race
Excessive force (i.e. Hammers) instead of bearing pullers, damaging the races of the bearings and
leading to early life failures
17
Bloch, Heinz P. and Fred K. Geitner, 1994, Practical Machinery Management for Process Plants, Volume 2:
Machinery Failure Analysis and Troubleshooting, 2nd Edition, Gulf Publishing Company, Houston, TX
18
Over one machine, this appears to be a very small maintenance cost item. However, when applied throughout a
plant, or on the so-called “critical” assets, it amounts to a significant maintenance cost.
Bearing misalignment
Wrong bearing selection
Pre-failed bearings due to poor storage techniques
While we can manage some of these, others are a direct result of frequent bearing changes.
Therefore, if we use hard time bearing replacement as a maintenance policy then we are:
a) reducing the maximum used life of the bearing, and
b) increasing the likelihood of failure through the introduction of several additional failure modes
In the Meridium RCM decision algorithm19, a management policy for an Evident Operational and
Non-Operational failure mode must comply with the following:
“Over a period of time, the failure management policy must cost less than the cost of the operational
consequences (if any) plus the total cost of repair.”
Ineffective maintenance is more common than most professionals think, it can also include areas such
as maintenance out of context, where maintenance regimes are unaligned with how the asset is used,
or practices that decrease an assets efficient operations.
Using the decision algorithm in RCM, the first option available to the team is Predictive Maintenance.
Where this is both applicable and effective it will increase the effectiveness of maintenance in a range
of areas:
Predictive Maintenance detects the signs of the onset of failure. As such, it provides the capability to
manage all failures, including random failures.
It can be done in-situ and often without interfering with the normal operation of the process.
It will ensure that the asset utilizes all of its economically useful life. (As opposed to hard-time
replacements)
Inapplicable Maintenance
This mistaken belief that there is always a relationship between age and failure leads maintenance
departments to all sorts of policies that, in practice, are achieving nothing.
Often these occur during maintenance turnarounds. The opportunity to access items that are normally
in a running state drives people to inspect items just in case a life related failure mode has developed.
In particular, this again is a common activity in relation to bearing management.
For example, a turbine turnaround occurs once every 3 years (say) for other failure management
reasons.
19
The Meridium RCM Decision Algorithm is based on Figure 17 – A Second Decision Diagram Example, page 49,
SAE JA1012, 2002-01
The maintenance department has taken this opportunity to perform a dye penetrant check on the
bearing to see if any cracks are starting to form, requiring them to take action.
On the face of it, this appears to be a perfectly valid, even wise, use of the opportunity. However, on
applying the RCM logic a little closer this perception changes dramatically.
For the sake of this example, we will say that the P-F interval is about 3 months. Meaning once we
detect cracks in this particular bearing, we have around three months of time prior to functional failure.
If we test the bearing on a hard-time basis of every three years, and the P-F interval is three months,
then the following logic applies.
a) The dye penetrant test is only useful if the bearing failure is occurring at the time of inspection.
b) This means it had to start developing at less than 3 months prior to opening.
As we shutdown every 36 months, the likelihood of this occurring (given the randomness of bearing
failure) is around 1:12.
Turnaround Interval = 3 years
Moreover, the likelihood of it not
occurring is around 11:12. This task
does not satisfy the RCM applicability
criteria and is a waste of resources.
In addition, opening the bearing
housing and interfering with the
bearing, which presumably is
operating fine, we again introduce the
possibility of human error.20
It is difficult to categorize this
Likelihood of detection 1:12 maintenance practice directly; but the
Likelihood of non-detection 11:12 closest match in RCM is Predictive
Maintenance. (PTIVE)
P-F Interval = 3 months
In the Meridium RCM decision
algorithm, this means the team needs to answer all of the following questions before this task is
applicable:
Is there a clear potential failure condition?
What is it?
What is the P-F interval?
Is the interval long enough to take action to avoid or minimise the consequences of failure?
Is the P-F interval reasonably consistent?
20
Human error is discussed in detail in module RCM-DO-06a Introducing Human Error.
Increases in Revenue
There are two specific areas where an RCM team can claim savings.
a) Where an asset, or system, has a history of failures leading to lost production opportunities.
Principally this refers unplanned shutdowns, overrun turnarounds, and start up issues of an
asset or system.
b) Where an asset, or system, has a history of failures leading to reduced production output. This
includes areas such as utilization, quality, and reduced availability. For example:
a. Reduced turnaround times
b. Increased yield (quality)
c. Increased availability for full production rates
As these are historic failures, issues such as quantification of lost production, direct maintenance costs,
and the frequency of failure are relatively easy to find out.
However, an alternative is to use sophisticated forecasting techniques such as Crow-AMSAA. This is
time proven as an accurate method for forecasting failure rates; enabling the team to then calculate
savings from the changes to asset maintenance. This is also a valid method for forecasting savings in
direct costs.
However, it is possible to represent some non-cashable benefits in monetary terms. The most common
of these is cost avoidance.
Risk Mitigation
When the mitigated risk is economic, it is often termed cost avoidance.
Where the team has implemented a policy for a reasonably likely21 failure mode where there was an
inadequate existing strategy in place, the team is justified in claiming this as a potential benefit of
RCM, even though the failure has not occurred previously.
These benefits count as non-cashable for a number of reasons:
1. They will never appear as part of the profit and loss of any enterprise. Nor will they cause a change
to maintenance budgets or revenues.
2. The team requires estimates to calculate the cost avoidance benefit. Some failure modes may have
similar consequences, affect similar assets, and have overlapping impacts on production.
For example, RCM teams can find themselves presenting benefits of several times the value of the
entire installation. If not explained correctly this is a false representation, which can erode the
credibility of RCM, and of the team attempting to implement it.
They are nevertheless valid and important benefits for the RCM team to claim.
Note the emphasis on “an inadequate existing strategy”. RCM did not invent maintenance, and often
there are adequate existing failure management policies in place.
As an output, the team will find that some maintenance regimes will disappear, some will remain, and
they will add some new, more sophisticated, regimes.
Redundancy This occurs because some of the maintenance
policies in place are redundant, some are either
Remaining pre-RCM
routines inapplicable or ineffective, yet others are adequate
Existing pre-RCM routines
consequences through regulatory penalties, or through secondary economic damages caused by the
21
What constitutes reasonably likely is specific to each company, and often to each RCM analysis. Methods for
determining reasonableness are not included in this module.
failure. Where this is the case then the team can calculate the value of the cost avoided in a similar
method to economic only consequences. 22
Where the failure mode will not have significant economic consequences, the delta between the
discovered risk and the managed risk can represent the benefit of risk mitigation.
The Principal Barrier to Value Realization
The benefits of RCM are obvious to anybody who has studied it or to any maintenance practitioner
who can relate to the concepts espoused in the method.
All levels within the corporation generally see different advantages to RCM and there is rarely a lack
of motivation for improvement.
Implementation problems commence due to fundamental misunderstandings about maintenance and
the functions of physical asset management23. This leads maintenance departments to see increased
risk where it does not exist.
For example, a maintenance manager could face any of the
Cashable Non-Cashable
following recommendations: (Among others)
• Elimination hard-time replacement policies where
Increased Risk applicable and effective,
Revenue Mitigation
• Elimination of invasive inspection while we have the
opportunity on planned turnarounds.
Reduced
This reluctance to change comes from the perception that this
Knowledge is risky, and instead of implementing the policy changes,
Costs Increases
things stay as they are.
The result is more of the same.
• Risk of unplanned failure stays provably higher, and
• the effectiveness of maintenance stays provably lower.
Moreover, resources remain tight performing maintenance that is not required, or repairing problems
caused by the activities that are supposed to prevent them.
It is clear that before we can successfully implement the strategy outcomes of RCM, we first need to
make sure that there is a deep understanding within the company of modern reliability principles.
22
Cost avoidance calculation methods are available in Handout RCM-DO-07a Calculating Costs Avoided, inspired
by the work of Steve Soos on this subject.
23
The Role of the Maintenance Manager, Daryl Mather, 2008:
• Design effective maintenance policy
• Execute them as efficiently as possible
• Collect relevant data for higher confidence decisions in the future.