Root Cause Analysis
Root Cause Analysis
causes of problems or events. The practice of RCA is predicated on the belief that problems are
best solved by attempting to correct or eliminate root causes, as opposed to merely addressing
the immediately obvious symptoms. By directing corrective measures at root causes, it is hoped
that the likelihood of problem recurrence will be minimized. However, it is recognized that
complete prevention of recurrence by a single intervention is not always possible. Thus, RCA is
often considered to be an iterative process, and is frequently viewed as a tool of continuous
improvement.
RCA, initially is a reactive method of problem detection and solving. This means that the
analysis is done after an event has occurred. By gaining expertise in RCA it becomes a pro-
active method. This means that RCA is able to forecast the possibility of an event even before it
could occur.
Root cause analysis is not a single, sharply defined methodology; there are many different tools,
processes, and philosophies of RCA in existence. However, most of these can be classed into
five, very-broadly defined "schools" that are named here by their basic fields of origin: safety-
based, production-based, process-based, failure-based, and systems-based.
Safety-based RCA descends from the fields of accident analysis and occupational safety
and health.
Production-based RCA has its origins in the field of quality control for industrial
manufacturing.
Process-based RCA is basically a follow-on to production-based RCA, but with a scope
that has been expanded to include business processes.
Failure-based RCA is rooted in the practice of failure analysis as employed in
engineering and maintenance.
Systems-based RCA has emerged as an amalgamation of the preceding schools, along
with ideas taken from fields such as change management, risk management, and systems
analysis.
Despite the seeming disparity in purpose and definition among the various schools of root cause
analysis, there are some general principles that could be considered as universal. Similarly, it is
possible to define a general process for performing RCA
A failure modes and effects analysis (FMEA), is a procedure in product development and
operations management for analysis of potential failure modes within a system for classification
by the severity and likelihood of the failures. A successful FMEA activity helps a team to identify
potential failure modes based on past experience with similar products or processes, enabling
the team to design those failures out of the system with the minimum of effort and resource
expenditure, thereby reducing development time and costs. It is widely used in manufacturing
industries in various phases of the product life cycle and is now increasingly finding use in the
service industry. Failure modes are any errors or defects in a process, design, or item, especially
those that affect the customer, and can be potential or actual.
Basic terms
FMEA cycle.
Failure mode: "The manner by which a failure is observed; it generally describes the way the
failure occurs."
Failure effect: Immediate consequences of a failure on operation, function or functionality, or
status of some item
Indenture levels: An identifier for item complexity. Complexity increases as levels are closer to
one.
Local effect: The Failure effect as it applies to the item under analysis.
Next higher level effect: The Failure effect as it applies at the next higher indenture level.
End effect: The failure effect at the highest indenture level or total system.
Failure cause: Defects in design, process, quality, or part application, which are the underlying
cause of the failure or which initiate a process which leads to failure.
Severity: "The consequences of a failure mode. Severity considers the worst potential
consequence of a failure, determined by the degree of injury, property damage, or system
damage that could ultimately occur."
Implementation
In FMEA, failures are prioritized according to how serious their consequences are, how
frequently they occur and how easily they can be detected. An FMEA also documents current
knowledge and actions about the risks of failures for use in continuous improvement. FMEA is
used during the design stage with an aim to avoid future failures. Later it is used for process
control, before and during ongoing operation of the process. Ideally, FMEA begins during the
earliest conceptual stages of design and continues throughout the life of the product or service.
The outcome of an FMEA development is actions to prevent or reduce the severity or likelihood
of failures, starting with the highest-priority ones. It may be used to evaluate risk management
priorities for mitigating known threat vulnerabilities. FMEA helps select remedial actions that
reduce cumulative impacts of life-cycle consequences (risks) from a systems failure (fault).
The process for conducting a FMEA is straightforward. It is developed in three main phases, in
which appropriate actions need to be defined. But before starting with a FMEA, it is important to
complete some pre-work to confirm that robustness and past history are included in the analysis.
A robustness analysis can be obtained from interface matrices, boundary diagrams, and
parameter diagrams. A lot of failures are due to noise factors and shared interfaces with other
parts and/or systems, because engineers tend to focus on what they control directly.
To start it is necessary to describe the system and its function. A good understanding simplifies
further analysis. This way an engineer can see which uses of the system are desirable and which
are not. It is important to consider both intentional and unintentional uses. Unintentional uses are
a form of hostile environment.
Then, a block diagram of the system needs to be created. This diagram gives an overview of the
major components or process steps and how they are related. These are called logical relations
around which the FMEA can be developed. It is useful to create a coding system to identify the
different system elements. The block diagram should always be included with the FMEA.
Before starting the actual FMEA, a worksheet needs to be created, which contains the important
information about the system, such as the revision date or the names of the components. On this
worksheet all the items or functions of the subject should be listed in a logical manner, based on
the block diagram.
RPN
S Responsi
Failu O Curre D CRIT (risk Actio
(sever Recomme bility and
Functi re (occurre nt (detec (critical priorit n
Effects ity Cause(s) nded target
on mod nce contr tion character y take
rating actions completio
e rating) ols rating) istic numb n
) n date
er)
Fill
Perform
timeo
cost
ut
High level analysis of
Liquid based
level sensor adding
spills on
sens failed additional Jane Doe
on time
Fill tub or 8 level 2 5 N 80 sensor 10-Oct-
custo to fill
neve sensor halfway 2010
mer to
r disconne between
floor low
trips cted low and
level
high level
senso
sensors
r
Determine all failure modes based on the functional requirements and their effects. Examples of
failure modes are: Electrical short-circuiting, corrosion or deformation. A failure mode in one
component can lead to a failure mode in another component, therefore each failure mode should
be listed in technical terms and for function. Hereafter the ultimate effect of each failure mode
needs to be considered. A failure effect is defined as the result of a failure mode on the function
of the system as perceived by the user. In this way it is convenient to write these effects down in
terms of what the user might see or experience. Examples of failure effects are: degraded
performance, noise or even injury to a user. Each effect is given a severity number (S) from 1 (no
danger) to 10 (critical). These numbers help an engineer to prioritize the failure modes and their
effects. If the severity of an effect has a number 9 or 10, actions are considered to change the
design by eliminating the failure mode, if possible, or protecting the user from the effect. A
severity rating of 9 or 10 is generally reserved for those effects which would cause injury to a
user or otherwise result in litigation.
In this step it is necessary to look at the cause of a failure mode and how many times it occurs.
This can be done by looking at similar products or processes and the failure modes that have
been documented for them. A failure cause is looked upon as a design weakness. All the
potential causes for a failure mode should be identified and documented. Again this should be in
technical terms. Examples of causes are: erroneous algorithms, excessive voltage or improper
operating conditions. A failure mode is given an occurrence ranking (O), again 1–10. Actions
need to be determined if the occurrence is high (meaning > 4 for non-safety failure modes and
> 1 when the severity-number from step 1 is 9 or 10). This step is called the detailed
development section of the FMEA process. Occurrence also can be defined as %. If a non-safety
issue happened less than 1%, we can give 1 to it. It is based on your product and customer
specification
When appropriate actions are determined, it is necessary to test their efficiency. Also a design
verification is needed. The proper inspection methods need to be chosen. First, an engineer
should look at the current controls of the system, that prevent failure modes from occurring or
which detect the failure before it reaches the customer. Hereafter one should identify testing,
analysis, monitoring and other techniques that can be or have been used on similar systems to
detect failures. From these controls an engineer can learn how likely it is for a failure to be
identified or detected. Each combination from the previous 2 steps receives a detection number
(D). This ranks the ability of planned tests and inspections to remove defects or detect failure
modes in time. The assigned detection number measures the risk that the failure will escape
detection. A high detection number indicates that the chances are high that the failure will escape
detection, or in other words, that the chances of detection are low.
After these three basic steps, risk priority numbers (RPN) are calculated
RPN do not play an important part in the choice of an action against failure modes. They are
more threshold values in the evaluation of these actions.
After ranking the severity, occurrence and detectability the RPN can be easily calculated by
multiplying these three numbers: RPN = S × O × D
This has to be done for the entire process and/or design. Once this is done it is easy to determine
the areas of greatest concern. The failure modes that have the highest RPN should be given the
highest priority for corrective action. This means it is not always the failure modes with the
highest severity numbers that should be treated first. There could be less severe failures, but
which occur more often and are less detectable.
After these values are allocated, recommended actions with targets, responsibility and dates of
implementation are noted. These actions can include specific inspection, testing or quality
procedures, redesign (such as selection of new components), adding more redundancy and
limiting environmental stresses or operating range. Once the actions have been implemented in
the design/process, the new RPN should be checked, to confirm the improvements. These tests
are often put in graphs, for easy visualisation. Whenever a design or a process changes, an
FMEA should be updated.
Try to eliminate the failure mode (some failures are more preventable than others)
Minimize the severity of the failure
Evaluation of the requirements of the customer to ensure that those do not give rise to
potential failures.
Ensuring that any failure that could occur will not injure the customer or seriously impact a
system.
[edit] Advantages
Improve the quality, reliability and safety of a product/process
Improve company image and competitiveness
[edit] Limitations
Since FMEA is effectively dependent on the members of the committee which examines product
failures, it is limited by their experience of previous failures. If a failure mode cannot be
identified, then external help is needed from consultants who are aware of the many different
types of product failure. FMEA is thus part of a larger system of quality control, where
documentation is vital to implementation. General texts and detailed publications are available in
forensic engineering and failure analysis. It is a general requirement of many specific national
and international standards that FMEA is used in evaluating product integrity. If used as a top-
down tool, FMEA may only identify major failure modes in a system. Fault tree analysis (FTA)
is better suited for "top-down" analysis. When used as a "bottom-up" tool FMEA can augment or
complement FTA and identify many more causes and failure modes resulting in top-level
symptoms. It is not able to discover complex failure modes involving multiple failures within a
subsystem, or to report expected failure intervals of particular failure modes up to the upper level
subsystem or system.[citation needed]
Additionally, the multiplication of the severity, occurrence and detection rankings may result in
rank reversals, where a less serious failure mode receives a higher RPN than a more serious
failure mode.[7] The reason for this is that the rankings are ordinal scale numbers, and
multiplication is not defined for ordinal numbers. The ordinal rankings only say that one ranking
is better or worse than another, but not by how much. For instance, a ranking of "2" may not be
twice as bad as a ranking of "1," or an "8" may not be twice as bad as a "4," but multiplication
treats them as though they are. See Level of measurement for further discussion.
Types of FMEA
Process: analysis of manufacturing and assembly processes
Design: analysis of products prior to production
Concept: analysis of systems or subsystems in the early design concept stages
Equipment: analysis of machinery and equipment design before purchase
Service: analysis of service industry processes before they are released to impact the
customer
System: analysis of the global system functions
Software: analysis of the software functions
Fault tree analysis (FTA) is a failure analysis in which an undesired state of a system is analyzed
using boolean logic to combine a series of lower-level events. This analysis method is mainly used in
the field of safety engineering to quantitatively determine the probability of a safety hazard.
Methodology
FTA methodology is described in several industry and government standards, including
NRC NUREG–0492 for the nuclear power industry, an aerospace-oriented revision to
NUREG–0492 for use by NASA[11], SAE ARP4761 for civil aerospace, MIL–HDBK–
338 for military systems[12] for military systems. IEC standard IEC 61025[13] is intended
for cross-industry use and has been adopted as European Norme EN 61025.
Since no system is perfect, dealing with a subsystem fault is a necessity, and any working
system eventually will have a fault in some place. However, the probability for a
complete or partial success is greater than the probability of a complete failure or partial
failure. Assembling a FTA is thus not as tedious as assembling a success tree which can
turn out to be very time consuming.
Because assembling a FTA can be a costly and cumbersome experience, the perfect
method is to consider subsystems. In this way dealing with smaller systems can assure
less error work probability, less system analysis. Afterward, the subsystems integrate to
form the well analyzed big system.
An undesired effect is taken as the root ('top event') of a tree of logic. There should be
only one Top Event and all concerns must tree down from it. Then, each situation that
could cause that effect is added to the tree as a series of logic expressions. When fault
trees are labeled with actual numbers about failure probabilities (which are often in
practice unavailable because of the expense of testing), computer programs can calculate
failure probabilities from fault trees.
A fault tree diagram
The Tree is usually written out using conventional logic gate symbols. The route through
a tree between an event and an initiator in the tree is called a Cut Set. The shortest
credible way through the tree from fault to initiating event is called a Minimal Cut Set.
Some industries use both Fault Trees and Event Trees (see Probabilistic Risk
Assessment). An Event Tree starts from an undesired initiator (loss of critical supply,
component failure etc.) and follows possible further system events through to a series of
final consequences. As each new event is considered, a new node on the tree is added
with a split of probabilities of taking either branch. The probabilities of a range of 'top
events' arising from the initial event can then be seen.
Classic programs include the Electric Power Research Institute's (EPRI) CAFTA
software, which is used by many of the US nuclear power plants and by a majority of US
and international aerospace manufacturers, and the Idaho National Laboratory's
SAPHIRE, which is used by the U.S. Government to evaluate the safety and reliability of
nuclear reactors, the Space Shuttle, and the International Space Station. Outside the US,
the software RiskSpectrum is a popular tool for Fault Tree and Event Tree analysis and is
licensed for use at almost half of the worlds nuclear power plants for Probabilistic Safety
Assessment.
5 Whys
Example
The following example demonstrates the basic process:
I will start maintaining my car according to the recommended service schedule. (solution)
The questioning for this example could be taken further to a sixth, seventh, or even greater level.
This would be legitimate, as the "five" in 5 Whys is not gospel; rather, it is postulated that five
iterations of asking why is generally sufficient to get to a root cause. The real key is to encourage
the troubleshooter to avoid assumptions and logic traps and instead to trace the chain of causality
in direct increments from the effect through any layers of abstraction to a root cause that still has
some connection to the original problem. Note that in this example the fifth why suggests a
broken process or an alterable behavior, which is typical of reaching the root-cause level.
History
The technique was originally developed by Sakichi Toyoda and was later used within Toyota
Motor Corporation during the evolution of their manufacturing methodologies. It is a critical
component of problem solving training delivered as part of the induction into the Toyota
Production System. The architect of the Toyota Production System, Taiichi Ohno, described the
5 whys method as "the basis of Toyota's scientific approach . . . by repeating why five times, the
nature of the problem as well as its solution becomes clear."[1] The tool has seen widespread use
beyond Toyota, and is now used within Kaizen, lean manufacturing, and Six Sigma.
Ishikawa diagram, also known as the fishbone diagram or cause and effect diagram. The
Ishikawa diagram is the preferred method for Project Managers for conducting RCA,
mainly due to its simplicity, and the complexity of the rest of the methods[1].
Kepner-Tregoe Problem Analysis - a root cause analysis process developed in 1958,
which provides a fact-based approach to systematically rule out possible causes and
identify the true cause
Pareto analysis
RPR Problem Diagnosis - An ITIL-aligned method for diagnosing IT problems.
Cause Mapping Simple method to research investigate and solve complex problems
Apollo Root Cause Analysis - a formal root cause analysis method focusing on cause and
effect relationships that is universally applicable to any industry and discipline
Why-Because analysis causal systems analysis for incidents and accidents based on the
logic of counterfactuals