0% found this document useful (0 votes)
262 views

Root Cause Analysis

Root cause analysis (RCA) is a problem-solving method used to identify the underlying cause of problems in order to prevent recurrence. There are different schools of RCA including safety, production, process, failure, and systems-based approaches. General principles of RCA include systematically investigating problems to find the true root cause, establishing timelines of causal factors, and transforming culture to prevent future problems. The general RCA process involves defining the problem, gathering data, identifying the root cause through questioning, determining corrective actions, and implementing and observing solutions. Common RCA techniques are barrier analysis, causal factor trees, change analysis, and failure mode and effects analysis (FMEA).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
262 views

Root Cause Analysis

Root cause analysis (RCA) is a problem-solving method used to identify the underlying cause of problems in order to prevent recurrence. There are different schools of RCA including safety, production, process, failure, and systems-based approaches. General principles of RCA include systematically investigating problems to find the true root cause, establishing timelines of causal factors, and transforming culture to prevent future problems. The general RCA process involves defining the problem, gathering data, identifying the root cause through questioning, determining corrective actions, and implementing and observing solutions. Common RCA techniques are barrier analysis, causal factor trees, change analysis, and failure mode and effects analysis (FMEA).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 12

Root cause analysis (RCA) is a class of problem solving methods aimed at identifying the root

causes of problems or events. The practice of RCA is predicated on the belief that problems are
best solved by attempting to correct or eliminate root causes, as opposed to merely addressing
the immediately obvious symptoms. By directing corrective measures at root causes, it is hoped
that the likelihood of problem recurrence will be minimized. However, it is recognized that
complete prevention of recurrence by a single intervention is not always possible. Thus, RCA is
often considered to be an iterative process, and is frequently viewed as a tool of continuous
improvement.

RCA, initially is a reactive method of problem detection and solving. This means that the
analysis is done after an event has occurred. By gaining expertise in RCA it becomes a pro-
active method. This means that RCA is able to forecast the possibility of an event even before it
could occur.

Root cause analysis is not a single, sharply defined methodology; there are many different tools,
processes, and philosophies of RCA in existence. However, most of these can be classed into
five, very-broadly defined "schools" that are named here by their basic fields of origin: safety-
based, production-based, process-based, failure-based, and systems-based.

 Safety-based RCA descends from the fields of accident analysis and occupational safety
and health.
 Production-based RCA has its origins in the field of quality control for industrial
manufacturing.
 Process-based RCA is basically a follow-on to production-based RCA, but with a scope
that has been expanded to include business processes.
 Failure-based RCA is rooted in the practice of failure analysis as employed in
engineering and maintenance.
 Systems-based RCA has emerged as an amalgamation of the preceding schools, along
with ideas taken from fields such as change management, risk management, and systems
analysis.

Despite the seeming disparity in purpose and definition among the various schools of root cause
analysis, there are some general principles that could be considered as universal. Similarly, it is
possible to define a general process for performing RCA

General principles of root cause analysis


1. The primary aim of RCA is to identify the root cause of a problem in order to create
effective corrective actions that will prevent that problem from ever re-occurring,
otherwise known as the '100 year fix'.
2. To be effective, RCA must be performed systematically as an investigation, with
conclusions and the root cause backed up by documented evidence.
3. There is always one true root cause for any given problem, the difficult part is having the
stamina to reach it.
4. To be effective the analysis must establish a sequence of events or timeline to understand
the relationships between contributory factors, the root cause and the defined problem.
5. Root cause analysis can help to transform an old culture that reacts to problems into a
new culture that solves problems before they escalate but more importantly; reduces the
instances of problems occurring over time within the environment where the RCA
process is operated.

[edit] General process for performing and documenting an


RCA-based Corrective Action
Notice that RCA (in steps 3, 4 and 5) forms the most critical part of successful corrective action,
because it directs the corrective action at the true root cause of the problem. The root cause is
secondary to the goal of prevention, but without knowing the root cause, we cannot determine
what an effective corrective action for the defined problem will be.

1. Define the problem.


2. Gather data/evidence.
3. Ask why and identify the true root cause associated with the defined problem.
4. Identify corrective action(s) that will prevent recurrence of the problem (your 100 year
fix).
5. Identify effective solutions that prevent recurrence, are within your control, meet your
goals and objectives and do not cause other problems.
6. Implement the recommendations.
7. Observe the recommended solutions to ensure effectiveness.
8. Variability Reduction methodology for problem solving and problem avoidance.

[edit] Root cause analysis techniques


 Barrier analysis - a technique often used in particularly in process industries. It is based
on tracing energy flows, with a focus on barriers to those flows, to identify how and why
the barriers did not prevent the energy flows from causing harm.
 Bayesian inference
 Causal factor tree analysis - a technique based on displaying causal factors in a tree-
structure such that cause-effect dependencies are clearly identified.
 Change analysis - an investigation technique often used for problems or accidents. It is
based on comparing a situation that does not exhibit the problem to one that does, in
order to identify the changes or differences that might explain why the problem occurred.
 Current Reality Tree A method developed by Eliahu M. Goldratt in his Theory of
Constraints that guides an investigator to identify and relate all root causes using a cause-
effect tree whose elements are bound by rules of logic (Categories of Legitimate
Reservation). The CRT begins with a brief list of the undesirables things we see around
us, and then guides us towards one or more root causes. This method is particularly
powerful when the system is complex, there is no obvious link between the observed
undesirable things, and a deep understanding of the root cause(s) is desired.

 Failure mode and effects analysis Also known as FMEA.

A failure modes and effects analysis (FMEA), is a procedure in product development and
operations management for analysis of potential failure modes within a system for classification
by the severity and likelihood of the failures. A successful FMEA activity helps a team to identify
potential failure modes based on past experience with similar products or processes, enabling
the team to design those failures out of the system with the minimum of effort and resource
expenditure, thereby reducing development time and costs. It is widely used in manufacturing
industries in various phases of the product life cycle and is now increasingly finding use in the
service industry. Failure modes are any errors or defects in a process, design, or item, especially
those that affect the customer, and can be potential or actual.

Basic terms

FMEA cycle.

Failure mode: "The manner by which a failure is observed; it generally describes the way the
failure occurs."
Failure effect: Immediate consequences of a failure on operation, function or functionality, or
status of some item
Indenture levels: An identifier for item complexity. Complexity increases as levels are closer to
one.
Local effect: The Failure effect as it applies to the item under analysis.
Next higher level effect: The Failure effect as it applies at the next higher indenture level.
End effect: The failure effect at the highest indenture level or total system.
Failure cause: Defects in design, process, quality, or part application, which are the underlying
cause of the failure or which initiate a process which leads to failure.
Severity: "The consequences of a failure mode. Severity considers the worst potential
consequence of a failure, determined by the degree of injury, property damage, or system
damage that could ultimately occur."

Implementation
In FMEA, failures are prioritized according to how serious their consequences are, how
frequently they occur and how easily they can be detected. An FMEA also documents current
knowledge and actions about the risks of failures for use in continuous improvement. FMEA is
used during the design stage with an aim to avoid future failures. Later it is used for process
control, before and during ongoing operation of the process. Ideally, FMEA begins during the
earliest conceptual stages of design and continues throughout the life of the product or service.

The outcome of an FMEA development is actions to prevent or reduce the severity or likelihood
of failures, starting with the highest-priority ones. It may be used to evaluate risk management
priorities for mitigating known threat vulnerabilities. FMEA helps select remedial actions that
reduce cumulative impacts of life-cycle consequences (risks) from a systems failure (fault).

It is used in many formal quality systems such as QS-9000 or ISO/TS 16949.

[edit] Using FMEA when designing


FMEA can provide an analytical approach, when dealing with potential failure modes and their
associated causes. When considering possible failures in a design – like safety, cost,
performance, quality and reliability – an engineer can get a lot of information about how to alter
the development/manufacturing process, in order to avoid these failures. FMEA provides an easy
tool to determine which risk has the greatest concern, and therefore an action is needed to
prevent a problem before it arises. The development of these specifications will ensure the
product will meet the defined requirements.

[edit] The pre-work

The process for conducting a FMEA is straightforward. It is developed in three main phases, in
which appropriate actions need to be defined. But before starting with a FMEA, it is important to
complete some pre-work to confirm that robustness and past history are included in the analysis.

A robustness analysis can be obtained from interface matrices, boundary diagrams, and
parameter diagrams. A lot of failures are due to noise factors and shared interfaces with other
parts and/or systems, because engineers tend to focus on what they control directly.
To start it is necessary to describe the system and its function. A good understanding simplifies
further analysis. This way an engineer can see which uses of the system are desirable and which
are not. It is important to consider both intentional and unintentional uses. Unintentional uses are
a form of hostile environment.

Then, a block diagram of the system needs to be created. This diagram gives an overview of the
major components or process steps and how they are related. These are called logical relations
around which the FMEA can be developed. It is useful to create a coding system to identify the
different system elements. The block diagram should always be included with the FMEA.

Before starting the actual FMEA, a worksheet needs to be created, which contains the important
information about the system, such as the revision date or the names of the components. On this
worksheet all the items or functions of the subject should be listed in a logical manner, based on
the block diagram.

Example FMEA Worksheet

RPN
S Responsi
Failu O Curre D CRIT (risk Actio
(sever Recomme bility and
Functi re (occurre nt (detec (critical priorit n
Effects ity Cause(s) nded target
on mod nce contr tion character y take
rating actions completio
e rating) ols rating) istic numb n
) n date
er)

Fill
Perform
timeo
cost
ut
High level analysis of
Liquid based
level sensor adding
spills on
sens failed additional Jane Doe
on time
Fill tub or 8 level 2 5 N 80 sensor 10-Oct-
custo to fill
neve sensor halfway 2010
mer to
r disconne between
floor low
trips cted low and
level
high level
senso
sensors
r

[edit] Step 1: Severity

Determine all failure modes based on the functional requirements and their effects. Examples of
failure modes are: Electrical short-circuiting, corrosion or deformation. A failure mode in one
component can lead to a failure mode in another component, therefore each failure mode should
be listed in technical terms and for function. Hereafter the ultimate effect of each failure mode
needs to be considered. A failure effect is defined as the result of a failure mode on the function
of the system as perceived by the user. In this way it is convenient to write these effects down in
terms of what the user might see or experience. Examples of failure effects are: degraded
performance, noise or even injury to a user. Each effect is given a severity number (S) from 1 (no
danger) to 10 (critical). These numbers help an engineer to prioritize the failure modes and their
effects. If the severity of an effect has a number 9 or 10, actions are considered to change the
design by eliminating the failure mode, if possible, or protecting the user from the effect. A
severity rating of 9 or 10 is generally reserved for those effects which would cause injury to a
user or otherwise result in litigation.

[edit] Step 2: Occurrence

In this step it is necessary to look at the cause of a failure mode and how many times it occurs.
This can be done by looking at similar products or processes and the failure modes that have
been documented for them. A failure cause is looked upon as a design weakness. All the
potential causes for a failure mode should be identified and documented. Again this should be in
technical terms. Examples of causes are: erroneous algorithms, excessive voltage or improper
operating conditions. A failure mode is given an occurrence ranking (O), again 1–10. Actions
need to be determined if the occurrence is high (meaning > 4 for non-safety failure modes and
> 1 when the severity-number from step 1 is 9 or 10). This step is called the detailed
development section of the FMEA process. Occurrence also can be defined as %. If a non-safety
issue happened less than 1%, we can give 1 to it. It is based on your product and customer
specification

[edit] Step 3: Detection

When appropriate actions are determined, it is necessary to test their efficiency. Also a design
verification is needed. The proper inspection methods need to be chosen. First, an engineer
should look at the current controls of the system, that prevent failure modes from occurring or
which detect the failure before it reaches the customer. Hereafter one should identify testing,
analysis, monitoring and other techniques that can be or have been used on similar systems to
detect failures. From these controls an engineer can learn how likely it is for a failure to be
identified or detected. Each combination from the previous 2 steps receives a detection number
(D). This ranks the ability of planned tests and inspections to remove defects or detect failure
modes in time. The assigned detection number measures the risk that the failure will escape
detection. A high detection number indicates that the chances are high that the failure will escape
detection, or in other words, that the chances of detection are low.

After these three basic steps, risk priority numbers (RPN) are calculated

[edit] Risk priority numbers

RPN do not play an important part in the choice of an action against failure modes. They are
more threshold values in the evaluation of these actions.

After ranking the severity, occurrence and detectability the RPN can be easily calculated by
multiplying these three numbers: RPN = S × O × D
This has to be done for the entire process and/or design. Once this is done it is easy to determine
the areas of greatest concern. The failure modes that have the highest RPN should be given the
highest priority for corrective action. This means it is not always the failure modes with the
highest severity numbers that should be treated first. There could be less severe failures, but
which occur more often and are less detectable.

After these values are allocated, recommended actions with targets, responsibility and dates of
implementation are noted. These actions can include specific inspection, testing or quality
procedures, redesign (such as selection of new components), adding more redundancy and
limiting environmental stresses or operating range. Once the actions have been implemented in
the design/process, the new RPN should be checked, to confirm the improvements. These tests
are often put in graphs, for easy visualisation. Whenever a design or a process changes, an
FMEA should be updated.

A few logical but important thoughts come in mind:

 Try to eliminate the failure mode (some failures are more preventable than others)
 Minimize the severity of the failure

 Reduce the occurrence of the failure mode

 Improve the detection

[edit] Timing of FMEA


The FMEA should be updated whenever:

 At the beginning of a cycle (new product/process)


 Changes are made to the operating conditions

 A change is made in the design

 New regulations are instituted

 Customer feedback indicates a problem

[edit] Uses of FMEA


 Development of system requirements that minimize the likelihood of failures.
 Development of methods to design and test systems to ensure that the failures have been
eliminated.

 Evaluation of the requirements of the customer to ensure that those do not give rise to
potential failures.

 Identification of certain design characteristics that contribute to failures, and minimize or


eliminate those effects.
 Tracking and managing potential risks in the design. This helps avoid the same failures in future
projects.

 Ensuring that any failure that could occur will not injure the customer or seriously impact a
system.

 To produce world class quality products

[edit] Advantages
 Improve the quality, reliability and safety of a product/process
 Improve company image and competitiveness

 Increase user satisfaction

 Reduce system development timing and cost

 Collect information to reduce future failures, capture engineering knowledge

 Reduce the potential for warranty concerns

 Early identification and elimination of potential failure modes

 Emphasize problem prevention

 Minimize late changes and associated cost

 Catalyst for teamwork and idea exchange between functions

 Reduce the possibility of same kind of failure in future

[edit] Limitations
Since FMEA is effectively dependent on the members of the committee which examines product
failures, it is limited by their experience of previous failures. If a failure mode cannot be
identified, then external help is needed from consultants who are aware of the many different
types of product failure. FMEA is thus part of a larger system of quality control, where
documentation is vital to implementation. General texts and detailed publications are available in
forensic engineering and failure analysis. It is a general requirement of many specific national
and international standards that FMEA is used in evaluating product integrity. If used as a top-
down tool, FMEA may only identify major failure modes in a system. Fault tree analysis (FTA)
is better suited for "top-down" analysis. When used as a "bottom-up" tool FMEA can augment or
complement FTA and identify many more causes and failure modes resulting in top-level
symptoms. It is not able to discover complex failure modes involving multiple failures within a
subsystem, or to report expected failure intervals of particular failure modes up to the upper level
subsystem or system.[citation needed]

Additionally, the multiplication of the severity, occurrence and detection rankings may result in
rank reversals, where a less serious failure mode receives a higher RPN than a more serious
failure mode.[7] The reason for this is that the rankings are ordinal scale numbers, and
multiplication is not defined for ordinal numbers. The ordinal rankings only say that one ranking
is better or worse than another, but not by how much. For instance, a ranking of "2" may not be
twice as bad as a ranking of "1," or an "8" may not be twice as bad as a "4," but multiplication
treats them as though they are. See Level of measurement for further discussion.

Types of FMEA
 Process: analysis of manufacturing and assembly processes
 Design: analysis of products prior to production
 Concept: analysis of systems or subsystems in the early design concept stages
 Equipment: analysis of machinery and equipment design before purchase
 Service: analysis of service industry processes before they are released to impact the
customer
 System: analysis of the global system functions
 Software: analysis of the software functions

 Fault tree analysis

Fault tree analysis (FTA) is a failure analysis in which an undesired state of a system is analyzed
using boolean logic to combine a series of lower-level events. This analysis method is mainly used in
the field of safety engineering to quantitatively determine the probability of a safety hazard.

 Methodology
 FTA methodology is described in several industry and government standards, including
NRC NUREG–0492 for the nuclear power industry, an aerospace-oriented revision to
NUREG–0492 for use by NASA[11], SAE ARP4761 for civil aerospace, MIL–HDBK–
338 for military systems[12] for military systems. IEC standard IEC 61025[13] is intended
for cross-industry use and has been adopted as European Norme EN 61025.
 Since no system is perfect, dealing with a subsystem fault is a necessity, and any working
system eventually will have a fault in some place. However, the probability for a
complete or partial success is greater than the probability of a complete failure or partial
failure. Assembling a FTA is thus not as tedious as assembling a success tree which can
turn out to be very time consuming.
 Because assembling a FTA can be a costly and cumbersome experience, the perfect
method is to consider subsystems. In this way dealing with smaller systems can assure
less error work probability, less system analysis. Afterward, the subsystems integrate to
form the well analyzed big system.
 An undesired effect is taken as the root ('top event') of a tree of logic. There should be
only one Top Event and all concerns must tree down from it. Then, each situation that
could cause that effect is added to the tree as a series of logic expressions. When fault
trees are labeled with actual numbers about failure probabilities (which are often in
practice unavailable because of the expense of testing), computer programs can calculate
failure probabilities from fault trees.



 A fault tree diagram
 The Tree is usually written out using conventional logic gate symbols. The route through
a tree between an event and an initiator in the tree is called a Cut Set. The shortest
credible way through the tree from fault to initiating event is called a Minimal Cut Set.
 Some industries use both Fault Trees and Event Trees (see Probabilistic Risk
Assessment). An Event Tree starts from an undesired initiator (loss of critical supply,
component failure etc.) and follows possible further system events through to a series of
final consequences. As each new event is considered, a new node on the tree is added
with a split of probabilities of taking either branch. The probabilities of a range of 'top
events' arising from the initial event can then be seen.
 Classic programs include the Electric Power Research Institute's (EPRI) CAFTA
software, which is used by many of the US nuclear power plants and by a majority of US
and international aerospace manufacturers, and the Idaho National Laboratory's
SAPHIRE, which is used by the U.S. Government to evaluate the safety and reliability of
nuclear reactors, the Space Shuttle, and the International Space Station. Outside the US,
the software RiskSpectrum is a popular tool for Fault Tree and Event Tree analysis and is
licensed for use at almost half of the worlds nuclear power plants for Probabilistic Safety
Assessment.

 5 Whys

The 5 Whys is a question-asking method used to explore the cause/effect relationships


underlying a particular problem. Ultimately, the goal of applying the 5 Whys method is to
determine a root cause of a defect or problem.

Example
The following example demonstrates the basic process:

 My car will not start. (the problem)

1. Why? - The battery is dead. (first why)


2. Why? - The alternator is not functioning. (second why)
3. Why? - The alternator belt has broken. (third why)
4. Why? - The alternator belt was well beyond its useful service life and has never been
replaced. (fourth why)
5. Why? - I have not been maintaining my car according to the recommended service
schedule. (fifth why, a root cause)
6. Why? - Replacement parts are not available because of the extreme age of my vehicle.
(sixth why, optional footnote)

 I will start maintaining my car according to the recommended service schedule. (solution)

The questioning for this example could be taken further to a sixth, seventh, or even greater level.
This would be legitimate, as the "five" in 5 Whys is not gospel; rather, it is postulated that five
iterations of asking why is generally sufficient to get to a root cause. The real key is to encourage
the troubleshooter to avoid assumptions and logic traps and instead to trace the chain of causality
in direct increments from the effect through any layers of abstraction to a root cause that still has
some connection to the original problem. Note that in this example the fifth why suggests a
broken process or an alterable behavior, which is typical of reaching the root-cause level.

History
The technique was originally developed by Sakichi Toyoda and was later used within Toyota
Motor Corporation during the evolution of their manufacturing methodologies. It is a critical
component of problem solving training delivered as part of the induction into the Toyota
Production System. The architect of the Toyota Production System, Taiichi Ohno, described the
5 whys method as "the basis of Toyota's scientific approach . . . by repeating why five times, the
nature of the problem as well as its solution becomes clear."[1] The tool has seen widespread use
beyond Toyota, and is now used within Kaizen, lean manufacturing, and Six Sigma.

 Ishikawa diagram, also known as the fishbone diagram or cause and effect diagram. The
Ishikawa diagram is the preferred method for Project Managers for conducting RCA,
mainly due to its simplicity, and the complexity of the rest of the methods[1].
 Kepner-Tregoe Problem Analysis - a root cause analysis process developed in 1958,
which provides a fact-based approach to systematically rule out possible causes and
identify the true cause
 Pareto analysis
 RPR Problem Diagnosis - An ITIL-aligned method for diagnosing IT problems.
 Cause Mapping Simple method to research investigate and solve complex problems
 Apollo Root Cause Analysis - a formal root cause analysis method focusing on cause and
effect relationships that is universally applicable to any industry and discipline
 Why-Because analysis causal systems analysis for incidents and accidents based on the
logic of counterfactuals

You might also like