Thinking About Problems
Thinking About Problems
By Hank Marquis Hank is EVP of Knowledge Management at Universal Solutions Group, and Founder and Director of NABSM.ORG. Contact Hank by email at [email protected]. View Hank's blog at www.hankmarquis.info.
The IT Infrastructure Library (ITIL) describes the steps of the root cause analysis method called Kepner Tregoe - Define and Describe the Problem, Establish possible causes, Test the most probable cause, and Verify the true cause. The ITIL mentions Kepner -Tregoe, but does not give enough detail to use it to solve difficult probl ems. Simple as it sounds, most technicians and technical leads do not actually follow Kepner -Tregoe. and often They skip rely instead steps. upon Then, preconceived ideas important
without a plan and in desperation they fall back on the good old "when in doubt swap it out" technique. Taking the time to use Kepner -Tregoe can result in dramatic improvements in troubleshooting, and deliver permanent fixes to prevent future problems as well. Following I provide a template for using Kepner-Tregoe that problem managers and staff can use to accelerate root cause analysis.
Kepner-Tregoe
The actual name is Kepner-Tregoe Problem Solving and Decision Making (PSDM). Kepner-Tregoe calls the part of PSDM that ITIL refers to Problem Analysis. Problem
Analysis helps the practitioner make sound decisions. It provides a process to identify and sort all the issues surrounding a decision. As a troubleshooting tool, Problem Analysis helps prevent jumping to conclusions. Immature troubleshooters use hunches, instinct, and intuition. These individual acts of heroism may seem brilliant, but they can also result in more problems since jumping to conclusions often compounds or expands problems instead of solving them. Problem Analysis leverages the combined knowledge, experience, intuition, and judgment of a team, resulting in faster and better decisions. Using Problem Analysis to aid Problem Management not only brings the team together, but also helps identify root cause. Problem Analysis is a problem solving and decision making framework. Six Sigma, Lean Manufacturing and ITIL all describe Problem Analysis. The Problem Analysis process divides decision-making into five steps: 1. Define the Problem 2. Describe the Problem 3. Establish possible causes 4. Test the most probable cause 5. Verify the true cause
Since problem management is inherently a team exercise, it is important to have a group understanding of the problem. Consider the following examples. A poor problem definition might appear as follows: "The server crashed." A better problem definition should include more information. A good model for clarifying statements of all sorts is the Goal Question Metric (GQM) method. It results in a statement with a clear Object, Purpose, Focus, Environment, and Viewpoint. This results in an unambiguous and easily understood statement. A clarified problem definition might be: "The e-mail system crashed after the 3rd shift support engineer applied hot-fix XYZ to Exchange Server 123." When developing a problem definition always use the "5 Whys technique" to arrive at the point where there is no explanation for the problem. Using 5 Whys with Kepner-Tregoe only accelerates the process.
basis of the troubleshooting. The last column provides space to list any changes made that could account for the differences.
IS WHAT WHERE WHEN EXTENT System failure Failure location Failure time Other failed systems COULD BE but IS NOT DIFFERENCES CHANGES Similar systems/situations not ? ? failed Other locations that did not fail ? ? Other times where failure did ? ? not occur Other systems without failure ? ?
WHAT
WHERE
WHEN EXTENT
Table 2. Problem Analysis Worksheet Example History (and best practice) says that the root cause of the problem is probably due to some recent change.
With the completed worksheet, some new possible solutions become apparent. Shown above is becomes clear that the root cause is probably procedural, and due to the fact the vendor did not apply the hot-fix, but rather gave procedures for the hot-fix to the company.
Only Exchange Server 123 has Maybe this problem Same procedure crashes Probably another server Problem did not always reoccur Probably not
It is important here as well to think about how to prevent similar problems from occurring in the future. The Problem Manager should consider how the issue arose in the first place by asking some questions:
Where else might this problem appear? Are there other occurrences of this problem in the past? Do any procedures need to change?
Summary
Kepner-Tregoe is a mature process with decades of proven capabilities. There are worksheets, training programs, and consulting firms all schooled in the process. You can take courses at many local colleges as well. Kepner-Tregoe Problem Analysis was used by NASA to troubleshoot Apollo XIII even though the technicians did not believe the results, they followed the process and saved the mission. The rest of the story, as they say, is history... Even without a lot of time available, using Kepner-Tregoe Problem Analysis can result in the most efficient problem resolutions. Armed with tools like 5 Whys and Ishikawa diagramming, a Problem Manager can capture the combined experience and knowledge of a team. When used with Kepner-Tregoe Problem Analysis the result is amazing.
Please forward this story to a manager, co-worker or a friend. digg (discuss or comment) on this article. Show your support for DITY! Subscribe to our newsletter and get new skills delivered right to your Inbox. Download this article in PDF format for use at your own convenience. Use your favorite RSS reader to stay up to date.
Related programs
ITIL V3 Certification Service Lifecycle Operation ITIL V3 Certification Service Capability Operational Support & Analysis (OSA)
Related articles
5-Step Problem Management with Kepner-Tregoe 5 Whys to Solve Problems Accelerating Problem Resolution: Why It Matters and How to Do It Browse back-issues of the DITY Newsletter, click here.