Tips For Recognizing and Transforming Non-Normal Data
Tips For Recognizing and Transforming Non-Normal Data
Peter J. Sherman 7
Six Sigma professionals should be familiar with normally distributed processes: the characteristic bell-shaped curve that is symmetrical about the mean, with tails approaching plus
and minus infinity (Figure 1).
When data fits a normal distribution, practitioners can make statements about the population using common analytical techniques, including control charts and capability indices
(such as sigma level, Cp, Cpk, defects per million opportunities and so on).
But what happens when a business process is not normally distributed? How do practitioners know the data is not normal? How should this type of data be treated? Practitioners
can benefit from an overview of normal and non-normal distributions, as well as familiarizing themselves with some simple tools to detect non-normality and techniques to accurately
determine whether a process is in control and capable.
1. The histogram does not look bell shaped. Instead, it is skewed positively or negatively (Figure 2).
2. A natural process limit exists. Zero is often the natural process limit when describing cycle times and lead times. For example, when a restaurant promises to deliver a pizza
in 30 minutes or less, zero minutes is the natural lower limit.
3. A time series plot shows large shifts in data.
4. There is known seasonal process data.
5. Process data fluctuates (i.e., product mix changes).
Transactional processes and most metrics that involve time measurements exist with non-normal distributions. Some examples:
There are a couple of ways to tell the data may not be normal. First, the histogram is skewed to the right (positively). Second, the control chart shows the lower control limit is less
than the natural limit of zero. Third, notice the number of high points and no real low points. These tell-tale signs indicate the data may not be normally distributed enough for an
individuals control chart. When control charts are used with non-normal data, they can give false special-cause signals. Therefore, the data must be transformed to follow the normal
distribution. Once this is done, standard control chart calculations can be used on the transformed data.
Type A data – One way to properly analyze the data is identify it with the appropriate distribution (i.e., lognormal, Weibull, exponential and so on). Some common distributions, data
types and examples associated with these distributions are in Table 1.
Normal Continuous Useful when it is equally likely the readings will fall above or below the average
Weibull Continuous Mean time-to-failure data, time to repair and material strength
Poisson Discrete Number of events in a specific time period (defect counts per interval such as arrivals, failures or defects)
A second way is to transform the data so that it follows the normal distribution. A common transformation technique is the Box-Cox. The Box-Cox is a power transformation because
the data is transformed by raising the original measurements to a power lambda (l).Some common lambda values, the transformation equation and resulting transformed value
assuming Y = 4 are in Table 2.
0.0 Lognormal (ln) The logarithm having base e, where e is the constant equal to approximately 2.71828.
The natural log of any positive number, n, is the exponent, x, to which e must be raised
1.0 Y 4
2.0 Y 2
42 = 16
Type B data – If none of the distributions or transformations fit, the non-normal data may be “pollution” caused by a mixture of multiple distributions or processes. Examples of this
type of pollution include complex work activities; multiple shifts, locations, or customers; and seasonality. Practitioners can try stratifying or breaking down the data into categories to
make sense of it. For example, the cycle time required for attorneys to complete contract documents is generally not normally distributed. Nor does it have a lognormal distribution.
Stratifying the data can make some contract documents, such as residential real estate closings, much simpler to research, draft and execute than more complex contract
documents. Hence, the complex contracts represent all the longer times, while the simpler contracts have shorter times. Another approach is to convert all the process data into a
common denominator, such as contract draft time per page. After, all the data can be recombined and tested for a single distribution.
Notice that the histogram of the transformed data (Figure 6) is much more normalized (bell-shaped, symmetrical) than the histogram in Figure 3.
An alternative to transforming the data is to find a non-normal distribution that does fit the data. Figure 7 shows probability plots for the ER waiting time using the normal, lognormal,
exponential and Weibull distributions.
The Anderson-Darling Normality test can be used as an indicator of goodness-of-fit. It produces a p-value, which is a probability that is compared to the decision criteria, alpha (a)
risk. Assume a = 0.05, meaning there is a 5 percent risk of rejecting the null when it is true. The hypothesis test for this example is:
If the p-value is equal to or less than alpha, there is evidence that the data does not follow a normal distribution. Conversely, a p-value greater than alpha suggests the data is
normally distributed.
The p-value for the lognormal distribution is 0.058 while the p-value for the Weibull distribution is 0.162. While both are above the 0.05 alpha risk, the Weibull distribution is the better
distribution because there is a 16.2 percent chance of being wrong when rejecting the null.
Now the Weibull distribution can be used to construct the proper individuals control chart (Figure 8). Notice all of the data points are within the control limits; hence, it is stable and
predictable.
Now that the process is in control, it can be assessed using indices such as Cpk (Figure 9). Overall, this is a predictable process with 8.85 percent of ER visit time out of
specification.
Six Sigma
Top 10 Data Analysis Tools 4 Six Sigma 8
Jobs for
Comments
Yathiesh
Excellent Post – Very Informative
Reply
Kevin C
This is a great post. What impact would we see if this was a short term analysis, to the point where you’re data on the “newer” data points have less variance simply
because they are newer, whereas the “older” data has had more time to accrue substantial outliers? Is there a definitive way to address this, or is it just a matter of
trimming the outliers off or simply waiting longer to analyze?
Reply
Victoria
Hello, thanks for this post. One question, if you use a transformation on the data, how do you assess the error? E.g. with a regression analysis?
Reply
Reply
Chas Ward
Yes; the answer is to be found early-on:
When data fits a normal distribution, practitioners can make statements about the population using common analytical techniques, including control charts
and capability indices …
If you are clear that the sample is representative of the population then the characteristics describing the shape should be identical for both sample and
population. The normal distribution is, or should be, the shape of both the sample and the population. A larger sample size should, if randomly selected, be
more representative of the population than a smaller one. HTH
Reply
AL
You can use Kolmogorov-Smirnov test for large sample size and shapiro wilk for smaller than 2000.
Reply
Sean
Hi,
Very helpful! Was this completed in R? Is there any place I can find this code/dataset on the web?
Thanks!
Reply