0% found this document useful (0 votes)
46 views

Supplemental 1

Supplemental material for SQC

Uploaded by

Mohsen Saghafi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Supplemental 1

Supplemental material for SQC

Uploaded by

Mohsen Saghafi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

______________________________________________________________________________________

Supplemental Text Material to Support


Introduction to Statistical Quality Control 4th Edition
Douglas C. Montgomery
John Wiley & Sons, New York, 2001

1. Independent Random Variables


Preliminary Remarks
Readers encounter random variables throughout the textbook. An informal definition of
and notation for random variables is used. A random variable may be thought of
informally as any variable for which the measured or observed value depends on a
random or chance mechanism. That is, the value of a random variable cannot be known
in advance of actual observation of the phenomena. Formally, of course, a random
variable is a function that assigns a real number to each outcome in the sample space of
the observed phenomena. Furthermore, it is customary to distinguish between the random
variable and its observed value or realization by using an upper-case letter to denote the
random variable (say X) and a corresponding lower-case letter for the actual numerical
value x that is the result of an observation or a measured value. This formal notation is
not used in the book because (1) it is not widely employed in the statistical quality control
field and (2) it is usually quite clear from the context whether we are discussing the
random variable or its realization.
Independent Random variables
In the textbook, we make frequent use of the concept of independent random variables.
Most readers have been exposed to this in a basic statistics course, but here a brief review
of the concept is given. For convenience, we consider only the case of continuous
random variables. For the case of discrete random variables, refer to Montgomery and
Runger (1999).
Often there will be two or more random variables that jointly define some physical
phenomena of interest. For example, suppose we consider injection-molded components
used to assemble a connector for an automotive application. To adequately describe the
connector, we might need to study both the hole interior diameter and the wall thickness
of the component. Let x1 represent the hole interior diameter and x2 represent the wall
thickness. The joint probability distribution (or density function) of these two
continuous random variables can be specified by providing a method for calculating the
probability that x1 and x2 assume a value in any region R of two-dimensional space, where
the region R is often called the range space of the random variable. This is analogous to
the probability density function for a single random variable. Let this joint probability
density function be denoted by f ( x1 , x2 ) . Now the double integral of this joint probability
density function over a specified region R provides the probability that x1 and x2 assume
values in the range space R.
A joint probability density function has the following properties:
a. f ( x1 , x2 ) t 0 for all x1 , x2

1
______________________________________________________________________________________

f f
b. ³ ³
f f
f ( x1 , x2 )dx1dx2 1

c. For any region R of two-dimensional space P{( x1 , x2 )  R} ³³ f ( x , x )dx dx


R
1 2 1 2

The two random variables x1 and x2 are independent if f ( x1 , x2 ) f1 ( x1 ) f 2 ( x2 ) where


f1 ( x1 ) and f 2 ( x2 ) are the marginal probability distributions of x1 and x2 , respectively,
defined as
f f
f1 ( x1 ) ³f
f ( x1 , x2 )dx2 and f 2 ( x2 ) ³ f
f ( x1 , x2 )dx1

In general, if there are p random variables x1 , x2 ,..., x p then the joint probability density
function is f ( x1 , x2 ,..., x p ) , with the properties:
a. f ( x1 , x2 ,..., x p ) t 0 for all x1 , x2 ,..., x p

b. ³³ ...³ f ( x , x ..., x
R
1 2 p )dx1dx2 ...dx p 1

c. For any region R of p-dimensional space,


P{( x1 , x2 ,..., x p )  R} ³³ ...³ f ( x , x ,..., x
R
1 2 p )dx1dx2 ...dx p

The random variables x1, x2, …, xp are independent if


f ( x1 , x2 ,..., x p ) f1 ( x1 ) f 2 ( x2 )... f p ( x p )

where fi ( xi ) are the marginal probability distributions of x1, x2 , …, xp, respectively,


defined as
fi ( xi ) ³³ ...³ f ( x , x ,..., x
Rxi
1 2 p )dx1dx2 ...dxi 1dxi 1...dx p

2. Random Samples
To properly apply many statistical techniques, the sample drawn from the population of
interest must be a random sample. To properly define a random sample, let x be a
random variable that represents the results of selecting one observation from the
population of interest. Let f ( x ) be the probability distribution of x. Now suppose that n
observations (a sample) are obtained independently from the population under
unchanging conditions. That is, we do not let the outcome from one observation influence
the outcome from another observation. Let xi be the random variable that represents the
observation obtained on the ith trial. Then the observations x1 , x2 ,..., xn are a random
sample.

2
______________________________________________________________________________________

In a random sample the marginal probability distributions f ( x1 ), f ( x2 ),..., f ( xn ) are all


identical, the observations in the sample are independent, and by definition, the joint
probability distribution of the random sample is f ( x1 , x2 ,..., xn ) f ( x1 ) f ( x2 )... f ( xn ) .

3. Development of the Poisson Distribution


The Poisson distribution is widely used in statistical quality control and improvement,
frequently as the underlying probability model for count data. As noted in Section 2-2.3
of the text, the Poisson distribution can be derived as a limiting form of the binomial
distribution, and it can also be developed from a probability argument based on the birth
and death process. We now give a summary of both developments.
The Poisson Distribution as a Limiting Form of the Binomial Distribution
Consider the binomial distribution
§n·
p( x) ¨ ¸ p x (1  p) n x
© x¹
n!
p x (1  p )n  x , x 0,1, 2,..., n
x !(n  x)!
Let O np so that p O / n . We may now write the binomial distribution as
x n x
n( n  1)(n  2) ( n  x  1) § O · § n  O ·
p( x) ¨ ¸ ¨ ¸
x! ©n¹ © n ¹
x n
O x ª § 1 ·§ 2 · § x  1 · º § O · § O ·
(1) ¨ 1  ¸¨ 1  ¸  ¨ 1  ¸ ¨1  ¸ ¨1  ¸
x ! «¬ © n ¹© n ¹ © n ¹ »¼ © n ¹ © n ¹
Let n o f and p o 0 so that O np remains constant. The terms
x
§ 1· § 2· § x 1 · § O·
¨ 1  ¸ , ¨ 1  ¸ ,..., ¨ 1  ¸ and ¨ 1  ¸ all approach unity. Furthermore,
© n¹ © n¹ © n ¹ © n¹
n
§ O· O
¨ 1  ¸ o e as n o f
© n¹
Thus, upon substitution we see that the limiting form of the binomial distribution is
O x e O
p( x)
x!
which is the Poisson distribution.

Development of the Poisson Distribution from the Poisson Process


Consider a collection of time-oriented events, arbitrarily called “arrivals” or “births”. Let
xt be the number of these “arrivals” or “births” that occur in the interval [0,t). Note that

3
______________________________________________________________________________________

the range space of xt is R = {0,1,…}. Assume that the number of births during non-
overlapping time intervals are independent random variables, and that there is a positive
constant O such that for any small time interval 't , the following statements are true:
1. The probability that exactly one birth will occur in an interval of length 't is
O ˜ 't .
2. The probability that zero births will occur in the interval is 1  O ˜ 't .
3. The probability that more than one birth will occur in the interval is zero.
The parameter O is often called the mean arrival rate or the mean birth rate. This type of
process, in which the probability of observing exactly one event in a small interval of
time is constant (or the probability of occurrence of event is directly proportional to the
length of the time interval), and the occurrence of events in non-overlapping time
intervals is independent is called a Poisson process.
In the following, let
P{xt x} p( x) px (t ), x 0,1, 2,...
Suppose that there have been no births up to time t. The probability that there are no
births at the end of time t + 't is
p0 (t  't ) (1  O ˜ 't ) p0 (t )
Note that
p0 (t  't )  p0 (t )
O p0 (t )
't
so consequently
ª p (t  't )  p0 (t ) º
lim « 0 »¼ p0c (t )
¬
't o 0 't
O p0 (t )
For x > 0 births at the end of time t + 't we have
px (t  't ) px 1 (t )O ˜ 't  (1  O ˜ 't ) px (t )

and
ª p (t  't )  px (t ) º
lim « x »¼ pcx (t )
't o 0
¬ 't
O ˜ px 1 (t )  O ˜ px (t )

Thus we have a system of differential equations that describe the arrivals or births:

4
______________________________________________________________________________________

p0c (t ) O p0 (t ) for x 0
pcx (t ) O px 1 (t )  O px (t ) for x 1, 2,...
The solution to this set of equations is
(O t ) x e  O t
p x (t ) x 0,1, 2,...
x!
Obviously for a fixed value of t this is the Poisson distribution.

4. Expected Value and Variance Operators


Readers should have prior exposure to mathematical expectation from a basic statistics
course. Here some of the basic properties of expectation are reviewed.
The expected value of a random variable x is denoted by E ( x ) and is given by

­ ¦ xi p( xi ), xi is a discrete random variable


° all xi
E ( x) ® f
° xf ( x)dx, x is a continuous random variable
¯³f
The expectation of a random variable is very useful in that it provides a straightforward
characterization of the distribution, and it has a simple practical interpretation as the
center of mass, centroid, or mean of the distribution.
Now suppose that y is a function of the random variable x, say y h( x ) . Note that y is
also a random variable. The expectation of h( x) is defined as

­ ¦ h( xi ) p ( xi ), xi is a discrete random variable


° all xi
E[h ( x)] ®
f
° h( x ) f ( x )dx, x is a continuous random variable
¯ ³f

An interesting result, sometimes called the “theorem of the unconscious statistician”


states that if x is a continuous random variable with probability density function f ( x ) and
y h( x ) is a function of x having probability density function g ( y ) , then the expectation
of y can be found either by using the definition of expectation with g ( y ) or in terms of
its definition as the expectation of a function of x with respect to the probability density
function of x. That is, we may write either
f
E( y) ³f
yg ( y )dy

or
f
E( y) E[h( x )] ³f
h ( x) f ( x )dx

The name for this theorem comes from the fact that we often apply it without consciously
thinking about whether the theorem is true in our particular case.

5
______________________________________________________________________________________

Useful Properties of Expectation I:


Let x be a random variable with mean P , and c be a constant. Then
1. 1. E (c ) c
2. 2. E ( x ) P
3. 3. E (cx ) cE ( x ) c P
4. 4. E[ch ( x )] cE[h( x )]
5. If c1 and c2 are constants and h1 and h2 are functions, then
E[c1h1 ( x )  c2 h2 ( x )] c1 E[h1 ( x )]  c2 E[h2 ( x )]
Because of property 5, expectation is called a linear (or distributive) operator.
Now consider the function h( x) ( x  c) 2 where c is a constant, and suppose that
E[( x  c )2 ] exists. To find the value of c for which E[( x  c )2 ] is a minimum, write

E[( x  c )2 ] E[ x 2  2 xc  c 2 ]
E ( x 2 )  2cE ( x)  c 2

Now the derivative of E[( x  c )2 ] with respect to c is 2 E ( x )  2c , and this derivative is


zero when c E ( x) . Therefore, E[( x  c )2 ] is a minimum when c E ( x ) P .
The variance of the random variable x is defined as
V ( x) E[( x  P ) 2 ]
V2
and we usually call
V ( x) E[( x  P ) 2 ]
the variance operator. It is straightforward to show that if c is a constant, then
V (cx ) c 2V 2
The variance is analogous to the moment of inertia in mechanics.

Useful Properties of Expectation II:


Let x1 and x2 be random variables with means P1 and P 2 and variances V 12 and V 22 ,
respectively, and let c1 and c2 be constants. Then
1. E ( x1  x2 ) P1  P 2
2. It is possible to show that V ( x1  x2 ) V 12  V 22  2Cov( x1 , x2 ) , where
Cov( x1 , x2 ) E[( x1  P1 )( x2  P 2 )]

6
______________________________________________________________________________________

is the covariance of the random variables x1 and x2. The covariance is a measure
of the linear association between x1 and x2. More specifically, we may show that
if x1 and x2 are independent, then Cov( x1 , x2 ) 0 .

3. V ( x1  x2 ) V 12  V 22  2Cov( x1 , x2 )

4. If the random variables x1 and x2 are independent, V ( x1 r x2 ) V 12  V 22


5. If the random variables x1 and x2 are independent, E ( x1 x2 ) E ( x1 ) E ( x2 ) P1P2
6. Regardless of whether x1 and x2 are independent, in general
§ x · E ( x1 )
E¨ 1 ¸ z
© x2 ¹ E ( x2 )
7. For the single random variable x
V ( x  x) 4V 2
because Cov( x, x ) { V 2 .

Moments
Although we do not make much use of the notion of the moments of a random variable
in the book, for completeness we give the definition. Let the function of the random
variable x be
h( x) xk

where k is a positive integer. Then the expectation of h( x) x k is called the kth moment
about the origin of the random variable x and is given by
­ ¦ xik p ( xi ), xi is a discrete random variable
° all xi
E(xk ) ® f
° x k f ( x )dx, x is a continuous random variable
¯ ³f
Note that the first origin moment is just the mean P of the random variable x. The
second origin moment is
E(x2 ) P2 V 2
Moments about the mean are defined as
­ ¦ ( xi  P )k p ( xi ), xi is a discrete random variable
° all xi
E[( x  P )k ] ®
f
° ( x  P ) k f ( x) dx, x is a continuous random variable
¯ ³f

The second moment about the mean is the variance V 2 of the random variable x.

7
______________________________________________________________________________________

5. The Mean and Variance of the Normal Distribution


In Section 2-3.1 we introduce the normal distribution, with probability density function
1
1  ( x  P )2
, f d x d f
2
f ( x) e 2V
V 2S
and we stated that P and V 2 are the mean and variance, respectively, of the distribution.
We now show that this claim is correct.
f
Note that f ( x ) ! 0 . We first evaluate the integral I ³
f
f ( x )dx , showing that it is
equal to 1. In the integral, change the variable of integration to z ( x  P ) / V . Then
f 1 z2 / 2
I ³ f
2S
e dz

Since I ! 0, if I 2 1, then I 1 . Now we may write


1 ª f  x2 / 2 º ª f  y2 / 2 º
2S ¬« ³f
I2 e dx ³f e dy
¼» ¬« ¼»
1 f f  ( x2  y 2 ) / 2
2S ³f ³f
e dxdy

If we switch to polar coordinates, then x r cos(T ), y r sin(T ) and


1 2S f  r 2 / 2
2S ³0 ³0
I2 e rdrdT

1 2S 1
2S ³0
dT
2S
2S 1

So we have shown that f ( x ) has the properties of a probability density function.


The integrand obtained by the substitution z ( x  P ) / V is, of course, the standard
normal distribution, an important special case of the more general normal distribution.
The standard normal probability density function has a special notation, namely
1  z2 / 2
I ( z) e , f  z  f
2S
and the cumulative standard normal distribution is
z
)( z ) ³f
I (t )dt

Several useful properties of the standard normal distribution can be found by basic
calculus:
1. I ( z ) I ( z ), for all real z, so I ( z ) is an even function (symmetric about 0) of z
2. I c( z )  zI ( z )

3. I cc( z ) ( z 2  1)I ( z )

8
______________________________________________________________________________________

Consequently, I ( z ) has a unique maximum at z = 0, inflection points at z r1 , and both


I ( z ) o 0 and I c( z ) o 0 as z o rf .
The mean and variance of the standard normal distribution are found as follows:
f
E ( z) ³ f
zI ( z )dz
f
 ³ I c( z )dz
f

I ( z ) |ff
0
and
f
E(z2 ) ³ f
z 2I ( z )dz
f
³ f
[I cc( z )  I ( z )]dz
f
I c( z ) |ff  ³ I ( z )dz
f

0 1
1
Because the variance of a random variable can be expressed in terms of expectation as
V 2 E ( z  P )2 E ( z 2 )  P 2 , we have shown that the mean and variance of the standard
normal distribution are 0 and 1, respectively.
Now consider the case where x follows the more general normal distribution. Based on
the substitution, we have z ( x  P ) / V
1
f 1  2 ( x  P )2
E ( x) ³f V 2S 2V
x e dx
f
³f
(P  zV )I ( z )dz
f f
P ³ I ( z )dz  V ³ zI ( z )dz
f f

P (1)  V (0)
P
and
1
f 1  2 ( x  P )2

³f V 2S 2V
2 2
E(x ) x e dx
f
³f
( P  zV )2 I ( z )dz
f f f
P 2 ³ I ( z )dz  2VP ³ zI ( z )dz  V 2 ³ I ( z )dz
f f f

P V2 2

9
______________________________________________________________________________________

Therefore, it follows that V ( x) E ( x2 )  P 2 (P 2  V 2 )  P 2 V2.

6. More About the Gamma Distribution


The gamma distribution is introduced in Section 3-3.3. The gamma probability density
function is
O
f ( x) (O x) r 1 e O x , x t 0
*( r )
where r > 0 is a shape parameter and O ! 0 is a scale parameter. The parameter r is
called a shape parameter because it determines the basic shape of the graph of the density
function. For example, if r = 1, the gamma distribution reduces to an exponential
distribution. There are actually three basic shapes; r  1 or hyperexponential, r = 1 or
exponential, and r > 1 or unimodal with right skew.
The cumulative distribution function of the gamma is
x O
F ( x; r , O ) ³ 0 *( r )
(O t )r 1 e  O x dt

The substitution u t / O in this integral results in F ( x; r , O ) F ( x / O ; r ,1) , which


depends on O only through the variable x / O . We typically call such a parameter a scale
parameter. It can be important to have a scale parameter in a probability distribution so
that the results do not depend on the scale of measurement actually used. For example,
suppose that we are measuring time in months, and O 6 . The probability that x is less
than or equal to 12 months is F (12 / 6; r ,1) F (2; r ,1) . If we wish to consider measuring
time in weeks, then the probability that x is less than or equal to 48 weeks is just
F (48 / 24; r ,1) F (2; r ,1) . Therefore, different scales of measurement can be
accommodated by changing the scale parameter without having to change to a more
general form of the distribution.
When r is an integer, the gamma distribution is sometimes called the Erlang distribution.
Another special case of the gamma distribution arises when we let r = ½, 1, 3/2, 2, … and
O 1/ 2 ; this is the chi-square distribution with degrees of freedom r / O 1, 2,... . The
chi-square distribution is very important in statistical inference.

7. The Lognormal Distribution


Another very general distribution not mentioned specifically in the textbook is the
lognormal distribution. The lognormal distribution is defined only for positive values
of the random variable x and the probability density function is
1
1  (ln x  P ) 2
f ( x) e 2V 2
x!0
xV 2S

10
______________________________________________________________________________________

The parameters of the lognormal distribution are f  P  f and 0  V 2  f . The


lognormal random variable is related to the normal random variable in that y ln x is
normally distributed with mean P and variance V 2 .
The mean and variance of the lognormal distribution are
P  12 V 2
E ( x) Px e
2 2
V ( x) V x2 e 2 P V (eV  1)
The median and mode of the lognormal distribution are
x eP
e P V
2
mo
In general, the kth origin moment of the lognormal random variable is
k P  12 k 2V 2
E(xk ) e
Like the gamma and Weibull distributions, the lognormal finds application in reliability
engineering, often as a model for survival time of components or systems. Some
important properties of the lognormal distribution are:
1. If x1 and x2 are independent lognormal random variables with parameters
( P1 , V 12 ), ( P2 , V 22 ) , respectively, then y x1 ˜ x2 is a lognormal random variable
with parameters P1  P 2 and V 12  V 22 .
2. If x1 , x2 ,..., xk are independently and identically distributed lognormal random
variables with parameters P and V 2 , then the geometric mean of the xi, or
1/ k
§ k ·
¨ – xi ¸ , has a lognormal distribution with parameters P and V 2 / k .
©i1 ¹
3. If x is a lognormal random variable with parameters P and V 2 , and if a, b, and c
are constants such that b ec , then the random variable y bx a has a lognormal
distribution with parameters c  aP and a 2V 2 .

8. The Failure Rate for the Exponential Distribution


The exponential distribution
f ( x) O eO x , x t 0
was introduced in Section 2-3.2 of the text. The exponential distribution is frequently
used in reliability engineering as a model for the lifetime or time to failure of a
component or system. Generally, we define the reliability function of the unit as

11
______________________________________________________________________________________

R(t ) P{x ! t}
t
1  ³ f ( x )dx
0

1  F (t )
where, of course, F (t ) is the cumulative distribution function. In biomedical
applications, the reliability function is usually called the survival function. For the
exponential distribution, the reliability function is
F (t ) e  Ot
The Hazard Function
The mean and variance of a distribution are quite important in reliability applications, but
an additional property called the hazard function or the instantaneous failure rate is also
useful. The hazard function is the conditional density function of failure at time t, given
that the unit has survived until time t. Therefore, letting X denote the random variable
and x denote the realization,
f ( x | X t x ) h( x )
F c( x | X t x)
F ( x  'x | X t x )  F ( x | X t x )
lim
'x o 'x
F ( x d X d x  'x | X t x )
lim
'x o 'x
F ( x d X d x  'x, X t x)
lim
'x o 'xP{ X t x}
F ( x d X d x  'x)
lim
'x o 'x[1  F ( x )]
f ( x)
1  F ( x)
It turns out that specifying a hazard function completely determines the cumulative
distribution function (and vive-versa).
The Hazard Function for the Exponential Distribution
For the exponential distribution, the hazard function is
f ( x)
h( x)
1  F ( x)
O e O x

e Ox
O
That is, the hazard function for the exponential distribution is constant, or the failure rate
is just the reciprocal of the mean time to failure.

12
______________________________________________________________________________________

A constant failure rate implies that the reliability of the unit at time t does not depend on
its age. This may be a reasonable assumption for some types of units, such as electrical
components, but it’s probably unreasonable for mechanical components. It is probably
not a good assumption for many types of system-level products that are made up of many
components (such as an automobile). Generally, an increasing hazard function indicates
that the unit is more likely to fail in the next increment of time than it would have been in
an earlier increment of time of the same length. This is likely due to aging or wear.
Despite the apparent simplicity of its hazard function, the exponential distribution has
been an important distribution in reliability engineering. This is partly because the
constant failure rate assumption is probably not unreasonable over some region of the
unit’s life.

9. The Failure Rate for the Weibull Distribution


The instantaneous failure rate or the hazard function was defined in Section 8 of the
Supplemental Text Material. For the Weibull distribution, the hazard function is
f ( x)
h( x)
1  F ( x)
E
( E / T )( x / T )E 1 e  ( x /T )
E
e  ( x /T )
E 1
E § x·
T ¨T ¸
© ¹
Note that if E 1 the Weibull hazard function is constant. This should be no surprise,
since for E 1 the Weibull distribution reduces to the exponential. When E ! 1 , the
Weibull hazard function increases, approaching f as E o f . Consequently, the Weibull
is a fairly common choice as a model for components or systems that experience
deterioration due to wear-out or fatigue. For the case where E  1 , the Weibull hazard
function decreases, approaching 0 as E o 0 .
For comparison purposes, note that the hazard function for the gamma distribution with
parameters r and O is also constant for the case r = 1 (the gamma also reduces to the
exponential when r = 1). Also, when r > 1 the hazard function increases, and when r < 1
the hazard function decreases. However, when r > 1 the hazard function approaches O
from below, while if r < 1 the hazard function approaches O from above. Therefore,
even though the graph of the gamma and Weibull distributions look very similar, and
they can both produce reasonable fits to the same sample of data, they clearly have very
different characteristics in terms of describing survival or reliability data.

10. More About Parameter Estimation


Throughout the book estimators of various population or process parameters are given
without much discussion concerning how these estimators are generated. Often they are

13
______________________________________________________________________________________

simply “logical” or intuitive estimators, such as using the sample average x as an


estimator of the population mean P .
There are methods for developing point estimators of population parameters. These
methods are typically discussed in detail in courses in mathematical statistics. We now
give a brief overview of some of these methods.

The Method of Maximum Likelihood

One of the best methods for obtaining a point estimator of a population parameter is the
method of maximum likelihood. Suppose that x is a random variable with probability
distribution f ( x;T ) , where T is a single unknown parameter. Let x1 , x2 ,..., xn be the
observations in a random sample of size n. Then the likelihood function of the sample is
L(T ) f ( x1 ;T ) ˜ f ( x2 ;T ) ˜ ˜ f ( xn ;T )
The maximum likelihood estimator of T is the value of T that maximizes the likelihood
function L( T ).
Example 1 The Exponential Distribution
To illustrate the maximum likelihood estimation procedure, set x be exponentially
distributed with parameter O . The likelihood function of a random sample of size n, say
x1 , x2 ,..., xn , is
n
L (O ) – Oe
i 1
 O xi

n
O ¦ xi
O ne i 1

Now it turns out that, in general, if the maximum likelihood estimator maximizes L( T ), it
will also maximize the log likelihood, ln L(T ) . For the exponential distribution, the log
likelihood is
n
ln L(O ) n ln O  O ¦ xi
i 1

Now
d ln L(O ) n n
 ¦ xi
dO O i1
Equating the derivative to zero and solving for the estimator of O we obtain
n 1
Oˆ n
x
¦x
i 1
i

Thus the maximum likelihood estimator (or the MLE) of O is the reciprocal of the sample
average.

14
______________________________________________________________________________________

Maximum likelihood estimation can be used in situations where there are several
unknown parameters, say T1 ,T 2 , ,T p to be estimated. The maximum likelihood
estimators would be found simply by equating the p first partial derivatives
wL(T1 ,T 2 , ,T p ) / wT i , i 1, 2,..., p of the likelihood (or the log likelihood) equal to zero
and solving the resulting system of equations.
Example 2 The Normal Distribution
Let x be normally distributed with the parameters P and V 2 unknown. The likelihood
function of a random sample of size n is
2
n 1 § xi  P ·
1  ¨ ¸
L( P , V ) –
2
e 2© V ¹

i 1 V 2S
n
1
1  ¦ ( xi  P )2
2V 2 i 1
e
(2SV 2 )n / 2
The log-likelihood function is
n
n 1
ln L( P , V 2 )  ln(2SV 2 )  2
2 2V
¦ (x  P)
i 1
i
2

Now
w ln L( PV 2 ) 1 n

wP V2
¦ ( x  P)
i 1
i

w ln L( PV 2 ) n 1 n
4 ¦
 2 ( xi  P )2 0
wV 2 2V 2V i 1
The solution to these equations yields the MLEs
1 n
Pˆ ¦ xi x
ni1
1 n
Vˆ 2 ¦
ni1
( xi  x )2

Generally, we like the method of maximum likelihood because when n is large, (1) it
results in estimators that are approximately unbiased, (2) the variance of a MLE is as
small as or nearly as small as the variance that could be obtained with any other
estimation technique, and (3) MLEs are approximately normally distributed.
Furthermore, the MLE has an invariance property; that is, if Tˆ is the MLE of T , then the
MLE of a function of T , say h(T ) , is the same function h(Tˆ) of the MLE . There are also
some other “nice” statistical properties that MLEs enjoy; see a book on mathematical
statistics, such as Hogg and Craig (1978) or Bain and Engelhardt (1987).

15
______________________________________________________________________________________

The unbiased property of the MLE is a “large-sample” or asymptotic property. To


illustrate, consider the MLE for V 2 in the normal distribution of example 2 above. We
can easily show that
n 1 2
E (Vˆ 2 ) V
n
Now the bias in estimation of V 2 is
n 1 2 V2
E (Vˆ 2 )  V 2 V V 2 
n n
Notice that the bias in estimating V 2 goes to zero as the sample size n o f . Therefore,
the MLE is an asymptotically unbiased estimator.
The Method of Moments
Estimation by the method of moments involves equating the origin moments of the
probability distribution (which are functions of the unknown parameters) to the sample
moments, and solving for the unknown parameters. We can define the first p sample
moments as
n

¦x k
i
M kc i 1
, k 1, 2,..., p
n
and the first p moments around the origin of the random variable x are just
P kc E ( x k ), k 1, 2,..., p
Example 3 The Normal Distribution
For the normal distribution the first two origin moments are
P1c P
P 2c P2 V 2
and the first two sample moments are
M 1c x
1 n 2
M 2c ¦ xi
ni1
Equating the sample and origin moments results in
P x
1 n 2
P2 V 2 ¦ xi
n i 1
The solution gives the moment estimators of P and V 2 :

16
______________________________________________________________________________________

Pˆ x
1 n
Vˆ 2 ¦ ( xi  x )2
ni1
The method of moments often yields estimators that are reasonably good. For example,
in the above example the moment estimators are identical to the MLEs. However,
generally moment estimators are not as good as MLEs because they don’t have statistical
properties that are as nice. For example, moment estimators usually have larger
variances than MLEs.
Least Squares Estimation
The method of least squares is one of the oldest and most widely used methods of
parameter estimation. Unlike the method of maximum likelihood and the method of
moments, least squares can be employed when the distribution of the random variable is
unknown.
To illustrate, suppose that the simple location model can describe the random variable x:
xi P  H i , i 1, 2,..., n
where the parameter P is unknown and the H i are random errors. We don’t know the
distribution of the errors, but we can assume that they have mean zero and constant
variance. The least squares estimator of P is chosen so the sum of the squares of the
model errors H i is minimized. The least squares function for a sample of n observations
x1 , x2 ,..., xn is
n
L ¦H
i 1
i
2

¦ (x  P)
i 1
i
2

Differentiating L and equating the derivative to zero results in the least squares estimator
of P :
Pˆ x
In general, the least squares function will contain p unknown parameters and L will be
minimized by solving the equations that result when the first partial derivatives of L with
respect to the unknown parameters are equated to zero. These equations are called the
least squares normal equations.
The method of least squares dates from work by Karl Gauss in the early 1800s. It has a
very well-developed and indeed quite elegant theory. For a discussion of the use of least
squares in estimating the parameters in regression models and many illustrative
examples, see Montgomery, Peck and Vining (2001), and for a very readable and concise
presentation of the theory, see Myers and Milton (1991).

17
______________________________________________________________________________________

11. Proof That E ( x ) P and E (S 2 ) V 2


It is easy to show that the sample average x and the sample variance S2 are unbiased
estimators of the corresponding population parameters P and V 2 , respectively. Suppose
that the random variable x has mean P and variance V 2 , and that x1 , x2 ,..., xn is a random
sample of size n from the population. Then
§1 n ·
E(x ) E ¨ ¦ xi ¸
©n i 1 ¹
1 n
¦ E ( xi )
ni1
1 n
¦P
ni1
P
because the expected value of each observation in the sample is E ( xi ) P . Now
consider
§ n 2 ·
¨ ¦ ( xi  x ) ¸
E (S 2 ) E¨ i 1 ¸
¨ n 1 ¸
¨ ¸
© ¹
1 § n ·
E ¨ ¦ ( xi  x ) 2 ¸
n 1 © i 1 ¹
n n
It is convenient to write ¦ ( xi  x )2
i 1
¦x
i 1
2
i  nx 2 , and so

§ n ·
E (S 2 ) E ¨ ¦ xi2  nx 2 ¸
©i1 ¹
n

¦ E(x
i 1
2
i )  nE ( x 2 )

n
Now ¦ E(x
i 1
2
i ) P 2  V 2 and E(x 2 ) P 2  V 2 / n . Therefore

18
______________________________________________________________________________________

1 ª n º
E (S 2 ) « ¦
n 1 ¬ i 1
( P 2  V 2 )  n( P 2  V 2 / n) »
¼
1
n 1
nP 2  nV 2  nP 2  V 2
(n  1)V 2
n 1
V 2

Note that:
a. These results do not depend on the form of the distribution for the random
variable x. Many people think that an assumption of normality is required, but
this is unnecessary.
b. Even though E (S 2 ) V 2 , the sample standard deviation is not an unbiased
estimator of the population standard deviation. This is discussed more fully in the
next section.

12. Proof That E (S ) z V


In Section 11 of the Supplemental Text Material we showed that the sample standard
deviation is an unbiased estimator of the population standard deviation; that is,
E (S 2 ) V 2 , and that this result does not depend on the form of the distribution.
However, the sample standard deviation is not an unbiased estimator of the population
standard deviation. This is easy to demonstrate for the case where the random variable x
follows a normal distribution.
Let x have a normal distribution with mean P and variance V 2 , and let x1 , x2 ,..., xn be a
random sample of size n from the population. Now the distribution of
(n  1) S 2
V2
is chi-square with n – 1 degrees of freedom, denoted F n21 . Therefore the distribution of
S2 is V 2 /(n  1) times a F n21 random variable. So when sampling from a normal
distribution, the expected value of S2 is

19
______________________________________________________________________________________

§ V2 2 ·
E (S 2 ) E¨ F n 1 ¸
© n 1 ¹
V 2
E ( F n21 )
n 1
V2
(n  1)
n 1
V2
because the mean of a chi-square random variable with n – 1 degrees of freedom is n – 1.
Now it follows that the distribution of
(n  1) S
V
is a chi distribution with n – 1 degrees of freedom, denoted F n 1 . The expected value of
S can be written as
§ V ·
E (S ) E¨ F n 1 ¸
© n 1 ¹
V
E ( F n 1 )
n 1
The mean of the chi distribution with n –1 degrees of freedom is
*(n / 2)
E ( F n 1 ) 2
*[(n  1) / 2]
f
³
 
where the gamma function *(r ) y r 1e y dy . Then
0

2 *(n / 2)
E (S ) V
n  1 *[(n  1) / 2]
c4V
The constant c4 is given in Appendix table VI.
While S is a biased estimator of V , the bias gets small fairly quickly as the sample size n
increases. From Appendix table VI, note that c4 = 0.94 for a sample of n = 5, c4 = 0.9727
for a sample of n = 10, and c4 = 0.9896 or very nearly unity for a sample of n = 25.

13. The Sample Variance S2 is not Always an Unbiased Estimator of V2


An important property of the sample variance is that it is an unbiased estimator of the
population variance, as demonstrated in Section 11 of the Supplemental Text Material.

20
______________________________________________________________________________________

However, this unbiased property depends on the assumption that the sample data has
been drawn from a stable process; that is, a process that is in statistical control. In
statistical quality control work we sometimes make this assumption, but if it is incorrect,
it can have serious consequences on the estimates of the process parameters we obtain.
To illustrate, suppose that in the sequence of individual observations
x1 , x2 ,  , xt , xt 1 , , xm
the process is in-control with mean P0 and standard deviation V for the first t
observations, but between xt and xt+1 an assignable cause occurs that results in a sustained
shift in the process mean to a new level P P 0  GV and the mean remains at this new
level for the remaining sample observations xt 1 , , xm . Under these conditions,
Woodall and Montgomery (2000-01) show that

t (m  t )
E (S 2 ) V 2  (GV ) 2 . (13-1)
m(m  1)

In fact, this result holds for any case in which the mean of t of the observations is P0 and
the mean of the remaining observations is P 0  GV , since the order of the observations is
not relevant in computing S2. Note that S2 is biased upwards; that is, S2 tends to
overestimate V2. Furthermore, the extent of the bias in V 5 depends on the magnitude of
the shift in the mean (GV), the time period following which the shift occurs (t), and the
number of available observations (m). For example, if there are m = 25 observations and
the process mean shifts from P0 to P P 0  V (that is, G 1) between the 20th and the 21st
observation (t = 20), then S2 will overestimate V2 by 16.7% on average. If the shift in the
mean occurs earlier, say between the 10th and 11th observations, then S2 will overestimate
V2 by 25% on average.
The proof of Equation 13-1 is straightforward. Since we can write
1 § m 2 ·
S2 ¨ ¦
m 1 © i 1
xi  mx 2 ¸
¹
then
2 1 FG ¦ x
m
2 I
 mx J2 1 FG ¦ E ( x )  mE ( x )IJ
m
2 2
E (S ) E
m1 Hi 1
i
K m1 Hi 1
i
K
Now

21
______________________________________________________________________________________

1 FG ¦ E ( x )IJ
m
2 1 FG ¦ E ( x )  ¦ E ( x )IJ
t
2
m
2

m1 H i 1 K i
m1 H i 1
i
K i t 1
i

1
m1
ct ( P  V )  (m  t )( P  GV )  (m  t )V h
2
0
2
0
2 2

1
m1
ctP  (m  t )( P  GV )  mV h
2
0 0
2 2

and

1 m LMF P  FG m  t IJGV I 2
V2 OP
MNGH H m K JK
mE ( x 2 ) 
m1 m 1
0
m PQ
Therefore

LM F F FG IJ I
mt
2
V2 I OP
MNc h GH GH H K JK
1
E (S )2
tP 20  (m  t )( P 0  GV ) 2  mV 2  m P 0  GV  JK P
m 1 m m
Q
1 L F F m  t IJGV IJ OP 2

V 2
 M
m  1 MN
tP  (m  t )( P  GV )  mG P  G
2
0 0
H H m K K PQ
2
0

1 L (m  t ) O 2
V 2

m 1N M(m  t )(GV )  2

m
(GV ) P
Q
2

1 L F (m  t ) IJ OP
V2 M(m  t )(GV ) G 1 

H m KQ
2

m 1N
t (m  t )
V2  (GV ) 2
m(m  1)

14. The Mean Square Successive Difference as an Estimator of V2


An alternative to the moving range estimator of the process standard deviation V is the
mean square successive difference as an estimator of V 2 . The mean square successive
difference is defined as
n
1
MSSD ¦
2(n  1) i 1
( x1  xi 1 )2

22
______________________________________________________________________________________

It is easy to show that the MSSD is an unbiased estimator of V 2 . Let x1 , x2 ,..., xn be a


random sample of size n from a population with mean P and variance V 2 . Without any
loss of generality, we may take the mean to be zero. Then
ª 1 n
º
E (MSSD ) E« ¦ ( xi  xi 1 )2 »
¬ 2(n  1) i 2 ¼
1 ª n
º
E « ¦ ( xi2  xi21  2 xi xi 1 )»
2(n  1) ¬ i 2 ¼
1
[( n  1)V 2  (n  1)V 2 ]
2(n  1)
2(n  1) 2
V
2(n  1)
V2
Therefore, the mean square successive difference is an unbiased estimator of the
population variance.

15. More About Checking Assumptions in the t-Test


The two-sample t-test can be presented from the viewpoint of a simple linear regression
model. This is a very instructive way to think about the t-test, as it fits in nicely with the
general notion of a factorial experiment with factors at two levels. This type of
experiment is very important in process development and improvement, and is discussed
extensively in Chapter 12. This also leads to another way to check assumptions in the t-
test. This method is equivalent to the normal probability plotting of the original data
discussed in Chapter 3.
In the t-test scenario, we have a factor x with two levels, which we can arbitrarily call
“low” and “high”. We will use x = -1 to denote the low level of this factor (Formulation
1 of the gasoline) and x = +1 to denote the high level of this factor (Formulation 2 of the
gasoline). The figure below is a scatter plot (from Minitab) of the road octane number
data from Chapter 3.

23
______________________________________________________________________________________

92
Octane No.

91

90

89

-1 0 1
Formulation

We will a simple linear regression model to this data, say


yij E 0  E 1 xij  H ij
where E 0 and E 1 are the intercept and slope, respectively, of the regression line and the
regressor or predictor variable is x1 j 1 and x2 j 1 . The method of least squares can
be used to estimate the slope and intercept in this model. Assuming that we have equal
sample sizes n for each factor level the least squares normal equations are:
2 n
2nE 0 ¦¦ y
i 1 j 1
ij

n n
2nE 1 ¦ y2 j  ¦ y1 j
j 1 j 1

The solution to these equations is


E 0 y
1
E 1 ( y2  y1 )
2
Note that the least squares estimator of the intercept is the average of all the observations
from both samples, while the estimator of the slope is one-half of the difference between
the sample averages at the “high” and “low’ levels of the factor x. Below is the output
from the linear regression procedure in Minitab for the octane number data.

24
______________________________________________________________________________________

Regression Analysis: Octane No. versus Formulation

The regression equation is


Octane No. = 90.8 + 0.050 Formulation

Predictor Coef SE Coef T P


Constant 90.7500 0.2455 369.63 0.000
Formulat 0.0500 0.2455 0.20 0.841

S = 1.098 R-Sq = 0.2% R-Sq(adj) = 0.0%

Analysis of Variance

Source DF SS MS F P
Regression 1 0.050 0.050 0.04 0.841
Residual Error 18 21.700 1.206
Total 19 21.750

Notice that the estimate of the slope (given in the column labeled “Coef” and the row
1 1
labeled “Formulat” above) is 0.05 ( y2  y1 ) (90.8  90.7) and the estimate of the
2 2
1 1
intercept is 90.75 ( y2  y1 ) (90.7  90.8) . Furthermore, notice that the t-statistic
2 2
associated with the slope is equal to 0.20, exactly the same value (apart from sign,
because we subtracted the averages in the reverse order) we gave in the text. Now in
simple linear regression, the t-test on the slope is actually testing the hypotheses
H0 : E 1 0
H0 : E 1 z 0
and this is equivalent to testing H0 : P 1 P2 .
It is easy to show that the t-test statistic used for testing that the slope equals zero in
simple linear regression is identical to the usual two-sample t-test. Recall that to test the
above hypotheses in simple linear regression the t-statistic is
E 1
t0
V 2
S xx
2 n
where Sxx ¦ ¦ (x
i 1 j 1
ij  x ) 2 is the “corrected” sum of squares of the x’s. Now in our

specific problem, x 0, x1 j 1 and x2 j 1, so S xx 2n. Therefore, since we have


already observed that the estimate of V is just Sp,

25
______________________________________________________________________________________

1
E 1 ( y2  y1 ) y2  y1
t0 2
V 2 Sp
1
Sp
2
S xx 2n n

This is the usual two-sample t-test statistic for the case of equal sample sizes.
Most regression software packages will also compute a table or listing of the residuals
from the model. The residuals from the Minitab regression model fit obtained above are
as follows:
Obs Formulat Octane N Fit SE Fit Residual St Resid
1 -1.00 89.500 90.700 0.347 -1.200 -1.15
2 -1.00 90.000 90.700 0.347 -0.700 -0.67
3 -1.00 91.000 90.700 0.347 0.300 0.29
4 -1.00 91.500 90.700 0.347 0.800 0.77
5 -1.00 92.500 90.700 0.347 1.800 1.73
6 -1.00 91.000 90.700 0.347 0.300 0.29
7 -1.00 89.000 90.700 0.347 -1.700 -1.63
8 -1.00 89.500 90.700 0.347 -1.200 -1.15
9 -1.00 91.000 90.700 0.347 0.300 0.29
10 -1.00 92.000 90.700 0.347 1.300 1.25
11 1.00 89.500 90.800 0.347 -1.300 -1.25
12 1.00 91.500 90.800 0.347 0.700 0.67
13 1.00 91.000 90.800 0.347 0.200 0.19
14 1.00 89.000 90.800 0.347 -1.800 -1.73
15 1.00 91.500 90.800 0.347 0.700 0.67
16 1.00 92.000 90.800 0.347 1.200 1.15
17 1.00 92.000 90.800 0.347 1.200 1.15
18 1.00 90.500 90.800 0.347 -0.300 -0.29
19 1.00 90.000 90.800 0.347 -0.800 -0.77
20 1.00 91.000 90.800 0.347 0.200 0.19

The column labeled “Fit” contains the predicted values of octane number from the
regression model, which just turn out to be the averages of the two samples. The
residuals are in the sixth column of this table. They are just the differences between the
observed values of the octane number and the corresponding predicted values. A normal
probability plot of the residuals follows.

26
______________________________________________________________________________________

Normal Probability Plot for RESI1


ML Estimates

99 ML Estimates
Mean -0.0000000
95
StDev 1.04163
90

80
Goodness of Fit
70 AD* 0.942
Percent

60
50
40
30
20

10
5

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Data

Notice that the residuals plot approximately along a straight line, indicating that there is
no problem with the normality assumption in these data. This is equivalent to plotting the
original octane number data on separate probability plots as we did in Chapter 3.

16. Expected Mean Squares in the Single-Factor Analysis of Variance


In the book we give the expected values of the mean squares for treatments and error in
the single-factor analysis of variance (ANOVA). These quantities may be derived by
straightforward application of the expectation operator.
Consider first the mean square for treatments:
FG SS IJ
H a 1 K
Treatments
E ( MSTreatments ) E

Now for a balanced design (equal number of observations in each treatment)


1 a 2 1 2
SSTreatments ¦ yi .  an y..
ni1
and the single-factor ANOVA model is

yij P  W i  H ij
RSi 1,2,  , a
Tj 1,2, , n
In addition, we will find the following useful:
E (H ij ) E (H i . ) E (H .. ) 0, E (H 2ij ) V 2 , E (H i2. ) nV 2 , E (H ..2 ) anV 2

27
______________________________________________________________________________________

Now
1 a 2 1
E ( SSTreatments ) E( ¦
ni1
yi . )  E ( y..2 )
an
Consider the first term on the right hand side of the above expression:
1 a 2 1 a
E( ¦ yi . )
ni1
¦
ni1
E (nP  nW i  H i . ) 2

Squaring the expression in parentheses and taking expectation results in


1 a 2 1 a
E( ¦ yi . )
ni1 n
[a (nP ) 2  n 2 ¦ W i2  anV 2 ]
i 1
a
anP 2  n¦ W i2  aV 2
i 1

because the three cross-product terms are all zero. Now consider the second term on the
right hand side of E ( SSTreatments ) :

FG 1 y IJ 1 a
E (anP  n¦ W i  H .. ) 2
H an K
2
E ..
an i 1

1
E (anP  H .. ) 2
an
a
since ¦W
i 1
i 0. Upon squaring the term in parentheses and taking expectation, we obtain

FG 1 y IJ 1
[(anP ) 2  anV 2 ]
H an K
2
E ..
an
anP 2  V 2
since the expected value of the cross-product is zero. Therefore,
1 a 2 1
E ( SSTreatments ) E( ¦
ni1
yi . )  E ( y..2 )
an
a
anP 2  n ¦ W 2i  aV 2  (anP 2  V 2 )
i 1
a
V 2 (a  1)  n¦ W i2
i 1

Consequently the expected value of the mean square for treatments is

28
______________________________________________________________________________________

FG SS IJ
H a 1 K
Treatments
E ( MSTreatments ) E
a
V 2 (a  1)  n¦ W i2
i 1

a 1
a
n ¦ W i2
V2  i 1

a 1
This is the result given in the textbook.
For the error mean square, we obtain
§ SS E ·
E (MS E ) E¨ ¸
© N a¹
1 ª a n º
E « ¦¦ ( yij  yi . )2 »
N a ¬i 1 j 1 ¼
1 ª a n 1 a º
E « ¦¦ yij2  ¦ yi2. »
N a ¬i 1 j 1 ni1 ¼
Substituting the model into this last expression, we obtain
1 ª a n 1 a § n · º
2

E (MS E ) E « ¦¦ (P  W i  H ij )  ¦ ¨ ¦ (P  W i  H ij ) ¸ »
2

N a «i 1 j 1 n i 1© j 1 ¹ »¼
¬
After squaring and taking expectation, this last equation becomes
1 § a a
·
E (MS E )
N a©¨ N P 2
 n ¦
i 1
W 2
i  N V 2
N P 2
 n ¦
i 1
W i2  aV 2 ¸
¹
V2

17. Fixed Versus Random Factors in the Analysis of Variance


In chapter 3, we present the standard analysis of variance (ANOVA) for a single-factor
experiment, assuming that the factor is a fixed factor. By a fixed factor, we mean that all
levels of the factor of interest were studied in the experiment. Sometimes the levels of a
factor are selected at random from a large (theoretically infinite) population of factor
levels. This leads to a random effects ANOVA model.
In the single factor case, there are only modest differences between the fixed and random
models. The model for a random effects experiment is still written as
yij P  W i  H ij

but now the treatment effects W i are random variables, because the treatment levels
actually used in the experiment have been chosen at random. The population of

29
______________________________________________________________________________________

treatments is assumed to be normally and independently distributed with mean zero and
variance V W2 . Note that the variance of an observation is
V ( yij ) V ( P  W i  H ij )
V W2  V 2

We often call V W2 and V 2 variance components, and the random model is sometimes
called the components of variance model. All of the computations in the random model
are the same as in the fixed effects model, but since we are studying an entire population
of treatments, it doesn’t make much sense to formulate hypotheses about the individual
factor levels selected in the experiment. Instead, we test the following hypotheses about
the variance of the treatment effects:
H 0 : V W2 0
H1 : V W2 ! 0
The test statistic for these hypotheses is the usual F-ratio, F = MSTreatments/MSE. If the
null hypothesis is not rejected, there is no variability in the population of treatments,
while if the null hypothesis is rejected, there is significant variability among the
treatments in the entire population that was sampled. Notice that the conclusions of the
ANOVA extend to the entire population of treatments.
The expected mean squares in the random model are different from their fixed effects
model counterparts. It can be shown that
E (MSTreatments ) V 2  nV W2
E (MS E ) V 2
Frequently, the objective of an experiment involving random factors is to estimate the
variance components. A logical way to do this is to equate the expected values of the
mean squares to their observed values and solve the resulting equations. This leads to
MSTreatments  MS E
VˆW2
n
Vˆ 2 MS E
A typical application of experiments where some of the factors are random is in
measurement systems capability study, as discussed in Chapter 7. The model used there
is a factorial model, so the analysis and the expected mean squares are somewhat more
complicated than in the single factor model considered here.

18. Probability Limits on Control Charts


In Chapter 5 of the textbook, probability limits for variables control charts are discussed.
The usual three-sigma limits are almost always used with variables control charts,

30
______________________________________________________________________________________

although as we point out, there can be some occasional advantage to the use of
probability limits, such as on the range chart to obtain a non-zero lower control limit.
The standard applications of attributes control charts almost always use the three-sigma
limits as well, although their use is potentially somewhat more troublesome here. When
three-sigma limits are used on attributes charts, we are basically assuming that the normal
approximation to either the binomial or Poisson distribution is appropriate, at least to the
extent that the distribution of the attribute chart statistic is approximately symmetric, and
that the symmetric three-sigma control limits are satisfactory.
This will, of course, not always be the case. If the binomial probability p is small and the
sample size n is not large, or if the Poisson mean is small, then symmetric three-sigma
control limits on the p or c chart may not be appropriate, and probability limits may be
much better.
For example, consider a p chart with p = 0.07 and n = 100. The center line is at 0.07 and
the usual three-sigma control limits are UCL = -0.007 = 0 and UCL = 0.147. A short
table of cumulative binomial probabilities computed from Minitab follows.

Cumulative Distribution Function

Binomial with n = 100 and p = 0.0700000

x P( X <= x )
0.00 0.0007
1.00 0.0060
2.00 0.0258
3.00 0.0744
4.00 0.1632
5.00 0.2914
6.00 0.4443
7.00 0.5988
8.00 0.7340
9.00 0.8380
10.00 0.9092
11.00 0.9531
12.00 0.9776
13.00 0.9901
14.00 0.9959
15.00 0.9984
16.00 0.9994
17.00 0.9998
18.00 0.9999
19.00 1.0000
20.00 1.0000

If the lower control limit is at zero and the upper control limit is at 0.147, then any
sample with 15 or more defective items will plot beyond the upper control limit. The
above table shows that the probability of obtaining 15 or more defectives when the
process is in-control is 1 – 0.9959 = 0.0041. This is about 50% greater than the false
alarm rate on the normal-theory three-sigma limit control chart (0.0027). However, if we
were to set the lower control limit at 0.01 and the upper control limit at 0.15, and
conclude that the process is out-of-control only if a control limit is exceeded, than the
false alarm rate is 0.0007 + 0.0016 = 0.0023, which is very close to the advertised value

31
______________________________________________________________________________________

of 0.0027. Furthermore, there is a nonzero LCL, which can be very useful in practice.
Notice that the control limits are not symmetric around the center line. However, the
distribution of p̂ is not symmetric, so this should not be too surprising.
There are several other interesting approaches to setting probability-type limits on
attribute control charts. Refer to Ryan (2000), Acosta-Mejia (1999), Ryan and
Schwertman (1997), Schwertman and Ryan (1997), and Shore (2000).

19. Should We Use d 2 or d 2* in Estimating V via the Range Method?


In the textbook, we make use of the range method for estimation of the process standard
deviation, particularly in constructing variables control charts (for example, see the
x and R charts of Chapter 5). We use the estimator R / d 2 . Sometimes an alternative
estimator, R / d 2* , is encountered. In this section we discuss the nature and potential uses
of these two estimators. Much of this discussion is adapted from Woodall and
Montgomery (2000-01). The original work on using ranges to estimate the standard
deviation of a normal distribution is due to Tippett (1925). See also the paper by Duncan
(1955).
Suppose one has m independent samples, each of size n, from one or more populations
assumed to be normally distributed with standard deviation V. We denote the sample
ranges of the m samples or subgroups as R1 , R2 ,  , Rm . Note that this type of data arises
frequently in statistical process control applications and gauge repeatability and
reproducibility (R & R) studies (refer to Chapter 7). It is well-known that
E ( Ri ) d 2V and Var ( Ri ) d 32V 2 , for i 1,2, , m where d2 and d3 are constants that
depend on the sample size n. Values of these constants are tabled in virtually all
textbooks and training materials on statistical process control. See, for example
Appendix table VI for values of d2 and d3 for n = 2 to 25.
There are two estimators of the process standard deviation V based on the average sample
range,
m

¦R
i 1
i
R , (19-1)
m
that are commonly encountered in practice. The estimator

V 1 R / d2 19-2)

is widely used after the application of control charts to estimate process variability and to
assess process capability. In Chapter 3 we report the relative efficiency of the range
estimator given in Equation (19-2) to the sample standard deviation for various sample
sizes. For example, if n = 5, the relative efficiency of the range estimator compared to
the sample standard deviation is 0.955. Consequently, there is little practical difference
between the two estimators. Equation (19-2) is also frequently used to determine the

32
______________________________________________________________________________________

usual 3-sigma limits on the Shewhart x  chart in statistical process control. The
estimator

V 2 R / d 2* (19-3)

is more often used in gauge R & R studies and in variables acceptance sampling. Here
d 2* represents a constant whose value depends on both m and n. See Chrysler, Ford, GM
(1995), Military Standard 414 (1957), and Duncan (1986).
Patnaik (1950) showed that R / V is distributed approximately as a multiple of a F -
distribution. In particular, R / V is distributed approximately as d 2* F / Q , where Q
represents the fractional degrees of freedom for the F distribution. Patnaik (1950) used
the approximation
§ 1 1 5 ·
d 2* # d 2 ¨1    ¸. (19-4)
© 4Q 32Q 128Q 3 ¹
2

It has been pointed out by Duncan (1986), Wheeler (1995), and Luko (1996), among
others, that V 1 is an unbiased estimator of V and that V 22 is an unbiased estimator of V2.
For V 22 to be an unbiased estimator of V2, however, David (1951) showed that no
approximation for d 2* was required. He showed that
d 2* (d 22  Vn / m)1/ 2 , (19-5)
where Vn is the variance of the sample range with sample size n from a normal
population with unit variance. It is important to note that Vn d 32 , so Equation (19-5) can
be easily used to determine values of d 2* from the widely available tables of d2 and d3.
Thus, a table of d 2* values, such as the ones given by Duncan (1986), Wheeler (1995), and
many others, is not required so long as values of d2 and d3 are tabled, as they usually are
(once again, see Appendix Table VI). Also, use of the approximation
§ 1 ·
d 2 # d 2 ¨1  ¸
© 4Q ¹
given by Duncan (1986) and Wheeler (1995) becomes unnecessary.
The table of d 2* values given by Duncan (1986) is the most frequently recommended. If a
table is required, the ones by Nelson (1975) and Luko (1996) provide values of d 2* that
are slightly more accurate since their values are based on Equation (19-5).
It has been noted that as m increases, d 2* approaches d2. This has frequently been argued
using Equation (19-4) and noting that Q increases as m increases. The fact that
d 2* approaches d2 as m increases is more easily seen, however, from Equation (19-5) as
pointed out by Luko (1996).
Sometimes use of Equation (19-3) is recommended without any explanation. See, for
example, the AIAG measurement systems capability guidelines [Chrysler, Ford, and GM
33
______________________________________________________________________________________

(1995)]. The choice between V 1 and V 2 has often not been explained clearly in the
literature. It is frequently stated that the use of Equation (19-2) requires that R be
obtained from a fairly large number of individual ranges. See, for example, Bissell
(1994, p. 289). Grant and Leavenworth (1996, p. 128) state that “Strictly speaking, the
validity of the exact value of the d2 factor assumes that the ranges have been averaged for
a fair number of subgroups, say, 20 or more. When only a few subgroups are available, a
better estimate of V is obtained using a factor that writers on statistics have designated as
d 2* .” Nelson (1975) writes, “If fewer than a large number of subgroups are used,
Equation (19-2) gives an estimate of V which does not have the same expected value as
the standard deviation estimator.” In fact, Equation (19-2) produces an unbiased
estimator of V regardless of the number of samples m, whereas the pooled standard
deviation does not (refer to Section 12 of the Supplemental Text Material). The choice
between V 1 and V 2 depends upon whether one is interested in obtaining an unbiased
estimator of V or V2. As m increases, both estimators (19-2) and (19-3) become
equivalent since each is a consistent estimator of V.
It is interesting to note that among all estimators of the form cR (c ! 0), the one
minimizing the mean squared error in estimating V has

c d 2 / (d 2* ) 2 .

The derivation of this result is in the proofs at the end of this section. If we let

d2
V 3 R
(d 2* ) 2

then it is shown in the proofs below that

MSE (V 3 )
FG1  d IJ 2
2

H (d ) K * 2
2

Luko (1996) compared the mean squared error of V 2 in estimating V to that of V 1 and
recommended V 2 on the basis of uniformly lower MSE values. By definition, V 3 leads to
further reduction in MSE. It is shown in the proofs at the end of this section that the
percentage reduction in MSE using V 3 instead of V 2 is

F d  d IJ
50G
*
2 2

H d K *
2

34
______________________________________________________________________________________

Values of the percentage reduction are given in Table 19-1. Notice that when both the
number of subgroups and the subgroup size are small, a moderate reduction in mean
squared error can be obtained by using V 3 .

Table 19-1.
Percentage Reduction in Mean Squared Error from using
V 3 instead of V 2
Subgroup Number of Subgroups, m
Size, n 1 2 3 4 5 7 10 15 20
2 10.1191 5.9077 4.1769 3.2314 2.6352 1.9251 1.3711 0.9267 0.6998
3 5.7269 3.1238 2.1485 1.6374 1.3228 0.9556 0.6747 0.4528 0.3408
4 4.0231 2.1379 1.4560 1.1040 0.8890 0.6399 0.4505 0.3017 0.2268
5 3.1291 1.6403 1.1116 0.8407 0.6759 0.4856 0.3414 0.2284 0.1716
6 2.5846 1.3437 0.9079 0.6856 0.5507 0.3952 0.2776 0.1856 0.1394
7 2.2160 1.1457 0.7726 0.5828 0.4679 0.3355 0.2356 0.1574 0.1182
8 1.9532 1.0058 0.6773 0.5106 0.4097 0.2937 0.2061 0.1377 0.1034
9 1.7536 0.9003 0.6056 0.4563 0.3660 0.2623 0.1840 0.1229 0.0923
10 1.5963 0.8176 0.5495 0.4138 0.3319 0.2377 0.1668 0.1114 0.0836

Proofs

Result 1: Let V cR , then MSE (V ) V 2 [c 2 (d 2* ) 2  2cd 2  1]

Proof:

MSE (V ) E[(cR  V ) 2 ]


[c 2 R 2  2cVR  V 2 ]
c 2 E ( R 2 )  2cVE ( R )  V 2

Now E ( R 2 ) Var ( R )  [ E ( R )]2 d 32V 2 / m  (d 2V ) 2 . Thus

MSE (V ) c 2 d 32V 2 / m  c 2 d 22V 2  2cV (d 2V )  V 2


V 2 [c 2 (d 32 / m  d 22 )  2cd 2  1]
V 2 [c 2 (d 2* ) 2  2cd 2  1]

35
______________________________________________________________________________________

Result 2: The value of c that minimizes the mean squared error of estimators of the form

d2
cR in estimating V is .
(d 2* ) 2

Proof:
MSE (V ) V 2 [c 2 (d 2* ) 2  2cd 2  1]
dMSE (V )
V 2 [2c(d 2* ) 2  2d 2 ] 0
dc
d2
Ÿc .
(d 2* ) 2

Result 3: The mean square error of V 3


d2 d FG
R is V 2 1  *2 2 .
IJ
* 2
(d 2 ) H
(d 2 ) K
Proof:

MSE (V 3 )
LM d (d )  2 d
V2
2
2 * 2 2
OP
d 2  1 (from result 1)
N (d ) * 4
(d )
2
2 * 2
2 Q
=V M
L d  2 d  1OP
2 2
2 2
2

N (d ) (d ) Q* 2
2
* 2
2

F d IJ
V G1  2
2
2

H (d ) K * 2
2

Note that MSE (V 3 ) o 0 as n o f and MSE (V 3 ) o 0 as m o f.

Result 4: Let V 2
R
and V 3
d2
R . Then
MSE (V 2 )  MSE (V 3 ) LM
u 100 , the
OP
*
d2 * 2
(d 2 ) MSE (V 2 ) N Q
percent reduction in mean square error using the minimum mean square error estimator
R
instead of * [as recommended by Luko (1996)], is
d2
F d  d IJ
50G
*
2 2

H d K *
2

36
______________________________________________________________________________________

Proof:

2V 2 (d 2*  d 2 )
Luko (1996) shows that MSE (V 2 ) , therefore
d 2*

MSE (V 2 )  MSE (V 3 )


2V 2 (d 2*  d 2 )
 V 2
1 
d 22FG IJ
d 2* H
(d 2* ) 2 K
V2
LM 2(d  d )  (d )  d OP
*
2 2
* 2
2
2
2

N d *
2 (d ) Q 2
* 2

V 2 LM 2(d  d )  (d  d )(d  d ) OP
*
2 2
*
2 2 2
*
2

N d *
2 (d ) Q * 2
2

(d  d ) F
*
d d I *
V2 2
d*
2
G
H d JK
22 2
*
2
2

(d  d ) F d  d I
* *
V2 2
d*
2
GH d JK
2 2
*
2
2

(d 2*  d 2 ) 2
V2
(d 2* ) 2

Consequently
LM MSE (V )  MSE (V ) OP u 100
2 3 V 2 (d 2*  d 2 ) 2 / (d 2* ) 2
u 100 50
d 2*  d 2
.
FG IJ
N MSE (V ) Q 2 2V 2 (d 2*  d 2 ) / (d 2* ) d 2* H K

20. Control Charting Past Values Versus Future Observations (Phase 1 & Phase 2
of Control Chart Usage)
There are two distinct phases of control chart usage. In Phase 1 (some authors say Stage
1) we plot a group of points all at once in a retrospective analysis, constructing trial
control limits to determine if the process has been in control over the period of time
where the data was collected, and to see if reliable control limits can be established to
monitor future production. In Phase 2 we use the control limits to monitor the process
by comparing the sample statistic for each sample as it is drawn from the process to the
control limits.
Thus in Phase 1, we are comparing a collection of say m points to a set of control limits.
Suppose for the moment that the process is normally distributed with known mean and
variance, and that the usual three-sigma limits are used on the control chart for x , so that
the probability of a single point plotting outside the control limits when the process is in
control is 0.0027. The question we address is if the averages of a set of m samples or

37
______________________________________________________________________________________

subgroups from an in-control process is plotted on this chart, what is the probability that
at least one of the averages will plot outside the control limits?
This is just the probability of obtaining at least one success out of m Bernoulli trials with
constant probability of success p = 0.0027. A brief tabulation of the calculations is
shown below:
Number of 1 5 10 20 25 50
subgroups,
m
Probability 0.0027 0.0134 0.0267 0.0526 0.0656 0.1264
of at least
one point
beyond the
control
limits

Notice that as the number of subgroups increases, the probability of finding at least one
point outside the limits increases. Furthermore, for the typical number of samples or
subgroups often used to construct Shewhart control charts (20 or 25), we observe that the
chances of finding at least one sample out of control is about an order of magnitude larger
than the probability of finding a single point out of control when the points are plotted
one-at-a-time in monitoring future production (0.0027).
Two obvious questions are (1) what is going on here, and (2) is this anything to worry
about?
Question 1 has an obvious answer. These are simply binomial probability calculations
and they give you some insight about what is likely to happen when you first analyze a
collection of samples retrospectively to establish control limits. They also point out what
is likely to occur when a batch of new samples is employed to revise the control limits on
a chart. It could also happen in monitoring future production if several points are added
to the chart at once, because the chart is only updated once per shift or once per day.
The answer to question 2 is also simple. In effect, when establishing trial control limits
you should probably expect some points to fall outside of the limits. This may happen
because the process was not in control when the preliminary samples were taken, but
even if the process is stable, the chances of seeing at least one out-pf-control point is not
0.0027. It depends on how many preliminary samples are used in the calculations. It is
possible to adjust the control limits (make them wider) to compensate for this, but there
isn’t a practical necessity to do this in most cases. This is also one of the reason that when
we can’t find an assignable cause for all the out-of-control points in the preliminary data
when establishing trial control limits, we usually just delete the points without too much
fanfare unless there are a lot of these points (you might find it helpful to reread the
discussions on trial control limits in chapters 5 and 6).
These calculations have been performed assuming that the process parameters were
known, so that the limits are based on standard values. In practice, the preliminary

38
______________________________________________________________________________________

observations will be used to estimate parameters and control limits. Consequently, the
deviations of the group of points plotted on the chart from the limits exhibit correlation,
and we can’t even calculate the false alarm probabilities analytically. In Shewhart
control charting, we assume that enough preliminary data is used so that very reliable
(essentially known) values of the process parameters are being used in the control limit
calculations. That’s why we recommend that 20 – 25 subgroups be used when setting up
control charts. It would be possible to determine the probabilities (or control limits that
give specified probability of false alarms) by simulation. Refer to Sullivan and Woodall
(1996) for a nice discussion of this. We presented their simulation based control limits in
the multivariate control chart example for subgroups of size n = 1 in Chapter 10.
It is fairly typical in Phase I to assume that the process is initially out of control, so a
frequent objective of the analyst is to bring the process into a state of statistical control.
Sometimes this will require several cycles in which the control chart is employed,
assignable causes are detected and corrected, revised control limits are calculated, and the
out-of-control action plan is updated and expanded. Generally, Shewhart control charts
are very effective in Phase I because they are easy to construct, patterns on the control
charts are often easy to interpret and have direct physical meaning, and most often the
types of assignable causes that usually occur in Phase 1 result in fairly large process
shifts – exactly the scenario in which the Shewhart control chart is most effective.
In Phase 2 we usually assume that the process is reasonably stable. Often, the assignable
causes that occur in Phase 2 result in smaller process shifts, because (hopefully) most of
the really ugly sources of variability have been systematically removed during Phase 1.
Our emphasis is now on process monitoring, not on bringing an unruly process into
control. Shewhart control charts are much less likely to be effective in Phase 2 because
they are not very sensitive to small to moderate size process shifts. Attempts to solve this
problem by employing sensitizing rules such as those discussed in Chapter 4 are likely to
be unsatisfactory, because the use of these supplemental sensitizing rules increases the
false alarm rate of the Shewhart control chart. Consequently, Cusum and EWMA control
charts are much more likely to be effective in Phase 2.

21. A Simple Alternative to Runs Rules on the x Chart


It is well-known that while Shewhart control charts detect large shifts quickly, they are
relative insensitive to small or moderately-sized process shifts. Various sensitizing rules
(sometimes called runs rules) have been proposed to enhance the effectiveness of the
chart to detect small shifts. Of these rules, the Western Electric rules are among the most
popular. The western Electric rules are of the r out of m form; that is, if r out of the last
m consecutive points exceed some limit, an out of control signal is generated.
In a very fundamental paper, Champ and Woodall (1987) point out that the use of these
sensitizing rules does indeed increase chart sensitivity, but at the expense of (sometimes
greatly) increasing the rate of false alarms, hence decreasing the in-control ARL.
Generally, I do not think that the sensitizing rules should be used routinely on a control
chart, particularly once the process has been brought into a state of control. They do have
some application in the establishment of control limits (Phase 1 of control chart usage)

39
______________________________________________________________________________________

and in trying to bring an unruly process into control, but even then they need to be used
carefully to avoid false alarms.
Obviously, Cusum and EWMA control charts provide an effective alternative to
Shewhart control charts for the problem of small shifts. However, Klein (2000) has
proposed another solution. His solution is simple but elegant: use an r out of m
consecutive point rule, but apply the rule to a single control limit rather than to a set of
interior “warning” type limits. He analyzes the following two rules:
1. If two consecutive points exceed a control limit, the process is out of control. The
width of the control limits should be 1.78 V .
2. If two out of three consecutive points exceed a control limit, the process is out of
control. The width of the control limits should be 1.93 V .
These rules would be applied to one side of the chart at a time, just as we do with the
Western Electric rules.
Klein (2000) presents the ARL performance of these rules for the x chart, using actual
control limit widths of r1.7814V and r 1.9307V , as these choices make the in-control
ARL exactly equal to 370, the values associated with the usual three-sigma limits on the
Shewhart chart. The table shown below is adapted from his results. Notice that
Professor Klein’s procedure greatly improves the ability of the Shewhart x chart to detect
small shifts. The improvement is not as much as can be obtained with an EWMA or a
Cusum, but it is substantial, and considering the simplicity of Klein’s procedure, it should
be more widely used in practice.

Shift in process ARL for the ARL for the ARL for the
mean, in standard Shewhart x chart Shewhart x chart Shewhart x chart
deviation units with three-sigma with 1.7814V with 1.9307V
control limits control limits control limits
0 370 350 370
0.2 308 277 271
0.4 200 150 142
0.6 120 79 73
0.8 72 44 40
1 44 26 23
2 6.3 4.6 4.3
3 2 2.4 2.4

40
______________________________________________________________________________________

22. Determining When the Process has Shifted


Control charts monitor a process to determine whether an assignable cause has occurred.
Knowing when the assignable cause has occurred would be very helpful in it’s
identification and eventual removal. Unfortunately, the time of occurrence of the
assignable cause does not always coincide with the control chart signal. In fact, given
what is known about average run length performance of control charts, it is actually very
unlikely that the assignable cause occurs at the time of the signal. Therefore, when a
signal occurs, the control chart analyst should look earlier in the process history to
determine the assignable cause.
But where should we start? The Cusum control chart provides some guidance – simply
search backwards on the Cusum status chart to find the point in time where the Cusum
last crossed zero (refer to Chapter 8). However, the Shewhart x control chart provides no
such simple guidance.
Samuel, Pignatiello, and Calvin (1998) use some theoretical results by Hinkley (1970) on
change-point problems to suggest a procedure to determine the time of a shift in the
process mean following a signal on the Shewhart x control chart. They assume the
standard x control chart with in-control value of the process mean P 0 . Suppose that the
chart signal at subgroup average xT . Now the in-control subgroups are x1 , x2 ,..., xt , and
the out-of-control subgroups are xt 1 , xt  2 ,..., xT , where obviously t d T . Their procedure
consists of finding the value of t in the range 0 d t  T that maximizes
Ct (T  t )( xT ,t  P0 )2

1 T
where xT ,t ¦ xi is the reverse cumulative average; that is, the average of the T – t
T  t i t 1
most recent subgroup averages. The value of t that maximizes Ct is the estimator of the
last subgroup that was selected from the in-control process.

23. Difference Control Charts


The difference control chart is briefly mentioned in Chapter 9, and a reference is given to
a paper by Grubbs (1946). There are actually two types of difference control charts in the
literature. Grubbs compared samples from a current production process to a reference
sample. His application was in the context of testing ordinance. The plotted quantity
was the difference in the current sample average and the reference sample average. This
quantity would be plotted on a control chart with center line at zero and control limits at
r A2 R12  R22 , where R12 and R22 are the average ranges for the reference samples (1)
and the current production samples (2) used to establish the control limits.
The second type of difference control chart was suggested by Ott (1947), who considered
the situation where differences are observed between paired measurements within each
subgroup (much as in a paired t-test), and the average difference for each subgroup is
plotted on the chart. The center line for this chart is zero, and the control limits are at
r A2 R , where R is the average of the ranges of the differences. This chart would be

41
______________________________________________________________________________________

useful in instrument calibration, where one measurement on each unit is from a standard
instrument (say in a laboratory) and the other is from an instrument used in different
conditions (such as in production).

24. Control Charts for Contrasts


There are many manufacturing processes where process monitoring is important but
traditional statistical control charts cannot be effectively used because of rational
subgrouping considerations. Examples occur frequently in the chemical and processing
industries, stamping, casting and molding operations, and electronics and semiconductor
manufacturing.
As an illustration, consider a furnace used to create an oxide layer on silicon wafers. In
each run of the furnace a set of m wafers will be processed, and at the completion of the
run a single measurement of oxide thickness will be taken at each of n sites or locations
on each wafer. These mn thickness measurements will be evaluated to ensure the
stability of the process, check for the possible presence of assignable causes, and to
determine any necessary modifications to the furnace operating conditions (or the recipe)
before any subsequent runs are initiated. Figure 24-1, adapted from Runger and Fowler
(1999) and Czitrom and Reece (1997), shows a typical oxidation furnace with m = 4
wafers and n = 9 sites on each wafer. In Chapter 5 of the textbook, Example 5-11
illustrates an aerospace casting where vane height and inner diameter are the
characteristics of interest. Each casting has five vanes that are measured to monitor the
height characteristic and the diameter of a casting is measured at 24 locations using a
coordinate measuring machine.
In these applications it would be inappropriate to monitor the process with traditional
x and R charts. For example in the oxidation furnace, assuming a rational subgroup of
either n = 9 or n = 36 is not correct because all sites experience the processing activities
during each furnace run simultaneously. That is, there is much less variability between
the observations at the 9 sites than would be anticipated in observations collected from a
process where all measurements reflect the processing activity(s) each unit experiences
independently. What usually occurs when this misapplication of the standard charts is
implemented is that the control limits on the X chart will be too narrow. Then if the
process experiences moderate run-to-run variability, there will be many out-of-control
points on the X chart that engineers and process operating personnel cannot associate
with specific upsets or assignable causes.

42
______________________________________________________________________________________

Furnace 6
2
9

Location1 3 1 5
4
7 8

Furnace xxx
Location2

Furnace
Location3

Furnace
Location4

Figure 24-1. Diagram of a Furnace where four wafers are simultaneously processed and
nine quality measurements are performed on each wafer.

The most widely used approach to monitoring these processes is to first consider the
average of all mn observations from a run as a single observation and to use a Shewhart
control chart for individuals to monitor the overall process mean. The control limits for
this chart are usually found by applying a moving range to the sequence of averages.
Thus, the control limits for the individuals chart reflect run-to-run variability, not
variability within a run. The variability within a run is monitored by applying a control
chart for S (the standard deviation) or S 2 to all mn observations from each run. It is
interesting to note that this approach is so widely used that at least one popular statistical
software package (Minitab) includes it as a standard control charting option (called the
“between – within” procedure in Minitab). This procedure was illustrated in Example 5-
11.
Runger and Fowler (1999) show how the structure of the data obtained on these processes
can be represented by an analysis of variance model, and how control charts based on
contrasts can be designed to detect specific assignable causes of potential interest.
Below we briefly review their results and relate them to some other methods. Then we
analyze the average run performance of the contrast charts and show that the use of
specifically designed contrast charts can greatly enhance the ability of the monitoring
scheme to detect assignable causes. We confine our analysis to Shewhart charts, but
both Cusum and EWMA control charts would be very effective alternatives, because they

43
______________________________________________________________________________________

are more effective in detecting small process shifts, which are likely to be of interest in
many of these applications
Contrast Control Charts
We consider the oxidation process in Figure 24-1, but allow m wafers in each run with n
measurements or sites per wafer. The appropriate model for oxide thickness is
yij ri  s j  H ij (24-1)

where yij is the oxide thickness measurement from run i and site j, ri is the run effect, sj is
the site effect, and H ij is a random error component. We assume that the site effects are
fixed effects, since the measurements are generally taken at the same locations on all
wafers. The run effect is a random factor and we assume it is distributed as NID (0, V r2 ) .
We assume that the error term is distributed as NID (0, V H2 ) . Notice that equation (24-1)
is essentially an analysis of variance model.
Let yt be a vector of all measurements from the process at the end of run t. It is
customary in most applications to update the control charts at the completion of every
run. A contrast is a linear combination of the elements of the observation vector yt , say

c ccy t

where the elements of the vector c sum to zero and, for convenience, we assume that the
contrast vector has unit length. That is,

c c1 0 and c cc 1

Any contrast vector is orthogonal to the vector that generates the mean, since the mean
can be written as

yt 1c y t / mn
Thus, a contrast generates information that is different from the information produced by
the overall mean from the current run. Based on the particular problem, the control chart
analyst can choose the elements of the contrast vector c to provide information of interest
to that specific process.
For example, suppose that we were interested in detecting process shifts that could cause
a difference in mean thickness between the top and bottom of the furnace. The
engineering cause of such a difference could be a temperature gradient along the furnace
from top to bottom. To detect this disturbance, we would want the contrast to compare
the average oxide thickness of the top wafer in the furnace to the average thickness of the
bottom wafer. Thus, if m = 4, the vector c has mn = 36 components, the first 9 of which

44
______________________________________________________________________________________

are +1, the last 9 of which are –1, and the middle 18 elements are zero. To normalize the
contrast to unit length we would actually use

c c [1,1,...,1,0,0,...,0,1,1,...,1] / 18

One could also divide the elements of c by nine to compute the averages of the top and
bottom wafers, but this is not really necessary.
In practice, a set of k contrasts, say

c1 , c2 ,..., ck

can be used to define control charts to monitor a process to detect k assignable causes of
interest. These simultaneous control charts have overall false alarm rate D, where
k
D 1  – (1  D i ) (24-2)
i 1

and Di is the false alarm rate for the ith contrast. If the contrasts are orthogonal, then
Equation (24-2) holds exactly, while if the contrasts are not orthogonal then the
Bonferroni inequality applies and the D in Equation (2) is a lower bound on the false
alarm rate.

Related Procedures
Several authors have suggested related approaches for process monitoring when non-
standard conditionss relative to rational subgrouping apply. Yashchin (1994), Czitrom
and Reese (1997), and Hurwicz and Spagon (1997) all present control charts or other
similar techniques based on variance components. The major difference in this approach
in comparison to these authors is the use of an analysis-of-variance type partitioning
based on contrasts instead of variance components as the basis of the monitoring scheme.
Roes and Does (1995) do discuss the use of contrasts, and Hurwicz and Spagon discuss
contrasts to estimate the variance contributed by sites within a wafer. However, the
Runger and Fowler model is the most widely applicable of all the techniques we have
encountered.
Even though the methodology used to monitor specific differences in processing
conditions has been studied by all these authors, the statistical performance of these
charts has not been demonstrated. We now present some performance results for
Shewhart control charts.

45
______________________________________________________________________________________

Average Run Length Performance of Shewhart Charts


In this section we assume that the process shown in Figure 24-1 is of interest. The
following scenarios are considered:
x A change in the mean of the top versus the bottom wafer.
x Changes on the left versus the right side of all wafers.
x Significant changes between the outside and the inside of each wafer.
x Four wafers are selected from the tube.
The contrasts for these charts are:

c1 1
9
[1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1, 1, 1]

c2 1
16
[0,0,0,0,0,1,1,1,1,0,0,0,0,0,1,1, 1,1,0,0,0,0,0,1,1,1,10,0,0,0,0,1,1,1,1]

c3 1
32
[0,1,1,1,1,1, 1, 1, 1,0,1,1,1,1, 1, 1, 1,1,0,1,1,1,1, 1, 1,1,1,0,1,1,1,1,1,1,1,1]

A comparison of the ARL values obtained using these contrasts and the traditional
approach – an individuals control chart for the mean of all 36 observations- is presented
in Tables 24-1, 24-2, and 24-3. From inspection of these tables, we see that the charts for
the orthogonal contrasts, originally with the same in-control ARL as the traditional chart,
are more sensitive to changes at specific locations, thus improving the chances of early
detection of an assignable cause. Notice that the improvement is dramatic for small shifts,
say on the order of 1.5 standard deviations or less.

A similar analysis was performed for a modified version of the process shown in figure
24-1. In this example, there are seven measurements per wafer for a total of 28
measurements in a run. There are still three measurements at the center of the wafer, but
now there are only measurements at the perimeter; one in each “corner”. The same types
of contrasts used in the previous example (top versus bottom, left versus right and edge
versus center) were analyzed and the ARL results are presented in Tables 24-4, 24-6, and
24-6.

46
______________________________________________________________________________________

Table 24-1. Average Run Length Performance of Traditional and Orthogonal


Contrast Charts for a shift in the Edges of all Wafers. In this chart m = 4 and n = 9.
Size of Shift Edge versus Center Traditional Chart
In Multiples of V Contrast

0.5 11.7 13.6

1 1.9 2.2

1.5 1.1 1.1

2 1 1

2.5 1 1

3 1 1

Table 24-2. Average Run Length Performance of Traditional and Orthogonal


Contrast Charts for a shift in the Top Wafer. In this chart m = 4 and n = 9.

Size of Shift Bottom Top versus Traditional Chart


In Multiples of V Contrast

0.5 23.4 47

1 3.9 10

1.5 1.5 3.4

2 1.1 1.7

2.5 1 1.2

3 1 1

47
______________________________________________________________________________________

Table 24-3. Average Run Length Performance of Traditional and Orthogonal


Contrast Charts for a shift in the Left side of all Wafers. In this chart m = 4 and n =
9.
Size of Shift Left versus Right Traditional Chart
In Multiples of V Contrast

0.5 26.7 57.2

1 4.6 13.6

1.5 1.7 4.6

2 1.1 2.2

2.5 1 1.4

3 1 1.1

Table 24- 4. Average Run Length Comparison between Traditional and Orthogonal
Contrast Charts for a shift in the Edge of all Wafers. In this chart m = 4 and n = 7.

Size of Shift Edge versus Center Traditional Chart


In Multiples of V Contrast

0.5 26.7 46.4

1 4.6 9.8

1.5 1.7 3.3

2 1.1 1.7

2.5 1 1.2

3 1 1

48
______________________________________________________________________________________

Table 24-5. Average Run Length Comparison between Traditional and Orthogonal
Contrast Charts for a change in the Top Wafer. In this chart m = 4 and n = 7.
Size of Shift Top versus Bottom Traditional Chart
In Multiples of V Contrast

0.5 30.8 57.9

1 5.5 13.8

1.5 2 4.7

2 1.2 2.2

2.5 1 1.4

3 1 1.1

Table 24-6. Average Run Length Performance of Traditional and Orthogonal Contrast
Charts for a shift in the left side of all Wafers. In this chart m = 4 and n = 7.

Size of Shift Left versus Right Contrast Traditional Chart


In Multiples of V

0.5 26.7 46.4

1 4.6 9.8

1.5 1.7 3.3

2 1.1 1.7

2.5 1 1.2

3 1 1

Decreasing the number of measurements per wafer has increased the relative importance
of the changes in the mean of a subset of the observations and the traditional control
charts signal the shift faster than in the previous example. Still, note that the control
charts based on orthogonal contrasts represent a considerable improvement over the
traditional approach.

49
______________________________________________________________________________________

25. More About Monitoring Variability with Individual Observations


As noted in the textbook, when one is monitoring a process using individual (as opposed
to subgrouped) measurements, a moving range control chart does not provide much
useful additional information about shifts in process variance beyond that which is
provided by the individuals control chart (or a Cusum or EWMA of the individual
observations). Sullivan and Woodall (1996) describe a change-point procedure that is
much more effective that the individuals (or Cusum/EWMA) and moving range chart.
Assume that the process is normally distributed. Then divide the n observations into two
partitions of n1 and n2 observations, with n1 = 2, 3, … , n – 2 observations in the first
partition and n – n1 in the second. The log-likelihood of each partition is computed, using
the maximum likelihood estimators for P and V 2 in each partition. The two log-
likelihood functions are then added. Call the sum of the two log-likelihood functions La.
Let L0 denote the log-likelihood computed without any partitions. Then find the
maximum value of the likelihood ratio r = –2(La – L0). The value of n1 at which this
maximum value occurs is the change point; that is, it is the estimate of the point in time at
which a change in either the process mean or the process variance (or both) has occurred.
Sullivan and Woodall show how to obtain a control chart for the likelihood ratio r. The
control limits must be obtained either by simulation or by approximation. When the
control chart signals, the quantity r can be decomposed into two components; one that is
zero if the means in each partition are equal, and another that is zero if the variances in
each partition are equal. The relative size of these two components suggests whether it is
the process mean or the process variance that has shifted.

26. Non-Manufacturing Applications of Control Charts


The textbook gives some account of using control charts and other statistical methods for
improving non-manufacturing processes. Some other useful references are Cinimera and
Lease (1992), Chamberlin, Lane, and Kennedy (1993), Finison, Spencer, and Finison
(1993), and Rodriguez (1996) (health care); Charnes and Gitlow (1995) (law
enforcement); Levey and Jennings (1992), and Howarth (1995) (analytical chemical and
clinical laboratories); Herath, Park, and Prueitt (1994) (project management); and the
Gardiner and Montgomery (1987) reference in the text (software development). The
author has control charted his golf scores, hoping to find the assignable cause.

27. Analysis of Variance Methods for Measurement Systems Capability Studies


In Chapter 7 an analysis of variance model approach to measurement systems studies is
presented as an alternative to the “classical” gage R & R study based on a tabular
approach. The tabular approach is a relatively simple method, but it is not the most
general or efficient approach to conducting gage studies. Gage and measurement systems
studies are designed experiments, and often we find that the gage study must be
conducted using an experimental design that does not nicely fit into the tabular analysis
scheme. For example, suppose that the operators used with each instrument (or gage) are

50
______________________________________________________________________________________

different because the instruments are in different physical locations. Then operators are
nested within instruments, and the experiment has been conducted as a nested design.
As another example, suppose that the operators are not selected at random, because the
specific operators used in the study are the only ones that actually perform the
measurements. This is a mixed model experiment, and the random effects approach that
the tabular method is based on is inappropriate. The random effects model analysis of
variance approach in the text is also inappropriate for this situation. Dolezal, Burdick,
and Birch (1998) and Montgomery (1997) discuss the mixed model analysis of variance
for gage R & R studies.
The tabular approach does not lend itself to constructing confidence intervals on the
variance components or functions of the variance components of interest. For that reason
we do not recommend the tabular approach for general use. There are three general
approaches to constructing these confidence intervals: (1) the Satterthwaite method, (2)
the maximum likelihood large-sample method, and (3) the modified large sample
method. Montgomery (1997) gives an overview of these different methods. Of the three
approaches, there is good evidence that the modified large sample approach is the nest in
the sense that it produces confidence intervals that are closest to the stated level of
confidence.
Hamada and Weerahandi (2000) show how generalized inference can be applied to the
problem of determining confidence intervals in measurement systems capability studies.
The technique is somewhat more involved that the three methods referenced above.
Either numerical integration or simulation must be used to find the desired confidence
intervals.
While the tabular method should be abandoned, the control charting aspect of
measurement systems capability studies should be used more consistently. All too often
a measurement study is conducted and analyzed via some computer program without
adequate graphical analysis of the data. Furthermore, some of the advice in various
quality standards and reference sources regarding these studies is just not very good and
can produce results of questionable validity.

28. The Brook and Evans Markov Chain Approach to Finding the Average Run
Length of the Cusum and EWMA Control Charts

When the observations drawn from the process are independent, average run lengths or
ARLs are easy to determine for Shewhart control charts because the points plotted on the
chart are independent. The distribution of run length is geometric, so the ARL of the
chart is just the mean of the geometric distribution, or 1/p, where p is the probability that
a single point plots outside the control limits.
The sequence of plotted points on Cusum and EWMA charts is not independent, so
another approach must be used to find the ARLs. The Markov chain approach developed
by Brook and Evans (1972) is very widely used. We give a brief discussion of this
procedure for a one-sided Cusum.

51
______________________________________________________________________________________

The Cusum control chart statistic C  (or C  ) form a Markov process with a continuous
state space. By discretizing the continuous random variable C  (or C  ) with a finite set
of values, approximate ARLs can be obtained from Markov chain theory. For the upper
one-sided Cusum with upper decision interval H, the intervals are defined as follows:
(f, w / 2],[ w / 2,3w / 2], ,[(k  1/ 2) w, (k  1/ 2) w], ,[(m  3/ 2) w, H ],[ H , f) where m
+ 1 is the number of states and w = 2H/(2m- 1). The elements of the transition
probability matrix of the Markov chain P [ pij ] are
w/ 2
pi 0 ³ f
f ( x  iw  k )dx, i 0,1,..., m  1
( j  i / 2) w ­ i 0,1,..., m  1
pij ³( j  i / 2) w
f ( x  iw  k )dx ®
¯ j 1, 2,..., m  1
f
pim ³ H
f ( x  iw  k )dx, i 0,1,..., m  1
pmj 0, j 0,1,..., m  1
pmm 1
The absorbing state is m and f denotes the probability density function of the variable that
is being monitored with the Cusum.
From the theory of Markov chains, the expected first passage times from state i to the
absorbing state are
m 1
P i 1  ¦ pij P j , i 0,1,..., m  1
j 0

Thus, P i is the ARL given that the process started in state i. Let Q be the matrix of
transition probabilities obtained by deleting the last row and column of P. Then the
vector of ARLs P is found by computing

P I  Q  1
where 1 is an mu 1 vector of 1s and I is the m u m identity matrix.
When the process is out of control, this procedure gives a vector of initial-state (or zero-
state) ARLs. That is, the process shifts out of control at the initial start-up of the control
chart. It is also possible to calculate steady-state ARLs that describe performance
assuming that the process shifts out of control after the control chart has been operating
for a long period of time. There is typically very little difference between initial-state and
steady-state ARLs.
Let P(n, i ) be the probability that run length takes on the value n given that the chart
started in state i. Collect these quantities into a vector say
pcn [ P (n,0), P(n,1),..., P(n, m  1)]
for n = 1,2, …. These probabilities can be calculated by solving the following equations:

52
______________________________________________________________________________________

p1 ( I  Q) 1 1
pn Qp n 1 , n 2,3,...
This technique can be used to calculate the probability distribution of the run length,
given the control chart started in state i. Some authors believe that the distribution of run
length or its percentiles is more useful that the ARL, since the distribution of run length is
usually highly skewed and so the ARL may not be a ‘typical” value in any sense.

29. Integral Equations Versus Markov Chains for Finding the ARL
Two methods are used to find the ARL distribution of control charts, the Markov chain
method and an approach that uses integral equations. The Markov chain method is
described in Section 28 of the Supplemental Text Material. This section gives an
overview of the integral equation approach for the Cusum control chart. Some of the
notation defined in Section 28 will be used here.
Let P(n, u ) and R (u ) be the probability that the run length takes on the value n and the
ARL for the Cusum when the procedure begins with initial value u. For the one-sided
upper Cusum
H
P(1, u ) 1  ³ f ( x  u  k )dx
f

w/ 2 m 1 ( j 1/ 2) w
1 ³ f ( x  u  k )dx  ¦ ³ f ( x  u  k )dx
f ( j 1/ 2) w
j 1

and
0 H
P(n, u ) P(n  1, 0)³ f ( x  u  k )dx  ³ P(n  1, y ) f ( x  u  k )dx
f 0
0 w/ 2
P(n  1, 0) ³ f ( x  u  k )dx  ³ P(n  1, y ) f ( x  u  k )dx
f 0
m 1 ( j 1/ 2) w
 ¦³ P(n  1, y ) f ( x  u  k )dx
( j 1/ 2) w
j 1
0 w/ 2
P(n  1, 0) ³ f ( x  u  k )dx  P(n  1, H 0 ) ³ f ( x  u  k )dx
f 0
m 1 ( j 1/ 2) w
 ¦ P(n  1, H j ) ³ f ( x  u  k )dx
( j 1/ 2) w
j 1

for n = 1,2,… and for H 0  (f, w / 2) and H j  [( j  1/ 2)w, ( j  1/ 2)w), j 1, 2,..., m  1. If


w is small, then H j is the midpoint jw of the jth interval for j = 1,2,…, m – 1, and
considering only the values of P (n, u ) for which u = iw results in
m 1
P(1, iw) 1  ¦ pij
j 0
m 1
P(n, iw) ¦ P(n  1, iw) p , n
j 0
ij 2,3,...

53
______________________________________________________________________________________

But these last equations are just the equations used for calculating the probabilities of
first-passage times in a Markov chain. Therefore, the solution to the integral equation
approach involves solving equations identical to those used in the Markov chain
procedure.
Champ and Rigdon (1991) give an excellent discussion of the Markov chain and integral
equation techniques for finding ARLs for both the Cusum and the EWMA control charts.
They observe that the Markov chain approach involves obtaining an exact solution to an
approximate formulation of the ARL problem, while the integral equation approach
involves finding an approximate solution to the exact formulation of the ARL problem.
They point out that more accurate solutions can likely be found via the integral equation
approach. However, there are problems for which only the Markov chain method will
work, such as the case of a drifting mean.

54

You might also like