Process Capability
Process Capability
In this chapter we give the mathematical background of process capability analysis, in particular
the capability indices Cp and Cpk and related topics like tolerance intervals and density estimation.
For a detailed mathematical account of capability indices, we refer to [13].
If the process is not centred, then the expected proportion of non-conforming items will be higher
than the value of Cp seems to indicate. Therefore the following index has been introduced for
1 In modern business a customer is any person that receives produced items. Hence, this may be another
1.1
1.2 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
non-centred processes:
µ ¶
U SL − µ µ − LSL
Cpk = min , .
3σ 3σ
1
Using the identity min(a, b) = 2 (|a + b| − |a − b|), we obtain the following representations:
min (U SL − µ, µ − LSL)
Cpk =
3σ
d − |µ − 12 (LSL + U SL)|
=
3σ
µ ¶
|µ − 12 (LSL + U SL)|
= 1− Cp . (1.2)
d
We immediately read off from (1.2) that Cp ≥ Cpk . Moreover, since Cp = d/(3σ), we also have
that Cp = Cpk if and only the process is centred. The notation
|µ − 12 (LSL + U SL)|
k=
d
is often used. It is also possible to define Cpk in terms of a target value T instead of the process
mean µ.
The expected proportion non-conforming items for a non-centred process with normal distri-
bution
³ can´be defined
³ in terms
´ of Cp and Cpk as follows (cf. (1.1). The expected proportion equals
Φ LSL−µ
σ + 1 − Φ U SL−µ
σ . Now assume that 12 (U SL + LSL) ≤ µ ≤ U SL. Then Cpk = U SL−µ
3σ
and
LSL − µ (U SL − µ) − (U SL − LSL)
= = Cpk − 2Cp ≤ −Cpk ,
3σ 3σ
because Cp ≥ Cpk . Hence, the expected proportion non-conforming items can be expressed as
The rationale behind this definition is as follows. Estimation of θ is obtained by maximizing the
likelihood function L(θ; x1 , . . . , xn ) as function of θ for fixed x1 , . . . , xn , while estimation of τ (θ)
is obtained by maximizing the induced likelihood function L(τ (θ); x1 , . . . , xn ) as function of τ (θ)
for fixed x1 , . . . , xn .
The following theorem describes a useful invariance property of Maximum Likelihood estima-
tors. Note that the theorem does not require any assumption on the function τ .
Proof: Define τ −1 (ξ) := {θ ∈ Θ | τ (θ) = ξ} for any ξ ∈ Θ. Obviously, θ ∈ τ −1 (τ (θ)) for all θ ∈ Θ.
Hence, we have for any ξ ∈ Θ that
Mτ (ξ; x1 , . . . , xn ) = sup L(θ; x1 , . . . , xn )
θ∈τ −1 (ξ)
≤ sup L(θ; x1 , . . . , xn )
θ∈Θ
b x1 , . . . , xn )
= L(Θ;
= sup L(θ; x1 , . . . , xn )
θ∈τ −1 (τ (Θ)
b )
³ ´
b ; x1 , . . . , xn ).
= Mτ (τ Θ
b maximizes the induced likelihood function, as required. Inspection of the proof reveals
Thus τ (Θ)
b b is the unique MLE of τ ((θ1 , . . . , θk )) . ¤
that if Θ is the unique MLE of (θ1 , . . . , θk ), then τ (Θ)
We now give some examples that illustrate how to use this invariance property in order to obtain
an MLE of a function of a parameter.
Examples 1.2.3 Let X, X1 , X2 , . . . , Xn be independent random variables, each distributed ac-
cording to the normal distribution with parameters µ and σ 2 . Let Z be a standard normal random
variable with distribution function Φ. Recall that the ML estimators for µ and σ 2 are µ
b = X and
c2 = 1 Pn (Xi − X)2 , respectively.
σ n i=1
b) Suppose we want to estimate 1/σ, when µ is unknown. Theorem 1.2.2 with Θ = (0, ∞) and
à n
!−1/2
√ 1 X 2
τ (x) = 1/ x yields that the MLE of σ equals (Xi − X) . The MLE’s for Cp
n i=1
U SL−LSL min(U SL−X,X−LSL)
and Cpk easily follow from the MLE of 1/σ and are given by 6σ̂ and 3σ̂ .
c) Let p be an arbitrary number between 0 and 1 and assume that both µ and σ 2 are unknown.
Suppose that we want to estimate the p-th quantile of X, that is we want to estimate the
unique number xp such that P (X ≤ xp ) = p. Since
µ ¶ µ ¶
xp − µ xp − µ
p = P (X ≤ xp ) = P Z ≤ =Φ ,
σ σ
it follows that xp = µ + zp σ, where zp := Φ−1 (p). Thus Theorem 1.2.2 with Θ = R × (0, ∞)
√
and τ (x, y) = x + zp y yields that the MLE of xp equals X + zp σb, where σ
b is as in a).
d) Let a < b be arbitrary real numbers and assume that both µ and σ 2 are unknown. Suppose
we want to estimate P (a < X < b) = F (b) − F (a). Since
µ ¶ µ ¶ µ ¶
a−µ b−µ b−µ a−µ
P (a < X < b) = P <Z< =Φ −Φ ,
σ σ σ σ
µ ¶ µ ¶
b−x a−x
Theorem 1.2.2 with Θ = R × (0, ∞) and τ (x, y) = Φ √ −Φ √ yields that the
y y
MLE for P (a < X < b) equals
µ ¶ µ ¶
b−X a−X
Φ −Φ ,
σ
b σ
b
where σ
b is as in a).
1.4 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
Since (X, S) are joint complete sufficiently statistics, the Rao-Blackwell theorem in combination
with the Lehmann-Scheffé theorem yields that E(Y | X, S) is an UMVU-estimator of the propor-
tion non-conforming items. For various explicit formulas of this quantity, we refer to [6, 7, 23].
bp = U SL − LSL
C
6S
and µ ¶
bpk = min U SL − X USL − X
C , ,
3S 3S
where X denotes the sample mean and S denotes the sample standard deviation.
Confidence intervals and hypothesis tests for Cp easily follow from the identity
à ! µ ¶
bp
C n−1
2
P > c = P χn−1 < .
Cp c2
where Tν (λ) denotes a random variable distributed according to the noncentral t-distribution with
ν degrees of freedom and noncentrality parameter λ.
Proof: Recall that n σ c2 /σ 2 follows a χ2 -distribution with n degrees of freedom. Combining this
with the definition of the noncentral t-distribution (see Definition 1.3.1), we obtain
µ √ √ ¶
X −µ n n (t − µ)
P (X + zp σ b ≤ t) = P √ + zp σ
b≤
σ/ n σ σ
µ √ √ ¶
n (µ − t) n
= P Z+ ≤ −zp σ
b
σ σ
µ µ√ ¶ ¶
n (µ − t) √
= P Tn ≤ −zp n .
σ
¤
With this theorem, we can compute the asymptotic distribution of many estimators, in particular
those discussed above. Among other things, this is useful for constructing confidence intervals
(especially when the finite sample distribution is intractable).
We start with the asymptotic distribution of the sample variance. We first recall a multivariate
version of the Central Limit Theorem.
1.6 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
Proof: Because the variance does not depend on the mean, we assume without loss of generality
that µ = 0. Since we have finite fourth moments, we infer from the multivariate Central Limit
Theorem ( = Theorem 1.4.2) that
"µ Pn ¶ µ ¶#
1
√ n i=1 Xi 0 d
n 1
Pn 2
− 2
−→ N (0, Σ),
n i=1 Xi σ
The last statement follows from the fact that for zero-mean normal distributions µ4 = 3 σ 4 . ¤
Theorem 1.4.4 Let X, X1 , X2 , . . . be independent normal random variables with mean µ and
variance σ 2 . If µ and σ 2 are unknown, then the following asymptotic result holds for the MLE
cp = X + zp σ
X b of xp :
√ ³ ´
d ¡ ¡ ¢¢
n X cp − (µ + zp σ) −→ N 0, σ 2 1 + 21 zp2 .
√ ¡ ¢ d
Proof: The Central Limit Theorem yields that n X − µ −→ N (0, σ 2 ). Combining Theo-
√ d
rems 1.4.1 and 1.4.3, we have that n (bσ − σ) −→ N (0, 21 σ 2 ). Now recall that since we have a
normal sample, X and σ b are independent. Hence, it follows from Slutsky’s Lemma that
√ ¡ ¢ d
b − µ − zp σ −→ N (0, σ 2 ) ∗ N (0, 21 zp2 σ 2 ).
n X + zp σ
Theorem 1.4.5 Let X, X1 , X2 , . . . be independent normal random variables with mean µ and
variance σ 2 and let ϕ be the standard normal density. If µ and σ 2 are unknown, then the following
asymptotic result holds for the MLE of P (a < X < b) = P ((a, b)):
µ µ ¶ µ ¶ ¶
√ b−X a−X d ¡ ¡ ¢¢
n Φ −Φ − P (a < X < b) −→ N (0, σ 2 c21 + 4 c1 c2 µ + 2 c22 (2µ2 + σ 2 ) ,
σ
b σ
b
where
µ ¶ µ ¶
b − µ b µ − (µ2 + σ 2 ) a − µ a µ − (µ2 + σ 2 )
c1 = ϕ −ϕ
σ 2σ 3 σ 2σ 3
µ ¶ µ ¶
b−µ µ−b a−µ µ−a
c2 = ϕ 3
−ϕ .
σ 2σ σ 2σ 3
Proof: We infer from the multivariate Central Limit Theorem ( = Theorem 1.4.2) that
·µ Pn ¶ µ ¶¸
√ 1/n i=1 Xi µ d
n Pn 2
− 2 2
−→ N (0, Σ) ,
1/n i=1 Xi µ + σ
d
where Σ is the covariance matrix of X and X 2 . Since E Z 4 = 3 and X = µ + σ Z where Z is a
standard normal random variable, we have
E X3 = µ3 + 3 µ σ 2
Var X 2 = µ4 + 6 µ2 σ 2 + 3 σ 4 − (µ2 + σ 2 )2 = 2 σ 2 (2 µ2 + σ 2 ).
Hence, µ ¶
σ2 2 µ σ2
Σ= .
2 µ σ2 2 σ 2 (2 µ2 + σ 2 )
Now we wish to apply Theorem 1.4.1 with
à ! à !
b−x a−x
g(x, y) = Φ p −Φ p .
y − x2 y − x2
where ϕ(x) is the standard normal density. Evaluating at x = µ and y = µ2 + σ 2 , we see that this
reduces to µ ¶ µ ¶
b − µ b µ − (µ2 + σ 2 ) a − µ a µ − (µ2 + σ 2 )
ϕ − ϕ
σ 2σ 3 σ 2σ 3
µ ¶ µ ¶ .
b−µ µ−b a−µ µ−a
ϕ 3
−ϕ 3
σ 2σ σ 2σ
Putting everything together yields the result. ¤
1.8 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
Many practical situations require knowledge about the location of the complete distribution. E.g.,
one would like to construct intervals that cover a certain percentage of a distribution. Such
intervals are known as tolerance intervals. Although they are of great practical importance, this
topic is ignored in many text books. Many practical applications (and theory) can be found in [1].
The monograph [9], the review paper [16] and the bibliographies [10, 11] are also excellent sources
of information on this topic.
In this section we will give an introduction to tolerance intervals based on the normal distribution.
It is also possible to construct intervals for other distributions (see e.g., [1]).
The random variable P (T (X1 , . . . , Xn )) is called the coverage of the tolerance interval.
This type of tolerance interval is sometimes called a guaranteed content interval. There also exists
a one-sided version of this type of tolerance interval.
There are interesting relations between these concepts and quantiles. Let X1 , . . . , Xn be a sample
from a continuous distribution P with distribution function F . Since
P (F (U ) ≤ β) = P (U ≤ F −1 (β)),
it follows immediately that an upper (lower) (α, β) tolerance limit is an upper (lower) confidence
interval for the quantile F −1 (β) and vice-versa.
Prediction intervals are usually associated with regression analysis, but also appear in other con-
texts as we shall see. The following proposition shows a surprising link between β-expectation
tolerance intervals and prediction intervals. It also has interesting corollaries as we shall see later
on.
E (E (Y | X)) = E Y.
Hence, rewriting the probability in the definition of prediction interval in terms of an expectation,
we obtain:
as required. ¤
Proof: We first have to show that the expectation is finite. Note that the coverage may be written
in this case as
µ ¶ µ ¶
X −µ S X −µ S
F (X + kS) − F (X − kS) = Φ +k −Φ −k .
σ σ σ σ
Since 0 ≤ Φ(x) ≤ 1 for all x ∈ R, it suffices to show that the expectations of X and S are finite.
The first expectation is trivial, while the second one follows from Exercise 4.
Now Proposition 1.5.5 yields that it suffices to choose k such that (X − k S, X + k S) is a 100 × β%
prediction interval. In other words, k must be chosen such that
¡ ¢
P X − k S < X < X + k S = β,
d
where
¡ X is independent
¢ of X1 , . . . , Xn , but follows the same normal distribution. Hence, X − X =
N 0, σ 2 (1 + n1 ) . Thus we have the following equalities:
¡ ¢
β = P X −kS < X < X +kS
µ ¶
X −X
= P −k < <k
S
µ q ¶
1
= P −k < 1 + n Tn−1 < k ,
1.10 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
q
1
from which we see that we must choose k = 1+ n tn−1;(1−β)/2 . ¤
For several explicit tolerance intervals under different assumptions, we refer to the exercises. There
is no closed solution for (α, β) tolerance intervals for normal distributions when both µ and σ 2 are
unknown; numerical procedures for this problem can be found in [4].
where |Ij | denotes the length of the interval Ij . It is clear that the histogram is a stepwise constant
function. Two major disadvantages of the histogram are
• the stepwise constant nature of the histogram
• the fact that the histogram heavily depends on the choice of the partition
In order to illustrate the last point, consider the following two histograms that represent the same
data set:
−4 −2 2 4 −4 −2 2 4
Histograms of sample of size 50 from mixture of 2 normal distributions
1.6. DENSITY ESTIMATORS 1.11
It is because of this phenomenon that histograms are not to be recommended. A natural way to
improve on histograms is to get rid of the fixed partition by putting an interval around each point.
If h > 0 is fixed, then
bn (x) := Pn ((x − h, x + h))
N (1.10)
2h
is called the naive density estimator and was introduced in 1951 by Fix and Hodges in an unpub-
lished report (reprinted in [5]) dealing with discriminant analysis. The motivation for the naive
estimator is that Z x+h
P (x − h < X < x + h) = f (t) dt ≈ 2 h f (x). (1.11)
x−h
Note that the naive estimator is a local procedure; it uses only the observations close to the point
at which one wants to estimate the unknown density. Compare this with the empirical distribution
function, which uses all observations to the right of the point at which one is estimating.
bn decreases as h tends to 0. However, if h
It is intuitively clear from (1.11) that the bias of N
tends to 0, then one is using less and less observations, and hence the variance of N bn increases.
This phenomenon occurs often in density estimation. The optimal value of h is a compromise
between the bias and the variance. We will return to this topic of great practical importance when
we discuss the MSE.
The naive estimator is a special case of the following class of density estimators. Let K be a
kernel function, that is a nonnegative function such that
Z ∞
K(x) dx = 1. (1.12)
−∞
The kernel estimator with kernel K and bandwidth h is defined by
n µ ¶
1 X 1 x − Xi
fbn (x) := K . (1.13)
n i=1 h h
Thus, the kernel indicates the weight that each observation receives in estimating the unknown
density. It is easy to verify that kernel estimators are densities and that the naive estimator is a
kernel estimator with kernel (
1
if |x| < 1
K(x) = 2
0 otherwise.
Remark 1.6.1 The kernel estimator can also be written in terms of the empirical distribution
function Fn : Z ∞ µ ¶
1 x−y
fbn (x) = K dFn (y),
−∞ h h
where the integral is a Stieltjes integral.
Examples of other kernels include:
name function
1 1 2
Gaussian √ e− 2 x
2π
1
naive/rectangular 1(−1,1) (x)
2
triangular (1 − |x|) 1(−1,1) (x)
15
biweight (1 − x2 )2 1(−1,1) (x)
16
3
Epanechnikov (1 − x2 ) 1(−1,1) (x)
4
1.12 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
The kernel estimator is a widely used density estimator. A good impression of kernel estimation
is given by the books [20] and [22]. For other types of estimators, we refer to [20] and [21].
RProof:
∞
This follows from the fact that for a random variable X with density f , we have E g(X) =
−∞
g(x) f (x) dx. ¤.
1 X 1 2 ³ x − Xi ´ 1 X 1 ³ x − Xi ´ ³ x − Xj ´
n n
¡ ¢2
fˆn (x) = 2 K + K K .
n i=1 h2 h n2 i,j=1 h2 h h
i6=j
Then
Z ³x − y´ Z
¡ ¢2 1 ∞
n − 1³ ∞ ³x − y´ ´2
E fˆn (x) = K2 f (y)dy + K f (y)dy .
nh2 −∞ h nh2 −∞ h
Next use Theorem 4.2 and the well-known fact that Var X = E X 2 − (E X)2 . ¤
The following general result due to Rosenblatt (see [19] for a slightly more general result) shows
that we cannot have unbiasedness for all x.
Theorem 1.6.4 (Rosenblatt [19]) A kernel estimator can not be unbiased for all x ∈ R.
Rb
Proof: We argue by contradiction. Assume that E fbn (x) = f (x) for all x ∈ R. Then a
fbn (x) dx
is an unbiased estimator for F (b) − F (a), since
Z b Z b Z b
E fbn (x) dx = E fbn (x) dx = f (x) dx = F (b) − F (a),
a a a
where the interchange of integrals is allowed since the integrand is positive. Now it can be shown
that the only unbiased estimator of F (b) − F (a) symmetric in X1 , . . . , Xn is Fn (b) − Fn (a). This
leads to a contradiction, since it implies that the empirical distribution function is differentiable.
¤
For point estimators, the MSE is a useful concept. We now generalize this concept to density
estimators.
1.6. DENSITY ESTIMATORS 1.13
Theorem 1.6.6 For a kernel density estimator fbn with kernel K the MSE and MISE can be
expressed as:
Z ∞ µ ¶ ½Z ∞ µ ¶ ¾2
b 1 2 x−y 1 x−y
MSEx (fn ) = K f (y) dy − K f (y) dy +
n h2 −∞ h n h2 −∞ h
µ Z ∞ µ ¶ ¶2 (1.18)
1 x−y
K f (y) dy − f (x) .
h −∞ h
Z ∞ ÃZ ∞ µ ¶ ½Z ∞ µ ¶ ¾2 !
b 1 2 x−y x−y
MISE(fn ) = K f (y) dy − K f (y) dy dx+
n h2 −∞ −∞ h −∞ h
Z ∞ µ Z ∞ µ ¶ ¶2
1 x−y
K f (y) dy − f (x) dx.
−∞ h −∞ h
(1.19)
Proof: Combination of Exercise 14 with formulas (1.14) and (1.15) yields the formula for the
MSE. Integrating this formula with respect to x, we obtain the formula for the MISE. ¤
The above formulas can in general not be evaluated explicitly. When both the kernel and the
unknown density are Gaussian, then straightforward but tedious computations yield explicit for-
mulas as shown in [8]. These formulas were extended in [14] to the case of mixtures of normal
distributions. Marron and Wand claim in [14] that the class of mixture of normal distributions is
very rich and that it is thus possible to perform exact calculations for many distributions. These
calculations can be used to choose an optimal bandwidth h (see [14] for details).
For other examples of explicit MSE calculations, we refer to [3] and the exercises.
We conclude this section with a note on the use of Fourier analysis. Recall that the convolution
of two functions g1 and g2 is defined as
Z ∞
(g1 ∗ g2 )(x) := g1 (t) g2 (x − t) dt.
−∞
One of the elementary properties of the Fourier transform is that it transforms the complicated
convolution operation into the elementary multiplication operation, i.e.
The formulas (1.14) and (1.15) show that E fbn (x) and Var fbn (x) can be expressed in terms of
convolutions of the kernel with the unknown density. The exercises contain examples in which
Fourier transforms yield explicit formulas for the mean and the variance of the kernel estimator.
1.14 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
Another (even more important) use of Fourier transforms is the computation of the kernel estimate
itself. Computing density estimates directly from the definition is often very time consuming.
Define the function u by
n
1 X i s Xj
u(s) = e . (1.20)
n j=1
Then the Fourier transform of the kernel estimator is a convolution of u with the Fourier transform
of the kernel (see Exercise 18). Using Fast Fourier Transform (FFT), one can efficiently compute
good approximations to the kernel estimates. For details we refer to [20, pp. 61-66] and [22,
Appendix D].
Theorem 1.6.7 (Bochner) Let K be a bounded kernel function such that lim|y|→∞ y K(y) = 0.
Define for any absolutely integrable function g the functions
Z ∞ µ ¶
1 y
gn (x) := K g(x − y) dy,
hn −∞ hn
where (hn )n∈N is a sequence of positive numbers such that limn→∞ hn = 0. If g is continuous at
x, then we have
lim gn (x) = g(x). (1.21)
n→∞
R∞ 1
¡y¢ R∞
Proof: Since −∞ h
K h dy = −∞
K(y) dy = 1, we may write
¯ Z ∞ µ ¶ ¯
¯ 1 y ¯
|gn (x) − g(x)| = ¯gn (x) − g(x) K dy ¯¯
¯
−∞ hn hn
Z ∞ ¯ µ ¶¯
¯ 1 y ¯¯
≤ ¯{g(x − y) − g(x)} K dy.
¯ hn hn ¯
−∞
Let δ > 0 be arbitrary. We now split the integration interval into 2 parts: {y : |y| ≥ δ} and
{y : |y| < δ}. The first integral can be bounded from above by
Z µ ¶ Z µ ¶
|g(x − y)| y y 1 y
K dy + |g(x)| K dy ≤
|y|≥δ y hn hn |y|≥δ hn hn
Z Z
sup|v|≥δ/hn |v K(v)|
|g(x − y)| dy + |g(x)| K(t) dt ≤
δ |y|≥δ |t|≥δ/hn
Z Z
sup|v|≥δ/hn |v K(v)| ∞
|g(u)| du + |g(x)| K(t) dt .
δ −∞ |t|≥δ/hn
Letting n → ∞ and using that K is absolutely integrable, we see that these terms can be made
arbitrarily small. The integral over the second region can be bounded from above by
Z
sup |g(x − y) − g(x)| K(y) dy ≤ sup |g(x − y) − g(x)|.
|y|<δ |y|<δ |y|<δ
1.6. DENSITY ESTIMATORS 1.15
Since this holds for all δ > 0 and g is continuous at x, the above expression can be made arbitrarily
small. ¤.
As a corollary, we obtain the following asymptotic results (taken from [17]) for the mean and
variance of the kernel estimator at a point x.
Corollary 1.6.8 (Parzen) Let fbn be a kernel estimator such that its kernel K is bounded and
satisfies lim|y|→∞ y K(y) = 0. Then fbn is an asymptotically unbiased estimator for f at all
continuity points x if limn→∞ hn = 0.
In the above corollary, there is no restriction on the rate at which (hn )n∈N converges to 0. The
next corollaries show that if (hn )n∈N converges to 0 slower than n−1 , then fbn (x) is consistent in
the sense that the MSE converges to 0.
Corollary 1.6.9 (Parzen) Let fbn be a kernel estimator such that its kernel K is bounded and
satisfies lim|y|→∞ y K(y) = 0. If limn→∞ hn = 0 and x is a continuity point of the unknown
density f , then Z ∞
b
lim n hn Var fn (x) = f (x) K 2 (y) dy.
n→∞ −∞
Proof: First note that since K is bounded, K 2 also satisfies the conditions of Theorem 1.6.7.
Hence, the result follows from applying Theorem 1.6.7 and Exercise 21 to Formula (1.15). ¤.
Corollary 1.6.10 (Parzen) Let fbn be a kernel estimator such that its kernel K is bounded and
satisfies lim|y|→∞ y K(y) = 0. If limn→∞ hn = 0, limn→∞ n hn = ∞ and x is a continuity point
of the unknown density f , then
lim MSEx (fbn ) = 0.
n→∞
Proof: It follows from Corollary 1.6.9 that limn→∞ Var fbn (x) = 0. The result now follows by
combining Corollary 1.6.8 and Exercise 14. ¤.
Although the above theorems give insight in the asymptotic behaviour of density estimators, they
are not sufficient for practical purposes. Therefore, we now refine them by using Taylor expansions.
Theorem 1.6.11 b
R ∞ Let fn be a kernel estimator such that its kernel K is bounded and symmetric
and such that −∞ |t3 | K(t) dt exists and is finite. If the unknown density f has a bounded third
derivative, then we have that
Z ∞
1
E fbn (x) = f (x) + h2 f 00 (x) t2 K(t) dt + o(h2 ), h ↓ 0 (1.22)
2 −∞
Z ∞ µ ¶
1 1
Var fbn (x) = f (x) K 2 (t) dt + o , h ↓ 0 and nh → ∞ (1.23)
nh −∞ nh
Z ∞ µ Z ∞ ¶2 µ ¶
1 1 4 1 ¡ ¢
MSEx (fbn ) = f (x) K 2 (t) dt + h f 00 (x) t2 K(t) dt +o + o h4 ,
nh −∞ 4 −∞ nh
h ↓ 0 and nh → ∞.
(1.24)
Proof: By Formula (1.14) and a change of variables, we may write the bias as
Z ∞
b
E fn (x) − f (x) = K(t) {f (x − th) − f (x)} dt.
−∞
1.16 CHAPTER 1. PROCESS CAPABILITY ANALYSIS
Now Taylor’s Theorem with the Lagrange form of the remainder says that
(th)2 00 (th)3 000
f (x − th) = f (x) − th f 0 (x) + f (x) − f (ξ),
2 3!
R∞
where ξ depends on x, t, and h and is such that |x − ξ| < |th|. Since −∞ K(t) dt = 1, it follows
that Z ∞ µ ¶
b 0 (th)2 00 (th)3 000
E fn (x) − f (x) = K(t) −th f (x) + f (x) − f (ξ) dt,
−∞ 2 3!
which because of the symmetry of K simplifies to
Z ∞ µ ¶
(th)2 00 (th)3 000
E fbn (x) − f (x) = K(t) f (x) − f (ξ) dt.
−∞ 2 3!
If M denotes an upper bound for f 000 , then the first result follows from
¯ Z ∞ ¯ 3 Z ∞ ¯
¯ ¯ ¯
¯E fbn (x) − f (x) − 1 h2 f 00 (x) t2
K(t) dt¯ ≤ h ¯t3 K(t) f 000 (ξ)¯ dt
¯ 2 ¯ 3! −∞
−∞
Z ∞
h3
≤ M |t3 K(t)| dt,
3! −∞
where the last term obviously is o(h2 ).
The asymptotic expansion of the variance follows immediately from Corollary 1.6.9. In order
to obtain the asymptotic expansion for the MSE, it suffices to combine Exercise 14 with Formu-
las (1.22) and (1.23). ¤
These expressions show that the asymptotic expressions are much easier to interpret than the exact
expression of the previous section. For example, we can now clearly see that the bias decreases
if h is small and that the variance decreases if h is large (cf. our discussion of the naive density
estimator).
Theorem 1.6.11 is essential for obtaining optimal choices of the bandwidth. If we assume that f 00
is square integrable, then it follows from Formula (1.24) that for h ↓ 0 and nh → ∞:
Z ∞ Z ∞ µZ ∞ ¶2 µ ¶
1 1 1
MISE(fbn ) = K 2 (t) dt + h4 (f 00 )2 (x) dx t2 K(t) dt + o + o(h4 ).
nh −∞ 4 −∞ −∞ nh
(1.25)
The expression
Z ∞ Z ∞ µZ ∞ ¶2
1 1 2
K 2 (t) dt + h4 (f 00 ) (x) dx t2 K(t) dt (1.26)
n h −∞ 4 −∞ −∞
is called the asymptotic MISE, often abbreviated as AMISE. Note that Formula (1.26) is much
easier to understand than Formula (1.19). We now see (cf. Exercise 22) how to balance between
squared bias and variance in order to obtain a choice of h that minimizes the MISE:
1/5
R∞ 2
−∞
K (t) dt
hAMISE = ³R ´2 R . (1.27)
∞ ∞ 00 )2 (x) dx
4n −∞ t2 K(t) dt −∞
(f
R∞ 2
An important drawback of Formula (1.27) is that it depends on −∞ (f 00 ) (x) dx, which is un-
known. However, there are good methods for estimating this quantity. For details, we refer to the
literature ([20, 22]). An example of a simple method is given in Exercise 23.
Given an optimal choice of the bandwidth h, we may wonder which kernel gives the smallest
MISE. It turns out that the Epanechnikov kernel is the optimal kernel. However, the other kernels
perform nearly as well, so that the optimality property of the Epanechnikov kernel is not very
important in practice. For details, we refer to [20, 22].
1.7. EXERCISES 1.17
1.7 Exercises
In all exercises X, X1 , X2 , . . . are independent identically distributed normal random variables with
mean µ and variance σ 2 , unless otherwise stated.
Exercise 1 Assume that that the main characteristic of a production process follows a normal
distribution and that Cp equals 1.33.
a) What is the percentage non-conforming items if the process is centred (that is, if µ = (U SL+
LSL)/2 )?
Exercise 6 Construct a β-expectation tolerance interval in the trivial case when both µ and σ 2
are known.
Exercise 9 Construct a β-content tolerance interval at confidence level α in the trivial case when
both µ and σ 2 are known. What values can α take?
Exercise 11 Construct a β-content tolerance interval at confidence level α when µ is known and
σ 2 is unknown.
Exercise 16 Calculate the following estimators for µ, σ 2 respectively, where fbn is a kernel esti-
mator for f with a symmetric kernel K (that is, K(x) = K(−x)):
Z ∞
a) µ
b= x fbn (x) dx.
−∞
Z ∞
c2 =
b) σ b)2 fbn (x) dx.
(x − µ
−∞
Exercise 17 Verify by direct computation that the naive estimator is biased in general.
Exercise 18 Use (1.20) to find a formula for the Fourier transform of fbn .
Exercise 19 Suppose that K is a symmetric kernel, i.e. K(x) = K(−x). Show that MISE(fc
n)
equals
Z ∞ Z ∞ ½ ¾
1 1 1
(FK)2 (t) dt + (1 − ) (FK)2 (ht) − 2 FK(ht) + 1 |Ff (t)|2 dt.
2πnh −∞ 2 π −∞ n
Exercise 20 The Laplace kernel is defined by K(x) := 12 e−|x| . Use the results of the previous
exercise to derive an expression for the MISE of the kernel estimator with the Laplace kernel when
the density is an exponential density.
Exercise 22 Prove Formula (1.27) for the optimal bandwidth based on the AMISE.
How can this be used to select a bandwidth? What is the rationale behind this bandwidth choice?
Exercise 24 Suppose that we take hn = c n−γ where c > 0 and γ ∈ (0, 1). Which value of γ gives
the optimal rate of convergence for the MSE?
Bibliography
[1] J. Aitchison, and I. Dunsmore, Statistical Prediction Analysis, Cambridge University Press,
1975.
[2] R.B. D’Agostino and M.A. Stephens (eds.), Goodness-of-fit Techniques, Marcel Dekker, New
York, 1986.
[4] K.R. Eberhardt, R.W. Mee and C.P. Reeve, Computing factors for exact two-sided tolerance
limits for a normal distribution, Communications in Statistics - Simulation and Computation
18 (1989), 397–413.
[5] E. Fix and J.L. Hodges, Discriminatory analysis - nonparametric discrimination: consistency
properties, International Statistical Reviews 57 (1989), 238–247.
[6] J.L. Folks, D.A. Pierce and C. Stewart, Estimating the fraction of acceptable product, Tech-
nometrics 7 (1965), 43–50.
[7] W.C. Guenther, A note on the Minimum Variance Unbiased estimate of the fraction of a
normal distribution below a specification limit, American Statistician 25 (1971), 18–20.
[8] M.J. Fryer, Some errors associated with the non-parametric estimation of density functions,
Journal of the Institute of Mathematics and its Applications 18 (1976), 371–380.
[9] I. Guttman, Statistical Tolerance regions: Classical and Bayesian, Charles Griffin, 1970.
[11] M. Jı́lek and H. Ackermann, A bibliography of statistical tolerance regions II, Statistics 20
(1989), 165–172.
[12] V.E. Kane, Process Capability Indices, J. Qual. Technol. 18(1986), 41–52.
[13] S. Kotz and N.L. Johnson, Process Capability Indices, Chapman and Hall, London, 1993.
[14] J.S. Marron and M.P. Wand, Exact mean integrated squared error, Annals of Statistics 20
(1992), 712–736.
[15] D.B. Owen, A survey of properties and applications of the noncentral t-distribution, Techno-
metrics 10 (1968), 445–478.
[16] J.K. Patel, Tolerance limits - a review, Communications in Statistics A - Theory and Methods
15 (1986), 2719–2762.
[17] E. Parzen, On estimation of a probability density function and mode, Annals of Mathematical
Statistics 33 (1962), 1065–1076.
1.19
1.20 BIBLIOGRAPHY
[18] E. Paulson, A note on control limits, Annals of Mathematical Statistics 14 (1943), 90–93.
[23] D.J. Wheeler, The variance of an estimator in variables sampling, Technometrics 12 (1970),
751–755.
[24] P.W. Zehna, Invariance of Maximum Likelihood estimation, Annals of Mathematical Statistics
37 (1966), 744.