stat2602_chapter4
stat2602_chapter4
§ 4.1 Introduction
The answer is: we can specify a “margin of error” for the point estimate from the
sample, which, when added to and subtracted from the sample result, gives us a
range of values. We use the whole range to estimate the population mean. Then we
can justify this range by considering our confidence that this range contains the
actual population mean. Such a range is called a confidence interval.
Definition
Let X X 1 , X 2 ,..., X n be a random vector of a statistical model with
parameter vector θ . Let θ denote a specific parameter, LX and U X
denote two statistics that satisfy
Example 4.1
3
which is equivalent to P 0 0.9502 . Note that the equation holds for all
X
values of 0, . Therefore 0 , 3 X is a 95.02% confidence interval for .
P.78
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.2
Suppose historical data suggests that the lifetime of a particular electronic device
follows an exponential distribution with unknown parameter. Suppose that we
observed a random sample of size n, i.e. X 1 , X 2 ,..., X n ~ Exponentia l . From
iid
Example 2.1, the sampling distribution of the sample mean was determined as
X ~ Gamman, n .
1
2nX ~ Gamma n, 22n ,
2
12 2, 2 n 2 2, 2 n
, .
2 nX 2 nX
P.79
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.3
X
Z ~ N 0,1 .
n
Using the percentiles of the standard normal distribution, we have the following
probability statement:
X
P Z 2 Z 2 1 for all , .
n
P X Z 2 X Z 2 1 for all , .
n n
Therefore X Z 2 , X Z 2 is a 1001 % confidence interval for .
n n
For simplicity, it is often denoted as X Z 2 . The quantity Z 2
is called
n n
the margin of error, which essentially represents an upper bound of the estimation
error.
Note that if is unknown (which is the usual situation), we may use the pivot
variable
X
T ~ t n 1
S n
P.80
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.4
A paint manufacturer wants to determine the average drying time of a new interior
wall paint. The drying times (in minutes) for 12 test areas of equal size are
obtained below.
58, 60, 61, 61, 68, 64, 76, 62, 73, 75, 59, 79
Assuming normal distribution of the drying time, a 95% C.I. for the mean drying
time is given by
7.51 7.51
66.33 t11, 0.025 66.33 2.201
12 12
66.33 4.77
61.56, 71.10
Note that the range 61.56 71.10 is just a conjecture about the population
mean. We can’t claim that the true average drying time must be between 61.56
mins and 71.10 mins. We can only claim that we have high confidence (95%) that
the range can cover the true mean.
Remarks
1. (Important note) The confidence interval LX ,U X is a random interval. It
depends on the particular sample observations. The interpretation of a
confidence interval is that when you obtain a lot of independent sets of random
sample and for each set of random sample you get one particular interval, then
1001 % of these intervals will contain the parameter θ . Hence the
confidence level 1001 % is the probability describing the behaviour of all
possible intervals rather than just the one we obtain. The confidence level
should represent the performance of the entire procedure. In example 4.4, it is
incorrect to write P 61.56 71.10 0.95 .
P.81
Stat2602 Probability and Statistics II Fall 2014-2015
3. With a fixed confidence level, we can reduce the margin of error (hence
producing shorter confidence interval) by increasing the sample size.
4. The confidence interval is not unique. For any specific confidence level 1 ,
there are infinite pairs of constants c1 ,c2 satisfying P c1 V X, θ c2 1 .
S S
In Example 4.4, X tn 1, 0.01 , X tn 1, 0.04 is also a 95% confidence
n n
interval for . For simplicity and for a shorter interval, a common practice is
to evenly allocate 2 as the tail areas to each side of c1 ,c2 . However, when
the parameter space has a finite boundary, we may allocate to only one side.
22n ,
In Example 4.2, 0 , is also a 1001 % confidence interval for .
2nX
X
Z
n
X Z 2
n
for large sample, even when the population distribution is not normal. If 2 is
unknown, we can replace it by the sample variance S 2 because S 2 a consistent
estimator of 2 . Then the C.I. will be
S
X Z 2 .
n
P.82
Stat2602 Probability and Statistics II Fall 2014-2015
Suppose we have two populations of the same measurement which are distributed
as normal. We may want to compare the two population means or the population
variances, e.g. comparing the mean height of the people of two races. We can draw
random samples from these two populations independently and use these samples
to make inferences.
~ N , ; Y1 , Y2 ,..., Yn ~ N y , y2
iid iid
2
X 1 , X 2 ,..., X m x x
x2 y
2
X Y ~ N x y , .
m n
X Y y
Z x
~ N 0,1
x2 m y2 n
x2 y2
X Y Z 2
m n
P.83
Stat2602 Probability and Statistics II Fall 2014-2015
Usually, both x2 and y2 are unknown. In such case we may estimate them by the
sample variances and use the following pivot variable:
X Y y
T x
~ tv .
S x2 m S y2 n
Note that this sampling distribution is not exact and is called the Satterthwaite’s
approximation, with the degrees of freedom v computed by
v
S 2
x
m S y2 n
2
.
S x4 S y4
m 2 m 1 n 2 n 1
2
S x2 S y
X Y t v , 2
m n
Remarks
P.84
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.5
The effectiveness of a mental training program was tested in a military training
program. In an anti-aircraft artillery examination, scores for an experimental group
and a control group were recorded. According to the following data, does the
mental training program make difference in their scores?
Satterthwaite’s approximation:
v
20.96 2
20 23.94 16
2
30
2
20.964 23.94 4
20 2 20 1 162 16 1
23.94 2
52.3 , 21.2
We have 95% confidence that x y 21.2 . We may conclude that the mean
score of the experimental group is significantly lower than the mean score of the
control group by more than 21. Hence the mental training program would, on
average, result in a lower score in the anti-aircraft artillery examination.
P.85
Stat2602 Probability and Statistics II Fall 2014-2015
m 1S 2
n 1S y2
S 2
x
mn2
pool
m n 2S 2
m 1S 2
n 1S 2
W pool
x
y
~ m2 n 2 .
2
2
2
Since
X Y y X Y y
Z x
x
~ N 0,1 ,
2 2 1 1
m n m n
Z X Y x y
T ~ tm n 2
W m n 2 1 1
S pool
m n
X Y t m n 2 , 2 S pool
1 1
m n
P.86
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.6
A bank’s loan department found that 11 home loans processed during April had a
mean value of $78,100 and a standard deviation of $6300. An analysis of the 9
loans in May showed a mean value of $82,700 with a standard deviation of $7100.
Suppose these home loans represent independent random samples of the values of
home loan applications approved in the bank’s service area. Find a 98% confidence
interval for the increase in the mean level of approved home loan applications from
April to May.
Denote X as the home loan in April and Y as the home loan in May. From the data
we have
m 11 , X 78100 , S x 6300
n 9, Y 82700 , S y 7100
S pool
11 1 63002 9 1 71002 6667
11 9 2
1 1
82700 78100 t 18 , 0.01 6667 4600 2.552 2997
11 9
4600 7648
3048 , 12248
Since zero is a possible value in the interval, we don’t have sufficient evidence to
conclude that the mean level of approved home loan application had increased.
Remarks
1. Assuming equal population variances and then use the pooled sample variance
is a popular approach. However, there is no guarantee that the assumption must
be satisfied. It relies on our basic knowledge about the variables in the problem
under consideration. For instance, if we believe that the effect of a certain
treatment will be more or less the same on all experimental objects, then it may
be regarded as a shifting of the distribution without altering the variation. In
such case it would be reasonable to assume equal population variances on
treatment and control groups. Moreover, we may check the equal population
variance assumption by comparing the sample variances, as described in next
section.
P.87
Stat2602 Probability and Statistics II Fall 2014-2015
2. The above procedures for constructing confidence intervals were derived based
on the assumption that the population(s) is/are normal. If the normal assumption
is violated, the above procedures are still valid with large samples. For small
sample problems with non-normal population(s), we will need to use non-
parametric statistical methods, which are beyond the scope of this course.
3. Independence between the two samples is also a crucial assumption for the
above procedures. Usually it can be guaranteed by proper design on the
sampling or experimental procedures. However, in some typical experiments,
data are measured on same objects and the result cannot be regarded as
independence samples although the data looks like obtaining from two samples.
S x2 x2
F ~ F m 1, n 1
S y2 y2
Based on this pivot variable, the following probability equation holds for all
parameter values:
S x2 x2
P Fm1,n1,1 2 2 F 1
Sy y2 m 1, n 1, 2
1 S x2 1 S x2
2
,
Fm1,n1, 2 S y Fm1,n 1,1 2 S y2
or equivalently,
1 S x2 S x2
2
, Fn1,m1, 2 2
Fm 1, n 1, 2 S y Sy
1
as Fr ,r ,1 according to the property of the F-distribution.
1 2
Fr ,r ,
2 1
P.88
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.6
1 S x2 S x2 1 6300 6300
2 2
= 0.251 , 2.378
Since the interval contains 1, there is no evidence that the two population variances
are different.
Many surveys have their objective as the estimation of the proportion of people or
objects in a large group that possess a particular attribute. As described in Section
2.2.4, we can use the sample proportion p̂ to estimate the population proportion p
and the sampling distribution of p̂ can be well approximated by normal if the
sample size n is large enough. Therefore the interval estimate of p can be
constructed based on the normal approximation.
Let X denote the number of objects in the sample possessing the interested
attribute, then X ~ bn, p . When n is large, by Central Limit Theorem, we can use
the pivot variable
X np .
~ N 0,1
np 1 p
X np
P Z 2 Z 2 1 for all p 0,1 .
np 1 p
Solving the quadratic inequality will give an approximate 1001 % C.I. for p:
2 pˆ Z 2 2 n 2 pˆ Z 2
2 n 4 pˆ 2 1 Z 2 2 n
2
X
where pˆ
21 Z n
2
.
2 n
P.89
Stat2602 Probability and Statistics II Fall 2014-2015
This expression is somewhat nasty. To obtain a simpler formula, we may drop the
term Z 2 2 n as it is negligible when n is large. As a result, the C.I. becomes
pˆ 1 pˆ
pˆ Z 2
n
pˆ 1 1 pˆ 1 pˆ 2 1 pˆ 2
pˆ1 pˆ 2 Z 2
n1 n2
where n1 ,n2 are the sample sizes of the samples from the two populations and
pˆ 1 , pˆ 2 are the corresponding sample proportions.
Example 4.7
It was found that 41 people in a random sample of 500 persons from the labour
force of a country were unemployed. Since the sample size n 500 is quite large, a
95% confidence interval for the rate of unemployment in the country can be
constructed as
pˆ 1 pˆ 0.0820.918
pˆ Z 0.025 0.082 1.96
n 500
0.082 0.024
5.8%, 10.6%
Example 4.8
We want to compare the proportion of defective electric motors turned out by two
shifts of workers. From the large number of motors produced in a given week,
n1 50 motors were selected from the output of shift I and n2 40 motors were
selected from the output of shift II. The sample from shift I revealed 4 to be
defective, and the sample from shift II showed 6 faulty motors.
4 6
Sample proportions : pˆ 1 0.08 , pˆ 2 0.15
50 40
P.90
Stat2602 Probability and Statistics II Fall 2014-2015
Since the interval overlaps zero, zero cannot be ruled out as a plausible value of the
true difference between proportions of defective motors. Therefore there does not
appear to be any significant difference between the defective rates for the two
shifts.
Remarks
1. The above procedures rely on some approximations. First of all, the population
is assumed to be much larger than the sample so that it can be approximately
regarded as an infinite population. As a rule of thumb, the population size
should be at least ten times the sample size for an adequate approximation.
3. In case when the sample size n is small, we can use the following formulae to
construct the one-sided C.I. for p.
n
0, p̂ j pˆ 1 pˆ
X
j n j
U where U U
j 0
n
pˆ ,1 pˆ Lj 1 pˆ L
n
n j
or L where
j X j
Note that p̂L and p̂U are functions of X and therefore are statistics. It can be
proved that (refer to the supplementary notes) they satisfy the following
probability inequalities:
Since the confidence level may be larger than the required 1 , they are called
the conservative confidence intervals.
P.91
Stat2602 Probability and Statistics II Fall 2014-2015
A common and practical consideration before any actual survey takes place is :
No statistician can answer this question without knowing how accurate the survey-
taker wishes the estimate to be. Usually, we will try to minimize the sample size
subject to some precision requirement.
X Z 2
n
Z 2 2 2
Solving D Z 2 gives n
n D2
p1 p Z 2 2 p1 p
solving D Z 2 gives n
n D2
P.92
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.9
n 42.68 .
D2 120
Example 4.10
In example 4.4, the mean drying time of the new interior wall paint was estimated
to be from 61.56 mins to 71.10 mins, based on a sample with n 12 , X 66.33 ,
S 7.51 . The width of this interval is 71.10 61.56 9.54 which may not be
informative enough. Suppose we want to have a more informative interval estimate
such that the width of the interval is as small as 8 mins. How many test areas
should we take as our sample?
Since the width of the confidence interval is two times the margin of error, the
precision requirement can be achieved with D 4 . Equating this to the margin of
error:
S
4 tn 1, 0.025
n
Hence we have
2
tn 1, 0.025 4 t
0.533 , i.e. n n 1, 0.025 .
n 7.51 0.533
It is hard to analytically solve this equation as it involves the inverse cdf of the t-
distribution. To find the smallest sample size to fulfil the requirement, we can use
the method of trial and error.
P.93
Stat2602 Probability and Statistics II Fall 2014-2015
2
2.145
n 16.20
0.533
This result suggest that we may take n 17 , for which t16 , 0.025 2.120 and we need
2
2.120
n 15.82
0.533
2
2.131
n 15.98
0.533
If we want the confidence interval to be shorter than 3 mins, then the precision
requirement would be D 1.5 . Equating the precision requirement to the margin of
error:
2
tn 1, 0.025 1.5 tn 1, 0.025
0.200 , i.e. n
n 7.51 0.200
2
2.093
n 109.52
0.200
We may need a much larger n to fulfil such precision requirement. Since the t-
values are very close to Z-values when the degrees of freedom n 1 is large, we
may simply use Z 0.025 1.96 to replace tn 1, 0.025 and calculate
2
1.96
n 96.04 .
0.200
P.94
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.11
In Example 4.6, the increase in mean level of approved home loan applications
from April to May was estimated as $4600 with a margin of error of $7648. The
result is inconclusive as the margin of error is larger than the estimated increase.
Suppose we want to have a more conclusive result next year such that the
estimation error can be reduced to as low as $4000. How many loan applications
should we need in order to achieve this requirement?
From the past data we have the pooled sample standard deviation calculated as
S pool 6667 . For simplicity, we use equal sample sizes, i.e. n1 n2 n . Equating
the precision requirement to the margin of error gives
1 1 6667 2
4000 tn n 2 S pool t2 n 2 , 0.01 ,
n n n
2
6667 2
n t2 n 2 , 0.01 2.357 t2 n2 , 0.01
2
i.e.
4000
Example 4.12
Z 02.025 p 1 p
n
D2
1.96 0.082 0.918
2
A larger sample of n 723 individuals from the labour force will be needed.
P.95
Stat2602 Probability and Statistics II Fall 2014-2015
Example 4.13
A food products company has hired a marketing research firm to sample two
markets, I and II, to compare the proportions of consumers who prefer company’s
frozen dinners over its competitors’ products. No prior information is available on
the magnitude of the proportions p1 and p2 . If the food products company wishes
to estimate the difference in proportions of consumers who prefer its products
correct to within 0.04 with 95% confidence, how many consumers must be
sampled in each market?
Suppose we take samples with equal sizes, i.e. n1 n2 n . Equating the precision
requirement to the margin of error, we have
p1 1 p1 p2 1 p2 p1 1 p1 p2 1 p2
0.04 Z 0.025 1.96
n n n
P.96