Sample Solution
Sample Solution
Introduction
1
2 CHAPTER 1. INTRODUCTION
researchers to assess whether the sample was representative of the target population
of all Americans. Because of the shortcomings in the survey design, it is impossible
to know whether the conclusions in Kripke et al. (2002) about sleep and mortality
are valid or not.” (pp. 97–98)
Lohr, S. (2008). “Coverage and sampling,” chapter 6 of International Handbook of
Survey Methodology, ed. E. deLeeuw, J. Hox, D. Dillman. New York: Erlbaum,
97–112.
1.25 Students will have many diÆerent opinions on this issue. Of historical interest
is this excerpt of a letter written by James Madison to Thomas JeÆerson on February
14, 1790:
A Bill for taking a census has passed the House of Representatives, and is
with the Senate. It contained a schedule for ascertaining the component
classes of the Society, a kind of information extremely requisite to the
Legislator, and much wanted for the science of Political Economy. A
repetition of it every ten years would hereafter aÆord a most curious
and instructive assemblage of facts. It was thrown out by the Senate
as a waste of trouble and supplying materials for idle people to make a
book. Judge by this little experiment of the reception likely to be given
to so great an idea as that explained in your letter of September.
Chapter 2
5
6 CHAPTER 2. SIMPLE PROBABILITY SAMPLES
(ii)
1 1 1
V [ȳ] = (135.33 ° 142.5)2 + (143.67 ° 142.5)2 + (147.33 ° 142.5)2
4 2 4
= 12.84 + 0.68 + 5.84
= 19.36.
2.3 No, because thick books have a higher inclusion probability than thin books.
2.4 (a) A total of ( 83 ) = 56 samples are possible, each with probability of selection
56 . The R function samplist below will (ine±ciently!) generate each of the 56
1
Of course, this variance could have been more easily calculated using the formula in
(2.7): µ ∂ µ ∂
n S2 3 6.8571429
V [ȳ] = 1 ° = 1° = 1.429.
N n 8 3
(b) A total of 83 = 512 samples are possible when sampling with replacement.
Fortunately, we need not list all of these to find the sampling distribution of ȳ. Let
Xi be the value of the ith unit drawn. Since sampling is done with replacement,
X1 , X2 , and X3 are independent; Xi (i = 1, 2, 3) has distribution
k P (Xi = k)
1 1/8
2 1/8
4 2/8
7 3/8
8 1/8
Using the independence, then, we have the following probability distribution for
X̄, which serves as the sampling distribution of ȳ.
9
k P (ȳ = k) k P (ȳ = k)
1 1/512 4 23 12/512
1 13 3/512 5 63/512
1 23 3/512 5 13 57/512
2 7/512 5 23 21/512
2 13 12/512 6 57/512
2 23 6/512 6 13 36/512
3 21/512 6 23 6/512
3 13 33/512 7 27/512
3 23 15/512 7 13 27/512
4 47/512 7 23 9/512
4 13 48/512 8 1/512
The with-replacement variance of ȳ is
1 1
Vwr [ȳ] = (1 ° 5)2 + · · · + (8 ° 5)2 = 2.
512 512
Or, using the formula with population variance (see Exercise 2.28),
N
1 X (yi ° ȳU )2 6
Vwr [ȳ] = = = 2.
n N 3
i=1
(c) No; a sample of size 50 is probably not large enough for ȳ to be normally
distributed, because of the skewness of the original data.
The sample skewness of the data is (from SAS) 1.593. This can be calculated by
hand, finding
1X
(yi ° ȳ)3 = 28.9247040
n
i2S
so that the skewness is 28.9247040/(2.6823 ) = 1.499314. Note this estimate diÆers
from SAS PROC UNIVARIATE
X since SAS adjusts for df using the formula skewness
n
= (yi ° ȳ) /s . Whichever estimate is used, however, formula
3 3
(n ° 1)(n ° 2)
i2S
(2.23) says we need a minimum of
28 + 25(1.5)2 = 84
observations to use the central limit theorem.
(d) p̂ = 28/50 = 0.56.
s µ ∂
(0.56)(0.44) 50
SE (p̂) = 1° = 0.0687.
49 807
2.07 (a) A 95% confidence interval for the proportion of entries from the South is
s ° ¢
1000 1 ° 1000
175 175
175
± 1.96 = [.151, .199].
1000 1000
(b) As 0.309 is not in the confidence interval, there is evidence that the percentages
diÆer.
2.08 Answers will vary.
2.09 If n0 ∑ N , then
r s r
n S n0 S n0
zÆ/2 1 ° p = zÆ/2 1 ° n 0 pn 1+
N n N (1 + ) 0 N
r N
n0 n0 S
= zÆ/2 1 + ° p
N N n0
S
= zÆ/2
zÆ/2 S
e
= e
2.10 Design 3 gives the most precision because its sample size is largest, even
though it is a small fraction of the population. Here are the variances of ȳ for the
three samples:
2.11 (a)
12 CHAPTER 2. SIMPLE PROBABILITY SAMPLES
60
40
frequency
20
0
10 12 14 16 18 20
age (months)
The histogram appears skewed with tail on the right. With a mildly skewed distri-
bution, though, a sample of size 240 is large enough that the sample mean should
be normally distributed.
p
(b) ȳ = 12.07917; s2 = 3.705003; SE [ȳ] = s2 /n = 0.12425.
(Since we do not know the population size, we ignore the fpc, at the risk of a
slightly-too-large standard error.)
A 95% confidence interval is
(1.96)2 (3.705)
(c) n = = 57.
(0.5)2
2.12 (a) Using (2.17) and choosing the maximum possible value of (0.5)2 for S 2 ,
Then
n0 96.04
n= = = 82.4.
1 + n0 /N 1 + 96.04/580
(b) Since sampling is with replacement, no fpc is used. An approximate 95% confi-
dence interval for the proportion of children not overdue for vaccination is
s ° ¢
27 27
1 ° 27
± 1.96 120 120
= [0.15, 0.30]
120 120
13
so an approximate 95% CI is
p
0.2 ± 1.96 0.0001557149 = [.176, .224].
(b) The above analysis is valid only if the respondents are a random sample of the
selected sample. If respondents diÆer from the nonrespondents—for example, if the
nonrespondents are more likely to have been bullied—then the entire CI may be
biased.
2.14 Here is SAS output:
Data Summary
Class
Variable Levels Values
sex 2 f m
Statistics
Std Error
Variable Level Mean of Mean 95% CL for Mean
__________________________________________________________________
sex f 0.306667 0.034353 0.23878522 0.37454811
m 0.693333 0.034353 0.62545189 0.76121478
Statistics
s µ ∂
s2 300
CI : 301953.7 ± 1.96 1° , or [264883, 339025]
300 3078
data golfsrs;
infile golfsrs delimiter="," dsd firstobs=2;
/* The dsd option allows SAS to read the missing values between
successive delimiters */
sampwt = 14938/120;
15
s µ ∂
85/120 (1 ° 85/120) 120
95%CI: 85/120 ± 1.96 1° = .708 ± .081,
119 14938
or [0.627, 0.790].
16 CHAPTER 2. SIMPLE PROBABILITY SAMPLES
2.19 Assume the maximum value for the variance, with p = 0.5. Then use n0 =
1.962 (0.5)2 /(.04)2 , n = n0 /(1 + n0 /N ).
City n0 n
Buckeye 600.25 535
Gilbert 600.25 595
Gila Bend 600.25 446
Phoenix 600.25 600
Tempe 600.25 598
The finite population correction only makes a diÆerence for Buckeye and Gila Bend.
2.20 Sixty of the 70 samples yield confidence intervals, using this procedure, that
include the true value t = 40. The exact confidence level is 60/70 = 0.857.
2.21 (a) A number of diÆerent arguments can be made that this method results
in a simple random sample. Here is one proof, which assumes that the random
number table indeed consists of independent random numbers. In the context of
the problem, M = 999, N = 742, and n = 30. Of course, many students will give a
more heuristic argument.
Let U1 , U2 , U3 , . . ., be independent random variables, each with a discrete uniform
distribution on {0, 1, 2, . . . , M }. Now define
T1 = min{i : Ui 2 [1, N ]}
and
Tk = min{i > Tk°1 : Ui 2 [1, N ], Ui 2
/ {UT1 , . . . , UTk°1 }}
for k = 2, . . . , n. Then for {x1 , . . . , xn } a set of n distinct elements in {1, . . . , N },
n!(N ° n)! 1
P (S = {x1 , . . . , xn }) = = °N ¢ ,
N! n
this, let’s look at a simpler case: selecting one number between 1 and 74 using this
procedure.
Let U1 , U2 , . . . be independent random variables, each with a discrete uniform dis-
tribution on {0, . . . , 9}. Then the first random number considered in the sequence
is 10U1 + U2 ; if that number is not between 1 and 74, then 10U2 + U3 is considered,
etc. Let
T = min{i : 10Ui + Ui+1 2 [1, 74]}.
Then for x = 10x1 + x2 , x 2 [1, 74],
P (S = {x}) = P (10UT + UT +1 = x)
= P (UT = x1 , UT +1 = x2 ).
For part (a), the stopping times were irrelevant for the distribution of UT1 , . . . , UTn ;
here, though, the stopping time makes a diÆerence. One way to have T = 2 is if
10U1 + U2 = 75. In that case, you have rejected the first number solely because the
second digit is too large, but that second digit becomes the first digit of the random
number selected. To see this formally, note that
P (S = {x}) = P (10U1 + U2 = x or {10U1 + U2 2
/ [1, 74] and 10U2 + U3 = x}
or {10U1 + U2 2
/ [1, 74] and 10U2 + U3 2
/ [1, 74]
and 10U3 + U4 = x} or . . .)
= P (U1 = x1 , U2 = x2 )
X1 µ t°1
\
+ P {Ui > 7 or [Ui = 7 and Ui+1 > 4]}
t=2 i=1
∂
and Ut = x1 and Ut+1 = x2 .
(f) Let’s look at the probability student j in class i is chosen for first unit in the
sample. Let U1 , U2 , . . . be independent discrete uniform {1, . . . , 20} and let V1 , V2 , . . .
18 CHAPTER 2. SIMPLE PROBABILITY SAMPLES
be independent discrete
P20uniform {1, . . . , 40}. Let Mi denote the number of students
in class i, with K = i=1 Mi . Then, because all random variables are independent,
Thus, before duplicates are eliminated, a student has probability 1/K of being
selected on any given draw. The argument in part (a) may then be used to show
that when duplicates are discarded, the resulting sample is an SRS.
2.22 (a) From (2.13),
p r
V (ȳ) n S
CV(ȳ) = = 1° p .
E(ȳ) N nȳU
2.23
µ ∂µ ∂
3059 19
300 0
P (no missing data) = µ ∂
3078
300
(2778)(2777) . . . (2760)
=
(3078)(3077) . . . (3060)
= 0.1416421.
2.24 ≥ n ¥ S2
g(n) = L(n) + C(n) = k 1 ° + c0 + c1 n.
N n
dg kS 2
= ° 2 + c1
dn n
Setting the derivative equal to 0 and solving for n gives
s
kS 2
n= .
c1
The sample size, in the decision theoretic approach, should be larger if the cost of a
bad estimate, k, or the variance, S 2 , is larger; the sample size is smaller if the cost
of sampling is larger.
2.25 (a) Skewed, with tail on right.
(b) ȳ = 20.15, s2 = 321.357, SE [ȳ] = 1.63
2.26 In a systematic sample, the population is partitioned into k clusters, each of
size n. One of these clusters is selected with probability 1/k, so ºi = 1/k for each i.
But many of the samples that could be selected in an SRS cannot be selected in a
systematic sample. For example,
P (Z1 = 1, . . . , Zn = 1) = 0 :
since every kth unit is selected, the sample cannot consist of the first n units in the
population.
2.27 (a)
µ ∂µ ∂
99,999,999 1
999 1
P (you are in sample) = µ ∂
100,000,000
1000
99,999,999! 1000! 99,999,000!
=
999! 99,999,000! 100,000,000!
1000 1
= = .
100,000,000 100,000
20 CHAPTER 2. SIMPLE PROBABILITY SAMPLES
(b)
µ ∂2000
1
P (you are not in any of the 2000 samples) = 1° = 0.9802
100,000
(c)
µ ∂2 "N #
N X
V [t̂] = V Qi yi
n
i=1
µ ∂2 X
N XN
N
= yi yj Cov [Qij Qj ]
n
i=1 j=1
µ ∂2 Ω X
N N X
X æ
N
= yi2 npi (1 ° pi ) + yi yj (°npi pj )
n
i=1 i=1 j6=i
µ ∂2 Ω µ ∂X
N N N æ
N n 1 n 1 XX
= 1° yi2 ° yi yj
n N N NN
i=1 i=1 j6=i
ΩX
N æ
N
= yi2 ° N ȳU2
n
i=1
P
N
(yi ° ȳU )2
N 2 i=1
= .
n N
21
1 n!(k ° 1)!
P (Sk°1 ) = µ ∂= .
n+k°1 (n + k ° 1)!
n
Now let Uk ª Uniform(0, 1), let Vk be discrete uniform (1, . . . , n), and suppose Uk
and Vk are independent. Let A be a subset of size n from Uk . If A does not contain
unit n + k, then A can be achieved as a sample at step k ° 1 and
µ ∂
n
P (Sk = A) = P Sk°1 and Uk >
n+k
k
= P (Sk°1 )
n+k
n!k!
= .
(n + k)!
If A does contain unit n + k, then the sample at step k ° 1 must contain Ak°1 =
A ° {n + k} plus one other unit among the k units not in Ak°1 .
X µ ∂
n
P (Sk = A) = P Sk°1 = Ak°1 [ {j} and Uk ∑ and Vk = j
C
n+k
j2Uk°1 \Ak°1
n!(k ° 1)! n 1
= k
(n + k ° 1)! n + k n
n!k!
= .
(n + k)!
2.30 I always use this activity in my classes. Students generally get estimates of
the total area that are biased upwards for the purposive sample. They think, when
looking at the picture, that they don’t have enough of the big rectangles and so tend
to oversample them. This is also a good activity for reviewing confidence intervals
and other concepts from an introductory statistics class.
22 CHAPTER 2. SIMPLE PROBABILITY SAMPLES
Chapter 3
Stratified Sampling
(b) From Stratum 1, we have the following probability distribution for t̂1 :
j P (t̂1 = j)
6 1/6
10 1/6
12 1/6
18 1/6
20 1/6
24 1/6
23
24 CHAPTER 3. STRATIFIED SAMPLING
k P (t̂2 = k)
22 1/2
28 1/2
Because we sample independently in Strata 1 and 2,
data acls;
input stratum $ popsize returns percfem;
females = round(returns*percfem/100);
males = returns - females;
sampwt = popsize/returns;
datalines;
Literature 9100 636 38
Classics 1950 451 27
Philosophy 5500 481 18
History 10850 611 19
Linguistics 2100 493 36
PoliSci 5500 575 13
Sociology 9000 588 26
;
data aclslist;
set acls;
do i = 1 to females;
femind = 1;
output;
end;
do i = 1 to males;
femind = 0;
output;
end;
We obtain t̂ = 10858 with SE 313. These values diÆer from those in Example 4.4
because of rounding.
3.5 (a) The sampled population consists of members of the organizations who would
respond to the survey.
27
(b)
7
X Nh
p̂str = p̂h
N
h=1
µ ∂ µ ∂ µ ∂
9,100 1,950 9,000
= (0.37) + (0.23) + · · · + (0.41)
44,000 44,000 44,000
= 0.334.
v
u 7 µ ∂µ ∂
uX nh Nh 2 p̂h (1 ° p̂h )
SE [p̂str ] = t 1°
Nh N nh ° 1
h=1
p
= 1.46 £ 10°5 + 5.94 £ 10°7 + · · · + 1.61 £ 10°5
= 0.0079.
3.6 (a) We use Neyman allocation (= optimal allocation when costs in the strata
are equal), with nh / Nh Sh .
We take Rh to be the relative standard deviation in stratum h, and let nh =
900(Nh Rh )/125000.
Stratum Nh Rh Nh Rh nh
Houses 35,000 2 70,000 504
Apartments 45,000 1 45,000 324
Condos 10,000 1 10,000 72
Sum 90,000 125,000 900
(b) Let’s suppose we take a sample of 900 observations. (Any other sample size will
give the same answer.)
With proportional allocation, we sample 350 houses, 450 apartments, and 100 con-
dominiums. If the assumptions about the variances hold,
µ ∂ µ ∂ µ ∂
350 2 (.45)(.55) 450 2 (.25)(.75) 100 2 (.03)(.97)
Vstr [p̂str ] = + +
900 350 900 450 900 100
= .000215.
If these proportions hold in the population, then
35 45 10
p= (.45) + (.25) + (.03) = 0.3033
90 90 90
and, with an SRS of size 900,
(0.3033)(1 ° .3033)
Vsrs [p̂srs ] = = .000235.
900
The gain in e±ciency is given by
Vstr [p̂str ] .000215
= = 0.9144.
Vsrs [p̂srs ] .000235
28 CHAPTER 3. STRATIFIED SAMPLING
For any sample size n, using the same argument as above, we have
.193233
Vstr [p̂str ] = and
n
.211322
Vsrs [p̂srs ] = .
n
We only need 0.9144n observations, taken in a stratified sample with proportional
allocation, to achieve the same variance as in an SRS with n observations.
Note: The ratio Vstr [p̂str ]/Vsrs [p̂srs ] is the design eÆect, to be discussed further in
Section 7.5.
3.7 (a) Here are summary statistics for each stratum:
Stratum
Biological Physical Social Humanities
average 3.142857 2.105263 1.230769 0.4545455
variance 6.809524 8.210526 4.358974 0.8727273
Since we took a simple random sample in each stratum, we use
(102)(3.142857) = 320.5714
to estimate the total number of publications in the biological sciences, with estimated
variance µ ∂
7 6.809524
(102) 1 °
2
= 9426.327.
102 7
The following table gives estimates of the total number of publications and estimated
variance of the total for each of the four strata:
Estimated total Estimated variance
Stratum number of publications of total
Biological Sciences 320.571 9426.33
Physical Sciences 652.632 38982.71
Social Sciences 267.077 14843.31
Humanities 80.909 2358.43
Total 1321.189 65610.78
We estimate the total number of refereed publications for the college by adding the
totals for each of the strata; as sampling was done independently in each stratum,
the variance of the college total is the sum of the variances of the population stra-
tum totals. Thusp we estimate the total number of refereed papers as 1321.2, with
standard error 65610.78 = 256.15.
(b) From Exercise 2.6, using an SRS of size 50, the estimated total was t̂srs =
1436.46, with standard error 296.2. Here, stratified sampling ensures that each
division of the college is represented in the sample, and it produces an estimate
with a smaller standard error than an SRS with the same number of observations.
The sample variance in Exercise 2.8 was s2 = 7.19. Only Physical Sciences had a
29
sample variance larger than 7.19; the sample variance in Humanities was only 0.87.
Observations within many strata tend to be more homogeneous than observations
in the population as a whole, and the reduction in variance in the individual strata
often leads to a reduced variance for the population estimate.
(c)
µ ∂
Nh nh Nh2 p̂h (1 ° p̂h )
Nh nh p̂h p̂h 1°
N Nh N 2 nh ° 1
1
Biological Sciences 102 7 .018 .0003
7
10
Physical Sciences 310 19 .202 .0019
19
9
Social Sciences 217 13 .186 .0012
13
8
Humanities 178 11 .160 .0009
11
p̂str = 0.567
p
SE[p̂str ] = 0.0043 = 0.066.
3.8 (a) Because the budget for interviews is $15,000, a total of 15,000/30 = 500
in-person interviews can be taken. The variances in the phone and nonphone strata
are assumed similar, so proportional allocation is optimal: 450 phone households
and 50 nonphone households would be selected for interview.
(b) The variances in the two strata are assumed equal, so optimal allocation gives
p
nh / Nh / ch .
p
Stratum ch Nh /N Nh /(N ch )
Phone 10 0.9 0.284605
Nonphone 40 0.1 0.015811
Total 1.0 0.300416
The calculations in the table imply that
0.284605
nphone = n;
0.300416
the cost constraints imply that
Solving, we have
nphone = 1227
nnon = 68
n = 1295.
Because of the reduced costs of telephone interviewing, more households can be
selected in each stratum.
3.9 (a) Summary statistics for acres87:
Nh ° nh Nh2 s2h
Region Nh nh ȳh s2h (Nh /N )ȳh
Nh N 2 nh
NC 1054 103 308188.3 2.943E+10 105532.98 30225148
NE 220 21 109009.6 1.005E+10 7791.46 2211633
S 1382 135 212687.2 5.698E+10 95495.05 76782239
W 422 41 654458.7 3.775E+11 89727.61 156241957
Total 3078 300 298547.10 265460977
P
For acres87, ȳstr = h (Nh /N )ȳh = 298547.1 and
v
uX µ ∂
u Nh 2 s2h
SE(ȳstr ) = t ) = 16293.
N nh
h
Of course, ȳstr could also be calculated using the column of weights in the data set,
as: P
wi yi 918927923
ȳstr = Pi2S = = 298547.1
w
i2S i 3078
P
For largef92, ȳstr = h (Nh /N )ȳh = 56.70 and
v
uX µ ∂
u
t Nh 2 s2h
SE(ȳstr ) = = 3.56.
N nh
h
data strattot;
input region $ _total_;
cards;
NE 220
NC 1054
S 1382
W 422
;
proc surveymeans data=agstrat total = strattot mean sum clm clsum df;
stratum region ;
var acres87 farms92 largef92 smallf92 ;
weight strwt;
run;
3.10 For this problem, note that Nh , the total number of dredge tows needed to
cover the stratum, must be calculated. We use Nh = 25.6 £ Areah .
(a) Calculate t̂h = Nh ȳh
32 CHAPTER 3. STRATIFIED SAMPLING
s2
Stratum Nh nh ȳh s2h t̂h Nh2 (1 ° nh /Nh ) nhh
1 5704 4 0.44 0.068 2510 552718
2 1270 6 1.17 0.042 1486 11237
3 1286 3 3.92 2.146 5041 1180256
4 5064 5 1.80 0.794 9115 4068262
Sum 13324 18 18152 5812472
(b)
s2
Stratum Nh nh ȳh s2h t̂h Nh2 (1 ° nh /Nh ) nhh
1 8260 8 0.63 0.083 5204 707176
4 5064 5 0.40 0.046 2026 235693
Sum 13324 13 7229 942869
3.11 Note that the paper is somewhat ambiguous on how the data were collected.
The abstract says random stratified sampling was used, while on p. 224 the authors
say: ‘a sampling grid covering 20% of the total area was made . . . by picking 40
numbers between one and 200 with the random number generator.” It’s possible
that poststratification was really used, but for exercise purposes, let’s treat it as a
stratified random sample. Also note that the original data were not available, data
were generated that were consistent with summary statistics in the paper.
(a) Summary statistics are in the following table:
Zone Nh nh ȳh s2h
1 68 17 1.765 3.316
2 84 12 4.417 11.538
3 48 11 10.545 46.073
Total 200 40
Using (3.1),
X
t̂str = Nh ȳh
h
= 68(1.76) + 84(4.42) + 48(10.55)
= 997.
33
From (3.3),
µ ∂ H
Nh X 2 s2h
V̂ (t̂str ) = 1° Nh
N nh
h=1
µ ∂ µ ∂ µ ∂
17 3.316 12 11.538 11 46.073
= 1° 682 + 1° 842 + 1° 482
68 17 84 12 48 11
= 676.5 + 5815.1 + 7438.7
= 13930.2,
so p
SE(t̂ystr ) = 13930.2 = 118.
data seals;
infile seals delimiter="," firstobs=2;
input zone holes;
if zone = 1 then sampwt = 68/17;
if zone = 2 then sampwt = 84/12;
if zone = 3 then sampwt = 48/11;
run;
data strattot;
input zone _total_;
datalines;
1 68
2 84
3 48
;
Data Summary
Number of Strata 3
Number of Observations 40
Sum of Weights 200
34 CHAPTER 3. STRATIFIED SAMPLING
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
Statistics
(b) (i) If the goal is estimating the total number of breathing holes, we should use
optimal allocation. Using the values of s2h from this survey as estimates of Sh2 , we
have:
Zone Nh s2h Nh sh
1 68 3.316 123.83
2 84 11.538 285.33
3 48 46.073 325.81
Total 200 734.97
Then n1 = (123.83/734.97)n = 0.17n; n2 = 0.39n; n3 = 0.44n. The high variance
in zone 3 leads to a larger sample size in that zone.
(ii) If the goal is to compare the density of the breathing holes in the three zones,
we would like to have equal precision for ȳh in the three strata. Ignoring the fpc,
that means we would like
S12 S2 S2
= 2 = 3,
n1 n2 n3
which implies that nh should be proportional to Sh2 to achieve equal variances.
Using the sample variances s2h instead of the unknown population variances Sh2 , this
leads to
s21
n1 = 2 n = 0.05n
s1 + s22 + s23
n2 = 0.19n
n3 = 0.76n.
P
3.12 We use nh = 300Nh sh / k N k sk
Region Nh Nh sh nh
Northeast 220 19,238,963 7
North Central 1,054 181,392,707 69
South 1,382 319,918,785 122
West 422 265,620,742 101
Total 3,078 786,171,197 300
35
3 µ
X ∂
Nh 2 s2h
V̂ (ȳstr ) =
N nh
h=1
= 8.85288 £ 10°5
Thus, a 95% CI is
p
3.9386 ± 1.96 8.85288 £ 10°5 = [3.92, 3.96]
This is a very small CI. Remember, though, that it reflects only the sampling error.
In this case, the author was unable to reach some of the stores and in addition some
of the market basket items were missing, so there was nonsampling error as well.
3.16 (a)
Stratum Nh nh ȳh s2h t̂h V̂ (t̂h )
1 89 19 1.74 5.43 154.6 1779.5
2 61 20 1.75 6.83 106.8 854.0
3 40 22 13.27 58.78 530.9 1923.7
4 47 21 4.10 15.59 192.5 907.2
Total 237 82 984.7 5464.3
The estimated total number of otter holts is
t̂str = 985
36 CHAPTER 3. STRATIFIED SAMPLING
with p
SE [t̂str ] = 5464 = 73.9.
data exer0316;
infile otters delimiter=’,’ firstobs=2;
input section habitat holts;
if habitat = 1 then sampwt = 89/19;
if habitat = 2 then sampwt = 61/20;
if habitat = 3 then sampwt = 40/22;
if habitat = 4 then sampwt = 47/21;
;
data strattot;
input habitat _total_;
datalines;
1 89
2 61
3 40
4 47
;
proc surveymeans data=exer0316 total = strattot mean clm sum clsum;
stratum habitat;
weight sampwt;
var holts;
run;
Data Summary
Number of Strata 4
Number of Observations 82
Sum of Weights 237
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
Statistics
37
3.17 (a) We form a new variable, weight= 1/samprate. Then the number of
divorces in the divorce registration area is
H
X
(weight)h (numrecs)h = 571,185.
h=1
Note that this is the population value, not an estimate, because samprate= nh /Nh
and (numrecs)h = nh . Thus
H
X H
X Nh
(weight)h (numrecs)h = nh = N.
nh
h=1 h=1
(b) They wanted a specified precision within each state (= stratum). You can see
that, except for a few states in which a census is taken, the number of records sam-
pled is between 2400 and 6200. That gives roughly the same precision for estimates
within each of those states. If the same sampling rate were used in each state, states
with large population would have many more records sampled than states with small
population.
(c) (i) For each stratum,
hsblt20 + hsb20-24
ȳh = p̂h = .
numrecs
and µ ∂
X nh p̂h (1 ° p̂h ) X
V̂ (t̂str ) = Nh2 1° = varconth .
Nh nh ° 1
h h
38 CHAPTER 3. STRATIFIED SAMPLING
husb
state rate nh Nh ∑ 24 p̂h Nh p̂h varcont
AL 0.1 2460 24600 295 0.11992 2950 23376
AK 1 3396 3396 371 0.10925 371 0
CT 0.5 6003 12006 333 0.05547 666 629
DE 1 2938 2938 238 0.08101 238 0
DC 1 2525 2525 90 0.03564 90 0
GA 0.1 3404 34040 440 0.12926 4400 34491
HI 1 4415 4415 394 0.08924 394 0
ID 0.5 2949 5898 380 0.12886 760 662
IL 1 46986 46986 4349 0.09256 4349 0
IA 0.5 5259 10518 541 0.10287 1082 971
KS 0.5 6170 12340 768 0.12447 1536 1345
KY 0.2 3879 19395 567 0.14617 2835 9685
MD 0.2 3104 15520 156 0.05026 780 2964
MA 0.2 3367 16835 163 0.04841 815 3103
MI 0.1 3996 39960 270 0.06757 2700 22664
MO 1 24984 24984 2876 0.11511 2876 0
MT 1 4125 4125 432 0.10473 432 0
NE 1 6236 6236 620 0.09942 620 0
NH 1 4947 4947 458 0.09258 458 0
NY 1 67993 67993 3809 0.05602 3809 0
OH 0.05 2465 49300 102 0.04138 2040 37171
OR 0.2 3124 15620 233 0.07458 1165 4314
PA 0.1 3883 38830 248 0.06387 2480 20900
RI 1 3684 3684 246 0.06678 246 0
SC 1 13835 13835 1429 0.10329 1429 0
SD 1 2699 2699 93 0.03446 93 0
TN 0.1 3042 30420 426 0.14004 4260 32982
UT 0.5 4489 8978 591 0.13166 1182 1027
VT 1 2426 2426 162 0.06678 162 0
VA 1 25608 25608 2075 0.08103 2075 0
WI 0.2 3384 16920 280 0.08274 1400 5138
WY 1 3208 3208 346 0.10786 346 0
Total 280983 571185 49039 201422
Thus, for estimating the total number of divorces granted to men aged 24 or less,
t̂str = 49039
and p
SE(t̂str ) = 201,422 = 449.
A 95% confidence interval is
49039 ± (1.96)(449) = [48159, 49919]
and
H µ
X ∂ µ ∂
Nh 2 nh p̂h (1 ° p̂h )
V̂ (p̂str ) = 1°
N Nh nh ° 1
h=1
and
p
SE(p̂str ) = (7.19 £ 10°8 ) + 0 + (5.80 £ 10°9 ) + · · · + (2.83 £ 10°8 ) + 0
= 9.75 £ 10°4 .
3.18 (a)
40 CHAPTER 3. STRATIFIED SAMPLING
(b) Let wh be the relative sampling weight for stratum h. Then Nh / nh wh . For
each response, we may calculate
X ¡X
ȳstr = nh wh ȳh nh wh ;
h h
data nybight;
infile nybight delimiter=’,’ firstobs=2;
input year stratum catchnum catchwt numspp depth temp ;
select (stratum);
when (1,2) relwt=1;
when (3,4,5,6) relwt=2;
end;
if year = 1974;
proc surveymeans data=nybight mean clm ;
stratum stratum;
var catchnum catchwt;
weight relwt;
run;
(c) The procedure is the same as that in part (b). Summary statistics for 1975 are:
Number of fish Weight of fish
Stratum nh wh ȳh s2h ȳh s2h
1 14 1 486.9 94132.0 127.0 3948.0
2 16 1 262.7 42234.8 109.5 7189.8
3 15 2 119.6 9592.0 33.5 867.8
4 13 2 238.3 12647.2 84.1 1583.6
5 3 2 119.7 789.3 20.6 18.4
6 3 2 70.7 3194.3 12.0 255.0
Response ȳˆstr SE(ȳˆstr )
Number of fish 223.9 18.5
Weight of fish 70.6 5.7
3.19 (a)
Respondents Respondents
Stratum to survey to breakaga, nh
1 288 232
2 533 514
3 91 86
4 73 67
Total 985 899
(b) In the table,
µ ∂µ ∂
nh Nh 2 p̂h (1 ° p̂h )
varconth = 1 ° .
Nh N nh ° 1
42 CHAPTER 3. STRATIFIED SAMPLING
3.23
(a) We take an SRS of n/H observations from each of the N/H strata, so there are
a total of µ ∂H ∑ ∏H
N/H (N/H)!
=
n/H (n/H)!(N/H ° n/H)!
possible stratified samples.
(b) By Stirling’s formula,
µ ∂
N N!
=
n n!(N ° n)!
µ ∂N
p N
2ºN
e
º ≥ n ¥n p µ ∂N °n
p N °n
2ºn 2º(N ° n)
e e
s
N NN
=
2ºn(N ° n) n (N ° n)N °n
n
We use the same argument, substituting N/H for N and n/H for n in the equation
above, to obtain:
µ ∂ s
N/H NH N N/H
º .
n/H 2ºn(N ° n) nn/H (N ° n)(N °n)/H
44 CHAPTER 3. STRATIFIED SAMPLING
Consequently,
"s #H
µ ∂H NH N N/H
N/H
n/H 2ºn(N ° n) nn/H (N ° n)(N °n)/H
µ ∂ º s
N N NN
n 2ºn(N ° n) nn (N ° n)N °n
∑ ∏(H°1)/2
N
= H H/2 .
2ºn(N ° n)
Lagrange multipliers are often used for such problems. (See, for example, Thomas,
G.B. and Finney, R. L. (1982). Calculus and Analytic Geometry, Fifth edition.
Reading, MA: Addison-Wesley, p. 617.)
Define
H µ
X ∂ 2 µ H
X ∂
nh 2 Sh
f (n1 , . . . , nH , ∏) = 1° Nh ° ∏ C ° c0 ° ch nh .
Nh nh
h=1 h=1
Then
@f S2
= °Nk2 k2 + ∏ck
@nk nk
k = 1, . . . , H, and
X H
@f
= c0 + ch nh ° C.
@∏
h=1
Setting the partial derivatives equal to 0 and solving gives
Nk Sk
nk = p
∏ck
for k = 1, . . . , H, and
H
X
ch nh = C ° c0 ,
h=1
which implies that PH
p p
ch Nh Sh
∏= h=1
C ° c0
45
Then
@g
= ck ° ∏Nk2 Sk2 /n2k
@nk
and
H µ
X ∂
@g nh S2
=V ° 1° Nh2 h .
@∏ Nh nh
h=1
(b)
H H
NX X
Vprop (t̂str ) ° VNeyman (t̂str ) = Nh Sh2 ° Nh Sh2
n
h=1 h=1
√H !2 H
1 X X
° Nh Sh + Nh Sh2
n
h=1 h=1
H
√H !2
NX 1 X
= 2
N h Sh ° Nh Sh
n n
h=1 h=1
2 √H !2 3
2 XH X
N 4 Nh 2 Nh
= Sh ° Sh 5
n N N
h=1 h=1
H
" H
#
N 2 X Nh X N l
= Sh2 ° Sh Sl
n N N
h=1 l=1
But
" #2 2 √H !2 3
H
X H
X H
X H
X X Nl
Nh Nl Nh 4Sh2 ° 2Sh Nl
Sh ° Sl = Sl + Sl 5
N N N N N
h=1 l=1 h=1 l=1 l=1
H
√H !2
X Nh X Nl
= Sh2 ° Sl ,
N N
h=1 l=1
3.34
(a) In the
P data step, define the variable
P one to have the value 1 for every observation.
Then w
i2S i 1 = N . Here, i2S i = 85174776. The standard error is zero
w 1
47
because this is a stratified sample. The weights are Nh /nh so the sum of the weights
in stratum h is Nh exactly. There is no sampling variability.
Here is the code used to obtain these values:
(b) The estimated total number of truck miles driven is 1.115 £1012 ; the standard
error is 6492344384 and a 95% CI is [1.102£1012 , 1.127£1012 ].
(c) Because these are stratification variables, we can calculate estimates for each
truck type by summing whj yhj separately for each h. We obtain:
(d) The estimated average mpg is 16.515427 with standard error 0.039676; a 95% CI
is [16.4377, 16.5932]. These CIs are very small because the sample size is so large.
48 CHAPTER 3. STRATIFIED SAMPLING
Chapter 4
4.2
(a) We have tx = 69. ty = 83, Sx = 4.092676, Sy = 5.333333, R = 0.8112815, and
B = 1.202899.
(b)
49
50 CHAPTER 4. RATIO AND REGRESSION ESTIMATION
Sample Sample
Number S x̄S ȳS B̂ t̂SRS t̂yr
1 {1, 2, 3} 10.333 10.000 0.968 90.000 66.774
2 {1, 2, 4} 10.667 11.333 1.063 102.000 73.313
3 {1, 2, 5} 8.000 8.333 1.042 75.000 71.875
4 {1, 2, 6} 7.667 6.000 0.783 54.000 54.000
5 {1, 2, 7} 10.333 11.000 1.065 99.000 73.452
6 {1, 2, 8} 7.667 8.000 1.043 72.000 72.000
7 {1, 2, 9} 8.333 7.000 0.840 63.000 57.960
8 {1, 3, 4} 12.000 13.333 1.111 120.000 76.667
9 {1, 3, 5} 9.333 10.333 1.107 93.000 76.393
10 {1, 3, 6} 9.000 8.000 0.889 72.000 61.333
11 {1, 3, 7} 11.667 13.000 1.114 117.000 76.886
12 {1, 3, 8} 9.000 10.000 1.111 90.000 76.667
13 {1, 3, 9} 9.667 9.000 0.931 81.000 64.241
14 {1, 4, 5} 9.667 11.667 1.207 105.000 83.276
15 {1, 4, 6} 9.333 9.333 1.000 84.000 69.000
16 {1, 4, 7} 12.000 14.333 1.194 129.000 82.417
17 {1, 4, 8} 9.333 11.333 1.214 102.000 83.786
18 {1, 4, 9} 10.000 10.333 1.033 93.000 71.300
19 {1, 5, 6} 6.667 6.333 0.950 57.000 65.550
20 {1, 5, 7} 9.333 11.333 1.214 102.000 83.786
21 {1, 5, 8} 6.667 8.333 1.250 75.000 86.250
22 {1, 5, 9} 7.333 7.333 1.000 66.000 69.000
23 {1, 6, 7} 9.000 9.000 1.000 81.000 69.000
24 {1, 6, 8} 6.333 6.000 0.947 54.000 65.368
25 {1, 6, 9} 7.000 5.000 0.714 45.000 49.286
26 {1, 7, 8} 9.000 11.000 1.222 99.000 84.333
27 {1, 7, 9} 9.667 10.000 1.034 90.000 71.379
28 {1, 8, 9} 7.000 7.000 1.000 63.000 69.000
29 {2, 3, 4} 10.000 12.333 1.233 111.000 85.100
30 {2, 3, 5} 7.333 9.333 1.273 84.000 87.818
31 {2, 3, 6} 7.000 7.000 1.000 63.000 69.000
32 {2, 3, 7} 9.667 12.000 1.241 108.000 85.655
33 {2, 3, 8} 7.000 9.000 1.286 81.000 88.714
34 {2, 3, 9} 7.667 8.000 1.043 72.000 72.000
35 {2, 4, 5} 7.667 10.667 1.391 96.000 96.000
36 {2, 4, 6} 7.333 8.333 1.136 75.000 78.409
37 {2, 4, 7} 10.000 13.333 1.333 120.000 92.000
38 {2, 4, 8} 7.333 10.333 1.409 93.000 97.227
39 {2, 4, 9} 8.000 9.333 1.167 84.000 80.500
40 {2, 5, 6} 4.667 5.333 1.143 48.000 78.857
51
Sample Sample
Number S x̄S ȳS B̂ t̂SRS t̂yr
41 {2, 5, 7} 7.333 10.333 1.409 93.000 97.227
42 {2, 5, 8} 4.667 7.333 1.571 66.000 108.429
43 {2, 5, 9} 5.333 6.333 1.188 57.000 81.938
44 {2, 6, 7} 7.000 8.000 1.143 72.000 78.857
45 {2, 6, 8} 4.333 5.000 1.154 45.000 79.615
46 {2, 6, 9} 5.000 4.000 0.800 36.000 55.200
47 {2, 7, 8} 7.000 10.000 1.429 90.000 98.571
48 {2, 7, 9} 7.667 9.000 1.174 81.000 81.000
49 {2, 8, 9} 5.000 6.000 1.200 54.000 82.800
50 {3, 4, 5} 9.000 12.667 1.407 114.000 97.111
51 {3, 4, 6} 8.667 10.333 1.192 93.000 82.269
52 {3, 4, 7} 11.333 15.333 1.353 138.000 93.353
53 {3, 4, 8} 8.667 12.333 1.423 111.000 98.192
54 {3, 4, 9} 9.333 11.333 1.214 102.000 83.786
55 {3, 5, 6} 6.000 7.333 1.222 66.000 84.333
56 {3, 5, 7} 8.667 12.333 1.423 111.000 98.192
57 {3, 5, 8} 6.000 9.333 1.556 84.000 107.333
58 {3, 5, 9} 6.667 8.333 1.250 75.000 86.250
59 {3, 6, 7} 8.333 10.000 1.200 90.000 82.800
60 {3, 6, 8} 5.667 7.000 1.235 63.000 85.235
61 {3, 6, 9} 6.333 6.000 0.947 54.000 65.368
62 {3, 7, 8} 8.333 12.000 1.440 108.000 99.360
63 {3, 7, 9} 9.000 11.000 1.222 99.000 84.333
64 {3, 8, 9} 6.333 8.000 1.263 72.000 87.158
65 {4, 5, 6} 6.333 8.667 1.368 78.000 94.421
66 {4, 5, 7} 9.000 13.667 1.519 123.000 104.778
67 {4, 5, 8} 6.333 10.667 1.684 96.000 116.211
68 {4, 5, 9} 7.000 9.667 1.381 87.000 95.286
69 {4, 6, 7} 8.667 11.333 1.308 102.000 90.231
70 {4, 6, 8} 6.000 8.333 1.389 75.000 95.833
71 {4, 6, 9} 6.667 7.333 1.100 66.000 75.900
72 {4, 7, 8} 8.667 13.333 1.538 120.000 106.154
73 {4, 7, 9} 9.333 12.333 1.321 111.000 91.179
74 {4, 8, 9} 6.667 9.333 1.400 84.000 96.600
75 {5, 6, 7} 6.000 8.333 1.389 75.000 95.833
76 {5, 6, 8} 3.333 5.333 1.600 48.000 110.400
77 {5, 6, 9} 4.000 4.333 1.083 39.000 74.750
78 {5, 7, 8} 6.000 10.333 1.722 93.000 118.833
79 {5, 7, 9} 6.667 9.333 1.400 84.000 96.600
80 {5, 8, 9} 4.000 6.333 1.583 57.000 109.250
81 {6, 7, 8} 5.667 8.000 1.412 72.000 97.412
82 {6, 7, 9} 6.333 7.000 1.105 63.000 76.263
83 {6, 8, 9} 3.667 4.000 1.091 36.000 75.273
84 {7, 8, 9} 6.333 9.000 1.421 81.000 98.053
(c)
52 CHAPTER 4. RATIO AND REGRESSION ESTIMATION
20
15
15
10
Frequency
Frequency
10
5
5
0
The shapes are actually quite similar; however, it appears that the histogram of the
ratio estimator is a little less spread out.
(d) The mean of the sampling distribution of t̂yr is 83.733; the variance is 208.083 and
the bias is 83.733 °83 = 0.733. By contrast, the mean of the sampling distribution
of N ȳ is 83 and its variance is 518.169.
(e) From (4.6),
Bias (ȳˆr ) = 0.07073094.
4.3 (a) The solid line is from regression estimation; the dashed line from ratio
estimation; the dashed/dotted line has equation y = 107.4.
53
80
60
6 7 8 9 10 11 12
Diameter
(b, c)
Method ȳˆ SE (ȳˆ)
SRS, ȳ 107.4 6.35
Ratio 117.6 4.35
Regression 118.4 3.96
For ratio estimation, B̂ = 11.41946; for regression estimation, B̂0 = °7.808 and
B̂1 = 12.250. Note that the sample correlation of age and diameter is 0.78, so we
would expect both ratio and regression estimation to improve precision.
To calculate V̂ (ȳˆr ) using (4.9), note that s2e = 321.933 so that
µ ∂µ ∂
20 10.3 2 321.933
V̂ (ȳˆr ) = 1 ° = 18.96
1132 9.405 20
54 CHAPTER 4. RATIO AND REGRESSION ESTIMATION
and SE(ȳˆ) = 4.35. For the regression estimator, we have s2e = 319.6, so
µ ∂
20 319.6
ˆ
V̂ (ȳ ) = 1 ° = 15.7.
1132 20
data trees;
input treenum diam age @@;
sampwt = 1132/20;
datalines;
1 12.0 125 11 5.7 61
2 11.4 119 12 8.0 80
3 7.9 83 13 10.3 114
4 9.0 85 14 12.0 147
5 10.5 99 15 9.2 122
6 7.9 117 16 8.5 106
7 7.3 69 17 7.0 82
8 10.2 133 18 10.7 88
9 11.7 154 19 9.3 97
10 11.3 168 20 8.2 99
;
proc print data=trees;
run;
data ratioout1;
set ratioout;
xmean = 10.3;
ratiomean = ratio*xmean;
semean = stderr*xmean;
lowercls = lowercl*xmean;
uppercls = uppercl*xmean;
data treesresid;
set trees;
resid = age - 11.419458*diam;
resid2 = resid*(10.3/ 9.405);
data trees2;
set trees;
resid = age - (-7.8080877 + 12.2496636*diam);
proc surveymeans data=trees2 total =1132;
weight sampwt;
var resid age diam;
run;
The output from proc surveyreg gives a larger standard error for the regression
estimator:
Standard
Parameter Estimate Error t Value Pr > |t|
4.5 There are 85 18-hole courses in the sample. For these 85 courses, the sample
mean weekend greens fee is
ȳd = 34.829
and the sample variance is
s2d = 395.498.
Using results from Section 4.3,
r
395.498
SE[ȳd ] = = 2.16.
85
filename golfsrs
data golfsrs;
infile golfsrs delimiter="," dsd firstobs=2;
/* The dsd option allows SAS to read the missing values between
successive delimiters */
input RN state $ holes type $ yearblt wkday18 wkday9 wkend18
wkend9 backtee rating par cart18 cart9 caddy $ pro $;
sampwt = 14938/120;
if holes = 18 then holes18 = 1;
else holes18=0;
Data Summary
Statistics
57
Std Error
Variable N Mean of Mean
Statistics
Std Error
holes18 Variable N Mean of Mean
0 wkend18 0 . .
1 wkend18 85 34.828824 2.144660
0 wkend18 . .
1 wkend18 30.5639320 39.0937150
4.6 As you can see from the plot of weekend greens fee vs. back-tee yardage, this
is not a “classical” straight-line relationship. The variability in weekend greens fee
appears to increase with the back-tee yardage. Nevertheless, we can estimate the
slope and intercept, with
ŷ = °37.26 + 0.0113x.
(We’ll discuss standard errors in Chapter 11.) For estimating the ratio, we have
ȳ 34.83
B̂ = = = 0.00545.
x̄ 6392.29
Using (4.10), with s2e the sample variance of the residuals,
s2e 362.578
V̂ (B̂) º = = 1.044 £ 10°7
(85)(6292.29)2 (85)(6392.29)2
SE(B̂) = .00032.
4.7 (a) 88 courses have a golf professional. For these 88 courses, ȳd1 = 23.5983 and
58 CHAPTER 4. RATIO AND REGRESSION ESTIMATION
s2d1 = 387.7194, so
387.7194
V̂ (ȳd1 ) = = 4.4059.
88
(b) For the 32 courses without a golf professional, ȳd2 = 10.6797 and s2d2 = 19.146,
so
19.146
V̂ (ȳd2 ) = = 0.5983.
32
s2e = 149,902,393,481.
ŷ = 267029.8 + 47.65325x
Then
ȳˆreg = 267029.8 + 47.65325(647.7467) = 297897.04
and
t̂yreg = 3078ȳˆreg = 916,927,075.
The estimated variance of the residuals from the regression is s2e = 118,293,647,832,
which implies from (4.19) that
r r
300 s2e
SE [t̂yreg ] = 3078 1 ° = 58,065,813.
3078 300
(d) Clearly, for this response, it is better to use acres87 as an auxiliary variable.
The correlation of farms87 with acres92 is only 0.06; using farms87 as an auxiliary
variable does not improve on the SRS estimate N ȳ. The correlation of acres92 and
acres87, however, exceeds 0.99. Here are the various estimates for the population
total of acres92:
Estimate t̂ SE [t̂]
SRS, N ȳ 916,927,110 58,169,381
Ratio, x = acres87 951,513,191 5,344,568
Ratio, x = farms87 960,155,061 65,364,822
Regression, x = farms87 916,927,075 58,065,813
59
Moral: Ratio estimation can lead to greatly increased precision, but should not
be used blindly. In this case, ratio estimation with auxiliary variable farms87 had
larger standard error than if no auxiliary information were used at all. The regression
estimate of t is similar to N ȳ, because the regression slope is small relative to the
magnitude of the data. The regression slope is not significantly diÆerent from 0;
as can be seen from the picture in (a), the straight-line regression model does not
describe the counties with few but large farms.
4.9 We use results from Section 4.2. (a) Let yi = acres92 for county i, and
xi = farms92 for county i. Define Then
and
r r r r
300 s2u 300 109,710,284,064
SE [t̂y1 ] = N 1° = 3078 1 ° = 55,919,525.
3078 300 3078 300
(b) Now
t̂y2 = 3078(136123.2) = 418, 987, 302
and r r
300 53,195,371,851
SE [t̂y2 ] = 3078 1 ° = 38,938,277.
3078 300
4.10 (a)
data cherry;
infile cherries delimiter=’,’ firstobs=2;
input diam height vol;
sampwt = 2967/31;
obsnum = _n_;
label diam = ’diam (in) at 4.5 feet’
height = ’height of tree (feet)’
vol = ’volume of tree (cubic feet)’
sampwt = ’sampling weight’
;
/* Plot and print the data set */
proc surveymeans data = cherry total=2967 mean clm sum clsum ratio ;
weight sampwt;
var diam vol;
ratio ’vol/diam’ vol/diam;
ods output Statistics=statsout Ratio=ratioout;
run;
data ratioout1;
set ratioout;
xtotal = 41835;
ratiosum = ratio*xtotal;
sesum = stderr*xtotal;
lowercls = lowercl*xtotal;
uppercls = uppercl*xtotal;
Using this code, we obtain t̂yr = 95272.16 with 95% CI of [84098, 106,446].
(c) SAS code and output follow:
Standard
Parameter Estimate Error t Value Pr > |t|
95% Confidence
Parameter Interval
Note that the estimate from regression estimation is quite a bit higher than the
estimate from ratio estimation. In addition, the CI for regression estimation is
narrower than the CIs for t̂yr or N ȳ. This is because the regression model is a
better fit to the data than the ratio model.
4.11 (a) The variable number of physicians has a skewed distribution. The first
histogram excludes Cook County, Illinois (with yi = 15,153) for slightly better
visibility.
The next histogram, of all 100 counties, depicts the logarithm of (number of physi-
cians + 1) (which is still skewed).
62 CHAPTER 4. RATIO AND REGRESSION ESTIMATION
N ȳ = (3141)(297.17) = 933,411
with
s
≥ n ¥ s2y
SE (N ȳ) = N 1°
N n
sµ ∂
100 2,534,052
= 3141 1°
3141 100
p
= 3141 24533.75
= 491,983.
The standard error is large compared with the estimated total number of physicians.
The extreme skewness of the data makes us suspect that N ȳ does not follow an
approximate normal distribution, and that a confidence interval of the form N ȳ ±
1.96 SE (N ȳ) would not have 95% coverage in practice. In fact, when we substitute
sample quantities into (2.23), we obtain
as the required minimum sample size for ȳ to approximately follow a normal distri-
bution.
(c) Again, we omit Cook County.
63
Regression estimation:
and
r
≥ n ¥ s2e
SE (ȳˆreg ) = 1°
sµ N n
∂
100 114644.1
= 1°
3141 100
= 33.316.
Consequently,
t̂yreg = N ȳˆreg = 3141(186.52) = 585871
and
SE (t̂yreg ) = N SE (ȳˆreg ) = 3141(33.316) = 104645.
The standard error from proc surveyreg is smaller and equals 92535.
(e) Ratio estimation and regression estimation both lead to a smaller standard error,
and an estimate that is closer to the true value.
Here is SAS code for performing these analyses:
data counties;
infile counties firstobs=2 delimiter=",";
input RN State County landarea totpop physician enroll
percpub civlabor unemp farmpop numfarm farmacre fedgrant
fedciv milit veterans percviet ;
sampwt = 3141/100;
logphys = log(physician+1);
data countnoCook;
set counties;
if physician gt 10000 then delete;
data ratioout1;
set ratioout;
xtot = 255077536;
ratiotot = ratio*xtot;
setot = stderr*xtot;
lowercls = lowercl*xtot;
uppercls = uppercl*xtot;
data resid;
set counties;
resid = physician -0.002507*totpop;
resid2 = resid*(255077536/ 372306374);
/* Use g-weights in SE formula*/
4.12 (a) The distribution appears to be skewed, but not quite as skewed as the
distribution in Exercise 4.11.
with sµ ∂
100 1,138,630
SE (N ȳ) = 3141 1° = 329,787.
3141 100
SAS proc surveymeans gives the same result.
(c) Note that corr(farmpop,landarea) = °0.058. We would not expect ratio or
regression estimation to do well here.
67
Here is SAS code that may be used to compute these estimates. See Exercise 4.11
solution for reading in the data.
data ratioout1;
set ratioout;
xtot = 3536278;
ratiotot = ratio*xtot;
setot = stderr*xtot;
lowercls = lowercl*xtot;
uppercls = uppercl*xtot;
4.13 (a) As in Exercise 4.11, we omit Cook County, Illinois, from the histogram so
we can see the other data points better. Cook County has 457,880 veterans. The
distribution is very skewed; Cook County is an extreme outlier.
69
N ȳ = (3141)(12249.71) = 38,476,339
sµ ∂
100 2,263,371,150
SE (N ȳ) = 3141 1° = 14,703,478.
3141 100
(c) Again, Cook County is omitted from this plot. These data appear very close to
a straight line. We would expect ratio or regression estimation to help immensely.
Regression estimation:
data countnoCook;
set counties;
if veterans gt 400000 then delete;
data ratioout1;
set ratioout;
xtot = 255077536;
ratiotot = ratio*xtot;
setot = stderr*xtot;
lowercls = lowercl*xtot;
uppercls = uppercl*xtot;
data resid;
set counties;
resid = veterans -0.002507*totpop;
resid2 = resid*(255077536/ 372306374);
/* Use g-weights in SE formula*/
From (4.8),
µ ∂
n 1
E[(B̂ ° B) ] º 2
1° (S 2 ° 2BRSx Sy + B 2 Sx2 ).
N nx2U y
The approximate MSE is thus of order 1/n, while the squared bias is of order 1/n2 .
Consequently, MSE (B̂) º V (B̂).
4.22 A rigorous proof, showing that the lower order terms are negligible, is beyond
the scope of this book. We give an argument for (4.6).
∑ ∏
ȳ ȳU
E[B̂ ° B] = E °
x̄ x̄U
∑ µ ∂ ∏
ȳ x̄U ȳU
= E °
x̄U x̄ x̄U
∑ µ ∂ ∏
ȳ x̄ ° x̄U ȳU
= E 1° °
x̄U x̄ x̄U
∑ ∏
ȳ(x̄ ° x̄U )
= °E
x̄U x̄
∑ µ ∂∏
ȳ(x̄ ° x̄U ) x̄ ° x̄U
= E °1
x̄2U x̄
1
= {BV (x̄) ° Cov (x̄, ȳ) + E[(B̂ ° B)(x̄ ° x̄U )2 ]}
x̄2U
1
º [BV (x̄) ° Cov (x̄, ȳ)].
x̄2U
and ∑ ∏µ ∂
1 ty ° tu x̄ ° x̄U
ȳ2 ° ȳU 2 = ȳ ° ū ° (1 ° x̄) 1 + .
1 ° x̄U N ° tx x̄
The covariance follows because the expected value of terms involving x̄ ° x̄U are
small compared with the other terms.
(b) Note that because xi takes on only values 0 and 1, x2i = xi , xi ui = ui , and
74 CHAPTER 4. RATIO AND REGRESSION ESTIMATION
XN ∑ ∏∑ ∏
tu ty ° tu
ui ° ūU ° (xi ° x̄U ) yi ° ȳU ° ui + ūU + (xi ° x̄U )
tx N ° tx
i=1
XN ∑ ∏∑ ∏
tu ty ° tu
= ui ° ūU ° (xi ° x̄U ) yi ° ui + xi
tx N ° tx
i=1
XN ∑ ∏
ty ° tu tu tu tu ty ° tu 2
= ui yi ° ui +
2
ui xi ° xi yi + ui xi ° x
N ° tx tx tx tx N ° tx i
i=1
µ ∂µ ∂
tu ty ° tu
+ x̄U ° ūU ty ° tu + tx
tx N ° tx
XN ∑ ∏
ty ° tu tu tu tu ty ° tu
= ui ° ui + ui ° xi + 0
N ° tx tx tx tx N ° tx
i=1
ty ° tu tu ty ° tu
= tu ° tx
N ° tx tx N ° tx
= 0.
Consequently,
∑µ ∂ Ω æ∏
tu ty ° tu
Cov ū ° x̄ , ȳ ° ū ° (1 ° x̄) = 0.
tx N ° tx
4.25 We use the multivariate delta method (see for example Lehmann, 1999, p.
295). Let
b
g(a, b) = x̄U
a
so that g(x̄, ȳ) = ȳˆr and g(x̄U , ȳU ) = ȳU . Then,
@g x̄U b
=° 2 ,
@a a
and
@g x̄U
= .
@b a
Thus, the asymptotic distribution of
p p £ §
n[g(x̄, ȳ) ° g(x̄U , ȳU )] = n ȳˆr ° ȳU
4.28
N
X (yi ° ȳU ° B1 [xi ° x̄U ])2
N °1
i=1
N
X [(yi ° ȳU )2 ° 2B1 (xi ° x̄U )(yi ° ȳU ) + B 2 (xi ° x̄U )2 ]
= 1
N °1
i=1
= Sy2 ° 2B1 RSx Sy + B12 Sx2
R 2 Sy R2 Sy2 2
= Sy2 ° 2 Sx Sy + S
Sx Sx2 x
= Sy2 (1 ° R2 ).
Thus,
Now, P
i2S (xi ° x̄)(yi ° ȳ)
B̂1 = P
i2S (xi ° x̄)
2
and
(xi ° x̄)(yi ° ȳ) = (xi ° x̄)(yi ° ȳU + ȳU ° ȳ)
= (xi ° x̄)[di + B1 (xi ° x̄U ) + ȳU ° ȳ]
= (xi ° x̄)[di + B1 (xi ° x̄) + B1 (x̄ ° x̄U ) + ȳU ° ȳ]
76 CHAPTER 4. RATIO AND REGRESSION ESTIMATION
with X X X
(xi ° x̄)(yi ° ȳ) = (xi ° x̄)di + B1 (xi ° x̄)2 .
i2S i2S i2S
Thus, P P
i2S (xi ° x̄U )di + (x̄U ° x̄) i2S di
B̂1 = B1 + P .
i2S (xi ° x̄)
2
¯ U ° x̄)2 ] is of smaller order than Cov (q̄, x̄), the approximation is shown.
and E[d(x̄
4.32 From linear models theory, if Y = XØ + ", with E["] = 0 and Cov["] = æ 2 A,
then the weighted least squares estimator of Ø is
with
V [Ø̂] = æ 2 (XT A°1 X)°1 .
This result may be found in any linear models book (for instance, Christensen, 1996,
p. 31). In our case,
so Pn
°1 °1 °1 Yi ȳ
Ø̂ = (X A T
X) X A T
Y = Pi=1
n =
x
i=1 i x̄
and
æ2
V [Ø̂] = æ 2 (XT A°1 X)°1 = Pn .
i=1 xi
77
(e) From linear models theory, if Y = XØ + ", with E["] = 0 and Cov["] = æ 2 A,
then the weighted least squares estimator of Ø is
with
V [Ø̂] = æ 2 (X T A°1 X)°1 .
Here, A = diag(x2i ), so
X 1
XT A°1 X = xi xi = n
x2i
i2S
and X X
1
XT A°1 Y = xi yi = Yi /xi .
x2i
i2S i2S
4.42
(a)
domain business;
run;
Business in
which
vehicle was
most often
used during Std Error
2002 Variable Mean of Mean
Business in
which
vehicle was
most often
used during
2002 Variable 95% CL for Mean Sum
Business in
which
vehicle was
most often
used during
2002 Variable Std Dev 95% CL for Sum
For-hire tra MILES_ANNL 1608230919 6.91207E10 7.54249E10
Vehicle leas MILES_ANNL 1213307392 1.76465E10 2.24027E10
Agriculture, MILES_ANNL 1354386330 2.14654E10 2.67745E10
Mining MILES_ANNL 360265917 2705422697 4117663856
Utilities MILES_ANNL 942933274 8396528057 1.20928E10
Construction MILES_ANNL 2821651145 7.03757E10 8.14366E10
Manufacturin MILES_ANNL 1399406209 1.26417E10 1.81274E10
Wholesale tr MILES_ANNL 1348917090 1.43196E10 1.96073E10
Retail trade MILES_ANNL 1422261638 2.46828E10 3.02581E10
Information MILES_ANNL 923917751 3811137245 7432891659
Waste manage MILES_ANNL 901989658 8941377763 1.24772E10
Arts, entert MILES_ANNL 353650310 1090929855 2477237856
Accommodatio MILES_ANNL 677802928 4487821313 7144806463
Other servic MILES_ANNL 2201296141 3.14617E10 4.00907E10
(b)
Type of
Transmission Variable Mean
(c)
5.1 If the nonresponse can be ignored, then p̂ is the ratio estimate of the proportion.
The variance estimate given in the problem, though, assumes that an SRS of voters
was taken. But this was a cluster sample—the sampling unit was a residential
telephone number, not an individual voter. As we expect that voters in the same
household are more likely to have similar opinions, the estimated variance using
simple random sampling is probably too small.
5.3 (a) This is a cluster sample because there are two levels of sampling units: the
wetlands are the psus and the sites are the ssus.
(b) The analysis is not appropriate. A two-sample t test assumes that all obser-
vations are independent. This is a cluster sample, however, and sites within the
same wetland are expected to be more similar than sites selected at random from
the population.
5.4 (a) This is a cluster sample because the primary sampling unit is the journal,
and the secondary sampling unit is an article in the journal from 1988.
(b) Let
Mi = number of articles in journal i
and
and X
Mi = 148.
i2S
81
82 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
data journal;
infile journal delimiter=’,’ firstobs=2;
input numemp prob nonprob ;
sampwt = 1285/26;
/* weight = N/n since this is a one-stage cluster sample */
5.5
Data Summary
Number of Clusters 10
Number of Observations 196
Sum of Weights 1411.2
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
Statistics
5.6 (a) The SAS code below was used to calculate summary statistics and the
ANOVA table. The output is given below.
data worms;
do case = 1 to 12;
do can = 1 to 3;
input worms @@;
wt = (580/12)*(24/3);
output;
end;
end;
cards;
1 5 7
4 2 4
0 1 2
3 6 6
4 9 8
0 7 3
5 5 1
3 0 2
7 3 5
3 1 4
4 7 9
0 0 0
;
proc print data=worms;
run;
85
case 12 1 2 3 4 5 6 7 8 9 10 11 12
Sum of
Source DF Squares Mean Square F Value Pr > F
Level of ------------worms------------
86 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
1 3 4.33333333 3.05505046
2 3 3.33333333 1.15470054
3 3 1.00000000 1.00000000
4 3 5.00000000 1.73205081
5 3 7.00000000 2.64575131
6 3 3.33333333 3.51188458
7 3 3.66666667 2.30940108
8 3 1.66666667 1.52752523
9 3 5.00000000 2.00000000
10 3 2.66666667 1.52752523
11 3 6.66666667 2.51661148
12 3 0.00000000 0.00000000
Note that the second term contributes little to the standard error.
Here is the approximation from SAS, using PROC SURVEYMEANS:
Data Summary
Number of Clusters 12
Number of Observations 36
Sum of Weights 13920
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
______________________________________________________________
worms 36 3.638889 0.614716 2.28590770 4.99187008
______________________________________________________________
87
5.7 We used SAS to obtain the mean and standard deviation for each city, and to
plot the data.
city 6 1 2 3 4 5 6
Sum of
Source DF Squares Mean Square F Value Pr > F
Level of ------------cases------------
city N Mean Std Dev
1 10 177.300000 83.5996411
2 4 93.250000 29.2389580
3 7 116.285714 54.5396317
4 8 104.625000 64.5930945
5 2 17.500000 7.7781746
6 3 63.000000 22.2710575
Data Summary
Number of Clusters 6
Number of Observations 34
Sum of Weights 1267.5
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
___________________________________________________________
cases 120.688145 19.730137 69.9702138 171.406075
___________________________________________________________
We can also use (5.18) and (5.24) for calculating the total number of cases sold, and
(5.26) and (5.28) for calculating the average number of cases sold per supermarket.
Summary quantities are given in the spreadsheet below.
2
City Mi mi ȳi si t̂i t̂i ° Mi ȳˆr (1 ° 2 si
Mi )Mi mi
mi
From (5.18),
45
t̂unb = (20396.3) = 152972.
6
Using (5.24),
s µ ∂
6 10952882 45
SE [t̂unb ] = 452 1° + (2716418)
45 6 6
p
= 3,203,717,941 + 20,373,134
= 56, 781.
data books;
infile booksdat delimiter=’,’ firstobs=2;
input shelf Mi number purchase replace;
sampwt = (44/12)*(Mi/5);
/* The crucial part for estimating the total is correctly
specifying the sampling weight. */
using SAS. Note that this approximation does not include the
second term in (5.21). */
It appears that the means and variances diÆer quite a bit for diÆerent shelves.
(b) Quantities used in calculation are in the spreadsheet below.
2
Shelf Mi mi ȳi s2i t̂i = Mi ȳi (1 ° mi 2 si
Mi )Mi mi (t̂i ° Mi ȳˆr )2
2 26 5 9.6 17.80 249.6 1943.76 132696.9
4 52 5 6.2 1.70 322.4 830.96 819661.7
11 70 5 9.2 18.70 644.0 17017.00 1017561.8
14 47 5 7.2 3.20 338.4 1263.36 594901.6
20 5 5 41.8 2666.70 209.0 0.00 8271.3
22 28 5 29.8 748.70 834.4 96432.56 30033.9
23 27 5 51.8 702.70 1398.6 83480.76 579293.8
31 29 5 61.2 353.20 1774.8 49165.44 1188301.2
37 21 5 50.4 147.30 1058.4 9898.56 316493.1
38 31 5 36.6 600.80 1134.6 96848.96 162144.0
40 14 5 54.2 595.70 758.8 15011.64 183399.3
43 27 5 6.6 4.80 178.2 570.24 210944.1
The standard error of t̂unb is 5733.52, which is quite large when compared with the
estimated total. The estimated coe±cient of variation for t̂unb is
5733.52
SE (t̂unb )/t̂unb = = 0.176.
32637.73
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
Statistics
Note that the with-replacement variance is too large because the psu sampling frac-
tion is large.
Here is the approximation using SAS and the without-replacement variance:
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
92 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
Statistics
p
Note that 5613 = 44 16274.6; SAS calculates only the first term of the without-
replacement variance.
(c) To find the average replacement cost per book with the information given, we
use the ratio estimate in (5.28):
P
t̂i 8901.2
ȳˆr = P i2S = = 23.61.
i2S M i 377
5.9 (a)
93
It appears that the means and variances diÆer quite a bit for diÆerent shelves.
Here is SAS code and output for the purchase price of the books.
run;
Data Summary
Number of Clusters 12
Number of Observations 60
Sum of Weights 1382.33333
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
Statistics
5.10 Here is the sample ANOVA table for the books data, calculated using SAS.
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 11 25570.98333 2324.635 4.76 0.0001
Error 48 23445.20000 488.442
Corrected Total 59 49016.18333
We use (5.10) to estimate Ra2 : we have
\ = 488.44167,
MSW
95
SSTO 49016.18333
S2 = = = 830.78,
dfTO 59
so
\ Ŝ 2 = 1 ° 488.44/830.78 = 0.41.
R̂a2 = 1 ° MSW/
The positive value of R̂a2 indicates that books on the same shelf do tend to have
more similar replacement costs.
5.11 (a) Here, N = 828 and n = 85. We have the following frequencies for ti =
number of errors in claim i:
Number of errors Frequency
0 57
1 22
2 4
3 1
4 1
P
The 85 claims have a total of i2S ti = 37 errors, so from (5.1),
828
t̂ = (37) = 360.42
85
and
µ ∂
1 X 37 2
s2t = ti °
84 85
i2S
∑ µ ∂ µ ∂ µ ∂
1 37 2 37 2 37 2
= 57 0 ° + 22 1 ° +4 2°
84 85 85 85
µ ∂2 µ ∂2 ∏
37 37
+ 3° + 4°
85 85
1
= [10.800 + 7.02 + 9.79 + 6.58 + 12.71]
84
= 0.558263.
t̂ 360.42
ȳˆ = = = 0.002025.
NM (828)(215)
From (5.6),
sµ ∂
1 85 s2t 1
SE [ȳˆ] = 1° = (.07677) = .000357.
215 828 85 215
96 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
(c) If the same number of errors (37) had been obtained using an SRS of 18,275 of
the 178,020 fields, the error rate would be
37
= .002025
18,275
(the same as in (a)). But the estimated variance from an SRS would be
µ ∂
18,275 p̂srs (1 ° p̂srs )
V̂ [p̂srs ] = 1 ° = 9.92 £ 10°8 .
178,020 18,274
Thus
Estimated variance under cluster design 1.28 £ 10°7
= = 1.29.
Estimated variance under SRS 9.92 £ 10°8
5.13 (a) Cluster sampling is needed for this application because the household is the
sampling unit. Yet the Arizona Statutes specify the statistic that must be used for
estimating the error rate: it must be estimated by the sample mean. It is therefore
important to make sure that a self-weighting sample is taken. In this application,
a self-weighting sample will result if an SRS of n households is taken from the
population of N households in the county, and if all individuals in the household
are measured. It makes sense in this example to take a one-stage cluster sample.
97
V (t̂cluster ) N (M ° 1) 2
=1+ Ra
V (t̂SRS ) N °1
so we can calculate a sample size by multiplying the number of persons needed for
an SRS (310) by the ratio of variances, then dividing by M . The following table
gives some sample sizes for diÆerent values of Ra2 :
M Ra2 sample size
1 0.1 310
2 0.1 171
3 0.1 124
4 0.1 101
5 0.1 87
1 0.5 310
2 0.5 233
3 0.5 207
4 0.5 194
6 0.5 181
1 0.8 310
2 0.8 279
3 0.8 269
4 0.8 264
5 0.8 260
5.14 (a) Treating the proportions as means, and letting Mi and mi be the number
of female students in the school and interviewed, respectively, we have the following
summaries.
2 2
School Mi mi smokers ȳi s2i t̂i t̂i ° Mi ȳˆr (1 ° mi Mi si
M i ) mi
1 792 25 10 0.4 .010 316.8 -24.7 243
2 447 15 3 0.2 .011 89.4 -103.3 147
3 511 20 6 0.3 .011 153.3 -67.0 139
4 800 40 27 0.675 .006 540.0 195.1 86
Sum 2550 100 1099.5 614
Var 17943
98 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
Then,
1099.5
ȳˆr = = 0.43
2550s
µ ∂
1 4 17943 1
SE [ȳˆr ] = 1° + (614)
637.5 29 4 4(29)
1 p
= 3867.0 + 5.3
637.5
= 0.098.
(b) We construct a sample ANOVA table from the summary information in part
(a). Note that
X mi
4 X 4
X
ssw = (yij ° ȳi ) =
2
(mi ° 1)s2i ,
i=1 j=1 i=1
and
Source df SS MS
Between psus 3
Within psus 96 0.837 0.0087
Total 99
5.15 (a) A cluster sample was used for this study because Arizona has no list of all
elementary school teachers in the state. All schools would have to be contacted to
construct a sampling frame of teachers, and this would be expensive. Taking a clus-
ter sample also makes it easier to distribute surveys. It’s possible that distributing
questionnaires through the schools might improve cooperation with the survey and
give respondents more assurance that their data are kept confidential.
(b) The means and standard deviations, after eliminating records with missing val-
ues, are in the following table:
99
s2
School Mi mi ȳi s2i Mi ȳi resids Mi2 (1 ° M
mi
) i
i mi
11 33 10 33.990 3.814 1121.670 5.501 289.508
12 16 13 36.123 24.428 577.969 36.797 90.196
13 22 3 34.583 0.521 760.833 16.721 72.569
15 24 24 36.756 0.557 882.150 70.391 0.000
16 27 24 36.840 1.165 994.669 81.440 3.933
18 18 2 35.000 0.000 630.000 21.181 0.000
19 16 3 34.867 0.053 557.867 16.694 3.698
20 12 8 36.356 6.197 436.275 30.396 37.185
21 19 5 35.410 9.948 672.790 30.147 529.234
22 33 13 35.677 24.835 1177.338 61.170 1260.867
23 31 16 35.175 11.506 1090.425 41.903 334.393
24 30 9 31.944 0.740 958.333 -56.366 51.776
25 23 8 31.250 0.446 718.750 -59.186 19.252
28 53 17 31.465 6.903 1667.630 -125.005 774.766
29 50 8 29.106 5.955 1455.313 -235.852 1563.270
30 26 22 35.791 3.045 930.564 51.158 14.393
31 25 18 34.525 1.761 863.125 17.543 17.118
32 23 16 35.456 2.932 815.494 37.558 29.503
33 21 5 26.820 0.145 563.220 -147.069 9.710
34 33 7 27.421 0.835 904.907 -211.261 102.333
36 25 4 36.975 8.769 924.375 78.793 1150.953
38 38 10 37.660 1.231 1431.080 145.795 130.978
41 30 2 36.875 0.101 1106.250 91.551 42.525
Sum 628 21241.026 6528.159
Variance 9349.2524
ȳˆr = 33.82
∑µ ∂ ∏
1 23 9349.252 6528.159
V (ȳˆr ) = 1° +
(27.30)2 245 23 (23)(245)
1
= [368.33 + 1.16]
745.53
= 0.50.
5.16 (a) Summary quantities for estimating ȳ and its variance are given in the table
below. Here, ki denotes the number sampled in school i. We use the number of
respondents in school i as mi .
101
s2
School Mi ki mi Return ȳi t̂i t̂i ° Mi Mi2 (1 ° M
mi
) i
i mi
1 78 40 38 19 0.5000 39.0000 -6.1580 21.0811
2 238 38 36 19 0.5278 125.6111 -12.1786 342.3401
3 261 19 17 13 0.7647 199.5882 48.4828 716.1696
4 174 30 30 18 0.6000 104.4000 3.6630 207.3600
5 236 30 26 12 0.4615 108.9231 -27.7087 492.6675
6 188 25 24 13 0.5417 101.8333 -7.0089 332.8031
7 113 23 22 15 0.6818 77.0455 11.6243 106.2293
8 170 43 36 21 0.5833 99.1667 0.7455 158.1944
9 296 38 35 23 0.6571 194.5143 23.1456 511.9485
10 207 21 17 7 0.4118 85.2353 -34.6070 595.3936
Sum 1961 307 281 160 1135.3175 3484.1873
var 581.79702
An approximate 95% confidence interval for the percentage of parents who returned
the questionnaire is
p
0.5789 ± 1.96 0.001381 = [0.506, 0.652].
(c) If the clustering were (incorrectly!) ignored, we would have had p̂ = 160/281 =
.569 with V̂ (p̂) = .569(1 ° .569)/280 = .000876.
5.17 (a) The following table gives summary quantities; the column ȳi gives the
estimated proportion of children who had previously had measles in each school.
s2
School Mi mi Hadmeas ȳi t̂i t̂i ° Mi Mi2 (1 ° M
mi
) i
i mi
1 78 40 32 0.8000 62.4000 28.0573 12.1600
2 238 38 10 0.2632 62.6316 -42.1576 249.4572
3 261 19 12 0.6316 164.8421 49.9262 816.4986
4 174 30 19 0.6333 110.2000 33.5894 200.6400
5 236 30 16 0.5333 125.8667 21.9581 417.2408
6 188 25 6 0.2400 45.1200 -37.6546 232.8944
7 113 23 11 0.4783 54.0435 4.2906 115.3497
8 170 43 23 0.5349 90.9302 16.0808 127.8864
9 296 38 5 0.1316 38.9474 -91.3787 235.8449
10 207 21 11 0.5238 108.4286 17.2884 480.1837
Sum 1961 307 145 863.41 2888.1556
Var 1890.1486
102 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
(b)
863.41
ȳˆr = = 0.4403
1961 ∑µ ∂ ∏
1 10 1890.15 2888.16
ˆ
V̂ (ȳr ) = 1° +
(196.1)2 46 10 (10)(46)
1
= [147.92 + 6.28]
(196.1)2
= 0.004
An approximate 95% CI is
p
0.4403 ± 1.96 0.004 = [0.316, 0.564]
5.18 With the diÆerent costs, the last two columns of Table 5.4 change. The table
now becomes:
Number of Stems Cost to
Sampled Sample Relative
per Site ȳˆ SE(ȳˆ) One Field Net Precision
1 1.12 0.15 50 0.15
2 1.01 0.10 70 0.14
3 0.96 0.08 90 0.13
4 0.91 0.07 110 0.12
5 0.91 0.06 130 0.12
Now the relative net precision is highest when one stem is sampled per site.
5.19 (a) Here is the sample ANOVA table from SAS PROC GLM:
Sum of
Source DF Squares Mean Square F Value Pr > F
We estimate Ra2 by
59910564633
1° = 0.4953455.
3.5496086 £ 1013 /299
103
Thus
PN PM PM
i=1 j=1 k6=j (yij ° ȳU )(yik ° ȳU )
ICC =
(N M ° 1)(M ° 1)S 2
P PM
M (SSB) ° Ni=1 j=1 (yij ° ȳU )
2
=
(N M ° 1)(M ° 1)S 2
M (SSB) ° SSTO
=
(M ° 1)SSTO
M (SSTO ° SSW) ° SSTO
=
(M ° 1)SSTO
M SSW
= 1° ,
M ° 1 SSTO
proving the result.
5.23 From (5.8),
M SSW
ICC = 1 ° .
M ° 1 SSTO
Rewriting, we have
1 M °1
MSW = SSTO (1 ° ICC)
N (M ° 1) M
NM ° 1 2
= S (1 ° ICC).
NM
104 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
Using Table 5.1, we know that SSW + SSB = SSTO, so from (5.8),
M SSTO ° (N ° 1)MSB
ICC = 1 °
M °1 SSTO
1 M (N ° 1)MSB
= ° +
M ° 1 (M ° 1)(N M ° 1)S 2
and
NM ° 1 2
MSB = S [1 + (M ° 1)ICC].
M (N ° 1)
St2 = M (MSB);
Thus,
∑ ≥ ∏
1 n ¥ M (MSB) N ≥ m ¥ M2
V (ȳˆunb ) = N 1°
2
+ 1° N (MSW)
(N M )2 N n n M m
≥ n ¥ MSB ≥ m ¥ MSW
= 1° + 1° .
N nM M nm
(b) The first equality follows directly from (5.9). Because SSTO = SSB + SSW,
1
MSB = [SSTO ° N (M ° 1)MSW]
N °1
1
= [(N M ° 1)S 2 ° N (M ° 1)S 2 (1 ° Ra2 )]
N °1
1
= [(N ° 1)S 2 + N (M ° 1)S 2 Ra2 ]
N °1
∑ ∏
2 N (M ° 1)Ra
2
= S +1 .
N °1
105
(c) µ ∂ ∑ ∏ µ ∂
n S 2 N (M ° 1)Ra2 m S2
V (ȳˆ) = 1° +1 + 1° (1 ° Ra2 ).
N nM N °1 M nm
(c) Let Ω
1 if psu i in sample
Zi =
0 otherwise.
XX
[ =
SSW (yij ° ȳi )2
i2S j2Si
∑X N X ∏
[ =E
E[SSW] Zi (yij ° ȳi ) 2
i=1 j2Si
ΩXN ∑X ∏æ
=E Zi E (yij ° ȳi ) | Z 2
i=1 j2Si
ΩXN æ
=E Zi (m ° 1)E[s2i | Z]
i=1
ΩXN æ
=E Zi (m ° 1)Si2
i=1
N
n X 2
= (m ° 1) Si
N
i=1
106 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
Thus,
N
1 X 2
E[msw] = Si = MSW.
N
i=1
i2S
µ ∑X
N ∏∂
= mE E 2
Zi ȳi | Z1 , . . . , Zn ° mnE[ȳˆunb
2
]
i=1
∑X
N ∏
= mE Zi {V (ȳi | Z) + ȳiU
2
} ° mn[V (ȳˆunb ) + ȳU2 ]
i=1
∑X
N Ωµ ∂ æ∏
m Si2
= mE Zi 1° + ȳiU
2
M m
i=1
∑ µ ∂ N µ ∂ ∏
1 n St2 1 X m Si2
°mn 1° + 1° + ȳU
2
M2 N n nN M m
i=1
N ∑µ ∂ ∏ µ ∂
mn X m Si2 m n
= 1° + ȳiU ° 2 1 °
2
St2
N M m M N
i=1
N µ ∂
mX m Si2
° 1° ° mnȳU2
N M m
i=1
µ ∂ N ∑ X N ∏
m(n ° 1) m X Si2 1
= 1° + mn 2 2
ȳiU ° ȳU
N M m N
i=1 i=1
µ ∂
m n
° 1° (MSB)
M N
µ ∂ N N
m(n ° 1) m X Si2 mn X
= 1° + (ȳiU ° ȳU )2
N M m N
i=1 i=1
µ ∂
m n 1
° 1° SSB
M N N °1
µ ∂ N ∑ µ ∂∏
m(n ° 1) m X Si2 mn m n
= 1° + ° 1° SSB
N M m NM M (N ° 1) N
i=1
µ ∂ N
m(n ° 1) m X Si2 (N ° 1)mn ° m(N ° n)
= 1° + SSB
N M m N M (N ° 1)
i=1
µ ∂
m m
= (n ° 1) 1 ° MSW + (n ° 1) MSB.
M M
Thus µ ∂
m m
E[msb] = 1° MSW + MSB.
M M
108 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
Now,
2 3
XX
E4 (yij ° ȳˆ)2 5
i2S j2Si
[ + E[SSB]
= E[SSW] d
N µ ∂
n X 2 m m
= (m ° 1) Si + (n ° 1) 1 ° MSW + (n ° 1) MSB
N M M
i=1
∑ µ ∂∏
m m
= (m ° 1)n + (n ° 1) 1 ° MSW + (n ° 1) MSB
M M
h m i m
= nm ° 1 ° (n ° 1) MSW + (n ° 1) MSB
M M
1 h m i (n ° 1) m
= nm ° 1 ° (n ° 1) SSW + SSB
N (M ° 1) M N °1 M
∑ ∏
nm ° 1 NM ° 1 (n ° 1)m N M ° 1 m(n ° 1)
= SSW + SSB ° SSW
N M ° 1 N (M ° 1) (N ° 1)M nm ° 1 N M (M ° 1)
∑Ω æ
nm ° 1 N °1
= 1+ SSW
NM ° 1 N (M ° 1)
Ω æ ∏
nm(M ° 1) ° (m ° 1)N M ° M + m
+ 1+ SSB
(N ° 1)M (nm ° 1)
m(n ° 1)
° SSW
N M (M ° 1)
nm ° 1 nm ° 1 nm(M ° 1) ° (m ° 1)N M ° M + m
= SSTO + SSB
NM ° 1 NM ° 1 (N ° 1)M (nm ° 1)
∑ ∏
nm ° 1 N ° 1 m(n ° 1)
+ ° SSW
N M ° 1 N (M ° 1) N M (M ° 1)
∑ µ ∂ µ ∂ ∏
nm ° 1 1 1
= SSTO + O SSB + O SSW .
NM ° 1 n n
Consequently,
M (N ° 1) (m ° 1)N M + M ° m
E[Ŝ 2 ] = E[msb] + E[msw]
m(N M ° 1) m(N M ° 1)
M (N ° 1) h m ≥ m¥ i
= MSB + 1 ° MSW
m(N M ° 1) M M
(m ° 1)N M + M ° m
+ MSW
m(N M ° 1)
1
= SSB
NM ° 1
1
+ [(N ° 1)(M ° m) + (m ° 1)N M + M ° m] MSW
NM ° 1
1 N (M ° 1)
= SSB + MSW
NM ° 1 NM ° 1
= S2.
5.27 The cost constraint implies that n = C/(c1 + c2 m); substituting into (5.30),
we have:
(c1 + c2 m)MSB MSB ≥ m ¥ (c1 + c2 m)MSW
g(m) = V (ȳˆunb ) = ° + 1°
CM NM M Cm
dg c2 MSB c1 MSW c2 MSW
= ° ° .
dm CM Cm2 CM
Setting the derivative equal to zero and solving for m, we have
s
c1 M (MSW)
m= .
c2 (MSB ° MSW)
5.28 This exercise does not rely on methods developed in this chapter (other than
for a general knowledge of systematic sampling), but represents the type of problem
a sampling practitioner might encounter. (A good sampling practitioner must be
versatile.)
(a) For all three cases, P (detect the contaminant) = P (distance between container
and nearest grid point is less than R). We can calculate the probability by using
simple geometry and trigonometry.
Case 1: R < D.
111
Since we assume that the waste container is equally likely to be anywhere in the
square relative to the nearest grid point, the probability is the ratio
p
Case 2: D ∑ R ∑ 2D
112 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
R2 − D2
p
Case 3: R > 2D.
The probability of detection is 1.
(b) Even though the square grid is commonly used in practice, we can increase the
probability of detecting a contaminant by staggering the rows.
113
c c
(b) Let Ω
bij ° 1 if i 2 S and j 2 Si
cij =
°1 otherwise.
Then
Mi
N X
X
T̂ ° T = cij yij .
i=1 j=1
114 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
∑ ∏2
P P P
= i2S b
j2Si ij ° M i æA +
2 2 2
/ Mi æA
i2S
∑ ∏
P P P
+ i2S j2Si (bij ° 2bij ) + Mi æ +
2 2
/ Mi æ
i2S
2
Then
@g X
= mi ° L
@∏
i2S
and, for k 2 S,
@g M2
= ° 2k + ∏.
@mk mk
p
Setting the partial derivatives equal to zero, we have that Mk /mk = ∏ is constant
for all k; that is, mk is proportional to Mk .
116 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
Chapter 6
117
118 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
psu √i Cumulative √i
1 0.000110 0.000000 0.000110
2 0.018556 0.000110 0.018666
3 0.062999 0.018666 0.081665
4 0.078216 0.081665 0.159881
5 0.075245 0.159881 0.235126
6 0.073983 0.235126 0.309109
7 0.076580 0.309109 0.385689
8 0.038981 0.385689 0.424670
9 0.040772 0.424670 0.465442
10 0.022876 0.465442 0.488318
11 0.003721 0.488318 0.492039
12 0.024917 0.492039 0.516956
13 0.040654 0.516956 0.557610
14 0.014804 0.557610 0.572414
15 0.005577 0.572414 0.577991
16 0.070784 0.577991 0.648775
17 0.069635 0.648775 0.718410
18 0.034650 0.718410 0.753060
19 0.069492 0.753060 0.822552
20 0.036590 0.822552 0.859142
21 0.033853 0.859142 0.892995
22 0.016959 0.892995 0.909954
23 0.009066 0.909954 0.919020
24 0.021795 0.919020 0.940815
25 0.059185 0.940815 1.000000
(Note: the numbers in the “Cumulative √i ” columns were rounded to fit in the
table.)
Ten random numbers I generated between 0 and 1 were:
{0.46242032, 0.34980142, 0.35083063, 0.55868338, 0.62149246,
0.03779992, 0.88290415, 0.99612658, 0.02660724, 0.26350658}.
Using these ten random numbers would result in psu’s 9, 7, 7, 14, 16, 3, 21, 25, 3,
and 6 being the sample.
(b) Here max{√i } = 0.078216. To use Lahiri’s method, we select two random
numbers for each draw—the first is a random integer between 1 and 25, and the
second is a random uniform between 0 and 0.08 (or any other number larger than
max{√i }). Thus, if our pair of random number is (20, 0.054558), we reject the pair
and try again because 0.054 > √20 = 0.03659. If the next pair is (8, 0.028979), we
include psu 8 in the sample.
6.3 Calculate t̂√S = ti /√i for each sample.
119
1 2 3 10
E[t̂√ ] = (1200) + (600) + (400) + (120)
16 16 16 16
= 300.
1 10
V [t̂√ ] = (810,000) + · · · + (32,400)
16 16
= 84,000.
6.4
Store √i ti t̂√S (t̂√S ° t)2
A 7/16 11 25.14 75546.4
B 3/16 20 106.67 37377.8
C 3/16 24 128.00 29584.0
D 3/16 245 1306.67 1013377.8
As shown in (6.3) for the general case,
7 3 3 3
E[t̂√ ] = (25.14) + (106.67) + (128) + (1306.67)
16 16 16 16
= 300.
Using (6.4),
7 3 3 3
V [t̂√ ] = (75546.4) + (37377.8) + (29584) + (1013377.8)
16 16 16 16
= 235615.2.
This is a poor sampling design. Store A, with the smallest sales, is sampled with
the largest probability, while Store D is sampled with a smaller probability.
The √i used in this exercise produce a higher variance than simple random sampling.
6.5 We use (6.5) to calculate t̂√ for each sample. So for sample (A,A),
µ ∂
1 11 11
t̂√ = 2 = = 176.
2 √A √A
1 100
E[t̂√ ] = (176) + · · · + (392) = 300
256 256
1 100
V [t̂√ ] = (176 ° 300)2 + · · · + (392 ° 300)2 = 7124.
256 256
Of course, an easier solution is to note that (6.5) and (6.6) imply that E[t̂√ ] = t,
and that V [t̂√ ] will be half of the variance found when taking a sample of one psu
in Section 6.2; i.e., V [t̂√ ] = 14248/2 = 7124.
6.6 (a) The following table does the calculations, using (6.4) to find the variance.
ti
name ti √i t̂√ = √i (t̂√ ° t)2 t̂SRS 13 (t̂SRS
1
° t)2
√i
Apache 31621 0.0572 553292.1 20477223 411073 1997590608
Cochise 51126 0.0969 527405.6 194693778 664638 656992453
Coconino 53443 0.0958 558108.6 19071085 694759 1155043188
Gila 28189 0.0423 667034.6 379902891 366457 3256832592
Graham 11430 0.0276 414597.1 684957758 148590 13804863397
Greenlee 3744 0.0070 532113.6 11318255 48672 21084888877
La Paz 15133 0.0162 932417.7 2105687838 196729 10845710928
Mohave 80062 0.1276 627317.4 387423351 1040806 16890146325
Navajo 47413 0.0802 590892.8 27974547 616369 149926608
Pinal 81154 0.1480 548502.8 83232656 1055002 17929037997
Santa Cruz 13036 0.0316 412582.0 805214838 169468 12477690693
Yavapai 81730 0.1379 592659.0 57604012 1062490 18489514797
Yuma 74140 0.1317 562787.3 11723897 963820 11796136677
Sum 572221 1.0000 4789282131 1.30534E+11
V (t̂√ ) =130,534,375,140.
The unequal-probability sample is more e±cient because ti and Mi are highly cor-
related: the correlation is 0.9905. This means that the quantity ti /√i does not vary
much from sample to sample.
6.7 From (6.6),
µ ∂2
1 1 X ti
V̂ (t̂√ ) = ° t̂√ .
nn°1 √i
i2R
1 1 X N2 1 X
V̂ (t̂√ ) = (N ti ° N t̄)2 = (ti ° t̄)2 .
nn°1 n n°1
i2R i2R
6.8 We use (6.13) and (6.14), along with the following calculations from a spread-
sheet.
Academic
Unit Mi √i yij ȳi t̂i t̂i /√i
14 65 0.0805452 3 004 1.75 113.75 1412.25
23 25 0.0309789 2 120 1.25 31.25 1008.75
9 48 0.0594796 0 010 0.25 12.00 201.75
14 65 0.0805452 2 010 0.75 48.75 605.25
16 2 0.0024783 2 0 1.00 2.00 807.00
6 62 0.0768278 0 225 2.25 139.50 1815.75
14 65 0.0805452 1 003 1.00 65.00 807.00
19 62 0.0768278 4 100 1.25 77.50 1008.75
21 61 0.0755886 2 231 2.00 122.00 1614.00
11 41 0.0508055 2 5 12 3 5.50 225.50 4438.50
average 1371.90
std. dev. 1179.47
p
Thus t̂√ = 1371.90 and SE(t̂√ ) = (1/ 10)(1179.47) = 372.98.
Here is SAS code for calculating these estimates. Note that unit 14 appears 3 times;
in SAS, you have to give each of these repetitions a diÆerent unit number. Otherwise,
SAS will just put all of the observations in the same psu for calculations.
data faculty;
input unit $ Mi psi y1 y2 y3 y4;
array yarray{4} y1 y2 y3 y4;
sampwt = (1/( 10*psi))*(Mi/4);
if unit = 16 then sampwt = (1/( 10*psi));
do i = 1 to 4;
y = yarray{i};
122 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
if y ne . then output;
end;
datalines;
/* Note: label the 3 unit 14s with different psu numbers */
14a 65 0.0805452 3 0 0 4
23 25 0.0309789 2 1 2 0
9 48 0.0594796 0 0 1 0
14b 65 0.0805452 2 0 1 0
16 2 0.0024783 2 0 . .
6 62 0.0768278 0 2 2 5
14c 65 0.0805452 1 0 0 3
19 62 0.0768278 4 1 0 0
21 61 0.0755886 2 2 3 1
11 41 0.0508055 2 5 12 3
;
6.12 (a) The correlation between √i and ti (= number of farms in county) is 0.26.
We expect some benefit from pps sampling, but not a great deal—sampling with
probability proportional to population works better for quantities highly correlated
with population, such as number of physicians as in Example 6.5.
(b) As in Example 6.5, we form a new column ti /√i . The mean of this column is
123
t̂√ = 1,896,300
and
3,674,225
SE [t̂√ ] = p = 367,423.
100
A histogram of the ti /√i exhibits strong skewness, however, so a confidence interval
using the normal approximation may not be appropriate.
Here is SAS code for producing these estimates. Note that we do not use the cluster
statement in proc surveymeans since the observations are psu totals.
data statepop;
infile statepop delimiter=’,’ firstobs=2;
input state $ county $ landarea popn physicns farmpop
numfarm farmacre veterans percviet;
psii = popn/255077536;
wt = 1/(100*psii); /* weight = 1/(n \psi_i) */
6.13 (a) Corr (population, number of veterans) = .99. We expect the unequal
probability sampling to be very e±cient here.
124 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
(b)
1 X ti
t̂√ = = 27,914,180
100 √i
10874532
SE [t̂√ ] = p = 1,087,453
100
Note that Allen County, Ohio appears to be an outlier among the ti /√i : it has
population 26,405 and 12,642 veterans.
(c) For each county,
We then form a column with ith entry Vietveti /√i , and find the mean (= 8050477)
and standard deviation (= 3273372) of that column. Then
t̂√ = 8,050,477
SE [t̂√ ] = 327,337
data statepop;
infile statepop delimiter=’,’ firstobs=2;
input state $ county $ landarea popn physicns farmpop
numfarm farmacre veterans percviet;
vietvet = veterans*percviet/100;
psii = popn/255077536;
wt = 1/(100*psii); /* weight = 1/(n \psi_i) */
6.14 (a) We use (6.28) and (6.29) to calculate the variances. We have
Class t̂i V̂ (t̂i )
4 110.00 16.50
10 106.25 185.94
1 154.00 733.33
9 195.75 2854.69
14 200.00 1200.00
From Table 6.7, and calculating
µ ∂
mi s2i
V̂ (t̂i ) = Mi2 1° ,
Mi mi
we have the second term in (6.28) and (6.29) is
X V̂ (t̂i )
= 11355.
ºi
i2S
To calculate the first term of (6.28), note that if we set ºii = ºi , we can write
X t̂2i X X ºik ° ºi ºk t̂i t̂k X X ºik ° ºi ºk t̂i t̂k
(1 ° ºi ) 2 + = .
ºi ºik ºi ºk ºik ºi ºk
i2S i2S k2S i2S k2S
k6=i
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
______________________________________________________________
y 3.450000 0.393436 2.35764732 4.54235268
______________________________________________________________
Statistics
(b) Let W represent the number of pairs of random numbers that must be generated
127
to obtain the first valid psu. Since sampling is done with replacement, and hence
all psu’s are selected independently, we will have that E[X] = nE[W ]. But W has
a geometric distribution with success probability
N
X 1 Mi
p = P (U1 = i, V1 ∑ Mi for some i) = .
N J
i=1
Then
P (W = k) = (1 ° p)k°1 p
and
1 .X N
E[W ] = = NJ Mi .
p
i=1
Hence,
.X
N
E[X] = nN J Mi .
i=1
∑ µ X N X Qi ∂∏ ∑ X N X Qi ∏
1 t̂ij 1 V (t̂ij )
E V | Q1 , . . . , QN =E 2
n √
i=1 j=1 i
n
i=1 j=1
√i2
∑ X N ∏
1 Vi
=E 2 Qi 2
n √i
i=1
N
1 X Vi
= .
n √i
i=1
This equality uses the assumptions that V (t̂ij ) = Vi for any j, and that the estimates
t̂ij are independent. Thus,
N µ ∂2 N
1X ti 1 X Vi
V [t̂√ ] = √i °t + .
n √i n √i
i=1 i=1
129
(c) To show that (6.9) is an unbiased estimator of the variance, note that
∑ X Qi
N X ∏
1 (t̂ij /√i ° t̂√ )2
E[V̂ (t̂√ )] = E
n n°1
i=1 j=1
Ω X Qi
N X ∑µ ∂2 ∏æ
1 1 t̂ij t̂ij
= E ° 2 t̂√ + t̂√
2
n n°1 √i √i
i=1 j=1
∑ X Qi
N X µ ∂2 ∏
1 1 t̂ij t̂2√
= E °
n n ° 1 √i n°1
i=1 j=1
∑XΩQi
N X µ ∂2 Ø ∏æ
1 1 t̂ij ØØ 1
= E E Q , . . . , Q ° [t2 + V (t̂√ )]
n ° 1 √i Ø
1 N
n n°1
i=1 j=1
∑ XN ∏
1 Qi t2i + Vi 1
= E ° [t2 + V (t̂√ )]
n n ° 1 √i2 n°1
i=1
XN 2
1 + Vi ti 1
= ° [t2 + V (t̂√ )]
n°1 √i n°1
i=1
µX N ∂ µX N ∂
1 t2i 1 Vi
= √i 2 ° t + 2
° V (t̂√ )
n°1 √i n°1 √
i=1 i=1 i
∑XN µ ∂2 X N ∏
1 ti Vi 1
= √i °t + ° V (t̂√ )
n°1 √i √i n°1
i=1 i=1
n 1
= V (t̂√ ) ° V (t̂√ ) = V (t̂√ ).
n°1 n°1
6.17 It is su±cient to show that (6.22) and (6.23) are equivalent. When an SRS
n(n ° 1)
of psus is selected, then ºi = n/N and ºik = for all i and k. So, starting
N (N ° 1)
with the SYG form,
µ ∂2
1 X X ºi ºk ° ºik ti tk
°
2 ºik ºi ºk
i2S k2S
k6=i
1 1 º12 ° º12 X X
= (ti ° tk )2
2 º12 º12
i2S k2S
k6=i
1 1 º12 ° º12 X X 2
= (ti + t2k ° 2ti tk )
2 º12 º12
i2S k2S
k6=i
1 º12 ° º12 X 1 º12 ° º12 X X
= (n ° 1)t 2
i ° ti tk .
º12 º12 º12 º12
i2S k2Si2S
k6=i
130 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
But in an SRS,
µ ∂2 n ° n ° 1 µ ∂2 ≥
1 º12 ° º12 N N N ° 1 (n ° 1) = N n¥
(n ° 1) = 1° ,
º12 º12 n n°1 n N
N °1
which proves the result.
6.18 To show the results from stratified sampling, we treat the H strata as the N
psus. Note that since all strata are subsampled, we have ºi = 1 for each stratum.
Thus, (6.25) becomes
XH H
X
t̂HT = t̂i = Ni ȳi .
i=1 i=1
For (3.3), since all strata are sampled, either (6.26) or (6.27) gives
H
X
V (t̂HT ) = V (t̂i ).
i=1
(b) Using (6.21), V (t̂HT ) = 1643. Using (6.46) (we only need the first term),
VW R (t̂√ ) = 1867.
131
N≥ n¥
N
X
= 1°
tix tiy
n N
i=1
µ ∂2 X N ∑ ∏
n (n ° 1) ≥ n ¥2
N X
N
+ ° tix tiy
n N (N ° 1) N
i=1 k=1
k6=i
N N ∑ ∏
N≥ n¥
N
X N X X (n ° 1) n
= 1° tix tiy + ° tix tky
n N n (N ° 1) N
i=1 i=1 k=1
k6=i
N≥ n ¥X
N N N
N XX N ° n
= 1° tix tiy ° tix tky
n N n N (N ° 1)
i=1 i=1 k=1
k6=i
2 3
N≥ n ¥6XN N N
6 1 XX 7
= 1° 4 t ix t iy ° tix tky 7
5
n N N °1
i=1 i=1 k=1
k6=i
"N #
N≥ n¥ X
N N N
1 XX 1 X
= 1° tix tiy ° tix tky + tix tiy
n N N °1 N °1
i=1 i=1 k=1 i=1
" #
N≥ n¥
N N N
N X 1 XX
= 1° tix tiy ° tix tky
n N N °1 N °1
i=1 i=1 k=1
"N #
N2 ≥ n¥ 1 X tx ty
= 1° tix tiy °
n N N °1 N
i=1
132 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
(b) Now,
M
X
tiy = yij
j=1
and
M
X
tiu = xij yij .
j=1
So if psu i is in domain 1 then xij = 1 for every ssu in psu i, tix = M , and tiu = tiy ;
if psu i is in domain 2 then xij = 0 for every ssu in psu i, tix = 0, and tiu = 0. We
133
XN Ω µ ∂æ
ty ° tu tu ty ° tu 2
tiu tiy ° t2iu + tiu tix ° tix tiy ° tix tiu + tix
N ° tx tx N ° tx
i=1
X Ω µ ∂æ
ty ° tu tu ty ° tu 2
= tiu tiy ° tiu +
2
tiu tix ° tix tiy ° tix tiu + t
N ° tx tx N ° tx ix
i2domain 1
X Ω ty ° tu tu
µ
ty ° tu 2
∂æ
+ tiu tiy ° tiu +
2
tiu tix ° tix tiy ° tix tiu + t
N ° tx tx N ° tx ix
i2domain 2
X Ω ty ° tu tu
µ
ty ° tu 2
∂æ
= tiy tiy ° tiy +
2
tiy M ° M tiy ° M tiy + M
N ° tx tx N ° tx
i2domain 1
X Ω ty ° tu tu ty ° tu 2
æ
= tiy M ° M
N ° tx tx N ° tx
i2domain 1
ty ° tu tx tu ty ° tu 2
= tu M ° N M
N ° tx N M tx N ° tx
= 0.
(c) Almost any example will work, as long as some psus have units from both
domains.
6.22 (a) Since E(Zi ) = ºi ,
N
X N
X N X
X M M
X
1 `ik yk
E(t̂y ) = ui E(Zi ) = ui = = yk .
ºi Lk
i=1 i=1 i=1 k=1 k=1
PN
The last equality follows since i=1 `ik = Lk .
The variance given is the variance of the one-stage Horvitz-Thompson estimator.
(b) Note that
N
X M M N
Zi X `ik yk X 1 X Zi
t̂y = = `ik yk .
ºi Lk Lk ºi
i=1 k=1 k=1 i=1
But the
PNsum is over all k from 1 to M , not just the units in S B . We need to show
that i=1 ºi `ik = 0 for k 62 S B .
Zi
N
X Zi
`ik
º
i=1 i
wk§ = N
.
X
`ik
i=1
But a student is in S B if and only if s/hePis linked to one of the sampled units in
S A . In other words, k 2 S B if and only if i2S A `ik > 0. For k 62 S B , we must have
`ik = 0 for each i 2 S A .
134 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
Element k from U B
`ik 1 2 ui
Unit i 1 1 0 4/2 = 2
from U A 2 1 1 4/2 + 6/2 = 5
3 0 1 6/2 = 3
3 21
{1,2} (2 + 5) =
2 2
3 15
{1,3} (2 + 3) =
2 2
3 24
{2,3} (5 + 3) =
2 2
Consequently, µ ∂
1 21 15 24
E[t̂y ] = + + = 10,
3 2 2 2
so it is unbiased. But
"µ ∂2 µ ∂2 µ ∂2 #
1 21 15 24
V [t̂y ] = ° 10 + ° 10 + ° 10
3 2 2 2
1
= [0.25 + 6.25 + 4]
3
= 3.5.
PM
(e) We construct the variable ui = k=1 `ik yk /Lk for each adult in the sample,
where Lk = (number
P of other adults +1). Using weight wi = 40,000/100 = 400, we
calculate t̂u = i2S wi ui = 7200 with
N2 2
V̂ (t̂u ) =
s = 1,900,606.
n u
This gives a 95% CI of [4464.5, 9935.5]. Note that the without-replacement variance
estimator could also be used.
The following SAS code will compute the estimates.
135
data wtshare;
infile wtshare delimiter="," firstobs=2;
input id child preschool numadult;
yoverLk = preschool/(numadult+1);
data sumout;
set sumout;
sampwt = 40000/100;
6.23 (a) We have ºi ’s 0.50 0.25 0.50 0.75, V (t̂HT ) = 180.1147, and V (t̂√ ) =
101.4167.
(b)
N N µ ∂
1 XX ti tk 2
ºi ºk °
2n ºi ºk
i=1 k=1
N N ∑µ ∂ µ ∂∏
1 XX ti t tk t 2
= ºi ºk ° ° °
2n ºi n ºk n
i=1 k=1
N N
" µ ∂ µ ∂ µ ∂µ ∂#
1 XX ti t 2 tk t 2 ti t tk t
= ºi ºk ° + ° °2 ° °
2n ºi n ºk n ºi n ºk n
i=1 k=1
N N µ ∂ "N µ ∂#2
1 XX ti t 2 1 X ti t
= ºi ºk ° ° ºi °
n ºi n n ºi n
i=1 k=1 i=1
N N µ ∂ "N #2
1 XX 2 ti t 2 1 X
= n √i √k ° ° (ti ° √i t)
n n√i n n
i=1 k=1 i=1
N µ ∂2
1X ti
= √i °t
n √i
i=1
= V (t̂√ ).
136 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
(c)
N N µ ∂2
1 XX ti tk
V (t̂√ ) ° V (t̂HT ) = ºi ºk °
2n ºi ºk
i=1 k=1
XN X N µ∂
1 ti tk 2
° (ºi ºk ° ºik ) °
2 ºi ºk
i=1 k=1
1 X X ≥ ºi ºk
N N ¥µt ∂
tk 2
i
= ° ºi ºk + ºik °
2 n ºi ºk
i=1 k=1
N N ∑ µ ∂ ∏µ ∂
1 XX 1 n°1 ti tk 2
> ºi ºk °1 ° ºi ºk °
2 n n ºi ºk
i=1 k=1
= 0.
XN X N µ ∂ "µ ∂2 #
n°1 ti ti tk
= ºik ° ºi ºk °
n ºi ºi ºk
i=1 k6=i
N X
X N µ ∂2 N X
X N N µ ∂2
ti ti tk n°1X ti
= ºik ° ºik ° ºi (n ° ºi )
ºi ºi ºk n ºi
i=1 k6=i i=1 k6=i i=1
N X
X N
n°1
+ ti tk
n
i=1 k6=i
N
X µ ∂2 XN X
N N
X µ ∂2
ti ti tk ti
= (n ° 1)ºi ° ºik ° (n ° 1) ºi
ºi ºi ºk ºi
i=1 i=1 k6=i i=1
N
X N X
X N
n°1 n°1
+ t2i + ti tk
n n
i=1 i=1 k6=i
137
Consequently,
N N N N
n ° 1 XX XX ti tk
ti tk ∏ ºik ,
n ºi ºk
i=1 k=1 i=1 k6=i
or
N X
X N
aik ti tk ∏ 0,
i=1 k=1
6.25 (a)
P {psu’s i and j are in the sample}
= P {psu i drawn first and psu j drawn second}
+P {psu j drawn first and psu i drawn second}
ai √j aj √i
= + N
P
N 1 ° √i P 1 ° √j
ak ak
k=1 k=1
√i (1 ° √i )√j √j (1 ° √j )√i
= +
P
N P
N
ak (1 ° √i )(1 ° ºi ) ak (1 ° √j )(1 ° ºj )
k=1 k=1
µ ∂
√i √j 1 1
= +
P
N 1 ° ºi 1 ° ºj
ak
k=1
P
N P
N
In the third step above, we used the constraint that ºj = n = 2, so √j = 1.
j=1 j=1
Now note that
N
X
PN √k (1 ° √k )
2 k=1 ak =2
1 ° 2√k
k=1
N
X √k (1 ° 2√k + 1)
=
1 ° 2√k
k=1
N
X √k
=1+ .
1 ° ºk
k=1
139
and .
P (psu j on 2nd draw | psu i on 1st draw) = √j (1 ° √i ).
Then
.26667 0.16
P {S = (1, 2)} = = 0.036174,
1.47437 0.8
.19765 0.2
P {S = (2, 1)} = = 0.031918,
1.47437 0.84
and
º12 = P {S = (1, 2)} + P {S = (2, 1)} = 0.068.
Continuing in like manner, we have the following table of ºij .
i\j 1 2 3 4 5
1 — .068 .193 .090 .049
2 .068 — .148 .068 .036
3 .193 .148 — .193 .107
4 .090 .068 .193 — .049
5 .049 .036 .107 .049 —
Sum .400 .320 .640 .400 .240
We use (6.21) to calculate the variance of the Horvitz-Thompson estimator.
≥ ¥2
t
i j ºij ºi ºj ti tj (ºi ºj ° ºij ) ºtii ° ºjj
1 2 0.068 0.40 0.32 20 25 47.39
1 3 0.193 0.40 0.64 20 38 5.54
1 4 0.090 0.40 0.40 20 24 6.96
1 5 0.049 0.40 0.24 20 21 66.73
2 3 0.148 0.32 0.64 25 38 20.13
2 4 0.068 0.32 0.40 25 24 19.68
2 5 0.036 0.32 0.24 25 21 3.56
3 4 0.193 0.64 0.40 38 24 0.02
3 5 0.107 0.64 0.24 38 21 37.16
4 5 0.049 0.40 0.24 24 21 35.88
Sum 1 243.07
Note that for this population, t = 128. To check the results, we see that
i 6= j,
The (l + 1)st SRSWR is chosen to be the sample if each of the previous l SRSWR’s
is rejected because the two psu’s are the same. Now
N
X
P (the two psu’s are the same in an SRSWR) = √k2 ,
k=1
Thus
and
n
X n X
X N
tÆ(k) ti
t̂RHC = = Iki Zi .
xk,Æ(k) xki
k=1 k=1 i=1
Since E[t̂RHC | I11 , . . . , InN ] = t for any random grouping of psu’s, we have that
E[t̂RHC ] = t.
To find the variance, note that
Since E[t̂RHC | I11 , . . . , InN ] = t, however, we know that V [E(t̂RHC | I11 , . . . , InN )] =
0. Conditionally on the grouping, the kth term in t̂RHC estimates the total of group
k using an unequal-probability sample of size one. We can thus use (6.4) within
each group to find the conditional variance, noting that psu’s in diÆerent groups
are selected independently. (We can obtain the same result by using the indicator
143
item Now to find E[V (t̂RHC | I11 , . . . , InN )], we need E[Iki ] and E[Iki Ikj ] for i 6= j.
Let Nk be the number of psu’s in group k. Then
Nk
E[Iki ] = P {psu i in group k} =
N
and, for i 6= j,
Nk Nk ° 1
E[Iki Ikj ] = P {psu’s i and j in group k} = .
N N °1
PN
Thus, letting √i = Mi / j=1 Mj ,
The second factor equals nV (t̂√ ), with V (t̂√ ) given in (6.46), assuming one stage
cluster sampling.
What should N1 , . . . , Nn be in order to minimize V [t̂RHC ]? Note that
n
X n
X
Nk (Nk ° 1) = Nk2 ° N
k=1 k=1
144 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
6.29 (a)
2 3
X X N
X
º̃ik = ºi ºk 41 ° (1 ° ºi )(1 ° ºk )/ cj 5
k6=i k6=i j=1
X ºi (1 ° ºi ) X
= ºi ºk ° PN ºk (1 ° ºk )
k6=i j=1 cj k6=i
"N #
ºi (1 ° ºi ) X
= ºi (n ° ºi ) ° PN ºk (1 ° ºk ) ° ºi (1 ° ºi )
j=1 cj k=1
ºi2 (1 ° ºi )2
= ºi (n ° ºi ) ° ºi (1 ° ºi ) + PN
j=1 cj
ºi2 (1 ° ºi )2
= ºi (n ° 1) + PN .
j=1 cj
n ≥ n¥ ≥ n¥
N
X N
X
cj = 1° =n 1°
N N N
j=1 j=1
and
" #
n2 (1 ° n/N )(1 ° n/N )
º̃ik = 1° ° ¢
N2 n 1° N n
n h ni
= n ° 1 +
N 2∑ N ∏
n n°1 n
= + 2
N N N
∑ ∏
n n°1 N °1 n°1 n
= + 2
N N °1 n°1 N N
∑ ∏
n n°1 N °1 n(N ° 1)
= +
N N °1 N (n ° 1)N 2
∑ ∏
n n°1 N °n
= 1+
N N °1 (n ° 1)N 2
145
N N µ ∂
1 XX ti tk 2
VHaj (t̂HT ) = (ºi ºk ° º̃ik ) °
2 ºi ºk
i=1 k=1
N N µ ∂
1 XX (1 ° ºi )(1 ° ºk ) ti tk 2
= ºi ºk PN °
2 j=1 cj
ºi ºk
i=1 k=1
N X
X N µ 2 ∂
1 ti ti tk
= P ºi ºk (1 ° ºi )(1 ° ºk ) 2 2 ° 2
2 N j=1 cj i=1 k=1
ºi ºi ºk
X N X N µ 2 ∂
1 ti ti tk
= PN ºi ºk (1 ° ºi )(1 ° ºk ) °
j=1 cj i=1 k=1
ºi2 ºi ºk
N
√N !2
X t2i 1 X ti
= ºi (1 ° ºi ) 2 ° PN ºi (1 ° ºi )
º i j=1 cj ºi
i=1 i=1
N
√N !2
X t2i 1 X ti
= ci 2 ° PN ci
º i j=1 c j ºi
i=1 i=1
XN µ 2 ∂2
ti
= ci °A .
i=1
ºi2
6.30
146 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
and
N
X
ºik = (n ° 1)ºi ,
k=1
k6=i
so
N X
X N µ ∂ N
X µ ∂
ti t 2 ti t 2
(ºi ºk ° ºik ) ° = [ºi (n ° ºi ) ° (n ° 1)ºi ] °
ºi n ºi n
i=1 k=1 i=1
k6=i
N
X µ ∂2
ti t
= ºi (1 ° ºi ) ° .
ºi n
i=1
This gives the first two terms in (6.47); the third term is the cross-product term
above.
(b)
For an SRS, ºi = n/N and ºik = [n(n ° 1)]/[N (N ° 1)]. The first term is
N
X µ ∂ N
n N ti t 2 XN S2
° = (ti ° t̄U )2 = N (N ° 1) t .
N n n n n
i=1 i=1
147
(c) Substituting ºi ºk (ci + ck )/2 for ºik , the third term in (6.47) is
N X
X N µ ∂µ ∂
ti t tk t
(ºik ° ºi ºk ) ° °
ºi n ºk n
i=1 k=1
k6=i
N X
X N µ ∂µ ∂
ci + ck ° 2 ti t tk t
= ºi ºk ° °
2 ºi n ºk n
i=1 k=1
k6=i
N X
X N µ ∂µ ∂ X N µ ∂
ci + ck ° 2 ti t tk t ti t 2
= ºi ºk ° ° ° ºi (ci ° 1)
2
°
2 ºi n ºk n ºi n
i=1 k=1 i=1
XN µ ∂
ti t 2
= ºi2 (1 ° ci ) ° .
ºi n
i=1
If ci = n°ºi ,
n°1
then the variance approximation in (6.48) for an SRS is
N
X µ ∂2 N
X µ ∂
ti t N n°1
ºi (1 ° ci ºi ) ° = 1° (ti ° t̄U )2
ºi n n N °1
i=1 i=1
µ ∂
N (N ° 1) n°1
= 1° St2 .
n N °1
148 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
If
n°1
ci = ≥ P ¥
1 ° 2ºi + n1 N º
k=1 k
2
then
N
X µ ∂2 µ ∂
ti t N (N ° 1) n(n ° 1)
ºi (1 ° ci ºi ) ° = 1° St2 .
ºi n n N °n
i=1
@g Mk2 Sk2
=° + n∏√k
@mk m2k √k
N
X
@g
=n √i mi ° C.
@∏
i=1
Setting the partial derivatives equal to zero gives
1 Mk Sk
mk = p
n∏ √k
and
p p XN
n
∏= Mi Si .
C
i=1
Also,
N
X µ ∂
1 Mi M0
P (reach no one on an attempt) = 1° =1° .
N 100 100N
i=1
Then,
Complex Surveys
7.6 Here is SAS code for solving this problem. Note that for the population, we
have ȳU = 17.73,
2 √2000 !2 3 µ
2000
X X ∂
1 4 1 35460 2
S = 38.1451726 =
2 2
yi ° yi 5 = 704958 ° /1999,
1999 2000 2000
i=1 i=1
data integerwt;
infile integer delimiter="," firstobs=2;
input stratum y;
ysq = y*y;
run;
151
152 CHAPTER 7. COMPLEX SURVEYS
data strattot;
input stratum _total_;
datalines;
1 200
2 800
3 400
4 600
;
data pseudopop;
set stratsamp;
retain stratum y;
do i = 1 to SamplingWeight;
output ;
end;
var y ysq;
run;
The estimates from the last two surveymeans statements are the same (not the
standard errors, however).
7.7 Let y = number of species caught.
y fˆ(y) y fˆ(y)
1 .0328 10 .2295
3 .0328 11 .0491
4 .0820 12 .0820
5 .0328 13 .0328
6 .0656 14 .0164
7 .0656 16 .0328
8 .1803 17 .0164
9 .0328 18 .0164
Here is SAS code for constructing this table:
data nybight;
infile nybight delimiter=’,’ firstobs=2;
input year stratum catchnum catchwt numspp depth temp ;
select (stratum);
when (1,2) relwt=1;
when (3,4,5,6) relwt=2;
end;
if year = 1974;
data htpop_epmf;
set htpop_epmf;
epmf = percent/100;
ecdf = cum_pct/100;
7.8 We first construct a new variable, weight, with the following values:
154 CHAPTER 7. COMPLEX SURVEYS
Stratum weight
245 Mi
large
23 mi
66 Mi
sm/me
8 mi
Because there is nonresponse on the variable hrwork, for this exercise we take mi to
be the number of respondents in that cluster. The weights for each teacher sampled
in a school are given in the following table:
dist school popteach mi weight
sm/me 1 2 1 16.50000
sm/me 2 6 4 12.37500
sm/me 3 18 7 21.21429
sm/me 4 12 7 14.14286
sm/me 6 24 11 18.00000
sm/me 7 17 4 35.06250
sm/me 8 19 5 31.35000
sm/me 9 28 21 11.00000
large 11 33 10 35.15217
large 12 16 13 13.11037
large 13 22 3 78.11594
large 15 24 24 10.65217
large 16 27 24 11.98370
large 18 18 2 95.86957
large 19 16 3 56.81159
large 20 12 8 15.97826
large 21 19 5 40.47826
large 22 33 13 27.04013
large 23 31 16 20.63859
large 24 30 9 35.50725
large 25 23 8 30.62500
large 28 53 17 33.20972
large 29 50 8 66.57609
large 30 26 22 12.58893
large 31 25 18 14.79469
large 32 23 16 15.31250
large 33 21 5 44.73913
large 34 33 7 50.21739
large 36 25 4 66.57609
large 38 38 10 40.47826
large 41 30 2 159.78261
The epmf is given below, with y =hrwork.
155
y fˆ(y) y fˆ(y)
20.00 0.0040 34.55 0.0019
26.25 0.0274 34.60 0.0127
26.65 0.0367 35.00 0.1056
27.05 0.0225 35.40 0.0243
27.50 0.0192 35.85 0.0164
27.90 0.0125 36.20 0.0022
28.30 0.0050 36.25 0.0421
29.15 0.0177 36.65 0.0664
30.00 0.0375 37.05 0.0023
30.40 0.0359 37.10 0.0403
30.80 0.0031 37.50 0.1307
31.25 0.0662 37.90 0.0079
32.05 0.0022 37.95 0.0019
32.10 0.0031 38.35 0.0163
32.50 0.0370 38.75 0.0084
32.90 0.0347 39.15 0.0152
33.30 0.0031 40.00 0.0130
33.35 0.0152 40.85 0.0018
33.75 0.0404 41.65 0.0031
34.15 0.0622 52.50 0.0020
7.10
Without weights
156 CHAPTER 7. COMPLEX SURVEYS
With weights
Using the weights makes a huge diÆerence, since the counties with large numbers of
veterans also have small weights.
7.13 The variable agefirst contains information on the age at first arrest. Missing
values are coded as 99; for this exercise, we use the non-missing cases.
Estimated Without With
Quantity Weights Weights
Mean 13.07 13.04
Median 13 13
25th Percentile 12 12
75th Percentile 15 15
Calculating these quantities in SAS is easy: simply include the weight variable in
PROC UNIVARIATE.
The weights change the estimates very little, largely because the survey was designed
to be self-weighting.
7.14
Quantity Variable p̂
Age ∑ 14 age .1233
Violent oÆense crimtype .4433
Both parents livewith .2974
Male sex .9312
Hispanic ethnicity .1888
Single parent livewith .5411
Illegal drugs everdrug .8282
7.15 (a) We use the following SAS code to obtain ȳˆ = 18.03, with 95% CI [17.48,
18.58].
157
data nhanes;
infile nhanes delimiter=’,’ firstobs=2;
input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2
dmdeduc indfminc bmxwt bmxbmi bmxtri
bmxwaist bmxthicr bmxarml;
label age = "Age at Examination (years)"
riagendr = "Gender"
ridreth2 = "Race/Ethnicity"
dmdeduc = "Education Level"
indfminc = "Family income"
bmxwt = "Weight (kg)"
bmxbmi = "Body mass index"
bmxtri = "Triceps skinfold (mm)"
bmxwaist = "Waist circumference (cm)"
bmxthicr = "Thigh circumference (cm)"
bmxarml = "Upper arm length (cm)";
run;
(c) The SAS code in part (a) also gives the following.
158 CHAPTER 7. COMPLEX SURVEYS
data groupage;
set nhanes;
bmigroup = round(bmxbmi,5);
trigroup = round(bmxtri,5);
run;
goptions reset=all;
goptions colors = (black);
axis3 label=(’Body Mass Index, rounded to 5’) order=(10 to 70 by 10);
axis4 label=(angle=90 ’Triceps skinfold, rounded to 5’)
order=(0 to 55 by 10);
goptions reset=all;
goptions colors = (gray);
axis4 label=(angle=90 ’Triceps skinfold’) order = (0 to 55 by 10);
axis3 label=(’Body Mass Index’) order=(10 to 70 by 10);
axis5 order=(0 to 55 by 5) major=none minor=none value=none;
symbol interpol=join width=2 color = black;
data plotsmth;
set bubbleage bmxsmooth; /* concatenates the data sets */
run;
7.17 We define new variables that take on the value 1 if the person has been a
victim of at least one violent crime and 0 otherwise, and another variable for injury.
The SAS code and output follows.
data ncvs;
infile ncvs delimiter = ",";
input age married sex race hispanic hhinc away employ numinc
violent injury medtreat medexp robbery assault
pweight pstrat ppsu;
if violent > 0 then isviol = 1;
else isviol = 0;
if injury > 0 then isinjure = 1;
161
else isinjure = 0;
run;
Data Summary
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
numinc 79360 0.070071 0.002034 0.06605010 0.07409164
isviol 79360 0.013634 0.000665 0.01232006 0.01494718
isinjure 79360 0.003754 0.000316 0.00312960 0.00437754
Std Error
isinjure Variable N Mean of Mean 95% CL for Mean
0 medexp 79093 0 0 0.0000000 0.000000
1 medexp 267 101.6229 33.34777 35.7046182 167.541160
162 CHAPTER 7. COMPLEX SURVEYS
P
7.18 Note that q1 ∑y∑q2 yf (y) is theP
sum of the middle N (1 ° 2Æ) observations
in the population divided by N , and q1 ∑y∑q2 f (y) = F (q2 ) ° F (q1 ) º 1 ° 2Æ.
Consequently,
sum of middle N (1 ° 2Æ) observations in the population
ȳU Æ = .
N (1 ° 2Æ)
To estimate the trimmed mean, substitute fˆ, q̂1 , and q̂2 for f , q1 , and q2 .
7.21 As stated in Section 7.1, the yi ’s are the measurements on observation units.
If unit i is in stratum h, then wi = Nh /nh . To express this formally, let
Ω
1 if unit i is in stratum h
xhi =
0 otherwise.
Then we can write
H
X Nh
wi = xhi
nh
h=1
and X
yi wi
P
y y fˆ(y) = X
i2S
wi
i2S
X H
X
yi (Nh /nh )xhi
i2S h=1
= H
XX
(Nh /nh )xhi
i2S h=1
H
X X
Nh (xhi yi /nh )
h=1 i2S
= H
X X
Nh (xhi /nh )
h=1 i2S
H
X
Nh ȳh
= h=1
H
X
Nh
h=1
H
X Nh
= ȳh .
N
h=1
Thus,
X X y2
y 2 fˆ(y) = i
,
y
n
X i2S
X yi
y fˆ(y) = = ȳ,
y
n
i2S
and ΩX ∑X ∏2 æ
N ˆ ˆ
Ŝ 2 = y f (y) °
2
y f (y)
N °1 y y
ΩX 2 æ
N yi
= ° ȳ 2
N °1 n
i2S
N X (yi ° ȳ)2
=
N °1 n
i2S
N n°1 2
= s .
N °1 n
If n < N , Ŝ 2 is smaller than s2 (although they will be close if n is large).
7.23 We need to show that the inclusion probability is the same for every unit in
S2 . Let Zi = 1 if i 2 S and 0 otherwise, and let Di = 1 if i 2 S2 and 0 otherwise.
We have P (Zi = 1) = ºi and P (Di = 1 | Zi = 1) / 1/ºi .
P (i 2 S2 ) = P (Zi = 1, Di = 1)
= P (Di = 1 | Zi = 1)P (Zi = 1)
1
/ ºi = 1.
ºi
7.24 A rare disease aÆects only a few children in the population. Even if all cases
belong to the same cluster, a disease with estimated incidence of 2.1 per 1,000 is
unlikely to aÆect all children in that cluster.
7.25 (a) Inner-city areas are sampled at twice the rate of non-inner-city areas. Thus
the selection probability for a household not in the inner city is one-half the selection
probability for a household in the inner city. The relative weight for a non-inner-city
household, then, is 2.
(b) Let º represent the probability that a household in the inner city is selected.
Then, for 1-person inner city households,
Nonresponse
8.1 (a) Oversampling the low-income families is a form of substitution. One ad-
vantage of substitution is that the number of low-income families in the sample is
larger. The main drawback, however, is that the low-income families that respond
may diÆer from those that do not respond. For example, mothers who work outside
the home may be less likely to breast feed and less likely to respond to the survey.
(b) The diÆerence between percentage of mothers with one child indicates that the
weighting does not completely adjust for the nonresponse.
(c) Weights were used to try to adjust for nonresponse in this survey. We can never
know whether the adjustment is successful, however, unless we have some data from
the nonrespondents. The response rate for the survey decreased from 54% in 1984
to 46% in 1989. It might have been better for the survey researchers to concentrate
on increasing the response rate and obtaining accurate responses instead of tripling
the sample size.
Because the survey was poststratified using ethnic background, age, and education,
the weighted counts must agree with census figures for those variables. A possible
additional variable to use for poststratification would be number of children.
8.2 (a) The respondents report a total of
X
yi = (66)(32) + (58)(41) + (26)(54) = 5894
165
166 CHAPTER 8. NONRESPONSE
and sµ ∂
150 403.6
SE(ȳ) = 1° = 1.58.
2000 150
Note that this is technically a ratio estimate, since the number of respondents (here,
150) would vary if a diÆerent sample were taken. We are estimating the average
hours of TV watched in the domain of respondents.
(b)
GPA Group Respondents Non respondents Total
3.00–4.00 66 9 75
2.00–2.99 58 14 72
Below 2.00 26 27 53
Total 150 50 200
X (observed ° expected)2
X2 =
expected
cells
[66 ° (.75)(75)]2 [9 ° (.25)(75)]2 [27 ° (.25)(53)]2
= + + ··· +
(.75)(75) (.25)(75) (.25)(53)
= 1.69 + 5.07 + 0.30 + 0.89 + 4.76 + 14.27
= 26.97
Comparing the test statistic to a ¬2 distribution with 2 df, the p-value is 1.4 £ 10°6 .
This is strong evidence against the null hypothesis that the three groups have the
same response rates.
The hypothesis test indicates that the nonresponse is not MCAR, because response
rates appear to be related to GPA. We do not know whether the nonresponse is
MAR, or whether is it nonignorable.
(c)
ni
3 X
X
SSB = (ȳi ° ȳ)2 = 9303.1
i=1 j=1
MSW = s2 = 403.6.
The ANOVA table is as follows:
Source df SS MS F p-value
Between groups 2 9303.1 4651.5 11.5 0.0002
Within groups 147 59323.0 403.6
Total, about mean 149 68626.1
Both the nonresponse rate and the TV viewing appear to be related to GPA, so
it would be a reasonable variable to consider for weighting class adjustment or
poststratification.
(d) The initial weight for each person in the sample is 2000/200=10. After increasing
the weights for the respondents in each class to adjust for the nonrespondents, the
167
8.6 (a) For this exercise, we classify the missing data in the “Other/Unknown”
category. Typically, raking would be used in situations in which the classification
variables were known (and known to be accurate) for all respondents.
168 CHAPTER 8. NONRESPONSE
Response
Population Respondents Rate (%)
Ph.D. 10235 3036 30
Master’s 7071 1640 23
Other/Unknown 1303 325 25
Industry 5397 1809 34
Academia 6327 2221 35
Government 2047 880 43
Other/Unknown 4838 91 19
These response rates are pretty dismal. The nonresponse does not appear to be
MCAR, as it diÆers by degree and by type of employment. I doubt that it is
MAR—I think that more information than is known from this survey would be
needed to predict the nonresponse.
(b) The cell counts from the sample are:
Industry Academia Other
PhD 798 1787 451 3036
non-PhD 1011 434 520 1965
1809 2221 971 5001
The initial sum of weights for each cell are:
Industry Academia Other
PhD 2969.4 6649.5 1678.2 11297.1
non-PhD 3762.0 1614.9 1934.9 7311.9
6731.4 8264.5 3613.1 18609.0
After adjusting for the population row counts (10235 for Ph.D. and 8374 for non-
Ph.D.) the new table is:
Industry Academia Other
PhD 2690.2 6024.4 1520.4 10235
non-PhD 4308.5 1849.5 2216.0 8374
6998.7 7873.9 3736.4 18609
Raking to the population column totals (Industry, 5397; Academia, 6327; Other,
6885) gives:
Industry Academia Other
PhD 2074.6 4840.8 2801.6 9717.0
non-PhD 3322.4 1486.2 4083.4 8892.0
5397.0 6327.0 6885.0 18609.0
As you can see, the previous two tables are still far apart. After iterating, the final
table of the weight sums is:
169
Respondent Nonrespondent
Literature 636 (651.6) 279 (263.4) 915
Classics 451 (450.8) 182 (182.2) 633
Philosophy 481 (468.6) 177 (189.4) 658
History 611 (608.9) 244 (246.1) 855
Linguistics 493 (475.0) 174 (192.0) 667
Political Science 575 (593.2) 258 (239.8) 833
Sociology 588 (586.8) 236 (237.2) 824
3835 1550 5385
The Pearson test statistic is
X (observed ° expected)2
X2 = = 6.8
expected
cells
Comparing the test statistic to a ¬26 distribution, we calculate p-value 0.34. There
is no evidence that the response rates diÆer among strata.
The estimated correlation coe±cient of the response rate and the percent female
members is 0.19. Performing a hypothesis test for association (Pearson correlation,
Spearman correlation, or Kendall’s ø ) gives p-value > .10. There is no evidence that
the response rate is associated with the percentage of members who are female.
8.12 (a) The overall response rate, using the file teachmi.dat, was 310/754=0.41.
(b) As with many nonresponse problems, it’s easy to think of plausible reasons why
the nonresponse bias might go either direction. The teachers who work many hours
may be working so hard they are less likely to return the survey, or they may be
more conscientious and thus more likely to return it.
(c) The means and variances from the file teachnr.dat (ignoring missing values)
are
hrwork size preprmin assist
responses 26 25 26 26
ȳ 36.46 24.92 160.19 152.31
s2 2.61 25.74 3436.96 49314.46
V̂ (ȳ) 0.10 1.03 132.19 1896.71
The corresponding estimates from teachers.dat, the original cluster sample, are:
hrwork size preprmin assist
ȳˆr 33.82 26.93 168.74 52.00
V̂ (ȳˆr ) 0.50 0.57 70.57 228.96
8.14 (a) We are more likely to delete an observation if the value of xi is small.
Since xi and yi are positively correlated, we expect the mean of y to be too big.
(b) The population mean of acres92 is ȳU = 308582.
171
8.17 The argument is similar to the previous exercise. If the classes are su±ciently
large, then E[1/¡˜c ] º 1/¡¯c .
8.19
V (ȳˆwc )
" N N
#
n1 1 X n2 1 X
= V Zi Ri xi yi + Zi Ri (1 ° xi )yi
n n1R n n2R
i=1 i=1
( " N N Ø #)
n1 1 X n2 1 X Ø
= E V Zi Ri xi yi + Zi Ri (1 ° xi )yi ØØZ1 , . . . , ZN
n n1R n n2R
i=1 i=1
( " N N Ø #)
n1 1 X n2 1 X Ø
+V E Zi Ri xi yi + Zi Ri (1 ° xi )yi ØØZ1 , . . . , ZN
n n1R n n2R
i=1 i=1
( " P PN Ø #)
N
n1 i=1 Zi Ri xi yi n2 i=1 Zi Ri (1 ° xi )yi ØØ
= E V PN + PN ØZ1 , . . . , ZN
i=1 Zi Ri (1 ° xi )
n i=1 Zi Ri xi
n
( " P P Ø #)
n1 N i=1 Zi Ri xi yi n2 N i=1 Zi Ri (1 ° xi )yi Ø
Ø
+V E PN + PN ØZ1 , . . . , ZN .
n i=1 Z i Ri xi n i=1 Z i Ri (1 ° x i )
172 CHAPTER 8. NONRESPONSE
We use the ratio approximations from Chapter 4 to find the approximate expected
values and variances.
" P Ø #
n1 N Ø
i=1 Zi Ri xi yi Ø
E PN ØZ1 , . . . , ZN
n i=1 Zi Ri xi
"P √ PN !Ø #
N
i=1 Zi [Ri ° ¡1 ]xi
n1 i=1 Zi Ri xi yi
Ø
= E P 1° PN ØZ1 , . . . , ZN
Ø
n ¡1 N i=1 Zi x i i=1 Zi Ri xi
"N N PN Ø #
1 X X Z i [Ri ° ¡ 1 ]xi Ø
= E Zi Ri xi yi ° Zi Ri xi yi i=1PN ØZ1 , . . . , ZN
n¡1 Z R x Ø
i=1 i=1 i=1 i i i
N N
1X 1 X
º Zi xi yi ° Zi V (Ri )xi yi
n (n¡1 )2
i=1 i=1
XN
1
º Zi xi yi .
n
i=1
Consequently,
( " P PN Ø #)
n1 N Z i Ri xi y i n 2 Zi R i (1 ° xi )y i Ø
ØZ1 , . . . , ZN
V E Pi=1
N
+ Pi=1
N Ø
n Z R
i=1 i i i x n Z
i=1 i iR (1 ° x i )
( N N
)
1X 1X
º V Zi xi yi + Zi (1 ° xi )yi
n n
i=1 i=1
≥ n ¥ Sy2
= 1° ,
N n
the variance that would be obtained if there were no nonresponse. For the other
term,
" P Ø #
n1 N Zi R i x i y i Ø
ØZ1 , . . . , ZN
V Pi=1
N Ø
n i=1 Z i R i x i
" N
√ PN !Ø #
1 X Z [R
i i ° ¡ ]x
1 i Ø
ØZ1 , . . . , ZN
= V Zi Ri xi yi 1 ° i=1 PN Ø
n¡1 Z R x
i=1 i i i
i=1
N
1 X
º Zi V (Ri )xi yi2
(n¡1 )2
i=1
N
¡1 (1 ° ¡1 ) X
º Zi xi yi2 .
(n¡1 )2
i=1
173
Thus,
( " P P Ø #)
n1 N i=1 Zi Ri xi yi n2 N i=1 Zi Ri (1 ° xi )yi Ø
Ø
E V PN + PN ØZ1 , . . . , ZN
n i=1 Z i Ri xi n i=1 Z i Ri (1 ° xi )
( N N
)
¡1 (1 ° ¡1 ) X ¡ 2 (1 ° ¡ 2 ) X
º E Zi xi yi2 + Zi (1 ° xi )yi2
(n¡1 )2 (n¡2 )2
i=1 i=1
N
X N
X
¡1 (1 ° ¡1 ) ¡2 (1 ° ¡2 )
= xi yi2 + (1 ° xi )yi2 .
n¡21 i=1
n¡22 i=1
8.20 (a) Respondents are divided into 5 classes on the basis of the number of nights
the respondent was home during the 4 nights preceding the survey call.
The sampling weight wi for respondent i is then multiplied by 5/(ki + 1). The
respondents with k = 0 were only home on one of the five nights and are assigned to
represent their share of the population plus the share of four persons in the sample
who were called on one of their “unavailable” nights. The respondents most likely
to be home have k = 4; it is presumed that all persons in the sample who were home
every night were reached, so their weights are unchanged.
(b) This method of weighting is based on the premise that the most accessible per-
sons will tend to be overrepresented in the survey data. The method is easy to use,
theoretically appealing, and can be used in conjunction with callbacks. But it still
misses people who were not at home on any of the five nights, or who refused to par-
ticipate in the survey. Since in many surveys done over the telephone, nonresponse
is due in large part to refusals, the HPS method may not be helpful in dealing with
all nonresponse. Values of k may also be in error, because people may err when
recalling how many evenings they were home.
174 CHAPTER 8. NONRESPONSE
Chapter 9
9.1 All of the methods discussed in this chapter would be appropriate. Note that
the replication methods might slightly overestimate the variance because sampling
is done without replacement, but since the sampling fractions are fairly small we
expect the overestimation to be small.
9.2 We calculate ȳ = 8.23333333 and s2 = 15.978, so s2 /30 = 0.5326.
For jackknife replicate j, the jackknife weight is wj(j) = 0 for observation j and
wi(j) = (30/29)wi = (30/29)(100/30) = 3.44828 for i 6= j. Using the jackknife
weights, we find ȳ(1) = 8.2413, . . . , ȳ(30) = 8.20690, so, by (9.8),
30
29 X
V̂JK (ȳ) = [ȳ(j) ° ȳ]2 = 0.5326054.
30
j=1
175
176 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS
≥ µ ∂
n ¥ 0.25862069 30 0.25862069
V̂ [F̂ (µ̂1/2 )] = 1 ° = 1° = 0.006034483.
N n 100 30
This is a small sample, so we use the t29 critical value of 2.045 to calculate
q
2.045 V̂ [F̂ (µ̂1/2 )] = 0.1588596.
The lower confidence bound is F̂ °1 (.5 ° 0.1588596) = F̂ °1 (0.3411404) and the up-
per confidence bound for the median is F̂ °1 (.5 + 0.1588596) = F̂ °1 (0.6588596).
Interpolating, we have that the lower confidence bound is
0.34114 ° 0.23333
5+ (6 ° 5) = 5.6
0.4 ° 0.23333
and the upper confidence bound is
0.6588596 ° 0.6
8+ (9 ° 8) = 8.8.
0.666667 ° 0.6
Thus an approximate 95% CI is [5.6, 8.8].
SAS code below gives approximately the same interval:
data srs30;
input y @@;
wt = 100/30;
datalines;
8 5 2 6 6 3 8 6 10 7 15 9 15 3 5 6
7 10 14 3 4 17 10 6 14 12 7 8 12 9
;
data htpop_epmf;
set htpop_epmf;
epmf = percent/100;
ecdf = cum_pct/100;
run;
data calcvar;
set srs30;
ui = 0;
if y le 7 then ui = 1;
ei = ui - .5;
Quantiles
9.5
(a) (b) (c) (d) (e) (f) (g)
Age Violent Bothpar Male Hispanic Sinpar Drugs
µ̂1 0.12447 0.52179 0.29016 0.90160 0.30106 0.55691 0.90072
µ̂2 0.09528 0.43358 0.31309 0.84929 0.20751 0.52381 0.84265
µ̂3 0.08202 0.36733 0.34417 0.99319 0.17876 0.51068 0.82960
µ̂4 0.21562 0.37370 0.25465 0.96096 0.08532 0.55352 0.80869
µ̂5 0.21660 0.42893 0.30181 0.91314 0.14912 0.54480 0.74491
µ̂6 0.07321 0.48006 0.30514 0.96786 0.15752 0.55350 0.82232
µ̂7 0.02402 0.51201 0.27299 0.96558 0.25170 0.54490 0.84977
9.6 From Exercise 3.4, B̂ = 11.41946, ȳˆr = tx B̂ = 10.3B̂ = 117.6, and SE (ȳˆr ) =
3.98. Using the jackknife, we have B̂(·) = 11.41937, ȳˆr(·) = 117.6, and SE (ȳˆr ) =
p
10.3 0.1836 = 4.41. The jackknife standard error is larger, partly because it does
not include the fpc.
9.7 We use
10
n°1X
V̂JK (ȳˆr ) = (ȳˆr(j) ° ȳˆr )2 .
n
j=1
The ȳˆr(j) ’s for returnf and hadmeas are given in the following table:
For hadmeas,
10
9 X
V̂JK (ȳˆr ) = (ȳˆr(j) ° 0.4402907)2 = 0.00526
10
j=1
179
9.8 We have B̂(·) = .9865651 and V̂JK (B̂) = 3.707 £ 10°5 . With the fpc, the
linearization variance estimate is V̂L (B̂) = 3.071
q £ 10 ; the linearization variance
°5
9.9 The median weekday greens fee for nine holes is µ̂ = 12. For the SRS of size
120,
(.5)(.5)
V [F̂ (µ0.5 )] = = 0.0021.
120
An approximate 95% confidence interval for the median is therefore
p p
[F̂ °1 (.5 ° 1.96 .0021), F̂ °1 (.5 + 1.96 .0021)] = [F̂ °1 (.4105), F̂ °1 (.5895)].
with standard error 1.39. This leads to a 95% CI of [10.1, 15.6] for the median.
9.13 (a) Since h00 (t) = °2t, the remainder term is
Z x Z x µ ∂
00 x2 a2
(x ° t)h (t)dt = °2 (x ° t)dt = °2 x °
2
° ax + = °(x ° a)2 .
a a 2 2
Thus,
(b) The remainder term is likely to be smaller than the other terms because it has
(p̂ ° p)2 in it. This will be small if p̂ is close to p.
180 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS
(c) To find the exact variance, we need to find V (p̂ ° p̂2 ), which involves the fourth
moments. For an SRSWR, X = np̂ ª Bin(n, p), so we can find the moments using
the moment generating function of the Binomial:
So, Ø Ø
Ø Ø
E(X) = 0
MX (t)ØØ = n(pe + q)
t n°1 tØ
pe Ø = np
t=0 t=0
Ø
Ø
E(X ) =
2 00
MX (t)ØØ
t=0
Ø
Ø
= [n(n ° 1)(pe + q) t n°2
(pe ) + n(pe + q)
t 2 t n°1
pe ]ØØ
t
t=0
= n(n ° 1)p2 + np
= n2 p2 + np(1 ° p)
Ø
Ø
E(X ) = MX (t)ØØ
3 000
= np(1 ° 3p + 3np + 2p2 ° 3np2 + n2 p2 )
t=0
V [p̂(1 ° p̂)]
= V (p̂) + V (p̂2 ) ° 2Cov (p̂, p̂2 )
£ §
= E[p̂2 ] ° p2 + E[p̂4 ] ° [E(p̂2 )]2 ° 2E p̂3 + 2pE(p̂2 )
p(1 ° p)
=
n
p
+ 3 (1 ° 7p + 7np + 12p2 ° 18np2 + 6n2 p2 ° 6p3 + 11np3 ° 6n2 p3 + n3 p3 )
n
∑ ∏
p(1 ° p) 2
° p +2
n
∑ ∏
p p(1 ° p)
°2 2 (1 ° 3p + 3np + 2p ° 3np + n p ) + 2p p +
2 2 2 2 2
n n
p(1 ° p)
= (1 ° 4p + 4p2 )
n
1 1
+ 2 (°2p + 14p2 ° 22p3 + 12p4 ) + 3 (p ° 7p2 + 12p3 ° 6p4 )
n n
Note that the first term is (1 ° 2p)2 V (p̂)/n, and the other terms are (constant)/n2
and (constant)/n3 . The remainder terms become small relative to the first term
when n is large. You can see why statisticians use the linearization method so
frequently: even for this simple example, the exact calculations of the variance are
nasty.
181
Note that with an SRS without replacement, the result is much more complicated.
Results from the following paper may be used to find the moments.
Finucan, H. M., Galbraith, R. F., and Stone, M. (1974). Moments Without Tears
in Simple Random Sampling from a Finite Population Biometrika, 61, 151–154.
9.14 (a) Write B1 = h(txy , tx , ty , tx2 , N ), where
a ° bc/e ea ° bc
h(a, b, c, d, e) = 2
= .
d ° b /e ed ° b2
The last equality follows from the normal equations. Then, by linearization,
B̂1 ° B1
@h @h @h
º (t̂xy ° txy ) + (t̂x ° tx ) + (t̂y ° ty )
@a @b @c
@h @h
+ (t̂x2 ° tx2 ) + (N̂ ° N )
@d @e
N £
= t̂xy ° txy ° (B0 ° B1 x̄U )(t̂x ° tx )
N tx2 ° (tx )2
i
°x̄U (t̂y ° ty ) ° B1 (t̂x2 ° tx2 ) + B0 x̄U (N̂ ° N )
" #
N X © ™
= wi xi yi ° (B0 ° B1 x̄U )xi ° x̄U yi ° B1 x2i + B0 x̄U
N tx2 ° (tx )2
i2S
N
° [txy ° tx (B0 ° B1 x̄U ) ° x̄U ty ° B1 tx2 + B0 N x̄U ]
N tx2 ° (tx )2
N X
= wi (yi ° B0 ° B1 xi ) (xi ° x̄U ).
N tx2 ° (tx )2
i2S
@h 1
=
@t1 t3 ° 1
@h t2
= °2
@t2 t3 (t3 ° 1)
µ ∂
@h 1 t2 1 t2
= ° t1 ° +
@t3 (t3 ° 1) 2 t3 t3 ° 1 t23
Then, by linearization,
@h @h @h
Ŝ 2 ° S 2 º (t̂1 ° t1 ) + (t̂2 ° t2 ) + (t̂3 ° t3 )
@t1 @t2 @t3
183
Let
@h 2 @h @h
qi = yi + yi +
@t1 @t2 @t3
µ ∂
1 t 2 1 t2 1 t2
= yi ° 2
2
yi ° t1 ° +
t3 ° 1 t3 (t3 ° 1) (t3 ° 1) 2 t3 t3 ° 1 t23
µ µ ∂ ∂
1 t2 1 t2 t2
= yi2 ° 2 yi ° t1 ° + 2
t3 ° 1 t3 (t3 ° 1) t3 t3
d ° ab/f f d ° ab
h(a, b, c, d, e, f ) = p =p .
(c ° a2 /f )(e ° b2 /f ) (f c ° a2 )(f e ° b2 )
Then, by linearization,
R̂ ° R
@h @h @h
º (t̂x ° tx ) + (t̂y ° ty ) + (t̂ 2 ° tx2 )
@a @b @c x
@h @h @h
+ (t̂xy ° txy ) + (t̂ 2 ° ty2 ) + (N̂ ° N )
@d @e y @f
∑µ ∂ ∏
1 tx RSy ty RSx
= °ty + (t̂x ° tx ) + (°tx + )(t̂y ° ty )
N (N ° 1)Sx Sy Sx Sy
R 1 1
° (t̂x2 ° tx2 ) + (t̂xy ° txy )
2 (N ° 1)Sx 2 (N ° 1)Sx Sy
R 1
° (t̂ 2 ° ty2 )
2 (N ° 1)Sy2 y
∑ µ ∂∏
txy R ty 2 tx2
+ ° + (N̂ ° N )
N (N ° 1)Sx Sy 2 N (N ° 1)Sy2 N (N ° 1)Sx2
∑µ ∂ µ ∂
1 tx RSy ty RSx
= °ty + (t̂x ° tx ) + °tx + (t̂y ° ty )
N (N ° 1)Sx Sy Sx Sy
N RSy N RSx
° (t̂x2 ° tx2 ) + N (t̂xy ° txy ) ° (t̂y2 ° ty2 )
2Sx 2Sy
Ω µ ∂æ ∏
R ty2 Sx tx2 Sy
+ txy ° + (N̂ ° N )
2 Sy Sx
and Ø
@h ØØ tl
Ø =° .
@bl t1 ,...,tL ,N1 ,...,NL Nl
Consequently,
L
X L
X tl
h(t̂1 , . . . , t̂L , N̂1 , . . . , N̂L ) º t + (t̂l ° tl ) ° (N̂l ° Nl )
Nl
l=1 l=1
185
and
∑X
L µ ∂∏
tl
V (t̂post ) º V t̂l ° N̂l .
Nl
l=1
Suppose the random groups are independent. Then ȳ1 , . . . , ȳR are independent and
identically distributed random variables with
E[ȳr ] = 0,
S2
V [ȳr ] = E[ȳr2 ] = = ∑2 (ȳ1 ),
m
E[ȳr4 ] = ∑4 (ȳ1 ).
We have
" R
# R
1 X 1 X £ §
E (ȳr ° ȳ)2 = E ȳr2 ° (ȳ)2
R(R ° 1) R(R ° 1)
r=1 r=1
XR
1
= [V (ȳr ) ° V (ȳ)]
R(R ° 1)
r=1
XR ∑ ∏
1 S2 S2
= °
R(R ° 1) m n
r=1
R ∑
X ∏
1 S2 S2
= R °
R(R ° 1) n n
r=1
S2
= .
n
186 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS
Also,
2( )2 3
XR
E4 (ȳr ° ȳ)2 5
r=1
2( )2 3
XR
= E4 ȳr2 ° Rȳ 2 5
r=1
" R X
R R
#
X X
= E ȳr2 ȳs2 ° 2Rȳ 2
ȳr2 + R ȳ
2 4
Consequently,
h i
E V̂22 (µ̂)
∑µ ∂ µ ∂ ∏
1 2 1 2 3
= 1 ° + 2 R∑4 (ȳ1 ) + 1 ° + 2 R(R ° 1)∑2 (ȳ1 )
2
R2 (R ° 1)2 R R R R
1 1
= ∑4 (ȳ1 ) + 3 (R2 ° 2R + 3)∑22 (ȳ1 )
R3 R (R ° 1)
and
" R
# µ 2 ∂2
1 X 1 R2 ° 2R + 3 2 S
V (ȳr ° ȳ)2 = ∑4 (ȳ1 ) + ∑2 (ȳ1 ) °
R(R ° 1) R 3 R (R ° 1)
3 n
r=1
µ ∂ 2 µ ∂2
1 R2 ° 2R + 3 S 2 S2
= ∑4 (ȳ1 ) + °
R3 R3 (R ° 1) m Rm
µ ∂2 µ 2 ∂2
" # 1 R2 ° 2R + 3 S 2 S
X R ∑ (ȳ
4 1 ) + °
1 R3 R3 (R ° 1) m Rm
CV 2 (ȳr ° ȳ)2 = µ 2 ∂2
R(R ° 1) S
r=1
Rm
∑ ∏
1 ∑4 (ȳ1 )m2 R ° 3
= ° .
R S4 R°1
187
We now need to find ∑4 (ȳ1 ) = E[ȳr4 ] to finish the problem. A complete argument
giving the fourth moment for an SRSWR is given by
Hansen, M. H., Hurwitz, W. N., and Madow, W. G. (1953). Sample Survey Methods
and Theory, Volume 2. New York: Wiley, pp. 99-100.
They note that
2
1 4X 4 X X
ȳr4 = y i + 4 yi
3
y j + 3 yi2 yj2
m4
i2Sr i6=j i6=j
3
X X
+6 yi2 yj yk + yi yj yk yl 5
i6=j6=k i6=j6=k6=l
so that
X N
1 m°1 4
∑4 (ȳ1 ) = E[ȳr4 ] = 3 (yi ° ȳU )4 + 3 S .
m (N ° 1) m3
i=1
This results in
" R
# ∑ ∏
1 X 1 ∑4 (ȳ1 )m2 R ° 3
CV 2
(ȳr ° ȳ)2 = °
R(R ° 1) R S4 R°1
r=1
∑ ∏
1 ∑ m°1 R°3
= +3 ° .
R m m3 R°1
The number of groups, R, has more impact on the CV than the group size m : the
random group estimator of the variance is unstable if R is small.
9.19 First note that
H
X H
X
Nh Nh yh1 + yh2
ȳstr (Ær ) ° ȳstr = yh (Ær ) °
N N 2
h=1 h=1
XH µ ∂ H
X
Nh Ærh + 1 Ærh ° 1 Nh yh1 + yh2
= yh1 ° yh2 °
N 2 2 N 2
h=1 h=1
XH
Nh yh1 ° yh2
= Ærh .
N 2
h=1
188 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS
Then
R
1 X
V̂BRR (ȳstr ) = [ȳstr (Ær ) ° ȳstr ]2
R
r=1
XR ∑X H ∏
1 Nh yh1 ° yh2 2
= Ærh
R N 2
r=1 h=1
R H H
1 X X X Nh yh1 ° yh2 N` y`1 ° y`2
= Ærh Ær`
R N 2 N 2
r=1 h=1 `=1
R H µ ∂
1 X X Nh 2 2 (yh1 ° yh2 )2
= Ærh
R N 4
r=1 h=1
H
X H
X R
1 Nh yh1 ° yh2 N` y`1 ° y`2 X
+ Ærh Ær`
R N 2 N 2
h=1 `=1,`6=h r=1
H µ
X ∂
Nh 2 (yh1 ° yh2 )2
=
N 4
h=1
= V̂str (ȳstr ).
PR
The last step holds because r=1 Ærh Ær` = 0 for ` 6= h.
9.20 As noted in the text,
H µ
X ∂
Nh 2 (yh1 ° yh2 )2
V̂str (ȳstr ) = .
N 4
h=1
Also,
H
X Ærh Nh
µ̂(Ær ) = ȳstr (Ær ) = (yh1 ° yh2 ) + ȳstr
2 N
h=1
so
H
X Nh
µ̂(Ær ) ° µ̂(°Ær ) = Ærh (yh1 ° yh2 )
N
h=1
PR
and, using the property r=1 Ærh Ærk = 0 for k 6= h,
R R H H
1 X 1 XXX Nh Nk
[µ̂(Ær ) ° µ̂(°Ær )]2 = Ærh Ærk (yh1 ° yh2 )(yk1 ° yk2 )
4R 4R N N
r=1 r=1 h=1 k=1
R H µ ∂
1 X X Nh 2
= (yh1 ° yh2 )2
4R N
r=1 h=1
H µ ∂
1 X Nh 2
= (yh1 ° yh2 )2 = V̂str (ȳstr ).
4 N
h=1
189
Similarly,
R
1 X
{[µ̂(Ær ) ° µ̂]2 + [µ̂(°Ær ) ° µ̂]2 }
2R
r=1
R Ω∑ X H ∏2 ∑ XH ∏2 æ
1 X Ærh Nh °Ærh Nh
= (yh1 ° yh2 ) + (yh1 ° yh2 )
2R 2 N 2 N
r=1 h=1 h=1
XR X H µ ∂2
1 2
Ærh Nh
= (yh1 ° yh2 )2
2R 2 N
r=1 h=1
H µ ∂
1 X Nh 2
= (yh1 ° yh2 )2 .
4 N
h=1
and
H X
X H
Nh Nk Ærh Ærk
[t̂(Ær )]2 = (yh1 ° yh2 )(yk1 ° yk2 )
4
h=1 k=1
XH
Nh Ærh
+2t̂ (yh1 ° yh2 ) + t̂2 .
2
h=1
Thus,
H
X
t̂(Ær ) ° t̂(°Ær ) = Nh Ærh (yh1 ° yh2 ),
h=1
H
X
[t̂(Ær )] ° [t̂(°Ær )] = 2t̂
2 2
Nh Ærh (yh1 ° yh2 ),
h=1
and
H
X
µ̂(Ær ) ° µ̂(°Ær ) = (2at̂ + b) Nh Ærh (yh1 ° yh2 ).
h=1
190 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS
PR
Consequently, using the balanced property r=1 Ærh Ærk = 0 for k 6= h, we have
R
1 X
[µ̂(Ær ) ° µ̂(°Ær )]2
4R
r=1
XXX R H H
1
= (2at̂ + b)2 Nh Nk Ærh Ærk (yh1 ° yh2 )(yk1 ° yk2 )
4R
r=1 h=1 k=1
H
X
1
= (2at̂ + b)2 Nh2 (yh1 ° yh2 )2 .
4
h=1
Using linearization,
h(t̂) º h(t) + (2at + b)(t̂ ° t),
so
VL (µ̂) = (2at ° b)2 V (t̂)
and
H
X
21
V̂L (µ̂) = (2at̂ ° b) Nh2 (yh1 ° yh2 )2 ,
4
h=1
PR
which is the same as 1
4R r=1 [µ̂(Ær ) ° µ̂(°Ær )]2 .
9.23 We can write
X
Nl wj xlj yj
L
X j2S
t̂post = g(w, y, x1 , . . . , xL ) = X .
l=1 wj xlj
j2S
Then,
@g(w, y, x1 , . . . , xL )
zi =
@wi
8 9
>
> X >
>
>
> >
>
>
> N x
l li wj lj j >
x y >
XL >
< >
=
Nl xli yi j2S
= X ° 0 12
> >
l=1 >
>
>
wj xlj X >
>
>
>
> j2S @ w x A >
>
>
: j lj >
;
j2S
L
( )
X Nl xli yi Nl xli t̂yl
= °
l=1
N̂l N̂l2
XL µ ∂
Nl t̂yl
= xli yi ° .
l=1
N̂l N̂l
191
Thus,
√ !
X
V̂ (t̂post ) = V̂ wi zi .
i2S
Note that this variance estimator diÆers from the one in Exercise 9.17, although
they are asymptotically equivalent.
9.24 From Chapter 5,
M MSB
V (t̂) º N 2
n
N 2M N M ° 1 2
= S [1 + (M ° 1)ICC]
n M (N ° 1)
NM NM
º p(1 ° p)[1 + (M ° 1)ICC]
n M
Consequently, the relative variance v can be written as Ø0 + Ø1 /t, where Ø0 =
1
° nM [1 + (M ° 1)ICC] + nt1
N [1 + (M ° 1)ICC].
9.25 (a) From (9.2),
"Ω æ2 #
ty 1
V [B̂] º E ° 2 (t̂x ° tx ) + (t̂y ° ty )
tx tx
"Ω æ2 #
t2y 1 1
= E ° (t̂x ° tx ) + (t̂y ° ty )
t2x tx ty
2 ∑ µ ∂∏
ty V (t̂x ) V (t̂y t̂x t̂y
= + 2 ° 2Cov ,
t2x t2x ty tx ty
∑ ∏
t2y V (t̂x ) V (t̂y B ° ¢
= + 2 °2 V t̂x
t2x t2x ty tx ty
∑ ∏
t2y V (t̂y V (t̂x )
= ° 2
t2x t2y tx
10.1 Many data sets used for chi-square tests in introductory statistics books use
dependent data. See Alf and Lohr (2007) for a review of how books ignore clustering
in the data.
10.3 (a) Observed and expected (in parentheses) proportions are given in the fol-
lowing table:
Abuse
No Yes
No .7542 .1017
(.7109) (.1451)
Symptom
Yes .0763 .0678
(.1196) (.0244)
(b) ∑ ∏
(.7542 ° .7109)2 (.0678 ° .0244)2
X 2 = 118 + ··· +
.7109 .0244
= 12.8 ∑ µ ∂ µ ∂∏
.7542 .0678
G2 = 2(118) .7542 ln + · · · + .0678 ln
.7109 .0244
= 10.3.
Both p-values are less than .002.
Because the expected count in the Yes-Yes cell is small, we also perform Fisher’s
exact test, which gives p-value .0016.
10.4 (a) This is a test of independence. A sample of students is taken, and each
student classified based on instructors and grade.
(b) X 2 = 34.8. Comparing this to a ¬23 distribution, we see that the p-value is
193
194CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS
less than 0.0001. A similar conclusion follows from the likelihood ratio test, with
G2 = 34.5.
(c) Students are probably not independent–most likely, a cluster sample of students
was taken, with the Math II classes as the clusters. The p-values in part (b) are
thus lower than they should be.
10.5 The following table gives the value of µ̂ for the 7 random groups:
Random Group µ̂
1 0.0132
2 0.0147
3 0.0252
4 -0.0224
5 0.0073
6 -0.0057
7 0.0135
Average 0.0065
std. dev. 0.0158
p
Using the random group method, the standard error of µ̂ is 0.0158/ 7 = 0:0060,
so the test statistic is
µ̂2
= 0.79.
V (µ̂)
Since our estimate of the variance from the random group method has only 6 df, we
compare the test statistic to an F (1, 6) distribution rather than to a ¬21 distribution,
obtaining a p-value of 0.4.
10.6 (a) The contingency table (for complete data) is as follows:
Break again?
No Yes
Faculty 65 167 232
Classified staÆ 55 459 514
Administrative staÆ 11 75 86
Academic professional 9 58 67
140 759 899
Xp2 = 37.3; comparing to a ¬23 distribution gives Ω-value < .0001. We can use the
¬2 test for homogeneity because we assume product-multinomial sampling. (Class
is the stratification variable.)
(b) Using the weights (with the respondents who answer both questions), we esti-
mate the probabilities as
195
Work
No Yes
No 0.0832 0.0859 0.1691
Breakaga
Yes 0.6496 0.1813 0.8309
0.7328 0.2672 1.0000
To estimate the proportion in the Yes-Yes cell, I used:
sum of weights of persons answering yes to both questions
p̂yy = .
sum of weights of respondents to both questions
Other answers are possible, depending on how you want to treat the nonresponse.
(c) The odds ratio, calculated using the table in part (b), is
0.0832/0.0859
= 0.27065.
0.6496/0.1813
(Or, you could get 1/.27065 = 3.695.)
The estimated proportions ignoring the weights are
Work
No Yes
No 0.0850 0.0671 0.1521
breakaga
Yes 0.6969 0.1510 0.8479
0.7819 0.2181 1.0000
Without weights the odds ratio is
0.0850/0.0671
= 0.27448
0.6969/0.1510
(or, 1/.27448 = 3.643).
Weights appear to make little diÆerence in the value of the odds ratio.
(d) µ̂ = (.0832)(.1813) ° (.6496)(.0859) = °0.04068.
(e) Using linearization, define
qi = p̂22 y11i + p̂11 y12i ° p̂12 y21i ° p̂21 y22i
where yjki is an indicator variable for membership in class (j, k). We then estimate
V (q̄str ) using the usual methods for stratified samples. Using the summary statistics,
≥ ¥ 2
Nh nh sh
Stratum Nh nh q̄h s2h N 1 ° Nh nh
Faculty 1374 228 °.117 0.0792 4.04 £ 10°5
C.S. 1960 514 °.059 0.0111 4.52 £ 10°6
A.S. 252 86 °.061 0.0207 7.42 £ 10°7
A.P. 95 66 °.076 0.0349 1.08 £ 10°7
Total 3681 894 4.58 £ 10°5
196CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS
data nhanes;
infile nhanes delimiter=’,’ firstobs=2;
input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2
dmdeduc indfminc bmxwt bmxbmi bmxtri
bmxwaist bmxthicr bmxarml;
bmiclass = .;
if 0 > bmxbmi and bmxbmi < 25 then bmiclass = 1;
else if bmxbmi >= 25 and bmxbmi < 30 then bmiclass = 2;
else if bmxbmi >= 30 then bmiclass = 3;
if age < 30 then ageclass = 1;
else if age >= 30 then ageclass = 2;
label age = "Age at Examination (years)"
riagendr = "Gender"
ridreth2 = "Race/Ethnicity"
dmdeduc = "Education Level"
indfminc = "Family income"
bmxwt = "Weight (kg)"
bmxbmi = "Body mass index"
bmxtri = "Triceps skinfold (mm)"
bmxwaist = "Waist circumference (cm)"
bmxthicr = "Thigh circumference (cm)"
bmxarml = "Upper arm length (cm)";
run;
weight wtmec2yr;
tables bmiclass*ageclass/chisq deff;
run;
F Value 162.4424
Num DF 2
Den DF 30
Pr > F <.0001
data ncvs;
infile ncvs delimiter = "," firstobs=2;
input age married sex race hispanic hhinc away employ numinc
199
F Value 29.2529
Num DF 1
Den DF 143
Pr > F <.0001
There is strong evidence that males are more likely to be victims of violent crime
200CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS
than females.
10.13 This test statistic does not in general give correct p-values for data from a
complex survey. It ensures that the sum of the “observed” counts is n but does not
adjust for stratification or clustering.
To see this, note that for the data in Example 10.4, the proposed test statistic is
the same as X 2 because all weights are equal. But in that example X 2 /2, not X 2 ,
has a null ¬21 distribution because of the clustering.
10.14 (a) For the Wald test,
and
µ̂ = p̂11 p̂22 ° p̂12 p̂21 .
Then, using Taylor linearization,
µ̂ º µ + p22 (p̂11 ° p11 ) + p11 (p̂22 ° p22 ) ° p12 (p̂21 ° p21 ) ° p21 (p̂12 ° p12 )
and
Then
V̂L (µ̂) = V (q̄ˆ).
and µ ∂
1 1 1 1 1
V̂L (µ̂) = + + +
n p̂11 p̂12 p̂21 p̂22
1 1 1 1
= + + + .
x11 x12 x21 x22
This is the estimated variance given in Section 10.1.1.
10.17 In a multinomial sample, all design eÆects are 1. From (10.9), under H0
r X
X c r
X c
X
E[X 2 ] = (1 ° pij ) ° (1 ° pi+ ) ° (1 ° p+j )
i=1 j=1 i=1 j=1
= rc ° 1 ° (r ° 1) ° (c ° 1)
= (r ° 1)(c ° 1).
Since Y ª N (0, ß), U ª N (0, PT ß°1/2 ßß°1/2 P) = N (0, I), so Wi = Ui2 ª ¬21
and the Wi ’s are independent.
(b) Using a central limit theorem for survey sampling, we know that V(µ̂)°1/2 (µ̂°µ)
has asymptotic N (0, I) distribution under H0 : µ = 0. Using part (a), then,
T T
µ̂ A°1 µ̂ = µ̂ V(µ̂)°1/2 V(µ̂)1/2 A°1 V(µ̂)1/2 V(µ̂)°1/2 µ̂
P
has the same asymptotic distribution as ∏i Wi , where the ∏i ’s are the eigenvalues
of
V(µ̂)1/2 A°1 V(µ̂)1/2 .
(c)
T X
E[µ̂ A°1 µ̂] º ∏i ,
T X
V [µ̂ A°1 µ̂] º 2 ∏i ,
and similarly for the other cells. Then we use equations (5.4)–(5.6) to estimate the
mean and variance for each cell. We have the following frequency data:
Freq. SM SF NM NF M F S N
0 41 45 58 34 30 17 24 28
1 17 20 11 22 24 24 19 19
2 13 6 2 15 17 30 28 24
ȳˆ .3028 .2254 .1056 .3662 .4085 .5915 .5282 .4718
V̂ (ȳˆ) .0022 .0015 .0008 .0022 .0022 .0022 .0026 .0026
Then
X2 17.89
XF2 = = = 16.7
1.07 1.07
with p-value < 0.0001.
10.20
Both statistics are very large. We obtain the Rao-Scott ¬2 statistic is 2721, while
the Wald test statistic is 4838. There is strong evidence that the variables are
associated.
204CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS
Chapter 11
11.3 The average score for the students planning a trip is ȳ1 = 77.158730 and the
average score for the students not planning a trip is ȳ2 = 61.887218. Using SAS
PROC SURVEYREG, we get ȳ1 ° ȳ2 = 15.27 with 95% CI [7.6247634, 22.9182608].
Since 0 is not in the CI, there is evidence that the domain means diÆer.
11.4 (a) From SAS, the fitted regression line for the truncated data set is
data anthrop;
infile anthrop firstobs=2 delimiter=",";
input finger height ;
one = 1;
run;
data anthrop1; /* Keep the lowest 2000 values in the data set */
set anthrop;
if _N_ <= 2000;
run;
goptions reset=all;
goptions colors = (black);
205
206 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
(b) We use exactly the same code as before, except now we sort the data by finger
instead of by height.
data anthrop2;
set anthrop;
if _N_ <= 2000;
run;
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
These values of the slope and intercept are quite close to the values given in Figure
11.4. The standard errors are larger, however, reflecting the lesser number of data
points and the reduced spread of the x’s.
208 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA
= 0.000261
SEL (B̂1 ) = .016.
data nybight;
infile nybight delimiter="," firstobs=2;
input year stratum catchnum catchwt numspp depth temp ;
if stratum = 1 or stratum = 2 then relwt = 1;
else if (stratum ge 3 and stratum le 6) then relwt = 2;
if year = 1974;
run;
Standard
Parameter Estimate Error t Value Pr > |t|
11.7 Using the weights as in Exercise 11.4, the estimated regression coe±cients are
B̂0 = 7.569 and B̂1 = 0.0778. From equation (11.8), V̂L (B̂1 ) = 0.068, (alternatively,
V̂JK (B̂1 ) = 0.070). The slope is not significantly diÆerent from 0. Here is SAS
output:
11.10 (a)
(b)
210 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA
data nhanes;
infile nhanes delimiter=’,’ firstobs=2;
input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2
dmdeduc indfminc bmxwt bmxbmi bmxtri
bmxwaist bmxthicr bmxarml;
if riagendr = 1 then x = 0; /* x=0 is male*/
if riagendr = 2 then x = 1; /* x=1 is female */
if age ge 15 then over15 = 1;
else if age lt 15 then over15 = 0;
else over15=.;
one = 1;
label age = "Age at Examination (years)"
211
agesq = "Age^2"
riagendr = "Gender"
ridreth2 = "Race/Ethnicity"
dmdeduc = "Education Level"
indfminc = "Family income"
bmxwt = "Weight (kg)"
bmxbmi = "Body mass index"
bmxtri = "Triceps skinfold (mm)"
bmxwaist = "Waist circumference (cm)"
bmxthicr = "Thigh circumference (cm)"
bmxarml = "Upper arm length (cm)";
run;
goptions reset=all;
goptions colors = (gray);
axis4 label=(angle=90 ’Triceps Skinfold’) order=(0 to 50 by 10);
axis3 label=(’Body Mass Index’) order = (10 to 70 by 10);
axis5 order=(0 to 50 by 10) major=none minor=none value=none;
symbol interpol=join width=2 color = black;
goptions reset=all;
goptions colors = (gray);
axis3 label=(’Predicted Values’) order = (15 to 30 by 5);
axis4 label=(angle=90 ’Residuals’) order=(-20 to 40 by 10);
axis5 order=(10 to 70 by 10) major=none minor=none value=none;
R2 = 0.38. Note the pattern in the residuals vs. predicted values plot. You may
want to use a model with log transformations instead.
11.15
R2 = 0.57.
11.16
data ncvs;
infile ncvs delimiter = ",";
input age married sex race hispanic hhinc away employ numinc
violent injury medtreat medexp robbery assault
pweight pstrat ppsu;
agesq = age*age;
if violent ge 1 then isviol = 1;
else if violent = 0 then isviol = 0;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
From this model, younger people and males are more likely to have at least one
violent victimization. A quadratic term in age is not significant.
11.17 From (11.4),
√N !√ N !
N
X X X .
xi yi ° xi yi N
i=1 i=1 i=1
B1 = √N !2
N
X X
x2i ° xi /N
i=1 i=1
N1 ȳ1U ° N1 ȳU
=
N1 ° N12 /N
ȳ1U ° (N1 ȳ1U + N2 ȳ2U )/N
=
1 ° N1 /N
= ȳ1U ° ȳ2U .
From (11.5),
ty ° B1 tx
B0 =
N
= ȳU ° B1 x̄U
N1 ȳ1U + N2 ȳ2U N1
= ° (ȳ1U ° ȳ2U )
N N
= ȳ2U .
11.18 (a)
(b) Ø̂0 = °4.096; Ø̂1 = 6.049
(c) We estimate the model-based variance using the regression software: V̂M (Ø̂1 ) =
215
0.541. P
(xi ° x̄)2 (yi ° Ø̂0 ° Ø̂1 xi )2
n
V̂L (Ø̂1 ) = P
(n ° 1)[ (xi ° x̄)2 ]2
= 0.685
V̂L is larger, as we would expect since the plot exhibits unequal variances.
11.19 From (11.10), for straight-line regression,
√ !°1
X X
B̂ = wi xi xTi wi xi yi
i2S i2S
with xi = [1 xi ]T . Here,
2 X X 3
wi wi xi
X 6 i2S 7
wi xi xTi = 6
4
X i2S
X 7
2 5
i2S
wi xi wi xi
i2S i2S
and 2 X 3
wi yi
X 6 i2S 7
wi xi yi = 6
4
X 7,
5
i2S
wi xi y i
i2S
so
1
B̂ = √ !√ ! √ !2
X X X
wi wi x2i ° wi xi
i2S i2S i2S
2 X X 32 X 3
wi x2i ° wi xi wi yi
6 i2S 7 6 i2S 7
6 X i2S
X 76 X 7
4 ° wi xi wi 54 wi xi yi 5
i2S i2S i2S
Thus, √ !√ ! √ !
X X X . X
wi xi yi ° wi xi wi yi wi
i2S i2S i2S i2S
B̂1 = √ !2 √ !
X X . X
wi x2i ° wi xi wi
i2S i2S i2S
and 0 10 1 0 10 1
X X X X
B
@ wi x2i C
A@
B
wi yi C B
A °@ wi xi C
A@
B
wi xi yi C
A
i2S i2S i2S i2S
B̂0 = 0 10 1 0 12
X X X
B
@ wi C
A@
B
wi x2i C B
A °@ wi xi C
A
√ !i2S
°1 "
i2S i2S
#
X X X
= wi wi yi ° B̂1 wi xi .
i2S i2S i2S
216 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA
@ B̂
zi =
@wi
2 0 1°1 3
@ @ X X
= 4 wj xj xTj A 5 wj xj yj
@wi
j2S j2S
0 1°1 2 3
X @ X
+@ wj xj xTj A 4 wj xj yj 5
@wi
j2S j2S
0 1°1 0 1°1
X X X
= °@ wj xj xTj A xi xTi @ wj xj xTj A wj xj yj
j2S j2S j2S
0 1°1
X
+@ wj xj xTj A xi yi
j2S
0 1°1
X ≥ ¥
= @ wj xj xTj A °xi xTi B̂ + xi yi
j2S
0 1°1
X ≥ ¥
= @ wj xj xTj A xi yi ° xTi B̂ .
j2S
with √ !°1
X 1 X 1
B̂ = wi 2 xi xTi wi xi yi .
æi æi2
i2S i2S
217
Thus,
√ !T
X X
t̂yGREG = wi yi + tx ° wi xi B̂
i2S i2S
@ t̂yGREG
zi =
@wi
° ¢T @ B̂
= yi ° xTi B̂ + tx ° t̂x
@wi
0 1°1
° ¢ T X 1 1 ≥ ¥
= yi ° xTi B̂ + tx ° t̂x @ wj 2 xj xTj A xi yi ° xT
i B̂
æj æi2
j2S
Ø̂ = (XT X)°1 XT Y.
Thus,
CovM (Ø̂) = (XT X)°1 XT Cov (Y)[(XT X)°1 XT ]T
= (XT X)°1 XT ßX(XT X)°1 .
218 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA
≥ X X X ¥ hX i2
VM (Ø̂1 ) = x̄2 æi2 ° 2x̄ xi æi2 + x2i æi2 / (xi ° x̄)2
P
(xi ° x̄)2 æi2
= P .
[ (xi ° x̄)2 ]2
To see the relation to Section 11.2.1, let
Qi = Yi ° Ø0 ° Ø1 xi .
Then VM (Qi ) = æi2 ; since observations are independent under the model,
∑X ∏ X
VM (xi ° x̄)Qi = (xi ° x̄)2 æi2 .
i i
and 2 X 3
wi yi
1 66 X 7 = 1 t̂y .
7
XTS WS ß°1
S yS =
i2S
æ2 4 wi xi yi 5 æ 2 t̂xy
i2S
Using (11.20),
∑ ∏°1 ∑ ∏ ∑ ∏
N̂ t̂x t̂y 1 t̂x2 t̂y ° t̂x t̂xy
B̂ = =
t̂x t̂x2 t̂xy N̂ t̂x2 ° (t̂x )2 °t̂x t̂y + N̂ t̂xy
219
and
If y = x,
1
t̂xGREG = t̂x + [0 + (tx ° t̂x )(°t̂2x + N̂ t̂x2 )]
N̂ t̂x2 ° (t̂x )2
= tx .
220 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA
Chapter 12
Two-Phase Sampling
12.1 We use (12.4) to estimate the total, and (12.7) to estimate its variance. We
obtain
(2) 100,000 X nh
t̂str = rh = 48310.
1000 mh
cells
2(2) (2) (2)
From (12.7), we estimate sh = p̂h (1 ° p̂h )/(mh ° 1) and obtain
≥ ¥ XH µ ∂ 2(2)
(2) nh ° 1 mh ° 1 nh sh
V̂ t̂str = N (N ° 1) °
n°1 N °1 n mh
h=1
µ ∂ H
N2 n X nh (2)
(ȳ ° ȳˆstr )2
(2)
+ 1°
n°1 N n h
h=1
1000002
= 100000(99999)0.000912108 + 0.109282588
999
= 10214904.
(2)
Thus SE(t̂str ) = 3196.
Note that we can do a rough check using SAS PROC SURVEYMEANS, which will
capture the variability due to the phase II sample.
data exer1201;
input strat nh mh diabcount ;
nondiab = mh - diabcount ;
datalines;
1 241 96 86
2 113 45 17
3 174 35 29
4 472 47 8
;
221
222 CHAPTER 12. TWO-PHASE SAMPLING
data exer1201;
set exer1201;
do i = 1 to diabcount;
sampwt = nh/mh*(100000/1000);
diab = 1;
output;
end;
do i = 1 to nondiab;
sampwt = nh/mh*(100000/1000);
diab = 0;
output;
end;
This code gives t̂ = 48310 with SE 3059.0785. We can then add the second term in
(12.7) to obtain
≥ ¥ 1000002
(2)
V̂ t̂str = 3059.07852 + 0.109282588 = 10451881.
999
This is a little bit larger than the estimate obtained above because we did not
incorporate fpcs.
12.3 Using the population and sample sizes of N = 2130, n(1) = 201, and n(2) = 12,
(1)
the phase I weight is wi = 2130/201 = 10.597 for every phase I unit. For the units
(2)
in phase II, the phase II weight is wi = 201/12 = 16.75. We have the following
information and summary statistics from shorebirds.dat:
X (1)
t̂(1)
x = wi xi = 44284.93.
i2S (1)
X (1) (2)
x =
t̂(2) wi wi xi = 34790,
i2S (2)
(2)
and t̂y = 43842.5. Using (12.9),
(2)
t̂y 43842.5
yr = t̂x
t̂(2) = 44284.93 = 55808.
(1)
(2)
t̂x 34790
223
We estimate the variance using (12.11): we have s2y = 115.3561, s2e = 7.453911, and
√ ! √ !
(1) 2 (2)
n sy n s2e
V̂ (t̂(2)
yr ) = N 2
1 ° + N 2
1 °
N n(1) n(1) n(2)
µ ∂ µ ∂
201 115.3561 12 7.453911
= (2130)2 1 ° + (2130)2 1 °
2130 201 201 12
= 2358067 + 2649890 = 5007958,
Then, using (12.8) (which we may use since the fpc is negligible),
H
X 2(2) H
nh ° 1 nh s 1 X nh (2)
V̂ (ȳˆstr ) º (ȳh ° ȳˆstr )2
(2) (2)
h
+
n ° 1 n mh n°1 n
h=1 h=1
°05
= 0.00186941 + 9.924 £ 10 + 3.201 £ 10°05
1
+ (0.007192 + 0.003654 + 0.012126)
1558
= 0.002015
so the standard error is 0.045. Note that the second term adds little to the variability
since the phase I sample size is large.
12.5 (a) We use the final weights for the phase 2 sample to calculate the proportions
in the table, and use (12.8) to find the standard error for each (given in parentheses
behind the proportion).
Case?
Proportion (SE) No Yes
Male 0.2265 (0.0397) 0.1496 (0.0312) 0.3761 (0.0444)
Gender
Female 0.2164 (0.0399) 0.4075 (0.0426) 0.6239 (0.0444)
0.4430 (0.0449) 0.5570 (0.0449)
224 CHAPTER 12. TWO-PHASE SAMPLING
(b) We can calculate the Rao-Scott correction for a test statistic based on a sample
of size n = 1558. Then, (10.2) gives
r X
X c
(p̂ij ° p̂i+ p̂+j )2
X =n2
= (1558)0.062035726 = 96.65
p̂i+ p̂+j
i=1 j=1
To find the design eÆect, we divide the estimated variance from part (a) by the
variance that would have been obtained if an SRS of size 1558 had been selected,
namely p̂(1 ° p̂)/1558. We obtain the following table:
Case?
Design eÆect No Yes
Male 13.987 11.929 13.072
Gender
Female 14.661 11.708 13.072
12.715 12.715
Using (10.9),
r X
X c r
X c
X
E[X 2 ] º (1 ° pij )dij ° (1 ° pi+ )dR
i ° (1 ° p+j )dC
j = 13.601.
i=1 j=1 i=1 j=1
data exer1205;
input strat nh mh gender $ case count ;
datalines;
1 1049 60 m 0 16
1 1049 60 m 1 8
1 1049 60 f 0 17
1 1049 60 f 1 19
2 237 48 m 0 9
2 237 48 m 1 8
2 237 48 f 0 5
2 237 48 f 1 26
225
3 272 142 m 0 15
3 272 142 m 1 28
3 272 142 f 0 8
3 272 142 f 1 91
;
data exer1205;
set exer1205;
do i = 1 to count;
sampwt = nh/mh;
output;
end;
This code treats the phase I sample as a population, so it underestimates the variance
slightly. But since in this case n is large, the results are very close. SAS calculates
the Rao-Scott chi-square statistic as 7.01, and p-value as 0.008.
2(2)
12.9 We estimate Wh by nh /n, and estimate Sh2 by sh . Then, using (12.17), we
have
Stratum Ŵh Ŝh2 Ŵh Ŝh2 ∫n
Yes 0.3658 0.1995 0.0730 0.40
No 0.3895 0.1313 0.0511 0.32
Not available 0.2447 0.2437 0.0596 0.44
Total 1.0000 0.1837
We estimate S 2 using
H
X H
X
(n ° 1)Ŝ = 2
(nh ° 1)Ŝh2 + nh (p̂h ° p̂)2 ,
h=1 h=1
V (t̂(2)
y ) = V (t̂y ) + E(V [t̂y | Z]).
(1) (2)
226 CHAPTER 12. TWO-PHASE SAMPLING
n(2) (n(2) ° 1)
P (Di Dj = 1 | Zi Zj = 1) = for j 6= i,
n(1) (n(1) ° 1)
(1) N
wi = ,
n(1)
and
(2) n(1)
wi = Zi .
n(2)
In addition,
n(1)
P (Zi = 1) =
N
and
n(1) (n(1) ° 1)
P (Zi Zj = 1) = .
N (N ° 1)
(2)
Thus, using (12.1) to write t̂y ,
V [t̂(2)
y | Z]
∑XN ∏
N n(1)
= V Zi Di (1) (2) yi | Z
i=1
n n
µ ∂2 ∑ X N X N ∏
N
= E Zi Zk Di Dk yi yk | Z ° [t̂(1)
y ]
2
n(2) i=1 k=1
µ ∂2 XN µ ∂ N N
N 2n
(2) N 2X X n(2) (n(2) ° 1)
= Zi y i (1) + Zi Z j y i yj (1) (n(1) ° 1)
° [t̂(1)
y ]
2
n(2) i=1
n n (2)
i=1 j=1,j6=i
n
227
µ ∂ N
N 2 X n(1) 2 n(2)
E(V [t̂(2)
y | Z]) = y
n(2) N i n(1)
i=1
µ ∂ N N
N 2 X X n(1) (n(1) ° 1) n(2) (n(2) ° 1)
+ (2) yi yj (1) (1)
n N (N ° 1) n (n ° 1)
i=1 j=1,j6=i
°V [t̂(1)
y ] ° (E[t̂y ])
(1) 2
N N N
N X 2 N (n(2) ° 1) X X
= y + (2) yi yj
n(2) i=1 i n (N ° 1) i=1 j=1,j6=i
Thus, µ ∂
n(2) Sy2
V [t̂(2)
y ] =N 2
1° .
N n(2)
Now " #
2(2) Ø 2(1)
s Ø sh
E h ØZ =
mh mh
and
H
X hn Ø i
(2) Ø
(ȳh ° ȳˆstr )2 ØZ
h (2)
E
n
h=1
" #
H
X nh Ø
(2) Ø
(ȳˆstr )2 ØZ
(2)
= E (ȳh )2 °
n
h=1
H
X ∑ 2(1) µ ∂ ∏
nh s mh (1) 2
= 1° h
+ (ȳh )
n mh nh
h=1
XH µ ∂ µ ∂ 2(1) ∑ X H ∏
nh 2 mh sh nh (1) 2
° 1° ° ȳ
n nh mh n h
h=1 h=1
H µ ∂µ ∂ 2(1) " H
#
X nh nh mh sh 1 X 2(1)
= 1° 1° + (n ° 1)sy °
2(1)
(nh ° 1)sh ;
n n n h mh n
h=1 h=1
228 CHAPTER 12. TWO-PHASE SAMPLING
2(1)
(The last equality follows after a lot of algebra.) Since E[sy ] = Sy2 , the unbiased-
ness is shown.
12.12 (a) Equation (A.9) implies these results.
(b) From the solution to Exercise 12.10,
(2)
V (t̂(2)
yr ) = V [t̂y ] + E[V (t̂d | Z)],
(1)
µ ∂
n(1) Sy2
V [t̂y ] = N 1 °
(1) 2
,
N n(1)
and
µ ∂
(2) Sd2
n(2) (1)
E[V (t̂d | Z)] = N 2
1° ° V [t̂d ]
n(2)
N
µ ∂ µ ∂
n(2) Sd2 n(1) Sd2
= N 1°
2
°N 1°2
N n(2) N n(1)
∑ ∏
N ° n(2) N ° n(1)
= N Sd2 °
n(2) n(1)
µ ∂
n(2) Sd2
= N 2 1 ° (1) .
n n(2)
(c) Follows because s2y and s2e estimate Sy2 and Sd2 , respectively.
229
12.14 (a)
(2)
(1) @ t̂yr
zi = (1)
@wi
(2)
t̂y
= xi (2)
t̂x
and
(2)
(2) @ t̂yr
zi =
@wi
" #
(2)
yi xi t̂y
= t̂(1)
x (2)
° (2) (2)
t̂x t̂x t̂x
t̂x h i
(1)
= (2)
yi ° xi t̂(2)
x B̂
(2)
t̂x
Thus,
0 1
X (1) (1)
X (2)
V̂DR (t̂(2)
yr ) = V̂
@ wi zi + wi zi A
i2S (1) i2S (2)
0 1
X X (1)
t̂x h i
(1)
= V̂ @ wi xi B̂ (2) + wi (2) yi ° xi t̂(2)
x B̂
(2) A
E[mh ] = ∫h E[nh ]
∑X
N ∏
= ∫h E Zi xih
i=1
N
X n
= ∫h xih
N
i=1
= n∫h Wh .
Thus,
H
X
E[C] = cn + n ch ∫h Wh .
h=1
E[C]
n= H
.
X
c+ ch ∫h Wh
h=1
Then
2 3
H
X
6c + ch ∫h Wh 7
6 1 7
6 7
V (ȳˆstr ) = S 2 6
(2) h=1
° 7
6 E[C] N7
4 5
H
X
c+ ch ∫h Wh H µ ∂
h=1
X 1
+ Wh Sh2 °1
E[C] ∫h
h=1
231
and
∑ H µ ∂ µ X H ∂∏
@V (ȳˆstr )
(2) X
S2 1 1 Wk Sk2
= ck Wk + ck Wk 2
Wh S h °1 ° 2 c+ ch ∫h Wh .
@∫k E[C] E[C] ∫h ∫k
h=1 h=1
for k = 1, . . . , H. Thus
H
X ∑ XH µ ∂∏ X H µ XH ∂
1 Wk Sk2
0 = ck ∫k Wk S +
2 2
Wh S h °1 ° c+ ch ∫h Wh
∫h ∫k
k=1 h=1 k=1 h=1
XH ∑ XH ∏ XH 2
Wk S k
= ck ∫k Wk S 2 ° Wh Sh2 ° c
∫k
k=1 h=1 k=1
and
H
X µ H
X ∂X
H
Wh S 2 1
h
= 2
S ° 2
Wh S h ck ∫k Wk .
∫h c
h=1 h=1 k=1
and, consequently, v
u cSk2
u
∫k = u .
u H
X
t c (S 2 ° Wh Sh2 )
k
h=1
232 CHAPTER 12. TWO-PHASE SAMPLING
(c) To meet the expected cost constraint with the optimal allocation, set
E[C]
n= H
,
X
c+ ch ∫h§ Wh
h=1
with v
u
§ u cSh2
∫h = u
u H
.
u X
t ch (S 2 ° Wj Sj )
2
j=1
H
X
12.18 Let A = Sy2 ° Wh Sh2 . Then, from, (12.17),
h=1
v s
u
u c(1) Sh2 c(1) Sh2
∫h,opt =u 0 1=
u XH ch A
u
t ch @S 2 ° Wj Sj2 A
j=1
and
(1) C§ C§
nopt = H
= H
r .
X X p c(1)
c(1) + ch Wh ∫h,opt c(1) + Wh Sh ch
A
h=1 h=1
233
Then,
H µ ∂
Sy2 Sy2 1 X 1
Vopt (ȳˆstr ) =
(2)
(1)
° + (1)
W S
h h
2
° 1
nopt N nopt h=1 ∫h,opt
√H r H
!
Sy2 Sy2 1 X p A X
= (1)
° + (1) Wh Sh ch (1)
° Wh Sh2
nopt N nopt h=1 c h=1
√H r !
Sy2 Sy2 1 X p A
= (1)
° + (1) Wh Sh ch (1)
+ A ° Sy2
nopt N nopt h=1 c
√H r !
1 X p A Sy2
= (1)
W h Sh ch + A °
nopt h=1 c(1) N
√ H
r ! √ H r !
1 X p c(1) X p A Sy2
= c +
(1)
Wh Sh ch Wh Sh ch +A °
C§ A c(1) N
h=1 h=1
2 √H !2
1 4p (1) X
H X
p p
= c A Wh Sh ch + Wh Sh ch
C§
h=1 h=1
#
p XH
p Sy2
+Ac(1) + c(1) A Wh Sh ch °
N
h=1
2 v 32
1 4X
H
p p u u H
X S2
= W S c + c (1) tS 2 ° W S 25 ° y .
h h h h h
C§ N
h=1 h=1
12.19 The easiest way to solve this optimization problem is to use Lagrange mul-
tipliers. Using the variance in (12.10), the function we wish to minimize is
µ ∂ µ ∂ h i
1 1 1 1
g(n , n , ∏) =
(1) (2)
° S 2
y + ° S 2
d ° ∏ C ° c(1) (1)
n ° c(2) (2)
n .
n(1) N n(2) n(1)
Setting the partial derivatives with respect to n(1) , n(2) , and ∏ equal to 0, we have
@g Sy2 Sd2
(1)
= ° £ §2 + £ §2 + ∏c = 0,
(1)
@n n (1) n (1)
@g Sd2
= ° £ §2 + ∏c = 0,
(2)
@n(2) n(2)
and h i
@g
= ° C ° c(1) n(1) ° c(2) n(2) = 0.
@∏
Consequently, using the first two equations, we have
h i2 Sy2 ° Sd2
n(1) =
∏c(1)
234 CHAPTER 12. TWO-PHASE SAMPLING
and
h i2 Sd2
n(2) = .
∏c(2)
Taking the ratios gives
√ !2
n(2) c(1) Sd2
= ,
n(1) c(2) (Sy2 ° Sd2 )
which proves the result.
12.20 (a) These results follow directly from the contingency table. For example,
µ ∂
N1 C21 C21 C2+ C22 C2+
p1 = = = 1° = (1 ° S2 )p.
N N C2+ N C2+ N
N2 C22 C22 C2+
p2 = = = S2 p.
N N C2+ N
The other results are shown similarly.
(b) From (12.19),
2 s s P2 32
(2) X2 2
Vopt (p̂str ) Sh c(1) S 2°
y h=1 W S
h h
º4 Wh + 5 .
VSRS (p̂) Sy c(2) Sy2
h=1
N
X N
X N
X
(xi ° x̄U )2 = (xi ° W2 )2 = xi ° N W22 = N W1 W2 ,
i=1 i=1 i=1
and
N
X N
X N
X
(yi ° ȳU )2 = (yi ° p)2 = yi ° N p2 = N p(1 ° p),
i=1 i=1 i=1
Consequently,
C22 ° N pW2 p(S2 ° W2 )
Sy R = p = p .
N W1 W2 W1 W2
235
2
X
Sy2 ° Wh Sh2 = p(1 ° p) ° W1 p1 (1 ° p1 ) ° W2 p2 (1 ° p2 )
h=1
= W1 p21 + W2 p22 ° p2
(1 ° S2 )2 p2 S22 p2
= + ° p2
W1 W2
p2 £ §
= W2 (1 ° S 2 )2 + W1 S22 ° W1 W2
W1 W2
p2 £ 2 §
= W2 ° 2W1 S2 + S22
W1 W2
p2
= [S2 ° W2 ]2
W1 W2
= Sy2 R2 .
Then,
P (Fi = 1, i 2 S)
µ ∂µ ∂µ ∂µ ∂ µ ∂µ ∂
m1 N1 ° m1 m2 N2 ° m2 mH N H ° mH
...
m1 n1 ° m1 m2 n2 ° m2 mH n H ° mH
= µ ∂µ ∂ µ ∂
N1 N2 NH
...
n1 n2 nH
total number of stratified samples containing S
=
number of possible stratified samples
236 CHAPTER 12. TWO-PHASE SAMPLING
Also,
Consequently,
13.1 Students may answer this in several diÆerent ways. The maximum likelihood
estimate is
n1 n2 (500)(300)
N̂ = = = 1250
m 120
with 95% CI (using likelihood ratio method) of [1116, 1422]. A bootstrap CI is
[1103, 1456].
237
238 CHAPTER 13. ESTIMATING POPULATION SIZE
13.4 (a) We treat the radio transmitter bears and feces sample bears as the two
samples to obtain N̂ = 483.8 with 95% CI [413.7, 599.0].
13.5 The model with all two-factor interactions has G2 = 3.3, with 4 df. Comparing
to a ¬24 distribution gives a p-value 0.502. No simpler model appears to fit the data.
Using this model and function captureci in R, we estimate 3645 persons in the
missing cell, with approximate 95% confidence interval [2804, 4725].
239
13.6 (a)N̂ = 336, with 95% CI [288, 408]. Ñ = 333 with 95% CI [273, 428]. The
linearization-based standard errors are
r
n21 n2 (n2 ° m)
SE (N̂ ) = = 37
m3
and s
(n1 + 1)(n2 + 1)(n1 ° m)(n2 ° m)
SE (Ñ ) = = 29.
(m + 1)2 (m + 2)
(b) The following SAS code may be used to obtain estimates for the models.
data hep;
input elist dlist tlist count;
datalines;
0 0 1 63
0 1 0 55
0 1 1 18
1 0 0 69
1 0 1 17
1 1 0 21
1 1 1 28
;
13.7 (a) The assumption of independence of the two sources is probably met, at
least approximately. The registry is from state and local health departments, while
BDMP is from hospital data. Presumably, the health departments do not use hospi-
tal newborn discharge information when compiling their statistics. However, there
might be a problem if congenital rubella syndrome is misclassified in both data sets,
for instance, if both sources tend to miss cases.
We do not know how easily records were matched, but the paper said matching was
not a problem.
The assumption of simple random sampling is probably not met. The BDMP was
from a sample of hospitals, giving a cluster sample of records. In addition, selection
of the hospitals for the BDMP was not random—hospitals were self-selected. It is
241
unclear how much the absence of simple random sampling in this source aÆects the
results.
(b)
Year Ñ
1970 244.33
1971 95
1972 48
1973 79.5
1974 44.5
1975 114
1976 41.67
1977 30.5
1978 62.33
1979 159
1980 31.5
1981 4
1982 35
1983 3
1984 3
1985 1
The sum of these estimates is 996.3333.
(c) Using the aggregated data, the total number of cases of congenital rubella syn-
drome between 1970 and 1985 is estimated to be
5
4
log(N)
3
2
1
0
Year
13.8
Model G2 df p-value
Independence 11.1 3 0.011
1*2 2.8 2 0.250
1*3 10.7 2 0.005
2*3 9.4 2 0.00
The model with interaction between sample 1 and sample 2 appears to fit well.
Using that model, we estimate N̂ = 2378 with approximate 95% confidence interval
[2142, 2664].
13.9
A positive interaction between presence in sample 1 and presence in sample 2 (as
there is) suggests that some fish are “trap-happy”—they are susceptible to repeated
trapping. An interaction between presence in sample 1 and presence in sample 3
might mean that the fin clipping makes it easier or harder to catch the fish with the
net.
13.10 (a) The maximum likelihood estimate is N̂ = 73.1 and Chapman’s estimate
is Ñ = 70.6. A 95% confidence interval for N , using N̂ and the function captureci, is
[55.4, 124.1]. Another approximate 95% confidence interval for N , using Chapman’s
estimate and the bootstrap, is [52.7, 127.8].
243
20 1
13.12 For the data in Example 13.1, p̂ = = and a 95% confidence interval
100 5
for p is r
(0.2)(0.8)
0.2 ± 1.96 = [0.12, 0.28].
100
The confidence limits L(p̂) and U (p̂) satisfy
Thus, a 95% confidence interval for N is [n1 /U (p̂), n1 /L(p̂)]; for these data, the
interval is [718, 1645]. The interval is comparable to those from the inverted chi-
square tests and bootstrap; like them, it is not symmetric.
13.13 Note that
µ∂µ ∂
N ° 1 ° n1
n1
m n2 ° m
L(N ° 1|n1 , n2 ) = µ ∂
N °1
n2
µ ∂µ ∂
n1 N ° n1 N ° n1 ° (n2 ° m)
m n2 ° m N ° n1
= µ ∂
N N ° n2
n2 N
N ° n1 ° n2 + m N
= L(N |n1 , n2 ) .
N ° n1 N ° n2
Thus, if N > n1 and N > n2 ,
N ° n1 ° n2 + m N
L(N ) ∏ L(N ° 1) iÆ ∑1
N ° n1 N ° n2
iÆ mN ∑ n1 n2 .
or
n2 n1
N̂ = .
m
Note that the second derivative is
d2 log L(N ) m (n2 ° m)n1 (2N ° n1 )
= °
dN 2 N 2 N 2 (N ° n1 )2
mN ° 2N n1 n2 + n21 n2
2
= ,
N 2 (N ° n1 )2
n21 ° 2N n1 + N C = 0,
or p
2N ± 4N 2 ° 4N C
n1 = .
2
Since n1 ∑ N , we take p
n1 = N ° N (N ° C)
245
and p
n2 = C ° N + N (N ° C)
Consequently,
V (ȳˆd ) = A2 V (ȳ1 ) + (1 ° A)2 V (ȳ2 )
A2 S12 (1 ° A)2 S22
º + .
n1 p1 n2 p2
N1 p1 N2 p2
(b) With these assumptions, we have A = , 1°A = , n1 p1 = kf2 p1 N1 ,
Np Np
n2 p2 = f2 p2 N2 , and
µ ∂
S12 A2 (1 ° A)2
V (ȳˆd ) º +
f2 kN1 p1 N2 p2
"µ ∂2 µ ∂ #
N1 p1 1 N2 p2 2 1
= S1
2
+
Np kf2 N1 p1 Np f2 N1 p1
µ ∂
S12 N1 p1
= + N2 p2 .
(N p) f2
2 k
The constraint on the sample size is n = n1 + n2 = kf2 N1 + f2 N2 ; solving for f2 ,
we have
n
f2 = .
N1 k + N2
247
248 CHAPTER 14. RARE POPULATIONS AND SMALL AREA ESTIMATION
d d © £ A§ £ § £ A A§
V (t̂y,µ ) = V t̂a + µ2 V t̂A ab + 2µCov t̂a , t̂ab
dµ dµ £ § £ B§ £ B B §™
+(1 ° µ)2 V t̂B
ab + V t̂b + 2(1 ° µ)Cov t̂b , t̂ab
£ § £ A A§ £ B§ £ B B§
= 2µV t̂A
ab + 2Cov t̂a , t̂ab ° 2(1 ° µ)V t̂ab ° 2Cov t̂b , t̂ab
Setting the derivative equal to 0 and solving gives the optimal value of µ.
14.7 (a) We write
Then
E[µ̃d (a) ° µd ] = E[a(vd + ed ) ° vd ] = 0.
(b)
© ™
V [µ̃d (a) ° µd ] = E [a(vd + ed ) ° vd ]2
= (a ° 1)2 æv2 + a2 √d
setting this equal to 0 and solving for a gives a = æv2 /(æv2 + √d ) = Æd . The minimum
variance achieved is
14.8 Here is SAS code for construction the population and samples:
data domainpop;
do strat = 1 to 20;
do psu = 1 to 4;
do j = 1 to 3;
y = strat;
dom = 1;
output;
end; end;
do psu = 5 to 8;
do j = 1 to 3;
y = strat;
dom = 2;
output;
end; end;
end;
data psuid;
do strat = 1 to 20;
do psu = 1 to 8;
250 CHAPTER 14. RARE POPULATIONS AND SMALL AREA ESTIMATION
output;
end; end;
data samp1 ;
merge psusamp1 (in=Insample) domainpop ;
/* When a data set contributes an observation for
the current BY group, the IN= value is 1. */
by strat psu;
if Insample ; /*delete obsns not in sample */
run;
data samp1d1;
set samp1;
if dom = 1;
weight SamplingWeight;
run;
data samp1d2;
set samp1;
if dom = 2;
Survey Quality
15.2 This is a stratified sample, so we use formulas from stratified sampling to find
¡ˆ and V̂ (¡).
ˆ
Nh ˆ Nh ° nh Nh2 s2h
Stratum Nh nh yes ¡ˆh ¡h
N Nh N 2 n h
Thus ¡ˆ = 0.1577 with V̂ (¡)ˆ = 1.05 £ 10°4 . The probability P that a person is
asked the sensitive question is the probability that a red ball is drawn from the box,
30/50. Also,
pI = P (white ball drawn | red ball not drawn) = 4/20.
Thus, using (12.10),
¡ˆ ° (1 ° P )pI 0.1577 ° (1 ° .6)(.2)
p̂S = = = 0.130
P .6
and
1.05 £ 10°4
V̂ (p̂S ) = = 2.91 £ 10°4
(0.6)2
so the standard error is 0.17.
15.3 (a)
P (“1”) = P (“1” | sensitive)ps + P (“1” | not sensitive)(1 ° ps )
= µ1 ps + µ2 (1 ° ps )
253
254 CHAPTER 15. SURVEY QUALITY
A.1
µ ∂µ ∂
5 30
3 2 (10)(435) 4350
P (match exactly 3 numbers) = µ ∂ = =
35 324,632 324,632
5
P (match at least 1 number) = 1 ° P
µ (match
∂µ ∂no numbers)
5 30
0 5
=1° µ ∂
35
5
142,506 182,126
=1° = .
324,632 324,632
A.2
µ ∂µ ∂
3 5
0 4 5
P (no 7s) = µ ∂ =
8 70
4
µ ∂µ ∂
3 5
1 3 30
P (exactly one 7) = µ ∂ =
8 70
4
µ ∂µ ∂
3 5
2 2 30
P (exactly two 7s) = µ ∂ = .
8 70
4
255
256 CHAPTER 15. SURVEY QUALITY
Property 4:
A.5
Cov [x̄, ȳ]
Corr [x̄, ȳ] = p
V [x̄]V [ȳ]
1 n
(1 ° )RSx Sy
=r n N
1 n 2 1 n
[ (1 ° )Sx ][ (1 ° )Sy2 ]
n N n N
= R.