0% found this document useful (0 votes)
10K views258 pages

Sample Solution

sample solution (1)

Uploaded by

Ha rry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10K views258 pages

Sample Solution

sample solution (1)

Uploaded by

Ha rry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 258

Chapter 1

Introduction

1.1 Target population: Unclear, but presumed to be readers of Parade magazine.


Sampling frame: Persons who know about the telephone survey.
Sampling unit = observation unit: One call. (Although it would also be correct to
consider the sampling unit to be a person. The survey is so badly done that it is
di±cult to tell what the units are.)
As noted in Section 1.3, samples that consist only of volunteers are suspect. This is
especially true of surveys in which respondents must pay to participate, as here—
persons willing to pay 75 cents a call are likely to have strong opinions about the
legalization of marijuana, and it is impossible to say whether pro- or anti-legalization
adherents are more likely to call. This survey is utterly worthless for measuring
public opinion because of its call-in format. Other potential biases, such as requiring
a touch-tone telephone, or the sensitive subject matter or the ambiguity of the
wording (what does “as legal as alcoholic beverages” mean?) probably make little
diÆerence because the call-in structure destroys all credibility for the survey by itself.
1.2 Target population: All mutual funds.
Sampling frame: Mutual funds listed in newspaper.
Sampling unit = observation unit: One listing.
As funds are listed alphabetically by company, there is no reason to believe there
will be any selection bias from the sampling frame. There may be undercoverage,
however, if smaller or new funds are not listed in the newspaper.
1.3 Target population: Not specified, but a target population of interest would be
persons who have read the book.
Sampling frame: Persons who visit the website
Sampling unit = observation unit: One review.

1
2 CHAPTER 1. INTRODUCTION

The reviews are contributed by volunteers. They cannot be taken as representative


of readers’ opinions. Indeed, there have been instances where authors of competing
books have written negative reviews of a book, although amazon.com tries to curb
such practices.
1.4 Target population: Persons eligible for jury duty in Maricopa County.
Sampling frame: County residents who are registered voters or licensed drivers over
18.
Sampling unit = observation unit: One resident.
Selection bias occurs largely because of undercoverage and nonresponse. Eligible
jurors may not appear in the sampling frame because they are not registered to vote
and they do not possess an Arizona driver’s license. Addresses on either list may not
be up to date. In addition, jurors fail to appear or are excused; this is nonresponse.
A similar question for class discussion is whether there was selection bias in selecting
which young men in the U.Sẇere to be drafted and sent to Vietnam.
1.5 Target population: All homeless persons in study area.
Sampling frame: Clinics participating in the Health Care for the Homeless project.
Sampling unit: Unclear. Depending on assumptions made about the survey design,
one could say either a clinic or a homeless person is the sampling unit.
Observation unit: Person.
Selection bias may be a serious problem for this survey. Even though the demo-
graphics for HCH patients are claimed to match those of the homeless population
(but do we know they match?) and the clinics are readily accessible, the patients
diÆer in two critical ways from non-patients: (1) they needed medical treatment,
and (2) they went to a clinic to get medical treatment. One does not know the
likely direction of selection bias, but there is no reason to believe that the same
percentages of patients and non-patients are mentally ill.
1.6 Target population: Female readers of Prevention magazine.
Sampling frame: Women who see the survey in a copy of the magazine.
Sampling unit = observation unit: One woman.
This is a mail-in survey of volunteers, and we cannot trust any statistics from it.
1.7 Target population: All cows in region.
Sampling frame: List of all farms in region.
Sampling unit: One farm.
Observation unit: One cow.
There is no reason to anticipate selection bias in this survey. The design is a single-
3

stage cluster sample, discussed in Chapter 5.


1.8 Target population: Licensed boarding homes for the elderly in Washington
state.
Sampling frame: List of 184 licensed homes.
Sampling unit = observation unit: One home.
Nonresponse is the obvious problem here, with only 43 of 184 administrators or food
service managers responding. It may be that the respondents are the larger homes,
or that their menus have better nutrition. The problem with nonresponse, though,
is that we can only conjecture the direction of the nonresponse bias.
1.13 Target population: All attendees of the 2005 JSM.
Sampling population: E-mail addresses provided by the attendees of the 2005 JSM.
Sampling unit: One e-mail address.
It is stated that the small sample of conference registrants was selected randomly.
This is good, since the ASA can control the quality better and follow up on non-
respondents. It also means, since the sample is selected, that persons with strong
opinions cannot flood the survey. But nonresponse is a potential problem—response
is not mandatory and it might be feared that only attendees with strong opinions
or a strong sense of loyalty to the ASA will respond to the survey.
1.14 Target population: All professors of education
Sampling population: List of education professors
Sampling unit: One professor
Information about how the sample was selected was not given in the publication,
but let’s assume it was a random sample. Obviously, nonresponse is a huge problem
with this survey. Of the 5324 professors selected to be in the sample, only 900 were
interviewed. Professors who travel during summer could of course not be contacted;
also, summer is the worst time of year to try to interview professors for a survey.
1.15 Target population: All adults
Sampling population: Friends and relatives of American Cancer Society volunteers
Sampling unit: One person
Here’s what I wrote about the survey elsewhere:
“Although the sample contained Americans of diverse ages and backgrounds, and
the sample may have provided valuable information for exploring factors associated
with development of cancer, its validity for investigating the relationship between
amount of sleep and mortality is questionable. The questions about amount of
sleep and insomnia were not the focus of the original study, and the survey was not
designed to obtain accurate responses to those questions. The design did not allow
4 CHAPTER 1. INTRODUCTION

researchers to assess whether the sample was representative of the target population
of all Americans. Because of the shortcomings in the survey design, it is impossible
to know whether the conclusions in Kripke et al. (2002) about sleep and mortality
are valid or not.” (pp. 97–98)
Lohr, S. (2008). “Coverage and sampling,” chapter 6 of International Handbook of
Survey Methodology, ed. E. deLeeuw, J. Hox, D. Dillman. New York: Erlbaum,
97–112.
1.25 Students will have many diÆerent opinions on this issue. Of historical interest
is this excerpt of a letter written by James Madison to Thomas JeÆerson on February
14, 1790:

A Bill for taking a census has passed the House of Representatives, and is
with the Senate. It contained a schedule for ascertaining the component
classes of the Society, a kind of information extremely requisite to the
Legislator, and much wanted for the science of Political Economy. A
repetition of it every ten years would hereafter aÆord a most curious
and instructive assemblage of facts. It was thrown out by the Senate
as a waste of trouble and supplying materials for idle people to make a
book. Judge by this little experiment of the reception likely to be given
to so great an idea as that explained in your letter of September.
Chapter 2

Simple Probability Samples

98 + 102 + 154 + 133 + 190 + 175


2.1 (a) ȳU = = 142
6
(b) For each plan, we first find the sampling distribution of ȳ.
Plan 1:
Sample number P (S) ȳS
1 1/8 147.33
2 1/8 142.33
3 1/8 140.33
4 1/8 135.33
5 1/8 148.67
6 1/8 143.67
7 1/8 141.67
8 1/8 136.67
1 1 1
(i) E[ȳ] = (147.33) + (142.33) + · · · + (136.67) = 142.
8 8 8
1 1 1
(ii) V [ȳ] = (147.33 ° 142)2 + (142.33 ° 142)2 + · · · + (136.67 ° 142)2 = 18.94.
8 8 8
(iii) Bias [ȳ] = E[ȳ] ° ȳU = 142 ° 142 = 0.
(iv) Since Bias [ȳ] = 0, MSE [ȳ] = V [ȳ] = 18.94
Plan 2:
Sample number P (S) ȳS
1 1/4 135.33
2 1/2 143.67
3 1/4 147.33
1 1 1
(i) E[ȳ] = (135.33) + (143.67) + (147.33) = 142.5.
4 2 4

5
6 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

(ii)
1 1 1
V [ȳ] = (135.33 ° 142.5)2 + (143.67 ° 142.5)2 + (147.33 ° 142.5)2
4 2 4
= 12.84 + 0.68 + 5.84
= 19.36.

(iii) Bias [ȳ] = E[ȳ] ° ȳU = 142.5 ° 142 = 0.5.


(iv) MSE [ȳ] = V [ȳ] + (Bias [ȳ])2 = 19.61.
(c) Clearly, Plan 1 is better. It has smaller variance and is unbiased as well.
1 1 1
2.2 (a) Unit 1 appears in samples 1 and 3, so º1 = P (S1 ) + P (S3 ) = + = .
8 8 4
Similarly,
1 3 5
º2 = + =
4 8 8
1 1 3
º3 = + =
8 4 8
1 3 1 5
º4 = + + =
8 8 8 8
1 1 1
º5 = + =
8 8 4
1 1 3 5
º6 = + + =
8 8 8 8
1 1 3
º7 = + =
4 8 8
1 1 3 1 7
º8 = + + + = .
4 8 8 8 8
P8
Note that i=1 ºi = 4 = n.
(b)
Sample, S P (S) t̂
{1, 3, 5, 6} 1/8 38
{2, 3, 7, 8} 1/4 42
{1, 4, 6, 8} 1/8 40
{2, 4, 6, 8} 3/8 42
{4, 5, 7, 8} 1/8 52

Thus the sampling distribution of t̂ is:


k P (t̂ = k)
38 1/8
40 1/8
42 5/8
52 1/8
7

2.3 No, because thick books have a higher inclusion probability than thin books.
2.4 (a) A total of ( 83 ) = 56 samples are possible, each with probability of selection
56 . The R function samplist below will (ine±ciently!) generate each of the 56
1

samples. To find the sampling distribution of ȳ, I used the commands

samplist <- function(popn,sampsize){


popvals <- 1:length(popn)
temp <- comblist(popvals,sampsize)
matrix(popn[t(temp)],nrow=nrow(temp),byrow=T)
}

comblist <- function(popvals, sampsize)


{
popsize <- length(popvals)
if(sampsize > popsize)
stop("sample size cannot exceed population size")
nvals <- popsize - sampsize + 1
nrows <- prod((popsize - sampsize + 1):popsize)/prod(1:sampsize)
ncols <- sampsize
yy <- matrix(nrow = nrows, ncol = ncols)
if(sampsize == 1) {yy <- popvals}
else {
nvals <- popsize - sampsize + 1
nrows <- prod(nvals:popsize)/prod(1:sampsize)
ncols <- sampsize
yy <- matrix(nrow = nrows, ncol = ncols)
rep1 <- rep(1, nvals)
if(nvals > 1) {
for(i in 2:nvals)
rep1[i] <- (rep1[i - 1] * (sampsize + i - 2))/(i - 1)
}
rep1 <- rev(rep1)
yy[, 1] <- rep(popvals[1:nvals], rep1)
for(i in 1:nvals) {
yy[yy[, 1] == popvals[i], 2:ncols] <- Recall(
popvals[(i + 1):popsize], sampsize - 1)
}
}
yy
}
temp1 <-samplist(c(1,2,4,4,7,7,7,8),3)
temp2 <-apply(temp1, 1, mean)
table(temp 2)
8 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

The following, then, is the sampling distribution of ȳ.


k P (ȳ = k)
2 13 2/56
3 1/56
3 13 4/56
3 23 1/56
4 6/56
4 13 8/56
4 23 2/56
5 6/56
5 13 7/56
5 23 3/56
6 6/56
6 13 6/56
7 1/56
7 13 3/56
Using the sampling distribution,
µ ∂ µ ∂
2 1 3 1
E[ȳ] = 2 + ··· + 7 = 5.
56 3 56 3

The variance of ȳ for an SRS without replacement of size 3 is


µ ∂2 µ ∂2
2 1 3 1
V [ȳ] = 2 ° 5 + ··· + 7 ° 5 = 1.429.
56 3 56 3

Of course, this variance could have been more easily calculated using the formula in
(2.7): µ ∂ µ ∂
n S2 3 6.8571429
V [ȳ] = 1 ° = 1° = 1.429.
N n 8 3

(b) A total of 83 = 512 samples are possible when sampling with replacement.
Fortunately, we need not list all of these to find the sampling distribution of ȳ. Let
Xi be the value of the ith unit drawn. Since sampling is done with replacement,
X1 , X2 , and X3 are independent; Xi (i = 1, 2, 3) has distribution
k P (Xi = k)
1 1/8
2 1/8
4 2/8
7 3/8
8 1/8
Using the independence, then, we have the following probability distribution for
X̄, which serves as the sampling distribution of ȳ.
9

k P (ȳ = k) k P (ȳ = k)
1 1/512 4 23 12/512
1 13 3/512 5 63/512
1 23 3/512 5 13 57/512
2 7/512 5 23 21/512
2 13 12/512 6 57/512
2 23 6/512 6 13 36/512
3 21/512 6 23 6/512
3 13 33/512 7 27/512
3 23 15/512 7 13 27/512
4 47/512 7 23 9/512
4 13 48/512 8 1/512
The with-replacement variance of ȳ is
1 1
Vwr [ȳ] = (1 ° 5)2 + · · · + (8 ° 5)2 = 2.
512 512
Or, using the formula with population variance (see Exercise 2.28),
N
1 X (yi ° ȳU )2 6
Vwr [ȳ] = = = 2.
n N 3
i=1

2.5 (a) The sampling weight is 100/30 = 3.3333.


P
(b) t̂ = i2S wi yi = 823.33.
≥ µ ∂
n ¥ s2y 30 15.9781609
(c) V̂ (t̂) = N 1 °
2
= 100 1 °
2
= 3728.238, so
N n 100 30
p
SE (t̂) = 3728.238 = 61.0593

and a 95% CI for t is

823.33 ± (2.045230)(61.0593) = 823.33 ± 124.8803 = [698.45, 948.21].

The fpc is (1 ° 30/100) = .7, so it reduces the width of the CI.


2.6 (a)
10 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

The data are quite skewed because 28 faculty have no publications.


(b) ȳ = 1.78; s = 2.682;
r
2.682 50
SE [ȳ] = p 1° = 0.367.
50 807

(c) No; a sample of size 50 is probably not large enough for ȳ to be normally
distributed, because of the skewness of the original data.
The sample skewness of the data is (from SAS) 1.593. This can be calculated by
hand, finding
1X
(yi ° ȳ)3 = 28.9247040
n
i2S
so that the skewness is 28.9247040/(2.6823 ) = 1.499314. Note this estimate diÆers
from SAS PROC UNIVARIATE
X since SAS adjusts for df using the formula skewness
n
= (yi ° ȳ) /s . Whichever estimate is used, however, formula
3 3
(n ° 1)(n ° 2)
i2S
(2.23) says we need a minimum of
28 + 25(1.5)2 = 84
observations to use the central limit theorem.
(d) p̂ = 28/50 = 0.56.
s µ ∂
(0.56)(0.44) 50
SE (p̂) = 1° = 0.0687.
49 807

A 95% confidence interval is


0.56 ± 1.96(0.0687) = [0.425, 0.695].
11

2.07 (a) A 95% confidence interval for the proportion of entries from the South is
s ° ¢
1000 1 ° 1000
175 175
175
± 1.96 = [.151, .199].
1000 1000

(b) As 0.309 is not in the confidence interval, there is evidence that the percentages
diÆer.
2.08 Answers will vary.
2.09 If n0 ∑ N , then
r s r
n S n0 S n0
zÆ/2 1 ° p = zÆ/2 1 ° n 0 pn 1+
N n N (1 + ) 0 N
r N
n0 n0 S
= zÆ/2 1 + ° p
N N n0
S
= zÆ/2
zÆ/2 S
e
= e

2.10 Design 3 gives the most precision because its sample size is largest, even
though it is a small fraction of the population. Here are the variances of ȳ for the
three samples:

Sample Number V (ȳ)


1 (1 ° 400/4000)S 2 /400 = 0.00225S 2
2 (1 ° 30/300)S 2 /30 = 0.03S 2
3 (1 ° 3000/300,000,000)S 2 /3000 = 0.00033333S 2

2.11 (a)
12 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

60
40
frequency

20
0

10 12 14 16 18 20

age (months)

The histogram appears skewed with tail on the right. With a mildly skewed distri-
bution, though, a sample of size 240 is large enough that the sample mean should
be normally distributed.
p
(b) ȳ = 12.07917; s2 = 3.705003; SE [ȳ] = s2 /n = 0.12425.
(Since we do not know the population size, we ignore the fpc, at the risk of a
slightly-too-large standard error.)
A 95% confidence interval is

12.08 ± 1.96(0.12425) = [11.84, 12.32].

(1.96)2 (3.705)
(c) n = = 57.
(0.5)2
2.12 (a) Using (2.17) and choosing the maximum possible value of (0.5)2 for S 2 ,

(1.96)2 S 2 (1.96)2 (0.5)2


n0 = = = 96.04.
e2 (0.1)2

Then
n0 96.04
n= = = 82.4.
1 + n0 /N 1 + 96.04/580

(b) Since sampling is with replacement, no fpc is used. An approximate 95% confi-
dence interval for the proportion of children not overdue for vaccination is
s ° ¢
27 27
1 ° 27
± 1.96 120 120
= [0.15, 0.30]
120 120
13

2.13 (a) We have p̂ = .2 and


µ ∂
745 (.2)(.8)
V̂ (p̂) = 1 ° = 0.0001557149,
2700 744

so an approximate 95% CI is
p
0.2 ± 1.96 0.0001557149 = [.176, .224].

(b) The above analysis is valid only if the respondents are a random sample of the
selected sample. If respondents diÆer from the nonrespondents—for example, if the
nonrespondents are more likely to have been bullied—then the entire CI may be
biased.
2.14 Here is SAS output:

The SURVEYMEANS Procedure

Data Summary

Number of Observations 150


Sum of Weights 864

Class Level Information

Class
Variable Levels Values

sex 2 f m

Statistics

Std Error
Variable Level Mean of Mean 95% CL for Mean
__________________________________________________________________
sex f 0.306667 0.034353 0.23878522 0.37454811
m 0.693333 0.034353 0.62545189 0.76121478

Statistics

Variable Level Sum Std Dev 95% CL for Sum


__________________________________________________________________
sex f 264.960000 29.680756 206.310434 323.609566
m 599.040000 29.680756 540.390434 657.689566
__________________________________________________________________
14 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

2.15 (a) ȳ =301,953.7, s2 =118,907,450,529.

s µ ∂
s2 300
CI : 301953.7 ± 1.96 1° , or [264883, 339025]
300 3078

(b) ȳ = 599.06, s2 = 161795.4


CI : [556, 642]
(c) ȳ = 56.593, s2 = 5292.73
CI : [48.8, 64.4]
(d) ȳ = 46.823, s2 = 4398.199
CI : [39.7, 54.0]
2.16 (a) The data appear skewed with tail on right.

(b) ȳ = 5309.8, s2 = 3,274,784, SE [ȳ] = 164.5


Here is SAS code for problems 2.16 and 2.17:

filename golfsrs ’C:\golfsrs.csv’;


options ls=78 nodate nocenter;

data golfsrs;
infile golfsrs delimiter="," dsd firstobs=2;
/* The dsd option allows SAS to read the missing values between
successive delimiters */
sampwt = 14938/120;
15

input RN state $ holes type $ yearblt wkday18 wkday9 wkend18


wkend9 backtee rating par cart18 cart9 caddy $ pro $ ;

/* Make sure the data were read in correctly */


proc print data=golfsrs;
run;

proc univariate data= golfsrs;


var wkday9 backtee;
histogram wkday9 /endpoints = 0 to 110 by 10;
histogram backtee /endpoints = 0 to 8000 by 500;
run;

proc surveymeans data=golfsrs total = 14938;


weight sampwt;
var wkday9 backtee;
run;

2.17 (a) The data appear skewed with tail on left.

(b) ȳ = 5309.8, s2 = 3,274,784, SE [ȳ] = 164.5


2.18 p̂ = 85/120 = 0.708

s µ ∂
85/120 (1 ° 85/120) 120
95%CI: 85/120 ± 1.96 1° = .708 ± .081,
119 14938
or [0.627, 0.790].
16 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

2.19 Assume the maximum value for the variance, with p = 0.5. Then use n0 =
1.962 (0.5)2 /(.04)2 , n = n0 /(1 + n0 /N ).
City n0 n
Buckeye 600.25 535
Gilbert 600.25 595
Gila Bend 600.25 446
Phoenix 600.25 600
Tempe 600.25 598
The finite population correction only makes a diÆerence for Buckeye and Gila Bend.
2.20 Sixty of the 70 samples yield confidence intervals, using this procedure, that
include the true value t = 40. The exact confidence level is 60/70 = 0.857.
2.21 (a) A number of diÆerent arguments can be made that this method results
in a simple random sample. Here is one proof, which assumes that the random
number table indeed consists of independent random numbers. In the context of
the problem, M = 999, N = 742, and n = 30. Of course, many students will give a
more heuristic argument.
Let U1 , U2 , U3 , . . ., be independent random variables, each with a discrete uniform
distribution on {0, 1, 2, . . . , M }. Now define

T1 = min{i : Ui 2 [1, N ]}

and
Tk = min{i > Tk°1 : Ui 2 [1, N ], Ui 2
/ {UT1 , . . . , UTk°1 }}
for k = 2, . . . , n. Then for {x1 , . . . , xn } a set of n distinct elements in {1, . . . , N },

P (S = {x1 , . . . , xn }) = P ({UT1 , . . . , UTn } = {x1 , . . . , xn })

P {UT1 = x1 , . . . , UTn = xn } = E[P {UT1 = x1 , . . . , UTn = xn | T1 , T2 , . . . , Tn }]


µ ∂µ ∂µ ∂ µ ∂
1 1 1 1
= ···
N N °1 N °2 N °n+1
(N ° n)!
= .
N!
Conditional on the stopping times T1 , . . . , Tn , UT1 is discrete uniform on {1, . . . , N };
(UT2 | T1 , . . . , TN , UT1 ) is discrete uniform on {1, . . . , N } ° {UT1 }, and so on. Since
x1 , . . . , xn are arbitrary,

n!(N ° n)! 1
P (S = {x1 , . . . , xn }) = = °N ¢ ,
N! n

so the procedure results in a simple random sample.


(b) This procedure does not result in a simple random sample. Units starting with
5, 6, or 7 are more likely to be in the sample than units starting with 0 or 1. To see
17

this, let’s look at a simpler case: selecting one number between 1 and 74 using this
procedure.
Let U1 , U2 , . . . be independent random variables, each with a discrete uniform dis-
tribution on {0, . . . , 9}. Then the first random number considered in the sequence
is 10U1 + U2 ; if that number is not between 1 and 74, then 10U2 + U3 is considered,
etc. Let
T = min{i : 10Ui + Ui+1 2 [1, 74]}.
Then for x = 10x1 + x2 , x 2 [1, 74],
P (S = {x}) = P (10UT + UT +1 = x)
= P (UT = x1 , UT +1 = x2 ).
For part (a), the stopping times were irrelevant for the distribution of UT1 , . . . , UTn ;
here, though, the stopping time makes a diÆerence. One way to have T = 2 is if
10U1 + U2 = 75. In that case, you have rejected the first number solely because the
second digit is too large, but that second digit becomes the first digit of the random
number selected. To see this formally, note that
P (S = {x}) = P (10U1 + U2 = x or {10U1 + U2 2
/ [1, 74] and 10U2 + U3 = x}
or {10U1 + U2 2
/ [1, 74] and 10U2 + U3 2
/ [1, 74]
and 10U3 + U4 = x} or . . .)
= P (U1 = x1 , U2 = x2 )
X1 µ t°1
\
+ P {Ui > 7 or [Ui = 7 and Ui+1 > 4]}
t=2 i=1

and Ut = x1 and Ut+1 = x2 .

Every term in the series is larger if x1 > 4 than if x1 ∑ 4.


(c) This method almost works, but not quite. For the first draw, the probability
that 131 (or any number in {1, . . . , 149, 170} is selected is 6/1000; the probability
that 154 (or any number in {150, . . . , 169}) is selected is 5/1000.
(d) This clearly does not produce an SRS, because no odd numbers can be included.
(e) If class sizes are unequal, this procedure does not result in an SRS: students in
smaller classes are more likely to be selected for the sample than are students in
larger classes.
Consider the probability that student j in class i is chosen on the first draw.
P {select student j in class i} = P {select class i}P {select student j | class i}
1 1
= .
20 number of students in class i

(f) Let’s look at the probability student j in class i is chosen for first unit in the
sample. Let U1 , U2 , . . . be independent discrete uniform {1, . . . , 20} and let V1 , V2 , . . .
18 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

be independent discrete
P20uniform {1, . . . , 40}. Let Mi denote the number of students
in class i, with K = i=1 Mi . Then, because all random variables are independent,

P (student j in class i selected)


µ[
20 ∂
= P (U1 = i, V2 = j) + P (U2 = i, V2 = j)P {U1 = k, V1 > Mk }
k=1
Ω æY
l µ[
20 ∂
+ · · · + P Ul+1 = i, Vl+1 = j P {Uq = k, Vq > Mk }
q=1 k=1
+···
1 ∑ l µ[
20 ∂∏
1 1 X Y
= P {Uq = k, Vq > Mk }
20 40
l=0 q=1 k=1
1 ∑X
X 20 ∏l
1 1 40 ° Mk
=
800 20 40
l=0 k=1
1 ∑ ∏
1 X K l
= 1°
800 800
l=0
1 1 1
= = .
800 1 ° (1 ° K/800) K

Thus, before duplicates are eliminated, a student has probability 1/K of being
selected on any given draw. The argument in part (a) may then be used to show
that when duplicates are discarded, the resulting sample is an SRS.
2.22 (a) From (2.13),
p r
V (ȳ) n S
CV(ȳ) = = 1° p .
E(ȳ) N nȳU

Substituting p̂ for ȳ, and N


N °1 p(1 ° p) for S 2 , we have
s s
≥ n ¥ N p(1 ° p) N °n1°p
CV(p̂) == 1° = .
N (N ° 1)np2 N ° 1 np
p
The CV for a sample of size 1 is (1 ° p)/p. The sample size in (2.26) will be
2 CV2 /r 2 .
zÆ/2

(b) I used Excel to calculate these values.


p 0.001 0.005 0.01 0.05 0.1 0.3 0.5
Fixed 4.3 21.2 42.3 202.8 384.2 896.4 1067.1
Relative 4264176 849420 422576 81100 38416 9959.7 4268.4

p 0.7 0.9 0.95 0.99 0.995 0.999


Fixed 896.4 384.2 202.8 42.3 21.2 4.3
Relative 1829.3 474.3 224.7 43.1 21.4 4.3
19

2.23
µ ∂µ ∂
3059 19
300 0
P (no missing data) = µ ∂
3078
300
(2778)(2777) . . . (2760)
=
(3078)(3077) . . . (3060)
= 0.1416421.

2.24 ≥ n ¥ S2
g(n) = L(n) + C(n) = k 1 ° + c0 + c1 n.
N n
dg kS 2
= ° 2 + c1
dn n
Setting the derivative equal to 0 and solving for n gives
s
kS 2
n= .
c1

The sample size, in the decision theoretic approach, should be larger if the cost of a
bad estimate, k, or the variance, S 2 , is larger; the sample size is smaller if the cost
of sampling is larger.
2.25 (a) Skewed, with tail on right.
(b) ȳ = 20.15, s2 = 321.357, SE [ȳ] = 1.63
2.26 In a systematic sample, the population is partitioned into k clusters, each of
size n. One of these clusters is selected with probability 1/k, so ºi = 1/k for each i.
But many of the samples that could be selected in an SRS cannot be selected in a
systematic sample. For example,

P (Z1 = 1, . . . , Zn = 1) = 0 :

since every kth unit is selected, the sample cannot consist of the first n units in the
population.
2.27 (a)
µ ∂µ ∂
99,999,999 1
999 1
P (you are in sample) = µ ∂
100,000,000
1000
99,999,999! 1000! 99,999,000!
=
999! 99,999,000! 100,000,000!
1000 1
= = .
100,000,000 100,000
20 CHAPTER 2. SIMPLE PROBABILITY SAMPLES

(b)
µ ∂2000
1
P (you are not in any of the 2000 samples) = 1° = 0.9802
100,000

(c) P (you are not in any of x samples) = (1 ° 1/100,000)x . Solving for x in


(1 ° 1/100,000)x = 0.5 gives x log(.99999) = log(0.5), or x = 69314.4. Almost
70,000 samples need to be taken! This problem provides an answer to the common
question, “Why haven’t I been sampled in a poll?”
2.28 (a) We can think of drawing a simple random sample with replacement as
performing an experiment n independent times; on each trial, outcome i (for i 2
{1, . . . , N }) occurs with probability pi = 1/N . This describes a multinomial exper-
iment.
We may then use properties of the multinomial distribution to answer parts (b) and
(c):
n
E[Qi ] = npi = ,
N µ ∂
n 1
V [Qi ] = npi (1 ° pi ) = 1° ,
N N
and
n 1
Cov [Qi , Qj ] = °npi pj = ° for i 6= j.
NN
(b)
∑X
N ∏ N
N NX n
E[t̂] = E Qi yi = yi = t.
n n N
i=1 i=1

(c)
µ ∂2 "N #
N X
V [t̂] = V Qi yi
n
i=1
µ ∂2 X
N XN
N
= yi yj Cov [Qij Qj ]
n
i=1 j=1
µ ∂2 Ω X
N N X
X æ
N
= yi2 npi (1 ° pi ) + yi yj (°npi pj )
n
i=1 i=1 j6=i
µ ∂2 Ω µ ∂X
N N N æ
N n 1 n 1 XX
= 1° yi2 ° yi yj
n N N NN
i=1 i=1 j6=i
ΩX
N æ
N
= yi2 ° N ȳU2
n
i=1
P
N
(yi ° ȳU )2
N 2 i=1
= .
n N
21

2.29 We use induction. Clearly, S0 is an SRS of size n from a population of size n.


Now suppose Sk°1 is an SRS of size n from Uk°1 = {1, 2, . . . , n + k ° 1}, where
k ∏ 1. We wish to show that Sk is an SRS of size n from Uk = {1, 2, . . . , n + k}.
Since Sk°1 is an SRS, we know that

1 n!(k ° 1)!
P (Sk°1 ) = µ ∂= .
n+k°1 (n + k ° 1)!
n

Now let Uk ª Uniform(0, 1), let Vk be discrete uniform (1, . . . , n), and suppose Uk
and Vk are independent. Let A be a subset of size n from Uk . If A does not contain
unit n + k, then A can be achieved as a sample at step k ° 1 and
µ ∂
n
P (Sk = A) = P Sk°1 and Uk >
n+k
k
= P (Sk°1 )
n+k
n!k!
= .
(n + k)!

If A does contain unit n + k, then the sample at step k ° 1 must contain Ak°1 =
A ° {n + k} plus one other unit among the k units not in Ak°1 .
X µ ∂
n
P (Sk = A) = P Sk°1 = Ak°1 [ {j} and Uk ∑ and Vk = j
C
n+k
j2Uk°1 \Ak°1

n!(k ° 1)! n 1
= k
(n + k ° 1)! n + k n
n!k!
= .
(n + k)!

2.30 I always use this activity in my classes. Students generally get estimates of
the total area that are biased upwards for the purposive sample. They think, when
looking at the picture, that they don’t have enough of the big rectangles and so tend
to oversample them. This is also a good activity for reviewing confidence intervals
and other concepts from an introductory statistics class.
22 CHAPTER 2. SIMPLE PROBABILITY SAMPLES
Chapter 3

Stratified Sampling

3.2 (a) For each stratum, we calculate t̂h = 4ȳh


Stratum 1
Sample, S1 P (S1 ) {yi , i 2 S1 } t̂1S1
{1, 2} 1/6 1, 2 6
{1, 3} 1/6 1, 4 10
{1, 8} 1/6 1, 8 18
{2, 3} 1/6 2, 4 12
{2, 8} 1/6 2, 8 20
{3, 8} 1/6 4, 8 24
Stratum 2
Sample, S2 P (S2 ) {yi , i 2 S2 } t̂2S2
{4, 5} 1/6 4, 7 22
{4, 6} 1/6 4, 7 22
{4, 7} 1/6 4, 7 22
{5, 6} 1/6 7, 7 28
{5, 7} 1/6 7, 7 28
{6, 7} 1/6 7, 7 28

(b) From Stratum 1, we have the following probability distribution for t̂1 :
j P (t̂1 = j)
6 1/6
10 1/6
12 1/6
18 1/6
20 1/6
24 1/6

The sampling distribution for t̂2 is:

23
24 CHAPTER 3. STRATIFIED SAMPLING

k P (t̂2 = k)
22 1/2
28 1/2
Because we sample independently in Strata 1 and 2,

P (t̂1 = j and t̂2 = k) = P (t̂1 = j)P (t̂2 = k)

for all possible values of j and k. Thus,


j k j+k P (t̂1 = j and t̂2 = k)
6 22 28 1/12
6 28 34 1/12
10 22 32 1/12
10 28 38 1/12
12 22 34 1/12
12 28 40 1/12
18 22 40 1/12
18 28 46 1/12
20 22 42 1/12
20 28 48 1/12
24 22 46 1/12
24 28 52 1/12

So the sampling distribution of t̂str is


k P (t̂str = k)
28 1/12
32 1/12
34 2/12
38 1/12
40 2/12
42 1/12
46 2/12
48 1/12
52 1/12
(c) X
E[t̂str ] = kP (t̂str = k) = 40
X
k
V [t̂str ] = (k ° 40)2 P (t̂str = k) = 47 13 .
k

3.3 (a) ȳU = 71.83333, S 2 = 86.16667.


µ ∂
6
(b) = 15.
4
(c) The 15 samples are:
25

Units in sample y values in sample Sample Mean


1 2 3 4 66 59 70 83 69.50
1 2 3 5 66 59 70 82 69.25
1 2 3 6 66 59 70 71 66.50
1 2 4 5 66 59 83 82 72.50
1 2 4 6 66 59 83 71 69.75
1 2 5 6 66 59 82 71 69.50
1 3 4 5 66 70 83 82 75.25
1 3 4 6 66 70 83 71 72.50
1 3 5 6 66 70 82 71 72.25
1 4 5 6 66 83 82 71 75.50
2 3 4 5 59 70 83 82 73.50
2 3 4 6 59 70 83 71 70.75
2 3 5 6 59 70 82 71 70.50
2 4 5 6 59 83 82 71 73.75
3 4 5 6 70 83 82 71 76.50
Using (2.9), µ ∂
4 86.16667
V (ȳ) = 1° = 7.180556.
6 4
µ ∂µ ∂
3 3
(d) = 9.
2 2
(e) You cannot have any of the samples from (c) which contain 3 units from one of
the strata. This eliminates the first 3 samples, which contain {1, 2, 3} and the three
samples containing students {4, 5, 6}. The stratified samples are
Units in S1 Units in S2 y values in sample ȳstr
1 2 4 5 66 59 83 82 72.50
1 2 4 6 66 59 83 71 69.75
1 2 5 6 66 59 82 71 69.50
1 3 4 5 66 70 83 82 75.25
1 3 4 6 66 70 83 71 72.50
1 3 5 6 66 70 82 71 72.25
2 3 4 5 59 70 83 82 73.50
2 3 4 6 59 70 83 71 70.75
2 3 5 6 59 70 82 71 70.50
µ ∂ µ ∂2 µ ∂ µ ∂2
2 3 31 2 3 44.33333
V (ȳstr ) = 1 ° + 1° = 3.14.
3 6 2 3 6 2
The variance is smaller because the extreme samples from (c) are excluded by the
stratified design. The variances S12 = 31 and S22 = 44.33 are much smaller than the
population variance S 2 .
3.4 Here is SAS code for creating the data set:
26 CHAPTER 3. STRATIFIED SAMPLING

data acls;
input stratum $ popsize returns percfem;
females = round(returns*percfem/100);
males = returns - females;
sampwt = popsize/returns;
datalines;
Literature 9100 636 38
Classics 1950 451 27
Philosophy 5500 481 18
History 10850 611 19
Linguistics 2100 493 36
PoliSci 5500 575 13
Sociology 9000 588 26
;

proc print data=acls;


run;

data aclslist;
set acls;
do i = 1 to females;
femind = 1;
output;
end;
do i = 1 to males;
femind = 0;
output;
end;

/* Check whether we created the data set correctly*/


proc freq data=aclslist;
tables stratum * femind;
run;

proc surveymeans data=aclslist mean clm sum clsum;


stratum stratum;
weight sampwt;
var femind;
run;

We obtain t̂ = 10858 with SE 313. These values diÆer from those in Example 4.4
because of rounding.
3.5 (a) The sampled population consists of members of the organizations who would
respond to the survey.
27

(b)
7
X Nh
p̂str = p̂h
N
h=1
µ ∂ µ ∂ µ ∂
9,100 1,950 9,000
= (0.37) + (0.23) + · · · + (0.41)
44,000 44,000 44,000
= 0.334.
v
u 7 µ ∂µ ∂
uX nh Nh 2 p̂h (1 ° p̂h )
SE [p̂str ] = t 1°
Nh N nh ° 1
h=1
p
= 1.46 £ 10°5 + 5.94 £ 10°7 + · · · + 1.61 £ 10°5
= 0.0079.

3.6 (a) We use Neyman allocation (= optimal allocation when costs in the strata
are equal), with nh / Nh Sh .
We take Rh to be the relative standard deviation in stratum h, and let nh =
900(Nh Rh )/125000.
Stratum Nh Rh Nh Rh nh
Houses 35,000 2 70,000 504
Apartments 45,000 1 45,000 324
Condos 10,000 1 10,000 72
Sum 90,000 125,000 900
(b) Let’s suppose we take a sample of 900 observations. (Any other sample size will
give the same answer.)
With proportional allocation, we sample 350 houses, 450 apartments, and 100 con-
dominiums. If the assumptions about the variances hold,
µ ∂ µ ∂ µ ∂
350 2 (.45)(.55) 450 2 (.25)(.75) 100 2 (.03)(.97)
Vstr [p̂str ] = + +
900 350 900 450 900 100
= .000215.
If these proportions hold in the population, then
35 45 10
p= (.45) + (.25) + (.03) = 0.3033
90 90 90
and, with an SRS of size 900,
(0.3033)(1 ° .3033)
Vsrs [p̂srs ] = = .000235.
900
The gain in e±ciency is given by
Vstr [p̂str ] .000215
= = 0.9144.
Vsrs [p̂srs ] .000235
28 CHAPTER 3. STRATIFIED SAMPLING

For any sample size n, using the same argument as above, we have
.193233
Vstr [p̂str ] = and
n
.211322
Vsrs [p̂srs ] = .
n
We only need 0.9144n observations, taken in a stratified sample with proportional
allocation, to achieve the same variance as in an SRS with n observations.
Note: The ratio Vstr [p̂str ]/Vsrs [p̂srs ] is the design eÆect, to be discussed further in
Section 7.5.
3.7 (a) Here are summary statistics for each stratum:
Stratum
Biological Physical Social Humanities
average 3.142857 2.105263 1.230769 0.4545455
variance 6.809524 8.210526 4.358974 0.8727273
Since we took a simple random sample in each stratum, we use

(102)(3.142857) = 320.5714

to estimate the total number of publications in the biological sciences, with estimated
variance µ ∂
7 6.809524
(102) 1 °
2
= 9426.327.
102 7
The following table gives estimates of the total number of publications and estimated
variance of the total for each of the four strata:
Estimated total Estimated variance
Stratum number of publications of total
Biological Sciences 320.571 9426.33
Physical Sciences 652.632 38982.71
Social Sciences 267.077 14843.31
Humanities 80.909 2358.43
Total 1321.189 65610.78
We estimate the total number of refereed publications for the college by adding the
totals for each of the strata; as sampling was done independently in each stratum,
the variance of the college total is the sum of the variances of the population stra-
tum totals. Thusp we estimate the total number of refereed papers as 1321.2, with
standard error 65610.78 = 256.15.
(b) From Exercise 2.6, using an SRS of size 50, the estimated total was t̂srs =
1436.46, with standard error 296.2. Here, stratified sampling ensures that each
division of the college is represented in the sample, and it produces an estimate
with a smaller standard error than an SRS with the same number of observations.
The sample variance in Exercise 2.8 was s2 = 7.19. Only Physical Sciences had a
29

sample variance larger than 7.19; the sample variance in Humanities was only 0.87.
Observations within many strata tend to be more homogeneous than observations
in the population as a whole, and the reduction in variance in the individual strata
often leads to a reduced variance for the population estimate.
(c)
µ ∂
Nh nh Nh2 p̂h (1 ° p̂h )
Nh nh p̂h p̂h 1°
N Nh N 2 nh ° 1

1
Biological Sciences 102 7 .018 .0003
7
10
Physical Sciences 310 19 .202 .0019
19
9
Social Sciences 217 13 .186 .0012
13
8
Humanities 178 11 .160 .0009
11

Total 807 50 .567 .0043

p̂str = 0.567
p
SE[p̂str ] = 0.0043 = 0.066.

3.8 (a) Because the budget for interviews is $15,000, a total of 15,000/30 = 500
in-person interviews can be taken. The variances in the phone and nonphone strata
are assumed similar, so proportional allocation is optimal: 450 phone households
and 50 nonphone households would be selected for interview.
(b) The variances in the two strata are assumed equal, so optimal allocation gives
p
nh / Nh / ch .

p
Stratum ch Nh /N Nh /(N ch )
Phone 10 0.9 0.284605
Nonphone 40 0.1 0.015811
Total 1.0 0.300416
The calculations in the table imply that
0.284605
nphone = n;
0.300416
the cost constraints imply that

10nphone + 40nnon = 10nphone + 40(n ° nphone ) = 15,000.


30 CHAPTER 3. STRATIFIED SAMPLING

Solving, we have
nphone = 1227
nnon = 68
n = 1295.
Because of the reduced costs of telephone interviewing, more households can be
selected in each stratum.
3.9 (a) Summary statistics for acres87:
Nh ° nh Nh2 s2h
Region Nh nh ȳh s2h (Nh /N )ȳh
Nh N 2 nh
NC 1054 103 308188.3 2.943E+10 105532.98 30225148
NE 220 21 109009.6 1.005E+10 7791.46 2211633
S 1382 135 212687.2 5.698E+10 95495.05 76782239
W 422 41 654458.7 3.775E+11 89727.61 156241957
Total 3078 300 298547.10 265460977
P
For acres87, ȳstr = h (Nh /N )ȳh = 298547.1 and
v
uX µ ∂
u Nh 2 s2h
SE(ȳstr ) = t ) = 16293.
N nh
h

Of course, ȳstr could also be calculated using the column of weights in the data set,
as: P
wi yi 918927923
ȳstr = Pi2S = = 298547.1
w
i2S i 3078

(b) Summary statistics for farms92:


Nh ° nh Nh2 s2h
Region Nh nh ȳh s2h (Nh /N )ȳh
N h N 2 nh
NC 1054 103 750.68 128226.50 257.06 131.71
NE 220 21 528.10 128645.90 37.75 28.31
S 1382 135 578.59 222972.8 259.78 300.44
W 422 41 602.34 311508.4 82.58 128.94
Total 3078 300 637.16 589.40
P
For farms92, ȳstr = h (Nh /N )ȳh= 637.16 and
v
uX µ ∂
u N h s2h
SE(ȳstr ) = t )2 = 24.28.
N nh
h

(c) Summary statistics for largef92:


Nh ° nh Nh2 s2h
Region Nh nh ȳh s2h (Nh /N )ȳh
N h N 2 nh
NC 1054 103 70.91 4523.34 24.28 4.65
NE 220 21 8.19 90.16 0.59 0.02
S 1382 135 38.84 2450.47 17.44 3.30
W 422 41 104.98 11328.97 14.39 4.69
Total 3078 300 56.70 12.66
31

P
For largef92, ȳstr = h (Nh /N )ȳh = 56.70 and
v
uX µ ∂
u
t Nh 2 s2h
SE(ȳstr ) = = 3.56.
N nh
h

(d) Summary statistics for smallf92:


Nh ° nh Nh2 s2h
Region Nh nh ȳh s2h (Nh /N )ȳh
N h N 2 nh
NC 1054 103 44.26 1286.43 15.16 1.32
NE 220 21 47.24 2364.79 3.38 0.52
S 1382 135 47.39 6205.45 21.28 8.36
W 422 41 124.39 100640.94 17.05 41.66
Total 3078 300 56.86 51.86
P
For smallf92, ȳstr = h (Nh /N )ȳh = 56.86 and
v
uX µ ∂
u
t Nh 2 s2h
SE(ȳstr ) = = 7.20.
N nh
h

Here is SAS code for finding these estimates:

data strattot;
input region $ _total_;
cards;
NE 220
NC 1054
S 1382
W 422
;

proc surveymeans data=agstrat total = strattot mean sum clm clsum df;
stratum region ;
var acres87 farms92 largef92 smallf92 ;
weight strwt;
run;

3.10 For this problem, note that Nh , the total number of dredge tows needed to
cover the stratum, must be calculated. We use Nh = 25.6 £ Areah .
(a) Calculate t̂h = Nh ȳh
32 CHAPTER 3. STRATIFIED SAMPLING

s2
Stratum Nh nh ȳh s2h t̂h Nh2 (1 ° nh /Nh ) nhh
1 5704 4 0.44 0.068 2510 552718
2 1270 6 1.17 0.042 1486 11237
3 1286 3 3.92 2.146 5041 1180256
4 5064 5 1.80 0.794 9115 4068262
Sum 13324 18 18152 5812472

Thus t̂str = 18152 and p


SE [t̂str ] = 5812472 = 2411.

(b)
s2
Stratum Nh nh ȳh s2h t̂h Nh2 (1 ° nh /Nh ) nhh
1 8260 8 0.63 0.083 5204 707176
4 5064 5 0.40 0.046 2026 235693
Sum 13324 13 7229 942869

Here, t̂str = 7229 and p


SE [t̂str ] = 942869 = 971.

3.11 Note that the paper is somewhat ambiguous on how the data were collected.
The abstract says random stratified sampling was used, while on p. 224 the authors
say: ‘a sampling grid covering 20% of the total area was made . . . by picking 40
numbers between one and 200 with the random number generator.” It’s possible
that poststratification was really used, but for exercise purposes, let’s treat it as a
stratified random sample. Also note that the original data were not available, data
were generated that were consistent with summary statistics in the paper.
(a) Summary statistics are in the following table:
Zone Nh nh ȳh s2h
1 68 17 1.765 3.316
2 84 12 4.417 11.538
3 48 11 10.545 46.073
Total 200 40
Using (3.1),
X
t̂str = Nh ȳh
h
= 68(1.76) + 84(4.42) + 48(10.55)
= 997.
33

From (3.3),
µ ∂ H
Nh X 2 s2h
V̂ (t̂str ) = 1° Nh
N nh
h=1
µ ∂ µ ∂ µ ∂
17 3.316 12 11.538 11 46.073
= 1° 682 + 1° 842 + 1° 482
68 17 84 12 48 11
= 676.5 + 5815.1 + 7438.7
= 13930.2,

so p
SE(t̂ystr ) = 13930.2 = 118.

SAS code to calculate these quantities is given below:

data seals;
infile seals delimiter="," firstobs=2;
input zone holes;
if zone = 1 then sampwt = 68/17;
if zone = 2 then sampwt = 84/12;
if zone = 3 then sampwt = 48/11;
run;

data strattot;
input zone _total_;
datalines;
1 68
2 84
3 48
;

proc surveymeans data=seals total=strattot mean clm sum clsum;


strata zone;
weight sampwt;
var holes;
run;

The SURVEYMEANS Procedure

Data Summary

Number of Strata 3
Number of Observations 40
Sum of Weights 200
34 CHAPTER 3. STRATIFIED SAMPLING

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean

holes 4.985909 0.590132 3.79018761 6.18163058

Statistics

Variable Sum Std Dev 95% CL for Sum

holes 997.181818 118.026447 758.037521 1236.32612

(b) (i) If the goal is estimating the total number of breathing holes, we should use
optimal allocation. Using the values of s2h from this survey as estimates of Sh2 , we
have:
Zone Nh s2h Nh sh
1 68 3.316 123.83
2 84 11.538 285.33
3 48 46.073 325.81
Total 200 734.97
Then n1 = (123.83/734.97)n = 0.17n; n2 = 0.39n; n3 = 0.44n. The high variance
in zone 3 leads to a larger sample size in that zone.
(ii) If the goal is to compare the density of the breathing holes in the three zones,
we would like to have equal precision for ȳh in the three strata. Ignoring the fpc,
that means we would like
S12 S2 S2
= 2 = 3,
n1 n2 n3
which implies that nh should be proportional to Sh2 to achieve equal variances.
Using the sample variances s2h instead of the unknown population variances Sh2 , this
leads to
s21
n1 = 2 n = 0.05n
s1 + s22 + s23
n2 = 0.19n
n3 = 0.76n.
P
3.12 We use nh = 300Nh sh / k N k sk
Region Nh Nh sh nh
Northeast 220 19,238,963 7
North Central 1,054 181,392,707 69
South 1,382 319,918,785 122
West 422 265,620,742 101
Total 3,078 786,171,197 300
35

3.13 Answers will vary since students select diÆerent samples.


3.14
Method p̂str SE[p̂str ]
Role play 0.96 0.011
Problem solving 0.82 0.217
Simulations 0.45 0.028
Empathy building 0.45 0.028
Gestalt exercises 0.11 0.017
Note that the standard errors calculated for role play and for gestalt exercises are
unreliable because the formula relies on a normal approximation to p̂h : here, the
sample sizes in the strata are small and p̂h ’s are close to 0 or 1, so the accuracy of
the normal approximation is questionable.
3.15 (a) An advantage is that using the same number of stores in each stratum
gives the best precision for comparing strata if the within-stratum variances are the
same. In addition, people may perceive that allocation as fair. A disadvantage is
that estimates may lose precision relative to optimal allocation if some strata have
higher variances than others.
(b)
3
X Nh
ȳstr = ȳh = 3.9386.
N
h=1

3 µ
X ∂
Nh 2 s2h
V̂ (ȳstr ) =
N nh
h=1
= 8.85288 £ 10°5
Thus, a 95% CI is
p
3.9386 ± 1.96 8.85288 £ 10°5 = [3.92, 3.96]
This is a very small CI. Remember, though, that it reflects only the sampling error.
In this case, the author was unable to reach some of the stores and in addition some
of the market basket items were missing, so there was nonsampling error as well.
3.16 (a)
Stratum Nh nh ȳh s2h t̂h V̂ (t̂h )
1 89 19 1.74 5.43 154.6 1779.5
2 61 20 1.75 6.83 106.8 854.0
3 40 22 13.27 58.78 530.9 1923.7
4 47 21 4.10 15.59 192.5 907.2
Total 237 82 984.7 5464.3
The estimated total number of otter holts is
t̂str = 985
36 CHAPTER 3. STRATIFIED SAMPLING

with p
SE [t̂str ] = 5464 = 73.9.

Here is SAS code and output for estimating these quantities:

data exer0316;
infile otters delimiter=’,’ firstobs=2;
input section habitat holts;
if habitat = 1 then sampwt = 89/19;
if habitat = 2 then sampwt = 61/20;
if habitat = 3 then sampwt = 40/22;
if habitat = 4 then sampwt = 47/21;
;

data strattot;
input habitat _total_;
datalines;
1 89
2 61
3 40
4 47
;
proc surveymeans data=exer0316 total = strattot mean clm sum clsum;
stratum habitat;
weight sampwt;
var holts;
run;

The SURVEYMEANS Procedure

Data Summary

Number of Strata 4
Number of Observations 82
Sum of Weights 237

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean

holts 4.154912 0.311903 3.53396136 4.77586336

Statistics
37

Variable Sum Std Dev 95% CL for Sum

holts 984.714229 73.920990 837.548842 1131.87962

3.17 (a) We form a new variable, weight= 1/samprate. Then the number of
divorces in the divorce registration area is
H
X
(weight)h (numrecs)h = 571,185.
h=1

Note that this is the population value, not an estimate, because samprate= nh /Nh
and (numrecs)h = nh . Thus
H
X H
X Nh
(weight)h (numrecs)h = nh = N.
nh
h=1 h=1

(b) They wanted a specified precision within each state (= stratum). You can see
that, except for a few states in which a census is taken, the number of records sam-
pled is between 2400 and 6200. That gives roughly the same precision for estimates
within each of those states. If the same sampling rate were used in each state, states
with large population would have many more records sampled than states with small
population.
(c) (i) For each stratum,

hsblt20 + hsb20-24
ȳh = p̂h = .
numrecs

The following spreadsheet shows calculations done to obtain


X
t̂str = Nh p̂h
h

and µ ∂
X nh p̂h (1 ° p̂h ) X
V̂ (t̂str ) = Nh2 1° = varconth .
Nh nh ° 1
h h
38 CHAPTER 3. STRATIFIED SAMPLING

husb
state rate nh Nh ∑ 24 p̂h Nh p̂h varcont
AL 0.1 2460 24600 295 0.11992 2950 23376
AK 1 3396 3396 371 0.10925 371 0
CT 0.5 6003 12006 333 0.05547 666 629
DE 1 2938 2938 238 0.08101 238 0
DC 1 2525 2525 90 0.03564 90 0
GA 0.1 3404 34040 440 0.12926 4400 34491
HI 1 4415 4415 394 0.08924 394 0
ID 0.5 2949 5898 380 0.12886 760 662
IL 1 46986 46986 4349 0.09256 4349 0
IA 0.5 5259 10518 541 0.10287 1082 971
KS 0.5 6170 12340 768 0.12447 1536 1345
KY 0.2 3879 19395 567 0.14617 2835 9685
MD 0.2 3104 15520 156 0.05026 780 2964
MA 0.2 3367 16835 163 0.04841 815 3103
MI 0.1 3996 39960 270 0.06757 2700 22664
MO 1 24984 24984 2876 0.11511 2876 0
MT 1 4125 4125 432 0.10473 432 0
NE 1 6236 6236 620 0.09942 620 0
NH 1 4947 4947 458 0.09258 458 0
NY 1 67993 67993 3809 0.05602 3809 0
OH 0.05 2465 49300 102 0.04138 2040 37171
OR 0.2 3124 15620 233 0.07458 1165 4314
PA 0.1 3883 38830 248 0.06387 2480 20900
RI 1 3684 3684 246 0.06678 246 0
SC 1 13835 13835 1429 0.10329 1429 0
SD 1 2699 2699 93 0.03446 93 0
TN 0.1 3042 30420 426 0.14004 4260 32982
UT 0.5 4489 8978 591 0.13166 1182 1027
VT 1 2426 2426 162 0.06678 162 0
VA 1 25608 25608 2075 0.08103 2075 0
WI 0.2 3384 16920 280 0.08274 1400 5138
WY 1 3208 3208 346 0.10786 346 0
Total 280983 571185 49039 201422
Thus, for estimating the total number of divorces granted to men aged 24 or less,
t̂str = 49039
and p
SE(t̂str ) = 201,422 = 449.
A 95% confidence interval is
49039 ± (1.96)(449) = [48159, 49919]

(ii) Similarly, for the women,


t̂str = 4600 + 664 + 1330 + · · · + 658 = 86619
and p
SE(t̂str ) = 33672 + 0 + 1183 + · · · 8327 + 0 = 564.
39

A 95% CI is [85513, 87725].


(d) For estimating the proportions,
H
X Nh
P̂str = P̂h
N
h=1

and
H µ
X ∂ µ ∂
Nh 2 nh p̂h (1 ° p̂h )
V̂ (p̂str ) = 1°
N Nh nh ° 1
h=1

(i) For the men:


µ ∂ µ ∂ µ ∂
24600 390 3396 647 3208 560
p̂str = + + ··· = 0.1928.
571185 2460 571185 3396 571185 3208
p
SE(p̂str ) = (9.06 £ 10°8 ) + 0 + (6.54 £ 10°9 ) + · · · + (3.37 £ 10°8 ) + 0
= 1.068 £ 10°3 .
A 95% confidence interval for the proportion of men aged 40–49 at the time of the
decree is
0.1928 ± 1.96(1.068 £ 10°3 ) = [0.191, 0.195].

(ii) For the women:


µ ∂ µ ∂ µ ∂
24600 296 3396 495 3208 437
p̂str = + + ··· + = 0.1566
571185 2460 571185 3396 571185 3208

and
p
SE(p̂str ) = (7.19 £ 10°8 ) + 0 + (5.80 £ 10°9 ) + · · · + (2.83 £ 10°8 ) + 0
= 9.75 £ 10°4 .

A 95% confidence interval is

0.1566 ± 1.96(9.75 £ 10°4 ) = [0.155, 0.158].

3.18 (a)
40 CHAPTER 3. STRATIFIED SAMPLING

(b) Let wh be the relative sampling weight for stratum h. Then Nh / nh wh . For
each response, we may calculate
X ¡X
ȳstr = nh wh ȳh nh wh ;
h h

equivalently, we may define a new column


Ω
1 if h = 1 or2
weighthi =
2 if h 2 {3, 4, 5, 6}
and calculate ¡XX
XX
ȳstr = weighthi yhi weighthi .
h i h i
Summary statistics for 1974 are:
Number of fish Weight of fish
Stratum nh wh Nh /N ȳh s2h ȳh s2h
1 13 1 0.213 496.2 108528.8 60.4 522.5
2 12 1 0.197 101.9 10185.0 32.1 299.5
3 9 2 0.295 38.2 504.4 15.0 130.9
4 3 2 0.098 39.0 147.0 6.9 9.1
5 3 2 0.098 299.3 189681.3 39.2 3598.4
6 3 2 0.098 103.7 8142.3 8.2 76.4
We have, using (3.5) and ignoring the fpc when calculating the standard error,
Response ȳˆstr SE(ȳˆstr )
Number of fish 180.5 32.5
Weight of fish 29.0 4.0
SAS code for calculating these estimates is given below.
41

data nybight;
infile nybight delimiter=’,’ firstobs=2;
input year stratum catchnum catchwt numspp depth temp ;
select (stratum);
when (1,2) relwt=1;
when (3,4,5,6) relwt=2;
end;
if year = 1974;
proc surveymeans data=nybight mean clm ;
stratum stratum;
var catchnum catchwt;
weight relwt;
run;

(c) The procedure is the same as that in part (b). Summary statistics for 1975 are:
Number of fish Weight of fish
Stratum nh wh ȳh s2h ȳh s2h
1 14 1 486.9 94132.0 127.0 3948.0
2 16 1 262.7 42234.8 109.5 7189.8
3 15 2 119.6 9592.0 33.5 867.8
4 13 2 238.3 12647.2 84.1 1583.6
5 3 2 119.7 789.3 20.6 18.4
6 3 2 70.7 3194.3 12.0 255.0
Response ȳˆstr SE(ȳˆstr )
Number of fish 223.9 18.5
Weight of fish 70.6 5.7
3.19 (a)
Respondents Respondents
Stratum to survey to breakaga, nh
1 288 232
2 533 514
3 91 86
4 73 67
Total 985 899
(b) In the table,
µ ∂µ ∂
nh Nh 2 p̂h (1 ° p̂h )
varconth = 1 ° .
Nh N nh ° 1
42 CHAPTER 3. STRATIFIED SAMPLING

Stratum Nh nh p̂h ( NNh )p̂h varcont


1 1374 232 .720 .269 1.01 £ 10°4
2 1960 514 .893 .475 3.90 £ 10°5
3 252 86 .872 .060 4.05 £ 10°6
4 95 67 .866 .022 3.46 £ 10°7
Total 3681 899 .826 1.44 £ 10°4
Thus,
p̂str = 0.826
p
SE (p̂str ) = 1.44 £ 10°4 = 0.012.

(c) The weights are as follows:


Stratum weight
1 5.922
2 3.813
3 2.930
4 1.418
The answer again is p̂str = 0.826.
(e)
Survey breakaga
Employee Response Response
Stratum Type Rate (%) Rate (%)
1 Faculty 58 46
2 Classified staÆ 82 79
3 Administrative staÆ 93 88
4 Academic professional 77 71
The faculty have the lowest response rate (somehow, this did not surprise me). Strat-
ification assumes that the nonrespondents in a stratum are similar to respondents
in that stratum.
3.20 (b) This is done by dividing popsize/sampsize for each county. The first few
weights for strata are:
countynum countyname weight
1 Aitkin 1350.0
2 Anoka 1261.4
3 Beltrami 2750.0
(c)
response mean SE 95% CI
radon 4.898551 0.154362 [4.59560775, 5.20149511]
lograd 1.301306 0.028777 [1.24482928, 1.35778371]
(d) The total number of homes with excessive radon is estimated as 722781, with
43

standard error 28107 and 95% CI [667620, 777942].


P
3.22 (a) nh = 2000(Nh Sh / i Ni Si )
Stratum Nh /N p Sh Sh Nh /N nh
1 0.4 p(.10)(.90) = .3000 .1200 1079
2 0.6 (.03)(.97) = .1706 .1024 921
Total 1.0 .2224 2000
(b) For both proportional and optimal allocation,
S12 = (.10)(.90) = .09 and S22 = (.03)(.97) = .0291.
Under proportional allocation n1 = 800 and n2 = 1200.
.09 .0291
Vprop (p̂str ) = (.4)2 + (.6)2 = 2.67 £ 10°5 .
800 1200
For optimal allocation,
.09 .0291
Vopt (p̂str ) = (.4)2 + (.6)2 = 2.47 £ 10°5 .
1079 921
For an SRS, p = (.4)(.10) + (.6)(.03) = .058
.058(1 ° .058)
Vsrs (p̂srs ) = = 2.73 £ 10°5 .
2000

3.23
(a) We take an SRS of n/H observations from each of the N/H strata, so there are
a total of µ ∂H ∑ ∏H
N/H (N/H)!
=
n/H (n/H)!(N/H ° n/H)!
possible stratified samples.
(b) By Stirling’s formula,
µ ∂
N N!
=
n n!(N ° n)!
µ ∂N
p N
2ºN
e
º ≥ n ¥n p µ ∂N °n
p N °n
2ºn 2º(N ° n)
e e
s
N NN
=
2ºn(N ° n) n (N ° n)N °n
n

We use the same argument, substituting N/H for N and n/H for n in the equation
above, to obtain:
µ ∂ s
N/H NH N N/H
º .
n/H 2ºn(N ° n) nn/H (N ° n)(N °n)/H
44 CHAPTER 3. STRATIFIED SAMPLING

Consequently,
"s #H
µ ∂H NH N N/H
N/H
n/H 2ºn(N ° n) nn/H (N ° n)(N °n)/H
µ ∂ º s
N N NN
n 2ºn(N ° n) nn (N ° n)N °n
∑ ∏(H°1)/2
N
= H H/2 .
2ºn(N ° n)

3.24 We wish to minimize


H µ
X ∂
nh S2
V [t̂str ] = 1° Nh2 h
Nh nh
h=1

subject to the constraint


H
X
C = c0 + ch nh .
h=1

Lagrange multipliers are often used for such problems. (See, for example, Thomas,
G.B. and Finney, R. L. (1982). Calculus and Analytic Geometry, Fifth edition.
Reading, MA: Addison-Wesley, p. 617.)
Define
H µ
X ∂ 2 µ H
X ∂
nh 2 Sh
f (n1 , . . . , nH , ∏) = 1° Nh ° ∏ C ° c0 ° ch nh .
Nh nh
h=1 h=1

Then
@f S2
= °Nk2 k2 + ∏ck
@nk nk
k = 1, . . . , H, and
X H
@f
= c0 + ch nh ° C.
@∏
h=1
Setting the partial derivatives equal to 0 and solving gives
Nk Sk
nk = p
∏ck
for k = 1, . . . , H, and
H
X
ch nh = C ° c0 ,
h=1
which implies that PH
p p
ch Nh Sh
∏= h=1
C ° c0
45

and hence that


N k Sk C ° c0
nk = p PH p .
ck h=1 ch Nh Sh
p
Note that we also have nk / Nk Sk / ck if we want to minimize the cost for a fixed
variance N . Then, let
H
X ∑ H µ
X ∂ 2∏
nh 2 Sh
g(n1 , . . . , nH , ∏) = c0 + ch nh ° ∏ V ° 1° Nh .
Nh nh
h=1 h=1

Then
@g
= ck ° ∏Nk2 Sk2 /n2k
@nk
and
H µ
X ∂
@g nh S2
=V ° 1° Nh2 h .
@∏ Nh nh
h=1

Setting the partial derivatives equal to 0 and solving gives


p
∏Nk Sk
nk = p .
ck

3.25 (a) We substitute nh,Neyman for nh in (3.4):


H µ
X ∂
nh Sh2
VNeyman (t̂str ) = 1° Nh2
Nh nh
h=1
0 1
H
X
B C S 2 Nl Sl
H B C h
X B N S
h h n C 2 l=1
= B1 ° C Nh
B XH C nh Nh Sh n
h=1 @
Nh Nl Sl A
l=1
0 1
XH
B C N S Nl Sl
XH B
S n C h h
B h C l=1
= B1 ° H C
B X C n
h=1 @
Nl Sl A
l=1
√H !2 H
1 X X
= Nh Sh ° Nh Sh2 .
n
h=1 h=1
46 CHAPTER 3. STRATIFIED SAMPLING

(b)
H H
NX X
Vprop (t̂str ) ° VNeyman (t̂str ) = Nh Sh2 ° Nh Sh2
n
h=1 h=1
√H !2 H
1 X X
° Nh Sh + Nh Sh2
n
h=1 h=1
H
√H !2
NX 1 X
= 2
N h Sh ° Nh Sh
n n
h=1 h=1
2 √H !2 3
2 XH X
N 4 Nh 2 Nh
= Sh ° Sh 5
n N N
h=1 h=1
H
" H
#
N 2 X Nh X N l
= Sh2 ° Sh Sl
n N N
h=1 l=1

But
" #2 2 √H !2 3
H
X H
X H
X H
X X Nl
Nh Nl Nh 4Sh2 ° 2Sh Nl
Sh ° Sl = Sl + Sl 5
N N N N N
h=1 l=1 h=1 l=1 l=1

H
√H !2
X Nh X Nl
= Sh2 ° Sl ,
N N
h=1 l=1

proving the result.


(c) When H = 2, the diÆerence from (b) is
2
√ 2
!2
N 2 X Nh X Nl
Sh ° Sl
n N N
h=1 l=1
" µ ∂2 µ ∂2 #
N 2 N1 N1 N2 N2 N1 N2
= S1 ° S1 ° S2 + S2 ° S1 ° S2
n N N N N N N
" µ ∂2 µ ∂2 #
N 2 N1 N2 N2 N2 N1 N1
= S1 ° S2 + S2 ° S1
n N N N N N N
∑ ∏
N 2 N1 N2 N1 N2
= + (S1 ° S2 )2
n N N N N
N1 N2
= (S1 ° S2 )2 .
n

3.34
(a) In the
P data step, define the variable
P one to have the value 1 for every observation.
Then w
i2S i 1 = N . Here, i2S i = 85174776. The standard error is zero
w 1
47

because this is a stratified sample. The weights are Nh /nh so the sum of the weights
in stratum h is Nh exactly. There is no sampling variability.
Here is the code used to obtain these values:

proc surveymeans data=vius mean clm sum clsum;


weight tabtrucks;
stratum stratum;
var one miles_annl mpg;

(b) The estimated total number of truck miles driven is 1.115 £1012 ; the standard
error is 6492344384 and a 95% CI is [1.102£1012 , 1.127£1012 ].
(c) Because these are stratification variables, we can calculate estimates for each
truck type by summing whj yhj separately for each h. We obtain:

proc sort data=vius;


by trucktype;
proc surveymeans data=vius sum clsum;
by trucktype;
weight tabtrucks;
stratum stratum;
var miles_annl;
ods output Statistics=Mystat;
proc print data=Mystat;
run;

Obs VarName VarLabel Sum

1 MILES_ANNL Number of Miles Driven During 2002 428294502082


2 MILES_ANNL Number of Miles Driven During 2002 541099850893
3 MILES_ANNL Number of Miles Driven During 2002 41279084490
4 MILES_ANNL Number of Miles Driven During 2002 31752656137
5 MILES_ANNL Number of Miles Driven During 2002 72301789843

Obs LowerCLSum StdDev UpperCLSum

1 4.19064E11 4708839922 4.37525E11


2 5.32459E11 4408042207 5.4974E11
3 4.05032E10 395841910 4.2055E10
4 3.107E10 348294378 3.24353E10
5 7.12861E10 518195242 7.33175E10

(d) The estimated average mpg is 16.515427 with standard error 0.039676; a 95% CI
is [16.4377, 16.5932]. These CIs are very small because the sample size is so large.
48 CHAPTER 3. STRATIFIED SAMPLING
Chapter 4

Ratio and Regression


Estimation

4.2
(a) We have tx = 69. ty = 83, Sx = 4.092676, Sy = 5.333333, R = 0.8112815, and
B = 1.202899.
(b)

49
50 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

Sample Sample
Number S x̄S ȳS B̂ t̂SRS t̂yr
1 {1, 2, 3} 10.333 10.000 0.968 90.000 66.774
2 {1, 2, 4} 10.667 11.333 1.063 102.000 73.313
3 {1, 2, 5} 8.000 8.333 1.042 75.000 71.875
4 {1, 2, 6} 7.667 6.000 0.783 54.000 54.000
5 {1, 2, 7} 10.333 11.000 1.065 99.000 73.452
6 {1, 2, 8} 7.667 8.000 1.043 72.000 72.000
7 {1, 2, 9} 8.333 7.000 0.840 63.000 57.960
8 {1, 3, 4} 12.000 13.333 1.111 120.000 76.667
9 {1, 3, 5} 9.333 10.333 1.107 93.000 76.393
10 {1, 3, 6} 9.000 8.000 0.889 72.000 61.333
11 {1, 3, 7} 11.667 13.000 1.114 117.000 76.886
12 {1, 3, 8} 9.000 10.000 1.111 90.000 76.667
13 {1, 3, 9} 9.667 9.000 0.931 81.000 64.241
14 {1, 4, 5} 9.667 11.667 1.207 105.000 83.276
15 {1, 4, 6} 9.333 9.333 1.000 84.000 69.000
16 {1, 4, 7} 12.000 14.333 1.194 129.000 82.417
17 {1, 4, 8} 9.333 11.333 1.214 102.000 83.786
18 {1, 4, 9} 10.000 10.333 1.033 93.000 71.300
19 {1, 5, 6} 6.667 6.333 0.950 57.000 65.550
20 {1, 5, 7} 9.333 11.333 1.214 102.000 83.786
21 {1, 5, 8} 6.667 8.333 1.250 75.000 86.250
22 {1, 5, 9} 7.333 7.333 1.000 66.000 69.000
23 {1, 6, 7} 9.000 9.000 1.000 81.000 69.000
24 {1, 6, 8} 6.333 6.000 0.947 54.000 65.368
25 {1, 6, 9} 7.000 5.000 0.714 45.000 49.286
26 {1, 7, 8} 9.000 11.000 1.222 99.000 84.333
27 {1, 7, 9} 9.667 10.000 1.034 90.000 71.379
28 {1, 8, 9} 7.000 7.000 1.000 63.000 69.000
29 {2, 3, 4} 10.000 12.333 1.233 111.000 85.100
30 {2, 3, 5} 7.333 9.333 1.273 84.000 87.818
31 {2, 3, 6} 7.000 7.000 1.000 63.000 69.000
32 {2, 3, 7} 9.667 12.000 1.241 108.000 85.655
33 {2, 3, 8} 7.000 9.000 1.286 81.000 88.714
34 {2, 3, 9} 7.667 8.000 1.043 72.000 72.000
35 {2, 4, 5} 7.667 10.667 1.391 96.000 96.000
36 {2, 4, 6} 7.333 8.333 1.136 75.000 78.409
37 {2, 4, 7} 10.000 13.333 1.333 120.000 92.000
38 {2, 4, 8} 7.333 10.333 1.409 93.000 97.227
39 {2, 4, 9} 8.000 9.333 1.167 84.000 80.500
40 {2, 5, 6} 4.667 5.333 1.143 48.000 78.857
51

Sample Sample
Number S x̄S ȳS B̂ t̂SRS t̂yr
41 {2, 5, 7} 7.333 10.333 1.409 93.000 97.227
42 {2, 5, 8} 4.667 7.333 1.571 66.000 108.429
43 {2, 5, 9} 5.333 6.333 1.188 57.000 81.938
44 {2, 6, 7} 7.000 8.000 1.143 72.000 78.857
45 {2, 6, 8} 4.333 5.000 1.154 45.000 79.615
46 {2, 6, 9} 5.000 4.000 0.800 36.000 55.200
47 {2, 7, 8} 7.000 10.000 1.429 90.000 98.571
48 {2, 7, 9} 7.667 9.000 1.174 81.000 81.000
49 {2, 8, 9} 5.000 6.000 1.200 54.000 82.800
50 {3, 4, 5} 9.000 12.667 1.407 114.000 97.111
51 {3, 4, 6} 8.667 10.333 1.192 93.000 82.269
52 {3, 4, 7} 11.333 15.333 1.353 138.000 93.353
53 {3, 4, 8} 8.667 12.333 1.423 111.000 98.192
54 {3, 4, 9} 9.333 11.333 1.214 102.000 83.786
55 {3, 5, 6} 6.000 7.333 1.222 66.000 84.333
56 {3, 5, 7} 8.667 12.333 1.423 111.000 98.192
57 {3, 5, 8} 6.000 9.333 1.556 84.000 107.333
58 {3, 5, 9} 6.667 8.333 1.250 75.000 86.250
59 {3, 6, 7} 8.333 10.000 1.200 90.000 82.800
60 {3, 6, 8} 5.667 7.000 1.235 63.000 85.235
61 {3, 6, 9} 6.333 6.000 0.947 54.000 65.368
62 {3, 7, 8} 8.333 12.000 1.440 108.000 99.360
63 {3, 7, 9} 9.000 11.000 1.222 99.000 84.333
64 {3, 8, 9} 6.333 8.000 1.263 72.000 87.158
65 {4, 5, 6} 6.333 8.667 1.368 78.000 94.421
66 {4, 5, 7} 9.000 13.667 1.519 123.000 104.778
67 {4, 5, 8} 6.333 10.667 1.684 96.000 116.211
68 {4, 5, 9} 7.000 9.667 1.381 87.000 95.286
69 {4, 6, 7} 8.667 11.333 1.308 102.000 90.231
70 {4, 6, 8} 6.000 8.333 1.389 75.000 95.833
71 {4, 6, 9} 6.667 7.333 1.100 66.000 75.900
72 {4, 7, 8} 8.667 13.333 1.538 120.000 106.154
73 {4, 7, 9} 9.333 12.333 1.321 111.000 91.179
74 {4, 8, 9} 6.667 9.333 1.400 84.000 96.600
75 {5, 6, 7} 6.000 8.333 1.389 75.000 95.833
76 {5, 6, 8} 3.333 5.333 1.600 48.000 110.400
77 {5, 6, 9} 4.000 4.333 1.083 39.000 74.750
78 {5, 7, 8} 6.000 10.333 1.722 93.000 118.833
79 {5, 7, 9} 6.667 9.333 1.400 84.000 96.600
80 {5, 8, 9} 4.000 6.333 1.583 57.000 109.250
81 {6, 7, 8} 5.667 8.000 1.412 72.000 97.412
82 {6, 7, 9} 6.333 7.000 1.105 63.000 76.263
83 {6, 8, 9} 3.667 4.000 1.091 36.000 75.273
84 {7, 8, 9} 6.333 9.000 1.421 81.000 98.053

Average 7.667 9.222 1.214 83.000 83.733


Variance 3.767 6.397 0.044 518.169 208.083

(c)
52 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

Histogram of SRS estimate Histogram of ratio estimate

20
15

15
10
Frequency

Frequency

10
5

5
0

50 100 150 50 100 150

SRS estimate of total Ratio estimate of total

The shapes are actually quite similar; however, it appears that the histogram of the
ratio estimator is a little less spread out.
(d) The mean of the sampling distribution of t̂yr is 83.733; the variance is 208.083 and
the bias is 83.733 °83 = 0.733. By contrast, the mean of the sampling distribution
of N ȳ is 83 and its variance is 518.169.
(e) From (4.6),
Bias (ȳˆr ) = 0.07073094.

4.3 (a) The solid line is from regression estimation; the dashed line from ratio
estimation; the dashed/dotted line has equation y = 107.4.
53

100 120 140 160


Age

80
60

6 7 8 9 10 11 12

Diameter

(b, c)
Method ȳˆ SE (ȳˆ)
SRS, ȳ 107.4 6.35
Ratio 117.6 4.35
Regression 118.4 3.96

For ratio estimation, B̂ = 11.41946; for regression estimation, B̂0 = °7.808 and
B̂1 = 12.250. Note that the sample correlation of age and diameter is 0.78, so we
would expect both ratio and regression estimation to improve precision.
To calculate V̂ (ȳˆr ) using (4.9), note that s2e = 321.933 so that
µ ∂µ ∂
20 10.3 2 321.933
V̂ (ȳˆr ) = 1 ° = 18.96
1132 9.405 20
54 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

and SE(ȳˆ) = 4.35. For the regression estimator, we have s2e = 319.6, so
µ ∂
20 319.6
ˆ
V̂ (ȳ ) = 1 ° = 15.7.
1132 20

From a design-based perspective, it makes little diÆerence which estimator is used.


From a model-based perspective, though, a plot of residuals vs. predicted values
exhibits a “funnel shape” indicating that the variability increases with x. Thus a
model-based analysis for these data should incorporate the unequal variances. From
that perspective, the ratio model might be more appropriate.
Note that the variances calculated using SAS PROC SURVEYREG are larger since
they use V̂2 from Section 11.7.
Here is code for SAS:

data trees;
input treenum diam age @@;
sampwt = 1132/20;
datalines;
1 12.0 125 11 5.7 61
2 11.4 119 12 8.0 80
3 7.9 83 13 10.3 114
4 9.0 85 14 12.0 147
5 10.5 99 15 9.2 122
6 7.9 117 16 8.5 106
7 7.3 69 17 7.0 82
8 10.2 133 18 10.7 88
9 11.7 154 19 9.3 97
10 11.3 168 20 8.2 99
;
proc print data=trees;
run;

/* proc surveymeans will estimate ratios with keyword ’ratio’ */

proc surveymeans data=trees total=1132 mean stderr clm


sum clsum ratio ;
var diam age; /* need both in var statement */
ratio ’age/diameter’ age/diam;
weight sampwt;
ods output Statistics=statsout Ratio=ratioout;
run;

/* Can get ratio estimates of totals by taking output from


proc surveymeans and multiplying by N */
55

data ratioout1;
set ratioout;
xmean = 10.3;
ratiomean = ratio*xmean;
semean = stderr*xmean;
lowercls = lowercl*xmean;
uppercls = uppercl*xmean;

proc print data = ratioout1;


run;

/* Can also calculate ratio estimate by hand */

data treesresid;
set trees;
resid = age - 11.419458*diam;
resid2 = resid*(10.3/ 9.405);

proc univariate data=treesresid;


run;

proc surveyreg data=trees total=1132;


model age = diam / clparm solution ;
/* fits the regression model */
weight sampwt;
estimate ’Mean age of trees’ intercept 1 diam 10.3;
run;

proc gplot data=trees;


plot age*diam;
run;

data trees2;
set trees;
resid = age - (-7.8080877 + 12.2496636*diam);
proc surveymeans data=trees2 total =1132;
weight sampwt;
var resid age diam;
run;

The output from proc surveyreg gives a larger standard error for the regression
estimator:

Analysis of Estimable Functions


56 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

Standard
Parameter Estimate Error t Value Pr > |t|

Mean age of trees 118.363449 5.20417420 22.74 <.0001

4.5 There are 85 18-hole courses in the sample. For these 85 courses, the sample
mean weekend greens fee is
ȳd = 34.829
and the sample variance is
s2d = 395.498.
Using results from Section 4.3,
r
395.498
SE[ȳd ] = = 2.16.
85

Here is SAS code:

filename golfsrs
data golfsrs;
infile golfsrs delimiter="," dsd firstobs=2;
/* The dsd option allows SAS to read the missing values between
successive delimiters */
input RN state $ holes type $ yearblt wkday18 wkday9 wkend18
wkend9 backtee rating par cart18 cart9 caddy $ pro $;
sampwt = 14938/120;
if holes = 18 then holes18 = 1;
else holes18=0;

proc surveymeans data=golfsrs total = 14938;


weight sampwt;
var wkend18;
domain holes18;
run;

Data Summary

Number of Observations 120


Sum of Weights 14938

Statistics
57

Std Error
Variable N Mean of Mean

wkend18 85 34.828824 2.148380

Statistics

Variable 95% CL for Mean

wkend18 30.5565341 39.1011129

Domain Analysis: holes18

Std Error
holes18 Variable N Mean of Mean

0 wkend18 0 . .
1 wkend18 85 34.828824 2.144660

Domain Analysis: holes18

holes18 Variable 95% CL for Mean

0 wkend18 . .
1 wkend18 30.5639320 39.0937150

4.6 As you can see from the plot of weekend greens fee vs. back-tee yardage, this
is not a “classical” straight-line relationship. The variability in weekend greens fee
appears to increase with the back-tee yardage. Nevertheless, we can estimate the
slope and intercept, with
ŷ = °37.26 + 0.0113x.
(We’ll discuss standard errors in Chapter 11.) For estimating the ratio, we have
ȳ 34.83
B̂ = = = 0.00545.
x̄ 6392.29
Using (4.10), with s2e the sample variance of the residuals,

s2e 362.578
V̂ (B̂) º = = 1.044 £ 10°7
(85)(6292.29)2 (85)(6392.29)2
SE(B̂) = .00032.

4.7 (a) 88 courses have a golf professional. For these 88 courses, ȳd1 = 23.5983 and
58 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

s2d1 = 387.7194, so
387.7194
V̂ (ȳd1 ) = = 4.4059.
88

(b) For the 32 courses without a golf professional, ȳd2 = 10.6797 and s2d2 = 19.146,
so
19.146
V̂ (ȳd2 ) = = 0.5983.
32

4.8 (a) (b) B̂ = ȳ/x̄ = 297897/647.7467 = 459.8975. Thus

t̂yr = tx B̂ = (2087759)(459.8975) = 960,155,061.

The estimated variance of the residuals about the line y = B̂x is

s2e = 149,902,393,481.

Using (4.11), then, with farms87 as the auxiliary variable,


r r
300 s2e
SE [t̂yr ] = 3078 1 ° = 65,364,822.
3078 300

(c) The least squares regression equation is

ŷ = 267029.8 + 47.65325x

Then
ȳˆreg = 267029.8 + 47.65325(647.7467) = 297897.04
and
t̂yreg = 3078ȳˆreg = 916,927,075.
The estimated variance of the residuals from the regression is s2e = 118,293,647,832,
which implies from (4.19) that
r r
300 s2e
SE [t̂yreg ] = 3078 1 ° = 58,065,813.
3078 300

(d) Clearly, for this response, it is better to use acres87 as an auxiliary variable.
The correlation of farms87 with acres92 is only 0.06; using farms87 as an auxiliary
variable does not improve on the SRS estimate N ȳ. The correlation of acres92 and
acres87, however, exceeds 0.99. Here are the various estimates for the population
total of acres92:
Estimate t̂ SE [t̂]
SRS, N ȳ 916,927,110 58,169,381
Ratio, x = acres87 951,513,191 5,344,568
Ratio, x = farms87 960,155,061 65,364,822
Regression, x = farms87 916,927,075 58,065,813
59

Moral: Ratio estimation can lead to greatly increased precision, but should not
be used blindly. In this case, ratio estimation with auxiliary variable farms87 had
larger standard error than if no auxiliary information were used at all. The regression
estimate of t is similar to N ȳ, because the regression slope is small relative to the
magnitude of the data. The regression slope is not significantly diÆerent from 0;
as can be seen from the picture in (a), the straight-line regression model does not
describe the counties with few but large farms.
4.9 We use results from Section 4.2. (a) Let yi = acres92 for county i, and
xi = farms92 for county i. Define Then

t̂y1 = N ū = 3078(161773.8) = 497,939,808

and
r r r r
300 s2u 300 109,710,284,064
SE [t̂y1 ] = N 1° = 3078 1 ° = 55,919,525.
3078 300 3078 300

(b) Now
t̂y2 = 3078(136123.2) = 418, 987, 302
and r r
300 53,195,371,851
SE [t̂y2 ] = 3078 1 ° = 38,938,277.
3078 300

4.10 (a)

(b) Here is code and output from SAS:

filename cherries ’cherry.csv’;


60 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

data cherry;
infile cherries delimiter=’,’ firstobs=2;
input diam height vol;
sampwt = 2967/31;
obsnum = _n_;
label diam = ’diam (in) at 4.5 feet’
height = ’height of tree (feet)’
vol = ’volume of tree (cubic feet)’
sampwt = ’sampling weight’
;
/* Plot and print the data set */

proc print data = cherry;


var diam height vol sampwt;

proc gplot data = cherry;


plot vol *diam ;

proc surveymeans data = cherry total=2967 mean clm sum clsum ratio ;
weight sampwt;
var diam vol;
ratio ’vol/diam’ vol/diam;
ods output Statistics=statsout Ratio=ratioout;
run;

data ratioout1;
set ratioout;
xtotal = 41835;
ratiosum = ratio*xtotal;
sesum = stderr*xtotal;
lowercls = lowercl*xtotal;
uppercls = uppercl*xtotal;

proc print data = ratioout1;


run;

Using this code, we obtain t̂yr = 95272.16 with 95% CI of [84098, 106,446].
(c) SAS code and output follow:

proc surveyreg data=cherry total=100;


model vol=diam / clparm solution;
weight sampwt;
estimate ’Total volume’ intercept 2967 diam 41835;
61

/* substitute N for intercept, t_x for diam */


run;

Analysis of Estimable Functions

Standard
Parameter Estimate Error t Value Pr > |t|

Total volume 102318.860 2233.70776 45.81 <.0001

Analysis of Estimable Functions

95% Confidence
Parameter Interval

Total volume 97757.0204 106880.700

Note that the estimate from regression estimation is quite a bit higher than the
estimate from ratio estimation. In addition, the CI for regression estimation is
narrower than the CIs for t̂yr or N ȳ. This is because the regression model is a
better fit to the data than the ratio model.
4.11 (a) The variable number of physicians has a skewed distribution. The first
histogram excludes Cook County, Illinois (with yi = 15,153) for slightly better
visibility.

The next histogram, of all 100 counties, depicts the logarithm of (number of physi-
cians + 1) (which is still skewed).
62 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

(b) ȳ = 297.17, s2y =2,534,052.


Thus the estimated total number of physicians is

N ȳ = (3141)(297.17) = 933,411

with
s
≥ n ¥ s2y
SE (N ȳ) = N 1°
N n
sµ ∂
100 2,534,052
= 3141 1°
3141 100
p
= 3141 24533.75
= 491,983.

The standard error is large compared with the estimated total number of physicians.
The extreme skewness of the data makes us suspect that N ȳ does not follow an
approximate normal distribution, and that a confidence interval of the form N ȳ ±
1.96 SE (N ȳ) would not have 95% coverage in practice. In fact, when we substitute
sample quantities into (2.23), we obtain

nmin = 28 + 25 (6.04)2 = 940

as the required minimum sample size for ȳ to approximately follow a normal distri-
bution.
(c) Again, we omit Cook County.
63

There appears to be increasing variance as population increases, so ratio estimation


may be more appropriate.
(d) Ratio estimation:
ȳ 297.17
B̂ = = = 0.002507.
x̄ 118531.2
t̂yr = B̂tx = (.002507)(255,077,536) = 639,506.
Using (4.11),
r
≥ n ¥ s2e
SE (t̂yr ) = N 1°
sµ N n ∂
100 (255077536)2 172268
= 3141 1°
3141 (372306374)2 100
= 87885.

Regression estimation:

B̂0 = °54.23 B̂1 = 0.00296.

From (4.15) and (4.19),

ȳˆreg = B̂0 + B̂1 x̄U


µ ∂
255,077,536
= °54.23 + (0.00296)
3141
= 186.52
64 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

and
r
≥ n ¥ s2e
SE (ȳˆreg ) = 1°
sµ N n

100 114644.1
= 1°
3141 100
= 33.316.

Consequently,
t̂yreg = N ȳˆreg = 3141(186.52) = 585871
and
SE (t̂yreg ) = N SE (ȳˆreg ) = 3141(33.316) = 104645.
The standard error from proc surveyreg is smaller and equals 92535.
(e) Ratio estimation and regression estimation both lead to a smaller standard error,
and an estimate that is closer to the true value.
Here is SAS code for performing these analyses:

data counties;
infile counties firstobs=2 delimiter=",";
input RN State County landarea totpop physician enroll
percpub civlabor unemp farmpop numfarm farmacre fedgrant
fedciv milit veterans percviet ;
sampwt = 3141/100;
logphys = log(physician+1);

/* The following histogram is really really skewed


because of Cook County */
proc univariate data=counties ;
var physician;
histogram / endpoints = 0 to 16000 by 500;
run;

data countnoCook;
set counties;
if physician gt 10000 then delete;

proc univariate data=countnoCook;


var physician ;
histogram / endpoints = 0 to 4400 by 200;
run;

proc univariate data=counties;


var logphys ;
65

histogram / endpoints = 0 to 10 by .5;


run;

proc surveymeans data=counties total = 3141 mean clm sum clsum;


weight sampwt;
var physician;
run;

proc gplot data=countnoCook;


plot physician * totpop;
run;

proc surveymeans data=counties total=3141 mean clm sum clsum ratio;


var physician totpop; /* need both in var statement */
ratio ’physician/totpop’ physician/totpop;
weight sampwt;
ods output Statistics=statsout Ratio=ratioout;
run;

/* Can get ratio estimates of totals by taking output from


proc surveymeans and multiplying by t_x */

data ratioout1;
set ratioout;
xtot = 255077536;
ratiotot = ratio*xtot;
setot = stderr*xtot;
lowercls = lowercl*xtot;
uppercls = uppercl*xtot;

proc print data = ratioout1;


run;

/* Can also calculate ratio estimate by hand */

data resid;
set counties;
resid = physician -0.002507*totpop;
resid2 = resid*(255077536/ 372306374);
/* Use g-weights in SE formula*/

proc surveymeans data=resid total=3141 mean clm sum clsum;


weight sampwt;
var resid resid2;
run;
66 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

proc surveyreg data=counties total=3141;


model physician=totpop / clparm solution ;
/* fits the regression model */
weight sampwt;
estimate ’Total number of physicians’ intercept 3141
totpop 255077536;
estimate ’Average number of physicians’ intercept 3141
totpop 255077536/divisor = 3141;
run;

4.12 (a) The distribution appears to be skewed, but not quite as skewed as the
distribution in Exercise 4.11.

(b) ȳ = 1146.87, s2y =1,138,630


Thus the estimated total farm population, using N ȳ, is

Nȳ = (3141)(1146.87) = 3,602,319

with sµ ∂
100 1,138,630
SE (N ȳ) = 3141 1° = 329,787.
3141 100
SAS proc surveymeans gives the same result.
(c) Note that corr(farmpop,landarea) = °0.058. We would not expect ratio or
regression estimation to do well here.
67

(d) Ratio estimation:


ȳ 1146.87
B̂ = = = 1.21372.
x̄ 944.92
s2e = 3,297,971
t̂yr = B̂tx = (1.21372)(3,536, 278) = 4,292,058
sµ ∂
100 (3,536, 278)2 3,297,971
SE (t̂yr ) = 3141 1° = 668 727.
3141 (2967994)2 100
Note that the SE for the ratio estimate is higher than that for N ȳ.
Regression estimation:

B̂0 = 1197.35, B̂1 = °.05342218, s2e = 1,134,785.

From (4.15) and (4.19),

ȳˆreg = B̂0 + B̂1 x̄U


= 1197.35 ° 0.0534(3,536,278)
= 1137.2
sµ ∂
100 1,134,785
SE (ȳˆreg ) = 1° = 104.82.
3141 100
Consequently,
t̂yreg = N ȳˆreg = 3141(1137.2) = 3,571,960
SE (t̂yreg ) = N SE (ȳˆreg ) = 3141(104.82) = 329,230.
SAS proc surveyreg gives SE 350595.
(e) The “true” value is ty = 3,871,583.
68 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

Here is SAS code that may be used to compute these estimates. See Exercise 4.11
solution for reading in the data.

proc univariate data=counties ;


var farmpop;
histogram / endpoints = 0 to 6000 by 500;
run;

proc gplot data=counties;


plot farmpop * landarea;
run;

proc surveymeans data=counties total=3141 mean clm sum clsum ratio ;


var farmpop landarea; /* need both in var statement */
ratio ’farmpop/landarea’ farmpop/landarea;
weight sampwt;
ods output Statistics=statsout Ratio=ratioout;

data ratioout1;
set ratioout;
xtot = 3536278;
ratiotot = ratio*xtot;
setot = stderr*xtot;
lowercls = lowercl*xtot;
uppercls = uppercl*xtot;

proc print data = ratioout1;


run;

proc surveyreg data=counties total=3141;


model farmpop=landarea / clparm solution ;
weight sampwt;
estimate ’Total farmpop’ intercept 3141 landarea 3536278;
estimate ’Average farmpop’ intercept 3141
landarea 3536278/divisor = 3141;
run;

4.13 (a) As in Exercise 4.11, we omit Cook County, Illinois, from the histogram so
we can see the other data points better. Cook County has 457,880 veterans. The
distribution is very skewed; Cook County is an extreme outlier.
69

(b) ȳ = 12249.71, s2y = 2,263,371,150.

N ȳ = (3141)(12249.71) = 38,476,339
sµ ∂
100 2,263,371,150
SE (N ȳ) = 3141 1° = 14,703,478.
3141 100

(c) Again, Cook County is omitted from this plot. These data appear very close to
a straight line. We would expect ratio or regression estimation to help immensely.

(d) Ratio estimation:


ȳ 12249.71
B̂ = = = 0.1033
x̄ 118531.2
70 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

t̂yr = B̂tx = (0.1033)(255,077,536) = 26,361,219.


1 X
s2e = (yi ° B̂xi )2 = 59,771,493
99
i2§
sµ ∂
100 (255077536)2 59,771,493
SE (t̂yr ) = 3141 1° = 1637046.
3141 (372306374)2 100

Regression estimation:

B̂0 = 1534.201, B̂1 = 0.0904, s2e = 13,653,852

ȳˆreg = B̂0 + B̂1 x̄U ≥ ¥


255,077,536
= (1534.2) + (0.0904) 3141
= 8875.7.
sµ ∂
100 13,653,852
SE (ȳˆreg ) = 1° = 363.58
3141 100

t̂yreg = N ȳˆreg = 27,878,564


SE (t̂yreg ) = N SE (ȳˆreg ) = 1,142,010.
The SE from proc surveyreg is 1063699.
Here is SAS code. The data are read in as in Exercise 4.11.

/* The following histogram is skewed because of Cook County */


proc univariate data=counties ;
var veterans;
histogram / endpoints = 0 to 440000 by 10000;
run;

data countnoCook;
set counties;
if veterans gt 400000 then delete;

proc univariate data=countnoCook;


var veterans ;
histogram / endpoints = 0 to 100000 by 5000;
run;

proc gplot data=countnoCook;


plot veterans * totpop;
run;

proc surveymeans data=counties total=3141 mean clm sum clsum ratio;


71

var veterans totpop; /* need both in var statement */


ratio ’veterans/totpop’ veterans/totpop;
weight sampwt;
ods output Statistics=statsout Ratio=ratioout;
run;

data ratioout1;
set ratioout;
xtot = 255077536;
ratiotot = ratio*xtot;
setot = stderr*xtot;
lowercls = lowercl*xtot;
uppercls = uppercl*xtot;

proc print data = ratioout1;


run;

/* Can also calculate ratio estimate by hand */

data resid;
set counties;
resid = veterans -0.002507*totpop;
resid2 = resid*(255077536/ 372306374);
/* Use g-weights in SE formula*/

proc surveymeans data=resid total=3141 mean clm sum clsum;


weight sampwt;
var resid resid2;
run;

proc surveyreg data=counties total=3141;


model veterans=totpop / clparm solution ;
/* fits the regression model */
weight sampwt;
estimate ’Total number of veterans’ intercept 3141
totpop 255077536;
estimate ’Average number of veterans’
intercept 3141 totpop 255077536/divisor = 3141;
run;

(e) Here, ty = 27,481,055.


4.15 (a) A 95% CI for the average concentration of lead is
146
127 ± 1.96 p = [101.0, 153.0].
121
72 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

For copper, the corresponding interval is


16
35 ± 1.96 p = [32.1, 37.85].
121
Note that we do not use an fpc here. Soil samples are collected at grid intersections,
and we may assume that the amount of soil in the sample is negligible compared
with that in the region.
(b) Because the samples are systematically taken on grid points, we know that
(Nh /N ) = (nh /n). Using (4.21),
82 31 8
LEAD : ȳpost = 71 + 259 + 189 = 127
121 121 121
82 31 8
COPPER : ȳpost = 28 + 50 + 45 = 35.
121 121 121
Not surprisingly, these are the same numbers from the first table. The variances
and confidence intervals, however, diÆer. From (4.22),
µ ∂ µ ∂ µ ∂ 2
82 282 31 2322 8 79
LEAD : V̂ (ȳpost ) = + + = 121.8
121 121 121 121 121 121
µ ∂ 2 µ ∂ µ ∂ 2
82 9 31 182 8 15
COPPER : V̂ (ȳpost ) = + + = 1.26.
121 121 121 121 121 121
The corresponding 95% confidence intervals are
p
LEAD : 127 ± 1.96 p121.8 = [105.4, 148.6]
COPPER : 35 ± 1.96 1.26 = [32.8, 37.2].
These confidence intervals are both smaller than the CIs in part (a); this indicates
that stratified sampling would increase precision in future surveys.
The following table gives the estimated coe±cients of variation for the SRS and
poststratified estimates:
d
CV(ȳ)
SRS Poststratified
Lead 0.1045 0.0869
Copper 0.0416 0.0321
4.18 As di = yi ° Bxi , d¯ = ȳ ° B x̄. Then, using (A.10),
¯ = V (ȳ ° B x̄)
V (d)
= V (ȳ) ° 2 Cov (ȳ, B x̄) + B 2 V (x̄)
µ ∂∑ 2 2

n Sy RSx Sy 2 Sx
= 1° ° 2B +B .
N n n n

4.21 From (4.6), the squared bias of B̂ is approximately


µ ∂
1 n 2 1
1° [BSx2 ° RSx Sy ]2
x4U N n2
73

From (4.8),
µ ∂
n 1
E[(B̂ ° B) ] º 2
1° (S 2 ° 2BRSx Sy + B 2 Sx2 ).
N nx2U y

The approximate MSE is thus of order 1/n, while the squared bias is of order 1/n2 .
Consequently, MSE (B̂) º V (B̂).
4.22 A rigorous proof, showing that the lower order terms are negligible, is beyond
the scope of this book. We give an argument for (4.6).
∑ ∏
ȳ ȳU
E[B̂ ° B] = E °
x̄ x̄U
∑ µ ∂ ∏
ȳ x̄U ȳU
= E °
x̄U x̄ x̄U
∑ µ ∂ ∏
ȳ x̄ ° x̄U ȳU
= E 1° °
x̄U x̄ x̄U
∑ ∏
ȳ(x̄ ° x̄U )
= °E
x̄U x̄
∑ µ ∂∏
ȳ(x̄ ° x̄U ) x̄ ° x̄U
= E °1
x̄2U x̄
1
= {BV (x̄) ° Cov (x̄, ȳ) + E[(B̂ ° B)(x̄ ° x̄U )2 ]}
x̄2U
1
º [BV (x̄) ° Cov (x̄, ȳ)].
x̄2U

4.24 (a) From (4.5),


µ ∂µ ∂
1 tu x̄ ° x̄U
ȳ1 ° ȳU 1 = ȳ ° x̄ 1°
x̄U tx x̄

and ∑ ∏µ ∂
1 ty ° tu x̄ ° x̄U
ȳ2 ° ȳU 2 = ȳ ° ū ° (1 ° x̄) 1 + .
1 ° x̄U N ° tx x̄
The covariance follows because the expected value of terms involving x̄ ° x̄U are
small compared with the other terms.
(b) Note that because xi takes on only values 0 and 1, x2i = xi , xi ui = ui , and
74 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

ui yi = u2i . Now look at the numerator of (4.26).

XN ∑ ∏∑ ∏
tu ty ° tu
ui ° ūU ° (xi ° x̄U ) yi ° ȳU ° ui + ūU + (xi ° x̄U )
tx N ° tx
i=1
XN ∑ ∏∑ ∏
tu ty ° tu
= ui ° ūU ° (xi ° x̄U ) yi ° ui + xi
tx N ° tx
i=1
XN ∑ ∏
ty ° tu tu tu tu ty ° tu 2
= ui yi ° ui +
2
ui xi ° xi yi + ui xi ° x
N ° tx tx tx tx N ° tx i
i=1
µ ∂µ ∂
tu ty ° tu
+ x̄U ° ūU ty ° tu + tx
tx N ° tx
XN ∑ ∏
ty ° tu tu tu tu ty ° tu
= ui ° ui + ui ° xi + 0
N ° tx tx tx tx N ° tx
i=1
ty ° tu tu ty ° tu
= tu ° tx
N ° tx tx N ° tx
= 0.

Consequently,
∑µ ∂ Ω æ∏
tu ty ° tu
Cov ū ° x̄ , ȳ ° ū ° (1 ° x̄) = 0.
tx N ° tx

4.25 We use the multivariate delta method (see for example Lehmann, 1999, p.
295). Let
b
g(a, b) = x̄U
a
so that g(x̄, ȳ) = ȳˆr and g(x̄U , ȳU ) = ȳU . Then,

@g x̄U b
=° 2 ,
@a a
and
@g x̄U
= .
@b a
Thus, the asymptotic distribution of
p p £ §
n[g(x̄, ȳ) ° g(x̄U , ȳU )] = n ȳˆr ° ȳU

is normal with mean 0 and variance


µ ∂2 µ ∂µ ∂ µ ∂2
@g @g @g @g
V = Sx + 2
2
RSx Sy + S2
@a x̄U ,ȳU @a @b x̄U ,ȳU @b x̄U ,ȳU y
= B 2 Sx2 + 2BSx Sy + Sy2 .
75

4.26 Using (4.8) and (4.17),


µ ∂
n 1 2
V (ȳˆr ) ° V (ȳˆreg ) º 1° (S ° 2BRSx Sy + B 2 Sx2 )
N n y
µ ∂
n 1 2
° 1° S (1 ° R2 )
N n y
µ ∂
n 1
= 1° [°2BRSx Sy + B 2 Sx2 + R2 Sy2 ]
N n
µ ∂
n 1
= 1° [RSy ° BSx ]2
N n
∏ 0.

4.28
N
X (yi ° ȳU ° B1 [xi ° x̄U ])2
N °1
i=1
N
X [(yi ° ȳU )2 ° 2B1 (xi ° x̄U )(yi ° ȳU ) + B 2 (xi ° x̄U )2 ]
= 1
N °1
i=1
= Sy2 ° 2B1 RSx Sy + B12 Sx2
R 2 Sy R2 Sy2 2
= Sy2 ° 2 Sx Sy + S
Sx Sx2 x
= Sy2 (1 ° R2 ).

4.29 From (4.15),

ȳˆreg = ȳ + B̂1 (x̄U ° x̄)


= ȳ + B1 (x̄U ° x̄) + (B̂1 ° B1 )(x̄U ° x̄).

Thus,

E[ȳˆreg ° ȳU ] = E[ȳ + B1 (x̄U ° x̄) + (B̂1 ° B1 )(x̄U ° x̄)] ° ȳU


= E[(B̂1 ° B1 )(x̄U ° x̄)]
= °Cov (B̂1 , x̄).

Now, P
i2S (xi ° x̄)(yi ° ȳ)
B̂1 = P
i2S (xi ° x̄)
2

and
(xi ° x̄)(yi ° ȳ) = (xi ° x̄)(yi ° ȳU + ȳU ° ȳ)
= (xi ° x̄)[di + B1 (xi ° x̄U ) + ȳU ° ȳ]
= (xi ° x̄)[di + B1 (xi ° x̄) + B1 (x̄ ° x̄U ) + ȳU ° ȳ]
76 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

with X X X
(xi ° x̄)(yi ° ȳ) = (xi ° x̄)di + B1 (xi ° x̄)2 .
i2S i2S i2S
Thus, P P
i2S (xi ° x̄U )di + (x̄U ° x̄) i2S di
B̂1 = B1 + P .
i2S (xi ° x̄)
2

Let qi = di (xi ° x̄U ). Then


N
X N
X N
X
qi = (yi ° ȳU )(xi ° x̄U ) ° B1 (xi ° x̄U )2 = 0
i=1 i=1 i=1

by the definition of B1 , so q̄U = 0. Consequently,


∑ ∏
nq̄ + n(x̄U ° x̄)d¯
E[(B̂1 ° B1 )(x̄U ° x̄)] = E (x̄U ° x̄)
(n ° 1)s2x
1 ¯ U ° x̄)2 ]
º E[q̄(x̄U ° x̄) + d(x̄
Sx2
1 ¯ U ° x̄)2 ]}.
= {°Cov (q̄, x̄) + E[d(x̄
Sx2
Since
µ ∂ X N
1 n 1
Cov (q̄, x̄) = 1° (qi ° q̄U )(xi ° x̄U )
N °1 N n
i=1
µ ∂ N
1 n 1X
= 1° qi (xi ° x̄U )
N °1 N n
i=1

¯ U ° x̄)2 ] is of smaller order than Cov (q̄, x̄), the approximation is shown.
and E[d(x̄
4.32 From linear models theory, if Y = XØ + ", with E["] = 0 and Cov["] = æ 2 A,
then the weighted least squares estimator of Ø is

Ø̂ = (XT A°1 X)°1 XT A°1 Y

with
V [Ø̂] = æ 2 (XT A°1 X)°1 .
This result may be found in any linear models book (for instance, Christensen, 1996,
p. 31). In our case,

Y = (Y1 , . . . , Yn )T , X = (x1 , . . . , xn )T , and A = diag(x1 , . . . , xn )

so Pn
°1 °1 °1 Yi ȳ
Ø̂ = (X A T
X) X A T
Y = Pi=1
n =
x
i=1 i x̄
and
æ2
V [Ø̂] = æ 2 (XT A°1 X)°1 = Pn .
i=1 xi
77

4.35 (a) No, because some values are 0.


(b) When the entire population is sampled, t̂x = tx and t̂y = ty , so B̂ = ty /tx and
tx B̂ = ty .
(c) Answers will vary.
(d) Letting Zi be the indicator variable for inclusion in the sample,
" N #
1X ty
E[b̄ ° B] = E Zi bi °
n tx
i=1
N PN
1 X bi xi
= bi ° i=1
N tx
i=1
PN PN PN
1 ( i=1 bi )( j=1 xi ) bi xi
= ° i=1
N tx tx
"N #
1 X
= ° bi xi ° N b̄U x̄u
tx
i=1
(N ° 1)Sbx
= ° .
tx

(e) From linear models theory, if Y = XØ + ", with E["] = 0 and Cov["] = æ 2 A,
then the weighted least squares estimator of Ø is

Ø̂ = (X T A°1 X)°1 XT A°1 Y

with
V [Ø̂] = æ 2 (X T A°1 X)°1 .
Here, A = diag(x2i ), so
X 1
XT A°1 X = xi xi = n
x2i
i2S

and X X
1
XT A°1 Y = xi yi = Yi /xi .
x2i
i2S i2S

4.42
(a)

proc surveymeans data = vius mean sum clm clsum;


weight tabtrucks;
strata stratum;
var miles_annl;
78 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

domain business;
run;

Business in
which
vehicle was
most often
used during Std Error
2002 Variable Mean of Mean

For-hire tra MILES_ANNL 56452 1246.317720


Vehicle leas MILES_ANNL 23306 790.981413
Agriculture, MILES_ANNL 10768 415.312729
Mining MILES_ANNL 19210 1126.305405
Utilities MILES_ANNL 15081 830.112607
Construction MILES_ANNL 16714 381.173223
Manufacturin MILES_ANNL 19650 1018.600046
Wholesale tr MILES_ANNL 23052 1184.715817
Retail trade MILES_ANNL 17948 561.147469
Information MILES_ANNL 14927 1396.653144
Waste manage MILES_ANNL 14410 726.842687
Arts, entert MILES_ANNL 9536.588165 1267.485898
Accommodatio MILES_ANNL 20461 1300.857231
Other servic MILES_ANNL 16818 600.239019

Domain Analysis: Business in which vehicle was most often used

Business in
which
vehicle was
most often
used during
2002 Variable 95% CL for Mean Sum

For-hire tra MILES_ANNL 54009.5907 58895.1519 72272793289


Vehicle leas MILES_ANNL 21755.6067 24856.2511 20024589014
Agriculture, MILES_ANNL 9954.2886 11582.3131 24119946651
Mining MILES_ANNL 17002.3551 21417.4684 3411543277
Utilities MILES_ANNL 13454.3175 16708.3561 10244675655
Construction MILES_ANNL 15966.8537 17461.0515 75906142636
Manufacturin MILES_ANNL 17653.3448 21646.2535 15384530602
Wholesale tr MILES_ANNL 20729.7496 25373.8316 16963450921
Retail trade MILES_ANNL 16848.3582 19048.0543 27470445448
Information MILES_ANNL 12189.9160 17664.7915 5622014452
79

Waste manage MILES_ANNL 12985.5076 15834.7285 10709275945


Arts, entert MILES_ANNL 7052.3180 12020.8583 1784083855
Accommodatio MILES_ANNL 17911.2857 23010.6416 5816313888
Other servic MILES_ANNL 15641.1955 17994.1304 35776203775

Business in
which
vehicle was
most often
used during
2002 Variable Std Dev 95% CL for Sum
For-hire tra MILES_ANNL 1608230919 6.91207E10 7.54249E10
Vehicle leas MILES_ANNL 1213307392 1.76465E10 2.24027E10
Agriculture, MILES_ANNL 1354386330 2.14654E10 2.67745E10
Mining MILES_ANNL 360265917 2705422697 4117663856
Utilities MILES_ANNL 942933274 8396528057 1.20928E10
Construction MILES_ANNL 2821651145 7.03757E10 8.14366E10
Manufacturin MILES_ANNL 1399406209 1.26417E10 1.81274E10
Wholesale tr MILES_ANNL 1348917090 1.43196E10 1.96073E10
Retail trade MILES_ANNL 1422261638 2.46828E10 3.02581E10
Information MILES_ANNL 923917751 3811137245 7432891659
Waste manage MILES_ANNL 901989658 8941377763 1.24772E10
Arts, entert MILES_ANNL 353650310 1090929855 2477237856
Accommodatio MILES_ANNL 677802928 4487821313 7144806463
Other servic MILES_ANNL 2201296141 3.14617E10 4.00907E10

(b)

proc surveymeans data = vius mean clm ;


weight tabtrucks;
strata stratum;
var mpg;
domain transmssn;
run;

Domain Analysis: Type of Transmission

Type of
Transmission Variable Mean

Automatic MPG 16.665277


Manual MPG 16.022122
Semi-Automat MPG 14.846222
Automated Ma MPG 16.732086
80 CHAPTER 4. RATIO AND REGRESSION ESTIMATION

Domain Analysis: Type of Transmission

Type of Std Error


Transmission Variable of Mean 95% CL for Mean

Automatic MPG 0.043659 16.5797047 16.7508490


Manual MPG 0.097739 15.8305539 16.2136901
Semi-Automat MPG 1.012044 12.8626219 16.8298213
Automated Ma MPG 1.599588 13.5969068 19.8672658

(c)

proc surveymeans data = vius mean clm ;


weight tabtrucks;
strata stratum;
var mpg;
domain transmssn;
run;

The estimated ratio is 0.124410 with 95% CI [0.12258, 0.12624].


Chapter 5

Cluster Sampling with Equal


Probabilities

5.1 If the nonresponse can be ignored, then p̂ is the ratio estimate of the proportion.
The variance estimate given in the problem, though, assumes that an SRS of voters
was taken. But this was a cluster sample—the sampling unit was a residential
telephone number, not an individual voter. As we expect that voters in the same
household are more likely to have similar opinions, the estimated variance using
simple random sampling is probably too small.
5.3 (a) This is a cluster sample because there are two levels of sampling units: the
wetlands are the psus and the sites are the ssus.
(b) The analysis is not appropriate. A two-sample t test assumes that all obser-
vations are independent. This is a cluster sample, however, and sites within the
same wetland are expected to be more similar than sites selected at random from
the population.
5.4 (a) This is a cluster sample because the primary sampling unit is the journal,
and the secondary sampling unit is an article in the journal from 1988.
(b) Let
Mi = number of articles in journal i
and

ti = number of articles in journal i that use non-probability sampling designs.

From the data file, X


ti = 137
i2S

and X
Mi = 148.
i2S

81
82 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

Then, using (5.16), P


ti 137
ȳˆr = P i2S = = 0.926.
i2S Mi 148
The estimated variance of the residuals is
P
ˆ
i2S (ti ° ȳr Mi )
2
= 0.993221;
n°1

using (5.18), with M S substituted for M U ,


sµ ∂
26 1
SE [ȳˆr ] = 1° (.993221) = .034.
1285 26(5.69)2

Here is SAS code:

data journal;
infile journal delimiter=’,’ firstobs=2;
input numemp prob nonprob ;
sampwt = 1285/26;
/* weight = N/n since this is a one-stage cluster sample */

proc surveymeans data=journal total = 1285 mean clm sum clsum;


weight sampwt;
var numemp nonprob;
ratio ’nonprob/(number of articles)’ nonprob/numemp;
run;

5.5

options ls=78 nodate nocenter;


data spanish;
infile spanish delimiter=’,’ firstobs=2;
input class score trip;
sampwt = 72/10;
/* weight = N/n = 72/10 since one-stage cluster sample*/

proc print data=spanish;


run;

proc surveymeans data=spanish total = 72 mean clm sum clsum;


weight sampwt;
cluster class;
var trip score;
run;
83

/* Construct a boxplot of the data */

proc sort data=spanish;


by class;

proc boxplot data=spanish;


plot score * class;
run;

/* Since this is a one-stage cluster sample, there is


no contribution to variance from subsampling. */

proc surveymeans data=spanish mean clm sum clsum;


weight sampwt;
cluster class;
var trip score;
run;

The SURVEYMEANS Procedure

Data Summary

Number of Clusters 10
Number of Observations 196
Sum of Weights 1411.2

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean

trip 0.321429 0.079001 0.1427164 0.5001408


score 66.795918 2.919409 60.1917561 73.4000806

Statistics

Variable Sum Std Dev 95% CL for Sum

trip 453.600000 120.502946 181.0034 726.197


score 94262 7301.420092 77745.4402 110779.360
84 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

Here is a side-by-side boxplot for score.

5.6 (a) The SAS code below was used to calculate summary statistics and the
ANOVA table. The output is given below.

data worms;
do case = 1 to 12;
do can = 1 to 3;
input worms @@;
wt = (580/12)*(24/3);
output;
end;
end;
cards;
1 5 7
4 2 4
0 1 2
3 6 6
4 9 8
0 7 3
5 5 1
3 0 2
7 3 5
3 1 4
4 7 9
0 0 0
;
proc print data=worms;
run;
85

proc glm data=worms;


class case;
model worms = case;
mean case;
run;

/* SAS does not calculate the extra term for variance


due to 2-stage sampling */
proc surveymeans data=worms total = 580;
weight wt;
cluster case;
var worms;
run;

The GLM Procedure

Class Level Information

Class Levels Values

case 12 1 2 3 4 5 6 7 8 9 10 11 12

Number of Observations Read 36


Number of Observations Used 36

Dependent Variable: worms

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 11 149.6388889 13.6035354 3.00 0.0117

Error 24 108.6666667 4.5277778

Corrected Total 35 258.3055556

R-Square Coeff Var Root MSE worms Mean

0.579310 58.47547 2.127858 3.638889

Level of ------------worms------------
86 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

case N Mean Std Dev

1 3 4.33333333 3.05505046
2 3 3.33333333 1.15470054
3 3 1.00000000 1.00000000
4 3 5.00000000 1.73205081
5 3 7.00000000 2.64575131
6 3 3.33333333 3.51188458
7 3 3.66666667 2.30940108
8 3 1.66666667 1.52752523
9 3 5.00000000 2.00000000
10 3 2.66666667 1.52752523
11 3 6.66666667 2.51661148
12 3 0.00000000 0.00000000

From the output, we see that


ȳˆunb = 3.639
(this works here because Mi = M for all i and mi = m for all i; in this case, (5.26)
reduces to the sample mean).
sµ ∂ µ ∂
12 13.604 1 3 4.528
SE [ȳˆunb ] = 1° + 1°
580 (12)(3) 580 24 3
p
= 0.3701 + 0.0023
= 0.61.

Note that the second term contributes little to the standard error.
Here is the approximation from SAS, using PROC SURVEYMEANS:

The SURVEYMEANS Procedure

Data Summary

Number of Clusters 12
Number of Observations 36
Sum of Weights 13920

Statistics

Std Error
Variable N Mean of Mean 95% CL for Mean
______________________________________________________________
worms 36 3.638889 0.614716 2.28590770 4.99187008
______________________________________________________________
87

5.7 We used SAS to obtain the mean and standard deviation for each city, and to
plot the data.

The GLM Procedure

Class Level Information

Class Levels Values

city 6 1 2 3 4 5 6

Number of Observations Read 34


Number of Observations Used 34

The GLM Procedure

Dependent Variable: cases

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 5 68339.9641 13667.9928 3.37 0.0166

Error 28 113570.6536 4056.0948

Corrected Total 33 181910.6176

R-Square Coeff Var Root MSE cases Mean


88 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

0.375679 53.85163 63.68748 118.2647

The GLM Procedure

Level of ------------cases------------
city N Mean Std Dev

1 10 177.300000 83.5996411
2 4 93.250000 29.2389580
3 7 116.285714 54.5396317
4 8 104.625000 64.5930945
5 2 17.500000 7.7781746
6 3 63.000000 22.2710575

The SURVEYMEANS Procedure

Data Summary

Number of Clusters 6
Number of Observations 34
Sum of Weights 1267.5

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean
___________________________________________________________
cases 120.688145 19.730137 69.9702138 171.406075
___________________________________________________________

Sum Std Dev 95% CL for Sum


______________________________________________
152972 56601 7473.70477 298470.742

We can also use (5.18) and (5.24) for calculating the total number of cases sold, and
(5.26) and (5.28) for calculating the average number of cases sold per supermarket.
Summary quantities are given in the spreadsheet below.
2
City Mi mi ȳi si t̂i t̂i ° Mi ȳˆr (1 ° 2 si
Mi )Mi mi
mi

1 52 10 177.30 83.60 9219.6 2943.8 1526376


2 19 4 93.25 29.24 1771.8 -521.3 60913
3 37 7 116.29 54.54 4302.6 -162.9 471682
4 39 8 104.63 64.59 4080.4 -626.5 630534
5 8 2 17.50 7.78 140.0 -825.5 1452
6 14 3 63.00 22.27 882.0 -807.6 25461
Sum 169 34 20396.3 2716418
Var 10952882 2138111
89

From (5.18),
45
t̂unb = (20396.3) = 152972.
6
Using (5.24),
s µ ∂
6 10952882 45
SE [t̂unb ] = 452 1° + (2716418)
45 6 6
p
= 3,203,717,941 + 20,373,134
= 56, 781.

From (5.26) and (5.28),


20,396
ȳˆr = = 120.7
169
and sµ ∂
1 6 2,138,111 1
SE [ȳˆr ] = 1° + (2716418)
28.17 45 6 6(45)
= 21.05.

5.8 (a) The SAS code below computes estimates.

data books;
infile booksdat delimiter=’,’ firstobs=2;
input shelf Mi number purchase replace;
sampwt = (44/12)*(Mi/5);
/* The crucial part for estimating the total is correctly
specifying the sampling weight. */

proc print data=books;


run;

/* Construct the plot for part (a) */

proc boxplot data=books;


plot replace * shelf;
run;

/*Here is the with-replacement approximation to the variance */

proc surveymeans data=books mean clm sum clsum;


weight sampwt;
cluster shelf;
var replace;
run;

/* Here is the without-replacement approximation to the variance


90 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

using SAS. Note that this approximation does not include the
second term in (5.21). */

proc surveymeans data=books total = 44 mean clm sum clsum;


weight sampwt;
cluster shelf;
var replace;
run;

It appears that the means and variances diÆer quite a bit for diÆerent shelves.
(b) Quantities used in calculation are in the spreadsheet below.
2
Shelf Mi mi ȳi s2i t̂i = Mi ȳi (1 ° mi 2 si
Mi )Mi mi (t̂i ° Mi ȳˆr )2
2 26 5 9.6 17.80 249.6 1943.76 132696.9
4 52 5 6.2 1.70 322.4 830.96 819661.7
11 70 5 9.2 18.70 644.0 17017.00 1017561.8
14 47 5 7.2 3.20 338.4 1263.36 594901.6
20 5 5 41.8 2666.70 209.0 0.00 8271.3
22 28 5 29.8 748.70 834.4 96432.56 30033.9
23 27 5 51.8 702.70 1398.6 83480.76 579293.8
31 29 5 61.2 353.20 1774.8 49165.44 1188301.2
37 21 5 50.4 147.30 1058.4 9898.56 316493.1
38 31 5 36.6 600.80 1134.6 96848.96 162144.0
40 14 5 54.2 595.70 758.8 15011.64 183399.3
43 27 5 6.6 4.80 178.2 570.24 210944.1

sum 377 8901.2 372463.2 5243902.9


var 268531.00
To estimate the total replacement value of the book collection, we use the unbiased
91

estimator since M0 , the total number of books, is unknown.


NX 44
t̂unb = t̂i = 8901.2 = 32637.73.
n 12
i2S
P
From the spreadsheet, s2t = 268,531 and i2S Mi2 (1 ° mi /Mi )s2i /mi = 372463.2, so
v
u≥ µ ∂
u n ¥ s2t 1 X 2 mi s2i
SE (t̂unb ) = N t 1 ° + Mi 1 °
N n nN Mi mi
i2S
sµ ∂
12 268, 531 1
= 44 1° + (372463.2)
44 12 (12)(44)
p
= 44 16274.6 + 705.4 = 5733.52.

The standard error of t̂unb is 5733.52, which is quite large when compared with the
estimated total. The estimated coe±cient of variation for t̂unb is
5733.52
SE (t̂unb )/t̂unb = = 0.176.
32637.73

Here is the approximation using SAS and the with-replacement variance:

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean

replace 23.610610 6.344128 9.64727746 37.5739427

Statistics

Variable Sum Std Dev 95% CL for Sum

replace 32638 6582.020830 18150.8032 47124.6635

Note that the with-replacement variance is too large because the psu sampling frac-
tion is large.
Here is the approximation using SAS and the without-replacement variance:

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean
92 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

replace 23.610610 5.410291 11.7026400 35.5185801

Statistics

Variable Sum Std Dev 95% CL for Sum

replace 32638 5613.166224 20283.2378 44992.2289

p
Note that 5613 = 44 16274.6; SAS calculates only the first term of the without-
replacement variance.
(c) To find the average replacement cost per book with the information given, we
use the ratio estimate in (5.28):
P
t̂i 8901.2
ȳˆr = P i2S = = 23.61.
i2S M i 377

In the formula for variance in (5.29),


µ ∂ 2 X µ ∂
12 sr 1 mi s2i
V̂ (ȳˆr ) = 1 ° + Mi 1 °
2
,
44 nM 2 (12)(44)M 2 Mi mi
i2S

we use M S = 31.417, the average of the Mi in our sample. So we have


P
(t̂i ° Mi ȳˆr )2 5243902.93
sr = i2S
2
= = 476718.4455
n°1 11
and
sµ ∂
1 12 476718.4455
SE (ȳˆr ) = 1° + 705.4
31.417 44 12
1 p
= 28892.0 + 705.4 = 5.476.
31.417
The estimated coe±cient of variation for ȳˆr is

SE (ȳˆr )/ȳˆr = 5.476/23.61 = 0.232.

5.9 (a)
93

It appears that the means and variances diÆer quite a bit for diÆerent shelves.
Here is SAS code and output for the purchase price of the books.

filename booksdat ’books.csv’;

options ls=78 nodate nocenter;


data books;
infile booksdat delimiter=’,’ firstobs=2;
input shelf Mi number purchase replace;
sampwt = (44/12)*(Mi/5);
/* The crucial part for estimating the total is correctly
specifying the sampling weight. */

proc print data=books;


run;

/* Construct the plot for part (a) */

proc boxplot data=books;


plot purchase * shelf;
run;

/* Here is the with-replacement approx to the variance using SAS */

proc surveymeans data=books mean clm sum clsum;


weight sampwt;
cluster shelf;
var purchase;
94 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

run;

/* Here is the without-replacement approximation to the variance


using SAS. Note that this approximation does not include the
second term in (5.21). */

proc surveymeans data=books total = 44 mean clm sum clsum;


weight sampwt;
cluster shelf;
var purchase;
run;

The SURVEYMEANS Procedure

Data Summary

Number of Clusters 12
Number of Observations 60
Sum of Weights 1382.33333

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean

purchase 12.942706 3.630539 4.95194389 20.9334672

Statistics

Variable Sum Std Dev 95% CL for Sum

purchase 17891 3854.814474 9406.74388 26375.5228

5.10 Here is the sample ANOVA table for the books data, calculated using SAS.
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 11 25570.98333 2324.635 4.76 0.0001
Error 48 23445.20000 488.442
Corrected Total 59 49016.18333
We use (5.10) to estimate Ra2 : we have

\ = 488.44167,
MSW
95

and we (with slight bias) estimate S 2 by

SSTO 49016.18333
S2 = = = 830.78,
dfTO 59
so
\ Ŝ 2 = 1 ° 488.44/830.78 = 0.41.
R̂a2 = 1 ° MSW/
The positive value of R̂a2 indicates that books on the same shelf do tend to have
more similar replacement costs.
5.11 (a) Here, N = 828 and n = 85. We have the following frequencies for ti =
number of errors in claim i:
Number of errors Frequency
0 57
1 22
2 4
3 1
4 1
P
The 85 claims have a total of i2S ti = 37 errors, so from (5.1),

828
t̂ = (37) = 360.42
85
and

µ ∂
1 X 37 2
s2t = ti °
84 85
i2S
∑ µ ∂ µ ∂ µ ∂
1 37 2 37 2 37 2
= 57 0 ° + 22 1 ° +4 2°
84 85 85 85
µ ∂2 µ ∂2 ∏
37 37
+ 3° + 4°
85 85
1
= [10.800 + 7.02 + 9.79 + 6.58 + 12.71]
84
= 0.558263.

Thus the error rate, using (5.4), is

t̂ 360.42
ȳˆ = = = 0.002025.
NM (828)(215)

From (5.6),
sµ ∂
1 85 s2t 1
SE [ȳˆ] = 1° = (.07677) = .000357.
215 828 85 215
96 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

(b) From calculations in (a),


t̂ = 360.42
and, using (5.3), sµ ∂
85 s2t
SE [t̂] = 828 1° = 63.565.
828 85

(c) If the same number of errors (37) had been obtained using an SRS of 18,275 of
the 178,020 fields, the error rate would be
37
= .002025
18,275

(the same as in (a)). But the estimated variance from an SRS would be
µ ∂
18,275 p̂srs (1 ° p̂srs )
V̂ [p̂srs ] = 1 ° = 9.92 £ 10°8 .
178,020 18,274

The estimate varianced from (a) is

(SE [ȳˆ])2 = 1.28 £ 10°7 .

Thus
Estimated variance under cluster design 1.28 £ 10°7
= = 1.29.
Estimated variance under SRS 9.92 £ 10°8

5.12 Students should plot the data similarly to Figure 5.3.


P
t̂i 85478.56
ȳˆr = P i2S = = 48.65.
i2S Mi 1757

s2r is the sample variance of the residuals t̂i ° Mi ȳˆr : here,


P
(t̂i ° Mi ȳˆr )2
sr = i2S
2
= 280.0261.
183
Since N is unknown, (5.29) yields
r
1 280.0261
SE [ȳˆr ] = = 0.129.
9.549 184

5.13 (a) Cluster sampling is needed for this application because the household is the
sampling unit. Yet the Arizona Statutes specify the statistic that must be used for
estimating the error rate: it must be estimated by the sample mean. It is therefore
important to make sure that a self-weighting sample is taken. In this application,
a self-weighting sample will result if an SRS of n households is taken from the
population of N households in the county, and if all individuals in the household
are measured. It makes sense in this example to take a one-stage cluster sample.
97

(b) We know that if we were taking an SRS of persons, we would calculate n0 =


(1.96)2 (0.1)(0.9)/(0.03)2 = 385 and
n0
n= = 310.
1 + n0 /N

Since we obtain at least some additional information by sampling everyone in the


household, we need at most 310 households. To calculate the number of households
to be sampled, we need an idea of the measure of homogeneity within the households.
We know from the equation following (5.11) that

V (t̂cluster ) N (M ° 1) 2
=1+ Ra
V (t̂SRS ) N °1

so we can calculate a sample size by multiplying the number of persons needed for
an SRS (310) by the ratio of variances, then dividing by M . The following table
gives some sample sizes for diÆerent values of Ra2 :
M Ra2 sample size
1 0.1 310
2 0.1 171
3 0.1 124
4 0.1 101
5 0.1 87
1 0.5 310
2 0.5 233
3 0.5 207
4 0.5 194
6 0.5 181
1 0.8 310
2 0.8 279
3 0.8 269
4 0.8 264
5 0.8 260
5.14 (a) Treating the proportions as means, and letting Mi and mi be the number
of female students in the school and interviewed, respectively, we have the following
summaries.
2 2
School Mi mi smokers ȳi s2i t̂i t̂i ° Mi ȳˆr (1 ° mi Mi si
M i ) mi
1 792 25 10 0.4 .010 316.8 -24.7 243
2 447 15 3 0.2 .011 89.4 -103.3 147
3 511 20 6 0.3 .011 153.3 -67.0 139
4 800 40 27 0.675 .006 540.0 195.1 86
Sum 2550 100 1099.5 614
Var 17943
98 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

Then,
1099.5
ȳˆr = = 0.43
2550s
µ ∂
1 4 17943 1
SE [ȳˆr ] = 1° + (614)
637.5 29 4 4(29)
1 p
= 3867.0 + 5.3
637.5
= 0.098.

(b) We construct a sample ANOVA table from the summary information in part
(a). Note that
X mi
4 X 4
X
ssw = (yij ° ȳi ) =
2
(mi ° 1)s2i ,
i=1 j=1 i=1

and
Source df SS MS
Between psus 3
Within psus 96 0.837 0.0087

Total 99
5.15 (a) A cluster sample was used for this study because Arizona has no list of all
elementary school teachers in the state. All schools would have to be contacted to
construct a sampling frame of teachers, and this would be expensive. Taking a clus-
ter sample also makes it easier to distribute surveys. It’s possible that distributing
questionnaires through the schools might improve cooperation with the survey and
give respondents more assurance that their data are kept confidential.
(b) The means and standard deviations, after eliminating records with missing val-
ues, are in the following table:
99

School Mean Std. Dev.


11 33.99 1.953
12 36.12 4.942
13 34.58 0.722
15 36.76 0.746
16 36.84 1.079
18 35.00 0.000
19 34.87 0.231
20 36.36 2.489
21 35.41 3.154
22 35.68 4.983
23 35.17 3.392
24 31.94 0.860
25 31.25 0.668
28 31.46 2.627
29 29.11 2.440
30 35.79 1.745
31 34.52 1.327
32 35.46 1.712
33 26.82 0.380
34 27.42 0.914
36 36.98 2.961
38 37.66 1.110
41 36.88 0.318
There appears to be large variation among the school means. An ANOVA table for
the data is shown below.
Df Sum of Sq Mean Sq F Value Pr(F)
school 22 1709.187 77.69034 13.76675 0
Residuals 224 1264.106 5.64333
(c) There appears to be a wider range of standard deviations for schools with higher
means.
(d) Calculations are in the following table.
100 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

s2
School Mi mi ȳi s2i Mi ȳi resids Mi2 (1 ° M
mi
) i
i mi
11 33 10 33.990 3.814 1121.670 5.501 289.508
12 16 13 36.123 24.428 577.969 36.797 90.196
13 22 3 34.583 0.521 760.833 16.721 72.569
15 24 24 36.756 0.557 882.150 70.391 0.000
16 27 24 36.840 1.165 994.669 81.440 3.933
18 18 2 35.000 0.000 630.000 21.181 0.000
19 16 3 34.867 0.053 557.867 16.694 3.698
20 12 8 36.356 6.197 436.275 30.396 37.185
21 19 5 35.410 9.948 672.790 30.147 529.234
22 33 13 35.677 24.835 1177.338 61.170 1260.867
23 31 16 35.175 11.506 1090.425 41.903 334.393
24 30 9 31.944 0.740 958.333 -56.366 51.776
25 23 8 31.250 0.446 718.750 -59.186 19.252
28 53 17 31.465 6.903 1667.630 -125.005 774.766
29 50 8 29.106 5.955 1455.313 -235.852 1563.270
30 26 22 35.791 3.045 930.564 51.158 14.393
31 25 18 34.525 1.761 863.125 17.543 17.118
32 23 16 35.456 2.932 815.494 37.558 29.503
33 21 5 26.820 0.145 563.220 -147.069 9.710
34 33 7 27.421 0.835 904.907 -211.261 102.333
36 25 4 36.975 8.769 924.375 78.793 1150.953
38 38 10 37.660 1.231 1431.080 145.795 130.978
41 30 2 36.875 0.101 1106.250 91.551 42.525
Sum 628 21241.026 6528.159
Variance 9349.2524

ȳˆr = 33.82
∑µ ∂ ∏
1 23 9349.252 6528.159
V (ȳˆr ) = 1° +
(27.30)2 245 23 (23)(245)
1
= [368.33 + 1.16]
745.53
= 0.50.

5.16 (a) Summary quantities for estimating ȳ and its variance are given in the table
below. Here, ki denotes the number sampled in school i. We use the number of
respondents in school i as mi .
101

s2
School Mi ki mi Return ȳi t̂i t̂i ° Mi Mi2 (1 ° M
mi
) i
i mi
1 78 40 38 19 0.5000 39.0000 -6.1580 21.0811
2 238 38 36 19 0.5278 125.6111 -12.1786 342.3401
3 261 19 17 13 0.7647 199.5882 48.4828 716.1696
4 174 30 30 18 0.6000 104.4000 3.6630 207.3600
5 236 30 26 12 0.4615 108.9231 -27.7087 492.6675
6 188 25 24 13 0.5417 101.8333 -7.0089 332.8031
7 113 23 22 15 0.6818 77.0455 11.6243 106.2293
8 170 43 36 21 0.5833 99.1667 0.7455 158.1944
9 296 38 35 23 0.6571 194.5143 23.1456 511.9485
10 207 21 17 7 0.4118 85.2353 -34.6070 595.3936
Sum 1961 307 281 160 1135.3175 3484.1873
var 581.79702

(b) Using (5.26) and (5.28),


1135.3175
ȳˆr = = 0.5789
1961
∑µ ∂ ∏
1 10 581.797 3484.1873
V̂ (ȳˆr ) = 1° +
(196.1)2 46 10 (10)(46)
1
= [45.532 + 7.574]
(196.1)2
= 0.001381.

An approximate 95% confidence interval for the percentage of parents who returned
the questionnaire is
p
0.5789 ± 1.96 0.001381 = [0.506, 0.652].

(c) If the clustering were (incorrectly!) ignored, we would have had p̂ = 160/281 =
.569 with V̂ (p̂) = .569(1 ° .569)/280 = .000876.
5.17 (a) The following table gives summary quantities; the column ȳi gives the
estimated proportion of children who had previously had measles in each school.

s2
School Mi mi Hadmeas ȳi t̂i t̂i ° Mi Mi2 (1 ° M
mi
) i
i mi
1 78 40 32 0.8000 62.4000 28.0573 12.1600
2 238 38 10 0.2632 62.6316 -42.1576 249.4572
3 261 19 12 0.6316 164.8421 49.9262 816.4986
4 174 30 19 0.6333 110.2000 33.5894 200.6400
5 236 30 16 0.5333 125.8667 21.9581 417.2408
6 188 25 6 0.2400 45.1200 -37.6546 232.8944
7 113 23 11 0.4783 54.0435 4.2906 115.3497
8 170 43 23 0.5349 90.9302 16.0808 127.8864
9 296 38 5 0.1316 38.9474 -91.3787 235.8449
10 207 21 11 0.5238 108.4286 17.2884 480.1837
Sum 1961 307 145 863.41 2888.1556
Var 1890.1486
102 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

(b)
863.41
ȳˆr = = 0.4403
1961 ∑µ ∂ ∏
1 10 1890.15 2888.16
ˆ
V̂ (ȳr ) = 1° +
(196.1)2 46 10 (10)(46)
1
= [147.92 + 6.28]
(196.1)2
= 0.004
An approximate 95% CI is
p
0.4403 ± 1.96 0.004 = [0.316, 0.564]

5.18 With the diÆerent costs, the last two columns of Table 5.4 change. The table
now becomes:
Number of Stems Cost to
Sampled Sample Relative
per Site ȳˆ SE(ȳˆ) One Field Net Precision
1 1.12 0.15 50 0.15
2 1.01 0.10 70 0.14
3 0.96 0.08 90 0.13
4 0.91 0.07 110 0.12
5 0.91 0.06 130 0.12
Now the relative net precision is highest when one stem is sampled per site.
5.19 (a) Here is the sample ANOVA table from SAS PROC GLM:

Dependent Variable: acres92

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 41 2.0039161E13 488760018795 8.16 <.0001

Error 258 1.5456926E13 59910564633

Corrected Total 299 3.5496086E13

R-Square Coeff Var Root MSE acres92 Mean

0.564546 82.16474 244766.3 297897.0

We estimate Ra2 by
59910564633
1° = 0.4953455.
3.5496086 £ 1013 /299
103

This value is greater than 0, indicating that there is a clustering eÆect.


(b) Using (5.32), with c1 = 15c2 and Ra2 = 0.5, and using the approximation M (N °
1) º N M ° 1, we have r
15(1 ° .5)
m̄opt = = 3.9.
.5
Taking m̄ = 4, we would have n = 300/4 = 75.
5.20 Answers will vary.
5.21 Answers will vary.
5.22 First note that
N X
X M X
M N ∑X
X M ∏2
(yij ° ȳU )(yik ° ȳU ) = (yij ° ȳU )
i=1 j=1 k=1 i=1 j=1
N
X
= [M (ȳiU ° ȳU )]2
i=1
= M (SSB).

Thus
PN PM PM
i=1 j=1 k6=j (yij ° ȳU )(yik ° ȳU )
ICC =
(N M ° 1)(M ° 1)S 2
P PM
M (SSB) ° Ni=1 j=1 (yij ° ȳU )
2
=
(N M ° 1)(M ° 1)S 2
M (SSB) ° SSTO
=
(M ° 1)SSTO
M (SSTO ° SSW) ° SSTO
=
(M ° 1)SSTO
M SSW
= 1° ,
M ° 1 SSTO
proving the result.
5.23 From (5.8),
M SSW
ICC = 1 ° .
M ° 1 SSTO
Rewriting, we have
1 M °1
MSW = SSTO (1 ° ICC)
N (M ° 1) M
NM ° 1 2
= S (1 ° ICC).
NM
104 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

Using Table 5.1, we know that SSW + SSB = SSTO, so from (5.8),

M SSTO ° (N ° 1)MSB
ICC = 1 °
M °1 SSTO
1 M (N ° 1)MSB
= ° +
M ° 1 (M ° 1)(N M ° 1)S 2

and
NM ° 1 2
MSB = S [1 + (M ° 1)ICC].
M (N ° 1)

5.24 (a) From (5.24), we have that


∑ µ ∂ N µ ∂ ∏
1 n St2 N X m Si2
V (ȳˆunb ) = N 1°
2
+ 1° M 2
(N M )2 N n n M m
i=1
µ ∂ 2 µ ∂ N
n St m 1 X 2
= 1° + 1° Si .
N nM 2 M N nm
i=1

The equation above (5.7) states that

St2 = M (MSB);

the definition of Si2 in Section 5.1 implies


N
X N M
1 XX SSW
Si2 = (yij ° ȳiU )2 = = N (MSW).
M °1 M °1
i=1 i=1 j=1

Thus,
∑ ≥ ∏
1 n ¥ M (MSB) N ≥ m ¥ M2
V (ȳˆunb ) = N 1°
2
+ 1° N (MSW)
(N M )2 N n n M m
≥ n ¥ MSB ≥ m ¥ MSW
= 1° + 1° .
N nM M nm

(b) The first equality follows directly from (5.9). Because SSTO = SSB + SSW,
1
MSB = [SSTO ° N (M ° 1)MSW]
N °1
1
= [(N M ° 1)S 2 ° N (M ° 1)S 2 (1 ° Ra2 )]
N °1
1
= [(N ° 1)S 2 + N (M ° 1)S 2 Ra2 ]
N °1
∑ ∏
2 N (M ° 1)Ra
2
= S +1 .
N °1
105

(c) µ ∂ ∑ ∏ µ ∂
n S 2 N (M ° 1)Ra2 m S2
V (ȳˆ) = 1° +1 + 1° (1 ° Ra2 ).
N nM N °1 M nm

(d) The coe±cient of S 2 Ra2 in (c) is


µ ∂ µ ∂
n N (M ° 1) m 1 (N ° n)(M ° 1)m ° (M ° m)(N ° 1)
1° ° 1° =
N nM (N ° 1) M nm M nm(N ° 1)
But if N (m ° 1) > nm, then

(N ° n)(M ° 1)m ° (M ° m)(N ° 1) = M [N (m ° 1) ° nm + 1] + m(n ° 1) > 0,

so V (ȳˆ) is an increasing function of Ra2 .


5.25 (a) If Mi = M for all i, and mi = m for all i, then from (5.15),
P
M ȳi 1
ȳˆr = i2S = t̂unb = ȳˆunb .
nM NM

(b) In the following, let ȳˆ = ȳˆr = ȳˆunb


Source df SS
XX
Between clusters n°1 (ȳi ° ȳˆ)2
i2S j2S
X Xi
Within clusters n(m ° 1) (yij ° ȳi )2
i2S j2Si
X X
Total nm ° 1 (yij ° ȳˆ)2
i2S j2Si

(c) Let Ω
1 if psu i in sample
Zi =
0 otherwise.
XX
[ =
SSW (yij ° ȳi )2
i2S j2Si
∑X N X ∏
[ =E
E[SSW] Zi (yij ° ȳi ) 2

i=1 j2Si
ΩXN ∑X ∏æ
=E Zi E (yij ° ȳi ) | Z 2

i=1 j2Si
ΩXN æ
=E Zi (m ° 1)E[s2i | Z]
i=1
ΩXN æ
=E Zi (m ° 1)Si2
i=1
N
n X 2
= (m ° 1) Si
N
i=1
106 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

Thus,
N
1 X 2
E[msw] = Si = MSW.
N
i=1

Since Mi = M and mi = m for all i,


N
1X 1X
ȳˆunb = ȳi = Zi ȳi
n n
i2S i=1
XX X
d =
SSB (ȳi ° ȳˆunb )2 = m (ȳi ° ȳˆunb )2
i2S j2Si i2S
107
∑X ∏
d = mE
E[SSB] (ȳi2 ° 2ȳi ȳˆunb + ȳˆunb
2
)
i2S
∑X ∏
= mE ȳi2 ° nȳˆunb
2

i2S
µ ∑X
N ∏∂
= mE E 2
Zi ȳi | Z1 , . . . , Zn ° mnE[ȳˆunb
2
]
i=1
∑X
N ∏
= mE Zi {V (ȳi | Z) + ȳiU
2
} ° mn[V (ȳˆunb ) + ȳU2 ]
i=1
∑X
N Ωµ ∂ æ∏
m Si2
= mE Zi 1° + ȳiU
2
M m
i=1
∑ µ ∂ N µ ∂ ∏
1 n St2 1 X m Si2
°mn 1° + 1° + ȳU
2
M2 N n nN M m
i=1
N ∑µ ∂ ∏ µ ∂
mn X m Si2 m n
= 1° + ȳiU ° 2 1 °
2
St2
N M m M N
i=1
N µ ∂
mX m Si2
° 1° ° mnȳU2
N M m
i=1
µ ∂ N ∑ X N ∏
m(n ° 1) m X Si2 1
= 1° + mn 2 2
ȳiU ° ȳU
N M m N
i=1 i=1
µ ∂
m n
° 1° (MSB)
M N
µ ∂ N N
m(n ° 1) m X Si2 mn X
= 1° + (ȳiU ° ȳU )2
N M m N
i=1 i=1
µ ∂
m n 1
° 1° SSB
M N N °1
µ ∂ N ∑ µ ∂∏
m(n ° 1) m X Si2 mn m n
= 1° + ° 1° SSB
N M m NM M (N ° 1) N
i=1
µ ∂ N
m(n ° 1) m X Si2 (N ° 1)mn ° m(N ° n)
= 1° + SSB
N M m N M (N ° 1)
i=1
µ ∂
m m
= (n ° 1) 1 ° MSW + (n ° 1) MSB.
M M

Thus µ ∂
m m
E[msb] = 1° MSW + MSB.
M M
108 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

(d) From (c),


µ ∂ µ ∂
\ M m M m M
E[MSB ] = 1° MSW + MSB ° ° 1 MSW = MSB .
m M mM m

(e) From (5.22) and (5.23),


µ ∂
1 X t̂unb 2
s2t = t̂i °
n°1 N
i2S
1 X
= (M ȳi ° M ȳˆ)2
n°1
i2S
1 M2 X X
= (ȳi ° ȳˆ)2
n°1 m
i2S j2Si
M2
= msb
m
and
P 1 XX
s2i = (yij ° ȳi )2
i2S
m°1
i2S j2Si
= nmsw.
Then, using (5.24),
∑ µ ∂ µ ∂ ∏
1 n s2t N m M2 X 2
V̂ (ȳˆ) = N 1°
2
+ 1° si
(N M )2 N n n M m
µ ∂ µ ∂ i2S
n msb 1 m msw
= 1° + 1° .
N nm N M m

5.26 (a) From Exercise 5.25,


1 XX
msto = (yij ° ȳˆ)2 .
nm ° 1
i2S j2Si
109

Now,
2 3
XX
E4 (yij ° ȳˆ)2 5
i2S j2Si

[ + E[SSB]
= E[SSW] d
N µ ∂
n X 2 m m
= (m ° 1) Si + (n ° 1) 1 ° MSW + (n ° 1) MSB
N M M
i=1
∑ µ ∂∏
m m
= (m ° 1)n + (n ° 1) 1 ° MSW + (n ° 1) MSB
M M
h m i m
= nm ° 1 ° (n ° 1) MSW + (n ° 1) MSB
M M
1 h m i (n ° 1) m
= nm ° 1 ° (n ° 1) SSW + SSB
N (M ° 1) M N °1 M
∑ ∏
nm ° 1 NM ° 1 (n ° 1)m N M ° 1 m(n ° 1)
= SSW + SSB ° SSW
N M ° 1 N (M ° 1) (N ° 1)M nm ° 1 N M (M ° 1)
∑Ω æ
nm ° 1 N °1
= 1+ SSW
NM ° 1 N (M ° 1)
Ω æ ∏
nm(M ° 1) ° (m ° 1)N M ° M + m
+ 1+ SSB
(N ° 1)M (nm ° 1)
m(n ° 1)
° SSW
N M (M ° 1)
nm ° 1 nm ° 1 nm(M ° 1) ° (m ° 1)N M ° M + m
= SSTO + SSB
NM ° 1 NM ° 1 (N ° 1)M (nm ° 1)
∑ ∏
nm ° 1 N ° 1 m(n ° 1)
+ ° SSW
N M ° 1 N (M ° 1) N M (M ° 1)
∑ µ ∂ µ ∂ ∏
nm ° 1 1 1
= SSTO + O SSB + O SSW .
NM ° 1 n n

The notation O(1/n) denotes a term that tends to 0 as n ! 1.


(b) Follows from the last line of (a).
(c) From Exercise 5.25, E[msw] = MSW and
m ≥ m¥
E[msb] = MSB + 1 ° MSW .
M M
110 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

Consequently,

M (N ° 1) (m ° 1)N M + M ° m
E[Ŝ 2 ] = E[msb] + E[msw]
m(N M ° 1) m(N M ° 1)
M (N ° 1) h m ≥ m¥ i
= MSB + 1 ° MSW
m(N M ° 1) M M
(m ° 1)N M + M ° m
+ MSW
m(N M ° 1)
1
= SSB
NM ° 1
1
+ [(N ° 1)(M ° m) + (m ° 1)N M + M ° m] MSW
NM ° 1
1 N (M ° 1)
= SSB + MSW
NM ° 1 NM ° 1
= S2.

5.27 The cost constraint implies that n = C/(c1 + c2 m); substituting into (5.30),
we have:
(c1 + c2 m)MSB MSB ≥ m ¥ (c1 + c2 m)MSW
g(m) = V (ȳˆunb ) = ° + 1°
CM NM M Cm
dg c2 MSB c1 MSW c2 MSW
= ° ° .
dm CM Cm2 CM
Setting the derivative equal to zero and solving for m, we have
s
c1 M (MSW)
m= .
c2 (MSB ° MSW)

Using Exercise 5.24,


s s
c1 M S 2 (1 ° Ra2 ) c1 M (N ° 1)(1 ° Ra2 )
m= = .
c2 S 2 [N (M ° 1)Ra2 /(N ° 1) + 1 ° (1 ° Ra2 )] c2 (N M ° 1)Ra2

5.28 This exercise does not rely on methods developed in this chapter (other than
for a general knowledge of systematic sampling), but represents the type of problem
a sampling practitioner might encounter. (A good sampling practitioner must be
versatile.)
(a) For all three cases, P (detect the contaminant) = P (distance between container
and nearest grid point is less than R). We can calculate the probability by using
simple geometry and trigonometry.
Case 1: R < D.
111

Since we assume that the waste container is equally likely to be anywhere in the
square relative to the nearest grid point, the probability is the ratio

area of shaded part ºR2


= .
area of square 4D2

p
Case 2: D ∑ R ∑ 2D
112 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

R2 − D2

The probability is again


area of shaded part
.
area of square
The area of the shaded part is
µ ∂ µ ∂µ 2 ∂ µ ∂µ 2 ∂
1 p 2 º/2 ° 2µ ºR p 4µ ºR
2 D R °D + 2 =D R °D + 1°
2 2 ,
2 º/2 4 º 4

where cos(µ) = D/R.


The probability is thus
sµ ∂2 µ ∂µ ∂
R 4µ ºR2
°1+ 1° .
D º 4D2

p
Case 3: R > 2D.
The probability of detection is 1.
(b) Even though the square grid is commonly used in practice, we can increase the
probability of detecting a contaminant by staggering the rows.
113

c c

5.30 (a) Because the Ai ’s and the "ij ’s are independent,


∑X X ∏ ∑ X µ X ∂∏ ∑X X ∏
VM1 bij Yij = VM1 Ai bij + VM 1 bij "ij
i2S j2Si i2S j2Si i2S j2Si
Xµ X ∂2 XX
= bij 2
æA + b2ij æ 2 .
i2S j2Si i2S j2Si

(b) Let Ω
bij ° 1 if i 2 S and j 2 Si
cij =
°1 otherwise.
Then
Mi
N X
X
T̂ ° T = cij yij .
i=1 j=1
114 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES

Using the same argument as in part (a),


N µX
X Mi ∂2 X Mi
N X
VM1 [T̂ ° T ] = cij æA2
+ c2ij æ 2
i=1 j=1 i=1 j=1
X ∑µ X X ∂2 ∏ XµX
M ∂2
= cij + cij 2
æA + cij 2
æA
i2S j2Si j 2S
/ i i2S
/ j=1
Xµ X X ∂ Mi
XX
+ cij +
2
c2ij æ 2 + c2ij æ 2
i2S j2Si j 2S
/ i / j=1
i2S
∑Ω æ2 ∏
P P P
= i2S j2Si (bij ° 1) ° (Mi ° mi ) æA2 + 2 2
/ Mi æA
i2S
∑ ∏
P P P
+ i2S j2Si (bij ° 1) + (Mi ° mi ) æ +
2 2
/ Mi æ
i2S
2

∑ ∏2
P P P
= i2S b
j2Si ij ° M i æA +
2 2 2
/ Mi æA
i2S
∑ ∏
P P P
+ i2S j2Si (bij ° 2bij ) + Mi æ +
2 2
/ Mi æ
i2S
2

5.32 Recall that XX


T̂r = bij yij ,
i2S j2S
with
Mi
bij = M0 X .
mi Mk
k2S
Then, from (5.36),
2 0 12 3
X X M M X
2 4
VM 1 [T̂r ° T ] = æA @ P0 i ° Mi A + Mi2 5
mi k2S Mk
2i2S j2Si√ i2S
/
! 3
XX µ M M
∂2
M M
+æ 24 P0 i °2 P0 i + M0 5
mi Mk
k2S mi k2S Mk
" i2S j2Sµi ∂2 X #
X M0
= æA
2
Mi2 P °1 + Mi2

i2S ∑ k2S Mk i2S


/
X ∏
M0 Mi2
2 M0 Mi
+æ 2 P ° 2P + æ 2 M0 ,
mi ( k2S Mk )2 k2S Mk
i2S

which is minimized when X


Mi2 /mi
i2S
is minimized. Let
√ !
X M2 X
g(m1 , . . . , mN , ∏) = i
°∏ i° mi .
mi
i2S i2S
115

Then
@g X
= mi ° L
@∏
i2S

and, for k 2 S,
@g M2
= ° 2k + ∏.
@mk mk
p
Setting the partial derivatives equal to zero, we have that Mk /mk = ∏ is constant
for all k; that is, mk is proportional to Mk .
116 CHAPTER 5. CLUSTER SAMPLING WITH EQUAL PROBABILITIES
Chapter 6

Sampling with Unequal


Probabilities

6.2 (a) Instead of creating columns of cumulative Mi range as in Example 6.2,


we create columns for the cumulative √i range. Then draw 10 random numbers
between 0 and 1 to select the psu’s with replacement. A table giving the cumulative
√i range for each psu follows:

117
118 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

psu √i Cumulative √i
1 0.000110 0.000000 0.000110
2 0.018556 0.000110 0.018666
3 0.062999 0.018666 0.081665
4 0.078216 0.081665 0.159881
5 0.075245 0.159881 0.235126
6 0.073983 0.235126 0.309109
7 0.076580 0.309109 0.385689
8 0.038981 0.385689 0.424670
9 0.040772 0.424670 0.465442
10 0.022876 0.465442 0.488318
11 0.003721 0.488318 0.492039
12 0.024917 0.492039 0.516956
13 0.040654 0.516956 0.557610
14 0.014804 0.557610 0.572414
15 0.005577 0.572414 0.577991
16 0.070784 0.577991 0.648775
17 0.069635 0.648775 0.718410
18 0.034650 0.718410 0.753060
19 0.069492 0.753060 0.822552
20 0.036590 0.822552 0.859142
21 0.033853 0.859142 0.892995
22 0.016959 0.892995 0.909954
23 0.009066 0.909954 0.919020
24 0.021795 0.919020 0.940815
25 0.059185 0.940815 1.000000
(Note: the numbers in the “Cumulative √i ” columns were rounded to fit in the
table.)
Ten random numbers I generated between 0 and 1 were:
{0.46242032, 0.34980142, 0.35083063, 0.55868338, 0.62149246,
0.03779992, 0.88290415, 0.99612658, 0.02660724, 0.26350658}.
Using these ten random numbers would result in psu’s 9, 7, 7, 14, 16, 3, 21, 25, 3,
and 6 being the sample.
(b) Here max{√i } = 0.078216. To use Lahiri’s method, we select two random
numbers for each draw—the first is a random integer between 1 and 25, and the
second is a random uniform between 0 and 0.08 (or any other number larger than
max{√i }). Thus, if our pair of random number is (20, 0.054558), we reject the pair
and try again because 0.054 > √20 = 0.03659. If the next pair is (8, 0.028979), we
include psu 8 in the sample.
6.3 Calculate t̂√S = ti /√i for each sample.
119

Store √i ti t̂√S (t̂√S ° t)2


A 1/16 75 1200 810,000
B 2/16 75 600 90,000
C 3/16 75 400 10,000
D 10/16 75 120 32,400

1 2 3 10
E[t̂√ ] = (1200) + (600) + (400) + (120)
16 16 16 16
= 300.
1 10
V [t̂√ ] = (810,000) + · · · + (32,400)
16 16
= 84,000.

6.4
Store √i ti t̂√S (t̂√S ° t)2
A 7/16 11 25.14 75546.4
B 3/16 20 106.67 37377.8
C 3/16 24 128.00 29584.0
D 3/16 245 1306.67 1013377.8
As shown in (6.3) for the general case,
7 3 3 3
E[t̂√ ] = (25.14) + (106.67) + (128) + (1306.67)
16 16 16 16
= 300.

Using (6.4),
7 3 3 3
V [t̂√ ] = (75546.4) + (37377.8) + (29584) + (1013377.8)
16 16 16 16
= 235615.2.

This is a poor sampling design. Store A, with the smallest sales, is sampled with
the largest probability, while Store D is sampled with a smaller probability.
The √i used in this exercise produce a higher variance than simple random sampling.
6.5 We use (6.5) to calculate t̂√ for each sample. So for sample (A,A),
µ ∂
1 11 11
t̂√ = 2 = = 176.
2 √A √A

For sample (A,B),


∑ ∏
1 11 20 1
t̂√ = + = [176 + 160] = 168.
2 √A √B 2

The results are given in the following table:


120 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

Sample, S P (S) t1 /√1 t2 /√2 t̂√ P (S)(t̂ ° t)2


(A,A) 1/256 176 176 176 60.06
(A,B) 2/256 176 160 168 136.13
(A,C) 3/256 176 128 152 256.69
(A,D) 10/256 176 392 284 10.00
(B,A) 2/256 160 176 168 136.13
(B,B) 4/256 160 160 160 306.25
(B,C) 6/256 160 128 144 570.38
(B,D) 20/256 160 392 276 45.00
(C,A) 3/256 128 176 152 256.69
(C,B) 6/256 128 160 144 570.38
(C,C) 9/256 128 128 128 1040.06
(C,D) 30/256 128 392 260 187.50
(D,A) 10/256 392 176 284 10.00
(D,B) 20/256 392 160 276 45.00
(D,C) 30/256 392 128 260 187.50
(D,D) 100/256 392 392 392 3306.25
Total 1 7124.00

1 100
E[t̂√ ] = (176) + · · · + (392) = 300
256 256
1 100
V [t̂√ ] = (176 ° 300)2 + · · · + (392 ° 300)2 = 7124.
256 256
Of course, an easier solution is to note that (6.5) and (6.6) imply that E[t̂√ ] = t,
and that V [t̂√ ] will be half of the variance found when taking a sample of one psu
in Section 6.2; i.e., V [t̂√ ] = 14248/2 = 7124.
6.6 (a) The following table does the calculations, using (6.4) to find the variance.
ti
name ti √i t̂√ = √i (t̂√ ° t)2 t̂SRS 13 (t̂SRS
1
° t)2
√i
Apache 31621 0.0572 553292.1 20477223 411073 1997590608
Cochise 51126 0.0969 527405.6 194693778 664638 656992453
Coconino 53443 0.0958 558108.6 19071085 694759 1155043188
Gila 28189 0.0423 667034.6 379902891 366457 3256832592
Graham 11430 0.0276 414597.1 684957758 148590 13804863397
Greenlee 3744 0.0070 532113.6 11318255 48672 21084888877
La Paz 15133 0.0162 932417.7 2105687838 196729 10845710928
Mohave 80062 0.1276 627317.4 387423351 1040806 16890146325
Navajo 47413 0.0802 590892.8 27974547 616369 149926608
Pinal 81154 0.1480 548502.8 83232656 1055002 17929037997
Santa Cruz 13036 0.0316 412582.0 805214838 169468 12477690693
Yavapai 81730 0.1379 592659.0 57604012 1062490 18489514797
Yuma 74140 0.1317 562787.3 11723897 963820 11796136677
Sum 572221 1.0000 4789282131 1.30534E+11

The t̂√ ’s are given in column 5, and V (t̂√ ) =4,789,282,131.


(b) Using √i = 1/13 for each i, we get the SRS estimators in column 7 with
121

V (t̂√ ) =130,534,375,140.
The unequal-probability sample is more e±cient because ti and Mi are highly cor-
related: the correlation is 0.9905. This means that the quantity ti /√i does not vary
much from sample to sample.
6.7 From (6.6),
µ ∂2
1 1 X ti
V̂ (t̂√ ) = ° t̂√ .
nn°1 √i
i2R

In an SRSWR, √i = 1/N , so we have t̂√ = N t̄ and

1 1 X N2 1 X
V̂ (t̂√ ) = (N ti ° N t̄)2 = (ti ° t̄)2 .
nn°1 n n°1
i2R i2R

6.8 We use (6.13) and (6.14), along with the following calculations from a spread-
sheet.
Academic
Unit Mi √i yij ȳi t̂i t̂i /√i
14 65 0.0805452 3 004 1.75 113.75 1412.25
23 25 0.0309789 2 120 1.25 31.25 1008.75
9 48 0.0594796 0 010 0.25 12.00 201.75
14 65 0.0805452 2 010 0.75 48.75 605.25
16 2 0.0024783 2 0 1.00 2.00 807.00
6 62 0.0768278 0 225 2.25 139.50 1815.75
14 65 0.0805452 1 003 1.00 65.00 807.00
19 62 0.0768278 4 100 1.25 77.50 1008.75
21 61 0.0755886 2 231 2.00 122.00 1614.00
11 41 0.0508055 2 5 12 3 5.50 225.50 4438.50

average 1371.90
std. dev. 1179.47
p
Thus t̂√ = 1371.90 and SE(t̂√ ) = (1/ 10)(1179.47) = 372.98.
Here is SAS code for calculating these estimates. Note that unit 14 appears 3 times;
in SAS, you have to give each of these repetitions a diÆerent unit number. Otherwise,
SAS will just put all of the observations in the same psu for calculations.

data faculty;
input unit $ Mi psi y1 y2 y3 y4;
array yarray{4} y1 y2 y3 y4;
sampwt = (1/( 10*psi))*(Mi/4);
if unit = 16 then sampwt = (1/( 10*psi));
do i = 1 to 4;
y = yarray{i};
122 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

if y ne . then output;
end;
datalines;
/* Note: label the 3 unit 14s with different psu numbers */
14a 65 0.0805452 3 0 0 4
23 25 0.0309789 2 1 2 0
9 48 0.0594796 0 0 1 0
14b 65 0.0805452 2 0 1 0
16 2 0.0024783 2 0 . .
6 62 0.0768278 0 2 2 5
14c 65 0.0805452 1 0 0 3
19 62 0.0768278 4 1 0 0
21 61 0.0755886 2 2 3 1
11 41 0.0508055 2 5 12 3
;

proc surveymeans data=faculty mean clm sum clsum;


cluster unit;
weight sampwt;
var y;
run;

6.12 (a) The correlation between √i and ti (= number of farms in county) is 0.26.
We expect some benefit from pps sampling, but not a great deal—sampling with
probability proportional to population works better for quantities highly correlated
with population, such as number of physicians as in Example 6.5.

(b) As in Example 6.5, we form a new column ti /√i . The mean of this column is
123

1,896,300, and the standard deviation of the column is 3,674,225. Thus

t̂√ = 1,896,300

and
3,674,225
SE [t̂√ ] = p = 367,423.
100
A histogram of the ti /√i exhibits strong skewness, however, so a confidence interval
using the normal approximation may not be appropriate.
Here is SAS code for producing these estimates. Note that we do not use the cluster
statement in proc surveymeans since the observations are psu totals.

data statepop;
infile statepop delimiter=’,’ firstobs=2;
input state $ county $ landarea popn physicns farmpop
numfarm farmacre veterans percviet;
psii = popn/255077536;
wt = 1/(100*psii); /* weight = 1/(n \psi_i) */

/* Be careful when constructing your dataset that if a psu is


selected multiple times, then it appears that many times
in the data. Otherwise, you will get the wrong estimate.*/
proc print data=statepop;
proc corr data=statepop;

proc gplot data=statepop;


plot numfarm*psii;
run;

/*Omit the ’total’ option because sampling is with replacement*/

proc surveymeans data=statepop nobs mean sum clm clsum;


var numfarm;
weight wt;
run;

6.13 (a) Corr (population, number of veterans) = .99. We expect the unequal
probability sampling to be very e±cient here.
124 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

(b)
1 X ti
t̂√ = = 27,914,180
100 √i
10874532
SE [t̂√ ] = p = 1,087,453
100
Note that Allen County, Ohio appears to be an outlier among the ti /√i : it has
population 26,405 and 12,642 veterans.
(c) For each county,

Vietveti = veteransi § percvieti /100.

We then form a column with ith entry Vietveti /√i , and find the mean (= 8050477)
and standard deviation (= 3273372) of that column. Then

t̂√ = 8,050,477
SE [t̂√ ] = 327,337

Here is SAS code for calculating these estimates:

data statepop;
infile statepop delimiter=’,’ firstobs=2;
input state $ county $ landarea popn physicns farmpop
numfarm farmacre veterans percviet;
vietvet = veterans*percviet/100;
psii = popn/255077536;
wt = 1/(100*psii); /* weight = 1/(n \psi_i) */

proc print data=statepop; run;


125

proc corr data=statepop; run;

proc gplot data=statepop;


plot veterans*psii;

proc surveymeans data=statepop nobs mean sum clm clsum;


var veterans vietvet;
weight wt;
run;

6.14 (a) We use (6.28) and (6.29) to calculate the variances. We have
Class t̂i V̂ (t̂i )
4 110.00 16.50
10 106.25 185.94
1 154.00 733.33
9 195.75 2854.69
14 200.00 1200.00
From Table 6.7, and calculating
µ ∂
mi s2i
V̂ (t̂i ) = Mi2 1° ,
Mi mi
we have the second term in (6.28) and (6.29) is
X V̂ (t̂i )
= 11355.
ºi
i2S

To calculate the first term of (6.28), note that if we set ºii = ºi , we can write
X t̂2i X X ºik ° ºi ºk t̂i t̂k X X ºik ° ºi ºk t̂i t̂k
(1 ° ºi ) 2 + = .
ºi ºik ºi ºk ºik ºi ºk
i2S i2S k2S i2S k2S
k6=i

We obtain that the first term of (6.28) is


X X ºik ° ºi ºk t̂i t̂k
= 6059.6,
ºik ºi ºk
i2S k2S
so
V̂HT (t̂HT ) = 6059.6 + 11355 = 17414.46.
Similarly,
V̂SYG (t̂HT ) = 6059.6 + 11355 = 66139.41.
Note how widely these values diÆer because of the instability of the estimators with
this small sample size (n = 5).
(b) Here is the pseudo-fpc calculated by SAS.
126 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

Statistics

Std Error
Variable Mean of Mean 95% CL for Mean
______________________________________________________________
y 3.450000 0.393436 2.35764732 4.54235268
______________________________________________________________

Statistics

Variable Sum Std Dev 95% CL for Sum


_____________________________________________________________
y 2232.150000 254.552912 1525.39781 2938.90219
_____________________________________________________________

This standard error is actually pretty close to the SYG SE of 257.


6.15 (a) Let J be an integer greater than max{Mi }. Let U1 , U2 , . . ., be independent
discrete uniform {1, . . . , N } random variables, and let V1 , V2 , . . . be independent
discrete uniform {1, . . . , J} random variables. Assume that all Ui and Vj are inde-
pendent. Then, on any given iteration of the procedure,
P (select psu i) = P (select psu i with first pair of random numbers)
+P (select psu i with second pair)
+···
= P (U1 = i and V1 ∑ Mi )
µ[ N ∂
+P (U2 = i, V2 ∑ Mi )P {U1 = j, V2 > Mj }
j=1
k°1
Y µ[
N ∂
+ · · · + P (Uk = i, Vk ∑ Mi ) P {Ul = j, Vl > Mj }
l=1 j=1
∑ N ∏
1 Mi 1 Mi 1 X J ° Mj
= +
N J N J N J
j=1
∑ X N ∏
1 Mi 1 J ° Mj k°1
+··· + + ···
N J N J
j=1
1 ∑ N ∏
1 Mi X 1 X J ° Mj k
=
N J N J
k=0 j=1
1 Mi 1
= PN
N J 1 ° j=1 (J ° Mj )/(JN )
.XN
= Mi Mj .
j=1

(b) Let W represent the number of pairs of random numbers that must be generated
127

to obtain the first valid psu. Since sampling is done with replacement, and hence
all psu’s are selected independently, we will have that E[X] = nE[W ]. But W has
a geometric distribution with success probability
N
X 1 Mi
p = P (U1 = i, V1 ∑ Mi for some i) = .
N J
i=1

Then
P (W = k) = (1 ° p)k°1 p
and
1 .X N
E[W ] = = NJ Mi .
p
i=1
Hence,
.X
N
E[X] = nN J Mi .
i=1

6.16 The random variables Q1 , . . . , QN have a joint multinomial distribution with


probabilities √1 , . . . , √N . Consequently,
Pn
i=1 Qi = n,
E[Qi ] = n√i ,
V [Qi ] = n√i (1 ° √i ),
and
Cov (Qi , Qk ) = °n√i √k for i 6= k.

(a) Using successive conditioning,


∑ X N X Qi ∏
1 t̂ij
E[t̂√ ] = E
n √
i=1 j=1 i
Ω ∑ X N X Qi Ø ∏æ
1 t̂ij Ø
Ø Q1 , . . . , QN
=E E Ø
n √
i=1 j=1 i
∑ X N ∏
1 ti
=E Qi
n √i
i=1
N
1X ti
= n√i = t.
n √i
i=1

Thus t̂√ is unbiased for t.


(b) To find V (t̂√ ), note that

V (t̂√ ) = V [E(t̂√ | Q1 , . . . , QN )] + E[V (t̂√ | Q1 , . . . , QN )]


∑ X N ∏ ∑ µ X N X Qi Ø ∂∏
1 ti 1 t̂ij ØØ
=V Qi +E V Q1 , . . . , QN .
n √i n √i Ø
i=1 i=1 j=1
128 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

Looking at the two terms separately,


∑ XN ∏ ∑ XN N ∏
1 ti 1 ti 1 X tk
V Qi = Cov Qi , Qk
n √i n √i n √k
i=1 i=1 k=1
N N
1 X X ti tk
= 2 Cov (Qi , Qk )
n √i √k
i=1 k=1
N µ ∂
1 X ti 2
= 2 n√i (1 ° √i )
n √i
i=1
N
X N
X
1 ti tk
+ 2 (°n√i √k )
n √i √k
i=1 k=1,k6=i
N µ ∂
1 X ti 2
= [√i (1 ° √i ) + √i2 ]
n √i
i=1
N N
1 XX
° ti tk
n
i=1 k=1
N µ ∂2
1X ti
= √i °t .
n √i
i=1

∑ µ X N X Qi ∂∏ ∑ X N X Qi ∏
1 t̂ij 1 V (t̂ij )
E V | Q1 , . . . , QN =E 2
n √
i=1 j=1 i
n
i=1 j=1
√i2
∑ X N ∏
1 Vi
=E 2 Qi 2
n √i
i=1
N
1 X Vi
= .
n √i
i=1

This equality uses the assumptions that V (t̂ij ) = Vi for any j, and that the estimates
t̂ij are independent. Thus,
N µ ∂2 N
1X ti 1 X Vi
V [t̂√ ] = √i °t + .
n √i n √i
i=1 i=1
129

(c) To show that (6.9) is an unbiased estimator of the variance, note that
∑ X Qi
N X ∏
1 (t̂ij /√i ° t̂√ )2
E[V̂ (t̂√ )] = E
n n°1
i=1 j=1
Ω X Qi
N X ∑µ ∂2 ∏æ
1 1 t̂ij t̂ij
= E ° 2 t̂√ + t̂√
2
n n°1 √i √i
i=1 j=1
∑ X Qi
N X µ ∂2 ∏
1 1 t̂ij t̂2√
= E °
n n ° 1 √i n°1
i=1 j=1
∑XΩQi
N X µ ∂2 Ø ∏æ
1 1 t̂ij ØØ 1
= E E Q , . . . , Q ° [t2 + V (t̂√ )]
n ° 1 √i Ø
1 N
n n°1
i=1 j=1
∑ XN ∏
1 Qi t2i + Vi 1
= E ° [t2 + V (t̂√ )]
n n ° 1 √i2 n°1
i=1
XN 2
1 + Vi ti 1
= ° [t2 + V (t̂√ )]
n°1 √i n°1
i=1
µX N ∂ µX N ∂
1 t2i 1 Vi
= √i 2 ° t + 2
° V (t̂√ )
n°1 √i n°1 √
i=1 i=1 i
∑XN µ ∂2 X N ∏
1 ti Vi 1
= √i °t + ° V (t̂√ )
n°1 √i √i n°1
i=1 i=1
n 1
= V (t̂√ ) ° V (t̂√ ) = V (t̂√ ).
n°1 n°1

6.17 It is su±cient to show that (6.22) and (6.23) are equivalent. When an SRS
n(n ° 1)
of psus is selected, then ºi = n/N and ºik = for all i and k. So, starting
N (N ° 1)
with the SYG form,
µ ∂2
1 X X ºi ºk ° ºik ti tk
°
2 ºik ºi ºk
i2S k2S
k6=i
1 1 º12 ° º12 X X
= (ti ° tk )2
2 º12 º12
i2S k2S
k6=i
1 1 º12 ° º12 X X 2
= (ti + t2k ° 2ti tk )
2 º12 º12
i2S k2S
k6=i
1 º12 ° º12 X 1 º12 ° º12 X X
= (n ° 1)t 2
i ° ti tk .
º12 º12 º12 º12
i2S k2Si2S
k6=i
130 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

But in an SRS,

µ ∂2 n ° n ° 1 µ ∂2 ≥
1 º12 ° º12 N N N ° 1 (n ° 1) = N n¥
(n ° 1) = 1° ,
º12 º12 n n°1 n N
N °1
which proves the result.
6.18 To show the results from stratified sampling, we treat the H strata as the N
psus. Note that since all strata are subsampled, we have ºi = 1 for each stratum.
Thus, (6.25) becomes
XH H
X
t̂HT = t̂i = Ni ȳi .
i=1 i=1

For (3.3), since all strata are sampled, either (6.26) or (6.27) gives
H
X
V (t̂HT ) = V (t̂i ).
i=1

Result (3.4) follows from (6.29) similarly.


6.19 (a) We use the method on page 239 to calculate the probabilities.
ºjk
Unit 1 2 3 4 5 6 7 8 9 ºi
1 — 0.049 0.080 0.088 0.007 0.021 0.080 0.021 0.035 0.381
2 0.049 — 0.041 0.045 0.003 0.010 0.041 0.010 0.018 0.218
3 0.080 0.041 — 0.073 0.006 0.017 0.067 0.017 0.029 0.330
4 0.088 0.045 0.073 — 0.006 0.019 0.073 0.019 0.032 0.356
5 0.007 0.003 0.006 0.006 — 0.001 0.006 0.001 0.002 0.033
6 0.021 0.010 0.017 0.019 0.001 — 0.017 0.004 0.007 0.097
7 0.080 0.041 0.067 0.073 0.006 0.017 — 0.017 0.029 0.330
8 0.021 0.010 0.017 0.019 0.001 0.004 0.017 — 0.007 0.097
9 0.035 0.018 0.029 0.032 0.002 0.007 0.029 0.007 — 0.159
ºi 0.381 0.218 0.330 0.356 0.033 0.097 0.330 0.097 0.159 2.000

(b) Using (6.21), V (t̂HT ) = 1643. Using (6.46) (we only need the first term),
VW R (t̂√ ) = 1867.
131

6.20 (a) We write


√N N
!
X tix X tky
Cov (t̂x , t̂y ) = Cov Zi , Zk
ºi ºk
i=1 k=1
N N
X X tix tky
= Cov (Zi , Zk )
º ºk
i=1 k=1 i
XN N X
X N
tix tiy tix tky
= ºi (1 ° ºi ) + (ºik ° ºi ºk )
ºi ºi ºi ºk
i=1 i=1 k=1
k6=i
N
X N X
X N
1 ° ºi tix tky
= tix tiy + (ºik ° ºi ºk ) .
ºi ºi ºk
i=1 i=1 k=1
k6=i

(b) If the design is an SRS,


N
X N X
X N
1 ° ºi tix tky
Cov (t̂x , t̂y ) = tix tiy + (ºik ° ºi ºk )
ºi ºi ºk
i=1 i=1 k=1
k6=i

N≥ n¥
N
X
= 1°
tix tiy
n N
i=1
µ ∂2 X N ∑ ∏
n (n ° 1) ≥ n ¥2
N X
N
+ ° tix tiy
n N (N ° 1) N
i=1 k=1
k6=i
N N ∑ ∏
N≥ n¥
N
X N X X (n ° 1) n
= 1° tix tiy + ° tix tky
n N n (N ° 1) N
i=1 i=1 k=1
k6=i

N≥ n ¥X
N N N
N XX N ° n
= 1° tix tiy ° tix tky
n N n N (N ° 1)
i=1 i=1 k=1
k6=i
2 3
N≥ n ¥6XN N N
6 1 XX 7
= 1° 4 t ix t iy ° tix tky 7
5
n N N °1
i=1 i=1 k=1
k6=i
"N #
N≥ n¥ X
N N N
1 XX 1 X
= 1° tix tiy ° tix tky + tix tiy
n N N °1 N °1
i=1 i=1 k=1 i=1
" #
N≥ n¥
N N N
N X 1 XX
= 1° tix tiy ° tix tky
n N N °1 N °1
i=1 i=1 k=1
"N #
N2 ≥ n¥ 1 X tx ty
= 1° tix tiy °
n N N °1 N
i=1
132 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

6.21 Note that P PM


j=1 wij yij t̂y
ȳˆ = P
i2S
PM =
i2S j=1 wij
NM
Thus,
∑µ ∂ Ω æ∏
tu ty ° tu
(N M ) Cov
2 ˆ ° x̄
ū ˆ , ȳˆ ° ū ˆ° ˆ)
(1 ° x̄
tx N ° tx
∑µ ∂ Ω æ∏
tu ty ° tu
= Cov t̂u ° t̂x , t̂y ° t̂u ° (N M ° t̂x )
tx N ° tx
∑ ∏ ∑ ∏
ty ° tu tu ty ° tu
= Cov t̂u , t̂y ° t̂u + t̂x ° Cov t̂x , t̂y ° t̂u + t̂x
N ° tx tx N ° tx
"N Ω
N2 ≥ n¥ 1 X ty ° tu
= 1° tiu tiy ° t2iu + tiu tix
n N N °1 N ° tx
i=1
µ ∂æ∏
tu ty ° tu 2
° tix tiy ° tix tiu + tix
tx N ° tx
≥ ¥ ∑
N n 1 ty ° tu
° 1° tu ty ° t2u + tu tx
n N N °1 N ° tx
µ ∂∏
tu ty ° tu 2
° tx ty ° tx tu + t
tx N ° tx x
"N Ω
N2 ≥ n¥ 1 X ty ° tu
= 1° tiu tiy ° t2iu + tiu tix
n N N °1 N ° tx
i=1
µ ∂æ∏
tu ty ° tu 2
° tix tiy ° tix tiu + t .
tx N ° tx ix

(b) Now,
M
X
tiy = yij
j=1

and
M
X
tiu = xij yij .
j=1

So if psu i is in domain 1 then xij = 1 for every ssu in psu i, tix = M , and tiu = tiy ;
if psu i is in domain 2 then xij = 0 for every ssu in psu i, tix = 0, and tiu = 0. We
133

may then write the covariance as

XN Ω µ ∂æ
ty ° tu tu ty ° tu 2
tiu tiy ° t2iu + tiu tix ° tix tiy ° tix tiu + tix
N ° tx tx N ° tx
i=1
X Ω µ ∂æ
ty ° tu tu ty ° tu 2
= tiu tiy ° tiu +
2
tiu tix ° tix tiy ° tix tiu + t
N ° tx tx N ° tx ix
i2domain 1
X Ω ty ° tu tu
µ
ty ° tu 2
∂æ
+ tiu tiy ° tiu +
2
tiu tix ° tix tiy ° tix tiu + t
N ° tx tx N ° tx ix
i2domain 2
X Ω ty ° tu tu
µ
ty ° tu 2
∂æ
= tiy tiy ° tiy +
2
tiy M ° M tiy ° M tiy + M
N ° tx tx N ° tx
i2domain 1
X Ω ty ° tu tu ty ° tu 2
æ
= tiy M ° M
N ° tx tx N ° tx
i2domain 1
ty ° tu tx tu ty ° tu 2
= tu M ° N M
N ° tx N M tx N ° tx
= 0.

(c) Almost any example will work, as long as some psus have units from both
domains.
6.22 (a) Since E(Zi ) = ºi ,
N
X N
X N X
X M M
X
1 `ik yk
E(t̂y ) = ui E(Zi ) = ui = = yk .
ºi Lk
i=1 i=1 i=1 k=1 k=1
PN
The last equality follows since i=1 `ik = Lk .
The variance given is the variance of the one-stage Horvitz-Thompson estimator.
(b) Note that
N
X M M N
Zi X `ik yk X 1 X Zi
t̂y = = `ik yk .
ºi Lk Lk ºi
i=1 k=1 k=1 i=1

But the
PNsum is over all k from 1 to M , not just the units in S B . We need to show
that i=1 ºi `ik = 0 for k 62 S B .
Zi

N
X Zi
`ik
º
i=1 i
wk§ = N
.
X
`ik
i=1

But a student is in S B if and only if s/hePis linked to one of the sampled units in
S A . In other words, k 2 S B if and only if i2S A `ik > 0. For k 62 S B , we must have
`ik = 0 for each i 2 S A .
134 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

(c) Suppose Lk = 1 for all k. Then,


N
X M N
Zi X X Zi
t̂y = `ik yk = yi
ºi ºi
i=1 k=1 i=1
PM
because k=1 `ik yk = yi .
(d) The values of ui are:

Element k from U B
`ik 1 2 ui
Unit i 1 1 0 4/2 = 2
from U A 2 1 1 4/2 + 6/2 = 5
3 0 1 6/2 = 3

Here are the three SRSs from U A :


Sample t̂y

3 21
{1,2} (2 + 5) =
2 2
3 15
{1,3} (2 + 3) =
2 2
3 24
{2,3} (5 + 3) =
2 2
Consequently, µ ∂
1 21 15 24
E[t̂y ] = + + = 10,
3 2 2 2
so it is unbiased. But
"µ ∂2 µ ∂2 µ ∂2 #
1 21 15 24
V [t̂y ] = ° 10 + ° 10 + ° 10
3 2 2 2
1
= [0.25 + 6.25 + 4]
3
= 3.5.
PM
(e) We construct the variable ui = k=1 `ik yk /Lk for each adult in the sample,
where Lk = (number
P of other adults +1). Using weight wi = 40,000/100 = 400, we
calculate t̂u = i2S wi ui = 7200 with
N2 2
V̂ (t̂u ) =
s = 1,900,606.
n u
This gives a 95% CI of [4464.5, 9935.5]. Note that the without-replacement variance
estimator could also be used.
The following SAS code will compute the estimates.
135

data wtshare;
infile wtshare delimiter="," firstobs=2;
input id child preschool numadult;
yoverLk = preschool/(numadult+1);

proc sort data=wtshare;


by id;
run;

proc means data=wtshare sum noprint;


by id;
var yoverLk;
output out=sumout sum = u;

data sumout;
set sumout;
sampwt = 40000/100;

proc surveymeans data=sumout mean sum clm clsum;


weight sampwt;
var u;
run;

6.23 (a) We have ºi ’s 0.50 0.25 0.50 0.75, V (t̂HT ) = 180.1147, and V (t̂√ ) =
101.4167.
(b)
N N µ ∂
1 XX ti tk 2
ºi ºk °
2n ºi ºk
i=1 k=1
N N ∑µ ∂ µ ∂∏
1 XX ti t tk t 2
= ºi ºk ° ° °
2n ºi n ºk n
i=1 k=1
N N
" µ ∂ µ ∂ µ ∂µ ∂#
1 XX ti t 2 tk t 2 ti t tk t
= ºi ºk ° + ° °2 ° °
2n ºi n ºk n ºi n ºk n
i=1 k=1
N N µ ∂ "N µ ∂#2
1 XX ti t 2 1 X ti t
= ºi ºk ° ° ºi °
n ºi n n ºi n
i=1 k=1 i=1
N N µ ∂ "N #2
1 XX 2 ti t 2 1 X
= n √i √k ° ° (ti ° √i t)
n n√i n n
i=1 k=1 i=1
N µ ∂2
1X ti
= √i °t
n √i
i=1
= V (t̂√ ).
136 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

(c)
N N µ ∂2
1 XX ti tk
V (t̂√ ) ° V (t̂HT ) = ºi ºk °
2n ºi ºk
i=1 k=1
XN X N µ∂
1 ti tk 2
° (ºi ºk ° ºik ) °
2 ºi ºk
i=1 k=1

1 X X ≥ ºi ºk
N N ¥µt ∂
tk 2
i
= ° ºi ºk + ºik °
2 n ºi ºk
i=1 k=1
N N ∑ µ ∂ ∏µ ∂
1 XX 1 n°1 ti tk 2
> ºi ºk °1 ° ºi ºk °
2 n n ºi ºk
i=1 k=1
= 0.

(d) If ºik ∏ (n ° 1)ºi ºk /n for all i and k, then


µ ∂
ºik n°1
min ∏ ºi for all i.
ºk n
Consequently,
N
X µ ∂ N
ºik n°1X
min ∏ ºi = n ° 1.
ºk n
i=1 i=1

(e) Suppose V (t̂HT ) ∑ V (t̂√ ). Then from part (b),


0 ∑ V (t̂√ ) ° V (t̂HT )
N N µ ∂µ ∂
1 XX n°1 ti tk 2
= ºik ° ºi ºk °
2 n ºi ºk
i=1 k6=i
N N µ ∂ "µ ∂2 µ ∂2 #
1 XX n°1 ti tk ti tk
= ºik ° ºi ºk + °2
2 n ºi ºk ºi ºk
i=1 k6=i

XN X N µ ∂ "µ ∂2 #
n°1 ti ti tk
= ºik ° ºi ºk °
n ºi ºi ºk
i=1 k6=i
N X
X N µ ∂2 N X
X N N µ ∂2
ti ti tk n°1X ti
= ºik ° ºik ° ºi (n ° ºi )
ºi ºi ºk n ºi
i=1 k6=i i=1 k6=i i=1
N X
X N
n°1
+ ti tk
n
i=1 k6=i
N
X µ ∂2 XN X
N N
X µ ∂2
ti ti tk ti
= (n ° 1)ºi ° ºik ° (n ° 1) ºi
ºi ºi ºk ºi
i=1 i=1 k6=i i=1
N
X N X
X N
n°1 n°1
+ t2i + ti tk
n n
i=1 i=1 k6=i
137

Consequently,
N N N N
n ° 1 XX XX ti tk
ti tk ∏ ºik ,
n ºi ºk
i=1 k=1 i=1 k6=i
or
N X
X N
aik ti tk ∏ 0,
i=1 k=1

where aii = 1 and


n ºik
aik = 1 ° if i 6= k.
n ° 1 ºi ºk
The matrix A must be nonnegative definite, which means that all principal subma-
trices must have determinant ∏ 0. Using a 2 £ 2 submatrix, we have 1 ° a2ik ∏ 0,
which gives the result.
6.24 (a) We have
k
ºik 1 2 3 4 ºi
1 0.00 0.31 0.20 0.14 0.65
i 2 0.31 0.00 0.03 0.01 0.35
3 0.20 0.03 0.00 0.31 0.54
4 0.14 0.01 0.31 0.00 0.46
ºk 0.65 0.35 0.54 0.46 2.00
(b)
Sample, S t̂HT V̂HT (t̂HT ) V̂SYG (t̂HT )
{1, 2} 9.56 38.10 -0.9287681
{1, 3} 5.88 -4.74 2.4710422
{1, 4} 4.93 -3.68 8.6463858
{2, 3} 7.75 -100.25 71.6674365
{2, 4} 21.41 -165.72 323.3238494
{3, 4} 3.12 3.42 -0.1793659
P P
Note that i k>i ºik V̂ (t̂HT ) = 6.74 for each method.
138 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

6.25 (a)
P {psu’s i and j are in the sample}
= P {psu i drawn first and psu j drawn second}
+P {psu j drawn first and psu i drawn second}
ai √j aj √i
= + N
P
N 1 ° √i P 1 ° √j
ak ak
k=1 k=1
√i (1 ° √i )√j √j (1 ° √j )√i
= +
P
N P
N
ak (1 ° √i )(1 ° ºi ) ak (1 ° √j )(1 ° ºj )
k=1 k=1
µ ∂
√i √j 1 1
= +
P
N 1 ° ºi 1 ° ºj
ak
k=1

(b) Using (a),


N
X
P {psu i in sample} = ºij
j=1,j6=i
N
X µ ∂
√i √j 1 1
= +
P
N 1 ° ºi 1 ° ºj
j=1,j6=i ak
k=1
ΩX N µ ∂ æ
√i 1 1 2√i
= N √j + °
P 1 ° ºi 1 ° ºj 1 ° ºi
ak j=1
k=1
Ω XN æ
√i 1 √j ºi
= N + °
P 1 ° ºi 1 ° ºj 1 ° ºi
ak j=1
k=1
Ω XN æ
√i √j
= N 1+
P 1 ° ºj
ak j=1
k=1

P
N P
N
In the third step above, we used the constraint that ºj = n = 2, so √j = 1.
j=1 j=1
Now note that
N
X
PN √k (1 ° √k )
2 k=1 ak =2
1 ° 2√k
k=1
N
X √k (1 ° 2√k + 1)
=
1 ° 2√k
k=1
N
X √k
=1+ .
1 ° ºk
k=1
139

Thus P {psu i in sample} = 2√i = ºi .


(c) Using part (a),
µ ∂
√i √j 1 1
ºi ºj ° ºij = 4√i √j ° N +
P 1 ° ºi 1 ° ºj
ak
∑ µk=1 ∂ ∏
PN
√i √j 4 ak (1 ° ºi )(1 ° ºj ) ° (1 ° ºj + 1 ° ºi )
k=1
= µN ∂ .
P
ak (1 ° ºi )(1 ° ºj )
k=1

Using the hint in part (b),


µX
N ∂
4 ak (1 ° ºi )(1 ° ºj ) ° (1 ° ºj + 1 ° ºi )
k=1
∑ N
X ∏
√k
= 2 1+ (1 ° ºi )(1 ° ºj ) ° 2 + ºi + ºj
1 ° ºk
k=1
N
X √k
= 2(1 ° ºi )(1 ° ºj ) + 2 ° 2ºi ° 2ºj + 2ºi ºj ° 2 + ºi + ºj
1 ° ºk
k=1
XN
√k
= 2(1 ° ºi )(1 ° ºj ) ° ºi ° ºj + 2ºi ºj
1 ° ºk
k=1
= ∏ 2(1 ° ºi )√j + 2(1 ° ºj )√i ° ºi ° ºj + 2ºi ºj
= (1 ° ºi )ºj + (1 ° ºj )ºi ° ºi ° ºj + 2ºi ºj
= 0.

Thus ºi ºj ° ºij ∏ 0, and the SYG estimator of the variance is guaranteed to be


nonnegative.
P
6.26 The desired probabilities of inclusion are ºi = 2Mi / 5j=1 Mj . We calculate
√i = ºi /2 and ai = √i (1 ° √i )/(1 ° ºi ) for each psu in the following table:
psu, i Mi ºi √i ai
1 5 0.40 0.20 0.26667
2 4 0.32 0.16 0.19765
3 8 0.64 0.32 0.60444
4 5 0.40 0.20 0.26667
5 3 0.24 0.12 0.13895
Total 25 2.00 1.00 1.47437
According to Brewer’s method,
.X
5
P (select psu i on 1st draw) = ai aj
j=1
140 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

and .
P (psu j on 2nd draw | psu i on 1st draw) = √j (1 ° √i ).

Then
.26667 0.16
P {S = (1, 2)} = = 0.036174,
1.47437 0.8
.19765 0.2
P {S = (2, 1)} = = 0.031918,
1.47437 0.84
and
º12 = P {S = (1, 2)} + P {S = (2, 1)} = 0.068.
Continuing in like manner, we have the following table of ºij .
i\j 1 2 3 4 5
1 — .068 .193 .090 .049
2 .068 — .148 .068 .036
3 .193 .148 — .193 .107
4 .090 .068 .193 — .049
5 .049 .036 .107 .049 —
Sum .400 .320 .640 .400 .240
We use (6.21) to calculate the variance of the Horvitz-Thompson estimator.
≥ ¥2
t
i j ºij ºi ºj ti tj (ºi ºj ° ºij ) ºtii ° ºjj
1 2 0.068 0.40 0.32 20 25 47.39
1 3 0.193 0.40 0.64 20 38 5.54
1 4 0.090 0.40 0.40 20 24 6.96
1 5 0.049 0.40 0.24 20 21 66.73
2 3 0.148 0.32 0.64 25 38 20.13
2 4 0.068 0.32 0.40 25 24 19.68
2 5 0.036 0.32 0.24 25 21 3.56
3 4 0.193 0.64 0.40 38 24 0.02
3 5 0.107 0.64 0.24 38 21 37.16
4 5 0.049 0.40 0.24 24 21 35.88
Sum 1 243.07
Note that for this population, t = 128. To check the results, we see that

ßP (S)t̂HT S = 128 and ßP (S)(t̂HT S ° 128)2 = 243.07.

6.27 A sequence of simple random samples with replacement (SRSWR) is drawn


until the first SRSWR in which the two psu’s are distinct. As each SRSWR in
the sequence is selected independently, for the lth SRSWR in the sequence, and for
141

i 6= j,

P {psu’s i and j chosen in lth SRSWR}


= P {psu i chosen first and psu j chosen second}
+P {psu j chosen first and psu i chosen second}
= 2√i √j .

The (l + 1)st SRSWR is chosen to be the sample if each of the previous l SRSWR’s
is rejected because the two psu’s are the same. Now
N
X
P (the two psu’s are the same in an SRSWR) = √k2 ,
k=1

so because SRSWR’s are drawn independently,


µX
N ∂l
P (reject first l SRSWR’s) = √k2 .
k=1

Thus

ºij = P {psu’s i and j are in the sample}


X1
= P {psu’s i and j chosen in (l + 1)st SRSWR,
l=0
and the first l SRSWR’s are rejected}
1
X µ XN ∂l
= (2√i √j ) 2
√k
l=0 k=1
2√i √j
= PN .
1° 2
k=1 √k

Equation (6.18) implies


PN
ºi = j=1,j6=i ºij
PN .µ PN

= j=1,j6=i 2√i √j
1° 2
k=1 √k
¡µ ∂ ¡µ ∂
PN PN
= 2√i 1 ° k=1 √k ° 2√i
2 2 1 ° k=1 √k
2
¡µ ∂
P
= 2√i (1 ° √i ) 1° N 2
k=1 √k .

Note that, as (6.17) predicts,


N
X µ N
X ∂¡µ N
X ∂
ºi = 2 1 ° 2
√i 1° √k = 2
2

i=1 i=1 k=1


142 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

6.28 Note that, using the indicator variables,


n
X
Iki = 1 for all i,
k=1
Mi
xki = PN ,
j=1 Ikj Mj
Xn n
X
Iki Mi
P (Zi = 1 | I11 , . . . , InN ) = PN = Iki xki
k=1 j=1 Ikj Mj k=1

and
n
X n X
X N
tÆ(k) ti
t̂RHC = = Iki Zi .
xk,Æ(k) xki
k=1 k=1 i=1

We show that t̂RHC is conditionally unbiased for t given the grouping


∑X
n X
N ∏
ti
E[t̂RHC | I11 , . . . , InN ] = E Iki Zi | I11 , . . . , InN
xki
k=1 i=1
n
XX N N
ti X
= Iki Ikl xkl
xki
k=1 i=1 l=1
Xn X N
= Iki ti
k=1 i=1
XN
= ti = t.
i=1

Since E[t̂RHC | I11 , . . . , InN ] = t for any random grouping of psu’s, we have that
E[t̂RHC ] = t.
To find the variance, note that

V [t̂RHC ] = E[V (t̂RHC | I11 , . . . , InN )] + V [E(t̂RHC | I11 , . . . , InN )].

Since E[t̂RHC | I11 , . . . , InN ] = t, however, we know that V [E(t̂RHC | I11 , . . . , InN )] =
0. Conditionally on the grouping, the kth term in t̂RHC estimates the total of group
k using an unequal-probability sample of size one. We can thus use (6.4) within
each group to find the conditional variance, noting that psu’s in diÆerent groups
are selected independently. (We can obtain the same result by using the indicator
143

variables directly, but it’s messier.) Then


n X
X N µ X N ∂2
ti
V (t̂RHC | I11 , . . . , InN ) = Iki xki ° Ikj tj
xki
k=1 i=1 j=1
n X
X N n X
X N X
N
t2i
= Iki ° Iki ti Ikj tj
xki
k=1 i=1 k=1 i=1 j=1
N X
X N X
N µ ∂
Mj t2i
= Iki Ikj ° ti tj .
Mi
k=1 i=1 j=1

item Now to find E[V (t̂RHC | I11 , . . . , InN )], we need E[Iki ] and E[Iki Ikj ] for i 6= j.
Let Nk be the number of psu’s in group k. Then
Nk
E[Iki ] = P {psu i in group k} =
N
and, for i 6= j,
Nk Nk ° 1
E[Iki Ikj ] = P {psu’s i and j in group k} = .
N N °1
PN
Thus, letting √i = Mi / j=1 Mj ,

V [t̂RHC ] = E[V (t̂RHC | I11 , . . . , InN )]


∑Xn X N X N µ ∂∏
Mj t2i
= E Iki Ikj ° ti tj
Mi
k=1 i=1 j=1
n X
X N n X
X N N
X µ ∂
Nk 2 Nk Nk ° 1 Mj t2i
= (t ° ti ) +
2
° ti tj
N i N N °1 Mi
k=1 i=1 k=1 i=1 j=1,j6=i
µX
n ∂µ X
N ∂
Nk Nk ° 1
t2i
= °t 2
N N °1 √
k=1 i=1 i
µXn ∂µ XN ∑ ∏2 ∂
Nk Nk ° 1 ti
= √i °t .
N N °1 √i
k=1 i=1

The second factor equals nV (t̂√ ), with V (t̂√ ) given in (6.46), assuming one stage
cluster sampling.
What should N1 , . . . , Nn be in order to minimize V [t̂RHC ]? Note that
n
X n
X
Nk (Nk ° 1) = Nk2 ° N
k=1 k=1
144 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

is smallest when all Nk ’s are equal. If N/n = L is an integer, take Nk = L for


k = 1, 2, . . . , n. With this design,
N µ ∂2
L°1 X ti
V [t̂RHC ] = √i °t
N °1 √i
i=1
N °n
= V (t̂√ ).
N °1

6.29 (a)
2 3
X X N
X
º̃ik = ºi ºk 41 ° (1 ° ºi )(1 ° ºk )/ cj 5
k6=i k6=i j=1
X ºi (1 ° ºi ) X
= ºi ºk ° PN ºk (1 ° ºk )
k6=i j=1 cj k6=i
"N #
ºi (1 ° ºi ) X
= ºi (n ° ºi ) ° PN ºk (1 ° ºk ) ° ºi (1 ° ºi )
j=1 cj k=1
ºi2 (1 ° ºi )2
= ºi (n ° ºi ) ° ºi (1 ° ºi ) + PN
j=1 cj
ºi2 (1 ° ºi )2
= ºi (n ° 1) + PN .
j=1 cj

(b) If an SRS is taken, ºi = n/N , so

n ≥ n¥ ≥ n¥
N
X N
X
cj = 1° =n 1°
N N N
j=1 j=1

and
" #
n2 (1 ° n/N )(1 ° n/N )
º̃ik = 1° ° ¢
N2 n 1° N n

n h ni
= n ° 1 +
N 2∑ N ∏
n n°1 n
= + 2
N N N
∑ ∏
n n°1 N °1 n°1 n
= + 2
N N °1 n°1 N N
∑ ∏
n n°1 N °1 n(N ° 1)
= +
N N °1 N (n ° 1)N 2
∑ ∏
n n°1 N °n
= 1+
N N °1 (n ° 1)N 2
145

(c) First note that


(1 ° ºi )(1 ° ºk )
ºi ºk ° º̃ik = ºi ºk PN .
j=1 cj
PN
Then, letting B = j=1 cj ,

N N µ ∂
1 XX ti tk 2
VHaj (t̂HT ) = (ºi ºk ° º̃ik ) °
2 ºi ºk
i=1 k=1
N N µ ∂
1 XX (1 ° ºi )(1 ° ºk ) ti tk 2
= ºi ºk PN °
2 j=1 cj
ºi ºk
i=1 k=1
N X
X N µ 2 ∂
1 ti ti tk
= P ºi ºk (1 ° ºi )(1 ° ºk ) 2 2 ° 2
2 N j=1 cj i=1 k=1
ºi ºi ºk
X N X N µ 2 ∂
1 ti ti tk
= PN ºi ºk (1 ° ºi )(1 ° ºk ) °
j=1 cj i=1 k=1
ºi2 ºi ºk
N
√N !2
X t2i 1 X ti
= ºi (1 ° ºi ) 2 ° PN ºi (1 ° ºi )
º i j=1 cj ºi
i=1 i=1
N
√N !2
X t2i 1 X ti
= ci 2 ° PN ci
º i j=1 c j ºi
i=1 i=1
XN µ 2 ∂2
ti
= ci °A .
i=1
ºi2

6.30
146 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

(a) From (6.21),


N N µ ∂2
1 XX ti tk
V (t̂HT ) = (ºi ºk ° ºik ) °
2 ºi ºk
i=1 k=1
k6=i
N N µ ∂2
1 XX ti t t tk
= (ºi ºk ° ºik ) ° + °
2 ºi n n ºk
i=1 k=1
k6=i
N N
1 XX
= (ºi ºk ° ºik ) £
2
i=1 k=1
k6=i
(µ ∂2 ∂µ µ ∂µ ∂)
ti t tk t 2 ti t tk t
+° ° °2 ° °
ºi n ºk n ºi n ºk n
N X
N
( µ ∂ µ ∂ µ ∂)
X ti t 2 ti t tk t
= (ºi ºk ° ºik ) ° ° ° °
ºi n ºi n ºk n
i=1 k=1
k6=i

From Theorem 6.1, we know that


N
X
ºk = n
k=1

and
N
X
ºik = (n ° 1)ºi ,
k=1
k6=i
so
N X
X N µ ∂ N
X µ ∂
ti t 2 ti t 2
(ºi ºk ° ºik ) ° = [ºi (n ° ºi ) ° (n ° 1)ºi ] °
ºi n ºi n
i=1 k=1 i=1
k6=i
N
X µ ∂2
ti t
= ºi (1 ° ºi ) ° .
ºi n
i=1

This gives the first two terms in (6.47); the third term is the cross-product term
above.
(b)
For an SRS, ºi = n/N and ºik = [n(n ° 1)]/[N (N ° 1)]. The first term is
N
X µ ∂ N
n N ti t 2 XN S2
° = (ti ° t̄U )2 = N (N ° 1) t .
N n n n n
i=1 i=1
147

The second term is


N ≥ µ ∂2
X n ¥2 N ti t St2
° = n(N ° 1) .
N n n n
i=1

(c) Substituting ºi ºk (ci + ck )/2 for ºik , the third term in (6.47) is
N X
X N µ ∂µ ∂
ti t tk t
(ºik ° ºi ºk ) ° °
ºi n ºk n
i=1 k=1
k6=i
N X
X N µ ∂µ ∂
ci + ck ° 2 ti t tk t
= ºi ºk ° °
2 ºi n ºk n
i=1 k=1
k6=i
N X
X N µ ∂µ ∂ X N µ ∂
ci + ck ° 2 ti t tk t ti t 2
= ºi ºk ° ° ° ºi (ci ° 1)
2
°
2 ºi n ºk n ºi n
i=1 k=1 i=1
XN µ ∂
ti t 2
= ºi2 (1 ° ci ) ° .
ºi n
i=1

Then, from (6.47),


N
X µ ∂2 ∂ N
X µ
ti t ti t 2
V (t̂HT ) = ºi ° ° ° ºi2
ºi n ºi n
i=1 i=1
XN X N µ ∂µ ∂
ti t tk t
+ (ºik ° ºi ºk ) ° °
ºi n ºk n
i=1 k=1
k6=i
N
X µ ∂2 N
X µ ∂2
ti t ti t
º ºi ° ° ºi2 °
ºi n ºi n
i=1 i=1
XN µ∂
ti t 2
+ ºi2 (1
° ci ) °
ºi n
i=1
N
X µ ∂
ti t 2
= ºi (1 ° ci ºi ) ° .
ºi n
i=1

If ci = n°ºi ,
n°1
then the variance approximation in (6.48) for an SRS is

N
X µ ∂2 N
X µ ∂
ti t N n°1
ºi (1 ° ci ºi ) ° = 1° (ti ° t̄U )2
ºi n n N °1
i=1 i=1
µ ∂
N (N ° 1) n°1
= 1° St2 .
n N °1
148 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

If
n°1
ci = ≥ P ¥
1 ° 2ºi + n1 N º
k=1 k
2

then
N
X µ ∂2 µ ∂
ti t N (N ° 1) n(n ° 1)
ºi (1 ° ci ºi ) ° = 1° St2 .
ºi n n N °n
i=1

6.31 We wish to minimize


N µ ∂2 N
1X ti 1 X Mi2 Si2
√i °t +
n √i n mi √i
i=1 i=1

subject to the constraint that


∑X ∏ ∑X
N ∏ N
X
C=E mi = E Qi mi = n √i mi .
i2S i=1 i=1

Using Lagrange multipliers, let


N
X µ N
X ∂
M 2S2
g(m1 , . . . , mN , ∏) = i i
°∏ C °n √i mi .
mi √i
i=1 i=1

@g Mk2 Sk2
=° + n∏√k
@mk m2k √k
N
X
@g
=n √i mi ° C.
@∏
i=1
Setting the partial derivatives equal to zero gives
1 Mk Sk
mk = p
n∏ √k
and
p p XN
n
∏= Mi Si .
C
i=1

Thus, the optimal allocation has mi / Mi Si /√i . For comparison, a self-weighting


design would have mi / Mi /√i .
6.32 Let Mi be the number of residential numbers in psu i. When you dial a
number based on the method,

P (reach working number in psu i on an attempt)


= P (select psu i)P (get working number | select psu i)
1 Mi
= .
N 100
149

Also,
N
X µ ∂
1 Mi M0
P (reach no one on an attempt) = 1° =1° .
N 100 100N
i=1

Then,

P (select psu i as first in sample)


= P (select psu i on first attempt)
+ P (reach no one on first attempt, select psu i on second attempt)
+ P (reach no one on first and second attempts,
select psu i on third attempt)
+ ...
µ ∂ µ ∂
1 Mi 1 Mi M0 1 Mi M0 2
= + 1° + 1° + ...
N 100 N 100 100N N 100 100N
1 µ ∂
1 Mi X M0 j
= 1°
N 100 100N
j=0
1 Mi 1
= µ ∂
N 100 M0
1° 1°
100N
Mi
=
M0
150 CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES
Chapter 7

Complex Surveys

7.6 Here is SAS code for solving this problem. Note that for the population, we
have ȳU = 17.73,
2 √2000 !2 3 µ
2000
X X ∂
1 4 1 35460 2
S = 38.1451726 =
2 2
yi ° yi 5 = 704958 ° /1999,
1999 2000 2000
i=1 i=1

µ̂25 = 13.098684, µ̂50 = 16.302326, µ̂75 = 19.847458.

data integerwt;
infile integer delimiter="," firstobs=2;
input stratum y;
ysq = y*y;
run;

/* Calculate the population characteristics for comparison */

proc means data=integerwt mean var;


var y;
run;

proc surveymeans data=integerwt mean sum percentile = (25 50 75);


var y ysq;
/* Without a weight statement, SAS assumes all weights are 1 */
run;

proc glm data=integerwt;


class stratum;
model y = stratum;
means stratum;
run;

151
152 CHAPTER 7. COMPLEX SURVEYS

/* Before selecting the sample,


you need to sort the data set by stratum */

proc sort data=integerwt;


by stratum;

proc surveyselect data=integerwt method=srs sampsize = (50 50 20 25)


out = stratsamp seed = 38572 stats;
strata stratum;
run;

proc print data = stratsamp;


run;

data strattot;
input stratum _total_;
datalines;
1 200
2 800
3 400
4 600
;

proc surveymeans data=stratsamp total = strattot mean clm sum


percentile = (25 50 75);
strata stratum;
weight SamplingWeight;
var y ysq;
run;

/* Create a pseudo-population using the weights */

data pseudopop;
set stratsamp;
retain stratum y;
do i = 1 to SamplingWeight;
output ;
end;

proc means data=pseudopop mean var;


var y;
run;

proc surveymeans data=pseudopop mean sum percentile = (25 50 75);


153

var y ysq;
run;

The estimates from the last two surveymeans statements are the same (not the
standard errors, however).
7.7 Let y = number of species caught.

y fˆ(y) y fˆ(y)
1 .0328 10 .2295
3 .0328 11 .0491
4 .0820 12 .0820
5 .0328 13 .0328
6 .0656 14 .0164
7 .0656 16 .0328
8 .1803 17 .0164
9 .0328 18 .0164
Here is SAS code for constructing this table:

data nybight;
infile nybight delimiter=’,’ firstobs=2;
input year stratum catchnum catchwt numspp depth temp ;
select (stratum);
when (1,2) relwt=1;
when (3,4,5,6) relwt=2;
end;
if year = 1974;

/*Construct empirical probability mass function and empirical cdf.*/

proc freq data=nybight;


tables numspp / out = htpop_epmf outcum;
weight relwt;

/*SAS proc freq gives values in percents, so we divide each by 100*/

data htpop_epmf;
set htpop_epmf;
epmf = percent/100;
ecdf = cum_pct/100;

proc print data=htpop_epmf;


run;

7.8 We first construct a new variable, weight, with the following values:
154 CHAPTER 7. COMPLEX SURVEYS

Stratum weight
245 Mi
large
23 mi

66 Mi
sm/me
8 mi
Because there is nonresponse on the variable hrwork, for this exercise we take mi to
be the number of respondents in that cluster. The weights for each teacher sampled
in a school are given in the following table:
dist school popteach mi weight
sm/me 1 2 1 16.50000
sm/me 2 6 4 12.37500
sm/me 3 18 7 21.21429
sm/me 4 12 7 14.14286
sm/me 6 24 11 18.00000
sm/me 7 17 4 35.06250
sm/me 8 19 5 31.35000
sm/me 9 28 21 11.00000
large 11 33 10 35.15217
large 12 16 13 13.11037
large 13 22 3 78.11594
large 15 24 24 10.65217
large 16 27 24 11.98370
large 18 18 2 95.86957
large 19 16 3 56.81159
large 20 12 8 15.97826
large 21 19 5 40.47826
large 22 33 13 27.04013
large 23 31 16 20.63859
large 24 30 9 35.50725
large 25 23 8 30.62500
large 28 53 17 33.20972
large 29 50 8 66.57609
large 30 26 22 12.58893
large 31 25 18 14.79469
large 32 23 16 15.31250
large 33 21 5 44.73913
large 34 33 7 50.21739
large 36 25 4 66.57609
large 38 38 10 40.47826
large 41 30 2 159.78261
The epmf is given below, with y =hrwork.
155

y fˆ(y) y fˆ(y)
20.00 0.0040 34.55 0.0019
26.25 0.0274 34.60 0.0127
26.65 0.0367 35.00 0.1056
27.05 0.0225 35.40 0.0243
27.50 0.0192 35.85 0.0164
27.90 0.0125 36.20 0.0022
28.30 0.0050 36.25 0.0421
29.15 0.0177 36.65 0.0664
30.00 0.0375 37.05 0.0023
30.40 0.0359 37.10 0.0403
30.80 0.0031 37.50 0.1307
31.25 0.0662 37.90 0.0079
32.05 0.0022 37.95 0.0019
32.10 0.0031 38.35 0.0163
32.50 0.0370 38.75 0.0084
32.90 0.0347 39.15 0.0152
33.30 0.0031 40.00 0.0130
33.35 0.0152 40.85 0.0018
33.75 0.0404 41.65 0.0031
34.15 0.0622 52.50 0.0020
7.10
Without weights
156 CHAPTER 7. COMPLEX SURVEYS

With weights

Using the weights makes a huge diÆerence, since the counties with large numbers of
veterans also have small weights.
7.13 The variable agefirst contains information on the age at first arrest. Missing
values are coded as 99; for this exercise, we use the non-missing cases.
Estimated Without With
Quantity Weights Weights
Mean 13.07 13.04
Median 13 13
25th Percentile 12 12
75th Percentile 15 15
Calculating these quantities in SAS is easy: simply include the weight variable in
PROC UNIVARIATE.
The weights change the estimates very little, largely because the survey was designed
to be self-weighting.
7.14
Quantity Variable p̂
Age ∑ 14 age .1233
Violent oÆense crimtype .4433
Both parents livewith .2974
Male sex .9312
Hispanic ethnicity .1888
Single parent livewith .5411
Illegal drugs everdrug .8282
7.15 (a) We use the following SAS code to obtain ȳˆ = 18.03, with 95% CI [17.48,
18.58].
157

data nhanes;
infile nhanes delimiter=’,’ firstobs=2;
input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2
dmdeduc indfminc bmxwt bmxbmi bmxtri
bmxwaist bmxthicr bmxarml;
label age = "Age at Examination (years)"
riagendr = "Gender"
ridreth2 = "Race/Ethnicity"
dmdeduc = "Education Level"
indfminc = "Family income"
bmxwt = "Weight (kg)"
bmxbmi = "Body mass index"
bmxtri = "Triceps skinfold (mm)"
bmxwaist = "Waist circumference (cm)"
bmxthicr = "Thigh circumference (cm)"
bmxarml = "Upper arm length (cm)";
run;

proc surveymeans data=nhanes mean clm percentile = (0 25 50 75 100);


stratum sdmvstra;
cluster sdmvpsu;
weight wtmec2yr;
var bmxtri age;
run;

(b) The data appear skewed.

(c) The SAS code in part (a) also gives the following.
158 CHAPTER 7. COMPLEX SURVEYS

Percentile Value Std Error


Minimum 2.8
25 10.98 0.177
50 16.35 0.324
75 23.95 0.425
Maximum 44.6
Men:
Percentile Value
Minimum 2.80
25 9.19
50 12.92
75 18.11
Maximum 42.4
Women:
Percentile Value
Minimum 4.00
25 14.92
50 21.94
75 28.36
Maximum 44.6
(d) Here is SAS code for constructing the plots:

data groupage;
set nhanes;
bmigroup = round(bmxbmi,5);
trigroup = round(bmxtri,5);
run;

proc sort data=groupage;


by bmigroup trigroup;

proc means data=groupage;


by bmigroup trigroup;
var wtmec2yr;
output out=circleage sum=sumwts;

goptions reset=all;
goptions colors = (black);
axis3 label=(’Body Mass Index, rounded to 5’) order=(10 to 70 by 10);
axis4 label=(angle=90 ’Triceps skinfold, rounded to 5’)
order=(0 to 55 by 10);

/* This gives the weighted circle plot */


159

proc gplot data=circleage;


bubble trigroup * bmigroup= sumwts/
bsize=12 haxis = axis3 vaxis = axis4;
run;

/* The following draws the bubble plot with trend line */

ods graphics on;


proc loess data=nhanes;
model bmxtri=bmxbmi / degree = 1 select=gcv;
weight wtmec2yr;
ods output OutputStatistics = bmxsmooth ;
run;
ods graphics off;

proc print data=bmxsmooth;


run;

proc sort data=bmxsmooth;


by bmxbmi;

goptions reset=all;
goptions colors = (gray);
axis4 label=(angle=90 ’Triceps skinfold’) order = (0 to 55 by 10);
axis3 label=(’Body Mass Index’) order=(10 to 70 by 10);
axis5 order=(0 to 55 by 5) major=none minor=none value=none;
symbol interpol=join width=2 color = black;

/* Display the trend line with the bubble plot */

data plotsmth;
set bubbleage bmxsmooth; /* concatenates the data sets */
run;

proc gplot data=plotsmth;


bubble bmxtri*bmxbmi = sumwts/
bsize=10 haxis = axis3 vaxis = axis4;
plot2 Pred*bmxbmi/haxis = axis3 vaxis = axis5;
run;
160 CHAPTER 7. COMPLEX SURVEYS

7.17 We define new variables that take on the value 1 if the person has been a
victim of at least one violent crime and 0 otherwise, and another variable for injury.
The SAS code and output follows.

data ncvs;
infile ncvs delimiter = ",";
input age married sex race hispanic hhinc away employ numinc
violent injury medtreat medexp robbery assault
pweight pstrat ppsu;
if violent > 0 then isviol = 1;
else isviol = 0;
if injury > 0 then isinjure = 1;
161

else isinjure = 0;
run;

proc surveymeans data=ncvs;


weight pweight;
strata pstrat;
cluster ppsu;
var numinc isviol isinjure;
run;

proc surveymeans data=ncvs;


weight pweight;
strata pstrat;
cluster ppsu;
var medexp;
domain isinjure;
run;

The SURVEYMEANS Procedure

Data Summary

Number of Strata 143


Number of Clusters 286
Number of Observations 79360
Sum of Weights 226204704

Statistics

Std Error
Variable N Mean of Mean 95% CL for Mean
numinc 79360 0.070071 0.002034 0.06605010 0.07409164
isviol 79360 0.013634 0.000665 0.01232006 0.01494718
isinjure 79360 0.003754 0.000316 0.00312960 0.00437754

Domain Analysis: isinjure

Std Error
isinjure Variable N Mean of Mean 95% CL for Mean
0 medexp 79093 0 0 0.0000000 0.000000
1 medexp 267 101.6229 33.34777 35.7046182 167.541160
162 CHAPTER 7. COMPLEX SURVEYS

P
7.18 Note that q1 ∑y∑q2 yf (y) is theP
sum of the middle N (1 ° 2Æ) observations
in the population divided by N , and q1 ∑y∑q2 f (y) = F (q2 ) ° F (q1 ) º 1 ° 2Æ.
Consequently,
sum of middle N (1 ° 2Æ) observations in the population
ȳU Æ = .
N (1 ° 2Æ)
To estimate the trimmed mean, substitute fˆ, q̂1 , and q̂2 for f , q1 , and q2 .
7.21 As stated in Section 7.1, the yi ’s are the measurements on observation units.
If unit i is in stratum h, then wi = Nh /nh . To express this formally, let
Ω
1 if unit i is in stratum h
xhi =
0 otherwise.
Then we can write
H
X Nh
wi = xhi
nh
h=1
and X
yi wi
P
y y fˆ(y) = X
i2S
wi
i2S
X H
X
yi (Nh /nh )xhi
i2S h=1
= H
XX
(Nh /nh )xhi
i2S h=1
H
X X
Nh (xhi yi /nh )
h=1 i2S
= H
X X
Nh (xhi /nh )
h=1 i2S
H
X
Nh ȳh
= h=1
H
X
Nh
h=1
H
X Nh
= ȳh .
N
h=1

7.22 For an SRS, wi = N/n for all i and


X N
n
fˆ(y) =
i2S:yi =y
XN .
n
i2S
163

Thus,
X X y2
y 2 fˆ(y) = i
,
y
n
X i2S
X yi
y fˆ(y) = = ȳ,
y
n
i2S

and ΩX ∑X ∏2 æ
N ˆ ˆ
Ŝ 2 = y f (y) °
2
y f (y)
N °1 y y
ΩX 2 æ
N yi
= ° ȳ 2
N °1 n
i2S
N X (yi ° ȳ)2
=
N °1 n
i2S
N n°1 2
= s .
N °1 n
If n < N , Ŝ 2 is smaller than s2 (although they will be close if n is large).
7.23 We need to show that the inclusion probability is the same for every unit in
S2 . Let Zi = 1 if i 2 S and 0 otherwise, and let Di = 1 if i 2 S2 and 0 otherwise.
We have P (Zi = 1) = ºi and P (Di = 1 | Zi = 1) / 1/ºi .

P (i 2 S2 ) = P (Zi = 1, Di = 1)
= P (Di = 1 | Zi = 1)P (Zi = 1)
1
/ ºi = 1.
ºi

7.24 A rare disease aÆects only a few children in the population. Even if all cases
belong to the same cluster, a disease with estimated incidence of 2.1 per 1,000 is
unlikely to aÆect all children in that cluster.
7.25 (a) Inner-city areas are sampled at twice the rate of non-inner-city areas. Thus
the selection probability for a household not in the inner city is one-half the selection
probability for a household in the inner city. The relative weight for a non-inner-city
household, then, is 2.
(b) Let º represent the probability that a household in the inner city is selected.
Then, for 1-person inner city households,

P (person selected | household selected) P (household selected) = 1 £ º.

For k-person inner-city households,


1
P (person selected | household selected) P (household selected) = º.
k
Thus the relative weight for a person in an inner-city household is the number
of adults in the household. The relative weight for a person in a non-inner-city
household is 2 £ (number of adults in household).
164 CHAPTER 7. COMPLEX SURVEYS

The table of relative weights is:


Number of adults Inner city Non-inner city
1 1 2
2 2 4
3 3 6
4 4 8
5 5 10
Chapter 8

Nonresponse

8.1 (a) Oversampling the low-income families is a form of substitution. One ad-
vantage of substitution is that the number of low-income families in the sample is
larger. The main drawback, however, is that the low-income families that respond
may diÆer from those that do not respond. For example, mothers who work outside
the home may be less likely to breast feed and less likely to respond to the survey.
(b) The diÆerence between percentage of mothers with one child indicates that the
weighting does not completely adjust for the nonresponse.
(c) Weights were used to try to adjust for nonresponse in this survey. We can never
know whether the adjustment is successful, however, unless we have some data from
the nonrespondents. The response rate for the survey decreased from 54% in 1984
to 46% in 1989. It might have been better for the survey researchers to concentrate
on increasing the response rate and obtaining accurate responses instead of tripling
the sample size.
Because the survey was poststratified using ethnic background, age, and education,
the weighted counts must agree with census figures for those variables. A possible
additional variable to use for poststratification would be number of children.
8.2 (a) The respondents report a total of
X
yi = (66)(32) + (58)(41) + (26)(54) = 5894

hours of TV, with


X
yi2 = [65(15)2 + 66(32)2 ] + [57(19)2 + 58(41)2 ] + [25(25)2 + 26(54)2 ] = 291725.

Then, for the respondents,


5894
ȳ = = 39.3
150
291725 ° (150)(39.3)2
s2 = = 403.6
149

165
166 CHAPTER 8. NONRESPONSE

and sµ ∂
150 403.6
SE(ȳ) = 1° = 1.58.
2000 150
Note that this is technically a ratio estimate, since the number of respondents (here,
150) would vary if a diÆerent sample were taken. We are estimating the average
hours of TV watched in the domain of respondents.
(b)
GPA Group Respondents Non respondents Total
3.00–4.00 66 9 75
2.00–2.99 58 14 72
Below 2.00 26 27 53
Total 150 50 200

X (observed ° expected)2
X2 =
expected
cells
[66 ° (.75)(75)]2 [9 ° (.25)(75)]2 [27 ° (.25)(53)]2
= + + ··· +
(.75)(75) (.25)(75) (.25)(53)
= 1.69 + 5.07 + 0.30 + 0.89 + 4.76 + 14.27
= 26.97
Comparing the test statistic to a ¬2 distribution with 2 df, the p-value is 1.4 £ 10°6 .
This is strong evidence against the null hypothesis that the three groups have the
same response rates.
The hypothesis test indicates that the nonresponse is not MCAR, because response
rates appear to be related to GPA. We do not know whether the nonresponse is
MAR, or whether is it nonignorable.
(c)
ni
3 X
X
SSB = (ȳi ° ȳ)2 = 9303.1
i=1 j=1
MSW = s2 = 403.6.
The ANOVA table is as follows:
Source df SS MS F p-value
Between groups 2 9303.1 4651.5 11.5 0.0002
Within groups 147 59323.0 403.6
Total, about mean 149 68626.1
Both the nonresponse rate and the TV viewing appear to be related to GPA, so
it would be a reasonable variable to consider for weighting class adjustment or
poststratification.
(d) The initial weight for each person in the sample is 2000/200=10. After increasing
the weights for the respondents in each class to adjust for the nonrespondents, the
167

weight for each respondent with GPA ∏ 3 is

sum of weights for sample 75(10)


initial weight £ = 10 £
sum of weights for respondents 66(10)
= 11.36364.

Sample Number of Weight for each


Size Respondents (nR ) Respondent (w) ȳ wnR ȳ wnR
75 66 11.36364 32 24000 750
72 58 12.41379 41 29520 720
53 26 20.38462 54 28620 530
200 150 82140 2000

Then t̂wc = 82140 and ȳwc = 82140/2000 = 41.07.


The weighting class adjustment leads to a higher estimate of average viewing time,
because the GPA group with the highest TV viewing also has the most nonresponse.
(e) The poststratified weight for each respondent with GPA ∏ 3 is

population count 700


wpost = initial weight £ = 10 £ = 10.60606.
sum of respondent weights (10)(66)

Here, nR denotes number of respondents.


Population
nR Count wpost ȳ wpost ȳnR wpost nR
66 700 10.60606 32 22400 700
58 800 13.79310 41 32800 800
26 500 19.2307 54 27000 500
150 2000 82200 2000
The last column is calculated to check the weights constructed—the sum of the
poststratified weights in each poststratum should equal the population count for
that poststratum.
t̂post = 82140
and
82140
ŷpost = = 41.07.
2000

8.6 (a) For this exercise, we classify the missing data in the “Other/Unknown”
category. Typically, raking would be used in situations in which the classification
variables were known (and known to be accurate) for all respondents.
168 CHAPTER 8. NONRESPONSE

Response
Population Respondents Rate (%)
Ph.D. 10235 3036 30
Master’s 7071 1640 23
Other/Unknown 1303 325 25
Industry 5397 1809 34
Academia 6327 2221 35
Government 2047 880 43
Other/Unknown 4838 91 19
These response rates are pretty dismal. The nonresponse does not appear to be
MCAR, as it diÆers by degree and by type of employment. I doubt that it is
MAR—I think that more information than is known from this survey would be
needed to predict the nonresponse.
(b) The cell counts from the sample are:
Industry Academia Other
PhD 798 1787 451 3036
non-PhD 1011 434 520 1965
1809 2221 971 5001
The initial sum of weights for each cell are:
Industry Academia Other
PhD 2969.4 6649.5 1678.2 11297.1
non-PhD 3762.0 1614.9 1934.9 7311.9
6731.4 8264.5 3613.1 18609.0
After adjusting for the population row counts (10235 for Ph.D. and 8374 for non-
Ph.D.) the new table is:
Industry Academia Other
PhD 2690.2 6024.4 1520.4 10235
non-PhD 4308.5 1849.5 2216.0 8374
6998.7 7873.9 3736.4 18609
Raking to the population column totals (Industry, 5397; Academia, 6327; Other,
6885) gives:
Industry Academia Other
PhD 2074.6 4840.8 2801.6 9717.0
non-PhD 3322.4 1486.2 4083.4 8892.0
5397.0 6327.0 6885.0 18609.0
As you can see, the previous two tables are still far apart. After iterating, the final
table of the weight sums is:
169

Industry Academia Other


PhD 2239.2 4980.6 3015.2 10235.0
non-PhD 3157.8 1346.4 3869.8 8374.0
5397.0 6327.0 6885.0 18609.0
The raking has dramatically increased the weights in the “Other” employment cat-
egory.
To calculate the proportions using the raking weights, create a new variable weight.
For respondents with PhD’s who work in industry, weight = 2239.2/798 = 2.806.
For the question, “Should the ASA develop some sort of certification?” the estimated
percentages are:
Without With Raking
Weights Weights
No response 0.2 0.3
Yes 26.4 25.8
Possibly 22.3 22.3
No opinion 5.4 5.4
Unlikely 6.7 6.9
No 39.0 39.3
(c) I think such a conclusion is questionable because of the very high nonresponse
rate. This survey is closer to a self-selected opinion poll than to a probability sample.
8.7
Response Female
Discipline Rate (%) Members (%)
Literature 69.5 38
Classics 71.2 27
Philosophy 73.1 18
History 71.5 19
Linguistics 73.9 36
Political Science 69.0 13
Sociology 71.4 26
The model implicitly adopted in Example 4.3 was that nonrespondents within each
stratum were similar to respondents in that stratum.
We can use a ¬2 test to examine whether the nonresponse rate varies among strata.
The observed counts are given in the following table, with expected counts in paren-
theses:
170 CHAPTER 8. NONRESPONSE

Respondent Nonrespondent
Literature 636 (651.6) 279 (263.4) 915
Classics 451 (450.8) 182 (182.2) 633
Philosophy 481 (468.6) 177 (189.4) 658
History 611 (608.9) 244 (246.1) 855
Linguistics 493 (475.0) 174 (192.0) 667
Political Science 575 (593.2) 258 (239.8) 833
Sociology 588 (586.8) 236 (237.2) 824
3835 1550 5385
The Pearson test statistic is
X (observed ° expected)2
X2 = = 6.8
expected
cells

Comparing the test statistic to a ¬26 distribution, we calculate p-value 0.34. There
is no evidence that the response rates diÆer among strata.
The estimated correlation coe±cient of the response rate and the percent female
members is 0.19. Performing a hypothesis test for association (Pearson correlation,
Spearman correlation, or Kendall’s ø ) gives p-value > .10. There is no evidence that
the response rate is associated with the percentage of members who are female.
8.12 (a) The overall response rate, using the file teachmi.dat, was 310/754=0.41.
(b) As with many nonresponse problems, it’s easy to think of plausible reasons why
the nonresponse bias might go either direction. The teachers who work many hours
may be working so hard they are less likely to return the survey, or they may be
more conscientious and thus more likely to return it.
(c) The means and variances from the file teachnr.dat (ignoring missing values)
are
hrwork size preprmin assist
responses 26 25 26 26
ȳ 36.46 24.92 160.19 152.31
s2 2.61 25.74 3436.96 49314.46
V̂ (ȳ) 0.10 1.03 132.19 1896.71
The corresponding estimates from teachers.dat, the original cluster sample, are:
hrwork size preprmin assist
ȳˆr 33.82 26.93 168.74 52.00
V̂ (ȳˆr ) 0.50 0.57 70.57 228.96
8.14 (a) We are more likely to delete an observation if the value of xi is small.
Since xi and yi are positively correlated, we expect the mean of y to be too big.
(b) The population mean of acres92 is ȳU = 308582.
171

8.16 We use the approximations from Chapter 3 to obtain:


2 8 93
XN > >
> >
6 Zi Ri wi yi >
>
> P
>
>
>7
6 < N
(Z R w ° ¡ ) =7
6 7
E[ȳˆR ] = E 6 i=1 N
i i i i
1 ° i=1N 7
6 X >
> X >
>7
4 ¡i >
> Z R w >
>5
>
: i i i >
;
i=1 i=1
N
X
¡i yi
i=1
º N
X
¡i
i=1
N
1 X
º ¡i yi .
N ¡¯U i=1

Thus the bias is


N
1 X
Bias [ȳˆR ] º (¡i ° ¡¯U )yi
N ¡¯U i=1
N
1 X
= (¡i ° ¡¯U )(yi ° ȳU )
N ¡¯U i=1
1
º Cov (¡i , yi )
¡¯U

8.17 The argument is similar to the previous exercise. If the classes are su±ciently
large, then E[1/¡˜c ] º 1/¡¯c .
8.19

V (ȳˆwc )
" N N
#
n1 1 X n2 1 X
= V Zi Ri xi yi + Zi Ri (1 ° xi )yi
n n1R n n2R
i=1 i=1
( " N N Ø #)
n1 1 X n2 1 X Ø
= E V Zi Ri xi yi + Zi Ri (1 ° xi )yi ØØZ1 , . . . , ZN
n n1R n n2R
i=1 i=1
( " N N Ø #)
n1 1 X n2 1 X Ø
+V E Zi Ri xi yi + Zi Ri (1 ° xi )yi ØØZ1 , . . . , ZN
n n1R n n2R
i=1 i=1
( " P PN Ø #)
N
n1 i=1 Zi Ri xi yi n2 i=1 Zi Ri (1 ° xi )yi ØØ
= E V PN + PN ØZ1 , . . . , ZN
i=1 Zi Ri (1 ° xi )
n i=1 Zi Ri xi
n
( " P P Ø #)
n1 N i=1 Zi Ri xi yi n2 N i=1 Zi Ri (1 ° xi )yi Ø
Ø
+V E PN + PN ØZ1 , . . . , ZN .
n i=1 Z i Ri xi n i=1 Z i Ri (1 ° x i )
172 CHAPTER 8. NONRESPONSE

We use the ratio approximations from Chapter 4 to find the approximate expected
values and variances.
" P Ø #
n1 N Ø
i=1 Zi Ri xi yi Ø
E PN ØZ1 , . . . , ZN
n i=1 Zi Ri xi
"P √ PN !Ø #
N
i=1 Zi [Ri ° ¡1 ]xi
n1 i=1 Zi Ri xi yi
Ø
= E P 1° PN ØZ1 , . . . , ZN
Ø
n ¡1 N i=1 Zi x i i=1 Zi Ri xi
"N N PN Ø #
1 X X Z i [Ri ° ¡ 1 ]xi Ø
= E Zi Ri xi yi ° Zi Ri xi yi i=1PN ØZ1 , . . . , ZN
n¡1 Z R x Ø
i=1 i=1 i=1 i i i
N N
1X 1 X
º Zi xi yi ° Zi V (Ri )xi yi
n (n¡1 )2
i=1 i=1
XN
1
º Zi xi yi .
n
i=1

Consequently,
( " P PN Ø #)
n1 N Z i Ri xi y i n 2 Zi R i (1 ° xi )y i Ø
ØZ1 , . . . , ZN
V E Pi=1
N
+ Pi=1
N Ø
n Z R
i=1 i i i x n Z
i=1 i iR (1 ° x i )
( N N
)
1X 1X
º V Zi xi yi + Zi (1 ° xi )yi
n n
i=1 i=1
≥ n ¥ Sy2
= 1° ,
N n
the variance that would be obtained if there were no nonresponse. For the other
term,
" P Ø #
n1 N Zi R i x i y i Ø
ØZ1 , . . . , ZN
V Pi=1
N Ø
n i=1 Z i R i x i
" N
√ PN !Ø #
1 X Z [R
i i ° ¡ ]x
1 i Ø
ØZ1 , . . . , ZN
= V Zi Ri xi yi 1 ° i=1 PN Ø
n¡1 Z R x
i=1 i i i
i=1
N
1 X
º Zi V (Ri )xi yi2
(n¡1 )2
i=1
N
¡1 (1 ° ¡1 ) X
º Zi xi yi2 .
(n¡1 )2
i=1
173

Thus,
( " P P Ø #)
n1 N i=1 Zi Ri xi yi n2 N i=1 Zi Ri (1 ° xi )yi Ø
Ø
E V PN + PN ØZ1 , . . . , ZN
n i=1 Z i Ri xi n i=1 Z i Ri (1 ° xi )
( N N
)
¡1 (1 ° ¡1 ) X ¡ 2 (1 ° ¡ 2 ) X
º E Zi xi yi2 + Zi (1 ° xi )yi2
(n¡1 )2 (n¡2 )2
i=1 i=1
N
X N
X
¡1 (1 ° ¡1 ) ¡2 (1 ° ¡2 )
= xi yi2 + (1 ° xi )yi2 .
n¡21 i=1
n¡22 i=1

8.20 (a) Respondents are divided into 5 classes on the basis of the number of nights
the respondent was home during the 4 nights preceding the survey call.
The sampling weight wi for respondent i is then multiplied by 5/(ki + 1). The
respondents with k = 0 were only home on one of the five nights and are assigned to
represent their share of the population plus the share of four persons in the sample
who were called on one of their “unavailable” nights. The respondents most likely
to be home have k = 4; it is presumed that all persons in the sample who were home
every night were reached, so their weights are unchanged.
(b) This method of weighting is based on the premise that the most accessible per-
sons will tend to be overrepresented in the survey data. The method is easy to use,
theoretically appealing, and can be used in conjunction with callbacks. But it still
misses people who were not at home on any of the five nights, or who refused to par-
ticipate in the survey. Since in many surveys done over the telephone, nonresponse
is due in large part to refusals, the HPS method may not be helpful in dealing with
all nonresponse. Values of k may also be in error, because people may err when
recalling how many evenings they were home.
174 CHAPTER 8. NONRESPONSE
Chapter 9

Variance Estimation in Complex


Surveys

9.1 All of the methods discussed in this chapter would be appropriate. Note that
the replication methods might slightly overestimate the variance because sampling
is done without replacement, but since the sampling fractions are fairly small we
expect the overestimation to be small.
9.2 We calculate ȳ = 8.23333333 and s2 = 15.978, so s2 /30 = 0.5326.
For jackknife replicate j, the jackknife weight is wj(j) = 0 for observation j and
wi(j) = (30/29)wi = (30/29)(100/30) = 3.44828 for i 6= j. Using the jackknife
weights, we find ȳ(1) = 8.2413, . . . , ȳ(30) = 8.20690, so, by (9.8),
30
29 X
V̂JK (ȳ) = [ȳ(j) ° ȳ]2 = 0.5326054.
30
j=1

9.3 Here is the empirical cdf F̂ (y):

Obs y COUNT PERCENT CUM_FREQ CUM_PCT epmf ecdf

1 2 3.3333 3.3333 3.333 3.333 0.03333 0.03333


2 3 10.0000 10.0000 13.333 13.333 0.10000 0.13333
3 4 3.3333 3.3333 16.667 16.667 0.03333 0.16667
4 5 6.6667 6.6667 23.333 23.333 0.06667 0.23333
5 6 16.6667 16.6667 40.000 40.000 0.16667 0.40000
6 7 10.0000 10.0000 50.000 50.000 0.10000 0.50000
7 8 10.0000 10.0000 60.000 60.000 0.10000 0.60000
8 9 6.6667 6.6667 66.667 66.667 0.06667 0.66667
9 10 10.0000 10.0000 76.667 76.667 0.10000 0.76667
10 12 6.6667 6.6667 83.333 83.333 0.06667 0.83333
11 14 6.6667 6.6667 90.000 90.000 0.06667 0.90000
12 15 6.6667 6.6667 96.667 96.667 0.06667 0.96667

175
176 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

13 17 3.3333 3.3333 100.000 100.000 0.03333 1.00000

Note that F̂ (7) = .5, so the median is µ̂0.5 = 7. No interpolation is needed.


As in Example 9.12, F̂ (µ1/2 ) is the sample proportion of observations that take on
value at most µ1/2 , so

≥ µ ∂
n ¥ 0.25862069 30 0.25862069
V̂ [F̂ (µ̂1/2 )] = 1 ° = 1° = 0.006034483.
N n 100 30

This is a small sample, so we use the t29 critical value of 2.045 to calculate
q
2.045 V̂ [F̂ (µ̂1/2 )] = 0.1588596.

The lower confidence bound is F̂ °1 (.5 ° 0.1588596) = F̂ °1 (0.3411404) and the up-
per confidence bound for the median is F̂ °1 (.5 + 0.1588596) = F̂ °1 (0.6588596).
Interpolating, we have that the lower confidence bound is
0.34114 ° 0.23333
5+ (6 ° 5) = 5.6
0.4 ° 0.23333
and the upper confidence bound is
0.6588596 ° 0.6
8+ (9 ° 8) = 8.8.
0.666667 ° 0.6
Thus an approximate 95% CI is [5.6, 8.8].
SAS code below gives approximately the same interval:

data srs30;
input y @@;
wt = 100/30;
datalines;
8 5 2 6 6 3 8 6 10 7 15 9 15 3 5 6
7 10 14 3 4 17 10 6 14 12 7 8 12 9
;

/* We use two methods. First, the "hand" calculations" */

/* Find the empirical cdf */

proc freq data=srs30;


tables y / out = htpop_epmf outcum;
weight wt;
run;
177

data htpop_epmf;
set htpop_epmf;
epmf = percent/100;
ecdf = cum_pct/100;
run;

proc print data=htpop_epmf;


run;

/* Find the variance of \hat{F}(median) */

data calcvar;
set srs30;
ui = 0;
if y le 7 then ui = 1;
ei = ui - .5;

proc univariate data=calcvar;


var ei;
run;

/* Calculate the stratified variance for the total of variable ei */


proc surveymeans data=calcvar total = 100 sum stderr;
weight wt;
var ei;
run;

/* Method 2: Use sas directly to find the CI */

proc surveymeans data=srs30 total=100


percentile=(25 50 75) nonsymcl;
weight wt;
var y;
run;

Quantiles

Variable Percentile Estimate Std Error 95% Confidence Limits


___________________________________________________________________
y 25% Q1 5.100000 0.770164 2.65604792 5.8063712
50% Median 7.000000 0.791213 5.64673564 8.8831609
75% Q3 9.833333 1.057332 7.16875624 11.4937313
___________________________________________________________________
178 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

9.5
(a) (b) (c) (d) (e) (f) (g)
Age Violent Bothpar Male Hispanic Sinpar Drugs
µ̂1 0.12447 0.52179 0.29016 0.90160 0.30106 0.55691 0.90072
µ̂2 0.09528 0.43358 0.31309 0.84929 0.20751 0.52381 0.84265
µ̂3 0.08202 0.36733 0.34417 0.99319 0.17876 0.51068 0.82960
µ̂4 0.21562 0.37370 0.25465 0.96096 0.08532 0.55352 0.80869
µ̂5 0.21660 0.42893 0.30181 0.91314 0.14912 0.54480 0.74491
µ̂6 0.07321 0.48006 0.30514 0.96786 0.15752 0.55350 0.82232
µ̂7 0.02402 0.51201 0.27299 0.96558 0.25170 0.54490 0.84977

µ̂ 0.12330 0.44325 0.29743 0.93119 0.18877 0.54108 0.82821


µ̃ 0.11875 0.44534 0.29743 0.93594 0.19014 0.54116 0.82838
V̂1 (µ̂) 0.00076 0.00055 0.00012 0.00036 0.00072 0.00004 0.00032
V̂2 (µ̂) 0.00076 0.00055 0.00012 0.00036 0.00072 0.00004 0.00032

9.6 From Exercise 3.4, B̂ = 11.41946, ȳˆr = tx B̂ = 10.3B̂ = 117.6, and SE (ȳˆr ) =
3.98. Using the jackknife, we have B̂(·) = 11.41937, ȳˆr(·) = 117.6, and SE (ȳˆr ) =
p
10.3 0.1836 = 4.41. The jackknife standard error is larger, partly because it does
not include the fpc.
9.7 We use
10
n°1X
V̂JK (ȳˆr ) = (ȳˆr(j) ° ȳˆr )2 .
n
j=1

The ȳˆr(j) ’s for returnf and hadmeas are given in the following table:

School, j returnf, ȳˆr(j) hadmeas, ȳˆr(j)


1 0.5822185 0.4253903
2 0.5860165 0.4647582
3 0.5504290 0.4109223
4 0.5768984 0.4214941
5 0.5950112 0.4275614
6 0.5829014 0.4615285
7 0.5726580 0.4379689
8 0.5785320 0.4313120
9 0.5650470 0.4951728
10 0.5986785 0.4304341
For returnf,
10
9 X
V̂JK (ȳˆr ) = (ȳˆr(j) ° 0.5789482)2 = 0.00160
10
j=1

For hadmeas,
10
9 X
V̂JK (ȳˆr ) = (ȳˆr(j) ° 0.4402907)2 = 0.00526
10
j=1
179

9.8 We have B̂(·) = .9865651 and V̂JK (B̂) = 3.707 £ 10°5 . With the fpc, the
linearization variance estimate is V̂L (B̂) = 3.071
q £ 10 ; the linearization variance
°5

estimate if we ignore the fpc is 3.071 £ 10°5 / 1 ° 300


3078 = 3.232 £ 10°5 .

9.9 The median weekday greens fee for nine holes is µ̂ = 12. For the SRS of size
120,
(.5)(.5)
V [F̂ (µ0.5 )] = = 0.0021.
120
An approximate 95% confidence interval for the median is therefore
p p
[F̂ °1 (.5 ° 1.96 .0021), F̂ °1 (.5 + 1.96 .0021)] = [F̂ °1 (.4105), F̂ °1 (.5895)].

We have the following values for the empirical distribution function:


y 10.25 10.8 11 11.5 12
F̂ (y) .3917 .4000 .4167 .4333 .5167
y 13 14 15 16
F̂ (y) .5250 .5417 .5833 .6000
Interpolating,
.4105 ° .4
F̂ °1 (.4105) = 10.8 + (11 ° 10.8) = 10.9
.4167 ° .4
and
.5895 ° .5833
F̂ °1 (.5895) = 15 + (16 ° 15) = 15.4.
.6 ° .5833
Thus, an approximate 95% confidence interval for the median is [10.9, 15.4].
Note: If we apply the bootstrap to these data, we get
1000
1 X §
µ̂r = 12.86
1000
r=1

with standard error 1.39. This leads to a 95% CI of [10.1, 15.6] for the median.
9.13 (a) Since h00 (t) = °2t, the remainder term is
Z x Z x µ ∂
00 x2 a2
(x ° t)h (t)dt = °2 (x ° t)dt = °2 x °
2
° ax + = °(x ° a)2 .
a a 2 2

Thus,

h(p̂) = h(p) + h0 (p)(p̂ ° p) ° (p̂ ° p)2 = p(1 ° p) + (1 ° 2p)(p̂ ° p) ° (p̂ ° p)2 .

(b) The remainder term is likely to be smaller than the other terms because it has
(p̂ ° p)2 in it. This will be small if p̂ is close to p.
180 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

(c) To find the exact variance, we need to find V (p̂ ° p̂2 ), which involves the fourth
moments. For an SRSWR, X = np̂ ª Bin(n, p), so we can find the moments using
the moment generating function of the Binomial:

MX (t) = (pet + q)n

So, Ø Ø
Ø Ø
E(X) = 0
MX (t)ØØ = n(pe + q)
t n°1 tØ
pe Ø = np
t=0 t=0

Ø
Ø
E(X ) =
2 00
MX (t)ØØ
t=0
Ø
Ø
= [n(n ° 1)(pe + q) t n°2
(pe ) + n(pe + q)
t 2 t n°1
pe ]ØØ
t
t=0
= n(n ° 1)p2 + np
= n2 p2 + np(1 ° p)
Ø
Ø
E(X ) = MX (t)ØØ
3 000
= np(1 ° 3p + 3np + 2p2 ° 3np2 + n2 p2 )
t=0

E(X 4 ) = np(1 ° 7p + 7np + 12p2 ° 18np2 + 6n2 p2 ° 6p3 + 11np3 ° 6n2 p3 + n3 p3 )


Then,

V [p̂(1 ° p̂)]
= V (p̂) + V (p̂2 ) ° 2Cov (p̂, p̂2 )
£ §
= E[p̂2 ] ° p2 + E[p̂4 ] ° [E(p̂2 )]2 ° 2E p̂3 + 2pE(p̂2 )
p(1 ° p)
=
n
p
+ 3 (1 ° 7p + 7np + 12p2 ° 18np2 + 6n2 p2 ° 6p3 + 11np3 ° 6n2 p3 + n3 p3 )
n
∑ ∏
p(1 ° p) 2
° p +2
n
∑ ∏
p p(1 ° p)
°2 2 (1 ° 3p + 3np + 2p ° 3np + n p ) + 2p p +
2 2 2 2 2
n n
p(1 ° p)
= (1 ° 4p + 4p2 )
n
1 1
+ 2 (°2p + 14p2 ° 22p3 + 12p4 ) + 3 (p ° 7p2 + 12p3 ° 6p4 )
n n
Note that the first term is (1 ° 2p)2 V (p̂)/n, and the other terms are (constant)/n2
and (constant)/n3 . The remainder terms become small relative to the first term
when n is large. You can see why statisticians use the linearization method so
frequently: even for this simple example, the exact calculations of the variance are
nasty.
181

Note that with an SRS without replacement, the result is much more complicated.
Results from the following paper may be used to find the moments.
Finucan, H. M., Galbraith, R. F., and Stone, M. (1974). Moments Without Tears
in Simple Random Sampling from a Finite Population Biometrika, 61, 151–154.
9.14 (a) Write B1 = h(txy , tx , ty , tx2 , N ), where

a ° bc/e ea ° bc
h(a, b, c, d, e) = 2
= .
d ° b /e ed ° b2

The partial derivatives, evaluated at the population quantities, are:


@h e
=
@a ed ° b2
@h c ea ° bc
= ° + 2b
@b ed ° b2 (ed ° b2 )2
c 2bB1
= ° +
ed ° b ∑ ed ° b2
2

e c b b
= ° ° B1 ° B1
ed ° b2 e e e
e
= ° (B0 ° B1 x̄U )
ed ° b2
@h b e
= ° 2
=° x̄U
@c ed ° b ed ° b2
@h e
= ° B1
@d ed ° b2
@h a d(ea ° bc)
= °
@e ed ° b 2 (ed ° b2 )2
a dB1
= °
ed ° b2 ed ° b2
e
= B0 x̄U .
ed ° b2
182 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

The last equality follows from the normal equations. Then, by linearization,

B̂1 ° B1
@h @h @h
º (t̂xy ° txy ) + (t̂x ° tx ) + (t̂y ° ty )
@a @b @c
@h @h
+ (t̂x2 ° tx2 ) + (N̂ ° N )
@d @e
N £
= t̂xy ° txy ° (B0 ° B1 x̄U )(t̂x ° tx )
N tx2 ° (tx )2
i
°x̄U (t̂y ° ty ) ° B1 (t̂x2 ° tx2 ) + B0 x̄U (N̂ ° N )
" #
N X © ™
= wi xi yi ° (B0 ° B1 x̄U )xi ° x̄U yi ° B1 x2i + B0 x̄U
N tx2 ° (tx )2
i2S
N
° [txy ° tx (B0 ° B1 x̄U ) ° x̄U ty ° B1 tx2 + B0 N x̄U ]
N tx2 ° (tx )2
N X
= wi (yi ° B0 ° B1 xi ) (xi ° x̄U ).
N tx2 ° (tx )2
i2S

9.15 (a) Write


0 12
N
X N
X µ ∂
1 @yi ° 1 A 1 t2
S2 = yj = t1 °
N °1 N t3 ° 1 t3
i=1 j=1

(b) Substituting, we have µ ∂


1 t̂2
Ŝ =
2
t̂1 ° 2
t̂3 ° 1 t̂3

(c) We need to find the partial derivatives:

@h 1
=
@t1 t3 ° 1
@h t2
= °2
@t2 t3 (t3 ° 1)
µ ∂
@h 1 t2 1 t2
= ° t1 ° +
@t3 (t3 ° 1) 2 t3 t3 ° 1 t23

Then, by linearization,
@h @h @h
Ŝ 2 ° S 2 º (t̂1 ° t1 ) + (t̂2 ° t2 ) + (t̂3 ° t3 )
@t1 @t2 @t3
183

Let
@h 2 @h @h
qi = yi + yi +
@t1 @t2 @t3
µ ∂
1 t 2 1 t2 1 t2
= yi ° 2
2
yi ° t1 ° +
t3 ° 1 t3 (t3 ° 1) (t3 ° 1) 2 t3 t3 ° 1 t23
µ µ ∂ ∂
1 t2 1 t2 t2
= yi2 ° 2 yi ° t1 ° + 2
t3 ° 1 t3 (t3 ° 1) t3 t3

9.16 (a) Write R = h(t1 , . . . , t6 ), where

d ° ab/f f d ° ab
h(a, b, c, d, e, f ) = p =p .
(c ° a2 /f )(e ° b2 /f ) (f c ° a2 )(f e ° b2 )

The partial derivatives, evaluated at the population quantities, are:


µ ∂
@h 1 a(f d ° ab)
= p °b +
@a (f c ° a2 )(f e ° b2 ) f c ° a2
°ty tx R
= +
N (N ° 1)Sx Sy N (N ° 1)Sx2
µ ∂
@h 1 b(f d ° ab)
= p °a +
@b (f c ° a2 )(f e ° b2 ) f e ° b2
°ty ty R
= +
N (N ° 1)Sx Sy N (N ° 1)Sy2
µ ∂
@h 1 f (f d ° ab)
= ° p
@c 2 (f c ° a2 )(f e ° b2 ) f c ° a2
R 1
= °
2 (N ° 1)Sx2
@h f
= p
@d (f c ° a2 )(f e ° b2 )
µ ∂
@h 1 f (f d ° ab)
= ° p
@e 2 (f c ° a2 )(f e ° b2 ) f e ° b2
R 1
= °
2 (N ° 1)Sy2
µ ∂
@h d f d ° ab e c
= p ° p +
@f (f c ° a2 )(f e ° b2 ) 2 (f c ° a2 )(f e ° b2 ) f e ° b2 f c ° a2
µ ∂
txy R ty 2 tx2
= ° +
N (N ° 1)Sx Sy 2 N (N ° 1)Sy2 N (N ° 1)Sx2
184 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

Then, by linearization,

R̂ ° R
@h @h @h
º (t̂x ° tx ) + (t̂y ° ty ) + (t̂ 2 ° tx2 )
@a @b @c x
@h @h @h
+ (t̂xy ° txy ) + (t̂ 2 ° ty2 ) + (N̂ ° N )
@d @e y @f
∑µ ∂ ∏
1 tx RSy ty RSx
= °ty + (t̂x ° tx ) + (°tx + )(t̂y ° ty )
N (N ° 1)Sx Sy Sx Sy
R 1 1
° (t̂x2 ° tx2 ) + (t̂xy ° txy )
2 (N ° 1)Sx 2 (N ° 1)Sx Sy
R 1
° (t̂ 2 ° ty2 )
2 (N ° 1)Sy2 y
∑ µ ∂∏
txy R ty 2 tx2
+ ° + (N̂ ° N )
N (N ° 1)Sx Sy 2 N (N ° 1)Sy2 N (N ° 1)Sx2
∑µ ∂ µ ∂
1 tx RSy ty RSx
= °ty + (t̂x ° tx ) + °tx + (t̂y ° ty )
N (N ° 1)Sx Sy Sx Sy
N RSy N RSx
° (t̂x2 ° tx2 ) + N (t̂xy ° txy ) ° (t̂y2 ° ty2 )
2Sx 2Sy
Ω µ ∂æ ∏
R ty2 Sx tx2 Sy
+ txy ° + (N̂ ° N )
2 Sy Sx

This is somewhat easier to do in matrix terms. Let


∑µ ∂ µ ∂
x̄U RSy ȳU RSx RSy RSx
± = °ȳU + , °x̄U + ,° ,° , 1,
Sx Sy 2Sx 2Sy
µ ∂∏
txy R ty2 Sx tx2 Sy T
° + ,
N 2N Sy Sx
then
1
Cov (R̂) º ± T Cov (t̂)±.
[(N ° 1)Sx Sy ]2

9.17 Write the function as h(a1 , . . . , aL , b1 , . . . , bL ). Then


Ø
@h ØØ
=1
@al Øt1 ,...,tL ,N1 ,...,NL

and Ø
@h ØØ tl
Ø =° .
@bl t1 ,...,tL ,N1 ,...,NL Nl
Consequently,
L
X L
X tl
h(t̂1 , . . . , t̂L , N̂1 , . . . , N̂L ) º t + (t̂l ° tl ) ° (N̂l ° Nl )
Nl
l=1 l=1
185

and
∑X
L µ ∂∏
tl
V (t̂post ) º V t̂l ° N̂l .
Nl
l=1

9.18 From (9.5),


R
1 1 X
V̂2 (µ̂) = (µ̂r ° µ̂)2 .
RR°1
r=1
PR
Without loss of generality, let ȳU = 0. We know that ȳ = r=1 ȳr /R.

Suppose the random groups are independent. Then ȳ1 , . . . , ȳR are independent and
identically distributed random variables with

E[ȳr ] = 0,

S2
V [ȳr ] = E[ȳr2 ] = = ∑2 (ȳ1 ),
m
E[ȳr4 ] = ∑4 (ȳ1 ).

We have
" R
# R
1 X 1 X £ §
E (ȳr ° ȳ)2 = E ȳr2 ° (ȳ)2
R(R ° 1) R(R ° 1)
r=1 r=1
XR
1
= [V (ȳr ) ° V (ȳ)]
R(R ° 1)
r=1
XR ∑ ∏
1 S2 S2
= °
R(R ° 1) m n
r=1
R ∑
X ∏
1 S2 S2
= R °
R(R ° 1) n n
r=1
S2
= .
n
186 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

Also,
2( )2 3
XR
E4 (ȳr ° ȳ)2 5
r=1
2( )2 3
XR
= E4 ȳr2 ° Rȳ 2 5
r=1
" R X
R R
#
X X
= E ȳr2 ȳs2 ° 2Rȳ 2
ȳr2 + R ȳ
2 4

r=1 s=1 r=1


2 3
R µ
R X
X X X ∂ X X
2 1
= E4 ȳr2 ȳs2 ° ȳr2 ȳs2 + 2 ȳj ȳk ȳr ȳs 5
R R r s
r=1 s=1 j k
2 3
µ ∂XR µ ∂ XR X R
2 1 2 3
= E4 1° + 2 ȳr4 + 1 ° + 2 ȳr2 ȳs2 5
R R R R
r=1 r=1 s6=r
µ ∂ µ ∂
2 1 2 3
= 1 ° + 2 R∑4 (ȳ1 ) + 1 ° + 2 R(R ° 1)∑22 (ȳ1 )
R R R R

Consequently,
h i
E V̂22 (µ̂)
∑µ ∂ µ ∂ ∏
1 2 1 2 3
= 1 ° + 2 R∑4 (ȳ1 ) + 1 ° + 2 R(R ° 1)∑2 (ȳ1 )
2
R2 (R ° 1)2 R R R R
1 1
= ∑4 (ȳ1 ) + 3 (R2 ° 2R + 3)∑22 (ȳ1 )
R3 R (R ° 1)

and
" R
# µ 2 ∂2
1 X 1 R2 ° 2R + 3 2 S
V (ȳr ° ȳ)2 = ∑4 (ȳ1 ) + ∑2 (ȳ1 ) °
R(R ° 1) R 3 R (R ° 1)
3 n
r=1
µ ∂ 2 µ ∂2
1 R2 ° 2R + 3 S 2 S2
= ∑4 (ȳ1 ) + °
R3 R3 (R ° 1) m Rm

µ ∂2 µ 2 ∂2
" # 1 R2 ° 2R + 3 S 2 S
X R ∑ (ȳ
4 1 ) + °
1 R3 R3 (R ° 1) m Rm
CV 2 (ȳr ° ȳ)2 = µ 2 ∂2
R(R ° 1) S
r=1
Rm
∑ ∏
1 ∑4 (ȳ1 )m2 R ° 3
= ° .
R S4 R°1
187

We now need to find ∑4 (ȳ1 ) = E[ȳr4 ] to finish the problem. A complete argument
giving the fourth moment for an SRSWR is given by
Hansen, M. H., Hurwitz, W. N., and Madow, W. G. (1953). Sample Survey Methods
and Theory, Volume 2. New York: Wiley, pp. 99-100.
They note that
2
1 4X 4 X X
ȳr4 = y i + 4 yi
3
y j + 3 yi2 yj2
m4
i2Sr i6=j i6=j
3
X X
+6 yi2 yj yk + yi yj yk yl 5
i6=j6=k i6=j6=k6=l

so that
X N
1 m°1 4
∑4 (ȳ1 ) = E[ȳr4 ] = 3 (yi ° ȳU )4 + 3 S .
m (N ° 1) m3
i=1

This results in
" R
# ∑ ∏
1 X 1 ∑4 (ȳ1 )m2 R ° 3
CV 2
(ȳr ° ȳ)2 = °
R(R ° 1) R S4 R°1
r=1
∑ ∏
1 ∑ m°1 R°3
= +3 ° .
R m m3 R°1

The number of groups, R, has more impact on the CV than the group size m : the
random group estimator of the variance is unstable if R is small.
9.19 First note that
H
X H
X
Nh Nh yh1 + yh2
ȳstr (Ær ) ° ȳstr = yh (Ær ) °
N N 2
h=1 h=1
XH µ ∂ H
X
Nh Ærh + 1 Ærh ° 1 Nh yh1 + yh2
= yh1 ° yh2 °
N 2 2 N 2
h=1 h=1
XH
Nh yh1 ° yh2
= Ærh .
N 2
h=1
188 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

Then
R
1 X
V̂BRR (ȳstr ) = [ȳstr (Ær ) ° ȳstr ]2
R
r=1
XR ∑X H ∏
1 Nh yh1 ° yh2 2
= Ærh
R N 2
r=1 h=1
R H H
1 X X X Nh yh1 ° yh2 N` y`1 ° y`2
= Ærh Ær`
R N 2 N 2
r=1 h=1 `=1
R H µ ∂
1 X X Nh 2 2 (yh1 ° yh2 )2
= Ærh
R N 4
r=1 h=1
H
X H
X R
1 Nh yh1 ° yh2 N` y`1 ° y`2 X
+ Ærh Ær`
R N 2 N 2
h=1 `=1,`6=h r=1
H µ
X ∂
Nh 2 (yh1 ° yh2 )2
=
N 4
h=1
= V̂str (ȳstr ).
PR
The last step holds because r=1 Ærh Ær` = 0 for ` 6= h.
9.20 As noted in the text,
H µ
X ∂
Nh 2 (yh1 ° yh2 )2
V̂str (ȳstr ) = .
N 4
h=1

Also,
H
X Ærh Nh
µ̂(Ær ) = ȳstr (Ær ) = (yh1 ° yh2 ) + ȳstr
2 N
h=1
so
H
X Nh
µ̂(Ær ) ° µ̂(°Ær ) = Ærh (yh1 ° yh2 )
N
h=1
PR
and, using the property r=1 Ærh Ærk = 0 for k 6= h,
R R H H
1 X 1 XXX Nh Nk
[µ̂(Ær ) ° µ̂(°Ær )]2 = Ærh Ærk (yh1 ° yh2 )(yk1 ° yk2 )
4R 4R N N
r=1 r=1 h=1 k=1
R H µ ∂
1 X X Nh 2
= (yh1 ° yh2 )2
4R N
r=1 h=1
H µ ∂
1 X Nh 2
= (yh1 ° yh2 )2 = V̂str (ȳstr ).
4 N
h=1
189

Similarly,
R
1 X
{[µ̂(Ær ) ° µ̂]2 + [µ̂(°Ær ) ° µ̂]2 }
2R
r=1
R Ω∑ X H ∏2 ∑ XH ∏2 æ
1 X Ærh Nh °Ærh Nh
= (yh1 ° yh2 ) + (yh1 ° yh2 )
2R 2 N 2 N
r=1 h=1 h=1
XR X H µ ∂2
1 2
Ærh Nh
= (yh1 ° yh2 )2
2R 2 N
r=1 h=1
H µ ∂
1 X Nh 2
= (yh1 ° yh2 )2 .
4 N
h=1

9.21 Note that


H
X ∑ ∏
Ærh
t̂(Ær ) = Nh (yh1 ° yh2 ) + ȳh
2
h=1
XH
Nh arh
= (yh1 ° yh2 ) + t̂
2
h=1

and
H X
X H
Nh Nk Ærh Ærk
[t̂(Ær )]2 = (yh1 ° yh2 )(yk1 ° yk2 )
4
h=1 k=1
XH
Nh Ærh
+2t̂ (yh1 ° yh2 ) + t̂2 .
2
h=1

Thus,
H
X
t̂(Ær ) ° t̂(°Ær ) = Nh Ærh (yh1 ° yh2 ),
h=1
H
X
[t̂(Ær )] ° [t̂(°Ær )] = 2t̂
2 2
Nh Ærh (yh1 ° yh2 ),
h=1

and
H
X
µ̂(Ær ) ° µ̂(°Ær ) = (2at̂ + b) Nh Ærh (yh1 ° yh2 ).
h=1
190 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

PR
Consequently, using the balanced property r=1 Ærh Ærk = 0 for k 6= h, we have
R
1 X
[µ̂(Ær ) ° µ̂(°Ær )]2
4R
r=1
XXX R H H
1
= (2at̂ + b)2 Nh Nk Ærh Ærk (yh1 ° yh2 )(yk1 ° yk2 )
4R
r=1 h=1 k=1
H
X
1
= (2at̂ + b)2 Nh2 (yh1 ° yh2 )2 .
4
h=1

Using linearization,
h(t̂) º h(t) + (2at + b)(t̂ ° t),
so
VL (µ̂) = (2at ° b)2 V (t̂)
and
H
X
21
V̂L (µ̂) = (2at̂ ° b) Nh2 (yh1 ° yh2 )2 ,
4
h=1
PR
which is the same as 1
4R r=1 [µ̂(Ær ) ° µ̂(°Ær )]2 .
9.23 We can write
X
Nl wj xlj yj
L
X j2S
t̂post = g(w, y, x1 , . . . , xL ) = X .
l=1 wj xlj
j2S

Then,

@g(w, y, x1 , . . . , xL )
zi =
@wi
8 9
>
> X >
>
>
> >
>
>
> N x
l li wj lj j >
x y >
XL >
< >
=
Nl xli yi j2S
= X ° 0 12
> >
l=1 >
>
>
wj xlj X >
>
>
>
> j2S @ w x A >
>
>
: j lj >
;
j2S
L
( )
X Nl xli yi Nl xli t̂yl
= °
l=1
N̂l N̂l2
XL µ ∂
Nl t̂yl
= xli yi ° .
l=1
N̂l N̂l
191

Thus,
√ !
X
V̂ (t̂post ) = V̂ wi zi .
i2S

Note that this variance estimator diÆers from the one in Exercise 9.17, although
they are asymptotically equivalent.
9.24 From Chapter 5,
M MSB
V (t̂) º N 2
n
N 2M N M ° 1 2
= S [1 + (M ° 1)ICC]
n M (N ° 1)
NM NM
º p(1 ° p)[1 + (M ° 1)ICC]
n M
Consequently, the relative variance v can be written as Ø0 + Ø1 /t, where Ø0 =
1
° nM [1 + (M ° 1)ICC] + nt1
N [1 + (M ° 1)ICC].
9.25 (a) From (9.2),
"Ω æ2 #
ty 1
V [B̂] º E ° 2 (t̂x ° tx ) + (t̂y ° ty )
tx tx
"Ω æ2 #
t2y 1 1
= E ° (t̂x ° tx ) + (t̂y ° ty )
t2x tx ty
2 ∑ µ ∂∏
ty V (t̂x ) V (t̂y t̂x t̂y
= + 2 ° 2Cov ,
t2x t2x ty tx ty
∑ ∏
t2y V (t̂x ) V (t̂y B ° ¢
= + 2 °2 V t̂x
t2x t2x ty tx ty
∑ ∏
t2y V (t̂y V (t̂x )
= ° 2
t2x t2y tx

Using the fitted model from (9.13),


V̂ (t̂x )
= a + b/t̂x
t̂2x
and
V̂ (t̂y )
= a + b/t̂y
t̂2y
Consequently, substituting estimators for the population quantitites,
∑ ∏
b b
V̂ [B̂] = B̂ a + ° a °
2
,
t̂y t̂x
192 CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

which gives the result.


(b) When B is a proportion,
∑ ∏ ∑ ∏
b b b b bB̂(1 ° B̂)
V̂ [B̂] = B̂ 2 ° = B̂ 2 ° = .
t̂y t̂x t̂x B̂ t̂x t̂x
Chapter 10

Categorical Data Analysis


in Complex Surveys

10.1 Many data sets used for chi-square tests in introductory statistics books use
dependent data. See Alf and Lohr (2007) for a review of how books ignore clustering
in the data.
10.3 (a) Observed and expected (in parentheses) proportions are given in the fol-
lowing table:
Abuse
No Yes
No .7542 .1017
(.7109) (.1451)
Symptom
Yes .0763 .0678
(.1196) (.0244)
(b) ∑ ∏
(.7542 ° .7109)2 (.0678 ° .0244)2
X 2 = 118 + ··· +
.7109 .0244
= 12.8 ∑ µ ∂ µ ∂∏
.7542 .0678
G2 = 2(118) .7542 ln + · · · + .0678 ln
.7109 .0244
= 10.3.
Both p-values are less than .002.
Because the expected count in the Yes-Yes cell is small, we also perform Fisher’s
exact test, which gives p-value .0016.
10.4 (a) This is a test of independence. A sample of students is taken, and each
student classified based on instructors and grade.
(b) X 2 = 34.8. Comparing this to a ¬23 distribution, we see that the p-value is

193
194CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS

less than 0.0001. A similar conclusion follows from the likelihood ratio test, with
G2 = 34.5.
(c) Students are probably not independent–most likely, a cluster sample of students
was taken, with the Math II classes as the clusters. The p-values in part (b) are
thus lower than they should be.
10.5 The following table gives the value of µ̂ for the 7 random groups:

Random Group µ̂
1 0.0132
2 0.0147
3 0.0252
4 -0.0224
5 0.0073
6 -0.0057
7 0.0135
Average 0.0065
std. dev. 0.0158
p
Using the random group method, the standard error of µ̂ is 0.0158/ 7 = 0:0060,
so the test statistic is
µ̂2
= 0.79.
V (µ̂)
Since our estimate of the variance from the random group method has only 6 df, we
compare the test statistic to an F (1, 6) distribution rather than to a ¬21 distribution,
obtaining a p-value of 0.4.
10.6 (a) The contingency table (for complete data) is as follows:
Break again?
No Yes
Faculty 65 167 232
Classified staÆ 55 459 514
Administrative staÆ 11 75 86
Academic professional 9 58 67
140 759 899
Xp2 = 37.3; comparing to a ¬23 distribution gives Ω-value < .0001. We can use the
¬2 test for homogeneity because we assume product-multinomial sampling. (Class
is the stratification variable.)
(b) Using the weights (with the respondents who answer both questions), we esti-
mate the probabilities as
195

Work
No Yes
No 0.0832 0.0859 0.1691
Breakaga
Yes 0.6496 0.1813 0.8309
0.7328 0.2672 1.0000
To estimate the proportion in the Yes-Yes cell, I used:
sum of weights of persons answering yes to both questions
p̂yy = .
sum of weights of respondents to both questions
Other answers are possible, depending on how you want to treat the nonresponse.
(c) The odds ratio, calculated using the table in part (b), is
0.0832/0.0859
= 0.27065.
0.6496/0.1813
(Or, you could get 1/.27065 = 3.695.)
The estimated proportions ignoring the weights are
Work
No Yes
No 0.0850 0.0671 0.1521
breakaga
Yes 0.6969 0.1510 0.8479
0.7819 0.2181 1.0000
Without weights the odds ratio is
0.0850/0.0671
= 0.27448
0.6969/0.1510
(or, 1/.27448 = 3.643).
Weights appear to make little diÆerence in the value of the odds ratio.
(d) µ̂ = (.0832)(.1813) ° (.6496)(.0859) = °0.04068.
(e) Using linearization, define
qi = p̂22 y11i + p̂11 y12i ° p̂12 y21i ° p̂21 y22i
where yjki is an indicator variable for membership in class (j, k). We then estimate
V (q̄str ) using the usual methods for stratified samples. Using the summary statistics,
≥ ¥ 2
Nh nh sh
Stratum Nh nh q̄h s2h N 1 ° Nh nh
Faculty 1374 228 °.117 0.0792 4.04 £ 10°5
C.S. 1960 514 °.059 0.0111 4.52 £ 10°6
A.S. 252 86 °.061 0.0207 7.42 £ 10°7
A.P. 95 66 °.076 0.0349 1.08 £ 10°7
Total 3681 894 4.58 £ 10°5
196CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS

Thus V̂L (µ̂) = 4.58 £ 10°5 and


µ̂2 0.00165
2
XW = = = 36.2.
V̂L (µ̂) 4.58 £ 10°5
We reject the null hypothesis with p-value < 0.0001.
10.7 Answers will vary, depending on how the categories for zprep are formed.
10.8 (a) Under the null hypothesis of independence the expected proportions are:
Fitness Level
Minimum
Recommended Acceptable Unacceptable
Current .241 .140 .159
Smoking Occasional .020 .011 .013
Status Never .186 .108 .123
Using (10.2) and (10.3),
X r X c
(p̂ij ° p̂i + p̂+j )2
X2 = n = 18.25
p̂i+ p̂+j
i=1 j=1
Xr X c µ ∂
p̂ij
G2 = 2n p̂ij ln = 18.25
p̂i+ p̂+j
i=1 j=1

Comparing each statistic to a ¬24 distribution gives p-value= .001.


(b) Using (10.9), E[X 2 ] º E[G2 ] º 6.84
4X 2
(c) XF2 = G2F = = 10.7, with p-value = .03 (comparing to a ¬24 distribution).
6.84
10.9 (a) Under the null hypothesis of independence, the expected proportions are:
Males Females
Decision-making managers 0.076 0.065
Advisor-managers 0.018 0.016
Supervisors 0.064 0.054
Semi-autonomous workers 0.103 0.087
Workers 0.279 0.238
Using (10.2) and (10.3),
X r X c
(p̂ij ° p̂i+ p̂+j )2
X2 =n = 55.1
p̂i+ p̂+j
i=1 j=1
Xr X c µ ∂
p̂ij
G2 = 2n p̂ij ln = 56.6
p̂i+ p̂+j
i=1 j=1

Comparing each statistic to a ¬2 distribution with (2 ° 1)(5 ° 1) = 4 df gives


“p-values” that are less than 1 £ 10°9 .
197

(b) Using (10.9), we have


E[X 2 ] º E[G2 ] º 4.45

(c) df = number of psu’s ° number of strata = 34


(d)
4X 2
XF2 = = .899X 2 = 49.5
4.452
4G
G2F = = 50.8
4.45
The p-values for these statistics are still small, less than 1.0 £ 10°9 .
(e) The p-value for Xs2 is 2.6 £ 10°8 , still very small.
10.11 Here is SAS code and output:

options ovp nocenter ls=85;


filename nhanes ’C:\nhanes.csv’;

data nhanes;
infile nhanes delimiter=’,’ firstobs=2;
input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2
dmdeduc indfminc bmxwt bmxbmi bmxtri
bmxwaist bmxthicr bmxarml;
bmiclass = .;
if 0 > bmxbmi and bmxbmi < 25 then bmiclass = 1;
else if bmxbmi >= 25 and bmxbmi < 30 then bmiclass = 2;
else if bmxbmi >= 30 then bmiclass = 3;
if age < 30 then ageclass = 1;
else if age >= 30 then ageclass = 2;
label age = "Age at Examination (years)"
riagendr = "Gender"
ridreth2 = "Race/Ethnicity"
dmdeduc = "Education Level"
indfminc = "Family income"
bmxwt = "Weight (kg)"
bmxbmi = "Body mass index"
bmxtri = "Triceps skinfold (mm)"
bmxwaist = "Waist circumference (cm)"
bmxthicr = "Thigh circumference (cm)"
bmxarml = "Upper arm length (cm)";
run;

proc surveyfreq data=nhanes ;


stratum sdmvstra;
cluster sdmvpsu;
198CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS

weight wtmec2yr;
tables bmiclass*ageclass/chisq deff;
run;

Table of bmiclass by ageclass

Weighted Std Dev of Std Err of Design


bmiclass ageclass Frequency Frequency Wgt Freq Percent Percent Effect
------------------------------------------------------------------------------------
1 1 881 9566761 716532 6.0302 0.3131 0.8572
2 75 2686500 495324 1.6934 0.2434 1.7639

Total 956 12253261 1083245 7.7236 0.3951 1.0855


------------------------------------------------------------------------------------
2 1 788 19494074 1615408 12.2878 0.6689 2.0573
2 1324 57089320 4300350 35.9853 1.2334 3.2727

Total 2112 76583394 5452762 48.2731 1.2969 3.3381


------------------------------------------------------------------------------------
3 1 627 15269676 1538135 9.6250 0.6261 2.2338
2 1262 54539886 4631512 34.3783 1.2349 3.3499

Total 1889 69809562 5816243 44.0033 1.4066 3.9794


------------------------------------------------------------------------------------
Total 1 2296 44330511 3336974 27.9430 1.0011 2.4666
2 2661 114315706 8538003 72.0570 1.0011 2.4666

Total 4957 158646217 11335366 100.000


------------------------------------------------------------------------------------
Frequency Missing = 4686

Rao-Scott Chi-Square Test

Pearson Chi-Square 525.1560


Design Correction 1.6164

Rao-Scott Chi-Square 324.8848


DF 2
Pr > ChiSq <.0001

F Value 162.4424
Num DF 2
Den DF 30
Pr > F <.0001

There is strong evidence of an association.


10.12 Here is SAS code and output:

data ncvs;
infile ncvs delimiter = "," firstobs=2;
input age married sex race hispanic hhinc away employ numinc
199

violent injury medtreat medexp robbery assault


pweight pstrat ppsu;
if violent > 0 then isviol = 1;
else if violent = 0 then isviol = 0;
run;

proc surveyfreq data=ncvs;


stratum pstrat ;
cluster ppsu;
weight pweight;
tables isviol*sex/chisq;
run;

Table of isviol by sex

Weighted Std Dev of Std Err of


isviol sex Frequency Frequency Wgt Freq Percent Percent
----------------------------------------------------------------------------
0 0 36120 107752330 1969996 47.6830 0.1620
1 42161 115149882 1906207 50.9566 0.1638

Total 78281 222902212 3813397 98.6396 0.0663


----------------------------------------------------------------------------
1 0 558 1746669 101028 0.7729 0.0445
1 441 1327436 81463 0.5874 0.0341

Total 999 3074105 155677 1.3604 0.0663


----------------------------------------------------------------------------
Total 0 36678 109498999 1985935 48.4560 0.1597
1 42602 116477318 1930795 51.5440 0.1597

Total 79280 225976317 3853581 100.000


----------------------------------------------------------------------------
Frequency Missing = 80

Rao-Scott Chi-Square Test

Pearson Chi-Square 30.6160


Design Correction 1.0466

Rao-Scott Chi-Square 29.2529


DF 1
Pr > ChiSq <.0001

F Value 29.2529
Num DF 1
Den DF 143
Pr > F <.0001

There is strong evidence that males are more likely to be victims of violent crime
200CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS

than females.
10.13 This test statistic does not in general give correct p-values for data from a
complex survey. It ensures that the sum of the “observed” counts is n but does not
adjust for stratification or clustering.
To see this, note that for the data in Example 10.4, the proposed test statistic is
the same as X 2 because all weights are equal. But in that example X 2 /2, not X 2 ,
has a null ¬21 distribution because of the clustering.
10.14 (a) For the Wald test,

µ = p11 p22 ° p12 p21

and
µ̂ = p̂11 p̂22 ° p̂12 p̂21 .
Then, using Taylor linearization,

µ̂ º µ + p22 (p̂11 ° p11 ) + p11 (p̂22 ° p22 ) ° p12 (p̂21 ° p21 ) ° p21 (p̂12 ° p12 )

and

VL (µ̂) = V [p22 p̂11 + p11 p̂22 ° p12 p̂21 ° p21 p̂12 ]


= p222 V (p̂11 ) + p211 V (p̂22 ) + p212 V (p̂21 ) + p221 V (p̂12 )
+2p11 p22 Cov (p̂11 , p̂22 ) ° 2p22 p12 Cov (p̂11 , p̂21 )
°2p22 p21 Cov (p̂11 , p̂12 ) ° 2p11 p12 Cov (p̂22 , p̂21 )
°2p11 p21 Cov (p̂22 , p̂12 ) + 2p12 p21 Cov (p̂21 , p̂12 ).

To estimate VL (µ̂), define


Ω
1 if unit i in cell (j, k)
yjki =
0 otherwise

for j, k 2 {1, 2} and let

qi = p̂22 y11i + p̂11 y12i ° p̂12 y21i ° p̂21 y22i .

Then
V̂L (µ̂) = V (q̄ˆ).

(b) For multinomial sampling,


p11 (1 ° p11 ) p12 (1 ° p12 )
VL (µ̂) = p222 + · · · + p221
n n
p211 p222 p11 p22 p12 p21 p2 p2
°2 +2 + · · · ° 2 12 21
n n n
1
= {°4p211 p222 ° 4p212 p221 + 8p11 p22 p12 p21
n
+p11 p22 (p11 + p22 ) + p12 p21 (p12 + p21 )}.
201

Under H0 : p11 p22 = p12 p21 ,


1 1 1
VL (µ̂) = p11 p22 = p12 p21 = p1+ p+1 p2+ p+2
n n n
and
1 1 1 1 p11 + p22 p21 + p12 1
+ + + = + = .
p11 p12 p21 p22 p11 p22 p12 p21 p11 p22
Thus, if H0 is true,
µ ∂
1 1 1 1 1
= n + + +
VL (µ̂) p11 p12 p21 p22
µ ∂
1 1 1 1
= n + + + .
p1+ p+1 p1+ p+2 p2+ p+1 p2+ p+2
Also note that, for any j, k 2 {1, 2},
µ̂2 = (p̂11 p̂22 ° p̂12 p̂21 )2 = (p̂jk ° p̂j+ p̂+k )2 .

Thus, estimating VL (µ̂) under H0 ,


2 X
X 2
(p̂jk ° p̂j+ p̂+k )2
2
XW =n = Xp2 .
p̂j+ p̂+k
j=1 k=1

10.15 (a) We can rewrite


µ = log(p11 ) + log(p22 ) ° log(p12 ) ° log(p21 ).
Then, using Taylor linearization,
p̂11 ° p11 p̂22 ° p22 p̂12 ° p12 p̂21 ° p21
µ̂ º µ + + ° °
p11 p22 p12 p21
and µ ∂
p̂11 p̂22 p̂12 p̂21
VL (µ̂) = V + ° °
p11 p22 p12 p21
X2 X 2
1
= 2 V (p̂jk )
p
j=1 k=1 jk
2 2
+ Cov (p̂11 , p̂22 ) ° Cov (p̂11 , p̂12 )
p11 p22 p11 p12
2 2
° Cov (p̂11 , p̂21 ) ° Cov (p̂22 , p̂12 )
p11 p21 p12 p22
2 2
° Cov (p̂22 , p̂21 ) + Cov (p̂12 , p̂21 ).
p22 p21 p12 p21

(b) Under multinomial sampling,


2 X
X 2
pjk (1 ° pjk ) 4
VL (µ̂) = +
j=1
np2jk n
µ k=1 ∂
1 1 1 1 1
= + + +
n p11 p12 p21 p22
202CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS

and µ ∂
1 1 1 1 1
V̂L (µ̂) = + + +
n p̂11 p̂12 p̂21 p̂22
1 1 1 1
= + + + .
x11 x12 x21 x22
This is the estimated variance given in Section 10.1.1.
10.17 In a multinomial sample, all design eÆects are 1. From (10.9), under H0
r X
X c r
X c
X
E[X 2 ] = (1 ° pij ) ° (1 ° pi+ ) ° (1 ° p+j )
i=1 j=1 i=1 j=1
= rc ° 1 ° (r ° 1) ° (c ° 1)
= (r ° 1)(c ° 1).

10.18 (a) Write


YT CY = YT ß°1/2 ß1/2 Cß1/2 ß°1/2 Y.
Since C is symmetric and positive definite, so is ß1/2 Cß1/2 and we can write
ß1/2 Cß1/2 = P§PT for an orthogonal matrix P and diagonal matrix §, where
each diagonal entry of § is positive. Let U = PT ß°1/2 Y; then
k
X
Y CY = U §U =
T T
∏i Ui2 .
i=1

Since Y ª N (0, ß), U ª N (0, PT ß°1/2 ßß°1/2 P) = N (0, I), so Wi = Ui2 ª ¬21
and the Wi ’s are independent.
(b) Using a central limit theorem for survey sampling, we know that V(µ̂)°1/2 (µ̂°µ)
has asymptotic N (0, I) distribution under H0 : µ = 0. Using part (a), then,
T T
µ̂ A°1 µ̂ = µ̂ V(µ̂)°1/2 V(µ̂)1/2 A°1 V(µ̂)1/2 V(µ̂)°1/2 µ̂
P
has the same asymptotic distribution as ∏i Wi , where the ∏i ’s are the eigenvalues
of
V(µ̂)1/2 A°1 V(µ̂)1/2 .

(c)
T X
E[µ̂ A°1 µ̂] º ∏i ,
T X
V [µ̂ A°1 µ̂] º 2 ∏i ,

10.19 This sample is self-weighting, so the estimated cell probabilities are


S N
Male .3028 .1056 .4084
Female .2254 .3662 .5916
.5282 .4718 1.0000
203

The variances under the (incorrect) assumption of multinomial sampling are:


S N
Male .001487 .000665 .001701
Female .001229 .001634 .001701
.001755 .001755
We use the information in Table 10.1 to find the estimated variances for each cell
and margin using the cluster sample. For the schizophrenic males, define

tSMi = number of schizophrenic males in psu i,

and similarly for the other cells. Then we use equations (5.4)–(5.6) to estimate the
mean and variance for each cell. We have the following frequency data:
Freq. SM SF NM NF M F S N
0 41 45 58 34 30 17 24 28
1 17 20 11 22 24 24 19 19
2 13 6 2 15 17 30 28 24
ȳˆ .3028 .2254 .1056 .3662 .4085 .5915 .5282 .4718
V̂ (ȳˆ) .0022 .0015 .0008 .0022 .0022 .0022 .0026 .0026

Thus the estimated variances using the clustering are


S N
Male .002161 .000796 .002244
Female .001488 .002209 .002244
.002604 .002604
and the estimated design eÆects are
S N
Male 1.453 1.197 1.319
Female 1.210 1.352 1.319
1.484 1.484
Using equation (10.9), E[X 2 ] is estimated by
2 X
X 2 2
X 2
X
(1 ° p̂ij )dij ° (1 ° p̂i+ )di °
R
(1 ° p̂+j )dcj = 1.075.
i=1 j=1 i=1 j=1

Then
X2 17.89
XF2 = = = 16.7
1.07 1.07
with p-value < 0.0001.
10.20
Both statistics are very large. We obtain the Rao-Scott ¬2 statistic is 2721, while
the Wald test statistic is 4838. There is strong evidence that the variables are
associated.
204CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS
Chapter 11

Regression with Complex


Survey Data

11.3 The average score for the students planning a trip is ȳ1 = 77.158730 and the
average score for the students not planning a trip is ȳ2 = 61.887218. Using SAS
PROC SURVEYREG, we get ȳ1 ° ȳ2 = 15.27 with 95% CI [7.6247634, 22.9182608].
Since 0 is not in the CI, there is evidence that the domain means diÆer.
11.4 (a) From SAS, the fitted regression line for the truncated data set is

data anthrop;
infile anthrop firstobs=2 delimiter=",";
input finger height ;
one = 1;
run;

proc sort data=anthrop;


by height;

data anthrop1; /* Keep the lowest 2000 values in the data set */
set anthrop;
if _N_ <= 2000;
run;

proc reg data = anthrop1;


model height = finger;
output out=regpred pred=predicted residual=resid;
run;

goptions reset=all;
goptions colors = (black);

205
206 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA

axis1 label=(’Left Middle Finger Length (cm)’)


order = (10 to 13.5 by .5);
axis2 label=(angle=90 ’Height (inches)’) order=(55 to 75 by 5);
axis3 order=(55 to 75 by 5) major=none minor=none value=none;
symbol interpol=join width=2 color = black;

proc sort data=regpred;


by finger height;
run;

proc means data=regpred noprint;


by finger height;
var one predicted resid;
output out=circlepred sum=sumn sumpred sumresid
mean = meanone meanpred meanresid;
run;

proc gplot data=circlepred;


bubble height*finger=sumn/haxis=axis1 vaxis=axis2;
plot2 meanpred*finger/haxis=axis1 vaxis =axis3 ;
run;

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 43.78957 0.80553 54.36 <.0001


finger 1 1.78861 0.07094 25.21 <.0001

The line is much flatter than one the on Figure 11.4.


207

(b) We use exactly the same code as before, except now we sort the data by finger
instead of by height.

proc sort data=anthrop;


by finger;

data anthrop2;
set anthrop;
if _N_ <= 2000;
run;

proc reg data = anthrop2;


model height = finger;
output out=regpred pred=predicted residual=resid;
run;

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 31.19794 1.33050 23.45 <.0001


finger 1 2.96393 0.11821 25.07 <.0001

These values of the slope and intercept are quite close to the values given in Figure
11.4. The standard errors are larger, however, reflecting the lesser number of data
points and the reduced spread of the x’s.
208 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA

(c) Regression uses conditional expectation, conditional on x. Thus, if the model


holds for all the observations, you should get unbiased estimates of the parameters
if you take a subset of the data using x to define the mechanism.
11.5 We obtain B̂0 = 14.2725 and B̂P 1 = 0.08138.
P Using equation (11.8), with
ˆ) and x̄
qi = (yi ° B̂0 ° B̂1 xi )(xi ° x̄ ˆ = wi xi / wi = 180.541, we have
≥X ¥ . ∑X P ∏2
( wi xi )2
V̂L (B̂1 ) = V̂ wi qi 2
wi xi ° P
wi
µ P ∂
wq
= V̂ P Pi i 2 P
wi xi ° ( wi xi ) / wi
2

= 0.000261
SEL (B̂1 ) = .016.

Here is output from SAS:

data nybight;
infile nybight delimiter="," firstobs=2;
input year stratum catchnum catchwt numspp depth temp ;
if stratum = 1 or stratum = 2 then relwt = 1;
else if (stratum ge 3 and stratum le 6) then relwt = 2;
if year = 1974;
run;

proc surveyreg data=nybight;


weight relwt;
stratum stratum;
model catchwt = catchnum;
run;
209

Estimated Regression Coefficients

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 14.2725002 2.13426397 6.69 <.0001


catchnum 0.0813836 0.01612252 5.05 <.0001

11.7 Using the weights as in Exercise 11.4, the estimated regression coe±cients are
B̂0 = 7.569 and B̂1 = 0.0778. From equation (11.8), V̂L (B̂1 ) = 0.068, (alternatively,
V̂JK (B̂1 ) = 0.070). The slope is not significantly diÆerent from 0. Here is SAS
output:

Estimated Regression Coefficients

Standard 95% Confidence


Parameter Estimate Error t Value Pr > |t| Interval

Intercept 7.56852711 4.31739366 1.75 0.0879 -1.1793434 16.3163976


temp 0.07780553 0.26477065 0.29 0.7705 -0.4586708 0.6142818

11.10 (a)

(b)
210 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA

(c) Here is output from SAS.

Estimated Regression Coefficients

Standard 95% Confidence


Parameter Estimate Error t Value Pr > |t| Interval

Intercept 9.61708567 3.56639563 2.70 0.0208 1.76750181 17.4666695


purchase 1.08119004 0.12136848 8.91 <.0001 0.81405982 1.3483203

NOTE: The denominator degrees of freedom for the t tests is 11.

We use (number of psus) ° (number of strata) = 12 ° 1 = 11 df.


11.14 Here is code and output from SAS:

data nhanes;
infile nhanes delimiter=’,’ firstobs=2;
input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2
dmdeduc indfminc bmxwt bmxbmi bmxtri
bmxwaist bmxthicr bmxarml;
if riagendr = 1 then x = 0; /* x=0 is male*/
if riagendr = 2 then x = 1; /* x=1 is female */
if age ge 15 then over15 = 1;
else if age lt 15 then over15 = 0;
else over15=.;
one = 1;
label age = "Age at Examination (years)"
211

agesq = "Age^2"
riagendr = "Gender"
ridreth2 = "Race/Ethnicity"
dmdeduc = "Education Level"
indfminc = "Family income"
bmxwt = "Weight (kg)"
bmxbmi = "Body mass index"
bmxtri = "Triceps skinfold (mm)"
bmxwaist = "Waist circumference (cm)"
bmxthicr = "Thigh circumference (cm)"
bmxarml = "Upper arm length (cm)";
run;

proc surveyreg data=nhanes;


stratum sdmvstra;
cluster sdmvpsu;
weight wtmec2yr;
model bmxtri=bmxbmi /clparm;
output out=quad pred=quadpred residual=quadres;
ods output ParameterEstimates = quadcoefs;
run;

/* Do a weighted bubble plot of data with the regression line. */

goptions reset=all;
goptions colors = (gray);
axis4 label=(angle=90 ’Triceps Skinfold’) order=(0 to 50 by 10);
axis3 label=(’Body Mass Index’) order = (10 to 70 by 10);
axis5 order=(0 to 50 by 10) major=none minor=none value=none;
symbol interpol=join width=2 color = black;

proc sort data=quad;


by bmxtri bmxbmi;

proc means data=quad noprint;


by bmxtri bmxbmi;
var wtmec2yr quadpred quadres;
output out=quadplot sum=sumwts sumquad sumres
mean=meanwt meanquad meanres;
run;

proc gplot data=quadplot;


bubble bmxtri*bmxbmi= sumwts/bsize=10 haxis = axis3 vaxis=axis4;
plot2 meanquad*bmxbmi/ vaxis=axis5;
run;
212 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA

/* Plot residuals vs predicted values */

goptions reset=all;
goptions colors = (gray);
axis3 label=(’Predicted Values’) order = (15 to 30 by 5);
axis4 label=(angle=90 ’Residuals’) order=(-20 to 40 by 10);
axis5 order=(10 to 70 by 10) major=none minor=none value=none;

proc gplot data=quadplot;


bubble meanres*meanquad= sumwts;
run;

Estimated Regression Coefficients

Standard 95% Confidence


Parameter Estimate Error t Value Pr > |t| Interval

Intercept -3.4248758 0.47196343 -7.26 <.0001 -4.4308420 -2.4189096


bmxbmi 0.8581404 0.02265388 37.88 <.0001 0.8098548 0.9064260

NOTE: The denominator degrees of freedom for the t tests is 15.

R2 = 0.38. Note the pattern in the residuals vs. predicted values plot. You may
want to use a model with log transformations instead.
11.15

Estimated Regression Coefficients


213

Standard 95% Confidence


Parameter Estimate Error t Value Pr > |t| Interval

Intercept 5.02260509 1.38143897 3.64 0.0024 2.07813762 7.96707255


bmxthicr 1.68332900 0.02540473 66.26 <.0001 1.62918010 1.73747790

NOTE: The denominator degrees of freedom for the t tests is 15.

R2 = 0.57.

11.16

data ncvs;
infile ncvs delimiter = ",";
input age married sex race hispanic hhinc away employ numinc
violent injury medtreat medexp robbery assault
pweight pstrat ppsu;
agesq = age*age;
if violent ge 1 then isviol = 1;
else if violent = 0 then isviol = 0;
run;

proc surveylogistic data=ncvs;


stratum pstrat;
cluster ppsu;
weight pweight;
model isviol (event=’1’)= age sex ;
run;
214 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA

proc surveylogistic data=ncvs;


stratum pstrat;
cluster ppsu;
weight pweight;
model isviol (event=’1’)= age agesq sex ;
run;

Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -2.6743 0.0891 900.7557 <.0001


age 1 -0.0418 0.00211 392.9618 <.0001
sex 1 -0.2925 0.0637 21.0673 <.0001

From this model, younger people and males are more likely to have at least one
violent victimization. A quadratic term in age is not significant.
11.17 From (11.4),
√N !√ N !
N
X X X .
xi yi ° xi yi N
i=1 i=1 i=1
B1 = √N !2
N
X X
x2i ° xi /N
i=1 i=1
N1 ȳ1U ° N1 ȳU
=
N1 ° N12 /N
ȳ1U ° (N1 ȳ1U + N2 ȳ2U )/N
=
1 ° N1 /N
= ȳ1U ° ȳ2U .

From (11.5),

ty ° B1 tx
B0 =
N
= ȳU ° B1 x̄U
N1 ȳ1U + N2 ȳ2U N1
= ° (ȳ1U ° ȳ2U )
N N
= ȳ2U .

11.18 (a)
(b) Ø̂0 = °4.096; Ø̂1 = 6.049
(c) We estimate the model-based variance using the regression software: V̂M (Ø̂1 ) =
215

0.541. P
(xi ° x̄)2 (yi ° Ø̂0 ° Ø̂1 xi )2
n
V̂L (Ø̂1 ) = P
(n ° 1)[ (xi ° x̄)2 ]2
= 0.685
V̂L is larger, as we would expect since the plot exhibits unequal variances.
11.19 From (11.10), for straight-line regression,
√ !°1
X X
B̂ = wi xi xTi wi xi yi
i2S i2S

with xi = [1 xi ]T . Here,
2 X X 3
wi wi xi
X 6 i2S 7
wi xi xTi = 6
4
X i2S
X 7
2 5
i2S
wi xi wi xi
i2S i2S

and 2 X 3
wi yi
X 6 i2S 7
wi xi yi = 6
4
X 7,
5
i2S
wi xi y i
i2S
so
1
B̂ = √ !√ ! √ !2
X X X
wi wi x2i ° wi xi
i2S i2S i2S
2 X X 32 X 3
wi x2i ° wi xi wi yi
6 i2S 7 6 i2S 7
6 X i2S
X 76 X 7
4 ° wi xi wi 54 wi xi yi 5
i2S i2S i2S

Thus, √ !√ ! √ !
X X X . X
wi xi yi ° wi xi wi yi wi
i2S i2S i2S i2S
B̂1 = √ !2 √ !
X X . X
wi x2i ° wi xi wi
i2S i2S i2S
and 0 10 1 0 10 1
X X X X
B
@ wi x2i C
A@
B
wi yi C B
A °@ wi xi C
A@
B
wi xi yi C
A
i2S i2S i2S i2S
B̂0 = 0 10 1 0 12
X X X
B
@ wi C
A@
B
wi x2i C B
A °@ wi xi C
A

√ !i2S
°1 "
i2S i2S
#
X X X
= wi wi yi ° B̂1 wi xi .
i2S i2S i2S
216 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA

11.20 In matrix terms,


0 1°1
X X
B̂ = @ wj xj xTj A wj xj yj .
j2S j2S

Then, using the matrix identity in the hint,

@ B̂
zi =
@wi
2 0 1°1 3
@ @ X X
= 4 wj xj xTj A 5 wj xj yj
@wi
j2S j2S
0 1°1 2 3
X @ X
+@ wj xj xTj A 4 wj xj yj 5
@wi
j2S j2S
0 1°1 0 1°1
X X X
= °@ wj xj xTj A xi xTi @ wj xj xTj A wj xj yj
j2S j2S j2S
0 1°1
X
+@ wj xj xTj A xi yi
j2S
0 1°1
X ≥ ¥
= @ wj xj xTj A °xi xTi B̂ + xi yi
j2S
0 1°1
X ≥ ¥
= @ wj xj xTj A xi yi ° xTi B̂ .
j2S

Then the estimated variance is


√ !
X
V̂ (B̂) = V̂ wi zi
i2S
0 1°1 √ !0 1°1
X X X
= @ wj xj xTj A V̂ wi qi @ wj xj xTj A .
j2S i2S j2S

11.21 First express the estimator as a function of the weights:

t̂yGREG = t̂y + (tx ° t̂x )T B̂,

with √ !°1
X 1 X 1
B̂ = wi 2 xi xTi wi xi yi .
æi æi2
i2S i2S
217

Thus,
√ !T
X X
t̂yGREG = wi yi + tx ° wi xi B̂
i2S i2S

Using the same argument as in Exercise 11.20,


2 0 1°1 3
@ B̂ @ X 1 X 1
= 4 @ wj 2 xj xTj A 5 wj 2 xj yj
@wi @wi æj æj
j2S j2S
0 1°1 2 3
X 1 @ X 1
+@ wj 2 xj xTj A 4 wj 2 xj yj 5
æj @wi æj
j2S j2S
0 1°1 0 1°1
X 1 1 X 1 X 1
= °@ wj 2 xj xTj A 2 xi x T @
i w j 2 xj xTA
j wj 2 xj yj
æj æi æj æj
j2S j2S j2S
0 1°1
X 1 1
+@ wj 2 xj xTj A xi yi
æj æi2
j2S
0 1°1
X µ ∂
@ 1 1 1
= wj 2 xj xTj A ° 2 xi xTi B̂ + 2 xi yi
æj æi æi
j2S
0 1°1
@
X 1 1 ≥ ¥
= wj 2 xj xTj A x i y i ° x T
i B̂
æj æi2
j2S

@ t̂yGREG
zi =
@wi
° ¢T @ B̂
= yi ° xTi B̂ + tx ° t̂x
@wi
0 1°1
° ¢ T X 1 1 ≥ ¥
= yi ° xTi B̂ + tx ° t̂x @ wj 2 xj xTj A xi yi ° xT
i B̂
æj æi2
j2S

= gi (yi ° xTi B̂).

11.26 The OLS estimator of Ø is

Ø̂ = (XT X)°1 XT Y.

Thus,
CovM (Ø̂) = (XT X)°1 XT Cov (Y)[(XT X)°1 XT ]T
= (XT X)°1 XT ßX(XT X)°1 .
218 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA

For a straight-line regression model,


2 P 3
n xi
XT X = 4 P P
5
xi x2i
2 P 2 3
xi /n °x̄
1
(XT X)°1 = P 4 5
(xi ° x̄)2
°x̄ 1
and ∑ P 2 P ∏
æ x æ 2
i
X ßX = P
T i P 2 2 .
i
xi æi2 xi æi
Thus, in straight-line regression,
∑ 1P 2 ∏∑ P 2 P ∏∑ P ∏
1 xi °x̄ 2 1
x2i °x̄
CovM (Ø̂) = P n P æi 2 P x2i æi2 n
[ (xi ° x̄)2 ]2 °x̄ 1 xi æi xi æi °x̄ 1

≥ X X X ¥ hX i2
VM (Ø̂1 ) = x̄2 æi2 ° 2x̄ xi æi2 + x2i æi2 / (xi ° x̄)2
P
(xi ° x̄)2 æi2
= P .
[ (xi ° x̄)2 ]2
To see the relation to Section 11.2.1, let

Qi = Yi ° Ø0 ° Ø1 xi .

Then VM (Qi ) = æi2 ; since observations are independent under the model,
∑X ∏ X
VM (xi ° x̄)Qi = (xi ° x̄)2 æi2 .
i i

If we take wi = 1 for all i, then Equation (11.8) provides an estimate of VM (Ø̂1 ).


11.29 For this model (see Exercise 11.19)
2 X X 3
wi wi xi
1 66 X
7
7
XTS WS ß°1
S XS =
i2S i2S
X
æ2 4 wi xi wi x2i 5
i2S i2S

and 2 X 3
wi yi
1 66 X 7 = 1 t̂y .
7
XTS WS ß°1
S yS =
i2S
æ2 4 wi xi yi 5 æ 2 t̂xy
i2S

Using (11.20),
∑ ∏°1 ∑ ∏ ∑ ∏
N̂ t̂x t̂y 1 t̂x2 t̂y ° t̂x t̂xy
B̂ = =
t̂x t̂x2 t̂xy N̂ t̂x2 ° (t̂x )2 °t̂x t̂y + N̂ t̂xy
219

and

t̂yGREG = t̂y + [N ° N̂ , tx ° t̂x ]B̂


1
= t̂y + [(N ° N̂ )(t̂x2 t̂y ° t̂x t̂xy ) + (tx ° t̂x )(°t̂x t̂y + N̂ t̂xy )].
N̂ t̂x2 ° (t̂x )2

If y = x,
1
t̂xGREG = t̂x + [0 + (tx ° t̂x )(°t̂2x + N̂ t̂x2 )]
N̂ t̂x2 ° (t̂x )2
= tx .
220 CHAPTER 11. REGRESSION WITH COMPLEX SURVEY DATA
Chapter 12

Two-Phase Sampling

12.1 We use (12.4) to estimate the total, and (12.7) to estimate its variance. We
obtain
(2) 100,000 X nh
t̂str = rh = 48310.
1000 mh
cells
2(2) (2) (2)
From (12.7), we estimate sh = p̂h (1 ° p̂h )/(mh ° 1) and obtain

≥ ¥ XH µ ∂ 2(2)
(2) nh ° 1 mh ° 1 nh sh
V̂ t̂str = N (N ° 1) °
n°1 N °1 n mh
h=1
µ ∂ H
N2 n X nh (2)
(ȳ ° ȳˆstr )2
(2)
+ 1°
n°1 N n h
h=1
1000002
= 100000(99999)0.000912108 + 0.109282588
999
= 10214904.
(2)
Thus SE(t̂str ) = 3196.
Note that we can do a rough check using SAS PROC SURVEYMEANS, which will
capture the variability due to the phase II sample.

data exer1201;
input strat nh mh diabcount ;
nondiab = mh - diabcount ;
datalines;
1 241 96 86
2 113 45 17
3 174 35 29
4 472 47 8
;

221
222 CHAPTER 12. TWO-PHASE SAMPLING

data exer1201;
set exer1201;
do i = 1 to diabcount;
sampwt = nh/mh*(100000/1000);
diab = 1;
output;
end;
do i = 1 to nondiab;
sampwt = nh/mh*(100000/1000);
diab = 0;
output;
end;

proc surveymeans data=exer1201 mean clm sum clsum;


stratum strat;
weight sampwt;
var diab;
run;

This code gives t̂ = 48310 with SE 3059.0785. We can then add the second term in
(12.7) to obtain
≥ ¥ 1000002
(2)
V̂ t̂str = 3059.07852 + 0.109282588 = 10451881.
999
This is a little bit larger than the estimate obtained above because we did not
incorporate fpcs.
12.3 Using the population and sample sizes of N = 2130, n(1) = 201, and n(2) = 12,
(1)
the phase I weight is wi = 2130/201 = 10.597 for every phase I unit. For the units
(2)
in phase II, the phase II weight is wi = 201/12 = 16.75. We have the following
information and summary statistics from shorebirds.dat:
X (1)
t̂(1)
x = wi xi = 44284.93.
i2S (1)
X (1) (2)
x =
t̂(2) wi wi xi = 34790,
i2S (2)
(2)
and t̂y = 43842.5. Using (12.9),
(2)
t̂y 43842.5
yr = t̂x
t̂(2) = 44284.93 = 55808.
(1)
(2)
t̂x 34790
223

We estimate the variance using (12.11): we have s2y = 115.3561, s2e = 7.453911, and
√ ! √ !
(1) 2 (2)
n sy n s2e
V̂ (t̂(2)
yr ) = N 2
1 ° + N 2
1 °
N n(1) n(1) n(2)
µ ∂ µ ∂
201 115.3561 12 7.453911
= (2130)2 1 ° + (2130)2 1 °
2130 201 201 12
= 2358067 + 2649890 = 5007958,

so the standard error is 2238.


Note that the paper by Bart and Earnst has some inconsistencies so it is possible
that the data set constructed for this problem does not reflect the distribution of
shorebirds in the region.
12.4 (a) The phase 2 weight is 1049/60 =17.48 for stratum 1, 237/48 = 4.9375 for
stratum 2, and 272/142 =1.915 for stratum 3.
(b) We use (12.5) to estimate
H
X nh
ȳˆstr =
(2) (2)
ȳh = 0.3030 + 0.1078 + 0.1426 = 0.5534.
n
h=1

Then, using (12.8) (which we may use since the fpc is negligible),
H
X 2(2) H
nh ° 1 nh s 1 X nh (2)
V̂ (ȳˆstr ) º (ȳh ° ȳˆstr )2
(2) (2)
h
+
n ° 1 n mh n°1 n
h=1 h=1
°05
= 0.00186941 + 9.924 £ 10 + 3.201 £ 10°05
1
+ (0.007192 + 0.003654 + 0.012126)
1558
= 0.002015

so the standard error is 0.045. Note that the second term adds little to the variability
since the phase I sample size is large.
12.5 (a) We use the final weights for the phase 2 sample to calculate the proportions
in the table, and use (12.8) to find the standard error for each (given in parentheses
behind the proportion).

Case?
Proportion (SE) No Yes
Male 0.2265 (0.0397) 0.1496 (0.0312) 0.3761 (0.0444)
Gender
Female 0.2164 (0.0399) 0.4075 (0.0426) 0.6239 (0.0444)
0.4430 (0.0449) 0.5570 (0.0449)
224 CHAPTER 12. TWO-PHASE SAMPLING

(b) We can calculate the Rao-Scott correction for a test statistic based on a sample
of size n = 1558. Then, (10.2) gives
r X
X c
(p̂ij ° p̂i+ p̂+j )2
X =n2
= (1558)0.062035726 = 96.65
p̂i+ p̂+j
i=1 j=1

To find the design eÆect, we divide the estimated variance from part (a) by the
variance that would have been obtained if an SRS of size 1558 had been selected,
namely p̂(1 ° p̂)/1558. We obtain the following table:

Case?
Design eÆect No Yes
Male 13.987 11.929 13.072
Gender
Female 14.661 11.708 13.072
12.715 12.715

Using (10.9),
r X
X c r
X c
X
E[X 2 ] º (1 ° pij )dij ° (1 ° pi+ )dR
i ° (1 ° p+j )dC
j = 13.601.
i=1 j=1 i=1 j=1

Then, from Section 10.2.2, X 2 /E[X 2 ] approximately follows a ¬21 distribution if


the null hypothesis is true. We calculate X 2 /E[X 2 ] = 96.65/13.601 = 7.1, so the
p-value is approximately 0.008. There is evidence against the null hypothesis of
independence.
P P (p̂ °p̂ p̂+j )2
Note that we could equally well have calculated X 2 as m ri=1 cj=1 ij p̂i+i+ p̂+j
and calculated the variance under an SRS as p̂(1 ° p̂)/m; this gives the same result.
Since the phase I sample size is relatively large, and the second term in (12.8) is
small relative to the first term, we can use SAS PROC SURVEYFREQ to obtain
an approximate check on our results.

data exer1205;
input strat nh mh gender $ case count ;
datalines;
1 1049 60 m 0 16
1 1049 60 m 1 8
1 1049 60 f 0 17
1 1049 60 f 1 19
2 237 48 m 0 9
2 237 48 m 1 8
2 237 48 f 0 5
2 237 48 f 1 26
225

3 272 142 m 0 15
3 272 142 m 1 28
3 272 142 f 0 8
3 272 142 f 1 91
;

data exer1205;
set exer1205;
do i = 1 to count;
sampwt = nh/mh;
output;
end;

proc surveyfreq data=exer1205;


stratum strat;
weight sampwt;
tables gender*case / chisq deff;
run;

This code treats the phase I sample as a population, so it underestimates the variance
slightly. But since in this case n is large, the results are very close. SAS calculates
the Rao-Scott chi-square statistic as 7.01, and p-value as 0.008.
2(2)
12.9 We estimate Wh by nh /n, and estimate Sh2 by sh . Then, using (12.17), we
have
Stratum Ŵh Ŝh2 Ŵh Ŝh2 ∫n
Yes 0.3658 0.1995 0.0730 0.40
No 0.3895 0.1313 0.0511 0.32
Not available 0.2447 0.2437 0.0596 0.44
Total 1.0000 0.1837
We estimate S 2 using
H
X H
X
(n ° 1)Ŝ = 2
(nh ° 1)Ŝh2 + nh (p̂h ° p̂)2 ,
h=1 h=1

which gives Ŝ 2 = 0.2468.


This allocation takes many more observations in the “Yes” and “No” strata than
did the allocation that was used. Proportional allocation would have ∫1 = ∫2 = ∫3 .
12.10 From property 5 in Section A.4,

V (t̂(2)
y ) = V (t̂y ) + E(V [t̂y | Z]).
(1) (2)
226 CHAPTER 12. TWO-PHASE SAMPLING

Because the phase I sample is an SRS,


√ !
n(1) Sy2
V (t̂(1)
y ) =N 2
1° .
N n(1)

Because the phase II sample is also an SRS,


.
P (Di = 1 | Zi = 1) = n(2) n(1) ,

n(2) (n(2) ° 1)
P (Di Dj = 1 | Zi Zj = 1) = for j 6= i,
n(1) (n(1) ° 1)
(1) N
wi = ,
n(1)
and
(2) n(1)
wi = Zi .
n(2)
In addition,
n(1)
P (Zi = 1) =
N
and
n(1) (n(1) ° 1)
P (Zi Zj = 1) = .
N (N ° 1)
(2)
Thus, using (12.1) to write t̂y ,

V [t̂(2)
y | Z]
∑XN ∏
N n(1)
= V Zi Di (1) (2) yi | Z
i=1
n n
µ ∂2 ∑ X N X N ∏
N
= E Zi Zk Di Dk yi yk | Z ° [t̂(1)
y ]
2
n(2) i=1 k=1
µ ∂2 XN µ ∂ N N
N 2n
(2) N 2X X n(2) (n(2) ° 1)
= Zi y i (1) + Zi Z j y i yj (1) (n(1) ° 1)
° [t̂(1)
y ]
2
n(2) i=1
n n (2)
i=1 j=1,j6=i
n
227

µ ∂ N
N 2 X n(1) 2 n(2)
E(V [t̂(2)
y | Z]) = y
n(2) N i n(1)
i=1
µ ∂ N N
N 2 X X n(1) (n(1) ° 1) n(2) (n(2) ° 1)
+ (2) yi yj (1) (1)
n N (N ° 1) n (n ° 1)
i=1 j=1,j6=i

°V [t̂(1)
y ] ° (E[t̂y ])
(1) 2

N N N
N X 2 N (n(2) ° 1) X X
= y + (2) yi yj
n(2) i=1 i n (N ° 1) i=1 j=1,j6=i

°V [t̂y (1) ] ° t2y


µ ∂
n(2) Sy2
= N 1°
2
° V [t̂(1)
y ].
N n(2)

Thus, µ ∂
n(2) Sy2
V [t̂(2)
y ] =N 2
1° .
N n(2)

12.11 Conditioning on the phase I units,


H µ
X ∂ ∑ 2(2) Ø ∏
(2) nh ° 1
mh ° 1 nh s Ø
E[V̂ (t̂str )|Z] = N (N ° 1) ° E h ØZ
n°1 N °1 n mh
h=1
µ ∂ H ∑ Ø ∏
N 2 n X nh (2) (2) 2 Ø
+ 1° E (ȳ ° ȳˆstr ) ØZ .
n°1 N n h
h=1

Now " #
2(2) Ø 2(1)
s Ø sh
E h ØZ =
mh mh
and
H
X hn Ø i
(2) Ø
(ȳh ° ȳˆstr )2 ØZ
h (2)
E
n
h=1
" #
H
X nh Ø
(2) Ø
(ȳˆstr )2 ØZ
(2)
= E (ȳh )2 °
n
h=1
H
X ∑ 2(1) µ ∂ ∏
nh s mh (1) 2
= 1° h
+ (ȳh )
n mh nh
h=1
XH µ ∂ µ ∂ 2(1) ∑ X H ∏
nh 2 mh sh nh (1) 2
° 1° ° ȳ
n nh mh n h
h=1 h=1
H µ ∂µ ∂ 2(1) " H
#
X nh nh mh sh 1 X 2(1)
= 1° 1° + (n ° 1)sy °
2(1)
(nh ° 1)sh ;
n n n h mh n
h=1 h=1
228 CHAPTER 12. TWO-PHASE SAMPLING

the last equality follows by applying the sums of squares identity


H
X H
X µ H
X ∂
2(1) (1) nk (1) 2
(n ° 1)s2(1)
y = (nh ° 1)sh + nh ȳh ° ȳ
n k
h=1 h=1 k=1

to the phase I sample.


Plugging in to the first equation, we have
µ ∂
(2) N2 n 2(1)
E[V̂ (t̂str )|Z] = 1° s
n N y
XH 2(1) Ω µ ∂
sh N ° 1 nh ° 1 mh ° 1 nh
+N 2
°
mh N n°1 N °1 n
h=1
µ ∂∑ µ ∂µ ∂ ∏æ
1 n nh nh mh mh (nh ° 1)
+ 1° 1° 1° °
n°1 N n n nh n
µ ∂ H µ
X s2(1) nh 2∂ µ ∂
N2 n 2(1) mh
= 1° sy + N 2 h

n N mh n nh
h=1

2(1)
(The last equality follows after a lot of algebra.) Since E[sy ] = Sy2 , the unbiased-
ness is shown.
12.12 (a) Equation (A.9) implies these results.
(b) From the solution to Exercise 12.10,
(2)
V (t̂(2)
yr ) = V [t̂y ] + E[V (t̂d | Z)],
(1)
µ ∂
n(1) Sy2
V [t̂y ] = N 1 °
(1) 2
,
N n(1)

and
µ ∂
(2) Sd2
n(2) (1)
E[V (t̂d | Z)] = N 2
1° ° V [t̂d ]
n(2)
N
µ ∂ µ ∂
n(2) Sd2 n(1) Sd2
= N 1°
2
°N 1°2
N n(2) N n(1)
∑ ∏
N ° n(2) N ° n(1)
= N Sd2 °
n(2) n(1)
µ ∂
n(2) Sd2
= N 2 1 ° (1) .
n n(2)

(c) Follows because s2y and s2e estimate Sy2 and Sd2 , respectively.
229

12.13 Using the hint,


N
1 X
Sy2 = (yi ° ȳU )2
N °1
i=1
XN
1
= (yi ° Bxi + Bxi ° B x̄U )2
N °1
i=1
N
X
1 £ §
= (yi ° Bxi )2 + B 2 (xi ° x̄U )2 + 2(yi ° Bxi )B(xi ° x̄U )
N °1
i=1
= Sd2 + B 2 Sx2 + 2BSxd .
Then,
!√ √ !
n(1) Sy2 n(2) Sd2
V (t̂(2)
yr ) º N 1° 2
+N 2
1 ° (1)
N n(1) n n(2)
√ ! √ !
n(1) 2BSxd + B 2 Sx2 + Sd2 n(2) Sd2
= N 2
1° +N 2
1 ° (1)
N n(1) n n(2)
√ ! √ !
n(1) 2BSxd + B 2 Sx2 n(2) Sd2
= N2 1 ° + N 2
1 ° .
N n(1) N n(2)

12.14 (a)
(2)
(1) @ t̂yr
zi = (1)
@wi
(2)
t̂y
= xi (2)
t̂x
and
(2)
(2) @ t̂yr
zi =
@wi
" #
(2)
yi xi t̂y
= t̂(1)
x (2)
° (2) (2)
t̂x t̂x t̂x
t̂x h i
(1)
= (2)
yi ° xi t̂(2)
x B̂
(2)
t̂x
Thus,
0 1
X (1) (1)
X (2)
V̂DR (t̂(2)
yr ) = V̂
@ wi zi + wi zi A
i2S (1) i2S (2)
0 1
X X (1)
t̂x h i
(1)
= V̂ @ wi xi B̂ (2) + wi (2) yi ° xi t̂(2)
x B̂
(2) A

i2S (1) i2S (2)


t̂x
230 CHAPTER 12. TWO-PHASE SAMPLING

12.16 Note that


2 3
X X º (1) ° º (1) º (1) yi yk
E[V̂HT (t̂(1)
y ) | Z] = E
4 ik i k
| Z5
(1) (2) (1) (1)
i2S (2) k2S (2)
ºik ºik ºi ºk
2 3
X X (1) (1) (1)
ºik ° ºi ºk yi yk
= E4 Di Dk (1) (2) (1) (1)
| Z5
i2S (1) k2S (1)
ºik ºik ºi ºk
2 3
X X º (1) ° º (1) º (1) yi yk
= E4 ik i k
(1) (1) (1)
| Z5
i2S (1) k2S (1)
ºik ºi ºk
(1)
= V̂HT (t̂(1)
y )

12.17 (a) Since mh = ∫h nh and ∫h is known,

E[mh ] = ∫h E[nh ]
∑X
N ∏
= ∫h E Zi xih
i=1
N
X n
= ∫h xih
N
i=1
= n∫h Wh .

Thus,
H
X
E[C] = cn + n ch ∫h Wh .
h=1

(b) Using the constraint, set

E[C]
n= H
.
X
c+ ch ∫h Wh
h=1

Then
2 3
H
X
6c + ch ∫h Wh 7
6 1 7
6 7
V (ȳˆstr ) = S 2 6
(2) h=1
° 7
6 E[C] N7
4 5

H
X
c+ ch ∫h Wh H µ ∂
h=1
X 1
+ Wh Sh2 °1
E[C] ∫h
h=1
231

and
∑ H µ ∂ µ X H ∂∏
@V (ȳˆstr )
(2) X
S2 1 1 Wk Sk2
= ck Wk + ck Wk 2
Wh S h °1 ° 2 c+ ch ∫h Wh .
@∫k E[C] E[C] ∫h ∫k
h=1 h=1

Setting the derivatives equal to 0, we have


H
X µ ∂ µ H
X ∂
1 Wk Sk2
S ck ∫k Wk + ck ∫k Wk
2
Wh Sh2 °1 ° c+ ch ∫h Wh = 0
∫h ∫k
h=1 h=1

for k = 1, . . . , H. Thus
H
X ∑ XH µ ∂∏ X H µ XH ∂
1 Wk Sk2
0 = ck ∫k Wk S +
2 2
Wh S h °1 ° c+ ch ∫h Wh
∫h ∫k
k=1 h=1 k=1 h=1
XH ∑ XH ∏ XH 2
Wk S k
= ck ∫k Wk S 2 ° Wh Sh2 ° c
∫k
k=1 h=1 k=1

and
H
X µ H
X ∂X
H
Wh S 2 1
h
= 2
S ° 2
Wh S h ck ∫k Wk .
∫h c
h=1 h=1 k=1

Substituting into (§), we have


∑ XH µ XH ∂X H ∏
1
0 = ck ∫k Wk S ° 2
Wh S h +
2 2
S ° 2
Wh Sh ch ∫h Wh
c
h=1 h=1 h=1
µ XH ∂
Wk Sk2
° c+ ch ∫h Wh
∫k
h=1
µ XH ∂ µ XH ∂
2 1
= ck ∫k Wk S ° 2
Wh S h c+ ch ∫h Wh
c
h=1 h=1
√ H
!
Wk Sk2 X
° c+ ch ∫h Wh ,
∫k
h=1

which implies that


µ H
X ∂
ck ∫k2 2
S ° Wh Sh = Sk2 ,
2
c
h=1

and, consequently, v
u cSk2
u
∫k = u .
u H
X
t c (S 2 ° Wh Sh2 )
k
h=1
232 CHAPTER 12. TWO-PHASE SAMPLING

(c) To meet the expected cost constraint with the optimal allocation, set

E[C]
n= H
,
X
c+ ch ∫h§ Wh
h=1

with v
u
§ u cSh2
∫h = u
u H
.
u X
t ch (S 2 ° Wj Sj )
2

j=1

H
X
12.18 Let A = Sy2 ° Wh Sh2 . Then, from, (12.17),
h=1
v s
u
u c(1) Sh2 c(1) Sh2
∫h,opt =u 0 1=
u XH ch A
u
t ch @S 2 ° Wj Sj2 A
j=1

and
(1) C§ C§
nopt = H
= H
r .
X X p c(1)
c(1) + ch Wh ∫h,opt c(1) + Wh Sh ch
A
h=1 h=1
233

Then,
H µ ∂
Sy2 Sy2 1 X 1
Vopt (ȳˆstr ) =
(2)
(1)
° + (1)
W S
h h
2
° 1
nopt N nopt h=1 ∫h,opt
√H r H
!
Sy2 Sy2 1 X p A X
= (1)
° + (1) Wh Sh ch (1)
° Wh Sh2
nopt N nopt h=1 c h=1
√H r !
Sy2 Sy2 1 X p A
= (1)
° + (1) Wh Sh ch (1)
+ A ° Sy2
nopt N nopt h=1 c
√H r !
1 X p A Sy2
= (1)
W h Sh ch + A °
nopt h=1 c(1) N
√ H
r ! √ H r !
1 X p c(1) X p A Sy2
= c +
(1)
Wh Sh ch Wh Sh ch +A °
C§ A c(1) N
h=1 h=1
2 √H !2
1 4p (1) X
H X
p p
= c A Wh Sh ch + Wh Sh ch

h=1 h=1
#
p XH
p Sy2
+Ac(1) + c(1) A Wh Sh ch °
N
h=1
2 v 32
1 4X
H
p p u u H
X S2
= W S c + c (1) tS 2 ° W S 25 ° y .
h h h h h
C§ N
h=1 h=1

12.19 The easiest way to solve this optimization problem is to use Lagrange mul-
tipliers. Using the variance in (12.10), the function we wish to minimize is
µ ∂ µ ∂ h i
1 1 1 1
g(n , n , ∏) =
(1) (2)
° S 2
y + ° S 2
d ° ∏ C ° c(1) (1)
n ° c(2) (2)
n .
n(1) N n(2) n(1)

Setting the partial derivatives with respect to n(1) , n(2) , and ∏ equal to 0, we have

@g Sy2 Sd2
(1)
= ° £ §2 + £ §2 + ∏c = 0,
(1)
@n n (1) n (1)

@g Sd2
= ° £ §2 + ∏c = 0,
(2)
@n(2) n(2)
and h i
@g
= ° C ° c(1) n(1) ° c(2) n(2) = 0.
@∏
Consequently, using the first two equations, we have
h i2 Sy2 ° Sd2
n(1) =
∏c(1)
234 CHAPTER 12. TWO-PHASE SAMPLING

and
h i2 Sd2
n(2) = .
∏c(2)
Taking the ratios gives
√ !2
n(2) c(1) Sd2
= ,
n(1) c(2) (Sy2 ° Sd2 )
which proves the result.
12.20 (a) These results follow directly from the contingency table. For example,
µ ∂
N1 C21 C21 C2+ C22 C2+
p1 = = = 1° = (1 ° S2 )p.
N N C2+ N C2+ N
N2 C22 C22 C2+
p2 = = = S2 p.
N N C2+ N
The other results are shown similarly.
(b) From (12.19),
2 s s P2 32
(2) X2 2
Vopt (p̂str ) Sh c(1) S 2°
y h=1 W S
h h
º4 Wh + 5 .
VSRS (p̂) Sy c(2) Sy2
h=1

Using the results from part (a),


sµ ∂ sµ ∂
X2
Sh N1 2 p1 (1 ° p1 ) N2 2 p2 (1 ° p2 )
Wh = +
Sy N p(1 ° p) N p(1 ° p)
h=1
s s
(1 ° S2 )p(1 ° p)S1 pS2 (1 ° p)(1 ° S1 )
= +
p(1 ° p) p(1 ° p)
p p
= (1 ° S2 )S1 + S2 (1 ° S1 ).

For the second term, note that


N
X N
X
(xi ° x̄U )(yi ° ȳU ) = xi (yi ° p) = C22 ° N pW2 ,
i=1 i=1

N
X N
X N
X
(xi ° x̄U )2 = (xi ° W2 )2 = xi ° N W22 = N W1 W2 ,
i=1 i=1 i=1
and
N
X N
X N
X
(yi ° ȳU )2 = (yi ° p)2 = yi ° N p2 = N p(1 ° p),
i=1 i=1 i=1
Consequently,
C22 ° N pW2 p(S2 ° W2 )
Sy R = p = p .
N W1 W2 W1 W2
235

2
X
Sy2 ° Wh Sh2 = p(1 ° p) ° W1 p1 (1 ° p1 ) ° W2 p2 (1 ° p2 )
h=1
= W1 p21 + W2 p22 ° p2
(1 ° S2 )2 p2 S22 p2
= + ° p2
W1 W2
p2 £ §
= W2 (1 ° S 2 )2 + W1 S22 ° W1 W2
W1 W2
p2 £ 2 §
= W2 ° 2W1 S2 + S22
W1 W2
p2
= [S2 ° W2 ]2
W1 W2
= Sy2 R2 .

(c) The following calculations were done in Excel.


Cost Ratio
S1 0.0001 0.01 0.1 0.5 1
0.5 1.00 1.02 1.06 1.15 1.21
0.6 0.97 1.02 1.15 1.42 1.64
0.7 0.85 0.93 1.15 1.61 2.01
0.8 0.65 0.76 1.04 1.68 2.25
0.9 0.37 0.48 0.78 1.53 2.25
0.95 0.20 0.28 0.54 1.23 1.92
12.21 Suppose that S is a subset of m units from U , and suppose S has mh units
from stratum h, for h = 1, . . . , H. Let Zi = 1 if unit i is selected to be in the
final sample and 0 otherwise; similarly, let Fi = 1 if unit i is selected to be in the
stratified sample and 0 otherwise, and let Di = 1 if unit i is selected to be in the
subsample chosen from the stratified sample and 0 otherwise. Then the probability
that S is chosen to be the sample is

P (S) = P (Zi = 1, i 2 S, and Zi = 0, i 62 S)

We can write P (S) as

P (S) = P (Fi = 1, i 2 S)P (Di = 1, i 2 S | F1 , . . . , FN ).

Then,

P (Fi = 1, i 2 S)
µ ∂µ ∂µ ∂µ ∂ µ ∂µ ∂
m1 N1 ° m1 m2 N2 ° m2 mH N H ° mH
...
m1 n1 ° m1 m2 n2 ° m2 mH n H ° mH
= µ ∂µ ∂ µ ∂
N1 N2 NH
...
n1 n2 nH
total number of stratified samples containing S
=
number of possible stratified samples
236 CHAPTER 12. TWO-PHASE SAMPLING

Also,

P (Di = 1, i 2 S | F) = P (Mh = mh , h = 1, . . . , H)P (Di = 1, i 2 S|M = m)


µ ∂ µ ∂
N1 NH
...
m1 mH 1
= µ ∂ µ ∂µ ∂ µ ∂.
N n1 n2 nH
...
m m1 m2 mH

Note that for each h,


µ ∂µ ∂
Nh ° mh Nh
nh ° mh mh
µ ∂µ ∂
Nh nh
nh mh
(Nh ° mh )! Nh ! nh !(Nh ° nh )! mh !(nh ° mh )!
=
(nh ° mh )!(Nh ° nh )! mh !(Nh ° mh )! Nh ! nh !
= 1.

Consequently,

P (S) = P (Fi = 1, i 2 S)P (Di = 1, i 2 S | F1 , . . . , FN )


µ ∂µ ∂
Nh ° mh Nh
H
Y
1 nh ° mh mh
= µ ∂ µ ∂µ ∂
N h=1
Nh nh
m nh mh
1
= µ ∂,
N
m

so this procedure results in an SRS of size m.


Chapter 13

Estimating Population Size

13.1 Students may answer this in several diÆerent ways. The maximum likelihood
estimate is
n1 n2 (500)(300)
N̂ = = = 1250
m 120
with 95% CI (using likelihood ratio method) of [1116, 1422]. A bootstrap CI is
[1103, 1456].

xmat <- cbind(c(1,1,0),c(1,0,1))


y <- c(120,380,180)
captureci(xmat,y)

bootout <- captureboot(converty(y,xmat[,1],xmat[,2]),


crossprod(xmat[,1],y),nboot=999,nfunc=nmle)

13.2 (a) The maximum likelihood estimate is


n1 n2 (7)(12)
N̂ = = = 21.
m 4
Chapman’s estimate is
(n1 + 1)(n2 + 1) (8)(13)
Ñ = °1= ° 1 = 19.8.
m+1 5
Because of the small sample sizes, we do not wish to employ a confidence interval
that requires N̂ or Ñ to be approximately normally distributed. Using the R func-
tion captureci gives N̂ = 21 with approximate 95% confidence interval [15.3, 47.2].
Alternatively, we could use the bootstrap to find an approximate confidence interval
for Ñ . (Theoretically, we could also use the bootstrap to find a confidence interval
for N̂ as well; in this data set, however, m§ for resamples can be 0, so we only use
the procedure with Chapman’s estimator) The bootstrap gives a 95% confidence
interval [12, 51] for N , using Chapman’s estimator.
Here is the code from R:

237
238 CHAPTER 13. ESTIMATING POPULATION SIZE

xmat <- cbind(c(1,1,0),c(1,0,1))


y <- c(4,3,8)
captureci(xmat,y)

bootout <- captureboot(converty(y,xmat[,1],xmat[,2]),


crossprod(xmat[,1],y),nboot=999,nfunc=nchapman)

(b) N̂ = 27.6 with approximate 95% confidence interval [24.1, 37.7].


(c) You are assuming that the two samples are independent. This means that fishers
are equally likely to be captured on each occasion.
13.3 We obtain N̂ = 65.4 with 95% CI [60.9, 74.4].
Here is the code from R:

xmat <- cbind(c(1,1,0),c(1,0,1))


y <- c(33,15,12)
captureci(xmat,y)

13.4 (a) We treat the radio transmitter bears and feces sample bears as the two
samples to obtain N̂ = 483.8 with 95% CI [413.7, 599.0].

xmat <- cbind(c(1,1,0),c(1,0,1))


y <- c(36,311-36,20)
captureci(xmat,y)

(b)N̂ = 486.5 with 95% CI [392.0, 646.4].

xmat <- cbind(c(1,1,0),c(1,0,1))


y <- c(28,239-28,57-28)
captureci(xmat,y)

(c)N̂ = 450 with 95% CI [427, 480].

xmat <- cbind(c(1,1,0),c(1,0,1))


y <- c(165, 311-165,239-165)
captureci(xmat,y)

13.5 The model with all two-factor interactions has G2 = 3.3, with 4 df. Comparing
to a ¬24 distribution gives a p-value 0.502. No simpler model appears to fit the data.
Using this model and function captureci in R, we estimate 3645 persons in the
missing cell, with approximate 95% confidence interval [2804, 4725].
239

13.6 (a)N̂ = 336, with 95% CI [288, 408]. Ñ = 333 with 95% CI [273, 428]. The
linearization-based standard errors are
r
n21 n2 (n2 ° m)
SE (N̂ ) = = 37
m3
and s
(n1 + 1)(n2 + 1)(n1 ° m)(n2 ° m)
SE (Ñ ) = = 29.
(m + 1)2 (m + 2)

xmat <- cbind(c(1,1,0),c(1,0,1))


y <- c(49,73,86)
captureci(xmat,y)

bootout <- captureboot(converty(y,xmat[,1],xmat[,2]),


crossprod(xmat[,1],y),nboot=999,nfunc=nchapman)

(b) The following SAS code may be used to obtain estimates for the models.

data hep;
input elist dlist tlist count;
datalines;
0 0 1 63
0 1 0 55
0 1 1 18
1 0 0 69
1 0 1 17
1 1 0 21
1 1 1 28
;

proc means data=hep sum;


var count;
run;

proc print data=hep;


run;

proc catmod data=hep;


weight count;
model elist*dlist*tlist = _response_ /pred=freq ml=nr;
loglin elist dlist tlist;
/* Model of independent factors */
run;
240 CHAPTER 13. ESTIMATING POPULATION SIZE

proc genmod data=hep;


CLASS elist dlist tlist / param=effect;
MODEL count = elist dlist tlist / dist=poisson link=log type3;
run;

proc catmod data=hep;


weight count;
model elist*dlist*tlist = _response_ /pred=freq ml;
loglin elist dlist tlist elist*dlist;
/* Model with elist and dlist dependent */
run;

proc catmod data=hep;


weight count;
model elist*dlist*tlist = _response_ /pred=freq ml;
loglin elist dlist tlist tlist*dlist;
/* Model with tlist and dlist dependent */
run;

proc catmod data=hep;


weight count;
model elist*dlist*tlist = _response_ /pred=freq ml;
loglin elist dlist tlist elist*tlist;
/* Model with elist and tlist dependent */
run;

proc catmod data=hep;


weight count;
model elist*dlist*tlist = _response_ /pred=freq;
loglin elist|dlist|tlist@2;
/* Model with all 2-way intrxns */
run;

13.7 (a) The assumption of independence of the two sources is probably met, at
least approximately. The registry is from state and local health departments, while
BDMP is from hospital data. Presumably, the health departments do not use hospi-
tal newborn discharge information when compiling their statistics. However, there
might be a problem if congenital rubella syndrome is misclassified in both data sets,
for instance, if both sources tend to miss cases.
We do not know how easily records were matched, but the paper said matching was
not a problem.
The assumption of simple random sampling is probably not met. The BDMP was
from a sample of hospitals, giving a cluster sample of records. In addition, selection
of the hospitals for the BDMP was not random—hospitals were self-selected. It is
241

unclear how much the absence of simple random sampling in this source aÆects the
results.
(b)
Year Ñ
1970 244.33
1971 95
1972 48
1973 79.5
1974 44.5
1975 114
1976 41.67
1977 30.5
1978 62.33
1979 159
1980 31.5
1981 4
1982 35
1983 3
1984 3
1985 1
The sum of these estimates is 996.3333.
(c) Using the aggregated data, the total number of cases of congenital rubella syn-
drome between 1970 and 1985 is estimated to be

Ñ = (263 + 1)(93 + 1)/(19 + 1) ° 1 = 1239.8.

Equation (12.8) results


p in V̂ (Ñ ) = 53343, and in an approximate 95% confidence
interval 1240 ±1.96 (53343) = [787, 1693]. Using the bootstrap function gives a
95% confidence interval of [855, 1908].
The aggregated estimate should be more reliable—each yearly estimate is based on
only a few cases and has large variability.
(d) Many methods could be used to assess whether the incidence of congenital
rubella syndrome has changed, using these data. You could use a change-point test
or divide the data into two groups and test whether the incidence is the same in
both. The plot below shows the relation between year and log(Ñ ), estimated in the
table in part(b). The decrease after 1980 is apparent.
242 CHAPTER 13. ESTIMATING POPULATION SIZE

5
4
log(N)

3
2
1
0

1970 1975 1980 1985

Year

13.8
Model G2 df p-value
Independence 11.1 3 0.011
1*2 2.8 2 0.250
1*3 10.7 2 0.005
2*3 9.4 2 0.00
The model with interaction between sample 1 and sample 2 appears to fit well.
Using that model, we estimate N̂ = 2378 with approximate 95% confidence interval
[2142, 2664].
13.9
A positive interaction between presence in sample 1 and presence in sample 2 (as
there is) suggests that some fish are “trap-happy”—they are susceptible to repeated
trapping. An interaction between presence in sample 1 and presence in sample 3
might mean that the fin clipping makes it easier or harder to catch the fish with the
net.
13.10 (a) The maximum likelihood estimate is N̂ = 73.1 and Chapman’s estimate
is Ñ = 70.6. A 95% confidence interval for N , using N̂ and the function captureci, is
[55.4, 124.1]. Another approximate 95% confidence interval for N , using Chapman’s
estimate and the bootstrap, is [52.7, 127.8].
243

20 1
13.12 For the data in Example 13.1, p̂ = = and a 95% confidence interval
100 5
for p is r
(0.2)(0.8)
0.2 ± 1.96 = [0.12, 0.28].
100
The confidence limits L(p̂) and U (p̂) satisfy

P {L(p̂) ∑ p ∑ U (p̂)} = 0.95.

Because N = n1 /p, we can write

P {n1 /U (p̂) ∑ N ∑ n1 /L(p̂)} = 0.95.

Thus, a 95% confidence interval for N is [n1 /U (p̂), n1 /L(p̂)]; for these data, the
interval is [718, 1645]. The interval is comparable to those from the inverted chi-
square tests and bootstrap; like them, it is not symmetric.
13.13 Note that
µ∂µ ∂
N ° 1 ° n1
n1
m n2 ° m
L(N ° 1|n1 , n2 ) = µ ∂
N °1
n2
µ ∂µ ∂
n1 N ° n1 N ° n1 ° (n2 ° m)
m n2 ° m N ° n1
= µ ∂
N N ° n2
n2 N
N ° n1 ° n2 + m N
= L(N |n1 , n2 ) .
N ° n1 N ° n2
Thus, if N > n1 and N > n2 ,
N ° n1 ° n2 + m N
L(N ) ∏ L(N ° 1) iÆ ∑1
N ° n1 N ° n2
iÆ mN ∑ n1 n2 .

Take N̂ to be the integer part of n1 n2 /m. Then for any integer k ∑ N̂ ,


n1 n2
mk ∑ m = n1 n2 ,
m
so L(k) ∏ L(k ° 1) for k ∏ N̂ . Similarly, for k > N̂ (k integer),
n1 n2
mk > m > n1 n2 ,
m
so L(k) < L(k ° 1) for k > N̂ . Thus N̂ is the maximum likelihood estimator of N .
13.14 (a) Setting the derivative equal to zero, we have

m(N̂ ° n1 ) = (n2 ° m)n1 ,


244 CHAPTER 13. ESTIMATING POPULATION SIZE

or
n2 n1
N̂ = .
m
Note that the second derivative is
d2 log L(N ) m (n2 ° m)n1 (2N ° n1 )
= °
dN 2 N 2 N 2 (N ° n1 )2
mN ° 2N n1 n2 + n21 n2
2
= ,
N 2 (N ° n1 )2

which is negative when evaluated at N̂ .


(b) Noting that E[m] = n1 n2 /N , the Fisher information is
∑ 2 ∏
@
I(N ) = °Em log L(N/m, n1 , n2 )
@N 2
∑ ∏
mN 2 ° 2N n1 n2 + n21 n2
= °Em
N 2 (N ° n1 )2
N n1 n2 ° 2N n1 n2 + n21 n2
= °
N 2 (N ° n1 )2
n1 n2
= .
N 2 (N ° n1 )

Consequently, the asymptotic variance of N̂ is


1 N 2 (N ° n1 )
V (N̂ ) = = .
I(N ) n1 n2

13.15 Substituting C ° n1 for n2 in the variance, we have


N 2 (N ° n1 )
g(n1 ) = V (N̂ ) = .
n1 (C ° n1 )
Taking the derivative,
dg N2 N 2 (N ° n1 )(C ° 2n1 )
= ° °
dn1 n1 (C ° n1 ) n21 (C ° n1 )2
N2
= ° [n1 (C ° n1 ) + (N ° n1 )(C ° 2n1 )].
n21 (C° n1 )2
Equating the derivative to 0 gives

n21 ° 2N n1 + N C = 0,

or p
2N ± 4N 2 ° 4N C
n1 = .
2
Since n1 ∑ N , we take p
n1 = N ° N (N ° C)
245

and p
n2 = C ° N + N (N ° C)

13.16 (a) X is hypergeometric;


µ ∂µ ∂
n1 N ° n1
m n °m
P (X = m) = µ 2∂ .
N
n2

(b) In the following, we let q = n2 + 1


∑ ∏
(n1 + 1)(n2 + 1)
E[Ñ ] = E °1
X +1
µ ∂µ ∂
n1 N ° n1
Xn2
m n ° m (n1 + 1)(n2 + 1)
= µ 2∂ °1
N m+1
m=0
n2
µ ∂µ ∂
n1 + 1 N + 1 ° (n1 + 1)
Xn2
m+1 n + 1 ° (m + 1)
= (N + 1) µ 2 ∂ °1
N +1
m=0
n2 + 1
µ ∂µ ∂
n1 + 1 N ° n1
Xq
k q°k
= (N + 1) µ ∂ °1
N +1
k=1
q
2 µ ∂µ ∂ µ ∂3
n1 + 1 N ° n1 N ° n1
q
6X k q°k q 7
= (N + 1) 6 4 µ ∂ ° µ ∂7 ° 1.
N +1 N +1 5
k=0
q q
P
The first term inside the brackets is k P (Y = k) for Y a hypergeometric
°N °n1 ¢ random
variable; it thus equals 1. If n2 ∏ N ° n1 , then q > N ° n1 and q = 0. Hence,
if n2 ∏ N ° n1 , E[Ñ ] = N .
246 CHAPTER 13. ESTIMATING POPULATION SIZE
Chapter 14

Rare Populations and Small


Area Estimation
X
ri yi
i2S
14.2 (a) Note that ȳ1 = X1
, so using properties of ratio estimation [see equa-
ri
i2S1
tion (4.10)],
N1
X
1 1 S12
V (ȳ1 ) º (ri yi ° ri ȳ1U )2
= (M1 ° 1)S 2
1 º .
n1 (N1 ° 1)p21 i=1 n1 (N1 ° 1)p21 n1 p1

Consequently,
V (ȳˆd ) = A2 V (ȳ1 ) + (1 ° A)2 V (ȳ2 )
A2 S12 (1 ° A)2 S22
º + .
n1 p1 n2 p2

N1 p1 N2 p2
(b) With these assumptions, we have A = , 1°A = , n1 p1 = kf2 p1 N1 ,
Np Np
n2 p2 = f2 p2 N2 , and
µ ∂
S12 A2 (1 ° A)2
V (ȳˆd ) º +
f2 kN1 p1 N2 p2
"µ ∂2 µ ∂ #
N1 p1 1 N2 p2 2 1
= S1
2
+
Np kf2 N1 p1 Np f2 N1 p1
µ ∂
S12 N1 p1
= + N2 p2 .
(N p) f2
2 k
The constraint on the sample size is n = n1 + n2 = kf2 N1 + f2 N2 ; solving for f2 ,
we have
n
f2 = .
N1 k + N2

247
248 CHAPTER 14. RARE POPULATIONS AND SMALL AREA ESTIMATION

Consequently, we wish to minimize


µ ∂
S12 N1 p1
V (ȳˆd ) º + N2 p2
(N p)2 f2 k
µ ∂
S12 N1 p1
= (N1 k + N2 ) + N2 p2 ,
(N p)2 n k

or, equivalently, to minimize


µ ∂
N1 p1
g(k) = (N1 k + N2 ) + N 2 p2 .
k

Setting the derivative


dg N1 N2 p1
= N1 N2 p2 °
dk k2
to 0 gives k 2 = p1 /p2 .
14.3 (a) The estimator is unbiased because each component is unbiased for its
respective population quantity. The variance formula follows because the random
variables for inclusion in sample A are independent of the random variables for
inclusion in sample B.
(b) We take the derivative of the variance with respect to µ.

d d © £ A§ £ § £ A A§
V (t̂y,µ ) = V t̂a + µ2 V t̂A ab + 2µCov t̂a , t̂ab
dµ dµ £ § £ B§ £ B B §™
+(1 ° µ)2 V t̂B
ab + V t̂b + 2(1 ° µ)Cov t̂b , t̂ab
£ § £ A A§ £ B§ £ B B§
= 2µV t̂A
ab + 2Cov t̂a , t̂ab ° 2(1 ° µ)V t̂ab ° 2Cov t̂b , t̂ab

Setting the derivative equal to 0 and solving gives the optimal value of µ.
14.7 (a) We write

µ̃d (a) ° µd = a(xTd Ø + vd + ed ) + (1 ° a)xTd Ø ° (xTd Ø + vd ).

Then
E[µ̃d (a) ° µd ] = E[a(vd + ed ) ° vd ] = 0.
(b)
© ™
V [µ̃d (a) ° µd ] = E [a(vd + ed ) ° vd ]2
= (a ° 1)2 æv2 + a2 √d

since ed and vd are independent.


d
[(a ° 1)2 æv2 + a2 √d ] = 2(a ° 1)æv2 + 2a√d ;
da
249

setting this equal to 0 and solving for a gives a = æv2 /(æv2 + √d ) = Æd . The minimum
variance achieved is

V [µ̃d (Æd ) ° µd ] = (Æd ° 1)2 æv2 + Æd2 √d


µ ∂2 µ ∂2
√d æv2
= æv +
2
√d
æv2 + √d æv2 + √d
√d2 æv2 √d æv4
= +
(æv2 + √d )2 (æv2 + √d )2
√d æv2 (√d + æv2 )
=
(æv2 + √d )2
= Æd √d .

14.8 Here is SAS code for construction the population and samples:

options ls=78 nodate nocenter;

data domainpop;
do strat = 1 to 20;
do psu = 1 to 4;
do j = 1 to 3;
y = strat;
dom = 1;
output;
end; end;
do psu = 5 to 8;
do j = 1 to 3;
y = strat;
dom = 2;
output;
end; end;
end;

proc sort data=domainpop;


by strat psu;

proc print data=domainpop;


run;

/* Select SRS of 2 psus from each stratum */

data psuid;
do strat = 1 to 20;
do psu = 1 to 8;
250 CHAPTER 14. RARE POPULATIONS AND SMALL AREA ESTIMATION

output;
end; end;

proc surveyselect data=psuid out=psusamp1 sampsize=2 seed=425;


strata strat;

proc sort data=psusamp1;


by strat psu;

proc print data=psusamp1;


run;

/* Merge back with data */

data samp1 ;
merge psusamp1 (in=Insample) domainpop ;
/* When a data set contributes an observation for
the current BY group, the IN= value is 1. */
by strat psu;
if Insample ; /*delete obsns not in sample */
run;

proc print data=samp1;


run;

/* Here is the correct analysis */

proc surveymeans data=samp1 nobs mean sum clm clsum;


stratum strat;
cluster psu;
var y;
weight SamplingWeight;
domain dom;
run;

/*Now do incorrect analysis by deleting observations not in domain*/

data samp1d1;
set samp1;
if dom = 1;

proc surveymeans data=samp1d1 nobs mean sum clm clsum;


stratum strat;
cluster psu;
var y;
251

weight SamplingWeight;
run;

data samp1d2;
set samp1;
if dom = 2;

proc surveymeans data=samp1d2 nobs mean sum clm clsum;


stratum strat;
cluster psu;
var y;
weight SamplingWeight;
run;
252 CHAPTER 14. RARE POPULATIONS AND SMALL AREA ESTIMATION
Chapter 15

Survey Quality

15.2 This is a stratified sample, so we use formulas from stratified sampling to find
¡ˆ and V̂ (¡).
ˆ

Nh ˆ Nh ° nh Nh2 s2h
Stratum Nh nh yes ¡ˆh ¡h
N Nh N 2 n h

Undergrads 8972 900 123 0.1367 0.1077 7.34 £ 10°5


Graduates 1548 150 27 0.1800 0.0245 1.66 £ 10°5
Professional 860 80 27 0.3375 0.0255 1.47 £ 10°5
Total 11380 1130 177 0.1577 1.05 £ 10°4

Thus ¡ˆ = 0.1577 with V̂ (¡)ˆ = 1.05 £ 10°4 . The probability P that a person is
asked the sensitive question is the probability that a red ball is drawn from the box,
30/50. Also,
pI = P (white ball drawn | red ball not drawn) = 4/20.
Thus, using (12.10),
¡ˆ ° (1 ° P )pI 0.1577 ° (1 ° .6)(.2)
p̂S = = = 0.130
P .6
and
1.05 £ 10°4
V̂ (p̂S ) = = 2.91 £ 10°4
(0.6)2
so the standard error is 0.17.
15.3 (a)
P (“1”) = P (“1” | sensitive)ps + P (“1” | not sensitive)(1 ° ps )
= µ1 ps + µ2 (1 ° ps )

(b) Let p̂ be the proportion of respondents who report “1.” Let


p̂ ° µ2
p̂s = .
µ1 ° µ2

253
254 CHAPTER 15. SURVEY QUALITY

(We must have µ1 6= µ2 .)


(c) If an SRS is taken,
1
V (p̂s ) = V (p̂)
(µ1 ° µ2 )2
1 p(1 ° p)
= .
(µ1 ° µ2 ) n ° 1
2
Appendix A: Probability
Concepts Used in Sampling

A.1
µ ∂µ ∂
5 30
3 2 (10)(435) 4350
P (match exactly 3 numbers) = µ ∂ = =
35 324,632 324,632
5
P (match at least 1 number) = 1 ° P
µ (match
∂µ ∂no numbers)
5 30
0 5
=1° µ ∂
35
5
142,506 182,126
=1° = .
324,632 324,632

A.2
µ ∂µ ∂
3 5
0 4 5
P (no 7s) = µ ∂ =
8 70
4
µ ∂µ ∂
3 5
1 3 30
P (exactly one 7) = µ ∂ =
8 70
4
µ ∂µ ∂
3 5
2 2 30
P (exactly two 7s) = µ ∂ = .
8 70
4

A.3 Property 1: Let Y = g(x). Then


X
P (Y = y) = P (X = x)
x:g(x)=y

255
256 CHAPTER 15. SURVEY QUALITY

so, using (A.3),


X
E[Y ] = yP (Y = y)
y
X X
= y P (X = x)
y x:g(x)=y
X X
= g(x)P (X = x)
y x:g(x)=y
X
= g(x)P (X = x).
x

Property 2: Using Property 1, let g(x) = aX + b. Then


X
E[aX + b] = (ax + b)P (X = x)
x
X X
= a xP (X = x) + b P (X = x)
x x
= aE[X] + b.

Property 3: If X and Y are independent, then P (X = x, Y = y) = P (X =


x)P (Y = y) for all x and y. Then
XX
E[XY ] = xyP (X = x, Y = y)
x y
XX
= xyP (X = x)P (Y = y)
x y
∑X ∏∑ X ∏
= xP (X = x) yP (Y = y)
x y
= (EX)(EY ).

Property 4:

Cov [X, Y ] = E[(X ° EX)(Y ° EY )]


= E[XY ° Y (EX) ° X(EY ) ° (EX)(EY )]
= E[XY ] ° (EX)(EY ).
257

Property 5: Using Property 4,


∑Xn m
X ∏
Cov ai Xi + bi , cj Yj + dj
i=1 j=1
∑X
n X
m ∏
= E (ai Xi + bi )(cj Yj + dj )
i=1 j=1
∑X n ∏ ∑X
m ∏
°E (ai Xi + bi ) E (cj Yj + dj )
i=1 j=1
n X
X m
= [ai cj E(Xi Yj ) + ai dj EXi + bi cj EYj + bi dj ]
i=1 j=1
Xn X m
° [ai E(Xi ) + bi ][cj E(Yj ) + dj ]
i=1 j=1
n
XXm
= ai cj [E(Xi Yj ) ° (EXi )(EYj )]
i=1 j=1
Xn X m
= ai cj Cov [Xi , Yj ].
i=1 j=1

Property 6: Using Property 4,


V [X] = Cov (X, X) = E[X 2 ] ° (EX)2 .

Property 7: Using Property 5,


V [X + Y ] = Cov [X + Y, X + Y ]
= Cov [X, X] + Cov [Y, X] + Cov [X, Y ] + Cov [Y, Y ]
= V [X] + V [Y ] + 2Cov [X, Y ].
(It follows from the definition of Cov that Cov [X, Y ] = Cov [Y, X].)
Property 8: From Property 7,
∑ ∏ ∑ ∏ ∑ ∏
X Y X Y
V p +p = V p +V p
V (X) V (Y ) V (X) V (Y )
∑ ∏
X Y
+2Cov p ,p
V (X) V (Y )
= 2 + 2Corr [X, Y ].
Since the variance on the left must be nonnegative, we have 2 + 2 Corr [X, Y ] ∏ 0,
which implies Corr [X, Y ] ∏ °1.
Similarly, the relation
∑ ∏
X Y
0∑V p °p = 2 ° 2 Corr [X, Y ]
V (X) V (Y )
258 CHAPTER 15. SURVEY QUALITY

implies that Corr [X, Y ] ∑ 1.


A.4 Note that Zi2 = Zi , so E[Zi2 ] = E[Zi ] = n/N . Thus,

V [Zi ] = E[Zi2 ] ° [E(Zi )]2


µ ∂2
n n
= °
N N
n(N ° n)
= .
N2
Cov [Zi , Zj ] = E[Zi Zj ] ° (EZi )(EZj )
µ ∂2
n(n ° 1) n
= °
N (N ° 1) N
n(n ° 1)N ° n2 (N ° 1)
=
N 2 (N ° 1)
n(N ° n)
=° 2 .
N (N ° 1)

A.5
Cov [x̄, ȳ]
Corr [x̄, ȳ] = p
V [x̄]V [ȳ]
1 n
(1 ° )RSx Sy
=r n N
1 n 2 1 n
[ (1 ° )Sx ][ (1 ° )Sy2 ]
n N n N
= R.

You might also like