0% found this document useful (0 votes)

391 views

STAT501 Complex Survey

This document examines and compares several software packages for analyzing complex survey sample data, including SAS, Stata, R, SUDAAN, SPSS, WesVar, AM, and IVEware. It finds that while SAS and SPSS are dominant in their domains, they are not optimal for serious complex survey analysis. Stata and R are found to be the most technically capable and economically viable options, as they offer commercial-grade power, flexibility, and sophistication at low or no cost. The document tests the software on various sampling designs and datasets, finding that Stata and R perform well on most tasks, including very complex designs that challenge other packages. It concludes that Stata and R should be the recommended packages for

Uploaded by

Percy Soto Becerra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

391 views

STAT501 Complex Survey

Uploaded by

Percy Soto Becerra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 104

Strong, Handy, and Cheap:

The New Generation of Complex Survey Analysis Software

by Jack Wiedrick

a thesis in statistical literature and problems

submitted in partial fulfillment of requirements for
the Master of Science in Statistics
at Portland State University

Spring 2014

Under the direction of Dr Robert Fountain

ABSTRACT

We examine the performance and characteristics of the following complex survey sample analysis
software packages: SAS, Stata, R, SUDAAN, SPSS, WesVar, AM, IVEware. (Due to cost considerations,
SUDAAN and SPSS results were not obtained directly but through product documentation and published
accounts; as a consequence, SUDAAN and SPSS performance measures are not available for all tasks
considered in this paper.) Where appropriate, each package is asked to conduct an analysis, and the
results of the analysis are compared to those of other packages and, when feasible, to reference results
obtained using standard methods. We go into some detail about the technical challenges of complex
sampling analysis and examine the need for software with specialized capability in this area. The primary
mode of investigation involves exploring and analyzing the behavior of candidate software packages on
a variety of datasets with complex sampling designs. Some of the datasets contain deliberate errors or
inconsistent information for the purpose of measuring the software's response to unexpected input.
Among the designs considered are complex multistage plans involving stratification and clustering,
including one with a very complex plan that proved too difficult for several of the packages under
consideration. We also examine the utility of the software for the application of replicate-based
methods of variance estimation and postsampling tasks. The following conclusions are made: 1) SAS and
SPSS, while currently dominant in their respective domains, are not optimal choices for the serious
analysis of complex survey sample data; 2) Stata and R are both the most technically capable and the
most economically viable of the packages, and should be recommended for general use; 3) SUDAAN and
WesVar remain among the most competent packages in their respective domains, but the continuing
need for either package has waned due to the advanced capabilities and better usability features of
Stata and R; 4) AM and IVEware are interesting and feature-rich but not recommended for general use
due to lack of capabilities for analyzing very complex designs. Appendices provide: a full implementation
in R of standard textbook methods for complex survey sample analysis involving up to two stages of
complexity; sample code templates and process descriptions for all the packages under consideration;
version information and 2014 availability of each software package; descriptions and listings (or links to
downloadable copies) of all the datasets used in the investigation; and proofs of selected mathematical
assertions presented in the paper (very complicated proofs were deferred to published presentations).
CONTENTS

INTRODUCTION 07

METHODS AND CONSIDERATIONS 10

A MOTIVATING EXAMPLE 11
Simple Random Sampling With Replacment (SRSWR) 11
Simple Random Sampling Without Replacment (SRSWOR) 12
Simple Stratified Sampling 13
Simple Cluster Sampling 14
Stratified Cluster Sampling 15
Bias and Design Effects 17
Applying the Designs 18

GENERAL PROBABILITY SAMPLING 19

TESTING BENCHMARKS 22

TESTING PROBLEMATIC DATA 26

COMPARING ON SUDAAN DESIGNS 34

OPTIONS FOR VARIANCE ESTIMATION 44

Linearization 45
Balanced Repeated Replication (BRR) 47
Jackknife 49
Bootstrap 52

PUSHING THE LIMITS 55

WRAPPING THINGS UP 64

APPENDIX A: SOME COMPLEX SURVEY ANALYSIS IMPLEMENTED IN R 67

APPENDIX B: SAMPLE CODE FOR TYPICAL OPERATIONS 77

APPENDIX C: SOFTWARE VERSION INFORMATION AND AVAILABILITY 81

APPENDIX D: DATASETS 83

APPENDIX E: PROOFS OF SELECTED RESULTS 91

REFERENCES 99
INTRODUCTION

I recently ran across an article online that set me thinking. The title of the article (written in 2013 by Bob
Muenchen at r4stats.com, a site devoted to R and its role in the world of modern data analytics) was
"Will 2014 be the Beginning of the End for SAS and SPSS?", and the main thesis was that the established
paradigm of large, bloated, and expensive statistical computing environments is now being increasingly
challenged by a new crop of up-and-coming software packages operating under an entirely different
paradigm to offer commercial-grade power, research-grade sophistication, and development-grade
programming flexibility, and to do it all at minimal (or no) monetary cost to the user. Whereas the old
guard of statistical computing packages like SAS and SPSS pride themselves on rock-solid analysis and
thorough vetting and certification of all procedures on offer, the new guard, represented by Stata and R
(among others), favor rapid development and an agility of operation that allows them to increase in
power and sophistication at a much faster pace than SAS or SPSS. But even more stunning is that what
SAS and SPSS charge users thousands of dollars a year for, exclusively through annual licensing schemes
that require regular (and large) infusions of cash from the user base, Stata is doing for one-time fees on
the order of a few hundred dollars and R is doing completely for free. SAS and SPSS offer some limited
extensibility through the definition of macro procedures that are awkward to program and live in a
separate computing universe from the software core, but Stata and R (especially R) allow users to write
full-fledged programs that are easily and seamlessly integrated into the software environment. In fact,
most of R is written in R, and the open-source licensing means the complete codebase for the language
can be viewed, modified, and recompiled at will. SAS and SPSS offer updates on a timescale of years, but
Stata and R on a timescale of weeks. Simply everything about SAS and SPSS seems big and complicated,
with pages and pages of rules for a multitude of complicated and often interrelated options, but the
driving theme behind development in Stata and R is to keep the procedures small and well-defined, with
clear syntax and friendly behavior at all times. If SAS and SPSS are the mammoths of the statistical
software world sober and staid, with massive heft and power then Stata and R are the Cro-Magnon
hunters bringing them down with razor-sharp ingenuity and light but deftly crafted tools.

Many recent examinations into the qualities and differences among the various statistical software
packages on offer have come to the same conclusion. Mitchell [72] compared SAS, SPSS, and Stata on a
number of different axes, including cost, power, usability, availability, and support, and concluded that
where SAS and SPSS have strengths, they are strengths of continuity with tradition, such as support for
obsolete file formats and procedures like MANOVA that have been largely supplanted by the more
sophisticated multilevel modeling (MLM) approaches. A telling example from Mitchell's paper recounts
his experiences teaching SAS, SPSS, and Stata to undergraduates. Whereas a typical SAS or SPSS class
would get bogged down in procedural statements and require several lab helpers running around
putting out fires so the students could stay on track, a typical Stata class would require at most one lab
helper, who often had little to do because the students were easily able to follow along and self-correct
when they made errors. Stata is just simpler to learn and use. (It should be said that Mitchell also lumps
R into the hard-to-learn category, but that is not a fair comparison. R is a programming language, and to
learn it effectively the user must have some basic programming skills to begin with.) Along similar lines,
Acock [60], writing primarily for research practitioners in the social sciences, recommends Stata over
SAS, and especially over SPSS, for its logical and highly consistent command structure, greater technical
sophistication, and ease of extensibility. The reason why the new paradigm is pushing out the old one is
because the new paradigm is better Cro-Magnons could rapidly adapt to changing climate conditions,
but mammoths couldn't.

7
When the domain of investigation shifts to complex survey sample analysis, the dividing line between
the old guard and the new guard becomes even sharper. SAS and SPSS were both slow to arrive at the
complex survey sample analysis game [66], and both entered in a somewhat half-hearted way. SAS
wrote a set of "SURVEY"-tagged procedures that are roughly on the same order of sophistication as a
graduate-level textbook in sampling theory, and SPSS added a new "Complex Samples" module
(available only at considerable additional cost to the user) that can run through the usual laundry list of
procedures but shows little innovation or support for advanced methods of analysis such as replication.
In contrast, Stata, starting about where SAS left off, has been growing by leaps and bounds in both
technical sophistication and power for the last decade, with each new release providing more and better
tools for users. And R's development is nothing short of remarkable. It is easily the most technically
advanced of all the packages we will consider in this paper, and the author (yes, a single author: Thomas
Lumley, a professor of biostatistics at the University of Auckland, NZ) continues to actively develop and
promote the software, offering greater power and sophistication with each new release. Agility and
adaptability are the name of the game nowadays, and SAS and SPSS, like the massive corporations that
back them, just cannot keep up with the nimbler and more energetic new guys.

Other software packages entering the arena, like AM and IVEware, tend to be either research-backed
projects implementing original methodology and offering the results to the public for free, or public
releases of moderately capable but overly domain-specific government-developed software, like the
Center for Disease Control's Epi-Info and the US Census Bureau's CSPRO, and although some of the older
niche contenders like WesVar and SUDAAN are still going strong, these are the exceptions that prove the
rule WesVar, maintained by a small but dedicated development team operating within the statistical
consulting firm Westat that has always remained strongly focused on a core area of expertise, has
chosen to discontinue charging for usage but continues to provide the same level of support, while
SUDAAN, just one arm of the much larger research mission pursued by the Research Triangle Institute,
continues to position itself as the statistical add-on for SAS, and prices itself accordingly (high).

Indeed, cost can be a blocking factor in many endeavors requiring the analysis of complex survey sample
data, and not least among those endeavors is graduate research. In preparing this paper I was unable to
afford to directly test all the packages mentioned above because the cost of purchasing some of them is
too high to justify. My university maintains a SAS license (almost by necessity, given the historically deep
penetration of SAS software at all levels of quantitative analysis) and an SPSS license, but the SPSS
license does not include the Complex Samples module because there is limited need for it and the price
is exorbitant (see APPENDIX C). SUDAAN is not kept for the same reasons. So wherever comparison of
SPSS or SUDAAN was called for, I was forced to rely on product documentation and published accounts.
By way of contrast, I should mention that I was so impressed with Stata's capabilities that I purchased a
personal copy for myself, even though it was already available to me through a university license. And of
course packages like R, WesVar, AM, and IVEware could be downloaded for free, so I was quite happy to
try them out and see what they could do. (I use R regularly anyway because it is so powerful at anything
and everything statistical.) The point is that SAS, SPSS, and SUDAAN rely on their brand for advertising,
where Stata, R, and others are forced to earn their way by just being better at what they do. That is not
to say that SAS, SPSS, or SUDAAN are incompetent far from it, especially in the case of the extremely
powerful SUDAAN but they are not under the same development pressure as the other packages, and
it shows in the somewhat lackadaisical approach they all seem to take when it comes to the evolution of
their product.

8
Investigators looking into the relative merits of all these packages [see e.g. 61; 62; 64; 67; 69; 73; 75]
have almost invariably found very few differences in output from the various packages on typical
datasets (such as household surveys, public-owned data, etc.) and therefore tend to discriminate on the
basis of ease of use or cost or other economic or taste considerations. Studies along these lines
frequently find one or two deficiencies among the packages in the area of offerings, such as a lack of
this-or-that advanced regression capability in one of the packages where the others all offer it, but holes
of this nature are generally patched up within a few release cycles and tend not to be stable indicators
of package utility. Much more pertinent to the evaluation of complex survey sample analysis software is
the question of how well the software reacts when pushed to the limits of its capabilities.

One very interesting study in this vein is the DACSEIS Project funded by the official Eurostat division of
the European Union ([71]; for more about this study, see TESTING PROBLEMATIC DATA), which ran
several major packages (and some minor ones) through a very complete set of tests over the course of a
few years and produced a large set of reports describing their results. The conclusions? SAS, SPSS, and
Stata are on roughly equal footing as general-purpose statistical packages able to handle extremely large
datasets and many advanced estimation tasks, but Stata is noticeably stronger in the area of complex
survey sample analysis. (And this was in 2004, well before most of Stata's more advanced features were
added. The disparity between Stata and SAS/SPSS is much wider now.) For more demanding analysis
tasks, SUDAAN and WesVar were recommended as capable alternatives, but neither was without flaws:
SUDAAN has no data management capabilities, and WesVar chokes on very large data. Other more
esoteric offerings like Bascula, Clan, and Poulpe were considered as innovative research-driven projects
implementing cutting-edge estimators, but all require the support of a larger system (usually SAS) and
quite a bit of statistical expertise to use appropriately, and for those reasons could not be recommended
for general use. (R was not considered a serious contender, since its survey package was still in its
infancy at the time. The fact that R has outpaced all the others in sophistication in the span of a mere
decade is a real testament to the urgency gap in software development circles.)

While the DACSEIS report can hardly be faulted for accurately reporting the state of software capabilities
at the time, relying on its results can be misleading, for two reasons: 1) software capabilities grow over
time as new features are added and old ones refined, making any analysis of this nature obsolete within
a few development cycles; and 2) a better question (which was not asked by DACSEIS) is whether the
software shows enduring promise as a go-to solution for the full gamut of complex survey sample
analysis tasks. On (1), it is illustrative that the 1985 first edition of Wolter [56] listed 15 then-current
complex survey sample analysis packages, and only two SUDAAN and WesVar remain viable today.
On (2), investigations into software should take into account not just the current capabilities (which we
do in later sections), but also consider the historical trend of development and some assessment of the
long-term usability of the software over a wide range of analysis tasks that might be asked of not just
data consumers but data producers as well.

So as we progress through the investigation, we will try to highlight those features and personality
trends in the software that suggest growth potential. After all, our goal is not to waste big time and big
bucks learning the ins and outs of every piece of software we can find, but to identify a few that we can
learn deeply and trust to be there for us when future needs come knocking. We will find that the open
and extensible win out over the closed and regimented, that the innovative win out over the certified,
and that the cheap win out over the expensive not because they are cheap, but because they are
better. In this software domain, there really is $10 wine you can serve to friends.

9
METHODS AND CONSIDERATIONS

In the sections to follow, we will examine the performance and operating characteristics of the following
software packages on a variety of complex survey sample datasets. The packages we will consider are
(see APPENDIX C for version information and 2014 availability):

SAS
Stata
R
SPSS
SUDAAN
WesVar
AM
IVEware

A specific class of software was deliberately excluded from consideration: multilevel modeling software
such as Mplus, MLwiN, and GLLAMM. (See e.g. Carle [34] or Chantala and Suchindran [35] for detailed
discussion of the capabilities of these packages.) Although the application of multilevel models to
complex survey sample data has been progressing in recent years [32; 36], the multilevel modeling
approach of decomposing variance into level-tiered components along the lines of e.g. [47] involves
some complexities relating to a clash of assumptions concerning the idea of a sampling variance "model"
and the role that sampling design weights should play in the analysis of the model [16], and it was
deemed that these subtleties were too far-reaching and profound to adequately deal with in this
context. In the sampling theory literature there has been a long-standing debate concerning the use of
models, as opposed to the traditional design-based inference founded on randomization theory [18].
See Royall and Cumberland [52] for an early application of a model-based approach, or Grnewald and
Hssjer [11] for a recent synthesis. A hybrid "model-assisted" approach was championed by the
influential Srndal, Swensson, and Wretman [10]. These approaches are interesting and promising, but
for our purposes, the can of worms is better left capped.

Our primary mode of investigation will be the exploration and analysis of the behavior of the candidate
software packages on a variety of datasets with complex sampling designs. Some of the datasets contain
deliberate errors or inconsistent information for the purpose of measuring the software's response to
unexpected input. Where appropriate, each package is asked to conduct an analysis, and the results of
the analysis are compared to those of other packages and, when feasible, to reference results obtained
using standard methods. Among the designs considered are a variety of complex multistage plans
involving stratification and clustering, including one with a very complex plan (see PUSHING THE LIMITS).
What we are looking for from the packages is evidence of the appropriate refinement of estimates in the
face of increasingly complex sampling designs. As the complexity of a design increases (by the nesting of
deeper levels of sampling or other complications) so does the difficulty of forming precise and consistent
estimates of population parameters. It goes without saying that software intended for the purpose of
correctly analyzing such data should be able to recognize these complexities and handle them in a
design-correct manner by respecting the sampling weights in all estimation procedures. All of the
packages that we consider here do this to an extent, but it is the scope of that extent that will be our
target of investigation in the sections to follow.

10
A MOTIVATING EXAMPLE

To understand why the proper choice of software is essential for the correct analysis of complex survey
sample data, it will be helpful to look at a small set of observations through the lens of a few different
sampling plans and see how the analysis of the data can change depending on the way the observations
were sampled from the population. Suppose that we have obtained 20 measurements on units
sampled from a population of 220 units. To make the example concrete, we can imagine that the
population is a group of overweight girls and boys (in the ratio 6:5) who attended a summer weight-loss
camp program and the measurements are their weight-loss scores expressed in percentages of the
target loss amount. Say the camp is interested in estimating the average fraction of target loss attained
by the participants, and would like to claim that better-than-50%-of-target results can be expected from
the program. (See APPENDIX D for the dataset used in this example.)

SIMPLE RANDOM SAMPLING WITH REPLACEMENT (SRSWR)

| | ""

Under SRSWR, we assume that the population of samples is essentially infinite, as though there were
infinitely many copies of each kid piled into some bottomless pool in just the right proportions and we
simply scooped out of them after thorough mixing. We might select the same kid times or we might
get all different kids, or some other combination. This plan corresponds to independent and identically
distributed (iid) sampling following a multinomial trial of size where 1/ for
1, , . But of course there are not really an infinite number of possible samples from a finite
population, and if we performed the multinomial trial a large enough number of times we would get the
same samples repeated infinitely often. The actual number of possible samples is | | , which
means we only really need copies of each kid in the pool to get the same properties as the multinomial
trial. In our case | | 67109806368068911898249788990, which is quite a
lot, so it makes sense to think of it as infinite at a conceptual level, but we are actually more interested
in each individual kid's probability of inclusion, and for that the "infinite population" idea offers no help.

We can apply the inclusion-exclusion rule of unions of finite sets to calculate the probability of inclusion
for an individual kid; because we are sampling with replacement, this probability will be the same for
every kid. Letting we have (because of independent draws):

1
1

because 1/ for each draw 1, , . For our example, this probability is:
11
20 1 6143479331859073478931832027561662440039128399
1 .087
220 70542949868640404420794777600000000000000000000

Notice that this is less than the 20220 .091 probability of selection under sampling without
replacement (see below). The reason for this is that when drawing without replacement, probabilities of
selection in subsequent draws given that a kid has not been selected thus far increase with every draw.
That does not happen when drawing with replacement per-draw probabilities are always the same.
What SRSWR means for our sample is that we simply generated random integers in the range of and
selected each kid whose index popped up. Some of the kids could be selected more than once.

We are interested in estimating , the mean of the ordered -tuple of values of

some quantity defined for every individual 1, , in the finite population, and unbiased
estimators of and are of course the sample mean and sample variance of the mean. (For proofs
of these and other results, see APPENDIX E.) For SRSWR these estimators are:

1
1

SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT (SRSWOR)

| |

In the case of SRSWOR we allow a single individual to be selected at most once for the sample; that is,
once a kid is selected, we remove that index from further consideration. The selection process must be
such that each of the possible samples occurs with the same probability. (All ways of doing this are
equivalent to randomly shuffling integers in the range of and taking the first from the shuffled stack.)
But because we reduce the size of the pool after each draw, we introduce a dependency among draws.
If we let then from the definition of the sampling process it follows that:

1/
| 1/

So all kids still have the same probability of inclusion, which in our case is 20/220 1/11, but this
probability is somewhat higher than in the SRSWR case, just like in Russian roulette the more times
you pull the trigger, the more likely you are to reach the bullet.

12
Estimators in the WOR case are similar to those for the WR case, except for the wrinkle of smaller
variance, which is handled by a finite population correction (FPC) defined as 1 , where / is
the sampling fraction. It turns out (see APPENDIX E) that this factor exactly corrects for the shrinkage in
variance that occurs under WOR sampling. Thus the estimators are:

1
SIMPLE STRATIFIED SAMPLING

| |

Stratified sampling involves partitioning the population into groups called strata and then sampling
independently within each stratum. The type of sampling can be different in different strata without
introducing any additional complications; as long as we can obtain unbiased point and variance
estimates within each stratum, these can be simply combined (with the proper weighting) to yield
unbiased point and variance estimates for the population. If the strata are indexed by and we
have unbiased estimates and obtained in accordance with the sampling plan specific to stratum ,
then the estimators for the population (assuming SRSWOR of units within strata) are:

Notice that and 1 are nothing more than the standard estimators for the total and
variance of the total inside stratum , so all the stratified sampling estimators are doing here is adding
up the estimated stratum totals and scaling them to population units. This is a common theme that we
see over and over again in complex sampling theory: the estimate for a particular stage in the sampling
process is a weighted linear combination of the estimates obtained from the next lower stage. The only
complications involve finding the correct weights.

To apply stratification to our example, we can imagine a scenario where the camp took a sexist approach
to weight-loss motivation and segregated the campers by sex, encouraging the boys to lose weight by
playing outdoor sports and encouraging the girls to lose weight by sitting around doing indoor crafts to
take their minds off cake and cookies. Since we would obviously expect pretty large differences in the
effectiveness of these two approaches, it would make sense to stratify the sample by sex as well, in
order to guard against the possibility of getting a sample that included too many boys (which might drive
13
the weight-loss scores up) or too many girls (which might drive the scores down). Notice that a stratified
sampling plan forbids a large class of samples from the population: namely, all samples that do not
include the specified numbers of girls and boys. In our case only about 17% of the possible samples
(WOR) of size 20 from our population are actually obtainable under stratified sampling if equal numbers
of boys and girls are sampled. An added bonus of this strategy is that it is likely to increase the precision
(i.e. lower the variance) of our point estimate by eliminating the potentially large between-sex variance
component entirely, since (due to independence of sampling from strata) the variance estimator above
does not depend in any way on the variance between strata.

SIMPLE CLUSTER SAMPLING

| |

The key difference between stratified sampling and cluster sampling is that in cluster sampling not all
the groups are represented in the sample. Rather than sample within the groups (as is done in stratified
sampling), we sample the groups themselves. Once a group is sampled, all members of that group are
scored and the group total becomes a known quantity. As in stratified sampling, we severely limit the
number of samples we are willing to take now only samples are allowed, where is the number
of clusters in the population and is the number we take into the sample but this time the strategy is
unlikely to improve the precision of the estimate unless we pack most or all of the variation into the
clusters themselves, where it will wash out due to the fact that we take a census within clusters. From a
variance component perspective, in stratified sampling we use the within-group variance to estimate the
population variance, whereas in cluster sampling we choose to use the between-group variance instead,
as shown by the unbiased estimators (assuming SRSWOR of clusters):

where is the total and the size of cluster , and / is the sampling fraction. We can think of
1/ as the weight of a cluster total, since we only sample clusters and expect their totals to do
the work of estimating the total over the full clusters in the population; therefore each cluster total
stands for 1/ of the population total. (We can also imagine scenarios where some clusters are
selected with higher probability than others, in which case the weight 1/ would depend on the
index of the particular cluster. See the section below on sampling proportional to size.) Note that if the
cluster sizes are not all known in advance, then is unknown and should be estimated by .
The variance estimator changes in that case (because the mean estimator becomes nonlinear, as a ratio
of random variables), but we will not present the other version here; see [7 p179-180] for details.

14
A cluster sample of our weight-loss camp population could be constructed by randomly assigning the
kids numbers in 1, , and then selecting of those numbers for the sample. The key is random
assignment of group numbers. If we cluster the kids on some systematic basis (such as pairing friends)
that makes kids within groups more likely to have similar weight-loss scores, then we wind up with less
variance within groups and more variance between, which drives up the cluster variance estimate. For a
sample of 20, we could divide the 220 kids into 110 pairs and select 10 pairs, or we could divide them
into 22 groups of 10 and select 2, or something in between. (We could even divide the kids arbitrarily
using unequal group sizes, but this tends to increase the variance [7 p179; 31 p70].) In general the most
efficient estimates (as measured by decrease in magnitude of the standard error) under cluster sampling
are obtained when the group sizes are large and reflective of the larger population [31 p76].

Note that a special case of cluster sampling is systematic sampling, a scheme where only a single cluster
is selected. In systematic sampling, we imagine the population as flowing in a stream past us, and given
some arbitrary starting index, we select into the sample every th unit in the stream, where is some
suitably-chosen integer ( 1 gives a census, and takes only a single observation). It is easy to
see that this is the same as a cluster sample where we form clusters according to unit index remainders
modulo and then select only the cluster containing unit indices whose remainder modulo is equal to
the starting index . This design is extremely hard to analyze correctly, and may not even be
meaningfully analyzable unless the stream of observations is actually a random shuffle of the population
or displays some predictable trend [2 p212-216; 7 p196-198]. Most damningly, since only one PSU is
selected, a design-based consistent estimator of variance does not exist for any statistic we might wish
to compute on the sample [2 p229]. (Systematic samples are usually analyzed as SRSWR designs,
operating on the assumption that the stream was a random shuffle [7 p196], or by imposing some
model assumptions on the population [2 p223-227].) These designs require care and expertise.

STRATIFIED CLUSTER SAMPLING

One reason to stratify or cluster the population before sampling is to mitigate the cost of constructing
sampling frames, which can be substantial [1 p229-331; 3 p77-84]. A real case illustrating this difficulty is
the National Agricultural Workers Survey (NAWS) conducted year-round through the US Department of
Labor (http://www.doleta.gov/agworker/naws.cfm). Since a large proportion of farm workers are
undocumented or migrant throughout the year as crops revolve, sampling frames can be obtained only
at the most local level, so the NAWS survey combats this volatility by stratifying by region and crop cycle
and then sampling through several stages of nested clusters: farming clusters, counties within farming
clusters, agricultural employers within counties, where finally a sampling frame of farm workers can be
created from daily labor information provided by the employer [28].

When combining clustering and stratification, we can either nest the clusters within the strata or nest
the strata within the clusters. The former involves partitioning the population first according to some set
of characteristics (often geography) and then forming groupings of the population segments in each
stratum, after which standard cluster sampling is performed independently within each stratum. But we
could also decide to select clusters across stratum lines and then subsample independently within each
stratum subset induced by the selected cluster; this is the latter case, and in some sense it requires more
ingenuity in the formation of clusters, because we have to ensure that each cluster contains units that
represent all the different strata.

15
Whether we decide to stratify first and then cluster or cluster first and then stratify, no new difficulties
arise in the construction of unbiased estimators because independence across strata saves the day. If we
stratify first, then in each stratum we conduct a cluster sample and return point and variance estimates;
these can be scaled and summed to yield population estimates, i.e.:

(clusters in strata)

| |

And similarly, if we cluster first then the estimates in each cluster are for a simple stratified sample, so
we scale them by the cluster weight to get estimates for the population, i.e.:

(strata in clusters)

| |

1
1 1

The estimator for the mean simply replaces the usual (census) total for cluster with the totals summed
across strata within the cluster, and the second component in the variance estimator is the variance
within each subsampled cluster , weighted and summed across strata within the cluster. If the clusters
are formed so cleverly that we take a stratified census in every one, the second component is eliminated
because independent sampling across strata ensures no covariance terms among the strata. In that case,
it is easy to see that the method becomes just a silly way of doing an ordinary simple cluster sample.

16
Getting back to our weight-loss camp population, we could imagine conducting the two methods of
stratified sampling along the following lines:

(clusters in strata) Divide the campers into boys and girls and then randomly assign the boys into
groups and the girls into groups. Select some of the boy groups and some of the girl groups in
such a way that the choices of boys groups do not affect the choices of girl groups and vice versa.
(strata in clusters) Group the kids randomly according to a usual clustering scheme, but make
sure each group contains at least some girls and some boys. The same range of cluster numbers
could be assigned to boys and girls separately and the final clusters formed by combining each
girl group and boy group having the same number. Then select some of the groups, and within
each selected group, independently subsample some of the boys and some of the girls.

BIAS AND DESIGN EFFECTS

The small set of simple designs presented above afford plenty of opportunities to see the dramatic
effects of different sampling plans on the statistics of interest, but before delving into the example we
should discuss how those statistics should be compared.

One axis of comparison is bias. Although all of the point estimators presented above are unbiased under
certain assumptions (see APPENDIX E), when the size of the complete population is unknown, estimating
it using the sum of the observation weights as is commonly done in practice introduces bias into
any estimator using that estimate [7 p117-119]. This is because the sum of the observation weights is
itself a random variable. Cochran [7 p160-162] discusses the magnitude of this bias for a certain class of
estimators and concludes that it can be ignored provided the coefficient of variation of the mean of the
observation weights is less than about 10%. Of course under any sampling plan the bias of consistent
estimators will become negligible as the sample size increases.

Another point to consider is that even unbiased estimators can give different estimates under different
sampling plans. For example, examination of the mean estimator for simple stratified sampling shows
that it will not give the same estimate as the sample mean unless the sampling fraction / is
the same for all strata . Both estimates are unbiased, but they will not agree in general.

Perhaps the most important consideration when comparing estimators is their precision. One way to
measure precision in a complex survey sampling context is to compare variances. Specifically, we take
the ratio of the variance of the estimator under the complex design with the variance of that same
estimator under the assumption that the sampling plan was SRSWOR instead. This ratio (introduced by
Kish [6 p257-260]) is called the design effect. Formally, the design effect of an estimator on a
sample obtained under a complex design is computed as:

|
|

Most complex survey sample software provide an option to output this statistic for any estimation
procedure, but if not, it can be computed easily enough by running twin analyses one under the
assumptions of and one under the assumptions of SRSWOR and forming the ratio by hand.

17
It should be apparent from this definition that if a design effect is known to be approximately constant
for an estimator over essentially any reasonable sample, then computing the (approximately) correct
design variance of estimates on any future samples can be greatly simplified by simply analyzing the new
sample as a SRSWOR and multiplying the variance estimate by the known design effect.

The design effect also gives a measure of the effectiveness of the complex sampling plan relative to
SRSWOR. If the design effect is less than unity, the plan is more efficient than SRSWOR, but if the effect
is much larger than unity, the plan is considerably less efficient than SRSWOR. Sudman [31 p73-74] gives
a nice example illustrating how a large design effect can reduce the effective sample size:

Suppose you want to estimate the proportions of black and white residents in a neighborhood using a
cluster sampling plan that calls for sampling 10 blocks and subsampling 5 households within each block.
The sample size is 50 households. But when you arrive at the neighborhood you discover that it is
completely segregated by race, so that some blocks contain only black families and other blocks contain
only white families. If you select a block and a house within the block and then knock on the door, if a
white person opens the door you can assume that the entire block is white, and similarly, if a black
person opens that means the entire block is black. Each block only provides one effective observation,
so the effective sample size is actually 10 instead of 50, which implies that the design effect is

50 10 5. In other words, the badly-chosen clustering has made our first-level
cluster totals no better than if we had just chosen blocks by SRSWOR and assigned white or black status
to the block based on the race of the first person we encountered there.

APPLYING THE DESIGNS

With all that theory under our belt, now we can finally take a look at the weight-loss camp data. Let's
assume that we heard about the differential treatment of girls and boys at the camp and have reason to
suspect that the boys' weight-loss scores will be significantly better than the girls'. How should we
sample the kids to make sure that this gets taken into account? (Hint: stratified sampling.) Recall that
the interest of the camp program is to be able to claim with some (say 95%) confidence that the average
weight-loss score is better than 50% of the goal, so we will need to look at the widths of confidence
intervals, which are inversely proportional to the precisions of the estimates. The table below shows the
results of sampling the kids according to the six different plans discussed above. (The results were
obtained using the survey package in R.)

mean se(mean) 95%CI deff effective size

SRSWR 61.70 7.2417 (47.51,75.89) 1.10 18
SRSWOR 61.70 6.9047 (48.17,75.23) 1.00 20
STRATIFIED 59.18 3.6577 (52.01,66.35) 0.28 71
CLUSTER 61.70 9.0744 (43.91,79.49) 1.73 12
STRAT+CLUST 59.18 3.2811 (52.75,65.61) 0.22 91
CLUST+STRAT 59.18 3.3961 (52.53,65.84) 0.24 83
Table 1: performance of complex sampling plans over the same observations

What a dramatic difference in precision! Careful inspection of the dataset reveals that the boys' scores
were a lot higher than the girls', so all the stratified plans smartly removed the considerable variation

18
between girl scores and boy scores, resulting in much more precise estimates. But the simple clustering
plan stupidly formed clusters containing either all boys or all girls, so all this variation wound up
expressed in the precision of the estimate; notice that the effective size under this plan suggests that
each cluster of two kids really only represented about one observation. Most interestingly of all, the
stratified cluster plans got the best of both worlds: eliminating most between-sex variation from the
estimate and removing some of the within-sex variation too by generally pairing kids with dissimilar
scores. (Of course this kind of information would not be known beforehand, but any attempt to tip the
scales toward heterogeneity e.g. by putting some fatter kids and thinner kids in the same cluster
should yield improvements in the performance of cluster sampling. By a similar token, any good attempt
to make individual strata as internally homogeneous as possible should yield big dividends.) Notice that
only the stratified designs would provide enough precision to allow the camp program to tout its claim.

Note that the software package gives a warning in the SRSWR case to the effect that the lack of any
specified weights implies equal probability sampling. ("Are you sure you really want to do this??") That
is what we intended in this case, but it is an important point to consider. Under most complex sampling
plans the observations will not have equal weights. By way of comparison, the stratified plans give girls
more weight because they are slightly more numerous in the population and therefore each one is
slightly less likely to be selected within the girl stratum than a boy within the boy stratum.

GENERAL PROBABILITY SAMPLING

The designs discussed above are really just special cases of a more general arbitrary-probability WOR
sampling design known as ps sampling. The theory gets more tricky in this general case, so we will only
present a cursory overview, beginning with an important example.

Sampling with probability proportional to size (PPS) is a special case of ps sampling where all the unit
inclusion probabilities depend on some external measure of the "size" of the unit, such as population
size, economic importance, etc. If units are chosen consistently according to this size measure at each
stage of sampling, then the sample takes on the convenient property of being "self-weighting": all
observations carry the same weight, so the nave unweighted estimators become unbiased estimators.
That is, to obtain correct point and variance estimates of the mean for a perfectly self-weighting design,
we simply compute the means of all the primary sampling units (PSUs), and then find the sample mean
and variance of the PSU means (i.e. treating the PSU means as the observations) [7 p232].

In general ps sampling, however, the self-weighting property does not hold. Suppose the probability
that a unit score for 1, , is included in a sample of size drawn from a finite population of
size is given by , and the probability that two unit scores and will be jointly included in the
sample is given by . Note that we can think of the latter as | , i.e. the probability that is in
the sample times the probability that is included in the sample given that has already been selected,
and because exactly items are included in the sample, the inclusion probabilities satisfy:

| 1

19
The quantity / can be thought of as the average probability of inclusion [7 p241]. In a landmark
paper, Horvitz and Thompson [12] proved that the following estimators are unbiased for the population
total and variance of the estimate of the total respectively, under literally any single-stage method of
selection that gives for the sample; these estimators have since become known as the Horvitz-
Thompson (HT) estimators:

1 1

While this estimator of the total is probably one of the most useful (if not the most useful) results ever
conceived in the context of complex sampling theory, the variance estimator is unfortunately
quite awkward to use in practice, because it requires knowledge of the full joint inclusion probability
matrix over every , pair in the sample. Moreover, despite the theoretical unbiasedness,
estimates generated from this formula tend to be wildly unstable, and even negative because the factor
1 can take on negative values [2 p261; 19]. To get around this awkwardness, a number of
alternative formulas and sundry approximate methods have been proposed [2 p261-270; 7 p274-277;
38], but the perhaps the most widely implemented of these is the WR approximation because of its
usefulness for approximating multistage arbitrary-probability WOR samples [7 p246]:

1
1

(This is the ps estimator implemented for the "pps" design type in APPENDIX A.) The theory behind this
estimator is very simple but clever: Scale the observations by their average probability of selection,
obtaining an "averaged total" observation for each ; since the sample can now be seen as a sequence
of mean values obtained by the same random process that generated the original values, we can label
its distribution and estimate the mean sampling variance of that distribution using / . (Note in this
regard that the expected value of a single is
,
which is unbiasedly estimated by .)

If this estimator is used to approximate the variance of totals in multistage samples, it is appropriate to
replace the with the estimated totals of the PSUs. An asymptotically unbiased estimator of the
population mean is . There are also extensions to the Horvitz-Thompson estimators for
samples involving more than one stage (see e.g. Lohr [7 p245] and Wolter [56 p14]), but in general these
seem to be more trouble than they are worth.

Because of the finicky theory and odious requirement to compute or store an matrix (where
may be very large indeed; Laaksonen et al [71] describe a recent German microcensus with nearly a
million observations, which would require a joint inclusion matrix approximately 4 terabytes in size!),
software support for ps designs is rather poor.
20
Of the packages we consider, SAS, Stata, and WesVar (and of course AM and IVEware) offer no support
whatsoever for ps estimators. In those packages, designs of this type must be approximated using
some weighted single-stage design. If only weights are supplied, it is fairly likely that some or all of them
default to a variance estimator along the lines of . (This is definitely the case for both SAS and Stata;
see [92 "Variance and Standard Deviation of the Total"] and [104 p104]. WesVar takes a different
approach altogether; see OPTIONS FOR VARIANCE ESTIMATION.)

SPSS [95 p13] and SUDAAN [106 "UNEQWOR"] both provide some support for ps estimators, but in
both packages the user is required to feed the procedure the precomputed joint inclusion probability
matrix . If the problem is so large that the matrix will not even fit into memory, good luck getting either
of those procedures to run. And as for help computing the matrix, forget it. Neither package wants to be
bothered with something like that. This kind of "support" should be considered a toy at best.

But there is some light in the world, and its name is R. The survey package in R not only provides the
typical "support" requiring the user to feed in a joint inclusion probability matrix Hey, it's there if you
want it but goes many steps beyond by providing automatic procedures for several single-stage ps
estimators and approximations [85], including the Brewer, Hartley-Rao, and Overton approximations,
which do not require specification of , as well as both the original Horvitz-Thompson estimator and the
unbiased Sen-Yates-Grundy estimator if is provided. There is even graphical visualization support for a
user-supplied or an internally estimated so the user can see the probability structure at a glance.
This level of support is truly remarkable given the miserable state of competition among other complex
survey analysis packages, and Thomas Lumley (the implementor of the survey package, among others)
deserves high praise for going to such lengths.

Now that we have gotten a taste of what is involved in making estimates from data obtained under a
complex sampling plan, we will begin to look deeply into the capabilities of the various software
packages. We will focus almost exclusively on their capabilities in the traditional estimation areas of
mean and total and ratio estimation, as well as their support for alternative methods of variance
estimation and robustness in the face of problematic input or extremely complicated designs. We will
not focus unduly on features of the user interface or other ease-of-use questions, beyond pointing out
areas where procedures are difficult to find or awkward to specify. Tastes in software vary as much as
tastes in anything else, so one analyst's dog may be another's purple unicorn prancing down rainbows.

It should be said that all the software can do more in most cases, quite a bit more sophisticated
estimation than the procedures we consider in this paper, including very advanced regression modeling
and survival analysis under complex sampling plans, publication-quality tabulation and graphical output,
advanced testing of hypotheses, goodness of fit and linear contrasts, single and multiple imputation, and
cutting-edge estimators like GREG (generalized regression) and GEE (generalized estimating equations);
see Heeringa et al [25 p402-406] for a very thorough list. All these bells and whistles are very nice, but if
the fundamental estimation of means and totals is not solid, then we should question the reliability of
everything else up the chain of estimation. At a minimum, we must insist that any prospective software
package be free of errors on the procedures we consider. (Since the theory of estimation under complex
sampling allows for alternative but equally valid estimators, minor differences in estimates are not
counted as errors. More egregious are refusals to handle common designs or poor handling of violations
of assumptions.)

21
TESTING BENCHMARKS

Wolter [56 p410-415] provides a number of simple benchmark datasets designed to test survey software
handling of a two-stage equal-probability stratified sampling design involving large sampling fractions
and the possibility of singleton and certainty clusters at each stage of sampling. This design is well
chosen as a make-or-break test of the two tiers of survey software we have been examining thus far
because it very effectively separates the can-do packages from the can't-do packages. Moreover, the
benchmark datasets are small enough that hand (or spreadsheet) computation of means, totals, and
standard errors is feasible. The datasets are reproduced in APPENDIX D.

The unbiased estimator of the total is the usual Horvitz-Thompson estimator [2 p300,316; 6 p165; 7
p2554,282; 10 p145]:

where is the index set of strata, is the index set of primary clusters (PSUs) within stratum , and
is the index set of secondary clusters (SSUs) within cluster of stratum . The quantity is a distinct
observation drawn according to the sampling plan with probability 1 . Under equal probability of
selection, if is the population size of stratum and the population size of cluster within stratum
, and and are the corresponding sample sizes, then . A somewhat more
convenient way to think about the design weights is in terms of the sampling fractions
and , giving 1 . Then the design-based estimator of variance of the total is
given by [2, p303,316; 5 p316-319; 6 p174; 7 p260-262,282; 10 p145]:

1
1 1

where is the sample standard deviation of the individual cluster totals

within stratum and is the sample standard

deviation of the observations within cluster of stratum .

These are very complicated formulas, but this design is actually the simplest case of stratified two-stage
finite-population sampling designs, so if a survey software package is unable to handle even this level of
complexity, it will certainly not fare well on the even more complicated multistage unequal-probability
sampling designs commonly encountered in practice. In the tables that follow, the "REFERENCE" column
lists estimates of the totals and standard errors computed "by hand" (using functions implemented in R;
see APPENDIX A for the relevant code) on the benchmark datasets.

The layout of all the benchmark datasets is essentially the same: stratum identifier, stratum size, cluster
identifier, cluster size, observation. Benchmark set 1 has just one stratum (i.e. a nonstratified design),
set 2 has three strata, and sets 3 through 6 have four strata each. Stratum size is generally 15 for
all strata and 3 PSUs are selected from each (for a largish sampling fraction of 315 20%), but

22
sets 3 and 4 have singleton PSUs in the last stratum, with different interpretations in each case. In set 3
the singleton PSU was simply the only one selected from the population of 15, but in set 4 the singleton
is a certainty PSU, meaning that the stratum has a population of 1 and the only PSU in the stratum was
chosen with certainty. The random singleton PSU provides no information about the variance within the
stratum, but a certainty PSU implies that the variance within the stratum is zero. Sets 5 and 6 also
contain singleton units, but at the second stage of sampling instead of the first. The implications for
variance calculations are similar. A correct implementation should assign zero variance to strata or
clusters with certainty units, but employ some ad-hoc method to obtain a variance estimate for strata or
clusters with random singletons. The usual strategy is to collapse strata with singleton units into nearby
strata and thereby eliminate the issue at the cost of distorting the true design in a small way [10 p109;
57 p50-57], but some software packages allow other methods, such as recentering stratum means or
estimating the missing strata variances using averages of the nonmissing ones. Also at issue is whether
or not finite population corrections are computed; in this case, where sampling fractions are very large
at both stages (50% in the second stage), it is a serious analytical error to omit them.

Tables 27 present a comparison of the results from several of the software packages under
consideration. For the reasons mentioned above, SPSS and SUDAAN results are missing from these
tables, but according to Heeringa et al [25 p402-403] both of these packages are able to handle these
designs. SPSSS documentation [95 p9] describes how to set up plans of this type, and as we will see later
on (see COMPARING ON SUDAAN DESIGNS), some of these benchmark designs are specifically
addressed in the SUDAAN documentation [109]. WesVar is not considered at this juncture because its
variance estimation model is radically different from that of the other packages (see OPTIONS FOR
VARIANCE ESTIMATION), and a direct comparison would not be very illuminating.

The statistics examined are totals, means, standard errors, and design effects. Parentheses or a "." entry
under one of these columns in the table indicates that the software package in question was unable or
unwilling to produce the estimate under normal operation. This is not a problem for statistics such as
means and totals and their standard errors, since a simple rescaling using the sum of the design weights
easily converts one into the other, but unwillingness to compute a design effect without undue hassle
(such as e.g. forcing the user to rerun the analysis and compute the effect by hand) feels like a glaring
omission. Note that sometimes design effects for means and totals will be different in the presence of
singletons, depending on various technical details of the estimation method and variance estimation
correction; these details will be glossed over in the table, and the smaller of the two design effects
reported. Other information indicated in the table are error messages or other comments from the
programs and a brief description of whether it was necessary to precompute design weights before
running the analysis (and whether that precomputation could be done within the software or not).

weight
1 total se(total) mean se(mean) deff messages
computation
REFERENCE 600 79.844 4 .532 1.653 Automatic
SAS 600 77.460 4 .516 . Within
Stata 600 79.844 4 .532 1.653 Within
R 600 79.844 4 .532 1.653 Automatic
AM (600) (86.550) 4 .577 1.750 Outside
IVEware (600) (86.550) 4 .577 1.750 Outside
Table 2: Benchmark set 1 results (2-stage SRSWOR, unstratified)

23
weight
2 total se(total) mean se(mean) deff messages
computation
REFERENCE 1800 138.293 4 .307 1.731 Automatic
SAS 1800 134.164 4 .298 . Within
Stata 1800 138.293 4 .307 1.731 Within
R 1800 138.293 4 .307 1.731 Automatic
AM (1800) (149.850) 4 .333 1.833 Outside
IVEware (1800) (150.000) 4 .333 1.833 Outside
Table 3: Benchmark set 2 results (2-stage SRSWOR, stratified)

weight
3 total se(total) mean se(mean) deff messages
computation
REFERENCE 2250 175.594 3.75 .293 1.702 Automatic
Only one cluster in a stratum for
SAS 2250 134.164 3.75 .224 . variable(s) y. The estimate of variance Within
for y will omit this stratum.
Note: Variances scaled within each
Stata 2250 175.357 3.75 .292 1.699 stage to handle strata with a single Within
sampling unit.
R (2250) 159.690 3.75 .266 1.409 options("survey.lonely.psu"="average") Automatic
WARNING: Stratum 4 has single PSU
AM (2250) (187.800) 3.75 .313 1.780 (cluster). Outside
Error: Bad strata. Only one cluster for
IVEware . . . . . stratum 4. Outside
Table 4: Benchmark set 3 results (2-stage SRSWOR, stratified, singleton PSU)

weight
4 total se(total) mean se(mean) deff messages
computation
REFERENCE 1830 138.384 3.978 .301 1.850 Automatic
Only one cluster in a stratum for
SAS 1830 134.164 3.978 .224 . variable(s) y. The estimate of variance Within
for y will omit this stratum.
Stata 1830 138.384 3.978 .301 1.861 singleunit(certainty) Within
R 1830 138.384 3.978 .301 1.861 options("survey.lonely.psu"="certainty") Automatic
WARNING: Stratum 4 has single PSU
AM (1830) (150.420) 3.978 .327 1.957 (cluster). Outside
Error: Bad strata. Only one cluster for
IVEware . . . . . stratum 4. Outside
Table 5: Benchmark set 4 results (2-stage SRSWOR, stratified, certainty PSU)

weight
5 total se(total) mean se(mean) deff messages
computation
REFERENCE 2300 146.202 3.833 .244 1.388 Automatic
SAS 2300 141.421 3.833 .236 . Within
Note: Variances scaled within each
Stata 2300 146.629 3.833 .244 1.465 stage to handle strata with a single Within
sampling unit.
R 2300 159.450 3.833 .245 1.478 options("survey.lonely.psu"="adjust") Automatic
AM (2300) (158.114) 3.833 .264 1.545 Outside
IVEware (2300) (158.114) 3.833 .264 1.545 Outside
Table 6: Benchmark set 5 results (2-stage SRSWOR, stratified, singleton SSU)

24
weight
6 total se(total) mean se(mean) deff messages
computation
REFERENCE 2165 203.359 3.901 .260 3.164 DEFF based on SRSWR: 2.844 Automatic
SAS 2165 199.950 3.901 .360 . Within
Stata 2165 203.359 3.901 .262 1.602 singleunit(certainty) Within
R 2165 203.359 3.901 .262 1.602 options("survey.lonely.psu"="certainty") Automatic
AM (2165) (157.065) 3.901 .283 1.682 Outside
IVEware (2165) (156.897) 3.901 .283 1.683 Outside
Table 7: Benchmark set 6 results (2-stage SRSWOR, stratified, certainty SSU)

Perhaps one of the most peculiar facts we can glean from these tables is that some software packages
neglect to estimate totals. Neither AM nor IVEware provided any straightforward method to obtain
unbiased estimates of totals. But, as mentioned earlier, means and their standard errors are sufficient
for computing totals and their standard errors, so perhaps the software designers simply felt that
providing the extra capability to estimate totals was redundant. Nevertheless, it feels profoundly
unsatisfying for an analyst not to have such basic methods readily available.

Another fact worth pointing out is that while all the software packages easily computed correctly
weighted point estimates, some of them struggled to produce correctly weighted standard errors. Even
in cases of considerable agreement, there are occasional small discrepancies that seem too large to
explain away as rounding error. (Floating-point hardware nowadays is highly accurate for computations
of this nature. Any "rounding errors" here are almost certainly artifacts of computational decisions made
at the implementation level.) These small discrepancies hint at the existence of minor idiosyncrasies in
the implementations of what ought to be standard computational methods.

The presence of singletons also created difficulties for some of the packages, and IVEware in particular
refused to compute any estimates with singletons. Without a mechanism in this package to distinguish
between random singletons and certainty units, aborting with an error is probably wise, but an overly
conservative choice. The other packages all found reasonable compromises and reported an appropriate
warning. In the cases of Stata and R, the default behavior is to produce point estimates but omit
variance estimates until more information is provided by the user [104 p186-187; 84 p43-44]. To obtain
a variance estimate, the analyst needs to classify the singularities as random or due to certainty
sampling, and if the former, select a preferred approximate estimation strategy (a kind of imputation).
The approach adopted by SAS to simply discard all strata with singletons is very ill-advised indeed,
as it can lead to serious underestimation of the variance. Eliminating entire strata from the population
introduces nonignorable downward bias in the variance due to the discarding of all information about
the within-stratum components of variance for the omitted strata. The correct strategy is actually the
opposite of what SAS does: instead we should collapse strata boundaries and use the inflation in the
between-strata variance components to compensate for the missing within-stratum components [5
p438-439]. In this as in many regards, the SAS operating paradigm for complex survey analysis feels far
out of step with the times, which are increasingly moving toward a much higher standard of accounting.
By way of comparison, the reference "package" used for baseline comparisons (see APPENDIX A) was
programmed by a single individual (me) over a weekend using very ordinary development tools. It seems
unbelievable that SAS development professionals with the vast resources at their disposal cannot
provide more robustness or flexibility in their survey package.

25
Since SAS does not account for the full design variance, its standard errors are understated when the
finite population correction is applied, so using this correction within SAS on a two-stage design with
large sampling fractions should be considered anticonservative [10 p146]. The conservative approach of
not applying the correction is illustrated by AM and IVEware, whose standard errors are somewhat
inflated in all cases. Note that by default, SAS does not apply the correction either it must be
requested by the user and therefore does exhibit conservative behavior under nave use.

On the Benchmark 5 dataset, R estimated the total and mean using different methods: observe that
even though the sum of weights is 600, the estimated total 159.45 147 .
The design effect reported in the table is for the mean, not the total (which was larger: R reported
1.733). In contrast, Stata's estimates for both quantities and their design effects are consistent on this
dataset. But on the Benchmark 6 dataset, all packages but SAS resort to alternate estimation methods
for the total and mean, resulting in a relatively smaller standard error for the mean than for the total.
(For the design effect of the total in this case, Stata reports 3.298 and R reports 3.145. These numbers
are in line with the reference design effect estimate, which is based on the total.) It seems likely that all
packages sometimes quietly employ a ratio estimate of the mean (which is likely to be nearly unbiased
[10 p181-184]) when this is more efficient than the standard unbiased estimate.

On a final note, the singleton problems in datasets 5 and 6 go completely unnoticed by SAS, AM, and
IVEware because those packages disregard all stages of the sampling design beyond the first. This
sloppiness is forgivable for AM and IVEware, free packages developed for academic research purposes.
Their approach (and SAS's default approach) of overestimating the variance in the first stage in order to
account for the slop in subsequent stages is a sensible one (based on "ultimate clusters" [20]), but users
should expect better from industry-standard commercial software like SAS. The technology in the R
survey package, for example, is a quantum leap in sophistication above what SAS provides.

TESTING PROBLEMATIC DATA

Another test of software fitness is how the software deals with unexpected input. Ideally we would
expect professional-quality software to detect anomalous situations, report them to the user in plain
language, and either take appropriate corrective measures or abort the analysis. In this section we will
run a problematic dataset (see APPENDIX D) on all the software packages under consideration. The
dataset was obtained from the DACSEIS report [71] discussed below. Although up-to-date SPSS and
SUDAAN results could not be produced for the reasons mentioned above, the DACSEIS report did test
those packages, so we will include the DACSEIS-reported results verbatim, with the caveat that the
handling within both packages may have changed in the decade or so since the report was released.

The EU-supported DACSEIS Project, initiated in 2000 and completed in 2004, was a transnational
initiative to produce a recommended practice manual on the analysis of complex survey data obtained
from official European household surveys overseen by Eurostat. One component of the project was a
comprehensive investigation into the capabilities and usability of professional-quality software packages
for complex sampling variance estimation that were available to European data analysis agencies at the
time. Although the report's results are now ten years old and largely obsolete on the major packages
under consideration here, the questions the investigators raised provide helpful guidelines for the

26
evaluation of variance estimation software in general. In particular, the DACSEIS report [71 p69-75]
examined the behavior of various software packages on several types of problematic data: a) missing
observations on a variable (a.k.a. item nonresponse); b) missing, nonpositive, or fractional weights
(between 0 and 1); c) contradictory design specifications; and d) missing or singleton sampling units.

In connection with the handling of item nonresponse, it should be mentioned that there are three
typical approaches. The incorrect approach is to simply omit all units with missing items and treat the
reduced sample as the full sample. The more technically correct approach is to treat the nonmissing
items as a subpopulation and perform subdomain estimation on them; this strategy implies the use of a
robust variance estimation method [10; 37; 41; 45]. Finally, another reasonable approach is imputation,
which has the effect of imposing model assumptions on the missingness [15 p35-43].

Except AM, all the major packages under consideration here (including SPSS and SUDAAN) have a
method for correct subdomain estimation, and all (including AM) can be coerced into producing robust
estimates of variance on variables with item nonresponse. R, Stata, SUDAAN and AM do this by default
[84 p34; 104 p6; 106 "Missing Values"; 76 "Variance Estimation"], SAS provides for it through the
"NOMCAR" option (and also allows missingness to be treated as a response value via the "MISSING"
option, as does SUDAAN via the "MISSVAL" keyword) [92 "SURVEYMEANS Missing Values"; 106], and
SPSS has similar capabilities and default behavior as SAS [95 p16]. IVEware allows the use of the "BY"
keyword applied to a categorical variable [78; 79], but in general is more oriented toward imputation.
WesVar can perform subdomain analysis when computing statistics (called "table requests" in the
documentation), but in general does not account for missing values when computing replicate weights
for use in variance estimation; WesVar documentation recommends careful imputation of missing
values before replicate weight creation [110 pB35-B37].

All the packages here except WesVar can also perform automated multiple imputation of missing data.
In fact, design-correct multiple imputation of missing data applied to complex survey samples is
IVEware's whole raison d'tre [77], and many options are provided for doing so. Multiple imputation is
fully integrated into the range of complex survey analysis functionality offered in R, Stata, SAS, SUDAAN,
and even AM; see [84; 104; 87; 106; 76] for package-specific details. SPSS sells a separate Missing Values
module (http://www-03.ibm.com/software/products/en/spss-missing-values), but since this module is
not bundled with the Complex Samples module it is not considered to be a fully integrated solution.
WesVar can use appropriate estimation procedures on (and generate correct replicate weights for)
datasets containing imputation, but there is no procedure for performing the imputation within WesVar
itself [110 pB7]. Imputation will not be considered for the problematic dataset because the degrees of
freedom are too small in this case and would lead to grossly distorted variance estimates, but it is worth
pointing out that this kind of functionality is now commonplace.

Tables 815 illustrate the behavior of SAS, Stata, R, SUDAAN, SPSS, AM, IVEware, and WesVar on the
various features of the problematic dataset. (Recall that the SUDAAN and SPSS results are not current
and may not be reflective of actual performance in newer versions of the software. The results in the
tables were obtained using SPSS 12 and SUDAAN 8; see [71].) The WesVar analysis employed the JKn
jackknife variance estimation method, which is appropriate for designs containing more than two PSUs
per stratum [110 pD1-D18]. The dataset encodes a stratified single-stage cluster sampling design with
two strata featuring both large (20%) and small (4%) sampling fractions. For each problem, the software
was asked to perform a full-domain estimate of the item total and then separate subdomain estimates
of the total for each stratum. (As before, AM and IVEware compute only means, so the values appearing
27
for those packages in the tables were obtained by multiplying the mean estimates and standard errors
by the sum of the design weights. Also, AM cannot easily do subdomain estimation, so the separate
stratum estimates are missing for AM.) Finite population corrections were applied when available (i.e.
for all packages but AM and IVEware); other settings were left at their default. The tables also show any
relevant messages obtained from the software during execution.

OVERALL STRATUM 1 STRATUM 2

1 messages
total (stderr) total (stderr) total (stderr)
SAS 1250 (604.6) 200 (85.7) 1050 (598.5)
Note: 1 stratum omitted because it contains no
Stata 1250 (609.9) 200 (88.3) 1050 (603.5) subpopulation members.
R 1250 (696.4) 200 (100.4) 1050 (689.1)
SUDAAN 1250 (696.4) 200 (100.4) 1050 (689.1)
SPSS 1250 (609.9) 200 (88.3) 1050 (603.5)
AM 1250 (618.3) . .
IVEware 1250 (618.3) 200 (95.8) 1050 (610.8)
Warning: 3 observations were excluded from the
preceding table. These observations were
WesVar 1250 (696.4) 200 (100.4) 1050 (689.1) excluded because they contained one or more
requested variables with missing values.
Table 8: Item Nonresponse

OVERALL STRATUM 1 STRATUM 2

2 messages
total (stderr) total (stderr) total (stderr)
NOTE: Due to nonpositive weights, 2
SAS 1845 (883.6) 120 (28.8) 1725 (883.1) observation(s) were deleted.
Note: 1 stratum omitted because it contains no
Stata 1845 (883.6) 120 (28.8) 1725 (883.1) subpopulation members.
Error in (function (object, ...) : missing values in
R 1845 (883.6) 120 (28.8) 1725 (883.1) `weights'
SUDAAN 1845 (883.6) 120 (28.8) 1725 (883.1)
SPSS 1845 (883.6) 120 (28.8) 1725 (883.1)
AM 1845 (898.2) . .
IVEware 1845 (898.1) 120 (31.2) 1725 (897.6)
One or more records have zero, missing or
WesVar 1845 (951.6) 120 (44.4) 1725 (950.6) negative weight(s).
Table 9: Missing Weights

OVERALL STRATUM 1 STRATUM 2

3 messages
total (stderr) total (stderr) total (stderr)
NOTE: Due to nonpositive weights, 2
SAS 1845 (883.6) 120 (28.8) 1725 (883.1) observation(s) were deleted.
Stata 1845 (951.6) 120 (44.4) 1725 (950.6)
R 1845 (951.6) 120 (44.4) 1725 (950.6)
Number of observations skipped: 2 (WEIGHT
SUDAAN 1845 (883.6) 120 (28.8) 1725 (883.1) variable nonpositive)
SPSS 1845 (883.6) 120 (28.8) 1725 (883.1)
AM 1845 (898.2) . .
IVEware 1845 (898.1) 120 (31.2) 1725 (897.6)
One or more records have zero, missing or
WesVar 1845 (951.6) 120 (44.4) 1725 (950.6) negative weight(s).
Table 10: Zero-valued Weights

28
OVERALL STRATUM 1 STRATUM 2
4 messages
total (stderr) total (stderr) total (stderr)
NOTE: Due to nonpositive weights, 2
SAS 1845 (883.6) 120 (28.8) 1725 (883.1) observation(s) were deleted.
Stata . . . negative weights encountered r(402);

R 1789 (972.0) 108 (53.4) 1681 (970.5)

Number of observations skipped: 2 (WEIGHT
SUDAAN 1845 (883.6) 120 (28.8) 1725 (883.1) variable nonpositive)
SPSS 1845 (883.6) 120 (28.8) 1725 (883.1)
AM 1845 (898.2) . .
IVEware 1845 (898.1) 120 (31.2) 1725 (897.6)
One or more records have zero, missing or
WesVar 1789 (972.0) 108 (53.4) 1681 (970.5) negative weight(s).
Table 11: Negative Weights

OVERALL STRATUM 1 STRATUM 2

5 messages
total (stderr) total (stderr) total (stderr)
SAS 1858 (947.7) 124 (41.5) 1734 (946.8)
Stata 1858 (947.7) 124 (41.5) 1734 (946.8)
R 1858 (947.7) 124 (41.5) 1734 (946.8)
SUDAAN 1858 (947.7) 124 (41.5) 1734 (946.8)
SPSS 1858 (947.7) 124 (41.5) 1734 (946.8)
AM 1860 (871.3) . .
IVEware 1858 (870.5) 124 (29.5) 1734 (869.1)
WesVar 1858 (947.7) 124 (41.5) 1734 (946.8)
Table 12: Fractional Weights

OVERALL STRATUM 1 STRATUM 2

6 messages
total (stderr) total (stderr) total (stderr)
NOTE: Only one cluster in a stratum in domain
stratum1 for variable(s) y. The estimate of
variance for y will omit this stratum.
SAS 2975 (1315.7) 150 (27.8) 2825 (1322.1) NOTE: Only one cluster in a stratum in domain
stratum2 for variable(s) y. The estimate of
variance for y will omit this stratum.
fpc for all observations within a stratum must be
Stata . . . the same r(461);
Warning message:
R 2975 (1315.7) 150 (27.8) 2825 (1396.8) In as.fpc(fpc, strata, ids, pps = pps) : `fpc' varies
within strata: stratum 1 at stage 1
SUDAAN . . . Old and new TOTCNT are unequal.
Population size values are unequal or missing.
SPSS 2975 (920.5) 1250 (920.5) 1725 (0) The first valid value found will be used.
AM 2975 (1067.5) . .
IVEware 2975 (1067.5) . . Error: Bad strata. Only one cluster for stratum 1.
Stratum 2 has non-consecutive varunits for
WesVar . . . METHOD=JKn.
Table 13: Inconsistent Stratum Identifiers

29
OVERALL STRATUM 1 STRATUM 2
7 messages
total (stderr) total (stderr) total (stderr)
ERROR: Population total 4 for stratum 2 in data
set PROBLEM is smaller than the number of
SAS 2975 (.) . . clusters 5.
NOTE: The SAS System stopped processing this
step because of errors.
fpc must be <= 1 if a rate, or >= no. sampled
units per stratum if unit totals r(462);
Stata . 150 (27.8) . Note: 1 stratum omitted because it contains no
subpopulation members.
record 5 stage 1 : popsize= 4 sampsize= 5
R . . . Error in as.fpc(fpc, strata, ids, pps = pps) : FPC
implies >100% sampling in some strata
SUDAAN . . . Population count (4) is less than sample size (5).

SPSS 2975 (27.8) 150 (27.8) .

AM 2975 (1097.1) . .
IVEware 2975 (1097.0) 150 (31.1) 2825 (1096.6)
WesVar . . . One or more FPC factors not in range (0,1].

Table 14: Inconsistent Population Sizes

OVERALL STRATUM 1 STRATUM 2

8 messages
total (stderr) total (stderr) total (stderr)
NOTE: Only one cluster in a stratum for
SAS 2945 (1074.4) 120 (0) 2825 (1074.4) variable(s) y. The estimate of variance for y will
omit this stratum.
Note: Missing standard error because of stratum
with single sampling unit.
Stata 2945 (.) 120 (.) 2825 (1074.4) Note: 1 stratum omitted because it contains no
subpopulation members.
Error in onestrat(x[index, , drop = FALSE],
clusters[index], nPSU[index][1], : Stratum (1)
has only one PSU at stage 1
R . . 2825 (1074.4) Error in tapply(1:NROW(x), list(factor(strata)),
function(index) { : arguments must have same
length
SUDAAN . . . Sample size is 1 but TOTCNT variable is 20.

SPSS 2945 (1074.4) 120 (0) 2825 (1074.4)

AM 2945 (1133.3) . . WARNING: Stratum 1 has single PSU (cluster).
Error: Bad strata. Only one cluster for stratum 1.
IVEware . . . Only one cluster for stratum 2.
WesVar . . . Stratum 1 has only 1 varunits for METHOD=JKn.

Table 15: Singleton Sampling Units

The results in the tables suggest two points of interest: 1) certain kinds of "problems" in datasets are no
longer seen as problematic by typical software in use today; and 2) the software packages differ widely
in their handling and reporting of actual problems.

Problems with weights are handled with grace by all the software we consider here, with the minor
exception of Stata objecting to negative weights and refusing to carry the analysis further. (This is
probably a correct default response. Under certain replicate-weighting schemes, some of the replicate
weights can be negative, as shown in e.g. [51], but in general negative sampling weights should be
considered suspicious at the very least. Note that Stata can be forced to use the negative weights by
specifying the weights as "iweight" importance weights instead of as the usual "pweight".)
30
Missing weights are handled by deleting cases in all packages except WesVar, which treats them as zero
instead. It is open to debate which approach is more correct, as a missing weight may imply that the
weight is truly unknown or accidentally omitted rather than simply zero, and software cannot be
expected to know the truth; on the other hand, in forming a variance estimate one would still like to use
the collected values of variables of interest on those cases. But the majority opinion seen here is to
ignore data that cannot be relied on. It is unfortunate that SUDAAN, SPSS, AM, and IVEware neglect to
report even a warning, as something like a missing weight should probably be looked into before
proceeding with the analysis. All the software packages typically include a line to the effect of "Number
of cases read" in the output, but users may not always check the boilerplate portions of the output;
issuing a warning is a better approach.

Negative and zero weights are handled consistently by all the packages, but there are two camps: the
get-rid-of-it camp (SAS, SUDAAN, SPSS, AM, IVEware) and the try-and-use-it camp (Stata, R, WesVar).
The get-rid-of-it camp treat the offending weights as missing and remove the cases entirely, while the
try-and-use-it camp just throw them into the formulas and let it all work out in the wash. Since negative
and zero weights (especially zero weights) can certainly occur in e.g. subdomain analysis or calibration
[18], ignoring cases with zero weights is a non-robust and anticonservative approach that leads to
deflation in variance estimates, as can be seen by comparing the numbers across packages in Tables 10
and 11. Since the software should not be making assumptions about the real-world interpretation of the
weights, which can be truly arbitrary under importance-weighting schemes, treating all weight values as
valid is the correct approach.

Fractional weights that is, positive weight values less than unity are seen as entirely unproblematic
by all the professional software packages, and indeed there are a wide variety of sampling situations
where fractional weights make sense, such as e.g. PPSWR sampling [7 p234-235]. (For example, if the
sampling probability for a unit is 1 , where is the number of samples taken
with replacement from the population, then the unit has the sampling weight 1 1.) Yet despite
the clear validity of fractional weights, AM and IVEware still persist in the quietly-get-rid-of-it approach.
The results of this test alone should disqualify those packages from serious consideration as truly
general-purpose complex survey analysis tools, since only being able to handle "nice" sampling designs
is a grave limitation in practice.

All the packages can handle item nonresponse (generally by deleting cases from analysis), but sadly most
do not issue any sort of warning; Stata and WesVar are welcome exceptions here. However, there are
discrepancies in how the missingness affects the variance estimates. SAS illustrates the case where the
records are deleted outright and do not enter into the variance estimation, i.e. the variance estimate
becomes conditional on the nonmissing records; notice that the standard error is understated by SAS as
a consequence. (But recall that SAS can be forced to use a more robust method by specifying the
"NOMCAR" option.) Stata, SPSS, AM, and IVEware are clearly using the same robust method, but the
adjustment they perform is minor enough that the method could be considered an "approximate
deletion" approach, most likely using Woodruff's method with ratio estimation on the remaining cases
[57; 58]. (Or perhaps it is nothing more than a simple degrees-of-freedom correction in the variance
estimator, along the lines suggested in [26 p115-121] or [6 p427-429].) In contrast, R, SUDAAN, and
WesVar produce a fully robust estimate by assigning missing values zero weight and thereby convert the
problem into one of subdomain estimation. The latter is the correct approach, and it is somewhat

31
disappointing to see Stata fall into the former camp. (Note that Stata could be forced to perform a
SUDAAN-style analysis by use of the "subpop" command applied to an indicator for missingness, but this
is a somewhat sophisticated step that nave users might not take.)

A deeper problem sometimes encountered when working with datasets is inadvertent misspecification
of design information in individual records. Robust software should be able to detect these situations
and act appropriately. For instance, if the wrong stratum identifier is attached to a record when finite
population corrections are applied, the software will see that the "population size" is inconsistent within
the stratum. This must be an error, so the correct response is to abort analysis and require the user to fix
the problem, but several of the software packages are too eager to try something anyway, even if the
estimates are garbage because they are based on an inconsistent design. Stata, SUDAAN, and WesVar
take the correct approach and issue error messages, although it should be said that only Stata's provides
the user with any real hint as to what the problem might be. R and SPSS also issue appropriate warning
messages but barrel on ahead with the analysis regardless; notice that their variance estimates are
wildly divergent (and both wrong). SAS and IVEware also issue warnings, but for the wrong reasons.
IVEware does not know about the finite population size, so it only runs into trouble when attempting a
subdomain analysis and finding a singleton unit in a subdomain; if IVEware could handle singletons in
general, it would have proceeded without trouble. Similarly, AM is blind to the problem and cannot do
subdomain analysis anyway, so it happily spits out the wrong answer and the user would never know the
difference. But SAS should know the difference and yet it does nothing. The warning message from SAS
suggests that the problem is the presence of singletons rather than the inconsistent input, yet another
example of the overall sloppiness of implementation of the complex survey analysis procedures in SAS.

The problem addressed in Table 14 is that of design values suggesting greater than 100% sampling of
populations. Again, software should detect the inconsistency and abort, and this is what most of the
software packages do. R, SUDAAN, and WesVar flat out refuse to proceed, and all three issue helpful
error messages. SAS, Stata, and SPSS also issue informative error messages but still try to estimate
whatever they can, which is not the best response. If there is a critical error in the data, then all of it
should be regarded as suspect. Notice that the overall standard error reported by SPSS is ludicrous. On a
side note, AM and IVEware behave properly given their analytical limitations. Since they cannot perform
finite population corrections anyway, their estimates are appropriately conservative.

Finally, we compare how the different packages respond to the presence of singletons. Having a
singleton within a stratum or cluster is not necessarily an error, as many highly stratified designs select
only a single unit from each stratum (see e.g. [49]), but since variance estimation in these situations
requires delicate handling, it is usually more appropriate for the software to require some extra
specification from the user before committing to an analysis. Stata, R, and SUDAAN take this approach.
Even though all three can perform ad-hoc adjustments and produce approximate variance estimates,
they wisely refrain until more information is forthcoming. Interestingly, of the three only Stata and R
went on to produce correct variance estimates in subdomains containing no singletons, which is the
appropriate behavior; SUDAAN just aborted with cryptic errors. (R's errors were also a bit too technical.)
Stata reported correct point estimates as well, just no standard errors for problematic domains. We
noted above how SAS responds badly to singleton units by deleting the stratum altogether! and
from Table 15 it is apparent that SPSS and AM take the same approach, blithely reporting obviously
underestimated standard errors. IVEware and WesVar go to the opposite extreme and refuse to work
with singletons at all, forcing the user to restructure the design before proceeding.

32
By way of summary, we can make a few initial assessments of the fitness of the various software
packages under consideration, based on their performance on the benchmarks and the problematic
dataset. To wit:

SAS, the industry-standard software solution (and deservedly so) in many fields of statistical
analysis, is by far the least powerful of the major packages in the area of complex survey sample
analysis and also displays unwelcome default behavior in many cases
SPSS is more powerful than SAS in the area of complex survey sample analysis [94; 95] but suffers
from many of the same drawbacks (including its exorbitant pricing scheme!)
SUDAAN and WesVar are serious and well-implemented packages with impressive technological
power and robust behavior; each could rightly be considered the gold standard of its respective
complex survey sample analysis domain (SUDAAN in the area of linearized variance estimation
and WesVar in the area of replicate-based variance estimation)
Stata and R are flexible, powerful, and robust, with sophisticated complex survey sample analysis
capabilities that make them serious contenders in competition with both SUDAAN and WesVar
AM and IVEware are not sufficiently flexible, powerful, or robust in the area of complex survey
sample analysis to recommend them for general use

We will continue to examine the capabilities and performance of SAS (mainly to show why it should not
be seen as the best choice for analyzing complex survey samples), as well as SUDAAN, Stata, R, and
WesVar in the sections to follow, but the other packages will not be considered further in this paper.
SPSS deserves more consideration, but as mentioned in the introduction, its unusually high cost makes it
a non-starter.

We should also not be too hard on AM and IVEware placing them in the same playing field as the
others is more than a little unfair, and it would be unkind to go on pointing out their shortcomings in the
area of complex survey sample analysis. Both packages were designed and implemented by academic
researchers to meet specific research needs, and both are still considered by their designers to be in the
early beta phase of development.

The original mission for AM was design-correct analysis of large-scale student performance assessments
using cutting-edge marginal maximum likelihood methods (see http://am.air.org/about2.asp for more
details), and it has grown to incorporate an impressive scope of advanced regression, imputation, and
survival models. While not strong on complex survey analysis per se, it is an extremely sophisticated tool
for what it was designed to do, and free to boot. If the need for very careful accounting of variance in
arbitrary-probability multistage sampling designs is not critical, then AM is a rather sensible choice as an
easy-to-use free alternative to SAS.

Similarly, IVEware was designed for careful design-correct multiple imputation of missing values in
complex sample data, including the generation of synthetic datasets for the purpose of design
obfuscation (to maintain confidentiality of data sources) and robust variance estimation in the presence
of imputation. (See http://www.isr.umich.edu/src/smp/ive for more details.) As such, it is excellent at
what it does and there is probably no other tool like it if those are your needs. It is also free, and while
not flexible or powerful enough in the area of complex survey analysis to recommend for general use, if
your analysis needs are modest but imputation needs are great then IVEware is a good choice.

33
COMPARING ON SUDAAN DESIGNS

For years SUDAAN has been considered the go-to software package for analyzing complex survey sample
data [62; 68; 75], particularly in the area of linearized variance estimation under complex multistage
designs (although since version 7.5 the software has also had the capability to generate variance
estimates for some replicate-weight designs [105 p31]). A typical operating environment for the well-
funded survey data analyst would be SAS on the back end for data management and summary statistics
and SUDAAN on the front end for more serious design-weighted analysis and modeling [72]. Baisden and
Hu [63] provide a succinct account of the similarities and key differences in behavior and usability
between SAS and SUDAAN.

Although it was not possible to test SUDAAN directly due to cost considerations (see INTRODUCTION),
excellent and detailed documentation of SUDAAN's capabilities is available online [106] from RTI
International (http://www.rti.org), the independent nonprofit research-and-development organization
responsible for its design and development. The documentation contains many fully worked examples of
a wide range of analyses performed on public-access data from a few large-scale national surveys,
complete with downloadable datasets and annotated code and output [108]. Later on we will compare
results from SAS, Stata, R, and WesVar against SUDAAN results on an actual national health survey with
a fairly simple design (see OPTIONS FOR VARIANCE ESTIMATION), but first we will examine the range of
designs that SUDAAN is capable of handling [107], with a view to determining whether it is still the best
or only choice for these types of designs.

Siller and Tompkins [74] compared SAS, SPSS, Stata, and SUDAAN and found "identical results" on a few
datasets with fairly trivial sampling plans involving only single-stage complexities. They concluded that
the primary criteria for choosing between software packages were economics and ergonomics. This is
fine if your only interest is in computing statistics on public-access datasets as a consumer of the data. If
only first-stage information is available as is typically the case with most public-access datasets
then almost any decent statistical package will compute correctly-weighted point estimates, and as we
have seen, there are plenty of cheap or free packages that will do correct standard errors as well. (This
was not the case ten years ago, however. In 2005, Brogan [66] warned against attempting to run survey
sample analyses in "standard packages" and recommended that researchers learn and use SUDAAN. But
there has been rapid and accelerating development of survey analysis software since then [56 p410],
and nowadays there are plenty of good choices available [70], including all the "standard packages", and
many with an impressive array of sophisticated statistical procedures [25 p402-406].) But if you are a
producer of data interested in accounting for the exact covariance among observations obtained from a
complex survey design, then any prospective software should be put through harder tests.

In this section we will pose a series of tests graded in increasing complexity and difficulty. These tests
were suggested by a set of design statement examples provided in the SUDAAN documentation [109].
Since they were specifically presented as examples of SUDAAN's capabilities, we will take it as given that
SUDAAN can analyze them correctly; the "REFERENCE" column (computed using the implementation in
APPENDIX A) will serve as proxy for SUDAAN results, with the important caveat that the reference
numbers are not always exactly what SUDAAN would report. (Some minor discrepancies in standard
error calculations occur between the reference results and those of the professional packages because
the reference implementation estimates the variance of a mean using a textbook formula [2 p305],
whereas the professional packages do something more sophisticated. Wherever discrepancies appear,
they are on the order of 5% or so, with the professional packages falling on the conservative side.)
34
The common (fabricated) setup in the test designs is study of mean GPA among high-school students in
a certain area of interest. The students are assumed to have been randomly selected within schools,
which were independently sampled within regions that partition the area of interest. Thus we can view
the regions as strata, the schools as PSUs, and the students as SSUs, making this a stratified two-stage
cluster sampling design. (The datasets for each test are described in APPENDIX D.) For the initial tests
some simplifying assumptions are made, and as the tests progress in difficulty the design is fleshed out
and then expanded in the final test to a three-stage design with additional second-stage stratification.

We will be comparing the performance of SAS, Stata, and R on all the tests in this section. But this time
the focus is different: Whereas before we were looking to shake all the monkeys from the trees, here we
are seeing how well they hang on. The goal is absolute uniformity of results across all the tests. To the
extent that this can be achieved, we may conclude that the packages left standing are just as capable as
SUDAAN on this class of designs, a class of designs where SUDAAN superiority has been touted (in e.g.
[68]). As in the benchmark tests (see TESTING BENCHMARKS), WesVar was deliberately excluded from
these tests because it uses a replicate-based variance estimation paradigm rather than one based on
Taylor-series linearization [110 pA1-A2]. (We will examine WesVar and replication methods in more
detail later on; see OPTIONS FOR VARIANCE ESTIMATION.)

TEST 1: Use WR simplification due to small sampling fractions at the first stage

In this test we ignore population sizes at all levels of sampling. SUDAAN documentation suggests that
this simplification is appropriate when the first-stage sampling fractions in all strata are below 10%.
(Cochran [2 p25] provides justification for this view.) In order to see the effects of this simplification, we
also include results taking the sampling fractions into account; our sampling fractions within strata are
0.6% and 0.2% for this test.

WR WR WOR WOR
1 comments
mean se(mean) mean se(mean)
REFERENCE 2.7928 .1819 2.7703 .2287
SAS 2.7928 .1819 2.7703 .2287
Stata 2.7928 .1819 2.7703 .2287
R 2.7928 .1819 2.7703 .2287
Table 16: WR simplification vs WOR

All the packages aced this easy test. They were forced to analyze the data as a single-stage design,
ignoring weights for the WR approximation and incorporating them for WOR. Note that if weights are
applied but the FPC is ignored then the estimated standard error is .2290 (all packages agree on this as
well). Clearly, with sampling fractions this small, a weighted single-stage approximation with or without
an FPC is very accurate, and even an unweighted one is not terrible (albeit biased and anticonservative).

TEST 2: Use WR simplification due to small sampling fractions at the first stage, recode certainty units

This test is the same as the previous one except that one school was chosen with certainty. SUDAAN
documentation recommends the standard practice [26 p53] of placing the certainty school into its own
stratum and treating the students as PSUs in that stratum.

35
WR WR WOR WOR
2 comments
mean se(mean) mean se(mean)
REFERENCE 2.7928 .1714 2.6291 .3027
SAS 2.7928 .1714 2.6291 .3027
Stata 2.7928 .1714 2.6291 .3027
R 2.7928 .1714 2.6291 .3027
Table 17: WR simplification vs WOR, with certainty PSU

All the packages aced this test as well. In this case the straight WR approximation is very biased and
severely understates the variance because the GPA scores at the certainty school are given weights
similar to those at the other schools, whereas in fact they have much lower weight because students in
the certainty school have a much higher probability of selection than students from other schools.

TEST 3: Use WR simplification due to small sampling fractions at the first stage, add substratification

As a trivial adjustment to the design in Test 1, we can imagine further substratifying the strata in order
to increase the level of homogeneity within each stratum and improve the efficiency of the sample as a
consequence [6 p76-77; 32 p88]. For this test, regions (strata) were partitioned into counties (substrata)
and then schools selected independently within the counties. SUDAAN documentation points out that
the correct analysis of this design simply involves creating a set of superstrata formed by all
combinations of levels of the strata and substrata.

WR WR WOR WOR
3 comments
mean se(mean) mean se(mean)
REFERENCE 2.9865 .1207 3.0654 .1740
SAS 2.9865 .1207 3.0654 .1740
Stata 2.9865 .1207 3.0654 .1740
R 2.9865 .1207 3.0654 .1740
Table 18: WR simplification vs WOR, with substratification

Clearly we are still on easy ground all the packages perform identically again. This time there is very
little error in point estimate using the WR approximation, although the variance is still quite understated
because the substratification increased the sampling fractions to 5.3%, 1.6%, 1.1%, 0.7%, 0.6%, and
0.4%, so the scores in some strata should carry much less weight than in others. Tests 2 and 3 both show
that design-correct weights need to be applied whenever the sampling fractions are unequal, even if
they are all small. The straight WR approximation really only makes sense when all the strata are quite
large and roughly the same size.

TEST 4: Use WOR analysis due to large sampling fractions at the first stage

Now suppose that sampling fractions are much larger than 10% in at least one stratum. In this case it is
necessary to apply the finite population correction [5 p124], so a fully correct analysis requires the
capacity to apply finite population corrections to both stages of sampling. Since we already know that
SAS is unable to handle this design (see TESTING BENCHMARKS), the table below will include columns of
single-stage approximations for comparison purposes.

36
2-stage WOR 2-stage WOR 1-stage WOR 1-stage WOR
4 comments
mean se(mean) mean se(mean)
REFERENCE 2.5962 .2574 2.5962 .2575 textbook method
SAS 2.5962 . 2.5962 .2653 using weights
Stata 2.5962 .2689 2.5962 .2653 2-stage exact
R 2.5962 .2689 2.5962 .2653 2-stage exact
Table 19: 2-stage WOR vs 1-stage approximation

Note that the reference standard error estimates were computed using a standard textbook formula
that apparently slightly understates the actual standard errors when the stratum sizes are vastly unequal
(as they are in this case). But all the professional packages agree with each other, so we will make the
reasonable assumption that the professional computational algorithms are more sophisticated and
correct. Here we find the first monkey losing its grip SAS. Although the SAS results for the single-stage
approximation are in agreement with the others, there is no option in SAS to move to an exact analysis.
We also find, as we did before, that a one-stage WOR approximation to a multistage WOR design is
anticonservative; the weighted WR approximation is to be preferred in this case [26 p267-269].
Performing a weighted WR analysis in SAS yields the standard error estimate 0.269817, which is off by
less than 1% and errs in the right direction.

TEST 5: Use WOR analysis for strata with large sampling fractions and WR for small sampling fractions

We could relax the requirement for WOR analysis in Test 4 by noting that only one of the two strata has
a large sampling fraction (stratum 1, 20%); the other has smallish sampling fraction (stratum 2, 3%). In
this case, SUDAAN documentation recommends (for unknown reasons) the simplification of using a WR
approximation on low-sampled strata. This is accomplished by specifying a 0% sampling rate for these
strata. (Note that population sizes cannot be used here since the required size would be .) By way of
comparison, we repeat the exact 2-stage numbers from Table 19.

WOR+WR WOR+WR 2-stage WOR 2-stage WOR

5 comments
mean se(mean) mean se(mean)
REFERENCE 2.5962 .2612 2.5962 .2574 textbook method
SAS 2.5962 . 2.5962 .
Stata 2.5962 .2705 2.5962 .2689 exact
R 2.5962 .2705 2.5962 .2689 exact
Table 20: Combining WOR with WR approximations in some strata

Again we see that the reference (textbook) standard errors are too small, but Stata and R both agree so
we will take their figures as correct. SAS is unable to handle even this simplified design, but it should be
said that a single-stage analysis in SAS accounting for only the large sampling rate yields a very good
standard error approximation: 0.269326, which is both conservatively larger than the exact value and
yet a bit tighter than the fully WR estimate, and moreover only off from the two-stage approximate
value by less than 0.5%. Perhaps this is the reason why this design example was included in SUDAAN
documentation because it appears to be somewhat superior to the fully WR method in certain cases
like this one.

37
TEST 6: WOR analysis required due to large sampling fractions at the first stage, recode certainty units

Another wrinkle that could be added to the Test 4 design is the inclusion of certainty units. As in Test 2,
a valid analysis strategy is to lump all the certainty units together into a new stratum of their own.
Heeringa et al [25 p103-107] go into some detail about practical methods for creating, combining, and
scrambling strata under a variety of analysis and publication constraints. Since this and most of the
remaining test designs in the battery preclude exact analysis in SAS by virtue of being multistage designs,
in the tables that follow we will include "the-best-that-SAS-can-do" results in parentheses, with
comments describing the nature of the approximations required.

2-stage WOR 2-stage WOR

6 comments
mean se(mean)
REFERENCE 2.5012 .2769 textbook method
SAS 2.5012 (.2970) weighted single-stage WR
Stata 2.5012 .2968 exact
R 2.5012 .2968 exact
Table 21: 2-stage WOR, with certainty PSUs

Yet again there is a worrying shrinkage in the so-called textbook estimate, so as usual we will regard the
Stata and R numbers as correct. The SAS approximation is accurate to three decimal places and
appropriately conservative.

TEST 7: Use WOR analysis for high-sampled strata, WR for low-sampled strata, recode certainty units

This test design is an amalgam of the Test 5 and Test 6 designs, requiring some fairly convoluted
specification in survey sample analysis software, but with unclear payoffs for the added complexity.
(Some of these SUDAAN designs feel like the documentation team was just showing off what it could do
without regard to actual practical utility!) The certainty units must be handled, of course, but if the
software can handle a fully two-stage analysis without trouble (and with a simpler specification to boot),
then it seems perverse not to just ask for that.

WOR+WR WOR+WR
7 comments
mean se(mean)
REFERENCE 2.5012 .2796 textbook method
SAS 2.5012 (.2970) weighted single-stage WR
Stata 2.5012 .2979 exact
R 2.5012 .2979 exact
Table 22: Combining WOR with WR approximations in some strata, with certainty PSUs

In this case, unlike with the design in Test 5, the weighted single-stage WR approximation remains the
best approximation from SAS. The presence of more certainty PSUs has made it more difficult to obtain
a conservative single-stage approximation, as the low within-cluster variance component of the
certainty PSU is substituted for the relatively higher between-cluster variance within strata; since
knowledge of the stratum sizes is used to form the grouping into certainty versus noncertainty clusters,
the estimate of variance is likely to be biased downward [4 p73-74]. Again we see that Stata and R agree
exactly (and the textbook method continues to underestimate), so we take their figures as correct.

38
TEST 8: Use WOR at the first stage and WR analysis at the second stage

This test differs from the single-stage comparison columns of Test 4 in that here both stages are taken
into account, so a variance component is computed for the second stage, but the finite population
correction is ignored at that stage. One reason for doing this might be to use the WR approximation at
the second stage to cover a further subnested multistage sampling design in deeper stages [26 p258].

WOR then WR WOR then WR

8 comments
mean se(mean)
REFERENCE 2.5962 .2619 textbook method
SAS 2.5962 (.2698) weighted single-stage WR
Stata 2.5962 .2690 exact
R 2.5962 .2690 exact
Table 23: WOR at primary stage then WR at secondary stage

As usual, the best strategy in SAS is to stick with the weighted single-stage WR approximation, regardless
of the underlying design. Stata and R continue to agree, so we regard their estimates as correct. Note
that the textbook method is also much closer to being correct for this test than for some of the other
two-stage tests in the battery.

TEST 9: WOR analysis required due to large sampling fractions at the first stage, census at second stage

In this test, a stratified single-stage cluster sampling design, we use the finite population correction at
the first stage to account for the large sampling fractions but then assume that the second stage was a
census within clusters (i.e. that all units within the clusters were selected and measured).

1-stage WOR 1-stage WOR

9 comments
mean se(mean)
REFERENCE 2.7603 .2539 exact
SAS 2.7603 .2539 exact
Stata 2.7603 .2539 exact
R 2.7603 .2539 exact
Table 24: Single-stage WOR cluster sampling

This test was obviously somewhat trivial in conception, and all the packages agreed exactly, including
SAS (and the reference package using textbook formulas), but it is worth examining for one reason:
because it exactly illustrates the assumption involved in deciding to apply a finite population correction
to survey data in SAS. Doing so is tantamount to assuming that the second-stage observations were the
result of a census on the cluster populations. If that is not the case (as it almost never is), then an analyst
should think twice before using the finite population correction in a software package like SAS that only
accounts for first-stage complexities. As we saw in Test 5, there may be some situations where a more
efficient estimate can be obtained by including limited information about population size when sampling
rates are very large say, on the order of 25% or so [5 p124; 6 p44] but in general it seems to be
good conservative practice to avoid applying the correction at any stage where your software cannot
handle the later-stage complexities.

39
TEST 10: Analyze a three-stage stratified design with large sampling fractions at the first two stages

Here is where things get thorny. Suppose that within regions we randomly sampled counties and then
applied stratification to each county before drawing schools independently within each county stratum.
Since students are further subsampled within schools, we now have a complex three-stage design
combining a stratified two-stage cluster sample nested within a stratified one-stage cluster sample.
There are no ready textbook formulas for a design this complicated, but one can be cobbled together
from standard formulas for simpler designs, viz. for a population total (see Cochran [2 p285-289],
Deming [3 p156], Hansen, Hurwitz, and Madow [5 p181-185], and Sampath [9 p77-78,143-147]) under
equal-probability sampling, we have:

where is the sample variance of the cluster totals within the first-stage stratum and:

1
1 1

If we let 1 stand for the design variance of a cluster total computed over , then the
expression for simplifies to:

where is some estimate of the variance of the complex design of the within-cluster sampling of SSUs.
The index in the SSU variance weight / (an artifact of the second stage of stratification;
see APPENDIX E) requires that the weighting be done inside the per-stratum calculations, which is an
inconvenience if you are trying to plug unstratified two-stage cluster sampling estimates into the larger
formula. Following an old idea of Cochran's (from the 1953 first edition of Sampling Techniques, p232;
cf. 2 p289), we could let and try the approximation:

Then the estimate of the variance of the estimated population mean could be expressed as
, where is the sum of the design weights over all records.
We will refer to this as the final-weight approximation in the table below.

40
The above is the approach adopted for the reference estimate (because the implementation given in
APPENDIX A is unable to handle a complex three-stage design directly); as we see in the table below, it is
quite a good estimate but still mildly anticonservative (due to the two-stage cluster sampling estimator).
It is interesting as a comparison to what SAS comes up with given similar technical limitations.

complex 3-stage complex 3-stage

10 using WR at end using WR at end comments
mean se(mean)
REFERENCE 3.0383 .1930 final-weight approximation = .1874
SAS 3.0383 (.2124) weighted single-stage WR
Stata 3.0383 .1933 exact
R 3.0383 .1933 exact
Table 25: Stratified WOR cluster sampling of stratified two-stage cluster sampling

Apparently nothing beats the weighted single-stage WR in SAS-world! Applying the finite population
correction in SAS yields the standard error estimate 0.177684, which is much too small. Notice that this
design approximation strategy (suggested by SUDAAN documentation) is inherently anticonservative.
While it is motivated by the very large sampling fractions in the early sampling stages, these corrections
overstate the variance reduction because they do not consider the considerable variance added by the
later stages. Doing a fully design-correct analysis, accounting for the stratification and sampling fractions
at all three stages yields the standard error estimate .2107588 (same answer from both Stata and R),
which is larger than all of the "exact" (to the order of the approximation) estimates above. This is a nice
illustration of a theme we have seen running throughout these tests Don't use FPCs without a very
good reason! Good reasons might include: a) the exact sampling design is simple; b) the design involves
only a couple of stages and the first-stage sampling fractions are large for all strata; c) your pointy-haired
boss insists on having them in the analysis and will not listen to reason; etc. West [20] also gives a very
nice discussion on this point, defending the SAS paradigm against criticisms of neglectfulness of the
range of complexity in sampling designs.

Although this series of tests has shown that SAS cannot keep up with the new pack of younger, stronger,
and more agile software, it has also revealed that the SAS strategy of "analyze everything as a weighted
single-stage WR design" is not a bad one in practice. It usually leads to appropriately conservative
estimates that are not too far off from the design-correct estimates in typical cases where the complex
design is not unusually efficient.

To cap off this section, we will compare the performance of SAS, Stata, R, and WesVar on a larger and
somewhat more realistic (but still fabricated) example adapted from a dataset available on a page of
downloadable datasets (http://www.stata-press.com/data/r13/svymain.html) created by StataCorp for
use with the Stata 13 survey reference manual [104]. (See APPENDIX D for a description and link to a
copy of the adapted dataset.) This dataset has a typical stratified three-stage structure, with students
selected WOR within high schools, schools selected WOR within counties, and counties selected WOR
within states, which act as top-level strata; however, since student populations for individual schools are
not provided with the data, for analysis purposes the familiar WR approximation is applied at the third
stage. (This design is similar to Test 10 above.) An additional wrinkle is that the weights attached to
students are not simply the sampling design weights, but have been raked and scaled to reflect national
demographic distributions and nonresponse (see [7 p342-346] for how this is done in principle), and sum

41
to the estimated US population of high school students, so analysis software must be provided with the
adjusted weights in order to compute correct estimates. And since the design involves sampling exactly
two PSUs from each stratum, it is a prime candidate for analysis using a replicate-based method such as
balanced repeated replication (BRR; see OPTIONS FOR VARIANCE ESTIMATION for more on this). Thus
WesVar was used to create a set of BRR replicate weights so its performance could be compared with
the other packages on a dataset where all packages could compete on a level playing field.

The statistic of interest will be the mean height-to-weight ratio of students, examined both across and
within sex (male vs female) and race (white vs black vs other) categories. Ratio estimation within
subdomains is a fairly delicate process since the estimator is technically a ratio of ratios [2 p183-184; 6
p503-505; 7 p133-138], so we will be looking hard for evidence of analytical failure. As we saw above,
SAS will not be able to handle the design as specified, but we can apply the all-purpose tool of weighted
single-stage WR approximation and see how it fares.

In the following tables, estimated ratios obtained from each software package are presented along with
the corresponding estimated standard errors in parentheses. It is important to be aware that the
WesVar estimates are computed in a very different way from how the other packages compute them.
WesVar employs a kind of bootstrap process: the user specifies an arbitrary function of sample
quantities in this case, "ratio = height / weight" and the program computes an estimate of
the function using each of the BRR weights in turn. These estimates are then averaged to form the
reported statistic. Thus there is a process-imposed potential for mismatch with standard estimates.

OVERALL
estimate comments
mean (stderr)
SAS 2.6992 (.0107) reported stderr = .010749
Stata 2.6992 (.0107) reported stderr = .0106909
R 2.6992 (.0107) reported stderr = .01069091
WesVar 2.7551 (.0104) using the geometric mean
Table 26: Mean height-to-weight ratio of students

BY SEX SEX
mean (stderr) Male Female
SAS 2.5693 (.0150) 2.8437 (.0175)
Stata 2.5693 (.0145) 2.8437 (.0171)
R 2.5693 (.0145) 2.8437 (.0171)
WesVar 2.6055 (.0147) 2.9014 (.0162)
Table 27: Mean height-to-weight ratios of students by sex

BY RACE RACE
mean (stderr) White Black Other
SAS 2.7035 (.0120) 2.6037 (.0372) 2.9694 (.0533)
Stata 2.7035 (.0118) 2.6037 (.0359) 2.9694 (.0563)
R 2.7035 (.0118) 2.6037 (.0359) 2.9694 (.0563)
WesVar 2.7577 (.0118) 2.6701 (.0381) 3.0154 (.0667)
Table 28: Mean height-to-weight ratios of students by race

42
BY SEX RACE
AND RACE
mean White Black Other
(stderr)
SAS Stata R WesVar SAS Stata R WesVar SAS Stata R WesVar
Male 2.5602 2.5602 2.5602 2.5934 2.5875 2.5875 2.5875 2.6447 2.9003 2.9003 2.9003 2.9514
S (.0160) (.0153) (.0153) (.0157) (.0452) (.0445) (.0445) (.0445) (.0827) (.0869) (.0869) (.1109)
E
X SAS Stata R WesVar SAS Stata R WesVar SAS Stata R WesVar
Female 2.8689 2.8689 2.8689 2.9236 2.6172 2.6172 2.6172 2.6895 3.0340 3.0340 3.0340 3.0689
(.0197) (.0191) (.0191) (.0183) (.0547) (.0529) (.0529) (.0545) (.0600) (.0628) (.0628) (.0737)
Table 29: Mean height-to-weight ratios of students by sex and race

The impression gained from the tables is that there is not much difference in precision between the
estimates produced by SAS, Stata, and R. Although SAS is forced to apply the single-stage approximation
to this three-stage design, its reported standard errors are mostly on the order of 5% off of those of the
other packages, and mostly on the conservative side, even in this fairly challenging problem involving
subdomain estimation using a nonlinear estimator. Examination of the SAS log reveals that some of the
subdomain definitions created random singletons in strata, which were omitted per usual SAS operating
procedure (see TESTING BENCHMARKS); this occurred in the strata involving the "Other" race category,
and as we can see in the tables, those standard errors are predictably anticonservative. Stata and R
encountered the same trouble with singletons but handled the problem more gracefully and produced
robust estimates. (Again recall that SAS can be forced to employ robust variance estimation by inclusion
of the "NOMCAR" option in the procedure statement, but Stata and R do this by default.)

We should also not be too hasty to dismiss WesVar's estimates as poorly executed. In fact, the balanced
replication procedure is robust [56 p354-366] and on solid mathematical footing [56 p107-112], and in
many situations has been shown to produce more accurate confidence intervals (in the sense of better
coverage probabilities) than the linearization methods used by default in SAS, Stata, and R [48]. And
Royall and Cumberland (in [15 p293-309]) demonstrated that even under a simple sampling plan, the
ratio estimator itself can be extremely volatile both in terms of bias and average error, particularly when
the sampled units are badly unbalanced in terms of size and variance, as subdomains often are. Thus it is
entirely possible that WesVar's bootstrap-style estimates actually have smaller average error. Although
the WesVar standard errors are on par with those of SAS (biased slightly high because the BRR estimator
tends to be biased high in general [48]), the WesVar point estimates are all slightly larger than the
"approximately unbiased" standard estimates, indicating that either the standard estimates are biased
downward on this particular sample or the BBR estimates are biased upward; majority opinion counts
for little with nonlinear estimators, which are inherently biased anyway, and consistent estimation of
nonlinear statistics is one of the areas where BRR estimators are strong [1 p210-241].

Stata and R were rock-solid on all of the tests in this section, neither deviating one iota from the other.
Based on these results, both packages should be viewed as entirely equal to SUDAAN in performance for
an important class of designs on which SUDAAN used to reign supreme. Indeed, a scan of the feature
comparison charts in Heeringa et al [25 p402-406] shows that Stata and R can do basically everything
SUDAAN can, and a whole lot more as well by virtue of their being general statistical analysis packages.
The era of expensive SUDAAN-style specialized complex survey sample analysis software is coming to an
end. In its place are rising equally capable do-it-all packages that are both cheaper and more flexible.

43
OPTIONS FOR VARIANCE ESTIMATION

We have seen that Stata and R are essentially the equal of SUDAAN when it comes to linearized variance
estimation methods, and even SAS is not too bad in this arena, provided that the analyst is careful to
stick to the weighted single-stage WR approximation for most (if not all) designs. But Taylor-series
linearization is not the only game in town when it comes to variance estimation for statistics computed
on complex survey samples. A competing model of variance estimation and the model adopted by
software packages such as WesVar [110 pA1-A31] is to compute the statistic of interest on a
subsample taken from the original sample and then use the variance of the set of estimates computed
on each subsample to estimate the variance of the statistic itself [7 p373; 39].

A variety of resampling methods exist, including the method of random groups (a.k.a. interpenetrating
samples) [1 p208-210; 7 p370-373; 10 p423-425; 31 p174-178; 56 p21-106], the method of balanced
repeated replication (BRR) [1 p210-214; 7 p373-380; 10 p430-437; 31 p178-182; 56 p107-150], several
variants of the jackknife method [1 p206-208; 7 p380-383; 10 p437-442; 56 p151-193], and several
variants of the general bootstrap method [1 p214-228; 7 p384-386; 10 p442-444; 56 p194-225]. We will
discuss the major variants below.

Table 30 (see also [25 p403]) gives a brief overview of direct support for Taylor-series linearization and
resampling methods in the software packages under consideration. Details will be discussed later.

Linearization BRR Jackknife Bootstrap

SAS single-stage only external or internal external or internal none
Stata multistage external or internal external or internal external only
R multistage external or internal external or internal external or internal
WesVar none external or internal external or internal external only
SUDAAN multistage external only external or internal none
SPSS multistage none none none
AM single-stage only external only external only none
IVEware single-stage only internal only internal only none
Table 30: Software support for alternative methods of variance estimation

From Table 30 we can see that the software packages fall into two classes of variance estimation
paradigms: focused or broad-based. The focused class includes WesVar (focused on replication) and
SPSS (focused on linearization), and the broad-based class includes the rest. Among the broad-based
packages, some are highly capable (R, Stata, SUDAAN, and SAS, in decreasing order of capability), and
the others are limited (AM and IVEware, in decreasing order of capability). The specific capabilities of
each package on a particular method will be discussed in conjunction with the description of that
method below.

The method of random groups is not included in the table because it simply involves partitioning the
sample into groups that each mirror the overall sample design; then the statistic of interest is computed
on each group and the sample variance of the statistic over the groups is used to estimate its sampling
variance. Since this procedure can be done using any statistical software whatsoever, we will not have
cause to consider it further, beyond noting the computational formula and offering a brief summary.

44
To facilitate discussion of the various methods of variance estimation, we adopt the notation:

parameter of interest
full-sample estimate of
th estimate of (i.e. estimate computed on the th subsample)
average value of
overall scaling factor (i.e. a known scalar or matrix that depends on the method)
scaling factor for the th estimate (i.e. is a known 1 vector that depends on the method)

In addition to computational formulas and brief summaries of the philosophy and procedures involved,
discussion of the methods will also include a performance comparison of the major packages (SAS, Stata,
R, SUDAAN, and WesVar, where applicable) on a sample dataset involving a fairly simple design plan that
any of the variance estimation methods can accommodate. This dataset (described in APPENDIX D)
comes from an actual study of WIC (Special Supplemental Nutrition Program for Women, Infants, and
Children) participants. A question of interest is whether race or educational level influences a new
mother's decision to initiate breastfeeding, and whether there are systematic differences in infant birth
weight along the same axes [105], so we will ask each package to estimate the proportion of mothers
who have initiated breastfeeding and mean birth weight (in ounces) across race and educational levels;
standard errors will be computed using the variance estimation method under discussion. SUDAAN
results on this dataset using version 7.5 are taken from Bieler and Williams [105].

LINEARIZATION

Suppose is a smooth function of a random vector of sample totals with expected value
and covariance having consistent design-based estimators and respectively.
Then under mild regularity conditions such as local continuity of the parameter space [37; 53] we have:

. .

where is the gradient of evaluated at . Note that this gradient plays the role of the
scaling factor in the notation above, i.e. if we let then . Demnati and Rao [41; 42]
further extend this basic framework, generalizing an on-the-fly variance evaluation algorithm due to
Woodruff [58]. Although the approach is well developed in survey sampling literature, it has been
criticized for applying infinite-population convergence results to finite populations for which those
results may not hold [56 p232]. Other disadvantages of the method are the need for special-case
derivative evaluations for every statistic of interest, and the inescapable fact that not every statistic of
interest can be expressed as a smooth function of sample totals [7 p369]. In particular, functions of
quantiles (e.g. medians) fall into this category, and the method has also been shown to be volatile on
certain other statistics such as multiple correlation coefficients (see e.g. [55]). Nevertheless, this has
been the preferred method of robust variance estimation in software for decades [18], although
resampling procedures are gaining in popularity with the advent of cheap computing power [17].

45
With the notable exception of WesVar, all of the software packages under consideration support the
linearization method of variance estimation. But if this is your preferred method, then your first choice
should be one of Stata, R, or SUDAAN (or SPSS), since SAS and the weaker programs do not exploit the
theory to its full potential.

BREASTFEEDING RACE EDUCATION

proportion OVERALL < High High > High
White Black Latina Other
(stderr) School School School
.5370 .5104 .3202 .8324 .6319 .5187 .4991 .6804
SAS
(.0310) (.0318) (.0445) (.0348) (.1220) (.0520) (.0323) (.0460)
.5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
Stata
(.0310) (.0318) (.0445) (.0351) (.1220) (.0520) (.0323) (.0462)
.5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
R
(.0310) (.0318) (.0445) (.0351) (.1220) (.0520) (.0323) (.0462)
.5370 .5104 .3202 .8324 .6319 .5187 .4991 .6804
SUDAAN
(.0310) (.0318) (.0445) (.0348) (.1220) (.0520) (.0323) (.0460)
Table 31: Estimated proportions of breastfeeding WIC mothers (LINEARIZED)

BIRTHWEIGHT RACE EDUCATION

mean OVERALL < High High > High
White Black Latina Other
(stderr) School School School
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
SAS
(.925) (1.03) (1.61) (1.26) (4.64) (1.34) (1.30) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
Stata
(.925) (1.03) (1.61) (1.26) (4.64) (1.34) (1.30) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
R
(.925) (1.03) (1.61) (1.26) (4.64) (1.34) (1.30) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
SUDAAN
(.925) (1.03) (1.61) (1.26) (4.64) (1.34) (1.30) (2.32)
Table 32: Estimated mean birth weights of infants born to WIC mothers (LINEARIZED)

There are very minor discrepancies in some of the cells (most likely due to idiosyncratic handling of item
nonresponse in SAS and SUDAAN), but on the whole all packages agree completely using this method.

RANDOM GROUPS

Suppose we divide the sample into groups in such a way that each group is a small snapshot of the
complete survey. (Note that the sizes of clusters must be large enough to support a reasonably large
value of or the point and variance estimates are likely to be strongly biased, especially for nonlinear
statistics [7 p373; 31 p176-177].) Then we have the estimators:

1
1

46
Notice that the variance estimator here is nothing more than the estimated sampling variance of . The
scaling factor is 1/ 1 . This method is simple but robust, and easily explained. With a large
enough number of groups it can handle essentially any statistic, and weighting adjustments do not
throw it off in the slightest; all it requires is approximate independence of the groups [56 p440-444].
Wolter [56 p50-57,64-83] gives a nice discussion of some variations on this method.

BALANCED REPEATED REPLICATION (BRR)

In the very specific (but common) case of a stratified sample where exactly two PSUs are chosen within
each stratum, a refinement of the random group method can be used. Two groups can be formed based
on the sample: each group includes exactly one of the two PSUs from each stratum. But if these groups
are selected randomly, the variance estimate is likely to be extremely volatile due to the fact that there
is only one degree of freedom for estimation (i.e. because we would have 2); a better approach is
to choose half-samples according to a balancing plan akin to a fractional factorial design [7 p375].
Suppose that each of the half-samples is associated with an | | 1 vector 1, 1 | | , where
| | is the number of strata, constructed so that 0. Within stratum , let be
the first PSU and be the second, and define the first value in the th subsample as:

1

1

Note that we can define the subsamples in this way by means of a Hadamard matrix [56 p367-368], a
square matrix on 1, 1 with the balanced orthogonality property described above. Clearly this
matrix should be | | | | or larger (for technical reasons the dimension should be at least the smallest
multiple of 4 strictly greater than | | in order to ensure full orthogonality of the balance [56 p112]); the
th column is associated with stratum , and the th row with half-sample . See [30] for an online
library of Hadamard matrices of various sizes. To form a half-sample, we scan across the row and select
if the th cell in the row contains 1 and if the cell contains 1. Then the estimators are:

where is simply the estimate of computed according to the selection of the subset of
observations that define the th half-sample. Note that 1/ for this method. There are some
variations on the theme, most notably one due to Fay [46] that replaces the scaling factor with
1/ 1 , where 0 1 is a perturbation of the weights: instead of zero-weighting the
negative half-sample and double-weighting the positive one, the negative half-sample is assigned a
weight of and the positive one a weight of 2 . (The WesVar manual [110 pA9] recommends a value
of .3 for most applications.) Further extensions of BRR methods to designs with more than two PSUs
per stratum and allowing a smaller number of half-samples (for lower storage demands and greater

47
computational efficiency) are discussed in Wolter [56 p123-139], but in typical usage this method is
restricted to the two-PSUs-per-stratum design [7 p380].

Almost all the major software packages we consider fully support BRR variance estimation. SPSS is a very
glaring exception in this regard. By "fully support" we mean that a package accepts precomputed
weights in a datafile and uses them appropriately, and in addition has the capability to compute BRR
weights on the fly if need be. Stata is fussy in requiring the user to specify a Hadamard matrix in the
latter situation, but the documentation describes an easy way to do this [104 p87]. SAS and R (and of
course WesVar) carry their own libraries of matrices and can select an appropriate one automatically.
SUDAAN and AM can compute BRR estimators, but you need to provide the replicates yourself, and
IVEware offers the option of doing its own special-case version of BRR (called "paired selection") for
strata that meet the design requirement, but there are no tuning parameters for the user. In the tables
below we force all the packages to use the BRR weights provided in the dataset in order to maintain
comparability with SUDAAN results.

BREASTFEEDING RACE EDUCATION

proportion OVERALL < High High > High
White Black Latina Other
(stderr) School School School
.5370 .5104 .3202 .8324 .6319 .5187 .4991 .6804
SAS
(.0311) (.0321) (.0469) (.0351) (.1496) (.0521) (.0337) (.0466)
.5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
Stata
(.0311) (.0321) (.0466) (.0353) (.1481) (.0521) (.0337) (.0469)
.5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
R
(.0311) (.0321) (.0466) (.0353) (.1481) (.0521) (.0337) (.0469)
.5370 .5104 .3202 .8324 .6319 .5187 .4991 .6804
SUDAAN
(.0311) (.0321) (.0469) (.0351) (.1496) (.0521) (.0337) (.0466)
.5367 .5104 .3202 .8318 .6319 .5187 .4991 .6789
WesVar
(.0311) (.0321) (.0469) (.0354) (.1496) (.0521) (.0337) (.0469)
Table 33: Estimated proportions of breastfeeding WIC mothers (BRR)

BIRTHWEIGHT RACE EDUCATION

mean OVERALL < High High > High
White Black Latina Other
(stderr) School School School
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
SAS
(.922) (1.10) (1.64) (1.36) (5.52) (1.38) (1.32) (2.40)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
Stata
(.922) (1.10) (1.63) (1.36) (5.46) (1.38) (1.32) (2.40)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
R
(.922) (1.10) (1.63) (1.36) (5.46) (1.38) (1.32) (2.40)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
SUDAAN
(.922) (1.10) (1.64) (1.36) (5.52) (1.38) (1.32) (2.40)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
WesVar
(.922) (1.10) (1.64) (1.36) (5.52) (1.38) (1.32) (2.40)
Table 34: Estimated mean birth weights of infants born to WIC mothers (BRR)

48
Again we see only small discrepancies, most likely due to handling of item nonresponse. Interestingly,
WesVar sometimes agrees with SAS and SUDAAN, and sometimes with Stata and R. We can conclude
that all the major packages compute essentially equivalent BRR estimates, but some packages (namely
SAS and R) nicely automate the process and save the user from the fuss of specifying replicate weights.

JACKKNIFE

The usual jackknife estimator is a tool to reduce first-order bias in nonlinear estimators [56 p152-159],
but it has been developed extensively in complex survey sampling theory as a variance-estimation tool
as well [17], with several proposed variants [48]. The essential idea of all the developments is the same:
resample by deleting one observation (or a codependent group of observations) and compute the
statistic of interest on the reduced sample; the variance between the subsamples then becomes the
estimate of variance for the statistic. Wolter [56 p169] observes that for most linear statistics the
jackknife estimator simply reproduces the standard variance estimators found in textbooks. This is a nice
property, but the real utility of the jackknife comes in estimating variance in complex multistage samples
involving nonlinear statistics. In the most general case of a stratified multistage sample, one PSU at a
time is deleted from each stratum separately (taking with it the entire ultimate cluster of observations
nested within it), and the statistic of interest is computed within the stratum on the reduced set of
observations. In practice this is accomplished by setting the weight of the deleted PSU in stratum to
zero and upscaling those of the rest of the observations by / 1 , where is the sample size of
the stratum. Then the sum of squared deviations for the stratum is downscaled by 1 / to give
an unbiased estimator in the linear case. (The exact asymptotic properties are unknown for nonlinear
statistics in the case of finite populations, but see [56 p369-383] for some known results.)

Formally, let be the total count of PSUs in all strata and be the estimate
of computed on the sample induced by deleting PSU from stratum , where for each observation
1, , over the full sample of size we apply the weight:

0

1

where is the index subset induced by selecting stratum , the index subset induced by selecting
PSU within stratum , and the index subset induced by selecting all PSUs except in stratum .
Then we can define the estimators [56 p174]:

1 1 1
1 1 1
| |

49
where . This can be extended to WOR sampling by replacing every 1 factor
with 1 1 . It is easy to see that we need exactly as many replications as deletions, so a total
of replicates are obtained, as hinted above. Putting these estimators into the general form of point
and variance estimators used in the description of the other methods, we can define:

1 1 1

| |

and write:

| |

The advantage of this formulation is apparent. No matter what the design or statistic, we can throw it
into the formulas and obtain consistent estimators, making the jackknife a one-size-fits-all approach.
There is one important caveat: the variances of certain statistics such as quantiles (e.g. medians) are not
well estimated by the jackknife estimator in single-stage samples [7 p383]. But in stratified multistage
samples where either there is a large number of strata or a large number of PSUs within each stratum,
the jackknife has been shown to be asymptotically consistent [110 pA15]. In all cases it is possible to
obtain conservative estimates by replacing with in the variance estimator above [56 p179], and
other methods are available (see e.g. [57]) for estimating statistics such as population quantiles. One
disadvantage of the jackknife over BRR is that it requires replicates equal to the total number of PSUs,
where BRR only requires replicates on the order of the number of strata, which may be much smaller in
practice. But this is a minor quibble nowadays, when computing resources are cheap and plentiful.

The method just described is called the JKn jackknife, and it is fully supported in all of the major software
packages except (predictably) SPSS. R and WesVar in particular can generate replicate weights for
several different jackknife variants. IVEware also supports JKn internally for some procedures but with
no user-tunable parameters, and AM accepts precomputed weights for the same jackknife variants that
WesVar supports but will not generate replicates on the fly like the major packages will. The latter ability
is actually a considerable boon to analysts, because datasets do not need to include replicates directly,
and unlike with BRR, users do not need to concern themselves with mathematical details like Hadamard
matrices. Just specify jackknife replication and go. Note that in the case of a single stratum, the formulas

50
reduce by setting | | 1 and , yielding what is known as the JK1 jackknife. WesVar can
compute another variant called the JK2, which is a simpler and more efficient (but not equivalent)
version of JKn in the special case when exactly two PSUs are selected in each stratum [110 pA10-A11], so
since the sample dataset meets this condition, in the tables that follow we will include a second WesVar
row showing the JK2 results for comparison.

BREASTFEEDING RACE EDUCATION

proportion OVERALL < High High > High
White Black Latina Other
(stderr) School School School
.5370 .5104 .3202 .8324 .6319 .5187 .4991 .6804
SAS
(.0310) (.0318) (.0449) (.0352) (.1279) (.0523) (.0323) (.0461)
.5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
Stata
(.0310) (.0318) (.0448) (.0353) (.1260) (.0522) (.0323) (.0463)
.5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
R
(.0310) (.0318) (.0448) (.0354) (.1279) (.0523) (.0323) (.0463)
.5370 .5104 .3202 .8324 .6319 .5187 .4991 .6804
SUDAAN
(.0310) (.0318) (.0449) (.0352) (.1279) (.0523) (.0323) (.0461)
.5367 .5104 .3202 .8318 .6319 .5187 .4991 .6789
WesVar (JKn)
(.0311) (.0318) (.0449) (.0354) (.1279) (.0523) (.0323) (.0463)
.5367 .5104 .3202 .8318 .6319 .5187 .4991 .6789
WesVar (JK2)
(.0310) (.0319) (.0439) (.0339) (.1121) (.0496) (.0324) (.0472)
Table 35: Estimated proportions of breastfeeding WIC mothers (JACKKNIFE)

BIRTHWEIGHT RACE EDUCATION

mean OVERALL < High High > High
White Black Latina Other
(stderr) School School School
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
SAS
(.925) (1.03) (1.64) (1.27) (4.87) (1.34) (1.31) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
Stata
(.925) (1.03) (1.63) (1.27) (4.79) (1.34) (1.30) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
R
(.925) (1.03) (1.64) (1.27) (4.87) (1.34) (1.32) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
SUDAAN
(.925) (1.03) (1.64) (1.27) (4.87) (1.34) (1.31) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
WesVar (JKn)
(.925) (1.03) (1.64) (1.27) (4.87) (1.34) (1.31) (2.32)
116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
WesVar (JK2)
(.914) (1.02) (1.50) (1.18) (4.20) (1.29) (1.31) (2.38)
Table 36: Estimated mean birth weights of infants born to WIC mothers (JACKKNIFE)

Again there are a few minor discrepancies, but the JKn estimates in all packages are mostly the same. It
is interesting that R agrees more with SAS and SUDAAN than with Stata, marking a break from the usual
pattern so far; Stata gives some lowball standard errors in the smaller subdomains. The JK2 standard
error estimates are quite a bit different from the JKn generally smaller and closer to Stata's so we
see evidence that JK2 is not just a smaller JKn but actually a different estimator in its own right.
51
BOOTSTRAP

Of all the resampling methods, the bootstrap has the lowest level of software support, despite the tone
of optimism concerning bootstrap estimators in complex sampling literature [48; 54]. Of all the packages
under consideration, only R provides full support for the method, including numerous tuning options for
the user [84 p46]. Part of the problem stems from theoretical uncertainties concerning how (and how
many) bootstrap samples are selected, and which form of the estimator holds the best properties; see
Rao and Wu [51] for a thorough discussion. The basic issue is that the nave choice of resampling a large
number of observations with replacement according to the probabilities implied by the sampling
weights leads to an inconsistent estimator in the general case [56 p204]. Another issue is that many
replicates are usually required to stabilize the estimate, due to the effects of Monte Carlo error in the
resampling process itself. (This effect is avoided in the BRR and jackknife methods by resampling
according to a balanced or systematic plan.) But of course more replicates requires more storage space
and more computing time, which can become burdensome if large numbers of statistics need to be
computed.

If is any function on the finite population totals and is the sample size, then given
a particular sample
taken with replacement from , where is the size of the
finite population, we can assume the natural estimate
and construct an estimator of using
. If is not smooth (i.e. not continuously differentiable in ) then may not be identifiable
given , due to the possible lack of a well-defined inverse mapping, so we should further impose a
smoothness condition on . Then the bootstrap point and variance estimators based on subsamples of
size taken with replacement are [56 p214-215]:

1 1

1 1

The beauty of these estimators is their sheer simplicity, but as discussed above, this apparent simplicity
belies some theoretical subtleties in the finite population application. In particular, we gloss over the
important fact that each should be computed using a consistent (and preferably unbiased) estimator
appropriate to the sampling design; otherwise will not be reliable. Perhaps for these reasons, most
statistical software developers do not see a great need to add bootstrap variance estimation to their
complex survey sample analysis offerings, especially when (as we have seen) other methods like the
jackknife seem to do just fine. Stata and WesVar will obligingly use bootstrap replicates if you provide
them yourself, but the results may be questionable due to the theoretical difficulties mentioned above.
In contrast, R makes a great effort to be careful and precise in its implementation of bootstrapping for
variance estimation, and provides a few alternative methods based on some recent research findings.
(See http://www.inside-r.org/packages/cran/survey/docs/bootweights for current discussion and
references.) In the tables to follow, we will include only estimates from R (since the other packages
would simply be repeating the analysis using weights computed by R) on two of the available methods,
along with the linearization method results for comparison. In both cases we generate just 200

52
bootstrap subsamples because simulation results from Rao and Wu [51] suggest that for typical
situations not much gain in performance is obtained by generating more than this.

BREASTFEEDING RACE EDUCATION

proportion OVERALL < High High > High
White Black Latina Other
(stderr) School School School
R .5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
linearized (.0310) (.0318) (.0445) (.0351) (.1220) (.0520) (.0323) (.0462)
R .5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
standard boot (.0296) (.0323) (.0477) (.0370) (.1252) (.0518) (.0307) (.0489)
R .5370 .5104 .3202 .8318 .6319 .5187 .4991 .6789
rescaled boot (.0316) (.0308) (.0481) (.0377) (.1499) (.0567) (.0320) (.0426)
Table 37: Estimated proportions of breastfeeding WIC mothers (BOOTSTRAP)

BIRTHWEIGHT RACE EDUCATION

mean OVERALL < High High > High
White Black Latina Other
(stderr) School School School
R 116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
linearized (.925) (1.03) (1.61) (1.26) (4.64) (1.34) (1.30) (2.32)
R 116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
standard boot (.906) (1.06) (1.67) (1.34) (5.33) (1.33) (1.35) (2.39)
R 116.62 118.68 108.69 120.37 119.25 115.21 116.54 120.40
rescaled boot (.912) (.965) (1.71) (1.23) (5.12) (1.32) (1.33) (2.35)
Table 38: Estimated mean birth weights of infants born to WIC mothers (BOOTSTRAP)

The results show an interesting trend toward larger variance estimates for small subdomains but smaller
estimates for larger subdomains (and by extension, the full sample); similar patterns in behavior were
found by Zhang, Weng, Salvucci, and Hu in a study of the performance of resampling estimators in the
case where the number of PSUs is small due to intense stratification [59]. The different bootstrap
methods also show no apparent systematic bias in one direction or the other, although the trend just
mentioned is more pronounced for the rescaled bootstrap than for the standard bootstrap. Overall,
these estimates seem to be very consistent with what we obtained from the other methods, although
there is a certain level of negligible volatility in the numbers due to the nature of the Monte Carlo
process. But then again, this just seems to make a case for not really needing a bootstrap method at all
the other methods do just as well and are more controlled. The big advantage to having a bootstrap
procedure available comes when you want to estimate much scarier kinds of statistics on much larger
and rougher datasets, where linearization is out of the question, fitting the design into a BRR mold
would involve too much collapsing of the design information, and a jackknife would require maybe
hundreds or thousands of replicates. That kind of wilderness is where the bootstrap really shines, so we
stand with R in asserting that it deserves a place at the complex survey analysis table.

Another kind of resampling method that we have not considered above is successive difference
replication, which was suggested by Fay and Train [43] specifically to handle variance estimation for
systematic sampling designs under a resampling estimator framework. Stata provides direct support for
this method but requires that the user supply the weights, and IVEware can use the method internally
for strata that fit the assumptions but (as usual) provides no user-tunable parameters for use. We will

53
not pursue this method further, beyond noting that it is one further option for (mainly Stata) users who
happen to run across datasets containing successive-difference replicate weights. (See the current Stata
user manual [104] for more details.)

In light of these examples, what can we say about the software packages themselves in terms of their
utility in producing correct analyses from replicate weights? Broene, Rust, and Westat [65] undertook a
similar study of software packages, testing their performance using both linearization and resampling
methods, and concluded that results were much more consistent and reliable across packages when
replicate weights were created by data producers and included along with the public-release datasets.
Our findings so far seem to bear that observation out: we see much less worrisome variation in package
performance and output when all packages are either using the same set of replicate weights or
generating them on the fly using essentially the same algorithms. This strategy also provides robustness
against vagaries in the design itself, as Zhang, Weng, Salvucci, and Hu [59] found that SUDAAN-style
linearization estimators performed poorly when the underlying design involved a stage of systematic
sampling, whereas bootstrap estimators never broke a sweat.

Another point to consider is that although all the packages performed about the same on our examples,
where all that was called for was to crack open some canned data and warm it up, they would not all
perform the same in more demanding situations involving data production. On this front, no one can
beat R for sheer capability. SAS, Stata, and WesVar (especially WesVar) can all create and output
replicate weights for BRR or jackknife designs, and WesVar provides extremely sophisticated tools for
poststratification, raking, nonresponse adjustment, degrees-of-freedom corrections, and other delicate
post-weighting procedures [110 p423-437]. WesVar also supports the definition of nearly arbitrary
statistics to be estimated using the replicate weights. But R can do all this and more in a fully interactive
programming environment, and R is the only package to provide any serious support for bootstrap
estimation. And R is free. (Of course WesVar is also free.) If a free tool can do all this, there is no real
reason to purchase software that does less.

So if your survey sample analysis needs are merely on the consumption end, any of the major packages
will do adequately on most tasks involving public datasets, which almost never publish full design
information anyway and often provide postadjusted replicate weights to facilitate correct variance
estimation. But if your needs involve careful analysis of complex plans or require the generation of
replicate weights for others to use, the big-box packages like SAS or SUDAAN (or especially SPSS) are not
nearly as useful as cheaper, more capable, and more programmable packages like Stata, which in turn is
not quite as useful as free and highly capable packages like WesVar (even though WesVar has the
serious limitations of not being programmable or able to work with linearized estimators). And none of
them can even compete with free, highly capable, and highly programmable packages like R.

Nevertheless, we should conclude that all the packages we have looked at, excluding SPSS, provide
reasonable support for resampling estimators, and none can be excluded from further consideration on
that basis alone. It is quite a testament to the pace and increasing sophistication of survey software
development that just a decade ago [71] many of these packages (including R and Stata) could not do
better than single-stage SRSWR or SRSWOR designs using textbook formulas. SUDAAN was the only
game in town for linearized estimation, and WesVar the only game in town for replication. In the next
section we will take things one step further by pushing all the software packages to their limits on a
large dataset with a very complex design that is difficult to restructure into a replication-friendly plan.

54
PUSHING THE LIMITS

Having put all the software through their paces in various tests of skill and agility, we now turn our
attention to one final challenge: a large and very complex dataset with multiple challenges. Our goal is
once again to shake the monkeys from their trees and look for the last monkey standing. (Swinging?)

Some of the alternative variance estimation techniques we have seen so far rely on a particular kind of
structure to the dataset. BRR requires (or nearly requires, without theoretical extension) exactly two
PSUs per stratum, and the standard advice [110 pD4-D8] when this condition is not met is to combine
and collapse strata in order to twist the design into one that does have two PSUs per stratum. Jackknife
replication is more flexible, but runs into problems when the total number of PSUs across all first-stage
strata is too large (many replicates needed) or too small (few degrees of freedom for estimation). And
while bootstrap replication is the most flexible of all, we have already seen (see OPTIONS FOR VARIANCE
ESTIMATION) that software support for this method is very poor in general, and besides, if the dataset is
very large then the creation of the weights and all estimations using the weights will be time-consuming.

So okay, if replication estimation is awkward or undesirable for any of those reasons, then why not fall
back on the tried-and-true linearized variance estimation? There is nearly universal software support for
this method, and the theoretical properties are well known. But if the design is very complex or contains
multiple stages of stratification and/or clustering, then some software may not be able to handle the
analysis at the level of rigor we desire, particularly if some of the subsampling stages contain very large
sampling fractions or singleton units. WR approximations under the ultimate cluster model [56 p35] may
misstate the actual variance to an unacceptable degree. There may even be large fractions of missing
items that would call for multiple imputation. (We will not examine multiple imputation here, because it
would take us too far afield.)

Some other challenging tasks we may be asked to perform are tabulation across subdomains that are
weakly represented in the dataset or analysis of the data using weights that are not reflective of the
actual design characteristics. There may also be a need to correct or adjust weighting according to some
poststratification criteria, and it would be nice to rely on automated software support for this kind of
adjustment rather than having to dig in and muck about with the weights ourselves. In short, there is no
shortage of opportunities for a software package to trip up in the analysis of real-life data.

If we are presented with a sample containing inconvenient features like this, then we will want to know
precisely the point at which our software fails to respond. That is the purpose of this final challenge.

The dataset we consider is fabricated (see APPENDIX D for a full description), but it contains many
realistic features, including a large sample size, random singleton units, complex nested structure, and
random oversampling of subpopulations. Very large sampling fractions (as large as 60%) also occur at
some stages, and the design as realized will not readily fit into any of the prefab replication models.
Attempting to analyze even a portion of the design by hand would be next to impossible, so we must
rely on software. Which software package should we trust? (Hint: It doesn't rhyme with GAS.)

But hold on before we dirty our hands on the really muddy data, we should start slow with a smaller
and simplified model of the actual dataset, one that still contains some of the complexities but none of
the really difficult random stuff. After cutting our teeth on that, we can scale up.

55
The data collection plan we imagine (for both the small-scale and the large-scale sets) is some kind of
nationwide poll of citizen support for some proposed measure. ("Should the US blast oil-bearing shale
deposits with tactical nuclear weapons? Yes or no.") Respondents are also asked to provide their age
and sex. Suppose that the respondents were chosen according to a complex four-stage plan with
stratification at each stage:

1. States chosen from state categories (e.g. size, political leanings, etc.)
2. Counties chosen from county categories (e.g. size, location, etc.) within states
3. Cities chosen from city categories (e.g. size, economy, etc.) within counties
4. Citizens chosen from citizen categories (e.g. sex, race, etc.) within cities

A design like this seems simple enough to execute, particularly since we have readily-available sampling
frames for all stages except the last, and since cities are pretty large on the whole, we can always treat
the last stage as a WR sample. But correct analysis of this design is a nightmare because of the extremely
complex stratification and substratification. Strata are independent at a given level, but not independent
of the cluster selection at the levels above them, and this intricate covariance structure needs to be
carried through four stages of clusteringwhich is three more stages than any of us care to think about.

Supposing everything went off without a hitch and we have this dataset ready to go, we can ask our
software packages to perform a few analyses of potential interest:

What proportion of citizens support the measure?

Does support vary by sex?
What are the average ages of citizens who support and don't support the measure?

To make things even nicer, let's say that some underpaid but unusually clever data preparer kindly
computed a set of 200 bootstrap weights for the benefit of packages unable to handle the full design.
We can look at the bootstrap analysis first and use those numbers as a baseline. There will be small
discrepancies from these due to the Monte Carlo error, but they will serve as a useful guide. Since we
already know (see OPTIONS FOR VARIANCE ESTIMATION) that SAS does not support bootstrap weights
(although it can be fooled into thinking the weights are BRR weights instead), the bootstrap analysis will
be run in Stata, R, and WesVar.

mean
Support Support (M) Support (F) Average Age (Y) Average Age (N)
(stderr)
.4886 .4992 .4806 41.93 41.66
Stata
(.0080) (.0172) (.0103) (.4530) (.5101)
.4886 .4992 .4806 41.93 41.66
R
(.0080) (.0172) (.0103) (.4530) (.5101)
.4886 .4992 .4806 41.93 41.66
WesVar
(.0082) (.0174) (.0104) (.4531) (.5108)
Table 39: Bootstrap analysis, easy dataset

WesVar's standard errors are mysteriously larger than those of the others, but only in the third or fourth
decimal place. Otherwise the numbers agree across the board. Next we try using the full design:

56
mean
Support Support (M) Support (F) Average Age (Y) Average Age (N)
(stderr)
.4886 .4992 .4806 41.93 41.66
SAS
(.0109) (.0231) (.0143) (.6071) (.6860)
.4886 .4992 .4806 41.93 41.66
Stata
(.) (.) (.) (.) (.)
R . . . . .
Table 40: Multistage analysis (except SAS), easy dataset

Whoawhat happened? Further inspection of this supposedly "clean" dataset reveals that the last-stage
stratification introduced some singletons. Apparently the last-stage stratification was by sex in order to
guard against disproportionate sampling of males and females (58% of the sample turned out female),
but it backfired because some of the sampled locations did not have enough of one sex or the other.
Stata handed out some point estimates but balked at trying for standard errors, and R spat out an error
to the effect that there are singletons in some strata. Of course, SAS did not even recognize the problem
because the best it can do in this case is a weighted single-stage WR approximation anyway; notice how
inflated the standard errors are compared to those in Table 39.

Here is where poststratification can come in handy: We can disregard the last-stage stratification (and
FPC factors, since the population totals will no longer be correct), and then poststratify using theoretical
rates of 50% for each sex. Since WesVar can apply poststratification to the bootstrap weights, we will
include some WesVar estimates as well for comparison purposes.

mean
Support Support (M) Support (F) Average Age (Y) Average Age (N)
(stderr)
.4899 .4992 .4806 42.02 41.62
SAS
(.0119) (.0231) (.0143) (.6124) (.7598)
.4899 .4992 .4806 42.02 41.62
Stata
(.0116) (.0237) (.0156) (.6240) (.7448)
.4899 .4992 .4806 42.02 41.62
R
(.0116) (.0237) (.0156) (.6231) (.7441)
.4899 .4992 .4806 42.02 41.62
WesVar
(.0089) (.0174) (.0104) (.4584) (.5699)
Table 41: Multistage analysis (except SAS), easy dataset, poststratifying on sex

WesVar's poststratification seems to have done very little, as these are essentially the same numbers we
saw from WesVar before. Perhaps the other packages are all overcompensating with their corrections,
but whatever the case, the poststratification adjustments to the bootstrap weights do not affect the
analysis very much. Also notice that the "Support (M)" and "Support (F)" columns for both SAS and
WesVar are utterly unchanged from the original analysis. When the poststratification is over the same
domain that the analysis is subsetting on, the poststratification leaves the domain invariant, as though
an unmodified-weight analysis had been done instead (because the rescales exactly cancel).

Another very subtle error from SAS's analysis is the low standard errors. In a design like this, even the
weighted single-stage WR approximation can be inadequately conservative It is not a panacea! Also
note that not even Stata and R can agree on standard errors for the subdomains after poststratification,

57
where they did agree before poststratification. Clearly "automatic poststratification" is applied in
idiosyncratic ways in the different packages, and analysts should treat the procedures with a certain
amount of caution.

Having played around with the practice dataset and uncovered some potential sources for error, we
now turn to the "real" dataset. This one is much larger ( 11039), and there are no bootstrap weights
to guide us here. The overall setup is the same, but now there are a few extra bumps in the road.

This dataset is intended to simulate a scaling up and imperfect realization of the idealized plan. States
were stratified into three groups according to political leaning and sampled without replacement inside
each stratum. Since one of the strata ("Independent") contained only seven states, the sampling fraction
within this stratum is over 40%. Selected states were then stratified into rural and urban counties, and a
certain number of counties were selected within the second-stage strata. It was intended that two
counties be chosen from each second-stage stratum, but due to some unfortunate cost and coverage
logistics, in some cases only one county could be chosen, and in other cases three needed to be chosen
instead. This means there are some random singletons at this stage. Next the selected counties were
stratified into groups of large and small cities before individual cities are chosen, but with a catch: due to
coverage concerns for small rural counties, only the five largest cities or towns within each third-stage
stratum were enumerated, and cities chosen from the list of five. Again it was intended to visit two cities
in each stratum, but similar logistical problems caused some strata to contain only one city and others to
contain three cities, as at the previous stage. So there are singletons here as well, but drawing from a
pool of only five cities means that the sampling fractions are 60% in strata with three selected cities. At
the last stage the individual citizens were subsampled from the selected cities, without stratification.
(The pollsters learned their lesson in the practice run and tried hard to avoid introducing singletons in
the final stage.) But as before, females are way oversampled, at a rate of 63:37.

What we have is a four-stage without-replacement cluster sampling design employing stratification in

the first three stages. Clearly this will be too much for SAS to handle, but just to give everyone a chance
to weigh in, we will approach the analysis by stages, starting with a single-stage approximation, moving
to a two-stage approximation, and so on up to the fourth stage. At the second stage we will be forced to
deal with the presence of singletons, and owing to the very complex stratification, it would be extremely
challenging if not impossible to collapse strata into neighbors. How do you define "nearby" consistently
across multiple levels of stratification nested within codependent clusters? Any attempt to collapse
strata at one stage is almost sure to break the semantics of the stratification at subsequent stages, so if
we want to remain faithful to the design structure, we will need to deal with singletons in a somewhat
more automated manner through recentering adjustments. (Note that it is an error to consider the
singletons as certainty units in this case, since the stratum counts were generated by a random process.)
SAS and WesVar will not be concerned with these subtleties because they will look no further than the
first-stage design structure.

Since there are three PSUs in all first-stage strata, WesVar will use jackknife weights, but there is an
issue with these: nine PSUs and three strata mean just six degrees of freedom for estimation using this
method. Bootstrapping would be preferable, but WesVar cannot create the weights.

After the final-stage analysis, we will take another look at poststratifying by sex to see how much
agreement can be obtained.

58
The questions of interest this time will be:

What proportion of citizens support the measure?

Does support vary by sex?
What is the average citizen age?
Does support vary by age category (split into decades)? (We need to discretize the age variable.)

Now for the first-stage approximation:

mean
Support Support by Sex Age Support by Age
(stderr)
<20 0 (.)
20-29 .1677 (.0298)
.2146
M 30-39 .2814 (.0222)
.2819 (.0155) 45.56
SAS 40-49 .2909 (.0212)
(.0128) .3250 (.0554)
F 50-59 .2777 (.0318)
(.0165)
60-69 .2720 (.0400)
>=70 .2018 (.1787)
<20 0 (.)
20-29 .1677 (.0298)
.2146
M 30-39 .2814 (.0222)
.2819 (.0155) 45.56
Stata 40-49 .2909 (.0212)
(.0128) .3250 (.0554)
F 50-59 .2777 (.0318)
(.0165)
60-69 .2720 (.0400)
>=70 .2018 (.1787)
<20 0 (.)
20-29 .1677 (.0298)
.2146
M 30-39 .2814 (.0222)
.2819 (.0155) 45.56
R 40-49 .2909 (.0212)
(.0128) .3250 (.0554)
F 50-59 .2777 (.0318)
(.0165)
60-69 .2720 (.0400)
>=70 .2018 (.1787)
<20 0 (.)
20-29 .1677 (.0319)
.2146
M 30-39 .2814 (.0232)
.2819 (.0155) 45.56
WesVar 40-49 .2909 (.0222)
(.0128) .3250 (.0566)
F 50-59 .2777 (.0364)
(.0170)
60-69 .2720 (.0403)
>=70 .2018 (.2113)
Table 42: One-stage analysis, weighted WR approximation (WesVar = JKn jackknife)

As expected, all the linearized estimates agree, and the jackknife estimates are very close. WesVar's
jackknife estimates were also checked against jackknife estimates from the other three packages. SAS
agreed completely, but both Stata and R had slightly different standard errors (Stata's somewhat

59
smaller, R's very close to WesVar's but sometimes off by a bit in the third or fourth decimal place),
indicating that even with "the same" method in three highly competent packages, mileage may vary.

Now we move on to the second-stage approximation. Here is where SAS and WesVar must be left
behind because their answers will be the same for all subsequent trials, but it is good to remember that
this is for different reasons: SAS because it refuses to consider anything but first-stage complexity no
matter what, but WesVar because theoretically the replications do account for later-stage complexities.
(However, in this case we will see that they are on weak footing because of the low degrees of freedom.)

It will be interesting to see how the estimates change from here on out. Our approach will be to use
FPCs at all stages above where we cut the analysis so for the two-stage approximation we apply FPCs
at stage one but not at stage two and to handle singletons by centering them at the grand mean
instead of at the stratum mean.

mean
Support Support by Sex Age Support by Age
(stderr)
<20 0 (.)
20-29 .1677 (.0318)
.2146
M 30-39 .2814 (.0218)
.2819 (.0152) 45.56
Stata 40-49 .2909 (.0203)
(.0123) .3250 (.0633)
F 50-59 .2777 (.0313)
(.0156)
60-69 .2720 (.0397)
>=70 .2018 (.1721)
<20 0 (.)
20-29 .1677 (.0318)
.2146
M 30-39 .2814 (.0218)
.2819 (.0152) 45.56
R 40-49 .2909 (.0203)
(.0123) .3250 (.0633)
F 50-59 .2777 (.0313)
(.0156)
60-69 .2720 (.0397)
>=70 .2018 (.1721)
Table 43: Two-stage analysis, weighted WR approximation from second stage

So far so good: Stata and R produce the same estimates using the same handling for singletons. Notice
that the estimates are not greatly changed from the single-stage approximation, but there is movement
of standard errors in both directions.

Next we move to three stages. Remember that there are singletons in the third stage as well, but their
distribution is independent of the distribution of the second-stage singletons, so the corrections at the
third stage should be seen as adding to (rather than refining) those at the second stage.

60
mean
Support Support by Sex Age Support by Age
(stderr)
<20 0 (.)
20-29 .1677 (.0318204)
.2146
M 30-39 .2814 (.0218289)
.2819 (.0151632) 45.56
Stata 40-49 .2909 (.0203396)
(.0123129) .3250 (.0632470)
F 50-59 .2777 (.0312572)
(.0156383)
60-69 .2720 (.0396751)
>=70 .2018 (.1720749)
<20 0 (.)
20-29 .1677 (.0318204)
.2146
M 30-39 .2814 (.0218289)
.2819 (.0151632) 45.56
R 40-49 .2909 (.0203396)
(.0123129) .3250 (.0632470)
F 50-59 .2777 (.0312572)
(.0156383)
60-69 .2720 (.0396751)
>=70 .2018 (.1720749)
Table 44: Three-stage analysis, weighted WR approximation from third stage

We see only the tiniest shift in the estimated standard errors. We needed to extend the precision of
reporting in order to see the changes because they were too small to affect the rounding at four decimal
places. So far Stata and R are in complete lockstep, even as the complexity ramps up. Now for the full
four-stage analysis, pulling out all the stops by using all the FPCs, with no approximation:

mean
Support Support by Sex Age Support by Age
(stderr)
<20 0 (.)
20-29 .1677 (.0318396)
.2146
M 30-39 .2814 (.0218301)
.2819 (.0151617) 45.56
Stata 40-49 .2909 (.0203377)
(.0123111) .3250 (.0633374)
F 50-59 .2777 (.0312572)
(.0156381)
60-69 .2720 (.0396815)
>=70 .2018 (.1720863)
<20 0 (.)
20-29 .1677 (.0318396)
.2146
M 30-39 .2814 (.0218301)
.2819 (.0151617) 45.56
R 40-49 .2909 (.0203377)
(.0123111) .3250 (.0633374)
F 50-59 .2777 (.0312572)
(.0156381)
60-69 .2720 (.0396815)
>=70 .2018 (.1720863)
Table 45: Four-stage analysis, exact

No discrepancies. When Stata and R say they can handle multistage sampling designs of arbitrary
nesting, they really mean it. Even this very complicated design was nowhere near fiendish enough to
shake them from the tree. However, it bears pointing out that the exactness here is all after the fourth
decimal place, differences that would be considered negligible by almost any standard. We saw some

61
real changes in the standard errors moving from the one-stage to the two-stage approximation
(although the one-stage approximation was not bad), but after the second stage all the movement was
small enough to be hidden by typical rounding practice.

It is also interesting to compare what the numbers would have been if we had chosen a different
strategy for handling the singleton units. Another common approach is to replace the singleton stratum
variances with the average of other stratum variances at that stage. Doing so yields the following table:

mean
Support Support by Sex Age Support by Age
(stderr)
<20 0 (.)
20-29 .1677 (.0356212)
.2146
M 30-39 .2814 (.0231125)
.2819 (.0153951) 45.56
Stata 40-49 .2909 (.0205802)
(.0124061) .3250 (.0938060)
F 50-59 .2777 (.0307701)
(.0161053)
60-69 .2720 (.0416816)
>=70 .2018 (.1736048)
<20 0 (.)
20-29 .1677 (.0307259)
.2146
M 30-39 .2814 (.0218312)
.2819 (.0144396) 45.56
R 40-49 .2909 (.0198080)
(.0118734) .3250 (.0638420)
F 50-59 .2777 (.0301036)
(.0153908)
60-69 .2720 (.0378687)
>=70 .2018 (.)
Table 46: Four-stage analysis, exact, using average variance for singletons

Here we finally see some differences. Clearly Stata and R are doing different kinds of averaging. R takes
the average over all the other strata at the same stage and uses that for the variance of the singleton
stratum, whereas Stata performs a rescaling of the stratum variances within the stage by using the
variance of the grouped singletons as another (less-than-full-weight) estimate of a stratum variance for
that stage. As to which method is better, who can say. On this example, R's method is more consistent
with the estimates obtained by recentering at the grand mean, while Stata's method tends to inflate all
the variances somewhat. But R's method also apparently yielded nonsense results for one of the
subdomain variances, leaving it unable to report a standard error for that subdomain. All these methods
are ad-hoc adjustments used to essentially impute variance information that does not really exist in the
dataset, so discrepancies here and there are to be expected. Interestingly, if we instruct Stata and R to
remove the singleton units entirely from variance analysis, their answers differ after the second stage.
Using the third-stage approximation, R reports a standard error of .0118092 for the support proportion
but Stata reports .0118273, and at the fourth (exact) stage, R's standard error estimate is .0118123 while
Stata's is .0123419. Apparently telling them to "remove" the singletons does not mean the same thing in
both packages. R's smaller standard errors in these cases suggests that perhaps Stata is trying to adjust
for the removal somehow, whereas R is simply removing the units.

Now for the final test: How will Stata and R compare on poststratification adjustments in the face of all
this other complexity? We return to the recentering-at-the-grand-mean method for variance adjustment
since both packages agree using that method. When we look at the next table, we should be prepared
62
for large changes in estimates. In the dataset we saw a roughly 2:1 ratio of women to men, and we are
going to adjust that to 1:1 instead. This will affect not only the standard errors, but the point estimates
as well. For good measure, we will throw WesVar back into the mix since it can do poststratification too.
Theoretically SAS can as well, but for some reason it refused to compute anything and returned an
unhelpful error on this dataset when we tried to add poststratification to the analysis. (The error was
not syntactic but was somehow related to the structure or contents of the dataset. Something about this
dataset rubbed SAS the wrong way when it tried to do poststratification. An error of this sort feels like a
handling flaw, since none of the other packages had any trouble running the poststratification.)

mean
Support Support by Sex Age Support by Age
(stderr)
<20 .
20-29 .
30-39 .
M .
SAS . . 40-49 .
F .
50-59 .
60-69 .
>=70 .
<20 0 (.)
20-29 .1748 (.0367702)
.2146
M 30-39 .2675 (.0190580)
.2698 (.0151617) 45.51
Stata 40-49 .2815 (.0226267)
(.0119668) .3250 (.0759249)
F 50-59 .2612 (.0333005)
(.0156381)
60-69 .2587 (.0400430)
>=70 .2122 (.1755965)
<20 0 (.)
20-29 .1748 (.0367702)
.2146
M 30-39 .2675 (.0190580)
.2698 (.0151617) 45.51
R 40-49 .2815 (.0226267)
(.0119668) .3250 (.0759249)
F 50-59 .2612 (.0333005)
(.0156381)
60-69 .2587 (.0400430)
>=70 .2122 (.1755965)
<20 0 (.)
20-29 .1748 (.0389)
.2146
M 30-39 .2675 (.0195)
.2698 (.0155) 45.51
WesVar 40-49 .2815 (.0254)
(.0123) .3250 (.0743)
F 50-59 .2612 (.0397)
(.0170)
60-69 .2587 (.0404)
>=70 .2122 (.2106)
Table 47: Four-stage analysis, exact (only Stata and R), poststratifying on sex

Once again Stata and R agree (although this is not always the case when poststratifying, as we saw with
the easy dataset in Table 41 above). WesVar's poststratified point estimates are correct and the
standard errors somewhat changed from the original analysis in Table 42, but the magnitude of the

63
change seems less than with Stata and R. Judging from the two examples in this section, it appears as
though poststratification has less effect on replicate-based estimates than on linearized estimates.

All in all, the packages here handled these tough datasets with aplomb. SAS could do no better than a
weighted single-stage WR approximation as usual, but its standard errors were in the ballpark, and its
poststratification works at least under some conditions. WesVar obtained quite competitive standard
error estimates using jackknife replicates and could also poststratify.

Stata and R were the stars of this test. We set out trying to find the precise point at which either or both
would fail, and as it turned out, neither failed. Neither one even flinched except when we asked them to
(by forcing them to use different methods of dealing with singletons). It is too bad that we were unable
to test SPSS and SUDAAN on these datasets, because it would be very illuminating to see whether they
would agree with Stata and R as well, since SPSS and SUDAAN also claim correct behavior on designs of
this type and use the same class of variance estimators that Stata and R use by default. But even if SPSS
and SUDAAN were to pass this test with flying colors, there would be no particular reason to
recommend them over either Stata or R, who are both much more flexible and capable in other areas as
well. Long story short: If you have some burning complex survey sample analysis needs, then either
Stata or R should be your first choice of software package.

WRAPPING THINGS UP

So where does this leave us? We have examined a variety of software packages that claim some level of
competence on complex survey samples, and found a great deal of consistency but also a few wide gaps
in both capability and usability. Let's take a moment to recap what we have learned about them:

All the packages acknowledge that complex sampling designs require special handling, and all
make an effort to produce properly weighted point estimates and properly design-adjusted
standard errors, at least to the level of first-stage sampling complexities. All of them also offer a
variety of estimation tools and procedures besides ordinary descriptive analyses, including linear
and logistic regression at a minimum. Most go well beyond this in their offerings.
All the packages except SPSS provide at least some support for replicate-based methods of
variance estimation, and all those except IVEware can competently use precomputed weights
from a dataset. Several of the packages can also generate their own replicate weights. All except
WesVar will apply linearized variance estimation to at least some complex designs.
As non-commercial packages still in beta development, AM and IVEware deserve credit for being
as good as they are both are written by researchers for researchers, and both fill a niche
but they are neither finished enough nor capable enough at complex survey sample analysis to
recommend them for serious use in that area. Their pace of future development is uncertain.
As a very expensive commercial package with decades of development to draw on, SAS deserves
criticism for being as bad as it is at complex survey sample analysis its capabilities are barely
above what far lesser packages provide for free and there has been little indication of a desire
to do better. It produces reasonable estimates, but only just. The SURVEY procedures feel hastily
conceived, with no plausible avenue for extension, and can make poor choices on trouble data.
SPSS is ludicrously expensive for what it offers as a complex survey sample analysis tool. Avoid.

64
SUDAAN and WesVar are each very good at what they do, but the times are outgrowing them.
Both are highly focused niche tools who are content to remain highly focused niche tools while
the rest of the statistical software world passes them by. I am not aware of any capability that
only SUDAAN possesses; it seems entirely supplantable by either Stata or R. As for WesVar, there
is a certain charm to its approach and it is the only package I know of that can create JK2 weights,
but its finicky expectations, lack of data tools, and awkward menus make continued use tedious.
If you need a tool to generate replicate weights, up to and including postsampling calibration and
nonresponse adjustment, or a tool to compute design-weighted values for arbitrary user-defined
estimators, then R is a better and more capable choice.
There is currently no better choice for complex survey sample analysis software than Stata or R.
Both are not only at the top of the heap in terms of power, features, flexibility, extensibility, ease
of use, low cost, technological sophistication, and developmental momentum, but at the top of
their game as well. Stata is a rising star among commercial packages because of its rich offerings
and rapid pace of development, and R is a rising star in research settings because of its deep
programmability and massive and ever-growing codebase of highly sophisticated software tools.
Both gobbled up every design we threw their way and came back for seconds.

To put it in a nutshell: AM and IVEware are not ready and may never be, SAS and SPSS think much too
highly of themselves, and SUDAAN and WesVar are slowly settling into senescence. Meanwhile Stata
and R are turning cartwheels and climbing trees. Both are bursting with so much development energy
they don't even realize they're already king of the hill. And we really hope they don't for a while yet.

It is important to keep in mind that we are not just calling Stata and R better because they have better
tools and talents at the moment that could easily change. What will happen in the software arena
over the next ten years or so is anybody's guess. SAS could get serious, or WesVar could branch out and
start making new friends. Looking purely at complex survey sample analysis competence, SUDAAN is
probably still just about as capable as Stata (but probably not as capable as R). If those are the kinds of
measures you use, you will need to revisit the assessment every couple of years as features are added
and refined across the playing field. The reason we are calling Stata and R better is because they have a
better model for growth and a rising (rather than a falling) trajectory. They are not finally the best of the
old guard, they are a new guard beating the old guard at its own game.

Although Stata and R could be deemed equal competitors in the area of complex survey sample analysis,
they are not equivalent. Stata has more spit and polish, much better data management tools, cleaner
syntax, tighter integration, and a more consistent and predictable user experience. Stata documentation
is so good it puts good documentation to shame, whereas R documentation is spotty even on its best
day. (If you want to learn to use R well, you need to buy a lot of books.) But R has full and option-rich
support for things Stata has not even tried yet, such as quantile estimation (more difficult than it sounds
under complex sampling), arbitrary-probability without-replacement designs, advanced postsampling
adjustments, GREG estimators, and especially bootstrap replicate weights (in several flavors). R is more
flexible than Stata by virtue of being a fully programmable open-source software use-and-development
environment. Today's ideal complex survey sample analysis setup should include both Stata and R.
Fortunately this is very easy, because Stata is cheap and R is free.

65
APPENDIX A: SOME COMPLEX SURVEY ANALYSIS IMPLEMENTED IN R

The following suite of functions was initially developed for the sole purpose of checking the accuracy of
the various complex survey software packages being considered in this paper. During the course of
development the range of functionality was expanded to include five major "textbook" designs to add
more flexibility and power to the suite. But please be aware that it was specifically written to produce
correct results on a few small datasets with known complex survey sampling designs, and has not been
tested to any significant degree beyond that, so if you use it, use it at your own risk.

### By-hand R implementation of a rudimentary complex survey package

### Written by Jack Wiedrick 2014
### Released without restriction to the public domain
###
### This package computes totals, means, standard errors, and design
### effects for univariate (or one-variable-at-a-time) data collected
### according to one of five basic sampling designs:
###
### srs simple random sampling with replacement (SRSWR)
### 1stage simple random sampling without replacement (SRSWOR)
### pps selection probability proportional to size (PPSWR)
### pps2 two stages: PPSWR at stage 1 and SRS(WR/WOR) at stage 2
### 2stage two stages: SRSWOR at stage 1 and SRS(WR/WOR) at stage 2
### (Note that this design can also handle one-stage cluster
### sampling schemes; see function header below for details.)
###
### Any of the basic designs above can also be combined with stratification
### at either stage (or both stages) of sampling, although the main
### cmplxsvy.analyze interface only allows for stratification at the first
### stage. (To implement stratification at the second stage, the user
### needs to make use of the est.* and var.* functions directly.)
###
### The est.* and var.* functions are designed for internal use and do not
### print results by default, but they can be called directly and forced
### to print output if preferred. More complicated designs than are
### supported by cmplxsvy.analyze can be implemented using these.
###
### For typical situations use the cmplxsvy.analyze function, passing
### the appropriate design specification matching your data situation.
### Even this interface is somewhat awkward to work with, as the user needs
### to pay close attention to the layout of the data in order to obtain
### correct results. A professional package would allow more flexibility.
###
### Error handling is almost nonexistent. Only design errors that prevent
### continuation are flagged by halting execution and printing a message.
### The parameters are also assumed to be known constants and not estimated
### quantities, so standard errors will be slightly misstated (and probably
### understated) if the population "parameters" are themselves estimates.
### Efficiency of code execution was also not considered.
###
### THE UPSHOT: Please do not use this as a substitute for a professionally-
### written complex survey software package! It is designed for
### educational purposes only, to demonstrate how to implement
### some common "textbook" designs by hand. Count on bugs.
###
#############################################################################

67
### simple estimate: PPSWR
### data is one-dimensional
### P is a scalar or vector of single-draw WR probabilities
### set print=TRUE for output
est.pps=function(data,P,print=FALSE)
{
n=length(data)
nP=length(P)
if (nP>1 && nP!=n)
return ("error (est.pps): weights and observations are mismatched")
w=1/(n*P)
t=sum(data*w)
m=t/mean(1/P)
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Point Estimates\n\n")
print(value)
}
invisible(list(N=n*mean(w),psu=list(total=t,mean=m,n=n,f=1/w),ssu=NULL))
}
###
### simple variance: PPSWR
### data is one-dimensional
### P is a scalar or vector of single-draw WR probabilities
### accepts a precomputed est.pps structure if available
### set print=TRUE for output
var.pps=function(data,P,est=NULL,print=FALSE)
{
if (is.null(est)) est=est.pps(data,P)
t=var(data/P)/est$psu$n; if (is.na(t)) t=0
m=t/mean(1/P^2)
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Variance Estimates\n\n")
print(value)
}
invisible(list(total=t,mean=m,cluster=NULL,est=est))
}
###
### two-stage estimate: PPSWR + SRS
### data is three-dimensional:
### column 1 is the psu identifier
### column 2 is the finite population size of the psu
### column 3 is the matched vector of observations
### P is a scalar or vector of single-draw WR probabilities
### set print=TRUE for output
est.pps2=function(data,P,print=FALSE)
{
data=as.data.frame(data)
names(data)=c("psu","M","y")
I=unique(data$psu)
n=length(I)
nP=length(P)
if (nP>1 && nP!=n)
return ("error (est.pps2): weights and observations are mismatched")
else if (nP==1) P=rep(P,n)
est=vector("list",n)
68
names(est)=I
f=n*P
t=m=0
w=numeric(n)
for(i in 1:n)
{
di=subset(data,psu==I[i])
Mi=unique(di$M)
if (length(Mi)!=1)
return("error (est.2stage): inconsistent cluster sizes")
est[[i]]=est.1stage(di$y/f[i],Mi)
t=t+est[[i]]$psu$total
w[i]=est[[i]]$psu$n/(f[i]*est[[i]]$psu$f)
}
m=t/mean(d$M/P)
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Point Estimates\n\n")
print(value)
}
invisible(list(N=sum(w),psu=list(total=t,mean=m,n=n,f=1/w),ssu=est))
}
###
### two-stage variance: PPSWR + SRS
### data is three-dimensional:
### column 1 is the psu identifier
### column 2 is the finite population size of the psu
### column 3 is the matched vector of observations
### P is a scalar or vector of single-draw WR probabilities
### accepts a precomputed est.pps2 structure if available
### set wr=TRUE to use WR at the second stage
### set print=TRUE for output
var.pps2=function(data,P,est=NULL,wr=FALSE,print=FALSE)
{
data=as.data.frame(data)
names(data)=c("psu","M","y")
if (is.null(est)) est=est.pps2(data,P)
I=unique(data$psu)
v=vector("list",est$psu$n)
names(v)=I
ssu.t=numeric(est$psu$n)
for (i in 1:est$psu$n)
{
di=subset(data,psu==I[i])
v[[i]]=if (wr) var.srs(di$y,unique(di$M))
else var.1stage(di$y,unique(di$M))
ssu.t[i]=est$ssu[[i]]$psu$total
}
t=var(est$psu$n*ssu.t)/est$psu$n; if (is.na(t)) t=0
m=t/est$N^2
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Variance Estimates\n\n")
print(value)
}
invisible(list(total=t,mean=m,cluster=v,est=est))
}
69
###
### simple estimate: SRSWR
### data is one-dimensional
### N is interpreted as the size of a finite population
### set print=TRUE for output
est.srs=function(data,N,print=FALSE)
{
est.pps(data,1/N,print) # Horvitz-Thompson estimator
}
###
### simple variance: SRSWR
### data is one-dimensional
### N is interpreted as the size of a finite population
### accepts a precomputed est.srs structure if available
### set print=TRUE for output
var.srs=function(data,N,est=NULL,print=FALSE)
{
var.pps(data,1/N,est,print)
}
###
### one-stage estimate: SRSWOR
### data is one-dimensional
### N is interpreted as the size of a finite population
### set print=TRUE for output
est.1stage=function(data,N,print=FALSE)
{
est.pps(data,1/N,print) # Horvitz-Thompson estimator
}
###
### one-stage variance: SRSWOR
### data is one-dimensional
### N is interpreted as the size of a finite population
### accepts a precomputed est.1stage structure if available
### set print=TRUE for output
var.1stage=function(data,N,est=NULL,print=FALSE)
{
if (is.null(est)) est=est.1stage(data,N)
m=(1-est$psu$f)*var(data)/est$psu$n; if (is.na(m)) m=0
t=m*N^2
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Variance Estimates\n\n")
print(value)
}
invisible(list(total=t,mean=m,cluster=NULL,est=est))
}
###
### two-stage estimate: SRS at both stages
### data is three-dimensional:
### column 1 is the psu identifier
### column 2 is the finite population size of the psu
### column 3 is the matched vector of observations
### N is interpreted as the finite size of the first-stage population
### set print=TRUE for output
est.2stage=function(data,N,print=FALSE)
{
data=as.data.frame(data)
names(data)=c("psu","M","y")
I=unique(data$psu)
70
n=length(I)
est=vector("list",n)
names(est)=I
f=n/N
t=w=0
for(i in 1:n)
{
di=subset(data,psu==I[i])
Mi=unique(di$M)
if (length(Mi)!=1)
return("error (est.2stage): inconsistent cluster sizes")
est[[i]]=est.1stage(di$y,Mi)
t=t+est[[i]]$psu$total
w=w+est[[i]]$psu$n/(f*est[[i]]$psu$f)
}
t=t/f
m=t/w
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Point Estimates\n\n")
print(value)
}
invisible(list(N=w,psu=list(total=t,mean=m,n=n,f=f),ssu=est))
}
###
### two-stage variance: SRS at both stages
### data is three-dimensional:
### column 1 is the psu identifier
### column 2 is the finite population size of the psu
### column 3 is the matched vector of observations
### N is interpreted as the finite size of the first-stage population
### accepts a precomputed est.2stage structure if available
### set wr=c(TRUE/FALSE,TRUE/FALSE) to use WR at either stage
### *NOTE: it is an error to use wr=c(TRUE,FALSE) with this estimator;
### if that is your situation, use the var.pps2 estimator instead
### *NOTE: for single-stage cluster sampling, set column 2 to the
### second-stage sample size and wr=c(...,FALSE)
### set print=TRUE for output
var.2stage=function(data,N,est=NULL,wr=c(FALSE,FALSE),print=FALSE)
{
if (wr[1] && !wr[2])
print("warning (var.2stage): first-stage WR implies SRSWR")
data=as.data.frame(data)
names(data)=c("psu","M","y")
if (is.null(est)) est=est.2stage(data,N)
I=unique(data$psu)
v=vector("list",est$psu$n)
names(v)=I
ssu.v=numeric(est$psu$n)
ssu.t=numeric(est$psu$n)
ssu.m=numeric(est$psu$n)
for (i in 1:est$psu$n)
{
di=subset(data,psu==I[i])
Mi=unique(di$M)
v[[i]]=if (wr[2]) var.srs(di$y,Mi)
else var.1stage(di$y,Mi)
ssu.v[i]=v[[i]]$total
ssu.t[i]=est$ssu[[i]]$psu$total
71
ssu.m[i]=(ssu.t[i]-Mi*est$psu$total/est$N)^2
}
s2ssu=sum(ssu.v)/est$psu$f
vt=var(ssu.t); if (is.na(vt)) vt=0
t=(N^2)*(1-ifelse(wr[1],0,est$psu$f))*vt/est$psu$n+s2ssu
vm=sum(ssu.m)/(est$psu$n-1); if (is.na(vm)) vm=0
m=(N^2)*(1-ifelse(wr[1],0,est$psu$f))*vm/est$psu$n+s2ssu
m=m/est$N^2
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Variance Estimates\n\n")
print(value)
}
invisible(list(total=t,mean=m,cluster=v,est=est))
}
###
### stratified estimate: any design
### requires a list of estimates from within-stratum designs
### set print=TRUE for output
est.stratified=function(est,print=FALSE)
{
n=length(est)
t=N=0
for (i in 1:n)
{
t=t+est[[i]]$psu$total
N=N+est[[i]]$N
}
m=t/N
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Point Estimates\n\n")
print(value)
}
invisible(list(N=N,psu=list(total=t,mean=m,n=n,f=1),strata=est))
}
###
### stratified variance: any design
### requires a list of variances from within-stratum designs
### use singleton to specify strata with only a single psu
### *NOTE: this is not the same as a certainty psu; the software
### handles certainty psus by setting their variance
### contribution equal to zero
### set print=TRUE for output
var.stratified=function(var,singleton=NULL,print=FALSE)
{
n=length(var)
t=m=N=0
for (i in 1:n)
{
t=t+var[[i]]$total
m=m+var[[i]]$mean*var[[i]]$est$N^2
N=N+var[[i]]$est$N
}
m=m/N^2
if (!is.null(singleton))
{
72
f=(n/(n-length(singleton)))^2
I=setdiff(1:n,singleton)
v=numeric(length(I))
for (i in I) v[i]=var[[i]]$total
t=t+f*mean(v)
m=t/N^2
}
if (print)
{
value=round(c(t,m),4)
names(value)=c("total","mean")
cat("Variance Estimates\n\n")
print(value)
}
invisible(list(total=t,mean=m,strata=var))
}
###
### analyze a complex design (USE THIS FUNCTION)
### the (data,parameter) pair must satisfy the requirements of the design
### argument, which must be one of:
###
### "srs": data = one-dimensional vector of observations
### parameter = finite population size (e.g. sum of weights)
###
### "1stage": data = one-dimensional vector of observations
### parameter = finite population size (e.g. sum of weights)
###
### "2stage": data = three-dimensional matrix or data frame:
### column 1 is the psu identifier
### column 2 is the finite population size of the psu
### column 3 is the matched vector of observations
### parameter = finite size of (first-stage) population of psus
###
### "pps": data = one-dimensional vector of observations
### parameter = scalar or vector of single-draw WR probabilities
###
### "pps2": data = three-dimensional matrix or data frame:
### column 1 is the psu identifier
### column 2 is the finite population size of the psu
### column 3 is the matched vector of observations
### parameter = scalar or vector of single-draw WR probabilities
###
### for stratified designs (first-stage only), specify the column index of
### the first-stage stratum identifiers in the strata argument, e.g.:
###
### strata=4 stage 1 is stratified and the first-stage stratum
### numbers are in column 4
###
### when the design is stratified, the parameter argument must be a vector
### (for non-pps designs) or a list of vectors (for pps designs) of length
### equal to the number of strata, with each element specifying the
### parameter value for the corresponding stratum
###
### depending on the design, the first column or first three columns,
### excluding the strata column, are assumed to be the columns needed for
### analysis; any other columns in the dataset are ignored
###
### set print=FALSE to suppress output
cmplxsvy.analyze=function(data,parameter,design="srs",strata=NULL,print=TRUE)
{
data=as.data.frame(data)
73
e=v=NULL
stratified=as.numeric(!is.null(strata))
if (stratified) data=data[,c(strata,setdiff(1:dim(data)[2],strata))]
if (is.element(design,c("srs","1stage","pps")))
{
if (stratified)
{
H=unique(data[,1])
nH=length(H)
est=vector("list",nH)
var=vector("list",nH)
singleton=NULL
if (design=="srs")
{
if (print) cat("Stratified SRSWR Design\n\n")
for (h in 1:nH)
{
dh=data[which(data[,1]==H[h]),]
est[[h]]=est.srs(dh[,2],parameter[h])
var[[h]]=var.srs(dh[,2],parameter[h],est[[h]])
if (est[[h]]$psu$n==1 && parameter[h]!=1)
singleton=c(singleton,h)
}
}
else if (design=="1stage")
{
if (print) cat("Stratified SRSWOR Design\n\n")
for (h in 1:nH)
{
dh=data[which(data[,1]==H[h]),]
est[[h]]=est.1stage(dh[,2],parameter[h])
var[[h]]=var.1stage(dh[,2],parameter[h],est[[h]])
if (est[[h]]$psu$n==1 && parameter[h]!=1)
singleton=c(singleton,h)
}
}
else if (design=="pps")
{
if (print) cat("Stratified PPSWR Design\n\n")
for (h in 1:nH)
{
dh=data[which(data[,1]==H[h]),]
est[[h]]=est.pps(dh[,2],parameter[[h]])
var[[h]]=var.pps(dh[,2],parameter[[h]],est[[h]])
if (est[[h]]$psu$n==1 && parameter[[h]][1]!=1)
singleton=c(singleton,h)
}
}
e=est.stratified(est)
v=var.stratified(var,singleton)
}
else
{
if (design=="srs")
{
if (print) cat("Simple SRSWR Design\n\n")
e=est.srs(data[,1],parameter)
v=var.srs(data[,1],parameter,e)
}
else if (design=="1stage")
{
74
if (print) cat("Simple SRSWOR Design\n\n")
e=est.1stage(data[,1],parameter)
v=var.1stage(data[,1],parameter,e)
}
else if (design=="pps")
{
if (print) cat("Simple PPSWR Design\n\n")
e=est.pps(data[,1],parameter)
v=var.pps(data[,1],parameter,e)
}
}
}
else if (is.element(design,c("2stage","pps2")))
{
if (stratified)
{
H=unique(data[,1])
nH=length(H)
est=vector("list",nH)
var=vector("list",nH)
singleton=NULL
if (design=="2stage")
{
if (print) cat("Stratified Two-Stage SRSWOR Design\n\n")
for (h in 1:nH)
{
dh=data[which(data[,1]==H[h]),]
est[[h]]=est.2stage(dh[,2:4],parameter[h])
var[[h]]=var.2stage(dh[,2:4],parameter[h],est[[h]])
if (est[[h]]$psu$n==1 && parameter[h]!=1)
singleton=c(singleton,h)
}
}
else if (design=="pps2")
{
if (print) cat("Stratified Two-Stage PPSWR Design\n\n")
for (h in 1:nH)
{
dh=data[which(data[,1]==H[h]),]
est[[h]]=est.pps2(dh[,2:4],parameter[[h]])
var[[h]]=var.pps2(dh[,2:4],parameter[[h]],est[[h]])
if (est[[h]]$psu$n==1 && parameter[[h]][1]!=1)
singleton=c(singleton,h)
}
}
e=est.stratified(est)
v=var.stratified(var,singleton)
}
else
{
if (design=="2stage")
{
if (print) cat("Two-Stage SRSWOR Design\n\n")
e=est.2stage(data[,1:3],parameter)
v=var.2stage(data[,1:3],parameter,e)
}
else if (design=="pps2")
{
if (print) cat("Two-Stage PPSWR+SRSWOR Design\n\n")
e=est.pps2(data[,1:3],parameter)
v=var.pps2(data[,1:3],parameter,e)
75
}
}
}
else return("error (cmplxsvy.analyze): unsupported design")
value=round(c(e$psu$total,sqrt(v$total),e$psu$mean,sqrt(v$mean)),4)
names(value)=c("total","se.total","mean","se.mean")
if (print) print(value)
invisible(list(est=e,var=v,data=data,par=parameter))
}
###
### compute a design effect on a cmplxsvy.analyze object
### set datacol to the column number of the observation data
### by default SRSWR is used as the reference design, but this can
### be changed to SRSWOR instead by setting wr=FALSE
### set print=FALSE to suppress output
cmplxsvy.deff=function(cmplxsvy.object,datacol,wr=TRUE,print=TRUE)
{
ref.design=if (wr) "srs" else "1stage"
srs.object=cmplxsvy.analyze(cmplxsvy.object$data[,datacol],
cmplxsvy.object$est$N,
design=ref.design,print=FALSE)
deff=cmplxsvy.object$var$total/srs.object$var$total
names(deff)=c("deff")
if (print)
{
cat("Design Effect vs",ifelse(wr,"SRSWR","SRSWOR"),"\n\n")
print(deff)
}
invisible(deff)
}
###
#############################################################################

76
APPENDIX B: SAMPLE CODE FOR TYPICAL OPERATIONS

Below are simple code templates and process flow instructions for performing a typical complex survey
sample analysis task (estimating a mean) in all the software packages under consideration in this paper.
For more detailed examples, refer to the documentation for each package (see APPENDIX C).

SAS
proc surveymeans data=dataset total=fpc_dataset;
stratum strata;
cluster clusters;
weight weights;
domain subdomains;
class categoricals;
var analyze_this;
run;

The SAS model is procedure-based, meaning that each procedure must be a self-contained unit that
completely describes the parameters for analysis. Datafiles must be loaded into libraries and referenced
in procedure (proc) statements. The typical SAS environment provides a code editor and several output
windows for examining the results of code execution. Only first-stage complexities can be defined.

Stata
use datafile
svyset clusters1 [pweight=weights], strata(strata1) fpc(fpc1)
|| clusters2, strata(strata2) fpc(fpc2)
|| . . .
svy, subpop(subdomains): mean analyze_this

The Stata model is project-based, meaning that one dataset is loaded into memory at a time and all
operations refer to that dataset. Before a dataset can be analyzed as a complex survey sample, it must
be defined using the svyset command. (Note that the brackets are required in the pweight definition.)
The typical Stata environment provides a menu-based windowed interface with a small command editor
and large output window. Complexities may be defined for an unlimited number of stages.

R
library(survey)
dataset=read.table("datafile.txt",header=TRUE)
data.svy=svydesign(ids=~clusters1+clusters2+...,
strata=~strata1+strata2+...,
fpc=~fpc1+fpc2...,
weights=~weights,
data=dataset)
svymean(~analyze_this,subset(data.svy,subdomains))

R is technically an interpreted programming language, so arbitrary variables and procedures can be

defined and kept in memory for later use. The typical R environment provides an integrated code editor
and output windows. Complexities may be defined for an unlimited number of stages.

77
SUDAAN
proc descript data=dataset design=WOR;
nest strata1 clusters1 strata2 clusters2 ...;
totcnt fpc_s1 fpc_c1 fpc_s2 fpc_c2 ...;
weight weights;
subgroup subdomains;
levels numbers_of_levels_for_subdomains;
var analyze_this;
run;

The SUDAAN model is essentially the same as the SAS model, where each procedure is a self-contained
unit of analysis describing the full design. Typically SUDAAN procedures are called from within a SAS
environment, but a standalone computing environment is also available. Complexities may be defined
for an unlimited number of stages. (Note that FPC variables must be provided for each level in the nest
statement, but some of these can be "_ZERO_" to indicate no variance contribution.)

SPSS
csdescriptives
/plan file = "CSPLAN_file"
/summary variables = analyze_this
/subpop table = subdomains
/mean
/statistics desired_statistics
/missing scope = analysis

The SPSS computational model is similar to Stata's, but the emphasis in SPSS is on point-and-click user
operations. A "CSPLAN" file describing the complex survey sampling structure must be created prior to
analysis, but all this is automated by wizards that guide the user through every step of the operation.
Code (such as e.g. above) can be captured for later reuse, but in general is not helpful or enlightening
about what SPSS does behind the scenes. The typical SPSS environment is a modern windowed interface
with buttons and menus. Complexities may be defined for an unlimited number of stages.

WesVar
(no code interface)

The WesVar interactive environment is point-and-click only. In the latest version (5.1) there are some
limited options for creating batch files [111], but only for routine preliminary operations. Although the
interface is somewhat old-fashioned and bizarre-looking by modern standards, it is easy enough to work
with once you get the hang of it. The typical process flow involves importing a dataset, describing the
key design variables to WesVar, and then creating and saving a new file with a set of WesVar-generated
replicate weights for analysis. This file is then attached to an analysis workbook where procedures such
as tabulation, regression, and general descriptive analysis can be performed. Output is generated to a
separate window and can be exported in a number of different formats for later perusal. Input and
output file management can become a nuisance, as WesVar creates a very large number of separate
files associated with any given project and provides no linking structure (beyond naming) between them.
Only first-stage complexities can be defined, but this is not a limitation for replicate-based methods (see
OPTIONS FOR VARIANCE ESTIMATION).
78
AM
(no code interface)

Like WesVar, AM provides no coding language for customizing analysis plans, but unlike WesVar, the
interface is fairly streamlined and modern in appearance. The operational model is point-and-click and
click-and-drag, with all procedures defined through wizards. The typical process flow involves importing
a dataset, which produces a listing of variables in a subwindow. All tests, regressions, and other analyses
are selected through menus and defined by dragging variables from the list into the appropriate slots
within the wizard windows. Default output is in HTML, and AM attempts to use your default browser for
display, which can cause problems with modern browsers that heavily sandbox all calls from external
processes, but ASCII output can also be requested. Only first-stage complexities can be defined.

IVEware/SRCware
datain dataset;
stratum strata;
cluster clusters;
weight weights;
by subdomains;
mean analyze_this;
MODEL MULT;
run;

The IVEware model is very similar to that of SAS and SUDAAN, but IVEware provides an integrated
windowed interface with menus and wizard-like operations. A datafile must be imported and then
described to IVEware using the DATAPREP (IVEware) or METADATA (SRCware) procedures, which define
the parameters of the complex survey sampling design. Other wizards are available to guide the user
through the definition of an analysis plan, or code may be entered and run directly. The code syntax is
similar to SAS and the IVEware version of the software provides a SAS-callable interface. A very large
number of output files and other files for internal use by the software are generated by each procedure,
but IVEware provides helpful linking by organizing them into a consistent folder structure (indexed by
date and time) in a predefined location. Only first-stage complexities can be defined.

79
APPENDIX C: SOFTWARE VERSION INFORMATION AND AVAILABILITY

Below is a brief summary of the software packages examined in this paper. The costs presented (current
to 2014) are for default user-level packages with no special discounts or add-ons (except where noted).

SAS
Version 9.2 (2010; most current version is 9.4, released 2013)
Platforms Windows families, UNIX/Linux families
Cost US$8700 for a first-year license
Support http://support.sas.com/documentation/
Comments Big and bloated. Powerful at most statistical analysis tasks, but weak for survey samples.
Very large and established worldwide user base and extensive support offerings.

Stata
Version 13 (most current version, released 2013)
Platforms Windows families, UNIX/Linux families, Mac families
Cost US$395 for a permanent license
Support http://www.stata.com/support/documentation/
Comments Nimble, with many extremely advanced modeling tools. Very strong for survey samples.
Voluminous documentation and enthusiastic user base focused on high-level research.

R (survey package)
Version 3.29.5 (most current version, released 2013)
Platforms Windows families, UNIX/Linux families, Mac families, open-source compilable
Cost Free
Support http://cran.r-project.org/web/packages/survey/
Comments Frequently updated, powerful, and extensible. Most advanced for survey samples.
Add-on to the R programming environment. Avid user base but scattered documentation.

SUDAAN (RTI)
Version 11.0.1 (most current version, released 2013)
Platforms Windows families, UNIX/Linux families
Cost US$1090 for a first-year license; US$2265 for a permanent license
Support http://cran.r-project.org/web/packages/survey/
Comments Highly tuned, with powerful SAS-like tools. Gold standard for survey samples, but aging.
Long history of focused development. Good documentation and SAS-callable interface.

SPSS with Complex Samples (IBM)

Version 22 (most current version, released 2013)
Platforms Windows families, UNIX/Linux families, Mac families
Cost US$5430 + US$1370 for a first-year Standard license and Complex Samples license
Support http://pic.dhe.ibm.com/infocenter/spssstat/v22r0m0/
Comments Big and bloated. Powerful at linear modeling, but below the curve on survey samples.
Widely used in social sciences and for large-business analytics. Dense documentation.

81
WesVar (Westat)
Version 5.1 (most current version, released 2013)
Platforms Windows families
Cost Free
Support http://www.westat.com/Westat/expertise/information_systems/WesVar/
Comments Taut and highly capable, focusing on replicate-based methods of survey sample analysis.
Long history of research and development. Excellent documentation but quirky interface.

AM (AIR)
Version 0.06.04beta (most current version, released 2011)
Platforms Windows families
Cost Free
Support http://am.air.org/
Comments Advanced but unfinished tool for analysis of assessment surveys. Good for simple samples
but weak on complex samples. Sporadic development and limited documentation.

IVEware/SRCware (University of Michigan)

Version 0.2beta (most current version, released 2011)
Platforms Windows families, UNIX/Linux families, Mac families
Cost Free
Support http://www.isr.umich.edu/src/smp/ive/
Comments Sophisticated tool for multiple imputation and single-stage variance estimation, but weak
on complex samples. SAS-callable. Sporadic development and limited documentation.

82
APPENDIX D: DATASETS

All datasets used in the paper are reproduced or described below, along with source information and
pertinent details. A link to downloadable copies of the datasets is provided at the end of this appendix.

CLUSTERS-VS-STRATA DATASET (original)

id y N Ns Nc Nsc Ncs s1 c2 c1 s2
1 87 220 100 110 10 10 1 12 2 21
2 94 220 100 110 10 10 1 12 2 21
3 84 220 100 110 10 10 1 13 4 41
4 77 220 100 110 10 10 1 13 4 41
5 100 220 100 110 10 10 1 16 5 51
6 92 220 100 110 10 10 1 16 5 51
7 89 220 100 110 10 10 1 17 6 61
8 97 220 100 110 10 10 1 17 6 61
9 95 220 100 110 10 10 1 19 10 101
10 79 220 100 110 10 10 1 19 10 101
11 71 220 120 110 12 10 2 22 2 22
12 12 220 120 110 12 10 2 22 2 22
13 41 220 120 110 12 10 2 23 4 42
14 15 220 120 110 12 10 2 23 4 42
15 35 220 120 110 12 10 2 25 5 52
16 9 220 120 110 12 10 2 25 5 52
17 41 220 120 110 12 10 2 26 6 62
18 28 220 120 110 12 10 2 26 6 62
19 65 220 120 110 12 10 2 28 10 102
20 23 220 120 110 12 10 2 28 10 102

SRSWR: use only the observations (y)

SRSWOR: same as SRSWR but use population size (N) for the FPC
STRATIFIED: use s1 codes as strata and Ns for the FPC
CLUSTERED: use c2 codes as clusters and Nc for the FPC
STRATIFED-THEN-CLUSTERED: use s1 codes as strata and c2 codes as clusters within strata; use Nsc for
the first-stage (stratum) FPC and Ncs for the second-stage (cluster) FPC
CLUSTERED-THEN-STRATIFIED: use c1 codes as clusters and s2 codes as strata within clusters; use Ncs
for the first-stage (cluster) FPC and Nsc for the second-stage (stratum) FPC

BENCHMARK DATASET (taken from Wolter [55 p410-415])

stratum fpcstr cluster fpcclu y wt

1 15 1 10 1 10
1 15 1 10 2 10
1 15 1 10 3 10
1 15 1 10 4 10
1 15 1 10 5 10
1 15 2 10 2 10
1 15 2 10 3 10
1 15 2 10 4 10
1 15 2 10 5 10
1 15 2 10 6 10
1 15 3 10 3 10
1 15 3 10 4 10
1 15 3 10 5 10
1 15 3 10 6 10
1 15 3 10 7 10
2 15 1 10 1 10
2 15 1 10 2 10
2 15 1 10 3 10
2 15 1 10 4 10
2 15 1 10 5 10

83
stratum fpcstr cluster fpcclu y wt
2 15 2 10 2 10
2 15 2 10 3 10
2 15 2 10 4 10
2 15 2 10 5 10
2 15 2 10 6 10
2 15 3 10 3 10
2 15 3 10 4 10
2 15 3 10 5 10
2 15 3 10 6 10
2 15 3 10 7 10
3 15 1 10 1 10
3 15 1 10 2 10
3 15 1 10 3 10
3 15 1 10 4 10
3 15 1 10 5 10
3 15 2 10 2 10
3 15 2 10 3 10
3 15 2 10 4 10
3 15 2 10 5 10
3 15 2 10 6 10
3 15 3 10 3 10
3 15 3 10 4 10
3 15 3 10 5 10
3 15 3 10 6 10
3 15 3 10 7 10
4 15 1 10 1 30
4 15 1 10 2 30
4 15 1 10 3 30
4 15 1 10 4 30
4 15 1 10 5 30

Benchmark 1: use the data for stratum 1 only

Benchmark 2: use the data for strata 13 only
Benchmark 3: use the full dataset
Benchmark 4: use the full dataset, but change all records in stratum 4 to fpcstr=1 and wt=2
Benchmark 5: use the full dataset and append the following records:

stratum fpcstr cluster fpcclu y wt

4 15 2 10 2 10
4 15 2 10 3 10
4 15 2 10 4 10
4 15 2 10 5 10
4 15 2 10 6 10
4 15 3 10 3 50

Benchmark 6: use the Benchmark 5 dataset, but change the final record to fpcclu=1 and wt=5

PROBLEMATIC DATASET (taken from DACSEIS [70])

id stratum x y wt wtmiss wtzero wtneg wt01 strsize num badstr badtot

1 1 3 6 5 . 0 -2 .7 20 4 1 20
2 1 25 12 5 5 5 5 5 20 4 1 20
3 1 . 7 5 5 5 5 5 20 4 1 20
4 1 12 5 5 5 5 5 5 20 4 1 20
1 2 1 44 25 . 0 -1 .2 125 5 1 4
2 2 . 42 25 25 25 25 25 125 5 2 4
3 2 . 19 25 25 25 25 25 125 5 2 4
4 2 29 4 25 25 25 25 25 125 5 2 4
5 2 12 4 25 25 25 25 25 125 5 2 4

Note that for the test in Table 15, observations 24 were deleted and the weight (wt) of observation 1
was changed to 20.

84
SUDAAN DESIGN EXAMPLE DATASETS (adapted from examples taken from RTI International [109])

region stratum school psu student fpc_region fpc_stratum gpa

1 1 10 10 101 530 530 2.52
1 1 10 10 102 530 530 3.28
1 1 10 10 103 530 530 1.96
1 1 20 20 201 530 530 3.68
1 1 20 20 202 530 530 2.1
1 1 20 20 203 530 530 4
1 1 30 30 301 530 530 3.33
1 1 30 30 302 530 530 2.88
1 1 30 30 303 530 530 1.78
2 2 40 40 401 1640 1640 1.58
2 2 40 40 402 1640 1640 1.9
2 2 40 40 403 1640 1640 3.02
2 2 50 50 501 1640 1640 2.56
2 2 50 50 502 1640 1640 3.71
2 2 50 50 503 1640 1640 2.6
2 60 60 601 601 1640 3 3.34
2 60 60 602 602 1640 3 3.22
2 60 60 603 603 1640 3 2.81

Test 1: use region codes as strata, school codes as PSUs, and fpc_region for the stratum sizes
Test 2: use stratum codes as strata, psu codes as PSUs, and fpc_stratum for the stratum sizes

region county stratum school student fpc_stratum gpa

1 1 1 10 101 57 2.52
1 1 1 10 102 57 3.28
1 1 1 10 103 57 1.96
1 1 1 20 201 57 3.68
1 1 1 20 202 57 2.1
1 1 1 20 203 57 4
1 1 1 30 301 57 3.33
1 1 1 30 302 57 2.88
1 1 1 30 303 57 1.78
1 2 2 40 401 188 1.58
1 2 2 40 402 188 1.9
1 2 2 40 403 188 3.02
1 2 2 50 501 188 2.56
1 2 2 50 502 188 3.71
1 2 2 50 503 188 2.6
1 2 2 60 601 188 3.34
1 2 2 60 602 188 3.22
1 2 2 60 603 188 2.81
1 3 3 70 701 285 3.83
1 3 3 70 702 285 3.34
1 3 3 70 703 285 3.24
1 3 3 80 801 285 2.5
1 3 3 80 802 285 3.55
1 3 3 80 803 285 2.46
1 3 3 90 901 285 3.08
1 3 3 90 902 285 1.54
1 3 3 90 903 285 2.75
2 1 4 100 1001 418 2.4
2 1 4 100 1002 418 2.68
2 1 4 100 1003 418 3.78
2 1 4 110 1101 418 3.44
2 1 4 110 1102 418 3.56
2 1 4 110 1103 418 4
2 1 4 120 1201 418 1.46
2 1 4 120 1202 418 3.79
2 1 4 120 1203 418 3.9
2 2 5 130 1301 487 1.54
2 2 5 130 1302 487 5.2
2 2 5 130 1303 487 3.62
2 2 5 140 1401 487 2.44
2 2 5 140 1402 487 2.37
2 2 5 140 1403 487 4
2 2 5 150 1501 487 3.91
2 2 5 150 1502 487 2.94
2 2 5 150 1503 487 1.39
2 3 6 160 1601 735 4

85
region county stratum school student fpc_stratum gpa
2 3 6 160 1602 735 3.64
2 3 6 160 1603 735 3.38
2 3 6 170 1701 735 3.07
2 3 6 170 1702 735 3.51
2 3 6 170 1703 735 4
2 3 6 180 1801 735 2.18
2 3 6 180 1802 735 1.68
2 3 6 180 1803 735 2.83

Test 3: use stratum codes as strata, school codes as PSUs, and fpc_stratum for the stratum sizes

region school student fpc_region fpc_school wt_student gpa

1 10 101 15 250 416.67 2.52
1 10 102 15 250 416.67 3.28
1 10 103 15 250 416.67 1.96
1 20 201 15 125 208.33 3.68
1 20 202 15 125 208.33 2.1
1 20 203 15 125 208.33 4
1 30 301 15 75 125 3.33
1 30 302 15 75 125 2.88
1 30 303 15 75 125 1.78
2 40 401 100 150 1666.67 1.58
2 40 402 100 150 1666.67 1.9
2 40 403 100 150 1666.67 3.02
2 50 501 100 70 777.78 2.56
2 50 502 100 70 777.78 3.71
2 50 503 100 70 777.78 2.6
2 60 601 100 50 555.56 3.34
2 60 602 100 50 555.56 3.22
2 60 603 100 50 555.56 2.81

Test 4: use region codes as strata, school codes as PSUs, student codes as SSUs, fpc_region for the
stratum sizes, and fpc_school for the school sizes
Test 5: use the Test 4 dataset, but ignore fpc_region in region 2
Test 6: use the Test 4 dataset, but change records as follows:

i. set region=3 for schools 30 and 60

ii. set fpc_region=2 for schools 30 and 60
iii. recompute wt_student=(fpc_region/2)x(fpc_school/3) for all records

Test 7: use the Test 6 dataset (containing three regions), but ignore fpc_region in region 2
Test 8: use the Test 4 dataset (containing two regions), but ignore fpc_school
Test 9: use the Test 4 dataset, but change fpc_school=3 and recompute wt_student=fpc_region/3

region county urbrur school student fraccty fracsch fracst frac2st wtst wt2st gpa
1 1 1 10 101 0.5 0.67 0 0.01 2.985074627 298.5074627 2.52
1 1 1 10 102 0.5 0.67 0 0.01 2.985074627 298.5074627 3.28
1 1 1 10 103 0.5 0.67 0 0.01 2.985074627 298.5074627 1.96
1 1 1 20 201 0.5 0.67 0 0.003 2.985074627 995.0248756 3.68
1 1 1 20 202 0.5 0.67 0 0.003 2.985074627 995.0248756 2.1
1 1 1 20 203 0.5 0.67 0 0.003 2.985074627 995.0248756 4
1 1 2 30 301 0.5 0.5 0 0.041 4 97.56097561 3.33
1 1 2 30 302 0.5 0.5 0 0.041 4 97.56097561 2.88
1 1 2 30 303 0.5 0.5 0 0.041 4 97.56097561 1.78
1 1 2 40 401 0.5 0.5 0 0.027 4 148.1481481 1.58
1 1 2 40 402 0.5 0.5 0 0.027 4 148.1481481 1.9
1 1 2 40 403 0.5 0.5 0 0.027 4 148.1481481 3.02
1 2 1 50 501 0.5 0.4 0 0.002 5 2500 2.56
1 2 1 50 502 0.5 0.4 0 0.002 5 2500 3.71
1 2 1 50 503 0.5 0.4 0 0.002 5 2500 2.6
1 2 1 60 601 0.5 0.4 0 0.005 5 1000 3.34
1 2 1 60 602 0.5 0.4 0 0.005 5 1000 3.22
1 2 1 60 603 0.5 0.4 0 0.005 5 1000 2.81
1 2 2 70 701 0.5 0.67 0 0.032 2.985074627 93.28358209 3.83

86
region county urbrur school student fraccty fracsch fracst frac2st wtst wt2st gpa
1 2 2 70 702 0.5 0.67 0 0.032 2.985074627 93.28358209 3.34
1 2 2 70 703 0.5 0.67 0 0.032 2.985074627 93.28358209 3.24
1 2 2 80 801 0.5 0.67 0 0.092 2.985074627 32.44646334 2.5
1 2 2 80 802 0.5 0.67 0 0.092 2.985074627 32.44646334 3.55
1 2 2 80 803 0.5 0.67 0 0.092 2.985074627 32.44646334 2.46
2 1 1 90 901 0.3 0.25 0 0.004 13.33333333 3333.333333 3.08
2 1 1 90 902 0.3 0.25 0 0.004 13.33333333 3333.333333 1.54
2 1 1 90 903 0.3 0.25 0 0.004 13.33333333 3333.333333 2.75
2 1 1 100 1001 0.3 0.25 0 0.006 13.33333333 2222.222222 2.4
2 1 1 100 1002 0.3 0.25 0 0.006 13.33333333 2222.222222 2.68
2 1 1 100 1003 0.3 0.25 0 0.006 13.33333333 2222.222222 3.78
2 1 2 110 1101 0.3 0.22 0 0.081 15.15151515 187.0557426 3.44
2 1 2 110 1102 0.3 0.22 0 0.081 15.15151515 187.0557426 3.56
2 1 2 110 1103 0.3 0.22 0 0.081 15.15151515 187.0557426 4
2 1 2 120 1201 0.3 0.22 0 0.015 15.15151515 1010.10101 1.46
2 1 2 120 1202 0.3 0.22 0 0.015 15.15151515 1010.10101 3.79
2 1 2 120 1203 0.3 0.22 0 0.015 15.15151515 1010.10101 3.9
2 2 1 130 1301 0.3 0.2 0 0.003 16.66666667 5555.555556 1.54
2 2 1 130 1302 0.3 0.2 0 0.003 16.66666667 5555.555556 5.2
2 2 1 130 1303 0.3 0.2 0 0.003 16.66666667 5555.555556 3.62
2 2 1 140 1401 0.3 0.2 0 0.01 16.66666667 1666.666667 2.44
2 2 1 140 1402 0.3 0.2 0 0.01 16.66666667 1666.666667 2.37
2 2 1 140 1403 0.3 0.2 0 0.01 16.66666667 1666.666667 4
2 2 2 150 1501 0.3 0.25 0 0.047 13.33333333 283.6879433 3.91
2 2 2 150 1502 0.3 0.25 0 0.047 13.33333333 283.6879433 2.94
2 2 2 150 1503 0.3 0.25 0 0.047 13.33333333 283.6879433 1.39
2 2 2 160 1601 0.3 0.25 0 0.077 13.33333333 173.1601732 4
2 2 2 160 1602 0.3 0.25 0 0.077 13.33333333 173.1601732 3.64
2 2 2 160 1603 0.3 0.25 0 0.077 13.33333333 173.1601732 3.38

Test 10: use region codes as the first-stage strata, county codes within region as the PSUs, then urbrur
codes as the second-stage strata, school codes within urbrur as the SSUs, and finally student codes as
TSUs; use fraccty for the sampling fractions within regions, fracsch for the sampling fraction within
counties, and fracst for the sampling fraction within schools. (The frac2st and wt2st columns are for
comparison purposes; see COMPARING ON SUDAAN DESIGNS.)

HIGH SCHOOL DATASET (adapted from a pedagogic sample created by StataCorp [104])

observations 4071
variables 11
BRR replicate weights 52

Fictitious height (height) and weight (weight) measurements taken on high school students (id), whose
sex (sex; 1=male, 2=female) and race (race; 1=white, 2=black, 3=other) were also recorded. Students
are sampled WOR within schools (school) of unspecified population size (i.e. assume WR sampling); the
schools are sampled WOR within counties (county) of specified size (nschools) within strata (state) of
specified size (ncounties). Each student is associated with a postadjusted sampling weight (sampwgt).
Since exactly two PSUs were sampled per stratum, BRR variance estimation is appropriate. See the
DOWNLOAD SITE section below for a link to a downloadable copy of the dataset.

WIC DATASET (taken from RTI International [108])

observations 953
variables 21
BRR replicate weights 24

87
The dataset contains a subset of data obtained during a year-long study of infant feeding practices
among new mothers (ID) enrolled in the WIC (Special Supplemental Nutrition Program for Women,
Infants, and Children) program. A sample of 42 local agencies (SITE) was selected from a national listing,
implicitly stratified by region and state (STRATUM). Within each agency, approximately 23 new mothers
participating in the WIC program were selected. (The selection counts vary considerably across sites.
Some sites present as many as 36 participants and other sites as few as 9 participants.) For each new
mother, measurements of infant weight (BABYWGT) were obtained, and the women were asked whether
or not they had initiated breastfeeding (BRFDINIT). Various demographic information such as maternal
race (RACEMOM) and highest level of education (EDUC) was also elicited. Some raking was performed on
the counts and a final analysis weight (ANALWGT1) generated for each new mother. (Other variables were
also measured and included in the dataset. See [105] for more discussion, and also Rush et al 1988,
"Longitudinal Study of Pregnant Women", American Journal of Clinical Nutrition 48:439-483 for more
about the larger study.) Since exactly two agencies were selected per stratum, BRR variance estimation
is appropriate. See the DOWNLOAD SITE section below for a link to a downloadable copy of the dataset.

POLL DATASETS (adapted from a pedagogic sample created by StataCorp [104])

"Ideal" Data
observations 1440
variables 17
bootstrap weights 200

Fictitious poll data (x=support:Y/N, y=age, z=sex:M/F) collected on citizens according to a WOR sampling
design involving stratification at four levels (st1,st2,st3,st4) with clusters (id1,id2,id3,id4) selected
recursively within each level. FPCs (fpc1,fpc2,fpc3,fpc4) are applied at each stage, and the sampling
weights (wt) accurately reflect the sampling plan. A poststratification weight (pwt) is also provided for
poststratification based on sex, and bootstrap weights (bt1bt200) for variance analysis. Despite the
complexity of the plan, the organization of the design is very regular, with exactly 3 PSUs per stratum
selected at stage 1, exactly 2 SSUs per stratum selected at stage 2, exactly 4 TSUs per stratum at stage 3,
and exactly 5 QSUs (quaternary sampling units) at stage 4 if stratification is ignored at that stage. With
no surprises, software capable of arbitrarily multistage analysis should produce correct answers on this
dataset, and bootstrap variance estimation can provide a check of correctness.

"Real" Data
observations 11039
variables 18

This dataset (also fictitious) is meant to simulate a scaling up and imperfect realization of the idealized
plan described above. Imagine a national citizen poll on support (Support) for some measure. States are
stratified into three groups according to political leanings (RedBlue) and then sampled (State) WOR.
Then each state is stratified into rural and urban counties (RuralUrban), and counties (County) are
selected WOR within each second-stage stratum. Selected counties are further stratified into groups of
large and small cities (LargeSmall), and individual cities (City) are chosen WOR from the set of the five
largest cities within each third-stage stratum. Finally, individual citizens (Id) are selected WOR from the
chosen cities. In addition to their support for the measure, their age (Age) and sex (Sex) are recorded.
88
Three independent sets of weights are provided in the dataset. One set is the design-correct sampling
probability weights (ProbWeight), which have been scaled uniformly by 1/8 to approximately account
for the actual adult voting population in the US. (The actual US voting-eligible population is about 230
million, but the ProbWeight weights sum to 288 million instead because choosing only from the largest
cities in the penultimate stage overrepresents the true population.) Software that claims to compute
correct probability weights automatically should produce ProbWeightx8. The second set of weights can
be considered importance weights (ImpWeight); assume that they have been assigned according to
some arbitrary weighting scheme that bears little if any relation to the actual sampling probabilities.
These sum to 529810.5 and can be used to check whether software fully respects weight specifications
or allows design weights to intrude into computations. The third set of weights is a poststratification
weight (SexWeight) to be used for poststratifying on the assumption that males and females are present
in equal numbers in the voting population; the weight is 1/2 for every record. Also included in the
dataset are FPC variables (fpc_state,fpc_county,fpc_city,fpc_id) representing population totals at
each stage, and a variable recoding age into age decades (AgeCat) as a categorical variable for use in
tabulation analysis.

Compared to the idealized dataset, the "real" dataset presents many problems to the analyst. There are
exactly 3 PSUs per stratum in the first stage, and one of the sampling fractions is large (3/7). The small
number of PSUs makes jackknife replication undesirable because of the low degrees of freedom, and the
extremely complex nesting and subnesting of various stratification variables makes it very difficult if not
impossible to faithfully split the design out into a 2-PSU-per-stratum one in order to make it amenable to
BRR analysis. Furthermore, in stages 2 and 3 there are multiple random singleton units (imagine that
some cities or counties could not be sampled due to cost logistics) that multistage-capable software will
have to contend with. Again, the very complex stratification makes it extremely challenging to collapse
strata into neighbors because "nearby" is hard to define consistently across multiple levels of analysis;
any attempt to collapse strata at one stage is almost sure to break the semantics of the stratification at
subsequent stages. Also, sampling fractions are very large (as high as 3/5) at the third stage, so early WR
approximations are likely to misstate the variance by quite a bit. In short, this is a nightmare dataset.

DOWNLOAD SITE

All datasets described above are available for download in tab-delimited ASCII format from:

http://web.pdx.edu/~wiedrick/stat501.html

89
APPENDIX E: PROOFS OF SELECTED RESULTS

Below we present proofs of the major estimation results presented in the early sections of the paper.

PROOF THAT SRSWR ESTIMATORS ARE UNBIASED

1 1

1 1
, 0

And is known to be unbiased for under iid sampling.

PROOF THAT SRSWOR ESTIMATORS ARE UNBIASED

1,
For the variance estimator, let 1 ~ , where / . Then:
0,

1
1 ,1 |
1 1

We will also require the identity:

2 2

Then using these results we obtain:

1 1 1 1 1 1
1 1

1 1 1
1 1 1 ,1

1
1 |

1
1
1

91
1
1
1

1
1
1
1
1
1 1
1
1
1
1
1

where . Then all we need to show is that , viz.:

1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1 1
1
1 1
1
1 1

92
PROOF THAT SIMPLE STRATIFIED SAMPLING ESTIMATORS ARE UNBIASED

1 1 1

because is estimated under SRSWOR. By independence of sampling across strata we have:

1 1
1

and was already shown to be unbiased for under SRSWOR.

PROOF THAT SIMPLE CLUSTER SAMPLING ESTIMATORS ARE UNBIASED

since the clusters are selected by SRSWOR. For the same reason, is unbiased for .

PROOF THAT STRATIFIED CLUSTER SAMPLING ESTIMATORS ARE UNBIASED

since the strata are independent and clusters within strata are selected by SRSWOR. For the same
reason, is unbiased for .

1 1 1
|

For the variance estimator, we want to show:

1
1 1

93
The proof of this can be broken into three steps:

1. , where is the estimator of the mean in two-stage cluster sampling

without stratification
2. 1 and 1
3.

The proof of part 1 is rather involved and will not be repeated here; see Lohr [7 p255-262] for a
complete and general demonstration. For part 2, the formula for follows by the fact that PSUs
under this scheme are selected by simple cluster sampling, and the formula for can be seen as:

because each SSU is selected within its parent PSU by simple cluster sampling and carries a variance
component scaled by the weight of the parent PSU; for those weights are equal to 1 .
For part 3, the second term follows by independence of sampling across strata, where the weights for
must be adjusted in the second stage because the probability of including a PSU depends on
the sampling fraction we need to obtain from each stratum (i.e. PSUs need to meet a size requirement
that impacts their probabilities of selection in a finite population). If is the overall sampling fraction
for stratum across PSUs 1, , and is the inclusion probability for PSU , then we can write:

because 0 for 1, , . The lowest-entropy solution to this relation is for every , so

choosing / as the PSU weight gives an estimate that is unbiased in expectation. Finally, the sum
follows by part 1, where under the sampling plan is replaced by .

PROOF THAT THE EFFECTIVE SAMPLE SIZE FORMULA IS VALID

1 1

PROOF THAT THE WR APPROXIMATION TO PS SAMPLING BECOMES CONSERVATIVE IN EXPECTATION

We need to show that two results hold under arbitrary-probability WR (PPZ) sampling:

94
where and . Following Cochran [2 p252-253], we define
the random vector ~ , where # and recognize that
marginally each ~ , with , when . Then:

1 1 1
,

1
1

1 1
1

And since 2 2 we can write:

2 2

because 1. This proves the first claim. For the second claim we note:

So if we can assume | we have:

| | |

95
1

This will prove the claim if we can show that | . What the condition means is
that the "probability" of being drawn into the sample is equal to the fraction of single draws that hit
over draws. Clearly this is not an actual probability but an expected inclusion count given independent
WR draws, but the idea is to show that assuming the single-draw probabilities are equal to the "average"
inclusion probabilities is a conservative approach on average, meaning that the expected estimate of
variance using the WR approximation will be larger than the variance under true WR sampling.

1
|

1
1 0

for large enough, because eventually 1 by the

Cauchy-Schwarz inequality. Therefore, for reasonable sample sizes, the WR approximation will become
appropriately conservative.

96
PROOF THAT STRATIFIED TWO-STAGE CLUSTER SAMPLING ESTIMATORS ARE UNBIASED

Given a set of probability weights, the Horvitz-Thompson estimator of the total is always unbiased for
the population total implied by the weights [12]. The variance estimator can be written as:

Since two-stage cluster sampling is conducted within each stratum, is unbiased for the
variance of the stratum mean, as shown by Lohr [7 p255-262], and hence is unbiased for
the variance of the stratum total. By independence of sampling across strata, is unbiased for the
variance of the population total.

PROOF THAT STRATIFIED THREE-STAGE VARIANCE ESTIMATORS ARE UNBIASED

The estimator of the total is the usual Horvitz-Thompson estimator, which is always unbiased [12]. The
estimator of the variance of the estimator of the total can be written as:

where 1 is an unbiased estimator of the variance of the cluster totals within

first-stage stratum , and is an unbiased estimator (as shown by Lohr [7 p255-262]) of
the two-stage cluster variance within second-stage stratum nested within PSU inside first-stage
stratum . With weights chosen as / / and given the unbiasedness of
already asserted, the sum is an unbiased estimator of
the variance of the stratified sampling estimator of the PSU total, due to independence of sampling
across strata and the application of results demonstrated above. Then is a properly weighted and
unbiased estimator of the aggregate within-cluster variance of the PSUs in stratum , and given the
unbiasedness of already asserted, is unbiased for the variance of the estimator of
the total for stratum . Finally, again invoking the independence of sampling across strata, the sum
is an unbiased estimator of .

97
REFERENCES

GENERAL
[1] Chaudhuri, A; Stenger, H. 2005. Survey Sampling: Theory and Methods, 2e. CRC: Boca Raton, FL.
[2] Cochran, WG. 1977. Sampling Techniques, 3e. Wiley: New York.
[3] Deming, WE. 1966. Some Theory of Sampling. Dover: New York.
[4] Fuller, WA. 1987. Measurement Error Models. Wiley: New York.
[5] Hansen, MH; Hurwitz, WN; Madow, WG. 1953. Sample Survey Methods and Theory, Volumes I
(Methods and Applications) & II (Theory). Wiley: New York.
[6] Kish, L. 1965. Survey Sampling. Wiley: New York.
[7] Lohr, SL. 2009. Sampling: Design and Analysis. Cengage: Boston.
[8] Rice, JA. 2007. Mathematical Statistics and Data Analysis, 3e. Duxbury: Belmont, CA.
[9] Sampath, S. 2001. Sampling Theory and Methods. CRC: Boca Raton, FL.
[10] Srndal, CE; Swensson, B; Wretman, J. 1992. Model Assisted Survey Sampling. Springer: New York.

THEORETICAL TOPICS
[11] Grnewald, M; Hssjer, O. 2012. A General Statistical Framework for Multistage Designs.
Scandinavian Journal of Statistics 39:131152.
[12] Horvitz, DG; Thompson, DJ. 1952. A Generalization of Sampling Without Replacement from a Finite
Universe. Journal of the American Statistical Association 47(260):663685.
[13] Kolenikov, S. Retrieved 2014. Analysis of Complex Survey Data. University of Missouri, Columbia.
http://web.missouri.edu/~kolenikovs/Stat9100svy/ComplexSvy-LectureNotes.pdf
[14] Nafiu, LA. 2012. Comparison of One-Stage, Two-Stage, and Three-Stage Estimators Using Finite
Population. Pacific Journal of Science and Technology 13(2):166171.
[15] Namboodiri, NK. (ed.) 1978. Survey Sampling and Measurement. Academic Press: New York.
[16] Pfeffermann, D. 1993. The Role of Sampling Weights when Modeling Survey Data. International
Statistical Review 61:317337.
[17] Rao, JNK. 1997. Developments in Sample Survey Theory: An Appraisal. Canadian Journal of
Statistics 25(1):121.
[18] Rao, JNK. 2005. Interplay between Sample Survey Theory and Practice: An Appraisal. Survey
Methodology 31(2):117138.
[19] Stehman, SV; Overton, WS. 1987. Estimating the Variance of the Horvitz-Thompson Estimator in
Variable Probability Systematic Samples. Proceedings of the Section on Survey Research Methods,
American Statistical Association, 743748.
[20] West, B. Retrieved 2014. Accounting for Multi-stage Sample Designs in Complex Sample Variance
Estimation. Survey Methodology Program, University of Michigan.
http://www.isr.umich.edu/src/smp/asda/first_stage_ve_new.pdf
[21] Williams, RL. 2000. A Note on Robust Variance Estimation for Cluster-Correlated Data. Biometrics
56(2): 645646.
[22] Young, DS. Retrieved 2014. Regression Methods. Pennsylvania State University.
https://onlinecourses.science.psu.edu/stat501/node/46

APPLIED TOPICS
[23] Chambers, RL; Skinner, CJ. (eds.) 2003. Analysis of Survey Data. Wiley: New York.
[24] Dargatz, DA; Hill, GW. 1996. Analysis of Survey Data. Preventive Veterinary Medicine 28:225237.
[25] Heeringa, SG; West, BT; Berglund, PA. 2010. Applied Survey Data Analysis. CRC: Boca Raton, FL.

99
[26] Lehtonen, R; Pahkinen, EJ. 2004. Practical Methods for Design and Analysis of Complex Surveys, 2e.
Wiley: New York.
[27] Obenauf, W. 2003. An Application of Sampling Theory to a Large Federal Survey. Portland State
University, unpublished manuscript.
[28] Osborne, JW. 2011. Best Practices in Using Large, Complex Samples: The Importance of Using
Appropriate Weights and Design Effect Compensation. Practical Assessment, Research &
Evaluation 16(12):17.
[29] Ross, KC; Renckly, TR. 2002. Air University Sampling and Surveying Handbook. US Air Force:
Maxwell Air Force Base, AL.
[30] Sloane, NJA. Retrieved 2014. A Library of Hadamard Matrices. http://neilsloane.com/hadamard/
[31] Sudman, S. 1976. Applied Sampling. Academic Press: New York.
[32] United Nations Statistics Division. 2005 (Retrieved 2014). Household Sample Surveys in Developing
and Transition Countries. https://unstats.un.org/unsd/hhsurveys/pdf/Household_surveys.pdf

MULTILEVEL MODELING
[33] Asparouhov, T; Muthen, B. 2006. Multilevel Modeling of Complex Survey Data. Proceedings of the
Section on Survey Research Methods, American Statistical Association, 27182726.
[34] Carle, AC. 2009. Fitting Multilevel Models in Complex Survey Data with Design Weights:
Recommendations. BMC Medical Research Methodology 9(49):113.
[35] Chantala, K; Suchindran, C. 2006. Adjusting for Unequal Selection Probability in Multilevel Models:
A Comparison of Software Packages. Proceedings of the Section on Survey Research Methods,
American Statistical Association, 28152824.
[36] Zhang, F; Salvucci, S; Cohen, M. 2000. Multilevel Linear Regression Analysis of Complex Survey
Data. Proceedings of the Section on Survey Research Methods, American Statistical Association,
197202.

VARIANCE ESTIMATION
[37] Binder, DA. 1983. On the Variances of Asymptotically Normal Estimators from Complex Surveys.
International Statistical Review 51(3):279292.
[38] Brewer, KRW; Hanif, M. 1970. Durbin's New Multistage Variance Estimator. Journal of the Royal
Statistical Society, Series B 32(2):302311.
[39] Brick, JM; Morganstein, D; Valliant, R. 2000 (Retrieved 2014). Analysis of Complex Sample Data
Using Replication. Westat. http://www.westat.com/westat/pdf/wesvar/acs-replication.pdf
[40] Chaudhuri, A; Arnab, R. 1982. On Unbiased Variance-Estimation with Various Multi-Stage Sampling
Strategies. Sankhy: The Indian Journal of Statistics, Series B 44(1):92101.
[41] Demnati, A; Rao, JNK. 2004. Linearization Variance Estimators for Survey Data. Survey
Methodology 30(1):1726.
[42] Demnati, A; Rao, JNK. 2007. Linearization Variance Estimators for Survey Data: Some Recent Work.
ICESIII: Papers Presented at the Third International Conference on Establishment Surveys, 916925.
[43] Fay, RE; Train, GF. 1995. Aspects of Survey and Model-Based Postcensal Estimation of Income and
Poverty Characteristics for States and Counties. US Census Bureau Conference Papers.
http://www.census.gov/did/www/saipe/publications/files/FayTrain95.pdf
[44] Ghosh, D; Vogt, A. 2004. Covariance Estimates in Stratified and Multistage Clustered Sampling.
Proceedings of the Section on Survey Research Methods, American Statistical Association, 3577
3580.
[45] Goga, C. 2008. Retrieved 2014. Variance Estimators in Survey Sampling. Universit de Bourgogne.
http://goga.perso.math.cnrs.fr/ChapVar1_coursBesan.pdf
100
[46] Judkins, DR. 1990. Fays Method for Variance Estimation. Journal of Official Statistics 6(3):223239.
[47] Korn, EL; Graubard, BI. 2003. Estimating Variance Components by Using Survey Data. Journal of the
Royal Statistical Society, Series B 65(1):175190.
[48] Kovar, JG; Rao, JNK; Wu, CFJ. 1988. Bootstrap and Other Methods to Measure Errors in Survey
Estimates. Canadian Journal of Statistics 16 Supplement: A Special Issue of Papers Presenting
Current Statistical Work at Statistics Canada, 2545.
[49] Mantel, H; Giroux, S. 2009. Variance Estimation in Complex Surveys with One PSU per Stratum.
Proceedings of the Section on Survey Research Methods, American Statistical Association, 3069
3082.
[50] Rao, JNK; Lanke, J. 1984. Simplified Unbiased Variance Estimation for Multistage Designs.
Biometrika 71(2):387395.
[51] Rao, JNK; Wu, CFJ. 1988. Resampling Inference with Complex Survey Data. Journal of the American
Statistical Association 83(401):231241.
[52] Royall, RM; Cumberland, WG. 1978. Variance Estimation in Finite Population Sampling. Journal of
the American Statistical Association 73(362):351358.
[53] Shah, BV. 2005. Linearization Methods of Variance Estimation. Encyclopedia of Biostatistics, 2e.
Edited by Armitage, P. and Colton, T. Wiley: New York.
[54] Shao, J. 2003. Impact of the Bootstrap on Sample Surveys. Statistical Science 18(2):191198.
[55] Wilson, M. 1989. An Evaluation of Woodruff's Technique for Variance Estimation in Educational
Surveys. Journal of Educational Statistics 14(1):81101.
[56] Wolter, KW. 2007. Introduction to Variance Estimation, 2e. Springer: New York.
[57] Woodruff, RS. 1971. A Simple Method for Approximating the Variance of a Complicated Estimate.
Journal of the American Statistical Association 66(334):411414.
[58] Woodruff, RS; Causey, BD. 1976. Computerized Method for Approximating the Variance of a
Complicated Estimate. Journal of the American Statistical Association 71(354):315321.
[59] Zhang, F; Weng, S; Salvucci, S; Hu, M. 2001. A Study of Variance Estimation Methods. US
Department of Education Office of Educational Research and Improvement, National Center for
Education Statistics Working Paper 2001-03.

SOFTWARE COMPARISONS
[60] Acock, AC. 2005. SAS, Stata, SPSS: A Comparison. Journal of Marriage and Family 67(4):10931095.
[61] Ahmad, T; Rai, A. Retrieved 2014. Packages for Survey Data Analysis. Indian Agricultural Statistics
Research Institute. http://www.iasri.res.in/design/ebook/EB_SMAR/e-book_pdf files/Manual I/
12-Packages for survey data analysis.pdf
[62] Ahti-Miettinen, O. 2008 (Retrieved 2014). Estimation in Complex Sample Design with Different
Statistical Software Packages. Statistics Estonia Workshop on Survey Sampling Theory and
Methodology. www-1.ms.ut.ee/samp2008/Presentations/OAhtiMiettinen.pdf
[63] Baisden, KL; Hu, P. 2006. The Enigma of Survey Data Analysis: Comparison of SAS Survey
Procedures and SUDAAN Procedures. Proceedings of the Thirty-First Annual SAS Users Group
International Conference (SUGI 31), Paper 194-31.
[64] Bell-Ellison, B; Kromrey, J. 2007. Software Alternatives for Variance Estimation in the Analysis of
Complex Sample Surveys: A Comparison of SAS Survey Procedures, SUDAAN, and AM. Proceedings
of the Section on Survey Research Methods, American Statistical Association, 26592666.
[65] Broene, P; Rust, K; Westat. 2000. Strengths and Limitations of Using SUDAAN, Stata, and
WesVarPC for Computing Variances from NCES Data Sets. US Department of Education Office of
Educational Research and Improvement, National Center for Education Statistics Working Paper
2000-03.
101
[66] Brogan, D. 2005. Software for Sample Survey Data, Misuse of Standard Packages. Encyclopedia of
Biostatistics, 2e. Edited by Armitage, P. and Colton, T. Wiley: New York.
[67] Brogan, D. 2005. Sampling Error Estimation for Survey Data. Household Sample Surveys in
Developing and Transition Countries (op. cit., Chapter XXI). With Annex: Illustrative and
Comparative Analyses of the Burundi Immunization Survey using Five Sample Survey Software
Packages. (Retrieved 2014) http://unstats.un.org/unsd/hhsurveys/pdf/Annex_CD-Rom.pdf
[68] Carlson, BL. (2005). Software for Sample Survey Data. Encyclopedia of Biostatistics, 2e. Edited by
Armitage, P. and Colton, T. Wiley: New York.
[69] Dowd, AC; Duggan, MB. 2001 (Retrieved 2014). Computing Variances from Data with Complex
Sampling Designs: A Comparison of Stata and SPSS. North American Stata Users Group.
http://www.stata.com/meeting/1nasug/dowdduggan.pdf
[70] Hahs-Vaughn, DL; McWayne, CM; Bulotsky-Shearer, RJ; Wen, X; Faria, AM. 2011. Complex Sample
Data Recommendations and Troubleshooting. Evaluation Review 35(3):304313.
[71] Laaksonen, S; Ollila, P; Sstra, K; Berger, Y; Boonstra, HJ; Van den Brakel, J; Davison, A; Sardy, S;
Magg, K; Mnnich, R; Ohly, D. 2004 (Retrieved 2014). Evaluation of Software for Variance
Estimation in Complex Surveys. IST-2000-26057-DACSEIS Project, Workpackage 4, Deliverables 4.1
and 4.2. https://www.uni-trier.de/index.php?id=29730
[72] Mitchell, MN. 2007 (Retrieved 2014). Strategically using General Purpose Statistics Packages: A
Look at Stata, SAS and SPSS. UCLA Academic Technology Services Statistical Consulting Group
Technical Report Series 1. http://www2.jura.uni-hamburg.de/instkrim/kriminologie/
Mitarbeiter/Enzmann/Lehre/StatIIKrim/Mitchell_2007.pdf
[73] Oyeyemi, GM; Adewara, AA; Adeyemi, RA. 2010. Complex Survey Data Analysis: A Comparison of
SAS, SPSS and STATA. Asian Journal of Mathematics and Statistics 3(1):3339.
[74] Siller, AB; Tompkins, L. 2005. The Big Four: Analyzing Complex Sample Survey Data Using SAS,
SPSS, STATA, and SUDAAN. Proceedings of the Eighteenth Annual NorthEast SAS Users Group
Conference (NESUG 2005), Poster 3.
[75] Tabladillo, M; Blanton, C. 2007. Data Analysis Software for Complex Sample Designs. Proceedings
of the 2007 National Conference on Tobacco or Health, Poster.

SOFTWARE AM
[76] American Institutes for Research. Retrieved 2014. AM Statistical Software Manual. -
http://am.air.org/help/JSTree/MainFrame.asp

SOFTWARE IVEware
[77] Raghunathan, TE; Lepkowski, JM; Van Hoewyk, J; Solenberger, P. 2001. A Multivariate Technique
for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology
27(1):8595.
[78] Survey Research Center Institute for Social Research. 2011 (Retrieved 2014). IVEware Version 0.2
User Guide and Installation Instructions. Survey Methodology Program, University of Michigan.
ftp://ftp.isr.umich.edu/pub/src/smp/ive/ive21_user.pdf
[79] Survey Research Center Institute for Social Research. 2011 (Retrieved 2014). SRCware Version 0.2
User Guide and Installation Instructions. Survey Methodology Program, University of Michigan.
ftp://ftp.isr.umich.edu/pub/src/smp/ive/src2_user.pdf

102
SOFTWARE R
[80] Chandra, H. Retrieved 2014. Introduction to Survey Data Analysis through Statistical Packages.
Indian Agricultural Statistics Research Institute. http://www.iasri.res.in/ebook/TEFCPI_sampling/
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES.pdf
[81] Damico, A. 2009. Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in
Health Policy Data. The R Journal 1(2):37-44.
[82] Lumley, T. 2004. Analysis of Complex Survey Samples. Journal of Statistical Software 9(1):119.
[83] Lumley, T. 2008. Analysis of Complex Samples in R. The Survey Statistician 57(1):2025.
[84] Lumley, T. 2010. Complex Surveys: A Guide to Analysis Using R. Wiley: New York.
[85] Lumley, T. 2013 (Retrieved 2014). Describing PPS Designs to R. Survey Package Vignette.
http://cran.r-project.org/web/packages/survey/vignettes/pps.pdf

SOFTWARE SAS
[86] An, A; Watts, D. 2000. SAS Procedures for Analysis of Sample Survey Data. Proceedings of the
Section on Survey Research Methods, American Statistical Association, 120-129.
[87] Berglund, PA. 2010. An Introduction to Multiple Imputation of Complex Sample Data using SAS
v9.2. Proceedings of the 2010 SAS Global Forum, Paper 265-2010.
[88] Cassell, D. 2010. BootstrapMania!: Re-sampling the SAS Way. Proceedings of the 2010 SAS Global
Forum, Paper 268-2010.
[89] Lewis, TH. 2010. Principles of Proper Inferences from Complex Survey Data. Proceedings of the
2010 SAS Global Forum, Paper 266-2010.
[90] SAS Institute Inc. 1999. Introduction to Survey Sampling and Analysis Procedures. SAS/STAT 8
Users Guide, Chapter 11.
[91] SAS Institute Inc. 2008. Introduction to Survey Sampling and Analysis Procedures. SAS/STAT 9.2
Users Guide, Chapter 14.
[92] SAS Institute Inc. Retrieved 2014. SAS/STAT 9.2 User's Guide, 2e.
http://support.sas.com/documentation/
[93] Wang, Z; Waldron, WR. 2010. Using the SAS Survey Procedures for Subpopulation Analysis with
Jackknife Repeated Replication Methods in SAS 9.2. Proceedings of the 2010 SAS Global Forum,
Paper 267-2010.

SOFTWARE SPSS
[94] IBM Corporation. 2011. IBM SPSS Complex Samples: Correctly Compute Complex Samples
Statistics. IBM Software Business Analytics.
[95] IBM Corporation. 2013. IBM SPSS Complex Samples 22. IBM SPSS Version 22 User Manuals.

SOFTWARE STATA
[96] Acock, AC. 2010. A Gentle Introduction to Stata, 3e. Stata Press: College Station, TX.
[97] Hamilton, L. C. 2012. Statistics with Stata: Updated for Version 12. Cengage: Boston.
[98] Kolenikov, S. 2010. Resampling Variance Estimation for Complex Survey Data. The Stata Journal
10(2):165199.
[99] Kreuter, F; Valliant, R. 2007. A Survey on Survey Statistics: What Is Done and Can Be Done in Stata.
The Stata Journal 7(1):121.
[100] Leidi, S; Stern, R; McDermott, B; Abeyasekera, S; Palmer, A. 2013 (Retrieved 2014). STATA 10 for
Surveys Manual. University of Reading. http://www.personal.reading.ac.uk/~sns97aal/
stata4surveys/STATA10_for_surveys_manual_part1.pdf (also _part2.pdf)

103
[101] Pfaff, T. 2009 (Retrieved 2014). A Brief Introduction to Stata with 50+ Basic Commands.
http://www.wiwi.uni-muenster.de/ioeb/Downloads/Forschen/Pfaff/
Introduction_to_Stata_with_50+_Basic_Commands.pdf
[102] Pitblado, J. 2009 (Retrieved 2014). Survey Data Analysis in Stata. Canadian Stata Users Group
Meeting. http://www.stata.com/meeting/dcconf09/dc09_pitblado_svy.pdf
[103] Rabe-Hesketh, S; Everitt, B. 2004. A Handbook of Statistical Analyses using Stata, 3e. CRC: Boca
Raton, FL.
[104] StataCorp. 2013. Stata Survey Data Reference Manual: Release 13. Stata Press: College Station, TX.

SOFTWARE SUDAAN
[105] Bieler, GS; Williams, RL. 1997. Analyzing Survey Data Using SUDAAN Release 7.5. Research Triangle
Institute: Research Triangle Park, NC.
[106] RTI International. Retrieved 2014. Online Help Manual for SUDAAN 10.
http://www.rti.org/Sudaan/onlinehelp/SUDAAN10/Default.htm
[107] RTI International. Retrieved 2014. SUDAAN Design Options.
http://www.rti.org/sudaan/page.cfm/SUDAAN_Design_Options
[108] RTI International. Retrieved 2014. SUDAAN 11 Examples. http://www.rti.org/sudaan/page.cfm/
SUDAAN_Eleven_Examples
[109] RTI International. Retrieved 2014. SUDAAN Technical Assistance: Design Statement Examples.
http://www.rti.org/sudaan/pdf_files/sudaanDesignExamples.pdf

SOFTWARE WESVAR
[110] Westat. 2007. WesVar 4.3 User's Guide. Westat: Rockville, MD.
[111] Westat. Retrieved 2014. Addendum to the WesVar Users Guide: New Features in WesVar 5.1.
http://www.westat.com/Westat/pdf/wesvar/addendum_users_guide.pdf

104

(eBook PDF) Programming Language Pragmatics, 4th Editionpdf download
100% (4)
(eBook PDF) Programming Language Pragmatics, 4th Editionpdf download
58 pages
Lesson 11 Multiple Linear Regression
No ratings yet
Lesson 11 Multiple Linear Regression
35 pages
Global Positioning System
No ratings yet
Global Positioning System
52 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Sensor Fusion of Differential Gps and Inertial Measuring Unit To Measure State of A Test Vehicle
No ratings yet
Sensor Fusion of Differential Gps and Inertial Measuring Unit To Measure State of A Test Vehicle
216 pages
Millimeter Wave Radar Design Considerations
No ratings yet
Millimeter Wave Radar Design Considerations
4 pages
Bolometers For Submm/mm-Wave Astronomy
No ratings yet
Bolometers For Submm/mm-Wave Astronomy
35 pages
Study and Prioritizing Effective Factors On Human Resource Productivity by Achieve Model and Topsis Method
No ratings yet
Study and Prioritizing Effective Factors On Human Resource Productivity by Achieve Model and Topsis Method
10 pages
FAR IR Instrumentation
No ratings yet
FAR IR Instrumentation
11 pages
Create High Speed Ping Scanning Script With Python
0% (1)
Create High Speed Ping Scanning Script With Python
17 pages
(1912) Guide To Central Experimental Farm: Ottawa, Ontario, Canada
100% (1)
(1912) Guide To Central Experimental Farm: Ottawa, Ontario, Canada
72 pages
Detection and High Resolution of Vehicles at Hypersonic Velocities
100% (1)
Detection and High Resolution of Vehicles at Hypersonic Velocities
46 pages
Microbolometer Detectors For Passive Millimeter-Wave Imaging
No ratings yet
Microbolometer Detectors For Passive Millimeter-Wave Imaging
43 pages
Raspberry Pi: Create A Voice Kit
No ratings yet
Raspberry Pi: Create A Voice Kit
76 pages
Getting Started With Pico
No ratings yet
Getting Started With Pico
70 pages
Criteria For Classifying Forecasting Me - 2020 - International Journal of Foreca PDF
No ratings yet
Criteria For Classifying Forecasting Me - 2020 - International Journal of Foreca PDF
11 pages
GPS World - September 2015
No ratings yet
GPS World - September 2015
61 pages
Geometric Dilution of Precision Computation
No ratings yet
Geometric Dilution of Precision Computation
25 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
William R. Bell, Scott H. Holan, Tucker S. McElroy - Economic Time Series - Modeling and Seasonality-Chapman and Hall - CRC (2012)
No ratings yet
William R. Bell, Scott H. Holan, Tucker S. McElroy - Economic Time Series - Modeling and Seasonality-Chapman and Hall - CRC (2012)
544 pages
Financial Time Series
No ratings yet
Financial Time Series
34 pages
Object Detection and Avoidance in Unmanned Ground Vehicle Using Arduino1
No ratings yet
Object Detection and Avoidance in Unmanned Ground Vehicle Using Arduino1
4 pages
Forcasting NHL Success
No ratings yet
Forcasting NHL Success
12 pages
L12 Bayesian Network
No ratings yet
L12 Bayesian Network
35 pages
Orbital Mechanics For Engineering Students Fourth Edition. Edition Howard D. Curtis - Ebook PDF All Chapter Instant Download
100% (3)
Orbital Mechanics For Engineering Students Fourth Edition. Edition Howard D. Curtis - Ebook PDF All Chapter Instant Download
51 pages
Emitter-Detection DTIC ADA471571
No ratings yet
Emitter-Detection DTIC ADA471571
116 pages
Optimal Integration of GPS With Inertial Sensors
No ratings yet
Optimal Integration of GPS With Inertial Sensors
161 pages
Journal of Statistical Software: Regularization Paths For Generalized Linear Models Via Coordinate Descent
No ratings yet
Journal of Statistical Software: Regularization Paths For Generalized Linear Models Via Coordinate Descent
22 pages
Karanja Evanson Mwangi Cit Masters Report Libre PDF
No ratings yet
Karanja Evanson Mwangi Cit Masters Report Libre PDF
136 pages
The Benefits of Balance Training
No ratings yet
The Benefits of Balance Training
3 pages
Deep Feature Learning and Classification of Remote Sensing Images
No ratings yet
Deep Feature Learning and Classification of Remote Sensing Images
19 pages
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
7 pages
A Systematic Review For Transformer-Based Long-Term Series Forecasting
No ratings yet
A Systematic Review For Transformer-Based Long-Term Series Forecasting
30 pages
Complete Supervised Machine Learning For Text Analysis in R 1st Edition Emil Hvitfeldt Julia Silge PDF For All Chapters
100% (1)
Complete Supervised Machine Learning For Text Analysis in R 1st Edition Emil Hvitfeldt Julia Silge PDF For All Chapters
59 pages
Sequential Analysis Hypothesis Testing and Changepoint Detection ( Etc.) (Z-Library)
No ratings yet
Sequential Analysis Hypothesis Testing and Changepoint Detection ( Etc.) (Z-Library)
600 pages
Design of An Airfoil by Mathematical Modelling Using Database
No ratings yet
Design of An Airfoil by Mathematical Modelling Using Database
8 pages
Introduction to Integral Calculus Systematic Studies with Engineering Applications for Beginners 1st Edition Ulrich L. Rohde - Read the ebook online or download it as you prefer
No ratings yet
Introduction to Integral Calculus Systematic Studies with Engineering Applications for Beginners 1st Edition Ulrich L. Rohde - Read the ebook online or download it as you prefer
52 pages
Kalman
No ratings yet
Kalman
36 pages
TDOA Equation
100% (1)
TDOA Equation
27 pages
Frozenonthe Rates
No ratings yet
Frozenonthe Rates
8 pages
A Survey On TOA Based Wireless Localization and NLOS Mitigation Techniques
No ratings yet
A Survey On TOA Based Wireless Localization and NLOS Mitigation Techniques
18 pages
One-Sample T-Test
No ratings yet
One-Sample T-Test
9 pages
Adafruit Ultimate Gps PDF
No ratings yet
Adafruit Ultimate Gps PDF
52 pages
Trainner XBEE DIGI March 2017
No ratings yet
Trainner XBEE DIGI March 2017
221 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
209 pages
Working With Sensor
No ratings yet
Working With Sensor
46 pages
Recommended Reading For Time Series Analysis
No ratings yet
Recommended Reading For Time Series Analysis
2 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
BMA150
No ratings yet
BMA150
56 pages
Stan Reference 2.7.0
No ratings yet
Stan Reference 2.7.0
534 pages
Time Series Forecasting
No ratings yet
Time Series Forecasting
59 pages
Quick Refmaerence Guide Matlab
No ratings yet
Quick Refmaerence Guide Matlab
2 pages
Rojas-Time series analysis and forecasting-Book16
No ratings yet
Rojas-Time series analysis and forecasting-Book16
384 pages
Crossformer - Transformer Utilizing Cross-Dimension Dependency For Multivariate Time Series Forecasting
No ratings yet
Crossformer - Transformer Utilizing Cross-Dimension Dependency For Multivariate Time Series Forecasting
21 pages
Analysis Complex Samples 131108
No ratings yet
Analysis Complex Samples 131108
31 pages
Comparison Python, R, SAS
No ratings yet
Comparison Python, R, SAS
6 pages
List of Statistical Packages
No ratings yet
List of Statistical Packages
9 pages
StatisticalSoftwareTodayandTomorrow
No ratings yet
StatisticalSoftwareTodayandTomorrow
17 pages
Spss
No ratings yet
Spss
35 pages
Comparison Table
No ratings yet
Comparison Table
8 pages
Statistical Analysis Overview
No ratings yet
Statistical Analysis Overview
9 pages
Auditor Changes and Discretionary Accruals DeFond and Subramanyam 1998
No ratings yet
Auditor Changes and Discretionary Accruals DeFond and Subramanyam 1998
33 pages
随机抽样 vs 随机分配
100% (1)
随机抽样 vs 随机分配
11 pages
02 Permutation Test
No ratings yet
02 Permutation Test
15 pages
The Normal Distribution: Armando A. Camana JR., Maed
No ratings yet
The Normal Distribution: Armando A. Camana JR., Maed
23 pages
BSSEII
No ratings yet
BSSEII
12 pages
Naive Bayes Classifier
100% (1)
Naive Bayes Classifier
4 pages
3 Sampling and Data Gathering Techniques
No ratings yet
3 Sampling and Data Gathering Techniques
38 pages
Relationship Between Self-efficacy and Language Proficiency- A Meta-Analysis
No ratings yet
Relationship Between Self-efficacy and Language Proficiency- A Meta-Analysis
11 pages
STAT 221 Mid Term (2) Midterm
No ratings yet
STAT 221 Mid Term (2) Midterm
4 pages
Chapter 10 One Sample Tests of Hypothesis
No ratings yet
Chapter 10 One Sample Tests of Hypothesis
36 pages
BM2 Chapter 5 Forecasting
No ratings yet
BM2 Chapter 5 Forecasting
21 pages
Interval Estimation
No ratings yet
Interval Estimation
46 pages
Bsm201 Model Questions
No ratings yet
Bsm201 Model Questions
7 pages
AML Winter 2021 Solution
No ratings yet
AML Winter 2021 Solution
6 pages
Business STAT 2 Class Lectures
No ratings yet
Business STAT 2 Class Lectures
15 pages
Proiect Econometrie
No ratings yet
Proiect Econometrie
15 pages
Annova
No ratings yet
Annova
4 pages
Statistical Significance Versus Clinical Relevance
No ratings yet
Statistical Significance Versus Clinical Relevance
38 pages
Unit 4 Regression analysis
No ratings yet
Unit 4 Regression analysis
28 pages
S.No. Experiment Date: Index
No ratings yet
S.No. Experiment Date: Index
24 pages
Statistics and Probability2021 - Quarter 3 2
No ratings yet
Statistics and Probability2021 - Quarter 3 2
38 pages
Logistic Regression Notes
No ratings yet
Logistic Regression Notes
49 pages
Pengaruh Corporate Governance, Bonus Plan, Dan Firm Size Terhadap Manajemen Laba
No ratings yet
Pengaruh Corporate Governance, Bonus Plan, Dan Firm Size Terhadap Manajemen Laba
13 pages
WEEK 7 Handout
No ratings yet
WEEK 7 Handout
5 pages
Hypothesis Assignment Final
No ratings yet
Hypothesis Assignment Final
1 page
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
11 pages
Report in Resmeth
No ratings yet
Report in Resmeth
37 pages
Midterm Exam
0% (1)
Midterm Exam
2 pages
Output Spss Statistika 1581
No ratings yet
Output Spss Statistika 1581
15 pages