Innovative Statistical Methods For Public Health Data
Innovative Statistical Methods For Public Health Data
Innovative
Statistical
Methods for
Public Health
Data
ICSA Book Series in Statistics
Series Editors
Jiahua Chen
Department of Statistics
University of British Columbia
Vancouver
Canada
Innovative Statistical
Methods for Public
Health Data
123
Editors
Ding-Geng (Din) Chen Jeffrey Wilson
School of Social Work Arizona State University
University of North Carolina Tempe, AZ, USA
Chapel Hill, NC, USA
This book was originated when we were become part of the leadership of the
Applied Public Health Statistics Section (https://www.apha.org/apha-communities/
member-sections/applied-public-health-statistics) in the American Public Health
Association. Professor Chen was the Chair-Elect (2012), Chair (2013), and Past-
Chair (2014) while Professor Wilson was the Chair-Elect (2011), Chair (2012) and
Past-Chair (2013). In addition, Professor Wilson has been the Chair of the Editorial
Board of the American Journal of Public Health for the past 3 years and a member
for 5 years. He has been a reviewer for the journal and a contributor to Statistically
Speaking.
During our leadership of the Statistics Section, we also served as APHA Program
Planners for the Section in the annual meetings by organizing abstracts and scientific
sessions as well as supporting the student paper competition. During our tenure, we
got a close-up view of the expertise and the knowledge of statistical principles and
methods that need to be disseminated to aid the development and growth in the
area of Public Health. We were convinced that this can be best met through the
compilation of a book on the public health statistics.
This book is a compilation of present and new developments in statistical
methods and their applications to public health research. The data and computer
programs used in this book are publicly available so the readers have the opportunity
to replicate the model development and data analyses as presented in each chapter.
This is certain to facilitate learning and support ease of computation so that these
new methods can be readily applied.
The book strives to bring together experts engaged in public health statistical
methodology to present and discuss recent issues in statistical methodological
development and their applications. The book is timely and has high potential to
impact model development and data analysis of public health research across a wide
spectrum of the discipline. We expect the book to foster the use of these novel ideas
in healthcare research in Public Health.
The book consists of 15 chapters which we present in three parts. Part I consists
of methods to model clustered data; Part II consists of methods to model incomplete
or missing data; while Part III consists of other healthcare research models.
vii
viii Preface
and illustrate the importance of disentangling the structural and sampling zeros in
alcohol research using simulated as well as real study data.
Chapter 7: Modeling Based on Progressively Type-I Interval Censored Sample.
In this chapter, several parametric modeling procedures (including model selection)
are presented with the use of maximum likelihood estimate, moment method
estimate, probability plot estimate, and Bayesian estimation. In addition, the
model presentation of general data structure and simulation procedure for getting
progressively type-I interval censored sample are presented.
Chapter 8: Techniques for Analyzing Incomplete Data in Public Health Research.
This chapter deals with the causes and problems created by incomplete data and
recommends techniques for how to handle it through multiple imputation.
Chapter 9: A Continuous Latent Factor Model for Non-ignorable Missing Data.
This chapter presents a continuous latent factor model as a novel approach to
overcome limitations which exist in pattern mixture models through the speci-
fication of a continuous latent factor. The advantages of this model, including
small sample feasibility, are demonstrated by comparing with Roy’s pattern mixture
model using an application to a clinical study of AIDS patients with advanced
immune suppression.
In Part III, we present a series of Healthcare Research Models which consists
of six chapters. Chapter 10: Health Surveillance. This chapter deals with the
application of statistical methods for health surveillance, including those for health
care quality monitoring and those for disease surveillance. The methods rely on
techniques borrowed from the industrial quality control and monitoring literature.
However, the distinctions are made when necessary and taken into account in these
methods.
Chapter 11: Standardization and Decomposition Analysis: A Useful Analytical
Method for Outcome Difference, Inequality and Disparity Studies. This chapter
deals with a traditional demographic analytical method that is widely used for com-
paring rates between populations with difference in composition. The results can
be readily applied to cross-sectional outcome comparisons as well as longitudinal
studies. While SDA does not rely on traditional assumptions, it is void of statistical
significance testing. This chapter presents techniques for significance testing.
Chapter 12: Cusp Catastrophe Modeling in Medical and Health Research. This
chapter presents the cusp catastrophe modeling method, including the general
principle and two analytical approaches to statistically solving the model for actual
data analysis: (1) the polynomial regression method for longitudinal data and (2) the
likelihood estimation method for cross-sectional data. A special R-based package
“cusp” is given for the likelihood method for data analysis.
Chapter 13: On Ranked Set Sampling Variation and Its Applications to Public
Health Research. This chapter presents the ranked set sampling as a cost-effective
alternative approach to traditional sampling schemes. This method relies on a small
fraction of the available units. It improves the precision of estimation. In RSS, the
desired information is obtained from a small fraction of the available units.
Chapter 14: Weighted Multiple Testing Correction for Correlated Endpoints in
Survival Data. In this chapter, a weighted multiple testing correction method for
x Preface
xi
xii Contents
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 341
Contributors
xiii
xiv Contributors
1 Introduction
Case–control studies are very common in public health research, where conducting a
designed experiment is not feasible. This may happen, for example, if assigning sub-
jects to certain treatments is unethical. Particularly in these situations, researchers
have to rely on observational studies, in which groups having different outcomes are
identified and compared. The example used in this chapter is a typical one—subjects
are classified based on an outcome, in this instance a (binary) cancer status, and the
goal of the case–control analysis is to identify factors that are associated with this
outcome. In the context of our example, the ‘controls’ are the cancer-free subjects
and the ‘cases’ are the cancer patients.
While the main focus in such studies is on the primary outcome (e.g., cancer
status), there is substantial interest to take advantage of existing large case–control
studies to identify if any exposure variables are associated with secondary outcomes
that are often collected in these studies. In particular, secondary outcome analyses
are now becoming popular in genetic epidemiology, where the interest is in studying
associations between genetic variants and human quantitative traits using data
collected from case–control studies of complex diseases (Lettre et al. 2008; Sanna
et al. 2008; Monsees et al. 2009; Schifano et al. 2013). For example, in the lung
cancer genome-wide association study conducted in Massachusetts General Hospi-
tal (MGH), several continuous measures of smoking behavior were collected from
both cases and controls (Schifano et al. 2013). In a secondary outcome analysis, one
is interested in identifying genetic variants (single nucleotide polymorphisms, or
SNPs) that are associated with smoking behavior while accounting for case–control
ascertainment. Since the subjects from the case–control study were sampled based
on lung cancer status (primary disease outcome), careful attention is warranted for
inferences regarding the secondary smoking outcomes, because the case–control
sample might not represent a random sample from the population as the cases have
been over-sampled. Consequently, the population association of the genetic variant
with a secondary outcome can be distorted in a case–control sample, and analysis
methods that ignore or improperly account for this sampling mechanism can lead to
biased estimates of the effect of the genetic variant on the secondary outcome.
One of the most common and simple approaches for analyzing secondary
quantitative traits involves performing the analysis on the control subjects only.
This strategy is appropriate only when the disease is rare, in which case the control
sample can be considered an approximation to a random sample from the general
population. Because the information from the cases is totally ignored, however, this
method is generally inefficient. Alternatively, one may attempt to analyze cases and
controls separately, or treat case–control status as a predictor in the regression model
of the secondary outcome. However, each of these methods may result in erroneous
conclusions, as the association between a secondary outcome and an exposure
variable in the case and control samples can be different from the association in
the underlying population.
Other analysis methods have been proposed to explicitly account for case–control
ascertainment and eliminate sampling bias. To study the effect of exposure (such
as a genetic variant) on a secondary outcome, Lin and Zeng (2009) proposed a
likelihood-based approach, reflecting case–control sampling by using a likelihood
conditional on disease status. Extensions and generalizations of the Lin and Zeng
(2009) likelihood framework may be found in Wei et al. (2013) and Ghosh et al.
(2013). Tchetgen Tchetgen (2014) proposed a general model based on a careful
re-parameterization of the conditional model for the secondary outcome given the
case–control outcome and regression covariates that relies on fewer model assump-
tions. In this chapter, we focus on comparing two popular methods, namely, inverse
probability weighting (IPW) (Robins et al. 1994) and propensity score matching
(PSM) (Rosenbaum and Rubin 1983). These two approaches can be thought of
more generally as design-based and model-based approaches, respectively. In the
design-based approach, one obtains the probability of selecting a subject from a
certain cohort from the sampling distribution of the primary outcome, while in the
model-based approach, one obtains estimates for the selection probability using a
statistical model. In the model-based approach one fits a logistic regression model
in which the response is the primary outcome, and the log-odds are assumed to be
a linear combination of several explanatory variables. The two approaches differ in
Methods for Analyzing Secondary Outcomes in Public Health Case–Control Studies 5
the way that the probability that a subject is included in the study is computed, and
in the way these probabilities are used in the analysis of the secondary outcome, but
both aim to compensate for the over-sampling of one group (i.e., the cases).
We use a subset of the MGH lung cancer case–control data to illustrate the two
methods for analyzing secondary outcomes. In Sect. 2 we describe the data in more
detail, while in Sects. 3 and 4 we describe the various design-based and model-
based techniques for analyzing such data. We demonstrate these methods in Sect. 5
and conclude with a brief discussion in Sect. 6.
10 0 2 0 0 30 0 4 0 0 5 0 0 6 0 0
500
Frequency
Frequency
300
10 0
0
80 0
0 20 40 60 0 2 4 6 8 10
avg # of cig. per day sqrt{avg # of cig. per day}
p
Fig. 1 The distribution of CPD and CPD
D S
p
the secondary outcome ( CPD), X represents the exposure variable (SNP), D is a
binary disease indicator (D D 1 for cases, D D 0 for controls), and S is a binary
sampling indicator (S D 1 if sampled in the case–control data set, S D 0 if not
sampled in the case–control data set). We are interested in detecting the dashed
association between X and Y (the SNP and the smoking habits), but both X and Y
may be associated with disease status D, which, in turn, influences who gets sampled
in the case–control data set (S D 1). In general, if Y is associated with D, then the
simple analysis techniques of analyzing cases and controls separately, or treating
case–control status as a predictor in the regression model of the secondary outcome
will yield invalid results (Lin and Zeng 2009). Since smoking is associated with
lung cancer, one must take the sampling mechanism into account.
Methods for Analyzing Secondary Outcomes in Public Health Case–Control Studies 7
The idea behind IPW is as follows. Consider a sample of n subjects and the linear
model yi D xTi ˇ C i ; i D 1; : : : ; n, where ˇ is a vector of regression coefficients,
and i are independent random errors. In matrix form, we write Y D Xˇ C, and we
assume EŒYjX D Xˇ in the population. In general, we can estimate the coefficients
ˇ by solving an estimating equation such as
1X
n
ui .ˇ/ D 0 :
n iD1
If EŒui .ˇ/ D 0 for the true ˇ, then the estimating equation is unbiased for
estimating ˇ. For instance, in a randomly selected cohort of size n from the
population, the estimating equation
1X
n
xi .yi xTi ˇ/ D 0
n iD1
1X
n
si xi .yi xTi ˇ/ D 0 (1)
n iD1
which includes the sampling indicator si for each subject. Clearly si D 1 for all
n subjects in the case–control sample. Using the law of iterated expectations, it
can be shown that (1) is not unbiased in general. However, by including the IPW
wi D 1=Pr.si D 1jdi /, the estimating equation
1X
n
si
xi .yi xTi ˇ/ D 0
n iD1 Pr.si D 1jdi /
is unbiased provided that the probability of selection into the study depends solely
on the disease status. Note that weighting the observations in this manner can
be inefficient (Robins et al. 1994). However, the simplicity of the model makes
it amenable to extensions which can gain power in other ways (e.g., in the joint
analysis of multiple secondary outcomes (Schifano et al. 2013)).
In case–control studies the ‘treatment’ (disease status) and the inclusion in the study
are not independent. Consequently, covariates that are associated with the disease
might also be associated with the probability (propensity) that a subject is selected
8 E.D. Schifano et al.
for the study. This is in contrast to randomized trials, where the randomization
implies that each covariate has the same distribution in both treatment and control
(for large samples). This balance is key to unbiased estimation of covariate effects
in controlled experiments. Thus, in observational studies, the estimates of effect for
covariates which are not balanced across treatments are likely to be biased, and may
lead to incorrect conclusions.
PSM (Rosenbaum and Rubin 1983; Abadie and Imbens 2006) is a method which
attempts to achieve balance in covariates across treatments by matching cases to
similar controls, in the sense that the two have similar covariate values. More
formally, recall our previous notation where D is the disease indicator, and X is a set
of predictors which may affect Pr.D D 1/ and the secondary outcome, Y. Denote
the potential outcome Y.d/, so that Y.1/ is the outcome for cases and Y.0/ is the
outcome for controls. Of course, we only observe one of Y.0/ or Y.1/. Denote the
propensity score by p.X/ D Pr.D D 1jX/. Rosenbaum and Rubin (1983) introduced
the assumption of ‘strong ignorability,’ which means that given the covariates, X, the
potential outcomes Y.0/ and Y.1/ are independent of the treatment. More formally,
strong ignorability is satisfied if the following conditions hold:
D?
? Y.0/; Y.1/jX almost surely:
p.D D 1jX D x/ 2 ŒpL ; pU almost surely; for some pL > 0 and pU < 1:
3. Check that the data set which consists of matched pairs is balanced for each
covariate.
4. Perform the multivariate statistical analysis on the matched data set.
In
p the context of our example, in the last step we fit a regression model in which
CPD is the response variable, but instead of using the complete data set, we use a
smaller data set which consists of matched pairs.
5 Data Analysis
Recall from Sect. 2 that we arep interested in the effect of a genetic variant
(rs1051730) on smoking habits ( CPD). First, we fit a logistic regression model
for the primary outcome variable (cancer status). The results which are summarized
in Table 1 suggest that the risk for lung cancer increases as the average number of
cigarettes smoked daily increases. The risk also increases with age, and for subjects
who have a college degree. Additionally, the risk for lung cancer decreases for
increasing numbers of the C allele (or increases for increasing numbers of the T
allele).
Since both the genetic variant and smoking behavior are associated with the
primary disease outcome, we need to account for selection bias when testing the
association between the genetic variant and the secondary smoking outcome. Let us
first check if such association exists between the genetic variant and the secondary
outcome without accounting for the potential
p selection bias problem.
We fit a linear regression model with CPD as the response variable. The results
are summarized in Table 2, and they suggest that the genetic variant is, indeed,
associated with smoking behavior. It would appear that the same variant which is
strongly associated with lung cancer is also associated with heavy smoking, and
that may be, at least in part, why those individuals end up with the disease. In the
followingpsubsections we describe how we use the IPW and PSM methods to test
whether CPD is associated with this genetic variant. We show that when selection
bias is accounted for, this genetic variant is not associated with the smoking variable.
One way to account for selection bias is to weight observations based on the
probability that a subject is sampled in the case–control data set. As described in
Monsees et al. (2009), we use the function geeglm in the R (R Core Team 2014)
package geepack (Højsgaard and Halekoh 2005) in order to perform the IPW
analysis. For n1 case subjects and n0 control subjects (n0 C n1 D n), the weight
to be applied to cases is 1=Pr.S D 1jD D 1/ / =pn1 where D 0:000745
is the disease prevalence (VanderWeele et al. 2012) and pn1 D n1 =.n0 C n1 / is the
proportion of cases in the data set, and the weight for controls is 1=Pr.S D 1jD D 0/
/ .1 /=.1 pn1 /. Table 3 contains
p the results of the IPW analysis. The effect
of genetic variant rs1051730 on CPD, after adjusting for covariates in the IPW
model is 0:1229 (z D 2:65; p D 0:103). Only the sex and education variables are
associated with the smoking behavior at the 0.05 level when applying the IPW.
Note that in this data set, the weights are proportional to .1 /=.1 pn1 / D
1:92407 for the controls and =pn1 D 0:00155 for the cases. This illustrates how
the IPW procedure down-weights the cases and up-weights the controls to reflect
the true population structure.
Methods for Analyzing Secondary Outcomes in Public Health Case–Control Studies 11
Table 5 A summary of the differences between the case and control groups, using
the original data set
Means Means SD eQQ eQQ eQQ
treated control control Mean diff. Med Mean Max
Distance 0.53 0.43 0.15 0.10 0.10 0.10 0.11
rs1051730 1.12 1.29 0.68 0.16 0.00 0.16 1.00
Age 65.19 59.21 11.56 5.98 5.95 6.02 14.61
Sex D 0 0.54 0.47 0.50 0.07 0.00 0.07 1.00
Sex D 1 0.46 0.53 0.50 0.07 0.00 0.07 1.00
Education D 1 0.07 0.03 0.16 0.05 0.00 0.05 1.00
pc1 0.00 0.00 0.02 0.00 0.00 0.00 0.02
pc2 0.00 0.00 0.02 0.00 0.00 0.00 0.02
pc3 0.00 0.00 0.02 0.00 0.00 0.00 0.01
The eQQ Med, Mean, and Max columns provide the corresponding summary
statistics of the empirical quantile–quantile function, of Treated vs. Control
Table 6 A summary of the differences between the case and control groups, using
the matched-pairs data set, after applying the propensity score matching method to
adjust for selection bias
Means Means SD eQQ eQQ eQQ
treated control control Mean diff. Med Mean Max
Distance 0.49 0.49 0.13 0.00 0.00 0.00 0.00
rs1051730 1.22 1.21 0.69 0.01 0.00 0.02 1.00
Age 62.77 62.89 9.93 0.13 0.62 0.68 2.63
Sex D 0 0.55 0.51 0.50 0.03 0.00 0.03 1.00
Sex D 1 0.45 0.49 0.50 0.03 0.00 0.03 1.00
Education D 1 0.04 0.03 0.18 0.00 0.00 0.00 1.00
pc1 0.00 0.00 0.02 0.00 0.00 0.00 0.01
pc2 0.00 0.00 0.02 0.00 0.00 0.00 0.01
pc3 0.00 0.00 0.02 0.00 0.00 0.00 0.01
The eQQ Med, Mean, and Max columns provide the corresponding summary
statistics of the empirical quantile–quantile function, of Treated vs. Control
p
covariates that were found to be strongly associated with the CPD variable using
the IPW method, the PSM method also identifies a significant association
p between
the first two population substructure principal components and CPD.
To check that the PSM algorithm achieved the desired balance, consider Tables 5
and 6, and notice that for all variables the differences between the two groups after
matching (Table 6) are significantly reduced, compared with the differences between
the groups when using the original data (Table 5). Note that the education D 0 level
is absent from the tables since the secondary response is well-balanced across the
two groups in the original data. Another way to visualize the effect of the matching
on the distribution is to use side-by-side boxplots. The matching provided the most
noticeable improvement in balancing age and PC1 across the two groups, as can be
seen in Fig. 3.
Methods for Analyzing Secondary Outcomes in Public Health Case–Control Studies 13
80
age
age
50
50
l
l l
l l
l
l
l
20
20
l
0.00
PC1
PC1
−0.06
l
−0.06
l
l
l
l
l
Fig. 3 The distributions of Age and PC1 in the case and control groups, before, and after applying
the propensity score matching method
6 Discussion
Acknowledgements The authors wish to thank Dr. David C. Christiani for generously sharing
his data. This research was supported in part by the National Institute of Mental Health, Award
Number K01MH087219. The content of this paper is solely the responsibility of the authors, and
it does not represent the official views of the National Institute of Mental Health or the National
Institutes of Health.
References
Abadie, A., Imbens, G.: Large sample properties of matching estimators for average treatment
effects. Econometrica 74(1), 235–267 (2006)
Amos, C.I., Wu, X., Broderick, P., Gorlov, I.P., Gu, J., Eisen, T., Dong, Q., Zhang, Q., Gu, X.,
Vijayakrishnan, J., Sullivan, K., Matakidou, A., Wang, Y., Mills, G., Doheny, K., Tsai, Y.Y.,
Chen, W.V., Shete, S., Spitz, M.R., Houlston, R.S.: Genome-wide association scan of tag SNPs
identifies a susceptibility locus for lung cancer at 15q25.1. Nat. Genet. 40(5), 616–622 (2008)
Ghosh, A., Wright, F., Zou, F.: Unified analysis of secondary traits in case-control association
studies. J. Am. Stat. Assoc. 108, 566–576 (2013)
Ho, D.E., Imai, K., King, G., Stuart, E.A.: MatchIt: nonparametric preprocessing for parametric
causal inference. J. Stat. Softw. 42(8), 1–28 (2011). http://www.jstatsoft.org/v42/i08/
Højsgaard, S., Halekoh, U., Yan, J.: The R package geepack for generalized estimating equations.
J. Stat. Softw. 15(2), 1–11 (2005)
Lettre, G., Jackson, A.U., Gieger, C., Schumacher, F.R., Berndt, S.I., Sanna, S., Eyheramendy, S.,
Voight, B.F., Butler, J.L., Guiducci, C., Illig, T., Hackett, R., Heid, I.M., Jacobs, K.B., Lyssenko,
V., Uda, M., Boehnke, M., Chanock, S.J., Groop, L.C., Hu, F.B., Isomaa, B., Kraft, P., Peltonen,
L., Salomaa, V., Schlessinger, D., Hunter, D.J., Hayes, R.B., Abecasis, G.R., Wichmann, H.E.,
Mohlke, K.L., Hirschhorn, J.N.: Identification of ten loci associated with height highlights new
biological pathways in human growth. Nat. Genet. 40(5), 584–591 (2008)
Lin, D.Y., Zeng, D.: Proper analysis of secondary phenotype data in case-control association
studies. Genet. Epidemiol. 33(3), 356–365 (2009)
Monsees, G.M., Tamimi, R.M., Kraft, P.: Genome-wide association scans for secondary traits using
case-control samples. Genet. Epidemiol. 33, 717–728 (2009)
R Core Team: R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2014). http://www.R-project.org/
Robins, J., Rotnitzky, A., Zhao, L.: Estimation of regression coefficients when some regressors are
not always observed. J. Am. Stat. Assoc. 89, 198–203 (1994)
Methods for Analyzing Secondary Outcomes in Public Health Case–Control Studies 15
Rosenbaum, P., Rubin, D.: The central role of the propensity score in observational studies for
causal effects. Biometrika 70, 41–55 (1983)
Sanna, S., Jackson, A.U., Nagaraja, R., Willer, C.J., Chen, W.M., Bonnycastle, L.L., Shen, H.,
Timpson, N., Lettre, G., Usala, G., Chines, P.S., Stringham, H.M., Scott, L.J., Dei, M., Lai,
S., Albai, G., Crisponi, L., Naitza, S., Doheny, K.F., Pugh, E.W., Ben-Shlomo, Y., Ebrahim,
S., Lawlor, D.A., Bergman, R.N., Watanabe, R.M., Uda, M., Tuomilehto, J., Coresh, J.,
Hirschhorn, J.N., Shuldiner, A.R., Schlessinger, D., Collins, F.S., Davey Smith, G., Boerwinkle,
E., Cao, A., Boehnke, M., Abecasis, G.R., Mohlke, K.L.: Common variants in the GDF5-
UQCC region are associated with variation in human height. Nat. Genet. 40(2), 198–203 (2008)
Schifano, E., Li, L., Christiani, D., Lin, X.: Genome-wide association analysis for multiple
continuous phenotypes. AJHG 92(5), 744–759 (2013)
Tchetgen Tchetgen, E.: A general regression framework for a secondary outcome in case-control
studies. Biostatistics 15, 117–128 (2014)
VanderWeele, T.J., Asomaning, K., Tchetgen Tchetgen, E.J., Han, Y., Spitz, M.R., Shete, S., Wu,
X., Gaborieau, V., Wang, Y., McLaughlin, J., Hung, R.J., Brennan, P., Amos, C.I., Christiani,
D.C., Lin, X.: Genetic variants on 15q25.1, smoking, and lung cancer: an assessment of
mediation and interaction. Am. J. Epidemiol. 175(10), 1013–1020 (2012)
Wei, J., Carrroll, R., Muller, U., Van Keilegon, I., Chatterjee, N.: Locally efficient estimation for
homoscedastic regression in the secondary analysis of case-control data. J. Roy. Stat. Soc. Ser.
B 75, 186–206 (2013)
Controlling for Population Density
Using Clustering and Data Weighting
Techniques When Examining Social Health
and Welfare Problems
1 Introduction
Risk proneness and its relationship to depressive symptoms are vital in understand-
ing the underlying processes that determine health risk behaviors among youth, such
as alcohol use and sexual behavior. The National Longitudinal Survey on Youth
(NLSY) 1998 Young Adult cohort has been selected to illustrate how sensation
seeking operates in predicting alcohol use and sexual behavior among adolescents,
using a weighted path analysis model. By introducing the weighting technique,
particularly with respect to computation of the covariance matrix necessary to
execute path analysis, never applied before to these data before in order to normalize
against US population, the resulting path model compares the different results when
a design effect procedure is applied.
2 Background
From a review of the literature, it appears that most studies using various
components of the NLSY Mother–Child cohorts or Young Adult data sets to conduct
analyses do not employ the raw data weights, let alone a transformed data weight,
in conjunction with an algebraic formula (Crockett et al. 2006; Pachter et al. 2006).
Thus, the application of the weighted approach extends the illustration of weighting
procedures beyond the econometric and/or demography literature into the broader
behavioral and epidemiological sciences (Horowitz and Manski 1998; MaCurdy
et al. 1998). The NLSY data weights have been used to examine employment and
wage trends, but not the relationship between underlying psychosocial mechanisms
and health-related outcomes (MaCurdy et al. 1998). A post-stratification procedure
is necessary to reduce bias in standard error estimates (Rubin et al. 1983). This
research makes an important contribution by using a weighted case approach in
testing the difference between non-weighted vs. weighted models.
Indeed, Lang and Zagorsky (2001) assert that not using weights may intro-
duce heteroskedasticity (different variances among the variables). Therefore, it is
necessary to examine and compare the standard errors when performing analyses,
using a weight formula. Horowitz and Manski (1998) explain the application of the
weight formula from Little and Rubin (1987) and Rubin et al. (1983), as applied
to econometric analysis. Moreover, MaCurdy et al. (1998) discuss why and how
the raw weights in each of the NLSY survey years differ, accounting for the non-
response rate and attrition. Since the weights differ in each year and particularly
since the calculation of the weight changed in 2002 (Ohio State University and
Center for Human Resource Research 2006), MaCurdy et al. (1998) assert that
longitudinal analysis using weighted data for multiple wave analysis from the NLSY
is not accurate.
Finally, regarding techniques to control for oversampling of certain under
represented groups in large population data sets, Stapleton (2002) suggests using
design weights in the calculation of the covariance matrices in multi-level and
structural equation models. Alternatively, she recommends using the design weight
variables as covariates in the hypothesized model. She compares the results of
the normalization vs. non-normalization procedures in a structural equation model.
Moreover, both Stapleton (2002) and Hahs-Vaughn and Lomax (2006) strongly
recommend that ignoring weights leads to serious bias in parameter estimates, with
the underestimation of standard errors. Finally, Stapleton et al. (2002) declares
“when modeling with effective sample size weights, care must be taken in devel-
oping syntax to be submitted to the SEM software program. Using traditional
SEM software, the analyst must provide the scaling factor for the between group
covariance model (the square root of the common group size).”
3 Method
In a later paper Stapleton (2006) presents an argument for design effect adjusted
weighting based on unequal probability selection. Given a probability sample is
a proportional selection procedure, then this data collection approach requires
20 L.A. Agre et al.
X
n
1
tD Yi I i i
i
In this formula, t is the estimator of that group with I as the indicator function or
Boolean indicator, i.e. <0,1> in the non-response population where
.
i D ni N I ni is the sample size of ith unit and N is the population size;
which is called the Horvitz–Thompson estimator. The t or weight for each observa-
tion then is the sum of the all the cases multiplied by the population mean divided by
the number of people in that group sampled within the population. The expression
below explains that each weight or t is calculated separately for each group sampled
in the population, then added to yield a distinct proportion or weight in decimal
form for each study subject in the data.
1
1
1
t D Y1 I 1 1 C Y2 I 2 2 C : : : C Yn I n n
However, the t or weighting cell estimator or raw weight must then be converted to
relative weights or mean values, or proportion of that group in the population that
was sampled based on race, age, and gender.
X
N
yw D w1 y i
iD1
where w D I 1
1
D 1 where 0 means no sample selection
0
So if Ij or population mean is zero then the relative weight is infinity
y j Ij Nj
nj D> yj Ij
j D Nj nj
The result, then, is higher weight or value for that group sample size that is
oversampled in order to compensate for under-representation of that particular race
or ethnic cluster. With n or 3 groups as in the hypothetical illustration below,
Controlling for Population Density Using Clustering and Data Weighting. . . 21
nominal weights are calculated based on the average or mean of each group. As
an example, for the first group, the weight is equal to 0.1. The second is 0.4 and
the third weight is 0.2. The algebraic weight formula, when subsequently applied,
yields a revised or transformed sample weight, normalized against the population as
follows:
Ij j 1
wi D P 1
k Ik k
Thus, for example when the given raw weight number 1 is equal to 0.1, the
transformed weight becomes 0.14.
0:1 0:1 1
wi D D D D 0:14
0:1 C 0:4 C 0:2 0:7 7
2 1 3
X 1 0:1=0:7
4 2 1 0:4=0:7 5 D 1
1
3 0:2=0:7
wi n
X wiDraw weight
wi
22 L.A. Agre et al.
For equal inclusion probabilities, weights are applied to minimize the variance in
the groups with oversampling due to under-representation in a population.
Xn
wi .yi bu/2
var .b
u/ D Xn iD1
Xn
wi wi 1
iD1 iD1
When the transformed weights in decimals are normalized, the weights total to one.
To implement the application of the transformed algebraic weights, the procedure
in statistical software entails creating a weighted variance–covariance matrix from
the original variables (Stapleton 2002). In SPSS, the variance–covariance matrix
is generated with the transformed weight or new weight turned on. The weighted
matrix and sample size are then supplied to the SEM software or AMOS for testing
the relationship of the paths.
On a programming level, the weighting cell estimator formula is used to
transform the raw weights which represent the number of people in that group
collected as part of the NLSY 1998 young adult cohort. The first procedure entails
selecting “analyze” then clicking on “descriptive statistics” followed by the function
“frequency.” Using the revised raw sample weight variable provided in the NLSY
data set by the Ohio State University, Center for Human Resource Research, the
“statistics button” is chosen on the bottom of the draw-down menu window and
then “sum.” This procedure prints out the sum of the weights of all the cases in
these data. A new weight for each variable is then calculated using the following
formula:
X
Normalized Weight D yaw n =
yaw D young adult’s raw weight variable provided in the NLSY Young Adult
Data set
nPD number of cases in the NLSY Young Adult cohorts 1998
D sum of the raw weights of all cases
Controlling for Population Density Using Clustering and Data Weighting. . . 23
Wi * ratio
Wi D 2 2*4/6.5 D 1.230769
Wi D 3 3*4/6.5 D 1.846154
Wi D 1 1*4/6.5 D 0.6153846
Wi D 0.5 0.5*4/6.5 D 0.3076923
Total D4
This normalized weight is then applied to all path analyses executed. While
performing analyses, the weight “on” command in SPSS is selected to ensure the
variables in the sample are normalized against the US population from which they
were originally drawn.
4 Results
4.1 Bivariate
4.1.1 Correlation Table Comparison
All the individual measures in the study were correlated to determine how strongly
associated the variables are with each other, yielding a bivariate final sample n of
4,648, as displayed in Table 1. Correlations were also employed to ascertain the
relationships among the descriptive as well as scale indicators used in the analyses.
Only those correlations that were both significant at p < 0.05 or 0.01 and below
are discussed. In the correlation Table 1 depicting both weighted and non-weighted
Pearson correlation coefficients, some correlations are strengthened with higher val-
ues when compared to non-weighted results. Indeed, in some cases, such as alcohol
use and age (weighted Pearson correlation coefficient (w.P.c.c. D 0.080) vs. non-
weighted Pearson correlation coefficient (n.w.P.c.c. D 0.021)), or sexual risk taking
and gender (w.P.c.c. D 0.062 vs. n.w.P.c.c. D 0.005), the association between two
weighted variables becomes significant. The correlation values generally remain
at the same significance level even when weighted, but some correlation values
increase, while a few decrease as indicated in the ensuing paragraphs.
24
Table 1 Correlations for NLSY 1998 variables used in analysis (n D 4,648): non-weighted vs. weighted (in parentheses)
Variable 1 2 3 4 5 6 7 8 9
1. Age 1.0
2. Gender 0.011 1.0
Male (0.023)
3. Race 0.103** 0.033 1.0
White (0.076**) (0.044**)
4. Neighborhood 0.064** 0.006 0.247** 1.0
Quality (0.034*) (0.012) (0.253**)
5. Perceived closeness 0.120** 0.010 0.100** 0.043** 1.0
between parents (0.140**) (0.003) (0.112**) (0.079**)
6. Depressive symptoms 0.001 0.169** 0.020 0.142** 0.015 1.0
index (0.005) (0.188**) (0.014) (0.149**) (0.000)
7. Risk proneness 0.123** 0.088** 0.229** 0.036* 0.098** 0.125** 1.0
(0.104**) (0.124**) (0.208**) (0.011) (0.087**) (0.121**)
8. Alcohol use 0.021 0.020 0.141** 0.082** 0.017 0.013 0.138** 1.0
(0.080**) (0.033*) (0.132**) (0.061**) (0.040**) (0.011) (0.153**)
9. Sexual risk taking 0.286** 0.005 0.071** 0.104** 0.112** 0.112** 0.029 0.105** 1.0
(0.295**) (0.062**) (0.059**) (0.155**) (0.149**) (0.138**) (0.012) (0.164**)
*p < 0.05; **p < 0.01
L.A. Agre et al.
Controlling for Population Density Using Clustering and Data Weighting. . . 25
Thus, youth’s age and white race yielded a weighted Pearson correlation coeffi-
cient of 0.076 (vs. n.w.P.c.c. D 0.103), with high significance at p < 0.01. Neigh-
borhood quality correlated with age at interview date (1998) with a weighted Pear-
son correlation coefficient of 0.034 (vs. n.w.P.c.c. D 0.064) with significance at
p < 0.05. Perceived parental closeness between the mother and biological father also
negatively correlated with youth’s age (w.P.c.c. D 0.140 vs. n.w.P.c.c. D 0.120)
and also with risk proneness (w.P.c.c. D 0.104 vs. n.w.P.c.c. D 0.123) at signif-
icance level of p < 0.01. Alcohol use (w.P.c.c. D 0.080 vs. n.w.P.c.c. D 0.021) and
sexual risk taking (w.P.c.c. D 0.295 vs. n.w.P.c.c. D 0.286) were both correlated with
age and highly significant at p < 0.01.
Male gender produced positive correlations with white race (w.P.c.c. D 0.044
vs. n.w.P.c.c. D 0.033) and risk proneness (w.P.c.c. D 0.124 vs. n.w.P.c.c. D 0.088),
again significant at p < 0.01, and alcohol use (w.P.c.c. D 0.033 vs. n.w.P.c.c. D 0.022)
significant at p < 0.05. Male gender negatively correlated with depressive
symptoms (w.P.c.c. D 0.188 vs. n.w.P.c.c. D 0.169, p < 0.01) and sexual
risk taking (w.P.c.c. D 0.062, p < 0.01 vs. n.w.P.c.c. D 0.005 not significant).
White race positively correlated with neighborhood quality (w.P.c.c. D 0.253
vs. n.w.P.c.c. D 0.247), perceived parental closeness between the mother
and biological father (w.P.c.c. D 0.112 vs. n.w.P.c.c. D 0.100), risk proneness
(w.P.c.c. D 0.208 vs. n.w.P.c.c. D 0.229) and alcohol use (w.P.c.c. D 0.132 vs.
n.w.P.c.c. D 0.141), all significant at p < 0.01. However, white race negatively
correlated with sexual risk taking (w.P.c.c. D 0.059 vs. n.w.P.c.c. D 0.071), at
p < 0.01 significance level. Further, neighborhood quality correlated with perceived
parental closeness between the mother and biological father at (w.P.c.c. D 0.079
vs. n.w.P.c.c. D 0.043) and alcohol use (w.P.c.c. D 0.061 vs. n.w.P.c.c. D 0.082)
also significant at p < 0.01. Other negative correlations with neighborhood
quality (meaning lower score, worse quality neighborhood) included: depressive
symptoms index (w.P.c.c. D 0.149 vs. n.w.P.c.c. D 0.142) and sexual risk taking
(w.P.c.c. D 0.155 vs. n.w.P.c.c. D 0.104), both highly significant at p < 0.01.
The variable perceived parental closeness between mother and biological father
(the lower the score, the worse the parenting) also correlated with risk prone-
ness (w.P.c.c. D 0.087 vs. n.w.P.c.c. D 0.098), alcohol use (w.P.c.c. D 0.040 vs.
n.w.P.c.c. D 0.017, not significant), and sexual risk taking (w.P.c.c. D 0.149 vs.
n.w.P.c.c. D 0.112), all weighted Pearson correlation coefficients significant at
p < 0.01. Further, depressive symptoms index was associated with risk proneness
(w.P.c.c. D 0.121 vs. n.w.P.c.c. D 0.125), and sexual risk taking (w.P.c.c. D 0.138 vs.
n.w.P.c.c. D 0.112) at significance level of p < 0.01. Finally, risk proneness was mod-
erately correlated with alcohol use (w.P.c.c. D 0.153 vs. n.w.P.c.c. D 0.138), which
was also associated with sexual risk taking (w.P.c.c. D 0.164 vs. n.w.P.c.c. D 0.105)
but highly significant at p < 0.01.
26 L.A. Agre et al.
Perceived
Neigh Quality e2
0.08(0.06)
−0.09(−0.14)
0.02(0.03) e3
Alcohol Use
e1
0.11(0.18)
0.13(0.15) 0.02(0.02) 0.05(0.09)
−0.14(−0.15)
Risk
Proneness Sexual Risk
Taking
0.12(0.12)
0.04(0.08) 0.10(0.12)
0.02(0.02)
0.10(0.09) e4
Depression
−0.11(−0.15)
Perceived
Parental Closeness
4.2 Multivariate
As with the bivariate analysis, the beta coefficients in the path model again
change as a consequence of applying weights. Most of the values increase, thereby
demonstrating that when the transformed data weights are applied, the association
between and among the variables differs from the non-weighted findings. Therefore,
as seen in Fig. 1, the weighted path analysis results show that higher neighborhood
quality is correlated with higher perceived parental closeness (weighted beta coef-
ficient (w.b.c.) D 0.08 vs. non-weighted beta coefficient (n.w.b.c.) D 0.04). Poorer
neighborhood quality is related to higher depression scores (w.b.c. D 0.15 vs.
n.w.b.c. D 0.14). Higher neighborhood quality and higher alcohol use are also
associated (w.b.c. D 0.06 vs. n.w.b.c. D 0.08) in this model. Poorer neighborhood
quality influences higher sexual risk taking (w.b.c. D 0.14 vs. n.w.b.c. D 0.09).
Moreover, lower perceived parental closeness between mother and biological
father also promotes higher risk proneness (w.b.c. D 0.08 vs. n.w.b.c. D 0.04).
Poor perceived parental closeness between mother and biological father is also
related to elevated sexual risk taking (w.b.c. D 0.15 vs. n.w.b.c D 0.14). Higher
depression scores are associated with increased risk proneness (w.b.c. D 0.12 same
as n.w.b.c. D 0.12) and greater sexual risk taking (w.b.c. D 0.12 vs. n.w.b.c. D 0.10).
Risk proneness leads to greater alcohol use (w.b.c. D 0.15 vs. n.w.b.c. D 0.13).
Finally, greater alcohol use promotes higher sexual risk taking (w.b.c. D 0.18 vs.
n.w.b.c. D 0.11).
Controlling for Population Density Using Clustering and Data Weighting. . . 27
Table 3 Indirect, direct, and total comparing non-weighted and weighted effects from path
analysis using perceived parental closeness between mother and biological father on sexual risk
taking n D 4,648
Non-weighted Weighted
Total indirect Direct Total Total indirect Direct Total
Variable effect effect effect effect effect effect
Perceived parental 0.001 0.110 0.109 0.002 0.145 0.143
closeness between
mother and bio-father
Neighborhood quality 0.006 0.094 0.100 0.007 0.137 0.144
Depressive symptoms 0.002 0.099 0.101 0.003 0.116 0.119
Risk proneness 0.015 0.000 0.015 0.027 0.000 0.027
Alcohol use 0.000 0.113 0.113 0.000 0.177 0.177
Table 2 presents the fit indices for both the non-weighted vs. weighted path
analysis models. The chi-square for both the non-weighted and weighted models
is low and not significant at p < 0.001, meaning the model has good fit in both cases.
However, the weighted chi-square results clearly show a slightly higher chi-square
with even less significance at p < 0.001. Further, the other fit indices, Comparative
Fit Index (CFI), Adjusted Goodness of Fit Index (AGFI), Root Mean Square Error
of Approximation (RMSEA) and Tucker Lewis Index (TLI) per Olobatuyi (2006),
when comparing the non-weighted to the non-weighted all meet the criteria of
between 0.9 and 1. But the weighted model indicates stronger fit with all values
closer to the upper threshold than the non-weighted. Thus, the CFI compares the
tested model to the null, i.e. no paths between variables. The AGFI measures
the amount of variance and covariance in observed and reproduced matrices. The
RMSEA is an indicator of parsimony in the model, meaning the simplest and fewest
number of variables. Finally, the TLI, using chi-square values of the null vs. the
proposed model, reinforces the rigor of this model.
Table 3 describes the direct and indirect effects of all the variables together with
perceived parental closeness between mother and biological father and mother and
step-father, respectively, used in this path analysis on sexual risk taking dependent
variable. Similar findings are reported for both non-weighted and weighted data.
Indeed, the detrimental effect of perceived parental closeness on sexual risk taking
is evidenced by the negative coefficient value in both models using ratings of the
28 L.A. Agre et al.
mother and biological father (0.15 non-weighted vs. weighted 0.03). However,
better neighborhood appraisal in the perceived parental closeness between the
mother and biological father had a direct effect (0.06) on increased alcohol use.
Thus, youth who perceived greater neighborhood quality used more alcohol. How-
ever, worse neighborhood quality (0.14) and lower perceived parental closeness
(0.15) has a direct negative effect on sexual risk taking. Or conversely, teens
who perceive a worse environment engage in less sexual activity. Thus, both
neighborhood perception and perceived parental closeness have a protective effect
on sexual risk taking, which then, in turn, is diminished by risk proneness (0.00)
and alcohol use (0.18).
Nevertheless, a paradox arises among indirect effects associated with neighbor-
hood quality. Those youth who rate neighborhood quality as high also report using
more alcohol and increased sexual risk taking, which can possibly be attributed
to more disposal income. In another indirect effect, lower neighborhood quality
operates through higher depression, higher risk proneness, in turn leading to
higher alcohol use and higher sexual risk taking (or total effect of 0.144 for
biological parents model path analysis, based on four multiplicative paths, then
summed together, (per Cohen and Cohen 1983, referring to Fig. 1). The total
effects for all variables in the weighted model (Fig. 1) are higher in value than
the non-weighted, with the exception of: (1) perceived parental closeness and
risk proneness (w.b.c. D 0.09 vs. n.w.b.c. D 0.10); (2) neighborhood quality and
alcohol use (w.b.c D 0.06 vs. n.w.b.c. D 0.08); and (3) depressive symptoms and risk
proneness (both w.b.c. and n.w.b.c. equal to 0.12).
5 Discussion
values. While these distinctions are not marked, applying the transformed weight
to the variance–covariance matrix does present implications for generalizability and
external validity.
Indeed, in path analysis, what happens to the covariance matrix to a degree
is data dependent (McDonald 1996). When using mean substitution in a path
model, variability and standard error decreases thereby reducing the strength of
the correlations and weakening the covariance matrix (Olinsky et al. 2003), as is
the case with this example using the NLSY—Young Adult Survey. However, if the
missing data is indeed missing at random (MAR), mean substitution is considered
the most conservative and therefore the least distorting of the data distribution (Little
and Rubin 1987; Shafer and Olsen 1998). Moreover, by applying the algebraic
weighting technique shown in this study, the contribution of each variable in the
equation is proportionally adjusted to reflect the actual distribution, normalized to
reflect the actual population from which the sample is drawn. On a graphical level,
the data weighting procedure thus hugs the scatter plot in closer away from the
imaginary axes defining the boundaries, tightening the ellipse, yielding a better
estimated regression line (Green 1977). Because the underlying notion of causal
inferencing is the focus of path analysis (Rothman and Greenland 2005) applying
the Rubin (1987) MAR mean substitution is not only required in determining the
causal reasoning with large secondary data sets, such as the NLSY, but essential for
demonstrating in particular the temporal ordering of the endogenous variables’ role
in affecting exogenous outcomes (Pearl 2000).
The weighting technique is critical even when the model is applied to primary
data. For example, if this model were tested on a sample taken from a geographic
area with overrepresentation of a particular ethnicity or race, then the weight
formula would need to be applied to ensure generalizability and replicability. Thus,
in order to determine policy initiatives and objectives with respect to demonstration
projects and/or interventions, the data needs to be representative of the population,
standardized against the original distribution ensured by implementing the weight
formula.
The effects of individual, family, and neighborhood quality on adolescent
substance use and sexual activity are evaluated to explain the relationship of
the individual adolescent to the environmental context and how these factors are
associated with co-morbid mental and physical health conditions. Understanding the
mechanisms, such as how depression, sensation seeking, lack of perceived parental
closeness (discord on rules), and poorer neighborhood quality elucidates the link
to health risk behaviors in situ. This study makes an important contribution by
using a weighted case approach in potentially testing different samples of youth by
race/ethnicity. Determining policy initiatives and objectives requires that the data be
representative of the population, ensured by applying the weight formula.
30 L.A. Agre et al.
References
Cohen, J., Cohen, A.: Applied Multiple Regression/Correlation Analysis for the Behavioral
Sciences. Lawrence Erlbaum Associates, Hillsdale (1983)
Crockett, L.J., Raffaelli, M., Shen, Y.-L.: Linking self-regulation and risk proneness to risky
sexual behavior: pathways through peer pressure and early substance use. J. Res. Adolesc. 16,
503–525 (2006)
Green, B.F.: Parameter sensitivity in multivariate methods. J. Multivar. Behav. Res. 12, 263–287
(1977)
Hahs-Vaughn, D.L., Lomax, R.G.: Utilization of sample weights in single-level structural equation
modeling. J. Exp. Educ. 74, 163–190 (2006)
Horowitz, J.L., Manski, C.F.: Censoring of outcomes and regressors due to survey nonresponse:
identification and estimation using weights and imputations. J. Econ. 84, 37–58 (1998)
Lang, K., Zagorsky, J.L.: Does growing up with a parent absent really hurt? J. Hum. Resour. 36,
253–273 (2001)
Little, R.J.A., Rubin, D.A.: Statistical Analysis with Missing Data. Wiley, New York (1987)
MaCurdy, T., Mroz, T., Gritz, R.M.: An evaluation of the national longitudinal survey on youth.
J. Hum. Resour. 33, 345–436 (1998)
McDonald, R.P.: Path analysis with composite variables. Multivar. Behav. Res. 31, 239–270 (1996)
Oh, H.L., Scheuren, F.J.: Weighting adjustment for unit nonresponse. In: Incomplete Data in
Sample Surveys, Chap. 3. Academic Press, New York (1983)
Ohio State University, Center for Human Resource Research: NLSY 79 Child & Young
Adult Data Users Guide. Ohio State University, Center for Human Resource Research,
Columbus (2006). Retrieved from http://www.nlsinfo.org/pub/usersvc/Child-Young-Adult/
2004ChildYA-DataUsersGuide.pdf
Olinsky, A., Chen, S., Harlow, L.: The comparative efficacy of imputation methods for missing
data in structural equation modeling. Eur. J. Oper. Res. 151, 53–79 (2003)
Olobatuyi, R.: A User’s Guide to Path Analysis. University Press of America, Lanham (2006)
Pachter, L.M., Auinger, P., Palmer, R., Weitzman, M.: Do parenting and home environment,
maternal depression, neighborhood and chronic poverty affect child behavioral problems
differently in different age groups? Pediatrics 117, 1329–1338 (2006)
Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge
(2000)
Rothman, K.J., Greenland, S.: Causation and causal inference in epidemiology. Am. J. Public
Health 95, S144–S150 (2005)
Rubin, D.B., Olkin, I., Madow, W.G.: Incomplete Data in Sample Surveys. Academic Press, Inc.,
New York (1983)
Shafer, J.L., Oslen, M.K.: Multiple imputation for multivariate missing-data problems: a data
analyst’s perspective. Multivar. Behav. Res. 33, 545–571 (1998)
Stapleton, L.M.: The incorporation of sample weights into multilevel structural equation models.
Struct. Equ. Model. 9, 475–502 (2002)
Stapleton, L.M.: An assessment of practical solutions for structural equation modeling with
complex data. Struct. Equ. Model. 13, 28–58 (2006)
Steinley, D., Brusco, M.J.: A new variable weighting and selection procedure for K-means cluster
analysis. Multivar. Behav. Res. 43, 77–108 (2008)
On the Inference of Partially Correlated Data
with Applications to Public Health Issues
1 Introduction
between the two responses is small, two other tests may be used: a test proposed
by Ekbohm when the homoscedasticity assumption is not strongly violated, and
otherwise a Welch-type statistic suggested by Lin and Stivers (1974) (for further
discussion, see Ekbohm 1976).
Alternatively, researchers tend to ignore some of the data—either the correlated
or the uncorrelated data depending on the size of each subset. However, in case the
missingness not completely at random (MCAR), Looney and Jones (2003) argued
that ignoring some of the correlated observations would bias the estimation of the
variance of the difference in treatment means and would dramatically affect the
performance of the statistical test in terms of controlling type I error rates and
statistical power (see Snedecor and Cochran 1980). They propose a corrected z-
test method to overcome the challenges created by ignoring some of the correlated
observations. However, our preliminary investigation shows that the method of
Looney and Jones (2003) pertains to large samples and is not the most powerful test
procedure. Furthermore, Rempala and Looney (2006) studied asymptotic properties
of a two-sample randomized test for partially dependent data. They indicated that
a linear combination of randomized t-tests is asymptotically valid and can be used
for non-normal data. However, the large sample permutation tests are difficult to
perform and only have some optimal asymptotic properties in the Gaussian family of
distributions when the correlation between the paired observations is positive. Other
researchers such as Xu and Harrar (2012) and Konietschke et al. (2012) also discuss
the problem for continuous variables including the normal distribution by using
weighted statistics. However, the procedure suggested by Xu and Harrar (2012) is
a functional smoothing to the Looney and Jones (2003) procedure. As such, the Xu
and Hara procedure is not a practical alternative for the non-statistician researcher.
The procedure suggested by Konietschke et al. (2012) is a nonparametric procedure
based on ranking.
The aforementioned methods cannot be used for non-normal and moderate, small
sample size data and categorical data. Samawi and Vogel (2011) introduced several
weighted tests when the variables of interest are categorical. They showed that
their test procedures compete with other tests in the literature. Moreover, there are
34 H.M. Samawi and R. Vogel
several attempts to provide nonparametric test procedures under MCAR and MAR
designs (for example, see Brunner and Puri 1996; Brunner et al. 2002; Akritas
et al. 2002; Im KyungAh 2002; Tang 2007). However, there is still a need for
intensive investigation to develop more powerful nonparametric testing procedures
for MCAR and MAR designs. Samawi et al. (2014) discussed and proposed some
nonparametric testing procedures to handle data when partially correlated data is
available without ignoring the cases with missing responses. They introduced more
powerful testing procedure which combined all cases in the study. These procedures
will be of special importance in meta-analysis where partially correlated data is a
concern when combining results of various studies.
Methods that are most commonly used to analyze a combination of correlated and
non-correlated when data assumed to be normally distributed data are:
1. Using all data with a t-test for two independent samples assuming no correlation
among the observations in the two treatments.
2. Ignoring the paired observations and perform the usual t-test of two independent
samples after deleting the correlated data.
3. Ignore the independent observations of treatment 1 and 2 and perform the usual
paired t-test on the correlated data.
4. The corrected z-test by Looney and Jones (2003).
To perform the Looney and Jones test, let fX1 ; X2 ; : : : ; Xn1 g and fY1 ; Y2 ; : : : ; Yn2 g
denote two independent random samples of subjects receiving either treatment 1
or treatment 2, respectively. Suppose there are n paired subjects in which one
member of the pair receives treatment 1 and the other paired member receives
treatment 2. Let f(U1 , V1 ), (U2 , V2 ), : : : , (Un , Vn )g denote the observed values of the
paired (correlated) subjects. Looney and Jones assumed that x- and u-observations
come from a common normal parent population and y- and v-observations come
from a common normal parent population, which may be different from x and u
observations. Let M1 denote the sample mean for all treatment 1 subjects; that is
the mean of all x- and u-values combined, and let M2 denote the sample mean for
all treatment 2 subjects; that is, the mean of all y- and v-values combined. Let S21
denote the sample variance for all treatment 1 subjects and let S22 denote the sample
variance for all treatment 2 subjects. The Looney and Jones proposed test statistic is:
M1 M2
ZCorr D q
S21 S22 2nS2uv
.n1 Cn/ C .n2 Cn/ .n1 Cn/.n2 Cn/
Under the null hypothesis, ZCorr has asymptotic N(0,1) distribution. However,
this test works only for a large sample size. In case of small sample sizes, the exact
distribution is not clear. An approximation to the exact distribution critical values is
needed. Bootstrap methods to find the p-value of the test may also work. In addition,
under the assumption of a large sample size, this test is not a uniformly powerful
test. Its power depends on the correlation between the correlated observations. As
an alternative we propose the following test procedure.
As in Looney and Jones (2003), let fX1 ; X2 ; : : : ; Xn1 g and fY1 ; Y2 ; : : : ; Yn2 g denote
two independent random samples of subjects receiving either treatment 1 or
treatment 2, respectively. Suppose there are n paired subjects in which one member
of the pair receives treatment 1 and the other paired member receives treatment 2.
Let f(U1 , V1 ), (U2 , V2 ), : : : , (Un , Vn )g denote the observed values of the paired
subjects. Assume that x- and u-observations come from a common normal parent
population N(1 , 2 ) and y- and v-observations come from a commonnormal parent
population N(2 , 2 ). Let Di D Ui Vi ; i D 1; 2; : : : ; n. Di is N 1 2 ; D2 ,
where D2 D 2 2 .1 / and is the correlation coefficient between U and V.
Let X; Y and D denote the sample means of x-observations, y-observations, and
d-observations, respectively. Also, let S2x , S2y and S2d denote the sample variances of
x-observations, y-observations, and d-observations, respectively. Let N D n1 Cn2 Cn
and D n1 Cn 2
N . Samawi and Vogel (2014) proposed the following test procedure for
testing the null hypothesis H0 :1 D2 , where 1 and 2 are the respective response
means of treatment 1 and treatment 2:
p XY p D
T0 D q C 1 p : (1)
S2x
2 Sd = n
n1 C Sy
n2
When D1, this test reduces to the two-sample t-test. Also, when ”D0, this test is
the matched paired t-test.
Case 1. Large sample sizes will generally mean both a large number of matched
paired observations and large number of two independent samples from treatment
1 and treatment 2. By applying Slutsky’s Theorem and under the null hypothesis,
T0 has an approximate N(0,1) distribution. The p-value of the test can therefore be
directly calculated from the standard normal distribution.
Case 2. Without loss of generality, we will consider that the paired data has small
sample size while the independent two-samples from the two treatments have large
sample size. To findpthe distribution of the weighted test, under the null hypothesis,
p
let To D X C 1 Tk where X has N(0,1) and Tk has t-distribution with k
degrees of freedom. Then
36 H.M. Samawi and R. Vogel
Z1 p
1 t x
fT0 .t/ D p p tk .x/dx: (2)
1
1
Nason (2005) in an unpublished report found the distribution of To when the degrees
of freedom is odd. The distribution provided by Nason (2005) is very complicated
and cannot be used directly to find percentiles from this distribution. To find the
p-value of To you need to use a package published by Nason (2005). Therefore, we
provide a simple bootstrap algorithm to find the p-value of test procedure based on
the distribution of To . A similar approach may be taken when the paired data has
large sample size and the independent data has small sample size.
Case 3. Both data sets, the independent samples and the matched paired data, have
small sample sizes. Under the null hypothesis T0 has weighted t-distribution. Let
To D Tk1 C.1 / Tk2 where Tk1 and Tk2 are two independent t-variates with k1 and
k2 degrees of freedom, respectively. Walker and Saw (1978) derived the distribution
of a linear combination of t-variates when all degrees of freedom are odd. In our
case, since we have only two t-variates with k1 and k2 degrees of freedoms, we need
to assume that k1 and k2 are odd numbers or we can manipulate the data to have both
numbers to be odd. Using Walker and Saw (1978) results, one can find the p-value
of the suggested test statistics T0 . However, the Walker and Saw (1978) method still
needs an extensive amount of computation. Therefore, a bootstrap algorithm also
may be used to find the p-value of T0 .
X
n
2 X
n1
2 X
n2
2
di d Xi X C Yi Y
1
where Sd2 D iD1
n1
; SP2 D iD1 iD1
n1 Cn2 2
; 1 D n
and 2 D n11 C 1
n2
:
On the Inference of Partially Correlated Data with Applications to Public Health Issues 37
2 2
Note that Var D C X Y D 1 d C 2 , under the normality assumption and
Xn1
2 X n2
2
Xi X C Yi Y
.n1/S2 L L
H0 W 1 D 2 , 2 d ! 2.n1/ and iD1 2
iD1
! 2.n1 Cn2 2/ :
d
Therefore, under the null hypothesis and using Satherwaite’s method,
2
2 2
L 1 Sd C 2 SP
TNew ! t .dfs / I dfs : (4)
2 2
. 2 2
/
. /
1 Sd 2 Sp
n1 C n1 Cn2 2
Uniform bootstrap resampling was introduce by Efron (1979). The uniform resam-
pling for the two independent sample case is discussed by Ibrahim (1991) and
Samawi et al. (1996, 1998).We suggest applying uniform bootstrap resampling
as a means of obtaining p-values for our test procedure. However, since our test
procedure involves t-statistic, there are some conditions discussed by Janssen and
Pauls (2003) and Janssen (2005) need to be verified to insure that the test statistic
under consideration will have proper convergent rate. They indicated that the
bootstrap works if and only if the so-called central limit theorem holds for the test
statistics.
In our case fX1 ; X2 ; : : : ; Xn1 g and fY1 ; Y2 ; : : : ; Yn2 g and f(U1 , V1 ),
(U2 , V2 ), : : : , (Un , Vn )g are independent samples, thus the bootstrap p-value can
be calculated as follows:
1. Use the original sample sets fX1 ; X2 ; : : : ; Xn1 g and fY1 ; Y2 ; : : : ; Yn2 g and
f(U1 , V1 ), (U2 , V2 ), : : : , (Un ,˚Vn )g to calculate T0 .
2. Let fX1 ; X2 ; : : : ; Xn1 g and Y1 ; Y2 ; : : : ; Yn2 and f(U *1 , V *1 ), (U *2 , V *2 ), : : : , (U *n ,
V *n )g be the centered samples by subtracting the sampling means X; Y and U; V ;
respectively.
3. With placed probabilities ( n11 , n12 1n ) on the samples in step (2), respectively,
generate independently bootstrap samples, namely fXi1 ; Xi2 ; : : : ; Xin1 g and
˚ ˚
Yi1 ; Yi2 ; : : : ; Yin2 and Ui1 ; Vi1 ; Ui2 ; Vi2 ; : : : ; Uin ; Vini ; i D
1; 2; : : : ; B:
4. For each iD1, 2, : : : , B set of samples in step 3, compute the corresponding
bootstrap version of T0 statistics, namely T *1 * , T *2 * , : : : , T *B * .
XB
ˇ ˇ ˇ
I ˇTi ˇ ˇT0
5. The bootstrap p-value is the computed as P D iD1
; where
ˇ ˇ B
ˇ ˇ 1 if ˇTi ˇ jT0 j
I ˇTi jj T0 ˇ D
0 Otherwise:
38 H.M. Samawi and R. Vogel
Table 3 indicates that all of the tests discussed in this paper show strong statistical
evidence, that on average, the satisfaction scores are lower on the second survey than
on the first survey.
1. The first test of significance of combining more than one table or result was
initially proposed by Tippett (1931). He pointed out that p1 , p2 , : : : , pk are
independent p-values from continuous test statistics, each having a uniform
distribution under the null hypothesis. In this case, we reject H0 at significance
level ’ if pŒ1 < 1 .1 ˛/1=k , where p[1] is the minimum of p1 , p2 , : : : , pk (see
Hedges and Oklin 1985).
2. Inverse chi-square method. This is the most widely used combination procedure
and was proposed by Fisher (1932). Given independent k studies and p-values:
p1 , p2 , : : : , pk , Fisher’s procedure uses the product of p1 , p2 , : : : , pk to combine
the p-values, noting if U has a uniform distribution, then -2log U has a chi-square
X k
distribution with two degrees of freedom. If H0 is true then -2 log pi is chi-
iD1
X
k
square with 2k degrees of freedom. Therefore, we reject H0 if -2 log pi >C,
iD1
where C is obtained from the upper tail of the chi-square distribution with 2k
degrees of freedom (see Hedges and Oklin 1985).
3. An incorrect practice is to use the Pearson chi-square test for homogeneity in an
unmatched group by converting the matched-pair data in Table 4 to unmatched
data in Table 5. Since the two groups of matched and unmatched data are
independent, people join the two data sets to one unmatched data set as in Table 6.
On the Inference of Partially Correlated Data with Applications to Public Health Issues 41
If the matched-pair data is small, with only a few pairs of matched data compared
to the unmatched data, then investigators tend to ignore the matched-pair portion
of the data and test the above hypothesis using a Pearson chi-square test from the
design in Table 5 as follows: under the null hypothesis (see Agresti 1990)
4. If the matched-pair data is large compared to the unmatched data, then investiga-
tors tend to ignore the unmatched portion of the data and test the hypothesis of
association using the McNemar test from the design in Table 4 as follows: under
the null hypothesis (see Agresti 1990)
.b c/2 2
3 D W .1/ (7)
bCc
In order to use all information in a study that has both matched-pair data and
unmatched data, we propose using a weighted chi-square test of homogeneity. The
methods will combine the benefits of using matched-pair data tests with the benefits
and strengths of unmatched data test procedures.
1. The proposed test for partially matched-pair data using the design in Tables 7
and 8 is as follows:
w Dw 2 C .1 w/ 3; (8)
42 H.M. Samawi and R. Vogel
C D 2 C 3: (9)
The data used in this project were obtained from the National Survey of Children’
Health of 2003 (NSCH) and contains all US states. Two states, namely Georgia
(GA) and Washington (WA) are of our primary concern for the comparison of
children insurance disparity. To demonstrate the use of the suggested methods
in Sect. 2, we control for two confounding factors, age and gender. Matching is
performed on all subjects by age (0–17 years) and gender (male and female) for
both states. Georgia is retained as a reference state, and children insurance is the
subject of interest. The subsequent section provides summaries of the data analysis
based on unmatched, matched-pair, and the combination of unmatched and matched
models.
On the Inference of Partially Correlated Data with Applications to Public Health Issues 43
Cross tabulations of insurance and WA State, using Georgia as reference state, are
presented in Table 7. A summary of statistical inference using McNemar’s test to
compare states’ children insurance disparities is presented in Table 8.
McNemar’s test is used to assess the significance of the difference in insurance
disparity between the states of Georgia and Washington controlling for age and
gender using the matched-pairs model. There is significant statistical evidence that
the children insurance disparity in GA is worse than in WA. Based on the estimated
odds ratio, the odds of a child who resides in Georgia not having health insurance
are 1.48 more than those living in Washington.
The matched-pairs analysis was based on the part of the data that could be matched.
However, the other part of the data is considered unmatched data and the usual
Pearson chi-square test is used to test for children’s insurance disparity difference
between GA and WA States.
The following tables are created from the interviewees who remained unmatched.
Pearson’s chi-square analysis is therefore conducted to see what kind of information
these remaining interviewees will provide. The following tables are derived from
this data (Table 9).
Table 10 shows similar conclusion as in Table 8. There is statistical difference
in children insurance disparity comparing GA to WA, at level of significance 0.05.
Also, based on the estimated odds ratio, the odds of a child residing in Georgia and
not having health insurance is 2.1 more than those live in Washington.
Table 11 shows the results of using the suggested methods for combining chi-square
tests for partially correlated data.
Table 10 Summary analysis for unmatched data ignoring the matched-pairs data
States Statistics DF p-Value Odds ratio 95 % confidence intervals
GA vs. WA 4.86 1 0.0275 2.01 (1.07, 3.76)
44 H.M. Samawi and R. Vogel
The critical value for the Tippett test is 0.025 and the critical value for the
weighted chi-square test is 2.99. The methods that combine tests provide results
and conclusion similar to that given in the literature but with greater power. For the
situation when we have only marginal significance in one or both types of the data,
combining the strength of the two types data (matched-pairs and unmatched data)
provides greater power to detect any small difference in conditional probabilities.
In conclusion, choosing the right test for combining the matched and the
unmatched data for testing the null hypothesis of homogeneity depends on the
impact of weights, and the strength of the association between the case and control
groups in both data sets. Our investigation revealed that any of the competing tests:
the Combined chi-square, the Inverse chi-square and the Weighted chi-square tests,
are recommended since they all show superiority over other tests in most of the
cases.
Conover 1999). The matched pairs sign test statistic, denoted by T1 , for testing the
above hypotheses, equals the number of “C” pairs:
X
n
T1 D I .Ui > Vi / (10)
iD1
1 if Ui > Vi
where I .Ui > Vi / D
0 otherwise :
Alternatively, define Di D Ui Vi ; i D 1; 2; : : : ; n. Then, the null hypothesis
(Ho : The median of the differences is zero) can be tested using the sign test.
X
n
Therefore, the test statistic can be written as: T1 D I .Di > 0/. All tied pairs are
iD1
discarded, and n represents the number of remaining pairs. Depending on whether
the alternative hypothesis is one- or two-tailed, and if n 20, then use the binomial
distribution (i.e., Bin .n; p D 12 /) for finding the critical region of approximately
size ˛. Under Ho and for n > 20, T1 N . n2 ; n4 /. Therefore, the critical region can be
defined based on the normal distribution:
T1 . n2 / L
Z1 D pn ! N .0; 1/ : (11)
4
For uncorrelated data, let fX1 ; X2 ; : : : ; Xn1 g and fY1 ; Y2 ; : : : ; Yn2 g denote two inde-
pendent simple random samples of subjects exposed to method 1 and method 2,
respectively. It can be shown that Ho W 1 D 2 is valid for two-independent
samples, where 1 is a measure of location (median) of FX (x) and 2 is a measure
of location (median) of FY (y). If the distributions of X and Y have the same shape,
then the null hypothesis of interest is H0 W 1 2 D 0. Define:
X
n1 X
n2
T2 D I Xj > Yk ; (12)
jD1 kD1
1 if Xj > Yk
where I Xj > Yk D
0 otherwise :
Then, T2 is the Mann–Whitney Wilcoxon two samples test. Therefore, E .T2 / D
n1 n2 .n1 Cn2 C1/
2 and Var .T2 / D ; (for example, see Conover 1999). For large
n1 n2
12
samples and under H0 W 1 2 D 0, the critical region can be defined based
on the normal distribution (again, see Conover 1999):
T2 . n12n2 / L
Z2 D q ! N .0; 1/ : (13)
n1 n2 .n1 Cn2 C1/
12
46 H.M. Samawi and R. Vogel
Case 1. Small sample sizes For small sample sizes, we propose the following
test procedure to combine the sign test for correlated data with the Mann–Whitney
Wilcoxon test for uncorrelated data:
(1) Let Tc D T1 C T2 :
(2) Let 0 < < 1, then the two sign tests can be combined as follows:
define T D T1 C .1 / T2 :
Using similar notation as that found in Hettmansperger and McKean (2011). We
construct the following theorem:
Theorem 1. Given n1 x’ s , n2 y’ s, and n pairs of (u, v) and under Ho , let:
P ;n2 .l/ D PH0 .T2 D l/ ; l D 0; 1; 2; : : : ; n1 n2 I and Pn .i/ D PH0 .T1 D i/ D
n1
n 1 n
; i D 0; 1; 2; : : : ; n: Then,
i 2
(i)
X X
P .Tc D t/ D Pn1 ;n2 .l/Pn .i/; t D 0; 1; 2; : : : ; n1 n2 C nI and (14)
lCiDt
(ii)
X X
P T Dt D Pn1 ;n2 .l/Pn .i/; (15)
lC.1 /iDt
Case 2. Large sample sizes For large sample sizes and under Ho , and let
n
nCn1 Cn2
! as
fn ! 1 and large n1 ; n2 < 1I or n ! 1 and n1 ! 1 and large n2 < 1I
or n; n1 ; n2 ! 1g
we propose to use:
(i)
T1 C T2 . n2 C n12n2 / L
Z0 D q ! N .0; 1/ I and (16)
n C n1 n2 .n1 Cn2 C1/
4 12
(ii)
p p L
TZ D Z1 C 1 Z2 ! N .0; 1/ : (17)
If we assume that U and V are exchangeable random variables, then UV and V U
both have symmetric distributions and the Wilcoxon test is clearly justified. Let
Di D Ui Vi ; i D 1; 2; : : : ; n. Under H0 , we may use a Wilcoxon signed-rank test as
follows. Let Ii be an indicator for when jDj .i/ corresponds to a positive observation,
where jDj .1/ < < jDj .n/ are the ordered absolute values. Then,
X
n X
n
TWC D iIi D Ri s .Di / (18)
iD1 iD1
is the Wilcoxon signed-rank statistic, where Ri is the rank of jDi j and Ii D s .Di /,
where s(Di )D1 if Di > 0 and 0 otherwise. It has been shown that under the
null hypothesis, E .TWC / D n.nC1/
4 and Var .TWC / D n.nC1/.2nC1/
24 : Since Di has a
symmetric distribution under the H0 assumption, TWC is a linear combination of
i.i.d. Bernoulli(½) random variables. However, for large samples
TWC n.nC1/
4 L
ZWC D q ! N .0; 1/ : (19)
n.nC1/.2nC1/
24
Case 1. Small sample sizes For small sample sizes, we propose the following test
procedure to combine the two tests:
(i) Let Tcw D TWC C T2 (see Dubnicka et al. 2002).
(ii) Let 0 < < 1, then the two sign tests can be combined as follows:
define T w D TWC C .1 / T2 .
Again, using similar notation as that found in Hettmansperger (1984) and
Hettmansperger and McKean (2011), we construct the following theorem:
Theorem 2. Given n1 x’ s , n2 y’ s, and n pairs of (u, v) and under H0 , let
Pwn .b/ n .n C 1/
Pwn .b/ D PH0 .TWC D b/ D ; b D 0; 1; 2; : : : ; : (21)
2 n 2
Then:
(i)
X X n .n C 1/
P .Tcw D t/ D Pn1 ;n2 .l/Pwn .b/; t D 0; 1; 2; : : : ; n1 n2 C I
lCbDt
2
and
(ii)
X X
P T w Dt D Pn1 ;n2 .l/Pwn .b/;
lC.1 /bDt
where Pn1 ;n2 .l/ is the same as that in Theorem 1 and Pn .b/ D Pn1 .b/ C
Pn1 .b n/ ; P0 .0/ D 1; P0 .b/ D 0 and Pn .b/ D 0 for b < 0:
The proof is a result of Theorems 3.2.2 and 3.2.3 as well as Exercise 3.7.3 in
Hettmansperger (1984). Similarly, examples of these null distributions for selected
sample sizes are provided in the Appendix. Additionally, R codes are provided on
the following website to calculate the exact discrete distribution of the proposed
tests: http://personal.georgiasouthern.edu/~hsamawi/.
Case 2. Large sample sizes For large samples and under H0 , and let
n
nCn1 Cn2 ! as
fn ! 1 and large n1 ; n2 < 1I or n ! 1 and n1 ! 1 and large n2 < 1I
or n; n1 ; n2 ! 1g ;
On the Inference of Partially Correlated Data with Applications to Public Health Issues 49
TWC C T2 . n.nC1/
4 C n12n2 / L
Z0D D q ! N .0; 1/ I and
n.nC1/.2nC1/
24 C n1 n2 .n112Cn2 C1/
(ii)
p p L
TZ D ZWC C 1 Z2 ! N .0; 1/ :
Under the assumption of MCAR and in order to test the null hypothesis H0 W 1 D
2 , the following notation must be defined: Di D Ui Vi ; i D 1; 2; : : : ; n, and
Djk D Xj Yk ; j D 1; 2; : : : ; n1 ; k D 1; 2; : : : ; n2 : Let N D n C n1 n2 ; DD1 D
D1 ; : : : ; Dn D DDn I DDnC1 D D11 ; : : : ; Dn1 n2 D DDN I and Im be an indicator when
jDDj .m/ corresponds to a positive DD observation, where m D 1; 2; : : : ; N and
jDDj .1/ < < jDDj .N/ are the ordered absolute values. Then,
X
N X
N
TNew D mIm D Rm s .DDm / (22)
mD1 mD1
2
X
N
1
P .Tnew D t/ D Pm .TWC D tWC /Pm .T2 D t tWC / ;
N mD1
n
N .N C 1/
t D 0; 1; 2; : : : ; ; (23)
2
where Pm .TWC D tWC / and Pm .T2 D t tWC / and are expressed in Theorem 2 for
the mth permutation as Pm .b/ D Pm .TWC D b/ D Pm2m.b/ ; b D 0; 1; 2; : : : ; m.mC1/
2
and Pm .b/ D Pm1 .b/ C Pm1 .b m/ ; P0 .0/ D 1; P0 .b/ D 0 and Pm .b/ D
0 for b < 0:
R codes to find (23) by calculating the exact discrete distribution of the proposed
test are provided on the following website: http://personal.georgiasouthern.edu/~
hsamawi/. Using (22), it is easy to show that the mean and the variance of our
proposed test statistic are as follows:
N .N C 1/
E .TNew / D ; (24)
N
2:
2
N
4N .N C 1/ .2N C 1/ 6N 2 .N C 1/2
2
V .TNew / D 2 : (25)
N
24
2
Note that both the mean and the variance in (24) and (25) are finite and decreasing
as N increases, provided that
n
nCn1 Cn2 ! as
fn ! 1 and large n1 ; n2 < 1I or n ! 1 and n1 ! 1 and large n2 < 1I
or n; n1 ; n2 ! 1g :
Under the MCAR design and the null hypothesis, the asymptotic distribution of TNew
is as follows:
TNew E .TNew / L
ZNew D p ! N .0; 1/ : (26)
V .TNew /
Table 12 contains data on eight patients taken from Weidmann et al. (1992). The
purpose of this study was to compare the proportions of certain T cell receptor
gene families (the V“ gene families) on tumor infiltrating lymphocytes (TILs) and
On the Inference of Partially Correlated Data with Applications to Public Health Issues 51
Acknowledgements We are grateful to the Center for Child & Adolescent Health for providing
us with the 2003 National Survey of Children’s Health. Also, we would like to thank the referees
and the associate editor for their valuable comments which improved the manuscript.
Appendix
References
Samawi, H.M., Vogel, R.L.: Tests of homogeneity for partially matched-pairs data. Stat. Methodol.
8, 304–313 (2011)
Samawi, H.M., Vogel, R.L.: Notes on two sample tests for partially correlated (paired) data. J.
Appl. Stat. 41(1), 109–117 (2014)
Samawi, H.M., Woodworth, G.G., Al-Saleh, M.F.: Two-sample importance resampling for the
bootstrap. Metron. LIV(3–4) (1996)
Samawi, H.M., Woodworth, G.G., Lemke, J.: Power estimation for two-sample tests using
importance and antithetic resampling. Biom. J. 40(3), 341–354 (1998)
Samawi, H.M., Yu, L., Vogel, R.L.: On some nonparametric tests for partially correlated data:
proposing a new test. Unpublished manuscript (2014)
Snedecor, G.W., Cochran, W.G.: Statistical Methods, 7th edn. Iowa State University Press, Ames
(1980)
Steere, A.C., Green, J., Schoen, R.T., Taylor, E., Hutchinson, G.J., Rahn, D.W., Malawista, S.E.:
Successful parenteral penicillin therapy of established Lyme arthritis. N. Engl. J. Med. 312(14),
8699–8874 (1985)
Tang, X.: New test statistic for comparing medians with incomplete paired data. M.S. Thesis,
Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh,
Pittsburgh, PA. http://www.google.com/search?hl=en&rlz=1T4ADRA_enUS357US357&q=
Tang+X.+%282007%29New+Test+Statistic+for+Comparing+Medians+with+Incomplete+
Paired+Data&btnG=Search&aq=f&aqi=&aql=&oq= (2007)
Tippett, L.H.C.: The Method of Statistics. Williams & Norgate, London (1931)
Walker, G.A., Saw, J.G.: The distribution of linear combinations of t-variables. J. Am. Stat. Assoc.
73(364), 876–878 (1978)
Weidmann, E., Whiteside, T.L., Giorda, R., Herberman, R.B., Trucco, M.: The T-cell receptor V
beta gene usage in tumor-infiltrating lymphocytes and blood of patients with hepatocellular
carcinoma. Cancer Res. 52(21), 5913–5920 (1992)
Xu, J., Harrar, S.W.: Accurate mean comparisons for paired samples with missing data: an
application to a smoking-cessation trial. Biom. J. 54, 281–295 (2012)
Modeling Time-Dependent Covariates
in Longitudinal Data Analyses
Trent L. Lalonde
Abstract Often public health data contain variables of interest that change over
the course of longitudinal data collection. In this chapter a discussion is presented
of analysis options for longitudinal data with time-dependent covariates. Relevant
definitions are presented and explained in the context of practical applications,
such as different types of time-dependent covariates. The consequences of ignoring
the time-dependent nature of variables in models is discussed. Modeling options
for time-dependent covariate data are presented in two general classes: subject-
specific models and population-averaged models. Specific subject-specific mod-
els include random-intercept models and random-slopes models. Decomposition
of time-dependent covariates into “within” and “between” components within
each subject-specific model are discussed. Specific population-averaged models
include the independent GEE model and various forms of the GMM (generalized
method of moments) approach, including researcher-determined types of time-
dependent covariates along with data-driven selection of moment conditions using
the Extended Classification. A practical data example is presented along with
example programs for both SAS and R.
The term “longitudinal data” refers to data that involve the collection of the same
variables repeatedly over time. Typically the term is used to refer to longitudinal
panel data, which denotes the case of collecting data repeatedly from the same
subjects. This type of data is very common in practice, and allows for researchers to
assess trends over time and gives power to typical population comparisons (Zeger
and Liang 1992; Diggle et al. 2002; Hedeker and Gibbons 2006; Fitzmaurice
et al. 2012). Specific research interests can include comparisons of mean responses
at different times; comparisons of mean responses across different populations,
accounting for the effects of time; and assessing the impacts of independent
In the following sections, the distinctions among TDC will be explored, including
methods of identifying types of TDC. Modeling TDC data using conditional
methods is discussed, followed by modeling using marginal methods. The chapter
concludes with a data example exemplifying all relevant modeling techniques.
2.1 Exogeneity
One of the most common distinctions made of TDC is that of exogeneity (Cham-
berlain 1982; Amemiya 1985; Diggle et al. 2002). An exogenous variable has a
stochastic process that can be determined by factors outside the system under study,
and is not influenced by the individual under study. An exogenous TDC can be
thought of as a randomly fluctuating covariate that cannot be explained using other
variables in the study. It is most important to determine exogeneity with respect to
the response. A TDC is said to be exogenous with respect to the response process if
that time-dependent variable at one time is conditionally independent of all previous
responses.
60 T.L. Lalonde
Formally, let the response for subject i at time t be denoted by Yit , and let xit
denote a TDC for subject i at time t. Then x is exogenous with respect to the response
process Y if
fX xit jYi1 ; : : : ; Yit I xi1 ; : : : ; xi.t1/ D fX xit jxi1 ; : : : ; xi.t1/ ; (1)
where fX denotes the density of x. Under the definition of Eq. (1), while xit may be
associated with previous covariate values xi1 ; : : : ; xi.t1/ , it will not be associated
with previous or current responses Yi1 ; : : : ; Yit . A consequence of this definition is
that the current response Yit will be independent of future covariate values, even if
there is an association with prior covariate values,
Lai and Small (2007) defined three types of TDC, and a fourth type was defined
by Lalonde et al. (2014). Each type of TDC is related to the extent of non-exogeneity
with respect to the response and can help determine appropriate analysis techniques.
A covariate is said to be a Type I TDC if
@is
E .Yit it / D 0 8s; t; (3)
@ˇj
where is and it are evaluated at the true parameter values ˇ, and j is the index
of the TDC in question. The expectation must be satisfied for all combinations of
times s and t, suggesting there is no relationship between the TDC and the response
at different times. A sufficient condition for a TDC to be Type I is
Thus the response is independent of all TDC values at different times. The sufficient
requirement of Eq. (4) would seem to be a stronger condition than the exogeneity
presented by Eq. (2), in that Eq. (4) requires the response at time t to be independent
of all other TDC values, even those prior to t. Variables that involve predictable
changes over time, such as age or time of observation, are typically treated as Type I
TDCs. A covariate is said to be a Type II TDC if
@is
E .Yit it / D 0 8s t: (5)
@ˇj
The expectation must be satisfied when s t, but not necessarily when s < t,
suggesting dependence between the response and covariate. In this case the TDC
process is not associated with prior responses, but the response process can be
associated with prior TDC values. A sufficient condition for a covariate to be
Type II is
As discussed in Lai and Small (2007), this is similar but not equivalent to exogeneity
with respect to the response process. It can be shown that exogeneity is sufficient for
a TDC to be of Type II (Chamberlain 1982; Lai and Small 2007). Examples of Type
II TDCs include covariates that may have a “lagged” association with the response
in that previous TDC values can affect the response, but covariate values will
not be affected by previous response values. One example is the covariate “blood
pressure medication” as a Type II covariate with the response “blood pressure,” as
the accumulated effects of medication over time can be expected to have an impact
on blood pressure at any time. A covariate is said to be a Type III TDC if
@is
E .Yit it / D 0 8s D t: (6)
@ˇj
62 T.L. Lalonde
@is
E .Yit it / D 0 8s t: (7)
@ˇj
The expectation must be satisfied for s t, but not necessarily when s > t,
suggesting dependence between the response and covariate. For a Type IV TDC,
a covariate can be associated with previous response values, but the response is not
associated with previous covariate values. A sufficient condition for a covariate to
be Type IV is
Type IV TDC are associated with prior response values, but the response at time
t is only associated with the TDC at time t. One example is the covariate “blood
pressure” as a Type IV covariate with the response “weight.” While there is an
association between weight and blood pressure, the direction of the effect seems
to be that weight impacts blood pressure, but the reverse is unlikely.
Different types of TDCs are associated with different relationships with the
response. It is important to be able to identify different types of TDCs to guide
model selection and to provide appropriate interpretations. Lai and Small (2007)
proposed selecting the type of TDC by choice of the researcher, but also presented a
2
test to compare two different selections of types for TDC. The idea is to construct
competing quadratic forms using the expressions from Eqs. (3), (5), (6), and (7) with
zero expectation, so that additional expressions from a different selection of a type
of TDC can inflate the quadratic form if those additional expressions do not, in
fact, have zero expectation. However, this method will only allow for comparisons
between possible selections of types of TDC, but will not make the selection for
the researcher. The Extended Classification method, described in Sect. 4.3, presents
such a process.
3 Subject-Specific Modeling
Here the interpretation of conditional model fixed effects can be made clear. The
parameter ˇk represents the expected change in the (transformed) mean response
for a unit increase in xk;it for an individual subject, holding all other predictors fixed.
In other words, if predictor xk changes for an individual subject, ˇk represents the
expected impact on the mean response.
In the presence of TDC, the standard conditional models are often adjusted
to allow for both “within” and “between” components of effects associated with
TDC (Neuhaus and Kalbfleisch 1998). If a covariate includes both variation within
subjects and variation between subjects, it is believed these two distinct sources of
variation can be associated with different effects. The term in the model representing
each TDC can be decomposed into two terms: one associated with variation within
subjects and the other associated with variation between subjects,
ˇxit ! ˇW .xit xN i: / C ˇB xN i: :
In this expression the parameter ˇW represents the expected change in the mean
response associated with changes of the TDC within subjects, while the parameter
ˇB is more of a population-averaged parameter that represents the expected change
in the mean response associated with changes of the TDC across subjects.
N
X
@.ˇI xi / T
S.ˇ/ D Wi .Yi .ˇI xi // D 0;
iD1
@ˇ
where the weight matrix Wi is often taken to be the inverse of the variance–
covariance of the marginal response (Diggle et al. 2002). Pepe and Anderson (1994)
showed that these estimating equations have zero expectation only if the data meet
the assumption,
for each TDC. The assumption of Eq. (8) is met trivially for TIC. Notice that
exogeneity is not a sufficient condition, as Eq. (2) implies that the response at
one time will be independent of an exogenous covariate’s values at future times.
Equation (8), on the other hand, suggests the response at one time should be
independent of covariate values at all other times. When this assumption is satisfied,
Modeling Time-Dependent Covariates in Longitudinal Data Analyses 65
In this derivation, the step of removing the derivative term from the inner expectation
depends on the assumption of Eq. (8). Depending on the choice of the weight
matrix Wi , the estimating equations may require combinations of the first term (the
derivative of the systematic component) with the second term (the raw error term)
across different times. Specifically, this will be the case for any non-diagonal weight
matrix. The assumption presented by Pepe and Anderson (1994) requires that the
derivative and raw residual terms are independent at any two time points combined
by the weight matrix.
The standard conditional generalized linear models induce a block-diagonal
variance–covariance structure for the marginal response, and thus the condition of
Eq. (8) must be satisfied if the standard weight matrix is applied. Notice Eq. (8) is
a sufficient condition for a covariate to be a Type I TDC. For other types of TDC,
the condition is likely not satisfied. If the condition is not satisfied, the likelihood-
based estimating equations will not have zero expectation, leading to bias and loss
of efficiency in parameter estimates (Pepe and Anderson 1994; Diggle et al. 2002).
4 Population-Averaged Modeling
Unlike the conditional models of Sect. 3, marginal models for longitudinal data
do not involve specification of a conditional response distribution using random
effects. Instead a marginal model involves specification of the marginal response
distribution, or at least moments of the response distribution (McCullagh and Nelder
1989; Hardin and Hilbe 2003; Diggle et al. 2002). This type of model is associated
with “population-averaged” interpretations, or standard regression interpretations.
Parameters in marginal longitudinal models provide a comparison of the mean
response between two populations with different average values of the predictor of
interest. While marginal conclusions can be obtained through conditional models,
the term “marginal model” will be used to refer to a model specifically intended
for marginal expression and interpretations (Lee and Nelder 2004). A marginal
correlated generalized linear model can be written,
66 T.L. Lalonde
Random Component:
Marginal Mean:
ln..xit // D xTit ˇ:
For this type of model D is assumed to be a distribution from the exponential family
of distributions, but may not be fully specified within a marginal model. Instead, the
mean .xit / and variance V..xit // (with possible over dispersion parameter ) are
supplied by the researcher. While there are many marginal methods for estimating
parameters in a longitudinal generalized linear model, this chapter will focus on two
methods: the generalized estimating equations (GEE) and the generalized method
of moments (GMM).
The GEE approach to model fitting has been covered extensively in the literature
(Liang and Zeger 1986; Zeger and Liang 1986; Liang et al. 1992; Ziegler 1995;
Hardin and Hilbe 2003; Diggle et al. 2002). Briefly, parameter estimates are
obtained by solving the equations,
N
X
@.ˇI xi / T
S.ˇ/ D Œ Vi . .ˇI xi //1 .Yi .ˇI xi // D 0;
iD1
@ˇ
1=2 1=2
Vi . .xit // D Ai Ri .˛/Ai :
Pepe and Anderson (1994) argued that the structure of the GEE requires
satisfaction of the assumption,
so that the GEE will have zero expectation. As with conditional model estimation,
this assumption is met trivially for TIC. When the assumption is met, the first term
of the GEE can be factored out of the expectation of S.ˇ/, producing unbiased
estimating equations. When the assumption is not met, the GEE will not have
zero expectation unless the working correlation structure is selected so that all
components of the GEE involve only a single observation time. This is achieved by a
Modeling Time-Dependent Covariates in Longitudinal Data Analyses 67
1 X
N
G.ˇI Y; X/ D gi .ˇI Y; X/:
N i1
When presenting the GMM, Hansen (1982) argued that the optimal choice for the
weight matrix is the inverse of the variance–covariance structure of the subject-level
vector of valid moment conditions,
The type of TDC will determine which combinations of times form valid
moment conditions. For all predictors in the model, the concatenation of all valid
moment conditions will form the vector gi for each subject. Notice that this method
avoids choosing a general weight matrix to apply across all covariates, as with
likelihood-based estimation or with the GEE. Instead, the GMM allows expressions
to be constructed separately for each TDC, which provides the ability to treat
each covariate according to its type. This eliminates a major restriction from both
likelihood-based methods and the GEE.
@
Os
dO sj D ;
@ˇj
rOt D yt
O t;
and standardized for comparison. Assuming all fourth moments of Osjt exist and
are finite,
Osjt
sjt D p N .0; 1/;
O 22 =N
P
where O 22 D .1=N/ i .dQ sji /2 .Qrti /2 , and N is the total number of subjects. Sig-
nificantly correlated components show evidence of non-independence between the
derivative and raw residual terms, and therefore the associated product of derivative
and raw error should not reasonably have zero expectation and can be omitted as
a potential valid moment condition. To account for the large number of hypothesis
tests involved in the Extended Classification process, p-values for all correlation
tests can be collectively evaluated (Conneely and Boehnke 2007).
The method of Extended Classification removes the potentially subjective deci-
sion of the type of each TDC by the researcher and allows the data to determine
appropriate valid moment conditions. Extended Classification also allows for more
than four discrete types, admitting all possible combinations of times instead of the
four cases corresponding to the four types. The Extended Classification process has
shown to be effective in determining appropriate types of TDC, with results similar
or superior to those of subjectively chosen types (Lalonde et al. 2014).
The quadratic form is then minimized to obtain final parameter estimates Ǒ . The
TSGMM process appears to be the most commonly applied method in the literature.
The IGMM process involves an iterative repeat of the steps in TSGMM. After the
quadratic form QTS has been minimized to obtain updated parameter estimates Ǒ .1/ ,
the estimate of the weight matrix is updated, providing W O .1/ . The process then
iterates between updating Ǒ .i/ using the quadratic form and updating W O .i/ using
the resulting estimates,
h i
Ǒ .iC1/ D argmin GT .ˇI Y; X/WO 1 G.ˇI Y; X/ :
.i/
70 T.L. Lalonde
5 Data Example
The RI logistic regression model can be written with a decomposition of all TDC,
except for the time indicators,
4
X
logit.it / D ˇ0 C .ˇkW .xk;it xN k;i: / C ˇkB xN k;i: / C ˇt2 Time2 C ˇt3 Time3 C 0i ;
kD1
where it indicates the probability of rehospitalization within 30 days for subject
i at time t, and 0i indicates the random subject effect. The model can be fit using
SAS or R with the commands provided in Sect. 7.1.
The RS logistic regression model can be written similarly, including a random
slope for the length of stay predictor. This will allow the effect of length of stay on
probability of rehospitalization to vary randomly among subjects,
4
X
logit.it / D ˇ0 C .ˇkW .xk;it xN k;i: / C ˇkB xN k;i: /
kD1
where 1i represents the random variation in the slope for length of stay. The model
can be fit using SAS or R with the commands provided in Sect. 7.2.
The IGEE logistic model will be written without the decomposition of TDC, and
without random subject effects,
4
X
logit.it / D ˇ0 C ˇkW xk;it C ˇt2 Time2 C ˇt3 Time3:
kD1
This GEE model can be fit with the independent working correlation structure using
SAS or R with the commands provided in Sect. 7.3.
The systematic and link components for the GMM-Types model will look
identical to that of the IGEE model. For the GMM-Types model, specific types will
be assumed for each TDC. Both time indicators will be treated as Type I TDC, as
is common for such deterministic variables. Both “length of stay” and “existence
of coronary atherosclerosis” will be treated as Type II TDC, as it is reasonable
to assume an accumulated effect on the response from these two variables, but
it is unlikely that the response at one time will affect future values of these
covariates. Both “number of diagnoses” and “number of procedures” will be treated
as Type III TDC, as it is reasonable to assume feedback between the probability of
rehospitalization within 30 days and these two counts.
For the GMM-EC model there will be no assumptions of specific types of TDC.
Instead the extended classification process will be used to select appropriate valid
moment conditions to be used in the GMM quadratic form. These GMM methods
are not yet available in SAS; R functions written by the author can be requested.
72 T.L. Lalonde
Results of fitting all five models are presented in Table 2. First consider the
results of the conditional models. For the model including a random intercept, the
variation associated with that intercept (0:1472) is significant, suggesting there is
significant individual variation in the baseline probability of rehospitalization within
30 days. For all models the time indicators have significant negative coefficients,
which implies the chance of rehospitalization within 30 days is significantly lower
for later follow-up visits. This is suggestive of either a patient fatigue effect in which
an individual tires of visiting the hospital, or the positive impact of multiple visits
on curing an illness.
The decomposed TDC in this model provide interesting interpretations. The
“between” components of the TDC provide population-averaged types of conclu-
sions. For example, there is evidence that subjects with higher average length of stay
tend to have a higher probability of rehospitalization (0:0736), perhaps an indication
of more serious illnesses. The “within” components provide interpretations of
individual effects over time. For example, there is evidence that an increase in the
number of diagnoses for an individual is associated with a higher probability of
rehospitalization (0:0780), perhaps an indication of identifying additional illnesses.
Results for the model including a random-slope for length of stay are similar.
Within the RS model, the variation in the length of stay slope (0:0025) is significant,
indicating meaningful individual variation in the effect of length of stay on the
probability of rehospitalization. The variation in the intercept (0:1512) remains
significant. Two changes are evident when compared to the random-intercept model.
First, the random-slope model shows a significant positive association with length of
stay within subjects, suggesting an increase in length of stay over time is associated
with a higher probability of rehospitalization within 30 days. Second, the RS model
shows a significant positive association with existence of coronary atherosclerosis
between subjects, suggesting an increase in the probability of rehospitalization
within 30 days for subjects who eventually develop coronary atherosclerosis.
Next consider the results of the marginal models. For all three of the models
IGEE, GMM-Types, and GMM-EC, the parameter associated with length of stay is
positive and significant. This indicates that, when comparing two populations with
different average lengths of stay, the population with the higher length of stay has
a higher probability of rehospitalization within 30 days. Notice that while all three
marginal models show a negative effect for the number of procedures, significance is
identified with GMM but not with GEE. This is to be expected, as GMM is intended
to improve the efficiency over the conservative IGEE process. Also notice that the
signs of significant “between” effects for the conditional models are similar to those
of the corresponding effects in the marginal models. This is also to be expected, as
“between” effects produce conclusions similar to the population-averaged marginal
model conclusions.
Overall fit statistics are provided but may not provide meaningful information for
selection between conditional and marginal models. Selecting the most appropriate
model is often based on researcher intentions. The IGEE model is a safe choice,
Table 2 Conditional and marginal logistic regression models
Parameter estimates and significance
RI—within RI—between RS—within RS—between IGEE GMM-types GMM-EC
Diagnoses 0:0780 0.0444 0:0686 0.0362 0:0648 0:0613 0:0573
Procedures 0.0188 0:0824 0.0092 0:0915 0.0306 0:0458 0:0366
LOS 0.0008 0:0736 0:0200 0:0952 0:0344 0:0530 0:0497
C.A. 0:2607 0.2223 0:2646 0:3050 0.1143 0.0536 0.0802
Time 2 0:3730 0:4061 0:3876 0:4004 0:3933
Time 3 0:2130 0:2357 0:2412 0:2417 0:2633
Intercept 0:1472 0:1512
Slope – 0:0025
Gen 2 =DF 0.98 0.96
QIC 6648.52
Modeling Time-Dependent Covariates in Longitudinal Data Analyses
QICu 6646.56
0 0:001 0:01 0:05 0:10
73
74 T.L. Lalonde
but generally lacks the power of the GMM models. The conditional models are an
appropriate choice when subject-specific designs and conclusions are of interest,
but also impose the assumption of a block-diagonal marginal variance–covariance
structure.
The most powerful and appropriate choice appears to be the GMM method that
avoids the necessary condition of Eq. (8) presented by Pepe and Anderson (1994),
and allows for TDC to be treated differently from each other. In this sense the
Extended Classification method provides the most flexibility, as moment conditions
are selected individually based on empirical evidence from the dataset. In this data
example the results of both the GMM-Types and GMM-EC models are quite similar,
yielding the same signs of parameter estimates and similar significance levels,
which suggests the researcher-selected types of covariates are probably appropriate
according to the dataset.
6 Discussion
TDC occur commonly in practice, as data collected for longitudinal studies often
change over time. There are numerous ways to classify TDC. The most common
type of classification is as exogenous versus endogenous covariates. Exogenous
covariates vary according to factors external to the system under consideration,
while endogenous covariates show association with other recorded variables. It is
most important to identify exogeneity with respect to the response variable.
TDC more recently have been classified according to four “types” that reflect the
nature of the association between the TDC and the response. While these definitions
are related to exogeneity, they do not represent the same characteristics. Instead, the
different types of TDC reflect different levels of association between covariates and
responses at different times, with the most substantial relationship a “feedback” loop
between covariates and response at different times.
Existing methods for modeling longitudinal data with TDC can be split into two
classes: conditional models and marginal models. Conditional models incorporate
random effects into the systematic component of the model to account for the
autocorrelation in responses. To accommodate TDC, individual regression terms
can be decomposed into contributions from variation “within” subjects and vari-
ation “between” subjects. When maximum-likelihood-type methods are applied
to estimate parameters in conditional models, there is an implicit assumption of
independence between the response at one time and covariate values at other times.
If this assumption is not met, the likelihood estimating equations will not have
zero expectation because of off-diagonal components of the response variance–
covariance structure, which can bias parameter estimates.
Marginal models, on the other hand, define a marginal response (quasi-) distribu-
tion through specification of a marginal mean and a marginal variance–covariance
structure. The most commonly used such method is the GEE. To accommodate
TDC, it has been recommended that the independent working correlation structure
Modeling Time-Dependent Covariates in Longitudinal Data Analyses 75
The random-intercept (RI) model discussed in Section LABEL can be fit using the
following SAS commands.
/* PROC GLIMMIX DOES NOT REQUIRE INITIAL VALUES */
PROC GLIMMIX DATA=ASID_DATA;
CLASS subject_id;
MODEL readmission(event = ’1’) = diagnoses_w diagnoses_b
procedures_w procedures_b
LOS_w LOS_b
CA_w CA_b
time2 time3
/ DIST=BINARY LINK=LOGIT
DDFM=BW SOLUTION;
RANDOM INTERCEPT/ subject=subject_id;
RUN;
+ beta9*time2 + beta10*time3;
exp_eta = exp(eta);
pi = ((exp_eta)/(1+exp_eta));
MODEL readmission ~ BINARY(pi);
RANDOM u ~ NORMAL(0, sigmau*sigmau)SUBJECT=subject_id;
RUN;
Alternatively, the model can be fit using R with the following commands.
install.packages("lme4")
library(lme4)
# USE start=c(diagnoses_w=, ... ) OPTION TO SPECIFY
INITIAL VALUES #
# USE INDEPENDENT GEE FOR INITIAL VALUES #
R_Int = glmer(readmission ~ diagnoses_w+diagnoses_b
+procedures_w+procedures_b+LOS_w+LOS_b
+CA_w+CA_b
+time2+time3 + (1|subject_id),family=binomial,
REML=FALSE,data=ASID_DATA)
summary(R_Int)
The random-slope (RS) model discussed in Section LABEL can be fit using the
following SAS commands.
/* PROC GLIMMIX DOES NOT REQUIRE INITIAL VALUES */
PROC GLIMMIX DATA=ASID_DATA;
CLASS subject_id;
MODEL readmission(event = ’1’) = diagnoses_w diagnoses_b
procedures_w procedures_b
LOS_w LOS_b
CA_w CA_b
time2 time3
/ DIST=BINARY LINK=LOGIT
DDFM=BW SOLUTION;
RANDOM INTERCEPT LOS / subject=subject_id;
run;
+ beta7*CA_w + beta8*CA_b
+ beta9*time2 + beta10*time3
+ rb1*LOS;
exp_eta = exp(eta);
pi = ((exp_eta)/(1+exp_eta));
MODEL readmission ~ BINARY(pi);
RANDOM u rb1 ~ NORMAL([0, 0], [s2u, 0, s2f])
SUBJECT=subject_id;
RUN;
Alternatively, the model can be fit using R with the following commands.
install.packages("lme4")
library(lme4)
# USE start=c(diagnoses_w=, ... ) OPTION TO SPECIFY
INITIAL VALUES #
# USE INDEPENDENT GEE FOR INITIAL VALUES #
R_Slopes = glmer(readmission ~ diagnoses_w+diagnoses_b
+procedures_w+procedures_b+LOS_w+LOS_b+CA_w+CA_b
+time2+time3 + (1|subject_id)+(0+LOS|subject_id),
family=binomial, REML=FALSE,
start=c(diagnoses_w=, . . .),data=ASID_DATA)
summary(R_Int)
Alternatively, the model can be fit using R with the following commands.
install.packages("geepack")
library(geepack)
Ind_GEE = geeglm(readmission ~ diagnoses+procedures+LOS+CA
+time2+time3, family=binomial,
id=subject_id,corstr="independence",
data=ASID_DATA)
summary(Ind_GEE)
78 T.L. Lalonde
References
Phillips, M.M., Phillips, K.T., Lalonde, T.L., Dykema, K.R.: Feasibility of text messaging for
ecologocial momentary assessment of marijuana use in college students. Psychol. Assess.
26(3), 947–957 (2014)
Wooldridge, J.M.: Introductory Econometrics: A Modern Approach, 4th edn. Cengage Learning,
South Melbourne (2008)
Zeger, S.L., Liang, K.Y.: Longitudinal data analysis for discrete and continuous outcomes.
Biometrics 42, 121–130 (1986)
Zeger, S.L., Liang, K.Y.: An overview of methods for the analysis of longitudinal data. Stat. Med.
11(14–15), 1825–1839 (1992)
Zeger, S.L., Liang, K.Y., Albert, P.S.: Models for longitudinal data: a generalized estimating
equation approach. Biometrics 44(4), 1049–1060 (1988)
Ziegler, A.: The different parametrizations of the gee1 and gee2. In: Seeber, G.U.H., et al. (eds.)
Statistical Modelling, pp. 315–324. Springer, New York (1995)
Solving Probabilistic Discrete Event Systems
with Moore–Penrose Generalized Inverse Matrix
Method to Extract Longitudinal Characteristics
from Cross-Sectional Survey Data
1 Background
up to 2006. By the end of the last follow-up, we will have longitudinal data for this
birth cohort by years from 2001 to 2006, corresponding to the age groups from 10
to 15 (the dashed-lined boxes in the figure). Transitional probabilities that describe
the progression of the smoking behavior (e.g., from nerve user to experimenters, to
regular user, etc.) during 1 year period can then be estimated directly by dividing
the number of new users identified during the 1-year follow-up and the total number
of at-risk subjects at the beginning of each period.
In contrast to the longitudinal data, if one-wave cross-sectional survey was
conducted with a random sample of individual subjects 10–15 years old in 2005,
we would have data by age for subjects also 10–15 years old (the solid-lined boxes,
Fig. 1) as compared to the longitudinal data collected from 2001 to 2006. If we add
up the number of various types of smokers by age, the differences in the numbers
of these smokers between any two consecutive age groups from the cross-sectional
data are analogous to the differences in the numbers of smokers in a birth cohort in
any two consecutive years from the longitudinal survey in the same age range. In
another word, cross-sectional data do contain longitudinal information.
Besides the fact that cross-sectional surveys are less costly to collect, and hence
much more survey data are available, cross-sectional data have advantages over
longitudinal data in the following aspects. (1) Selection biases due to attrition:
attrition or loss of follow-up is a common and significant concern with survey
data collected through a longitudinal design. Data from social behavior research
indicate that participants who missed the follow-up are more likely to be smokers
(or other drug users). This selective attrition will threaten the validity of longitudinal
data. (2) Inaccuracy of survey time: for an ideal longitudinal survey, each wave
of data collection should be completed at one time point (e.g., January 1, 2005
for wave 1 and January 1, 2006 for wave 2). However, a good survey usually
involves a population with large numbers of participants. Collecting data from such
large samples cannot be completed within 1 or 2 days, resulting in time errors in
measuring behavior progression even with advanced methodologies. For example,
a participant may be surveyed once on January 1, 2005 and then again on March
1, 2006, instead of January 1, 2006. This will cause a time error. (3) Hawthorne
(survey) effect: repeatedly asking the same subjects the same questions over time
may result in biased data. (4) Recall biases: to obtain data on behavior dynamics, a
longitudinal survey may ask each participant to recall in great detail on his or her
behavior in the past (e.g., exact date of smoking onset, exact age when voluntarily
stopped smoking after experimentation); this may result in erroneous data due to
memory loss. (5) Age range of the subjects in a longitudinal sample shifts up as the
subjects are followed up over time, affecting the usefulness of such data. (6) Time
required: longitudinal survey takes several years to complete, while cross-sectional
survey can be done in a relatively short period of time.
Despite the success, the established PDES method has a limitation: the model
cannot be determined without extra exogenous equations. Furthermore, such equa-
tions are often impractical to obtain and even if an equation is derived, the data
supporting the construction of the equation may be error prone. To overcome the
limitation of the PDES modeling, we propose to use the Moore–Penrose (M–P)
84 D.-G. (Din) Chen et al.
inverse matrix method that can solve the established PDES model without exoge-
nous equation(s) to create a full-ranked coefficient matrix. The Moore–Penrose
(M–P) generalized inverse matrix theory (Moore and Barnard 1935; Penrose 1955)
is a powerful tool to solve a linear equation system that cannot be solved by using
the classical inverse of the coefficient matrix. Although M–P matrix theory has been
used to solve challenging problems in operation research, signal process, system
controls, and various other fields (Campbel and Meyer 1979; Ying and Jia 2009;
Nashed 1976; Cline 1979), to date this method has not been used in human behavior
research. In this chapter, we demonstrate the applicability of this M–P Approach
with PDES modeling in characterizing the health risk behavior of an adolescent
population. The application of the M–P inverse matrix based methodology (or M–P
Approach for short) will increase the efficiency and utility of PDES modeling in
investigating many dynamics of human behavior. To facilitate the use of the M–P
Approach, an R program with examples and data are provided in Appendix for
interested readers to apply their own research data.
We give a brief review in this section on PDES model and detailed descriptions for
this PDES can be found from Lin and Chen (2010) and Chen and Lin (2012). We
make use of the notations in Lin and Chen (2010) to describe the PDES model.
According to Lin and Chen (2010), in estimating the transitional probability with
cross-sectional survey data to model smoking multi-stage progression (Fig. 2), five
behavioral states/stages are defined to construct a PDES as follows:
• NS—never-smoker, a person who has never smoked by the time of the survey.
• EX—experimenter, a person who smokes but not on a regular basis after
initiation.
SS QU
σ6 σ11
Solving Probabilistic Discrete Event Systems with Moore–Penrose Generalized. . . 85
G D .Q; †; ı; qo / (1)
For example, Eq. (2) states that the percentage of people who are never-smoker
at age a C 1 is equal to the percentage of people who are never-smoker at age a,
subtracted from the percentage of people who are never-smoker at age a, times the
percentage of never-smokers who start smoking at agea. Similar explanations can
be given for the other equations. Furthermore, we have the following additional
equations with respect to Fig. 2.
The above ten equations from Eqs. (2) to (11) can be casted into the matrix format:
2 3
2 3 1 .a/
0 NS.a/ 0 0 0 0 0 0 0 0 0 6 .a/ 7
6 0 NS.a/ 0 EX.a/ SS.a/ 0 EX.a/ 0 6 2 7
6 0 0 077 6 7
60 6 3 .a/ 7
6 0 0 EX.a/ SS.a/ 0 0 0 0 0 077 6 7
60 6 4 .a/ 7
6 0 0 0 0 0 EX.a/ 0 RS.a/ QU.a/ 0 7
7 6
6 5 .a/ 7
7
6 7 6 7
60 0 0 0 0 0 0 0 RS.a/ QU.a/ 0 7 6 6 .a/ 7
6 7 6 7
61 1 0 0 0 0 0 0 0 0 07 6 .a/ 7
6 7 6 7 7
60 0 1 1 0 0 1 0 0 0 07 6 7
6 7 6 8 .a/ 7
60 0 0 0 1 1 0 0 0 0 07 6 7
6 7 6 9 .a/ 7
40 0 0 0 0 0 0 1 1 0 05 6 7
4 10 .a/ 5
0 0 0 0 0 0 0 0 0 1 1
11 .a/
2 3
NS .a C 1/ NS.a/
6 EX .a C 1/ EX.a/ 7
6 7
6 SS .a C 1/ SS.a/ 7
6 7
6 RS .a C 1/ RS.a/ 7
6 7
6 7
6 QU .a C 1/ QU.a/ 7
D6 7:
6 1 7
6 7
6 1 7
6 7
6 1 7
6 7
4 1 5
1
(12)
ago) and (2) QU, old quitters (e.g., those who quit smoking 1 year ago). With data
for these two types of newly defined smokers, two more independent equations
SS .a C 1/ D SS.a/6 .a/ and QU .a C 1/ D QU.a/11 .a/ are derived to ensure
the equation set (12) has a unique solution. However, the introduction of the two
types of smokers SS and QU may have also brought in more errors from the data
because two types of smokers must be derived from recalled data 1 year earlier than
other data. If this is the case, errors introduced through these two newly defined
smokers will affect the estimated transitional probabilities that are related to self-
stoppers and quitters, including 3 , 4 , 5 , 6 , 10 , and 11 (refer to Fig. 2). When
searching for methods that can help to solve Eq. (12) without depending on the two
additional equations, we found that the generalized inverse matrix approach can
be applied here successfully. It is this “M–P Approach” that makes the impossible
PDES model possible.
Inspired by the general inverse matrix theory, particularly the work by Moore and
Barnard (1935) and Penrose 1955, we introduced a mathematical approach to this
problem: the M–P Approach.
In his famous paper, Moore and Barnard (1935) proposed four conditions to the
generalized-inverse A defined above. They are as follows:
(1)
AA A D A (13)
A AA D A (14)
To demonstrate the M–P Approach in solving the PDES model, a linear equation
system without full rank, we make use of the R library “MASS” (Venables and
Ripley 2002). This package includes a function named “ginv.” It is devised specif-
ically to calculate the Moore–Penrose generalized-inverse of a matrix. We used
this function to calculate the Moore–Penrose generalized-inverse of the coefficient
matrix A in the PDES smoking behavior model described in Eq. (12).
As presented in Lin and Chen (2010), smoking data from 2000 National Survey
on Drug Use and Health (NSDUH) are compiled for US adolescents and young
adults aged 15–21 (Table 1). According to the PDES, the state probability for each
of the seven types of smokers by single year of age was calculated with the NSDUH
data (Table 1). The state probabilities were estimated as the percentages of subjects
in various behavioral states. Since the five smoking stages (i.e., NS, EX, SS, RS,
QU) are all defined on the current year, the sum of them were one (i.e., 100 %)
where SS and QU were defined as the participants who self-stopped smoking and
quit 1 year before.
With data for the first five types of smokers in Table 1, we estimated the transition
probabilities with the M–P Approach. The results are shown in Table 2.
For validation and comparison purpose, we also compute the transitional proba-
bilities using data for all seven types of smokers and the original PDES method by
Lin and Chen (2010) using R (Codes are included also in Appendix). The results
from Table 3 were almost identical to those reported in the original study by Lin
and Chen (2010).
As expected, by comparing the results in Table 2 with those in Table 3, for the
five transitional probabilities (e.g., 1 , 2 , 7 , 8 , 9 ) that are not directly affected
by the two additionally defined stages SS or old self-stoppers and QU or old quitters,
the results from the “M–P Approach” are almost identical to those from the original
method. On the contrary, however, the other six estimated probabilities ( 3 , 4 , 5 ,
6 , 10 , 11 ) have noticeable differences between the two methods. For example,
compared with the original estimates by Lin and Chen (2010), 10 (the transitional
probability to relapse to smoke again) with the “M–P Approach” are higher and
11 (the transitional probability of remaining as quitters) are lower; furthermore,
these two probabilities show little variations across ages compared to the originally
reported results.
To our understanding, the results from the “M–P Approach” are more valid for a
number of reasons. (1) The M–P Approach did not use additional data from which
more errors could be introduced. (2) More importantly, the results from the M–P
Approach scientifically make more sense than those estimated with the original
method. Using 10 and 11 as examples, biologically, it has been documented that
it is much harder for adolescent smokers who quit and remain as quitters than to
relapse and smoke again (Turner et al. 2005; Kralikova et al. 2013; Reddy et al.
2003). Consistent with this finding, the estimated 10 (quitters relapse to regular
smokers) was higher and 11 (quitters remain as quitters) was lower with the new
method than those with the original method. The results from the “M–P Approach”
more accurately characterize these two steps of smoking behavior progression.
Furthermore, the likelihood to relapse or to remain as quitter is largely determined
by levels of addiction to nicotine, rather than chronological age (Panlilio et al. 2012;
Solving Probabilistic Discrete Event Systems with Moore–Penrose Generalized. . . 91
Carmody 1992; Drenan and Lester 2012; Govind et al. 2009; De Biasi and Salas
2008). Consistent with this evidence, the estimated 10 and 11 with the “M–P
Approach” varied much less along with age than those estimated with the original
method. Similar evidence, supporting a high validity of the “M–P Approach,” is the
difference in the estimated 6 (self-stoppers remaining as self-stoppers) between
the two methods. The probability estimated through the “M–P Approach” showed
a declining trend with age, reflecting the dominant influence of peers and society
rather than nicotine dependence (Turner et al. 2005; Lim et al. 2012; Castrucci and
Gerlach 2005). However, no clear age trend was observed in the same probability
6 estimated using the original method by Lin and Chen (2010).
library(MASS)
newPDES D function(a)f
# get the coefficient matrix in equation 1 (10 equations)
r1 D c(0,-dat$NS[a],0,0,0,0,0,0,0,0,0)
r2 D c(0,dat$NS[a],0,-dat$EX[a],dat$SS[a],0,-dat$EX[a],0,0,0,0)
r3 D c(0,0,0,dat$EX[a],-dat$SS[a],0,0,0,0,0,0)
r4 D c(0,0,0,0,0,0,dat$EX[a],0,-dat$RS[a],dat$QU[a],0)
r5 D c(0,0,0,0,0,0,0,0, dat$RS[a],-dat$QU[a],0)
r6 D c(1,1,0,0,0,0,0,0,0,0,0)
r7 D c(0,0,1,1,0,0,1,0,0,0,0)
r8 D c(0,0,0,0,1,1,0,0,0,0,0)
r9 D c(0,0,0,0,0,0,0,1,1,0,0)
r10 D c(0,0,0,0,0,0,0,0,0,1,1)
out D rbind(r1,r2,r3,r4,r5,r6,r7,r8,r9, r10)
# get the right-side vector
vec D c(dat$NS[a C 1]-dat$NS[a],dat$EX[a C 1]-dat$EX[a],dat$SS[a C 1]-dat$SS
[a],dat$RS[a C 1]-dat$RS[a],dat$QU[a C 1]-dat$QU[a], 1,1,1,1,1)
t(ginv(out)%*%matrix(vec, ncol D 1))
g # end of “newPDES”
# Step 3.2. Calculations to generate Table 1 in the current paper
newtab2 D rbind(newPDES(1),newPDES(2),newPDES(3),newPDES(4),newPDES
(5),newPDES(6))
colnames(newtab2) D paste(“sig”,1:11,sep D “”)
rownames(newtab2) D dat$Age[-7]
print(newtab2) # Output as seen in Table 2 in the current paper
References
Campbel, S.L., Meyer, C.D.J.: Generalized Inverses of Linear Transformations. Pitman, London
(1979)
Carmody, T.P.: Preventing relapse in the treatment of nicotine addiction: current issues and future
directions. J. Psychoactive Drugs 24(2), 131–158 (1992)
Castrucci, B.C., Gerlach, K.K.: The association between adolescent smokers’ desire and intentions
to quit smoking and their views of parents’ attitudes and opinions about smoking. Matern. Child
Health J. 9(4), 377–384 (2005)
Chen, X., Lin, F.: Estimating transitional probabilities with cross-sectional data to assess smoking
behavior progression: a validation analysis. J. Biom. Biostat. 2012, S1-004 (2012)
Chen, X., Lin, F., Zhang, X.: Validity of PDES method in extracting longitudinal information from
cross-sectional data: an example of adolescent smoking progression. Am. J. Epidemiol. 171,
S141 (2010)
Chen, X.G., Ren, Y., Lin, F., MacDonell, K., Jiang, Y.: Exposure to school and community based
prevention programs and reductions in cigarette smoking among adolescents in the United
States, 2000–08. Eval. Program Plann. 35(3), 321–328 (2012)
Cline, R.E.: Elements of the Theory of Generalized Inverses for Matrices. Modules and Mono-
graphs in Undergraduate Mathematics and its Application Project. Education Development
Center, Inc., Knoxville (1979)
94 D.-G. (Din) Chen et al.
De Biasi, M., Salas, R.: Influence of neuronal nicotinic receptors over nicotine addiction and
withdrawal. Exp. Biol. Med. (Maywood) 233(8), 917–929 (2008)
Drenan, R.M., Lester, H.A.: Insights into the neurobiology of the nicotinic cholinergic system
and nicotine addiction from mice expressing nicotinic receptors harboring gain-of-function
mutations. Pharmacol. Rev. 64(4), 869–879 (2012)
Govind, A.P., Vezina, P., Green, W.N.: Nicotine-induced upregulation of nicotinic receptors:
underlying mechanisms and relevance to nicotine addiction. Biochem. Pharmacol. 78(7),
756–765 (2009)
Kralikova, E., et al.: Czech adolescent smokers: unhappy to smoke but unable to quit. Int. J. Tuberc.
Lung Dis. 17(6), 842–846 (2013)
Lim, M.K., et al.: Role of quit supporters and other factors associated with smoking abstinence
in adolescent smokers: a prospective study on quitline users in the Republic of Korea. Addict.
Behav. 37(3), 342–345 (2012)
Lin, F., Chen, X.: Estimation of transitional probabilities of discrete event systems from cross-
sectional survey and its application in tobacco control. Inf. Sci. (Ny) 180(3), 432–440 (2010)
Moore, E.H., Barnard, R.W.: General Analysis, Parts I. The American Philosophical Society,
Philadelphia (1935)
Nashed, M.Z.: Generalized Inverses and Applications. Academic, New York (1976)
Panlilio, L.V., et al.: Novel use of a lipid-lowering fibrate medication to prevent nicotine reward
and relapse: preclinical findings. Neuropsychopharmacology 37(8), 1838–1847 (2012)
Penrose, R.: A Generalized Inverse for Matrices, vol. 51. Proceedings of the Cambridge Philosoph-
ical Society (1955)
Reddy, S.R., Burns, J.J., Perkins, K.C.: Tapering: an alternative for adolescent smokers unwilling
or unable to quit. Pediatr. Res. 53(4), 213a–214a (2003)
Turner, L.R., Veldhuis, C.B., Mermelstein, R.: Adolescent smoking: are infrequent and occasional
smokers ready to quit? Subst. Use Misuse 40(8), 1127–1137 (2005)
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York
(2002)
Ying, Z., Jia, S.H.: Moore–Penrose generalized inverse matrix and solution of linear equation
group. Math. Pract. Theory 39(9), 239–244 (2009)
Part II
Modelling Incomplete or Missing Data
On the Effects of Structural Zeros
in Regression Models
Hua He, Wenjuan Wang, Ding-Geng (Din) Chen, and Wan Tang
Abstract Count variables are commonly used in public health research. However,
the count variables often do not precisely capture differences among subjects in a
study population of interest. For example, drinking outcomes such as the number
of days of any alcohol drinking (DAD) over a period of time are often used to
assess alcohol use in alcohol studies. A DAD value of 0 for a subject could mean
that the subject was continually abstinent from drinking such as lifetime abstainers
or that the subject was alcoholic, but happened not to use any alcohol during the
period of time considered. In statistical analysis, zeros of the first kind are referred
to as structural zeros, to distinguish them from the second type, sampling zeros. As
the example indicates, the structural and sampling zeros represent two groups of
subjects with quite different psychosocial outcomes. Although many recent studies
have begun to explicitly account for the differences between the two types of zeros in
modeling drinking variables as responses, none have acknowledged the implications
of the different types of zeros when such drinking variables are used as predictors.
This chapter is an updated version of He et al. (J Data Sci 12(3), 2014), where we
first attempted to tackle the issue and illustrate the importance of disentangling the
structural and sampling zeros in alcohol research using simulated as well as real
study data.
1 Introduction
Count data with structural zeros is a common phenomenon in public health research
and it is important, both conceptually and methodologically, to pay special attention
to structural zeros in such count variables. Structural zeros refer to zero responses
by those subjects whose count responses will always be zero, in contrast to random
(or sampling) zeros that occur in subjects whose count responses can be greater than
zero, but appear to be zero due to sampling variability. For example, the number of
days of alcohol drinking (DAD) is commonly used to measure alcohol consumptions
in alcohol research. Subjects who were always, or become, continually abstinent
from drinking during a given time period yield structural zeros and form a non-
risk group of individuals in such drinking outcomes, while the remaining subjects
constitute the at-risk group. In HIV/AIDS research, the number of sexual partners
is often used as a risk factor for HIV/AIDS; subjects with lifetime celibacy yield
structural zeros and then consist of the non-risk group for HIV/AIDS, while subjects
who happened to have no sex during the study time have random zeros and are part
of an at-risk group for HIV/AIDS. Such a partition of the study population is not
only supported by the excessive number of zeros observed in the distributions of
count responses such as alcohol consumptions from many epidemiologic studies
focusing on alcohol and related substance use, but also conceptually needed to
serve as a basis for causal inference, as the two groups of subjects can have quite
different psychosocial outcomes. In fact, the issue of structural zeros has been well
acknowledged (Horton et al. 1999; Pardini et al. 2007; Neal et al. 2005; Hagger-
Johnson et al. 2011; Connor et al. 2011; Buu et al. 2011; Fernandez et al. 2011;
Cranford et al. 2010; Hildebrandt et al. 2010; Hernandez-Avila et al. 2006) and it
has been an active research topic for over a decade. However, nearly all of the studies
focus on the cases when the count variables are treated as response (outcome)
variables, by using mixture modeling approaches such as zero-inflated Poisson (ZIP)
models (Hall 2000; Hall and Zhang 2004; Yu et al. 2012; Tang et al. 2012) to model
both the count outcome for at-risk group and the structural zeros for non-risk group.
However, it is also important to study the issue of structural zeros when such count
variables are serving as predictors.
For instance, DAD is often used as a predictor to study the effects of alcohol
use on other psychosocial outcomes such as depression. Common approaches are
to treat the DAD as continuous predictor or dichotomize it as a binary predictor
(zeros vs. non-zeros, i.e., positive outcomes). Both approaches cannot distinguish
the differential effects of structural zeros from their random counterparts. Compared
to the dichotomized version of the count variable, the continuous DAD predictor
allows one to study the dose effects of alcohol use, but it cannot model the
differential effects between structural and random zeros on the outcome, nor can
it result in valid inference on other components of the models. This practice is
often adopted for modeling convenience, but in many studies it does not reflect the
true associations of variables involved. Hence it is essential to model the difference
between the structural and random zeros.
In this chapter, we use simulated studies as well as real study data to illustrate
the importance of modeling the differential effects of structural zeros when a count
variable is used as a predictor. In Sect. 2, we present some background for the
structural zeros issue in regression models with such count variables as predictors.
We then propose models to assess the differential effects of structural zeros in
Sect. 3, and compare the proposed models with conventional models where the
On the Effects of Structural Zeros in Regression Models 99
effects of structural and random zeros are not delineated in terms of possible
biases that may result and the impact on power using simulation studies in Sect. 4.
The results from a real data example are presented in Sect. 5, and the chapter is
concluded with a discussion in Sect. 6. Much of the material in this chapter was
previously presented in He et al. (2014).
2 Background
Count variables in public health, especially in behavioral research, often measure the
severity of some kind of behavior such as alcohol use and often contain excessive
zeroes, referred to as structural zeroes. The presence of structural zeroes has long
been acknowledged as a serious problem for data analysis (Clifford Cohen 1965;
Johnson and Kotz 1969; Lachenbruch 2001). In practice, such count variables are
treated in a variety of ways when they are analyzed as the response variables.
For example, in alcohol studies the DAD has been transformed to percentages
as in “percentage of days abstinent” (Babor et al. 1994) or “percentage of heavy
drinking days” (Allen 2003). However, a fundamental problem with such data is
the probability of a high peak at one end of the distribution, in other words a
point mass at “zero.” For example, zero percent of days abstinent (common among
participants in epidemiological studies) or zero percent days of heavy drinking
(common among participants in epidemiological studies and alcoholism trials).
Accordingly, researchers have regularly used various transformations to improve
data distributions. For example, an arcsine transformation has been recommended
and routinely adopted to improve the heteroscedasticity of the distribution of the
percent days abstinent variable (Babor et al. 1994). However, the point mass remains
because regardless of the transformation, the zeroes will be mapped to a different
value (0 under the arcsine transformation), which will then cluster at end of the
distribution of the transformed variable (0 under the arcsine transformation). Since
the outcomes are in fact a mixture of degenerate zeros consisting of non-risk subjects
and count responses from an at-risk group, it is natural to use mixture models. For
example, under a zero-inflated Poisson regression model, one may apply a log-linear
Poisson regression model for the at-risk group and logistic regression model for
the non-risk group. Since zeros are not identified as structural or random, the two
components of a zero-inflated model must be estimated jointly. There are many
activities focused on zero-inflated models research and the zero-inflated models
become more popular in practice. A search of “zero-inflated” in Google Scholar
returned about 14,300 articles.
However, little attention has been paid on the issue of structural zeros when the
count variables are treated as a predictor. Since the subjective nature of structural
and random zeros can be quite different, their effects on the outcome may also be
very different. Under conventional models, no matter the count variable is treated
as continuous or transformed, the differences between structural and random zeroes
can’t be assessed and thus biased inferences can be resulted.
100 H. He et al.
For notational brevity, we consider only the cross-sectional data setting. The
considerations as well as the conclusions obtained also apply for longitudinal study
data. Given a sample of n subjects, let yi denote the outcome of interest from the
ith subject .1 i n/. We are interested in assessing the effects of some personal
trait such as alcohol dependency on the outcome, along with other covariates,
>
collectively denoted by zi D zi1 ; : : : ; zip . Suppose that the trait is measured by a
count variable xi with structural zeros.
Let ri be the indicator of whether xi is a structural zero, i.e., ri D 1 if xi
is a structural zero and ri D 0 otherwise. For simplicity, we assume that the
structural zeros are observed, which we assume throughout the chapter unless
stated otherwise. The indicator ri partitions the study sample (population) into two
distinctive groups, with one consisting of subjects corresponding to ri D 1 and the
On the Effects of Structural Zeros in Regression Models 101
other comprising of the remaining subjects with ri D 0. Since the trait in many
studies is often a risk factor such as alcohol use, the first group is often referred to
as the non-risk group, while the second as the at-risk group.
Without distinguishing structural zeros from random zeros, one may apply gen-
eralized linear models (GLM) to model the association between the explanatory
variables, the predictor of interest xi plus the covariates zi , and the outcome. For
example, if yi is continuous, we may use the following linear model:
Here one may include a covariate with a constant value 1 in zi so that the intercept
is included in ˇ as well. In general, we may use generalized linear models
where g. / is a link function, such as a logarithm function for count responses and a
probit or a logistic function for binary outcomes.
If the predictor xi has structural zeros, the structural zeros have a quite different
conceptual interpretation than their random zero counterparts and the conceptual
difference carries a significant implication for the interpretation of the coefficient ˛
in (1). For example, within the context of drinking outcomes, the difference in yi
between subjects with xi D 1 and xi D 0 has a different interpretation, depending
on whether xi D 0 is a structural or random zero. If xi D 0 is a random zero,
this difference represents the differential effects of drinking on yi within the drinker
subgroup when the drinking outcome changes from 0 to 1. For a structural zero,
such a difference in yi speaks to the effects of the trait of drinking on the response
yi . Thus, the model in (1) is flawed since it does not delineate the two types of effects
and must be revised to incorporate the information of structural zeros.
One way to model the effects of a count variable with structural zeros as a
predictor in regression analysis is to distinguish between random and structural
zeros by including an indicator of structural zeros in the model, in addition to the
count variable itself. By expanding .xi ; zi / to include ri , it follows from (1) that
Under the refined model above, the association between the trait and the response
can be tested by checking whether both ˛ D 0 and D 0. This involves a composite
linear contrast, H0 W ˛ D 0; D 0. If the null H0 is rejected, then either ˛ ¤ 0 or
¤ 0 or ˛ ¤ 0; ¤ 0. The coefficient is interpreted as the trait effects on the
response yi , all other things being equal. The coefficient ˛ measures the change in
yi per unit increase in xi within the at-risk group.
102 H. He et al.
Similarly, we may introduce a link function to the linear model in (3) to extend
it to generalized linear models to accommodate categorical and count responses.
When the outcome yi itself is a count response with structural zeros, it is not
appropriate to apply Poisson or negative binomial (NB) log-linear models, the
popular models for count responses. Instead, one needs to apply the zero-inflated
Poisson (ZIP) or zero-inflated negative binomial (ZINB) model (Tang et al. 2012;
Lambert 1992; Welsh et al. 1996). The ZIP model extends the Poisson model by
including an additional logistic regression component so that it can model both the
at- and non-risk groups. Estimates from the Poisson loglinear regression component
assess the increased/reduced effects of an explanatory variable on the count response
of interest within the at-risk subgroup, while those from the logistic regression
component indicate increased/reduced likelihood of being in the non-risk group
that an explanatory variable can affect on. By replacing Poison with NB, ZINB
models also address the weakness of the Poisson component in ZIP to account for
overdispersion, a common violation of Poisson models that restrict the variance to
be the same as the mean.
By ignoring the structural zeros in xi , one may model yi using a ZIP model as
following:
0
structural zero in yi j xi ; zi Bernoulli.i /; logit.i / D ˛ 0 xi C z>
i ˇ ;
(4)
4 Simulation Studies
where " is the error term. If c1 and c2 have different signs, say c1 > 0 > c2 , then
the mean of the at-risk subgroup defined by positive X > the mean of the at-risk
subgroup defined by the random zeros of X > the mean of the non-risk group defined
by the structural zeros of X. In this case, this monotone relationship among the three
subgroups will remain, even if the random and structural zeros are not distinguished
between each other. However, if c1 and c2 have the same sign, say both are positive,
c1 , c2 > 0, then the mean of the at-risk subgroup defined by positive X > the mean
of the at-risk group defined by the random zeros of X < the mean of the non-risk
group defined by the structural zeros of X. In such cases, the mean of the non-risk
group may be bigger than the at-risk subgroup defined by positive X, depending on
the relationship between c1 and c2 , and the monotone relationship among the three
subgroups may fail, if random and structural zeros are combined. Thus, to assess
power, we ran simulations to cover both situations, where c1 and c2 had the same
and different signs.
The zero-inflated predictor X was simulated from a ZIP with the probability of
being a structural zero p D 0:2 and the mean of the Poisson component D 0:3:
We simulated 1,000 samples with sample sizes of 100, 200, 500, and 1,000, for
several sets of parameters:
For each simulated data, we fit the four aforementioned models, i.e.,
Model I W Y D c0 C c1 X C c2 R C "; " N 0; 2 ; (8)
Model II W Y D c0 C c1 X C "; " N 0; 2 ; (9)
Model III W Y D c0 C c1 IX C c2 R C "; " N 0; 2 ; (10)
Model IV W Y D c0 C c1 IX C "; " N 0; 2 ; (11)
Table 1 Parameter estimates (Mean) and standard errors (Std err) averaged over 1,000 MC
replications for the four models considered in the simulation study with a continuous response
Cases c0 c1 c2
.c1 D/ Model I II III IV I II III IV I III
0.5 Mean 0.50 0.62 0.50 0.63 0.50 0.60 0.58 0.70 0.49 0.49
Std err 0.046 0.042 0.047 0.042 0.072 0.071 0.094 0.092 0.096 0.097
0.25 Mean 0.50 0.62 0.50 0.63 0.25 0.35 0.29 0.42 0.49 0.49
Std err 0.046 0.042 0.047 0.042 0.072 0.071 0.091 0.089 0.096 0.097
0 Mean 0.50 0.62 0.50 0.63 0.00 0.10 0.00 0.13 0.49 0.49
Std err 0.046 0.042 0.047 0.042 0.072 0.071 0.090 0.088 0.096 0.097
0.25 Mean 0.50 0.62 0.50 0.63 0.25 0.15 0.29 0.16 0.49 0.49
Std err 0.046 0.042 0.047 0.042 0.072 0.071 0.092 0.090 0.096 0.097
0.5 Mean 0.50 0.62 0.50 0.63 0.50 0.40 0.58 0.45 0.49 0.49
Std err 0.046 0.042 0.047 0.042 0.072 0.071 0.096 0.094 0.096 0.097
values, the differences reflected the sampling variability. Note that the “true” value
of the parameter c1 under Model III should in fact be
c1 0:3c1
E.Y j X > 0/ E.Y j X D 0 andR D 0/ D D ; (12)
Pr.X > 0/ 1 exp.0:3/
i.e., 0:58; 0:29; 0:00; 0:29; and 0:58, respectively, because of the grouping of
subjects with X > 0.
Even if one does not care about the size of the effects of X on Y and just wants
to detect association between the two variables, an application of incorrect models
such as Models II and Model IV may still be quite problematic. For example, we
also examined power in detecting association between the trait and the outcome
for the different models, with a type I error of 0.05. For Models II and IV, we can
simply test the null: H0 W c1 D 0 for this purpose. However, for Models I and
III, there are two terms pertaining to the association of interest, one relating to the
difference between the structural and random zero in X (c2 ) and the other associated
with difference between positive X and random zeros in X (c1 ). So, we need to test a
composite null: H0 W c1 D c2 D 0 in Models I and III. We computed the proportions
of p-values that were less than 0.05 for these hypothesis tests as the empirical power
estimates. Shown in Table 2 are the estimated powers to test the effects of the trait
based on 1,000 MC replications with sample sizes 100, 200, 500, and 1,000 in the
range of values of c1 (and c2 ) considered. The models with the structural zero issue
addressed (Models I and III) were much more powerful in detecting the association
between Y and X than their counterparts (Models II and IV). Thus, models that do
not account for structural zeros such as Models II and IV may not even be able to
perform such a “crude” task.
106 H. He et al.
We considered a count response with structural zeros generated from the following
ZIP:
non structuralzerocountY j X; R Poisson ./ ; log ./ Dc0 Cc1 XCc2 R; (13)
structuralzeroY j X; R Bernoulli ./ ; logit ./ Dc00 Cc01 XCc02 R:
As in the previous cases, we fit four different ZIPs to the data simulated with the
same set of parameter values (in addition to c0 ; c1 , and c2 are in previous cases, we
set c00 D c0 ; c01 D c1 and c02 D c2 ). Again, we report the results for the case with
sample size = 200 for the parameter estimates.
Shown in Table 3 are the averaged estimates of the parameters c0 ; c1 ; and c2 ; and
associated standard errors over the 1,000 MC replications. The same patterns of bias
again emerged from the incorrect models (Models II and IV). The incorrect models
also yielded much lower power than their correct counterparts. Shown in Table 4 are
the estimated powers for testing the effects of the trait on the response. As seen, the
On the Effects of Structural Zeros in Regression Models 107
Table 3 Parameter estimates (Mean) and standard errors (Std err) averaged over 1,000 MC
replications for the four models considered in the simulation study with a ZIP response
Cases c0 c1 c2
(c1 D) Model I II III IV I II III IV I III
0.5 Mean 0.48 0.63 0.48 0.63 0.55 0.67 0.60 0.75 0.48 0.49
Std err 0.155 0.126 0.155 0.126 0.370 0.360 0.425 0.416 0.298 0.299
0.25 Mean 0.49 0.63 0.48 0.63 0.30 0.43 0.33 0.48 0.48 0.49
Std err 0.152 0.125 0.155 0.126 0.330 0.325 0.392 0.383 0.297 0.299
0 Mean 0.49 0.63 0.48 0.63 0.05 0.17 0.04 0.19 0.48 0.49
Std err 0.153 0.126 0.155 0.126 0.295 0.292 0.342 0.331 0.297 0.299
0.25 Mean 0.49 0.63 0.48 0.63 0.22 0.10 0.26 0.12 0.48 0.49
Std err 0.150 0.125 0.155 0.126 0.266 0.263 0.315 0.302 0.295 0.299
0.5 Mean 0.49 0.63 0.48 0.63 0.49 0.37 0.57 0.42 0.48 0.48
Std err 0.150 0.125 0.154 0.126 0.229 0.221 0.284 0.268 0.295 0.299
ZIP seems to have similar power as the binary response Y, which is not surprising
given that one of the components of ZIP is the binary response for modeling the
structural zero of Y. Note that there are two components in ZIP models and thus
the results are obtained from testing composite hypotheses. To see if the trait is
associated with the outcome, we tested the null, H0 W c1 D c01 D 0, for Models II
and IV, but a different null, H0 W c1 D c01 D c2 D c02 D 0, for Models I and III.
We now illustrate the effects of structural zeros with a real data example based on the
2009–2010 National Health and Nutrition Examination Survey (NHANES). In this
database, we identified a measure of alcohol use to be examined as an explanatory
variable for depressive symptoms (count response). Both the alcohol and depression
outcomes show a preponderance of zeros because of a large percent of the surveyed
population is not at risk for either of the health issues. The relationship between the
two has been reported in a number of studies (Gibson and Becker 1973; Pettinati
et al. 1982; Dackis et al. 1986; Willenbring 1986; Brown and Schuckit 1988; Penick
et al. 1988; Davidson 1995; Merikangas et al. 1998; Swendsen et al. 1998; Hasin
and Grant 2002).
The NHANES is an annual national survey of the health and nutritional status
of adults and children in the United States. A nationally representative sample
of about 5,000 persons participates each year. Interviews and assessments are
conducted in respondents’ homes. Health assessments are performed in equipped
mobile centers, which travel to locations throughout the country. Starting in 2007,
NHANES has been oversampling all Hispanics (previously Mexican Americans
were oversampled). In the 2009–2010 data set, data were collected from 10,537
individuals of all ages during the 2-year period between January 2009 and December
2010. The race/ethnicity of the sample is 22.5 % Hispanic-Mexican American,
10.8 % Hispanic-other, 18.6 % non-Hispanic Black, 42.1 %, non-Hispanic White,
and 6.1 % other.
Alcohol Use Measure In NHANES, for measurement of alcohol use, a different
assessment was done for those aged 12–19 vs. those aged 20 and older; the
assessment for the former age group asked only about the past 30 days, while the
one administered to the latter age group asked about the past year. Therefore, for the
current case study example we only used the data from the cohort aged 20 and
older. Alcohol use (for those aged 20 or above) was assessed with a computer-
assisted personal interview (CAPI). Specific questions of interest for the current
work included number of days of any alcohol drinking (DAD) in the past year,
which is commonly used in alcohol research. This variable was converted to average
number of days drinking per month in our analysis. There were 6,218 subjects in the
data set with age of 20 and older.
On the Effects of Structural Zeros in Regression Models 109
In NHANES, one question asks “In your entire life, have you had at least 12
drinks of any type of alcoholic beverage?”. This variable has been used previously
to differentiate lifetime abstainers, who answered “no” to this question and ex-
drinkers, who answered “yes” (Tsai et al. 2012). Thus, an answer “no” to this
question is a proxy of structural zeros. Hence the zeros were endorsed by two
distinctive risk groups in this study population for the question about drinking.
Depression Symptoms Depression Symptoms were measured in those aged 12 and
above in the 2009–2010 NHANES with the Patient Health Questionnaire (PHQ-9)
administered by CAPI. The PHQ-9 is a multiple-choice self-report inventory of nine
items specific to depression. Each item of the PHQ-9 evaluates the presence of one
of the nine DSM-IV criteria for depressive disorder during the last 2 weeks. Each of
the nine items can be scored 0 (not at all), 1 (few days), 2 (more than half the days)
and 3 (nearly every day) and a total score is obtained. Among the 6,218 subjects
with CAPI, 5,283 subjects reported PHQ-9, so there are about 935 subjects with
missing values in the PHQ-9.
Covariates In epidemiological samples, several demographic characteristics,
including female gender, older age, not being married, low education, low income
level, poor physical health, social isolation, minority status, and urban residence,
have been associated with higher levels of depressive symptoms or presence of a
major depressive disorder, though overlap among some of these factors suggests
that these may not all be independent influences (Oh et al. 2013; Roberts et al.
1997; Leiderman et al. 2012; Wilhelm et al. 2003; González et al. 2010; Rushton
et al. 2002; Weissman et al. 1996). Based on these findings, in our analyses of
the relationship of alcohol use to depressive symptoms, we incorporated relevant
demographic variables available in NHANES (age, gender, education, race) as
covariates.
Shown in Fig. 2 are the distributions of PHQ9 and DAD, both exhibiting a
preponderance of zeros. Goodness of fit tests also rejected the fit of the data in
DAD Depression
1500
Frequency
1000
500
0
0 5 15 25 0 5 10 20
DAD PHQ9
Fig. 2 Distributions of DAD and PHQ9 for the 2009–2010 NHANES data, with the darker-shaded
bar in the distribution of DAD representing structural zeros
110 H. He et al.
each case by the Poisson (p-value< 0.001). Further, the Vuong test showed that
ZIP provided a much better fit than the Poisson (p-value<0.001). These findings
are consistent with our prior knowledge that this study sample is from a mixed
population consisting of an at-risk and non-risk subgroup for each of the behavioral
and health outcomes.
Statistical Models We applied the ZIP to model the PHQ-9 score with DAD in the
past month as the predictor, adjusting for age, gender, race, and education. Since
we had the information to identify the non-risk group for the DAD variable, we
conducted the analysis using two different models. In the first ZIP model, or Model
I, we explicitly modeled the effects of structural zero of DAD on PHQ9 using a
binary indicator (NeverDrink = 1 for structural and NeverDrink D 0 for sampling
zero) and thus both the indicator of the non-risk group for drinking (NeverDrink)
and DAD variable were included as predictors. As a comparison, we also fit the data
with only the DAD predictor and thus the structural and sampling zeros were not
distinguished in the second ZIP, or Model II. We used SAS 9.3 PROC GENMOD
to fit the models, with parameter estimates based on the maximum likelihood
approach.
Analysis Results Among the 5,283 subjects with both CAPI and PHQ-9, there
were a small amount of missing values in the covariate and the actual sample
size used for the analysis was 5,261. Shown in Tables 5 and 6 are the parameter
estimates of the Poisson and Zero-Inflated components of the two ZIP models,
Table 5 Comparison of model estimates (Estimate), standard errors (Std err), and p-values
(P-value) from the Poisson component for the count response (including random zeros) (Std err)
for the real study data
Parameter estimates from Poisson component of ZIP
Parameter Model I Model II
Estimate Std err P-value Estimate Std err P-value
Intercept 2.2936 0.0495 <0.0001 2.2695 0.0490 <0.0001
NeverDrink Yes 0.0878 0.0253 0.0005
NeverDrink No 0.0000 0.0000 –
DAD 0.0023 0.0011 0.0423 0.0017 0.0011 0.1409
Gender Male 0.1988 0.0164 <0.0001 0.1890 0.0162 <0.0001
Female 0.0000 0.0000 – 0.0000 0.0000 –
AGE 0.0025 0.0005 <0.0001 0.0026 0.0005 <0.0001
Race/ethnicity Mexican American 0.0954 0.0408 0.0192 0.0928 0.0407 0.0228
Other Hispanic 0.0322 0.0425 0.4495 0.0282 0.0425 0.5072
Non-Hispanic white 0.0952 0.0376 0.0114 0.0864 0.0375 0.0213
Non-Hispanic black 0.0030 0.0401 0.9402 0.0087 0.0401 0.8278
Other race 0.0000 0.0000 – 0.0000 0.0000 –
Education 0.1267 0.0067 <0.0001 0.1246 0.0067 <0.0001
Scale 1.0000 0.0000 1.0000 0.0000
On the Effects of Structural Zeros in Regression Models 111
Table 6 Comparison of model estimates (Estimate), standard errors (Std err), and p-values (P-
value) from the Logistic component for the probability of occurrence of structural zeros for the
real study data
Parameter estimates from logistic component of ZIP
Parameter Model I Model II
Estimate Std err P-value Estimate Std err P-value
Intercept 1.9351 0.1998 <0.0001 1.7999 0.1966 <0.0001
NeverDrink Yes 0.4230 0.0945 <0.0001
NeverDrink No 0.0000 0.0000 –
DAD 0.0021 0.0043 0.6191 0.0055 0.0042 0.1868
Gender Male 0.5715 0.0641 <0.0001 0.5174 0.0626 <0.0001
Female 0.0000 0.0000 – 0.0000 0.0000 –
AGE 0.0158 0.0018 <0.0001 0.0164 0.0018 <0.0001
Race/ethnicity Mexican American 0.0883 0.1596 0.5799 0.0594 0.1590 0.7087
Other Hispanic 0.0904 0.1694 0.5934 0.1209 0.1688 0.4739
Non-Hispanic white 0.2250 0.1471 0.1262 0.2714 0.1463 0.0637
Non-Hispanic black 0.0603 0.1563 0.6999 0.0309 0.1558 0.8428
Other race 0.0000 0.0000 – 0.0000 0.0000 –
Education 0.0480 0.0261 0.0658 0.0394 0.0259 0.1284
6 Discussion
In this chapter, we discussed the importance of untangling the structural and random
zeros for count variables with structural zeros in public health. Although older
studies completely ignored structural zeros, many newer ones have attempted to
112 H. He et al.
address this issue. However, all efforts to date have focused on the statistical
problems when the count outcomes are used as a response, or dependent variable,
in regression analysis, with no attention paid to the equally important problem of
biased estimates when such outcomes are used as an explanatory, or independent,
variable. We failed to find any study in the extant literature that even acknowledged
this problem. Our findings are significant in this respect since they show for the first
time the critical importance of delineating the effects of the two different types of
zeros in count outcomes like DAD.
Both our simulated and real study examples demonstrate that it is critical that we
model and delineate the effects of structural and random zeros when using a zero-
inflated count outcome as an explanatory variable in regression analysis. Otherwise,
not only are we likely to miss opportunities to find associations between such a
variable and an outcome of interest (due to significant loss of power), but also
to obtain results that are difficult to interpret because of high bias in the estimate
and dual interpretation of the value zero of such a count variable. For example, the
estimated coefficient 0:0017 of DAD in the Poisson component of ZIP Model II for
the relationship between PHQ9 and DAD had about 30 % upward bias as compared
to 0:0023 for the same coefficient of the Poisson component of Model I ZIP of the
analysis in the NHANES study. Even ignoring such bias, the estimate 0:0017 was
difficult to interpret; without accounting for structural zeros as in ZIP Model I, the
change in DAD from 0 to 1 has a dubious meaning, since it may mean the change
in amount of drinking within alcohol users or it may mean the difference between
alcohol users vs. lifetime abstainers.
In all the examples considered, we assumed linear functions of explanatory
variables for notational brevity. In practice, more complex functions of explanatory
variables may be considered utilizing piecewise linear, polynomial functions or even
nonparametric methods such as local polynomial regression. Also, we limited our
considerations to cross-sectional studies, but the same considerations are readily
applied to longitudinal studies.
We assumed that structural zeros of a count explanatory variable are known in
the simulation studies and the case study example. Although as in the case study
example we sometimes may be able to provide concrete examples of structural
zeroes (e.g., stable abstinence in a clinical trial), they are more appropriately
conceptualized as latent (i.e., unobservable) variables and hence new statistical
methods treating them as such are needed. Treating structural zeroes as latent or
unobserved variables avoids the need to make a priori decisions associated with
observed variables, for example in defining a cutoff to consider abstinence from
drinking as “stable,” a decision that involves subjective judgment and introduces
error. Although lifetime abstainers in alcohol epidemiology studies represent a case
where structural zeroes are conceptually more observable, this too is imperfect,
for example a 23-year-old with no lifetime drinking history would be treated
as a lifetime abstainer in such data. Further, in many studies, such proxies of
structural zeros are not available. For example, no lifetime abstinence was collected
in NHANES for heavy drinking. Thus, it is not possible to study the effects of
On the Effects of Structural Zeros in Regression Models 113
structural zeros using the models considered in the study. Further research is needed
to address this methodological issue to facilitate research in public health.
Acknowledgements The study was supported in part by National Institute on Drug Abuse grant
R33DA027521, National Institute of General Medical Sciences R01GM108337, Eunice Kennedy
Shriver National Institute of Child Health and Human Development grant R01HD075635, UR
CTSI 8UL1TR000042-09, and a Faculty Research Support Grant from School of Nursing,
University of Rochester.
References
Allen, J.P.: Measuring outcome in interventions for alcohol dependence and problem drinking:
executive summary of a conference sponsored by the national institute on alcohol abuse and
alcoholism. Alcohol. Clin. Exp. Res. 27(10), 1657–1660 (2003)
Babor, T.F., Longabaugh, R., Zweben, A., Fuller, R.K., Stout, R.L., Anton, R.F., Randall, C.L.:
Issues in the definition and measurement of drinking outcomes in alcoholism-treatment
research. J. Stud. Alcohol Suppl. 12, 101–111 (1994)
Brown, S.A., Schuckit, M.A.: Changes in depression among abstinent alcoholics. J. Stud. Alcohol
Drugs 49(05), 412–417 (1988)
Buu, A., Johnson, N.J., Li, R., Tan, X.: New variable selection methods for zero-inflated count data
with applications to the substance abuse field. Stat. Med. 30(18), 2326–2340 (2011)
Clifford Cohen, A. Jr.: Estimation in mixtures of discrete distributions. In: Classical and Conta-
gious Discrete Distributions (Proc. Internat. Sympos., McGill Univ., Montreal, Que., 1963),
pp. 373–378. Statistical Publishing Society, Calcutta (1965)
Connor, J.L., Kypri, K., Bell, M.L., Cousins, K.: Alcohol outlet density, levels of drinking and
alcohol-related harm in new zealand: a national study. J. Epidemiol. Commun. Health 65(10),
841–846 (2011)
Cranford, J.A., Zucker, R.A., Jester, J.M., Puttler, L.I., Fitzgerald, H.E.: Parental alcohol involve-
ment and adolescent alcohol expectancies predict alcohol involvement in male adolescents.
Psychol. Addict. Behav. 24(3), 386–396 (2010)
Dackis, C.A., Gold, M.S., Pottash, A.L.C., Sweeney, D.R.: Evaluating depression in alcoholics.
Psychiatry Res. 17(2), 105–109 (1986)
Davidson, K.M.: Diagnosis of depression in alcohol dependence: changes in prevalence with
drinking status. Br. J. Psychiatry 166(2), 199–204 (1995)
Fernandez, A.C., Wood, M.D., Laforge, R., Black, J.T.: Randomized trials of alcohol-use interven-
tions with college students and their parents: lessons from the transitions project. Clin. Trials
8(2), 205–213 (2011)
Gibson, S., Becker, J.: Changes in alcoholics’ self-reported depression. Q. J. Stud. Alcohol 34,
829–836 (1973)
González, H.M., Tarraf, W., Whitfield, K.E., Vega, W.A.: The epidemiology of major depression
and ethnicity in the united states. J. Psychiatr. Res. 44(15), 1043–1051 (2010)
Hagger-Johnson, G., Bewick, B.M., Conner, M., O’Connor, D.B., Shickle, D.: Alcohol, conscien-
tiousness and event-level condom use. Br. J. Health Psychol. 16(4), 828–845 (2011)
Hall, D.B.: Zero-inflated Poisson and binomial regression with random effects: a case study.
Biometrics 56(4), 1030–1039 (2000)
Hall, D.B., Zhang, Z.: Marginal models for zero inflated clustered data. Stat. Model. 4(3), 161–180
(2004)
Hasin, D.S., Grant, D.F.: Major depression in 6050 former drinkers - association with past alcohol
dependence. Arch. Gen. Psychiatry 59(9), 794–800 (2002)
114 H. He et al.
He, H., Wang, W., Crits-Christoph, P., Gallop, R., Tang, W., Chen, D., Tu, X.: On the implication
of structural zeros as independent variables in regression analysis: applications to alcohol
research. J. Data Sci. (2013, in press)
Hernandez-Avila, C.A., Song, C., Kuo, L., Tennen, H., Armeli, S., Kranzler, H.R.: Targeted versus
daily naltrexone: secondary analysis of effects on average daily drinking. Alcohol. Clin. Exp.
Res. 30(5), 860–865 (2006)
Hildebrandt, T., McCrady, B., Epstein, E., Cook, S., Jensen, N.: When should clinicians switch
treatments? an application of signal detection theory to two treatments for women with alcohol
use disorders. Behav. Res. Therapy 48(6), 524–530 (2010)
Horton, N.J., Bebchuk, J.D., Jones, C.L., Lipsitz, S.R., Catalano, P.J., Zahner, G.E.P., Fitzmaurice,
G.M.: Goodness-of-fit for GEE: an example with mental health service utilization. Stat. Med.
18(2), 213–222 (1999)
Johnson, N.L., Kotz, S.: Distributions in Statistics: Discrete Distributions. Houghton Mifflin,
Boston (1969)
Lachenbruch, P.A.: Comparisons of two-part models with competitors. Stat. Med. 20(8),
1215–1234 (2001)
Lambert, D.: Zero-inflated poisson regression, with an application to defects in manufacturing.
Technometrics 34(1), 1–14 (1992)
Leiderman, E.A., Lolich, M., Vázquez, G.H., Baldessarini, R.J.: Depression: point-prevalence and
sociodemographic correlates in a buenos aires community sample. J. Affect. Disord. 136(3),
1154–1158 (2012)
Merikangas, K.R., Mehta, R.L., Molnar, B.E., Walters, E.E., Swendsen, J.D., Aguilar-Gaziola, S.,
Bijl, R., Borges, G., Caraveo-Anduaga, J.J., Dewit, D.J., et al.: Comorbidity of substance use
disorders with mood and anxiety disorders: results of the international consortium in psychiatric
epidemiology. Addict. Behav. 23(6), 893–907 (1998)
Neal, D.J., Sugarman, D.E., Hustad, J.T.P., Caska, C.M., Carey, K.B.: It’s all fun and games. . . or is
it? Collegiate sporting events and celebratory drinking. J. Stud. Alcohol. Drugs 66(2), 291–294
(2005)
Oh, D.H., Kim, S.A., Lee, H.Y., Seo, J.Y., Choi, B.-Y., Nam, J.H.: Prevalence and correlates of
depressive symptoms in korean adults: results of a 2009 korean community health survey. J.
Korean Med. Sci. 28(1), 128–135 (2013)
Pardini, D., White, H.R., Stouthamer-Loeber, M.: Early adolescent psychopathology as a predictor
of alcohol use disorders by young adulthood. Drug Alcohol Depend. 88, S38–S49 (2007)
Penick, E.C., Powell, B.J., Liskow, B.I., Jackson, J.O., Nickel, E.J.: The stability of coexisting
psychiatric syndromes in alcoholic men after one year. J. Stud. Alcohol Drugs 49(05), 395–405
(1988)
Pettinati, H.M., Sugerman, A.A., Maurer, H.S.: Four year mmpi changes in abstinent and drinking
alcoholics. Alcohol. Clin. Exp. Res. 6(4), 487–494 (1982)
Roberts, R.E., Kaplan, G.A., Shema, S.J., Strawbridge, W.J.: Does growing old increase the risk
for depression? Am. J. Psychiatry 154(10), 1384–1390 (1997)
Rushton, J.L., Forcier, M., Schectman, R.M.: Epidemiology of depressive symptoms in the national
longitudinal study of adolescent health. J. Am. Acad. Child Adolesc. Psychiatry 41(2), 199–205
(2002)
Swendsen, J.D., Merikangas, K.R., Canino, G.J., Kessler, R.C., Rubio-Stipec, M., Angst, J.:
The comorbidity of alcoholism with anxiety and depressive disorders in four geographic
communities. Compr. Psychiatry 39(4), 176–184 (1998)
Tang, W., He, H., Tu, X.: Applied Categorical and Count Data Analysis. Chapman & Hall/CRC,
Boca Raton (2012)
Tsai, J., Ford, E.S., Li, C., Zhao, G.: Past and current alcohol consumption patterns and elevations
in serum hepatic enzymes among us adults. Addict. Behav. 37, 78–84 (2012)
Weissman, MM., et al.: Cross-national epidemiology of major depression and bipolar disorder.
JAMA 276, 293–299 (1996)
Welsh, A.H., Cunningham, R.B., Donnelly, C.F., Lindenmayer, D.B.: Modelling the abundance of
rare species: statistical models for counts with extra zeros. Ecol. Model. 88(1), 297–308 (1996)
On the Effects of Structural Zeros in Regression Models 115
Wilhelm, K., Mitchell, P., Slade, T., Brownhill, S., Andrews, G.: Prevalence and correlates of dsm-
iv major depression in an australian national survey. J. Affect. Disord. 75(2), 155–162 (2003)
Willenbring, M.L.: Measurement of depression in alcoholics. J. Stud. Alcohol 47(5), 367–372
(1986)
Yu, Q., Chen, R., Tang, W., He, H., Gallop, R., Crits-Christoph, P., Hu, J., Tu, X.M.: Distribution-
free models for longitudinal count responses with overdispersion and structural zeros. Stat.
Med. 32(14), 2390–2405 (2012)
Modeling Based on Progressively Type-I
Interval Censored Sample
Yu-Jau Lin, Nan Jiang, Y.L. Lio, and Ding-Geng (Din) Chen
Abstract Progressively type-I interval censored data occurs very often in public
health study. For example, 112 patients with plasma cell myeloma were admitted to
be treated at the National Cancer Institute and all patients were under examination
at time schedules (in terms of months), 5.5, 10.5, 15.5, 20.5, 25.5, 30.5, 40.5, 50.5,
60.5, respectively. The data reported by Carbone et al. (Am J Med 42:937–948,
1967) shows the number of patients at risk in each time interval and the number of
withdrawn at each examination time schedule which is the most right end point of
each time interval. After 60.5 months, the study was terminated. The patients with-
drawn at the right end point of time interval have no follow-up study. This table did
not provide any patient’s exact lifetime. The data structure presented in the table is
called progressively type-I interval censored data. In this chapter, many parametric
modeling procedures will be discussed via maximum likelihood estimate, moment
method estimate, probability plot estimate, and Bayesian estimation. Finally, model
selection based on Bayesian concept will be addressed. The entire chapter will also
include the model presentation of general data structure and simulation procedure
for getting progressively type-I interval censored sample. Basically, this chapter
will provide the techniques published by Ng and Wang (J Stat Comput Simul 79:
145–159, 2009), Chen and Lio (Comput Stat Data Anal 54:1581–1591, 2010), and
Lin and Lio (2012). R and WinBUGS implementation for the techniques will be
included.
Y.-J. Lin
Department of Applied Mathematics, Chung-Yuan Christian University,
Chung-Li, Taoyuan City, Taiwan
e-mail: [email protected]
N. Jiang () • Y.L. Lio
Department of Mathematical Sciences, University of South Dakota, Vermillion, SD 57069, USA
e-mail: [email protected]; [email protected]; [email protected]
D.-G. Chen
School of Social Work, University of North Carolina, Chapel Hill, NC 27599, USA
e-mail: [email protected]
1 Introduction
In industrial life testing and medical survival analysis, very often we encounter the
situations that the object is lost or withdrawn before failure, or the object lifetime
is only known within an interval. Hence, the obtained sample is called a censored
sample (or an incomplete sample). The most common censoring schemes are type-I
censoring, type-II censoring, and progressive censoring. The life testing is ended at
a pre-scheduled time for the type-I censoring and for the type-II, the life testing
is ended whenever the number of lifetimes is reached. In the type-I and type-
II censoring schemes, the tested items are allowed to be withdrawn only at the
end of life testing. However, in many real-life cases, subjects could be missing
or withdrawn at some other times before the end of life testing, which motivated
the progressive censoring scheme to be investigated. Balakrishnan and Aggarwala
(2000) provided more information about progressive censoring in combined with
type-I or type-II. Aggarwala (2001) introduced type-I interval and progressive
censoring and developed the statistical inference for the exponential distribution
based on progressive type-I interval censored data. Under progressive type-I interval
censoring, lifetimes are only known within two consecutively pre-scheduled times
and subjects would be allowed to withdraw at any time before the end of treatment.
Table 1 displays a typical progressively type-I interval censored data that consists
of 112 patients with plasma cell myeloma treated at the National Cancer Institute.
This data set was reported by Carbone et al. (1967) and discussed in Lawless (2003).
The most right side column in Table 1 shows the number of patients who were
found to be dropped out from the study at the right end of each time interval. These
dropped patients are known to be survived at the right end of each time interval
but no further follow-up. Hence, the most right side column in Table 1 provides
the numbers of withdraws, Ri ; i D 1; ; m D 9. The number of failures, Xi ; i D
1; ; m, can be easily calculated to be X D .18; 16; 18; 10; 11; 8; 13; 4; 1/ from the
number at risk and the number of withdrawals.
Y
m
L./ / ŒF.ti ; / F.ti1 ; /Xi Œ1 F.ti ; /Ri ; (1)
iD1
Suppose that the Xi failure units in each subinterval .ti1 ; ti occurred at the center of
the interval mi D ti12Cti and Ri censored items withdrawn at the censoring time ti .
Then the likelihood function (1) could be approximately represented as:
Y
m
LM ./ / .f .mi ; //Xi Œ1 F.ti ; /Ri :
iD1
The approximated likelihood function is usually simpler than the likelihood function
of original progressive type-I interval censored sample. However, in many situa-
tions, the MLE of parameter vector, , still cannot be solved by a closed form
formula.
2.3 EM-Algorithm
will be implemented by using the E-step first and then followed by the M-step.
This iterative procedure can also be implemented by using the M-step first and then
followed by the E-step. Then the algorithm is also called ME algorithm. In this
chapter, ME algorithm will be used. More detailed procedure will be addressed in
case by case situations.
Rb k
a t f .t;/dt
where E T k jT 2 Œa; b/ D F.b;/F.a;/ for a given positive integer k and 0
a < b 1. Let L be the dimension of . The moment method starts with
setting the kth sample moment equal to the corresponding kth population moment,
for k D 1; 2; : : : ; L. Then the moment method estimate of will be obtained
through solving the system of L equations. Since a closed form solution is usually
not available, an iterative numerical process will be used to obtain the solution. More
information will be provided in the case studies.
Y
i
O i/ D 1
F.t .1 pO j /; i D 1; 2; : : : ; m; (3)
jD1
where
Xj
pO j D Pj1 Pj1 ; j D 1; 2; : : : ; m:
n kD0 Xk kD0 Rk
Ng and Wang (2009) introduced Weibull distribution into the progressive type-I
interval data modeling. Weibull distribution has the probability density function,
distribution function, and hazard function as follows:
tˇ
fW .tI ; ˇ/ D ˇtˇ1 e ; (4)
tˇ
FW .tI ; ˇ/ D 1 e ; (5)
ˇ1
hW .tI ; ˇ/ D ˇt ; (6)
where ˇ > 0 is the shape parameter and > 0 is the scale parameter. When
the shape parameter ˇ D 1, the Weibull distribution reduces to the conventional
exponential distribution. The Weibull hazard function can be increasing, decreasing,
or constant depending upon the shape parameter. Therefore, the Weibull distribution
has provided us with the flexibility in modeling lifetime data.
Given a progressive
P type-I interval censored sample fXi ; Ri ; ti g; i D 1; 2; : : : ; m of
size n D m .X
iD1 i C R i / from the two-parameter Weibull distribution defined by (4),
the likelihood function (1) can be specified as follows:
m h
Y iX i
ˇ ˇ ˇ
LW . ; ˇ/ / e ti1
e ti
e ti Ri
; (7)
iD1
and
X
m
Xi h ˇ ˇ
i X
m
ˇ ˇ ˇ
ˇ ˇ
e ti1
ln.ti1 /ti1 C e ti
ln.ti /ti D ln.ti /ti Ri : (10)
iD1 e ti1 e ti iD1
The MLEs of and ˇ can be obtained by solving Eqs. (9) and (10) simultaneously.
Since no closed form is available for the solution, a numerical iteration method
could be used to evaluate the MLEs.
Next, let us apply EM-algorithm to obtain MLEs. When the lifetime is Weibull
distributed, the likelihood function (2) based on the random sample of lifetimes,
i;j ; j D 1; 2; : : : ; Xi and i;j ; j D 1; 2; : : : ; Ri , of these n items can be represented as
2 3
Y
m Y
Xi
ˇ Y
Ri
ˇ
LcW . ; ˇ/ D n ˇ n 4 i;jˇ1 e i;j
.i;j /ˇ1 e .i;j / 5 (11)
iD1 jD1 jD1
n
ˇ D P hP PR i i: (14)
ˇ
m Xi
iD1 jD1 i;j ln.i;j / ln.i;j / C jD1 .i;j /ˇ ln.i;j / ln.i;j /
Given t0 < t1 < < tm , the lifetime of the Xi failures in the ith interval .ti1 ; ti are
independent and follow a doubly truncated Weibull distribution from the left at ti1
and from the right at ti , and the lifetime of the Ri censored items in the ith interval
.ti1 ; ti are independent and follow a truncated Weibull distribution from the left
at ti , i D 1; 2; : : : ; m. The required expected values of a doubly truncated Weibull
distribution from the left at a and from the right at b with 0 < a < b 1 for
EM-algorithm are given by
Rb Rb ˇ1 ˇ
ln.y/fW .yI ; ˇ/dy a ˇy ln.y/e y dy
E ;ˇ .ln.Y/jY 2 Œa; b// D a
D ;
FW .bI ; ˇ/ FW .aI ; ˇ/ e aˇ e bˇ
Rb ˇ
ˇ ˇy2ˇ1 e y dy
E ;ˇ .Y jY 2 Œa; b// D a
e aˇ e bˇ
Rb ˇ
ˇy2ˇ1 ln.y/e y dy
E ;ˇ ..ln.Y//Y ˇ jY 2 Œa; b// D a :
e aˇ e bˇ
124 Y.-J. Lin et al.
and
n
ˇ D Pm : (16)
iD1 ŒX i E 5i C R i E6;i Xi E3i Ri E4i
• M-step solves the Eqs. (15) and (16) to obtain the next values, O .kC1/ and
ˇO .kC1/ of and ˇ
n
O .kC1/ D Pm ; (17)
iD1 X i E 1i C Ri E2i
n
ˇO .kC1/ D Pm : (18)
iD1 O .kC1/ .Xi E5i C Ri E6i / .Xi E3;i C Ri E4i /
3. Checking the convergence, if the convergence occurs then the current O .kC1/ and
ˇO .kC1/ are the approximated MLEs of and ˇ via EM algorithm; otherwise, set
k D k C 1 and go to Step 2.
Modeling Based on Progressively Type-I Interval Censored Sample 125
Y
m
X .ˇ1/ Xi mˇi Ri tiˇ
W . ; ˇ/ D
LM ˇ X i mi i
Xi
e e (19)
iD1
X
m
ˇ ˇ
W . ; ˇ/ D
lM Xi .ln. / C ln.ˇ// C Xi .ˇ 1/ln.mi / Xi mi Ri ti : (20)
iD1
W . ; ˇ/ with respect to
Let the derivatives of lM and ˇ be equal to 0, respectively,
the likelihood equations are given as,
Pm
iD1 Xi
D Pm ˇ ˇ
(21)
iD1 Xi mi C Ri ti
and
Pm
iD1 Xi
ˇ D Pm ˇ ˇ Pm (22)
iD1 Xi mi ln.mi / C Ri ti ln.ti / iD1 Xi ln.mi /
Let T be random variable which has Weibull distribution of (4). Then the kth
moment of Weibull distribution is
k=ˇ
E.T k / D .1 C k=ˇ/;
where . / is the complete gamma function and k is a positive integer. Let the
first and second sample moments, which were defined in Sect. 2.4, be equal to the
corresponding population moments. The following two equations are obtained for
solving the moment method estimates of and ˇ
" m #
1=ˇ 1 X
.1 C 1=ˇ/ D Xi E ;ˇ .TjT 2 Œti1 ; ti // C Ri E ;ˇ .TjT 2 Œti ; 1// ;
n i
(23)
126 Y.-J. Lin et al.
and
" m #
2=ˇ 1 X 2 2
.1 C 2=ˇ/ D Xi E ;ˇ .T jT 2 Œti1 ; ti // C Ri E ;ˇ .T jT 2 Œti ; 1// :
n i
(24)
Since the solutions to Eqs. (23) and (24) cannot be obtained in a closed form,
an iterative numerical process to obtain the parameter estimates is described as
follows:
.0/
1. Let the initial estimates of and ˇ be and ˇ .0/ and k D 0.
2. In the .k C 1/th iteration,
• computing E .k/ ;ˇ.k/ T j jT 2 Œti1 ; ti / and E .k/ ;ˇ.k/ T j jT 2 Œti ; 1/ for j D
1; 2, and solving the following equation, which is derived from Eqs. (23) and
(24), for ˇ, say ˇ .kC1/ :
Œ .1 C 1=ˇ/2
Œ .1 C 2=ˇ/
˚Pm 2
i Xi E .k/ ;ˇ .k/ ŒTjT 2 Œti1 ; ti / C Ri E .k/ ;ˇ .k/ ŒTjT 2 Œti ; 1/
D ˚P m :
2 2
n i Xi E .k/ ;ˇ .k/ ŒT jT 2 Œti1 ; ti / C Ri E .k/ ;ˇ .k/ ŒT jT 2 Œti ; 1/
.kC1/
• The solution for , say , is obtained based on Eq. (23)
( ) ˇ.k/
n .1 C 1=ˇ .kC1/ /
D Pm :
i Xi E .k/ ;ˇ .k/ ŒTjT 2 Œti1 ; ti / C Ri E .k/ ;ˇ .k/ ŒTjT 2 Œti ; 1/
3. Checking the convergence, if the convergence occurs then the current .kC1/ and
ˇ .kC1/ are the estimates of and ˇ by the method of moments; otherwise, set
k D k C 1 and go to Step 2.
O
Let the product-limit distribution, F.t/, described in Sect. 2.5 be the estimate of
the Weibull distribution function of (5), then the estimates of and ˇ in the
Weibull distribution based on probability plot can be obtained by minimizing
Pm h i
O i / 1=ˇ 2
ln.1F.t
iD1 ti . / with respect to and ˇ. A nonlinear optimization
procedure will be applied to find the minimizers as the estimates of and ˇ.
Modeling Based on Progressively Type-I Interval Censored Sample 127
In addition, Ng and Wang (2009) mentioned that the estimates of and ˇ based
on probability plot can also be obtained by least square fit of linear regression model
y D ln. / C x C ; (25)
O i //// for i D 1; 2;
with data set .xi ; yi / D .ln.ti /; ln.ln.1 F.t ; m, and is an
error term.
˛ .1 e t /˛1 e t
hGE .tI ˛; / D ; t > 0; ˛ > 0; (28)
1 .1 e t /˛
where D .˛; /, ˛ > 0 is the shape parameter and > 0 is the scale parameter.
If ˛ D 1, then the GE defined above reduces to the conventional exponential
distribution. If ˛ < 1, then the density function (26) is decreasing and if ˛ > 1, then
the density function (26) is a unimodal function. Similar to Weibull distribution, GE
hazard function can be increasing, decreasing, or constant depending upon the shape
parameter ˛. The GE distribution has been studied by numeral authors, for example,
Chen and Lio (2010); Gupta and Kundu (1999, 2001a,b, 2002, 2003). Gupta and
Kundu (2001a, 2003) mentioned that the two-parameter GE distribution could be
used quite effectively in analyzing many lifetime data sets and provide a better fit
than the two-parameter Weibull distribution in many situations. An extensive survey
of recent developments for the two-parameter GE distribution based on a complete
random sample can be found from Gupta and Kundu (2007).
Y
m
Xi Ri
LGE .˛; / / .1 eti /˛ .1 eti1 /˛ 1 .1 eti /˛ ; (29)
iD1
X
m
lGE .˛; / D constantC Xi ln..1eti /˛ .1eti1 /˛ /CRi ln.1.1eti /˛ /:
iD1
(30)
By setting the derivatives of the log likelihood function with respect to ˛ or to
zero, the MLEs of ˛ and are the solutions of the following likelihood equations
Xm
Ri ti .1 e ti /˛1 Xm
Xi Œ.1 e ti /˛1 ti .1 e ti1 ˛1
/ ti1
D ; (31)
iD1
1 .1 e ti /˛ iD1
.1 e ti /˛ .1 e ti1 /˛
and
Xm
Ri Œln.1 e ti /.1 e ti /˛
iD1
1 .1 e ti /˛
Xm
Xi Œ.1 e ti /˛ ln.1 e ti / .1 e ti1 ˛
/ ln.1 e ti1
/
D (32)
iD1
.1 e ti /˛ .1 e ti1 /˛
No closed form solution can be found to the above equations, and an iterative
numerical search can be used to obtain the MLEs. Let ˛O and O be the solution to
the above equations. Since there is no closed form of the MLE, the EM-algorithm
and a mid-point approximation are introduced as follows for finding the MLEs of ˛
and .
Similarly to Sect. 3.1, let i;j ; j D 1; 2; ; Xi , be the survival times within
subinterval .ti1 ; ti and i;j ; j D 1; 2; ; Ri be the survival times for withdrawn
items at ti for i D 1; 2; 3; ; m, then the likelihood LcG .˛; / and log likelihood
lG .˛; / D ln.LG .˛; /, for the complete lifetimes of n items from the two-
c c
and
X
m X Xi X
Ri
lcGE .˛; / D Œln.˛/ C ln. / n . i;j C i;j /
iD1 jD1 jD1
2 3
X
m XXi X
Ri
C.˛ 1/ 4 ln.1 e i;j
/C ln.1 e i;j
/5 : (33)
iD1 jD1 jD1
and
2 3 2 3
n Xm XXi XRi Xm XXi
e i;j XRi
i;j
e
D 4 i;j C i;j 5 .˛ 1/ 4 i;j
C
i;j 5
.1 e i;j / .1 e i;j /
iD1 jD1 jD1 iD1 jD1 jD1
2 3 2 3
X
m X
Xi X
Ri X
m X
Xi
i;j XRi
i;j
D 4 i;j C i;j 5 .˛ 1/ 4 C
5:
iD1 jD1 jD1 iD1 jD1
.e i;j 1/ jD1 .e i;j
1/
(35)
The lifetimes of the Xi failures in the ith interval .ti1 ; ti are independent and
follow a doubly truncated GE from the left at ti1 and from the right at ti . The
lifetimes of the Ri censored items in the ith interval .ti1 ; ti are independent and
follow a truncated GE from the left at ti , i D 1; 2; : : : ; m. The required expected
values of a doubly truncated GE from the left at a and from the right at b with
0 < a < b 1 for EM algorithm are given by
Rb
yfGE .yI ˛; /dy
E˛; ŒYjY 2 Œa; b/ D a
;
FGE .bI ˛; / FGE .aI ˛; /
Rb
ln.1 e y /fGE .yI ˛; /dy
E˛; ln 1 e Y jY 2 Œa; b/ D a
;
FGE .bI ˛; / FGE .aI ˛; /
Rb y
Y a e y 1 fGE .yI ˛; /dy
E˛; jY 2 Œa; b/ D :
e 1
Y FGE .bI ˛; / FGE .aI ˛; /
130 Y.-J. Lin et al.
Y
E5i D E˛O .k/ ; O .k/ jY 2 Œti1 ; ti / ;
eY O .k/ 1
Y
E6i D E˛O .k/ ; O .k/ jY 2 Œti ; 1/ ;
eY O .k/ 1
and the likelihood Eqs. (34) and (35) are replaced by
n Xm
D ŒXi E2i C Ri E4i (36)
˛ iD1
and
n X
m X
m
D ŒXi E1i C Ri E3i .˛ 1/ ŒXi E5i C Ri E6i I (37)
iD1 iD1
• the M-step solves the Eqs. (36) and (37) and obtains the next values, ˛O .kC1/
and O .kC1/ , of ˛ and , respectively, as follows:
n
˛O .kC1/ D Pm (38)
iD1 .X i E 2i C Ri E4i /
O .kC1/ n
D Pm .kC1/ 1
Pm : (39)
iD1 .X E
i 1i C R E
i 3i / ˛
O iD1 .Xi E5i C Ri E6i /
3. Checking the convergence, if the convergence occurs then the current ˛O .kC1/ and
O .kC1/ are the approximated MLEs of ˛ and via EM algorithm; otherwise, set
k D k C 1 and go to Step 2.
It can be easily seen that EM algorithm has no complicated likelihood equations
involved for solving MLEs of ˛ and as does Equations (31) and (32). Therefore,
it can be efficiently implemented through a computing program.
Modeling Based on Progressively Type-I Interval Censored Sample 131
Suppose that Xi failure units in each subinterval .ti1 ; ti occurred at the center of
the interval mi D ti12Cti and Ri censored items withdrawn at the censoring time ti .
Then the log likelihood function (30) could be approximately represented as:
X
m
GE / /
ln.LM ŒXi ln .f .mi ; // C Ri ln .1 F.ti ; //
iD1
X
m X
m
D Œln.˛/ C ln./ Xi X i mi
iD1 iD1
X
m X
m
C .˛ 1/ Xi ln.1 emi / C Ri ln.1 .1 eti /˛ / : (40)
iD1 iD1
The MLEs, ˛O and O , of ˛ and are the solutions of the following system of
equations,
" #
X
m X
m X
m O
.1 eti /˛O ln.1 eti /
O
mi O
˛O Xi ln.1 e /C Xi D ˛O Ri ; (41)
iD1 iD1 iD1 1 .1 eti O /˛O
and
X
m X
m
Xi mi emi
O X
m X
m O O O
ti Ri eti .1 eti /˛1
Xi = O C.˛O 1/ D Xi mi C ˛O : (42)
iD1 iD1 1 emi O iD1 iD1 1 .1 eti O /˛O
There is no closed form for the solutions and an iterative numerical search is
needed to obtain the parameter estimates, ˛O Mid and O Mid , from the above equation(s).
Although there is no closed form of solution, the mid-point likelihood equations are
simpler than the original likelihood equations.
Let T be random variable which has the pdf (26). Gupta and Kundu (1999) had
shown that the mean and the variance are:
'.˛ C 1/ '.1/
D E.T/ D ;
0 0
' .˛ C 1/ C ' .1/
2 D EŒ.T /2 D 2
;
132 Y.-J. Lin et al.
0
where '.t/ is the digamma function and ' .t/ is the derivative of '.t/. The kth
moment of a doubly truncated GE in the interval .a; b/ with 0 < a < b 1 is
given by
Rb
tk fGE .t/dt
E˛; T jT 2 Œa; b/ D
k a
:
FGE .b/ FGE .a/
Since no closed form of the solutions of Eqs. (43) and (44) can be derived,
an iterative numerical process to obtain the parameter estimates is described as
follows:
1. Let the initial estimates of ˛ and be ˛ .0/ and .0/
. Set k D 0.
2. In the .k C 1/th iteration,
• computing E1i D E˛.k/ ; .k/ .TjT 2 Œti1 ; ti //, E3i D E˛.k/ ; .k/ .TjT 2 Œti ;1//,
E7i D E˛.k/ ; .k/ T 2 jT 2 Œti1 ; ti / and E8i D E˛.k/ ; .k/ T 2 jT 2 Œti ; 1/ and
solving the following equation for ˛, say ˛ .kC1/ :
P
Œ'.˛ C 1/ '.1/2 Œ mi Xi E1i C Ri E3i
2
0 0 D P I (45)
Œ'.˛ C 1/ '.1/2 C ' .1/ ' .˛ C 1/ n m
i Xi E7i C Ri E8i
• the solution for is obtained through the following equation and labeled by
.kC1/
,
" m #
'.˛ .kC1/ C 1/ '.1/ 1 X
D Xi E1i C Ri E3i : (46)
n i
3. Checking the convergence, if the convergence occurs then the current ˛ .kC1/ and
.kC1/
are the estimates of ˛ and by the method of moments; otherwise, set
k D k C 1 and go to Step 2.
Modeling Based on Progressively Type-I Interval Censored Sample 133
The data set which consists of 112 patients with plasma cell myeloma treated at the
National Cancer Institute (See [21]) is used for modeling two-parameter Weibull
and two-parameter GED.
where ˛ > 0, > 0, and ˇ > 0. It is clear that the EWD of (47) reduces to the
GE distribution defined by (26) when ˇ D 1, the Weibull distribution (WD) when
˛ D 1 and the classical exponential distribution (ED) when both ˇ D 1 and ˛ D 1.
Here, the three-parameter EWD will be used to fit the given data set and statistically
tested whether it can be reduced to the WD model or the GE model for the given
data set.
Fitting the EWD of (47) to the given data, the estimated parameters are
O D .˛; O O ; ˇ/
O D .1:064; 0:026; 1:185/ and log likelihood, logL.EWD/, has
-2logL.WED/ = 460.693. Fitting the classical Weibull distribution yields the
estimated parameters . O ; ˇ/
O D .0:021; 1:227/ and log likelihood, logL.WD/, with
2 log L.WD/ = 460.681. Fitting the GE distribution results the estimated param-
O O / D .1:433; 0:057/ and log likelihood, logL.GE/, with 2 log L.GE/
eters .˛;
= 460.941. Therefore, the log likelihood ratio statistic between EWD and WD is
134 Y.-J. Lin et al.
Applying the estimation processes developed in Sect. 3, the parameter estimates are
D .0:021; 0:019; 0:021; 0:020; 0:021/ and ˇ D .1:231; 1:263; 1:227; 1:248; 1:224/
for maximum likelihood estimate, midpoint approximation estimates, EM-
algorithm estimates, moment method estimate and probability plot estimates,
respectively, for Weibull distribution modeling procedure with the given data
set. Meanwhile, applying the estimation processes developed in Sect. 4, the
parameter estimates are ˛ D .1:433; 1:514; 1:433; 1:513; 1:499/ and D
.0:057; 0:059; 0:057; 0:059; 0:059/ for maximum likelihood estimate, midpoint
approximation estimates, EM-algorithm estimates, moment method estimate and
probability plot estimates, respectively, for GE distribution modeling procedure
with the given data set. It can be seen that these are virtually identical, so do the
estimated GE density function, distribution function, and hazard function. Since
all the estimates for ˛ are greater than 1, the estimated GE densities are unimodal
functions and the estimated GE hazard functions are increasing functions.
Let the likelihood function of (1), based on progressively type-I interval censored
data D D f.Xi ; Ri ; ti /; i D 1; 2; ; mg, be represented as follows:
Y
m
L.jD/ / ŒF.ti ; / F.ti1 ; /Xi Œ1 F.ti ; /Ri ; (48)
iD1
and the joint prior distribution for in the likelihood function, L.jD/, be denoted
by h./. Then, the posterior joint likelihood for a given progressively type-I
censored data D of size n can be obtained as follows:
3. Set j D j C 1.
4. Repeat Step (2) to Step (3) until j D k C 1.
5. Set i D i C 1
6. Repeat Step (1) to Step (5) for a huge number, say i D N C 1, of periods.
Given j D 1; 2; : : : ; k, the empirical distribution of j can be then described
by the realizations of j from the constructed Markov chain after “some” burn-in
Modeling Based on Progressively Type-I Interval Censored Sample 137
Since no any prior information provided along with the real data set, the improper
priors of ˛ and with hyper parameters a D b D c D d D 0 maybe
used for the investigation of the MCMC process in the study. Sun (1997) had
a detailed discussion about the Jeffreys priors for the parameters, and ˇ of
Weibull distribution in the Bayesian estimation procedure based on random sample.
However, the Jeffreys priors, which is proportional to the square root of the
determinant of the Fisher information matrix, is difficulty derived based on a
progressively type-I interval censored data. For simplicity, the Jeffreys priors under
random sample could be adopted to demonstrate the MCMC process. The Jeffreys
priors for the Weibull distribution of (4) have the following form,
1
h1 .ˇ/h2 . / / : (53)
ˇ
Lin and Lio (2012) studied the GE distribution modeling and Weibull distribution
modeling by using Bayesian approach based on progressive type-I interval censored
data under the same pre-specified inspection times (in terms of month) as the given
read data set and had a detailed discussion about the Bayesian estimation based on
progressive type-I interval censored data.
138 Y.-J. Lin et al.
The data set which contains 112 patients with plasma cell myeloma treated at the
National Cancer Institute (Carbone et al. 1967) has been used for the two-parameter
GE distribution modeling and the two-parameter Weibull distribution modeling
through non-Bayesian approach in Sect. 5. It had been mentioned that the two-
parameter GE distribution and two-parameter Weibull distribution modelings were
virtually indistinguishable for modeling the data set in Table 1 through likelihood
process. In this section, model selection between GE and Weibull distributions will
be investigated through Bayesian framework.
The Eq. (54) can be viewed as the expectation E.f .Dj; M// taken with respect to
the prior distribution ˘.jM/ and can be approximated by the Monte Carlo method
as follows:
1 XN
L.DjM/ f .Dj .i/ ; M/; (55)
N Nb iDN
b
where .i/ D .˛ .i/ ; .i/ / or .i/ D .ˇ .i/ ; .i/ / with the index i from the burn-in
period Nb to Gibbs sampler size N.
However, there are some difficulties with the approximation of the likelihood
of parameters, see, for example, Robert and Marin (2008). In this study, the
approximation by (55) is infeasible, because the size of Monte Carlo simulation
should be very large, say 1010 or more, to guarantee the convergence of the
desired quantity. Regarding the model selection, we also try to find the limiting
model probabilities of reversible jump MCMC between two models’ posterior
likelihoods [for more information about reversible jump MCMC, reader may refer
Modeling Based on Progressively Type-I Interval Censored Sample 139
to Green (1995)]. Yet, we never know when two competing models’ transition
probabilities reach balance. Instead, we come up with a novel idea dealing with
such “mixed” type of data. The novel approach is that a “supervised” mixture
model Mm containing both GE and Weibull distributions is proposed. And then the
MCMC method is employed to calculate the mix proportions, w and G , for both
components that can be served as the weight of two models’ posterior probabilities
and the criterion for model selection. Through the process, we find the calculation
is fast and simple. More detail is given as follows:
Assume the interval censoring data fXi W i D 1; 2; ; mg came from the mixture
model of GE and Weibull distributions as follows:
where L.DjMG / and L.DjMw / are the posterior likelihoods of GE and Weibull
models defined in (54). When w is 0, then we see Xi GE model and when w
is 1, then Xi Weibull model. Also notice that if w > 1=2, Weibull model is
preferred since Weibull distribution has more mix weight in the mixture model Mm ;
otherwise, GE model is preferred. Next, the estimation of w can be done by usual
MCMC process.
Assume the prior distribution of w is uniformly distributed over (0,1), then the
Gibbs scheme to estimate w in the mixture model is given as follows.
• Set the initial values of all parameters.
• For i D 1 to Nw , do
1. Update the parameters .i/ D .˛ .i/ ; .i/ / of GE distribution using the posterior
likelihood L.DjMm / in (56).
2. Update the parameters 0.i/ D .ˇ .i/ ; .i/ / of Weibull distribution using the
posterior likelihood L.DjMm / in (56).
3. Update the parameter w using the M-H algorithm through the posterior
likelihood in (56) and set the proposal density qw .w jw / U.0; 1/ to avoid
the local extrema. Specifically, draw the candidate w from q U.0; 1/,
.i/
then accept w as the ith state value, w , with probability
( )
w L.DjMw ; 0.i/ / C .1 w / L.DjMG ; .i/ /
min 1; .i1/ .i1/
;
w L.DjMw ; 0.i/ / C .1 w / L.DjMG ; .i/ /
.i/ .i1/
otherwise, the ith state value, w , is w .
.i/
• The Bayes estimate of w is the sample mean of fw g after some burn-in
period Nb00 ,
140 Y.-J. Lin et al.
0.12
2.0
0.06
1.0
0.00
0.0
0.12
2.0
0.06
1.0
0.00
0.0
The simulation study shows that when the progressively type-I censored data
fXi W i D 1; 2; ; 8g is generated from GE distribution or from Weibull distribution,
the MCMC estimate of w correctly identifies the correct model about 989 out 1000
times for each situation.
Given the real data set in Table 1, we apply the proposed Bayesian procedure
with the improper priors for GE model and the Jeffreys priors for Weibull model,
and make the time series plots of the MCMC samplers of parameters (˛, ) in GE
distribution and (ˇ, ) in Weibull distribution shown in Fig. 1. Even with different
initial values of parameters, the times series plots are stable and have a similar
pattern. The Bayes estimate of ˛ D 1:547 (variance= 0:00482) and D 0:0592
(variance= 0:01362) if data are assumed from GE distribution. On the other hand,
the Bayes estimate of ˇ D 1:2789 (variance = 0:00032) and D 0:0189 (variance=
0:0000013) if data are assumed from Weibull distribution.
With the aid of the above mixture model, the MCMC estimate of w (N D
200; 000; Nb D 100; 000) is 0.128294 (with standard deviation 0.103). Also, a
Modeling Based on Progressively Type-I Interval Censored Sample 141
Appendix
############################################
#Random sample from Generalized exponential distribution(GED)
# n: sample size; alpha, lambda: parameters
############################################
# random number for GED
rGexp=function(n,alpha,lambda){
U=runif(n, min=0, max=1)
return(-1/lambda*log(1.0-U^(1/alpha)))
}
# density for GED
dGexp=function(x,alpha,lambda){
alpha *lambda*(1.0-exp(-lambda *x))^(alpha -1)*exp(-lambda *x)
142 Y.-J. Lin et al.
}
# Distribution function of GED
# x is input; alpha, lambda: parameters
pGexp=function(x,alpha,lambda) (1.0 - exp(-x*lambda))^alpha
######################################
# Procedure for Maximum Likelihood Estimate
######################################
estMLE = function(x,R,T,inita,initb)
{
m=length(x)
# make the mle objective function
obj.mle = function(parm)
{
alpha=parm[1];lambda= parm[2]
tmpFi = pGexp(T[-1],alpha,lambda)
tmpFi1 = pGexp(T[-(m+1)],alpha,lambda)
tmp = (tmpFi-tmpFi1)^x*(1-tmpFi)^R
logL = log(tmp)
-sum(logL[is.finite(logL)])
#logL could be infinite since of the log
}
pa = c(inita,initb)
tmp = optim(pa,obj.mle,method="L-BFGS-B",lower=c(0.001,0.001) )
################################################
# Procedure for Mid-point Module from DC
################################################
MidPT=function(x,R,T,inita,initb)
{
m=length(R)
mi= (T[-1]+T[-(m+1)])/2
obj.MLE=function(parm){
alpha = parm[1];lambda= parm[2]
logL = x*log(dGexp(mi,alpha,lambda))
+R*log(1-pGexp(T[-1],alpha,lambda))
-sum(logL)
} # end of the obj.MLE
pa=c(inita,initb)
tmp=optim(pa,obj.MLE,method="L-BFGS-B",lower=c(0.001,0.001))
###################################################
# Proceudre for EM algorithm
###################################################
EM=function(x,R,T,inal,inlam)
{
m=length(R)
n=sum(x) + sum(R)
al1=inal
lam1=inlam
E1=E2=E3=E4=E5=E6= numeric(m)
Modeling Based on Progressively Type-I Interval Censored Sample 143
mm=m+1
Cont=TRUE
while( Cont ){
alp=al1
lam=lam1
# E-step
for(i in 2:mm){
d1=pGexp(T[i],alp,lam)-pGexp(T[i-1],alp,lam)
d2=1.0 - pGexp(T[i],alp,lam)
EMva1=integrate(ydg,lower=T[i-1],upper=T[i],alpha=alp,
lambda=lam)
E1[i-1]=EMva1$value/d1
EMva1=integrate(lndg,lower=T[i-1],upper=T[i],alpha=alp,
lambda=lam)
E2[i-1]=EMva1$value/d1
EMva1=integrate(ydg,lower=T[i],upper=Inf,alpha=alp,
lambda=lam)
E3[i-1]=EMva1$value/d2
EMva1=integrate(lndg,lower=T[i],upper=Inf,alpha=alp,
lambda=lam)
E4[i-1]=EMva1$value/d2
EMva1=integrate(y2Edg,lower=T[i-1],upper=T[i],alpha=alp,
lambda=lam)
E5[i-1]=EMva1$value/d1
EMva1=integrate(y2Edg,lower=T[i],upper=Inf,alpha=alp,
lambda=lam)
E6[i-1]=EMva1$value/d2
}
#M step
al1=-n/sum(x*E2 + R*E4)
lam1=n/( sum(x*E1+R*E3)-(al1-1)*sum(x*E5+R*E6) )
#Convergence checking
############################################
# Procedure for Moment method
#########################################3
MMM=function(x,R,T,inal,inlam)
{
m=length(R)
n=sum(x) + sum(R)
al1=inal
lam1=inlam
E1= numeric(m)
E2=numeric(m)
E3=numeric(m)
E4=numeric(m)
mm=m+1
Cont=TRUE
while( Cont ){
alp=al1
lam=lam1
# next step
for(i in 2:mm){
d1=pGexp(T[i],alp,lam)-pGexp(T[i-1],alp,lam)
d2=1.0 - pGexp(T[i],alp,lam)
MMEva=integrate(ydg,lower=T[i-1],upper=T[i],alpha=alp,
lambda=lam)
E1[i-1]=MMEva$value/d1
E2[i-1]=integrate(y2dg,lower=T[i-1],upper=T[i],alpha=alp,
lambda=lam)
144 Y.-J. Lin et al.
E2[i-1]=MMEva$value/d1
E3[i-1]=integrate(ydg,lower=T[i],upper=Inf,alpha=alp,
lambda=lam)
E3[i-1]=MMEva$value/d2
E4[i-1]=integrate(y2dg,lower=T[i],upper=Inf,alpha=alp,
lambda=lam)
E4[i-1]=MMEva$value/d2
}
# continue next step
#-----------------------------------------
# psigamma(x, deriv=1) is the derivative of digamma function
#-----------------------------------------
MMEq=function(b){
n*(sum(x*E2+R*E4))/(sum(x * E1 +R*E3))^2 - 1.0-
(psigamma(1,deriv=1)-psigamma(b+1,deriv=1))
/( digamma(b+1)-digamma(1) )^2
}
# Convergence checking
#####################################################
# Procedure of Probability-plot Estimation Method
#####################################################
Pplot=function(x,R,Ti,inita,initb)
{
n = sum(x)+sum(R)
m=length(R)
p= numeric(m)
T =numeric(m)
hatF=numeric(m)
cux=0.0
cur=0.0
mm = m+1
for(i in 2:mm){T[i-1]=Ti[i]}
for (j in 1:m){
sur = n- cux -cur
if ( sur <= 0 ) { p[j]=1}
else{
p[j]=x[j]/sur
pa=c(inita,initb)
Modeling Based on Progressively Type-I Interval Censored Sample 145
#########################################################
#Ox program code to implement Model Selection
# Between WB and GED from Mixture Model
#########################################################
#include<oxstd.h>
#include<oxprob.h>
WBCDF(xt, beta, gamma) {return (1-exp(-gamma*(xt^beta)));}
GEDF(xt, alpha, lambda) {return (1-exp(-lambda*xt))^alpha;}
Mixed(xt, beta, gamma, alpha, lambda, probWB)
{return probWB*(1-exp(-gamma*(xt^beta)))+(1- probWB)*
(1-exp(-lambda*xt))^alpha;}
logLtypeIWB(icX,icR,T,beta,gamma){
decl i,j, iInterval=rows(T), logLL=0 ;
decl logProb=zeros(iInterval,1),logProbR=zeros(iInterval,1);
logProb[0] =log(1-exp(-gamma*(T[0])^beta));
logProbR[0]=log(1-WBCDF(T[0],beta,gamma));
for(i=1;i<iInterval;i++){ //begin of for i loop
logProb[i] =log(WBCDF(T[i],beta,gamma)-WBCDF(T[i-1],
beta,gamma));
146 Y.-J. Lin et al.
logProbR[i]=log(1-WBCDF(T[i],beta,gamma));
} // end of for i loop
logLL=sumc(logProb.*icX)+sumc(logProbR.*icR);
return logLL;
} // end of logLtypeI
logLMixed(icX,icR,T,alpha,lambda,beta,gamma,probWB){
// This function compute the loglikelihood of given counts
// to be used in M-H algorithm
decl i,j, iInterval=rows(T), logLL=0 ;
decl logProb=zeros(iInterval,1),logProbR=zeros(iInterval,1);
logProb[0] =log(Mixed(T[0],beta,gamma,alpha,lambda,probWB));
logProbR[0]=log(1-Mixed(T[0],beta,gamma,alpha,lambda,probWB));
for(i=1;i<iInterval;i++){ //begin of for i loop
logProb[i]=log(Mixed(T[i],beta,gamma,alpha,lambda,probWB)
-Mixed(T[i-1], beta, gamma,alpha, lambda, probWB));
logProbR[i]=log(1-Mixed(T[i],beta,gamma,alpha,lambda,probWB));
} // end of for i loop
logLL=sumc(logProb.*icX)+sumc(logProbR.*icR);
return logLL;
} // end of logLtypeIGE
logLtypeIGE(icX,icR,T,alpha,lambda){
decl i,j, iInterval=rows(T), logLL=0 ;
decl logProb=zeros(iInterval,1),logProbR=zeros(iInterval,1);
logProb[0] =alpha*log(1-exp(-lambda*T[0]));
logProbR[0]=log(1-GEDF(T[0],alpha,lambda));
for(i=1;i<iInterval;i++){ //begin of for i loop
logProb[i] =log(GEDF(T[i],alpha,lambda)
-GEDF(T[i-1],alpha,lambda));
logProbR[i]=log(1-GEDF(T[i],alpha,lambda));
} // end of for i loop
logLL=sumc(logProb.*icX)+sumc(logProbR.*icR);
return logLL;
} // end of logLtypeIGE
main()
{ decl seed=182632;
decl simmodel=2;
//simmodel=0,if WB;simmodel=1,if GE;simmodel=2 if real data
decl alpha_sim=1.56,lambda_sim=.06;
// targeted values if simmodel=1
decl beta_sim=1.12,gamma_sim=.03;
// targeted values if simmodel==0
decl N_mcmc=1000*100; // length of MCMC
decl burninperiod=1000*50; //the burn-in period in MCMC
decl censorT=<5.5,10.5,15.5,20.5,25.5,30.5,40.5,50.5,60.5>’;
//decl prob=<0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.25,1>;
//the prob of withdrawals
decl prob=<0,0,0,0,0,0,0,0,1>; // the prob of withdrawals
decl N_obsn=112;//sample size if simmodel=0 or 1, disabled
if THE real data is applied
decl beta_start=beta_sim+0.1*rann(1,1)[0]; //starting points
decl gamma_start=gamma_sim+.0013*+rann(1,1)[0];
// starting points
decl alpha_start=alpha_sim+0.1*rann(1,1)[0];
// starting points
decl lambda_start=lambda_sim+.000125*+rann(1,1)[0];
// starting points
decl cXF,vec,cXR,i,hh,hh1=.001, sum_Like=0,iInt=rows(censorT);
decl logL, logLcand;
decl gamma_mc,beta_mc,alpha_mc,lambda_mc;
decl invgamma_cand,invlambda_cand;
decl gamma_cand,beta_cand,alpha_cand,lambda_cand;
decl nowstate,newstate;
decl alpha_rjmc, lambda_rjmc,beta_rjmc, gamma_rjmc;
decl cGE=0, cWB=0,cRJMCMC=0,probGE=1/5, probWB=1/2;
Modeling Based on Progressively Type-I Interval Censored Sample 147
if (imod(i,5000*2)==0)print(".") ;
}// end of for loop
print("\n The likelihood of Weibull model =",
1/((sum_Like)/(N_mcmc-burninperiod)[0]),"\n");
alpha_mc=ones(N_mcmc,1);
lambda_mc=ones(N_mcmc,1);
lambda_mc[0]=lambda_start; alpha_mc[0]=alpha_start;
// starting points of parameters
sum_Like=0;
//compute the averaged (posterior) likelihood
// based on the MCMC outputs
for (i=1;i<N_mcmc;i++){
// estimate the parameters of GE model
if (i<2000) hh=hh1*5;else hh=hh1;
lambda_cand= 1/(1/lambda_mc[i-1] + rann(1,1) );
alpha_cand=alpha_mc[i-1]+.05*rann(1,1);
// in case lambda_cand is negative
while( alpha_cand>2.8)alpha_cand=alpha_mc[i-1]+hh*rann(1,1);
logL = logLtypeIGE(cXF,cXR,censorT,alpha_mc[i-1],
lambda_mc[i-1]);
logLcand = logLtypeIGE(cXF,cXR,censorT,alpha_cand,
lambda_cand);
if(log(ranu(1,1)) < logLcand-logL)
{alpha_mc[i]= alpha_cand; lambda_mc[i]= lambda_cand; }
else{alpha_mc[i]= alpha_mc[i-1];
lambda_mc[i]= lambda_mc[i-1];}
if (imod(i,5000*2)==0) print(".");
if(i>burninperiod){
if(i==burninperiod)
sum_Like=1/exp(logLtypeIGE(cXF,cXR,censorT,alpha_mc[i],
lambda_mc[i]))[0];
else sum_Like=sum_Like+1/exp(logLtypeIGE(cXF,cXR,censorT,
alpha_mc[i], lambda_mc[i]))[0]; }
}
print("\n The likelihood of GE model=",
1/((sum_Like)/(N_mcmc-burninperiod)),"\n");
for (i=0; i<N_mcmc;i++)
fprint(file1," ", "\t",beta_mc[i],"\t",
gamma_mc[i],"\t", alpha_mc[i],"\t", lambda_mc[i],"\n");
alpha_mc=ones(N_mcmc,1)*alpha_sim;
lambda_mc=ones(N_mcmc,1)*lambda_sim;
lambda_mc[0]=0.06; alpha_mc[0]=1.6;
beta_mc=ones(N_mcmc,1)*beta_sim;
gamma_mc=ones(N_mcmc,1)*gamma_sim;
gamma_mc[0]=0.03;//meanc(gamma_mc[burninperiod:N_mcmc-1])[0];
beta_mc[0]=1.2;//meanc(beta_mc[burninperiod:N_mcmc-1])[0];
probWB_mc= ones(N_mcmc,1)*.5;
for (i=1;i<N_mcmc;i++){ // estimate the parameters of WB
gamma_cand= 1/(1/gamma_mc[i-1] + rann(1,1) );
beta_cand=beta_mc[i-1]+.01*(rann(1,1));
logL=logLMixed(cXF,cXR,censorT,alpha_mc[i-1],lambda_mc[i-1],
beta_mc[i], gamma_mc[i], probWB_mc[i-1]);
logLcand = logLMixed(cXF,cXR,censorT,alpha_mc[i-1],
lambda_mc[i-1],beta_cand, gamma_cand, probWB_mc[i-1]);
probWB_cand=ranu(1,1);
while( probWB_cand<0 ||probWB_cand>1) probWB_cand=
probWB_mc[i-1]+.05*(rann(1,1));
logL = logLMixed(cXF,cXR,censorT,alpha_mc[i], lambda_mc[i],
beta_mc[i], gamma_mc[i], probWB_mc[i-1]);
logLcand=logLMixed(cXF,cXR,censorT,alpha_mc[i],lambda_mc[i],
beta_mc[i], gamma_mc[i], probWB_cand);
if(log(ranu(1,1))<logLcand-logL){probWB_mc[i]=probWB_cand;}
else { probWB_mc[i]= probWB_mc[i-1]; }
if (imod(i,5000*2)==0) print(".");
}// end of for loop
alpha_rjmc=meanc(alpha_mc[burninperiod:N_mcmc-1])[0];
lambda_rjmc=meanc(lambda_mc[burninperiod:N_mcmc-1])[0];
gamma_rjmc=meanc(gamma_mc[burninperiod:N_mcmc-1])[0];
beta_rjmc=meanc(beta_mc[burninperiod:N_mcmc-1])[0];
print("\n The MCMC estimates for WeiBull is ");
print("\n beta hat=",beta_rjmc);
print("\t gamma hat=",gamma_rjmc);
print("\n The MCMC estimates for GE is ");
print("\n alpha hat=", alpha_rjmc);
print("\t lambda hat=",lambda_rjmc);
if(simmodel==0) print("\n\n The true model is Weibull");
if(simmodel==1) print("\n\n The true model is GED");
if(simmodel>1) print("\n\n THE real data is applied.");
print("\t The last 20 obsn of probWB is=",
probWB_mc[N_mcmc-20:N_mcmc-1]’);
print("\t min probWB hat=",
min(probWB_mc[burninperiod:N_mcmc-1])[0],"\n" );
print("\t max probWB hat=",
max(probWB_mc[burninperiod:N_mcmc-1])[0],"\n" );
print("\t probWB hat=",
meanc(probWB_mc[burninperiod:N_mcmc-1])[0],"\n" );
print("\t sd probWB hat=",
varc(probWB_mc[burninperiod:N_mcmc-1])[0]^.5,"\n" );
for (i=0; i<N_mcmc;i++) fprint(file," ", probWB_mc[i], "\n");
fclose(file); fclose(file1);
}
#################################################################
# Using WinBUGS to implement Built-in probability transition jump
# Selecting Probability Model between WB and GED
#################################################################
model { # Define the mixture model using the Poison zerostrick
const<- 5; zero<-0; zero ~ dpois(zero.mean)
zero.mean <- const + (-1)*logL
logL<- log( rho*exp(sum(logLGEx[])+sum(logLGEr[]))
+(1-rho)*exp(sum(logLWBx[])+sum(logLWBr[])) )
# compute the likelihood of GE model
pr[1] <- pow(1-exp(-lambda*tt[1]), alpha)
for (i in 2:9) {
pr[i] <-pow(1-exp(-lambda*tt[i]),alpha)
-pow(1-exp(-lambda*tt[i-1]), alpha)
}
pr[10] <- 1-pow(1-exp(-lambda*tt[9]), alpha)
xxsum <- sum(xx[1:10]) ;xx[1:10] ~ dmulti (pr[], xxsum)
for (i in 1:8) {
prob[i]<-equals(rr[i],0)*1/2+
(1-equals(rr[i],0))*sum(pr[(i+1):10])
nn [i] <-rr[i]+equals(rr[i],0);rr[i]~dbin(prob[i],nn[i])}
for (i in 1:10) {logLGEx[i]<-xx[i]*log(pr[i])}
for (i in 1:8) {logLGEr[i]<-rr[i]*log(prob[i])}
150 Y.-J. Lin et al.
References
Aggarwala, R.: Progressively interval censoring: some mathematical results with application to
inference. Commun. Stat. Theory Methods 30, 1921–1935 (2001)
Balakrishnan, N., Aggarwala, R.: Progressive Censoring: Theory, Methods and Applications.
Birkhauser, Boston (2000)
Carbone, P.P., Kellerhouse, L.E., Gehan, E.A.: Plasmacytic myeloma: A study of the relationship
of survival to various clinical manifestations and anomalous protein type in 112 patients. Am.
J. Med. 42, 937–948 (1967)
Chen, D.G., Lio, Y.L.: Parameter estimations for generalized exponential distribution under
progressive type-I interval censoring. Comput. Stat. Data Anal. 54, 1581–1591 (2010)
Dempster, A.P., Laird N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM
algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of
images. IEEE Trans. Pattern. Anal. Mach. Intell. 6, 721–741 (1984)
Green, P.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determi-
nation. Biometrika 82, 711–732 (1995)
Gupta, R.D., Kundu, D.: Generalized exponential distributions. Aust. N. Z. J. Stat. 41, 173–188
(1999)
Gupta, R.D., Kundu, D.: Exponentiated exponential distribution: an alternative to gamma and
Weibull distributions. Biom. J. 43, 117–130 (2001a)
Gupta, R.D., Kundu, D.: Generalized exponential distributions: different method of estimations.
J. Stat. Comput. Simul. 69, 315–338 (2001b)
Gupta, R.D., Kundu, D.: Generalized exponential distributions: statistical inferences. J. Stat.
Theory Appl. 1, 101–118 (2002)
Gupta, R.D., Kundu, D.: Discriminating between Weibull and generalized exponential distribu-
tions. Comput. Stat. Data Anal. 43, 179–196 (2003)
Gupta, R.D., Kundu, D.: Generalized exponential distributions: existing results and some recent
developments. J. Stat. Plann. Inference 137, 3537–3547 (2007)
Modeling Based on Progressively Type-I Interval Censored Sample 151
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications.
Biometrika 57, 97–109 (1970)
Kundu, D., Gupta, R.D.: Generalized exponential distribution: Bayesian inference. Comput. Stat.
Data Anal. 52, 1873–1883 (2008)
Kundu, D., Pradhan, B.: Bayesian inference and life testing plans for generalized exponential
distribution. Sci. China Ser. A Math. 52, 1373–1388 (2009)
Lawless, J.F.: Statistical models and methods for lifetime data, 2nd edn. Wiley, New Jersey (2003)
Lin, Y.-J., Lio, Y.L.: Bayesian inference under progressive type-I interval censoring. J. Appl. Stat.
39, 1811–1824 (2012)
Metropolis, N., Rosenbluth, A., Rosenblith, M., Teller, A., Teller, E.: Equations of state calcula-
tions by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
Mudholkar, G.S., Srivastava, D.K.: Exponentiated Weibull family for analyzing bathtub failure
data. IEEE Trans. Reliab. 42, 299–302 (1993)
Mudholkar, G.S., Srivastava, D.K., Friemer, M.: The exponentiated Weibull family: a reanalysis of
the bus-motor-failure data. Technometrics 37, 436–445 (1995)
Mudholkar, G.S., Srivastava, D.K., Kollia, G.D.: A generalization of the Weibull distribution with
application to the analysis of survival data. J. Am. Stat. Assoc. 91, 1575–1583 (1996)
Ng, H., Wang, Z.: Statistical estimation for the parameters of Weibull distribution based on
progressively type-I interval censored sample. J. Stat. Comput. Simul. 79, 145–159 (2009)
Robert, C.P., Marin, J.M.: On some difficulties with a posterior probability approximation
technique. Bayesian Anal. 3, 427–442 (2008)
Sun, D.: A note on noninformative priors for Weibull distributions. J. Stat. Plann. Inference 61,
319–338 (1997)
Techniques for Analyzing Incomplete Data
in Public Health Research
1 Introduction
There are several resources available which offer suggestions on how to prevent
or reduce non-response and attrition that should be considered in advance of data
collection (Little et al. 2012; Chang et al. 2009). There should be a sincere attempt
to avoid unplanned non-response when possible, but it is not realistic to assume it
can be avoided all together. Even the most well-designed studies are susceptible to
missing values! In a research setting in public health, the challenge is even greater.
Participants could miss appointments, refuse to respond, become unwell, or simply
lose interest in the study.
Since missing values are difficult to control, it is important to know how to move
forward and to understand what assumptions you are willing (or unwilling) to make.
It should be mentioned that many statistical analysis and estimation procedures were
not designed to handle missing values. Conventional statistical methods assume that
all variables of a particular model are measured for all cases. Therefore, the software
being used, simply by default, may discard all incomplete cases and proceed. A lot
of research has been devoted to demonstrating that this approach is only reasonable
under some very strict assumptions regarding the nature of the missing data, but
is very unreasonable and misleading otherwise. For this reason, a lot of emphasis
should be placed on what assumptions are reasonable to make with your data.
The first part of this chapter is devoted to the missing data theory, as described by
Rubin (1976). It continues with an overview of some ad hoc approaches to handling
missing values and describes the inherent faults with these techniques. Likelihood-
based methods and multiple imputation are then discussed with particular attention
given towards multiple imputation. Next, an overview of useful R packages for visu-
alization of missing data and analysis through multiple imputation are presented. An
example in public health is also used to demonstrate the impact of missing data and
the utility of multiple imputation for obtaining unbiased and efficient estimates.
Let the complete data be denoted by Ynp where n denotes the number of observa-
tional units and p represents the number of variables. The pattern of missingness
describes which variables are missing and where the missing data are located.
Schafer and Graham (2002) describe some of the patterns of missingness which
include univariate, monotone, and arbitrary patterns. A univariate pattern describes
the situation in which missingness occurs in only one of the variables while all other
variables are completely observed. A monotone pattern is often used to describe
a dropout pattern. That is, an observational unit has observed values up until the
ith position, but has all subsequent values missing. Finally, an arbitrary pattern
describes when missing values occur in no particular pattern.
Figure 1 shows some examples of each type of missingness pattern. These
plots are available with the R package VIM and are a useful initial assessment of
your data. Evaluating these plots can illustrate the frequency at which particular
variables are missing and can show particular patterns in the data. In Fig. 1a,
59 % of experimental units have observed data for variables x1 x4 , whereas 41 %
of experimental units have observed data for variables x1 x3 , but have missing
values for x4 . This is a univariate missingness pattern since the only missing
values in that particular data set occurred in one of the variables, in our case, x4 .
Techniques for Analyzing Incomplete Data in Public Health Research 155
a 0.4 b
0.4
0.06
0.41
0.3
0.3
Proportion of missings
Proportion of missings
0.10
Combinations
Combinations
0.2
0.2
0.10
0.15
0.1
0.1
0.59
0.59
0.0
0.0
x1
x2
x3
x4
x1
x2
x3
x4
x1
x2
x3
x4
x1
x2
x3
x4
c d
0.4
0.01
0.05
0.15
0.01
0.3
Proportion of missings
0.10
Proportion of missings
0.01
Combinations
Combinations
0.10
0.2
0.11 0.03
0.12
0.05
0.15
0.1
0.17
0.59
0.65
0.00
0.0
x1
x2
x3
x4
x1
x2
x3
x4
x1
x2
x3
x4
x1
x2
x3
x4
Fig. 1 Examples of potential non-response patterns: gray regions represent missing data. (a) Uni-
variate missingness. (b) Monotone missingness (i). (c) Monotone missingness (ii). (d) Arbitrary
missingness
An indicator matrix Rnp can also be used to indicate the pattern of non-response.
The entries in R are defined as rij D 1, i D 1; : : : ; n and j D 1; : : : ; p, if
the observation for the jth variable of subject i is missing .yij /, and rij D 0 for
observations that are observed.
The strongest of the missing data assumptions is MCAR. It implies that the
probability of missing values is unrelated to observed and unobserved values. This
mechanism of missingness is given by
for all values of ' and at the realized values of R and Yobs . In other words, the
missing values are a random sample of all data values.
Little (1988) lists a number of important instances where verification of MCAR
is important. One reason is that many methods for handling incomplete data will
work well when this assumption is valid.
Is MCAR a realistic assumption for your data? It depends. Imagine a situation
where the collection of all data is costly. Therefore, researchers may design a study
in which the data on particular variables is only collected on a complete random
sample of participants. This refers to a study that is missing by design (Graham et al.
1996) and since the data is not missing for any reason related to the given variable
or other observed values, then missing completely at random can be considered a
reasonable assumption. If the data is not missing due to a known and completely
random process, then MCAR may be difficult to justify. We will discuss some
methods used for determining types of missingness in Sect. 2.2.4.
Techniques for Analyzing Incomplete Data in Public Health Research 157
Data are missing at random if the probability of missing values can be described
through observed values only. This mechanism is given by
for all values of ' and at realized values of R and Yobs . This mechanism requires that
the missing values are a random sample of all values within a subclass defined by
the observed data. Notice that MCAR is a special case of MAR.
Is MAR a realistic assumption for your data? Again, it depends. Consider a
study where particular variables of interest can only be measured through biopsies.
Since biopsies can be costly, invasive, and risky, it may only be ethical to perform a
biopsy if the participant is considered high risk. If risk is an observed and measured
characteristic of all participants and if the probability that a participant is biopsied
can be fully explained by risk, then the missing biopsy data could be considered
MAR. This is another case of data that is missing by design. However, if the data is
missing for reasons not controlled (or fully understood) by the experimenter, then
the assumption of missing at random may be difficult to justify.
In Bayesian or likelihood-based inference, if the parameters that govern the
missingness mechanism (') are distinct from the parameters of the data model ( )
and if the data is MAR or MCAR, then the missing data mechanism is said to be
ignorable (Little and Rubin 2002). These parameters are considered distinct if the
joint prior distribution for and ' is equal to the product of two independent priors.
From a frequentist perspective, the parameters are distinct if the joint parameter
space is the Cartesian cross-product of the individual parameter spaces of and '
(Schafer 1997). There is a marked benefit to being able to classify the missingness
mechanism as ignorable in that one does not have to model the mechanism by which
the data became missing.
If the relationship of MAR does not hold, then data are missing not at random.
This implies that the probability that a value is missing cannot be described
fully by observed values. MNAR data automatically implies that the missing data
mechanism is non-ignorable. That means that valid estimation would require that
the missing data mechanism be modeled. Unfortunately, this is not an easy task and
inferences could be highly sensitive to any misspecification. Non-ignorable missing
data techniques are not covered in this chapter. Some resources to consider in the
event of non-ignorable missingness are Little and Rubin (2002), Fitzmaurice et al.
(2011), and Diggle and Kenward (1994).
158 V. Pare and O. Harel
80 100
40 60
40 60
Index
Index
20
20
0
0
x1
x2
x3
x4
x1
x2
x3
x4
80 100
80 100
40 60
40 60
Index
Index
20
20
0
0
x1
x2
x3
x4
x1
x2
x3
x4
It is well documented that ad hoc treatments for missing data can lead to serious
problems including loss of efficiency, biased results, and underestimation of error
variation (Belin 2009; White and Carlin 2010; Harel et al. 2012). Many methods
have been discounted for leading to one of those aforementioned problems.
One such method is complete case analysis (CCA) which deletes observational
units with any missing values. Therefore, if the data is not MCAR, then imple-
160 V. Pare and O. Harel
menting such a procedure may fail to account for the systematic difference between
observed and not fully observed cases. This will potentially cause bias in inferences
of parameters. Even if the data is MCAR, CCA will result in a marked loss of
efficiency depending on the proportion of incomplete cases.
A second missing data method is single imputation in which researchers replace
missing values with some plausible values. There are a wide range of procedures
which seek to fill in these missing values including mean imputation, hot deck
imputation, regression imputation, or, in longitudinal settings, last observation
carried forward. All of these methods have been shown to underestimate the error
variation which will lead to inflated Type I error. These imputation methods are
discussed in detail by Schafer and Graham (2002).
There are several more sophisticated methods for treating incomplete data
that perform much better than the above methods including maximum likelihood
estimation via the EM algorithm (Dempster et al. 1977), Bayesian analysis (Gelman
et al. 2003), and multiple imputations (Rubin 1987). This chapter focuses on
multiple imputation and will therefore be discussed in detail in the following section.
4 Multiple Imputation
One principled method for handling incomplete data is with multiple imputation
(Rubin 1987). This method can be applied to an array of different models, which
can make it more appealing than other principled missing data techniques. The
procedure entails creating several data sets by replacing missing values with a set
of plausible values that represent the uncertainty about the true unobserved values.
There are three stages of multiple imputation: (1) Imputation: Replace each missing
value with m > 1 imputations under a suitable model, (2) Analysis: Analyze each
of the completed data sets using complete data techniques, and (3) Combination:
Combine the m sets of estimates and standard errors using Rubin’s (1987) rules.
There are some considerations that should be made during the imputation stage.
You will need to decide on appropriate variables to use for imputation, the type of
imputation approach you will use, and how many imputations you should make.
4.1.1 Approaches
There are two general approaches to imputing multivariate data-joint modeling and
chained equations (van Buuren and Oudshoorn 2000). Both approaches should
begin by deciding on what variables should be used for imputation. Collins et al.
Techniques for Analyzing Incomplete Data in Public Health Research 161
(2001) found that including all auxiliary variables in the imputation model prevents
one from inadvertently omitting important causes of missingness. This will aid in
the plausibility of the missing at random assumption. That is, there is evidence
which suggests that the imputation model should utilize all available variables,
including those that are not of specific interest for the analysis. At the very least
you should include all variables that will be used in your analysis.
The first approach, joint modeling , involves specifying a multivariate distribution
for the data, and drawing imputations from the conditional distributions of Ymis jYobs
usually by MCMC techniques. The multivariate normal distribution is the most
widely used probability model for continuous multivariate data (Schafer 1997).
Schafer (1997) mentions that even when data sets deviate from this distribution,
the multivariate normal model may still be useful in the imputation framework.
One obvious reason for this is that a suitable transformation may help make the
assumption of multivariate normality seem more realistic. Further, in many settings
it is believed that inference by multiple imputation is relatively robust to departures
from the imputation model, particularly when there are small amounts of missing
information (Schafer 1997).
Suppose we have a set of variables x1 ; x2 ; : : : ; xk that are believed to follow
a multivariate normal distribution. The parameters of this joint distribution are
estimated from the observed data. Then, the missing values are imputed from
the conditional distribution (also multivariate normal) of the missing values given
the observed values for each missing data pattern. Under the assumption of
multivariate normality, we can use the R package norm (Schafer 2012) to perform
these imputations.
The chained equation approach specifies a multivariate imputation model by
a series of univariate conditional regressions for each incomplete variable. For
example, suppose we have a set of variables, x1 ; x2 ; : : : ; xk , where x1 ; x2 ; x3 have
some missing values. Suppose further that x1 is binary, x2 is count, and x3
is continuous. The chained equations approach would proceed in the following
steps:
• Initially, all missing values are filled in at random
• Then, for all cases where x1 is observed, a logistic regression is performed where
x1 is regressed on x2 ; : : : ; xk .
• Missing values for x1 are then replaced by simulated draws from the posterior
predictive distribution of x1 .
• Then, for all cases where x2 is observed, a Poisson regression is performed where
x2 is regressed on x1 ; x3 ; : : : ; xk
• Missing values for x2 are then replaced by simulated draws from the posterior
predictive distribution of x2 .
• Then, for all cases where x3 is observed, a linear regression is performed where
x3 is regressed on x1 ; x2 ; x4 ; : : : ; xk
• Missing values for x3 are then replaced by simulated draws from the posterior
predictive distribution of x3 .
• The process is repeated several times
162 V. Pare and O. Harel
Next, each of the M data sets can be analyzed using complete data methods. Suppose
that Q is a quantity of interest—an example might be a mean or a regression
coefficient. Assume that with complete data, inference about Q would be based on
O N.0; U/ where Q
the statement that Q Q O is the complete-data statistic estimating
parameter Q, and U is the complete-data statistic providing the variance of Q Q. O
Since each missing value is replaced by M simulated values, we have M complete
data sets and M estimates of Q and U. The overall estimate of Q is
XM
N D 1
Q O .m/ ;
Q
M mD1
XM
N D 1
U U .m/ and
M mD1
1 X O .m/ N 2
M
B D .Q Q/ ;
M 1 mD1
and where U .m/ is the variance across each imputed data set. The total variance .T/
N is then
of .Q Q/
N C .1 C 1 /B:
T DU
M
In the equation above, U N estimates the variance if the data were complete
1
and .1 C M /B estimates the increase in variance due to the missing data (Rubin
1987).
Interval estimates and significance levels for the scalar Q are based on a Student-t
reference distribution
N tv ;
T 1=2 .Q Q/
Techniques for Analyzing Incomplete Data in Public Health Research 163
Various improved estimates for v have been proposed over the years (Barnard and
Rubin 1999; Reiter 2007; Marchenko and Reiter 2009). Wagstaff and Harel (2011)
compared the various estimates and found those of Barnard and Rubin (1999) and
Reiter (2007) performed satisfactorily.
Thankfully, draws from the posterior predictive distribution, along with com-
bination of parameter estimates (including the extension to multivariate parameter
estimates) can be accomplished with R packages mice and norm and are discussed
in the example to follow.
One way to describe the impact of missing data uncertainty is in a measure called
rate of missing information (Rubin 1987). This measure can be used in diagnostics
to indicate how missing data influences the quantity being estimated (Schafer
1997). If Ymis carries no information about Q, then the estimates of Q O across each
N Therefore,
imputed data set are identical. Thus, the total variance .T/ reduces to U.
.1 C M 1 /B=U N estimates the relative increase in variance due to missing data. An
estimate of the rate of missing information due to Ymis is
B
OD ;
N CB
U
which does not tend to decrease as the number of imputations increases (Rubin
1987).
Harel (2007) establishes the asymptotic behavior of O to help determine the num-
ber of imputations necessary when accurate estimates of the missing-information
rates are of interest. From this distribution, an approximate 95 % confidence interval
for the population rate of missing information is
1:96 O .1 O /
O˙ p ;
M=2
we just illustrated, when the rates of missing information are of interest, more
imputations are required. In addition, there is a general trend to increase the number
of imputations particularly when p-values are of interest (Graham et al. 2007;
Bodner 2008; White et al. 2011).
5 Example
Let’s look at a study of sexual risk behavior in South Africa. Cain et al. (2012)
studied the effects of patronizing alcohol serving establishments (shebeens) and
alcohol use in predicting HIV risk behaviors. For the data of this particular analysis,
men and women were recruited from inside shebeens and in the surrounding
areas near shebeens in 8 different communities. Surveys were administered to
measure demographic characteristics, alcohol use, shebeen attendance, and sexual
risk behaviors. It was of interest to determine whether social influences and
environmental factors in shebeens attribute to sexual risk behavior independently
of alcohol consumption. The variables of interest are:
GENDER 1 if female, 0 if male
AGE Age of survey participant
UNEMP 1 if unemployed, 0 if employed
ALC alcohol index
SHEBEEN 1 if attends shebeen, 0 if does not attend
RBI risk behavior index
The alcohol index was measured as the product of alcohol use frequency
and alcohol consumption quantity. The risk behavior index was measured as the
logarithm of one plus the product of number of sexual partners in the past 30 days
and number of unprotected sex acts in the past 30 days. The ultimate goal is to
predict this risky behavior index from all other independent variables.
The original analysis utilized 1,473 people, which will serve as our completely
observed data. We deliberately introduce missing data on several of the variables
under the missing at random assumption for illustrative purposes. For AGE,
24 % of observations were removed conditional on GENDER. For ALC, 29 %
of observations were removed conditional on GENDER and RBI. For RBI, 49 %
of observations were removed conditional on Shebeen attendance and auxiliary
information regarding the community where recruited. Suppose community 4 had
a much higher prevalence of missing RBI responses. Data was imposed as missing
based on the probabilities derived from the following models:
Logit(P(AGE missing)/ D 0:10 C 0:15GENDER
Logit(P(ALC missing)/ D 0:35 C 0:65GENDER0:05RBI
Logit(P(RBI missing)/ D 0:5 C 1:03SHEBEENC1:5COMM4
Techniques for Analyzing Incomplete Data in Public Health Research 165
Listwise deletion on this set of variables leaves 399 records, which is less than
one-third the original sample size.
Now, let’s forget for a moment that we understand the nature of how the data
came to be missing. Ordinarily, we would be uncertain of this process and would be
required to make some assumptions. Initial visualizations on our data set ’risk’ are
performed using the following R code:
Figures 3 and 4 show the resulting output. We see that there is an arbitrary pattern
of missingness and that there are no clearly visible relationships in the matrix plots
that might explain why the data came to be missing. Next, we may want to examine
the plausibility of the MCAR assumption:
The output for this test states that the hypothesis of MCAR is rejected at the 0.05
significance level. This indicates that missing values are unlikely due to complete
random chance (no surprise to us!). Further, this tells us that if we ran an analysis
of just the complete cases, that our estimates may be biased. The following code
could be used in the event that CCA is appropriate and if there is little concern for
the reduction in sample size:
166 V. Pare and O. Harel
0.032
0.4
0.035
0.088
0.3
Combinations
0.090
Proportion of missings
0.106
0.2
0.115
0.1
0.263
0.271
0.0
GENDER
AGE
UNEMP
ALC
RBI
SHEBEEN
GENDER
AGE
UNEMP
ALC
SHEBEEN
RBI
Suppose that subject experts feel that MAR is a reasonable assumption. Since we
have auxiliary information available, we will proceed with constructing imputations
with all available variables in the data set. In addition to the variables of primary
interest in our model, we also include:
COMM Identifies location of recruitment (1-8)
STRESS Index measure of self-perceived feelings of stress (rated low to high)
GBV Index measure of attitude towards gender based violence
Techniques for Analyzing Incomplete Data in Public Health Research 167
0 500 1500
0 500 1500
Index
Index
ALC
ALC
AGE
UNEMP
RBI
AGE
UNEMP
RBI
GENDER
GENDER
SHEBEEN
SHEBEEN
0 500 1500
0 500 1500
Index
Index
ALC
ALC
AGE
AGE
UNEMP
UNEMP
RBI
RBI
GENDER
GENDER
SHEBEEN
SHEBEEN
0 500 1500
0 500 1500
Index
Index
ALC
ALC
AGE
AGE
UNEMP
UNEMP
RBI
RBI
GENDER
GENDER
SHEBEEN
SHEBEEN
Fig. 4 Data example: matrix plots
Using the recommendation from Sect. 4.3, if we are also interested in obtaining
the rates of missing information for our estimates of interest with 95 % confidence
and margin of error 0.04, then approximately 300 imputations are required.
Under the joint modeling approach, we would now have to decide on a joint
model for our data. Suppose we assume that the variables form a multivariate normal
model. We could then obtain imputations with the following code:
(continued)
168 V. Pare and O. Harel
> results=list()
> # generate imputations using data augmentation
> for(i in 1:M){
imps=da.norm(s, start=thetahat, steps=100,
showits=FALSE, return.ymis=TRUE)$ymis
imputeddata risk
for(j in 1:length(imps)){
imputeddata[which(is.na(risk))[j]]=imps[j]
}
imputeddata data.frame(imputeddata)
results[[i]]=lm(RBI AGE + GENDER+ UNEMP+ ALC +
SHEBEEN, data=imputeddata)
}
> # take the results of complete-data analysis from the imputed data sets,
> # and turn it into a mira object that can be pooled
> results as.mira(results)
> # combine results from M imputed data sets with df estimate from Barnard and Rubin
(1999)
> analysis summary(pool(results, method=“small sample”))
> # view results
> analysis
While multivariate normality may not seem the most realistic model (we have
several dichotomous variables), more complex models are not as readily available.
This is where the chained equations approach makes sense to use. For our data,
GENDER and SHEBEEN are fully observed. We impute our data assumed to be
normally distributed using Bayesian linear regression, our dichotomous variables
using logistic regression, and our categorical variables using predictive mean
matching. Specifically, we impute AGE, ALC, RBI, STRESS, and GBV using
Bayesian linear regression, UNEMP using logistic regression, and COMM using
predictive mean matching. These methods are specified in the mice function on the
5th line of the code below. There are many other methods available and details can
be found in van Buuren and Groothuis-Oudshoorn (2011). The following displays
our code for generating imputations under this approach:
(continued)
Techniques for Analyzing Incomplete Data in Public Health Research 169
The results from complete case analysis, multiple imputation with norm, and
multiple imputation with mice are compared to the original complete data in Table 1.
The most important variable in the original analysis by Cain et al. (2012) was the
significance and magnitude of the SHEBEEN variable. In the original complete data,
the variable SHEBEEN was significant. However, in CCA, this significance is not
captured. Both mice and norm provide similar results to one another. While there
is negative bias of the coefficient of SHEBEEN, the bias is far less extreme than it
is with CCA. Additionally, both multiple imputation methods find SHEBEEN to be
significant. In terms of bias, the chained equations approach outperformed the joint
model approach where we assumed multivariate normality—this may be due to the
inadequacy of the multivariate normal model for this data. For all other coefficients,
multiple imputation consistently provided narrower interval estimates than complete
case analysis. The rates of missing information are relatively high, which implies the
missing values have substantial impact on the coefficients we estimated.
6 Concluding Remarks
The problem of missing data is one which researchers encounter regularly. Visual-
izing the missing data is one important first step in understanding how much of a
problem it may be and also may help support what missingness mechanisms may
exist in your data. Expert knowledge and unverifiable assumptions accompany any
missing data method, so assumptions should be made cautiously. While there are
several principled and established methods for dealing with missing data, multiple
imputation provides flexibility in terms of analysis capabilities and is readily
available to implement in many statistical software programs.
Acknowledgements The authors wish to thank Dr. Seth Kalichman for generously sharing his
data. This project was supported in part by the National Institute of Mental Health, Award Number
K01MH087219. The content is solely the responsibility of the authors, and it does not represent
the official views of the National Institute of Mental Health or the National Institutes of Health.
References
Barnard, J., Rubin, D.B.: Small-sample degrees of freedom with multiple imputation.
Biometrika 86, 948–955 (1999)
Belin, T.: Missing data: what a little can do and what researchers can do in response. Am. J.
Opthalmology 148(6), 820–822 (2009)
Bodner, T.E.: What improves with increased missing data imputations? Struct. Equ. Model. 15(4),
651–675 (2008)
Cain, D., Pare, V., Kalichman, S.C., Harel, O., Mthembu, J., Carey, M.P., Carey, K.B., Mehlo-
makulu, V., Simbayi, L.C., Mwaba, K.: Hiv risks associated with patronizing alcohol serving
establishments in south african townships, cape town. Prev. Sci. 13(6), 627–634 (2012)
Chang, C.-C.H., Yang, H.-C. , Tang, G., Ganguli, M.: Minimizing attrition bias: a longitudinal
study of depressive symptoms in an elderly cohort. Int. Psychogeriatr. 21(05), 869–878 (2009)
Collins, L., Schafer, J., Kam, C.: A comparison of inclusive and restrictive strategies in modern
missing data procedures. Psychol. Methods 6, 330–351 (2001)
Dempster, A., Laird, A., Rubin, D.: Maximum likelihood from incomplete data via the em
algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)
Techniques for Analyzing Incomplete Data in Public Health Research 171
Diggle, P., Kenward, M.G.: Informative drop-out in longitudinal data analysis. Appl. Stat., 49–93
(1994)
Fitzmaurice, G., Laird, N., Ware, J.: Applied Longitudinal Analysis. Wiley Series in Probability
and Statistics. Wiley (2011)
Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis. Chapman and Hall/CRC,
Boca Raton, FL (2003)
Graham, J.W., Hofer, S.M., MacKinnon, D.P.: Maximizing the usefulness of data obtained with
planned missing value patterns: An application of maximum likelihood procedures. Multivar.
Behav. Res. 31(2), 197–218 (1996)
Graham, J. W., Olchowski, A.E., Gilreath, T.D.: How many imputations are really needed? some
practical clarifications of multiple imputation theory. Prev. Sci. 8(3), 206–213 (2007)
Harel, O.: Inferences on missing information under multiple imputation and two-stage multiple
imputation. Stat. Methodol. 4, 75–89 (2007)
Harel, O., Pellowski, J., Kalichman, S.: Are we missing the importance of missing values in
HIV prevention randomized clinical trials? Review and recommendations. AIDS Behav. 16,
1382–1393 (2012)
Jamshidian, M., Jalal, S.: Tests of homoscedasticity, normality, and missing completely at random
for incomplete multivariate data. Psychometrika 75(4), 649–674 (2010)
Jamshidian, M., Jalal, S., Jansen, C.: MissMech: an R package for testing homoscedasticity,
multivariate normality, and missing completely at random (mcar). J. Stat. Softw. 56(6), 1–31
(2014)
Little, R., Rubin, D.: Statistical Analysis with Missing Data, 2nd edn. Wiley, Hoboken, NJ (2002)
Little, R.J.: A test of missing completely at random for multivariate data with missing values.
J. Am. Stat. Assoc. 83(404), 1198–1202 (1988)
Little, R.J., D’Agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., Frangakis, C.,
Hogan, J.W., Molenberghs, G., Murphy, S.A., et al.: The prevention and treatment of missing
data in clinical trials. N. Engl. J. Med. 367(14), 1355–1360 (2012)
Marchenko, Y.V., Reiter, J.P.: Improved degrees of freedom for multivariate significance tests
obtained from multiply imputed, small-sample data. Stata J. 9(3), 388–397 (2009)
Reiter, J.P.: Small-sample degrees of freedom for multi-component significance tests with multiple
imputation for missing data. Biometrika 94, 502–508 (2007)
Rubin, D.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Rubin, D.: Multiple Imputation for Nonresponse in Surveys. Wiley, Hoboken, NJ (1987)
Schafer, J.: Analysis of Incomplete Multivariate Data. Chapman and Hall/CRC, Boca Raton, FL
(1997)
Schafer, J., Graham, J.: Missing data: our view of the state of the art. Psychol. Methods 7, 147–177
(2002)
Schafer, J.L.: Norm: analysis of multivariate normal datasets with missing values. R package
version 1.0-9.4 (2012)
Templ, M., Alfons, A., Kowarik, A., Prantner, B.: VIM: visualization and imputation of missing
values. R package version 4.0.0. (2013)
van Buuren, S., Groothuis-Oudshoorn, K.: Mice: multivariate imputation by chained equations in
r. J. Stat. Softw. 45(3), 1–67 (2011)
van Buuren, S., Oudshoorn, K.: Multivariate imputation by chained equations:mice v1.0 user’s
manual (2000)
Wagstaff, D.A., Harel, O.: A closer examination of three small-sample approximations to the
multiple-imputation degrees of freedom. Stata J. 11(3), 403–419 (2011)
White, I., Carlin, J.: Bias and efficiency of multiple imputation compared with complete-case
analysis for missing covariate values. Stat. Med. 28, 2920–2931 (2010)
White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: Issues and
guidance for practice. Stat. Med. 30(4), 377–399 (2011)
A Continuous Latent Factor Model
for Non-ignorable Missing Data
1 Introduction
Missing values in multivariate studies pose many challenges. The primary research
of interest focuses on accurate and efficient estimation of means and covariance
structure in the population. The assumption and estimation of the covariance
structure provide the foundation of many statistical models, for instance, structural
equation modeling, principle component analysis, and so on. Literature on multi-
variate missing data methods was reviewed by Little and Rubin (2002) and Schafer
(1997). For some frequentist statistical procedures, we may generally ignore the
distribution of missingness only when the missing data are missing completely at
J. Zhang
Bayer Healthcare Pharmaceuticals Inc., 100 Bayer Boulevard, Whippany, NJ 07981, USA
e-mail: [email protected]
M. Reiser ()
School of Mathematical and Statistical Sciences, Arizona State University,
Tempe, AZ 85260, USA
e-mail: [email protected]
2 Models
Many longitudinal studies suffer from missing data due to subjects dropping into
or out of a study or not being available at some measurement times, which can cause
bias in the analysis if the missingness are informative. For likelihood procedures of
estimating linear mixed models, we may generally ignore the distribution of missing
indicators when the missing data are MAR (or ignorable likelihood estimation), that
is missingness depends only on observed information. However, when the missing
data mechanism is related to the unobservable missing values, the missing data
are non-ignorable and the distribution of missingness has to be considered. To
account for informative missingness, a number of model based approaches have
been proposed to jointly model the longitudinal outcome and the non-ignorable
missing mechanism. Little and Rubin (2002) described three major formulations
of joint modeling approaches: selection model, pattern-mixture model, and shared-
parameter model, while Verbeke and Molenberghs (2000) provided applications
for these models in their book. Other researchers have extended this field in the
last decade. Some authors have incorporated latent class structure into pattern-
mixture models to jointly describe the pattern of missingness and the outcome
of interest (Lin et al. 2004; Muthén et al. 2003; Roy 2003). Lin et al. (2004)
proposed a latent pattern-mixture model where the mixture patterns are formed
from latent classes that link a longitudinal response with a missingness process.
Roy (2003) investigated latent classes to model dropouts in longitudinal studies to
effectively reduce the number of missing-data patterns. Muthén et al. (2003) also
discussed how latent classes could be applied to non-ignorable missingness. Jung
et al. (2011) extended traditional latent class models, where the classes are defined
by the missingness indicators alone.
All the above extensions are from the family of pattern-mixture models, and
these models stratify the data according to time to dropout or missing indicators
alone and formulate a model for each stratum. This usually results in under-
identifiability since we need to estimate many pattern-specific parameters, even
though the eventual interest is usually on the marginal parameters. Further, there
is a controversial and also important practical modeling issue in using latent class
models, which is determining a suitable number of latent classes. Some authors
suggested a criterion approach as a way of comparing models with different number
of classes. In our work using simulation studies, we found that the selection of latent
classes is sensitive to many factors that relate to missing data, and a simulation
study on selection latent classes is strongly recommended if one wants to apply
latent class modeling for missing data. Moreover, the uncertainty of model selection
makes latent class models inefficient in estimating population parameters. Instead of
modeling missing indicators with latent categorical classes, one possible alternative
approach is to model missingness as continuous latent variables.
As the alternative, Guo et al. (2004) extended pattern-mixture to a random
pattern-mixture model for longitudinal data with dropouts. The extended model
works effectively on the case where a good surrogate for the dropout can be
representative for the dropout process. In most real studies, however, it maybe
impossible to find good measures for the missing mechanism. For instance, in a
longitudinal study with many intermittent missing values, time to dropout is not
176 J. Zhang and M. Reiser
necessarily a good measure, and it probably would not capture most features of
missingness. That is, this measurement cannot represent for subjects who have drop-
in responses. Instead, modeling for missing indicators is necessary in this case.
Further, models other than the normal distribution will be required to describe the
missingness process. The violation of joint multivariate normality will lead to an
increase of computation difficulties. In the proposed new model, missing indicators
are directly modeled with a continuous latent variable, and this latent factor is
treated as a predictor for latent subject-level random effects in the primary model
of interests. Some informative variables related with missingness (e.g. time to first
missing, number of switches between observed and missing responses) will serve as
covariates in the modeling of missing indicators. A detailed description of the new
model will be given in the next section.
For analyzing multivariate categorical data, continuous latent factor modeling which
is often referred to as categorical variable factor analysis (Muthén 1978) and item
response modeling (Lord 1980; Embretson and Reise 2000) probably is the most
widely used method. In the terminology of educational testing, the involved binary
variables are called items and the observed values are referred to as binary or
dichotomous responses. In this paper, we will extend this model to describe missing
data procedure.
Let ri1 ; : : : ; riJ be the J binary responses (missing indicators) on J given time
points for a given individual i out of a sample of n individuals, i D 1; : : : ; n and
j D 1; : : : ; J. In concrete cases 1 and 0 may correspond to an observed or unobserved
outcome in a longitudinal study. In the continuous latent factor model there are two
sets of parameters. The probability of rij being 1 or 0 can depend on an individual
parameter ui , specific and characteristic for the individual in study. This parameter
is also referred to as a latent parameter. In addition, the probability may depend on
a parameter for different time points (items) j , characteristic for the particular time
point.
We use the following notation to define the probability of a missing outcome as
a function of the latent individual factor:
1.0 Time 1
Time 2
Time 3
Time 4
0.8
Time 5
Missing Probability
0.4 0.6
0.2
0.0
-4 -2 0 2 4
Random Factor Ui
In the literature two main models for a latent trait have been suggested.
The normal ogive or probit model is given by
where .x/ D ex =.1 C ex / (1 < x < 1) is the cumulative distribution function
of the standard logistic random variable.
There is a series of continuous latent variable models for different kinds of
categorical data. Here, we present the two-parameter (2PL) item response model
for binary data, which could be reduced to the model discussed above. The 2PL
model is used to estimate the probability (ij ) of a missing response for subject i
and time point j while considering the item (time)-varying parameters, 2j for item
(time) location parameters, and 1j for item (time) slope parameters, which allow for
different weights for different times, and the person-varying latent trait variables ui .
The 2PL model is expressed as
As 1j increases, the item (time) has a stronger association with the underlying
missingness. When 1j is fixed to be 1, the 2PL model is reduced to be a Rasch
model (Rasch 1960) or a 1PL model. As 2j increases, the response is more likely
to be observed. This 2PL model has been shown to be mathematically equivalent
to be confirmatory factor analysis model for binary data (Takane and de-Leeuw
1987). The IRT models can be expressed as generalized mixed or multilevel models
(Adams et al. 1997; Rijmen et al. 2003). Considering a mixed logistic regression
model for binary data:
exp.xTij ˇ C zTij ui /
Pr.rij D 1jxij ; zij ; ˇ; ui / D
1 C exp.xTij ˇ C zTij ui /
In this section we present a continuous latent factor model (CLFM) for longitudinal
data with non-ignorable missingness. For a J-time period study which may have
as many as 2J possible missing patterns, modeling the relationship among the
missing indicators and their relationships to the observed data is a challenge. The
underlying logic of our new model comes from the assumption that a continuous
latent variable exists and allows flexibly for modeling missing indicators. Suppose
we have a data set with n independent individuals. For individual i (i D 1; ; n),
let Yi D .Yi1 ; ; YiJ /0 be a J-dimensional observed vector with continuous
elements used to measure a q-dimensional continuous latent variable bi . Let Ri D
.ri1 ; ; riJ /0 be a J-dimensional missing data indicator vector with binary elements
and ui be a continuous latent variable, which is used to measure Ri . The primary
A Continuous Latent Factor Model for Non-ignorable Missing Data 179
Fig. 2 Proposed model diagram: observed quantities are described in squared boxes, latent
quantities are in circled boxes
model of interest will be the joint distribution of Yi and Ri , given ui and possibly
additional observed covariates Xi , where Xi represents p-dimensional fully observed
covariates. Figure 2 provides a diagram representing the proposed model for all the
observed and latent variables. As indicated in Fig. 2, X1i , containing both time-
variant and time-invariant attributes for subject i, is the p1 dimensional covariates
and used in model B; X2i is the p2 dimensional covariates used in model A; a p3
dimensional time-invariant covariate vector X3i is used in modeling link function
between bi and ui . These three covariate-vectors form the covariates for the model,
i.e. p D p1 C p2 C p3 .
One of the fundamental assumptions of this new model is that Yi is conditionally
independent of Ri given the latent variables ui and bi . This is a natural assumption
when modeling relationships between variables measured with error, i.e., we want
to model the relationship between the underlying variables, not the ones with error.
Finally, we assume that Yi is conditionally independent of ui given bi , and likewise,
Ri is conditionally independent of bi given ui . Hence, we introduce the following
model for the joint distribution of the responses Yi and missing indicators Ri ,
“
f .Yi ; Ri jXi / D f .Yi jbi ; X1i /f .Ri jui ; X2i /f .bi jui ; X3i /f .ui /dui dbi (1)
180 J. Zhang and M. Reiser
with specific parametric models specified as follows: (Np .a; B/ denotes the p-variate
normal distribution with mean a and covariance matrix B)
Y
J
ijij .1 ij /1rij
r
f .Ri jui ; X2i / D (5)
jD1
A linear mixed model (growth curve) is used for the relationship between Yi and
bi , where X1i is a known .J p1 / design matrix containing fixed within-subject
and between-subject covariates (including both time-invariate and time-varying
covariates), with associated unknown .p1 1/ parameter vector ˇ, Z1i is a known
.J q/ matrix for modeling random effects, and bi is an unknown .q 1/ random
coefficient vector. We specify Yi D X1i ˇ C Zi bi C i , where the random error term
i is a J-dimensional vector with E.i / D 0, Var.i / D ˙ , and i is assumed
independent of bi . Furthermore, the J J covariance matrix ˙ is assumed to be
diagonal, that any correlations found in the observation vector Yi are due to their
relationship with common bi and not due to some spurious correlation between i .
A continuous latent variable model is assumed for the relationship between Ri and ui
with ij D Pr.rij D 1/ representing the probability that the response for subject i at
time point j is missing. We apply the logit link for the probability of the missingness,
.u ; X /
i.e., log. 1ij ij .ui i ; X2i2i / / D ui j X2i ˛ C Z2i ui , where j are unknown parameters
for determining an observation at time point j is missing. As discussed earlier, this
relationship is equivalent to a random logistic regression, with appropriate design
0
matrices X2i and Z2i . A latent variable regression, bi D X3i Ci , is used to establish
0
the relationship between latent variable bi and ui , where X3i D ŒX3i ui is a p3 C 1
dimensional vector combining X3i and ui , is the .p3 C 1/ q unknown regression
0
coefficients for X3i and the q q matrix determines variance–covariance structure
for error term i . Finally the latent continuous variable ui is assumed to be normally
distributed with mean 0 and variance u2 .
Note that the maximum likelihood (ML) estimation of the model (2)–(4) requires
the maximization of the observed likelihood, after integrating out missing data Ymis
and latent variables b and u from complete-data likelihood function. Detail of the
ML estimation technique will be given in next section.
A Continuous Latent Factor Model for Non-ignorable Missing Data 181
The main objective of this section is to obtain the ML estimate of parameters in the
model and standard errors on the basis of the observed data Yobs and R. The ML
approach is an important statistical procedure which has many optimal properties
such as consistency, efficiency, etc. Furthermore, it is also the foundation of
many important statistical methods, for instance, the likelihood ratio test, statistical
diagnostics such as Cook’s distance and local influence analysis, among others. To
perform ML estimation, the computational difficulty arises because of the need to
integrate over continuous latent factor u, random subject-level effects b, as well
as missing responses Ymis . The classic Expectation-Maximization (EM) algorithm
provides a tool for obtaining maximum likelihood estimates under models that
yield intractable likelihood equations. The EM algorithm is an iterative routine
requiring two steps in each iteration: computation of a particular conditional
expectation of the log-likelihood (E-step) and maximization of this expectation
over the parameters of interest (M-step). In our situations, in addition to the real
missing data Ymis , we will treat the latent variables b and u as missing data.
However, due to the complexities associated with the missing data structure and
the nonlinearity part of the model, the E-step of the algorithm, which involves the
computations of high-dimensional complicated integrals induced by the conditional
expectations, is intractable. To solve this difficulty, we propose to approximate the
conditional expectations by sample means of the observations simulated from the
appropriate conditional distributions, which is known as Monte Carlo Expectation
Maximization algorithm. We will develop a hybrid algorithm that combines two
advanced computational tools in statistics, namely the Gibbs sampler (Geman and
Geman 1984) and the Metropolis Hastings (MH) algorithm (Hastings 1970) for
simulating the observations. The M-step does not require intensive computations
due to the distinctness of parameters in the proposed model. Hence, the proposed
algorithm is a Monte Carlo EM (MCEM) type algorithm (Wei and Tanner 1990).
The description of the observed likelihood function is given in the following.
Given the parametric model (2)–(4) and the i.i.d. J 1 variables Yi and
Ri , for i D 1; : : : ; n, estimation of the model parameters can proceed via the
maximum likelihood method. Let Wi D .Yobs i ; Ri / be the observed quantities,
di D .Ymis
i ; bi ; u i / be the missing quantities, and D .˛; ˇ; j ; ; ; u2 ; ˙ / be the
vector of parameters relating Wi with di and covariates Xi . Under Birch’s (1964)
regularity conditions for parameter vector , the observed likelihood function for
the model (2)–(4) can be written as
Y
n n Z
Y
Lo .jYobs ; R/ D f .Wi jXI / D f .Wi ; di jXi I /ddi (6)
iD1 iD1
182 J. Zhang and M. Reiser
where the notation for the integral over di is taken generally to include the multiple
continuous integral for ui and bi , as well as missing observations Ymis
i . In detail, the
above function can be rewritten as following:
Y
n
Lo .jYobs ; R/ D
iD1
•
1 1
p j˙ j1=2 exp .Ycom X 1i ˇ Z b
1i i / T 1
˙ .Y com
X 1i ˇ Z b
1i i /
2 2 i i
1 1 0 0 1 u2
p j˙b j1=2 exp .bi X3i /T ˙b1 .bi X3i / p exp i 2
2 2 2u2 2u
8 9
<Y J rij 1rij =
exp.X2i ˛ C Z2i ui / exp.X2i ˛ C Z2i ui /
1 dui dbi dYmis
: 1 C exp.X2i ˛ C Z2i ui / 1 C exp.X2i ˛ C Z2i ui / ; i
jD1
(7)
where Ycom
i D .Yobs
i ; Yi /,
mis
˙b D u2 T
C . As discussed above, the
E-step involves complicated, intractable and high dimension integrations. Hence,
the Monte Carlo EM algorithm is applied to obtain ML estimates. Detail of the
technique for MCEM will be given in the following section.
Inspired by the key idea of the EM algorithm, we will treat di as missing data and
implement the expectation and maximization (EM) algorithm for maximizing (7).
Since it is difficult to maximize the observed data likelihood Lo directly, we con-
struct the complete-data likelihood and apply the EM algorithm on the augmented
log-likelihood ln Lc .W; dj/ to obtain the MLE of over the observed
R likelihood
function Lo .Yobs ; Rj/ where it is assumed that Lo .Yobs ; Rj/ D Lc .W; dj/dd.
[W and d are ensemble matrices for vectors Wi and di defined in (6)]. In detail,
the EM algorithm iterates between a computation of the expected complete-data
likelihood
.r/
Q.jO / D E O .r/ fln Lc .W; dj/jYobs ; Rg (8)
.r/
and the maximization of Q.jO / over , where the maximum value of at the
.rC1/ .r/
.r C 1/th iteration is denoted by O and O denotes the maximum value of
evaluated at the rth iteration. Specifically, r represents the EM iteration. Under
.r/
regularity conditions the sequence of values fO g converges to the MLE . O (See
Wu (1983).)
A Continuous Latent Factor Model for Non-ignorable Missing Data 183
.r/
where g.djYobs ; RI O / is the joint conditional distribution of the latent variables
given the observed data and . A hybrid algorithm that combines the Gibbs sampler
and the MH algorithm is developed to obtain Monte Carlo samples from the above
.r/ .r/
conditional distribution. Once we draw a sample d1 ; : : : ; dT from the distribution
.r/
g.djYobs ; RI O /, this expectation can be estimated by the Monte Carlo average
1X
T
.r/ .r/
QT .jO / D ln Lc .W; dt j/ (9)
T tD1
where T is the MC sample size and also denotes the dependence of current estimator
on the MC sample size. By the law of large numbers, the estimator given in (9)
converges to the theoretical expectation in (8). Thus the classic EM algorithm can
be modified into an MCEM where the E-step is replaced by the estimated quantity
from (9). The M-step maximizes (9) over .
1X
T
O
Efh.Y mis
; b; u/jYobs ; RI g D h.Ymis.t/ ; b.t/ ; u.t/ / (10)
T tD1
f .Ymis jYobs ; bI / which is again another normal distribution from the property
of conditional distribution of multivariate normal. This conditional can be further
simplified in our case due to the assumption that the variance–covariance matrix ˙
in model (2) is diagonal. In detail, for subject i D 1; : : : ; n, since Yi are mutually
independent given bi , Ymis i are also mutually independent given bi . Since ˙ is
diagonal, Ymis i is conditionally independent with Yobs
i given bi . Hence, it follows
from model (2) that:
Y
n
f .Ymis jYobs ; bI / D f .Ymis
i jbi I /
iD1
and
.Ymis
i jbi I / MVN.X1i ˇ C Z1i bi ; ˙;i /
mis mis mis
1 com
i ; Ri ; ui I / exp .Yi
bi jYcom X1i ˇ Z1i bi /T ˙1 .Ycom X1i ˇ Z1i bi /
2 i
1 0 0
.bi X3i /T 1 .bi X3i /
2
u2 1 0 0
ui jYi ; Ri ; bi I / exp i 2 .bi X3i /T 1 .bi X3i /
com
2u 2
YJ rij 1rij
exp.X2i ˛ C Z2i ui / exp.X2i ˛ C Z2i ui /
1
jD1
1 C exp.X2i ˛ C Z2i ui / 1 C exp.X2i ˛ C Z2i ui /
(11)
Based on expressions (11), it is shown that the associated full conditional distribu-
tions for b and u are not standard and are relatively complex. Hence we choose to
apply the M-H algorithm for simulating observations efficiently. The M-H algorithm
is one of the classic MCMC methods that has been widely used for obtaining
random samples from a target density via the help of a proposed distribution when
direct sampling is difficult. Here p1 .bi jYcom
i ; Ri ; ui I / and p2 .ui jYi ; Ri ; bi I /
com
are treated as the target densities. Based on the discussion given in Robert and
A Continuous Latent Factor Model for Non-ignorable Missing Data 185
where p1 . / and p2 . / are calculated from Eq. (11). The quantity 2 can be chosen
such that the average acceptance rate is approximately 1/4, as suggested by Robert
and Casella (2010).
Instead of allowing the candidate distributions for b and u to depend on the
present state of the chain, an attractive alternative is choosing proposed distributions
to be independent of this present state, then we get a special case which is
named Independent Metropolis-Hastings. To implement this method, we generate
candidate for bi at step t, bi , from a multivariate normal distribution with mean
vector 0 and variance covariance ˙b (denote as the function h1 . /); generate
candidate for ui at step t, ui , from a univariate normal distribution with mean 0 and
variance u2 (denote as the function h2 . /). The acceptance probability for proposed
.tC1/ .tC1/
distributions of bi and ui (i D 1; 2; : : : ; n) can be obtained by
( .t/
) ( .t/
)
p1 .bi jYcom
i ; Ri ; ui I / h1 .bi / p2 .ui jYcom
i ; Ri ; bi I / h2 .ui /
min 1; .t/
; min 1; .t/
p1 .bi jYcom
i ; Ri ; ui I / h1 .bi / p2 .ui jYcom
i ; Ri ; bi I / h2 .ui /
X
T
.t/ .t/
1
EŒYi Z1i bi jYobs
i ; Ri I DT .Yi Z1i bi /
tD1
X
T
.t/ .t/ .t/ .t/
EŒi 0i jYobs
i ; Ri I D T
1
.Yi X1i ˇ Z1i bi /.Yi X1i ˇ Z1i bi /0
tD1
X
T
.t/
1
EŒbi jYobs
i ; Ri I D T bi
tD1
186 J. Zhang and M. Reiser
X
T
.t/ 0.t/ .t/ 0.t/
0
EŒ i i jYi ; Ri I
obs
D T 1 .bi X3i /.bi X3i /0
tD1
X
T
.t/
X
T
.t/ 0.t/
1
EŒui jYobs
i ; Ri I DT ui ; EŒui u0i jYobs
i ; Ri I DT 1
ui ui (12)
tD1 tD1
0.t/ .t/
where X3i D ŒX3i ui .
At the M-step we need to maximize Q.j .r/ / with respect to . In other words, the
following systems are needed to be solved:
@Q.j .r/ / @
D Ef lnLc .W; dj/jYobs ; RI .r/ g D 0 (13)
@ @
It can be shown that
That is, the ML estimates can be obtained from observed components in Y, given
information of u, and b. Specifically, the dimension of integration in E-step will
reduce to two, instead of three.
Due to the complexity of the observed likelihood function, the accurate value of
K. .rC1/ ; .r/ / is difficult to obtain. However, as pointed out by Meng and Schilling
(1996), it can be approximated by
8 " # 12 9
<XT
Lc .W; dr;.t/ j .rC1/ / =
O .rC1/ ; .r/ / D log
K.
: Lc .W; dr;.t/ j .r/ / ;
tD1
8 " # 12 9
(15)
<XT
Lc .W; drC1;.t/ j .r/ / =
log
: Lc .W; drC1;.t/ j .rC1/ / ;
tD1
where dr;.t/ ; t D 1; : : : ; T are random samples generated from g.djW; .r/ / by the
hybrid algorithm. In determining the convergence of the MCEM algorithm, we plot
O .rC1/ ; .r/ / against iteration index r. Approximate convergence is claimed to be
K.
achieved if the plot shows a curve converging to zero.
188 J. Zhang and M. Reiser
! !T ˇˇ
@2 Lo .Yobs ; Rj/ XT1
@Lc .W; d.t/ j/ X
T1
@Lc .W; d.t/ j/ ˇ
D T 2 ˇ
@@ T
1
@ @ ˇ
tD1 tD1 ˇ
DO
X
T1
( T ) ˇˇ
@2 Lc .W; d.t/ j/ @Lc .W; d.t/ j/ @Lc .W; d.t/ j/ ˇ
CT11 ˇ
@@ T @ @ ˇ
tD1 DO
(17)
Finally, the standard errors are obtained from the diagonal elements of inverse
O
Hessian matrix @2 Lo .Yobs ; Rj/=@@ T , evaluated at .
In this section, we present an application using a data set that has appeared
previously in the literature. We illustrate the application of CLFM by using data
from a randomized, double-blind, study of AIDS patients with advanced immune
suppression, which is measured as CD4 counts 50 cells/ mm3 (Henry et al. 1998).
A Continuous Latent Factor Model for Non-ignorable Missing Data 189
Patients in an AIDS Clinical Trial Group (ACTG) Study 193 A were randomized to
dual or triple combinations of HIV-1 reverse transcriptase inhibitors. Specifically,
HIV patients were randomized to one of four daily regimens containing 600 mg
of zidovudine: zidovudine plus 2.25 mg of zalcitabine; zidovudine plus 400 mg of
didanosine; zidovudine alternating monthly with 400 mg didanosine; or zidovudine
plus 400 mg of didanosine plus 400 mg of nevirapine (triple therapy). In this study,
we focus on the comparison of the first three treatment regimens (dual therapy)
with the forth (triple therapy) as described in Fitzmaurice’s work (Fitzmaurice et al.
2004).
Measurements of CD4 counts were scheduled to be collected at baseline and at 8-
week intervals during follow-up. However, the CD4 count data are unbalanced due
to unequal measurements and also CD4 counts have missing data that were caused
by skipped visits and dropout. Table 1 presents four randomly selected subjects.
The number of measurements of CD4 counts during the first 40 weeks of follow-up
varied from 1 to 9, with a median of 4, based on the available data. The goal in this
study is to compare the dual and triple therapy groups in terms of short-term changes
in CD4 counts from baseline to week 40. The responses of interest are based on log
transformation CD4 counts, log(CD4 counts C 1), available on 1,309 patients.
3.2
Dual Therapy
Triple Therapy
3.0
Log(CD4+1)
2.8
2.6
0 8 16 24 32 40
Time (Weeks)
Fig. 3 Lowess smoothed curves of log(CD4 C 1) against time (in weeks), for subject in the dual
and triple therapy groups in ACTG study 193A
Figure 3 describes the trend in the mean response in the dual and triple therapy
groups via lowess smoothed curves on observed data. The curves reveal a modest
decline in the mean response during the first 16 weeks for the dual therapy group,
followed by a steeper decline from week 16 to week 40. By comparison, the mean
response increases during the first 16 weeks and declines after for the triple therapy
group. The rate of decline from week 16 to week 40 appears to be similar for the
two groups. However, one has to notice that there is a substantial amount of missing
data in the study, therefore the plot of the mean response over time can be potentially
misleading, unless the data are missing completely at random (MCAR). Moreover,
based on a small random sample of individuals, we observed that those with drop-
out tend to have large CD4 counts. In other words, there is a trend that a patient in
the study tended to skip a visit due to a large magnitude of current CD4 count. That
is, a patient tends to skip a visit because of no treatment benefits or side effects.
When data are missing due to this reason, a plot of the mean response over time
can be deceptive. Figure 4 describes observed responses at different visit points in
each group. Almost all patients from both groups are treated at baseline and their
CD4 count data are collected. There are two sharp decreases in response rate, one
is from week 0 to week 8 and the other is from week 32 to week 40. Approaching
to the end of the study, most patients are dropping out from study, and response
A Continuous Latent Factor Model for Non-ignorable Missing Data 191
1.0
Dual Therapy
Triple Therapy
Proportion of Observed Data
0.8
0.6
0.4
0.2
0 8 16 24 32 40
Time (Weeks)
Fig. 4 Proportions of observed responses in the dual and triple therapy groups in ACTG
study 193A
rates at week 40 are close to 20 % for both treatments. The missing information
can substantially influence the analysis and even bias our findings. In the example,
we will implement CLFM which assumes missing data are not ignorable, and
compare with the conventional model that ignores missingness. We also compare
the maximum likelihood CLFM results to results from Roy’s (2003) latent class
model and to results from a Bayesian method given in Zhang (2014).
In the following we describe a model for the mean response that enables the rates
of change before and after week 16 to differ within and between groups, and this
model was also been adopted by Fitzmaurice et al. (2004) in their work. Specifically,
one could assume that each patient has a piecewise linear spline with a knot at
week 16. That is, the response trajectory of each patient can be described with an
intercept and two slopes—one slope for the changes in response before week 16,
another slope for the changes in response after week 16. Further, we assume the
average slopes for changes in response before and after week 16 are allowed to
vary by group. Because this is a randomized study, the mean response at baseline is
assumed to be the same in the two groups, as supported in Fig. 3. Hence instead of
the conventional growth curve model, we applied a special growth curve model to
capture changing trends of responses on CD4 counts.
192 J. Zhang and M. Reiser
Let tij denote the time since baseline for the jth measurement on the ith subject with
tij D 0 at baseline, we consider the following linear mixed effects model:
E.Yij jbi / Dˇ1 C ˇ2 tij C ˇ3 .tij 16/C C ˇ4 Groupi tij C ˇ5 Groupi .tij 16/C
C b1i C b2i tij C b3i .tij 16/C
where Groupi D 1 if the ith subject is randomized to triple therapy, and Groupi D 0
otherwise; .tij 16/C D tij 16 if tij > 16 and .tij 16/C D 0 if tij 16; b1i ,
b2i , and b3i are random effects in this splined growth curve model. In this model,
.ˇ1 C b1i / is the intercept for the ith subject and has an interpretation as the true
log CD4 count as baseline, i.e. when tij D 0. Similarly, ˇ2 C b2i is the ith subject’s
slope, or rate of change in log CD4 counts from baseline to week 16, if this patient is
randomized to dual therapy; .ˇ2 C ˇ4 C b2i / is the ith subject’s slope if randomized
to triple therapy. Finally, the ith subject’s slope from week 16 to week 40 is given
by f.ˇ2 C ˇ3 / C .b2i C b3i /g if randomized to dual therapy and f.ˇ2 C ˇ3 C ˇ4 C
ˇ5 /C.b2i Cb3i /g if randomized to triple therapy. The model described above will be
fitted without incorporating missing data. In order to fit CLFM, one has to specify
the model for the missing part. Assume that R is a missing indicator matrix where
its .i; j/th element rij D 1 if Yij is missing and rij D 0 if it is observed. Within
a framework of CLFM, we incorporate information on missing values through
modeling the missing information matrix R with time location parameters, and a
continuous latent factor u. Further, there are strong indications which support an
application of this model. Based on Fig. 4 one can see that the response variable
tends to be missing over time. In other words, time locations are good indicators for
explaining missing data. From Fig. 4 one might also notice that the two therapies
have identical missing proportions which suggests a group effect for therapies is
not necessary in modeling R. The continuous latent factor u is used to describe
individuals’ variability in missingness, and two regression parameters 1 and 2
are specified to provide information on random intercept b0 and slope b1 , in order
to correct estimation bias. A third regression parameter was also explored which
links u with b3 , but analysis results showed that this parameter is not significant.
Hence we exclude this parameter in the final results. To estimate CLFM, we adopt
two approaches: MCEM to obtain ML estimates and full Bayesian estimates with
specified conjugate priors. Point estimates and corresponding standard errors from
a Bayesian perspective are summarized by posterior mean and standard deviation.
Roy’s model is also implemented by summarized missing patterns from R into
three latent classes. (The number of latent classes for Roy’s model is determined
by information criteria)
A Continuous Latent Factor Model for Non-ignorable Missing Data 193
Table 2 Estimated regression coefficients (fixed effects) and variance components (random
effects) for the log CD4 counts from a MAR model, Roy’s model and CLFM in both approaches
MAR Roy MCEM Bayesian
Variables Estimate SE Estimate SE Estimate SE Estimate SE
Intercept 2:9415 0:0256 2:9223 0:0374 2:9300 0:0250 2:9320 0:0262
tij 0:0073 0:0020 0:0051 0:0056 0:0040 0:0052 0:0047 0:0058
.tij 16/C 0:0120 0:0032 0:0201 0:0052 0:0221 0:0090 0:0223 0:0092
Groupi tij 0:0269 0:0039 0:0271 0:0062 0:0272 0:0105 0:0273 0:0109
Groupi .tij 16/C 0:0277 0:0062 0:0240 0:0102 0:0243 0:0169 0:0243 0:0177
Var.b1i / D g11 585:742 34:754 364:000 49:000 630:050 32:430 640:600 34:7300
Var.b2i / D g22 0:923 0:160 1:000 0:500 2:3190 0:9990 2:3230 1:0050
Var.b3i / D g33 1:240 0:395 2:000 1:013 37:640 1:9503 38:8600 2:0840
Cov.b1i ; b2i / D g12 7:254 1:805 7:106 3:001 8:6240 3:0500 8:5240 4:0760
Cov.b1i ; b3i / D g13 12:348 2:730 1:500 3:120 2:5150 5:3000 2:5220 6:5000
Cov.b2i ; b3i / D g23 0:919 0:236 6:405 0:892 7:0130 0:9980 7:1530 1:0070
Var.ei / D 2 306:163 10:074 412:000 36:000 500:6300 6:7390 515:3000 9:3570
In this study, one research question of interest is treatment effects in the changes
in log CD4 counts. The null hypothesis of no treatment group differences can
be expressed as H0 W ˇ4 D ˇ5 D 0. The ML estimates on fixed effects
from three models are given in Table 2, including the conventional model with
a MAR assumption, Roy’s model that handles non-ignorable missing data from
pattern-mixture modeling and CLFM. The Bayesian estimates for CLFM are also
displayed in Table 2. For the likelihood approach with MAR assumptions, a test of
H0 W ˇ4 D ˇ5 D 0 yields a Wald statistic, W 2 D 59:12, with 2 degrees of freedom,
and corresponding p-value is less than 0:0001. For the full Bayesian approach,
we compute Deviance information criterion (DIC) to compare two models: one
assumes no treatment effects by excluding interaction terms between treatment
groups and study time; the other assumes treatment effects are significant. DIC
for a model with embracing treatment effects is 15; 792:7, which is less than
the one from the model with no groups effects, 18; 076:5. Based on the criteria,
“the smaller the better,” there is evidence to support the fact that treatment group
differences in changes in log CD4 counts are significant. The tests from Roy’s
model and MCEM approach on CLFM also support this group variety, with p-
values for both less than 0:0001. Based on the magnitude of the estimate of ˇ4 ,
and its standard error from all approaches, there is a significant group difference
in the rates of change from baseline to week 16. The estimated response curves
for two groups are displayed in Fig. 5. In this figure, dashed lines represent the
response curve from CLFM, dotted lines correspond to results from Roy’s model,
while solid lines are results from the MAR approach. In the dual therapy group,
there is a significant decrease in the mean of the log CD4 counts from baseline to
194 J. Zhang and M. Reiser
week 16, based on the ignorable likelihood approach. The estimated change during
the first 16 weeks is 0:12, which can be obtained from 16 0:0073. On the
untransformed scale, this corresponds to an approximate 10 % decrease in CD4
counts. However, CLFM which assumes missing data are not ignorable suggests that
this decrease is not significant, since the 95 % credible interval for ˇ2 covers zero
(Œ0:01638; 0:006517). Further, Roy’s model also confirms this finding with the
95 % confidence interval Œ0:016076; 0:005876. By observing missingness from
baseline to week 16, subjects with higher log CD4 counts tend to be missing. CLFM
involves non-ignorable missing data in the analysis, and the average of log CD4
counts tend to recover to a higher value. Hence, the decrease in the mean of the
log CD4 counts from baseline to week 16 is not significant, when non-ignorable
missing data are considered. By comparison, in the triple therapy group, there is
a significant increase in the mean response. Based on the ignorable approach, the
estimated change during the first 16 weeks in the triple therapy group is 0:31,
(16.0:0073C0:0269/); the estimated slope for the triple therapy group is 0:0196
with a standard error 0:0033. In terms of the untransformed scale, it corresponds
to an approximate 35 % increase in CD4 counts. In CLFM, a similar estimate is
obtained: the corresponding estimated change is 0:36. (16 .0:0047 C 0:0273/);
the estimated slope for the triple therapy group is 0:0226, and it corresponds to an
approximate 40 % increase in CD4 counts.
The loess curves in Fig. 3 suggest that the rate of decline from week 16 to
week 40 is similar for the two groups. The null hypothesis of no treatment group
difference in the rates of change in log CD4 counts from week 16 to week 40 can
be expressed as H0 W ˇ4 C ˇ5 D 0. The estimates of ˇ4 and ˇ5 from all approaches
appear to support the null hypothesis since they are of similar magnitude but with
opposite signs. In the work of Fitzmaurice et al. (2004), a test of the null hypothesis,
H0 W ˇ4 C ˇ5 D 0, is given and a Wald statistic is yielded with W 2 D 0:07, with
1 degree of freedom. The corresponding p value is greater than 0:75 based on the
ignorable likelihood approach. DIC comparison for the Bayesian version of CLFM
also suggests that two groups have similar rate of decline from week 16 to week 40.
The Wald tests for Roy’s model and MCEM version of CLFM further indicate this
parallel change profiles after week 16, with both p-values are greater than 0:6.
The estimated variances of the random effects in Table 2 indicate that there is
substantial individual variability in baseline CD4 counts and the rates of change in
CD4 counts. For instance, in the triple therapy group, many patients show increases
in CD4 counts during the first 16 weeks, but some patients have declining CD4
counts. Specifically, approximately 95 % of patients are expected to have changes
in log CD4 counts from baseline to week 16 between 0:64 and 1:27. Hence, there
are approximately 26 % of patients who are expected to have decreases in CD4
counts during the first 16 weeks of triple therapy, based on the ignorable likelihood
approach; by comparison, a larger variability from patient to patient is indicated
by CLFM. 95 % of patients are expected to have changes in log CD4 counts from
baseline to week 16 between 1:15 and 1:87, and correspondingly approximately
30 % of patients are expected to decrease CD4 counts from CLFM. Substantial
A Continuous Latent Factor Model for Non-ignorable Missing Data 195
3.2
3.0
Log(CD4+1)
2.8
Dual, MAR
2.6 Triple, MAR
Dual, LCFM
Triple, LCFM
2.4 Dual, Roy
Triple, Roy
2.2
0 8 16 24 32 40
Time (Weeks)
Fig. 5 Fitted response curve in the dual and triple therapy groups in ACTG study 193A
components of variability due to measurement error are also suggested from all
models (Table 2).
various distribution of the latent factor u, we adopted two distribution forms: normal
distribution and logistic distribution. In the specification of parameters in logistic
distribution, we choose so that the logistic distribution has similar shape with
the normal distribution, in order to achieve comparability. Estimation procedure
was performed within the full Bayesian framework, and the estimation results of
parameters including point estimates and standard errors in the linear mixed model
are given in Table 3. The routine experienced longer time to obtain stable mixed
Markov chains when a logistic distribution was used. In detail, we extended the
burn-in iterations to 20; 000 and started another 30; 000 iterations to obtain posterior
estimates, with thinning size 10. From Table 3 one can observe that two distributions
produced identical results, due to specified similar distribution shapes. Furthermore,
one advantage should be mentioned is that the proposed Bayesian estimating scheme
is more flexible in extending distribution of repeated-measures, other than stating
different distribution shapes on the latent factor u.
In this study, missing data are potentially not ignorable with analyzing a random
selected subsample, especially for the first 16 weeks. To evaluate effectiveness of
treatment therapies, we compared three approaches, including the ignorable model
which assumes missing data are MAR, Roy’s model that handles non-ignorable
missing data from pattern-mixture perspective, and CLFM with NMAR assumption.
Controversial results on change rates of log CD4 counts at dual therapy group
during first 16 weeks were obtained, that is, ignorable suggested there is a significant
decrease in log CD4 counts, whereas both Roy’s model and CLFM indicated this
decrease is not substantial. This disagreement is due to those potential non-ignorable
missing values. However, all approaches supported that triple therapy has similar
change rate on log CD4 counts from week 16 to week 40, compare with dual
therapy group. Further, with incorporating missing values, efficacy for both therapy
A Continuous Latent Factor Model for Non-ignorable Missing Data 197
groups is shown to be more substantial from CLFM, which can be seen from the log
CD4 counts at week 40. Compared with Roy’s model, the proposed CLFM is more
flexible in extending the model with a more general distribution.
References
Adams, R.J., Wilson, M., Wu, M.: Multilevel item response models: An approach to errors in
variables regression. J. Educ. Behav. Stat. 22, 47–76 (1997)
Birch, M.W.: A new proof of the Pearson-Fisher theorem. Ann. Math. Stat. 35, 818–824 (1964)
Bock, R.D., Aitkin, M.: Marginal maximum likelihood estimation of item parameters: application
of an em algorithm. Psychometrika 46, 443–458 (1981)
Diggle, P., Kenward, M.G.: Informative drop-out in longitudinal data analysis. Appl. Stat. 43,
49–73 (1994)
Diggle, P., Liang, K.Y., Zeger, S.L.: Analysis of Longitudinal Data. Oxford University Press,
Oxford (1994)
Embretson, S.E., Reise, S.P.: Item Response Theory for Psychologists. Erlbaum, Mahwah (2000)
Fitzmaurice, G.M., Laird, N.M., Ware, J.H.: Applied Longitudinal Analysis. Wiley, New York
(2004)
Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian restoration of
images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Guo, W., Ratcliffe, S.J., Ten Have, T.R.: A random pattern-mixture model for longitudinal data
with dropouts. J. Am. Stat. Assoc. 99(468), 929–937 (2004)
Hastings, W.K.: Monte carlo sampling methods using markov chains and their application.
Biometrika 57, 97–109 (1970)
Henry, K., Erice, A., Tierney, C., Balfour, H.H. Jr., Fischl, M.A., Kmack, A., Liou, S.H., Kenton,
A., Hirsch, M.S., Phair, J., Martinez, A., Kahn, J.O.: A randomized, controlled, double-blind
study comparing the survival benefit of four different reverse transcriptase inhibitor therapies
for the treatment of advanced aids. J. Acquir. Immune Defic. Syndr. Hum. Retrovirol. 19(3),
339–349 (1998)
Jung, H., Schafer, J.L., Seo, B.: A latent class selection model for nonignorably missing data.
Comput. Stat. Data Anal. 55(1), 802–812 (2011)
Laird, M.M., Ware, J.H.: Random-effects models for longitudinal data. Biometrics 38(4), 963–974
(1982)
Lin, H., McCulloch, C.E., Rosenheck, R.A.: Latent pattern mixture models for informative
intermittent missing data in longitudinal studies. Biometrics 60(2), 295–305 (2004)
Little, R.J.A.: Pattern-mixture models for multivariate incomplete data. J. Am. Stat. Assoc. 88,
125–134 (1993)
A Continuous Latent Factor Model for Non-ignorable Missing Data 199
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley Series in Probability and
Statistics. Wiley, New York (2002)
Lord, F.: A Theory of Test Scores (Psychometric Monograph No. 7). Psychometric Corporation,
Richmond (1952)
Lord, F.M.: The relation of test score to the trait underlying the test. Educ. Psychol. Meas. 13,
517–548 (1953)
Lord, F.M.: Applications of Item Response Theory to Practical Testing Problems. Erlbaum,
Hillsdale (1989).
Louis, T.A.: Finding the observed information matrix when using the em algorithm. J. R. Stat.
Soc. Ser. B. 44, 226–233 (1982).
McCulloch, C.E., Searle, S.R.: Generalized, Linear, and Mixed Models. Wiley, New York (2001)
Meng, X.L., Schilling, S.: Fitting full-information item factor models and an empirical investiga-
tion of bridge sampling. J. Am. Stat. Assoc. 91, 1254–1267 (1996)
Muthén, B.: Contributions to factor analysis of dichotomous variables. Psychometrika 43, 551–560
(1978)
Muthén, B., Jo, B., Brown, C.H.: Principal stratification approach to broken randomized
experiments: A case study of school choice vouchers in new york city [with comment]. J.
Am. Stat. Assoc. 98(462), 311–314 (2003)
Pirie, P.L., Murray, D.M., Luepker, R.V.: Smoking prevalence in a cohort of adolescents, including
absentees, dropouts, and transfers. Am. J. Public Health 78, 176–178 (1988)
Rasch, G.: Probabilistic Models for Some Intelligence and Attainment Tests. University of Chicago
Press, Chicago (1960)
Raudenbush, S.W., Johnson, C., Sampson, R.: A multivariate, multilevel rasch model with
applications to self-reported criminal behavior. Sociol. Methodol. 33, 169–211 (2003)
Rijmen, F., Tuerlinckx, F., De Boeck, P., Kuppens, P.: A nonlinear mixed model framework for
item response theory. Psychol. Methods 8, 185–205 (2003)
Robert, C.P., Casella, G.: Introducing Monte Carlo Methods with R. Springer, New York (2010)
Roy, J.: Modeling longitudinal data with nonignorable dropouts using a latent dropout class model.
Biometrics 59(4), 829–836 (2003)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman and Hall, New York (1997)
Takane, Y., de Leeuw, J.: On the relationship between item response theory and factor analysis of
discretized variables. Psychometrica 52, 393–408 (1987)
Verbeke, G., Molenberghs, G.: Linear Mixed Models for Longitudinal Data. Springer, New York
(2000)
Wei, G.C., Tanner, M.A.: A monte carlo implementation of the em algorithm and the poor man’s
data augmentation algorithms. J. Am. Stat. Assoc. 85, 699–704 (1990)
Wu, C.F.J.: On the convergence of properties of the em algorithm. Ann. Stat. 11, 95–103 (1983)
Zhang, J.: A continuous latent factor model for non-ignorable missing data in longitudinal studies.
Unpublished doctoral dissertation, Arizona State Univeristy (2014).
Zhang, J., Reiser, M.: Simulation study on selection of latent class models with missing
data. In: JSM Proceedings, Biometrics Section, pp. 98–111. American Statistical Association,
Alexandria (2012)
Part III
Healthcare Research Models
Health Surveillance
Abstract This chapter describes the application of statistical methods for health
surveillance, including those for health care quality monitoring and those for
disease surveillance. The former includes adverse event surveillance as well as the
monitoring of non-disease health outcomes, such as rates of caesarean section or
hospital readmission rates. The latter includes various types of disease surveillance,
including traditional surveillance as well as syndromic surveillance. The methods
described are drawn from the industrial quality control and monitoring literature
where they are frequently referred to as “control charts.” The chapter includes a
detailed background of that literature as well as a discussion of the criteria and
metrics used to assess the performance of methods of health surveillance methods.
1 Introduction
outbreak will naturally grow and recede even when no action is taken to mitigate
the outbreak while in the industrial case process degradations typically persist until
the cause is detected and corrected.
Figure 1, taken from Fricker (2013, p. 6), is a basic taxonomy of public
health surveillance, which includes the surveillance of adverse reactions to medical
interventions (particularly drugs and vaccines) and how health services are used,
as well as disease (epidemiologic) surveillance. Brookmeyer and Stroup (2004,
p. 1) quote Thacker (2000) in defining public health surveillance as “the ongoing
systematic collection, analysis, interpretation, and dissemination of health data
for the purpose of preventing and controlling disease, injury, and other health
problems.”
In this chapter, we describe the application of industrial process monitoring (also
referred to as statistical process control) methods to health care, where we bifurcate
the various health surveillance activities shown in Fig. 1 into those for health care
quality monitoring and those for disease surveillance. The former includes adverse
event surveillance, such as death following surgery, as well as the monitoring of non-
disease health outcomes, such as rates of caesarean section or hospital readmission
rates. The latter includes various types of disease surveillance, including traditional
surveillance as well as syndromic surveillance. In so doing, we also discuss the
criteria and metrics used to assess the performance of methods of health surveillance
methods.
Health Surveillance 205
In any system, there is a certain amount of noise that is present that cannot be
reduced without fundamentally changing the system. Occasionally, however, some
change is introduced into the system resulting in a change to the output. This change
could affect the mean response, the variability of the response, or it could influence
the process output in some other way. Monitoring industrial processes via control
charts dates back to the 1920s when Walter Shewhart suggested that there is a
distinction between common causes of variability, the inherent noise in the system,
and special causes of variability, those sources which induce a change in the system
(Shewhart 1931).
Shewhart’s insight was to plot quality measures of the output, and to specify
upper limit and lower limits that within which the plotted measure is likely to
be if the process is in-control, that is, producing output with the same mean and
variance. Points outside these control limits would then be taken to indicate that the
process has changed. Often the control limits are placed three standards above and
below the process mean, since the probability of a random variable being beyond
three standard deviations is very small (e.g., the probability is 0:0027 if the normal
distribution is an accurate model for the outcomes).
The kind of chart that is used to monitor the process depends on the type of data
collected. These are discussed in the next few subsections.
When the outcome is the measurement of some quantity, such as length, weight,
time, density, etc., then the data are said to be continuous. The quality control
literature often uses the term variables data for continuous measures.
For continuous data, the typical procedure is to take subgroups of size n (often
n D 3 to 5) and from each subgroup compute the average xN and some measure
of the variability, such as the range (R D xmax xmin ) or the sample standard
deviation s. These statistics, xN 1 ; xN 2 ; xN 3 ; : : : and either R1 ; R2 ; R3 ; : : : or s1 ; s2 ; s3 ; : : :,
are then plotted in time order in order to monitor the mean and variance of the
process.
If the process is normally distributed with a mean of 0 and standard deviation of
0 when the process is in control, then the upper control limit (UCL) and the lower
control limit (LCL) for the “X-chart” are:
0
UCLX D 0 C 3 p
n
0
LCLX D 0 3 p :
n
206 S.E. Rigdon and R.D. Fricker, Jr.
4
UCL
2
−2
LCL
−4
2 4 6 8 10 12 14
Time/subgroup (i)
Fig. 2 An illustrative X-chart where the process goes out of control at time 12
The control limits for the “R-chart” are UCLR D D2 0 and LCLR D D1 0 , while for
the “s-chart” they are UCLs D B6 0 and LCLs D B5 0 . The constants D1 ; D2 ; B5 ,
and B6 are functions of the subgroup size n, and are tabulated in most books on
statistical quality control (e.g., see Montgomery 2009, Appendix VI, p. 702).
The basic idea of a control chart is to then monitor future observations. Those
that fall within the LCL and UCL are determined to have only common variability
and thus the process is assumed to be behaving normally. However, if one or more
points fall outside of the control limits, that is an indication that one or more special
causes of variability are present, and thus the process is not behaving normally.
Under these conditions, the control chart is said to signal and the process should be
investigated and the special causes of variability identified and rectified. Figure 2
is an example of an X-chart with “3-sigma” control limits where the control chart
signals an out-of-control condition at time i D 12.
In practice, of course, the parameters 0 and 0 are unknown and must be
estimated from data taken when the process is in control. The iterative process
of collecting data, estimating parameters, discarding data for which there is an
explainable cause, re-estimating the parameters is called Phase I. This process is
often more difficult than it might sound; see Jordan and Benneyan (2012) for a
description of the issues involved when health care data are being monitored.
The usual estimate for 0 is the grand average of the subgroup means for data
taken when the process is in control. That is, for m subgroups,
1X
m
xD xi :
m iD1
O 0 D R=d2 ;
Health Surveillance 207
if the R-chart is used to monitor the variability, where R is the average of the m range
measures, or
1 X
m
O 0 D si :
m iD1
if the s-chart is used. As with the other constants, the constant d2 is tabulated in, for
example, Montgomery (2009, Appendix VI, p. 702).
Once the process parameters are estimated from historical data with reasonable
accuracy, that is, with a sufficiently large number of subgroups, the real-time
monitoring of the process begins. This is called Phase II and it is the phase that
is most often associated with the use of control charts. Recent studies indicate that
the needed sample sizes can be much larger than previously thought; see Champ
et al. (2005), Jensen et al. (2006), and Champ and Jones-Farmer (2007).
Subgrouping is widely recommended because a sample average is more likely to
signal a change (if there is one) than control charts based on individual observations.
There are times, however, when each individual data value should be plotted and
a decision made about the process. For example, if data points are taken very
infrequently, it might be desirable to plot each one. In cases like this, the “individuals
chart,” or simply the “X-chart” can be applied. If the mean and standard deviation
are known, then the control limits for an individuals chart based on 3-sigma limits
are simply
UCLX D 0 C 30
LCLX D 0 30 :
and then estimate 0 as O 0 D MR=1:128. This “short term” estimate of the variability
is less likely to overestimate 0 .
Some authors suggest running an X-chart to monitor the process mean and a chart
of the moving ranges to monitor the process dispersion. Rigdon et al. (1994) have
shown that the MR-chart is nearly powerless to detect changes in variability. They
suggest plotting only an X-chart to monitor both mean and variability.
208 S.E. Rigdon and R.D. Fricker, Jr.
Rather than measuring the quality of a unit on a continuous scale, there are cases
where each unit can be only be classified as conforming or nonconforming, where
conforming means that the unit meets the requisite quality standards. For example,
a requirement that the unit should be free of surface blemishes does not yield a
measurement; a unit either has or does not have surface blemishes. A similar type
of data occurs when, for example, one is counting the number of scratches in a roll
of sheet metal. Data such as these are called attributes data in the quality literature
and discrete data in much of the statistics literature.
In the situation where each unit is either conforming or nonconforming, the usual
procedure is to take a subgroup of size ni at time i and observe the number Xi of
nonconforming units. If the units are independent with constant probability (within
the subgroup) of being nonconforming, pi , then Xi has a binomial distribution with
parameters ni and pi . When the process is in-control, the probability is constant, that
is, pi D p0 for all i. The goal is to detect a change as quickly as possible if the
nonconforming probability shifts to p1 , which could be larger or smaller than p0 .
A chart of pO i D xi =ni against the time index i is called a “p-chart.” The mean and
variance of pO i are E .Opi / D p0 and V .Opi / D p0 .1 p0 / =ni for an in-control process.
The control limits are then placed three standard deviations above and below the
mean:
s
p0 .1 p0 /
UCLp D p0 C 3 ; (1)
ni
0 s 1
p 0 .1 p 0 /
LCLp D max @0; p0 3 A: (2)
ni
The max in the formula for the LCL is needed because the second expression in
Eq. (2) can be negative for small values of p0 or ni . If the LCL is equal to 0,
then no signal can be raised for a decrease in the proportion nonconforming. It
is usually desirable to detect a decrease in p for two reasons: first, a low value of pO
could be due to measurement error (e.g., a new employee who misunderstands the
criteria for nonconforming), and second, a change in the process that leads to better
quality is worth knowing so that the change can be made permanent (or more widely
implemented).
Often, the subgroup size ni is constant, in which case the control limits in Eqs. (1)
and (2) are constant. However, there are cases where the ni will vary from subgroup
to subgroup. For example, in monitoring surgical outcomes, the time frame might
be fixed at one month, and the number of surgeries will vary from month to month.
In these types of cases, the control limits will vary.
Health Surveillance 209
x1 C x2 C C xm
pN 0 D :
n1 C n2 C C nm
Of course, if ni D n for all subgroups, then this estimate of p0 is simply the average
of the pO i .
Occasionally, the total number of nonconforming units Xi is monitored rather
than the proportion. This is called an “np-chart” since Xi D ni pO i when the process is
stable. The np-chart is normally used only when the subgroup sizes ni are constant.
There are situations where the output is a count of the number of nonconformities
per unit. For example, the measurement might be the number of voids (air pockets)
in a plastic molded part; for any unit, there could be 0, or 1, or 2, etc., voids.
That is, there can be more than one nonconformity per unit. For count data such
as this, the Poisson distribution is often an appropriate model for the number of
nonconformities Xi per unit at time i. The Poisson distribution has one parameter ,
which is also the mean and variance of the distribution: E.Xi / D and V.Xi / D .
A plot of xi against i is called a “c-chart,” and the control limits are
p
UCLc D C3 00 ;
p
LCLc D max 0; 0 3 0 ; (3)
where 0 is the Poisson distribution parameter when p the process is in control. The
maximum function is needed in Eq. (3) because 0 3 0 is negative when 0 < 9.
Of course, in practice the value of 0 is unknown and must be estimated from prior
data,
X m
O0 D 1 xi ;
m iD1
where the estimate is calculated for the m Phase I data periods when the process in
control.
The Poisson distribution is often a reasonable model for the number of events
that occur in a fixed time interval, or the number of occurrences on a fixed area
of the output. There are cases, though, where the variance is larger than the mean.
This phenomenon is called overdispersion, and if this occurs for some data set, the
negative binomial distribution (a two-parameter distribution) is often used in place
of the Poisson.
210 S.E. Rigdon and R.D. Fricker, Jr.
The c-chart assumes that the sample consists of a single unit, and the number
xi of nonconformities on that unit is recorded. The sample at each time unit could,
however, consist of ni units, rather than a single unit. The statistic ui D xi =ni is then
plotted. A plot of ui against time i is called a “u-chart.” The control limits for the
u-chart are
s
0
UCLu D 0 C3 ;
ni
0 s 1
0A
LCLu D max @0; 0 3 :
ni
All of the charts described in Sects. 2.1 and 2.2 are commonly referred to as
Shewhart charts—named after Walter Shewhart, who first used them—and they
share the property that the decision made at the current time is based on data
collected only at the current time. If, for example, a point is inside the control limits,
then the process is deemed to be in-control and when the next data point is collected,
this point and all past data points are ignored.1
For Shewhart charts, the number of subgroups between signals has a geometric
distribution with parameter p which is the probability of being outside of the control
limits. The expected number of subgroups between signals is commonly referred
to as the average run length (or ARL), where ARLD 1=p, and the ARL is used to
quantify and compare the performance of control charts.
For the X-chart with 3-sigma limits, for example, the probability of signaling
when the process is in-control is p D 0:0027 and so the in-control ARL or ARL(0)
is 1=p D 370. This is the average time until a false signal—it is a false signal
because the process is in-control—and thus ARL(0) is a measure of how well the
control chart performs when the process is in-control.
Now, if a process were to go out-of-control,
p say with the mean increasing by one
standard deviation (i.e., 1 D 0 C = n), then the probability is p D 0:0227
that a subgroup mean will exceed the UCL (and negligible probability that the
subgroup mean would fall below the LCL: 0.00003). Under these conditions, the
out-of-control ARL or ARL(1) is 1=p 44. (Here, ARL.ı/ is the average run length
when the process mean shifts by the amount ı standard deviations.) The result is that
N
it can take a Shewhart X-chart a long time to signal for small to moderate changes
in the mean.
1
Sometimes Shewhart charts are used with supplementary runs rules, such as “also signal if there
are eight points in a row on the same side of the center line.” In these cases, it is no longer true that
past data are ignored, but even with the addition of such rules, the charts just described are often
referred to as Shewhart charts.
Health Surveillance 211
The CUSUM chart is based on the sequential probability ratio test of Wald (1945)
which is designed to test the simple hypotheses H0 W D 0 and H1 W D 1 . The
sequential probability ratio test is designed to do this sequentially in time; that is,
at each stage, the decision can be to accept H0 ; reject H0 ; or continue taking data.
Wald (1945) showed that the optimal form of the test is to compute the cumulative
sum
L1i
Xi D Xi1 C log ;
L0i
where Lji is the likelihood under Hj ; j D 0; 1. The sequential probability ratio test
terminates when
1ˇ
Xi > b D log
˛
or when
ˇ
Xi < a D log ;
1˛
where ˛ and ˇ are the desired (or target) probabilities of Type I and Type II errors,
respectively. (The values given for a and b given above yield values of ˛ and ˇ that
are only approximately correct; the true probabilities of Type I or Type II errors
will differ slightly from the target.) In the case of process monitoring, whether it
be quality or health, there is really never an option to “accept” the null hypothesis,
so the lower limit is ignored. Thus, the lower boundary is normally replaced by a
reflecting boundary, usually at zero.
Because the null hypothesis is never “accepted” and eventually the statistic Xi
will cross its boundary b, the probabilities of Type I and Type II errors are really
1 and 0, respectively. For this reason, the metrics of Type I and Type II errors are
never used in process monitoring. Rather, we look at properties of the run length
distribution. Since we want to detect quickly a large shift, the run length should
be small when the process change is large, and since we don’t want false alarms,
we want the run length to be large when there is no change. The ARL defined in
212 S.E. Rigdon and R.D. Fricker, Jr.
Sect. 2.2 is a common metric, especially in the quality monitoring literature, but
there are other metrics that can be used. See Fraker et al. (2008) for a discussion of
other metrics.
The CUSUM control chart of Page (1954) and Lorden (1971) is a well-known
industrial process monitoring methodology. The simplest form involves the sum
X
i
Ci D .xi 0 / ; (4)
jD1
Ci D Ci1 C .xi 0 / :
This version of the CUSUM chart involves plotting Ci against the time index i and
looking for changes in the slope of the data. This is rather difficult to do by eye, so
graphical procedures, such as the V-mask (Montgomery 2009, p. 415), have been
developed.
An alternative to the V-mask is to accumulate two separate cumulative sums: one
to detect upward increases in the mean, and one to detect decreases. Suppose that 0
and 0 are the process mean and standard deviation when the process is in-control,
and suppose it is desirable to detect a change
p of k standard deviations in the mean,
i.e., a shift from 0 to 1 D 0 C k0 = n if subgroups of size n are used. The two
CUSUMs are defined by
C0C D 0
xi 0
CiC C
D max 0; Ci1 C k (5)
0
and
C0 D 0
xi 0
Ci D min 0; Ci1
C Ck : (6)
0
The CUSUM chart raises a signal when CiC > h or Ci < h. Since in some cases
it is more desirable to detect quickly an increase in the mean than a decrease (or
vice-versa), it is possible to use different values of k and h for the upper and lower
CUSUMs.
For small to moderate shifts, the CUSUM chart will signal a change with a
shorter ARL than the Shewhart chart when the two charts have the same in-control
ARL. For example, the CUSUM chart with k D 0:5 and h D 5 yields
Health Surveillance 213
Thus, the CUSUM will catch a one standard deviation shift in the mean, on average,
in one-fifth the time as the Shewhart chart with the same ARL(0) performance.
Although the CUSUM will catch small to moderate shifts much quicker than
the Shewhart, the reverse is true when there is a very large shift. For example,
ARLCUSUM .4/ D 2:0 whereas ARLShewhart .4/ D 1:2: For this reason, the CUSUM and
the Shewhart charts are often used in tandem, often with limits of ˙3:5 standard
deviations or higher on the Shewhart chart.
The CUSUM can also be used to monitor process variability. For example, to
monitor an increase in process variability, following Hawkins and Olwell (1998,
p. 67), use the CUSUM recursion
where
p
jxi j 0:822
yi D :
0:394
As recommended by Hawkins and Olwell, the same value for k should be used in
these CUSUMs for monitoring variability as in the CUSUMs for the mean.
The EWMA chart of Roberts (1959) calculates weighted averages of the current
data value (xi ) and the previous EWMA statistic (zi1 ),
zi D xi C .1 /zi1 ; (7)
The weights on past data values decrease exponentially, hence the name of the
control chart.
214 S.E. Rigdon and R.D. Fricker, Jr.
E .zi / D 0
and
V .zi / 02 :
2
The exact control limits for the EWMA chart are therefore
r
UCLEWMA D 0 C L0 Œ1 .1 /2i ;
2
r
LCLEWMA D 0 L0 Œ1 .1 /2i ;
2
while the asymptotic control limits are
r
UCLEWMA D 0 C L0 ;
2
r
LCLEWMA D 0 L0 :
2
The values of and L are chosen to give the desired in-control and out-of-control
ARLs. The EWMA chart has properties much like the CUSUM chart. ARLs for
small to moderate shifts are much smaller for the EWMA or CUSUM chart than for
the Shewhart chart. For example, with D 0:1 and L D 2:79:
Comparing these numbers to those of the CUSUM chart, note that the EWMA and
CUSUM charts have similar ARL properties.
Very small values of , such as D 0:05, for example, produce a nearly uniform
weighting of past observations, with very little weight on the current data value. As
a result, similar to the CUSUM, large shifts are difficult to detect quickly with the
EWMA chart.
Health Surveillance 215
In even the simplest process there is frequently more than one quality characteristic
to be monitored and these quality characteristics are often correlated. For example,
in industrial process monitoring, measurements of dimensions on a plastic part are
generally affected by the pressure and length of time that the plastic was intruded
into the mold. If the time and pressure are both high, then all dimensions of the part
will tend to be on the high side. Similar issues arise in health and monitoring; for
example, when monitoring systolic and diastolic blood pressure.
Assuming the quality measures have a multivariate, specifically a p-variate,
normal distribution with mean vector 0 and covariance matrix †, so that x
Np .0 ; †/, then the Mahalonobis distance
T 2 D .x 0 /0 †1 .x 0 / (8)
is the distance from x to the distribution’s mean 0 , taking into account the covari-
ance. Two points with the same Mahalonobis distance will have equal probability
density height. Note that observations of the multiple quality characteristics within
a single unit are correlated, but successive random vectors are independent.
A chart based on the T 2 statistic is called the Hotelling T 2 chart (Hotelling 1947).
If subgroups of size n are used, then the sample mean xN i and is computed for each
subgroup and the T 2 statistic becomes
Since T 2 measures the distance from the middle of the distribution, there is no LCL.
If the parameters are unknown, and estimated by the grand mean
1 X
m
xN D xN j
m jD1
and
1 X
m
SN D Sj ;
m jD1
where xN j and Sj are the sample mean vector and sample covariance matrix within
the jth subgroup, then the T 2 statistic becomes
0 1
Ti2 D xN i xN SN xN i xN :
216 S.E. Rigdon and R.D. Fricker, Jr.
p.m C 1/.m 1/
UCLT 2 D F˛;p;mp :
m.m p/
Champ et al. (2005) showed that very large sample sizes are needed in order to make
the T 2 with estimated parameters behave like the T 2 chart with assumed known
parameters.
The T 2 chart is a Shewhart chart in the sense that the decision at time i depends
only on data from time i; no accumulation of data is done. It is also directionally
invariant, that is, the run length distribution depends only on the magnitude of the
shift, measured by the Mahalonobis distance
Ci D Ci1 C xi 0
where 0 is the mean vector and †0 is the variance–covariance matrix when the
process is in-control. It then “shrinks” the cumulative sum by
The control chart starts with C0 D 0 and it signals a change has occurred when
Si h, for some threshold h.
The literature contains a number of other MCUSUM control charts. In fact,
Crosier’s MCUSUM control chart described above is one of a number of other
multivariate CUSUM-like algorithms he proposed, but Crosier generally preferred
the above procedure after extensive simulation comparisons. Pignatiello and Runger
(1990) proposed other multivariate CUSUM-like algorithms but found that they
performed similar to Crosier’s. Healy (1987) derived a sequential likelihood ratio
Health Surveillance 217
z0 D 0 ;
zi D xi C .1 /zi1 ;
where
h i
†z i D 1 .1 /2i †:
2
A signal is raised on the MEWMA chart whenever T 2 exceeds the value h. Just as
for the univariate EWMA chart, the parameters and h are chosen to produce some
of the desired ARL properties of the chart. Note that it is possible to use the exact
covariance matrix or the asymptotic covariance matrix
†z †
2
in the computation of the T 2 statistic in Eq. (10). Thus, there are actually two
versions of the MEWMA chart. Tables for choosing and h are given in Lowry
et al. (1992) and Montgomery (2009).
The MCUSUM and MEWMA can detect small to moderate shifts in the mean
more quickly than the Hotelling T 2 chart. For example, when p D 6; D 0:2; h D
17:51 and the shift is
1=2
ı D .0 1 /0 †1 .0 1 /
the MEWMA ARL is 14.6. In contrast, the Hotelling T 2 chart with UCL D h D
18:55 gives an ARL of 74.4 for the same shift. Both control charts have an in-control
ARL of 200.
218 S.E. Rigdon and R.D. Fricker, Jr.
In quality monitoring, it is usually assumed that the input materials are homo-
geneous, and that the resulting process output has a fixed mean and variance.
Health care monitoring of individual patients, on the other hand, must account
for differences among patients. This case mix, that is, the variability in risk
factors among the patients being monitored, must be taken into account. Otherwise,
providers who take on patients with high risk factors would be penalized when fewer
patients survive.
Thus, the first important difference from industrial process monitoring is that
health monitoring data must be risk-adjusted, so that comparison among or across
providers is done fairly. In this context, risk adjustment means building a model
using historical data relating risk factors, such as age, body mass index (BMI),
diabetes status, etc., to the outcome variable. What is charted, then, is some statistic
that does not depend on the levels of the predictor variable.
Before any risk-adjusted chart can be applied, a model must be developed that
relates the probability pi of the adverse outcome for patient i to the predictor
variables xi1 ; xi2 ; : : : ; xip . A logistic regression model assumes that this relationship
is of the form
p0i
logit .p0i / D log D ˇ00 C ˇ01 x1 C C ˇ0p xp : (11)
1 p0i
The parameters ˇ00 ; ˇ01 ; : : : ; ˇ0p must be estimated from some baseline set of data
taken when the process is stable. We will look for a change in the parameters from
0
ˇ00 ; ˇ01 ; : : : ; ˇ0p to ˇ10 ; ˇ11 ; : : : ; ˇ1p . If we write xi D 1; xi1 ; : : : ; xip and ˇ 0 D
0
ˇ00 ; ˇ01 ; : : : ; ˇ0p , then we can write the logistic model in Eq. (11) as
exp x0i ˇ0
p0i D : (12)
1 C exp x0i ˇ0
Consider, for example, the cardiac survival data from Steiner et al. (2000). The
response variable is a dichotomous variable that indicates death within 30 days (Y D
1) or survival past 30 days (Y D 0).2 The predictor variable is the Parsonnet score, a
2
Attributes or discrete data are much more common in health care monitoring. In fact, many
variables are dichotomized, that is changed from a continuous measurement into a yes/no
measurement. Here, for example, the variable of interest is whether or not the patient survived
for 30 days, not the actual survival time.
Health Surveillance 219
0 20 40 60
Parsonnet Score
Fig. 3 Outcome (death within 30 days of surgery) versus Parsonnet score. Some “jitter” has been
introduced to avoid many overlapping points
or, equivalently,
and
!
X
n X
n X
n
V yi D V .yi / D pi .1 pi / :
iD1 iD1 iD1
P
Because the varying risk factors cause the pi to vary, the sum niD1 yi does not
have a binomial distribution. For the usual p-chart in industrial monitoring, the
Central Limit Theorem is applied to argue that the proportions of nonconforming
units in each subgroup is approximately normally distributed, and therefore that
three standard deviation limits above and below the mean should include nearly all
of the observed proportions. Similar reasoning applies here.
For the risk-adjusted
Pn p-chart, the plotted statistic is the proportion of adverse
events O D y
iD1 i =n. 3
Since the mean and standard deviation of the plotted
3
We use O rather than the pO used in industrial quality monitoring because we reserve pi to be the
(estimated) probability of adverse outcome or patient i.
Health Surveillance 221
statistic O are
1X
n
E ./
O D pi
n iD1
and
v v
u u n
p u1 X n
1 uX
V ./
O D t pi .1 pi / D t pi .1 pi /
2
n iD1 n iD1
Here, the sums are over all of the outcomes in the current subgroup. Often, the LCL
is negative, making it impossible for the risk-adjusted p chart to detect a decrease in
the probability of an adverse outcome (i.e., an improvement in outcomes). Note that
the control limits will vary from one subgroup to the next because of the varying
risk factors.
The choice of the subgroup size n involves some trade-offs. If n is large, then
there will be a lot of information in each subgroup, making it likely that a shift will
be detected on that subgroup. Large subgroups, however, mean that data points for
the chart will be obtained infrequently (since n patients must be accumulated before
a subgroup is completed) making quick detection more difficult. On the other hand,
small subgroups mean that the plotted statistics will be obtained more frequently
but each will contain less information. See Jordan and Benneyan (2012) for a more
detailed description of the issues involved in selecting the subgroup size.
To illustrate the risk-adjusted CUSUM chart, we return to the Steiner et al. (2000)
cardiac surgery data and the logistic regression model fit in Eqs. (13) and (14). Now,
consider, for illustration, the first patient, who had a Parsonnet score of 19. Using
the logistic model from the first two years’ worth of data, we would estimate this
person’s probability of death to be
exp .3:67 C 0:077 19/
p01 D D 0:09912
1 C exp .3:67 C 0:077 19/
assuming, of course, that the process is operating at the standard defined by the
logistic regression model in Eq. (13). This patient did survive past 30 days, so
y1 D 0.
222 S.E. Rigdon and R.D. Fricker, Jr.
If we were to use a subgroup size of n D 20, then patients 1–20 would be in the
first subgroup, patients 21–40 would be in the second subgroup, etc. The number of
adverse outcomes in the first 20 patients was one since only patient 12 died within
the 30-day window. The proportion of deaths was then O 1 D 1=20 D 0:05. The
expected proportion of deaths is equal to the average of the risks
20
1 X 1:706
pi D D 0:085:
20 iD1 20
3p
D 0:085 C 1:396
20
D 0:262
and the LCL is zero since the formula for LCL in Eq. (15) yields a negative number.
Figure 4 shows the p-chart for the first 40 subgroups. The 27th observation was
slightly above the UCL, indicating that the process is not operating according to the
standard set by the logistic regression model. This is indicated on the chart by the
solid dot.
Figure 5 shows the resulting p-charts for the entire data set for subgroups of size
n D 20; 40; 80, and 160. Note that the LCLs are zero in most cases for n 80. Only
for larger subgroup sizes is the risk-adjusted p chart able to detect an improvement
in surgical outcomes. Among the four p-charts in Fig. 5 there are five signals (three
on the n D 20 chart and two on the n D 80 chart), which in this case may be false
alarms. In all five cases, the plotted point is barely above the UCL, and the signals do
Subgroup size = 20
0.3
Proportion
0.2
0.1
0.0
0 10 20 30 40
Subgroup
Subgroup size = 20
Proportion
0.2
0.0
Subgroup size = 40
Proportion
0.2
0.0
0 50 100 150
(1000) (2000) (3000) (4000) (5000) (6000)
Subgroup (Patient index in parentheses)
Subgroup size = 80
Proportion
0.2
0.0
0 20 40 60 80
(1000) (2000) (3000) (4000) (5000) (6000)
Subgroup (Patient index in parentheses)
0 10 20 30 40
(1000) (2000) (3000) (4000) (5000) (6000)
Subgroup (Patient index in parentheses)
Fig. 5 p-Charts for cardiac survival data. Response is whether the patient survived for 30 days.
The four plots show the p-chart for subgroup sizes of 20, 40, 80, and 160
not match up in chronological time. We would expect more false alarms for smaller
sample sizes, such as n D 20, simply because there are more plotted points and
more opportunities for a false alarm.
p1 =.1 p1 /
RD (16)
p0 =.1 p0 /
be the odds ratio, then our null and alternative hypotheses for testing whether the
odds ratio has changed by a factor of R can be written as
H0 W p D p 0
Rp0
H1 W p D
1 C .R 1/ p0
p1i .1 p1 /1yi
y
Wi D log
p0i .1 p0 /1yi
y
R yi
D log
1 C .R 1/ p0
8
< log R
; if yi D 1
D 1C.R1/p0 (17)
: log 1
; if yi D 0:
1C.R1/p0
X0 D 0
Xi D max .0; Xi1 C Wi / (18)
X0 D 0
Xi D min .0; Xi1 Wi / (19)
In Eq. (18), p1 > p0 , whereas in Eq. (19), p1 < p0 . For the CUSUM with limit
h D 5 for the upper chart and h D 5 for the lower chart the in-control ARL is
approximately 6,939 (estimated by simulation).
Consider now the risk-adjusted CUSUM chart. We assume that a logistic
regression model has already been developed that relates the predictor variable(s)
to the response and that the model parameters have been estimated. The probability
pi of an adverse outcome, computed from the logistic model, is incorporated into
the likelihood ratio. Note that the parameters ˇ0 ; ˇ1 ; : : : ; ˇp in the logistic model
are assumed to be known, much as the mean and variance in the XN and R-charts
Health Surveillance 225
are assumed known. Note also that Yi Bin .1; p0i /, so that E .Y0i / D p0i . The
resulting weights Wi are as in Eq. (17), except now the weight for patient i depends
on the probability of the adverse event for that patient. Often, the change we would
like to detect is expressed in terms of the odds ratio R, rather than in terms of ratio
of probabilities p1i =p0i or the difference of probabilities p1i p0i because
p1i = .1 p1i /
RD
p0i = .1 p0i /
pi1 pi0
D exp log log
1 pi 1 1 pi 0
D exp .ˇ10 ˇ00 / C .ˇ11 ˇ01 / x1i C C ˇ1p ˇ0p xpi :
Here ˇ 0 is the parameter vector when the process is in control (or not operating at
the standard) and ˇ 1 is the parameter vector when the process is out of control (or
not operating at the standard). Thus, R is independent of the levels of the predictor
variables xi if and only if ˇ11 D ˇ01 ; : : : ; ˇ1p D ˇ0p ; in other words, the only change
in ˇ is in the constant term ˇ0 which shifts from ˇ00 to ˇ10 .
The risk-adjusted CUSUM is then defined the same as in Eqs. (18) and (19),
although the weights will differ; in this case, the weights will depend on the patient’s
condition through the predictor variables.
Consider, for illustration, the cardiac surgery data from Sect. 3.1. If we set up the
risk-adjusted CUSUM chart to detect a doubling of the odds ratio (that is, R D 2),
then since the first patient survived for 30 days the weight for patient 1 is
Note that had this first patient died within the 30-day window, the weight would
have been
Thus, since the first patient did survive past 30 days, the CUSUM at time i D 1 is
max.0; 0 0:0945/ D 0. The second patient had a Parsonnet score of x2 D 0 so the
probability of survival was p2 D exp.3:67/=.1 C exp.3:67// D 0:02484. This
patient, who also survived for 30 days, produces a weight of
This weight is less in magnitude than the weight for the first patient who had a
higher Parsonnet score. Had the second patient died within the 30-day period, the
weight would have been
Thus, the death of a low risk patient contributes more to the upper CUSUM than
the death of a higher risk patient. Analogously, the survival of a higher risk patient
contributes more in magnitude to the lower CUSUM than the survival of low risk
patient.
The lower CUSUM might be set up to detect a halving (i.e., R D 0:5) of the odds
ratio. In this case, the weight for the first patient would be
8
6
4
2
CUSUM
0
−2
−4
−6
−8
Fig. 6 Risk-adjusted CUSUM of cardiac survival data. Response is whether the patient survived
for 30 days
If pi is the probability of the adverse outcome, obtained from some model relating
the risk factors xi for patient i, then the expected value of yi is
E .yi jxi / D pi :
Thus, yi is the observed outcome and pi is the expected outcome at time i. A plot of
the accumulated values of
Observedi Expectedi D Oi Ei D yi pi
on the y-axis and time i on the x-axis is called a variable life adjusted display
(VLAD), although it goes by other names, such as the cumulative risk-adjusted
mortality (CRAM) chart. The cumulative sum
X
n
Si D .yi pi /
jD1
represents the difference between the cumulative number of deaths and the expected
value of this quantity. For this reason, the cumulative sum can be interpreted as the
228 S.E. Rigdon and R.D. Fricker, Jr.
number of lives lost above that which was expected. The vertical axis is often labeled
as “Lives Lost” or “Excess Mortality.” If the cumulative sums are formed as
X
i
Si D p j yj
jD1
X
i
Xi
E .Vi / D E Oj E j D pj pj D 0
jD1 jD1
and
X
i
Xi
X i
Var .Vi / D Var Oj Ej D Var Oj D pj .1 pj /:
jD1 jD1 jD1
Because these control limits create a convex in-control region opening to the right,
they are often called “rocket tails.” The reason these limits are not recommended is
that if the change occurs when the plotted statistic is currently near the middle or
opposite end of the shifted direction, then the chart will take a long time to signal.
This phenomenon is called inertia in the quality literature.
The VLAD chart for the cardiac surgery data is shown in Fig. 7. There seems
to be a decrease in excess mortality starting just before patient 5,000, though it
Health Surveillance 229
40
20
Excess Mortality
0
−20
−40
Fig. 7 Variable life adjusted display of cardiac survival data. Response is whether the patient
survived for 30 days
is important not to overinterpret the VLAD plot. Figure 8 shows the same VLAD
chart as in Fig. 7, with an additional four plots simulated with nothing but noise.
Even though there some ups and downs in Fig. 8 that seem to be as distinct as those
in Fig. 7, in the last four plots there was no change in either the surgical performance
or the distribution of risk factors.
Woodall et al. (2015) suggest that the VLAD chart is easily understood and can
serve as a visual aid to get an overall sense of the data. However, since there is no
good way for the VLAD to raise an out-of-control signal, the VLAD should be used
together with some other method, such as the risk-adjusted CUSUM, that can raise
a signal.
The weights from the risk-adjusted CUSUM in Eq. (17) can be written as
L1i p1 = .1 p1 / log 1p 0
1p1
Wi D log D yi log p0 :
L0i p0 = .1 p0 / p0
L1i
log D AOi BEi (20)
L0i
where
p1 = .1 p1 /
A D log
p0 = .1 p0 /
230 S.E. Rigdon and R.D. Fricker, Jr.
40
0
−40
Fig. 8 Variable life adjusted display of cardiac survival data. Response is whether the patient
survived for 30 days. The top figure is the cardiac survival data. The other for are simulated from
a stable process
and
1p1
log 1p0
BD :
p0
The weights from the risk-adjusted CUSUM are given in Eq. (17). Note that there is
no value of .p0 ; p1 / which makes the coefficients A and B in Eq. (20) both equal to
1; which are the coefficients of Oi and Ei in the VLAD. This implies that the VLAD
is never equivalent to the risk-adjusted CUSUM procedure.
Health Surveillance 231
The EWMA chart described in Sect. 2 can be applied to risk-adjusted data. One
way to do this is to compute the observed statistic Oi D yi minus the expected
statistic Ei D pi , as in the VLAD chart, and to maintain an EWMA statistic on these
values. This is the approach described by Steiner (2014). More general approaches
are described in Grigg and Spiegelhalter (2007) and Steiner and Jones (2010). The
EWMA statistic is computed by
z0 D 0
zi D .yi pi / C .1 /zi1 :
Fig. 9 The risk-adjusted EWMA chart applied to the observed minus expected statistics yi pi
232 S.E. Rigdon and R.D. Fricker, Jr.
Fig. 10 Histogram of the observed minus expected statistics yi pi showing a highly nonnormal
distribution
Chen (1978) proposed the “Sets” method for surveilling the occurrence of congeni-
tal malformations. The idea behind the sets method is to monitor the times between
successive events, that is the number of elements in the “set” of patients between
events. Normally, the set includes the first patient after the previous event and the
next patient who had the event. For the non-risk-adjusted chart, the probability of
the event is constant from patient to patient, so the random variable which is equal to
the number of patients G in the set has a geometric distribution with probability p0 ,
where p0 is the probability of the adverse event when the process is in control. The
rule for inferring “out of control” is that n successive G’s are less than or equal to the
value T. The values chosen for n and T then determine the chart. The sets method
was studied further by Gallus et al. (1986). Later Chen (1987) compared the sets
method with the risk-adjusted CUSUM chart and found that “The sets technique is
. . . more efficient than monthly cusum when the number of cases expected in a year
is no greater than five, but less efficient otherwise.”
Figure 11 shows the sets plot for the first 700 or so patients in the cardiac data set.
The plotted statistic increases by one for each additional patient who survives. After
each death, the y-coordinate is reset to zero. A plot such as this is called a “grass”
Health Surveillance 233
1 Flag 1 2
20
1 2 2
1 1 7 1 3
4 5
5
3 23
3 6 8 4
0
Fig. 11 Grass plot for the unadjusted sets method for the cardiac data
plot, since the plotted lines look like blades of grass. It is helpful to put a horizontal
line on the graph at the value of T and to make notes of those cases where the set
has size less than or equal to T. If we chose T D 20 and n D 8, then we would say
“out of control” when we get eight straight sets of 20 or fewer patients. In Fig. 11
the out of control flag would be raised at time 286 when eight consecutive sets had
20 or fewer patients.
The risk-adjusted sets method of Grigg and Farewell (2004) involves computing
the size of the set beginning with the first patient after the previous event and ending
with the patient having the current event. Now, the “size” of the set is defined to be
the scaled total risk of this patient pool. More precisely, we define
p0i
wi D
pN 0
where p0i is the probability of the adverse event for patient i under the assumption
that the process is operating at the standard level. This method of scaling is based on
the intuitive assumption that a patient with a high probability of the adverse event
who does not experience the event should add more to the risk pool for that set than
a patient with a lower probability. This scaling also reduces to the unadjusted sets
method because if all patients have the same probability, say p0 , then the expected
value of the measure of the set would be
p0 p0 p0 1 1
E G D E.G/ D D :
pN pN pN p0 pN
234 S.E. Rigdon and R.D. Fricker, Jr.
80
Cumulative risk since previous failure
60
40
1 2
Flag
20
1
23 2
1 3
1 45 7
1 5
3 6 8 234
0
Fig. 12 Grass plot for the adjusted sets method for the cardiac data
For the risk-adjusted sets method, a grass plot is defined to be a plot of the
accumulated risk between events. The same out of control rule, n consecutive sets
whose total risk is T or less, is used here. Figure 12 shows the risk-adjusted grass
plot for the cardiac data.
Donabedian (2005, 1988) suggests three categories for assessing the quality of
health care: structural, process, and outcomes. The structural category refers to the
availability of equipment, nurse-to-bed ratios, etc. The process category involves
measurements on variables related to the delivery of health care; for example, lab
turnaround time, “door-to-balloon” time (for certain myocardial infarction patients),
and hand washing compliance. The outcomes category involves characteristics of
the patients after receiving treatment, and includes, for example, 30-day survival
after surgery, hospital readmission within 30 days, ventilator-associated pneumo-
nia, etc.
We will add to this list of categories one which we call personal, whereby the
health characteristics of a single patient are monitored in much the same way as
process or outcomes are monitored. For example, a patient may monitor his blood
pressure daily, where any departure from normal could be reported to the physician.
Health Surveillance 235
110
120 130 140 150 160 170 180
100
DBP
SBP
90
80
70
0 10 20 30 40 50 0 10 20 30 40 50
Day Day
Fig. 13 Individuals charts for systolic blood pressure (SBP) and diastolic blood pressure (DBP)
110
20
100
15
DBP
T2
90
10
80
5
70
Fig. 14 Hotelling T 2 chart for variables systolic blood pressure (SBP) and diastolic blood pressure
Figure 13 shows individuals charts of systolic and diastolic blood pressures for
a hypothetical patient. Values that plot outside the three standard deviation limits
above and below the mean would suggest a potential health problem and, in Fig. 13,
we see that the last data point on the systolic blood pressure exceeds the UCL,
causing a signal that the process has changed.
However, since blood pressure is inherently a two-dimensional process (systolic
and diastolic), a multivariate control chart, such as those described in Sect. 2.4,
would be appropriate. Figure 14 shows the bivariate plot of systolic and diastolic
blood pressure measurements with older observations in a lighter shade of gray.
The other part of Fig. 14 shows the Hotelling T 2 chart of
where xSBP and xDPB are the observed systolic and diastolic blood pressures at each
time period, while xN SBP and xN DPB are the mean systolic and diastolic blood pressures
and S is the sample covariance matrix taken when the patient is in his or her normal
condition.
236 S.E. Rigdon and R.D. Fricker, Jr.
In Fig. 14, Hotelling T 2 raises a signal at the very last data point. The ellipse in
the first part of Fig. 14 is the “control ellipse” in the sense that a point inside the
ellipse will lead to a T 2 value below the UCL, and a point outside the ellipse will
lead to a T 2 value above the UCL.
The ease with which personal data can be collected may lead to tremendous
opportunities in the monitoring of data on individuals. This can lead to the
availability of massive data sets, often called “big data,” which have the potential
to monitor health to a greater extent. To illustrate one possible use of monitoring
personal data, consider the touch sensitive floor called a “magic carpet” that can be
installed in the home or in independent or assisted living communities. Baker (2008)
described this concept in his book The Numerati. This special flooring can record
the exact time, location, angle and pressure of every step the person takes. Numerous
characteristics about a person’s gait can be gleaned from such data. Obviously, the
absence of any such data on a given day is cause for alarm (although it could also just
indicate the person was on vacation). Data showing slower steps, or uneven steps,
could indicate a problem as well. Intel has developed SHIMMER (Sensing Health
with Intelligence, Modularity, Mobility and Experimental Reusability) technology
(see Boran 2007). These involve wearable Bluetooth devices that can monitor many
health characteristics. Large data sets that would be obtained using such devices
present challenges for data analysis and opportunities for improving health care
while reducing costs.
4 Disease Surveillance
Many of the methods used in disease surveillance have been drawn from or are
related to those of industrial process monitoring. However, there are important
differences between the typical industrial process monitoring application and
disease surveillance, which means that the standard industrial methods described
in Sect. 2 usually must be modified before being applied to disease surveillance.
In this section, to distinguish between the two different uses, methods designed
for or discussed within an industrial context are referred to as control charts while
methods designed for or discussed in a disease surveillance context are referred to
as detection methods.
A key difference in the two applications is the assumption that the observations
are independent and identically distributed (or iid). In industrial process monitoring
applications, with appropriate implementation of the procedures, this assumption
can often be met with the raw data. But this is often not the case for disease
surveillance data. There are two main reasons for the difference.
• First, while industrial process monitoring and disease surveillance are both
time series data, in industrial monitoring the process is explicitly controlled
and thus the raw data is the noise resulting from the process when it is in-
control. As such, the data can reasonably be assumed to be identically distributed.
Health Surveillance 237
1. The in-control distribution of the data is 1. There is little to no control over disease
(or can reasonably be assumed to be) sta- incidence and the disease incidence distri-
tionary bution
2. Observations can be drawn from the pro- 2. Autocorrelation and the potential need to
cess so they are independent (or nearly so) monitor all the data can result in depen-
dence
3. The asymptotic distributions of the statis- 3. Individual observations may be moni-
tics being monitored are known and thus tored; if so, asymptotic sampling distribu-
can be used to design control charts tions not relevant
4. Monitoring the process mean and standard 4. Little is known about which statistics are
deviation is usually sufficient useful; often looking for anything unusual
5. Out-of-control condition remains until it is 5. Outbreaks are transient, with disease inci-
detected and corrective action is taken dence returning to its original state when
the outbreak has run its course
6. Temporal detection is the critical problem 6. Detecting both temporal and spatial
anomalies are critical
Disease surveillance data often have systematic effects (i.e., explainable trends and
patterns). These can include day-of-the-week effects, where patient health-seeking
behavior systematically varies according to the day of the week. It may also include
seasonal effects where, for example, influenza-like illness is generally higher in the
winter months of the year compared to the summer months.
These trends and patterns can be used to build models and the models can then
be used to better understand and characterize historical trends, to assess how the
current state compares to historical trends, and to forecast what is likely to occur
in the near future. For example, one might use a model f to forecast the disease
incidence at time i, xO i , using past disease incidence (xi1 ; xi2 ; xi3 ; : : :) as well as
other covariates (y1 ; y2 ; y3 ; : : :):
For example, many diseases have a clear seasonal component to their incidence
rate. Some diseases such as influenza or pertussis peak in the winter, whereas others
such as E. coli peak in the summer. Serfling (1963) first considered the use of
trigonometric models that account for the seasonality in disease rates. He applied
models of the form
p
X 2ki 2ki
i D a0 bk sin C ck cos (21)
kD1
52 52
for weekly counts of influenza. For his purpose, the counts were large enough
that the normal distribution was reasonable. For diseases that are much rarer than
influenza, such an assumption is unreasonable. Rigdon et al. (2014) considered the
reportable diseases that are monitored by the state of Missouri. For most diseases a
first-order model, that is p D 1 in Eq. (21), is sufficient, but in some cases a second-
order model (p D 2) is needed. They assumed Poisson distributed counts for the
reportable diseases and constructed control limits that were based on the current
estimated mean; thus, the control limits varied in a seasonal fashion along with the
estimated disease incidence.
One advantage of modeling the counts directly is that the resulting chart is easily
understood. Also, one can see from the plot the current situation along with the past
history of the disease. Figure 15 shows the incidence of pertussis in Missouri from
2002 to 2011. From this plot, the years when there was a pertussis outbreak are
obvious.
Perhaps most important for this discussion, many of the detection methods
discussed in Sect. 2 are most effective when the systematic components of disease
surveillance data are removed. This is best accomplished by first modeling the data,
where the model is used to estimate the systematic effects, and then using the
detection methods on the model residuals. The residuals ri are what remain after
Health Surveillance 239
60
40
40
20
20
0
0
0 10 20 30 40 50 0 10 20 30 40 50
Pertusis 2004 Pertusis 2005
60
60
40
40
20
20
0
0
0 10 20 30 40 50 0 10 20 30 40 50
Pertusis 2006 Pertusis 2007
60
60
40
40
20
20
0
0 10 20 30 40 50 0 10 20 30 40 50
Pertusis 2008 Pertusis 2009
60
60
40
40
20
20
0
0 10 20 30 40 50 0 10 20 30 40 50
Pertusis 2010 Pertusis 2011
60
60
40
40
20
20
0
0 10 20 30 40 50 0 10 20 30 40 50
the modeled values are subtracted from the raw data, ri D xi xO i , and thus what is
monitored are changes from forecast values. Correctly done, the residuals may then
be independent, or nearly so, and then the industrial process monitoring methods of
Sect. 2 more appropriately apply. There are also cases where the disease counts are
modeled directly. See Fricker (2013, Chap. 5) for a more in-depth discussion about
methods for modeling disease incidence data.
240 S.E. Rigdon and R.D. Fricker, Jr.
or
N
This is just a specific form of the X-chart with two standard deviation limits.
Interestingly, rather than plotting the data as a time series on a control chart, the
CDC uses a bar plot of the natural log-transformed (4-week) counts. For example,
Fig. 16 is Figure I from “Notifiable Diseases/Deaths in Selected Cities Weekly
Information” for week 47 of 2009 (CDC 2009), where for this week the mumps
count exceeded its historical limits as shown by the crosshatched top of the bar.
4
See www.cdc.gov/ncphi/disss/nndss/phs/infdis.htm for a complete list of reportable diseases.
Health Surveillance 241
CASES CURRENT
DISEASE DICREASE INCREASE 4 WEEKS
Giardiasis 815
Hepatitis A, acute 63
Hepatitis B, acute 122
Hepatitis C, acute 35
Legionellosis 146
Measles 1
Meningococal disease 44
Mumps 155
Pertussis 334
0.25 0.5 1 2 4 8
Rafo (Log scale)*
Beyond historical limits
Fig. 16 Figure I from “Notifiable Diseases/Deaths in Selected Cities Weekly Information” for the
week 47 of 2009 (CDC 2009). For this week, the mumps count exceeded its historical limits
In the preceding example, a formal model of disease counts was not required as
comparisons to historical data were limited to those time periods in previous years
expected to be similar to the current period. This too is a model, but an informal
one which assumes that counts at the same time of the year are iid. For other types
of disease surveillance, particularly in biosurveillance, a more formal model may
have to be applied. Once done, many of the detection methods of Sect. 2, applied to
forecast residuals, are useful for monitoring disease incidence.
Now, for some types of surveillance, only monitoring for increases in disease
incidence is of interest. A benefit of so doing is greater detection power for the
same rate of false positive signals. In such cases, it is only necessary to use an UCL
for Shewhart charts. Similarly, for the CUSUM, one only need calculate CC , using
Eq. (5), and not C using Eq. (6), and signal at time i when CiC > h. For the EWMA,
in addition to only using the UCL, the EWMA can be “reflected” (in the spirit of the
CUSUM) to improve its performance in detecting increases (Crowder and Hamilton
1992). To do so, Eq. (7) is modified to:
With this modification, the EWMA statistic must always be greater than or equal
to 0 , meaning that it cannot drift too far downwards, and thus it will more readily
signal when a positive shift occurs. The method starts at z0 D 0 and a signal is
generated as before when zi > h, where the UCL is
r
UCL D h D 0 C LO Œ1 .1 /2t :
2
242 S.E. Rigdon and R.D. Fricker, Jr.
zi D maxŒ0; xi C .1 /zi1
zi D z0i †1
z zi :
• Conditional expected delay (CED) is the mean number of time periods it takes
for the method to first signal, given that an outbreak is occurring and that the
method signals during the outbreak. Thus, the CED is the expected number of
time periods from the start of an outbreak until the first true signal during that
outbreak.
• Probability of successful detection (PSD) is the probability the method signals
during an outbreak, where the probability of detection is both a function of the
EED method and the type of outbreak.
The metrics are mathematically defined as follows. Let St denote a generic
detection method statistic at time t, where S0 is the value of the statistic when
the detection method is first started. Let h denote the method’s threshold, where if
St h the method signals at time t. Also, let s denote the first day of a disease
outbreak, where the notation s D 1 means that an outbreak never occurs, and let
e denote that last day of an outbreak, where if s D 1 then by definition e D 1.
Finally, let t denote the first time the method signals, t D min.t W St h/, and let
t denote the next time the method signals, t D min.t W t > t and St h/.
Then
and
Mathematically, the ATFS metric as defined in Eq. (23) is the same as the in-
control ARL because after each signal the method’s statistic is re-set to its starting
value. However, some disease surveillance practitioners prefer not to re-set after
each signal, so in that case,
Note the difference between Eqs. (23) and (26): in the former, the statistic is re-set
to its starting value after each time the detection method signals, while in the latter
it is not. If the time series of statistics is autocorrelated, then the resulting ATFS
performance can be very different since, with autocorrelation, once a signal has
occurred in one time period more signals are likely to occur in subsequent periods.
Under the condition that the statistic is not re-set, Fraker et al. (2008) have
proposed the average time between signal events (ATBSE) metric, where a signal
event is defined as consecutive time periods during which an EED method signals.
Under these conditions, the ATBSE may be a more informative measure, since it
quantifies the length of time between groups of signals, but it may not provide
sufficient information about the number of false positive signals that will occur.
244 S.E. Rigdon and R.D. Fricker, Jr.
where d is the maximum delay required for a successful detection, and where
“successful detection” means early enough in the outbreak that an intervention is
medically effective.
Fraker et al. (2008) note that “Substantially more metrics have been proposed in
the public health surveillance literature than in the industrial monitoring literature.”
These include run length and time to signal based metrics such as the ARL and
average time to signal (ATS). However, these metrics fail both to account for the
transient nature of disease outbreaks and that detection method statistics are often
not re-set after they signal. In comparison, when an industrial process goes out-
of-control it stays in that condition until the control chart signals and the cause
is identified and corrected. Thus, in industrial process monitoring, once a process
goes out of control any signal is a true signal, and so the probability of signaling
during an out-of-control condition is always 1. This is not the case in disease
surveillance where outbreaks are transient and after some period of time disappear.
In this situation, it is possible for a detection method to fail to signal during an
outbreak, after which a signal is a false signal.
To overcome the issues associated with applying the control chart ARL metrics
to disease surveillance, various modifications have been proposed. For example,
in addition to the ATBSE Franker et al. (2008) also define the average signal
event length (ASEL) as how long, on average, a detection method signals over
consecutive time periods. The ATBSE and ASEL metrics are designed for how
disease surveillance systems are often currently operated, where the detection
methods are not re-set after they signal. In this situation, disease surveillance system
operators allow the detection methods to continue to run after they signal and
interpret the resulting sequence of signals (or lack thereof) as additional information
about a potential outbreak. Under these conditions, the ATBSE maybe preferred to
the ATFS metric. See Fricker (2013, Chap. 6) for a more in-depth discussion of
these and other metrics.
Health Surveillance 245
Non-outbreak
60
period
40
20
0
Fig. 17 Plot of 2-1/2 years of GI syndrome data from a hospital. The dotted lines indicate a period
of “normal” (i.e., non-seasonal outbreak) disease incidence from days 220 to 400
246 S.E. Rigdon and R.D. Fricker, Jr.
Fig. 18 Plot of the hospital GI syndrome data with signal times, where the Shewhart, CUSUM and
EWMA detection method signaling times are indicated with the vertical lines. Thresholds were set
with an ATFS of 365 days and the methods are not re-set after signals
Ignoring day and other systematic effects potentially present in the data, one
approach is to simply standardize future observations with
xi O 0 xi 13:6
yi D D ; i D 401; 402; : : :
O 0 4:5
and apply the Shewhart, CUSUM, and EWMA detection methods to the yi (without
re-setting after signals). For an ATFS of 365 days, assuming the standardized values
are approximately normally distributed (which a Q-Q plot shows is reasonable) set
h D 2:7775 for the Shewhart, h D 1:35 and k D 1:5 for the CUSUM, and L D
2:815 and D 0:3 for the EWMA.
The results are shown at the top of Fig. 18, where the signaling times for each
detection method are indicated by the short vertical lines. The figure shows that all
three detection methods clearly indicate the two large seasonal increases (after day
400) in GI. However, there are some differences in how they indicate the duration of
the outbreaks and, because the methods are not re-set, the CUSUM’s and EWMA’s
signals are more persistent.
In particular, note how the CUSUM continues to signal well after the two
seasonal increases have subsided. In addition, the EWMA and CUSUM detection
methods tend to signal earlier because of the gradual increase in GI counts at the start
of the outbreaks. Finally, note that there are a couple of smaller potential outbreaks
in between the two larger outbreaks that are more obvious given the signals. See
Chap. 9 of Fricker (2013) for additional detail and further development of this
example.
Health Surveillance 247
5 Summary
Many of the methods of industrial quality control can be used directly, or adapted,
in order to monitor health data. Monitoring of patient outcomes can be done so
long as the patients’ risk is taken into account. It is also possible to monitor process
variables, such as laboratory times, using methods such as control charts. Public
health can be studied by monitoring the rates of disease. Methods of plotting that
are related to control charts can provide information about disease outbreaks so that
practitioners can take action.
6 Further Reading
1. Faltin, F., Kenett, R., and Ruggeri, F., editors (2012) Statistical Methods in
Healthcare, Wiley, West Sussex, UK.
2. Winkel, P. and Zhang, N. F. (2007) Statistical Development of Quality in
Medicine, Wiley, West Sussex, UK.
1. Brookmeyer, R. and D.F. Stroup, eds. (2004). Monitoring the Health of Popula-
tions: Statistical Principles and Methods for Public Health Surveillance, Oxford
University Press.
2. Fricker, Jr., R.D. (2013). Introduction to Statistical Methods for Biosurveillance:
With an Emphasis on Syndromic Surveillance, Cambridge University Press.
3. Lombardo, J.S., and D.L. Buckeridge, eds. (2007). Disease Surveillance:
A Public Health Informatics Approach, Wiley-Interscience.
248 S.E. Rigdon and R.D. Fricker, Jr.
References
Loke, C.K., Gan, F.F.: Joint monitoring scheme for clinical failures and predisposed risk. Qual.
Technol. Quant. Manag. 9, 3–21 (2012)
Lorden, G.: Procedures for reacting to a change in distribution. Ann. Math. Stat. 42, 1897–1908
(1971)
Lowry, C.A., Woodall, W.H., Champ, C.W., Rigdon, S.E.: A multivariate exponentially weighted
moving average control chart. Technometrics 34, 46–53 (1992)
Montgomery, D.C.: Introduction to Statistical Quality Control, 6th edn. Wiley, London (2009)
Moustakides, G.V.: Optimal stopping times for detecting a change in distribution. Technometrics
14, 1379–1388 (1986)
Page, E.S.: Continuous inspection schemes. Biometrika 41, 100–115 (1954)
Pignatiello, J.J. Jr., Runger, G.C.: Comparisons of multivariate CUSUM charts. J. Qual. Technol.
3, 173–186 (1990)
Rigdon, S.E., Cruthis, E.N., Champ, C.W.: Design strategies for individuals and moving range
control charts. J. Qual. Technol. 26, 274–287 (1994)
Rigdon, S.E., Turabelidze, G., Jahanpour, E.: Trigonometric regression for analysis of public health
surveillance data. J. Appl. Math. 2014 (2014)
Roberts, S.W.: Control chart tests based on geometric moving averages. Technometrics 1, 239–250
(1959)
Serfling, R.E.: Methods for current statistical analysis of excess pneumonia-influenza deaths.
Public Health Rep. 78, 494–506 (1963)
Shewhart, W.A. (ed.): Economic Control of Quality of Manufactured Product. D. van Nostrand
Company, New York (1931)
Sonesson, C., Bock, D.: A review and discussion of prospective statistical surveillance in public
health. J. R. Stat. Soc. Ser. A 166, 5–21 (2003)
Steiner, S.H.: Risk-adjusted monitoring of outcomes in health care. In: Lawless, J.F. (ed.) Statistics
in Action: A Canadian Outlook, pp. 225–241. CRC/Chapman Hall, Boca Raton/London (2014)
Steiner, S.H., Jones, M.: Risk-adjusted survival time monitoring with an updating exponentially
weighted moving average (EWMA) control chart. Stat. Med. 29, 444–454 (2010)
Steiner, S.H., Cook, R., Farewll, V.T., Treasure, T.: Monitoring surgical performance using risk-
adjusted cumulative sum charts. Biostatistics 1, 441–452 (2000)
Thacker, S.B.: Historical development. In: Teutsh, S.M., Churchill, R.E. (eds.) Principles and
Practices of Public Health Surveillance, pp. 1–16. Oxford University Press, Oxford (2000)
Wald, A.: Sequential tests of statistical hypotheses. Ann. Math. Stat. 16, 117–186 (1945)
Winkel, P., Zhang, N.F.: Statistical Development of Quality in Medicine. Wiley, London (2007)
Woodall, W.H., Fogel, S.L., Steiner, S.H.: The monitoring and improvement of surgical outcome
quality. J. Qual. Technol. (2015, To appear)
Zhang, X., Woodall, W.H.: Dynamic probability control limits for risk-adjusted Bernoulli CUSUM
charts. Stat. Med. (2015, to appear)
Standardization and Decomposition Analysis:
A Useful Analytical Method for Outcome
Difference, Inequality and Disparity Studies
Jichuan Wang
1 Introduction
J. Wang ()
Department of Epidemiology and Biostatistics, The George Washington University School
of Medicine, Children’s National Medical Center, 111 Michigan Avenue Northwest,
Washington, DC 20010, USA
e-mail: [email protected]; [email protected]
(i.e., the death rates for specific age groups). This paradox is often a result of the
fact that the first population has a considerably larger proportion of its population
in age groups (e.g., age 5–44) that are subject to lower mortality. In this death rate
example, the observed difference in crude death rate between the two populations is
confounded by difference in composition of a confounding factor (i.e., age structure)
between the two populations. Once the composition of the confounding factor (i.e.,
age structure) is standardized across the two populations, the adjusted death rate of
the first population would be certainly higher than that of the second population.
SDA is a useful analytical method for studying outcome difference, inequality
or disparity between populations. The technique of standardization is used to adjust
or purge the confounding effects on observed (crude) rate of some phenomena in
two or more populations. Decomposition takes standardization a step further by
revealing the relative contributions of various confounding factors in explaining the
observed outcome difference, inequality or disparity. Its results tell how much of
the difference in the crude rate between populations is “real” rate difference that is
attributed to different factor-specific rates; and what factors significantly confound
the crude rate difference, and how much of the observed rate difference is attributed
to each specific confounding factor (Kitagawa 1955, 1964; Pullum 1978; United
Nations 1979). Returning to the death rate comparison example, based on the results
of standardization, the difference in crude death rate between the two populations
can be decomposed into two additive component effects: (1) the rate effect attributed
to the difference in the age-specific death rates; and (2) the factor component effect
attributed to the difference in age structure between the two populations. The former
is the “real” rate difference, and the latter is the observed rate difference due to
confounding effects (i.e., difference in age structure in this example). If another
confounding factor (e.g., ethnicity) were taken into account, the difference in the
crude death rate would be decomposed into three component effects: (1) the rate
effect due to the difference in the factor specific (i.e., age-ethnicity specific) death
rates; (2) factor-1 component effect due to the difference in age structure; and (3)
factor-2 component effect due to the difference in ethnic composition.
SDA can be readily applied to outcome comparison in various research fields.
For example, suppose we would like to study difference, inequality or disparity in
health and health care outcomes (e.g., prevalence of HIV, cancer, asthma, diabetes,
hypertension, obesity, mental disorder, likelihood of health service utilization,
etc.) between Black and White populations, application of SDA would tell how
much of the observed outcome difference could be the “real” difference in the
outcome measure between the populations; and how much would be attributed
to compositional differences of specific confounding factors (e.g., age, gender,
education, family income, location, immigrant status, etc.) in the populations.
The advantages of SDA include but not limited to: (1) its results can be presented
in a manner that is initiatively understandable, like presenting a decomposition
of the observed rate difference into different component effects and the relative
contributions of the component effects sum up to 100 %. (2) SDA is based
on algebraic calculation, it, therefore, has no constraints on the specification of
relationship (e.g., linearity), the nature of the variables (e.g., random), the form of
Standardization and Decomposition Analysis: A Useful Analytical Method. . . 253
variable distributions (e.g., normality), and observation independence that are usual
assumptions for statistical analyses. (3) SDA can be used to study a wide range
of outcome measures such as rate, percentage, proportion, ratio, and arithmetic
mean. And (4) SDA can also be readily applied to analyze outcome change and
confounding effects on the change in longitudinal studies.
Various SDA methods have been developed by demographers. In general,
the methods of standardization and decomposition are grouped into two broad
categories (Das Gupta 1991; Das et al. 1993). In the first category, a crude rate is
expressed as a function of one or several factors (Bongaarts 1978; Pullum et al.
1989; Nathanson and Kim 1989; Wojtkiewicz et al. 1990). In the second and
more common category standardization and decomposition are performed on cross-
classified or contingency table data (Kitagawa 1955, 1964; Das Gupta 1991; Das
et al. 1993; Cho and Retherford 1973; Kim and Strobino 1984; Liao 1989). In both
categories, standardization and decomposition are all performed based on algebraic
calculation rather than statistical modeling. In a series of papers, Clogg and his
colleagues (Clogg and Eliason 1988; Clogg et al. 1990) have developed a statistical
model—the Clogg Model—to standardize rates. Based on log-linear models, the
Clogg Model centers around the idea of purging the effects of confounding factors.
However, the Clogg model is not designed for, and can’t be applied directly
to, decomposition analysis. Liao (1989) has developed a method which applies
the results of the Clogg models to decompose the difference in crude rates into
component effects representing compositional effects, rate effect, and possible
interactions between the two.
The choice of a standardization and decomposition method first depends on
the type of data (aggregate data, contingency table, or individual data) available
for analysis; second, the choice of a method is a matter of personal preference.
Nonetheless, the Das Gupta’s method is more preferable because its symmetric
approach integrates factor interactions into additive main effects (Das Gupta 1991;
Das et al. 1993). As such, multiple factors can be easily included and the result
interpretation becomes much easier. Unfortunately, none of the existing SDA
methods can take sampling variability into account when survey data are analyzed
because they are all based on algebraic calculation. Although Liao’s (1989) method
is based on the results of statistical modeling (i.e., the Clogg purging model) (Clogg
and Eliason 1988; Clogg et al. 1990), the actual calculation of the component effects
in decomposition analysis is still based on algebraic equations. Therefore, like the
other methods, it does not provide statistical significance testing for component
effects neither. Wang and colleagues (2000) have developed a Windows-based
computer program, DECOMP, that employs bootstrapping techniques (Efron 1979,
1981) to estimate standard errors of SDA component effects, therefore, significance
testing for component effects becomes possible in SDA.
SDA is an important demographical analytical method that has been widely
used to compare birth rates, death rates, unemployment rates, etc. between different
populations/groups in population studies. It has also been increasingly applied to
study outcomes difference, inequality or disparity in many other research fields.
Wang et al. (2000) applied the computer program DECOMP to conduct SDA to
254 J. Wang
compare gender difference with regard to HIV seropositive rate among injection
drug users (IDUs) in the U.S. A sample of 7,378 IDUs (1,745 females and 5,633
males) was obtained from the National Institute on Drug Abuse’s National AIDS
Demonstration Research (NADR) project (Brown and Beschner 1993) for the study.
Their findings show that the HIV seropositive rate among the IDUs was high (overall
36.73 %) in the U.S. IDU population. The corresponding rates for the male and
female IDUs were 37.39 % and 34.60 %, respectively, and the gender difference
(about 2.79 %) is statistically significant (t D 2.10, d.f. D 7,376, p D 0.0358). In
addition, the age structure and ethnic composition were significantly different
between the gender populations. To evaluate the difference in HIV seropositive
rate between the gender populations, age structure and ethnic composition were
standardized across the two populations. And then, the observed difference in
HIV seropositive rate between the two populations was decomposed into three
components: (1) the rate effect attributed to difference in factor-specific rates;
(2) factor-1 component effect attributed to difference in age structures; and (3)
factor-2 component effect attributed to difference in ethnic composition between
the populations. Once age structure and ethnic composition were standardized,
gender difference in HIV seropositive rate disappeared, indicating that the observed
gender difference was simply because of compositional difference between the
two populations. However, only age structure shows a significant confounding
effect on the observed difference in HIV seropositive rate between the gender IDU
populations. Similar studies were conducted to assess difference in HIV prevalence
rate in different regions in the U.S., such as high HIV prevalence region vs.
low HIV prevalence region (Wang 2003), and between four different geographic
regions (Wang and Kelly 2014). Their results show that ethnicity and education are
important confounding factors in HIV prevalence comparison among IDUs across
different U.S. regions.
The SDA was also applied to compare drug abuse among rural stimulant drug
users in three geographically distinct areas of the U.S. (Arkansas, Kentucky, and
Ohio) (Wang et al. 2007). The findings show that the observed rate of “ever
used” methamphetamine and the frequency of methamphetamine use in the past
30 days were much higher on average in Kentucky than in the other two states.
However, after the compositions of socio-demographic confounding factors (e.g.,
gender, ethnicity, age, and education) were standardized across the populations,
the two measures of methamphetamine use ranked highest in Arkansas, followed
by Kentucky, and then Ohio. Different confounding factors contributed in various
dimensions to the differences in the observed measures of methamphetamine use
between the geographical drug injection populations. Differential ethnic composi-
tions in the populations largely accounted for the observed difference in both ever
used methamphetamine and frequency of using methamphetamine in the past 30
days between Arkansas and other project sites. Since non-Whites were found to be
less likely to report methamphetamine use than Whites, regardless of location, and
the much higher proportion of non-Whites in Arkansas made the observed measures
of methamphetamine use substantially lower than the real level.
Standardization and Decomposition Analysis: A Useful Analytical Method. . . 255
2 Method
For two-population comparison with only one confounding factor, the algebraic
expression of SDA can be shown as follows (Das Gupta 1991; Das et al. 1993):
X
J
N1j R1j X
J
R1: D D F1j R1j (1)
jD1
N1: jD1
X
J
N2j R2j X
J
R2: D D F2j R2j (2)
jD1
N2: jD1
where R1. denotes the observed rate (or mean if the outcome is a continuous
measure) for Population 1; R1j the observed factor-specific rate in the jth category
of the confounding factor with J categories (j D 1, 2, : : : J) in Population 1; N1. is
the total number of cases in Population 1; N1j specifies the number of cases in the
jth category of the confounding factor in Population 1; and F1j D N1j /N1. represents
the proportion or relative frequency ofPthe Population 1 members who fall into the
jth category of the confounder, and F1j D 1. R2. , R2j , N2., N2j , and F2j are the
equivalent notations for Population 2. In both Eqs. (1) and (2), the observed rate is
expressed as a summation of weighted factor-specific rates: for instance, the weight
is F1j D N1j /N1. for Population 1, and F2j D N2j /N2. for Population 2, which are the
compositions of the confounder in the respective populations. The rate difference
between Populations 1 and 2 can be expressed accordingly:
256 J. Wang
XJ
F1j C F2j X J
R1j C R2j
R1: R2: D R1j R2j C F1j F2j (3)
jD1
2 jD1
2
Equation (3) shows that the difference between the two observed rates, (R1. R2. ),
can be decomposed into two components: a rate effect (i.e., the first term on the
right side of the equation) and a factor component effect (i.e., the second term
on the right side of the equation). As shown in the first term, the composition of
the confounding factor is standardized across populations; thus, the observed rate
difference contained in this term can be considered having resulted from differential
factor-specific rates between the populations under study. Therefore, we called it
rate effect. In contrast, the second term on right side of Eq. (3), where the factor-
specific rate is standardized, represents the component in the crude rate difference
that is attributed to differential factor compositions between the two populations.
We call this term factor component effect, which describes the effect of the factor
composition on the observed rate difference.
The traditional SDA could only deal with two confounding factors and compare
two populations (Kitagawa 1955, 1964). When multiple confounding factors are
involved, the decomposition equations become progressively more complex because
of the proliferation of relationships between variables. In addition, when multiple
populations are involved in SDA, naive pairwise comparisons are usually conducted
separately, which is inappropriate because the results of pairwise comparisons may
lack internal consistency (Das Gupta 1991; Das et al. 1993). As a result, the estimate
of a standardized rate for each population may not be consistent in different pairwise
comparisons. For example, when comparing three populations, the difference in
the estimated standardized rates between Population 1 and Population 2 plus the
difference between Population 2 and Population 3 may not equal the difference
between Population 1 and Population 3. The same problem remains when studying
temporal outcome changes in a single population at multiple time-points. The SDA
was generalized by Das Gupta (1991) and Das et al. (1993) for multiple population
comparisons with multiple confounding factors. In theory, the generalized SDA
does not have a limit on the number of populations to compare and number of
confounding factors to analyze. The formulas for comparing Populations 1 and 2
in the presence of Populations 3, 4, : : : , and K are described as the following (Das
Gupta 1991; Das et al. 1993):
XK XK XK
A1:j Ai:j .K 2/ Ai:1
jD2 iD2 j¤1;i
A1:23:::K D C (4)
.K 1/ K .K 1/
XK
A12 C A2j A1j
A12:3:::K D A12 C iD3
(5)
K
where A1.23 : : : K and A12.3 : : : K are the standardized rate in Population 1 and the
component effect of factor A, respectively, standardizing all other factors but A,
Standardization and Decomposition Analysis: A Useful Analytical Method. . . 257
XB
where b ™.:/ D b™b =B. When bootstrapping option is specified in the computer
bD1
program DECOMP, B resamples will be generated and each of them will be used
to generate a contingency table for SDA. Component effects will be estimated
separately from each of the B bootstrap resamples, and the empirical distributions
of the component effect estimates will be used to estimate the standard error
of each component effect. The distribution normality of the component effects
estimated from bootstrapping with various numbers of resamples was tested in the
author’s previous study, using Q–Q normal probability plots (Wang et al. 2000). The
results show that the plot points cluster around a straight line, indicating that the
bootstrapping estimated values of component effects display a normal distribution
when a moderately large number (e.g., 200) of bootstrap resamples were used.
However, statisticians recommend that 800 or more bootstrap resamples are needed
to estimate the standard error of b™ (Booth and Sarkar 1998). It is remindful that
bootstrapping is conducted using raw (individual) survey data. In case that only
grouped survey data (contingency table) are available, a utility function built in
DECOMP will allow one to convert the grouped data into raw (individual) data
if the outcome is a rate, percentage, proportion, or ratio (Wang et al. 2000).
two-population SDA (Wang and Kelly 2014; Wang et al. 2007). All populations will
be compared in a pairwise way, adjusting for internal inconsistence.
Application of the Windows version DECOMP is straightforward (Wang et al.
2000). What one needs to do are: (1) open the program in Windows and input data
file with text or ASCII format; (2) specify the population variable (a categorical
variable that has as many categories as the number of populations); (3) specify
the outcome variable and select the confounding factors; (4) specify the number
of bootstrapping resamples if significance testing is preferred; and then (5) click
the Run button. The results of SDA can be saved in different formats (e.g., pdf,
MS Word). The program is freely available to download online (www.wright.edu/~
jichuan.wang/).
This chapter demonstrates how to apply SDA in real research. The example of
application assesses regional difference in the HIV prevalence rate among IDUs
in the U.S. A total of 9,824 IDUs located in three geographic regions (e.g.,
Northeast, Midwest, and West) were retrieved from the large national database of the
National Institute on Drug Abuse’s Cooperative Agreement for AIDS Community-
Based Outreach/Intervention Research Program (COOP) (Needle et al. 1995) for
the purpose of demonstration. In the SDA, differences in the observed regional
HIV prevalence rate were decomposed into component effects, such as “real”
difference in HIV prevalence that is attributed to difference in factor-specific rates;
and factor component effects that are attributed to compositional differences of
specific confounding factors. The outcome is a dichotomous variable at individual
level (1-HIV positive; 0-HIV negative); thus, the mean of the outcome in a regional
population is an estimated HIV prevalence rate in the regional target population.
The confounding factors included in the analysis are: Ethnicity (0-Nonwhite;
1-White), gender (0-Female; 1-Male), age group (1: <30; 2: 30–39; 3: 40C), and
education level (1: <High school; 2: High school; 3: CollegeC). The Windows
version DECOMP was used to implement the multi-population SDA with multiple
confounding factors.
3 Results
The sample descriptive statistics and the estimates of HIV prevalence rates among
IDUs are shown by region in Table 1. The HIV prevalence rate was high in the
Northeast (17.52 %), moderate in the West (8.04 %), and low in the Midwest
(4.56 %). The HIV prevalence rate was higher among Black IDUs than among
White IDUs across the regions. Age and education are significantly associated with
HIV prevalence rate only in the West. Gender is not significantly associated with
260 J. Wang
HIV prevalence rate in all the regions. Notably, the compositions of the socio-
demographic factors, ethnicity in particular, vary across regions. For example, only
47.16 % of the IDUs in the West were Blacks, while the corresponding figures were
76.64 % in the Northeast, and 80.10 % in the Midwest, respectively.
The results of SDA are shown in Table 2. The upper panel of the table shows
the results comparing HIV prevalence rates between the Northeast and Midwest
regions. The observed HIV prevalence rate was about 12.91 % higher in the
Northeast than in the Midwest. Significance testing for the rate difference was
conducted using t-test, where the standard error of the difference was estimated
based on 1,000 bootstrap resamples. Only education shows significant confounding
effect (t-ratio D 0.0027/0.0011 D 2.45) on the regional difference in HIV prevalence
rate. Nonetheless, the factor component effect is very small, contributing only
2.09 % to the observed rate difference. Other socio-demographic factors do not
significantly confound the regional difference of HIV prevalence rates. As such,
the regional rate difference (12.81 %) remains almost unchanged after adjusting for
confounding effects.
Standardization and Decomposition Analysis: A Useful Analytical Method. . . 261
The HIV prevalence difference between the Northeast and the West decreased
from 9.47 to 6.51 % after adjusting for the confounding factors (see the middle
panel of Table 2). That is, adjusting for socio-demographic compositions, the
regional difference in HIV prevalence rate would be about 31.26 % smaller.
The adjusted prevalence difference reflects the factor-specific rate difference,
which accounts for about 68.71 % of the observed regional prevalence difference
(see the last column of the panel in Table 2). Ethnic composition had a
significantly confounding effect (t-ratio D 0.0257/0.0028 D 9.18), accounting
for about 27.13 % of the observed HIV prevalence difference. Education
had a significant confounding effect (t-ratio D 0.0033/0.0012 D 2.75), but its
contributions to the regional difference in HIV prevalence rate were limited,
accounting for only 3.48 %. Gender (t-ratio D 0.0007/0.0011 D 0.64) and age
262 J. Wang
4 Discussion
This chapter briefly introduces SDA and demonstrates its application using real
research data. The results of the example SDA show that ethnicity and education,
particularly ethnicity, are important confounding factors in comparison of HIV
prevalence among IDUs between the U.S. geographic regions. Gender and age
had no significantly confounding effects on regional HIV prevalence difference
because age structure and gender compositions do not vary much between the
regions (see Table 1). It is remindful that decomposition of outcome differences
using SDA is not equivalent to analyzing variation of a dependent variable in a
regression model or ANOVA. A variable may significantly explain the variation
of a dependent variable in regression, but may not have a significant confounding
effect in SDA. For example, both binary and multivariate statistics may show a
significant relationship between a variable and an outcome measure of interest;
however, the variable would have no significant confounding effect on outcome
difference between populations if its composition does not vary much across the
populations under study. In the example demonstrated in this chapter, ethnicity
shows significant confounding effect when comparing the West region with the
Northeast and Midwest regions. However, such a confounding effect was not statis-
tically significant (t-ratio D 0.0019/0.0015 D 1.26) when comparing the Northeast
and Midwest regions because ethnic composition was similar in the two regions
(see Table 1). That is, although ethnicity is significantly related to HIV seropositive
status, its confounding effect on regional difference of HIV prevalence rate depends
on the regional ethnic composition.
In many research fields, such as epidemiology, health and health care studies, and
behavior research, outcome difference, inequality or disparity is often a significant
concern. An observed outcome measure (e.g., crude death rate, prevalence of a
specific disease, adverse event or symptom, average medical expenses, average cost
of medical insurance, school dropout rate, crime rate, etc.) depends not just on
the level of the outcome, but also the composition of the underlying population.
That is, the observed outcome measure for a population depends on a number of
Standardization and Decomposition Analysis: A Useful Analytical Method. . . 263
confounding factors that need to be taken into account when comparing outcome
measures across populations. SDA is a useful analytical method for such outcome
comparison. It enables to evaluate what factors contribute, and how, to the observed
outcome difference, inequality or disparity between populations. It provides not
only an opportunity of viewing and interpreting outcome difference, inequality or
disparity from a different perspective, but also important policy implications with
regard to intervention efforts that are tailored to meet the needs of the populations.
References
Bongaarts, J.: A framework for analyzing the proximate determinants of fertility. Popul. Dev. Rev.
4, 105–132 (1978)
Booth, J.G., Sarkar, S.: Monte Carlo approximation of bootstrap variances. Am. Stat. 52, 354–357
(1998)
Brown, B.S., Beschner, G.M.: Handbook on Risk of AIDS: Injection Drug Users and Sexual
Partners. Greenwood Press, Westport (1993)
Chevan, A., Sutherland, M.: Revisiting Das Gupta: refinement and extension of standardization
and decomposition. Demography 46, 429–449 (2009)
Cho, L.J., Retherford, R.D.: Comparative analysis of recent fertility trends in East Asia. Proc.
IUSSP Int. Popul. Conf. 2, 163–181 (1973)
Clogg, C.C., Eliason, S.R.: A flexible procedure for adjusting rates and proportions, including
statistical method for group comparisons. Am. Sociol. Rev. 53, 267–283 (1988)
Clogg, C.C., Shockey, J.W., Eliason, S.R.: A general statistical framework for adjustment of rates.
Sociol. Methods Res. 19, 156–195 (1990)
Das Gupta, P.: Decomposition of the difference between two rates and its consistency when more
than two populations are involved. Math. Popul. Stud. 3, 105–125 (1991)
Das Gupta, P.: Standardization and Decomposition of Rates: A User’s Manual. U.S. Bureau of
the Census, Current Population Reports, Series P23-186. U.S. Government Printing Office,
Washington, DC (1993)
Efron, B.: Bootstrap methods: another look at the Jackknife. Ann. Stat. 7, 1–26 (1979)
Efron, B.: Nonparametric standard errors and confidence intervals (with discussion). Can. J. Stat.
9, 139–172 (1981)
Kim, Y.J., Strobino, D.M.: Decomposition of the difference between two rates with hierarchical
factors. Demography 15, 99–112 (1984)
Kitagawa, E.M.: Components of a difference between two rates. J. Am. Stat. Assoc. 50, 1168–1194
(1955)
Kitagawa, E.M.: Standardized comparisons in population research. Demography 1, 296–315
(1964)
Liao, T.F.: A flexible approach for the decomposition of rate differences. Demography 26, 717–726
(1989)
Nathanson, C.A., Kim, Y.J.: Components of change in adolescent fertility. Demography 26, 85–98
(1989)
Needle, R., Fisher, D.G., Weatherby, N., Chitwood, D., Brown, B., Cesari, H., et al.: Reliability of
self-reported HIV risk behaviors of drug users. Psychol. Addict. Behav. 9, 242–250 (1995)
Pullum, T.W.: Standardization (World Fertility Survey Technical Bulletins, No. 597). International
Statistical Institute, Voorburg (1978)
Pullum, T.W., Tedrow, L.M., Herting, J.R.: Measuring change and continuity in parity distributions.
Demography 26, 485–498 (1989)
264 J. Wang
Ruggles, S.: Software for Multiple Standardization and Demographic Decomposition. http://www.
hist.umn.edu/~ruggles/DECOMP.html (1986–1989)
United Nations: The Methodology of Measuring the Impact of Family Planning Programs, Manual
IX, Population Studies, No. 66. U.N., New York (1979)
Wang, J.: Components of difference in HIV seropositivity rate among injection drug users between
low and high HIV prevalence regions. AIDS Behav. 7, 1–8 (2003)
Wang, J., Kelly, B.: Gauging regional differences in the hiv prevalence rate among injection drug
users in the U.S. Open Addict. J. 7, 1–7 (2014)
Wang, J., Rahman, A., Siegal, H.A., Fisher, J.H.: Standardization and decomposition of rates:
useful analytic techniques for behavior and health studies. Behav. Res. Methods Instrum.
Comput. 32, 357–366 (2000)
Wang, J., Carlson, R.G., Falck, R.S., Leukefeld, C., Booth, B.M.: Multiple sample standardization
and decomposition analysis: an application to comparisons of methamphetamine use among
rural drug users in three American states. Stat. Med. 26, 3612–3623 (2007)
Wojtkiewicz, R.A., Mclanahan, S.S., Garfinkel, I.: The growth of families headed by women:
1950–1980. Demography 27, 19–30 (1990)
Yuan, C., Wei, C., Wang, J., Qian, H., Ye, X., Liu, Y., Hinds, P.S.: Self-efficacy difference among
patients with cancer with different socioeconomic status: application of latent class analysis
and standardization and decomposition analysis. Cancer Epidemiol. 38, 298–306 (2014)
Cusp Catastrophe Modeling in Medical
and Health Research
Abstract Further advancement in medical and health research calls for analytical
paradigm shifting from linear and continuous approach to nonlinear and discrete
approach. In response to this need, we introduced the cusp catastrophe modeling
method, including the general principle and two analytical approaches to statis-
tically solving the model for actual data analysis: (1) the polynomial regression
method and (2) the likelihood estimation method, with the former for analyzing
longitudinal data and the later for cross-sectional data. The polynomial regression
method can be conducted using most software packages, including SAS, SPSS, and
R. A special R-based package “cusp” is needed to run the likelihood method for data
analysis. To assist researchers interested in using the method, two examples with
empirical data analyses are included, including R codes for the “cusp” package.
paths; the development of a drug addiction and the recovery from the addiction
treatment also follow different paths. The limitation is obvious if methods based
on the LC paradigm are used to quantify a condition with multiple progression
paths.
3. LC paradigm assumes that changes in y are continuous in response to changes
in x. However, in medical and behavioral research, non-continuous changes are
not uncommon. Typical examples in medicine include the occurrence of heart
attack, stroke, asthma, and fatigue; and in health behaviors include injury and
accidents, use of drugs and substance, sexual debut, and condom use during
sex. In these cases, analytical methods based on LC paradigm are inadequate
to quantify the process, including the overall trends and the threshold in x for the
sudden change in y.
To deal with the categorical variables that are neither continuous nor linear, the
likelihood approach is adopted capitalizing on the linear and continuous paradigm.
The most typical example is the logistic regression model established for binary
dependent variable to regress all independent variable x (can be continuous such
as age, blood pressure) in predicting the likelihood (e.g., the odds ratio or other
similar measures) of the occurrence of the dependent variable y (e.g., a disease or
death). With this approach, the predicted likelihood/probability is nonlinear to the
linear independent variables. This approach extends the LC paradigm and it can be
more generally termed as nonlinear continuous (NLC) paradigm. Mathematically,
an NLC paradigm can be expressed as:
x
t
Inaccessible
x
d u
When the independent variable x declines from left to the right however, changes
in the dependent variabley y follow a different path. It declines gradually first also
following a nonlinear path; when x passes the threshold d (d for downward), y drops
suddenly with further declines in x resulting in not much change in y. We term this
as the nonlinear discrete (NLD) paradigm. If NLD data were analyzed using linear
regression method, the result would be “good”, but the conclusion will be wrong.
Likewise, an NLD paradigm can be mathematically expressed as follows:
where x and y represent the independent and dependent variables, u and d are two
thresholds on x, c represents a set of covariates, and fnld is a nonlinear and discrete
function characterize the relationship between x and y.
Up to date, no analytical method has been established in medical and health
research to simultaneously quantify a NLD relationship. Adaptation of the bifurca-
tion analysis (Chitnis et al. 2006) to establish new statistical methods may represent
a promising approach to advance medical and health research. Bifurcation analysis
has been widely used in physics and engineering and has not been used in medical
and health research.
Similar to the dual-characteristics of wave and particle of light ray, the dynamics of
many medical and health phenomenon may also contain a continuous component
and a discontinuous or quantum component. For example, the occurrence of a
disease (e.g., heart disease) contains (1) a continuous and accumulative exposure
to the disease-causing risk factors (e.g., fat and salt intake and lack of exercise)
and (2) a sudden process (e.g., a trigger, such as stress). Another example is the
270 X. (Jim) Chen and D.-G. (Din) Chen
process by which a smoker determines to quit smoking. He or she may make the
decision based on (1) a careful assessment of the pros and cons of continuous
smoking and quit and conclude that the pros are much greater than the cons if quit,
or (2) simply quit “cold turkey” without effortful assessment, or simply decided
after watching a movie on smoking and cancer, or mimic a role model who quit
smoking. A conceptual framework that is capable of guiding new methodologies
to characterize this type of change can be termed as quantum (Q) paradigm.
Certainly a Q paradigm will be more close to the truth than any of the LC, NLC,
and NLD paradigms introduced above to reflect the reality of a medical or health
issues. More research effort is needed to develop this paradigm, including the
establishment of analytical methodologies for use in medical and health research.
Both authors of this chapter have collaborated since 2010 to work on this line
of methodological research. Recently, they received 5 years of funding through a
research grant (Awards #: R01HD075635, period: 2013–2018) from the National
Institute of Health to establish a set of Quantum paradigm-based methodologies for
health behavior research.
Along with the development of social and behavioral research, catastrophe
modeling approach emerges. Catastrophe theory was established by Thom (1973,
1975) and Gilmore (1993) to describe complex phenomenon in science. According
to Thom, many seemingly very complex systems in the universe, such as severe
weather, earthquake, and social turmoil, is, in fact determined by a small number
of factors. He termed these factors as control variables. According to the number of
control variables and the complex of the relationship, seven elementary catastrophe
models are developed, these models from simple to complex are (1) Fold (one
control variable), (2) Cusp (two control variables), (3) Swallowtail (three control
variables), (4) Butterfly (four control variables), (5) Hyperbolic Umbilic (three
control variables), (6) Elliptic Umbilic (three control variables), and (7) Parabolic
Umbilic (four control variables). The application of the catastrophe modeling
methods in the physical science is well accepted, this method has not been widely
used in medical health research. Readers who are interested in the catastrophe
models can consult the related books.
Among the seven elementary catastrophe models described above, the cusp is
more widely used in research probably due to the effort of a number of researchers,
particularly Zeeman (Zeeman 1973, 1974, 1976), Guastello (Guastello 1982, 1989;
Guastello et al. 2008, 2012), and others (Stewart and Peregoy 1983; Poston and
Stewart 1996). Inspired and encouraged by the reported researches, the two authors
of this chapter started to use the cusp modeling method in health and behavior
research since 2009. In addition to application of the methods, the two authors
also developed methods to determine sample size and statistical power for cusp
catastrophe modeling (Chen et al. 2014a).
In the cusp catastrophe model, the dynamics of a disease or a health behavior
z is presented as a function of two control factors (i.e., independent or predictor
variables in statistical terminology) x and y as follows:
1 4 1 2
V .z; x; y/ D z C z y C zx (4)
4 2
Cusp Catastrophe Modeling in Medical and Health Research 271
In the model, the variable x is termed as asymmetry and the variable y as bifurcation.
Figure 4 depicts the first derivative of the cusp model (Eq. 4). It is a curved
equilibrium plane that determines the dynamic change in z along with change in x
and y. The two control variables x and y form a control plane (see Fig. 5), governing
the dynamics of the dependent variable z.
C z
O
y
R
B Q
x
Fig. 4 An illustration of the cusp catastrophe model
Asymmetry
x
O (cusp point)
Bifurcation
Cusp region
y
Fig. 5 Illustration of the control plane of cusp catastrophe (with reference to Fig. 4)
From Figs. 4 and 5, it can be seen that the cusp model contains two threshold lines
O-Q (ascending) and O-R (descending), one continuous changing region (behind
point O), two stable regions (bottom front and upper front) and one unstable region
between O-Q and O-R. When a behavior z moves within the two stable regions,
272 X. (Jim) Chen and D.-G. (Din) Chen
changes in either x and y results in little change in z; however, when z moves into
the unstable region between O-Q and O-R, small changes in either x and/or y will
result in phase change in z.
The three paths in Fig. 4 further illustrate how the cusp catastrophe model works.
When the bifurcation variable y < O, Path A represents the linear and continuous
relationship between x and z. This relationship is familiar with most researchers and
can be adequately quantified with many conventional methods, including linear and
nonlinear regression analyses. When y > O, the relationship between x and z can take
two very different paths. Path B represents the process of sudden occurrence of a
behavior. As x increases, z will experience little change; however, just as x passes
O-Q, z will experience a sudden increase.
Likewise, Path C represents a relapse/recovery of a behavior with sudden change
occurring along with the descending threshold O-R. In addition, in the unstable
region, z can take two different values with opposite status at the same value of x
and y. This phenomenon cannot be directly captured by the traditional continuous
statistical approaches based on LC and NLC paradigms as shown in Figs. 1 and 2;
however, it does reflect the NLD paradigm as shown in Fig. 3.
Despite the widely acceptance of the LC and NLC paradigms, the large number
of analytical and statistical methods guided by these two analytical paradigms,
and great success of these methodologies in medical and health research, a further
investigation of the research findings using these methodologies reveals obvious
limitations. In most if not all reported etiological research, analytical models based
linear and continuous (i.e., linear regression models) or nonlinear and continuous
(i.e., logistic regression models) hypotheses can only explain a small proportion of
the variance. This issue is much more significant in studies with focus on health
related attitudes and behaviors. In such studies, a theory-guided linear model can
only explain 15–25 % of the variance of a variable of interest in etiological research
and a small to moderate effect in intervention trials (Godin and Kok 1996; Armitage
and Conner 2001; Wu et al. 2005). The solid theoretical basis of these research
studies and the high quality of data strongly suggest the limitations of the analytical
paradigm.
Theoretically, from Figs. 1, 2, and 3 it can be seen what happens when a linear or
a curve linear continuous method is used to analyze a phenomena that are nonlinear
and discrete. Although these models may fit the data well if judged by the criteria
for significant test established for the method, including the significance tests for
the model parameters and R2 for data-model fit. However, such models can hardly
explain a large amount of variance because these linear and nonlinear continuous
models only pick up a part of the underlying relationship between the predictor
Cusp Catastrophe Modeling in Medical and Health Research 273
The lack of tools to solve the cusp model has prevented the application of the method
in medical and health behavior research (Chen 1989). Solving the cusp model
statistically as described by Eq. (4) is a challenge (Cobb and Ragade 1978; Guastello
1982). Through the effort of a group of scientist, a polynomial regression approach
was established (Guastello 1982). This approach is to take the first derivative of
Eq. (4) with regard to z:
@V=@z D z3 C zy C x (5)
Let x1 , y1 and z1 be the variables observed at time 1, and use the difference z of the
observed values of the dependent variable z at two consecutive times as a numerical
proxy of the derivative, and insert regression ˇ coefficients to the related variables,
re-arrange the term to obtain the following:
Two methods have been established to assess if the study variable follows a cusp
catastrophe. Each method provides unique evidence.
In this method, a number of models similar to the polynomial cusp model not
containing the higher-orders of the outcome variable z are used to assess if the cusp
model is superior to these alternative models (Guastello 1982; Chen et al. 2010a).
Four alternative regression models are often used to model the same data. These
four alternative models often take the following forms:
z D ˇ0 C ˇ1 z1 C ˇ4 x1 C ˇ5 y1 (7)
z D ˇ0 C ˇ1 z1 C ˇ3 y1 z1 C ˇ4 x1 C ˇ5 y1 (8)
z2 D ˇ0 C ˇ1 z1 C ˇ4 x1 C ˇ5 y1 (9)
z2 D ˇ0 C ˇ1 z1 C ˇ3 y1 z1 C ˇ4 x1 C ˇ5 y1 (10)
Cusp Catastrophe Modeling in Medical and Health Research 275
Among these four models, the first two are deferential linear regression models and
the second two are pre-post linear regression models. The second and the fourth
models contain an interaction term between y1 and z1 , more close to the polynomial
cusp model.
In modeling analysis, the R2 of these alternative models and a cusp model
are obtained and compared. If R2 for the cusp model is greater than for the four
alternative models, it will provide data supporting the superiority of the cusp model
over the alternative models.
The following five steps are to be followed for polynomial cusp modeling analysis.
Step 1 Tabulation of the dependent variable to see if it shows a bimodal with two
peaks. If not, cusp catastrophe model may not be relevant.
Step 2 Standardize all variables, including x, y, and x to create a new dataset.
Making the standardization by subtracting the mean and then dividing by the
standard deviation to create a set of new standardized x, y and z .
Step 3 Create new variables z1 3 , z1 2 , and yz1 through simple arithmetic computing
and add them into the dataset.
Step 4 Conduct regression analysis using the standardized and the newly created
variables and the five linear equations from (6) to (10). In the modeling analysis,
ask the program to output R2 for all five models. These R2 will be used later
for comparison purposes to determine which model is superior than others in
reporting results.
Step 5 Reporting
Data used in this example were derived from a randomized controlled trial con-
ducted in the Bahamas to test a program in encourage HIV protective behaviors
(e.g., use condom) and discourage HIV risk behaviors (e.g., engage in risky sex).
A total of 1,360 middle school students from 15 government-run schools were
randomized into three groups: the first group (n D 427) with only students receiving
intervention, the second group (n D 436) with both students and their parents
receiving intervention, and the third group (n D 497) receiving environmental
conservation intervention as the intentional control. Among the total, 366 (85.7 %)
participants in group one, 389 (89.2 %) in group two, and 417 (83.9) in group three
participated in follow-up assessment 24 months post-intervention. The program
showed significant effect at multiple follow-up assessments (Chen et al. 2009,
2010b; Gong et al. 2009). To assess factors associated with sexual initiation,
276 X. (Jim) Chen and D.-G. (Din) Chen
Table 2 Comparison of the R2 of the four alternative models with the cusp model
Model name Expression R2
Polynomial cusp z D ˇ0 C ˇ1 z31 C ˇ2 z21 C ˇ3 y1 z1 C ˇ4 x1 C ˇ5 y1 0.51
Differential linear model z D ˇ0 C ˇ1 z1 C ˇ4 x1 C ˇ5 y1 0.20
Differential linear model with z D ˇ0 C ˇ1 z1 C ˇ3 y1 z1 C ˇ4 x1 C ˇ5 y1 0.21
interaction term
Pre-post model z2 D ˇ0 C ˇ1 z1 C ˇ4 x1 C ˇ5 y1 0.14
Pre-post model with interaction z2 D ˇ0 C ˇ1 z1 C ˇ3 y1 z1 C ˇ4 x1 C ˇ5 y1 0.14
Results in Table 2 indicate that with exact the same variables, the polynomial
cusp model performed significantly better than the other four alternative models
with regard to the amount of variances explained by a model.
One unanswered question in cusp modeling analysis for medical and health behavior
study among the potential control variables is: which should be tested as asymmetry
and which as bifurcation? More research is needed to develop guidelines and
standard regarding the variable selection (Guastello 1982; Stewart and Peregoy
1983; Chen et al. 2010a, 2013). The following are a couple of experience-based
and commonly accepted rules.
Variables more likely to be modeled as asymmetry are those that are relatively
stable and their development is gradual, their dynamic changes overtime are smooth,
and they reflect primarily intra-personal characteristics. Typical examples include
chronological age, knowledge, skills, hormone levels, cognitive function. Often the
relationship between the asymmetry variable and the outcome variable is stable and
robust without the impact of the bifurcation variable.
278 X. (Jim) Chen and D.-G. (Din) Chen
Bifurcation variables on the other hand, are rather volatile; they reflect either
contextual factors or a perception of situational conditions, or emotion-related fac-
tors. One characteristic common to these variables is that they change more rapidly
than the asymmetry variables. Typical examples include stress, peer pressure, self-
efficacy, beliefs, and attitudes.
It is worth noting that the whether a variable is asymmetry or bifurcation is
also relative. For example, in a study to investigate the role of inflammation and
cognitive functioning on fatigue, a model with inflammation as asymmetry and the
cognitive functioning as bifurcation fit the data well (Chen et al. 2014b). Here the
cognitive function is also an intrapersonal factor that is relatively stable. However, it
was used as a bifurcation. The reason for this selection is that relative to the process
of inflammation, cognitive function is more volatile and less stable. Executive
function is situational and can be affected also by emotions, while neuromuscular
inflammation follows specific pathological processes.
Figure 6 illustrates the density distribution of the cusp model (Eq. 12). It is worth
noting that at different regions of the x-y control plane, the density function of
Eq. (12) takes different forms. (1) When the bifurcation variables y is above zero,
the density distribution is bimodal with two peaks when x and y vary within the cusp
region (the gray area); (2) in the rest of the control plane, the density function is uni-
modal with only one peak; and (3) the density distribution tends to be symmetrical
toward the center where x D 0, skewed to the right when x < 0 and skewed to the left
when x > 0.
Fig. 6 Density distribution of a cusp catastrophe model at different regions of the x-y control plane
With density function of Eq. (12) of the stochastic cusp model, the theory of
maximum likelihood can be employed for estimating parameters and statistical
inference. For a study with n subjects, the following likelihood function can be
established:
ˇ
ˇ
Xn Xn 1 2 1 4
l z; ˛; ˇ ˇZ; X; Y D log i ˛i zi C ˇi zi zi (13)
iD1 iD1 2 4
280 X. (Jim) Chen and D.-G. (Din) Chen
With the likelihood function of Eq. (13), the model parameters can be estimated by
applying the powerful likelihood theory when data are available from a randomly
selected sample.
The likelihood function of Eq. (13) in Sect. 7.1 above indicates that the likelihood
stochastic cusp approach does not require longitudinal data. Instead of modeling
changes as in the polynomial regression approach, the likelihood function only
requires the measurement of the observed variables at a time point for all subjects
in a study. This greatly increases the opportunity to apply the cusp catastrophe
modeling method in medical and health research. Collecting cross-sectional data
is much more cost-effective. In addition, a lot of existing data that are available
for modeling analysis are cross-sectional in nature, including both medical (e.g.,
many datasets from clinical records) and health behavior research data (such as the
National Health and Nutrition Examination Survey, the National Survey on Drug
Use and Health, the National Health Interview Survey).
In addition to cross-sectional data, longitudinal models can also be tested by
selecting the control variables x and y assessed at an earlier period to predict the
outcome variable z at the subsequent times. This approach has also been used in
reported studies to assess HIV risk behaviors and tobacco use among adolescents
(Chen et al. 2010a, 2012, 2013).
One limitation for the polynomial regression it described in Sect. 6 is that it only
allows for one observed variable for each of the three model variables x, y, and z.
With the specification of the stochastic model by Eq. (12), it removes this limitation
of the polynomial regression approach. For example, in a study with n participants,
researchers often measure more than one variable for each underlying (latent)
constructs, such as cognitive functioning, self-efficacy, health literature, etc. Assume
p’s dependent variables Zp , q’s asymmetry variables Xq , and r’s bifurcation variables
Yr are observed. Assuming a linear relationship between a group of observed
variables and the corresponding latent construct, the following linear combinations
for each of the three subconstructs can be used for stochastic cusp modeling:
z D w0 C w1 Z1 C w2 Z2 C C wp Zp (14)
Cusp Catastrophe Modeling in Medical and Health Research 281
x D ˛0 C ˛1 X1 C ˛2 X2 C C ˛p Xq (15)
y D ˇ0 C ˇ1 Y1 C ˇ2 Y2 C C ˛r Yr (16)
With the multivariate specification of a cusp model, the statistical solution becomes
much more complex, as compared with the polynomial regression method as
previously described in Sect. 6. Several researchers, including Oliva (Oliva et al.
1987), Lange and Oliva (Lange et al. 2000), and Cobb (Cobb and Ragade 1978;
Cobb 1981; Cobb and Zacks 1985) have proposed and tested various statistical
methods based on the likelihood theory with cross-sectional data and multivariate
predictor variables. Assuming that only the two control variables are multivariates,
Cobb has established a set of computing methods for parameter estimation (Cobb
and Ragade 1978; Cobb 1981; Cobb and Zacks 1985). But no computing software
was developed. Following Cobb’s approach, Lange and Oliva (Lange et al. 2000)
developed the software GEMCAT II and used in research (Lange et al. 2000, 2004).
However, this method only allows for the two control variables to be multivariate.
Based on previous work (Flay 1978; Cobb and Watson 1980; Cobb 1981; Cobb
et al. 1983; Cobb and Zacks 1985), Oliva (Oliva et al. 1987), and van der Mass
et al. (2003), the R Package “Cusp” was developed and reported by Grassmen and
colleagues (2009). In developing the software, the Broyden–Fletcher–Goldfarb–
Shanno algorithm with bounds (Zhu et al. 1997) was used to minimize the likelihood
function for optimal solution of the cusp model. The package is very efficient for
modeling analysis. In addition to fitting the cusp catastrophe with data, this R-based
cusp package contains a number of functions for modeling analysis, including utility
functions to generate observations from the estimated cusp density, to evaluate
the density and cumulative distribution function, to evaluate data-model fit, and
to display the modeling results, including plots. Different from the polynomial
regression method, there is no need for researchers to convert the data, the cusp
package has the function to normalize the data using a QR decomposition approach
before modeling analysis.
282 X. (Jim) Chen and D.-G. (Din) Chen
Var.error/
R2 D 1 (18)
Var.z/
However since the relationship between the predictors and the outcome variables
is implicitly expressed in a cusp model, we cannot use the method as in linear
regression models where the relationship between the predictor and the outcome
variables is explicitly specified. Therefore the methods used to compute the variance
of error for linear regression method cannot be used to determine the variance of
error for cusp models. To estimate R2 conceptually similar, the cusp package adapted
two different methods to evaluate the variances: the delay convention (with mode of
the density closest to the cusp state plane as the expected value) and the Maxwell
convention (with the mode at which the density is the highest as expected value).
To distinguish the R2 computed in linear regression, the R2 in this cusp package is
termed as pseudo-R2. As usual, a larger R2 indicates a better data-model fit.
Cusp Catastrophe Modeling in Medical and Health Research 283
Fourth, the Akaike Information Criterion (AIC) (Akaike, 1974) and Bayesian
Information Criterion (BIC) (Gelfand and Dey, 1994) are computed for cusp models
as well as the corresponding linear and logistic regression models. Statistically, a
model with a small AIC or BIC indicates a better data-model fit.
The R-based Cusp package is relatively easy to use, following the steps we described
below.
Step 1: Prepare data and save it in .csv format for R modeling
You can prepare your data using any software, including excel sheet, SAS, SPSS,
and name the variables as usual. After quality check, save the data into “.csv”
format (also known as comma-separated values). CSV is one of the most
commonly used data formats for R to read for analysis.
Step 2: Install R and the Cusp Package
R is a free software environment for statistical computing and graphics created
through the Comprehensive R Achieve Network (CRAN) (http://www.r-project.
org/). You can install the software on your computer by:
(1) Searching the web using the key phrase “download r,” find the link, download
and install R on your computer; or
(2) Go to the official website for R is: http://cran.r-project.org/.
After the R is installed/or if your computer has R ready installed, then you can
install the “Cusp Package” to your computer.
(1) Run R on your computer
(2) Click the tap “Package” on top of the screen “R Console,” then click “install
package,” a list of countries/regions appear. Select one location near to your
physical location by clicking on it. You will then see a long list of numerous
statistical packages. Scroll down the list to locate the word “cusp,” which is the
name for the cusp analysis package.
(3) Click on “cusp.” In a little while, this package will be automatically installed
on your computer.
Step 3: Develop your R Codes for Analysis
The R codes include the following key components: (a) read in data; (b) specify the
model, (c) instruct for modeling fitting, (d) output modeling results, (e) output
for graphic results.
284 X. (Jim) Chen and D.-G. (Din) Chen
For simplicity, we assume that the obtained data are manually entered into com-
puter, and then save as: “data4cusp.csv” on drive C with the path “C:\cusp\data\”.
Three variables in the data4cusp.csv are: gripstrength (scores ranging from 1 to
10), cytokine (a proxy of inflammation), and cognition (scale scores assessing
executive functioning)
Step 2: Read data into R and obtain basic statistics of the data
## read in saved data set into R dataset,
# rename it as “datcusp”
# header D “T” indicating the first raw of the data
# contains variable names
# na.strangs D “.” indicating the blank space is
# used for missing data
datcusp D read.csv(“C:\cusp\data\data4cusp.csv”,
headerD“T”, na.strings D “.”)
## check data
# list variables in the dataset datcusp
names (datcusp)
#compute basic statistics of all variables in the
data summary (datcusp)
Step 3: Prepare a dataset “datcusp” for modeling
In actual data analysis, researchers may have a larger number of variables
included in the csv dataset. After reading in data in step 2 above, these variables
will all be available in the computer for analysis. R package has a data function
for researchers to select specific variables from the long list of variables for
modeling with the following data.frame statement:
datmodel D data.frame (z D datcusp$fatigue,
yD datcusp$cognition, xD datcusp$cytokine)
Step 4: Modeling analysis
The following R statement will conduct cusp modeling analysis using the dataset
“datmodel” created in the previous step. y z is equivalent to Eq. (14): z D w0 C
w1 gripstrength; likewise, x is equivalent to Eq. (15) and beta y is equivalent to
Eq. (16).
# fit cusp model
fit < cusp (y ~ z, alpha ~ x, beta ~y,
data D datmodel)
Step 5: Modeling results
Result from the cusp catastrophe modeling can be obtained by calling the
following summary statements from the cusp package. The first one gives the
general results, and the second one produces results of alternative models for
comparison, including results from logistic regression models.
# cusp modeling results that can be formatted
for better presentation
summary (fit)
286 X. (Jim) Chen and D.-G. (Din) Chen
Table 3 Comparison of the cusp catastrophe modeling with multiple linear regression modeling
and nonlinear logistic regression modeling
Model coefficients Data-model fit
Model Cytokine Cognition log likelihood R2 AIC BIC
Cusp 0.1349 (<0.001) 0.1599 (0.013) 1,145.84 0.7918 2,303 2,332
Linear 1.0481 (0.020) 2.3129 (<0.001) 3,408.94 0.0543 6,825 6,845
Logistic 0.0573 (<0.001) 0.1153 (0.003) 3,408.11 0.0559 6,826 2,332
Table 3 summarizes the results from the cusp modeling analysis. First of all, the
model coefficients for cusp were all statistically significant. Second, among the
three models, the –log likelihood and the AIC were the smallest and the R2 was
the largest for the cusp model. Evidence from these results suggests that the grip
strength follows a cusp process. However, since the BIC was the same for both
cusp and logistic regression model and the model coefficients are also pointed to
the same direction; this evidence suggests that if we do not consider variances
explained by a model, logistic regression may also provide an approach to assess
factors related to grip strengths.
Cusp Catastrophe Modeling in Medical and Health Research 287
Since cusp modeling analysis is still in its early stage, it is likely that you may
find strange results. For example, you may find a very good data-model fit, but the
results may not be consistent with the graphic presentation. We are investigating
these inconsistencies in the funded project.
References
Armitage, C.J., Conner, M.: Efficacy of the theory of planned behaviour: a meta-analytic review.
Br. J. Soc. Psychol. 40(Pt 4), 471–499 (2001)
Barnes, R.D., Blomquist, K.K., et al.: Exploring pretreatment weight trajectories in obese patients
with binge eating disorder. Compr. Psychiatry 52(3), 312–318 (2011)
Beyer, I., Bautmans, I., et al.: Effects on muscle performance of NSAID treatment with piroxicam
versus placebo in geriatric patients with acute infection-induced inflammation. A double blind
randomized controlled trial. BMC Musculoskelet. Disord. 12, 292 (2011)
Byrne, D.G.: A cusp catastrophe analysis of changes to adolescent smoking behaviour in response
to smoking prevention programs. Nonlinear Dynamics Psychol. Life Sci. 5, 115–137 (2001)
Chen, X., Brogan, K.: Developmental trajectories of overweight and obesity of US youth through
the life course of adolescence to young adulthood. Adolesc. Health Med. Ther. 3, 33–42 (2012)
Chen, X., Jacques-Tiura, A.J.: Smoking initiation associated with specific periods in the life course
from birth to young adulthood: data from the National Longitudinal Survey of Youth 1997. Am.
J. Public Health 104(2), e119–e126 (2014)
Chen, X.: [Xin Wei Yi Xue] Behavioral Medicine. Shanghai Scientific Publication House,
Shanghai (1989)
Chen, X., Lunn, S., et al.: A cluster randomized controlled trial of an adolescent HIV prevention
program among Bahamian youth: effect at 12 months post-intervention. AIDS Behav. 13(3),
499–508 (2009)
Chen, X., Lunn, S., et al.: Modeling early sexual initiation among young adolescents using
quantum and continuous behavior change methods: implications for HIV prevention. Nonlinear
Dynamics Psychol. Life Sci. 14(4), 491–509 (2010a)
Chen, X., Stanton, B., et al.: Effects on condom use of an HIV prevention programme 36 months
postintervention: a cluster randomized controlled trial among Bahamian youth. Int. J. STD
AIDS 21(9), 622–630 (2010b)
Cusp Catastrophe Modeling in Medical and Health Research 289
Chen, X., Gong, J., et al.: Cusp catastrophe modeling of cigarette smoking among vocational high
school students. Conference Paper on the 140th American Public Health Association Meeting
(2012)
Chen, X., Stanton, B., et al.: Intention to use condom, cusp modeling, and evaluation of an HIV
prevention intervention trial. Nonlinear Dynamics Psychol. Life Sci. 17(3), 385–403 (2013)
Chen, D., Chen, X., et al.: Cusp catastrophe polynomial model: power and sample size. Open J.
Stat. 4(4), 803–813 (2014a)
Chen, D.G., Lin, F., et al.: Cusp catastrophe model: a nonlinear model for health outcomes in
nursing research. Nurs. Res. 63(3), 211–220 (2014b)
Chitnis, M., Cushing, J.M., et al.: Bifurcation analysis of a mathematical model for malaria
transmission. J. SIAM Appl. Math. 67(1), 24–45 (2006)
Clair, S.: A cusp catastrophe model for adolescent alcohol use: an empirical test. Nonlinear
Dynamics Psychol. Life Sci. 2(3), 217–241 (2004)
Cobb, L.: Stochastic differential equations for social sciences. In: Cobb, L., Thrall, R.M. (eds.)
Mathematical Frontiers of the Social and Policy Sciences. Westview Press, Boulder (1981)
Cobb, L., Ragade, R.K.: Applications of catastrophe theory in the behavioral and life sciences.
Behav. Sci. 23, 291–419 (1978)
Cobb, L., Watson, B.: Statistical catastrophe theory: an overview. Math. Model. 1(4), 311–317
(1980)
Cobb, L., Zacks, S.: Applications of catastrophe theory for statistical modeling in the biosciences.
J. Am. Stat. Assoc. 80(392), 793–802 (1985)
Cobb, L., Koppstein, P., et al.: Estimation and moment recursion relations for multimodal
distributions of the exponential family. J. Am. Stat. Assoc. 78(381), 124–130 (1983)
Flay, B.R.: Catastrophe theory in social psychology: some applications to attitudes and social
behavior. Behav. Sci. 23(5), 335–350 (1978)
Gilmore, R.: Catastrophe Theory for Scientists and Engineers. Dover Publications, New York
(1993)
Godin, G., Kok, G.: The theory of planned behavior: a review of its applications to health-related
behaviors. Am. J. Health Promot. 11(2), 87–98 (1996)
Gong, J., Stanton, B., et al.: Effects through 24 months of an HIV/AIDS prevention intervention
program based on protection motivation theory among preadolescents in the Bahamas.
Pediatrics 123(5), e917–e928 (2009)
Grasman, R.P., Mass, H.J., et al.: Fitting the cusp catastrophe in R: a cusp package primer. J. Stat.
Softw. 32(8), 1–27 (2009)
Guastello, S.J.: Moderator regression and the cusp catastrophe: an application of two-stage
personal selection, training, therapy and policy evaluation. Behav. Sci. 27, 259–272 (1982)
Guastello, S.J.: Catastrophe modeling of the accident processes: evaluation of an accident reduction
program using the occupational hazards survey. Accid. Anal. 21, 17–28 (1989)
Guastello, S.J., Aruka, Y., et al.: Cross-cultural generalizability of a cusp catastrophe model for
binge drinking among college students. Nonlinear Dynamics Psychol. Life Sci. 12(4), 397–407
(2008)
Guastello, S.J., Boeh, H., et al.: Cusp catastrophe models for cognitive workload and fatigue in a
verbally cued pictorial memory task. Hum. Factors 54(5), 811–825 (2012)
Hartelman, P.A., et al.: Detecting and modelling developmental transitions. Br. J. Dev. Psychol. 16,
97–122 (1998)
Jones, B.L., Nagin, D.S.: Advances in group-based trajectory modeling and an SAS procedure for
estimating them. Sociol. Methods Res. 35(4), 542–571 (2007)
Lachman, M.E., Agrigoroaei, S., et al.: Frequent cognitive activity compensates for education
differences in episodic memory. Am. J. Geriatr. Psychiatry 18(1), 4–10 (2011)
Lange, R., Oliva, T.A., et al.: An algorithm for estimating multivariate catastrophe models:
GEMCAT II. Stud. Nonlinear Dyn. Econ. 4(3), 137–168 (2000)
Lange, R., McDade, S.R., et al.: The estimation of a cusp model to describe the adoption of word
for windows. J. Prod. Innov. Manag. 21(1), 15–32 (2004)
290 X. (Jim) Chen and D.-G. (Din) Chen
MacDonald, S.W., DeCarlo, C.A., et al.: Linking biological and cognitive aging: toward improving
characterizations of developmental time. J. Gerontol. B Psychol. Sci. Soc. Sci. 66(Suppl. 1),
i59–i70 (2011)
Mazanov, J., Byrne, D.G.: A cusp catastrophe model analysis of changes in adolescent substance
use: assessment of behavioural intention as a bifurcation variable. Nonlinear Dynamics
Psychol. Life Sci. 10(4), 445–470 (2006)
Muthen, B., Muthen, L.K.: Integrating person-centered and variable-centered analyses: growth
mixture modeling with latent trajectory classes. Alcohol. Clin. Exp. Res. 24(6), 882–891 (2000)
Nagin, D.S.: Group-Based Modeling of Development. Harvard University Press, Cambridge
(2005)
Oliva, T.A., Desarbo, W.S., et al.: GEMCAT – a general multivariate methodology for estimating
catastrophe models. Behav. Sci. 32(2), 121–137 (1987)
Poston, T., Stewart, I.N.: Catastrophe Theory and Its Applications. Dover, New York (1996)
Smerz, K.E., Guastello, S.J.: Cusp catastrophe model for binge drinking in a college population.
Nonlinear Dynamics Psychol. Life Sci. 12(2), 205–224 (2008)
Stewart, I.N., Peregoy, P.L.: Catastrophe theory modeling in psychology. Psychol. Bull. 94(2),
336–362 (1983)
Thom, R.: Structural Stability and Morphogenesis: Essai D’une Theorie Generale Des Modeles.
W. A. Benjamin, California (1973)
Thom, R.: Structural Stability and Morphogenesis. Benjamin-Addison-Wesley, New York (1975)
van der Maas, H.L., Kolstein, R., et al.: Sudden transitions in attitudes. Sociol. Methods Res. 23(2),
125–152 (2003)
West, R., Sohal, T.: “Catastrophic” pathways to smoking cessation: findings from national survey.
BMJ 332(7539), 458–460 (2006)
Wong, E.S., Wang, B.C., et al.: BMI trajectories among the severely obese: results from an
electronic medical record population. Obesity (Silver Spring) 20(10), 2107–2112 (2012)
Wu, Y., Stanton, B.F., et al.: Protection motivation theory and adolescent drug trafficking:
relationship between health motivation and longitudinal risk involvement. J. Pediatr. Psychol.
30(2), 127–137 (2005)
Zeeman, E.: Catastrophe theory in brain modelling. Int. J. Neurosci. 6, 39–41 (1973)
Zeeman, E.: On the unstable behavior of the stock exchanges. J. Math. Econ. 1, 39–44 (1974)
Zeeman, E.C.: Catastrophe theory. Scientific American. 234(4), 65–83 (1976).
Zhu, C., Byrd, R., et al.: L-BFGS-B: Fortran subroutines for large-scale bound constrained
optimization. ACM Trans. Math. Softw. 23(4), 550–560 (1997)
On Ranked Set Sampling Variation
and Its Applications to Public Health Research
Keywords Ranked set sample (RSS) • Extreme ranked set sample (ERSS)
• Median ranked set sample (MRSS) • Simple random sample (SRS)
• Simulation • Naive estimator • Regression estimator • Ratio estimator
• Normal data • Concomitant variable • Varied set size ranked set sampling
(VSRSS) • Bilirubin • Quantiles • Bivariate ranked set sampling (BVRSS)
• Clinical trials
1 Introduction
H. Samawi ()
Department of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern
University, Hendricks Hall, Room 1006, Statesboro, GA 30460, USA
e-mail: [email protected]
R. Vogel ()
Department of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern
University, Hendricks Hall, Room 1013, Statesboro, GA 30460, USA
e-mail: [email protected]
visually by observing Color of the face, Color of the chest, Color of lower part of
the body, and Color of terminal parts of the whole body. Then, as the yellowing goes
from face to terminal parts, the level of bilirubin in the blood increases (Nelson et al.
1992; Samawi and Al-Sageer 2001).
In some circumstances, considerable cost savings can be achieved if the number
of measured sampling units is only small fraction of the number of available units
but all units contribute to the information content of the measured units. Ranked set
sampling (RSS) is a method of sampling that can achieve this goal. RSS was first
introduced by McIntyre (1952). The use of RSS is highly powerful and superior to
the standard simple random sampling (SRS) for estimating some of the population
parameters. The RSS procedure can be obtained by selecting r random sets each of
size r from the target population. In most practical situations, the size r will be 2,
3, or 4. Rank each set by a suitable method of ranking, such as prior information
or visual inspection. In sampling notation, let Xij denote the jth observation in the
ith set and Xi(j) is the jth ordered statistic in the ith set. X1(1) , X2(2) , : : : , Xr(r) are
quantified by obtaining the element with the smallest rank from the first set, the
second smallest from the second set, and so on until the largest unit from the rth
set is measured, then this represents one cycle of RSS. We can repeat the whole
procedure m times to get an RSS of size n D mr.
A variety of extreme ranked set sample (ERSS) procedures have been introduced
and investigated by Samawi et al. (1996) to estimate the population mean. Similar
to RSS, in ERSS, we only quantify the minimum and the maximum ranked
observation. In the case of symmetric populations, Samawi et al. (1996) showed
that the ERSS procedure gives an unbiased estimate of the population mean and
it is more efficient than the SRS mean, using the same number of quantified
units. Recently, ERSS applied to genetics for quantitative trait loci (QTL) mapping
(see Chen 2007). He indicated that since the frequency of the Q allele, in the
general population, is small therefore, instead of drawing a simple random sample
(SRS) from the population, one of the approaches adopted for detecting QTL using
population data is to truncate the population at a certain quantile of the distribution
of Y and take a random sample from the truncated portion and a random sample from
the whole population. The two samples drawn are genotyped and compared on the
number of Q-alleles. Then if a significant difference exists, the candidate QTL is
claimed as a true QTL (see Slatkin 1999; Xu et al. 1999; Chen 2007). However, this
approach needs a large number of individuals have to be screened before a sample
can be taken from the truncated portion and hence it is not practical. Alternatively,
the ERSS can be used as follows: Individuals are taken in sets and the individuals
within each set are ranked according to their trait values. The one with the largest
trait value is put into an upper sample and the one with the smallest trait value is put
into a lower sample. Then the two samples obtained this way are then genotyped
and compared. Also, ERSS approach has been applied for linkage disequilibrium
mapping of QTL recently by Chen et al. (2005).
The ERSS has been applied to a sib-pair regression model where extremely
concordant and/or discordant sib-pairs are selected by the ERSS (see Zheng et al.
2006). As indicated by Chen (2007), the ERSS approach can be applied also to
On Ranked Set Sampling Variation and Its Applications to Public Health Research 293
many other genetic problems such as the transmission disequilibrium test (TDT)
(Spielman et al. 1993) and the gamete competition model (Sinsheimer et al. 2000).
Another variation of RSS is the median ranked set sampling (MRSS) investigated
by Muttlak (1997). The ratio estimator using RSS is investigated by Samawi
and Muttlak (1996). The ratio estimator is used to obtain increased precision for
estimating the population mean or total by taking the advantage of the correlation
between an auxiliary variable X and the variable of interest Y. Samawi and Muttlak
(2001) used MRSS in ratio estimation. They showed that MRSS gives approxi-
mately an unbiased estimate of a population ratio in case of symmetric populations
and it is more efficient than SRS, using the same number of quantified units.
Moreover, Al-Saleh and AL-Kadiri (2000) showed that the efficiency in estimating
the populations mean can be improved even more by using a double ranked set
sampling technique (DRSS). Samawi (2002) suggested a double extreme ranked set
sampling (DERSS) for the mean and ratio estimators. Additional information about
RSS and its application can be found in Kaur et al. (1995) and Patil et al. (1999).
Stratified RSS was introduced by Samawi (1996) and used to improve ratio
estimation by Samawi and Siam (2003). A varied set size RSS (VSRSS) has been
introduced and investigated by Samawi (2011) for estimating a population means
and ratios. This approach can be useful in queuing and epidemiology studies where
cases come in different size batches.
Research in multiple characteristics estimation has been performed by Patil et al.
(1993, 1994) and Norris et al. (1995). They used a bivariate ranked set sampling
(BVRSS) procedure ranking on only one of the characteristics (X or Y). However,
BVRSS, ranking on both characteristics (X or Y), was introduced by Al-Saleh and
Zheng (2002). They indicated that BVRSS procedure could easily be extended to a
multivariate one. The performance of BVRSS in comparison with RSS and SRS for
estimating the population means, using ratio and regression estimators, is considered
by Samawi and Al-Saleh (2007).
Another attempted application of RSS is in treatment comparison experiments
including some clinical trials. In RSS, many more sampling units are sampled and
discarded than those eventually fully measured. This might not be desirable in the
situation where sampling units are not easy to obtain, which is especially the case
in clinical trials. Ozturk and MacEachern (2004) and Zheng et al. (2006) separately
considered an RSS approach which generates RSSs for each treatment but without
discarding any sampling units (see Chen 2007). The approach as described by Zheng
et al. (2006) is as follows: Assume that the response variable (Y) is correlated with
a common concomitant variable (X). Let the set size k in RSS be even. The RSS is
carried out two sets at a time. That is, each time two random sets of experimental
units are taken and ranked separately according to the values of X. For the first
ranked set, units with odd ranks are assigned to treatment 1 and units with even
ranks are assigned to treatment 2. For the second ranked set, units with odd ranks
are assigned to treatment 2 and units with even ranks are assigned to treatment 1.
This process produces two correlated general RSS samples, each for each treatment.
It does not discard any experimental units. It is shown in Zheng et al. (2006) that
this method of treatment assignment is much more efficient than a simple random
294 H. Samawi and R. Vogel
The RSS procedure can be obtained by selecting r random sets each of size r from
the target population. In most practical situations, the size r will be 2, 3, or 4.
Rank each set by a suitable method of ranking, such as prior information, visual
inspection or by an experimenter himself. In sampling notation, let Xij denote the
jth observation in the ith set and Xi(j) is the jth ordered statistic in the ith set. If only
X1(1) , X2(2) , : : : , Xr(r) quantified by obtaining the element with the smallest rank
from the first set, the second smallest from the second set, and so on until the largest
unit from the rth set is measured, then this represents one cycle of RSS. We can
repeat the whole procedure m times to get an RSS of size n D mr.
1 X 2
r
2
Var X RSS D 2 .i/ ; (1)
n mr iD1
where .i/ D E X.i/ ; and 2 is the population variance of X. See Takahasi and
Wakimoto (1968).
For 0 < p < 1, the population pth quantile is defined as p D inf fx W F.x/ pg and is
denoted by F 1 .p/. Suppose X1 , X2 , : : : , Xn is an SRS of size n from a population.
Then for a given t, F(t) can be estimated by
On Ranked Set Sampling Variation and Its Applications to Public Health Research 295
Xn
O D 1
F.t/ I .Xi t/ ; (2)
n iD1
O
where I(.) is an indicator function. Clearly, E F.t/ O
D F.x/ and var F.t/ D
1
n
F.t/ .1 F.t//.
Let X(1) , X(2) , : : : , X(n) be the order statistics of an SRS of size n. Then p can be
estimated by the sample pth quantile which is defined as follows:
X.np/ ; if np is an integer
Op D (3)
X.ŒnpC1/ ; if np is not an integer
see Serfling (1980). Under some mild conditions about F(t),Bahadur (1966) showed
p O
that n p p converges in distribution to N 0; 2 p.1p/
.
f .p /
1 XX
r m
F .t/ D I X.i/k t (4)
n iD1 kD1
with variance
( )
1 Xr
2
var .F .t// D F.t/ IF.t/ .i; r i C 1/ =r ; (5)
n iD1
where IF.t/ .i; r i C 1/ is the incomplete Beta function. Also, if f(i) (x), i D 1,
2, : : : , r, is positive in a neighborhood of p and is continuous at p , Samawi
p
(2001) and Chen (2000) showed that n Qp p converge in distribution to
0 1
Xr
B p ŒIp .i;riC1/2 =r C
B C
NBB0;
iD1 C, where Qp is the sample quantile based on the RSS.
C
f .p /
2
@ A
Samawi (2001) showed that the relative efficiency of using RSS relative to SRS,
by estimation of the quantiles for different values of p, ranging from 1.05 to 1.77
for r D 3 and the efficiency increases asset size r increases. As an application to
quantiles estimation using RSS Samawi (2001) illustrated the method by using the
data from Iowa 65C Rural Health Study. He found the normal ranges (p D 0.05 and
p D 0.95) of hemoglobin level in the blood of the women aged 70C were disease
free.
296 H. Samawi and R. Vogel
The ratio estimator is used to improve precision of estimating the population mean
of a variable of interest (Y) using some concomitant variable (X). Let (X,Y) have
the c.d.f. F(x, y) with mean and y , variances 2x and 2y correlation coefficient
x
, then R D yx denotes the ratio of means. Using a simple bivariate random sample
from F(x, y), the estimator of R is given by:
RO SRS D XY , where X and Y are the sample means of X and Y, respectively.
Hansen et al. (1953) showed that the variance of RO SRS can be approximated by:
R2 2
Var RO SRS Š Vx C Vy2 2Vx Vy ; (6)
n
EŒ.Xx /.Yy /
where Vx D xx ; Vy D yy ; and D x y .
The ratio estimator using RSS data, if ranking is on the variable X, with errors in
ranking for the variable Y is given by Samawi and Muttlak (1996) as RO RSS D Œr ,
Y
X .r/
X
m X
r X
m X
r
1 1
where Y Œr D n
YiŒik and X .r/ D n
Xi.i/k are the sample means using
kD1 iD1 kD1 iD1
RSS and n D mr. Note that, () denotes perfect ranking while [] denotes ranking with
error. Also, they showed that the approximate variance of RO RSS is given by:
( " r #)
R2 1 X 2
Var RO RSS1 Š Vx2 C Vy2 2Vx Vy Mx.i/ MyŒi ; (7)
n r iD1
where Mx.i/ D x.i/x x and MyŒi D yŒiy y .
Moreover, Samawi and Muttlak (1996) show that if ranking on X is perfect and
ranking on Y is with error in the ranking, then this ratio estimator is more efficient
than when ranking on Y is perfect and ranking on X is with error in the ranking.
They showed that the relative efficiency of using RSS relative to SRS for estimating
the ratio (population mean using the ratio estimate), when ranking on X, ranges
from 1.62 to 1.88 for r D 3 and the efficiency increases as r increases.
ratio of the two variables. Regression estimators using SRS and RSS are investigated
by Sukhatme and Sukhatme (1970) and Yu and Lam (1997), respectively.
Let (Xi ,Yi ), i D 1,2, : : : ,n, be a bivariate sample from F(x, y), and
where x and y are the means of X and Y, respectively, and for fixed Xi , "i ,
i D 1,2, : : : ,n are i.i.d. with mean zero and variance 2" .
When x is unknown, the method of double sampling can be used to obtain an
estimate of x . This involves drawing of a large random sample of size n0 , which
is used to estimate x . Then a sub-sample of size n is selected from the original
selected units to study the primary characteristic of Y. Set n0 D r3 m and n D
r .rm/ D r2 m, when the first and the second-phase samples are both conducted by
SRS scheme. Then the double-sampling regression estimator Y ds is given by
_ 0
Y ds D Y SRS C ˇ X X SRS ; (9)
X X 0
where X SRS D 1n Xi , Y SRS D 1n Yi , X is the sample mean of X based on r3 m
observations from the first phase and
X
_ Xi X SRS Yi Y SRS
ˇD X 2 :
Xi X SRS
If the assumption of the linear relationship in (8) is invalid, then the SRS
regression estimator in (9) is in general a biased estimator for y .
Using the bivariate RSS, ranking only on X, assume that the relationship between
Y[i]k and X(i)k is
YŒik D y C ˇ X.i/k x C "Œik ; (11)
i D 1,2, : : : ,r and k D 1,2, : : : ,rm. Again, when x is unknown the method of double
sampling (two-phase sampling) can be used to obtain an estimate of x . Set n0 D
r3 m and n D r .rm/ D r2 m, and when the first-phase sampling is an SRS and the
second-phase sampling is an RSS. Then the double-sampling regression estimator
Y Rds based on RSS is given by Yu and Lam (1997) as:
298 H. Samawi and R. Vogel
_ 0
Y Rds D Y RSS C ˇ RSS X X RSS ; (12)
0
0 X
n
1
where X D n0 Xi is the sample mean of X based on the r3 m observations
iD1
X
rm X
r X
rm X
r _
1 1
of the first phase, X RSS D n
X.i/k ; Y RSS D n
YŒik and ˇ RSS D
kD1 iD1 kD1 iD1
X
rm X
r
X.i/k X RSS YŒik Y RSS
kD1 iD1
.
Xrm X
r
2
X.i/k X RSS
kD1 iD1
By using the basic properties of conditional moments, Yu and Lam (1997)
showed that, under (12), Y Rds is an unbiased estimator of y and the variance is
given by:
" 2 #!
2 Z RSS Z 1 2 2
Var Y Rds D " 1CE C y ; (13)
n Sz2 R rn
0
X x X.i/k x X
rm X
r
1 2
where Z D x
, Z.i/k D x , Z RSS D n
Z.i/k , SzR D
XX 2
kD1 iD1
1
n
Z.i/k Z RSS and n D r2 m.
k i
Moreover, if the assumption of linear relationship is invalid, the RSS regression
estimator in (12) is in general a biased estimator for y .
The varied set size ranked set sample (VSRSS) is investigated by Samawi (2011).
The VSRSS is obtained by randomly selecting c sets of different sizes, respectively,
say, fk21 , k22 , : : : , k2c g. Apply the scheme of RSS on each set separately to obtain c
RSSs of sizes fk1 , k2 , : : : , kc g respectively. This will produce a VSRSS of size n D
Xc
˚
kl . Then X1.1/Wk1 ; : : : ; Xk1 .k1 /Wk1 I X1.1/Wk2 ; : : : ; Xk2 .k2 /Wk2 I : : : I X1.1/Wkc ; : : : ; Xkc .kc /Wkc
lD1
denotes by VSRSS. Note that Xj.j/kl is the jth order statistics of the jth sample of
the lth set, l D 1, 2, : : : , c. Let X have a probability density function (p.d.f.), f (x),
and a cumulative distribution function (c.d.f.), F(x), with mean and variance 2 .
Let X(j) : s denotes
the2 jth order statistic
from a sample of size s. Furthermore, let
.j/Ws D E X.j/Ws , .j/Ws D Var X.j/Ws and f(j) : s (x) and F(j) : s (x) are the p.d.f. and
c.d.f. of X(j) : s , respectively.
On Ranked Set Sampling Variation and Its Applications to Public Health Research 299
Theorem 1 (Samawi 2011) Let X have a probability density function (p.d.f.), f (x),
and a cumulative distribution function (c.d.f.), F(x), with mean and variance 2 ,
then:
(1)
X
c X
ki X
c
f.j/Wki .x/ D f .x/ ki ; (14)
iD1 jD1 iD1
X
c X
ki X
c
F.j/Wki .x/ D F.x/ ki : (15)
iD1 jD1 iD1
(3)
X
c X
ki Xc
.j/Wki D ki : (16)
iD1 jD1 iD1
(4)
X
c X
ki X
c X
c X
ki
2
2
.j/Wk i
D 2 ki .j/Wki (17)
iD1 jD1 iD1 iD1 jD1
1 XX X
ki
2
c
2 c
Var X VSRSS D 2 .j/Wki I n D ki : (18)
n n iD1 jD1 iD1
300 H. Samawi and R. Vogel
Let the ranking be performed on the variable X. Let () denote perfect ranking while
[] denote ranking with error. We assume ranking on X is perfect while ranking on Y
is with error. Let
n
X1.1/Wk1 ; Y1Œ1k1 ; : : : ; Xk1 .k1 /Wk1 ; Yk1 Œk1 Wk1 I X1.1/Wk2 ; Y1Œ1k2 ; : : : ; Xk2 .k2 /Wk2 ;
o
Yk2 Œk2 Wk2 I : : : I X1.1/Wkc ; Y1Œ1kc ; : : : ; Xkc .kc /Wkc ; Ykc Œkc Wkc
be the VSRSS, where Xj.j/ki is the jth smallest X unit in the jth bivariate RSS of set
size ki and YjŒjki is the jth corresponding Y observation, i D 1, 2, : : : , c.
X c X
ki
Let O x D X .VSRSS/ and O y D Y ŒVSRSS , where X .VSRSS/ D 1n Xj.j/ki ,
iD1 jD1
X
c X
ki X
c
1
Y ŒVSRSS D n YjŒjki and n D ki . Also, let x2 D Var.X/, y2 D Var.Y/,
2
iD1 jD1 2 iD1
x.j/Wk i
D Var X j.j/k i , yŒjWk i
D Var YjŒjki and x.j/yŒjWki D Cov Xj.j/Wki ; YjŒjWki .
The estimator of the population ratio R using VSRSS is given by
Y ŒVSRSS
RO VSRSS D : (19)
X .VSRSS/
Assume that the population is large enough so that the sample fraction Nn is
negligible. Then by using a Taylor series expansion, RO VSRSS the variance of RO VSRSS
can be approximated by:For large population size
On Ranked Set Sampling Variation and Its Applications to Public Health Research 301
2 2 33
R 2
1 X
c Xki
2
Var RO VSRSS Š 4Vx2 C Vy2 2 Vx Vy 4 Mx.j/Wki MyŒjWki 55 ;
n n iD1 jD1
(20)
by using SRS as 214631.4 tons. However, as in William et al. (2004) was 238,600.
This indicates that using VSRSS is more accurate than using SRS in this case.
An SSRS (for example, see Hansen et al. 1953) is a sampling plan in which
a population is divided into L mutually exclusive and exhaustive strata, and an
SRS of nh elements is taken and quantified within each stratum h. The sampling
is performed independently across the strata. In essence, we can think of an SSRS
scheme as one consisting of L separate simple random samples.
An SRSS is a sampling plan in which a population is divided into L mutually
exclusive and exhaustive strata, and an RSS of nh elements is quantified within each
stratum, h D 1, 2, : : : , L. The sampling is performed independently across the strata.
Therefore, we can think of an SRSS scheme as a collection of L separate ranked set
samples.
Suppose that the population is divided into L mutually exclusive and exhaustive
strata. Let Xh11 ; Xh12 ; : : : ; Xh1nh
I Xh21 ; Xh22 ; : : : ; Xh2nh
I : : : I Xhn h1
; Xhn h2
; : : : ; Xhn h nh
be
nh independent random samples of size nh each one is taken from each stratum
.h D 1; 2; : : : ; L/. Assume that each element X *hij in the sample has the same
distribution function Fh (x) and density function fh (x) with mean h and variance 2h .
For simplicity of notation, we will assume that Xhij denotes the quantitative measure
of the unit X *hij . Then, according to our description Xh11 ; Xh21 ; : : : ; Xhnh 1 could be
considered as the SRS from the h-th stratum. Let Xhi.1/ ; Xhi.2/ ; : : : ; Xhi.n h/
be the
ordered statistics of the i-th sample Xhi1 ; Xhi2 ; : : : ; Xhink .i D 1; 2; ::::nk / taken from
the h-th stratum. Then, Xh1.1/ ; Xh1.2/ ; : : : ; Xhnh .nh / denotes the RSS for the h-th
stratum. If N1 , N2 , : : : , NL represent the number of sampling units within respective
strata, and n1 , n2 , : : : , nL represent the number of sampling units measured within
X L X L
each stratum, then N D Nh will be the total population size, and n D nh will
hD1 hD1
be the total sample size.
The following notations and results will be usedthroughout 2 this section.
For all
i; i D 1;2; : : : ; nh andh D 1;
2; : : : ; L, let h D E X hij ; h D Var Xhij , h.i/ D
2
E Xhi.i/ ; h.i/ D Var Xhi.i/ , for all j D 1; 2; : : : ; nh and let Th.i/ D h.i/ h .
As in Dell and Clutter (1972), one can show easily that for a particular stratum
Xnh X nh X
nh
h; .1 D 1; 2; : : : ; L/, fh .x/ D n1h fh.i/ .x/, and hence h.i/ D nh h ; Th.i/ D
iD1 iD1 iD1
X
nh X
nh
2 2
0 and h.i/ D nh h2 Th.i/ .
iD1 iD1
The mean of the variable X for the entire population is given by
1X X
L L
D Nh h D Wh h (21)
N hD1 hD1
where Wh D NNh .
If within a particular stratum, h, we supposed to have selected SRS of nh elements
from Nh elements in the stratum and each sample element is measured with respect
to some variable X, then the estimate of the mean h using SRS of size nh is given by
304 H. Samawi and R. Vogel
1X
nh
Xh D Xhi1 : (22)
nh iD1
2
The mean and variance of X h are known to be E X h D h and Var X h D nhh ;
respectively, assuming Nh ’s are large enough. The estimate of the population mean
using SSRS of size n is defined by
1X X
L L
X SSRS D Nh X h D Wh X h (23)
N hD1 hD1
The mean and the variance of X SSRS are known to be E X SSRS D and
X
L
h2
Var X SSRS D Wh2 (24)
hD1
nh
1X
nh
Xh D Xhi.i/ (25)
nh iD1
It can be shown that the mean and variance of X h.nh / are E X h.nh / D h and
1X
nh
2
Var X h.nh / D h 2 T2 ; (26)
nh nh iD1 h.i/
1X X
L L
X SRSS D Nh X h.nh / D Wh X h.nh / : (27)
N hD1 hD1
Algebraically,
it can be shown that the mean and the variance of X SRSS are
E X SRSS D (i.e., and unbiased estimator) and
!
X
L
h2 1X
nh
Var X SRSS D Wh2 2 T 2
; (28)
hD1
nh nh iD1 h.i/
On Ranked Set Sampling Variation and Its Applications to Public Health Research 305
respectively, assuming Nh ’s are large enough. Samawi (1996) showed that using
SRSS for estimating the population mean is more efficient than using SSRS. As an
illustration to this method Samawi (1996) used Iowa 65C Rural Health Study. In
Table 3 he presented three samples of size 7 each, from baseline interview data for
the (RHS), which is a longitudinal cohort study of 3,673 individuals (1,420 men arid
2,253 women) ages 65 or older living in Washington and Iowa countries of the State
of Iowa in 1982. In the Iowa 65C RHS there were 33 diabetic women aged 80–85,
of whom 14 reported urinary incontinence. The question of interest is to estimate the
mean body mass index (BMI) of diabetic women. BMI may be different for diabetic
women with or without urinary incontinence. Thus, here is a situation where stratifi-
cation might work well. The 33 women were divided into two strata, the first consists
of those women without urinary incontinence and the second consists of those 14
women with urinary incontinence. Four samples of size .n D 7/ each were drawn
from those women using SSRS, SRSS, RSS, and SRS. Note that in case of SRSS
and RSS the selecting samples are drawn with replacement. The calculated values
of BMI are given in Table 2. These calculations indicate the same pattern of conclu-
sions that were obtained earlier, and illustrate the method described in Sect. 2.
The BMI data are a good example where we need stratification to find an
unbiased estimator for the population mean of those diabetic women aged 80–85
years. Since the 33 women were divided into two strata, the first consists of those
women without urinary incontinence and the second consists of those women with
urinary incontinence. It is clear that the mean of the BMI in each stratum will be
different. Also, there is potential that women can be ranked visually according to
their BMI. In this situation using SRSS to estimate the mean BMI is recommended
of those women. SRSS will give an unbiased and more efficient estimate of the BMI
mean. Moreover, SRSS can provide an unbiased and more efficient estimate for the
mean of each stratum.
306 H. Samawi and R. Vogel
5 Bivariate Population
Samawi and Muttlak (1996) used a modification of the RSS procedure in case of
bivariate distributions to estimate a population ratio. The procedure is described as
follows:
First choose r2 independent bivariate elements from a population that has a
bivariate distribution function F(x, y) with parameters x , y , 2x , 2y and correlation
coefficient . Rank each set with respect to one of the variables Y or X. Suppose
ranking is on variable X. Then divide the elements into r sets. From the first set
obtain the element with the smallest rank of X, together with the associated value
of the variable Y. From the second set obtain the second smallest element of X,
together with the associated value of the variable Y. The procedure is continued
until the element with the largest ranking observation of X is measured from the rth
set. The whole procedure can be repeated m times to get a bivariate RSS sample of
size n D mr ranking only on one variable. In Sampling notation f(Xi(i)k , Yi[i]k ), i D 1,
2, : : : , r; k D 1,2, : : : ,mg will denote the bivariate RSS. However, ranking on both
variables X and Y is introduced by Al-Saleh and Zheng (2002). Based on Al-Saleh
and Zheng (2002) description, a BVRSS can be obtained as follows: Suppose (X, Y)
is a bivariate random vector with the joint probability density function (p.d.f.) f(x, y).
Step 1. A random sample of size r4 is identified from the population and randomly
allocated into r2 pools of size r2 each so that each pool is a square matrix with r
rows and r columns.
Step 2. In the first pool, rank each set (row) by a suitable method of ranking with
respect to (w.r.t.) the first characteristics (X). Then from each row identify the
unit with the smallest rank w.r.t. X.
Step 3. Rank the r minima obtained in Step 2, in a similar manner but w.r.t. the
second characteristic (Y). Then identify and measure the unit with the smallest
rank w.r.t. Y. This pair of measurements (x, y), which is resembled by the label
(1, 1), is the first element of the BVRSS sample.
Step 4. Repeat Steps 2 and 3 for the second pool, but in Step 3, the pair that
corresponds to the second smallest rank w.r.t. the second characteristic (Y) is
chosen for actual measurement (quantification). This pair resembled by the label
(1, 2).
Step 5. The process continues until the label (r, r) is resembled from the r2 th (last)
pool.
The above procedure produces a BVRSS of size r2 . The procedure can be
repeated m times to obtain a sample of size n D mr2 . In sampling notation,
assume that a random sample of size mr4 is identified (no measurements
were taken) from a bivariate probability density function, say f(x, y); (x, y)
2 R2 , with means x and y , variances 2x and 2y and correlation coefficient
.
h Following the Al-Saleh and Zheng (2002) definition of BVRSS,i then
XŒi.j/ k ; Y.i/Œjk ; i D 1; 2; : : : ; rI j D 1; 2; : : : ; rI and k D 1; 2; : : : ; m
On Ranked Set Sampling Variation and Its Applications to Public Health Research 307
denotes the BVRSS. Now, let fXŒi.j/ ; Y.i/Œj .x; y/ be the joint p.d.f. of XŒi.j/ k ; Y.i/Œjk ,
k D 1, 2, : : : , m. Al-Saleh and Zheng (2002), with m D 1, showed that
X
r X
r
1
(1) r2
fŒi.j/; .i/Œj .x; y/ D f .x; y/ ;
jD1 iD1
Xr X r
1
(2) r2
fXŒi.j/ .x/ D fX .x/, and
jD1 iD1
Xr X r
1
(3) r2
fY.i/Œj .y/ D fY .y/:
jD1 iD1
_ Y BVRSS
R BVRSS D : (29)
X BVRSS
By using Taylor expansion (see, for example, Bickel and Doksum 1977) and
_ y
assuming large population size, it is easy to show that E R BVRSS D x C O 1n ,
and the variance of RBVRSS is approximated by
_ R2
Var R BVRSS Š
n
308 H. Samawi and R. Vogel
0 0 11
X
r X
r X
r X
r X
r X
r
2 2
B B TxŒi.j/ Ty.i/Œj TxŒi.j/y.i/Œj CC
B B iD1 jD1 CC
B 2 2 B iD1 jD1 iD1 jD1 CC
BVx CVy 2Vx Vy m B C 2 CC
B B n2x n2y nx y CC
@ @ AA
(30)
where V 2x and V 2y are as in Eq. (20), TxŒi.j/ D xŒi.j/ x , Ty.i/Œj D y.i/Œj y and
TxŒi.j/y.i/Œj D xŒi.j/ x y.i/Œj y . However, in case of SRS
_ R2 2
Var R BVSRS Š Vx C Vy2 2Vx Vy :
n
Hence,
_
Var RO BVSRS Var R BVRSS
0X r X r Xr X r Xr X r 1
2 2
B T xŒi.j/ T y.i/Œj TxŒi.j/y.i/Œj C
B iD1 jD1 C
B
D mB C
iD1 jD1
2
iD1 jD1 C
nx2
ny2 nx y C (31)
@ A
X r
r X
TxŒi.j/ Ty.i/Œj 2
D m
x 0:
n
iD1 jD1
x
X
r
Now when > 0, TxŒi.j/y.i/Œj tends to be positive because X tends to increase
iD1
X
r X
r
1 1
as Y increases and since r xŒi.j/ D x and r y.i/Œj D y . Also, < 0,
iD1 iD1
X
r
TxŒi.j/y.i/Œj tends to be negative because X tends to decrease as Y increases.
iD1
_
Therefore, from (31) Var RO BVSRS Var R BVRSS when < 0 is much larger
than when > 0. Hence, the relative efficiency of RBVRSS relative to RO BVSRS is much
higher when < 0 than when > 0.
Thee simulation from the bivariate normal and Plackett’s distributions, respec-
tively, showed that in all cases estimating the population ratio using BVRSS is
more efficient than using BVSRS and RSS. Also, the asymptotic relative efficiency
increases as the set size r increases for any given positive or negative value of .
The performance of the ratio estimator using BVRSS is improved over SRS when
the absolute value of increases.
On Ranked Set Sampling Variation and Its Applications to Public Health Research 309
In the two-phase regression estimator using BVRSS, for the kth cycle, k D 1, 2, : : : ,
m, in the first stage, suppose that (X, Y) is a bivariate random vector with the joint
p.d.f.
f(x, y). A random sample of size r4 is identified from the population and randomly
allocated into r2 pools of size r2 each, where each pool is a square matrix with r rows
and r columns then proceed as follows:
Step (a): From the first r pools, rank each set (row) in each pool by a suitable method
of ranking like prior information, visual inspection or by the experimenter
himself, : : : etc. w.r.t. the first characteristics (X), and then from each row
identify and get the actual measurement of the units with the smallest rank
w.r.t. X.
From each row in each of the second r pools, identify and get the actual
measurement (in the same way as in Step 1) of the second minimum w.r.t. the first
characteristic (X), and so on until you identify and get the actual measurements
of the rth smallest unit (maximum), from each row of each of the last r pools.
Note that there will be r pools of quantified samples (w.r.t the variable X) each
of size r2 . Repeat this m times. Used the quantified sample of size n D r3 m to
estimate x . Then proceed as follows:
Step (b): For a fixed k
From any given produced pool (produced in Step (a)), identify the ith minimum
value by judgment w.r.t. the second characteristic (Y), from the ith row, of that
pool, and quantify the second characteristic only as the first characteristic is
already quantified in Step (a).
Steps (a) and (b) describe a procedure to produce a BVRSS of size n D r2 m, for
regression estimators. h
Using the BVRSS sample, XŒi.j/ k ; Y.i/Œjk ; i D 1; 2; : : : ; rI j D 1; 2; : : : ;
rI and k D 1; 2; : : : ; m and assume that
Y.i/Œjk D y C ˇ XŒi.j/k x C "ijk ; i; j D 1; 2; : : : ; r; k D 1; 2; : : : ; m; (32)
where ˇ is the model
slope,
and "ijk are the random errors with E "ijk D 0;
var "ijk D e2 ; Cov "ijk ; "lst D 0; i ¤ l; j ¤ s and= or k ¤ t. Also, assume that
X[i](j)k and "ijk are independent. From the first stage, let X RSS be the sample mean
Xm X r X r X r
based on RSS samples of size r3 m, i.e., X RSS D r31m Xi.j/k . Note that,
kD1 zD1 jD1 iD1
by Al-Saleh and Zheng (2002)
E X RSS D x (33)
310 H. Samawi and R. Vogel
2 1 X 2
Var X RSS D 3 4 x.j/ x (34)
r m r m
Using BVRSS sample, the regression estimator of the population mean y can be
defined as
Y RegBVRSS D Y BVRSS C ˇOBVRSS X RSS X BVRSS ; (35)
X
m X
r X
r
XŒi.j/k X BVRSS Y.i/Œjk
X
m X
r X
r
ˇOBVRSS D ; X BVRSS D r21m
kD1 iD1 jD1
XŒi.j/k ;
Xm X r X r
2
where XŒi.j/k X BVRSS kD1 iD1 jD1
0 2 2 13
Zy2 Z ˇ2 X 2
r
6 B RSS BVRSS
C7
Var Y RegBVRSS D 1 2 41 C E @ 2 A 5 C ;
n SBVRSS r2 n iD1 x.j/
where n D r2 m,
2
XŒi.j/k x
X
m X
r X
r
1
ZŒi.j/k D x ; SB2 D n ZŒi.j/k Z BVRSS ;
kD1 j iD1
X BVRSS x X
Z BVRSS D x
and Z RSS D RSS x :
x
On Ranked Set Sampling Variation and Its Applications to Public Health Research 311
Simulation studies conducted by Samawi and Al-Saleh (2007) show that the
regression estimator based on BVRSS, from bivariate normal distribution, is more
efficient than the naive estimator using BVRSS only whenever jj > 0.90 and the
set size is small. This is always the case when using RSS technique for regression
estimator (Yu and Lam 1997.) Clearly, the relative efficiency is affected only slightly
by the number of cycles m. However, regression estimator using BVRSS, from
bivariate normal distribution is more efficient than naïve estimators based on SRS
and RSS whenever jj > 0.4.
In addition they showed that the regression estimator using BVRSS is always
superior to double sampling regression estimators using SRS and RSS. Although,
the efficiency is affected by the value of and the sample size, Y RegBVRSS is still
more efficient than using other sampling methods. Even with departures from the
normality assumption, they showed that Y RegBVRSS is still more efficient than other
regression estimators based on SRS and RSS.
In order to investigate the performance of the methods introduced in this section
to a real data set we compare the BVRSS regression estimation to BVSRS regression
to data collected from trauma victims in a hospital setting. Each observation consists
of the patients’ age, bd score, and gender. The bd score is a measure indicating the
level of blunt force trauma as reported by the administering doctor. The data contains
N D 1,480 female. For the analysis we treat the data as the population and resample
it 5,000 times under the various sampling mechanisms (i.e., BVRSS and BVSRS)
to estimate the mean bd (Y) score using the covariate age (X). The following are the
exact population values of the data: 2
For Females it was found that the mean age .x / 2 D 35:44, Variance x D
412:58 the mean bd score y D 2:25, Variance y D 12:19, D 0:21. Using
BVSRS, estimator has mean 2.41 with variance 0.21. Using BVRSS, estimator
has a mean 2.48, with variance 0.20. From the results above we conclude that
both sampling techniques exhibit similar performance in terms of bias with BVRSS
performing better in terms of variance.
Finally, whenever BVRSS is possible to be conducted and the relationship
between X and Y is approximately linear, ratio and regression estimators, using
BVRSS, are recommended.
References
Al-Saleh, M.F., Al-Kadiri, M.A.: Double ranked set sampling. Stat. Probab. Lett. 48(2), 205–212
(2000)
Al-Saleh, M.F., Zheng, G.: Estimation of bivariate characteristics using ranked set sampling. Aust.
N. Z. J. Stat. 44, 221–232 (2002)
Bahadur, R.R.: A note on quantiles in large samples. Ann. Math. Stat. 37, 577–590 (1966)
Bickel, P.J., Doksum, K.A.: Mathematical Statistics. Holden-Day, Inc., Oakland (1977)
Chen, Z.: On ranked-set sample quantiles and their application. J. Stat. Plan. Inference 83, 125–135
(2000)
312 H. Samawi and R. Vogel
Chen, Z., Zheng, G., Ghosh, K., Li, Z.: Linkage disequilibrium mapping of quantitative trait loci
by selective genotyping. Am. J. Hum. Genet. 77, 661–669 (2005)
Chen, Z.: Ranked set sampling: Its essence and new applications. Environ. Ecol. Stat. 14, 355–363
(2007)
Dell, T.R., Clutter, J.L.: Ranked set sampling theory with order statistics background. Biometrics
28, 545–555 (1972)
Food and Agriculture Organization (FAO): Date Production and Protection. FAO Plant Production
and Protection Paper 35. FAO (1982)
Hansen, M.H., Hurwitz, W.N., Madow, W.G.: Sampling Survey Methods and Theory, vol. 2. Wiley,
New York (1953)
Kaur, A., Patil, G.P., Sinha, A.K., Tailie, C.: Ranked set sampling: an annotated bibliography.
Environ. Ecol. Stat. 2, 25–54 (1995)
McIntyre, G.A.: A method for unbiased selective sampling using ranked set. Aust. J. Agr. Res. 3,
385–390 (1952)
Muttlak, H.A.: Median ranked set sampling. J. Appl. Stat. Sci. 6(4), 245–255 (1997)
Nelson, W.E., Behrman, R.E., Kliegman, R.M., Vaughan, V.C.: Textbook of Pediatrics, 4th edn.
W. B. Saunders Company Harcourt Barace Jovanovich, Inc, Philadelphia (1992)
Norris, R.C., Patil, G.P., Sinha, A.K.: Estimation of multiple characteristics by ranked set sampling
methods. COENOSES 10(2&3), 95–111 (1995)
Ozturk, O., MacEachern, S.N.: Order restricted randomized designs for control versus treatment
comparison. Ann. Inst. Stat. Math. 56, 701–720 (2004)
Patil, G.P., Sinha, A.K., Taillie, C.: Relative efficiency of ranked set sampling: comparison with
regression estimator. Environmetrics 4, 399–412 (1993)
Patil, G.P., Sinha, A.K., Taillie, C.: Ranked set sampling for multiple characteristics. Int. J. Ecol.
Environ. Sci. 20, 94–109 (1994)
Patil, G.P., Sinha, A.K., Taillie, C.: Ranked set sampling: a bibliography. Environ. Ecol. Stat. 6,
91–98 (1999)
Samawi, H.M.: Stratified ranked set sample. Pak. J. Stat. 12(1), 9–16 (1996)
Samawi, H.M.: On quantiles estimation using ranked samples with some applications. JKSS 30(4),
667–678 (2001)
Samawi, H.M.: Varied set size ranked set sampling with application to mean and ratio estimation.
Int. J. Model. Simul. 32(1), 6–13 (2011)
Samawi, H.M., Al-Sageer, O.A.: On the estimation of the distribution function using extreme and
median ranked set sampling. Biom. J. 43(3), 357–373 (2001)
Samawi, H.M., Al-Saleh, M.F.: On bivariate ranked set sampling for ratio and regression
estimators. Int. J. Model. Simul. 27(4), 1–7 (2007)
Samawi, H.M.: On double extreme ranked set sample with application to regression estimator.
Metron LX(1–2), 53–66 (2002)
Samawi, H.M., Muttlak, H.A.: Estimation of ratio using ranked set sampling. Biom. J. 38(6),
753–764 (1996)
Samawi, H.M., Muttlak, H.A.: On ratio estimation using median ranked set sampling. J. Appl. Stat.
Sci. 10(2), 89–98 (2001)
Samawi, H.M., Siam, M.I.: Ratio estimation using stratified ranked set sample. Metron LXI(1),
75–90 (2003)
Samawi, H.M., Ahmed, M.S., Abu Dayyeh, W.: Estimating the population mean using extreme
ranked set sampling. Biom. J. 38(5), 577–586 (1996)
Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York (1980)
Sinsheimer, J.S., Blangero, J., Lange, K.: Gamete competition models. Am. J. Hum. Genet. 66,
1168–1172 (2000)
Slatkin, M.: Disequilibrium mapping of a quantitative-trait locus in an expanding population. Am.
J. Hum. Genet. 64, 1765–1773 (1999)
Spielman, R.S., McGinnis, R.E., Ewens, W.J.: Transmission test for linkage disequilibrium: the
insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52,
506–516 (1993)
On Ranked Set Sampling Variation and Its Applications to Public Health Research 313
Stokes, S.L., Sager, T.W.: Characterization of a ranked set sample with application to estimating
distribution functions. J. Am. Stat. Assoc. 83, 374–381 (1988)
Sukhatme, P.V., Sukhatme, B.V.: Sampling Theory of Surveys with Applications. Iowa State
University Press, Ames (1970)
Takahasi, K., Wakimoto, K.: On unbiased estimates of the population mean based on the stratified
sampling by means of ordering. Ann. Inst. Stat. Math. 20, 1–31 (1968)
William, E., Ahmed, M., Ahmed, O., Zaki, L., Arash, N., Tamer, B., Subhy, R.: Date Palm in the
GCC countries of the Arabian Peninsula. http://www.icarda.org/aprp/datepalm/introduction/
intro-body.htm (2004)
Xu, X.P., Rogus, J.J., Terwedom, H.A., et al.: An extreme-sib-pair genome scan for genes
regulating blood pressure. Am. J. Hum. Genet. 64, 1694–1701 (1999)
Yu, L.H., Lam, K.: Regression estimator in ranked set sampling. Biometrics 53, 1070–1080 (1997)
Zheng, G., Ghosh, K., Chen, Z., Li, Z.: Extreme rank selections for linkage analysis of quantitative
trait loci using selected sibpairs. Ann. Hum. Genet. 70, 857–866 (2006)
Weighted Multiple Testing Correction
for Correlated Endpoints in Survival Data
1 Introduction
Multiple correlated time-to-event endpoints are often collected to test the treatment
effect in clinical trials. For example, the Outcome Reduction with an Initial
Glargine Intervention (ORIGIN) trial (The ORIGIN Trial Investigators 2008) has
two co-primary endpoints: the first is a composite of cardiovascular death, non-
fatal myocardial infarction (MI), or non-fatal stroke; the second is a composite
of these three events plus revascularization or hospitalization for heart failure.
These two endpoints are correlated. Also, the first endpoint is considered more
important than the second endpoint. The issue of multiplicity occurs when multiple
hypotheses are tested in this way. Ignoring multiplicity can cause false positive
results. Many statistical methods have been proposed to control family-wise error
rate (FWER), which is the probability of rejecting at least one true null hypothesis.
When some hypotheses are more important than others, weighted multiple testing
correction methods may be useful. Commonly used weighted multiple testing
correction methods to control FWER include the weighted Bonferroni correction,
the Bonferroni fixed sequence (BFS), the alpha-exhaustive fallback (AEF), and
the weighted Holm procedure. The weighted Bonferroni correction computes the
adjusted P-valueP for pi as padji D min.1; pi =wi /, where wi > 0, i D 1; : : : ; m are the
weights with m iD1 wi D 1 (m is the total number of tests performed) and rejects
the null hypothesis, Hi if the adjusted p-value padji ˛ (or pi wi ˛). Combining a
Bonferroni adjustment and the fixed sequence (FS) testing procedure, Wiens (2003)
proposed a Bonferroni fixed sequence (BFS) procedure, where each of the null
hypotheses is given a certain significance level and a pre-specified testing sequence
that allows the significance level to accumulate for later testing when the null
hypotheses are rejected. Wiens and Dmitrienko (2005, 2010) developed this method
further to use more available alpha to provide an alpha-exhaustive fallback (AEF)
procedure with more power than the original BFS. Holm (Holm 1979; Westfall et al.
2004) presented a weighted Holm method as follows. Let qi D pi =wi , i D 1; : : : ; m.
Without loss of generality, suppose q1 q2 qm . Then the adjusted p-value
for the first hypothesis is padj1 D min.1; q1 /. Inductively, the adjusted p-value for
the jth hypothesis is padjj D min.1; max.padj. j1/ ; .wj C : : : C wm /qi //, j D 2; : : : ; m.
The method rejects a hypothesis if the adjusted p-value is less than the FWER, ˛.
It is notable that all these weighted multiple testing methods disregard the cor-
relation among the endpoints and they are therefore appropriately called weighted
nonparametric multiple testing methods. They are conservative if test statistics are
correlated leading to false negative results. In other words, ignoring the correlation
when correcting for multiple testing can lower the power of a study. Recently,
weighted parametric multiple testing methods have been proposed to take into
account correlations among the test statistics. These methods require the correlation
matrix for the correlated tests related to the corresponding correlated endpoints.
For continuous data or binary data, the correlation matrix can be directly estimated
from the corresponding correlated endpoints (Conneely et al. 2007; Xie 2012; Xie
et al. 2013; Xie 2014). However, it is challenging to directly estimate the correlation
WMTCc in Survival Data 317
matrix from the multiple time-to-event endpoints in survival data since censoring
is involved. Pocock et al. (1987) discussed the analysis of multiple endpoints in
survival data using log-rank tests and gave the normal approximation to the log-rank
test. However, the correlation matrix was not given despite being specified for other
situations such as binary data. Alosh and Huque (2009) considered the correlation
of a survival endpoint between the overall population and a subgroup. Their method
was based on the proportion of subjects in the subgroup, which is not suitable for the
estimation of the correlations between different survival endpoints measured in the
same population. Wei et al. (1989) proposed a method called the WLW method to
analyze multiple endpoints in survival data using the marginal Cox models. Instead
of estimating the correlation matrix from the multiple time-to-event endpoints
directly, they proposed a robust sandwich covariance matrix estimate for the
maximum partial likelihood estimates for the event-specific regression coefficients.
Neither Pocock’s method nor the WLW method considered giving different weights
to different endpoints. In this chapter, we will use the WLW method to estimate
the correlations among the test statistics. With the estimated correlation matrix, we
propose a weighted multiple testing correction for correlated endpoints, WTMCc,
which can be used to apply different weights to hypotheses when conducting
multiple testing for correlated time-to-event endpoints. Simulations are conducted
to study the family-wise type I error rate of the proposed method and compare
the power performance of the proposed method to the power performance of the
alpha-exhaustive fallback (AEF), the fixed-sequence (FS), and the weighted Holm-
Bonferroni method when used for the correlated time-to-event endpoints. One
might consider other parametric methods such as Huque and Alosh (2008) flexible
fixed-sequence (FFS) testing method and Li and Mehrotra’s adaptive ˛ allocation
approach (4A), using the estimated correlation matrix from the WLW method.
However, we previously compared the WMTCc method with both the FFS method
and the 4A method and shown the WMTCc has its advantage (Xie 2014), and so we
will not discuss the FFS method and the 4A method further in this chapter.
In the next section, the WMTCc method for correlated time-to-event endpoints
is presented. In Sect. 3, simulations are conducted to evaluate the proposed method.
A real example to illustrate use of the proposed method for correlated time-to-event
endpoints is given in Sect. 4, followed by discussion and concluding remarks in
Sect. 5.
Suppose there are n subjects and each subject can have up to K potential failure
times (events). Let Xki be the covariate process associated with the kth event for the
ith subject. The marginal Cox models are given by
318 C. Xie et al.
0
hk .t/ D hk0 .t/eˇk Xki .t/ ; k D 1; : : : ; K and i D 1; : : : ; n; (1)
where hk0 .t/ is the event-specific baseline hazard function for the kth event and
ˇk is the (event-specific) column vector of regression coefficients for the kth
event. The WLW estimates ˇ1 ; : : : ; ˇK by the maximum partial likelihood estimates
ˇO1 ; : : : ; ˇOK , respectively, and uses a robust sandwich covariance matrix estimate,
˙ for .ˇO10 ; : : : ; ˇOK0 /0 to account for the dependence of the multiple endpoints. This
robust sandwich covariance matrix estimate can be obtained using the PHREG
procedure in SAS. After we have the estimated robust sandwich covariance matrix,
the WMTCc method is applied.
Assume that the test statistics follow a multivariate normal distribution with the
estimated correlation matrix ˙, using the WLW method above. Let p1 ; : : : ; pm be
.1/ .m/
the observed p-values for null hypotheses H0 ; : : : ; H0 , respectively, and wi > 0,
.i/
i DP 1; : : : ; m be the weight for null hypothesis H0 . Note that we do not require
m
that iD1 wi D 1. It can be seen from Eqs. (3) or (4) below that the adjusted p-
values depend only on the ratios of the weights. For each i D 1; : : : ; m, calculate
.i/
qi D pi =wi . Then the adjusted p-value for the null hypothesis H0 is
padji D P. minj qj qi /
D 1 P. all qj > qi /
(2)
D 1 P.all pj > pi wj =wi /
Tm
D 1P a
jD1 j X j b j ;
removing the rejected k1 null hypotheses, using the corresponding correlation matrix
and weights. This procedure is continued until there is no null hypothesis left or
there is no null hypothesis that can be rejected.
Computation of the adjusted p-values in (2) requires integration of the mul-
tivariate normal density function, which has no closed-form solution. However,
Genz (1992, 1993) and Genz et al. (2014) have developed a computationally
efficient method for numerical integration of the multivariate normal distribution
and incorporated it into the package mvtnorm in the R software environment (R
Development Core Team (2014). Based on the magnitude of the p-values and
the nature of the analysis, one may choose the precision level to devote more
computational resources to a high-precision analysis and improve computational
efficiency.
3 Simulations
where 0 .t/ is the baseline hazard and X is a vector of covariates. This model is
equivalent to the following transformed regression model (Fan et al. 1997):
where e exp.1/ and H0 .t/ is the baseline cumulative hazard function. In order
to obtain correlated survival data, we generated samples from a multi-exponential
distribution with a given correlation matrix. This correlation matrix is the correlation
of the different cumulative hazard functions, which is specifically designed for
survival data to allow censoring. If the event times have an exponential distribution,
this correlation matrix is the same as the correlation matrix of multivariate event
times. This may not hold if the event times do not have exponential distribution.
We simulated a clinical trial with three correlated time-to-event endpoints and
240 individuals. Each individual had probability 0.5 to receive the active treatment
and probability 0.5 to receive placebo. The baseline hazard 0 .t/ was set to be
24t2 . Equal correlation structure was used with chosen as 0:0; 0:3; 0:5; 0:7 and
0:9. The treatment effect size was assumed as 0:0; 0:05, and 0:2. The survival times
were censored by the uniform distribution U.0; 3/. The weights for the three end-
points were .5; 4; 1/, which corresponds to alpha allocations .0:025; 0:02; 0:005/.
320 C. Xie et al.
The simulation results are summarized in Table 1. From these simulations, we can
conclude the following:
1. The proposed method using estimated correlation matrices from the WLW
method can control the family-wise type I error rate well (the first part of Table 1).
Both the WMTCc and the FS can keep the family-wise type I error rate at 5 %
level as the correlation, , increases. However, the family-wise type I error rates
for the AEF and the weighted Holm decreases as the correlation, , increases,
resulting in decreased power with increasing correlation.
2. The WMTCc method has higher power for testing the first hypothesis than the
AEF and the weighted Holm methods (Table 1)
3. The WMTCc method has a power advantage over the weighted Holm method for
each individual hypothesis, especially when the correlation, , is high.
4. The WMTCc method has a higher chance of rejecting at least one hypothesis
compared to the weighted Holm and the AEF methods, especially when the
correlation, , is high.
5. The FS method has the highest power for testing the first hypothesis. However,
the WMTCc method can still have high power for testing other hypotheses when
the power for testing the first hypothesis is very low, which is not true for the FS
method.
To illustrate the proposed method, we analyze data from the Fernald Community
Cohort (FCC). Community members of a small Ohio town, living near a uranium
refinery, participated in a medical monitoring program from 1990 to 2008. The
Fernald Medical Monitoring Program (FMMP) provided health screening and
promotion services to 9,782 persons who resided within close proximity to the plant.
For more details, see Wones et al. (2009) or the cohort website (Fernald Medical
Monitoring Program website 2014). For illustration purposes, we considered four
time-to-event outcomes among female participants: three types of incident cancers
(colon cancer, lung cancer, and breast cancer) and their composite (any of the
three cancers). We were interested in testing the association between smoking
and the four outcomes after adjusting for age and uranium exposure. First, we
tested one outcome at a time, the results are shown in Table 2. Although the
two-tailed unadjusted p-values for colon cancer and breast cancer are large, the
coefficients of smoking for all the four outcomes are positive, indicating their
potential harmful associations of smoking. Since we have four tests, we need
to consider adjusting for the multiple testing. To illustrate the weighted multiple
testing methods, the weight 0:4; 0:2; 0:2; 0:2 were given to the composite endpoint,
lung cancer, colon cancer, and breast cancer separately. The corresponding ˛
WMTCc in Survival Data 321
Table 1 Three endpoints: simulated power .%/ or significant level .%/ based on
100; 000 runs for selected treatment differences for the WMTCc method, AEF, FS and
the weighted Holm-Bonferroni method (The first cell entry is for the first endpoint, the
second entry is for the second endpoint, and the third entry is for the third endpoint. The
probability (%) that at least one hypothesis is rejected is given in brackets)
˛ allocations Effect Weighted Holm-
or weight size WMTCc AEF FS Bonferroni
˛ allocations 0.0, 0.0, 0:0 2.6, 2.1, 2.5, 2.1, 5.0, 0.3, 2.5, 2.1,
(0.025,0.02, 0.0 0.5 (5.0) 0.6 (5.0) 0.01 (5.0) 0.05 (5.0)
0.005) or 0:3 2.7, 2.2, 2.5, 2.1, 5.0, 0.5, 2.6, 2.1,
weight (5,4,1) 0.6 (5.0) 0.6 (4.8) 0.1 (5.0) 0.6 (4.8)
0:5 3.0, 2.5, 2.6, 2.2, 5.1, 1.0, 2.7, 2.2,
0.8 (5.1) 0.8 (4.5) 0.3 (5.1) 0.8 (4.5)
0:7 3.4, 2.9, 2.6, 2.4, 5.0, 1.7, 2.8, 2.4,
1.3 (5.0) 1.2 (4.1) 0.9 (5.0) 1.1 (4.1)
0:9 4.2, 3.7, 2.7, 2.6, 5.0, 3.0, 2.8, 2.5,
2.4 (5.0) 1.9 (3.3) 2.3 (5.0) 1.9 (3.3)
0.2, 0.05, 0:0 75.4, 8.6, 74.9, 9.2, 82.9, 9.3, 75.2, 8.6,
0.05 3.6 (76.9) 2.7 (76.7) 1.1 (82.9) 3.5 (76.7)
0:3 75.7, 9.3, 74.8, 9.8, 83.0, 10.4, 75.0, 9.1,
4.6 (76.2) 3.6 (75.5) 2.5 (83.0) 4.5 (75.5)
0:5 76.4, 9.8, 74.9, 10.0, 83.0, 10.8, 74.9, 9.3,
5.4 (76.6) 4.5 (75.1) 3.7 (83.0) 5.2 (75.1)
0:7 77.7, 10.3, 74.9, 10.2, 83.1, 10.9, 74.9, 9.5,
6.4 (77.9) 5.6 (75.1) 5.3 (83.1) 6.0 (75.1)
0:9 80.0, 11.0, 74.8, 10.2, 82.9, 10.9, 74.8, 9.5,
8.0 (80.3) 7.5 (74.9) 7.7 (82.9) 7.3 (74.9)
0.05, 0.05, 0:0 7.3, 6.3, 7.2, 6.2, 11.3, 1.3, 7.2, 6.2,
0.2 55.4 (59.7) 55.3 (59.5) 1.1 (11.3) 55.2 (59.5)
0:3 7.8, 6.8, 7.5, 5.5, 11.3, 2.6, 7.5, 6.6,
55.5 (57.7) 54.9 (56.9) 2.5 (11.3) 54.8 (56.9)
0:5 8.2, 7.3, 7.7, 6.9, 11.2, 3.8, 7.7, 6.9,
56.0 (57.0) 54.4 (55.4) 3.7 (11.2) 54.4 (55.4)
0:7 8.8, 8.0, 7.9, 7.3, 11.2, 5.3, 7.9, 7.3,
57.1 (57.6) 54.1 (54.5) 5.3 (11.2) 54.1 (54.5)
0:9 9.9, 9.2, 8.0, 7.6, 11.3, 7.8, 8.0, 7.6,
59.7 (60.1) 54.0 (54.2) 7.6 (11.3) 54.0 (54.2)
0.2, 0.2, 0:0 80.5, 79.9, 79.5, 80.1, 83.0, 68.9, 80.4, 79.8,
0.2 75.1 (96.9) 75.5 (96.8) 57.2 (83.0) 75.0 (96.8)
0:3 80.1, 79.3, 78.8, 79.2, 82.9, 71.0, 79.7, 78.9,
74.0 (92.9) 74.2 (92.5) 62.2 (82.9) 73.6 (92.5)
0:5 80.0, 79.3, 78.4, 78.7, 83.0, 73.0, 79.2, 78.4,
73.9 (90.0) 73.7 (89.1) 66.1 (83.0) 73.2 (89.1)
0:7 80.3, 79.6, 77.8, 78.1, 83.0, 75.2, 78.5, 77.8,
74.5 (87.1) 73.5 (85.0) 70.3 (83.0) 73.0 (85.0)
0:9 81.4, 80.6, 76.8, 77.0, 82.9, 78.5, 77.4, 76.6,
76.6 (84.2) 74.1 (79.5) 75.9 (82.9) 73.9 (79.5)
The total sample size is 240
322 C. Xie et al.
Table 2 The results of analyzing each of the four endpoints: Cox model with
smoking as covariate, adjusting for age and uranium exposure
Coefficient SE Unadjusted
Outcome of smoking of coefficient Hazard ratio P-value
Composite of lung, 0:2947 0:1317 1:34 0:0252
colon and breast cancer
Lung cancer 2:1997 0:4533 9:02 0:0000012
Colon cancer 0:1347 0:3771 1:14 0:72
Breast cancer 0:0288 0:1550 1:03 0:85
Table 3 The estimated correlation matrix of the test statistics for the four
endpoints, using the WLW method
The composite
of lung, colon Lung Colon Breast
and breast cancer cancer cancer cancer
The composite of lung, 1 0.290 0.342 0.838
colon, and breast cancer
Lung cancer 0.290 1 0.004 0.040
Colon cancer 0.342 0.004 1 0.002
Breast cancer 0.838 0.040 0.002 1
allocations are 0:02; 0:01; 0:01; 0:01. The testing sequence for the AEF and the
FS methods is the composite endpoint, lung cancer, colon cancer, breast cancer.
In applying the WMTCc method, we estimated the correlation matrix of the
test statistics for the four endpoints,using the WLW method. This estimated
correlation matrix is given in Table 3. The adjusted p-values from the WMTCc
method are 0:041; 0:000005; 0:92; 0:92, respectively. The first and the second null
hypothesis can be rejected, corresponding to the composite endpoint and lung
cancer respectively. The weighted Holm-Bonferroni method gave the adjusted p-
values as 0:051; 0:000005; 1:0; 1:0 respectively. Only the second null hypothesis,
corresponding to lung cancer, can be rejected. In this example, the AEF method has
the same results as the weighted Holm-Bonferroni method. The FS method has the
same results as the WMTCc method since we specified the right testing sequence.
For illustration purposes, if we change the testing sequence to the composite
endpoint, colon cancer, lung cancer, breast cancer, even the null hypothesis for
lung cancer cannot be rejected. Although the AEF method depends on the testing
sequence, it can still reject the null hypothesis, corresponding to lung cancer (since
0:0000012 < 0:01) , but not others. The WMTCc method and the weighted Holm-
Bonferroni method do not depend on the testing sequence.
WMTCc in Survival Data 323
5 Discussions
In this chapter, we investigated the weighted multiple testing correction for cor-
related time-to-event endpoints. Extensive simulations were conducted to evaluate
the WMTC method in comparison with three existing nonparametric methods. The
simulations showed that the proposed method using estimated correlation matrices
from the WLW method can control the family-wise type I error rate very well as
summarized in the first part of Table 1. The proposed method has a power advantage
over the weighted Holm method for each individual hypothesis, especially when
the correlation is high. It also has higher power for testing the first hypothesis
(which is usually the most important hypothesis) than the AEF method. For the FS
method, if we cannot reject the first hypothesis, the remaining hypotheses cannot
be rejected even if their unadjusted p-values are very small. This is not true for the
WMTCc method, which can still have high power for testing other hypotheses when
the power for testing the first hypothesis is very low.
It should be noted that the WMTCc method assumes that test statistics are asymp-
totically distributed as multivariate normal with the estimated correlation matrix
from the data, using the WLW method. The positive semi-definite assumption needs
to be checked since the estimated correlation matrix from an inconsistent data set
might not be positive semi-definite. If this is the case, the algorithm proposed by
Higham (2002) can be used to compute the nearest correlation matrix.
In conclusion, the WMTCc method outperforms the existing nonparametric
methods in multiple testing for correlated time-to-event multiple endpoints in
clinical trials.
Acknowledgements This project was supported by the National Center for Research Resources
and the National Center for Advancing Translational Sciences, National Institutes of Health,
through Grant 8 UL1 TR000077-04 and National Institute for Environmental Health Sciences,
Grant P30-ES006096 (UC Center for Environmental Genetics). The content is solely the responsi-
bility of the authors and does not necessarily represent the official views of the NIH.
References
Alosh, M., Huque, M.F.: A flexible strategy for testing subgroups and overall population. Stat.
Med. 28, 3–23 (2009)
Conneely, K.N., Boehnke, M.: So many correlated tests, so little time! Rapid adjustment of p values
for multiple correlated tests. Am. J. Hum. Genet. 81, 1158–1168 (2007)
Fan, J., Gijbels, I., King, M.: Local likelihood and local partial likelihood in hazard regression.
Ann. Stat. 25(4), 1661–1690 (1997)
Fernald Medical Monitoring Program Website: https://urldefense.proofpoint.com/v2/url?u=http-
3A__www.eh.uc.edu_fmmp&d=AAIGAQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53
YSKuLUQRxhA&r=656gLa67DL3AtCWt3Jb0tIRzTwk1qCp1OB7YsvvcToI&m=6W9WCE
ZFMxFnHnV55uMK2eK2NDafDEWbmZP_2DcdldA&s=HjK4SFlUlSnSDz50bfhJBuDPUC
EfjPDRzFZTbg0sAl0&e= (2014). Cited 15 Nov 2014
324 C. Xie et al.
Genz, A.: Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1,
141–149 (1992)
Genz, A.: Comparison of methods for the computation of multivariate normal probabilities.
Comput. Sci. Stat. 25, 400–405 (1993)
Genz, A., Bretz, F., Hothorn, T.: mvtnorm: multivariate normal and t distribution. R package
version 2.12.0. Available at https://urldefense.proofpoint.com/v2/url?u=http-3A__cran.r-2D
project.org_web_packages_mvtnorm_&d=AAIGAQ&c=4sF48jRmVAe_CH-k9mXYXEGfSn
M3bY53YSKuLUQRxhA&r=656gLa67DL3AtCWt3Jb0tIRzTwk1qCp1OB7YsvvcToI&m=6
W9WCEZFMxFnHnV55uMK2eK2NDafDEWbmZP_2DcdldA&s=4LjNlTNOcwwRiQYj7_
CHG0qMaZIphbnaImMOKgYfryk&e=nindex.html (2014). Cited 15 Nov 2014
Higham, N.J.: Computing the nearest correlation matrix - A problem from finance. IMA J. Numer.
Anal. 22(3), 329–343 (2002)
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)
Huque, M.F., Alosh, M.: A flexible fixed-sequence testing method for hierarchically ordered
correlated multiple endpoints in clinical trials. J. Stat. Plann. Inference 138, 321–335 (2008)
Pocock, S.J., Geller, N.L., Tsiatis, A.A.: The analysis of multiple endpoints in clinical trials.
Biometrics 43, 487–498 (1987)
R Development Core Team: R: a language and environment for statistical computing. R Foundation
for Statistical Computing. Aavailable at https://urldefense.proofpoint.com/v2/url?u=http-3A__
www.r-2Dproject.org_&d=AAIGAQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuL
UQRxhA&r=656gLa67DL3AtCWt3Jb0tIRzTwk1qCp1OB7YsvvcToI&m=6W9WCEZFMxFn
HnV55uMK2eK2NDafDEWbmZP_2DcdldA&s=k45dskHmTjrjhL-mG248jkpbB-Vc6UP2wX
W0uveFHtw&e= (2014). Cited 15 Nov 2014
The ORIGIN Trial Investigators: Rationale, design and baseline characteristics for a large simple
international trial of cardiovascular disease prevention in people with dysglycaemia: the
ORIGIN trial. Am. Heart J. 155, 26–32 (2008)
Wei, L.J., Lin, D.Y., Weissfeld, L.: Regression analysis of multivariate incomplete failure time data
by modeling marginal distributions. J. Am. Stat. Assoc. 84, 1065–1073 (1989)
Westfall, P.H., Kropf, S., Finos, L.: Weighted FWE-controlling methods in high-dimensional
situations. In: Recent Developments in Multiple Comparison Procedures. IML Lecture Notes
and Monograph Series, vol. 47, pp. 143–154. Institute of Mathematical Statistics, Beachwood
(2004)
Wiens, B.L.: A fixed sequence Bonferroni procedure for testing multiple endpoints. Pharm. Stat.
2, 211–215 (2003)
Wiens, B.L., Dmitrienko, A.: The fallback procedure for evaluating a single family of hypotheses.
J. Biopharm. Stat. 15, 929–942 (2005)
Wiens, B.L., Dmitrienko, A.: On selecting a multiple comparison procedure for analysis of a
clinical trial: fallback, fixed sequence, and related procedures. Stat. Biopharm. Res. 2, 22–32
(2010)
Wones, R., Pinney, S.M., Buckholz, J., Deck-Tebbe, C., Freyberg, R., Pesce, A.: Medical
monitoring: a beneficial remedy for residents living near an environmental hazard site. J.
Occup. Environ. Med. 51(12), 1374–1383 (2009)
Xie, C.: Weighted multiple testing correction for correlated tests. Stat. Med. 31, 341–352 (2012)
Xie, C., Lu, X., Pogue, J., Chen, D.: Weighted multiple testing corrections for correlated binary
endpoints. Commun. Stat. Simul. Comput. 42(8), 1693–1702 (2013)
Xie, C.: Relations among three parametric multiple testing methods for correlated tests. J. Stat.
Comput. Simul. 84(4), 812–818 (2014)
Meta-Analytic Methods for Public Health
Research
Y. Ma () • W. Zhang
Department of Epidemiology and Biostatistics, The George Washington University,
Washington, DC, USA
e-mail: [email protected]
D.-G. Chen
School of Social Work, University of North Carolina, Chapel Hill, NC 27599, USA
e-mail: [email protected]
an estimate of the overall pooled effect size with optimal precision. With rising
cost of health research, many studies are carried out with small sample sizes
resulting in low power for detecting useful effect size. This phenomenon has
increased the chance of producing conflicting results from different studies. By
pooling the estimated effect sizes from each of the component studies through meta-
analytic methods, information from larger number of patients and increased power is
expected (Peto 1987). More detailed discussions on meta-analysis for clinical trials
can be found from Whitehead (2002) and the implementation of meta-analysis in R
can be found in Chen and Peace (2013).
A typical MA deals with n independent studies in which a parameter of interest
i .i D 1; 2; : : : ; n/ is estimated. It can be applied to a broad range of study designs
such as single-arm or multiple-arm studies, randomized controlled studies, and
observational studies. For illustrative purposes, we focus on MA of two-arm studies,
where i is some form of the effect size between the two groups. The most popular
choice for i is standardized mean difference for a continuous outcome, or odds
ratio, risk ratio, and risk difference for dichotomous outcome. In most cases, an
estimate Oi of the true i and its associated standard error could be directly extracted
from each study. The ultimate goal of meta-analysis is to produce an optimal
estimate of the population effect size by pooling the estimates Oi .i D 1; 2; : : : ; n/
from individual studies using appropriate statistical models.
This chapter presents an overview on MA intended for public health researchers
to understand and to apply the methods of MA. Emphasis is focused on classical
statistical methods for estimation of the parameters of interest in MA as well as
recent development in research in MA. Specifically, univariate and multivariate
fixed- and random-effects MAs, as well as meta-regression are discussed. All
methods are illustrated by examples of published MAs in health research. We
demonstrate how these approaches can be implemented in R.
2 Univariate Meta-Analysis
In any MA the point estimates of the effect size Oi from different studies are
inevitably different due to two sources of variation, within- and between-study
variations. The within-study variation is caused by sampling error, which is random
or non-systematic. In contrast, the between-study variation is resulted from the
systematic differences among studies. If it is believed that between-study variation
does not exist, the effect estimates Oi are considered homogeneous. Otherwise, they
are heterogeneous. Underlying causes of heterogeneity may include differences
across studies in patient characteristics, the specific interventions and design of the
studies, or hidden factors. In MA, the assumption of homogeneity states that i
.i D 1; 2; : : : ; n/ are the same in all studies, that is
Meta-Analytic Methods for Public Health Research 327
1 D 2 D D n D : (1)
Further, this assumption can be examined using a statistical test, known as Cochran’s
2
test or the Q-test (Cochrane Injuries Group Albumin Reviewers 1998; Whitehead
and Whitehead 1991). It indicates lack of homogeneity if the test is significant.
However, it has been criticized for its low statistical power when the number of
studies is small in an MA (Hardy and Thompson 1998). Higgins and Thompson
(2002) developed a heterogeneity
statistic I 2 to quantify heterogeneity in an MA.
2 2
The I 0 % I 100 % has an interpretation as the proportion of total variation
in the estimates of effect size that is due to heterogeneity between studies. For
example, an I 2 of 0 % .100 %/ implies that all variability in effect size estimates
is due to sampling error (between-study heterogeneity).
Yi D C i I i D 1; 2; : : : ; n; (2)
where for study i, Yi represents the effect size, the population effect size, and i the
sampling error with mean 0 and variance i2 . In general, the i is assumed to follow
a normal distribution N.0; i2 /. A pooled estimate of is given by the weighted least
square estimation
O D ˙iD1 wi Yi ;
n
˙iD1
n
wi
where a popular choice of weight is wi D 1=i2 and variance i2 is estimated using
sample variance O i2 of Yi from study i. Hence, the 95 % confidence interval of is
given by
r r
O t0:025;.n1/ Var O O C t0:025;.n1/ Var O ;
Yi D C bi C i ; i D 1; 2; : : : ; n; (3)
where for study i, Yi represents the effect size, the population effect size, bi the
random effect with mean 0 and variance 2 , and i the sampling error with mean
0 and variance i2 . It is assumed that bi and i are independent and follow normal
distributions N.0; 2 / and N.0; i2 /; respectively. Let i D C bi ; i D 1; 2; : : : ; n:
Then the random effects model (3) can be simplified as
Yi D i C i ;
where i represents the true effect size for study i. All i .i D 1; 2; : : : ; n/ are random
samples from the same normal population
i N. ; 2 /
rather than being a constant for F-UMA (1).Further, the marginal variance of Yi is
given by
The truncation at zero in (4) is to ensure that the variance estimate is non-negative.
Further, the estimate of the population effect size is given by
O D ˙iD1 wi Yi
n
(5)
˙iD1n
wi
where wi D 1= i2 C O 2 : The variance of O is simply
Var O D 1=˙iD1
n
wi
r
and the 95 % confidence interval can be calculated by O t0:025;.n1/ Var O
r
C t0:025;.n1/ Var O .
O
The ML and REML estimates of 2 and do not have a closed form. In particular,
the REML estimates are shown to be the iterative equivalent to the weighted
estimators in (4) and (5) (Shadish and Haddock 2009).
2.4 Meta-Regression
The model assumes that the variation in effect sizes can be completely explained by
these predictors. In other words, the variation is predictable.
The random effects univariate meta-regression (R-UMR) model can be obtained
by adding random effects bi to the fixed model (6):
Yi D ˇ0 C ˇ1 xi1 C : : : C ˇp xip C bi C i ; i D 1; 2; : : : ; n
3 Multivariate Meta-Analysis
Following the notations for UMA, the formulation for MMA, based on multivariate
random-effects model, is written as
Yi D ˇ C bi C "i I i D 1; 2; ::; n;
where for the ith study, ˇ describes the vector of population effect size of M
outcomes, bi the vector of between-study random effect, and "i the vector of within-
study sampling error. It is often assumed that bi and "i are independent, following
multivariate normal distributions with zero means and M M variance–covariance
matrices Var .bi / D D and Var ."i / D ˝ i , respectively.
For illustrative purposes, we focus on the simple case of meta-analysis with
bivariate outcome variable Yi D .Yi1 ; Yi2 /> I i D 1; 2; : : : ; n as these methods
can be easily extended to meta-analysis with M > 2 outcomes. With M D 2
in the random-effects model, the marginal variability in Yi accounts for both
between-study variation .D/ and within-study variation .˝ i /. In particular, the
variance–covariance matrix for a bivariate meta-analysis (BMA) boils down to
Var.Yi / D D C ˝ i ;
12 12 i12 i12
DD 2 ; ˝i D ;
12 2 i12 i22
12 D 1 2 b ; i12 D i1 i2 iw ;
where 12 ; 22 and b describe the between-study variation and correlation, whereas
i12 ; i22 , and iw capture the within-study variation and correlation. Similar to
the UMA setting, the within-study variances i12 and i22 can be estimated using
sample variances O i12 and O i22 . Although the WSC iw is generally unknown, several
approaches for addressing this issue have been discussed in Riley (2009). For
simplicity, throughout this section, the within-study correlation is assumed equal
across studies (iw D w ; i D 1; 2; : : : ; n) and a sensitivity analysis for a wide range
of w is conducted.
There are three major methods in the literature for MMA: Restricted maximum
likelihood approach (Berkey et al. 1998) (REML), the multivariate extension of the
332 Y. Ma et al.
DerSimonian and Laird’s method of moments (Jackson et al. 2010), and U-statistic
based approach (Ma and Mazumdar 2011). Through extensive simulation studies, it
is shown in Ma and Mazumdar (2011) that estimates from these three approaches
are very similar. In addition, since REML was developed first, it has been the
default approach for MMA. We introduce the statistical property and applications
for REML next.
The REML approach has been widely used and is incorporated in most statistical
software. By assuming the within-study variance matrix ˝ i to be known, the
remaining parameters of interest to be estimated include ˇ; 12 ; 22 ; and b . Under
REML, bi and "i are usually assumed to follow bivariate normal distributions,
bi N .0; D/, "i N .0; ˝ i /. The outcome variable Yi , as a result, follows a
bivariate normal distribution with mean ˇ and variance D C ˝ i I i D 1; 2; : : : ; n.
Normal random effects and sampling errors are discussed as that is most
commonly assumed. No closed form derivation for REML estimates exist and
therefore iterative procedures (e.g. Newton-Raphson algorithm, E-M algorithm)
have been developed for estimating ˇ and variance D. Briefly, the REML estimate
O . This is equivalent to the Restricted Iterative
of ˇ can be derived as a function of D
Generalized Least Square (RIGLS) estimate
n
!1 !
X 1 n
X 1
Ǒ D O C ˝i
D DO C ˝i yi
iD1 iD1
when outcomes are normally distributed (Riley et al. 2007). Asymptotically, the
estimate Ǒ above follows a normal distribution with mean ˇ and variance
!1
n
X 1
Var Ǒ D DO C ˝i :
iD1
Yi D Xi ˇ C bi C i (7)
Meta-Analytic Methods for Public Health Research 333
There have been extensive debate regarding when, how, and why MMA can differ
from two independent UMAs (Berkey et al. 1998; Riley et al. 2007; Sohn 2000).
Since MMA is able to “borrow strength” across outcomes through the within-study
correlations, this method is expected to produce increased precision of estimates
compared to a UMA of each outcome separately. Riley et al. (2007) demonstrated
that the completeness of data plays a role when assessing the benefits of using
MMA. When all outcomes are available in each individual study (i.e., complete
data), the advantages of MMA tend to be marginal. When at least one of the
outcomes is unavailable for some studies (i.e., missing data), these advantages
become more apparent. In particular, MMA can still incorporate those incomplete
studies in analysis. However, studies with a missing outcome will be excluded in
a UMA for the outcome. Therefore, Riley et al. (2007) recommended that MMA
should be used if there are missing data and multiple separate UMAs are sufficient
if there are complete data.
4 Applications
We apply the UMA and MMA to two published meta-analysis, one with complete
data and the other one with missing data.
334 Y. Ma et al.
Table 2 Bivariate and univariate meta-analysis results of Example 1: surgical versus non-surgical
procedure for treating periodontal diseasea
Method ˇOPD .s.e./ 2
95 % CI (PD) OPD ˇOAL .s.e./ 95 % CI (AL) 2
OAL Ob
F-UMA 0.347(0.0289) (0.267, 0.427) 0.393(0.0189) (0.445, 0.340)
F-BMA 0.307(0.0286) (0.228, 0.387) 0.394(0.0186) (0.446, 0.343)
R-UMA 0.361(0.0592) (0.196,0.525) 0.0119 0.346(0.0885) (0.591,0.100) 0.0331
R-BMA 0.353(0.0588) (0.190,0.517) 0.0117 0.339(0.0879) (0.583,0.095) 0.0327 0.609
a
PD probing depth, AL attachment level, F-UMA fixed effects univariate meta-analysis, F-BMA fixed
effects bivariate meta-analysis, R-UMA random effects univariate meta-analysis, R-BMA random effects
bivariate meta-analysis
In addition to MA, we also performed fixed and random UMR and BMR
(Table 3). In these regression analyses, year of publication, a proxy for time of
conducting of a study, was used as the predictor. Similar to the MAs in Table 2,
we also found in MR analyses that all the random effects standard errors (SEs) were
considerably larger than the corresponding fixed effects SEs. The year of publication
was not a significant predictor in any of the analyses. The differences between UMR
and BMR estimates (coefficients and SEs) were only marginal for both fixed and
random effects MRs.
Table 3 Meta-regression analysis results of Example 1: surgical versus non-surgical procedure for treating periodontal diseasea
F-UMR (PD) F-UMR (AL) F-BMR R-UMR (PD) R-UMR (AL) R-BMR
(YPD D b1 C b2 year) (YAL D b3 C b4 year) (YPD D b1 C b2 year) (YAL D b3 C b4 year)
bO 1 SE bO 1 0.345(0.029) 0.305(0.029) 0.363(0.073) 0.359(0.073)
bO 2 SE bO 2 0.008(0.008) 0.005(0.008) 0.005(0.02) 0.005(0.022)
O12 0.019 0.02
bO 3 SE bO 3 0.394(0.019) 0.399(0.019) 0.333(0.091) 0.336(0.098)
bO 4 SE bO 4 0.012(0.007) 0.01(0.006) 0.014(0.027) 0.012(0.03)
O22 0.033 0.04
Ob 0.561
a
PD probing depth, AL attachment level, F-UMR fixed effects univariate meta-regression, F-BMR fixed effects bivariate meta-regression, R-UMR random effects
univariate meta-regression, R-BMA random effects bivariate meta-regression
Y. Ma et al.
Meta-Analytic Methods for Public Health Research 337
Table 4 Data of Example 2: Gamma nail versus sliding hip screw (SHS) for extracapsular hip
fractures in adulta
Sample size Length of surgery (minutes) Operative blood loss (ml)
Study G S Mean(G) Std(G) Mean(S) Std(S) Mean(G) Std(G) Mean(S) Std(S)
1 203 197 55:4 20 61:3 22:2 244:4 384:9 260:4 325:5
2 60 60 47:1 20:8 53:4 8:3 152:3 130:7 160:3 110:8
3 73 73 65 29 51 22 240 190 280 280
4 53 49 59 23:9 47 13:3 258:7 145:4 259:2 137:5
5 104 106 46 11 44 15 NA NA NA NA
6 31 36 56:7 17 54:3 16:4 NA NA NA NA
7 93 93 NA NA NA NA 814 548 1; 043 508
a
G=Gamma nail, S=SHS, NA=Not available
respectively. A UMA was also conducted for each outcome. Since nearly 85 % I 2
of the total variation in LOS and 30 % in BL were from the between-study variation,
we only conducted random effects MAs.
2 2
Shown in Table 5 are estimates of ˇLOS ; ˇBL ; LOS ; BL ; b and the 95 %
confidence interval of ˇ. O Results of all four analyses imply Gamma nail was
associated with longer LOS and less amount of BL, but these findings were not
statistically significant. For all three BMAs, the effect size estimates (in absolute
value) of LOS and BL were all greater than those of UMAs. Compared
to UMA, in
BMA, by modeling ˇLOS and ˇBL simultaneously, ˇOLOS ˇOBL “borrows strength”
from the BL (LOS), despite the fact that the LOS (BL) is missing in few studies.
This leads to more precise estimation with smaller standard errors and narrower
confidence intervals under BMA. For example, when within-study correlation is
moderate .w D 0:5/ ; the width of the 95 % CI in BMA was narrower by 2 % and
9 %, compared to that in UMA for LOS and BL, respectively.
Appendix
We performed all our analyses in R using the mvmeta package. In this appendix we
provide the R code written for the meta-analysis and meta-regression in Example 1.
###R code for fixed effects UMA:
Table 5 Meta-analysis results of Example 2: Gamma nail versus sliding hip screw (SHS) for extracapsular hip fractures in adultsa
2 2
Method ˇOLOS 95 %CILOS LOS .CI/ OLOS ǑBL 95 %CIBL BL .CI/ OBL Ob
UMA 0:117.0:1691/ .0:318; 0:552/ 0:141 0:144.0:0811/ .0:369; 0:082/ 0:01
BMA(0.2) 0:160.0:1695/ .0:275; 0:596/ 0:11 % 0:147 0:158.0:0737/ .0:363; 0:047/ 6:65 % 0:007 1
BMA(0.5) 0:148.0:1658/ .0:278; 0:574/ 2:07 % 0:139 0:160.0:0759/ .0:371; 0:050/ 9:09 % 0:009 1
BMA(0.8) 0:138.0:1640/ .0:283; 0:560/ 3:10 % 0:136 0:163.0:0779/ .0:379; 0:054/ 3:99 % 0:01 1
a
LOS length of surgery, BL blood loss, UMA univariate meta-analysis, BMA bivariate meta-analysis, CI 95 % confidence interval,
.CI/ the relative change in the width of CI, defined as (width of CI for the current BMA—width of CI for UMA)/(width of CI for
UMA)100 %
Y. Ma et al.
Meta-Analytic Methods for Public Health Research 339
References
Antczak-Bouckoms, A., Joshipura, K., Burdick, E., Tulloch, J.F.C.: Meta-analysis of surgical
versus non-surgical method of treatment for periodontal disease. J. Clin. Periodontol. 20, 259–
68 (1993)
Berkey, C.S., Hoaglin, D.C., Antczak-Bouckoms, A., Mosteller, F., Colditz, G.A.: Meta-analysis
of multiple outcomes by regression with random effects. Stat. Med. 17, 2537–2550 (1998)
Brownson, R.C., Fielding, J.E., Maylahn, C.M., Evidence-based public health: a fundamental
concept for public health practice. Annu. Rev. Public Health. 30, 175–201 (2009)
Chen, D.G., Peace, K.E.: Applied Meta-Analysis with R. Chapman and Hall/CRC Biostatistics
Series. Chapman and Hall/CRC, Boca Raton (2013)
Cochrane Injuries Group Albumin Reviewers: Human albumin administration in critically ill
patients: systematic review of randomised controlled trials. Br. Med. J. 317, 235–240 (1998)
DerSimonian, R., Laird, N.: Meta-analysis in clinical trials. Control. Clin. Trials 7, 177–188 (1986)
Fielding, J.E., Briss, P.A.: Promoting evidence-based public health policy: Can we have better
evidence and more action? Health Aff. Millwood 25, 969–78 (2006)
Glasziou, P, Longbottom, H.: Evidence-based public health practice. Aust. N. Z. J. Public Health
23, 436–40 (1999)
Hardy, R.J., Thompson, S.G.: Detecting and describing heterogeneity in meta-analysis. Stat. Med.
17, 841–856 (1998)
Hedges, L.V.: Distribution theory for glass’s estimator of effect size and related estimators. J. Educ.
Stat. 6, 107–28 (1981)
340 Y. Ma et al.
Higgins, J.P.T., Thompson, S.G.: Quantifying heterogeneity in a meta-analysis. Stat. Med. 21,
1539–1558 (2002)
Jackson, D., White, I.R., Thompson, S.G.: Extending DerSimonian and Laird’s methodology to
perform multivariate random effects meta-analysis. Stat. Med. 29, 1282–1297 (2010)
Ma, Y., Mazumdar, M.: Multivariate meta-analysis: a robust approach based on the theory of
U-statistic. Stat. Med. 30, 2911–2929 (2011)
Parker, M.J., Handoll, H.H.G.: Gamma and other cephalocondylic intramedullary nails versus
extramedullary implants for extracapsular hip fractures in adults. Cochrane Database Syst. Rev.
16(3), CD000093 (2008)
Peto, R.: Why do we need systematic overviews of randomized trials? Stat. Med. 6, 233–240
(1987)
Raudenbush, S.W., Bryk, A.S.: Empirical Bayes meta-analysis. J. Educ. Stat. 10(2), 75–98 (1985)
Riley RD, Abrams, K.R., Lambert, P.C., Sutton, A.J., Thompson, J.R.: An evaluation of bivariate
random-effects meta-analysis for the joint synthesis of two correlated outcomes. Stat. Med. 26,
78–97 (2007)
Riley, R.D.: Multivariate meta-analysis: the effect of ignoring within-study correlation. J. R. Stat
Soc. Ser. A. 172(4), 789–811 (2009)
Shadish, W.R., Haddock, C.K.: Combining estimates of effect size. In: Cooper, H., Hedges, L.V.,
(eds.) The Handbook of Research Synthesis, pp. 261–284. Russel Sage Foundation, New York
(2009)
Sohn SY. Multivariate meta-analysis with potentially correlated marketing study results. Nav. Res.
Logist. 47, 500–510 (2000)
Sutton, A.J., Abrams, K.R., Jones, D.R., Sheldon, T.A., Song, F.: Methods for Meta-Analysis in
Medical Research. Wiley, New York (2000)
Whitehead, A., Whitehead, J.: A general parametric approach to the meta-analysis of randomised
clinical trials. Stat. Med. 10, 1665–1677 (1991)
Whitehead, A.: Meta-Analysis of Controlled Clinical Trials. Wiley, Chichester (2002)
Index
A B
Absolute value of error loss, 136 Bayesian (Bayes) approach
AIDS Clinical Trial Group (ACTG) Study, data, 138
189–191, 195 factor, 138
Akaike information criterion (AIC), 283, likelihood function, 135–136
286 MCMC process, 136–137
Alpha-exhaustive fallback (AEF), 316, model selection, 138–141
320–322 Bayesian framework, 196
Analytical paradigm, LC and NLC paradigms, Bayesian information criterion (BIC), 283, 286
272–273 Bayesian model comparison, 138
The Arizona state inpatient database (SID), 58, Behavioral epidemiology, 293
70 Behavioral system, progression, 82
ASEL. See Average signal event length BFS procedure. See Bonferroni fixed sequence
(ASEL) (BFS) procedure
ATBSE. See Average time between signal BIC. See Bayesian information criterion (BIC)
events (ATBSE) Bilirubin, 291–292
ATFS. See Average time between false signals Bivariate meta-analysis (BMA)
(ATFS) extracapsular hip fractures in adults, 338
ATS. See Average time to signal periodontal disease, treatment, 335
(ATS)Attrition, 83 UMA, 334
Average run length (ARL), 210, 213, 217, 224, variance–covariance matrix, 331
226, 243, 244 Bivariate population
CUSUM chart, 212 BVRSS (see Bivariate ranked set sampling
EWMA chart, 214 (BVRSS))
out-of-control, 210 elements, 306
performance metrics, 242 Plackett’s distributions, 308
probability, signaling, 210 Bivariate ranked set sampling (BVRSS)
Average signal event length (ASEL), bivariate random vector, 306
244 population ratio, 307
Average time between false signals (ATFS), procedure, 306–307
242, 243, 244, 246 regression estimator, 309–311
Average time between signal events (ATBSE), RSS and SRS, 293
243–244 BMA. See Bivariate meta-analysis (BMA)
Average time to signal (ATS), 244 Bonferroni correction, 316