Artificial Intelligence and Causal Inference
Artificial Intelligence and Causal Inference
Longitudinal Data
with Examples
Analysis of
Longitudinal Data
with Examples
You-Gan Wang
Liya Fu
Sudhir Paul
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781315153636
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To those who have passions in longitudinal data analysis
Contents
List of Tables xv
Preface xvii
Contributors xxiii
Acknowledgment xxv
1 Introduction 1
1.1 Longitudinal Studies 1
1.2 Notation 2
4 Parameter Estimation 37
4.1 Likelihood Approach 38
4.2 Quasi-likelihood Approach 39
4.3 Gaussian Approach 41
4.4 Generalized Estimating Equations (GEE) 44
4.4.1 Estimation of Mean Parameters β 45
4.4.2 Estimation of Variance Parameters τ 48
4.4.2.1 Gaussian Estimation 48
4.4.2.2 Extended Quasi-likelihood 49
4.4.2.3 Nonlinear Regression 50
4.4.2.4 Estimation of Scale Parameter φ 50
4.4.3 Estimation of Correlation Parameters 51
4.4.3.1 Stationary Correlation Structures 51
4.4.3.2 Generalized Markov Correlation Structure 54
4.4.3.3 Second Moment Method 55
4.4.3.4 Gaussian Estimation 55
4.4.3.5 Quasi Least-squares 57
4.4.3.6 Conditional Residual Method 58
4.4.3.7 Cholesky Decomposition 59
4.4.4 Covariance Matrix of β̂ 63
4.4.5 Example: Epileptic Data 66
4.4.6 Infeasibility 68
4.5 Quadratic Inference Function 72
5 Model Selection 75
5.1 Introduction 75
5.2 Selecting Covariates 76
5.2.1 Quasi-likelihood Criterion 76
5.2.2 Gaussian Likelihood Criterion 78
5.3 Selecting Correlation Structure 79
5.3.1 CIC Criterion 79
5.3.2 C(R) Criterion 79
5.3.3 Empirical Likelihood Criteria 80
5.4 Examples 82
5.4.1 Examples for Variable Selection 82
5.4.2 Examples for Correlation Structure Selection 82
6 Robust Approaches 89
6.1 Introduction 89
6.2 Rank-based Method 89
6.2.1 An Independence Working Model 89
6.2.2 A Weighted Method 90
CONTENT ix
6.2.3 Combined Method 92
6.2.4 A Method Based on GEE 93
6.2.5 Pediatric Pain Tolerance Study 94
6.3 Quantile Regression 97
6.3.1 An Independence Working Model 98
6.3.2 A Weighted Method Based on GEE 99
6.3.3 Modeling Correlation Matrix via Gaussian Copulas 99
6.3.3.1 Constructing Estimating Functions 100
6.3.3.2 Parameter and Covariance Matrix Estimation 101
6.3.4 Working Correlation Structure Selection 103
6.3.5 Analysis of Dental Data 103
6.4 Other Robust Methods 105
6.4.1 Score Function and Weighted Function 106
6.4.2 Main Algorithm 107
6.4.3 Choice of Tuning Parameters 108
Bibliography 201
6.1 The boxplots of the time in log2 seconds of pain tolerance in four
trials in the pediatric pain study. 95
6.2 The boxplots of the time in log2 seconds of pain tolerance are for
girls and boys in four trails in the pediatric pain study. 96
6.3 Scatterplot of the time in log2 seconds of pain tolerance for four
trails in the pediatric pain study. 97
6.4 The distance versus the measurement time for boys and girls. 104
6.5 Boxplots of distances for girls and boys. 105
10.1 Plot of the log-transformed gene expressive level with time for the
first twenty genes. 196
10.2 Histogram of the log-transformed gene expressive level in the yeast
cell-cycle process. 197
10.3 Boxplot of the log-transformed gene expressive level in the yeast
cell-cycle process. 198
xiii
List of Tables
2.1 The seizure count data for 5 subjects assigned to placebo (0) and 5
subjects assigned to progabide (1). 7
5.1 The values of the criteria QIC, EQIC, GAIC, and GBIC under two
different working correlation structures (CS) for Example 1. 83
5.2 The results obtained by the GEE with independence (IN), exchange-
able (EX), and AR(1) correlation structures in Example 2. 84
5.3 The values of the criteria under independence (IN), exchangeable
(EX), and AR(1) correlation structures in Example 2. 84
5.4 The results are obtained via the QIC function in MESS package. 85
5.5 The results obtained by the GEE with independence (IN), exchange-
able (EX), and AR(1) correlation structures in Example 3. 86
5.6 The results obtained by the GEE with independence (IN), exchange-
able (EX), and AR(1) correlation structures when removing the
outliers in Example 3. 87
5.7 The results of the criteria for the data with and without the outliers
in Example 3. 87
5.8 The correlation matrix of the data with and without the outliers in
Example 3. 87
6.1 Parameter estimates and their standard errors (SE) for the pediatric
pain tolerance study. 98
6.2 Parameter estimates (Est) and their standard errors (SE) of esti-
mators β̂I , β̂EX , β̂AR and β̂ MA obtained from estimating equations
with independence, exchangeable, AR(1), and MA(1) correlation
structures, respectively, and frequencies of the correlation structure
identified using GAIC and GBIC criteria for the dental dataset. 106
xv
xvi LIST OF TABLES
7.1 Artificial data of Systolic blood pressure of children of 10 families 121
7.2 Data from Toxicological experiment (Paul, 1982). (i) Number of live
fetuses affected by treatment. (ii) Total number of live fetuses. 126
10.1 The number of TFs selected for the G1-stage yeast cell-cycle process
with penalized GEE (PGEE), penalized Exponential squared loss
(PEXPO), and penalized Huber loss (PHUBER) with SCAD penalty. 197
10.2 List of selected TFs for the G1-stage yeast cell-cycle process
with penalized GEE (PGEE), penalized Exponential squared loss
(PEXPO), and penalized Huber loss (PHUBER) with SCAD penalty. 199
Preface
xvii
xviii PREFACE
the true one. The likelihood-based estimation is important as it lays the foundation for
estimation and inference. If we are interested in quantifying the uncertainties in the
resultant estimates (such as standard errors and confidence intervals), we would have
to investigate implications of any misspecified model components, i.e., the variance
and correlation parts in our cases. Chapter 4 presents various parameter estimation
approaches for the mean and variance functions and correlation structures.
The marginal model consists of three components, the mean function, variance
function, and correlation structure. Chapter 5 introduces the criteria for selecting
each of these three key parts. Selecting the useful predictors (regressors) in the mean
function is a classical topic. We have only introduced quasi-likelihood and Gaussian
likelihood method in this respect for longitudinal data analysis. In fact, the Lasso
approach using L1 norm is also applicable, although it is often used when the number
of predictors is very high.
In longitudinal studies, the collected data often deviates from normality, and
the response variable and/or predictors may contain some potential outliers, which
often results in serious problems for variable selection and parameter estimation.
Therefore, robust methods have attracted much attention in recent years. Robust ap-
proaches using rank and quantile regression are given in Chapter 6.
Clustered data refer to a set of measurements collected from subjects that are
structured in clusters, which arise in many biostatistical practices and environmen-
tal studies. Responses from a group of genetically-related members from a familial
pedigree constitute clustered data in which the responses are correlated. Chapter 7 is
devoted to the methodology of such data analysis.
Missing data or missing values occur when no information is available on the re-
sponse or some of the predictors or both the response and some predictors for some
subjects who participate in a study of interest. There can be a variety of reasons for
the occurrence of missing values. Nonresponse occurs when the respondent does not
respond to certain questions due to stress, fatigue, or lack of knowledge. Some in-
dividuals in the study may not respond because some questions are sensitive. Miss-
ing data can create difficulty in the analysis because nearly all standard statistical
methods presume complete information for all the variables included in the analysis.
Chapter 8 deals with methodologies for the analysis of longitudinal data with missing
values.
The traditional designed experiments involve data that need to be analyzed us-
ing a fixed-effects model or a random-effects model. Central to the idea of variance
components models is the idea of fixed and random effects. Each effect in a variance
components model must be classified as either a fixed or a random effect. Fixed ef-
fects arise when the levels of an effect constitute the entire population about which
one is interested. Chapter 9 develops the methodology of analyzing longitudinal data
using random effects and transitional models.
High-dimensional longitudinal data involving many variables are often collected.
The inclusion of redundant variables may reduce accuracy and efficiency for both pa-
rameter estimation and statistical inference. When a large number of predictors are
collected as in phenotypical studies to identify responsible genes, the inclusion of re-
dundant variables can reduce the accuracy and efficiency of estimation and prediction
PREFACE xix
(inflated false discovery rate and reduced power). Therefore, it is important to select
the appropriate covariates in analyzing longitudinal data. However, it is a challenge
to select significant variables in longitudinal data due to underlying correlations and
unavailable likelihood. Chapter 10 presents how the lasso-type approach can be used
in longitudinal data analysis.
This book is written for (1) applied statisticians and scientists who are interested
in how dependence is taken care of in analyzing longitudinal data; (2) data analysts
who are interested in the development of techniques for longitudinal data analysis;
and (3) graduate students and researchers who are interested in researching on cor-
related data analysis. This book is also suitable as a graduate textbook to assist the
students in learning advanced statistical thinking and seeking potential projects in
correlated data analysis.
Author Bios
xxi
xxii AUTHOR BIOS
published in most of the top-tier journals in statistics (Journal of the Royal Statis-
tical Society, Biometrika, Biometrics, Journal of the American Statistician Associa-
tion, Technometrics). Professor Paul supervised over 50 graduate students including
16 Ph.D. students, and has published over 100 papers.
Contributors
xxiii
Acknowledgment
xxv
Chapter 1
Introduction
DOI: 10.1201/9781315153636-1 1
2 INTRODUCTION
We first consider the special case when ni = 1 and all observations are inde-
pendent of each other. This is basically the setup of the generalized linear models
(GLMs). In GLMs, we have µi j = h(XiTj β), a function of some linear combination
of the covariates, and the variance σ2i j is related to the covariates via some known
functions of the mean.
Later on we will allow multiple observations from the same subjects, and corre-
lations will have to be allowed among these within-subject observations while obser-
vations from different subjects are assumed independent.
In longitudinal studies, a variety of models can be used to meet different purposes
of the research. For example, some experiments focus on individual responses; the
others emphasize the average characters. Two popular approaches have been devel-
oped to accommodate different scientific objectives: the random effects model and
the marginal model (Liang et al., 1992).
The random-effects model is a subject-specific model that models the source of
the heterogeneity explicitly. The basic premise behind the random-effects model is
that we assume that there is natural heterogeneity across individuals in a subset of
the regression coefficients. That is, a subset of the regression coefficients is assumed
to vary across individuals according to some distribution. Thus the coefficients have
an interpretation for individuals.
Marginal model is a population-average model. When inferences about the
population-average are the focus, the marginal models are appropriate. For exam-
ple, in a clinical trial, the average difference between control and treatment is most
important, not the difference for a particular individual.
The main difference between the marginal and random-effects model is the
way in which the multivariate distribution of responses is specified. In a marginal
model, the mean response modeled is conditioning on only fixed covariates, while
for random-effects models, it is conditioned on both covariates and random effects.
The random-effects models can be seen as a likelihood-based approach, while the
marginal approach is semiparametric in the sense that only the mean function and
the covariance structure are modeled via some parametric functions (and likelihood
function is avoided).
1.2 Notation
In general, we will use capital letters to represent vectors and small letters to present
scalars. The letters represent random variables or observations depending on the con-
text to distinguish.
Yi = g(Xi β̃) + i ,
We now introduce some longitudinal studies. Some of the datasets will be used for
illustration through this book.
DOI: 10.1201/9781315153636-2 5
6 EXAMPLES AND ORGANIZATION OF THE BOOK
6
5
log10(RNA copies)
4
3
2
5 10 15 20 25 30
Days
by each subject over the 8-week period prior to the start of administration of the as-
signed treatment. It is common in such studies to record such baseline measurements
so that the effect of treatment for each subject may be measured relative to how that
subject behaved before treatment. Following the commencement of treatment, the
number of seizures for each subject was counted for each of four 2-week consecu-
tive periods. The age of each subject at the start of the study was also recorded, as it
was suspected that the age of the subject might be associated with the effect of the
treatment somehow. The primary objective of the study was to determine whether
progabide reduces the rate of seizures in subjects like those in the trial.
The data for the first five subjects in each treatment group are shown in Table 2.1.
The boxplots of the number of seizures for epileptics (see Figure 2.2) indicate that
there exist outliers. Thus, a robust method should be considered when analyzing this
dataset. Figure 2.3 indicates that there are also strong within-subject correlations in
the epileptic data, as reported in Thall & Vail (1990). Thall and Vail (1990) presented
a number of covariance models to account for overdispersion, heteroscedasticity, and
dependence among repeated observations.
EXAMPLES FOR LONGITUDINAL STUDIES 7
Table 2.1 The seizure count data for 5 subjects assigned to placebo (0) and 5 subjects assigned
to progabide (1).
Period
1 5 3 3 3 0 11 31
2 3 5 3 3 0 11 30
3 2 4 0 5 0 6 25
4 4 4 1 4 0 8 26
5 7 18 9 21 0 66 22
..
.
29 11 14 9 8 1 76 18
30 8 7 9 4 1 38 32
31 0 4 3 0 1 19 20
32 3 6 1 3 1 10 30
33 2 6 7 4 1 19 18
150
100
The number of seizures
50
0 1 2 3 4 0 1 2 3 4
Visit
Figure 2.2 Boxplots of the number of seizures for epileptics at baseline and four subsequent
2-week periods: “0” indicates baseline.
150
100
baseline
50
20 40 60 80
visit 1
0
50
visit 2
30
0 10
60
visit 3
40
20
0
50
visit 4
30
0 10
50 100 150 0 20 40 60 0 20 40 60
Figure 2.3 Scatterplot matrix for epileptics at baseline and four subsequent 2-week periods:
solid circle for placebo; hollow circle for progabide.
number of normal fetuses per dam clearly shows the adverse effects of this toxic
agent. By now, without any parametric modeling, we can see an obvious adverse ef-
fect from this monotonic trend. The adjusted proportions of normal fetuses for the
four dose levels are (0.865, 0.760, 0.592, 0.576) assuming N, the average number
of initial population size for each group, is 7.96, as observed in the control group.
These proportions of normal fetuses become (0.765, 0.673, 0.524, 0.509) if we as-
sume N = 9. Note that for any values assumed for N, the ratios to the control group
are still (0.878, 0.684, 0.666). This shows that roughly the excessive risk for the low,
medium, and high dose groups are 0.122, 0.316, and 0.334 (the actual dose values
are not available).
Parametric models are usually used to quantify the relationship with dose levels
and other covariates. This will allow us to determine the virtually safe dose (Ryan,
1992). Traditional risk assessment is based on the probabilities of abnormal fetuses
within each litter. The risk of death in utero is not accounted for because the inference
10 EXAMPLES AND ORGANIZATION OF THE BOOK
4
3
2
Log progesterone
1
0
−1
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Figure 2.4 Progesterone levels (on log scale) plot against the number of days in a standardized
menstrual cycle. Solid line indicates the estimated population mean curve.
is based on successful implantations. The additional risk of death in utero due to the
toxic agent has to be modeled to obtain the overall risk. Estimation of the overall
risk ratios for different dose levels will depend on modeling the average response
(including death in utero) numbers per dam, which may not be directly observed or
reliably recorded. If death in utero is ignored, agent risk will be underestimated.
Development of offspring is subject to the risk of uniforming due to toxins, al-
though the number of such littermates may not always be fully observed. This risk
can also be interpreted as unsuccessful implantation. The offspring will be further
subject to adverse effects after successful implantation. It seems appropriate to take
EXAMPLES FOR LONGITUDINAL STUDIES 11
into account of death in utero as well as other observed abnormalities in fetuses in
risk assessment, although this is not currently required by regulatory agencies such
as Environmental Protection Agency and Food and Drug Administration.
50 100 150
Active Placebo
100
80
60
Pain score
40
20
50 100 150
Figure 2.5 Parallel lines of pain scores for active and placebo groups.
Active Placebo
100
75
Pain score
50
25
Figure 2.6 Boxplots of pain scores for placebo and active groups.
EXAMPLES FOR LONGITUDINAL STUDIES 13
Histogram for active group Histogram for placebo group
0.025
0.05
0.020
0.04
0.015
0.03
Density
0.010
0.02
0.005
0.01
0.000
0.00
0 20 40 60 80 0 20 40 60 80
Figure 2.7 Histograms of pain scores for active and placebo groups.
incomes as education time increases, and the average income of south is lower than
that in others when women have the same education.
When modeling this dataset, the candidate covariates are (i) age: the age at the
time of the survey; (ii) grade: the current grade that has been completed; and (iii)
south: whether a person came from the south.
0 5 10 15
Others South
3
Log wage
0 5 10 15
Education (year)
Figure 2.8 Scatter plot of log wage against education for south and other places.
4
3
Log wage
2
1
0
0 1 2 4 5 6 7 8 9 10 12 14 16 18
Education (year)
Figure 2.9 The boxplots of the income of the women with different education time.
16 EXAMPLES AND ORGANIZATION OF THE BOOK
30017 30018
6e+05
4e+05
2e+05
Values of total cyanophytes mg/L
0e+00
30004 30016
6e+05
4e+05
2e+05
0e+00
Year
Figure 2.10 Time series plot of the values of Total Cyanophyte from 4 sites.
Figure 2.11 Q-Q plot for total Cyanophyte counts from 20 sites.
Chapter 3
Suppose that data are collected from N subjects, and Yi is the response data from
subject i, where i = 1, . . . , N. We first assume Yi is univariate in this chapter, and
then allow Yi to be a vector of responses in longitudinal or clustered data analysis in
Chapter 4. The predictors associated with Yi are collected in a vector of dimension
p, Xi . The design matrix for all the observations is then a matrix of dimension p × N,
X = (X1> , . . . , XN> ). The response vector Y is (Y1 , . . . , YN )> . Note that the Yi is also
referred to as dependent variable, output variable, or endogenous variable in econo-
metrics, and the Xi vector is also referred to as independent variables, input variables,
explanatory variables, or exogenous variables.
DOI: 10.1201/9781315153636-3 17
18 MODEL FRAMEWORK AND ITS COMPONENTS
factors. When we are interested in investigating the relationships between Y and X
(causal or just a predictor). We shall ask the question of how Y changes when X
changes. This is reflected in the conditional distribution function f (y|x). For exam-
ple, Y is the bodyweight of an individual from the Australian population. Its ran-
domness is partially caused by biological difference due to X =male or female – we,
therefore, may be interested in investigating the conditional distributions of Y given
X, f (y|X = Male), and f (y|X = Female). In this case, if the normal family is used,
we can allow a swift in the mean to reflect the difference in these two distributions.
However, the population with a larger mean may exhibit a larger variance as well. To
this end, we can either model variance just as we model the mean, or we can link the
variance as a function of the mean.
Distribution families all imply a functional relationship between the mean and the
variance functions due to the probabilistic regularization. If X is a continuous variable
such as height in inch, f (y|X = x) is then a function of x ( f may not be continuous in
x). Intuitively, conditioning on X explains certain variations due to X, so the variance
of the residuals should shrink. As more and more useful predictors (Xi ) are used,
f (y|Xi ) should have smaller and smaller variance leading to more accurate prediction.
Of course, this accurate prediction relies on our accurate description of f (y|Xi ), which
is not easy especially when the dimension of Xi becomes large.
Exercise 1. If var(Y) exists, prove E(var(Y|X)) ≤ var(Y) by establishing the fol-
lowing identity
var(Y) = E(var(Y|X)) + var(E(Y|X)).
Exercise 2. Generate 1,000 normal responses from a linear model in R, yi with
mean µi = a + bxi and variance σ2i = (µi /75)2 , where xi is either 0 or 1 with 50%
probability. Obtain the density plot for the two groups, a = 165 and b = 5. This may
represent Male and Female heights (cm) from 1,000 people. How about if we wish to
have a skewed distribution, what distribution can we do? Revise the R code accord-
ingly.
In classical regression with regressors Xi , we assume f (y|Xi ) is normal,
N(Xi> β, σ2 ). As Xi changes, the response mean is assumed to change in a linear fash-
ion. Here Xi can be random or fixed, and the marginal density f (y) is not of interest.
Table 3.1 Components of commonly used exponential families: φ is the dispersion parameter,
b(θ) is the cumulate function, µ(θ) is the mean function as the natural parameter, V(µ) is the
mean-variance relationship, and final row is the inverse function of µ(θ) as the canonical link
to η = X > β. If the link function h(µ) = X > β nominated is not the canonical one, η = X > β is not
the natural parameter θ any more.
Normal Poisson Binomial Gamma IGaussian
N(µ, σ2 ) Poi(λ) Binom(m, π) Γ(µ, ν) IG(µ, λ)
φ σ2 1 1/m 1/ν 1/λ
√
b(θ) θ2 /2 eθ m log(1 + eθ ) − log(−θ) − √−2θ
µ(θ) θ eθ meθ /(1 + eθ ) −1/θ 1/ −2θ
V(µ) 1 µ µ(m − µ) µ2 µ3
θ µ log(µ) log{µ/(m − µ)} 1/µ −1/(2µ2 )
var(Y) = φV(µ),
1 (y−µ)2 yµ−µ2 /2
− 12
f (y|µ, σ2 ) = √ e σ2 ∝ e σ2 .
2πσ
Here the proportion symbol ∝ is used because the constant terms free from the pa-
rameter of interest (i.e., θ in this case) are ignored. Clearly, we have b(θ) = θ2 /2, and
σ2 is the dispersion parameter.
20 MODEL FRAMEWORK AND ITS COMPONENTS
Example 3. The inverse Gaussian distribution describes the time to the first pas-
sage of a Brownian motion and has (pdf)
s
λ −λ(y − µ)2
( )
f (y|µ, λ) = exp
2πy3 2µ2 y
n √ o
∝ exp λ(yθ − −2θ) ,
√
in which the natural parameter θ = −1/(2µ2 ) is always negative and b(θ) = − −2θ.
Therefore, under the canonical link, the linear score η = X > β must be negative.
Example 4. Let us now consider a general power function including two limiting
cases for Gamma and Poisson distributions,
τ
θ τ−1
τ−1
τ if τ , 0 or 1
b(θ) = if τ = 0 (Gamma distribution).
− log(−θ)
eθ
if τ → ∞ (Poisson distribution)
The normal and inverse Gaussian distributions are included when τ = 2 and
τ = −0.5. The distributions corresponding to other τ values are known as Tweedie ex-
ponential family (Tweedie, 1984; Jørgensen, 1997). The power relationship between
the mean µ = b0 (θ) and variance b00 (θ) is seen when {θ/(τ − 1)}τ (τ − 1)/τ, where τ , 1
and can be positive or negative. Note the limiting value of b(θ) is log(−θ) when τ = 0.
It is interesting to see τ = 2, 0, and −0.5 correspond to normal, Gamma, and inverse
Gaussian distributions.
nential family in the sense that f (y|X0> β) is unspecified and does not have to take the
linear exponential distribution. Huang and Rathouz (2012) proposed semiparametric
and empirical likelihood approach for estimating β.
This tilted model bears some similarity to the proportional hazard model (Cox,
1972), where the hazard ratio f (y)/(1 − F(y)) instead of f (y) is tilted. The exponen-
tial distribution family provides a family of distributions indexed by the canonical
parameter values, but the family does not specify how these family members are
linked. The GLM hinges on the exponential family with a link function specifying
how the canonical values are linked to the covariates. The variance is still the deriva-
tion of the mean with respect to θ, but no explicit forms exist in general for either the
mean or the variance functions.
The tilted exponential family provides more than just a family of distributions,
and the “canonical” parameter value is implicitly defined by the desired marginal
mean via the link function. The “canonical” parameter here plays a tilting role so
that the reference distribution is tilted as the distribution at the new mean values.
Much more work is needed on this topic for analysis of longitudinal data.
3.2 Quasi-Likelihood
When the true likelihood is difficult to specify, we can then rely on the quasi-
likelihood (QL) function proposed by Wedderburn (1974). Instead of claiming the
true distribution of yi is known, we only need to model the first two moments
of the distribution and construct a “likelihood” alike function (the so-called quasi-
likelihood) to work with. In the regression context, we specify only the relationship
between the mean and covariates and the variance function as the mean (for weight-
ing).
We have assumed that Y is a random variable with mean µ = h(X > β) and variance
φV(µ). The QL (on log scale) for Yi is defined as (Wedderburn, 1974)
Z µi
Yi − t
Q(µi ; yi ) = dt.
yi φV(t)
22 MODEL FRAMEWORK AND ITS COMPONENTS
The overall QL for all the independent observations, Y = (Y1 , . . . , YN ), is defined
as N
X
Q(µ; Y) = Q(µi ; Yi ),
i=1
where µ is the mean vector.
Remark 1: The QL is a likelihood function, but it is not meant to be the true
likelihood that generates the observed data.
Remark 2: The QL function is more than just a “working” likelihood in the sense
that it is meant to be an approximation to the true likelihood when only the first two
moments are matched.
Remark 3: The QL function can be treated as the true likelihood is the sense that
the resultant score functions and the information matrix are valid.
Remark 4: The true score function may not be a linear combination of the data
(for example, y2i can be involved), and in that case QL is still valid for estimating
β and deriving the information matrix. Of course, if the true likelihood is used, the
MLE will be more efficient but at the cost that potential biases exist in β̂ when the
likelihood is misspecified although the first two moments are correctly matched.
Remark 5: In the case the variance function V(µ) is incorrect, the QL score function
is valid for estimating β, but the information matrix becomes invalid.
For dependent observations, multivariate analysis will be used. The QL can be de-
fined in a similar way,
Z Z
Q(µ; Y) = (Y − t)> V −1 (t)dt(s),
t(s)
where t(s) is a smooth curve of dimension n joining Y and µ. For this integral to be
meaningful as a likelihood, it needs to be path-independent (§9.3, McCullagh and
Nelder, 1989). For longitudinal data analysis, the joint likelihood for the n random
variables from some subject can be written in the form of n products as,
f (y1 , y2 , ..., yn ) = f (y1 ) f (y2 |y1 ) f (y3 |y1 , y2 )... f (yn |y1 , y2 , ..., yn−1 ). (3.5)
Wang (1996) constructed the quasi-likelihood (QL) for f (y j |y1 , y2 , ..., y j−1 ) in the
context of categorical data analysis. An interesting application was done by Wang
(1999a) for estimating a population size from removal data. The conditional quasi
likelihood was constructed via the conditional mean and conditional variance of the
catch given previous total catches.
Remark 6: This approach can be easily extended for any multivariate variables by
modeling the conditional mean and variance. The resultant QL function is, in general,
not a scaled version of the ordinary log-likelihood function.
The question is to which extent the estimation and inference are still valid. It is
amazing how valid this approach can be even for binary data. Whittle (1961) and
Crowder (1985) introduced the Gaussian estimation as a vehicle for estimation with-
out requiring that the data are normally distributed. The function above may also be
called Pseudo-likelihood (Davidian and Carroll, 1987). For longitudinal data anal-
ysis, the Gaussian approach can also be applied (Crowder, 1985, 2001; Wang and
Zhao, 2007). Recall that the scalar response yi j is observed for cluster i (i = 1, . . . , N),
at time j ( j = 1, . . . , ni ). For the ith cluster, let Yi = (yi1 , . . . , yini )T be a ni × 1 response
vector, and µi = E(Yi ) is also a ni × 1 vector. We denote Cov(Yi ) by Σi , which has the
general form φVi , where Vi = Ai1/2 Ri A1/2 i , with Ai = diag{Var(yit )} and Ri being the
correlation matrix of Yi . For independent data, Σi is just φAi .
The Gaussian log-likelihood (multiplied by −2) for the data (Y1 , . . . , YN ), is
N
X
LG (β; τ) = log{|Σi | + (Yi − µi )> Σ−1
i (Yi − µi )}, (3.7)
i=1
where β is the vector of regression coefficients governing the mean and τ is the vector
of additional parameters needed to model the covariance structure realistically. Later
on we will let θ collect all the parameters including both β and τ. Then, we can write
µi and Σi in a parametric form, respectively: µi = µi (β) and Σi = Σi (β, τ). The Gaussian
estimation is performed by maximizing LG (θ) over θ.
Exercise 3. Suppose independent data (y1 , . . . , yN ) are generated from the distribu-
T (yi ) is a sufficient statistic for θ.
PN
tion function (3.1). Prove that i=1
Exercise 4. For the exponential distribution family given by (3.2), we can show
that E(Y) = b0 (θ), and var(Y) = φb00 (θ).
Exercise 5. Suppose Y is a compound distribution in the sense that Y|θ ∼
Poisson(θ) and θ|λ is also Poisson(λ). Show that
(a). Var(Y) = 2λ.
(b). Work out the probability P(Y = 5).
(c). Verify ∞ P(Y = i) = 1.
P
Pi=0
(d). Verify ∞ i=0 iP(Y = i) = λ.
(e). What is ∞ i=0 i P(Y = i)?
P 2
Denote the linear score Xi β as ηi and G(η) = ∂θ/∂ηi , which is 1 when canonical
link is used. The corresponding score function is
N
X
XiG(ηi )(Yi − µi ). (3.8)
i=1
As we can see, under canonical link, the estimating functions take the following
simple form,
XN
S (β) = Xi (Yi − µi ), (3.9)
i=1
which is a linear combination of the residuals Yi − µi .
When the link function is not canonical, the log-likelihood function may not be
convex in β and finding the MLE may incur numerical problems. This can also be
seen in (3.8) as the unknown parameters also appear in G(·), which can make the
estimating functions nonmonotonic even if µi is monotonic in β.
For the likelihood-based estimates, the covariance of β̂ is given by the inverse of
fisher information matrix, regardless the canonical link is used or not,
−E(∂2 L/∂β∂β> ).
However, if (3.9) is used as estimating functions for solving β and the canonical link
is not used because either a noncanonical function is used for the mean function or
we are not sure if the data come from the linear exponential family, we will have
N
−1 N −1 N −1
X X X
β̂ − β̃ ∼ N 0, Xi Di Xi var(Yi )Xi Di Xi .
> > >
i=1 i=1 i=1
GLM AND MEAN FUNCTIONS 25
Note that var(Yi ) is usually unknown and will be estimated either by a vari-
ance model or (Yi − µ̂i )2 . In the latter, the sample size N must be large enough
Di i Di /N ≈
PN > 2
so that the limiting theorem can be applied to approximate i=1
PN
E(D > 2 D )/N.
i=1 i i i
Example 5. Poisson regression with a canonical or noncanonical link
Suppose Yi |Xi ∼ Poi(µi ), and the log-likelihood function is
N
X
{Yi log(µi ) − µi }.
i=1
If the canonical link is used, log(µi ) = Xi β, we have the following score functions for
β,
N
Xi (Yi − eXi β ).
X
i=1
Now let us consider a noncanonical link, θ = log(µi ) = exp(Xi β), for example. The
resultant score functions take the form
N
Xi eXi β {Yi − eX (eXi β )}.
X
i=1
Clearly, for a given α, f (y|α, β) belongs to the linear exponential family with the
natural parameter θ = −β−1 . On the other hand, if β is known (or does not change
with mean and treated as a nuisance parameter), f (y|α, β) belongs to the nonlinear
exponential distribution family with θ = (α − 1) and T (y) = log(y). For this gamma
distribution, we know µi = αβ and σ2i = αβ2 = µi β = µ2i /α. This indicates that the
variance can be modeled as proportional to µi when fixing β (nonlinear exponen-
tial distribution with T (y) = log(y)) or as proportional to µ2i when fixing α (linear
exponential distribution).
γ
How about other power relationships? Suppose we desire to have σ2i ∝ µi to
model variance heterogeneity. This can be achieved when α changes with β in a way
α = β(γ−2)/(γ−1) .
Exercise 6. Prove that (3.8) coincides with the result when applying the Gauss-
Markov theorem to the residuals Yi − µi .
Exercise 7. Suppose that yi is generated from the normal distribution N(µi , σ2 µ2i ),
where µi = Xi β. Obtain the score functions for β. Explain why they do not have the
form as (3.9) or (3.8). Are there any link functions µi = g(Xi β) so that the resultant
score functions take the form of (3.9) or (3.8)?
Unlike the classical linear regression model, which can only handle the normally
distributed data, GLM extends the approach to count data, binary data, continuous
26 MODEL FRAMEWORK AND ITS COMPONENTS
data which need not be normal. Therefore, GLM is applicable to a wider range of
data analysis problems.
Under the assumption that E(Yi ) = µi regardless Yi is continuous or not, we can
consider the following regression model for estimating the parameters β in the mean
function µi ,
Yi = µi + i .
Here i is simply the difference between the observed Yi and its expectation µi . The
ordinary least squares approach leads to the following estimating function in which
Di is the Jacobian matrix ∂µi /∂β,
N
X
i {Yi − µi (β)}.
D> (3.10)
i=1
i σi (β){Yi − µi (β)}.
D> −2
(3.12)
i=1
Exercise 8. Suppose β̂w is obtained from (3.12) and β̂ora is an oracle estimator
obtained by solving
XN
D>i σi (β̃){Yi − µi ((β)},
−2
i=1
Exercise 9. Show that (3.12) is the score function if Yi is from the linear exponen-
tial distribution family.
MARGINAL MODELS 27
3.5 Marginal Models
The marginal models aim to make inferences at the population level. This approach
is to model the marginal distributions f (yi j ) at each time point, for j = 1, 2, . . . , ni ,
instead of the joint distributions for all the repeated measures.
The emphasis is often on the mean (and variance) of the univariate response yi j
the population level instead of on the individual i at any given fixed covariates. A
covariance structure is often nominated to account for the within-subject correlations.
A feature of marginal models is that the models for the mean and the covariance
are specified separately. Marginal models are considered a natural approach when we
wish to extend the generalized linear model methods for the analysis of independent
observations to the setting of correlated responses.
Specification of a mean function is the premier task in the GEE regression model.
If the mean function is not correctly specified, the analysis will have no meaning.
Under the work frames of GLM, the link function provides a link between the
mean and a linear combination of the covariates. The link function is called canonical
link if the link function equals to the canonical parameters. Different distribution
models are associated with different canonical links. For Normal, Poisson, Binomial,
and Gamma random components, the canonical links are identity, log-link, logit-link,
and inverse link, respectively.
We assume that the response variable, yki , has mean µi j and variance φσ2i j . Here φ
is an unknown scale parameter, µi j and σ2i j are some known functions of the covari-
ates, and Xi j is a p × 1 vector. Let µi = (µi j ) be the marginal mean vector for subject
i. The covariance matrix of Yi , Σi , is assumed to have the form φA1/2 1/2
i Ri (α)Ai , in
which Ai = diag(σi j ) and Ri (α) is the correlation matrix parameterized by α, a q-
2
vector.
We first consider the special case when ni = 1 and all observations are indepen-
dent of each other. In generalized linear models (GLMs), we have µi j = h(Xi>j β) where
h(·) is the inverse function of the link function., a function of some linear combina-
tion of the covariates, and the variance σ2i j is related to the covariates via some known
function of the mean.
The counterpart of the random effect model is a marginal model. A marginal
model is often used when inference about population averages is of interest. The
mean response modeled in a marginal model is conditional only on covariates and
not on random effects. In marginal models, the mean response and the covariance
structure are modeled separately.
We assume that the marginal density of yi j is given by,
That is, each yi j is assumed to have a distribution from the exponential family. Specif-
ically, with marginal models we make the following assumption:
• The marginal expectation of the response, E(yi j ) = µi j , depends on explanatory
variables, Xi j , through a known link function, the inverse function of h,
var(yi j ) = φν(µi j ),
in which ν(µi j ) is a known “variance function”, and φ is a scale parameter that may
need to be estimated.
• The correlation between yi j and yik is a function of some covariates (usually just
time) with a set of additional parameters, say α, that may also need to be estimated.
Here are some examples of marginal models:
• Continuous responses:
1. µi j = ηi j = Xi>j β (i.e. linear regression), identity link
2. var(yi j ) = φ (i.e. homogeneous variance)
3. Corr(yi j , yik ) = α|k− j| (i.e. autoregressive correlation)
• Binary response:
1. Logit(µi j ) = ηi j = Xi>j β (i.e. logistic regression), logit link
2. var(yi j ) = µi j (1 − µi j ) (i.e. Bernoulli variance)
3. Corr(yi j , yik ) = α jk (i.e. unstructured correlation)
• Count data:
1. log(µi j ) = ηi j = xi>j β (i.e. Poisson regression), log link
2. var(yi j ) = φµi j (i.e. extra-poisson variance)
3. Corr(yi j , yik ) = α (i.e. compound symmetry correlation)
V(µ) = µγ .
Most common values of γ are the values of 0, 1, 2, 3, which are associated with Nor-
mal, Poisson, Gamma, and Inverse Gaussian distributions, respectively. K. (1981)
also discussed distributions with this power variance function and showed that an
exponential family exists for γ = 0 and γ ≥ 1. In Jørgensen (1997), the author sum-
marized Tweedie exponential dispersion models and concluded that distributions do
not exist for 0 < γ < 1. For 1 < γ < 2, it is compound Poisson; For 2 < γ < 3 and
γ > 3, it is positive stable distribution. The Tweedie exponential dispersion model is
denoted Y ∼ T wγ (µ, φ). By definition, this model has mean µ and variance
Var(Y) = φµγ .
MODELING THE VARIANCE 29
Now we try to find the exponential dispersion model corresponding to V(µ) = µγ . The
exponential dispersion model extends the natural exponential families and includes
many standard families of distribution.
However, from the likelihood perspective, it is interesting to find what other like-
lihood functions can lead to such variance functions. Denote exponential dispersion
model with ED(µ, φ), and it has the following distribution form:
where υ is a given σ-finite measure on R (the set of all real numbers). The parameter
θ is called the canonical parameter, λ is called the index parameter (and φ = 1/λ is
called the dispersion parameter). The parameter µ is called the mean value parameter.
The cumulative generating function of Y ∼ ED(µ, φ) is
Let κγ and τγ denote the corresponding unit cumulative function and mean value
mapping, respectively. For exponential dispersion models, we have the following
relations
∂τ−1
γ 1
=
∂µ Vγ (µ)
and
κγ 0 (θ) = τγ (θ).
If the exponential dispersion model corresponding to Vγ exists, we must solve the
following two differential equations,
∂τ−1
γ
= µ−γ , (3.13)
∂µ
and
κγ 0 (θ) = τγ (θ). (3.14)
In both (3.13) and (3.14), we have ignored the constants in the solutions, which
does not affect the results.
If an exponential dispersion model corresponding to (3.6) exists, the commutative
generating function of the corresponding convolution model is,
λκ (θ){(1 + θλs )ϕ − 1} if γ , 1, 2
γθ
Kγ (s; θ, λ) = λe {exp(s/λ) − 1} if γ = 1 .
−λ log(1 + s )
if γ = 2
θλ
We now consider the case α < 0, corresponding to 1 < γ < 2. It shows that the Tweedie
model with 1 < γ < 2 is a compound Poisson distributions.
Let N and X1 , . . . , XN denote a sequence of independent random variables, such
that N is Poisson distributed Poisson(m) and the Xi s are identically distributed. De-
fine
XN
Z= Xi , (3.17)
i=1
where Z is defined as 0 for N = 0. The distribution (3.17) is a compound Poisson
distribution. Now we assume that m = λκγ (θ) and Xi ∼ Γ(ϕ/θ, −ϕ). Note that, by
the convolution formula, we have Z|N = n ∼ Γ(ϕ/θ, −nϕ). The moment generating
function of Z is
E(e sZ ) = exp[λκγ (θ){(1 + s/θ)ϕ − 1}].
This shows that Z is a Tweedie model. We can obtain the joint density of Z and N,
for n ≥ 1 and z > 0,
(−θ)−nϕ mn z−nϕ−1
pZ,N (z, n; θ, λ, ϕ) = exp{θz − m}
Γ(−nϕ)n!
λn κγn (−1/z)
= exp{θz − λκγ (θ)}.
Γ(−nϕ)n!z
The distribution of Z is continuous for z > 0, and summing over n, the density of Z is
1 X λ κγ (−1/z)
∞ n n
p(z; θ, λ, ϕ) = exp{θz − λκγ (θ)}. (3.18)
z n=1 Γ(−nϕ)n!
1 α1 α2 · · · αm 0 · · · 0
α
1 1 α1 · · · αm−1 αm · · · 0
R = . .. .
.. .
α2 α1 · · · 1
0 0 ··· ···
The case of m = 1 corresponds to the moving average (MA) structure in time se-
ries modeling. Note that α1 is the assumed or imposed correlation of corr(y11 , y12 ),
corr(y12 , y13 ) (i = 1), and also corr(y21 , y22 ), corr(y22 , y23 ) (i = 2), etc. This structure
may be unreasonable if the observation times ti j are 1, 2, 4, .... for i = 1 and 2, 3, 4 for
i = 2.
• Exchangeable:
1 j=k
(
Corr(yi j , yik ) = .
α j,k
• Autoregressive correlation, AR(1)
corr(Yi j , Yi, j+t ) = αt , t = 0, 1, 2, . . . , n − j.
Correlation matrix is
α α2 αn−1
1 ···
α 1 α ··· αn−2
R =
α α2 1 ··· αn−3 .
.. .. .. ..
. . . .
αn−1 αn−2 αn−3 ··· 1
32 MODEL FRAMEWORK AND ITS COMPONENTS
• Toeplitz correlation (n − 1 parameters)
• Unstructured correlation
j=k
(
1
Corr(yi j , yik ) = .
α jk j,k
Σ = σ2C,
where C = (ρ|i− j| ), and σ2 > 0 and −1 < ρ < 1 are unknown. Lee (1988) studied these
two correlation structures in the context of growth curves.
Note that these models are most suitable when all subjects have the same, and
equal observation times, for example, each subject is observed five times (Monday
to Friday). However, if Subject 1 is observed on Monday, Tuesday, and Friday; and
Subject 2 is observed on Monday, Wednesday only, the corresponding two correla-
tion matrices using AR(1) structure are
1 α α4
1 α2
!
α 1 α 3
and .
α2 1
α α
4 3 1
If observation times are not a lattice nature, but rather in a continuous time, ex-
ponential correlation structure is more appropriate (Diggle, 1988), ρ(|t j − ti |), where
ρ(u) = exp(−αuc ), with c = 1 or 2. The case of c = 1 is the continuous-time ana-
log of a first-order autoregressive process. The case of c = 2 corresponds to an in-
trinsically smoother process. The covariance structure can handle irregularly spaced
time sequences within experimental units that could arise through randomly miss-
ing data or by design. Besides the aforementioned covariance structures, there are
still parametric families of covariance structures proposed to describe the correlation
of many types of repeated data. They can model quite parsimoniously a variety of
forms of dependence and accommodate arbitrary numbers and spacings of observa-
tion times, which need not be the same for all subjects. Núñez Anton and Woodworth
MODELING THE CORRELATION 33
(1994) proposed a covariance model to analyze unequally spaced data when the er-
ror variance-covariance matrix has a structure that depends on the spacing between
observations. The covariance structure depends on the time intervals between mea-
surements rather than the time order of the measurements. The main feature of the
structure is that it involves a power transformation of the time rather than time inter-
val and the power parameter is unknown.
The general form of the covariance matrix for a subject with k observation at
times 0 < t1 < t2 < . . . < tk is
λ λ
σ2 · α(tv −tu ) /λ if λ , 0
(
(Σ)uv = (Σ)vu =
σ2 · αlog(tv /tu ) if λ = 0.
(1 ≤ u ≤ v ≤ k, 0 < α < 1). The covariance structure consists of three-parameter vector
θ = (σ2 , α, λ). It is different from the uniform covariance structure with two param-
eters as well as an unstructured multivariate normal distribution with ni (ni − 1)/2
parameters. Modeling the covariance structure in continuous time removes any re-
quirement that the sequences or measurements on the different units are made at a
common set of times.
Suppose there are five observations at times 0 < t1 < t2 < t3 < t4 < t5 . Denote
λ λ λ λ λ λ λ λ
a = α(t2 −t1 )/λ , b = α(t3 −t2 )/λ , c = α(t4 −t3 )/λ , d = α(t5 −t4 )/λ .
Consequently, the matrix can be written as
The eigenvalues are 1 + 2α cos( j/(n + 1)π) for j = 1, 2, ..., n, and the maximum α ≤
min{1, 1/(2 cos(π/(n + 1)))}. The ( j, k)-th element ( j ≥ k) of its inverse R−1 is given
by
(1 − b2n−2k+2 )(b j+k+1 − bk− j+1 )
b jk = , (3.27)
α(1 − b2 )(1 − b2n+2 )
√
where b = ( 1 − 4α2 − 1)/(2α), i.e., α = −b/(1 + b2 ).
It is crucial to know the constraints so that we can make sure the resultant R ma-
trix is meaningful when the parameters are estimated from the data. More discussion
will be given towards the end of the next Chapter.
Davidian and Giltinan (1995) provided a comprehensive description on the mixed
effect models and their computational algorithms when different variance function
and covariance structures are used.
Yi = Xi β + i ,
Yi = Xi β + Zi bi + ηi , (3.28)
where
Xi is a ni × p “design matrix”;
β is a p × 1 vector of parameters referred to as fixed effects;
Zi is a ni × k “design matrix” that characterizes random variation in the response
attributable to among-unit sources;
bi is a k × 1 vector of unknown random effects;
-, and ηi is distributed as N(0, Ri ). Here Ri is an ni × ni positive-definite covariance
matrix reflecting “measurement” errors. In practice, Ri is often taken as a diagonal
matrix.
Step 2. β is considered fixed parameters at the population level, and bi is also the
unknown parameters but for subject i only. Here ηi is often assumed to be indepen-
dent. The bi values are distributed as N(0,G), independently of each other and of the
ηi . Here G is a k × k positive-definite covariance matrix.
The regression parameter vector β is the fixed effects, which are assumed to be
the same for all individuals and have population-averaged interpretation. In contrast
to β, the vector bi is comprised of subject-specific regression coefficients.
The conditional mean of Yi , given bi , is
E(Yi |bi ) = Xi β + Zi bi ,
which is the ith subject’s mean response profile. The marginal or population-averaged
mean of Yi is
E(Yi ) = Xi β.
Similarly,
var(Yi |bi ) = var(ηi ) = Ri
and
var(Yi ) = var(Zi bi ) + var(ηi ) = ZiGZi> + Ri .
Thus, the introduction of random effects, bi , induces correlation (marginally) among
the Yi . That is,
var(Yi ) = Σi = ZiGZi> + Ri ,
which has nonzero off-diagonal elements. Based on the assumption on bi and ηi , we
have
Yi ∼ Nni (Xi β, Σi ).
Chapter 4
Parameter Estimation
DOI: 10.1201/9781315153636-4 37
38 PARAMETER ESTIMATION
4.1 Likelihood Approach
We first introduce how β can be estimated when ignoring the within-subject cor-
relations. It is simple to estimate the regression parameter by adopting GLM ap-
proach when the independent data is univariate. Consider the univariate observations
yi , i = 1, . . . , N and p × 1 covariate vector Xi . Let β be a p × 1 vector of regression pa-
rameter and linear predictor ηi = Xi> β. Suppose Y = (y1 , . . . , yN ) follows a distribution
from a specific exponential family as given by (3.2),
( )
yθ − b(θ)
f (y; θ, φ) = exp + c(y, φ)
φ
with canonical parameter θ and dispersion parameter φ. For each yi , the log-
likelihood is
Li (β, φ) = log f (yi ; θi , φ).
For y1 , . . . , yN , the joint log-likelihood is
N
X N
X
L(β, φ) = log f (yi ; θi , φ) = Li (β, φ).
i=1 i=1
N
X
U(β; φ) = i Vi S i ,
D> −1
i=1
The deviance function, which measures the discrepancy between the observation and
its expected value, is obtained from the analog of the log-likelihood-ratio statistic
Z µ
y−u
D(y; µ) = −2{Q(y; µ) − Q(y; y)} = −2 du. (4.3)
y V(u)
Table 4.1 Quasi-likelihood and extended quasi-likelihood for a single observation yit . φ is
the dispersion parameter. The extended quasi-likelihood is Q+ (µit , yit ) = −0.5{φ−1 D(µit , yit ) +
log(2πφA(µit )}. Note that D(µit , yit ) and hence Q+ (µit , yit ) differ from those in Table 9.1 of
McCullagh and Nelder (1989).
Variance Quasi-likelihood Deviance Canonical Link
A(µit ) Q∗ (µit , yit ) D(µit , yit ) ηit = g(µit )
2
1 − (yit −µ
2φ
it )
(yit − µit )2 µit
µit (yit log(µit ) − µit )/φ yit log( µyitit ) − (yit − µit ) log(µit )
µit µit
n o
µit (1 − µit ) yit log{ 1−µ it
} + log(1 − µit ) −2 yit log( 1−µ it
} + log(1 − µit ) log{µit /(1 − µit )}
µ2it µit
−2 yit log µyitit(k+µ µit
n o h n (k+y ) o k+yit
i
µit + k yit log( k+µ it
) + k log( k+µ
k
it
) it
it )
+ k log( k+µit
) log( k+µ it
)
Pan (2001b) derived the Akaike Information Criterion (AIC) based on the inde-
pendence QL for model selection in longitudinal data analysis.
Later on, Hin and Wang (2009) discovered that this criterion is only valid for se-
lecting useful ones among the predictors (x1 , x2 , ..., x p ), but not for correlation struc-
tures because it is derived assuming the independence model.
For estimating β, the QL estimating functions are again given by
N
X
i Vi (Yi − µi ) = 0,
D> −1
(4.7)
i=1
N
X
U(β) = i Vi (Yi − µi ) = 0,
D> −1
(4.8)
i=1
which takes the same form as (4.7), but Vi now is a covariance matrix incorporating
a hypothesized correlation matrix.
As for the asymptotic variance of the estimators, due to the lack of the true likeli-
hood for further inference, we can rely on the approximating covariance of U(β). We
will provide more details when introducing the sandwich estimator. The book Heyde
(1997) has provided a comprehensive description of the quasi-likelihood theory and
inferences for martingale families.
where
(yi − µi )2 ∂σ2i
Uσi (β) = σ−2 > .
i 1 − ∂β
σ2i
For example, if we let g(µ) = µγ , where γ > 0, the above likelihood function does
not belong to the linear exponential distribution form because yi and other power
functions yi are interacting with µi . If both mean and variance functions are cor-
rect, we have E{∂LG (β, τ)/∂β} = 0. In ordinary regression, we only rely on the lin-
ear combinations of S i = (yi − µi ) to gain protection against misspecification of the
variance modeling. We can also achieve this by ignoring the second term Uσi (β) in
∂LG (β, τ)/∂β, and use only the first term for estimating β (with some notation abuse
42 PARAMETER ESTIMATION
in UG ),
N
X yi − µi
UG (β; τ) = D>
i , (4.9)
i=1
σ2i
∂σi
N
2
X (yi − µi )2
UG (τ; β) = σi
−2
.
1− (4.10)
σ2 ∂τ>
i=1 i
A key condition for consistency of the estimator is that the estimating equations
should be unbiased, at least asymptotically.
Again, to have protection to misspecification of Σi = φVi , we will drop the second
term in u(β j ; τ) and obtain the matrix form for β,
N
X
U(β; τ) = D> Vi−1 (Yi − µi ), (4.13)
i=1
N
X h −1 i
u(τ j ; β) = tr Σ µ µ T
Σ−1
/∂τ .
(Y − )(Y − ) − I (∂Σ ) (4.14)
i i i i i i i j
i=1
We have E{U(β; τ)} = 0 when the mean assumption holds and E{u(τ j ; β)} = 0 when
the covariance assumption holds. This will lead to consistency of β estimates even Σi
matrices are incorrectly modeled.
GAUSSIAN APPROACH 43
Another perspective of looking at these above estimating functions is using the
“decoupling” idea. If we write the mean µi explicitly as a function of β while write
Σi explicitly as a function of β∗ (as if it is a new set of parameters),
N
X
LG∗ (β; τ) = {log{|Σi (β∗ )| + (Yi − µi (β))> Σi (β∗ )−1 (Yi − µi (β))}.
i=1
Minimization of LG∗ (β; τ) with respect to β will lead to estimating functions given
by (4.13).
Remark 2. log{|Σi (β∗ )| is deemed as a constant in this decoupling approach and it
can be ignored.
Remark 3. In practice, one can simply replace β∗ by the previous β estimates when
minimizing LG∗ . For Gaussian estimation, under mild regularity conditions, when the
working correlation matrix is either correctly specified as the true one R̃i or chosen as
the identity matrix (the independence model), the Gaussian estimators of regression
and variance parameters are consistent (Wang and Zhao, 2007). To see this, we will
check when we have E{U(β; τ)} = 0 and E{u(τ j ; β)} = 0. From equations (4.11) and
(4.12), the unbiasedness condition for θ j is
T −1 ∂Σi −1 ∂Σi
( " #) ( " #)
E tr Σi (Yi − µi )(Yi − µi ) Σi
−1
− E tr Σi = 0. (4.15)
∂θ j ∂θ j
Now we make some transformations of (4.15) to see the condition more clearly. For
notation simplicity, let Σ̃i be the true covariance, thus Σ̃i = E[(Yi − µi )(Yi − µi )T ] =
A1/2 1/2
i R̃i Ai , where R̃i is the true correlation structure.
The left-hand side of (4.15):
T −1 ∂Σi −1 ∂Σi
( " #) ( " #)
E tr Σi (Yi − µi )(Yi − µi ) Σi
−1
− E tr Σi
∂θ j ∂θ j
∂Σ ∂Σ
−1 −1
= −tr i Σ̃i + tr i Σi
∂θ j ∂θ j
−1/2 −1/2
∂Ai ∂Ai
−1 −1/2 1/2 1/2 1/2
= −2tr Ri Ai Ai R̃i Ai + 2tr
Ai
∂θ j ∂θ j
−1/2 −1/2
∂A ∂Ai
= −2tr i Ai1/2 R−1i R̃i + 2tr A1/2
∂θ j ∂θ j
i
−1/2
∂Ai
1/2 −1
= −2tr Ai (Ri R̃i − I) .
(4.16)
∂θ j
∂A−1/2
It is clearly that (4.16) will be 0 if Ri = R̃i . As both the i
∂θ j and the Ai are
diagonal matrices, (4.16) will be also 0 if the diagonal elements of {R−1
i R̃i − I} are
all 0. This will happen when Ri = I because the diagonal elements of R̃i are all 1.
Thus, we can conclude that under one of the two conditions: Ri = R̃i and Ri = I, the
Gaussian estimation will be consistent.
44 PARAMETER ESTIMATION
This implies that we can use independent correlation structure if we have no idea
about the true one, and the resulting estimator will be consistent under mild regularity
conditions.
In general, (3.7) can be referred to as a “working” likelihood function to provide a
sensible solution to supplying the working parameters needed for the mean parameter
estimation.
In longitudinal data analysis, the mean response is usually modeled as a function
of time and other covariates. Profile analysis (assuming each time point is a category
instead of a continuous variable) and parametric curves are the two popular strategies
for modeling the time trend. In a parametric approach, we model the mean as an
explicit function of time. If the profile means appear to change linearly over time, we
can fit linear model over time; if the profile means appear to change over time in a
quadratic manner, we can fit the quadratic model over time. Appropriate tests may
be used to check which model is the better choice. Clearly, profile analysis is only
sensible if each subject has the same set of observation times.
µi = h(Xi> β),
µi = h(Xi , β).
This mean model will include nonlinear regression setup and multiple index models
as long as h is given a priori and the dimension of β, p, is fixed and usually much
smaller than n.
Remark 5. If a spline function is assumed to model h, the number of parameters,
p, will expand as n or ni increases. This will lead to a new area of semiparametric
modeling for longitudinal data analysis (Lin and Ying, 2001; Lin and Carroll, 2006;
Li et al., 2010; Hua and Zhang, 2012).
The inverse of h or h itself is often referred to as the “link” function. In quasi-
likelihood, the variance of Yi j , σ2i j is expressed as a known function g of the expec-
tation µi j , i.e.,
σ2i j = φg(µi j ),
where µi = (µi1 , . . . , µini )> and Di = ∂µi /∂β> . In order to solve β from the above equa-
tions, both Di and Vi must be given up to the parameter vector β, and all other pa-
rameters such as α and γ (except for φ) must be supplied a priori.
For convenience, we will let Ui be the contribution of subject i in UGEE , i.e.,
U i = D>i Vi S i , where S i = Yi − µi . The above GEE is not a stranger to us. Its form is
−1
the same as the estimating functions derived from the linear exponential family (4.1).
In particular, when Vi is diagonal (independence working correlation model), UGEE
becomes the estimating functions derived from the QL approach (4.4), or the general
Vi corresponds to the multivariate version given by (4.8). The decoupled version from
Gaussian estimation also takes√the form, see (4.13).
Suppose that α̂ and γ̂ are N-consistent estimators obtained somehow via other
approaches, the GEE (4.18) above will solve UGEE (β; α̂, γ̂) = 0 p×1 . In general, α̂ and
√
γ̂ will also require a N-consistent estimator of β so that valid residuals can be
obtained for α and γ estimators. So, the GEE estimator of β, β̂GEE is essentially
GENERALIZED ESTIMATING EQUATIONS (GEE) 47
obtained by solving UGEE (β; α̂(β), γ̂(β)) = 0 p×1 . We will omit the subscript indicting
the dimension when there is no confusion.
Under mild regularity conditions and prerequisite that the link function is cor-
rectly specified under minimal assumptions about the time dependence, Liang and
Zeger (1986) showed that as N −→ ∞, β̂GEE is a consistent estimator of β and that
√
N(β̂R − β) is asymptotically multivariate Gaussian with covariance matrix given by
N −1 N N −1
X X X
lim N D> i Vi −1 Di D>i Vi −1 Cov(Yi )Vi −1 Di D>i Vi −1 Di . (4.19)
N→∞
i=1 i=1 i=1
This matrix can be estimated consistently without any knowledge on Cov(Yi ) directly
>
because Cov(Yi ) can be simply replaced by the residual product Sˆ i Sˆ i and α, β and
φ by their estimates in equation (4.20). This leads to the well-celebrated sandwich
estimator for the covariance of β̂GEE ,
N −1 N N −1
>
X X X
GEE =
V[ D Vi Di D V Sˆ i Sˆ i V Di D Vi Di ,
> −1 > −1 −1 > −1
(4.20)
i i i i i
i=1 i=1 i=1
in which all the matrices, Di , Vi if needed, are evaluated at the final parameter esti-
mates, β̂GEE , α̂, and γ̂.
When Ri is correctly specified and α is known, the generalized estimating equa-
tions given by (4.18) is optimal for β in the class of linear functions of Yi according
to the Gauss-Markov theorem. In practice, the correlation matrix is often unknown
and we rely on the estimation of α given β. This leads to simultaneous or iterative
estimation for β and α (Zhao and Prentice, 1990; Liang et al., 1992).
A key robustness property motivating the widespread use of generalized estimat-
ing equations is the consistency of the solution β̂ whether or not the working cor-
relation structure Ri is correctly specified. Careful modeling of Ri will improve the
efficiency of β̂, especially when the sample size is not large (Albert and McShane,
1995). The optimal asymptotic efficiency of β̂ is obtained when Ri coincides with the
true correlation matrix. In practice, Ri would be chosen to be the most reasonable for
the data based on either statistical criteria or biological background.
Although the GEE approach can provide consistent regression coefficient estima-
tors, the estimation efficiency may fluctuate greatly according to the specification of
the “working” covariance matrix. The “working” covariance has two parts: one is the
“working” correlation structure; the other is the variance function. The existing lit-
erature has been focused on the specification of the “working” correlation while the
variance function is often assumed to be correctly chosen, such as Poisson variance
and Gaussian variance function. In real data analysis, if the variance function is also
misspecified, the estimation efficiency will be low. It is, therefore, crucial to model
the variance function in order to improve the estimation efficiency of β. Relevant
work is abundant (Dempster (1972); Davidian and Carroll (1987); O’Hara-Hines
(1998); Wang and Lin (2005); Wang and Hin (2009).
The GEE approach estimates β using only linear combinations of the data as
given by (4.18). In some cases, the variance and covariance can contain information
48 PARAMETER ESTIMATION
on µi j and hence β, and it then becomes sensible to construct estimating functions
based on the quadratic functions of the data. GEE2 proposed by Prentice and Zhao
(1991) aims to construct the “optimal” combination of both linear and quadratic
residuals for estimating β. Prentice and Zhao (1991) and Fitzmaurice et al. (1993)
also obtained this type of estimating functions using a likelihood-based method. The
GEE2 can, therefore, be regarded as driven by absorbing higher moment information.
The connection between the likelihood equations and the GEE2 is well described by
Fitzmaurice et al. (1993).
It is true that the GEE2 may produce more efficient estimators of β when the
third and fourth moments can be correctly specified. However, as pointed out by
Prentice and Zhao (1991), models for means and covariances can typically be sen-
sibly specified and are readily interpreted. Models for higher order parameters are
usually specified in an ad hoc fashion and are unlikely to accurately approximate the
real complexity.
Apart from computationally intensive as ni gets large, GEE2 approach also has
the following drawbacks: (i) no consistency of the mean parameter estimates under
covariance misspecification; (ii) no efficiency will be gained if the third or the fourth
moment functions are incorrect (Liang et al. 1992). In general, little can be gained in
incorporating the third and fourth moments in the estimation. Therefore, in practice,
one may wish to restrict the assumptions on the mean and covariance functions only.
According to in §4.3, for using the true R = R̃i or the naı̈ve independence R = I,
the Gaussian estimator of τ is consistent under mild conditions.
If the specified variance function is far from the true one, the correct choice of
the working correlation matrix may no longer be the true one. In real data analysis,
it is difficult to determine the variance function, and possibly, we will choose an in-
correct variance function. Therefore, akin to the correlation parameters, the variance
parameters are also subject to the pitfalls discussed by Crowder (1995) under model
misspecification. Therefore, well-behaved estimators for those working parameters
are highly desirable to avoid possible convergence problems and to result in efficient
estimators of β (Wang and Carey, 2003). The Gaussian working likelihood should be
a convenient and useful approach to analyzing longitudinal data.
For normal data the squared residuals have approximate variance proportional to
σ4i j ; the generalized weighted least squares can be considered as well,
N
X
{ri2 − φg(µ̂i j )}2 /g2 (µ̂i j ),
i=1
leading to
ni
N X
1 X
φ̂Gau = PN (yi j − µi j )2 /g(µ̂i j ).
i=1 ni i=1 j=1
1 0 0 ... 0
0 1 0 ... 0
R = .
...
0 0 0 ... 1 n×n
52 PARAMETER ESTIMATION
• Compound symmetry (exchangeable or equal correlation)
1 α α ... α
α 1 α ... α
R = .
...
α α α ... 1 n×n
• First order autoregressive: AR(1)
α α2 ... αn−1
1
α 1 α ... αn−2
R = α2 α 1 ... αn−3 .
...
αn−1 αn−2 αn−3 ...
1 n×n
• Moving average
1 α 0 ... 0 0
α 1 α ... 0 0
R = 0 α 1 ... .
0 0
...
0 0 o ... α 1 n×n
• m-dependent correlation
This is a generalization of the above first-order moving average model to order m,
1 t=0
Corr(yi j , yi j+t ) = αt t = 1, 2, . . . , m .
0 t>m
• Toeplitz
This is essentially a special case of m-dependent model with the largest m possi-
ble, m = n. Note that this structure is also “stationary” as the correlation is lag-
dependent but not the starting time.
1 α1 α2 . . . αn−1
α1 1 α1 . . . αn−2
R = α2 α1 . . . αn−3 .
....
αn−1 αn−2 αn−3 . . . 1 n×n
The above correlation structures are only appropriate for equally-spaced measure-
ments, and all subjects must have the same observation times in order to share the
same correlation structure.
If stationary is in question, or observations are irregularly spaced, we may con-
sider
• Unstructured
1 α12 α13 ... α1n
α21 1 α23 ... α2n
R = α31 α32 ... α3n .
1
...
αn1 αn2 αn3 ... 1
GENERALIZED ESTIMATING EQUATIONS (GEE) 53
Note that, again, all subjects must have the same observation times in order to
share the same correlation structure.
All these correlation structures can be easily implemented in R package nlme.
For example, if ID is the factor for the n subject levels, the generic R code for fitting
the AR(1) covariance model by using corr=corAR1(form = 1—id) in gls function.
The moment estimators can be easily constructed for these models. The estima-
tors are all expressed based on the Pearson residuals. Denote Pearson residual by
êi j = (yi j − µi j )/Ai j , j = 1, 2, ..., n. where Ai j is the jth diagonal element of the diago-
nal variance matrix Ai . Note that σ2i j = φAi j , and E(ê2i j ) ≈ φ.
For m-dependent model, we will use the only t-lagged residuals for estimating
αt , t = 1, 2, ..., m, PN P
2 i=1 j≤ni −t êi j êi j+t
α̂t = PN P . (4.22)
i=1 j≤ni −t (êi j + êi j+t )
2 2
This estimator (with m = 1) is also applicable to both moving average and the AR(1)
model.
For exchangeable structure, we can make use of all the lagged residual products,
PN P
2 i=1 j,k êi j êik
α̂ = PN P .
i=1 j,k (êi j + êik )
2 2
For the unstructured correlation, we will use all the pairwise products, the esti-
mator for the parameter α jk will rely on the residuals at time j and time k,
2 N êi j êik
P
α̂ jk = PN i=12 .
i=1 (êi j + êik )
2
Note that all these correlation estimators are bounded between -1 and 1 according
to the Cauchy inequality. But this does not necessarily mean the resultant matrices
are always positive definite. This brings another issue as pointed out by Crowder
(1995). We will discuss this further towards the end of this section.
The correlation matrix R(α) has q = dim(α) parameters. For example, in a study
involving T equidistant follow-up visits, the unstructured correlation matrix for an
individual with complete data will have q = T (T − 1)/2 correlation parameters; if the
repeated observations are assumed exchangeable, R will have the compound symme-
try structure, and q = 1. A parsimonious parametrization of the correlation structure
is desired in order to optimize the efficiency of the estimation procedure.
Note that an underlying assumption here is that the correlations depends on j, k
but not on subject i. Sometimes, this may not hold that the covariance matrix varies
between subjects. Even the unstructured matrix will not be able to accommodate this
because the subjects cannot share the same correlation matrix anymore.
For observations at continuous times, and at irregular times, ti1 , ti2 , ti3 , and ti4 ,
for example, we may wish to consider the First order autoregressive with continuous
54 PARAMETER ESTIMATION
time: CAR(1),
In this case each subject will have a different correlation matrix depending on the
specific observation times. Muñoz et al. (1992) proposed a damped exponential cor-
relation structure for modeling multivariate outcomes.
The damped exponential correlation structure applies a power transformation on
the lag, two observations with lag s is modeled as if the time lag is sϕ in CAR(1)
ϕ
model, the correlation is α s , and ϕ is a damping parameter. The correlation structures
of compound symmetry, first-order autoregressive, and first-order moving average
processes can be obtained by assuming ϕ = 0, ϕ = 1, and ϕ → ∞.
Thus, the correlation structures with q = 2 can model quite parsimoniously a va-
riety of forms of dependence, including slowing decaying autocorrelation functions
and autocorrelation functions that decay faster than the commonly used first-order
autoregressive model.
However, estimation of α is not straightforward as the moment estimators cannot
be constructed easily. We will, therefore, need to construct supplementary estimat-
ing functions for α. Further complications may arise when sampling is involved in
nested, spatial and temporal random effects. Intuitively, careful modeling the cor-
relation structure should improve the efficiency of estimation. Diggle et al. (1994)
provides a comprehensive review of relevant techniques.
worth (1994). As one can see, this correlation function is not stationary in time any
more unless λ = 1. The generalized Markov correlation structure accommodates ir-
regular and nonlattice-valued observation times, which is quite norm in practice
(Shults and Chaganty, 1998). This correlation is quite rich and flexible and is ap-
propriate in many practical cases. A case of particular interest is the genetic distance
among family members when data are clustered/coordinatized in pedigrees (Trègouët
et al., 1997).
In many cases, the correlation parameter describes the association between re-
sponses from the same cluster (such as family, parents, or area), and may be of sci-
entific interest. For example, intraclass correlation is commonly used to measure the
degree of similarity or resemblance siblings. In these cases, the efficient estimation
of the association parameters would be valuable. In genetic studies, we often rely on
associations for prediction. So we have to come up with a correlation structure and
GENERALIZED ESTIMATING EQUATIONS (GEE) 55
estimate the correlation-related parameters (correlations may be governed by covari-
ates such as dam weight in developmental studies). Correlation and mean parameters
will be equally important in these cases.
Apart from simple moment estimators, the supplementary set of estimating func-
tions for α can be constructed more elegantly to enhance the estimation efficiency
and avoid possible pitfalls (Lipsitz et al., 1991; Prentice and Zhao, 1991; Liang et al.,
1992; Hall and Severini, 1998). In analysis of real data, misspecification is probably
the norm and the efficiency of β̂ depends on how close the working matrix is to the
true correlation. Because β̂ values obtained from (4.18) depend on the values of α
used, estimation methods for α are, therefore, of importance for improving the ef-
ficiency of estimation of β. On the other hand, in many cases, estimates of α may
be of scientific interest as well (Lipsitz et al. 1991). More detials will be given in
Chapter 9.5.
Estimator of β,γ and α are obtained from the iterative method or the joint estimating
functions, (4.18), together with (4.25) and (4.26).
in which Pi j = ∂R−1
i /∂α j , which can also be written as Pi j = −R−1 i ∂Ri /∂α j Ri .
−1
Shults and Chaganty (1998) also found that compared with the ad hoc methods in the
literature, it has smaller empirical risk of producing infeasible correlation parameter
estimates and smaller mean square error in the estimates of β when the correlation is
small or moderate. The bias in the QLS estimates can be easily removed according
to Chaganty and Shults (1999). More details, extensions and interesting applications
can be found in the book Shults and Hilbe (2014).
To avoid nonconvergence or the infeasibility problem pointed out by Crowder
(1995), explicit expressions of α̂ that are constrained within the sensible region are
attractive. This leads Chaganty (1997) to consider the QLS method.
The problem is the QLS estimators are inconsistent. The bias-corrected version
of Chaganty and Shults (1999) for the AR(1) working model is
k |i− j|=1 ˆki ˆk j
P P
α̂QLS 1 = P . (4.27)
k ˆk1 + 2 i=1 ˆki + ˆkn
2 Pn 2 2
Clearly, we have α̂QLS 1 → γ = α, when the true correlation structure is either AR(1),
exchangeable or MA(1) with parameter α.
58 PARAMETER ESTIMATION
If working matrix is exchangeable instead of AR(1), and the corresponding QLS
estimate is α̂QLS , the bias corrected estimate is obtained from
X {1 + (n − 1)α̂QLS }2
(αi j ) = .
i, j
1 + (n − 1)α̂2QLS
If the true correlation is exchangeable, we can verify that the limit of the bias cor-
rected estimate of α as
2 + (n − 2)α̂QLS
α̂QLS = α.
1 + (n − 1)α̂2QLS
In order to obtain consistent estimates of α for other types of working matrices
(exchangeable and MA(1)), Chaganty and Shults (1999) suggested to obtain an ini-
tial QLS estimate using AR(1) working model, α̂QLS , and then adjust the estimates
by assuming the correlation structure believed to be true is either exchangeable or
MA(1). The resulting estimate of α, α̂QLS 1 , has the same expression as (4.27). Note
that one needs to choose the working correlation matrices twice, and they do not
have to be of the same structure. In theory, the initial estimate α̂QLS can also be
based on exchangeable, MA(1) models, or any other correlation structures. But the
initial estimate from AR(1) model leads to the above simple expressions.
in which Γi is a working variance vector, which can be chosen as {var(yik |yi j )}. In
general, Γi may have to be chosen as the identity matrix, unless some mean-variance
relationship can be specified such as the case of binary responses. It is easily to verify
that E(UCR (α)) is unbiased from 0 if either Γk is the identity matrix (and ξki j does
not have to be the true condition mean), or ξik j is the true conditional mean (and Γi
may depend on ξi ).
For multiple binary responses, Carey et al. (1993) found that using conditional
residuals produce much more efficient estimates of correlation parameters than using
unconditional residuals. This was further demonstrated by Lipsitz and Fitzmaurice
(1996).
GENERALIZED ESTIMATING EQUATIONS (GEE) 59
4.4.3.7 Cholesky Decomposition
In the conditional residual method, the identity matrix may have to be used for Γi
when the conditional variance is unknown, which is often the case. Therefore, each
residual is treated equally. This may be reasonable if the correlation is the same
between each pairwise observations. This motivates us to consider using all previous
responses rather than each single individual previous response.
Wang and Carey (2004) proposed Cholesky decomposition method to improve
the estimation efficiency and guarantee feasibility of solutions. The basic idea was
first presented at the Eastern North American Region International Biometric Society
(ENAR) 1999 meeting and also at the Biostatistics Department Seminar, Harvard
School of Public Health.
Let R−1 i = C i C i and C i is a lower triangular matrix, which can be obtained from
>
c11 0 0 . . . 0 0 0 0 . . . 0
0 c 0 . . . 0 c21 0 0 . . . 0
22
0 c33 . . . 0 + c31 c32 0 . . . 0 = Ji + Bi .
0
. . . . . .
0 0 0 . . . cnn cn1 cn2 cn3 . . . 0
We can now write Hi as
i (Yi − ζi ),
Ji A−1
in which ζi = µi − Ai Ji−1 Bi A−1
i (Yi − µi ), which is a linear conditional mean vector.
The second term Ai Ji−1 Bi A−1i (Yi − µi ) adjusts the mean based on previous responses
if correlations between responses exist.
Remark 8. Note that ζi are linear predictors using the first two moments and the
previous observations (their residuals). Therefore, in case of correct specifying Ri ,
we would expect ζi be centered as 0 and orthogonal to each other.
It is easily to show that
∂ζi
!
E = Ai Ji−1Ci A−1
i Di ,
∂β
and
Cov(Yi − ζi ) = Cov{Ai (Ii + Ji−1 Bi )A−1
i (Yi − µi )}
= Ai Ji−1 (Ji + Bi )Ri (Ji + Bi )> Ji−1 Ai
= A2i Ji−2 .
We will denote the diagonal matrix, Ji2 A−2 i , as Wi , for convenience. It is, therefore,
sensible to consider the following estimating functions for α,
N !>
X ∂ζi
UChol (α; β) = Wi (Yi − ζi ), (4.29)
i=1
∂α
60 PARAMETER ESTIMATION
in which Wi is a weighting diagonal matrix. In fact, for any chosen diagonal matrix
Wi that is independent of the data, the estimating functions UChol (α) given by (4.29)
is unbiased from zero.
Remark 9. It is easy to show E(UChol (α; β)) = 0.
We first rewrite Yi − ζi as Ai (Ii + Ji−1 Bi )A−1
i (Yi − µi ), and hence for any component
α, αi ,
∂ζk
= Ai Fik A−1
i (Yi − µi ).
∂αi
The estimating function UChol (α; β) can, therefore, be written in the following
quadratic form
XN
(Yi − µi )> A−1i Mik Ai (Yi − µi ),
−1
(4.30)
i=1
= tr(Lk Fik ),
in which Li = Ci−1 Ji−1 Wi−1 A2i is a lower triangle matrix. Because Fik is also a lower
triangle matrix with all leading elements being 0, we have tr(Li Fik ) = 0, which com-
pletes the proof.
When Ci and Bi are taken to be an upper triangular matrix (this is equivalent
to reverse the time order), we can also obtain a similar version of UChol (α; β). In
general, these two sets are very similar. The average or sum can be used as suggested
by Wang and Carey (2004). This will also symmetrize the residual appearance as we
will see later in AR(1) model. An advantage of the use of UChol over UG is that UChol
is free from the scale parameter φ.
After some algebra, we can rewrite (4.30) as
N !>
X ∂ζi
UChol (α; β) = E Wi (Yi − ζi ). (4.31)
i=1
∂β
So the same “working” matrix are be used in U(β) and UChol (α; β). Note that there
is an expectation operator, “E” , in (4.31) so that quadratic terms are eliminated. A
similar relationship was found by Wang (1999a) between conditional (transitional)
models and marginal models.
If E(yik |yi1 , · · · , yik−1 ) = ζi is a linear function of past responses, and V∗i is the
corresponding conditional variance matrix (diagonal), the transitional or conditional
model would rely on
N !>
X ∂ζi
Uc (β) = Vi∗ −1 (Yi − ζi ),
i=1
∂β
GENERALIZED ESTIMATING EQUATIONS (GEE) 61
for estimating β. Note that E(Yi − ζi ) are unbiased from 0 even in the true conditional
means are not linear functions. To ensure E{Uc (β)} = 0 and hence robustness to mis-
specification of conditional means, we may wish to replace the Jacobian matrix and
the conditional variance matrix with their expectations.
Remark 10. ζi are linear combinations of past responses only, and can, therefore, be
interpreted as some linear predictions. As the covariance matrix of Yi − ζi is diagonal,
namely, A2i Ji−2 , the components can be roughly regarded as independent.
Remark 11. In the case of Wi = Ji2 A−2 i , as we suggested, Mik = F ik Ji C i , where
>
∂J −1 Bi
Fik = ∂α i
k
.
We now consider two widely used correlation structures, the first-order autore-
gression (AR(1)) and the equicorrelation structures. The first one is often used in
longitudinal studies, and the latter is often used to account for cluster settings (Lip-
sitz and Fitzmaurice, 1996).
For the AR(1) model with correlation parameter α, the (i, j)th element of R is
α| j−i| , 1 ≤ i, j ≤ n, and det(R) = (1 − α2 )n−1 . We have
1 −α 0 0
−α 1 + α2 −α 0
1
R−1 = . .
1 − α2 ..
0 0 −α 1
0 0 0 ... 0 0 1 − α
2
−α 0 0 ... 0 0
1
... 1
J B =
−1 0 −α 0 0 0
and J 2
= diag
1
.
.. .. .. .. .. .. 1 − α2 ..
. . . . . . .
...
0 0 0 −α 0 1
We, therefore, have for the ith subject,
0
σi2 /σi1 i1
σi3 /σi2 i2
ζi = µi + αi ,
..
.
σin /σin−1 in−1
in which ik = yik − µik and αk is the correlation parameter which may depend on the
62 PARAMETER ESTIMATION
subject-specific covariates through parameters α. The contribution of the ith subject
to UChol (α; β), given by (4.29) is
n
1 X ik ik−1 ik−1
!
− αi .
σ
1 − αi k=2 ik
2 σ ik−1 σ ik−1
and
0 0 0 ... 0 0
1
...
−α 1 0 0 0 0
...
J B = diag .. 1 1 0 0 0 .
−1
. ..
.
−α
1 + (n − 2)α
...
1 1 1 1 0
Therefore,
0
σi2
σi1 i1
ζi = µi + α .
..
.
σn Pn−1 i j
1+(n−2)α j=1 σi j
in which
tiλj log(ti j ) − tiλj−1 log(ti j−1 )
li j = ∂di j /∂λ = − di j /λ.
λ
The special case of AR(1) model (λ = 1), ti = i and ni = 3 for all subjects. The
solution to UChol (α) is
Another advantage of α̂Chol is that it always lies in the sensible range of (−1, 1).
Suppose the true parameters are denoted as β̃. The covariance of U(β̃) can be approx-
imated by E{UGEE (β̃; τ, α)UGEE (β̃; τ, α)> } = i=1 Di Vi Σ̃i Vi−1 Di .
PN > −1
Suppose we have constructed U(α, τ; β) for all the variance and correlation pa-
rameters as supplementary estimating functions to UGEE (β; τ, α) as given by (4.18),
the final estimates of (β, α, τ) from the joint estimating functions
UGEE (β; τ)
!
U(θ) = .
U(α, τ; β)
√
Under mild regularity conditions, β̂R is consistent as N −→ ∞, and N(β̂R − β)
is asymptotically multivariate Gaussian with covariance matrix VR given by
N −1 N N
X X X
VR = lim N Di Vi Di Di Vi Cov(Yi )Vi Di ( D>
> −1 > −1 −1
i Vi Di ) .
−1 −1
N→∞
i=1 i=1 i=1
i=1
Since ˆi ˆi> is based on the data from only one subject, it is neither consistent nor
efficient, and hence it is not an optimal estimator of Cov(Yi ). Although ˆi ˆiT is a poor
estimator of Cov(Yi ), VLZ is a consistent estimator of VG .
An improved version of VR estimator under certain restrictions on the data struc-
ture is given by Pan (2001). The same asymptotic normality for the β estimator in
our case can be established as in Crowder (2001).
Assume that (a) the marginal variance of var(yik ) is modeled correctly, and (b)
there is a common correlation structure R0 across all subjects, then R0 can be esti-
mated by
N
1 X −1/2 > −1/2
R̂C = A ˆi ˆi Ai ,
φN i=1 i
GENERALIZED ESTIMATING EQUATIONS (GEE) 65
where β, α, and φ are replaced by their estimators. Under the two assumptions which
are often reasonable, Pan (2001) proposed estimating Cov(Yi ) by
1 XN
−1/2 > −1/2
Wi = φA1/2
i R̂C A1/2
i = A1/2
i
N A j
ˆ j
ˆ A
j j
A1/2 .
i
j=1
Replacing ˆi ˆi> in (4.35) by Wi , a new covariance matrix estimator VP can be obtained
(Pan, 2001) and is given by
N −1 N N −1
X X > −1
X > −1
VP = Di Vi Di
> −1 −1 Di Vi Di .
Di Vi Wi Vi Di (4.36)
i=1 i=1 i=1
Let
N
X N
X
MLZ = i Vi ˆi ˆi Vi Di and MP =
D> −1 > −1
i Vi Wi Vi Di .
D> −1 > −1
i=1 i=1
For any matrix B, define the operator vec(B) as that of stacking the columns of B
together to obtain a vector. Then under mild regularity conditions, cov{vec(MLZ )} −
cov{vec(MP )} is nonnegative definite with probability 1 as N → +∞.
When the number of subjects is small (N is small), VLZ would be expected under-
estimate var(β̂) (Mancl and DeRoune, 2001). Therefore, Mancl and DeRoune (2001)
proposed an alternative robust covariance estimator for β̂ to reduce the bias of the
residual estimator, ˆi ˆi> . The first-order Taylor series expansion of the residual vector
ˆi about β is given by
∂i
ˆi = i (β̂) = i + (β̂ − β),
∂β>
= i − Di (β̂ − β), (4.37)
where i = i (β) = Yi − µi and i = 1, . . . , N. Therefore,
E(ˆi ˆi> ) ≈ E(i i> ) − E[i (β̂ − β)> D>
i ]
− E[Di (β̂ − β)i> ] + E[Di (β̂ − β)(β̂ − β)> D>
i ],
i=1 i=1
Hence,
N
X
E(ˆi ˆi> ) ≈ Cov(Yi ) − Cov(Yi )Hii> − Hii Cov(Yi ) + Hi j Cov(Yi )Hi>j ,
j=1
X
= (I − Hii )Cov(Yi )(I − Hii ) +>
Hi j Cov(Yi )Hi>j (4.38)
j,i
66 PARAMETER ESTIMATION
P −1
where Hi j = Di k=1 N
D> −1
k Vk Dk D>j V −1
j . Since the elements of Hi j are between
zero and one, and usually close to zero, it is reasonable to assume that the summation
makes only a small contribution to the bias (Mancl and DeRoune, 2001). Therefore,
E(ˆi ˆi> ) can be approximated by
Mancl and DeRoune (2001) proposed the bias-corrected covariance estimator of β̂:
N −1
X > −1
V MD = V MB Di Vi (I − Hii ) ˆi ˆi (I − Hii ) Vi Di
−1 > > −1 −1
V MB .
i=1
Because Di V MD D>i is positive definite, the Sandwich estimate VLZ appears to biased
downward. Kauerman and Carroll (2001) proposed using (I − Hii )−1/2 ˆi to replace ˆi
in VLZ and gave a bias-reduced sandwich estimate
N −1
X > −1
= V MB −1/2
ˆi ˆi (I − Hii )
> > −1/2 −1
V MB .
VKC Di Vi (I − Hii ) V i Di
i=1
Table 4.2 Mean and variance for the two-week seizure count within each group. The ratio of
variance to mean shows the extent of Poisson overdispersion (φ̂ = s2 /Ȳ).
Placebo (M1 = 28) Progabide (M2 = 31)
Visit Ȳ s2 φ̂ Ȳ s2 φ̂
1 9.36 102.76 10.98 8.58 332.72 38.78
2 8.29 66.66 8.04 8.42 140.65 16.70
3 8.79 215.29 23.09 8.13 193.05 23.75
4 7.96 58.18 7.31 6.71 126.88 18.91
function have a great impact on the estimates and especially on the standard error.
When the Bartlett and the power variance are used, the estimates have much smaller
standard errors. Estimates and their standard errors from the Bartlett variance model
differ little for different working correlation structures. However, across the nine co-
variance models, the standard errors ranged from 0.174 to 0.211 for the interaction
term, and 0.415 to 0.464 for the treatment. This demonstrates that attention in mod-
eling the variance function is necessary instead of just using the default variance
functions to achieve high estimation efficiency. In this example, further goodness-
of-fit development is needed for assessing different variance and correlation models.
Model selection criteria will be discussed in the coming Chapter.
4.4.6 Infeasibility
Crowder (1995) found that under misspecification of the correlation structure, the
designated working correlation matrix may not even converge to a correlation matrix.
This generates interest in searching for better estimates of correlation parameters or
alternatives to the working matrix method (Chaganty, 1997; Qu, Lindsay & Li, 2000).
Crowder (1995) redefines GEE as seeking simultaneous solutions of estimating equa-
tions Uβ (θ) = 0 and Uα (θ) = 0. Crowder shows that “lack of a parametric status of α”
leads to indefiniteness
√ of underlying stochastic structure for the outcome vectors, so
that the K-consistency of α̂ required to obtain favorable properties of solutions to
Uβ (θ) = 0 may be meaningless. Crowder illustrates this indefiniteness by attempting
to solve a GEE with working structure R(α) when the data arise from a model with
true cluster correlation structure R̃ that is very different from (R). Specifically, when
(R̃) jk = α (exchangeable true correlation) but (R) jk = α| j−k| (autoregressive working
correlation), then for ni = 3 and −1/2 ≤ α < −1/3, an obvious moment-based formu-
lation of Uα (θ) has no real-valued root between -1 and 1. This destroys prospects
for a general theory of existence and consistency of simultaneous solutions to the
estimating functions. Crowder argues that this indefiniteness also infects the GEE2
procedures of Prentice and Zhao (1991).
In a seminal study, Crowder (1995) indicates that while robustness to misspecifi-
cation of cov(Yi ) is a key attraction of the GEE approach, study of performance under
misspecification requires additional formalization. The ad hoc estimators of Liang
and Zeger are, therefore, re-expressed as supplemental estimating equations whose
solutions α̃ can be characterized under misspecified structure for cov(Yi ). Even if α
GENERALIZED ESTIMATING EQUATIONS (GEE) 69
Table 4.3 Parameters estimates from models with different variance functions and correlation
structures for the epileptic data.
Poisson Variance: σ2i j = φµi j .
Independence working model,φ̂ = exp(1.486)
log(baseline) log(age) trt intact
β 0.950 0.897 -1.341 0.562
Stderr 0.097 0.275 0.426 0.174
AR(1) working model, φ̂ = exp(1.509), α̂ = 0.495
log(baseline) log(age) trt intact
β 0.941 0.994 -1.502 0.626
Stderr 0.091 0.272 0.415 0.168
Exchangeable working model, φ̂ = exp(1.486), α̂ = 0.347
β 0.950 0.897 -1.341 0.562
Stderr 0.097 0.275 0.426 0.174
Bartlett variance: σ2i j = 3.424 ∗ µi j + 0.120 ∗ µ2i j .
Independence working model, φ̂ = exp(1.402)
log(baseline) log(age) trt intact
β 0.943 0.762 -1.074 0.434
Stderr 0.110 0.266 0.464 0.211
AR(1) working model, φ̂ = exp(1.422), α̂ = 0.477
log(baseline) log(age) trt intact
β 0.933 0.850 -1.219 0.492
Stderr 0.102 0.264 0.458 0.208
Exchangeable working model, φ̂ = exp(1.33), α̂ = 0.294
β 0.943 0.762 -1.074 0.434
Stderr 0.110 0.266 0.464 0.211
Power function variance: σ2i j = φµ1.329ij .
Independence working model, φ̂ = exp(1.486)
log(baseline) log(age) trt intact
β 0.924 0.719 -1.091 0.449
Stderr 0.109 0.267 0.431 0.198
AR(1) working model, φ̂ = exp(1.348), α̂ = 0.438
log(baseline) log(age) trt intact
β 0.915 0.796 -1.210 0.496
Stderr 0.102 0.265 0.429 0.198
Exchangeable working model, φ̂ = exp(1.33), α̂ = 0.294
β 0.924 0.719 -1.091 0.449
Stderr 0.109 0.267 0.431 0.198
70 PARAMETER ESTIMATION
is regarded strictly as a nuisance parameter, the effects of misspecification of cov(Yi )
and the subsequent misweighting of the observations may propagate to properties of
β̂G . Specifically, Crowder shows that if cov(Yi ) has the AR(1) form, but the working
model is chosen to have the exchangeable (compound symmetric) form then (for a
balanced design) the limit of the moment estimator α̃ depends explicitly on the clus-
ter size. Additionally, if cov(Yi ) has the exchangeable form but the working model is
of AR(1) structure, then for ni ≡ 3, if −1/2 ≤ αexch ≤ −1/3 the estimating equation as-
sociated with the AR(1) moment estimator has no real solution, and neither does the
GEE. Crowder concludes that “there can be no general asymptotic theory support-
ing existence and consistency of (β̂, α̂)” ... “α has no underlying parametric identity
independent of the particular estimating equation chosen: α does not exist in any
fundamental sense.” Crowder concludes his discussion with the positive recommen-
dation to use only estimating equations that have guaranteed solution, or estimate α
by minimizing a well-behaved objective function. He also notes that in practice “sta-
tistical judgement would normally be employed in an attempt to avoid such hidden
pitfalls” as infeasible estimators obtained through misspecification. In a follow-up
to Crowder’s paper, Sutradahar and Das (1999) argued that solutions to GEEs using
the (typically frankly misspecified) working independence model can often be more
efficient than those obtained under misspecified nonindependence working models.
To support their argument they investigate a model for binary outcomes
logit EYit = β0 + β1 t
with balanced design ni ≡ n. Let β̂ struct denote the solution to the GEE with work-
ing correlation structure struct ∈ {indep, exch, AR(1)}, and let eff β̂ struct denote
Vtrue /V struct , where V struct is the asymptotic variance of a component of the solution
to the GEE with working correlation structure struct. When true cov(Yi ) is of AR(1)
form but the working model is exchangeable, they show that eff (β̂indep ) ≈ eff (β̂exch ).
When the true cov(Yi ) is of exchangeable form but the working model is AR(1),
they show that eff (β̂indep ) eff (β̂AR(1) ). They conclude that the “use of working
GEE estimator β̂G may be counterproductive”, and that their “results contradict the
recommendations made in LZ that one should use β̂G for higher efficiency relative to
β̂indep .”
Investigations of infeasibility and efficiency loss under misspecification under-
taken thus far do not acknowledge the additional complications of unbalanced de-
signs and within/between-cluster covariate dispersion. The latter phenomenon has
been studied by Fitzmaurice (1995), who showed that the relative efficiency of solu-
tions obtained under working independence depends critically on the within-cluster
correlation of covariates. Furthermore, no study of tools for data-based selection of
working correlation models has been undertaken to date. A reliable correlation struc-
ture selection method would greatly diminish the reservations concerning feasibility
and efficiency loss reviewed here.
To best illustrate the problem raised by Crowder (1995), let us revisit his example
in which the true correlation structure is exchangeable with ni = 3 and the AR(1)
GENERALIZED ESTIMATING EQUATIONS (GEE) 71
working correlation matrix is used. To be more specific, we use
1 α α2
Ri = α 1 α
α α 1
2
If we use the simple moment estimator as given by (4.22) or the Cholesky esti-
mator 4.32), their limit converges to ρ, the true exchangeable correlation parameter
value, i.e., asymptotically, we will use an incorrect AR(1) correlation matrix although
Ri will never be correct no matter what α is used (even if α = ρ). In these cases, there
is no problem. However, if we use a moment estimator that is based on all the residual
products, the estimating equation is
n
X
qα = (ˆi1 ˆi2 + ˆi2 ˆi3 + ˆi1 ˆi3 ) − nφ(2α + α2 ) = 0,
i=1
where ˆi j = (yi j − µi j )/(Ai ) j j , and (Ai ) j j is the j-the element of matrix Ai .
A reasonable estimator of φ is
n
1X 2
φ̂ = (ˆ + ˆ 2 + ˆ 2 )/3.
n i=1 i1 i2 i3
Clearly, ρ̂ is a random variable taking values between -1 and 1, and it is quite possible
that pr(ρ̂ < −1/3) > 0. In fact, if the true correlation structure is exchangeable with a
parameter ρ < −1/3, we have pr(ρ̂ < −1/3) → 1 as n → +∞, indicating that α̂ would
be undefined with probability 1.
Crowder (1995) further argued that the limiting value of α̂ may not exist and
hence the asymptotic properties for the GEE estimators breakdown. The main con-
cern here is we cannot guarantee the condition n1/2 (α̂ − α0 ) = O p (1) holds, which
is required for the existence of α̂ and its limiting value α0 . The implication of this
concern is that it is important to ensure the α̂ has the asymptotic normality. Care-
fully chosen supplementary estimating functions will not have this issue, and choice
72 PARAMETER ESTIMATION
of a “working” correlation should be as close to the truth as possible. In the above
example, even if the true model is AR(1) but with ρ ≤ −1/3, α̂ can still be problem-
atic. This indicates the moment estimator using all pairwise product is statistically
problematic for the AR(1) model. As we mentioned, the simple lag-1 estimator
does not have the aforementioned infeasibility issue either regardless the true model
is exchangeable or AR(1).
Of course, in cases when no sensible correlation matrix is produced by a nomi-
nated “working” model, we should embark on the other working models (including
the independence model).
In the above example, when we have ρ̂ ≤ −1/3, one can consider a different
estimator for α,
1 + 3ρ̂ − 1 if ρ̂ ≥ −1/3
( p
α̂2 = , (4.40)
γ̂1 if ρ̂ < −1/3
in which γ̂1 = ni=1 (ˆi1 ˆi2 + ˆi2 ˆi3 )/(2nφ̂). The infeasibility issue also disappears when
P
the working matrix is AR(1) with a parameter α̂2 .
Alternatively, if we choose a different working structure, say, equicorrelated with
a parameter γ which can be estimated by products of all pairs,
n
X
γ̂2 = (ˆi1 ˆi2 + ˆi2 ˆi3 + ˆi1 ˆi3 )/(3nφ̂).
i=1
In this case, α̂ also takes the same form as (4.40) replacing γ̂1 by γ̂2 , but the corre-
sponding working matrix involves two different structures: AR(1) with the parameter
1 + 3ρ̂ − 1 if ρ̂ > −1/3 and equicorrelated matrix with the parameter γ̂2 . In practice,
p
one can also consider independence (α̂ = 0) when the nominated parametric correla-
tion matrix is not positive definite.
The conclusion is that this infeasibility is a theoretic concern, and it does not
happen in practice, not even rarely.
in which Ui has K p components denoting the contributions from subject i. Note that
R(k)
i is symmetric with non-negative eignevalues. Applying the generalized method
of moment, we can obtain the optimal p-estimating functions for β of dimension p.
The generalized moment method can be applied to combine all these together for
estimating β (Hansen, 1982). The Quadratic inference function for β proposed by Qu
et al. (2000) is to minimize
U > (β)cov(U)U(β),
[
Model Selection
5.1 Introduction
Model selection is an important issue in almost any practical data analysis. For lon-
gitudinal data analysis, model selection commonly includes covariate selection in
regression and correlation structure selection in the GEE discussed in Chapter 4.
Covariate selection in regression means: given a large group of covariates (includ-
ing some higher-order terms or interactive effects), one needs to select a subset to
be included in the regression model. Correlation structure selection means: given
a pool of working correlation structure candidates, select one that is closer to the
truth, and thus results in more efficient parameter estimates. Such as the study of
the National Longitudinal Survey of Labor Market Experience introduced in Chap-
ter 1, covariates including age, grade, and south, were recorded on the participants
over the years. Even though the number of variables is not large in this example,
when various interaction effects are included, the total number of covariates in the
statistical model can be considerably large. However, only a subset of them is rele-
vant to the response variable. Inclusion of redundant variables may hinder accuracy
and efficiency for both estimation and inference. Furthermore, a correlation structure
needs to specify when using the GEE to estimate parameters. Appropriate specifi-
cation of correlation structures in longitudinal data analysis can improve estimation
efficiency and lead to more reliable statistical inferences. Thus, it is important to use
statistical methodology to select important covariates and an appropriate correlation
structure. There is a lot of model-selection literature in statistics (e.g., Miller, 1990,
Jiang et al. (2015) and references therein) but mainly for covariate selection in the
classic linear regression with independent data. Traditional model selection criteria
such as the Akaike information criterion (AIC), and the Bayesian information cri-
terion (BIC) may not be useful for correlation structure selection because the joint
density function of response variables is usually unknown in longitudinal data.
Suppose that a longitudinal dataset composed of outcome variables Yi =
(yi1 , . . . , yini )T , and corresponding covariate variables Xi = (Xi1 , . . . , Xini )T in which
Xi j be a p × 1 vector, i = 1, . . . , N. For the sake of simplicity, we assume that ni = n
for all i and the total number of observations is M = Nn. Assumed that
DOI: 10.1201/9781315153636-5 75
76 MODEL SELECTION
where g(·) is a specified, function, and β is an unknown parameter vector, and that
(ii) σ2i j = φν(µi j ), (5.2)
where ν(·) is a given function of µi j , and φ is a scale parameter. Assume that the co-
variance matrix of Yi is Vi = φΣi in which Σi = Ai1/2 Ri (α)A1/2
i , where Ai = diag(ν(µi j ))
and Ri (α) is the true correlation matrix of Yi with an unknown r × 1 parameter vector
α.
where
N
X
ΩR = DTi Wi−1 Di
i=1
A consistent estimator of VR is
N
X
V̂R = ΩR Di Wi S i S i Wi Di ΩR
−1 T −1 T −1 −1
in which α̂R and φ̂R are consistent estimators of α and φ. In the following sections,
we mainly introduce several criteria for selecting covariates in (5.1) and choosing an
appropriate correlation structure for Rw in (5.3) for longitudinal data analysis.
Based on the model specification µi j = E(yi j |xi j ) and σ2i j = φν(µi j ), the quasi-log-
likelihood function of yi j is
Z µi j
yi j − t
Q(yi j ; µi j , φ) = dt.
yi j φν(t)
Therefore, for a given working correlation matrix R, the QIC can be expressed as
where the first term is the estimated quasi-log-likelihood with β = β̂R . The second
term is the trace of the product of Q̂I = φ−1 i=1
PN T −1
Di Ai Di |β=β̂R ,φ=φ̂R and V̂R given in
Section 1. Here Q̂I and V̂R can be directly available from the model fitting results in
many statistical softwares, such as SAS, S-Plus, and R. Besides selecting covariates
in generalized linear models, the QIC can also be applied to select a working correla-
tion structure in GEE: one needs to calculate the QIC for various candidate working
correlation structures and then pick the one with the smallest QIC.
In practice, since φ is unknown, we estimate it using φ̂ = (M − p)−1 i=1
P N Pn
k=1 (yik −
µ̂ik )2 /ν(µ̂ik ), in which µ̂ik is estimated based on the regression model including all
covariates. When all modeling specifications in GEE are correct, Q̂−1 I and V̂R are
asymptotically equivalent and tr(Q̂I V̂R ) ≈ p. Then the QIC reduces to the AIC. In
GEE with longitudinal data, one may take QICu (R) = −2Q(β̂R , I) + 2p as an approx-
imation to the QIC(R), and thus QIC(R) can be potentially useful in variable selec-
tion. However, QICu (R) cannot be applied to select the working correlation matrix R.
That is because the value of Q(β̂R , I) does not depend on the correlation matrix R. It
is worth noting that the performance of the QICs with different correlation structures
is close for selecting covariates, but the QIC(I) performs the best (Pan, 2001).
78 MODEL SELECTION
Hardin and Hilbe (2012) proposed a slightly different version of QIC(R):
where Q(β̂R ; φ̂; I) and V̂R are the same as those in (5.4). However, Ω̂I is evaluated
using β̂I and φ̂I instead of β̂R and φ̂R .
1h i
l(β, α, φ) = − M log φ + log |V| + (Y − Xβ)T V −1 (Y − Xβ)/φ ,
2
where V is an M × M block-diagonal matrix with N × N blocks Σi . Therefore, the
AIC and BIC for the longitudinal data are
where (β̂, α̂, φ̂) are the maximum likelihood estimators of (β, α, φ). Specifically,
β̂ = (X T V̂ −1 X)−1 X T V̂ −1 Y and φ̂ = (Y − X β̂)T V̂ −1 (Y − X β̂)/M, in which V̂ is eval-
uated at α̂ which is estimated by maximizing the profile log-likelihood l p (α) =
−1/2[M log φ̂(α) + log |V(α)|]. When the sample size N is small, the corrected AIC
for continuous longitudinal data is
M
AICc = −2l(β̂, α̂, φ̂) + 2p .
M− p−1
When N tends to infinity with a fixed p, AICc approximates to AIC.
If Yi is a discrete variable, Yi does not follow a multivariate normal distribution.
However, White (1961) and Crowder (1985) proposed using a normal log-likelihood
to estimate β and α without assuming that Yi is normally distributed. Hence, we can
utilize the following normal log-likelihood function as a pseudo-likelihood,
N
1 Xh i
lG (β, α, φ) = − log |2πVi | + (Yi − µi )T Vi−1 (Yi − µi ) ,
2 i=1
where µi = (g(Xi1
T β), . . . , g(X T β))T . Therefore, the AIC and BIC for the longitudinal
in
data are
Thus, QIC(R) = −2Q(β̂R ; I) + 2CIC(R). Without the effect of the random error from
the first term in (5.4), the CIC could be more powerful than the QIC.
The theoretical underpinning for the biased first term in (5.4) was outlined in
Wang and Hin (2010). Note that for continuous responses and φ estimated by φ̂, we
have QIC(R) = (M − p) + 2CIC(R), which means that the CIC is equivalent to the
QIC in the working correlation structure chosen for continuous responses. If the true
correlation matrix is the identify matrix I, Q̂−1
I and V̂R are asymptotically equivalent,
hence CIC(R) ≈ 2p.
The value of the C(R) should be close to zero when Rw is accurately specified. How-
ever, the C(R) is appropriate only for the balanced data.
where g(·) = (g1 (X, θ), . . . , gr (X, θ))T is r-dimension functions and r ≥ p. The empirical
likelihood ratio function for θ is defined as (Qin and Lawless (1994)
N N N
Y X X
R(θ) = sup = , θ) = .
N p ; 0 ≤ p ≤ 1, p 1, p g(x 0
i i i i i
i=1 i=1 i=1
An explicit expression value for R(θ) can be derived by the Lagrange multiplier
method. The maximum empirical likelihood estimator of θ is
Qin and Lawless (1994) proved that −2 log(R(θ)/R(θ̂)) → χ2p , and hence confidence
regions can be constructed for θ. Note that when r = p, it seen that β̂EL is equal to
g(xi , θ) = 0. When r > q, the empirical
PN
the solution to the estimating equations i=1
likelihood method allows us to deal with the combination of pieces of information
about θ. However, computational issues may arise to obtain β̂EL .
SELECTING CORRELATION STRUCTURE 81
Define correlation matrix RF (α) as a toeplitz matrix with (n − 1)-dimensional cor-
relation parameter vector α = (α1 , . . . , αn−1 )T . That is, the jth off the diagonal element
in RF (α) is α j . Therefore, RF (α) is a more general structure, and the most commonly
used correlation structures, independence, exchangeable, AR(1), and MA(1) are all
embedded in RF (α). We define the GEE model with RF (α) as the “full model”. Fur-
thermore, unbiased estimating functions can be constructed for β and α,
Note that this empirical likelihood is built at a very weak assumption, such as the
stationarity assumption of the underlying correlation structure. R F (β, α) serves as a
unified measure that can be applied to most of the competing GEE models (Chen and
Lazar, 2012).
To avoid the computational issues associated with the empirical likelihood, Chen
and Lazar (2012) proposed calculating R F (β, α) via giving β and α. If n = 3, the cor-
relation matrix RF (α) is parameterized by αT . = (α1 , α2 ). Suppose that there are three
correlation structure candidates: exchangeable (RE ), AR(1) (RA ), and toeplitz (RF ),
and corresponding estimators for β and α are (β̂E , α̂E ), (β̂A , α̂A ), and (β̂F , α̂1F , α̂2F ),
which are obtained with one of the three working correlation structures based on
the same data. The corresponding empirical likelihood ratios are R F (β̂E , α̂E , α̂E ),
R F (β̂A , α̂A , α̂A ), and R F (β̂F , α̂1F , α̂2F ). It is easy to see that R F (β̂F , α̂1F , α̂2F ) is equal
to one, and the other models are smaller than one. Therefore, different competing
structures embedded in the general correlation structure RF (α) have different values,
which can be used to compare the different competing structures and then select the
best one.
Chen and Lazer (2012) modified the AIC and the BIC by substituting the empir-
ical likelihood for the parametric likelihood and gave empirical likelihood versions
of the AIC and the BIC:
where θ̂w = (β̂Tw , α̂Tw )T . It is worth mentioning that, when the EAIC and the EBIC are
calculated, β̂w is the GEE estimator with the working correlation structure Rw , and
α̂ is the method of moments estimator of α given β̂w and Rw . Note that the EAIC
82 MODEL SELECTION
and the EBIC contain p + n − 1 unknown parameters even for an exchangeable and
AR(1) correlation matrix. Therefore, if sample size N is small and n is large, the
performance of the EAIC and the EBIC is affected.
5.4 Examples
In this section, we illustrate the mentioned criteria for covariates selection and corre-
lation structure selection by some real datasets. The R codes are also provided.
Table 5.1 The values of the criteria QIC, EQIC, GAIC, and GBIC under two different working
correlation structures (CS) for Example 1.
CS Covariates QIC EQIC GAIC GBIC
age,bmi, time 781.25 2258.19 2442.94 2447.52
age, bmi,time, time2 772.76 2233.12 2428.36 2434.47
Independence age, bmi,time, time2 ,time3 670.72 1951.60 2250.88 2258.51
age, bmi,time, time2 , 661.67 1922.09 2232.42 2241.57
time3 ,time4
age, bmi,time, time2 , 663.27 1921.63 2232.35 2243.04
time3 ,time4 ,time5
covariates QIC EQIC GAIC GBIC
age,bmi, time 784.12 2261.78 2170.26 2174.84
age, bmi,time, time2 775.67 2236.77 2147.46 2153.56
Exchangeable age, bmi,time, time2 ,time3 673.45 1955.28 1874.48 1882.11
age, bmi,time, time2 , 664.40 1925.79 1843.43 1852.59
time3 ,time4
age, bmi,time, time2 , 666.32 1925.65 1841.44 1852.12
time3 ,time4 ,time5
Table 5.2 The results obtained by the GEE with independence (IN), exchangeable (EX), and
AR(1) correlation structures in Example 2.
Independence correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 0.7969 0.3180 6.2812 0.0122
MONTH -0.2627 0.0578 20.6750 0.0000
AGE 0.1967 0.6383 0.0950 0.7580
GENDER -0.7200 0.4643 2.4052 0.1209
MONTH.AGE -0.1006 0.0991 1.0311 0.3099
MONTH.GENDER -0.1274 0.1076 1.4029 0.2362
AGE.GENDER 1.0495 0.7170 2.1425 0.1433
Exchangeable correlation matrix
Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.8518 0.3232 6.9451 0.0084
MONTH -0.2779 0.0613 20.5433 0.0000
AGE 0.1369 0.6684 0.0419 0.8378
GENDER -1.2709 0.7256 3.0679 0.0799
MONTH.AGE -0.0692 0.1007 0.4722 0.4920
MONTH.GENDER -0.1643 0.1071 2.3555 0.1248
AGE.GENDER 1.7727 0.8968 3.9075 0.0481
AR(1) correlation matrix
Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.6988 0.3076 5.1593 0.0231
MONTH -0.2409 0.0538 20.0577 0.0000
AGE 0.0083 0.5962 0.0002 0.9889
GENDER -0.4578 0.4481 1.0436 0.3070
MONTH.AGE -0.0643 0.0894 0.5177 0.4718
MONTH.GENDER -0.1771 0.0979 3.2735 0.0704
AGE.GENDER 1.0809 0.7136 2.2943 0.1298
In the statistical software R, we can first use the geeglm function in the geepack
to obtain the parameter estimates, and then use the function QIC in the packages
Table 5.3 The values of the criteria under independence (IN), exchangeable (EX), and AR(1)
correlation structures in Example 2.
IN EX AR(1)
QIC 954.12 971.91 954.85
QICH 949.46 972.28 948.88
CIC 23.09 25.79 22.69
CICH 20.75 25.97 19.71
GAIC 882.53 1069.70 477.69
GBIC 899.71 1089.34 497.32
EAIC 2932.02 2782.40 243.61
EBIC 2949.20 2802.03 263.24
EXAMPLES 85
Table 5.4 The results are obtained via the QIC function in MESS package.
QIC QICu Quasi Lik CIC params QICC
IN 949.46 921.95 -453.98 20.75 7.00 949.58
EX 940.81 934.34 -460.17 10.24 7.00 940.97
AR 923.25 923.46 -454.73 6.90 7.00 923.41
geepack and MESS to calculate the QIC and the CIC. The specific results are as
follows (Table 5.4).
In the output results, QIC and CIC correspond to the values of the QICH and the
CICH , respectively, and QICu corresponds to the value of the QICu.
It is worth noting that the values of QIC and CIC obtained via QIC are different
with those given in Table 5.3 except those under the independence working corre-
lation matrix, because ΩR = i=1
PN T −1
Di Vi Di is used in the QIC function instead of
ΩI .
Table 5.5 The results obtained by the GEE with independence (IN), exchangeable (EX), and
AR(1) correlation structures in Example 3.
Independence correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 2.8887 1.6027 3.2487 0.0715
base 0.1118 0.1159 0.9306 0.3347
trt 0.0090 0.2089 0.0018 0.9658
log(age) -0.4618 0.4852 0.9059 0.3412
base:trt -0.1047 0.2134 0.2407 0.6237
Exchangeable correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 3.9705 1.2129 10.7160 0.0011
base 0.1118 0.1159 0.9306 0.3347
trt -0.0086 0.2147 0.0016 0.9681
log(age) -0.7876 0.3649 4.6583 0.0309
base:trt -0.1047 0.2134 0.2407 0.6237
AR(1) correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 4.5580 1.3991 10.6127 0.0011
base 0.1561 0.1137 1.8843 0.1699
trt -0.0312 0.2093 0.0223 0.8814
log(age) -0.9765 0.4186 5.4409 0.0197
base:trt -0.1315 0.2667 0.2431 0.6220
under the exchangeable and AR(1) correlation matrices indicate that intercept and
log(age) are significant.
Patient number 207 appears to be very unusual (id = 49 in the dataset). He (or
she) had an extremely high seizure count (151 seizures in eight weeks) at baseline
and his count doubled after treatment (302 seizures in eight weeks). The GEE method
is sensitive to outliers, and hence we rerun the results after removing the outliers and
present the new results in Table 5.6.
The results are different with and without the outliers. After removing the out-
liers, the interactive term between base and treatment is significant, which indicates
the difference in the logarithm of the post to pre-treatment ration between the pro-
gabide and the placebo groups are significant.
Next, we use the criteria to specify the correlation structure. We also calculate
the criteria with and without the 49th patient. The results are presented in Table 5.7.
When the data contain the outliers, all the criteria select the exchangeable correlation
structure. However, when the data exclude the outliers, the QIC and QICH choose
the independence correlation structure, and the CIC chooses the AR(1) correlation
structure, which indicate that the QIC and the CIC are sensitive to outliers. We be-
lieve that the exchangeable correlation structure is more reliable. We also calculate
the sample correlation matrix (see Table 5.8), which or that is close to the exchange-
able correlation structure.
EXAMPLES 87
Table 5.6 The results obtained by the GEE with independence (IN), exchangeable (EX), and
AR(1) correlation structures when removing the outliers in Example 3.
Independence correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 1.6165 1.3153 1.5104 0.2191
base 0.1118 0.1159 0.9306 0.3347
trt -0.1089 0.1909 0.3253 0.5684
log(age) -0.0804 0.3966 0.0411 0.8394
base:trt -0.3024 0.1711 3.1248 0.0771
Exchangeable correlation matrix
Intercept 3.2250 1.2461 6.6977 0.0097
base 0.1118 0.1159 0.9306 0.3347
trt -0.1265 0.1884 0.4505 0.5021
log(age) -0.5629 0.3753 2.2497 0.1336
base:trt -0.3024 0.1711 3.1248 0.0771
AR(1) correlation matrix
Intercept 4.1477 1.3083 10.0503 0.0015
base 0.1521 0.1114 1.8654 0.1720
trt -0.1143 0.1943 0.3457 0.5566
log(age) -0.8515 0.3918 4.7236 0.0298
base:trt -0.4018 0.1757 5.2314 0.0222
Table 5.7 The results of the criteria for the data with and without the outliers in Example 3.
Complete data Outlier deleted
IN EX AR IN EX AR
QIC -695.76 -701.26 -678.12 -1062.05 -1014.30 -938.22
QICH -695.76 -701.42 -676.64 -1062.05 -1013.18 -934.95
CIC 11.79 10.62 11.00 11.03 10.32 9.47
CICH 11.79 10.54 11.73 11.03 10.88 11.11
C(R) 13.57 2.18 12.38 9.61 2.74 10.66
GAIC 2409.43 2142.65 2256.92 2158.03 2017.64 2098.56
GBIC 2419.81 2155.12 2269.39 2168.33 2030.00 2110.92
EAIC 142.68 13.33 60.13 1894.41 17.74 26.56
EBIC 153.07 25.79 72.59 1904.71 30.10 38.92
Table 5.8 The correlation matrix of the data with and without the outliers in Example 3.
Complete data Outlier deleted
0 1.00 1.00
1 0.79 1.00 0.68 1.00
2 0.83 0.87 1.00 0.73 0.69 1.00
3 0.67 0.74 0.80 1.00 0.49 0.54 0.67 1.00
4 0.84 0.89 0.89 0.82 0.75 0.72 0.76 0.71
Chapter 6
Robust Approaches
6.1 Introduction
The GEE method is robust against the misspecification of correlation structures, but
is sensitive to outliers because it is essentially a generalized weighted least squares
approach (Jung and Ying, 2003; Wang and Zhao, 2008). In this chapter, we will
introduce several robust methods for parameter estimation in analysis of longitudinal
data. In §6.2, we introduce the rank-based method, which is distribution-free, robust,
and highly efficient (Hettmansperger, 1984). In §6.3, we will introduce the quantile
regression, which gives a global assessment about covariate effects on the distribution
of the response variable, provides complete description of the distribution, and is
robust against outliers (Koenker and Bassett, 1978). In §6.4, we will introduce other
methods based on the Huber’s function and exponential square loss function, which
is not only robust against outliers in response variable, but also robust against outliers
in covariates.
where β is the parameter vector corresponding to the covariate vector Xik of dimen-
sion p. Assume that the median of ik − jl is zero. To avoid complication caused
by ties, we assume that error terms ik are continuous variables. Define residuals
eik (β̂) = yik − XikT β̂, where β̂ is a given consistent estimator of β.
DOI: 10.1201/9781315153636-6 89
90 ROBUST APPROACHES
proposed estimating β using the following estimating functions:
ni
N X
X
U I (β) = M −1 (Xik − X̄)rik (β). (6.2)
i=1 k=1
Let β̂I be the resultant estimator from (6.2), which can also be obtained by minimiz-
ing the following loss function
ni X
N X
X nj
N X
L(β) = M −2 |eik (β) − e jl (β)|. (6.3)
i=1 k=1 j=1 l=1
To make inferences about the regression coefficients of model (6.1), such as test-
ing the null hypothesis H0 : β = β0 , Jung and Ying (2003) proposed a test statis-
tic NU IT (β)V −1 U I (β), which can bypass density estimation and bandwidth selec-
tion, and V is a covariance matrix of N −1/2 U I (β). A consistent estimator of V is
V̂ = N −1 i=1 ξ̂i ξ̂iT , where
PN
ni
X
ni0
N X
X
ξ̂i = (Xik − X̄){N −1 rik (β̂I ) − 1/2} − N −1 (Xi0 k0 − X̄)I(ei0 k0 (β̂I ) ≤ eik (β̂I )) ,
k=1 i0 =1 k0 =1
Under the null hypothesis, the test statistic approximately has a χ2 distribution with
p degrees of freedom. The null hypothesis will be rejected for a large observed value
of the test statistic.
The function (6.3) is based on the independence working model assumption, thus
the efficiency of β̂I may be improved by taking account of the correlations and the
impacts of varying cluster sizes. To this end, Wang and Zhao (2008) introduced a
weighted estimating functions.
To make use of all the observations, we would repeat this resampling process many
times. Conditional on sampling one observation from each cluster, the probability of
yik being sampled is 1/ni . Therefore, the limiting dispersion function is
N N i Xn nj
1 XX 1 1 X
Lw (β) = |e(i) − e( j) |.
M 2 i=1 j,i ni n j k=1 l=1
RANK-BASED METHOD 91
This motivates Wang and Zhao (2008) to consider the following weighted loss
function for estimation of β,
i X N N n nj
1 XXX
Lwz (β) = 2 ωi ω j |eik − e jl |,
M i=1 j,i k=1 l=1
where r̄ is the mean of the ranks rik for k = 1 . . . , ni and i = 1, . . . , N. When the within
correlation is weak, it can be ignored in the analysis, and ωi = 1 is corresponding to
the classical rank regression.
Let fik (·) and Fik (·) be the probability density function and the cumulative distri-
bution function of ik , for k = 1, . . . , ni and i = 1, . . . , N. Let Vw = lim M→∞ i=1 E(ηi ηTi ),
PN
where
N ni X nl ( )
X X 1
ηi = ωi ω j (xik − x jl ) F jl (ik ) − .
j,i k=1 l=1
2
Furthermore, let
N N ni Xnl Z
2 XX X
DN = ωi ω j (xik − x jl )(x ik − x jl )T
fik dFik .
M 2 i=1 j,i k=1 l=1
N nl
ni X ( )
X X 1
η̂i = ωi ω j (xik − x jl ) I(eik ≥ e jl ) − .
j,i k=1 l=1
2
The proposed weighted rank method is simple and effective because it utilizes only
the weight function to incorporate correlations and cluster sizes for efficiency gain.
However, when the design is balanced, this method is equivalent to the Jung and
Ying’s method. Furthermore, this method reduces the efficiency of parameter estima-
tors because of only weighting at cluster levels, in which the covariate varies within
each subject (Neuhaus and Kalbfleish, 1998).
92 ROBUST APPROACHES
6.2.3 Combined Method
Decompose L(β) given in (6.3) into two parts, as
N X
X N X nj
ni X N X
X ni
ni X
L(β) = M −2
|eik (β) − e jl (β)| + M −2 |eik (β) − eil (β)|
i=1 j,i k=1 l=1 i=1 k=1 l=1
= LB (β) + M −1 LW (β),
where LB (β) and LW (β) stand for between- and within-subject loss functions. As M
tends to infinity, L(β) ' LB (β) + O p (1), and thus the information contained in the
within-subject loss function LW (β) is ignored. Therefore, Wang and Zhu (2006) pro-
posed minimizing LB (β) and LW (β) separately and then combining the two afore-
mentioned estimators in some optimal sense.
The estimating functions based on the between- and within-subject ranks LB (β)
and LW (β) can be obtained:
N N ni X nj
1 XXX
S B (β) = 2 (Xik − X jl )I(e jl ≤ eik ), (6.5)
M i=1 j,i k=1 l=1
N i n
1 XX
S W (β) = (Xik − Xil )I(eil ≤ eik ). (6.6)
M i=1 k,l
Because the median of eik − e jl is zero, (6.5) and (6.6) are unbiased and can be used
to estimate β. Suppose that β̂B and β̂W are estimators derived from (6.5) and (6.6), re-
spectively. It seems natural to combine these two estimators to obtain a more efficient
PN PN Pni Pn j
estimator (Heyde, 1987). Let XB = M −2 i=1 T
j,i k=1 l=1 (Xik − X jl )(Xik − X jl ) and
n n
XW = M −1 i=1 N i i
(X − Xil )(Xik − Xil )T . Wang and Zhu (2006) proposed find-
P P P
k=1 l=1 ik
ing an optimal estimator of β from an optimal linear combination of the estimating
functions for β:
!
S B (β)
S C (β) = (XB , XW )Σ−1 (6.7)
S W (β)
where Σ is the covariance matrix of S B (β) and S W (β). Because Σ is unknown,
Wang and Zhu (2006) proposed using resampling method to bypass density estima-
N are independently sampled from the binomial distribution
tion. Suppose that {zi }i=1
B(N, 1/N). The bootstrap method of resampling the subjects with replacement leads
to the following perturbed estimating functions:
Ni X N n nj
1 XXX
S̃ B (β) = 2 zi z j (Xik − X jl )I(e jl ≤ eik ),
M i=1 j,i k=1 l=1
Ni n
1 XX
S̃ W (β) = zi (Xik − Xil )I(eil ≤ eik ).
M i=1 k,l
where τ21 = var{I(eik ≤ eil )}, τ22 = var{I(eik ≤ e jl )}, and ρ1 , . . . , ρ6 are six types of cor-
relation coefficients among {S ik (β), k = 1, . . . , ni ; i = 1, . . . , N} and given as follows
ρ1 = corr{I(eik ≤ eil ), I(eik ≤ eil0 )}, ρ2 = corr{I(eik ≤ eil ), I(eik ≤ ers )},
ρ3 = corr{I(eik ≤ e jl ), I(eik ≤ e jl0 )}, ρ4 = corr{I(eik ≤ e jl ), I(eik ≤ ers )},
ρ5 = corr{I(eik ≤ e jl ), I(eik0 ≤ e jl0 )}, ρ6 = corr{I(eik ≤ e jl ), I(eik0 ≤ ers )}.
94 ROBUST APPROACHES
These six types correlation coefficients illustrate the correlations between the rela-
tive ordering of two different within correlated pairwise residuals. For example, ρ1
indicates the correlation between the relative ordering of two pairwise residuals from
the same subject, and ρ6 indicates the correlation between the relative ordering of
two different residuals from the same subject and another two distinct residuals from
different subjects.
Note that if there are no within correlations, ρ5 = ρ6 = 0. If there exists correla-
tions, and the errors have an exchangeable structure, then the pairwise residuals in ρ1
and ρ4 are identically distributed, hence ρ1 = ρ4 = 1/3. Furthermore, it can be seen
that σ2i = O(1), σ2ii = O(1), and σi j = O(M −1 ) for i , j, which implies that correla-
tions between rank residuals from different subjects can be ignored as M tends to
infinity. Utilizing the idea of the GEE, Fu and Wang (2012) proposed using a block
diagonal matrix diag(V1 , · · · , VN ) as a working covariance matrix for V, and obtained
an estimate of β using estimating equations
N
X
S G (β) = DTi Vi−1 S i (β) = 0, (6.9)
i=1
where Vi−1 = (σ2i − σii )−1 {Ini − σii [σ2i + (ni − 1)σii ]−1 Jni ×ni } is the inverse matrix of
Cov(S i (β)), in which I is an identity matrix, and J is a matrix with all elements
being one. The equation (6.9) is the generalized estimating equations√of S i (β). Let
β̂G be the resulting estimator from (6.9), then it can be shown that N(β̂G − β) is
asymptotically normal with mean zero and covariance matrix
N −1 N N −1
X X X
ΣG = lim N Di Vi D̄i Di Vi Cov{S i (β0 )}Vi Di D̄i Vi Di .
T −1 T −1 −1 T −1
N→∞
i=1 i=1 i=1
To avoid estimating the joint density of the error terms, we use the resampling
method to estimate ΣG . Let {zi }i=1
N be sampled from a distribution with unit mean and
We can derive an estimate of β by solving the equation (6.10) for each sequence
N . Therefore, an independent sequence of {z }N can result in many estimates of
{zi }i=1 i i=1
β. The covariance of the bootstrap estimates can be an estimate of ΣG .
6
attender
5
Pain tolerance inlog2 seconds
6
distracter
1 2 3 4 1 2 3 4 1 2 3 4
Trail
Figure 6.1 The Boxplots of the time in log2 seconds of pain tolerance in four trials in the
pediatric pain study. Two row panels represent attender and distracter baseline groups, and
three column panels represent three treatments.
Let yik be the log2 of pain tolerance time of the kth trial for the ith subject, and
B be the baseline indicator taking 1 for attenders and 0 for distracters. The attend
treatment is denoted by A, and the distract treatment is denoted by D, and no advice
treatment is denoted by F. Let A = 1 for the attend treatment and A = 0 otherwise,
96 ROBUST APPROACHES
girl boy
6
Pain tolerance in log2 seconds
1 2 3 4 1 2 3 4
Trial
Figure 6.2 The boxplots of the time in log2 seconds of pain tolerance are for girls and boys in
four trails in the pediatric pain study.
where β0 is the mean of the distracter group (baseline), β1 indicates the difference
between the attender and distracter groups. Parameters (β2 , β3 , β4 ) correspond the
three treatment effects for the attender group, and (β5 , β6 , β7 ) correspond the three
treatments for the distracter group. The standard errors for estimates obtained by the
JY’s method and Wang and Fu’s method are based on 3000 resampling estimates
from an exponential distribution with unit variance.
Parameter estimates and their standard errors obtained from different methods are
given in Table 6.1. The results obtained via the GEE method depend on the selection
of the working correlation matrix. Comparing to the GEE method, all the rank-based
methods indicate that the distract treatment given to distracters will help distracters
QUANTILE REGRESSION 97
0.4
0.3
Corr: Corr: Corr:
Trail 1
0.2 0.726*** 0.836*** 0.604***
0.1
0.0
8
6
Corr: Corr:
Trail 2
0.718*** 0.662***
4
8
7
6
Corr:
Trail 3
5 0.764***
4
3
8
7
6
Trail 4
5
4
3
4 5 6 7 8 4 6 8 3 4 5 6 7 8 3 4 5 6 7 8
Figure 6.3 Scatterplot of the time in log2 seconds of pain tolerance for four trails in the
pediatric pain study. The diagonal plots show the densities of responses in the four Trails.
The plots were produced by ggpairs in GGally.
increase their pain tolerance (the estimate for β6 is significant), which may have
implications for medical treatments with painful procedures. In addition, except the
GEE with an exchangeable working matrix, all the other methods indicate that the
girls have much stronger pain tolerance than boys as Figure 6.2 indicates.
Table 6.1 Parameter estimates and their standard errors (SE) for the pediatric pain tolerance
study. JY: the method of Jung and Ying (2003); CM: the method of Wang and Zhu (2006); WZ:
the method by Wang and Zhao (2008); FW: the method by Fu and Wang (2012); GEEAR(1) :
the GEE method with an independence working matrix; GEEEX : the GEE method with an
exchangeable working matrix; and GEEAR(1) : the GEE method with an AR(1) working matrix;
GEEUN : the GEE method with an unstructure working matrix. ∗ indicates that p-value is less
than 0.05.
JY CM WZ FW GEEIN GEEEX GEEAR(1) GEEUN
β1 -0.57∗ -0.47∗ -0.57∗ -0.59∗ -0.60∗ -0.59∗ -0.71 -0.65∗
(SE) (0.18) (0.24) (0.21) (0.18) (0.24) (0.25) (0.24) (0.24)
β2 0.31 0.24 0.34 0.27 0.11 0.07 0.11 0.21
(SE) (0.22) (0.22) (0.23) (0.22) (0.18) (0.13) (0.14) (0.23)
β3 -0.002 -0.06 0.03 0.15 0.04 0.04 0.13 -0.07
(SE) (0.23) (0.28) (0.25) (0.26) (0.25) (0.29) (0.29) (0.24)
β4 -0.25 -0.11 -0.25 -0.38 -0.11 -0.11 -0.01 -0.16
(SE) (0.33) (0.21) (0.26) (0.26) (0.12) (0.13) (0.13) (0.25)
β5 -0.36 -0.28 -0.39 -0.43∗ -0.35∗ -0.27∗ -0.10 -0.39
(SE) (0.28) (0.20) (0.26) (0.19) (0.16) (0.13) (0.22) (0.27)
β6 0.95∗ 0.87∗ 0.80∗ 1.07∗ 0.57 0.55 0.66 0.87∗
(SE) (0.43) (0.36) (0.42) (0.37) (0.30) (0.35) (0.31) (0.35)
β7 -0.69∗ -0.60∗ -0.71∗ -0.54∗ -0.48∗ -0.31∗ -0.38 -0.81∗
(SE) (0.33) (0.27) (0.35) (0.25) (0.20) (0.13) (0.16) (0.31)
β8 0.46∗ 0.40∗ 0.47∗ 0.59∗ 0.44 0.50∗ 0.59 0.43
(SE) (0.17) (0.22) (0.21) (0.16) (0.24) (0.23) (0.21) (0.23)
let ik = yik − XikT βτ be a continuous error term satisfying p(ik ≤ 0) = τ and with
an unspecified density function fik (·). The median regression is obtained by taking
τ = 0.5.
The resulting estimates β̂τI from (6.11) can be also derived by minimizing the
following loss function
ni
N X
X
Lτ (β) = ρτ (yik − XikT β), (6.12)
i=1 k=1
where ρτ (u) = u{τ − I(u ≤ 0)} (Koenker and Bassett, 1978). Koenker and D’Orey
(1987) developed an efficient algorithm to optimize Lτ (β), which is available in the
R package quantreg.
Using a similar argument given in Chamberlain (1994) for the case of indepen-
dent observations, the asymptotic distribution of N 1/2 (β̂τI − βτ ) is a normal distri-
bution N(0, A−1 −1
τ var{U τ (βτ )}Aτ ) as N → ∞, where Aτ is the expected value of the
derivative of Uτ (βτ ) with respect to βτ . It is difficult to estimate the covariance ma-
trix because Aτ may involve the unknown density functions.
Resampling method can be used to approximate the distribution of without in-
volving any complicated and subjective nonparametric functional estimation. Let
N
X ni
X
L̃τ (βτ ) = zi ρτ (yik − XikT βτ ), (6.13)
i=1 k=1
1 h i
= C(F ik (0), F il (0)) − τ2
(τ − τ2 )
Ckl (τ, τ) − τ2 Φ2 (Φ−1 (τ), Φ−1 (τ); ρkl ) − τ2
= = .
(τ − τ2 ) (τ − τ2 )
where Φ2 (·, ·; γkl ) denotes the standardized bivariate normal distribution with corre-
lation coefficient γkl = corr(ik , il ). Specifically, when γkl = 0, then ik and il are
independent, and hence we have Ckl (τ, τ) = τ2 and ρkl = 0, which indicates that sik
and sil are uncorrelated. When τ = 0.5, then Ckl (τ, τ) = 1/4 + (2π)−1 arcsin(γkl ), and
hence ρkl = 2/π arcsin(γkl ).
To specify the correlation matrix Ri of (si1 , . . . , sini ) for i = 1, . . . , N, we need to
specify the correlation structure of i = (i1 , . . . , ini )T . When the correlation structure
of i is exchangeable, that is p(ik ≤ 0, il ≤ 0) is a constant δ, for any k , l, the corre-
lation coefficient of sik and sil equals ρ = (δ − τ2 )/(τ − τ2 ), and hence the correlation
matrix of S τi = (si1 , . . . , sini )T is Rτi = (1 − ρ)Ini + ρJni , where Ini is the ni × ni identity
matrix, and Jni is an ni × ni matrix of 1s. Therefore, the correlation structure of S τi is
exchangeable. Similarly, when the correlation structure of i is MA(1), Ri is MA(1).
When the correlation structure of i is AR(1) or toeplitz matrix, then Ri is a toeplitz
matrix. Therefore, we can construct various correlation structures for Ri via Gaussian
copulas.
If there is no correlation, ρ = 0 and R−1i = Ini . In this case, UGτ (βτ ) is equivalent to
Uτ (β).
Suppose that β̂Gτ is the resulting estimator from UGτ (βτ ). Under some regularity
conditions, we can prove that N −1/2 Uτ (βτ ) → N(0, VU ), where
N
X
VU = lim N −1 XiT Vi−1 Cov(S i )Vi−1 Xi .
N→∞
i=1
√
Furthermore, β̂Gτ is a consistent estimator of βτ for a given τ, and N(β̂Gτ − βτ ) →
N(0, ΣGτ ), where
N
X
ΣGτ = lim NDτ (β) Xi Vτi Cov(S τi )Vτi Xi {D−1
−1 T −1 −1
τ (β)} ,
T
N→∞
i=1
where Dτ (β) = i=1 XiT Vτi−1 Λi Xi , and Λi is an ni × ni diagonal matrix with the
PN
k-th diagonal element fik (0). The covariance matrix Cov(S τi ) is unknown and can
be estimated empirically. These asymptotic properties can be derived according to
Jung (1996) and Yin and Cai (2005).
where Φ(·) is the standard normal cumulative distributed function. Because ŨGτ (βτ )
are smoothing functions of βτ , we can calculate ∂ŨGτ (βτ )/∂βτ that can be used as an
approximation of Dτ . Let
N
∂ŨGτ (βτ ) X
D̃τ (βτ ) = =− XiT Vi−1 Λ̃i Xi ,
∂βτ i=1
where Λ̃i is an ni × ni diagonal matrix with the kth diagonal element σ−1
ik φ((yini −
q
T β )/σ ), where φ(·) is the standard normal density function, and σ =
Xin XikT ΓXik .
i
τ ik ik
In general, the resulting estimator β̃Gτ from Ũτ (βτ ) and its covariance matrix can
be obtained by iteration. Taking the exchangeable correlation working structure as
an example, we give the explicit stepwise procedures for the algorithm:
Step 1: Let the estimator obtained from equation (6.12) be an initial estimate, that is
β̃(0)
τ = β̂τI , and Γ
(0) = I /N.
p
(k−1)
Step 2: Given β̃τ and Γ(k−1) from the k − 1 step, and let ˆil = yil − XilT β̂(k−1)
τ . Obtain
(k−1)
δ̂τ by
PN Pni Pni
i=1 k=1 l,k I(ˆ ik ≤ 0, ˆil ≤ 0)
δ̂ (k−1)
= PN .
i=1 ni (ni − 1)
β̃(k) (k−1)
τ = β̃τ + {−D̃τ (δ̂(k−1) , β̃(k−1)
τ , Γ(k−1) )}−1 ŨGτ (δ̂(k−1) , β̃(k−1)
τ , Γ(k−1) ),
(k−1) (k) (k−1)
Γ(k) = D̃−1
τ (δ̂ , β̃τ , Γ )VU (δ̂(k−1) , β̃(k) −1 (k−1) (k) (k−1)
τ ) D̃τ (δ̂ , β̃τ , Γ ).
Step 4: Repeat the above iteration Steps 2-3 until the algorithm converges.
The finial values of β̃τ and Γ can be taken as the smoothed estimators of
β̂Gτ and its covariance matrix, respectively. Under some regularity conditions,
N −1/2 {ŨGτ (βτ ) − UGτ (βτ )} = o p (1), and the smoothing estimator β̃Gτ → β0 in prob-
√
ability, and N(β̃Gτ − βτ ) converges in distribution to N(0, ΣGτ ). Therefore, the
smoothed and unsmoothed estimating functions are asymptotically equivalent uni-
formly in β, and the limiting distributions of the smoothed estimators coincide with
the unsmoothed estimators. The induced smoothing method can be also used to the
rank-based methods introduced in Section 2. More details can be found in Fu et al.
(2010).
QUANTILE REGRESSION 103
6.3.4 Working Correlation Structure Selection
Suppose that there are J working correlation matrix candidates via Gaussian copulas:
j
Ri , j = 1, . . . , J. Parameter estimates of βτ can be obtained by solving J estimating
equations:
N
j −1
X
XiT A−1/2
i [Ri ] A−1/2
i S i (β) = 0, j = 1, . . . , J. (6.14)
i=1
For a given set of correlation structures, we obtain estimates of βτ and ρ and then
choose the final correlation structure corresponding to the minimum value of GAIC
or GBIC.
There are J × p equations in (6.14) based on J different working correlation matri-
ces. Therefore, the number of equations is larger than the number of parameters. We
can use the quadratic inference function (QIF) method proposed by Qu et al. (2000)
and the empirical likelihood method to combine these equations. More details can be
found in Leng and Zhang (2014) and Fu and Wang (2016).
8 9 10 11 12 13 14
Boy Girl
30
Distance (mm)
25
20
8 9 10 11 12 13 14
Age (year)
Figure 6.4 The distance versus the measurement time for boys and girls.
where yik and Ageik are the distance and age for the i-th subject at time k, respec-
tively, and Genderi takes −1 for girls and 1 for boys. Parameters β0 and β1 denote
the intercept and slope of the average growth curve for the entire group. Parameters
β1 and β3 denote the deviations from this average intercept and slope for the group
of girls and the group of boys, respectively. Therefore, the intercept and slope of
the average growth curve for girls are indicated by β0 − β1 and β2 − β3 , respectively.
Equivalently, the intercept and slope of the average growth curve for boys are given
by β0 + β1 and β2 + β3 , respectively.
The parameter estimates and their standard errors are presented in Table 6.2. The
results indicate that β2 is significant at three different quantiles, which indicates that
there is a linear relationship between the distance and age. The parameter β1 is not
significant, which indicates that, on average, neither girls nor boys differ significantly
concerning their initial dental distance. The parameter estimate of β3 is positive and
significant at τ = 0.25 but not significant at τ = 0.5, 0.75, and 0.95, which indicates
that boys with low distances (bottom 25%) increase faster over time than girls at
τ = 0.25, and there is no significant difference between boys’ and girls’ distance
increment at τ = 0.5 and 0.75. The proposed criteria have the lowest values when
correlation structure is exchangeable at τ = 0.25, 0.75, and 0.95. However, the GAIC
and GBIC criteria reach the minimum values when correlation structure is AR(1) at
τ = 0.5. This indicates that kids at top or bottom 25% have more steady correlations
OTHER ROBUST METHODS 105
Boy Girl
32
28
Distance (mm)
24
20
16
8 10 12 14 8 10 12 14
Age (year)
over time (hence exchangeable is appropriate) but correlations at medium decay fast
over the years which makes the AR(1) model appropriate. This plausible explana-
tions need to be tested on a much large dataset.
Table 6.2 Parameter estimates (Est) and their standard errors (SE) of estimators β̂I , β̂EX ,
β̂AR and β̂ MA obtained from estimating equations with independence, exchangeable, AR(1),
and MA(1) correlation structures, respectively, and frequencies of the correlation structure
identified using GAIC and GBIC criteria for the dental dataset.
τ = 0.25
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 20.690 (0.299) 0.513 (0.299) 0.543 (0.063) 0.175 (0.063) -64.790 -59.606
β̂EX 20.690 (0.299) 0.513 (0.299) 0.543 (0.063) 0.175 (0.063) -69.584 -63.105
β̂AR 20.726 (0.298) 0.542 (0.298) 0.540 (0.063) 0.172 (0.063) -65.797 -59.318
β̂ MA 20.718 (0.299) 0.547 (0.299) 0.541 (0.064) 0.173 (0.064) -65.189 -58.709
τ = 0.50
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 21.877 (0.348) 0.729 (0.348) 0.591 (0.078) 0.095 (0.078) -33.720 -28.536
β̂EX 21.877 (0.348) 0.729 (0.348) 0.591 (0.078) 0.095 (0.078) -62.332 -55.852
β̂AR 21.918 (0.306) 0.795 (0.306) 0.595 (0.081) 0.088 (0.081) -64.971 -58.492
β̂ MA 21.896 (0.307) 0.803 (0.307) 0.599 (0.085) 0.078 (0.085) -56.289 -49.810
τ = 0.75
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 23.306 (0.561) 0.704 (0.561) 0.657 (0.094) 0.133 (0.094) -72.790 -67.606
β̂EX 23.306 (0.561) 0.704 (0.561) 0.657 (0.094) 0.133 (0.094) -107.511 -101.032
β̂AR 23.390 (0.592) 0.802 (0.592) 0.660 (0.093) 0.119 (0.093) -101.206 -94.726
β̂ MA 23.344 (0.593) 0.817 (0.593) 0.660 (0.094) 0.106 (0.094) -88.079 -81.600
τ = 0.95
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 25.719 (0.525) 1.335 (0.525) 0.740 (0.065) 0.068 (0.065) -22.167 -16.984
β̂EX 25.719 (0.525) 1.334 (0.525) 0.740 (0.065) 0.068 (0.065) -27.287 -20.808
β̂AR 25.698 (0.520) 1.327 (0.520) 0.749 (0.068) 0.070 (0.068) -20.386 -13.907
β̂ MA 25.719 (0.525) 1.334 (0.525) 0.740 (0.065) 0.068 (0.065) -20.166 -13.687
to ensure the Fisher consistency of the estimate. The matrix Wi = diag(wi1 , · · · , wini )
is a weighted diagonal matrix to downweight the effect of leverage points in the
covariates.
where d and κ are tuning constants, and µ̂ x and S x are the robust estimates of the
location and covariance of xik (Rousseeuw and van Zomeren, 1990). One can use
high breakdown point location and scatter estimators, such as Minimum Covariance
Determinant and Minimum Volume Ellipsoid. For the tuning parameters κ and d, we
can use κ = 2 and c = χ20.95 (p), which is the 0.95 quantile of a χ2 (p) distribution.
Note that if ψ(e) = e and wi j = 1 for j = 1, . . . , ni , i = 1, . . . , N, UR (β) is the gen-
eralized estimating equations introduced in Chapter 4, which can be used for data
without outliers.
7.1 Introduction
Suppose a population is divided into a number of groups. If any two subjects selected
at random from the same group are positively correlated then each group of subjects
form a cluster. Clustered data arise in varieties of practical data analytic situations,
such as, epidemiology, biostatistics and medical studies. A cluster may be a village in
an agricultural study, a hospital, a doctor’s practice, an animal giving birth to several
fetuses. For example, in a study concerned with an educational intervention program
on behavior change, the data are grouped into small classes (Lui et al., 2000).
In practice-based research, multiple patients are collected per clinician or per
practice. Each class or practice is a cluster. A group of genetically related members
from a familial pedigree is a cluster.
and
N i n
1 X X
σ2 = (ni − 1) (xi j − µ)2 .
M i=1 j=1
Pni
where µi = 1
ni j=1 xi j . The intraclass correlation coefficient is defined as
Note, υ11 is the covariance between any two observations within the same cluster for
all clusters, Var(Xi j ) = Var(Xil ) = σ2 .
Example 1. As an example, consider the data in Kendall and Stuart (1979, p322).
Data on heights in inches of three brothers in five families are 74, 70, 72; 70, 71, 72;
71, 72, 72; 68, 70, 70; 71, 72, 70.
For these data M = 30, n1 = n2 = n3 = n4 = n5 = 3, 3j=1 x1 j = 211, 3j=1 x2 j = 213,
P P
j=1 x3 j = 215, j=1 x34 j = 208, j=1 x5 j = 216, µ1 = 70.33, µ2 = 71, µ3 = 71.66,
P3 P3 P3
µ4 = 69.33, and µ5 = 72. Using these we obtain
N ni 3
1 X X X
µ= (ni − 1) xi j = 70.86, (x1 j − µ)2 = 5.4988,
M i=1 i=1 j=1
3
X 3
X
(x2 j − µ)2 = 2.0588, (x3 j − µ)2 = 2.6188,
j=1 j=1
3
X 3
X
(x4 j − µ)2 = 9.6588, (x5 j − µ)2 = 5.8988.
j=1 j=1
Then,
N i n
1 X X 2 ∗ 25.734
σ2 = (ni − 1) (xi j − µ)2 = = 1.7156
M i=1 j=1
30
and
2 PN Pni
i=1 ni (µi − µ) − i=1 j=1 (xi j − µ)
PN 2 2
41.229 − 25.734
ρ= Pni = = 0.301.
i=1 (ni − 1) j=1 (xi j − µ)
PN 2 2 ∗ 25.724
yi j = µ + ai + ei j , (7.1)
where µ is the grand mean of all the observations in the population, {ai } is the random
effect of the ith family, and {ei j } is the error in observing yi j . The random effects {ai }
are identically distributed with mean 0 and variance σ2a , the residual errors ei j are
identically distributed with mean 0 and variance σ2e , and the {ai }, {ei j } are completely
independent. The variance of yi j is then given by
σ2 = σ2a + σ2e ,
This implies that Cov(yi j , yil ) = σ2 ρ. The quantities σ2a and σ2e are called the variance
components. The above random effects model can simply be written as
where N(, ) stands for a multivariate normal density, νi = (µ, . . . , µ)T is a vector of
length ni , and
Σi = σ2 {(1 − ρ)Ii + ρJi }
is a ni × ni matrix, Ii denotes a ni × ni identity matrix, Ji a ni × ni matrix containing
only ones.
Let n0 = (N ∗ − i=1 ni /N )/(L − 1), where N ∗ = i=1
PN 2 ∗ PN
ni is the total number of
observations and p is the number of families studied, SSB = i=1
PN
ni (yi j − ȳi. )2 , and
PN Pni
SSW = i=1 j=1 (yi j − ȳi. ) . Then, the following table shows the analysis of variance
2
MSB − MSW
ρ̂ = .
MSB + (n0 − 1)MSW
on page 19 of Donner and Koval (1980) we have now, an unbiased estimate of σ2e is
MSW and that of n0 σ2a + σ2e is MSB. Solving these two equations, an estimate of σ2a
is σ̂2a = (MSB − MSW)/n0 and that of ρ is
MSB − MSW
ρ̂ = .
MSB + (n0 − 1)MSW
Now, consider a hypothetical data set of two judges testing 8 different wines assign-
ing scores from 0 to 9, yi j , i = 1, 2, j = 1, 2, ..., 8 as (4,2), (1,3), (3,8), (6, 4), (6,5),
(7,5), (8,7), (9,7). Find the intraclass correlation of the judges assigning the scores.
For these data we obtain MSB = 0.5625, MSW = 5.776, hence
MSB − MSW
ρ̂ = = −0.127
MSB + (n0 − 1)MSW
Yi ∼ N(Xi β, σ2 Σi ), (7.4)
yi j = Xi j β + ei j ,
yi j ∼ N(Xi j β, σ2 Vi j ), (7.5)
Then, following the outlines above the maximum likelihood estimates (or more ac-
curately, a solution to the ML equations) of βi and σ2i given ρi are
−1
mi
X mi
X
β̂i |ρi = XiTj Wi j Xi j XiTj Wi j Yi j (7.7)
j=1 j=1
and
X S S i j − ρi ni j SSTi j ti−1
j
σ̂2i |ρi = , (7.8)
j
(1 − ρi )Ni j
and
SSTi j = (Yi j − Xi j βi )T Ji j (Yi j − Xi j βi )/ni j .
118 CLUSTERED DATA ANALYSIS
Further equating −2∂l/∂ρi to zero, and after some algebra the estimating equation
for ρi , is
Ni−1 mj=1 [SSi j − ni j SSTi j ti−2j 1 + (ni j − 1)ρi ]
2 mi
P i
X
Pmi − ρi j = 0,
ni j (ni j − 1)ti−1 (7.9)
j=1 (S S i j − ρi ni j SST i j t −1 )
ij j=1
where ρi ∩ j (−1/(ni j − 1), 1). Note that this equation involves only ρi as unknown,
while βi in SSi j and SSTi j involves data and ρi . Thus, the maximum likelihood es-
timate of ρi is obtained by solving equation (7.9) iteratively. Equation (7.9) is the
optimal estimating equation in the sense of Godambe (1960) and Bhapkar (1972).
That is, as a maximum likelihood equation, the estimating equation for ρi is unbiased
and fully efficient. Denote the estimate of ρi obtained by solving equation (7.9) by
ρ̂i . Then, replacing ρi on the right hand side of equations (7.7) and (7.8) by ρ̂i , the
maximum likelihood estimates of βi and σ2i are obtained.
Maximum likelihood (ML) estimates of the parameters of models (7.2)–(7.5) can
be obtained as special cases of equations (7.7)–(7.9). So the ML estimates of β and
σ2 of model (7.5) are
−1
X mi
L X mi
L X
X
β̂ =
T
Xi j Wi j Xi j
XiTj Wi j Yi j (7.10)
i=1 j=1 i=1 j=1
and
mi SS − ρ n SST t−1
L X
X ij i ij ij ij
σ̂2 = , (7.11)
i=1 j=1
Ñ(1 − ρi )
and
XX
σ̂2 = (SSi j − ρni j SSTi j ti−1
j )/{N(1 − ρ)}. (7.14)
i j
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 119
The estimating equation for ρ is
j {1 + (Ni j − 1)ρi }]
Ñ i j [SSi j − ni j SSTi j ti−2
PP 2 XX
−ρ j = 0, (7.15)
ni j (ni j − 1)ti−1
ρn −1
PP
i j (SS ij − ij SST t
ij ij ) i j
and
(SSi − ρni SSTi ti−1 )
σ2i = ,
ni mi (1 − ρ)
where XX
SSi = (yi jk − ȳi.. )2 ,
j k
and X
SSTi = ni (ȳi j. − ȳi.. )2 , ti = 1 + (ni − 1)ρ.
j
Further, after simplification, the estimating equation for ρ can be expressed as
X mi ni (ni − 1)(ρ − ri )
= 0, (7.19)
i
(1 − ρRi ){1 + (ni − 1)ρ}
where
ni SSTi − SSi
ri = (7.20)
(ni − 1)SSi
is the sample intraclass correlation from the ith population,
Ri = 1 − (ni − 1)(1 − ri ), and ρ ∈ ∩i (−1/(ni − 1), 1).
120 CLUSTERED DATA ANALYSIS
7.2.4 Asymptotic Variance
It can be easily found that for all the models the β parameters and the (σ2 , ρ) param-
eters are orthogonal, i.e.
∂2 l ∂2 l
! !
−E = −E = 0.
∂β∂σ2 ∂β∂ρ
The consequence of such orthogonality is that the estimates β̂ and (σ̂2 , ρ̂) are
asymptotically independent (Cox and Reid, 1987). The asymptotic variance of ρ̂ is
thus obtained from the inverse of the Fisher information matrix of (σ2 , ρ). For Model
(7.6), it can be shown that
∂2 l Ni ∂2 l −Ai ∂2 l Bi
−E 4 =
, − E 2 =
, − E 2 =
,
∂σi 2σi4 ∂σi ∂ρi 2σi (1 − ρi )
2 ∂ρi 2(1 − ρi )2
where N j =
P
j ni j ,
X ni j (ni j − 1)ρi
Ai =
j
1 + (ni j − 1)ρi
and
X ni j (ni j − 1){1 + (ni j − 1)ρ2 }
i
Bi = .
j
{1 + (ni j − 1)ρi }2
From the inverse of the Fisher information matrix of (σ2i , ρ) it can be seen that
2Ni (1 − ρi )2
var(ρ̂i ) = .
Ni Bi − A2i
2Ñ(1 − ρi )2
var(ρ̂i ) = ,
ÑBi − A2i
where Ñ =
P
i Ni ; for Model (7.4),
2Ñ(1 − ρ)2
var(ρ̂) = ,
ÑB − A2
where
X X ni j (ni j − 1)ρ
A=
i j
1 + (ni j − 1)ρi
and
X X ni j (ni j − 1){1 + (ni j − 1)ρ2 }
B= ;
i j
{1 + (ni j − 1)ρ}2
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 121
for Model (7.2),
2N(1 − ρ)2
var(ρ̂) = ,
ND − C 2
where n =
P
i ni ,
2(1 − ρ)2
var(ρ̂) = P {1 + (ni − 1)ρ}−2 .
i mi ni (ni − 1)
The purpose of the example here is to illustrate that the estimating equation for
ρ produces unique solution within the permitted range. The estimating equation was
solved by using R and solution was obtained within 9 iterations.
We consider Model IV, page 550 of Paul, 1990, where Yi = (Yi1 , . . . , Yini )0 , ν =
(µ, . . . , µ) and Σi = (1 − ρ)Ii + ρJi , where Ii is a ni × ni identity matrix and Ji is a ni × ni
matrix of only ones.
Now, let ti = 1 + (ni − 1)ρ,
ni ȳi /ti
P
µ = Pi
i ni /ti
S S − ρ i ni S S T i /ti
P X
σ2 = , N= ni
N(1 − ρ) i
where, XX
SS = (yik − µ)2 S S T i = ni (ȳi − µ)2
i k
122 CLUSTERED DATA ANALYSIS
and the estimating equation for ρ is,
h i
N S S − i ni S S T i {1 + (ni − 1)ρ2 }/ti 2
P
X ni (ni − 1)
−ρ =0
S S − ρ i ni S S T i /ti 1 + (ni − 1)ρ
P
i
where,
1
ρ ∈ ∩i − , 1 and
ni − 1
2(1 − ρ̂)2
Var(ρ̂) = P {1 + (ni − 1)ρ}−2
i n i (ni − 1)
The final solution is ρ̂ = 0.0266 with standard error=0.1596. The solution took 6
iterations.
7.2.8 Estimation
A number of methods for the estimation of π and φ are available. Here we give two
of the most popular methods, namely, the method of moments and the maximum
likelihood estimates. For a comprehensive description of the methods available see
Paul and Islam (1998) and Paul (1982).
124 CLUSTERED DATA ANALYSIS
We first give the estimates by the method of moments. This method was first
given by Kleinman (1973), who used weighted average and weighted variance of
sample proportions to equate to their respective expected values to find method of
moments estimates of the parameters π and φ.
Define zi = nyii , wi = π(1−π){1+(n
ni
for i = 1, 2, . . . , m and π̂ = m i=1 wi zi / i=1 wi .
P Pm
i −1)φ}
It can be seen that given φ, E (π̂) = π. Further, define S = m i=1 wi (zi − π) . Then it can
2
P
be shown that
m m
X wi π (1 − π) X wi π (1 − π) (ni − 1) φ
E (S ) = + .
i=1
ni i=1
ni
Then, the method of moments estimates of π and φ are obtained by solving the
estimating equations
m
X
wi (zi − π) = 0,
i=1
and
m m m
X X wi π (1 − π) X wi π (1 − π) (ni − 1) φ
wi (zi − π)2 − − =0
i=1 i=1
ni i=1
ni
simultaneously.
As is well known, the maximum likelihood estimates are obtained by maximizing
the likelihood function with respect to the parameters of interest. Using model (7.22)
the log-likelihood, apart from a constant, can be written as
i −1
m " yX
X −yi −1
ni X
l= log {(1 − φ)π + rφ} + log {(1 − φ)(1 − π) + rφ}
i=1 r=0 r=0
i −1
nX #
− log {(1 − φ) + rφ} , (7.23)
r=0
and the maximum likelihood estimates of the parameters π and φ can be obtained by
solving the estimating equations
−1
m yX −yi −1
ni X
X i (1 − φ) (1 − φ)
= 0
−
(1 − φ)π + rφ (1 − φ)(1 − π) + rφ
i=1 r=0 r=0
and
−1
m yX −yi −1
ni X i −1
nX
X i r−π r+π−1 r − 1
+ − = 0
(1 − φ)π + rφ (1 − φ)(1 − π) + rφ (1 − φ) + rφ
i=1 r=0 r=0 r=0
simultaneously.
SOME EXAMPLES 125
7.2.9 Inference
After the parameters are estimated, attention generally shifts to hypothesis testing
and confidence interval construction. Recall that the beta-binomial and the extended
beta-binomial models are extensions of the simpler binomial model. It is then natural
to test whether the binomial model is good enough to model toxicological data in the
form of proportions. To this end we test the null hypothesis H0 : φ = 0 against the
alternative hypothesis H1 : φ , 0.
The pioneering work in this is by Tarone (1979) who developed C(α) (Neyman,
1959) tests for testing the goodness of the binomial model against the BB, CB and
MB models. These tests are closely related to the binomial variance test. Paul (1982)
summarized these results as follows.
Let p̂ = y/n, where y = m y and n = m n and q̂ = 1 − p̂. Further, let
P P
Pm i=1 i 2 Pi=1 i
A = i=1 ni (ni − 1), B = i=1 ni (ni − 1) , R = m i=1 yi (ni − yi ), D = (q̂ − p̂)A, E =
Pm
p̂q̂(B + p̂q̂(2A − 4B)), F = n/( p̂q̂) and s = m ni p̂)2 /( p̂q̂). Then the optimal test
P
i=1 (yi −
statistic for testing the goodness of fit of the binomial distribution
(i) against the BB model is Z = (s − n)/(2A)1/2 , and under the null hypothesis that
the binomial model has a good fit against BB model and under alternative hypothesis
will have an asymptotic standard normal distribution;
(ii) against the CB model is Xc2 = Z 2 , and under the null hypothesis that the binomial
model has a good fit against CB model, and under alternative hypothesis will have
an asymptotic χ2 distribution with one degree of freedom and;
(iii) against the MB model is X M 2 = (R − A p̂q̂)2 (E − D2 /F)−1 , and under the null hy-
pothesis that the binomial model has a good fit against MB model alternative and
under alternative hypothesis will have an asymptotic χ2 distribution with one degree
of freedom.
After an extensive simulation study to compare these three tests, Paul (1982)
found that the BB model is, in general, more sensitive to the departure from the
binomial model, and therefore, is a superior model for the analysis of the data in
Table 7.2 given in Section 7.3. The justification for superiority of one model over the
others is that a model which is more sensitive to the departure from the binomial will
characterize the data more accurately than others which are less sensitive.
Table 7.2 Data from Toxicological experiment (Paul, 1982). (i) Number of live fetuses affected
by treatment. (ii) Total number of live fetuses.
Dose Group
(i) 1 1 4 0 0 0 0 0 1 0 2 0 5 2 1 2 0 0 1
Control (C)
(ii) 12 7 6 6 7 8 10 7 8 6 11 7 8 9 2 7 9 7 11
(i) 0 1 1 0 2 0 1 0 1 0 0 3 0 0 1 5
Low (L)
(ii) 5 11 7 9 12 8 6 7 6 4 6 9 6 7 5 9
(i) 2 3 2 1 2 3 0 4 0 0 4 0 0 6 6 5
Medium (M)
(ii) 4 4 9 8 9 7 8 9 6 4 6 7 3 13 6 8
(i) 1 0 1 0 1 0 1 1 2 0 4 1 1 4 2
High (H)
(ii) 9 10 7 5 4 6 3 8 5 4 4 5 3 8 6
Yi j = Xi j β + Zi j b j + ei j , (7.24)
where Yi j is the response on the ith unit at level 1 within the jth unit (cluster) at
level 2, Xi j is a 1 × p (row) vector of covariates, β is a 1 × p (row) vector of fixed
effects regression parameters, Zi j is a design matrix for the random effects at level
2, b j is the random effect at the jth unit (cluster) at level 2, and ei j is the error. The
random effects, b j , vary across level 2 units but, for a given level 2 unit, are the same
for all level 1 units. For example, Yi j might be the outcome for the ith patient in the
AN EXAMPLE 127
jth clinic, where the clinics are a random sample of clinics from around the area.
The random effects are assumed to be correlated across level 2 units, with mean zero
and covariance cov(b j ) = G. The level 1 random components, ei j , are assumed to
be independent across level 1 units, with mean zero and variance Var(ei j ) = σ2 . In
addition, the ei j ’s are assumed to be independent of the b j ’s, with Cov(ei j , b j ) = 0.
That is, level 1 units are assumed to be conditionally independent given the level 2
random effects (and the covariates).
The regression parameters, β, are the fixed effects and describe the effects of
covariates on the mean response
E(Yi j ) = Xi j β, (7.25)
where the mean response is averaged over both level 1 and level 2 units. The two
level model given by (7.24) also describes the effects of covariates on the conditional
mean response given the random effect b j as
E(Yi j |b j ) = Xi j β + Zi j b j , (7.26)
where the response is averaged over level 1 units only. The maximum likelihood es-
timates of the parameters β, σ2 , and G for longitudinal data when there is no missing
responses are obtained in Chapter 9. Following these results we obtain maximum
likelihood estimates of the parameters β, σ2 and G of model (7.24) as
N −1 N
X X
β̂ = Xi Σi Xi
T −1
XiT Σ−1
i yi (7.27)
i=1 i=1
PN T N
ei ei X bT bi
σ̂2 = Pi=1
N
, Ĝ = i
, (7.28)
i=1 ni i=1
N
where ei = yi − Xi β−Zi bi (see Chapter 9). However, suppose the b0j s are assumed to be
independent, var(b j ) = σ22 and there is no covariate at the level 2 units. Since the data
are clustered, observations within the same cluster are correlated, but are independent
across clusters. Then, letting Var(ei j ) = σ21 , the degree to which the observations
within the same cluster are correlated can be measured by the intra-cluster correlation
σ22
ρ= . (7.29)
σ21 + σ22
σ22
ρ= . (7.31)
σ21 + σ22
In Table 17.2, Fitzmaurice et al. (2004,0.452) give the results of fitting the model
to the fetal weight data. The REML (restricted maximum likelihood) estimate of
the regression parameter for (transformed) dose indicates that the mean fetal weight
decreases with increasing dose. The estimated decrease in weight, comparing the
highest dose group to the control group, is 0.27 (or 2 × −0.134, with the 95% con-
fidence interval: -0.316 to - 0.220). Note that both model based and empirical (or
sandwich) standard errors are calculated and they were very similar, suggesting that
the simple random effect structure for the clustering of fetal weights is adequate. The
TWO-LEVEL GENERALIZED LINEAR MODEL 129
estimate of the intra-cluster correlation, ρ̂ = 0.57, indicates that there are moderate
litter effects.
Fitzmaurice et al. (2004) provided further analysis of the data to assess the ad-
equacy of the linear dose-response trend. They considered a model that included
a quadratic effect of (transformed) dose. Both Wald and likelihood ratio tests of the
quadratic effect of dose indicated that the linear trend is adequate for these data (Wald
W 2 = 1.38, with 1 df, p-value > 0.20; likelihood ratio G2 = 1.37, with 1 df, p-value
> 0.20).
ηi j = Xi j β + Zi j b j , (7.32)
with
g{E(Yi j |b j )} = ηi j = Xi j β + Zi j b j , (7.33)
for some known link function, g(·).
(3) Finally, the random effects are assumed to have some probability distribution.
In principle, any multivariate distribution can be assumed for the b j ; in practice,
for computational convenience, the random effects are usually assumed to have a
multivariate normal distribution, with zero mean and covariance matrix, G. It is also
common, for convenience, to assume that the b j are independent, Var(b j ) = σ2j and
Cov(b j , bk ) = 0 for j , k.
These three components completely specify a broad class of two level generalized
linear models for different types of responses.
The two-level generalized linear model is generally referred to as generalized
linear mixed models (GLMMs, Jiang, 2007). Next, to clarify the main ideas, we
consider two examples of two level generalized linear models.
130 CLUSTERED DATA ANALYSIS
Example 1: Two-level Generalized Linear Model for counts
Consider a study comparing cross-national rates of skin cancer and the factors
(e.g., climate, economic and social factors, regional differences in diagnostic proce-
dures) that influence variability in the rates of disease. Suppose we have counts of the
number of cases of skin cancer in a set of well-defined regions, indexed by i, within
counties, indexed by j. Let Yi j be a count of the number of individuals who develop
skin cancer within the ith region of the jth county during a given period of time (say,
5 years). The resulting counts have a two level structure with regional units at the
lower level (level 1 units). Usually, the analysis of count data requires knowledge of
the population at risk. That is, the rate at which the disease occurs is of more direct
interest than the corresponding count.
Counts are often modeled as Poisson random variables using a log link function.
This motivates the following illustration of a two level generalized linear model for
Yi j given by the three part specification:
(1) Conditional on a vector of random effects b j , the Yi j are assumed to be inde-
pendent observations from a Poisson distribution, with Var(Yi j |b j ) = E(Yi j |b j ), (i.e.,
φ = 1).
(2) The conditional mean of Yi j depends upon fixed and random effects via the
following log link function,
where T i j is the population at risk in the ith region of the jth county and the log(T i j )
is an offset.
(3) The random effects are assumed to have a multivariate normal distribution,
with zero mean and covariance matrix G.
This is an example of a two-level log linear model that assumes a linear relation-
ship between the log rate of disease occurrence and the covariances.
Example 2: Two-level Generalized Linear Model for Binary Responses
Consider a study of men with newly diagnosed prostate cancer. The study is de-
signed to evaluate the factors that determine physician recommendations for surgery
(radical prostatectomy) versus radiation therapy. In particular, it is of interest to de-
termine the relative importance of patient factors (e.g., patients age, level of prostate
specific antigen) and physician factors (e.g., speciality training, years of experience)
on physician recommendations for treatment. Many patients in the study seek the rec-
ommendation of the same physician. As a result, patients (level 1 units) are nested
within the physicians (level 2 units). For each patient, we have a binary outcome
denoting the physicians recommendation (surgery versus radiation therapy).
Let Yi j be the binary response, taking values 0 and 1 (e.g., denoting surgery or
radiation therapy) for the ith patient of the jth physician. An illustrative example of
a two level logistic model for Yi j is given by the following three part specification:
(1) Conditional on a single random effects b j , the Yi j are independent and have a
Bernoulli distribution, with Var(Yi j |b j ) = E(Yi j |b j ){1 − E(Yi j |b j )}, (i.e., φ = 1).
(2) The conditional mean of Yi j depends upon fixed and random effects via the
following linear predictor:
ηi j = Xi j β + b j , (7.35)
RANK REGRESSION 131
and
Pr(Yi j = 1|b j )
( )
log = ηi j = Xi j β + b j . (7.36)
Pr(Yi j = 0|b j )
That is, the conditional mean of Yi j is related to the linear predictor by a logit link
function.
(3) The single random effect b j is assumed to have a univariate normal distribu-
tion, with zero mean and variance σ21 .
In this example, the model is a simple two-level logistic regression model with
randomly varying intercepts.
> is the k-th row of the design matrix X . This function is equivalent to that
where xik i
given by (6.3).
One feature in cluster data analysis is that the time order of the observations is
often not important or not recorded, which is often the case in developmental studies.
This often implies the exchangeable correlation may be appropriate. This also moti-
vates us to avoid modeling the correlation structure by subsampling one observation
from each cluster and then apply the classical rank method to the resulting N inde-
pendent observations, (yi ), 1 ≤ i ≤ N, say. If the corresponding residuals are (e(i) ), the
dispersion function is
XN X N
M −2 |(e(i) ) − (e( j) )|.
i=1 j,i
To make use of all the observations, we would repeat this resampling process
many times. Conditional on sampling one observation from each cluster, the proba-
bility of yi j being sampled is 1/ni . Therefore, the limiting dispersion function is
N X
X X nk
ni X
LDS (β) = M −2 n−1 −1
i nj |eik − e jl |.
i=1 j,i k=1 l=1
132 CLUSTERED DATA ANALYSIS
This within-cluster resampling was first considered by Hoffman et al. (2001) for
eliminating the bias when the cluster sizes are informative. This was further inves-
tigated by Williamson et al. (2003) in the context of the GEE setup. In the special
case of comparing two groups, the LDS becomes the testing statistic (S ) proposed by
Datta and Satten (2005).
This motivates us to consider the following weighted loss function for estimation
of β,
XN XX nk
ni X
LW (β) = M −2 wi w j |eik (β) − e jl (β)|, (7.37)
i=1 j,i k=1 l=1
Yi j = β0 + β1 T 12 + β2 T 20 + β3 T 24 + β4Gi + i j .
The method using wi = 1/ni is refereed to as W0 . Using Jung and Ying’s method,
the estimates for β1 , β2 , β3 , and β4 are 2.774, 12.647, 7.922, and 10.475, respectively.
The hypothesis of no treatment or time effect can be constructed H0 : βi = 0, i =
1, ..., 4, respectively.
The average correlation ρ̄ is estimated as 0.49, which is quite large. We, therefore,
applied the weighted rank method with weights {1 + ρ̄(ni − 1)}−1 (W3 ) and obtained
β̂ = (3.143, 13.314, 7.751, 10.393).
RANK REGRESSION 133
Under H0 , Q = max MUW (β)V̂W
−1
UW (β)> has an asymptotic chi-square distri-
{βi =0}
bution with one degree of freedom (Jung and Ying, 2003). The corresponding Q
values in weighted rank regression are obtained as 0.816, 13.864, 2.106, and 3.970
for βi , i = 1, ..., 4, respectively. For comparison, the Q values from Jung and Ying’s
method are 0.653, 16.100, 2.425, and 3.833 correspondingly. The 95% confidence in-
tervals for β1 , β2 , β3 , and β4 are (−3.176, 9.461), (6.251, 20.376), (−2.532, 18.034),
and (−0.057, 20.842). The one-sided hypotheses are available based on the variance
estimates of β̂W which may suggest a significant drug effect on increasing the serum
cholesterol level.
Yi j = β0 + β1 L + β2 H + β3G + β4 S + i j .
The estimates from the independence rank regression are β̂ JY = (−0.448, −0.936,
−0.236, −0.123)> , from which we also obtain an estimate of ρ̄ as 0.456. The
weighted rank regression using wi = {1 + ρ̄(ni − 1)}−1 produces the estimates
of β as β̂W = (−0.460, −0.886, −0.242, −0.126)> . Except for gender, the esti-
mates from these two methods give similar estimates to those from the ran-
dom intercept model of Dempster et al. (1984) who obtained the estimates as
(−0.429, −0.859, −0.359, −0.129)> . The standard deviation of individual parameter
estimates β̂W are (0.196, 0.241, 0.083, 0.025)T , which are generally higher than those
from the random effect model which are equal to (0.150, 0.182, 0.047, 0.019)T based
on normality assumption. No violation of normality in the data set may be a reason
for this. The hypothesis tests based on β̂W conclude that all the variables in the model
have statistically significant effects on the pup weights.
To investigate the impact of the informative cluster size on the regression models,
we excluded litter size from the set of covariates. The estimates using weight 1/ni
are (−0.445, −0.441, −0.148)> . It is interesting to note that that similar estimates are
obtained here except for the high dose group. Same conclusions can also be seen
in the random effect model, where the estimates are (−0.375, −0.355, −0.361)T . The
134 CLUSTERED DATA ANALYSIS
hypothesis tests show that neither the low dose nor the high dose are significant in the
random effect model. However, the normal value of -2.432 for testing the effect of
the low dose in the weighted rank regression with weights {1 + ρ̄(ni − 1)}−1 indicates
that the low dose significantly affect pup wights.
The resistance of the weighted rank regression to informative cluster size will
need more research to explore. However, our weighted method including Size as a
covariate should produce more efficient estimators.
Chapter 8
8.1 Introduction
Missing data or missing values occur when no information is available on the re-
sponse or some of the covariates or both the response and some covariates for some
subjects who participate in a study of interest. There can be a variety of reasons for
occurrence of missing values. Nonresponse occurs when the respondent does not
respond to certain questions due to stress, fatigue or lack of knowledge. Some indi-
viduals in the study may not respond because some questions are sensitive. Missing
data can create difficulty in analysis because nearly all standard statistical methods
presume complete information for all the variables included in the analysis. Even a
small number of missing observations can dramatically affect statistical analysis re-
sulting in biased and inefficient parameter estimates and confidence intervals that are
too wide or too short.
Further, let Xi , Zi , and Wi be design matrices for fixed effects, random effects (if
any), and missing data process, and let θ and ψ be vectors that parameterize the joint
distribution of yi and ri = (ri1 , ri2 , . . . , rini )T , where θ = (βT , αT ) and ψ represents the
measurement and missingness processes, respectively, β is the fixed effects parameter
vector and α represents the variance components and/or association parameters. The
full data (yi , ri ) consist of the complete data and the missing data indicators.
Now, when data are incomplete due to a stochastic mechanism, the full data den-
sity is
f (yi , ri |Xi , Zi , Wi , θ, ψ),
which can be factorized as
so that
f (yi , ri |Xi , Zi , Wi , θ, ψ) ∝ f (yi |Xi , Zi , θ).
So, the data analysis is done based on only the complete cases.
Under MAR, the probability of an observation being missing is conditionally
independent of the unobserved data given the values of the observed data, which
implies that
f (ri |yi , Wi , ψ) = f (ri |yoi , Wi , ψ).
So the model for data analysis is
However, in practice, often, the above integral is intractable. So, some Monte Carlo
method needs to be devised to replace the integral by summation.
A complete data pattern refers to the case with no missing values, as shown in
Table 8.1 and in panel (a) of Figure 8.1. A univariate (response) missing pattern refers
to the situation where missing values only occur at the last visit, as shown in Table
8.2 and panel (b) of Figure 8.1. This is a special case of dropout pattern.
Table 8.3 and panel (c) of figure 8.1 show a uniform missing pattern, in which
missing values occur in a joint fashion. That is, the measurements at last two visits
138 MISSING DATA ANALYSIS
are either both observed or both missing simultaneously. This is a kind of dropout
mechanism and the dropout time is uniform across all subjects.
Table 8.4 and panel (d) of Figure 8.1 display a monotonic missing pattern, where
if one observation is missing, then all the observations after it will be unobserved.
This is a general and important kind of dropout mechanism that allows subjects to
have different dropout times. As a matter of fact, all the above cases (b)-(d) are mono-
tonic missing patterns.
An arbitrary missing pattern refers to the case in which missing values may occur
in any fashion, an arbitrary combination of intermittent missing values and dropouts.
Table 8.5 and panel (e) of Figure 8.1 demonstrate a possible scenario for a mixture
of intermittent missing (on Subject 5) at the second visit and some dropouts (on both
Subject 1 and Subject 2).
Maximization step (M step): Find the parameter that maximizes this quantity:
The MLE of π is obtained by cycling back and forth between (8.8) and (8.9).
It can be seen that starting with an initial value of π(0) = 0.5, the algorithm con-
(p)
verges in eight steps as can be seen in Table 8.6. By substituting x2 from equation
(8.8) into equation (8.9), and letting π(∗) = π(p) = π(p+1) we can explicitly solve a
quadratic equation for the maximum-likelihood estimate of π:
√
π(∗) = (25 + 56929)/414 + 0.6667101. (8.10)
and (y1 , (x2 , y2 ), . . . , (xn , yn )) is the incomplete data (note that the summation over
the likelihood (8.11) is equivalent to ∞ −τ1 (τ ) x1 which is 1). So we need to
P
x1 =0 e 1
MISSING DATA METHODOLOGIES 145
maximize the likelihood (8.13). Differentiation leads to the ML equations
Pn
yi
β̂ = Pni=1 , y1 = τ̂1 β̂, x j + y j = τ̂ j (β̂ + 1), j = 2, . . . , n. (8.14)
τ̂
i=1 i
where in the last equality we have grouped together terms involving β and τi , and
terms that do not involve these parameters. Since we are calculating this expected
log likelihood for the purpose of maximizing it in β and τi , we can ignore the terms
in the second set of parentheses, because
(r)
n
X n
X ∞
X e−τ1 (τ(r)
1 )
x1
log yi ! + log xi ! + log x1 !
i=1 2=1 x1 =0
x1 !
is a constant. Thus, we have to maximize only the terms in the first set of parentheses.
Now,
(r) (r)
∞
X e−τ1 (τ(r)
1 )
x1 ∞
X e−τ1 (τ1(r) ) x1
[−τ1 + x1 log τ1 ] = −τ1 + log τ1 x1 = −τ1 + τ1(r) log τ1 .
x1 =0
x1 ! x1 =0
x1 !
146 MISSING DATA ANALYSIS
Substituting this back into (8.15), apart from a constant, the expected complete data
log-likelihood is
n
X n
X
[−βτi + yi (log β + log τi )] + (−τi + xi log τi ).
i=1 i=1
This is the same as the original complete data log-likelihood, x1 being replaced by
τ(r)
1 . Thus, in the rth step, the MLEs are only a minor variation of (8.14) and are given
by
τ(r)
1 + y1 xj +yj
Pn
yi (r+1)
β̂(r+1) = (r) i=1 and τ̂ = , τ̂(r+1)
j = , j = 2, . . . , n.
τ1 + i=2 xi
1
β̂(r+1) + 1 β̂(r+1) + 1
Pn
(8.16)
This defines both the E-step (which results in the substitution of τ̂(r) 1 for x 1 ) and
the M-step (which results in the calculation in (8.16) for the MLEs at the rth iter-
ation). Note that the final solution is obtained by substituting τ̂1(r) on the first two
equations in (8.16) by some initial value and then cycling back and forth.
Exercise 1. Refer to Example 2 above. Show that
(a) the maximum likelihood estimators from the complete data likelihood (8.11) are
given by
xj +yj
Pn
yi
β̂ = Pni=1 and τ̂ j = , j = 1, 2, . . . , n.
i=1 xi β̂ + 1
and
(b) a direct solution of the original (incomplete-data) likelihood equations is possible.
Show that the solution to (8.14) is given by
xj +yj
Pn
yi y1
β̂ = Pni=2 , τ̂1 = , τ̂ j = , j = 2, 3, . . . , n. (8.17)
i=2 xi β̂ β̂ + 1
Exercise 2. Use the model of Example 2 on the data in the following table
adapted from Lange et al. (1994). These are leukemia counts and the associated pop-
ulations for a number of areas in New York State.
(a) Fit the Poisson model to these data both for the full data set and for an “in-
complete” data set where we suppose that the first population count (x1 = 3540) is
missing.
(b) Suppose that instead of having an x value missing, we actually have lost a
leukemia count (assume that y1 = 3 is missing). Use the EM algorithm to find the
MLEs in this case, and compare your answer to those of part (a).
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 147
8.5 Analysis of Zero-inflated Count Data With Missing Values
Discrete data in the form of counts often exhibit extra variation that cannot be ex-
plained by a simple model, such as the binomial or the Poisson. Also, these data
sometimes show more zero counts than what can be predicted by a simple model.
Therefore, a discrete model (Poisson or binomial) may fail to fit a set of discrete data
either because of zero-inflation, or because of over-dispersion, or because there is
zero-inflation as well as over-dispersion in the data. Deng and Paul (2005) developed
score tests for zero-inflation and over-dispersion in generalized linear models. Mian
and Paul (2016) developed estimation procedures for zero-inflated over-dispersed
count data regression model with missing responses. Here we discuss these proce-
dures in detail.
Let Y be a discrete count data random variable. The simplest model for such a
random variable is the Poisson, which has the probability mass function
e−µ µy
f (y; µ) = . (8.18)
y!
However, data may show evidence of over-dispersion (variance is larger than the
mean). A popular over-dispersed count data model is the two parameter negative
binomial model. Different authors have used different parameterizations for the neg-
ative binomial distribution (see, for example, (Paul and Plackett, 1978; Barnwal and
Paul, 1988; Paul and Banerjee, 1998; Piegorsch, 1990). Let Y be a negative binomial
random variable with mean parameter µ and dispersion parameter c. Then, using the
terminology of Paul and Plackett (1978), Y has the probability mass function
!y !c−1
Γ(y + c−1 ) cµ 1
f (y; µ, c) = , (8.19)
y!Γ(c−1 ) 1 + cµ 1 + cµ
for y = 0, 1, . . ., µ > 0. Now, for a typical Y, Var(Y) = µ(1 + µc) and c > −1/µ. This is
the extended negative binomial distribution of Prentice (1986) which takes account
of over-dispersion as well as under-dispersion. Obviously, when c = 0, variance of
the NB(µ, c) distribution becomes that of the Poisson(µ) distribution. Moreover, it
can be shown that the limiting distribution of the NB(µ, c) distribution, as c → 0, is
the Poisson(µ).
Using the mass function in equation (8.19), the zero-inflated negative binomial
regression model (see Deng and Paul, 2005) can be written as
−1
1 c
ω + (1 − ω) 1+cµ if y = 0,
f (yi |xi ; µ, c, ω) = (8.20)
Γ(y+c−1 ) cµ y c−1
(1 − ω) 1
if y > 0
y!Γ(c−1 ) 1+cµ 1+cµ
with E(Y) = (1 − ω)µ, and Var(Y) = (1 − ω)µ[1 + (c + ω)µ], where ω is the zero-
inflation parameter. We denote this distribution as ZINB(µ, c, ω).
Regression analysis of count data may be further complicated by the existence
of missing values either in the response variable and/or in the explanatory variables
148 MISSING DATA ANALYSIS
(covariates). Extensive work has been done on regression analysis of continuous re-
sponse data with some missing responses under normality assumption. See, for ex-
ample, Rubin (1976), Little and Rubin (1987), Anderson and Taylor (1976), Geweke
(1986), Raftery et al. (1997), Chen et al. (2001), Kelly (2007), Zhang and Huang
(2008).
Some work on missing values has also been done on logistic regression analy-
sis of discrete data. See, for example, Ibrahim (1990), Lipsitz and Ibrahim (1996),
Ibrahim and Lipsitz (1996), Ibrahim et al. (1999, 2001), Ibrahim et al. (2005), Sinha
and Maiti (2008), Maiti and Pradhan (2009).
Exercise 3.
(a) Derive the negative binomial distribution as a Gamma(α, β) mixture of the
Poisson(µ), reparameterize, and show that it can be written as the probability mass
function given in equation (8.19) (see Paul and Plackett, 1978).
(b) Derive the mean and variance of the NB(µ, c) (hint: find the unconditional mean
and unconditional variance of a mixture distribution).
(c) Verify that the mean and variance of a zero-inflated negative binomial distribution
are those given in this chapter.
Suppose data for the ith of n subjects are (yi , xi ), i = 1, . . . , n, which are realizations
from ZINB(µ, c, ω), where yi represents the response variable and xi represents a
p × 1 vector of covariates with the regression parameter β = (β1 , β2 , . . . , β p ), such that
Pp
µi = exp( j=1 Xi j β j ). Here β1 is the intercept parameter in which case Xi1 = 1 for
all i. In Subsection 8.5.1 we show ML estimation of the parameters with no missing
data. Subsection 8.5.2 deals with different scenarios of missingness.
Writing γ = ω/(1 − ω), the log likelihood, apart from a constant, can be written
as
n n
X o
l(β, c, γ|yi ) = − log(1 + γ) + log[γ + f (0; µi , c, ω)]I{yi =0} + log f (yi ; µi , c, ω)I{yi >0}
i=1
n (
X
= − log(1 + γ) + log γ + exp[−c−1 log(1 + µi c)] I{yi =0}
i=1
h yi
X i )
+ yi log µi − (yi + c ) log(1 + µi c) + [1 + (l − 1)c] I{yi >0} .
−1
(8.22)
l=1
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 149
The parameters β j , c and γ can be estimated by directly maximizing the loglikeli-
hood function (8.22) or by simultaneously solving the following estimating equations
n (
∂l X −(1 + µc)−1 exp[−c−1 log(1 + µc)] y1 c(y1 + c−1 )
" # )
= I {y =0} + − I {y >0}
∂β j i=1 γ + exp[−c−1 log(1 + µc)] i
µ 1 + µc i
∂µi
× = 0, (8.23)
∂β j
n (
∂l X [−µc−1 (1 + µc)−1 + c−2 log(1 + µc)] exp[−c−1 log(1 + µc)]
= I{yi =0}
∂c i=1 γ + exp[−c−1 log(1 + µc)]
yi
X
+ µ(yi + c )(1 + µc) − c log(1 + µc) + (l − 1) I{yi >0}
−1 −1 −2
= 0,
(8.24)
l=1
and
n (
∂l X h i−1 )
= − (1 + γ)−1 + γ + exp[−c−1 log(1 + µc)] I{yi =0} + 0I{yi >0} = 0, (8.25)
∂γ i=1
where
p
∂µi X
= Xi j exp Xi j β j .
∂β j j=1
Exercise 4.
(a) Obtain maximum likelihood estimates of the parameters µ and c of model
(8.19) for the modified DMFT index data given in Table 8A, 8B, 8C, 8D, 8E, 8F, 8G,
8H, 8I, and 8J.
(b) Obtain maximum likelihood estimates of the parameters µ, c and ω of model
(8.20) for the modified DMFT index data given in Table 8A, 8B, 8C, 8D, 8E, 8F, 8G,
8H, 8I, and 8J.
Note that l(ψ|Yo , X) is the log-likelihood when the missing data indicators, which are
also part of the observed data, are not used.
In the more general case where missing data are not MAR, this likelihood would
remain the same but a distribution defining the missing data mechanism needs to be
included in the model. This general case is explained in the following section.
Direct maximization of l(ψ; Yo , X) is not, in general, straight forward. So, we use
the EM algorithm.
As explained earlier, the EM algorithm uses an expectation-step (E-step) and a
maximization-step (M-step). Following Little and Rubin (1987), the E-step provides
the conditional expectation of the log-likelihood l(ψ|yo,i , ym,i , xi ) given the observed
data (yo,i , xi ) and current estimate of the parameters ψ.
Suppose A of the n responses are observed and B = n − A responses are missing,
and s is an arbitrary number of iterations during maximization of the log-likelihood.
Then the E-step of the EM algorithm for the ith missing response for (s + 1)th
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 151
iteration can be written as
For all the observations, the E-step of EM algorithm for (s + 1)-th iteration is
A
X B X
X
Q(ψ|ψ(s) ) = l(ψ(s) |yi ) + l(ψ(s) |yo,i , ym,i , xi )P(ym,i |yo,i , xi , ψ(s) ).
i=1 i=1 ym,i
(8.30)
Note that for the situation in which there is no missing response the EM algorithm
requires only maximization of the first term on the right hand side.
Here P(ym,i |yo,i , xi , ψ(s) ) is the conditional distribution of the missing response
given the observed data and the current (s-th iteration) estimate of ψ. However, in
many situations, P(ym,i |yo,i , xi , ψ(s) ) may not always be available. Following Ibrahim
et al. (2001) and Sahu and Roberts (1999), we can write P(ym,i |yo,i , xi , ψ(s) ) ∝
P(yi |xi , ψ(s) ) (the complete data distribution given in (8.20)). For the ith of the B
missing responses, we take a sample ai1 , ai2 , . . . , aimi from P(yi |xi , ψ(s) ) using Gibbs
sampler (see Casella and George (1992) for details). Then, following Ibrahim et al.
(2001), Q(ψ|ψ(s) ) can be written as
A B mi
X X 1 X
Q(ψ|ψ(s) ) = l(ψ(s) |yi ) + l(ψ(s) |yo,i , xi , aik ). (8.31)
i=1 i=1
m i k=1
n
∂2 l γµ(1 + c) + µcA(µ, c) ∂µi
" #
X A(µ, c)
= − γc log(1 + µc)
−2
I{y =0}
∂β j ∂c i=1 (1 + µc)[γ + A(µ, c)]2 c(1 + µc) ∂β j i
n
X c(1 + µc)(yi − c−2 ) + (1 + 2µc)(yi + c−1 ) ∂µi
− I{yi >0} ,
i=1
(1 + µc)2 ∂β j
n
∂2 l X µ (1 + µc) + 2µc
( 2 )
A(µ, c)
= − 2c log(1 + µc) I{yi =0}
−3
∂c2 i=1 (γ + A(µ, c))2 c2 (1 + µc)2
n !2
X A(µ, c)(1 − A(µ, c)) µ
+ − c−2
log(1 + µc) I{yi =0}
i=1
(γ + A(µ, c))2 c(1 + µc)
n "
µ2 (yi + c−1 ) µ
X #
+ − − 2 + 2c −3
log(1 + µc) I{yi >0} ,
i=1
(1 + µc)2 c2 (1 + µc)
and
n (
∂2 l µ
" # )
X A(µ, c)
= − c log(1 + µc) I{yi =0} + 0I{yi >0}
−2
∂c∂γ i=1 [γ + A(µ, c)]2 c(1 + µc)
n "
∂2 l exp(−c−1 log(1 + µc)) ∂µi
X #
= I {y =0} + 0I {y >0} ,
∂β j ∂γ i=1 (1 + µc)[γ + A(µ, c)]2 ∂β j i i
n "
∂2 l X
#
= (1 + γ)−2
− [γ + A(µ, c)]−2
I {yi =0} + 0I {yi >0} .
∂γ2 i=1
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 153
8.5.2.3 Estimation under MNAR
Under MNAR, probability of missing observations in the response variable depends
on the covariates and the values of the response that would have been observed.
Then, it is necessary to incorporate this missing data mechanism into the likelihood.
The missing observations that follow this missing data mechanism are known as
nonignorable missing. To incorporate the missing data mechanism into the likelihood
we define a a binary random variable ri (i = 1, 2, . . . , n) as
(
0 if yi is observed,
ri = (8.33)
1 if yi is missing.
See Ibrahim et al. (2001). To model the probability of missing in terms of the values
of the responses that would have been observed and the covariates, a logit link,
p(ri = 1)
" #
log = ν0 + ν1 ∗ yi + ν2 ∗ xi1 + ν3 xi2 + · · · + ν p+1 xip , (8.35)
1 − p(ri = 1)
can be used, where yi is the responses and the responses that would have been ob-
served, and xi j ( j = 1, 2, . . . , p) are the covariates. Denote the (p + 2) parameter vector
as ν = (ν0 , ν1 , ν2 , . . . , ν p+1 ). Note that p(ri = 1) can now be written as a logistic model
exp(ν0 + ν1 ∗ yi + ν2 ∗ xi1 + ν3 xi2 + · · · + νq xip )
p(ri = 1) = . (8.36)
1 + exp(ν0 + ν1 ∗ yi + ν2 ∗ xi1 + ν3 xi2 + · · · + ν p+1 xip )
Then, the log-likelihood function of the parameter ν can be written as
n (
p(ri = 1)
X " # )
l(ν|ri , yi , xi j ) = ri ∗ log + log(1 − p(ri = 1)) . (8.37)
i=1
1 − p(ri = 1)
Note that choice of variables for the model of ri is important. Often many vari-
ables in this model are not necessarily significant, and more importantly parameters
in the model for ri are not of primary interest. Detailed discussion on this can be
found in Ibrahim, Lipsitz, and Chen (1999) and Ibrahim, Chen, and Lipsitz (2001).
Following Ibrahim, Lipsitz, and Chen (1999), after incorporating the model for
missingness mechanism in l(ν|ri , yi , xi j ), the log-likelihood for all the parameters in-
volved is
n n
X
l(ψ|Y, Xo, j , Xm, j ) = − log(1 + γ)
i=1
o
+ log[γ + f (0; µi , c, ω)]I{yi =0} + log f (yi ; µi , c, ω)I{yi >0}
n (
p(ri = 1)
X " # )
+ ri ∗ log + log(1 − p(ri = 1)) . (8.38)
i=1
1 − p(ri = 1)
154 MISSING DATA ANALYSIS
Note that two parts of this log-likelihood are separate and their parameters are dis-
tinct. This characteristic of the log-likelihood facilitates separate maximization. The
rest of the estimation procedure under MNAR remains exactly the same as the esti-
mation procedure under MAR. Note that as it is, the log-likelohood in (8.38) is not
computable. However, it is computable when this is plugged in into the EM algo-
rithm. Further, note that some of the covariates xi j also may be missing which has
not been discussed here as it needs separate theoretical development. This has been
left for future research.
The theoretical development here is flexible. If through tests (Deng and Paul,
2005) or other data visualization procedures no evidence of zero-inflation or over-
dispersion is evident in the data, then we can start with a simple model, such as the
Poisson model or the negative binomial model or the zero-inflated Poisson model.
Note that the development of missing data involved here is that of response data yi .
However, some of the covariates xi j can also be missing which has not been dealt
with here.
Example 3.
Mian and Paul (2016) analyzed a set of data from a prospective study of dental status
of school children from Bohning et al. (1999). Here we report the results of the
analysis. For more detailed information see, Mian and Paul (2016). The children
were all 7 years of age at the beginning of the study. Dental status were measured
by the decayed, missing and filled teeth (DMFT) index. Only the eight deciduous
molars were considered so the smallest possible value of the DMFT index is 0 and
the largest is 8. The prospective study was for a period of two years. The DMFT
index was obtained at the beginning of the study and also at the end of the study.
The data also involved three categorical covariates: gender having two categories
(0 - female, 1 - male), ethnic group having three categories (1 - dark, 2 - white, 3 -
black) and school having six categories (1 - oral health education, 2 - all four methods
together, 3 - control school (no prevention measure), 4 - enrichment of the school diet
with ricebran, 5 - mouthrinse with 0.2% NaF-solution, 6 - oral hygiene).
For the purpose of illustration of estimation in a zero-inflated over-dispersed
count data model with missing responses, Mian and Paul (2016) use the DMFT index
data obtained at the beginning of the study (as in Deng and Paul, 2005). The DMFT
index data at the beginning of the study are: (index, frequency): (0,172), (1,73),
(2,96), (3,80), (4,95), (5,83), (6,85), (7,65), (8,48). They first fit a zero-inflated nega-
tive binomial model to the complete data and data with missing observations without
covariates. Data with missing observations were obtained by randomly deleting a
certain percentage (5%, 10%, 25%) of the observed responses.
The estimates of the mean parameter µ, the over dispersion parameter c and the
zero inflation parameter ω based on the zero-inflated negative binomial model, un-
der different percentages of missingness, and their corresponding standard errors are
given in Table V of Mian and Paul (2016). Note that the estimates of the parame-
ters µ, c and ω and the corresponding standard errors remain stable irrespective of
the amount of missingness. Only for MAR and MNAR and for 25% missing val-
ues, their values are slightly different (slightly larger in case of µ and c). For higher
percentage of missing these properties might deteriorate further.
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 155
For more insight E(Y)d = (1 − ω̂)µ̂ and Var(Y)
[ = (1 − ω̂)µ̂[1 + (ĉ + ω̂)µ̂] were calcu-
lated and are given in Table V of Mian and Paul (2016). These estimates also do not
vary very much irrespective of the amount of missingness, except under MNAR and
25% missing, when Var(Y) [ is slightly higher compared to others.
Mian and Paul (2016) then fit a zero-inflated negative binomial model to the
complete data and data with missing observations and covariates. Response data
with missingness were obtained exactly the same way as in the situation without
covariates. The model fitted was µ = exp{β + βG(M) I(Gender = 1) + βE(D) I(Ethnic =
1) + βE(W) I(Ethnic = 2) + βS (1) I(S chool = 1) + βS (2) I(S chool = 2) + βS (3) I(S chool =
3) + βS (4) I(S chool = 4) + βS (5) I(S chool = 5)}, where β represents the intercept param-
eter and βG represents the regression parameter for gender, βE(1) and βE(2) represent
the regression parameters for the ethnic groups 1 and 2, and βS (1) , βS (2) , βS (3) , βS (4) ,
and βS (5) represent the regression parameters for school 1, school 2, school 3, school
4, and school 5, respectively. Estimates of the parameters can be found in Table VI
of Mian and Paul (2016).
In this case the estimates differed (this is expected as it depends on which ob-
servations have remained in the final data set). In general, the standard errors of the
estimates are larger (in some cases these are much larger, for example, in case of
S E(β̂S (5) )) than those under complete data. For MCAR, MAR, and MNAR, and 25%
missing responses, the standard error is close to twice for missing data in comparison
to those for complete data. The estimates of E(Y) d do not vary much irrespective of the
percentage missing and the missing data mechanism. For complete data and smaller
percentage missing, the property of Var(Y) [ is similar to that of E(Y).d However, for
MAR and 25% missing values Var(Y) [ is much larger (12.415) than in the other cases
(varies between 8.26 and 9.5).
Here we create a subset of the DMFT data analysed by Mian and Paul (2016).
These data are given in the Appendix under title: DMFT data for the analysis in
chapter 8 (Table 8A, Table 8B, Table 8C, Table 8D, Table 8E, Table 8F, Table 8G,
Table 8H,Table 8I, Table 8J, Table 8K). Further, we analyzed the modified DMFT
data and the Results are given in Table 8.8 and Table 8.9. The analyses and conclusion
of the results in Tables 8.8 and 8.9 are very similar to those given in Table V and Table
VI in Mian and Paul (2016).
yi = Xi β + Zi bi + ei , i = 1, . . . , N, (8.39)
Table 8.8 Estimates and standard errors of the parameters for DMFT index data.
Percentage missingness µ̂ S E(µ̂) ĉ S E(ĉ) ω̂ S E(ω̂) E(y) [
d Var(y)
Complete data 0% 4.1444 0.0916 0.0426 0.0195 0.1892 0.0150 3.3603 6.5887
5% 4.1598 0.0942 0.0415 0.0199 0.1953 0.0155 3.3473 6.6445
MCAR 10% 4.1440 0.0976 0.0472 0.0210 0.1922 0.0159 3.3475 6.6685
25% 4.1514 0.1066 0.0465 0.0230 0.1878 0.0173 3.3719 6.6514
5% 4.1739 0.0891 0.0291 0.0182 0.1862 0.0148 3.3967 6.4497
MAR 10% 4.1730 0.0872 0.0224 0.0175 0.1743 0.0144 3.4457 6.2731
25% 4.2395 0.0846 0.0127 0.0160 0.1500 0.0135 3.6035 6.0887
5% 4.1562 0.0939 0.0397 0.0197 0.1945 0.0154 3.3476 6.6071
MNAR 10% 4.1488 0.0976 0.0473 0.0210 0.1934 0.0160 3.3464 6.6878
25% 4.1532 0.1069 0.0479 0.0232 0.1877 0.0173 3.3735 6.6742
Table 8.9 Estimates and standard errors of the parameters for DMFT index data with covari-
ates.
Percentage missingness β̂ S E(β̂) β̂G S E(β̂G ) β̂E(1) S E(β̂E(1) ) β̂E(2) S E(β̂E(2) )
Percentage missingness β̂S (1) S E(β̂S (1) ) β̂S (2) S E(β̂S (2) ) β̂S (3) S E(β̂S (3) ) β̂S (4) S E(β̂S (4) )
where Ini is the ni × ni identity matrix, and Nq (µ, Σ) denotes the q-dimensional multi-
variate normal distribution with mean µ and covariance matrix Σ. The positive defi-
nite matrix D is the covariance matrix of the random effects and is typically assumed
to be unstructured and unknown. However, in practice, if one is convinced that a sim-
pler correlation structure, such as the exchangeable correlation matrix (see Zhang and
Paul, 2013), is sufficient then the estimation procedure might be simpler. Under these
assumptions, the conditional model, where conditioning refers to the random effects,
takes the form
(yi |β, σ2 , bi ) ∼ Nni (Xi β + Zi bi , σ2 Ini ). (8.41)
The model in (8.41) assumes a distinct set of regression coefficients for each indi-
vidual once the random effects are known. Integrating over the random effects, the
marginal distribution of yi is obtained as
As in Section 8.4, the EM algorithm has the E-step and the M-step. Because we are
dealing with a random effects model, the M-step itself needs two steps. Now, the
158 MISSING DATA ANALYSIS
log-likelihood, apart from a constant, based on the observed data can be expressed as
N N
1X 1X
`(β, σ2 , D) = − log |Σi | − (yi − Xi β)T Σ−1
i (yi − Xi β), (8.45)
2 i=1 2 i=1
This is the M-step for estimating β given θ and V. For the estimation of θ, given β
and V, the complete data log-likelihood is given by
N
X N
X
`(β, σ2 , D) = log f (yi | β, σ2 , bi ) + log f (bi | D) ,
(8.47)
i=1 i=1
This completes the M-step. In the E-step we calculate the expected value of the suf-
ficient statistics given the observed data and the current parameter estimates as
E(bi bTi | yi , β̂, σ̂2 , D̂) = E(bi | yi , β̂, σ̂2 , D̂)E(bTi | yi , β̂, σ̂2 , D̂) + Var(bi | yi , β̂, σ̂2 , D̂)
= D̂ZiT Σ̂−1
i (yi − Xi β̂)(yi − Xi β̂) Σ̂i Zi D̂ + D̂ − D̂Zi Σ̂i Zi D̂
T −1 T −1
and
E(ei eTi | yi , β̂, σ̂2 , D̂) = tr(E(ei eTi | yi , β̂, σ̂2 , D̂))
= E(ei | yi , β̂, σ̂2 , D̂)E(eTi | yi , β̂, σ̂2 , D̂) + Var(ei | yi , β̂, σ̂2 , D̂)
= tr(σ̂4 Σ̂−1
i (yi − Xi β̂)(yi − Xi β̂) Σ̂i + σ Ini − σ̂ Σ̂i ),
T −1 2 4 −1
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 159
where Σ̂i = Zi D̂ZiT + σ̂2 Ini and ei = yi − Xi β − Zi bi . Then, the maximum likelihood
estimates of all the parameters are obtained as
Step 1: Given some initial estimates σ̂2 and D̂ of σ2 and D. Obtain Σ̂i = Zi D̂ZiT +
σ̂2 Ini . Use these to estimate β as
N −1 N
X X
β̂ = Xi Σi Xi
T −1
XiT Σ−1
i yi . (8.50)
i=1 i=1
and
N
X bT bi
i
D̂ = , (8.52)
i=1
N
where the ith component in the numerator on the right hand side of equations (8.51)
and (8.52) are tr(σ̂4 Σ̂−1 i (yi − Xi β̂)(yi − Xi β̂) Σ̂i + σ Ini − σ̂ Σ̂i ) and D̂Zi Σ̂i (yi −
T −1 2 4 −1 T −1
hood estimates of the parameters β, σ2 and D are obtained by iterating between step
1 and step 2. For a discussion on convergence issues of the EM algorithm here see
Ibrahim and Molenberghs (2009).
Example 4. The six cities study of air pollution and health was a longitudinal
study designed to characterize lung growth as measured by changes in pulmonary
function in children and adolescents, and the factors that influence lung function
growth. A cohort of 13,379 children born on or after 1967 was enrolled in six com-
munities across the U.S.: Watertown (Massachusetts), Kingston and Harriman (Ten-
nessee), a section of St. Louis (Missouri), Steubenville (Ohio), Portage (Wisconsin),
and Topeka (Kansas). Most children were enrolled in the first or second grade (be-
tween the ages of six and seven) and measurements of study participants were ob-
tained annually until graduation from high school or loss to follow-up. At each annual
examination, spirometry, the measurement of pulmonary function, was performed
and a respiratory health questionnaire was completed by a parent or guardian.The ba-
sic maneuver in simple spirometry is maximal inspiration (or breathing in) followed
by forced exhalation as rapidly as possible into a closed container. Many different
measures can be derived from the spirometric curve of volume exhaled versus time.
One widely used measure is the total volume of air exhaled in the first second of the
maneuver (FEV1 ).
Fitzmaurice et al. (2004) present an analysis of a subset of the pulmonary func-
tion data collected in the Six Cities Study. The dataset contains a subset of the pul-
monary function data collected in the Six Cities Study. The data consist of all mea-
surements of FEV1, height and age obtained from a randomly selected subset of the
female participants living in Topeka, Kansas. The random sample consists of 300
girls, with a minimum of one and a maximum of twelve observations over time. Data
160 MISSING DATA ANALYSIS
for four selected girls are presented in Table 8.1 in Fitzmaurice et al. (2004). These
data were then analysed using the following regression model
8.6.3 Estimation with Nonignorable Missing Response Data (MAR and MNAR)
The method of estimation used for complete data will now be extended to missing
data under MAR and MNAR mechanism used in equations (8.1) and (8.1) in Sec-
tion 8.2. In MAR, conditional probability of missingness of the data depends on the
observed data. Parameters of the missingness mechanism are completely separate
and distinct from the parameters of the model (8.39). In likelihood based estimation
considering MAR, missingness mechanism can be ignored from the likelihood and
missing data that are missing at random are often known as ignorable missing or ig-
norable nonresponse, but the subjects having these missing observations cannot be
deleted before the analysis. That is, in MAR the distribution of ri given in equation
(8.1) is not required but in MNAR we need both models given in (8.1) and (8.2). So,
in what follows we deal with the analysis under MNAR mechanism. Because the part
of the likelihood involving ri does not depend on the parameters of interest; there-
fore, the MLE of the parameters of interest is the same without the part involving ri s.
So for MAR the distribution of ri is deleted.
To put things in perspective, define a ni × 1 random vector Ri , whose jth compo-
nent has the binary distribution
(
0 if yi j is observed,
ri j = (8.54)
1 if yi j is missing.
where γ(t) = (β(t) , σ2(t) , D(t) , φ(t) ). Since both bi and ymis,i are unobserved, they must
be integrated out. Thus, the E-step for the ith observation at the (t + 1)th iteration is
Z Z
Qi (γ | γ(t) ) = log[ f (yi | β, σ2 , bi )] f (ymis,i , bi | yobs,i , ri , γt )dbi dymis,i
Z Z
+ log[ f (bi | D)] f (ymis,i , bi | yobs,i , ri , γt )dbi dymis,i
Z Z
+ log[ f (ri | φ, yi )] f (ymis,i , bi | yobs,i , ri , γt )dbi dymis,i
≡ I1 + I2 + I3 , (8.59)
where f (ymis,i , bi | yobs,i , ri , γt ) represents the conditional distribution of the data con-
sidered (missing), given the observed data.
Detailed calculations, which are omitted here, given in Ibrahim and Molenberghs
(2009), lead to the evaluation of Qi (γ | γ(t) ) as given below. Let ui1 , . . . , uimi be a
sample of size mi from
n 1 o
f (ymis,i | yobs,i , ri , γ(t) ) ∝ exp − (yi − Xi β(t) )T (Zi D(t) ZiT + σ2(t) Ini )−1 (yi − Xi β(t) )
2
× f (ri | ymis,i , yobs,i , γ(t) ), (8.60)
obtained via the Gibbs sampler along with the adaptive rejection algorithm of Gilks
and Wild (1992), where the specific model for f (ri | ymis,i , yobs,i , γ(t) ) is given in their
equation (5.43). Also see Ibrahim and Molenberghs (2009) for further discussion.
162 MISSING DATA ANALYSIS
Now let y(k)
i= (uTik , yTobs,i )T and = b(tk)
i Σ(t) T (k)
i Zi (yi − Xi β )/σ
(t) 2(t) .
Then, the E-step
for the ith observation at the (t + 1)th iteration can be simplified as
ni ni log(σ2 ) 1
Qi (γ|γ(t) ) = − log(2π) − − 2 tr(ZiT Zi Σ(t)
i )
2 2 2σ
mi
1 X
+ (y(k) − Xi β − Zi b(tk) T (k) (tk)
i ) (yi − Xi β − Zi bi )
mi k=1 i
q log(2π) log(|D|) 1
− − − tr(D−1 Σ(t) i )
2 2 2
mi mi
1 X 1 X
− [(bi(tk) )T D−1 b(tk)
i ] + log[ f (ri |φ, y(k)
i )] (8.61)
2mi k=1 mi k=1
σ2(t+1)
mi
N
1 X 1 X (k) (tk) T (k) (tk)
(t)
= (yi − Xi β(t+1)
− Zi bi ) (yi − Xi β(t+1)
− Zi bi ) + tr(Zi Zi Σi )
T
M i=1 mi k=1
(8.65)
and
N i m
1 Xh 1 X
b(tk) (b(tk) )T + Σ(t)
i
D (t+1)
= i i i , (8.66)
N i=1 mi k=1
where Vi (β, α) = φAi (µi )1/2 Ri (ρ)Ai (µi )1/2 , Ai = diag{var(yi1 ), . . . , var(yini )}, α = (φ, ρ),
and Ri (ρ) is a working correlation matrix of Yi . For the choice of Ri (ρ), see Chapter
4. Denote√ the GEE estimate of β obtained by solving U1 (β, α̂) = 0 as β̂1 , where α̂
is any N-consistent estimator. Usually, α̂ is estimated by the method of moments
Zhang and Paul (2013). Liang and Zeger (1986) showed that β̂1 is consistent and
asymptotically normal and its variance can be consistently estimated by a sandwich-
type estimator (see Chapter 4).
As earlier, let rit be the missing data indicator variable for the tth observation of
the ith individual and assume monotone missing with ri1 ≥ ri2 ≥ . . . riN−1 ≥ riN (that
is, an individual who leaves the study never comes back). Also, assume, ri1 = 1 for
all i; that is, all responses are completely observed at baseline.
Then, the weighted estimating equation (Robins and Rotnitzky 1995, Preisser,
Lohman, and Rathouz 2002) for obtaining unbiased GEE estimates of the regression
parameters under MAR is
N !T
X ∂µi
U2 (β, α, γ) = Vi−1 (β, α)∆i (Yi − µi ) = 0, (8.68)
i=1
∂β
where Ei = ∂µi /∂β, Wi is the variance of Yi , ri = ri1 + ri2 , and θi = πi1 + (1 − πi1 )πi2
is the probability of being observed. It can be shown that E(U T LB ) = 0, and hence
U T LB = 0 will produce consistent estimates of β under very general conditions.
The weight here depends on when the response is observed (at time 1 or time 2),
while U T LB uses the overall probability of being observed either at time 1 or time 2.
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 167
Bias-Corrected Estimating Functions
The weighting GEE approach of Troxel et al. (1997) requires that (a) an in-
dividual will be further observed even if he or she has been observed earlier and
(b) return is not possible once a subject leaves the study. Both of these conditions
can be avoided if we use U CC = E(U CC ) rather than U CC = 0 for parameter esti-
mation. Wang (1999b) showed that we can write the unbiased estimating equations
U (1) = U CC − E(U CC ) as
N
X
U (1) = ri EiT Wi−1 (Yi − ηi ), (8.72)
i=1
where ηi = P(Yi |Xi , Ri = 1). This can be also be expressed as a function of µi and
ηi = θi1 µi /θi1 µi + (1 − µi )θi0 as
N !T
X ∂ηi Yi − ηi
U (1) = ri .
i=1
∂β Var(Yi |Ri = 1)
This implies that U (1) may be derived by maximizing Ri =1 P(Yi |ri = 1). The
Q
form of U (1) also suggests that they are the most efficient estimating functions (con-
ditional on ri = 1) in terms of asymptotic variance of the estimates (Godambe and
Heyde, 1987).
Conditional Estimating Functions
The complete-case model produces biased estimates because the expectations
of Yi for those who responded differ from µi . The estimating functions given by
(8.69) and (8.70) are correct for the biases due to the response-dependent selection.
However, responses obtained at time l are not distinguished from those obtained at
time 2 in U T LB .
Let µi1 = E(Yi |Xi , ri1 = 1) and µi2 = E(Yi |Xi , ri1 = 0, ri2 = 1). Parameter estimation
can rely on the following unbiased estimating functions:
N
X N
X
U (2) = −1
ri1 DTi1 Wi1 (Yi − µi1 ) + −1
ri2 DTi2 Wi2 (Yi − µi2 ), (8.73)
i=1 i=1
in which Dik = ∂µik /∂β and Wik = µik (1 − µik ) for k = 1 and 2.
We will write the missingness probabilities as πi1 (Yi ) and πi2 (Yi ) to express the
dependence on the response variable Yi explicitly. The relationships between µi1 , µi2
and µ1 are given by
πi1 (1)µi
µi1 = (8.74)
πi1 (1)µi + πi1 (0)(1 − µi )
and
πi2 (1){1 − πi1 (1)}µi
µi2 = . (8.75)
πi2 (1){1 − πi1 (1)}µi + πi2 (0){1 − πi1 (0)}(1 − µi )
Because DTik Wik−1 = EiT Wi−1 , (5.67) can be re written as
N h
X i
U (2) = EiT Wi−1 {ri1 (Yi − µi1 ) + ri2 (Yi − µi2 )} . (8.76)
i=1
168 MISSING DATA ANALYSIS
Asymptotic Covariance
Wang (1999b) discussed the estimation of the asymptotic covariance of the GEE
estimates of the regression parameters β. Omitting details, the asymptotic covari-
ance of the estimates, β̂, obtained from the unbiased estimating equations U = 0
discussed above, can be approximated by V = A−1 B(AT )−1 , where A = E(∂U/∂β)
and B = E(UU T ), the covariance matrix of the estimating functions. Table 1 of Wang
(1999b), gives asymptotic covariance of the estimates.
Example 7.
Wang (1999b) analyzed a set of data given in Troxel et al. (1997) from a survey
study concerning medical practice guidelines. The purpose of the survey was to as-
sess the lawyers’ knowledge of medical practice guidelines and the effect of those
guidelines on their legal work. Overall, 578 responses were received (among the 960
recipients) from the two mailing waves. Because of missing values in the received
responses, there are only 524 subjects with complete information on four binary vari-
ables. The four binary variables are Years (= 1 if > 10 years of law practice), Firm
P (= 1 if > 20% of the firm’s work has involved malpractice), Number (= 1 if > 5
new cases a year), and the response variable, Aware (awareness of medical practice
guidelines). Wang (1999) used the same model as in Troxel et al. (1997). To eliminate
the complications arising from the missing values in the responses received, he only
used observations from the 524 subjects and adjusted the total number of recipients
to 960(524)/578.
The variables used in the missingness model are Years, Firm P, Number, Aware,
Years × Aware, and Firm P × Number. For these variables, a logit regression model
was used which also included an intercept parameter and a parameter τ represent-
ing amounts of missingness. The parameters in the missingness model were esti-
mated as α = (−1.75, 0.72, 1.11, 1.65, 1.12, −1.14, −1.85, 0.15) with standard errors
(0.35,0.65,0.67,0.50, 0.60,0.79,0.78,0.15). The parameter estimates of the four esti-
mating function methods are quite different from those of the CC analysis. For the in-
tercept, Years, Firm p, and Number, the coefficient estimates from the CC analysis are
(-0.19,0.01,0.25, 0.62). The corresponding estimates from the weighting methods of
TLB and RRZ are (-0.75, 0.49, 0.30, 0.71) with standard errors (0.35,0.42,0.20,0.22).
Estimates from the new methods U (1) and U (2) are similar to those from the weight-
ing methods except for one variable (Years). The effects of Years estimated by U (1)
and U (2) are close to zero, quite different from the TLB and RRZ estimates. Con-
sidering the large standard errors for the Year effect, we should not be surprised by
the difference. Also notice that the missingness model plays an important role in es-
timating β, and the effect of Year in the missingness model is also very uncertain,
which may have a serious impact on the estimate of the Year effect.
Estimation with ignorable missing response data
For estimation with ignorable missing response data the missingness mecha-
nism can be ignored. So the joint distribution would be f (yi , bi | β, σ2 , D) instead of
f (yi , bi , ri | β, σ2 , D, φ) in case of nonignorable missing response data. For ignorable
(MAR) missing response data, selection models decompose the joint distribution as
An unbiased estimate of σ2e is σ̂e 2 = MSW and that of σ2a is σˆa 2 = (MSB −
MSW)/n0 .
The random effects models or the more general mixed effects models are used
in longitudinal data analysis. Longitudinal data arise as continuous responses or dis-
crete responses. In what follows we first show methodologies for the analysis of
continuous data.
Yi ∼ N(Xi β, Vi ). (9.2)
If bi is random and is distributed as N(0, σ2b ), then Vi becomes a matrix with σ2 + σ2b
as the diagonal elements and σ2 as the off diagonal elements, and the distribution of
Yi comes
Yi ∼ N(Xi β, Vi ). (9.4)
Model (9.4) is a mixed effects model. If there is no covariate, this model is iden-
tical to the random effects model in Section 9.1.2. However, because of the longi-
tudinal nature of the model, bi could have covariates which themselves could be
correlated. This results in model (4.3) and (4.4) in Molenberghs and Verbeke (2005)
and are
and
where Wi equals Vi−1 . In practice, α is not known and can be replaced by its MLE b α.
However, one often also uses the so-called restricted maximum likelihood (REML)
estimator for α (Thompson, 1962), which allows one to estimate α without having to
estimate the fixed effects in β first. It is known from simpler models, such as linear
regression models, that, although, the classical ML estimators are biased, the REML
estimators avoid the bias (Verbeke and Molenberghs 2000, Section 5.3).
In practice, the fixed effects in β are often of primary interest, as they describe
the average evolution in the population. Conditionally on α, the maximum likelihood
(ML) estimate for β is given by (9.8), which is normally distributed with mean
N −1 N
h i X X
E bβ(α) = Xi Wi Xi
>
Xi> Wi E [Yi ] = β, (9.9)
i=1 i=1
and covariance
N −1 N N −1
h i X X X
β(α) = Xi> Wi Xi × Xi> Wi Var(Yi )Wi Xi × Xi> Wi Xi
Var b
i=1 i=1 i=1
N −1
X
= Xi Wi Xi
>
(9.10)
i=1
provided that the mean and covariance were correctly specified in our model, i.e.,
provided that E(Yi ) = Xi β and Var(Yi ) = Vi = Zi DZi> .
The random effects model established by Laird and Ware (1982) explicitly model
the individual subject effect. Once the distributions of the random effects are speci-
fied, inference can be based on maximum likelihood methods. Lindstrom and Bates
(1988) also provided details on computational methods for Newton-Raphson and EM
algorithms. For the linear random effects model, the generalized least squares based
on the marginal distributions of Yi (after integrating out bi ) is,
N −1 N
X X
β̂ = Xi Σi Xi
T −1
XiT Σ−1
i Yi . (9.11)
i=1 i=1
is often referred to as the “Empirical BLUP” or the “Empirical Bayes” (EB) estimator
(see Harville and Jeske, 1992).
Furthermore, it can be shown that,
XN
var(b̂i ) = DZiT Σ−1
i Zi D − DZi Σi Xi (
T T −1
XiT Σ−1
i Xi ) Xi Σi Zi D .
−1 T −1 T
i=1
Let Ĥi = Σ̂i − Zi D̂Zi> . The ith subject’s predicted response profile is,
Ŷi = Xi β̂ + Zi b̂i
= Xi β̂ + Zi D̂ZiT Σ̂−1
i (Yi − Xi β̂)
= (Ĥi Σ̂−1
i )Xi β̂ + (I − Ĥi Σ̂i )Yi .
−1
That is, the ith subject’s predicted response profile is a weighted combination of
the population-averaged mean response profile Xi β̂, and the ith subject’s observed
response profile Yi .
Note that Hi Σ−1 i measures the relative between-subject variability, while Σi is the
overall with-subject and between-subject sources of variability. As a result, Ŷi assigns
additional weight to Yi (the observation from i itself) due to presence of random
effects. Note that Xi β̂ is the estimate at the population level. If Ĥi is large, and the
within-subject variability is greater than the between-subject variability, more weight
is given to Xi β̂, the population-averaged mean response profile. When the between-
subject variability is greater than the within-subject variability, more weight is given
to the ith subject’s observed data Yi . The lme function in R package is widely used
and numerical solutions can be easily obtained.
The conditional likelihood for β given the sufficient statistics for the γi has the
form
N exp( ni y X > β)
P
j=1 i j i j
Y
Pyi. > , (9.15)
Ri exp( l=1 Xil β)
P
i=1
where yi. = nj=1 yi j and the index set Ri contains all the yni.i ways of choosing yi.
Pi
positive responses out of ni repeated observations. Conditional maximum likelihood
estimates of the parameters are to be obtained by maximizing (9.15) with respect to
the parameters β.
Maximum likelihood estimators of m and c are then obtained by solving the estimat-
ing equations
N "
∂l X yi 1 + cyi
#
= − = 0,
∂m i=1 m 1 + cm
and
N i −1
yX
∂l X 1 yi − m 1
= 2 ln(1 + cm) − − = 0,
∂c i=1 c c(1 + cm) j=0 c(1 + c j)
simultaneously (see Piegorsch, 1990). Solution of the first equation provides m̂ = ȳ.
Maximum likelihood estimator of the parameter c, denoted by ĉ ML , is obtained by
solving the second equation after replacing m by ȳ. It can be seen that the restriction
GENERALIZED LINEAR MIXED EFFECTS MODELS 179
ĉ ML > −1/ymax , where ymax is the maximum of the data values Y1 , . . . , Yn , must be
imposed in solving the equation.
2. Method of Moments Estimator
The method of moments estimators of the parameters m and c obtained by equat-
ing the first two sample moments with the corresponding population moments are
m̂ = ȳ and
s2 − m̂
ĉ MM = (9.17)
m̂2
where ȳ is the sample mean and s2 = i=1 (yi − ȳ)2 /(N − 1) is the sample variance.
PN
3. The Extended Quasi-Likelihood Estimator
Estimation of the negative binomial parameter k by maximum quasi-likelihood.
Biometrics. 45 (1), pp. 309-316. For joint estimation of the mean and dispersion pa-
rameters, Nelder and Pregibon (1987) suggested an extended quasi-likelihood, which
assumes only the first two moments of the response variable. The extended quasi-
likelihood (EQL) and the estimating equations for m and c was given in Clark and
Perry (1989). Note our c is the same as α of Clark and Perry (1989). Without giving
details we denote the maximum extended quasi-likelihood estimate of c by ĉEQL .
4. The Double-Extended Quasi-Likelihood Estimator
Lee and Nelder (2001) introduced double-extended quasi-likelihood (DEQL) for
the joint estimation of the mean and the dispersion parameters. The DEQL method-
ology requires an EQL for Yi given some random effect ui and an EQL for ui from
a conjugate distribution given some mean parameter m and dispersion parameter c.
The DEQL then is obtained by combining these two EQLs. The random effects ui ’s
or some transformed variables vi are then replaced by their maximum likelihood es-
timates resulting in a profile DEQL. This profile DEQL is the same as the negative
binomial log-likelihood with the factorials replaced by the usual Stirling approxima-
tions (Lee and Nelder, 2001, Result 5, p. 996). They argued, however, that the Stirling
approximation may not be good for small z, so for non-normal random effects, they
suggested the modified Stirling approximation
!
1 1 1
ln Γ(z) ' z − ln(z) + ln(2π) − z + .
2 2 12z
Omitting details of the derivation, which can be obtained from the authors, the profile
DEQL, with the modified Stirling approximation, is
N
1 + cyi
! ! !
?
X 1 1 1
pv (DEQ) = yi ln(m) + y + ln − ln[2π(1 + cyi − y + ln(yi )
i=1
c 1 + cm 2 2
c2 yi 1
− − .
12(1 + cyi ) 12yi
From this we obtain the maximum profile DEQL estimating equations for m and c as
N " #
X yi 1
− =0
i=1
m(1 + cm) 1 + cm
180 RANDOM EFFECTS AND TRANSITIONAL MODELS
and
N
1 + cm 2cyi + 2 − c 2 − c cyi (2 + cyi )
!
X 1 yi − m
ln + − + − =0
i=1
c2 1 + cyi c(1 + cm) 2c2 (1 + cyi ) 2c2 12(1 + cyi )2
respectively. The maximum DEQL estimate for m obtained from the first equation
above is m̂ = ȳ. The maximum DEQL estimate of c, denoted by ĉDEQL , is obtained
by iteratively solving the second equation above after replacing m by m̂ = ȳ. Saha
and Paul (2005) provided a detailed comparison, by a simulation study, of these
estimators.
9.4.4 Examples: Estimation for European Red Mites Data and the Ames
Salmonella Assay Data
The European red mites data and the Ames salmonella assay data are now analyzed.
The European red mites data do not have any covariate and have 150 observations
in the form of frequency distribution (see Table 1, Saha and Paul (2005). The Ames
salmonella assay dataset has one covariate and a total of 18 observations. The max-
imum likelihood estimates of the parameters m and c for the red mites data with the
standard errors in parentheses 1.1467(0.1273) and 0.976(0.2629), respectively.
Example 9: Ames Salmonella Assay Data, see Table 2, Saha and Paul, 2005
The data in Table 2 of Saha and Paul (2005), were originally given by Margoline
et al.(1981). The data from an Ames salmonella reverse mutagenicity assay have
a response variable Y, the number of revertant colonies observed on each of three
replicate plates, and a covariate x, the dose level of quinoline on the plate. We use
the regression model given by
E(Yi |xi ) = mi =exp[β0 + β1 xi + β2 ln(xi + 10)].
The maximum likelihood estimating equations of the parameters of a general
negative binomial regression model are given by Lawless(1987).
The maximum likelihood estimates of the parameters β0 , β1 , β2 , and c are
2.197628(0.324576), −0.000980(0.000386), 0.312510(0.087892), and
0.048763(0.028143) respectively.
It is straightforward to generalize this framework to generalized linear models
with random effects. Recall that the GLM relies on the link function h to specify the
mean, µ = h−1 (X > β). As an extension to generalized linear models, we can naturally
incorporate random effects into the linear predictor. To be brief, we simply replace
Xi β by Xi β + Zi bi , in which both Xi and Zi are covariates. Once the distributions of bi
and ηi (usually multivariate normal) is specified (assumed), likelihood inference can
be carried out in principle. Conditional on (Xi , Zi , bi ), we have the same framework
as the GLM. But bi are actually “latent” as in linear mixed effects model. In general,
approximations are needed in evaluating the marginal likelihood (after integrating
the random effects) due to the nonlinear function µi in θi = Xi β + Zi bi . Evaluate the
likelihood or how to approximate the likelihood falls into computational statistics.
More details can be found in the excellent books (McCulloch and Searle, 2001).
TRANSITION MODELS 181
Note that using an additive random effects model leads to a mixture of GLMs.
McLachlan (1997) proposed the EM algorithm for the fitting of mixture distributions
by maximum likelihood. This essentially opens a very new direction for estimating
dispersion parameters for analyzing overdispersed data.
yi1 = Xi1
>
β + εi1 , (9.18)
yi j = Xi j β + αyi j−1 + εi j . (9.19)
Assuming εi1 ∼ N(0, σ2 ) and εi j ∼ N(0, σ2 (1−α2 )), after some simple algebra, we
0
can obtain: var(Yi j ) = σ2 and cov(Yi j , Yi j0 ) = α| j − j| σ2 . In other words, this model pro-
duces a marginal multivariate normal model with AR(1) variance-covariance matrix.
It makes most sense for equally spaced outcomes, of course. Upon including random
effects into (9.1), and varying the assumptions about the autoregressive structure, it is
clear that the general linear mixed-effects model formulation with serial correlation
encompasses wide classes of transition models. There is a relatively large literature
on the direct formulation of Markov models for binary, categorical, and Poisson data
as well (Diggle et al 2002). For outcomes of a general type, generalized linear model
ideas can be followed to formulate transition models.
We extend the generalized linear models (GLMs) for describing the conditional
distribution of each response yi j as a function of past response yi j−1 , . . . , yi1 and co-
variates Xi j . For example, the probability of a child having a obesity problem at
time ti j depends not only on explanatory variables, but also on the obesity sta-
tus at time ti j−1 . We will focus on the case where the observation times ti j are
equally spaced. To simplify notation, we denote the history for subject i at visit j
by Hi j = (yi1 , yi2 , . . . , yi j−1 ). As above, we will continue on the past and present values
of the covariates without explicitly listing them.
The most useful transition models are Markov chains for which the conditional
distribution of yi j given Hi j depends only on the q prior observations yi j−1 , . . . , yi j−q ,
The integer q is referred to as the model order.
A transition model specifies a GLM for the conditional distribution of yi j given
the past response, Hi j . The form of the conditional GLM is
n o
f (yi j |Hi j ) = exp [yi j θi j − ψ(θi j )]/φ + c(yi j , φ) (9.20)
for the known functions ψ(θi j ) and c(yi j , φ). The conditional mean and variance are
µcij = E(Yi j |Hi j ) = ψ0 (θi j ) and υcij = Var(Yi j |Hi j ) = ψ00 (θi j )φ,
ηi j (µcij ) = Xi1
>
β + κ(Hi j , β, α), (9.21)
where the εi j are independent, mean-zero, Gaussian innovations. Note that the
present observation, Yi j , is a linear function of xi j and of the earlier deviation
Yi j−r − xi>j−r β, r = 1, . . . , q.
Logit link—an example of a logistic regression model for binary responses that
comprises a first order Markov chain (Cox and Snell, 1989; Korn and Whittemore,
1979, see) which assumes that yi j ( j > 1) is independent of earlier observations given
the previous observation yi j−1 is
The notation βq indicates that the value and interpretation of the regression coef-
ficients changes with the Markov order, q.
Log-link— with count data we assume a log-linear model where Yi j given Hi j fol-
lows a Poisson distribution.
n Zeger and o Qaqish (1988) discussed a first order Markov
chain with f = α log(y∗i j−1 ) − Xi>j−1 β , where y∗i j = max(yi j , c), 0 < c < 1. This leads to
α
y∗i j−1
µcij = E(Yi j |Hi j ) = exp(Xi>j β) .
(9.25)
exp(Xi>j−1 β)
When maximizing (9.27) there are two distinct cases to consider. In the first,
fr (Hi j ; α, β) = αr fr (Hi j ) so that h(ucij ) = Xi>j β + r=1 αr fr (Hi j ). Here, h(ucij ) is a linear
Ps
function of both β and α = (α1 , . . . , α s ) so that the estimation proceeds as in GLMs for
independent data. We simply regress yi j on the (p+ s) dimensional vector of extended
explanatory variables (xi j , f1 (Hi j ), . . . , f s (Hi j )).
The second case occurs when the functions of past responses include both α
and β. Examples are the linear and log-linear models discussed above. To derive
an estimation algorithm for this case, note that the derivative of the log conditional
likelihood or conditional score function has the form
ni ∂uc
N X
ij
X
U c (δ) = υcij (yi j − ucij ) = 0, (9.28)
i=1 j=q+1
∂δ
TRANSITION MODEL FOR CATEGORICAL DATA 185
where δ = (β, α). This equation is the conditional analog of the GLM score equation.
The derivative ∂ucij /∂δ is analogous to Xi j but it can depend on α and β. We can still
formulate the estimation procedure as an iterative weighted least squares as follows.
Let Yi j be the (ni − q) vector of responses for j = q + 1, . . . , ni and µcij its expectation
given Hi j . Let Xi∗ be an (ni − q) × (p + s) matrix with kth row ∂uiq+k /∂δ and Wi j =
diag(1/υciq+1 , · · · , 1/υcini ) and (ni − q) × (ni − q) diagonal weighting matrix. Finally, let
Zi = Xi∗ δ̂ + (Yi − µ̂c ). Then, an updated δ̂ can be obtained by iteratively regressing Z
on X ∗ using weights W.
When the correct model is assumed for the conditional mean and variance, the
solution δ̂ asymptotically, as N goes to infinity, follows a Gaussian distribution with
mean equal to the true value, δ, and (p + s) × (p + s) variance matrix
N −1
X
>
Vδ̂ = Xi∗ Wi Xi∗ . (9.29)
i=1
π00 π01
!
,
π10 π11
where πab = p(Yi j = b|Yi j−1 = a), a, b = 0, 1. For example, π01 is the probability that
186 RANDOM EFFECTS AND TRANSITIONAL MODELS
Yi j = 1 when the previous response is Yi j−1 = 0. Note that each row of a transition ma-
trix sums to one. As its name implies, the transition matrix records the probabilities
of making each of the possible transitions from one visit to the next.
In the regression setting, we model the transition probabilities as functions
of covariates Xi j . A very general model uses a separate logistic regression for
P(Yi j=1 |Yi j−1 = yi j ), yi j = 0, 1. That is, we assume that
and
where β0 and β1 may differ. In words, this model assumes that the effects of explana-
tory variables will differ depending on the previous response. A more concise form
for the same model is
Here, πabc = P(Yi j = c|Yi j−2 = a, Yi j−1 = b); for example π011 is the probability
that Yi j = 1 given Yi j−2 = 0 and Yi j−1 = 1. We could fit four separate logistic re-
gression models, one for each of the four possible histories (Yi j−2 , Yi j−1 ) namely
(0, 0), (0, 1), (1, 0), and (1, 1) with regression coefficients β00 , β01 , β10 , β11 respectively.
But it is again more convenient to write a single equation as follows
By plugging in the different values for yi j−2 and yi j−1 , we obtain β00 = β, β01 = β + α1 ,
β10 = β + α2 , and β11 = β + α1 + α2 + α3 . We would again hope that a more parsimo-
nious model fits the data equally well so many of the components of the αi would be
zero.
TRANSITION MODEL FOR CATEGORICAL DATA 187
An important special case occurs when there are no interactions between the past
responses, yi j−1 and yi j−2 and the explanatory variables, that is, when all elements of
the αi are zero except the intercept term. In this case, the previous responses affect the
probability of a positive outcome but the effects of the explanatory variables are the
same regardless of the history. Even in this situation, we must still choose between
Markov models of different order. For example, we might start with a third order
model which can be written in the form
A second order model can be used if the data are consistent with α3 = α5 = α6 =
α7 = 0; a first order model is implied if αi = 0 for j = 2, . . . , 7. As with any regres-
sion coefficients, the interpretation and value of β depends on the other explanatory
variables in the model, in particular on which previous responses are included. When
inference about β are the scientific focus, it is essential to check their sensitivity to
the assumed order of Markov regression model. When the Markov transition model is
correctly specified, the transition events are uncorrelated so that ordinary logistic re-
gression can be used to estimate regression coefficients and their standard error. How-
ever, there may be circumstances when we choose to model P(Yi j |Yi j−1 , · · · , Yi j−q )
even though it does not equal P(Yi j |Hi j ). For example, suppose there is heterogeneity
across people in the transition matrix due to unobserved factors, so that a reasonable
model is
10.1 Introduction
Many variables are often collected in high-dimensional longitudinal data. The inclu-
sion of redundant variables may reduce accuracy and efficiency for both parameter
estimation and statistical inference. The traditional criteria introduced in Chapter 5
are all subset methods, and these methods become computationally intensive when
dimension of the variables is moderately large. Therefore, it is important to find
new methodology for variable selection in analysis of high-dimensional longitudi-
nal data. The penalized loss function (or −2 log likelihood function) methods have
been widely used to select variables in regression models for the independent data
{(Xi , Yi ) : i = 1, · · · , N}. The penalized loss function method is composed of a loss
Pp
function and a penalty function, that is Q(β) = i=1 li (β) + N j=1 Pλ (|β j |), where
PN
Pλ (|β|) is a penalty function, and λ is a tuning parameter which controls the sparse-
ness of the regression parameters β, the commonly used penalty functions Pλ (|θ|)
include,
• Hard penalty function: Pλ (|θ|) = λ2 − (|θ| − λ)2 I(|θ| < λ).
• Least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996):
Pλ (|θ|) = λ|θ|.
• Smoothly clipped absolute deviation penalty (SCAD) (Fan and Li, 2001):
( )
(aλ − |θ|)+
Pλ (|θ|) = λ I(|θ| ≤ λ) +
0
I(|θ| > λ) ,
(a − 1)λ
where a > 2 and is proposed taking a value of 3.7 by Fan and Li (2001).
• Adaptive penalty function (ALASSO) (Zou, 2006): Pλ (|θ|) = λ|θ|/|θ̂|γ , where θ̂ is
a consistent estimator of θ and γ > 0.
• Elastic net (EN) penalty (Zou and Hastie, 2005) introduced as a mixing penalty
to effectively select grouped variables: Pλ (|θ|) = λ1 ||θ||1 + λ2 ||θ||22 .
The penalized methods can compress the coefficients of unrelated variables to-
ward zero and obtain the parameter estimates at the same time. Fan and Li (2001)
proposed that a good penalty function should result in an estimator with three prop-
erties: unbiasedness, sparsity, and continuity. Their studies found that the excessive
where P0λ (|β|) = (P0λ (|β1 |), · · · , P0λ (|β p |))T is a p dimensional derivative func-
tion of the penalty function Pλ (|β|) encouraging sparsity in β, and sign(β) =
(sign(β1 ), · · · , sign(β p ))T is an indicator vector with sign(t) = 1 for t > 0, sign(t) = −1
for t < 0 and sign(t) = 0 for t = 0. In Equation (10.1), the tuning parameter λ > 0
controls the complexity of the model.
Wang et al. (2012a) proposed combining the Minimization-Maximization (MM)
algorithm by Hunter and Li (2005) with the Fisher-scoring algorithm to solve the pe-
nalized GEE (10.1). For a small > 0, the MM algorithm suggests that the penalized
estimator β̂ approximately satisfies the following equation:
|β̂ j |
U N j (β̂) − NP0λ (|β̂ j |)sign(β̂ j ) = 0, (10.2)
|β̂ j | +
where
N
X
H(β̂ ) =
(k)
XiT A1/2 1/2 (k)
i (β̂ )Ri (α)Ai (β̂ )Xi ,
(k) −1
i=1
P0 (|β̂(k) |+) P0λ (|β̂(k)
λ 1 p |+)
E(β̂ ) = diag (k)
(k) , · · · , (k) .
|β̂1 | + |β̂ p | +
Given a selected tuning parameter λ and an initial value of β, such as the estimate
obtained from the GEE with an independence working matrix under a full model,
update the estimate of β via equation (10.3) until the algorithm converges.
The tuning parameter λ controls the spareness of the regression parameters, and
it is important to select an appropriate λ. The cross-validation method is very pop-
ular for selecting the tuning parameter. The k-fold cross-validation procedure is as
follows: Denote the full dataset by T and randomly split the data into k nonoverlap-
ping subsets of approximately equal size. Denote cross-validation training and test
set by T − T v and T v , respectively. Obtain the estimator β̂v of β using the training
set T − T v . Form the cross validation criterion as
k
X X
CV(λ) = l(yi j , xi j , β̂(v) ),
v=1 (yi j ,xi j )∈T v
where k usually takes a value of 5 or 10, and l(·) is a loss function. Wang et al. (2012)
proposed taking the negative log likelihood of exponential family distribution under
a working independence assumption as the loss function. The best tuning parameter
is selected by minimizing CV(λ) over a fine grid of λ. From the MM algorithm, a
sandwich covariance formula can be used to estimate the asymptotic covariance of
β̂:
d β̂) = [H(β̂) + NE(β̂)]−1 M(β̂)[H(β̂) + NE(β̂)]−1 ,
Cov(
where
N
X
M(β̂) = XiT A1/2 −1 1/2
i (β̂)R̂i (α)i (β̂)i (β̂)R̂i Ai (β̂)Xi ,
−1 T
i=1
−1
where Vi = Ri (α)A1/2
i , hi (µi (β)) = Wi (ψ(Ai (Yi − µi )) − C i ) in which C i =
2
−1
E(ψ(Ai 2 (Yi − µi ))) is used to ensure Fisher consistency of the estimator. For the
Gaussian distribution and symmetric Huber function, the correction term Ci is ex-
actly equal to zero.
The function ψ(·) is selected to downweight the influence of outliers in
the response variable. Fan et al. (2012) chose the Huber function ψc (x) =
min{c, max(−c, x)}, where the tuning constant c is chosen to give a certain level
of asymptotic efficiency at the underlying distribution. Lv et al. (2015) proposed
using a bounded exponential score function ψ(t) = exp(−t2 /γ), where γ down-
weights the influence of an outlier on the estimators. The weighted function Wi =
diag(wi1 , wi2 , · · · , wini ) is used to downweight the effect of leverage points and can be
chosen using the Mahalanobis distance function given as
r
b0 2
wi j = wi j (Xi j ) = min ,
1,
(Xi j − m x )T S −1
x (Xi j − m x )
The MM iterative algorithm can be used to solve Equation (10.5). The correlation
scale parameter φ need to be estimated before solving (10.5). Let
parameters and the q
êi j = (yi j − µi j (β̂))/ v(µi j (β̂)) be Pearson residuals. We can obtain a robust estimate
of φ through the median absolute deviation
where the summation is over all pairs ( j , k) and H is the mean of ψ2 (ei j ), j =
1, · · · , ni , i = 1, · · · , N. When R(α) is an AR(1) correlation matrix, α can be estimated
by
N
1 X 1 X
α̂ = ψ(ei j )ψ(ei j+1 ). (10.8)
NH i=1 ni − 1 j≤n −1
i
For the tuning parameter, Fan et al. (2012) proposed a robust general-
ized CV to choose λ for reducing the impact of outliers. Define RSSR (λ)
as the summation of squares of the robustified residuals, that is, RSSR (λ) =
PN
2
−1/2
i=1
Wi ψ(A i (Yi − µi (β̂λ )))
, where β̂λ is the solution of the penalized robust esti-
mating equation (10.5) when λ is fixed. Select the penalty parameter λ by minimizing
the robustified GCV statistic:
RSSR (λ)/n
GCVR (λ) = , (10.9)
{1 − d(λ)/n}2
where d(λ) = tr{[D(β̂) + NE(β̂λ )]−1 DT (β̂λ )} is the effective number of parameters.
Then λopt = argminλ GCVR is chosen, and the corresponding β̂λ is the robust penal-
ized estimator of β.
The specific algorithm is given as follows.
• Step 1. We can use the estimator β̂0 obtained from GEE with an independence
working matrix as an initial estimator of β.
• Step 2. Update the correlation parameter α and scale parameter φ using (10.6):
• Step 3. For a given tuning parameter λ, utilize the MM algorithm to solve Equa-
tion (10.4). Let β̂(k) be the kth iterate value for a fixed λ, then the iterative MM
algorithm is written:
β̂(k+1) = β̂(k) − [D(β̂(k) ) − NE(β̂(k) )]−1 [UR (β̂) − NE(β̂(k) ) × β̂(k) ], (10.10)
where D(β) = ∂UR (β)/∂β. The MM algorithm continues until successive iterates
values are less than a user-defined threshold.
• Step 4. Choose λ using the GCV estimator.
where β̂λ? is an estimator of β for a given λ? , and d fλ? = i=1 I(δ̂ j , 1) is the
Pp
number of nonzero elements of the estimators. The optimal tuning parameter λ?opt =
minλ? RBIC(λ? ). The procedure for solving equation (10.11) follows.
• Step 1. Give an initial estimate β̂(0) . For example, the estimate obtained via the
generalized estimating equations with an independence working matrix can be
used as the initial estimate.
• Step 2. Estimate the scale parameter φ̂ using (10.6) with the current estimate β̂(k) .
For a given working correlation matrix Ri (α), estimate the correlation parameter
using (10.7) and (10.8) for exchangeable and AR(1), respectively, and obtain
n o
Vi µi β̂(k) , φ̂(k) = Ri (α̂(k) )Â1/2
i β̂(k) , φ̂(k) .
Step 3. For a given λ, update the estimate of β via the following iterative formula:
−1
n
X
β̂(k+1) = β̂(k) − DT Ωi (µi (β)) Di + Ĝ Un (β) + Ĝ · β ,
(10.13)
i
i=1
(k)
β̂
−1
where Ĝ = I p − ∆ˆ ∆,
ˆ and Ω (µi (β)) = V −1 (µi (β)) Γi (µi (β)), in which
i
h i h i
Γi (µi (β)) = E ḣbi (µi (β)) = E ∂hbi (µi (β))/∂µi ,
µi =µi (β)
where tik denotes time, and xi1 , · · · , xi96 are standardized to have mean zero and unit
variance.
Figure 10.4 shows the change of the log-transformed gene expressive level with
time for the first twenty genes. The histogram (see Figure 10.4) indicates the distribu-
tion of log-transformed gene expressive level is symmetric. The boxplot (see Figure
10.3) indicates that log-transformed gene expression levels may contain many under-
lying outliers. We utilize the penalized methods to identify the important TFs. Table
10.1 presents the number of TFs selected for the G1-stage yeast cell-cycle process
using the penalized GEE, penalized Huber (c=2) and penalized exponential squared
loss with SCAD penalty under three correlation structures (independence, exchange-
able and AR(1)). The results indicate that these three methods tend to select more
TFs under independence correlation structure. Table 10.2 lists the selected TFs using
the three penalized methods under independence, exchangeable and AR(1) correla-
tion structures. The overlaps of selected TFs using the three methods can be treated
as important TFs. It would be of great interest to further study other controversial
TFs and confirm their biological properties using the genome-wide binding method.
196 HANDING HIGH DIMENSIONAL LONGITUDINAL DATA
2
1
Log−transformed gene expressive level
0
−1
−2
4 6 8 10 12
Time
Figure 10.1 Plot of the log-transformed gene expressive level with time for the first twenty
genes.
0.6
Density
0.4
0.2
0.0
−2 −1 0 1 2
Figure 10.2 Histogram of the log-transformed gene expressive level in the yeast cell-cycle
process.
Table 10.1 The number of TFs selected for the G1-stage yeast cell-cycle process with penal-
ized GEE (PGEE), penalized Exponential squared loss (PEXPO), and penalized Huber loss
(PHUBER) with SCAD penalty.
Correlation PGEE PHUBER PEXPO
Independence 18 16 15
Exchangeable 13 12 11
AR(1) 14 14 11
2
1
Log−transformed gene expressive level
0
−1
−2
Figure 10.3 Boxplot of the log-transformed gene expressive level in the yeast cell-cycle pro-
cess.
Table 10.2 List of selected TFs for the G1-stage yeast cell-cycle process with penalized GEE
(PGEE), penalized Exponential squared loss (PEXPO), and penalized Huber loss (PHUBER)
with SCAD penalty.
PGEE
Independence ABF1 FKH1 FKH2 GAT3 GCR2 MBP1 MSN4 NDD1
PHD1 RGM1 RLM1 SMP1 SRD1 STB1 SWI4 SWI6
Exchangeable FKH1 FKH2 GAT3 MBP1 NDD1 PHD1 RGM1 SMP1
STB1 SWI4 SWI6
AR(1) FKH1 FKH2 GAT3 MBP1 MSN4 NDD1 PHD1 RGM1
SMP1 STB1 SWI4 SWI6
PHUBER
Independence FKH1 FKH2 GAT3 GCR2 IXR1 MBP1 NDD1 NRG1
PDR1 ROX1 SRD1 STB1 STP1 SWI4 SWI6 YAP5
Exchangeable FKH2 GAT3 GCR2 MBP1 NDD1 PDR1 SRD1 STB1
STP1 SWI4 SWI6 YAP5
AR(1) FKH1 FKH2 GAT3 GCR2 MBP1 NDD1 NRG1 PDR1
SRD1 STB1 STP1 SWI4 SWI6 YAP5
PEXPO
Independence ABF1 FKH1 FKH2 GAT3 GCR2 MBP1 MSN4 NDD1
PDR1 PHD1 RGM1 RLM1 SRD1 SWI4 SWI6
Exchangeable ABF1 FKH2 GAT3 GCR2 MBP1 NDD1 RLM1 SRD1
SWI4 SWI6 YAP5
AR(1) ABF1 FKH2 GAT3 GCR2 MBP1 NDD1 RLM1 SRD1
SWI4 SWI6 YAP5
lation structures. Liu (2016) studied longitudinal partially linear models with ultra
high-dimensional covariates, and provided a two-stage variable selection procedure
that consists of a quick screening stage and a post-screening refining stage. The pro-
posed approach is based on the partial residual method for dealing with the nonpara-
metric baseline function.
Li et al. (2018) proposed using a robust conditional quantile correlation or con-
ditional distribution correlation screening procedures to reduce dimension of the
covariates to a moderate order and then utilize the kernel smoothing technique to
estimate the population conditional quantile correlation and population conditional
distribution correlation for the varying coefficient models in ultra high-dimensional
data analysis.
These studies are mainly about continuous longitudinal data, and there are a few
studies on discrete longitudinal data. In addition, when the variables are filtered in
the ultra-high-dimensional longitudinal data, the screening methods for independent
ultra-high-dimensional data cannot be directly applied to the ultra-high-dimensional
longitudinal data. Zhu et al. (2017) proposed that projection correlation for random
vector correlation, which provides a new idea for the study of correlation between
vectors. However, there are still many problems need to be resolved in the correlation
between vectors.
Bibliography
201
202 BIBLIOGRAPHY
V. J. Carey, and Y-G. Wang. Working covariance model selection for generalized
estimating equations. Statistics in Medicine, 30:3117–3124, 2011.
V. J. Carey, S. L. Zeger, and P. Diggle. Modelling multivariate binary data with
alternating logistic regressions. Biometrika, 80:517–526, 1993.
J. P. Carpenter, S. Pocock, and C. J. Lamm. Coping with missing data in clinical
trials: a model-based approach applied to asthma trials. Statistics in Medicine,
21:43–66, 2002.
R. J. Carroll, and D. Ruppert. Robust estimation in heteroscedastic linear models.
The Annals of Statistics, 10(2):429–441, 1982.
G. Casella, and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove,
CA, 2002.
G. Casella, and E. I. George. Explaining the gibbs sampler. The American Statisti-
cian, 46(3):167–174, 1992.
R. N. Chaganty. An alternative approach to the analysis of longitudinal data via
generalized estimating equations. Journal of Statistical Planning and Inference,
63:39–54, 1997.
R. N. Chaganty, and J. Shults. On eliminating the asymptotic bias in the quasi-least
squares estimate of the correlatiopn parameter. Journal of Statistical Planning
and Inference, 76:145–161, 1999.
G. Chamberlain. Quantile regression, censoring and the structure of wages. In Pro-
ceedings of the Sixth World Congress of the Econometrics Society (eds. C. Sims
and J.J. Laffont), 2:171–209, 1994.
J. Chen, and N. A. Lazar. Selection of working correlation structure in generalized
estimating equations via empirical likelihood. Journal of Computational and
Graphical Statistics, 21:18–41, 2012.
J. Chen, S. Hubbard, and Y. Rubin. Estimating the hydraulic conductivity at the
south oyster site from geophysical tomographic data using bayesian techniques
based on the normal linear regression model. Water Resources Research, 37(6):
1603–1613, 2001.
L. Chen, L. J. Wei, and M. I. Parzen. Quantile regression for correlated observa-
tions. Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of
Correlated Data, 179:51–70, 2004.
M.-H. Chen, and J. G. Ibrahim. Maximum likelihood methods for cure rate models
with missing covariates. Biometrics, 57:43–52, 2002.
M-Y. Cheng, T. Honda, J. L. Li, and H. Peng. Nonparametric independence screen-
ing and structure identification for ultra-high dimensional longitudinal data. The
Annals of Statistics, 42:1819–1849, 2014.
S. J. Clark, and J. N. Perry. Estimation of the negative binomial parameter κ by
maximum quasi-likelihood. Biometrics, 45(1):309–316, 1989.
D. R. Cox. Regression models and life-tables. Journal of the Royal Statistical Soci-
ety, Series B, 34(2):187––220, 1972.
BIBLIOGRAPHY 203
D. R. Cox, and N. Reid. Parameter orthogonality and approximate conditional infer-
ence. Journal of the Royal Statistical Society. Series B, 1–39, 1987.
D. R. Cox, and E. J. Snell. Analysis of Binary Data, Second Edition. Chapman &
Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis,
1989.
M. Crowder. Gaussian estimation for correlated binomial data. Journal of the Royal
Statistical Society, Series B, 47:229–237, 1985.
M. Crowder. On the use of a working correlation matrix in using generalised linear
models for repeated measures. Biometrika, 82:407–410, 1995.
M. Crowder. On repeated measures analysis with misspecified covariance structure.
Journal of Royal Statistical Society, Statistical Methodology, Series B, 63(1):
55–62, 2001.
S. Datta, and A. Satten, Glen. Rank-sum tests for clustered data. Journal of the
American Statistical Association, 100(471):908–915, 2005.
M. Davidian, and R. J. Carroll. Variance function estimation. Journal of the Ameri-
can Statistical Association, 82(400):1079–1091, 1987.
M. Davidian, and D. M. Giltinan. Nonlinear Models for Repeated Measurement
Data. Chapman & Hall, London, 1995.
C. S. Davis. Semi-parametric and non-parametric methods for the analysis of re-
peated measurements with applications to clinical trials. Statistics in Medicine,
10(12):1959–1980, 1991.
I. Deltour, S. Richardson, and J.-Yves L. Hesran. Stochastic algorithms for markov
models estimation with intermittent missing data. Biometrics, 55(2):565–573,
1999.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society. Series B, 39:
1–38, 1977.
A. M. Dempster. Covariance selection. Biometrics, 28:157–175, 1972.
A. P. Dempster, C. M. Patel, and A. J. Selwyn, M. R., and Roth. Statistical and
computational aspects of mixed model analysis. Journal of the Royal Statistical
Society. Series C (Applied Statistics), 33(2):203–214, 1984.
D. Deng, and S. R. Paul. Score tests for zero-inflation and over-dispersion in gener-
alized linear models. Statistica Sinica, 15:257–276, 2005.
P. Diggle, and M. G. Kenward. Informative drop-out in longitudinal data analysis
(with discussion). Applied Statistics, 43:49–93, 1994.
P. J. Diggle. An approach to the analysis of repeated measurements. Biometrics, 44
(4):959–971, 1988.
P. J. Diggle, K-Y. Liang, and S. L. Zeger. Analysis of Longitudinal Data. Clarendon
Press, 1994.
P. J. Diggle, K-Y. Liang, and S. L. Zeger. Analysis of Longitudinal Data. Oxford
University Press, Oxford, 2002.
204 BIBLIOGRAPHY
A. Donner. A review of inference procedures for the intraclass correlation coefficient
in the one-way random effects model. International Statistical Review/Revue
Internationale de Statistique, 54(1):67–82, 1986.
A. Donner, and S. Bull. Inferences concerning a common intraclass correlation co-
efficient. Biometrics, 39(3):771–775, 1983.
A. Donner, and J. J. Koval. The estimation of intraclass correlation in the analysis of
family data. Biometrics, 36(1):19–25, 1980.
T. M. Durairajan. Optimal estimating function for non-orthogonal model. Journal
of Statistical Planning Inference, 33:381–384, 1992.
F. Eicker. Asymptotic normality and consistency of the least squares estimators for
families of linear regressions. The Annals of Mathematical Statistics, 34(2):447–
456, 1963.
A. Ekholm, and C. Skinner. The muscatine children’s obesity data reanalysed us-
ing pattern mixture models. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 47(2):251–263, 1998.
J. Engel. Models for response data showing extra-poisson variation. Statistica Neer-
landica, 38(3):159–167, 1984.
J. Fan, and R. Li. Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American Statistical Association, 96:1348–
1360, 2001.
J. Fan, and J. Lv. Sure independence screening for ultrahigh dimensional feature
space. Journal of the Royal Statistical Society, Ser. B, 70:849–911, 2008.
J. Fan, Y. Ma, and W. Dai. Nonparametric independence screening in sparse ultra-
high-dimensional varying coefficient models. Journal of the American Statistical
Association, 109:1270–1284, 2014.
Y. L. Fan, G. Y. Qin, and Z. Y. Zhu. Variable selection in robust regression models
for longitudinal data. Journal of Multivariate Analysis, 109:156–167, 2012.
R. A. Fisher. Statistical methods for research workers. Genesis Publishing Pvt Ltd,
1925.
G. M. Fitzmaurice. A caveat concerning independence estimating equations with
multivariate binary data. Biometrics, 51:309–317, 1995.
G. M. Fitzmaurice, N. M. Laird, and J. H. Ware. Applied Longitudinal Analysis, 2nd
Edition, Wiley.
G. M. Fitzmaurice, N. M. Laird, and A. G. Rotnitzky. Regression models for discrete
longitudinal responses (disc: P300-309). Statistical Science, 8:284–299, 1993.
G. M. Fitzmaurice, S. R. Lipsitz, Gelber R. Ibrahim JG, and Lipshultz. S. Estima-
tion in regression models for longitudinal binary data with outcome-dependent
follow-up. Biostatistics, 7:469–465, 2006.
L. Y. Fu, and Y-G. Wang. Efficient estimation for rank-based regression with clus-
tered data. Biometrics, 68:1074–1082, 2012a.
BIBLIOGRAPHY 205
L. Y. Fu, Y-G. Wang, and Z. D. Bai. Rank regression for analysis of clustered data: A
natural induced smoothing approach. Computational Statistics and Data Analy-
sis, 54:1036–1050, 2010.
L. Y. Fu, Y-G. Wang, and M. Zhu. A gaussian pseudolikelihood approach for quan-
tile regression with repeated measurements. Computational Statistics and Data
Analysis, 84:41–53, 2015.
L. Y. Fu, Y-G. Wang, and F. Cai. A working likelihood approach for robust regres-
sion. Statistical Methods in Medical Research, 29(12):3641–3652, 2020.
L. Y. Fu, and Y-G. Wang. Quantile regression for longitudinal data with a working
correlation model. Computational Statistics and Data Analysis, 56:2526–2538,
2012b.
L. Y. Fu, and Y-G. Wang. Efficient parameter estimation via gaussian copulas for
quantile regression with longitudinal data. Journal of Multivariate Analysis,
143:492–502, 2016.
K. W. Fung, Z. Y. Zhu, B. C. Wei, and X. M. He. Infference diagnostics and outlier
tests for semiparametric mixed models. Journal of Royal Statistical Society,
Series B., 64:565–579, 2002.
J. Geweke. Exact inference in the inequality constrained normal linear regression
model. Journal of Applied Econometrics, 1(2):127–141, 1986.
W. R. Gilks, and P. Wild. Adaptive rejection sampling for gibbs sampling. Applied
Statistics, 41:337–348, 1992.
V. P. Godambe, and M. E. Thompson. An extension of quasi-likelihood estimation.
Statistical Planning and Inference, 22:137–152, 1989a.
V. P. Godambe. An optimum property of regular maximum likelihood estimation
(Ack: V32 p1343). The Annals of Mathematical Statistics, 31:1208–1212, 1960.
V. P. Godambe, and C. C. Heyde. Quasi-likelihood and optimal estimation. Interna-
tional Statistical Review, 55:231–244, 1987.
V. P. Godambe, and Mary Elinore Thompson. An extension of quasi-likelihood
estimation. Journal of Statistical Planning and Inference, 22(2):137–152, 1989b.
D. Griffin, and R. Gonzalez. Correlational analysis of dyad-level data in the ex-
changeable case. Psychological Bulletin, 118(3):430, 1995.
D. Hall, and T. A. Severini. Extended generalized estimating equations for clustered
data. Journal Americian Statistican and Association, 93:1365–1375, 1998.
L. P. Hansen. Large sample properties of generalized method of moments estimators.
Econometrica, 50(4):1029–1054, 1982.
J. Hardin, and J. Hilbe. Generalized Estimating Equations. Chapman & Hall, CRC,
2012.
D. A. Harville, and D. R. Jeske. Mean squared error of estimation or prediction
under a general linear model. Journal of the American Statistical Association,
87(419):724–731, 1992.
206 BIBLIOGRAPHY
J. K. Haseman, and L. L. Kupper. Analysis of dichotomous response data from
certain toxicological experiments. Biometrics, 35(1):281–293, 1979.
T. P. Hettmansperger. Statistical Inference Based on Ranks. New York: John Wiley
and Sons, 1984.
C. C. Heyde. Statistical Data Analysis and Inference. Amsterdam: Elsevier, 1987.
C. C. Heyde. Quasi-Likelihood and Its Application: A General Approach to Optimal
Parameter Estimation. New York: Springer, 1997.
L. Y. Hin, and Y-G. Wang. Working-correlation-structure identification in general-
ized estimating equations. Statistics in Medicine, 28(4):642–658, 2009.
D. D. Ho, A. U. Neumann, A. S. Perelson, W. Chen, J. M. Leonard, and M.
Markowitz. Rapid dynamics in human immunodeficiency virus type 1 turnover
of plasma virions and cd4 lymphocytes in hiv-1 infection. Nature, 373:123–126,
1995.
E. B. Hoffman, P. K. Sen, and C. R. Weinberg. Within-cluster resampling.
Biometrika, 88(4):1121–1134, 2001.
L. Hua, and Y. Zhang. Spline-based semiparametric projected generalized estimating
equation method for panel count data. Biostatistics, 13(3):440–454, 2012.
A. Huang, and P. J. Rathouz. Proportional likelihood ratio models for mean regres-
sion. Biometrika, 99:223–229, 2012.
P. J. Huber. The behavior of maximum likelihood estimates under nonstandard con-
ditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statis-
tics and Probability, Volume 1: Statistics, 221–233. University of California
Press, 1967. https://projecteuclid.org/euclid.bsmsp/1200512988.
D. R. Hunter, and R. Li. Variable selection using mm algorithms. The Annals of
Statistic, 33:1617–1642, 2005.
J. G. Ibrahim. Incomplete data in generalized linear models. Journal of the American
Statistical Association, 85:765–769, 1990.
J. G. Ibrahim, and S. R. Lipsit. Parameter estimation from incomplete data in bino-
mial regression when the missing data mechanism is nonignorable. Biometrics,
52:1071–1078, 1996.
J. G. Ibrahim, and G. Molenberghs. Missing data methods in longitudinal studies: a
review. Test, 18:1–43, 2009.
J. G. Ibrahim, M. H. Chen, and S. R. Lipsitz. Missing responses in generalized linear
mixed models when the missing data mechanismis nonignorable. Biometrika,
88:551–564, 2001.
J. G. Ibrahim, M. H. Chen, S. R. Ipsitiz, and A. H. Herring. Missing-data methods
for generalized linear models: A comparative review. Journal of the American
Statistical Association, 100:332–346, 2005.
J. G. Ibrahim. Incomplete data in generalized linear models. Journal of the American
Statistical Association, 85(411):765–769, 1990.
BIBLIOGRAPHY 207
J. G. Ibrahim, and S. R. Lipsitz. Parameter estimation from incomplete data in bino-
mial regression when the missing data mechanism is nonignorable. Biometrics,
52(3):1071–1078, 1996.
G. Inan, and L. Wang. Pgee: An r package for analysis of longitudinal data with high-
dimensional covariates. The R Journal, 9:393–402, 06 2017. doi: 10.32614/RJ-
2017-030.
G. Inan, J. H. Zhou, and L. Wang. Pgee: Penalized generalized estimating equations
in high-dimension. https://cran.r-project.org/package=pgee. R package version
1.5, https://CRAN.R-project.org/package=PGEE, 2016.
R. I. Jennrich, and M. D. Schluchter. Unbalanced repeated-measures models with
structured covariance matrices. Biometrics, 42:805–820, 1986.
J. M. Jiang. Linear and Generalized Linear Mixed Models and Their Applications.
Springer, New York, 2007.
J. M. Jiang, and T. Nguyen. The Fence Methods. World Scientific, 2015.
B. Jørgensen. The Theory of Dispersion Models. London: Chapman and Hall, 1997.
S. H. Jung. Quasi-likelihood for median regression models. Journal of the American
Statistical Association, 91:251–257, 1996.
S. H. Jung, and Z. Ying. Rank-based regression with repeated measurements data.
Biometrika, 90:732–740, 2003.
M. C. K. Tweedie. An index which distinguishes between some important exponential
families. A4, 22 pp. Pre-print issued for the Golden Jubliee Conference at the
Indian Statistical Institute, Calcutta, 1981.
G. Kauerman and R.J. Carroll. A note on the efficiency of sandwich covariance
matrix estimation. Journal of the American Statistical Association, 96:1387–
1396, 2001.
B. C. Kelly. Some aspects of measurement error in linear regression of astronomical
data. The Astrophysical Journal, 665(2):1489, 2007.
A. I. Khuri, and H. Sahai. Variance components analysis: a selective literature survey.
International Statistical Review/Revue Internationale de Statistique, 53(3):279–
300, 1985.
J. C. Kleinman. Proportions with extraneous variance: single and independent sam-
ples. Journal of the American Statistical Association, 68(341):46–54, 1973.
R. Koenker, and G. Jr. Bassett. Regression quantiles. Econometrica, 84:33–50, 1978.
R. Koenker, and V. D’Orey. Computing regression quantiles. Applied Statistics, 36:
383–393, 1987.
E. L. Korn, and A. S. Whittemore. Methods for analyzing panel studies of acute
health effects of air pollution. Biometrics, 35:795–802, 1979.
L. L. Kupper, and J. K. Haseman. The use of a correlated binomial model for the
analysis of certain toxicological experiments. Biometrics, 34(1):69–76, 1978.
N. M. Laird, and J. H. Ware. Random-effects models for longitudinal data. Biomet-
rics, 38:963–974, 1982.
208 BIBLIOGRAPHY
N. Lange, L. Ryan, L. Billard, D. Brillinger, L. Conquest, and J. (eds.) Greenhouse.
Case Studies in Biometry. New York: Wiley-Interscience, 1994.
J. F. Lawless. Regression methods for poisson process data. Journal of the American
Statistical Association, 82(399):808–815, 1987.
E. W. Lee, and M. Y. Kim. The analysis of correlated panel data using a continuous-
time markov model. Biometrics, 54(4):1638–1644, 1998.
J. C. Lee. Prediction and estimation of growth curves with special covariance struc-
tures. Journal of the American Statistical Association, 83:432–440, 1988.
Y. Lee, and J. A. Nelder. Hierarchical generalised linear models: A synthesis
of generalised linear models, random-effect models and structured dispersions.
Biometrika, 88:987–1006, 2001.
C. L. Leng, and H. P. Zhang. Smoothing combined estimating equations in quantile
regression for longitudinal data. Statistics and Computing, 24:123–136, 2014.
G. Li, L. Zhu, L. Xue, and S. Feng. Empirical likelihood inference in partially linear
single-index models for longitudinal data. Journal of Multivariate Analysis, 101
(3):718–732, 2010.
X. J. Li, X. J. Ma, and J. X. Zhang. Conditional quantile correlation screening proce-
dure for ultrahigh-dimensional varying coefficient models. Journal of Statistical
Planning and Inference, 197:69–92, 2018.
K-Y. Liang, and S. L. Zeger. Lonagitudinal data analysis using generalized linear
models. Biometrika, 73:13–22, 1986.
K-Y. Liang, S. L. Zeger, and B. Qaqish. Multivariate regression analyses for cate-
gorical data (disc: P24-40). Journal of the Royal Statistical Society, Series B, 54:
3–24, 1992.
D. Y. Lin, and Z. Ying. Semiparametric and nonparametric regression analysis of
longitudinal data. Journal of the American Statistical Association, 96(453):103–
126, 2001.
X. Lin, and R. J. Carroll. Semiparametric estimation in general repeated measures
problems. Journal of the Royal Statistical Society: Series B, 68(1):69–88, 2006.
M. J. Lindstrom, and D. M. Bates. Newton—raphson and em algorithms for lin-
ear mixed-effects models for repeated-measures data. Journal of the American
Statistical Association, 83(404):1014–1022, 1988.
S. R. Lipsitz, and G. M. Fitzmaurice. Estimating equations for measures of associa-
tion between repeated binary responses. Biometrics, 52(3):903–912, 1996.
S. R. Lipsitz, N. M. Laird, and D. P. Harrington. Finding the design matrix for the
marginal homogeneity model. Biometrika, 77:353–358, 1990.
S. R. Lipsitz, N. M. Laird, and D. P. Harrington. Generalized estimating equations
for correlated binary data: Using the odds ratio as a measure of association.
Biometrika, 78:153–160, 1991.
S. R. Lipsitz, K. n Kim, and L. Zhao. Analysis of repeated categorical data using
generalized estimating equations. Statistics in Medicine, 13:1149–1163, 1994.
BIBLIOGRAPHY 209
S. R. Lipsitz, J. G. Ibrahim, and G. Molenburgs. Using a box-cox transformation in
the analysis of longitudinal data with incomplete responses. Applied Statistics,
49:287–296, 2000.
S. R. Lipsitz, G. M. Fitzmaurice, J. G. Ibrahim, and R. Gelber. Parameter estimation
in longitudinal studies with outcome-dependent follow-up. Biometrics, 58:621–
630, 2002.
R. J. A. Little. Pattern-mixture models for multivariate incomplete data. Journal of
the American Statistical Association, 88:125–134, 1993.
R. J. A. Little. Modeling the drop-out mechanism in repeated-measures studies.
Jornal of the American Statistical Association, 90:1113–1121, 1995.
R. J. A. Little, and D. B. Rubin. Statistical Analysis with Missing Data, 2nd Edition.
Wiley, 1987.
J. Liu. Feature screening and variable selection for partially linear models with
ultrahigh-dimensional longitudinal data. Neurocomputing, 195:202–210, 2016.
K-J. Lui, Mayer J.A., and L. Eckhardt. Confidence intervals for the risk ratio under
cluster sampling based on the beta-binomial model. Statistics in Medicine, 19
(21):2933–2942, 2000.
J. Lv, H. Yang, and C. H. Guo. An efficient and robust variable selection method
for longitudinal generalized linear models. Computational Statistics and Data
Analysis, 82:74–88, 2015.
M. Gosho, C. Hamada, and I. Yoshimura. Selection of working correlation structure
in weighted generalized estimating equation method for incomplete longitudinal
data. Communications in Statistics - Simulation and Computation, 43:62–81,
2014.
T. Maiti, and V. Pradhan. Bias reduction and a solution for separation of logistic
regression with missing covariates. Biometrics, 65(4):1262–1269, 2009.
L. A. Mancl, and T. A. DeRouen. A covariance estimator for gee with improced
samll-sample properties. Biometrice, 57:126–134, 2001.
L. A. Mancl, and B. G. Leroux. Efficiency of regression estimates for clustered data.
Biometrics, 52:500–511, 1996.
K. G. Manton, M. A. Woodbury, and E. Stallard. A variance components approach to
categorical data models with heterogenous cell populations: Analysis of spatial
gradients in lung cancer mortality rates in north carolina counties. Biometrics,
37(2):259–269, 1981.
B. H. Margolin, B. S. Kim, and K. J. Risko. The ames salmonella/microsome mu-
tagenicity assay: Issues of inference and validation. Journal of the American
Statistical Association, 84(407):651–661, 1989.
P. McCullagh, and J. A. Nelder. Generalized Linear Models (2nd Edition). Chapman
& Hall, 1989.
C. E. McCulloch, and S. R. Searle. Generalized, Linear and Mixed Models. John
Wiley & Sons, New York, USA, 2001.
210 BIBLIOGRAPHY
G. J. McLachlan. On the em algorithm for overdispersed count data. Statistical
Methods in Medical Research, 6(1):76–98, 1997.
R. Mian, and S. Paul. Estimation for zero-inflated over-dispersed count data model
with missing response. Statistics in medicine, 35(30):5603–5624, 2016.
A. J. Miller. Subset Selection in Regression. London: Chapman and Hall, 1990.
G. Molenberghs, and G. Verbeke. Models for Discrete Longitudinal Data. Sprnger,
New York, 2005.
A. Muñoz, V. Carey, J. P. Schouten, M. Segal, and B. Rosner. A parametric family
of correlation structures for the analysis of longitudinal data. Biometrics, 48:
733–742, 1992.
A. Munoz, B. Rosner, and V. Carey. Regression analysis in the presence of hetero-
geneous intraclass correlations. Biometrics, 42(3):653–658, 1986.
J. A. Nelder, and D. Pregibon. An extended quasi-likelihood function. Biometrika,
74:221–232, 1987.
J. M. Neuhaus, and J. D. Kalbfleish. Between- and within-cluster covariate effects
in the analysis of clustered data. Biometrics, 54:638–645, 1998.
J. Neyman. Optimal asymptotic tests of composite statistical hypotheses. Probability
and statistics, 57:213–234, 1959.
V. Núñez Anton, and G. G. Woodworth. Analysis of longitudinal data with unequally
spaced observations and time-dependent correlated errors. Biometrics, 50:445–
456, 1994.
R. J. O’Hara-Hines. Comparison of two covariance structures in the analysis of
clusterd polytomous data using generalized estimating equations. Biometrics,
54:312–316, 1998.
A. B. Owen. Empirical likelihood. New York: Chapman and Hall, CRC, 2001.
M. C. Paik. The generalized estimating equation approach when data are not missing
completely at random. Journal of the American Statistical Association, 92:1320–
1329, 1997.
W. Pan. On the robust variance estimator in generalised estimating equations.
Biomtrika, 88:901–906, 2001a.
W. Pan. Akaike’s information criterion in generalized estimating equations. Biomet-
rics, 57:120–125, 2001b.
H. D. Patterson, and R. Thompson. Recovery of inter-block information when block
sizes are unequal. Biometrika, 8:545–554, 1971.
S. R. Paul. Maximum likelihood estimation of intraclass correlation in the analysis of
familial data: Estimating equation approach. Biometrika, 77(3):549–555, 1990.
S. R. Paul. Quadratic estimating equations for the estimation of regression and dis-
persion parameters in the analysis of proportions. Sankhya B, 63:43–55, 2001.
S. R. Paul, and T. Banerjee. Analysis of two-way layout of count data involving
multiple counts in each cell. Journal of the American Statistical Association, 93
(444):1419–1429, 1998.
BIBLIOGRAPHY 211
S. R. Paul, and R. L. Plackett. Inference sensitivity for poisson mixtures. Biometrika,
65(3):591–602, 1978.
S. R. Paul. Analysis of proportions of affected foetuses in teratological experiments.
Biometrics, 38:361–370, 1982.
S. R. Paul, and A. S. Islam. Analysis of proportions based on parametric and semi-
parametric models. Biometrics, 51:1400–1410, 1995.
M. S. Pepe, and G. L. Anderson. A cautionary note on inference for marginal regres-
sion models with longitudinal data and general correlated response data. Com-
munications in Statistics, Part B - Simulation and Computation, 23:939–951,
1994.
W. W. Piegorsch. Maximum likelihood estimation for the negative binomial disper-
sion parameter. Biometrics, 46(3):863–867, 1990.
J. C. Pinheiro, and D. M. Bates. Mixed-Effects Models in S and S-PLUS. Springer-
Verlag, New York, 2000.
J. S. Preisser, K. K. Lohman, and P. J. Rathouz. Performance of weighted estimating
equations for longitudinal binary data with drop-outs missing at random. Statis-
tics in Medicine, 21:3035–3054, 2002.
R. L. Prentice, and L. P. Zhao. Estimating equations for parameters in means and
covariances of multivariate discrete and continuous responses. Biometrics, 47:
825–839, 1991.
C. J. Price, C. A. Kimmel, J. D. George, and M. C. Marr. The developmental toxicity
of ethylene glycol in mice. Fundamental and Applied Toxicology, 81:113–127,
1985.
J. Qin, and J. Lawless. Empirical likelihood and generalized estimating equations.
The Annals of Statistics, 22:300–325, 1994.
A. Qu, B. G. Lindsay, and Bing Li. Improving generalised estimating equations
using quadratic inference functions. Biometrik, 87:823–836, 2000.
A. Qu, and P. X. K. Song. Assessing robustness of generalised estimating equations
and quadratic inference functions. Biometrika, 91(2):447–459, 2004.
A. Qu, J. J. Lee, and B. G. Lindsay. Model diagnostic tests for selecting informative
correlation structure in correlated data. Biometrika, 95(4):891–905, 12 2008.
A. E. Raftery, D. Madigan, and J. A. Hoeting. Bayesian model averaging for linear
regression models. Journal of the American Statistical Association, 92(437):
179–191, 1997.
J. N. K. Ran, and I. Molina. Small Area Estimation,, 2nd ed. Wiley, New York, 2015.
C. R. Rao. Linear Statistical Inference and its Applications. Wiley, 1965.
P. J. Rathouz, and L. Gao. Generalized linear models with unspecified reference
distribution. Biostatistics, 10:205–218, 2009.
J. M. Robin, A. Rotnitzky, and L. P. Zhao. Analysis of semiparametric regression
models for repeated outcomes in the presence of missing data. Journal of the
American Statistical Association, 90:106–121, 1995.
212 BIBLIOGRAPHY
J. M. Robins, and A. Rotnitzky. Semiparametric efficiency in multivariate regression
models with missing data. Journal of the American Statistical Association, 90
(429):122–129, 1995.
B. Rosner. Multivariate methods in ophthalmology with application to other paired-
data situations (c/r: V46 p523-531). Biometrics, 40:1025–1035, 1984.
G. J. S. Ross, and D. A. Preece. The negative binomial distribution. Journal of the
Royal Statistical Society: Series D (The Statistician), 34(3):323–335, 1985.
A. Rotnitzky, and N. P. Jewell. Hypothesis testing of regression parameters in semi-
parametric generalized linear models for cluster correlated data. Biometrika, 77:
485–497, 1990.
P. J. Rousseeuw, and B. C. van Zomeren. Unmasking multivariate outliers and lever-
age points. Journal of the American Statistical Association, 85:633–639, 1990.
D. B. Rubin. Inference and missing data. Biometrika, 63:581–592, 1976.
S. Galbraith, J. A. Daniel, and B. Vissel. A study of clustered data and approaches
to its analysis. Journal of Neuroscience, 30(32):10601–10608, 2010.
K. Saha, and S. Paul. Bias-corrected maximum likelihood estimator of the negative
binomial dispersion parameter. Biometrics, 61(1):179–185, 2005.
H. Sahai, A. Khuri, and C. H. Kapadia. A second bibliography on variance compo-
nents. Communications in Statistics-theory and Methods, 14:63–115, 1985.
H. Sahai. A bibliography on variance components. International Statistical Re-
view/Revue Internationale de Statistique, 47(2):177–222, 1979.
S. K. Sahu, and G. O. Roberts. On convergence of the em algorithmand the gibbs
sampler. Statistics and Computing, 9(1):55–64, 1999.
D. E. W. Schumann, and R. A. Bradley. The comparison of the sensitivities of similar
experiments: Theory. The Annals of Mathematical Statistics, 13(4):902–920,
1957.
S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components, volume 391.
John Wiley & Sons, 2009.
P. E. Shrout, and J. L. Fleiss. Intraclass correlations: uses in assessing rater reliability.
Psychological Bulletin, 86(2):420, 1979.
J. Shults, and R. N. Chaganty. Analysis of serially correlated data using quasi-
likelihood squares. Biometrics, 54:1622–1630, 1998.
J. Shults, and J. M. Hilbe. Quasi-Least Squares Regression. Chapman & Hall/CRC
Monographs on Statistics & Applied Probability. Taylor & Francis, 2014.
S. Sinha, and T. Maiti. Analysis of matched case-control data in presence of nonig-
norable missing exposure. Biometrics, 64(1):106–114, 2008.
A. Sklar. Fonctions de répartition à n de dimensions et leursmarges. Paris: Publica-
tions de l’Institut de Statistique de l’Université de Paris 8, 143:229–231, 1959.
P. X.-K. Song. Correlated data analysis. modeling. Analytics and Applications.
Springer, New York, 2007.
BIBLIOGRAPHY 213
P. X.-K. Song, Z. C. Jiang, E. Park, and A. Qu. Quadratic inference functions in
marginal models for longitudinal data. Statistics in Medicine, 28(29):3683–
3696, 2009.
M. F. Sowers, M. Crutchfield, J. F. Randolph, B. Shapiro, B. Zhang, M. L. Pietra,
and M. A. Schork. Urinary ovarian and gonadotrophin hormone levels in pre-
menopausal women with low bone mass. Journal of Bone Mining Research, 13:
1191–1202, 1998.
P. T. Spellman, G. Sherlock, M. Q. Zhang, et al. Comprehensive identification of
cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray
hybridization. Molecular Biology of Cell, 9:3273–3297, 1998.
B. C. Sutradhar, and K. Das. On the efficiency of regression estimators in generalised
linear models for longitudinal data. Biometrika, 86(2):459–465, 1999.
P. F. Thall, and S. C. Vail. Some covariance models for longitudinal count data with
overdispersion. Biometrics, 46:657–671, 1990.
W. A. Thompson Jr. The problem of negative estimates of variance components. The
Annals of Mathematical Statistics, 33:273–289, 1962.
R. J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Ser. B, 58:267–288, 1996.
P. Tishler, A. Donner, J. O. Taylor, and E. H. Kass. Familial aggregation of blood
pressure in very young children. CVD Epidemiology Newsletter, 22(45), 1977.
D. Trègouët, P. Ducimetière, and L. Tiret. Testing association between candidate-
gene markers and phenotype in related individuals, by use of estimating equa-
tions. American Journal of Human Genetics, 61:189–199, 1997.
A. B. Troxel, S. L. Lipsitz, and T. A. Brennan. Weighted estimating equations with
nonignorably missing response data. Biometrics, 53:857–869, 1997.
R. S. Tsay. Regression models with time series errors. Journal of the American
Statistical Association, 79(385):118–124, 1984.
M. C. K. Tweedie. An index which distinguishes between some important exponen-
tial families. In Statistics: Applications and new directions: Proc. Indian Statis-
tical Institute Golden Jubilee International Conference, volume 579, 579–604,
1984.
G. Verbeke. Models for Discrete Longitudinal Data. Springer Series in Statistics.
Springer, 2005.
L. Wang, J. H. Zhou, and A. Qu. Penalized generalized estimating equations for
high-dimensional longitudinal data analysis. Biometrics, 68(1):353–360, 2012a.
L. Wang, and A. Qu. Consistent model selection and data-driven smooth tests for
longitudinal data in the estimating equations approach. Journal of the Royal
Statistical Society: Series B, 71(1):177–190, 2009.
P. Wang, G-F. Tsai, and A. Qu. Conditional inference functions for mixed-effects
models with unspecified random-effects distribution. Journal of the American
Statistical Association, 107(498):725–736, 2012b.
214 BIBLIOGRAPHY
Y-G. Wang. A quasi-likelihood approach for ordered categorical data with overdis-
persion. Biometrics, 52(4):1252–1258, 1996.
Y-G. Wang. Estimating equations for removal data analysis. Biometrics, 55(4):
1263–1268, 1999a.
Y-G. Wang. Estimating equations with nonignorably missing response data. Biomet-
rics, 55(3):984–989, 1999b.
Y-G. Wang, and V. J. Carey. Working correlation structure misspecification, esti-
mation and covariate design: Implications for generalised estimating equations
performance. Biometrika, 90(1):29–41, 2003.
Y-G. Wang, and V. J. Carey. Unbiased estimating equations from working correla-
tion models for irregularly timed repeated measures. Journal of the American
Statistical Association, 99(467):845–853, 2004.
Y-G. Wang, and L-Y. Hin. Modeling strategies in longitudinal data analysis: Co-
variate,variance function and correlations structure selection. Computational
Statistics and Data Analysis, 54(5):3359–3370, 2009.
Y-G. Wang, and X. Lin. Effects of variance-function misspecification in analysis of
longitudinal data. Biometrics, 61(2):413–421, 2005.
Y-G. Wang, and Y. D. Zhao. Weighted rank regression for clustered data analysis.
Biometrics, 64:39–45, 2008.
Y-G. Wang, and Y. N. Zhao. A modified pseudolikelihood approach for analysis of
longitudinal data. Biometrics, 63(3):681–689, 2007.
Y-G. Wang, and M. Zhu. Rank-based regression for analysis of repeated measures.
Biometrika, 93:459–464, 2006.
Y-G. Wang, X. Lin, and M. Zhu. Robust estimating functions and bias correction for
longitudinal data analysis. Biometrics, 61(3):684–691, 2005.
Y-G. Wang, Q. Shao, and M. Zhu. Quantile regression without the curse of un-
smoothness. Computational Statistics and Data Analysis, 52:3696–3705, 2009.
J. H. Ware, S. Lipsitz, and F. E. Speizer. Issue in the analysis of repeated categorical
outcomes. Statistics in Medicine, 7:95–107, 1988.
R. W. M. Wedderburn. Quasi-likelihood functions, generalized linear models and
the gauss-newton method. Biometrika, 61:439–47, 1974.
L. J. Wei, and J. M. Lachin. Two-sample asymptotically distribution-free tests for
incomplete multivariate observations. Journal of the American Statistical Asso-
ciation, 79(387):653–661, 1984.
C. S. Weil. Selection of the valid number of sampling units and a consideration of
their combination in toxicological studies involving reproduction, teratogenesis
or carcinogenesis. Food and cosmetics toxicology, 8(2):177–182, 1970.
M. Cho, R. E. Weiss, and M. Yanuzzi. On bayesian calculations for mixture priors
and likelihoods. Statistics in Medicine, 18:1555–1570, 1999.
H. White. Maximum likelihood estimation of misspecified models. Econometrica,
50(1):1–25, 1982.
BIBLIOGRAPHY 215
P. Whittle. Gaussian estimation in stationary time series. Bulletin of the International
Statistical Institute, 39:1–26, 1961.
D. A. Williams. Dose-response models for teratological experiments. Biometrics,
43:1013–1016, 1987.
J. M. Williamson, S. Datta, and G. A. Satten. Marginal analyses of clustered data
when cluster size is informative. Biometrics, 59(1):36–42, 2003.
R. F. Woolson, and W. R. Clarke. Analysis of categorical incomplete longitudinal
data. Journal of the Royal Statistical Society. Series A (General), 147(1):87–99,
1984.
H. Wu, and A. A. Ding. Population hiv-1 dynamics in vivo: applicable models and
inferential tools for virological data from aids clinical trials. Biometrics, 55:
410–418, 1999.
P. R. Xu, L. X. Zhu, and Y. Li. Ultrahigh dimensional time course feature selection.
Biometrics, 70:356––365, 2014.
G. Yin, and J. Cai. Quantile regression models with multivariate failure time data.
Biometrics, 61:151–161, 2005.
S. L. Zeger, and K-Y. Liang. Longitudinal data analysis for discrete and continuous
outcomes. Biometrics, 42(1):121–130, 1986.
S. L. Zeger, and B. Qaqish. Markov regression models for time series: A quasi-
likelihood approach. Biometrics, 44:1019–1031, 1988.
L. Zeng, and R. J. Cook. Transition models for multivariate longitudinal binary data.
Journal of the American Statistical Association, 102(477):211–223, 2007.
C-H. Zhang, and J. Huang. The sparsity and bias of the lasso selection in high-
dimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
D. Zhang, X. H. Lin, J. Raz, and M. F. Sowers. Semiparametric stochastic mixed
models for longitudinal data. Journal of the American Statistical Association,
93:710–719, 1998.
W. P. Zhang, C. L. Leng, and C. Y. Tang. A joint modelling approach for longitudinal
studies. Journal of the Royal Statistical Society: Series B, 77(1):219–238, 2015.
X. M. Zhang, and S. Paul. Modified gaussian estimation for correlated binary data.
Biometrical Journal, 55(6):885–898, 2013.
L. P. Zhao, and R. L. Prentice. Correlated binary regression using a quadratic expo-
nential model. Biometrika, 77:642–648, 1990.
L. P. Zhu, K. Xu, R. Li, and W. Zhong. Projection correlation between two random
vectors. Biometrika, 104:829–843, 2017.
A. Ziegler. Generalized Estimating Equations (Lecture Notes in Statistics). Pub-
lisher: Springer, 2011.
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statis-
tical Association, 476:1418–1429, 2006.
216 BIBLIOGRAPHY
H. Zou, and Hastie T. Regularization and variable selection via the elastic net. Jour-
nal of the Royal Statistical Society, Ser. B, 67:301–320, 2005.
S. J. Zyzanski, S. A. Flocke, and L. M. Dickinson. On the nature and analysis of
clustered data. The Annals of Family Medicine, 2(3):199–200, 2004.
Author Index
217
218 AUTHOR INDEX
Eicker, F., 71 Huang, J. , 150
Engel, J., 182 Hubbard, S. , 150
Eric, S., 182 Huber, P.J., 71
Hunter, D.R., 194
Fan, J., 193, 200
Fan, Y.L., 109, 196 Ibrahim, J.G., 143, 145, 150, 153
Fisher, R.A., 115, 182 Inan, G., 199
Fitzmaurice, 131 Islam, A., 125
Fitzmaurice, G., 41, 50, 74, 128–130 Islam, A. , 168
Fitzmaurice, G.M., 63, 65
Fitzmaurice,G., 52 Jennrich, R. , 159
Fleiss, J.L., 116 Jewell, N.P., 74, 110
Flocke, S.A., 114 Jiang, J. M., 131
Follmann, 192 Jiang, J.M., 79, 184
Fu, L.Y., 95, 102, 104, 106 Jorgensen, B., 22
Fung, K.W., 8 Jung, S.H., 91, 92, 104, 133, 135
Jørgensen, B., 31
Galbraith, 114
Gao, L., 23 Kalbfleish, J.D., 94
George, E.I., 153 Kass, E.H., 123
George, J.D., 129 Kauerman, G., 71
Geweke, J., 150 Kelly, B.C., 150
Gilks, W.R. , 165 Kenward, M., 164
Giltinan, D., 31, 52 Khuri, A.I., 116
Godambe, V.P., 119 Kim, B.S., 182
Gonzalez, R., 116 Kim, K., 50
Goodambe, V., 168 Kim, M.Y., 192
Greenhouse, J., 148 Kimmel, C.A., 129
Griffin, D., 116 Kirchner, U. , 151
Guo, C.H., 109, 196 Kleinman, J. C., 125
Koenker, R., 91, 99, 102
Hall, D., 59 Kopadia,, 116
Hardin, J., 82 Korn, E., 187
Harrington, D., 59 Koval, J.J., 117, 123
Haseman, J., 124 Kupper, L., 124
Haseman, J.K., 124 Kupper, L.L., 124
Hastie, T., 193
He, X.M., 8 Lachin, J.M., 134
Hettmansperger, T.P., 91 Laird, N., 52, 59, 128
Heyde, C., 45, 55, 94 Laird, N. M., 38
Hilbe, J., 62, 82 Laird, N.M., 144, 159, 179
Hin, L., 44, 83 Laird, S., 129
Ho, D.D., 5 Lamm, C.J., 143
Hoeting, J. A., 150 Lange, N., 148
Hoffman, E.B., 134 Lawless, J., 84
Honda, T., 200 Lawless, J.F., 182
AUTHOR INDEX 219
Lazar, N.A., 85 Mian, R., 149
Le Hesran, J.-Y., 192 Mian, R. , 156
Lee, E.W., 192 Miller, A.J., 79
Lee, J. C., 35 Molenberghs, G., 142, 143, 178
Lee, J.C., 199 Molina, I., 113
Lee, Y., 168 Munoz, A., 118, 123
Leng, C., 37
Leng, C.L., 106 Núñez-Anton , V., 35
Leroux, B., 50 Nelder, J., 24, 43, 168, 183
Li, B., 72, 99 Nelder, J.A., 44, 81, 131
Li, J.L., 200 Neuhaus, J.M., 94
Li, R., 193, 194, 203 Neumann, A.U., 5
Li, X.J., 202 Neyman, J., 126
Li, Y., 201 Nguyen, T., 79
Liang, K-Y., 2, 48, 51, 58, 59, 71, 184
Liang, K-Y. , 168 O’Hara-Hines, R., 52
Liang, K.-Y., 11, 33, 99, 181, 185 Owen, A.B., 84
Lin, X.H., 8
Pam, W., 86
Lindsay, B.G., 72, 99
Pan, W., 44, 69, 81, 82
Lindstrom, M.J., 180
Parzen, M.I., 101
Lipsitz, S., 50, 59, 184
Patel, C.M., 135
Lipsitz, S. , 150, 153, 155
Patterson, H., 83
Lipsitz, S.R., 63, 65
Paul, S., 8, 72, 123–127, 149, 169,
Lipsitz, S.R. , 169
182
Little, R.J.A., 138, 142, 143
Paul, S.R., 118, 149, 156, 168
Liu, J., 202
Pepe, M., 50
Lohman, K., 169
Perry, J.N., 182
Lui, K.-J., 113
Piegorsch, W.W., 149, 182
Lv, J., 109, 196, 200
Pietra, M. L., 7
Ma, X.J., 202 Pinheiro, J.C., 52
Ma, Y., 200 Plackett, R., 72, 149, 168, 182
Madigan, D., 150 Pocock, S., 143
Maiti, T., 150, 153 Pradhan, V. , 150, 153
Mancl, L., 50, 70 Preece, D.A., 182
Manton, K.G., 182 Pregibon, D., 24, 43, 168, 183
Margolin, B., 182 Preisser, J., 169
Margolin, B.H., 182 Prentice, R. L., 125, 149
Marr, M.C., 129 Prentice, R.L., 51, 52, 59, 72
Mayer, J.A., 113 Price, C.J., 129
McCullagh, P., 44, 81, 131
Qaqish, B., 59, 184
McCulloch, C.E., 116, 184
Qin, G.Y., 109, 196
McLachlan, G.J., 184
Qin, J., 84
McShane, L.M., 51
Qu, A., 72, 99, 194
Mendonca, L., 151
220 AUTHOR INDEX
Raftery, A.E., 150 Spellman, P., 199
Ran, J.N.K., 113 Sutradhar, B.C., 50, 74
Randolph, J. F., 7
Rao, C.R., 144 Tang, C. y., 37
Rathouz, C. R., 23 Tarone, 126
Rathouz, P., 169 Taylor, J.B., 150
Raz, J., 8 Taylor, J.O., 123
Regier, M.H., 192 Thall, P.F., 7, 71
Reid, N., 122 Thompson, M.E., 168
Richardson, S., 192 Thompson, R., 83
Risko, K.J., 182 Thompson, W.A.J., 179
Robin, D.B., 168 Tibshirani, R.J., 193
Robin, J.M. , 169 Tiret, L., 59
Rosner, B., 117, 118, 123 Tishler, R.J., 123
Ross, G.J.S., 182 Tregouet, D., 59
Roth, A.J., 135 Troxel, A., 169
Rotnitzky, A., 52, 74, 169 Tsay, R.S., 186
Rousseeuw, A., 110 Tweedie, M.C.K., 31
Rubin, D.B., 138, 142–144
Rubin, Y., 150 Ueki, M., 198
Ryan, L., 148
Vail, S.C., 7, 71
Saha, K., 182, 184 Verbeke, G., 178
Sahai, H., 116
Wang, L., 194, 199
Satten, G.A., 134
Wang, Y-G., 24, 25, 41, 44, 47, 53, 65,
Schlattmann, P. , 151
83, 91–95, 102, 104, 106,
Schluchter, M., 159
110, 111, 134, 170, 184
Schork, M. A., 7
Ware, 129
Schumann, D., 114
Ware, , 128
Searle, S.R., 116, 184
Ware, J. H., 38
Selwyn, M.R., 135
Ware, J.H, 184
Sen, P.K., 134
Ware, J.H., 159, 179
Severini, T.A., 59
Wedderburn, R. W. M., 24
Shao, Q., 104
Wedderburn, R.W.M., 168
Shapiro, B., 7
Wei, B.C., 8
Sherlock, G., 199
Wei, L.J., 101, 134
Shrout, P.E., 116
Weil, C.S., 124
Shults, J., 59, 61, 62
Weinberg, C.R., 134
Sinha, S. , 150, 153
Weiss, R.E., 97
Snell, E.J., 187
White, H., 42, 71
Song, P. X.-K., 139
Whittemore, A., 187
Song, P.X., 138
Whittle, P., 25, 60
Sowers, M. F., 7
Wild, P., 165
Sowers, M.F., 8
Williams, D.A., 9, 124, 134
Speizer, F.E., 184
Woodbury, M.A., 182
AUTHOR INDEX 221
Woodworth , G. G., 35 Zhang, W. , 169
Wu, H., 5 Zhao, L., 50
Zhao, L.P., 51, 52, 59, 72
Xu, K., 203 Zhao, Y., 47
Xu, P.R., 201 Zhao, Y.D., 91–93, 111, 134
Zhao, Y.N., 25
Yang, H., 109, 196 Zhong, W., 203
Yanuzzi, M., 97 Zhou, J., 194
Yin, G., 104 Zhou, J.H., 199
Yin, Z., 135 Zhu, L., 203
Ying, Z., 91, 92, 133 Zhu, L.X., 201
Zhu, M., 94, 104
Zeger, S., 11, 71
Zhu, Z.Y., 8, 109, 196
Zeger, S. L., 33
Ziegler, A., 22
Zeger, S.L., 2, 48, 51, 58, 99, 168,
Zou, H., 193
181, 184, 185, 187
Zyzanski, S.J., 114
Zeger, S.Y., 59
Zeng, L., 192
Zhang, B., 7
Zhang, C.-H., 150
Zhang, D., 8
Zhang, H.P., 106
Zhang, J.X., 202
Zhang, M.Q., 199
Zhang, W., 37
Subject Index
223
224 SUBJECT INDEX
GBIC, 83 Maximum likelihood, 179
GEE, 41, 48, 52, 74 Maximum likelihood estimator, 179,
GEE2, 52, 73 182
Generalized linear models, 2 MCAR, 137
Generalized weighted least squares, Mean assumption, 46
54 Median, 133
GLM, 26, 41, 48, 49 Medical studies, 1
Minimum Covariance Determinant,
Hard penalty function, 193 110
Heterogeneity, 2 Minimum Volume Ellipsoid, 110
Heteroscedasticity, 7, 78 ML, 52
HIV, 5 MLE, 42
Hot deck imputation, 142 MNAR, 137
Huber’s function, 91 Model misspecification, 53
Model selection, 79
Implantation, 129 Moment estimation, 55
Implantations, 11 Moment estimator, 74
Imputation by substitution, 142 Mortality rate, 9
Indicator function, 101 Multiple logistic regression, 117
Induced smoothing method, 104 Multivariate normal distribution, 82
Interactive effect, 79
Interclass correlation coefficient, 114 NASBA, 5
Intraclass correlation, 59 Non-parametric, 11
Intracluster correlation coefficient, 123 Normal log-likelihood, 82
Intracoastal correlation, 114
Outlier, 89, 91
Jacobian matrix, 65, 68 Over-dispersion parameter, 89
Overdispersion parameter, 49
Labor Market Experience, 12
Lagrange multiplier method, 84 Patterned correlation matrix, 33
Least absolute shrinkage and selec- Penalized loss function , 193
tion operator, 193 PI, 5
Likelihood function, 81 Poisson variance, 41
Linear exponential family, 23 Population averaged effects, 177
Link function, 26, 80 Power transformation, 58
logit link, 187 Profile analysis, 48
Progesterone, 8
MA(1), 62 Proportional hazard, 23
Madras Longitudinal Schizophrenia pseudolikelihood, 25
Study, 11 Psychiatric symptoms, 11
Malformation rates, 9
Mallows-based weighted function, 110 QIC, 83, 88
MAR, 137 QL, 43
Marginal mean vector, 1 QLS, 61, 62
Marginal model, 2 Quadratic inference function, 106
Marginal models, 65 Quantile regression, 91
SUBJECT INDEX 225
Quasi-likelihood, 23, 43, 46, 81 Uniform correlation structure, 34
Quintic polynomial, 8 Utero, 11
Temporal changes, 1
Tilted exponential family, 23
Toeplitz matrix, 85
Toxicological study, 123
Two-level generalized linear model,
131