100% found this document useful (1 vote)
173 views

Artificial Intelligence and Causal Inference

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
173 views

Artificial Intelligence and Causal Inference

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 253

Analysis of

Longitudinal Data
with Examples
Analysis of
Longitudinal Data
with Examples

You-Gan Wang
Liya Fu
Sudhir Paul
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2022 Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.

ISBN: 978-1-4987-6460-5 (hbk)


ISBN: 978-1-032-19652-7 (pbk)
ISBN: 978-1-315-15363-6 (ebk)

DOI: 10.1201/9781315153636

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To those who have passions in longitudinal data analysis
Contents

List of Figures xiii

List of Tables xv

Preface xvii

Author Bios xxi

Contributors xxiii

Acknowledgment xxv

1 Introduction 1
1.1 Longitudinal Studies 1
1.2 Notation 2

2 Examples and Organization of The Book 5


2.1 Examples for Longitudinal Studies 5
2.1.1 HIV Study 5
2.1.2 Progabide Study 5
2.1.3 Hormone Study 7
2.1.4 Teratology Studies 8
2.1.5 Schizophrenia Study 11
2.1.6 Labor Pain Study 11
2.1.7 Labor Market Experience 11
2.1.8 Water Quality Data 13
2.2 Organization of the Book 14

3 Model Framework and Its Components 17


3.1 Distributional Theory 17
3.1.1 Linear Exponential Distribution Family 18
3.1.2 Quadratic Exponential Distribution Family 20
3.1.3 Tilted Exponential Family 20
3.2 Quasi-Likelihood 21
3.3 Gaussian Likelihood 22
3.4 GLM and Mean Functions 23
3.5 Marginal Models 27
3.6 Modeling the Variance 28
vii
viii CONTENT
3.7 Modeling the Correlation 31
3.8 Random Effects Models 35

4 Parameter Estimation 37
4.1 Likelihood Approach 38
4.2 Quasi-likelihood Approach 39
4.3 Gaussian Approach 41
4.4 Generalized Estimating Equations (GEE) 44
4.4.1 Estimation of Mean Parameters β 45
4.4.2 Estimation of Variance Parameters τ 48
4.4.2.1 Gaussian Estimation 48
4.4.2.2 Extended Quasi-likelihood 49
4.4.2.3 Nonlinear Regression 50
4.4.2.4 Estimation of Scale Parameter φ 50
4.4.3 Estimation of Correlation Parameters 51
4.4.3.1 Stationary Correlation Structures 51
4.4.3.2 Generalized Markov Correlation Structure 54
4.4.3.3 Second Moment Method 55
4.4.3.4 Gaussian Estimation 55
4.4.3.5 Quasi Least-squares 57
4.4.3.6 Conditional Residual Method 58
4.4.3.7 Cholesky Decomposition 59
4.4.4 Covariance Matrix of β̂ 63
4.4.5 Example: Epileptic Data 66
4.4.6 Infeasibility 68
4.5 Quadratic Inference Function 72

5 Model Selection 75
5.1 Introduction 75
5.2 Selecting Covariates 76
5.2.1 Quasi-likelihood Criterion 76
5.2.2 Gaussian Likelihood Criterion 78
5.3 Selecting Correlation Structure 79
5.3.1 CIC Criterion 79
5.3.2 C(R) Criterion 79
5.3.3 Empirical Likelihood Criteria 80
5.4 Examples 82
5.4.1 Examples for Variable Selection 82
5.4.2 Examples for Correlation Structure Selection 82

6 Robust Approaches 89
6.1 Introduction 89
6.2 Rank-based Method 89
6.2.1 An Independence Working Model 89
6.2.2 A Weighted Method 90
CONTENT ix
6.2.3 Combined Method 92
6.2.4 A Method Based on GEE 93
6.2.5 Pediatric Pain Tolerance Study 94
6.3 Quantile Regression 97
6.3.1 An Independence Working Model 98
6.3.2 A Weighted Method Based on GEE 99
6.3.3 Modeling Correlation Matrix via Gaussian Copulas 99
6.3.3.1 Constructing Estimating Functions 100
6.3.3.2 Parameter and Covariance Matrix Estimation 101
6.3.4 Working Correlation Structure Selection 103
6.3.5 Analysis of Dental Data 103
6.4 Other Robust Methods 105
6.4.1 Score Function and Weighted Function 106
6.4.2 Main Algorithm 107
6.4.3 Choice of Tuning Parameters 108

7 Clustered Data Analysis 111


7.1 Introduction 111
7.1.1 Clustered Data 111
7.1.2 Intracluster Correlation 112
7.2 Analysis of Clustered Data: Continuous Responses 113
7.2.1 Inference for Intraclass Correlation from One-way Analysis
of Variance 113
7.2.2 Inference for Intracluster Correlation from More General
Settings 115
7.2.3 Maximum Likelihood Estimation of the Parameters 117
7.2.4 Asymptotic Variance 120
7.2.5 Inference for Intracluster Correlation Coefficient 121
7.2.6 Analysis of Clustered or Intralitter Data: Discrete Responses 122
7.2.7 The Models 122
7.2.8 Estimation 123
7.2.9 Inference 125
7.3 Some Examples 125
7.4 Regression Models for Multilevel Clustered Data 126
7.5 Two-Level Linear Models 126
7.6 An Example: Developmental Toxicity Study of Ethylene Glycol 127
7.7 Two-Level Generalized Linear Model 129
7.8 Rank Regression 131
7.8.1 National Cooperative Gallstone Study 132
7.8.2 Reproductive Study 133

8 Missing Data Analysis 135


8.1 Introduction 135
8.2 Missing Data Mechanism 135
8.3 Missing Data Patterns 137
x CONTENT
8.4 Missing Data Methodologies 139
8.4.1 Missing Data Methodologies: The Methods of Imputation 140
8.4.1.1 Last Value Carried Forward Imputation 140
8.4.1.2 Imputation by Related Observation 140
8.4.1.3 Imputation by Unconditional Mean 140
8.4.1.4 Imputation by Conditional Mean 140
8.4.1.5 Hot Deck Imputation 140
8.4.1.6 Cold Deck Imputation 141
8.4.1.7 Imputation by Substitution 141
8.4.1.8 Regression Imputation 141
8.4.2 Missing Data Methodologies: Likelihood Methods 141
8.5 Analysis of Zero-inflated Count Data With Missing Values 147
8.5.1 Estimation of the Parameters with No Missing Data 148
8.5.2 Estimation of the Parameters with Missing Responses 149
8.5.2.1 Estimation under MCAR 149
8.5.2.2 Estimation under MAR 150
8.5.2.3 Estimation under MNAR 153
8.6 Analysis of Longitudinal Data With Missing Values 155
8.6.1 Normally Distributed Data 155
8.6.2 Complete-data Estimation via the EM 157
8.6.3 Estimation with Nonignorable Missing Response Data
(MAR and MNAR) 160
8.6.4 Generalized Estimating Equations 164
8.6.4.1 Introduction 164
8.6.4.2 Weighted GEE for MAR Data 164
8.6.5 Some Applications of the Weighted GEE 165
8.6.5.1 Weighted GEE for Binary Data 165
8.6.5.2 Two Modifications 166

9 Random Effects and Transitional Models 171


9.1 A General Discussion 171
9.2 Random Intercept Models 172
9.3 Linear Mixed Effects models 173
9.4 Generalized Linear Mixed Effects Models 176
9.4.1 The Logistic Random Effects Models 176
9.4.2 The Binomial Random Effects Models 177
9.4.3 The Poisson Random Effects Models 177
9.4.4 Examples: Estimation for European Red Mites Data and the
Ames Salmonella Assay Data 180
9.5 Transition Models 181
9.6 Fitting Transition Models 184
9.7 Transition Model for Categorical Data 185
9.8 Further reading 188
CONTENT xi
10 Handing High Dimensional Longitudinal Data 189
10.1 Introduction 189
10.2 Penalized Methods 190
10.2.1 Penalized GEE 190
10.2.2 Penalized Robust GEE-type Methods 192
10.3 Smooth-threshold Method 193
10.4 Yeast Data Study 195
10.5 Further Reading 196

Bibliography 201

Author Index 217

Subject Index 223


List of Figures

2.1 Parallel lines of log(RNA copies) in HIV study. 6


2.2 Boxplots of the number of seizures for epileptics at baseline and
four subsequent 2-week periods: “0” indicates baseline. 8
2.3 Scatterplot matrix for epileptics at baseline and four subsequent
2-week periods: solid circle for placebo; hollow circle for progabide. 9
2.4 Progesterone levels (on log scale) plot against the number of days in
a standardized menstrual cycle. 10
2.5 Parallel lines of pain scores for active and placebo groups. 12
2.6 Boxplots of pain scores for placebo and active groups. 12
2.7 Histograms of pain scores for active and placebo groups. 13
2.8 Scatter plot of log wage against education for south and other places. 14
2.9 The boxplots of the income of the women with different education
time. 15
2.10 Time series plot of the values of Total Cyanophyte from 4 sites. 16
2.11 Q-Q plot for total Cyanophyte counts from 20 sites. 16

6.1 The boxplots of the time in log2 seconds of pain tolerance in four
trials in the pediatric pain study. 95
6.2 The boxplots of the time in log2 seconds of pain tolerance are for
girls and boys in four trails in the pediatric pain study. 96
6.3 Scatterplot of the time in log2 seconds of pain tolerance for four
trails in the pediatric pain study. 97
6.4 The distance versus the measurement time for boys and girls. 104
6.5 Boxplots of distances for girls and boys. 105

8.1 Graphic representation of five missing data patterns 137

10.1 Plot of the log-transformed gene expressive level with time for the
first twenty genes. 196
10.2 Histogram of the log-transformed gene expressive level in the yeast
cell-cycle process. 197
10.3 Boxplot of the log-transformed gene expressive level in the yeast
cell-cycle process. 198

xiii
List of Tables

2.1 The seizure count data for 5 subjects assigned to placebo (0) and 5
subjects assigned to progabide (1). 7

3.1 Components of commonly used exponential families. 19

4.1 Quasi-likelihood and extended quasi-likelihood for a single observa-


tion yit 40
4.2 Mean and variance for the two-week seizure count within each
group. 68
4.3 Parameters estimates from models with different variance functions
and correlation structures for the epileptic data. 69

5.1 The values of the criteria QIC, EQIC, GAIC, and GBIC under two
different working correlation structures (CS) for Example 1. 83
5.2 The results obtained by the GEE with independence (IN), exchange-
able (EX), and AR(1) correlation structures in Example 2. 84
5.3 The values of the criteria under independence (IN), exchangeable
(EX), and AR(1) correlation structures in Example 2. 84
5.4 The results are obtained via the QIC function in MESS package. 85
5.5 The results obtained by the GEE with independence (IN), exchange-
able (EX), and AR(1) correlation structures in Example 3. 86
5.6 The results obtained by the GEE with independence (IN), exchange-
able (EX), and AR(1) correlation structures when removing the
outliers in Example 3. 87
5.7 The results of the criteria for the data with and without the outliers
in Example 3. 87
5.8 The correlation matrix of the data with and without the outliers in
Example 3. 87

6.1 Parameter estimates and their standard errors (SE) for the pediatric
pain tolerance study. 98
6.2 Parameter estimates (Est) and their standard errors (SE) of esti-
mators β̂I , β̂EX , β̂AR and β̂ MA obtained from estimating equations
with independence, exchangeable, AR(1), and MA(1) correlation
structures, respectively, and frequencies of the correlation structure
identified using GAIC and GBIC criteria for the dental dataset. 106

xv
xvi LIST OF TABLES
7.1 Artificial data of Systolic blood pressure of children of 10 families 121
7.2 Data from Toxicological experiment (Paul, 1982). (i) Number of live
fetuses affected by treatment. (ii) Total number of live fetuses. 126

8.1 Complete data pattern. 138


8.2 Univariate missing data pattern. 138
8.3 Uniform missing data pattern. 139
8.4 Monotonic missing data pattern. 139
8.5 Arbitrary missing data pattern. 140
8.6 The EM algorithm (Dempster et al., 1977) 144
8.7 Counts of leukemia cases 146
8.8 Estimates and standard errors of the parameters for DMFT index
data. 156
8.9 Estimates and standard errors of the parameters for DMFT index
data with covariates. 156

10.1 The number of TFs selected for the G1-stage yeast cell-cycle process
with penalized GEE (PGEE), penalized Exponential squared loss
(PEXPO), and penalized Huber loss (PHUBER) with SCAD penalty. 197
10.2 List of selected TFs for the G1-stage yeast cell-cycle process
with penalized GEE (PGEE), penalized Exponential squared loss
(PEXPO), and penalized Huber loss (PHUBER) with SCAD penalty. 199
Preface

Longitudinal data are ubiquitous in economics, medical studies, and environmental


research. The fundamental framework in statistics is regression, where many prob-
lems and solutions can be embedded. Generally, the solution hinges on a linear com-
bination of the data for predictors, parameter estimators, or the estimating functions.
In a regression framework, each observation is regarded as a “replicate” as the dif-
ference is taken care of by the nonzero coefficients of the regressors.
Why do we need to model the correlations? A simple example is the paired t-
test scenario where the paired data are correlated. Ignoring such correlation will
lead to misuse of the independent t-test, which will produce misleading inferences.
Analysis of correlated data needs to account for the correlations of each observa-
tion with all the rest observations. This defining correlation feature of longitudinal
data makes modeling dependence a key topic in longitudinal data analysis. The de-
pendence structure will imply how we borrow the information from each other for
better prediction. Application of which or what skill depends on how you describe or
model the underlying correlations. Random effects models are the most intuitive as
an extension of linear regression, or generalized linear models, while marginal mod-
els choose to model the correlation structures directly. The concept of correlation
herein is the same as in time series models. The unique issue is that the number of
observations for each subject is usually small, and the inference will need to rely on
independent “replicates” from many subjects.
In fact, correlation must be accounted for properly in order to obtain valid in-
ferences. For example, in longitudinal studies, the sequential nature of the measures
implies that certain types of correlation structures are likely to arise. Therefore, care-
ful modeling is needed to make the hypothesized model as close as possible to the
true one. Chapter 2 presents eight examples, and all the datasets are available for
readers to investigate further.
Chapter 3 introduces the basic components of the statistical models for longi-
tudinal data analysis. While the correlation modeling is the key, such model and
estimation become meaningless when the other parts are not modeled properly.
All statistical models have explicit and implicit assumptions to work. This does
not necessarily mean the model does not work if certain assumptions are in doubt or
violated. It is, therefore, of great both practical and theoretical interest to investigate
misspecification implications. This is why the GEE (Generalized Estimating Equa-
tion) approach is elegant; the correlation structure is just a “working” correlation
model, and consistency is not affected when it is incorrectly specified, but the frame-
work allows you to reach optimality if the specified correlation model is very close to

xvii
xviii PREFACE
the true one. The likelihood-based estimation is important as it lays the foundation for
estimation and inference. If we are interested in quantifying the uncertainties in the
resultant estimates (such as standard errors and confidence intervals), we would have
to investigate implications of any misspecified model components, i.e., the variance
and correlation parts in our cases. Chapter 4 presents various parameter estimation
approaches for the mean and variance functions and correlation structures.
The marginal model consists of three components, the mean function, variance
function, and correlation structure. Chapter 5 introduces the criteria for selecting
each of these three key parts. Selecting the useful predictors (regressors) in the mean
function is a classical topic. We have only introduced quasi-likelihood and Gaussian
likelihood method in this respect for longitudinal data analysis. In fact, the Lasso
approach using L1 norm is also applicable, although it is often used when the number
of predictors is very high.
In longitudinal studies, the collected data often deviates from normality, and
the response variable and/or predictors may contain some potential outliers, which
often results in serious problems for variable selection and parameter estimation.
Therefore, robust methods have attracted much attention in recent years. Robust ap-
proaches using rank and quantile regression are given in Chapter 6.
Clustered data refer to a set of measurements collected from subjects that are
structured in clusters, which arise in many biostatistical practices and environmen-
tal studies. Responses from a group of genetically-related members from a familial
pedigree constitute clustered data in which the responses are correlated. Chapter 7 is
devoted to the methodology of such data analysis.
Missing data or missing values occur when no information is available on the re-
sponse or some of the predictors or both the response and some predictors for some
subjects who participate in a study of interest. There can be a variety of reasons for
the occurrence of missing values. Nonresponse occurs when the respondent does not
respond to certain questions due to stress, fatigue, or lack of knowledge. Some in-
dividuals in the study may not respond because some questions are sensitive. Miss-
ing data can create difficulty in the analysis because nearly all standard statistical
methods presume complete information for all the variables included in the analysis.
Chapter 8 deals with methodologies for the analysis of longitudinal data with missing
values.
The traditional designed experiments involve data that need to be analyzed us-
ing a fixed-effects model or a random-effects model. Central to the idea of variance
components models is the idea of fixed and random effects. Each effect in a variance
components model must be classified as either a fixed or a random effect. Fixed ef-
fects arise when the levels of an effect constitute the entire population about which
one is interested. Chapter 9 develops the methodology of analyzing longitudinal data
using random effects and transitional models.
High-dimensional longitudinal data involving many variables are often collected.
The inclusion of redundant variables may reduce accuracy and efficiency for both pa-
rameter estimation and statistical inference. When a large number of predictors are
collected as in phenotypical studies to identify responsible genes, the inclusion of re-
dundant variables can reduce the accuracy and efficiency of estimation and prediction
PREFACE xix
(inflated false discovery rate and reduced power). Therefore, it is important to select
the appropriate covariates in analyzing longitudinal data. However, it is a challenge
to select significant variables in longitudinal data due to underlying correlations and
unavailable likelihood. Chapter 10 presents how the lasso-type approach can be used
in longitudinal data analysis.
This book is written for (1) applied statisticians and scientists who are interested
in how dependence is taken care of in analyzing longitudinal data; (2) data analysts
who are interested in the development of techniques for longitudinal data analysis;
and (3) graduate students and researchers who are interested in researching on cor-
related data analysis. This book is also suitable as a graduate textbook to assist the
students in learning advanced statistical thinking and seeking potential projects in
correlated data analysis.
Author Bios

Professor Wang obtained his Ph.D. in dynamic optimization in 1991 (University of


Oxford) and worked for CSIRO (2005–2010). Before returning to Australia, Profes-
sor Wang worked for the National University of Singapore (2001–2005) and Harvard
University as Assistant Professor and Associate Professor (1998–2000) in biostatis-
tics. He joined the University of Queensland in April 2010 as Chair Professor of Ap-
plied Statistics to lead the Centre for Applications in Natural Resource Mathematics
and to promote applied statistics and mathematics. Currently, he is Capacity Building
Professor in Data Science at Queensland University of Technology, Australia.
Professor Wang has developed a number of novel statistical methodologies in
longitudinal data analysis published by top statistical journals (Biometrika, Biomet-
rics, Statistics in Medicine, Journal of the American Statistician Association, Annals
of Statistics). His recent interests and successes include (1) “working” likelihood
approach for hyperparameter estimation and model selection, (2) integrating statis-
tical learning and machine learning for dependent data analysis, and (3) data-driven
approach for robust estimation. More recently, he advocates “working” likelihood
approaches to parameter estimation but recognizing possibly a different likelihood
that generating the observed data in inferencing. This has been found very useful in
finding data-dependent tuning parameters in robust estimation and hyper-parameters
in machine learning algorithms. He has published over 175 papers in international
SCI journals.
Liya Fu obtained her Ph.D. in 2010 from Northeast Normal University. Currently,
she is an Associate Professor of Statistics at Xi’an Jiaotong University. She worked
briefly as a Postdoctoral Fellow at the University of Queensland, after two-years,
visiting students at CSIRO (2008–2010), Australia. Dr. Fu mainly focuses on the
methodologies for the analysis of longitudinal data and has published about more
than 20 papers in international journals, including Biometrics, Statistics in Medicine,
Journal of Multivariate Analysis.
Professor Sudhir Paul obtained his Ph.D. in 1976 (University of Wales). He
worked as a postdoctoral fellow (University of Newcastle Upon Tyne, 1976–1978)
and a lecturer (University of Kent at Canterbury, 1978-1982) before moving to
Canada in 1982. He started as Assistant Professor at the University of Windsor and
moved through all professorial ranks and finally, in 2005, became a distinguished
University Professor. He became a fellow of the Royal Statistical Society in 1982
and a fellow of the American Statistical Association in 1986.
Professor Paul has developed many methodologies for the analyses of over-
dispersed and zero-inflated count data, longitudinal data, and familial data and

xxi
xxii AUTHOR BIOS
published in most of the top-tier journals in statistics (Journal of the Royal Statis-
tical Society, Biometrika, Biometrics, Journal of the American Statistician Associa-
tion, Technometrics). Professor Paul supervised over 50 graduate students including
16 Ph.D. students, and has published over 100 papers.
Contributors

You-Gan Wang Liya Fu Sudhir Paul


School of Mathematical Department of
School of Mathematics &
Sciences, Mathematics &
Queensland University of Statistics,
Statistics
Technology, Xi’an Jiaotong University, University of Windsor,
Brisbane, Australia Xi’an, China Ontario, Canada

xxiii
Acknowledgment

This book is partially supported by Australian Research Council (ARC) Discov-


ery Project (DP160104292), the Australian Research Council Centre of Excel-
lence for Mathematical and Statistical Frontiers (ACEMS), under grant number
CE140100049, and the Natural Science Foundation of China (No. 11871390). We
also wish to thank Dr. Qibin Duan, Dr. Jiyuan An, Mr. Ryan (Jinran) Wu and Miss
Jiaqi Li for some assistance in latex work.

xxv
Chapter 1

Introduction

1.1 Longitudinal Studies


Understanding the variance and covariance structure in the data is an imperative part
in statistical inferences.
Quantification of parameter values and their standard errors critically depends
on the underlying structure that generated the data. The challenge of a modeler is
to specify what the average or mean function should be and how we describe the
most appropriately what the variance/covariance structure in the data. Such mean
functions will be crucial in forecasting, and the variance/covariance functions are
often used as weighting and to describe the uncertainties in future observations.
Longitudinal data are routinely collected in this fashion in a broad range of ap-
plications, including agriculture and the life sciences, medical and public health re-
search, and industrial applications. For longitudinal studies, the same experimental
units (such as patients, trees, and sites) are observed or measured for multiple times
over a period of time. The experiment units can be patients in medical studies, trees
in forestry studies, and animals in biological studies. The experimental units can also
be sites or buildings where water or air quality data are collected. This repeating
nature or clustering nature exhibited in the data makes the longitudinal data behold
certain variance/covariance structures that we need to reflect in our model.
Longitudinal data contains temporal changes over time from each individual. In
contrast to a cross-sectional study in which a single outcome is measured for each
individual, the prime advantage of a longitudinal study is its effectiveness for study-
ing changes over time. Therefore, observations from the same patients are correlated,
and this correlation must be taken into account in statistical analysis. Thus, it is nec-
essary for a statistical model to reflect the way in which the data were collected in
order to address these questions.
We assume that the response variable from subject i at time j is represented as
yi j , which has mean µi j and variance φσ2i j for j = 1, . . . , ni and i = 1, . . . , N. Here φ is
an unknown scale parameter, µi j and σ2i j are some known functions with unknown
parameters of the covariates, and Xi j is a p × 1 vector. Let µi = (µi j ) be the marginal
mean vector for subject i. Let YiT = (yi1 , . . . , yini ). The covariance matrix of Yi , Vi , is
assumed to have the form φA1/2 1/2
i Ri (α)Ai , in which Ai = diag(σi j ) and Ri (α) is the
2

correlation matrix parameterized by α, a q-vector.

DOI: 10.1201/9781315153636-1 1
2 INTRODUCTION
We first consider the special case when ni = 1 and all observations are inde-
pendent of each other. This is basically the setup of the generalized linear models
(GLMs). In GLMs, we have µi j = h(XiTj β), a function of some linear combination
of the covariates, and the variance σ2i j is related to the covariates via some known
functions of the mean.
Later on we will allow multiple observations from the same subjects, and corre-
lations will have to be allowed among these within-subject observations while obser-
vations from different subjects are assumed independent.
In longitudinal studies, a variety of models can be used to meet different purposes
of the research. For example, some experiments focus on individual responses; the
others emphasize the average characters. Two popular approaches have been devel-
oped to accommodate different scientific objectives: the random effects model and
the marginal model (Liang et al., 1992).
The random-effects model is a subject-specific model that models the source of
the heterogeneity explicitly. The basic premise behind the random-effects model is
that we assume that there is natural heterogeneity across individuals in a subset of
the regression coefficients. That is, a subset of the regression coefficients is assumed
to vary across individuals according to some distribution. Thus the coefficients have
an interpretation for individuals.
Marginal model is a population-average model. When inferences about the
population-average are the focus, the marginal models are appropriate. For exam-
ple, in a clinical trial, the average difference between control and treatment is most
important, not the difference for a particular individual.
The main difference between the marginal and random-effects model is the
way in which the multivariate distribution of responses is specified. In a marginal
model, the mean response modeled is conditioning on only fixed covariates, while
for random-effects models, it is conditioned on both covariates and random effects.
The random-effects models can be seen as a likelihood-based approach, while the
marginal approach is semiparametric in the sense that only the mean function and
the covariance structure are modeled via some parametric functions (and likelihood
function is avoided).

1.2 Notation
In general, we will use capital letters to represent vectors and small letters to present
scalars. The letters represent random variables or observations depending on the con-
text to distinguish.

N: the number of subjects.


(yi j , Xi j , ti j ): data from subject i, 1 ≤ j ≤ ni and i = 1, 2, . . . , N.
yi j : responses from subject i at time t j .
µi j : mean of yi j .
var(yik ) = σ2i j = φν(µi j ).
g(·): the link function.
β = (β1 , . . . , β p )T : parameter vector of dimension p in the mean function.
NOTATION 3
β̃: the true value of β.
α: parameter vector in the correction matrix.
γ: parameter vector in the variance-mean matrix.

  xi11 ... xi1p


 T 
 yi1 Y1  Xi1
    
...
 
  xi21 xi2p
  
Yi =  ... ..  , Xi =  ...

 , Y =   =  .  ,

.. ..
    
.   ..
    
T . . 
yini YN Xin
...
 
i xini 1 xini p
in which Xi j = (xi j1 , . . . , xi jp )T , j = 1, 2, . . . , ni .

var(Yi1 ) ... Cov(Yi1 , Yini ) 


 
 µi1

..
   
 Cov(Yi2 , Yi1 ) . Cov(Yi2 , Yini ) 

µi = E(Yi ) =  ...
 
 , Vi = var(Yi ) =   .
 
.. .. ..
. . .
   
µini  
Cov(Yini , Yi1 ) ... var(Yini )
The true parameter vector is denoted as β̃, i.e., we can decompose Yi as

Yi = g(Xi β̃) + i ,

in which i = (i1 , . . . , ini )T , and E(i ) = 0.


Chapter 2

Examples and Organization


of The Book

We now introduce some longitudinal studies. Some of the datasets will be used for
illustration through this book.

2.1 Examples for Longitudinal Studies


2.1.1 HIV Study
In a medical study, a measure of viral load may be taken at monthly intervals from
patients with HIV infection. The objective was to establish viral dynamics (Kinetics
of HIV and CD4 turnover). For example, Ho et al. (1995) concluded a mean virus
half-life of about 2 days and a mean production of 0.68 ± 0.13 × 109 virion per day.
Elucidation of HIV dynamics greatly advanced AIDS research since HIV was first
identified in 1983.
In the AIDS Clinical Trial Group (ACTG) Protocol 315 (Wu and Ding, 1999), 53
HIV infected patients were treated with potent antiviral drugs (ritonavir monotherapy
for the first 10 days, 3TC and AZT added on day 10), which consists of protease
inhibitor (PI) and reverse transcriptase inhibitor (RTI) drugs. Plasma HIV-1 RNA
copies was repeatedly measured on days 2, 7, 10, 14, 21, 28 and weeks 8, 12, 24, and
48 after initiation of treatment. The nucleic acid sequence-based amplification assay
(NASBA) was used to measure plasma HIV-1 RNA. This assay offers a lower limit
of detection of 100 RNA copies/ml plasma. Five patients discontinued the study due
to drug intolerance and other problems. Therefore, the observations were collected
from only 48 patients. Figure 2.1 shows the HIV viral load measurements (plasma
HIV-1 RNA copies) for 48 valuable patients.

2.1.2 Progabide Study


A clinical trial was conducted in which 59 people with epilepsy suffering from simple
or partial seizures were assigned at random to receive either the anti-epileptic drug
progabide (subjects 29-59) or an inert substance (a placebo, subjects 1-28). Because
each individual might be prone to different rates of experiencing seizures, the inves-
tigators first tried to get a sense of this by recording the number of seizures suffered

DOI: 10.1201/9781315153636-2 5
6 EXAMPLES AND ORGANIZATION OF THE BOOK

6
5
log10(RNA copies)

4
3
2

5 10 15 20 25 30

Days

Figure 2.1 Parallel lines of log(RNA copies) in HIV study.

by each subject over the 8-week period prior to the start of administration of the as-
signed treatment. It is common in such studies to record such baseline measurements
so that the effect of treatment for each subject may be measured relative to how that
subject behaved before treatment. Following the commencement of treatment, the
number of seizures for each subject was counted for each of four 2-week consecu-
tive periods. The age of each subject at the start of the study was also recorded, as it
was suspected that the age of the subject might be associated with the effect of the
treatment somehow. The primary objective of the study was to determine whether
progabide reduces the rate of seizures in subjects like those in the trial.
The data for the first five subjects in each treatment group are shown in Table 2.1.
The boxplots of the number of seizures for epileptics (see Figure 2.2) indicate that
there exist outliers. Thus, a robust method should be considered when analyzing this
dataset. Figure 2.3 indicates that there are also strong within-subject correlations in
the epileptic data, as reported in Thall & Vail (1990). Thall and Vail (1990) presented
a number of covariance models to account for overdispersion, heteroscedasticity, and
dependence among repeated observations.
EXAMPLES FOR LONGITUDINAL STUDIES 7

Table 2.1 The seizure count data for 5 subjects assigned to placebo (0) and 5 subjects assigned
to progabide (1).

Period

Subject 1 2 3 4 Trt Baseline Age

1 5 3 3 3 0 11 31

2 3 5 3 3 0 11 30

3 2 4 0 5 0 6 25

4 4 4 1 4 0 8 26

5 7 18 9 21 0 66 22
..
.

29 11 14 9 8 1 76 18

30 8 7 9 4 1 38 32

31 0 4 3 0 1 19 20

32 3 6 1 3 1 10 30

33 2 6 7 4 1 19 18

2.1.3 Hormone Study


In this study, a total of 492 urine samples were collected from 34 women (with men-
strual cycles) aged between 27 and 45 years, and urinary progesterone was assayed
on alternate days. Each woman contributed from 11 to 28 observations over a period
of time; hence the data are unbalanced. One purpose of the study was to test the effect
of age and body mass index (BMI) on women’s progesterone level after an appro-
priate adjustment of their menstrual cycles. More details can be found in Sowers et
al. (1998). Figure 2.4 indicates that the log-transformed progesterone level exhibits a
nonlinear effect in time. Zhang et al. (1998) and Fung et al. (2002) found that some
outliers exist in the data and fitted the data using a semiparametric mixed model,
which included covariates, age, body mass index BMI, and nonlinear time effects. In
Chapter 5, we will use a quintic polynomial to fit the time effect. In addition to age,
BMI, and time effect, we will also consider their interaction effects.
8 EXAMPLES AND ORGANIZATION OF THE BOOK
Placebo Progabide

150

100
The number of seizures

50

0 1 2 3 4 0 1 2 3 4
Visit

Figure 2.2 Boxplots of the number of seizures for epileptics at baseline and four subsequent
2-week periods: “0” indicates baseline.

2.1.4 Teratology Studies


Paul (1982) reported that the mean litter sizes from female banded Dutch rabbits
are found to be 7.96, 7.00, 7.19, and 5.94 when exposed to control, low, medium and
high levels of toxic doses. The corresponding observed malformation rates are 0.135,
0.135, 0.344 and 0.228, i.e. the normality rates are (0.865, 0.865, 0.656, 0.772). The
medium dose level, instead of the high dose level, appears to have the lowest survival
rate (Williams, 1987). As we will see, this is probably due to ignoring the death in
utero.
However, the average numbers of normal fetuses produced by each dam are
6.889, 6.052, 4.714, and 4.588. The ratios to the control group are (0.878, 0.684,
0.666), which exhibits a decreasing trend. The dramatic reduction in an average
EXAMPLES FOR LONGITUDINAL STUDIES 9

Progabide study−−two treatments


0 20 60 100 0 20 40 60

150
100
baseline

50
20 40 60 80

visit 1
0

50
visit 2

30
0 10
60

visit 3
40
20
0

50
visit 4

30
0 10
50 100 150 0 20 40 60 0 20 40 60

Figure 2.3 Scatterplot matrix for epileptics at baseline and four subsequent 2-week periods:
solid circle for placebo; hollow circle for progabide.

number of normal fetuses per dam clearly shows the adverse effects of this toxic
agent. By now, without any parametric modeling, we can see an obvious adverse ef-
fect from this monotonic trend. The adjusted proportions of normal fetuses for the
four dose levels are (0.865, 0.760, 0.592, 0.576) assuming N, the average number
of initial population size for each group, is 7.96, as observed in the control group.
These proportions of normal fetuses become (0.765, 0.673, 0.524, 0.509) if we as-
sume N = 9. Note that for any values assumed for N, the ratios to the control group
are still (0.878, 0.684, 0.666). This shows that roughly the excessive risk for the low,
medium, and high dose groups are 0.122, 0.316, and 0.334 (the actual dose values
are not available).
Parametric models are usually used to quantify the relationship with dose levels
and other covariates. This will allow us to determine the virtually safe dose (Ryan,
1992). Traditional risk assessment is based on the probabilities of abnormal fetuses
within each litter. The risk of death in utero is not accounted for because the inference
10 EXAMPLES AND ORGANIZATION OF THE BOOK

4
3
2
Log progesterone

1
0
−1

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Days in standardized menstrual cycle

Figure 2.4 Progesterone levels (on log scale) plot against the number of days in a standardized
menstrual cycle. Solid line indicates the estimated population mean curve.

is based on successful implantations. The additional risk of death in utero due to the
toxic agent has to be modeled to obtain the overall risk. Estimation of the overall
risk ratios for different dose levels will depend on modeling the average response
(including death in utero) numbers per dam, which may not be directly observed or
reliably recorded. If death in utero is ignored, agent risk will be underestimated.
Development of offspring is subject to the risk of uniforming due to toxins, al-
though the number of such littermates may not always be fully observed. This risk
can also be interpreted as unsuccessful implantation. The offspring will be further
subject to adverse effects after successful implantation. It seems appropriate to take
EXAMPLES FOR LONGITUDINAL STUDIES 11
into account of death in utero as well as other observed abnormalities in fetuses in
risk assessment, although this is not currently required by regulatory agencies such
as Environmental Protection Agency and Food and Drug Administration.

2.1.5 Schizophrenia Study


We use the data set from the Madras Longitudinal Schizophrenia Study available
from Diggle et al. (2002) as an example for binary data. This study investigated
the course of positive and negative psychiatric symptoms over the first year after
initial hospitalization for schizophrenia. The response variable (Y) is binary, indi-
cating the presence of thought disorder. The covariates are (i). Month - duration
since hospitalization (in months), (ii). Age - age of patient at the onset of symptom
(1 represents Age < 20; 0 otherwise), (iii). Gender - gender of patient (1=female;
0=male), (iv). Month×Age - Interaction term between variables Month and Age, and
(v). Month×Gender - Interaction term between variables Month and Gender.

2.1.6 Labor Pain Study


In this study, 83 women were randomized to either the placebo group (40 women) or
the active treatment group (43 women). Treatment was initiated when cervical dila-
tion was 8cm. The self-reported amount of pain was recorded on a 100mm line (0=no
pain, 100=very much pain) in 30-minute intervals over a period of 180 minutes. The
length of the line to the left of the subject’s perceived pain level was measured to the
nearest 0.5 mm; therefore, the outcome variable is essentially continuous. In the lat-
ter part of the study, there were many missing values. We assume the data are missing
completely at random. More details can be found in Davis (1991).
Parallel lines and boxplots (Figures 2.5 and 2.6) indicate that pain scores seem to
be higher in the placebo group than in the active group at each point. The distribution
of the pain score is extremely non-normal (see Figure 2.7 for the placebo group
and active treatment group, respectively); thus, the use of nonparametric method is
indicated.

2.1.7 Labor Market Experience


We now consider a survey example, a study of the National Longitudinal Survey
of Labor Market Experience (Center for Human Resource Research 1989). Years of
education, age, income in each year, and location of the subjects were collected. A
subsample of 3,913 women between 14 and 26 years old who have completed their
education and earning wages in excess of $1/hour but less than $700/hour are used
in the analysis. The incomes for each woman were collected.
The response variable in this example is the annualized income (assuming 2000
working hours a year). The log-transformed annualized incomes for the cohort is
analyzed first, followed by a re-analysis using the untransformed incomes. We are
interested in how education affects the pay rate and if there is a difference between
southerners and others. Figures 2.8 and 2.9 indicate that the increase of average
12 EXAMPLES AND ORGANIZATION OF THE BOOK

50 100 150

Active Placebo

100

80

60
Pain score

40

20

50 100 150

Measurement time (minutes)

Figure 2.5 Parallel lines of pain scores for active and placebo groups.

Active Placebo

100

75
Pain score

50

25

30 60 90 120 150 180 30 60 90 120 150 180


Measurement times

Figure 2.6 Boxplots of pain scores for placebo and active groups.
EXAMPLES FOR LONGITUDINAL STUDIES 13
Histogram for active group Histogram for placebo group

0.025
0.05

0.020
0.04

0.015
0.03

Density

0.010
0.02

0.005
0.01

0.000
0.00

0 20 40 60 80 0 20 40 60 80

Pain score Pain score

Figure 2.7 Histograms of pain scores for active and placebo groups.

incomes as education time increases, and the average income of south is lower than
that in others when women have the same education.
When modeling this dataset, the candidate covariates are (i) age: the age at the
time of the survey; (ii) grade: the current grade that has been completed; and (iii)
south: whether a person came from the south.

2.1.8 Water Quality Data


This dataset was collected from 32 sites from 1997 to 2007 at a dam in South East
Queensland, Australia. There are 12 water quality indicators monitored, including
Total Cyanophytes, Chlorophyll A, and Ammonia Nitrogen. An important water
quality indicator is the total amount of cyanophytes in the water because almost all
toxic freshwater blooms are caused by cyanophytes. Cyanophyte blooms form un-
pleasant surface scums and cause odor in the water. Some cyanophyte is toxic and
can cause deaths. Cyanophyte is known to be affected by season and the rate of flow
of water. We present repeated measurements of the values of total cyanophyte from
four sites in Figure 2.10. Figure 2.10 indicates that there exist some outliers. The
Q-Q plot shows that the distribution of the total cyanophyte is skewed. Therefore,
robust nonparametric methods could be considered for this data.
14 EXAMPLES AND ORGANIZATION OF THE BOOK

0 5 10 15

Others South

3
Log wage

0 5 10 15

Education (year)

Figure 2.8 Scatter plot of log wage against education for south and other places.

2.2 Organization of the Book


The remainder of this book is organized as follows. Chapter 3 discusses the gener-
alized linear model (GLM) and Quasi-likelihood method and also introduces how
to model the variance the correlation. Chapter 4 mainly discusses the methods of
estimating parameters for longitudinal data. Criteria for variable selection and cor-
relation structure selection are introduced in Chapter 5. Chapter 6 discusses several
roust methods for parameter estimation. Chapter 7 mainly introduces the clustered
data analysis and statistical inference for intra-cluster correlation. Chapter 8 mainly
focuses on missing data, including the patterns and mechanism of missing data, and
how to analyze the longitudinal data with missing values. Chapter 9 introduces an-
other two important models in longitudinal data: random-effects models and transi-
tional models. The last chapter, Chapter 10, provides the “penalty” approach for high
dimension data. This lasso-type approach has been found very effective in eliminat-
ing the redundant genes and keeping useful predictors.
ORGANIZATION OF THE BOOK 15

4
3
Log wage

2
1
0

0 1 2 4 5 6 7 8 9 10 12 14 16 18

Education (year)

Figure 2.9 The boxplots of the income of the women with different education time.
16 EXAMPLES AND ORGANIZATION OF THE BOOK

1998 2000 2002 2004 2006 2008

30017 30018

6e+05

4e+05

2e+05
Values of total cyanophytes mg/L

0e+00

30004 30016

6e+05

4e+05

2e+05

0e+00

1998 2000 2002 2004 2006 2008

Year

Figure 2.10 Time series plot of the values of Total Cyanophyte from 4 sites.

Figure 2.11 Q-Q plot for total Cyanophyte counts from 20 sites.
Chapter 3

Model Framework and Its Components

Suppose that data are collected from N subjects, and Yi is the response data from
subject i, where i = 1, . . . , N. We first assume Yi is univariate in this chapter, and
then allow Yi to be a vector of responses in longitudinal or clustered data analysis in
Chapter 4. The predictors associated with Yi are collected in a vector of dimension
p, Xi . The design matrix for all the observations is then a matrix of dimension p × N,
X = (X1> , . . . , XN> ). The response vector Y is (Y1 , . . . , YN )> . Note that the Yi is also
referred to as dependent variable, output variable, or endogenous variable in econo-
metrics, and the Xi vector is also referred to as independent variables, input variables,
explanatory variables, or exogenous variables.

3.1 Distributional Theory


We are familiar with the distribution types, such as normal, Poisson, and bino-
mial families. They are the fundamental elements in distributional theory. They are
“static” in the sense that for a given parameter value, the distribution is then fixed,
describing the randomness of observation at a given time. In regression or condi-
tional modeling, we describe how Y changes accordingly when X varies. If f (y|x1 )
is the probability density function (pdf) of the response given X = x1 , how should
we describe the distribution when X = x2 ? If we believe a particular family of dis-
tributions is rich enough, f (y|x1 ) and f (y|x2 ) will correspond to two members in the
distribution family associated with or determined by two parameter values. If we
think of a series of observations from a number of different subjects, we may im-
pose a set of parameter values, one for each subject. Each distribution family can
be regarded as a convenient mathematical description of randomness for a particular
type of data. Nevertheless, theoretically, it is possible that distributions from different
subjects may not come from the same family. This motivates the development of the
exponential distribution family, where the generalized linear models (GLMs) hinge
on. A different distribution can be obtained by simply using a different exponential
parameter that is determined by the linear combination of the predictors. Therefore,
we can think of each family consisting of many different members, and they can be
used to differentiate the distribution functions (including the mean and variance) as
the predictors change.
Let Y be a random variable representing any of the responses, which is charac-
terized by a distribution function f (y). Its randomness may be contributed by many

DOI: 10.1201/9781315153636-3 17
18 MODEL FRAMEWORK AND ITS COMPONENTS
factors. When we are interested in investigating the relationships between Y and X
(causal or just a predictor). We shall ask the question of how Y changes when X
changes. This is reflected in the conditional distribution function f (y|x). For exam-
ple, Y is the bodyweight of an individual from the Australian population. Its ran-
domness is partially caused by biological difference due to X =male or female – we,
therefore, may be interested in investigating the conditional distributions of Y given
X, f (y|X = Male), and f (y|X = Female). In this case, if the normal family is used,
we can allow a swift in the mean to reflect the difference in these two distributions.
However, the population with a larger mean may exhibit a larger variance as well. To
this end, we can either model variance just as we model the mean, or we can link the
variance as a function of the mean.
Distribution families all imply a functional relationship between the mean and the
variance functions due to the probabilistic regularization. If X is a continuous variable
such as height in inch, f (y|X = x) is then a function of x ( f may not be continuous in
x). Intuitively, conditioning on X explains certain variations due to X, so the variance
of the residuals should shrink. As more and more useful predictors (Xi ) are used,
f (y|Xi ) should have smaller and smaller variance leading to more accurate prediction.
Of course, this accurate prediction relies on our accurate description of f (y|Xi ), which
is not easy especially when the dimension of Xi becomes large.
Exercise 1. If var(Y) exists, prove E(var(Y|X)) ≤ var(Y) by establishing the fol-
lowing identity
var(Y) = E(var(Y|X)) + var(E(Y|X)).
Exercise 2. Generate 1,000 normal responses from a linear model in R, yi with
mean µi = a + bxi and variance σ2i = (µi /75)2 , where xi is either 0 or 1 with 50%
probability. Obtain the density plot for the two groups, a = 165 and b = 5. This may
represent Male and Female heights (cm) from 1,000 people. How about if we wish to
have a skewed distribution, what distribution can we do? Revise the R code accord-
ingly.
In classical regression with regressors Xi , we assume f (y|Xi ) is normal,
N(Xi> β, σ2 ). As Xi changes, the response mean is assumed to change in a linear fash-
ion. Here Xi can be random or fixed, and the marginal density f (y) is not of interest.

3.1.1 Linear Exponential Distribution Family


Suppose Y is a random variable and its distribution takes the form
( )
T (y)θ − b(θ)
f (y|θ) = exp + c(y, φ) dν(y), (3.1)
φ
where T (y) is a known function such as y or yk with a known k, θ is the natural (also
known as canonical) parameter, and φ is a dispersion parameter, c(y, φ) is a known
function of y and φ, and ν(y) is a measure function.
Example 1. For a given power parameter k, the Weibull distribution belongs to the
exponential family with T (y) = yk ,
y θ − b(θ)
( k )
f (y|θ) = exp + c(y, φ) , y ∈ (0, ∞),
φ
DISTRIBUTIONAL THEORY 19

Table 3.1 Components of commonly used exponential families: φ is the dispersion parameter,
b(θ) is the cumulate function, µ(θ) is the mean function as the natural parameter, V(µ) is the
mean-variance relationship, and final row is the inverse function of µ(θ) as the canonical link
to η = X > β. If the link function h(µ) = X > β nominated is not the canonical one, η = X > β is not
the natural parameter θ any more.
Normal Poisson Binomial Gamma IGaussian
N(µ, σ2 ) Poi(λ) Binom(m, π) Γ(µ, ν) IG(µ, λ)
φ σ2 1 1/m 1/ν 1/λ

b(θ) θ2 /2 eθ m log(1 + eθ ) − log(−θ) − √−2θ
µ(θ) θ eθ meθ /(1 + eθ ) −1/θ 1/ −2θ
V(µ) 1 µ µ(m − µ) µ2 µ3
θ µ log(µ) log{µ/(m − µ)} 1/µ −1/(2µ2 )

where θ = λ−k and b(θ) = log(θ).


The most commonly used exponential family is the linear exponential family
with T (y) = y. Consider a set of independent random variables Y1 , . . . , YN , each with
a distribution from the exponential family,
( )
yθ − b(θ)
f (y; θ, φ) = exp + c(y, φ) dν(y), (3.2)
φ

in which θ is the natural (also known as canonical) parameter and φ is a dispersion


parameter, and ν(y) is a measure function. Here y can take continuous or categorical
data (ν(y) is a Lebesgue measure for continuous y and count measure for discrete
y). Clearly, E(Y) = µ = b0 (θ) and var(Y) = φb00 (θ), both are functions of θ. It is also
common to write the variance as

var(Y) = φV(µ),

where V(·) is the so-called variance function.


In fact, b(θ) is the most important function: it determines the type of distribution
(Gaussian, Poisson etc.), and its derivatives b0 (θ) is actually the mean, while its sec-
ond derivative determines the variance function (V = b00 (θ)) of the distribution. Table
3.1 lists the commonly used exponential families.
Example 2. Suppose Y follows a normal distribution N(µ, σ2 ), where σ2 does not
change with µ,

1 (y−µ)2 yµ−µ2 /2
− 12
f (y|µ, σ2 ) = √ e σ2 ∝ e σ2 .
2πσ
Here the proportion symbol ∝ is used because the constant terms free from the pa-
rameter of interest (i.e., θ in this case) are ignored. Clearly, we have b(θ) = θ2 /2, and
σ2 is the dispersion parameter.
20 MODEL FRAMEWORK AND ITS COMPONENTS
Example 3. The inverse Gaussian distribution describes the time to the first pas-
sage of a Brownian motion and has (pdf)
s
λ −λ(y − µ)2
( )
f (y|µ, λ) = exp
2πy3 2µ2 y
n √ o
∝ exp λ(yθ − −2θ) ,

in which the natural parameter θ = −1/(2µ2 ) is always negative and b(θ) = − −2θ.
Therefore, under the canonical link, the linear score η = X > β must be negative.
Example 4. Let us now consider a general power function including two limiting
cases for Gamma and Poisson distributions,
  τ
θ τ−1

 τ−1

 τ if τ , 0 or 1
b(θ) =  if τ = 0 (Gamma distribution).

− log(−θ)
 eθ

if τ → ∞ (Poisson distribution)

The normal and inverse Gaussian distributions are included when τ = 2 and
τ = −0.5. The distributions corresponding to other τ values are known as Tweedie ex-
ponential family (Tweedie, 1984; Jørgensen, 1997). The power relationship between
the mean µ = b0 (θ) and variance b00 (θ) is seen when {θ/(τ − 1)}τ (τ − 1)/τ, where τ , 1
and can be positive or negative. Note the limiting value of b(θ) is log(−θ) when τ = 0.
It is interesting to see τ = 2, 0, and −0.5 correspond to normal, Gamma, and inverse
Gaussian distributions.

3.1.2 Quadratic Exponential Distribution Family


A straightforward generalization to the linear exponential family is to include y2 term
in (3.2),
yθ + y2 λ − b(θ, λ)
( )
f (y|θ) = exp + c(y, φ) dν(y), (3.3)
φ
in which both θ and λ will be functions of the mean and variance, and
∂b(θ, λ) ∂b(θ, λ)
= µ, = E(Y 2 ),
∂θ ∂λ
and
∂2 b(θ, λ) ∂2 b(θ, λ)
= σ2 , = var(Y 2 ).
∂θ 2 ∂λ2
More details can be found in the book by Ziegler (2011, Ch. 2).

3.1.3 Tilted Exponential Family


Consider a regression approach when we are interested in investigating the effect of
Xi — how Yi changes as Xi changes. Suppose we use X0 as the reference values (such
as the Male group, or average age etc.) and the corresponding response variable Yi
QUASI-LIKELIHOOD 21
has the distribution, f (y|X0> β), which is the reference distribution. To model the effect
of Xi , we only need to look at the ratio

f (y|Xi> β)/ f (y|X0> β).

Under the linear exponential family framework, this ratio is

(θi − θ0 )y − b(θi ) + b(θ0 )


( )
f (y|Xi β)/ f (y|X0 β) = exp
> >
+ c(y, φ) . (3.4)
φ
This leads to the tilted distribution family (Rathouz and Gao, 2009),

f (y|Xi> β) = exp(bi + θi y) f (y|X0> β),

f (y|XiT β)dy = 1, and θi is so chosen that


R
where bi is the normalizing constant so that
the mean assumption is met, y f (y|Xi β)dy = µi . This generalizes the linear expo-
R
>

nential family in the sense that f (y|X0> β) is unspecified and does not have to take the
linear exponential distribution. Huang and Rathouz (2012) proposed semiparametric
and empirical likelihood approach for estimating β.
This tilted model bears some similarity to the proportional hazard model (Cox,
1972), where the hazard ratio f (y)/(1 − F(y)) instead of f (y) is tilted. The exponen-
tial distribution family provides a family of distributions indexed by the canonical
parameter values, but the family does not specify how these family members are
linked. The GLM hinges on the exponential family with a link function specifying
how the canonical values are linked to the covariates. The variance is still the deriva-
tion of the mean with respect to θ, but no explicit forms exist in general for either the
mean or the variance functions.
The tilted exponential family provides more than just a family of distributions,
and the “canonical” parameter value is implicitly defined by the desired marginal
mean via the link function. The “canonical” parameter here plays a tilting role so
that the reference distribution is tilted as the distribution at the new mean values.
Much more work is needed on this topic for analysis of longitudinal data.

3.2 Quasi-Likelihood
When the true likelihood is difficult to specify, we can then rely on the quasi-
likelihood (QL) function proposed by Wedderburn (1974). Instead of claiming the
true distribution of yi is known, we only need to model the first two moments
of the distribution and construct a “likelihood” alike function (the so-called quasi-
likelihood) to work with. In the regression context, we specify only the relationship
between the mean and covariates and the variance function as the mean (for weight-
ing).
We have assumed that Y is a random variable with mean µ = h(X > β) and variance
φV(µ). The QL (on log scale) for Yi is defined as (Wedderburn, 1974)
Z µi
Yi − t
Q(µi ; yi ) = dt.
yi φV(t)
22 MODEL FRAMEWORK AND ITS COMPONENTS
The overall QL for all the independent observations, Y = (Y1 , . . . , YN ), is defined
as N
X
Q(µ; Y) = Q(µi ; Yi ),
i=1
where µ is the mean vector.
Remark 1: The QL is a likelihood function, but it is not meant to be the true
likelihood that generates the observed data.
Remark 2: The QL function is more than just a “working” likelihood in the sense
that it is meant to be an approximation to the true likelihood when only the first two
moments are matched.
Remark 3: The QL function can be treated as the true likelihood is the sense that
the resultant score functions and the information matrix are valid.
Remark 4: The true score function may not be a linear combination of the data
(for example, y2i can be involved), and in that case QL is still valid for estimating
β and deriving the information matrix. Of course, if the true likelihood is used, the
MLE will be more efficient but at the cost that potential biases exist in β̂ when the
likelihood is misspecified although the first two moments are correctly matched.
Remark 5: In the case the variance function V(µ) is incorrect, the QL score function
is valid for estimating β, but the information matrix becomes invalid.
For dependent observations, multivariate analysis will be used. The QL can be de-
fined in a similar way,
Z Z
Q(µ; Y) = (Y − t)> V −1 (t)dt(s),
t(s)

where t(s) is a smooth curve of dimension n joining Y and µ. For this integral to be
meaningful as a likelihood, it needs to be path-independent (§9.3, McCullagh and
Nelder, 1989). For longitudinal data analysis, the joint likelihood for the n random
variables from some subject can be written in the form of n products as,

f (y1 , y2 , ..., yn ) = f (y1 ) f (y2 |y1 ) f (y3 |y1 , y2 )... f (yn |y1 , y2 , ..., yn−1 ). (3.5)

Wang (1996) constructed the quasi-likelihood (QL) for f (y j |y1 , y2 , ..., y j−1 ) in the
context of categorical data analysis. An interesting application was done by Wang
(1999a) for estimating a population size from removal data. The conditional quasi
likelihood was constructed via the conditional mean and conditional variance of the
catch given previous total catches.
Remark 6: This approach can be easily extended for any multivariate variables by
modeling the conditional mean and variance. The resultant QL function is, in general,
not a scaled version of the ordinary log-likelihood function.

3.3 Gaussian Likelihood


Let us first consider the independence case assuming ni = 1. In the lack of the likeli-
hood function that generates the data, one may pretend that yi were generated from
a normal distribution N(µi , σ2i ), in which σ2i = σ2 g(µi ), where µi = h(Xi β). This is
GLM AND MEAN FUNCTIONS 23
known as the pseudolikelihood approach. The -2 log-likelihood function is then
N !2 X N
X yi − µi
LG0 (β; γ) = + log(σ2i ). (3.6)
i=1
σ i i=1

The question is to which extent the estimation and inference are still valid. It is
amazing how valid this approach can be even for binary data. Whittle (1961) and
Crowder (1985) introduced the Gaussian estimation as a vehicle for estimation with-
out requiring that the data are normally distributed. The function above may also be
called Pseudo-likelihood (Davidian and Carroll, 1987). For longitudinal data anal-
ysis, the Gaussian approach can also be applied (Crowder, 1985, 2001; Wang and
Zhao, 2007). Recall that the scalar response yi j is observed for cluster i (i = 1, . . . , N),
at time j ( j = 1, . . . , ni ). For the ith cluster, let Yi = (yi1 , . . . , yini )T be a ni × 1 response
vector, and µi = E(Yi ) is also a ni × 1 vector. We denote Cov(Yi ) by Σi , which has the
general form φVi , where Vi = Ai1/2 Ri A1/2 i , with Ai = diag{Var(yit )} and Ri being the
correlation matrix of Yi . For independent data, Σi is just φAi .
The Gaussian log-likelihood (multiplied by −2) for the data (Y1 , . . . , YN ), is
N
X
LG (β; τ) = log{|Σi | + (Yi − µi )> Σ−1
i (Yi − µi )}, (3.7)
i=1

where β is the vector of regression coefficients governing the mean and τ is the vector
of additional parameters needed to model the covariance structure realistically. Later
on we will let θ collect all the parameters including both β and τ. Then, we can write
µi and Σi in a parametric form, respectively: µi = µi (β) and Σi = Σi (β, τ). The Gaussian
estimation is performed by maximizing LG (θ) over θ.
Exercise 3. Suppose independent data (y1 , . . . , yN ) are generated from the distribu-
T (yi ) is a sufficient statistic for θ.
PN
tion function (3.1). Prove that i=1
Exercise 4. For the exponential distribution family given by (3.2), we can show
that E(Y) = b0 (θ), and var(Y) = φb00 (θ).
Exercise 5. Suppose Y is a compound distribution in the sense that Y|θ ∼
Poisson(θ) and θ|λ is also Poisson(λ). Show that
(a). Var(Y) = 2λ.
(b). Work out the probability P(Y = 5).
(c). Verify ∞ P(Y = i) = 1.
P
Pi=0
(d). Verify ∞ i=0 iP(Y = i) = λ.
(e). What is ∞ i=0 i P(Y = i)?
P 2

3.4 GLM and Mean Functions


The exponential distribution family is quite general and flexible in describing a distri-
bution. This is great news for modeling independent and identically distributed (i.i.d.)
data which rarely comes by in practice. For example, if we are interested in investi-
gating the effect of Xi on the response variable Yi , it would be reasonable to impose
i.i.d. assumption on the responses from subjects who have the same Xi . However, it
is critical to describe how the distribution changes when Xi values differ.
24 MODEL FRAMEWORK AND ITS COMPONENTS
To this end, a regression framework can be adopted. The GLM assumes a rela-
tionship between the mean µi and Xi via a link function, g(µi ) = Xi> β. For a given
Xi values, the distribution of yi is then determined based on (3.2) by adjusting the
natural parameter θ to match the desired mean via the inverse function of g such that
µi = g−1 (Xi β).
The following diagram shows the skeleton of the GLM: the first part from θ to
b0 (θ) specifies the distribution of each response yi , and the second mapping is the link
function describing how the mean changes as the covariate Xi changes.
b0 (θi ) g(µi )
θi → µi → Xi β.
If we wish to use the linear score Xi β be the natural parameter θ, the compounding
effect of b0 (θ) and g(µi ) must become the identity function (i.e. g(b0 (θ)) = θ) so that
Xi β = θ. In this case, the link g(·) is, therefore, automatically determined when b(θ)
is known.
Now let us consider the overall log-likelihood for (Y, X),
N
X N
X
L=φ −1
{Yi θi − b(θi )} + c(Yi , φ)dµi .
i=1 i=1

Denote the linear score Xi β as ηi and G(η) = ∂θ/∂ηi , which is 1 when canonical
link is used. The corresponding score function is
N
X
XiG(ηi )(Yi − µi ). (3.8)
i=1

As we can see, under canonical link, the estimating functions take the following
simple form,
XN
S (β) = Xi (Yi − µi ), (3.9)
i=1
which is a linear combination of the residuals Yi − µi .
When the link function is not canonical, the log-likelihood function may not be
convex in β and finding the MLE may incur numerical problems. This can also be
seen in (3.8) as the unknown parameters also appear in G(·), which can make the
estimating functions nonmonotonic even if µi is monotonic in β.
For the likelihood-based estimates, the covariance of β̂ is given by the inverse of
fisher information matrix, regardless the canonical link is used or not,
−E(∂2 L/∂β∂β> ).
However, if (3.9) is used as estimating functions for solving β and the canonical link
is not used because either a noncanonical function is used for the mean function or
we are not sure if the data come from the linear exponential family, we will have
 
N
−1  N −1   N −1
 X  X   X 
β̂ − β̃ ∼ N 0,  Xi Di   Xi var(Yi )Xi    Di Xi  .
> > >
  
i=1 i=1 i=1
GLM AND MEAN FUNCTIONS 25
Note that var(Yi ) is usually unknown and will be estimated either by a vari-
ance model or (Yi − µ̂i )2 . In the latter, the sample size N must be large enough
Di i Di /N ≈
PN > 2
so that the limiting theorem can be applied to approximate i=1
PN
E(D >  2 D )/N.
i=1 i i i
Example 5. Poisson regression with a canonical or noncanonical link
Suppose Yi |Xi ∼ Poi(µi ), and the log-likelihood function is
N
X
{Yi log(µi ) − µi }.
i=1

If the canonical link is used, log(µi ) = Xi β, we have the following score functions for
β,
N
Xi (Yi − eXi β ).
X

i=1

Now let us consider a noncanonical link, θ = log(µi ) = exp(Xi β), for example. The
resultant score functions take the form
N
Xi eXi β {Yi − eX (eXi β )}.
X

i=1

Example 6. Suppose Y follow a gamma distribution Γ(α, β),

f (y|α, β) ∝ yα−1 e−y/β = eX {−y/β + (α − 1) log(y)}.

Clearly, for a given α, f (y|α, β) belongs to the linear exponential family with the
natural parameter θ = −β−1 . On the other hand, if β is known (or does not change
with mean and treated as a nuisance parameter), f (y|α, β) belongs to the nonlinear
exponential distribution family with θ = (α − 1) and T (y) = log(y). For this gamma
distribution, we know µi = αβ and σ2i = αβ2 = µi β = µ2i /α. This indicates that the
variance can be modeled as proportional to µi when fixing β (nonlinear exponen-
tial distribution with T (y) = log(y)) or as proportional to µ2i when fixing α (linear
exponential distribution).
γ
How about other power relationships? Suppose we desire to have σ2i ∝ µi to
model variance heterogeneity. This can be achieved when α changes with β in a way
α = β(γ−2)/(γ−1) .

Exercise 6. Prove that (3.8) coincides with the result when applying the Gauss-
Markov theorem to the residuals Yi − µi .
Exercise 7. Suppose that yi is generated from the normal distribution N(µi , σ2 µ2i ),
where µi = Xi β. Obtain the score functions for β. Explain why they do not have the
form as (3.9) or (3.8). Are there any link functions µi = g(Xi β) so that the resultant
score functions take the form of (3.9) or (3.8)?
Unlike the classical linear regression model, which can only handle the normally
distributed data, GLM extends the approach to count data, binary data, continuous
26 MODEL FRAMEWORK AND ITS COMPONENTS
data which need not be normal. Therefore, GLM is applicable to a wider range of
data analysis problems.
Under the assumption that E(Yi ) = µi regardless Yi is continuous or not, we can
consider the following regression model for estimating the parameters β in the mean
function µi ,
Yi = µi + i .
Here i is simply the difference between the observed Yi and its expectation µi . The
ordinary least squares approach leads to the following estimating function in which
Di is the Jacobian matrix ∂µi /∂β,
N
X
i {Yi − µi (β)}.
D> (3.10)
i=1

To account for possible variance heterogeneity in Yi , a weighted least squares (WLS)


can be adopted, but a known weight (wi , say) must be supplied, and the resultant
estimating function is
XN
D>i wi {Yi − µi (β)}. (3.11)
i=1

In general, wi is the reciprocal of the variance of Yi (up to a proportion con-


stant) and unknown. In the GLM setting, we have σ2i = h(µi ) and may wish to use
wi = 1/h(µ̂i ) for some estimated µ̂i in the WLS approach. This is the key idea in the
iterative weighted least squares (IWLS), which iterates between estimating β̂ and up-
dating σ2i = h(µ̂i ) as wi . At convergence, the β̂ is the same as those from the following
estimating function,
XN

i σi (β){Yi − µi (β)}.
D> −2
(3.12)
i=1

Exercise 8. Suppose β̂w is obtained from (3.12) and β̂ora is an oracle estimator
obtained by solving
XN
D>i σi (β̃){Yi − µi ((β)},
−2

i=1

in which the weighting is based on the true β̃.


Prove (a). β̂w − β̃ = O p (N −1/2 ).
(b). β̂ora − β̃ = O p (N −1/2 ).
(c). β̂w − β̂ora = o p (N −1/2 ).

Exercise 9. Show that (3.12) is the score function if Yi is from the linear exponen-
tial distribution family.
MARGINAL MODELS 27
3.5 Marginal Models
The marginal models aim to make inferences at the population level. This approach
is to model the marginal distributions f (yi j ) at each time point, for j = 1, 2, . . . , ni ,
instead of the joint distributions for all the repeated measures.
The emphasis is often on the mean (and variance) of the univariate response yi j
the population level instead of on the individual i at any given fixed covariates. A
covariance structure is often nominated to account for the within-subject correlations.
A feature of marginal models is that the models for the mean and the covariance
are specified separately. Marginal models are considered a natural approach when we
wish to extend the generalized linear model methods for the analysis of independent
observations to the setting of correlated responses.
Specification of a mean function is the premier task in the GEE regression model.
If the mean function is not correctly specified, the analysis will have no meaning.
Under the work frames of GLM, the link function provides a link between the
mean and a linear combination of the covariates. The link function is called canonical
link if the link function equals to the canonical parameters. Different distribution
models are associated with different canonical links. For Normal, Poisson, Binomial,
and Gamma random components, the canonical links are identity, log-link, logit-link,
and inverse link, respectively.
We assume that the response variable, yki , has mean µi j and variance φσ2i j . Here φ
is an unknown scale parameter, µi j and σ2i j are some known functions of the covari-
ates, and Xi j is a p × 1 vector. Let µi = (µi j ) be the marginal mean vector for subject
i. The covariance matrix of Yi , Σi , is assumed to have the form φA1/2 1/2
i Ri (α)Ai , in
which Ai = diag(σi j ) and Ri (α) is the correlation matrix parameterized by α, a q-
2

vector.
We first consider the special case when ni = 1 and all observations are indepen-
dent of each other. In generalized linear models (GLMs), we have µi j = h(Xi>j β) where
h(·) is the inverse function of the link function., a function of some linear combina-
tion of the covariates, and the variance σ2i j is related to the covariates via some known
function of the mean.
The counterpart of the random effect model is a marginal model. A marginal
model is often used when inference about population averages is of interest. The
mean response modeled in a marginal model is conditional only on covariates and
not on random effects. In marginal models, the mean response and the covariance
structure are modeled separately.
We assume that the marginal density of yi j is given by,

f (yi j ) = eX [{yi j θi j − b(θi j )}/φ + c(yi j , φ)].

That is, each yi j is assumed to have a distribution from the exponential family. Specif-
ically, with marginal models we make the following assumption:
• The marginal expectation of the response, E(yi j ) = µi j , depends on explanatory
variables, Xi j , through a known link function, the inverse function of h,

h−1 (µi j ) = ηi j = Xi>j β.


28 MODEL FRAMEWORK AND ITS COMPONENTS
• The marginal variance of yi j is assumed to be a function of the marginal mean,

var(yi j ) = φν(µi j ),

in which ν(µi j ) is a known “variance function”, and φ is a scale parameter that may
need to be estimated.
• The correlation between yi j and yik is a function of some covariates (usually just
time) with a set of additional parameters, say α, that may also need to be estimated.
Here are some examples of marginal models:
• Continuous responses:
1. µi j = ηi j = Xi>j β (i.e. linear regression), identity link
2. var(yi j ) = φ (i.e. homogeneous variance)
3. Corr(yi j , yik ) = α|k− j| (i.e. autoregressive correlation)
• Binary response:
1. Logit(µi j ) = ηi j = Xi>j β (i.e. logistic regression), logit link
2. var(yi j ) = µi j (1 − µi j ) (i.e. Bernoulli variance)
3. Corr(yi j , yik ) = α jk (i.e. unstructured correlation)
• Count data:
1. log(µi j ) = ηi j = xi>j β (i.e. Poisson regression), log link
2. var(yi j ) = φµi j (i.e. extra-poisson variance)
3. Corr(yi j , yik ) = α (i.e. compound symmetry correlation)

3.6 Modeling the Variance


When analyzing count data, we often assume the variance structure is the one of Pois-
son distribution, that is Var(y) = E(y) = µ. But for some count data, such as epileptic
seizures data mentioned previously, the variance structure Var(y) = µ seems inappro-
priate, because the sample variance is much larger than the sample mean. Misspeci-
fication of variance structure will lead to the low efficiency of regression parameter
estimation in longitudinal data analysis. One sensible way is to use a different vari-
ance function according to the features of the data set. Many variance functions,
such as exponential, extra Poisson, powers of µ, have been proposed in Davidian and
Giltinan (1995).
Here we consider the variance function as a power function of µ:

V(µ) = µγ .

Most common values of γ are the values of 0, 1, 2, 3, which are associated with Nor-
mal, Poisson, Gamma, and Inverse Gaussian distributions, respectively. K. (1981)
also discussed distributions with this power variance function and showed that an
exponential family exists for γ = 0 and γ ≥ 1. In Jørgensen (1997), the author sum-
marized Tweedie exponential dispersion models and concluded that distributions do
not exist for 0 < γ < 1. For 1 < γ < 2, it is compound Poisson; For 2 < γ < 3 and
γ > 3, it is positive stable distribution. The Tweedie exponential dispersion model is
denoted Y ∼ T wγ (µ, φ). By definition, this model has mean µ and variance

Var(Y) = φµγ .
MODELING THE VARIANCE 29
Now we try to find the exponential dispersion model corresponding to V(µ) = µγ . The
exponential dispersion model extends the natural exponential families and includes
many standard families of distribution.
However, from the likelihood perspective, it is interesting to find what other like-
lihood functions can lead to such variance functions. Denote exponential dispersion
model with ED(µ, φ), and it has the following distribution form:

exp[λ{yθ − κ(θ)}]υλ (dy),

where υ is a given σ-finite measure on R (the set of all real numbers). The parameter
θ is called the canonical parameter, λ is called the index parameter (and φ = 1/λ is
called the dispersion parameter). The parameter µ is called the mean value parameter.
The cumulative generating function of Y ∼ ED(µ, φ) is

K(s; θ, λ) = λκ(θ + s/λ) − κ(θ).

Let κγ and τγ denote the corresponding unit cumulative function and mean value
mapping, respectively. For exponential dispersion models, we have the following
relations
∂τ−1
γ 1
=
∂µ Vγ (µ)
and
κγ 0 (θ) = τγ (θ).
If the exponential dispersion model corresponding to Vγ exists, we must solve the
following two differential equations,

∂τ−1
γ
= µ−γ , (3.13)
∂µ
and
κγ 0 (θ) = τγ (θ). (3.14)

It is convenient to introduce the parameter ϕ, defined by


γ−2
ϕ= , (3.15)
γ−1
with inverse relation
ϕ−2
γ= . (3.16)
ϕ−1
From (3.13) we find,
θ ϕ−1
  
if γ , 1

τγ (θ) =  .

 ϕ−1
 eθ
 if γ = 1
30 MODEL FRAMEWORK AND ITS COMPONENTS
From τγ we find κγ by solving (3.14), which gives,
 ϕ−1
( θ )ϕ if γ , 1, 2
 θϕ ϕ−1



κγ (θ) =  if γ = 1 .

 e
 − log(−θ) if γ = 2

In both (3.13) and (3.14), we have ignored the constants in the solutions, which
does not affect the results.
If an exponential dispersion model corresponding to (3.6) exists, the commutative
generating function of the corresponding convolution model is,
λκ (θ){(1 + θλs )ϕ − 1} if γ , 1, 2

 γθ


Kγ (s; θ, λ) =  λe {exp(s/λ) − 1} if γ = 1 .

 −λ log(1 + s )


if γ = 2
θλ

We now consider the case α < 0, corresponding to 1 < γ < 2. It shows that the Tweedie
model with 1 < γ < 2 is a compound Poisson distributions.
Let N and X1 , . . . , XN denote a sequence of independent random variables, such
that N is Poisson distributed Poisson(m) and the Xi s are identically distributed. De-
fine
XN
Z= Xi , (3.17)
i=1
where Z is defined as 0 for N = 0. The distribution (3.17) is a compound Poisson
distribution. Now we assume that m = λκγ (θ) and Xi ∼ Γ(ϕ/θ, −ϕ). Note that, by
the convolution formula, we have Z|N = n ∼ Γ(ϕ/θ, −nϕ). The moment generating
function of Z is
E(e sZ ) = exp[λκγ (θ){(1 + s/θ)ϕ − 1}].
This shows that Z is a Tweedie model. We can obtain the joint density of Z and N,
for n ≥ 1 and z > 0,
(−θ)−nϕ mn z−nϕ−1
pZ,N (z, n; θ, λ, ϕ) = exp{θz − m}
Γ(−nϕ)n!
λn κγn (−1/z)
= exp{θz − λκγ (θ)}.
Γ(−nϕ)n!z
The distribution of Z is continuous for z > 0, and summing over n, the density of Z is

1 X λ κγ (−1/z)
∞ n n
p(z; θ, λ, ϕ) = exp{θz − λκγ (θ)}. (3.18)
z n=1 Γ(−nϕ)n!

Let y = z/λ, then y has the probability density function given by


p(y; θ, λ, ϕ) = cγ (y; λ) exp[λ{θy − κγ (θ)}], y ≥ 0, (3.19)
where
λn κγn (− λy
1 )

 1 P∞ y>0

cγ (y; λ) = 

y n=1 Γ(−ϕn)n! (3.20)
 1 y = 0.

MODELING THE CORRELATION 31
It is not clear how valid the likelihood functions are in statistical inferences
when the data generated by a different function meeting certain moment assump-
tions (mean and variance, for example). As one can see, it is not straightforward to
develop multivariate versions of such likelihood functions in which a dependency
structure or a correlation structure is imbedded somehow. These are topics of future
work.

3.7 Modeling the Correlation


Arguably, correlation modeling is the key in dependent data analysis such as time
series and longitudinal data analysis. The general approach to model dependence
in longitudinal studies takes the form of a patterned correlation matrix R(α) with
q = dim(α) correlation parameters.
The number of correlation parameters and the estimator of α varies from case
to case in the literature. Most researchers follow Liang and Zeger (1986), who dis-
cussed a number of important special cases. We first assume that each subject has
an equal number of observations (ni = n). The following are the typical “working”
correlation structures and the estimators used to estimate the “working” correlations.
• m-dependent correlation
1 t=0



Corr(yi j , yi j+t ) =  αt t = 1, . . . , m

and

 0 t>m

 1 α1 α2 · · · αm 0 · · · 0 
 
 α
 1 1 α1 · · · αm−1 αm · · · 0 

R =  . ..  .
 .. . 
α2 α1 · · · 1

0 0 ··· ···
The case of m = 1 corresponds to the moving average (MA) structure in time se-
ries modeling. Note that α1 is the assumed or imposed correlation of corr(y11 , y12 ),
corr(y12 , y13 ) (i = 1), and also corr(y21 , y22 ), corr(y22 , y23 ) (i = 2), etc. This structure
may be unreasonable if the observation times ti j are 1, 2, 4, .... for i = 1 and 2, 3, 4 for
i = 2.
• Exchangeable:
1 j=k
(
Corr(yi j , yik ) = .
α j,k
• Autoregressive correlation, AR(1)
corr(Yi j , Yi, j+t ) = αt , t = 0, 1, 2, . . . , n − j.
Correlation matrix is
α α2 αn−1
 
 1 ··· 

 α 1 α ··· αn−2 

R = 
 α α2 1 ··· αn−3  .

.. .. .. ..
. . . .
 
 
αn−1 αn−2 αn−3 ··· 1
32 MODEL FRAMEWORK AND ITS COMPONENTS
• Toeplitz correlation (n − 1 parameters)

 1 α1 α2 ··· αn−1


 

 α 1 α1 ··· αn−2 
1
j=k
(
R =  α2 α1 αn−3
1
 
Corr(yi j , yik ) = and
 1 ···  .

α| j−k| j,k  . .. .. ..
 .. . . .


αn−1 αn−2 αn−3

··· 1

• Unstructured correlation
j=k
(
1
Corr(yi j , yik ) = .
α jk j,k

The number of correlation parameter varies according to different “working” cor-


relation structures. The exchangeable (uniform or compound) correlation structure
has
R = (1 − ρ)I + ρee> ,
where −1/(ni − 1) < ρ < 1 is unknown, e = (1, 1, . . . , 1)T , and I is the identity matrix
of order ni . The serial covariance structure has:

Σ = σ2C,

where C = (ρ|i− j| ), and σ2 > 0 and −1 < ρ < 1 are unknown. Lee (1988) studied these
two correlation structures in the context of growth curves.
Note that these models are most suitable when all subjects have the same, and
equal observation times, for example, each subject is observed five times (Monday
to Friday). However, if Subject 1 is observed on Monday, Tuesday, and Friday; and
Subject 2 is observed on Monday, Wednesday only, the corresponding two correla-
tion matrices using AR(1) structure are

 1 α α4 
 
1 α2
!
α 1 α 3 
and .
 
α2 1
 

α α
4 3 1

If observation times are not a lattice nature, but rather in a continuous time, ex-
ponential correlation structure is more appropriate (Diggle, 1988), ρ(|t j − ti |), where
ρ(u) = exp(−αuc ), with c = 1 or 2. The case of c = 1 is the continuous-time ana-
log of a first-order autoregressive process. The case of c = 2 corresponds to an in-
trinsically smoother process. The covariance structure can handle irregularly spaced
time sequences within experimental units that could arise through randomly miss-
ing data or by design. Besides the aforementioned covariance structures, there are
still parametric families of covariance structures proposed to describe the correlation
of many types of repeated data. They can model quite parsimoniously a variety of
forms of dependence and accommodate arbitrary numbers and spacings of observa-
tion times, which need not be the same for all subjects. Núñez Anton and Woodworth
MODELING THE CORRELATION 33
(1994) proposed a covariance model to analyze unequally spaced data when the er-
ror variance-covariance matrix has a structure that depends on the spacing between
observations. The covariance structure depends on the time intervals between mea-
surements rather than the time order of the measurements. The main feature of the
structure is that it involves a power transformation of the time rather than time inter-
val and the power parameter is unknown.
The general form of the covariance matrix for a subject with k observation at
times 0 < t1 < t2 < . . . < tk is
λ λ
σ2 · α(tv −tu ) /λ if λ , 0
(
(Σ)uv = (Σ)vu =
σ2 · αlog(tv /tu ) if λ = 0.
(1 ≤ u ≤ v ≤ k, 0 < α < 1). The covariance structure consists of three-parameter vector
θ = (σ2 , α, λ). It is different from the uniform covariance structure with two param-
eters as well as an unstructured multivariate normal distribution with ni (ni − 1)/2
parameters. Modeling the covariance structure in continuous time removes any re-
quirement that the sequences or measurements on the different units are made at a
common set of times.
Suppose there are five observations at times 0 < t1 < t2 < t3 < t4 < t5 . Denote
λ λ λ λ λ λ λ λ
a = α(t2 −t1 )/λ , b = α(t3 −t2 )/λ , c = α(t4 −t3 )/λ , d = α(t5 −t4 )/λ .
Consequently, the matrix can be written as

 1 a ab abc abcd


 

 a 1 b bc bcd 
Σ = σ 
2 
 
ab b 1 c cd  (3.21)

 abc
 bc c 1 d 
abcd bcd cd d 1
and the inverse of this covariance matrix is
 1 −a
0 0 0

 1−a2 1−a2 
 −1 1−a2 b2 −b 
 1−a2 (1−a2 )(1−b2 ) 1−b2
0 0 
1  1−b2 c2

Σ−1 = 2  0 −b −c
0  . (3.22)
σ  1−b2 (1−b2 )(1−c2 ) 1−c2 
 0 −c 1−c2 d2 −d
0

 1−c2 (1−c2 )(1−d2 ) 1−d2 
−d 1 
0 0 0 1−d2 1−d2

The elements of the covariance matrix are


λ λ
σ2 (Σ−1 )11 = [1 − α2(t2 −t1 )/λ ]−1 ,
λ λ
σ2 (Σ−1 )kk = [1 − α2(tk −tk−1 )/λ ]−1 ,
2(tλj+1 −tλj )/λ −1 (tλj+1 −tλj )/λ
σ2 (Σ−1 ) j, j+1 = −[1 − α ] α , 1 ≤ j ≤ k − 1,
2(tλj −tλj−1 )/λ 2(tλj+1 −tλj )/λ 2(tλj+1 −tλj−1 )/λ
σ2 (Σ−1 ) j j = {[1 − α ][1 − α ]}−1 [1 − α ], 1 < j < k,
σ2 (Σ−1 ) j1 = 0, | j − 1| > 1.
34 MODEL FRAMEWORK AND ITS COMPONENTS
In the case that variances are different, we may write the more general form for the
covariance matrix, Σ = A1/2 RA1/2 , where A = diag(σk ), k = 1, . . . , ni , and R is the
correlation matrix.
We can also consider the damped exponential correlation structure here. Muñoz
et al. (1992) introduced this structure. The model can handle slowly decaying au-
tocorrelation dependence and autocorrelation dependence that decay faster than the
commonly used first-order autoregressive model as well. In addition, the covariance
structure allows for nonequidistant and unbalanced observations, thus efficiently ac-
commodate the occurrence of missing observation.
Let Yi = (yi1 , . . . , yini )T be a ni × 1 vector of responses at ni time points for the ith
individual (i = 1, . . . , N). The covariate measurements Xi is an ni × p matrix. Denote
the ni -vector si the times elapsed from baseline to follow-up with si1 = 0, si2 = time
from baseline to first follow-up visit on subject i; si,ni = time from baseline to last
follow-up visit for subject i. The follow-up time can be scaled to keep si small pos-
itive integers of a size comparable to maxi {ni } so that we can avoid exponentiation
with unnecessarily large numbers. We assume that the marginal density on the ith
subject, i = 1, . . . , N, is

Yi ∼ MVN(Xi β, σ2 Vi (α, θ; si )), 0 ≤ α < 1, θ ≥ 0; (3.23)

and the ( j, k) ( j < k) element of Vi is


θ
corr(Yi j , Yik ) = [Vi (α, θ; si )] jk = α(sik −si j ) , (3.24)

where α denotes the correlation between observations separated by one s-unit in


time; θ is the “scale parameter” which permits attenuation or acceleration of the
exponential decay of the autocorrelation function defining an AR(1). As attenuation
is the most common in practical applications, we refer to this model as the damped
exponential (DE). Given that most longitudinal data exhibit a positive correlation, it
is sensible to limit α within nonnegative values.
For nonnegative α, the correlation structure given by (3.24) produces a variety
of correlation structures upon fixing the scale parameter θ. Let IB be the indicator
function of the set B. If θ = 0, then corr(Yit , Yi,t+s ) = I|s=0| +αI|s>0| , which is compound
symmetry model; If θ = 1, then corr(Yit , Yi,t+s ) = α|s| , yielding AR(1); And as θ → ∞,
corr(Yit , Yi,t+s ) → I|s=0| + αI|s=1| , yielding MA(1); If 0 < θ < 1, we obtain a family
of correlation structures with decay rate between those of compound symmetry and
AR(1) models; For θ > 1, it is a correlation structure with a decay rate faster than that
of AR(1).
As we know, any correlation matrix should be positive definite with all non-
negative eigenvalues. This means that the correlation parameters are constrained.
For example, for AR(1) model, we must have |α| ≤ 1; for exchangeable model, we
must have 1 ≥ α ≥ −1/(ni − 1).
For the unstructured structure, it is less clear how these parameters α j should be
constrained. Zhang et al. (2015) proposed an unconstrained parameterization for any
correlation matrix using hyperspherical coordinates (the angles are free parameters
between 0 and π).
RANDOM EFFECTS MODELS 35
Exercise 10 For the exchangeable correlation structure R with correlation param-
eter α and dimension n, show |R| = (1 − α)n−1 {1 + (n − 1)α}, and
1  α 
R−1 = I− ee> (3.25)
1−α 1 − α + αn
in which e is a unit vector consist of 1s, (1, 1, ..., 1). The eigen-values are
(1 − ρ)(1 + (n − 1)α
.
(1 + (n − 2)α}
Exercise 11 (Graybill Theorem 8.15.4) If R has the MA(1) correlation structure
with the correlation parameter α, show
n 
Y jπ 
|R| = 1 + 2α cos( ) . (3.26)
j=1
n+1

The eigenvalues are 1 + 2α cos( j/(n + 1)π) for j = 1, 2, ..., n, and the maximum α ≤
min{1, 1/(2 cos(π/(n + 1)))}. The ( j, k)-th element ( j ≥ k) of its inverse R−1 is given
by
(1 − b2n−2k+2 )(b j+k+1 − bk− j+1 )
b jk = , (3.27)
α(1 − b2 )(1 − b2n+2 )

where b = ( 1 − 4α2 − 1)/(2α), i.e., α = −b/(1 + b2 ).
It is crucial to know the constraints so that we can make sure the resultant R ma-
trix is meaningful when the parameters are estimated from the data. More discussion
will be given towards the end of the next Chapter.
Davidian and Giltinan (1995) provided a comprehensive description on the mixed
effect models and their computational algorithms when different variance function
and covariance structures are used.

3.8 Random Effects Models


We briefly introduce the widely used random-effects model in longitudinal data anal-
ysis. More details will be given in Chapter 9.
The random-effects model can be regarded as a fully-parametric and distribu-
tional approach. The widely used linear model assumes E(Yi |Xi ) = Xi β, i.e., we can
decompose the observed vector Yi as two parts: deterministic part and noise i ,

Yi = Xi β + i ,

in which E(i |Xi ) = 0.


As an alternative to directly modeling the correlations or the covariance of Yi ,
we can decompose/model i further as Zi bi + ηi , where bi is referred to as the ran-
dom effects associated with individual i. As we shall see, the presence of the same bi
vector in all observations from subject i induces correlations among these observa-
tions. This random-effects model explicitly identifies individual effects. In contrast to
36 MODEL FRAMEWORK AND ITS COMPONENTS
full multivariate models, which are not able to fit unbalanced data, the random effect
model can handle the unbalanced situation. A model contains both fixed effects and
random effects is often referred to as the mixed-effects model (not to be confused
with the mixture model).
For multivariate normal data, the random-effects model can be described by two
steps (Laird and Ware, 1982):
Step 1. For the ith experiment unit, i = 1, . . . , N,

Yi = Xi β + Zi bi + ηi , (3.28)

where
Xi is a ni × p “design matrix”;
β is a p × 1 vector of parameters referred to as fixed effects;
Zi is a ni × k “design matrix” that characterizes random variation in the response
attributable to among-unit sources;
bi is a k × 1 vector of unknown random effects;
-, and ηi is distributed as N(0, Ri ). Here Ri is an ni × ni positive-definite covariance
matrix reflecting “measurement” errors. In practice, Ri is often taken as a diagonal
matrix.
Step 2. β is considered fixed parameters at the population level, and bi is also the
unknown parameters but for subject i only. Here ηi is often assumed to be indepen-
dent. The bi values are distributed as N(0,G), independently of each other and of the
ηi . Here G is a k × k positive-definite covariance matrix.
The regression parameter vector β is the fixed effects, which are assumed to be
the same for all individuals and have population-averaged interpretation. In contrast
to β, the vector bi is comprised of subject-specific regression coefficients.
The conditional mean of Yi , given bi , is

E(Yi |bi ) = Xi β + Zi bi ,

which is the ith subject’s mean response profile. The marginal or population-averaged
mean of Yi is
E(Yi ) = Xi β.
Similarly,
var(Yi |bi ) = var(ηi ) = Ri
and
var(Yi ) = var(Zi bi ) + var(ηi ) = ZiGZi> + Ri .
Thus, the introduction of random effects, bi , induces correlation (marginally) among
the Yi . That is,
var(Yi ) = Σi = ZiGZi> + Ri ,
which has nonzero off-diagonal elements. Based on the assumption on bi and ηi , we
have
Yi ∼ Nni (Xi β, Σi ).
Chapter 4

Parameter Estimation

A longitudinal study is characterized by repeated measures from individuals through


time. Analysis of longitudinal data plays an important role in many research areas,
such as medical and biological research, economics, and finance as well. This leads
to interesting statistical research on methods to take account of possible correlations
among the repeated observations.
Let us first consider the independent data. For independent data, we only have
two types of parameters to estimate, namely, regression parameters, variance param-
eters including the scale parameter. In most research literature, when count data is
analyzed, the Poisson model is often used with Var(y) = φE(y) = φµ. However, the
real variance structures may be very different from the Poisson model. There are at
least two possible generalizations to the Poisson variance model,
(1). V(µ) = µγ , 1 ≤ γ ≤ 2;
(2). V(µ) = γ1 µ + γ2 µ2 ,
where γ, (γ1 , γ2 ) are unknown constants to be estimated. In previous chapter, we
have considered the variance function V(µ) = µγ and the corresponding likelihood
function.
Independent data can be classified into two types: univariate observations and
multivariate observations. For both of them, regression parameters can be estimated
by GLM approach; for the later one, if it is a special case of longitudinal data, then
GEE approach can also be employed. We use Gaussian, Quasi-likelihood, and other
approaches to estimate variance parameters, which we will introduce later on.
For longitudinal data, we may proceed with the independent data analysis pre-
tending the data were independent. This approach, in general, results in consistent
β parameter estimators, but the efficiency can be low (Wang and Carey, 2003; Fitz-
maurice, 1995), in other words, the standard errors of other estimators (incorporating
within-subject correlations) can be made much smaller. Once a consistent estimator
of β is available, the resultant residuals are valid for further analysis, i.e., we can then
estimate and carry out inferences on the variance functions and correlation structures
based on these residuals. Of course, β estimates can then be updated when better
covariance matrices for Yi are available.

DOI: 10.1201/9781315153636-4 37
38 PARAMETER ESTIMATION
4.1 Likelihood Approach
We first introduce how β can be estimated when ignoring the within-subject cor-
relations. It is simple to estimate the regression parameter by adopting GLM ap-
proach when the independent data is univariate. Consider the univariate observations
yi , i = 1, . . . , N and p × 1 covariate vector Xi . Let β be a p × 1 vector of regression pa-
rameter and linear predictor ηi = Xi> β. Suppose Y = (y1 , . . . , yN ) follows a distribution
from a specific exponential family as given by (3.2),
( )
yθ − b(θ)
f (y; θ, φ) = exp + c(y, φ)
φ
with canonical parameter θ and dispersion parameter φ. For each yi , the log-
likelihood is
Li (β, φ) = log f (yi ; θi , φ).
For y1 , . . . , yN , the joint log-likelihood is
N
X N
X
L(β, φ) = log f (yi ; θi , φ) = Li (β, φ).
i=1 i=1

The score estimation function for β j , j = 1, . . . , p, can be derived by applying chain


rule,
N
∂L(β, φ) X ∂Li (β, φ)
=
∂β j i=1
∂β j
N
yi − µi 1 ∂µi
X ( )
= xi j .
i=1
φi V(µi ) ∂ηi

The estimating functions for β can be written as

N
X
U(β; φ) = i Vi S i ,
D> −1

i=1

in which Vi is the variance function, and S i = yi − µi . The MLE can be obtained by


solving U(β; φ) = 0. Usually, we assume φ, which is a constant for all observations,
or for subject i or observation i, φ can be replaced by φ/mi , where mi (i = 1, 2, ..., N)
are known weights. In these cases, φ can be removed from the estimating equations
U(β; φ) = 0, and it does not contribute towards estimation of β. However, φ does
affect the variance of β estimator as an over-dispersion parameter.
If we assume the likelihood L(β, φ) is the true likelihood that generates the data,
we can rely on likelihood-based inferences. For example, the covariance of the re-
sultant β̂ can be approximated by the inverse of the information matrix. Otherwise,
asymptotic results can be derived from the estimating functions U(β; φ). An excellent
reference on misspecified likelihood inference is White (1982).
QUASI-LIKELIHOOD APPROACH 39
4.2 Quasi-likelihood Approach
Wedderburn (1974) defined the quasi-likelihood, Q for an observation y with mean
µ and variance function V(µ) by the equation
Z µ
y−u
Q(y; µ) = du (4.1)
y V(u)

plus some function of y only, or equivalently by

∂Q(y; µ)/∂µ = (y − µ)/V(µ). (4.2)

The deviance function, which measures the discrepancy between the observation and
its expected value, is obtained from the analog of the log-likelihood-ratio statistic
Z µ
y−u
D(y; µ) = −2{Q(y; µ) − Q(y; y)} = −2 du. (4.3)
y V(u)

The QL estimating functions are given by


N
X
i Vi (Yi − µi ) = 0,
D> −1
(4.4)
i=1

where Di = (∂µi /∂β> )ni ×p and Vi is the variance function.


If we are only concerned about parameter estimation, we can rely on estimating
functions taking forms similar to (4.4) or other functions we constructed, and there is
no need to worry about the likelihood functions because the known variance function
is sufficient and ready to be incorporated (as weighting) in our estimation.
Under mild conditions, the Wedderburn form of QL can be used as a valid ‘like-
lihood’ function for estimation and inference and compare different linear predictors
or different link functions on the same data. It cannot, however, be used to com-
pare different variance functions on the same data. To this end, Nelder and Pregibon
(1987) proposed extended-likelihood definition (EQL),
1 1
Q+ (y; µ) = − log{2πφV(y)} − D(y; µ)/φ, (4.5)
2 2
where D(y, µ) is the deviance as defined in (4.3), and φ is the dispersion parameter,
V(y) is the variance function applied to the observation. When there exists a distribu-
tion of the exponential family with a given variance function, it turns out that the EQL
is the saddle point approximation to that distribution. Thus Q+ , like Q, does not make
a full distributional assumption but only the first two moments. The Quasi-likelihood
and extended quasi-likelihood for a single observation yit are listed in Table 4.1.
A distribution can be formed from an extended quasi-likelihood by normalizing
exp(Q+ ) with a suitable factor to make the sum or integral equal to unity. How-
ever, Nelder and Pregibon (1987) argued that the solution of the maximum quasi-
likelihood equations would be little affected by the omission of the normalizing fac-
tor because it was often found that the normalizing factor changed rather little with
those parameters.
40 PARAMETER ESTIMATION

Table 4.1 Quasi-likelihood and extended quasi-likelihood for a single observation yit . φ is
the dispersion parameter. The extended quasi-likelihood is Q+ (µit , yit ) = −0.5{φ−1 D(µit , yit ) +
log(2πφA(µit )}. Note that D(µit , yit ) and hence Q+ (µit , yit ) differ from those in Table 9.1 of
McCullagh and Nelder (1989).
Variance Quasi-likelihood Deviance Canonical Link
A(µit ) Q∗ (µit , yit ) D(µit , yit ) ηit = g(µit )
2
1 − (yit −µ

it )
(yit − µit )2 µit

µit (yit log(µit ) − µit )/φ yit log( µyitit ) − (yit − µit ) log(µit )

µ2it −( µyitit + log(µit ))/φ yit


µit − log( µyitit ) − 1 −1/µit

−ζ 2−ζ 1−ζ 2−ζ


ζ µit µ2 yit −(2−ζ)yit µit +(1−ζ)µit ζ−1
µit φ { µ1−ζ
it yit it
− 2−ζ } (1−ζ)(2−ζ) 1/{(1 − ζ)µit }

µit µit
n o
µit (1 − µit ) yit log{ 1−µ it
} + log(1 − µit ) −2 yit log( 1−µ it
} + log(1 − µit ) log{µit /(1 − µit )}

µ2it µit
−2 yit log µyitit(k+µ µit
n o h n (k+y ) o k+yit
i
µit + k yit log( k+µ it
) + k log( k+µ
k
it
) it
it )
+ k log( k+µit
) log( k+µ it
)

Remark 1. It is not clear how to improve this deviance function by incorporating a


correlation structure.
If we assume independence, the sum of deviances for all observations in D is
ni
N X
X
D(β; I, D) = D(yit ; µit ). (4.6)
i=1 t=1

Pan (2001b) derived the Akaike Information Criterion (AIC) based on the inde-
pendence QL for model selection in longitudinal data analysis.
Later on, Hin and Wang (2009) discovered that this criterion is only valid for se-
lecting useful ones among the predictors (x1 , x2 , ..., x p ), but not for correlation struc-
tures because it is derived assuming the independence model.
For estimating β, the QL estimating functions are again given by
N
X
i Vi (Yi − µi ) = 0,
D> −1
(4.7)
i=1

where Vi is a diagonal variance matrix here.


As for the parameters in Vi and φ, the extended QL becomes useful: an iterative
approach between (4.7) and updating Vi can be adopted.
Note that Q(yit , yit ) should be zero according to the definition. However, the
Q(yit , yit ) in Table 9.1 of McCullagh and Nelder (1989) does not become zero gen-
erally. For example, when A(µit ) = µ2it , a constant Q∗ (yit , yit ) = −{1 + log(yit )}/φ, is
missing. Therefore, Q∗ (yit , yit ) needs to be subtracted in calculating D(β; I, D) (Wang
and Hin, 2009). Therefore, the quasi-likelihood outlined in Table 9.1 of McCullagh
GAUSSIAN APPROACH 41
and Nelder (1989) should be used with care. To be more specific, we should define
Q(yit , µit ) = Q∗ (yit , µit ) − Q∗ (yit , yit ) so that Q(yit , yit ) = 0.
In Section 3.2, we also defined the QL when yi is a correlated multivariate vector.
Regardless the integral in Q(µ; Y) is path-dependent or not, their derivatives with
respect to β are unique,

N
X
U(β) = i Vi (Yi − µi ) = 0,
D> −1
(4.8)
i=1

which takes the same form as (4.7), but Vi now is a covariance matrix incorporating
a hypothesized correlation matrix.
As for the asymptotic variance of the estimators, due to the lack of the true likeli-
hood for further inference, we can rely on the approximating covariance of U(β). We
will provide more details when introducing the sandwich estimator. The book Heyde
(1997) has provided a comprehensive description of the quasi-likelihood theory and
inferences for martingale families.

4.3 Gaussian Approach


For the independence case we assume ni = 1 for simplicity, the Gaussian likelihood
given by (3.6), we have the score functions for the mean parameters β and all the
other variance function parameters (denoted as τ including φ ) are,
N N
X yi − µi X
∂LG (β, τ)/∂β = D>
i + Uσi (β),
i=1
σ2i i=1
N 2  ∂σ2
 
X (yi − µ i )  i
∂LG (β, τ)/∂τ = σi 1 − ,
−2 

σ 2 ∂τ

i=1 i

where
(yi − µi )2  ∂σ2i
 
Uσi (β) = σ−2  > .


i  1 − ∂β
σ2i

For example, if we let g(µ) = µγ , where γ > 0, the above likelihood function does
not belong to the linear exponential distribution form because yi and other power
functions yi are interacting with µi . If both mean and variance functions are cor-
rect, we have E{∂LG (β, τ)/∂β} = 0. In ordinary regression, we only rely on the lin-
ear combinations of S i = (yi − µi ) to gain protection against misspecification of the
variance modeling. We can also achieve this by ignoring the second term Uσi (β) in
∂LG (β, τ)/∂β, and use only the first term for estimating β (with some notation abuse
42 PARAMETER ESTIMATION
in UG ),
N
X yi − µi
UG (β; τ) = D>
i , (4.9)
i=1
σ2i

 ∂σi
N
  2
X  (yi − µi )2 
UG (τ; β) = σi 
−2 
.
 
1− (4.10)

 σ2  ∂τ>


i=1 i

This is precisely the approach of the iterative reweighted least-squares approach


by iterating between (4.9) (updating β estimates) and (4.10) for updating σi weights.
This makes one wonder if there is an underlying quasi-likelihood that produces such
simpler estimating functions. We used “quasi” here because we know the “likeli-
hood” we hypothesized is usually different to the true likelihood.
The Gaussian score function, obtained by differentiating equation (3.7) with re-
spect to θ, for each component β j in β is

u(β j ; τ) = ∂LG /∂β j


N (
∂µi −1
X )
= Σi (Yi − µi )
i=1
∂β j
 N 

X h −1 i 
+ tr  Σi (Yi − µi )(Yi − µi ) − I Σi (∂Σi /∂β j )
T −1
,
 
(4.11)


 

i=1

and for the variance parameter τ j ,

u(τ j ; β) = ∂LG /∂τ j


 N 

X h −1 i 
= tr  Σi (Yi − µi )(Yi − µi )T − I Σ−1 (∂Σi /∂τ j ) .
 
(4.12)


 i 

i=1

A key condition for consistency of the estimator is that the estimating equations
should be unbiased, at least asymptotically.
Again, to have protection to misspecification of Σi = φVi , we will drop the second
term in u(β j ; τ) and obtain the matrix form for β,
N
X
U(β; τ) = D> Vi−1 (Yi − µi ), (4.13)
i=1
 N 

X h −1 i 
u(τ j ; β) = tr  Σ µ µ T
Σ−1
/∂τ .
 
(Y − )(Y − ) − I (∂Σ ) (4.14)

 i i i i i i i j 

 
i=1

We have E{U(β; τ)} = 0 when the mean assumption holds and E{u(τ j ; β)} = 0 when
the covariance assumption holds. This will lead to consistency of β estimates even Σi
matrices are incorrectly modeled.
GAUSSIAN APPROACH 43
Another perspective of looking at these above estimating functions is using the
“decoupling” idea. If we write the mean µi explicitly as a function of β while write
Σi explicitly as a function of β∗ (as if it is a new set of parameters),
N
X
LG∗ (β; τ) = {log{|Σi (β∗ )| + (Yi − µi (β))> Σi (β∗ )−1 (Yi − µi (β))}.
i=1

Minimization of LG∗ (β; τ) with respect to β will lead to estimating functions given
by (4.13).
Remark 2. log{|Σi (β∗ )| is deemed as a constant in this decoupling approach and it
can be ignored.
Remark 3. In practice, one can simply replace β∗ by the previous β estimates when
minimizing LG∗ . For Gaussian estimation, under mild regularity conditions, when the
working correlation matrix is either correctly specified as the true one R̃i or chosen as
the identity matrix (the independence model), the Gaussian estimators of regression
and variance parameters are consistent (Wang and Zhao, 2007). To see this, we will
check when we have E{U(β; τ)} = 0 and E{u(τ j ; β)} = 0. From equations (4.11) and
(4.12), the unbiasedness condition for θ j is

T −1 ∂Σi −1 ∂Σi
( " #) ( " #)
E tr Σi (Yi − µi )(Yi − µi ) Σi
−1
− E tr Σi = 0. (4.15)
∂θ j ∂θ j
Now we make some transformations of (4.15) to see the condition more clearly. For
notation simplicity, let Σ̃i be the true covariance, thus Σ̃i = E[(Yi − µi )(Yi − µi )T ] =
A1/2 1/2
i R̃i Ai , where R̃i is the true correlation structure.
The left-hand side of (4.15):

T −1 ∂Σi −1 ∂Σi
( " #) ( " #)
E tr Σi (Yi − µi )(Yi − µi ) Σi
−1
− E tr Σi
∂θ j ∂θ j
 ∂Σ  ∂Σ
 −1   −1 
= −tr  i Σ̃i  + tr  i Σi 
 
∂θ j ∂θ j
 −1/2  −1/2
 ∂Ai  ∂Ai
 
−1 −1/2 1/2 1/2  1/2 
= −2tr  Ri Ai Ai R̃i Ai  + 2tr 
 
Ai 
∂θ j ∂θ j
 
 −1/2  −1/2
 ∂A  ∂Ai
 
= −2tr  i Ai1/2 R−1i R̃i  + 2tr  A1/2
 
∂θ j ∂θ j
  
i 
 −1/2
 ∂Ai

1/2 −1
= −2tr  Ai (Ri R̃i − I) .

(4.16)
∂θ j

∂A−1/2
It is clearly that (4.16) will be 0 if Ri = R̃i . As both the i
∂θ j and the Ai are
diagonal matrices, (4.16) will be also 0 if the diagonal elements of {R−1
i R̃i − I} are
all 0. This will happen when Ri = I because the diagonal elements of R̃i are all 1.
Thus, we can conclude that under one of the two conditions: Ri = R̃i and Ri = I, the
Gaussian estimation will be consistent.
44 PARAMETER ESTIMATION
This implies that we can use independent correlation structure if we have no idea
about the true one, and the resulting estimator will be consistent under mild regularity
conditions.
In general, (3.7) can be referred to as a “working” likelihood function to provide a
sensible solution to supplying the working parameters needed for the mean parameter
estimation.
In longitudinal data analysis, the mean response is usually modeled as a function
of time and other covariates. Profile analysis (assuming each time point is a category
instead of a continuous variable) and parametric curves are the two popular strategies
for modeling the time trend. In a parametric approach, we model the mean as an
explicit function of time. If the profile means appear to change linearly over time, we
can fit linear model over time; if the profile means appear to change over time in a
quadratic manner, we can fit the quadratic model over time. Appropriate tests may
be used to check which model is the better choice. Clearly, profile analysis is only
sensible if each subject has the same set of observation times.

4.4 Generalized Estimating Equations (GEE)


The main objective in longitudinal studies is to describe the marginal expectation of
the outcome as a function of the predictor variables or covariates. As repeated obser-
vations are made on each subject, correlation among a subject’s measurement may
be generated. Thus the correlation should be accounted for to obtain an appropriate
statistical analysis. However, the GLM only handles independent data.
When the outcome variable is approximately Gaussian with a constant variance,
linear models are often used to fit the data or after a power transformation. But for
non-Gaussian longitudinal data, few models are available in the literature. The lim-
ited number of models for non-Gaussian longitudinal data is partly due to the lack
of a rich class of multivariate distributions such as the multivariate Gaussian for the
repeated observations for each subject. Due to this reason, likelihood-based methods
are generally not popular.
Even for the covariance modeling, it is usually like a bitter pill we have to swal-
low if we have to assume the true covariance is a member of the hypothesized family,
especially when the subjects are recruited from different areas and over an extended
study period. Irregular observations times and missing data makes us more brave to
impose such assumptions.
To this end, the generalized estimating equations (GEE) was developed by Liang
and Zeger (1986). The framework of GEE is based on quasi-likelihood theory.
In addition, a “working” correlation matrix for the repeated observations for each
subject is nominated in GEE. We denote the “working” correlation matrix by Ri (α),
which is a matrix with unknown parameters α. The nominated “working” correlation
matrix to be correct to produce consistent β estimators.
GENERALIZED ESTIMATING EQUATIONS (GEE) 45
4.4.1 Estimation of Mean Parameters β
Recall that we assume the data are collected from N subjects indexed by subscript i.
The outcome vector for subject i consists of ni observations, Yi = (yi1 , . . . , yini )> , and
covariates (predictors, features) are recorded in matrix Xi , of dimension ni × p. The
first column of Xi is usually a vector of 1s so that the intercept term is included unless
specified. The corresponding observation times for Yi are in the vector T i = (ti j ), of
dimension ni .
The marginal mean of Yi is µi , and the fundamental assumption is the parametric
relationship between Xi and E(Yi ),

µi = h(Xi> β),

where β is a p × 1 vector of parameters.


Akin to the GLM framework, we also need to specify the mean function in rela-
tion to the covariates. Apart from choosing systematic components and the distribu-
tion family for the responses, we can also establish such mean function h(Xi , β) as in
nonlinear regression. For distributional assumptions, we can select normal, gamma,
inverse Gaussian random components for continuous data and binary, and multino-
mial, Poisson components for discrete data.
Remark 4. The marginal mean µi depends on the p predictors via a single index
θi = Xi> β. This assumption can be easily relaxed to be more general,

µi = h(Xi , β).

This mean model will include nonlinear regression setup and multiple index models
as long as h is given a priori and the dimension of β, p, is fixed and usually much
smaller than n.
Remark 5. If a spline function is assumed to model h, the number of parameters,
p, will expand as n or ni increases. This will lead to a new area of semiparametric
modeling for longitudinal data analysis (Lin and Ying, 2001; Lin and Carroll, 2006;
Li et al., 2010; Hua and Zhang, 2012).
The inverse of h or h itself is often referred to as the “link” function. In quasi-
likelihood, the variance of Yi j , σ2i j is expressed as a known function g of the expec-
tation µi j , i.e.,
σ2i j = φg(µi j ),

where φ is a scale parameter. Write Cov(Yi |Xi ) as φAi1/2 R̃i (α)A1/2


i , where R̃i (α) is the
true correlation matrix and Ai = diag(σi j ) is the diagonal variance matrix. Here φ is
regarded as an overdispersion parameter in GLM setting. The univariate specification
is E(yi j ) = µi j and var(yi j ) = φσ2i j . The mean model is a familiar generalized linear
model (GLM) specification, and the covariance factor Vi with elements (σi j ) is the
GLM variance function or a generalized version with additional parameters to be
more flexible.
The working covariance matrix for Yi is given by φVi , where

Vi = Ai1/2 Ri (α)Ai1/2 , (4.17)


46 PARAMETER ESTIMATION
where Ai is an ni × ni diagonal matrix with var(yi j ) as the jth diagonal element.
Remark 6. The variance σi j and hence A2i are parameterized by γ, which includes φ.
Remark 7. The correlation matrix Ri is parameterized by α.
Based on quasi-likelihood and the set up of “working” correlation matrix, Liang
and Zeger(1986) derived the generalized estimating equations that produce consis-
tent estimators of the regression coefficients under mild regularity conditions. The
variances of the β estimators can also be estimated by the so-called sandwich esti-
mator consistently without requiring the Vi be correctly modeled. Bear in mind, the
covariance matrix of the β estimates depends on both the chosen working matrix and
the true covariance matrix.
The “working independence” model has been advocated for a variety of robust-
ness properties (Pepe and Anderson, 1994).
Sutradhar and Das (1999) concluded on the basis of limited asymptotic studies
with a fixed balanced design that if there is a substantial risk of misspecification of
the working correlation model, the analyst may be better off with the independence
working model. In contrast to this, it has been suggested that efficiency gains can be
obtained provided misspecification is not “too bad”. Note that independence models
have been found to be fully efficient only when the covariate pattern is the same for
all individuals (Lipsitz et al., 1994; Fitzmaurice, 1995; Mancl and Leroux, 1996).
Fitzmaurice (1995) also considered more realistic cases when the within-cluster co-
variate differs among clusters and found that independence models suffer substantial
losses in efficiency. Mancl and Leroux (1996) provided a thorough study on indepen-
dence model when the true correlation structure is exchangeable.
Suppose τ collects all the variance parameters including the overdispersion pa-
rameter φ. For a given covariance model, φVi , the generalized estimating equations
(GEE) can be expressed as,
N
X
UGEE (β; τ, α) = i Vi (Yi − µi ) = (0) p×1 ,
D> −1
(4.18)
i=1

where µi = (µi1 , . . . , µini )> and Di = ∂µi /∂β> . In order to solve β from the above equa-
tions, both Di and Vi must be given up to the parameter vector β, and all other pa-
rameters such as α and γ (except for φ) must be supplied a priori.
For convenience, we will let Ui be the contribution of subject i in UGEE , i.e.,
U i = D>i Vi S i , where S i = Yi − µi . The above GEE is not a stranger to us. Its form is
−1

the same as the estimating functions derived from the linear exponential family (4.1).
In particular, when Vi is diagonal (independence working correlation model), UGEE
becomes the estimating functions derived from the QL approach (4.4), or the general
Vi corresponds to the multivariate version given by (4.8). The decoupled version from
Gaussian estimation also takes√the form, see (4.13).
Suppose that α̂ and γ̂ are N-consistent estimators obtained somehow via other
approaches, the GEE (4.18) above will solve UGEE (β; α̂, γ̂) = 0 p×1 . In general, α̂ and

γ̂ will also require a N-consistent estimator of β so that valid residuals can be
obtained for α and γ estimators. So, the GEE estimator of β, β̂GEE is essentially
GENERALIZED ESTIMATING EQUATIONS (GEE) 47
obtained by solving UGEE (β; α̂(β), γ̂(β)) = 0 p×1 . We will omit the subscript indicting
the dimension when there is no confusion.
Under mild regularity conditions and prerequisite that the link function is cor-
rectly specified under minimal assumptions about the time dependence, Liang and
Zeger (1986) showed that as N −→ ∞, β̂GEE is a consistent estimator of β and that

N(β̂R − β) is asymptotically multivariate Gaussian with covariance matrix given by
 N −1  N  N −1
X  X  X 
lim N  D> i Vi −1 Di   D>i Vi −1 Cov(Yi )Vi −1 Di   D>i Vi −1 Di  . (4.19)
N→∞
i=1 i=1 i=1

This matrix can be estimated consistently without any knowledge on Cov(Yi ) directly
>
because Cov(Yi ) can be simply replaced by the residual product Sˆ i Sˆ i and α, β and
φ by their estimates in equation (4.20). This leads to the well-celebrated sandwich
estimator for the covariance of β̂GEE ,
 N −1  N  N −1
>
X  X  X 
GEE = 
V[  D Vi Di   D V Sˆ i Sˆ i V Di   D Vi Di  ,
> −1 > −1 −1 > −1
(4.20)
 i   i i i  i 
i=1 i=1 i=1

in which all the matrices, Di , Vi if needed, are evaluated at the final parameter esti-
mates, β̂GEE , α̂, and γ̂.
When Ri is correctly specified and α is known, the generalized estimating equa-
tions given by (4.18) is optimal for β in the class of linear functions of Yi according
to the Gauss-Markov theorem. In practice, the correlation matrix is often unknown
and we rely on the estimation of α given β. This leads to simultaneous or iterative
estimation for β and α (Zhao and Prentice, 1990; Liang et al., 1992).
A key robustness property motivating the widespread use of generalized estimat-
ing equations is the consistency of the solution β̂ whether or not the working cor-
relation structure Ri is correctly specified. Careful modeling of Ri will improve the
efficiency of β̂, especially when the sample size is not large (Albert and McShane,
1995). The optimal asymptotic efficiency of β̂ is obtained when Ri coincides with the
true correlation matrix. In practice, Ri would be chosen to be the most reasonable for
the data based on either statistical criteria or biological background.
Although the GEE approach can provide consistent regression coefficient estima-
tors, the estimation efficiency may fluctuate greatly according to the specification of
the “working” covariance matrix. The “working” covariance has two parts: one is the
“working” correlation structure; the other is the variance function. The existing lit-
erature has been focused on the specification of the “working” correlation while the
variance function is often assumed to be correctly chosen, such as Poisson variance
and Gaussian variance function. In real data analysis, if the variance function is also
misspecified, the estimation efficiency will be low. It is, therefore, crucial to model
the variance function in order to improve the estimation efficiency of β. Relevant
work is abundant (Dempster (1972); Davidian and Carroll (1987); O’Hara-Hines
(1998); Wang and Lin (2005); Wang and Hin (2009).
The GEE approach estimates β using only linear combinations of the data as
given by (4.18). In some cases, the variance and covariance can contain information
48 PARAMETER ESTIMATION
on µi j and hence β, and it then becomes sensible to construct estimating functions
based on the quadratic functions of the data. GEE2 proposed by Prentice and Zhao
(1991) aims to construct the “optimal” combination of both linear and quadratic
residuals for estimating β. Prentice and Zhao (1991) and Fitzmaurice et al. (1993)
also obtained this type of estimating functions using a likelihood-based method. The
GEE2 can, therefore, be regarded as driven by absorbing higher moment information.
The connection between the likelihood equations and the GEE2 is well described by
Fitzmaurice et al. (1993).
It is true that the GEE2 may produce more efficient estimators of β when the
third and fourth moments can be correctly specified. However, as pointed out by
Prentice and Zhao (1991), models for means and covariances can typically be sen-
sibly specified and are readily interpreted. Models for higher order parameters are
usually specified in an ad hoc fashion and are unlikely to accurately approximate the
real complexity.
Apart from computationally intensive as ni gets large, GEE2 approach also has
the following drawbacks: (i) no consistency of the mean parameter estimates under
covariance misspecification; (ii) no efficiency will be gained if the third or the fourth
moment functions are incorrect (Liang et al. 1992). In general, little can be gained in
incorporating the third and fourth moments in the estimation. Therefore, in practice,
one may wish to restrict the assumptions on the mean and covariance functions only.

4.4.2 Estimation of Variance Parameters τ


As we can see, the GEE approach requires input of a covariance matrix for each
subject. We now present how the variance function can be estimated, and then pro-
ceed to the correlation matrix in §4.4.3. Recall τ collects all the variance parameters
including the overdispersion parameter φ.
In the GEE approach, much attention is put on the specification of the working
correlation structure while ignoring the importance of modeling the variance func-
tion.
The existing approaches for estimating the variance parameters in correlated data
analysis do not “decouple” the variance and correlation in the sense that the estimator
of the variance parameters always rely on the assumed correlation structure such
as the maximum likelihood (ML) and the restricted maximum likelihood (REML)
(Davidian and Giltinan, 1995; Pinheiro and Bates, 2000).

4.4.2.1 Gaussian Estimation


Suppose Ri is the nominated correlation matrix for estimating γ. Note that this choice
does not have to be the same as the “working” correlation matrix. The Gaussian log-
likelihood for the complete data vector Y is
N " ( )#
Y 1
log {det(2πΣi )} −1/2
exp − (Yi − µi )Σi (Yi − µi )
−1 >
,
i=1
2
GENERALIZED ESTIMATING EQUATIONS (GEE) 49
where Σi = φVi , the working covariance matrix. The corresponding Gaussian score
function with respect to τ relies on minimizing the Gaussian -2log-likelihood as
shown by expression (4.12),
 N ( )
X ∂Σi 
u(τ j ; β) = tr   {Σi (Yi − µi )(Yi − µi ) − I}Σi
−1 > −1  .
i=1
∂τ j 

According to in §4.3, for using the true R = R̃i or the naı̈ve independence R = I,
the Gaussian estimator of τ is consistent under mild conditions.
If the specified variance function is far from the true one, the correct choice of
the working correlation matrix may no longer be the true one. In real data analysis,
it is difficult to determine the variance function, and possibly, we will choose an in-
correct variance function. Therefore, akin to the correlation parameters, the variance
parameters are also subject to the pitfalls discussed by Crowder (1995) under model
misspecification. Therefore, well-behaved estimators for those working parameters
are highly desirable to avoid possible convergence problems and to result in efficient
estimators of β (Wang and Carey, 2003). The Gaussian working likelihood should be
a convenient and useful approach to analyzing longitudinal data.

4.4.2.2 Extended Quasi-likelihood


We can adopt the extended QL under independence for estimating the variance pa-
rameters. The variance function is V(µ) = µγ , and Var(y) = φV(µ). Under these set-
tings, the quasi-likelihood contribution of a single observation is
1 1
Q+i (yi ; µi , γ) = − log{2πV(yi )} − D(yi ; µi ), (4.21)
2 2
where D(yi ; µi ) is deviance function given by
Z µi
yi − u
D(yi ; µi ) = −2 du.
yi V(u)

For the variance function V(µ) = µγ , the deviance function is


2{y log(y/µ) − (y − µ)} γ = 1,



γ = 2,

D(y; µ) = 
 2{y/µ − log(y/µ) − 1}

 2{y −(2−γ)yµ +(1−γ)µ } otherwise.


 2−γ 1−γ 2−γ
(1−γ)(2−γ)

When γ , 1, 2, the score estimation equation with respect to γ can be expressed by


PN Pni
i=1 j=1 qi j (yi j ; γ) = 0, where

qi j (yi j ; γ) = ∂Q+i j /∂γ


2−γ 2−γ
1 yi j (2γ − 3) yi j log yi j
= − log yi j − −
2 (1 − γ) (2 − γ)
2 2 (1 − γ)(2 − γ)
1−γ 2−γ 2−γ
yi j µi log µi j yi j µi j µi j log µi j µi j
1−γ
+ − − + .
1−γ (1 − γ) 2 2−γ (2 − γ)2
50 PARAMETER ESTIMATION
The success of the extended QL is nicely demonstrated by Nelder and Pregibon
(1987).

4.4.2.3 Nonlinear Regression


Denote the residuals by ri j = Yi j −µi j . Clearly, the variance information lies in ri2 . The
squared residuals ri2j have expectation σ2i j = φg(µi j ). This suggests a nonlinear re-
gression on {ri2j } versus φg(µi j ). The estimator τ̂ can be obtained by the least-squares
approach (Davidian and Carroll, 1987),
N
X
{ri2j − φg(µ̂i j )}2 .
i=1

For normal data the squared residuals have approximate variance proportional to
σ4i j ; the generalized weighted least squares can be considered as well,

N
X
{ri2 − φg(µ̂i j )}2 /g2 (µ̂i j ),
i=1

where γ̂ is a preliminary estimator for γ.


Alternatively, one can also apply the LS approach to the logarithmic transformed
residuals,
XN
[2 log(|ri j |) − log{φg(µ̂i j )}]2 .
i=1
One pitfall is that if any of the residuals are close to 0, the regression could be ad-
versely affected by these large negative outliers.

4.4.2.4 Estimation of Scale Parameter φ


In GLM setting, φ is regarded as the overdispersion parameter. In GEE approach,
this scale parameter does not affect the β estimation, however, the variance of the
β estimates is, in fact, proportional to φ. Therefore, the final φ estimate can play
critical role in statistical inferences (hypothesis testing and confidence intervals). In
fact, the scale parameter is one of the variance parameters. The Gaussian approach,
the extended QL, the regression are all applicable for estimating φ.
All the fundamental approaches, such as the maximum likelihood, quasi-
likelihood, Gaussian and moment, can be considered for estimating φ, depending
on the overall longitudinal model used in the analysis. Moment estimation is a sensi-
ble and can be advantageous to the likelihood-based approaches. Moment estimation
for φ is often based on the Pearson residuals,
N n
1 XX i
(yi j − µ̂i j )2
,
N − p i=1 j=1 g(µ̂i j )

where p is the number of regression parameters.


GENERALIZED ESTIMATING EQUATIONS (GEE) 51
For the Gaussian likelihood approach and given β̂, we can write the independence
log-likelihood as an explicit function of φ (Carroll and Ruppert, 1982),
N N ni N X ni
log φ X 1 XX X
l(β, γ, φ) = − ni − log{g(µ̂i j )} − (2φ)−1 {yi j − µi j }2 /g(µ̂i j )
2 i=1 2 i=1 j=1 i=1 j=1

leading to
ni
N X
1 X
φ̂Gau = PN (yi j − µi j )2 /g(µ̂i j ).
i=1 ni i=1 j=1

4.4.3 Estimation of Correlation Parameters


The GEE approach is an elegant method for estimation of β, which can be regarded as
the optimal linear combinations of the data according to the Gauss-Markov theorem
(Heyde, 1997).
The working matrix of Cov(Yi ) by φVi , which can also be written as form
φA1/2 1/2
i Ri Ai , with Ai = diag{Var(yit )} and Ri (α) being the working correlation ma-
trix of Yi . Apart from the variance parameter τ, there are also “nuisance” parameters
in Ri that we have to estimate. In the same way as the τ estimates are used in Ai , the
α estimates for the correlation matrix will then be used in the GEE for estimating β.
For estimation of correlation parameters α, the estimation method often depends
on the chosen correlation structure. Liang and Zeger (1986) illustrated these methods
by several examples.
The number of correlation parameters and the estimator of α vary from case to
case in the literature. Most researchers follow Liang and Zeger (1986), who discussed
a number of important special cases. We now discuss how to model the correlation
matrix. Commonly used or encountered correlation structures for Ri are stationary
in the sense that the correlation matrix does not change when all observations are
shifted in time.

4.4.3.1 Stationary Correlation Structures


The following structures are the typical “working” correlation structures, and the
estimators are used to estimate the “working” correlations. The number of correla-
tion parameter varies according to different “working” correlation structures as in-
troduced in Section 3.7.
• Independence model

 1 0 0 ... 0 
 
 0 1 0 ... 0 

R =   .
 ... 
0 0 0 ... 1 n×n
52 PARAMETER ESTIMATION
• Compound symmetry (exchangeable or equal correlation)
 1 α α ... α 
 
 α 1 α ... α 
R =   .
 ... 
α α α ... 1 n×n
• First order autoregressive: AR(1)
α α2 ... αn−1
 
 1 
 α 1 α ... αn−2 
 
R =  α2 α 1 ... αn−3  .
 ...
 

αn−1 αn−2 αn−3 ...

1 n×n
• Moving average
1 α 0 ... 0 0
 
 
 α 1 α ... 0 0 
R =  0 α 1 ... .
 
0 0 
...
 
 
0 0 o ... α 1 n×n
• m-dependent correlation
This is a generalization of the above first-order moving average model to order m,
1 t=0



Corr(yi j , yi j+t ) =  αt t = 1, 2, . . . , m .


 0 t>m

• Toeplitz
This is essentially a special case of m-dependent model with the largest m possi-
ble, m = n. Note that this structure is also “stationary” as the correlation is lag-
dependent but not the starting time.
1 α1 α2 . . . αn−1 
 

 α1 1 α1 . . . αn−2 

R =  α2 α1 . . . αn−3  .
 
 .... 


αn−1 αn−2 αn−3 . . . 1 n×n
The above correlation structures are only appropriate for equally-spaced measure-
ments, and all subjects must have the same observation times in order to share the
same correlation structure.
If stationary is in question, or observations are irregularly spaced, we may con-
sider
• Unstructured
1 α12 α13 ... α1n
 
 
 α21 1 α23 ... α2n 
R =  α31 α32 ... α3n  .
 
1
...
 
 
αn1 αn2 αn3 ... 1
GENERALIZED ESTIMATING EQUATIONS (GEE) 53
Note that, again, all subjects must have the same observation times in order to
share the same correlation structure.
All these correlation structures can be easily implemented in R package nlme.
For example, if ID is the factor for the n subject levels, the generic R code for fitting
the AR(1) covariance model by using corr=corAR1(form = 1—id) in gls function.
The moment estimators can be easily constructed for these models. The estima-
tors are all expressed based on the Pearson residuals. Denote Pearson residual by
êi j = (yi j − µi j )/Ai j , j = 1, 2, ..., n. where Ai j is the jth diagonal element of the diago-
nal variance matrix Ai . Note that σ2i j = φAi j , and E(ê2i j ) ≈ φ.
For m-dependent model, we will use the only t-lagged residuals for estimating
αt , t = 1, 2, ..., m, PN P
2 i=1 j≤ni −t êi j êi j+t
α̂t = PN P . (4.22)
i=1 j≤ni −t (êi j + êi j+t )
2 2

This estimator (with m = 1) is also applicable to both moving average and the AR(1)
model.
For exchangeable structure, we can make use of all the lagged residual products,
PN P
2 i=1 j,k êi j êik
α̂ = PN P .
i=1 j,k (êi j + êik )
2 2

For the unstructured correlation, we will use all the pairwise products, the esti-
mator for the parameter α jk will rely on the residuals at time j and time k,

2 N êi j êik
P
α̂ jk = PN i=12 .
i=1 (êi j + êik )
2

Note that all these correlation estimators are bounded between -1 and 1 according
to the Cauchy inequality. But this does not necessarily mean the resultant matrices
are always positive definite. This brings another issue as pointed out by Crowder
(1995). We will discuss this further towards the end of this section.
The correlation matrix R(α) has q = dim(α) parameters. For example, in a study
involving T equidistant follow-up visits, the unstructured correlation matrix for an
individual with complete data will have q = T (T − 1)/2 correlation parameters; if the
repeated observations are assumed exchangeable, R will have the compound symme-
try structure, and q = 1. A parsimonious parametrization of the correlation structure
is desired in order to optimize the efficiency of the estimation procedure.
Note that an underlying assumption here is that the correlations depends on j, k
but not on subject i. Sometimes, this may not hold that the covariance matrix varies
between subjects. Even the unstructured matrix will not be able to accommodate this
because the subjects cannot share the same correlation matrix anymore.
For observations at continuous times, and at irregular times, ti1 , ti2 , ti3 , and ti4 ,
for example, we may wish to consider the First order autoregressive with continuous
54 PARAMETER ESTIMATION
time: CAR(1),

α|ti1 −ti2 | α|ti1 −ti3 | ... α|ti1 −tin |


 
 1 
 α|ti2 −ti1 | 1 α|ti2 −ti3 | ... α|ti2 −tin | 
 
R =  α|ti3 −ti1 | α|ti3 −ti2 | 1 ... α|ti3 −tin |  .
...
 
 
α|tin −ti1 | α|tin −ti2 | α|tin −ti3 | ...

1 n×n

In this case each subject will have a different correlation matrix depending on the
specific observation times. Muñoz et al. (1992) proposed a damped exponential cor-
relation structure for modeling multivariate outcomes.
The damped exponential correlation structure applies a power transformation on
the lag, two observations with lag s is modeled as if the time lag is sϕ in CAR(1)
ϕ
model, the correlation is α s , and ϕ is a damping parameter. The correlation structures
of compound symmetry, first-order autoregressive, and first-order moving average
processes can be obtained by assuming ϕ = 0, ϕ = 1, and ϕ → ∞.
Thus, the correlation structures with q = 2 can model quite parsimoniously a va-
riety of forms of dependence, including slowing decaying autocorrelation functions
and autocorrelation functions that decay faster than the commonly used first-order
autoregressive model.
However, estimation of α is not straightforward as the moment estimators cannot
be constructed easily. We will, therefore, need to construct supplementary estimat-
ing functions for α. Further complications may arise when sampling is involved in
nested, spatial and temporal random effects. Intuitively, careful modeling the cor-
relation structure should improve the efficiency of estimation. Diggle et al. (1994)
provides a comprehensive review of relevant techniques.

4.4.3.2 Generalized Markov Correlation Structure


The vector of observation times for subject i is (ti1 , . . . , ti,ni ), The generalized Markov
correlation structure assumes the Markov correlation structure after applying a Box-
Cox type of power transformation on time. The correlation between Yi j and Yik is
given, for j , k, by α∆i ( j,k;λ) , where ∆i ( j, k; λ) = |tiλj − tik
λ |/λ Núñez Anton and Wood-

worth (1994). As one can see, this correlation function is not stationary in time any
more unless λ = 1. The generalized Markov correlation structure accommodates ir-
regular and nonlattice-valued observation times, which is quite norm in practice
(Shults and Chaganty, 1998). This correlation is quite rich and flexible and is ap-
propriate in many practical cases. A case of particular interest is the genetic distance
among family members when data are clustered/coordinatized in pedigrees (Trègouët
et al., 1997).
In many cases, the correlation parameter describes the association between re-
sponses from the same cluster (such as family, parents, or area), and may be of sci-
entific interest. For example, intraclass correlation is commonly used to measure the
degree of similarity or resemblance siblings. In these cases, the efficient estimation
of the association parameters would be valuable. In genetic studies, we often rely on
associations for prediction. So we have to come up with a correlation structure and
GENERALIZED ESTIMATING EQUATIONS (GEE) 55
estimate the correlation-related parameters (correlations may be governed by covari-
ates such as dam weight in developmental studies). Correlation and mean parameters
will be equally important in these cases.
Apart from simple moment estimators, the supplementary set of estimating func-
tions for α can be constructed more elegantly to enhance the estimation efficiency
and avoid possible pitfalls (Lipsitz et al., 1991; Prentice and Zhao, 1991; Liang et al.,
1992; Hall and Severini, 1998). In analysis of real data, misspecification is probably
the norm and the efficiency of β̂ depends on how close the working matrix is to the
true correlation. Because β̂ values obtained from (4.18) depend on the values of α
used, estimation methods for α are, therefore, of importance for improving the ef-
ficiency of estimation of β. On the other hand, in many cases, estimates of α may
be of scientific interest as well (Lipsitz et al. 1991). More detials will be given in
Chapter 9.5.

4.4.3.3 Second Moment Method


In fact, the moment estimators can be formulated using a linear combination of the
residual products, which reflects the association between any two responses from the
same subject. It is, therefore, quite natural to consider the products of all pairwise
residual products. Let zik j = (yik − µik )(yi j − µi j ), following the GEE idea for estimat-
ing β the following estimating functions for α,
N
X ∂δi
U M (α; β) = i (Zi )(Zi − δi ),
Q−1
i=1
∂α

in which Zi is the vector of residuals consisting of (zik j ), δi = E(Z n i ) and Qi is a


working covariance matrix for Zi . Note that the dimension of Zi is 2i .
The working covariance Qi may be chosen to be diagonal weighting. For bi-
nary responses, var(zik j ) = δik j (1 − δki j ). In general, if var(zik j ) is unknown, one may
choose the diagonal elements of Q as σ2ik σ2i j . Prentice and Zhao (1991) proposed a
few specifications of Q to account for the correlations between high orders.

4.4.3.4 Gaussian Estimation


Whittle (1961) introduced Gaussian estimation, which uses a normal log-likelihood
as a vehicle for estimation without assuming that the data are normally distributed.
Crowder (1985) also proposed this approach for analyzing correlated binomial data.
Later on, Crowder (2001) further proposed the modified Gaussian estimation for
analysis of repeated measures. The modification is to decouple the β parameters in
the covariance matrix from the β in the mean so that the estimating functions for β
will ignore the information of β in Vi and Vi will play the role of weighting. In our
current setting, we have additional parameters γ in Ai and α in Ri .
Let θ be the vector collecting all the parameters including β, and τ = (γ, φ, α),
where γ and α are the parameter vectors in Ai and the correlation structure, respec-
tively. When necessary, we will write µi and Vi explicitly as µi = µi (β), Ai = Ai (β, γ),
Ri = Ri (α) and Vi = Vi (β, τ). We shall be interested in consistent α estimators that
56 PARAMETER ESTIMATION
have protection against misspecified Ri to improve the asymptotic efficiency of the β
estimators. The working likelihood -2log-likelihood for the data (Y1 , ..., YN ) is
N
X
LG (θ) = [log{det(2πφVi )} + φ−1 (Yi − µi )Vi−1 (Yi − µi )> ]. (4.23)
i=1

We can rely on Gaussian estimation to get consistent estimate of correlation pa-


rameters in the same way as for the variance parameters. The score function with
respect to α j has the same form as (4.12),
 N 
X 
tr  {Vi (Yi − µi )(Yi − µi ) − Ini }Vi (∂Vi /∂α j ) .
 −1 > −1
(4.24)
i=1

Therefore, we know that when the covariance is correctly specified or when


the independence model is used, the estimating functions for γ from the Gaussian
method will be unbiased from 0. A safe estimation approach is to simply use the
independent model when it is difficult to model the true correlations as the result-
ing estimators of (β, γ) will be consistent under mild regularity conditions. The full
estimation procedure for θ = (β, γ, φ, α) can be described by the following iterative
algorithm.
(i) obtain initial estimates of β from the independence model or some other naive
model;
(ii) for given β = β̂, apply the Gaussian approach using the independence model to
obtain/update the estimator of γ, ie, minimizing LG0 (β̂, φ, γ) with respect to (φ, γ), as
also given by (3.6),
N
X
LG0 (β̂, φ, γ) = i (Yi − µ̂i )],
[log{det(φÂ2i )} + φ−1 (Yi − µ̂i )> Â−2
i=1

where µ̂i and Âi are evaluated at β = β̂.


(iii) For given (β, γ) = (β̂, γ̂), apply the Gaussian-likelihood approach to the indepen-
dence model to obtain/update the estimator of α, i.e., maximizing LG (β̂, γ̂, φ̂, α) given
by (3.7), with respect to α. Or alternatively, the estimator of α may be obtained by
other existing methods (Wang and Carey, 2003);
(iv) For given τ = (γ̂, φ̂, α̂), update β̂ by maximizing the Gaussian function Ĝ(β) with
respect to β, where Ĝ(β) is G(β, γ̂, φ̂, α̂) given by (4.23) with Vi being evaluated at the
previous estimate θ̂;
(v) Iterate between (ii) and (iv) until desired convergence as in the ordinary.
For estimation of β, the β in Vi is decoupled from the β in µi . The resulting
estimating equations for β, therefore, become the same as (4.18) which have con-
sistency protection against misspecification of Vi . Recall that ei = A−1 i (yi − µi ). The
GENERALIZED ESTIMATING EQUATIONS (GEE) 57
corresponding estimating functions for (γ, φ, α) can be written as
 N 
X 
g(γ j ; θ) = tr  (ei ei − φIni )Ai (∂Ai /∂γ j ) ,
 > −1
(4.25)
i=1
N i n
1 XX
φ= P (yi j − µi j )2 /νi j ,
i n i i=1 j=1
N
X
g(α j ; θ) = tr{ (ei e>
i − φRi )(∂Ri /∂α j )}.
−1
(4.26)
i=1

Estimator of β,γ and α are obtained from the iterative method or the joint estimating
functions, (4.18), together with (4.25) and (4.26).

4.4.3.5 Quasi Least-squares


Shults and Chaganty (1998) suggest the quasi-least squares (QLS) for parameter
estimation, which unifies the estimation of mean and variance parameter estimation
but the variance parameter estimates are biased even when the correlation structure
is correctly specified. Their method is desirable only if the correlation parameters
are of no interest but has to be used to take account for dependence in estimating the
mean parameters.
In notation, the QLS estimates of α are obtained by minimizing i (Yi −
P
µi )> Vi−1 (Yi − µi ).
The resulting estimating function for the parameter α j in α is
N
X
U Q (α j ; β) =  0i Pi j  i ,
i=1

in which Pi j = ∂R−1
i /∂α j , which can also be written as Pi j = −R−1 i ∂Ri /∂α j Ri .
−1

Shults and Chaganty (1998) also found that compared with the ad hoc methods in the
literature, it has smaller empirical risk of producing infeasible correlation parameter
estimates and smaller mean square error in the estimates of β when the correlation is
small or moderate. The bias in the QLS estimates can be easily removed according
to Chaganty and Shults (1999). More details, extensions and interesting applications
can be found in the book Shults and Hilbe (2014).
To avoid nonconvergence or the infeasibility problem pointed out by Crowder
(1995), explicit expressions of α̂ that are constrained within the sensible region are
attractive. This leads Chaganty (1997) to consider the QLS method.
The problem is the QLS estimators are inconsistent. The bias-corrected version
of Chaganty and Shults (1999) for the AR(1) working model is
k |i− j|=1 ˆki ˆk j
P P
α̂QLS 1 = P  . (4.27)
k ˆk1 + 2 i=1 ˆki + ˆkn
2 Pn 2 2

Clearly, we have α̂QLS 1 → γ = α, when the true correlation structure is either AR(1),
exchangeable or MA(1) with parameter α.
58 PARAMETER ESTIMATION
If working matrix is exchangeable instead of AR(1), and the corresponding QLS
estimate is α̂QLS , the bias corrected estimate is obtained from
X {1 + (n − 1)α̂QLS }2
(αi j ) = .
i, j
1 + (n − 1)α̂2QLS

If the true correlation is exchangeable, we can verify that the limit of the bias cor-
rected estimate of α as
2 + (n − 2)α̂QLS
α̂QLS = α.
1 + (n − 1)α̂2QLS
In order to obtain consistent estimates of α for other types of working matrices
(exchangeable and MA(1)), Chaganty and Shults (1999) suggested to obtain an ini-
tial QLS estimate using AR(1) working model, α̂QLS , and then adjust the estimates
by assuming the correlation structure believed to be true is either exchangeable or
MA(1). The resulting estimate of α, α̂QLS 1 , has the same expression as (4.27). Note
that one needs to choose the working correlation matrices twice, and they do not
have to be of the same structure. In theory, the initial estimate α̂QLS can also be
based on exchangeable, MA(1) models, or any other correlation structures. But the
initial estimate from AR(1) model leads to the above simple expressions.

4.4.3.6 Conditional Residual Method


Carey et al. (1993) proposed using the alternating logistic regression based on con-
ditional residuals to estimate α. For any pair (k, j), k < j, we let
ξik j = µik + (σik j /σi j j )(yi j − µi j ),
and sik j = yik − ξik j . Note that ξik j has a marginal mean of µik , and in most cases
such as multivariate normal, binary and multinomial responses ξik j is exactly the
conditional mean, E(yik |yi j ). In general, ξik j can be regarded as a linear approximation
to the conditional mean E(yik |yi j ). Their idea
n  can be easily extended to general cases.
Let Zk and ξk be the vectors (of length 2i ) consisting of sik j and ξik j , respectively.
We have the following estimating equation for α,
N !>
X ∂ξi
UCR (α; β) = i )Wi=0 ,
diag(Γ−1 (4.28)
i=1
∂α

in which Γi is a working variance vector, which can be chosen as {var(yik |yi j )}. In
general, Γi may have to be chosen as the identity matrix, unless some mean-variance
relationship can be specified such as the case of binary responses. It is easily to verify
that E(UCR (α)) is unbiased from 0 if either Γk is the identity matrix (and ξki j does
not have to be the true condition mean), or ξik j is the true conditional mean (and Γi
may depend on ξi ).
For multiple binary responses, Carey et al. (1993) found that using conditional
residuals produce much more efficient estimates of correlation parameters than using
unconditional residuals. This was further demonstrated by Lipsitz and Fitzmaurice
(1996).
GENERALIZED ESTIMATING EQUATIONS (GEE) 59
4.4.3.7 Cholesky Decomposition
In the conditional residual method, the identity matrix may have to be used for Γi
when the conditional variance is unknown, which is often the case. Therefore, each
residual is treated equally. This may be reasonable if the correlation is the same
between each pairwise observations. This motivates us to consider using all previous
responses rather than each single individual previous response.
Wang and Carey (2004) proposed Cholesky decomposition method to improve
the estimation efficiency and guarantee feasibility of solutions. The basic idea was
first presented at the Eastern North American Region International Biometric Society
(ENAR) 1999 meeting and also at the Biostatistics Department Seminar, Harvard
School of Public Health.
Let R−1 i = C i C i and C i is a lower triangular matrix, which can be obtained from
>

Cholskey decomposition. Consider the transformed variable Hi = Ci A−1 i (Yi − µi ),


which are unbiased from 0 and an identity covariance matrix. Further more, we de-
compose Ci into

 c11 0 0 . . . 0   0 0 0 . . . 0 
   
 0 c 0 . . . 0   c21 0 0 . . . 0 
  
22
0 c33 . . . 0  +  c31 c32 0 . . . 0  = Ji + Bi .
   
 0
 . . .   . . .
   

0 0 0 . . . cnn cn1 cn2 cn3 . . . 0
We can now write Hi as
i (Yi − ζi ),
Ji A−1
in which ζi = µi − Ai Ji−1 Bi A−1
i (Yi − µi ), which is a linear conditional mean vector.
The second term Ai Ji−1 Bi A−1i (Yi − µi ) adjusts the mean based on previous responses
if correlations between responses exist.
Remark 8. Note that ζi are linear predictors using the first two moments and the
previous observations (their residuals). Therefore, in case of correct specifying Ri ,
we would expect ζi be centered as 0 and orthogonal to each other.
It is easily to show that
∂ζi
!
E = Ai Ji−1Ci A−1
i Di ,
∂β
and
Cov(Yi − ζi ) = Cov{Ai (Ii + Ji−1 Bi )A−1
i (Yi − µi )}
= Ai Ji−1 (Ji + Bi )Ri (Ji + Bi )> Ji−1 Ai
= A2i Ji−2 .

We will denote the diagonal matrix, Ji2 A−2 i , as Wi , for convenience. It is, therefore,
sensible to consider the following estimating functions for α,
N !>
X ∂ζi
UChol (α; β) = Wi (Yi − ζi ), (4.29)
i=1
∂α
60 PARAMETER ESTIMATION
in which Wi is a weighting diagonal matrix. In fact, for any chosen diagonal matrix
Wi that is independent of the data, the estimating functions UChol (α) given by (4.29)
is unbiased from zero.
Remark 9. It is easy to show E(UChol (α; β)) = 0.
We first rewrite Yi − ζi as Ai (Ii + Ji−1 Bi )A−1
i (Yi − µi ), and hence for any component
α, αi ,
∂ζk
= Ai Fik A−1
i (Yi − µi ).
∂αi
The estimating function UChol (α; β) can, therefore, be written in the following
quadratic form
XN
(Yi − µi )> A−1i Mik Ai (Yi − µi ),
−1
(4.30)
i=1

where Mik = > A2 W −1 (I + J −1 B )


Fik i i i i and Fik = ∂(Ji−1 Bi )/∂αk .
We also have
n o
E{(Yi − µi )> A−1
i Mik Ai (Yi − µi )} = tr Mik Cov(Ai (Yi − µi ))
−1 −1
n o
= tr Fik Ai Wi (I + Ji−1 Bi )Ri
> 2 −1
n o
= tr Fik Ai Wi Ji (Ji + B>
> 2 −1 −1
i )
−1

= tr(Lk Fik ),

in which Li = Ci−1 Ji−1 Wi−1 A2i is a lower triangle matrix. Because Fik is also a lower
triangle matrix with all leading elements being 0, we have tr(Li Fik ) = 0, which com-
pletes the proof.
When Ci and Bi are taken to be an upper triangular matrix (this is equivalent
to reverse the time order), we can also obtain a similar version of UChol (α; β). In
general, these two sets are very similar. The average or sum can be used as suggested
by Wang and Carey (2004). This will also symmetrize the residual appearance as we
will see later in AR(1) model. An advantage of the use of UChol over UG is that UChol
is free from the scale parameter φ.
After some algebra, we can rewrite (4.30) as
N !>
X ∂ζi
UChol (α; β) = E Wi (Yi − ζi ). (4.31)
i=1
∂β

So the same “working” matrix are be used in U(β) and UChol (α; β). Note that there
is an expectation operator, “E” , in (4.31) so that quadratic terms are eliminated. A
similar relationship was found by Wang (1999a) between conditional (transitional)
models and marginal models.
If E(yik |yi1 , · · · , yik−1 ) = ζi is a linear function of past responses, and V∗i is the
corresponding conditional variance matrix (diagonal), the transitional or conditional
model would rely on
N !>
X ∂ζi
Uc (β) = Vi∗ −1 (Yi − ζi ),
i=1
∂β
GENERALIZED ESTIMATING EQUATIONS (GEE) 61
for estimating β. Note that E(Yi − ζi ) are unbiased from 0 even in the true conditional
means are not linear functions. To ensure E{Uc (β)} = 0 and hence robustness to mis-
specification of conditional means, we may wish to replace the Jacobian matrix and
the conditional variance matrix with their expectations.
Remark 10. ζi are linear combinations of past responses only, and can, therefore, be
interpreted as some linear predictions. As the covariance matrix of Yi − ζi is diagonal,
namely, A2i Ji−2 , the components can be roughly regarded as independent.
Remark 11. In the case of Wi = Ji2 A−2 i , as we suggested, Mik = F ik Ji C i , where
>

∂J −1 Bi
Fik = ∂α i
k
.
We now consider two widely used correlation structures, the first-order autore-
gression (AR(1)) and the equicorrelation structures. The first one is often used in
longitudinal studies, and the latter is often used to account for cluster settings (Lip-
sitz and Fitzmaurice, 1996).
For the AR(1) model with correlation parameter α, the (i, j)th element of R is
α| j−i| , 1 ≤ i, j ≤ n, and det(R) = (1 − α2 )n−1 . We have

 1 −α 0 0 
 
 −α 1 + α2 −α 0 
1 
R−1 =  .  .

1 − α2  .. 

0 0 −α 1

The Cholskey decomposition of R−1 leads to


 √
 1 − α2 0 0 ...

0 0 
−α 1 0 ... 0 0
 
 
1 ...
C= √ 0 −α 1 0 0  ,
 
.. .. .. .. .. ..

1 − α2 
. . . . . .

 
0 0 0 ... −α 1

0 0 0 ... 0 0  1 − α
2
   
  
 −α 0 0 ... 0 0
  1 
... 1
   
J B = 
−1  0 −α 0 0 0 
and J 2
= diag
 1
 .

.. .. .. .. .. .. 1 − α2 ..
  
. . . . . . .
   
  
...

0 0 0 −α 0 1
We, therefore, have for the ith subject,
0
 
 
 σi2 /σi1 i1 
σi3 /σi2 i2
 
ζi = µi + αi   ,
 
 ..
 . 

σin /σin−1 in−1

in which ik = yik − µik and αk is the correlation parameter which may depend on the
62 PARAMETER ESTIMATION
subject-specific covariates through parameters α. The contribution of the ith subject
to UChol (α; β), given by (4.29) is
n
1 X ik ik−1 ik−1
!
− αi .
σ
1 − αi k=2 ik
2 σ ik−1 σ ik−1

The resulting estimate of α from UChol (α; β) is


PN Pni
ˆik ˆ jk
α̂Chol = Pi=1 Pk=2
ni −1 2
,
i=1 k=1 ˆik
N

in which ˆik = ik /σik .


If R is the equicorrelated (exchangeable correlation) matrix with the constant
correlation parameter α for subject i, we have det(R) = (1 − αi )n−1 {1 + (n − 1)αi }, and
1  α 
R−1 = I− 1
1−α 1 − α + αn
in which 1 is a unit matrix of n × n. After some algebra, we have R−1 = C >C =
(J + B> )(J + B), in which
1
 
 
 (1 − α2 )−1 
.
 
J = diag 
2  ..  ,

1 + (n − 2)α
 
 
1 + (n − 2)α − (n − 1)α 2

and
  0 0 0 ... 0 0

1
 
...
  
 −α   1 0 0 0 0 
...

J B = diag  .. 1 1 0 0 0  .
−1
   
. ..
 
 
.
 
 −α  
1 + (n − 2)α

...

1 1 1 1 0
Therefore,
0
 
σi2
 
σi1 i1
 
ζi = µi + α   .
 
..
 . 
 σn Pn−1 i j 
1+(n−2)α j=1 σi j

The contribution of the i-th subject to the estimating function is


   
n   k−1 k−1
 −1 X i j   ik α i j 
X   X 
,
 
a − 
k 
σ σik 1 + (k − 2)α j=1 σi j 

 
 
 
 
j=1 i j
   
k=2
 

in which ak = (1 − α){1 + (k − 1)α}{1 + (k − 2)α}. It is easily to see that the above


expression has 0 expectation under the equicorrelation assumption.
GENERALIZED ESTIMATING EQUATIONS (GEE) 63
For the general Markov correlation, the corresponding estimating functions for λ
are Wang and Carey (2004),
ni
N X
X li j αdi j
UChol (λ; β) = {(αdi j i j−1 − i j )i j−1 + (αdi j i j − i j−1 )i j },
i=1 j=2 1 − α2di j
ni
N X
X li j α2di j
UG (λ; β) = UChol (λ; β) + {(i j − αdi j i j−1 )2 + (i j−1 − αdi j i j )2
i=1 j=2 (1 − α2di j )2
− 2φ(1 − α 2di j
)},

in which
tiλj log(ti j ) − tiλj−1 log(ti j−1 )
li j = ∂di j /∂λ = − di j /λ.
λ
The special case of AR(1) model (λ = 1), ti = i and ni = 3 for all subjects. The
solution to UChol (α) is

(i1 i2 + i2 i3 )/2


PN
α̂Chol = PNi=1 2 . (4.32)
i=1 (i1 + 2i2 + i3 )/4
2 2

Another advantage of α̂Chol is that it always lies in the sensible range of (−1, 1).

4.4.4 Covariance Matrix of β̂


Recall that the GEE estimator β̂ of β by solving the following estimating equations,
N
X
UGEE (β; τ, α) = i Vi (Yi − µi ) = 0.
D> −1
(4.33)
i=1

Suppose the true parameters are denoted as β̃. The covariance of U(β̃) can be approx-
imated by E{UGEE (β̃; τ, α)UGEE (β̃; τ, α)> } = i=1 Di Vi Σ̃i Vi−1 Di .
PN > −1
Suppose we have constructed U(α, τ; β) for all the variance and correlation pa-
rameters as supplementary estimating functions to UGEE (β; τ, α) as given by (4.18),
the final estimates of (β, α, τ) from the joint estimating functions

UGEE (β; τ)
!
U(θ) = .
U(α, τ; β)

Using the method of delta approximation, we can approximate the covariance of


the estimates as
∆−1 Cov(U(θ))(∆> )−1 ,
in which ∆ is the Jacobian matrix (∂U(θ)/∂θ) evaluated at the estimated values. The
covariance of U(θ) can be estimated from the products of the subject residuals (con-
tributions to the estimating functions).
64 PARAMETER ESTIMATION
For the proposed method, ∆ can be replaced by
PN > −1 !
D V Di 0
 i=1 i i .
E ∂U(α, τ; β)/∂β> E {∂U(α, τ; β)/∂(α, τ)}


Under mild regularity conditions, β̂R is consistent as N −→ ∞, and N(β̂R − β)
is asymptotically multivariate Gaussian with covariance matrix VR given by
 N −1  N  N
X  X  X
VR = lim N  Di Vi Di   Di Vi Cov(Yi )Vi Di  ( D>
> −1 > −1 −1
i Vi Di ) .
−1 −1
N→∞
i=1 i=1 i=1

The covariance matrix of β̂ can be consistently estimated by the sandwich or


robust estimator
 N −1  N  N −1
X   X  X 
VG =  Di Vi Di  
> −1 > −1 −1  D V Di  , (4.34)
> −1
 
Di Vi Cov(Yi )Vi Di 
 

 
  i i 
i=1 i=1 i=1

where β, α, and φ are replaced by their estimators.


Let ˆi = Yi − µ̂i . Liang and Zeger (1986) proposed to estimate Cov(Yi ) in VG by
ˆi ˆi> , and then we have
 N −1  N  N −1
X  X  X 
=  Di Vi Di  
> −1
Di Vi ˆi ˆi Vi Di 
> −1 > −1  D V Di  .
> −1
 
VLZ (4.35)
 

 
  i i 
i=1 i=1 i=1

If the covariance matrix is correctly specified, then Cov(Yi ) = Vi . The covariance


matrix estimator of β̂ becomes the so-called model based covariance estimator (or
naive covariance estimator),
 N −1
X 
V MB =  Di Vi Di  .
> −1

i=1

Since ˆi ˆi> is based on the data from only one subject, it is neither consistent nor
efficient, and hence it is not an optimal estimator of Cov(Yi ). Although ˆi ˆiT is a poor
estimator of Cov(Yi ), VLZ is a consistent estimator of VG .
An improved version of VR estimator under certain restrictions on the data struc-
ture is given by Pan (2001). The same asymptotic normality for the β estimator in
our case can be established as in Crowder (2001).
Assume that (a) the marginal variance of var(yik ) is modeled correctly, and (b)
there is a common correlation structure R0 across all subjects, then R0 can be esti-
mated by
N
1 X −1/2 > −1/2
R̂C = A ˆi ˆi Ai ,
φN i=1 i
GENERALIZED ESTIMATING EQUATIONS (GEE) 65
where β, α, and φ are replaced by their estimators. Under the two assumptions which
are often reasonable, Pan (2001) proposed estimating Cov(Yi ) by
 
 1 XN 
−1/2 > −1/2 
Wi = φA1/2
i R̂C A1/2
i = A1/2 
i 

N A j 
ˆ j 
ˆ A
j j
 A1/2 .
 i
j=1

Replacing ˆi ˆi> in (4.35) by Wi , a new covariance matrix estimator VP can be obtained
(Pan, 2001) and is given by
 N −1  N  N −1
X  X > −1 
 X > −1 
VP =  Di Vi Di  
> −1 −1  Di Vi Di  .
 
  Di Vi Wi Vi Di  (4.36)

  

i=1 i=1 i=1

Let
N
X N
X
MLZ = i Vi ˆi ˆi Vi Di and MP =
D> −1 > −1
i Vi Wi Vi Di .
D> −1 > −1

i=1 i=1

For any matrix B, define the operator vec(B) as that of stacking the columns of B
together to obtain a vector. Then under mild regularity conditions, cov{vec(MLZ )} −
cov{vec(MP )} is nonnegative definite with probability 1 as N → +∞.
When the number of subjects is small (N is small), VLZ would be expected under-
estimate var(β̂) (Mancl and DeRoune, 2001). Therefore, Mancl and DeRoune (2001)
proposed an alternative robust covariance estimator for β̂ to reduce the bias of the
residual estimator, ˆi ˆi> . The first-order Taylor series expansion of the residual vector
ˆi about β is given by
∂i
ˆi = i (β̂) = i + (β̂ − β),
∂β>
= i − Di (β̂ − β), (4.37)
where i = i (β) = Yi − µi and i = 1, . . . , N. Therefore,
E(ˆi ˆi> ) ≈ E(i i> ) − E[i (β̂ − β)> D>
i ]
− E[Di (β̂ − β)i> ] + E[Di (β̂ − β)(β̂ − β)> D>
i ],

in which β̂ − β can be approximated by the first order Taylor series expansion of


(4.18), that is,
 N −1 N
X  X
β̂ − β ≈  D> −1 
i Vi Di 

 i Vi i .
D> −1

i=1 i=1
Hence,
N
X
E(ˆi ˆi> ) ≈ Cov(Yi ) − Cov(Yi )Hii> − Hii Cov(Yi ) + Hi j Cov(Yi )Hi>j ,
j=1
X
= (I − Hii )Cov(Yi )(I − Hii ) +>
Hi j Cov(Yi )Hi>j (4.38)
j,i
66 PARAMETER ESTIMATION
P −1
where Hi j = Di k=1 N
D> −1
k Vk Dk D>j V −1
j . Since the elements of Hi j are between
zero and one, and usually close to zero, it is reasonable to assume that the summation
makes only a small contribution to the bias (Mancl and DeRoune, 2001). Therefore,
E(ˆi ˆi> ) can be approximated by

E(ˆi ˆi> ) ≈ (I − Hii )Cov(Yi )(I − Hii )> .

The covariance matrix Cov(Yi ) can be estimated by


[ i ) = (I − Hii )−1 ˆi ˆ > (I − H > )−1 .
Cov(Y i ii

Mancl and DeRoune (2001) proposed the bias-corrected covariance estimator of β̂:
 N −1

X > −1 
V MD = V MB  Di Vi (I − Hii ) ˆi ˆi (I − Hii ) Vi Di 
−1 > > −1 −1 
V MB .
 

 

i=1

If the covariance matrix Vi is correctly specified, then (4.38) can be simplified to

E(ˆi ˆi> ) ≈ Cov(Yi ) − Di V MD D>


i . (4.39)

Because Di V MD D>i is positive definite, the Sandwich estimate VLZ appears to biased
downward. Kauerman and Carroll (2001) proposed using (I − Hii )−1/2 ˆi to replace ˆi
in VLZ and gave a bias-reduced sandwich estimate
 N −1

X > −1 
= V MB  −1/2
ˆi ˆi (I − Hii )
> > −1/2 −1 
V MB .
 
VKC  Di Vi (I − Hii ) V i Di 

 
i=1

In practice, it seems to be a plausible strategy to work with V MD and VKC instead of


VLZ , even if Vi is a working covariance and the true variance structure is unknown.
The basic idea of the three modified covariance matrix estimators is to seek a
good estimator of the covariance matrix Cov(Yi ). The sandwich estimator is proposed
in the case of misspecified models. There has been some great theory established by
Eicker (1963); Huber (1967); and White (1982).
The statistical properties of these improved estimators of the covariance matrix
mentioned above are not discussed here. More details can be found in references Pan
(2001), Kauerman and Carroll (2001), and Mancl and DeRoune (2001).

4.4.5 Example: Epileptic Data


Thall and Vail (1990) analyzed the data from a clinical trial of 59 epileptics using
various variance functions and correlation structures. Patients were randomized to
receive either the anti-epileptic drug progabide or a placebo. For each patient, the
number of epileptic seizures was recorded during a baseline period of eight weeks.
During the treatment, the number of seizures was then recorded in four consecutive
two-week intervals. The data also includes age of patients, treatment indicator with 1
GENERALIZED ESTIMATING EQUATIONS (GEE) 67
for progabide and 0 for placebo. The medical interest is whether or not the progabide
reduces the rate of epileptic seizures.
The data shows high degree of extra-Poisson variation and within subject corre-
lation as demonstrated in Thall and Vail (1990) and Diggle et al. (2002). Interesting
work has been done in Thall and Vail (1990). We will apply the proposed approach
assuming different variance functions including the power function and compare with
results from other competing models.
We consider five covariates including intercept, treatment, baseline seizure rate,
age of subject, and the interaction between treatment and baseline seizure rate. The
mean vector for the ith subject is µi = exp(xiT β), where xi is the design matrix for
the covariates. They transformed baseline as the logarithm of 1/4 of the 8-week pre-
randomization seizure count and log-transformed age. The treatment variable is a
binary indicator for the progabide group.
The overdispersion value φ for the two treatment groups estimated by the ratio
between sample variance and mean are far away from 1, demonstrating the high de-
gree of extra-Poisson variation. The sample mean-variance plot also exhibits some
nonlinear trend. We, therefore, assume two variance functions for the data: one is
quadratic function σ2i j = γ1 µi j + γ2 µ2i j , which was introduced by Bartlett (1936) and
γ
Morton (1987); the other is power function σ2i j = φµi j with γ being estimated from
the data. Paul and Plackett (1978) suggested both of these functions for overdispersed
Poisson data. Two different working correlation structures, AR(1) and the exchange-
able, are used to take account of within subject correlations. The proposed approach
is applied to perform the estimation of variance and regression parameters. The cor-
relation parameters are estimated by the moment methods. The final estimates of
the parameters are obtained after a few iterations among regression, variance and
correlation parameters. The asymptotic covariance matrix of β is estimated by the
sandwich estimator.
Table 4.2 shows the summary statistics for the two-week seizure counts. Overdis-
persion to the Poisson model is evident because the ratio of variance to mean is much
larger than 1. We will further discuss this example in Chapter 5.
Once the Quasi-poisson (Poisson with overdispersion) model is fitted, we ob-
tained the fitted values µ̂ik , and residuals rik . The variance parameters are then simply
P
obtained by minimizing the i,k {rik 2 − σ2 }2 /µ̂2 for the nominated variance function
ik ik
(Power or Bartlett). The resultant variance function can be used as the weight func-
tion weights=µik /σ2ik and family = poisson to make use of the existing geese function
(geepack). Further iterations can also be employed in updating the variance parame-
ters and the weights as an exercise.
Table 4.3 shows the results for different variance functions. The overdispersion
parameter φ values obtained from different correlation structures are very similar for
a given variance function.
It should be noted that in this example all covariates are at cluster level, i.e., they
do not change within each subject. In these cases, the constant correlation assumption
(exchangeable structure) will lead to the same estimates as the independence models.
Even the standard errors are the same. This is because they are obtained by the ro-
bust sandwich approach (instead of the naive one). It is clear that choice of variance
68 PARAMETER ESTIMATION

Table 4.2 Mean and variance for the two-week seizure count within each group. The ratio of
variance to mean shows the extent of Poisson overdispersion (φ̂ = s2 /Ȳ).
Placebo (M1 = 28) Progabide (M2 = 31)
Visit Ȳ s2 φ̂ Ȳ s2 φ̂
1 9.36 102.76 10.98 8.58 332.72 38.78
2 8.29 66.66 8.04 8.42 140.65 16.70
3 8.79 215.29 23.09 8.13 193.05 23.75
4 7.96 58.18 7.31 6.71 126.88 18.91

function have a great impact on the estimates and especially on the standard error.
When the Bartlett and the power variance are used, the estimates have much smaller
standard errors. Estimates and their standard errors from the Bartlett variance model
differ little for different working correlation structures. However, across the nine co-
variance models, the standard errors ranged from 0.174 to 0.211 for the interaction
term, and 0.415 to 0.464 for the treatment. This demonstrates that attention in mod-
eling the variance function is necessary instead of just using the default variance
functions to achieve high estimation efficiency. In this example, further goodness-
of-fit development is needed for assessing different variance and correlation models.
Model selection criteria will be discussed in the coming Chapter.

4.4.6 Infeasibility
Crowder (1995) found that under misspecification of the correlation structure, the
designated working correlation matrix may not even converge to a correlation matrix.
This generates interest in searching for better estimates of correlation parameters or
alternatives to the working matrix method (Chaganty, 1997; Qu, Lindsay & Li, 2000).
Crowder (1995) redefines GEE as seeking simultaneous solutions of estimating equa-
tions Uβ (θ) = 0 and Uα (θ) = 0. Crowder shows that “lack of a parametric status of α”
leads to indefiniteness
√ of underlying stochastic structure for the outcome vectors, so
that the K-consistency of α̂ required to obtain favorable properties of solutions to
Uβ (θ) = 0 may be meaningless. Crowder illustrates this indefiniteness by attempting
to solve a GEE with working structure R(α) when the data arise from a model with
true cluster correlation structure R̃ that is very different from (R). Specifically, when
(R̃) jk = α (exchangeable true correlation) but (R) jk = α| j−k| (autoregressive working
correlation), then for ni = 3 and −1/2 ≤ α < −1/3, an obvious moment-based formu-
lation of Uα (θ) has no real-valued root between -1 and 1. This destroys prospects
for a general theory of existence and consistency of simultaneous solutions to the
estimating functions. Crowder argues that this indefiniteness also infects the GEE2
procedures of Prentice and Zhao (1991).
In a seminal study, Crowder (1995) indicates that while robustness to misspecifi-
cation of cov(Yi ) is a key attraction of the GEE approach, study of performance under
misspecification requires additional formalization. The ad hoc estimators of Liang
and Zeger are, therefore, re-expressed as supplemental estimating equations whose
solutions α̃ can be characterized under misspecified structure for cov(Yi ). Even if α
GENERALIZED ESTIMATING EQUATIONS (GEE) 69

Table 4.3 Parameters estimates from models with different variance functions and correlation
structures for the epileptic data.
Poisson Variance: σ2i j = φµi j .
Independence working model,φ̂ = exp(1.486)
log(baseline) log(age) trt intact
β 0.950 0.897 -1.341 0.562
Stderr 0.097 0.275 0.426 0.174
AR(1) working model, φ̂ = exp(1.509), α̂ = 0.495
log(baseline) log(age) trt intact
β 0.941 0.994 -1.502 0.626
Stderr 0.091 0.272 0.415 0.168
Exchangeable working model, φ̂ = exp(1.486), α̂ = 0.347
β 0.950 0.897 -1.341 0.562
Stderr 0.097 0.275 0.426 0.174
Bartlett variance: σ2i j = 3.424 ∗ µi j + 0.120 ∗ µ2i j .
Independence working model, φ̂ = exp(1.402)
log(baseline) log(age) trt intact
β 0.943 0.762 -1.074 0.434
Stderr 0.110 0.266 0.464 0.211
AR(1) working model, φ̂ = exp(1.422), α̂ = 0.477
log(baseline) log(age) trt intact
β 0.933 0.850 -1.219 0.492
Stderr 0.102 0.264 0.458 0.208
Exchangeable working model, φ̂ = exp(1.33), α̂ = 0.294
β 0.943 0.762 -1.074 0.434
Stderr 0.110 0.266 0.464 0.211
Power function variance: σ2i j = φµ1.329ij .
Independence working model, φ̂ = exp(1.486)
log(baseline) log(age) trt intact
β 0.924 0.719 -1.091 0.449
Stderr 0.109 0.267 0.431 0.198
AR(1) working model, φ̂ = exp(1.348), α̂ = 0.438
log(baseline) log(age) trt intact
β 0.915 0.796 -1.210 0.496
Stderr 0.102 0.265 0.429 0.198
Exchangeable working model, φ̂ = exp(1.33), α̂ = 0.294
β 0.924 0.719 -1.091 0.449
Stderr 0.109 0.267 0.431 0.198
70 PARAMETER ESTIMATION
is regarded strictly as a nuisance parameter, the effects of misspecification of cov(Yi )
and the subsequent misweighting of the observations may propagate to properties of
β̂G . Specifically, Crowder shows that if cov(Yi ) has the AR(1) form, but the working
model is chosen to have the exchangeable (compound symmetric) form then (for a
balanced design) the limit of the moment estimator α̃ depends explicitly on the clus-
ter size. Additionally, if cov(Yi ) has the exchangeable form but the working model is
of AR(1) structure, then for ni ≡ 3, if −1/2 ≤ αexch ≤ −1/3 the estimating equation as-
sociated with the AR(1) moment estimator has no real solution, and neither does the
GEE. Crowder concludes that “there can be no general asymptotic theory support-
ing existence and consistency of (β̂, α̂)” ... “α has no underlying parametric identity
independent of the particular estimating equation chosen: α does not exist in any
fundamental sense.” Crowder concludes his discussion with the positive recommen-
dation to use only estimating equations that have guaranteed solution, or estimate α
by minimizing a well-behaved objective function. He also notes that in practice “sta-
tistical judgement would normally be employed in an attempt to avoid such hidden
pitfalls” as infeasible estimators obtained through misspecification. In a follow-up
to Crowder’s paper, Sutradahar and Das (1999) argued that solutions to GEEs using
the (typically frankly misspecified) working independence model can often be more
efficient than those obtained under misspecified nonindependence working models.
To support their argument they investigate a model for binary outcomes

logit EYit = β0 + β1 t

with balanced design ni ≡ n. Let β̂ struct denote the solution to the GEE with work-
ing correlation structure struct ∈ {indep, exch, AR(1)}, and let eff β̂ struct denote
Vtrue /V struct , where V struct is the asymptotic variance of a component of the solution
to the GEE with working correlation structure struct. When true cov(Yi ) is of AR(1)
form but the working model is exchangeable, they show that eff (β̂indep ) ≈ eff (β̂exch ).
When the true cov(Yi ) is of exchangeable form but the working model is AR(1),
they show that eff (β̂indep )  eff (β̂AR(1) ). They conclude that the “use of working
GEE estimator β̂G may be counterproductive”, and that their “results contradict the
recommendations made in LZ that one should use β̂G for higher efficiency relative to
β̂indep .”
Investigations of infeasibility and efficiency loss under misspecification under-
taken thus far do not acknowledge the additional complications of unbalanced de-
signs and within/between-cluster covariate dispersion. The latter phenomenon has
been studied by Fitzmaurice (1995), who showed that the relative efficiency of solu-
tions obtained under working independence depends critically on the within-cluster
correlation of covariates. Furthermore, no study of tools for data-based selection of
working correlation models has been undertaken to date. A reliable correlation struc-
ture selection method would greatly diminish the reservations concerning feasibility
and efficiency loss reviewed here.
To best illustrate the problem raised by Crowder (1995), let us revisit his example
in which the true correlation structure is exchangeable with ni = 3 and the AR(1)
GENERALIZED ESTIMATING EQUATIONS (GEE) 71
working correlation matrix is used. To be more specific, we use

 1 α α2 
 
Ri =  α 1 α 
 
α α 1
 2 

while the true correlation matrix is


 1 ρ ρ 
 
R̃i =  ρ 1 ρ  .
 
ρ ρ 1
 

If we use the simple moment estimator as given by (4.22) or the Cholesky esti-
mator 4.32), their limit converges to ρ, the true exchangeable correlation parameter
value, i.e., asymptotically, we will use an incorrect AR(1) correlation matrix although
Ri will never be correct no matter what α is used (even if α = ρ). In these cases, there
is no problem. However, if we use a moment estimator that is based on all the residual
products, the estimating equation is
n
X
qα = (ˆi1 ˆi2 + ˆi2 ˆi3 + ˆi1 ˆi3 ) − nφ(2α + α2 ) = 0,
i=1

where ˆi j = (yi j − µi j )/(Ai ) j j , and (Ai ) j j is the j-the element of matrix Ai .
A reasonable estimator of φ is
n
1X 2
φ̂ = (ˆ + ˆ 2 + ˆ 2 )/3.
n i=1 i1 i2 i3

For convenience, let


i1 ˆi2 + ˆi2 ˆi3 + ˆi1 ˆi3 ) (ˆi1 ˆi2 + ˆi2 ˆi3 + ˆi1 ˆi3 )
Pn Pn
i=1 (ˆ
ρ̂ = = i=1Pn .
3nφ̂ i=1 (ˆ i1
2 + ˆ 2 + ˆ 2 )
i2 i3

The solution of qα = 0 is given by

1 + 3ρ̂ − 1 if ρ̂ > −1/3


( p
α̂1 =
undefined if ρ̂ ≤ −1/3

Clearly, ρ̂ is a random variable taking values between -1 and 1, and it is quite possible
that pr(ρ̂ < −1/3) > 0. In fact, if the true correlation structure is exchangeable with a
parameter ρ < −1/3, we have pr(ρ̂ < −1/3) → 1 as n → +∞, indicating that α̂ would
be undefined with probability 1.
Crowder (1995) further argued that the limiting value of α̂ may not exist and
hence the asymptotic properties for the GEE estimators breakdown. The main con-
cern here is we cannot guarantee the condition n1/2 (α̂ − α0 ) = O p (1) holds, which
is required for the existence of α̂ and its limiting value α0 . The implication of this
concern is that it is important to ensure the α̂ has the asymptotic normality. Care-
fully chosen supplementary estimating functions will not have this issue, and choice
72 PARAMETER ESTIMATION
of a “working” correlation should be as close to the truth as possible. In the above
example, even if the true model is AR(1) but with ρ ≤ −1/3, α̂ can still be problem-
atic. This indicates the moment estimator using all pairwise product is statistically
problematic for the AR(1) model. As we mentioned, the simple lag-1 estimator

2 ni=1 (ˆi1 ˆi2 + ˆi2 ˆi3 )


P
γˆ2 = Pn ,
i1
i=1 (ˆ
2 + 2ˆ i2
2 + ˆ 2 )
i3

does not have the aforementioned infeasibility issue either regardless the true model
is exchangeable or AR(1).
Of course, in cases when no sensible correlation matrix is produced by a nomi-
nated “working” model, we should embark on the other working models (including
the independence model).
In the above example, when we have ρ̂ ≤ −1/3, one can consider a different
estimator for α,
1 + 3ρ̂ − 1 if ρ̂ ≥ −1/3
( p
α̂2 = , (4.40)
γ̂1 if ρ̂ < −1/3
in which γ̂1 = ni=1 (ˆi1 ˆi2 + ˆi2 ˆi3 )/(2nφ̂). The infeasibility issue also disappears when
P
the working matrix is AR(1) with a parameter α̂2 .
Alternatively, if we choose a different working structure, say, equicorrelated with
a parameter γ which can be estimated by products of all pairs,
n
X
γ̂2 = (ˆi1 ˆi2 + ˆi2 ˆi3 + ˆi1 ˆi3 )/(3nφ̂).
i=1

In this case, α̂ also takes the same form as (4.40) replacing γ̂1 by γ̂2 , but the corre-
sponding working matrix involves two different structures: AR(1) with the parameter
1 + 3ρ̂ − 1 if ρ̂ > −1/3 and equicorrelated matrix with the parameter γ̂2 . In practice,
p

one can also consider independence (α̂ = 0) when the nominated parametric correla-
tion matrix is not positive definite.
The conclusion is that this infeasibility is a theoretic concern, and it does not
happen in practice, not even rarely.

4.5 Quadratic Inference Function


As one can see supplying a sensible “working” correlation matrix and the efficient
estimating functions for the corresponding α parameters can be challenges. This mo-
tivates us to consider other approaches such as the random effects models, or Transi-
tional GEE approach in which the “working” covariance matrix is always diagonal.
Alternatively, one can also consider a series of “working” models, and establish
model selection criteria to help us identify the best models. We will provide more
details on this in the next chapter.
Apart from these considerations, one can also consider optimally combine a num-
ber of nominated working models to bypass the need to specify one particular model
each time. This leads to the question, can we generalize the unstructured working
QUADRATIC INFERENCE FUNCTION 73
correlation models so that the estimating functions can automatically combine all
possible structures?
Suppose we have set of candidate correlation models, the K sets of estimating
functions for estimating β,
 > −1/2 (1) −1/2
 Di Ai Ri Ai (Yi − µi )


N  N
X  D> A−1/2 R(2) A−1/2 (Yi − µi )  X
U(β) =  i i i i  = Ui (β),

 ...
i=1   > −1/2 (K) −1/2  i=1
Di Ai Ri Ai (Yi − µi )

in which Ui has K p components denoting the contributions from subject i. Note that
R(k)
i is symmetric with non-negative eignevalues. Applying the generalized method
of moment, we can obtain the optimal p-estimating functions for β of dimension p.
The generalized moment method can be applied to combine all these together for
estimating β (Hansen, 1982). The Quadratic inference function for β proposed by Qu
et al. (2000) is to minimize
U > (β)cov(U)U(β),
[

where cov(U) [ is the emprical estimate of the covariance matrix of U(β),


PN >
i=1 U i (β)U i (β). A number of interesting properties and extensions were established
by Qu et al. (2000); Qu and Song (2004); Qu et al. (2008); Wang and Qu (2009). An
excellent review is given by Song et al. (2009). In particular, an interesting extension
is made for generalized mixed-effects models by Wang et al. (2012b) with penalized
random effects in the estimating functions Ui .
There is a wide class available for the correlation structure for Ri available in
argument correlation = , and also for the within-group heteroscedasticity structure
using weights=.
Chapter 5

Model Selection

5.1 Introduction
Model selection is an important issue in almost any practical data analysis. For lon-
gitudinal data analysis, model selection commonly includes covariate selection in
regression and correlation structure selection in the GEE discussed in Chapter 4.
Covariate selection in regression means: given a large group of covariates (includ-
ing some higher-order terms or interactive effects), one needs to select a subset to
be included in the regression model. Correlation structure selection means: given
a pool of working correlation structure candidates, select one that is closer to the
truth, and thus results in more efficient parameter estimates. Such as the study of
the National Longitudinal Survey of Labor Market Experience introduced in Chap-
ter 1, covariates including age, grade, and south, were recorded on the participants
over the years. Even though the number of variables is not large in this example,
when various interaction effects are included, the total number of covariates in the
statistical model can be considerably large. However, only a subset of them is rele-
vant to the response variable. Inclusion of redundant variables may hinder accuracy
and efficiency for both estimation and inference. Furthermore, a correlation structure
needs to specify when using the GEE to estimate parameters. Appropriate specifi-
cation of correlation structures in longitudinal data analysis can improve estimation
efficiency and lead to more reliable statistical inferences. Thus, it is important to use
statistical methodology to select important covariates and an appropriate correlation
structure. There is a lot of model-selection literature in statistics (e.g., Miller, 1990,
Jiang et al. (2015) and references therein) but mainly for covariate selection in the
classic linear regression with independent data. Traditional model selection criteria
such as the Akaike information criterion (AIC), and the Bayesian information cri-
terion (BIC) may not be useful for correlation structure selection because the joint
density function of response variables is usually unknown in longitudinal data.
Suppose that a longitudinal dataset composed of outcome variables Yi =
(yi1 , . . . , yini )T , and corresponding covariate variables Xi = (Xi1 , . . . , Xini )T in which
Xi j be a p × 1 vector, i = 1, . . . , N. For the sake of simplicity, we assume that ni = n
for all i and the total number of observations is M = Nn. Assumed that

(i) µi j = g(XiTj β), (5.1)

DOI: 10.1201/9781315153636-5 75
76 MODEL SELECTION
where g(·) is a specified, function, and β is an unknown parameter vector, and that
(ii) σ2i j = φν(µi j ), (5.2)
where ν(·) is a given function of µi j , and φ is a scale parameter. Assume that the co-
variance matrix of Yi is Vi = φΣi in which Σi = Ai1/2 Ri (α)A1/2
i , where Ai = diag(ν(µi j ))
and Ri (α) is the true correlation matrix of Yi with an unknown r × 1 parameter vector
α.

Based on the assumptions (5.1) and (5.2), the GEE is


N
X
U(β) = DTi Wi−1 (Yi − µi ) = 0, (5.3)
i=1

where Di = ∂µi /∂β, and Wi = φA1/2 1/2


i Rw Ai , in which Rw is a working correlation
matrix of Yi . Define β̂R is the resulting estimator from the equation (5.3), then the
robust variance estimator of β̂R is
 N 
X 
VR = ΩR  Di Wi var(Yi )Wi Di  Ω−1
−1  T −1 −1
R ,
i=1

where
N
X
ΩR = DTi Wi−1 Di
i=1

is the model-based estimator of var(β̂R ). Specifically, if we specify Rw as the identify


matrix, then Ω−1
R reduces to
N
X
ΩI = i Di /φ.
DTi A−1
i=1

A consistent estimator of VR is
 N 
X 
V̂R = ΩR  Di Wi S i S i Wi Di  ΩR
−1  T −1 T −1 −1

i=1 β=β̂R ,α=α̂R ,φ=φ̂R


in which α̂R and φ̂R are consistent estimators of α and φ. In the following sections,
we mainly introduce several criteria for selecting covariates in (5.1) and choosing an
appropriate correlation structure for Rw in (5.3) for longitudinal data analysis.

5.2 Selecting Covariates


5.2.1 Quasi-likelihood Criterion
The AIC is based on the likelihood function and asymptotic properties of the max-
imum likelihood estimator. Since no distribution is assumed in the GEE, there is no
SELECTING COVARIATES 77
likelihood is defined, and thus the AIC cannot be directly used. Pan (2001) proposed
a criterion that is an extension of the AIC based on quasi-likelihood, named the QIC,
to choose the appropriate mean model or the working correlation structure.

Based on the model specification µi j = E(yi j |xi j ) and σ2i j = φν(µi j ), the quasi-log-
likelihood function of yi j is
Z µi j
yi j − t
Q(yi j ; µi j , φ) = dt.
yi j φν(t)

If yi j is a continuous response with specified ν(µi j ) = 1, then Q(yi j , µi j , φ) = −(yi j −


µi j )2 /(2φ). If yi j is a binary response and with specified ν(µi j ) = µi j (1 − µi j ), then
Q(yi j , µi j , φ) = yi j log{µi j /(1 − µi j )} + log(1 − µi j ). If yi j is a count response with spec-
ified ν(µi j ) = µi j , then Q(yi j , µi j , φ) = (yi j log µi j − µi j )/φ (McCullagh and Nelder,
1989). The dispersion parameter φ = 1 for a binary response. For other response
types, φ is unknown, and φ > 1 is extremely useful in modeling over-dispersion that
commonly occurs in practice.
If we assume that the working correlation matrix Rw is the identify matrix I, then
the quasi-log-likelihood is
N X
X n
Q(β, φ; I) = Q(yi j ; µi j , φ).
i=1 j=1

Therefore, for a given working correlation matrix R, the QIC can be expressed as

QIC(R) = −2Q(β̂R , φ̂; I) + 2tr(Q̂I V̂R ), (5.4)

where the first term is the estimated quasi-log-likelihood with β = β̂R . The second
term is the trace of the product of Q̂I = φ−1 i=1
PN T −1
Di Ai Di |β=β̂R ,φ=φ̂R and V̂R given in
Section 1. Here Q̂I and V̂R can be directly available from the model fitting results in
many statistical softwares, such as SAS, S-Plus, and R. Besides selecting covariates
in generalized linear models, the QIC can also be applied to select a working correla-
tion structure in GEE: one needs to calculate the QIC for various candidate working
correlation structures and then pick the one with the smallest QIC.
In practice, since φ is unknown, we estimate it using φ̂ = (M − p)−1 i=1
P N Pn
k=1 (yik −
µ̂ik )2 /ν(µ̂ik ), in which µ̂ik is estimated based on the regression model including all
covariates. When all modeling specifications in GEE are correct, Q̂−1 I and V̂R are
asymptotically equivalent and tr(Q̂I V̂R ) ≈ p. Then the QIC reduces to the AIC. In
GEE with longitudinal data, one may take QICu (R) = −2Q(β̂R , I) + 2p as an approx-
imation to the QIC(R), and thus QIC(R) can be potentially useful in variable selec-
tion. However, QICu (R) cannot be applied to select the working correlation matrix R.
That is because the value of Q(β̂R , I) does not depend on the correlation matrix R. It
is worth noting that the performance of the QICs with different correlation structures
is close for selecting covariates, but the QIC(I) performs the best (Pan, 2001).
78 MODEL SELECTION
Hardin and Hilbe (2012) proposed a slightly different version of QIC(R):

QICH (R) = −2Q(β̂R , φ̂; I) + 2tr(Ω̂I V̂R )

where Q(β̂R ; φ̂; I) and V̂R are the same as those in (5.4). However, Ω̂I is evaluated
using β̂I and φ̂I instead of β̂R and φ̂R .

5.2.2 Gaussian Likelihood Criterion


Assume that Yi is a continuous variable and follows a multivariate normal distribution
with mean Xi β and covariance matrix φΣi for i = 1, . . . , N, then the log-likelihood
function of Y = (Y1T , . . . , YNT )T (omitting constant terms) is

1h i
l(β, α, φ) = − M log φ + log |V| + (Y − Xβ)T V −1 (Y − Xβ)/φ ,
2
where V is an M × M block-diagonal matrix with N × N blocks Σi . Therefore, the
AIC and BIC for the longitudinal data are

AIC = −2l(β̂, α̂, φ̂) + 2p,


BIC = −2l(β̂, α̂, φ̂) + p log M,

where (β̂, α̂, φ̂) are the maximum likelihood estimators of (β, α, φ). Specifically,
β̂ = (X T V̂ −1 X)−1 X T V̂ −1 Y and φ̂ = (Y − X β̂)T V̂ −1 (Y − X β̂)/M, in which V̂ is eval-
uated at α̂ which is estimated by maximizing the profile log-likelihood l p (α) =
−1/2[M log φ̂(α) + log |V(α)|]. When the sample size N is small, the corrected AIC
for continuous longitudinal data is
M
AICc = −2l(β̂, α̂, φ̂) + 2p .
M− p−1
When N tends to infinity with a fixed p, AICc approximates to AIC.
If Yi is a discrete variable, Yi does not follow a multivariate normal distribution.
However, White (1961) and Crowder (1985) proposed using a normal log-likelihood
to estimate β and α without assuming that Yi is normally distributed. Hence, we can
utilize the following normal log-likelihood function as a pseudo-likelihood,
N
1 Xh i
lG (β, α, φ) = − log |2πVi | + (Yi − µi )T Vi−1 (Yi − µi ) ,
2 i=1

where µi = (g(Xi1
T β), . . . , g(X T β))T . Therefore, the AIC and BIC for the longitudinal
in
data are

GAIC = −2lG (β̂G , α̂G , φ̂G ) + 2p,


GBIC = −2lG (β̂G , α̂G , φ̂G ) + p log M,
SELECTING CORRELATION STRUCTURE 79
where (β̂G , α̂G ) are the Gaussian estimators of (β, α) introduced in Chapter 4, and
φ̂G is obtained by i=1 (Yi − µi (β̂G ))T Σ̂−1
i (β̂G , α̂G )(Yi − µi (β̂G ))/M. Note that there is
PN
a subtle difference between the traditional AIC/BIC and the GAIC/GBIC: the tradi-
tional AIC/BIC requires the likelihood function to be correct, while the GAIC/GBIC
does not. Here, lG is a working likelihood. Furthermore, the lG in the GAIC and
GBIC can be replaced by the restricted Gaussian likelihood proposed by Patterson
and Thompson (1971). When the GAIC and the GBIC are used to select covariates,
the simulation results indicate that the GAIC and the GBIC with the independence
correlation structure perform the best.

5.3 Selecting Correlation Structure


Selecting an appropriate working correlation structure is pertinent to longitudinal
data analysis using generalized estimating equations. An inappropriate correlation
structure will lead to inefficient parameter estimation in the GEE framework (Hin
and Wang, 2009). Besides the QIC, the GAIC, and the GBIC, we will introduce
another two criteria for working correlation structure selection. Note that the goal of
selecting a working correlation structure is to estimate parameters more efficiently.

5.3.1 CIC Criterion


Hin and Wang (2009) noted that the expectation of the first term in the QIC is free
from the working correlation matrix Rw and the true correlation matrix RT . Therefore,
Q(β̂, φ̂; I, D) does not contain information about the hypothesized within correlation
structure, and the random errors from Q(β̂, φ̂; I, D) can affect the performance of the
QIC. Therefore, Hin and Wang (2009) proposed using only the second term in the
QIC as a correlation information criterion (CIC) for correlation structure selection,

CIC(R) = tr(Q̂I V̂R ).

Thus, QIC(R) = −2Q(β̂R ; I) + 2CIC(R). Without the effect of the random error from
the first term in (5.4), the CIC could be more powerful than the QIC.

The theoretical underpinning for the biased first term in (5.4) was outlined in
Wang and Hin (2010). Note that for continuous responses and φ estimated by φ̂, we
have QIC(R) = (M − p) + 2CIC(R), which means that the CIC is equivalent to the
QIC in the working correlation structure chosen for continuous responses. If the true
correlation matrix is the identify matrix I, Q̂−1
I and V̂R are asymptotically equivalent,
hence CIC(R) ≈ 2p.

5.3.2 C(R) Criterion


From a theoretical point of view, when Rw is close to the true correlation matrix
RT , the discrepancy between the covariance matrix estimator of Yi and the specified
working covariance matrix Wi is small. Therefore, according to a statistic to test
80 MODEL SELECTION
the hypothesis that the covariance matrix equals a given matrix, Gosho et al. (2011)
proposed a criterion to choose a working correlation structure
  −1 2 
  1 XN N  
 1

   X  

C(R) = tr  T
.
     
S S W − I (5.5)
   
 
 N i i
 i n
N  

  
 
 
 
  

i=1 i=1

 

The value of the C(R) should be close to zero when Rw is accurately specified. How-
ever, the C(R) is appropriate only for the balanced data.

5.3.3 Empirical Likelihood Criteria


Suppose that x1 , . . . , xN are independent random samples from a distribution F with
a parameter vector θ ∈ Θ ⊂ R p . The empirical likelihood function is
N
Y N
Y
L(F) = F(xi ) = pi ,
i=1 i=1

where pi = dF(xi ) = P(X = xi ). The empirical likelihood ratio is defined as (Owen,


2001)
N
Y
R(F) = L(F)/L(Fn ) = N pi , (5.6)
i=1

where Fn (x) = 1/N


PN
i=1 I(xi ≤ x).

Assume that θ satisfies unbiased estimating equations

EF (g(X, θ)) = 0, (5.7)

where g(·) = (g1 (X, θ), . . . , gr (X, θ))T is r-dimension functions and r ≥ p. The empirical
likelihood ratio function for θ is defined as (Qin and Lawless (1994)
 N N N


Y X X 
R(θ) = sup  = , θ) = .
 
N p ; 0 ≤ p ≤ 1, p 1, p g(x 0

 i i i i i 

 
i=1 i=1 i=1

An explicit expression value for R(θ) can be derived by the Lagrange multiplier
method. The maximum empirical likelihood estimator of θ is

θ̂EL = argmaxθ R(θ).

Qin and Lawless (1994) proved that −2 log(R(θ)/R(θ̂)) → χ2p , and hence confidence
regions can be constructed for θ. Note that when r = p, it seen that β̂EL is equal to
g(xi , θ) = 0. When r > q, the empirical
PN
the solution to the estimating equations i=1
likelihood method allows us to deal with the combination of pieces of information
about θ. However, computational issues may arise to obtain β̂EL .
SELECTING CORRELATION STRUCTURE 81
Define correlation matrix RF (α) as a toeplitz matrix with (n − 1)-dimensional cor-
relation parameter vector α = (α1 , . . . , αn−1 )T . That is, the jth off the diagonal element
in RF (α) is α j . Therefore, RF (α) is a more general structure, and the most commonly
used correlation structures, independence, exchangeable, AR(1), and MA(1) are all
embedded in RF (α). We define the GEE model with RF (α) as the “full model”. Fur-
thermore, unbiased estimating functions can be constructed for β and α,

DTi A−1/2 R−1 (α)A−1/2


 
 i F i Si 
α φ̂(n
 Pn−1
e e − − 1 − p/N)

 k=1 i j ik+1 1
g(Yi , Xi , β, α, RF ) =   ,

..
 . 
 P1 
k=1 ei j eik+n−1 − αn−1 φ̂(1 − p/N)

where ei j = (Yi j − µi j )/ ν(µi j ). The empirical likelihood ratio based on


p
g(Yi , Xi , β, α, RF ) is
 N N N


Y X X 
R (β, α) = sup 
F
pi = 1, pi g(Yi , Xi , β, α, RF ) = 0 . (5.8)
 
N pi ; 0 ≤ pi ≤ 1,


 

i=1 i=1 i=1

Note that this empirical likelihood is built at a very weak assumption, such as the
stationarity assumption of the underlying correlation structure. R F (β, α) serves as a
unified measure that can be applied to most of the competing GEE models (Chen and
Lazar, 2012).

To avoid the computational issues associated with the empirical likelihood, Chen
and Lazar (2012) proposed calculating R F (β, α) via giving β and α. If n = 3, the cor-
relation matrix RF (α) is parameterized by αT . = (α1 , α2 ). Suppose that there are three
correlation structure candidates: exchangeable (RE ), AR(1) (RA ), and toeplitz (RF ),
and corresponding estimators for β and α are (β̂E , α̂E ), (β̂A , α̂A ), and (β̂F , α̂1F , α̂2F ),
which are obtained with one of the three working correlation structures based on
the same data. The corresponding empirical likelihood ratios are R F (β̂E , α̂E , α̂E ),
R F (β̂A , α̂A , α̂A ), and R F (β̂F , α̂1F , α̂2F ). It is easy to see that R F (β̂F , α̂1F , α̂2F ) is equal
to one, and the other models are smaller than one. Therefore, different competing
structures embedded in the general correlation structure RF (α) have different values,
which can be used to compare the different competing structures and then select the
best one.
Chen and Lazer (2012) modified the AIC and the BIC by substituting the empir-
ical likelihood for the parametric likelihood and gave empirical likelihood versions
of the AIC and the BIC:

EAIC = −2 log R F (θ̂) + 2dim(θ),


EBIC = −2 log R F (θ̂) + log(m)dim(θ),

where θ̂w = (β̂Tw , α̂Tw )T . It is worth mentioning that, when the EAIC and the EBIC are
calculated, β̂w is the GEE estimator with the working correlation structure Rw , and
α̂ is the method of moments estimator of α given β̂w and Rw . Note that the EAIC
82 MODEL SELECTION
and the EBIC contain p + n − 1 unknown parameters even for an exchangeable and
AR(1) correlation matrix. Therefore, if sample size N is small and n is large, the
performance of the EAIC and the EBIC is affected.

5.4 Examples
In this section, we illustrate the mentioned criteria for covariates selection and corre-
lation structure selection by some real datasets. The R codes are also provided.

5.4.1 Examples for Variable Selection


In this subsection, we discuss the criteria for covariate selection under assumption
that the variance function and correlation structure are given.

Example 1. Hormone data


The hormone data were introduced in Chapter 1.1 Example 3. The response
variable is the log-transformed progesterone level. In addition to age and body mass
index (BMI) covariates, we also consider the time effect. To account for highly non-
linear, we utilize a polynomial to fit the time effect. We use the QIC, the EQIC, the
GAIC, and the GBIC to select the degree of the polynomial. The results are pre-
sented in Table 5.1. As we can see that the QIC selects the quartic polynomial under
two working correlation structures. However, the EQIC and the GAIC both choose
the quintic polynomial. The GBIC chooses the quartic polynomial under the indepen-
dence correlation structure but selects the quintic polynomial under the exchangeable
correlation structure. According to Pan (2001), the QIC, the GAIC, and the GBIC
with the independence correlation structure perform better than other cases. Further-
more, the values of the EQIC, the GAIC, and the GBIC for the quartic and quintic
polynomials are close. In general, an underfitting model is more serious than an over-
fitting model. Hence, we choose the quintic polynomial which is more conservative.

5.4.2 Examples for Correlation Structure Selection


In this subsection, we discuss the criteria for correlation structure selection under
assumption that the variance function and covariates are given.

Example 2. Madras longitudinal schizophrenia study (binary data)


This study has been introduced in Chapter 2. To investigate the course of positive
and negative psychiatric symptoms over the first year after initial hospitalization for
schizophrenia, the response variable Y is binary, indicating the presence of thought
disorder. The covariates are (a) Month: duration since hospitalization (in months), (b)
Age: age of patient at the onset of symptom (1 represents Age <20; 0 otherwise), (c)
Gender: gender of patient (1 =female; 0 =male), (d) Month×Age is the interaction
term between variables Month and Age, (e) Month×Gender is the interaction term
between variables Month and Gender, and (f) Age×Gender is the interaction term
EXAMPLES 83

Table 5.1 The values of the criteria QIC, EQIC, GAIC, and GBIC under two different working
correlation structures (CS) for Example 1.
CS Covariates QIC EQIC GAIC GBIC
age,bmi, time 781.25 2258.19 2442.94 2447.52
age, bmi,time, time2 772.76 2233.12 2428.36 2434.47
Independence age, bmi,time, time2 ,time3 670.72 1951.60 2250.88 2258.51
age, bmi,time, time2 , 661.67 1922.09 2232.42 2241.57
time3 ,time4
age, bmi,time, time2 , 663.27 1921.63 2232.35 2243.04
time3 ,time4 ,time5
covariates QIC EQIC GAIC GBIC
age,bmi, time 784.12 2261.78 2170.26 2174.84
age, bmi,time, time2 775.67 2236.77 2147.46 2153.56
Exchangeable age, bmi,time, time2 ,time3 673.45 1955.28 1874.48 1882.11
age, bmi,time, time2 , 664.40 1925.79 1843.43 1852.59
time3 ,time4
age, bmi,time, time2 , 666.32 1925.65 1841.44 1852.12
time3 ,time4 ,time5

between Age and Gender. We consider the logistic regression.

logit(µit ) = β0 + β1 Monthik + β2 Agei + β3 Genderi + β4 Monthik ∗ Agei


+ β5 Monthik ∗ Genderi + β6 Agei ∗ Genderi , k = 1, . . . , ni ; i = 1, . . . , 86.

We utilize the function geeglm in the geepack package to estimate parameters in


the model. The parameter estimates obtained from the GEE with the independence,
exchangeable, and AR(1) working correlation matrix are listed in Table5.2. The stan-
dard errors (Std.err) of the parameter estimators and P-values are also given.
When the working correlation structure is independence, the intercept and Month
have significant positive effects on the probability of the presence of thought disor-
der. However, when the working correlation structure is exchangeable, besides the
intercept and Month, the interaction term between Age and Gender are also signif-
icant, which indicates that the age has significantly different impact for female and
male. The correlation coefficient estimate is 0.258, which indicates the correlation
among the observations is weak. When the correlation structure is AR(1), the im-
portant covariates is the same as those under the independence working correlation
strucutre. As we can see that the parameter estimates based on three different corre-
lation structures are close, but the standard errors of the parameter estimators under
the AR(1) correlation structure are smallest.
We next use the criteria to select a correlation structure for the GEE. The values
of the criteria are presented in Table 5.3. The results indicate that all the criteria select
the AR(1) correlation structure except the QIC. The QICH and the CICH correspond
to the QIC and the CIC in which Ω̂I is evaluated via the GEE estimates based on the
independence correlation structure.
84 MODEL SELECTION

Table 5.2 The results obtained by the GEE with independence (IN), exchangeable (EX), and
AR(1) correlation structures in Example 2.
Independence correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 0.7969 0.3180 6.2812 0.0122
MONTH -0.2627 0.0578 20.6750 0.0000
AGE 0.1967 0.6383 0.0950 0.7580
GENDER -0.7200 0.4643 2.4052 0.1209
MONTH.AGE -0.1006 0.0991 1.0311 0.3099
MONTH.GENDER -0.1274 0.1076 1.4029 0.2362
AGE.GENDER 1.0495 0.7170 2.1425 0.1433
Exchangeable correlation matrix
Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.8518 0.3232 6.9451 0.0084
MONTH -0.2779 0.0613 20.5433 0.0000
AGE 0.1369 0.6684 0.0419 0.8378
GENDER -1.2709 0.7256 3.0679 0.0799
MONTH.AGE -0.0692 0.1007 0.4722 0.4920
MONTH.GENDER -0.1643 0.1071 2.3555 0.1248
AGE.GENDER 1.7727 0.8968 3.9075 0.0481
AR(1) correlation matrix
Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.6988 0.3076 5.1593 0.0231
MONTH -0.2409 0.0538 20.0577 0.0000
AGE 0.0083 0.5962 0.0002 0.9889
GENDER -0.4578 0.4481 1.0436 0.3070
MONTH.AGE -0.0643 0.0894 0.5177 0.4718
MONTH.GENDER -0.1771 0.0979 3.2735 0.0704
AGE.GENDER 1.0809 0.7136 2.2943 0.1298

In the statistical software R, we can first use the geeglm function in the geepack
to obtain the parameter estimates, and then use the function QIC in the packages

Table 5.3 The values of the criteria under independence (IN), exchangeable (EX), and AR(1)
correlation structures in Example 2.
IN EX AR(1)
QIC 954.12 971.91 954.85
QICH 949.46 972.28 948.88
CIC 23.09 25.79 22.69
CICH 20.75 25.97 19.71
GAIC 882.53 1069.70 477.69
GBIC 899.71 1089.34 497.32
EAIC 2932.02 2782.40 243.61
EBIC 2949.20 2802.03 263.24
EXAMPLES 85

Table 5.4 The results are obtained via the QIC function in MESS package.
QIC QICu Quasi Lik CIC params QICC
IN 949.46 921.95 -453.98 20.75 7.00 949.58
EX 940.81 934.34 -460.17 10.24 7.00 940.97
AR 923.25 923.46 -454.73 6.90 7.00 923.41

geepack and MESS to calculate the QIC and the CIC. The specific results are as
follows (Table 5.4).
In the output results, QIC and CIC correspond to the values of the QICH and the
CICH , respectively, and QICu corresponds to the value of the QICu.
It is worth noting that the values of QIC and CIC obtained via QIC are different
with those given in Table 5.3 except those under the independence working corre-
lation matrix, because ΩR = i=1
PN T −1
Di Vi Di is used in the QIC function instead of
ΩI .

Example 3. Progabide study


The Progbide study has been introduced in Chapter 2. The response variable Y
is counted response, indicating the number of epileptic seizures of each patient. A
log-linear regression model is considered:

log(µi j ) = log(ti j ) + β0 + β1 basei j + β2 trti


+ β3 log(agei j ) + β4 trti × basei j ,

where j = 0, . . . , 4; i = 1, . . . , 59, and log(ti j ) is an offset to take account of different in-


terval lengths with ti j = 8 if j = 0 and ti j = 2 if j = 1, 2, 3, 4. The covariates basei j = 0
if baseline, otherwise basei j = 1; trti = 1 if the ith patient is in the progabide group,
trt = 0 otherwise; age indicates the patients’ age; the last term is the interaction effect
between treatment and the baseline indicator. Therefore, the parameter β1 is the log-
arithm of the ratio of the average seizure rate after treatment to before treatment for
the placebo group. The parameter β2 represents the difference in the logarithm of the
average seizure rate at the baseline between the progabide and the placebo groups.
The parameter β4 is of interest and indicates the difference in the logarithm of the
post to pre-treatment ration between the progabide and the placebo groups.
To account for the over-dispersion which is not described by the Poisson distribu-
tion, we assume that var(yi j ) = φE(yi j ) with φ > 1. The function summary in R is used
to produce summaries of the results of various model fitting functions. The function
update is used to update and refit a model. The results for the parameter estimates
obtained from the GEE with the independence, exchangeable, and AR(1) working
correlation structures are given in Table 5.5.
The parameter estimates obtained from the GEE with the three correlation struc-
tures are similar. The over-dispersion parameter φ is significantly larger than 1. Pa-
rameter estimate of β4 is negative, which indicates that the progabide can reduce
the number of epileptic seizure counts. The estimate of the correlation coefficient is
larger than 0.7, which indicates there exists strong correlations. The results obtained
86 MODEL SELECTION

Table 5.5 The results obtained by the GEE with independence (IN), exchangeable (EX), and
AR(1) correlation structures in Example 3.
Independence correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 2.8887 1.6027 3.2487 0.0715
base 0.1118 0.1159 0.9306 0.3347
trt 0.0090 0.2089 0.0018 0.9658
log(age) -0.4618 0.4852 0.9059 0.3412
base:trt -0.1047 0.2134 0.2407 0.6237
Exchangeable correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 3.9705 1.2129 10.7160 0.0011
base 0.1118 0.1159 0.9306 0.3347
trt -0.0086 0.2147 0.0016 0.9681
log(age) -0.7876 0.3649 4.6583 0.0309
base:trt -0.1047 0.2134 0.2407 0.6237
AR(1) correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 4.5580 1.3991 10.6127 0.0011
base 0.1561 0.1137 1.8843 0.1699
trt -0.0312 0.2093 0.0223 0.8814
log(age) -0.9765 0.4186 5.4409 0.0197
base:trt -0.1315 0.2667 0.2431 0.6220

under the exchangeable and AR(1) correlation matrices indicate that intercept and
log(age) are significant.
Patient number 207 appears to be very unusual (id = 49 in the dataset). He (or
she) had an extremely high seizure count (151 seizures in eight weeks) at baseline
and his count doubled after treatment (302 seizures in eight weeks). The GEE method
is sensitive to outliers, and hence we rerun the results after removing the outliers and
present the new results in Table 5.6.
The results are different with and without the outliers. After removing the out-
liers, the interactive term between base and treatment is significant, which indicates
the difference in the logarithm of the post to pre-treatment ration between the pro-
gabide and the placebo groups are significant.
Next, we use the criteria to specify the correlation structure. We also calculate
the criteria with and without the 49th patient. The results are presented in Table 5.7.
When the data contain the outliers, all the criteria select the exchangeable correlation
structure. However, when the data exclude the outliers, the QIC and QICH choose
the independence correlation structure, and the CIC chooses the AR(1) correlation
structure, which indicate that the QIC and the CIC are sensitive to outliers. We be-
lieve that the exchangeable correlation structure is more reliable. We also calculate
the sample correlation matrix (see Table 5.8), which or that is close to the exchange-
able correlation structure.
EXAMPLES 87

Table 5.6 The results obtained by the GEE with independence (IN), exchangeable (EX), and
AR(1) correlation structures when removing the outliers in Example 3.
Independence correlation matrix
Estimate Std.err Wald Pr(>|W|)
Intercept 1.6165 1.3153 1.5104 0.2191
base 0.1118 0.1159 0.9306 0.3347
trt -0.1089 0.1909 0.3253 0.5684
log(age) -0.0804 0.3966 0.0411 0.8394
base:trt -0.3024 0.1711 3.1248 0.0771
Exchangeable correlation matrix
Intercept 3.2250 1.2461 6.6977 0.0097
base 0.1118 0.1159 0.9306 0.3347
trt -0.1265 0.1884 0.4505 0.5021
log(age) -0.5629 0.3753 2.2497 0.1336
base:trt -0.3024 0.1711 3.1248 0.0771
AR(1) correlation matrix
Intercept 4.1477 1.3083 10.0503 0.0015
base 0.1521 0.1114 1.8654 0.1720
trt -0.1143 0.1943 0.3457 0.5566
log(age) -0.8515 0.3918 4.7236 0.0298
base:trt -0.4018 0.1757 5.2314 0.0222

Table 5.7 The results of the criteria for the data with and without the outliers in Example 3.
Complete data Outlier deleted
IN EX AR IN EX AR
QIC -695.76 -701.26 -678.12 -1062.05 -1014.30 -938.22
QICH -695.76 -701.42 -676.64 -1062.05 -1013.18 -934.95
CIC 11.79 10.62 11.00 11.03 10.32 9.47
CICH 11.79 10.54 11.73 11.03 10.88 11.11
C(R) 13.57 2.18 12.38 9.61 2.74 10.66
GAIC 2409.43 2142.65 2256.92 2158.03 2017.64 2098.56
GBIC 2419.81 2155.12 2269.39 2168.33 2030.00 2110.92
EAIC 142.68 13.33 60.13 1894.41 17.74 26.56
EBIC 153.07 25.79 72.59 1904.71 30.10 38.92

Table 5.8 The correlation matrix of the data with and without the outliers in Example 3.
Complete data Outlier deleted
0 1.00 1.00
1 0.79 1.00 0.68 1.00
2 0.83 0.87 1.00 0.73 0.69 1.00
3 0.67 0.74 0.80 1.00 0.49 0.54 0.67 1.00
4 0.84 0.89 0.89 0.82 0.75 0.72 0.76 0.71
Chapter 6

Robust Approaches

6.1 Introduction
The GEE method is robust against the misspecification of correlation structures, but
is sensitive to outliers because it is essentially a generalized weighted least squares
approach (Jung and Ying, 2003; Wang and Zhao, 2008). In this chapter, we will
introduce several robust methods for parameter estimation in analysis of longitudinal
data. In §6.2, we introduce the rank-based method, which is distribution-free, robust,
and highly efficient (Hettmansperger, 1984). In §6.3, we will introduce the quantile
regression, which gives a global assessment about covariate effects on the distribution
of the response variable, provides complete description of the distribution, and is
robust against outliers (Koenker and Bassett, 1978). In §6.4, we will introduce other
methods based on the Huber’s function and exponential square loss function, which
is not only robust against outliers in response variable, but also robust against outliers
in covariates.

6.2 Rank-based Method


Rank-based methods are robust and highly efficient (Hettmansperger, 1984). Jung
and Ying (2003) extended the Wilcoxon-Man-Whitney rank statistic to the longitu-
dinal data under the independence assumption. Their method is simple and also has
desirable properties. In this section, we will introduce several rank-based methods
for parameter estimation in linear regression models.
Let yik be the ni × 1 vector of responses for i = 1, . . . , N. Then the model for yik is

yik = XikT β + ik , k = 1, . . . , ni , i = 1, . . . , N, (6.1)

where β is the parameter vector corresponding to the covariate vector Xik of dimen-
sion p. Assume that the median of ik −  jl is zero. To avoid complication caused
by ties, we assume that error terms ik are continuous variables. Define residuals
eik (β̂) = yik − XikT β̂, where β̂ is a given consistent estimator of β.

6.2.1 An Independence Working Model


PN Pni
Let M be the total number of observations and X̄ = M −1 i=1 x . The corre-
k=1 ik
P Pn j
sponding ranks of eik (β) are rik (β) = Nj=1 l=1 I(e jl ≤ eik ). Jung and Ying (2003)

DOI: 10.1201/9781315153636-6 89
90 ROBUST APPROACHES
proposed estimating β using the following estimating functions:
ni
N X
X
U I (β) = M −1 (Xik − X̄)rik (β). (6.2)
i=1 k=1

Let β̂I be the resultant estimator from (6.2), which can also be obtained by minimiz-
ing the following loss function
ni X
N X
X nj
N X
L(β) = M −2 |eik (β) − e jl (β)|. (6.3)
i=1 k=1 j=1 l=1

To make inferences about the regression coefficients of model (6.1), such as test-
ing the null hypothesis H0 : β = β0 , Jung and Ying (2003) proposed a test statis-
tic NU IT (β)V −1 U I (β), which can bypass density estimation and bandwidth selec-
tion, and V is a covariance matrix of N −1/2 U I (β). A consistent estimator of V is
V̂ = N −1 i=1 ξ̂i ξ̂iT , where
PN

ni 
X
 ni0
N X
X


ξ̂i = (Xik − X̄){N −1 rik (β̂I ) − 1/2} − N −1 (Xi0 k0 − X̄)I(ei0 k0 (β̂I ) ≤ eik (β̂I )) ,

k=1 i0 =1 k0 =1

Under the null hypothesis, the test statistic approximately has a χ2 distribution with
p degrees of freedom. The null hypothesis will be rejected for a large observed value
of the test statistic.
The function (6.3) is based on the independence working model assumption, thus
the efficiency of β̂I may be improved by taking account of the correlations and the
impacts of varying cluster sizes. To this end, Wang and Zhao (2008) introduced a
weighted estimating functions.

6.2.2 A Weighted Method


To avoid modeling the correlations, we can resample one observation from each sub-
ject, and then apply the rank method to the N independent observations, (yi ), 1 ≤ i ≤
N. If the corresponding residuals for the ith subject are (e(i) ), the dispersion function
is
XN X N
|e(i) − e( j) |.
i=1 j,i

To make use of all the observations, we would repeat this resampling process many
times. Conditional on sampling one observation from each cluster, the probability of
yik being sampled is 1/ni . Therefore, the limiting dispersion function is
N N i Xn nj
1 XX 1 1 X
Lw (β) = |e(i) − e( j) |.
M 2 i=1 j,i ni n j k=1 l=1
RANK-BASED METHOD 91
This motivates Wang and Zhao (2008) to consider the following weighted loss
function for estimation of β,
i X N N n nj
1 XXX
Lwz (β) = 2 ωi ω j |eik − e jl |,
M i=1 j,i k=1 l=1

where ωi and ω j are weighted functions. The corresponding quasi-score functions


for β are
N N ni X nj
1 XXX
Uwz (β) = 2 ωi ω j (xik − x jl )I(e jl ≤ eik ), (6.4)
M i=1 j,i k=1 l=1
For the weighted function ωi , it seems that sensible choices include (i) ωi = 1, (ii)
ωi = 1/ni , and (iii) ωi = 1/(1 + (ni − 1)ρ̄), where ρ̄ is the “average” within correlation
and can be estimated by
PN Pni
i=1 (rik − r̄)(ril − r̄)
ρ̂0 = PN Pnk,l ,
i 2
i=1 k=1 (ni − 1)(rik − r̄)

where r̄ is the mean of the ranks rik for k = 1 . . . , ni and i = 1, . . . , N. When the within
correlation is weak, it can be ignored in the analysis, and ωi = 1 is corresponding to
the classical rank regression.
Let fik (·) and Fik (·) be the probability density function and the cumulative distri-
bution function of ik , for k = 1, . . . , ni and i = 1, . . . , N. Let Vw = lim M→∞ i=1 E(ηi ηTi ),
PN
where
N ni X nl ( )
X X 1
ηi = ωi ω j (xik − x jl ) F jl (ik ) − .
j,i k=1 l=1
2
Furthermore, let
N N ni Xnl Z
2 XX X
DN = ωi ω j (xik − x jl )(x ik − x jl )T
fik dFik .
M 2 i=1 j,i k=1 l=1

Assume that the limiting matrix of DN as N → ∞ is Dw . If ni is bounded and both Vw


and Dw are positive definite,√ then the resultant estimator β̂wz from (6.4) has the fol-
lowing asymptotic result: M(β̂wz − β) converges in distribution to N(0, Λw ), where
Λw = D−1 −1
w Vw Dw . The covariance matrix Vw can be replaced with a consistent esti-
mator V̂w = 16/M 3 i=1 η̂i η̂Ti , in which
PN

N nl
ni X ( )
X X 1
η̂i = ωi ω j (xik − x jl ) I(eik ≥ e jl ) − .
j,i k=1 l=1
2

The proposed weighted rank method is simple and effective because it utilizes only
the weight function to incorporate correlations and cluster sizes for efficiency gain.
However, when the design is balanced, this method is equivalent to the Jung and
Ying’s method. Furthermore, this method reduces the efficiency of parameter estima-
tors because of only weighting at cluster levels, in which the covariate varies within
each subject (Neuhaus and Kalbfleish, 1998).
92 ROBUST APPROACHES
6.2.3 Combined Method
Decompose L(β) given in (6.3) into two parts, as
N X
X N X nj
ni X N X
X ni
ni X
L(β) = M −2
|eik (β) − e jl (β)| + M −2 |eik (β) − eil (β)|
i=1 j,i k=1 l=1 i=1 k=1 l=1

= LB (β) + M −1 LW (β),
where LB (β) and LW (β) stand for between- and within-subject loss functions. As M
tends to infinity, L(β) ' LB (β) + O p (1), and thus the information contained in the
within-subject loss function LW (β) is ignored. Therefore, Wang and Zhu (2006) pro-
posed minimizing LB (β) and LW (β) separately and then combining the two afore-
mentioned estimators in some optimal sense.
The estimating functions based on the between- and within-subject ranks LB (β)
and LW (β) can be obtained:
N N ni X nj
1 XXX
S B (β) = 2 (Xik − X jl )I(e jl ≤ eik ), (6.5)
M i=1 j,i k=1 l=1
N i n
1 XX
S W (β) = (Xik − Xil )I(eil ≤ eik ). (6.6)
M i=1 k,l

Because the median of eik − e jl is zero, (6.5) and (6.6) are unbiased and can be used
to estimate β. Suppose that β̂B and β̂W are estimators derived from (6.5) and (6.6), re-
spectively. It seems natural to combine these two estimators to obtain a more efficient
PN PN Pni Pn j
estimator (Heyde, 1987). Let XB = M −2 i=1 T
j,i k=1 l=1 (Xik − X jl )(Xik − X jl ) and
n n
XW = M −1 i=1 N i i
(X − Xil )(Xik − Xil )T . Wang and Zhu (2006) proposed find-
P P P
k=1 l=1 ik
ing an optimal estimator of β from an optimal linear combination of the estimating
functions for β:
!
S B (β)
S C (β) = (XB , XW )Σ−1 (6.7)
S W (β)
where Σ is the covariance matrix of S B (β) and S W (β). Because Σ is unknown,
Wang and Zhu (2006) proposed using resampling method to bypass density estima-
N are independently sampled from the binomial distribution
tion. Suppose that {zi }i=1
B(N, 1/N). The bootstrap method of resampling the subjects with replacement leads
to the following perturbed estimating functions:
Ni X N n nj
1 XXX
S̃ B (β) = 2 zi z j (Xik − X jl )I(e jl ≤ eik ),
M i=1 j,i k=1 l=1

Ni n
1 XX
S̃ W (β) = zi (Xik − Xil )I(eil ≤ eik ).
M i=1 k,l

The covariance matrix of S̃ B and S̃ W can be used as an estimate of Σ.


RANK-BASED METHOD 93
It is worth noting that in the case of cluster-level covariate designs, in which the
covariate has the same value for all the units in the same subject/cluster, the within-
subject estimating functions are identically equal to zero, and hence the combined
method is equivalent to the independence working model.

6.2.4 A Method Based on GEE


To improve the efficiency of estimators by incorporating the within correlations,
Fu and Wang (2012) derived an optimal rank-based estimating function in terms
of asymptotic variance of regression parameter estimators (Heyde, 1987) under the
GEE framework.
P Pn j
Let S ik = M −1 Nj=1 l=1 {I(e jl (β) ≤ eik (β)) − 1/2} be the average rank of eik (β).
The indicator function I(e jl (β) ≤ eik (β) minus 1/2 is to make sure S ik is unbiased.
Let S i (β) = (S i1 , . . . , S ini )T and S (β) = (S 1 (β)T , . . . , S N (β)T )T . Because the summa-
tion of S (β) is a constant M(M + 1)/2, the covariance matrix VG of S (β) is singular.
Therefore, Fu and Wang (2012) proposed randomly selecting M − 1 elements of S (β)
to estimate β. We will still use S (β) and VG to indicate the M − 1 average ranks, and
their covariance matrix, respectively.
According to Heyde (1987) and Durairajan (1992), Fu and Wang (2012) proposed
an optimal combination of all the elements of S (β) which takes the form of
S O (β) = D̄T VG−1 S (β), (6.8)
where D̄ = {D̄ik , k = 1, . . . , ni ; i = 1, . . . , N} are derivatives of expected values of S (β),
P Pn j
and D̄Tik = M −1 Nj=1 l=1 (X jl − xik ) fik jl (0), where fik jl (·) is the density function of
ik −  jl . Note that D̄ik involves the unknown density function fik jl (·). To bypass esti-
mate fik jl (0), Fu and Wang (2012) assumed that fik jl (0) is a constant. Therefore, we
can replace D̄Tik by DTik = x̄ − xik in the function (6.8).
Due to a variety of correlation types induced by combinations of the between-
and within-subject ranks, it is even difficult to specify the correlation structure of
S (β). Under an exchangeable correlation structure of the within-subject residuals
assumption, Fu and Wang (2012) derived VG and used this matrix as a working matrix
in (6.8). Let σ2i = Var(S ik ), σii = Cov(S ik , S il ), σi j = Cov(S ik , S jl ), r1i = Nj,i n j , r2i =
P

j,i n j , and r3i j = ni + n j − 1. Fu and Wang (2012) obtained


PN 2

σ2i = {r1i + (r2i − r1i )ρ3 + (r1i


2
− r2i )ρ4 }τ22 + (ni − 1)τ1 {2r1i ρ2 τ2 + [1 + (ni − 2)ρ1 ]τ1 }/M 2 ,
σii = {[r1i ρ3 + (r2i − r1i )ρ5 + (r1i
2
− r2i )ρ6 ]τ22 − 2r1i ρ2 τ1 τ2 − [(ni − 2)ρ1 + 1]τ21 }/M 2 ,
σi j = {[(r2i − n2j )(r1i − n j )r3i j ]ρ6 −1−(r1i − n j )ρ4 − (r3i j − 1)ρ3 −(ni n j − r3i j )ρ5 }τ22 /M 2 ,

where τ21 = var{I(eik ≤ eil )}, τ22 = var{I(eik ≤ e jl )}, and ρ1 , . . . , ρ6 are six types of cor-
relation coefficients among {S ik (β), k = 1, . . . , ni ; i = 1, . . . , N} and given as follows
ρ1 = corr{I(eik ≤ eil ), I(eik ≤ eil0 )}, ρ2 = corr{I(eik ≤ eil ), I(eik ≤ ers )},
ρ3 = corr{I(eik ≤ e jl ), I(eik ≤ e jl0 )}, ρ4 = corr{I(eik ≤ e jl ), I(eik ≤ ers )},
ρ5 = corr{I(eik ≤ e jl ), I(eik0 ≤ e jl0 )}, ρ6 = corr{I(eik ≤ e jl ), I(eik0 ≤ ers )}.
94 ROBUST APPROACHES
These six types correlation coefficients illustrate the correlations between the rela-
tive ordering of two different within correlated pairwise residuals. For example, ρ1
indicates the correlation between the relative ordering of two pairwise residuals from
the same subject, and ρ6 indicates the correlation between the relative ordering of
two different residuals from the same subject and another two distinct residuals from
different subjects.
Note that if there are no within correlations, ρ5 = ρ6 = 0. If there exists correla-
tions, and the errors have an exchangeable structure, then the pairwise residuals in ρ1
and ρ4 are identically distributed, hence ρ1 = ρ4 = 1/3. Furthermore, it can be seen
that σ2i = O(1), σ2ii = O(1), and σi j = O(M −1 ) for i , j, which implies that correla-
tions between rank residuals from different subjects can be ignored as M tends to
infinity. Utilizing the idea of the GEE, Fu and Wang (2012) proposed using a block
diagonal matrix diag(V1 , · · · , VN ) as a working covariance matrix for V, and obtained
an estimate of β using estimating equations

N
X
S G (β) = DTi Vi−1 S i (β) = 0, (6.9)
i=1

where Vi−1 = (σ2i − σii )−1 {Ini − σii [σ2i + (ni − 1)σii ]−1 Jni ×ni } is the inverse matrix of
Cov(S i (β)), in which I is an identity matrix, and J is a matrix with all elements
being one. The equation (6.9) is the generalized estimating equations√of S i (β). Let
β̂G be the resulting estimator from (6.9), then it can be shown that N(β̂G − β) is
asymptotically normal with mean zero and covariance matrix
 N −1  N  N −1
X  X  X 
ΣG = lim N  Di Vi D̄i   Di Vi Cov{S i (β0 )}Vi Di   D̄i Vi Di  .
T −1 T −1 −1 T −1
N→∞
i=1 i=1 i=1

To avoid estimating the joint density of the error terms, we use the resampling
method to estimate ΣG . Let {zi }i=1
N be sampled from a distribution with unit mean and

unit variance, and define the perturbed estimating equation


N
X
S̃ G (β) = zi DTi Vi−1 S i (β) = 0. (6.10)
i=1

We can derive an estimate of β by solving the equation (6.10) for each sequence
N . Therefore, an independent sequence of {z }N can result in many estimates of
{zi }i=1 i i=1
β. The covariance of the bootstrap estimates can be an estimate of ΣG .

6.2.5 Pediatric Pain Tolerance Study


In this subsection, we will use a real data set to illustrate the rank-based methods. The
data are from a pediatric pain tolerance study (Weiss et al., 1999) which contained a
total of 64 children aged 8 − 10 years old. The children attended four trials of keeping
their arms in very cold water for as long as possible. The response variable is the
RANK-BASED METHOD 95
log2 of time in seconds that the children can tolerate keeping their arms immersed
in cold water. Two trials were made during a first visit, and two more trials were
made during a second visit two weeks later. According to whether focusing on the
experiment or not, the children were classified as attenders or distracters based on
their coping responses during the first two trials. Three baseline trials were given
and then children were randomly assigned to one of three counseling interventions
before the last trial, where advice was given to attend the experiment or to distract
the experiment, or no intervention. This study aims at comparing the two baselines
and examining whether treatment effects differ for attenders and distracters. There
are missing data from six children because of arms in casts and similar reasons. We
assume that the data are missing completely at random.
In Figure 6.1, it can be seen that the distract treatment appears to improve the
performance of distracters. The attend treatment appears to decrease the pain toler-
ance time of the distracters. In Figure 6.2, girls have relatively longer pain tolerance
time than boys over the whole trial period, and there are some outliers as pointed out
by Weiss et al. (1998). Figure 6.3 indicates there exists strong correlations among the
observations. Therefore, robust and efficient rank methods are desirable to analyze
this dataset.

attend distract no directions


8

6
attender

5
Pain tolerance inlog2 seconds

6
distracter

1 2 3 4 1 2 3 4 1 2 3 4

Trail

Figure 6.1 The Boxplots of the time in log2 seconds of pain tolerance in four trials in the
pediatric pain study. Two row panels represent attender and distracter baseline groups, and
three column panels represent three treatments.

Let yik be the log2 of pain tolerance time of the kth trial for the ith subject, and
B be the baseline indicator taking 1 for attenders and 0 for distracters. The attend
treatment is denoted by A, and the distract treatment is denoted by D, and no advice
treatment is denoted by F. Let A = 1 for the attend treatment and A = 0 otherwise,
96 ROBUST APPROACHES
girl boy

6
Pain tolerance in log2 seconds

1 2 3 4 1 2 3 4
Trial

Figure 6.2 The boxplots of the time in log2 seconds of pain tolerance are for girls and boys in
four trails in the pediatric pain study.

D = 1 for the distract treatment and D = 0 otherwise, and F = 1 for no intervention


treatment and F = 0 otherwise. G is a gender indicator taking 1 for girl and 0 for boy.
We assume that the data are missing completely at random and further consider the
following model

yik = β0 + β1 Bi + (β2 Aik + β3 Dik + β4 Fik ) ∗ Bi


+ (β5 Aik + β6 Dik + β7 Fik ) ∗ (1 − Bi ) + β8Gi + ik ,

where β0 is the mean of the distracter group (baseline), β1 indicates the difference
between the attender and distracter groups. Parameters (β2 , β3 , β4 ) correspond the
three treatment effects for the attender group, and (β5 , β6 , β7 ) correspond the three
treatments for the distracter group. The standard errors for estimates obtained by the
JY’s method and Wang and Fu’s method are based on 3000 resampling estimates
from an exponential distribution with unit variance.
Parameter estimates and their standard errors obtained from different methods are
given in Table 6.1. The results obtained via the GEE method depend on the selection
of the working correlation matrix. Comparing to the GEE method, all the rank-based
methods indicate that the distract treatment given to distracters will help distracters
QUANTILE REGRESSION 97

Trail 1 Trail 2 Trail 3 Trail 4

0.4

0.3
Corr: Corr: Corr:

Trail 1
0.2 0.726*** 0.836*** 0.604***
0.1

0.0
8

6
Corr: Corr:

Trail 2
0.718*** 0.662***
4

8
7
6
Corr:

Trail 3
5 0.764***
4
3
8
7
6

Trail 4
5
4
3
4 5 6 7 8 4 6 8 3 4 5 6 7 8 3 4 5 6 7 8

Figure 6.3 Scatterplot of the time in log2 seconds of pain tolerance for four trails in the
pediatric pain study. The diagonal plots show the densities of responses in the four Trails.
The plots were produced by ggpairs in GGally.

increase their pain tolerance (the estimate for β6 is significant), which may have
implications for medical treatments with painful procedures. In addition, except the
GEE with an exchangeable working matrix, all the other methods indicate that the
girls have much stronger pain tolerance than boys as Figure 6.2 indicates.

6.3 Quantile Regression


Various methods have been developed to evaluate covariate effects on the mean of
a response variable in longitudinal data analysis (Liang and Zeger, 1986; Qu et
al., 2000). To give a global assessment about covariate effects on the distribution
98 ROBUST APPROACHES

Table 6.1 Parameter estimates and their standard errors (SE) for the pediatric pain tolerance
study. JY: the method of Jung and Ying (2003); CM: the method of Wang and Zhu (2006); WZ:
the method by Wang and Zhao (2008); FW: the method by Fu and Wang (2012); GEEAR(1) :
the GEE method with an independence working matrix; GEEEX : the GEE method with an
exchangeable working matrix; and GEEAR(1) : the GEE method with an AR(1) working matrix;
GEEUN : the GEE method with an unstructure working matrix. ∗ indicates that p-value is less
than 0.05.
JY CM WZ FW GEEIN GEEEX GEEAR(1) GEEUN
β1 -0.57∗ -0.47∗ -0.57∗ -0.59∗ -0.60∗ -0.59∗ -0.71 -0.65∗
(SE) (0.18) (0.24) (0.21) (0.18) (0.24) (0.25) (0.24) (0.24)
β2 0.31 0.24 0.34 0.27 0.11 0.07 0.11 0.21
(SE) (0.22) (0.22) (0.23) (0.22) (0.18) (0.13) (0.14) (0.23)
β3 -0.002 -0.06 0.03 0.15 0.04 0.04 0.13 -0.07
(SE) (0.23) (0.28) (0.25) (0.26) (0.25) (0.29) (0.29) (0.24)
β4 -0.25 -0.11 -0.25 -0.38 -0.11 -0.11 -0.01 -0.16
(SE) (0.33) (0.21) (0.26) (0.26) (0.12) (0.13) (0.13) (0.25)
β5 -0.36 -0.28 -0.39 -0.43∗ -0.35∗ -0.27∗ -0.10 -0.39
(SE) (0.28) (0.20) (0.26) (0.19) (0.16) (0.13) (0.22) (0.27)
β6 0.95∗ 0.87∗ 0.80∗ 1.07∗ 0.57 0.55 0.66 0.87∗
(SE) (0.43) (0.36) (0.42) (0.37) (0.30) (0.35) (0.31) (0.35)
β7 -0.69∗ -0.60∗ -0.71∗ -0.54∗ -0.48∗ -0.31∗ -0.38 -0.81∗
(SE) (0.33) (0.27) (0.35) (0.25) (0.20) (0.13) (0.16) (0.31)
β8 0.46∗ 0.40∗ 0.47∗ 0.59∗ 0.44 0.50∗ 0.59 0.43
(SE) (0.17) (0.22) (0.21) (0.16) (0.24) (0.23) (0.21) (0.23)

of the response variable, we will introduce a quantile regression model proposed


by Koenker and Bassett (1978), which can characterize a particular point of a
distribution and hence provide more complete description of the distribution. Fur-
thermore, quantile regression is robust against outliers and does not require specify-
ing any error distribution.
Assume that the 100τth percentile of yik is XikT βτ , where βτ is a p × 1 unknown
parameter vector. The conditional quantile functions of the response yik is:

Qτ (yik |Xik ) = XikT βτ ,

let ik = yik − XikT βτ be a continuous error term satisfying p(ik ≤ 0) = τ and with
an unspecified density function fik (·). The median regression is obtained by taking
τ = 0.5.

6.3.1 An Independence Working Model


Let sik = τ − I(yik − XikT βτ ≤ 0), where I(·) is an indicator function. Because i1 , . . . , ini
are correlated, si1 , . . . , sini are also correlated for i = 1, . . . , N. However, their corre-
lation structure is more complex. Under the independence working model assump-
tion, Chen et al. (2004) proposed using the following estimating functions to make
QUANTILE REGRESSION 99
inferences about βτ :
ni
N X
X
Uτ (β) = Xik sik . (6.11)
i=1 k=1

The resulting estimates β̂τI from (6.11) can be also derived by minimizing the
following loss function
ni
N X
X
Lτ (β) = ρτ (yik − XikT β), (6.12)
i=1 k=1

where ρτ (u) = u{τ − I(u ≤ 0)} (Koenker and Bassett, 1978). Koenker and D’Orey
(1987) developed an efficient algorithm to optimize Lτ (β), which is available in the
R package quantreg.
Using a similar argument given in Chamberlain (1994) for the case of indepen-
dent observations, the asymptotic distribution of N 1/2 (β̂τI − βτ ) is a normal distri-
bution N(0, A−1 −1
τ var{U τ (βτ )}Aτ ) as N → ∞, where Aτ is the expected value of the
derivative of Uτ (βτ ) with respect to βτ . It is difficult to estimate the covariance ma-
trix because Aτ may involve the unknown density functions.
Resampling method can be used to approximate the distribution of without in-
volving any complicated and subjective nonparametric functional estimation. Let
N
X ni
X
L̃τ (βτ ) = zi ρτ (yik − XikT βτ ), (6.13)
i=1 k=1

N is a random sample from the standard normal population. It is straight-


where {zi }i=1
forward to show that the unconditional distribution of N 1/2 (β̂τI − βτ ) can be approxi-
mated by the conditional distribution of N 1/2 (β̃τ − βτ ) given the data (yik , Xik ), where
β̃τ is the estimator obtained from (6.13). The covariance matrix of β̂ can be estimated
by the empirical covariance matrix of β̃τ , for example, {β̃τi , i = 1, . . . , m} is a sequence
estimates of βτ , and the empirical covariance matrix m ¯ ¯ T
i=1 (β̃τi − β̃τ )(β̃τi − β̃τ ) /m,
P
¯
where β̃τ is the sample mean of {β̃τi , i = 1, . . . , m}.
The estimating functions Uτ (β) are based on the independence working model
assumption, hence the efficiency of β̂τI derived from Uτ (β) can be enhanced if the
within correlations are incorporated.

6.3.2 A Weighted Method Based on GEE


In this section, we will utilize Gaussian copulas to characterize the within correla-
tion (Fu and Wang, 2016), which can describe various correlation structures, and
construct estimating functions via the GEE method.

6.3.3 Modeling Correlation Matrix via Gaussian Copulas


It is difficult to specify and model the underlying correlation structure of si1 , . . . , sini
for i = 1, . . . , N. We model the correlation via Gaussian copulas. Assume that Fi (·)
100 ROBUST APPROACHES
is the joint cumulative distribution function of i = (i1 , . . . , ini )T with marginal dis-
tributions Fi1 (·), . . . , Fini (·) for i = 1, . . . , N. Therefore, Uk = Fik (ik ) are uniformly
distributed random variables on [0,1]. For given u1 , . . . , uni , we have
Fi (u1 , · · · , uni ) = P(i1 ≤ u1 , · · · , ini ≤ uni )
= P(Fi1 (i1 ) ≤ Fi1 (u1 ), · · · , Fini (ini ) ≤ Fini (uni ))
= P(U1 ≤ Fi1 (u1 ), · · · , Uni ≤ Fini (uni ))
def
= C(Fi1 (u1 ), · · · , Fini (uni )),
where C is a copula, a multivariate distribution on [0, 1]ni with standard uniform
marginal functions. According to Sklar’s Theorem (Sklar, 1959), if Fi1 (·), · · · , Fini (·)
are absolutely continuous, C is uniquely determined on [0, 1]ni by Fi (·). Furthermore,
−1 (·) be the inverse function of F (·), and then the copula function can be rewrit-
let Fik ik
ten as C(u1 , · · · , uni ) = Fi (Fi1−1 (u ), . . . , F −1 (u )).
1 ini ni
Let ρkl be the correlation coefficient of sik and sil , and C be a Gaussian copula.
Then we have
1 h i
ρkl = P( ik ≤ 0,  il ≤ 0) − τ 2
(τ − τ2 )
1 h i
= P (Fik (ik ) ≤ Fik (0), Fil (il ) ≤ Fil (0)) − τ2
(τ − τ )
2

1 h i
= C(F ik (0), F il (0)) − τ2
(τ − τ2 )
Ckl (τ, τ) − τ2 Φ2 (Φ−1 (τ), Φ−1 (τ); ρkl ) − τ2
= = .
(τ − τ2 ) (τ − τ2 )
where Φ2 (·, ·; γkl ) denotes the standardized bivariate normal distribution with corre-
lation coefficient γkl = corr(ik , il ). Specifically, when γkl = 0, then ik and il are
independent, and hence we have Ckl (τ, τ) = τ2 and ρkl = 0, which indicates that sik
and sil are uncorrelated. When τ = 0.5, then Ckl (τ, τ) = 1/4 + (2π)−1 arcsin(γkl ), and
hence ρkl = 2/π arcsin(γkl ).
To specify the correlation matrix Ri of (si1 , . . . , sini ) for i = 1, . . . , N, we need to
specify the correlation structure of i = (i1 , . . . , ini )T . When the correlation structure
of i is exchangeable, that is p(ik ≤ 0, il ≤ 0) is a constant δ, for any k , l, the corre-
lation coefficient of sik and sil equals ρ = (δ − τ2 )/(τ − τ2 ), and hence the correlation
matrix of S τi = (si1 , . . . , sini )T is Rτi = (1 − ρ)Ini + ρJni , where Ini is the ni × ni identity
matrix, and Jni is an ni × ni matrix of 1s. Therefore, the correlation structure of S τi is
exchangeable. Similarly, when the correlation structure of i is MA(1), Ri is MA(1).
When the correlation structure of i is AR(1) or toeplitz matrix, then Ri is a toeplitz
matrix. Therefore, we can construct various correlation structures for Ri via Gaussian
copulas.

6.3.3.1 Constructing Estimating Functions


Let Xi = (Xi1 , . . . , Xini )T . To incorporate the within correlation and variation of
the number of repeated measurements for each subject, Fu and Wang (2012) and
QUANTILE REGRESSION 101
Fu et al. (2015) considered the following weighted estimating functions via the GEE
method:
N
X N
X
UGτ (βτ ) = XiT Vτi−1 S τi = XiT A−1/2
i R−1 −1/2
i Ai S τi ,
i=1 i=1

where Ai = diag{τ−τ2 , · · · , τ−τ2 } is a diagonal matrix, and Ri is a working correlation


matrix. When Ri is an exchangeable matrix, the inverse matrix can be written as

ρ1ni 1Tni 


 
1 
R−1 = nI −
(1 − ρ)  i 1 + (ni − 1)ρ 
i


If there is no correlation, ρ = 0 and R−1i = Ini . In this case, UGτ (βτ ) is equivalent to
Uτ (β).
Suppose that β̂Gτ is the resulting estimator from UGτ (βτ ). Under some regularity
conditions, we can prove that N −1/2 Uτ (βτ ) → N(0, VU ), where
N
X
VU = lim N −1 XiT Vi−1 Cov(S i )Vi−1 Xi .
N→∞
i=1

Furthermore, β̂Gτ is a consistent estimator of βτ for a given τ, and N(β̂Gτ − βτ ) →
N(0, ΣGτ ), where
 N 
X 
ΣGτ = lim NDτ (β)  Xi Vτi Cov(S τi )Vτi Xi  {D−1
−1  T −1 −1
τ (β)} ,
T
N→∞
i=1

where Dτ (β) = i=1 XiT Vτi−1 Λi Xi , and Λi is an ni × ni diagonal matrix with the
PN
k-th diagonal element fik (0). The covariance matrix Cov(S τi ) is unknown and can
be estimated empirically. These asymptotic properties can be derived according to
Jung (1996) and Yin and Cai (2005).

6.3.3.2 Parameter and Covariance Matrix Estimation


It is difficult to estimate the covariance matrix of parameter estimators in a quantile
regression because it involves the unknown error density functions. The resampling
method can be used to estimate the covariance matrix which has been described in
subsection 6.3.1. However, this method often adds computation burdens.
The induced smoothing method proposed by Brown and Wang (2005) can be
extended to the quantile regression (Wang et al., 2009, Fu and Wang, 2012), which
can estimate parameters and their covariance simultaneously and bypass estimating
the distribution function. Assume Z ∼ N(0, I p ), and approximate β̂Gτ by βτ + Γ1/2 Z,
where Γ = O(1/N) is a positive definite matrix. The smoothed estimating functions of
UGτ (βτ ) can be given as ŨGτ (βτ ) = EZ [UGτ (βτ + Γ1/2 Z)], where expectation is over
102 ROBUST APPROACHES
Z. The smoothed estimating function is:
   
 yi1 −X T βτ 
 τ − 1 + Φ  q T i1
  

 Xi1 ΓXi1 
N
..
X  
ŨGτ (βτ ) = EZ [UGτ (βτ + Γ1/2 Z)] = XiT Vi−1  .
 ,

  
i=1   yini −XinT βτ  
 τ − 1 + Φ  q i  
Xin ΓXini
 T 
i

where Φ(·) is the standard normal cumulative distributed function. Because ŨGτ (βτ )
are smoothing functions of βτ , we can calculate ∂ŨGτ (βτ )/∂βτ that can be used as an
approximation of Dτ . Let
N
∂ŨGτ (βτ ) X
D̃τ (βτ ) = =− XiT Vi−1 Λ̃i Xi ,
∂βτ i=1

where Λ̃i is an ni × ni diagonal matrix with the kth diagonal element σ−1
ik φ((yini −
q
T β )/σ ), where φ(·) is the standard normal density function, and σ =
Xin XikT ΓXik .
i
τ ik ik
In general, the resulting estimator β̃Gτ from Ũτ (βτ ) and its covariance matrix can
be obtained by iteration. Taking the exchangeable correlation working structure as
an example, we give the explicit stepwise procedures for the algorithm:
Step 1: Let the estimator obtained from equation (6.12) be an initial estimate, that is
β̃(0)
τ = β̂τI , and Γ
(0) = I /N.
p
(k−1)
Step 2: Given β̃τ and Γ(k−1) from the k − 1 step, and let ˆil = yil − XilT β̂(k−1)
τ . Obtain
(k−1)
δ̂τ by
PN Pni Pni
i=1 k=1 l,k I(ˆ ik ≤ 0, ˆil ≤ 0)
δ̂ (k−1)
= PN .
i=1 ni (ni − 1)

Step 3: Update β̃(k)


τ and Γ
(k) by

β̃(k) (k−1)
τ = β̃τ + {−D̃τ (δ̂(k−1) , β̃(k−1)
τ , Γ(k−1) )}−1 ŨGτ (δ̂(k−1) , β̃(k−1)
τ , Γ(k−1) ),
(k−1) (k) (k−1)
Γ(k) = D̃−1
τ (δ̂ , β̃τ , Γ )VU (δ̂(k−1) , β̃(k) −1 (k−1) (k) (k−1)
τ ) D̃τ (δ̂ , β̃τ , Γ ).

Step 4: Repeat the above iteration Steps 2-3 until the algorithm converges.
The finial values of β̃τ and Γ can be taken as the smoothed estimators of
β̂Gτ and its covariance matrix, respectively. Under some regularity conditions,
N −1/2 {ŨGτ (βτ ) − UGτ (βτ )} = o p (1), and the smoothing estimator β̃Gτ → β0 in prob-

ability, and N(β̃Gτ − βτ ) converges in distribution to N(0, ΣGτ ). Therefore, the
smoothed and unsmoothed estimating functions are asymptotically equivalent uni-
formly in β, and the limiting distributions of the smoothed estimators coincide with
the unsmoothed estimators. The induced smoothing method can be also used to the
rank-based methods introduced in Section 2. More details can be found in Fu et al.
(2010).
QUANTILE REGRESSION 103
6.3.4 Working Correlation Structure Selection
Suppose that there are J working correlation matrix candidates via Gaussian copulas:
j
Ri , j = 1, . . . , J. Parameter estimates of βτ can be obtained by solving J estimating
equations:
N
j −1
X
XiT A−1/2
i [Ri ] A−1/2
i S i (β) = 0, j = 1, . . . , J. (6.14)
i=1

Appropriate specification of Ri can improve the estimation efficiency of βτ . Wang


and Carey (2011) proposed a Gaussian pseudolikelihood criterion to select a work-
ing covariance model for marginal mean regression models. This criterion can be
extended to select a working correlation structure in marginal quantile regression.
Let dim(θ̂) be the effective dimension of θ but excluding the β̂ components being
zero. Substitute the pseudolikelihood Gaussian likelihood for parametric likelihood
in AIC and BIC and obtain versions of AIC and BIC,

GAIC = −2l(θ̂) + 2dim(θ̂),


GBIC = −2l(θ̂) + log(m)dim(θ̂),

where l(θ) is pseudolikelihood Gaussian likelihood for S i with unknown parameters


θ including βτ and correlation parameters ρ in Ri , and is given as
N
1 Xh i
l(θ) = − log |2πVi | + S iT Vi−1 S i .
2 i=1

For a given set of correlation structures, we obtain estimates of βτ and ρ and then
choose the final correlation structure corresponding to the minimum value of GAIC
or GBIC.
There are J × p equations in (6.14) based on J different working correlation matri-
ces. Therefore, the number of equations is larger than the number of parameters. We
can use the quadratic inference function (QIF) method proposed by Qu et al. (2000)
and the empirical likelihood method to combine these equations. More details can be
found in Leng and Zhang (2014) and Fu and Wang (2016).

6.3.5 Analysis of Dental Data


In this subsection, we will illustrate the method by a set of growth dental data for
11 girls and 16 boys. For each child, the distance (millimeter) from the center of the
pituitary gland to the pteryomaxillary fissure was measured every two years at ages
8, 10, 12, and 14. What of interest is the relation between the recorded distance and
age. Figures 6.4–6.5 indicate that the distance is linearly related to age. Furthermore,
the boys’ distances are larger than girls’ at the same age.
We use quantile regression to model the dental data at τ = 0.25, 0.5, 0.75, and
τ = 0.95,
Qτ (yik ) = β0 + β1 Genderi + β2 Ageik + β3 Genderi ∗ Ageik ,
104 ROBUST APPROACHES

8 9 10 11 12 13 14

Boy Girl

30
Distance (mm)

25

20

8 9 10 11 12 13 14

Age (year)

Figure 6.4 The distance versus the measurement time for boys and girls.

where yik and Ageik are the distance and age for the i-th subject at time k, respec-
tively, and Genderi takes −1 for girls and 1 for boys. Parameters β0 and β1 denote
the intercept and slope of the average growth curve for the entire group. Parameters
β1 and β3 denote the deviations from this average intercept and slope for the group
of girls and the group of boys, respectively. Therefore, the intercept and slope of
the average growth curve for girls are indicated by β0 − β1 and β2 − β3 , respectively.
Equivalently, the intercept and slope of the average growth curve for boys are given
by β0 + β1 and β2 + β3 , respectively.
The parameter estimates and their standard errors are presented in Table 6.2. The
results indicate that β2 is significant at three different quantiles, which indicates that
there is a linear relationship between the distance and age. The parameter β1 is not
significant, which indicates that, on average, neither girls nor boys differ significantly
concerning their initial dental distance. The parameter estimate of β3 is positive and
significant at τ = 0.25 but not significant at τ = 0.5, 0.75, and 0.95, which indicates
that boys with low distances (bottom 25%) increase faster over time than girls at
τ = 0.25, and there is no significant difference between boys’ and girls’ distance
increment at τ = 0.5 and 0.75. The proposed criteria have the lowest values when
correlation structure is exchangeable at τ = 0.25, 0.75, and 0.95. However, the GAIC
and GBIC criteria reach the minimum values when correlation structure is AR(1) at
τ = 0.5. This indicates that kids at top or bottom 25% have more steady correlations
OTHER ROBUST METHODS 105
Boy Girl
32

28
Distance (mm)

24

20

16
8 10 12 14 8 10 12 14
Age (year)

Figure 6.5 Boxplots of distances for girls and boys.

over time (hence exchangeable is appropriate) but correlations at medium decay fast
over the years which makes the AR(1) model appropriate. This plausible explana-
tions need to be tested on a much large dataset.

6.4 Other Robust Methods


The robust methods considered in the first two sections are based on linear models.
Next we will introduce two robust methods for analyzing generalized linear models
in longitudinal data. We consider the marginal model of yik , which satisfies the first
two marginal moment conditions: µi j = g(XiTj β) and σ2i j = φν(µi j ), where g(·) and ν(·)
are specified functions, and φ is a scale parameter.
Fan et al. (2012) and Lv et al. (2015) proposed the following robust generalized
estimating equations:
N
X
UR (β) = DTi Mi−1 hi (µi (β)) = 0, (6.15)
i=1

where Di = ∂µi /∂β is an ni × p matrix, Mi = Ri (α)A1/2 i , in which Ri (α) is a working


correlation matrix and Ai = φdiag(ν(µi1 ), · · · , ν(µini )), and hi (µi (β)) = Wi [ψ(µi (β)) −
Ci (µi (β))] with ψ(µi (β)) = ψ(A−1/2
i (Yi −µi )) and Ci (µi (β)) = E(ψ(µi (β))), which is used
106 ROBUST APPROACHES

Table 6.2 Parameter estimates (Est) and their standard errors (SE) of estimators β̂I , β̂EX ,
β̂AR and β̂ MA obtained from estimating equations with independence, exchangeable, AR(1),
and MA(1) correlation structures, respectively, and frequencies of the correlation structure
identified using GAIC and GBIC criteria for the dental dataset.

τ = 0.25
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 20.690 (0.299) 0.513 (0.299) 0.543 (0.063) 0.175 (0.063) -64.790 -59.606
β̂EX 20.690 (0.299) 0.513 (0.299) 0.543 (0.063) 0.175 (0.063) -69.584 -63.105
β̂AR 20.726 (0.298) 0.542 (0.298) 0.540 (0.063) 0.172 (0.063) -65.797 -59.318
β̂ MA 20.718 (0.299) 0.547 (0.299) 0.541 (0.064) 0.173 (0.064) -65.189 -58.709
τ = 0.50
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 21.877 (0.348) 0.729 (0.348) 0.591 (0.078) 0.095 (0.078) -33.720 -28.536
β̂EX 21.877 (0.348) 0.729 (0.348) 0.591 (0.078) 0.095 (0.078) -62.332 -55.852
β̂AR 21.918 (0.306) 0.795 (0.306) 0.595 (0.081) 0.088 (0.081) -64.971 -58.492
β̂ MA 21.896 (0.307) 0.803 (0.307) 0.599 (0.085) 0.078 (0.085) -56.289 -49.810
τ = 0.75
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 23.306 (0.561) 0.704 (0.561) 0.657 (0.094) 0.133 (0.094) -72.790 -67.606
β̂EX 23.306 (0.561) 0.704 (0.561) 0.657 (0.094) 0.133 (0.094) -107.511 -101.032
β̂AR 23.390 (0.592) 0.802 (0.592) 0.660 (0.093) 0.119 (0.093) -101.206 -94.726
β̂ MA 23.344 (0.593) 0.817 (0.593) 0.660 (0.094) 0.106 (0.094) -88.079 -81.600
τ = 0.95
β0 β1 β2 β3 GAIC GBIC
Est (Sd) Est (Sd) Est (Sd) Est (Sd)
β̂I 25.719 (0.525) 1.335 (0.525) 0.740 (0.065) 0.068 (0.065) -22.167 -16.984
β̂EX 25.719 (0.525) 1.334 (0.525) 0.740 (0.065) 0.068 (0.065) -27.287 -20.808
β̂AR 25.698 (0.520) 1.327 (0.520) 0.749 (0.068) 0.070 (0.068) -20.386 -13.907
β̂ MA 25.719 (0.525) 1.334 (0.525) 0.740 (0.065) 0.068 (0.065) -20.166 -13.687

to ensure the Fisher consistency of the estimate. The matrix Wi = diag(wi1 , · · · , wini )
is a weighted diagonal matrix to downweight the effect of leverage points in the
covariates.

6.4.1 Score Function and Weighted Function


The ψ(·) function is selected to downweight the influence of outliers in the response
variable, and a common selection is robust Huber’s score function ψH (e) = e for |e| ≤ c
and ψH (e) = csign(e) for |e| ≥ c, where c is a tuning constant and is usually selected
between 1 and 2. Assume that the cumulative distribution function of e is Fe (·), then
we can calculate E(ψH (e)) = cE(eI(|e| ≤ c)) + c(1 − F(c) + F(−c)). Therefore, Ci is
exactly equal to zero for symmetric distributions.
OTHER ROBUST METHODS 107
Another selection of the ψ(·) function is the score function derived from the ex-
ponential squared loss function ϕ(e) = 1 − exp{−e2 /γ}, where γ > 0 determines the
degree of robustness of the estimation (Wang et al., 2013). If γ is large, we have
ϕ(e) = 1 − exp{−e2 /γ} ≈ e2 /γ, hence the exponential squared loss is similar to the
least squares loss function for large γ. Let the ψ(·) function be e/γ exp(−e2 /γ), which
is based on the first derivative of ϕ(e). Then the exponential squared loss function is
continuous and differentiable, and hence the calculation based on the exponential
squared loss function may be easier than that based on the Huber’s function.
For the weighted function wi j , the commonly used is Mallows-based weighted
function, that is,
  κ/2 
 d  
wi j = min  ,

  
1,

 (Xi j − µ̂ x )T S −1
x (Xik − µ̂ x )
 
  

where d and κ are tuning constants, and µ̂ x and S x are the robust estimates of the
location and covariance of xik (Rousseeuw and van Zomeren, 1990). One can use
high breakdown point location and scatter estimators, such as Minimum Covariance
Determinant and Minimum Volume Ellipsoid. For the tuning parameters κ and d, we
can use κ = 2 and c = χ20.95 (p), which is the 0.95 quantile of a χ2 (p) distribution.
Note that if ψ(e) = e and wi j = 1 for j = 1, . . . , ni , i = 1, . . . , N, UR (β) is the gen-
eralized estimating equations introduced in Chapter 4, which can be used for data
without outliers.

6.4.2 Main Algorithm


The equation (6.15) must be solved by iterative methods. Hence we need to estimate
the scale parameter φ and the correlation parameter α and then specify the initial
estimate of β. If we select ψ(e) = e, wi j = 1, and Ri = I, then (6.15) no longer depends
on the scale parameter and the correlation parameter. Therefore, the solution β̂(0) of
(6.15) in this special case can take as an initial estimate of β. The value of β̂(0) can be
obtained using the functions geese or gee in the packages geese and geepack in the
statistical software R.
Liang and Zeger (1986) proposed using the moment method to estimate φ and α
in the GEE, but the moment estimates are sensitive to the outliers. We can use the
following robust estimates of φ and α in (6.15). An robust estimate for φ is obtained
via the median absolute
q deviation φ̂ = [1.483 × median|êik − median{êik }|]2 , where
êik = (yik − µik (β̂))/ ν(µik (β̂)) are the Pearson residuals and β̂ is the current estimate
of β. For binary responses, φ = 1. An robust estimate of α can be obtained by the
moment method (Wang et al. 2005),
N
1 X 1 X
α̂ = ψ(eik )ψ(eik+1 )
NH i=1 ni − 1 k≤n −1
i
108 ROBUST APPROACHES
for the AR(1) working correlation matrix and
N
1 X 1 X
α̂ = ψ(eik )ψ(eil )
NH i=1 ni (ni − 1) k,l

for the exchangeable working correlation matrix, where H is the median of


{ψ(eik ), k = 1, . . . , ni ; i = 1, . . . , N}.
With an initial estimate of β, φ, and α, we solve (6.15) to find the estimate of β
using the Fisher’s scoring iterative procedure:
 N −1 N
X  X
β̂(k+1) = β̂(k) −  DTi Σi Di  DTi Mi−1 hi (µi (β)) (6.16)

i=1 i=1 β=β̂(k)

where Σi = Mi−1 Γi and Γi = E(∂hi (µi (β))/∂µi ) for i = 1, . . . , N.



Because the initial estimate is N consistent under the marginal model, the one-
step or the final estimator from (6.16) has the same asymptotic distribution. In prac-
tice, we may simply choose to use an asymptotically equivalent estimator by stopping
the iteration in a small number of steps without worrying about the convergence of
(6.16). The tuning parameters c in Huber’s function and γ in the exponential square
loss function can be given and then fixed in the iterative procedure. An alternative
way is using the data driven method to specify them, such as minimize the trace or
determination of the covariance matrix of the estimators. More details can be see
Wang et al. (2007) and Wang et al. (2013).

6.4.3 Choice of Tuning Parameters


The robust approach requires the specification of a loss function such as the Huber
and other loss functions, and more importantly, the associated tuning parameter that
determines the extent of robustness is needed. High robustness is often at the cost
of efficiency loss in parameter estimation. Therefore, it makes more sense to choose
a regularization (tuning) parameter according to the proportion of “outliers” in the
data. Wang et al. (2007) first proposed to choose the tuning parameter in the Huber
function by maximizing the asymptotical efficiency of the parameter estimation.
The data driven (or data dependent) approach leads to more efficient parameter
estimation because the regularization (tuning) parameter is so chosen to reflect the
proportion of outliers in the data.
More recently, this was extended to the exponential loss function and the gener-
alized liner models for time series with heterogeneity while accounting for possible
temporal correlations and irregular observations (Callens et al., 2020).
Another recent approach to determining the tuning parameters is the so called
working likelihood. This term was first proposed by (Wang and Zhao, 2007) for es-
timating the variance parameters while the correlation structure is misspecified. The
working likelihood function is deemed as a vehicle for efficient regression parameter
estimation while we recognize that the data may not be generated from this likelihood
function.
OTHER ROBUST METHODS 109
Treating the robust loss function as the negative log-likelihood, we can effectively
obtain an appropriate regularization parameter depending on the extend of contam-
ination in the data. Such constructed likelihood from the loss function of interest is
referred to as a working likelihood. (Fu et al., 2020) derived the score functions for
estimating the tuning parameters in the Huber function and bisquare functions.
Chapter 7

Clustered Data Analysis

7.1 Introduction
Suppose a population is divided into a number of groups. If any two subjects selected
at random from the same group are positively correlated then each group of subjects
form a cluster. Clustered data arise in varieties of practical data analytic situations,
such as, epidemiology, biostatistics and medical studies. A cluster may be a village in
an agricultural study, a hospital, a doctor’s practice, an animal giving birth to several
fetuses. For example, in a study concerned with an educational intervention program
on behavior change, the data are grouped into small classes (Lui et al., 2000).
In practice-based research, multiple patients are collected per clinician or per
practice. Each class or practice is a cluster. A group of genetically related members
from a familial pedigree is a cluster.

7.1.1 Clustered Data


Clustered data refers to a set of measurements collected from subjects that are struc-
tured in clusters. Responses from a group of genetically related members from a
familial pedigree constitute clustered data in which the responses are correlated. For
example, birth weight measurements of all fetuses born to all animals in a toxico-
logical study form a set of clustered data in which all measurements from the same
animal are correlated, but measurements from two different animals are independent.
Another well-known example of clustered data is small area estimation (e.g., Ran &
Molina, 2015), in which each small area is considered a cluster. Clustered data is
different from longitudinal data in which measurements are collected from each sub-
ject over time. Sometimes longitudinal data may be thought of as a special type of
clustered data by treating a subject as a cluster in which each subject’s responses over
time are correlated. In longitudinal data analysis, serial correlation within a subject’s
responses is commonly used, while equal pairwise within cluster correlation is used
in clustered data analysis. This pairwise within cluster correlation is referred to as
intracluster correlation.

DOI: 10.1201/9781315153636-7 111


112 CLUSTERED DATA ANALYSIS
7.1.2 Intracluster Correlation
A major issue in the analysis of clustered data is that observations within a cluster
are not independent, and the degree of similarity is typically measured by the intr-
acluster correlation coefficient ρ or ICC. Ignoring the intraclass correlation in the
analysis could lead to incorrect inference procedures, such as incorrect p-values in
hypothesis testing, confidence intervals that are too small, biased estimates and effect
sizes, all of which can lead to incorrect interpretation of results related intraclass cor-
relation coefficients or effects in designed experiments. For recent reviews of intra-
class correlation see Zyzanski et al. (2004) and Galbraith et al. (2010). The intraclass
correlation coefficient has a long history of application in several different fields of
research. In research in epidemiology, it is commonly used to measure the degree
of familial resemblance with respect to biological or environmental characteristics,
and in genetics it plays a central role in estimating the heritability of selected traits
in animal and plant populations. In family studies it is frequently used to measure
the degree of intra-family resemblance with respect to characteristics such as blood
pressure, weight and height. In psychology, it plays a fundamental role in reliability
theory, where observations may be collected on one or more sets of judges or asses-
sors. Another area of application is in sensitivity analysis, where it may be used to
measure the effectiveness of an experimental treatment (see Schumann and Bradley,
1957 and Donner, 1986).
Note, this is not ordinary correlation coefficient between two random variables.
It is the correlation between pairs of values within the same family. How do we
calculate the intraclass correlation coefficient? We illustrate in the context of familial
correlation in what follows.
Suppose we want to find correlation between heights of brothers of N families in
a village in which each family has ni brothers, i = 1, . . . , N. Let the heights of brothers
in the N families be x11 , . . . , x1n1 ; . . . ; xN1 , . . . , xNnN . The interclass correlation coeffi-
cient ρ is defined as the ordinary correlation coefficient between any two observations
xi j and xil in the same group. As there are M = i=1
PN
ni (ni − 1) pairs of values in the
population, the population mean µ and variance σ2 are defined as
N i n
1 X X
µ= (ni − 1) xi j
M i=1 j=1

and
N i n
1 X X
σ2 = (ni − 1) (xi j − µ)2 .
M i=1 j=1

The covariance between pairs of values is defined as


N ni
1 XX
ν11 = E(Xi j − 1)(Xil − 1) = (xi j − µ)(xil − µ).
M i=1 j,l=1
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 113
This, after simplification, can be written as
 
N ni
N X
1 X 2 X 
υ11 =  ni (µi − µ)2 − (xi j − µ)2  ,
M i=1 i=1 j=1

Pni
where µi = 1
ni j=1 xi j . The intraclass correlation coefficient is defined as

i=1 ni (µi − µ) − i j (xi j − µ)


PN 2 2 P P 2
E(Xi j − 1)(Xil − 1) υ11
ρ= p = = .
Var(Xi j )Var(Xil ) σ2
Pni
i=1 (ni − 1) j=1 (xi j − µ)
PN 2

Note, υ11 is the covariance between any two observations within the same cluster for
all clusters, Var(Xi j ) = Var(Xil ) = σ2 .
Example 1. As an example, consider the data in Kendall and Stuart (1979, p322).
Data on heights in inches of three brothers in five families are 74, 70, 72; 70, 71, 72;
71, 72, 72; 68, 70, 70; 71, 72, 70.
For these data M = 30, n1 = n2 = n3 = n4 = n5 = 3, 3j=1 x1 j = 211, 3j=1 x2 j = 213,
P P

j=1 x3 j = 215, j=1 x34 j = 208, j=1 x5 j = 216, µ1 = 70.33, µ2 = 71, µ3 = 71.66,
P3 P3 P3
µ4 = 69.33, and µ5 = 72. Using these we obtain
N ni 3
1 X X X
µ= (ni − 1) xi j = 70.86, (x1 j − µ)2 = 5.4988,
M i=1 i=1 j=1

3
X 3
X
(x2 j − µ)2 = 2.0588, (x3 j − µ)2 = 2.6188,
j=1 j=1

3
X 3
X
(x4 j − µ)2 = 9.6588, (x5 j − µ)2 = 5.8988.
j=1 j=1

Then,
N i n
1 X X 2 ∗ 25.734
σ2 = (ni − 1) (xi j − µ)2 = = 1.7156
M i=1 j=1
30

and
2 PN Pni
i=1 ni (µi − µ) − i=1 j=1 (xi j − µ)
PN 2 2
41.229 − 25.734
ρ= Pni = = 0.301.
i=1 (ni − 1) j=1 (xi j − µ)
PN 2 2 ∗ 25.724

7.2 Analysis of Clustered Data: Continuous Responses


7.2.1 Inference for Intraclass Correlation from One-way Analysis of Variance
Inference procedures for ρ, as first noted by Fisher (1925), are closely related to the
more general problem of inference procedures for variance components. Variance
114 CLUSTERED DATA ANALYSIS
components estimation is not the subject in this chapter, so we will briefly mention
what are variance components in the context of estimating the interclass correlation.
For details regarding variance components we refer the reader to Sahai (1979),
Shrout and Fleiss (1979), Khuri and Sahai (1985), Sahai et al. (1985), Griffin and
Gonzalez (1995), and Searle et al. (2009).
Suppose that we have data on L groups or families, the ith family having ni ob-
servations. Let yi j be the observation on the jth member of the ith family. Then, yi j
can be represented by a random effects model

yi j = µ + ai + ei j , (7.1)

where µ is the grand mean of all the observations in the population, {ai } is the random
effect of the ith family, and {ei j } is the error in observing yi j . The random effects {ai }
are identically distributed with mean 0 and variance σ2a , the residual errors ei j are
identically distributed with mean 0 and variance σ2e , and the {ai }, {ei j } are completely
independent. The variance of yi j is then given by

σ2 = σ2a + σ2e ,

and the covariance between yi j and yil (l , j) is

Cov(µ + ai + ei j , µ + ai + eil ) = Var(ai ) = σ2a .

The intraclass correlation ρ is defined as

ρ = σ2a /(σ2a + σ2e ) = σ2a /σ2 .

This implies that Cov(yi j , yil ) = σ2 ρ. The quantities σ2a and σ2e are called the variance
components. The above random effects model can simply be written as

yi = (yi1 , yi2 , . . . , yini )T ∼ N(νi , Σi ), (7.2)

where N(, ) stands for a multivariate normal density, νi = (µ, . . . , µ)T is a vector of
length ni , and
Σi = σ2 {(1 − ρ)Ii + ρJi }
is a ni × ni matrix, Ii denotes a ni × ni identity matrix, Ji a ni × ni matrix containing
only ones.
Let n0 = (N ∗ − i=1 ni /N )/(L − 1), where N ∗ = i=1
PN 2 ∗ PN
ni is the total number of
observations and p is the number of families studied, SSB = i=1
PN
ni (yi j − ȳi. )2 , and
PN Pni
SSW = i=1 j=1 (yi j − ȳi. ) . Then, the following table shows the analysis of variance
2

corresponding to this model (see Donner, 1986)

Source Degrees of freedom Sum of squares Mean square E(MS )


Between families N − 1 SSB MSB n0 σ2a + σ2e
Within families ∗
N −N SSW MSW σ2e
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 115
Now, an unbiased estimate of σ2e is MSW and that of n0 σ2a + σ2e is MSB. Solving
these two equations, an estimate of σ2a is σ̂2a = (MSB − MSW)/n0 and that of ρ is

MSB − MSW
ρ̂ = .
MSB + (n0 − 1)MSW

Example 2: Let n0 = (N ∗ − i=1 ni /N )/(L − 1), where N ∗ = i=1


PN 2 ∗ PN
ni is the total num-
ber of observations and L is the number of families studied, SSB = i=1
PN
ni (ȳi. − ȳ.. )2
PN Pni
following standard notation, and SSW = i=1 j=1 (yi j − ȳi. ) . Then, following table
2

on page 19 of Donner and Koval (1980) we have now, an unbiased estimate of σ2e is
MSW and that of n0 σ2a + σ2e is MSB. Solving these two equations, an estimate of σ2a
is σ̂2a = (MSB − MSW)/n0 and that of ρ is

MSB − MSW
ρ̂ = .
MSB + (n0 − 1)MSW
Now, consider a hypothetical data set of two judges testing 8 different wines assign-
ing scores from 0 to 9, yi j , i = 1, 2, j = 1, 2, ..., 8 as (4,2), (1,3), (3,8), (6, 4), (6,5),
(7,5), (8,7), (9,7). Find the intraclass correlation of the judges assigning the scores.
For these data we obtain MSB = 0.5625, MSW = 5.776, hence
MSB − MSW
ρ̂ = = −0.127
MSB + (n0 − 1)MSW

7.2.2 Inference for Intracluster Correlation from More General Settings


Donner and Koval (1980) deal with estimation of common interclass correlation in
samples of unequal size from a multivariate normal population with no covariances
in the framework of a random effects linear model (7.1).
Donner and Bull (1983) considered maximum likelihood estimation of a com-
mon interclass correlation in two independent samples drawn from multivariate nor-
mal populations. In particular they considered the situation with a different number
of classes per population but an equal number of observations per class within a
population.
Suppose we have mi classes of observations yi jk (k = 1, 2, . . . , ni ; j =
1, 2, . . . , mi ; i = 1, 2, . . . , L) with ni observations per class in the ith population. We
assume that the yi jk are distributed according to a multivariate normal distribution
about the same mean µi and same variance σ2i within the ith population, in such a
way that the observations, yi jk and yi jl , in the same class have a common correlation,
ρ.
Let Ii be a ni × ni identity matrix, Ji be a ni × ni unit matrix, Σi = {(1 − ρ)Ii + ρJi }
be a matrix of dimension ni × ni , and yi j = (yi j1 , yi j2 , . . . , yi jni )T . Then we may write

yi j ∼ N(νi , σ2i Σi ), (7.3)


116 CLUSTERED DATA ANALYSIS
where N(·) denotes the multivariate normal density, νi = (µi , µi , . . . , µi )T is a vector of
length ni . For L = 2 model (7.3) is the model considered by Donner and Bull (1983),
otherwise it is a generalization of their model for L groups or populations.
Rosner (1984) developed methods for performing multiple regression analyses
and multiple logistic regression analyses on ophthalmologic data with normally and
binomially distributed outcome variables, while accounting for the interclass corre-
lation between eyes. Here we deal with normally distributed outcome variables in
which we assume a nested data structure with L primary units of analysis, where
within the ith primary unit of analysis there are ti secondary units of analysis (or
subunits), i = 1, . . . , L. In ophthalmologic data the primary unit of analysis is a person
and the secondary units are the eyes with ti = 2.
Let yi j be the measure on the jth secondary unit of the ith primary unit, j =
1, . . . , ti , i = 1, . . . , L. Then Rosner (1984) considered the multiple regression model
p−1
X
yi j = β0 + βk xi jk + ei j ,
k=1

where variance of ei j is σ2 and covariance between ei j and eil is ρ, i = 1, · · · , L,


j , l = 1, . . . , ti . This model can conveniently be written as

Yi ∼ N(Xi β, σ2 Σi ), (7.4)

where Yi = (yi1 , . . . , yini )T , N follows the multivariate normal distribution, Xi is a ni × p


matrix of covariates, β is a p × 1 vector of regression coefficients and Σi = {(1 − ρ)Ii +
ρJi } is a matrix of dimension ni × ni .
Munoz, Rosner and Carey (1986) considered estimation of common interclass
correlation where the purpose was to study possible common effect of several inde-
pendent variables in a regression context after allowing for heterogeneous interclass
correlations. Their regression model is

yi j = Xi j β + ei j ,

where yi j is an ni j × 1 vector of values of the dependent variable of the individuals


belonging to the jth family of type i, Xi j is a ni j × p matrix of covariances, (p − 1)
is the number of covariates, β is the p × 1 vector of regression coefficients, and the
vectors of residuals ei j are assumed to be independent and to follow a multivariate
normal distribution with mean zero and covariance matrix σ2 Vi j , where Vi j is an
ni j × ni j matrix with 1 on the diagonal and ρi everywhere else, so that the interclass
correlations depend on the type of family. This model can conveniently be written as

yi j ∼ N(Xi j β, σ2 Vi j ), (7.5)

where Yi j = (yi j1 , . . . , yi jni j )T , Xi j is a ni j × p matrix of (p − 1) covariates, β is a p × 1


vector of regression coefficients and Vi j = {(1 − ρi )Ii j + ρi Ji j } is a matrix of dimension
ni j × ni j .
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 117
Models (7.2) to (7.5) are all special cases of the generalized regression model
(Paul, 1990),

Yi j ∼ N(Xi j βi , σ2i Vi j ), (7.6)

where Yi j = (Yi j,1 , . . . , Yi jni j )T , Xi j is a ni j × p matrix of (p − 1) covariates, βi is a p × 1


vector of regression coefficients and Vi j = (1 − ρi )Ii j + ρi Ji j is a matrix of dimension
ni j × ni j .
Now, if we put σ2i = σ2 , βi = β = (β0 , β1 , · · · , β p−1 ) in Model (7.6), we obtain
model (7.5). In addition if we let ρi = ρ, we obtain model (7.4). Further, if we put
p = 1, ni j = ni , mi = 1, ρi = ρ, σ2i = σ2 , and βi = νi = ν in Model (7.6), we obtain
model (7.2). In Model (7.6), if we put p = 1, ρi = ρ, βi = νi , and ni j = ni , we obtain
model (7.3), where νi = (µi , µi , . . . , µi )T .

7.2.3 Maximum Likelihood Estimation of the Parameters


It is well known that maximum likelihood estimates of the parameters are obtained
by taking derivatives of the log likelihood with respect to the parameters of inter-
est, equating the resulting derivatives to zero, and finally solving these estimating
equations.
It can be seen that minus twice the log-likelihood for L independent samples from
Model (7.6) can be conveniently written as
X mi h
L X i
−2l = ni j log σ2i + (ni j − 1) log(1 − ρi ) + log{1 + (ni j − 1)ρi }
i=1 j=1
L Xmi 
 (Yi j − Xi j βi )T Ii j (Yi j − Xi j βi ) ρi (Yi j − Xi j βi )T Ji j (Yi j − Xi j βi ) 

X
+  −  .
i=1 j=1
(1 − ρi )σ2i (1 − ρi ){1 + (ni j − 1)ρi }σ2i

Then, following the outlines above the maximum likelihood estimates (or more ac-
curately, a solution to the ML equations) of βi and σ2i given ρi are
 −1  
mi
X mi
 X 
β̂i |ρi =  XiTj Wi j Xi j   XiTj Wi j Yi j  (7.7)
j=1 j=1

and
X S S i j − ρi ni j SSTi j ti−1
j
σ̂2i |ρi = , (7.8)
j
(1 − ρi )Ni j

respectively, where Ni = j ni j , Wi j is the inverse of Vi j with (ti j − ρi )/{(1 − ρi )ti j } on


P
the diagonal and −ρi /{(1 − ρi )ti j } everywhere else, ti j = 1 + (ni j − 1)ρi ,

SSi j = (Yi j − Xi j βi )T Ii j (Yi j − Xi j βi ),

and
SSTi j = (Yi j − Xi j βi )T Ji j (Yi j − Xi j βi )/ni j .
118 CLUSTERED DATA ANALYSIS
Further equating −2∂l/∂ρi to zero, and after some algebra the estimating equation
for ρi , is
Ni−1 mj=1 [SSi j − ni j SSTi j ti−2j 1 + (ni j − 1)ρi ]
2 mi
P i
X
Pmi − ρi j = 0,
ni j (ni j − 1)ti−1 (7.9)
j=1 (S S i j − ρi ni j SST i j t −1 )
ij j=1

where ρi  ∩ j (−1/(ni j − 1), 1). Note that this equation involves only ρi as unknown,
while βi in SSi j and SSTi j involves data and ρi . Thus, the maximum likelihood es-
timate of ρi is obtained by solving equation (7.9) iteratively. Equation (7.9) is the
optimal estimating equation in the sense of Godambe (1960) and Bhapkar (1972).
That is, as a maximum likelihood equation, the estimating equation for ρi is unbiased
and fully efficient. Denote the estimate of ρi obtained by solving equation (7.9) by
ρ̂i . Then, replacing ρi on the right hand side of equations (7.7) and (7.8) by ρ̂i , the
maximum likelihood estimates of βi and σ2i are obtained.
Maximum likelihood (ML) estimates of the parameters of models (7.2)–(7.5) can
be obtained as special cases of equations (7.7)–(7.9). So the ML estimates of β and
σ2 of model (7.5) are
 −1
X mi
L X mi
L X
 X
β̂ =  
 T
Xi j Wi j Xi j  
 XiTj Wi j Yi j (7.10)
i=1 j=1 i=1 j=1

and
mi SS − ρ n SST t−1
L X
X ij i ij ij ij
σ̂2 = , (7.11)
i=1 j=1
Ñ(1 − ρi )

respectively, where Ñ = i j ni j and the estimating equation for ρi is


PP

Ñ(1 − ρi )−1 j [SSi j − ni j SSTi j ti−2 j {1 + (Ni j − 1)ρi }]


2
P
X
− ρi j = 0,
ni j (ni j − 1)ti−1
ρ −1 ρ −1
P P
i (1 − i ) j (SS ij − n
i ij SST t
ij ij ) j
(7.12)

where ρi ∈ ∩ j (−1/(ni j − 1), 1), SSi j = (Yi j − Xi j βi )T Ii j (Yi j − Xi j βi ), SSTi j = (Yi j −


Xi j βi )T Ji j (Yi j − Xi j βi )/ni j , and ti j = 1 + (ni j − 1)ρi .
Now, let Ωi j be the inverse of Σi j with (ti j − ρ)/{(1 − ρ)ti j } in the diagonal and
−ρ/{(1 − ρ)ti j } everywhere else; ti j = 1 + (ni j − 1)ρ and SSi j and SSTi j are as defined
for equations (7.10) and (7.11) above. Then the maximum likelihood estimates of β
and σ2 of model (7.4) are respectively
 −1  
X X  X X 
β̂ =  XiTj Ωi j Xi j   XiTj Ωi j Yi j  (7.13)
i j i j

and
XX
σ̂2 = (SSi j − ρni j SSTi j ti−1
j )/{N(1 − ρ)}. (7.14)
i j
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 119
The estimating equation for ρ is

j {1 + (Ni j − 1)ρi }]
Ñ i j [SSi j − ni j SSTi j ti−2
PP 2 XX
−ρ j = 0, (7.15)
ni j (ni j − 1)ti−1
ρn −1
PP
i j (SS ij − ij SST t
ij ij ) i j

where ρi ∈ ∩ j (−1/(ni j − 1), 1).


The maximum likelihood estimates of the parameters of Model (7.2) can be ob-
tained from equations (7.13)-(7.15) by taking mi = 1, that is subscript j is not re-
quired, so that ti j = ti = 1 + (ni − 1)ρ, ρi = ρ, Xi = (1, . . . , 1)T is a vector of dimension
ni and β = µ. Thus, for given ρ the maximum likelihood estimates of µ and σ2 are
P −1
i ni ȳi ti
µ̂ = P −1 (7.16)
i ni ti
and
SS − ρ i ni SSTi ti−1
P
σ̂2 = , (7.17)
N(1 − ρ)
where SS = k (yik − µ)
2 and SSTi = ni (ȳi. − µ)2 , and the estimating equation for ρ
PP
i
is
i ni SSTi {1 + (ni − 1)ρ }ti ]
Ñ[SS −
P 2 −2 X ni (ni − 1)
−ρ = 0, (7.18)
SS − ρ i ni SSTi ti−1 1 + (ni − 1)ρ
P
i
where ρ ∈ ∩i (−1/(ni − 1), 1).
Similarly, the ML estimates of µi and σ2i of model (7.3) are respectively
XX
µ̂i = yi jk /ni mi = ȳi..
j k

and
(SSi − ρni SSTi ti−1 )
σ2i = ,
ni mi (1 − ρ)
where XX
SSi = (yi jk − ȳi.. )2 ,
j k
and X
SSTi = ni (ȳi j. − ȳi.. )2 , ti = 1 + (ni − 1)ρ.
j
Further, after simplification, the estimating equation for ρ can be expressed as
X mi ni (ni − 1)(ρ − ri )
= 0, (7.19)
i
(1 − ρRi ){1 + (ni − 1)ρ}
where
ni SSTi − SSi
ri = (7.20)
(ni − 1)SSi
is the sample intraclass correlation from the ith population,
Ri = 1 − (ni − 1)(1 − ri ), and ρ ∈ ∩i (−1/(ni − 1), 1).
120 CLUSTERED DATA ANALYSIS
7.2.4 Asymptotic Variance
It can be easily found that for all the models the β parameters and the (σ2 , ρ) param-
eters are orthogonal, i.e.

∂2 l ∂2 l
! !
−E = −E = 0.
∂β∂σ2 ∂β∂ρ

The consequence of such orthogonality is that the estimates β̂ and (σ̂2 , ρ̂) are
asymptotically independent (Cox and Reid, 1987). The asymptotic variance of ρ̂ is
thus obtained from the inverse of the Fisher information matrix of (σ2 , ρ). For Model
(7.6), it can be shown that
     
 ∂2 l  Ni  ∂2 l  −Ai  ∂2 l  Bi
−E  4  =
  , − E  2  =
  , − E  2  =
 ,
∂σi 2σi4 ∂σi ∂ρi 2σi (1 − ρi )
2 ∂ρi 2(1 − ρi )2

where N j =
P
j ni j ,
X ni j (ni j − 1)ρi
Ai =
j
1 + (ni j − 1)ρi

and
X ni j (ni j − 1){1 + (ni j − 1)ρ2 }
i
Bi = .
j
{1 + (ni j − 1)ρi }2

From the inverse of the Fisher information matrix of (σ2i , ρ) it can be seen that

2Ni (1 − ρi )2
var(ρ̂i ) = .
Ni Bi − A2i

Similarly, it can be shown that for Model (7.5)

2Ñ(1 − ρi )2
var(ρ̂i ) = ,
ÑBi − A2i

where Ñ =
P
i Ni ; for Model (7.4),

2Ñ(1 − ρ)2
var(ρ̂) = ,
ÑB − A2
where
X X ni j (ni j − 1)ρ
A=
i j
1 + (ni j − 1)ρi

and
X X ni j (ni j − 1){1 + (ni j − 1)ρ2 }
B= ;
i j
{1 + (ni j − 1)ρ}2
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 121
for Model (7.2),
2N(1 − ρ)2
var(ρ̂) = ,
ND − C 2
where n =
P
i ni ,

X ni (ni − 1)ρ X ni (ni − 1){1 + (ni − 1)ρ2 }


C= and D = ,
i
1 + (ni − 1)ρ i
{1 + (ni − 1)ρ}2

and finally for Model (7.3),

2(1 − ρ)2
var(ρ̂) = P {1 + (ni − 1)ρ}−2 .
i mi ni (ni − 1)

7.2.5 Inference for Intracluster Correlation Coefficient


Inference for the intraclass correlation can be done by constructing confidence inter-
vals or conducting hypothesis tests for ρ or ρi using the asymptotic standard errors.
The estimating equations for ρ can be solved by a subroutine in R.
Example 3

Table 7.1 Artificial data of Systolic blood pressure of children of 10 families


Families
1 2 3 4 5 6 7 8 9 10
Systolic 118 111 107 115 127 126 90 140 110 109
Blood 130 140 110 115 122 123 99 120 105
Pressure 125 125 140 109 138
100 134 103
88

The purpose of the example here is to illustrate that the estimating equation for
ρ produces unique solution within the permitted range. The estimating equation was
solved by using R and solution was obtained within 9 iterations.
We consider Model IV, page 550 of Paul, 1990, where Yi = (Yi1 , . . . , Yini )0 , ν =
(µ, . . . , µ) and Σi = (1 − ρ)Ii + ρJi , where Ii is a ni × ni identity matrix and Ji is a ni × ni
matrix of only ones.
Now, let ti = 1 + (ni − 1)ρ,
ni ȳi /ti
P
µ = Pi
i ni /ti

S S − ρ i ni S S T i /ti
P X
σ2 = , N= ni
N(1 − ρ) i
where, XX
SS = (yik − µ)2 S S T i = ni (ȳi − µ)2
i k
122 CLUSTERED DATA ANALYSIS
and the estimating equation for ρ is,

h i
N S S − i ni S S T i {1 + (ni − 1)ρ2 }/ti 2
P
X ni (ni − 1)
−ρ =0
S S − ρ i ni S S T i /ti 1 + (ni − 1)ρ
P
i
where,
 1 
ρ ∈ ∩i − , 1 and
ni − 1
2(1 − ρ̂)2
Var(ρ̂) = P {1 + (ni − 1)ρ}−2
i n i (ni − 1)
The final solution is ρ̂ = 0.0266 with standard error=0.1596. The solution took 6
iterations.

7.2.6 Analysis of Clustered or Intralitter Data: Discrete Responses


Discrete data in the form of proportions arise in toxicological experiments from lit-
termates. A litter is generally a pregnant animal and the littermates are usually the fe-
tuses within an animal. Investigators studying the teratogenic, mutagenic or carcino-
genic effects of chemical agents in laboratory animals frequently obtain data from lit-
termates. As Paul (1982) described: Data arise in teratological experiments in which
a number of pregnant female animals are assigned to different groups. These gener-
ally consist of a control group and a number of groups treated with varying doses of
a compound. Each fertilized egg results in either a resorption, an early foetal death, a
late foetal death or a live fetus. Treatments may affect the incidence of each of these
characteristics, but the principal aim of the experiment is to determine if treatments
affect the incidences of abnormalities in live fetuses as the littermates may receive
an indirect exposure by the treatment of one parent with the compound under study.
More details are given in a review paper by Haseman and Kupper (1979).

7.2.7 The Models


The simplest model to describe toxicological data in the form of proportions is the
binomial model. However, it happens quite often that these types of data show greater
variability than predicted by the simple binomial model and the reason for this vari-
ability depends on the form of study. Weil (1970) observed that if the experimental
units of the data are litters of animals then “litter effect”, that is, the tendency of
animals in the same litter to respond more similarly than animals from different lit-
ters contribute to greater variability than predicted by the simple model. This litter
effect is known as intra-litter correlation coefficient. Several models, such as the beta-
binomial model (BB) (Williams, 1975; Crowder, 1978), the additive binomial and
the multiplicative binomial (MB) models (Altham, 1978) and the correlated bino-
mial model (CB) (Kupper and Haseman, 1978) have been proposed in the literature
to account for this intra-litter correlation coefficient. The additive binomial and the
ANALYSIS OF CLUSTERED DATA: CONTINUOUS RESPONSES 123
correlated binomial models are identical (Paul, 1982). However, the beta-binomial
model is the most popular and widely used as it often fits the toxicological propor-
tion data much better than the other models (Paul, 1982).
Consider a toxicological experiment consisting of m animals or litters, treated by
a compound (treatment), the ith litter giving birth to ni fetuses of which yi fetuses are
affected (for example, malformed). Thus, the observed proportion of fetuses affected
by the treatment for the ith litter is yi /ni , i = 1, . . . , m. Let p be the overall proportion
of fetuses affected by the treatment. We assume that yi |p ∼ binomial(ni , p) and p is a
random variable having a beta distribution with probability density function
1
f (p) = pα−1 (1 − p)β−1 .
B(α, β)
Then, unconditionally, yi has a beta-binomial distribution with probability mass func-
tion
ni B(yi + α, ni + β − yi )
!
Pr(yi ) = . (7.21)
yi B(α, β)
α ni αβ(α+β+ni )
 
The mean and variance of yi are ni α+β and (α+β) 2 (α+β+1) , respectively. Now, define
α ω
π = α+β , ω = α+β
1
and φ = 1+ω . Then the mean and the variance of yi can be expressed
as ni π and ni π(1−π) 1 + (ni − 1)φ . The parameter φ is the extra-dispersion parameter
 
which is also known as the intra-litter correlation parameter. In practical contexts
the interest is to estimate the parameters π and φ. The beta-binomial distribution
assumes that the intra-litter correlation parameter φ is positive. However, Prentice
(1986) argued that φ can also assume negative values and proposed the extended
beta-binomial distribution where φ can assume positive as well as negative values
with the range given below. Using the above parametrization, further simplification,
φ can assume positive as well as negative values. The probability mass function of
the extended beta-binomial distribution is
i −1
yQ ni −y
Qi −1
! [π(1 − φ) + rφ] [(1 − π)(1 − φ) + rφ]
ni r=0 r=0
Pr(yi |π, φ) = (7.22)
yi i −1
nQ
[(1 − φ) + rφ]
r=0
h    i
with 0 ≤ πi ≤ 1, and φi ≥ max −πi / ni j − 1 , − (1 − πi ) / ni j − 1 . Then φ can take
positive as well as negative values, and can be justified from the fact that fetuses for
the same litter can compete for the same resources within the mother’s womb.

7.2.8 Estimation
A number of methods for the estimation of π and φ are available. Here we give two
of the most popular methods, namely, the method of moments and the maximum
likelihood estimates. For a comprehensive description of the methods available see
Paul and Islam (1998) and Paul (1982).
124 CLUSTERED DATA ANALYSIS
We first give the estimates by the method of moments. This method was first
given by Kleinman (1973), who used weighted average and weighted variance of
sample proportions to equate to their respective expected values to find method of
moments estimates of the parameters π and φ.
Define zi = nyii , wi = π(1−π){1+(n
ni
for i = 1, 2, . . . , m and π̂ = m i=1 wi zi / i=1 wi .
P Pm
i −1)φ}
It can be seen that given φ, E (π̂) = π. Further, define S = m i=1 wi (zi − π) . Then it can
2
P
be shown that
m m
X wi π (1 − π) X wi π (1 − π) (ni − 1) φ
E (S ) = + .
i=1
ni i=1
ni

Then, the method of moments estimates of π and φ are obtained by solving the
estimating equations
m
X
wi (zi − π) = 0,
i=1

and
m m m
X X wi π (1 − π) X wi π (1 − π) (ni − 1) φ
wi (zi − π)2 − − =0
i=1 i=1
ni i=1
ni

simultaneously.
As is well known, the maximum likelihood estimates are obtained by maximizing
the likelihood function with respect to the parameters of interest. Using model (7.22)
the log-likelihood, apart from a constant, can be written as

i −1
m " yX
X −yi −1
ni X
l= log {(1 − φ)π + rφ} + log {(1 − φ)(1 − π) + rφ}
i=1 r=0 r=0
i −1
nX #
− log {(1 − φ) + rφ} , (7.23)
r=0

and the maximum likelihood estimates of the parameters π and φ can be obtained by
solving the estimating equations
 −1
m yX −yi −1
ni X

X  i (1 − φ) (1 − φ) 
 = 0

(1 − φ)π + rφ (1 − φ)(1 − π) + rφ 

i=1 r=0 r=0

and
 −1
m yX −yi −1
ni X i −1
nX

X  i r−π r+π−1 r − 1 
+ −  = 0
(1 − φ)π + rφ (1 − φ)(1 − π) + rφ (1 − φ) + rφ 

i=1 r=0 r=0 r=0

simultaneously.
SOME EXAMPLES 125
7.2.9 Inference
After the parameters are estimated, attention generally shifts to hypothesis testing
and confidence interval construction. Recall that the beta-binomial and the extended
beta-binomial models are extensions of the simpler binomial model. It is then natural
to test whether the binomial model is good enough to model toxicological data in the
form of proportions. To this end we test the null hypothesis H0 : φ = 0 against the
alternative hypothesis H1 : φ , 0.
The pioneering work in this is by Tarone (1979) who developed C(α) (Neyman,
1959) tests for testing the goodness of the binomial model against the BB, CB and
MB models. These tests are closely related to the binomial variance test. Paul (1982)
summarized these results as follows.
Let p̂ = y/n, where y = m y and n = m n and q̂ = 1 − p̂. Further, let
P P
Pm i=1 i 2 Pi=1 i
A = i=1 ni (ni − 1), B = i=1 ni (ni − 1) , R = m i=1 yi (ni − yi ), D = (q̂ − p̂)A, E =
Pm
p̂q̂(B + p̂q̂(2A − 4B)), F = n/( p̂q̂) and s = m ni p̂)2 /( p̂q̂). Then the optimal test
P
i=1 (yi −
statistic for testing the goodness of fit of the binomial distribution
(i) against the BB model is Z = (s − n)/(2A)1/2 , and under the null hypothesis that
the binomial model has a good fit against BB model and under alternative hypothesis
will have an asymptotic standard normal distribution;
(ii) against the CB model is Xc2 = Z 2 , and under the null hypothesis that the binomial
model has a good fit against CB model, and under alternative hypothesis will have
an asymptotic χ2 distribution with one degree of freedom and;
(iii) against the MB model is X M 2 = (R − A p̂q̂)2 (E − D2 /F)−1 , and under the null hy-

pothesis that the binomial model has a good fit against MB model alternative and
under alternative hypothesis will have an asymptotic χ2 distribution with one degree
of freedom.
After an extensive simulation study to compare these three tests, Paul (1982)
found that the BB model is, in general, more sensitive to the departure from the
binomial model, and therefore, is a superior model for the analysis of the data in
Table 7.2 given in Section 7.3. The justification for superiority of one model over the
others is that a model which is more sensitive to the departure from the binomial will
characterize the data more accurately than others which are less sensitive.

7.3 Some Examples


Data given in Paul (1982, Table 7.1 p 362) are obtained from a teratological ex-
periment from Shell Toxicology Laboratory, Sittingbourne Research Centre, Sitting-
bourne, Kent, England. There were four groups consisting of a control (C) group, a
low (L) dose group, a Medium (M) dose group, and a high (H) dose group.
By analyzing these data, Paul (1982) concluded that the beta-binomial model is
the most sensitive to the departure from the binomial model for data sets of the type
met in teratological experiments.
Below we give a subset of the data for interested readers to analyze.
126 CLUSTERED DATA ANALYSIS

Table 7.2 Data from Toxicological experiment (Paul, 1982). (i) Number of live fetuses affected
by treatment. (ii) Total number of live fetuses.
Dose Group
(i) 1 1 4 0 0 0 0 0 1 0 2 0 5 2 1 2 0 0 1
Control (C)
(ii) 12 7 6 6 7 8 10 7 8 6 11 7 8 9 2 7 9 7 11

(i) 0 1 1 0 2 0 1 0 1 0 0 3 0 0 1 5
Low (L)
(ii) 5 11 7 9 12 8 6 7 6 4 6 9 6 7 5 9

(i) 2 3 2 1 2 3 0 4 0 0 4 0 0 6 6 5
Medium (M)
(ii) 4 4 9 8 9 7 8 9 6 4 6 7 3 13 6 8

(i) 1 0 1 0 1 0 1 1 2 0 4 1 1 4 2
High (H)
(ii) 9 10 7 5 4 6 3 8 5 4 4 5 3 8 6

7.4 Regression Models for Multilevel Clustered Data


Cluster-correlated data arise when data are obtained in clusters where data within
clusters are correlated. Also data may be obtained where there is a natural structure
or hierarchy within each cluster. Some examples are:
Example 1: In studies of health care services we may be interested in assessing
quality of care for patients who are nested or grouped within different clinics. These
data are said to be hierarchical data having two-levels: the patients within clinics are
at level 1 and clinics are level 2 units.
Example 2: In family studies children are at level 1, mother-father at level 2 and
families are at level 3.
Example 3: In educational studies children are nested within a classroom, class-
rooms are nested within schools and schools are nested within in a school district.
Data obtained would be three level data.
Often it is of interest to analyze these multilevel data. Fitzmaurice, Laird, and
Ware (2004) provide an excellent overview of multilevel data. Here we discuss mod-
els and inference procedures for two-level data.

7.5 Two-Level Linear Models


The two level linear model is given by

Yi j = Xi j β + Zi j b j + ei j , (7.24)

where Yi j is the response on the ith unit at level 1 within the jth unit (cluster) at
level 2, Xi j is a 1 × p (row) vector of covariates, β is a 1 × p (row) vector of fixed
effects regression parameters, Zi j is a design matrix for the random effects at level
2, b j is the random effect at the jth unit (cluster) at level 2, and ei j is the error. The
random effects, b j , vary across level 2 units but, for a given level 2 unit, are the same
for all level 1 units. For example, Yi j might be the outcome for the ith patient in the
AN EXAMPLE 127
jth clinic, where the clinics are a random sample of clinics from around the area.
The random effects are assumed to be correlated across level 2 units, with mean zero
and covariance cov(b j ) = G. The level 1 random components, ei j , are assumed to
be independent across level 1 units, with mean zero and variance Var(ei j ) = σ2 . In
addition, the ei j ’s are assumed to be independent of the b j ’s, with Cov(ei j , b j ) = 0.
That is, level 1 units are assumed to be conditionally independent given the level 2
random effects (and the covariates).
The regression parameters, β, are the fixed effects and describe the effects of
covariates on the mean response

E(Yi j ) = Xi j β, (7.25)

where the mean response is averaged over both level 1 and level 2 units. The two
level model given by (7.24) also describes the effects of covariates on the conditional
mean response given the random effect b j as

E(Yi j |b j ) = Xi j β + Zi j b j , (7.26)

where the response is averaged over level 1 units only. The maximum likelihood es-
timates of the parameters β, σ2 , and G for longitudinal data when there is no missing
responses are obtained in Chapter 9. Following these results we obtain maximum
likelihood estimates of the parameters β, σ2 and G of model (7.24) as
 N −1 N
X  X
β̂ =  Xi Σi Xi 
 T −1
XiT Σ−1
i yi (7.27)
i=1 i=1

PN T N
ei ei X bT bi
σ̂2 = Pi=1
N
, Ĝ = i
, (7.28)
i=1 ni i=1
N

where ei = yi − Xi β−Zi bi (see Chapter 9). However, suppose the b0j s are assumed to be
independent, var(b j ) = σ22 and there is no covariate at the level 2 units. Since the data
are clustered, observations within the same cluster are correlated, but are independent
across clusters. Then, letting Var(ei j ) = σ21 , the degree to which the observations
within the same cluster are correlated can be measured by the intra-cluster correlation

σ22
ρ= . (7.29)
σ21 + σ22

7.6 An Example: Developmental Toxicity Study of Ethylene Glycol


We refer to the data in Price et al. (1985). Developmental toxicity studies of lab-
oratory animals play a crucial role in the testing and regulation of chemicals and
pharmaceutical compounds. Exposure to developmental toxicants typically causes
a variety of adverse effects, such as fetal malformations and reduced fetal weight
128 CLUSTERED DATA ANALYSIS
at term. In a typical developmental toxicity experiment, laboratory animals are as-
signed to increasing doses of a chemical or a test substance. Price et al. (1985) pro-
vided data from an developmental toxicity study of ethylene glycol (EG). Ethylene
glycol is a high volume industrial chemical used in many applications. It is used as
an antifreeze, as a solvent in the paint and plastic industries, and in the formulation
of various types of inks. In a study of laboratory mice conducted through the Na-
tional Toxicology Program (NTP), EG was administered at doses of 0, 750, 1500, or
3000 mg/kg/day to 94 pregnant mice or dams over the period of major organogen-
esis, beginning just after implantation. See Price et al. (1985) for additional details
concerning the study design. Following sacrifice, fetal weight and evidence of mal-
formations were recorded for each live fetus. Fitzmaurice, Laird, and Ware (2004)
provided an analysis of the data. In the analysis of the data, they focused on the ef-
fects of dose on fetal weight. Summary statistics (ignoring clustering in the data) for
fetal weight for the 94 litters (composed of total 1028 live fetuses) are presented in
Table 17.1, p 451 of Fitzmaurice et al. (2004) . Fetal weight decreases monotonically
with increasing dose, with the average weight ranging from 0.972 (gm) in the con-
trol group to 0.704 (gm) in the group administered the highest dose. The decrease in
fetal
√ weight is not linear in increasing dose, but is approximately linear in increasing
dose.
The data on fetal weight from this experiment are clustered, with observations on
the fetuses (level 1 units) nested within dams/litters (level 2 units). The litter sizes
range from 1 to 16. Let Yi j denote the fetal weight of the ith live fetus from the
jth litter, Fitzmaurice et al. (2004) considered the following model relating the fetal
weight outcome to dose:
Yi j = β1 + β2 d j + b j + ei j , (7.30)
where d j = dose j /750 is the square-root transformed dose administered to the jth
p
dam. The random effect b j is assumed to vary independently across litters, with b j ∼
N(0, σ22 ). The errors, ei j , are assumed to vary independently across fetuses (within
a litter), and ei j ∼ N(0, σ21 ). This model assumes that fetuses within a cluster are
exchangeable and the positive correlation among the fetal weights is accounted for
by their sharing a common random effect, b j . The degree of clustering of the data
can be expressed in terms of the intra-cluster (or intra-litter) correlation

σ22
ρ= . (7.31)
σ21 + σ22

In Table 17.2, Fitzmaurice et al. (2004,0.452) give the results of fitting the model
to the fetal weight data. The REML (restricted maximum likelihood) estimate of
the regression parameter for (transformed) dose indicates that the mean fetal weight
decreases with increasing dose. The estimated decrease in weight, comparing the
highest dose group to the control group, is 0.27 (or 2 × −0.134, with the 95% con-
fidence interval: -0.316 to - 0.220). Note that both model based and empirical (or
sandwich) standard errors are calculated and they were very similar, suggesting that
the simple random effect structure for the clustering of fetal weights is adequate. The
TWO-LEVEL GENERALIZED LINEAR MODEL 129
estimate of the intra-cluster correlation, ρ̂ = 0.57, indicates that there are moderate
litter effects.
Fitzmaurice et al. (2004) provided further analysis of the data to assess the ad-
equacy of the linear dose-response trend. They considered a model that included
a quadratic effect of (transformed) dose. Both Wald and likelihood ratio tests of the
quadratic effect of dose indicated that the linear trend is adequate for these data (Wald
W 2 = 1.38, with 1 df, p-value > 0.20; likelihood ratio G2 = 1.37, with 1 df, p-value
> 0.20).

7.7 Two-Level Generalized Linear Model


As for the two level linear model, Fitzmaurice et al. (2004) provided an excellent
review of two level generalized linear model.
Let Yi j be the response on the ith level 1 unit in the jth level 2 cluster. Let Xi j be a
1 × p (row) vector of covariances. The response can be continuous, binary, or a count.
Three-part specification for the two level generalized linear models by McCullagh
and Nelder (1989) is given as follows.
(1) We assume that the conditional distribution of each Yi j , given a vector of
random effects b j (and the covariances), belongs to the exponential family of distri-
butions and that Var(Yi j |b j ) = v{E(Yi j |b j )}φ, where v(·) is a known variance function
and is a function of the conditional mean, E(Yi j |b j ); and φ is a scale or dispersion
parameter. In addition, given the random effects b j , it is assumed that the Yi j are
independent of one another.
(2) The conditional mean of Yi j is assumed to depend upon the fixed and random
effects via the following linear predictor,

ηi j = Xi j β + Zi j b j , (7.32)

with
g{E(Yi j |b j )} = ηi j = Xi j β + Zi j b j , (7.33)
for some known link function, g(·).
(3) Finally, the random effects are assumed to have some probability distribution.
In principle, any multivariate distribution can be assumed for the b j ; in practice,
for computational convenience, the random effects are usually assumed to have a
multivariate normal distribution, with zero mean and covariance matrix, G. It is also
common, for convenience, to assume that the b j are independent, Var(b j ) = σ2j and
Cov(b j , bk ) = 0 for j , k.
These three components completely specify a broad class of two level generalized
linear models for different types of responses.
The two-level generalized linear model is generally referred to as generalized
linear mixed models (GLMMs, Jiang, 2007). Next, to clarify the main ideas, we
consider two examples of two level generalized linear models.
130 CLUSTERED DATA ANALYSIS
Example 1: Two-level Generalized Linear Model for counts
Consider a study comparing cross-national rates of skin cancer and the factors
(e.g., climate, economic and social factors, regional differences in diagnostic proce-
dures) that influence variability in the rates of disease. Suppose we have counts of the
number of cases of skin cancer in a set of well-defined regions, indexed by i, within
counties, indexed by j. Let Yi j be a count of the number of individuals who develop
skin cancer within the ith region of the jth county during a given period of time (say,
5 years). The resulting counts have a two level structure with regional units at the
lower level (level 1 units). Usually, the analysis of count data requires knowledge of
the population at risk. That is, the rate at which the disease occurs is of more direct
interest than the corresponding count.
Counts are often modeled as Poisson random variables using a log link function.
This motivates the following illustration of a two level generalized linear model for
Yi j given by the three part specification:
(1) Conditional on a vector of random effects b j , the Yi j are assumed to be inde-
pendent observations from a Poisson distribution, with Var(Yi j |b j ) = E(Yi j |b j ), (i.e.,
φ = 1).
(2) The conditional mean of Yi j depends upon fixed and random effects via the
following log link function,

log{E(Yi j |b j )} = log(T i j ) + Xi j β + Zi j b j , (7.34)

where T i j is the population at risk in the ith region of the jth county and the log(T i j )
is an offset.
(3) The random effects are assumed to have a multivariate normal distribution,
with zero mean and covariance matrix G.
This is an example of a two-level log linear model that assumes a linear relation-
ship between the log rate of disease occurrence and the covariances.
Example 2: Two-level Generalized Linear Model for Binary Responses
Consider a study of men with newly diagnosed prostate cancer. The study is de-
signed to evaluate the factors that determine physician recommendations for surgery
(radical prostatectomy) versus radiation therapy. In particular, it is of interest to de-
termine the relative importance of patient factors (e.g., patients age, level of prostate
specific antigen) and physician factors (e.g., speciality training, years of experience)
on physician recommendations for treatment. Many patients in the study seek the rec-
ommendation of the same physician. As a result, patients (level 1 units) are nested
within the physicians (level 2 units). For each patient, we have a binary outcome
denoting the physicians recommendation (surgery versus radiation therapy).
Let Yi j be the binary response, taking values 0 and 1 (e.g., denoting surgery or
radiation therapy) for the ith patient of the jth physician. An illustrative example of
a two level logistic model for Yi j is given by the following three part specification:
(1) Conditional on a single random effects b j , the Yi j are independent and have a
Bernoulli distribution, with Var(Yi j |b j ) = E(Yi j |b j ){1 − E(Yi j |b j )}, (i.e., φ = 1).
(2) The conditional mean of Yi j depends upon fixed and random effects via the
following linear predictor:
ηi j = Xi j β + b j , (7.35)
RANK REGRESSION 131
and
Pr(Yi j = 1|b j )
( )
log = ηi j = Xi j β + b j . (7.36)
Pr(Yi j = 0|b j )
That is, the conditional mean of Yi j is related to the linear predictor by a logit link
function.
(3) The single random effect b j is assumed to have a univariate normal distribu-
tion, with zero mean and variance σ21 .
In this example, the model is a simple two-level logistic regression model with
randomly varying intercepts.

7.8 Rank Regression


Finally, we consider rank regression for the linear model as in §6.2, Yi = β0 1+ Xi β+i ,
As in Jung and Ying (2003), we assume ik are continuous random variables with the
fundamental assumption that the median of any pairwise difference ik −  jl is 0.
Define the residual vector ei (β̂) = Yi − Xi β̂ for any vector β̂ of dimension p. The
components of ei (β̂) are eik (β̂), 1 ≤ k ≤ ni .
To bypass modeling the correlations, Jung and Ying (2003) suggest to ignore
the possible within-subject correlations and apply the usual rank estimation for β,
i.e., minimizing the following loss function as given by (6.3), which leads to the
following quasi score functions for β,
N X
X N X nk
ni X
U JY (β) = M −2 (xik − x jl )sgn(eik − e jl ),
i=1 j=1 k=1 l=1

> is the k-th row of the design matrix X . This function is equivalent to that
where xik i
given by (6.3).
One feature in cluster data analysis is that the time order of the observations is
often not important or not recorded, which is often the case in developmental studies.
This often implies the exchangeable correlation may be appropriate. This also moti-
vates us to avoid modeling the correlation structure by subsampling one observation
from each cluster and then apply the classical rank method to the resulting N inde-
pendent observations, (yi ), 1 ≤ i ≤ N, say. If the corresponding residuals are (e(i) ), the
dispersion function is
XN X N
M −2 |(e(i) ) − (e( j) )|.
i=1 j,i

To make use of all the observations, we would repeat this resampling process
many times. Conditional on sampling one observation from each cluster, the proba-
bility of yi j being sampled is 1/ni . Therefore, the limiting dispersion function is
N X
X X nk
ni X
LDS (β) = M −2 n−1 −1
i nj |eik − e jl |.
i=1 j,i k=1 l=1
132 CLUSTERED DATA ANALYSIS
This within-cluster resampling was first considered by Hoffman et al. (2001) for
eliminating the bias when the cluster sizes are informative. This was further inves-
tigated by Williamson et al. (2003) in the context of the GEE setup. In the special
case of comparing two groups, the LDS becomes the testing statistic (S ) proposed by
Datta and Satten (2005).
This motivates us to consider the following weighted loss function for estimation
of β,
XN XX nk
ni X
LW (β) = M −2 wi w j |eik (β) − e jl (β)|, (7.37)
i=1 j,i k=1 l=1

where wi is a pre-chosen nonnegative and bounded sequence to take account of the


different cluster sizes. The corresponding quasi-score functions for β are
N X
X X nk
ni X
UW (β) = M −2 wi w j (xik − x jl )sgn(eik − e jl ).
i=1 j,i k=1 l=1

If we regard LW as a generalization to LDS , wi /( j w j ) can be regarded as the


P
probability of sampling from cluster i. This probability is chosen to be inversely
proportional to the cluster size for the within cluster sampling approach. It seems

that sensible choices include (i) wi = 1, (ii) wi = 1/ni , (iii) wi = 1/ ni ; and (iv)
wi = {1 + (ni − 1)ρ̄}−1 , where ρ̄ is the “average” within-cluster correlation. Wang and
Zhao (2008) provided more details on simulation and asymptotic results.

7.8.1 National Cooperative Gallstone Study


In this study, 113 patients who had floating gallstones were randomly allocated to the
high dose group (750 mg per day), 65 patients, or a placebo group, 48 patients. The
serum cholesterol was measured for each patients at baseline and at months 6, 12, 20,
and 24 of follow-up. Many cholesterol measurements were missing because patient
follow-up was terminated (Wei and Lachin, 1984). The response Yi j here were the in-
crease in cholesterol levels at months 6, 12, 20, and 24 above the baseline value. One
of the major interest in this study was to investigate the safety of the drug chenodiol
for the treatment of cholesterol gallstones. Let Gi be the group indicator taking 0 for
placebo and 1 for high dose. In this case, we treated time as a categorical variable
which is expressed by three dummy variables T 12 , T 20 , and T 24 . Let measurements
at month 6 as the reference level. We consider the regression model,

Yi j = β0 + β1 T 12 + β2 T 20 + β3 T 24 + β4Gi + i j .

The method using wi = 1/ni is refereed to as W0 . Using Jung and Ying’s method,
the estimates for β1 , β2 , β3 , and β4 are 2.774, 12.647, 7.922, and 10.475, respectively.
The hypothesis of no treatment or time effect can be constructed H0 : βi = 0, i =
1, ..., 4, respectively.
The average correlation ρ̄ is estimated as 0.49, which is quite large. We, therefore,
applied the weighted rank method with weights {1 + ρ̄(ni − 1)}−1 (W3 ) and obtained
β̂ = (3.143, 13.314, 7.751, 10.393).
RANK REGRESSION 133
Under H0 , Q = max MUW (β)V̂W
−1
UW (β)> has an asymptotic chi-square distri-
{βi =0}
bution with one degree of freedom (Jung and Ying, 2003). The corresponding Q
values in weighted rank regression are obtained as 0.816, 13.864, 2.106, and 3.970
for βi , i = 1, ..., 4, respectively. For comparison, the Q values from Jung and Ying’s
method are 0.653, 16.100, 2.425, and 3.833 correspondingly. The 95% confidence in-
tervals for β1 , β2 , β3 , and β4 are (−3.176, 9.461), (6.251, 20.376), (−2.532, 18.034),
and (−0.057, 20.842). The one-sided hypotheses are available based on the variance
estimates of β̂W which may suggest a significant drug effect on increasing the serum
cholesterol level.

7.8.2 Reproductive Study


This study was conducted to evaluate the effects of an experimental compound on
general reproductive performance and postnatal measurements taken on pups. Alto-
gether 30 dams were randomly assigned to three equal size treatment groups: control,
a low dose and a high dose of the compound. There are only 7 litters which are avail-
able in the high dose group since one female did not conceive, one cannibalized her
pups and one delivered a dead birth (Dempster et al., 1984).
Denote the low dose group as L (reference group) and the high dose group as
H. The response variable Y is pup weights which are used to assess the treatment
effects. For each dam, the numbers of pups, which will be included in our model
as S , are different. An interesting complication is that the cluster size is informative
since pups from smaller litters tend to have a higher weight than those from larger
cluster. In addition, gender is also considered as a covariate denoted as G. Therefore,
we have the following model:

Yi j = β0 + β1 L + β2 H + β3G + β4 S + i j .

The estimates from the independence rank regression are β̂ JY = (−0.448, −0.936,
−0.236, −0.123)> , from which we also obtain an estimate of ρ̄ as 0.456. The
weighted rank regression using wi = {1 + ρ̄(ni − 1)}−1 produces the estimates
of β as β̂W = (−0.460, −0.886, −0.242, −0.126)> . Except for gender, the esti-
mates from these two methods give similar estimates to those from the ran-
dom intercept model of Dempster et al. (1984) who obtained the estimates as
(−0.429, −0.859, −0.359, −0.129)> . The standard deviation of individual parameter
estimates β̂W are (0.196, 0.241, 0.083, 0.025)T , which are generally higher than those
from the random effect model which are equal to (0.150, 0.182, 0.047, 0.019)T based
on normality assumption. No violation of normality in the data set may be a reason
for this. The hypothesis tests based on β̂W conclude that all the variables in the model
have statistically significant effects on the pup weights.
To investigate the impact of the informative cluster size on the regression models,
we excluded litter size from the set of covariates. The estimates using weight 1/ni
are (−0.445, −0.441, −0.148)> . It is interesting to note that that similar estimates are
obtained here except for the high dose group. Same conclusions can also be seen
in the random effect model, where the estimates are (−0.375, −0.355, −0.361)T . The
134 CLUSTERED DATA ANALYSIS
hypothesis tests show that neither the low dose nor the high dose are significant in the
random effect model. However, the normal value of -2.432 for testing the effect of
the low dose in the weighted rank regression with weights {1 + ρ̄(ni − 1)}−1 indicates
that the low dose significantly affect pup wights.
The resistance of the weighted rank regression to informative cluster size will
need more research to explore. However, our weighted method including Size as a
covariate should produce more efficient estimators.
Chapter 8

Missing Data Analysis

8.1 Introduction
Missing data or missing values occur when no information is available on the re-
sponse or some of the covariates or both the response and some covariates for some
subjects who participate in a study of interest. There can be a variety of reasons for
occurrence of missing values. Nonresponse occurs when the respondent does not
respond to certain questions due to stress, fatigue or lack of knowledge. Some indi-
viduals in the study may not respond because some questions are sensitive. Missing
data can create difficulty in analysis because nearly all standard statistical methods
presume complete information for all the variables included in the analysis. Even a
small number of missing observations can dramatically affect statistical analysis re-
sulting in biased and inefficient parameter estimates and confidence intervals that are
too wide or too short.

8.2 Missing Data Mechanism


How do missing values or observations occur? The way missingness occurs in a data
set or the missing data mechanism can be divided into three parts, missing com-
pletely at random (MCAR), missing at random (MAR) and missing not at random
(MNAR). See the pioneering work in this area by Rubin (1976) and Little and Ru-
bin (1987). MCAR refers to the missingness that does not depend on observed as
well as unobserved observations. For MCAR, probability of missingness is same for
all the observations. For example, if answering a question depends on the result of
a head after tossing a fair coin, then missingness of that answer is completely ran-
dom. MAR refers to the missingness that depends only on observed values. In MAR,
probability of missingness depends only on available observations not on unobserved
observations. For example, missing information about age or income may depend on
other available information, such as race and gender of the individual. When neither
MCAR nor MAR hold, the data are said to be MNAR. In this, the probability of
missingness depends on both observed and unobserved observations. For example,
dropouts in medical studies are MNAR as some individuals in a study may not like
the previous results and may be worried about future results of the study and drop
out (see Rubin and Little, 1987 and Song, 2007).

DOI: 10.1201/9781315153636-8 135


136 MISSING DATA ANALYSIS
To put things in proper perspective, using notation similar to Song (2007), let
yi = (yi1 , yi2 , . . . , yini )T and
(
1 if yi j is observed,
ri j = (8.1)
0 otherwise.

As some of the observations in response may be missing, we write the response yi as


( o
yi if yi is observed,
yi = (8.2)
ymi if yi is missing.

Further, let Xi , Zi , and Wi be design matrices for fixed effects, random effects (if
any), and missing data process, and let θ and ψ be vectors that parameterize the joint
distribution of yi and ri = (ri1 , ri2 , . . . , rini )T , where θ = (βT , αT ) and ψ represents the
measurement and missingness processes, respectively, β is the fixed effects parameter
vector and α represents the variance components and/or association parameters. The
full data (yi , ri ) consist of the complete data and the missing data indicators.
Now, when data are incomplete due to a stochastic mechanism, the full data den-
sity is
f (yi , ri |Xi , Zi , Wi , θ, ψ),
which can be factorized as

f (yi , ri |Xi , Zi , Wi , θ, ψ) = f (yi |Xi , Zi , θ) f (ri |yi , Wi , ψ). (8.3)

Then, under an MCAR mechanism, the probability of an observation being missing


is independent of the responses, and therefore

f (ri |yi , Wi , ψ) = f (ri |yoi , ym


i , Wi , ψ) = f (ri |Wi , ψ),

so that
f (yi , ri |Xi , Zi , Wi , θ, ψ) ∝ f (yi |Xi , Zi , θ).
So, the data analysis is done based on only the complete cases.
Under MAR, the probability of an observation being missing is conditionally
independent of the unobserved data given the values of the observed data, which
implies that
f (ri |yi , Wi , ψ) = f (ri |yoi , Wi , ψ).
So the model for data analysis is

f (yi , ri |Xi , Zi , Wi , θ, ψ) = f (yi |Xi , Zi , θ) f (ri |yoi , Wi , ψ), (8.4)

and at the observed data level the model is

f (yoi , ri |Xi , Zi , Wi , θ, ψ) = f (yoi |Xi , Zi , θ) f (ri |yoi , Wi , ψ). (8.5)

Under MNAR, the probability of an observation being missing depends on ob-


served and unobserved data which implies that

f (ri |yi , Wi , ψ) = f (ri |yoi , ym


i , Wi , ψ).
MISSING DATA PATTERNS 137
Thus, at the observed data level, the model is
Z
f (yoi , ri |Xi , Zi , Wi , θ, ψ) = f (yoi |Xi , Zi , θ) f (ri |yoi , ym
i |Wi , ψ)dyi .
m
(8.6)

However, in practice, often, the above integral is intractable. So, some Monte Carlo
method needs to be devised to replace the integral by summation.

8.3 Missing Data Patterns


The best description of missing data patterns is by Song (2007) which we follow here.
To illustrate some common missing data patterns encountered in practice, consider
the following hypothetical example adapted from Song (2007).
For example, a longitudinal study involves eight subjects, each having three vis-
its. Half of them are randomized into the standard treatment and the other half into
the new treatment. Blood pressure is the outcome variable of interest.

Figure 8.1 Graphic representation of five missing data patterns

A complete data pattern refers to the case with no missing values, as shown in
Table 8.1 and in panel (a) of Figure 8.1. A univariate (response) missing pattern refers
to the situation where missing values only occur at the last visit, as shown in Table
8.2 and panel (b) of Figure 8.1. This is a special case of dropout pattern.
Table 8.3 and panel (c) of figure 8.1 show a uniform missing pattern, in which
missing values occur in a joint fashion. That is, the measurements at last two visits
138 MISSING DATA ANALYSIS
are either both observed or both missing simultaneously. This is a kind of dropout
mechanism and the dropout time is uniform across all subjects.
Table 8.4 and panel (d) of Figure 8.1 display a monotonic missing pattern, where
if one observation is missing, then all the observations after it will be unobserved.
This is a general and important kind of dropout mechanism that allows subjects to
have different dropout times. As a matter of fact, all the above cases (b)-(d) are mono-
tonic missing patterns.
An arbitrary missing pattern refers to the case in which missing values may occur
in any fashion, an arbitrary combination of intermittent missing values and dropouts.
Table 8.5 and panel (e) of Figure 8.1 demonstrate a possible scenario for a mixture
of intermittent missing (on Subject 5) at the second visit and some dropouts (on both
Subject 1 and Subject 2).

Table 8.1 Complete data pattern.


Subject Visit Treatment Heart Rate Subject Visit Treatment Heart Rate
1 1 A 65 5 1 B 72
1 2 A 64 5 2 B 74
1 3 A 61 5 3 B 76
2 1 A 60 6 1 B 67
2 2 A 80 6 2 B 69
2 3 A 81 6 3 B 73
3 1 A 62 7 1 B 62
3 2 A 68 7 2 B 61
3 3 A 69 7 3 B 65
4 1 A 81 8 1 B 75
4 2 A 75 8 2 B 74
4 3 A 78 8 3 B 72

Table 8.2 Univariate missing data pattern.


Subject Visit Treatment Heart Rate Subject Visit Treatment Heart Rate
1 1 A 65 5 1 B 72
1 2 A 64 5 2 B 74
1 3 A ? 5 3 B 76
2 1 A 60 6 1 B 67
2 2 A 80 6 2 B 69
2 3 A ? 6 3 B ?
3 1 A 62 7 1 B 62
3 2 A 68 7 2 B 61
3 3 A 69 7 3 B 65
4 1 A 81 8 1 B 75
4 2 A 75 8 2 B 74
4 3 A 78 8 3 B ?
MISSING DATA METHODOLOGIES 139

Table 8.3 Uniform missing data pattern.


Subject Visit Treatment Heart Rate Subject Visit Treatment Heart Rate
1 1 A 65 5 1 B 72
1 2 A ? 5 2 B 74
1 3 A ? 5 3 B 76
2 1 A 60 6 1 B 67
2 2 A ? 6 2 B ?
2 3 A ? 6 3 B ?
3 1 A 62 7 1 B 62
3 2 A 68 7 2 B 61
3 3 A 69 7 3 B 65
4 1 A 81 8 1 B 75
4 2 A 75 8 2 B ?
4 3 A 78 8 3 B ?

Table 8.4 Monotonic missing data pattern.


Subject Visit Treatment Heart Rate Subject Visit Treatment Heart Rate
1 1 A 65 5 1 B 72
1 2 A 64 5 2 B 74
1 3 A ? 5 3 B 76
2 1 A 60 6 1 B 67
2 2 A ? 6 2 B 69
2 3 A ? 6 3 B ?
3 1 A 62 7 1 B 62
3 2 A 68 7 2 B 61
3 3 A 69 7 3 B 65
4 1 A 81 8 1 B 75
4 2 A 75 8 2 B ?
4 3 A 78 8 3 B ?

8.4 Missing Data Methodologies


Missing data methodologies can be classified into imputational methodologies and
likelihood but ad hoc procedures and the likelihood methodologies. The imputational
methodologies are quick but inference based on this analysis can be misleading. The
likelihood based procedures are much more intensive but formal data analytic pro-
cedures. The imputational procedures are given in Section 8.4.1 and the likelihood
methods are discussed in Section 8.4.2. Section 8.6 deals with longitudinal data anal-
ysis with missing values.
140 MISSING DATA ANALYSIS

Table 8.5 Arbitrary missing data pattern.


Subject Visit Treatment Heart Rate Subject Visit Treatment Heart Rate
1 1 A 65 5 1 B 72
1 2 A ? 5 2 B 74
1 3 A ? 5 3 B 76
2 1 A 60 6 1 B 67
2 2 A ? 6 2 B 69
2 3 A 81 6 3 B ?
3 1 A 62 7 1 B 62
3 2 A 68 7 2 B 61
3 3 A 69 7 3 B 65
4 1 A 81 8 1 B 75
4 2 A 75 8 2 B 74
4 3 A 78 8 3 B 72

8.4.1 Missing Data Methodologies: The Methods of Imputation


8.4.1.1 Last Value Carried Forward Imputation
In this method, the last observed value of a subject is carried over to the next missing
observation.

8.4.1.2 Imputation by Related Observation


Sometimes related observations are imputed to fill the missing values. It may happen
in a study that the mother’s age and educational status for a child is missing so the
father’s information is used to fill the mothers missing information.

8.4.1.3 Imputation by Unconditional Mean


In this type of imputation procedure, the missing value of a subject is replaced by the
average of the available information of the same variable but from different subjects.

8.4.1.4 Imputation by Conditional Mean


This approach of imputation was discussed by Little and Rubin (1987). Following
Molenberghs et al. (2004), conditional mean imputation can be explained by con-
sidering a single normal sample. The mean and the covariance matrix are calculated
from the complete cases of the data in the first step, and then in the second step, infor-
mation from the first step is used to calculate the conditional mean from a regression
of missing values of a subject conditional on the actual observations. Conditional
mean from the second step is used to replace the missing value.

8.4.1.5 Hot Deck Imputation


Hot deck imputation procedure uses similar responding units from the sample to
replace the missing observations. This technique is one of the commonly used
MISSING DATA METHODOLOGIES 141
techniques. For example, if the information about the total number of persons in
a household is missing then that information would be replaced by the total number
of persons in a similar household in that area.

8.4.1.6 Cold Deck Imputation


In this imputation technique, missing observations are replaced with a constant value
from the external sources, such as, a previous survey or study.

8.4.1.7 Imputation by Substitution


In this imputation technique, the missing observations or the nonresponses are substi-
tuted by the information from different sources or subjects which were not included
initially in the survey. This method is usually used in the data collection stage of the
survey. For example, if a previously selected subject was not found during the survey,
then the information would be collected from another subject who was not selected
initially in the survey.

8.4.1.8 Regression Imputation


In this method predicted values are obtained from regression of the missing observa-
tions on the observed values. For example, if the height and the weight are measured
from 30 students of a class and the weight of a student was missing, then the weights
of the 9 students would be regressed on their heights and the regression coefficients
would be used for the prediction of the missing weight for that specific height.

8.4.2 Missing Data Methodologies: Likelihood Methods


Since the inception of the seminal paper by Rubin (1976) and the book by Little and
Rubin (1987), a substantial amount of research (both theoretical and applied) have
appeared in statistical literature (see for example, Carpenter et al., 2002, Chen and
Ibrahim, 2002, Ibrahim et al., 2005 and Ibrahim and Molenberghs, 2009). Covering
all topics in missing data literature is beyond the scope of this book. Here we give
a general discussion of estimation based on likelihood methods and provide some
examples of application.
In the usual situation where data are assumed from a distribution with a param-
eter θ, where θ can be vector valued, for example, normal (µ, σ2 ), a likelihood L or
a log-likelihood l is constructed to obtain maximum likelihood estimates (MLE) of
the parameter. Then L or l is maximized to obtain MLE of θ. The situation becomes
complicated when some observations are missing. In this situation the EM algorithm
(Dempster et al., 1977) becomes handy. It is a general approach to iterative compu-
tation of maximum likelihood estimate when data are incomplete (or missing). The
EM algorithm consists of an expectation step followed by a maximization step at
each iteration which guarantees to converge to the MLE of the parameters. A cau-
tionary note is that the current EM theory only guarantees convergence to a local
maximum of the likelihood, which is not necessarily the MLE.
142 MISSING DATA ANALYSIS
To put things in perspective, let X be a set of observed data and Z be a set of
unobserved data from a probability distribution f (y|θ). Then the likelihood function
can be written as L(θ; X, Z) = f (X, Z|θ). The maximum likelihood estimate (MLE) of
the unknown parameters θ is obtained by maximizing the marginal likelihood of the
observed data X
L(θ; X) = f (X|θ) = f (X, Z|θ).
Z
Summation is replaced by integration for continuous data. However, in most situa-
tions summation is either difficult or intractable. The MLE from the marginal likeli-
hood is then obtained by iteratively applying the following two steps:
Expectation step (E step): Calculate the expected value of the log likelihood func-
tion, with respect to the conditional distribution of Z given X under the current esti-
mate of the parameters θ(t) :

Q(θ|θ(t) ) = EZ|X,θ(t) log L(θ; X, Z) ,


 

Maximization step (M step): Find the parameter that maximizes this quantity:

θ(t+1) = arg maxQ(θ|θ(t) ).


θ

Example 1. (Modified data from Demster et al., 1977)


This is a motivating and interesting example adapted from Demster et al. (1977).
The data used come from Rao (1965, pages 368-369) and refer to 207 animals that
are distributed multinomially into four categories, so that the observed data consist
of
y = (y1 , y2 , y3 , y4 ) = (135, 18, 20, 34).
Rao (1965) used a genetic model for the population which specifies cell probabilities
1 π 1−π 1−π π
!
+ , , ,
2 4 4 4 4
for some π, 0 ≤ π ≤ 1, so that the multinomial can be written as
!y !y !y
(y1 + y2 + y3 + y4 )! 1 π 1 1 − π 2 1 − π 3  π y4
f (y|π) = + .
y1 !y2 !y3 !y4 ! 2 4 4 4 4

Rao (1965) reparameterized π by π = (1 − θ)2 . He then developed a Fisher-scoring


procedure for maximizing f (y|(1 − θ)2 ) given the observed y.
Note that P(Y1 = y1 ) = 1/2 + π/4 which indicates that the first two categories
have been put into one category having frequency 135. To use the EM algorithm we
consider y as incomplete data from a five-category multinomial population having
cell probabilities 1/2, π/4, (1 − π), (1 − π)/4, and π/4. So, the complete data consist
of x = (x1 , x2 , x3 , x4 , x5 ), where y1 = x1 + x2 , y2 = x3 , y3 = x4 , y4 = x5 , and the complete
data likelihood is
!x !x !x
(x1 + x2 + x3 + x4 + x5 )! 1 1  π  x2 1 − π 3 1 − π 4  π  x5
f (x|π) = . (8.7)
x1 !x2 !x3 !x4 !x5 ! 2 4 4 4 4
MISSING DATA METHODOLOGIES 143
Note that the EM algorithm has two steps: the E-step and the M-step. Also x1 and
x2 in (8.7) are unknown, and need to be estimated through the E-step. But we know
that x1 + x2 = 135 and P(X1 = x1 ) = π1 and P(X2 = x2 ) = π2 with
1 1π
π1 = 2 and π2 = 4 .
1 + 1π 1 + 1π
2 4 2 4
Using the binomial (X1 , π1 ), the expected values of X1 and X2 are x1 = 135π1 and
x2 = 135π2 .
The M step requires us to obtain the maximum likelihood estimate of π from
(8.7). By taking the derivative of the log of (8.7) (log of the multinomial pdf) and
equating it to zero, we obtain
x2 + 34
π̂ = .
x2 + 18 + 20 + 34
The EM algorithm then proceeds as follows:
(p)
Let π(p) be the value π at the pth iteration. Then the value of x2 , denoted by x2 (the
E-step) is
π(p)
(p) 4
x2 = 135 (8.8)
π(p)
1
2 + 4
and the value of π at the (p + 1)th iteration (M-step) is
(p)
x2 + 34
π(p+1) = (p)
. (8.9)
x2 + 18 + 20 + 34

The MLE of π is obtained by cycling back and forth between (8.8) and (8.9).
It can be seen that starting with an initial value of π(0) = 0.5, the algorithm con-
(p)
verges in eight steps as can be seen in Table 8.6. By substituting x2 from equation
(8.8) into equation (8.9), and letting π(∗) = π(p) = π(p+1) we can explicitly solve a
quadratic equation for the maximum-likelihood estimate of π:

π(∗) = (25 + 56929)/414 + 0.6667101. (8.10)

Example 2. (Casella and Berger, 2002)


Modeling incidence of a disease: Suppose that we observe a random sample
X1 , . . . , Xn from Poisson (τi ) and a random sample Y1 , . . . , Yn from Poisson (βτi ). This
represents, for example, modeling incidence of a disease, Yi , where the underlying
rate is a function of an overall effect β and an additional factor τi . The parameter τi
could be a measure of population density or health status of the population in area i.
We do not see τi but get information on it through Xi .
144 MISSING DATA ANALYSIS

Table 8.6 The EM algorithm (Dempster et al., 1977)


p π(p) π(p) − π(∗) (π(p+1) − π(∗) ) ÷ (π(p) − π(∗) )
0 0.500000000 0.136710100 0.15031
1 0.616161600 0.020548500 0.13700
2 0.633895000 0.002815090 0.13517
3 0.636329600 0.000380520 0.1328
4 0.636658800 0.000051330 0.13492
5 0.636703200 0.00000693 0.13498
6 0.636709200 0.00000093 0.13420
7 0.636710000 0.00000013 0.13978
8 0.626821484 0 000000014 –

Suppose X1 is missing and we want to estimate the parameters β and τi based on


the remaining data. Now in case of no missing data, the joint distribution of (Xi , Yi ),
i = 1, . . . , n is
n
Y e−βτi (βτi )yi e−τi (τi ) xi
f ((x1 , y1 ), · · · , (xn , yn )|β, τ1 , τ2 , · · · , τn ) = . (8.11)
i=1
yi ! xi !

Then the maximum likelihood estimates are


xj +yj
Pn
yi
β̂ = Pni=1 and τˆj = , j = 1, 2, . . . , n.
x
i=1 i β̂ + 1
The joint distribution in (8.11) is the complete data likelihood, and
((x1 , y1 ), . . . , (xn , yn )) is called the complete data. One way to handle estimation of
the parameters when X1 is missing is to delete Y1 and estimate the parameters based
on the (n − 1) pairs (Xi , Yi ), i = 2, . . . , n. But this is ignoring the information in y1 .
Using this information would improve our estimates.
Now, the likelihood of the sample with x1 missing

X
f ((x1 , y1 ), . . . , (xn , yn )|β, τ1 , τ2 , . . . , τn ) (8.12)
x1 =0

is the incomplete-data likelihood. This is the likelihood that we need to maximize.


The incomplete data likelihood is obtained from (8.11) by summing over x1 . This
gives
n n
Y e−βτi (βτi )yi Y e−τi (τi ) xi
L (β, τ1 , τ2 , · · · , τn |(x2 , y2 ), . . . , (xn , yn )) = (8.13)
i=1
yi ! i=2
xi !

and (y1 , (x2 , y2 ), . . . , (xn , yn )) is the incomplete data (note that the summation over
the likelihood (8.11) is equivalent to ∞ −τ1 (τ ) x1 which is 1). So we need to
P
x1 =0 e 1
MISSING DATA METHODOLOGIES 145
maximize the likelihood (8.13). Differentiation leads to the ML equations
Pn
yi
β̂ = Pni=1 , y1 = τ̂1 β̂, x j + y j = τ̂ j (β̂ + 1), j = 2, . . . , n. (8.14)
τ̂
i=1 i

These equations can be solved iteratively.


We now use the EM algorithm to obtain the MLE’s. Let (x, y) = ((x1 , y1 ),
. . . , (xn , yn )) denote the complete data and (x(−1) , y) = (y1 , (x2 , y2 ), . . . , (xn , yn )) denote
the incomplete data. The expected (expectation is over the missing data, that is x1
here) complete data log likelihood is

E[log L(β, τ1 , τ2 , · · · , τn |(x, y)|τ(r) , (x(−1) , y))]


 (r)
Y e−βτi (βτi )yi e−τi (τi ) xi  e−τ1 (τ(r) ) x1

 n
X
1
= log   
x =0 i=1
y i ! x i !  x 1 !
1
n
X n
 X
= −βτi + yi (log β + log τi ) − log yi ! + −τi + xi log τi − log xi !
 
i=1 i=2
(r)(r)
∞ −τ
X  e 1 (τ1 ) x1
+ −τ1 + x1 log τ1 − log x1 !

x1 =0
x1 !
 n n

X X
= [−βτi + yi (log β + log τi )] + [−τi + xi log τi ]



i=1 i=2
(r) (r)

X ∞
e−τ1 (τ1 ) x1  
+ [−τ1 + x1 log τ1 ]


x !


x1 =0 1 

 (r) 


X
n Xn X ∞
e−τ1 (τ1(r) ) x1 

log yi ! + log xi ! + ,
 
− log x1 ! (8.15)

x !
 


 i=1 2=1 x =0 1 

1

where in the last equality we have grouped together terms involving β and τi , and
terms that do not involve these parameters. Since we are calculating this expected
log likelihood for the purpose of maximizing it in β and τi , we can ignore the terms
in the second set of parentheses, because
(r)
n
X n
X ∞
X e−τ1 (τ(r)
1 )
x1
log yi ! + log xi ! + log x1 !
i=1 2=1 x1 =0
x1 !

is a constant. Thus, we have to maximize only the terms in the first set of parentheses.
Now,
(r) (r)

X e−τ1 (τ(r)
1 )
x1 ∞
X e−τ1 (τ1(r) ) x1
[−τ1 + x1 log τ1 ] = −τ1 + log τ1 x1 = −τ1 + τ1(r) log τ1 .
x1 =0
x1 ! x1 =0
x1 !
146 MISSING DATA ANALYSIS
Substituting this back into (8.15), apart from a constant, the expected complete data
log-likelihood is
n
X n
X
[−βτi + yi (log β + log τi )] + (−τi + xi log τi ).
i=1 i=1

This is the same as the original complete data log-likelihood, x1 being replaced by
τ(r)
1 . Thus, in the rth step, the MLEs are only a minor variation of (8.14) and are given
by

τ(r)
1 + y1 xj +yj
Pn
yi (r+1)
β̂(r+1) = (r) i=1 and τ̂ = , τ̂(r+1)
j = , j = 2, . . . , n.
τ1 + i=2 xi
1
β̂(r+1) + 1 β̂(r+1) + 1
Pn
(8.16)
This defines both the E-step (which results in the substitution of τ̂(r) 1 for x 1 ) and
the M-step (which results in the calculation in (8.16) for the MLEs at the rth iter-
ation). Note that the final solution is obtained by substituting τ̂1(r) on the first two
equations in (8.16) by some initial value and then cycling back and forth.
Exercise 1. Refer to Example 2 above. Show that
(a) the maximum likelihood estimators from the complete data likelihood (8.11) are
given by
xj +yj
Pn
yi
β̂ = Pni=1 and τ̂ j = , j = 1, 2, . . . , n.
i=1 xi β̂ + 1
and
(b) a direct solution of the original (incomplete-data) likelihood equations is possible.
Show that the solution to (8.14) is given by
xj +yj
Pn
yi y1
β̂ = Pni=2 , τ̂1 = , τ̂ j = , j = 2, 3, . . . , n. (8.17)
i=2 xi β̂ β̂ + 1
Exercise 2. Use the model of Example 2 on the data in the following table
adapted from Lange et al. (1994). These are leukemia counts and the associated pop-
ulations for a number of areas in New York State.

Table 8.7 Counts of leukemia cases


population 3540 3560 3739 2784 2571 2729 3952 993 1908
Number of cases 3 4 1 1 3 1 2 0 2
Population 948 1172 1047 3138 5485 5554 2943 4969 4828
Number of cases 0 1 3 5 4 6 2 5 4

(a) Fit the Poisson model to these data both for the full data set and for an “in-
complete” data set where we suppose that the first population count (x1 = 3540) is
missing.
(b) Suppose that instead of having an x value missing, we actually have lost a
leukemia count (assume that y1 = 3 is missing). Use the EM algorithm to find the
MLEs in this case, and compare your answer to those of part (a).
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 147
8.5 Analysis of Zero-inflated Count Data With Missing Values
Discrete data in the form of counts often exhibit extra variation that cannot be ex-
plained by a simple model, such as the binomial or the Poisson. Also, these data
sometimes show more zero counts than what can be predicted by a simple model.
Therefore, a discrete model (Poisson or binomial) may fail to fit a set of discrete data
either because of zero-inflation, or because of over-dispersion, or because there is
zero-inflation as well as over-dispersion in the data. Deng and Paul (2005) developed
score tests for zero-inflation and over-dispersion in generalized linear models. Mian
and Paul (2016) developed estimation procedures for zero-inflated over-dispersed
count data regression model with missing responses. Here we discuss these proce-
dures in detail.
Let Y be a discrete count data random variable. The simplest model for such a
random variable is the Poisson, which has the probability mass function

e−µ µy
f (y; µ) = . (8.18)
y!
However, data may show evidence of over-dispersion (variance is larger than the
mean). A popular over-dispersed count data model is the two parameter negative
binomial model. Different authors have used different parameterizations for the neg-
ative binomial distribution (see, for example, (Paul and Plackett, 1978; Barnwal and
Paul, 1988; Paul and Banerjee, 1998; Piegorsch, 1990). Let Y be a negative binomial
random variable with mean parameter µ and dispersion parameter c. Then, using the
terminology of Paul and Plackett (1978), Y has the probability mass function
!y !c−1
Γ(y + c−1 ) cµ 1
f (y; µ, c) = , (8.19)
y!Γ(c−1 ) 1 + cµ 1 + cµ

for y = 0, 1, . . ., µ > 0. Now, for a typical Y, Var(Y) = µ(1 + µc) and c > −1/µ. This is
the extended negative binomial distribution of Prentice (1986) which takes account
of over-dispersion as well as under-dispersion. Obviously, when c = 0, variance of
the NB(µ, c) distribution becomes that of the Poisson(µ) distribution. Moreover, it
can be shown that the limiting distribution of the NB(µ, c) distribution, as c → 0, is
the Poisson(µ).
Using the mass function in equation (8.19), the zero-inflated negative binomial
regression model (see Deng and Paul, 2005) can be written as
  −1
1 c

 ω + (1 − ω) 1+cµ if y = 0,



f (yi |xi ; µ, c, ω) =  (8.20)

Γ(y+c−1 )  cµ y  c−1
 (1 − ω) 1
if y > 0



y!Γ(c−1 ) 1+cµ 1+cµ

with E(Y) = (1 − ω)µ, and Var(Y) = (1 − ω)µ[1 + (c + ω)µ], where ω is the zero-
inflation parameter. We denote this distribution as ZINB(µ, c, ω).
Regression analysis of count data may be further complicated by the existence
of missing values either in the response variable and/or in the explanatory variables
148 MISSING DATA ANALYSIS
(covariates). Extensive work has been done on regression analysis of continuous re-
sponse data with some missing responses under normality assumption. See, for ex-
ample, Rubin (1976), Little and Rubin (1987), Anderson and Taylor (1976), Geweke
(1986), Raftery et al. (1997), Chen et al. (2001), Kelly (2007), Zhang and Huang
(2008).
Some work on missing values has also been done on logistic regression analy-
sis of discrete data. See, for example, Ibrahim (1990), Lipsitz and Ibrahim (1996),
Ibrahim and Lipsitz (1996), Ibrahim et al. (1999, 2001), Ibrahim et al. (2005), Sinha
and Maiti (2008), Maiti and Pradhan (2009).
Exercise 3.
(a) Derive the negative binomial distribution as a Gamma(α, β) mixture of the
Poisson(µ), reparameterize, and show that it can be written as the probability mass
function given in equation (8.19) (see Paul and Plackett, 1978).
(b) Derive the mean and variance of the NB(µ, c) (hint: find the unconditional mean
and unconditional variance of a mixture distribution).
(c) Verify that the mean and variance of a zero-inflated negative binomial distribution
are those given in this chapter.
Suppose data for the ith of n subjects are (yi , xi ), i = 1, . . . , n, which are realizations
from ZINB(µ, c, ω), where yi represents the response variable and xi represents a
p × 1 vector of covariates with the regression parameter β = (β1 , β2 , . . . , β p ), such that
Pp
µi = exp( j=1 Xi j β j ). Here β1 is the intercept parameter in which case Xi1 = 1 for
all i. In Subsection 8.5.1 we show ML estimation of the parameters with no missing
data. Subsection 8.5.2 deals with different scenarios of missingness.

8.5.1 Estimation of the Parameters with No Missing Data


For complete data the likelihood function is
n h
Y i
L(β, c, ω|yi ) = ω + (1 − ω) f (0; µi , c, ω)I{yi =0} + (1 − ω) f (yi ; µi , c, ω)I{yi >0} .
i=1
(8.21)

Writing γ = ω/(1 − ω), the log likelihood, apart from a constant, can be written
as
n n
X o
l(β, c, γ|yi ) = − log(1 + γ) + log[γ + f (0; µi , c, ω)]I{yi =0} + log f (yi ; µi , c, ω)I{yi >0}
i=1
n (
X
= − log(1 + γ) + log γ + exp[−c−1 log(1 + µi c)] I{yi =0}
 
i=1
h yi
X i )
+ yi log µi − (yi + c ) log(1 + µi c) + [1 + (l − 1)c] I{yi >0} .
−1
(8.22)
l=1
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 149
The parameters β j , c and γ can be estimated by directly maximizing the loglikeli-
hood function (8.22) or by simultaneously solving the following estimating equations
n (
∂l X −(1 + µc)−1 exp[−c−1 log(1 + µc)] y1 c(y1 + c−1 )
" # )
= I {y =0} + − I {y >0}
∂β j i=1 γ + exp[−c−1 log(1 + µc)] i
µ 1 + µc i

∂µi
× = 0, (8.23)
∂β j

n (
∂l X [−µc−1 (1 + µc)−1 + c−2 log(1 + µc)] exp[−c−1 log(1 + µc)]
= I{yi =0}
∂c i=1 γ + exp[−c−1 log(1 + µc)]
yi
  
 X  
+ µ(yi + c )(1 + µc) − c log(1 + µc) + (l − 1) I{yi >0} 
−1 −1 −2
= 0,

(8.24)
  


l=1

and
n (
∂l X h i−1 )
= − (1 + γ)−1 + γ + exp[−c−1 log(1 + µc)] I{yi =0} + 0I{yi >0} = 0, (8.25)
∂γ i=1

where  
p
∂µi X 
= Xi j exp  Xi j β j  .

∂β j j=1

Exercise 4.
(a) Obtain maximum likelihood estimates of the parameters µ and c of model
(8.19) for the modified DMFT index data given in Table 8A, 8B, 8C, 8D, 8E, 8F, 8G,
8H, 8I, and 8J.
(b) Obtain maximum likelihood estimates of the parameters µ, c and ω of model
(8.20) for the modified DMFT index data given in Table 8A, 8B, 8C, 8D, 8E, 8F, 8G,
8H, 8I, and 8J.

8.5.2 Estimation of the Parameters with Missing Responses


8.5.2.1 Estimation under MCAR
In case of MCAR, missingness of the data does not depend on observed data and the
subjects having the missing observations are deleted before the analysis. For estima-
tion procedure, the likelihood function remains the same as given in equation (8.21)
with reduced sample size having only complete observations. Note that this is the
so-called complete-data analysis, which results in loss of information. For example,
if every subject has at least one missing data component, which could be a response
or a covariate, then, with the complete-data only analysis there will be no data left to
analyze.
150 MISSING DATA ANALYSIS
8.5.2.2 Estimation under MAR
As some of the observations in response may be missing, we write the response yi as
(
yo,i if yi is observed,
yi = (8.26)
ym,i if yi is missing.

Using this in f (yi |xi ; µ, c, ω) given in equation (8.20), the log-likelihood is


n
X
l(ψ|Yo , Ym , X) = log( f (yi |xi , ψ))
i=1
Xn (  γ + f (0; µi , c, ω)  )
= log I{yi =0} + log f (yi ; µi , c, ω)I{yi >0} , (8.27)
i=1
1+γ

where Yo is the vector of observed values, Ym is the vector of missing values, ψ =


Pp
(β, c, γ) and µi = exp( j=1 Xi j β j ).
In MAR, conditional probability of missingness of the data depends on observed
data. Parameters of the missingness mechanism are completely separate and distinct
from the parameters of the model (8.20). In likelihood based estimation considering
MAR, missingness mechanism can be ignored from the likelihood and missing data
that are missing at random are often known as ignorable missing or ignorable non-
response, but the subjects having these missing observations cannot be deleted before
the analysis (see Little and Rubin (1987), Ibrahim et al. (2005) for detailed discussion
on this).
In this scenario, our goal is to maximize the following loglikelihood with respect
to the parameters ψ X
l(ψ|Yo , X) = l(ψ|Yo , Ym , X). (8.28)
Ym

Note that l(ψ|Yo , X) is the log-likelihood when the missing data indicators, which are
also part of the observed data, are not used.
In the more general case where missing data are not MAR, this likelihood would
remain the same but a distribution defining the missing data mechanism needs to be
included in the model. This general case is explained in the following section.
Direct maximization of l(ψ; Yo , X) is not, in general, straight forward. So, we use
the EM algorithm.
As explained earlier, the EM algorithm uses an expectation-step (E-step) and a
maximization-step (M-step). Following Little and Rubin (1987), the E-step provides
the conditional expectation of the log-likelihood l(ψ|yo,i , ym,i , xi ) given the observed
data (yo,i , xi ) and current estimate of the parameters ψ.
Suppose A of the n responses are observed and B = n − A responses are missing,
and s is an arbitrary number of iterations during maximization of the log-likelihood.
Then the E-step of the EM algorithm for the ith missing response for (s + 1)th
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 151
iteration can be written as

Qi (ψ|ψ(s) ) = E[l(ψ(s) |yo,i , ym,i , xi )|yo,i , xi , ψ(s) ]


X
= l(ψ(s) |yo,i , ym,i , xi )P(ym,i |yo,i , xi , ψ(s) ). (8.29)
ym,i

For all the observations, the E-step of EM algorithm for (s + 1)-th iteration is
A
X B X
X
Q(ψ|ψ(s) ) = l(ψ(s) |yi ) + l(ψ(s) |yo,i , ym,i , xi )P(ym,i |yo,i , xi , ψ(s) ).
i=1 i=1 ym,i

(8.30)

Note that for the situation in which there is no missing response the EM algorithm
requires only maximization of the first term on the right hand side.
Here P(ym,i |yo,i , xi , ψ(s) ) is the conditional distribution of the missing response
given the observed data and the current (s-th iteration) estimate of ψ. However, in
many situations, P(ym,i |yo,i , xi , ψ(s) ) may not always be available. Following Ibrahim
et al. (2001) and Sahu and Roberts (1999), we can write P(ym,i |yo,i , xi , ψ(s) ) ∝
P(yi |xi , ψ(s) ) (the complete data distribution given in (8.20)). For the ith of the B
missing responses, we take a sample ai1 , ai2 , . . . , aimi from P(yi |xi , ψ(s) ) using Gibbs
sampler (see Casella and George (1992) for details). Then, following Ibrahim et al.
(2001), Q(ψ|ψ(s) ) can be written as
A B mi
X X 1 X
Q(ψ|ψ(s) ) = l(ψ(s) |yi ) + l(ψ(s) |yo,i , xi , aik ). (8.31)
i=1 i=1
m i k=1

In the M-step, Q(ψ|ψ(s) ) is maximized. Here maximizing Q(ψ|ψ(s) ) is analo-


gous to maximization of the complete data log likelihood where each incomplete
response is replaced by mi weighted observations. More details of the EM algorithm
by method of weights can be found in Ibrahim (1990), Lipsitz and Ibrahim (1996),
Ibrahim and Lipsitz (1996), Ibrahim et al. (1999, 2001), Ibrahim et al.(2005), Sinha
and Maiti (2007), Maiti and Pradhan (2009).
Covariance matrix of the parameter estimators is calculated by inverting the ob-
served information matrix at convergence (Efron and Hinkley, 1978), which is
A B mi 2
X ∂2 X 1 X ∂ l(ψ(s) |yo,i , xi , aik )
HψψT = Q00 (ψ|ψ(s) ) = l(ψ (s)
|y i ) + . (8.32)
i=1
∂ψ∂ψT m
i=1 i k=1
∂ψ∂ψT
152 MISSING DATA ANALYSIS
Let A(µ, c) = exp(−c−1 log(1 + µc)). Expressions for the elements of H are
n !2
∂2 l X A(µ, c)[γ(c + 1) + cA(µ, c)] ∂µi
= I{yi =0}
∂β2j i=1 (1 + µc)2 [γ + A(µ, c)]2 ∂β j
n "
(1 + µc)−1 exp[−c−1 log(1 + µc)] y1 c(y1 + c−1 ) ∂ µi
X ! # 2
− I{yi =0} − − I{yi >0}
i=1
γ + A(µ, c) µ 1 + µc ∂β2j
n 2
c2 (yi + c−1 ) yi ∂µi
X " # !
+ − I{yi >0} ,
i=1
(1 + µc)2 µ2 ∂β j
n
 
∂2 l X A(µ, c)[γ(c + 1) + cA(µ, c)]  ∂µi ∂µi 
=  I{y =0}
∂β j ∂βTj (1 + µc)2 [γ + A(µ, c)]2  ∂β j ∂βTj  i

i=1
n (
(1 + µc)−1 A(µ, c) y1 c(y1 + c−1 ) ∂ µi
X ! ) 2
− I{yi =0} − − I{yi >0} ,
i=1
γ + A(µ, c) µ 1 + µc ∂β j ∂βTj
n " 2 # 
X c (yi + c−1 ) yi  ∂µi ∂µi 
+ − 2   I{y >0}
i=1
(1 + µc)2 µ ∂β j ∂βTj  i

n
∂2 l γµ(1 + c) + µcA(µ, c) ∂µi
" #
X A(µ, c)
= − γc log(1 + µc)
−2
I{y =0}
∂β j ∂c i=1 (1 + µc)[γ + A(µ, c)]2 c(1 + µc) ∂β j i
n
X c(1 + µc)(yi − c−2 ) + (1 + 2µc)(yi + c−1 ) ∂µi
− I{yi >0} ,
i=1
(1 + µc)2 ∂β j

n
∂2 l X µ (1 + µc) + 2µc
( 2 )
A(µ, c)
= − 2c log(1 + µc) I{yi =0}
−3
∂c2 i=1 (γ + A(µ, c))2 c2 (1 + µc)2
n !2
X A(µ, c)(1 − A(µ, c)) µ
+ − c−2
log(1 + µc) I{yi =0}
i=1
(γ + A(µ, c))2 c(1 + µc)
n "
µ2 (yi + c−1 ) µ
X #
+ − − 2 + 2c −3
log(1 + µc) I{yi >0} ,
i=1
(1 + µc)2 c2 (1 + µc)

and
n (
∂2 l µ
" # )
X A(µ, c)
= − c log(1 + µc) I{yi =0} + 0I{yi >0}
−2
∂c∂γ i=1 [γ + A(µ, c)]2 c(1 + µc)
n "
∂2 l exp(−c−1 log(1 + µc)) ∂µi
X #
= I {y =0} + 0I {y >0} ,
∂β j ∂γ i=1 (1 + µc)[γ + A(µ, c)]2 ∂β j i i

n "
∂2 l X
#
= (1 + γ)−2
− [γ + A(µ, c)]−2
I {yi =0} + 0I {yi >0} .
∂γ2 i=1
ZERO-INFLATED COUNT DATA WITH MISSING VALUES 153
8.5.2.3 Estimation under MNAR
Under MNAR, probability of missing observations in the response variable depends
on the covariates and the values of the response that would have been observed.
Then, it is necessary to incorporate this missing data mechanism into the likelihood.
The missing observations that follow this missing data mechanism are known as
nonignorable missing. To incorporate the missing data mechanism into the likelihood
we define a a binary random variable ri (i = 1, 2, . . . , n) as
(
0 if yi is observed,
ri = (8.33)
1 if yi is missing.

Obviously, as a binary random variable, ri follows

p(ri |yi , xi j ) = [p(ri = 1)]ri [1 − p(ri = 1)](1−ri ) . (8.34)

See Ibrahim et al. (2001). To model the probability of missing in terms of the values
of the responses that would have been observed and the covariates, a logit link,
p(ri = 1)
" #
log = ν0 + ν1 ∗ yi + ν2 ∗ xi1 + ν3 xi2 + · · · + ν p+1 xip , (8.35)
1 − p(ri = 1)
can be used, where yi is the responses and the responses that would have been ob-
served, and xi j ( j = 1, 2, . . . , p) are the covariates. Denote the (p + 2) parameter vector
as ν = (ν0 , ν1 , ν2 , . . . , ν p+1 ). Note that p(ri = 1) can now be written as a logistic model
exp(ν0 + ν1 ∗ yi + ν2 ∗ xi1 + ν3 xi2 + · · · + νq xip )
p(ri = 1) = . (8.36)
1 + exp(ν0 + ν1 ∗ yi + ν2 ∗ xi1 + ν3 xi2 + · · · + ν p+1 xip )
Then, the log-likelihood function of the parameter ν can be written as
n (
p(ri = 1)
X " # )
l(ν|ri , yi , xi j ) = ri ∗ log + log(1 − p(ri = 1)) . (8.37)
i=1
1 − p(ri = 1)

Note that choice of variables for the model of ri is important. Often many vari-
ables in this model are not necessarily significant, and more importantly parameters
in the model for ri are not of primary interest. Detailed discussion on this can be
found in Ibrahim, Lipsitz, and Chen (1999) and Ibrahim, Chen, and Lipsitz (2001).
Following Ibrahim, Lipsitz, and Chen (1999), after incorporating the model for
missingness mechanism in l(ν|ri , yi , xi j ), the log-likelihood for all the parameters in-
volved is
n n
X
l(ψ|Y, Xo, j , Xm, j ) = − log(1 + γ)
i=1
o
+ log[γ + f (0; µi , c, ω)]I{yi =0} + log f (yi ; µi , c, ω)I{yi >0}
n (
p(ri = 1)
X " # )
+ ri ∗ log + log(1 − p(ri = 1)) . (8.38)
i=1
1 − p(ri = 1)
154 MISSING DATA ANALYSIS
Note that two parts of this log-likelihood are separate and their parameters are dis-
tinct. This characteristic of the log-likelihood facilitates separate maximization. The
rest of the estimation procedure under MNAR remains exactly the same as the esti-
mation procedure under MAR. Note that as it is, the log-likelohood in (8.38) is not
computable. However, it is computable when this is plugged in into the EM algo-
rithm. Further, note that some of the covariates xi j also may be missing which has
not been discussed here as it needs separate theoretical development. This has been
left for future research.
The theoretical development here is flexible. If through tests (Deng and Paul,
2005) or other data visualization procedures no evidence of zero-inflation or over-
dispersion is evident in the data, then we can start with a simple model, such as the
Poisson model or the negative binomial model or the zero-inflated Poisson model.
Note that the development of missing data involved here is that of response data yi .
However, some of the covariates xi j can also be missing which has not been dealt
with here.
Example 3.
Mian and Paul (2016) analyzed a set of data from a prospective study of dental status
of school children from Bohning et al. (1999). Here we report the results of the
analysis. For more detailed information see, Mian and Paul (2016). The children
were all 7 years of age at the beginning of the study. Dental status were measured
by the decayed, missing and filled teeth (DMFT) index. Only the eight deciduous
molars were considered so the smallest possible value of the DMFT index is 0 and
the largest is 8. The prospective study was for a period of two years. The DMFT
index was obtained at the beginning of the study and also at the end of the study.
The data also involved three categorical covariates: gender having two categories
(0 - female, 1 - male), ethnic group having three categories (1 - dark, 2 - white, 3 -
black) and school having six categories (1 - oral health education, 2 - all four methods
together, 3 - control school (no prevention measure), 4 - enrichment of the school diet
with ricebran, 5 - mouthrinse with 0.2% NaF-solution, 6 - oral hygiene).
For the purpose of illustration of estimation in a zero-inflated over-dispersed
count data model with missing responses, Mian and Paul (2016) use the DMFT index
data obtained at the beginning of the study (as in Deng and Paul, 2005). The DMFT
index data at the beginning of the study are: (index, frequency): (0,172), (1,73),
(2,96), (3,80), (4,95), (5,83), (6,85), (7,65), (8,48). They first fit a zero-inflated nega-
tive binomial model to the complete data and data with missing observations without
covariates. Data with missing observations were obtained by randomly deleting a
certain percentage (5%, 10%, 25%) of the observed responses.
The estimates of the mean parameter µ, the over dispersion parameter c and the
zero inflation parameter ω based on the zero-inflated negative binomial model, un-
der different percentages of missingness, and their corresponding standard errors are
given in Table V of Mian and Paul (2016). Note that the estimates of the parame-
ters µ, c and ω and the corresponding standard errors remain stable irrespective of
the amount of missingness. Only for MAR and MNAR and for 25% missing val-
ues, their values are slightly different (slightly larger in case of µ and c). For higher
percentage of missing these properties might deteriorate further.
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 155
For more insight E(Y)d = (1 − ω̂)µ̂ and Var(Y)
[ = (1 − ω̂)µ̂[1 + (ĉ + ω̂)µ̂] were calcu-
lated and are given in Table V of Mian and Paul (2016). These estimates also do not
vary very much irrespective of the amount of missingness, except under MNAR and
25% missing, when Var(Y) [ is slightly higher compared to others.
Mian and Paul (2016) then fit a zero-inflated negative binomial model to the
complete data and data with missing observations and covariates. Response data
with missingness were obtained exactly the same way as in the situation without
covariates. The model fitted was µ = exp{β + βG(M) I(Gender = 1) + βE(D) I(Ethnic =
1) + βE(W) I(Ethnic = 2) + βS (1) I(S chool = 1) + βS (2) I(S chool = 2) + βS (3) I(S chool =
3) + βS (4) I(S chool = 4) + βS (5) I(S chool = 5)}, where β represents the intercept param-
eter and βG represents the regression parameter for gender, βE(1) and βE(2) represent
the regression parameters for the ethnic groups 1 and 2, and βS (1) , βS (2) , βS (3) , βS (4) ,
and βS (5) represent the regression parameters for school 1, school 2, school 3, school
4, and school 5, respectively. Estimates of the parameters can be found in Table VI
of Mian and Paul (2016).
In this case the estimates differed (this is expected as it depends on which ob-
servations have remained in the final data set). In general, the standard errors of the
estimates are larger (in some cases these are much larger, for example, in case of
S E(β̂S (5) )) than those under complete data. For MCAR, MAR, and MNAR, and 25%
missing responses, the standard error is close to twice for missing data in comparison
to those for complete data. The estimates of E(Y) d do not vary much irrespective of the
percentage missing and the missing data mechanism. For complete data and smaller
percentage missing, the property of Var(Y) [ is similar to that of E(Y).d However, for
MAR and 25% missing values Var(Y) [ is much larger (12.415) than in the other cases
(varies between 8.26 and 9.5).
Here we create a subset of the DMFT data analysed by Mian and Paul (2016).
These data are given in the Appendix under title: DMFT data for the analysis in
chapter 8 (Table 8A, Table 8B, Table 8C, Table 8D, Table 8E, Table 8F, Table 8G,
Table 8H,Table 8I, Table 8J, Table 8K). Further, we analyzed the modified DMFT
data and the Results are given in Table 8.8 and Table 8.9. The analyses and conclusion
of the results in Tables 8.8 and 8.9 are very similar to those given in Table V and Table
VI in Mian and Paul (2016).

8.6 Analysis of Longitudinal Data With Missing Values


8.6.1 Normally Distributed Data
The concept of longitudinal data has been introduced earlier in this book. Here we
follow Ibrahim and Molenberghs (2009) who provided an excellent review of missing
data methods in longitudinal studies. We start with the normal random-effects model
(Laird and Ware, 1982) with missing data (MAR or MNAR)

yi = Xi β + Zi bi + ei , i = 1, . . . , N, (8.39)

where yi is ni × 1, Xi is an ni × p known matrix of fixed-effects covariates, β is a p × 1


vector of unknown regression parameters, commonly referred to as fixed effects, Zi
156 MISSING DATA ANALYSIS

Table 8.8 Estimates and standard errors of the parameters for DMFT index data.
Percentage missingness µ̂ S E(µ̂) ĉ S E(ĉ) ω̂ S E(ω̂) E(y) [
d Var(y)

Complete data 0% 4.1444 0.0916 0.0426 0.0195 0.1892 0.0150 3.3603 6.5887
5% 4.1598 0.0942 0.0415 0.0199 0.1953 0.0155 3.3473 6.6445
MCAR 10% 4.1440 0.0976 0.0472 0.0210 0.1922 0.0159 3.3475 6.6685
25% 4.1514 0.1066 0.0465 0.0230 0.1878 0.0173 3.3719 6.6514
5% 4.1739 0.0891 0.0291 0.0182 0.1862 0.0148 3.3967 6.4497
MAR 10% 4.1730 0.0872 0.0224 0.0175 0.1743 0.0144 3.4457 6.2731
25% 4.2395 0.0846 0.0127 0.0160 0.1500 0.0135 3.6035 6.0887
5% 4.1562 0.0939 0.0397 0.0197 0.1945 0.0154 3.3476 6.6071
MNAR 10% 4.1488 0.0976 0.0473 0.0210 0.1934 0.0160 3.3464 6.6878
25% 4.1532 0.1069 0.0479 0.0232 0.1877 0.0173 3.3735 6.6742

Table 8.9 Estimates and standard errors of the parameters for DMFT index data with covari-
ates.
Percentage missingness β̂ S E(β̂) β̂G S E(β̂G ) β̂E(1) S E(β̂E(1) ) β̂E(2) S E(β̂E(2) )

Complete 0% 1.3000 0.0891 0.2551 0.0462 −0.0496 0.0731 0.0598 0.0688


data

5% 1.5621 0.1006 0.0738 0.0495 −0.1391 0.0818 −0.0140 0.0760


MCAR 10% 1.1953 0.1068 0.3302 0.0554 −0.1125 0.0869 0.1454 0.0815
25% 1.1401 0.1324 0.4094 0.0786 −0.2894 0.1106 0.2243 0.1031

5% 1.4724 0.1073 0.1756 0.0534 −0.2948 0.0932 0.0998 0.0813


MAR 10% 1.3841 0.1258 0.2629 0.0656 −0.1999 0.1161 −0.1479 0.1040
25% 1.3072 0.1164 0.5014 0.0730 −0.2073 0.0967 −0.1625 0.0901

5% 1.5621 0.1006 0.0738 0.0495 −0.1391 0.0818 −0.0140 0.0760


MNAR 10% 1.1953 0.1068 0.3302 0.0553 −0.1125 0.0870 0.1454 0.0815
25% 1.1401 0.1324 0.4094 0.0786 −0.2894 0.1106 0.2243 0.1031

Percentage missingness β̂S (1) S E(β̂S (1) ) β̂S (2) S E(β̂S (2) ) β̂S (3) S E(β̂S (3) ) β̂S (4) S E(β̂S (4) )

Complete 0% 0.1252 0.0739 −0.0955 0.0861 −0.1308 0.0788 −0.1093 0.0824


data

5% 0.0372 0.0822 −0.2767 0.0992 −0.3295 0.0932 −0.2309 0.0943


MCAR 10% 0.1276 0.0882 0.0712 0.0975 −0.1676 0.0945 −0.0338 0.0948
25% −0.0873 0.1184 0.0786 0.1244 −0.3042 0.1325 0.1355 0.1212

5% 0.0716 0.0901 −0.1583 0.1045 −0.2423 0.0995 −0.0857 0.0976


MAR 10% −0.1402 0.1150 0.1079 0.1150 −0.3748 0.1401 0.1177 0.1056
25% −0.0585 0.1040 −0.1015 0.1145 −0.3191 0.1152 −0.0719 0.1108

5% 0.0372 0.0822 −0.2767 0.0992 −0.3295 0.0932 −0.2309 0.0943


MNAR 10% 0.1276 0.0882 0.0712 0.0975 −0.1676 0.0945 −0.0338 0.0948
25% −0.0873 0.1184 0.0786 0.1244 −0.3042 0.1325 0.1355 0.1212

Percentage missingness β̂S (5) S E(β̂S (5) ) ĉ S E(ĉ) ω̂ S E(ω̂) E(y)


d [
Var(y)

Complete 0% −0.1425 0.0823 0.0398 0.0203 0.1413 0.0132 3.4502 5.9609


data

5% −0.1504 0.0900 0.0873 0.0300 0.1544 0.0155 3.3980 6.7001


MCAR 10% −0.3075 0.1015 0.0986 0.0322 0.1731 0.0176 3.2278 6.6513
25% −0.0492 0.1193 0.2154 0.0759 0.1538 0.0242 3.2800 7.9741

5% −0.3911 0.1109 0.1594 0.0470 0.1567 0.0183 3.3008 8.0543


MAR 10% 0.0105 0.1090 0.2525 0.1309 0.1598 0.0295 3.2994 8.0491
25% −0.0524 0.1047 0.1501 0.0479 0.2043 0.0219 3.0300 7.1204

5% −0.1504 0.0900 0.0873 0.0300 0.1544 0.0155 3.3980 6.7001


MNAR 10% −0.3075 0.1015 0.0985 0.0322 0.1731 0.0176 3.2278 6.6513
25% −0.0492 0.1193 0.2154 0.0759 0.1538 0.0242 3.2800 7.9741
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 157
is a known ni × q matrix of covariates for the q × 1 vector of random effects bi , and
ei is an ni × 1 vector of errors. The columns of Zi are usually a subset of Xi , allowing
for fixed effects as well as random intercepts and/or slopes. It is typically assumed
that the ei are independent, the bi are i.i.d., the bi are independent of the ei , and

ei ∼ Nni (0, σ2 Ini ), bi ∼ Nq (0, D), (8.40)

where Ini is the ni × ni identity matrix, and Nq (µ, Σ) denotes the q-dimensional multi-
variate normal distribution with mean µ and covariance matrix Σ. The positive defi-
nite matrix D is the covariance matrix of the random effects and is typically assumed
to be unstructured and unknown. However, in practice, if one is convinced that a sim-
pler correlation structure, such as the exchangeable correlation matrix (see Zhang and
Paul, 2013), is sufficient then the estimation procedure might be simpler. Under these
assumptions, the conditional model, where conditioning refers to the random effects,
takes the form
(yi |β, σ2 , bi ) ∼ Nni (Xi β + Zi bi , σ2 Ini ). (8.41)
The model in (8.41) assumes a distinct set of regression coefficients for each indi-
vidual once the random effects are known. Integrating over the random effects, the
marginal distribution of yi is obtained as

(yi |β, σ2 , D) ∼ Nni (Xi β, Zi DZiT + σ2 Ini ). (8.42)


For the MLE of the parameters of the model (8.42) where there is no missing data,
see Laird and Ware (1982) and Jennrich and Schluchter (1986). Here, as in Section
8.5 and Ibrahim and Molenberghs (2009), we first show how to use the EM algorithm
to estimate parameters of this model for complete data. Although the EM algorithm
is not necessary here, it simplifies the estimation procedure.

8.6.2 Complete-data Estimation via the EM


In model (8.39), denote the observed data, the covariance parameters, and the com-
plete data by Y = (y1 , . . . , yN ), θ = (σ2 , D) and V = (y1 , b1 , . . . , yN , bN ), respectively.
Then, the likelihood of β and θ given Y is
N
Y
L(β, θ|Y) = f (yi | β, σ2 , D), (8.43)
i=1

and the likelihood of β and θ given V is


N
Y N
Y
L(β, θ|V) = f (yi , bi | β, σ , D) =
2
f (yi | β, σ2 , bi ) f (bi | D). (8.44)
i=1 i=1

As in Section 8.4, the EM algorithm has the E-step and the M-step. Because we are
dealing with a random effects model, the M-step itself needs two steps. Now, the
158 MISSING DATA ANALYSIS
log-likelihood, apart from a constant, based on the observed data can be expressed as
N N
1X 1X
`(β, σ2 , D) = − log |Σi | − (yi − Xi β)T Σ−1
i (yi − Xi β), (8.45)
2 i=1 2 i=1

where Σi = Zi DZiT + σ2 Ini . Then, given θ = (σ2 , D) and V = (y1 , b1 , . . . , yN , bN ), the


ML estimates of β obtained by equating dβ d`
to zero and solving for β are
 N −1 N
X  X
β̂ =  Xi Σi Xi 
 T −1
XiT Σ−1
i yi . (8.46)
i=1 i=1

This is the M-step for estimating β given θ and V. For the estimation of θ, given β
and V, the complete data log-likelihood is given by
N
X N
 X
`(β, σ2 , D) = log f (yi | β, σ2 , bi ) + log f (bi | D) ,
  
(8.47)
i=1 i=1

which, apart from a constant can be expressed as


N "
ni log σ2
#
X 1
`(β, σ , D) = −
2
+ 2 (yi − Xi β − Zi bi )T (yi − Xi β − Zi bi )
i=1
2 2σ
N " #
X q log(2π) 1 1
− + log |D| + bTi D−1 bi . (8.48)
i=1
2 2 2

(yi − Xi β − Zi bi )T (yi − Xi β − Zi bi ) ≡ i=1


PN PN T
This expression establishes that i=1 ei ei and
i=1 bi bi are the complete data sufficient statistics for σ and D, respectively. Then
PN T 2

it can be seen that


PN T N
ei ei X bTi bi
σ̂ = Pi=1
2
N
, and D̂ = . (8.49)
i=1 ni i=1
N

This completes the M-step. In the E-step we calculate the expected value of the suf-
ficient statistics given the observed data and the current parameter estimates as

E(bi bTi | yi , β̂, σ̂2 , D̂) = E(bi | yi , β̂, σ̂2 , D̂)E(bTi | yi , β̂, σ̂2 , D̂) + Var(bi | yi , β̂, σ̂2 , D̂)
= D̂ZiT Σ̂−1
i (yi − Xi β̂)(yi − Xi β̂) Σ̂i Zi D̂ + D̂ − D̂Zi Σ̂i Zi D̂
T −1 T −1

and

E(ei eTi | yi , β̂, σ̂2 , D̂) = tr(E(ei eTi | yi , β̂, σ̂2 , D̂))
= E(ei | yi , β̂, σ̂2 , D̂)E(eTi | yi , β̂, σ̂2 , D̂) + Var(ei | yi , β̂, σ̂2 , D̂)
= tr(σ̂4 Σ̂−1
i (yi − Xi β̂)(yi − Xi β̂) Σ̂i + σ Ini − σ̂ Σ̂i ),
T −1 2 4 −1
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 159
where Σ̂i = Zi D̂ZiT + σ̂2 Ini and ei = yi − Xi β − Zi bi . Then, the maximum likelihood
estimates of all the parameters are obtained as
Step 1: Given some initial estimates σ̂2 and D̂ of σ2 and D. Obtain Σ̂i = Zi D̂ZiT +
σ̂2 Ini . Use these to estimate β as
 N −1 N
X  X
β̂ =  Xi Σi Xi 
 T −1
XiT Σ−1
i yi . (8.50)
i=1 i=1

Step 2: Use σ̂2 , D̂, Σ̂i , and β̂ from step 1 to obtain


PN T
ei ei
σ̂ = Pi=1
2
N
(8.51)
i=1 ni

and
N
X bT bi
i
D̂ = , (8.52)
i=1
N
where the ith component in the numerator on the right hand side of equations (8.51)
and (8.52) are tr(σ̂4 Σ̂−1 i (yi − Xi β̂)(yi − Xi β̂) Σ̂i + σ Ini − σ̂ Σ̂i ) and D̂Zi Σ̂i (yi −
T −1 2 4 −1 T −1

Xi β̂)(yi − Xi β̂)T Σ̂−1


i Zi D̂ + D̂ − D̂Zi Σ̂i Zi D̂, respectively. Finally, the maximum likeli-
T −1

hood estimates of the parameters β, σ2 and D are obtained by iterating between step
1 and step 2. For a discussion on convergence issues of the EM algorithm here see
Ibrahim and Molenberghs (2009).
Example 4. The six cities study of air pollution and health was a longitudinal
study designed to characterize lung growth as measured by changes in pulmonary
function in children and adolescents, and the factors that influence lung function
growth. A cohort of 13,379 children born on or after 1967 was enrolled in six com-
munities across the U.S.: Watertown (Massachusetts), Kingston and Harriman (Ten-
nessee), a section of St. Louis (Missouri), Steubenville (Ohio), Portage (Wisconsin),
and Topeka (Kansas). Most children were enrolled in the first or second grade (be-
tween the ages of six and seven) and measurements of study participants were ob-
tained annually until graduation from high school or loss to follow-up. At each annual
examination, spirometry, the measurement of pulmonary function, was performed
and a respiratory health questionnaire was completed by a parent or guardian.The ba-
sic maneuver in simple spirometry is maximal inspiration (or breathing in) followed
by forced exhalation as rapidly as possible into a closed container. Many different
measures can be derived from the spirometric curve of volume exhaled versus time.
One widely used measure is the total volume of air exhaled in the first second of the
maneuver (FEV1 ).
Fitzmaurice et al. (2004) present an analysis of a subset of the pulmonary func-
tion data collected in the Six Cities Study. The dataset contains a subset of the pul-
monary function data collected in the Six Cities Study. The data consist of all mea-
surements of FEV1, height and age obtained from a randomly selected subset of the
female participants living in Topeka, Kansas. The random sample consists of 300
girls, with a minimum of one and a maximum of twelve observations over time. Data
160 MISSING DATA ANALYSIS
for four selected girls are presented in Table 8.1 in Fitzmaurice et al. (2004). These
data were then analysed using the following regression model

Yi j = β1 + β2 Agei j + β3 log(heighti j ) + β3 Agei1 + β4 log(heighti1 ) + b1i + b2i Agei j .


(8.53)
The estimates of the regression coefficients (fixed effect) and standard errors for the
log(FEV1 ) are given in their Table 8.2 (page 213).
Exercise 5.
Obtain maximum likelihood estimates of the parameters of the model in page 213 for
the Six Cities Study of Air Pollution and Health data given in page 211 of Fitzmaurice
et al. (2004) by the EM algorithm described above using R.

8.6.3 Estimation with Nonignorable Missing Response Data (MAR and MNAR)
The method of estimation used for complete data will now be extended to missing
data under MAR and MNAR mechanism used in equations (8.1) and (8.1) in Sec-
tion 8.2. In MAR, conditional probability of missingness of the data depends on the
observed data. Parameters of the missingness mechanism are completely separate
and distinct from the parameters of the model (8.39). In likelihood based estimation
considering MAR, missingness mechanism can be ignored from the likelihood and
missing data that are missing at random are often known as ignorable missing or ig-
norable nonresponse, but the subjects having these missing observations cannot be
deleted before the analysis. That is, in MAR the distribution of ri given in equation
(8.1) is not required but in MNAR we need both models given in (8.1) and (8.2). So,
in what follows we deal with the analysis under MNAR mechanism. Because the part
of the likelihood involving ri does not depend on the parameters of interest; there-
fore, the MLE of the parameters of interest is the same without the part involving ri s.
So for MAR the distribution of ri is deleted.
To put things in perspective, define a ni × 1 random vector Ri , whose jth compo-
nent has the binary distribution
(
0 if yi j is observed,
ri j = (8.54)
1 if yi j is missing.

The distribution of Ri follows a multinomial distribution with 2ni cell probabilities


indexed by the parameter vector φ. We assume that the covariates are fully observed.
Under the normal mixed model, the complete data density of (yi , bi , ri ) for subject i
is then given by f (yi , bi , ri | β, σ2 , D, φ). Little (1993, 1995) suggested that this can be
factored as

f (yi , bi , ri | β, σ2 , D, φ) = f (yi | β, σ2 , bi ) f (bi | D) f (ri | φ, yi ) (8.55)

(the selection model) or

f (yi , bi , ri | β, σ2 , D, φ) = f (yi | β, σ2 , bi , ri ) f (bi | D) f (ri | φ) (8.56)


ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 161
(the pattern mixture model). Here we consider only the selection model. Now, let
γ = (β, σ2 , D, φ). Then, the complete data log-likelihood is
N
hY i
`(γ) = log f (yi , bi , ri | β, σ2 , D, φ)
i=1
N
X N
X N
X
= log f (yi | β, σ2 , bi ) + log f (bi | D) + log f (ri | φ, yi ). (8.57)
i=1 i=1 i=1

The parameters β, σ2 , and D are usually of interest, while considering bi and φ as


nuisance parameters. Diggle and Kenward (1994) discussed estimation methods as-
suming monotone missing data. However, as Ibrahim and Molenberghs (2009) com-
ment, these methods are not easily extended to the analysis of nonmonotone missing
data, where a subject may be observed after a missing value occurs. A weighted EM
method, proposed by Ibrahim (1990), which applies to both monotone and nonmono-
tone missing data, is given below.
Writing yi = (ymis,i , yobs,i ), where ymis,i is the si × 1 vector of missing components
of yi , the E-step calculates the expected value of the complete data log-likelihood
given the observed data and current parameter estimates as

Qi (γ | γ(t) ) = E(l(γ; yi , bi , ri ) | yobs,i , ri , γ(t) ), (8.58)

where γ(t) = (β(t) , σ2(t) , D(t) , φ(t) ). Since both bi and ymis,i are unobserved, they must
be integrated out. Thus, the E-step for the ith observation at the (t + 1)th iteration is
Z Z
Qi (γ | γ(t) ) = log[ f (yi | β, σ2 , bi )] f (ymis,i , bi | yobs,i , ri , γt )dbi dymis,i
Z Z
+ log[ f (bi | D)] f (ymis,i , bi | yobs,i , ri , γt )dbi dymis,i
Z Z
+ log[ f (ri | φ, yi )] f (ymis,i , bi | yobs,i , ri , γt )dbi dymis,i

≡ I1 + I2 + I3 , (8.59)

where f (ymis,i , bi | yobs,i , ri , γt ) represents the conditional distribution of the data con-
sidered (missing), given the observed data.
Detailed calculations, which are omitted here, given in Ibrahim and Molenberghs
(2009), lead to the evaluation of Qi (γ | γ(t) ) as given below. Let ui1 , . . . , uimi be a
sample of size mi from
n 1 o
f (ymis,i | yobs,i , ri , γ(t) ) ∝ exp − (yi − Xi β(t) )T (Zi D(t) ZiT + σ2(t) Ini )−1 (yi − Xi β(t) )
2
× f (ri | ymis,i , yobs,i , γ(t) ), (8.60)

obtained via the Gibbs sampler along with the adaptive rejection algorithm of Gilks
and Wild (1992), where the specific model for f (ri | ymis,i , yobs,i , γ(t) ) is given in their
equation (5.43). Also see Ibrahim and Molenberghs (2009) for further discussion.
162 MISSING DATA ANALYSIS
Now let y(k)
i= (uTik , yTobs,i )T and = b(tk)
i Σ(t) T (k)
i Zi (yi − Xi β )/σ
(t) 2(t) .
Then, the E-step
for the ith observation at the (t + 1)th iteration can be simplified as

ni ni log(σ2 ) 1
Qi (γ|γ(t) ) = − log(2π) − − 2 tr(ZiT Zi Σ(t)
i )
2 2 2σ
mi
1 X
+ (y(k) − Xi β − Zi b(tk) T (k) (tk)
i ) (yi − Xi β − Zi bi )
mi k=1 i
q log(2π) log(|D|) 1
− − − tr(D−1 Σ(t) i )
2 2 2
mi mi
1 X 1 X
− [(bi(tk) )T D−1 b(tk)
i ] + log[ f (ri |φ, y(k)
i )] (8.61)
2mi k=1 mi k=1

and hence for all N observations it is


N
X
Q(γ|γ(t) ) = Qi (γ|γ(t) ). (8.62)
i=1

In the M-step Q(γ|γ(t) ) is to be maximized to obtain estimates of β, σ2 , and D. It can


be shown that φ(t+1) is obtained by maximizing
N mi
X 1 X
Qφ = log[ f (ri |φ, yki )] (8.63)
i=1
m i k=1

and that closed form estimates for β, σ2 , and D are


N N  mi
X T X 1 X
XiT (y(k) − Zi b(tk)

β(t+1) = XiT Xi i i ), (8.64)
i=1 i=1
mi k=1

σ2(t+1)
mi
N
 
1 X  1 X (k) (tk) T (k) (tk)

(t) 
=  (yi − Xi β(t+1)
− Zi bi ) (yi − Xi β(t+1)
− Zi bi ) + tr(Zi Zi Σi )
T
M i=1 mi k=1
(8.65)

and
N i m
1 Xh 1 X
b(tk) (b(tk) )T + Σ(t)
i
D (t+1)
= i i i , (8.66)
N i=1 mi k=1

where M = i=1 ni . MLEs of β, σ2 , and D are obtained by iterating between the


PN
E-step and the M-step.
Exercise 6. Delete at random 10% of the response data from the data given in Ex-
ample 4. For these new data, obtain maximum likelihood estimates of the parameters
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 163
of the model in page 213 for the data given in page 211 of Fitzmaurice et al. (2004)
using R.
In biostatistics or similar fields many longitudinal studies involve nonignorable
missing data. Here we present two data sets.
Example 5. (IBCSG data)
Consider a data set concerning the quality of life among breast cancer patients in
a clinical trial comparing four different chemotherapy regimens conducted by the
International Breast Cancer Study Group (IBCSG Trial VI; Ibrahim et al. 2001).
The main outcomes of the trial were time until relapse and death, but patients were
also asked to complete quality of life questionnaires at baseline and at three-month
intervals. Some patients did refuse, on occasion, to complete the questionnaire. How-
ever, even when they refused, the patients were asked to complete an assessment at
their next follow-up visit. Thus, the structure of this trial resulted in nonmonotone
patterns of missing data. One longitudinal quality of life outcome was the patient’s
self-assessment of her mood, measured on a continuous scale from 0 (best) to 100
(worst). The three covariates of interest included a dichotomous covariate for lan-
guage (Italian or Swedish), a continuous covariate for age, and three dichotomous
covariates for the treatment regimen (4 regimens). Data from the first 18 months of
the study were used, implying that each questionnaire was filled out at most seven
times, i.e., at baseline plus at six follow-up visits.
There are 397 observations in the data set, and mood is missing at least one time
for 71% of the cases, resulting in 116 (29%) complete cases. The amount of missing
data is minimal at baseline (2%) and ranges between 24% and 31% at the other six
times: 26.2% at the second, 24.2% at the third, 29% at the fourth, 24.9% at the fifth,
28.2% at the sixth, and 30.5% at the seventh occasion. For details of the data and
analysis, see Ibrahim et al. (2001). All patients were alive at the end of 18 months,
so no missingness is due to death. However, it is reasonable to conjecture that the
mood of the patient affected their decision to fill out the questionnaire. In this case,
the missingness would be MNAR, and an analysis that does not include the missing
data mechanism would be biased.
Example 6. (Muscatine children’s obesity data)
The Muscatine Coronary Risk Factor Study (MCRFS) was a longitudinal study
of coronary risk factors in school children (Woolson and Clarke 1984; Ekholm and
Skinner 1998). Five cohorts of children were measured for height and weight in 1977,
1979, and 1981. Relative weight was calculated as the ratio of a child’s observed
weight to the median weight for their age-sex-height group. Children with a relative
weight greater than 110% of the median weight for their respective stratum were
classified as obese. The analysis of this study involves binary data (1 = obese, 0 =
not obese) collected at successive time points. For every cohort, each of the following
seven response patterns occurs: (p,p,p), (p,p,m), (p,m,p), (m,p,p), (p,m,m), (m,p,m),
and (m,m,p), where a p (m) denotes that the child was present (missing) for the
corresponding measurement. The distribution over the patterns is shown in Table 1
of Ekholm and Skinner (1998).
The statistical problem is to estimate the obesity rate as a function of age and sex.
However, as can be seen in Table 1 of Ekholm and Skinner 1998) , many data records
164 MISSING DATA ANALYSIS
are incomplete since many children participated in only one or two occasions of the
survey. Ekholm and Skinner (1998) report that the two main reasons for nonresponse
were: (i) no consent form signed by the parents was received, and (ii) the child was
not in school on the day of the examination. If the parent did not sign the consent
form because they did not want their child to be labeled as obese, or if the child did
not attend school the day of the survey because of their weight, then the missingness
would at least be MAR, and likely even MNAR. In the latter case, an analysis that
ignores the missing data mechanism would be biased. However, since the outcome is
binary, these data cannot be modeled using the normal random effects model. Instead,
a general method based on the generalized estimating equations would be useful (see
Section 8.6.4).

8.6.4 Generalized Estimating Equations


8.6.4.1 Introduction
In Sections 8.6.2 and 8.6.3, we show analysis of missing data when the data in ques-
tion follow a particular model. However, in practice model departure is often a chal-
lenging issue. For example, when there is evidence of over-dispersion in count data,
a negative binomial distribution is used. Similarly, when data in the form of propor-
tions arise, a binomial model may not be the correct model as over-dispersion may be
present in the data. In this case, often a Beta-binomial distribution is used. See, for ex-
ample, Paul and Plackett (1978) and Paul and Islam (1995). In all these data analysis
situations, a particular model for the data is assumed. But, in practice, data may fol-
low a different distribution than what is assumed. For example, over-dispersed count
data may have come from a different over-dispersed count data model than a negative
binomial model, such as a log-normal mixture of the Poisson distribution. Theoreti-
cal analysis with a log-normal mixture of the Poisson distribution is complex at best.
Yet this assumption may not be valid. So, robust data analytic methods, such as,
the quasi-likelihood (Wedderburn, 1974) and the extended quasi-likelihood (Nelder
and Pregibon, 1987; Goodambe and Thompson, 1989), the double extended quasi-
likelihood (Lee and Nelder, 2001) have been developed. However, these methods
are for data in which distribution of observations can be assumed to be independent.
For the analysis of longitudinal data, a robust procedure using generalized estimat-
ing equations (GEE) has been developed by Liang and Zeger (1986) and Zeger and
Liang (1986).
The GEE approach of obtaining estimates of the regression parameters requires
that there is no missing data, that is when the missingness probability does not depend
on the responses. No general theory is available for GEE estimates of the regression
parameters in presence of missing data. However, there are a few very useful special
cases which are given below.

8.6.4.2 Weighted GEE for MAR Data


Robins et al. (1995) proposed a weighted GEE method for analysis for data that
are MAR. Let yit and xit be the response and a vector of independent variables for
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 165
individual i at time t, t = 1, . . . , ni , where ni denotes the number of repeated obser-
vations for the ith subject. Denote the complete longitudinal outcomes and indepen-
dent variables by Yi = (yi1 , . . . , yini )T and Xi = (xi1 , . . . , xini )T and let E(Yi |xi ) = µi =
(µi1 , µi2 , . . . , µini )T . For completely observed longitudinal data, the GEE approach of
Liang and Zeger (1986) allowed regression modeling of the data specifying only
the mean and variance of the outcome variables. The mean for the complete data
is assumed to satisfy µit = g−1 (ηit ) = g−1 (xit β), where g(·) is a link function and the
estimating equation for β is
N !T
X ∂µi
U1 (β, α) = Vi−1 (β, α)(Yi − µi ) = 0, (8.67)
i=1
∂β

where Vi (β, α) = φAi (µi )1/2 Ri (ρ)Ai (µi )1/2 , Ai = diag{var(yi1 ), . . . , var(yini )}, α = (φ, ρ),
and Ri (ρ) is a working correlation matrix of Yi . For the choice of Ri (ρ), see Chapter
4. Denote√ the GEE estimate of β obtained by solving U1 (β, α̂) = 0 as β̂1 , where α̂
is any N-consistent estimator. Usually, α̂ is estimated by the method of moments
Zhang and Paul (2013). Liang and Zeger (1986) showed that β̂1 is consistent and
asymptotically normal and its variance can be consistently estimated by a sandwich-
type estimator (see Chapter 4).
As earlier, let rit be the missing data indicator variable for the tth observation of
the ith individual and assume monotone missing with ri1 ≥ ri2 ≥ . . . riN−1 ≥ riN (that
is, an individual who leaves the study never comes back). Also, assume, ri1 = 1 for
all i; that is, all responses are completely observed at baseline.
Then, the weighted estimating equation (Robins and Rotnitzky 1995, Preisser,
Lohman, and Rathouz 2002) for obtaining unbiased GEE estimates of the regression
parameters under MAR is
N !T
X ∂µi
U2 (β, α, γ) = Vi−1 (β, α)∆i (Yi − µi ) = 0, (8.68)
i=1
∂β

where the jth diagonal element of ∆i is ri j wi j , wi j = 1/P(ri j = 1|Xi , Yi ), and γ is a


parameter from the distribution of ri j .
The weights wi j are generally estimated using a logistic regression model under
MAR assumption. See Lin and Rodrigues (2015) for further details.
Robins et al. (1995) showed that if ∆i is estimated consistently, then β̂2 is consis-
tent and asymptotically normal under MAR and monotone missing patterns.

8.6.5 Some Applications of the Weighted GEE


8.6.5.1 Weighted GEE for Binary Data
Following Robins et al. (1995), Troxel et al. (1997) developed a weighted GEE
method for nonignorably missing binary response survey data. Let Y = (Y1 , . . . , YN )
represent the underlying response (zero and one) vector for the N independent sub-
jects and let Xi be the vector of p covariates of the ith subject. Suppose that the
initial nonresponders are followed up and given a second opportunity to respond to
166 MISSING DATA ANALYSIS
the survey. The indicator variable r1 = (ri1 , . . . , rn1 ) indicates if (Yi , Xi ) is observed
or not at the first time, and similarly, r2 is the indicator variable for the second
time for those who did not respond at the first time. The missingness probabilities
are assumed to depend on the unobserved response variable Yi and a set of the co-
variates Xi∗ . Let πi1 = P(ri1 = 1|Yi , Xi∗ ) and πi2 = P(ri2 = 1|Yi , Xi∗ , ri1 = 0). Suppose
µi = E(Yi |Xi ) = g(Xi β) and that β are parameters of interest. Based on the available
data from subjects who responded either at the first time or at the second time, the
parameter β can be estimated by solving the estimating equations
N
X
ri EiT Wi−1 (Yi − µi ) = 0, (8.69)
i=1

where Ei = ∂µi /∂β, Wi is the variance of Yi , and ri = ri1 + ri2 .


As pointed out by Troxel et al. (1997), if ri1 or ri2 depends on Yi , the above
estimating equations are biased from zero and hence result in biased estimates. To
this end, Troxel et al. (1997) proposed to use the weighted estimating functions
N
X ri
U T LB = EiT Wi−1 (Yi − µi ), (8.70)
i=1
θi

where Ei = ∂µi /∂β, Wi is the variance of Yi , ri = ri1 + ri2 , and θi = πi1 + (1 − πi1 )πi2
is the probability of being observed. It can be shown that E(U T LB ) = 0, and hence
U T LB = 0 will produce consistent estimates of β under very general conditions.

8.6.5.2 Two Modifications


Wang (1999b) modified the estimating functions suggested by Robins et al. (1995)
and suggested two alternative modifications for unbiased estimation of regression
parameters when a binary outcome is potentially observed at successive time points.
In particular, he proposed two new ways of constructing estimating functions, one
of which is based on direct bias correction and the other is motivated by condition-
ing. Unlike the weighting methods, the two estimating functions proposed by Wang
(1999b) are score-like functions and are, therefore, the most efficient. Also, they are
unbiased even for nonignorably missing data.
Modified Approach of Robins et al. (1995)
A simple modification of the approach by Robins et al. (1995) leads to the fol-
lowing unbiased estimating functions:
N
X ri1
U RRZ = EiT Wi−1 (Yi − µi )
i=1
πi1
N
X ri2
+ E T W −1 (Yi − µi ). (8.71)
i=1
(1 − πi1 )πi2 i i

The weight here depends on when the response is observed (at time 1 or time 2),
while U T LB uses the overall probability of being observed either at time 1 or time 2.
ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 167
Bias-Corrected Estimating Functions
The weighting GEE approach of Troxel et al. (1997) requires that (a) an in-
dividual will be further observed even if he or she has been observed earlier and
(b) return is not possible once a subject leaves the study. Both of these conditions
can be avoided if we use U CC = E(U CC ) rather than U CC = 0 for parameter esti-
mation. Wang (1999b) showed that we can write the unbiased estimating equations
U (1) = U CC − E(U CC ) as
N
X
U (1) = ri EiT Wi−1 (Yi − ηi ), (8.72)
i=1

where ηi = P(Yi |Xi , Ri = 1). This can be also be expressed as a function of µi and
ηi = θi1 µi /θi1 µi + (1 − µi )θi0 as
N !T
X ∂ηi Yi − ηi
U (1) = ri .
i=1
∂β Var(Yi |Ri = 1)

This implies that U (1) may be derived by maximizing Ri =1 P(Yi |ri = 1). The
Q
form of U (1) also suggests that they are the most efficient estimating functions (con-
ditional on ri = 1) in terms of asymptotic variance of the estimates (Godambe and
Heyde, 1987).
Conditional Estimating Functions
The complete-case model produces biased estimates because the expectations
of Yi for those who responded differ from µi . The estimating functions given by
(8.69) and (8.70) are correct for the biases due to the response-dependent selection.
However, responses obtained at time l are not distinguished from those obtained at
time 2 in U T LB .
Let µi1 = E(Yi |Xi , ri1 = 1) and µi2 = E(Yi |Xi , ri1 = 0, ri2 = 1). Parameter estimation
can rely on the following unbiased estimating functions:
N
X N
X
U (2) = −1
ri1 DTi1 Wi1 (Yi − µi1 ) + −1
ri2 DTi2 Wi2 (Yi − µi2 ), (8.73)
i=1 i=1

in which Dik = ∂µik /∂β and Wik = µik (1 − µik ) for k = 1 and 2.
We will write the missingness probabilities as πi1 (Yi ) and πi2 (Yi ) to express the
dependence on the response variable Yi explicitly. The relationships between µi1 , µi2
and µ1 are given by
πi1 (1)µi
µi1 = (8.74)
πi1 (1)µi + πi1 (0)(1 − µi )
and
πi2 (1){1 − πi1 (1)}µi
µi2 = . (8.75)
πi2 (1){1 − πi1 (1)}µi + πi2 (0){1 − πi1 (0)}(1 − µi )
Because DTik Wik−1 = EiT Wi−1 , (5.67) can be re written as
N h
X i
U (2) = EiT Wi−1 {ri1 (Yi − µi1 ) + ri2 (Yi − µi2 )} . (8.76)
i=1
168 MISSING DATA ANALYSIS
Asymptotic Covariance
Wang (1999b) discussed the estimation of the asymptotic covariance of the GEE
estimates of the regression parameters β. Omitting details, the asymptotic covari-
ance of the estimates, β̂, obtained from the unbiased estimating equations U = 0
discussed above, can be approximated by V = A−1 B(AT )−1 , where A = E(∂U/∂β)
and B = E(UU T ), the covariance matrix of the estimating functions. Table 1 of Wang
(1999b), gives asymptotic covariance of the estimates.
Example 7.
Wang (1999b) analyzed a set of data given in Troxel et al. (1997) from a survey
study concerning medical practice guidelines. The purpose of the survey was to as-
sess the lawyers’ knowledge of medical practice guidelines and the effect of those
guidelines on their legal work. Overall, 578 responses were received (among the 960
recipients) from the two mailing waves. Because of missing values in the received
responses, there are only 524 subjects with complete information on four binary vari-
ables. The four binary variables are Years (= 1 if > 10 years of law practice), Firm
P (= 1 if > 20% of the firm’s work has involved malpractice), Number (= 1 if > 5
new cases a year), and the response variable, Aware (awareness of medical practice
guidelines). Wang (1999) used the same model as in Troxel et al. (1997). To eliminate
the complications arising from the missing values in the responses received, he only
used observations from the 524 subjects and adjusted the total number of recipients
to 960(524)/578.
The variables used in the missingness model are Years, Firm P, Number, Aware,
Years × Aware, and Firm P × Number. For these variables, a logit regression model
was used which also included an intercept parameter and a parameter τ represent-
ing amounts of missingness. The parameters in the missingness model were esti-
mated as α = (−1.75, 0.72, 1.11, 1.65, 1.12, −1.14, −1.85, 0.15) with standard errors
(0.35,0.65,0.67,0.50, 0.60,0.79,0.78,0.15). The parameter estimates of the four esti-
mating function methods are quite different from those of the CC analysis. For the in-
tercept, Years, Firm p, and Number, the coefficient estimates from the CC analysis are
(-0.19,0.01,0.25, 0.62). The corresponding estimates from the weighting methods of
TLB and RRZ are (-0.75, 0.49, 0.30, 0.71) with standard errors (0.35,0.42,0.20,0.22).
Estimates from the new methods U (1) and U (2) are similar to those from the weight-
ing methods except for one variable (Years). The effects of Years estimated by U (1)
and U (2) are close to zero, quite different from the TLB and RRZ estimates. Con-
sidering the large standard errors for the Year effect, we should not be surprised by
the difference. Also notice that the missingness model plays an important role in es-
timating β, and the effect of Year in the missingness model is also very uncertain,
which may have a serious impact on the estimate of the Year effect.
Estimation with ignorable missing response data
For estimation with ignorable missing response data the missingness mecha-
nism can be ignored. So the joint distribution would be f (yi , bi | β, σ2 , D) instead of
f (yi , bi , ri | β, σ2 , D, φ) in case of nonignorable missing response data. For ignorable
(MAR) missing response data, selection models decompose the joint distribution as

f (yi , bi | β, σ2 , D) = f (yi | β, σ2 , bi ) f (bi | D). (8.77)


ANALYSIS OF LONGITUDINAL DATA WITH MISSING VALUES 169
The rest of the estimation procedure would remain similar to nonignorable missing
response data.
Using the same framework, methods for analysis of longitudinal (or otherwise)
data with missing values can be developed when data follow other continuous or dis-
crete models. For example, Ibrahim and Lipsitz (1996) developed estimation proce-
dures for parameters of a Binomial Regression in presence of nonignorable missing
responses.
Chapter 9

Random Effects and Transitional


Models

9.1 A General Discussion


Traditional designed experiments involve data that need to be analyzed using a fixed
effects model or a random effects model. Central to the idea of variance components
models is the idea of fixed and random effects. Each effect in a variance components
model must be classified as either a fixed or a random effect. Fixed effects arise when
the levels of an effect constitute the entire population about which one is interested.
For example, if a plant scientist is comparing the yields of three varieties of soybeans,
then Variety would be a fixed effect, provided that the scientist is concerned about
making inferences on only these three varieties of soybeans. Similarly, if an indus-
trial experiment focused on the effectiveness of two brands of a machine, machine
would be a fixed effect only if the experimenter’s interest does not go beyond the two
machine brands.
On the other hand, an effect is classified as a random effect when one wants to
make inferences on an entire population, and the levels in your experiment represent
only a sample from that population. Psychologists comparing test results between
different groups of subjects would consider Subject as a random effect. Depending
on the psychologists’ particular interest, the group effect might be either fixed or
random. For example, if the groups are based on the sex of the subject, then sex
would be a fixed effect. But if the psychologists are interested in the variability in test
scores due to different teachers, then they might choose a random sample of teachers
as being representative of the total population of teachers, and teacher would be a
random effect. Note that, in the soybean example presented earlier, if the scientists
are interested in making inferences on the entire population of soybean varieties and
randomly choose three varieties for testing, then variety would be a random effect.
If all the effects in a model (except for the intercept) are considered random
effects, then the model is called a random effects model; likewise, a model with only
fixed effects is called a fixed-effects model. The more common case, where some
factors are fixed and others are random, is called a mixed model.
In the linear model, each level of a fixed effect contributes a fixed amount to the
expected value of the dependent variable. What makes a random effect different is
that each level of a random effect contributes an amount that is viewed as a sample

DOI: 10.1201/9781315153636-9 171


172 RANDOM EFFECTS AND TRANSITIONAL MODELS
from a population of normally distributed variables, each with mean 0, and an un-
known variance, much like the usual random error term that is a part of all linear
models. The variance associated with the random effect is known as the variance
component because it is measuring the part of the overall variance contributed by
that effect.

9.2 Random Intercept Models


Suppose m large elementary schools are chosen randomly from among thousands in
a large country. Suppose also that n pupils of the same age are chosen randomly at
each selected school. Their scores on a standard aptitude test are ascertained. Let Yi j
be the score of the jth pupil at the ith school. A simple way to model the relationships
of these quantities is
Yi j = µ + αi + ηi j ,
where µ is the average test score for the entire population. In this model αi is the
school-specific random effect: it measures the difference between the average score
at school i and the average score in the entire country. The term ηi j is the individual-
specific effect, i.e., it’s the deviation of the jth pupil score from the average for the
ith school.
The model can be augmented by including additional explanatory variables,
which would capture differences in scores among different groups. For example:
Yi j = µ + β1 Sexi j + β2 Racei j + β3 Parentedui j + αi + ηi j ,
where Sexi j is the dummy variable for boys/girls, Racei j is the dummy variable for
white/black pupils, and Parenteduci j records the average education level of child’s
parents. This is a mixed model, not a purely random effects model, as it introduces
fixed-effects terms for Sex, Race, and Parents’ Education. A simple one-way random
effects model was given in Section 7.2.1, which will be briefly discussed here.
Suppose that we have data on L groups or families, the ith family having ni ob-
servations. Let yi j be the observation on the jth member of the ith family. Then, yi j
can be represented by a random effects model
yi j = µ + ai + ei j , (9.1)
where µ is the grand mean of all the observations in the population, ai is the ran-
dom effect of the ith family, ei j is the error in observing yi j . The random effects ai
are identically distributed with mean 0 and variance σ2a , the random errors ei j are
identically distributed with mean 0 and variance σ2e , and the ai , ei j are completely
independent. The variance of yi j is then given by σ2 = σ2a + σ2e , and the covariance
between yi j and yil is

Cov{(µ + ai + ei j )(µ + ai + eil )} = Var(ai ) = σ2a .


The quantities σ2a and σ2e are called the variance components. These variance com-
ponents can be estimated from the analysis of variance table (see Section 7.2.1) as
follows.
LINEAR MIXED EFFECTS MODELS 173
Pp 2 Pp
Let n0 = (M − i=1 ni /M)/(p − 1), where M = i=1 ni and p is the number of fam-
ilies studied. Then, the following table shows the analysis of variance corresponding
to this model.

Source Degrees of Sum of Mean E(MS)


freedom squares square
Between families p−1 SSB MSB n0 σ2a + σ2e
Within families M−p SSW MSW σ2e

An unbiased estimate of σ2e is σ̂e 2 = MSW and that of σ2a is σˆa 2 = (MSB −
MSW)/n0 .
The random effects models or the more general mixed effects models are used
in longitudinal data analysis. Longitudinal data arise as continuous responses or dis-
crete responses. In what follows we first show methodologies for the analysis of
continuous data.

9.3 Linear Mixed Effects models


Longitudinal continuous response data with covariates are usually analyzed on the
assumption that the error terms in the regression model are normally distributed.
Consider a longitudinal study involving N subjects, each subject producing a
varying number of observations at different time points. Let Yi = (Yi1 , . . . , Yini )> be the
ni -dimensional vector of observations for the ith subject, i = 1, . . . , N. Assuming an
average linear trend for Y as a function of time, a multivariate regression model can
be obtained by assuming that the elements Yi j in Yi satisfy Yi j = β0 + β1 ti j + εi j , with
the assumption that the error components εi j , j = 1, . . . , ni , are normally distributed
with mean zero. In vector notation, we can write the regression model as Yi = Xi β + εi
for an appropriate design matrix Xi , with β> = (β0 , β1 ) and ε> i = (εi1 , εi2 , . . . , εini ).
Using an appropriate variance-covariance matrix Vi for εi , we obtain

Yi ∼ N(Xi β, Vi ). (9.2)

Note that if the repeated measurements Yi j are assumed to be independent, then


Vi = σ2 Ini , where Ini is the identity matrix of dimension ni . The above model is
based on the assumption that the N subjects are chosen in advance and our purpose
is estimate the regression coefficients β0 and β1 . This is a fixed effects model. The
parameters β0 and β1 are called population averaged effects.
However, if the N subjects are chosen at random from a population of subjects
the fixed effects model fails to account for the differences between the subjects. So,
we introduce subject-specific effects, say bi for the ith subject, into the model. Since
the subjects are randomly chosen, bi is called a random effect having a probability
distribution which is typically taken as N(0, σ2b ) (see Section 9.2). The model for Yi j
then becomes Yi j = β0 + β1 ti j + bi + εi j and the model for the vector of observations
become Yi = Xi β + bi + εi .
174 RANDOM EFFECTS AND TRANSITIONAL MODELS
For fixed bi , the distribution of Yi is

Yi |bi ∼ N(Xi β + bi , Vi ). (9.3)

If bi is random and is distributed as N(0, σ2b ), then Vi becomes a matrix with σ2 + σ2b
as the diagonal elements and σ2 as the off diagonal elements, and the distribution of
Yi comes

Yi ∼ N(Xi β, Vi ). (9.4)

Model (9.4) is a mixed effects model. If there is no covariate, this model is iden-
tical to the random effects model in Section 9.1.2. However, because of the longi-
tudinal nature of the model, bi could have covariates which themselves could be
correlated. This results in model (4.3) and (4.4) in Molenberghs and Verbeke (2005)
and are

Yi |bi ∼ N(Xi β + Zi bi , Σi ) (9.5)

and

bi ∼ N(0, D), (9.6)

where Xi and Zi are ni × p and ni × q known design matrices, respectively, for a p-


dimensional vector β of unknown regression coefficients and a q-dimensional vector
bi of subject-specific regression coefficients assumed to be random samples from the
q-dimensional normal distribution with mean zero and covariance D, and with Σi
covariance matrix parameterized through a set of unknown parameters. The compo-
nents in β are called fixed effects, the components in bi are called random effects.
The fact that the model contains fixed as well as random effects motivates the term
mixed effects model (Molenberghs and Verbeke, 2005). Unless the model is fitted
in a Bayesian framework (Gelman et al 1995), estimation and inference are based
on the marginal distribution for the response vector Yi . Let fi (yi |bi ) and f (bi ) be the
density functions corresponding to (9.5) and (9.6), respectively. The marginal density
function of Yi is
Z
fi (yi ) = fi (yi |bi ) f (bi )dbi ,

which can easily be shown to be the density function of an ni -dimensional normal


distribution with mean vector Xi β and with covariance matrix Vi = Zi DZi> + Σi .
Estimation of the parameters of model (9.4) and (9.5) is based on the marginal
model that, for subject i, Yi is multivariate normal with mean Xi β and covariance
Vi (α) = Zi DZi> + Σi , where α represents the parameters in the covariance matrices D
and Σi . The classical approach to estimation is based on the method of maximum
likelihood (ML). Assuming independence across subjects, the likelihood takes the
form
N ( " #)
1
(2π)−ni /2 |Vi (α)|− 2 × exp − (Yi − Xi β)> Vi−1 (α)(Yi − Xi β) .
1
Y
L(θ) = (9.7)
i=1
2
LINEAR MIXED EFFECTS MODELS 175
Estimation of θ> = (β> , α> ) requires joint maximization of (9.7) with respect to all
elements in θ. In general, no analytic solutions are available. So, the only direct
method of finding the maximum likelihood estimates of the parameters is by using
some numerical optimization routines.
Conditionally on α, the maximum likelihood estimator (MLE) of β is given by
(Laird and Ware 1982):
 N −1 N
X  X
β(α) =  Xi Wi Xi 
b  >
Xi> Wi Yi , (9.8)
i=1 i=1

where Wi equals Vi−1 . In practice, α is not known and can be replaced by its MLE b α.
However, one often also uses the so-called restricted maximum likelihood (REML)
estimator for α (Thompson, 1962), which allows one to estimate α without having to
estimate the fixed effects in β first. It is known from simpler models, such as linear
regression models, that, although, the classical ML estimators are biased, the REML
estimators avoid the bias (Verbeke and Molenberghs 2000, Section 5.3).
In practice, the fixed effects in β are often of primary interest, as they describe
the average evolution in the population. Conditionally on α, the maximum likelihood
(ML) estimate for β is given by (9.8), which is normally distributed with mean
 N −1 N
h i X  X
E bβ(α) =  Xi Wi Xi 
>
Xi> Wi E [Yi ] = β, (9.9)
i=1 i=1

and covariance
 N −1  N   N −1
h i X  X  X 
β(α) =  Xi> Wi Xi  ×  Xi> Wi Var(Yi )Wi Xi  ×  Xi> Wi Xi 
Var b
i=1 i=1 i=1
 N −1
X 
=  Xi Wi Xi 
 >
(9.10)
i=1

provided that the mean and covariance were correctly specified in our model, i.e.,
provided that E(Yi ) = Xi β and Var(Yi ) = Vi = Zi DZi> .
The random effects model established by Laird and Ware (1982) explicitly model
the individual subject effect. Once the distributions of the random effects are speci-
fied, inference can be based on maximum likelihood methods. Lindstrom and Bates
(1988) also provided details on computational methods for Newton-Raphson and EM
algorithms. For the linear random effects model, the generalized least squares based
on the marginal distributions of Yi (after integrating out bi ) is,
 N −1 N
X  X
β̂ =  Xi Σi Xi 
T −1
XiT Σ−1
i Yi . (9.11)
i=1 i=1

As for bi , the Best Linear Unbiased Predictor (BLUP) for bi is


b̂i = DZiT Σ−1 (Yi − Xi β̂). (9.12)
176 RANDOM EFFECTS AND TRANSITIONAL MODELS
When the unknown covariance parameters are replaced by their ML or REML
estimates, the resulting predictor,
−1
b̂i = D̂ZiT Σ̂i (Yi − Xi β̂),

is often referred to as the “Empirical BLUP” or the “Empirical Bayes” (EB) estimator
(see Harville and Jeske, 1992).
Furthermore, it can be shown that,

XN
var(b̂i ) = DZiT Σ−1
i Zi D − DZi Σi Xi (
T T −1
XiT Σ−1
i Xi ) Xi Σi Zi D .
−1 T −1 T

i=1

Let Ĥi = Σ̂i − Zi D̂Zi> . The ith subject’s predicted response profile is,

Ŷi = Xi β̂ + Zi b̂i
= Xi β̂ + Zi D̂ZiT Σ̂−1
i (Yi − Xi β̂)
= (Ĥi Σ̂−1
i )Xi β̂ + (I − Ĥi Σ̂i )Yi .
−1

That is, the ith subject’s predicted response profile is a weighted combination of
the population-averaged mean response profile Xi β̂, and the ith subject’s observed
response profile Yi .
Note that Hi Σ−1 i measures the relative between-subject variability, while Σi is the
overall with-subject and between-subject sources of variability. As a result, Ŷi assigns
additional weight to Yi (the observation from i itself) due to presence of random
effects. Note that Xi β̂ is the estimate at the population level. If Ĥi is large, and the
within-subject variability is greater than the between-subject variability, more weight
is given to Xi β̂, the population-averaged mean response profile. When the between-
subject variability is greater than the within-subject variability, more weight is given
to the ith subject’s observed data Yi . The lme function in R package is widely used
and numerical solutions can be easily obtained.

9.4 Generalized Linear Mixed Effects Models


We will first focus on random effects model for discrete response data with covariates
usually arise for binary response data, data in the form of proportions or count data.

9.4.1 The Logistic Random Effects Models


First we discuss random effects models for binary response data given in Diggle et al.
(1994). We consider a random intercept logistic regression model for binary response
data with a random effect αi given by

logitP(Yi j = 1|αi ) = β0 + αi + Xi>j β. (9.13)


GENERALIZED LINEAR MIXED EFFECTS MODELS 177
Now, write γi = β0 + αi and assume that Xi j does not include an intercept term. Then
the joint likelihood function for β and the γi is proportional to
   
Y N ni
 X Xni  ni
X 
exp γi
 yi j +  yi j Xi j  β −
 >  log{1 + exp(γi + Xi j β)}
>
(9.14)
i=1 j=i j=1 i=1

The conditional likelihood for β given the sufficient statistics for the γi has the
form
N exp( ni y X > β)
P
j=1 i j i j
Y
Pyi. > , (9.15)
Ri exp( l=1 Xil β)
P
i=1
 
where yi. = nj=1 yi j and the index set Ri contains all the yni.i ways of choosing yi.
Pi
positive responses out of ni repeated observations. Conditional maximum likelihood
estimates of the parameters are to be obtained by maximizing (9.15) with respect to
the parameters β.

9.4.2 The Binomial Random Effects Models


Modeling of random effects for binary response data can be extended to sums of
independent binary response data as follows: Let (Xi1 , . . . , Xini ) be a ni vector of in-
dependent Bernoulli (pi ) distributed responses. Then, given pi , Yi = nj=i
Pi
Xi j has a
binomial (ni , pi ). However, if we assume that pi is a random variable (effect) having
a beta distribution with certain parameters, say, α and β. Then with reparameteriza-
tion, the unconditional distribution of Yi is a beta-binomial distribution with mean
π and overdispersion parameter φ. This was discussed in Chapter 7 (section 7.2.10)
extensively.
Exercise 7. Using the results given in Chapter 7 find maximum likelihood esti-
mates of the parameters π and φ of the beta-binomial models for the Medium (M)
group data given in Table 7.1 and test whether there is significant over-dispersion in
the data.

9.4.3 The Poisson Random Effects Models


Count data arise in numerous biological, biomedical and epidemiological investiga-
tions. See, for example, Anscombe (1949), Ross and Preece (1985), and Manton et
al. (1981). The analysis of such count data is often based upon the assumption of
some form of Poisson model. However, when using a Poisson distribution to model
such a datasets, it often occurs that the variance of the data exceeds that which would
be normally expected. This phenomenon of overdispersion in count data is quite
common in practice. See, for example, Bliss and Fisher (1953) and Saha and Paul
(2005).
To capture this overdispersion, let Y ∼ Poisson(µ). However, the Poisson param-
eter µ may itself vary from experiment to experiment which follows a certain contin-
uous distribution f (·), where, f (·) would be a two-parameter density function. Thus,
178 RANDOM EFFECTS AND TRANSITIONAL MODELS
given µ, Y ∼ Poisson(µ) and µ ∼ f . Then, the unconditional distribution of Y is a two
parameter distribution, one of which is the mean parameter m, say and the other is an
overdispersion parameter c.
For modeling count data with overdispersion, a popular and convenient model is
the negative binomial distribution (see Manton et al., 1981; Breslow, 1984; Engel,
1984; Lawless, 1987; Margolin et al. 1989). For more references on applications
of the negative binomial distribution see Clark and Perry (1989). Different authors
have used different parameterizations for the negative binomial distribution (see, for
example, Paul and Plackett, 1978; Barnwal and Paul, 1988; Piegorsch, 1990; Paul and
Banerjee, 1998). Let Y be a negative binomial random variable with mean parameter
m and dispersion parameter c. We write Y ∼ NB(m, c), which has probability mass
function

f (y|m, c) = Pr(Y = y|m, c)


!c−1
Γ(y + c−1 )  cm y 1
= , (9.16)
y!Γ(c−1 ) 1 + cm 1 + cm
for y = 0, 1, · · · and m > 0. Now, Var(Y) = m(1 + mc) and c > −1/m. Since c can take
a positive as well as a negative value, it is called a dispersion parameter, rather than
an overdispersion parameter, and with this range of c, f (y|m, c) is a valid probability
function. Obviously, when c = 0 variance of the NB(m, c) distribution becomes that of
the Poisson(m) distribution. Moreover, it can be shown that the limiting distribution
of the NB(m, c) distribution, as c → 0, is the Poisson(m). The parameter c and its
efficient estimation are, therefore, important in practice.
Variety of estimation procedures for the parameters m and c have been developed
over the years. Here we give a few of them.
1. The Maximum Likelihood Estimator
Let Y1 , . . . , YN be a random sample from the negative binomial distribution (1.1).
Then the log likelihood, apart from a constant, can be written as
 
N  ! yXi −1
X 1 
l= yi ln(m) − yi + ln(1 + cm) + ln(1 + c j)

i=1
c j=0

Maximum likelihood estimators of m and c are then obtained by solving the estimat-
ing equations
N "
∂l X yi 1 + cyi
#
= − = 0,
∂m i=1 m 1 + cm
and  
N i −1
yX
∂l X  1 yi − m 1 
=  2 ln(1 + cm) − −  = 0,
∂c i=1 c c(1 + cm) j=0 c(1 + c j) 

simultaneously (see Piegorsch, 1990). Solution of the first equation provides m̂ = ȳ.
Maximum likelihood estimator of the parameter c, denoted by ĉ ML , is obtained by
solving the second equation after replacing m by ȳ. It can be seen that the restriction
GENERALIZED LINEAR MIXED EFFECTS MODELS 179
ĉ ML > −1/ymax , where ymax is the maximum of the data values Y1 , . . . , Yn , must be
imposed in solving the equation.
2. Method of Moments Estimator
The method of moments estimators of the parameters m and c obtained by equat-
ing the first two sample moments with the corresponding population moments are
m̂ = ȳ and
s2 − m̂
ĉ MM = (9.17)
m̂2
where ȳ is the sample mean and s2 = i=1 (yi − ȳ)2 /(N − 1) is the sample variance.
PN
3. The Extended Quasi-Likelihood Estimator
Estimation of the negative binomial parameter k by maximum quasi-likelihood.
Biometrics. 45 (1), pp. 309-316. For joint estimation of the mean and dispersion pa-
rameters, Nelder and Pregibon (1987) suggested an extended quasi-likelihood, which
assumes only the first two moments of the response variable. The extended quasi-
likelihood (EQL) and the estimating equations for m and c was given in Clark and
Perry (1989). Note our c is the same as α of Clark and Perry (1989). Without giving
details we denote the maximum extended quasi-likelihood estimate of c by ĉEQL .
4. The Double-Extended Quasi-Likelihood Estimator
Lee and Nelder (2001) introduced double-extended quasi-likelihood (DEQL) for
the joint estimation of the mean and the dispersion parameters. The DEQL method-
ology requires an EQL for Yi given some random effect ui and an EQL for ui from
a conjugate distribution given some mean parameter m and dispersion parameter c.
The DEQL then is obtained by combining these two EQLs. The random effects ui ’s
or some transformed variables vi are then replaced by their maximum likelihood es-
timates resulting in a profile DEQL. This profile DEQL is the same as the negative
binomial log-likelihood with the factorials replaced by the usual Stirling approxima-
tions (Lee and Nelder, 2001, Result 5, p. 996). They argued, however, that the Stirling
approximation may not be good for small z, so for non-normal random effects, they
suggested the modified Stirling approximation
!
1 1 1
ln Γ(z) ' z − ln(z) + ln(2π) − z + .
2 2 12z
Omitting details of the derivation, which can be obtained from the authors, the profile
DEQL, with the modified Stirling approximation, is
N 
1 + cyi
! ! !
?
X 1 1 1
pv (DEQ) = yi ln(m) + y + ln − ln[2π(1 + cyi − y + ln(yi )
i=1
c 1 + cm 2 2
c2 yi 1 
− − .
12(1 + cyi ) 12yi
From this we obtain the maximum profile DEQL estimating equations for m and c as
N " #
X yi 1
− =0
i=1
m(1 + cm) 1 + cm
180 RANDOM EFFECTS AND TRANSITIONAL MODELS
and
N 
1 + cm 2cyi + 2 − c 2 − c cyi (2 + cyi ) 
!
X 1 yi − m
ln + − + − =0
i=1
c2 1 + cyi c(1 + cm) 2c2 (1 + cyi ) 2c2 12(1 + cyi )2

respectively. The maximum DEQL estimate for m obtained from the first equation
above is m̂ = ȳ. The maximum DEQL estimate of c, denoted by ĉDEQL , is obtained
by iteratively solving the second equation above after replacing m by m̂ = ȳ. Saha
and Paul (2005) provided a detailed comparison, by a simulation study, of these
estimators.

9.4.4 Examples: Estimation for European Red Mites Data and the Ames
Salmonella Assay Data
The European red mites data and the Ames salmonella assay data are now analyzed.
The European red mites data do not have any covariate and have 150 observations
in the form of frequency distribution (see Table 1, Saha and Paul (2005). The Ames
salmonella assay dataset has one covariate and a total of 18 observations. The max-
imum likelihood estimates of the parameters m and c for the red mites data with the
standard errors in parentheses 1.1467(0.1273) and 0.976(0.2629), respectively.
Example 9: Ames Salmonella Assay Data, see Table 2, Saha and Paul, 2005
The data in Table 2 of Saha and Paul (2005), were originally given by Margoline
et al.(1981). The data from an Ames salmonella reverse mutagenicity assay have
a response variable Y, the number of revertant colonies observed on each of three
replicate plates, and a covariate x, the dose level of quinoline on the plate. We use
the regression model given by
E(Yi |xi ) = mi =exp[β0 + β1 xi + β2 ln(xi + 10)].
The maximum likelihood estimating equations of the parameters of a general
negative binomial regression model are given by Lawless(1987).
The maximum likelihood estimates of the parameters β0 , β1 , β2 , and c are
2.197628(0.324576), −0.000980(0.000386), 0.312510(0.087892), and
0.048763(0.028143) respectively.
It is straightforward to generalize this framework to generalized linear models
with random effects. Recall that the GLM relies on the link function h to specify the
mean, µ = h−1 (X > β). As an extension to generalized linear models, we can naturally
incorporate random effects into the linear predictor. To be brief, we simply replace
Xi β by Xi β + Zi bi , in which both Xi and Zi are covariates. Once the distributions of bi
and ηi (usually multivariate normal) is specified (assumed), likelihood inference can
be carried out in principle. Conditional on (Xi , Zi , bi ), we have the same framework
as the GLM. But bi are actually “latent” as in linear mixed effects model. In general,
approximations are needed in evaluating the marginal likelihood (after integrating
the random effects) due to the nonlinear function µi in θi = Xi β + Zi bi . Evaluate the
likelihood or how to approximate the likelihood falls into computational statistics.
More details can be found in the excellent books (McCulloch and Searle, 2001).
TRANSITION MODELS 181
Note that using an additive random effects model leads to a mixture of GLMs.
McLachlan (1997) proposed the EM algorithm for the fitting of mixture distributions
by maximum likelihood. This essentially opens a very new direction for estimating
dispersion parameters for analyzing overdispersed data.

9.5 Transition Models


The marginal approach models the effects on the population rather than on individu-
als. In many cases, it is more appropriate to model individual changes to fully utilize
the information in longitudinal data (see Ware et al., 1988). A few other authors have
henceforth extended the marginal GEE approach to a conditional counterpart known
as transition models (Zeger and Qaqish, 1988; Liang et al., 1992; Wang, 1996).
Although it is based on the framework of a marginal model, the proposed method
relies on the linear functions of past responses as adjusted mean based on the previous
responses (residuals).
The transitional model and other conditional quasi-likelihood methods are gen-
erally more efficient (Wang 1996). But the efficiency gain hinges on the correct spec-
ification of high moments or equivalently, correct specification of the conditional
mean and conditional variance structure. However, in many cases, we are only con-
fident about the second-order moment assumption, and it is not possible to specify
the fully conditional mean. If the conditional means are indeed linear functions of
past responses, the transitional or conditional model has a close connection with the
ordinary GEE approach (Wang, 1999c). The proposed model here can, therefore, be
treated as a robust version of transitional models.
The parameter estimates of α from a consistent estimation method will converge
to α if the correlation is correctly specified. But under misspecification, α̂ will con-
verge to some function of α, which will be denoted as γ, and in general, γ , α. The
limiting working correlation matrix, R(γ), determines the asymptotic efficiency of
Uβ .
In fact, the working matrix does not even have to converge to a correlation ma-
trix. As long as the limiting matrix is nonsingular, the asymptotic properties of β̂ still
hold. For example, we may use α = 2 in AR(1) as a working matrix. But the resulting
estimates of β may be very inefficient. This is because the working matrix only plays
a role of “weighting” in (4.18). To obtain efficient parameter estimates β̂, it is desir-
able to have computationally reliable and statistically robust methods of producing
appropriate “working” covariance models.
A very specific class of conditional models are so-called transition models. In a
transition model, a measurement yi j in a longitudinal sequence is described as a func-
tion of previous outcomes, or history Hi j = (yi1 , . . . , yi, j−1 ) (Diggle et al 2002, p.190).
One can write a regression model for the outcome yi j in terms of Hi j , or alternatively
the error term εi j can be written in terms of previous error terms. In the case of linear
models for Gaussian outcomes, one formulation can be translated easily into another
one and specific choices give rise to well-known marginal covariance structures such
as, for example, AR(1). Specific classes of transition models are also called Markov
models (Feller 1968). The order of a transition model is the number of previous
182 RANDOM EFFECTS AND TRANSITIONAL MODELS
measurements that is still considered to influence the current one. A model is called
stationary if the functional form of the dependence is the same regardless of the ac-
tual time at which it occurs. An example of a stationary first-order autoregressive
model for continuous data is:

yi1 = Xi1
>
β + εi1 , (9.18)
yi j = Xi j β + αyi j−1 + εi j . (9.19)

Assuming εi1 ∼ N(0, σ2 ) and εi j ∼ N(0, σ2 (1−α2 )), after some simple algebra, we
0
can obtain: var(Yi j ) = σ2 and cov(Yi j , Yi j0 ) = α| j − j| σ2 . In other words, this model pro-
duces a marginal multivariate normal model with AR(1) variance-covariance matrix.
It makes most sense for equally spaced outcomes, of course. Upon including random
effects into (9.1), and varying the assumptions about the autoregressive structure, it is
clear that the general linear mixed-effects model formulation with serial correlation
encompasses wide classes of transition models. There is a relatively large literature
on the direct formulation of Markov models for binary, categorical, and Poisson data
as well (Diggle et al 2002). For outcomes of a general type, generalized linear model
ideas can be followed to formulate transition models.
We extend the generalized linear models (GLMs) for describing the conditional
distribution of each response yi j as a function of past response yi j−1 , . . . , yi1 and co-
variates Xi j . For example, the probability of a child having a obesity problem at
time ti j depends not only on explanatory variables, but also on the obesity sta-
tus at time ti j−1 . We will focus on the case where the observation times ti j are
equally spaced. To simplify notation, we denote the history for subject i at visit j
by Hi j = (yi1 , yi2 , . . . , yi j−1 ). As above, we will continue on the past and present values
of the covariates without explicitly listing them.
The most useful transition models are Markov chains for which the conditional
distribution of yi j given Hi j depends only on the q prior observations yi j−1 , . . . , yi j−q ,
The integer q is referred to as the model order.
A transition model specifies a GLM for the conditional distribution of yi j given
the past response, Hi j . The form of the conditional GLM is
n o
f (yi j |Hi j ) = exp [yi j θi j − ψ(θi j )]/φ + c(yi j , φ) (9.20)

for the known functions ψ(θi j ) and c(yi j , φ). The conditional mean and variance are

µcij = E(Yi j |Hi j ) = ψ0 (θi j ) and υcij = Var(Yi j |Hi j ) = ψ00 (θi j )φ,

where φ is an overdispersion parameter. The only difference is that, by including


Hi j , an outcome is described in terms of its predecessors. A function of the mean
components is equated to a linear function of the predictors:

ηi j (µcij ) = Xi1
>
β + κ(Hi j , β, α), (9.21)

where κ(·) is a function, often linear, of the history.


TRANSITION MODELS 183
In words, the transition model expresses the conditional mean µcij as a function of
both the covariates Xi j and of the past responses yi1 , yi2 , . . . , yi j−q . Past responses or
functions thereof are simply treated as additional explanatory variables. The follow-
ing examples with different link functions illustrate the range of transition models
which are available.
Linear link—a linear regression with autoregressive errors for Gaussian data
(Tsay, 1984) is a Markov model. It has the form
q
X
Yi j = xi>j β + αr (Yi j−r − xi>j−r β) + εi j , (9.22)
r=1

where the εi j are independent, mean-zero, Gaussian innovations. Note that the
present observation, Yi j , is a linear function of xi j and of the earlier deviation
Yi j−r − xi>j−r β, r = 1, . . . , q.
Logit link—an example of a logistic regression model for binary responses that
comprises a first order Markov chain (Cox and Snell, 1989; Korn and Whittemore,
1979, see) which assumes that yi j ( j > 1) is independent of earlier observations given
the previous observation yi j−1 is

logitP(Yi j = 1|Hi j ) = Xi j β + Yi j−1 α. (9.23)

This model is of the stationary first-order autoregressive type. Evaluating (9.23) to


yi, j−1 = 0 and yi, j−1 = 1, respectively, produces the so-called transition probabilities
between occasions j − 1 and j. In this model, when there would be no covariates,
these would be constant across the population. When there are time-independent
covariates only, the transition probabilities change in a relatively straightforward way
with level of covariate. For example, a different transition structure may apply to the
standard and experimental arms in a two-armed clinical study. A simple extension to
a model of order q has the form
q
X
logitP(Yi j = 1|Hi j ) = Xi j βq + Yi j−r αr . (9.24)
r=1

The notation βq indicates that the value and interpretation of the regression coef-
ficients changes with the Markov order, q.
Log-link— with count data we assume a log-linear model where Yi j given Hi j fol-
lows a Poisson distribution.
n Zeger and o Qaqish (1988) discussed a first order Markov
chain with f = α log(y∗i j−1 ) − Xi>j−1 β , where y∗i j = max(yi j , c), 0 < c < 1. This leads to
α
y∗i j−1

µcij = E(Yi j |Hi j ) = exp(Xi>j β)   .

 
(9.25)
exp(Xi>j−1 β)

The constant c prevents yi j = 0 from being an absorbing state whereas yi j = 0 forces


all future response to be 0. Note when α > 0, we have an increased expectation, µcij ,
when the previous outcome, yi j−1 , exceeds exp(Xi>j−1 β). When α < 0, a higher value
at ti j−1 causes a lower value at ti j .
184 RANDOM EFFECTS AND TRANSITIONAL MODELS
In the linear regression model, the transition model can be formulated with
fr = αr (yi j−r − Xi>j−1 β) so that E(Yi j ) = Xi>j β for different values of q. In the logis-
tic and log-linear cases, it is difficult to formulate models in such a way that β has the
same meaning for different assumptions about the time dependence. When β is the
scientific focus, the careful data analyst should exam the sensitivity of the substantive
findings to the choice of time dependence model.

9.6 Fitting Transition Models


The contribution to the likelihood for the ith subject can be written as:
ni
Y
Li (yi1 , · · · yini ) = f (yi1 ) f (yi j |Hi j )
j=2
ni
Y
= f (yi1 , · · · , yiq ) f (yi j |yi j−1 , · · · , yi j−q ). (9.26)
j=q+1

The latter decomposition is relevant as the history Hi j contains the Q immediately


preceding measurements. It is now clear that the product in (9.13) yields ni inde-
pendent univariate GLM contributions. Clearly, a separate model may need to be
considered for the first q measurements, as these are left undescribed by the con-
ditional GLM. In the linear model we assume that yi j given Hi j follows a normal
distribution. If yi1 , . . . , yiq are also multivariate normal and the covariance structure
for the yi j is weakly stationary, the marginal distribution f (yi j , · · · , yiq ) can be fully
determined from the conditional distribution model without additional unknown pa-
rameters. Hence, full maximum likelihood estimation can be used to fit normal au-
toregressive models. See Tsay (1984) and references therein for details.
In the logistic and log-linear cases, f (yi1 , · · · , yiq ) is not determined from the GLM
assumption about the conditional model, and the full likelihood is unavailable. An
alternative is to estimate β and α by maximizing the conditional likelihood
N
Y ni
N Y
Y
f (yiq+1 , · · · , yini |yi1 , · · · , yiq ) = f (yi j |Hi j ). (9.27)
i=1 i=1 j=q+1

When maximizing (9.27) there are two distinct cases to consider. In the first,
fr (Hi j ; α, β) = αr fr (Hi j ) so that h(ucij ) = Xi>j β + r=1 αr fr (Hi j ). Here, h(ucij ) is a linear
Ps
function of both β and α = (α1 , . . . , α s ) so that the estimation proceeds as in GLMs for
independent data. We simply regress yi j on the (p+ s) dimensional vector of extended
explanatory variables (xi j , f1 (Hi j ), . . . , f s (Hi j )).
The second case occurs when the functions of past responses include both α
and β. Examples are the linear and log-linear models discussed above. To derive
an estimation algorithm for this case, note that the derivative of the log conditional
likelihood or conditional score function has the form
ni ∂uc
N X
ij
X
U c (δ) = υcij (yi j − ucij ) = 0, (9.28)
i=1 j=q+1
∂δ
TRANSITION MODEL FOR CATEGORICAL DATA 185
where δ = (β, α). This equation is the conditional analog of the GLM score equation.
The derivative ∂ucij /∂δ is analogous to Xi j but it can depend on α and β. We can still
formulate the estimation procedure as an iterative weighted least squares as follows.
Let Yi j be the (ni − q) vector of responses for j = q + 1, . . . , ni and µcij its expectation
given Hi j . Let Xi∗ be an (ni − q) × (p + s) matrix with kth row ∂uiq+k /∂δ and Wi j =
diag(1/υciq+1 , · · · , 1/υcini ) and (ni − q) × (ni − q) diagonal weighting matrix. Finally, let
Zi = Xi∗ δ̂ + (Yi − µ̂c ). Then, an updated δ̂ can be obtained by iteratively regressing Z
on X ∗ using weights W.
When the correct model is assumed for the conditional mean and variance, the
solution δ̂ asymptotically, as N goes to infinity, follows a Gaussian distribution with
mean equal to the true value, δ, and (p + s) × (p + s) variance matrix
 N −1
X 
>
Vδ̂ =  Xi∗ Wi Xi∗  . (9.29)
i=1

The variance Vδ̂ depends on β and α. A consistent estimate, V̂δ̂ is obtained by


replacing β andqα by their estimates β̂ and α̂. Hence, a 95 percent confidence interval
for β1 is β̂1 ± 2 V̂δ̂ , where V̂δ̂ is the element in the first row and column of V̂δ̂ .
If the conditional mean is correctly specified but the conditional variance is not,
we still obtain consistent inferences about δ by using the robust variance from equa-
tion (A.6.1) in Diggle et al. (1994, pp 245). Here it takes the form
 N −1  N  N −1
X  X  X 
VR =  Xi Wi Xi   Xi Wi VT i Wi Xi   Xi Wi Xi  .
∗T ∗ ∗T ∗ ∗T ∗
(9.30)
i=1 i=1 i=1

A consistent estimate V̂R is obtained by replacing VT i = Var(Yi |Hi ) in the equation


above by its estimates (Yi − µ̂ci )(Yi − µ̂ci )> .
Interestingly, use of the robust variance will often give consistent confidence in-
tervals for δ̂ even when the Markov assumption is violated. However, in that situation,
the interpretation of δ̂ is questionable since µcij (δ̂) is not the conditional mean of Yi j
given Hi j .

9.7 Transition Model for Categorical Data


This section discusses Markov chain regression models for categorical responses
observed at equally spaced intervals. We begin with logistic models for binary re-
sponses and then briefly consider extensions to multinominal and ordered categorical
outcomes.
A first order binary Markov chain is characterized by the transition matrix

π00 π01
!
,
π10 π11

where πab = p(Yi j = b|Yi j−1 = a), a, b = 0, 1. For example, π01 is the probability that
186 RANDOM EFFECTS AND TRANSITIONAL MODELS
Yi j = 1 when the previous response is Yi j−1 = 0. Note that each row of a transition ma-
trix sums to one. As its name implies, the transition matrix records the probabilities
of making each of the possible transitions from one visit to the next.
In the regression setting, we model the transition probabilities as functions
of covariates Xi j . A very general model uses a separate logistic regression for
P(Yi j=1 |Yi j−1 = yi j ), yi j = 0, 1. That is, we assume that

logitP(Yi j = 1|Yi j−1 = 0) = Xi>j β0

and

logitP(Yi j = 1|Yi j−1 = 1) = Xi>j β1 ,

where β0 and β1 may differ. In words, this model assumes that the effects of explana-
tory variables will differ depending on the previous response. A more concise form
for the same model is

logitP(Yi j = 1|Yi j−1 = yi j−1 ) = Xi>j β0 + yi j−1 Xi>j α, (9.31)

so that β1 = β0 + α. An advantage of this form is that we can easily test whether


simpler models fit the data equally well. For example, we can test whether α = (α0 , 0)
indicates that the covariates have the same effect on the response probability whether
yi j−1 = 0 or yi j−1 = 1. Alternately, we can test whether a more limited subset of α is
zero indication that the associated covariates can be dropped from the model. Each
of these alternatives is nested within the saturated model so that standard statistical
method for nested models can be applied.
In many problems, a higher order Markov chain may be needed. The second
order model has transition matrix
Yi j
Yi j−2 Yi j−1 0 1
0 0 π000 π001
0 1 π010 π011
1 1 π110 π111

Here, πabc = P(Yi j = c|Yi j−2 = a, Yi j−1 = b); for example π011 is the probability
that Yi j = 1 given Yi j−2 = 0 and Yi j−1 = 1. We could fit four separate logistic re-
gression models, one for each of the four possible histories (Yi j−2 , Yi j−1 ) namely
(0, 0), (0, 1), (1, 0), and (1, 1) with regression coefficients β00 , β01 , β10 , β11 respectively.
But it is again more convenient to write a single equation as follows

logitP(Yi j = 1|Yi j−2 = yi j−2 , Yi j−1 = yi j−1 )


= Xi>j β + yi j−1 Xi j α1 + yi j−2 Xi>j α2 + yi j−1 yi j−2 Xi>j α3 . (9.32)

By plugging in the different values for yi j−2 and yi j−1 , we obtain β00 = β, β01 = β + α1 ,
β10 = β + α2 , and β11 = β + α1 + α2 + α3 . We would again hope that a more parsimo-
nious model fits the data equally well so many of the components of the αi would be
zero.
TRANSITION MODEL FOR CATEGORICAL DATA 187
An important special case occurs when there are no interactions between the past
responses, yi j−1 and yi j−2 and the explanatory variables, that is, when all elements of
the αi are zero except the intercept term. In this case, the previous responses affect the
probability of a positive outcome but the effects of the explanatory variables are the
same regardless of the history. Even in this situation, we must still choose between
Markov models of different order. For example, we might start with a third order
model which can be written in the form

logitP(Yi j = 1|Yi j−3 = yi j−3 , Yi j−2 = yi j−2 , Yi j−1 = yi j−1 )


= xi0 j β + yi j−1 α1 + yi j−2 α2 + yi j−3 α3 + yi j−1 yi j−2 α4 + yi j−1 yi j−3 α5
+ yi j−2 yi j−3 α6 + yi j−1 yi j−2 yi j−3 α7 . (9.33)

A second order model can be used if the data are consistent with α3 = α5 = α6 =
α7 = 0; a first order model is implied if αi = 0 for j = 2, . . . , 7. As with any regres-
sion coefficients, the interpretation and value of β depends on the other explanatory
variables in the model, in particular on which previous responses are included. When
inference about β are the scientific focus, it is essential to check their sensitivity to
the assumed order of Markov regression model. When the Markov transition model is
correctly specified, the transition events are uncorrelated so that ordinary logistic re-
gression can be used to estimate regression coefficients and their standard error. How-
ever, there may be circumstances when we choose to model P(Yi j |Yi j−1 , · · · , Yi j−q )
even though it does not equal P(Yi j |Hi j ). For example, suppose there is heterogeneity
across people in the transition matrix due to unobserved factors, so that a reasonable
model is

logitP(Yi j = 1|Yi j−1,Ui = yi j−1 ) = (β0 + Ui ) + Xi>j β + αyi j−1 , (9.34)

where Ui ∼ N(0, σ2 ). We may still wish to estimate the population-averaged transi-


tion matrix, P(Yi j |Yi j−1 = yi j−1 ). But here the random intercept Ui makes the transi-
tions for a person correlated. Correct inference about the population-average coeffi-
cients can be drawn using the GEE approach.
To capture the baseline heterogeneity across subjects (Albert and Follmann,
2003) for the first order binary Markov chain, a separate random intercept effect
logistic regression model can be used as

logit(P01 (ui )) = logit[P(Yi j = 1|Yi j−1 = 0, ui )] = Xi>j β01 + ui


logit(P10 (ui )) = logit[P(Yi j = 0|Yi j−1 = 1, ui )] = Xi>j β10 + νui

from which we obtain


exp(Xi>j β01 + ui )
P01 (ui ) =
1 + exp(Xi>j β01 + ui )
exp(Xi>j1 β10 + νui )
P10 (ui ) =
1 + exp(Xi>j β10 + νui )
188 RANDOM EFFECTS AND TRANSITIONAL MODELS
where β01 and β10 are regression parameters. The parameter ν represents the associ-
ation between P01 (ui ) and P10 (ui ). Note that P01 + P00 = 1 = P11 + P10 = 1. To see
the effect of ν for some fixed parameters see Figure 3 (Albert and Follmann, 2003,
pp 104). Thus, the model for yi j given xi j , yi, j−1 , ui , and β = (β0 , β1 ) can be written as
 yi j
 P01 (ui )(1 − P01 (ui ))1−yi j if yi( j−1) = 0

f (yi j |Xi j , zi j , yi, j−1 , ui , θ) = 

(9.35)
 P1−yi j (ui )(1 − P10 (ui ))yi j if yi( j−1) = 1.

10

9.8 Further reading


Markov transition models have been studied by several authors. For example, Korn
and Whittemore (1979) modeled the probability of occupying the current state us-
ing the previous state. Wu and Ware (1979) assumed one binary event (e.g., death)
though the covariate information as time passes before the event. Regier (1968)
reparametrized the two-state transition matrix to include a parameter which is the
ratio of the odds of staying in state 0 to the odds of staying in state 1. Zeger and
Qaqish (1988) discussed a class of Markov regression models for time-series by us-
ing a quasi-likelihood approach. Lee and Kim (1998) analysed correlated panel data
using a continuous-time Markov Model. Zeng and Cook (2007) proposed a estima-
tion method based on joint transitional models for multivariate longitudinal binary
data using GEE2. Deltour, Richardson, and Le Hesran (1999) used stochastic algo-
rithms for Markov models estimation with intermittent missing data. Albert (2000)
developed a transitional model for longitudinal binary data subject to nonignorable
missing data and proposed an EM algorithm for parameter estimation. In Albert and
Follmann (2003), an extended version of the Markov transition model was proposed
to handle nonignorable missing values in a binary longitudinal data set.
Chapter 10

Handing High Dimensional


Longitudinal Data

10.1 Introduction
Many variables are often collected in high-dimensional longitudinal data. The inclu-
sion of redundant variables may reduce accuracy and efficiency for both parameter
estimation and statistical inference. The traditional criteria introduced in Chapter 5
are all subset methods, and these methods become computationally intensive when
dimension of the variables is moderately large. Therefore, it is important to find
new methodology for variable selection in analysis of high-dimensional longitudi-
nal data. The penalized loss function (or −2 log likelihood function) methods have
been widely used to select variables in regression models for the independent data
{(Xi , Yi ) : i = 1, · · · , N}. The penalized loss function method is composed of a loss
Pp
function and a penalty function, that is Q(β) = i=1 li (β) + N j=1 Pλ (|β j |), where
PN
Pλ (|β|) is a penalty function, and λ is a tuning parameter which controls the sparse-
ness of the regression parameters β, the commonly used penalty functions Pλ (|θ|)
include,
• Hard penalty function: Pλ (|θ|) = λ2 − (|θ| − λ)2 I(|θ| < λ).
• Least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996):
Pλ (|θ|) = λ|θ|.
• Smoothly clipped absolute deviation penalty (SCAD) (Fan and Li, 2001):
( )
(aλ − |θ|)+
Pλ (|θ|) = λ I(|θ| ≤ λ) +
0
I(|θ| > λ) ,
(a − 1)λ
where a > 2 and is proposed taking a value of 3.7 by Fan and Li (2001).
• Adaptive penalty function (ALASSO) (Zou, 2006): Pλ (|θ|) = λ|θ|/|θ̂|γ , where θ̂ is
a consistent estimator of θ and γ > 0.
• Elastic net (EN) penalty (Zou and Hastie, 2005) introduced as a mixing penalty
to effectively select grouped variables: Pλ (|θ|) = λ1 ||θ||1 + λ2 ||θ||22 .
The penalized methods can compress the coefficients of unrelated variables to-
ward zero and obtain the parameter estimates at the same time. Fan and Li (2001)
proposed that a good penalty function should result in an estimator with three prop-
erties: unbiasedness, sparsity, and continuity. Their studies found that the excessive

DOI: 10.1201/9781315153636-10 189


190 HANDING HIGH DIMENSIONAL LONGITUDINAL DATA
compression of the coefficients in LASSO would cause large deviations and then
proposed a smoothly clipped absolute deviation (SCAD) penalty function to solve
this problem. SCAD is more stable than LASSO and can reduce the computational
complexity. Furthermore, the adaptive LASSO permits different weights for different
coefficients and has oracle properties.
In this chapter, we will introduce several advanced methods for simultaneous
parameter estimation and variable selection in marginal models with longitudinal
data. Assume that µi j = g(XiTj β), where g(·) is a specified function and β is an un-
known parameter vector, and we assume that β is sparse and partition β as (βT1 , βT2 )T
with β1 ∈ Rd and β2 ∈ R p−d . Suppose that the true values of parameters β are
βT = (β∗T1 , 02 ) . We aim to identify the variables with zero coefficients β2 consis-
T T

tently, and simultaneously estimate the nonzero coefficients β1 .

10.2 Penalized Methods


In this section, we will introduce several penalized methods for variable selection in
marginal models.

10.2.1 Penalized GEE


The GEE method introduced in Chapter 4 is the most popular method in estimating
parameters in marginal models. In this section, we will introduce a penalized GEE
proposed by Wang et al. (2012a) for variable selection. Let
N
X
U N (β) = DTi A−1/2
i R−1 −1/2
i (α)Ai (Yi − µi ).
i=1

The penalized GEE is

QN (β) = U N (β) − NP0λ (|β|)sign(β), (10.1)

where P0λ (|β|) = (P0λ (|β1 |), · · · , P0λ (|β p |))T is a p dimensional derivative func-
tion of the penalty function Pλ (|β|) encouraging sparsity in β, and sign(β) =
(sign(β1 ), · · · , sign(β p ))T is an indicator vector with sign(t) = 1 for t > 0, sign(t) = −1
for t < 0 and sign(t) = 0 for t = 0. In Equation (10.1), the tuning parameter λ > 0
controls the complexity of the model.
Wang et al. (2012a) proposed combining the Minimization-Maximization (MM)
algorithm by Hunter and Li (2005) with the Fisher-scoring algorithm to solve the pe-
nalized GEE (10.1). For a small  > 0, the MM algorithm suggests that the penalized
estimator β̂ approximately satisfies the following equation:

|β̂ j |
U N j (β̂) − NP0λ (|β̂ j |)sign(β̂ j ) = 0, (10.2)
|β̂ j | + 

where U N j (β̂) is the jth element of U N (β̂), and β̂ j is an estimator of β j for i = 1, · · · , p.


Applying the Fisher-scoring algorithm to (10.2), we can obtain the following iterative
PENALIZED METHODS 191
formula for the penalized GEE:

β̂(k+1) = β̂(k) + [H(β̂(k) ) + NE(β̂(k) )]−1 [U N (β̂(k) ) − NE(β̂(k) ) × β̂(k) ], (10.3)

where
N
X
H(β̂ ) =
(k)
XiT A1/2 1/2 (k)
i (β̂ )Ri (α)Ai (β̂ )Xi ,
(k) −1

i=1
 P0 (|β̂(k) |+) P0λ (|β̂(k)
 
λ 1 p |+) 

E(β̂ ) = diag  (k)
(k)  , · · · , (k)  .
|β̂1 | +  |β̂ p | + 

Given a selected tuning parameter λ and an initial value of β, such as the estimate
obtained from the GEE with an independence working matrix under a full model,
update the estimate of β via equation (10.3) until the algorithm converges.
The tuning parameter λ controls the spareness of the regression parameters, and
it is important to select an appropriate λ. The cross-validation method is very pop-
ular for selecting the tuning parameter. The k-fold cross-validation procedure is as
follows: Denote the full dataset by T and randomly split the data into k nonoverlap-
ping subsets of approximately equal size. Denote cross-validation training and test
set by T − T v and T v , respectively. Obtain the estimator β̂v of β using the training
set T − T v . Form the cross validation criterion as
k
X X
CV(λ) = l(yi j , xi j , β̂(v) ),
v=1 (yi j ,xi j )∈T v

where k usually takes a value of 5 or 10, and l(·) is a loss function. Wang et al. (2012)
proposed taking the negative log likelihood of exponential family distribution under
a working independence assumption as the loss function. The best tuning parameter
is selected by minimizing CV(λ) over a fine grid of λ. From the MM algorithm, a
sandwich covariance formula can be used to estimate the asymptotic covariance of
β̂:
d β̂) = [H(β̂) + NE(β̂)]−1 M(β̂)[H(β̂) + NE(β̂)]−1 ,
Cov(
where
N
X
M(β̂) = XiT A1/2 −1 1/2
i (β̂)R̂i (α)i (β̂)i (β̂)R̂i Ai (β̂)Xi ,
−1 T

i=1

where i (β̂) =A−1/2


i (β̂)(Yi − µi (β̂)).
The tuning parameter can also be selected by min-
imizing the determinant or trace of the covariance matrix Cov(d β̂). The correlation
parameters α in the working correlation matrix can be estimated using the moment
method introduced in Chapter 4.
The penalty function Pλ (|β|) can be SCAD or ALASSO. The resulting estimators
from (10.1) are unbiased, consistent, and have oracle properties (Wang et al., 2012).
A PGEE package in statistical software R can be used to fit the penalized generalized
estimating equations by SCAD penalty to longitudinal data with high-dimensional
covariates.
192 HANDING HIGH DIMENSIONAL LONGITUDINAL DATA
10.2.2 Penalized Robust GEE-type Methods
The penalized GEE is sensitive to outliers and, therefore, some researchers proposed
penalized robust GEE-type methods. For example, Fan et al. (2012) and Lv et al.
(2015) proposed a robust variable selection approach based on robust generalized
estimating functions that incorporate the correlation structure for longitudinal data.
Consider the following robust estimating equations:
N
X
UR (β) = DTi Vi−1 hi (µi (β)) = 0, (10.4)
i=1

−1
where Vi = Ri (α)A1/2
i , hi (µi (β)) = Wi (ψ(Ai (Yi − µi )) − C i ) in which C i =
2

−1
E(ψ(Ai 2 (Yi − µi ))) is used to ensure Fisher consistency of the estimator. For the
Gaussian distribution and symmetric Huber function, the correction term Ci is ex-
actly equal to zero.
The function ψ(·) is selected to downweight the influence of outliers in
the response variable. Fan et al. (2012) chose the Huber function ψc (x) =
min{c, max(−c, x)}, where the tuning constant c is chosen to give a certain level
of asymptotic efficiency at the underlying distribution. Lv et al. (2015) proposed
using a bounded exponential score function ψ(t) = exp(−t2 /γ), where γ down-
weights the influence of an outlier on the estimators. The weighted function Wi =
diag(wi1 , wi2 , · · · , wini ) is used to downweight the effect of leverage points and can be
chosen using the Mahalanobis distance function given as
  r 
 b0 2 
wi j = wi j (Xi j ) = min  ,
 
  
1,
 
  (Xi j − m x )T S −1
 
x (Xi j − m x )

 

where r ≥ 1 and b0 is the 0.95 quantile of a χ2 distribution with p degrees of freedom,


m x and S x are some robust estimates of location and scatter of Xi j , such as the median
of Xi j or median absolute deviance of Xi j . If we set ψ(x) = x and wi j = 1, then UR (β)
in (10.4) will be equal to U N (β).
To select important covariate variables and estimate them simultaneously, the
penalized method introduced in §10.2.1 can be used to select important variables by
adding a penalty function in UR (β). The penalized robust estimating equations are
given as follows:

U PR (β) = UR (β) − NP0λ (|β|)sign(β) = 0. (10.5)

The MM iterative algorithm can be used to solve Equation (10.5). The correlation
scale parameter φ need to be estimated before solving (10.5). Let
parameters and the q
êi j = (yi j − µi j (β̂))/ v(µi j (β̂)) be Pearson residuals. We can obtain a robust estimate
of φ through the median absolute deviation

φ̂ = {1.483median|êi j − median{êi j }|}2 . (10.6)


SMOOTH-THRESHOLD METHOD 193
For a given score function ψ(·), we obtain the estimates of α by the method of
Wang et al. (2005). When R(α) is an exchangeable correlation matrix, α can be esti-
mated by
N
1 X 1 X
α̂ = ψ(ei j )ψ(eik ), (10.7)
NH i=1 ni (ni − 1) j,k

where the summation is over all pairs ( j , k) and H is the mean of ψ2 (ei j ), j =
1, · · · , ni , i = 1, · · · , N. When R(α) is an AR(1) correlation matrix, α can be estimated
by
N
1 X 1 X
α̂ = ψ(ei j )ψ(ei j+1 ). (10.8)
NH i=1 ni − 1 j≤n −1
i

For the tuning parameter, Fan et al. (2012) proposed a robust general-
ized CV to choose λ for reducing the impact of outliers. Define RSSR (λ)
as the summation of squares of the robustified residuals, that is, RSSR (λ) =
PN 2
−1/2
i=1 Wi ψ(A i (Yi − µi (β̂λ ))) , where β̂λ is the solution of the penalized robust esti-
mating equation (10.5) when λ is fixed. Select the penalty parameter λ by minimizing
the robustified GCV statistic:
RSSR (λ)/n
GCVR (λ) = , (10.9)
{1 − d(λ)/n}2
where d(λ) = tr{[D(β̂) + NE(β̂λ )]−1 DT (β̂λ )} is the effective number of parameters.
Then λopt = argminλ GCVR is chosen, and the corresponding β̂λ is the robust penal-
ized estimator of β.
The specific algorithm is given as follows.
• Step 1. We can use the estimator β̂0 obtained from GEE with an independence
working matrix as an initial estimator of β.
• Step 2. Update the correlation parameter α and scale parameter φ using (10.6):
• Step 3. For a given tuning parameter λ, utilize the MM algorithm to solve Equa-
tion (10.4). Let β̂(k) be the kth iterate value for a fixed λ, then the iterative MM
algorithm is written:
β̂(k+1) = β̂(k) − [D(β̂(k) ) − NE(β̂(k) )]−1 [UR (β̂) − NE(β̂(k) ) × β̂(k) ], (10.10)
where D(β) = ∂UR (β)/∂β. The MM algorithm continues until successive iterates
values are less than a user-defined threshold.
• Step 4. Choose λ using the GCV estimator.

10.3 Smooth-threshold Method


Ueki (2009) developed smooth-threshold estimating equations that can automatically
eliminate irrelevant parameters by setting them as zero, which can be extended to the
marginal model. Consider the following equations:
(I p − ∆)UR (β) + ∆β = 0 p , (10.11)
194 HANDING HIGH DIMENSIONAL LONGITUDINAL DATA
where I p is a p-dimensional identity matrix, and ∆ is a diagonal matrix with diagonal
 (1+τ)  √
elements (δ̂1 , · · · , δ̂ p ), in which δ̂ j = min 1, λ? / β̂(0)
j with a N-consistent esti-
mator β̂(0)j ( j = 1, · · · , p) which can be obtained by solving equation (10.4). Parameter
τ and λ are tuning parameters. If λ takes a large value or τ takes a small value, then
δ̂ j = 1, and hence the corresponding parameter is shrunk to zero and the correspond-
ing covariate Xi j can be removed from the model. Before solving the equation, we
need to specify the parameters τ and λ. According to Zou (2006), parameter τ can
be selected among (0.5, 1, 2). The parameter λ? is a tuning parameter and controls
the sparsity of parameters. To solve equation (10.11), λ? should be specified. The
following criterion function can be used to select a proper tuning parameter λ? :
N
RBIC(λ? ) =
X
hTi (µi (β̂λ? ))R−1
i hi (µi (β̂λ? )) + d fλ? log N,
T
(10.12)
i=1

where β̂λ? is an estimator of β for a given λ? , and d fλ? = i=1 I(δ̂ j , 1) is the
Pp
number of nonzero elements of the estimators. The optimal tuning parameter λ?opt =
minλ? RBIC(λ? ). The procedure for solving equation (10.11) follows.
• Step 1. Give an initial estimate β̂(0) . For example, the estimate obtained via the
generalized estimating equations with an independence working matrix can be
used as the initial estimate.
• Step 2. Estimate the scale parameter φ̂ using (10.6) with the current estimate β̂(k) .
For a given working correlation matrix Ri (α), estimate the correlation parameter
using (10.7) and (10.8) for exchangeable and AR(1), respectively, and obtain
n   o  
Vi µi β̂(k) , φ̂(k) = Ri (α̂(k) )Â1/2
i β̂(k) , φ̂(k) .
Step 3. For a given λ, update the estimate of β via the following iterative formula:
 −1 
 n 
X   

β̂(k+1) = β̂(k) −   DT Ωi (µi (β)) Di + Ĝ Un (β) + Ĝ · β  ,

(10.13)

 i 

 i=1
   (k)

β̂
 −1
where Ĝ = I p − ∆ˆ ∆,
ˆ and Ω (µi (β)) = V −1 (µi (β)) Γi (µi (β)), in which
i
h i h i
Γi (µi (β)) = E ḣbi (µi (β)) = E ∂hbi (µi (β))/∂µi ,
µi =µi (β)

Step 4. Repeat Steps 2–3 until the algorithm converges.


According to the iterative algorithm mentioned above, we obtain a sandwich for-
mula to estimate the asymptotic covariance matrix of β̂λ :
h  i−1  h  i−1
Cov(β̂λ ) ≈ Σ̂n µi (β̂λ ) Ĥn µi (β̂λ ) Σ̂n µi (β̂λ ) , (10.14)
where
  Xn    n  oT   
Ĥn µi (β̂λ ) = DTi Vi−1 µi (β̂λ ) hbi µi (β̂λ ) hbi µi (β̂λ ) Vi−1 µi (β̂λ ) DTi ,
i=1
YEAST DATA STUDY 195
and
  Xn    
Σ̂n µi (β̂λ ) = DTi Vi−1 µi (β̂λ ) Γi µi (β̂λ ) Di .
i=1

The tuning parameter b can be selected by minimizing the determinant of the


covariance matrix of β̂λ , that is bopt = argminb |Cov(β̂λ )|.

10.4 Yeast Data Study


In this section, we applied the variable selection methods to analyze a yeast dataset.
A yeast cell-cycle gene expression dataset was collected in the CDC15 experiment
where genome-wide mRNA levels of 6178 yeast open reading frames (ORFs) in a
two cell-cycle period were measured at M/G1-G1-S-G2-M stages (Spellman et al.,
1998). To better understand the phenomenon underlying cell-cycle process, it is im-
portant to identify transcription factors (TFs) that regulate the gene expression levels
of cell cycle-regulated genes. In this section, we use a subset of 283 cell-cycled-
regularized genes observed over 4 time points at G1 stage, which is available in the
R package PGEE (Inan et al., 2016).
The response variable Yik is the log-transformed gene expression level of gene
i measured at time point k. The covariates xi j , j = 1, · · · , 96, are the standardized
binding probabilities of a total of 96 TFs obtained from a mixture model approach
of Wang et al. (2007) based on the ChIP data of Lee et al. (2002). The full model
considered is,
96
X
Yik = β0 + β1 tik + β j xi j + ik ,
j=1

where tik denotes time, and xi1 , · · · , xi96 are standardized to have mean zero and unit
variance.
Figure 10.4 shows the change of the log-transformed gene expressive level with
time for the first twenty genes. The histogram (see Figure 10.4) indicates the distribu-
tion of log-transformed gene expressive level is symmetric. The boxplot (see Figure
10.3) indicates that log-transformed gene expression levels may contain many under-
lying outliers. We utilize the penalized methods to identify the important TFs. Table
10.1 presents the number of TFs selected for the G1-stage yeast cell-cycle process
using the penalized GEE, penalized Huber (c=2) and penalized exponential squared
loss with SCAD penalty under three correlation structures (independence, exchange-
able and AR(1)). The results indicate that these three methods tend to select more
TFs under independence correlation structure. Table 10.2 lists the selected TFs using
the three penalized methods under independence, exchangeable and AR(1) correla-
tion structures. The overlaps of selected TFs using the three methods can be treated
as important TFs. It would be of great interest to further study other controversial
TFs and confirm their biological properties using the genome-wide binding method.
196 HANDING HIGH DIMENSIONAL LONGITUDINAL DATA

2
1
Log−transformed gene expressive level

0
−1
−2

4 6 8 10 12

Time

Figure 10.1 Plot of the log-transformed gene expressive level with time for the first twenty
genes.

10.5 Further Reading


The aforementioned methods in Sections §10.1 and §10.2.2 can only deal with the
data that the number of covariates p is no more than the number of subjects N.
Therefore, when p grows at an exponential rate of N, that is, the longitudinal data is
FURTHER READING 197

0.6
Density

0.4
0.2
0.0

−2 −1 0 1 2

Figure 10.2 Histogram of the log-transformed gene expressive level in the yeast cell-cycle
process.

Table 10.1 The number of TFs selected for the G1-stage yeast cell-cycle process with penal-
ized GEE (PGEE), penalized Exponential squared loss (PEXPO), and penalized Huber loss
(PHUBER) with SCAD penalty.
Correlation PGEE PHUBER PEXPO
Independence 18 16 15
Exchangeable 13 12 11
AR(1) 14 14 11

ultra-high dimensional, new methods need to be developed. Ultra-high dimensional


longitudinal data are increasingly common and the analysis is challenging both the-
oretically and methodologically.
Cheng et al. (2014) and Fan et al. (2014) studied the varying coefficient model un-
der ultra-high dimensional longitudinal data. They proposed first reducing the num-
198 HANDING HIGH DIMENSIONAL LONGITUDINAL DATA

2
1
Log−transformed gene expressive level

0
−1
−2

Figure 10.3 Boxplot of the log-transformed gene expressive level in the yeast cell-cycle pro-
cess.

ber of covariates to a moderate order by employing a screening procedure proposed


by Fan and Lv (2008), and then identified both the varying and constant coefficients
using a group SCAD. Their method is based on B-spline marginal models and work-
ing independence assumption, and the correlations are ignored.
Xu et al. (2014) proposed a novel GEE-based screening procedure, which only
pertains to the specifications of the first two marginal moments and a working cor-
relation structure. Different from existing methods that either fit separate marginal
models or compute pairwise correlation measures, their method merely involves
making a single evaluation of estimating functions and thus is extremely compu-
tationally efficient. The new method is robust against the misspecification of corre-
FURTHER READING 199

Table 10.2 List of selected TFs for the G1-stage yeast cell-cycle process with penalized GEE
(PGEE), penalized Exponential squared loss (PEXPO), and penalized Huber loss (PHUBER)
with SCAD penalty.
PGEE
Independence ABF1 FKH1 FKH2 GAT3 GCR2 MBP1 MSN4 NDD1
PHD1 RGM1 RLM1 SMP1 SRD1 STB1 SWI4 SWI6
Exchangeable FKH1 FKH2 GAT3 MBP1 NDD1 PHD1 RGM1 SMP1
STB1 SWI4 SWI6
AR(1) FKH1 FKH2 GAT3 MBP1 MSN4 NDD1 PHD1 RGM1
SMP1 STB1 SWI4 SWI6
PHUBER
Independence FKH1 FKH2 GAT3 GCR2 IXR1 MBP1 NDD1 NRG1
PDR1 ROX1 SRD1 STB1 STP1 SWI4 SWI6 YAP5
Exchangeable FKH2 GAT3 GCR2 MBP1 NDD1 PDR1 SRD1 STB1
STP1 SWI4 SWI6 YAP5
AR(1) FKH1 FKH2 GAT3 GCR2 MBP1 NDD1 NRG1 PDR1
SRD1 STB1 STP1 SWI4 SWI6 YAP5
PEXPO
Independence ABF1 FKH1 FKH2 GAT3 GCR2 MBP1 MSN4 NDD1
PDR1 PHD1 RGM1 RLM1 SRD1 SWI4 SWI6
Exchangeable ABF1 FKH2 GAT3 GCR2 MBP1 NDD1 RLM1 SRD1
SWI4 SWI6 YAP5
AR(1) ABF1 FKH2 GAT3 GCR2 MBP1 NDD1 RLM1 SRD1
SWI4 SWI6 YAP5

lation structures. Liu (2016) studied longitudinal partially linear models with ultra
high-dimensional covariates, and provided a two-stage variable selection procedure
that consists of a quick screening stage and a post-screening refining stage. The pro-
posed approach is based on the partial residual method for dealing with the nonpara-
metric baseline function.
Li et al. (2018) proposed using a robust conditional quantile correlation or con-
ditional distribution correlation screening procedures to reduce dimension of the
covariates to a moderate order and then utilize the kernel smoothing technique to
estimate the population conditional quantile correlation and population conditional
distribution correlation for the varying coefficient models in ultra high-dimensional
data analysis.
These studies are mainly about continuous longitudinal data, and there are a few
studies on discrete longitudinal data. In addition, when the variables are filtered in
the ultra-high-dimensional longitudinal data, the screening methods for independent
ultra-high-dimensional data cannot be directly applied to the ultra-high-dimensional
longitudinal data. Zhu et al. (2017) proposed that projection correlation for random
vector correlation, which provides a new idea for the study of correlation between
vectors. However, there are still many problems need to be resolved in the correlation
between vectors.
Bibliography

P. S. Albert, and L. M. McShane. A generalized estimating equations approach for


spatially correlated binary data: Applications to the analysis of neuroimaging
data. Biometrics, 51:627–638, 1995.
P. S. Albert. A transitional model for longitudinal binary data subject to nonignorable
missing data. Biometrics, 56(2):602–608, 2000.
P. S. Albert, and D. A. Follmann. A random effects transition model for longitudinal
binary data with informative missingness. Statistica Neerlandica, 57(1):100–
111, 2003.
P. M. E. Altham. Two generalizations of the binomial distribution. Journal of the
Royal Statistical Society. Series C (Applied Statistics), 27(2):162–167, 1978.
T. W. Anderson, and J. B. Taylor. Strong consistency of least squares estimates in
normal linear regression. The Annals of Statistics, 4(4):788–790, 1976.
F. J. Anscombe. The statistical analysis of insect counts based on the negative bino-
mial distribution. Biometrics, 5(2):165–173, 1949.
R. K. Barnwal, and S. R. Paul. Analysis of one-way layout of count data with nega-
tive binomial variation. Biometrika, 75(2):215–222, 1988.
V. P. Bhapkar. On a measure of efficiency of an estimating equation. Sankhyā: The
Indian Journal of Statistics, Series A, 34(4):467–472, 1972.
C. I. Bliss, and R. A. Fisher. Fitting the negative binomial distribution to biological
data. Biometrics, 9(2):176–200, 1953.
D. Böhning, E. Dietz, P. Schlattmann, L. Mendonca, and U. Kirchner. The zero-
inflated poisson model and the decayed, missing and filled teeth index in dental
epidemiology. Journal of the Royal Statistical Society: Series A (Statistics in
Society), 162(2):195–209, 1999.
N. E. Breslow. Extra-poisson variation in log-linear models. Journal of the Royal
Statistical Society: Series C (Applied Statistics), 33(1):38–44, 1984.
B. M. Brown, and Y-G. Wang. Standard errors and covariance matrices for smoothed
rank estimators. Biometrika, 92:149–158, 2005.
A. Callens, Y. Wang, L. Y. Fu, and B. Liquet. Robust estimation procedure for autore-
gressive models with heterogeneity. Environmental Modeling and Assessment,
2020. doi: 10.1007/s10666-020-09730-w.

201
202 BIBLIOGRAPHY
V. J. Carey, and Y-G. Wang. Working covariance model selection for generalized
estimating equations. Statistics in Medicine, 30:3117–3124, 2011.
V. J. Carey, S. L. Zeger, and P. Diggle. Modelling multivariate binary data with
alternating logistic regressions. Biometrika, 80:517–526, 1993.
J. P. Carpenter, S. Pocock, and C. J. Lamm. Coping with missing data in clinical
trials: a model-based approach applied to asthma trials. Statistics in Medicine,
21:43–66, 2002.
R. J. Carroll, and D. Ruppert. Robust estimation in heteroscedastic linear models.
The Annals of Statistics, 10(2):429–441, 1982.
G. Casella, and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove,
CA, 2002.
G. Casella, and E. I. George. Explaining the gibbs sampler. The American Statisti-
cian, 46(3):167–174, 1992.
R. N. Chaganty. An alternative approach to the analysis of longitudinal data via
generalized estimating equations. Journal of Statistical Planning and Inference,
63:39–54, 1997.
R. N. Chaganty, and J. Shults. On eliminating the asymptotic bias in the quasi-least
squares estimate of the correlatiopn parameter. Journal of Statistical Planning
and Inference, 76:145–161, 1999.
G. Chamberlain. Quantile regression, censoring and the structure of wages. In Pro-
ceedings of the Sixth World Congress of the Econometrics Society (eds. C. Sims
and J.J. Laffont), 2:171–209, 1994.
J. Chen, and N. A. Lazar. Selection of working correlation structure in generalized
estimating equations via empirical likelihood. Journal of Computational and
Graphical Statistics, 21:18–41, 2012.
J. Chen, S. Hubbard, and Y. Rubin. Estimating the hydraulic conductivity at the
south oyster site from geophysical tomographic data using bayesian techniques
based on the normal linear regression model. Water Resources Research, 37(6):
1603–1613, 2001.
L. Chen, L. J. Wei, and M. I. Parzen. Quantile regression for correlated observa-
tions. Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of
Correlated Data, 179:51–70, 2004.
M.-H. Chen, and J. G. Ibrahim. Maximum likelihood methods for cure rate models
with missing covariates. Biometrics, 57:43–52, 2002.
M-Y. Cheng, T. Honda, J. L. Li, and H. Peng. Nonparametric independence screen-
ing and structure identification for ultra-high dimensional longitudinal data. The
Annals of Statistics, 42:1819–1849, 2014.
S. J. Clark, and J. N. Perry. Estimation of the negative binomial parameter κ by
maximum quasi-likelihood. Biometrics, 45(1):309–316, 1989.
D. R. Cox. Regression models and life-tables. Journal of the Royal Statistical Soci-
ety, Series B, 34(2):187––220, 1972.
BIBLIOGRAPHY 203
D. R. Cox, and N. Reid. Parameter orthogonality and approximate conditional infer-
ence. Journal of the Royal Statistical Society. Series B, 1–39, 1987.
D. R. Cox, and E. J. Snell. Analysis of Binary Data, Second Edition. Chapman &
Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis,
1989.
M. Crowder. Gaussian estimation for correlated binomial data. Journal of the Royal
Statistical Society, Series B, 47:229–237, 1985.
M. Crowder. On the use of a working correlation matrix in using generalised linear
models for repeated measures. Biometrika, 82:407–410, 1995.
M. Crowder. On repeated measures analysis with misspecified covariance structure.
Journal of Royal Statistical Society, Statistical Methodology, Series B, 63(1):
55–62, 2001.
S. Datta, and A. Satten, Glen. Rank-sum tests for clustered data. Journal of the
American Statistical Association, 100(471):908–915, 2005.
M. Davidian, and R. J. Carroll. Variance function estimation. Journal of the Ameri-
can Statistical Association, 82(400):1079–1091, 1987.
M. Davidian, and D. M. Giltinan. Nonlinear Models for Repeated Measurement
Data. Chapman & Hall, London, 1995.
C. S. Davis. Semi-parametric and non-parametric methods for the analysis of re-
peated measurements with applications to clinical trials. Statistics in Medicine,
10(12):1959–1980, 1991.
I. Deltour, S. Richardson, and J.-Yves L. Hesran. Stochastic algorithms for markov
models estimation with intermittent missing data. Biometrics, 55(2):565–573,
1999.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society. Series B, 39:
1–38, 1977.
A. M. Dempster. Covariance selection. Biometrics, 28:157–175, 1972.
A. P. Dempster, C. M. Patel, and A. J. Selwyn, M. R., and Roth. Statistical and
computational aspects of mixed model analysis. Journal of the Royal Statistical
Society. Series C (Applied Statistics), 33(2):203–214, 1984.
D. Deng, and S. R. Paul. Score tests for zero-inflation and over-dispersion in gener-
alized linear models. Statistica Sinica, 15:257–276, 2005.
P. Diggle, and M. G. Kenward. Informative drop-out in longitudinal data analysis
(with discussion). Applied Statistics, 43:49–93, 1994.
P. J. Diggle. An approach to the analysis of repeated measurements. Biometrics, 44
(4):959–971, 1988.
P. J. Diggle, K-Y. Liang, and S. L. Zeger. Analysis of Longitudinal Data. Clarendon
Press, 1994.
P. J. Diggle, K-Y. Liang, and S. L. Zeger. Analysis of Longitudinal Data. Oxford
University Press, Oxford, 2002.
204 BIBLIOGRAPHY
A. Donner. A review of inference procedures for the intraclass correlation coefficient
in the one-way random effects model. International Statistical Review/Revue
Internationale de Statistique, 54(1):67–82, 1986.
A. Donner, and S. Bull. Inferences concerning a common intraclass correlation co-
efficient. Biometrics, 39(3):771–775, 1983.
A. Donner, and J. J. Koval. The estimation of intraclass correlation in the analysis of
family data. Biometrics, 36(1):19–25, 1980.
T. M. Durairajan. Optimal estimating function for non-orthogonal model. Journal
of Statistical Planning Inference, 33:381–384, 1992.
F. Eicker. Asymptotic normality and consistency of the least squares estimators for
families of linear regressions. The Annals of Mathematical Statistics, 34(2):447–
456, 1963.
A. Ekholm, and C. Skinner. The muscatine children’s obesity data reanalysed us-
ing pattern mixture models. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 47(2):251–263, 1998.
J. Engel. Models for response data showing extra-poisson variation. Statistica Neer-
landica, 38(3):159–167, 1984.
J. Fan, and R. Li. Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American Statistical Association, 96:1348–
1360, 2001.
J. Fan, and J. Lv. Sure independence screening for ultrahigh dimensional feature
space. Journal of the Royal Statistical Society, Ser. B, 70:849–911, 2008.
J. Fan, Y. Ma, and W. Dai. Nonparametric independence screening in sparse ultra-
high-dimensional varying coefficient models. Journal of the American Statistical
Association, 109:1270–1284, 2014.
Y. L. Fan, G. Y. Qin, and Z. Y. Zhu. Variable selection in robust regression models
for longitudinal data. Journal of Multivariate Analysis, 109:156–167, 2012.
R. A. Fisher. Statistical methods for research workers. Genesis Publishing Pvt Ltd,
1925.
G. M. Fitzmaurice. A caveat concerning independence estimating equations with
multivariate binary data. Biometrics, 51:309–317, 1995.
G. M. Fitzmaurice, N. M. Laird, and J. H. Ware. Applied Longitudinal Analysis, 2nd
Edition, Wiley.
G. M. Fitzmaurice, N. M. Laird, and A. G. Rotnitzky. Regression models for discrete
longitudinal responses (disc: P300-309). Statistical Science, 8:284–299, 1993.
G. M. Fitzmaurice, S. R. Lipsitz, Gelber R. Ibrahim JG, and Lipshultz. S. Estima-
tion in regression models for longitudinal binary data with outcome-dependent
follow-up. Biostatistics, 7:469–465, 2006.
L. Y. Fu, and Y-G. Wang. Efficient estimation for rank-based regression with clus-
tered data. Biometrics, 68:1074–1082, 2012a.
BIBLIOGRAPHY 205
L. Y. Fu, Y-G. Wang, and Z. D. Bai. Rank regression for analysis of clustered data: A
natural induced smoothing approach. Computational Statistics and Data Analy-
sis, 54:1036–1050, 2010.
L. Y. Fu, Y-G. Wang, and M. Zhu. A gaussian pseudolikelihood approach for quan-
tile regression with repeated measurements. Computational Statistics and Data
Analysis, 84:41–53, 2015.
L. Y. Fu, Y-G. Wang, and F. Cai. A working likelihood approach for robust regres-
sion. Statistical Methods in Medical Research, 29(12):3641–3652, 2020.
L. Y. Fu, and Y-G. Wang. Quantile regression for longitudinal data with a working
correlation model. Computational Statistics and Data Analysis, 56:2526–2538,
2012b.
L. Y. Fu, and Y-G. Wang. Efficient parameter estimation via gaussian copulas for
quantile regression with longitudinal data. Journal of Multivariate Analysis,
143:492–502, 2016.
K. W. Fung, Z. Y. Zhu, B. C. Wei, and X. M. He. Infference diagnostics and outlier
tests for semiparametric mixed models. Journal of Royal Statistical Society,
Series B., 64:565–579, 2002.
J. Geweke. Exact inference in the inequality constrained normal linear regression
model. Journal of Applied Econometrics, 1(2):127–141, 1986.
W. R. Gilks, and P. Wild. Adaptive rejection sampling for gibbs sampling. Applied
Statistics, 41:337–348, 1992.
V. P. Godambe, and M. E. Thompson. An extension of quasi-likelihood estimation.
Statistical Planning and Inference, 22:137–152, 1989a.
V. P. Godambe. An optimum property of regular maximum likelihood estimation
(Ack: V32 p1343). The Annals of Mathematical Statistics, 31:1208–1212, 1960.
V. P. Godambe, and C. C. Heyde. Quasi-likelihood and optimal estimation. Interna-
tional Statistical Review, 55:231–244, 1987.
V. P. Godambe, and Mary Elinore Thompson. An extension of quasi-likelihood
estimation. Journal of Statistical Planning and Inference, 22(2):137–152, 1989b.
D. Griffin, and R. Gonzalez. Correlational analysis of dyad-level data in the ex-
changeable case. Psychological Bulletin, 118(3):430, 1995.
D. Hall, and T. A. Severini. Extended generalized estimating equations for clustered
data. Journal Americian Statistican and Association, 93:1365–1375, 1998.
L. P. Hansen. Large sample properties of generalized method of moments estimators.
Econometrica, 50(4):1029–1054, 1982.
J. Hardin, and J. Hilbe. Generalized Estimating Equations. Chapman & Hall, CRC,
2012.
D. A. Harville, and D. R. Jeske. Mean squared error of estimation or prediction
under a general linear model. Journal of the American Statistical Association,
87(419):724–731, 1992.
206 BIBLIOGRAPHY
J. K. Haseman, and L. L. Kupper. Analysis of dichotomous response data from
certain toxicological experiments. Biometrics, 35(1):281–293, 1979.
T. P. Hettmansperger. Statistical Inference Based on Ranks. New York: John Wiley
and Sons, 1984.
C. C. Heyde. Statistical Data Analysis and Inference. Amsterdam: Elsevier, 1987.
C. C. Heyde. Quasi-Likelihood and Its Application: A General Approach to Optimal
Parameter Estimation. New York: Springer, 1997.
L. Y. Hin, and Y-G. Wang. Working-correlation-structure identification in general-
ized estimating equations. Statistics in Medicine, 28(4):642–658, 2009.
D. D. Ho, A. U. Neumann, A. S. Perelson, W. Chen, J. M. Leonard, and M.
Markowitz. Rapid dynamics in human immunodeficiency virus type 1 turnover
of plasma virions and cd4 lymphocytes in hiv-1 infection. Nature, 373:123–126,
1995.
E. B. Hoffman, P. K. Sen, and C. R. Weinberg. Within-cluster resampling.
Biometrika, 88(4):1121–1134, 2001.
L. Hua, and Y. Zhang. Spline-based semiparametric projected generalized estimating
equation method for panel count data. Biostatistics, 13(3):440–454, 2012.
A. Huang, and P. J. Rathouz. Proportional likelihood ratio models for mean regres-
sion. Biometrika, 99:223–229, 2012.
P. J. Huber. The behavior of maximum likelihood estimates under nonstandard con-
ditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statis-
tics and Probability, Volume 1: Statistics, 221–233. University of California
Press, 1967. https://projecteuclid.org/euclid.bsmsp/1200512988.
D. R. Hunter, and R. Li. Variable selection using mm algorithms. The Annals of
Statistic, 33:1617–1642, 2005.
J. G. Ibrahim. Incomplete data in generalized linear models. Journal of the American
Statistical Association, 85:765–769, 1990.
J. G. Ibrahim, and S. R. Lipsit. Parameter estimation from incomplete data in bino-
mial regression when the missing data mechanism is nonignorable. Biometrics,
52:1071–1078, 1996.
J. G. Ibrahim, and G. Molenberghs. Missing data methods in longitudinal studies: a
review. Test, 18:1–43, 2009.
J. G. Ibrahim, M. H. Chen, and S. R. Lipsitz. Missing responses in generalized linear
mixed models when the missing data mechanismis nonignorable. Biometrika,
88:551–564, 2001.
J. G. Ibrahim, M. H. Chen, S. R. Ipsitiz, and A. H. Herring. Missing-data methods
for generalized linear models: A comparative review. Journal of the American
Statistical Association, 100:332–346, 2005.
J. G. Ibrahim. Incomplete data in generalized linear models. Journal of the American
Statistical Association, 85(411):765–769, 1990.
BIBLIOGRAPHY 207
J. G. Ibrahim, and S. R. Lipsitz. Parameter estimation from incomplete data in bino-
mial regression when the missing data mechanism is nonignorable. Biometrics,
52(3):1071–1078, 1996.
G. Inan, and L. Wang. Pgee: An r package for analysis of longitudinal data with high-
dimensional covariates. The R Journal, 9:393–402, 06 2017. doi: 10.32614/RJ-
2017-030.
G. Inan, J. H. Zhou, and L. Wang. Pgee: Penalized generalized estimating equations
in high-dimension. https://cran.r-project.org/package=pgee. R package version
1.5, https://CRAN.R-project.org/package=PGEE, 2016.
R. I. Jennrich, and M. D. Schluchter. Unbalanced repeated-measures models with
structured covariance matrices. Biometrics, 42:805–820, 1986.
J. M. Jiang. Linear and Generalized Linear Mixed Models and Their Applications.
Springer, New York, 2007.
J. M. Jiang, and T. Nguyen. The Fence Methods. World Scientific, 2015.
B. Jørgensen. The Theory of Dispersion Models. London: Chapman and Hall, 1997.
S. H. Jung. Quasi-likelihood for median regression models. Journal of the American
Statistical Association, 91:251–257, 1996.
S. H. Jung, and Z. Ying. Rank-based regression with repeated measurements data.
Biometrika, 90:732–740, 2003.
M. C. K. Tweedie. An index which distinguishes between some important exponential
families. A4, 22 pp. Pre-print issued for the Golden Jubliee Conference at the
Indian Statistical Institute, Calcutta, 1981.
G. Kauerman and R.J. Carroll. A note on the efficiency of sandwich covariance
matrix estimation. Journal of the American Statistical Association, 96:1387–
1396, 2001.
B. C. Kelly. Some aspects of measurement error in linear regression of astronomical
data. The Astrophysical Journal, 665(2):1489, 2007.
A. I. Khuri, and H. Sahai. Variance components analysis: a selective literature survey.
International Statistical Review/Revue Internationale de Statistique, 53(3):279–
300, 1985.
J. C. Kleinman. Proportions with extraneous variance: single and independent sam-
ples. Journal of the American Statistical Association, 68(341):46–54, 1973.
R. Koenker, and G. Jr. Bassett. Regression quantiles. Econometrica, 84:33–50, 1978.
R. Koenker, and V. D’Orey. Computing regression quantiles. Applied Statistics, 36:
383–393, 1987.
E. L. Korn, and A. S. Whittemore. Methods for analyzing panel studies of acute
health effects of air pollution. Biometrics, 35:795–802, 1979.
L. L. Kupper, and J. K. Haseman. The use of a correlated binomial model for the
analysis of certain toxicological experiments. Biometrics, 34(1):69–76, 1978.
N. M. Laird, and J. H. Ware. Random-effects models for longitudinal data. Biomet-
rics, 38:963–974, 1982.
208 BIBLIOGRAPHY
N. Lange, L. Ryan, L. Billard, D. Brillinger, L. Conquest, and J. (eds.) Greenhouse.
Case Studies in Biometry. New York: Wiley-Interscience, 1994.
J. F. Lawless. Regression methods for poisson process data. Journal of the American
Statistical Association, 82(399):808–815, 1987.
E. W. Lee, and M. Y. Kim. The analysis of correlated panel data using a continuous-
time markov model. Biometrics, 54(4):1638–1644, 1998.
J. C. Lee. Prediction and estimation of growth curves with special covariance struc-
tures. Journal of the American Statistical Association, 83:432–440, 1988.
Y. Lee, and J. A. Nelder. Hierarchical generalised linear models: A synthesis
of generalised linear models, random-effect models and structured dispersions.
Biometrika, 88:987–1006, 2001.
C. L. Leng, and H. P. Zhang. Smoothing combined estimating equations in quantile
regression for longitudinal data. Statistics and Computing, 24:123–136, 2014.
G. Li, L. Zhu, L. Xue, and S. Feng. Empirical likelihood inference in partially linear
single-index models for longitudinal data. Journal of Multivariate Analysis, 101
(3):718–732, 2010.
X. J. Li, X. J. Ma, and J. X. Zhang. Conditional quantile correlation screening proce-
dure for ultrahigh-dimensional varying coefficient models. Journal of Statistical
Planning and Inference, 197:69–92, 2018.
K-Y. Liang, and S. L. Zeger. Lonagitudinal data analysis using generalized linear
models. Biometrika, 73:13–22, 1986.
K-Y. Liang, S. L. Zeger, and B. Qaqish. Multivariate regression analyses for cate-
gorical data (disc: P24-40). Journal of the Royal Statistical Society, Series B, 54:
3–24, 1992.
D. Y. Lin, and Z. Ying. Semiparametric and nonparametric regression analysis of
longitudinal data. Journal of the American Statistical Association, 96(453):103–
126, 2001.
X. Lin, and R. J. Carroll. Semiparametric estimation in general repeated measures
problems. Journal of the Royal Statistical Society: Series B, 68(1):69–88, 2006.
M. J. Lindstrom, and D. M. Bates. Newton—raphson and em algorithms for lin-
ear mixed-effects models for repeated-measures data. Journal of the American
Statistical Association, 83(404):1014–1022, 1988.
S. R. Lipsitz, and G. M. Fitzmaurice. Estimating equations for measures of associa-
tion between repeated binary responses. Biometrics, 52(3):903–912, 1996.
S. R. Lipsitz, N. M. Laird, and D. P. Harrington. Finding the design matrix for the
marginal homogeneity model. Biometrika, 77:353–358, 1990.
S. R. Lipsitz, N. M. Laird, and D. P. Harrington. Generalized estimating equations
for correlated binary data: Using the odds ratio as a measure of association.
Biometrika, 78:153–160, 1991.
S. R. Lipsitz, K. n Kim, and L. Zhao. Analysis of repeated categorical data using
generalized estimating equations. Statistics in Medicine, 13:1149–1163, 1994.
BIBLIOGRAPHY 209
S. R. Lipsitz, J. G. Ibrahim, and G. Molenburgs. Using a box-cox transformation in
the analysis of longitudinal data with incomplete responses. Applied Statistics,
49:287–296, 2000.
S. R. Lipsitz, G. M. Fitzmaurice, J. G. Ibrahim, and R. Gelber. Parameter estimation
in longitudinal studies with outcome-dependent follow-up. Biometrics, 58:621–
630, 2002.
R. J. A. Little. Pattern-mixture models for multivariate incomplete data. Journal of
the American Statistical Association, 88:125–134, 1993.
R. J. A. Little. Modeling the drop-out mechanism in repeated-measures studies.
Jornal of the American Statistical Association, 90:1113–1121, 1995.
R. J. A. Little, and D. B. Rubin. Statistical Analysis with Missing Data, 2nd Edition.
Wiley, 1987.
J. Liu. Feature screening and variable selection for partially linear models with
ultrahigh-dimensional longitudinal data. Neurocomputing, 195:202–210, 2016.
K-J. Lui, Mayer J.A., and L. Eckhardt. Confidence intervals for the risk ratio under
cluster sampling based on the beta-binomial model. Statistics in Medicine, 19
(21):2933–2942, 2000.
J. Lv, H. Yang, and C. H. Guo. An efficient and robust variable selection method
for longitudinal generalized linear models. Computational Statistics and Data
Analysis, 82:74–88, 2015.
M. Gosho, C. Hamada, and I. Yoshimura. Selection of working correlation structure
in weighted generalized estimating equation method for incomplete longitudinal
data. Communications in Statistics - Simulation and Computation, 43:62–81,
2014.
T. Maiti, and V. Pradhan. Bias reduction and a solution for separation of logistic
regression with missing covariates. Biometrics, 65(4):1262–1269, 2009.
L. A. Mancl, and T. A. DeRouen. A covariance estimator for gee with improced
samll-sample properties. Biometrice, 57:126–134, 2001.
L. A. Mancl, and B. G. Leroux. Efficiency of regression estimates for clustered data.
Biometrics, 52:500–511, 1996.
K. G. Manton, M. A. Woodbury, and E. Stallard. A variance components approach to
categorical data models with heterogenous cell populations: Analysis of spatial
gradients in lung cancer mortality rates in north carolina counties. Biometrics,
37(2):259–269, 1981.
B. H. Margolin, B. S. Kim, and K. J. Risko. The ames salmonella/microsome mu-
tagenicity assay: Issues of inference and validation. Journal of the American
Statistical Association, 84(407):651–661, 1989.
P. McCullagh, and J. A. Nelder. Generalized Linear Models (2nd Edition). Chapman
& Hall, 1989.
C. E. McCulloch, and S. R. Searle. Generalized, Linear and Mixed Models. John
Wiley & Sons, New York, USA, 2001.
210 BIBLIOGRAPHY
G. J. McLachlan. On the em algorithm for overdispersed count data. Statistical
Methods in Medical Research, 6(1):76–98, 1997.
R. Mian, and S. Paul. Estimation for zero-inflated over-dispersed count data model
with missing response. Statistics in medicine, 35(30):5603–5624, 2016.
A. J. Miller. Subset Selection in Regression. London: Chapman and Hall, 1990.
G. Molenberghs, and G. Verbeke. Models for Discrete Longitudinal Data. Sprnger,
New York, 2005.
A. Muñoz, V. Carey, J. P. Schouten, M. Segal, and B. Rosner. A parametric family
of correlation structures for the analysis of longitudinal data. Biometrics, 48:
733–742, 1992.
A. Munoz, B. Rosner, and V. Carey. Regression analysis in the presence of hetero-
geneous intraclass correlations. Biometrics, 42(3):653–658, 1986.
J. A. Nelder, and D. Pregibon. An extended quasi-likelihood function. Biometrika,
74:221–232, 1987.
J. M. Neuhaus, and J. D. Kalbfleish. Between- and within-cluster covariate effects
in the analysis of clustered data. Biometrics, 54:638–645, 1998.
J. Neyman. Optimal asymptotic tests of composite statistical hypotheses. Probability
and statistics, 57:213–234, 1959.
V. Núñez Anton, and G. G. Woodworth. Analysis of longitudinal data with unequally
spaced observations and time-dependent correlated errors. Biometrics, 50:445–
456, 1994.
R. J. O’Hara-Hines. Comparison of two covariance structures in the analysis of
clusterd polytomous data using generalized estimating equations. Biometrics,
54:312–316, 1998.
A. B. Owen. Empirical likelihood. New York: Chapman and Hall, CRC, 2001.
M. C. Paik. The generalized estimating equation approach when data are not missing
completely at random. Journal of the American Statistical Association, 92:1320–
1329, 1997.
W. Pan. On the robust variance estimator in generalised estimating equations.
Biomtrika, 88:901–906, 2001a.
W. Pan. Akaike’s information criterion in generalized estimating equations. Biomet-
rics, 57:120–125, 2001b.
H. D. Patterson, and R. Thompson. Recovery of inter-block information when block
sizes are unequal. Biometrika, 8:545–554, 1971.
S. R. Paul. Maximum likelihood estimation of intraclass correlation in the analysis of
familial data: Estimating equation approach. Biometrika, 77(3):549–555, 1990.
S. R. Paul. Quadratic estimating equations for the estimation of regression and dis-
persion parameters in the analysis of proportions. Sankhya B, 63:43–55, 2001.
S. R. Paul, and T. Banerjee. Analysis of two-way layout of count data involving
multiple counts in each cell. Journal of the American Statistical Association, 93
(444):1419–1429, 1998.
BIBLIOGRAPHY 211
S. R. Paul, and R. L. Plackett. Inference sensitivity for poisson mixtures. Biometrika,
65(3):591–602, 1978.
S. R. Paul. Analysis of proportions of affected foetuses in teratological experiments.
Biometrics, 38:361–370, 1982.
S. R. Paul, and A. S. Islam. Analysis of proportions based on parametric and semi-
parametric models. Biometrics, 51:1400–1410, 1995.
M. S. Pepe, and G. L. Anderson. A cautionary note on inference for marginal regres-
sion models with longitudinal data and general correlated response data. Com-
munications in Statistics, Part B - Simulation and Computation, 23:939–951,
1994.
W. W. Piegorsch. Maximum likelihood estimation for the negative binomial disper-
sion parameter. Biometrics, 46(3):863–867, 1990.
J. C. Pinheiro, and D. M. Bates. Mixed-Effects Models in S and S-PLUS. Springer-
Verlag, New York, 2000.
J. S. Preisser, K. K. Lohman, and P. J. Rathouz. Performance of weighted estimating
equations for longitudinal binary data with drop-outs missing at random. Statis-
tics in Medicine, 21:3035–3054, 2002.
R. L. Prentice, and L. P. Zhao. Estimating equations for parameters in means and
covariances of multivariate discrete and continuous responses. Biometrics, 47:
825–839, 1991.
C. J. Price, C. A. Kimmel, J. D. George, and M. C. Marr. The developmental toxicity
of ethylene glycol in mice. Fundamental and Applied Toxicology, 81:113–127,
1985.
J. Qin, and J. Lawless. Empirical likelihood and generalized estimating equations.
The Annals of Statistics, 22:300–325, 1994.
A. Qu, B. G. Lindsay, and Bing Li. Improving generalised estimating equations
using quadratic inference functions. Biometrik, 87:823–836, 2000.
A. Qu, and P. X. K. Song. Assessing robustness of generalised estimating equations
and quadratic inference functions. Biometrika, 91(2):447–459, 2004.
A. Qu, J. J. Lee, and B. G. Lindsay. Model diagnostic tests for selecting informative
correlation structure in correlated data. Biometrika, 95(4):891–905, 12 2008.
A. E. Raftery, D. Madigan, and J. A. Hoeting. Bayesian model averaging for linear
regression models. Journal of the American Statistical Association, 92(437):
179–191, 1997.
J. N. K. Ran, and I. Molina. Small Area Estimation,, 2nd ed. Wiley, New York, 2015.
C. R. Rao. Linear Statistical Inference and its Applications. Wiley, 1965.
P. J. Rathouz, and L. Gao. Generalized linear models with unspecified reference
distribution. Biostatistics, 10:205–218, 2009.
J. M. Robin, A. Rotnitzky, and L. P. Zhao. Analysis of semiparametric regression
models for repeated outcomes in the presence of missing data. Journal of the
American Statistical Association, 90:106–121, 1995.
212 BIBLIOGRAPHY
J. M. Robins, and A. Rotnitzky. Semiparametric efficiency in multivariate regression
models with missing data. Journal of the American Statistical Association, 90
(429):122–129, 1995.
B. Rosner. Multivariate methods in ophthalmology with application to other paired-
data situations (c/r: V46 p523-531). Biometrics, 40:1025–1035, 1984.
G. J. S. Ross, and D. A. Preece. The negative binomial distribution. Journal of the
Royal Statistical Society: Series D (The Statistician), 34(3):323–335, 1985.
A. Rotnitzky, and N. P. Jewell. Hypothesis testing of regression parameters in semi-
parametric generalized linear models for cluster correlated data. Biometrika, 77:
485–497, 1990.
P. J. Rousseeuw, and B. C. van Zomeren. Unmasking multivariate outliers and lever-
age points. Journal of the American Statistical Association, 85:633–639, 1990.
D. B. Rubin. Inference and missing data. Biometrika, 63:581–592, 1976.
S. Galbraith, J. A. Daniel, and B. Vissel. A study of clustered data and approaches
to its analysis. Journal of Neuroscience, 30(32):10601–10608, 2010.
K. Saha, and S. Paul. Bias-corrected maximum likelihood estimator of the negative
binomial dispersion parameter. Biometrics, 61(1):179–185, 2005.
H. Sahai, A. Khuri, and C. H. Kapadia. A second bibliography on variance compo-
nents. Communications in Statistics-theory and Methods, 14:63–115, 1985.
H. Sahai. A bibliography on variance components. International Statistical Re-
view/Revue Internationale de Statistique, 47(2):177–222, 1979.
S. K. Sahu, and G. O. Roberts. On convergence of the em algorithmand the gibbs
sampler. Statistics and Computing, 9(1):55–64, 1999.
D. E. W. Schumann, and R. A. Bradley. The comparison of the sensitivities of similar
experiments: Theory. The Annals of Mathematical Statistics, 13(4):902–920,
1957.
S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components, volume 391.
John Wiley & Sons, 2009.
P. E. Shrout, and J. L. Fleiss. Intraclass correlations: uses in assessing rater reliability.
Psychological Bulletin, 86(2):420, 1979.
J. Shults, and R. N. Chaganty. Analysis of serially correlated data using quasi-
likelihood squares. Biometrics, 54:1622–1630, 1998.
J. Shults, and J. M. Hilbe. Quasi-Least Squares Regression. Chapman & Hall/CRC
Monographs on Statistics & Applied Probability. Taylor & Francis, 2014.
S. Sinha, and T. Maiti. Analysis of matched case-control data in presence of nonig-
norable missing exposure. Biometrics, 64(1):106–114, 2008.
A. Sklar. Fonctions de répartition à n de dimensions et leursmarges. Paris: Publica-
tions de l’Institut de Statistique de l’Université de Paris 8, 143:229–231, 1959.
P. X.-K. Song. Correlated data analysis. modeling. Analytics and Applications.
Springer, New York, 2007.
BIBLIOGRAPHY 213
P. X.-K. Song, Z. C. Jiang, E. Park, and A. Qu. Quadratic inference functions in
marginal models for longitudinal data. Statistics in Medicine, 28(29):3683–
3696, 2009.
M. F. Sowers, M. Crutchfield, J. F. Randolph, B. Shapiro, B. Zhang, M. L. Pietra,
and M. A. Schork. Urinary ovarian and gonadotrophin hormone levels in pre-
menopausal women with low bone mass. Journal of Bone Mining Research, 13:
1191–1202, 1998.
P. T. Spellman, G. Sherlock, M. Q. Zhang, et al. Comprehensive identification of
cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray
hybridization. Molecular Biology of Cell, 9:3273–3297, 1998.
B. C. Sutradhar, and K. Das. On the efficiency of regression estimators in generalised
linear models for longitudinal data. Biometrika, 86(2):459–465, 1999.
P. F. Thall, and S. C. Vail. Some covariance models for longitudinal count data with
overdispersion. Biometrics, 46:657–671, 1990.
W. A. Thompson Jr. The problem of negative estimates of variance components. The
Annals of Mathematical Statistics, 33:273–289, 1962.
R. J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Ser. B, 58:267–288, 1996.
P. Tishler, A. Donner, J. O. Taylor, and E. H. Kass. Familial aggregation of blood
pressure in very young children. CVD Epidemiology Newsletter, 22(45), 1977.
D. Trègouët, P. Ducimetière, and L. Tiret. Testing association between candidate-
gene markers and phenotype in related individuals, by use of estimating equa-
tions. American Journal of Human Genetics, 61:189–199, 1997.
A. B. Troxel, S. L. Lipsitz, and T. A. Brennan. Weighted estimating equations with
nonignorably missing response data. Biometrics, 53:857–869, 1997.
R. S. Tsay. Regression models with time series errors. Journal of the American
Statistical Association, 79(385):118–124, 1984.
M. C. K. Tweedie. An index which distinguishes between some important exponen-
tial families. In Statistics: Applications and new directions: Proc. Indian Statis-
tical Institute Golden Jubilee International Conference, volume 579, 579–604,
1984.
G. Verbeke. Models for Discrete Longitudinal Data. Springer Series in Statistics.
Springer, 2005.
L. Wang, J. H. Zhou, and A. Qu. Penalized generalized estimating equations for
high-dimensional longitudinal data analysis. Biometrics, 68(1):353–360, 2012a.
L. Wang, and A. Qu. Consistent model selection and data-driven smooth tests for
longitudinal data in the estimating equations approach. Journal of the Royal
Statistical Society: Series B, 71(1):177–190, 2009.
P. Wang, G-F. Tsai, and A. Qu. Conditional inference functions for mixed-effects
models with unspecified random-effects distribution. Journal of the American
Statistical Association, 107(498):725–736, 2012b.
214 BIBLIOGRAPHY
Y-G. Wang. A quasi-likelihood approach for ordered categorical data with overdis-
persion. Biometrics, 52(4):1252–1258, 1996.
Y-G. Wang. Estimating equations for removal data analysis. Biometrics, 55(4):
1263–1268, 1999a.
Y-G. Wang. Estimating equations with nonignorably missing response data. Biomet-
rics, 55(3):984–989, 1999b.
Y-G. Wang, and V. J. Carey. Working correlation structure misspecification, esti-
mation and covariate design: Implications for generalised estimating equations
performance. Biometrika, 90(1):29–41, 2003.
Y-G. Wang, and V. J. Carey. Unbiased estimating equations from working correla-
tion models for irregularly timed repeated measures. Journal of the American
Statistical Association, 99(467):845–853, 2004.
Y-G. Wang, and L-Y. Hin. Modeling strategies in longitudinal data analysis: Co-
variate,variance function and correlations structure selection. Computational
Statistics and Data Analysis, 54(5):3359–3370, 2009.
Y-G. Wang, and X. Lin. Effects of variance-function misspecification in analysis of
longitudinal data. Biometrics, 61(2):413–421, 2005.
Y-G. Wang, and Y. D. Zhao. Weighted rank regression for clustered data analysis.
Biometrics, 64:39–45, 2008.
Y-G. Wang, and Y. N. Zhao. A modified pseudolikelihood approach for analysis of
longitudinal data. Biometrics, 63(3):681–689, 2007.
Y-G. Wang, and M. Zhu. Rank-based regression for analysis of repeated measures.
Biometrika, 93:459–464, 2006.
Y-G. Wang, X. Lin, and M. Zhu. Robust estimating functions and bias correction for
longitudinal data analysis. Biometrics, 61(3):684–691, 2005.
Y-G. Wang, Q. Shao, and M. Zhu. Quantile regression without the curse of un-
smoothness. Computational Statistics and Data Analysis, 52:3696–3705, 2009.
J. H. Ware, S. Lipsitz, and F. E. Speizer. Issue in the analysis of repeated categorical
outcomes. Statistics in Medicine, 7:95–107, 1988.
R. W. M. Wedderburn. Quasi-likelihood functions, generalized linear models and
the gauss-newton method. Biometrika, 61:439–47, 1974.
L. J. Wei, and J. M. Lachin. Two-sample asymptotically distribution-free tests for
incomplete multivariate observations. Journal of the American Statistical Asso-
ciation, 79(387):653–661, 1984.
C. S. Weil. Selection of the valid number of sampling units and a consideration of
their combination in toxicological studies involving reproduction, teratogenesis
or carcinogenesis. Food and cosmetics toxicology, 8(2):177–182, 1970.
M. Cho, R. E. Weiss, and M. Yanuzzi. On bayesian calculations for mixture priors
and likelihoods. Statistics in Medicine, 18:1555–1570, 1999.
H. White. Maximum likelihood estimation of misspecified models. Econometrica,
50(1):1–25, 1982.
BIBLIOGRAPHY 215
P. Whittle. Gaussian estimation in stationary time series. Bulletin of the International
Statistical Institute, 39:1–26, 1961.
D. A. Williams. Dose-response models for teratological experiments. Biometrics,
43:1013–1016, 1987.
J. M. Williamson, S. Datta, and G. A. Satten. Marginal analyses of clustered data
when cluster size is informative. Biometrics, 59(1):36–42, 2003.
R. F. Woolson, and W. R. Clarke. Analysis of categorical incomplete longitudinal
data. Journal of the Royal Statistical Society. Series A (General), 147(1):87–99,
1984.
H. Wu, and A. A. Ding. Population hiv-1 dynamics in vivo: applicable models and
inferential tools for virological data from aids clinical trials. Biometrics, 55:
410–418, 1999.
P. R. Xu, L. X. Zhu, and Y. Li. Ultrahigh dimensional time course feature selection.
Biometrics, 70:356––365, 2014.
G. Yin, and J. Cai. Quantile regression models with multivariate failure time data.
Biometrics, 61:151–161, 2005.
S. L. Zeger, and K-Y. Liang. Longitudinal data analysis for discrete and continuous
outcomes. Biometrics, 42(1):121–130, 1986.
S. L. Zeger, and B. Qaqish. Markov regression models for time series: A quasi-
likelihood approach. Biometrics, 44:1019–1031, 1988.
L. Zeng, and R. J. Cook. Transition models for multivariate longitudinal binary data.
Journal of the American Statistical Association, 102(477):211–223, 2007.
C-H. Zhang, and J. Huang. The sparsity and bias of the lasso selection in high-
dimensional linear regression. The Annals of Statistics, 36(4):1567–1594, 2008.
D. Zhang, X. H. Lin, J. Raz, and M. F. Sowers. Semiparametric stochastic mixed
models for longitudinal data. Journal of the American Statistical Association,
93:710–719, 1998.
W. P. Zhang, C. L. Leng, and C. Y. Tang. A joint modelling approach for longitudinal
studies. Journal of the Royal Statistical Society: Series B, 77(1):219–238, 2015.
X. M. Zhang, and S. Paul. Modified gaussian estimation for correlated binary data.
Biometrical Journal, 55(6):885–898, 2013.
L. P. Zhao, and R. L. Prentice. Correlated binary regression using a quadratic expo-
nential model. Biometrika, 77:642–648, 1990.
L. P. Zhu, K. Xu, R. Li, and W. Zhong. Projection correlation between two random
vectors. Biometrika, 104:829–843, 2017.
A. Ziegler. Generalized Estimating Equations (Lecture Notes in Statistics). Pub-
lisher: Springer, 2011.
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statis-
tical Association, 476:1418–1429, 2006.
216 BIBLIOGRAPHY
H. Zou, and Hastie T. Regularization and variable selection via the elastic net. Jour-
nal of the Royal Statistical Society, Ser. B, 67:301–320, 2005.
S. J. Zyzanski, S. A. Flocke, and L. M. Dickinson. On the nature and analysis of
clustered data. The Annals of Family Medicine, 2(3):199–200, 2004.
Author Index

Albert, P.S., 51, 192 Chen, M.H., 155


Altham, 124 Cheng, M.-Y., 200
Anderson, G., 50 Cho, M., 97
Anderson, T.W., 150 Clark, S.J., 182
Anscombe, F., 182 Conquest, L., 148
Cook, R.J., 192
Bai, Z.D., 106 Cox, D.R., 23, 122, 187
Banerjee, T., 149, 182 Crowder, M., 25, 53, 62, 69, 72, 75
Barnwal, 149 Crowder, P.M.E., 124
Barnwal, R.K., 182 Crowder, W., 82
Bassett, G.J., 91, 99 Crutchfield, M., 7
Bates, D.M., 52, 180
Berger, R.L., 145 D’Orey, V., 102
Bhapkar, M.E., 119 Dai, W., 200
Billard, L., 148 Das, K., 50, 74
Bliss, C.I., 182 Datta, S., 134
Bohning, D. , 151 Davidian, M., 25, 31, 52, 54
Bradley, R., 114 Deltour, I., 192
Brennan, T. , 169 Dempster, A., 52
Breslow, N.E., 182 Dempster, A.P., 135, 144
Brillinger, D., 148 Demster, A., 144
Brown, D.M., 104 Deng, D., 149
Bull, S., 117 Deng, D. , 156
DeRouen, T.A., 70
Cai, J., 104 Dickinson, L.M., 114
Carey, V., 118, 123 Dietz, E., 151
Carey, V.J., 41, 53, 106 Diggle P., 185
Carpenter, J.P., 143 Diggle, P., 11, 71, 181
Carroll, R., 71 Diggle, P. J., 35
Carroll, R.J., 25, 52, 54 Diggle, P.J., 58
Casella, G., 116, 145 Diggle, P.J. , 164
Casella, G. , 153 Ding, A., 5
Chaganty, R., 59, 61 Ding, A.A., 5
Chaganty, R.N., 62, 72 Donner, A., 114, 116, 117, 123
Chamberlain, G., 102 Ducimetiere, P., 59
Chen, J., 85, 150 Durairajan, T.M., 95
Chen, L., 101
Chen, M.-H., 143, 145 Eckhardt, L., 113

217
218 AUTHOR INDEX
Eicker, F., 71 Huang, J. , 150
Engel, J., 182 Hubbard, S. , 150
Eric, S., 182 Huber, P.J., 71
Hunter, D.R., 194
Fan, J., 193, 200
Fan, Y.L., 109, 196 Ibrahim, J.G., 143, 145, 150, 153
Fisher, R.A., 115, 182 Inan, G., 199
Fitzmaurice, 131 Islam, A., 125
Fitzmaurice, G., 41, 50, 74, 128–130 Islam, A. , 168
Fitzmaurice, G.M., 63, 65
Fitzmaurice,G., 52 Jennrich, R. , 159
Fleiss, J.L., 116 Jewell, N.P., 74, 110
Flocke, S.A., 114 Jiang, J. M., 131
Follmann, 192 Jiang, J.M., 79, 184
Fu, L.Y., 95, 102, 104, 106 Jorgensen, B., 22
Fung, K.W., 8 Jung, S.H., 91, 92, 104, 133, 135
Jørgensen, B., 31
Galbraith, 114
Gao, L., 23 Kalbfleish, J.D., 94
George, E.I., 153 Kass, E.H., 123
George, J.D., 129 Kauerman, G., 71
Geweke, J., 150 Kelly, B.C., 150
Gilks, W.R. , 165 Kenward, M., 164
Giltinan, D., 31, 52 Khuri, A.I., 116
Godambe, V.P., 119 Kim, B.S., 182
Gonzalez, R., 116 Kim, K., 50
Goodambe, V., 168 Kim, M.Y., 192
Greenhouse, J., 148 Kimmel, C.A., 129
Griffin, D., 116 Kirchner, U. , 151
Guo, C.H., 109, 196 Kleinman, J. C., 125
Koenker, R., 91, 99, 102
Hall, D., 59 Kopadia,, 116
Hardin, J., 82 Korn, E., 187
Harrington, D., 59 Koval, J.J., 117, 123
Haseman, J., 124 Kupper, L., 124
Haseman, J.K., 124 Kupper, L.L., 124
Hastie, T., 193
He, X.M., 8 Lachin, J.M., 134
Hettmansperger, T.P., 91 Laird, N., 52, 59, 128
Heyde, C., 45, 55, 94 Laird, N. M., 38
Hilbe, J., 62, 82 Laird, N.M., 144, 159, 179
Hin, L., 44, 83 Laird, S., 129
Ho, D.D., 5 Lamm, C.J., 143
Hoeting, J. A., 150 Lange, N., 148
Hoffman, E.B., 134 Lawless, J., 84
Honda, T., 200 Lawless, J.F., 182
AUTHOR INDEX 219
Lazar, N.A., 85 Mian, R., 149
Le Hesran, J.-Y., 192 Mian, R. , 156
Lee, E.W., 192 Miller, A.J., 79
Lee, J. C., 35 Molenberghs, G., 142, 143, 178
Lee, J.C., 199 Molina, I., 113
Lee, Y., 168 Munoz, A., 118, 123
Leng, C., 37
Leng, C.L., 106 Núñez-Anton , V., 35
Leroux, B., 50 Nelder, J., 24, 43, 168, 183
Li, B., 72, 99 Nelder, J.A., 44, 81, 131
Li, J.L., 200 Neuhaus, J.M., 94
Li, R., 193, 194, 203 Neumann, A.U., 5
Li, X.J., 202 Neyman, J., 126
Li, Y., 201 Nguyen, T., 79
Liang, K-Y., 2, 48, 51, 58, 59, 71, 184
Liang, K-Y. , 168 O’Hara-Hines, R., 52
Liang, K.-Y., 11, 33, 99, 181, 185 Owen, A.B., 84
Lin, X.H., 8
Pam, W., 86
Lindsay, B.G., 72, 99
Pan, W., 44, 69, 81, 82
Lindstrom, M.J., 180
Parzen, M.I., 101
Lipsitz, S., 50, 59, 184
Patel, C.M., 135
Lipsitz, S. , 150, 153, 155
Patterson, H., 83
Lipsitz, S.R., 63, 65
Paul, S., 8, 72, 123–127, 149, 169,
Lipsitz, S.R. , 169
182
Little, R.J.A., 138, 142, 143
Paul, S.R., 118, 149, 156, 168
Liu, J., 202
Pepe, M., 50
Lohman, K., 169
Perry, J.N., 182
Lui, K.-J., 113
Piegorsch, W.W., 149, 182
Lv, J., 109, 196, 200
Pietra, M. L., 7
Ma, X.J., 202 Pinheiro, J.C., 52
Ma, Y., 200 Plackett, R., 72, 149, 168, 182
Madigan, D., 150 Pocock, S., 143
Maiti, T., 150, 153 Pradhan, V. , 150, 153
Mancl, L., 50, 70 Preece, D.A., 182
Manton, K.G., 182 Pregibon, D., 24, 43, 168, 183
Margolin, B., 182 Preisser, J., 169
Margolin, B.H., 182 Prentice, R. L., 125, 149
Marr, M.C., 129 Prentice, R.L., 51, 52, 59, 72
Mayer, J.A., 113 Price, C.J., 129
McCullagh, P., 44, 81, 131
Qaqish, B., 59, 184
McCulloch, C.E., 116, 184
Qin, G.Y., 109, 196
McLachlan, G.J., 184
Qin, J., 84
McShane, L.M., 51
Qu, A., 72, 99, 194
Mendonca, L., 151
220 AUTHOR INDEX
Raftery, A.E., 150 Spellman, P., 199
Ran, J.N.K., 113 Sutradhar, B.C., 50, 74
Randolph, J. F., 7
Rao, C.R., 144 Tang, C. y., 37
Rathouz, C. R., 23 Tarone, 126
Rathouz, P., 169 Taylor, J.B., 150
Raz, J., 8 Taylor, J.O., 123
Regier, M.H., 192 Thall, P.F., 7, 71
Reid, N., 122 Thompson, M.E., 168
Richardson, S., 192 Thompson, R., 83
Risko, K.J., 182 Thompson, W.A.J., 179
Robin, D.B., 168 Tibshirani, R.J., 193
Robin, J.M. , 169 Tiret, L., 59
Rosner, B., 117, 118, 123 Tishler, R.J., 123
Ross, G.J.S., 182 Tregouet, D., 59
Roth, A.J., 135 Troxel, A., 169
Rotnitzky, A., 52, 74, 169 Tsay, R.S., 186
Rousseeuw, A., 110 Tweedie, M.C.K., 31
Rubin, D.B., 138, 142–144
Rubin, Y., 150 Ueki, M., 198
Ryan, L., 148
Vail, S.C., 7, 71
Saha, K., 182, 184 Verbeke, G., 178
Sahai, H., 116
Wang, L., 194, 199
Satten, G.A., 134
Wang, Y-G., 24, 25, 41, 44, 47, 53, 65,
Schlattmann, P. , 151
83, 91–95, 102, 104, 106,
Schluchter, M., 159
110, 111, 134, 170, 184
Schork, M. A., 7
Ware, 129
Schumann, D., 114
Ware, , 128
Searle, S.R., 116, 184
Ware, J. H., 38
Selwyn, M.R., 135
Ware, J.H, 184
Sen, P.K., 134
Ware, J.H., 159, 179
Severini, T.A., 59
Wedderburn, R. W. M., 24
Shao, Q., 104
Wedderburn, R.W.M., 168
Shapiro, B., 7
Wei, B.C., 8
Sherlock, G., 199
Wei, L.J., 101, 134
Shrout, P.E., 116
Weil, C.S., 124
Shults, J., 59, 61, 62
Weinberg, C.R., 134
Sinha, S. , 150, 153
Weiss, R.E., 97
Snell, E.J., 187
White, H., 42, 71
Song, P. X.-K., 139
Whittemore, A., 187
Song, P.X., 138
Whittle, P., 25, 60
Sowers, M. F., 7
Wild, P., 165
Sowers, M.F., 8
Williams, D.A., 9, 124, 134
Speizer, F.E., 184
Woodbury, M.A., 182
AUTHOR INDEX 221
Woodworth , G. G., 35 Zhang, W. , 169
Wu, H., 5 Zhao, L., 50
Zhao, L.P., 51, 52, 59, 72
Xu, K., 203 Zhao, Y., 47
Xu, P.R., 201 Zhao, Y.D., 91–93, 111, 134
Zhao, Y.N., 25
Yang, H., 109, 196 Zhong, W., 203
Yanuzzi, M., 97 Zhou, J., 194
Yin, G., 104 Zhou, J.H., 199
Yin, Z., 135 Zhu, L., 203
Ying, Z., 91, 92, 133 Zhu, L.X., 201
Zhu, M., 94, 104
Zeger, S., 11, 71
Zhu, Z.Y., 8, 109, 196
Zeger, S. L., 33
Ziegler, A., 22
Zeger, S.L., 2, 48, 51, 58, 99, 168,
Zou, H., 193
181, 184, 185, 187
Zyzanski, S.J., 114
Zeger, S.Y., 59
Zeng, L., 192
Zhang, B., 7
Zhang, C.-H., 150
Zhang, D., 8
Zhang, H.P., 106
Zhang, J.X., 202
Zhang, M.Q., 199
Zhang, W., 37
Subject Index

Abnormalities, 11 Dispersion function, 93


ACTG, 5 Dropout mechanis, 139
AIC, 44, 79
Ammonia nitrogen, 13 EAIC, 86
Anti-epileptic drug progabide, 6 EBIC, 86
AR(1), 62, 65, 74 EM, 144
Asymptotic covariance, 172 Empirical likelihood ratio function,
Asymptotic variance, 45, 121 84
ENAR, 63
Beta-binomial, 125 EQL, 43
BIC, 79 Equicorrelation structures, 65
Biological studies, 1 EXCH, 62
BMI, 8 Exch, 62
Exchangeable, 34
Canonical, 23 Exchangeable working correlation,
Canonical link, 27 87
Cervical dilation, 11 Exponential distribution, 46
Chlorophyll a, 13 Exponential distribution family, 26
Cholesky decomposition, 63 Exponential square loss function, 91
CIC, 83 Exponential squared loss function,
Clinical trial, 2 110
Cold deck imputation, 142 Extended QL, 54
Compound correlation Structure, 34 Extra-Poisson variation, 71
Conditional estimating function, 171
Correlation matrix, 2 Fisher scoring iterative procedure,
Correlation structure selection, 79 111
Covariance, 1, 7 Fisher-scoring procedure, 144
Covariance assumption, 46 Fixed effect, 175
Covariance matrix, 1 Foetuses, 10, 11
Covariate selection, 79 Forestry studies, 1
Covariates, 1, 12
Cyanophyte, 13 GAIC, 83
Gaussian copulas, 102
Damped exponential correlation struc- Gaussian likelihood, 45
ture, 36 Gaussian pseudolikelihood criterion,
Density function, 79 106
Dependent data analysis, 33 Gaussian score function, 46
Dispersion, 7 Gaussian working likelihood, 53

223
224 SUBJECT INDEX
GBIC, 83 Maximum likelihood, 179
GEE, 41, 48, 52, 74 Maximum likelihood estimator, 179,
GEE2, 52, 73 182
Generalized linear models, 2 MCAR, 137
Generalized weighted least squares, Mean assumption, 46
54 Median, 133
GLM, 26, 41, 48, 49 Medical studies, 1
Minimum Covariance Determinant,
Hard penalty function, 193 110
Heterogeneity, 2 Minimum Volume Ellipsoid, 110
Heteroscedasticity, 7, 78 ML, 52
HIV, 5 MLE, 42
Hot deck imputation, 142 MNAR, 137
Huber’s function, 91 Model misspecification, 53
Model selection, 79
Implantation, 129 Moment estimation, 55
Implantations, 11 Moment estimator, 74
Imputation by substitution, 142 Mortality rate, 9
Indicator function, 101 Multiple logistic regression, 117
Induced smoothing method, 104 Multivariate normal distribution, 82
Interactive effect, 79
Interclass correlation coefficient, 114 NASBA, 5
Intraclass correlation, 59 Non-parametric, 11
Intracluster correlation coefficient, 123 Normal log-likelihood, 82
Intracoastal correlation, 114
Outlier, 89, 91
Jacobian matrix, 65, 68 Over-dispersion parameter, 89
Overdispersion parameter, 49
Labor Market Experience, 12
Lagrange multiplier method, 84 Patterned correlation matrix, 33
Least absolute shrinkage and selec- Penalized loss function , 193
tion operator, 193 PI, 5
Likelihood function, 81 Poisson variance, 41
Linear exponential family, 23 Population averaged effects, 177
Link function, 26, 80 Power transformation, 58
logit link, 187 Profile analysis, 48
Progesterone, 8
MA(1), 62 Proportional hazard, 23
Madras Longitudinal Schizophrenia pseudolikelihood, 25
Study, 11 Psychiatric symptoms, 11
Malformation rates, 9
Mallows-based weighted function, 110 QIC, 83, 88
MAR, 137 QL, 43
Marginal mean vector, 1 QLS, 61, 62
Marginal model, 2 Quadratic inference function, 106
Marginal models, 65 Quantile regression, 91
SUBJECT INDEX 225
Quasi-likelihood, 23, 43, 46, 81 Uniform correlation structure, 34
Quintic polynomial, 8 Utero, 11

Random effect, 2, 175 Variance, 1


Random effects model, 116 Variance function, 86
Random intercept logistic regression Variance interaction effect, 79
model, 181 Viral load, 5
Random-effects model, 159
Rank regression, 133 Wald test, 130
Rank-based method, 91 Wedderburn form, 43
Regression imputation, 142 Weighted GEE method, 168
REML, 52, 130 Wilcoxon-Man-Whitney rank statis-
Repeated observations, 41 tic, 91
Resampling method, 97, 102 Within-subject correlations, 41, 133
Restricted maximum likelihood, 179 Working correlation, 83
Robust generalized estimating equa- Working correlation matrix, 47, 48,
tion, 109 80, 106
RTI, 5 Working correlation structure, 34, 51,
55
Sandwich estimator, 71 Working covariance, 51, 59
Scale parameter, 1 Working covariance function, 49
Schizophrenia, 11 Working covariance matrix, 53
Semiparametric, 2, 8 Working independence model, 50
Smoothly clipped absolute deviation Working likelihood, 24, 60
penalty , 193 Working likelihood function, 48
Standard errors, 1 Working parameters, 48, 53
subsampling, 133
Sufficient statistic, 25

Temporal changes, 1
Tilted exponential family, 23
Toeplitz matrix, 85
Toxicological study, 123
Two-level generalized linear model,
131

You might also like