Analysis of Correlated Data with SAS and R Third Edition Shoukri - Read the ebook now with the complete version and no limits
Analysis of Correlated Data with SAS and R Third Edition Shoukri - Read the ebook now with the complete version and no limits
com
https://ebookgate.com/product/analysis-of-correlated-data-
with-sas-and-r-third-edition-shoukri/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookgate.com/product/categorical-data-analysis-using-the-sas-
system-2nd-edition-maura-e-stokes/
ebookgate.com
https://ebookgate.com/product/sas-r-intelligence-platform-
overview-2nd-edition-sas-publishing/
ebookgate.com
https://ebookgate.com/product/common-statistical-methods-for-clinical-
research-with-sas-examples-third-edition-glenn-walker/
ebookgate.com
https://ebookgate.com/product/adaptive-tests-of-significance-using-
permutations-of-residuals-with-r-and-sas-1st-edition-thomas-w-ogorman/
ebookgate.com
https://ebookgate.com/product/data-science-fundamentals-with-r-python-
and-open-data-1st-edition-marco-cremonini/
ebookgate.com
https://ebookgate.com/product/bayesian-data-analysis-in-ecology-using-
linear-models-with-r-bugs-and-stan-1st-edition-franzi-korner-
nievergelt/
ebookgate.com
https://ebookgate.com/product/statistical-methods-for-survival-data-
analysis-third-edition-elisa-t-lee/
ebookgate.com
Statistics
Edition
Third
Analysis of Correlated Data
with SAS and R
Third Edition
Analysis of
Correlated Data
Shoukri • Chaudhary
material covered
• An accompanying CD-ROM that contains all the data sets in the book along
with the SAS and R codes
Assuming a working knowledge of SAS and R, this text provides the necessary
Mohamed M. Shoukri
concepts and applications for analyzing clustered and correlated data from
epidemiologic and medical investigations.
Mohammad A. Chaudhary
C6196
www.crcpress.com
Mohamed M. Shoukri
Mohammad A. Chaudhary
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
— Mohamed M. Shoukri
— Mohammad A. Chaudhary
Contents
Authors ..................................................................................................................xv
References ............................................................................................................283
Index .....................................................................................................................291
Preface to the First Edition
M.M. Shoukri
V.L. Edge
Guelph, Ontario
July 1995
Preface to the Second Edition
The main structure of the book has been kept similar to the first edition. To
keep pace with the recent advances in the science of statistics, more topics
have been covered. In Chapter 2 we introduce the coefficient of variation as
a measure of reproducibility, and comparing two dependent reliability coef-
ficients. Testing for trend using Cochran-Armitage chi-square, under cluster
randomization has been introduced in Chapter 4. In this chapter we dis-
cussed the application of the PROC GENMOD in SAS, which implements
the GEE approach, and “Multi-level analysis’’ of clustered binary data under
the “Generalized Linear Mixed Effect Models,’’ using Schall’s algorithm, and
GLIMMIX SAS macro. In Chapter 5 we added two new sections on modeling
seasonal time series; one uses combination of polynomials to describe the
trend component and trigonometric functions to describe seasonality, while
the other is devoted to modeling seasonality using the more sophisticated
ARIMA models. Chapter 6 has been expanded to include analysis of repeated
measures experiment under the “Linear Mixed Effects Models,’’ using PROC
MIXED in SAS. We added Chapter 7 to cover the topic of survival analysis.
We included a brief discussion on the analysis of correlated survival data in
this chapter.
An important feature of the second edition is that all the examples are
solved using the SAS package. We also provided all the SAS programs that
are needed to understand the material in each chapter.
It was brought to our attention by many of our colleagues that the title of the
previous edition did not reflect the focus of the book, which was the analysis
of correlated data. We therefore decided to change the title of the third edition
to Analysis of Correlated Data with SAS and R. We believe that the change in the
title is appropriate and reflects the main focus of the book.
The fundamental objectives of the new edition have been kept similar to
those of the previous two editions. However, this edition contains major struc-
tural changes. The first chapter in the previous editions has been deleted
and is replaced with a new chapter devoted to the issue of modeling and
analyzing normally distributed variables under clustered sampling designs.
A separate chapter is devoted to the analysis of correlated count data with
extensive discussion on the issue of overdispersion. Multilevel analyses of
clustered data using the “Generalized Linear Mixed Effects Models’’ fitted by
the PROC GLIMMIX in SAS are emphasized. Chapter 6 has been expanded
to include the analysis of repeated measures and longitudinal data when the
response variables are normally distributed, binary and count. The “Linear
Mixed Effects Models’’ are fitted using PROC MIXED and PROC GLIMMIX
in SAS. An important feature of the third edition is the introduction of R
codes for almost all the examples solved with SAS. The freeware R package
can be downloaded from The Comprehensive R Archive Network (CRAN) at
http://cran.r-project.org/ or at any of its mirrors. The reader of this book is
expected to have prior working knowledge of SAS and R packages. Readers
who have experience with S-PLUS will have no problem working with R.
Readers completely new to R will benefit from the many tutorials available
on the R web site.
The important features of the third edition are
Mohamed M. Shoukri received his MSc and PhD degrees from the Depart-
ment of Mathematics and Statistics, University of Calgary, Alberta, Canada.
He held several faculty positions at various Canadian universities, and
taught applied statistics at Simon Fraser University, the University of British
Columbia, and the University of Windsor, and was a full professor with tenure
at the University of Guelph, Ontario, Canada. His papers have been published
in the Journal of the Royal Statistical Society (series C), Biometrics, Journal of Sta-
tistical Planning and Inference, The Canadian Journal of Statistics, Statistics in
Medicine, Statistical Methods in Medical Research, and many other journals. He
is a fellow of the Royal Statistical Society of London and an elected member
of the International Statistical Institute. He is now principal scientist and the
acting chairman of the Department of Biostatistics and Epidemiology at the
Research Center of King Faisal Specialist Hospital.
CONTENTS
1.1 Introduction ................................................................................................. 1
1.1.1 The Basic Feature of Cluster Data ................................................ 2
1.1.2 Sample and Design Issues ............................................................. 7
1.2 Regression Analysis for Clustered Data .................................................. 11
1.3 Generalized Linear Models ....................................................................... 15
1.3.1 Marginal Models (Population Average Models) ........................ 16
1.3.2 Random Effects Models ................................................................. 17
1.3.3 Generalized Estimating Equation (GEE) ..................................... 17
1.4 Fitting Alternative Models for Clustered Data ....................................... 19
1.1 Introduction
Clusters are aggregates of individuals or items that are the subject of inves-
tigation. A cluster may be a family, school, herd of animals, flock of birds,
hospital, medical practice, or an entire community. Data obtained from clus-
ters may be the result of an intervention in a randomized clinical or a field
trial. Sometimes interventions in randomized clinical trials are allocated to
groups of patients rather than to individual patients. This is called cluster ran-
domization or cluster allocation, and is particularly common in human and
animal health research. There are several reasons why investigators wish to
randomize clusters rather than the individual study subjects. The first being
the intervention may naturally be applicable to clusters. For example, Murray
et al. (1992) evaluated the effect of school-based interventions in reducing ado-
lescent tobacco use. A total of 24 schools (of various sizes) were randomized
to an intervention condition (SFG = smoke-free generation) or to a control
condition (EC = existing curriculum). The number (and proportion) of chil-
dren in each school who continued to use smokeless tobacco after 2 years of
follow-up is given in Table 1.1.
It would be impossible to assign students to intervention and control groups
because the intervention is through the educational program that is received
by all students in a school.
1
2 Analysis of Correlated Data with SAS and R
TABLE 1.1
Smokeless Tobacco Use among Schoolchildren
Control (EC) Intervention (SFG)
5/103 0/42
3/174 1/84
6/83 9/149
6/75 11/136
2/152 4/58
7/102 1/55
7/104 10/219
3/74 4/160
1/55 2/63
23/225 5/85
16/125 1/96
12/207 10/194
are two sources of variations in the observations. The first is the one between
subjects within a cluster, and the second is the variability among clusters.
These two sources of variation cause the variance to inflate and must be taken
into account in the analysis.
The effect of the increased variability due to clustering is to increase the
standard error of the effect measure, and thus widen the confidence inter-
val and inflate the type I error rate, compared to a study of the same size
using individual randomization. In other words, the effective sample size
and consequently the power are reduced. Conversely, failing to account for
clustering in the analysis will result in confidence intervals that are falsely
narrow and the p-values falsely small. Randomization by cluster accompa-
nied by an analysis appropriate to randomization by individual is an exercise
in self-deception (Cornfield, 1978).
Failing to account for clustering in the analysis is similar to another error
that relates to the definition of the unit of analysis. The unit of analysis error
occurs when several observations are taken on the same subject. For exam-
ple, a patient may have multiple observations (e.g., systolic blood pressure)
repeated over time. In this case, the repeated data points cannot be regarded
as independent, since measurements are taken on the same subject. For exam-
ple, if five readings of systolic blood pressure are taken from each of the 15
patients, assuming 75 observations to be independent is wrong. Here, the
patient is the appropriate unit of analysis and is considered as a cluster.
To recognize the problem associated with the unit of analysis, let us assume
that we have k clusters each of size n units (the assumption of equal cluster
size will be relaxed later on). The data layout (Table 1.2) may take the form:
TABLE 1.2
Data Layout
Clusters
Units 1 2 ...... i ...... k
1 k n
The grand sample mean is denoted by y = nk i=1 j=1 yij .
If the observations within a cluster are independent, then the variance of
the ith cluster mean is
σy2
V(yi ) = (1.1)
n
4 Analysis of Correlated Data with SAS and R
where σy2 = E(yij − µ)2 and µ is the population mean. Assuming that the
variance is constant across clusters, the variance of the grand mean is
σy2
V(y) = (1.2)
nk
so that E(bi ) = 0, V(bi ) = σb2 , E(eij ) = 0, V(eij ) = σe2 , and bi and eij are independent
of each other. Under this setup, we can show that
σb2
Corr(yij , yil ) = ρ = (1.5)
σb2 + σe2
Equation 1.4 shows that if the observations within a cluster are not correlated
(ρ = 0), then the within-cluster variance is identical to the variance of ran-
domly selected individuals. Since 0 ≤ ρ < 1, the within-cluster variance σe2 is
always less than σy2 .
Simple algebra shows that
σy2
V(Y i ) = [1 + (n − 1)ρ] (1.7)
n
and
σy2
V(Y) = [1 + (n − 1)ρ] (1.8)
nk
Analyzing Clustered Data 5
Note that for binary response, σ 2 is replaced with π(1 − π). The quantity
[1 + (n − 1)ρ] is called the variance inflation factor (VIF) or the “design effect’’
(DEFF) by Kerry and Bland (1998). It is also interpreted as the relative effi-
ciency of the cluster design relative to the random sampling of subjects and
is the ratio of Equation 1.8 to Equation 1.2:
The DEFF represents the factor by which the total sample size must be
increased if a cluster design is to have the same statistical power as a design
in which individuals are sampled or randomized. If the cluster sizes are not
equal, which is commonly the case, the cluster size n should be replaced with
n = ki=1 n2i /N, where N = ki=1 ni .
Since ρ is unknown, we estimate its value from the one-way ANOVA layout
(Table 1.3):
TABLE 1.3
ANOVA Table
SOV DF Sum of square Mean square
BSS
Between clusters k−1 BSS BMS =
k−1
WSS
Within clusters N −k WSS WMS =
N −k
Total N −1
k k n
where BSS = n i (yi − µ̂)2 , WSS = i j (yij − yi )2 , and the ICC is esti-
mated by
BMS − WMS σ̂ 2
ρ̂ = = 2 b
BMS + (n0 − 1)WMS σ̂b + σ̂e2
where σ̂b2 = (BMS − WMS)/n0 and σ̂e2 = WMS are the sample estimates of σb2
and σe2 , respectively.
k
− n)2
i=1 (ni N
n0 = n − n= (1.10)
k(k − 1)n k
Note that when n1 = n2 = · · · = nk , then n0 = n = n.
If ρ̂ > 0, then the variance in a cluster design will always be greater than a
design in which subjects are randomly assigned so that, conditional on the
cluster size, the confidence interval will be wider and the p-values larger.
We note further that, if the cluster size (n) is large, the DEFF would be large
even for small values of ρ̂. For example, an average cluster size of 40 and
ICC = 0.02 would give DEFF of 1.8, implying that the sample size of a cluster-
randomized design should be 180% as large as the estimated sample size of
an individually randomized design to achieve the same statistical power.
6 Analysis of Correlated Data with SAS and R
There are several approaches that can be used to allow for clustering
ranging from simple to quite sophisticated:
The focus in this book will be on the statistical analysis of correlated data
using the first three approaches.
data milk;
input farm milk size $;
cards;
1 32.33 L
1 29.47 L
1 30.19 L
···
10 24.12 S
;
proc sort data=milk; by size;
10 Analysis of Correlated Data with SAS and R
The ANOVA results from the SAS output are given below:
σ̂e2 = 6.67
100.57 − 6.67
σ̂b2 = = 7.83
12
Therefore, the estimated ICC is ρ̂ = 7.837.83
+ 6.67 = 0.54.
An important objective of this study was to test whether average milk yield
in large farms differs significantly from the average milk yield of small farms.
That is to test H0 : µs = µl versus H1 : µs = µl .
The ANOVA results separately for each farm size (small and large) are now
produced. The SAS code is
proc glm data = milk;
class farm;
model milk = farm;
by size;
run; quit;
Large farm size
Source DF Sum of Squares Mean Square
ys = 25.32, k = 5, n = 12, N = 60
2 2 94.81 − 5.94
σ̂es = 5.94, σ̂bs = = 7.4
12
2
σ̂bs
2
= 1.24
σ̂es
5.94
V̂(ys ) = [1 + (12)(1.24)] = 1.57
60
28 − 25.32
Z = = 1.61
(1.17 + 1.57)1/2
model that characterizes the main features of the behavior of the response
variable.
Regression analysis is among the most commonly used methods of sta-
tistical analysis to model the relationship between variables. Its objective
is to describe the relationship of response with independent or explanatory
variables. In its very general form, a regression model is written as
Y = Xβ + e (1.16)
Under the above conditions and provided that (X T X)−1 exists, the least
squares estimate of β is
β̂ = (X T X)−1 X T Y (1.17)
Equation 1.17 is important in regression analysis since it provides the esti-
mates of β once we are sure that conditions (a, b, c) are satisfied and the
matrix X is specified.
In addition to the linear models (Equation 1.16), regression models include
logistic models for binary responses, log-linear model for counts, and survival
analysis for time to events. In this chapter, we discuss the linear-normal model
for continuous responses when the basic assumption that all the observations
are independent, or at least uncorrelated, is violated. Recall that the assump-
tion of zero correlation would mean that knowing one subject’s response
provides no information regarding the status of another subject in the same
study. However, the assumption of independence may not hold if the subjects
belong to the same cluster as has been already demonstrated. As an exam-
ple of a regression problem when clusters of subjects are sampled together
is Miall and Oldham’s arterial blood pressure levels family study. Owing to
their common household environment and their shared genetic makeup, we
would expect a family member to have a greater chance of having elevated
blood pressure levels if his/her sibling had the same. Data from this study can
be usefully thought of as being “clustered’’ into families. Blood pressure lev-
els from different families are likely to be independent; those from the same
Analyzing Clustered Data 13
family are not. This dependence among observations from the same cluster
must be accounted for in assessing the relationships between risk factors and
health outcomes.
Another example cited by Liang and Zeger (1993) is the growth study
of Hmong refugee children. In this example, 1000 Hmong refugee children
receiving health care at two Minnesota clinics between 1976 and 1985 were
examined for their growth patterns. The objective was to study the patterns
of growth and its association with age at entry into the United States. It is
believed that stature is influenced by both genetic and environmental factors.
When the offending environmental factors are removed, the growth process
progresses at a faster rate. To study the growth, repeated measurements of
height of each child were recorded. The number of visits per child ranged
from 1 to 15 and averaged 5. The correlation between repeated observations
on height for each child may be a nuisance but cannot be ignored in regression
analysis.
The above two examples have common features, although they address
questions with different scientific objectives. Data in the above two stud-
ies are organized in clusters. For family study, the clusters are formed by
families, and in the second example a cluster comprises the repeated obser-
vations for a child. Another aspect of similarity between the two studies
is that one can safely assume that the response variables (blood pressure
in the family study and height in the growth study) are normally dis-
tributed. The two studies also differ in the structure of the within-cluster
correlation. For example, in the family study one may assume that the cor-
relation between the pairs of sibs within the family is equal, that is, we
may assume a constant within-cluster correlation. For repeated measures
longitudinal study, the situation is different. Although the repeated observa-
tions are correlated, this correlation may not be constant across time (cluster
units). It is common sense to assume that observations taken at adjacent time
points are more correlated than observations that are taken at separated time
points.
In the remainder of this chapter, we shall focus on regression analysis of
clustered data assuming common or fixed within-cluster correlation and the
response variable is normally distributed. Other types of response variables
and different correlation structures will be discussed in detail in subsequent
chapters.
Within the framework of linear regression, we illustrate an answer to the
question: what happens when the conventional linear regression is used to
analyze clustered data?
Let Yij denote the score on the jth offspring in the ith family; Xi the score of
the ith parent, where j = 1, 2, . . . , ni , i = 1, 2, . . . , k; ni the number of offspring
in the ith family; and k the total number of families. We assume that the
regression of Y on X is given by
and
σ 2 (1 − ρ)
V(b1 ) =
wi (xi − x̂)2
where wi = ni /(1 + ni ρe ), ρe = (ρ − β2 )/(1 − β2 ), yi = j yij /ni , and x̂ = wi xi /
wi .
The most widely used estimator for β ignores the ICC ρ and is given by the
usual estimator:
k
ni (yi − y)(xi − x)
b = i=1 k (1.20)
i=1 ni (xi − x)
2
where y = ni yi /N, x = ni xi /N, and N = n1 + n2 + · · · + nk .
σ 2 (1 − ρ) ni (1 + ni ρe )(xi − x)2
V(b) = 2
(1.21)
ni (xi − x)2
If ρe = 0, then wi = ni and var(b) = var(b1 ) = σ 2 (1 − ρ)/ ni (xi − x)2 , which
means that b1 is fully efficient.
Therefore,
V(b) ni (1 + ni ρe )(xi − x)2
=
V(b1 ) ni (xi − x)2
or equivalently
V(b) ρe n2i (xi − x)2
=1 + (1.22)
V(b1 ) ni (xi − x)2
Assuming ρe > 0, b is always less efficient.
The most important message of Equation 1.22 is that ignoring within-
cluster correlation can lead to a loss of power when both within-cluster and
cluster-level covariate information are being used to estimate the regression
coefficient.
To analyze clustered data, one must therefore model both the regression
of Y on X and the within-cluster dependence. If the responses are indepen-
dent of each other, then ordinary least squares can be used, which produces
Analyzing Clustered Data 15
regression estimators that are identical to the maximum likelihood in the case
of normally distributed responses. In this chapter we consider two differ-
ent modeling approaches: marginal and random effects. In marginal models,
the regression of Y on X and the within-cluster dependence are modeled
separately. The random effects models attempt to address both issues simul-
taneously through a single model. We shall explore both modeling strategies
for a much wider class of distributions named “generalized linear models’’
(GLM) that includes the normal distribution as a special case.
g(µ) = X T β (1.23)
Here, g(.) is a specified function known as the “link function.’’ The normal
regression model for continuous data is a special case of Equation 1.23, where
g(.) is the identity link. For binary response variable Y, the logistic regression
is a special case of Equation 1.23 with the logit transformation as the link.
That is,
µ P(Y = 1)
g(µ) = log = log = XT β
1−µ P(Y = 0)
When the response variable is count, we assume that
log E(Y) = X T β
To account for the variability in the response variable that is not accounted for
by the systematic component, GLM assume that Y has the probability density
function given by
θ = µ (identity link)
b(θ) = µ2/2, φ = σ2
θ = ln µ (log-link)
b(θ) = µ = eθ , φ=1
b (θ) = e = µ
θ
b (θ) = eθ = µ
p
∂µi (β) T
U= V −1 (Yi )[Yi − µi (β)] = 0 (1.25)
∂β
i=1
The above equation provides valid estimates when the responses are indepen-
dently distributed. For clustered data, the GLM and hence Equation 1.25 are
not sufficient, since the issue of within-cluster correlation is not addressed.
We now discuss the two modeling approaches commonly used to analyze
clustered data.
k
∂µi T
U1 (β, α) = Cov−1 (Yi , β, α)[yi − µi (β)] = 0 (1.27)
∂β
i=1
where µi (β) = E(Yi ), the marginal mean of Yi . One should note that U1 (.) has
the same form of U(.) in Equation 1.25, except that Yi is now ni × 1 vector,
which consists of the ni observations of the ith cluster, and the covariance
matrix Cov(Yi ), which depends on β and α, a parameter that characterizes
the within-cluster dependence.
For a given α, the solution β̂ to Equation 1.27 can be obtained by an itera-
tively reweighted least squares calculations. The solution to these equations
18 Analysis of Correlated Data with SAS and R
where
k
A= T V
D −1
i i Di
i=1
k
M = T V
−1 −1
D i i Cov(Yi )Vi Di
i=1
Cov(Yi ) = (Yi − µi (
β))(Yi − µi (
β))T
k
∂µ∗i
U2 (β, α) = [Cov(Zi , δ)]−1 (Zi − µ∗i θ) = 0
∂β
i=1
Analyzing Clustered Data 19
where
2 2
Zi = (Yi1 , . . . , Yini , Yi1 , . . . , Yin i
, Yi1 Yi2 , Yi1 Yi3 , . . . , Yi ni −1 Yini )T
where µ∗i = E(Zi , θ), which is completely specified by the modeling assump-
tions of the GLM. They called this expanded procedure that uses both the
Yij ’s and Yij Yil the GEE2.
GEE2 seems to improve the efficiency of both β and α. On the other hand,
the robustness property for β of GEE is no longer true. Hence, correct infer-
ences about β require correct specification of the within-cluster dependence
structure. The same authors suggest using a sensitivity analysis when making
inference on β. That is, one may repeat the procedure with different models
for the within-cluster dependence structure to examine the sensitivity of
β to
choose the dependence structure.
familyid: Family ID
subjid: Sibling ID
sbp: Sibling systolic blood pressure
msbp: Mother systolic blood pressure
age: Sibling age
armgirth: Sibling arm girth
cenmsbp: Mother systolic blood pressure centered
cenage: Sibling age centered within the family
cengirth: Sibling arm girth centered within the family
The records with missing values of sibling age, mother systolic blood pressure,
sibling arm girth, and sibling systolic blood pressure are deleted. The dataset
20 Analysis of Correlated Data with SAS and R
consists of 488 observations on 154 families. The family size ranges from 1
to 10. We begin by first analyzing the data using the GEE approach. This
is followed by a series of models using the multilevels modeling approach.
All models are fitted using the SAS procedures GENMOD for the GEE and
the MIXED for the multilevels approach. The equivalent R code (R 2.3.1—A
Language and Environment Copyright 2006, R Development Core Team) is
also provided.
The SAS code to read in the data and fit this model is
data fam;
input familyid subjid sbp age armgirth msbp;
datalines;
1 1 85 5 5.75 120
1 2 105 15 8.50 120
......
The following is the partial output showing the analysis of the GEE
parameter estimates:
Analysis of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|
yij = µ + bi + eij
The model has one fixed effect (µ) and two variance components—one rep-
resenting the variation between family means (σb2 ) and the other variation
among siblings within families (σe2 ).
The SAS code to fit this model is
familyid 106.98
Residual 166.22
σ̂b2 106.98
ρ̂ = = = 0.39
(σb2 + σe2 ) (106.98 + 166.22)
This tells us that there is a great deal of clustering of systolic blood pressure
levels of siblings within families.
There is another approach that generalizes more easily to data with multiple
levels. This approach expresses the subject-level outcome yij using a pair of
linked models: one at the subject level (level 1) and another at the family level
(level 2). At level 1, we express yij as the sum of an intercept for the subject’s
family (βi0 ) and random error (eij ):
The parameter estimates under this model are the same as in the previous
model. The purpose of the covtest option in the “proc mixed’’ statement is to
test the hypothesis for the variance components.
Note that there are two fixed effects to be estimated: the intercept and the
covariate (MSBP). The null hypothesis, which states that there is no relation-
ship between mother’s systolic blood pressure levels and the siblings, is not
supported by the data. Also note that the variance components estimates are
67.47 and 163.89 for τ0 and σe2 , respectively. These estimates under the present
model have different interpretations. In model 1, there were no covariates,
so these were unconditional components. After adding the mother’s blood
pressure as a covariate, these are now conditional components. Note that
the conditional within-family component is slightly reduced (from 166.22 to
163.89). The variance component representing variation between families τ0
or σb2 has diminished markedly (from 106.98 to 67.47). This tells us that the
cluster- or family-level covariate (mother’s systolic blood pressure) explains
a large percentage of the family-to-family variation. One way of measuring
how much variation exists in family mean blood pressures as explained by the
mother’s blood pressure levels is to compute how much the variance compo-
nent for this term τ0 has diminished between the two models. Following Bryk
and Raudenbush (1992), we compute this as (106.98 − 67.47)/106.98 = 36.9%.
This is interpreted by saying that about 36% of the explainable variation in
the sibling’s mean systolic blood pressure levels is explained by the mother’s
systolic blood pressure levels.
Here, Zij is the age of the jth subject within the ith family centered at its mean
value. The other terms are defined as before. Let
Hence,
where eij ≈ N(0, σe2 ), ui = (ui0 , ui1 ) ≈ BIVN(0, ), and eij is independent of the
bivariate normal random vector ui . The is a 2 × 2 symmetric matrix whose
elements are δ00 = V(ui0 ), δ11 = V(ui1 ), and δ01 = Cov(ui0 , ui1 ). The SAS code
to fit this model is
Notice that the random statement has two random effects—one for intercept
and one for the Z-slope. There is also an additional option in the random
statement, “type=un,’’ indicating that an unstructured specification for the
covariance of ui is assumed. Partial SAS output is shown below.
We shall start by first interpreting the fixed effects. The estimate of the inter-
cept 118.43 indicates the estimated average sibling systolic blood pressure
levels after controlling for the mother’s systolic blood pressure. The estimate
of the cenmsbp indicates that the average slope representing the relationship
between siblings’ blood pressure and the mother’s systolic blood pressure is
0.20. The standard errors of these estimates are very small, resulting in small,
p-values. We conclude that, on average, there is a statistically significant
26 Analysis of Correlated Data with SAS and R
where Zij is the centered arm girth of the jth subject within the ith family, aij
the centered age of the jth subject within the ith family, and xi the centered
mother’s systolic blood pressure in the ith family.
Hence,
yij = γ00 + γ01 xi + ui0 + Zij (γ10 + γ11 xi + ui1 ) + aij (γ20 + γ21 xi + ui2 ) + eij
Simplifying, we get
yij = γ00 +γ01 xi +γ10 Zij +γ20 aij +γ11 xi Zij +γ21 xi aij + u10 +Zij ui1 +aij uiz +eij
Terms in the first bracket should appear in the model statement, while those
in the second bracket should appear in the random statement.
The SAS code to fit the model is
proc mixed data=family covtest noclprint noitprint;
class familyid;
Analyzing Clustered Data 27
Interpretation of the above output has been left as an exercise to the reader.
The R code reads the data; computes the centered variables cenage, cen-
girth, and cenmsbp; and fits the alternative models discussed in this example.
Note that the packages “nlme’’ and “gee’’ should be installed and loaded for
functions “lme’’ and “gee’’, respectively, to run.
fam <-read.table("x:/xxx/familydata.txt",header=T)
cenage = unlist(tapply(fam[,4], fam[,1], scale, scale=FALSE))
cengirth = unlist(tapply(fam[,5], fam[,1], scale, scale=FALSE))
cenmsbp = scale(fam[,6], center = TRUE, scale = FALSE)
family = data.frame(fam, cenage, cengirth, cenmsbp)
# Generalized estimating equations (GEE)
fam.gee <- gee(sbp ∼ cenmsbp+cenage+cengirth, familyid, data=family,
family = gaussian, corstr = "exchangeable")
summary(fam.gee)
# Unconditional mean model—Mixed Model 1
fam.lme1 <- lme(fixed = sbp ∼ 1, random=∼1 | familyid, data = family)
summary(fam.lme1)
28 Analysis of Correlated Data with SAS and R
Appendix
1. Linear combinations of random variables
Let x = (xi , x2 , . . . , xk ) be a set of random variables such that E(xi ) = µi ,
V(xi ) = σi2 , and Cov(xi , xj ) = cij . We define a linear combination of the
random variable x as
k
y= wi x i
i=1
k
E(y) = wi µ i (A.1)
i=1
and
k
k
k
V(y) = wi2 σi2 + wi wj cij (A.2)
i=1 i=1 j=1
i=j
k k
2. Consider two linear combinations ys = i=1 asi xi and yt = i=1 ati xi . Then
the covariance between ys and yt is
k
k
Cov(ys , yt ) = asi atj Cov(xi , xj ) (A.3)
i=1 j=1
Exploring the Variety of Random
Documents with Different Content
Abb. 5. Lößlandschaft bei Kenzingen.
Nach einer Photographie von Prof. Dr. P. Paulcke in Freiburg.
(Zu Seite 20.)
Nicht leicht wird man in anderen Gauen auf höchster Höhe oder
fern vom belebenden Schienenstrang und von der großen
Heerstraße, im abgelegensten Dorfe oder Weiler so gute Unterkunft
finden wie im Schwarzwald, wo kein verständiger Wunsch an
Quartier oder Verpflegung unerfüllt zu bleiben braucht. Vom großen
Hotel ersten Ranges der Städte, Bade- und Luftkurorte bis zum
bescheidensten, aber sauberen, urbehaglichen und billigen
Bauerngasthaus finden sich alle Übergänge, so daß jeder Geschmack
Befriedigung finden kann.
II. Orographische und geologische
Übersicht.
UmGebirgsindividuum
den Schwarzwald
verstehen
als
zu
Wasgenwald und Schwarzwald.
Abb. 16. Holzschlitten im Walde. Zastler Tal; obere Enden der Holzriesen.
Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 44.)
❏
GRÖSSERES BILD
Meereshöhe Mitteltemperatur ° C
m Winter Frühling Sommer Herbst Jahr
200 1,5 10,4 19,4 10,4 10,5
400 0,9 9,2 18,2 9,5 9,5
600 0,3 8,0 17,0 8,6 8,5
800 –0,4 6,8 15,8 7,7 7,5
1000 –1,0 5,6 14,6 6,8 6,5
1200 –1,7 4,4 13,4 5,9 5,5
Diese Zusammenstellung zeigt vor allem die mit der Höhe
zunehmende Bevorzugung des Herbstes vor dem Frühling, welch
letzterer im eigentlichen Gebirge wegen der späten Schneeschmelze
wesentlich kühler ist als der Herbst, der sich häufig mit schönen
Sonnenscheintagen aufs angenehmste bis tief in den November
hinein geltend macht.
Von diesen Normalwerten ergeben sich im einzelnen je nach den
Lageverhältnissen bedeutende Abweichungen. So ist z. B. Villingen
(708 m) auf der Ostseite des Schwarzwaldes im Verhältnis zu seiner
Höhenlage in den vier Jahreszeiten nach obiger Reihenfolge und im
Jahr zu kalt um 1.8‒0.7‒0.5‒0.9‒1.0°, Freiburg am Westfuß (272 m)
zu warm um 0.9‒0.7‒1.0‒1.0‒0.9°. Nicht leicht könnte der
Gegensatz zwischen dem östlichen und westlichen Schwarzwald
deutlicher vor Augen geführt werden.
Abb. 21. Einzige Darstellung des alten Schwarzwälder
Hausierers. Krug vom Jahre 1806 in der Schwarzwaldsammlung
der Stadt Freiburg. Nach einer Photographie von M. Ferrars
in Freiburg. (Zu Seite 45.)
Abb. 22. Alte Schwarzwalduhr vom Jahre 1670.
Aus der Schwarzwaldsammlung der Stadt Freiburg.
Nach einer Photographie von M. Ferrars in Freiburg.
(Zu Seite 45.)
ebookgate.com