0% found this document useful (0 votes)

17 views

Analysis of Correlated Data with SAS and R Third Edition Shoukri - Read the ebook now with the complete version and no limits

The document provides information about various eBooks available for instant download at ebookgate.com, focusing on statistical analysis using SAS and R. It highlights the third edition of 'Analysis of Correlated Data with SAS and R' by Mohamed M. Shoukri and Mohammad A. Chaudhary, which includes new chapters and R code examples. The book is aimed at readers with a working knowledge of SAS and R, emphasizing practical applications in analyzing clustered and correlated data.

Uploaded by

urmeburgy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Analysis of Correlated Data with SAS and R Third Edition Shoukri - Read the ebook now with the complete version and no limits

Uploaded by

urmeburgy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Instant Ebook Access, One Click Away – Begin at ebookgate.

com

Analysis of Correlated Data with SAS and R Third

Edition Shoukri

https://ebookgate.com/product/analysis-of-correlated-data-
with-sas-and-r-third-edition-shoukri/

OR CLICK BUTTON

DOWLOAD EBOOK

Get Instant Ebook Downloads – Browse at https://ebookgate.com

Click here to visit ebookgate.com and download ebook now
Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Categorical Data Analysis Using the SAS System 2nd Edition

Maura E. Stokes

https://ebookgate.com/product/categorical-data-analysis-using-the-sas-
system-2nd-edition-maura-e-stokes/

ebookgate.com

Design and Analysis of Experiments Classical and

Regression Approaches with SAS 1st Edition Leonard C.
Onyiah
https://ebookgate.com/product/design-and-analysis-of-experiments-
classical-and-regression-approaches-with-sas-1st-edition-leonard-c-
onyiah/
ebookgate.com

Sas R Intelligence Platform Overview 2nd Edition Sas

Publishing

https://ebookgate.com/product/sas-r-intelligence-platform-
overview-2nd-edition-sas-publishing/

ebookgate.com

Analysis of Variance Designs A Conceptual and

Computational Approach with SPSS and SAS 1st Edition Glenn
Gamst
https://ebookgate.com/product/analysis-of-variance-designs-a-
conceptual-and-computational-approach-with-spss-and-sas-1st-edition-
glenn-gamst/
ebookgate.com
Common Statistical Methods for Clinical Research with SAS
Examples Third Edition Glenn Walker

https://ebookgate.com/product/common-statistical-methods-for-clinical-
research-with-sas-examples-third-edition-glenn-walker/

ebookgate.com

Adaptive Tests of Significance Using Permutations of

Residuals with R and SAS 1st Edition Thomas W. O'Gorman

https://ebookgate.com/product/adaptive-tests-of-significance-using-
permutations-of-residuals-with-r-and-sas-1st-edition-thomas-w-ogorman/

ebookgate.com

Data Science Fundamentals with R Python and Open Data 1st

Edition Marco Cremonini

https://ebookgate.com/product/data-science-fundamentals-with-r-python-
and-open-data-1st-edition-marco-cremonini/

ebookgate.com

Bayesian Data Analysis in Ecology Using Linear Models with

R BUGS and Stan 1st Edition Franzi Korner-Nievergelt

https://ebookgate.com/product/bayesian-data-analysis-in-ecology-using-
linear-models-with-r-bugs-and-stan-1st-edition-franzi-korner-
nievergelt/
ebookgate.com

Statistical Methods for Survival Data Analysis Third

Edition Elisa T. Lee

https://ebookgate.com/product/statistical-methods-for-survival-data-
analysis-third-edition-elisa-t-lee/

ebookgate.com
Statistics

Edition
Third
Analysis of Correlated Data
with SAS and R
Third Edition
Analysis of
Correlated Data

Analysis of Correlated Data

Previously known as Statistical Methods for Health Sciences, this bestselling
resource is one of the first books to discuss the methodologies used for the
analysis of clustered and correlated data. While the fundamental objectives of

with SAS and R

its predecessors remain the same, Analysis of Correlated Data with SAS and
R, Third Edition incorporates several additions that take into account recent
developments in the field.

New to the Third Edition

with SAS
and R
• The introduction of R codes for almost all of the numerous examples solved
with SAS
• A chapter devoted to the modeling and analyzing of normally distributed
variables under clustered sampling designs
• A chapter on the analysis of correlated count data that focuses on over-
dispersion
• Expansion of the analysis of repeated measures and longitudinal data when
Third Edition
the response variables are normally distributed
• Sample size requirements relevant to the topic being discussed, such as
when the data are correlated because the sampling units are physically
clustered or because subjects are observed over time
• Exercises at the end of each chapter to enhance the understanding of the

Shoukri • Chaudhary
material covered
• An accompanying CD-ROM that contains all the data sets in the book along
with the SAS and R codes

Assuming a working knowledge of SAS and R, this text provides the necessary
Mohamed M. Shoukri
concepts and applications for analyzing clustered and correlated data from
epidemiologic and medical investigations.
Mohammad A. Chaudhary
C6196

www.crcpress.com

C6196_Cover.indd 1 3/27/07 11:30:20 AM

Analysis of
Correlated Data
with SAS
and R
THIRD EDITION
Analysis of
Correlated Data
with SAS
and R
THIRD EDITION

Mohamed M. Shoukri
Mohammad A. Chaudhary
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2007 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Version Date: 20131216

International Standard Book Number-13: 978-1-4200-1125-8 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com
To the memory of my mother. To my wife, Sue, and my children,
Nader, Nadene, and Tamer

— Mohamed M. Shoukri

To my mother, Nazir Begum, and my father, Siddiq Ahmed

— Mohammad A. Chaudhary
Contents

Preface to the First Edition ...................................................................................ix

Preface to the Second Edition ..............................................................................xi

Preface to the Third Edition ...............................................................................xiii

Authors ..................................................................................................................xv

1. Analyzing Clustered Data .......................................................................... 1

2. Analysis of Cross-Classified Data ...........................................................33

3. Modeling Binary Outcome Data ............................................................. 83

4. Analysis of Clustered Count Data ........................................................ 133

5. Analysis of Time Series ...........................................................................159

6. Repeated Measures and Longitudinal Data Analysis .......................207

7. Survival Data Analysis ............................................................................243

References ............................................................................................................283

Index .....................................................................................................................291
Preface to the First Edition

A substantial portion of epidemiologic studies, particularly in community

medicine, veterinary herd health, field trials and repeated measures from clin-
ical investigations, produce data that are clustered and quite heterogeneous.
Such clustering will inevitably produce highly correlated observations; thus,
standard statistical techniques in non-specialized biostatistics textbooks are
no longer appropriate in the analysis of such data. For this reason it was
our mandate to introduce to our audience the recent advances in statisti-
cal modeling of clustered or correlated data that exhibit extra variation or
heterogeneity.
This book reflects our teaching experiences of a biostatistics course in the
University of Guelph’s Department of Population Medicine. The course is
attended predominantly by epidemiology graduate students, as well as, stu-
dents from Animal Science and researchers from disciplines which involve the
collection of clustered and over-time data. The material in this text assumes
that the reader is familiar with basic applied statistics, principles of linear
regression and experimental design, but stops short of requiring a cognizance
of the details of the likelihood theory and asymptotic inference. We emphasize
the “how to’’ rather than the theoretical aspect; however, on several occasions
the theory behind certain topics could not be omitted, but is presented in a
simplified form.
The book is structured as follows: Chapter 1 serves as an introduction in
which the reader is familiarized with the effect of violating the assumptions
of homogeneity and independence on the ANOVA problem. Chapter 2 dis-
cusses the problem of assessing measurement reliability. The computation of
the intraclass correlation as a measure of reliability allowed us to introduce
this measure as an index of clustering in subsequent chapters. The analysis of
binary data summarized in 2 × 2 tables is taken up in Chapter 3. This chap-
ter deals with several topics including, for instance, measures of association
between binary variables, measures of agreement and statistical analysis of
medical screening tests. Methods of cluster adjustment proposed by Donald
and Donner (1987), Rao and Scott (1992) are explained. Chapter 4 concerns
the use of logistic regression models in studying the effects of covariates on
the risk of disease. In addition to the methods of Donald and Donner, and
Rao and Scott to adjust for clustering, we explain the Generalized Estimating
Equations (GEE) approach proposed by Liang and Zeger (1986). A general
background on time series models is introduced in Chapter 5. Finally, in
Chapter 6 we show how repeated measures data are analyzed under the lin-
ear additive model for continuously distributed data and also for other types
of data using the GEE.
We wish to thank Dr. A. Meek, the Dean of the Ontario Veterinary College,
for his encouragement in writing this book; Dr. S. W. Martin, Chair of the
Department of Population medicine, for facilitating the use of the depart-
mental resources; the graduate students who took the course “Statistics for
the Health Sciences’’; special thanks to Dr. J. Sargeant for being so gener-
ous with her data and to Mr. P. Page for his invaluable computing expertise.
Finally, we would like to thank J. Tremblay for her patience and enthusiasm
in the production of this manuscript.

M.M. Shoukri
V.L. Edge
Guelph, Ontario
July 1995
Preface to the Second Edition

The main structure of the book has been kept similar to the first edition. To
keep pace with the recent advances in the science of statistics, more topics
have been covered. In Chapter 2 we introduce the coefficient of variation as
a measure of reproducibility, and comparing two dependent reliability coef-
ficients. Testing for trend using Cochran-Armitage chi-square, under cluster
randomization has been introduced in Chapter 4. In this chapter we dis-
cussed the application of the PROC GENMOD in SAS, which implements
the GEE approach, and “Multi-level analysis’’ of clustered binary data under
the “Generalized Linear Mixed Effect Models,’’ using Schall’s algorithm, and
GLIMMIX SAS macro. In Chapter 5 we added two new sections on modeling
seasonal time series; one uses combination of polynomials to describe the
trend component and trigonometric functions to describe seasonality, while
the other is devoted to modeling seasonality using the more sophisticated
ARIMA models. Chapter 6 has been expanded to include analysis of repeated
measures experiment under the “Linear Mixed Effects Models,’’ using PROC
MIXED in SAS. We added Chapter 7 to cover the topic of survival analysis.
We included a brief discussion on the analysis of correlated survival data in
this chapter.
An important feature of the second edition is that all the examples are
solved using the SAS package. We also provided all the SAS programs that
are needed to understand the material in each chapter.

M.M. Shoukri, Guelph, Ontario

C.A. Pause, London, Ontario
July 1998
Preface to the Third Edition

It was brought to our attention by many of our colleagues that the title of the
previous edition did not reflect the focus of the book, which was the analysis
of correlated data. We therefore decided to change the title of the third edition
to Analysis of Correlated Data with SAS and R. We believe that the change in the
title is appropriate and reflects the main focus of the book.
The fundamental objectives of the new edition have been kept similar to
those of the previous two editions. However, this edition contains major struc-
tural changes. The first chapter in the previous editions has been deleted
and is replaced with a new chapter devoted to the issue of modeling and
analyzing normally distributed variables under clustered sampling designs.
A separate chapter is devoted to the analysis of correlated count data with
extensive discussion on the issue of overdispersion. Multilevel analyses of
clustered data using the “Generalized Linear Mixed Effects Models’’ fitted by
the PROC GLIMMIX in SAS are emphasized. Chapter 6 has been expanded
to include the analysis of repeated measures and longitudinal data when the
response variables are normally distributed, binary and count. The “Linear
Mixed Effects Models’’ are fitted using PROC MIXED and PROC GLIMMIX
in SAS. An important feature of the third edition is the introduction of R
codes for almost all the examples solved with SAS. The freeware R package
can be downloaded from The Comprehensive R Archive Network (CRAN) at
http://cran.r-project.org/ or at any of its mirrors. The reader of this book is
expected to have prior working knowledge of SAS and R packages. Readers
who have experience with S-PLUS will have no problem working with R.
Readers completely new to R will benefit from the many tutorials available
on the R web site.
The important features of the third edition are

1. We provide a large number of examples to cover the material, together

with their SAS and the R codes.
2. In each chapter, we provide the reader with sample size requirements
relevant to the topic discussed, with special emphasis on situations when
the data are correlated either because the sampling units are physically
clustered or because subjects are observed over time.
3. At the end of each chapter, a set of exercises is provided to enhance the
understanding of the material covered.
4. We provide a CD that contains all the data along with the SAS and
R codes.
Dr. Chaudhary’s work was supported in part by the Bill and Melinda Gates
Foundation as part of the Consortium to Respond Effectively to the AIDS-TB
Epidemic (CREATE) project (19790.01).

M.M. Shoukri, London, Ontario, Canada

M.A. Chaudhary, Baltimore, Maryland
Authors

Mohamed M. Shoukri received his MSc and PhD degrees from the Depart-
ment of Mathematics and Statistics, University of Calgary, Alberta, Canada.
He held several faculty positions at various Canadian universities, and
taught applied statistics at Simon Fraser University, the University of British
Columbia, and the University of Windsor, and was a full professor with tenure
at the University of Guelph, Ontario, Canada. His papers have been published
in the Journal of the Royal Statistical Society (series C), Biometrics, Journal of Sta-
tistical Planning and Inference, The Canadian Journal of Statistics, Statistics in
Medicine, Statistical Methods in Medical Research, and many other journals. He
is a fellow of the Royal Statistical Society of London and an elected member
of the International Statistical Institute. He is now principal scientist and the
acting chairman of the Department of Biostatistics and Epidemiology at the
Research Center of King Faisal Specialist Hospital.

Mohammad A. Chaudhary is an associate scientist (biostatistics) at the

Department of International Health, Johns Hopkins Bloomberg School of
Public Health. He received his PhD degree in biostatistics from the University
of North Carolina at Chapel Hill and his master degrees from Islamia Univer-
sity and the University of Southampton. He has taught at Islamia University,
Punjab University, and at the University of Memphis while he was a postdoc-
toral fellow at St. Jude Children Research Hospital. Before joining his current
position, he served as a scientist (biostatistics) at King Faisal Specialist Hospi-
tal and Research Center, Riyadh. His papers have been published in Statistics
in Medicine, Journal of Statistical Planning and Inference, Computer Methods and
Programs in Medicine, Biometrical Journal, Journal of Statistical Research, Cancer
Research, Lancet, and other biomedical research journals. His current research
interests include statistical computing and the design and analysis of ran-
domized clinical trials evaluating novel interventions for the prevention of
TB and HIV-AIDS.
1
Analyzing Clustered Data

CONTENTS
1.1 Introduction ................................................................................................. 1
1.1.1 The Basic Feature of Cluster Data ................................................ 2
1.1.2 Sample and Design Issues ............................................................. 7
1.2 Regression Analysis for Clustered Data .................................................. 11
1.3 Generalized Linear Models ....................................................................... 15
1.3.1 Marginal Models (Population Average Models) ........................ 16
1.3.2 Random Effects Models ................................................................. 17
1.3.3 Generalized Estimating Equation (GEE) ..................................... 17
1.4 Fitting Alternative Models for Clustered Data ....................................... 19

1.1 Introduction
Clusters are aggregates of individuals or items that are the subject of inves-
tigation. A cluster may be a family, school, herd of animals, flock of birds,
hospital, medical practice, or an entire community. Data obtained from clus-
ters may be the result of an intervention in a randomized clinical or a field
trial. Sometimes interventions in randomized clinical trials are allocated to
groups of patients rather than to individual patients. This is called cluster ran-
domization or cluster allocation, and is particularly common in human and
animal health research. There are several reasons why investigators wish to
randomize clusters rather than the individual study subjects. The first being
the intervention may naturally be applicable to clusters. For example, Murray
et al. (1992) evaluated the effect of school-based interventions in reducing ado-
lescent tobacco use. A total of 24 schools (of various sizes) were randomized
to an intervention condition (SFG = smoke-free generation) or to a control
condition (EC = existing curriculum). The number (and proportion) of chil-
dren in each school who continued to use smokeless tobacco after 2 years of
follow-up is given in Table 1.1.
It would be impossible to assign students to intervention and control groups
because the intervention is through the educational program that is received
by all students in a school.

1
2 Analysis of Correlated Data with SAS and R

TABLE 1.1
Smokeless Tobacco Use among Schoolchildren
Control (EC) Intervention (SFG)

5/103 0/42
3/174 1/84
6/83 9/149
6/75 11/136
2/152 4/58
7/102 1/55
7/104 10/219
3/74 4/160
1/55 2/63
23/225 5/85
16/125 1/96
12/207 10/194

Second, even if individual allocation is possible, there is the possibility of

contamination. For example, in a randomized controlled intervention trial the
purpose was to reduce the rate of cesarean section. The intervention was that
each physician should consult with his/her fellow physician before making
the decision to operate, and the control was to allow the physician to make
the decision without consulting his/her colleague. Ten hospitals were ran-
domized to receive the intervention, while ten other hospitals were kept as
controls. In this example, cluster randomization is desired even if the random-
ization by the physician is possible, because of the possibility of significant
crossover contamination. Because the physicians work together, it is likely
that a physician in the control group, who did not receive the intervention,
might still be affected by it via interactions with colleagues in the intervention
group.
Third, cluster allocation is sometimes cheaper or more practical than indi-
vidual allocation. Many public health interventions are relatively less costly
when implemented at an organizational level (e.g., community) than at an
individual level.
Similar to cluster randomization, cluster allocation is common in observa-
tional studies. In this case, it is sometimes more efficient to gather data from
organizational units such as farms, census tracts, or villages rather than from
individuals (see Murray et al., 1992).

1.1.1 The Basic Feature of Cluster Data

When subjects are sampled, randomized, or allocated by clusters, several
statistical problems arise. Observations within a cluster tend to be more alike
than observations selected at random. If observations within a cluster are
correlated, one of the assumptions of estimation and hypothesis testing is
violated. Because of this correlation, the analyses must be modified to take
into account the cluster design effect. When cluster designs are used, there
Analyzing Clustered Data 3

are two sources of variations in the observations. The first is the one between
subjects within a cluster, and the second is the variability among clusters.
These two sources of variation cause the variance to inflate and must be taken
into account in the analysis.
The effect of the increased variability due to clustering is to increase the
standard error of the effect measure, and thus widen the confidence inter-
val and inflate the type I error rate, compared to a study of the same size
using individual randomization. In other words, the effective sample size
and consequently the power are reduced. Conversely, failing to account for
clustering in the analysis will result in confidence intervals that are falsely
narrow and the p-values falsely small. Randomization by cluster accompa-
nied by an analysis appropriate to randomization by individual is an exercise
in self-deception (Cornfield, 1978).
Failing to account for clustering in the analysis is similar to another error
that relates to the definition of the unit of analysis. The unit of analysis error
occurs when several observations are taken on the same subject. For exam-
ple, a patient may have multiple observations (e.g., systolic blood pressure)
repeated over time. In this case, the repeated data points cannot be regarded
as independent, since measurements are taken on the same subject. For exam-
ple, if five readings of systolic blood pressure are taken from each of the 15
patients, assuming 75 observations to be independent is wrong. Here, the
patient is the appropriate unit of analysis and is considered as a cluster.
To recognize the problem associated with the unit of analysis, let us assume
that we have k clusters each of size n units (the assumption of equal cluster
size will be relaxed later on). The data layout (Table 1.2) may take the form:

TABLE 1.2
Data Layout
Clusters
Units 1 2 ...... i ...... k

1 Y11 y21 ....... yi1 ....... yk1

2 y12 Y22 ....... yi2 ....... yk2
: : : ....... : ....... :
j Y1j y2j ....... yij ....... ykj
: : : ....... : ....... :
n y1n y2n ....... yin ....... ykn
Total y1. y2. ....... yi. ....... yk
Mean y1 y2 ....... yi yk

1 k n
The grand sample mean is denoted by y = nk i=1 j=1 yij .
If the observations within a cluster are independent, then the variance of
the ith cluster mean is
σy2
V(yi ) = (1.1)
n
4 Analysis of Correlated Data with SAS and R

where σy2 = E(yij − µ)2 and µ is the population mean. Assuming that the
variance is constant across clusters, the variance of the grand mean is

σy2
V(y) = (1.2)
nk

Now, if k clusters are sampled from a population of clusters, and because

members of the same cluster are similar, the variance within the cluster would
be smaller than would be expected if members were assigned at random. To
articulate this concept, we first assume that the jth measurement within the
ith cluster yij is such that yij = µ + bi + eij , where

bi ≡ random cluster effect and

eij ≡ within cluster deviation from cluster mean

so that E(bi ) = 0, V(bi ) = σb2 , E(eij ) = 0, V(eij ) = σe2 , and bi and eij are independent
of each other. Under this setup, we can show that

V(yij ) = σb2 + σe2 ≡ σy2 (1.3)

Cov(yij , yil ) = σb2 j = l (1.4)

Therefore, the correlation between any pair of observations within a cluster is

σb2
Corr(yij , yil ) = ρ = (1.5)
σb2 + σe2

This correlation is known as the intracluster correlation (ICC). Therefore,

σe2 = σy2 (1 − ρ) (1.6)

Equation 1.4 shows that if the observations within a cluster are not correlated
(ρ = 0), then the within-cluster variance is identical to the variance of ran-
domly selected individuals. Since 0 ≤ ρ < 1, the within-cluster variance σe2 is
always less than σy2 .
Simple algebra shows that

σy2
V(Y i ) = [1 + (n − 1)ρ] (1.7)
n

and
σy2
V(Y) = [1 + (n − 1)ρ] (1.8)
nk
Analyzing Clustered Data 5

Note that for binary response, σ 2 is replaced with π(1 − π). The quantity
[1 + (n − 1)ρ] is called the variance inflation factor (VIF) or the “design effect’’
(DEFF) by Kerry and Bland (1998). It is also interpreted as the relative effi-
ciency of the cluster design relative to the random sampling of subjects and
is the ratio of Equation 1.8 to Equation 1.2:

DEFF = [1 + (n − 1)ρ] (1.9)

The DEFF represents the factor by which the total sample size must be
increased if a cluster design is to have the same statistical power as a design
in which individuals are sampled or randomized. If the cluster sizes are not
equal, which is commonly the case, the cluster size n should be replaced with

n = ki=1 n2i /N, where N = ki=1 ni .
Since ρ is unknown, we estimate its value from the one-way ANOVA layout
(Table 1.3):

TABLE 1.3
ANOVA Table
SOV DF Sum of square Mean square

BSS
Between clusters k−1 BSS BMS =
k−1
WSS
Within clusters N −k WSS WMS =
N −k
Total N −1

k k n
where BSS = n i (yi − µ̂)2 , WSS = i j (yij − yi )2 , and the ICC is esti-
mated by
BMS − WMS σ̂ 2
ρ̂ = = 2 b
BMS + (n0 − 1)WMS σ̂b + σ̂e2
where σ̂b2 = (BMS − WMS)/n0 and σ̂e2 = WMS are the sample estimates of σb2
and σe2 , respectively.
k
− n)2
i=1 (ni N
n0 = n − n= (1.10)
k(k − 1)n k
Note that when n1 = n2 = · · · = nk , then n0 = n = n.
If ρ̂ > 0, then the variance in a cluster design will always be greater than a
design in which subjects are randomly assigned so that, conditional on the
cluster size, the confidence interval will be wider and the p-values larger.
We note further that, if the cluster size (n) is large, the DEFF would be large
even for small values of ρ̂. For example, an average cluster size of 40 and
ICC = 0.02 would give DEFF of 1.8, implying that the sample size of a cluster-
randomized design should be 180% as large as the estimated sample size of
an individually randomized design to achieve the same statistical power.
6 Analysis of Correlated Data with SAS and R

We have demonstrated that applying conventional statistical methods to

cluster data assuming independence between the study subjects is wrong,
and one has to employ appropriate approach that duly accounts for the cor-
related nature of cluster data. The complexity of the approach depends on the
complexity of design. For example, the simplest form of cluster data is the
one-way layout, where subjects are nested within clusters. Conventionally,
this design has two levels of hierarchy: subjects at the first level and clusters at
the second level. For example, the sib-ship data (will be shown below) have
two levels: the first is observations from sibs and the second is formed by
the family identifiers. Data with multiple levels of hierarchy, such as animals
within the farms, farms nested within regions, may also be available, and one
must account for the variability at each level of hierarchy.
We shall now review some of the studies reported in the medical literature
that must be included under “cluster randomization trials’’ where we clearly
identify what is meant by “cluster.’’
Russell et al. (1983) investigated the effect of nicotine chewing gum as a
supplement to the general practitioners’ advice against smoking. Subjects
were assigned by week of attendance (in a balanced design) to one of three
groups: (a) nonintervention controls, (b) advice and booklet, and (c) advice
and booklet plus the offer of nicotine gum. There were six practices, with
recruitment over 3 weeks, 1 week to each regime. There were 18 clusters
(practices) and 1938 subjects. The unit of analysis will be subject nested within
practice.

1. In a trial of clinical guidelines to improve general-practice management

and referral of infertile couples, Emslie et al. (1993) randomized 82
general practices in Grompian region and studied 100 couples in each
group. The outcome measure was whether the general practitioner
had taken a full sexual history and examined and investigated both
partners appropriately, so that the general practitioner would be the
unit of analysis.
2. A third example is the Swedish two-county trial of mammographic
screening for breast cancer. In this study, clusters (geographical areas)
were randomized within strata, comprising 12 pairs of geographical
clusters in Ostergotland county and 7 triplets in Kopperberg county.
The strata were designed so that clusters within a stratum were similar
in socioeconomic terms. It should be noted that for randomization or
allocation by cluster, there is a price to be paid at the analysis stage.
We can no longer think of our trial subjects as independent individ-
uals, but must do the analysis at the level of the sampling unit. This
is because we have two sources of variation, one between subjects
within a cluster and the other between clusters; and the variability
between clusters must be taken into account. Clustering leads to a loss
of power and a need to increase the sample size to cope up with the
loss. The excess variation resulting from randomization being at the
cluster level rather than the individual level was neglected by Tabar
Analyzing Clustered Data 7

et al. (1985). They claimed (without supporting evidence) that such

excess variation was negligible. This study was later analyzed by Duffy
et al. (2003) who used hierarchical modeling to take clustering into
account and found an evidence for an effect. Taking account of the
cluster randomization, there was a significant 30% reduction in breast
cancer mortality. They concluded that mammographic screening does
indeed reduce mortality from breast cancer and that the criticisms of
the Swedish two-county trial were unfounded. The fact is that the
criticism was founded, because it was wrong to ignore clustering in
such a study. Getting the same answer when we do it correctly is
irrelevant.

There are several approaches that can be used to allow for clustering
ranging from simple to quite sophisticated:

1. Whether the outcome is normally distributed or not, group compar-

isons may be achieved by adjusting the standard error of the “effect
measure’’ using the DEFF. These are generally approximate methods,
but more realistic than ignoring the within-cluster correlation.
2. Multilevel or hierarchical or random effects modeling.
3. When covariates adjustment are required within the regression analy-
ses, the “generalized estimating equation’’ (GEE) approach is used.
4. Bayesian hierarchical models.

The focus in this book will be on the statistical analysis of correlated data
using the first three approaches.

1.1.2 Sample and Design Issues

As we have seen, the main reasons for adopting trials with clusters as the
sampling unit are:

• Evaluation of interventions, which by their nature have to be imple-

mented at community level, e.g., water and sanitation schemes, and
some educational interventions, e.g., smoking cessation project.
• Logistical convenience, or to avoid the resentment or contamina-
tion that might occur if unblended interventions were provided for
some individuals, but not others in each cluster.

It might be desirable to capture the mass effect on disease of applying an inter-

vention to a large proportion of community or cluster members, for example,
due to “an overall reduction in the transmission of an infectious agent’’ (Hayes
and Bennett, 1999).
The within-cluster correlation can come about by any of several mecha-
nisms, including shared exposure to the same physical or social environment,
8 Analysis of Correlated Data with SAS and R

self-selection in belonging to the cluster or the group, or sharing ideas or

diseases among members of the group.
As we see from Equation 1.9, the DEFF is a function of the cluster size
and the ICC. The values of ρ tend to be larger in small groups such as a
family, and smaller in large groups such as a county or a village because
the degree of clustering often depends on the interaction of group members;
family members are usually more alike than individuals in different areas of
a large geographic region. Unfortunately, the influence of ρ on study power
is directly related to cluster size. Studies with a few large clusters are often
very inefficient.

Example 1.1 Comparison of Means

We assume in this example that k clusters of n individuals are assigned to
each of an experimental group E and a control group C. The aim is to test the
hypothesis H0 : µE = µC , where µE and µC are the means of the two groups,
respectively, of a normally distributed response variable Y having common
but unknown variance σ 2 .
Unbiased estimates of µE and µC are given by the usual sample means yE
and yC , where
1
k n
yE = yij
nk
i=1 j=1
and
σ2
V(yE ) = [1 + (n − 1)ρ] (1.11)
nk
with similar expression for V(yC ).
For sample size determination, expression 1.11 implies that the usual
estimate of the required number of individuals in each group should be mul-
tiplied by the inflation factor or DEFF = [1 + (n − 1)ρ] to provide the same
statistical power as would be obtained by randomizing nk subjects to each
group when there is no clustering effect. More formally, let zα/2 denote the
value of a standardized score cutting of 100α/2% of each tail of a standard
normal distribution and zβ denote the value of a standardized score cutting of
the upper 100β%. Then the test H0 : µE = µC versus H1 : µE = µC has a power of
100(1 − β)% when performed at the 100α% level of significance; the number
of individuals n required in each group is given by

n = 2(zα/2 + zβ )2 σ 2 [1 + (n − 1)ρ]/δ2 (1.12)

where δ is a “meaningful difference’’ specified in advance by the investigator.
Alternatively, Equation 1.12 may be written as

n = (zα/2 + zβ )2 [1 + (n − 1)ρ]/2 (1.13)

µE − µ C δ
where = √ = √ .
σ 2 σ 2
Analyzing Clustered Data 9

At ρ = 0, formula 1.12 reduces to the usual sample size specification given

by Snedecor and Cochran (1981). When ρ = 1, there is variability within the
cluster, and the usual formula applies with n as the number of clusters that
must be sampled from each group. In most epidemiologic applications, how-
ever, values of ρ tend to be no greater than 0.6, and advance estimates may
also be available from previous data. Obtaining an estimate of ρ is no different
in principle from the usual requirement imposed on the investigator to sup-
ply an advance estimate of the population variance σ 2 . In the case of unequal
cluster sizes, we may replace n in the right-hand side of Equation 1.12 or 1.13
by the average cluster size, n (or by n0 ). A conservative approach would be
to replace n by nmax , the largest expected cluster size in the sample.
We now consider a test of significance on yE − yC . Note that in most appli-
cations, the clusters are likely to be of unequal size ni , i = 1, 2, . . ., k. In this
case, formula 1.11 generalizes to

σe2 n2i σb2
V(yE ) = 1+ (1.14)
N N σe2

An estimate V̂(yE ) of V(yE ) may be calculated by substituting σ̂b2 and

σ̂e2 (defined earlier), the sample estimates of σb2 and σe2 , respectively, in
Equation 1.14.
A large sample test on H0 : µE = µC may be obtained by calculating
yE − yC
Z= (1.15)
[V̂(yE ) + V̂(yC )]1/2

(See Donner et al., 1981.)

Example 1.2
The milk yield data from 10 Ontario farms, 5 large and 5 small farms, are
analyzed. Each farm provided 12 observations representing the average milk
yield per cow per day for each month. For the purpose of this example, we
shall ignore the sampling time as a factor and consider each farm as a cluster
size of 12. The following SAS code shows how to read in the data and run the
general linear model.

data milk;
input farm milk size $;
cards;
1 32.33 L
1 29.47 L
1 30.19 L
···
10 24.12 S
;
proc sort data=milk; by size;
10 Analysis of Correlated Data with SAS and R

proc glm data = milk;

class farm;
model milk = farm;
run;

The ANOVA results from the SAS output are given below:

Source DF Sum of Squares Mean Square

Model 9 905.122484 100.569165

Error 110 734.014075 6.672855

σ̂e2 = 6.67
100.57 − 6.67
σ̂b2 = = 7.83
12
Therefore, the estimated ICC is ρ̂ = 7.837.83
+ 6.67 = 0.54.
An important objective of this study was to test whether average milk yield
in large farms differs significantly from the average milk yield of small farms.
That is to test H0 : µs = µl versus H1 : µs = µl .
The ANOVA results separately for each farm size (small and large) are now
produced. The SAS code is
proc glm data = milk;
class farm;
model milk = farm;
by size;
run; quit;
Large farm size
Source DF Sum of Squares Mean Square

Model 4 283.4424600 70.8606150

Error 55 407.2490000 7.4045273

Small farm size

Source DF Sum of Squares Mean Square

Model 4 379.2289833 94.8072458

Error 55 326.7650750 5.9411832

For large farms, since yl = 28.00, k = 5, n = 12, N = 60

70.86 − 7.4
σ̂el2 = 7.40, 2
σ̂bl = = 5.29
12
2
σ̂bl /σ̂el2 = 0.71

then V̂(yl ) = 7.40

60 [1 + (12)(0.71)] = 1.17.
Analyzing Clustered Data 11

For small farms

ys = 25.32, k = 5, n = 12, N = 60
2 2 94.81 − 5.94
σ̂es = 5.94, σ̂bs = = 7.4
12
2
σ̂bs
2
= 1.24
σ̂es
5.94
V̂(ys ) = [1 + (12)(1.24)] = 1.57
60
28 − 25.32
Z = = 1.61
(1.17 + 1.57)1/2

p-value = 0.10, and there is no sufficient evidence in the data to support H0 .

The R code to read in the milk data and produce the ANOVA results:

milk <- read.table("x:/xxx/milk.txt",header=T)

# ANOVA results overall

anova(lm(milk ∼ factor(farm), data=milk))

# ANOVA results for large farms

anova(lm(milk ∼ factor(farm), data=milk[milk$size=='L',]))

# ANOVA results for small farms

anova(lm(milk ∼ factor(farm), data=milk[milk$size=='S',]))

1.2 Regression Analysis for Clustered Data

A fundamental question in many scientific investigations is concerned with
how and to what extent a response variable is related to a set of independent
variables. For example, a health economist may be interested in the relation-
ship between the effect of intervention and the cost of its administration, a
clinical nutritionist in the relationship between obesity and hypertension, or
a radiologist in the relationship between the ultrasound diagnosis of cancer
and the tumor size. The list of situations of this kind in biomedical research
is endless, let alone other areas of applications.
Suppose for a given situation, the actual mathematical relationship between
the response variable “Y’’ and a set of independent variables is known. The
investigator is then in a position to understand the factors that control the
direction of the response. Unfortunately, there are few situations in prac-
tice in which the true mathematical model connecting the response to the
independent variables is known. Consequently, one is forced to combine
practical experience and mathematical theory to develop an approximate
12 Analysis of Correlated Data with SAS and R

model that characterizes the main features of the behavior of the response
variable.
Regression analysis is among the most commonly used methods of sta-
tistical analysis to model the relationship between variables. Its objective
is to describe the relationship of response with independent or explanatory
variables. In its very general form, a regression model is written as

Y = Xβ + e (1.16)

where Y is the (n × 1) vector of dependent variable values, X an (n × (p + 1))

matrix containing the values of the independent variables, β the ((p + 1) × 1)
vector of parameters, and e the (n × 1) vector of error components. It is well
known that the method of least squares is the most preferred method of
estimation of the parameter of the regression model. There are fundamen-
tal assumptions that should be satisfied to use this method to estimate the
parameter vector β:

(a) The components of Y are uncorrelated with each other.

(b) The error components e are assumed to be uncorrelated with mean 0
and common variance σ 2 .
(c) The vector of error components e is uncorrelated with the matrix X.

Under the above conditions and provided that (X T X)−1 exists, the least
squares estimate of β is

β̂ = (X T X)−1 X T Y (1.17)
Equation 1.17 is important in regression analysis since it provides the esti-
mates of β once we are sure that conditions (a, b, c) are satisfied and the
matrix X is specified.
In addition to the linear models (Equation 1.16), regression models include
logistic models for binary responses, log-linear model for counts, and survival
analysis for time to events. In this chapter, we discuss the linear-normal model
for continuous responses when the basic assumption that all the observations
are independent, or at least uncorrelated, is violated. Recall that the assump-
tion of zero correlation would mean that knowing one subject’s response
provides no information regarding the status of another subject in the same
study. However, the assumption of independence may not hold if the subjects
belong to the same cluster as has been already demonstrated. As an exam-
ple of a regression problem when clusters of subjects are sampled together
is Miall and Oldham’s arterial blood pressure levels family study. Owing to
their common household environment and their shared genetic makeup, we
would expect a family member to have a greater chance of having elevated
blood pressure levels if his/her sibling had the same. Data from this study can
be usefully thought of as being “clustered’’ into families. Blood pressure lev-
els from different families are likely to be independent; those from the same
Analyzing Clustered Data 13

family are not. This dependence among observations from the same cluster
must be accounted for in assessing the relationships between risk factors and
health outcomes.
Another example cited by Liang and Zeger (1993) is the growth study
of Hmong refugee children. In this example, 1000 Hmong refugee children
receiving health care at two Minnesota clinics between 1976 and 1985 were
examined for their growth patterns. The objective was to study the patterns
of growth and its association with age at entry into the United States. It is
believed that stature is influenced by both genetic and environmental factors.
When the offending environmental factors are removed, the growth process
progresses at a faster rate. To study the growth, repeated measurements of
height of each child were recorded. The number of visits per child ranged
from 1 to 15 and averaged 5. The correlation between repeated observations
on height for each child may be a nuisance but cannot be ignored in regression
analysis.
The above two examples have common features, although they address
questions with different scientific objectives. Data in the above two stud-
ies are organized in clusters. For family study, the clusters are formed by
families, and in the second example a cluster comprises the repeated obser-
vations for a child. Another aspect of similarity between the two studies
is that one can safely assume that the response variables (blood pressure
in the family study and height in the growth study) are normally dis-
tributed. The two studies also differ in the structure of the within-cluster
correlation. For example, in the family study one may assume that the cor-
relation between the pairs of sibs within the family is equal, that is, we
may assume a constant within-cluster correlation. For repeated measures
longitudinal study, the situation is different. Although the repeated observa-
tions are correlated, this correlation may not be constant across time (cluster
units). It is common sense to assume that observations taken at adjacent time
points are more correlated than observations that are taken at separated time
points.
In the remainder of this chapter, we shall focus on regression analysis of
clustered data assuming common or fixed within-cluster correlation and the
response variable is normally distributed. Other types of response variables
and different correlation structures will be discussed in detail in subsequent
chapters.
Within the framework of linear regression, we illustrate an answer to the
question: what happens when the conventional linear regression is used to
analyze clustered data?
Let Yij denote the score on the jth offspring in the ith family; Xi the score of
the ith parent, where j = 1, 2, . . . , ni , i = 1, 2, . . . , k; ni the number of offspring
in the ith family; and k the total number of families. We assume that the
regression of Y on X is given by

Yij = µy + β(Xi − µx ) + Eij (1.18)

14 Analysis of Correlated Data with SAS and R

where µy = E(Yij ), µx = E(Xi ), β is the regression coefficient of Y on X, and Eij

is the deviation of the jth offspring of the ith parent. We further assume that
2
ρσ j = l
Cov(Yij , Yil ) =
σ2 j = l
Under this model, Kempthorne and Tandon (1953) showed that the minimum
variance unbiased estimator of β is given by
k
i=1 wi (xi − x̂)yi
b1 = (1.19)
wi (xi − x̂)2

and
σ 2 (1 − ρ)
V(b1 ) =
wi (xi − x̂)2

where wi = ni /(1 + ni ρe ), ρe = (ρ − β2 )/(1 − β2 ), yi = j yij /ni , and x̂ = wi xi /

wi .
The most widely used estimator for β ignores the ICC ρ and is given by the
usual estimator:
k
ni (yi − y)(xi − x)
b = i=1 k (1.20)
i=1 ni (xi − x)
2

where y = ni yi /N, x = ni xi /N, and N = n1 + n2 + · · · + nk .

σ 2 (1 − ρ) ni (1 + ni ρe )(xi − x)2
V(b) = 2
(1.21)
ni (xi − x)2

If ρe = 0, then wi = ni and var(b) = var(b1 ) = σ 2 (1 − ρ)/ ni (xi − x)2 , which
means that b1 is fully efficient.
Therefore,

V(b) ni (1 + ni ρe )(xi − x)2
=
V(b1 ) ni (xi − x)2
or equivalently

V(b) ρe n2i (xi − x)2
=1 + (1.22)
V(b1 ) ni (xi − x)2
Assuming ρe > 0, b is always less efficient.
The most important message of Equation 1.22 is that ignoring within-
cluster correlation can lead to a loss of power when both within-cluster and
cluster-level covariate information are being used to estimate the regression
coefficient.
To analyze clustered data, one must therefore model both the regression
of Y on X and the within-cluster dependence. If the responses are indepen-
dent of each other, then ordinary least squares can be used, which produces
Analyzing Clustered Data 15

regression estimators that are identical to the maximum likelihood in the case
of normally distributed responses. In this chapter we consider two differ-
ent modeling approaches: marginal and random effects. In marginal models,
the regression of Y on X and the within-cluster dependence are modeled
separately. The random effects models attempt to address both issues simul-
taneously through a single model. We shall explore both modeling strategies
for a much wider class of distributions named “generalized linear models’’
(GLM) that includes the normal distribution as a special case.

1.3 Generalized Linear Models

GLM are a unified class of regression methods for continuous and discrete
response variables. There are two components in a GLM, the systematic
component and the random component. For the systematic component,
one relates Y to X assuming that the mean response µ = E(Y) satisfies
g(µ) = X1 β1 + X2 β2 + · · · + Xp βp , which may conveniently be written as

g(µ) = X T β (1.23)

Here, g(.) is a specified function known as the “link function.’’ The normal
regression model for continuous data is a special case of Equation 1.23, where
g(.) is the identity link. For binary response variable Y, the logistic regression
is a special case of Equation 1.23 with the logit transformation as the link.
That is,
µ P(Y = 1)
g(µ) = log = log = XT β
1−µ P(Y = 0)
When the response variable is count, we assume that

log E(Y) = X T β

To account for the variability in the response variable that is not accounted for
by the systematic component, GLM assume that Y has the probability density
function given by

f (y) = exp[(yθ − b(θ))/φ + C(y, φ)] (1.24)

which is a general form of the exponential family of distributions. This

includes among others, the normal, binomial, and Poisson as special cases. It
can be easily shown that
E(Y) = b (θ)
and
V(Y) = φb (θ)
16 Analysis of Correlated Data with SAS and R

For the normal distribution:

θ = µ (identity link)
b(θ) = µ2/2, φ = σ2

Hence, b (θ) = µ and b (θ) = 1, indicating that φ is the variance.

For the Poisson distribution:

θ = ln µ (log-link)
b(θ) = µ = eθ , φ=1

b (θ) = e = µ
θ

b (θ) = eθ = µ

Hence, E(Y) = V(Y) = φµ.

The scale parameter φ in Equation 1.24 is called the over-dispersion
parameter. If φ > 1, then the variance of the counts is larger than the mean.
When the link function in Equation 1.23 and the random component are
specified by the GLM, we can estimate the regression parameters β by solving
the estimating equation:

p
∂µi (β) T
U= V −1 (Yi )[Yi − µi (β)] = 0 (1.25)
∂β
i=1

The above equation provides valid estimates when the responses are indepen-
dently distributed. For clustered data, the GLM and hence Equation 1.25 are
not sufficient, since the issue of within-cluster correlation is not addressed.
We now discuss the two modeling approaches commonly used to analyze
clustered data.

1.3.1 Marginal Models (Population Average Models)

As already mentioned, in a marginal model, the regression of Y on X and the
within-cluster dependence are modeled separately. We assume:

1. The marginal mean or “population average’’ of the response,

µij = E(Yij ), depends on the explanatory variables Xij through a link
function g(µij ) = XijT β, where g is a specified link function.
2. The marginal variance depends on the marginal mean through V(Yij ) =
φV(µij ), where V(.) is a known variance function, such as V(µij ) = φ for
normally distributed response and V(µij ) = µij for count data similar
to the GLM setup.
Analyzing Clustered Data 17

3. Cov(Yil , Yil ), the covariance between pairs within clusters, is a function

of the marginal means and another additional parameter α, i.e.,

Cov(Yij , Yil ) = γ(µij , µil , α)

where γ(.) is a known function.

1.3.2 Random Effects Models

There are several names given to these types of models: multilevel, hierarchi-
cal, random coefficients, or mixed effects models. The fundamental feature of
these models is the assumption that parameters vary from cluster to cluster,
reflecting natural heterogeneity due to unmeasured cluster-level covariates.
The general setup for the random effects GLM was described by Zeger and
Karim (1991) as follows:

1. Conditional on random effects bi , specific to the ith cluster, the response

variable Yij follows a GLM with

g[E(Yij |bi )] = XijT β + ZTij bi (1.26)

where Zij , a q × 1 vector of covariates, is a subset of Xij .

2. Conditional on bi , Yi = (Yi1 , Yi2 , . . . , Yin )T are statistically independent.
3. The bi ’s are independent observations from a distribution F(., α),
indexed by some unknown parameter α. The term “random effect’’
was assigned to the variable bi , because we treat bi as a random sample
from F. The random effects bi are not related to Xij .

1.3.3 Generalized Estimating Equation (GEE)

The GEE provides a tool for practical statistical inference on the β coeffi-
cient under the marginal model when the data are clustered. The regression
estimates are obtained by solving the equation:

k
∂µi T
U1 (β, α) = Cov−1 (Yi , β, α)[yi − µi (β)] = 0 (1.27)
∂β
i=1

where µi (β) = E(Yi ), the marginal mean of Yi . One should note that U1 (.) has
the same form of U(.) in Equation 1.25, except that Yi is now ni × 1 vector,
which consists of the ni observations of the ith cluster, and the covariance
matrix Cov(Yi ), which depends on β and α, a parameter that characterizes
the within-cluster dependence.
For a given α, the solution β̂ to Equation 1.27 can be obtained by an itera-
tively reweighted least squares calculations. The solution to these equations
18 Analysis of Correlated Data with SAS and R

is a consistent estimate of β, provided that µi (β) is correctly specified. The

T
consistency property follows because ∂µ ∂β
i
Cov−1 (Yi ) does not depend on
the Yi ’s, so Equation 1.27 converges to 0 and has consistent roots as long as
E(Yi − µ√i (β)) = 0 (see Liang and Zeger (1986) and Zeger and Liang (1986)).
If a k consistent estimate of α is available, β̂ are asymptotically nor-
mal, even if the correlation structure is misspecified. Liang and Zeger (1986)
proposed a “robust’’ estimate of the covariance matrix of β̂ as

V(β̂) = A−1 MA−1

where

k
A= T V
D −1
i i Di
i=1

k
M = T V
−1 −1
D i i Cov(Yi )Vi Di
i=1

Cov(Yi ) = (Yi − µi (
β))(Yi − µi (
β))T

where ∼ denotes evaluation at

β and α(
β).
Liang and Zeger (1986) proposed a simple estimator for α based on
Pearson’s residuals
yij − µij (β)
r̂ij =
var(yij )
For example, under common correlation structure that Cor(yij , yil ) = α for all
i, j, and l, an estimate of α is
k ni ni −1
i=1 j=1 l=j+1 r̂ij r̂il
α̂ = (1.28)
k ni
i=1 −p
2

where p in the denominator of Equation 1.28 is the number of regression

parameters.
One of the limitations of the above approach is that estimation of β and α
from Equations 1.27 and 1.28 are done as if (β, α) are independent of each
other. Consequently, very little information from β is used when estimating
α. This can lead to a significant loss of information on α. To improve on the
efficiency of estimating β, Prentice (1988) and Liang et al. (1992) discussed
estimating θ = (β, α) jointly by solving

k
∂µ∗i
U2 (β, α) = [Cov(Zi , δ)]−1 (Zi − µ∗i θ) = 0
∂β
i=1
Analyzing Clustered Data 19

where

2 2
Zi = (Yi1 , . . . , Yini , Yi1 , . . . , Yin i
, Yi1 Yi2 , Yi1 Yi3 , . . . , Yi ni −1 Yini )T

where µ∗i = E(Zi , θ), which is completely specified by the modeling assump-
tions of the GLM. They called this expanded procedure that uses both the
Yij ’s and Yij Yil the GEE2.
GEE2 seems to improve the efficiency of both β and α. On the other hand,
the robustness property for β of GEE is no longer true. Hence, correct infer-
ences about β require correct specification of the within-cluster dependence
structure. The same authors suggest using a sensitivity analysis when making
inference on β. That is, one may repeat the procedure with different models
for the within-cluster dependence structure to examine the sensitivity of
β to
choose the dependence structure.

1.4 Fitting Alternative Models for Clustered Data

Example 1.3
We will use a subset of the data from a survey conducted by Miall and
Oldham (1955) to assess the correlations in systolic and diastolic blood
pressures among family members living within 25 miles of Rhonda Fach
Valley in South Wales. The purpose of the following analysis is to illus-
trate the effect of the within-cluster correlation in the case of the normal
linear regression model. The part of the data that we use for this illustration
consists of the observations made on siblings and their parents. Each obser-
vation consists of systolic and diastolic blood pressures to the nearest 5 or
10 mmHg. In this analysis, we will not distinguish among male and female
siblings. The following variables will be used to run a few models in this
section.

familyid: Family ID
subjid: Sibling ID
sbp: Sibling systolic blood pressure
msbp: Mother systolic blood pressure
age: Sibling age
armgirth: Sibling arm girth
cenmsbp: Mother systolic blood pressure centered
cenage: Sibling age centered within the family
cengirth: Sibling arm girth centered within the family

The records with missing values of sibling age, mother systolic blood pressure,
sibling arm girth, and sibling systolic blood pressure are deleted. The dataset
20 Analysis of Correlated Data with SAS and R

consists of 488 observations on 154 families. The family size ranges from 1
to 10. We begin by first analyzing the data using the GEE approach. This
is followed by a series of models using the multilevels modeling approach.
All models are fitted using the SAS procedures GENMOD for the GEE and
the MIXED for the multilevels approach. The equivalent R code (R 2.3.1—A
Language and Environment Copyright 2006, R Development Core Team) is
also provided.
The SAS code to read in the data and fit this model is

data fam;
input familyid subjid sbp age armgirth msbp;
datalines;
1 1 85 5 5.75 120
1 2 105 15 8.50 120
......

200 5 135 40 12.50 255

213 1 120 64 11.00 110
;

* Computing the overall mean for msbp;

proc means data=fam noprint; var msbp; output out=msbp mean=
mmsbp; run;

* Computing cluster-specific means for age and armgirth;

proc means data=fam noprint; class familyid; var age armgirth;
output out=fmeans mean=mage marmgirth; run;
data fmeans; set fmeans; if familyid=. then delete; drop _TYPE_ _FREQ_;
run;

* Centering msbp at overall mean and age and armgirth at cluster-specific

means;
data family; merge fam fmeans; by familyid; if _n_=1 then set
msbp(drop=_TYPE_ _FREQ_);
cenage=age-mage; cengirth=armgirth-marmgirth; cenmsbp=
msbp-mmsbp;
keep familyid subjid sbp age armgirth msbp cenage cengirth cenmsbp;

proc genmod data=family;

class familyid;
model sbp = msbp cenage cengirth/dist = n lnk=id;
repeated subject = familyid /type = cs corrw;
run;
Analyzing Clustered Data 21

The following is the partial output showing the analysis of the GEE
parameter estimates:
Analysis of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard 95% Confidence
Parameter Estimate Error Limits Z Pr > |Z|

Intercept 119.0985 0.9198 117.2957 120.9013 129.48 <0.0001

cenmsbp 0.2024 0.0349 0.1340 0.2707 5.80 <0.0001
cenage 0.1802 0.1984 −0.2086 0.5690 0.91 0.3636
cengirth 3.5445 0.8263 1.9249 5.1641 4.29 <0.0001

This model treats the within-cluster correlation as nuisance. It is assumed

that the within-subject correlation structure is exchangeable or compound
symmetry. The estimated working correlation under the common (compound
symmetry) structure is 0.328. The analysis indicates that the arm girth and the
mother systolic blood pressure are significant predictors of the sibling systolic
blood pressure levels.
We now illustrate the application of the random effects model to analyze
clustered data. We followed an informative strategy given by Singer (1998)
for fitting multilevel data. Therefore, we shall present three nested random
effects models and discuss the relative advantages of each model.
PROC MIXED for Clustered Data
Here we illustrate how two levels of clustered data are analyzed using PROC
MIXED in SAS (SAS Institute 1995, and 1996). By two levels we mean a sit-
uation where subjects are nested within organizational units. The subjects in
the dataset are the siblings and the clusters are the families. We are interested
in examining the behavior of a level 1 outcome (siblings outcome) as function
of level 1 and level 2 (family) covariates. The siblings-level outcomes are the
systolic blood pressures (sbp), and the covariates measured at the siblings
level are age (cenage) and arm girth (cengirth). There are several family-level
outcomes, but we shall restrict to the systolic blood pressure of the mother
(cenmsbp).
The first baseline model is called an “unconditional means’’ model, which
examines the variability in sbp across families.

Model 1: Unconditional Mean Model

Under this model, we express sbp (yij ) as a one-way random effects model

yij = µ + bi + eij

bi ≈ N(0, σb2 ) and eij ≈ N(0, σe2 )

22 Analysis of Correlated Data with SAS and R

The model has one fixed effect (µ) and two variance components—one rep-
resenting the variation between family means (σb2 ) and the other variation
among siblings within families (σe2 ).
The SAS code to fit this model is

proc mixed data=family noclprint noitprint;

class familyid;
model sbp=/bw;
random familyid;
run;

The selected output is shown below.

Covariance Parameter Estimates
Cov Parm Estimate

familyid 106.98
Residual 166.22

Solution for Fixed Effects

Effect Estimate Standard Error DF T-Value Pr > |t|

Intercept 118.67 1.0554 487 112.44 <0.0001

The mixed procedure produces a set of information: the “familyid’’ esti-

mate is an estimate of the parameter (σb2 ), while the “residual’’ estimate is
an estimate of the parameter (σe2 ). The maximum likelihood estimate of the
ICC is

σ̂b2 106.98
ρ̂ = = = 0.39
(σb2 + σe2 ) (106.98 + 166.22)

This tells us that there is a great deal of clustering of systolic blood pressure
levels of siblings within families.
There is another approach that generalizes more easily to data with multiple
levels. This approach expresses the subject-level outcome yij using a pair of
linked models: one at the subject level (level 1) and another at the family level
(level 2). At level 1, we express yij as the sum of an intercept for the subject’s
family (βi0 ) and random error (eij ):

yij = βi0 + eij where eij ≈ N(0, σe2 ).

Analyzing Clustered Data 23

At the higher level (family level), we express the family-level intercept as

the sum of an overall mean (β) and random deviation (ui0 ) so that

βi0 = β + ui0 where ui0 ≈ N(0, τ0 )

Therefore,
yij = β + ui0 + eij

The SAS code for this model is

proc mixed data=family noclprint noitprint covtest;

class familyid;
model sbp =/s ddfm=bw;
random intercept/sub = familyid;
run;

The parameter estimates under this model are the same as in the previous
model. The purpose of the covtest option in the “proc mixed’’ statement is to
test the hypothesis for the variance components.

Model 2: Including a Family-Level Covariate

In this model, we include the mother’s systolic blood pressure score (msbp)
as a predictor of the siblings score. Following Singer (1998), the msbp is cen-
tered at the overall mean so that it has mean 0 and allows a meaningful
interpretation of the intercept. For this model we write

yij = βi0 + eij

βi0 = γ00 + γ01 xi + ui0

where eij ≈ N(0, σe2 ), ui0 ≈ N(0, σ02 ).

Therefore, yij = (γ00 + γ01 xi ) + (ui0 + eij ), where xi = msbp − mean(msbp)=

cenmsbp.
The above model has two components, a fixed part enclosed in the first
bracket and a random part enclosed in the second bracket. The SAS code to
fit this model is

proc mixed data=family noclprint noitprint;

class familyid;
model sbp = cenmsbp/s ddfm=bw notest;
random intercept/subject = familyid;
run;

Note that there is another option in the “model’’ statement, which is

ddfm = bw. This allows SAS to use the “between/within’’ method for com-
puting the denominator degrees of freedom for tests of fixed effects. See SAS
documentation or Littell et al. (1996) for details.
24 Analysis of Correlated Data with SAS and R

The SAS output is

Covariance Parameter Estimates
Cov Parm Subject Estimate

Intercept familyid 67.4679

Residual 163.89

Solution for Fixed Effects

Effect Estimate Standard Error DF t-Value Pr > |t|

Intercept 119.11 0.9161 152 130.01 <0.0001

cenmsbp 0.2005 0.02748 152 7.29 <0.0001

Note that there are two fixed effects to be estimated: the intercept and the
covariate (MSBP). The null hypothesis, which states that there is no relation-
ship between mother’s systolic blood pressure levels and the siblings, is not
supported by the data. Also note that the variance components estimates are
67.47 and 163.89 for τ0 and σe2 , respectively. These estimates under the present
model have different interpretations. In model 1, there were no covariates,
so these were unconditional components. After adding the mother’s blood
pressure as a covariate, these are now conditional components. Note that
the conditional within-family component is slightly reduced (from 166.22 to
163.89). The variance component representing variation between families τ0
or σb2 has diminished markedly (from 106.98 to 67.47). This tells us that the
cluster- or family-level covariate (mother’s systolic blood pressure) explains
a large percentage of the family-to-family variation. One way of measuring
how much variation exists in family mean blood pressures as explained by the
mother’s blood pressure levels is to compute how much the variance compo-
nent for this term τ0 has diminished between the two models. Following Bryk
and Raudenbush (1992), we compute this as (106.98 − 67.47)/106.98 = 36.9%.
This is interpreted by saying that about 36% of the explainable variation in
the sibling’s mean systolic blood pressure levels is explained by the mother’s
systolic blood pressure levels.

Model 3: Including Sib-Level Covariate

The simplest model may be written as

yij = βi0 + β11 Zij + eij

Here, Zij is the age of the jth subject within the ith family centered at its mean
value. The other terms are defined as before. Let

βi0 = β00 + ui0

βi1 = β11 + ui1
Analyzing Clustered Data 25

Hence,

yij = β00 + ui0 + (β11 + ui1 )Zij + eij

= (β00 + β11 Zij ) + (ui0 + ui1 Zij + eij )

where eij ≈ N(0, σe2 ), ui = (ui0 , ui1 ) ≈ BIVN(0, ), and eij is independent of the
bivariate normal random vector ui . The is a 2 × 2 symmetric matrix whose
elements are δ00 = V(ui0 ), δ11 = V(ui1 ), and δ01 = Cov(ui0 , ui1 ). The SAS code
to fit this model is

proc mixed data=family noclprint noitprint;

class familyid;
model sbp= cenmsbp/s ddfm=bw notest;
random intercept cenage/ subject=familyid type=un;
run;

Notice that the random statement has two random effects—one for intercept
and one for the Z-slope. There is also an additional option in the random
statement, “type=un,’’ indicating that an unstructured specification for the
covariance of ui is assumed. Partial SAS output is shown below.

Covariance Parameter Estimates

Cov Parm Subject Estimate

UN(1,1) familyid 78.9632

UN(2,1) familyid 4.3682
UN(2,2) familyid 1.0199
Residual 133.16

Solution for Fixed Effects

Effect Estimate Standard Error DF t-Value Pr > |t|

Intercept 118.43 0.8977 152 131.92 <0.0001

cenmsbp 0.1973 0.02687 152 7.34 <0.0001

We shall start by first interpreting the fixed effects. The estimate of the inter-
cept 118.43 indicates the estimated average sibling systolic blood pressure
levels after controlling for the mother’s systolic blood pressure. The estimate
of the cenmsbp indicates that the average slope representing the relationship
between siblings’ blood pressure and the mother’s systolic blood pressure is
0.20. The standard errors of these estimates are very small, resulting in small,
p-values. We conclude that, on average, there is a statistically significant
26 Analysis of Correlated Data with SAS and R

relationship between siblings’ systolic blood pressures and the mother’s

systolic blood pressure.
The covariance parameter estimates tell us how much these intercepts and
slopes vary across families. The δ̂00 = 78.96 represents the variability in the
intercepts, δ̂11 = 4.37 the variability in the slopes, and δ̂01 = 1.02 the covariance
between intercepts and slopes. We can say that the intercepts vary consider-
ably; in other words, families do differ in average systolic blood pressure
levels after controlling for the effects of the mother’s blood pressure levels.
We also note that the slopes do not considerably vary between the families,
and there is no evidence that the effects of mother’s blood pressure on sibling
systolic blood pressure differ between the families.
Finally, we compare the residual error variance of the unconditional model
to that of the present model. Recall that the estimated variance of the uncon-
ditional model was 166.22. Here we have the conditional estimate of 133.16.
Inclusion of the sibling’s age is therefore explained as (166.22 − 133.16)/
166.22 = 20.0% of the explainable variation within families.

Model 4: Including One Family-Level Covariate and Two

Subject-Level Covariates
Following Singer (1998), it is always helpful to write the outcome variable as
a function of the covariates measured at the lowest (subject) level. Thereafter,
we write the slopes and the intercepts as functions of the higher level (in our
example, a family).

yij = Bi0 + Bi1 Zij + Bi2 aij + eij

Bi0 = γ00 + γ01 xi + ui0

Bi1 = γ10 + γ11 xi + ui1
Bi2 = γ20 + γ21 xi + ui2

where Zij is the centered arm girth of the jth subject within the ith family, aij
the centered age of the jth subject within the ith family, and xi the centered
mother’s systolic blood pressure in the ith family.
Hence,

yij = γ00 + γ01 xi + ui0 + Zij (γ10 + γ11 xi + ui1 ) + aij (γ20 + γ21 xi + ui2 ) + eij

Simplifying, we get

yij = γ00 +γ01 xi +γ10 Zij +γ20 aij +γ11 xi Zij +γ21 xi aij + u10 +Zij ui1 +aij uiz +eij

Terms in the first bracket should appear in the model statement, while those
in the second bracket should appear in the random statement.
The SAS code to fit the model is
proc mixed data=family covtest noclprint noitprint;
class familyid;
Analyzing Clustered Data 27

model sbp=cenmsbp cenage cengirth cenmsbpcenage cenmsbpcengirth/s

ddfm=bw notest;
random cenage cengirth/subject=familyid type=un;
run;
The variable cengirth has a zero variance component; we fit the model
after removing cengirth from the random statement. The results are given
below.

Covariance Parameter Estimates

Cov Parm Subject Estimate Standard Error Z-Value Pr > |Z|

UN(1,1) familyid 81.8458 15.4989 5.28 <0.0001

UN(2,1) familyid 4.7531 1.7062 2.79 0.0053
UN(2,2) familyid 0.6185 0.3115 1.99 0.0235
Residual 124.29 11.2232 11.07 <0.0001

Solution for Fixed Effects

Effect Estimate Standard Error DF t-Value Pr > |t|

Intercept 119.07 0.9208 152 129.32 <0.0001

cenmsbp 0.2043 0.02777 152 7.36 <0.0001
cenage 0.02669 0.1624 330 0.16 0.8695
cengirth 4.2129 0.7010 330 6.01 <0.0001
cenmsbp*cenage 0.008845 0.004130 330 2.14 0.0330
cenmsbp*cengirth −0.02641 0.01736 330 −1.52 0.1292

Interpretation of the above output has been left as an exercise to the reader.
The R code reads the data; computes the centered variables cenage, cen-
girth, and cenmsbp; and fits the alternative models discussed in this example.
Note that the packages “nlme’’ and “gee’’ should be installed and loaded for
functions “lme’’ and “gee’’, respectively, to run.
fam <-read.table("x:/xxx/familydata.txt",header=T)
cenage = unlist(tapply(fam[,4], fam[,1], scale, scale=FALSE))
cengirth = unlist(tapply(fam[,5], fam[,1], scale, scale=FALSE))
cenmsbp = scale(fam[,6], center = TRUE, scale = FALSE)
family = data.frame(fam, cenage, cengirth, cenmsbp)
# Generalized estimating equations (GEE)
fam.gee <- gee(sbp ∼ cenmsbp+cenage+cengirth, familyid, data=family,
family = gaussian, corstr = "exchangeable")
summary(fam.gee)
# Unconditional mean model—Mixed Model 1
fam.lme1 <- lme(fixed = sbp ∼ 1, random=∼1 | familyid, data = family)
summary(fam.lme1)
28 Analysis of Correlated Data with SAS and R

# Mixed model including one cluster-level covariate, cenmsbp—Mixed

Model 2
fam.lme2 <- lme(fixed = sbp ∼ cenmsbp, random=∼1 | familyid, data =
family)
summary(fam.lme2)
# Mixed model including sib-level covariate, cengirth—Mixed Model 3
fam.lme3 <- lme(fixed = sbp ∼ cenmsbp, random=∼ cenage | familyid,
data = family)
summary(fam.lme3)
# Mixed model including one sibling-level covariate, cengurth—*Mixed
Model 4
fam.lme4 <- lme(fixed = sbp ∼ cenmsbp + cenage + cengirth + cenmsbp*
cenage + cenmsbp*cengirth, random=∼ cenage −1 | familyid, data =
family)
summary(fam.lme4)

Appendix
1. Linear combinations of random variables
Let x = (xi , x2 , . . . , xk ) be a set of random variables such that E(xi ) = µi ,
V(xi ) = σi2 , and Cov(xi , xj ) = cij . We define a linear combination of the
random variable x as

k
y= wi x i
i=1

where w1 , w2 , . . . , wk are constants.

k
E(y) = wi µ i (A.1)
i=1

and

k
k
k
V(y) = wi2 σi2 + wi wj cij (A.2)
i=1 i=1 j=1
i=j

k k
2. Consider two linear combinations ys = i=1 asi xi and yt = i=1 ati xi . Then
the covariance between ys and yt is

k
k
Cov(ys , yt ) = asi atj Cov(xi , xj ) (A.3)
i=1 j=1
Exploring the Variety of Random
Documents with Different Content
Abb. 5. Lößlandschaft bei Kenzingen.
Nach einer Photographie von Prof. Dr. P. Paulcke in Freiburg.
(Zu Seite 20.)

Nicht leicht wird man in anderen Gauen auf höchster Höhe oder
fern vom belebenden Schienenstrang und von der großen
Heerstraße, im abgelegensten Dorfe oder Weiler so gute Unterkunft
finden wie im Schwarzwald, wo kein verständiger Wunsch an
Quartier oder Verpflegung unerfüllt zu bleiben braucht. Vom großen
Hotel ersten Ranges der Städte, Bade- und Luftkurorte bis zum
bescheidensten, aber sauberen, urbehaglichen und billigen
Bauerngasthaus finden sich alle Übergänge, so daß jeder Geschmack
Befriedigung finden kann.
II. Orographische und geologische
Übersicht.

UmGebirgsindividuum
den Schwarzwald
verstehen
als
zu
Wasgenwald und Schwarzwald.

können, muß er im Zusammenhang mit

seiner Umgebung betrachtet werden. Und da ist nun vor allen
Dingen die sich lebhaft aufdrängende Wahrnehmung von Belang,
daß unser Gebirge im Wasgenwald jenseits des Rheines eine Art von
Spiegelbild besitzt mit einer auffallend großen Anzahl von
übereinstimmenden Zügen, die jedem aufmerksamen Beobachter
den Gedanken an einen inneren Zusammenhang der beiden
Erhebungssysteme nahelegen. Von Basel, das 243 m hoch liegt, bis
gegen Mainz (82 m) hinab bildet die im Mittel 30 km breite
Rheinebene auf eine Länge von 300 km die Symmetrieachse für ihre
beiderseitigen Randgebirge. Im Westen steigen die Vogesen aus der
Burgundischen Pforte (350 m) rasch zu ihren höchsten Gipfeln an
und erreichen im Gebweiler Belchen eine Höhe von 1423 m. Weiter
nach Norden nimmt die Höhenentwicklung allmählich ab, der Paß
von Zabern senkt sich bis zu 404 m, und jenseits desselben steigt
dann der Kalmitgipfel der pfälzischen Hart wieder bis auf 683 m an.
Die genannten Gebirge fallen gegen die Rheinebene im Osten
ziemlich unvermittelt ab, während sie nach der entgegengesetzten
Richtung im Lothringer Stufenlande einen allmählichen Abfall
aufweisen, der sich in treppenförmigen Absätzen verfolgen läßt bis
zum Rande des Pariser Beckens.
Ganz entsprechend steigt vom oberen Rheintale zwischen
Waldshut und Basel der Schwarzwald in kurzem Abstande zu seiner
beherrschenden Kuppe, dem Feldberg (1493 m) auf, vermindert
nach Norden seine Gipfel- und Kammhöhe mehr und mehr, bis das
Gebirge nördlich auf der Wasserscheide zwischen Pfinz und Enz an
der Straße von Karlsruhe nach Pforzheim sich auf 374 m herab
senkt, um jenseits dieser Eintiefung, einer der wichtigsten
ihresgleichen im Kraichgauer Hügellande, wieder zum Odenwald
anzusteigen und hier im Katzenbuckel eine Höhe von 626 m zu
erreichen. Auch diese rechtsrheinischen Erhebungen weisen ihren
Steilabfall dem großen Strome zu und zeigen auf der ihm
abgewandten Seite ein wesentlich schwächeres, ebenfalls
stufenförmiges Gefälle in die Terrassen- und Hügelländer Schwabens
im Süden, Frankens im Norden.

Abb. 6. Verschneite Schwarzwaldhöfe.

Nach einer Photographie von Dr. Hoek in Freiburg. (Zu Seite 22.)
Nach dieser auffälligen und klar übersehbaren Symmetrie der
Oberflächenformen erscheint das ganze südwestliche Deutschland
nebst dem im Westen angrenzenden Frankreich, also das Gebiet von
der Maas bis zum Fichtelgebirge, von der Burgundischen Pforte bis
zum Taunus als eine orographische Einheit, innerhalb welcher
nunmehr der Schwarzwald nur als Glied dieses größeren Ganzen,
des „Südwest-deutschen Beckens“, zu betrachten ist. Noch inniger
als hinsichtlich der Höhenverhältnisse treten uns diese
Zusammenhänge vor Augen, wenn wir sie geologisch zu ergründen
suchen.
Jede geologische Karte des Gebietes
Geologischer Bau.
läßt erkennen, daß der Schwarzwald
wie der Wasgenwald im südlichen
Gebirgsteile je einen großen, im allgemeinen südnördlich gerichteten
Urgebirgskern aufweist, der im Norden unter immer weiter sich
ausbreitenden Buntsandsteindecken verschwindet, wie auch die von
der Rheinebene sich abwendenden Außenseiten der beiden Gebirge
nach Schwaben und Lothringen zu eine starke Verbreitung des
Buntsandsteins zeigen. Auf dessen fast ebene Hochflächen legen
sich der Altersreihenfolge nach die jüngeren Sedimente des
Muschelkalks und des Keupers, endlich die des Jura in der Weise auf,
daß man von den Höhen des linksrheinischen Gebirges westwärts,
des rechtsrheinischen ostwärts schreitend immer auf jüngeres
Gestein stößt, während man abwärts steigt; nur der rechtsrheinische
Jurazug ragt wieder in höheres Niveau auf. Auch die dem Rheine
zugekehrten Innenseiten der Zwillingsgebirge sind auf lange
Erstreckung hin von den genannten Sedimentbildungen in der
Reihenfolge ihres Alters derart begleitet, daß man mit der
fortschreitenden Entfernung von den Gebirgskernen stets auf
jüngere Formationen stößt. Zumeist bilden aber hier diese
Sedimentgesteine nur schmale, vielfach zerrissene und
unterbrochene Streifen von Vorhöhen des eigentlichen Gebirges, die
im nördlichen Schwarzwald sogar so gut wie gänzlich fehlen (s.
Profil, Abb. 2 u. 3).
Das Grundgebirge des jetzigen südwestdeutschen Beckens und
seiner weiteren Umgebung stellt sich als der heute vielfach in
Einzelschollen zerrissene Rest eines alten Gebirges dar, das sich
hauptsächlich aus Gneisen aufbaut, die aber gar mannigfach von
Graniten und verwandten Gesteinen durchbrochen sind.

Abb. 7. Verschneite Schwarzwaldhäuser.

Nach einer Photographie von Dr. Hoek in Freiburg. (Zu Seite 22.)

Paläozoische Schiefer, Silur, Devon

Vom Paläozoikum bis zum Tertiär.
bis herauf zur Kohlenformation und
dem Rotliegenden der Permformation,
zeigen sich stark gestört und erscheinen als ein altes, von Südwest
nach Nordost streichendes Faltensystem, dessen Erhebung in die
späteren Zeiten des Paläozoikums fällt und dem der Name
„Variskisches Gebirge“ beigelegt worden ist. Auf dem Rotliegenden,
das noch gefaltet ist, liegen diskordant, aber unter sich wieder
parallel, die Sedimente der Trias, nämlich des Buntsandsteins,
Muschelkalks und Keupers, sowie die der jurassischen Bildungen
Lias, Dogger, Malm, die in einer langen Zeit ruhiger Ablagerung teils
festländischer, überwiegend aber mariner Natur das alte Gebirge
unter sich begruben. Kreide- und ältere Tertiärschichten fehlen
vollständig, das Land hat während der Zeit ihrer Bildung inselartig
aus den umgebenden Meeren aufgeragt.

Abb. 8. Sägemühle im Winter.

Nach einer Photographie von Dr. Hoek in Freiburg. (Zu Seite 22.)
Das mittlere Tertiär stellt sich, wie der erwähnte Abschnitt der
paläozoischen Zeit, als eine Periode großartiger Gebirgsbildung dar,
der wir in unseren Gegenden nicht nur die Entstehung des
Faltengebirges der Alpen, sondern vor allen Dingen die
Ausgestaltung des südwestdeutschen Beckens zu seinen heutigen
Formen verdanken. Hier senkten sich im Gegensatz zu den Alpen,
deren Entstehung wir auf die Wirkung mächtigen Seitendruckes
zurückführen müssen, längs weithin verlaufender Verwerfungslinien
einzelne Schollen in die Tiefe, andere blieben stehen oder erfuhren
sogar eine Hebung. Die hauptsächlichsten dieser Spalten, welche für
die jetzige Konfiguration unserer Landschaft maßgebend sind,
verlaufen von Südsüdwest nach Nordnordost. In der Achse des
Insellandes entstand die gewaltige Grabenversenkung der jetzigen
Oberrheinischen Tiefebene, deren Boden lange Zeit vom Meere
überflutet war und sich erst später sehr allmählich mit ungeheueren
Mengen von Flußgeschieben bedeckte, die ihm durch die
Wasserläufe der umrandenden Höhen, beziehungsweise durch den
Rhein zugeführt wurden. An den vom Graben abgewandten
Außenseiten der früher einheitlichen Landmasse aber sanken die
Schollen weniger tief als in dem Graben selbst, doch so, daß das
Ausmaß des Absinkens mit der Entfernung von den Gebirgskernen
Wasgenwald und Schwarzwald immer bedeutender wurde.
Abb. 9. Schneewächten am Feldberg. (Zu Seite 25.)

Auf diese Weise entstanden die

Entstehung des Reliefs.
durch die spätere Arbeit des spülenden
und fließenden Wassers in ihrem Relief
immerhin noch reich gegliederten Stufenländer von Lothringen und
Schwaben, auf dieselbe Weise die schmalen Zonen von
Schichtgesteinen in der Vorhöhenreihe zwischen der Rheinebene und
den höheren Gebirgen. Sehr tief gehende Querverwerfungen von im
allgemeinen westöstlicher Richtung wurden die Veranlassung zu den
Einsenkungen von Zabern und im Kraichgau, während weiter
nördlich, in der Hart und im Odenwalde, das Absinken wieder in
geringerem Maße stattfand. In dem Netz der Verwerfungsspalten
dürfen wir die ersten Voraussetzungen für die Anlage der jetzigen
Flußsysteme erblicken, zu deren weiterer Ausgestaltung freilich das
fließende Wasser selbst das meiste beigetragen hat.
Die Abtragung, Denudation, setzt, wie wir überall wahrnehmen
können, aus klimatischen Gründen stets um so wirksamer ein, je
höher ihre Angriffsfläche liegt. So erklärt es sich, daß die Kämme
und Gipfel unserer südwestdeutschen Schollengebirge da, wo das
Absinken in der Tertiärzeit am geringsten war, also im Süden, von
ihren alten Decken sedimentärer Gesteine allmählich entblößt
wurden und nun das Grundgebirge zutage treten lassen, während
jene Sedimente in den tieferen Lagen der umgebenden
Stufenlandschaften noch erhalten sind. Vielerorts sind Grundgebirge
wie ältere Sedimente in weiter Ausdehnung von diluvialen,
insbesondere von eiszeitlichen Bildungen überdeckt (Abb. 4), die uns
zeigen, daß unsere Landschaften in jüngerer geologischer
Vergangenheit auch die Wirkungen glazialer Kräfte über sich haben
ergehen lassen. Wir werden mehrfach Gelegenheit haben, die von
der Eiszeit modellierten Züge im Antlitze des Schwarzwaldes wieder
zu erkennen.
Soll nun unser Gebirge, das nach
Abgrenzung des Gebirges.
seiner Entstehung als Massen- oder
Schollengebirge zu bezeichnen ist,
gegen seine mit ihm durch gemeinschaftliche Geschichte eng
verwandten Nachbargebiete abgegrenzt werden, so ist das im Süden
und Westen, ja auch im Norden nicht schwer. Denn hier fallen die
orographischen Gesichtspunkte der Höhenentwicklung, die wir zur
naturgemäßen Umgrenzung benutzen können, mit den geologischen
Kriterien gut zusammen.
Im Süden bildet von der Einmündung der Wutach ab auf eine
Länge von mehr als 60 km der Rhein, zu dessen rechtem Ufer das
Gebirge abfällt, die Grenze des Schwarzwaldes gegen den Schweizer
Jura, der jenseits des Stromes ebenso unmittelbar ansteigt. Von
Basel bis in die Gegend von Durlach bei Karlsruhe (116 m) ragt der
Schwarzwald längs einer scharf hervortretenden, etwa 200 km
langen Linie, die von Südsüdwest nach Nordnordost verläuft, aus
den Flußgeschieben der Rheinebene auf; im Norden folgt unsere
Grenze der schon erwähnten Eintiefung der Pfinztalfurche, welche
bei 374 m verlassen wird, um sich nach Pforzheim an der Enz (247
m) hinabzusenken. Dieser nur etwa 25 km lange Nordrand fällt
annähernd mit der Grenze des waldreichen Buntsandsteins gegen
die Ackerböden des Muschelkalks zusammen und bildet so eine auch
dem Laienauge auffällige Scheide des vom Walde benannten
Höhengebietes gegen das fruchtreiche, niedrige Hügelland im
Kraichgau.
Abb. 10. Schwarzwaldtannen im Winter.
Nach einer Photographie von Dr. W. Paulcke in Freiburg. (Zu Seite 28.)

Wollte man, wie das oft

Begrenzung und Fläche.
vorgeschlagen worden ist, den Ostrand
des Gebirges zwischen Pforzheim und
dem Rhein bei Waldshut nach dem gleichen Gesichtspunkt
bestimmen und demnach möglichst an die Grenze von Buntsandstein
gegen Muschelkalk legen, so würde die so zu gewinnende Linie
orographisch an vielen Stellen gar nicht hervortreten. Sie erscheint
daher in strenger Durchführung unpassend für unsere Zwecke.
Besser und plastisch durchaus wirkungsvoll ist dagegen die im
folgenden gezeichnete Grenze, welche durchweg Tallinien folgt, also
geeignet ist, die Erhebungen, die im Westen als geschlossene
Gebirgsmasse aufragen, von den niedrigern Stufenländern des
Ostens zu trennen. Freilich fällt diese den Talrinnen folgende Linie
nicht streng mit geologischen Formationsgrenzen zusammen, doch
läßt sie in der Hauptsache die echten Schwarzwaldhöhen des
Buntsandsteins westlich, und jenseits ziemlich schmaler Muschelkalk-
und Keuperbänder den Jura im Osten liegen. Sie verläuft von
Pforzheim dem Flüßchen Nagold entlang bis zum Städtchen gleichen
Namens (452 m), überschreitet die Wasserscheide zum Neckar bei
Hochdorf (511 m), senkt sich hinab nach Horb (391 m), folgt dem
Neckar bis zu seiner Quelle (700 m) in der Nähe von Schwenningen
und verläuft weiter über einförmige Hochflächen, auf denen sie die
Rhein-Donau-Wasserscheide zum erstenmal trifft, bis nach
Donaueschingen (676 m), von wo sie, immer in südlicher Richtung
weiter ziehend, die Wasserscheide der Donau gegen den Rhein bei
etwa 780 m wieder überschreitet und bald danach das östliche Knie
der Wutach bei Achdorf (540 m) erreicht, um schließlich diesem Fluß
zu folgen bis zum Rhein (319 m) oberhalb Waldshut.
Dieser rund 190 km lange Ostrand des Schwarzwaldes liegt
durchweg höher als die Süd-, West- und Nordgrenze des Gebirges;
man kann ihm eine Mittelhöhe von 400 m zuschreiben, während die
entsprechenden Werte im Süden 270, im Westen 150, im Norden
190 m betragen.
Abb. 11. Weinlese im Immental bei Freiburg.
Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 32.)
Abb. 12. Feldbestellung im Schwarzwald. Nach einer Photographie von M. Ferrars
in Freiburg. (Zu Seite 29.)
❏
GRÖSSERES BILD

In diesen Grenzen bedeckt unser Gebirge eine Fläche von 7860

qkm, von denen 6060 qkm oder 77% auf Baden, 1800 qkm oder
23% auf Württemberg fallen. Kleine Schweizer Gebietsteile bei Basel
und hohenzollerische bei Horb sind dabei nicht besonders
ausgeschieden. Zum Rheingebiet gehören 7350 qkm oder mehr als
93%, zum Donaugebiet 510 qkm oder weniger als 7% des
Schwarzwaldareals. In der Luftlinie gemessen ist die größte
Südnordausdehnung des Gebirges zwischen Säckingen und Durlach
166 km, der größte Westostabstand von Müllheim bis Achdorf 67
km, die mittlere Breite etwa 47 km; die Breite nimmt von Süd nach
Nord fast stetig ab. —
Zum Zwecke der Orientierung hat
Einteilung des Gebirges.
der Volksmund längst einen südlichen
und nördlichen Gebirgsteil
unterschieden und beide durch das Kinzigtal voneinander getrennt,
ohne daß man sich aber genauere Rechenschaft darüber gegeben
hätte, welches für beide Hälften die ihr Wesen bedingenden
charakteristischen Merkmale seien. Geeigneter erscheint die
Vierteilung in einen südlichen, mittleren, nördlichen und östlichen
Schwarzwald. Erscheint die erstgenannte Teilgruppe als die
Landschaft der vom Feldberg nach allen Seiten strahlenförmig
auslaufenden Kämme und ihrer Verzweigungen, so haben wir in der
zweiten neben einem niederen westlichen Vorlande in der
Umgebung des Hünersedels zwei parallele Hauptkämme von
südnördlicher Richtung und daran anschließend eine zum
Donaugebiet abfallende Hochfläche; der nördliche Schwarzwald kann
als das weitere Gebiet des von Süd nach Nord verlaufenden
Hornisgrindenkammes definiert werden, der östliche endlich ist das
überwiegend aus Buntsandstein, weiter südlich auch aus
Muschelkalk aufgebaute, den Höhenunterschieden nach wenig
gegliederte Hochland zwischen Pforzheim und Donaueschingen: in
der Hauptsache der württembergische Schwarzwald.
Abb. 13. Holzschleifen im Zastler Tal.
Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 44.)
Abb. 14. Einzelhof im Zastler Tal. Nach einer Photographie von M. Ferrars in
Freiburg. (Zu Seite 43 u.46.)
❏
GRÖSSERES BILD

Der östliche Schwarzwald läßt sich gegen die westlichen Gruppen

des Gebirges leicht abgrenzen durch das Tal der Untern Murg von
Rastatt bis Freudenstadt, das der obern Kinzig von da bis Schiltach,
des Schiltachflüßchens bis zu seiner Quelle am Ruppertsberg und der
Brigach von hier bis Donaueschingen. Die das ganze Gebirge quer
durchbrechende Kinzig trennt auf der Strecke Schiltach-Offenburg
den nördlichen vom mittleren, das Tal der Dreisam, des Rot- und
Höllenbachs und der oberen Wutach auf der Strecke Freiburg-
Hinterzarten-Achdorf den mittleren vom südlichen Schwarzwald.
Wie überaus verschieden diese vier
Orographischer Aufbau.
Gruppen sich in ihrem orographischen
Aufbau verhalten, mögen folgende
Zahlen veranschaulichen:
Schwarzwald Südlicher Mittlerer Nördlicher Östlicher Summa
Fläche qkm 2250 2010 1350 2250 7860
Höchster
Gipfel, m 1493 1241 1164 988 1493
Mittlere
Kammhöhe, m 855 790 725 655 770
unter
200 m — 4,6 8,4 2,4 3,3
200–400
m 16,2 19,9 20,9 14,4 17,5
400–600
m 19,9 17,2 27,1 26,4 22,3
Prozente 600–800
der m 23,1 20,9 26,6 50,3 30,9
Fläche 800–1000
m 26,8 28,7 15,9 6,5 19,6
1000–
1200
m 12,2 8,6 1,1 — 5,9
über 1200
m 1,8 0,1 — — 0,5
Abb. 15. Bau eines Kohlenmeilers im Zastler Tal.
Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 44.)

Deutlich tritt aus dieser Zusammenstellung der

Hochebenencharakter des östlichen Schwarzwaldes hervor, bei dem
mehr als die Hälfte des Areals der Höhenstufe von 600 zu 800 m
angehört, während über letzterer Höhe nur noch 6,5 Prozent der
Fläche aufragen. Ganz anders liegen die Verhältnisse in den drei
anderen Gruppen, in denen von Nord nach Süd immer mehr die
Höhenentwicklung zunimmt hinsichtlich der Kämme wie der
beherrschenden Gipfel. Im mittleren Schwarzwalde läßt außerdem
das starke Vorwiegen der Höhenstufe zwischen 800 und 1000 m den
weit verbreiteten Hochflächencharakter dieses Gebietes gut
erkennen. Einzelheiten des Gebirgsaufbaues genauer zu schildern,
wird im folgenden sich reichlich Gelegenheit bieten.

Abb. 16. Holzschlitten im Walde. Zastler Tal; obere Enden der Holzriesen.
Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 44.)
❏
GRÖSSERES BILD

Von der Gesamtfläche des

Die Gesteine und ihre Formen.
Schwarzwaldes werden etwa 24% von
Gneis, 18% von Granit, 31% von
Buntsandstein, 16% von Muschelkalk eingenommen. Der Rest
verteilt sich mit rund 2% auf paläozoische Schiefer und 9% auf alle
anderen Bildungen, also abgesehen von den vereinzelt auftretenden
Porphyren auf jüngere Sedimente. Die Gneismassen nehmen die
Hauptteile der westlichen Schwarzwaldgruppen ein, kommen aber in
kleineren Bezirken auch sonst vor. Zumeist geben sie bei der
Verwitterung fruchtbare Lehmböden, auch sind sie reich an
Erzgängen, die einst einen lebhaften Bergbau auf Bleiglanz, Silber,
Zinkblende, Kupferkies usw. ermöglichten, sowie an Mineralquellen.
Der Granit bildet, abgesehen von kleineren Vorkommnissen, die
Massive von Oberkirch und Triberg und hat seine weiteste
Verbreitung zwischen Kandern, Säckingen und Villingen. Verwittert
läßt er die lockereren Teile durch das Wasser an den Bergabhängen
in die Tiefe führen, wo sie zu wertvollem Ackerboden werden,
während oben mageres Erdreich zurückbleibt, das an den Vorhöhen
dem Rebbau, im eigentlichen Gebirge dem Walde günstig ist. Auf der
Neigung des Granites zur Zerklüftung beruht das Vorhandensein
wertvoller Thermen, so der von Baden-Baden.

Abb. 17. Köhlerhütte.

Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 44.)
Abb. 18. Holzsägemühle im Löffelschmiedental. Nach einer Photographie von M.
Ferrars in Freiburg. (Zu Seite 44.)
❏
GRÖSSERES BILD

Gegenüber den meist rundlichen

Boden und Bodenformen.
Kuppenformen im Gneis- und
Granitgebiet neigt der Porphyr, der zu
verschiedenen Zeiten der Erdgeschichte da und dort im
Schwarzwalde als Eruptivmasse zutage trat, zur Bildung von
schroffen Felswänden und von steil aufragenden Kegelspitzen, die
sich vielerorts im Landschaftsbilde ganz bestimmt hervorheben. Der
Buntsandstein bildet eine im Norden bis zu 400 m mächtige Decke
über dem Grundgebirge; die meisten seiner Schichten liefern bei der
Verwitterung lockeren Schutt und Sand — trefflichsten Waldboden
—, in welchem die Trümmer festerer Horizonte als große Blöcke in
sogenannten Felsmeeren regellos angehäuft liegen bleiben. Den
Muschelkalk finden wir, abgesehen vom Ostrand des Gebirges, längs
der Rheinebene in kleineren Schollen, und im Süden, am Dinkelberg
zwischen Säckingen und Basel, in ansehnlicher Verbreitung. Seine
Mergel geben fruchtbare Lehmböden, er ist der Spender von Gips
und Steinsalz. Keuper, Jura und Tertiärbildungen treten, wie schon
erwähnt, nur in geringer Verbreitung auf, diluviale Kiese, Sande und
Tone füllen die Rheinebene und die meist weiten Mündungstrichter
der westlichen Schwarzwaldtäler. Die höheren Gebirgsteile weisen in
ziemlich weiter Ausdehnung typisch ausgebildete Grundmoräne und
Endmoränen mit gekritzten Geschieben und erratischen Blöcken,
Moränenseen sowie fluvioglaziale Terrassen auf; wir werden diese
interessanten Eiszeitspuren noch da und dort genauer kennen
lernen. Der Löß endlich bedeckt teilweise in großer Mächtigkeit den
Fuß des Gebirges, besonders am Rande der Rheinebene (Abb. 5);
das Vorhandensein dieser durch das Wehen trockener Winde
bedingten Bildung ist ein Beweis dafür, daß zwischen den
Hauptperioden der Eiszeit und nach ihrem Ende in unseren
Gegenden Steppenklima herrschte. Im Löß finden wir die Überreste
der großen diluvialen Säuger, aber auch des Ren, und daneben die
ältesten Menschenspuren der Rheinebene und ihrer Umrandung.
Heute sind die Lößböden als wertvolle Rebgelände und Ackerböden
von allerhöchster Bedeutung.
Abb. 19. Strohflechterin im Herrgottswinkel.
Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 44.)
III. Klima und Bewässerung.

Dieist Ausdehnung des Schwarzwaldes

zu gering, als daß der
Wind und Wärme.

Unterschied der geographischen Breite

zwischen den südlichen und nördlichen Gebirgsteilen einen
Gegensatz der klimatischen Verhältnisse bedingen könnte. Ebenso
vermag die schmale Ostwestausdehnung sich nicht in dem Sinne
wirksam zu machen, daß etwa aus ihr heraus eine kontinentale Seite
des Gebirges einer ozeanischen sich gegenüberstellen ließe. Und
doch sind Ost- und Westseite klimatisch grundverschieden. Die
Ursache hiervon liegt aber durchaus in der Lage der
Gebirgserhebung zu den Hauptwindrichtungen. Diese wehen
nördlich der Alpen entweder von Südwest und West oder von
Nordost, und ihnen stellt sich in jedem Falle der Schwarzwald mit
seiner vorherrschend meridionalen Richtung in den Weg, so daß dem
Westgehänge überwiegend warme und wasserdampfreiche
Luftströmungen zufließen, der Ostseite aber trockene, die besonders
in den kälteren Jahreszeiten sich durch empfindlich niedere
Temperaturen auszeichnen. Dem Westen bleiben diese rauhen
Kontinentalwinde zu allermeist erspart, und entsprechend ist dem
Osten der Zugang der milden Westwinde versperrt. Verschärft wird
dieser Gegensatz noch ganz wesentlich durch den Umstand, daß,
wie wir sahen, der Ostrand des Gebirges viel höher gelegen ist als
der Westrand, welcher über der klimatisch meistbegünstigten
Landschaft Deutschlands, der Oberrheinischen Tiefebene, ansteigt.
Abb. 20. Schwarzwälder Glasarbeiten. Aus der Sammlung Spiegelhalter.
Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 45.)

Was die Wärmeverhältnisse betrifft,

Wärmeverhältnisse.
so ist unser Gebiet, das der
ausgedehnten Übergangprovinz
Europas vom ozeanischen zum Landklima angehört, rund um 4° C
wärmer, als es der Durchschnittstemperatur seiner Breitenlage
entspricht. Aus langjährigen Beobachtungen ergeben sich als
Normaltemperaturen der einzelnen Höhenstufen des Schwarzwaldes:

Meereshöhe Mitteltemperatur ° C
m Winter Frühling Sommer Herbst Jahr
200 1,5 10,4 19,4 10,4 10,5
400 0,9 9,2 18,2 9,5 9,5
600 0,3 8,0 17,0 8,6 8,5
800 –0,4 6,8 15,8 7,7 7,5
1000 –1,0 5,6 14,6 6,8 6,5
1200 –1,7 4,4 13,4 5,9 5,5
Diese Zusammenstellung zeigt vor allem die mit der Höhe
zunehmende Bevorzugung des Herbstes vor dem Frühling, welch
letzterer im eigentlichen Gebirge wegen der späten Schneeschmelze
wesentlich kühler ist als der Herbst, der sich häufig mit schönen
Sonnenscheintagen aufs angenehmste bis tief in den November
hinein geltend macht.
Von diesen Normalwerten ergeben sich im einzelnen je nach den
Lageverhältnissen bedeutende Abweichungen. So ist z. B. Villingen
(708 m) auf der Ostseite des Schwarzwaldes im Verhältnis zu seiner
Höhenlage in den vier Jahreszeiten nach obiger Reihenfolge und im
Jahr zu kalt um 1.8‒0.7‒0.5‒0.9‒1.0°, Freiburg am Westfuß (272 m)
zu warm um 0.9‒0.7‒1.0‒1.0‒0.9°. Nicht leicht könnte der
Gegensatz zwischen dem östlichen und westlichen Schwarzwald
deutlicher vor Augen geführt werden.
Abb. 21. Einzige Darstellung des alten Schwarzwälder
Hausierers. Krug vom Jahre 1806 in der Schwarzwaldsammlung
der Stadt Freiburg. Nach einer Photographie von M. Ferrars
in Freiburg. (Zu Seite 45.)
Abb. 22. Alte Schwarzwalduhr vom Jahre 1670.
Aus der Schwarzwaldsammlung der Stadt Freiburg.
Nach einer Photographie von M. Ferrars in Freiburg.
(Zu Seite 45.)

Im äußersten Falle steigt die Wärme in Karlsruhe, am Rande der

Rheinebene, auf 37 ° C, in Villingen auf 32°, in dem 1004 m hoch
und nach allen Seiten frei gelegenen Höhenschwand auf 29°,
während die entsprechenden niedersten Werte -27, -33, -23° C sind.
Gegenüber der hohen Sommerwärme in der Rheinebene und der
exzessiven Winterkälte auf der östlichen Hochebene erweist sich
hiernach der eigentliche hohe Schwarzwald als eine Landschaft, in
welcher die Wärmegegensätze nicht allzu schroff sind.
Der Winter ist auf den Höhen mehr
Temperaturumkehr.
durch seine langdauernde Schneedecke
(Abb. 6, 7 und 8) und die hierdurch
bedingte Schwierigkeit des Verkehrs lästig als durch übermäßige
Kälte. Die freien Höhen erfreuen sich viel häufiger, als man das in
den Niederungen ahnt, der winterlichen Temperaturumkehr, bei
welcher es in höheren Lagen wärmer ist als in niedrigeren,
besonders in muldenartigen Eintiefungen von größerer räumlicher
Ausdehnung, wo sich bei hohem Luftdruck und dauernder Windstille
die kalten, schweren Luftmassen ungestört ansammeln und nur
schwierig einen Abfluß verschaffen können.

Abb. 23. Schwarzwälder Uhrmacher.

Nach einer Photographie von M. Ferrars in Freiburg. (Zu Seite 45.)
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

GE Carescape Gateway
100% (1)
GE Carescape Gateway
270 pages
Introduction to Statistical and Machine Learning Methods for Data Science
From Everand
Introduction to Statistical and Machine Learning Methods for Data Science
Carlos Andre Reis Pinheiro
No ratings yet
Full download Analysis of Correlated Data with SAS and R Third Edition Shoukri pdf docx
No ratings yet
Full download Analysis of Correlated Data with SAS and R Third Edition Shoukri pdf docx
82 pages
[FREE PDF sample] Analysis of Correlated Data with SAS and R Third Edition Shoukri ebooks
100% (2)
[FREE PDF sample] Analysis of Correlated Data with SAS and R Third Edition Shoukri ebooks
81 pages
Analysis of Correlated Data With SAS and R Third Edition Shoukri Download PDF
100% (3)
Analysis of Correlated Data With SAS and R Third Edition Shoukri Download PDF
84 pages
Analysis of Correlated Data with SAS and R Third Edition Shoukri instant download
100% (2)
Analysis of Correlated Data with SAS and R Third Edition Shoukri instant download
84 pages
Analysis of Correlated Data with SAS and R Third Edition Shoukri pdf download
100% (1)
Analysis of Correlated Data with SAS and R Third Edition Shoukri pdf download
86 pages
Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri - The ebook is available for instant download, no waiting required
No ratings yet
Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri - The ebook is available for instant download, no waiting required
56 pages
Complete Download Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri PDF All Chapters
100% (1)
Complete Download Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri PDF All Chapters
51 pages
Instant Download Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri PDF All Chapters
100% (10)
Instant Download Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri PDF All Chapters
37 pages
Full Download Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri PDF DOCX
100% (5)
Full Download Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri PDF DOCX
46 pages
(Ebook) Analysis of Correlated Data with SAS and R, Third Edition by Shoukri, Mohamed M. ISBN 9781420011258, 1420011251 instant download
100% (1)
(Ebook) Analysis of Correlated Data with SAS and R, Third Edition by Shoukri, Mohamed M. ISBN 9781420011258, 1420011251 instant download
58 pages
Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri download
100% (1)
Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri download
51 pages
(Ebook) Analysis of Correlated Data with SAS and R by Mohamed M. Shoukri, Mohammad A. Chaudhary ISBN 9781584886198, 1584886196 - Download the ebook with all fully detailed chapters
100% (2)
(Ebook) Analysis of Correlated Data with SAS and R by Mohamed M. Shoukri, Mohammad A. Chaudhary ISBN 9781584886198, 1584886196 - Download the ebook with all fully detailed chapters
47 pages
Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri download
100% (2)
Analysis of Correlated Data with SAS and R 3rd Edition Mohamed M. Shoukri download
62 pages
Get (Ebook) Analysis of Correlated Data with SAS and R, Third Edition by Shoukri, Mohamed M. ISBN 9781420011258, 1420011251 PDF ebook with Full Chapters Now
100% (3)
Get (Ebook) Analysis of Correlated Data with SAS and R, Third Edition by Shoukri, Mohamed M. ISBN 9781420011258, 1420011251 PDF ebook with Full Chapters Now
71 pages
Full Download Analysis of Correlated Data With SAS and R 3rd Edition Mohamed M. Shoukri PDF
100% (8)
Full Download Analysis of Correlated Data With SAS and R 3rd Edition Mohamed M. Shoukri PDF
84 pages
Analysis of Correlated Data with SAS and R Mohamed M. Shoukri - Get instant access to the full ebook content
100% (4)
Analysis of Correlated Data with SAS and R Mohamed M. Shoukri - Get instant access to the full ebook content
56 pages
14086532
No ratings yet
14086532
65 pages
(Ebook) Analysis of Correlated Data with SAS and R by Mohamed M. Shoukri, Mohammad A. Chaudhary ISBN 9781584886198, 1584886196 download
No ratings yet
(Ebook) Analysis of Correlated Data with SAS and R by Mohamed M. Shoukri, Mohammad A. Chaudhary ISBN 9781584886198, 1584886196 download
57 pages
Analysis of Correlated Data with SAS and R Mohamed M. Shoukri download
100% (1)
Analysis of Correlated Data with SAS and R Mohamed M. Shoukri download
58 pages
Buy Ebook Analysis of Correlated Data With SAS and R Mohamed M. Shoukri Cheap Price
100% (3)
Buy Ebook Analysis of Correlated Data With SAS and R Mohamed M. Shoukri Cheap Price
52 pages
14086532
No ratings yet
14086532
65 pages
Download ebooks file Analysis of Correlated Data with SAS and R Mohamed M. Shoukri all chapters
100% (1)
Download ebooks file Analysis of Correlated Data with SAS and R Mohamed M. Shoukri all chapters
49 pages
Analysis of Correlated Data with SAS and R Mohamed M. Shoukri 2024 scribd download
100% (3)
Analysis of Correlated Data with SAS and R Mohamed M. Shoukri 2024 scribd download
55 pages
(Ebook) Analysis of Correlated Data with SAS and R by Mohamed M. Shoukri ISBN 9781138197459, 1138197459 pdf download
No ratings yet
(Ebook) Analysis of Correlated Data with SAS and R by Mohamed M. Shoukri ISBN 9781138197459, 1138197459 pdf download
64 pages
Analysis of Correlated Data with SAS and R, 4th Edition Extended Version Download
No ratings yet
Analysis of Correlated Data with SAS and R, 4th Edition Extended Version Download
16 pages
Analysis of Correlated Data with SAS and R - 4th Edition Full MOBI eBook
100% (10)
Analysis of Correlated Data with SAS and R - 4th Edition Full MOBI eBook
14 pages
Instant Access to Applied categorical and count data analysis 1st Edition Tang ebook Full Chapters
100% (2)
Instant Access to Applied categorical and count data analysis 1st Edition Tang ebook Full Chapters
57 pages
Immediate download Applied categorical and count data analysis 1st Edition Tang ebooks 2024
No ratings yet
Immediate download Applied categorical and count data analysis 1st Edition Tang ebooks 2024
86 pages
7773
No ratings yet
7773
81 pages
Applied categorical and count data analysis 1st Edition Tang download
100% (3)
Applied categorical and count data analysis 1st Edition Tang download
79 pages
(Ebook) Applied categorical and count data analysis by Tang, Wan; He, Hua; Tu, Xin M ISBN 9781439806241, 9781439897935, 1439806241, 143989793Xpdf download
100% (3)
(Ebook) Applied categorical and count data analysis by Tang, Wan; He, Hua; Tu, Xin M ISBN 9781439806241, 9781439897935, 1439806241, 143989793Xpdf download
38 pages
A Handbook of Statistical Analyses using SAS Third Edition Der instant download
100% (1)
A Handbook of Statistical Analyses using SAS Third Edition Der instant download
48 pages
A Handbook of Statistical Analyses using SAS Third Edition Der - Instantly access the complete ebook with just one click
100% (1)
A Handbook of Statistical Analyses using SAS Third Edition Der - Instantly access the complete ebook with just one click
47 pages
Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen download
100% (1)
Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen download
54 pages
Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen pdf download
100% (2)
Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen pdf download
72 pages
Biostatistics by Example Using SAS Studio
From Everand
Biostatistics by Example Using SAS Studio
Ron Cody
No ratings yet
Applied Categorical and Count Data Analysis (PDFDrive)
50% (2)
Applied Categorical and Count Data Analysis (PDFDrive)
380 pages
Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum instant download
No ratings yet
Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum instant download
55 pages
Instant Access to Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen ebook Full Chapters
100% (1)
Instant Access to Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen ebook Full Chapters
40 pages
Categorical And Nonparametric Data Analysis Choosing The Best Statistical Technique 1st Edition E Michael Nussbaum download
No ratings yet
Categorical And Nonparametric Data Analysis Choosing The Best Statistical Technique 1st Edition E Michael Nussbaum download
81 pages
Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen - Download the full set of chapters carefully compiled
100% (1)
Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen - Download the full set of chapters carefully compiled
74 pages
Full download (Ebook) Applied categorical and count data analysis by Tang, Wan; He, Hua; Tu, Xin M ISBN 9781439806241, 9781439897935, 1439806241, 143989793X pdf docx
100% (3)
Full download (Ebook) Applied categorical and count data analysis by Tang, Wan; He, Hua; Tu, Xin M ISBN 9781439806241, 9781439897935, 1439806241, 143989793X pdf docx
81 pages
(Ebook) Categorical Data Analysis Using SAS, Third Edition by Maura E. Stokes, Charles S. Davis, Gary G. Koch ISBN 9781607646648, 1607646641 pdf download
No ratings yet
(Ebook) Categorical Data Analysis Using SAS, Third Edition by Maura E. Stokes, Charles S. Davis, Gary G. Koch ISBN 9781607646648, 1607646641 pdf download
61 pages
Applied Regression and ANOVA Using SAS 1st Edition Patricia F Moodie Dallas E Johnson - Own the ebook now and start reading instantly
100% (1)
Applied Regression and ANOVA Using SAS 1st Edition Patricia F Moodie Dallas E Johnson - Own the ebook now and start reading instantly
73 pages
Interpreting and Comparing Effects in Logistic Probit and Logit Regression 1st Edition Jacques A P Hagenaars - Download the ebook today to explore every detail
100% (1)
Interpreting and Comparing Effects in Logistic Probit and Logit Regression 1st Edition Jacques A P Hagenaars - Download the ebook today to explore every detail
36 pages
PDF Applied Categorical and Count Data Analysis: Second Edition Wan Tang & Hua He & Xin M. Tu download
100% (2)
PDF Applied Categorical and Count Data Analysis: Second Edition Wan Tang & Hua He & Xin M. Tu download
39 pages
Full download Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen pdf docx
100% (9)
Full download Categorical Data Analysis for the Behavioral and Social Sciences 2nd Edition Razia Azen pdf docx
40 pages
Instant download Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum pdf all chapter
100% (4)
Instant download Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum pdf all chapter
56 pages
Download Complete (Ebook) Interpreting and Comparing Effects in Logistic, Probit and Logit Regression by Jacques A P Hagenaars, Steffen Kuhnel, Hans-Jurgen Andress ISBN 9781544364018, 1544364016 PDF for All Chapters
100% (3)
Download Complete (Ebook) Interpreting and Comparing Effects in Logistic, Probit and Logit Regression by Jacques A P Hagenaars, Steffen Kuhnel, Hans-Jurgen Andress ISBN 9781544364018, 1544364016 PDF for All Chapters
81 pages
Applied Categorical And Count Data Analysis 1st Edition Tang pdf download
100% (3)
Applied Categorical And Count Data Analysis 1st Edition Tang pdf download
76 pages
Analyzing Health Data in R for SAS Users 1st Edition Monika Maya Wahi 2024 scribd download
100% (1)
Analyzing Health Data in R for SAS Users 1st Edition Monika Maya Wahi 2024 scribd download
65 pages
Multidimensional Item Response Theory 1st Edition Instant DOCX Download
100% (12)
Multidimensional Item Response Theory 1st Edition Instant DOCX Download
17 pages
Analysis Of Longitudinal Data By Example Yougan Wang Liya Fu Sudhir Paul pdf download
No ratings yet
Analysis Of Longitudinal Data By Example Yougan Wang Liya Fu Sudhir Paul pdf download
90 pages
Logistic Regression A Primer Fred C Pampel download
No ratings yet
Logistic Regression A Primer Fred C Pampel download
54 pages
Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum download pdf
100% (12)
Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum download pdf
71 pages
Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum - Read the ebook now or download it for a full experience
100% (1)
Categorical and Nonparametric Data Analysis Choosing the Best Statistical Technique 1st Edition E. Michael Nussbaum - Read the ebook now or download it for a full experience
59 pages
Instant Download Multilevel Modeling Using R 1st Edition Edition W. Holmes Finch PDF All Chapters
No ratings yet
Instant Download Multilevel Modeling Using R 1st Edition Edition W. Holmes Finch PDF All Chapters
81 pages
(Ebook) Categorical Data Analysis for the Behavioral and Social Sciences by Razia Azen, Cindy M. Walker ISBN 9780367352745, 9780367352769, 9780429330308, 0367352745, 0367352761, 0429330308 - The complete ebook is available for download with one click
100% (2)
(Ebook) Categorical Data Analysis for the Behavioral and Social Sciences by Razia Azen, Cindy M. Walker ISBN 9780367352745, 9780367352769, 9780429330308, 0367352745, 0367352761, 0429330308 - The complete ebook is available for download with one click
57 pages
Multidimensional Item Response Theory - 1st Edition ISBN 1506384250, 9781506384252 Optimized EPUB Download
No ratings yet
Multidimensional Item Response Theory - 1st Edition ISBN 1506384250, 9781506384252 Optimized EPUB Download
17 pages
TDS Vitamina e Acetato 98%
No ratings yet
TDS Vitamina e Acetato 98%
3 pages
TOP 100 Companies by Output
No ratings yet
TOP 100 Companies by Output
4 pages
Complex Trusses
0% (2)
Complex Trusses
1 page
Traffic Congestion and Parking Difficulties
No ratings yet
Traffic Congestion and Parking Difficulties
10 pages
Inlab MC X5 Accessories: Order Form
No ratings yet
Inlab MC X5 Accessories: Order Form
1 page
MCQ-NLKT
No ratings yet
MCQ-NLKT
30 pages
P 02
No ratings yet
P 02
34 pages
Blocking Non Blocking
No ratings yet
Blocking Non Blocking
3 pages
Adithya Jatangi: Professional Summary
No ratings yet
Adithya Jatangi: Professional Summary
7 pages
Module 3.1
No ratings yet
Module 3.1
29 pages
Coding Theory Lecturs For M Al-Ashker
No ratings yet
Coding Theory Lecturs For M Al-Ashker
101 pages
CLARION rd3 - 20 - 20PU-2471A - 20 - Pu2472 - 20B - C
No ratings yet
CLARION rd3 - 20 - 20PU-2471A - 20 - Pu2472 - 20B - C
14 pages
Drive Test Report - MSS6 Testing
No ratings yet
Drive Test Report - MSS6 Testing
13 pages
Multiplying Decimals Worksheet
No ratings yet
Multiplying Decimals Worksheet
7 pages
Chapter 7 Trading and Profit and Loss Accounts For Sole Traders Q1 Hadlee
No ratings yet
Chapter 7 Trading and Profit and Loss Accounts For Sole Traders Q1 Hadlee
1 page
VDF 06 Flow Divider/Combiner: Flow Control Valves Catalog
No ratings yet
VDF 06 Flow Divider/Combiner: Flow Control Valves Catalog
1 page
EagleBurgmann DMS TSE E3 Brochure Mechnical Seal Technology and Selection en 22.07.2015
No ratings yet
EagleBurgmann DMS TSE E3 Brochure Mechnical Seal Technology and Selection en 22.07.2015
58 pages
Lab Report Marcet Boiler
100% (1)
Lab Report Marcet Boiler
8 pages
Friendship Bench_ a community led approach to mental health care
No ratings yet
Friendship Bench_ a community led approach to mental health care
4 pages
AWE-Exam-Report
No ratings yet
AWE-Exam-Report
7 pages
Juzisound AccKit Menu Description
No ratings yet
Juzisound AccKit Menu Description
21 pages
Sample Loan
No ratings yet
Sample Loan
47 pages
General Environment Dior
No ratings yet
General Environment Dior
6 pages
PRACTICAL RESEARCH 2 - Q4 - SLM8
No ratings yet
PRACTICAL RESEARCH 2 - Q4 - SLM8
15 pages
Nitk Academic Calendar
No ratings yet
Nitk Academic Calendar
1 page
Select Command in Abap
No ratings yet
Select Command in Abap
2 pages
Aquasure Harmony Series Installation Manual - Ver4-WEB
No ratings yet
Aquasure Harmony Series Installation Manual - Ver4-WEB
36 pages
Youth Tobacco Survey Among The School Going Adolescents in A Block of Vadodara District, Gujarat
No ratings yet
Youth Tobacco Survey Among The School Going Adolescents in A Block of Vadodara District, Gujarat
6 pages
Object Oriented Analysis and Design May 2024
No ratings yet
Object Oriented Analysis and Design May 2024
4 pages

Analysis of Correlated Data with SAS and R Third Edition Shoukri - Read the ebook now with the complete version and no limits

Uploaded by

Analysis of Correlated Data with SAS and R Third Edition Shoukri - Read the ebook now with the complete version and no limits

Uploaded by

Instant Ebook Access, One Click Away – Begin at ebookgate.

Analysis of Correlated Data with SAS and R Third

Get Instant Ebook Downloads – Browse at https://ebookgate.com

Categorical Data Analysis Using the SAS System 2nd Edition

Design and Analysis of Experiments Classical and

Sas R Intelligence Platform Overview 2nd Edition Sas

Analysis of Variance Designs A Conceptual and

Adaptive Tests of Significance Using Permutations of

Data Science Fundamentals with R Python and Open Data 1st

Bayesian Data Analysis in Ecology Using Linear Models with

Statistical Methods for Survival Data Analysis Third

Analysis of Correlated Data

with SAS and R

New to the Third Edition

C6196_Cover.indd 1 3/27/07 11:30:20 AM

© 2007 by Taylor & Francis Group, LLC

No claim to original U.S. Government works

International Standard Book Number-13: 978-1-4200-1125-8 (eBook - PDF)

and the CRC Press Web site at

To my mother, Nazir Begum, and my father, Siddiq Ahmed

Preface to the First Edition ...................................................................................ix

Preface to the Second Edition ..............................................................................xi

Preface to the Third Edition ...............................................................................xiii

1. Analyzing Clustered Data .......................................................................... 1

2. Analysis of Cross-Classified Data ...........................................................33

3. Modeling Binary Outcome Data ............................................................. 83

4. Analysis of Clustered Count Data ........................................................ 133

5. Analysis of Time Series ...........................................................................159

6. Repeated Measures and Longitudinal Data Analysis .......................207

7. Survival Data Analysis ............................................................................243

A substantial portion of epidemiologic studies, particularly in community

M.M. Shoukri, Guelph, Ontario

1. We provide a large number of examples to cover the material, together

M.M. Shoukri, London, Ontario, Canada

Mohammad A. Chaudhary is an associate scientist (biostatistics) at the

Second, even if individual allocation is possible, there is the possibility of

1.1.1 The Basic Feature of Cluster Data

1 Y11 y21 ....... yi1 ....... yk1

Now, if k clusters are sampled from a population of clusters, and because

bi ≡ random cluster effect and

V(yij ) = σb2 + σe2 ≡ σy2 (1.3)

Cov(yij , yil ) = σb2 j = l (1.4)

Therefore, the correlation between any pair of observations within a cluster is

This correlation is known as the intracluster correlation (ICC). Therefore,

σe2 = σy2 (1 − ρ) (1.6)

DEFF = [1 + (n − 1)ρ] (1.9)

We have demonstrated that applying conventional statistical methods to

1. In a trial of clinical guidelines to improve general-practice management

et al. (1985). They claimed (without supporting evidence) that such

1. Whether the outcome is normally distributed or not, group compar-

1.1.2 Sample and Design Issues

• Evaluation of interventions, which by their nature have to be imple-

It might be desirable to capture the mass effect on disease of applying an inter-

self-selection in belonging to the cluster or the group, or sharing ideas or

Example 1.1 Comparison of Means

n = 2(zα/2 + zβ )2 σ 2 [1 + (n − 1)ρ]/δ2 (1.12)

n = (zα/2 + zβ )2 [1 + (n − 1)ρ]/2 (1.13)

At ρ = 0, formula 1.12 reduces to the usual sample size specification given

An estimate V̂(yE ) of V(yE ) may be calculated by substituting σ̂b2 and

(See Donner et al., 1981.)

proc glm data = milk;

Source DF Sum of Squares Mean Square

Model 9 905.122484 100.569165

Model 4 283.4424600 70.8606150

Small farm size

Model 4 379.2289833 94.8072458

For large farms, since yl = 28.00, k = 5, n = 12, N = 60

then V̂(yl ) = 7.40

For small farms

p-value = 0.10, and there is no sufficient evidence in the data to support H0 .

milk <- read.table("x:/xxx/milk.txt",header=T)

# ANOVA results overall

# ANOVA results for large farms

# ANOVA results for small farms

1.2 Regression Analysis for Clustered Data

where Y is the (n × 1) vector of dependent variable values, X an (n × (p + 1))

n = (zα/2 + zβ )2 [1 + (n − 1)ρ]/2 (1.13)

where ∼ denotes evaluation at

model sbp=cenmsbp cenage cengirth cenmsbpcenage cenmsbpcengirth/s