0% found this document useful (0 votes)
4 views22 pages

Lecture 5

This document reviews one-way ANOVA, a statistical method used to compare means across multiple groups, and discusses its assumptions, calculations, and interpretation. It includes an example analyzing the effect of maternal smoking on birth weights of infants, demonstrating the application of ANOVA and subsequent multiple comparison methods. The document also outlines how to perform these analyses using Stata software, providing commands and expected outputs.

Uploaded by

Kapil Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views22 pages

Lecture 5

This document reviews one-way ANOVA, a statistical method used to compare means across multiple groups, and discusses its assumptions, calculations, and interpretation. It includes an example analyzing the effect of maternal smoking on birth weights of infants, demonstrating the application of ANOVA and subsequent multiple comparison methods. The document also outlines how to perform these analyses using Stata software, providing commands and expected outputs.

Uploaded by

Kapil Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Lecture 5: One-Way ANOVA (Review) and Experimental Design

Samuals and Witmer Chapter 11 - all sections except 6.


The one-way analysis of variance (ANOVA) is a generalization of the two sample t−test
to k ≥ 2 groups. Assume that the populations of interest have the following (unknown)
population means and standard deviations:

population 1 population 2 ··· population k


mean µ1 µ2 ··· µk
std dev σ1 σ2 ··· σk

A usual interest in ANOVA is whether µ1 = µ2 = · · · = µk . If not, then we wish to know


which means differ, and by how much. To answer these questions we select samples from
each of the k populations, leading to the following data summary:

sample 1 sample 2 ··· sample k


size n1 n2 ··· nk
mean Ȳ1 Ȳ2 ··· Ȳk
std dev s1 s2 ··· sk

A little more notation is needed for the discussion. Let Yij denote the j th observation in the
ith sample and define the total sample size n∗ = n1 + n2 + · · · + nk . Finally, let Ȳ¯ be the
average response over all samples (combined), that is
P P
Yij ni Ȳi
Ȳ¯ = ij
= i
.
n∗ n∗
Note that Ȳ¯ is not the average of the sample means, unless the sample sizes ni are equal.
An F −statistic is used to test H0 : µ1 = µ2 = · · · = µk against HA : not H0 . The
assumptions needed for the standard ANOVA F −test are analogous to the independent
two-sample t−test assumptions: (1) Independent random samples from each population.
(2) The population frequency curves are normal. (3) The populations have equal standard
deviations, σ1 = σ2 = · · · = σk .
The F −test is computed from the ANOVA table, which breaks the spread in the combined
data set into two components, or Sums of Squares (SS). The Within SS, often called the

64
Residual SS or the Error SS, is the portion of the total spread due to variability within
samples:
P
SS(Within) = (n1 − 1)s21 + (n2 − 1)s22 + · · · + (nk − 1)s2k = ij (Yij − Ȳi )2 .

The Between SS, often called the Model SS, measures the spread between (actually among!)
the sample means
P
SS(Between) = n1 (Ȳ1 − Ȳ¯ )2 + n2 (Ȳ2 − Ȳ¯ )2 + · · · + nk (Ȳk − Ȳ¯ )2 = i ni (Ȳi − Ȳ¯ )2 ,

weighted by the sample sizes. These two SS add to give


P
SS(Total) = SS(Between) + SS(Within) = ij (Yij − Ȳ¯ )2 .

Each SS has its own degrees of freedom (df ). The df (Between) is the number of groups
minus one, k − 1. The df (Within) is the total number of observations minus the number
of groups: (n1 − 1) + (n2 − 1) + · · · (nk − 1) = n∗ − k. These two df add to give df (Total)
= (k − 1) + (n∗ − k) = n∗ − 1.
The Sums of Squares and df are neatly arranged in a table, called the ANOVA table:

Source df SS MS
P
Between Groups k−1 i ni (Ȳi − Ȳ¯ )2
P
Within Groups n∗ − k − 1)s2i
i (ni
P ¯ 2
Total n∗ − 1 ij (Yij − Ȳ ) .

The ANOVA table often gives a Mean Squares (MS) column, left blank here. The
Mean Square for each source of variation is the corresponding SS divided by its df . The
Mean Squares can be easily interpreted.
The MS(Within)
(n1 − 1)s21 + (n2 − 1)s22 + · · · + (nk − 1)s2k
= s2pooled
n∗ − k
is a weighted average of the sample variances. The MS(Within) is known as the pooled
estimator of variance, and estimates the assumed common population variance. If all the
sample sizes are equal, the MS(Within) is the average sample variance. The MS(Within) is
identical to the pooled variance estimator in a two-sample problem when k = 2.

65
The MS(Between)
P
i ni (Ȳi − Ȳ¯ )2
k−1
is a measure of variability among the sample means. This MS is a multiple of the sample
variance of Ȳ1 , Ȳ2 , ..., Ȳk when all the sample sizes are equal.
The MS(Total)
P ¯ )2
ij (Yij − Ȳ
n∗ − 1
is the variance in the combined data set.
The decision on whether to reject H0 : µ1 = µ2 = · · · = µk is based on the ratio of the
MS(Between) and the MS(Within):

M S(Between)
Fs = .
M S(W ithin)

Large values of Fs indicate large variability among the sample means Ȳ1 , Ȳ2 , ..., Ȳk relative
to the spread of the data within samples. That is, large values of Fs suggest that H0 is false.
Formally, for a size α test, reject H0 if Fs ≥ Fcrit , where Fcrit is the upper-α percentile
from an F distribution with numerator degrees of freedom k − 1 and denominator degrees
of freedom n∗ − k (i.e. the df for the numerators and denominators in the F −ratio.). An
F distribution table is given on pages 654-663 of SW. The p-value for the test is the area
under the F − probability curve to the right of Fs :

F Distribution with 3 and 20 df

p−value

0 1 2 3 4 5
FS F C ri t

Stata summarizes the ANOVA F −test with a p-value. In Stata, use the anova or oneway
commands to perform 1-way ANOVA. The data should be in the form of a variable containing

66
the response Yij and a grouping variable. For k = 2, the test is equivalent to the pooled
two-sample t−test.

Example from the Child Health and Development Study (CHDS)

We consider data from the birth records of 680 live-born white male infants. The infants
were born to mothers who reported for pre-natal care to three clinics of the Kaiser hospitals
in northern California. As an initial analysis, we will examine whether maternal smoking has
an effect on the birth weights of these children. To answer this question, we define 3 groups
based on mother’s smoking history: (1) mother does not currently smoke or never smoked (2)
mother smoked less than one pack of cigarettes a day during pregnancy (3) mother smoked
at least one pack of cigarettes a day during pregnancy.
Let µi = pop mean birth weight (in lbs) for children in group i, (i = 1, 2, 3). We wish to
test H0 : µ1 = µ2 = µ3 against HA : not H0 .
The side-by-side boxplots of the data show roughly the same spread among groups and
little evidence of skew:
12 10
Child’s birth weight (lbs)
6 8
4

1 2 3

There is no strong evidence against normality here. Furthermore the sample standard
deviations are close (see the following output). We may formally test the equality of variances
across the three groups (remember - the F-test is not valid if its assumptions are not met)
using Stata’s robvar command. In this example we obtain a set of three robust tests for

67
the hypothesis H0 : σ1 = σ2 = σ3 where σi is the population standard deviation of weight in
group i, i = 1, 2, 3. What robust means in this context is that the test still works reasonably
well if assumptions are not quite met. The classical test of this hypothesis is Bartlett’s test,
and that test is well known to be extraordinarily sensitive to the assumption of normality
of all the distributions. There are two ways a test may not work well when assumptions are
violated - the level may not be correct, or the power may be poor. For Bartlett’s test, the
problem is the level may not be accurate, which in this case means that you may see a small
p-value that does not reflect unequal variances but instead reflects non-normality. A test
with this property is known as liberal because it rejects H0 too often (relative to the nominal
α).
Stata output follows; we do not reject that the variances are equal across the three
groups at any reasonable significance level using any of the three test statistics:

| Summary of weight
group | Mean Std. Dev. Freq.
------------+------------------------------------
1 | 7.7328084 1.0523406 381
2 | 7.2213018 1.0777604 169
3 | 7.2661539 1.0909461 130
------------+------------------------------------
Total | 7.5164706 1.0923455 680
W0 = .82007944 df(2, 677) Pr > F = .44083367
W50 = .75912861 df(2, 677) Pr > F = .46847213
W10 = .77842523 df(2, 677) Pr > F = .45953896

There are multiple ways to get the ANOVA table here, the most common being the
command anova weight group or the more specialized oneway weight group. Bartlett’s
test for equal variances is given when using the latter. In the following output, I also gave
the ,b option to get Bonferroni multiple comparisons, discussed after the Fisher’s Method
(next section). The ANOVA table is:

. oneway weight group,b


Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 40.7012466 2 20.3506233 17.90 0.0000
Within groups 769.4943 677 1.13662378
------------------------------------------------------------------------
Total 810.195546 679 1.19321877
Bartlett’s test for equal variances: chi2(2) = 0.3055 Prob>chi2 = 0.858
Comparison of Child’s birth weight (lbs) by group

68
(Bonferroni)
Row Mean-|
Col Mean | 1 2
---------+----------------------
2 | -.511507
| 0.000
|
3 | -.466655 .044852
| 0.000 1.000

The p-value for the F −test is less than .0001. We would reject H0 at any of the usual
test levels (i.e. .05 or .01), concluding that the population mean birth weights differ in some
way across smoking status groups. The data (boxplots) suggest that the mean birth weights
are higher for children born to mothers that did not smoke during pregnancy, but that is
not a legal conclusion based upon the F-test alone.
The Stata commands to obtain this analysis are:

infile id head ...et cetera... pheight using c:/chds.txt


generate group = 1 if msmoke == 0
replace group = 2 if msmoke >= 1 & msmoke <= 20
replace group = 3 if msmoke > 20
graph box weight, medtype(line) over(group)
robvar weight, by (group)
oneway weight group,b

Multiple Comparison Methods: Fisher’s Method


The ANOVA F −test checks whether all the population means are equal. Multiple com-
parisons are often used as a follow-up to a significant ANOVA F −test to determine which
population means are different. I will discuss Fisher, Bonferroni, and Tukey methods for
comparing all pairs of means. Fisher’s and Tukey’s approaches are implemented in Stata
using Stata’s prcomp command. This command is not automatically installed in Stata 8.0.
You will have to search for “pairwise comparisons” under Help > Search... and click on the
blue Sg101 link. Click on [Click here to install] (your computer must be connected to
the internet to do this) and you will then have access to this command. (This should have
been done in the lab last year, but just in case . . . ).
Fisher’s Least significant difference method (LSD or FSD) is a two-step process:

69
1. Carry out the ANOVA F −test of H0 : µ1 = µ2 = · · · = µk at the α level. If H0 is
not rejected, stop and conclude that there is insufficient evidence to claim differences
among population means. If H0 is rejected, go to step 2.

2. Compare each pair of means using a pooled two sample t−test at the α level. Use spooled
from the ANOVA table and df = df (Residual). Using this denominator is different
from just doing all the possible pair-wise t-tests.

To see where the name LSD originated, consider the t−test of H0 : µi = µj (i.e. populations
i and j have same mean). The t−statistic is

Ȳi − Ȳj
ts = q
1 1
.
spooled ni
+ nj

You reject H0 if |ts | ≥ tcrit , or equivalently, if


s
1 1
|Ȳi − Ȳj | ≥ tcrit spooled + .
ni nj

The minimum absolute difference between Ȳi and Ȳj needed to reject H0 is the LSD, the
quantity on the right hand side of this inequality.
Stata gives all possible comparisons between pairs of populations means. The error level
(i.e. α) can be set to an arbitrary value using the level() subcommand, with 0.05 being
the standard. Looking at the CI’s in the Stata output, we conclude that the mean birth
weights for children born to non-smoking mothers (group 1) is significantly different from
the mean birth weights for each of the other two groups (2 and 3), since confidence intervals
do not contain 0. The Stata command prcomp weight group produced the output; the
default output includes CIs for differences in means. Alternatively, one obtains the p-values
for testing the hypotheses that the population means are equal using the test subcommand.
This is illustrated in the section on Tukey’s method. Examining the output from the prcomp
command, we see the FSD method is called the t method by Stata.
. prcomp weight group
Pairwise Comparisons of Means
Response variable (Y): weight
Group variable (X): group

70
Group variable (X): group Response variable (Y): weight
------------------------------- -------------------------------
Level n Mean S.E.
------------------------------------------------------------------
1 381 7.732808 .053913
2 169 7.221302 .0829046
3 130 7.266154 .0956823
------------------------------------------------------------------
Individual confidence level: 95% (t method)
Homogeneous error SD = 1.066126, degrees of freedom = 677
95%
Level(X) Mean(Y) Level(X) Mean(Y) Diff Mean Confidence Limits
-------------------------------------------------------------------------------
2 7.221302 1 7.732808 -.5115066 -.7049746 -.3180387
3 7.266154 1 7.732808 -.4666546 -.6792774 -.2540318
2 7.221302 .0448521 -.1993527 .2890568
-------------------------------------------------------------------------------

Discussion of the FSD Method


³ ´
k
With k groups, there are c = 2
= .5k(k − 1) pairs of means to compare in the second step
of the FSD method. Each comparison is done at the α level, where for a generic comparison
of the ith and j th populations

α = probability of rejecting H0 : µi = µj when H0 is true.

This probability is called the comparison error rate or the individual error rate.
The individual error rate is not the only error rate that is important in multiple compar-
isons. The family error rate (FER), or the experimentwise error rate, is defined to be
the probability of at least one false rejection of a true hypothesis H0 : µi = µj over all com-
parisons. When many comparisons are made, you may have a large probability of making
one or or more false rejections of true null hypotheses. In particular, when all c comparisons
of two population means are performed, each at the α level, then α ≤ F ER ≤ cα.
For example, in the birth weight problem where k = 3, there are c = .5 ∗ 3 ∗ 2 = 3
possible comparisons of two groups. If each comparison is carried out at the 5% level, then
.05 ≤ F ER ≤ .15. At the second step of the FSD method, you could have up to a 15%
chance of claiming one or more pairs of population means are different if no differences
existed between population means.
The first step of the FSD method is the ANOVA “screening” test. The multiple compar-
isons are carried out only if the F −test suggests that not all population means are equal.

71
This screening test tends to deflate the FER for the two-step FSD procedure. However, the
FSD method is commonly criticized for being extremely liberal (too many false rejections
of true null hypotheses) when some, but not many, differences exist - especially when the
number of comparisons is large. This conclusion is fairly intuitive. When you do a large
number of tests, each, say, at the 5% level, then sampling variation alone will suggest differ-
ences in 5% of the comparisons where the H0 is true. The number of false rejections could be
enormous with a large number of comparisons. For example, chance variation alone would
account for an average of 50 significant differences in 1000 comparisons each at the 5% level.

The Bonferroni Multiple Comparison Method


The Bonferroni method goes directly after the preceding relationship, α ≤ F ER ≤ cα. To
α
keep the FER below level α, do the individual tests at level c
, or equivalently multiply
each of the reported p-values by c. This is, in practice, extremely conservative but it does
guarantee the FER is below α. If you have, for instance, c = 3 comparisons to make, and
a reported p-value from a t-test is .02 then the Bonferroni p-value is 3(.02) = .06 and the
difference would not be judged significant. With more comparisons it becomes extremely
hard for the Bonferroni method to find anything. The FSD method tends to have a too-high
FER, the Bonferroni method a too-low FER. Very often they agree.
Earlier (p. 69) we looked at the ANOVA output following the oneway weight group,b
command. Examining that output we see p-values of 0 for testing H0 : µ1 = µ2 and
H0 : µ1 = µ3 , and a p-value of 1 for testing H0 : µ2 = µ3 using the Bonferroni method. The
Bonferroni tests see group 1 differing from both 2 and 3, and no difference between 2 and 3,
in complete agreement with FSD.

Tukey’s Multiple Comparison Method


One commonly used alternative to FSD and Bonferroni is Tukey’s honest significant differ-
ence method (HSD). Unlike FSD (but similar to Bonferroni), Tukey’s method allows you
to prespecify the FER, at the cost of making the individual comparisons more conservative
than in FSD (but less conservative than Bonferroni).

72
To implement Tukey’s method with a FER of α, reject H0 : µi = µj when
s
qcrit 1 1
|Ȳi − Ȳj | ≥ √ spooled + ,
2 ni nj
where qcrit is the α level critical value of the studentized range distribution (tables not in
SW). The right hand side of this equation is called the HSD. For the birth weight data,
the groupings based on the Tukey and Fisher methods are identical. We obtain Tukey’s
groupings via the Stata command prcomp weight group, tukey test. The differences
with an asterisk next to them are significant (the |numerator| is larger than the denominator):
Pairwise Comparisons of Means
Response variable (Y): weight
Group variable (X): group
Group variable (X): group Response variable (Y): weight
------------------------------- -------------------------------
Level n Mean S.E.
------------------------------------------------------------------
1 381 7.732808 .053913
2 169 7.221302 .0829046
3 130 7.266154 .0956823
------------------------------------------------------------------
Simultaneous significance level: 5% (Tukey wsd method)
Homogeneous error SD = 1.066126, degrees of freedom = 677
(Row Mean - Column Mean) / (Critical Diff)
Mean(Y) | 7.7328 7.2213
Level(X)| 1 2
--------+--------------------
7.2213| -.51151*
2| .23145
|
7.2662| -.46665* .04485
3| .25436 .29214

Stata does not provide, as built-in commands or options, very many multiple comparison
procedures. The one-way ANOVA problem we have been looking at is relatively simple, and
the Tukey method appears as something of an afterthought for it. For more complicated
multi-factor models, about all Stata offers is Bonferroni and a two other methods (Holm and
Sidak) that adjust the p-value similarly using slighlty different principles to control FER,
but less conservatively than Bonferroni. The help file on mtest has details. FSD is always
available, since that amounts to no adjustment. In response to questions on the www about
doing multiple comparisons, Stata has pointed out how easy it is to program whatever you
want in do files (probably the right answer for experts). Some packages like SAS offer a
larger number of options. What Stata offers is adequate for many areas of research, but for

73
some others it will be necessary to go beyond the built-in offerings of Stata (a reviewer on
your paper will let you know!)

Checking Assumptions in ANOVA Problems


The classical ANOVA assumes that the populations have normal frequency curves and the
populations have equal variances (or spreads). You can test the normality assumption using
multiple Wilk-Shapiro tests (i.e. one for each sample). You discussed the Wilk-Shapiro, or
normal quantile test in Lab. In addition, you can save (to the worksheet) the centered data
values, which are the observations minus the mean for the group from which each observation
comes. These centered values, or residuals, should behave as a single sample from a normal
population. A boxplot and normal quantile test of the residuals gives an overall assessment
of normality. The commands predict r, res and then swilk r indicates that, although
not signficant at the 5% level, normality may be suspect:
Shapiro-Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+-------------------------------------------------
r | 680 0.99580 1.866 1.520 0.06425

The command qnorm r yields the following normal probability plot:


4
2
Residuals
0
−2
−4

−4 −2 0 2 4
Inverse Normal

There are several alternative procedures that can be used when either the normality or
equal variance assumption are not satisfied. Welch’s ANOVA method (available in JMP-In,

74
not directly available in Stata) is appropriate for normal populations with unequal variances.
The test is a generalization of Satterthwaite’s two-sample test discussed last semester. Most
statisticians probably would use weighted least squares or transformations to deal with the
unequal variance problem (we will discuss this if time permits this semester). The Wilcoxon
or Kruskal-Wallis non-parametric ANOVA is appropriate with non-normal populations with
similar spreads.
For the birth weight data, recall that formal tests of equal variances are not significant
(p-values > .4). Thus, there is insufficient evidence that the population variances differ.
Given that the distributions are fairly symmetric, with no extreme values, the standard
ANOVA appears to be the method of choice. As an illustration of an alternative method,
though, the summary from the Kruskal-Wallis (command: kwallis weight, by(group))
approach follows, leading to the same conclusion as the standard ANOVA. One weakness
of Stata is that it does not directly provide for non-parametric multiple comparisons. One
could do all the pair-wise Mann-Whitney two-sample tests and use a Bonferroni adjustment
(the ranksum command implements this two sample version of the Kruskal-Wallis test). The
Bonferroni adjustment just multiplies all the p-values by 3 (the number of comparisons). If
you do this, you find the same conclusions as with the normal-theory procedures: Group 1
differs from the other two, and groups 2 and 3 are not significantly different. Recall from
last semester the Kruskal-Wallis and the Mann-Whitney amount to little more than one-
way ANOVA and two-sample t-tests, respectively, on ranks in the combined samples (this
controls for outliers).

Test: Equality of populations (Kruskal-Wallis test)


+-------------------------+
| group | Obs | Rank Sum |
|-------+-----+-----------|
| 1 | 381 | 144979.00 |
| 2 | 169 | 47591.00 |
| 3 | 130 | 38970.00 |
+-------------------------+
chi-squared = 36.594 with 2 d.f.
probability = 0.0001
chi-squared with ties = 36.637 with 2 d.f.
probability = 0.0001

75
Basics of Experimental Design
This section describes an experimental design to compare the effectiveness of four insecticides
to eradicate beetles. The primary interest is determining which treatment is most effective,
in the sense of providing the lowest typical survival time.
In a completely randomized design (CRD), the scientist might select a sample of
genetically identical beetles for the experiment, and then randomly assign a predetermined
number of beetles to the treatment groups (insecticides). The sample sizes for the groups
need not be equal. A power analysis is often conducted to determine sample sizes for the
treatments. For simplicity, assume that 48 beetles will be used in the experiment, with 12
beetles assigned to each group.
After assigning the beetles to the four groups, the insecticide is applied (uniformly to
all experimental units or beetles), and the individual survival times recorded. A natural
analysis of the data would be to compare the survival times using a one-way ANOVA.
There are several important controls that should be built into this experiment. The
same strain of beetles should be used to ensure that the four treatment groups are alike as
possible, so that differences in survival times are attributable to the insecticides, and not
due to genetic differences among beetles. Other factors that may influence the survival time,
say the concentration of the insecticide or the age of the beetles, would be held constant, or
fixed by the experimenter, if possible. Thus, the same concentration would be used with the
four insecticides.
In complex experiments, there are always potential influences that are not realized or
thought to be unimportant that you do not or can not control. The randomization of
beetles to groups ensures that there is no systematic dependence of the observed treatment
differences on the uncontrolled influences. This is extremely important in studies where
genetic and environmental influences can not be easily controlled (as in humans, more so
than in bugs or mice). The randomization of beetles to insecticides tends to diffuse or greatly
reduce the effect of the uncontrolled influences on the comparison of insecticides, in the sense
that these effects become part of the uncontrolled or error variation of the experiment.

76
Suppose yij is the response for the j th experimental unit in the ith treatment group, where
i = 1, 2, ..., I. The statistical model for a completely randomized one-factor design that
leads to a one-way ANOVA is given by:

yij = µi + eij ,

where µi is the (unknown) population mean for all potential responses to the ith treatment,
and eij is the residual or deviation of the response from the population mean. The responses
within and across treatments are assumed to be independent, normal random variables with
constant variance.
For the insecticide experiment, yij is the survival time for the j th beetle given the ith
insecticide, where i = 1, 2, 3, 4 and j = 1, 2, .., 12. The random selection of beetles coupled
with the randomization of beetles to groups ensures the independence assumptions. The
assumed population distributions of responses for the I = 4 insecticides can be represented
as follows:

Insecticide 1

Insecticide 2

Insecticide 3

Insecticide 4

77
1 P
Let µ = I i µi be the grand mean, or average of the population means. Let αi = µi − µ
be the ith treatment group effect. The treatment effects add to zero, α1 +α2 +· · ·+αI = 0,
and measure the difference between the treatment population means and the grand mean.
Given this notation, the one-way ANOVA model is

yij = µ + αi + eij .

The model specifies that the

Response = Grand Mean + Treatment Effect + Residual.

An hypothesis of interest is whether the population means are equal: H0 : µ1 = · · · = µI ,


which is equivalent to the hypothesis of no treatment effects: H0 : α1 = · · · = αI = 0. If H0
is true, then the one-way model is
yij = µ + eij ,

where µ is the common population mean. We know how to test H0 and do multiple com-
parisons of the treatments, so I will skip this material.
Most epidemiological studies are observational studies where the groups to be com-
pared ideally consist of individuals that are similar on all characteristics that influence the
response, except for the feature that defines the groups. In a designed experiment, the
groups to be compared are defined by treatments randomly assigned to individuals. If, in an
observational study we can not define the groups to be homogeneous on important factors
that might influence the response, then we should adjust for these factors in the analysis. I
will discuss this more completely in the next 2 weeks. In the analysis we just did on smoking
and birth weight, we were not able to randomize with respect to several factors that might
influence the response, and will need to adjust for them.

Paired Experiments and Randomized Block Experiment


A randomized block design is often used instead of a completely randomized design in
studies where there is extraneous variation among the experimental units that may influence
the response. A significant amount of the extraneous variation may be removed from the

78
comparison of treatments by partitioning the experimental units into fairly homogeneous
subgroups or blocks.
For example, suppose you are interested in comparing the effectiveness of four antibi-
otics for a bacterial infection. The recovery time after administering an antibiotic may be
influenced by the patients general health, the extent of their infection, or their age. Ran-
domly allocating experimental subjects to the treatments (and then comparing them using
a one-way ANOVA) may produce one treatment having a “favorable” sample of patients
with features that naturally lead to a speedy recovery. Additionally, if the characteristics
that affect the recovery time are spread across treatments, then the variation within samples
due to these uncontrolled features can dominate the effects of the treatment, leading to an
inconclusive result.
A better way to design this experiment would be to block the subjects into groups of
four patients who are alike as possible on factors other than the treatment that influence
the recovery time. The four treatments are then randomly assigned to the patients (one per
patient) within a block, and the recovery time measured. The blocking of patients usually
produces a more sensitive comparison of treatments than does a completely randomized
design because the variation in recovery times due to the blocks is eliminated from the
comparison of treatments.
A randomized block design is a paired experiment when two treatments are com-
pared. The usual analysis for a paired experiment is a parametric or non-parametric paired
comparison. In certain experiments, each experimental unit receives each treatment. The
experimental units are “natural” blocks for the analysis.

Example: Comparison of Treatments to Relieve Itching


Ten male volunteers between 20 and 30 years old were used as a study group to compare
seven treatments (5 drugs, a placebo, and no drug) to relieve itching. Each subject was
given a different treatment on seven study days. The time ordering of the treatments was
randomized across days. Except on the no-drug day, the subjects were given the treatment
intravenously, and then itching was induced on their forearms using an effective itch stimulus

79
called cowage. The subjects recorded the duration of itching, in seconds. The data are given
in the table below. From left to right the drugs are: papaverine, morphine, aminophylline,
pentobarbitol, tripelenamine.

Patient Nodrug Placebo Papv Morp Amino Pento Tripel


1 174 263 105 199 141 108 141
2 224 213 103 143 168 341 184
3 260 231 145 113 78 159 125
4 255 291 103 225 164 135 227
5 165 168 144 176 127 239 194
6 237 121 94 144 114 136 155
7 191 137 35 87 96 140 121
8 100 102 133 120 222 134 129
9 115 89 83 100 165 185 79
10 189 433 237 173 168 188 317

The volunteers in the study were treated as blocks in the analysis. At best, the volunteers
might be considered a representative sample of males between the ages of 20 and 30. This
limits the extent of inferences from the experiment. The scientists can not, without sound
medical justification, extrapolate the results to children or to senior citizens.

The Analysis of a Randomized Block Design


Assume that you designed a randomized block experiment with I blocks and J treatments,
where each treatment occurs once in each block. Let yij be the response for the j th treatment
within the ith block. The model for the experiment is

yij = µij + eij ,

where µij is the population mean response for the j th treatment in the ith block and eij is
the deviation of the response from the mean. The population means are assumed to satisfy
the additive model
µij = µ + αi + βj

80
where µ is a grand mean, αi is the effect for the ith block, and βj is the effect for the j th
treatment. The responses are assumed to be independent across blocks, normally distributed
and with constant variance. The randomized block model does not require the observations
within a block to be independent, but does assume that the correlation between responses
within a block is identical for each pair of treatments. This is a reasonable working assump-
tion in many analyses. In this case you really need to be sure the order in which treatments
are administered to subjects is randomized in order to assume equal correlation.
The model is sometimes written as

Response = Grand Mean + Treatment Effect + Block Effect + Residual.

Given the data, let ȳi· be the ith block sample mean (the average of the responses in the
ith block), ȳ·j be the j th treatment sample mean (the average of the responses on the j th
treatment), and ȳ·· be the average response of all IJ observations in the experiment.
An ANOVA table for the randomized block experiment partitions the Model SS into SS
for Blocks and Treatments.

Source df SS MS
P
Blocks I −1 J i (ȳi· − ȳ·· )2
P
Treats J −1 I j (ȳ·j − ȳ·· )2
P
Error (I − 1)(J − 1) ij (yij − ȳi· − ȳ·j + ȳ·· )2
P
Total IJ − 1 ij (yij − ȳ·· )2 .

A primary interest is testing whether the treatment effects are zero: H0 : β1 = · · · =


βJ = 0. The treatment effects are zero if the population mean responses are identical for each
treatment. A formal test of no treatment effects is based on the p-value from the F-statistic
Fobs = MS Treat/MS Error. The p-value is evaluated in the usual way (i.e. as an upper tail
area from an F-distribution with J − 1 and (I − 1)(J − 1) df.) This H0 is rejected when the
treatment averages ȳ·j vary significantly relative to the error variation.
A test for no block effects (H0 : α1 = · · · = αI = 0) is often a secondary interest, because,
if the experiment is designed well, the blocks will be, by construction, noticeably different.
There are no block effects if the block population means are identical. A formal test of no

81
block effects is based on the p-value from the the F-statistic Fobs = MS Blocks/MS Error.
This H0 is rejected when the block averages ȳi· vary significantly relative to the error varia-
tion.

A Randomized Block Analysis of the Itching Data


The anova command is used to get the randomized block analysis. You will be shown the
steps in Thursday’s Lab, but I will mention a few important points.

• The data are comprised of three variables: itchtime, person (ranges from 1-10), and
treatment (ranges from 1-7). A data file called itch.txt was created with these three
variables to be read into Stata.

• In the anova table, persons play the role of Blocks in this analysis. Using the com-
mands infile itchtime person treatment using c:/itch.txt and then anova itchtime
person treatment we obtain the following output:
Number of obs = 70 R-squared = 0.4832
Root MSE = 55.6327 Adj R-squared = 0.3397
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 156292.6 15 10419.5067 3.37 0.0005
|
person | 103279.714 9 11475.5238 3.71 0.0011
treatment | 53012.8857 6 8835.48095 2.85 0.0173
|
Residual | 167129.686 54 3094.99418
-----------+----------------------------------------------------
Total | 323422.286 69 4687.2795

• The Model SS is the Sum of the person SS and treatment SS; check that they add
up. The F-test on the Whole-Model test ANOVA checks for whether Treatments or
Persons, or both, are significant, i.e. provides an overall test of all effects in the model.

• Next comes the SS for Persons and Treatments, and the corresponding F-statistics and
p-values.

• It is possible in JMP-IN (but not directly in Stata) to obtain Tukey multiple compar-
isons of the treatments (but not for persons). These are options in the analysis of the
individual effects.

82
• In Stata, we obtain the results of testing differences in the treatments (averaged over
persons) using Fisher’s method from the test command. You will cover this in more
detail in Thursday’s lab. To obtain Bonferroni’s adjusted p-values, simply multiply the
p-value for each of Fisher’s tests
 by  the number of comparisons you are making; in the
 7 
itching time example this is   = 21 paired comparisons. We obtain, for example
2
the results of Fisher’s method for comparing treatment 1 with treatment 2 (no drug
versus placebo) and treatment 1 with treatment 3 (no drug versus papaverine) with
the following commands:

test _b[treatment[1]] = _b[treatment[2]]


test _b[treatment[1]] = _b[treatment[3]]

We obtain the output:

( 1) treatment[1] - treatment[2] = 0
F( 1, 54) = 0.31
Prob > F = 0.5814
( 1) treatment[1] - treatment[3] = 0
F( 1, 54) = 8.56
Prob > F = 0.0050

We see that using Fisher’s method, treatments 1 and 2 do not significantly differ, but
treatments 1 and 3 do significantly differ at the 5% level. The corresponding Bonferroni
p-values are 0.58(2) > 1 and 0.005(2) = 0.01 for only the two comparisons. They are
0.58(21) > 1 and 0.005(21) = 0.105 if all 21 paired comparisons are to be made. We
would accept that there is no significant difference in mean itching time between either
pairs of treatments when all 21 comparisons are to be made. The tabulated p-values
resulting from Fisher’s method are:

83
Treatment 1 2 3 4 5 6
2 0.58
3 0.01 0.00
4 0.09 0.03 0.23
5 0.07 0.02 0.30 0.88
6 0.56 0.26 0.02 0.26 0.20
7 0.34 0.13 0.05 0.44 0.36 0.71

We have the following groupings:

3 5 4 7 6 1 2
-------
--------- Fisher’s
-------

-----------
----------- Bonferroni’s [and Tukey’s obtained in JMP-IN]

Looking at the means for each treatment averaged over persons, we see that each of the
five drugs appears to have an effect, compared to the placebo and to no drug, which have
similar means. Papaverine appears to be the most effective drug, whereas placebo is the least
effective treatment. A formal F-test shows significant differences among the treatments (p-
value=0.017), and among patients (p-value=0.001). The only significant pairwise difference
in treatments is between papaverine and placebo using Bonferroni (or Tukey) adjustments.
The residual plot, from the commands (1) predict res, res, (2) predict fitted, xb,
(3) twoway (scatter res fitted), shows no gross deficiencies with the model:

84
150
100 50
Residuals
0 −50
−100

50 100 150 200 250 300


Linear prediction

The command bysort treatment: summarize itchtime produces the following mod-
ified output:

itchtime | Obs Mean


-------------------+----------------------
no drug | 10 191.0
placebo | 10 204.8
papaverine | 10 118.2
morphine | 10 148.0
aminophylline | 10 144.3
pentobarbitol | 10 176.5
tripelenamine | 10 167.2

85

You might also like