Cabral 2008 Multiple Comparisons Procedures
Cabral 2008 Multiple Comparisons Procedures
Howard J. Cabral
Circulation. 2008;117:698-701
doi: 10.1161/CIRCULATIONAHA.107.700971
Circulation is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231
Copyright © 2008 American Heart Association, Inc. All rights reserved.
Print ISSN: 0009-7322. Online ISSN: 1524-4539
The online version of this article, along with updated information and services, is located on the
World Wide Web at:
http://circ.ahajournals.org/content/117/5/698
Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published
in Circulation can be obtained via RightsLink, a service of the Copyright Clearance Center, not the Editorial
Office. Once the online version of the published article for which permission is being requested is located,
click Request Permissions in the middle column of the Web page under Services. Further information about
this process is available in the Permissions and Rights Question and Answer document.
From the Department of Biostatistics, Boston University School of Public Health, Boston, Mass.
Correspondence to Howard J. Cabral, PhD, MPH, Department of Biostatistics, Boston University School of Public Health, 715 Albany St, Crosstown
Center, 3rd Floor, Boston, MA 02118. E-mail [email protected]
(Circulation. 2008;117:698-701.)
© 2008 American Heart Association, Inc.
Circulation is available at http://circ.ahajournals.org DOI: 10.1161/CIRCULATIONAHA.107.700971
␣-level either (1) in an a priori fashion when specific Dunnett,9 Scheffé,10 and Bonferroni (Dunn)11 tests. The
preselected comparisons are of interest or (2) in a post hoc options available for MCPs vary by software package. Inves-
fashion when the data suggest that specific groups be com- tigators should consider their scientific question when choos-
pared statistically. Commonly, this ␣-level is set at 0.05 for ing an MCP and not limit their choice by the availability of
the experiment or study and is applied to the global test MCPs in the statistical software used for their global test.
whether or not a priori or post hoc MCPs are conducted. The Tukey test is appropriate for the comparison of pairs of
In determining how to maintain the overall ␣-level, an group means; originally developed for equal sample sizes per
investigator must consider an MCP to reduce the ␣-level for group, a modified version accommodates unequal sample
each MCP test or to apply the same ␣-level for each MCP test sizes. A Dunnett test is applied in situations in which
as applied in the global test. In mathematical terms, each contrasts are limited to comparisons with a control group and
additional statistical test performed in addition to the global not, for example, between the means of active treatment
test in such a situation will actually increase the overall groups. The Scheffé test is applicable for more general
␣-level for the study. For example, with k groups and interest comparisons than the comparison of pairs of group means and
in comparing each pair of the k mean values, k(k⫺1)/2 is more appropriate than the Tukey test if sample sizes per
possible comparisons exist. If a separate ␣-level is applied group differ markedly. It is considered to be more conserva-
here to each test of hypothesis, the actual ␣-level for the set tive than the Tukey test when pairs of group means are being
of comparisons could be as large as ␣[k(k⫺1)/2]. Thus, when compared. The Scheffé test can be computed for a specific
k⫽4 and ␣⫽0.05 for each test, the overall ␣-level for the set contrast of group means by first determining the critical value
of tests could be as large as 0.30: ie, 0.05[4(3)/2]. Further- of the F statistic for the ␣-level of interest, with k⫺1 degrees
more, if interest exists in comparisons beyond those between of freedom (df) for the numerator and N⫺-k degrees of
pairs of means, the number of multiple comparisons tests will freedom for the denominator when k groups are present and
increase. For example, in a study of 3 groups (A, B, and C), N subjects overall. This F value is then multiplied by k⫺1 to
one might have interest in the following comparisons: A yield a new critical F value for the multiple comparisons
versus B, A versus C, B versus C, A and B versus C, A and contrast.
C versus B, and B and C versus A (6 total, 3 pairwise The Bonferroni (Dunn) procedure takes into account the
comparisons). MCPs that maintain the overall ␣-level for the number of comparisons to be made and is more conservative
set of tests are said to control the “experimentwise” error rate; (less likely to find a significant difference) than the Tukey or
a related type, called “familywise” error rate control proce- Scheffé test in comparisons of pairs of group means, and it is
dures, also effectively reduce the ␣-level for each post hoc considered to be the most conservative option among MCPs
test. We will simplify matters by referring to these 2 classes in most situations. It can be applied to general hypothesis
of procedures as “experimentwise,” although technical dif- tests in addition to ANOVA. The Bonferroni (Dunn) proce-
ferences between the 2 must be acknowledged and have been dure is implemented by computing a new ␣-level for each
left to the reader to investigate independently.4 multiple comparisons test based only on the overall ␣-level
In contrast, MCPs that apply a separate ␣-level for each for the study and the number of comparisons to be made. In
test are called “comparisonwise” error control procedures. In this approach, the new ␣, ␣", is equal to ␣/C, where C is the
the case of the study with groups A, B, and C, the use of number of post hoc tests to be performed. Thus, in the
comparisonwise error control after the global null hypothesis previously discussed example with 3 groups, A, B, and C, and
has been rejected would entail the performance of 6 individ- interest in all possible comparisons, the new alpha, ␣", would
ual tests and application of an ␣-level of 0.05 for each test. be equal to ␣/6. Modifications have been made to the
Statistically oriented overviews of MCPs,3 as well as general Bonferroni procedure with the goal of improving statistical
biostatistical and research texts,5– 8 cover in more detail the power and include the Holm12 and Hochberg13,14 procedures
distinction between these classes of MCPs and their relative among the more prominent methods.
strengths and weaknesses. In addition, the use of MCPs in
additional analytical frameworks, such as multiway ANOVA Comparisonwise Error Control Procedures
and repeated-measures analysis, is beyond the scope of this When an investigator has a limited number of comparisons to
article. be made after the rejection of the global null hypothesis,
especially if these were prespecified before this test was
Experimentwise Error Control Procedures conducted, it may be of interest to employ an MCP that
As noted above, when interest exists in maintaining the controls the comparisonwise error. As noted previously,
overall ␣-level for the experiment or study, an investigator MCPs that control the comparisonwise error rate typically use
may choose an MCP that controls the experimentwise error. the same ␣-level for each test that is applied in the test of the
Many MCPs have been developed to maintain this overall global null hypothesis for the study. Thus, the likelihood of
␣-level (the term “experiment” stems from the early devel- falsely rejecting each null hypothesis increases with addi-
opment of ANOVA in the context of experimental research). tional tests. They also, however, provide the benefit to the
Most statistical software packages that offer applications for investigator of being more powerful, ie, more likely to reject
general linear models analyses such as ANOVA have also the null hypothesis of each test when it should be rejected.
implemented MCPs with an array of choices for these tests. Examples of such procedures include the application of
Among the more commonly used procedures in this class are Fisher’s LSD (least significant difference) test and linear
the Tukey (John W. Tukey, PhD, unpublished data, 1953), contrasts.6 In the LSD test, a variant of the standard 2-sample
Downloaded from http://circ.ahajournals.org/ at RHODE ISLAND HOSP on December 17, 2012
700 Circulation February 5, 2008
Table 1. Heart Weight/Body Weight Ratio by Exercise-Duration between subgroups. Although one should in practice restrict
Group the choice of multiple comparisons to those that are substan-
Exercise Duration tively meaningful, we will assume for the purpose of illus-
tration in this case that all pairs of means are of interest and
4 wk, will examine results of contrasts between pairs of means
10 min 2.5 d 1 wk 2 wk 3 wk 4 wk 1-wk Rest
using MCPs among those discussed earlier.
4.29 4.49 5.38 5.44 5.50 5.54 4.66 In Table 2, we present the means and SDs for the outcome
4.43 4.54 5.18 5.59 6.47 5.70 4.90 of interest, heart weight–to– body weight ratio, together with
4.17 4.65 4.83 5.64 6.03 5.47 4.91 a summary of the results of the application of 4 procedures
Values are heart weight/body weight ratios. that control the experimentwise type I error rate (Tukey test,
Scheffé test, and the Bonferroni [Dunn] procedure) and one
that controls the comparisonwise type I error rate (Fisher
t test is used in which the within-subjects mean square from
LSD test). To aid in the interpretation of differences between
the global test on the full data set is used as an estimate of the
groups, means have been arranged so that the highest is
pooled variance, as opposed to using the variance estimates
presented at the top of the table and the lowest at the bottom
only from the 2 groups being compared. Linear contrasts can
the table (higher values indicating greater cardiac hypertro-
be used for more general comparisons, for example, when a
set of groups is to be compared with another set of groups or phy as a result of exercise).
when the study groups represent different dose levels of a In Table 2, we adopt a system used in the statistical
treatment or exposure. software, SAS,16 to identify statistically significant differ-
ences between groups. We apply a level of ␣⫽0.05 to denote
Worked Example statistical significance. In this system, means of groups that
We now present an example of an analysis of data from a share a letter are not significantly different, whereas the
study of heart size in mice exposed to different conditions of means of any groups that do not share a letter are significantly
physical exercise15 that will illustrate the use of 1-factor different. For example, for the results of the Tukey test, the “3
ANOVA with supplementary MCPs. The design used 2 weeks” group is significantly different from all groups except
randomly assigned factors, long-term exercise versus no “4 weeks” and “2 weeks,” because they all share the letter
exercise (control), and exercise duration at 7 different ages “A” in the display. Likewise, the “10 minutes” group is
after baseline. The sample included 30 mice in total, 21 significantly different from all groups except the “2.5 days”
assigned to the 7 different durations of long-term exercise group, because it shares the letter “D” only with that group.
(swimming) and 9 assigned to 3 durations in the control In the table overall, we see a similar set of results for
group (8, 12, and 13 weeks of age). The heart weight–to– comparisons of all pairs of means among the procedures that
body weight ratio at the time of euthanasia was examined as control the experimentwise error rate. These tests, however,
the outcome of interest in this study. We will limit our are rejected at the 0.05 level less frequently than the Fisher’s
analyses here to the long-term exercise group to illustrate the LSD test, which is expected given that the LSD test controls
use of 1-factor ANOVA. In Table 1, we show the data for this the comparisonwise error rate and should be more powerful.
sample. We note, however, that the actual type I error rate for the set
For these 21 animals, the global F test of differences in of all pairwise comparisons here could be as large as
mean heart weight–to– body weight ratio for the 1-factor 0.05[(7⫻6)/2)]ⱕ1.00 (k⫽7). In interpreting these results,
ANOVA was 20.59 (numerator df⫽6, denominator df⫽14; however, one should keep in mind that the results of hypoth-
R2⫽0.90) with P⬍0.0001. Thus, we reject the global null esis tests are highly dependent on sample size, and only 3
hypothesis of no difference in population mean heart weight– mice in each of the 7 groups were examined in this sample.
to– body weight ratio between groups using ␣⫽0.05 and are Greater distinction between the findings of these procedures
next interested in where significant differences can be found may be observed in larger samples.
Table 2. Means and SDs of Heart Weight/Body Weight Ratio With Results of Multiple Comparisons Procedures (nⴝ21 Mice, 3 per
Group)
Multiple Comparisons Procedure
Exercise-Duration Group Mean (SD) Tukey Scheffé Bonferroni (Dunn) Fisher LSD
3 wk 6.00 (0.49) A A A A
4 wk 5.57 (0.12) A B A B A B B
2 wk 5.56 (0.10) A B A B A B B
1 wk 5.13 (0.28) B C B C B C C
4 wk, 1 wk of rest 4.82 (0.14) B C B C D C D C D
2.5 d 4.56 (0.08) C D C D C D D E
10 min 4.30 (0.13) D D D E
Means are sorted from largest to smallest, and means of groups that do not share a letter are significantly different.