0% found this document useful (0 votes)
28 views

Cabral 2008 Multiple Comparisons Procedures

This document discusses multiple comparisons procedures (MCPs) which are used when making multiple comparisons between groups after rejecting the global null hypothesis in an analysis of variance (ANOVA). It describes how performing multiple tests can increase the type I error rate and the need to control this. The two main types of MCPs are those that reduce the significance level for each individual test, like the Bonferroni correction, and those that maintain the same significance level, like the Tukey and Dunnett tests. An example is provided to illustrate using 1-factor ANOVA with MCPs to analyze differences between groups while controlling for multiple comparisons.

Uploaded by

Carlos Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Cabral 2008 Multiple Comparisons Procedures

This document discusses multiple comparisons procedures (MCPs) which are used when making multiple comparisons between groups after rejecting the global null hypothesis in an analysis of variance (ANOVA). It describes how performing multiple tests can increase the type I error rate and the need to control this. The two main types of MCPs are those that reduce the significance level for each individual test, like the Bonferroni correction, and those that maintain the same significance level, like the Tukey and Dunnett tests. An example is provided to illustrate using 1-factor ANOVA with MCPs to analyze differences between groups while controlling for multiple comparisons.

Uploaded by

Carlos Andrade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Multiple Comparisons Procedures

Howard J. Cabral

Circulation. 2008;117:698-701
doi: 10.1161/CIRCULATIONAHA.107.700971
Circulation is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231
Copyright © 2008 American Heart Association, Inc. All rights reserved.
Print ISSN: 0009-7322. Online ISSN: 1524-4539

The online version of this article, along with updated information and services, is located on the
World Wide Web at:
http://circ.ahajournals.org/content/117/5/698

Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published
in Circulation can be obtained via RightsLink, a service of the Copyright Clearance Center, not the Editorial
Office. Once the online version of the published article for which permission is being requested is located,
click Request Permissions in the middle column of the Web page under Services. Further information about
this process is available in the Permissions and Rights Question and Answer document.

Reprints: Information about reprints can be found online at:


http://www.lww.com/reprints

Subscriptions: Information about subscribing to Circulation is online at:


http://circ.ahajournals.org//subscriptions/

Downloaded from http://circ.ahajournals.org/ at RHODE ISLAND HOSP on December 17, 2012


Statistical Primer for Cardiovascular Research

Multiple Comparisons Procedures


Howard J. Cabral, PhD, MPH

I n biomedical research, a common question posed by


investigators is whether or not an outcome of interest
differs significantly between multiple independent groups of
In this analysis, the outcome, or dependent variable, is
compared between the categories of the grouping, or inde-
pendent, variable, which is referred to as the “factor.” The
subjects in the study sample. For example, in a randomized categories of the factor are referred to as “levels” (whether or
clinical trial focusing on differences in a parameter of not they are ordered in some fashion). We will henceforth
cardiovascular health such as systolic blood pressure or heart refer to these categories as “groups.” In the population to
rate that is measured on a continuum, one might make the which statistical inference is to be made, the outcome is assumed
comparison of those who received a placebo, those who to be measured on a continuum and to follow a gaussian
received a particular active drug, and those who received a distribution for each group, with statistically independent
different active drug. Another example of a multiple group values across individual subjects. Furthermore, the variances
comparison might arise in an observational study when of the outcome are assumed to be equal across the groups. For
comparisons between categories of race or ethnicity are of k groups, the null hypothesis (H0) is that the population means
interest. (␮) of the outcome are the same for all of the groups. This is
The statistical problem that arises from the use of multiple commonly written in symbolic form as:
comparisons tests is that any subsequent tests of hypotheses
H0: ␮1⫽␮2⫽. . .⫽␮k
will be performed on the outcome with the same data on
which the global test was performed. This can result in an versus the alternative hypothesis, H1, that the k population
uncontrolled type I error rate (the rate of rejecting the null means are not equal. This null hypothesis is referred to as the
hypothesis when it should not be rejected). These tests can “global” hypothesis and its statistical testing as an “omnibus”
produce this statistical problem, which can be encountered in test. Note that if the null hypothesis is rejected in the face of
analyses of multiple treatment or exposure groups, multiple sufficient sample data, the question of where particular
end points, or multiple interim analyses. This problem has differences were present between the mean values is not
been addressed from a broad perspective.1 The present report, addressed.
however, will focus on the statistical analysis strategies used In situations such as tests of treatment efficacy in a
when the global or omnibus test of differences on a contin- phase-III clinical trial in which a placebo control is used, for
uous outcome across the multiple groups has been performed example, it has been argued that rejection of the global null
and statistical tests contrasting subgroups are then conducted. hypothesis is required before one proceeds with additional
It serves as a follow-up to an earlier article2 in the series of analyses to identify specific differences between subgroups
statistical tutorials in Circulation that addressed the use of the of the factor of interest.3 In contrast, it can also be argued that
ANOVA in performing the global test of hypothesis for a with a limited number of a priori (ie, preplanned) multiple
continuous outcome. These statistical tests are often referred comparisons (for example, in a confirmatory study), one
to as multiple comparisons procedures (MCPs). We will first should not have to reject the global null hypothesis to perform
present a brief review of the statistical foundations of 1-factor the preplanned MCPs.4 We will adopt the former strategy and
ANOVA and then will describe the 2 main types of MCPs thus will next discuss the use of MCPs to answer the question
with specific reference to the more commonly used MCPs. of the significance of differences between specific subgroups
Finally, we will show a worked example of an analysis of assuming that the global test has been rejected at a given ␣,
data from a study of heart size in animals exposed to different or “significance,” level.
conditions of physical exercise that will illustrate the use of
1-factor ANOVA with supplementary MCPs. Multiple Comparisons Procedures
As noted above, the performance of multiple hypothesis tests
Review of 1-Factor ANOVA subsequent to the global test can result in an uncontrolled
The 1-factor ANOVA is used to compare mean values for a type I error rate (the rate of rejecting the null hypothesis when
continuous outcome of interest across 2 or more independent it should not be rejected). MCPs are applied when the global
groups, ie, groups in which subjects belong to only 1 group. null hypothesis of the study has been rejected at a given

From the Department of Biostatistics, Boston University School of Public Health, Boston, Mass.
Correspondence to Howard J. Cabral, PhD, MPH, Department of Biostatistics, Boston University School of Public Health, 715 Albany St, Crosstown
Center, 3rd Floor, Boston, MA 02118. E-mail [email protected]
(Circulation. 2008;117:698-701.)
© 2008 American Heart Association, Inc.
Circulation is available at http://circ.ahajournals.org DOI: 10.1161/CIRCULATIONAHA.107.700971

Downloaded from http://circ.ahajournals.org/ at698


RHODE ISLAND HOSP on December 17, 2012
Cabral Multiple Comparisons Procedures 699

␣-level either (1) in an a priori fashion when specific Dunnett,9 Scheffé,10 and Bonferroni (Dunn)11 tests. The
preselected comparisons are of interest or (2) in a post hoc options available for MCPs vary by software package. Inves-
fashion when the data suggest that specific groups be com- tigators should consider their scientific question when choos-
pared statistically. Commonly, this ␣-level is set at 0.05 for ing an MCP and not limit their choice by the availability of
the experiment or study and is applied to the global test MCPs in the statistical software used for their global test.
whether or not a priori or post hoc MCPs are conducted. The Tukey test is appropriate for the comparison of pairs of
In determining how to maintain the overall ␣-level, an group means; originally developed for equal sample sizes per
investigator must consider an MCP to reduce the ␣-level for group, a modified version accommodates unequal sample
each MCP test or to apply the same ␣-level for each MCP test sizes. A Dunnett test is applied in situations in which
as applied in the global test. In mathematical terms, each contrasts are limited to comparisons with a control group and
additional statistical test performed in addition to the global not, for example, between the means of active treatment
test in such a situation will actually increase the overall groups. The Scheffé test is applicable for more general
␣-level for the study. For example, with k groups and interest comparisons than the comparison of pairs of group means and
in comparing each pair of the k mean values, k(k⫺1)/2 is more appropriate than the Tukey test if sample sizes per
possible comparisons exist. If a separate ␣-level is applied group differ markedly. It is considered to be more conserva-
here to each test of hypothesis, the actual ␣-level for the set tive than the Tukey test when pairs of group means are being
of comparisons could be as large as ␣[k(k⫺1)/2]. Thus, when compared. The Scheffé test can be computed for a specific
k⫽4 and ␣⫽0.05 for each test, the overall ␣-level for the set contrast of group means by first determining the critical value
of tests could be as large as 0.30: ie, 0.05[4(3)/2]. Further- of the F statistic for the ␣-level of interest, with k⫺1 degrees
more, if interest exists in comparisons beyond those between of freedom (df) for the numerator and N⫺-k degrees of
pairs of means, the number of multiple comparisons tests will freedom for the denominator when k groups are present and
increase. For example, in a study of 3 groups (A, B, and C), N subjects overall. This F value is then multiplied by k⫺1 to
one might have interest in the following comparisons: A yield a new critical F value for the multiple comparisons
versus B, A versus C, B versus C, A and B versus C, A and contrast.
C versus B, and B and C versus A (6 total, 3 pairwise The Bonferroni (Dunn) procedure takes into account the
comparisons). MCPs that maintain the overall ␣-level for the number of comparisons to be made and is more conservative
set of tests are said to control the “experimentwise” error rate; (less likely to find a significant difference) than the Tukey or
a related type, called “familywise” error rate control proce- Scheffé test in comparisons of pairs of group means, and it is
dures, also effectively reduce the ␣-level for each post hoc considered to be the most conservative option among MCPs
test. We will simplify matters by referring to these 2 classes in most situations. It can be applied to general hypothesis
of procedures as “experimentwise,” although technical dif- tests in addition to ANOVA. The Bonferroni (Dunn) proce-
ferences between the 2 must be acknowledged and have been dure is implemented by computing a new ␣-level for each
left to the reader to investigate independently.4 multiple comparisons test based only on the overall ␣-level
In contrast, MCPs that apply a separate ␣-level for each for the study and the number of comparisons to be made. In
test are called “comparisonwise” error control procedures. In this approach, the new ␣, ␣", is equal to ␣/C, where C is the
the case of the study with groups A, B, and C, the use of number of post hoc tests to be performed. Thus, in the
comparisonwise error control after the global null hypothesis previously discussed example with 3 groups, A, B, and C, and
has been rejected would entail the performance of 6 individ- interest in all possible comparisons, the new alpha, ␣", would
ual tests and application of an ␣-level of 0.05 for each test. be equal to ␣/6. Modifications have been made to the
Statistically oriented overviews of MCPs,3 as well as general Bonferroni procedure with the goal of improving statistical
biostatistical and research texts,5– 8 cover in more detail the power and include the Holm12 and Hochberg13,14 procedures
distinction between these classes of MCPs and their relative among the more prominent methods.
strengths and weaknesses. In addition, the use of MCPs in
additional analytical frameworks, such as multiway ANOVA Comparisonwise Error Control Procedures
and repeated-measures analysis, is beyond the scope of this When an investigator has a limited number of comparisons to
article. be made after the rejection of the global null hypothesis,
especially if these were prespecified before this test was
Experimentwise Error Control Procedures conducted, it may be of interest to employ an MCP that
As noted above, when interest exists in maintaining the controls the comparisonwise error. As noted previously,
overall ␣-level for the experiment or study, an investigator MCPs that control the comparisonwise error rate typically use
may choose an MCP that controls the experimentwise error. the same ␣-level for each test that is applied in the test of the
Many MCPs have been developed to maintain this overall global null hypothesis for the study. Thus, the likelihood of
␣-level (the term “experiment” stems from the early devel- falsely rejecting each null hypothesis increases with addi-
opment of ANOVA in the context of experimental research). tional tests. They also, however, provide the benefit to the
Most statistical software packages that offer applications for investigator of being more powerful, ie, more likely to reject
general linear models analyses such as ANOVA have also the null hypothesis of each test when it should be rejected.
implemented MCPs with an array of choices for these tests. Examples of such procedures include the application of
Among the more commonly used procedures in this class are Fisher’s LSD (least significant difference) test and linear
the Tukey (John W. Tukey, PhD, unpublished data, 1953), contrasts.6 In the LSD test, a variant of the standard 2-sample
Downloaded from http://circ.ahajournals.org/ at RHODE ISLAND HOSP on December 17, 2012
700 Circulation February 5, 2008

Table 1. Heart Weight/Body Weight Ratio by Exercise-Duration between subgroups. Although one should in practice restrict
Group the choice of multiple comparisons to those that are substan-
Exercise Duration tively meaningful, we will assume for the purpose of illus-
tration in this case that all pairs of means are of interest and
4 wk, will examine results of contrasts between pairs of means
10 min 2.5 d 1 wk 2 wk 3 wk 4 wk 1-wk Rest
using MCPs among those discussed earlier.
4.29 4.49 5.38 5.44 5.50 5.54 4.66 In Table 2, we present the means and SDs for the outcome
4.43 4.54 5.18 5.59 6.47 5.70 4.90 of interest, heart weight–to– body weight ratio, together with
4.17 4.65 4.83 5.64 6.03 5.47 4.91 a summary of the results of the application of 4 procedures
Values are heart weight/body weight ratios. that control the experimentwise type I error rate (Tukey test,
Scheffé test, and the Bonferroni [Dunn] procedure) and one
that controls the comparisonwise type I error rate (Fisher
t test is used in which the within-subjects mean square from
LSD test). To aid in the interpretation of differences between
the global test on the full data set is used as an estimate of the
groups, means have been arranged so that the highest is
pooled variance, as opposed to using the variance estimates
presented at the top of the table and the lowest at the bottom
only from the 2 groups being compared. Linear contrasts can
the table (higher values indicating greater cardiac hypertro-
be used for more general comparisons, for example, when a
set of groups is to be compared with another set of groups or phy as a result of exercise).
when the study groups represent different dose levels of a In Table 2, we adopt a system used in the statistical
treatment or exposure. software, SAS,16 to identify statistically significant differ-
ences between groups. We apply a level of ␣⫽0.05 to denote
Worked Example statistical significance. In this system, means of groups that
We now present an example of an analysis of data from a share a letter are not significantly different, whereas the
study of heart size in mice exposed to different conditions of means of any groups that do not share a letter are significantly
physical exercise15 that will illustrate the use of 1-factor different. For example, for the results of the Tukey test, the “3
ANOVA with supplementary MCPs. The design used 2 weeks” group is significantly different from all groups except
randomly assigned factors, long-term exercise versus no “4 weeks” and “2 weeks,” because they all share the letter
exercise (control), and exercise duration at 7 different ages “A” in the display. Likewise, the “10 minutes” group is
after baseline. The sample included 30 mice in total, 21 significantly different from all groups except the “2.5 days”
assigned to the 7 different durations of long-term exercise group, because it shares the letter “D” only with that group.
(swimming) and 9 assigned to 3 durations in the control In the table overall, we see a similar set of results for
group (8, 12, and 13 weeks of age). The heart weight–to– comparisons of all pairs of means among the procedures that
body weight ratio at the time of euthanasia was examined as control the experimentwise error rate. These tests, however,
the outcome of interest in this study. We will limit our are rejected at the 0.05 level less frequently than the Fisher’s
analyses here to the long-term exercise group to illustrate the LSD test, which is expected given that the LSD test controls
use of 1-factor ANOVA. In Table 1, we show the data for this the comparisonwise error rate and should be more powerful.
sample. We note, however, that the actual type I error rate for the set
For these 21 animals, the global F test of differences in of all pairwise comparisons here could be as large as
mean heart weight–to– body weight ratio for the 1-factor 0.05[(7⫻6)/2)]ⱕ1.00 (k⫽7). In interpreting these results,
ANOVA was 20.59 (numerator df⫽6, denominator df⫽14; however, one should keep in mind that the results of hypoth-
R2⫽0.90) with P⬍0.0001. Thus, we reject the global null esis tests are highly dependent on sample size, and only 3
hypothesis of no difference in population mean heart weight– mice in each of the 7 groups were examined in this sample.
to– body weight ratio between groups using ␣⫽0.05 and are Greater distinction between the findings of these procedures
next interested in where significant differences can be found may be observed in larger samples.

Table 2. Means and SDs of Heart Weight/Body Weight Ratio With Results of Multiple Comparisons Procedures (nⴝ21 Mice, 3 per
Group)
Multiple Comparisons Procedure

Exercise-Duration Group Mean (SD) Tukey Scheffé Bonferroni (Dunn) Fisher LSD
3 wk 6.00 (0.49) A A A A
4 wk 5.57 (0.12) A B A B A B B
2 wk 5.56 (0.10) A B A B A B B
1 wk 5.13 (0.28) B C B C B C C
4 wk, 1 wk of rest 4.82 (0.14) B C B C D C D C D
2.5 d 4.56 (0.08) C D C D C D D E
10 min 4.30 (0.13) D D D E
Means are sorted from largest to smallest, and means of groups that do not share a letter are significantly different.

Downloaded from http://circ.ahajournals.org/ at RHODE ISLAND HOSP on December 17, 2012


Cabral Multiple Comparisons Procedures 701

Summary 2. Larson MG. Analysis of variance. Circulation. 2008;117:115–121.


3. D’Agostino RB, Heeren TC. Multiple comparisons in over-the counter
We have presented information that should be helpful to
drug clinical trials with both positive and placebo controls. Stat Med.
investigators in biomedical research who have an interest in 1991;10:1– 6.
addressing study questions in which multiple comparisons of 4. D’Agostino RB, Massaro J, Kwan H, Cabral H. Strategies for dealing
study groups are appropriate. We have reviewed the statistical with multiple treatment comparisons in confirmatory clinical trials. Drug
framework for 1-factor ANOVA and have discussed how Inf J. 1993;27:625– 641.
multiple comparisons after the rejection of the global null 5. D’Agostino RB, Sullivan L, Beiser A. Introductory Applied Biostatistics.
Belmont, Calif: Duxbury-Brooks/Cole; 2004.
hypothesis are conceptually linked to the global test. We have 6. Rosner B. Fundamentals of Biostatistics. 6th ed. Boston, Mass: Duxbury
described selected, commonly applied MCPs and have uti- Press; 2005.
lized them in the analysis of data from an animal study. 7. Kleinbaum DG, Kupper LL, Mueller KE, Nizam A. Applied Regression
Several of the procedures discussed here can be applied in Analysis and Multivariable Methods. 5th ed. Boston, Mass: Duxbury
common statistical software in the context of ANCOVA Press; 1997.
8. Toothaker L. Multiple Comparisons for Researchers. New York, NY:
models when variables are present that need to be controlled
Sage Publications; 1991.
statistically to obtain valid inferences about differences be- 9. Dunnett CW. New tables for multiple comparisons with a control.
tween treatment or exposure groups (eg, Bonferroni, Scheffé, Biometrics. 1964;20:482– 491.
Tukey, and Dunnett tests in SAS). We have not covered this 10. Scheffé H. The Analysis of Variance. New York, NY: Wiley; 1959.
situation and leave this to the reader to investigate within 11. Dunn OJ. Multiple comparisons among means. J Am Stat Assoc. 1961;
their statistical software of choice. Researchers without ex- 56:52– 64.
12. Holm S. A simple sequentially rejective multiple test procedure. Scand J
tensive statistical analysis experience should be able to use Stat. 1979;6:65–70.
this information to work with a professional statistician to 13. Hochberg Y. A sharper Bonferroni procedure for multiple tests of sig-
better design and analyze data from studies in which multiple nificance. Biometrika. 1988;75:800 – 803.
comparisons are of interest. 14. Hochberg Y, Benjamini Y. More powerful procedures for multiple sig-
nificance testing. Stat Med. 1990;9:811– 818.
15. Available at: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/
Disclosures swim_home.html. Accessed January 15, 2008.
None. 16. Statistical Analysis System (SAS), Version 9.1. Cary, NC: SAS Institute;
2007.
References
1. Koch GG, Gansky SA. Statistical considerations for multiplicity in con-
firmatory protocols. Drug Inf J. 1996;30:523–533. KEY WORDS: analysis of variance 䡲 models, statistical 䡲 statistics

Downloaded from http://circ.ahajournals.org/ at RHODE ISLAND HOSP on December 17, 2012

You might also like