Artigo Bioesatitistica
Artigo Bioesatitistica
Special Article
What Your Statistician Never Told You
about P-Values
Jeffrey Blume, Ph.D., and Jeffrey F. Peipert, M.D., MPH
We provide a nontechnical overview of what P-values are and what they are not. To determine how P-values ought to be
used, reported, and interpreted, we must first clarify the often-overlooked differences between, and proper usages of, signifi-
cance testing and hypothesis testing. Several clinical examples are given to illustrate these differences, and failure to distin-
guish between them is seen to be problematic. Common misinterpretations of P-values are explained. Confidence intervals
provide essential information where P-values are deficient in doing so and they therefore play an essential role in reporting
and interpreting study results.
Has this ever happened to you? After completing an subject to debate.1–4 The resulting tension between researchers
interesting clinical study, you meet with a statistician to and statisticians surrounding the proper use and interpre-
review the data analysis. Among other things, he tells you tation of P-values is understandable, but it is also avoidable.
that the P-value for your primary comparison is very small, What is often left unsaid in introductory textbooks and
say 0.004. “Great,” you say with some confidence, “so the courses on statistics is that the discipline of statistics is
likelihood that my findings are due to chance is 0.004.” Not itself conflicted about the proper use and interpretation of
really,” says the statistician without even looking up from P-values.4–7 This debate has dragged on since the 1930s
his printouts. “Well you mean that there is a 99.996% without a resolution in sight, despite numerous attempts to
chance that there is a real difference between the study resolve the conflict.3,4,8 As a result, many statisticians sim-
groups or a 0.4% chance that the null hypothesis is true. ply shun or ignore the issue, sensitive to the fact that the
Right?” “Nope,” says the statistician, again without look- discipline has not yet come to agreement. In fact, several
ing up. So you try a different tack: “In the recent Jones et articles advocated replacing P-values with effect sizes and
al study they reported a P-value of only 0.03 for the same confidence intervals (CIs) altogether.9–13
comparison, so at least I have found stronger evidence of We believe a nontechnical roadmap of what P-values
the effect. Correct?” “Not exactly,” he says, “that study are and what they are not will help clinicians, researchers,
was much larger, so they probably had more evidence of scientists, and statisticians alike. Guidelines will enable
an effect than you do, even though your P-value is smaller. authors and readers to interpret, evaluate, and communicate
I really can’t be sure without looking at their results.” “But these values in everyday research, while avoiding common
their P-value is larger,” you exclaim. “Yes, I know,” he says pitfalls. To help in the process, we illustrate with two exam-
after a big sigh, “but P-values are weird like that. You can’t ples from the clinical research setting.
compare their magnitude unless the sample sizes are equal.”
You leave the meeting feeling confused, frustrated, and Clinical Illustrations
disappointed.
Exchanges like this are unfortunately commonplace. Our first illustration is a small clinical trial evaluating
The P-value is the most commonly used and, perhaps, most a new analgesic for chronic pelvic pain. Sixty women are
misunderstood statistical concept in clinical research. To the enrolled in the placebo arm and 60 in the experimental
researcher, it is critical; it measures the strength of evi- therapy arm. Thirty percent of women in the treatment arm
dence and provides a means of communicating study results experience marked improvement of pain, compared with
quickly and objectively. To the statistician, it is something only 15% of women in the placebo arm. Thus we have an
of a sore spot; its proper interpretation and use are widely estimated relative risk of 2.0 for the improvement of pain.
misunderstood, and its adequacy as such a measure is still The investigators report that the P-value under the null
From the Division of Research in Women’s Health and the George Anderson Outcomes Measurement Unit, Department of Obstetrics and Gynecology,
Women and Infants Hospital (both authors), and Center for Statistical Sciences, Brown University, Providence, Rhode Island (Dr. Blume).
Corresponding author Jeffrey D. Blume, Ph.D., Box G-H, Center for Statistical Sciences, Brown University, Providence, RI 02912.
Supported in part by National Institutes of Health grant K24 HD01298-03, Midcareer Investigator Award in Women’s Health Research from the National
Institutes of Child Health and Human Development.
Submitted July 1, 2003. Accepted for publication August 1, 2003.
Reprinted from the JOURNAL OF THE AMERICAN ASSOCIATION OF GYNECOLOGIC LAPAROSCOPISTS, November 2003, Vol. 10 No. 4
© 2003 The American Association of Gynecologic Laparoscopists. All rights reserved. This work may not be reproduced in any form or by any means without written permission from
the AAGL. This includes but is not limited to, the posting of electronic files on the Internet, transferring electronic files to other persons, distributing printed output, and photocopying.
To order multiple reprints of an individual article or request authorization to make photocopies, please contact the AAGL.
439
P-Values Blume and Peipert
hypothesis that the regimens are the same is 0.079 (Fisher’s reported the P-value to communicate the strength of the evi-
exact test). They interpret this as marginal evidence that the dence in the data and characterized the evidence as marginal.
analgesic works better than placebo. Additional recommendations for modifying clinical prac-
The second example concerns a larger study evaluat- tice based on these data would be discussed later and would
ing a new technique for intrauterine insemination. Two have to be weighted in light of possible side effects and
thousand women are recruited and randomized to either the costs.
new technique or standard one. Implantation rates are 55% Once reported, researchers are supposed to ruminate
and 50%, respectively. The investigators report that, at the on the magnitude of the P-value and consider the scientific
5% significance level, they rejected the null hypothesis context. Fisher suggested that the 5% level (p <0.05) could
that the two techniques have equal implantation rates. Based be used as a scientific benchmark for concluding that fairly
on these data alone, they recommend that the new technique strong evidence exists against the null hypothesis.16 How-
be adopted. ever he never intended this level to be an absolute threshold.
Notice the difference in reporting results in the two The strength of evidence does not jump from one category
illustrations. Both are correct, but the goals are different. to the next (weak to moderate to strong, etc.). Rather, it is
In the first example the investigators are trying to commu- on a continuum that gradually and smoothly increases in
nicate the strength of evidence in the data by quoting the the same way that measurements of height and weight do.
P-value, but in the second example the investigators are Whereas a P-value of 0.049 represents fairly strong evi-
reporting how they (and others) should behave. Also, inves- dence, so to does a one of 0.055 or 0.07, albeit to a lesser
tigators in the second example did not report the strength degree. Fisher also maintained that scientific context was
of evidence in the data (although its magnitude is implied critical. A P-value of 0.05, for example, might lead to the
because rejection at the 5% level implies that probability recommendation that additional experiments be performed
is below 0.05. Indeed, the P-value is 0.028). The first is an in one circumstance, whereas that same value could be
example of significance testing and the second of hypoth- taken as ironclad evidence of an effect in another situation.
esis testing. These two procedures are commonly merged Fisher referred to the process of drawing conclusions
into one grand statistical procedure, causing confusion and from data as inductive inference.18 His significance tests
misinterpretation of results. were to be used to measure and summarize the strength of
statistical evidence in the data against a particular hypoth-
Significance Testing, P-Values, and esis. Unfortunately, significance tests are seldom distin-
Statistical Evidence guished from hypothesis tests, leading to confusion and
inaccuracies in reporting and interpreting P-values.
The use of significance tests to interpret the data as sta-
tistical evidence was advocated by R. A. Fisher since the Hypothesis Testing, Types I and II Errors, and
early 1920s.14,15 A significance test is conducted by speci- Making Decisions
fying a null hypothesis, calculating the P-value under that
hypothesis, and reporting the numerical P-value. In clini- The hypothesis test is an altogether different animal
cal research, the null hypothesis would likely state the status from the significance test. In 1933, J. Neyman and E. Pear-
quo (there is no difference between treatment groups, the son introduced hypothesis testing as an alternative to sig-
intervention is ineffective etc.). The P-value is a tail area nificance testing.19 Their idea was to take a mathematical
probability based on the observed effect; it is calculated as rule for choosing between two hypotheses, say a null and
the probability of an effect as large as or larger than the alternative hypothesis, and determine how often this rule
observed effect (more extreme in the tails of the distribu- would lead researchers astray. They were then able to find
tion), assuming that the null hypothesis is true. the one rule that was optimal in the sense that it led
Fisher claimed that the P-value was a measure of the researchers astray least often.
strength of evidence against the null hypothesis and he A hypothesis test is formulated in terms of two hypothe-
used it to assess the degree of concordance between data ses and two error rates. Unlike significance testing, an alter-
and the null hypothesis.16 Accordingly, smaller P-values native hypothesis must be specified. If the null hypothesis
indicated stronger evidence against the null hypothesis; the is true but the test tells us to choose the alternative, we have
smaller the P-value the more inconsistent the data are with made an error of the first kind (type I); if the alternative is
the null hypothesis.17 There was just one caveat: Fisher did true and the test tells us to choose the null hypothesis, we
not interpret the P-value as a probability.4,8,16 Rather, he have made an error of the second kind (type II). Types I and
claimed it had no particular (probabilistic) interpretation per II error rates are the probability of making these errors. Ney-
se, and was only an index of evidence in the same sense that, man and Pearson were able to identify the one decision rule
say, a foot is an index of length or a pound is an index of that minimized the type II error rate when all of the rules
weight. So when reporting results, one should simply note under consideration had the same type I error rate.19 There-
the magnitude of the P-value. No particular interpretation fore researchers need only specify the type I error rate,
is necessary (just as we do not “interpret” a measurement after which they could then construct the Neyman–Pearson
of someone’s height). rejection rule to tell them what hypothesis to reject, with
The first clinical illustration is an example of report- the knowledge that this rule would naturally keep the type
ing findings by way of a significance test. The researchers II error rate as small as possible. Note that the type I error
440
November 2003, Vol. 10, No. 4 The Journal of the American Association of Gynecologic Laparoscopists
rate is also represented by the tail area probability under the ent. But much of what they said went unheeded because of
null hypothesis. acrimony between Neyman and Fisher.
A hypothesis test is classically conducted in two stages. In retrospect, it was probably only a matter of time
The first stage occurs before data are collected and the sec- before these two frameworks were merged. Why? Because
ond stage occurs after data are collected. In the first stage, each one has a key concept that the other lacks. Significance
null and alternative hypotheses are specified, the type I testing provides a measure of the strength of evidence, but
error rate is chosen, and Neyman–Pearson’s rejection rule does not address how often that measure may be mislead-
is determined. After collecting data, one checks to see if the ing. Hypothesis testing does just the opposite: error rates
null hypothesis is rejected. One then reports only whether provide a sense of how often one may be mislead, but they
or not the null hypothesis is rejected, and what type I error do not represent or measure the strength of evidence in the
level was used (important: the magnitude of the P-value is data. So it seems only natural that researchers would want
irrelevant here; all that matters is the particular action to use both frameworks to design good experiments and
taken). Because this procedure is concerned only with tak- characterize evidence in the data.
ing appropriate action, it is a model of inductive behavior, The end result of this marriage is an ad hoc method-
not inductive inference. ology based loosely on partial definitions, choice princi-
The second clinical research illustration is an example ples, and key quantities from both testing procedures. But
of how hypothesis tests are reported. Only rejection of the because of this, this methodology gives rise to irresolvable
null hypothesis and the 5% significance level (type I error controversies such as those surrounding adjustment for
rate) are reported. The concept of statistical evidence is multiple comparisons and multiple looks at accumulating
absent here. For error rates to have their intended meanings, data.4,20 Moreover, the union of these two testing frame-
investigators (and all clinicians) should accept the null works invites misinterpretation of key quantities, such as
hypothesis as false and modify their behavior accordingly. interpreting the P-value as a post hoc type I error rate.4,8
There is no ruminating over the results here! Of course, this
does not happen because the reporting of scientific results To Adjust or Not to Adjust
is not about making decisions, but about collecting, sum-
marizing, and reevaluating evidence. However, the ability One example of the confusion that this merger has
to design experiments that controlled certain types of errors created concerns adjustments of the tail area probability
was so attractive that researchers and scientists found for multiple comparisons or multiple looks at accumulat-
another way to take advantage of this framework. What they ing data.4,20 Should error rates or P-values be adjusted for
did is use the hypothesis-testing framework to design a multiple comparisons or repeated testing of accumulating
study, and the significance-testing framework to report and data? Following the line of reasoning laid down here, the
interpret study results. answer is that error rates should be adjusted, but P-values
The problem with this approach is that interpretation should not (as long as they are not given a probabilistic
of error rates is meaningless because researchers no longer interpretation).
act in direct accordance with the hypothesis test. If re- Adjustment of error rates for multiple comparisons or
searchers and scientists always acted in accordance with the repeated testing is necessary to keep that error rate controlled.
test (they always accepted the hypothesis that was not Otherwise it inflates with each examination of the data
rejected), they would be led to only accept incorrect hypothe- (with each opportunity to make an error). But P-values are
ses as often as types I and II error rates specified. But once different; they measure the strength of evidence in the data,
a hypothesis test is conducted, researchers are not supposed which should depend only on the data at hand and not on
to reevaluate those hypotheses. When they do, true error how many other examinations were conducted or are
rates become inflated and uncontrollable. There are ways planned. The key idea is this: repeated testing increases the
to avoid this, but they require determining how many times propensity for some results along the way to be mislead-
the hypothesis under consideration will be tested. Once ing, so we attempt to control that propensity by adjusting
that is known, types I and II errors can be adjusted pro- the error rate. But repeated testing does not change how we
actively (Goodman provides a nice overview of this issue).20 interpret what the data themselves represent in terms of
Unfortunately this is often impossible to determine in statistical evidence. Why should it? If two experiments
practice. happen to collect exactly the same data, they should have
exactly the same amount of statistical evidence. But the
Confusion Reigns as Significance and Hypothesis propensity for each set of data to be misleading could be
Testing Are Married different depending on the experimental design.
We should note that this is a hotly debated topic.
Fisher was the first to see the writing on the wall. He Statisticians sometimes point out that in certain situations
understood that these two testing procedures, so similar in P-values can be very misleading if they are not adjusted
mechanics and terminology yet so drastically different in for repeated testing,21,22 while others note that adjusting
purpose, could easily be confused as one. Neyman and P-values can lead to statistical conclusions that violate com-
Pearson also understood what was at stake, and all three mon sense (two researchers with exactly the same data can
wrote extensively about how their procedures were differ- then end up with different P-values and therefore different
441
P-Values Blume and Peipert
assessments of the strength of evidence in the data).4,23 ous consequences: clinically significant differences noted
Again, following our line of reasoning, P-values are an in small studies are considered nonsignificant and are
index of the strength of evidence in the data and should ignored, and all statistically significant findings are assumed
therefore not be adjusted. to result from real treatment effects and are assumed to be
important (but may be clinically insignificant).30,31 It is very
P-Values, Sample Size, and Statistical versus important to consider sample size, and in this sense CIs can
Clinical Significance be extremely helpful. We will return to this, after first clear-
ing up some misinterpretations of P-values.
Implicit in the significance testing framework is the
concept that equal P-values from two different experiments Common Misinterpretations of P-Values
represent the same strength of evidence against the null
hypothesis, regardless of sample sizes. Fisher strongly P-values are routinely misused and misinterpreted in
believed this,24 and many others supported this idea.25 This clinical research. In fact, they are easy to misinterpret
concept was named the α postulate.23 because their seemingly natural or desired interpretation is
But the α postulate is wrong. A given P-value does often technically incorrect. As discussed earlier, they do not
not have a fixed meaning independent of sample size. When require a specific interpretation other than as index of the
two studies have equal P-values, some experts held that evi- strength of evidence in the data. However, they are often
dence from the one with the smaller sample was actually confused with type I errors, resulting in some of the fol-
stronger,26 but others came to the exact opposite conclu- lowing mistaken interpretations:
sion.27 In fact, both interpretations are correct, depending 1. The P-value is the likelihood that findings are due
on if the exact P-value is reported (p = 0.003) or if it is to chance.
reported only to be less than some fixed threshold (p 2. The P-value is 0.06; therefore, there is a 94%
<0.01).28 Such discrepancies are certainly not reassuring, chance that a real difference exists.
and they indicate that P-values may not be the best mea- 3. With a low P-value (p <0.001), the findings must
sure of statistical evidence available. Alternative view- be true.
points are published elsewhere.2,7,29 4. The lower the P-value, the stronger the evidence
The take-home message is that statistical significance for an effect.
depends critically on sample size, and clinical significance 5. Equal P-values represent the same amount of evi-
should always be considered. What is absent from our two dence against the null hypothesis.
illustrations is exactly this: a definition of what differences 6. The P-value is the probability that the null hypoth-
are clinically important for the comparison. The first exam- esis is true given the data.
ple shows 2-fold improvement, but the difference is not Interpretations 1, 2, and 6 attempt to interpret the
statistically significant. The second example has lots of P-value as a probability. As we noted earlier, if the P-value
statistical power due to large sample, and a statistically sig- is an index of the evidence, it should not be interpreted in
nificant difference, but the difference is not likely to be clin- this way. Interpretation 6 is wrong because the P-value is
ically significant. Although the P-value provides a measure calculated assuming that the null hypothesis is true, so it
of statistical significance, it fails to connect that measure cannot represent the uncertainly about the null hypothesis
directly to the estimated magnitude of the effect. That is why as well.32 Interpretations 4 and 5 both rest on the validity
many experts prefer the use of effect measures (relative risk, of the α postulate. These statements are true only when sam-
odds ratio, etc.) and CIs instead.9–13 In this sense, P-values ple sizes are the same among experiments. Otherwise they
tell only half the story. may or may not be true. Finally, interpretation 3 is also not
Many experts argued that researchers put too much correct. Here context is very important, and even with very
emphasis on P-values because researchers tend to focus on low P-values we will never be absolutely sure that the null
achieving statistical significance with limited regard to hypothesis is false.
clinical significance. Instead, one should focus on estima-
tion and CIs. The P-value has limited value in terms of Effect Sizes and Confidence Intervals
assessing the magnitude of an effect because it depends
heavily on sample size. Hence the common warning that As mentioned, many experts and journal editors pre-
statistical significance does not imply clinical significance. fer effect sizes and CIs to P-values. The P-value is often
For this reason, leaders in the field have suggested that taken to split findings into two groups, statistically signif-
P-values be replaced, or at the least augmented with effect icant (p <0.05) and insignificant (p >0.05) because of con-
estimates and CIs. fusion with hypothesis testing. This makes little sense from
How has this been translated into practice in medical an evidential perspective and promotes superficial thinking
journals? As a result of the marriage of significance and about research findings.33 For example, does it makes sci-
hypothesis testing, journals and researchers often divide entific sense to adopt a new therapy because the P-value of
results into two categories: statistically significant (positive) a single study was 0.049, and at the same time ignore the
studies and results that are not statistically significant (neg- results of another therapy because the P-value was 0.051?
ative). This has resulted in two common and potentially seri- Certainly not, although that is exactly what one does under
442
November 2003, Vol. 10, No. 4 The Journal of the American Association of Gynecologic Laparoscopists
the hypothesis-testing framework. Effect sizes and confi- CI greatly improves the dissemination and characterization
dence intervals can provide an indirect assessment of the of results.
strength of evidence in the data. However, they do not suf-
fer from problems with sample size, as P-values do, and for Should I Measure the Evidence or Make a Decision?
that reason alone are a welcome improvement.
As Grimes succinctly stated, “the fundamental prob- The natural tendency, we believe, is for scientists and
lem with the P-value is that it conveys no information about researchers to want to measure and summarize evidence.
the size of differences or associations found in studies.”9 Hypothesis testing puts too much emphasis on a single
Reporting an effect size, such as a relative risk (RR) or odds statistically significant study, without regard to costs and
ratio, provides a measure of the strength of the association. benefits of the therapy under consideration. Moreover,
A relative risk of 4.0 implies a 4-fold increase, and a rela- published medical research reports often provide no firm
tive risk of 30 is a 30-fold increase. Consider our clinical evidence for decision making. Each report contributes incre-
examples: in the first example we had a 2-fold difference mentally to an existing body of knowledge. Increasing
(RR = 2.0) that was statistically insignificant. In the sec- recognition of this fact is reflected in the growth of formal
ond, the difference was a modest 10% improvement (RR methods of research synthesis,8 including presentation of
= 1.1), but was statistically significant. Surely the reader will an updated meta-analysis in the discussion section of orig-
agree that an RR of 2.0 is more clinically meaningful than inal research reports.9 When presented in this manner, prior
an RR of 1.1. But how good is this estimate? To answer this evidence is based on results of reports addressing the same
question, we use a CI. issue, and new data are added to the body of evidence. All
The effect measure, or point estimate (RR), is the best forms of evidence (animal studies, different epidemiologic
estimate for the true but unknown effect under investiga- study designs, etc.) should be considered as one weighs the
tion. A CI for the effect (usually at the 95% level) indicates evidence, and this is not done in the hypothesis-testing
how good that estimate is by providing a range of uncer- framework. The discussion section of a research report
tainty in the point estimate. The width of the interval indi- should put the results in the context of other evidence in the
cates precision in our data; wide CIs are less precise and medical literature to arrive at a logical conclusion.
narrow ones more precise. Thus we have an indication of In our opinion, conclusions and recommendations for
clinical significance because the interval will include only decision making should not be based on results of a hypoth-
clinical important effects, only clinically unimportant esis test alone. Evidence for clinical practice should be
effects, or both types of effects. The first and second cases based on all available evidence, the strength of the associ-
are self-explanatory; these intervals clearly indicate what ation, the precision of this estimate, potential public health
the data say. But the third case is not so clear, and it indi- benefits (and harms), and economic considerations. The
cates that more data should be collected. The problem with P-value plays an important role in this respect and that is
looking only at P-values is that a small figure could be asso- why it will just not disappear, despite it deficiencies. Sci-
ciated with any one of these three examples depending on ence needs a way to measure and summarize the strength
the sample size and null hypothesis being tested. of the evidence in data objectively. Without a better option,
The connection between P-values and CIs is as follows. the P-value is here to stay; however, we can avoid some of
If a 95% CI includes the null effect, the P-value is greater the problems associated with it by presenting effect sizes
than 0.05 and a test of the null hypothesis would fail to and CIs for the effect under investigation.
reject. (The null effect is that specified under the null
hypothesis. In our example the null effect would be that the Summary
relative risk is unity.) If a 95% CI excludes the null effect
(RR = 1.0 in our examples) the associated P-value for test- The arbitrary dichotomies of research findings as sta-
ing that null hypothesis will be less than 0.05 and a test of tistically significant and insignificant, which result from the
the null hypothesis will reject. In fact, a 95% CI can be pure hypothesis-testing approach, are often not helpful sci-
thought of as a collection of all null hypotheses that would entifically and can lead to problems. Researchers want to
not reject at the 5% level. It is most easily interpreted as a measure and summarize the strength of evidence in the
collection of hypotheses that are best supported by, or con- data. This is what P-values are used for, not for making deci-
sistent with, the data at the 95% level. sions. And although they may not be the best measure avail-
In our first illustration, the relative risk was 2.0 and the able, they are the standard of statistical care at this time.
95% CI was 0.98 to 4.1. From this we know that a hypoth- However researchers, experts, and journal editors are
esis test would not reject at the 5% level and that the P-value increasingly encouraging the use of effect measures (point
would be greater than 0.05 (we know this because the CI estimates) and CI estimation over both hypothesis and sig-
includes 1.0). Here the CI includes values that are not clin- nificance testing. Effect sizes and CIs often provide criti-
ically important. In the second illustration, where the null cal information that should not be overlooked and can help
hypothesis was rejected and the result deemed statistically to reduce the dependence on P-values in cases where they
significant, the RR was only 1.1 and the 95% CI was 1.01 are likely to be problematic. New research findings must
to 1.2. This CI indicates that if there is any effect at all, it be put into the context of existing knowledge. Medical evi-
is likely to be a small improvement. Clearly, presenting a dence for decision making should rely on a synthesis of
443
P-Values Blume and Peipert
existing research studies, and the contribution of the new 17. Fisher RA: Statistical Methods and Scientific Inference, 2nd
data presented to the current evidence in the literature. ed. New York, Hafner, 1959
18. Fisher RA: The logic of inductive inference. J R Stat Soc
References Series B 98:39–54, 1935
1. Goodman SN: Toward evidence-based medical statistics. 19. Neyman J, Pearson E: On the problem of the most efficient
1. The probability value fallacy. Ann Intern Med 130: tests of statistical hypotheses. Phil Trans R Soc A 231:
995–1004, 1999 289–337, 1933
2. Goodman SN: Toward evidence-based medical statistics. 20. Goodman SN: Multiple comparisons, explained. Am J Epi-
2. The Bayes factor. Ann Intern Med 130:1005–1013, 1999 demiol 147:807–812, 1998
3. Berger J: Could Fisher, Jeffreys and Neyman have agreed on 21. Armitage P: Some developments in the theory and practice
testing? Stat Sci 18:1–32, 2003 of sequential medical trials. Proceedings of the 5th Berkeley
symposium. Math Stat Probl 4:791–804, 1967
4. Royall RM: Statistical Evidence: A Likelihood Paradigm.
London, Chapman & Hall, 1997 22. Armitage P, McPherson CK, Rowe BC: Repeated signifi-
cance tests on accumulating data. J R Stat Soc 132:235–244,
5. Hubbard R, Bayarri MJ: Confusions over measures of evi- 1969
dence (p’s) versus errors (alphas’s) in classical statistical test-
ing. Am Statistician 57:171–182, 2003 23. Cornfield J: Sequential trials, sequential analysis, and the
likelihood principle. Am Statistician 29:18–23, 1966
6. Goodman SN, Royall RM: Evidence and scientific research.
Am J Public Health 78:1568–1574, 1988 24. Fisher RA: Statistical Methods for Research Workers, 5th ed.
London, Oliver & Boyd, 1934
7. Goodman SN: Of p-values and Bayes: A modest proposal. Epi-
demiology 12:295–297, 2001 25. Bergson J: Tests of significance considered as evidence. J Am
Stat Assoc 37:325–335, 1942
8. Goodman SN: P-values, hypothesis tests, and likelihood:
Implications for epidemiology of a neglected historical debate. 26. Lindley DV, Swott WF: New Cambridge Elementary Statis-
Am J Epidemiol 137:485–496, 1993 tical Tables. London, Cambridge University Press, 1984
9. Grimes DA: The case for confidence intervals. Obstet Gynecol 27. Peto R, Pike MC, Armitage P, et al: Design and analysis of
80:865–866, 1992 randomized clinical trials requiring prolonged observation of
each patient. I. Introduction and design. Br Med J 34:585–612,
10. Altman DG: Confidence intervals in research evaluation. 1976
ACP Journal Club 116(suppl 2):A28–9, 1992
28. Royall RM: The effect of sample size on the meaning of sig-
11. Berry G: Statistical significance and confidence intervals nificant tests. Am Statistician 40:313–315, 1986
[editorial]. Med. J Aust 144:618–619, 1986
29. Blume JD: Likelihood methods for measuring statistical evi-
12. Braitman LE: Confidence intervals extract clinically useful dence. Stat Med 21:2563–2599, 2002
information from data [editorial]. Ann Intern Med 108:
30. Sterne JAC, Smith GD: Sifting the evidence—What’s wrong
296–298, 1988
with significance tests? Lancet 322:226–231, 2001
13. Simon R: Confidence interval for reporting the results of clin-
31. Freiman JA, Chalmers TC, Smith H, et al: The importance of
ical trials. Ann Intern Med 105:429–435, 1986
beta, the type II error, and sample size in the design and inter-
14. Fisher RA: Statistical Methods for Research Workers. Edin- pretation of the randomized controlled trial. N Engl J Med
burgh, Oliver & Boyd, 1925 299:690–694, 1978
15. Fisher RA: The Design of Experiments. Edinburgh, Oliver & 32. Cohen J: The earth is round (p<0.05). Am Psychol 49:
Boyd, 1935 997–1003, 1994
16. Fisher RA: Statistical Methods for Research Workers, 13th ed. 33. Poole C: Beyond the confidence interval. Am J Public Health
New York, Hafner, 1958 77:195–199, 1987
444