Statistical Power - Encrypted
Statistical Power - Encrypted
www.pmrjournal.org
Statistically Speaking
Statistical Power
Regina L. Nuzzo, PhD
1934-1482/$ - see front matter ª 2016 by the American Academy of Physical Medicine and Rehabilitation
http://dx.doi.org/10.1016/j.pmrj.2016.08.004
908 Statistical Power
R.L. Nuzzo / PM R 8 (2016) 907-912 909
More formally, the power of a statistical test for a a, which is established before a study is done. A sta-
given effect size d is the probability before the study tistical test with a significance threshold of .01 is spe-
that the test will return a result of “statistically signif- cific enough to ignore truly nonsignificant results 99% of
icant” (eg, P < .05) when the true effect size is d or the time over many repetitions. In other words, a highly
larger. This is the probability of getting a “true positive” specific test will produce very few false positives.
when the population effect is no smaller than d. The Just as is the aim in medical testing, the crucial hy-
formula for power can depend on the individual study pothesis test in a study will ideally have both high
design and require a number of crucial assumptions, so sensitivity and specificity. Yet, in general, increasing a
researchers often consult a statistician for help or use test’s specificity will decrease its sensitivity. Thus,
available software, such as G*Power [1,2]. tradeoffs must be balanced, and costs of errors in spe-
It is important to note that it is incomplete to simply cific situations must be weighed. In a research setting,
discuss the power of a study or a test without discussing high sensitivity helps prevent truly effective in-
a specific effect size. A calculated power is always terventions or results from being overlooked, whereas
dependent on the hypothesized effect size in the pop- high specificity helps prevent ineffectual interventions
ulation that is being sought; different effect size values from being promoted and spurious associations from
will yield different estimates of power. Thus, instead of being published.
writing simply about the “power of a study,” we should
write, for example, about the “power to detect a true
mean difference of at least 5 pounds,” or the “power
Difficulties in Increasing Power
against a mean change of 2 points.”
Although great attention has been paid to the
Power and Significance, as Related to Sensitivity and problems of false positives [3], there is an increasing
Specificity awareness that the problems stemming from low
power also deserve serious consideration, especially
Power is akin to the sensitivity of a medical test. A true in light of increasing evidence of low power in
screening test with high sensitivity for a certain disease many studies [4]. In challenging research environ-
has a high probability of detecting a patient’s disease if ments, however, adequate power in a study can be
it is present and will therefore correctly flag a large difficult to achieve. It is standard in much of biomed-
fraction of diseased patients. Likewise, a statistical test ical research to design a study so that the primary
with 80% power for effect size d is sensitive enough to hypothesis test will have 80% power for the smallest
pick up true effects as small as d about 80% of the time plausible effect that will have clinical relevance; in
over many repetitions. In other words, a highly sensitive other words, it is desired that a true effect of this
medical test will overlook very few diseased patients; a given size would have an 80% chance of being flagged
high-powered statistical test will miss very few true as statistically significant. When designing a study,
effects of a given size. researchers should keep in mind the 4 study compo-
Specificity also has an analog in statistical testing. A nents that will affect its power:
highly specific medical test has a better chance of
correctly giving a healthy person a clean bill of health 1. Sample size: Increasing a study’s sample size will
and therefore not incorrectly flagging too many healthy increase how precisely we can estimate the true ef-
patients with a false positive. In statistical testing, fect size, which will in turn increase the study’s
specificity is controlled by the significance threshold, or power.
=
Figure 1. (A) This illustrates the outcomes of set of hypotheses in a given research situation, in which half of the hypotheses correspond to real
effects, the power of the statistical test to detect these effects is 80%, and a is set to the standard .05 level. The test has a greater specificity (95%)
than it does sensitivity (80%), so although half of the hypotheses were true, more than half of the results were negative. Of the results that were
statistically significant (the “positives”), 94% corresponded to true effects, which gives a positive predictive value of 94%. Likewise, 82% of the
negative results were accurate (giving the negative predictive value). (B) This is the same research situation as in (A), but now power is 30% instead
of 80%. (Recall that even with an identical sample size, a study may have lower power due to increased population variance in the outcome
measure, or a smaller effect size in the population.) This decreased sensitivity results in more false-negative results and fewer true-positive
results. Now only 86% of positive results are correct, and only 58% of negative results are correct. (C) This illustrates a research situation
testing more exploratory or “long-shot” hypotheses, in which only a quarter of the hypotheses correspond to real effects. As in (A), the power to
detect these effects is 80% and the significance level is the standard 5%. With more null effects being tested, the negative predictive value in-
creases to 93%, but the positive predictive value decreases to 84%. This situation is analogous to using a medical test to screen a general pop-
ulation: as diseased individuals become rarer in the screen population, the test’s positive results are more likely to be flukes. (D) This is the same
research situation as in (C), now with power to detect the true effects at only 30%. With high specificity (95%) and low sensitivity (30%), this test
more effectively guards against false positives and essentially errs on the side of overlooking true effects. Combined with fewer true effects to be
caught in the first place, this drops the fraction of significant results that are correct (positive predictive value) to only 67%. The negative pre-
dictive value similarly drops to 80%.
910 Statistical Power
2. Significance threshold: All other things being equal, outcomes. The first important impact is the most widely
power and significance thresholds work in opposition known: More true effects are overlooked with a low-
to one another. Keeping the sample size identical but powered study than with a high-powered one. For
setting a less stringent threshold for significance (eg, example, for a set of hypotheses all tested at 80% power
increasing alpha from .05 to .10) will increase a for a given effect size, only 20% of true effects of that
study’s power and thus decrease its false negatives, size will be incorrectly classified as nonsignificant. If the
but it will do so at the cost of increasing the number hypotheses were tested at 25% power, however, three-
of false positives. On the other hand, a stricter quarters of true effects will be overlooked.
threshold for significance (decreasing alpha from .05 A second effect is less widely appreciated: In low-
to .01, for example) will have the advantage of pro- powered studies that test many comparisons and con-
ducing fewer false positives, but it will also decrease ditions, a greater fraction of statistically significant
power and thus increase false negatives. Designing a results may in fact be false positives. This problem be-
study with 80% power and a significance threshold of comes even worse in research situations in which more
.05 essentially places 4 times the priority on mini- hypotheses are “exploratory” and fewer correspond to
mizing false positives compared with minimizing false true effects. This situation is analogous to using a
negatives. In such a study, a null result will have a 5% medical test with low sensitivity to screen the general
chance of incorrectly coming up significant, whereas population versus screening a targeted high-risk group:
a true result will have a 20% chance of incorrectly the general population includes more asymptomatic
coming up nonsignificant. individuals unlikely to have the disease, so a higher
3. Population variance: For research situations fortu- proportion of patients flagged by the test as having the
nate enough to have in the study population a low disease will in fact be healthy. In this situation, the
variability in the effect being measured, a study positive predictive value of the test will be low, mean-
with a particular sample size will have greater ing that a greater fraction of positives will be false
power to detect the effect than a study with an positives [7]. The relationship among exploratory hy-
identical size but greater population variance. In potheses, power, positive predictive value, and nega-
other words, when researchers expect a high popu- tive predictive value is illustrated in Figure 1.
lation variance in the results, their study will need a A third effect of low power is perhaps even less well
greater sample size to achieve the standard 80% known but is gaining wider attention: Effect sizes of
power than would a study in which low variance is statistically significant results in low-powered studies
expected. can be exaggerated, so that significant effects appear
4. Effect size: All other factors being equal, the power larger than they really are [4,8]. This problem has
to detect a large effect is greater than the power obvious implications for follow-up studies and repro-
to detect a small effect; that is, it is more difficult ducibility of research; it is sometimes referred to as a
to detect subtle effects. Thus, if researchers want “Type M (magnitude) error” or “the winner’s curse,” the
the ability to detect small as well as large effects latter referencing the common phenomenon of prom-
at 80% power, they need to collect a greater sample ising early findings not panning out in later studies.
size than if they were only looking for large effects. When initial studies are underpowered to detect the
true effect of an intervention, it is often only those
studies “lucky” enough to have drawn a sample with an
Problems Stemming From Low Power unusually large effect that will produce statistically
significant results and be published. Thus, the effects in
In ideal situations, researchers could conduct a study these studies that pass the “statistical significance fil-
with a sample size large enough to achieve very high ter” are often inflated in publication. This is a problem
power. In reality, however, it often can be difficult or for follow-up studies because they are usually designed
impossible to collect a large number of observations, as to replicate these too-large effects, and since more
the result of cost, limited time, patient scarcity or power is needed to detect smaller effects, these follow-
dropouts, or other constraints, so it’s often not practi- up studies are often underpowered and thus often will
cally possible to achieve the desired level of power, a fail to find a significant effect.
fact that is leading some researchers to explore alter-
natives methods to determining an appropriate sample Misconceptions About Prestudy and Poststudy Power
size for a study [5]. Although changing the significance
level is an easy way to increase power, it is difficult to Power should be used to guide the design of a study.
do in practice; an alpha of .05 is standard in much of When performing power calculations before data
research, and in fact some researchers are calling for collection to aid in study design, researchers should
even more stringent thresholds such as .005 or .001 [6]. choose the smallest effect size that is both plausible
In light of the difficulties in achieving high power, it’s and clinically relevant. Although it may be difficult to
important to look at how low power affects study estimate the true effect size, researchers should
R.L. Nuzzo / PM R 8 (2016) 907-912 911
consider the size below which an effect would be After a study, some researchers instead use the
considered negligible. This is often a difficult task, but observed data to calculate power, in what is known as
as statistician Stephen Senn points out, “An astronomer “observed power” or “post hoc power.” This post hoc
does not know the magnitude of new stars until he power, however, is simply a restatement of the P value
has found them, but the magnitude of star he is looking and therefore contains no new information [10]. For
for determines how much he has to spend on a tele- example, after finding a nonsignificant effect, some-
scope” [9]. times researchers will point to a low post hoc power to
Power provides very little information after a study is make the argument that the study was simply under-
completed. Power is the probability before the study powered to detect the observed effect, and the results
that the statistical test will correctly pick up on a hy- likely to be a false negative. Yet for any nonsignificant
pothesized effect size if it is present. After a study, result, it can be shown that the power (with very few
therefore, prestudy power calculations cannot reveal exceptions) will necessarily be <50%. So using post hoc
how well the hypothesized effects fit with data actually power does not provide more help in interpreting
observed in the study. nonsignificant results.
Power Calculation:
Example Calculations for Power and Sample Size
Software such as G*Power will easily perform the following calculations; they are provided here for readers seeking a
slightly deeper conceptual understanding. Note that the Z test statistic rather than the more accurate t test statistic
is used in the interest of providing a simple illustration.
Suppose researchers want to use the Functional Independence Measure to investigate the difference between
treatment group and a control group of patients. They would like to have enough power in their study to detect a
difference jm1 m2 j of at least 22 points, which is the Minimum Clinically Important Difference. From previous
studies they believe the observations are roughly normally distribution and the population standard deviation s for
the Functional Independence Measure score is about 25 points for each group. They plan to do a standard 2-sided
hypothesis test of no difference between groups, with a significance threshold a of .05 and equal sample sizes for
each group.
What sample size do the researchers need to have a study with 80% power to detect a difference of 22 points?
Solution:
2
2s2 za=2 þ zb
n ¼ minimum sample size for each group ¼ 2
ðm1 m2 Þ
where
s ¼ population standard deviation (assumed to be equal for both groups)
za/2 ¼ standard normal value for which 100 (a/2)% of the values fall in the upper tail (for the typical significance
threshold of a ¼ .05, za/2 ¼ 1.96)
zb ¼ standard normal value at which 100$b% of the values fall in the upper tail (for the typical power of 80%, b ¼ 1
power ¼ .020 and zb ¼ 0.841)
m1 m2 ¼ the difference that the study will be powered to detect
So in this example, s ¼ 25, a ¼ .05, za/2 ¼ 1.96, b ¼ .20, zb ¼ 0.841, and jm1 m2 j ¼ 22
2 2
Therefore, n ¼ 2,25 ð1:96þ:841Þ
222 ¼ 20:26. So a minimum of 21 patients in each group would be necessary for the study to
have 80% power to detect a difference of 22 points, given the assumptions.
Note: Closer inspection of the formula for the sample size shows the relationship between power and the 4 ele-
ments: sample size, population standard deviation, significance threshold, and expected effect size. A larger
population standard deviation s will increase the sample size necessary to achieve the desired power; a larger a
(less strict threshold) will decrease za/2 and therefore decrease the necessary sample size; accepting a lower power
will increase b, decrease zb, and therefore decrease the necessary sample size; searching for a larger mean dif-
ference jm1 m2 j will decrease the necessary sample size.
912 Statistical Power
Disclosure