0% found this document useful (0 votes)
24 views6 pages

Statistical Power - Encrypted

The document discusses the concept of statistical power, which is the probability that a statistical test will correctly detect a real effect of a certain size. It explains the key factors that determine power, including sample size, significance threshold, population variance, and effect size. High power reduces false negatives and increases the chances of detecting true effects. However, achieving high power can be difficult, and there are tradeoffs between power and specificity.

Uploaded by

Dexter Brave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views6 pages

Statistical Power - Encrypted

The document discusses the concept of statistical power, which is the probability that a statistical test will correctly detect a real effect of a certain size. It explains the key factors that determine power, including sample size, significance threshold, population variance, and effect size. High power reduces false negatives and increases the chances of detecting true effects. However, achieving high power can be difficult, and there are tradeoffs between power and specificity.

Uploaded by

Dexter Brave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PM R 8 (2016) 907-912

www.pmrjournal.org

Statistically Speaking

Statistical Power
Regina L. Nuzzo, PhD

Terminology: beyond this context, and thus it’s useful to understand


it from a statistician’s perspective. It’s important to
Sensitivity: The “true positive” rate for a test; the propor-
remember that power is a theoretical property of a
tion of “positives” that the test correctly flags as being
positive.
statistical test in a given study situation, in much the
Specificity: The “true negative” rate for a test; the propor- same way that sensitivity and specificity are properties
tion of “negatives” that the test correctly excludes as being of medical tests, as will be discussed herein. When we
negative. speak of “power calculations,” we usually are either
Type I error rate: The “false-positive” rate for a statistical using our knowledge of a specific study situation to
test; the proportion of “negatives” that will be incorrectly estimate how strong its power will be, or we have a
flagged as being positive. Also known as a, or the significance target power level in mind for a situation and are
threshold. estimating how large the sample size we will need to
Type II error rate: The “false-negative” rate for a statistical achieve our goal.
test; the proportion of “positives” that will be incorrectly There are 4 main components that affect the power
excluded as being negative.
of a study: sample size, significance threshold, pop-
Positive predictive power: Of the cases flagged by the test
as being positives, the proportion that are truly “positives.”
ulation variance of the effect, and effect size.
Negative predictive power: Of the cases excluded by the Knowing these 4 values will allow a researcher to
test as being negatives, the proportion that truly are calculate the theoretical power of a given study
“negatives.” design. Of these 4 factors, sample size is the compo-
Effect size: A general term for the quantitative magnitude or nent most under a researcher’s control; the latter
strength of the effect being studied. Specific examples 3 factors typically are dictated by the situation. Thus,
include the difference in outcome measure between 2 researchers usually use power calculations to work
groups; the correlation between 2 measures; the change in backwards: for a given significance threshold, popu-
an outcome measure over time. lation variance, and effect size, they calculate what
Power: The prestudy probability that a statistical test will sample size is necessary to design a study with 80%
correctly return a result of “statistically significant” for true
power. These 4 factors will be discussed in more detail
effects that are no smaller than a certain size.
herein.
Conceptually, the power of a statistical test is similar
to the power of a telescope: it is the ability to separate
Statistical power is a prevalent but widely misun- a true pattern from its background. The smaller the
derstood concept used in planning studies and inter- object to be detected, or the more distracting the
preting results. This article will discuss the concept of environment is, the more magnification power is needed
power, its relationship to sensitivity and specificity, the to detect a true signal. A statistically high-powered,
difficulties in achieving high power, implications of low placebo-controlled randomized study, for instance, is
power, and common misconceptions in interpreting more likely to spot even subtle differences between the
power. treatment and control groups than a low-powered study
would, even when there is high person-to-person vari-
ability (“noise”) in the results. Thus, designing a study
Concept of Power to have high power reduces the chance of overlooking
true findings (ie, reducing false negatives, or Type II
Statistical power often is thought of solely in the errors), and it also increases the chances that a signifi-
context of determining sample size for a planned cant finding in fact represents a true effect (a true
study, but power is in fact more relevant and useful positive).

1934-1482/$ - see front matter ª 2016 by the American Academy of Physical Medicine and Rehabilitation
http://dx.doi.org/10.1016/j.pmrj.2016.08.004
908 Statistical Power
R.L. Nuzzo / PM R 8 (2016) 907-912 909

More formally, the power of a statistical test for a a, which is established before a study is done. A sta-
given effect size d is the probability before the study tistical test with a significance threshold of .01 is spe-
that the test will return a result of “statistically signif- cific enough to ignore truly nonsignificant results 99% of
icant” (eg, P < .05) when the true effect size is d or the time over many repetitions. In other words, a highly
larger. This is the probability of getting a “true positive” specific test will produce very few false positives.
when the population effect is no smaller than d. The Just as is the aim in medical testing, the crucial hy-
formula for power can depend on the individual study pothesis test in a study will ideally have both high
design and require a number of crucial assumptions, so sensitivity and specificity. Yet, in general, increasing a
researchers often consult a statistician for help or use test’s specificity will decrease its sensitivity. Thus,
available software, such as G*Power [1,2]. tradeoffs must be balanced, and costs of errors in spe-
It is important to note that it is incomplete to simply cific situations must be weighed. In a research setting,
discuss the power of a study or a test without discussing high sensitivity helps prevent truly effective in-
a specific effect size. A calculated power is always terventions or results from being overlooked, whereas
dependent on the hypothesized effect size in the pop- high specificity helps prevent ineffectual interventions
ulation that is being sought; different effect size values from being promoted and spurious associations from
will yield different estimates of power. Thus, instead of being published.
writing simply about the “power of a study,” we should
write, for example, about the “power to detect a true
mean difference of at least 5 pounds,” or the “power
Difficulties in Increasing Power
against a mean change of 2 points.”
Although great attention has been paid to the
Power and Significance, as Related to Sensitivity and problems of false positives [3], there is an increasing
Specificity awareness that the problems stemming from low
power also deserve serious consideration, especially
Power is akin to the sensitivity of a medical test. A true in light of increasing evidence of low power in
screening test with high sensitivity for a certain disease many studies [4]. In challenging research environ-
has a high probability of detecting a patient’s disease if ments, however, adequate power in a study can be
it is present and will therefore correctly flag a large difficult to achieve. It is standard in much of biomed-
fraction of diseased patients. Likewise, a statistical test ical research to design a study so that the primary
with 80% power for effect size d is sensitive enough to hypothesis test will have 80% power for the smallest
pick up true effects as small as d about 80% of the time plausible effect that will have clinical relevance; in
over many repetitions. In other words, a highly sensitive other words, it is desired that a true effect of this
medical test will overlook very few diseased patients; a given size would have an 80% chance of being flagged
high-powered statistical test will miss very few true as statistically significant. When designing a study,
effects of a given size. researchers should keep in mind the 4 study compo-
Specificity also has an analog in statistical testing. A nents that will affect its power:
highly specific medical test has a better chance of
correctly giving a healthy person a clean bill of health 1. Sample size: Increasing a study’s sample size will
and therefore not incorrectly flagging too many healthy increase how precisely we can estimate the true ef-
patients with a false positive. In statistical testing, fect size, which will in turn increase the study’s
specificity is controlled by the significance threshold, or power.

=
Figure 1. (A) This illustrates the outcomes of set of hypotheses in a given research situation, in which half of the hypotheses correspond to real
effects, the power of the statistical test to detect these effects is 80%, and a is set to the standard .05 level. The test has a greater specificity (95%)
than it does sensitivity (80%), so although half of the hypotheses were true, more than half of the results were negative. Of the results that were
statistically significant (the “positives”), 94% corresponded to true effects, which gives a positive predictive value of 94%. Likewise, 82% of the
negative results were accurate (giving the negative predictive value). (B) This is the same research situation as in (A), but now power is 30% instead
of 80%. (Recall that even with an identical sample size, a study may have lower power due to increased population variance in the outcome
measure, or a smaller effect size in the population.) This decreased sensitivity results in more false-negative results and fewer true-positive
results. Now only 86% of positive results are correct, and only 58% of negative results are correct. (C) This illustrates a research situation
testing more exploratory or “long-shot” hypotheses, in which only a quarter of the hypotheses correspond to real effects. As in (A), the power to
detect these effects is 80% and the significance level is the standard 5%. With more null effects being tested, the negative predictive value in-
creases to 93%, but the positive predictive value decreases to 84%. This situation is analogous to using a medical test to screen a general pop-
ulation: as diseased individuals become rarer in the screen population, the test’s positive results are more likely to be flukes. (D) This is the same
research situation as in (C), now with power to detect the true effects at only 30%. With high specificity (95%) and low sensitivity (30%), this test
more effectively guards against false positives and essentially errs on the side of overlooking true effects. Combined with fewer true effects to be
caught in the first place, this drops the fraction of significant results that are correct (positive predictive value) to only 67%. The negative pre-
dictive value similarly drops to 80%.
910 Statistical Power

2. Significance threshold: All other things being equal, outcomes. The first important impact is the most widely
power and significance thresholds work in opposition known: More true effects are overlooked with a low-
to one another. Keeping the sample size identical but powered study than with a high-powered one. For
setting a less stringent threshold for significance (eg, example, for a set of hypotheses all tested at 80% power
increasing alpha from .05 to .10) will increase a for a given effect size, only 20% of true effects of that
study’s power and thus decrease its false negatives, size will be incorrectly classified as nonsignificant. If the
but it will do so at the cost of increasing the number hypotheses were tested at 25% power, however, three-
of false positives. On the other hand, a stricter quarters of true effects will be overlooked.
threshold for significance (decreasing alpha from .05 A second effect is less widely appreciated: In low-
to .01, for example) will have the advantage of pro- powered studies that test many comparisons and con-
ducing fewer false positives, but it will also decrease ditions, a greater fraction of statistically significant
power and thus increase false negatives. Designing a results may in fact be false positives. This problem be-
study with 80% power and a significance threshold of comes even worse in research situations in which more
.05 essentially places 4 times the priority on mini- hypotheses are “exploratory” and fewer correspond to
mizing false positives compared with minimizing false true effects. This situation is analogous to using a
negatives. In such a study, a null result will have a 5% medical test with low sensitivity to screen the general
chance of incorrectly coming up significant, whereas population versus screening a targeted high-risk group:
a true result will have a 20% chance of incorrectly the general population includes more asymptomatic
coming up nonsignificant. individuals unlikely to have the disease, so a higher
3. Population variance: For research situations fortu- proportion of patients flagged by the test as having the
nate enough to have in the study population a low disease will in fact be healthy. In this situation, the
variability in the effect being measured, a study positive predictive value of the test will be low, mean-
with a particular sample size will have greater ing that a greater fraction of positives will be false
power to detect the effect than a study with an positives [7]. The relationship among exploratory hy-
identical size but greater population variance. In potheses, power, positive predictive value, and nega-
other words, when researchers expect a high popu- tive predictive value is illustrated in Figure 1.
lation variance in the results, their study will need a A third effect of low power is perhaps even less well
greater sample size to achieve the standard 80% known but is gaining wider attention: Effect sizes of
power than would a study in which low variance is statistically significant results in low-powered studies
expected. can be exaggerated, so that significant effects appear
4. Effect size: All other factors being equal, the power larger than they really are [4,8]. This problem has
to detect a large effect is greater than the power obvious implications for follow-up studies and repro-
to detect a small effect; that is, it is more difficult ducibility of research; it is sometimes referred to as a
to detect subtle effects. Thus, if researchers want “Type M (magnitude) error” or “the winner’s curse,” the
the ability to detect small as well as large effects latter referencing the common phenomenon of prom-
at 80% power, they need to collect a greater sample ising early findings not panning out in later studies.
size than if they were only looking for large effects. When initial studies are underpowered to detect the
true effect of an intervention, it is often only those
studies “lucky” enough to have drawn a sample with an
Problems Stemming From Low Power unusually large effect that will produce statistically
significant results and be published. Thus, the effects in
In ideal situations, researchers could conduct a study these studies that pass the “statistical significance fil-
with a sample size large enough to achieve very high ter” are often inflated in publication. This is a problem
power. In reality, however, it often can be difficult or for follow-up studies because they are usually designed
impossible to collect a large number of observations, as to replicate these too-large effects, and since more
the result of cost, limited time, patient scarcity or power is needed to detect smaller effects, these follow-
dropouts, or other constraints, so it’s often not practi- up studies are often underpowered and thus often will
cally possible to achieve the desired level of power, a fail to find a significant effect.
fact that is leading some researchers to explore alter-
natives methods to determining an appropriate sample Misconceptions About Prestudy and Poststudy Power
size for a study [5]. Although changing the significance
level is an easy way to increase power, it is difficult to Power should be used to guide the design of a study.
do in practice; an alpha of .05 is standard in much of When performing power calculations before data
research, and in fact some researchers are calling for collection to aid in study design, researchers should
even more stringent thresholds such as .005 or .001 [6]. choose the smallest effect size that is both plausible
In light of the difficulties in achieving high power, it’s and clinically relevant. Although it may be difficult to
important to look at how low power affects study estimate the true effect size, researchers should
R.L. Nuzzo / PM R 8 (2016) 907-912 911

consider the size below which an effect would be After a study, some researchers instead use the
considered negligible. This is often a difficult task, but observed data to calculate power, in what is known as
as statistician Stephen Senn points out, “An astronomer “observed power” or “post hoc power.” This post hoc
does not know the magnitude of new stars until he power, however, is simply a restatement of the P value
has found them, but the magnitude of star he is looking and therefore contains no new information [10]. For
for determines how much he has to spend on a tele- example, after finding a nonsignificant effect, some-
scope” [9]. times researchers will point to a low post hoc power to
Power provides very little information after a study is make the argument that the study was simply under-
completed. Power is the probability before the study powered to detect the observed effect, and the results
that the statistical test will correctly pick up on a hy- likely to be a false negative. Yet for any nonsignificant
pothesized effect size if it is present. After a study, result, it can be shown that the power (with very few
therefore, prestudy power calculations cannot reveal exceptions) will necessarily be <50%. So using post hoc
how well the hypothesized effects fit with data actually power does not provide more help in interpreting
observed in the study. nonsignificant results.

Power Calculation:
Example Calculations for Power and Sample Size
Software such as G*Power will easily perform the following calculations; they are provided here for readers seeking a
slightly deeper conceptual understanding. Note that the Z test statistic rather than the more accurate t test statistic
is used in the interest of providing a simple illustration.
Suppose researchers want to use the Functional Independence Measure to investigate the difference between
treatment group and a control group of patients. They would like to have enough power in their study to detect a
difference jm1  m2 j of at least 22 points, which is the Minimum Clinically Important Difference. From previous
studies they believe the observations are roughly normally distribution and the population standard deviation s for
the Functional Independence Measure score is about 25 points for each group. They plan to do a standard 2-sided
hypothesis test of no difference between groups, with a significance threshold a of .05 and equal sample sizes for
each group.
What sample size do the researchers need to have a study with 80% power to detect a difference of 22 points?
Solution:
2
2s2 za=2 þ zb
n ¼ minimum sample size for each group ¼ 2
ðm1  m2 Þ

where
s ¼ population standard deviation (assumed to be equal for both groups)
za/2 ¼ standard normal value for which 100 (a/2)% of the values fall in the upper tail (for the typical significance
threshold of a ¼ .05, za/2 ¼ 1.96)
zb ¼ standard normal value at which 100$b% of the values fall in the upper tail (for the typical power of 80%, b ¼ 1 
power ¼ .020 and zb ¼ 0.841)
m1  m2 ¼ the difference that the study will be powered to detect
So in this example, s ¼ 25, a ¼ .05, za/2 ¼ 1.96, b ¼ .20, zb ¼ 0.841, and jm1  m2 j ¼ 22
2 2
Therefore, n ¼ 2,25 ð1:96þ:841Þ
222 ¼ 20:26. So a minimum of 21 patients in each group would be necessary for the study to
have 80% power to detect a difference of 22 points, given the assumptions.
Note: Closer inspection of the formula for the sample size shows the relationship between power and the 4 ele-
ments: sample size, population standard deviation, significance threshold, and expected effect size. A larger
population standard deviation s will increase the sample size necessary to achieve the desired power; a larger a
(less strict threshold) will decrease za/2 and therefore decrease the necessary sample size; accepting a lower power
will increase b, decrease zb, and therefore decrease the necessary sample size; searching for a larger mean dif-
ference jm1  m2 j will decrease the necessary sample size.
912 Statistical Power

References 5. Tavernier E, Trinquart L, Giraudeau B. Finding alternatives to the


dogma of power based sample size calculation: Is a fixed sample
size prospective meta-experiment a potential alternative? PLoS
1. Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: A
One 2016;11:e0158604.
flexible statistical power analysis program for the social,
6. Johnson VE. Revised standards for statistical evidence. Proc Natl
behavioral, and biomedical sciences. Behav Res Methods 2007;
Acad Sci U S A 2013;110:19313-19317.
39:175-191.
7. Krzywinski M, Altman N. Points of significance: Power and sample
2. Noordzij M, Tripepi G, Dekker FW, Zoccali C, Tanck MW, Jager KJ.
size. Nat Methods 2013;10:1139-1140.
Sample size calculations: Basic principles and common pitfalls.
8. Gelman A, Carlin J. Beyond power calculations assessing type s (sign)
Nephrol Dial Transplant 2010;25:1388-1393.
and type m (magnitude) errors. Perspect Psychol Sci 2014;9:641-651.
3. Ioannidis JP. Why most published research findings are false. PLoS
9. Senn SJ. Power is indeed irrelevant in interpreting completed
Med 2005;2:e124.
studies. BMJ 2002;325:1304.
4. Button K, Ioannidis JPA, Mokrysz C, et al. Power failure: Why small
10. Goodman SN, Berlin JA. The use of predicted confidence intervals
sample size undermines the reliability of neuroscience. Nat Rev
when planning experiments and the misuse of power when inter-
Neurosci 2013;14:365-376.
preting results. Ann Inter Med 1994;121:200-206.

Disclosure

R.L.N. Department of Science, Technology, and Mathematics, Gallaudet Uni-


versity, 800 Florida Avenue, NE, Washington, DC 20002-3695. Address corre-
spondence to: R.L.N.; e-mail: [email protected]
Disclosure: nothing to disclose
Accepted August 11, 2016.

You might also like