INGLES Poder Estadístico en Experimentos de ISW
INGLES Poder Estadístico en Experimentos de ISW
08/2024
https://www.sciencedirect.com/science/article/pii/S0950584905001333
Cited by 361 in Scholar
Contex
Statistical power is an inherent part of empirical studies employing significance
testing and is essential for planning the study, for interpreting the results of a
study, and for the validity of the conclusions.
This paper reports on a quantitative assessment of the statistical power of
empirical software engineering research based on the 103 controlled experiment
papers (out of a total of 5453 papers) published in 9 major software engineering
journals and 3 conference proceedings in the decade 1993-2002.
Introduction
An important application of statistical significance testing in empirical software
engineering (ESE) research is hypothesis testing in controlled experiments.
A key component of such testing is the notion of statistical power, which is defined as
the probability that a statistical test will correctly reject the null hypothesis.
A test without sufficient statistical power will fail to provide the researcher with the
information needed to draw conclusions about accepting or rejecting the null
hypothesis.
Knowledge of statistical power can influence both the planning and execution and
outcomes of empirical research.
“If resources are limited and prevent achieving a satisfactory level of statistical power,
the research is probably not worth the time, effort, and cost of inferential statistics.”
Introduction
They suggest that statistical power is not reported or given due attention in the ESE
literature, leading to potentially flawed research designs and questionable validity of
results.
Objective
(1) conduct a systematic review and quantitative assessment of the statistical
power of ESE research in a sample of published controlled experiments,
(2) discuss the implications of these findings, and
(3) discuss techniques that ESE researchers can use to increase the statistical
power of their studies in order to improve the quality and validity of ESE research.
Power and errors in statistical inference
According to the Neyman-Pearson* method of statistical inference, testing hypotheses
requires that we specify an acceptable level of statistical error, that is, the risk we are
willing to take regarding the correctness of our decisions. Regardless of the decision rule
we choose, there are generally two ways to be right and two ways to make a mistake in
choosing between the null hypothesis ( ) and the alternative hypothesis ( ).
*J. Neyman, E.S. Pearson. On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika,
20A (1928), pp. 175-240 263–294
Type I error
It is the error that is committed when (the hypothesis being tested) is wrongly rejected.
That is, a type I error is committed whenever the sample results fall into the rejection
region, even if is true.
Conventionally, the probability of committing a type I error is represented by the level of
statistical significance, denoted by the lowercase Greek letter alpha ( α ).
On the contrary, the probability of being correct, given that is true, is equal to 1−α .
Type Error II
The probability of making a type II error, also known as beta ( β ), is the probability of
failing to reject the null hypothesis when it is actually false.
Thus, when a sample result does not fall into the rejection region, even though is true,
we are induced to make a type II error.
Consequently, the probability of correctly rejecting the null hypothesis, that is, the
probability of making a correct decision given is true, is 1− β ; the power of the
statistical test.
It is literally the probability of discovering that is incorrect, given the decision rule
and the true .
Statistical power
Statistical power is very important when there is a real difference in the population.
When the phenomenon really exists, the statistical test must be powerful enough to
detect it.
If the test reveals a non-significant result in this case, the conclusion of "no effect" would
be misleading and we would be committing a type II error.
Statistical power
Traditionally, α is set to 0.05 to protect against type I error, while β is set to 0.20 to
protect against type II error.
Accepting these conventions also means that we are four times more protected
against type I errors than type II errors.
However, the distribution of risk between type I and type II errors must be appropriate
for the situation at hand.
Ejemplo ilustrativo
Transbordador espacial Challenger en 1986 (10ma misión)
*J. Cohen Statistical Power Analysis for the Behavioral Sciences (second ed.), Laurence Erlbaum, Hillsdale, New Jersey (1988)
Determinants of statistical power
Determinants of statistical power
Distribution curves:
The curve on the left represents the distribution
under the null hypothesis ( ).
The curve on the right represents the distribution
under the alternative hypothesis ( ).
Significance criterion (α):
The shaded region under the curve to the right
of the vertical line represents the area where the
null hypothesis is rejected.
α is the probability of making a Type I error, that
is, rejecting when it is actually true. That is, the
probability of incorrectly rejecting the null
hypothesis.
Determinants of statistical power
Statistical power (1 - β):
The shaded region under the curve to the
right of the vertical line shows the statistical
power of the test.
.Statistical power (1 - β) is the probability of
correctly rejecting when is true.
Type II error (β):
The shaded region under the curve to the left
of the vertical line represents the probability of
making a Type II error (β), that is, do not reject
when is true.
Power increases with larger α. A small α will result in relatively small power.
Determinants of statistical power
Summary:
A two-sided test is more conservative and can detect effects in both directions, but with
less power.
A one-sided test is more powerful in the predicted direction, but cannot detect effects in
the opposite direction.
The researcher must choose the type of test based on the nature of the hypothesis and the
context of the experiment.
Sample size
At any given α level, a larger sample size reduces the standard deviations of the
sampling distributions for and .
This reduction results in less overlap of the distributions, higher precision, and therefore
higher power.
Population effect size (ES)
Actual size of the difference between and (the null hypothesis is that the effect size
is 0), i.e. the degree to which the phenomenon is present in the population.
The larger the effect size, the greater the likelihood that the effect will be detected and
the null hypothesis rejected.
The nature of the effect size will vary from one statistical procedure to another (e.g., a
standardized mean difference or a correlation coefficient), but its role in power analysis
is the same across procedures.
Therefore, each statistical test has its own continuous, unscaled effect size index, ranging
from zero.
Thus, while p-values reveal whether a finding is statistically significant, effect size indices
are measures of practical significance or relevance.
The interpretation of effect sizes is critical, because it is possible for a finding to be
statistically significant but not relevant, and vice versa.
A finding may be "statistically significant but not
relevant"
It refers to the difference between statistical significance and practical relevance or
importance of the effect.
Suppose you are a researcher studying the impact of a new drug on lowering
cholesterol. You conduct an experiment with a large sample of people and find the
following:
Null hypothesis ( ): The new drug does not lower cholesterol more than placebo.
Alternative hypothesis ( ): The new drug lowers cholesterol more than placebo.
After analyzing the data, you get a very small p-value (for example, p=0.0001). This
indicates that the results are statistically significant and that you can reject the null
hypothesis. However, when you calculate the effect size (the magnitude of the
cholesterol reduction), you find that the drug only lowers cholesterol by 0.5 mg/dL on
average.
Let's interpret the results:
Statistical significance:
The low p-value means that the observed reduction in cholesterol is very unlikely to
be due to chance. Therefore, the result is statistically significant.
Practical relevance:
However, an average reduction of 0.5 mg/dL in cholesterol levels is very small. From a
clinical perspective, this decrease might not have a significant impact on patients'
health. Therefore, although the result is statistically significant, it is not clinically
relevant.
On the other hand:
Situation where the effect is large (e.g. a 20 mg/dL reduction in cholesterol levels), but
because the sample is small, the p-value does not meet the threshold for statistical
significance (e.g. p=0.06). In this case, the effect is practically relevant but not
statistically significant due to lack of statistical power.
Summary:
Effect size: Indicates the magnitude of the difference or impact of a treatment or
intervention. A large effect size suggests that the result is practically relevant or
important.
Statistical significance: Indicates the likelihood that a result is due to chance. A low
p-value suggests that the result is unlikely to be an accident.
The key is that a result can be statistically significant but not practically important (as in
the first example), or it can be practically important but not statistically significant if the
study is underpowered (as in the second example). That is why it is crucial to interpret
both statistical significance and effect size to draw complete and accurate conclusions.
Effect size indices and their values for the most common
statistical tests
*D.I.K. Sjøberg, et al. A survey of controlled experiments in software engineering. IEEE Transactions on Software Engineering (2005)
ESE studies using controlled experiments
Procedure
Five experiments were reported in more than 1 article. In these cases, they included the
most recent one.
This evaluation resulted in 78 articles. From these articles, they identified 459 statistical
tests corresponding to the main hypotheses or research questions of 92 experiments.
For comparison, the mean sample size of all tests in Rademacher's* study of power in IS
research was 179 subjects (with a standard deviation of 196).
Cumulative percentage and power frequency
distribution
Small effect size: The average
statistical power of the tests when
assuming small effect sizes was as low
as 0.11. This means that if the
phenomena being investigated exhibit
a small effect size, then on average the
studies examined have only a one in 10
chance of detecting them.
The table shows that only one test is
above the conventional power level of
0.80 and that 97% have less than a 50%
chance of detecting significant
findings.
Cumulative percentage and power frequency
distribution
Medium effect size: When assuming
medium effect sizes, the average
statistical power of the tests increases
to 0.36.
Studies only have, on average, about
1/3 the chance of detecting
phenomena that exhibit a medium
effect size.
The table indicates that only 6% of the
tests examined achieve the
conventional power level of 0.80 or
better, and that 78% of the tests have
less than a 50% chance of detecting
significant results.
Cumulative percentage and power frequency
distribution
Large effect size: Assuming large effect
sizes, the average statistical power of
the tests increases further, to 0.63.
On average, studies still have a slightly
less than 2/3 chance of detecting their
phenomena.
Cumulative percentage and power frequency
distribution
The table shows that 31% of the tests
meet or exceed the .80 power level, and
70% get a greater than 50% chance of
correctly rejecting their null hypotheses.
Thus, even when we assume that the
effect being studied is so large as to
make statistical testing unnecessary, as
many as 69% of the tests fall below the
.80 level.
Power analysis by type of statistical test
Of the 78 articles in the sample, 12 analyzed the statistical power associated with
null hypothesis testing.
Of these studies, 9 elaborated on the specific procedures for determining the
statistical power of the tests.
3 of the 9 performed an a priori power analysis, while 6 performed the a posteriori
analysis.
Only 1 of the articles that performed an a priori power analysis used it to guide the
choice of sample size. In this case, the authors explicitly stated that they were
only interested in large effect sizes and that they considered a power level of 0.5
to be sufficient. Even so, they included so few subjects in the experiment that the
average power to detect a large effect size from their statistical tests was as low
as 0.28.
Results. Qualitative evaluation
They compare the results of the current study with two corresponding reviews of
statistical power levels in IS research.
* J. Baroudi, W. Orlikowski. The problem of statistical power in MIS research. MIS Quarterly, (1989)
** R.A. Rademacher. Statistical power in information systems research: application and impact on the discipline. Journal of
Computer Information Systems, (1999)
Comparison with IS research
The results of the current study show that the power of experimental research in
SE is far below the levels achieved by IS research.
One reason for this difference could be that the IS field has benefited from
Baroudi and Orlikowski's early review of power, and thus explicit attention has
been paid to statistical power, which has paid off with contemporary research
showing improved power levels.
Comparison with IS research
What is particularly worrying for SE research is that the power level shown by the
current study not only falls markedly below the level of Rademacher's 1999 study,
but also falls markedly below the level of Baroudi and Orlikowski's 1989 study.
Comparison with IS research
Medium effect sizes are considered the target level in IS research**, and the
average power to detect these effect sizes is .81 in IS research; only 6% of the tests
examined in the current research reach this level, and up to 78% of the tests in the
current research have less than a 50% chance of detecting significant results for
medium effects.
Comparison with IS research
Unless it can be shown that medium (and large) effect sizes are irrelevant for SE
research, this should be a matter of concern for SE researchers and practitioners.
Consequently, we should explore in more depth what constitutes meaningful
effect sizes within SE research, in order to establish specific SE conventions.
The results show that, on average, SI research employs sample sizes that are
twice as large as those found in SE research for these tests. In fact, the situation is
slightly worse than that, as observations are used as sample size in the current
study, whereas SI studies refer to subjects. Furthermore, the power levels of the
current study to detect medium effect sizes are only about half of the
corresponding power levels of SI research.
Implications for the interpretation of experimental SE
research
Power issues in experimental SE research are very limited.
15.4% of the articles discussed statistical power in relation to their null hypothesis
test, but only in one article did the authors perform an a priori power analysis.
Post hoc power analyses showed that, overall, the studies examined had low
statistical power.
Even for large effect sizes, up to 69% of the tests fell below the .80 level. This
implies that statistical power considerations are underemphasized in
experimental SE research.
A test without sufficient statistical power will not provide the researcher with
enough information to draw conclusions about accepting or rejecting the null
hypothesis.
Implications for the interpretation of experimental SE
research
If no effects are detected in this situation, researchers should not conclude that
the phenomenon does not exist. Rather, they should report that no significant
findings were demonstrated in their study and that this may be due to the low
statistical power associated with their tests.
Problem with underpowered studies, especially when multiple tests are
performed: Although the probability of any individual test being significant is low,
the probability of obtaining at least one significant result increases with the
number of tests. However, this significant result could be misleading because it
does not reflect the true power of the study to detect specific effects. Therefore, it
is important for researchers to distinguish between main (primary) and
additional (secondary) tests in order to correctly interpret the results.
Implications for the interpretation of experimental SE
research
Low statistical power also has a substantial impact on the ability to replicate
experimental studies based on null hypothesis testing.
…the more we are guided by theory and prior observation but conduct an
underpowered study, the more we decrease the likelihood of replication. Thus, an
underpowered literature is not just making a passive mistake, but may actually
contribute to diverting attention and resources in unproductive directions
(Ottenbacher*).
Consequently, the tendency to under-power SE studies makes replication and
meta-analysis difficult and will tend to produce an inconsistent body of literature,
thus hindering the advancement of knowledge.
* K.J. Ottenbacher. The power of replications and the replications of power. The American Statistician. (1996)
Interpretation of studies with very high power levels
Some of the studies in this review employed large sample sizes, ranging from 400
to 800 observations.
This poses a problem of interpretation, because virtually any study can show
significant results if the sample size is large enough, regardless of how small the
actual effect size may be.
It is therefore of particular importance that researchers who report statistically
significant results from studies with very large sample sizes, or with very large
power levels, also report the corresponding effect sizes. This will put the reader in
a better position to interpret the results and judge whether the statistically
significant findings have practical importance.
Ways to increase statistical power
1: Increase the sample size.
2: Relax the significance criterion. This approach is not common because of the
widespread concern to keep type I errors at a low, fixed level of, say, 0.01 or 0.05.
However, the significance criterion and power level should be determined by the
relative severity of type I and type II errors.
J. Cohen. Statistical Power Analysis for the Behavioral Sciences (second ed.), Laurence Erlbaum, Hillsdale, New Jersey (1988)
Recommendations for future research
Analyze the implications of the relative severity of type I and type II errors for the
specific treatment situation being investigated.
Unless there are specific circumstances, they do not recommend that
researchers relax the commonly accepted standard of setting alpha at 0.05. They
recommend that SE researchers plan for a power level of at least 0.80 and
perform power analyses accordingly.
Therefore, rather than relaxing alpha, they generally recommend increasing
power to better balance the odds of making type I and type II errors.
Recommendations for future research
They recommend that significance tests from experimental studies be
accompanied by effect size measures and confidence intervals to better inform
readers.
In addition, studies should report data to calculate items such as sample sizes,
alpha level, means, standard deviations, statistical tests, the tails of the tests, and
the value of the statistics.
They recommend that journal editors and reviewers pay more attention to the
issue of statistical power.
In this way, readers will be better able to make informed decisions about the
validity of the results and meta-analysts will be better able to perform secondary
analyses.
Conclusions
Since this is the first study of its kind in SE research, it was not possible to
compare the power data of the current study with previous experimental SE
research. They therefore found it useful to turn to the related discipline of IS
research, which gave them convenient reference data to measure and validate
the results of the power analysis.
The results showed that power issues are generally not given due attention in SE
research and that the level of statistical power is substantially below accepted
norms as well as levels found in the related discipline of IS research.
Only 6% of the studies included in this analysis had a power of 0.80 or more to
detect a medium effect size, which most IS researchers consider to be the target
level.
Conclusiones
Attention should be paid to the adequacy of sample sizes and research designs
in experimental investigation of statistical significance to ensure acceptable
levels of power (i.e., 1− β ≥.80), assuming that type I errors should be controlled at
α =.05.
At a minimum, reporting of significance tests should be improved by reporting
effect sizes and confidence intervals to allow for secondary analysis and to afford
the reader a richer understanding of and greater confidence in the results and
implications of a study.
Thank You
For Your Attention