0% found this document useful (0 votes)
2 views63 pages

INGLES Poder Estadístico en Experimentos de ISW

This systematic review assesses the statistical power of empirical software engineering research based on 103 controlled experiment papers published between 1993 and 2002. It highlights the importance of statistical power for valid hypothesis testing and the planning of studies, revealing that many studies lack sufficient power, leading to potentially flawed conclusions. The paper discusses techniques to improve statistical power and emphasizes the need for researchers to consider both statistical significance and effect size in their analyses.

Uploaded by

Danelys
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views63 pages

INGLES Poder Estadístico en Experimentos de ISW

This systematic review assesses the statistical power of empirical software engineering research based on 103 controlled experiment papers published between 1993 and 2002. It highlights the importance of statistical power for valid hypothesis testing and the planning of studies, revealing that many studies lack sufficient power, leading to potentially flawed conclusions. The paper discusses techniques to improve statistical power and emphasizes the need for researchers to consider both statistical significance and effect size in their analyses.

Uploaded by

Danelys
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

A systematic review of statistical power

in software engineering experiments


2006. Tore Dybå, Vigdis By Kampenes, Dag I.K. Sjøberg,
Information and Software Technology, Volume 48, Issue 8.

Danelys Brito González

08/2024

https://www.sciencedirect.com/science/article/pii/S0950584905001333
Cited by 361 in Scholar
Contex
Statistical power is an inherent part of empirical studies employing significance
testing and is essential for planning the study, for interpreting the results of a
study, and for the validity of the conclusions.
This paper reports on a quantitative assessment of the statistical power of
empirical software engineering research based on the 103 controlled experiment
papers (out of a total of 5453 papers) published in 9 major software engineering
journals and 3 conference proceedings in the decade 1993-2002.
Introduction
An important application of statistical significance testing in empirical software
engineering (ESE) research is hypothesis testing in controlled experiments.
A key component of such testing is the notion of statistical power, which is defined as
the probability that a statistical test will correctly reject the null hypothesis.
A test without sufficient statistical power will fail to provide the researcher with the
information needed to draw conclusions about accepting or rejecting the null
hypothesis.
Knowledge of statistical power can influence both the planning and execution and
outcomes of empirical research.
“If resources are limited and prevent achieving a satisfactory level of statistical power,
the research is probably not worth the time, effort, and cost of inferential statistics.”
Introduction
They suggest that statistical power is not reported or given due attention in the ESE
literature, leading to potentially flawed research designs and questionable validity of
results.
Objective
(1) conduct a systematic review and quantitative assessment of the statistical
power of ESE research in a sample of published controlled experiments,
(2) discuss the implications of these findings, and
(3) discuss techniques that ESE researchers can use to increase the statistical
power of their studies in order to improve the quality and validity of ESE research.
Power and errors in statistical inference
According to the Neyman-Pearson* method of statistical inference, testing hypotheses
requires that we specify an acceptable level of statistical error, that is, the risk we are
willing to take regarding the correctness of our decisions. Regardless of the decision rule
we choose, there are generally two ways to be right and two ways to make a mistake in
choosing between the null hypothesis ( ) and the alternative hypothesis ( ).

*J. Neyman, E.S. Pearson. On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika,
20A (1928), pp. 175-240 263–294
Type I error
It is the error that is committed when (the hypothesis being tested) is wrongly rejected.
That is, a type I error is committed whenever the sample results fall into the rejection
region, even if is true.
Conventionally, the probability of committing a type I error is represented by the level of
statistical significance, denoted by the lowercase Greek letter alpha ( α ).
On the contrary, the probability of being correct, given that is true, is equal to 1−α .
Type Error II
The probability of making a type II error, also known as beta ( β ), is the probability of
failing to reject the null hypothesis when it is actually false.
Thus, when a sample result does not fall into the rejection region, even though is true,
we are induced to make a type II error.
Consequently, the probability of correctly rejecting the null hypothesis, that is, the
probability of making a correct decision given is true, is 1− β ; the power of the
statistical test.
It is literally the probability of discovering that is incorrect, given the decision rule
and the true .
Statistical power

Statistical power is very important when there is a real difference in the population.
When the phenomenon really exists, the statistical test must be powerful enough to
detect it.
If the test reveals a non-significant result in this case, the conclusion of "no effect" would
be misleading and we would be committing a type II error.
Statistical power

Traditionally, α is set to 0.05 to protect against type I error, while β is set to 0.20 to
protect against type II error.
Accepting these conventions also means that we are four times more protected
against type I errors than type II errors.
However, the distribution of risk between type I and type II errors must be appropriate
for the situation at hand.
Ejemplo ilustrativo
Transbordador espacial Challenger en 1986 (10ma misión)

Los oficiales de la NASA se enfrentaron a una


elección entre dos tipos de suposiciones, cada una
con un costo distintivo:
1) el transbordador no era seguro para volar
porque el rendimiento del anillo tórico (O-ring)
utilizado en el cohete propulsor era diferente del
utilizado en misiones anteriores.
2) el transbordador era seguro para volar porque
no habría diferencia en el rendimiento de los anillos
tóricos en esta y las misiones anteriores.

Si la misión hubiera sido abortada y el anillo tórico efectivamente hubiera funcionado,


se habría cometido un error de Tipo I.
Obviamente, el costo del error de Tipo II, lanzar con un anillo tórico defectuoso, fue
mucho mayor que el costo que se habría incurrido con el error de Tipo I.
Ejemplo ilustrativo
Disintegrated on its tenth mission
Determinants of statistical power
The fundamental approach to statistical power analysis was established by Cohen*,
who described the relationships among the four variables involved in statistical
inference:
significance criterion ( α ),
sample size ( N ),
population effect size (ES), and
statistical power (1− β ).
For any statistical model, these relationships are such that each is a function of the
other three.
Therefore, we can determine the power for any statistical test, given α , N , and ES

*J. Cohen Statistical Power Analysis for the Behavioral Sciences (second ed.), Laurence Erlbaum, Hillsdale, New Jersey (1988)
Determinants of statistical power
Determinants of statistical power
Distribution curves:
The curve on the left represents the distribution
under the null hypothesis ( ​).
The curve on the right represents the distribution
under the alternative hypothesis ( ​).
Significance criterion (α):
The shaded region under the ​ curve to the right
of the vertical line represents the area where the
null hypothesis is rejected.
α is the probability of making a Type I error, that
is, rejecting when it is actually true. That is, the
probability of incorrectly rejecting the null
hypothesis.
Determinants of statistical power
Statistical power (1 - β):
The shaded region under the curve to the
right of the vertical line shows the statistical
power of the test.
.Statistical power (1 - β) is the probability of
correctly rejecting when ​is true.
Type II error (β):
The shaded region under the curve to the left
of the vertical line represents the probability of
making a Type II error (β), that is, do not reject ​
when is true.

Power increases with larger α. A small α will result in relatively small power.
Determinants of statistical power

The directionality of the significance criterion


also affects the power of a statistical test.
A two-tailed nondirectional test will have lower
power than a one-tailed directional test with
the same α, provided the sample result is in the
predicted direction.
A directional test has no power to detect effects
in the opposite direction than predicted.
Example of the effect of the directionality of the
significance criterion on the power of a statistical test
Suppose a researcher wants to test whether a new drug lowers blood pressure more than a
placebo. The researcher has two options for formulating his hypothesis:
1. Two-sided (non-directional) test:
a. Null hypothesis ( ): The drug has no effect (difference in blood pressure is zero).
b. Alternative hypothesis ( ): The drug has an effect (difference in blood pressure is
non-zero, it may be higher or lower).
2. One-sided (directional) test:
a. Null hypothesis ( ): The drug does not reduce blood pressure (difference in blood
pressure is zero or higher, i.e. the drug might even increase blood pressure).
b. Alternative hypothesis ( ): The drug reduces blood pressure (difference in blood
pressure is negative).
Effect of directionality on power:
Two-sided (non-directional) test:
In this case, the researcher is considering the possibility that the drug may have an
effect in either direction (either increasing or decreasing blood pressure). The
significance level α is split between the two tails of the distribution. This means that
the area that is considered "significant" is smaller at each end, which reduces the
probability of rejecting ​when is true, i.e. the power of the test is lower.
One-sided (directional) test:
Here, the researcher is interested in only one direction (the reduction in blood
pressure). All of the significance α is concentrated in a single tail of the distribution.
This means that the critical region is larger in that specific direction, which increases
the power of the test, as it is more likely to reject if ​ is true and the drug does
indeed reduce blood pressure.
Limitation of directional testing:
If the drug's effect is in the opposite direction than predicted (for example, if it actually
increases blood pressure rather than lowers it), a directional test will not detect this
effect because it has no power in that direction.
This is what it means that a directional test has no power to detect effects in the
opposite direction than predicted.

Summary:
A two-sided test is more conservative and can detect effects in both directions, but with
less power.
A one-sided test is more powerful in the predicted direction, but cannot detect effects in
the opposite direction.

The researcher must choose the type of test based on the nature of the hypothesis and the
context of the experiment.
Sample size
At any given α level, a larger sample size reduces the standard deviations of the
sampling distributions for and .
This reduction results in less overlap of the distributions, higher precision, and therefore
higher power.
Population effect size (ES)
Actual size of the difference between and (the null hypothesis is that the effect size
is 0), i.e. the degree to which the phenomenon is present in the population.
The larger the effect size, the greater the likelihood that the effect will be detected and
the null hypothesis rejected.
The nature of the effect size will vary from one statistical procedure to another (e.g., a
standardized mean difference or a correlation coefficient), but its role in power analysis
is the same across procedures.
Therefore, each statistical test has its own continuous, unscaled effect size index, ranging
from zero.
Thus, while p-values ​reveal whether a finding is statistically significant, effect size indices
are measures of practical significance or relevance.
The interpretation of effect sizes is critical, because it is possible for a finding to be
statistically significant but not relevant, and vice versa.
A finding may be "statistically significant but not
relevant"
It refers to the difference between statistical significance and practical relevance or
importance of the effect.
Suppose you are a researcher studying the impact of a new drug on lowering
cholesterol. You conduct an experiment with a large sample of people and find the
following:
Null hypothesis ( ): The new drug does not lower cholesterol more than placebo.
Alternative hypothesis ( ): The new drug lowers cholesterol more than placebo.
After analyzing the data, you get a very small p-value (for example, p=0.0001). This
indicates that the results are statistically significant and that you can reject the null
hypothesis. However, when you calculate the effect size (the magnitude of the
cholesterol reduction), you find that the drug only lowers cholesterol by 0.5 mg/dL on
average.
Let's interpret the results:

Statistical significance:
The low p-value means that the observed reduction in cholesterol is very unlikely to
be due to chance. Therefore, the result is statistically significant.
Practical relevance:
However, an average reduction of 0.5 mg/dL in cholesterol levels is very small. From a
clinical perspective, this decrease might not have a significant impact on patients'
health. Therefore, although the result is statistically significant, it is not clinically
relevant.
On the other hand:
Situation where the effect is large (e.g. a 20 mg/dL reduction in cholesterol levels), but
because the sample is small, the p-value does not meet the threshold for statistical
significance (e.g. p=0.06). In this case, the effect is practically relevant but not
statistically significant due to lack of statistical power.
Summary:
Effect size: Indicates the magnitude of the difference or impact of a treatment or
intervention. A large effect size suggests that the result is practically relevant or
important.
Statistical significance: Indicates the likelihood that a result is due to chance. A low
p-value suggests that the result is unlikely to be an accident.
The key is that a result can be statistically significant but not practically important (as in
the first example), or it can be practically important but not statistically significant if the
study is underpowered (as in the second example). That is why it is crucial to interpret
both statistical significance and effect size to draw complete and accurate conclusions.
Effect size indices and their values ​for the most common
statistical tests

Cohen provided such an estimate


of effect size.
Based on a review of previous
behavioral research,
he developed operational
definitions of three levels of effect
sizes (small, medium, and large)
with different quantitative levels
for different types of statistical
tests.
Research method
They evaluated the 103 articles on controlled experiments (out of a total of 5,453
articles)*, published in 9 major software engineering journals and 3 conference
proceedings during the decade 1993-2002.
These journals and conference proceedings were chosen because they were considered
to be representative of ESE research.
Furthermore, since controlled experiments are empirical studies employing inferential
statistics, they were considered a relevant sample in this study.

*D.I.K. Sjøberg, et al. A survey of controlled experiments in software engineering. IEEE Transactions on Software Engineering (2005)
ESE studies using controlled experiments
Procedure
Five experiments were reported in more than 1 article. In these cases, they included the
most recent one.
This evaluation resulted in 78 articles. From these articles, they identified 459 statistical
tests corresponding to the main hypotheses or research questions of 92 experiments.

Twenty-five articles were excluded due to


duplicate reporting, lack of statistical
analysis, or unspecified statistical tests.
Statistical tests used
Both parametric and nonparametric tests of the main hypotheses were included.
Distribution of the 459 statistical tests in the final sample for which post hoc statistical
power could be determined.
The main parametric tests were analysis of variance (ANOVA) and t tests.
The main nonparametric tests were Wilcoxon, Mann-Whitney, Fisher's exact test, Chi-
square, and Kruskall-Wallis.
Other tests include Tukey's pairwise comparison (18), nonparametric rank sum test (6),
Poisson (3), regression (3), Mood's median test (2), proportion (2), and Spearman's rank
correlation (2).
Statistical tests used in 92 controlled SE experiments
Results
The 78 articles selected for this study, with data available to calculate power, yielded 459
statistical tests of the main hypotheses investigated in the 92 reported controlled
experiments.
Sample sizes in 92 controlled SE experiments

Thus, while the mean sample size


of the 459 statistical tests was 55
observations, with a standard
deviation of 87, the median
sample size was as low as 34
observations.
Consequently, the mean number
of subjects in the experiments
studied was 48, with a standard
deviation of 51 and a median of 30.

For comparison, the mean sample size of all tests in Rademacher's* study of power in IS
research was 179 subjects (with a standard deviation of 196).
Cumulative percentage and power frequency
distribution
Small effect size: The average
statistical power of the tests when
assuming small effect sizes was as low
as 0.11. This means that if the
phenomena being investigated exhibit
a small effect size, then on average the
studies examined have only a one in 10
chance of detecting them.
The table shows that only one test is
above the conventional power level of
0.80 and that 97% have less than a 50%
chance of detecting significant
findings.
Cumulative percentage and power frequency
distribution
Medium effect size: When assuming
medium effect sizes, the average
statistical power of the tests increases
to 0.36.
Studies only have, on average, about
1/3 the chance of detecting
phenomena that exhibit a medium
effect size.
The table indicates that only 6% of the
tests examined achieve the
conventional power level of 0.80 or
better, and that 78% of the tests have
less than a 50% chance of detecting
significant results.
Cumulative percentage and power frequency
distribution
Large effect size: Assuming large effect
sizes, the average statistical power of
the tests increases further, to 0.63.
On average, studies still have a slightly
less than 2/3 chance of detecting their
phenomena.
Cumulative percentage and power frequency
distribution
The table shows that 31% of the tests
meet or exceed the .80 power level, and
70% get a greater than 50% chance of
correctly rejecting their null hypotheses.
Thus, even when we assume that the
effect being studied is so large as to
make statistical testing unnecessary, as
many as 69% of the tests fall below the
.80 level.
Power analysis by type of statistical test

None of the tests achieve the


conventional power level of .80 on
average; not even when assuming
large effect sizes.
ANOVAs and t-tests account for nearly
2/3 of all statistical analyses in
experiments, yet their average power
level for detecting large effect sizes is
only .67 and .61, respectively, while the
corresponding power levels assuming
medium effect sizes are as low as .40
and .33.
Results. Quantitative evaluation

Revealed that controlled SE experiments, on average, only have a 2/3 chance of


detecting phenomena with large effect sizes.
The corresponding probability of detecting phenomena with medium effect sizes
is about one in three (1/3),
While there is only a one in ten (1/10) chance of detecting small effect sizes.
Results. Qualitative evaluation

Of the 78 articles in the sample, 12 analyzed the statistical power associated with
null hypothesis testing.
Of these studies, 9 elaborated on the specific procedures for determining the
statistical power of the tests.
3 of the 9 performed an a priori power analysis, while 6 performed the a posteriori
analysis.
Only 1 of the articles that performed an a priori power analysis used it to guide the
choice of sample size. In this case, the authors explicitly stated that they were
only interested in large effect sizes and that they considered a power level of 0.5
to be sufficient. Even so, they included so few subjects in the experiment that the
average power to detect a large effect size from their statistical tests was as low
as 0.28.
Results. Qualitative evaluation

Of the 6 articles that performed a post hoc power analysis, 2 gave


recommendations on the sample sizes needed for future replication studies.
Thus, overall, 84.6% of the experimental studies included in the sample made no
reference to the statistical power of their significance tests.
Comparison with IS research

They compare the results of the current study with two corresponding reviews of
statistical power levels in IS research.

* J. Baroudi, W. Orlikowski. The problem of statistical power in MIS research. MIS Quarterly, (1989)
** R.A. Rademacher. Statistical power in information systems research: application and impact on the discipline. Journal of
Computer Information Systems, (1999)
Comparison with IS research

The results of the current study show that the power of experimental research in
SE is far below the levels achieved by IS research.
One reason for this difference could be that the IS field has benefited from
Baroudi and Orlikowski's early review of power, and thus explicit attention has
been paid to statistical power, which has paid off with contemporary research
showing improved power levels.
Comparison with IS research

What is particularly worrying for SE research is that the power level shown by the
current study not only falls markedly below the level of Rademacher's 1999 study,
but also falls markedly below the level of Baroudi and Orlikowski's 1989 study.
Comparison with IS research

Medium effect sizes are considered the target level in IS research**, and the
average power to detect these effect sizes is .81 in IS research; only 6% of the tests
examined in the current research reach this level, and up to 78% of the tests in the
current research have less than a 50% chance of detecting significant results for
medium effects.
Comparison with IS research
Unless it can be shown that medium (and large) effect sizes are irrelevant for SE
research, this should be a matter of concern for SE researchers and practitioners.
Consequently, we should explore in more depth what constitutes meaningful
effect sizes within SE research, in order to establish specific SE conventions.
The results show that, on average, SI research employs sample sizes that are
twice as large as those found in SE research for these tests. In fact, the situation is
slightly worse than that, as observations are used as sample size in the current
study, whereas SI studies refer to subjects. Furthermore, the power levels of the
current study to detect medium effect sizes are only about half of the
corresponding power levels of SI research.
Implications for the interpretation of experimental SE
research
Power issues in experimental SE research are very limited.
15.4% of the articles discussed statistical power in relation to their null hypothesis
test, but only in one article did the authors perform an a priori power analysis.
Post hoc power analyses showed that, overall, the studies examined had low
statistical power.
Even for large effect sizes, up to 69% of the tests fell below the .80 level. This
implies that statistical power considerations are underemphasized in
experimental SE research.
A test without sufficient statistical power will not provide the researcher with
enough information to draw conclusions about accepting or rejecting the null
hypothesis.
Implications for the interpretation of experimental SE
research
If no effects are detected in this situation, researchers should not conclude that
the phenomenon does not exist. Rather, they should report that no significant
findings were demonstrated in their study and that this may be due to the low
statistical power associated with their tests.
Problem with underpowered studies, especially when multiple tests are
performed: Although the probability of any individual test being significant is low,
the probability of obtaining at least one significant result increases with the
number of tests. However, this significant result could be misleading because it
does not reflect the true power of the study to detect specific effects. Therefore, it
is important for researchers to distinguish between main (primary) and
additional (secondary) tests in order to correctly interpret the results.
Implications for the interpretation of experimental SE
research
Low statistical power also has a substantial impact on the ability to replicate
experimental studies based on null hypothesis testing.
…the more we are guided by theory and prior observation but conduct an
underpowered study, the more we decrease the likelihood of replication. Thus, an
underpowered literature is not just making a passive mistake, but may actually
contribute to diverting attention and resources in unproductive directions
(Ottenbacher*).
Consequently, the tendency to under-power SE studies makes replication and
meta-analysis difficult and will tend to produce an inconsistent body of literature,
thus hindering the advancement of knowledge.

* K.J. Ottenbacher. The power of replications and the replications of power. The American Statistician. (1996)
Interpretation of studies with very high power levels

Some of the studies in this review employed large sample sizes, ranging from 400
to 800 observations.
This poses a problem of interpretation, because virtually any study can show
significant results if the sample size is large enough, regardless of how small the
actual effect size may be.
It is therefore of particular importance that researchers who report statistically
significant results from studies with very large sample sizes, or with very large
power levels, also report the corresponding effect sizes. This will put the reader in
a better position to interpret the results and judge whether the statistically
significant findings have practical importance.
Ways to increase statistical power
1: Increase the sample size.

2: Relax the significance criterion. This approach is not common because of the
widespread concern to keep type I errors at a low, fixed level of, say, 0.01 or 0.05.
However, the significance criterion and power level should be determined by the
relative severity of type I and type II errors.

3: Choose powerful statistical tests: In general, parametric tests are more


powerful than their nonparametric counterparts. Researchers are encouraged to
use the parametric test most appropriate for their study and to resort to
nonparametric procedures only in the rare case of extreme violations of
assumptions. (Baroudi and Orlikowski).
Ways to increase statistical power
4: Retain as much information as possible about the dependent variable:
In general, tests that compare data categorized into groups are less powerful
than tests that use data measured along a continuum.
“Statistics that allow continuous data to be analyzed in a continuous manner,
such as regression, should be used instead of those that require dividing the
data into groups, such as analysis of variance” (Baroudi and Orlikowski).
The direction of the significance criterion also affects the power of a statistical
test.
Ways to increase statistical power
5: Reduce measurement error and subject heterogeneity:
The greater the variance in scores within the treatment and control groups, the
smaller the effect size and power.
One source of such variance is measurement error, that is, variability in scores
that is unrelated to the characteristic being measured. Another source is subject
heterogeneity in the measurement.
Anything that makes the population standard deviation small will increase power,
all else being equal.
Ways to increase statistical power
Subject heterogeneity can be reduced by selecting or developing measures that
do not discriminate strongly among subjects.
If the measure nevertheless accounts for differences among subjects
substantially, these differences could be statistically reduced during data
analysis. To reduce such variance and thereby increase statistical power, the
researcher can use a repeated measures or paired-subjects design, or a factorial
design that employs blocking, stratification, or matching criteria.
Researchers can also reduce subject heterogeneity by employing a research
design that covaries a pretest measure with the dependent variable.
Measurement error can be reduced by exercising careful control over subjects
and experimental conditions.
Ways to increase statistical power
The researcher may use some form of aggregation or averaging of multiple
measures that contain errors individually, to reduce the influence of error on the
composite scores.
Therefore, whenever applicable, the researcher should use reliable multi-item
measures to increase power.
Ways to increase statistical power
6: Balance groups:
The statistical power of a study is based less on the total number of subjects
involved than on the number in each group or cell within the design.
Since the power of a test with unequal group sizes is estimated using the
harmonic mean, the "effective" group size is biased toward the size of the group
with fewer subjects.
Thus, with a fixed number of subjects, maximum statistical power is achieved
when subjects are evenly divided between treatment and control groups.
Researchers should therefore aim for equal, or in the case of factorial designs,
proportional, group sizes, rather than obtaining a large sample size that results in
unequal or disproportionate groups.
Ways to increase statistical power
7: Investigate only the relevant variables:
Use theory and prior research to identify those variables that are most likely to
have an effect.
Careful selection of which independent variables to include and which to exclude
is crucial to increasing the power of a study.
Kraemer and Thiemann suggested that only those factors that are absolutely
necessary to the research question or that have a strong, documented
relationship to the answer should be included in a study.
Accordingly, they recommended “Choose a few predictor variables and choose
them carefully”
Or as McClelland put it “Doubling the thinking is probably much more productive
than doubling the sample size”
Limitations
Publication selection bias. The recent survey of controlled SE experiments was
used as a basis, and they have the same publication selection basis. If the main
study had also included grey literature (thesis, technical reports, working papers,
etc.) on controlled SE experiments, the current study could in principle provide
more data and possibly allow more general conclusions to be drawn.
Imprecision in data extraction: Because it was not always clear from the study
reports which hypotheses were actually tested, which significance tests
corresponded to which hypothesis, or how many observations were included for
each test, the extraction process may have resulted in some inaccuracy in the
data.
Recommendations for future research
Recommendations for SE researchers performing null hypothesis testing.
Plan for acceptable power based on attention to effect size, either by assessing
previous empirical research in the area and using the effect sizes found in these
studies as a guide, or by looking at your own studies and pilot studies as a guide.
If this is difficult, due to the limited number of empirical studies in SE, you can
alternatively use a judgement-based approach to decide what effect size you
are interested in detecting. They recommend the same general target level of
medium effect sizes as is used in IS research, determined according to Cohen's
definitions*.

J. Cohen. Statistical Power Analysis for the Behavioral Sciences (second ed.), Laurence Erlbaum, Hillsdale, New Jersey (1988)
Recommendations for future research
Analyze the implications of the relative severity of type I and type II errors for the
specific treatment situation being investigated.
Unless there are specific circumstances, they do not recommend that
researchers relax the commonly accepted standard of setting alpha at 0.05. They
recommend that SE researchers plan for a power level of at least 0.80 and
perform power analyses accordingly.
Therefore, rather than relaxing alpha, they generally recommend increasing
power to better balance the odds of making type I and type II errors.
Recommendations for future research
They recommend that significance tests from experimental studies be
accompanied by effect size measures and confidence intervals to better inform
readers.
In addition, studies should report data to calculate items such as sample sizes,
alpha level, means, standard deviations, statistical tests, the tails of the tests, and
the value of the statistics.
They recommend that journal editors and reviewers pay more attention to the
issue of statistical power.
In this way, readers will be better able to make informed decisions about the
validity of the results and meta-analysts will be better able to perform secondary
analyses.
Conclusions
Since this is the first study of its kind in SE research, it was not possible to
compare the power data of the current study with previous experimental SE
research. They therefore found it useful to turn to the related discipline of IS
research, which gave them convenient reference data to measure and validate
the results of the power analysis.
The results showed that power issues are generally not given due attention in SE
research and that the level of statistical power is substantially below accepted
norms as well as levels found in the related discipline of IS research.
Only 6% of the studies included in this analysis had a power of 0.80 or more to
detect a medium effect size, which most IS researchers consider to be the target
level.
Conclusiones
Attention should be paid to the adequacy of sample sizes and research designs
in experimental investigation of statistical significance to ensure acceptable
levels of power (i.e., 1− β ≥.80), assuming that type I errors should be controlled at
α =.05.
At a minimum, reporting of significance tests should be improved by reporting
effect sizes and confidence intervals to allow for secondary analysis and to afford
the reader a richer understanding of and greater confidence in the results and
implications of a study.
Thank You
For Your Attention

You might also like