Equivalence Tests A Practical Primer For T Tests C PDF
Equivalence Tests A Practical Primer For T Tests C PDF
Daniël Lakens1
Abstract
Scientists should be able to provide support for the absence of a meaningful effect. Currently, researchers often incorrectly
conclude an effect is absent based a nonsignificant result. A widely recommended approach within a frequentist framework is to
test for equivalence. In equivalence tests, such as the two one-sided tests (TOST) procedure discussed in this article, an upper and
lower equivalence bound is specified based on the smallest effect size of interest. The TOST procedure can be used to statistically
reject the presence of effects large enough to be considered worthwhile. This practical primer with accompanying spreadsheet
and R package enables psychologists to easily perform equivalence tests (and power analyses) by setting equivalence bounds based
on standardized effect sizes and provides recommendations to prespecify equivalence bounds. Extending your statistical tool kit
with equivalence tests is an easy way to improve your statistical and theoretical inferences.
Keywords
research methods, equivalence testing, null hypothesis significance testing, power analysis
Scientists should be able to provide support for the null smallest effect size of interest (SESOI; e.g., a positive or neg-
hypothesis. A limitation of the widespread use of traditional ative difference of d ¼ .3). Two composite null hypotheses are
significance tests, where the null hypothesis is that the true tested: H01: D DL and H02: D DU. When both these one-
effect size is zero, is that the absence of an effect can be sided tests can be statistically rejected, we can conclude that
rejected, but not statistically supported. When you perform DL < D < DU or that the observed effect falls within the
a statistical test, and the outcome is a p value larger than the equivalence bounds and is close enough to zero to be practi-
a level (e.g., p > .05), the only formally correct conclusion is cally equivalent (Seaman & Serlin, 1998).
that the data are not surprising, assuming the null hypothesis Psychologists often incorrectly conclude there is no effect
is true. It is not possible to conclude there is no effect when based on a nonsignificant test result. For example, the words
p > a—our test might simply have lacked the statistical power “no effect” had been used in 108 articles published in Social
to detect a true effect. Psychological and Personality Science up to August 2016.
It is statistically impossible to support the hypothesis that a Manual inspection revealed that in almost all of these articles,
true effect size is exactly zero. What is possible in a frequentist the conclusion of “no effect” was based on statistical nonsigni-
hypothesis testing framework is to statistically reject effects ficance. Finch, Cumming, and Thomason (2001) reported that
large enough to be deemed worthwhile. When researchers want in the Journal of Applied Psychology, a stable average of
to argue for the absence of an effect that is large enough to be around 38% of articles with nonsignificant results accept the
worthwhile to examine, they can test for equivalence (Wellek, null hypothesis. This practice is problematic. With small sam-
2010). By rejecting an effect (indicated in this article by D) ple sizes, nonsignificant test results are hardly indicative of
more extreme than predetermined lower and upper equivalence the absence of a true effect, and with huge sample sizes,
bounds (DL and DU, e.g., effect sizes of Cohen’s d ¼ .3 and effects can be statistically significant but practically and theo-
d ¼ .3), we can act as if the true effect is close enough to zero retically irrelevant. Equivalence tests, which are conceptually
for our practical purposes. Equivalence testing originates from
the field of pharmacokinetics (Hauck & Anderson, 1984),
1
where researchers sometimes want to show that a new cheaper Human Technology Interaction Group, Eindhoven University of Technology,
drug works just as well as an existing drug (for an overview, see Eindhoven, the Netherlands
Senn, 2007, Chapters 15 and 22). A very simple equivalence
Corresponding Author:
testing approach is the “two one-sided tests” (TOST) procedure Daniël Lakens, Human Technology Interaction Group, Eindhoven University of
(Schuirmann, 1987). In the TOST procedure, an upper (DU) and Technology, IPO 1.24, PO Box 513, 5600 MB, Eindhoven, the Netherlands.
lower (DL ) equivalence bound is specified based on the Email: [email protected]
2 Social Psychological and Personality Science XX(X)
straightforward, easy to perform, and highly similar to widely bounds in psychology, I propose that when theoretical or prac-
used hypothesis significance tests that aim to reject a null tical boundaries on meaningful effect sizes are absent,
effect, are a simple but underused approach to reject the possi- researchers set the bounds to the smallest effect size they have
bility that an effect more extreme than the SESOI exists sufficient power to detect, which is determined by the resources
(Anderson & Maxwell, 2016). they have available to study an effect.
Psychologists would gain a lot by embracing equivalence
tests. First, researchers often incorrectly use nonsignificance
to claim the absence of an effect (e.g., “there were no gender
Testing for Equivalence
effects, p > .10”). This incorrect interpretation of p values In this article, I will focus on the TOST procedure (Schuir-
would be more easily recognized and should become less mann, 1987) of testing for equivalence because of its simplicity
common in the scientific literature if equivalence tests were and widespread use in other scientific disciplines. The goal in
better known and more widely used. Second, where tradi- the TOST approach is to specify a lower and upper bound, such
tional significance test only allows researchers to reject the that results falling within this range are deemed equivalent to
null hypothesis, science needs statistical approaches that the absence of an effect that is worthwhile to examine (e.g.,
allow us to conclude meaningful effects are absent (Dienes, DL ¼ .3 to DU ¼ .3, where D is a difference that can be
2016). Finally, the strong reliance on hypothesis significance defined by either standardized differences such as Cohen’s d
tests that merely aim to reject a null effect does not require or raw differences such as .3 scale point on a 5-point scale).
researchers to think about the effect size under the alternative In the TOST procedure, the null hypothesis is the presence of
hypothesis. Exclusively focusing on rejecting a null effect has a true effect of DL or DU, and the alternative hypothesis is an
been argued to lead to imprecise hypotheses (Gigerenzer, effect that falls within the equivalence bounds or the absence
1998). Equivalence testing invites researchers to make more of an effect that is worthwhile to examine. The observed data
specific predictions about the effect size they find worthwhile are compared against DL and DU in two one-sided tests. If the
to examine. Bayesian methods can also be used to test a null p value for both tests indicates the observed data are surpris-
effect (e.g., Dienes, 2014), but equivalence tests do not ing, assuming DL or DU are true, we can follow a Neyman–
require researchers to switch between statistical philosophies Pearson approach to statistical inferences and reject effect
to test the absence of a meaningful effect, and the availability sizes larger than the equivalence bounds. When making such
of power analyses for equivalence tests allows researchers to a statement, we will not be wrong more often, in the long run,
easily design informative experiments. than our Type 1 error rate (e.g., 5%). It is also possible to test
There have been previous attempts to introduce equivalence for inferiority, or the hypothesis that the effect is smaller than
testing to psychology (Quertemont, 2011; Rogers, Howard, & an upper equivalence bound, by setting the lower equivalence
Vessey, 1993; Seaman & Serlin, 1998). I believe there are four bound to 1.1 Furthermore, equivalence bounds can be sym-
reasons why previous attempts have largely failed. First, there metric around zero (DL ¼ .3 to DU ¼ .3) or asymmetric
is a lack of easily accessible software to perform equivalence (DL ¼ .2 to DU ¼ .4).
tests. To solve this problem, I’ve created an easy to use spread- When both null hypothesis significance tests (NHST) and
sheet and R package to perform equivalence tests for indepen- equivalence tests are used, there are four possible outcomes
dent and dependent t tests, correlations, and meta-analyses (see of a study: The effect can be statistically equivalent (larger than
https://osf.io/q253c/) based on summary statistics. Second, in DL, smaller than DU) and not statistically different from zero,
pharmacokinetics, the equivalence bounds are often defined statistically different from zero but not statistically equivalent,
in raw scores, whereas it might be more intuitive for research- statistically different from zero and statistically equivalent, or
ers in psychology to express equivalence bounds in standar- undetermined (neither statistically different from zero nor sta-
dized effect sizes. This makes it easier to perform power tistically equivalent). In Figure 1, mean differences (black
analyses for equivalence tests (which can also be done with the squares) and their 90% (thick lines) and 95% confidence inter-
accompanying spreadsheet and R package) and to compare vals (CIs; thin lines) are illustrated for four scenarios. To con-
equivalence bounds across studies in which different measures clude equivalence (Scenario A), the 90% CI around the
are used. Third, there is no single article that discusses both observed mean difference should exclude the DL and DU values
power analyses and statistical tests for one-sample, dependent of .5 and .5 (indicated by black vertical dashed lines).2
and independent t tests, correlations, and meta-analyses, which The traditional two-sided null hypothesis significance test
are all common in psychology. Finally, guidance on how to set is rejected (Scenario B) when the CI around the mean differ-
equivalence boundaries has been absent for psychologists, ence does not include 0 (the vertical gray dotted line). Effects
given that there are often no specific theoretical limitations can be statistically different from zero and statistically equiv-
on how small effects are predicted to be (Morey & Lakens, alent (Scenario C) when the 90% CI exclude the equivalence
2017) nor cost–benefit boundaries of when effects are too small bounds and the 95% CI exclude zero. Finally, an effect can be
to be practically meaningful. This is a chicken–egg problem, undetermined, or not statistically different from zero, and not
since using equivalence tests will likely stimulate researchers statistically equivalent (Scenario D) when the 90% CI
to specify which effect sizes are predicted by a theory (Weber includes one of the equivalence bounds and the 95% CI
& Popova, 2012). To bootstrap the specification of equivalence includes zero.
Lakens 3
where the degrees of freedom (df) for Welch’s t test are based
on the Sattherthwaite (1946) correction:
2
SD1 SD22 2
n1 þ n2
dfw ¼ 2 2 2 2 : ð4Þ
SD SD
Figure 1. Mean differences (black squares) and 90% confidence n1
1
n2
2
intervals (CIs; thick horizontal lines) and 95% CIs (thin horizontal n1 1 þ n2 1
lines) with equivalence bounds DL ¼ .5 and DU ¼ .5 for four
combinations of test results that are statistically equivalent or not These equations are highly similar to the Student’s and
and statistically different from zero or not.
Welch’s t-statistic for traditional significance tests. The only
difference is that the lower equivalence bound DL and the upper
In this article, the focus lies on the TOST procedure, equivalence bound DU are subtracted from the mean difference
where two p values are calculated. Readers are free to between groups. These bounds can be defined in raw scores or
replace decisions based on p values by decisions based on in a standardized difference, where D ¼ Cohen’s d s or
90% CIs if they wish. Formally, hypothesis testing and esti- Cohen’s d ¼ D/s. The two one-sided tests are rejected if
mation are distinct approaches (Cumming & Finch, 2001). tU t(df, a), and tL t(df, a), where t(a, df) is the upper 100a
For example, while sample size planning based on CIs percentile of a t-distribution (Berger & Hsu, 1996). The spread-
focusses on the width of CIs, sample size planning for sheet and R package can be used to perform this test, but some
hypothesis testing uses power analysis to estimate the prob- commercial software such as Minitab (Minitab 17 Statistical
ability of observing a significant result (Maxwell, Kelley, & Software, 2010) also include the option to perform equivalence
Rausch, 2008). Since the TOST procedure is based on a tests for t tests.
Neyman–Pearson hypothesis testing approach to statistics, As an example, Eskine (2013) showed that participants who
and I’ll explain how to calculate the tests as well as how had been exposed to organic food were substantially harsher in
to perform power analysis, I’ll focus on the calculation of their moral judgments relative to those in the control condition
p values for conceptual consistency. (d ¼ .81, 95% CI [0.19, 1.45]). A replication by Moery and
Calin-Jageman (2016, study 2) did not observe a significant effect
Equivalence Tests for Differences Between (control: n ¼ 95, M ¼ 5.25, SD ¼ .95, organic food: n ¼ 89,
Two Independent Means M ¼ 5.22, SD ¼ .83). The authors followed Simonsohn’s
(2015) recommendation so set the equivalence bound to the effect
The TOST procedure entails performing two one-sided tests to
size the original study had 33% power to detect. With n ¼ 21 in
examine whether the observed data are surprisingly larger than
each condition of the original study, this means the equivalence
an equivalence boundary lower than zero (DL) or surprisingly
bound is d ¼ .48, which equals a difference of .384 on a 7-point
smaller than an equivalence boundary larger than zero (DU).
scale given the sample sizes and a pooled SD of .894. We can cal-
The equivalence test assuming equal variances is based on:
culate the TOST equivalence test t-values:
M 1 M 2 DL M 1 M 2 DU 5:25 5:22 ð0:384Þ
tL ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi and tU ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð1Þ qffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ tL ¼ 3:14 and
s n11 þ n12 s n11 þ n12 0:894 951
þ 89 1
where M1 and M2 indicate the means of each sample, n1 and n2 5:25 5:22 0:384
qffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ tU ¼ 2:69;
are the sample size in each group, and s is the pooled standard 0:894 951
þ 89 1
deviation (SD):
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi which correspond to p values of .001 and .004. If a ¼ .05, and
ðn1 1ÞSD21 þ ðn2 1ÞSD22 assuming equal variances, the equivalence test is significant,
s¼ : ð2Þ
n1 þ n2 2 t(182) ¼ 2.69, p ¼ .004. We can reject effects larger than
.384 scale points. Note that both one-sided tests need to be sig-
Even though Student’s t test is by far the most popular t test nificant to declare equivalence; but for efficiency, only the one-
in psychology, there is general agreement that whenever the sided test with the highest p value is reported in TOST results
number of observations are unequal across both conditions, (given that if this test is significant, so is the other). Alterna-
Welch’s t test (1947), which does not rely on the assumption tively, because Moery and Calin-Jageman’s (2016) main
4 Social Psychological and Personality Science XX(X)
Table 1. Sample Sizes (for the Number of Observations in Each Group) for Equivalence Tests for Independent Means, as a Function of the
Desired Power, a Level, and Equivalence Bound D (in Cohen’s d), Based on Exact Calculations and the Approximation.
Approximation Exact
Bound (D) a ¼ .05 a ¼ .01 a ¼ .05 a ¼ .01 a ¼ .05 a ¼ .01 a ¼ .05 a ¼ .01
prediction seems to be whether the effect is smaller than the Equivalence Tests for Differences Between
upper equivalence bound (a test for inferiority), only the one- Dependent Means
sided t test against the upper equivalence bound could be per-
formed and reported. Note that the spreadsheet and R package When comparing dependent means, the correlation between the
allow you to either directly specify the equivalence bounds in observations has to be taken into account, and the effect size
Cohen’s d or set the equivalence bound in raw units. directly related to the statistical significance of the test (and
An a priori power analysis for equivalence tests can be per- thus used in power analysis) is Cohen’s d z (see Lakens,
formed by calculating the required sample sizes to declare 2013). The t-values for the two one-sided tests statistics are:
equivalence for two one-sided tests based on the lower equiva- M 1 M 2 DL
lence bound and upper equivalence bound. When equivalence tL ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and
SD21 þ SD22 2 r SD1 SD2
bounds are symmetric around zero (e.g., DL ¼ .5 and pffiffiffi
N
DU ¼ .5), the required sample sizes (referred to as nL and nU M 1 M 2 DU
in Equation 5) will be identical. Following Chow, Shao, and tU ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð6Þ
SD21 þ SD22 2 r SD1 SD2
Wang (2002), the normal approximation of the power equation pffiffiffi
N
for equivalence tests (for each independent group of an inde-
pendent t test) given a specific a level and desired level of sta- The bounds DL and DU can be defined in raw scores, or in
tistical power (1 b) is: a standardized bound based on Cohen’s dz, where D ¼ dz
SDdiff, or dz ¼ D/SDdiff. Equation 3 can be used for a priori
2ðza þ zb =2Þ2 2ðza þ zb =2Þ2 power analyses by inserting Cohen’s dz instead of Cohen’s
nL ¼ ; n U ¼ ; ð5Þ
DL 2 DU 2 d. The number of pairs needed to achieve a desired level
of power when using Cohen’s d z is half the number of
where DL and DU are the standardized mean difference equiva- observations needed in each between subject condition spec-
lence bounds (in Cohen’s d). This equation calculates the ified in Table 1.
required sample sizes based on the assumption that the true There are no suggested benchmarks of small, medium, and
effect size is zero (see Table 1). If a nonzero true effect size large effects for Cohen’s dz. We can consider two approaches
is expected, an iterative procedure must be used. A highly to determining benchmarks. The first is to use the same
accessible overview of power analysis for equivalence, super- benchmarks for Cohen’s d as for Cohen’s dz. This assumes r
iority, and noninferiority designs with power tables for a wide ¼ .5, when Cohen’s d and Cohen’s dz are identical.3 A second
range of standardized mean differences and expected true mean approach is to scale the benchmarks for Cohen’s dz based on
differences that can be used to decide upon the sample size in the sample size we need to reliably detect an effect. For exam-
your study is available in Julious’s (2004) study. ple, in an independent t test, 176 participants are required in
The narrower the equivalence bounds, or the smaller the each condition to achieve 80% power for d ¼ .3 and
effect sizes one tries to reject, the larger the sample size that a ¼ .05. With 176 pairs of observations and a ¼ .05, a study
is required. Large sample sizes are required to achieve high has 80% power for a Cohen’spdffiffiffiz of .212. The relationship
power when equivalence bounds are close to zero. This is com- between d and dz is a factor of 2, which means we can trans-
parable to the large sample sizes that are required to reject a late the benchmarks for Cohen’s d for small (.2), medium (.5),
true but small effect when the null hypothesis is a null effect. and large (.8) effects into benchmarks for Cohen’s dz of small
Equivalence tests require slightly larger sample sizes than tra- (.14), medium (.35), and large (.57). There is no objectively
ditional null hypothesis tests. correct way to set benchmarks for Cohen’s dz. I leave it up
Lakens 5
to the reader to determine whether either of these approaches values can be calculated with meta-analysis software such as
is useful. metafor (Viechtbauer, 2010). The two one-sided tests are
rejected if Z L Z a and Z U Z a. Alternatively, the 90%
Equivalence Tests for One-Sample t Tests CI can be reported. If the 90% CI falls within the equivalence
bounds, the observed meta-analytic effect is statistically
The t-values for the two one-sided tests for a one-sample equivalent.
t tests are:
M m DL M m DU Setting Equivalence Bounds
tL ¼ and tU ¼ ; ð7Þ
p ffiffiffi
SD ffiffiffi
SD
p
N N In psychology, most theories do not state which effects are too
small to be interpreted as support for the proposed underlying
where M is the observed mean, SD is the observed standard
mechanism. Instead, feasibility considerations are often the
deviation, N is the sample size, DL and DU are lower and upper
strongest determinant of the effect sizes a researcher can reli-
equivalence bounds, and m is the value that the mean is
ably examine. In daily practice, researchers have a maximum
tested against.
sample size they are willing to collect in a single study (e.g.,
100 participants in each between-subject condition). Given a
Equivalence Tests for Correlations desired level of statistical power (e.g., 80%) and a specific a
Equivalence tests can also be performed on correlations, where (e.g., .05), this implies a smallest effect size they find worth-
the two one-sided tests aim to reject correlations larger than a while to examine or a SESOI (Lakens, 2014) they can reliably
lower equivalence bound (r L ) and smaller than an upper examine. Based on a sensitivity analysis in power analysis soft-
equivalence bound (r U ). I follow Goertzen and Cribbie ware (such as G*Power), we can calculate that with 100 parti-
(2010), who use Fisher’s z transformation on the correlations, cipants in each condition, 80% desired power, and an a of .05,
after which critical values are calculated that can be compared the SESOI in a null effect significance test is D ¼ 0.389; and
against the normal distribution: using the power analysis calculation for an equivalence test for
independent samples, assuming a true effect size of 0, 80%
LN
1þrL
LN
1þrU power is achieved when DL ¼ 0.414 and DU ¼ 0.414. As
LN ð1rÞ
1þr 1rL LN ð1rÞ
1þr 1rU
Simonsohn (2015) proposes to test for inferiority for replica- demonstrating an effect and demonstrating the absence of a
tion studies (an equivalence test where the lower bound is set to worthwhile effect. The only solution for anyone skeptical about
infinity). He suggests to set the upper equivalence bound in a studies demonstrating equivalence is to perform an indepen-
replication study to the effect size that would have given an dent replication.
original study 33% power. For example, an original study with Equivalence testing is based on a Neyman–Pearson hypoth-
60 participants divided equally across two independent groups esis testing approach that allows researchers to control error
has 33% power to detect an effect of d ¼ .4, so DU is set to rates in the long run and design studies based on a desired level
d ¼ .4. This approach limits the sample size required to test for of statistical power. Error rates in equivalence tests are con-
equivalence to 2.5 times the sample size of the original study. trolled at the a level when the true effect equals the equivalence
The goal is not to show the effect is too small to be feasible to bound. When the true effect is more extreme than the equiva-
study but too small to have been reliably detected by the orig- lence bounds, error rates are smaller than the a level. It is
inal experiment, thus casting doubt on the original observation. important to take statistical power into account when determin-
If feasibility constraints are practically absent (e.g., in ing the equivalence bounds because, in small samples (where
online studies), another starting point to set equivalence CIs are wide), a study might have no statistical power (i.e., the
bounds is by setting bounds based on benchmarks for small, CI will always be so wide that it is necessarily wider than the
medium, and large effects. Although using these benchmarks equivalence bounds).
to interpret effect sizes is typically recommended as a last There are alternative approaches to the TOST procedure.
resort (e.g., Lakens, 2013), their use in setting equivalence Updated versions of equivalence tests exist, but their added
bounds seems warranted by the lack of other clear-cut recom- complexity does not seem to be justified by the small gain in
mendations. By far the best solution would be for researchers power (for a discussion, see Meyners, 2012). There are also
to specify their SESOI when they publish an original result or alternative approaches to providing statistical support for a
describe a theoretical idea (Morey & Lakens, 2017). The use small or null effect, such as estimation (calculating effect sizes
of equivalence testing will no doubt lead to a discussion about and CIs), specifying a region of practical equivalence
which effect sizes are too small to be worthwhile to examine (Kruschke, 2010), or calculating Bayes factors (Dienes, 2014;
in specific research lines in psychology, which in itself is Rouder, Speckman, Sun, Morey, & Iverson, 2009). Research-
progress. ers should report effect size estimates in addition to hypothesis
tests. Since Bayesian and frequentist tests answer complemen-
tary questions, with Bayesian statistics quantifying posterior
Discussion beliefs, and Frequentist statistics controlling Type 1 and Type
Equivalence tests are a simple adaptation of traditional sig- 2 error rates, these tests can be reported side by side.
nificance tests that allow researchers to design studies that Other fields are able to use raw measures due to the wide-
reject effects larger than prespecified equivalence bounds. spread use of identical measurements (e.g., the number of
It allows researchers to reject effects large enough to be deaths, the amount of money spent), but in some subfields in
considered worthwhile. Adopting equivalence tests will pre- psychology the variability in the measures that are collected
vent the common misinterpretations of nonsignificant p val- require standardized effect sizes to make comparisons across
ues as the absence of an effect and nudge researchers studies (Cumming & Fidler, 2009). A consideration of using
toward specifying which effects they find worthwhile. By standardized effect sizes as equivalence bounds is that in two
providing a simple spreadsheet and R package to perform studies with the same mean difference and CIs in raw scale
power calculations and equivalence tests for common statis- units (e.g., a difference of 0.2 on a 7-point scale with 90% CI
tical tests in psychology, researchers should be able to eas- [0.13;0.17]), the same standardized equivalence bounds can
ily improve their research practices. lead to different significance levels in a equivalence test. The
Rejecting effects more extreme than the equivalence bounds reason for this is that the pooled SD can differ across the stud-
implies that we can conclude equivalence for a specific opera- ies, and as a consequence, the same equivalence bounds in stan-
tionalization of a hypothesis. It is possible that a meaningful dardized scores imply different equivalence bounds in raw
effect would be observed with a different manipulation or mea- scores. If this is undesirable, researchers should specify equiva-
sure. Confounds can underlie observed equivalent effects. An lence bounds in raw scores instead.
additional nonstatistical challenge in interpreting equivalence Ideally, psychologists could specify equivalence bounds in
concerns the issue of whether an experiment was performed raw mean differences based on theoretical predictions or
competently (Senn, 2007). Complete transparency (sharing all cost–benefit analyses, instead of setting equivalence bounds
materials) is a partial solution since it allows peers to evaluate based on standardized benchmarks. My hope is that as equiva-
whether the experiment was well designed (Morey et al., 2016), lence tests become more common in psychology, researchers
but this issue is not easily resolved when the actions of an will start to discuss which effect sizes are theoretically
experimenter might influence the data. In such experiments, expected while setting equivalence bounds. When theories do
even blinding the experimenter to conditions is no solution not specify which effect sizes are too small to be meaningful,
since an experimenter can interfere with the data quality of all theories can’t be falsified. Whenever a study yields no statisti-
conditions. This is an inherent asymmetry between cally significant effect, one can always argue that there is a true
Lakens 7
effect that is smaller than the study could reliably detect and noncentral distributions. Educational and Psychological
(Morey & Lakens, 2017). Maxwell, Lau, and Howard (2015) Measurement, 61, 532–574. doi:10.1177/0013164401614002
suggest that replication studies demonstrate the absence of an Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should
effect by using equivalence bounds of DL ¼ .1 and DU ¼ .1 by default use Welch’s t-test instead of Student’s t-test with
or even DL ¼ .05 and DU ¼ .05. I believe this creates an unequal group sizes. International Review of Social Psychology,
imbalance where we condone original studies that fail to make 30, 92–101. doi:10.5334/irsp.82
specific predictions, while replication studies are expected to Dienes, Z. (2014). Using Bayes to get the most out of non-
test extremely specific predictions that can only be confirmed significant results. Quantitative Psychology and Measurement,
by collecting huge numbers of observations. 5, 781. doi:10.3389/fpsyg.2014.00781
Extending your statistical tool kit with equivalence tests is Dienes, Z. (2016). How Bayes factors change scientific practice. Jour-
an easy way to improve your statistical and theoretical infer- nal of Mathematical Psychology. doi:10.1016/j.jmp.2015.10.003
ences. The TOST procedure provides a straightforward Eskine, K. J. (2013). Wholesome foods and wholesome morals?
approach to reject effect sizes that one considers large enough Organic foods reduce prosocial behavior and harshen moral judg-
to be worthwhile to examine. ments. Social Psychological and Personality Science, 4, 251–254.
doi:10.1177/1948550612447114
Author’s Note Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G* Power
The TOSTER spreadsheet is available from https://osf.io/q253c/. The 3: A flexible statistical power analysis program for the social,
TOSTER R package can be installed from CRAN using install. behavioral, and biomedical sciences. Behavior Research Methods,
packages (TOSTER). Detailed example vignettes are available from: 39, 175–191.
https://cran.rstudio.com/web/packages/TOSTER/vignettes/ Finch, S., Cumming, G., & Thomason, N. (2001). Colloquium on
IntroductionToTOSTER.html effect sizes: The roles of editors, textbook authors, and the pub-
lication manual reporting of statistical inference in the journal
Declaration of Conflicting Interests of applied psychology: Little evidence of reform. Educational
and Psychological Measurement, 61, 181–210. doi:10.1177/
The author(s) declared no potential conflicts of interest with respect to
0013164401612001
the research, authorship, and/or publication of this article.
Gigerenzer, G. (1998). Surrogates for theories. Theory and Psychol-
ogy, 8, 195–204.
Funding
Goertzen, J. R., & Cribbie, R. A. (2010). Detecting a lack of
The author(s) received no financial support for the research, author- association: An equivalence testing approach. British Journal
ship, and/or publication of this article. of Mathematical and Statistical Psychology, 63, 527–537.
doi:10.1348/000711009X475853
Notes Hauck, D. W. W., & Anderson, S. (1984). A new statistical procedure
1. As Wellek (2010, p. 30) notes, for all practical purposes, one can for testing equivalence in two-group comparative bioavailability
simply specify a very large value for the infinite equivalence trials. Journal of Pharmacokinetics and Biopharmaceutics, 12,
bound. 83–91. doi:10.1007/BF01063612
2. A 90% confidence interval (CI; 1 2a) is used instead of a 95% CI Julious, S. A. (2004). Sample sizes for clinical trials with normal data.
(1 a) because two one-sided tests (each with an a of 5%) are Statistics in Medicine, 23, 1921–1986. doi:10.1002/sim.1783
performed. Kruschke, J. (2010). Doing Bayesian data analysis: A tutorial intro-
3. The author would like to thank Jake Westfall for this suggestion. duction with R. Burlington, MA: Academic Press.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate
References cumulative science: A practical primer for t-tests and ANOVAs.
Anderson, S. F., & Maxwell, S. E. (2016). There’s more than one way Frontiers in Psychology, 4. doi:10.3389/fpsyg.2013.00863
to conduct a replication study: Beyond statistical significance. Lakens, D. (2014). Performing high-powered studies efficiently with
Psychological Methods, 21, 1–12. doi:https://doi.org/11037/ sequential analyses: Sequential analyses. European Journal of
met0000051 Social Psychology, 44, 701–710. doi:10.1002/ejsp.2023
Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection- Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size plan-
union tests and equivalence confidence sets. Statistical Science, 11, ning for statistical power and accuracy in parameter estimation.
283–302. Annual Review of Psychology, 59, 537–563. doi:10.1146/
Chow, S.-C., Shao, J., & Wang, H. (2002). A note on sample size cal- annurev.psych.59.103006.093735
culation for mean comparisons based on noncentral t-statistics. Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suf-
Journal of Biopharmaceutical Statistics, 12, 441–456. fering from a replication crisis? What does “failure to replicate” really
Cumming, G., & Fidler, F. (2009). Confidence intervals: Better mean? American Psychologist, 70, 487–498. doi:10.1037/a0039400
answers to better questions. Zeitschrift Für Psychologie/Journal Meyners, M. (2012). Equivalence tests—A review. Food Quality and
of Psychology, 217, 15–26. doi:10.1027/0044-3409.217.1.15 Preference, 26, 231–245. doi:10.1016/j.foodqual.2012.05.003
Cumming, G., & Finch, S. (2001). A primer on the understanding, use, Minitab 17 Statistical Software. (2010). [Computer software]. State
and calculation of confidence intervals that are based on central College, PA: Minitab, Inc.
8 Social Psychological and Personality Science XX(X)
Moery, E., & Calin-Jageman, R. J. (2016). Direct and conceptual Schuirmann, D. J. (1987). A comparison of the two one-sided tests
replications of Eskine (2013): Organic food exposure has little to procedure and the power approach for assessing the equivalence
no effect on moral judgments and prosocial behavior. Social of average bioavailability. Journal of Pharmacokinetics and
Psychological and Personality Science, 7, 312–319. doi:10.1177/ Biopharmaceutics, 15, 657–680.
1948550616639649 Seaman, M. A., & Serlin, R. C. (1998). Equivalence confidence inter-
Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, vals for two-group comparisons of means. Psychological Methods,
R., Lakens, D., . . . Zwaan, R. A. (2016). The peer reviewers’ open- 3, 403–411. doi:10.1037/1082-989X.3.4.403
ness initiative: Incentivizing open research practices through peer Senn, S. (2007). Statistical issues in drug development (2nd ed.).
review. Royal Society Open Science, 3, 150547. Hoboken, NJ: Wiley.
Morey, R. D., & Lakens, D. (2017). Why most of psychology is statis- Simonsohn, U. (2015). Small telescopes detectability and the evalua-
tically unfalsifiable. Manuscript submitted for publication. tion of replication results. Psychological Science, 26, 559–569.
Piaggio, G., Elbourne, D. R., Altman, D. G., Pocock, S. J., Evans, S. J., doi:10.1177/0956797614567341
& Group, C. (2006). Reporting of noninferiority and equivalence Viechtbauer, W. (2010). Conducting meta-analyses in R with the
randomized trials: An extension of the CONSORT statement. metafor package. Journal of Statistical Software, 36, 1–48.
Journal of the American Medical Association, 295, 1152–1160. Weber, R., & Popova, L. (2012). Testing equivalence in communica-
Quertemont, E. (2011). How to statistically show the absence of an tion research: Theory and application. Communication Methods
effect. Psychologica Belgica, 51, 109. doi:10.5334/pb-51-2-109 and Measures, 6, 190–213. doi:10.1080/19312458.2012.703834
Rogers, J. L., Howard, K. I., & Vessey, J. T. (1993). Using signifi- Welch, B. L. (1947). The generalization of student’s’ problem when
cance tests to evaluate equivalence between two experimental several different population variances are involved. Biometrika,
groups. Psychological Bulletin, 113, 553. 34, 28–35. doi:10.2307/2332510
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. Wellek, S. (2010). Testing statistical hypotheses of equivalence and
(2009). Bayesian t tests for accepting and rejecting the null hypoth- noninferiority (2nd ed.). Boca Raton, FL: CRC Press.
esis. Psychonomic Bulletin & Review, 16, 225–237. doi:10.3758/
PBR.16.2.225 Author Biography
Ruxton, G. D. (2006). The unequal variance t-test is an underused
Daniël Lakens is an assistant professor at the School of Innovation
alternative to Student’s t-test and the Mann-Whitney U test. Beha-
Sciences at Eindhoven University of Technology. He is interested in
vioral Ecology, 17, 688–690. doi:10.1093/beheco/ark016
improving research practices, drawing better statistical inferences, and
Satterthwaite, F. E. (1946). An approximate distribution of estimates
reducing publication bias.
of variance components. Biometrics Bulletin, 2, 110–114. doi:10.
2307/3002019 Handling Editor: Dominique Muller