0% found this document useful (0 votes)
43 views5 pages

2024 How Muriel S Tea Stained Management Research Through Statistical Significance Tests

Uploaded by

mcbayon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views5 pages

2024 How Muriel S Tea Stained Management Research Through Statistical Significance Tests

Uploaded by

mcbayon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Reflections on Experience

Journal of Management Inquiry


1–5
How Muriel’s Tea Stained Management © The Author(s) 2024
Article reuse guidelines:
Research Through Statistical sagepub.com/journals-permissions
DOI: 10.1177/10564926241257164

Significance Tests journals.sagepub.com/home/jmi

Andreas Schwab1 and William H. Starbuck2

Abstract
Ronald Fisher created statistical significance tests to provide an easy method anyone could perform. Their simplicity and gene-
ral applicability spurred adoption, and they became universal in statistical training, and universal training made these tests uni-
versal in social science. Editors and reviewers expected to see statistical significance in every paper. But the method has
serious deficiencies. Today’s more advanced computational capabilities have created opportunities to address these deficien-
cies and to use statistical analyses that provide better information. This essay introduces four lessons we have learned during
our two-decade effort to inform management scholars about limitations of statistical significance tests. First, methodological
change is generational and benefits from a focus on doctoral students. Second, criticizing the status quo is not enough: intro-
ducing and teaching alternative approaches is essential. Third, in a publish-or-perish world, change initiatives must address
publication. Fourth, to speed up progress, leadership by academic organizations and journal editors is essential.

Keywords
statistical significance tests, research methods in management, management research history, doctoral student training

During a tea break around 1920, Ronald Fisher offered books introduced a kind of statistical analysis—null
Muriel Bristol, his coworker, a cup of tea he had just hypothesis significance tests (NHST)—that requires very
poured (Fisher, 1935; Salsburg, 2002). But Muriel refused little knowledge about random processes or probability
the tea, explaining that she strongly preferred to have milk distributions.
poured into her cup before tea was added. Ronald scoffed Firstly, Fisher designed this test to answer only a narrow
at Muriel’s claim that she could distinguish whether milk question: Do observed data differ sufficiently from the data
or tea had been poured first. Overhearing them, Muriel’s likely when a null hypothesis is true and only random sam-
fiancé, William Roach proposed, “Let’s test her.” The men pling affects results—yes or no? He embellished answers
offered Muriel eight cups of tea in random order, four pre- to this question by describing unlikely data as “significant.”
pared each way. She then shocked her skeptics by correctly Secondly, he minimized computations by appending sta-
identifying how all eight cups had been prepared. Ronald tistical tables to his books. In the 1920s and 1930s, high-
estimated that if Muriel could perceive no differences speed computers did not yet exist, so people who made sta-
between the preparations of tea, her ability to identify each tistical computations had to use mechanical calculators.
cup correctly was 50/50, and the probability of her identify- Statistical tables avoided laborious calculations.
ing all eight cups correctly was only 1.43%. Thirdly, he simplified his tables by assuming that
This experiment has influenced social science research researchers are satisfied to study averages. Focusing on aver-
methods greatly for over a century because these experiments ages allowed Ronald to assume the tested data have Normal
led Ronald to his idea of testing a “null hypothesis.” distributions (Fischer, 2010). A Normal distribution is fully
defined by its mean and standard deviation, so users must
Significance Tests Created Their Own
1
Revolutionary Success 2
Ivy College of Business, Iowa State University, Ames, IA, USA
University of Oregon, Eugene, OR, USA
Ronald Fisher exerted great influence on statistical analyses
Corresponding author:
in social science. Statistical analysis had developed Andreas Schwab, Ivy College of Business, Iowa State University, 3129
greatly during the 1800s, and his books (1925, 1935) codified Gerdin, 2167 Union Drive, Ames, IA 50010, USA.
and extended this development, and more importantly, the Email: [email protected]
2 Journal of Management Inquiry

estimate only those two parameters, and a Normal distribu- Furthermore, because it is not possible to measure the
tion is symmetric, so the tables in his books had to describe heights of every man and woman on earth, the available
only one half of the distribution. data must be a partial sample, so statements about the com-
The resulting design for NHSTs succeeded impressively. plete populations of men and women should allow for possi-
NHSTs were so easy they became universal in statistical ble sampling errors. An unconditional statement that the
training, and universal training made NHSTs universal in average height of all men is greater than the average height
medical and social-science research (Stang et al., 2016). of all women should incorporate some recognition that the
Professional statisticians liked teaching a method that data are not 100% complete.
required little mathematical training and attracted students Null hypotheses that demand exact equalities also cause
to their courses. Editors and reviewers expected to find NHSTs to misclassify probable random errors as “signifi-
NHSTs in every empirical paper. Universities publicized cant” findings (Mayo, 2006, pp. 808–809). An exact equality
“significant” research findings by their professors and is an infinitesimally small point, and there are many ways that
students. the measured quantities might differ from their correct
values. Indeed, when samples are large, NHSTs are likely
How Simplicity and Broad Applicability to interpret probable random errors as “significant” findings.
Looking back over his research career in 2016, a distin-
Turned into Deficiencies
guished professor of psychology told a friend, “I have the
Emphasizing very easy computation may leave students fear that the field of experimental psychology over the past
with insufficient understanding. They learn how to few decades is little more than Type 1 errors. I don’t know
execute NHSTs but do not fully understand what “statistical what is and isn’t real.”
significance” means. Misinterpretation of NHST results has Another deficiency is that an NHST gives no information
been documented by Amrhein et al. (2019), Armstrong about hypotheses other than its null. A clear yes-no answer to
(2007), Cohen (1994), Fidler et al. (2004), Greenland a simple hypothesis suppresses further questions about con-
et al. (2016), Oakes (1986), and Wasserstein and Lazar tingencies and does not encourage further research or practi-
(2016). cal use. Ronald and William learned that Muriel had more
Researchers misinterpret NHSTs, and reviewers for jour- skill than they had expected, but they learned nothing
nals misinterpret NHSTs, and these are not the only users about the limits of her skill, and their statistical analysis gen-
who have trouble with NHSTs. Haller and Krauss (2002) erated no suggestions for further research. Did Muriel’s judg-
and Hubbard and Armstrong (2006) found that statistics ments about the tea depend on the temperature of the teacup
teachers and professional statisticians also misinterpret before anything was poured into it?
NHSTs. Clearly, statistical education does not conquer the Neyman and Pearson (1933) reacted to concentration on a
logical challenges posed by NHSTs. single null hypothesis by proposing that statistical analyses
Perhaps the most troublesome logical challenge is that should compare competing plausible hypotheses (Sapra &
when data are judged statistically significant and the null Nundy, 2018). Considering alternative perspectives is often
hypothesis is rejected, this judgment also rejects a key an essential component of progress toward greater under-
assumption that led to this conclusion. The estimation of sta- standing of complex phenomena because it is possible for
tistical significance assumes that the null hypothesis is two theories to both be useful in somewhat different contexts.
correct: if the null hypothesis is not correct, the probability Starbuck (2004) has encouraged a few doctoral students to
computed in the NHST has an unknown error. But that frame their dissertations in two theoretical traditions—for
finding of statistical significance is also a warning that this example, a psychological theory versus a sociological
finding is based on an invalidated assumption. theory, or a psychological theory versus an economic
Further logical issues arise when null hypotheses ignore theory. The students who have done this have ended up
prior research or contradict common sense, which they often arguing that each academic tradition contributed a part of
do. Imagine a study to verify that the average height of “truth.”
women is less than the average height of men. The conven- Perhaps the greatest weakness of NHSTs is their disregard
tional null hypothesis would say that the two average heights of previous research. For scientific knowledge to increase,
are exactly the same—not merely similar, or nearly equal, new studies must build upon or around previous studies,
but exactly equal to many decimal places. But researchers but null hypotheses typically ignore previous studies.
have seen thousands of men and thousands of women, so Webster and Starbuck (1988) analyzed nine relationships
even without measuring anyone, the researchers know that if among variables that researchers have deemed to be impor-
they indeed find equal average heights, something is wrong tant for job performance and job satisfaction. Surprisingly,
with their data. Exact equality would be practically impossible later studies of these relationships did not produce evidence
since the data are supposed to include random elements, includ- of clearer understanding than earlier studies had done, even
ing arbitrary choices of specific men and women to measure. though research for some of the relationships had continued
Schwab and Starbuck 3

for as long as 55 years and had involved as many as 4000 Effect Sizes
studies. Throughout their research histories, effect sizes for
four of these nine relationships had been approximately cons- As a result of psychologists’ discussions about NHSTs, an
tant, and for the other five relationships, effect sizes had grad- increasing number of journals now require researchers to
ually decreased toward zero. report effect sizes in addition to NHSTs. These reports may
Webster and Starbuck proposed five reasons for these dis- limit misunderstandings when “statistically significant find-
appointing histories. (1) Researchers may have clung to ings” are so small that they have no practical significance,
incorrect hypotheses. (2) Researchers may have used familiar and the reports do not require researchers to cease familiar
research methods after these methods ceased to add knowl- practices. However, reporting effect sizes does not counteract
edge. (3) Characteristics of jobs or people may have other detrimental consequences of NHSTs such as over-
changed faster than researchers’ theories improved. (4) simplification of subtle differences.
Earlier studies that had large effect sizes may have had unre-
ported or unobserved idiosyncrasies. (5) Researchers may Null Models
have used confirmatory data-gathering strategies and attrib-
uted effects to the relations they had expected to see. Bioecologists began to compare data with “null models” after
As indicated above, scholars have discussed and docu- Connor and Simberloff (1983, 1986) argued that interactions
mented the limitations of statistical significance tests for within ecological communities make simple no-effect null
decades. This topic, however, has received renewed attention hypotheses very unrealistic. Connor and Simberloff advised
after large-scale replication efforts in social psychology bioecologists to replace null hypotheses with non-causal
reported that only 36% of the statistically significant findings random distributions for the studied variables, which they
in the original studies were supported by statistically signifi- called null models. For example, bioecologists had debated
cant results in replications (Open Science Collaboration, for decades various theories about the numbers of species
2015; Wasserstein et al., 2019). on each of the Galapagos Islands. Connor and Simberloff
(1983) showed that the species data looked much like a
random distribution that only took account of islands’ areas
What Can Researchers do Instead of —larger islands had proportionately more species than
NHSTs? smaller islands, and more complex theories added little.
Researchers have tried to eliminate NHSTs from journals in
ecology, epidemiology, medicine, and psychology (Fidler Baseline Models
et al., 2004; Sapra & Nundy, 2018), but professional associ-
ations and journal editors have not supported such reforms, Baseline models are a large category of common-sense
so only in epidemiology, medicine, and public health have explanations that make weak or vague statements about cau-
these efforts suppressed many NHSTs. sality. For example, Elliott (1973) showed that complex
Inspired by efforts in other fields, Starbuck organized a economic theories to predict short-run changes in the US
2005 symposium that reviewed the efforts to ban NHSTs national economy were no more accurate than two simpler
in various fields and proposed that the Academy of explanations: (1) the economy three months from now
Management should refuse to publish NHSTs in its journals. will be the same as it is today, and (2) economic trends
The symposium attracted a large audience, suggesting that during the recent three months will continue for three
many management researchers saw need for reform. But more months. In another example from macroeconomics,
after hearing about failed efforts to persuade other profes- Peach and Webb (1983, p. 697) compared calculations
sional associations to ban NHSTs (Fidler et al., 2004), we typical of complex macroeconomic theories with linear
guessed that the Academy of Management would be unlikely regression calculations using randomly chosen variables.
to ban NHSTs. The randomly chosen variables produced fits to the data
So, we began offering workshops to support researchers as good as the “evidence supporting theoretical propositions
who might be seeking alternatives to NHSTs, and we gradu- in the literature.”
ally changed our workshops to reflect the questions and
ideas of our audiences. Our initial sessions concentrated on
Bayesian Analyses
describing the deficiencies of NHSTs, but our audiences
reacted by asking “What should we do instead of NHSTs.” In the mid-1700s, Thomas Bayes created a formula that
So, we added speakers who described other ways to analyze explains how researchers can revise their current expecta-
data as well as how to deal with editors and reviewers who tions (stated as probability distributions) to take account of
were expecting NHSTs. This process has led us to promote new data (Bayes & Price, 1763). Such revisions support
the following four approaches to extend the analysis of empir- the accumulation of knowledge by integrating new studies
ical data beyond the traditional application of NHSTs. with previous research.
4 Journal of Management Inquiry

Table 1. Alternatives to NHSTs.

Advantages Disadvantages References

Effect sizes Evaluation of substantive strength of effects. Often highly context-specific. (Aguinis et al., 2010;
Cumming 2012).
Null models Assumes random effects instead of no effects to May exaggerate the explanatory power of (Connor & Simberloff, 1983,
establish effect threshold. randomness. 1986; Denrell et al., 2015)
Baseline Assumes rudimentary common sense Often highly context-specific with limited (Schwab & Starbuck, 2012,
models explanations (e.g., stable effects or trends) to guidance from theory or prior 2013)
establish effect threshold. research.
Bayesian Provides probability statements for hypothesized Requires researchers to learn new (Gelman et al., 2014;
analyses effects, incorporates prior knowledge. complex techniques. McElreath, 2020)

Bayesian analysis attracted supporters, including distin- strong tendencies to repeat the methodologies they had
guished mathematicians, but there were arguments. used previously. Methodological improvements had
Opponents of Bayesian analysis said that probability distri- occurred because younger researchers had set higher stan-
butions should only be based on large amounts of data; dards for their studies than their predecessors.
whereas proponents of Bayesian analysis said that probabil- A similar pattern has emerged in our symposia and work-
ities are abstract concepts in people’s heads and these could shops about the limitations of statistical significance tests.
be based on theories or personal experiences. Also, Bayes’ These sessions have been populated mainly by younger schol-
formula required some very difficult algebra, which made ars, some of whom searched out courses about Bayesian statis-
Bayesian analysis an impractical option. tics being taught in schools of education or engineering.
World War II brought important changes in the practical- In the end, moving beyond NHSTs would benefit from
ity of Bayesian analyses. The War stimulated the creation of corresponding institutional adjustments. Doctoral programs
large computing machines that were capable of making have started to teach more than just NHSTs. Journal editors
Bayesian analyses, and military analysts used Bayesian anal- can adjust publication guidelines to promote alternative
yses to break the secret codes of Germany and Japan and to approaches to interpreting empirical data. We all, when
predict the locations of enemy resources (McGrayne, 2011; acting as reviewers, can encourage authors to provide more
Simpson, 2010). These practical successes stimulated interest substantial evidence than just NHSTs. The more we all con-
in Bayesian analyses during the postwar period, and decades tribute, the faster these desired and needed methodological
of technological development produced fast electronic com- changes will arrive.
puters and user-friendly software that now make Bayesian
analyses much easier (McElreath, 2020). In the 200 years Acknowledgement
from 1769 to 1969, only 15 books discussed Bayesian anal- We thank Richard Stackman, David Hannah, and Simon Pek for
yses. The 20 years from 1970 to 1989 brought 30 additional useful ideas and helpful comments.
books, and the 10 years from 1990 to 1999 brought 60 more
books. Bayesian analyses have become prevalent in the fields Declaration of Conflicting Interests
of education, electrical engineering, epidemiology, insur- The authors declared no potential conflicts of interest with respect to
ance, marketing, medicine, and psychology. Unfortunately, the research, authorship, and/or publication of this article.
Management has lagged other fields in this respect.
Table 1 summarizes the ways to improve on NHSTs that Funding
we have been promoting. The authors received no financial support for the research, author-
ship, and/or publication of this article.
What Will the Future Bring?
ORCID iDs
Because established scholars have vested interests and Andreas Schwab https://orcid.org/0000-0002-7968-1907
limited time to retool, substantive future improvements William H. Starbuck https://orcid.org/0009-0006-4555-0999
likely depend on young scholars. Dutton and Starbuck
(1971) analyzed the changes that had occurred through a References
century of studies that used machines to simulate human Aguinis, H., Werner, S., Lanza Abbott, J., Angert, C., Park, J. H., &
behavior. They found that simulation studies had grown Kohlhausen, D. (2010). Customer-centric science: Reporting
more rigorous and had gained stronger empirical support, significant research results with rigor, relevance, and practical
but these changes had generally not been visible in the impact in mind. Organizational Research Methods, 13(3),
series of studies by established researchers, who showed 515–539. https://doi.org/10.1177/1094428109333339
Schwab and Starbuck 5

Amrhein, V., Greenland, S., & McShane, B., & more than 800 sig- Mayo, D. (2006). Statistics, philosophy of. In S. Sarkar & J. Pfeifer
natories. (2019). Retire statistical significance. Nature, 567, (Eds.), The philosophy of science: An encyclopedia (pp. 802–815).
306–307. https://doi.org/10.1038/d41586-019-00857-9 Routledge.
Armstrong, J. S. (2007). Significance tests harm progress in fore- McElreath, R. (2020). Statistical rethinking: A Bayesian course with
casting. International Journal of Forecasting, 23(2), 321– examples in R and Stan (2nd ed). CRC Press.
327. https://doi.org/10.1016/j.ijforecast.2007.03.004 McGrayne, S. B. (2011). The theory that would not die: How Bayes’
Bayes, T., & Price, R. (1763). An essay towards solving a problem rule cracked the Enigma code, hunted down Russian subma-
in the doctrine of chance. By the late Rev. Mr. Bayes, commu- rines, and emerged triumphant from two centuries of contro-
nicated by Mr. Price, in a letter to John Canton, A. M. F. R. S. versy. Yale University Press.
Philosophical Transactions of the Royal Society of London, 53, Neyman, J., & Pearson, E. S. (1933). On the problem of the
370–418. https://doi.org/10.1098/rstl.1763.0053 most efficient tests of statistical hypotheses. Philosophical
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, Transactions of the Royal Society of London A, 231(694–706),
49(12), 997–1003. https://doi.org/10.1037/0003-066X.49. 289–337. https://doi.org/10.1098/rsta.1933.0009
12.997 Oakes, M. W. (1986). Statistical inference: A commentary for the
Connor, E. F., & Simberloff, D. (1983). Interspecific competition social and behavioural sciences. Wiley.
and species co-occurrence patterns on islands: Null models Open Science Collaboration (2015). Estimating the reproducibility
and the evaluation of evidence. Oikos, 41, 455–465. https:// of psychological science. Science, 349(6251), 943. aac4716.
doi.org/10.2307/3544105 https://doi.org/10.1126/science.aac4716
Connor, E. F., & Simberloff, D. (1986). Competition, scientific method, Peach, J. T., & Webb, J. L. (1983). Randomly specified macroeco-
and null models in ecology. American Scientist, 74(2), 155–162. nomic models: Some implications for model election. Journal
http://www.jstor.org/stable/27854031 of Economic Issues, 17(3), 697–720. https://doi.org/10.1080/
Cumming, G. (2012). Understanding the new statistics: Effect sizes, 00213624.1983.11504150
confidence intervals, and meta-analysis. Routledge/Taylor & Salsburg, D. (2002). The lady tasting tea: How statistics revolution-
Francis Group. ized science in the twentieth century. Henry Holt.
Denrell, J., Fang, C., & Liu, C. (2015). Chance explanations in the Sapra, R. L., & Nundy, S. (2018). Why the p-value is under fire?
management sciences. Organization Science, 26(3), 923–940. Current Medicine Research and Practice, 8(6), 222–229.
https://doi.org/10.1287/orsc.2014.0946 https://doi.org/10.1016/j.cmrp.2018.10.003
Dutton, J. M., & Starbuck, W. H. (1971). The history of simulation Schwab, A., & Starbuck, W. H. (2012). Using baseline models to
models. In J. M. Dutton & W. Starbuck (Eds.), Computer simu- improve theories about emerging markets. In C. Wang,
lation of human behavior (pp. 9–102). Wiley. D. Bergh, & D. Ketchen (Eds.), Research methodology in strat-
Elliott, J. W. (1973). A direct comparison of short-run GNP fore- egy and management (vol. 7, pp. 3–33). Bingley.
casting models. Journal of Business, 46(1), 33–60. https://doi. Schwab, A., & Starbuck, W. H. (2013). Why baseline modeling is
org/10.1086/295506 better than null-hypothesis testing: Examples from research about
Fidler, F., Cumming, G., Burgman, M., & Thomason, N. (2004). international management, developing countries, and emerging
Statistical reform in medicine, psychology, and ecology. markets. In T. Devinney, T. Pedersen, & L. Tihanyi (Eds.),
Journal of Socio-Economics, 33(5), 615–630. https://doi.org/ Advances in international management, 26 (pp. 171–195). Emerald.
10.1016/j.socec.2004.09.035 Simpson, E. (2010). Bayes at Bletchley park. Significance, 7(2),
Fischer, H. (2010). A history of the central limit theorem: From 76–80. https://doi.org/10.1111/j.1740-9713.2010.00424.x
classical to modern probability theory. Springer. Stang, A., Deckert, M., Poole, C., & Rothman, K. J. (2016).
Fisher, R. A. (1925). Statistical methods for research workers. Statistical inference in abstracts of major medical and epidemi-
Oliver and Boyd. ology journals 1975–2014: A systematic review. European
Fisher, R. A. (1935). The design of experiments. Oliver and Boyd. Journal of Epidemiology, 32, 21–29. https://doi.org/10.1007/
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & s10654-016-0211-1
Rubin, D. B. (2014). Bayesian data analysis (3rd ed). Chapman Starbuck, W. H. (2004). Why I stopped trying to understand the real
& Hall/CRC. world. Organization Studies, 25(7), 1233–1254. https://doi.org/
Greenland, S., Senn, S. J., & Rothman, K. J. (2016). Statistical tests, 10.1177/0170840604046361
p-values, confidence intervals, and power: A guide to misinter- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on
pretations. European Journal of Epidemiology, 31, 337–350. p-values: Context, process, and purpose. The American Statistician,
https://doi.org/10.1007/s10654-016-0149-3 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
Haller, H., & Krauss, S. (2002). Misinterpretations of significance: Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019).
A problem students share with their teachers? Methods of Moving to a world beyond “p<0.05”. The American Statistician,
Psychology Research, 7(1), 1–20. 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913
Hubbard, R., & Armstrong, J. S. (2006). Why we don’t really know Webster, J., & Starbuck, W. H. (1988). Theory building in industrial
what statistical significance means: Implications for educators. and organizational psychology. In C. L. Cooper & I.
Journal of Marketing Education, 28(2), 114–120. https://doi. T. Robertson (Eds.), International review of industrial and
org/10.1177/0273475306288399 organizational psychology (pp. 93–138). Wiley.

You might also like