0% found this document useful (0 votes)
21 views17 pages

23 STS899

Uploaded by

363331272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views17 pages

23 STS899

Uploaded by

363331272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Statistical Science

2023, Vol. 38, No. 4, 655–671


https://doi.org/10.1214/23-STS899
© Institute of Mathematical Statistics, 2023

Tracking Truth Through Measurement and


the Spyglass of Statistics
Antonio Possolo iD

Abstract. The measurement of a quantity is reproducible when mutually in-


dependent, multiple measurements made of it yield mutually consistent mea-
surement results, that is, when the measured values, after due allowance for
their associated uncertainties, do not differ significantly from one another. In-
terlaboratory comparisons organized deliberately for the purpose, and meta-
analyses that are structured so as to be fit for the same purpose, are procedures
of choice to ascertain measurement reproducibility.
The realistic evaluation of measurement uncertainty is a key preliminary
to the assessment of reproducibility because lack of reproducibility mani-
fests itself as dispersion or variability of measured values in excess of what
their associated uncertainties suggest that they should exhibit. For this reason,
we review the distinctive traits of measurement in the physical sciences and
technologies, including medicine, and discuss the meaning and expression of
measurement uncertainty.
This contribution illustrates the application of statistical models and meth-
ods to quantify measurement uncertainty and to assess reproducibility in
four concrete, real-life examples, in the process revealing that lack of re-
producibility can be a consequence of one or more of the following: intrinsic
differences between laboratories making measurements; choice of statistical
model and of procedure for data reduction or of causes yet to be identified.
Despite the instances of lack of reproducibility that we review, and many
others like them, the outlook is optimistic. First, because “lack of repro-
ducibility is not necessarily bad news; it may herald new discoveries and
signal scientific progress” (Nat. Phys. 16 (2020) 117–119). Second, and as
the example about the measurement of the Newtonian constant of gravitation,
G, illustrates, when faced with a reproducibility crisis the scientific commu-
nity often engages in cooperative efforts to understand the root causes of the
lack of reproducibility, leading to advances in scientific knowledge.
Key words and phrases: Avandia, common mean, fixed effect, COVID-19,
Newtonian constant of gravitation, Rosiglitazone, dark uncertainty, hetero-
geneity, interlaboratory study, meta-analysis, random effects, repeatability,
replicability, reproducibility, reproduction number, W boson.

1. INTRODUCTION nature and role of measurement in the social and behav-


ioral sciences, including the education sciences, and at-
This contribution reviews how organized comparisons tendant issues of reproducibility, lie beyond the scope of
(interlaboratory studies), and meta-analyses of measure- this review.
ment results obtained in different studies or experiments, The use of statistical models and of methods of statis-
and the evaluation of measurement uncertainty that un- tical data analysis are illustrated in several examples in-
derlies them, can contribute to gauge and improve repro- volving uncertainty evaluations and the intercomparison
ducibility in the physical sciences and in medicine. The of measurement results, highlighting the characterization
of reproducibility and indicating the role that the evalua-
Antonio Possolo is NIST Fellow and Chief Statistician, tion of measurement uncertainty plays in the process.
National Institute of Standards and Technology, Gaithersburg, The article is intended for statisticians concerned with
Maryland 20899, U.S.A. (e-mail: [email protected]). the assessment of reproducibility in measurement as prac-

655
656 A. POSSOLO

ticed in national metrology institutes like the National In- of gravitation made at the University of Zürich, Switzer-
stitute of Standards and Technology (NIST) of the U.S., land, G = (6.674252 ± 0.000122) m3 kg−1 s−2 (this is the
as well as in many other laboratories where measure- result labeled UZur-06 in Figure 8). There are three differ-
ments are made that support the practice of medicine, en- ent conventions in play here:
gineering, environmental studies, forensic investigations
• Since the true value of G is unknown, G is modeled as
and that ensure the quality of food, therapies and indus- a random variable whose probability distribution char-
trial products. acterizes the uncertainty surrounding its true value, yet
The article is also intended for physical scientists, med- without impugning the fact that, according to current
ical doctors, engineers, laboratory technicians and others understanding, G has had a unique, essentially invari-
who make measurements and employ statistical methods ant true value throughout most of the history of the uni-
to assess reproducibility via interlaboratory studies and verse [59, 25].
meta-analyses, and who also wish to gain some appreci- • The standard uncertainty, u(G), is the standard devia-
ation for how the evaluation of measurement uncertainty tion of G’s distribution. However, since this distribu-
underlies the assessment of reproducibility. tion also comprises uncertainty contributions that are
Section 2 uses the Newtonian constant of gravita- not expressed in the data, for example, uncertainty in
tion as an example to explain the meaning of notational the calibration of measuring instruments; metrology
conventions that are widely used in metrology but that uses a term conceived to be more inclusive than “stan-
statisticians may be unfamiliar with, and which are used dard error.”
throughout this contribution. • The expression for the value of G includes the paren-
Since measurement plays a key role in science and tech- thetic notation “6.674252(122),” which is shorthand for
nology, both the credibility of scientific results and the “6.674252 ± 0.000122,” indicating that the digits be-
reliability of technologies hinge on measurement quality, tween parentheses express the standard uncertainty and
which is the topic of Section 3. affect the same number of trailing digits of the value of
Section 4 discusses the meaning of “reproducibility” G while disregarding the location of the decimal point.
and of related concepts. Section 5 presents a reanalysis, This parenthetic notation is commonly employed to re-
employing contemporary techniques, of a historical data port measurement results concisely in the scientific lit-
set that John Mandel used to illustrate his pioneering ap- erature, as well as in Sections 4, 7 and 8 of this contri-
proach to characterize measurement reproducibility and bution.
repeatability.
Sections 6 (assessment of the risks of a particular 3. MEASUREMENT AND MEASUREMENT QUALITY
therapy), 7 (estimation of the reproduction number of
3.1 Measurement
COVID-19) and 8 (measurement of the Newtonian con-
stant of gravitation) provide additional illustrations of Measurement, the same as science generally, aims “to
how the statistical intercomparison of measurement re- find out something” ([34], p. 287) based on empirical
sults contributes to the assessment of reproducibility. evidence and employing methods that peer-review deter-
Section 9 gathers some lessons learned about how the mines to be sound and enable empirically verifiable pre-
application of statistical models and methods can quantify dictions, to obtain this evidence and to analyze it, yielding
the reproducibility of the conclusions of scientific studies, results that can be essentially reproduced by others.
and in the process increase their trustworthiness, thereby In practice, our measured values are approximations to
advancing scientific knowledge. the true values of the properties that we intend to mea-
The title chosen for this contribution alludes to the sure. These estimates, alone, are of little value because
tracking theory of knowledge developed by Nozick [64] they provide no assurances about their quality. For this
and by Roush [79], at the same time as it evokes the dy- reason, a bona fide measurement result comprises both a
namic nature of exploratory and confirmatory statistical measured value and an evaluation of measurement uncer-
data analysis, as they “track” the scent of truth in empir- tainty.
ical data, thus fulfilling the allegorical role of a spyglass Broadly conceived, measurement is an experimental or
that delivers reliable knowledge built upon reproducible computational process that produces an estimate of the
findings. true value of a property of a material or virtual object or
collection of objects, or of a process, event or series of
2. NOTATIONAL CONVENTIONS events, and satisfies these requirements [93, 70]:
The term standard uncertainty, and the notation used (a) The estimate (measured value) is based on a compar-
to denote it, occur repeatedly throughout this contribu- ison of the property of interest with a property of the
tion, as in u(G) = 0.000122 m3 kg−1 s−2 , which Schlam- same kind realized in a standard that is recognized as a
minger et al. [84] reported as the standard uncertainty as- common reference by the community of producers and
sociated with a measurement of the Newtonian constant users of the measurement result;
TRACKING TRUTH THROUGH MEASUREMENT 657

(b) The measured value is qualified with an evaluation of doubt?” For NIST SRM 3161a, the size of the margin is
measurement uncertainty; gauged by half the length of the confidence interval afore-
(c) The measurement result (measured value together mentioned, and the severity of the doubt is expressed by
with its associated measurement uncertainty) is used the probability (5% in this case) that said interval does not
to inform an action or decision. include the true value of that mass fraction.
As example of the comparison mentioned in (a), con- Confidence in measurement results can be strengthened
sider the Eiffel Tower: saying that it is 330 m tall means by introducing known measurands in the measurement
that its height is 330 times the length of the meter, which workflow that are indistinguishable from the materials or
is the unit of length in the International System of Units products that are being measured. Such check standards
(SI) [6]. ([63], 2.1.2) were first used in mass measurement [69].
The property that is measured (measurand) can be qual- In general, they can be reference materials or calibrated
itative or quantitative. The species of the plant in NIST devices delivering certified values whose associated un-
Standard Reference Material (SRM) 3246, Ginkgo biloba certainty has been evaluated reliably.
(Leaves), is a qualitative property. The mass fraction of The convergence toward a particular value as the same
tin in NIST SRM 3161a, Tin Standard Solution (Lot No. measurand is measured repeatedly over time, in indepen-
140917), is a quantitative property whose certified value dent experiments, is another indication that knowledge
is 10.011 mg/g. about it is solidifying. The history of the measurements
To satisfy requirement (b), the aforementioned estimate of the speed of light and of the Planck constant are no-
of the mass fraction of tin is qualified with an expression table examples of such convergence [52, 61].
of measurement uncertainty, in the form of a confidence Confidence in a measurement result is bolstered appre-
interval ranging from 9.986 mg/g to 10.036 mg/g. ciably if one or several so-called primary methods of mea-
Requirement (c) is exemplified by the decision to ac- surement are employed, and they produce measurement
cept or reject a shipment of boxes of breakfast cereal, results that are essentially in agreement with one another.
which depends on a measurement result for the mass of A primary measurement procedure is such that it does not
cereal in the boxes. This can be the average mass of ce- require calibration with a reference that delivers the same
real per box, for example, qualified with an evaluation of property that one intends to measure. Digital polymerase
the associated measurement uncertainty. chain reaction (dPCR) is a primary measurement method
for viral loads in samples of bodily fluids [85], and for
3.2 Measurement Quality
many other measurements in molecular biology [67].
Measurement quality is its trustworthiness: the extent Coulometry ([35], Section 17-3) can be a primary
to which measured values approximate the corresponding method for determining the amount of a substance in a
true values sufficiently closely for the purpose they are in- solution, which involves counting the number of electrons
tended to serve, and do so with assuredly high confidence consumed in a chemical reaction involving that substance.
[71]. This measurement method involves reference to standards
Such trustworthiness requires that measurement results of time and electrical current, but not to standards for the
be metrologically traceable to appropriate, widely recog- concentration of the substance [4], 2.9.5.
nized standards of reference, and that the associated un- In summary, measurement provides estimates of values
certainty be small enough to warrant using the measured of properties of interest to science and technology using
value in practice as a proxy for the corresponding true recognized standards as references. Both measurement
value. uncertainty and traceability, which characterize measure-
Traceability is a property of a measurement result con- ment’s reliability and validity, are attributes of measure-
sisting of a documented series of comparisons that relate ment quality. The demonstration of mutual consistency
the measured value to a standard of reference, with each between measurement results for the same measurand ob-
comparison being qualified by an evaluation of the as- tained independently of one another, that is, reproducibil-
sociated measurement uncertainty [73]. Traceability thus ity (which we turn to next), is another quality attribute of
guarantees that 1 kg of coffee weighed and sold in a su- measurement that bolsters the trustworthiness of measure-
permarket in Cali, Colombia, has the same mass as 1 kg of ment results.
coffee bought in Coimbra, Portugal, up to their respective,
associated uncertainties. 4. REPRODUCIBILITY
Measurement uncertainty is the doubt about the true
value of the measurand that remains after making a mea- A search for articles listed in the Web of Science that
surement ([75], p. 14). Bell [5] points out that to charac- were published between January 1, 2020, and January 31,
terize the margin of this doubt, we need to answer two 2023, and that include the word “reproducibility” in their
questions: “How big is the margin?” and “How bad is the titles yielded 2524 results (retrieved on February 2, 2023).
658 A. POSSOLO

These articles are from a very wide range of fields of sci-


ence and technology, with the largest numbers relating to
medical ethics and brain imaging, which together account
for almost 14% of the total.
The epistemic value of reproducibility has long been
recognized. Referring to measurement standards, Her-
schel [36] suggested that they ought to possess the “qual-
ities of invariability, indestructibility and identical repro-
ducibility,” as well as “some obvious claim to general ac-
ceptation as of common interest to all mankind.”
Viewing the issue from a different angle, Munafò et al.
[60] argue that the debate around reproducibility, rather F IG . 1. Mutually consistent set of three measurement results for the
than a crisis, is an opportunity “to reflect on which aspects universal gas constant, R, obtained using different measurement meth-
of relevant working practices continue to be effective, ods, from the National Institute of Standards and Technology (NIST)
which can be improved, and which new ways of work- of the United States (NIST-88), the Physikalisch-Technische Bunde-
ing can beneficially be introduced.” Similarly, Milton and sanstalt of Germany (PTB-17) and jointly by the National Institute of
Metrology of China and NIST (NIST/NMI-17). The diamonds indicate
Possolo [52] point out that “lack of reproducibility is not the measured values, and each vertical line segment represents a mea-
necessarily bad news; it may herald new discoveries and sured value plus or minus the reported standard uncertainty (1σ ).
signal scientific progress.”
For example, the CDF Collaboration’s reanalysis of ob-
servations made at Fermilab’s Tevatron collider yielded while the corresponding determination made by the L3
80 433(9) MeV/c2 [19] as an estimate of the mass of the Collaboration [20] was 80 270(55) MeV/c2 . These mea-
W boson, while the corresponding, previous result based surement results are not identical but their difference is
on observations made at CERN’s Large Hadron Collider not significantly different, either statistically (their stan-
had yielded 80 370(19) MeV/c2 [18]: their standardized dardized difference is 0.8σ ) or substantively.
difference is 3σ , which suggests a significant difference. Concerning the spectrum of modalities: at one end, we
However, an even more dramatic difference is obtained have repetition of the same experiment involving the same
when the latest measurement result obtained by the CDF materials, apparatuses, methods and procedures, experi-
Collaboration is compared against the prediction that the menters and place of execution; at the other end, the in-
Standard Model of particle physics makes for the mass of tended repetition is not of the experiment itself, but of
the W boson, 80 357(6) MeV/c2 [32]: once standardized, reaching essentially the same conclusions that the origi-
this difference amounts to 7σ , and Science declared it to nal experiment had reached.
be “an upset to the standard model” [15]. This second option in the spectrum of modalities in-
4.1 Terminology volves measuring the same property, or more generally
studying the same phenomenon, but using altogether dif-
The Oxford English Dictionary defines reproducibility ferent approaches, methods and procedures, applied by
as “the extent to which consistent results are obtained different experimenters working independently of the
when an experiment is repeated.” The meaning of “re-
original ones, in different laboratories. It is generally
peated,” or the sense in which repetition suffices to war-
agreed that this form of replication has greater epistemic
rant reproducibility, requires clarification because it can
value than the former, because it widens the realm of con-
have different flavors, and also because it encompasses a
ditions under which essentially the same conclusions are
very wide spectrum of modalities.
Concerning its flavors: “repeating” can mean obtaining reached.
the same results again and again, or it can mean obtaining For example, Figure 1 shows three measurement re-
essentially the same results, even if not necessarily ex- sults for the universal gas constant, R = kNA , obtained
actly the same results, where “essentially” means that the independently of one another and using different mea-
results from different repetitions cannot be distinguished surement methods: k denotes the Boltzmann constant and
once their respective uncertainties are taken into account. NA denotes the Avogadro constant. Two of these mea-
This jigs with the understanding of replicability ex- surements were made shortly before the values of k and
pressed by Fineberg et al. ([30], p. 3): “Two studies may NA were fixed as part of the 2019 redefinition of the in-
be considered to have replicated if they obtain consistent ternational system of units [6]. The result labeled PTB-17
results given the level of uncertainty inherent in the sys- was obtained using a dielectric-constant gas thermome-
tem under study.” ter [31], and NIST/NIM-17 was obtained using a Johnson
For example, the DELPHI Collaboration et al. [27] de- noise thermometer [76]. The result labeled NIST-88 was
termined the mass of the W boson as 80 336(67) MeV/c2 , obtained much earlier, via acoustic gas thermometry [57].
TRACKING TRUTH THROUGH MEASUREMENT 659

The meaning of reproducibility varies considerably


across the scientific literature. Gundersen ([33], Table 1)
mentions no fewer than sixteen published, different def-
initions of reproducibility, recognizes that there are dif-
ferent types and levels of reproducibility, and proposes
this definition: “the ability of independent investigators
to draw the same conclusions from an experiment by fol-
lowing the documentation shared by the original investi-
gators.”
The National Academies of Science, Engineering and
Medicine (NASEM) use reproducibility as synonymous
with computational reproducibility ([30], p. 4) and define
it as “obtaining consistent results using the same input
data, computational steps, methods and code, and con-
ditions of analysis.” In this sense, reproducibility is less
F IG . 2. Left panel: Boxplots for the raw values of stress at 600%
demanding than replicability, which NASEM defines as
elongation of the 7 rubber specimens (A, . . . , G), determined by 13
“obtaining consistent results across studies aimed at an- laboratories. Each boxplot summarizes 13 × 4 = 52 determinations of
swering the same scientific question, each of which has stress. Right panel: Corresponding boxplots of the residuals from the
obtained its own data.” linear, Gaussian mixed effects model fitted to the determinations using
Plesser [68] emphasizes the terminology prevailing in R function lmer defined in package lme4 [3].
chemistry and in measurement science, which inspired the
understanding of repeatability, reproducibility and repli- with mean 0 and standard deviation τ and the {ij k } de-
cability originally adopted by the Association for Com- note measurement errors with mean 0 and standard devia-
puting Machinery (ACM).
tion σ , for material i = 1, . . . , I , laboratory j = 1, . . . , J
and replicate k = 1, . . . , K. Owing to the marked het-
5. QUANTIFYING REPRODUCIBILITY AND
REPEATABILITY
eroscedasticity of the raw values of stress (Figure 2), we
will conduct all the analyses using the logarithms of the
Well before the reproducibility “crisis” became a topic observed values of stress.
of conversation, for example, in a briefing entitled “Trou- Discussing the presence of apparently outlying obser-
ble at the lab,” which The Economist published on Oc- vations in interlaboratory studies, Mandel ([47], p. 111),
tober 18, 2013, John Mandel [46], a statistician working points out that “There is a great temptation to reject such
at the National Bureau of Standards (which became the outliers, that is, to discard them from the data prior to cal-
National Institute of Standards and Technology in 1988), culating precision or accuracy parameters,” and adds: “We
defined repeatability as “the variability (or rather small- do not recommend rejection on the basis of purely statis-
ness of variability) between replicate results obtained on tical considerations. Our main reason is that while such
the same material within a single laboratory,” and repro- rejection procedures always improve the appearance of
ducibility as “the (smallness of) variability between re- the data, for example, by drastically reducing the standard
sults obtained on the same material in different laborato-
deviations, they do nothing in terms of avoiding future in-
ries,” adding that “more exact definitions are needed.”
stances of outlying results. They have simply sharply re-
We will review Mandel’s concept of these “more exact
duced the field to which the inferences from the study ap-
definitions” in a reanalysis of the results of an interlabora-
ply. [. . . ] It is our opinion that the blind application of tests
tory study employing contemporary models and methods
of significance to interlaboratory data for the purpose of
of statistical data analysis. The study produced 364 de-
rejecting outliers is logically invalid and practically harm-
terminations of the stress at 600% elongation, of I = 7
different specimens of natural rubber, obtained by J = 13 ful.”
laboratories, each of which made K = 4 replicated deter- We have expressed similar reservations about rejecting
minations for each specimen ([46], Table 1). These deter- measurement results based on “purely statistical consid-
minations, and the R code used to analyze them, are listed erations” [41, 74]. The Analytical Methods Committee of
in the Supplementary Material for this article [72]. the Royal Society of Chemistry considered the issue at
The model we shall adopt for these determinations is a length more than 30 years ago, and issued recommenda-
linear, mixed-effects model, tions for how not to reject outliers [21, 22].
For the experiment concerned with rubber elongation,
(1) yij k = μi + λj + ij k , in the absence of a substantive reason to reject any of the
where μi denotes the true mean value of the stress for observations under consideration, we will replace the as-
material i, the {λj } denote laboratory (“random”) effects sumption that the measurement errors {ij k } are Gaussian,
660 A. POSSOLO

F IG . 4. Posterior probability density estimates of the laboratory ef-


fects, {λj }. Each shaded area amounts to 95% of the posterior proba-
bility.

variability of the replicated determinations that individual


laboratories made on each rubber.
F IG . 3. Left panel: Boxplots of the residuals from fitting a Bayesian
Figure 5 shows estimates of the posterior densities of
linear mixed-effects model to the logarithms of stress, with Gaus- the number of degrees of freedom (ν) for the Student’s
sian laboratory effects and Student’s t-measurement errors, using R t-distribution of the measurement errors, and also of the
function brm defined in package brms. Right panel: QQ-plot of the between-laboratory (τ ) and within-laboratory (σ ) stan-
posterior means for the laboratory effects in the Bayesian mixed ef- dard deviations, which we will use next to quantify the re-
fects model with Gaussian random laboratory effects and Student’s
peatability and reproducibility achieved in this study. The
t-measurement errors.
mean of the posterior distribution of the number of de-
grees of freedom of the Student’s t-distribution adopted
with the assumption that they are a sample from a Stu- for the measurement errors, {ij k }, was 2.7(5).
dent’s t-distribution whose number of degrees of freedom Mandel ([46], p. 78) quantified repeatability in terms
will be estimated in the course of fitting the model to the of “a quantity that will be exceeded only about 5 percent
data, in the version of the model where the {yij k } in equa- of the time by the difference, taken in absolute value, of
tion (1) denote the logarithms of the observed values of two randomly selected test results obtained in the same
stress. laboratory on a given material.” Here, test result means
Figure 3 shows boxplots of the residuals and a QQ- an average of 4 replicated determinations that a laboratory
plot for the posterior means of the laboratory effects cor- makes for a rubber specimen. In this conformity, (lack of)
responding to the aforementioned mixed effects model repeatability is quantified as
fitted using a Bayesian procedure implemented using R √ √ √ √
r = 2 2 σ / K = 2 2(0.072)/ 4 = 0.10,
function brm defined in package brms [13, 14], with the
student family specification, using Stan [16, 87] and R and (lack of) reproducibility is quantified as
[86] codes listed in the Supplementary Material [72]. 
 
The prior distributions for the (fixed) effects attributable R=2 2 
τ2 + 
σ 2 /K
to differences between rubber specimens were essentially 
 
noninformative Gaussian distributions. The priors for τ = 2 2 0.1852 + 0.0722 /4 = 0.53.
and σ were half-Cauchy distributions. A single σ as stan-
In the expressions for both r and R, the first “2” is the
dard deviation for all the measurement errors seems justi-
rounded value of the 97.5th percentile of the standard
fied by the sufficient homoscedasticity apparent in the left Gaussian distribution. 
σ and τ denote the medians of
panel of Figure 3, and the assumption of Gaussian labora-
tory effects is justified by the QQ-plot in the right panel of
the same figure. The prior distribution for the number of
degrees of freedom, ν, of the Student’s t-distribution for
the {ij k } was gamma such that with 95% prior probabil-
ity, 1 < ν < 45.
Figure 4 shows posterior probability density estimates
of the laboratory effects, indicating that several of the lab-
oratory effects differ significantly from 0, hence that there
is significant heterogeneity (between-laboratory variabil- F IG . 5. Posterior probability density estimates of the between-labo-
ity), or dark uncertainty [89], that is, the laboratory aver- ratory (τ ) and within-laboratory (σ ) standard deviations, and of the
ages, once adjusted for the effects of the different rubbers, number of degrees of freedom (ν) for the Student’s t-distribution of the
are more dispersed than they should be considering the measurement errors. The dots indicate the posterior medians.
TRACKING TRUTH THROUGH MEASUREMENT 661

the respective posterior distributions; they are unitless be- This reanalysis shows that contemporary tools for sta-
cause the analysis is being done using the logarithms of tistical modeling and data analysis, which were not avail-
the values of stress, and the logarithm “swallows” units able in John Mandel’s time, afford great flexibility for ac-
as can be seen by its series expansion presented in [65], curate modeling. For example, replacing the assumption
4.6.4. that measurement errors are Gaussian with the assump-
It is important to realize that these quantifications of re- tion that they follow a Student’s t-distribution can be han-
peatability and of reproducibility are supported by differ- dled easily in the context of a Bayesian model owing to
ent amounts of evidence. In fact, the evaluation of repeata- the availability of Markov chain Monte Carlo sampling.
bility is based on the variability of 13 groups of 28 in- Also, suitably chosen reexpression (which in this case
dividual determinations of stress each (whose logarithms is as simple as taking logarithms) can go a long way to-
have approximately constant variance), while the evalu- ward simplifying the analysis and increasing the adequacy
ation of reproducibility is based on the variability of 91 of statistical models to data ([58], Chapter 5). However,
averages (13 for each of 7 rubber specimens). the fundamental insights and specific proposals that John
Mandel ([46], p. 79) noticed that the different amounts Mandel offered 50 years ago, about how to quantify re-
of evidence that support the evaluations of repeatability peatability and reproducibility, withstood the test of time,
and reproducibility can be captured using the following and continue to be valuable.
fact pointed out by Blackman and Tukey ([8], p. 208): if
V is a multiple of a chi-square random variable with m 6. ROSIGLITAZONE
degrees of freedom, for example, when V is an estimate
of a variance component, then its coefficient of variation, On July 22, 2007, The New York Times reported that Dr.
√ Steven Nissen’s “questioning of the safety of the Avandia
CV, is 2/m. For this reason, Blackman and Tukey [8]
propose 2/(CV)2 as an equivalent number of degrees of diabetes medication in late May” had “prompted a federal
freedom (also called degrees of firmness [9], p. 290) sup- safety alert and led to a sales decline of about 30 percent
porting V . for the drug,” which had earned GlaxoSmithKline (GSK)
To compute the degrees of firmness of the repeatabil- $3.2 billion in 2006.
ity, r, and of the reproducibility, R, one can either simply The basis for that questioning was a meta-analysis [62]
compute their respective coefficients of variation based of 42 clinical studies of the risk of myocardial infarc-
on the MCMC samples drawn from the posterior distri- tion and death from cardiovascular causes seemingly as-
butions of σ and τ , or possibly better, employ an analog sociated with the use of rosiglitazone, which is the ac-
of the coefficient of variation that may be less sensitive tive ingredient of Avandia. The results of each of these
to the asymmetry of these posterior distributions, whose studies can be summarized in a 2 × 2 table, for exam-
densities are depicted in Figure 5. In this particular case, ple, Table 1 for the ADOPT study [91, 39], which was a
the two options produce very similar assessments of the randomized, double-blind, parallel-group study involving
degrees of firmness of r and of R. 4351 patients with recently diagnosed type 2 diabetes.
The robust version of the degree of firmness for r is All together, the 42 studies whose results are listed
computed as the ratio between half the length of a 68% in Nissen and Wolski ([62], Table 3) involved 27 833
credible interval for σ centered at the posterior median patients. The prevalence of myocardial infarction was
of σ , and this posterior median. The value of this ratio around 0.6% in both the rosiglitazone and control groups.
is 338. The robust version of the degree of firmness for In four of these studies, there were no cases of my-
R, defined similarly, is 48. Hence, and not unexpectedly, ocardial infarction either in the rosiglitazone group or in
the evaluation of repeatability has about 7 times greater the control group. These four were therefore excluded
firmness than the evaluation of reproducibility. from consideration by those methods of data reduction
In general, repeatability depends both on the measur-
and and on the particular laboratory making the measure- TABLE 1
ments, while reproducibility depends on the measurand Results of the ADOPT study, where patients were randomized to
receive double-blinded rosiglitazone, glyburide or metformin, and
and on the class of laboratories that the laboratories par-
were treated for periods of 4 years median duration, as originally
ticipating in the study actually represent. reported by Kahn et al. [39], Table 2, and transcribed by Nissen and
Also in this case, the logarithmic transformation of the Wolski [62], Table 3
values of stress, together with the adjustment for differ-
ences between the rubber specimens accomplished by the Myocardial infarction
mixed effects model, achieved sufficient homoscedastic-
Yes No Total
ity within-laboratories, and also enabled using a single τ
to quantify the between-laboratories variability, so as to Rosiglitazone Group 27 1429 1456
justify pooling the results and producing single evalua- Control Group 41 2854 2895
tions of repeatability and reproducibility.
662 A. POSSOLO

TABLE 2
Estimates and lower (LWR) and upper (UPR) endpoints of 95%
confidence intervals for the odds ratio (OR) comparing the effects of
rosiglitazone and control on myocardial infarction

OR LWR UPR

Peto 1.428 1.031 1.979


Mantel-Haenszel 1.427 1.029 1.978
Weighted Median 1.300 1.001 2.014
DerSimonian-Laird 1.286 0.940 1.759
REML 1.286 0.940 1.759
Bayes 1.280 0.928 1.762

which we have employed for this reanalysis that take es-


timates of log odds ratios, and their associated uncertain-
ties as inputs: DerSimonian–Laird [28], REML [81] and
a Bayesian procedure detailed below. Since neither Peto’s
[95] nor Mantel-Haenszel’s [49] procedures require the
calculation of log odds ratios, they used the results from
all 42 studies.
Nissen and Wolski [62] chose Peto’s method for their
data reductions, which was a very reasonable choice con-
sidering the findings reported by Bradburn et al. [12]:
that, in a comparative evaluation of the performance of
12 methods for pooling rare events (with event rates be-
low 1%), Peto’s method was the least biased and most
powerful method, and provided the best confidence inter-
val coverage, provided there was no substantial imbalance F IG . 6. Forest plot showing 95% confidence intervals (thick, hori-
between treatment and control group sizes within trials, zontal (light blue) bars) for the log odds ratios for rosiglitazone versus
control in the 38 studies listed in Nissen and Wolski [62], Table 3, that
and treatment effects were not exceptionally large, which
had at least one death in the control group.
is generally the case for these trials that involved rosigli-
tazone.
Table 2 lists the estimates of log odds ratio, and corre- The results for Peto’s method (first line in Table 2) re-
sponding 95% confidence intervals, resulting from pool- produce the corresponding results in Nissen and Wolski
ing the results from the trials listed in Nissen and Wolski ([62], Table 4), and the results from the Mantel-Haenszel
([62], Table 3) using five different statistical procedures. procedure are in close agreement with Peto’s. For nei-
ther method does the 95% confidence interval straddle
The methods of Peto, Mantel-Haenszel, DerSimonian-
1. However, the results from the last three procedures—
Laird and REML were applied as implemented in R func-
DerSimonian-Laird, REML and Bayes—do not unequiv-
tion rma of package metafor [92]. Figure 6 depicts the
ocally corroborate the conclusion of the first two. The
log odds ratios for the different studies and the consensus
NIST Decision Tree [74] recommends that the results from
log odds ratio corresponding to Peto’s method. the individual studies be combined using the weighted
The model used in the Bayesian procedure correspond- median, which produces the consensus value and 95%
ing to the last line of Table 2 modeled the log odds ra- confidence interval (based on the non-parametric boot-
tios as outcomes of Gaussian random variables, with the strap) listed in the third line of Table 2.
usual large sample approximation for their standard errors The apparently increased risk of cardiovascular events
([38], 9.2). The prior distribution for the mean log odds associated with the use of rosiglitazone has been reexam-
ratio was centered at 0 and had a large standard devia- ined repeatedly since Nissen and Wolski [62] first rang the
tion (5), and the between-study standard deviation, τ , had alarm bell in 2007, both via critical reanalyses [29] of the
a half- Cauchy prior distribution with median 0.05. The same data, and also considering the results of subsequent
posterior distribution of τ had median 0.04. A 95% cred- studies, for example, the RECORD study [37].
ible interval for τ ranged from 0.002 to 0.3. The model Following a recommendation that the European
was implemented using R function brm defined in pack- Medicines Agency made on September 23, 2010, to sus-
age brms [13], as detailed in the Supplementary Material pend the marketing authorizations for medications con-
[72]. taining rosiglitazone, Avandia has been withdrawn from
TRACKING TRUTH THROUGH MEASUREMENT 663

use throughout the European Union (https://www.ema.


europa.eu/en/medicines/human/EPAR/avandia). On July
2, 2012, The New York Times reported that “Glaxo-
SmithKline agreed to plead guilty to criminal charges and
pay $3 billion in fines for promoting its best-selling an-
tidepressants for unapproved uses and failing to report
safety data about a top diabetes drug” [88]—the diabetes
drug was Avandia.
The principal lesson that can be drawn from this ex-
ample is that different statistical models and methods of
data analysis, which may all be comparably adequate for
the task at hand, can lead to markedly different conclu-
sions when they are applied to the same data. In this case,
three out of the six methods whose results are summarized
in Table 2 suggest that the use of rosiglitazone induces a F IG . 7. Gaussian (green) and skew-normal (orange) approxima-
tions to the sample quantiles Q(0.05) = 0.64, Q(0.25) = 0.70,
significant risk of myocardial infarction, while the other
Q(0.50) = 0.74, Q(0.75) = 0.79 and Q(0.95) = 0.87, which are rep-
three do not corroborate such conclusion. Therefore, dif- resented by the (blue) dots. F (R) denotes the probability that the true
ferences between models and between methods of data value of the reproduction number will be less than or equal to R.
reduction can pose a challenge to the reproducibility of
research results impacting an issue of the greatest interest
the mean and of the standard deviation of R involves con-
in public health.
sideration of an estimate of the skewness of the distribu-
tion based on these percentiles. For these particular per-
7. REPRODUCTION NUMBER
centiles, the procedure reduces to modeling R’s distribu-
The British Health Security Agency (UKHSA) has tion as being Gaussian with mean R = 0.74 and standard
been publishing consensus values weekly, since May deviation u(R) = 0.079.
2020, for the reproduction number, R, of COVID-19. The Considering that the eleven sets of percentiles listed in
UKHSA explains it thus: “the reproduction number (R) is Maishman et al. ([45], Table 1) exhibit fairly mild skew-
the average number of secondary infections produced by ness, we have adopted an alternative modeling approach
a single infected person. An R value of 1 means that on that approximates the sample percentiles by correspond-
average every person who is infected will infect 1 other ing percentiles of a skew-normal distribution [1].
person, meaning the total number of infections is stable. The approach involves finding values of the parameters
If R is 2, on average, each infected person infects 2 more of the skew-normal distribution—ξ5
(location), ω (scale)
people. If R is 0.5, then on average for each 2 infected and α (shape)—that minimize i=1 (qi − θi )2 , where the
people, there will be only 1 new infection. If R is greater {qi } are the aforementioned sample percentiles, the {θi =
than 1, the epidemic is growing, if R is less than 1 the Q(pi |ξ, ω, α)} are the corresponding skew-normal per-
epidemic is shrinking.” centiles, and Q denotes the quantile function of the skew-
The consensus estimate results from blending esti- normal distribution. Once estimates of ξ , ω and α are in
mates produced by different research groups, mostly from hand, the mean and standard deviation are computed √ as

British universities, working independently of one an- ξ + ωδ 2/π and ω(1 − 2δ /π), where δ = α/ 1 + α .
2 2

other and using different models. Blending is done as an Figure 7 shows the Gaussian cumulative distribution
exercise in meta-analysis [45]. function and its skew-normal counterpart fitted to the
However, each research group reports several quantiles percentiles that Maishman et al. ([45], Table 1) list for
of the probability distribution that expresses the uncer- model 3, showing that the skew-normal model is apprecia-
tainty surrounding R, while most procedures used for bly more accurate than the Gaussian model. Table 3 lists
meta-analysis expect the mean and the standard deviation the means and standard deviations imputed by Maishman
of R’s distribution as inputs. Maishman et al. ([45], Ta- et al. [45] for the eleven models, and their counterparts
ble 1) list the 5th, 25th, 50th, 75th and 95th percentiles obtained using the skew-normal approximation.
for R’s distribution, as produced by each of eleven mod- Table 4 reveals details of the differences induced by the
els for a particular (but unspecified) date and region of the two different methods used to impute the mean and stan-
UK. dard deviation based on sample percentiles, and also the
For model 3 in Table 1 of [45], these percentiles are differences attributable to four different statistical models
0.64, 0.70, 0.74, 0.79 and 0.87, respectively. The proce- used to reduce the data to obtain a consensus value and to
dure that Maishman et al. [45] use to derive estimates of evaluate the associated uncertainty.
664 A. POSSOLO

TABLE 3 Even though none of the differences between the con-


Estimates and standard uncertainties, {RG,j } and {u(RG,j )}, for the
values of the reproduction number listed in Maishman et al. [45],
sensus values derived from the {(RG,j , u(RG,j ))} or from
Table 1, which are based on a Gaussian model, and their the {(RSN,j , u(RSN,j ))}, using the different blending pro-
counterparts, {RSN,j } and {u(RSN,j )}, based on the skew-normal cedures (DL, HGG, MP, REML), are significantly dif-
model, for epidemic models j = 1, ldots, 12. Model 8 did not produce ferent from one another; Table 4 does reveal differences
results for this reporting period worth noting from the viewpoint of reproducibility.
The estimates of the dark uncertainty, τ , in particular,
j RG,j u(RG,j ) RSN,j u(RSN,j ) are rather sensitive to the model employed to impute the
1 0.74 0.079 0.7435 0.0858 mean and the standard deviation that correspond to a par-
2 0.7045 0.0742 0.7123 0.0388 ticular set of percentiles. This is not surprising because it
3 0.74 0.079 0.7466 0.0491 simply expresses the fact that the values of the standard
4 0.75 0.2371 0.7576 0.2068 uncertainty, u(R), based on the skew-normal model are
5 0.7954 0.0028 0.7949 0.0020 generally smaller than their counterparts that are based on
6 0.8329 0.0256 0.8361 0.0136
7 0.7862 0.1233 0.7895 0.1142
the Gaussian model (Table 3).
9 0.9382 0.1351 0.9437 0.0899 The estimates of τ also are fairly sensitive to the statis-
10 0.8302 0.0077 0.8302 0.0076 tical procedure used for the purpose, for example, HGG’s
11 0.9293 0.0637 0.9314 0.0570 estimate of τ is 1.7 times larger than DL’s estimate (first
12 0.76 0.0608 0.7572 0.0636 two lines of the lower panel of Table 4). Even though this
is not surprising either, considering that τ is a particularly
challenging estimand [44, 43], it also influences the eval-
uation of u(R) [42], thus impacting reproducibility.
The results listed in Table 4 for the DerSimonian–Laird The foregoing retrospective of the development of a
procedure (DL) [28], the Mandel–Paule procedure (MP) consensus estimate for the reproduction number of the
[48] and the restricted maximum likelihood procedure COVID-19 pandemic reveals that apparently minor differ-
(REML) [81], all were obtained using R function rma ences between fairly simple choices about how to prepare
defined in package metafor [92]. The results for the hi- the data for an assessment of reproducibility, can have
erarchical Bayesian procedure with Gaussian laboratory their effects amplified when different procedures are then
effects and Gaussian errors (HGG) were obtained using used to blend the results in a meta-analysis. In addition,
the NIST Decision Tree [74]. those differences also impact the extent to which the re-
producibility of the conclusions depends on the particular
procedure employed for the meta-analysis.
TABLE 4
The upper section of the table lists the results of four alternative
meta-analyses applied to the means and standard errors imputed 8. BIG G
using the method described by Maishman et al. [45]. The lower Newton’s law of universal gravitation states that two
section lists their counterparts for the method that uses the
skew-normal distribution. The four procedures (DL, HGG, MP,
massive objects attract one another with a force that is di-
REML) used to blend the results in Table 3 are referenced in the text. rectly proportional to the product of their masses, and in-
R denotes the consensus estimate of the reproduction number and versely proportional to the square of the distance between
u(R) denotes the associated standard uncertainty. LWR and UPR are their centers of mass: the constant of proportionality is
the endpoints of 95% confidence or credible intervals for the true the Newtonian constant of gravitation, G, also informally
value of R, and τ (dark uncertainty) is an estimate of the standard
called “Big G” in contradistinction to “small g,” which
deviation of the (random) effects attributable to the different models
for the epidemic refers to g, the acceleration of a massive body in free-fall
toward the Earth.
R u(R) LWR UPR τ G is believed to have the same value everywhere
throughout the universe, and figures not only in Newton’s
Pooling {(RG,j , u(RG,j ))} from Table 3 third law, but also in the equations of Einstein’s theory of
DL 0.8112 0.0135 0.7848 0.8377 0.0229 general relativity [53]. Big G’s lofty status notwithstand-
HGG 0.8092 0.0184 0.7717 0.8467 0.0334
MP 0.8114 0.0125 0.7869 0.8360 0.0206
ing, its relative standard uncertainty, of about 22 parts per
REML 0.8114 0.0126 0.7868 0.8361 0.0207 million, is much larger than the relative uncertainties of
most other fundamental constants [90].
Pooling {(RSN,j , u(RSN,j ))} from Table 3
DL 0.8088 0.0137 0.7819 0.8356 0.0269
The uncertainty surrounding G is relatively large for
HGG 0.8072 0.0209 0.7647 0.8497 0.0453 three principal reasons: (i) it is not possible to lever-
MP 0.8062 0.0194 0.7682 0.8442 0.0441 age knowledge of the values of other fundamental con-
REML 0.8065 0.0185 0.7702 0.8427 0.0412 stants to reduce the uncertainty associated with the esti-
mate of G because there is no known relation between G
TRACKING TRUTH THROUGH MEASUREMENT 665

and the other fundamental constants; (ii) measuring G is The measurement errors {j } are assumed to have a
very challenging because it involves measuring extremely joint multivariate Gaussian distribution with mean 0
small forces and (iii) the measured values of G are appre- and the same units as G, whose covariance matrix has
ciably more dispersed than their individual measurement the {u2 (Gj )} along the main diagonal, and all the off-
uncertainties intimate. diagonal entries are 0 except for those that involve the
Reason (iii) is a manifestation of lack of reproducibil- correlations listed in the caption of Table XXIX of
ity, as independent experiments, relying either on differ- Tiesinga et al. [90]: 0.351 between NIST-82 and LANL-97;
ent physical principles or on different implementations of 0.134 between HUST-05 and HUST-09 and 0.068 between
the same principle, have historically yielded mutually in- HUST-09 and HUSTT-18.
consistent measurement results. Both the 2014 [56] and 2018 [90] releases of the values
Figure 8 shows the measurement results that CODATA recommended by CODATA for the fundamental constants
(Committee on Data of the International Science Coun- employ an ad hoc procedure to assign a value to κ, as the
cil) took into account for the 2018 release of the rec- smallest positive number such that the resulting, standard-
ommended values of the fundamental physical constants ized residuals (which Tiesinga et al. [90] call normalized
[90], and the results of two alternative statistical measure- residuals) all have absolute values no larger than 2. This
ment models and data reductions for them. choice, which Merkatas et al. ([51], Section 3.2) show is
Two kinds of statistical models have been used for mea- overly conservative, yields 3.9 as estimate of κ.
surement results such as these, depending on how one ad- Both maximum likelihood estimation (MLE) and the
Bayesian alternative described by Bodnar and Elster [10]
dresses their mutual inconsistency. The model discussed
are model-based alternatives preferable to the aforemen-
in Section 8.1 is based on Birge’s [7] suggestion whereby
tioned ad hoc procedure to estimate κ.
the reported uncertainties are magnified by a factor (Birge
The maximum likelihood estimates of G and κ in equa-
ratio) sufficiently large to achieve mutual consistency.  = 6.67430(13) × 10−11 m3 kg−1 s−2 and
tion (2) are G
The model discussed in Section 8.2, which we call the
κ = 3.5(6). Note that the maximum likelihood estimate

laboratory effects model, is a conventional mixed effects
of κ is qualified with an evaluation of the associated un-
model [50], where G is the fixed effect and the experi- certainty, which is neither recognized nor propagated for
ment effects are the random effects. Both models will be the ad hoc estimate used by Tiesinga et al. [90]. The corre-
fitted taking into account the three nonnull correlations sponding results are depicted in the left panel of Figure 8.
between the measured values {Gj } listed in the caption of
Table XXIX in Tiesinga et al. [90]. 8.2 Laboratory Effects Model for G
Baker and Jackson [2], Koepke et al. [41], Merkatas The NIST Decision Tree [74] (which ignores the three
et al. [51] all compare and discuss these two kinds of mod- correlations aforementioned) recommends a Bayesian hi-
els, and point out that the preference for one or for the erarchical model with Gaussian random effects and Gaus-
other seems to be mostly cultural, with CODATA and the sian measurement errors for these 16 measurement re-
Particle Data Group (pdg.lbl.gov) [32] favoring the Birge sults, similar to the model in equation (1):
ratio, while medical meta-analysis [23] and interlabora-
tory studies in measurement science [80] generally opting (3) Gj = G + λj + j ,
for the additive mixed effects model. where the {j } are assumed to be independent and Gaus-
The 16 measurement results for G are mutually incon- sian, all with mean zero and standard deviations equal to
sistent as judged by Cochran’s Q test [17], which yields the reported standard uncertainties, {u(Gj )}, all of which
an exceedingly small p-value. Figure 8 also shows the are also assumed to be based on very large numbers of
value of G recommended by CODATA in 2018 [90], and degrees of freedom—likely an unrealistic assumption.
the estimates of G obtained by application of the multi- The experiment effects, {λj }, are assumed to be Gaus-
plicative and additive models that address such mutual in- sian, centered at 0 m3 kg−1 s−2 and with a covariance ma-
consistency, as detailed in the following two subsections. trix all of whose elements are zero, except for τ 2 along the
main diagonal, and the same three elements in the upper
8.1 Common Mean Model for G and lower triangles that correspond to the three nonnull
The multiplicative model is a heteroscedastic, Gaus- correlations mentioned above in Section 8.1.
sian, common mean model [11] (also called “fixed effect” This model is identifiable because the data are the pairs
model—note the singular in “effect,” hence a different {(Gj , u(Gj ))}: since the {j } should be consistent with
model from the conventional fixed effects model), which the {u(Gj )}, the {Gj } being overdispersed relative to the
amplifies the standard uncertainties multiplicatively with reported uncertainties suggests that the {λj } cannot all be
the inflation factor κ > 0: zero.
A Bayesian version of the model in equation (3), taking
(2) Gj = G + κj . the aforementioned correlations into account, was fitted to
666 A. POSSOLO

F IG . 8. Measurement results for G, and results from two alternative statistical models and corresponding data reductions. The labels at the
bottom are the same that are used by Tiesinga et al. ([90], Table XXIX), where the corresponding references are listed. The diamonds represent the
measured values. The (green) thick vertical line segments represent the measurement results {Gj ± u(Gj )}. The (dark blue) thin horizontal line
segment, and the light blue band centered on it, represent the 2018 CODATA recommended value for G and the associated standard uncertainty [90],
Section XIX. Left panel: The (dark brown) thin horizontal line segment and the yellow band centered on it represent the consensus value computed
using the common mean model of equation (2) fitted by maximum likelihood, and taking into account the correlations between experiments listed in
the caption of Tiesinga et al. ([90], Table XXIX). The (purple) thin vertical line segments represent the {Gj ± κ u(Gj )}. Right panel: Counterpart
of the left panel for the mixed effects, Bayesian hierarchical model with Gaussian experiment effects and Gaussian measurement errors, also taking
into account the correlations aforementioned. The (purple) thin vertical line segments represent the {Gj ± ( τ 2 + u2 (Gj ))½ } where 
τ denotes τ ’s
posterior mean.

the data listed in Table XXIX of Tiesinga et al. [90] using of the reported uncertainties. Note that both panels of Fig-
Stan [16, 87] and R [86] codes listed in the Supplementary ure 8 have the same scale in their vertical axes.
Material [72], with the results depicted in the right panel
8.3 Evaluating Reproducibility
of Figure 8.
The prior distribution chosen for G was Gaussian with Table 5 summarizes the estimates of G and of other rel-
mean set equal to the 2014 CODATA recommended value evant quantities from Sections 8.1 and 8.2, alongside the
for G [55], and with standard deviation set equal to the CODATA 2018 recommended value of G and associated
corresponding standard uncertainty. The prior distribution standard uncertainty [90]. These three estimates of G do
chosen for τ was half-Cauchy with median set equal to not differ significantly from one another once their uncer-
the MAD (as defined in the R environment for statistical tainties are taken into account.
computing and graphics [86]) of the measured values. Schlamminger [82] notes that not only do “the various
The posterior mean of G is 6.67399(20) × 10−11 m3 · measurements of G seem not to converge on a value; it
kg−1 s−2 , which is not statistically significantly differ- seems that the convergence gets worse with each addi-
ent from the 2018 CODATA [90] recommended value tional data point.” He concludes that “adding more data
because the absolute value of their difference amounts points from isolated experiments has not been the best
to 1.24 times the standard error of their difference. strategy to improve the situation,” and supports the idea of
The dark uncertainty, τ , had posterior mean 0.00096 × “forming an international consortium to coordinate these
10−11 m3 kg−1 s−2 , which is 3.8 times larger than the me- demanding experiments.”
dian of the standard uncertainties associated with the 16 Such an international consortium [54] has meanwhile
measured values of G. been formed, and in consequence the MARK-2 torsion
Figure 8 reveals that the laboratory effects model en- balance that Quinn et al. [77, 78] built and used at the
tails generally smaller, more equitable increases to the ef- BIPM (International Bureau of Weights and Measures,
fective uncertainties of the measured values than the com- Sèvres, France) was disassembled and shipped to NIST,
mon mean model, which involves multiplicative inflation in Gaithersburg, Maryland, U.S., where it was reassem-
TRACKING TRUTH THROUGH MEASUREMENT 667

TABLE 5 9. RECAPITULATION AND CONCLUSIONS


CODATA 2018 recommended value of G [90], maximum likelihood
estimate of G for the common mean model, and estimate of G from This contribution entertains a broad concept of repro-
the Bayesian laboratory effects model described in Section 8.2. The ducibility that is consistent with how this term has tra-
corresponding standard uncertainties are listed under u(G). The
ditionally been understood in measurement science; the
value for the dark uncertainty, τ , is the mean of its posterior
distribution. The estimates of κ, which figures in equation (2), are the essential agreement between results when measuring the
ad hoc estimate from Tiesinga et al. [90], and the maximum likelihood same property, or more generally studying the same phe-
estimate. Only the latter is qualified with the associated standard nomenon, while using different approaches, methods and
uncertainty, u(κ) procedures, applied by different experimenters working
independently of one another in different laboratories and
G u(G) τ possibly at different times.
/(10−11 m3 kg−1 s−2 ) κ u(κ) The illustrative examples show the key role that the
evaluation of measurement uncertainty plays in identify-
CODATA 2018 6.67430 0.00015 3.9 ing the seriousness of reproducibility crises, and in flesh-
Common Mean 6.67430 0.00013 3.5 0.6 ing out, and quantifying, the impact that different causes,
Lab Effects 6.67399 0.00020 0.00096
or sources of uncertainty, can have upon the lack of repro-
ducibility.
The process of learning from experience through mea-
bled and mounted on a coordinate measurement machine; surement is best done as a collective, collaborative enter-
it became operational in August of 2016. prise, where different participants address the same prob-
During the April 2022 meeting of the American Phys- lem and not only compare their results but also blend
them into a consensus estimate. Such consensus estimate
ical Society, Schlamminger et al. [83] described the new
typically has smaller uncertainty than the uncertainty of
setup for the MARK-2 balance, explained how an inde-
the individual estimates taken separately, and is also sup-
pendent, blind measurement of G was performed and an-
ported by a richer, more varied basis of empirical evi-
nounced that the first reproducibility test result should be dence. The consensus estimate can be of interest in itself,
revealed soon. as it is for the risk of rosiglitazone (Section 6) and for the
There are other avenues being explored to resolve this reproduction number of a pandemic (Section 7), or it can
reproducibility crisis. One, merely data analytical, which provide a reference against which to compare individual
in fact affords no resolution but only makes the consen- measurement results, as it does in the measurement of G
sus building more palatable, involves the use of a model (Section 8).
with shades of dark uncertainty, which entertains not a The conclusions are most reliable when the methods
single value of τ but several, which “penalize” differ- variously employed by the participants are fundamentally
ent results differently [51]. Another, theoretical, employs different, possibly relying on different physical principles,
non-classical physics models to explain the discrepancies and also when at least some of them are primary methods,
between at least some of the historical results [40], and to in the sense explained in Section 3.2. In such cases, as
“adjust” the affected results, thereby reducing the level of Milton and Possolo [52] put it, “they achieve consilience”
mutual inconsistency of the ensemble. [94].
In brief, this review of the recent history of the measure- The precise nature of the aforementioned collective en-
ment of the Newtonian constant of gravitation, G, and the terprise varies between meta-analysis in medicine and in-
corresponding quest for reproducibility provides yet an- terlaboratory studies in measurement science. The former
other illustration of the extent to which the choice of sta- typically do not involve a preliminary agreement about
methods and materials to be used by the participants, the
tistical model (here between a common effect model and
onus of selecting the results to be compared and merged
a laboratory effects model) impacts the assessment of re-
falling on the researcher conducting the meta-analysis.
producibility.
The latter usually are fairly structured procedures, involv-
Maybe more importantly, it also shows that a repro- ing a specified schedule and common protocols to be used
ducibility crisis can stimulate further research and en- for measurement.
courage novel approaches to evaluate and improve repro- The conventional understanding of reproducibility and
ducibility; in this case, the disassembly, transport across repeatability in measurement science lends itself to the
an ocean and reassembly at the destination of a delicate quantification of these attributes via some form or another
measuring instrument of great electromechanical com- of estimating variance components, as was illustrated for
plexity, as a radical and risky step taken on a wing and an interlaboratory study of the stress required to achieve
a prayer, hoping to identify reasons for the lack of repro- a particular relative elongation of rubber samples (Sec-
ducibility. tion 5).
668 A. POSSOLO

The examples also show that a meaningful data anal- turned out that the mere exercise of preparing the inputs
ysis can require a preliminary choice of reexpression for for analysis can be quite influential upon the level of re-
the measurement results, in particular to facilitate and le- producibility of the results, above and beyond the differ-
gitimize the use of a statistical model that is demonstrably ences between the epidemiological models that provided
adequate for the data, and that is also fit for purpose. This those inputs, and also above and beyond the methods used
was the case for the values of stress in the interlaboratory to determine a consensus value. This serves as a warning
study of rubber elongation (Section 5), where a logarith- about the fact that fairly simple matters often relegated
mic reexpression was very helpful, and also for the meta- to routine work can impact reproducibility, or the lack
analysis for the effects of rosiglitazone (Section 6), with thereof, substantially.
the traditional focus on log odds. The history of the measurements of the least accessi-
In interlaboratory studies and meta-analyses, there of- ble of the fundamental constants of nature, the Newtonian
ten arise results that deviate markedly from the bulk of the constant of gravitation, G, shows that alternative treat-
others; either because the measured value is rather differ- ments of the same data, even when they produce results
ent from most of the others, or because the uncertainty that are in fair agreement, involve very different assump-
reported in a result is very different from the uncertainties tions that effectively establish dividing lines in the inter-
reported in the other results, or both. ested community; in particular, and in this case, whether
In general, and concerning very different reported un- one adopts the approach first proposed by Raymond Birge
certainties, it is the smallest uncertainties that are partic- and faithfully followed mostly by the physics community,
ularly influential, especially when the measurement re- or opts instead for the approach that is prevalent in medi-
sults are mutually inconsistent, because they tend to pull cal meta-analysis and in measurement science.
the consensus value toward their corresponding measured But the most important lesson one can draw from the
values. Such unusually small uncertainties can then be recent history of the measurement of G is a lesson of opti-
said to be influential “inliers.” mism and empowerment; that, when faced with a consid-
Faced with mutually inconsistent measurement results, erable, genuine reproducibility crisis, the scientific com-
the temptation is great to set “discrepant” values aside, munity is ready to engage in extraordinary, cooperative
thereby appearing to resolve the lack of reproducibility— efforts to understand the root causes of the lack of repro-
Cox [24] describes one manner of succumbing to such ducibility, and to do so with the resolve needed to move
temptation. However, unless there is a substantive, iden- heaven and earth, and with the creativity to match, of
tifiable cause to do so, no “discrepant” result should be which Stephan Schlamminger (NIST) and his collabora-
set aside, for the simple reason that in the absence of such tors provide paradigmatic examples.
cause there would be no logical basis whereon to reject
discrepant values as being invalid—the most discrepant ACKNOWLEDGMENTS
value can very well be the one closest to the true value of
the measurand [26]. The author is immensely grateful to Stefan Schlam-
Statistical diagnostics are most valuable aids in identi- minger (NIST) for all that he has taught him over the
fying unusual measurement results, but statistical consid- years about the measurement of G. The author is also
erations alone are insufficient to reject a measurement re- much indebted to Olha Bodnar (Örebro University, Swe-
sult. Faced by challenges posed by “discrepant” but cred- den), David Newton (NIST) and Mikela Waldman (NIST
ible measurement results, one should tune the model to fit and Georgetown University, Washington, DC) for their
all credible results rather then set credible but “inconve- most valuable and extensive suggestions for improvement
nient” results aside. The example in Section 5 illustrated of a draft of this contribution. The author thanks David
ways of accomplishing this, including by replacing the as- Woods (Univ. of Southampton, UK) for an exchange of
sumption that measurement errors are Gaussian with the eMails about the measurement of the reproduction num-
assumption that they follow a Student’s t-distribution with ber of COVID-19 in the United Kingdom.
a small number of degrees of freedom, similar to [66]. The author thanks the organizers of the special issue of
The roller coaster that has been the history of the use Statistical Science dedicated to the issue of reproducibil-
of rosiglitazone as a therapy (Section 6) shows that, even ity for the invitation to contribute to it, and acknowledges
when starting from the same set of data, one can reach the very helpful criticism and guidance that the guest edi-
rather different conclusions owing to different statistical tors, the journal’s editor and a referee provided throughout
models and methods of data reduction; in other words, the revision process, which led to considerable improve-
the issue of lack of reproducibility raised its head when ments.
the results of alternative but comparably tenable models Some specific commercial entities, equipment or ma-
and data reductions were compared. terials may be identified in this document in order to de-
When blending independent estimates of the reproduc- scribe or illustrate an experimental or statistical procedure
tion number for COVID-19 in the UK (Section 7), it so or concept adequately. Such identification is not intended
TRACKING TRUTH THROUGH MEASUREMENT 669

to imply recommendation or endorsement by the National [12] B RADBURN , M. J., D EEKS , J. J., B ERLIN , J. A. and L O -
Institute of Standards and Technology (NIST), nor is it in- CALIO , A. R. (2007). Much ado about nothing: A comparison
of the performance of meta-analytical methods with rare events.
tended to imply that the entities, equipment or materials
Stat. Med. 26 53–77. MR2312699 https://doi.org/10.1002/sim.
mentioned are necessarily the best available for the pur- 2528
pose. [13] B ÜRKNER , P. C. (2017). brms: An R package for Bayesian
multilevel models using Stan. J. Stat. Softw. 80 1–28.
https://doi.org/10.18637/jss.v080.i01
SUPPLEMENTARY MATERIAL [14] B ÜRKNER , P. C. (2018). Advanced Bayesian multilevel mod-
eling with the R package brms. The R Journal 10 395–411.
Data and R Code (DOI: 10.1214/23-STS899SUPP; https://doi.org/10.32614/RJ-2018-017
.zip). The supplementary information file Possolo [15] C AMPAGNARI , C. and M ULDERS , M. (2022). An upset to the
2023-TrackingTruth-Supplement.R contains standard model. Science 376 136–136. https://doi.org/10.1126/
science.abm0101
data and R code that facilitate reproducing the numeri- [16] C ARPENTER , B., G ELMAN , A., H OFFMAN , M., L EE , D.,
cal results listed in this contribution. G OODRICH , B., B ETANCOURT, M., B RUBAKER , M., G UO , J.,
L I , P. et al. (2017). Stan: A probabilistic programming language.
J. Stat. Softw. 76 1–32. https://doi.org/10.18637/jss.v076.i01
REFERENCES [17] C OCHRAN , W. G. (1954). The combination of estimates from
different experiments. Biometrics 10 101–129. https://doi.org/10.
[1] A ZZALINI , A. (2014). The Skew-Normal and Related Fam- 2307/3001666
ilies. Institute of Mathematical Statistics (IMS) Monographs [18] ATLAS C OLLABORATION, A ABOUD , M. (2018).
3. Cambridge Univ. Press, Cambridge. With the collaboration √ Measure-
ment of the W-boson mass in pp collisions at s = 7 TeV
of Antonella Capitanio. MR3468021 https://doi.org/10.1017/ with the ATLAS detector. European Physical Journal C 78 110.
cbo9781139248891 https://doi.org/10.1140/epjc/s10052-017-5475-4
[2] BAKER , R. and JACKSON , D. (2015). New models for describ- [19] CDF C OLLABORATION (2022). High-precision measurement of
ing outliers in meta-analysis. Res. Synth. Methods 7 314–328. the W boson mass with the CDF II detector. Science 376 170–
https://doi.org/10.1002/jrsm.1191 176. https://doi.org/10.1126/science.abk1781
[3] BATES , D., M ÄCHLER , M., B OLKER , B. and WALKER , S. [20] L3 C OLLABORATION (2006). Measurement of the mass and the
(2015). Fitting linear mixed-effects models using lme4. J. Stat. width of the W boson at LEP. Eur. Phys. J. C 45 569–587.
Softw. 67 1–48. https://doi.org/10.18637/jss.v067.i01 https://doi.org/10.1140/epjc/s2005-02459-6
[4] B EAUCHAMP, C. R., C AMARA , J. E., C ARNEY, J., C HO - [21] A NALYTICAL M ETHODS C OMMITTEE (1989a). Robust
QUETTE , S. J., C OLE , K. D., D E ROSE , P. C., D UEWER , D. L., statistics—how not to reject outliers. Part 1. Basic concepts.
E PSTEIN , M. S., K LINE , M. C. et al. (2021). Metrological Analyst 114 1693–1697. https://doi.org/10.1039/AN9891401693
Tools for the Reference Materials and Reference Instruments [22] A NALYTICAL M ETHODS C OMMITTEE (1989b). Robust
of the NIST Materials Measurement Laboratory. NIST Spe- statistics—how not to reject outliers. Part 2. Inter-laboratory
cial Publication 260-136 (2021 Edition). National Institute of trials. Analyst 114 1699–1702.
Standards and Technology, Gaithersburg, MD. https://doi.org/10. [23] C OOPER , H., H EDGES , L. V. and VALENTINE , J. C., eds.
6028/NIST.SP.260-136-2021 (2019) The Handbook of Research Synthesis and Meta-Analysis,
[5] B ELL , S. (1999). A Beginner’s Guide to Uncertainty of Measure- 3rd ed. Russell Sage Foundation Publications, New York, NY.
ment. Measurement Good Practice Guide 11 (Issue 2). National [24] C OX , M. G. (2007). The evaluation of key comparison data: De-
Physical Laboratory, Teddington, Middlesex, United Kingdom. termining the largest consistent subset. Metrologia 44 187–200.
Amendments March 2001. https://doi.org/10.1088/0026-1394/44/3/005
[6] BIPM (2019). The International System of Units (SI), 9th ed. [25] DAI , D. C. (2021). Variance of Newtonian constant from lo-
cal gravitational acceleration measurements. Phys. Rev. D 103
International Bureau of Weights and Measures (BIPM), Sèvres,
064059. https://doi.org/10.1103/PhysRevD.103.064059
France.
[26] D E B IÈVRE , P. (2007). Statistics and measurement results in
[7] B IRGE , R. T. (1932). The calculation of errors by the method
chemistry. Accredit. Qual. Assur. 12 333–334. https://doi.org/10.
of least squares. Phys. Rev. 40 207–227. https://doi.org/10.1103/
1007/s00769-007-0294-1
PhysRev.40.207
[27] DELPHI C OLLABORATION A BDALLAH , J. et al. Measurement √
[8] B LACKMAN , R. B. and T UKEY, J. W. (1958). The measure- of the mass and width of the W boson in e+ e− collisions at s =
ment of power spectra from the point of view of communica- 161-209 GeV. Eur. Phys. J. C 55 1. https://doi.org/10.1140/epjc/
tions engineering. I. Bell Syst. Tech. J. 37 185–282. MR0102897 s10052-008-0585-7
https://doi.org/10.1002/j.1538-7305.1958.tb03874.x [28] D ER S IMONIAN , R. and L AIRD , N. (1986). Meta-analysis in
[9] B LACKWELL , T., B ROWN , C. and M OSTELLER , F. (1991). clinical trials. Control. Clin. Trials 7 177–188. https://doi.org/10.
Which denominator? In Fundamentals of Exploratory Analysis 1016/0197-2456(86)90046-2
of Variance (D. C. Hoaglin, F. Mosteller and J. W. Tukey, eds.) [29] D IAMOND , G. A., BAX , L. and K AUL , S. (2007). Uncer-
10 252–294. Wiley, New York, NY. tain effects of rosiglitazone on the risk for myocardial infarc-
[10] B ODNAR , O. and E LSTER , C. (2014). On the adjustment of in- tion and cardiovascular death. Ann. Intern. Med. 147 578–581.
consistent data using the Birge ratio. Metrologia 51 516–521. https://doi.org/10.7326/0003-4819-147-8-200710160-00182
https://doi.org/10.1088/0026-1394/51/5/516 [30] F INEBERG , H. V., A LLISON , D. B., BARBA , L. A.,
[11] B ORENSTEIN , M., H EDGES , L. V., H IGGINS , J. P. T. and C HONG , D., D ONOHO , D., F REIRE , J., G ABRIELSE , G., G AT-
ROTHSTEIN , H. R. (2010). A basic introduction to fixed-effect SONIS , C., H ALL , E. et al. (2019). Reproducibility and Replica-
and random-effects models for meta-analysis. Res. Synth. Meth- bility in Science. Committee on Reproducibility and Replicabil-
ods 1 97–111. https://doi.org/10.1002/jrsm.12 ity in Science, the National Academies of Sciences, Engineering,
670 A. POSSOLO

and Medicine. The National Academies Press, Washington, DC. [46] M ANDEL , J. (1972). Repeatability and reproducibility. J.
https://doi.org/10.17226/25303 Qual. Technol. 4 74–85. https://doi.org/10.1080/00224065.1972.
[31] G AISER , C., F ELLMUTH , B., H AFT, N., K UHN , A., T HIELE - 11980520
K RIVOI , B., Z ANDT, T., F ISCHER , J., J USKO , O. and [47] M ANDEL , J. (1991). The validation of measurement through in-
S ABUGA , W. (2017). Final determination of the Boltzmann con- terlaboratory studies. Chemom. Intell. Lab. Syst. 11 109–119.
stant by dielectric-constant gas thermometry. Metrologia 54 280– https://doi.org/10.1016/0169-7439(91)80058-X
289. https://doi.org/10.1088/1681-7575/aa62e3 [48] M ANDEL , J. and PAULE , R. (1970). Interlaboratory evaluation
[32] PARTICLE DATA G ROUP, Z YLA , P. A. et al. (2020). Review of a material with unequal numbers of replicates. Anal. Chem. 42
of Particle Physics. Progress of Theoretical and Experimental 1194–1197. https://doi.org/10.1021/ac60293a019
Physics 083C01. https://doi.org/10.1093/ptep/ptaa104 [49] M ANTEL , N. and H AENSZEL , W. (1959). Statistical aspects of
[33] G UNDERSEN , O. E. (2021). The fundamental principles of re- the analysis of data from retrospective studies of disease. J. Natl.
producibility. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Cancer Inst. 22 719–748. https://doi.org/10.1093/jnci/22.4.719
Eng. Sci. 379 20200210. https://doi.org/10.1098/rsta.2020.0210 [50] M C C ULLOCH , C. E., S EARLE , S. R. and N EUHAUS , J. M.
[34] M ICHELL , J. (2005). The logic of measurement: A realist (2008). Generalized, Linear, and Mixed Models, 2nd ed. Wi-
overview. Measurement 38 285–294. https://doi.org/10.1016/j. ley Series in Probability and Statistics. Wiley, Hoboken, NJ.
measurement.2005.09.004 MR2431553
[35] H ARRIS , D. C. and L UCY, C. A. (2020). Quantitative Chemical [51] M ERKATAS , C., T OMAN , B., P OSSOLO , A. and S CHLAM -
Analysis, 10th ed. Macmillan Learning, New York, NY. MINGER , S. (2019). Shades of dark uncertainty and consensus
[36] H ERSCHEL , J. F. W. (1866). Familiar Lectures on Scientific value for the Newtonian constant of gravitation. Metrologia 56
Subjects X. The Yard, the Pendulum, and the Metre 419–451, 054001. https://doi.org/10.1088/1681-7575/ab3365
London Alexander Strahan. [52] M ILTON , M. J. T. and P OSSOLO , A. (2020). Trustworthy
[37] H OME , P. D., P OCOCK , S. J., B ECK -N IELSEN , H., C UR - data underpin reproducible research. Nat. Phys. 16 117–119.
TIS , P. S., G OMIS , R., H ANEFELD , M., J ONES , N. P., KOMA - https://doi.org/10.1038/s41567-019-0780-5
JDA , M. and M C M URRAY, J. J. V. (2009). Rosiglitazone evalu- [53] M ISNER , C. W., T HORNE , K. S. and W HEELER , J. A. (2017).
ated for cardiovascular outcomes in oral agent combination ther- Gravitation. Princeton University Press, Princeton, NJ.
apy for type 2 diabetes (RECORD): A multicentre, randomised, [54] M OHR , P. (2014). Newtonian constant of gravitation in-
open-label trial. Lancet 373 2125–2135. https://doi.org/10.1016/ ternational consortium. https://www.nist.gov/programs-projects/
S0140-6736(09)60953-3 newtonian-constant-gravitation-international-consortium. NIST
[38] J EWELL , N. P. (2004). Statistics for Epidemiology. CRC Physical Measurement Laboratory.
Press/CRC, Boca Raton, FL. [55] M OHR , P. J., N EWELL , D. B. and TAYLOR , B. N. (2015). CO-
[39] K AHN , S. E., H AFFNER , S. M., H EISE , M. A., H ER - DATA Recommended Values of the Fundamental Physical Con-
MAN , W. H., H OLMAN , R. R., J ONES , N. P., K RAVITZ , B. G., stants: 2014. CODATA Zenodo Collection. https://doi.org/10.
L ACHIN , J. M., O’N EILL , M. C. et al. (2006). Glycemic 5281/zenodo.22826
durability of rosiglitazone, metformin, or glyburide monother- [56] M OHR , P. J., N EWELL , D. B. and TAYLOR , B. N. (2016). CO-
apy. N. Engl. J. Med. 355 2427–2443. https://doi.org/10.1056/ DATA recommended values of the fundamental physical con-
NEJMoa066224 stants: 2014. Rev. Modern Phys. 88 035009. https://doi.org/10.
[40] K LEIN , N. (2020). Evidence for modified Newtonian dy- 1103/RevModPhys.88.035009
namics from Cavendish-type gravitational constant experi- [57] M OLDOVER , M. R., T RUSLER , J. P. M., E DWARDS , T. J.,
ments. Classical Quantum Gravity 37 065002, 21. MR4086686 M EHL , J. B. and DAVIS , R. S. (1988). Measurement of the uni-
https://doi.org/10.1088/1361-6382/ab6cab versal gas constant R using a spherical acoustic resonator. J. Res.
[41] KOEPKE , A., L AFARGE , T., P OSSOLO , A. and T OMAN , B. Natl. Bur. Stand. 93 85–144. https://doi.org/10.6028/jres.093.010
(2017). Consensus building for interlaboratory studies, key [58] M OSTELLER , F. and T UKEY, J. W. (1977). Data Analysis and
comparisons, and meta-analysis. Metrologia 54 S34–S62. Regression. Addison-Wesley Company, Reading, MA.
https://doi.org/10.1088/1681-7575/aa6c0e [59] M OULD , J. and U DDIN , S. A. (2014). Constraining a possible
[42] KOETSE , M. J., F LORAX , R. J. G. M. and DE G ROOT, H. L. F. variation of G with type ia supernovae. Publ. Astron. Soc. Aus-
(2010). Consequences of effect size heterogeneity for meta- tral. 31 e015. https://doi.org/10.1017/pasa.2014.9
analysis: A Monte Carlo study. Stat. Methods Appl. 19 217–236. [60] M UNAFÒ , M. R., C HAMBERS , C., C OLLINS , A., F ORTU -
MR2651450 https://doi.org/10.1007/s10260-009-0125-0 NATO , L. and M ACLEOD , M. (2022). The Reproducibility De-
[43] L ANGAN , D., H IGGINS , J. P. T., JACKSON , D., B OWDEN , J., bate Is an Opportunity, Not a Crisis. BMC Research Notes 15 43.
V ERONIKI , A. A., KONTOPANTELIS , E., V IECHTBAUER , W. https://doi.org/10.1186/s13104-022-05942-3
and S IMMONDS , M. (2019). A comparison of heterogeneity vari- [61] N EWELL , D. B. (2014). A more fundamental international sys-
ance estimators in simulated random-effects meta-analyses. Res. tem of units. Phys. Today 67 35–41. https://doi.org/10.1063/PT.
Synth. Methods 10 83–98. https://doi.org/10.1002/jrsm.1316 3.2448
[44] L ANGAN , D., H IGGINS , J. P. T. and S IMMONDS , M. (2017). [62] N ISSEN , S. E. and W OLSKI , K. (2007). Effect of rosiglitazone
Comparative performance of heterogeneity variance estimators on the risk of myocardial infarction and death from cardiovascu-
in meta-analysis: A review of simulation studies. Res. Synth. lar causes. N. Engl. J. Med. 356 2457–2471. https://doi.org/10.
Methods 8 181–198. https://doi.org/10.1002/jrsm.1198 1056/NEJMoa072761
[45] M AISHMAN , T., S CHAAP, S., S ILK , D. S., N EVITT, S. J., [63] NIST/SEMATECH (2012). NIST/SEMATECH E-Handbook of
W OODS , D. C. and B OWMAN , V. E. (2022). Statistical methods Statistical Methods. National Institute of Standards and Tech-
used to combine the effective reproduction number, R(t), and nology, U.S. Department of Commerce, Gaithersburg, MD.
other related measures of COVID-19 in the UK. Stat. Methods https://doi.org/10.18434/M32189
Med. Res. 31 1757–1777. MR4478307 https://doi.org/10.1177/ [64] N OZICK , R. (1981). Philosophical Explanations. Harvard Univ.
09622802221109506 Press, Cambridge, MA.
TRACKING TRUTH THROUGH MEASUREMENT 671

[65] O LVER , F. W. J., L OZIER , D. W., B OISVERT, R. F. and [80] RUKHIN , A. L. (2009). Weighted means statistics in interlabo-
C LARK , C. W., eds. (2010) NIST Handbook of Mathematical ratory studies. Metrologia 46 323–331. https://doi.org/10.1088/
Functions. Cambridge Univ. Press, Cambridge. MR2723248 0026-1394/46/3/021
[66] P INHEIRO , J. C., L IU , C. and W U , Y. N. (2001). Efficient algo- [81] RUKHIN , A. L., B IGGERSTAFF , B. J. and VANGEL , M. G.
rithms for robust estimation in linear mixed-effects models using (2000). Restricted maximum likelihood estimation of a com-
the multivariate t distribution. J. Comput. Graph. Statist. 10 249– mon mean and the Mandel-Paule algorithm. J. Statist. Plann.
276. MR1939700 https://doi.org/10.1198/10618600152628059 Inference 83 319–330. MR1748018 https://doi.org/10.1016/
[67] P INHEIRO , L. and E MSLIE , K. R. (2018). Basic concepts and S0378-3758(99)00098-1
validation of digital PCR measurements. In Digital PCR: Meth- [82] S CHLAMMINGER , S. (2014). A cool way to measure big G. Na-
ods and Protocols 11–24 Springer, New York, New York, NY. ture 510 478–480. https://doi.org/10.1038/nature13507
https://doi.org/10.1007/978-1-4939-7778-9_2 [83] S CHLAMMINGER , S., C HAO , L. S., L EE , V., S PEAKE , C. C.
[68] P LESSER , H. E. (2018). Reproducibility vs. replicability: A brief
and N EWELL , D. B. (2022). Measurement of Newton’s grav-
history of a confused terminology. Front. Neuroinform. 11 76.
itational constant with the BIPM torsion balance. In American
https://doi.org/10.3389/fninf.2017.00076
Physical Society April Meeting 2022 Session S16: Lab Experi-
[69] P ONTIUS , P. E. (1966). Measurement philosophy of the pilot
ments and Detector Characterization S16.00002.
program for mass calibration. National Bureau of Standards,
[84] S CHLAMMINGER , S., H OLZSCHUH , E., K ÜNDIG , W., N OLT-
Washington, DC. NBS Technical Note 288, Reprinted 1968, with
ING , F., P IXLEY, R. E., S CHURR , J. and S TRAUMANN , U.
minor corrections.
[70] P OSSOLO , A. (2018). Measurement. In Advanced Mathemat- (2006). Measurement of Newton’s gravitational constant. Phys.
ical and Computational Tools in Metrology and Testing: AM- Rev. D 74 082001. https://doi.org/10.1103/PhysRevD.74.082001
CTM XI (A. B. Forbes, N. F. Zhang, A. Chunovkina, S. Eich- [85] S TRAIN , M. C., L ADA , S. M., L UONG , T., ROUGHT, S. E.,
städt and F. Pavese, eds.). Series on Advances in Mathe- G IANELLA , S., T ERRY, V. H., S PINA , C. A., W OELK , C. H.
matics for Applied Sciences 89 273–285. World Scientific and R ICHMAN , D. D. (2013). Highly precise measurement
Company, Singapore. https://doi.org/10.1142/9789813274303\ of HIV DNA by droplet digital PCR. PLoS ONE 8 1–8.
protect\T1\textunderscore0027 https://doi.org/10.1371/journal.pone.0055943
[71] P OSSOLO , A. (2021). Concepts, methods, and tools enabling [86] R C ORE T EAM (2022). R: A Language and Environment for Sta-
measurement quality. In Frontiers in Statistical Quality Con- tistical Computing. R Foundation for Statistical Computing, Vi-
trol 13 (S. Knoth and W. Schmid, eds.) 19 339–357. Springer, enna, Austria.
Cham, Switzerland. https://doi.org/10.1007/978-3-030-67856-2\ [87] S TAN D EVELOPMENT T EAM (2022). RStan: the R interface to
protect\T1\textunderscore19 Stan. R package version 2.21.7.
[72] P OSSOLO , A. (2023). Supplement to “Tracking truth through [88] T HOMAS , K. and S CHMIDT, M. S. (2012). Glaxo Agrees to Pay
measurement and the spyglass of statistics.” https://doi.org/10. $3 Billion in Fraud Settlement. The New York Times July 2.
1214/23-STS899SUPP [89] T HOMPSON , M. and E LLISON , S. L. R. (2011). Dark un-
[73] P OSSOLO , A., B RUCE , S. S. and WATTERS , R. L. J R . (2021). certainty. Accredit. Qual. Assur. 16 483–487. https://doi.org/10.
Metrological Traceability Frequently Asked Questions and NIST 1007/s00769-011-0803-0
Policy. National Institute of Standards and Technology, Gaithers- [90] T IESINGA , E., M OHR , P. J., N EWELL , D. B. and TAY-
burg, MD. NIST Technical Note 2156. https://doi.org/10.6028/ LOR , B. N. (2021). CODATA recommended values of the funda-
NIST.TN.2156 mental physical constants: 2018. Rev. Modern Phys. 93 025010.
[74] P OSSOLO , A., KOEPKE , A., N EWTON , D. and W INCH - https://doi.org/10.1103/RevModPhys.93.025010
ESTER , M. R. (2021). Decision tree for key comparisons. J. Res. [91] V IBERTI , G., K AHN , S. E., G REENE , D. A., H ERMAN , W. H.,
Natl. Inst. Stand. Technol. 126 126007. https://doi.org/10.6028/ Z INMAN , B., H OLMAN , R. R., H AFFNER , S. M., L EVY, D.,
jres.126.007 L ACHIN , J. M. et al. (2002). A Diabetes Outcome Progression
[75] P OSSOLO , A. and M EIJA , J. (2022). Measurement Uncertainty: Trial (ADOPT): An international multicenter study of the com-
A Reintroduction, 2nd ed. Sistema Interamericano de Metrologia parative efficacy of rosiglitazone, glyburide, and metformin in re-
(SIM), Montevideo, Uruguay. https://doi.org/10.4224/1tqz-b038 cently diagnosed type 2 diabetes. Diabetes Care 25 1737–1743.
[76] Q U , J., B ENZ , S. P., C OAKLEY, K., ROGALLA , H., T EW, W. L.,
https://doi.org/10.2337/diacare.25.10.1737
W HITE , R., Z HOU , K. and Z HOU , Z. (2017). An improved elec-
[92] V IECHTBAUER , W. (2010). Conducting meta-analyses in R with
tronic determination of the Boltzmann constant by Johnson noise
the metafor package. J. Stat. Softw. 36 1–48. https://doi.org/10.
thermometry. Metrologia 54 549–558. https://doi.org/10.1088/
18637/jss.v036.i03
1681-7575/aa781e
[93] W HITE , R. (2011). The meaning of measurement in metrol-
[77] Q UINN , T., PARKS , H., S PEAKE , C. and DAVIS , R. (2013). Im-
proved determination of G using two methods. Phys. Rev. Lett. ogy. Accredit. Qual. Assur. 16 31–41. https://doi.org/10.1007/
111 101102. https://doi.org/10.1103/PhysRevLett.111.101102 s00769-010-0698-1
[78] Q UINN , T., S PEAKE , C., PARKS , H. and DAVIS , R. (2014). The [94] W ILSON , E. O. (1998). Consilience: The Unity of Knowledge.
BIPM measurements of the Newtonian constant of gravitation, Alfred A. Knopf, New York, NY.
G. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 372 [95] Y USUF, S., P ETO , R., L EWIS , J., C OLLINS , R. and S LEIGHT, P.
0032. https://doi.org/10.1098/rsta.2014.0032 (1985). Beta blockade during and after myocardial infarction:
[79] ROUSH , S. (2005). Tracking Truth: Knowledge, Evidence, and An overview of the randomized trials. Prog. Cardiovasc. Dis. 27
Science. Oxford Univ. Press, New York, NY. 335–371. https://doi.org/10.1016/s0033-0620(85)80003-7

You might also like