23 STS899
23 STS899
655
656 A. POSSOLO
ticed in national metrology institutes like the National In- of gravitation made at the University of Zürich, Switzer-
stitute of Standards and Technology (NIST) of the U.S., land, G = (6.674252 ± 0.000122) m3 kg−1 s−2 (this is the
as well as in many other laboratories where measure- result labeled UZur-06 in Figure 8). There are three differ-
ments are made that support the practice of medicine, en- ent conventions in play here:
gineering, environmental studies, forensic investigations
• Since the true value of G is unknown, G is modeled as
and that ensure the quality of food, therapies and indus- a random variable whose probability distribution char-
trial products. acterizes the uncertainty surrounding its true value, yet
The article is also intended for physical scientists, med- without impugning the fact that, according to current
ical doctors, engineers, laboratory technicians and others understanding, G has had a unique, essentially invari-
who make measurements and employ statistical methods ant true value throughout most of the history of the uni-
to assess reproducibility via interlaboratory studies and verse [59, 25].
meta-analyses, and who also wish to gain some appreci- • The standard uncertainty, u(G), is the standard devia-
ation for how the evaluation of measurement uncertainty tion of G’s distribution. However, since this distribu-
underlies the assessment of reproducibility. tion also comprises uncertainty contributions that are
Section 2 uses the Newtonian constant of gravita- not expressed in the data, for example, uncertainty in
tion as an example to explain the meaning of notational the calibration of measuring instruments; metrology
conventions that are widely used in metrology but that uses a term conceived to be more inclusive than “stan-
statisticians may be unfamiliar with, and which are used dard error.”
throughout this contribution. • The expression for the value of G includes the paren-
Since measurement plays a key role in science and tech- thetic notation “6.674252(122),” which is shorthand for
nology, both the credibility of scientific results and the “6.674252 ± 0.000122,” indicating that the digits be-
reliability of technologies hinge on measurement quality, tween parentheses express the standard uncertainty and
which is the topic of Section 3. affect the same number of trailing digits of the value of
Section 4 discusses the meaning of “reproducibility” G while disregarding the location of the decimal point.
and of related concepts. Section 5 presents a reanalysis, This parenthetic notation is commonly employed to re-
employing contemporary techniques, of a historical data port measurement results concisely in the scientific lit-
set that John Mandel used to illustrate his pioneering ap- erature, as well as in Sections 4, 7 and 8 of this contri-
proach to characterize measurement reproducibility and bution.
repeatability.
Sections 6 (assessment of the risks of a particular 3. MEASUREMENT AND MEASUREMENT QUALITY
therapy), 7 (estimation of the reproduction number of
3.1 Measurement
COVID-19) and 8 (measurement of the Newtonian con-
stant of gravitation) provide additional illustrations of Measurement, the same as science generally, aims “to
how the statistical intercomparison of measurement re- find out something” ([34], p. 287) based on empirical
sults contributes to the assessment of reproducibility. evidence and employing methods that peer-review deter-
Section 9 gathers some lessons learned about how the mines to be sound and enable empirically verifiable pre-
application of statistical models and methods can quantify dictions, to obtain this evidence and to analyze it, yielding
the reproducibility of the conclusions of scientific studies, results that can be essentially reproduced by others.
and in the process increase their trustworthiness, thereby In practice, our measured values are approximations to
advancing scientific knowledge. the true values of the properties that we intend to mea-
The title chosen for this contribution alludes to the sure. These estimates, alone, are of little value because
tracking theory of knowledge developed by Nozick [64] they provide no assurances about their quality. For this
and by Roush [79], at the same time as it evokes the dy- reason, a bona fide measurement result comprises both a
namic nature of exploratory and confirmatory statistical measured value and an evaluation of measurement uncer-
data analysis, as they “track” the scent of truth in empir- tainty.
ical data, thus fulfilling the allegorical role of a spyglass Broadly conceived, measurement is an experimental or
that delivers reliable knowledge built upon reproducible computational process that produces an estimate of the
findings. true value of a property of a material or virtual object or
collection of objects, or of a process, event or series of
2. NOTATIONAL CONVENTIONS events, and satisfies these requirements [93, 70]:
The term standard uncertainty, and the notation used (a) The estimate (measured value) is based on a compar-
to denote it, occur repeatedly throughout this contribu- ison of the property of interest with a property of the
tion, as in u(G) = 0.000122 m3 kg−1 s−2 , which Schlam- same kind realized in a standard that is recognized as a
minger et al. [84] reported as the standard uncertainty as- common reference by the community of producers and
sociated with a measurement of the Newtonian constant users of the measurement result;
TRACKING TRUTH THROUGH MEASUREMENT 657
(b) The measured value is qualified with an evaluation of doubt?” For NIST SRM 3161a, the size of the margin is
measurement uncertainty; gauged by half the length of the confidence interval afore-
(c) The measurement result (measured value together mentioned, and the severity of the doubt is expressed by
with its associated measurement uncertainty) is used the probability (5% in this case) that said interval does not
to inform an action or decision. include the true value of that mass fraction.
As example of the comparison mentioned in (a), con- Confidence in measurement results can be strengthened
sider the Eiffel Tower: saying that it is 330 m tall means by introducing known measurands in the measurement
that its height is 330 times the length of the meter, which workflow that are indistinguishable from the materials or
is the unit of length in the International System of Units products that are being measured. Such check standards
(SI) [6]. ([63], 2.1.2) were first used in mass measurement [69].
The property that is measured (measurand) can be qual- In general, they can be reference materials or calibrated
itative or quantitative. The species of the plant in NIST devices delivering certified values whose associated un-
Standard Reference Material (SRM) 3246, Ginkgo biloba certainty has been evaluated reliably.
(Leaves), is a qualitative property. The mass fraction of The convergence toward a particular value as the same
tin in NIST SRM 3161a, Tin Standard Solution (Lot No. measurand is measured repeatedly over time, in indepen-
140917), is a quantitative property whose certified value dent experiments, is another indication that knowledge
is 10.011 mg/g. about it is solidifying. The history of the measurements
To satisfy requirement (b), the aforementioned estimate of the speed of light and of the Planck constant are no-
of the mass fraction of tin is qualified with an expression table examples of such convergence [52, 61].
of measurement uncertainty, in the form of a confidence Confidence in a measurement result is bolstered appre-
interval ranging from 9.986 mg/g to 10.036 mg/g. ciably if one or several so-called primary methods of mea-
Requirement (c) is exemplified by the decision to ac- surement are employed, and they produce measurement
cept or reject a shipment of boxes of breakfast cereal, results that are essentially in agreement with one another.
which depends on a measurement result for the mass of A primary measurement procedure is such that it does not
cereal in the boxes. This can be the average mass of ce- require calibration with a reference that delivers the same
real per box, for example, qualified with an evaluation of property that one intends to measure. Digital polymerase
the associated measurement uncertainty. chain reaction (dPCR) is a primary measurement method
for viral loads in samples of bodily fluids [85], and for
3.2 Measurement Quality
many other measurements in molecular biology [67].
Measurement quality is its trustworthiness: the extent Coulometry ([35], Section 17-3) can be a primary
to which measured values approximate the corresponding method for determining the amount of a substance in a
true values sufficiently closely for the purpose they are in- solution, which involves counting the number of electrons
tended to serve, and do so with assuredly high confidence consumed in a chemical reaction involving that substance.
[71]. This measurement method involves reference to standards
Such trustworthiness requires that measurement results of time and electrical current, but not to standards for the
be metrologically traceable to appropriate, widely recog- concentration of the substance [4], 2.9.5.
nized standards of reference, and that the associated un- In summary, measurement provides estimates of values
certainty be small enough to warrant using the measured of properties of interest to science and technology using
value in practice as a proxy for the corresponding true recognized standards as references. Both measurement
value. uncertainty and traceability, which characterize measure-
Traceability is a property of a measurement result con- ment’s reliability and validity, are attributes of measure-
sisting of a documented series of comparisons that relate ment quality. The demonstration of mutual consistency
the measured value to a standard of reference, with each between measurement results for the same measurand ob-
comparison being qualified by an evaluation of the as- tained independently of one another, that is, reproducibil-
sociated measurement uncertainty [73]. Traceability thus ity (which we turn to next), is another quality attribute of
guarantees that 1 kg of coffee weighed and sold in a su- measurement that bolsters the trustworthiness of measure-
permarket in Cali, Colombia, has the same mass as 1 kg of ment results.
coffee bought in Coimbra, Portugal, up to their respective,
associated uncertainties. 4. REPRODUCIBILITY
Measurement uncertainty is the doubt about the true
value of the measurand that remains after making a mea- A search for articles listed in the Web of Science that
surement ([75], p. 14). Bell [5] points out that to charac- were published between January 1, 2020, and January 31,
terize the margin of this doubt, we need to answer two 2023, and that include the word “reproducibility” in their
questions: “How big is the margin?” and “How bad is the titles yielded 2524 results (retrieved on February 2, 2023).
658 A. POSSOLO
the respective posterior distributions; they are unitless be- This reanalysis shows that contemporary tools for sta-
cause the analysis is being done using the logarithms of tistical modeling and data analysis, which were not avail-
the values of stress, and the logarithm “swallows” units able in John Mandel’s time, afford great flexibility for ac-
as can be seen by its series expansion presented in [65], curate modeling. For example, replacing the assumption
4.6.4. that measurement errors are Gaussian with the assump-
It is important to realize that these quantifications of re- tion that they follow a Student’s t-distribution can be han-
peatability and of reproducibility are supported by differ- dled easily in the context of a Bayesian model owing to
ent amounts of evidence. In fact, the evaluation of repeata- the availability of Markov chain Monte Carlo sampling.
bility is based on the variability of 13 groups of 28 in- Also, suitably chosen reexpression (which in this case
dividual determinations of stress each (whose logarithms is as simple as taking logarithms) can go a long way to-
have approximately constant variance), while the evalu- ward simplifying the analysis and increasing the adequacy
ation of reproducibility is based on the variability of 91 of statistical models to data ([58], Chapter 5). However,
averages (13 for each of 7 rubber specimens). the fundamental insights and specific proposals that John
Mandel ([46], p. 79) noticed that the different amounts Mandel offered 50 years ago, about how to quantify re-
of evidence that support the evaluations of repeatability peatability and reproducibility, withstood the test of time,
and reproducibility can be captured using the following and continue to be valuable.
fact pointed out by Blackman and Tukey ([8], p. 208): if
V is a multiple of a chi-square random variable with m 6. ROSIGLITAZONE
degrees of freedom, for example, when V is an estimate
of a variance component, then its coefficient of variation, On July 22, 2007, The New York Times reported that Dr.
√ Steven Nissen’s “questioning of the safety of the Avandia
CV, is 2/m. For this reason, Blackman and Tukey [8]
propose 2/(CV)2 as an equivalent number of degrees of diabetes medication in late May” had “prompted a federal
freedom (also called degrees of firmness [9], p. 290) sup- safety alert and led to a sales decline of about 30 percent
porting V . for the drug,” which had earned GlaxoSmithKline (GSK)
To compute the degrees of firmness of the repeatabil- $3.2 billion in 2006.
ity, r, and of the reproducibility, R, one can either simply The basis for that questioning was a meta-analysis [62]
compute their respective coefficients of variation based of 42 clinical studies of the risk of myocardial infarc-
on the MCMC samples drawn from the posterior distri- tion and death from cardiovascular causes seemingly as-
butions of σ and τ , or possibly better, employ an analog sociated with the use of rosiglitazone, which is the ac-
of the coefficient of variation that may be less sensitive tive ingredient of Avandia. The results of each of these
to the asymmetry of these posterior distributions, whose studies can be summarized in a 2 × 2 table, for exam-
densities are depicted in Figure 5. In this particular case, ple, Table 1 for the ADOPT study [91, 39], which was a
the two options produce very similar assessments of the randomized, double-blind, parallel-group study involving
degrees of firmness of r and of R. 4351 patients with recently diagnosed type 2 diabetes.
The robust version of the degree of firmness for r is All together, the 42 studies whose results are listed
computed as the ratio between half the length of a 68% in Nissen and Wolski ([62], Table 3) involved 27 833
credible interval for σ centered at the posterior median patients. The prevalence of myocardial infarction was
of σ , and this posterior median. The value of this ratio around 0.6% in both the rosiglitazone and control groups.
is 338. The robust version of the degree of firmness for In four of these studies, there were no cases of my-
R, defined similarly, is 48. Hence, and not unexpectedly, ocardial infarction either in the rosiglitazone group or in
the evaluation of repeatability has about 7 times greater the control group. These four were therefore excluded
firmness than the evaluation of reproducibility. from consideration by those methods of data reduction
In general, repeatability depends both on the measur-
and and on the particular laboratory making the measure- TABLE 1
ments, while reproducibility depends on the measurand Results of the ADOPT study, where patients were randomized to
receive double-blinded rosiglitazone, glyburide or metformin, and
and on the class of laboratories that the laboratories par-
were treated for periods of 4 years median duration, as originally
ticipating in the study actually represent. reported by Kahn et al. [39], Table 2, and transcribed by Nissen and
Also in this case, the logarithmic transformation of the Wolski [62], Table 3
values of stress, together with the adjustment for differ-
ences between the rubber specimens accomplished by the Myocardial infarction
mixed effects model, achieved sufficient homoscedastic-
Yes No Total
ity within-laboratories, and also enabled using a single τ
to quantify the between-laboratories variability, so as to Rosiglitazone Group 27 1429 1456
justify pooling the results and producing single evalua- Control Group 41 2854 2895
tions of repeatability and reproducibility.
662 A. POSSOLO
TABLE 2
Estimates and lower (LWR) and upper (UPR) endpoints of 95%
confidence intervals for the odds ratio (OR) comparing the effects of
rosiglitazone and control on myocardial infarction
OR LWR UPR
other and using different models. Blending is done as an Figure 7 shows the Gaussian cumulative distribution
exercise in meta-analysis [45]. function and its skew-normal counterpart fitted to the
However, each research group reports several quantiles percentiles that Maishman et al. ([45], Table 1) list for
of the probability distribution that expresses the uncer- model 3, showing that the skew-normal model is apprecia-
tainty surrounding R, while most procedures used for bly more accurate than the Gaussian model. Table 3 lists
meta-analysis expect the mean and the standard deviation the means and standard deviations imputed by Maishman
of R’s distribution as inputs. Maishman et al. ([45], Ta- et al. [45] for the eleven models, and their counterparts
ble 1) list the 5th, 25th, 50th, 75th and 95th percentiles obtained using the skew-normal approximation.
for R’s distribution, as produced by each of eleven mod- Table 4 reveals details of the differences induced by the
els for a particular (but unspecified) date and region of the two different methods used to impute the mean and stan-
UK. dard deviation based on sample percentiles, and also the
For model 3 in Table 1 of [45], these percentiles are differences attributable to four different statistical models
0.64, 0.70, 0.74, 0.79 and 0.87, respectively. The proce- used to reduce the data to obtain a consensus value and to
dure that Maishman et al. [45] use to derive estimates of evaluate the associated uncertainty.
664 A. POSSOLO
and the other fundamental constants; (ii) measuring G is The measurement errors {j } are assumed to have a
very challenging because it involves measuring extremely joint multivariate Gaussian distribution with mean 0
small forces and (iii) the measured values of G are appre- and the same units as G, whose covariance matrix has
ciably more dispersed than their individual measurement the {u2 (Gj )} along the main diagonal, and all the off-
uncertainties intimate. diagonal entries are 0 except for those that involve the
Reason (iii) is a manifestation of lack of reproducibil- correlations listed in the caption of Table XXIX of
ity, as independent experiments, relying either on differ- Tiesinga et al. [90]: 0.351 between NIST-82 and LANL-97;
ent physical principles or on different implementations of 0.134 between HUST-05 and HUST-09 and 0.068 between
the same principle, have historically yielded mutually in- HUST-09 and HUSTT-18.
consistent measurement results. Both the 2014 [56] and 2018 [90] releases of the values
Figure 8 shows the measurement results that CODATA recommended by CODATA for the fundamental constants
(Committee on Data of the International Science Coun- employ an ad hoc procedure to assign a value to κ, as the
cil) took into account for the 2018 release of the rec- smallest positive number such that the resulting, standard-
ommended values of the fundamental physical constants ized residuals (which Tiesinga et al. [90] call normalized
[90], and the results of two alternative statistical measure- residuals) all have absolute values no larger than 2. This
ment models and data reductions for them. choice, which Merkatas et al. ([51], Section 3.2) show is
Two kinds of statistical models have been used for mea- overly conservative, yields 3.9 as estimate of κ.
surement results such as these, depending on how one ad- Both maximum likelihood estimation (MLE) and the
Bayesian alternative described by Bodnar and Elster [10]
dresses their mutual inconsistency. The model discussed
are model-based alternatives preferable to the aforemen-
in Section 8.1 is based on Birge’s [7] suggestion whereby
tioned ad hoc procedure to estimate κ.
the reported uncertainties are magnified by a factor (Birge
The maximum likelihood estimates of G and κ in equa-
ratio) sufficiently large to achieve mutual consistency. = 6.67430(13) × 10−11 m3 kg−1 s−2 and
tion (2) are G
The model discussed in Section 8.2, which we call the
κ = 3.5(6). Note that the maximum likelihood estimate
laboratory effects model, is a conventional mixed effects
of κ is qualified with an evaluation of the associated un-
model [50], where G is the fixed effect and the experi- certainty, which is neither recognized nor propagated for
ment effects are the random effects. Both models will be the ad hoc estimate used by Tiesinga et al. [90]. The corre-
fitted taking into account the three nonnull correlations sponding results are depicted in the left panel of Figure 8.
between the measured values {Gj } listed in the caption of
Table XXIX in Tiesinga et al. [90]. 8.2 Laboratory Effects Model for G
Baker and Jackson [2], Koepke et al. [41], Merkatas The NIST Decision Tree [74] (which ignores the three
et al. [51] all compare and discuss these two kinds of mod- correlations aforementioned) recommends a Bayesian hi-
els, and point out that the preference for one or for the erarchical model with Gaussian random effects and Gaus-
other seems to be mostly cultural, with CODATA and the sian measurement errors for these 16 measurement re-
Particle Data Group (pdg.lbl.gov) [32] favoring the Birge sults, similar to the model in equation (1):
ratio, while medical meta-analysis [23] and interlabora-
tory studies in measurement science [80] generally opting (3) Gj = G + λj + j ,
for the additive mixed effects model. where the {j } are assumed to be independent and Gaus-
The 16 measurement results for G are mutually incon- sian, all with mean zero and standard deviations equal to
sistent as judged by Cochran’s Q test [17], which yields the reported standard uncertainties, {u(Gj )}, all of which
an exceedingly small p-value. Figure 8 also shows the are also assumed to be based on very large numbers of
value of G recommended by CODATA in 2018 [90], and degrees of freedom—likely an unrealistic assumption.
the estimates of G obtained by application of the multi- The experiment effects, {λj }, are assumed to be Gaus-
plicative and additive models that address such mutual in- sian, centered at 0 m3 kg−1 s−2 and with a covariance ma-
consistency, as detailed in the following two subsections. trix all of whose elements are zero, except for τ 2 along the
main diagonal, and the same three elements in the upper
8.1 Common Mean Model for G and lower triangles that correspond to the three nonnull
The multiplicative model is a heteroscedastic, Gaus- correlations mentioned above in Section 8.1.
sian, common mean model [11] (also called “fixed effect” This model is identifiable because the data are the pairs
model—note the singular in “effect,” hence a different {(Gj , u(Gj ))}: since the {j } should be consistent with
model from the conventional fixed effects model), which the {u(Gj )}, the {Gj } being overdispersed relative to the
amplifies the standard uncertainties multiplicatively with reported uncertainties suggests that the {λj } cannot all be
the inflation factor κ > 0: zero.
A Bayesian version of the model in equation (3), taking
(2) Gj = G + κj . the aforementioned correlations into account, was fitted to
666 A. POSSOLO
F IG . 8. Measurement results for G, and results from two alternative statistical models and corresponding data reductions. The labels at the
bottom are the same that are used by Tiesinga et al. ([90], Table XXIX), where the corresponding references are listed. The diamonds represent the
measured values. The (green) thick vertical line segments represent the measurement results {Gj ± u(Gj )}. The (dark blue) thin horizontal line
segment, and the light blue band centered on it, represent the 2018 CODATA recommended value for G and the associated standard uncertainty [90],
Section XIX. Left panel: The (dark brown) thin horizontal line segment and the yellow band centered on it represent the consensus value computed
using the common mean model of equation (2) fitted by maximum likelihood, and taking into account the correlations between experiments listed in
the caption of Tiesinga et al. ([90], Table XXIX). The (purple) thin vertical line segments represent the {Gj ± κ u(Gj )}. Right panel: Counterpart
of the left panel for the mixed effects, Bayesian hierarchical model with Gaussian experiment effects and Gaussian measurement errors, also taking
into account the correlations aforementioned. The (purple) thin vertical line segments represent the {Gj ± ( τ 2 + u2 (Gj ))½ } where
τ denotes τ ’s
posterior mean.
the data listed in Table XXIX of Tiesinga et al. [90] using of the reported uncertainties. Note that both panels of Fig-
Stan [16, 87] and R [86] codes listed in the Supplementary ure 8 have the same scale in their vertical axes.
Material [72], with the results depicted in the right panel
8.3 Evaluating Reproducibility
of Figure 8.
The prior distribution chosen for G was Gaussian with Table 5 summarizes the estimates of G and of other rel-
mean set equal to the 2014 CODATA recommended value evant quantities from Sections 8.1 and 8.2, alongside the
for G [55], and with standard deviation set equal to the CODATA 2018 recommended value of G and associated
corresponding standard uncertainty. The prior distribution standard uncertainty [90]. These three estimates of G do
chosen for τ was half-Cauchy with median set equal to not differ significantly from one another once their uncer-
the MAD (as defined in the R environment for statistical tainties are taken into account.
computing and graphics [86]) of the measured values. Schlamminger [82] notes that not only do “the various
The posterior mean of G is 6.67399(20) × 10−11 m3 · measurements of G seem not to converge on a value; it
kg−1 s−2 , which is not statistically significantly differ- seems that the convergence gets worse with each addi-
ent from the 2018 CODATA [90] recommended value tional data point.” He concludes that “adding more data
because the absolute value of their difference amounts points from isolated experiments has not been the best
to 1.24 times the standard error of their difference. strategy to improve the situation,” and supports the idea of
The dark uncertainty, τ , had posterior mean 0.00096 × “forming an international consortium to coordinate these
10−11 m3 kg−1 s−2 , which is 3.8 times larger than the me- demanding experiments.”
dian of the standard uncertainties associated with the 16 Such an international consortium [54] has meanwhile
measured values of G. been formed, and in consequence the MARK-2 torsion
Figure 8 reveals that the laboratory effects model en- balance that Quinn et al. [77, 78] built and used at the
tails generally smaller, more equitable increases to the ef- BIPM (International Bureau of Weights and Measures,
fective uncertainties of the measured values than the com- Sèvres, France) was disassembled and shipped to NIST,
mon mean model, which involves multiplicative inflation in Gaithersburg, Maryland, U.S., where it was reassem-
TRACKING TRUTH THROUGH MEASUREMENT 667
The examples also show that a meaningful data anal- turned out that the mere exercise of preparing the inputs
ysis can require a preliminary choice of reexpression for for analysis can be quite influential upon the level of re-
the measurement results, in particular to facilitate and le- producibility of the results, above and beyond the differ-
gitimize the use of a statistical model that is demonstrably ences between the epidemiological models that provided
adequate for the data, and that is also fit for purpose. This those inputs, and also above and beyond the methods used
was the case for the values of stress in the interlaboratory to determine a consensus value. This serves as a warning
study of rubber elongation (Section 5), where a logarith- about the fact that fairly simple matters often relegated
mic reexpression was very helpful, and also for the meta- to routine work can impact reproducibility, or the lack
analysis for the effects of rosiglitazone (Section 6), with thereof, substantially.
the traditional focus on log odds. The history of the measurements of the least accessi-
In interlaboratory studies and meta-analyses, there of- ble of the fundamental constants of nature, the Newtonian
ten arise results that deviate markedly from the bulk of the constant of gravitation, G, shows that alternative treat-
others; either because the measured value is rather differ- ments of the same data, even when they produce results
ent from most of the others, or because the uncertainty that are in fair agreement, involve very different assump-
reported in a result is very different from the uncertainties tions that effectively establish dividing lines in the inter-
reported in the other results, or both. ested community; in particular, and in this case, whether
In general, and concerning very different reported un- one adopts the approach first proposed by Raymond Birge
certainties, it is the smallest uncertainties that are partic- and faithfully followed mostly by the physics community,
ularly influential, especially when the measurement re- or opts instead for the approach that is prevalent in medi-
sults are mutually inconsistent, because they tend to pull cal meta-analysis and in measurement science.
the consensus value toward their corresponding measured But the most important lesson one can draw from the
values. Such unusually small uncertainties can then be recent history of the measurement of G is a lesson of opti-
said to be influential “inliers.” mism and empowerment; that, when faced with a consid-
Faced with mutually inconsistent measurement results, erable, genuine reproducibility crisis, the scientific com-
the temptation is great to set “discrepant” values aside, munity is ready to engage in extraordinary, cooperative
thereby appearing to resolve the lack of reproducibility— efforts to understand the root causes of the lack of repro-
Cox [24] describes one manner of succumbing to such ducibility, and to do so with the resolve needed to move
temptation. However, unless there is a substantive, iden- heaven and earth, and with the creativity to match, of
tifiable cause to do so, no “discrepant” result should be which Stephan Schlamminger (NIST) and his collabora-
set aside, for the simple reason that in the absence of such tors provide paradigmatic examples.
cause there would be no logical basis whereon to reject
discrepant values as being invalid—the most discrepant ACKNOWLEDGMENTS
value can very well be the one closest to the true value of
the measurand [26]. The author is immensely grateful to Stefan Schlam-
Statistical diagnostics are most valuable aids in identi- minger (NIST) for all that he has taught him over the
fying unusual measurement results, but statistical consid- years about the measurement of G. The author is also
erations alone are insufficient to reject a measurement re- much indebted to Olha Bodnar (Örebro University, Swe-
sult. Faced by challenges posed by “discrepant” but cred- den), David Newton (NIST) and Mikela Waldman (NIST
ible measurement results, one should tune the model to fit and Georgetown University, Washington, DC) for their
all credible results rather then set credible but “inconve- most valuable and extensive suggestions for improvement
nient” results aside. The example in Section 5 illustrated of a draft of this contribution. The author thanks David
ways of accomplishing this, including by replacing the as- Woods (Univ. of Southampton, UK) for an exchange of
sumption that measurement errors are Gaussian with the eMails about the measurement of the reproduction num-
assumption that they follow a Student’s t-distribution with ber of COVID-19 in the United Kingdom.
a small number of degrees of freedom, similar to [66]. The author thanks the organizers of the special issue of
The roller coaster that has been the history of the use Statistical Science dedicated to the issue of reproducibil-
of rosiglitazone as a therapy (Section 6) shows that, even ity for the invitation to contribute to it, and acknowledges
when starting from the same set of data, one can reach the very helpful criticism and guidance that the guest edi-
rather different conclusions owing to different statistical tors, the journal’s editor and a referee provided throughout
models and methods of data reduction; in other words, the revision process, which led to considerable improve-
the issue of lack of reproducibility raised its head when ments.
the results of alternative but comparably tenable models Some specific commercial entities, equipment or ma-
and data reductions were compared. terials may be identified in this document in order to de-
When blending independent estimates of the reproduc- scribe or illustrate an experimental or statistical procedure
tion number for COVID-19 in the UK (Section 7), it so or concept adequately. Such identification is not intended
TRACKING TRUTH THROUGH MEASUREMENT 669
to imply recommendation or endorsement by the National [12] B RADBURN , M. J., D EEKS , J. J., B ERLIN , J. A. and L O -
Institute of Standards and Technology (NIST), nor is it in- CALIO , A. R. (2007). Much ado about nothing: A comparison
of the performance of meta-analytical methods with rare events.
tended to imply that the entities, equipment or materials
Stat. Med. 26 53–77. MR2312699 https://doi.org/10.1002/sim.
mentioned are necessarily the best available for the pur- 2528
pose. [13] B ÜRKNER , P. C. (2017). brms: An R package for Bayesian
multilevel models using Stan. J. Stat. Softw. 80 1–28.
https://doi.org/10.18637/jss.v080.i01
SUPPLEMENTARY MATERIAL [14] B ÜRKNER , P. C. (2018). Advanced Bayesian multilevel mod-
eling with the R package brms. The R Journal 10 395–411.
Data and R Code (DOI: 10.1214/23-STS899SUPP; https://doi.org/10.32614/RJ-2018-017
.zip). The supplementary information file Possolo [15] C AMPAGNARI , C. and M ULDERS , M. (2022). An upset to the
2023-TrackingTruth-Supplement.R contains standard model. Science 376 136–136. https://doi.org/10.1126/
science.abm0101
data and R code that facilitate reproducing the numeri- [16] C ARPENTER , B., G ELMAN , A., H OFFMAN , M., L EE , D.,
cal results listed in this contribution. G OODRICH , B., B ETANCOURT, M., B RUBAKER , M., G UO , J.,
L I , P. et al. (2017). Stan: A probabilistic programming language.
J. Stat. Softw. 76 1–32. https://doi.org/10.18637/jss.v076.i01
REFERENCES [17] C OCHRAN , W. G. (1954). The combination of estimates from
different experiments. Biometrics 10 101–129. https://doi.org/10.
[1] A ZZALINI , A. (2014). The Skew-Normal and Related Fam- 2307/3001666
ilies. Institute of Mathematical Statistics (IMS) Monographs [18] ATLAS C OLLABORATION, A ABOUD , M. (2018).
3. Cambridge Univ. Press, Cambridge. With the collaboration √ Measure-
ment of the W-boson mass in pp collisions at s = 7 TeV
of Antonella Capitanio. MR3468021 https://doi.org/10.1017/ with the ATLAS detector. European Physical Journal C 78 110.
cbo9781139248891 https://doi.org/10.1140/epjc/s10052-017-5475-4
[2] BAKER , R. and JACKSON , D. (2015). New models for describ- [19] CDF C OLLABORATION (2022). High-precision measurement of
ing outliers in meta-analysis. Res. Synth. Methods 7 314–328. the W boson mass with the CDF II detector. Science 376 170–
https://doi.org/10.1002/jrsm.1191 176. https://doi.org/10.1126/science.abk1781
[3] BATES , D., M ÄCHLER , M., B OLKER , B. and WALKER , S. [20] L3 C OLLABORATION (2006). Measurement of the mass and the
(2015). Fitting linear mixed-effects models using lme4. J. Stat. width of the W boson at LEP. Eur. Phys. J. C 45 569–587.
Softw. 67 1–48. https://doi.org/10.18637/jss.v067.i01 https://doi.org/10.1140/epjc/s2005-02459-6
[4] B EAUCHAMP, C. R., C AMARA , J. E., C ARNEY, J., C HO - [21] A NALYTICAL M ETHODS C OMMITTEE (1989a). Robust
QUETTE , S. J., C OLE , K. D., D E ROSE , P. C., D UEWER , D. L., statistics—how not to reject outliers. Part 1. Basic concepts.
E PSTEIN , M. S., K LINE , M. C. et al. (2021). Metrological Analyst 114 1693–1697. https://doi.org/10.1039/AN9891401693
Tools for the Reference Materials and Reference Instruments [22] A NALYTICAL M ETHODS C OMMITTEE (1989b). Robust
of the NIST Materials Measurement Laboratory. NIST Spe- statistics—how not to reject outliers. Part 2. Inter-laboratory
cial Publication 260-136 (2021 Edition). National Institute of trials. Analyst 114 1699–1702.
Standards and Technology, Gaithersburg, MD. https://doi.org/10. [23] C OOPER , H., H EDGES , L. V. and VALENTINE , J. C., eds.
6028/NIST.SP.260-136-2021 (2019) The Handbook of Research Synthesis and Meta-Analysis,
[5] B ELL , S. (1999). A Beginner’s Guide to Uncertainty of Measure- 3rd ed. Russell Sage Foundation Publications, New York, NY.
ment. Measurement Good Practice Guide 11 (Issue 2). National [24] C OX , M. G. (2007). The evaluation of key comparison data: De-
Physical Laboratory, Teddington, Middlesex, United Kingdom. termining the largest consistent subset. Metrologia 44 187–200.
Amendments March 2001. https://doi.org/10.1088/0026-1394/44/3/005
[6] BIPM (2019). The International System of Units (SI), 9th ed. [25] DAI , D. C. (2021). Variance of Newtonian constant from lo-
cal gravitational acceleration measurements. Phys. Rev. D 103
International Bureau of Weights and Measures (BIPM), Sèvres,
064059. https://doi.org/10.1103/PhysRevD.103.064059
France.
[26] D E B IÈVRE , P. (2007). Statistics and measurement results in
[7] B IRGE , R. T. (1932). The calculation of errors by the method
chemistry. Accredit. Qual. Assur. 12 333–334. https://doi.org/10.
of least squares. Phys. Rev. 40 207–227. https://doi.org/10.1103/
1007/s00769-007-0294-1
PhysRev.40.207
[27] DELPHI C OLLABORATION A BDALLAH , J. et al. Measurement √
[8] B LACKMAN , R. B. and T UKEY, J. W. (1958). The measure- of the mass and width of the W boson in e+ e− collisions at s =
ment of power spectra from the point of view of communica- 161-209 GeV. Eur. Phys. J. C 55 1. https://doi.org/10.1140/epjc/
tions engineering. I. Bell Syst. Tech. J. 37 185–282. MR0102897 s10052-008-0585-7
https://doi.org/10.1002/j.1538-7305.1958.tb03874.x [28] D ER S IMONIAN , R. and L AIRD , N. (1986). Meta-analysis in
[9] B LACKWELL , T., B ROWN , C. and M OSTELLER , F. (1991). clinical trials. Control. Clin. Trials 7 177–188. https://doi.org/10.
Which denominator? In Fundamentals of Exploratory Analysis 1016/0197-2456(86)90046-2
of Variance (D. C. Hoaglin, F. Mosteller and J. W. Tukey, eds.) [29] D IAMOND , G. A., BAX , L. and K AUL , S. (2007). Uncer-
10 252–294. Wiley, New York, NY. tain effects of rosiglitazone on the risk for myocardial infarc-
[10] B ODNAR , O. and E LSTER , C. (2014). On the adjustment of in- tion and cardiovascular death. Ann. Intern. Med. 147 578–581.
consistent data using the Birge ratio. Metrologia 51 516–521. https://doi.org/10.7326/0003-4819-147-8-200710160-00182
https://doi.org/10.1088/0026-1394/51/5/516 [30] F INEBERG , H. V., A LLISON , D. B., BARBA , L. A.,
[11] B ORENSTEIN , M., H EDGES , L. V., H IGGINS , J. P. T. and C HONG , D., D ONOHO , D., F REIRE , J., G ABRIELSE , G., G AT-
ROTHSTEIN , H. R. (2010). A basic introduction to fixed-effect SONIS , C., H ALL , E. et al. (2019). Reproducibility and Replica-
and random-effects models for meta-analysis. Res. Synth. Meth- bility in Science. Committee on Reproducibility and Replicabil-
ods 1 97–111. https://doi.org/10.1002/jrsm.12 ity in Science, the National Academies of Sciences, Engineering,
670 A. POSSOLO
and Medicine. The National Academies Press, Washington, DC. [46] M ANDEL , J. (1972). Repeatability and reproducibility. J.
https://doi.org/10.17226/25303 Qual. Technol. 4 74–85. https://doi.org/10.1080/00224065.1972.
[31] G AISER , C., F ELLMUTH , B., H AFT, N., K UHN , A., T HIELE - 11980520
K RIVOI , B., Z ANDT, T., F ISCHER , J., J USKO , O. and [47] M ANDEL , J. (1991). The validation of measurement through in-
S ABUGA , W. (2017). Final determination of the Boltzmann con- terlaboratory studies. Chemom. Intell. Lab. Syst. 11 109–119.
stant by dielectric-constant gas thermometry. Metrologia 54 280– https://doi.org/10.1016/0169-7439(91)80058-X
289. https://doi.org/10.1088/1681-7575/aa62e3 [48] M ANDEL , J. and PAULE , R. (1970). Interlaboratory evaluation
[32] PARTICLE DATA G ROUP, Z YLA , P. A. et al. (2020). Review of a material with unequal numbers of replicates. Anal. Chem. 42
of Particle Physics. Progress of Theoretical and Experimental 1194–1197. https://doi.org/10.1021/ac60293a019
Physics 083C01. https://doi.org/10.1093/ptep/ptaa104 [49] M ANTEL , N. and H AENSZEL , W. (1959). Statistical aspects of
[33] G UNDERSEN , O. E. (2021). The fundamental principles of re- the analysis of data from retrospective studies of disease. J. Natl.
producibility. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Cancer Inst. 22 719–748. https://doi.org/10.1093/jnci/22.4.719
Eng. Sci. 379 20200210. https://doi.org/10.1098/rsta.2020.0210 [50] M C C ULLOCH , C. E., S EARLE , S. R. and N EUHAUS , J. M.
[34] M ICHELL , J. (2005). The logic of measurement: A realist (2008). Generalized, Linear, and Mixed Models, 2nd ed. Wi-
overview. Measurement 38 285–294. https://doi.org/10.1016/j. ley Series in Probability and Statistics. Wiley, Hoboken, NJ.
measurement.2005.09.004 MR2431553
[35] H ARRIS , D. C. and L UCY, C. A. (2020). Quantitative Chemical [51] M ERKATAS , C., T OMAN , B., P OSSOLO , A. and S CHLAM -
Analysis, 10th ed. Macmillan Learning, New York, NY. MINGER , S. (2019). Shades of dark uncertainty and consensus
[36] H ERSCHEL , J. F. W. (1866). Familiar Lectures on Scientific value for the Newtonian constant of gravitation. Metrologia 56
Subjects X. The Yard, the Pendulum, and the Metre 419–451, 054001. https://doi.org/10.1088/1681-7575/ab3365
London Alexander Strahan. [52] M ILTON , M. J. T. and P OSSOLO , A. (2020). Trustworthy
[37] H OME , P. D., P OCOCK , S. J., B ECK -N IELSEN , H., C UR - data underpin reproducible research. Nat. Phys. 16 117–119.
TIS , P. S., G OMIS , R., H ANEFELD , M., J ONES , N. P., KOMA - https://doi.org/10.1038/s41567-019-0780-5
JDA , M. and M C M URRAY, J. J. V. (2009). Rosiglitazone evalu- [53] M ISNER , C. W., T HORNE , K. S. and W HEELER , J. A. (2017).
ated for cardiovascular outcomes in oral agent combination ther- Gravitation. Princeton University Press, Princeton, NJ.
apy for type 2 diabetes (RECORD): A multicentre, randomised, [54] M OHR , P. (2014). Newtonian constant of gravitation in-
open-label trial. Lancet 373 2125–2135. https://doi.org/10.1016/ ternational consortium. https://www.nist.gov/programs-projects/
S0140-6736(09)60953-3 newtonian-constant-gravitation-international-consortium. NIST
[38] J EWELL , N. P. (2004). Statistics for Epidemiology. CRC Physical Measurement Laboratory.
Press/CRC, Boca Raton, FL. [55] M OHR , P. J., N EWELL , D. B. and TAYLOR , B. N. (2015). CO-
[39] K AHN , S. E., H AFFNER , S. M., H EISE , M. A., H ER - DATA Recommended Values of the Fundamental Physical Con-
MAN , W. H., H OLMAN , R. R., J ONES , N. P., K RAVITZ , B. G., stants: 2014. CODATA Zenodo Collection. https://doi.org/10.
L ACHIN , J. M., O’N EILL , M. C. et al. (2006). Glycemic 5281/zenodo.22826
durability of rosiglitazone, metformin, or glyburide monother- [56] M OHR , P. J., N EWELL , D. B. and TAYLOR , B. N. (2016). CO-
apy. N. Engl. J. Med. 355 2427–2443. https://doi.org/10.1056/ DATA recommended values of the fundamental physical con-
NEJMoa066224 stants: 2014. Rev. Modern Phys. 88 035009. https://doi.org/10.
[40] K LEIN , N. (2020). Evidence for modified Newtonian dy- 1103/RevModPhys.88.035009
namics from Cavendish-type gravitational constant experi- [57] M OLDOVER , M. R., T RUSLER , J. P. M., E DWARDS , T. J.,
ments. Classical Quantum Gravity 37 065002, 21. MR4086686 M EHL , J. B. and DAVIS , R. S. (1988). Measurement of the uni-
https://doi.org/10.1088/1361-6382/ab6cab versal gas constant R using a spherical acoustic resonator. J. Res.
[41] KOEPKE , A., L AFARGE , T., P OSSOLO , A. and T OMAN , B. Natl. Bur. Stand. 93 85–144. https://doi.org/10.6028/jres.093.010
(2017). Consensus building for interlaboratory studies, key [58] M OSTELLER , F. and T UKEY, J. W. (1977). Data Analysis and
comparisons, and meta-analysis. Metrologia 54 S34–S62. Regression. Addison-Wesley Company, Reading, MA.
https://doi.org/10.1088/1681-7575/aa6c0e [59] M OULD , J. and U DDIN , S. A. (2014). Constraining a possible
[42] KOETSE , M. J., F LORAX , R. J. G. M. and DE G ROOT, H. L. F. variation of G with type ia supernovae. Publ. Astron. Soc. Aus-
(2010). Consequences of effect size heterogeneity for meta- tral. 31 e015. https://doi.org/10.1017/pasa.2014.9
analysis: A Monte Carlo study. Stat. Methods Appl. 19 217–236. [60] M UNAFÒ , M. R., C HAMBERS , C., C OLLINS , A., F ORTU -
MR2651450 https://doi.org/10.1007/s10260-009-0125-0 NATO , L. and M ACLEOD , M. (2022). The Reproducibility De-
[43] L ANGAN , D., H IGGINS , J. P. T., JACKSON , D., B OWDEN , J., bate Is an Opportunity, Not a Crisis. BMC Research Notes 15 43.
V ERONIKI , A. A., KONTOPANTELIS , E., V IECHTBAUER , W. https://doi.org/10.1186/s13104-022-05942-3
and S IMMONDS , M. (2019). A comparison of heterogeneity vari- [61] N EWELL , D. B. (2014). A more fundamental international sys-
ance estimators in simulated random-effects meta-analyses. Res. tem of units. Phys. Today 67 35–41. https://doi.org/10.1063/PT.
Synth. Methods 10 83–98. https://doi.org/10.1002/jrsm.1316 3.2448
[44] L ANGAN , D., H IGGINS , J. P. T. and S IMMONDS , M. (2017). [62] N ISSEN , S. E. and W OLSKI , K. (2007). Effect of rosiglitazone
Comparative performance of heterogeneity variance estimators on the risk of myocardial infarction and death from cardiovascu-
in meta-analysis: A review of simulation studies. Res. Synth. lar causes. N. Engl. J. Med. 356 2457–2471. https://doi.org/10.
Methods 8 181–198. https://doi.org/10.1002/jrsm.1198 1056/NEJMoa072761
[45] M AISHMAN , T., S CHAAP, S., S ILK , D. S., N EVITT, S. J., [63] NIST/SEMATECH (2012). NIST/SEMATECH E-Handbook of
W OODS , D. C. and B OWMAN , V. E. (2022). Statistical methods Statistical Methods. National Institute of Standards and Tech-
used to combine the effective reproduction number, R(t), and nology, U.S. Department of Commerce, Gaithersburg, MD.
other related measures of COVID-19 in the UK. Stat. Methods https://doi.org/10.18434/M32189
Med. Res. 31 1757–1777. MR4478307 https://doi.org/10.1177/ [64] N OZICK , R. (1981). Philosophical Explanations. Harvard Univ.
09622802221109506 Press, Cambridge, MA.
TRACKING TRUTH THROUGH MEASUREMENT 671
[65] O LVER , F. W. J., L OZIER , D. W., B OISVERT, R. F. and [80] RUKHIN , A. L. (2009). Weighted means statistics in interlabo-
C LARK , C. W., eds. (2010) NIST Handbook of Mathematical ratory studies. Metrologia 46 323–331. https://doi.org/10.1088/
Functions. Cambridge Univ. Press, Cambridge. MR2723248 0026-1394/46/3/021
[66] P INHEIRO , J. C., L IU , C. and W U , Y. N. (2001). Efficient algo- [81] RUKHIN , A. L., B IGGERSTAFF , B. J. and VANGEL , M. G.
rithms for robust estimation in linear mixed-effects models using (2000). Restricted maximum likelihood estimation of a com-
the multivariate t distribution. J. Comput. Graph. Statist. 10 249– mon mean and the Mandel-Paule algorithm. J. Statist. Plann.
276. MR1939700 https://doi.org/10.1198/10618600152628059 Inference 83 319–330. MR1748018 https://doi.org/10.1016/
[67] P INHEIRO , L. and E MSLIE , K. R. (2018). Basic concepts and S0378-3758(99)00098-1
validation of digital PCR measurements. In Digital PCR: Meth- [82] S CHLAMMINGER , S. (2014). A cool way to measure big G. Na-
ods and Protocols 11–24 Springer, New York, New York, NY. ture 510 478–480. https://doi.org/10.1038/nature13507
https://doi.org/10.1007/978-1-4939-7778-9_2 [83] S CHLAMMINGER , S., C HAO , L. S., L EE , V., S PEAKE , C. C.
[68] P LESSER , H. E. (2018). Reproducibility vs. replicability: A brief
and N EWELL , D. B. (2022). Measurement of Newton’s grav-
history of a confused terminology. Front. Neuroinform. 11 76.
itational constant with the BIPM torsion balance. In American
https://doi.org/10.3389/fninf.2017.00076
Physical Society April Meeting 2022 Session S16: Lab Experi-
[69] P ONTIUS , P. E. (1966). Measurement philosophy of the pilot
ments and Detector Characterization S16.00002.
program for mass calibration. National Bureau of Standards,
[84] S CHLAMMINGER , S., H OLZSCHUH , E., K ÜNDIG , W., N OLT-
Washington, DC. NBS Technical Note 288, Reprinted 1968, with
ING , F., P IXLEY, R. E., S CHURR , J. and S TRAUMANN , U.
minor corrections.
[70] P OSSOLO , A. (2018). Measurement. In Advanced Mathemat- (2006). Measurement of Newton’s gravitational constant. Phys.
ical and Computational Tools in Metrology and Testing: AM- Rev. D 74 082001. https://doi.org/10.1103/PhysRevD.74.082001
CTM XI (A. B. Forbes, N. F. Zhang, A. Chunovkina, S. Eich- [85] S TRAIN , M. C., L ADA , S. M., L UONG , T., ROUGHT, S. E.,
städt and F. Pavese, eds.). Series on Advances in Mathe- G IANELLA , S., T ERRY, V. H., S PINA , C. A., W OELK , C. H.
matics for Applied Sciences 89 273–285. World Scientific and R ICHMAN , D. D. (2013). Highly precise measurement
Company, Singapore. https://doi.org/10.1142/9789813274303\ of HIV DNA by droplet digital PCR. PLoS ONE 8 1–8.
protect\T1\textunderscore0027 https://doi.org/10.1371/journal.pone.0055943
[71] P OSSOLO , A. (2021). Concepts, methods, and tools enabling [86] R C ORE T EAM (2022). R: A Language and Environment for Sta-
measurement quality. In Frontiers in Statistical Quality Con- tistical Computing. R Foundation for Statistical Computing, Vi-
trol 13 (S. Knoth and W. Schmid, eds.) 19 339–357. Springer, enna, Austria.
Cham, Switzerland. https://doi.org/10.1007/978-3-030-67856-2\ [87] S TAN D EVELOPMENT T EAM (2022). RStan: the R interface to
protect\T1\textunderscore19 Stan. R package version 2.21.7.
[72] P OSSOLO , A. (2023). Supplement to “Tracking truth through [88] T HOMAS , K. and S CHMIDT, M. S. (2012). Glaxo Agrees to Pay
measurement and the spyglass of statistics.” https://doi.org/10. $3 Billion in Fraud Settlement. The New York Times July 2.
1214/23-STS899SUPP [89] T HOMPSON , M. and E LLISON , S. L. R. (2011). Dark un-
[73] P OSSOLO , A., B RUCE , S. S. and WATTERS , R. L. J R . (2021). certainty. Accredit. Qual. Assur. 16 483–487. https://doi.org/10.
Metrological Traceability Frequently Asked Questions and NIST 1007/s00769-011-0803-0
Policy. National Institute of Standards and Technology, Gaithers- [90] T IESINGA , E., M OHR , P. J., N EWELL , D. B. and TAY-
burg, MD. NIST Technical Note 2156. https://doi.org/10.6028/ LOR , B. N. (2021). CODATA recommended values of the funda-
NIST.TN.2156 mental physical constants: 2018. Rev. Modern Phys. 93 025010.
[74] P OSSOLO , A., KOEPKE , A., N EWTON , D. and W INCH - https://doi.org/10.1103/RevModPhys.93.025010
ESTER , M. R. (2021). Decision tree for key comparisons. J. Res. [91] V IBERTI , G., K AHN , S. E., G REENE , D. A., H ERMAN , W. H.,
Natl. Inst. Stand. Technol. 126 126007. https://doi.org/10.6028/ Z INMAN , B., H OLMAN , R. R., H AFFNER , S. M., L EVY, D.,
jres.126.007 L ACHIN , J. M. et al. (2002). A Diabetes Outcome Progression
[75] P OSSOLO , A. and M EIJA , J. (2022). Measurement Uncertainty: Trial (ADOPT): An international multicenter study of the com-
A Reintroduction, 2nd ed. Sistema Interamericano de Metrologia parative efficacy of rosiglitazone, glyburide, and metformin in re-
(SIM), Montevideo, Uruguay. https://doi.org/10.4224/1tqz-b038 cently diagnosed type 2 diabetes. Diabetes Care 25 1737–1743.
[76] Q U , J., B ENZ , S. P., C OAKLEY, K., ROGALLA , H., T EW, W. L.,
https://doi.org/10.2337/diacare.25.10.1737
W HITE , R., Z HOU , K. and Z HOU , Z. (2017). An improved elec-
[92] V IECHTBAUER , W. (2010). Conducting meta-analyses in R with
tronic determination of the Boltzmann constant by Johnson noise
the metafor package. J. Stat. Softw. 36 1–48. https://doi.org/10.
thermometry. Metrologia 54 549–558. https://doi.org/10.1088/
18637/jss.v036.i03
1681-7575/aa781e
[93] W HITE , R. (2011). The meaning of measurement in metrol-
[77] Q UINN , T., PARKS , H., S PEAKE , C. and DAVIS , R. (2013). Im-
proved determination of G using two methods. Phys. Rev. Lett. ogy. Accredit. Qual. Assur. 16 31–41. https://doi.org/10.1007/
111 101102. https://doi.org/10.1103/PhysRevLett.111.101102 s00769-010-0698-1
[78] Q UINN , T., S PEAKE , C., PARKS , H. and DAVIS , R. (2014). The [94] W ILSON , E. O. (1998). Consilience: The Unity of Knowledge.
BIPM measurements of the Newtonian constant of gravitation, Alfred A. Knopf, New York, NY.
G. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 372 [95] Y USUF, S., P ETO , R., L EWIS , J., C OLLINS , R. and S LEIGHT, P.
0032. https://doi.org/10.1098/rsta.2014.0032 (1985). Beta blockade during and after myocardial infarction:
[79] ROUSH , S. (2005). Tracking Truth: Knowledge, Evidence, and An overview of the randomized trials. Prog. Cardiovasc. Dis. 27
Science. Oxford Univ. Press, New York, NY. 335–371. https://doi.org/10.1016/s0033-0620(85)80003-7