0% found this document useful (0 votes)
9 views

1.01 Quality of Analytical Measurements_Statistical Methods for Internal Validation

The document discusses statistical methods for internal validation of analytical measurements, emphasizing the importance of accurate results in various fields such as healthcare and quality control. It covers topics like confidence intervals, hypothesis testing, and the evaluation of method performance characteristics, including trueness, precision, and accuracy. The article serves as an update to previous work, incorporating MATLAB live-scripts and correcting estimates and references.

Uploaded by

Raphael yagami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

1.01 Quality of Analytical Measurements_Statistical Methods for Internal Validation

The document discusses statistical methods for internal validation of analytical measurements, emphasizing the importance of accurate results in various fields such as healthcare and quality control. It covers topics like confidence intervals, hypothesis testing, and the evaluation of method performance characteristics, including trueness, precision, and accuracy. The article serves as an update to previous work, incorporating MATLAB live-scripts and correcting estimates and references.

Uploaded by

Raphael yagami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Quality of Analytical Measurements: Statistical Methods for Internal Validation☆

M Cruz Ortiz, Departamento de Química, Facultad de Ciencias, Universidad de Burgos, Burgos, Spain
Luis A Sarabia and M Sagrario Sánchez, Departamento de Matemáticas y Computación, Facultad de Ciencias, Universidad de Burgos,
Burgos, Spain
Ana Herrero, Departamento de Química, Facultad de Ciencias, Universidad de Burgos, Burgos, Spain
© 2019 Elsevier Inc. All rights reserved.

This article is an update of M.C. Ortiz, L.A. Sarabia, M.S. Sánchez, A. Herrero, 1.02—Quality of Analytical Measurements: Statistical Methods for Internal
Validation, Editor(s): Steven D. Brown, Romá Tauler, Beata Walczak, Comprehensive Chemometrics, Elsevier, 2009, pp. 17–76.

Introduction 3
Confidence and Tolerance Intervals 7
Confidence Interval 8
Confidence Interval on the Mean of a Normal Distribution 8
Case 1: Known variance 8
Case 2: Unknown variance 10
Confidence Interval on the Variance of a Normal Distribution 10
Confidence Interval on the Difference in Two Means 11
Case 1: Known variances 11
Case 2: Unknown variances 11
Case 3: Confidence interval for paired samples 12
Confidence Interval on the Ratio of Variances of Two Normal Distributions 12
Confidence Interval on the Median 12
Joint Confidence Intervals 13
Tolerance Intervals 13
Case 1: b-content tolerance interval 13
Case 2: b-expectation tolerance interval 14
Case 3: Distribution free intervals 14
Hypothesis Tests 15
Elements of a Hypothesis Test 15
Hypothesis Test on the Mean of a Normal Distribution 18
Case 1: Known variance 18
Case 2: Unknown variance 18
Case 3: The paired t-test 19
Hypothesis Test on the Variance of a Normal Distribution 19
Hypothesis Test on the Difference in Two Means 20
Case 1: Known variances 20
Case 2: Unknown variances 21
Test Based on Intervals 22
Hypothesis Test on the Variances of Two Normal Distributions 22
Hypothesis Test on the Comparison of Several Independent Variances 24
Case 1: Cochran’s test 24
Case 2: Bartlett’s test 25
Case 3: Levene’s test 25
Goodness-of-Fit Tests: Normality Tests 26
Case 1: Chi-square test 26
Case 2: D’Agostino normality test 26
One-Way Analysis of Variance 27
The Fixed Effects Model 28
Power of the Fixed Effects ANOVA model 30
Uncertainty and Testing of the Estimated Parameters in the Fixed Effects Model 31
Case 1: Orthogonal contrasts 32
Case 2: Comparison of several means 32
The Random Effects Model 34
Power of the Random Effects ANOVA model 35
Confidence Intervals for the Estimated Parameters in the Random Effects Model 35


Change History: October 2019: M. Cruz Ortiz, Luis A. Sarabia, M. Sagrario Sánchez, Ana Herrero added MATLAB live-scripts for the computations; re-written
introduction to tolerance intervals; corrected estimates in Table 13; updated texts; corrected mistakes and updated references.

Comprehensive Chemometrics 2nd edition: Chemical and Biochemical Data Analysis https://doi.org/10.1016/B978-0-12-409547-2.14746-8 1
2 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Statistical Inference and Validation 36


Trueness 36
Precision 37
Statistical Aspects of the Experiments to Determine Precision 39
Consistency Analysis and Incompatibility of Data 39
Case 1: Elimination of data 39
Case 2: Robust methods 41
Accuracy 43
Ruggedness 43
Appendix 45
Some Basic Elements of Statistics 45
The Normal Distribution 46
Student’s t Distribution 46
The w2 (Chi-square) Distribution 47
The F Distribution 48
Convergence of Random Variables 48
Some Computational Aspects 48
Normal distribution 49
Student’s t distribution with n degrees of freedom 49
w2 distribution with n degrees of freedom 49
Fn1, n2 distribution with n1 and n2 degrees of freedom 49
Power for the z-test, Eq. 49
Power for the t-test, Eq. 50
Power for the chi-square test, Eq. 50
Power for the F-test, Eq. 50
Power for fixed effects ANOVA, Eq. 50
Power for random effects ANOVA, Eq. 50
References 50

Nomenclature
1−a Confidence level
1−b Power
CCa Limit of decision
CCb Capability of detection
Fn1, n2 F distribution with n1 and n2 degrees of freedom (d.f.)
H0 Null hypothesis
H1 Alternative hypothesis
N(m,s) Normal distribution with mean m and standard deviation s
NID(m,s) (Normally and Independently Distributed) independent random variables equally distributed as normal with
mean m and standard deviation s
s Sample standard deviation
s2 Sample variance
tn Student’s t distribution with n degrees of freedom (d.f.)
x Sample mean
V(X) Variance of the random variable X
a Significance level, probability of type I error
b Probability of type II error
D Bias (systematic error)
« Random error
m Mean
n Degree(s) of freedom, d.f.
s Standard deviation
s2 Variance
sR Reproducibility (as standard deviation)
sr Repeatability (as standard deviation)
x2n w2 (chi-square) distribution with n degrees of freedom
Quality of Analytical Measurements: Statistical Methods for Internal Validation 3

Introduction

Every day millions of analytical determinations are made in thousands of laboratories all around the world. These measurements
are necessary for assessment of merchandise in the commercial interchanges, supporting health care, maintaining security, for
quality control of water and environment, for characterization of raw materials and manufactured products, and for forensic
analyses. Practically, any aspect of the contemporary social activity is somehow supported in the analytical measurements. The cost
of these measurements is high, but the cost of the decisions made based on incorrect results is much greater. For example, a test that
wrongly shows the presence of a forbidden substance in a food destined for human consumption can result in an expensive claim,
the confirmation of the presence of an abuse drug can lead to serious judicial sentences or doping in the sport practice may result in
severe sanctions. The importance of providing a correct result is evident but it is equally important to be able to prove that the result
is correct.
Once an analytical problem is posed to a laboratory and the analytical method is selected, the next step is the in-house validation
of the method. This is the process of defining the analytical requirements to respond to the problem and to confirm that the
considered method has performance characteristics consistent with those required. The results of the validation experiments must
be evaluated in order to ensure that the method meets the measurement required specification.
The set of operations to determine the value of an amount (measurand) suitably defined is called the measurement. The method
of measurement is the sequence of operations that is used when conducting the measurements. It is documented with enough
details so that the measurement may be done without additional information.
Once a method is designed or selected, it is necessary to evaluate its performance characteristics and to identify the factors that
can change these characteristics and to what extent they can change. If, in addition, the method is developed to solve a particular
analytical problem, it is necessary to verify that the method is fit for purpose.1 This process of evaluation is called validation of the
method. It implies the determination of several parameters that characterize the method performance: decision limit, capability of
detection, selectivity, specificity, ruggedness, and accuracy (trueness and precision). In any case, they are the measurements
themselves which allow evaluation of the performance characteristics of the method and its fit for purpose. In addition, when
using the method, the obtained measurements are also the ones that will be used to make decisions on the analyzed sample, for
example, whether the amount of an analyte fulfills a legal specification. Therefore, it is necessary to suitably model the data that a
method provides. In what follows we will consider that the data provided by the analytical method are real numbers; other
possibilities exist, for example, the count of bacteria or impacts in a detector take only (discrete) natural values, or also, sometimes,
the data resulting from an analysis are qualitative, for example, the identification of an analyte through its m/z ratios in a mass
spectrometry-chromatography analysis.
With regard to the analytical measurement, it is admitted that the value, x, provided by the method of analysis consists of three
terms, the true value of the parameter m, a systematic error (bias) D, and a random error E with zero mean, in an additive way as
expressed in Eq. (1):

x¼m+D+e (1)

All the possible measurements that a method can provide when analyzing a sample constitute the population of the measurements.
This is indeed a theoretical situation because it is being assumed that there are infinitely many samples and that the method of
analysis remains unalterable. In these conditions, the model of the analytical method, Eq. (1), is mathematically a random variable,
X, with mathematical expectation m + D and variance equal to the variance of E; in statistical notation, E(X) ¼ m + D and V(X) ¼
V(e), respectively.
A random variable, and thus the analytical method, is described by its cumulative distribution function FX(x), that is, the
probability that the method provides measurements less than or equal to x for any value x. Symbolically, this is written as FX(x) ¼ pr
{X  x} for any real value x. In most of the applications, it is assumed that FX(x) is differentiable, which implies, among other
things, that the probability of obtaining exactly a specific value is zero. In the case of a differentiable cumulative distribution
function, the derivative of FX(x) is Rthe probability density function (pdf ) fX(x). Any function f(x) such that it is positive, f(x)  0, and
the area under the function is 1, Rf(x)dx ¼ 1, is the pdf of a random variable. The probability that the random variable X takes
values in the interval [a, b] is the area under the pdf over the interval [a, b], that is,
Z b
pr fX 2 ½a; bg ¼ f ðxÞ dx (2)
a

and the mean and variance of X are written as in Eqs. (3), (4), respectively
Z
EðXÞ ¼ x f ðxÞdx (3)
R
Z
V ðX Þ ¼ ðx − EðXÞÞ2 f ðxÞdx (4)
R

In general, mean and variance do not characterize in a unique way a random variable and therefore neither the method of analysis.
Fig. 1 shows the pdf of four random variables with the same mean 6.00 and standard deviation 0.61.
4 Quality of Analytical Measurements: Statistical Methods for Internal Validation

1.2 1.2
(A) (B)

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
4 5 6 7 8 4 5 6 7 8

1.2 1.2
(C) (D)

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
4 5 6 7 8 4 5 6 7 8
Fig. 1 Probability density functions of four random variables with mean 6 and variance 0.375. (A) Uniform in [4.94, 7.06]; (B) Symmetric triangular in [4.5, 7.5];
(C) Normal N(6, 0.61); (D) Weibull with shape 1.103 and scale 0.7 shifted to give a mean of 6. Dotted vertical lines mark the interval [5.0, 7.0].

These four distributions, uniform or rectangular (Fig. 1A), triangular (Fig. 1B), normal (Fig. 1C), and Weibull (Fig. 1D), are
frequent in the scope of analytical determinations, and they appear in Appendix E of the EURACHEM/CITAC Guide1 and also they
are used in metrology.2
If the only available information regarding a quantity X is the lower limit, l, and the upper limit, u, but the quantity could be
anywhere in between, with no idea of whether any part of the range is more likely, then a rectangular distribution in the interval [l, u]
would be assigned to X. This is so because it is the pdf that maximizes the “information entropy” of Shannon, in other words the pdf
that adequately characterizes the incomplete knowledge about X. Frequently, in reference material, the certified concentration is
expressed in terms of a number and unqualified limits (e.g., 1000  2 mg L−1). In this case, a rectangular distribution should be
used (Fig. 1A).
When the available information concerning X includes the knowledge that values close to c (between l and u) are more likely
than those near the bounds, the adequate distribution is a triangular one (Fig. 1B), with the maximum of its pdf in c.
If a good location estimate, m, and a scale estimate, s, are the only information available regarding X, then, according to the
principle of maximum entropy, a normal probability distribution N(m,s) (Fig. 1C) would be assigned to X (remember that m and s
may have been obtained from repeated applications of a measurement method).
Finally, the Weibull distribution (Fig. 1D) is very versatile; it can mimic the behavior of other distributions such as the normal or
exponential. It is adequate for the analysis of reliability of processes, and in chemical analysis it is useful in describing the behavior
of the figures of merit of a long-term procedure. For example, the distribution of the capability of detection CCb3 is a Weibull one or
the distribution of the determinations of ammonia in water by UV-vis spectroscopy during 350 different days in Aldama.4
In the four cases given in Fig. 1, the probability of obtaining values between 5 and 7 has been computed with Eq. (2). For the
uniform distribution (Fig. 1A) this probability is 0.94, whereas for the triangular distribution (Fig. 1B) is 0.88, for the normal
distribution (Fig. 1C) is 0.90, and for the Weibull distribution (Fig. 1D), 0.93. Sorting in decreasing order of the proportion of values
that each distribution accumulates in the interval [5.0, 7.0] we have uniform, Weibull, normal, and triangular although the triangular
and normal distributions tend to give values symmetrically around the mean and the Weibull distribution does not. If another interval
is considered, say [5.4, 6.6], the distributions accumulate probabilities of 0.57, 0.64, 0.67, and 0.54, respectively, in which the
difference among values is larger than before and, in addition, sorted the distributions as normal, triangular, uniform, and Weibull.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 5

Table 1 Values of b such that p ¼ pr{X < b} where X is each one of the random variables defined in the caption of Fig. 1.

P Random variable

Uniform Triangular Normal Weibull

0.01 4.96 4.71 4.58n 5.34m


0.05 5.05 4.97n 5.00 5.37m
0.50 6.00m 6.00m 6.00m 5.83n
0.95 6.95n 7.03 7.01 7.22m
0.99 7.04n 7.29 7.42 8.12m

n, Minimum b among the four distributions; m, Maximum b among the four distributions.

If for each of those variables, value b is determined so that there is a fixed probability, p, of obtaining values below b (i.e., the
value b such that p ¼ pr{X < b} for each distribution X), the results of Table 1 are obtained. For example, second row, 5% of the
times the uniform distribution at hand gives values less than b ¼ 5.05, less than 4.97 if it is the triangular distribution, and so on.
In the table, the extreme values among the four distributions for each probability p have been identified, and large differences are
observed caused by the form in which the values far from 6 are distributed (notice the differences in Fig. 1 for the normal, the
triangular, or the uniform distribution) and also due to the asymmetry of the Weibull distribution.
Therefore, the mean and variance of a random variable give very limited information on the values provided by the random
variable, unless additional information is at hand about the form of its density (pdf ). For example, if one knows that the
distribution is uniform or symmetrical triangular or normal, the random variable is completely characterized by its mean and
variance.
In practice, the pdf of a method of analysis is unknown. We only have a finite number, n, of measurements, which are the
outcomes obtained when applying repeatedly (n times) the same method to the same sample. These n measurements constitute a
statistical sample of the random variable X defined by the method of analysis.
Fig. 2 shows histograms of 100 results obtained when applying four methods of analysis, named A, B, C, and D, to aliquot parts
of a sample to determine an analyte. Clearly, the four methods behave differently.
From the experimental data, the (sample) mean and variance are computed as
Pn
xi
x ¼ i¼1 (5)
n
Pn
ðxi − xÞ2
s2 ¼ i¼1 (6)
n−1
x and s2 are estimates of the mean and variance of the distribution of X. These estimates with the data in Fig. 2 are shown in Table 2.
According to the model of Eq. (1), EðX Þ ¼ m + D ’ x, that is, the sample mean estimates the true value m plus the bias D. Assuming
that the true value is m ¼ 6 and subtracting it from the sample means in the first row of Table 2, the bias estimated for methods A and
B would be 0.66 and 0.16 for methods C and D. The bias of a method is one of its performance characteristics and must be evaluated
during the validation of the method. In fact, technical guides, for example, the one by the International Organization for
Standardization (ISO), state that, for a method, better trueness means less bias. To estimate the bias, it is necessary to have samples
with known concentration m (e.g., certified material, spiked samples).
The value of the variance is independent of the true content, m, of the sample. For this reason, to estimate the variance, it is only
necessary to have replicated measurements on aliquot parts of the same sample. The second row of Table 2 shows that methods
B and C have the same variance, 1.26, which is 5 times greater than the one of methods A and D, 0.25. The dispersion of the data
obtained with a method is the precision of the method and constitutes another performance characteristic to be determined in the
validation of the method. In agreement with model in Eq. (1), a measure of the dispersion is the variance V(X), which is estimated
by means of s2.
In some occasions, for evaluating trueness and precision, it is more descriptive to use statistics other than mean and variance. For
example, when the distribution is rather asymmetric, as in Fig. 1D, it is more reasonable to use the median than the mean. The
median is the value in which the distribution accumulates 50% of the probability, 5.83 for the pdf in Fig. 1D and 6.00 for the other
three distributions, which are symmetric around their mean. In practice, it is frequent to see the presence of anomalous data
(outliers) that influence the mean and above all the variance, which is improperly increased; in these cases, it is advisable to use
robust estimates of central tendency and spread (dispersion).5–7 Details can be found in the chapter of the present book devoted to
robust procedures.
Fig. 2 and Table 2 show that the two characteristics of a measurement method, trueness and precision, are independent to one
another, in the sense that a method with better trueness (less bias), methods C and D, can be more, case D, or less, case C, precise.
Analogously, methods A and B have an appreciable bias but A is more precise than B. A method is said to be accurate when it is
precise and fulfills trueness.
Histograms are estimates of the pdf and allow evaluation of the performance of each method in a more detailed way than when
only considering trueness and precision. For example, the probability of obtaining values in any interval can be estimated with the
6 Quality of Analytical Measurements: Statistical Methods for Internal Validation

40 40
(A) (B)

30 30

20 20

10 10

0 0
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10

40 40
(C) (D)

30 30

20 20

10 10

0 0
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
Fig. 2 Frequency histograms of 100 measures obtained with four different analytical methods, named (A), (B), (C), and (D), on aliquot parts of a sample. Dotted
vertical lines mark the interval [5.0, 7.0].

Table 2 Some characteristics of the distributions in Fig. 2.

Method

A B C D

Mean, x 6.66 6.66 6.16 6.16


Variance, s2 0.25 1.26 1.26 0.25
fr {5 < X < 7} 0.70 0.56 0.58 0.98
fr{X < 6} 0.08 0.29 0.49 0.39
pr{5 < N(x, s) < 7} 0.75 0.55 0.62 0.94
pr{N(x, s) < 6} 0.09 0.28 0.44 0.37

fr, frequencies; pr, probabilities.

histogram. The third row in Table 2 shows the frequencies for the interval [5.0, 7.0]. Method D (best trueness and precision among
the four) provides 98% of the values in the interval, whereas method B (worst trueness and precision) provides only 56% of the
values in the interval. Nonetheless, trueness and precision should be jointly considered. See how according to data in Table 2 the
effect of increasing the precision, using method A instead of B when the bias is “high” is an increase of 14% of results of the
measurement method to be in the interval [5.0, 7.0], whereas when the bias is small (C and D), there is an increase of 40%. This
behavior should be taken into account when optimizing a method and also in the ruggedness analysis, which is another
performance characteristic to be validated according to most of the guides. As can be seen in the fourth row of Table 2, if the
method that provides more results below 6 is needed, C would be the method selected.
The previous explanations show the usefulness of knowing the pdf of the results of a method of analysis. As in practice we have
only a limited number of results, two basic strategies are possible to estimate the pdf: (1) to assess that the experimental data are
Quality of Analytical Measurements: Statistical Methods for Internal Validation 7

compatible with a known distribution (e.g., normal) and then use the corresponding pdf; (2) to estimate the pdf by a data-driven
technique based on a computer-intensive method such as the kernel method8 or by using other methods such as adaptive or
penalized likelihood.9,10 The data of Fig. 2 can be adequately modeled by a normal distribution, according to normality hypothesis
tests whose details are explained later in Section “Goodness-of-Fit Tests: Normality Tests”. The fitted normal distributions are used
to compute the probabilities of obtaining values in the interval [5.0, 7.0] or less than 6, last two rows in Table 2. When comparing
these values with those computed with the empirical histograms (compare rows 3 and 5, and rows 4 and 6), there are no appreciable
differences and the normal pdf can be used instead.
In the validation of an analytical method and during its later use, statistical methodological strategies are needed to make
decisions from the available experimental data. The knowledge of these strategies supposes a way of thinking and acting that,
subordinated to the chemical knowledge, makes it objective both the analytical results and their comparison with those of other
researchers and/or other analytical methods.
Ultimately, a good method of analysis is a serious attempt to come close to the true value of the measurement, always unknown.
For this reason, the result of a measurement has to be accompanied by an evaluation of uncertainty or its degree of reliability. This is
done by means of a confidence interval. When the requirement is to establish the quality of an analytical method, then its capability
of detection, precision, etc. must be compared with those corresponding to other methods. This is formalized with a hypothesis test.
Confidence intervals and test of hypothesis are the basic tools in the validation of analytical methods.
In this introduction, the word sample has been used with two different meanings. Usually, there is no confusion because the
context allows one to distinguish whether it is a sample in the statistical or chemical sense.
In Chemistry, according to the International Union for Pure and Applied Chemistry (IUPAC) (Page 50 in Section 18.3.2 of
Inczédy et al.11), “sample” should be used only when it is a part of a selected material of a great amount of material. This meaning
coincides with that of a statistical sample and implies the existence of sampling error, that is, error caused by the fact that the sample
can be more or less representative of the amount in the material. For example, suppose that we want to measure the amount of
pesticide that remains in the ground of an arable land after a certain time. We take several samples “representative” of the ground of
the parcel (statistical sampling) and this introduces an uncertainty in the results characterized by a variance (theoretical) s2s .
Afterward, the quantity of pesticide in each chemical sample is determined by an analytical method, which has its own uncertainty,
characterized by s2m, in such a way that the uncertainty in the quantity of pesticide in the parcel is s2s + s2m provided that the method
gives results independent of the location of the sample. Sometimes, when evaluating whether a method is adequate for a task, the
sampling error can be an important part of the uncertainty in the result and, of course, should be taken into account to plan the
experimentation.
When the sampling error is negligible, for example, when a portion is taken from a homogeneous solution, the IUPAC
recommends using words such as test portion, aliquot, or specimen.
In summary, there is a clear link between measurement method and a random variable which is why the probability is the
natural form of expressing experimental uncertainty. This is thus the focus of the present article that is organized as follows:
Section “Confidence and Tolerance Intervals” describes confidence intervals to measure bias and precision under the normality
hypothesis and tolerance intervals, useful in evaluating the fit for purpose of a method. Also, a nonparametric interval on the
median is described.
Section “Hypothesis Test” is devoted to making decisions based on experimental data that, as such, are affected by uncertainty.
In this section, the computation of the power of a test is systematically proposed as a key element to evaluate the quality of the
decision at the desired significance level. A brief incursion into tests based on intervals is also made as they solve the problem of
deciding whether an interval of values is acceptable, for example, a relative error less than 10% in absolute value. The section ends
with some goodness-of-fit tests to evaluate the compatibility of a theoretical probability distribution with some experimental data.
Section “One-Way Analysis of Variance” is dedicated to the analysis of variance (ANOVA) for both fixed and random effects, and
in Section “Statistical Inference and Validation” some more specific questions related to the usual parameters of the analytical
method validation and their relation with the developed statistical methodologies are analyzed.
Mathematical proofs are not covered in this article and, to be operative from a practical point of view, several examples have
been included so that the reader can verify the understanding of the formulas and the argumentation for their thoughtful use. This
aspect is completed with the inclusion of an Appendix where some essential aspects related to the effectiveness of the statistical
models and the limits laws are described. The Appendix also contains the necessary sentences, in MATLAB code, to repeat all the
calculations proposed along the article. The same sentences are also available as supplementary material in the form of MATLAB .
mlx live scripts (at least release R2016a is needed to read and execute them).

Confidence and Tolerance Intervals

There are some important questions when evaluating a method, for example, “in a given sample, what is the maximum value that it
provides?” that, due to the random character of the results, cannot be answered with just a number.
In order to include the degree of certainty in the answer, the question should be reformulated as: What is the maximum value, U,
that will be obtained 95% of the times that the method is used in the sample? The answer to the question thus posed would be a
tolerance interval, and to build it the probability distribution must be known. For instance, let us suppose that it is a N(m,s) and we
denote by z0.05 the critical value of a N(0,1) ¼ Z distribution, the one that accumulates probability 0.95. Then, a possible answer is
8 Quality of Analytical Measurements: Statistical Methods for Internal Validation

U ¼ m + z0.05s because then the probability that the analytical method gives values greater than U is pr{method > U} ¼ pr
{N(m,s) > m + z0.05s}, which, according to the result in Appendix, is equal to pr{Z > z0.05} ¼ 0.05. In general, for any percentage
of results 100(1 − a)%, the maximum value provided by the method would be

U ¼ m + za s (7)

with a probability a that the aforementioned assertion is false.


If, instead, the interest was in the value L so that the 100(1 − a)% of the results are greater than L, then the answer would be

L ¼ m − za s (8)

Finally, the interval [L, U] that contains 100(1 − a)% of the values obtained with the method would be
 
½L; U  ¼ m − za=2 s; m + za=2 s (9)

An analytical example where one of these tolerance intervals with a normal distribution N(m,s) needs to be computed would be: An
analytical method gives values (mg L−1) that follow a N(9, 0.5) distribution when measuring a standard with 9 mg L−1. To assess
whether the method is still properly working, ten standards  arepincluded
ffiffiffiffiffiffi in the daily sequence of determinations. The probability
distribution of p the
ffiffiffiffiffiffi mean of these ten values is a N 9; 0:5= 10 . Following Eq. (9), the tolerance interval at 95% level is
9  1:96  0:5= 10 ¼ 9  0.31 mg L−1. Consequently, if one day a mean of, say, 9.5 mg L−1 is obtained, the method does not
work properly because 9.5 does not belong to the tolerance interval and the method should be revised, at the risk of doing
this revision uselessly 5% of the times.  Notice that
ffi the tolerance interval is always the same, built at the desired confidence level
pffiffiffiffiffi
100(1 − a)% with the distribution N 9; 0:5= 10 and it is not updated daily with the new samples.
Different to Eq. (9), two variants of tolerance intervals, namely the b-content and the b-expectation tolerance intervals, are
explained in Section “Tolerance Intervals” due to their relevance in the context of validation of analytical methods. In any case, any
of them is completely different from the confidence intervals introduced and developed in the following sections (from
Section “Confidence Interval” to Section “Joint Confidence Intervals”).
After explaining all the studied cases, the section finishes with a comparative analysis of both concepts (tolerance and confidence
intervals).

Confidence Interval
We have already remarked that estimation of solely the mean, x, and variance, s2, from n independent results provides very limited
information on the method performance. The objective now is to make affirmations of the type “in the sample, the amount of the
analyte m, estimated by x, is between L and U (m 2 [L, U])” with a certain probability that the statement is true. Following this
particular example, we should consider that x is a value taken by the random variable X (sample mean) and use its distribution to
answer the new question. Its distribution function is obtained mathematically from the one of X, FX(x), and thus depends on the
information we have about FX(x) (e.g., if the variance is known or should be also estimated, etc.).
In the general case, with a random variable X, obtaining a confidence interval for X from a sample x1, x2,. . ., xn consists of
obtaining two functions l(x1, x2,. . ., xn) and u(x1, x2,. . ., xn) such that

pr fX 2 ½lug ¼ pr fl  X  ug ¼ 1 − a (10)

1 − a is the confidence level and a is the significance level, meaning that the statement that the value of X is between l and u will
be false 100a% of the times.
In the next sections this idea will be particularized for some different cases, according to the random variable X of interest. Fig. 3
is a diagram that summarizes the cases studied in the following sections. All the examples are written in MATLAB live-script file
Intervals_section1022_live.mlx, in the supplementary material, so that they can be easily repeated or adapted for the reader’s
own data.

Confidence Interval on the Mean of a Normal Distribution


Case 1: Known variance
Suppose that we have a random variable that follows a normal distribution with known variance. This will be the case, for example,
of using an already validated method of analysis. The assumption means that we know that E in Eq. (1) is normally distributed and
 n.
also its variance. If we are using samples of size andtaking into account the properties of the normal distribution (see Appendix),
the sample mean,X,  is a random variable N m; s pffiffin ; thus, the particular expression of Eq. (10) for this random variable is

s s
pr m − za =2 pffiffiffi  X  m + za =2 pffiffiffi ¼1−a (11)
n n
that is, 100(1 − a)% of the values of the sample mean are in the interval in Eq. (11). A simple algebraic manipulation (subtract m
 multiply by −1) gives
and X,
Quality of Analytical Measurements: Statistical Methods for Internal Validation 9

CONFIDENCE INTERVALS UNDER NORMAL


DISTRIBUTION(S)
ONE SAMPLE TWO INDEPENDENT SAMPLES
For difference
For mean μ0 in means μ1-μ2

Known Known
variance variances

Unknown Unknown Equal


variance variances
Unequal

For standard For ratio of standard


deviation σ0 deviations σ1/σ2

Fig. 3 Diagram summarizing the different cases for computing confidence intervals.

s s
pr X − za =2 pffiffiffi  m  X + za =2 pffiffiffi ¼1−a (12)
n n

Therefore, according to Eq. (10), the confidence interval on the mean that is obtained from Eq. (12) is

s s
X − za =2 pffiffiffi ; X + za =2 pffiffiffi (13)
n n

Analogously, the confidence intervals at confidence level 100(1 − a)% for the maximum and minimum values of the mean are
computed from Eqs. (14), (15), respectively

s
pr m  X + za pffiffiffi ¼1−a (14)
n
s
pr X − za pffiffiffi  m ¼ 1 − a (15)
n
 i h 
and, thus, the corresponding one-sided intervals would be −1; X + za psffiffin and X − za psffiffin ; 1 .
In an experimental context, when measuring n aliquot parts of a test sample, we obtain n values x1, x2,. . ., xn. Their sample mean x
is the particular value taken by the random variable X and is also an estimate of the true value m.
Example 1: Suppose that an analytical method follows a N(m,4) and we have a sample of size 10 with values 98.87, 92.54, 99.42,
105.66, 98.70, 97.23,h 98.44, 103.73, 94.45
. . sample,
and 101.08. With this i the mean is 99.01 and using Eq. (13), the interval at 95%
confidence level is 99:01 − 1:96  4 pffiffiffiffi 10 ; 99:01 + 1:96  4 pffiffiffiffi
10 ¼ [96.53, 101.49].
For the interpretation of this interval, notice that with different samples of size 10 (same analytical method), different intervals
will be obtained at the same 95% confidence level. The endpoints of these intervals are nonrandom values, and the unknown mean
value, which is also a specific value, will or will not belong to the interval. Therefore, the affirmation “the interval contains the
mean” is a deterministic assertion that is true or false for each of the intervals. What one knows is that it is true for 100(1 − a)% of
those intervals. In our case, as 95% of the constructed intervals will contain the true value, we say, at 95% confidence level, that the
interval [96.53, 101.49] contains m.
This is the interpretation with the frequentist approach adopted in this article, that is to say that the information on random
variables is obtained by means of samples of them and that the parameters to be estimated are not known but are fixed amounts
(e.g., the amount of analyte in a sample, m, is estimated by the measurement results obtained by analyzing it n times). With a
Bayesian approach to the problem, a probability distribution is attributed to the amount of analyte m and once fixed an interval
of interest [a,b], the “a priori” distribution of m, the experimental results, and the Bayes’ theorem are used to calculate the
probability a posteriori that m belongs to the interval [a,b]. It is shown that, although in most practical cases the uncertainty
intervals obtained from repeated measurements using either theory may be similar, their interpretation is completely different.
The works by Lira and Wöger12 and Zech13 are devoted to compare both approaches from the point of view of the experimental
data and their uncertainty. Also, an introduction to Bayesian methods for analyzing chemical data can be seen in Armstrong and
Hibbert.14,15
10 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Case 2: Unknown variance


Suppose a normally distributed random variable with unknown variance that must be estimated, together with the mean, from n
experimental data. The confidence interval is computed as in Case 1, but now the random variable X follows (see Appendix) a
Student’s t distribution with n − 1 degrees of freedom (d.f.); thus, the interval at the 100(1 − a)% confidence level is obtained from

s s
pr X − ta=2, n pffiffiffi  m  X + ta=2, n pffiffiffi ¼1−a (16)
n n
where ta/2,n is the upper percentage point (100a =2 %) of the Student t distribution with n ¼ n − 1 d.f. and s is the sample standard
deviation. Analogously, the one-sided intervals at the 100(1 − a)% confidence level come from

s
pr m  X + ta, n pffiffiffi ¼1−a (17)
n
s
pr X − ta, n pffiffiffi  m ¼1−a (18)
n
Example 2: Suppose that the probability distribution of an analytical method is a normal, but its standard deviation is unknown.
With the data of Example 1, the sample standard deviation, s, is computed as 3.90. As t0.025,9 ¼ 2.262 (see Appendix), the
confidence interval at 95% level is [99.01 − 2.26  1.24, 99.01 + 2.26  1.24] ¼ [96.21, 101.81]. The 95% confidence interval on
the minimum of the mean (i.e., the 95.0% lower confidence bound) is made up, according to Eq. (18), by all the values greater than
96.74 ¼ 99.01 − 1.83  1.24. The corresponding interval on the maximum (upper confidence bound for mean), Eq. (17), will be
made up by the values less than 101.28 ¼ 99.01 + 1.83  1.24.
The length of the confidence intervals from Eqs. (12)–(15) is a function of the sample size and tends towards zero when the
sample size tends to infinity. This functional relation permits the computation of the sample size needed to obtain an interval of
 
2 za=2 s 2
given length, d. It will suffice to consider 2d ¼ za=2 psffiffin and take as n the nearest integer greater than d . For example, if we want
a 95% confidence interval with length d less than 2, in the hypothesis of Example 1, we will need a sample size greater than or
equal to 62.  
2 ta=2, n s 2
The same argument can be applied when the standard deviation is unknown. However, in this case, to compute n by d
it is necessary to have an initial estimation of s, which, in general, is obtained in a pilot study with size n0 , in such a way that
in the previous expression the d.f., n, are n0 − 1. An alternative is to define the desired length of the interval in standard deviation
units (remember that the standard deviation is unknown). For instance, in Example 2, if we want d ¼ 0.5s, we will need a sample
size greater than (4za/2)2 ¼ 61.5; note the substitution of ta/2,n by za/2, which is mandatory because we do not have the sample size
needed to compute ta/2,n, which is precisely what we want to estimate.

Confidence Interval on the Variance of a Normal Distribution


In this case, the data come from a N(m,s) distribution with m and s unknown, and we have a sample with values x1, x2,. . ., xn. The
distribution of the random variable “sample variance” S2 is related to the chi-square distribution, w2 (see Appendix). As a
consequence, the 100(1 − a)% confidence interval for the variance s2 is obtained from
( )
ðn − 1ÞS2 ðn − 1ÞS2
pr  s2
 ¼1−a (19)
w2a=2, n w21 − a=2, n

where w2a/2, n is the critical value of a w2 distribution with n ¼ n − 1 d.f. at significance level a/2. As in the previous case for the
sample mean, we should distinguish between the random variable sample variance S2 and one of its values, s2, computed with
Eq. (6) from sample x1, x2, . . ., xn.
The intervals for the maximum and minimum of the variance at 100(1 − a)% confidence level are obtained from Eqs. (20), (21),
respectively.
( )
ðn − 1ÞS2
pr s  2
2
¼1−a (20)
w1 − a, n
( )
ðn − 1ÞS2
pr  s2 ¼1−a (21)
w2a, n
Example 3: Knowing that the n ¼ 10 data of Example 2 come from a normal distribution with both mean and variance unknown,
the 95% confidence interval on s2 is found from Eq. (19) as [7.21, 50.81] because s2 ¼ 15.25, w20.025, 9 ¼ 19.02, and w20.975, 9 ¼ 2.70.
If the analyst is interested in obtaining a confidence interval for the maximum variance, the 95% upper confidence interval is
found from Eq. (20) as [0, 41.27] because w20.95, 9 ¼ 3.33, that is, the upper bound for the variance is 41.27 with 95% confidence.
Notice the lower bound in 0. To obtain confidence intervals on the standard deviation, it suffices to take the square root of the
Quality of Analytical Measurements: Statistical Methods for Internal Validation 11

aforementioned intervals because this operation is a monotonically increasing transformation; therefore, the intervals at 95%
confidence level on the standard deviation are [2.69, 7.13] and [0, 6.42], respectively.
The sample size, n, needed so that s2/s2 is between 1 − k and 1 + k is given by the nearest integer greater than
h pffiffiffiffiffiffiffiffiffiffiffi  i2
1 + 1 =2 za=2 1 + k + 1 =k . For example, for k ¼ 0.5, such that the length of the confidence interval verifies 0.5 < s2/s2 < 1.5,
we would need n ¼ 40 data (at least). Just for comparative purposes, we will admit in the example that with the sample of size 40 we
obtain the same variance s2 ¼ 15.25. As w20.025, 39 ¼ 58.12, and w20.975, 39 ¼ 23.65, the two-sided interval at 95% confidence level is
now [10.23, 25.15], which verifies the required specifications.

Confidence Interval on the Difference in Two Means


Case 1: Known variances
Consider two independent random variables, N1 and N2, distributed as N(m1,s1) and N(m2,s2) with unknown means and known
variances s21 and s22. We wish to find a 100(1 − a)% confidence interval on the difference in means m1 − m2. With a random sample
of n1 observations from the first distribution, x11, x12, . . ., x1n1, and n2 observations from the second one, x21, x22, . . ., x2n2, the 100
(1 −a)% confidence interval on m1 − m2 is obtained from the equation
8 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi9
< s 2 s 2 s21 s2 =
pr ðX1 − X2 Þ − za=2 1
+ 2
 m1 − m2  ðX1 − X2 Þ + za=2 + 2 ¼1−a (22)
: n1 n2 n1 n2 ;

where X1 and X2 are the random variables of the sample mean, which take the values x1 and x2 . The reader can easily write the
expressions analogous to Eqs. (14), (15) for the one-sided intervals.

Case 2: Unknown variances


The approach to this topic is similar to the previous case, but here even the variances s21 and s22 are unknown. However, it can be
reasonable to assume that they are equal, s21 ¼ s22 ¼ s2, and that the differences observed in their estimates with the samples, s21 and
s22, are not significant. The methodology to decide whether this can be assumed, or not, is explained later, in
Section “Hypothesis Test”.
An estimate of the common variance s2 is given by the pooled sample variance in Eq. (23) which is an arithmetic average of both
variances weighted by the corresponding d.f.,

ðn1 − 1Þs21 + ðn2 − 1Þs22


s2p ¼ (23)
n1 + n2 − 2

The 100(1 − a)% confidence interval is obtained from the following equation:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1 1 1
pr X1 − X2 − ta=2, n sp +  m1 − m2  X1 − X2 + ta=2, n sp + ¼1−a (24)
n1 n2 n1 n2

where n ¼ n1 + n2 − 2 are the d.f. of the Student’s t distribution. The one-sided intervals at 100(1 − a)% confidence level have the
analogous expressions deduced from Eq. (24) by substituting ta/2,n for ta,n. If a fixed length is desired for the confidence interval, the
computation explained in Section “Confidence Interval on the Mean of a Normal Distribution” can be immediately adapted to
obtain the needed sample size.
Example 4: We want to study the stability of a substance after being stored for a month. Here stability means that the content of the
substance remains unchanged. Two series of measurements (n1 ¼ n2 ¼ 8) were carried out before and after the storage period and
we will estimate the difference in means by a 95% confidence interval. The results were x1 ¼ 90:8, s21 ¼ 3:89 and
x1 ¼ 92:7, s22 ¼ 4:02, respectively. Therefore, the two-sided interval when assuming equal variances (s2p ¼ 3.96, Eq. (23)) is
pffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffi
ð90:8 − 92:7Þ  2:1448  3:96 18 + 18, that is [−4.03, 0.23]. Therefore, at 95% confidence level, the difference of the means
belongs to the interval, including null difference, that is, the substance is stable.
When the assumption s21 ¼ s22 is not reasonable, we can still obtain an interval on the difference m1 − m2 by using the fact that the
 X2 − ðm1 − m2 Þ
statistic X1p−ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi is distributed approximately as a t with d.f. given by,
s21 =n1 + s22 =n2
2 2
s =n1 + s2 =n2
n ¼ 21 2 2 2 (25)
ðs1 =n1 Þ ðs2 =n2 Þ
2

n1 − 1 + n2 − 1

The 100(1 − a)% confidence interval is obtained from the following equation:
8 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 9
< s 2 s 2 s21 s2 =
pr X1 − X2 − ta=2, n 1
+ 2
 m1 − m2  X1 − X2 + ta=2, n + 2 ¼1−a (26)
: n1 n2 n1 n2 ;
12 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Example 5: We want to compute a confidence interval on the difference of two means with unknown and not equal variances, with
the results that come from an experiment carried out with four aliquot samples by two different analysts. The first analyst obtains
x1 ¼ 3:285, and the second x2 ¼ 3:257. The variances were s21 ¼ 3.33  10−5 and s22 ¼ 9.17  10−5, respectively. Assuming that
s21 6¼ s22, Eq. (25) gives n ¼ 4.9, so the d.f. to apply Eq. (26) are 5 and t0.025,5 ¼ 2.571. Thus, the 95% confidence interval is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
−5 −5
ð3:285 − 3:257Þ  2:571 3:3310 4 + 9:1710 4 , that is, [0.014, 0.042]. So, at 95% of confidence, the two analysts provide unequal
measurements because zero is not in the interval.
The confidence intervals for the maximum and the minimum are obtained by considering the last or the first term respectively in
Eq. (26) and replacing ta/2,n by ta,n.

Case 3: Confidence interval for paired samples


Sometimes we are interested in evaluating an effect (e.g., the reduction of a polluting agent in an industrial spill by means of a
catalyst) but it is impossible to have two homogeneous populations of samples without and with treatment to obtain the two
means of the recoveries, because the amount of polluting agent may change, for example, over time. In these cases, the solution is to
determine the polluting agent before and after applying the procedure to the same spill. The difference between both determina-
tions is a measure of the effect of the catalyst. The (statistical) samples obtained in this way are known as paired samples. Formally,
with the two paired samples of size n, x11, x12,. . ., x1n and x21, x22,. . ., x2n, we compute the differences between any pair of
data, di ¼ x1i − x2i, i ¼ 1,2,. . .,n. If these differences follow a normal distribution, the 100(1 − a)% confidence interval is obtained
from

sd sd
pr d − ta=2, n pffiffiffi  m  d + ta=2, n pffiffiffi ¼1−a (27)
n n
where d and sd are the mean and standard deviation of the differences di and n ¼ n − 1 are the d.f. of the t distribution.

Confidence Interval on the Ratio of Variances of Two Normal Distributions


This section approaches the question of giving a confidence interval on the ratio s21/s22 of the variances of two distributions N1 N
(m1,s1) and N2 N(m2,s2) with unknown means and variances. Let x11, x12, . . ., x1n1 be a random sample of n1 observations from N1
and x21, x22, . . ., x2n2 be a random sample of n2 observations from N2. The sample variances obtained with these two samples, s21 and
s22, are the particular values of the random variables S21 and S22, and the 100(1 − a)% confidence interval on the ratio of variances is
computed from the following equation:

S21 s21 S2
pr F1 − a=2, n1 , n2   Fa=2, n1 , n2 12 ¼1−a (28)
S22 s22 S2

where F1−a/2, n1, n2 and Fa/2, n1, n2 are the critical values (upper tail) of an F distribution with n1 ¼ n2 − 1 d.f. in the numerator and
n2 ¼ n1 − 1 d.f. in the denominator. The Appendix contains a description of some relevant properties of the F distribution.
We can also compute one-sided confidence intervals. The 100(1 − a)% upper or lower confidence bound on s21/s22 is obtained
from Eqs. (29), (30), respectively. Remember that, when computing the intervals by using Eq. (29), the lower bound is always 0.

s21 S2
pr 2  Fa, n1 , n2 12 ¼1−a (29)
s2 S2

S21 s21
pr F1 − a, n1 , n2  ¼1−a (30)
S22 s22
Example 6: In this example, we compute a two-sided 95% confidence interval for the ratio of the variances in Example 4
(n1 ¼ n2 ¼ 8, s21 ¼ 3.89, s22 ¼ 4.02). The resulting interval is [0.20  (3.89/4.02), 4.99  (3.89/4.02)] ¼ [0.19, 4.83]. As 1 belongs
to this interval, we can admit that both variances are equal.

Confidence Interval on the Median


This case is different from the previous cases, because the confidence interval is a “distribution-free” interval, that is, there is no
distribution assumed for the data. As it is known, a percentile (pct) is the value xpct such that 100pct% of the values are less than or
equal to xpct. It is possible to compute confidence intervals on any pct, but for values of pct near one or zero we need very large
sample sizes, n, because the values n  pct and n  (1 − pct) must be greater than 5. For the median (pct ¼ 0.5), it suffices to
consider samples of size 10 or more.
The fundamentals of these confidence intervals are based on the binomial distribution whose details are outside the scope of
this article and can be found in Sprent.16 We use the data of Example 1 to show step by step how a 100(1 − a)% confidence
interval on the median is computed (the guided example is for a ¼ 0.05 with za/2 ¼ 1.96). The procedure consists of three steps:
Quality of Analytical Measurements: Statistical Methods for Internal Validation 13

1. To sort the data in ascending order. In our case, 92.54, 94.45, 97.23, 98.44, 98.70, 98.87, 99.42, 101.08, 103.73, and 105.66. The
rank of each datum is the position that it occupies in the sorted list, for example, the rank of 98.44 is four.
2. To calculate the rank, rl, of the value that will  be the p lower
ffiffiffiffiffiffi endpoint
 of the interval. It is the nearest integer less than
pffiffiffi
2 n − za=2 n + 1 . In our case, this value is 0:5 10 − 1:96 10 + 1 ¼ 2:40, thus rl ¼ 2.
1

3. To calculate the rank,


 ru, of the value that will be the upper endpoint
pffiffiffiffiffi ffi  of the interval, which is the nearest integer greater than
pffiffiffi
1
2 n + z a=2 n − 1 . In our case, this value is 0:5 10 + 1:96 10 − 1 ¼ 7:60, then ru ¼ 8.
Hence, the 95% confidence interval on the median is made by the values that are between the values in position 2 and 8, that is,
[94.45, 101.08].

Joint Confidence Intervals


Sometimes it is necessary to compute confidence intervals for several parameters but maintaining a 100(1 − a)% confidence that all
of them contain the true value of the corresponding parameter. For example, for two parameters statistically independent, we can
assure a 100(1 − a)% joint confidence level by taking separately the corresponding 100(1 − a)1/2% confidence intervals because (1
− a)1/2  (1 − a)1/2 ¼ (1 − a). In general, if there are k parameters, we will compute the 100(1 − a)1/k% confidence interval for any
of them.
However, if the used sample statistics are not independent to one another, the above computation is not valid. The Bonferroni
inequality states that the probability that all the affirmations are true at 100(1 − a)% confidence level is greater than or equal to
Pk
1 −( i¼1 ai), where 1 − ai is the confidence level of the i-th interval (usually ai ¼ a/k). For example, if a joint 90% confidence
interval is needed for the mean of two distributions, according to Bonferroni inequality ai ¼ a/2 ¼ 0.10/2 ¼ 0.05; thus, each
individual interval should be the corresponding 95% confidence interval.

Tolerance Intervals
In the introduction to present Section “Confidence and Tolerance Intervals”, the tolerance intervals of a normal distribution have
been calculated knowing its mean and variance. Remember that the tolerance interval [l, u] contains 100(1 − a)% of the values of
the distribution of X or, equivalently, pr{X 2 = [l, u]} ¼ a. Actually, the values of the parameters that define the probability
distribution are unknown and this uncertainty should be transferred into the endpoints of the interval. There are several types of
tolerance regions, but in this article, we will restrict ourselves to two common cases.

Case 1: b-content tolerance interval


Given a random variable X, an interval [l, u] is a b-content tolerance interval at g confidence level if the following holds:

pr fpr fX 2 ½l; ug  bg  g (31)

Expressed in words, [l, u] contains at least 100b% of the values of X with g confidence level. For the case of an analytical method, this
is to say that we have to determine, based on a sample of size n, for instance, the interval that will contain 95% (b ¼ 0.95) of the
results and this assertion must be true 90% of the times (g ¼ 0.90). Evidently, b-content tolerance intervals can be one-sided, which
means that the procedure will provide 95% of its results above l (respectively, below u) 90% of the times. We leave to the reader the
corresponding formal definitions.
One-sided and two-sided b-content tolerance intervals can be computed either by controlling the center or by controlling the
tails, and for both continuous and discrete random variables (a review can be seen in Patel17 and applications in Analytical
Chemistry in Meléndez et al.18 and Reguera et al.19).
Here we will only describe the case of a normally distributed X with unknown mean and variance. From this distribution, we
have a sample of size n that is used to compute the mean x and standard deviation s. We want to obtain a two-sided b-content
tolerance interval controlling the center, that is, an interval such that

pr fpr fX 2 ½x − ks; x + ksg  bg  g (32)

To determine k, several approximations have been reported; consult Patel17 for a discussion on them. The approach by Wald and
Wolfowitz20 is based on determining k1 such that

1 1
pr N ð0; 1Þ  pffiffiffi + k1 − pr Nð0; 1Þ  pffiffiffi − k1 ¼b (33)
n n

Therefore
14 Quality of Analytical Measurements: Statistical Methods for Internal Validation

sffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n−1
k ¼ k1 (34)
w2g, n − 1

w2g, n−1 is the point exceeded with probability g when using the w2 distribution with n − 1 d.f.
Example 7: With the data in Example 1, and b ¼ g ¼ 0.95, we have x ¼ 99:01, s ¼ 3.91, k1 ¼ 2.054, and w20.95, 9¼3.33; thus,
according to Eq. (34), k ¼ 3.379 and, as a consequence, the interval [99.01 − 3.38  3.91, 99.01 + 3.38  3.91] ¼ [85.79, 112.23]
contains 95% of the results of the method 95% of the times that the procedure is repeated with a sample of size 10.

Case 2: b-expectation tolerance interval


The interval [l, u] is called a b-expectation tolerance interval if

Eðpr fX 2 ½l; ugÞ ¼ b (35)

Unlike the b-content tolerance interval, condition in Eq. (35) only demands that, on average, the probability that the random
variable takes values between l and u is b.
As in the previous case, we limit ourselves to obtain intervals of the form [ x − ks, x + ks]. When the distribution of the random
variable is normal and we have a sample of size n, the solution was obtained for the first time by Wilks21 and is
rffiffiffiffiffiffiffiffiffiffiffi
n+1
k ¼ t1 − b, n (36)
2 n
where t(1−b)/2, n is the upper (1 − b)/2 point of the t distribution with n ¼ n − 1 d.f.
Example 7 (continuation): With the same data, the 95% expectation tolerance interval would be [99.01 − 2.37  3.91, 99.01 +
2.37  3.91] ¼ [89.74, 108.28] as now k is directly computed with the critical value t0.025, 9 ¼ 2.262.
This interval is shorter than the b-content tolerance interval because it only assures the expected value (the mean) of the
probabilities that the individual values belong to the interval. In fact, the interval [89.74, 108.28] contains 95% of the values of X
only 64% of the times, conclusion drawn by applying Eq. (32) with k ¼ 2.37. Also, note that when the sample size tends to infinity,
the value of k in Eq. (36) tends towards z(1−b)/2 which is the length of the theoretical interval that, in our example, would be [91.35,
106.67] obtained by substituting k by z0.025 ¼ 1.96.

Case 3: Distribution free intervals


It is also possible to obtain tolerance intervals independent of the distribution (provided it is continuous) of variable X. These
intervals are based on the rank of the observations, but they demand very large sample sizes, which makes them quite useless in
practice. For example, the sample size n needed to guarantee that the b-content tolerance interval [l, u] is [x(1), x(n)] (i.e., the
endpoints are the smallest and the greatest values in the sample), it is necessary that n fulfills approximately the equation log(n) +
(n − 1) log (g) ¼ log (1 − b) − log (1 − g).22 If we need, as in Example 7, b ¼ g ¼ 0.95, the value of n has to be 89. Nevertheless,
Willinks23 used the Monte Carlo method to compute shorter “distribution-free” uncertainty intervals proposed in Draft Supple-
ment2 but it still requires sample sizes that are rather large in the scope of chemical analysis. A complete theoretical development on
tolerance intervals (including their estimation by means of Bayesian methods) is in the book by Guttman.24
The tolerance intervals are of interest to show that a method is fit for purpose because when establishing that the interval [ x − ks,
x + ks] will contain, on average, 100b% of the values provided by the method (or 100b% of the values with g confidence level), we
x − ks, x + ks]
are including precision and trueness. To assess that the method is “fit for purpose” it suffices that the tolerance interval [
is included in the specifications that the method should fulfill. Note that a method with high precision (small value of s) but with a
significant bias can get to fulfill the specifications in the sense that a high proportion of its values are within the specifications.
In addition, in the estimation of s, the repeatability can be introduced as the intermediate precision or the reproducibility to
consider the scope of application of the method. The use of a tolerance interval solves the problem of introducing the bias as a
component of the uncertainty.
With the aim of developing analytical fit for purpose methods, the Societé Française des Sciences et Techniques Pharmaceutiques
(SFSTP) proposed25–28 the use of b-expectation tolerance intervals in the validation of quantitative methods. In four case studies, it
has shown the validity of b-expectation tolerance intervals as an adequate way to conciliate both the objectives of the analytical
method in routine analysis and those of the validation step, and it proposes them29 as a criterion to select the calibration curve. Also,
it has analyzed30 their adequacy to the guides that establish the performance criteria that should be validated and their usefulness31
in the problem of the transference of an analytical method. González and Herrador32 have proposed their computation for the
estimation of uncertainty of the analytical assay. In all these cases, b-expectation tolerance intervals based on the normality of data
are used, that is, using Eq. (36). To avoid dependence on the underlying distribution and the use of the classic distribution-free
methods, Rebafka et al.33 proposed the use of a bootstrap technique to calculate b-expectation tolerance intervals, whereas Fernholz
and Gillespie34 studied the estimation of the b-content tolerance intervals by using bootstrap.
To summarize this whole section about tolerance and confidence intervals, it is worth pointing out some comparative aspects
because there is a tendency to confuse both concepts that have nothing in common but the word interval. The difference between
Quality of Analytical Measurements: Statistical Methods for Internal Validation 15

them is clear: the confidence interval is the set that is supposed to contain (with a 100(1 − a)% confidence) the true value of the
unknown parameter; the tolerance interval is the set that contains a value which is taken by the random variable in a percentage of b,
with a given confidence g.
In particular, confidence intervals must be used in the process of evaluating trueness and precision of a method when there is no
need to fulfill external requirements but just to compare with other methods or to quantify uncertainty and bias of the results
obtained with it.
A usual error is to mistakenly consider a confidence interval as a tolerance interval when the difference between them is
important. For instance, with the data of Example 7, notice that to compute the confidence interval, the standard deviation of the
pffiffiffi
mean is estimated as s= n ¼ 1:24, whereas the standard deviation of the individual results of the method is estimated as s ¼ 3.91,
very different.
Also, it is important to remember that when the sample size n tends to infinity, the length of a confidence interval tends toward
zero, independently of the chosen confidence level. For example, with the confidence intervals for the mean, in the limit we will
have x ¼ m thus the estimator and the true parameter will be equal for sure (1 − a ¼ 1). On the contrary, the length of a b-content
tolerance interval does not tend towards zero when increasing the sample size but to the interval that contains for sure (1 − g ¼ 1)
the 100b % of the values.
There are other aspects of the determination of the uncertainty that are of practical interest, for example, the problem that arises
by the fact that any uncertainty interval, particularly an expanded uncertainty interval, should be restricted to the range of feasible
values of the measurand. Cowen and Ellison35 analyzed how to modify the interval when the data are close to a natural limit in a
feasible range such as 0 or 100% mass or mole fraction.

Hypothesis Tests

This section is devoted to the introduction of a statistical methodology to decide whether an affirmation is false, for example, the
affirmation “this method of analysis applied to this sample of reference provides the certified value”. If, on the basis of the
experimental results, it is decided that it is false, we will conclude that the method has bias. The affirmation is customarily called
hypothesis and the procedure of decision making is called hypothesis testing. A statistical hypothesis is an asseveration on the
probability distribution that follows a random variable. Sometimes one has to decide on a parameter, for example, whether the
mean of a normal distribution is a specific value. In other occasions it may be required to decide on other characteristics of the
distribution, for example, whether the experimental data are compatible with the hypothesis that they come from a normal or
uniform distribution.

Elements of a Hypothesis Test


As the results obtained with analytical methods are modeled by a probability distribution, it is evident that both the validation of a
method and its routine use involve making decisions that are naturally formulated as problems of hypothesis testing. In order to
describe the elements of a hypothesis test, we will use a concrete case. Like in the case of intervals, all the examples can be followed
with live-script in the supplementary material entitled Tests_section1023_live.mlx.
Example 8: For an experimental procedure, we need solutions with pH values less than 2. The preparation of these solutions
provides pH values that follow a normal distribution with s ¼ 0.55. pH values obtained from 10 measurements were 2.09, 1.53,
1.70, 1.65, 2.00, 1.68, 1.52, 1.71, 1.62, and 1.58. The question to be answered is whether the pH of the resulting solution is
adequate to proceed with the experiment.
We express this formally as
H0 : m ¼ 2:00 ðinadequate solutionÞ
(37)
H1 : m < 2:00 ðvalid solutionÞ

The statement “m ¼ 2.00” in Eq. (37) is called the null hypothesis, denoted as H0, and the statement “m < 2.00” is called the
alternative hypothesis, H1. As the alternative hypothesis specifies values of m that are less than 2.00 it is called one-sided alternative.
In some situations, we may wish to formulate a two-sided alternative hypothesis to specify values of m that could be either greater or
less than 2.00 as in
H0 : m ¼ 2:00
(38)
H1 : m 6¼ 2:00

The hypotheses are not affirmations about the sample but about the distribution from which those values come, that is to say, m is
the value, unknown, of the pH of the solution that will be the same as the value provided by the procedure if the bias is zero (see the
model of Eq. (1)). In general, to test a hypothesis, the analyst must consider the experimental goal and define, accordingly, the null
hypothesis for the test, as in Eq. (37). Hypothesis-testing procedures rely on using the information in a random sample; if this
information is inconsistent with the null hypothesis, we would conclude that the hypothesis is false. If there is not enough evidence
to prove falseness, the test defaults to the decision of not rejecting the null hypothesis though this does not actually prove that it is
correct. It is therefore critical to choose carefully the null hypothesis in each problem.
16 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 3 Decisions in hypothesis testing.

Researcher’s decision The unknown truth

H0 is true H0 is false

Accept H0 No error Type II error


Reject H0 Type I error No error

Table 4 Some parametric hypothesis tests.

Null hypothesis Alternative hypothesis Statistic Critical region


x − m0
1 m ¼ m0 m 6¼ m0 Zcalc ¼ {Zcalc < − za/2} [ {Zcalc > za/2}
s
=pffin
2 m ¼ m0 m < m0 {Zcalc < − za}
3 m ¼ m0 m > m0 {Zcalc > za}

4 m ¼ m0 m 6¼ m0 tcalc ¼ xs −pmffi0 {tcalc < − ta/2, n−1} [ {tcalc > ta/2, n−1}
= n
5 m ¼ m0 m < m0 {tcalc < − ta, n−1}
6 m ¼ m0 m > m0 {tcalc > ta, n−1}

− x2 ffi 
7 m1 ¼ m2 m1 6¼ m2 Zcalc ¼ q
x 1ffiffiffiffiffiffiffiffiffi
2 2
{Zcalc < − za/2} [ {Zcalc > za/2}
s s
8 m1 ¼ m2 m1 > m2 1
n1 + n2
2
{Zcalc > za}

xp x2
1 −ffiffiffiffiffiffiffiffiffi
9 m1 ¼ m2 m1 6¼ m2 tcalc ¼ 1 1
{tcalc < − ta/2, n1+n2−2} [ {tcalc > ta/2, n1+n2−2}
sp n +n
10 m1 ¼ m2 m1 > m2 1 2 {tcalc > ta, n1+n2−2}

11 md ¼ 0 md 6¼ 0 tcalc ¼ sd =dpffi {tcalc < − ta/2, n−1} [ {tcalc > ta/2, n−1}
n
12 md ¼ 0 md > 0 {tcalc > ta, n−1}
2
13 s2 ¼ s20 s2 6¼ s20 w2calc ¼ ðn −s21Þs {w2calc < w1-a/2,n-1
2
} [ {w2calc > wa/2,n-1
2
}
0
14 s2 ¼ s20 s2 > s20 {w2calc > w2a,n− 1}
s2
15 s21 ¼ s22 s21 6¼ s22 Fcalc ¼ s12 {Fcalc < F1−a/2, n1−1, n2−1} [ {Fcalc > Fa/2, n1−1, n2−1}
2
16 s21 ¼ s22 s21 > s22 {Fcalc > Fa, n1−1, n2−1}

The values za are the percentiles of a standard normal distribution such that a¼pr{N(0,1)>za}. The values ta,n are the percentiles of a Student’s t distribution with n degrees of freedom
such that a¼pr{t>ta,n}. The values w2a,n are the percentiles of a w2 distribution with n degrees of freedom such that a¼pr{w2 >w2a,n}. The values Fa,n1,n2 are the percentiles of an F
distribution with n1 degrees of freedom for the numerator and n2 degrees of freedom for the denominator, such that a¼pr{F>Fa,n1,n2}. sp is the pooled variance defined in Eq. (23).
d is the mean of the differences di ¼ x1i − x2i between the paired samples; sd is their standard deviation.

In practice, to test a hypothesis, we must take a random sample, compute an appropriate test statistic from the sample data, and
then use the information contained in this statistic to make a decision. However, as the decision is based on a random sample, it is
subject to error. Two kinds of potential errors may be made when testing hypothesis. If the null hypothesis is rejected when it is true,
then a type I error has been made. A type II error occurs when the researcher accepts the null hypothesis when it is false. The situation
is described in Table 3.
In Example 8, if the experimental data lead to rejection of the null hypothesis H0 being true, our (wrong) conclusion is that the
pH of the solution is less than 2. A type I error has been made and the analyst will use the solution in the procedure when in fact it is
not chemically valid. If, on the contrary, the experimental data lead to acceptance of the null hypothesis when it is false, the analyst
will not use the solution when in fact the pH is less than 2 and a type II error has been made. Note that both types of error have to be
considered because their consequences are very different. In the case of type I error, an unsuitable solution is accepted, the procedure
will be inadequate, and the analytical result will be wrong with the subsequent damages that it may cause (e.g., the loss of a client, or
a mistaken environmental diagnosis). On the contrary, the type II error implies that a valid solution is not used with the
corresponding extra cost of the analysis. It is clear that the analyst has to specify the assumable risk of making these errors, and
this is done in terms of the probability that they will occur.
The probabilities of occurrence of type I and II errors are denoted by specific symbols, defined in Eq. (39). The probability a of
the test is called the significance level, and the power of the test is 1 − b, which measures the probability of correctly rejecting the null
hypothesis.
a ¼ pr ftype I errorg ¼ pr freject H0 j H0 is truegb ¼ pr ftype II errorg ¼ pr faccept H0 j H0 is falseg (39)

In Eq. (39), the symbol “|” indicates that the probability is calculated under that condition. In the example we are following, a will
be calculated with the normal distribution of mean 2 and standard deviation 0.55.
Statistically expressed, with the n ¼ 10 results in Example 8 (sample mean x ¼ 1:708), one wants to decide about the value of the
mean of a normal distribution with known variance and one-sided alternative hypothesis (a one-tail test).
Quality of Analytical Measurements: Statistical Methods for Internal Validation 17

pffiffiffi

pffiffiffiffiffiffi the related statistic is written in Table 4 (second row) and gives Zcalc ¼ ðx − m0 Þ=ðs= nÞ ¼
With these premises,
ð1:708 − 2:0Þ= 0:55= 10 ¼ − 1:679.
In addition, the analyst must assume the risk a, say 0.05. This means that the decision rule that is going to apply to the
experimental results will accept an inadequate (chemical) solution 5% of the times. Therefore, the critical or rejection region is
written in Table 4, second row, as CR ¼ {Zcalc < − 1.645}, meaning that the null hypothesis will be rejected for the samples of size
10 that provide values of the statistic less than −1.645. In the example, the actual value Zcalc ¼ −1.679 belongs to the critical region;
thus, the decision is to reject the null hypothesis at 5% significance level.
Given the present facilities of computation, instead of the CR, the available statistical software calculates the so-called P-value,
which is the probability of obtaining the current value of the statistic under the null hypothesis H0. In our case, P-value ¼ pr
{Z  −1.679} ¼ 0.0466. When the P-value is less than the significance level a, the null hypothesis is rejected because this is the same
as saying that the value of the statistic belongs to the critical region.
The next question that immediately arises is about the power of the applied decision rule (statistic and critical region).
To calculate b, defined in Eq. (39), it is necessary to exactly specify the meaning of the alternative hypothesis. In our case, what is
meant by pH smaller than 2. From a mathematical point of view, the answer is clear: any number less than 2, for example, 1.9999
which clearly does not make sense from the point of view of the analyst. In this context, sometimes due to previous knowledge, in
other cases because of the regulatory stipulations or simply by the detail of the standardized work procedure, the analyst can decide
the value of pH that is considered to be less than 2.00, for example, a pH less than 1.60. This is the same as assuming that “pH equal
to 2” is any smaller value whose distance to 2 is less than 0.40. In these conditions,

jdj pffiffiffi
b ¼ pr N ð0; 1Þ < za − n (40)
s

where |d| ¼ 0.40 in our problem and when replacing it in Eq. (40), we have b ¼ 0.26 (calculations can be seen in Example A9 of
Appendix). That is to say, whatever the decision made, the decision rule leads to throw a valid chemical solution away 26% of the
times. Evidently, this is an inadequate rule.
pffiffiffi
A simple examination of Eq. (40) explains the situation. To decrease b, we should decrease the value za − jdj n=s. This may be
pffiffiffi
done by decreasing za (i.e., increasing the significance level a) or increasing jdj n=s. As both the procedure precision, s, and the
difference of pH that we wish to detect are fixed, the only possibility left is to increase the sample size n. Solving Eq. (40) in n we
have
 2
za + zb
n’ (41)
ðjdj=sÞ2

The values of b and a for sample sizes of 10, 15, 20, and 25, maintaining d and s fixed, are drawn in Fig. 4. As can be seen, a and b
exhibit opposite behavior and, unless the sample size is increased, it is not possible to simultaneously decrease the probability of
both errors. In our case, Eq. (41) gives n ¼ 20.5 for a ¼ b ¼ 0.05, thus n ¼ 21 because the sample size must be an integer. The dotted
lines in Fig. 4 intersect in values of b of 0.263, 0.126, 0.058 and 0.025 when increasing the sample size while maintaining the
significance level a ¼ 0.05. Again, we see that for a given a, the risk b decreases with the increase in n.
Eq. (40) also allows the analyst to decide the standard deviation (precision) necessary to obtain a decision rule according to the
risks a and b that the analyst is willing to admit. For example, if one must decide on the validity of the prepared solution with

0.50

0.45

0.40

0.35
β = pr {type II error}

n = 10
0.30
n = 15
0.25
n = 20
0.20
n = 25
0.15

0.10

0.05
0
0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
α = pr {type I error}

Fig. 4 Simultaneous (opposite) behavior of a and b for different sample sizes, n ¼ 10, 15, 20 and 25 maintaining d and s fixed at 0.4 and 0.55 respectively.
Dotted lines intersect the different curves for a ¼ 0.05.
18 Quality of Analytical Measurements: Statistical Methods for Internal Validation

HYPOTHESIS TESTING UNDER NORMAL


DISTRIBUTION(S)
ONE SAMPLE TWO INDEPENDENT SAMPLES

For difference in
For mean μ0 means μ1−μ2

Known Known
variance variances
Z-test Z-test

Unknown Unknown Equal


variance variances
t-test t-test
Unequal

For standard For ratio of standard


deviation σ0 deviations σ1 σ2
χ2-test F-test

Fig. 5 Diagram for selecting the appropriate hypothesis test.

10 results and the analyst states pffiffiffiffiffiaffi¼ b ¼ 0.05, the only option according to Eq. (40) is to increase | d|/s. By solving
0:05 ¼ pr Nð0; 1Þ < 1:645 − 0:40 s 10 , one obtains s ¼ 0.3845. This means that the procedure should be improved from the
current value of 0.55 until 0.38. If only five results were allowed, the standard deviation would have to decrease to 0.27 for
maintaining both the significance level and the power of the test.
Finally, there is an aspect in Eq. (40) that should not go unnoticed. Maintaining a, b, and n fixed, it is possible to reduce d (the
pH value that can be distinguished from 2.00) if the analyst simultaneously increases the precision of the method, provided that the
ratio |d|/s remains constant. Said otherwise, without changing any of the specifications of the hypothesis test, by diminishing s we
can discriminate a value of pH nearer to 2. Qualitatively this argument is clear: if a procedure is more precise, more similar results are
easier to distinguish, so that with a more precise procedure different values will appear that would be considered equal with a less
precise procedure. Eq. (40) quantifies this relation for the hypothesis test we are conducting.
In summary, a hypothesis test includes the following steps: (1) defining the null, H0, and alternative, H1, hypotheses according
to the purpose of the test and the properties of the distribution of the random variable, which, according to Eq. (1), is the
distribution of the values provided by a method of measurement; (2) deciding on the probabilities a and b, that is, the risk for the
two types of error that will be assumed for the decision; (3) computing the needed sample size; (4) obtaining the results, computing
the corresponding test statistic and evaluating whether it belongs to the critical region CR; and, finally, drawing the analytical
conclusion, which should entail more than reporting the pure statistical test decision. The conclusion should include the elements
of the statistical test, the assumed distribution, a, b, and n. Care must be taken in writing the conclusion; for example, it is more
adequate to say “there is no experimental evidence to reject the null hypothesis” than “the null hypothesis is accepted”.
Table 4 summarizes the tests most frequently used in the validation of analytical procedures and in the analysis of their results.
Fig. 5 is the diagram equivalent to the one in Fig. 3 but for hypothesis testing.

Hypothesis Test on the Mean of a Normal Distribution


Case 1: Known variance
Admit that the data follow a normal N(m,s) distribution with unknown m as in the worked Example 8. The corresponding tests are
in rows 1 to 3 in Table 4. The test statistic is always the same, but, depending on whether the alternative hypothesis is two-sided (row
1 in Table 4) or one-sided (rows 2 and 3), the critical region is different. The value za/2 verifies a/2 ¼ pr{Z > za/2} or the analogous
result for za. For the two-tail test, the relation among n, a, and b is given by
 2
za=2 + zb
n’ (42)
ðjdj=sÞ2

whereas for the one-tail tests, Eq. (41) must be used.

Case 2: Unknown variance


In this case, both the mean, m, and the standard deviation, s, of the normal distribution are unknown. The hypothesis tests are
in row 4 of Table 4 for the two-tail case and in rows 5 and 6 for the one-tail tests. The statistic tcalc should be compared to the values
ta,n−1 or ta/2,n−1 of the Student’s t distribution with n − 1 d.f. The equation that relates a, b, and n is
Quality of Analytical Measurements: Statistical Methods for Internal Validation 19


b ¼ pr −ta=2, n − 1  tn − 1 ðDÞ  ta=2, n − 1 (43)
p ffiffiffi
where D ¼ jsdj n is the noncentrality parameter of a noncentral t(D) distribution, which in Eq. (43) has n − 1 d.f. Note the analogy
with the “shift” of the N(0,1) in Eq. (40). The discussion about the relative effect of sample size and precision is similar to the case in
which the variance is known. The corresponding equations for one-tail tests are b ¼ pr{− ta, n−1  tn−1(D)} if H1: m < m0 and b ¼ pr
{tn−1(D)  ta, n−1} if H1: m > m0.
To compute n from Eq. (43), the standard deviation is needed. To solve this additional difficulty, the comments in Case 2 of
Section “Confidence Interval on the Mean of a Normal Distribution” are valid and can be applied here also. Usually, d ¼ 2s
or 3s. Let us compare the solutions with known and unknown variance with the same data of Example 8, but supposing that the
variance is unknown. We wish to detect differences in pH of 0.73s (the same d/s as in Example 8). By using a sample of
size 10, the probability b is 0.31 instead of the previous 0.26 (calculations can be seen in Example A10 of Appendix). This increase
in the probability of type II error is due to the less information we have about the problem; now the standard deviation is
unknown.

Case 3: The paired t-test


In Case 3 of Section “Confidence Interval on the Difference in Two Means” the experimental procedure and the reasons for
considering paired samples have been already explained. To decide on the effect of a treatment, the null hypothesis is that the mean
of the differences is zero, that is, H0: md ¼ 0 and the two-sided alternative H1: md 6¼ 0. This is the test shown in row 11 of Table 4,
where there is only a one-tail test (row 12) because, if needed, it suffices to consider the opposite differences di ¼ x2i − x1i instead of
di ¼ x1i − x2i, i ¼ 1,. . .,n. The statistic and the critical region are analogous to those of Case 2 (test on the mean with unknown
variance).
Example 9: Table 5 shows the recovery rates obtained with two solid-phase extraction (SPE) cartridges after fortification of
wastewater samples with a sulfonamide. The samples came from 10 different locations. We want to decide whether cartridge A is
more efficient than cartridge B and to compute the b risk of the test. To answer these questions, it is important to specify that we
consider “different” those differences between the means of recovery rates that are greater than 2%.
We use a paired t-test on the mean of the differences between the recovery rates obtained with the two cartridges on the same
sample (those of cartridge A minus those of cartridge B). By considering these differences, we eliminate the effect of the location of
the wastewater samples on the performance of the two cartridges. The test is carried out as follows:

H0 : md ¼ 0 ðno differences in recoveriesÞ


(44)
H1 : md > 0 ðcartridge A gives recoveries greater than cartridge BÞ


Following row 12 of Table 4, the critical region is CR ¼ {tcalc > ta, and the value of the statistic is tcalc ¼ psddffi ¼ 2:69
n−1} pffiffiffi
3:526 ¼ 2:412.
n 10
The critical value t0.05,9 is equal to 1.833; thus, the actual tcalc belongs to the critical region. Therefore, the null hypothesis is
rejected for a ¼ 0.05 and we can conclude that cartridge A is more efficient than cartridge B, because the mean of the differences is
positive.
pffiffiffi
To evaluate the power (1 − b) of the test, the equation b ¼ pr{tn−1(D)  ta, n−1} with D ¼ d n for d ¼ |m − m0 |/s ¼ |d |/
s ¼ 2/3.53 ¼ 0.57 provides 1 − b ¼ 1 − 0.492 ¼ 0.508 for a ¼ 0.05 and n ¼ 10 (calculations can be seen in Example A11 of
Appendix). Hence, 50% of the times the conclusion of accepting that there is no difference between recovery rates is wrong.
In this case, the risk of a type II error is very large; in other words, the power is very poor when we want to discriminate differences of
2% in recovery because the ratio d is small.

Hypothesis Test on the Variance of a Normal Distribution


The variance is a measure of the dispersion of the data used to evaluate the precision of a procedure of analysis; thus, decisions must
be made on this parameter frequently. The corresponding hypothesis tests are in rows 13 and 14 of Table 4.
Example 10: A validated procedure has a repeatability of s0 ¼ 1.40 mg L−1 when measuring concentrations around 400 mg L−1.
After a technical revision of the instrument, the laboratory is interested in testing the hypothesis

H0 : s2 ¼ s20 ¼ 1:96 ðthe repeatability did not changeÞ


(45)
H1 : s2 > 1:96 ðthe repeatability got worseÞ

Table 5 Recovery rates obtained by using two different extraction cartridges for a sulfonamide spiked in wastewater.

Location 1 2 3 4 5 6 7 8 9 10

Cartridge A (%) 77.2 74.0 75.6 80.0 75.2 69.2 75.4 74.0 71.6 60.4
Cartridge B (%) 74.4 70.0 70.2 77.2 75.9 60.0 77.0 76.0 70.0 55.0

See Example 9 for more details.


20 Quality of Analytical Measurements: Statistical Methods for Internal Validation

The analyst decides that a repeatability is admissible until 2.0 times the initial one, 1.40 mg L−1, and assumes the risks a ¼ b ¼ 0.05.
The sample size needed to guarantee the requirements of the analyst, which is formally a one-tail hypothesis test on the variance, is
obtained from Eq. (46).

k
b ¼ pr w2n − 1 < (46)
l2
k is the value such that a ¼ pr{w2n−1 > k} and l ¼ s/s0. As l ¼ 2.0, Eq. (46) gives that for n ¼ 14, b ¼ 0.0402, whereas for n ¼ 13,
b ¼ 0.0511 (calculations can be seen in Example A12 of Appendix). Therefore, the analyst decides to carry out 14 determinations on
aliquot parts of a sample with 400 mg L−1 obtaining a variance of 3.10 (s ¼ 1.76 mg L−1).
The statistic related to the decision (row 14 in Table 4) is w2calc ¼ (14 − 1) 3.10/1.96 ¼ 20.56. As the critical region is CR ¼
{wcalc > w20.05, 13 ¼ 22.36}, the conclusion is that there is not enough experimental evidence to conclude that the precision has
2

worsened. In this case, the acceptance of the null hypothesis, that is, to maintain the repeatability below 2.0 times the initial one,
will be erroneous 5% of the times because b was fixed at 5%. The decision rule is equally protected against type I and II errors.

Hypothesis Test on the Difference in Two Means


Case 1: Known variances
We assume that X1 is normal and has unknown mean m1 and known variance s21, and that X2 is also normal with unknown mean m2
and known variance s22. We will be concerned with testing the hypothesis that the means m1 and m2 are equal. The two-sided
alternative hypothesis is in row 7 of Table 4 and the one-sided in line 8 when we have a random sample of size n1 of X1 and another
random sample of size n2 of X2.
Example 11: A solid-phase microextraction (SPME) procedure to extract triazines from wastewater was carried out. The results must
be compared with previous ones where extraction was made by means of SPE. The repeatability of both procedures is known to be
5.36% for SPME procedure and 3.12% for SPE. The mean recovery rate for 10 replicated samples (Table 6) is 85.9% for SPME and
81.8% for SPE. We want to decide, at a 0.05 significance level, if the recovery rate of SPME procedure is greater than the one of SPE.
As the standard repeatability of both procedures is known, a test to compare two means with normal distribution and known
variances is adequate. The hypotheses are

H0 : mSPME ¼ mSPE ðrecovery rates are the same for both proceduresÞ
(47)
H1 : mSPME > mSPE ðthe recovery using SPME procedure is greater than the one using SPEÞ

Following row 8 of Table 4, for a significance level a ¼ 0.05, is CR ¼ {Zcalc > za ¼ 1.645}. The statistic is
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s2SPME s2 28:73 9:73
Z calc ¼ ðxSPME − xSPE Þ= + SPE ¼ ð85:9 − 81:8Þ= + ¼ 2:091:
n1 n2 10 10

As the value of the statistic 2.091 2 CR, the null hypothesis is rejected, and we conclude that the mean recovery rate with SPME is
greater than with SPE.
The next question could be related to the risk b for this hypothesis test, provided a difference in recovery of 3% is enough in the
analysis. To answer this question, a simple modification of Eq. (40) shows that
8 9
>
< >
jdj =
b ¼ pr Z > za − qffiffiffiffiffiffiffiffiffiffiffiffiffiffi (48)
>
: s21 s22 >
;
n1 + n2

By substituting our data in Eq. (48), one obtains b ¼ 0.55. That means that in 55% of the cases, we will incorrectly accept that the
recovery is the same for both procedures.

Table 6 Recovery rates for triazines in wastewater using solid-phase microextraction (SPME) and solid-phase
extraction (SPE).

Recovery rate (%)

SPME 91 85 90 81 79 78 84 87 93 91
SPE 86 82 85 86 79 82 80 77 79 82

See Example 11 for more details.


Quality of Analytical Measurements: Statistical Methods for Internal Validation 21

It is also possible to derive formulas to estimate the sample size required to obtain a specified b for given d and a. For the one-
sided alternative, the sample size n ¼ n1 ¼ n2 is
 2
za + zb
n’ 2 (49)
d
s21 + s22

Again, with the data of Example 11 and b ¼ 0.05, Eq. (49) gives 46.25, that is, 47 aliquot samples should be analyzed for each
procedure so that a ¼ b ¼ 0.05.
For the two-sided alternative, the sample size n ¼ n1 ¼ n2 is
 2
za=2 + zb
n’ 2 (50)
d
s21 + s22

Case 2: Unknown variances


As in Section “Confidence Interval on the Difference in Two Means” (Case 2), there are two possibilities: (1) the unknown variances
are equal s21 ¼ s22 ¼ s2, although the numerical values differ just by chance, and (2) the variances are different s21 6¼ s22. The
question of deciding between (1) and (2) will be approached in Section “Hypothesis Test on the Variances of Two Normal
Distributions”.
Let X1 and X2 be two independent normal random variables with unknown but equal variances. The statistics and critical region
for the two-tail test are in row 9 of Table 4 and in row 10 we can see the one-tail case.
For the two-sided alternative, with risks a and b, we additionally consider that the two means are different if their difference is at
least a quantity d ¼ |m1 − m2 |. As the variances are unknown, the comments in Case 2 of Section “Confidence Interval on the Mean
of a Normal Distribution” are also applicable. If we have samples from a pilot experiment with respective sizes n10 and n20 , and s2p is
the pooled variance as defined in Eq. (23), the sample size needed n ¼ n1 ¼ n2 is
 2
ta=2, n + tb, n
n’  (51)
d2 s2
2 p

where n ¼ n10 + n20 − 2 are the d.f. of the Student’s t distribution. If this is not possible, the difference to be detected should be
expressed in standard deviation units, that is, d ¼ |m1 − m2 | ¼ ks and the following expression applies:
 2
za=2 + zb
n’ k2 =
(52)
2

where za/2 and zb are the corresponding upper percentage points of the standard normal distribution Z.
Example 12: An experimenter wishes to compare the mean of two procedures, stating that they should be considered different if
they differ by 2 or more standard deviations (k ¼ 2), and defining assumable risks of, at most, a ¼ 0.05 and b ¼ 0.10.
As z0.025 ¼ 1.960 and z0.10 ¼ 1.282, Eq. (52) gives n ¼ 5.25, then six samples must be determined using each procedure. If the
experimenter had wanted to distinguish 1 standard deviation (k ¼ 1), then n ¼ 21.01, that is, 22 determinations with each
procedure would have been necessary.
Although it is preferable to always take equal sample sizes, it may be more expensive or laborious to collect data from X1 than
from X2. In this case, there are weighted sample sizes to be considered.36
In case that the equality of variances s21 and s22 cannot be admitted, there is no completely justified solution for the test.
However, approximations exist with good power and easy to use tests, such as Welch test. This method consists of substituting the
known variances in the expression of Zcalc in rows 7 and 8 of Table 4 by their sample estimates, in such a way that the statistic
becomes
x1 − x2
tcalc ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi (53)
s21 s22
n1 + n2

which follows a Student’s t with the degrees of freedom n in Eq. (25). The critical region for the two-tail test is CR ¼ {tcalc < −ta/2,n}
[ {tcalc > ta/2,n}. The critical region for the one-tail test is CR ¼ {tcalc < −ta,n} if H1: m1 < m2.
As the variances are different, it seems reasonable to take the sample sizes, n1 and n2, also different. If s2 ¼ r  s1, similar to
Eq. (52), one obtains the expression in Eq. (54)
 2
za=2 + zb
n1 ’ 2 (54)
k =ðr + 1Þ

Once n1 is determined, n2 is obtained as n2 ¼ r  n1. The computation of the sample sizes with different variances when pilot
samples are at hand can be found in Schouten.36
22 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Test Based on Intervals


The problem of deciding “the equality” of the means of two distributions discussed in the previous section shows the fact that the
result of interest (the two means are equal) is obtained by accepting the null hypothesis. Hence, the type II error becomes very
important. To compute it, it is necessary to decide which is the least difference between the means that is to be detected,
d ¼ |m1 − m2 |. A more natural framework is to define the null and alternative hypotheses in such a way that the decision of
accepting the equality of means is made by rejecting the null hypothesis, that is, the test should be posed as

H0 : jm1 − m2 j  d ðthe means are differentÞ


H1 : jm1 − m2 j < d ðthe means are considered to be equalÞ

Contrary to the tests so far, the hypotheses of this test, called interval hypotheses, are not made by one point but an interval. The two
one-sided tests (TOST) procedure consists of decomposing the interval hypotheses H0 and H1 into two sets of one-sided hypotheses:

H01 : m1 − m2  − d
H11 : m1 − m2 > − d

And

H02 : m1 − m2  d
H12 : m1 − m2 < d

The TOST procedure consists of rejecting the interval hypothesis H0 (and thus concluding equality of m1 and m2) if and only if
both H01 and H02 are rejected at a chosen level of significance a.
If two normal distributions with the same unknown variance, s2, are assumed and two samples of size n1 and n2 are taken
from each one, the two sets of one-sided hypotheses will be tested with ordinary one-tail test (row 10 of Table 4). Thus, the critical
region is
8 9
>
<ðx − x Þ + d >
=
1 2 d − ðx1 − x2 Þ
CR ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi  ta, n and qffiffiffiffiffiffiffiffiffiffiffiffiffiffi  ta, n (55)
>s
: 1 1
sp n11 + n12 >
;
p n1 + n2

where s2p is the pooled variance and n ¼ n1 + n2 − 2 its d.f.


The TOST procedure turns out to be operationally identical to the procedure of declaring equality only if the usual confidence
interval at 100(1 − 2a)% on m1 − m2 is completely contained in the interval [−d, d].
As we are supposing that the variances are equal, the expression that relates the sample sizes n1 ¼n2 ¼n with a and b is written in
Eq. (56)
z + z 2
a b=2
n’ 2s2 (56)
d

Again, s is unknown in Eq. (56), so it should be adapted as in Case 2 of Section “Hypothesis Test on the Difference in Two Means”.
When comparing Eq. (56) with those corresponding to the two-tail t-test on the difference of means, one observes that it is
completely analogous by exchanging the two risks (see Eqs. (50), (52)). That is, the significance level and the power of the t-test
become the power and significance level, respectively, of the TOST procedure, which completely agrees with the exchange of the
hypotheses.
The tests based on intervals have a long tradition in Statistics; see for example the book (very technical) by Lehmann.37 The TOST
procedure is a particular case that has also been used under the name bioequivalence test.38 Mehring39 has proposed some technical
improvements to obtain optimal interval hypothesis tests, including equivalence testing. It is shown that TOST is always biased, in
particular, the power tends to zero for increasing variances independently of the difference in means. As a result, an unbiased test40
and a suitable compromise between the most powerful test and the shape of its critical region41 have been proposed. In Chemistry
the use of TOST has been suggested to verify the equality of two procedures.42,43 Kuttatharmmakull et al.44 provide a detailed
analysis of the sample sizes necessary in a TOST procedure to compare methods of measurement. There are different versions of
TOST for ratio of variables and for proportions; the details of the equations for these cases can be consulted in Section 8.13 of the
book by Martin Andrés and Luna del Castillo45 and in the book by Wellek.46 The latter is a comprehensive review of inferential
procedures that enable one to “prove the null hypothesis” for many areas of applied statistical data analysis.

Hypothesis Test on the Variances of Two Normal Distributions


Suppose that two procedures follow normal distributions X1 and X2 with unknown means and variances. We wish to test the
hypothesis on the equality of the two variances, that is, H0: s21 ¼ s22. In practice, this is a relevant problem because this hypothesis is
Quality of Analytical Measurements: Statistical Methods for Internal Validation 23

Table 7 Data for analysis of stability (arbitrary units).

Control sample 46.31 44.90 44.12 36.07 39.20 36.39 50.71 47.85 45.60
Test sample 43.12 43.00 44.75 39.66 37.74 37.50 54.79 53.08 55.07

See Example 13 for more details.

related to the equality of the precision of the two procedures, and also as a previous step to decide about the equality of variances
before applying the test on the equality of means (Case 2 of Section “Hypothesis Test on the Difference in Two Means”) or
to compute a confidence interval on the difference of means (Case 2 of Section “Confidence Interval on the Difference in
Two Means”).
Assume that two random samples of size n1 of X1 and of size n2 of X2 are available and let s21 and s22 be the sample variances.
To test the two-sided alternative, we use the statistic and CR of row 15 of Table 4. The probability b can be computed as a function of
the ratio of variances l2 ¼ s21/s22 that is to be detected by Eq. (57)
F1 − a=2, n1 − 1, n2 − 1 Fa=2, n1 − 1, n2 − 1
b ¼ pr < Fn1 − 1, n2 − 1 < (57)
l2 l2

where Fn1−1, n2−1 denotes an F-distribution with n1 − 1 and n2 − 1 d.f. and Fa/2, n1−1, n2−1 its upper a/2 point, so that pr{Fn1−1, n2−1 >
Fa/2, n1−1, n2−1} ¼ a/2. Similarly, F1−a/2, n1−1, n2−1 is the upper 1 − a/2 point.
Example 13: Aliquot samples have been analyzed in random order under the same experimental conditions to carry out a stability
test. The results are given in Table 7 and must be compared for assessing the test material stability. Different questions can be asked:
(1) Is there experimental evidence of instability in the material?, (2) Taking into account that the analyst considers that the material
is not stable if the mean of the test sample differs from the mean of the control sample in two standard deviations, what is the
probability of accepting the null hypothesis when it is in fact wrong?, (3) What should be the sample size if just one standard
deviation is needed for fit for purpose of this analysis (with a ¼ b ¼ 0.05)?
The answers would be:

(1) Stability: As we only know the estimates of the variance, we should use a t-test to compare means.
The first step is to test if the variances can be considered equal by using a two-tail F-test:

H0 : s21 ¼ s22
H1 : s21 6¼ s22

Following row 15 in Table 4, CR ¼ {Fcal < F1−a/2, n1−1, n2−1 or Fcal > Fa/2, n1−1, n2−1} with Fa/2, n1−1, n2−1 ¼ F0.025, 8, 8 ¼ 4.43 and
F1−a/2, n1−1, n2−1 ¼ F0.975, 8, 8 ¼ 1/F0.025, 8, 8 ¼ 1/4.43 ¼ 0.23. As Fcalc ¼ s21/s22 ¼ 50.75/26.72 ¼ 1.93, there is no experimental
evidence to conclude that the variances differ.
Therefore, a hypothesis t-test on the difference of the two means with equal variances is being formulated (Case 2 of
Section “Hypothesis Test on the Difference in Two Means”). The statistic and the CR are given in row 9 of Table 4.

H0 : m1 ¼ m2 ðthe test material is stableÞ


H1 : m1 6¼ m2 ðthe test material is not stableÞ

 qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The “pooled” variance, s2p ¼ 38.49, so sp ¼ 6.20 with 9 + 9 − 2 ¼ 16 d.f. tcalc ¼ ðx1 − x2 Þ= sp n11 + n12 ¼
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð45:41 − 43:46Þ=6:20 1=9 + 1=9 ¼ 0:67 and t0.025, 16 ¼ 2.12. Therefore, the critical region is the set of values of tcalc less
than −2.12 or greater than 2.12, which does not contain 0.67. Hence, there is no evidence to reject the null hypothesis, that is,
with these data there is no experimental evidence of instability.
(2) Power of the test: With the condition imposed by the analyst, Eq. (52) with k ¼ 2 gives b  0.05, so power greater than 0.95.
(3) Sample size: In this case, the analyst is interested in computing the sample size under the assumption that only 1 standard
deviation is admissible for fit for purpose of this analysis. Therefore, k ¼ 1 and Eq. (52) gives n ¼ 25.99, so n1 ¼ n2 ¼ 26.
Sample sizes are greater than in point (2) reflecting the fact that the analyst is now interested in distinguishing a much smaller
quantity.

Regarding the sample size of the F-test in the answer to question (1), when the aim is to detect a standard deviation that is twice as
the one of the control samples, Eq. (57) gives a probability b for this test of 0.56. That means that 56% of the times the null
hypothesis will be wrongly accepted, and in this case, we have accepted the null hypothesis. When the F-test is used as a previous
step to the one on equality of means, and we decided b ¼ 0.05 for the latter (t-test), it is common to use b ¼ 0.10 for the former
(F-test). Eq. (57) gives n ¼ n1 ¼ n2 ¼ 24 with b ¼ 0.098 (the closest to the intended 0.1), and by maintaining that a change of 2
times the standard deviation of the control samples is to be detected (all calculations of b can be seen in Example A13 of Appendix).
In general, the F-tests on the equality of variances are very conservative and large sample sizes are needed to assure an adequately
small probability of type II error.
24 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 8 Determination of the degree of acidity of a vinegar by means of an acid-base titration.

Group 1 Group 2 Group 3 Group 4 Group 5

6.028 5.974 5.886 6.132 5.916


6.028 6.004 5.970 6.120 6.123
5.998 6.005 5.880 6.131 6.034
6.089 5.852 5.910 6.072 6.004
6.059 5.944 5.910 6.071 6.152
Mean x i 6.040 4 5.955 8 5.911 2 6.105 2 6.045 8
Variance s2i 1.203  10−3 3.997  10−3 1.267  10−3 0.969  10−3 8.993  10−3

See Example 14 for more details.

6.2

6.1
pH

6.0

5.9

5.8
1 2 3 4 5
Group
Fig. 6 pH obtained by different group of students (Example 14) depicted to visually inspect the equality of variances.

Hypothesis Test on the Comparison of Several Independent Variances


When the hypothesis of equality of variances of several groups of data coming from normal and independent distributions is to be
tested, a good practice is to draw the data for a visual inspection of their dispersion.
Example 14: Table 8 shows the results of the determination of the degree of acidity by means of an acid-base titrimetry, employing
sodium hydroxide as the titrant. These data are adapted from the practice “Analysis and comparison of the acetic grade of a vinegar”
included in Ortiz et al.,47 and each series is a replicated determination carried out by a group of students on the same vinegar
sample. The means and variances obtained by each group are also included in Table 8.
Fig. 6 shows the plot of the results obtained by the five different groups of students. Some differences are observed.
The most commonly used tests to compare several variances are Cochran’s, Bartlett’s, and Levene’s tests. In all the cases, the
hypotheses to be tested are
H0 : s21 ¼ . . . ¼ s2i ¼ . . . ¼ s2k
(58)
H1 : at least one s2i is different

Pk
The sample size of each group is denoted as ni, i ¼ 1,2,. . .,k, and N ¼ i ¼ 1ni.

Case 1: Cochran’s test


The null hypothesis is that the variances within each of the k groups of data are the same. This test detects if one variance is greater
than the rest. The statistic is

max s2
Gcal ¼ Pk i (59)
2
i¼1 si

The critical region at significance level a is given by



CR ¼ Gcal > Ga, k, n − 1 (60)

where Ga, k, n is the value tabulated in Table 9 for n d.f. In the case ni ¼ n for all i, is n ¼ n − 1.
With the data of Example 14 in Table 8, Gcalc ¼ 8.993  10−3/(16.429  10−3) ¼ 0.5474 and G0.05, 5,5-1 ¼ 0.5441 (Table 9).
Thus, at 0.05 significance level, the null hypothesis should be rejected and the variance of group 5 should be considered different
from the rest.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 25

Table 9 Critical values for Cochran’s test for testing homogeneity of several variances at 5% significance level.

k n

1 2 3 4 5 6 7 8 9 10

2 0.9985 0.9750 0.9392 0.9057 0.8772 0.8534 0.8332 0.8159 0.8010 0.7880
3 0.9669 0.8709 0.7977 0.7457 0.7071 0.6771 0.6530 0.6333 0.6167 0.6025
4 0.9065 0.7679 0.6841 0.6287 0.5895 0.5598 0.5365 0.5175 0.5017 0.4884
5 0.8412 0.6838 0.5981 0.5441 0.5065 0.4783 0.4564 0.4387 0.4241 0.4118
6 0.7808 0.6161 0.5321 0.4803 0.4447 0.4184 0.3980 0.3817 0.3682 0.3568
7 0.7271 0.5612 0.4800 0.4307 0.3974 0.3726 0.3939 0.3384 0.3259 0.3154
8 0.6798 0.5157 0.4377 0.3910 0.3595 0.3362 0.3185 0.3043 0.2926 0.2820
9 0.6385 0.4775 0.4027 0.3584 0.3286 0.3067 0.2901 0.2768 0.2659 0.2568
10 0.6020 0.4450 0.3733 0.3311 0.3029 0.2823 0.2666 0.2541 0.2439 0.2353

k, number of levels; n, degrees of freedom.


Adapted from Sachs, L. Applied Statistics. A Handbook of Techniques; Springer-Verlag: New York, 1982.

Case 2: Bartlett’s test


This test is appropriate to detect groups of similar variance but that differs from one group to another. The statistic is defined using
the following equations:
q
w2calc ¼ 2:3026 (61)
c
  X k  
q ¼ ðN − kÞ log 10 s2p − ðni − 1Þ log 10 s2i (62)
i¼1
hP i
k 1
i¼1 ni − 1 − 1
N−k
c¼1+ (63)
3ðk − 1Þ

P
In Eq. (62), “log10” means the decimal logarithm and s2p is the pooled variance that, extending Eq. (23) for k variances, is s2p ¼ [ ki¼1
(ni − 1)si ]/(N − k).
2

The critical region is


n o
CR ¼ w2calc > w2a, k − 1 (64)

In Example 14, c ¼ 1.10, q ¼ 3.43, and w2calc ¼ 7.19, which does not belong to the critical region defined in Eq. (64) because
w20.05,4 ¼ 9.49. Consequently, we have no evidence of difference in variances.
Cochran’s and Bartlett’s tests are very sensitive to the normality assumption. Levene’s test, particularly when it is based on the
medians of each group, is more robust to the lack of normality of data.

Case 3: Levene’s test


For each j-th group of replicates, compute the absolute deviations of the values xij from its corresponding mean.
 
lij ¼ xij − xi , i ¼ 1, 2, . . . , ni (65)

Consider the data arranged as in Table 8 and compute the usual F statistics for the deviations lij
Pkni ðli − lÞ
2

i¼1 k−1
Fcalc ¼ (66)
Pk Pni ðlij − li Þ2
j¼1 i¼1 N − k

where li is the mean of the i-th group and l is the overall mean. Note that the numerator of Eq. (66) is the pooled variance of the
deviations and the denominator is the overall variance of these deviations. The critical region at 100(1 − a)% confidence level is

CR ¼ Fcalc > Fa, k − 1, N − k (67)

Computing the differences in Eq. (65) with the data of Table 8, Fcalc ¼ (2.205  10−3)/(0.905  10−3) ¼ 2.44. As F0.05,4,20 ¼ 2.866,
there is no evidence to reject the null hypothesis (the variances are equal).
Levene’s test is more recommendable using group medians instead of group means. The adaptation is simple; one has to
consider the absolute value of the differences but to the median, x~i , of each group
26 Quality of Analytical Measurements: Statistical Methods for Internal Validation

 
lij ¼ xij − x~i , i ¼ 1, 2, . . . , ni (68)

The statistics is again the one of Eq. (66) and it is applied similarly.
With the same data of Table 8, but the values in Eq. (68), one obtains Fcalc ¼ (2.146  10−3)/(1.360  10−3) ¼ 1.58, and the
conclusion is the same. The variance of the five groups should be considered as equal.
It often happens that the three tests do not agree in the result, as is the case here. But the joint interpretation clarifies the situation:
In the data of Example 14, the variance of group 5 is greater than the variance of other groups, as Cochran’s test shows. When
Levene’s test is applied, a large difference between both statistics is observed when using the median. This makes one think that the
increase in the variance of the last group is caused by some data being different from the others which is graphically seen in Fig. 6.

Goodness-of-Fit Tests: Normality Tests


The test on distributions or goodness-of-fit tests are designed to decide whether the experimental data are compatible with a
predetermined probability distribution, generally characterized by one or several parameters, such as the normal, the Student’s t, the
F or uniform distributions. Almost all the inferential procedures proposed in this article are based on normality; thus, in most of the
cases, it is necessary to check whether the data are compatible with this assumption. In this section, we will show the chi-square test
that is used for any distribution and the D’Agostino test that is advised for evaluating the normality of a data set.

Case 1: Chi-square test


The test is designed to detect frequencies inadequate for a specified probability distribution F0. Given a sample x1, x2, . . ., xn from a
random variable, one is interested in testing the hypothesis

H0 : The distribution of the random variable is F0


(69)
H1 : This is not the case

To compute the statistics, the n sample values are grouped into k classes (intervals). Denote by Oi, i ¼ 1,. . .,k, the frequency observed
in each class and by Ei the expected frequency for the same class provided the distribution is exactly F0. Then, the statistic in Eq. (70)

Xk
ðOi − Ei Þ2
w2calc ¼ (70)
i¼1
Ei

follows a w2k−p−1 distribution, that is used to define the critical region at (1 − a)100% confidence level as
n o
CR ¼ w2calc > w2a, k − p − 1 (71)

where w2a, k−p−1 is the value such that pr{w2k−p−1 > wa,
2
k−p−1} ¼ a and p is a number that depends on the distribution F0, for instance
p ¼ 2 for a normal, p ¼ 1 for a Poisson, and p ¼ 0 for a uniform distribution. The test requires that the expected frequencies are not
too small. If this is so, the data are regrouped into bigger classes. In the practice of chemical analysis, the sample sizes are not large
and when grouping the data the d.f. of the chi-square statistics are few, the critical value of Eq. (71) becomes large, and it is necessary
to have a large discrepancy between the estimated and observed frequencies to reject the null hypothesis. That means that the test is
very conservative.
Example 15: To show the validity of the use of the crystal violet (CV) as an internal standard in the determination by LC-MS-MS of
malachite green (MG) in trout, a sample of trout was spiked with 1 mg kg−1 of CV and increasing concentrations of MG between 0.5
and 5.0 mg kg−1. The areas of the CV-specific peak (transition 372 > 356) in these calibration standards resulted: 1326, 1384, 1419,
1464, 1425, 1409, 1387, 1449, 1311, 1338, 1350 and 1345. To verify whether the signal of CV is constant and independent of the
concentration of MG in every standard, we can test the null hypothesis, H0,

H0 : The distribution of the random variable is uniform


H1 : This is not the case

Table 10 shows the calculation of both observed and expected frequencies under the uniform distribution in the interval [1311,
1464], the endpoints being respectively the minimum and maximum values in the sample.
By summing up the values of the last column of Table 10, the statistics is w2calc ¼ 0.51, which does not belong to the critical region
because it is not greater than w20.05,5-0-1 ¼ 9.49. Therefore, there is no evidence to reject the hypothesis that the data come from a
uniform distribution.

Case 2: D’Agostino normality test


The problem of checking the normality of a set of data has been extensively treated. When the empirical and theoretical histograms
are compared, the most commonly used tests are those of the chi-square and the Kolmogoroff–Smirnov. However, there are several
Quality of Analytical Measurements: Statistical Methods for Internal Validation 27

Table 10 w2 Goodness-of-fit test to a uniform distribution applied to assess the validity of crystal violet as
internal standard; data of Example 15.
ðOi − Ei Þ2
Class Observed frequency (Oi) Expected frequency (Ei) Ei

[1311.0, 1341.6) 3 2.40 0.15


[1341.6, 1372.2) 2 2.40 0.07
[1372.2, 1402.8) 2 2.40 0.07
[1402.8, 1433.4) 3 2.40 0.15
[1433.4, 1464.0) 2 2.40 0.07

Table 11 Significance limits for the D’Agostino normality test.

Significance level

a ¼ 0.05 a ¼ 0.01

Sample size DL DU DL DU

10 0.2513 0.2849 0.2379 0.2857


12 0.2544 0.2854 0.2420 0.2862
14 0.2568 0.2858 0.2455 0.2865
16 0.2587 0.2860 0.2482 0.2867
18 0.2603 0.2862 0.2505 0.2868
20 0.2617 0.2863 0.2525 0.2869
22 0.2629 0.2864 0.2542 0.2870
24 0.2639 0.2865 0.2557 0.2871
26 0.2647 0.2866 0.2570 0.2872
28 0.2655 0.2866 0.2581 0.2873
30 0.2662 0.2866 0.2592 0.2873

Adapted from Martín Andrés, A.; Luna del Castillo, J. D. Bioestadística para las ciencias de la salud; Spain: Norma Capitel Madrid, 2004.

characteristics that are specific for the pdf of a normal distribution; for example, the skewness and the kurtosis, which are statistics
related to higher than two order moments for the normal pdf. A very powerful test is D’Agostino test, with hypotheses

H0 : The distribution of the random variable is normal


H1 : This is not the case

To apply the test, the data are sorted in increasing order, so that x1  x2  ⋯  xn. The statistic is
P
n Pn
i xi − n +2 1 xi
i¼1 i¼1
Dcalc ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n P h (72)
n  Pn  2 io
i¼1 xi − i¼1 xi =n
n n 2

Index i in Eq. (72) refers to the ordered data. Table 11 shows some of the critical values of the statistics Da,n with the two values,
DLa,n and DUa,n, for each sample size n and significance level a. The critical region of the test is

CR ¼ fDcalc < DLa,n g [ fDcalc > DUa,n g (73)

For further details, consult the work by D’Agostino and Stephens.48


As with the confidence intervals, a Bayesian approach exists for the construction of the hypothesis tests that several statisticians
prefer because of its internal coherence. For a recent comparative analysis of both approaches, see Moreno and Girón.49

One-Way Analysis of Variance

Sometimes, more than two means must be compared. One can think in comparing, say, five means, applying the test of comparison
of two means of Section “Hypothesis Test on the Difference in Two Means” to each of the 10 pairs of means that can be formed by
taking them two by two. This option has a serious drawback: it requires enormous sample sizes, because to test the null hypothesis
“the five means are equal” with a ¼ 0.05 and assuming that the 10 tests are independent, each one of the hypothesis “the means xi
28 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 12 Arrangement of data for an ANOVA.

Factor

Level 1 Level 2 Level 3 Level k

x11 x21 x31 ⋯ xk1


x12 x22 x32 ⋯ xk2
x13 x23 x33 ⋯ xk3
⋮ ⋮ ⋮ ⋱ ⋮
x1n x2n x3n ⋯ xkn

and xj are equal” should be tested with a significance level of 0.0051 to obtain a confidence equal to (1 − 0.0051)10 0.95. The
appropriate procedure for testing the equality of several means is the analysis of variance (ANOVA).
The ANOVA has many more applications; it is particularly useful in the validation of a model fit to some experimental data and,
hence, in an analytical calibration or in the analysis of response surfaces as can be seen in the corresponding chapters of the
present book.
Table 12 shows how the data are usually arranged in a general case, in columns the k levels of a factor (e.g., five different
extraction cartridges) and in rows the n data obtained (e.g., four determinations with each cartridge). Each of the N ¼ k  n values xij
(i ¼ 1,2,. . .,k, j ¼ 1,2,. . .,n) is the result obtained, in our example, when using the i-th cartridge with the j-th aliquot sample.
P
In general, in each level i, a different number of replicates are available ni, with N ¼ ki¼1ni. To make the notation easier, we will
suppose that all ni are equal, that is, ni ¼ n for each level.
Suppose that the data in Table 12 can be described by the model

xij ¼ m + ti + eij with i ¼ 1, 2, . . . , k, j ¼ 1, 2, . . . , n (74)

where m is a parameter common to all treatments, called the overall mean, ti is a parameter associated with the i-th level, called the
factor effect, and eij is the random error component. In our example, m is the content of the sample and ti is the variation in this
quantity caused by the use of the i-th cartridge. Note that in model of Eq. (74), the effect of the factor is additive; this is an
assumption that may be unacceptable in some practical situations.
The ANOVA is posed to test some hypotheses about the treatment effects and to estimate them. In order to support the
conclusion when testing the hypothesis, the model errors, eij, are assumed to be normally and independently distributed random
variables, with mean zero and variance s2, NID(0,s). Besides, the variance s2 is assumed to be constant for every level of the factor.
The model of Eq. (74) is called the one-way ANOVA, because only one factor is studied. The analysis for two or more factors can
be seen in the chapter about factorial techniques in this book. Furthermore, the data of Table 12 are required to be obtained in
random order to reduce the effect of other uncontrolled factors.
There are two ways for choosing the k levels of the factor in the experiment. In the first case, the k levels are specifically chosen by
the researcher, as the cartridges in our example. In this case, we wish to test the hypothesis about the magnitude of ti and
conclusions will apply only to the levels of the factor explicitly considered in the analysis and they cannot be extended to similar
levels that were not considered. This is called the “fixed effects model”.
Alternatively, the k levels could be a random sample from a larger population of levels. In this case, we would like to be able to
extend the conclusions based on the sample to all levels in the population, regardless of whether they have been explicitly
considered in the analysis or not. Hence, each ti is a random variable and information about the specific values included in the
analysis is useless. Instead, we test the hypothesis about the variability. This is called the “random effects model”. This model is used
to evaluate the repeatability and reproducibility of a method and also the laboratory bias when the method of analysis is being
tested by a proficiency test. In the same experiment, and provided there are at least two factors, fixed and random effects can
simultaneously appear.50,51 The reader can easily combine them, when appropriate, from the explanations and examples in the
following subsections.
Also for this section, all the computations for the examples can be followed with the live-script in the supplementary material
named ANOVA_section1024_live.mlx.

The Fixed Effects Model


In this model, the effect of the factor ti is defined as the difference between the mean in each level and the overall mean with the
constraint:

X
k
ti ¼ 0 (75)
i¼1
Quality of Analytical Measurements: Statistical Methods for Internal Validation 29

From the individual data, the mean value per level is defined as
P
n
xij
j¼1
xi ¼ , i ¼ 1, 2, . . . , k (76)
n

And the overall mean is

P
k P
n
xij
i¼1 j¼1
x ¼ (77)
N

A simple calculation gives

X
k X
n  2 X
k X
k X
n  2
xij − x ¼ n ðxi − xÞ2 + xij − xi (78)
i¼1 j¼1 i¼1 i¼1 j¼1

Eq. (78) shows that the total variability of the data, measured by the sum of squares of the difference of each datum and the overall
mean, can be partitioned into a sum of squares of differences between level means and the overall mean and a sum of squares of
P
differences of individual
2 values and their level mean. The term n ki¼1 ðxi − xÞ2 measures the differences between levels, whereas
Pk Pn
i¼1 i is due to random error alone. It is common to write Eq. (78) as
j¼1 xij − x

SST ¼ SSF + SSE (79)

where SST is the total sum of squares, SSF is the sum of squares due to change levels of the factor, which is called sum of squares
between levels, and SSE is the sum of squares due to random error, which is called sum of squares within levels. There are N
individual values, thus SST has N − 1 d.f. Similarly, as there are k levels of the factor, SSF has k − 1 d.f. Finally, SSE has N − k d.f. We are
interested in testing

H0 : t1 ¼ t2 ¼ t3 ¼ . . . ¼ tk ¼ 0 ðthere is no effect due to the factorÞ


H1 : ti 6¼ 0 for at least one i

Because of the assumption that the errors Eij are NID(0, s), the values xij are NID(m + ti, s), and therefore SST/s2 is distributed as a
w2N−1. Cochran’s theorem guarantees that, under the null hypothesis, SSF/s2 and SSE/s2 are independent chi-square distributions
with k − 1 and N − k d.f., respectively. Therefore, under the null hypothesis, the statistic
SSF
k−1 MSF
Fcalc ¼ SSE
¼ (80)
N−k
MSE

follows an Fk−1, N−k distribution, whereas under the alternative hypothesis, it follows a noncentral F with the same d.f.50 The
P
quantities MSF and MSE are called mean squares. Their expected values are E(MSF) ¼ s2 + n( ki¼1 t2i )/(k − 1) and E(MSE) ¼ s2,
respectively. Therefore, under the null hypothesis, both are unbiased estimators of the residual variance, s2, whereas under the
alternative hypothesis, the expected value of MSF is greater than s2. The critical region of the test at significance level a is in Eq. (81)
and reflects the idea that, if the null hypothesis is false, the numerator of Eq. (80) is significantly greater than the denominator.

CR ¼ Fcalc > Fa, k − 1, N − k (81)

Usually, the test procedure is summarized in a table (called ANOVA table) like the one in Table 13, except that we have added a
column, the one corresponding to E(MS), just to emphasize the values each MS estimate and their relation with the previous
discussion.
Example 16: To investigate the influence of the composition of some fibers on a SPME procedure, an experiment was performed
using five different fibers. The data shown in Table 14 are the results of four replicated analyses carried out after extraction with each

Table 13 Skeleton of an ANOVA of fixed effects.

Source of variation Sum of squares d.f. Mean squares E(MS) Fcalc


Pk
n t2 MSF
Factor (between levels) SSF k−1 MSF s2 + i¼1 i
k−1 MSE
Error (within levels) SSE N−k MSE s2
Total SST N−1
30 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 14 Experimental results (mg L−1 of triazine), means and variances obtained in the study of the effect of
the type of fiber in a SPME procedure.

Type of fiber

Fiber 1 Fiber 2 Fiber 3 Fiber 4 Fiber 5

Replicates 490 612 509 620 490


478 609 496 601 502
492 599 489 580 495
499 589 500 603 479
Mean xi 489.75 602.25 498.50 601.00 491.50
Variance s2i 76.25 108.92 69.67 268.67 93.67

Table 15 Results of the ANOVA for data of Table 14.

Source Sum of squares d.f. Mean squares Fcalc

Between fibers 56 551.3 4 14 137.8 114.54


Error (within fibers) 1 851.5 15 123.4
Total 58 402.8 19

fiber on a sample spiked with 1000 mg L−1 of triazine. All the analyses were carried out in random order maintaining the rest of
experimental conditions controlled.
In the last two rows of Table 14, the means and variances for each fiber are given. Before conducting the ANOVA, the hypothesis
of equality of variances should be tested:

H0 : s21 ¼ . . . ¼ s2i ¼ . . . ¼ s2k


H1 : At least one s2i is different

With the variances in Table 14, the statistics of Cochran’s test (Eq. (59)) is Gcalc ¼ 268.67/617.168 ¼ 0.435. As G0.05,k,n−1 ¼ 0.5981
(see Table 9), the statistic does not belong to the critical region (Eq. (60)) and there is no evidence to reject the null hypothesis at 5%
significance level.
The statistics of Bartlett’s test is w2calc ¼ 1.792 (Eq. (61)) and the critical value is w20.05,4 ¼ 9.488, so there is no evidence to reject the
null hypothesis either (Eq. (64)). The same happens with the Levene’s test; computing the absolute values, according to Eq. (65), of
the data of Table 14, Fcalc ¼ 14.70/44.01 ¼ 0.33, and F0.05,4,15 ¼ 3.06, so there is no evidence to reject the null hypothesis on the
equality of variances. By using the median instead of the mean (Eq. (68)), Fcalc ¼ 15.13/46.23 ¼ 0.33, and the conclusion is the
same. From the analysis of the equality of variances, we can conclude that the variances of the five levels should be considered
as equal.
The ANOVA of the experimental data gives the results in Table 15. Considering the critical region defined in Eq. (81), as
Fcalc ¼ 114.54 is greater than the critical value F0.05,4,15 ¼ 3.06, we reject the null hypothesis and hence the conclusion is that there is
a significant effect of “fiber composition” on the extracted amount.

Power of the Fixed Effects ANOVA model


The power of the ANOVA is computed by the following expression:
n o
1 − b ¼ pr Fk∗− 1, N − k, d < Fa, k − 1, N − k (82)

where Fa, k−1, N−k is the critical value of Eq. (81), F k−1, N−k, d is a noncentral F distribution with k − 1 and N − k d.f. of the numerator
and denominator, respectively, and d is the noncentrality parameter, whose value is given by
Pk 2
i¼1 ti
d¼n (83)
s2

The noncentrality parameter d depends on the number of replicates n and also on the difference in means that we wish to detect in
P
terms of ki¼1 t2i . When the error variance is unknown, which is usually the case, we must define the differences to be detected in
P
terms of ratios ki¼1 t2i /s2. As the power, 1 − b, of the test increases with d, the next question would be about the minimum d needed
Quality of Analytical Measurements: Statistical Methods for Internal Validation 31

Table 16 Probability of type II error, b, as a function of the number n of replicates in the ANOVA for comparing
fiber types.

n 4 5 6 7 8 9

b 0.347 0.203 0.111 0.058 0.029 0.014

(for a given b) to distinguish differences of at least D in two of the t’s. This minimum d can be computed, provided that two of the ti
differ by D and the remaining k − 2 are kept at the mean of these two50 and is given by

X
k
D2
t2i ¼ (84)
i¼1
2

For example, with the data of Example 16 (Table 14), we are now interested in the risk of falsely affirming that the type of fiber is not
significant for the recovery.
The answer consists of evaluating the probability b by Eq. (82). Suppose that we want to discriminate effects greater than twice
Pk
the MSE, that is, i¼1 t2i /s2 2, and thus d ¼ n  2 ¼ 8. Notice that, by substituting Eq. (84) into Eq. (83), this value of d means
that we want to discriminate a difference D between the two types of fiber of at least 2s. In these conditions, F0.05, 4, 15 ¼ 3.06 and
b ¼ 0.54 (calculations can be seen in Example A14 of Appendix and in live-script in the supplementary material ANOVA_sec-
tion1024_live.mlx). In other words, 54 out of 100 times we will accept the null hypothesis (there is no effect of the composition of
the fiber) when it is wrong. This is not good enough for a suitable decision rule.
Eq. (82) can also be used to determine the sample size before starting an experiment, so that risks a and b are both good enough.
For example, we want to know how many replicates we need to carry out in the experiment for a ¼ b ¼ 0.05 and maintaining the
P
ratio ki¼1 t2i /s2 3. Note that, in this case, the analyst considers “effect of fiber type” if 2
pffiffiffiit is greater than 3 times s , which is
equivalent, using Eq. (84), to detect a difference between two fibers at least equal to D ¼ 6s 2:5 s.
To calculate the sample size, a table must be made to write b as a function of n in Eq. (82) with k, a, and d fixed at 5, 0.05, and
3  n, respectively. Following the results shown in Table 16, computed with the code in the mentioned live-script, we need n ¼ 8
replicates with each fiber to achieve b  0.05, but in practice n ¼ 7 would be enough.

Uncertainty and Testing of the Estimated Parameters in the Fixed Effects Model
It is possible to derive estimators for the parameters m and ti (i ¼ 1,. . .,k) in the one-way ANOVA modeled by Eq. (74). The
normality assumption on the errors is not needed to obtain an estimate by least squares; however, the solution is not unique, so the
constraint of Eq. (75) is imposed. Using this constraint, we obtain the following estimates

^ ¼ x and ^ti ¼ xi − x, i ¼ 1, . . . , k


m (85)

where xi and x have been defined in Eqs. (76), (77), respectively. If the number of replicates, ni, in each level is not equal
P
(unbalanced ANOVA), then the constraint in Eq. (75) should be changed by ki¼1 niti ¼ 0 and the weighted average of xi should
be used instead of the unweighted average in Eq. (85).
Now, if we assume that the errors are NID(0,s) and ni ¼ n, i ¼ 1,. . .,k, the estimates of Eq. (85) are also the maximum likelihood
ones. For unbalanced designs, the maximum likelihood solution is better because the least squares solution is biased. The reader
interested on this subject should consult statistical monographs that describe this matter at a high level, such as Milliken and
Johnson52 and Searle.53
^ + ^ti ¼ xi
^i ¼ m
The mean of the i-th level is mi ¼ m + ti, i ¼ 1,. . .,k. In our case, with a balanced design an estimator of mi would be m
pffiffiffi
and, as errors are NID(0,s), then xi is NID(mi, s/ n). Using MSE as an estimator of s2, Eq. (16) gives the confidence interval at
(1 −a)100% level:
" rffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffi #
MSE MSE
xi − ta=2, N − k ; xi + ta=2, N − k (86)
n n

A (1 − a)100% confidence interval on the difference in the means of any two levels, say mi − mj, would be
" rffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffi #
  2 MSE   2 MSE
xi − xj − ta=2, N − k ; xi − xj + ta=2, N − k (87)
n n

With the data in Examplep 16 (Table 14),ffi a 95% confidence interval on the difference between fibers 1 and 2 is given by
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð489:75 − 602:25Þ  2:131 2  123:43=4 which is [−129.24, −95.76].
32 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Finally, the (1 − a)100% confidence interval on the overall mean is


" rffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffi #
MSE MSE

x − ta=2, N − k 
; x + ta=2, N − k (88)
nk nk

In the example, as there is effect of the type of fiber, it makes no sense to compute this interval.
Rejecting the null hypothesis in the fixed effect model of the ANOVA implies that there are differences between the k levels, but
the exact nature of the differences is not specified. To address this question, two procedures are used: orthogonal contrasts and
multiple tests.

Case 1: Orthogonal contrasts


For example, with the data of Table 14, we would like to test the hypothesis H0: m4 ¼ m5. A contrast is a linear combination, in this
case, the linear relation related with this hypothesis is x4 − x5 ¼ 0. Then the contrast is tested by comparing its sum of squares to the
mean square error by a statistic that follows an F distribution, with 1 and N − k d.f.
Each contrast is defined by the coefficients of the linear combination, in the previous case (0,0,0,1,−1). Two contrasts C ¼ (c1,
Pk
c2,. . .,ck) and D ¼ (d1,d2,. . .,dk) are orthogonal if i¼1cidi ¼ 0. There are numerous ways to choose the orthogonal contrast
coefficients for a set of levels. Usually, something in the experiment should suggest which comparison(s) will be of interest.
To illustrate the procedure, and with purely didactic purpose, we are going to pose a fictitious case with the data of Table 14. In each
problem, its peculiarities and the previous knowledge of the analyst will suggest the contrast to be studied. The comparisons
between the means per fiber type and the associated orthogonal contrasts proposed are

H0 : m4 ¼ m5 C1 ¼ x4 − x5
H0 : m1 + m3 ¼ m4 + m5 C2 ¼ x1 + x3 − x4 − x5
H0 : m1 ¼ m3 C3 ¼ x1 − x3
H0 : 4m2 ¼ m1 + m3 + m4 + m5 C4 ¼ −x1 + 4 x2 − x3 − x4 − x5

The sum of squares associated with each contrast C is


 2
P
k
n ci xi
i¼1
SSC ¼ (89)
Pk
c2i
i¼1

For example, SSC1 ¼ 4(−1  601.00 + 1  491.50) /2 ¼ 23981 with 1 d.f. These sums of squares are incorporated into the ANOVA
2

table as shown in Table 17.


Now, to test each of the hypotheses, it suffices to compare the corresponding Fcalc in Table 17 with the critical value
F0.05,1,15 ¼ 4.54. The conclusion is that, except for C3, the rest of the contrasts are significant according to Eq. (81). Thus, we should
reject the hypothesis that fibers 4 and 5 give the same recovery. The hypothesis that the mean of the recovery rates for fibers 1 and 3
is the same as for fibers 4 and 5 is also rejected. Also, fiber 2 differs significantly from the mean of the other four, whereas there is no
experimental evidence to reject that fibers 1 and 3 provide the same recovery. It is just an example of the wide range of possibilities
for analyzing experimental results.

Case 2: Comparison of several means


Many different methods have been described that were specifically designed for the comparison of several means. Here, we will
describe the method of Newman-Keuls. The hypothesis test is the following:

H0: All the differences two by two are equal to zero


H1: At least one difference is non-null

Table 17 ANOVA table with orthogonal contrasts for composition of fibers for SPME.

Source Sum of squares d.f. Mean squares Fcalc

Between fibers 56551.3 4 14137.8 114.54


C1: m4 ¼ m5 23981.0 1 23981.0 194.33
C2: m1 + m3 ¼ m4 + m5 10868.0 1 10868.0 88.07
C3: m1 ¼ m3 153.1 1 153.1 1.24
C4: 4m2 ¼ m1 + m3 + m4 + m5 21550.0 1 21550.0 174.63
Error (within fibers) 1851.5 15 123.4
Total 58402.8 19
Quality of Analytical Measurements: Statistical Methods for Internal Validation 33

Table 18 Results of Newman-Keuls for multiple comparison test; data of SPME fibers.

Levels Rank Mean Homogeneous groups

2 1 602.25 
4 2 601.00 
3 3 498.50 
5 4 491.50 
1 5 489.75 

The symbols “” aligned in columns indicate that the corresponding means are all equal two by two.

Table 19 Skeleton for using the corresponding tabulated values for the Newman-Keuls procedure.

t¼k t¼k−1 ... t¼2

xr ð1Þ − xr ðk Þ xr ð1Þ − xr ðk − 1Þ ... xr ð1Þ − xr ð2Þ
xr ð2Þ − xr ðkÞ ... xr ð2Þ − xr ð3Þ
⋱ ⋮
xr ðk − 1Þ − xr ðkÞ
qa(k, k(n − 1)) qa(k − 1, k(n − 1)) ... qa(2, k(n − 1))

t denotes the difference of ranks plus one; the subscript r(i) indicates the i-th rank. k is the number of levels in the ANOVA and qa
are the tabulated values at significance level a.

The procedure consists of the following steps:


1. To sort the means per level, xi , i ¼ 1,2,. . .,k, in decreasing order, xrð1Þ  xr ð2Þ  ⋯  xrðkÞ . The subindex r(i) refers to the rank of
the corresponding mean, that is, the position that it occupies in the ordered list. For example, the means of Table 14 have the
following ranks: r(1) ¼ 2, r(2) ¼ 4, r(3) ¼ 3, r(4) ¼ 5, and r(5) ¼ 1. That means that the first one, which is 489.75, has rank 5,
that is, is in the fifth position in the decreasing ordered list. Table 18 shows the ordered means and the ranks in the second
column.
2. To create a table with the differences between the means from greatest to lowest in columns identified by t that is equal to the
difference of ranks plus one. Table 19 contains all the possible contrasts two by two of the means.
3. Finally, each of the following hypotheses is tested:

H0 : xr ðiÞ − xrði + k − 1Þ ¼ 0


H1 : xr ðiÞ − xrði + k − 1Þ > 0

with the statistic in Eq. (90)


rffiffiffiffiffiffiffiffiffi
MSE
Rt ¼ qa ðt; kðn − 1ÞÞ (90)
n

The values qa(t, k(n − 1)) in Eq. (90) are tabulated. Table 20 shows some of them. They depend, as usual, on the significance level a,
on t, and on the d.f. N − k of MSE. Further, the first term in Rt changes with the difference of ranks, t. The corresponding values are
written in the last row in Table 19.
The critical region is made up by

CR ¼ xr ðiÞ − xrði + k − 1Þ  Rt (91)

The results obtained when applying the method of Newman-Keuls to the data of Example 16 are given in Table 21. The first column
contains the means to be compared, for example, 1–2 indicates that the comparison is between x1 and x2 . The second column
contains the differences (without sign) between the means. The values of t (difference of ranks plus one) are in the third column, for
example, t ¼ 5 in the first row because, with the ranks in Table 18, x1 has rank 5 and the rank of x2 is 1. The next column contains the
critical value (Rt) computed with the value of q in Table 20 and Eq. (90). The critical value Rt defines the critical region so that the
analyst can decide whether the estimated difference is significant or not. At 5% significance level, the resulting decision of rejecting
or not rejecting the null hypothesis is shown in the last column of Table 21.
Usually, the result of this multiple comparison is presented as in the last column of Table 18 that is more graphic. The columns
with the symbols “” aligned indicate that the corresponding means are all equal two by two. In our example, on the one hand, the
means x2 and x4 and, on the other hand, any other pair among x1 , x3 , andx5 , according to the decisions in Table 21. It is possible to
conclude that there are two groups of fibers; as far as the recovery is concerned, fibers 2 and 4 provide results that are significantly
equal and greater than the recovery rate obtained with the other three fibers, that are similar to each other.
34 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 20 Values of qa(t,n), the upper percentage points of the studentized range for a ¼ 0.05.

n t

2 3 4 5 6 7 8 9 10

1 17.969 26.98 32.82 37.08 40.41 43.12 45.50 47.36 49.07


2 6.085 8.33 9.80 10.88 11.74 12.44 13.03 13.54 13.99
3 4.501 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46
4 3.926 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83
5 3.635 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99
6 3.460 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49
7 3.344 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16
8 3.261 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92
9 3.199 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74
10 3.151 3.88 4.33 4.66 4.91 5.12 5.30 5.46 5.60
11 3.113 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49
12 3.081 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39
13 3.055 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32
14 3.033 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25
15 3.014 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20
16 2.998 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15
17 2.984 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11
18 2.971 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07
19 2.960 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04
20 2.950 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01

t, difference of ranks plus one; n, degrees of freedom of MSE.


Adapted from Sachs, L. Applied Statistics. A Handbook of Techniques; Springer-Verlag: New York, 1982.

Table 21 Results of the Newman-Keuls test applied to data of SPME fibers.



Contrast levels Differences xi − xj t q0.05,t,15 Critical values, Rt Decision according to CR ¼ xr ðiÞ − xr ði + k − 1Þ  Rt

1–2 112.50 5 4.37 4.37  5.555 ¼ 24.27 Reject H0


1–3 8.75 3 3.67 3.67  5.555 ¼ 20.39 No evidence to reject H0
1–4 111.25 4 4.08 4.08  5.555 ¼ 22.66 Reject H0
1–5 1.75 2 3.01 3.01  5.555 ¼ 16.72 No evidence to reject H0
2–3 103.75 3 3.67 3.67  5.555 ¼ 20.39 Reject H0
2–4 1.25 2 3.01 3.01  5.555 ¼ 16.72 No evidence to reject H0
2–5 110.75 4 4.08 4.08  5.555 ¼ 22.66 Reject H0
3–4 102.50 2 3.01 3.01  5.555 ¼ 16.72 Reject H0
3–5 7.00 2 3.01 3.01  5.555 ¼ 16.72 No evidence to reject H0
4–5 109.50 3 3.67 3.67  5.555 ¼ 20.39 Reject H0

H0: the difference is null, x i − x j ¼ 0; H1: x i − x j 6¼ 0; a ¼ 0.05.

The Random Effects Model


In many cases, the factor of interest is a random variable as well, so that the chosen levels are in fact a sample of this random
variable and we want to extract conclusions about the population from which the sample comes. For example, in the case of
validating an analytical method, several laboratories will apply it to aliquot samples so that it is possible to decide what part of
the variability of the results is attributable to the change of laboratory and what part is due to the repetition of the procedure
inside the same laboratory. These are the concepts of reproducibility and repeatability. The same happens in the analytical control
of processes: It is necessary to split the variability observed between the one due to the measurement procedure and the one
assignable to the process.
The linear statistical model is

xij ¼ m + ti + eij with i ¼ 1, 2, . . . , k, j ¼ 1, 2, . . . , n (92)

where ti and eij are independent random variables. Note that the model is identical in structure to the fixed effect case (Eq. (74)), but
the parameters have a different interpretation. If V(ti) ¼ s2t , then the variance of any observation is
 
V xij ¼ s2t + s2 (93)
Quality of Analytical Measurements: Statistical Methods for Internal Validation 35

The variances of Eq. (93) are called variance components, and the model, Eq. (92), is called components of variance or the
random effects model. To test hypotheses in this model, we require that the eij are NID(0,s), all of the ti are NID(0,st), and ti and eij
are independent to one another.
The sum of squares equality SST ¼ SSF + SSE still holds. However, instead of testing the hypothesis about individual levels effects,
we test the hypothesis

H0 : s2t ¼ 0
H1 : s2t > 0

If s2t ¼ 0, all levels are identical; if s2t > 0, then there is variability between levels. Thus, under the null hypothesis, the ratio
SSF
k−1 MSF
Fcalc ¼ SSE
¼ (94)
N−k
MSE

is distributed as an F with k − 1 and N − k d.f. The expected values (means) of MSF and MSE are

EðMSF Þ ¼ s2 + ns2t (95)

and

EðMSE Þ ¼ s2 (96)

Therefore, the critical region is



CR ¼ Fcalc > Fa, k − 1, N − k (97)

Power of the Random Effects ANOVA model


The power of the random effects ANOVA model is obtained from

Fa, k − 1, N − k
1 − b ¼ pr Fk − 1, N − k > (98)
l2
where l2 ¼ 1 + ns2t /s2. As s2 is usually unknown, we may either use a prior estimate or define the value of s2t that we are interested
in detecting in terms of the ratio s2t /s2. An application to determine the number of replicates in a proficiency test can be seen in
Example A15 of Appendix and in ANOVA_section1024_live.mlx in the supplementary material.

Confidence Intervals for the Estimated Parameters in the Random Effects Model
In general, the mean value per level xi does not have more statistical meaning than being a sample of the random factor. But
sometimes, as in the case of proficiency tests, this mean value is of interest for each participating laboratory. The variance of the
mean value per level is theoretically equal to V ðxi Þ ¼ s2t + s2 =n. From Eqs. (95), (96), MSF/n (with k − 1 d.f.) estimates the variance
of the mean per level. As a consequence, the 100(1 − a)% confidence interval is
" rffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffi#
MSF MSF
xi − ta=2, k − 1 ; xi + ta=2, k − 1 (99)
n n

When calculating the variance for the overall mean, it is necessary to consider the variability provided by the factor, as the factor
always acts. For example, when evaluating an analytical method, the results without the variability attributable to the factor
P
laboratory are not conceivable. The variance of the overall mean is V ðxÞ ¼ ki¼1 V ðxi Þ=k2 , which is estimated by MSF/(nk), with k − 1
d.f., so that the 100(1 − a)% confidence interval is
" rffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffi #
MSF MSF
x − ta=2, k − 1 ; x + ta=2, k − 1 (100)
nk nk

The random effects ANOVA is a model of practical interest because it allows attributing real meaning to many statements that seem
evident. For example, the samples distributed to laboratories in a proficiency test must be homogeneous. Strictly speaking, in most
of the occasions, it is impossible to assure homogeneity, but it is enough that the variability attributable to the change of sample is
significantly smaller than the one attributable to the procedure of analysis. This can be guaranteed by means of an ANOVA of
random effects.
36 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Statistical Inference and Validation


Trueness
The trueness is a key concept; several international organizations are unifying its definition. For example, the definition “The
closeness of agreement between the average value obtained from a large series of test results and an accepted reference value” has
been adopted by the IUPAC (Inczédy et al.11, Chapter 18). The definition of the ISO7 exactly coincides with it, and it is the definition
accepted by the European Union in the Decision 2002/657/EC3 as far as the operation of the analytical methods and the
interpretation of results are concerned. The trueness is usually expressed in terms of bias, which combines all the components of
the systematic error, denoted by D in Eq. (1).
The decisions on the trueness of a method are made by hypothesis testing on the central value of a distribution; in case the
random error can be assumed to have a zero mean, they are in fact tests on the mean because, according to Eq. (1), the expected
mean value for a series of measurements will be m + D and the question is reduced to test whether D is zero or not (equivalently, to
test whether x is significantly equal to m).
To use one or another test depends only on the information available about the distribution of the random error—its type
(normal, parametric, or unknown) and, in the case of normality, whether the variance is known or not. Some common cases are
given below:
1. To decide whether an analytical procedure fulfills trueness using a reference sample whose value is assumed to be true. If normal
data with known variance s2 are supposed, then the tests of Section “Hypothesis Test on the Mean of a Normal Distribution”
will be used.
2. To decide whether an analytical procedure has bias specifically positive (or negative) by using a reference sample whose value is
assumed to be true. If normal data are assumed, the one-tail test versions of Cases 1 and 2 of Section “Hypothesis Test on the
Mean of a Normal Distribution” will be of use.
3. In other occasions, the question of trueness is considered comparatively between two methods: “To decide if the difference in
means between them is significant or not, when they are applied to the same reference sample”. It is the two-tail test. The one-tail
case is “to decide whether one has bias of specific sign against the other”. In these tests (Section “Hypothesis Test on the
Difference in Two Means”), two experimental means are compared, one coming from applying n1 times the first method to the
reference sample (aliquot parts) and the other one coming from applying n2 times the second method. Under the normality
assumption, we will have to know whether the variances of both methods are known (for applying tests in rows 7 and 8 of
Table 4) or it is necessary to estimate them from the samples. In this second case we have to decide, with the test in row 15 of
Table 4, whether they are equal or different, for applying test in rows 9 or 10 in Table 4, or the statistic in Eq. (53), respectively.
4. Sometimes it is impossible to use similar enough samples. A solution is the use of the “test on the difference of means with
paired data” (Case 3 of Section “Hypothesis Test on the Mean of a Normal Distribution”). For example, in an on-line system, say
in flow analysis, before introducing a new faster method to indirectly determine the content of an analyte in wastewater, the test
can be used to decide if the new method maintains the trueness at the same level as the previous one. Once the new method is
ready and validated with reference samples, real samples must be measured. The difficulty is that we cannot be sure about the
amount that is to be found because this may vary from day to day. In order to eliminate the effect of the sample (the factor
“sample”), paired designs are used: Both methods are applied to aliquot parts of the same sample, and two series of paired
results x1i and x2i are obtained when applying the old and new methods, respectively. Individual means here make no sense
because in that case we would be introducing variability due to the change of sample in each series. The correct procedure is to
subtract the values di ¼ x1i − x2i so that now the differences are caused exclusively by the change of method and their mean will
estimate the bias attributable to the new method. It will be enough to apply a test on the mean; thus, the normality and
independence hypotheses must be evaluated on the differences, which are also used to estimate the standard deviation.
This test for paired data is frequently used to evaluate the improvement achieved in a procedure by a technical variation, as is
the case of Example 9. The effect of the change on the trueness must always be evaluated in the range of concentrations in which
the procedure will be used. An alternative to the use of this test is the analysis of the pairs of data by a linear regression; in this
case, the regression method used should consider the existence of error in both axes.

The hypotheses of normality (Section “Goodness-of-Fit Tests: Normality Tests”), and the equality of variances, when applicable,
will have to be tested with the appropriate tests (Section “Hypothesis Test on the Comparison of Several Independent Variances”).
When a hypothesis test is to be posed, one can think about omitting some known data, for example, the variance. The effect is a
loss of power, that is, with the same value of a and the same sample size, there is a greater probability of type II error. Said otherwise,
to maintain power, larger sample sizes are needed to get the same experimental evidence; a calculation on this matter is in Case 2 of
Section “Hypothesis Test on the Mean of a Normal Distribution”. The same applies about the use of one-tail tests respect to the
respective two-tail tests; or about the use of nonparametric tests that do not impose any type of distribution a priori.
Also, it is important to remember that the presence of outlier data tends to greatly increase the variance so that the tests become
insensitive, that is to say, larger experimental evidence is needed to reject the null hypothesis. The nonparametric alternative has, in
general, a high cost in terms of power for the same significance level (or in terms of sample size). For this reason, its use is not
advised unless it is strictly necessary. In addition, some nonparametric tests also assume hypothesis on the distribution of the values,
for example, to be symmetric or unimodal.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 37

Precision
The other very important criterion in the validation of a method is the precision. In the ISO 5725,7 the IUPAC (Inczédy et al.11,
Sections 2 and 3), and the 2002/657/EC European Decision,3 we can read “Precision, the closeness of agreement between
independent test results obtained under stipulated conditions”.
The precision usually is expressed as imprecision. The lesser the dispersion of the random component in Eq. (1), the more
precise the procedure. It must be remembered that the precision depends solely on the distribution of the random errors and is not
related to the value of reference or the value assigned to the sample. In a first approach, it is computed as a standard deviation of the
results; nevertheless, even the ISO 5725-5 recommends the use of a robust estimation.
Two measures, limits in a certain sense, of the precision of an analytical method are the reproducibility and the repeatability.
Repeatability is defined as precision under repeatability conditions. Repeatability conditions means conditions where indepen-
dent test results are obtained with the same method on identical test items in the same laboratory by the same operator using the
same equipment in short intervals of time. Repeatability as standard deviation is denoted as sr.
The repeatability limit, r, is the value below which lies, with a probability of (1 − a)100%, the absolute value of the difference
between two results of a test obtained under repeatability conditions. The repeatability limit is given by
pffiffiffi
r ¼ za=2 2 sr (101)

where za/2 is the a =2 upper percentage point of the standard normal distribution.
Reproducibility is defined as precision under reproducibility conditions. Reproducibility conditions means conditions where
test results are obtained with the same method on identical test items in different laboratories with different operators using
different equipment. Reproducibility as standard deviation is named sR.
The reproducibility limit, R, defined in Eq. (102), is the value below which is, with a probability of (1 − a)100%, the absolute
value of the difference between two results of a test, results obtained under reproducibility conditions.
pffiffiffi
R ¼ za=2 2 sR (102)

When estimating sr (or sR) with n < 10, a correction factor7 should be applied to Eq. (6).
Notice that both R and r define, in fact, two-sided tolerance intervals for the difference of two measurements.
The ISO introduces the concept of intermediate precision when only some of the factors described in the reproducibility
conditions are varied. A particular interesting case is when the “internal” factors of the laboratory (analyst, instrument, day) are
varied, which in Commission Decision3 is called intralaboratory reproducibility.
One of the causes of the ambiguity when defining precision is the laboratory bias. When the method is applied only in a
laboratory, the laboratory bias is a systematic error of that laboratory. If the analytical method is evaluated in general, the laboratory
bias becomes a part of the random error: to change the laboratory contributes to the variance expected for a determination
conducted with that method in any laboratory.
The most eclectic position is the one described in the ISO 5725 that declares “The laboratory bias is considered constant when
the method is used under repeatability conditions but is considered as a random variable if series of applications of the method are
made under reproducibility conditions”.
With these premises, we can realize that to evaluate the precision of an analytical method is equivalent to estimating the variance
of the random error in the results and that the discrepancies that can appear when establishing the sources of variability must be
explicitly identified, for example, the laboratory bias.
The precision of two methods can be compared by a hypothesis test on the equality of variances, under the normality
assumption, that is, an F-test (Section “Hypothesis Test on the Variances of Two Normal Distributions”).
Another usual problem is to decide whether the variance observed can be considered significantly equal or not to an external
value, which is decided by using a w2 test (Section “Hypothesis Test on the Variance of a Normal Distribution”).
It is common that the lack of control on a concrete aspect of an analytical procedure is the origin of a great variability. If the
experimental conditions are not stable, we will have an additional variability in the determinations. The F-test permits to decide
whether the precision improves significantly when an optimization is carried out.
In fact, many improvements in the procedures are the consequence of acting after the identification of some causes of variability
in the results and their quantification. More details about this aspect of control and improvement of the precision are given in the
section dedicated to the ruggedness of chemical analysis.
The technique used in the random effects ANOVA is also the adequate technique to split the variance of each experimental data
into addends, which in turn are specially adapted to estimate the repeatability and the reproducibility of an analytical method when
an interlaboratory test comparison has been carried out. In the following, the use of an ANOVA to estimate reproducibility and
repeatability in a proficiency test is briefly explained.
There is no doubt that a good analytical procedure has to be insensitive to the laboratory where it is conducted. To decide
whether the “change of laboratory” has any effect, k laboratories apply a procedure to aliquot samples; each laboratory makes n
determinations. In the terminology of the ANOVA, we have a random factor (the laboratory) at k levels and n replicates in each level.
It has already been said that in general, it is not necessary to have the same number of replicates in all the levels.
38 Quality of Analytical Measurements: Statistical Methods for Internal Validation

We denote by xij the experimental results, where i ¼ 1,. . .,k identifies the laboratory and j ¼ 1,. . .,n the replicate.
Fig. 7 is a skeleton of Eqs. (93)–(96) and shows how to compute an estimate of the variance of the random variable E in Eq. (92).
If the analytical procedure is well defined, the k estimates s2i are expected to be approximately equal and to gather the variability due
to the use of the analytical method by only one laboratory. In these conditions, the pooled variance s2p is a joint (“pooled”) estimate
of the same variance, that is, by definition, the repeatability of the method expressed as standard deviation (ISO 5725)
pffiffiffiffiffiffiffiffiffi
sr ¼ V ðeÞ sp (103)

Fig. 7 ANOVA of random effects for an interlaboratory study.


Quality of Analytical Measurements: Statistical Methods for Internal Validation 39

From the same data we can obtain k estimates of the bias Di (Fig. 7, top) and then the variance of the laboratory bias, considering
this bias as a random variable. Taking into account the quantities estimated by the variances described in Fig. 7, one obtains the
following expression for the interlaboratory variance:

s2p
V ðDÞ x2 −
s (104)
n
which, linked to Eq. (1), provides the following estimate of the reproducibility as standard deviation (ISO 5725):
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sR V ðDÞ + V ðeÞ (105)

In the ANOVA, the null hypothesis is that V(D) ¼ 0 (i.e., there is no effect of the factor), and the alternative is that at least one
laboratory has non-null bias (there is effect of the factor).
The conclusion of the ANOVA is obtained by deciding whether both variances, n s x2 and s2p , can be considered significantly equal.
To decide it, an F-test is applied. The logic is clear: If there is no effect (laboratory effect), V(D) should be significantly zero and, thus,
both variances are equal or, in other words, they estimate the same quantity.
In practice, the expression for the computation of the power of the ANOVA with random effects (Eq. (98)) is useful in deciding
the number of laboratories that should participate, k, and the number of replicated determinations, n, that each one must conduct.
It is essential to remember that an ANOVA requires normal distribution of the residuals and equality of the variances s21, s22, . . ., s2k .
When the number of replicates is two (n ¼ 2), a common way of the interlaboratory analysis is the use of the graph of Youden54
to show the trueness and precision of each laboratory. Actually, Youden’s graph is nothing but the graphical representation of an
ANOVA shown in Kateman and Pijpers.55 In addition to being used for comparing the quality of the laboratories, Youden’s graph
can be used to compare two methods of analysis in terms of the laboratory bias they have.
An approach for the comparison of two methods in the intralaboratory situation has been proposed by Kuttatharmmakull
et al.56 Instead of the reproducibility, as included in Fig. 7 and ISO guidelines, the (operator + instrument + time) different
intermediate precision is considered in the comparison.
In the case of precision, the effect of outlier data is really devastating; hence, a very delicate task of analysis to detect those outlier
data is essential. In general, more than one test is needed (usual ones are those of Dixon, and Grubbs and Cochran), especially to
accept the hypotheses of the ANOVA made for the determination of repeatability and reproducibility. In view of the difficulties, the
AMC5,6 advises the use of robust methods to evaluate the precision and trueness and for proficiency testing. This path is also
followed in the new ISO norm about reproducibility and repeatability.

Statistical Aspects of the Experiments to Determine Precision


The analysis of data implies three steps:

1. Critical examination of the data, in order to identify outliers or other irregularities, and to verify the suitability of the model.
2. To compute for each level of concentration the preliminary values of precision and mean.
3. To establish the final values of precision and means, including the establishment of a relation between precision and the level of
concentration when the analysis indicates that such relation may exist.

The analysis includes a systematic application of statistical tests for detecting outliers, and a great variety of such tests are available
from the literature and could be used for this task.

Consistency Analysis and Incompatibility of Data


From the data collected in a specific number of levels, a decision must be taken about certain individual results or values that seem
to be “different” from those of the rest of laboratories or that can modify the estimations. Specific tests are used for the detection of
these outlier numerical results.

Case 1: Elimination of data


It is the classic procedure based on detecting and, if this is the case, eliminating the outlier data. The tests are of two types. Cochran’s
test is related to the interlevel variability of the factor and should be applied first; its objective is to detect an anomalous variance in
one or several of the levels of the factor. Cochran’s test has already been described in Section “Hypothesis Test on the Comparison of
Several Independent Variances”.
Later Grubbs’ test will be applied. It is basically a test on the intralevel variability to discover possible outlier individual data.
It can be used (if ni > 2) for those levels in which Cochran’s test has led to the suspicion that the interlevel variation is attributable to
an individual result. It is applied in two stages:

1. Detection of a unique outlying observation (single Grubbs’ test)


In a data set xi (i ¼ 1, 2, . . ., n) sorted in increasing order, to prove whether the greatest observation, xn, is incompatible with
the rest, the following statistic is computed:
40 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 22 Critical values for Grubbs’ test.

n One largest or one smallest Two largest or two smallest

a ¼ 0.05 a ¼ 0.01 a ¼ 0.05 a ¼ 0.01

4 1.481 1.496 0.0002 0.0000


5 1.715 1.764 0.0090 0.0018
6 1.887 1.973 0.0349 0.0116
7 2.020 2.139 0.0708 0.0308
8 2.126 2.274 0.1101 0.0563
9 2.215 2.387 0.1492 0.0851
10 2.290 2.482 0.1864 0.1150
11 2.355 2.564 0.2213 0.1448
12 2.412 2.636 0.2537 0.1738
13 2.462 2.699 0.2836 0.2016
14 2.507 2.755 0.3112 0.2280
15 2.549 2.806 0.3367 0.2530
16 2.585 2.852 0.3603 0.2767
17 2.620 2.894 0.3822 0.2990
18 2.651 2.932 0.4025 0.3200
19 2.681 2.968 0.4214 0.3398
20 2.709 3.001 0.4391 0.3585

Adapted with permission from ISO-5725–2. Accuracy, Trueness and Precision of Measurement Methods and Results; Gèneve, 1994; p. 22.

xn − x
Gn, calc ¼ (106)
s

On the contrary, to verify whether the smallest observation, x1, is significantly different from the rest, the statistic G1 is
computed as
x − x1
G1, calc ¼ (107)
s

In Eqs. (106), (107), x and s are, respectively, the mean and standard deviation of xi.
To decide whether the greatest or smallest value is significantly different from the rest at 100a% significance level, the values
obtained in Eqs. (106), (107) are compared to the corresponding critical values written down in Table 22.
The decision includes two “anomaly levels”:
(a) If Gi,calc < G0.05,i, with i ¼ 1 or i ¼ n, accept that the corresponding x1 or xn is similar to the rest.
(b) If G0.05,i < Gi,calc < G0.01,i, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is considered an straggler.
(c) If G0.01,i < Gi,calc, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is incompatible with the rest of data of the same level
(statistical outlier).
2. Detection of two outlying observations (double Grubbs’ test)

Sometimes it is necessary to verify that two extreme data (very large or very small) incompatible with the others do not exist.
In the case of the two greatest observations, xn and xn−1, the statistic G is computed as

s2n − 1, n
G¼ (108)
s20
P Pn-2  P − 2 2
where s20 ¼ ni¼1 ðxi − xÞ2 and s2n-1, n ¼ i¼1 x i − n 1− 2 ni¼1 xi .
Similarly, it is possible to jointly decide on the two smallest observations, x1 and x2, by means of the following statistic:

s21, 2
G¼ (109)
s20
P  P 2
where s21, 2 ¼ ni¼3 xi − n −1 2 ni¼3 xi .
The decision rule is analogous to the one of the case of an extreme value but with the corresponding critical values in
Table 22.
In general, norms, like ISO 5725,7 propose the inspection of the origin of the anomalous results and, if assignable cause does not
exist, eliminate the incompatible ones and leave the straggler ones indicating their condition with an asterisk.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 41

Table 23 Data of Example 17.

Series 1 Series 2 Series 3 Series 4 Series 5

13.50 13.50 13.70 13.04 13.48


13.40 13.51 13.71 13.03 13.47
13.47 13.35 13.76 15.93 13.92
13.49 13.35 13.80 13.04 13.46

Table 24 Robust and nonrobust estimates of the centrality and dispersion parameters (data of Table 23).

With all data (n ¼ 20) Without 15.93 (n ¼ 19)

Nonrobust procedures
Mean, x 13.60 13.47
Standard deviation, s 0.60 0.25
Robust procedures
Median 13.49 13.48
H15, centrality parameter 13.50 13.48
MAD/0.6745 0.26 0.21
H15, dispersion 0.27 0.24

Example 17: With didactic purpose to apply Grubbs’ test and to verify the effect of outliers, the data of Table 23 have been
considered as a unique series of 20 results.
The greatest value is 15.93 and the lowest is 13.03, with s ¼ 0.60 and x ¼ 13.60. Eq. (106) gives G20,calc ¼ 3.889 and Eq. (107)
gives G1,calc ¼ 0.942. By consulting the critical values in Table 22, G0.05,20 ¼ 2.709 and G0.01,20 ¼ 3.001; therefore, according to the
decision rule in Case 1 (single Grubbs’ test), the value 15.93 should be considered different from the rest.
Applying again the test, with 19 data, the greatest value now is 13.92 and the lowest is still 13.03, with G19,calc ¼ 1.804 and
G1,calc ¼ 1.785. As the tabulated values are G0.05,19 ¼ 2.681 and G0.01,19 ¼ 2.968, there is no evidence to say that any of the extreme
values is different from the rest. Table 24 contains the mean and standard deviation, with and without the value 15.93. A large effect
is observed on the standard deviation, which is reduced in more than 50% when removing the point.
Grubbs’ test can also be applied to the mean values per level. In practice, Grubbs’ test is also used to restore the equality of
variances in the ANOVA when the homogeneity of variances is rejected (section “Hypothesis Test on the Comparison of Several
Independent Variances”). The work by Ortiz et al.47 contains a complete analysis with sequential application of Cochran’s, Bartlett’s,
and Grubbs’ tests.

Case 2: Robust methods


The procedure described in the previous section is focused on the detection of anomalous data within a set of results. Nevertheless,
the elimination of these data is not recommendable when the variability of the analytical procedure is to be evaluated because the
procedure is sensitive to the present values, that is to say, it depends on the data that have been eliminated (Eqs. (106)–(109) can
lead to elimination of data in successive stages because of the reduction of the variance), and because the attainable real variance is
underestimated.
As previously indicated, the values of repeatability (sr) and reproducibility (sR) are determined by means of an ANOVA whose
validity depends on whether the hypotheses of normality and homogeneity of variances are fulfilled. The robust methodology
proposed in this section avoids these limitations. Its technical details can be found in Hampel et al.57 and Huber.58
An alternative to the procedures based on the elimination of outlier data, as exposed in the ISO 5725-5 norm, consists of using
the H15 estimator proposed by Huber (c ¼ 1.5 and “Proposal 2 Scale”, Huber58), recommended by the Analytical Methods
Committee5,6 and accepted in the Harmonized Protocol.59 It is an estimator whose influence function is a monotone function
that limits the influence of the anomalous data by “moving them” toward the position of the majority, but maintaining for them the
maximum influence. This is carried out by transforming the original data by means of the function

Cm:s:c: ðxÞ ¼ max ½m − cs; min ðm + cs; xÞ (110)

where m and s are the centrality and dispersion parameters, which must be iteratively estimated. The function in Eq. (110) is
represented in Fig. 8.
The estimate is exactly the generalization of the maximum-likelihood estimate. It is asymptotically optimum for high-quality
data, that is, data with little contamination and not very different from data following a Student’s t distribution with three d.f.
Remember that Hampel et al.57 have shown that Student’s t distributions with between 3 and 9 d.f. reproduce high-quality
experimental data, and that for t3 the efficiency of the mean and standard deviation is 50 and 0%, respectively. Therefore, in
practice there is need for robust estimates of high-quality empirical data (as those obtained with present analytical methods).
42 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Ψ(x)

m + cs

m − cs

m − cs m m + cs x

Fig. 8 Function Cm. s. c.(x).

Table 25 Robust and nonrobust estimates of the repeatability and reproducibility with data of Table 23.

ANOVA With all data (n ¼ 20) Without 15.93 (n ¼ 19) Without series 4 and 13.92 (n ¼ 15)

Nonrobust procedure
Fcalc (P-value) 0.22 (0.92) 17.02 (<5  10−5) 24.91 (<5  10−5)
SSF (d.o.f.) 0.094 (4) 0.229 (4) 0.083 (3)
SSE (d.o.f.) 0.431 (15) 0.013 (14) 0.003 (11)
sR 0.657 0.260 0.153
sr 0.657 0.116 0.058
P-value, Cochran’s test 8.9  10−9 0.001 0.093
P-value, Bartlett’s test 3.9  10−8 0.001 0.100
P-value, Levene’s test 0.53 0.61 0.005
Robust procedures
Robust sR 0.281 0.172
Robust sr 0.072 0.072

The H15 estimator provides enough protection against high concentration of data that are abnormally large but near to the
correct data. Nevertheless, the clearly anomalous data are not rejected with the H15 estimator, and they maintain the maximum
influence but bounded. This produces an avoidable loss of the efficiency of the H15 estimator between 5% and 15% when the
proportion of anomalous data present is also between 5% and 15% (rather usual percentages in routine analyses). In order to avoid
this limited weakness, robust estimators as the median and the median of absolute deviation (MAD) (Eq. (111)) are necessary at
least in the first step of the calculation, to surely identify most of the “suitable” data.

MAD ¼ medianfjxi − medianfxi gjg (111)

The robust procedure obtained when adapting the H15 estimator to the problem of the estimation of repeatability and reproduc-
ibility as posed in ISO norm consists of two stages and it has been followed in an identical way to the proposal in Sanz et al.60 As in
the parametric procedure, it uses the mean and standard deviation of the data. Therefore, once the robust procedure is applied, the
data necessary to estimate the reproducibility or the intermediate precision are at hand.
In order to verify the utility of these robust procedures, with the same data of Table 23 considered as a unique series of 20 values,
the median and the centrality parameter of the H15 estimator have been written down in Table 24. These are very similar to the
nonrobust estimates, and for both 20 and 19 values. Nevertheless, when comparing the robust parameters of dispersion, MAD/
0.6745 and H15, they do not differ when considering 20 or 19 data and are similar to the standard deviation obtained after applying
the method of Grubbs and repeating the calculations without the outlier. For this reason, it is a good strategy to apply systematically
robust procedures together with the classic ones. The difference in the results is an indication of the presence of outlier data, in
which case the robust estimations will have to be used.
The effect, and therefore the advantage, of the robust procedures is much more remarkable when a random effects ANOVA is
evaluated, for example, to estimate the reproducibility and repeatability of a method by means of an interlaboratory test as the one
described in Fig. 7. To show this, we will use the data of Table 23, this time considering its structure of levels of the factor (k ¼ 5) and
replicates (n ¼ 4).
The values of reproducibility and repeatability should not be accepted if the homogeneity of variances in the ANOVA
assumption is not fulfilled. In this case, it is necessary to verify whether some of the levels have outlier data. The first column of
Table 25 shows that the ANOVA with all the data is not acceptable because the variances cannot be considered equal (rejection in
the tests on variance homogeneity). In addition, it is observed that the anomaly in the data causes that the estimates sR and sr are
equal and very different from the robust estimates.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 43

Once the value 15.93 of series 5 is removed, the ANOVA (column 2 of Table 25) points to significant effect, but still lack of
variance homogeneity is observed. Nevertheless, the new estimates of sR and sr are more similar to those obtained with the robust
procedure.
The lack of equality of variances forces one to eliminate series 4, which has a very different variance (smaller than the others), and
later the value 13.92 of series 5. The final result of this sequential process is the third column of Table 25. A significant effect with the
ANOVA, the homogeneity of variances can be accepted, and the estimates of the reproducibility and the repeatability are 0.153 and
0.058, similar to the ones obtained with the robust procedure without series 4. The values sR and sr can be too small due to the
elimination of data, with the risk of having underestimations that are not realistic and thus impossible to be fulfilled by the
laboratories. For this reason, it is advised5–7 to avoid reduction of the sample and to maintain the initial robust estimates.
As the presence of outliers in the experimental work is unavoidable, the robust statistical methodology has consolidated as an
essential tool in chemical analysis. Further information can be found, for example, in the chapter of this book dedicated to robust
statistical techniques.

Accuracy
According to the IUPAC (Inczédy et al.11, Sections 2–3), the ISO,7 and the Directive of the European Union (Definition 1.1 of the
Commission Decision3), the accuracy is a concept defined as “Closeness of agreement between a test result and the accepted
reference value”. It is estimated by determining trueness and precision. Evidently, this definition collects together the systematic and
random errors, because for an individual determination it is xi − m ¼ ðxi − xÞ + ðx − mÞ ¼ e + D.
In practice, it is unreasonable to think that an analytical procedure has no bias; experimentally we can decide about the hypothesis
of null bias. If it is significant, it is possible to correct the measurement by subtracting the value D. However, this implies an increase in
the variance of the final result because D is estimated by experimental replicates and therefore it has uncertainty. For this reason, when
the uncertainty of a measurement is expressed, it is usual to include a term that takes into account the bias in a form similar to
Eq. (105). For a detailed treatment of this question, consult the guide of the EURACHEM/CITAC.1

Ruggedness
The ruggedness of a method is defined as its capacity to maintain trueness and precision throughout the time. The same applies for
the robustness of a material of reference or any other reagent.
Ruggedness means susceptibility of an analytical method to changes in experimental conditions, which can be expressed as a list
of sample materials, analytes, storage conditions, environmental conditions, and/or sample preparation conditions under which
the method can be applied as presented or with specified minor modifications.3
The study on ruggedness can be approached using two different statistical methodologies, one of which consists of using the
well-known control charts (they are confidence intervals on the mean, the variance, or the range of the measured parameter) and
continuously writing down the results obtained on known samples throughout the time. This type of “a posteriori” control is
essential to maintain the quality (precision, trueness, capability of detection, etc.) of a measurement method and to establish alarm
mechanisms when an observed drift can alter the quality of the procedure affecting the value of the analytical results. There is also a
chapter in this book that deals specifically with control charts.
The other approach to the problem of ruggedness involves evaluating “a priori” the variability expected in the analytical
procedure and identifying the sources of that variability.
Before routinely using a procedure, the effect of small changes in the reagents, in the conditions of work, or in the specifications
of its protocol must have been verified. It can happen that small changes in the volume of extracting reagent do not lead to great
variations in the response, whereas a small variation in, say, pH does. One way of knowing and controlling this quality criterion is
by making small changes in the potentially influential factors and observing the effect on the response.
The influence of each factor should not be separately analyzed since it is not methodologically adequate and, in addition, it is
not realistic because in practice unforeseeable combinations of all the factors will occur that can affect the results. Instead, the
methodology of the design of experiments should be used, and details about this are in the corresponding chapters of this
collection. As the number of factors that potentially affect the response is large, highly fractional factorial designs for two levels
have to be used (to reduce, e.g., the needed 27 ¼ 128 different experiments of a complete factorial design for seven factors).
Plackett–Burman designs and D-optimal designs have proven to be useful tools in ruggedness analysis.61–67 For more alternatives,
consult the chapter dedicated to these strategies.
Example 18: An analysis of ruggedness of a procedure of extraction of three sulfonamides is carried out. The seven considered
factors are buffer solution, pH, methanol as extracting agent, extraction cycles, petroleum benzin, volume of elution, and
evaporation mode. A Plackett-Burman design has been proposed to estimate the effects of the factors by fitting a linear model for
each sulfonamide:

y ¼ bo + b1 x1 + b2 x2 + . . . + b7 x7 (112)

where xi denotes the i-th factor (Table 26) and y represents the response to be modeled, which is the chromatographic peak area
for each of the three sulfonamides. The details about the experimental domain can be seen in Table 26, where the nominal level is
codified as “−” and the extreme level as “+”.
44 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table 26 Experimental factors with nominal (−) and extreme (+) levels selected for a Plackett-Burman design
for seven factors (ruggedness analysis of an extraction procedure of sulfonamides).

Level

Factor (units) − +

x1: buffer solution (mL) 1.0 1.5


x2: pH 4.5 4.8
x3: extracting agent (mL) 20 25
x4: extraction cycles One Two
x5: petroleum benzin (mL) 20 25
x6: volume of elution (mL) 7.0 8.0
x7: evaporation mode To dryness Evaporated until 1 mL

Table 27 Plackett-Burman design for the seven factors of Table 26.

Factors Responses

Run x1 x2 x3 x4 x5 x6 x7 SDZ SMT SMP

1 1 1 1 1 1 1 1 10.50 10.50 10.50


10.30 10.30 10.30
2 1 1 −1 1 −1 −1 −1 5.87 6.31 8.91
7.56 7.74 8.72
3 1 −1 1 −1 1 −1 −1 7.55 8.88 7.06
8.08 11.01 8.23
4 1 −1 −1 −1 −1 1 1 3.82 6.58 6.11
5.61 7.20 7.53
5 −1 1 1 −1 −1 1 −1 6.74 10.00 9.61
8.19 10.63 10.38
6 −1 1 −1 −1 1 −1 1 6.89 5.35 4.29
7.63 6.99 6.40
7 −1 −1 1 1 −1 −1 1 8.21 7.86 10.24
8.70 8.56 9.94
8 −1 −1 −1 1 1 1 −1 8.56 6.95 7.02
5.85 8.11 7.96

The values of the three responses are the areas under the chromatographic peak (in a.u.) of sulfadiazine (SDZ), sulfamethazine (SMT), and sulfamethoxypyridazine (SMP). All
experiments are replicated twice.

Table 28 Estimated coefficients of the linear model (Eq. (112)) fit for each sulfonamide by means of a Plackett-Burman design.

Sulfadiazine Sulfamethazine Sulfamethoxypyridazine

Coefficient P-value Coefficient P-value Coefficient P-value

b0 7.504 <0.0001a 8.309 <0.0001a 8.325 <0.0001a


b1 −0.093 0.726 0.252 0.274 0.095 0.635
b2 0.456 0.111 0.165 0.465 0.314 0.142
b3 1.030 0.004a 1.409 0.000a 1.208 0.000a
b4 0.690 0.020a −0.021 0.924 0.874 0.002a
b5 0.666 0.039a 0.203 0.374 −0.605 0.014a
b6 −0.057 0.820 0.475 0.058 0.351 0.105
b7 0.204 0.447 −0.391 0.106 −0.161 0.426
a
Significant factor at a 0.05 significance level.

Table 27 shows the experimental runs and the two values (replicates) of the three responses.
Finally, Table 28 contains the estimated coefficients of model in Eq. (112) and their P-values. The conclusion is that only the
extracting agent (x3), the number of cycles in the extraction (x4), and the volume of petroleum benzin (x5) are significant at 5% level.
Hence, special care should be taken with these factors because small changes in any of them can cause large variation in the
response.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 45

Appendix
Some Basic Elements of Statistics
A distribution function (cumulative distribution function (cdf )) in R is any function F, such that

1. F is an application from R to the interval [0, 1]


2. lim x! − 1 FðxÞ ¼ 0
3. lim x!1 F ðxÞ ¼ 1
4. F is a monotonously increasing function, that is, a  b implies F(a)  F(b).
5. F is continuous on the left or the right. For example, F is continuous on the left if lim x!a, x<a F ðxÞ ¼ F ðaÞ for each real number a.

Any probability defined in R corresponds to a distribution function and vice versa.


If p is the probability defined for intervals of real numbers, F(x) is defined as the probability that accumulates until x, that is,
F(x) ¼ p(−1,x). It is easy to show that F(x) verifies the above definition of distribution function.
If F is a cdf continuous on the left, its associated probability is defined by

pr ½a; b ¼ pr fa  x  bg ¼ F ðbÞ − FðaÞ


pr ða; b ¼ pr fa < x  bg ¼ FðbÞ − lim x!a, x>a F ðxÞ
pr ½a; bÞ ¼ pr fa  x < bg ¼ FðbÞ − F ðaÞ
pr ða; bÞ ¼ pr fa < x < bg ¼ FðbÞ − lim x!a, x>a F ðxÞ

If the distribution function is continuous, then the above limits coincide with the value of the function in the corresponding point.
The probability density function f(x), abbreviated pdf, if it exists, is the derivative of the cdf.
Each random variable X is characterized by a distribution function FX(x).
When several random variables are handled, it is necessary to define the joint distribution function.

FX1 , X2 , ⋯, Xk ða1 ; a2 ; ⋯; ak Þ ¼ pr fX1  a1 and X2  a2 ⋯and Xk  ak g (A1)

If the previous joint probability is equal to the product of the individual probabilities, it is said that the random variables are
independent:

FX1 , X2 , ⋯, Xk ða1 ; a2 ; ⋯; ak Þ ¼ pr fX1  a1 g  pr fX2  a2 g  ⋯  pr fXk  ak g (A2)

Eqs. (A3), (A4) define the mean and variance of a continuous random variable whose pdf is f. Some basic properties are

EðaX + bY Þ ¼ aEðX Þ + bEðY Þ for any X and Y (A3)

V ðaX Þ ¼ a V ðXÞ for any random variable X


2
(A4)

Given a randomp variable,


ffiffiffiffiffiffiffiffiffiffiffi X, the standardized variable is obtained by subtracting the mean and dividing by the standard deviation,
Y ¼ ðX − EðXÞÞ= V ðXÞ. The standardized variable has E(Y) ¼ 0, V(Y) ¼ 1.
For any two random variables, the variance is

V ðX + Y Þ ¼ V ðX Þ + V ðY Þ + 2CovðX; Y Þ (A5)

and the covariance is defined as

CovðX; Y Þ ¼ 2
= ðx − EðXÞÞðy − EðY ÞÞfX, Y ðx; yÞdxdy (A6)

In the definition of the covariance (Eq. (A6)), fX,Y(x, y) is the joint pdf of the random variables. In the case where they are
independent, the joint pdf is equal to the product fX(x)fY(y) and the covariance is zero.
In general, E(XY) 6¼ E(X)E(Y), except if the variables are independent, in which case the equality holds.
In the applications in Analytical Chemistry, it is very frequent to use formulas to obtain the final measurement from other
intermediate ones that had experimental variability. A strategy for the calculation of the uncertainty (variance) in the final result
under two basic hypotheses has been developed. The strategy is to make a linear approach to the formula and then to assimilate the
quadratic terms to the variance of the random variable at hand (see e.g., the “Guide to the Expression of Uncertainty in
Measurement”2). This procedure, called in many texts the method of transmission of errors, can lead to unacceptable results.
Hence, an improvement based on Monte Carlo simulation has been suggested for the calculation of the uncertainty (see the
Supplement 1 to the aforementioned guide).
A useful representation of the data is the so-called box and whisker plot (or simply box plot). It consists of a box built with the
first, Q1, and third, Q3, quartiles, that is, the 0.25 and 0.75 percentiles, so that the box contains half of the central data. The line in
46 Quality of Analytical Measurements: Statistical Methods for Internal Validation

9.2

8.2

7.2

6.2

5.2
A B
Fig. A1 Box and whisker plots computed with A: data of method A in Fig. 2, and B: data of method A with an outlier.

between is the median (Q2 or 0.5 percentile). Then, the whisker extends on both sides of the box up to the maximum and minimum
values, provided they are not further than1.5 times the interquartile range Q3–Q1.
Fig. A1 shows a boxplot of the 100 values of the method A of Fig. 2A, the first on the left. Two values appear like squares,
“disconnected” at the bottom, meaning that these two values are less than 1.5 times the interquartile range below Q1.
The advantage of using box plots is that the quartiles are practically insensitive to outliers. For example, suppose that the
maximum value 7.86 is changed by 8.86; this change does not affect the median or the quartiles, the box plot continues being
similar but with a datum outside the upper whisker, as can be seen in the second box plot on the right in Fig. A1.

The Normal Distribution


A normal distribution with mean m and standard deviation s, N(m,s), has as pdf the following function defined for all real numbers:
1
f ðxÞ ¼ pffiffiffiffiffiffi e −2ð s Þ
1 x−m 2
(A7)
s 2p

The normal distribution is a continuous random variable with E(N(m, s)) ¼ m and V(N(m, s)) ¼ s2, and these two parameters
completely define the distribution.
Particularly interesting is the N(0,1), usually called Z, because any other normal distribution N(m,s) is transformed into a Z
when standardizing it, that is, Z ¼ (N(m, s) − m)/s.
The distribution function of a normal random variable does not have an analytical expression; hence it is necessary to use tables
or somewhat complex formulas to calculate the probabilities. As any normal distribution can be transformed into a N(0,1), it is
customary to use only the table of this distribution. Table A1 contains some of its values that, in any case, cover the cases used in this
article. For example, if z ¼ 1.83, from the reading in row 1.8 and column 0.03. p ¼ pr{N(0, 1) > 1.83} ¼ 0.0336.
Pn
PThe sum
Pn 2 
of normal and independent random variables,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i¼1 N(mi, si), also follows a normal distribution
n
N i¼1 mi ; i¼1 i s .

Student’s t Distribution
If X is a random variable N(m, s) and X1,X2,. . .,Xn are n random variables, independent and with the same distribution as X, then the
pffiffiffi P
random variable ðX − mÞ=ðs= nÞ is a N(0,1), where X denotes the random variable ni¼1 Xi/n.
pffiffiffi
However, with the sample variance S instead, the statistics t ¼ ðX − mÞ=ðS= nÞ follows a t distribution with n ¼ n − 1 d.f. The
2

mean and variance of a Student t distribution are respectively E(t) ¼ 0 and V(t) ¼ n/(n − 2), n > 2. The general shape of its pdf is
similar to that of the standard normal distribution, both are symmetrical around zero, unimodal, and defined in (−1,1). However,
the t distribution has heavier tails than the normal; that is, exhibits greater variability. As the number of d.f. tends to infinity, the

Table A1 Selected probabilities of the Z ¼ N(0,1) distribution.

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0282 0.0258 0.0250 0.0244 0.0239 0.0233
2.0 0.0227 5 0.0222 2 0.0216 9 0.0211 8 0.0208 8 0.0201 8 0.0197 0 0.0192 3 0.0187 8 0.0183 1

Values of p such that p ¼ pr{N(0,1) > z}. Up to the first decimal of z in rows, second decimal in columns.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 47

Table A2 Selected points of the t distribution with n degrees of freedom.

n a

0.25 0.10 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005

1 1.000 3.078 6.314 12.706 31.821 63.657 127.321 318.309 636.619


2 0.816 1.886 2.920 4.303 6.965 9.925 14.089 22.327 31.598
3 0.765 1.638 2.353 3.182 4.541 5.841 7.453 10.214 12.924
4 0.741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.721 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 1.415 1.895 2.365 2.998 3.499 4.929 4.785 5.408
8 0.716 1.397 1.860 2.365 2.896 3.355 3.833 4.501 5.041
9 0.713 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.711 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
20 0.687 1.325 1.725 2.086 2.528 2.845 3.497 3.552 3.850
100 0.677 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
1000 0.675 1.282 1.645 1.962 2.330 2.581 2.813 3.098 3.300
1 0.675 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.290

Values of t such that a ¼ pr{tn > t }.

Table A3 Selected a percentage points of the w2 distribution with n degrees of freedom.

n a

0.990 0.975 0.950 0.900 0.100 0.050 0.025 0.010

1 0.000 16 0.000 98 0.003 9 0.015 8 2.71 3.84 5.02 6.63


2 0.020 1 0.050 6 0.102 8 0.210 7 4.61 5.99 7.38 9.21
3 0.115 0.216 0.352 0.584 6.25 7.81 9.35 11.34
4 0.297 0.484 0.711 1.064 7.78 9.49 11.14 13.28
5 0.554 0.831 1.15 1.61 9.24 11.07 12.83 15.09
6 0.872 1.24 1.64 2.20 10.64 12.69 14.45 16.81
7 1.24 1.69 2.17 2.83 12.02 14.07 16.01 18.48
8 1.65 2.18 2.73 3.49 13.36 15.51 17.53 20.09
9 2.09 2.70 3.33 4.17 14.68 16.92 19.02 21.67
10 2.56 3.25 3.94 4.87 15.99 18.31 20.48 23.21
100 70.06 74.22 77.93 82.36 118.50 124.34 129.56 135.91

Values of x such that a ¼ pr{w2n > x}.

limiting distribution is the standard normal one. The family of t distributions only depends on one parameter, the degrees of
freedom.
Table A2 contains some values of the t distribution. For example, if n ¼ 5, for a ¼ 0.025, the value t ¼ 2.571 in the table is the
one such that 0.025 ¼ pr{t5 > 2.571}. Compare with the value 1.96 in Table A1 that would correspond, in the same conditions, to a
N(0,1), that is, 0.025 ¼ pr{N(0,1) > 1.96}.
Because of the symmetry, 0.025 ¼ pr{t5 < −2.571} also holds and, consequently, 0.95 ¼ pr{−2.571 < t5 < 2.571}.

The x2 (Chi-square) Distribution


Under the conditions of the previous section for variables X1,X2,. . .,Xn, the random variable w2 ¼ [(n − 1)S2]/s2 follows a chi-square
distribution with n ¼ n − 1 d.f.
The mean and variance of a w2 distribution with n d.f. are E(w2n ) ¼ n and V(w2n ) ¼ 2n. The chi-square distribution is nonnegative
and the pdf is skewed to the right. However, as n increases, the distribution becomes more symmetric. Some percentage points of the
chi-square distribution are given in Table A3.
For example, if n ¼ 6 and a ¼ 0.025, the value of the chi-square distribution with 6 d.f. that leaves to its right the probability
0.025 is w20.025, 6 ¼ 14.45. That is, 0.025 ¼ pr{w26 > 14.45}. Analogously, 0.975 ¼ pr{w26 > 1.24}, and consequently, 0.95 ¼ pr
{1.24 < w26 < 14.45}.
The chi-square distribution has an important property: Let w21, w22, ⋯, w2n be independent chi-square random variables with n1,
P
n2,. . .,nn d.f., respectively. Then the random variable w21 + w22 + ⋯ + w2n is a chi-square distribution with n ¼ ni¼1 ni d.f.
This property becomes apparent if we note that if Z1,Z2,. . .,Zn are normally and independently distributed random variables,
P
then the random variable ni¼1 Z2i is a chi-square distribution with n d.f.
48 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Table A4 Selected percentage points of the Fn1,n2 distribution for a ¼ 0.025.

n2 n

1 2 3 4 5 6 7 8 9 10

1 647.79 799.50 864.16 899.58 921.85 937.11 948.22 956.66 963.28 968.63
2 38.506 39.000 39.165 39.248 39.298 39.331 39.355 39.373 39.387 39.398
3 17.443 16.044 15.439 15.101 14.885 14.735 14.624 14.540 14.473 14.419
4 12.218 10.649 9.979 9.605 9.365 9.197 9.074 8.980 8.905 8.844
5 10.007 8.434 7.764 7.388 7.146 6.978 6.853 6.757 6.681 6.619
6 8.813 7.260 6.599 6.227 5.988 5.820 5.696 5.600 5.523 5.461
7 8.073 6.542 5.890 5.523 5.285 5.119 4.995 4.899 4.823 4.761
8 7.571 6.060 5.416 5.053 4.817 4.652 4.529 4.433 4.357 4.295
9 7.209 5.715 5.078 4.718 4.484 4.320 4.197 4.102 4.026 3.964
10 6.937 5.456 4.826 4.468 4.236 4.072 3.950 3.855 3.779 3.717
11 6.724 5.256 4.630 4.275 4.044 3.881 3.759 3.664 3.588 3.526
12 6.554 5.096 4.474 4.121 3.891 3.728 3.607 3.512 3.436 3.374
13 6.414 4.965 4.347 3.996 3.767 3.604 3.483 3.388 3.312 3.250
14 6.298 4.857 4.242 3.892 3.663 3.501 3.380 3.285 3.209 3.147
15 6.200 4.765 4.153 3.804 3.576 3.415 3.293 3.199 3.123 3.060

Values of x such that 0.025 ¼ pr {Fn1,n2 > x}.

The F Distribution
Let X1 and X2 be independent chi-square random variables with n1 and n2 d.f., respectively. Then, the ratio F ¼ (X1/n1)/(X2/n2) is an
F distribution with n1 d.f. in the numerator and n2 d.f. in the denominator. It is usually abbreviated as Fn1, n2. The mean and variance
of Fn1, n2 are E(Fn1, n2) ¼ n2/(n2 − 2), n2 > 2 and V(Fn1, n2) ¼ 2n22(n1 + n2 − 2)/[n1(n2 − 2)2(n2 − 4)], n2 > 4.
The F distribution is nonnegative and skewed to the right. Some percentage points of the F distribution are given in Table A4 for
a ¼ 0.025. For, say, n1 ¼ 5 and n2 ¼ 10, F0.025, 5, 10 ¼ 4.24 is the value such that 0.025 ¼ pr{F5, 10 > 4.24}.
The lower percentage points can be found taking into account that F1−a,n1,n2 ¼ 1/Fa,n2,n1. For example, to find F0.975, 5, 10 with
Table A4 is F0.975, 5, 10 ¼ 1/F0.025, 10, 5 ¼ 1/6.62 ¼ 0.15. Therefore, we have 0.95 ¼ pr{0.15 < F5, 10 < 4.24}.

Convergence of Random Variables


Sometimes, it is useful to think how a sequence of random variables converges to another random variable. Let X1,X2,. . . be a
sequence of random variables and let F1(x),F2(x),. . . be the corresponding sequence of distribution functions.
If the distribution functions are more and more similar to the distribution function F of a random variable X when n ! 1, then
we say that they converge to X in “distribution”. Formally, this means that lim n!1 Fn ðxÞ ¼ FðxÞ for each x.
We say that Xn converges to X in “probability” if 8e > 0, lim n!1 pr fjXn − Xj > eg ¼ 0. This means that the probability of the set
where Xn differs from X is getting smaller and smaller for each e.
Furthermore, we say that Xn converges to X “almost surely” if pr f lim n!1 jXn − Xj ¼ 0g ¼ 1. Almost sure convergence implies
that the set of ideal measurements such that the outcomes (real values) of Xn are getting closer and closer to X has probability one.
It can be proven that almost sure convergence implies convergence in probability, which then implies convergence in
distribution. The following three fundamental results for convergence are the most widely used in practice.

• The “weak law of large numbers” states that if X1,X2,. . .,Xn,. . . are independent and identically distributed random variables with
finite mean m, then X1 + X2 n+ ⋯ + Xn ! m in probability
• If the random variables also have a finite variance (a weaker condition is also possible), then we have the “strong law of large
numbers”, that is, X1 + X2 n+ ⋯ + Xn ! m almost surely
• The “central limit theorem” states that for independent (or weakly correlated) random variables X1,X2,. . .,Xn, with the same
X −pmffiffiÞ
distribution, ðs= n
! Z ¼ Nð0; 1Þ in distribution, where m and s2 are the common mean and variance of the random variables
Xn. This means that the distributional shape of X is more and more like the one of a standard normal random variable as n
increases.

Some Computational Aspects


The accessibility of personal computers allows doing statistical calculations without tables. It is advisable to use software especially
devoted to Statistics, but in the initial stages of the learning it is worthwhile to do the calculations manually in order to acquire the
intuition necessary to avoid the errors derived from a non-reflexive and automatic use of the software.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 49

The basic distributions (normal, Student’s t, F, chi-square) can be programmed with the algorithms in Abramowitz and Stegun.68
Appendices of the book by Meier and Zünd69 show the necessary numerical approximations and programs in BASIC for the same
distributions. To compute the noncentral F, the needed numerical approximation can be consulted in Johnson and Kotz,70 and
Evans et al.71
All the calculations in this article have been made with the Statistics Toolbox for MATLAB.72 What follows is a list of basic
command instructions used that the reader can also find in live scripts Appendix_1probDistr_live.mlx, Appendix_2power_live.mlx (in
the form of MATLAB mlx-files) in the supplementary material.
Note that all the MATLAB commands referring to cumulative distribution functions, Eqs. (A8)–(A11), compute the
cumulative probability a until the corresponding value of the distribution. However, along the text and in Tables A1–A4,
the calculated probability a is always the upper percentage point, that is, the cumulative probability above the corresponding
value.

Normal distribution
a ¼ pr fNðm; sÞ < za g (A8)

• z ¼ norminv(a, m, s)

Example A1: a ¼ 0.05, m ¼ 0, s ¼ 1; then norminv(0.05, 0, 1) gives z ¼ − 1.645.

• a ¼ normcdf(z, m, s)

Example A2: z ¼ 1.645, m ¼ 0, s ¼ 1; then normcdf(1.645, 0, 1) gives a ¼ 0.95

Student’s t distribution with n degrees of freedom


a ¼ pr ftn < ta, n g (A9)

• t ¼ tinv(a,v)

Example A3: a ¼ 0.05, n ¼ 5; then tinv(0.05,5) gives t ¼ −2.015.

• a ¼ tcdf(t,v)

Example A4: t ¼ 1.645, n ¼ 5; then tcdf(1.645,5) gives a ¼ 0.9196.

x2 distribution with n degrees of freedom


n o
a ¼ pr w2n < w2a, n (A10)

• x ¼ chi2inv(a,n)
Example A5: a ¼ 0.05, n ¼ 5; then chi2inv(0.05,5) gives x ¼ 1.1455.

• a ¼ chi2cdf(x,n)

Example A6: x ¼ 9.24, n ¼ 5; then chi2cdf(9.24,5) gives a ¼ 0.9001.

Fn1, n2 distribution with n1 and n2 degrees of freedom


a ¼ pr fFn1 , n2 < Fa, n1 , n2 g (A11)

• x ¼ finv(a,n1,n2)
Example A7: a ¼ 0.95, n1 ¼ 5, n2 ¼ 15; then finv(0.95,5,15) gives x ¼ 2.9013.

• a ¼ fcdf(x,n1,n2)

Example A8: x ¼ 2.90, n1 ¼ 5, n2 ¼ 15; then fcdf(2.90,5,15) gives a ¼ 0.9499.

Power for the z-test, Eq. (40)


Example A9: With the data of Example 8, |d | ¼ 0.40, s ¼ 0.55, n ¼ 10, a ¼ 0.05, normcdf(norminv(0.95,0,1) − 0.40∗sqrt(10)/
0.55) gives 0.2562.
50 Quality of Analytical Measurements: Statistical Methods for Internal Validation

Power for the t-test, Eq. (43)


Example A10: With the same data as for the Z-test, the sentence includes the noncentral t-distribution
pffiffiffiffiffiffi “nctcdf” and the t
distribution “tinv”, both of them with n − 1 degrees of freedom and noncentrality parameter 0:73 10:
nctcdf(tinv(0.95,9),9, 0.73∗sqrt(10)) gives b ¼ 0.3137. pffiffiffiffiffiffi
Example A11: With data of Example 9, a ¼ 0.05, n ¼ 10, and the noncentrality parameter is 0:57 10.

nctcdf(tinv(0.95,9), 9, 0.57 sqrt(10)) gives b ¼ 0.4918.

Power for the chi-square test, Eq. (46)


Example A12: We have l ¼ 2, a ¼ 0.05, and n ¼ 14 or n ¼ 13 to obtain a value of b  0.05.
chi2cdf(chi2inv(0.95,13)/(2∗2),13) gives b ¼ 0.0402.
chi2cdf(chi2inv(0.95,12)/(2∗2),12) gives b ¼ 0.0511.
Notice that the d.f. equal 14 − 1 ¼ 13 or 13 − 1 ¼ 12

Power for the F-test, Eq. (57)


Example A13: For a ¼ 0.05, n1 ¼ n2 ¼ 9, l ¼ s1/s2 ¼ 2 which are those of question (3) in Example 13.
fcdf(finv(0.975,8,8)/(2∗2),8,8)-fcdf(finv(0.025,8,8)/(2∗2),8,8) gives b ¼ 0.5558.
If we look for the sample size n ¼ n1 ¼ n2 so that b 0.10, trying some values, we get
fcdf(finv(0.975,22,22)/(2∗2),22,22)-fcdf(finv(0.025,22,22)/(2∗2),22,22) that gives b ¼ 0.1115 and
fcdf(finv(0.975,23,23)/(2∗2),23,23)-fcdf(finv(0.025,23,23)/(2∗2),23,23) that gives b ¼ 0.0981.
Consequently, n ¼ 24.

Power for fixed effects ANOVA, Eq. (82)


P 2 2
Example A14: For a ¼ 0.05, n1 ¼ 4, n2 ¼ 15, and noncentrality parameter d ¼ n ti /s ¼ 4  2 ¼ 8, the command
ncfcdf(finv((0.95),4,15),4,15,8) gives b ¼ 0.5364
Notice that in the ANOVA model the levels of factor are k ¼ 5 and there is n ¼ 4 replicates per level.

Power for random effects ANOVA, Eq. (98)


Example A15: Suppose that 10 laboratories participate in a proficiency test to evaluate a method. Assumed risks are a ¼ b ¼ 0.05
and it is desirable to detect at least an interlaboratory variance equal to the intralaboratory variance, that is, s2t /s2 ¼ 1, so that l2 ¼ 1 +
ns2t /s2 ¼ 1 + n. With these data
k ¼ 10;n ¼ 4;fcdf(finv(0.95,k − 1,k∗(n − 1))/(1 + 1∗n),k − 1,k∗(n − 1)) gives b ¼ 0.099
k ¼ 10;n ¼ 5;fcdf(finv(0.95,k − 1,k∗(n − 1))/(1 + 1∗n),k − 1,k∗(n − 1)) gives b ¼ 0.050
Thus, each laboratory must do five determinations.

References

1. EURACHEM/CITAC, Guide CG4. In Quantifying Uncertainty in Analytical Measurement, 2nd ed.; Ellison, S. L. R., Rosslein, M., Williams, A., Eds.; 2000. ISBN: 0-948926-15-5
Available from the Europchem Secretariat. See http://www.eurochem.org.
2. Evaluation of Measurement Data, Supplement 1 to the ‘Guide to the Expression of Uncertainty in Measurement’—Propagation of Distributions Using a Monte Carlo Method; Joint
Committee for Guides in Metrology, 2008. JCGM 101.
3. Commission Decision (EC), No 2002/657/EC of 12 August 2002 Implementing Council Directive 96/23/EC Concerning the Performance of Analytical Methods and the
Interpretation of Results. Off. J. Eur. Commun. 2002, L221, 8–36.
4. Aldama, J. M. Practicum of Master in Advanced Chemistry; University of Burgos: Burgos, Spain, 2007.
5. Analytical Methods Committee, Robust Statistics-How Not to Reject Outliers, Part 1. Basic Concepts. Analyst 1989, 114, 1693–1697.
6. Analytical Methods Committee, Robust Statistics-How Not to Reject Outliers, Part 2. Inter-laboratory Trials. Analyst 1989, 114, 1699–1702.
7. ISO 5725, Accuracy Trueness and Precision of Measurement Methods and Results, Part 1. General Principles and Definitions, Part 2. Basic Method for the Determination of
Repeatability and Reproducibility of a Standard Measurement Method, Part 3. Intermediate Measures of the Precision of a Standard Measurement Method, Part 4. Basic Methods
for the Determination of the Trueness of a Standard Measurement Method, Part 5. Alternative Methods for the Determination of the Precision of a Standard Measurement Method,
Part 6. Use in Practice of Accuracy Values. Genève, 1994 .
8. Analytical Methods Committee, In Technical Brief No 4; Thompson, M., Ed.; 2006. www.rsc.org/amc/.
9. Silverman, B. W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, Great Britain, 1986.
10. Wand, M. P.; Jones, M. C. Kernel Smoothing; Chapman and Hall: London, Great Britain, 1995.
11. Inczédy, J.; Lengyel, T.; Ure, A. M.; Gelencsér, A.; Hulanicki, A. Compendium of Analytical Nomenclature IUPAC, 3rd ed.; Pot City Press Inc.: Baltimore 2nd printing, 2000.
12. Lira, I.; Wöger, W. Comparison Between the Conventional and Bayesian Approaches to Evaluate Measurement Data. Metrologia 2006, 43, S249–S259.
13. Zech, G. Frequentist and Bayesian confidence intervals. Eur. Phys. J. Direct 2002, C12, 1–81.
14. Armstrong, N.; Hibbert, D. B. An Introduction to Bayesian Methods for Analyzing Chemistry Data, Part 1: An Introduction to Bayesian Theory and Methods. Chemom. Intel. Lab.
Syst. 2009, 97, 194–210.
15. Armstrong, N.; Hibbert, D. B. An Introduction to Bayesian Methods for Analyzing Chemistry Data, Part II: A Review of Applications of Bayesian Methods in Chemistry. Chemom.
Intel. Lab. Syst. 2009, 97, 211–220.
16. Sprent, P.; Smeeton, N. C. Applied Nonparametric Statistical Methods, 4th ed.; Chapman & Hall/CRC: Boca Raton, 2007.
17. Patel, J. K. Tolerance Limits. A Review. Commun. Stat. Theory Methods 1986, 15 (9), 2716–2762.
18. Meléndez, M. E.; Sarabia, L. A.; Ortiz, M. C. Distribution Free Methods to Model the Content of Biogenic Amines in Spanish Wines. Chemom. Intel. Lab. Syst. 2016, 155,
191–199.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 51

19. Reguera, C.; Sanllorente, S.; Herrero, A.; Sarabia, L. A.; Ortiz, M. C. Study of the Effect of the Presence of Silver Nanoparticles on Migration of Bisphenol A From Polycarbonate
Glasses into Food Simulants. Chemom. Intel. Lab. Syst. 2018, 176, 66–73.
20. Wald, A.; Wolfowitz, J. Tolerance Limits for a Normal Distribution. Ann. Math. Stat. 1946, 17, 208–215.
21. Wilks, S. S. Determination of Sample Sizes for Setting Tolerance Limits. Ann. Math. Stat. 1941, 12, 91–96.
22. Kendall, M.; Stuart, A. The Advanced Theory of Statistics, Inference and Relationship; Charles Griffin & Company Ltd.: London, 1979;547–548. Section 32.11; Vol. 2.
23. Willink, R. On using the Monte Carlo Method to Calculate Uncertainty Intervals. Metrologia 2006, 43, L39–L42.
24. Guttman, I. Statistical Tolerance Regions; Charles Griffin and Company: London, 1970.
25. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.;
Nivet, C.; Valat, L. Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal – Part I. J. Pharm. Biomed. Anal. 2004, 36, 579–586.
26. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.;
Nivet, C.; Valat, L.; Rozet, E. Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal—Part II. J. Pharm. Biomed. Anal. 2007, 45,
70–81.
27. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Laurentie, M.; Mercier, N.; Muzard, G.; Valat, L.; Rozet, E.
Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal—Part III. J. Pharm. Biomed. Anal. 2007, 45, 82–86.
28. Feinberg, M. Validation of Analytical Methods Based on Accuracy Profiles. J. Chromatogr. A 2007, 1158, 174–183.
29. Rozet, E.; Hubert, C.; Ceccato, A.; Dewé, W.; Ziemons, E.; Moonen, F.; Michail, K.; Wintersteiger, R.; Streel, B.; Boulanger, B.; Hubert, P. Using Tolerance Intervals in Pre-Study
Validation of Analytical Methods to Predict In-Study Results. The Fit-for-Future-Purpose Concept. J. Chromatogr. A 2007, 1158, 126–137.
30. Rozet, E.; Ceccato, A.; Hubert, C.; Ziemons, E.; Oprean, R.; Rudaz, S.; Boulanger, B.; Hubert, P. Analysis of Recent Pharmaceutical Regulatory Documents on Analytical Method
Validation. J. Chromatogr. A 2007, 1158, 111–125.
31. Dewé, W.; Govaerts, B.; Boulanger, B.; Rozet, E.; Chiap, P.; Hubert, P. Using Total Error as Decision Criterion in Analytical Method Transfer. Chemom. Intel. Lab. Syst. 2007, 85,
262–268.
32. González, A. G.; Herrador, M. A. Accuracy Profiles from Uncertainty Measurements. Talanta 2006, 70, 896–901.
33. Rebafka, T.; Clémençon, S.; Feinberg, M. Bootstrap-Based Tolerance Intervals for Application to Method Validation. Chemom. Intel. Lab. Syst. 2007, 89, 69–81.
34. Fernholz, L. T.; Gillespie, J. A. Content-Correct Tolerance Limits Based on the Bootstrap. Technometrics 2001, 43 (2), 147–155.
35. Cowen, S.; Ellison, S. L. R. Reporting Measurement Uncertainty and Coverage Intervals Near Natural Limits. Analyst 2006, 131, 710–717.
36. Schouten, H. J. A. Sample Size Formulae with a Continuous Outcome for Unequal Group Sizes and Unequal Variances. Stat. Med. 1999, 18, 87–91.
37. Lehmann, E. L. Testing Statistical Hypothesis; Wiley & Sons: New York, 1959.
38. Schuirmann, D. J. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. J. Pharmacokinet.
Biopharm. 1987, 15, 657–680.
39. Mehring, G. H. On Optimal Tests for General Interval Hypothesis. Commun. Stat. Theory Methods 1993, 22 (5), 1257–1297.
40. Brown, L. D.; Hwang, J. T. G.; Munk, A. An Unbiased Test the Bioequivalence Problem. Ann. Stat. 1998, 25, 2345–2367.
41. Munk, A.; Hwang, J. T. G.; Brown, L. D. Testing Average Equivalence. Finding a Compromise Between Theory and Practice. Biom. J. 2000, 42 (5), 531–552.
42. Hartmann, C.; Smeyers-Verbeke, J.; Penninckx, W.; Vander Heyden, Y.; Vankeerberghen, P.; Massart, D. L. Reappraisal of Hypothesis Testing for Method Validation: Detection of
Systematic Error by Comparing the Means of Two Methods or of Two Laboratories. Anal. Chem. 1995, 67, 4491–4499.
43. Limentani, G. B.; Ringo, M. C.; Ye, F.; Bergquist, M. L.; McSorley, E. O. Beyond the t-Test. Statistical Equivalence Testing. Anal. Chem. 2005, 77, 221A–226A.
44. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods: Determination of the Minimal Number of Measurements Required
for the Evaluation of the Bias by Means of Interval Hypothesis Testing. Chemom. Intel. Lab. Syst. 2000, 52, 61–73.
45. Martín Andrés, A.; Luna del Castillo, J. D. Bioestadí stica para las ciencias de la salud; Ediciones Norma-Capitel: Madrid, 2004.
46. Wellek, S. Testing Statistical Hypotheses of Equivalence; Chapman & May/CRC Press LLC: Boca Raton, FL, 2003.
47. Ortiz, M. C.; Herrero, A.; Sanllorente, S.; Reguera, C. The Quality of the Information Contained in Chemical Measures; Servicio de Publicaciones Universidad de Burgos: Burgos,
2005. (Electronic Book).
48. D’Agostino, R. B., Stephens, M. A., Eds.; In Goodness-of-Fit Techniques; Marcel Dekker Inc.: New York, 1986.
49. Moreno, E.; Girón, F. J. On the Frequentist and Bayesian Approaches to Hypothesis Testing (with discussion). Stat. Oper. Res. Trans. 2006, 30 (1), 3–28.
50. Scheffé, H. The Analysis of Variance; Wiley & Sons: New York, 1959.
51. Anderson, V. L.; MacLean, R. A. Design of Experiments. A Realistic Approach; Marcel Dekker Inc.: New York, 1974.
52. Milliken, G. A.; Johnson, D. E. Analysis of Messy Data: Designed Experiments; Wadsworth Publishing Co.: Belmont, NJ, 1984; vol. I.
53. Searle, S. R. Linear Models; Wiley & Sons, Inc.: New York, 1971.
54. Youden, W. J. Statistical Techniques for Collaborative Tests; Association of Official Analytical Chemists: Washington, DC, 1972.
55. Kateman, G.; Pijpers, F. W. Quality Control in Analytical Chemistry; Wiley & Sons: New York, 1981.
56. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods. Anal. Chim. Acta 1999, 391, 203–225.
57. Hampel, F. R.; Ronchetti, E. M.; Rousseeuw, P. J.; Stahel, W. A. Robust Statistics. The Approach Based on Influence Functions; Wiley-Interscience: Zurich, 1985.
58. Huber, P. J. Robust Statistics; Wiley & Sons: New York, 1981.
59. Thompson, M.; Wood, R. J. Assoc. Off. Anal. Chem. Int. 1993, 76, 926–940.
60. Sanz, M. B.; Ortiz, M. C.; Herrero, A.; Sarabia, L. A. Robust and Non Parametric Statistic in the Validation of Chemical Analysis Methods. Quí m. Anal. 1999, 18, 91–97.
61. García, I.; Sarabia, L.; Ortiz, M. C.; Aldama, J. M. Usefulness of D-optimal Designs and Multicriteria Optimization in Laborious Analytical Procedures. Application to the Extraction
of Quinolones From Eggs. J. Chromatogr. A 2005, 1085, 190–198.
62. García, I.; Sarabia, L. A.; Ortiz, M. C.; Aldama, J. M. Robustness of the Extraction Step When Parallel Factor Analysis (PARAFAC) is Used to Quantify Sulfonamides in Kidney by
High Performance Liquid Chromatography-Diode Array Detection (HPLC-DAD). Analyst 2004, 129 (8), 766–771.
63. Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; de Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics: Part A; Elsevier:
Amsterdam, 1997.
64. Herrero, A.; Reguera, C.; Ortiz, M. C.; Sarabia, L. A.; Sánchez, M. S. Ad-Hoc Blocked Design for the Robustness Study in the Determination of Dichlobenil and 2,6-
Dichlorobenzamide in Onions by Programmed Temperature Vaporization-Gas Chromatography–Mass Spectrometry. J. Chromatogr. A 2014, 1370, 187–199.
65. Arce, M. M.; Sanllorente, S.; Ortiz, M. C.; Sarabia, L. A. Easy-To-Use Procedure to Optimise a Chromatographic Method. Application in the Determination of Bisphenol-A and
Phenol in Toys by Means of Liquid Chromatography with Fluorescence Detection. J. Chromatogr. A 2018, 1534, 93–100.
66. Oca, M. L.; Rubio, L.; Ortiz, M. C.; Sarabia, L. A.; García, I. Robustness Testing in the Determination of Seven Drugs in Animal Muscle by Liquid Chromatography–Tandem Mass
Spectrometry. Chemom. Intel. Lab. Syst. 2016, 151, 172–180.
67. Rodríguez, N.; Ortiz, M. C.; Sarabia, L. A. Study of Robustness Based on N-Way Models in the Spectrofluorimetric Determination of Tetracyclines in Milk When Quenching Exists.
Anal. Chim. Acta 2009, 651, 149–158.
68. Abramowitz, M.; Stegun, I. A. Handbook of Mathematical Functions; Government Printing Office, 1964.
69. Meier, P. C.; Zünd, R. E. Statistical Methods in Analytical Chemistry, 2nd ed.; Wiley & Sons: New York, 2000.
70. Johnson, N.; Kotz, S. Distributions in Statistics: Continuous Univariate Distributions—2; Wiley & Sons: New York, 1970;191. (Equation (5)).
71. Evans, M.; Hastings, N.; Peacock, B. Statistical Distributions, 2nd ed.; Wiley & Sons: New York, 1993; 73–74.
72. The MathWorks, Inc., Statistics and Machine Learning Toolbox for use with MATLABW; version 11.4 (R2018b) The MathWorks, Inc., 2018.

You might also like