1.01 Quality of Analytical Measurements_Statistical Methods for Internal Validation
1.01 Quality of Analytical Measurements_Statistical Methods for Internal Validation
M Cruz Ortiz, Departamento de Química, Facultad de Ciencias, Universidad de Burgos, Burgos, Spain
Luis A Sarabia and M Sagrario Sánchez, Departamento de Matemáticas y Computación, Facultad de Ciencias, Universidad de Burgos,
Burgos, Spain
Ana Herrero, Departamento de Química, Facultad de Ciencias, Universidad de Burgos, Burgos, Spain
© 2019 Elsevier Inc. All rights reserved.
This article is an update of M.C. Ortiz, L.A. Sarabia, M.S. Sánchez, A. Herrero, 1.02—Quality of Analytical Measurements: Statistical Methods for Internal
Validation, Editor(s): Steven D. Brown, Romá Tauler, Beata Walczak, Comprehensive Chemometrics, Elsevier, 2009, pp. 17–76.
Introduction 3
Confidence and Tolerance Intervals 7
Confidence Interval 8
Confidence Interval on the Mean of a Normal Distribution 8
Case 1: Known variance 8
Case 2: Unknown variance 10
Confidence Interval on the Variance of a Normal Distribution 10
Confidence Interval on the Difference in Two Means 11
Case 1: Known variances 11
Case 2: Unknown variances 11
Case 3: Confidence interval for paired samples 12
Confidence Interval on the Ratio of Variances of Two Normal Distributions 12
Confidence Interval on the Median 12
Joint Confidence Intervals 13
Tolerance Intervals 13
Case 1: b-content tolerance interval 13
Case 2: b-expectation tolerance interval 14
Case 3: Distribution free intervals 14
Hypothesis Tests 15
Elements of a Hypothesis Test 15
Hypothesis Test on the Mean of a Normal Distribution 18
Case 1: Known variance 18
Case 2: Unknown variance 18
Case 3: The paired t-test 19
Hypothesis Test on the Variance of a Normal Distribution 19
Hypothesis Test on the Difference in Two Means 20
Case 1: Known variances 20
Case 2: Unknown variances 21
Test Based on Intervals 22
Hypothesis Test on the Variances of Two Normal Distributions 22
Hypothesis Test on the Comparison of Several Independent Variances 24
Case 1: Cochran’s test 24
Case 2: Bartlett’s test 25
Case 3: Levene’s test 25
Goodness-of-Fit Tests: Normality Tests 26
Case 1: Chi-square test 26
Case 2: D’Agostino normality test 26
One-Way Analysis of Variance 27
The Fixed Effects Model 28
Power of the Fixed Effects ANOVA model 30
Uncertainty and Testing of the Estimated Parameters in the Fixed Effects Model 31
Case 1: Orthogonal contrasts 32
Case 2: Comparison of several means 32
The Random Effects Model 34
Power of the Random Effects ANOVA model 35
Confidence Intervals for the Estimated Parameters in the Random Effects Model 35
☆
Change History: October 2019: M. Cruz Ortiz, Luis A. Sarabia, M. Sagrario Sánchez, Ana Herrero added MATLAB live-scripts for the computations; re-written
introduction to tolerance intervals; corrected estimates in Table 13; updated texts; corrected mistakes and updated references.
Comprehensive Chemometrics 2nd edition: Chemical and Biochemical Data Analysis https://doi.org/10.1016/B978-0-12-409547-2.14746-8 1
2 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Nomenclature
1−a Confidence level
1−b Power
CCa Limit of decision
CCb Capability of detection
Fn1, n2 F distribution with n1 and n2 degrees of freedom (d.f.)
H0 Null hypothesis
H1 Alternative hypothesis
N(m,s) Normal distribution with mean m and standard deviation s
NID(m,s) (Normally and Independently Distributed) independent random variables equally distributed as normal with
mean m and standard deviation s
s Sample standard deviation
s2 Sample variance
tn Student’s t distribution with n degrees of freedom (d.f.)
x Sample mean
V(X) Variance of the random variable X
a Significance level, probability of type I error
b Probability of type II error
D Bias (systematic error)
« Random error
m Mean
n Degree(s) of freedom, d.f.
s Standard deviation
s2 Variance
sR Reproducibility (as standard deviation)
sr Repeatability (as standard deviation)
x2n w2 (chi-square) distribution with n degrees of freedom
Quality of Analytical Measurements: Statistical Methods for Internal Validation 3
Introduction
Every day millions of analytical determinations are made in thousands of laboratories all around the world. These measurements
are necessary for assessment of merchandise in the commercial interchanges, supporting health care, maintaining security, for
quality control of water and environment, for characterization of raw materials and manufactured products, and for forensic
analyses. Practically, any aspect of the contemporary social activity is somehow supported in the analytical measurements. The cost
of these measurements is high, but the cost of the decisions made based on incorrect results is much greater. For example, a test that
wrongly shows the presence of a forbidden substance in a food destined for human consumption can result in an expensive claim,
the confirmation of the presence of an abuse drug can lead to serious judicial sentences or doping in the sport practice may result in
severe sanctions. The importance of providing a correct result is evident but it is equally important to be able to prove that the result
is correct.
Once an analytical problem is posed to a laboratory and the analytical method is selected, the next step is the in-house validation
of the method. This is the process of defining the analytical requirements to respond to the problem and to confirm that the
considered method has performance characteristics consistent with those required. The results of the validation experiments must
be evaluated in order to ensure that the method meets the measurement required specification.
The set of operations to determine the value of an amount (measurand) suitably defined is called the measurement. The method
of measurement is the sequence of operations that is used when conducting the measurements. It is documented with enough
details so that the measurement may be done without additional information.
Once a method is designed or selected, it is necessary to evaluate its performance characteristics and to identify the factors that
can change these characteristics and to what extent they can change. If, in addition, the method is developed to solve a particular
analytical problem, it is necessary to verify that the method is fit for purpose.1 This process of evaluation is called validation of the
method. It implies the determination of several parameters that characterize the method performance: decision limit, capability of
detection, selectivity, specificity, ruggedness, and accuracy (trueness and precision). In any case, they are the measurements
themselves which allow evaluation of the performance characteristics of the method and its fit for purpose. In addition, when
using the method, the obtained measurements are also the ones that will be used to make decisions on the analyzed sample, for
example, whether the amount of an analyte fulfills a legal specification. Therefore, it is necessary to suitably model the data that a
method provides. In what follows we will consider that the data provided by the analytical method are real numbers; other
possibilities exist, for example, the count of bacteria or impacts in a detector take only (discrete) natural values, or also, sometimes,
the data resulting from an analysis are qualitative, for example, the identification of an analyte through its m/z ratios in a mass
spectrometry-chromatography analysis.
With regard to the analytical measurement, it is admitted that the value, x, provided by the method of analysis consists of three
terms, the true value of the parameter m, a systematic error (bias) D, and a random error E with zero mean, in an additive way as
expressed in Eq. (1):
x¼m+D+e (1)
All the possible measurements that a method can provide when analyzing a sample constitute the population of the measurements.
This is indeed a theoretical situation because it is being assumed that there are infinitely many samples and that the method of
analysis remains unalterable. In these conditions, the model of the analytical method, Eq. (1), is mathematically a random variable,
X, with mathematical expectation m + D and variance equal to the variance of E; in statistical notation, E(X) ¼ m + D and V(X) ¼
V(e), respectively.
A random variable, and thus the analytical method, is described by its cumulative distribution function FX(x), that is, the
probability that the method provides measurements less than or equal to x for any value x. Symbolically, this is written as FX(x) ¼ pr
{X x} for any real value x. In most of the applications, it is assumed that FX(x) is differentiable, which implies, among other
things, that the probability of obtaining exactly a specific value is zero. In the case of a differentiable cumulative distribution
function, the derivative of FX(x) is Rthe probability density function (pdf ) fX(x). Any function f(x) such that it is positive, f(x) 0, and
the area under the function is 1, Rf(x)dx ¼ 1, is the pdf of a random variable. The probability that the random variable X takes
values in the interval [a, b] is the area under the pdf over the interval [a, b], that is,
Z b
pr fX 2 ½a; bg ¼ f ðxÞ dx (2)
a
and the mean and variance of X are written as in Eqs. (3), (4), respectively
Z
EðXÞ ¼ x f ðxÞdx (3)
R
Z
V ðX Þ ¼ ðx − EðXÞÞ2 f ðxÞdx (4)
R
In general, mean and variance do not characterize in a unique way a random variable and therefore neither the method of analysis.
Fig. 1 shows the pdf of four random variables with the same mean 6.00 and standard deviation 0.61.
4 Quality of Analytical Measurements: Statistical Methods for Internal Validation
1.2 1.2
(A) (B)
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
4 5 6 7 8 4 5 6 7 8
1.2 1.2
(C) (D)
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
4 5 6 7 8 4 5 6 7 8
Fig. 1 Probability density functions of four random variables with mean 6 and variance 0.375. (A) Uniform in [4.94, 7.06]; (B) Symmetric triangular in [4.5, 7.5];
(C) Normal N(6, 0.61); (D) Weibull with shape 1.103 and scale 0.7 shifted to give a mean of 6. Dotted vertical lines mark the interval [5.0, 7.0].
These four distributions, uniform or rectangular (Fig. 1A), triangular (Fig. 1B), normal (Fig. 1C), and Weibull (Fig. 1D), are
frequent in the scope of analytical determinations, and they appear in Appendix E of the EURACHEM/CITAC Guide1 and also they
are used in metrology.2
If the only available information regarding a quantity X is the lower limit, l, and the upper limit, u, but the quantity could be
anywhere in between, with no idea of whether any part of the range is more likely, then a rectangular distribution in the interval [l, u]
would be assigned to X. This is so because it is the pdf that maximizes the “information entropy” of Shannon, in other words the pdf
that adequately characterizes the incomplete knowledge about X. Frequently, in reference material, the certified concentration is
expressed in terms of a number and unqualified limits (e.g., 1000 2 mg L−1). In this case, a rectangular distribution should be
used (Fig. 1A).
When the available information concerning X includes the knowledge that values close to c (between l and u) are more likely
than those near the bounds, the adequate distribution is a triangular one (Fig. 1B), with the maximum of its pdf in c.
If a good location estimate, m, and a scale estimate, s, are the only information available regarding X, then, according to the
principle of maximum entropy, a normal probability distribution N(m,s) (Fig. 1C) would be assigned to X (remember that m and s
may have been obtained from repeated applications of a measurement method).
Finally, the Weibull distribution (Fig. 1D) is very versatile; it can mimic the behavior of other distributions such as the normal or
exponential. It is adequate for the analysis of reliability of processes, and in chemical analysis it is useful in describing the behavior
of the figures of merit of a long-term procedure. For example, the distribution of the capability of detection CCb3 is a Weibull one or
the distribution of the determinations of ammonia in water by UV-vis spectroscopy during 350 different days in Aldama.4
In the four cases given in Fig. 1, the probability of obtaining values between 5 and 7 has been computed with Eq. (2). For the
uniform distribution (Fig. 1A) this probability is 0.94, whereas for the triangular distribution (Fig. 1B) is 0.88, for the normal
distribution (Fig. 1C) is 0.90, and for the Weibull distribution (Fig. 1D), 0.93. Sorting in decreasing order of the proportion of values
that each distribution accumulates in the interval [5.0, 7.0] we have uniform, Weibull, normal, and triangular although the triangular
and normal distributions tend to give values symmetrically around the mean and the Weibull distribution does not. If another interval
is considered, say [5.4, 6.6], the distributions accumulate probabilities of 0.57, 0.64, 0.67, and 0.54, respectively, in which the
difference among values is larger than before and, in addition, sorted the distributions as normal, triangular, uniform, and Weibull.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 5
Table 1 Values of b such that p ¼ pr{X < b} where X is each one of the random variables defined in the caption of Fig. 1.
P Random variable
n, Minimum b among the four distributions; m, Maximum b among the four distributions.
If for each of those variables, value b is determined so that there is a fixed probability, p, of obtaining values below b (i.e., the
value b such that p ¼ pr{X < b} for each distribution X), the results of Table 1 are obtained. For example, second row, 5% of the
times the uniform distribution at hand gives values less than b ¼ 5.05, less than 4.97 if it is the triangular distribution, and so on.
In the table, the extreme values among the four distributions for each probability p have been identified, and large differences are
observed caused by the form in which the values far from 6 are distributed (notice the differences in Fig. 1 for the normal, the
triangular, or the uniform distribution) and also due to the asymmetry of the Weibull distribution.
Therefore, the mean and variance of a random variable give very limited information on the values provided by the random
variable, unless additional information is at hand about the form of its density (pdf ). For example, if one knows that the
distribution is uniform or symmetrical triangular or normal, the random variable is completely characterized by its mean and
variance.
In practice, the pdf of a method of analysis is unknown. We only have a finite number, n, of measurements, which are the
outcomes obtained when applying repeatedly (n times) the same method to the same sample. These n measurements constitute a
statistical sample of the random variable X defined by the method of analysis.
Fig. 2 shows histograms of 100 results obtained when applying four methods of analysis, named A, B, C, and D, to aliquot parts
of a sample to determine an analyte. Clearly, the four methods behave differently.
From the experimental data, the (sample) mean and variance are computed as
Pn
xi
x ¼ i¼1 (5)
n
Pn
ðxi − xÞ2
s2 ¼ i¼1 (6)
n−1
x and s2 are estimates of the mean and variance of the distribution of X. These estimates with the data in Fig. 2 are shown in Table 2.
According to the model of Eq. (1), EðX Þ ¼ m + D ’ x, that is, the sample mean estimates the true value m plus the bias D. Assuming
that the true value is m ¼ 6 and subtracting it from the sample means in the first row of Table 2, the bias estimated for methods A and
B would be 0.66 and 0.16 for methods C and D. The bias of a method is one of its performance characteristics and must be evaluated
during the validation of the method. In fact, technical guides, for example, the one by the International Organization for
Standardization (ISO), state that, for a method, better trueness means less bias. To estimate the bias, it is necessary to have samples
with known concentration m (e.g., certified material, spiked samples).
The value of the variance is independent of the true content, m, of the sample. For this reason, to estimate the variance, it is only
necessary to have replicated measurements on aliquot parts of the same sample. The second row of Table 2 shows that methods
B and C have the same variance, 1.26, which is 5 times greater than the one of methods A and D, 0.25. The dispersion of the data
obtained with a method is the precision of the method and constitutes another performance characteristic to be determined in the
validation of the method. In agreement with model in Eq. (1), a measure of the dispersion is the variance V(X), which is estimated
by means of s2.
In some occasions, for evaluating trueness and precision, it is more descriptive to use statistics other than mean and variance. For
example, when the distribution is rather asymmetric, as in Fig. 1D, it is more reasonable to use the median than the mean. The
median is the value in which the distribution accumulates 50% of the probability, 5.83 for the pdf in Fig. 1D and 6.00 for the other
three distributions, which are symmetric around their mean. In practice, it is frequent to see the presence of anomalous data
(outliers) that influence the mean and above all the variance, which is improperly increased; in these cases, it is advisable to use
robust estimates of central tendency and spread (dispersion).5–7 Details can be found in the chapter of the present book devoted to
robust procedures.
Fig. 2 and Table 2 show that the two characteristics of a measurement method, trueness and precision, are independent to one
another, in the sense that a method with better trueness (less bias), methods C and D, can be more, case D, or less, case C, precise.
Analogously, methods A and B have an appreciable bias but A is more precise than B. A method is said to be accurate when it is
precise and fulfills trueness.
Histograms are estimates of the pdf and allow evaluation of the performance of each method in a more detailed way than when
only considering trueness and precision. For example, the probability of obtaining values in any interval can be estimated with the
6 Quality of Analytical Measurements: Statistical Methods for Internal Validation
40 40
(A) (B)
30 30
20 20
10 10
0 0
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
40 40
(C) (D)
30 30
20 20
10 10
0 0
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
Fig. 2 Frequency histograms of 100 measures obtained with four different analytical methods, named (A), (B), (C), and (D), on aliquot parts of a sample. Dotted
vertical lines mark the interval [5.0, 7.0].
Method
A B C D
histogram. The third row in Table 2 shows the frequencies for the interval [5.0, 7.0]. Method D (best trueness and precision among
the four) provides 98% of the values in the interval, whereas method B (worst trueness and precision) provides only 56% of the
values in the interval. Nonetheless, trueness and precision should be jointly considered. See how according to data in Table 2 the
effect of increasing the precision, using method A instead of B when the bias is “high” is an increase of 14% of results of the
measurement method to be in the interval [5.0, 7.0], whereas when the bias is small (C and D), there is an increase of 40%. This
behavior should be taken into account when optimizing a method and also in the ruggedness analysis, which is another
performance characteristic to be validated according to most of the guides. As can be seen in the fourth row of Table 2, if the
method that provides more results below 6 is needed, C would be the method selected.
The previous explanations show the usefulness of knowing the pdf of the results of a method of analysis. As in practice we have
only a limited number of results, two basic strategies are possible to estimate the pdf: (1) to assess that the experimental data are
Quality of Analytical Measurements: Statistical Methods for Internal Validation 7
compatible with a known distribution (e.g., normal) and then use the corresponding pdf; (2) to estimate the pdf by a data-driven
technique based on a computer-intensive method such as the kernel method8 or by using other methods such as adaptive or
penalized likelihood.9,10 The data of Fig. 2 can be adequately modeled by a normal distribution, according to normality hypothesis
tests whose details are explained later in Section “Goodness-of-Fit Tests: Normality Tests”. The fitted normal distributions are used
to compute the probabilities of obtaining values in the interval [5.0, 7.0] or less than 6, last two rows in Table 2. When comparing
these values with those computed with the empirical histograms (compare rows 3 and 5, and rows 4 and 6), there are no appreciable
differences and the normal pdf can be used instead.
In the validation of an analytical method and during its later use, statistical methodological strategies are needed to make
decisions from the available experimental data. The knowledge of these strategies supposes a way of thinking and acting that,
subordinated to the chemical knowledge, makes it objective both the analytical results and their comparison with those of other
researchers and/or other analytical methods.
Ultimately, a good method of analysis is a serious attempt to come close to the true value of the measurement, always unknown.
For this reason, the result of a measurement has to be accompanied by an evaluation of uncertainty or its degree of reliability. This is
done by means of a confidence interval. When the requirement is to establish the quality of an analytical method, then its capability
of detection, precision, etc. must be compared with those corresponding to other methods. This is formalized with a hypothesis test.
Confidence intervals and test of hypothesis are the basic tools in the validation of analytical methods.
In this introduction, the word sample has been used with two different meanings. Usually, there is no confusion because the
context allows one to distinguish whether it is a sample in the statistical or chemical sense.
In Chemistry, according to the International Union for Pure and Applied Chemistry (IUPAC) (Page 50 in Section 18.3.2 of
Inczédy et al.11), “sample” should be used only when it is a part of a selected material of a great amount of material. This meaning
coincides with that of a statistical sample and implies the existence of sampling error, that is, error caused by the fact that the sample
can be more or less representative of the amount in the material. For example, suppose that we want to measure the amount of
pesticide that remains in the ground of an arable land after a certain time. We take several samples “representative” of the ground of
the parcel (statistical sampling) and this introduces an uncertainty in the results characterized by a variance (theoretical) s2s .
Afterward, the quantity of pesticide in each chemical sample is determined by an analytical method, which has its own uncertainty,
characterized by s2m, in such a way that the uncertainty in the quantity of pesticide in the parcel is s2s + s2m provided that the method
gives results independent of the location of the sample. Sometimes, when evaluating whether a method is adequate for a task, the
sampling error can be an important part of the uncertainty in the result and, of course, should be taken into account to plan the
experimentation.
When the sampling error is negligible, for example, when a portion is taken from a homogeneous solution, the IUPAC
recommends using words such as test portion, aliquot, or specimen.
In summary, there is a clear link between measurement method and a random variable which is why the probability is the
natural form of expressing experimental uncertainty. This is thus the focus of the present article that is organized as follows:
Section “Confidence and Tolerance Intervals” describes confidence intervals to measure bias and precision under the normality
hypothesis and tolerance intervals, useful in evaluating the fit for purpose of a method. Also, a nonparametric interval on the
median is described.
Section “Hypothesis Test” is devoted to making decisions based on experimental data that, as such, are affected by uncertainty.
In this section, the computation of the power of a test is systematically proposed as a key element to evaluate the quality of the
decision at the desired significance level. A brief incursion into tests based on intervals is also made as they solve the problem of
deciding whether an interval of values is acceptable, for example, a relative error less than 10% in absolute value. The section ends
with some goodness-of-fit tests to evaluate the compatibility of a theoretical probability distribution with some experimental data.
Section “One-Way Analysis of Variance” is dedicated to the analysis of variance (ANOVA) for both fixed and random effects, and
in Section “Statistical Inference and Validation” some more specific questions related to the usual parameters of the analytical
method validation and their relation with the developed statistical methodologies are analyzed.
Mathematical proofs are not covered in this article and, to be operative from a practical point of view, several examples have
been included so that the reader can verify the understanding of the formulas and the argumentation for their thoughtful use. This
aspect is completed with the inclusion of an Appendix where some essential aspects related to the effectiveness of the statistical
models and the limits laws are described. The Appendix also contains the necessary sentences, in MATLAB code, to repeat all the
calculations proposed along the article. The same sentences are also available as supplementary material in the form of MATLAB .
mlx live scripts (at least release R2016a is needed to read and execute them).
There are some important questions when evaluating a method, for example, “in a given sample, what is the maximum value that it
provides?” that, due to the random character of the results, cannot be answered with just a number.
In order to include the degree of certainty in the answer, the question should be reformulated as: What is the maximum value, U,
that will be obtained 95% of the times that the method is used in the sample? The answer to the question thus posed would be a
tolerance interval, and to build it the probability distribution must be known. For instance, let us suppose that it is a N(m,s) and we
denote by z0.05 the critical value of a N(0,1) ¼ Z distribution, the one that accumulates probability 0.95. Then, a possible answer is
8 Quality of Analytical Measurements: Statistical Methods for Internal Validation
U ¼ m + z0.05s because then the probability that the analytical method gives values greater than U is pr{method > U} ¼ pr
{N(m,s) > m + z0.05s}, which, according to the result in Appendix, is equal to pr{Z > z0.05} ¼ 0.05. In general, for any percentage
of results 100(1 − a)%, the maximum value provided by the method would be
U ¼ m + za s (7)
L ¼ m − za s (8)
Finally, the interval [L, U] that contains 100(1 − a)% of the values obtained with the method would be
½L; U ¼ m − za=2 s; m + za=2 s (9)
An analytical example where one of these tolerance intervals with a normal distribution N(m,s) needs to be computed would be: An
analytical method gives values (mg L−1) that follow a N(9, 0.5) distribution when measuring a standard with 9 mg L−1. To assess
whether the method is still properly working, ten standards arepincluded
ffiffiffiffiffiffi in the daily sequence of determinations. The probability
distribution of p the
ffiffiffiffiffiffi mean of these ten values is a N 9; 0:5= 10 . Following Eq. (9), the tolerance interval at 95% level is
9 1:96 0:5= 10 ¼ 9 0.31 mg L−1. Consequently, if one day a mean of, say, 9.5 mg L−1 is obtained, the method does not
work properly because 9.5 does not belong to the tolerance interval and the method should be revised, at the risk of doing
this revision uselessly 5% of the times. Notice that
ffi the tolerance interval is always the same, built at the desired confidence level
pffiffiffiffiffi
100(1 − a)% with the distribution N 9; 0:5= 10 and it is not updated daily with the new samples.
Different to Eq. (9), two variants of tolerance intervals, namely the b-content and the b-expectation tolerance intervals, are
explained in Section “Tolerance Intervals” due to their relevance in the context of validation of analytical methods. In any case, any
of them is completely different from the confidence intervals introduced and developed in the following sections (from
Section “Confidence Interval” to Section “Joint Confidence Intervals”).
After explaining all the studied cases, the section finishes with a comparative analysis of both concepts (tolerance and confidence
intervals).
Confidence Interval
We have already remarked that estimation of solely the mean, x, and variance, s2, from n independent results provides very limited
information on the method performance. The objective now is to make affirmations of the type “in the sample, the amount of the
analyte m, estimated by x, is between L and U (m 2 [L, U])” with a certain probability that the statement is true. Following this
particular example, we should consider that x is a value taken by the random variable X (sample mean) and use its distribution to
answer the new question. Its distribution function is obtained mathematically from the one of X, FX(x), and thus depends on the
information we have about FX(x) (e.g., if the variance is known or should be also estimated, etc.).
In the general case, with a random variable X, obtaining a confidence interval for X from a sample x1, x2,. . ., xn consists of
obtaining two functions l(x1, x2,. . ., xn) and u(x1, x2,. . ., xn) such that
pr fX 2 ½lug ¼ pr fl X ug ¼ 1 − a (10)
1 − a is the confidence level and a is the significance level, meaning that the statement that the value of X is between l and u will
be false 100a% of the times.
In the next sections this idea will be particularized for some different cases, according to the random variable X of interest. Fig. 3
is a diagram that summarizes the cases studied in the following sections. All the examples are written in MATLAB live-script file
Intervals_section1022_live.mlx, in the supplementary material, so that they can be easily repeated or adapted for the reader’s
own data.
s s
pr m − za =2 pffiffiffi X m + za =2 pffiffiffi ¼1−a (11)
n n
that is, 100(1 − a)% of the values of the sample mean are in the interval in Eq. (11). A simple algebraic manipulation (subtract m
multiply by −1) gives
and X,
Quality of Analytical Measurements: Statistical Methods for Internal Validation 9
Known Known
variance variances
Fig. 3 Diagram summarizing the different cases for computing confidence intervals.
s s
pr X − za =2 pffiffiffi m X + za =2 pffiffiffi ¼1−a (12)
n n
Therefore, according to Eq. (10), the confidence interval on the mean that is obtained from Eq. (12) is
s s
X − za =2 pffiffiffi ; X + za =2 pffiffiffi (13)
n n
Analogously, the confidence intervals at confidence level 100(1 − a)% for the maximum and minimum values of the mean are
computed from Eqs. (14), (15), respectively
s
pr m X + za pffiffiffi ¼1−a (14)
n
s
pr X − za pffiffiffi m ¼ 1 − a (15)
n
i h
and, thus, the corresponding one-sided intervals would be −1; X + za psffiffin and X − za psffiffin ; 1 .
In an experimental context, when measuring n aliquot parts of a test sample, we obtain n values x1, x2,. . ., xn. Their sample mean x
is the particular value taken by the random variable X and is also an estimate of the true value m.
Example 1: Suppose that an analytical method follows a N(m,4) and we have a sample of size 10 with values 98.87, 92.54, 99.42,
105.66, 98.70, 97.23,h 98.44, 103.73, 94.45
. . sample,
and 101.08. With this i the mean is 99.01 and using Eq. (13), the interval at 95%
confidence level is 99:01 − 1:96 4 pffiffiffiffi 10 ; 99:01 + 1:96 4 pffiffiffiffi
10 ¼ [96.53, 101.49].
For the interpretation of this interval, notice that with different samples of size 10 (same analytical method), different intervals
will be obtained at the same 95% confidence level. The endpoints of these intervals are nonrandom values, and the unknown mean
value, which is also a specific value, will or will not belong to the interval. Therefore, the affirmation “the interval contains the
mean” is a deterministic assertion that is true or false for each of the intervals. What one knows is that it is true for 100(1 − a)% of
those intervals. In our case, as 95% of the constructed intervals will contain the true value, we say, at 95% confidence level, that the
interval [96.53, 101.49] contains m.
This is the interpretation with the frequentist approach adopted in this article, that is to say that the information on random
variables is obtained by means of samples of them and that the parameters to be estimated are not known but are fixed amounts
(e.g., the amount of analyte in a sample, m, is estimated by the measurement results obtained by analyzing it n times). With a
Bayesian approach to the problem, a probability distribution is attributed to the amount of analyte m and once fixed an interval
of interest [a,b], the “a priori” distribution of m, the experimental results, and the Bayes’ theorem are used to calculate the
probability a posteriori that m belongs to the interval [a,b]. It is shown that, although in most practical cases the uncertainty
intervals obtained from repeated measurements using either theory may be similar, their interpretation is completely different.
The works by Lira and Wöger12 and Zech13 are devoted to compare both approaches from the point of view of the experimental
data and their uncertainty. Also, an introduction to Bayesian methods for analyzing chemical data can be seen in Armstrong and
Hibbert.14,15
10 Quality of Analytical Measurements: Statistical Methods for Internal Validation
s s
pr X − ta=2, n pffiffiffi m X + ta=2, n pffiffiffi ¼1−a (16)
n n
where ta/2,n is the upper percentage point (100a =2 %) of the Student t distribution with n ¼ n − 1 d.f. and s is the sample standard
deviation. Analogously, the one-sided intervals at the 100(1 − a)% confidence level come from
s
pr m X + ta, n pffiffiffi ¼1−a (17)
n
s
pr X − ta, n pffiffiffi m ¼1−a (18)
n
Example 2: Suppose that the probability distribution of an analytical method is a normal, but its standard deviation is unknown.
With the data of Example 1, the sample standard deviation, s, is computed as 3.90. As t0.025,9 ¼ 2.262 (see Appendix), the
confidence interval at 95% level is [99.01 − 2.26 1.24, 99.01 + 2.26 1.24] ¼ [96.21, 101.81]. The 95% confidence interval on
the minimum of the mean (i.e., the 95.0% lower confidence bound) is made up, according to Eq. (18), by all the values greater than
96.74 ¼ 99.01 − 1.83 1.24. The corresponding interval on the maximum (upper confidence bound for mean), Eq. (17), will be
made up by the values less than 101.28 ¼ 99.01 + 1.83 1.24.
The length of the confidence intervals from Eqs. (12)–(15) is a function of the sample size and tends towards zero when the
sample size tends to infinity. This functional relation permits the computation of the sample size needed to obtain an interval of
2 za=2 s 2
given length, d. It will suffice to consider 2d ¼ za=2 psffiffin and take as n the nearest integer greater than d . For example, if we want
a 95% confidence interval with length d less than 2, in the hypothesis of Example 1, we will need a sample size greater than or
equal to 62.
2 ta=2, n s 2
The same argument can be applied when the standard deviation is unknown. However, in this case, to compute n by d
it is necessary to have an initial estimation of s, which, in general, is obtained in a pilot study with size n0 , in such a way that
in the previous expression the d.f., n, are n0 − 1. An alternative is to define the desired length of the interval in standard deviation
units (remember that the standard deviation is unknown). For instance, in Example 2, if we want d ¼ 0.5s, we will need a sample
size greater than (4za/2)2 ¼ 61.5; note the substitution of ta/2,n by za/2, which is mandatory because we do not have the sample size
needed to compute ta/2,n, which is precisely what we want to estimate.
where w2a/2, n is the critical value of a w2 distribution with n ¼ n − 1 d.f. at significance level a/2. As in the previous case for the
sample mean, we should distinguish between the random variable sample variance S2 and one of its values, s2, computed with
Eq. (6) from sample x1, x2, . . ., xn.
The intervals for the maximum and minimum of the variance at 100(1 − a)% confidence level are obtained from Eqs. (20), (21),
respectively.
( )
ðn − 1ÞS2
pr s 2
2
¼1−a (20)
w1 − a, n
( )
ðn − 1ÞS2
pr s2 ¼1−a (21)
w2a, n
Example 3: Knowing that the n ¼ 10 data of Example 2 come from a normal distribution with both mean and variance unknown,
the 95% confidence interval on s2 is found from Eq. (19) as [7.21, 50.81] because s2 ¼ 15.25, w20.025, 9 ¼ 19.02, and w20.975, 9 ¼ 2.70.
If the analyst is interested in obtaining a confidence interval for the maximum variance, the 95% upper confidence interval is
found from Eq. (20) as [0, 41.27] because w20.95, 9 ¼ 3.33, that is, the upper bound for the variance is 41.27 with 95% confidence.
Notice the lower bound in 0. To obtain confidence intervals on the standard deviation, it suffices to take the square root of the
Quality of Analytical Measurements: Statistical Methods for Internal Validation 11
aforementioned intervals because this operation is a monotonically increasing transformation; therefore, the intervals at 95%
confidence level on the standard deviation are [2.69, 7.13] and [0, 6.42], respectively.
The sample size, n, needed so that s2/s2 is between 1 − k and 1 + k is given by the nearest integer greater than
h pffiffiffiffiffiffiffiffiffiffiffi i2
1 + 1 =2 za=2 1 + k + 1 =k . For example, for k ¼ 0.5, such that the length of the confidence interval verifies 0.5 < s2/s2 < 1.5,
we would need n ¼ 40 data (at least). Just for comparative purposes, we will admit in the example that with the sample of size 40 we
obtain the same variance s2 ¼ 15.25. As w20.025, 39 ¼ 58.12, and w20.975, 39 ¼ 23.65, the two-sided interval at 95% confidence level is
now [10.23, 25.15], which verifies the required specifications.
where X1 and X2 are the random variables of the sample mean, which take the values x1 and x2 . The reader can easily write the
expressions analogous to Eqs. (14), (15) for the one-sided intervals.
The 100(1 − a)% confidence interval is obtained from the following equation:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1 1 1
pr X1 − X2 − ta=2, n sp + m1 − m2 X1 − X2 + ta=2, n sp + ¼1−a (24)
n1 n2 n1 n2
where n ¼ n1 + n2 − 2 are the d.f. of the Student’s t distribution. The one-sided intervals at 100(1 − a)% confidence level have the
analogous expressions deduced from Eq. (24) by substituting ta/2,n for ta,n. If a fixed length is desired for the confidence interval, the
computation explained in Section “Confidence Interval on the Mean of a Normal Distribution” can be immediately adapted to
obtain the needed sample size.
Example 4: We want to study the stability of a substance after being stored for a month. Here stability means that the content of the
substance remains unchanged. Two series of measurements (n1 ¼ n2 ¼ 8) were carried out before and after the storage period and
we will estimate the difference in means by a 95% confidence interval. The results were x1 ¼ 90:8, s21 ¼ 3:89 and
x1 ¼ 92:7, s22 ¼ 4:02, respectively. Therefore, the two-sided interval when assuming equal variances (s2p ¼ 3.96, Eq. (23)) is
pffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffi
ð90:8 − 92:7Þ 2:1448 3:96 18 + 18, that is [−4.03, 0.23]. Therefore, at 95% confidence level, the difference of the means
belongs to the interval, including null difference, that is, the substance is stable.
When the assumption s21 ¼ s22 is not reasonable, we can still obtain an interval on the difference m1 − m2 by using the fact that the
X2 − ðm1 − m2 Þ
statistic X1p−ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi is distributed approximately as a t with d.f. given by,
s21 =n1 + s22 =n2
2 2
s =n1 + s2 =n2
n ¼ 21 2 2 2 (25)
ðs1 =n1 Þ ðs2 =n2 Þ
2
n1 − 1 + n2 − 1
The 100(1 − a)% confidence interval is obtained from the following equation:
8 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 9
< s 2 s 2 s21 s2 =
pr X1 − X2 − ta=2, n 1
+ 2
m1 − m2 X1 − X2 + ta=2, n + 2 ¼1−a (26)
: n1 n2 n1 n2 ;
12 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Example 5: We want to compute a confidence interval on the difference of two means with unknown and not equal variances, with
the results that come from an experiment carried out with four aliquot samples by two different analysts. The first analyst obtains
x1 ¼ 3:285, and the second x2 ¼ 3:257. The variances were s21 ¼ 3.33 10−5 and s22 ¼ 9.17 10−5, respectively. Assuming that
s21 6¼ s22, Eq. (25) gives n ¼ 4.9, so the d.f. to apply Eq. (26) are 5 and t0.025,5 ¼ 2.571. Thus, the 95% confidence interval is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
−5 −5
ð3:285 − 3:257Þ 2:571 3:3310 4 + 9:1710 4 , that is, [0.014, 0.042]. So, at 95% of confidence, the two analysts provide unequal
measurements because zero is not in the interval.
The confidence intervals for the maximum and the minimum are obtained by considering the last or the first term respectively in
Eq. (26) and replacing ta/2,n by ta,n.
sd sd
pr d − ta=2, n pffiffiffi m d + ta=2, n pffiffiffi ¼1−a (27)
n n
where d and sd are the mean and standard deviation of the differences di and n ¼ n − 1 are the d.f. of the t distribution.
S21 s21 S2
pr F1 − a=2, n1 , n2 Fa=2, n1 , n2 12 ¼1−a (28)
S22 s22 S2
where F1−a/2, n1, n2 and Fa/2, n1, n2 are the critical values (upper tail) of an F distribution with n1 ¼ n2 − 1 d.f. in the numerator and
n2 ¼ n1 − 1 d.f. in the denominator. The Appendix contains a description of some relevant properties of the F distribution.
We can also compute one-sided confidence intervals. The 100(1 − a)% upper or lower confidence bound on s21/s22 is obtained
from Eqs. (29), (30), respectively. Remember that, when computing the intervals by using Eq. (29), the lower bound is always 0.
s21 S2
pr 2 Fa, n1 , n2 12 ¼1−a (29)
s2 S2
S21 s21
pr F1 − a, n1 , n2 ¼1−a (30)
S22 s22
Example 6: In this example, we compute a two-sided 95% confidence interval for the ratio of the variances in Example 4
(n1 ¼ n2 ¼ 8, s21 ¼ 3.89, s22 ¼ 4.02). The resulting interval is [0.20 (3.89/4.02), 4.99 (3.89/4.02)] ¼ [0.19, 4.83]. As 1 belongs
to this interval, we can admit that both variances are equal.
1. To sort the data in ascending order. In our case, 92.54, 94.45, 97.23, 98.44, 98.70, 98.87, 99.42, 101.08, 103.73, and 105.66. The
rank of each datum is the position that it occupies in the sorted list, for example, the rank of 98.44 is four.
2. To calculate the rank, rl, of the value that will be the p lower
ffiffiffiffiffiffi endpoint
of the interval. It is the nearest integer less than
pffiffiffi
2 n − za=2 n + 1 . In our case, this value is 0:5 10 − 1:96 10 + 1 ¼ 2:40, thus rl ¼ 2.
1
Tolerance Intervals
In the introduction to present Section “Confidence and Tolerance Intervals”, the tolerance intervals of a normal distribution have
been calculated knowing its mean and variance. Remember that the tolerance interval [l, u] contains 100(1 − a)% of the values of
the distribution of X or, equivalently, pr{X 2 = [l, u]} ¼ a. Actually, the values of the parameters that define the probability
distribution are unknown and this uncertainty should be transferred into the endpoints of the interval. There are several types of
tolerance regions, but in this article, we will restrict ourselves to two common cases.
Expressed in words, [l, u] contains at least 100b% of the values of X with g confidence level. For the case of an analytical method, this
is to say that we have to determine, based on a sample of size n, for instance, the interval that will contain 95% (b ¼ 0.95) of the
results and this assertion must be true 90% of the times (g ¼ 0.90). Evidently, b-content tolerance intervals can be one-sided, which
means that the procedure will provide 95% of its results above l (respectively, below u) 90% of the times. We leave to the reader the
corresponding formal definitions.
One-sided and two-sided b-content tolerance intervals can be computed either by controlling the center or by controlling the
tails, and for both continuous and discrete random variables (a review can be seen in Patel17 and applications in Analytical
Chemistry in Meléndez et al.18 and Reguera et al.19).
Here we will only describe the case of a normally distributed X with unknown mean and variance. From this distribution, we
have a sample of size n that is used to compute the mean x and standard deviation s. We want to obtain a two-sided b-content
tolerance interval controlling the center, that is, an interval such that
To determine k, several approximations have been reported; consult Patel17 for a discussion on them. The approach by Wald and
Wolfowitz20 is based on determining k1 such that
1 1
pr N ð0; 1Þ pffiffiffi + k1 − pr Nð0; 1Þ pffiffiffi − k1 ¼b (33)
n n
Therefore
14 Quality of Analytical Measurements: Statistical Methods for Internal Validation
sffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n−1
k ¼ k1 (34)
w2g, n − 1
w2g, n−1 is the point exceeded with probability g when using the w2 distribution with n − 1 d.f.
Example 7: With the data in Example 1, and b ¼ g ¼ 0.95, we have x ¼ 99:01, s ¼ 3.91, k1 ¼ 2.054, and w20.95, 9¼3.33; thus,
according to Eq. (34), k ¼ 3.379 and, as a consequence, the interval [99.01 − 3.38 3.91, 99.01 + 3.38 3.91] ¼ [85.79, 112.23]
contains 95% of the results of the method 95% of the times that the procedure is repeated with a sample of size 10.
Unlike the b-content tolerance interval, condition in Eq. (35) only demands that, on average, the probability that the random
variable takes values between l and u is b.
As in the previous case, we limit ourselves to obtain intervals of the form [ x − ks, x + ks]. When the distribution of the random
variable is normal and we have a sample of size n, the solution was obtained for the first time by Wilks21 and is
rffiffiffiffiffiffiffiffiffiffiffi
n+1
k ¼ t1 − b, n (36)
2 n
where t(1−b)/2, n is the upper (1 − b)/2 point of the t distribution with n ¼ n − 1 d.f.
Example 7 (continuation): With the same data, the 95% expectation tolerance interval would be [99.01 − 2.37 3.91, 99.01 +
2.37 3.91] ¼ [89.74, 108.28] as now k is directly computed with the critical value t0.025, 9 ¼ 2.262.
This interval is shorter than the b-content tolerance interval because it only assures the expected value (the mean) of the
probabilities that the individual values belong to the interval. In fact, the interval [89.74, 108.28] contains 95% of the values of X
only 64% of the times, conclusion drawn by applying Eq. (32) with k ¼ 2.37. Also, note that when the sample size tends to infinity,
the value of k in Eq. (36) tends towards z(1−b)/2 which is the length of the theoretical interval that, in our example, would be [91.35,
106.67] obtained by substituting k by z0.025 ¼ 1.96.
them is clear: the confidence interval is the set that is supposed to contain (with a 100(1 − a)% confidence) the true value of the
unknown parameter; the tolerance interval is the set that contains a value which is taken by the random variable in a percentage of b,
with a given confidence g.
In particular, confidence intervals must be used in the process of evaluating trueness and precision of a method when there is no
need to fulfill external requirements but just to compare with other methods or to quantify uncertainty and bias of the results
obtained with it.
A usual error is to mistakenly consider a confidence interval as a tolerance interval when the difference between them is
important. For instance, with the data of Example 7, notice that to compute the confidence interval, the standard deviation of the
pffiffiffi
mean is estimated as s= n ¼ 1:24, whereas the standard deviation of the individual results of the method is estimated as s ¼ 3.91,
very different.
Also, it is important to remember that when the sample size n tends to infinity, the length of a confidence interval tends toward
zero, independently of the chosen confidence level. For example, with the confidence intervals for the mean, in the limit we will
have x ¼ m thus the estimator and the true parameter will be equal for sure (1 − a ¼ 1). On the contrary, the length of a b-content
tolerance interval does not tend towards zero when increasing the sample size but to the interval that contains for sure (1 − g ¼ 1)
the 100b % of the values.
There are other aspects of the determination of the uncertainty that are of practical interest, for example, the problem that arises
by the fact that any uncertainty interval, particularly an expanded uncertainty interval, should be restricted to the range of feasible
values of the measurand. Cowen and Ellison35 analyzed how to modify the interval when the data are close to a natural limit in a
feasible range such as 0 or 100% mass or mole fraction.
Hypothesis Tests
This section is devoted to the introduction of a statistical methodology to decide whether an affirmation is false, for example, the
affirmation “this method of analysis applied to this sample of reference provides the certified value”. If, on the basis of the
experimental results, it is decided that it is false, we will conclude that the method has bias. The affirmation is customarily called
hypothesis and the procedure of decision making is called hypothesis testing. A statistical hypothesis is an asseveration on the
probability distribution that follows a random variable. Sometimes one has to decide on a parameter, for example, whether the
mean of a normal distribution is a specific value. In other occasions it may be required to decide on other characteristics of the
distribution, for example, whether the experimental data are compatible with the hypothesis that they come from a normal or
uniform distribution.
The statement “m ¼ 2.00” in Eq. (37) is called the null hypothesis, denoted as H0, and the statement “m < 2.00” is called the
alternative hypothesis, H1. As the alternative hypothesis specifies values of m that are less than 2.00 it is called one-sided alternative.
In some situations, we may wish to formulate a two-sided alternative hypothesis to specify values of m that could be either greater or
less than 2.00 as in
H0 : m ¼ 2:00
(38)
H1 : m 6¼ 2:00
The hypotheses are not affirmations about the sample but about the distribution from which those values come, that is to say, m is
the value, unknown, of the pH of the solution that will be the same as the value provided by the procedure if the bias is zero (see the
model of Eq. (1)). In general, to test a hypothesis, the analyst must consider the experimental goal and define, accordingly, the null
hypothesis for the test, as in Eq. (37). Hypothesis-testing procedures rely on using the information in a random sample; if this
information is inconsistent with the null hypothesis, we would conclude that the hypothesis is false. If there is not enough evidence
to prove falseness, the test defaults to the decision of not rejecting the null hypothesis though this does not actually prove that it is
correct. It is therefore critical to choose carefully the null hypothesis in each problem.
16 Quality of Analytical Measurements: Statistical Methods for Internal Validation
H0 is true H0 is false
xp x2
1 −ffiffiffiffiffiffiffiffiffi
9 m1 ¼ m2 m1 6¼ m2 tcalc ¼ 1 1
{tcalc < − ta/2, n1+n2−2} [ {tcalc > ta/2, n1+n2−2}
sp n +n
10 m1 ¼ m2 m1 > m2 1 2 {tcalc > ta, n1+n2−2}
11 md ¼ 0 md 6¼ 0 tcalc ¼ sd =dpffi {tcalc < − ta/2, n−1} [ {tcalc > ta/2, n−1}
n
12 md ¼ 0 md > 0 {tcalc > ta, n−1}
2
13 s2 ¼ s20 s2 6¼ s20 w2calc ¼ ðn −s21Þs {w2calc < w1-a/2,n-1
2
} [ {w2calc > wa/2,n-1
2
}
0
14 s2 ¼ s20 s2 > s20 {w2calc > w2a,n− 1}
s2
15 s21 ¼ s22 s21 6¼ s22 Fcalc ¼ s12 {Fcalc < F1−a/2, n1−1, n2−1} [ {Fcalc > Fa/2, n1−1, n2−1}
2
16 s21 ¼ s22 s21 > s22 {Fcalc > Fa, n1−1, n2−1}
The values za are the percentiles of a standard normal distribution such that a¼pr{N(0,1)>za}. The values ta,n are the percentiles of a Student’s t distribution with n degrees of freedom
such that a¼pr{t>ta,n}. The values w2a,n are the percentiles of a w2 distribution with n degrees of freedom such that a¼pr{w2 >w2a,n}. The values Fa,n1,n2 are the percentiles of an F
distribution with n1 degrees of freedom for the numerator and n2 degrees of freedom for the denominator, such that a¼pr{F>Fa,n1,n2}. sp is the pooled variance defined in Eq. (23).
d is the mean of the differences di ¼ x1i − x2i between the paired samples; sd is their standard deviation.
In practice, to test a hypothesis, we must take a random sample, compute an appropriate test statistic from the sample data, and
then use the information contained in this statistic to make a decision. However, as the decision is based on a random sample, it is
subject to error. Two kinds of potential errors may be made when testing hypothesis. If the null hypothesis is rejected when it is true,
then a type I error has been made. A type II error occurs when the researcher accepts the null hypothesis when it is false. The situation
is described in Table 3.
In Example 8, if the experimental data lead to rejection of the null hypothesis H0 being true, our (wrong) conclusion is that the
pH of the solution is less than 2. A type I error has been made and the analyst will use the solution in the procedure when in fact it is
not chemically valid. If, on the contrary, the experimental data lead to acceptance of the null hypothesis when it is false, the analyst
will not use the solution when in fact the pH is less than 2 and a type II error has been made. Note that both types of error have to be
considered because their consequences are very different. In the case of type I error, an unsuitable solution is accepted, the procedure
will be inadequate, and the analytical result will be wrong with the subsequent damages that it may cause (e.g., the loss of a client, or
a mistaken environmental diagnosis). On the contrary, the type II error implies that a valid solution is not used with the
corresponding extra cost of the analysis. It is clear that the analyst has to specify the assumable risk of making these errors, and
this is done in terms of the probability that they will occur.
The probabilities of occurrence of type I and II errors are denoted by specific symbols, defined in Eq. (39). The probability a of
the test is called the significance level, and the power of the test is 1 − b, which measures the probability of correctly rejecting the null
hypothesis.
a ¼ pr ftype I errorg ¼ pr freject H0 j H0 is truegb ¼ pr ftype II errorg ¼ pr faccept H0 j H0 is falseg (39)
In Eq. (39), the symbol “|” indicates that the probability is calculated under that condition. In the example we are following, a will
be calculated with the normal distribution of mean 2 and standard deviation 0.55.
Statistically expressed, with the n ¼ 10 results in Example 8 (sample mean x ¼ 1:708), one wants to decide about the value of the
mean of a normal distribution with known variance and one-sided alternative hypothesis (a one-tail test).
Quality of Analytical Measurements: Statistical Methods for Internal Validation 17
pffiffiffi
pffiffiffiffiffiffi the related statistic is written in Table 4 (second row) and gives Zcalc ¼ ðx − m0 Þ=ðs= nÞ ¼
With these premises,
ð1:708 − 2:0Þ= 0:55= 10 ¼ − 1:679.
In addition, the analyst must assume the risk a, say 0.05. This means that the decision rule that is going to apply to the
experimental results will accept an inadequate (chemical) solution 5% of the times. Therefore, the critical or rejection region is
written in Table 4, second row, as CR ¼ {Zcalc < − 1.645}, meaning that the null hypothesis will be rejected for the samples of size
10 that provide values of the statistic less than −1.645. In the example, the actual value Zcalc ¼ −1.679 belongs to the critical region;
thus, the decision is to reject the null hypothesis at 5% significance level.
Given the present facilities of computation, instead of the CR, the available statistical software calculates the so-called P-value,
which is the probability of obtaining the current value of the statistic under the null hypothesis H0. In our case, P-value ¼ pr
{Z −1.679} ¼ 0.0466. When the P-value is less than the significance level a, the null hypothesis is rejected because this is the same
as saying that the value of the statistic belongs to the critical region.
The next question that immediately arises is about the power of the applied decision rule (statistic and critical region).
To calculate b, defined in Eq. (39), it is necessary to exactly specify the meaning of the alternative hypothesis. In our case, what is
meant by pH smaller than 2. From a mathematical point of view, the answer is clear: any number less than 2, for example, 1.9999
which clearly does not make sense from the point of view of the analyst. In this context, sometimes due to previous knowledge, in
other cases because of the regulatory stipulations or simply by the detail of the standardized work procedure, the analyst can decide
the value of pH that is considered to be less than 2.00, for example, a pH less than 1.60. This is the same as assuming that “pH equal
to 2” is any smaller value whose distance to 2 is less than 0.40. In these conditions,
jdj pffiffiffi
b ¼ pr N ð0; 1Þ < za − n (40)
s
where |d| ¼ 0.40 in our problem and when replacing it in Eq. (40), we have b ¼ 0.26 (calculations can be seen in Example A9 of
Appendix). That is to say, whatever the decision made, the decision rule leads to throw a valid chemical solution away 26% of the
times. Evidently, this is an inadequate rule.
pffiffiffi
A simple examination of Eq. (40) explains the situation. To decrease b, we should decrease the value za − jdj n=s. This may be
pffiffiffi
done by decreasing za (i.e., increasing the significance level a) or increasing jdj n=s. As both the procedure precision, s, and the
difference of pH that we wish to detect are fixed, the only possibility left is to increase the sample size n. Solving Eq. (40) in n we
have
2
za + zb
n’ (41)
ðjdj=sÞ2
The values of b and a for sample sizes of 10, 15, 20, and 25, maintaining d and s fixed, are drawn in Fig. 4. As can be seen, a and b
exhibit opposite behavior and, unless the sample size is increased, it is not possible to simultaneously decrease the probability of
both errors. In our case, Eq. (41) gives n ¼ 20.5 for a ¼ b ¼ 0.05, thus n ¼ 21 because the sample size must be an integer. The dotted
lines in Fig. 4 intersect in values of b of 0.263, 0.126, 0.058 and 0.025 when increasing the sample size while maintaining the
significance level a ¼ 0.05. Again, we see that for a given a, the risk b decreases with the increase in n.
Eq. (40) also allows the analyst to decide the standard deviation (precision) necessary to obtain a decision rule according to the
risks a and b that the analyst is willing to admit. For example, if one must decide on the validity of the prepared solution with
0.50
0.45
0.40
0.35
β = pr {type II error}
n = 10
0.30
n = 15
0.25
n = 20
0.20
n = 25
0.15
0.10
0.05
0
0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
α = pr {type I error}
Fig. 4 Simultaneous (opposite) behavior of a and b for different sample sizes, n ¼ 10, 15, 20 and 25 maintaining d and s fixed at 0.4 and 0.55 respectively.
Dotted lines intersect the different curves for a ¼ 0.05.
18 Quality of Analytical Measurements: Statistical Methods for Internal Validation
For difference in
For mean μ0 means μ1−μ2
Known Known
variance variances
Z-test Z-test
10 results and the analyst states pffiffiffiffiffiaffi¼ b ¼ 0.05, the only option according to Eq. (40) is to increase | d|/s. By solving
0:05 ¼ pr Nð0; 1Þ < 1:645 − 0:40 s 10 , one obtains s ¼ 0.3845. This means that the procedure should be improved from the
current value of 0.55 until 0.38. If only five results were allowed, the standard deviation would have to decrease to 0.27 for
maintaining both the significance level and the power of the test.
Finally, there is an aspect in Eq. (40) that should not go unnoticed. Maintaining a, b, and n fixed, it is possible to reduce d (the
pH value that can be distinguished from 2.00) if the analyst simultaneously increases the precision of the method, provided that the
ratio |d|/s remains constant. Said otherwise, without changing any of the specifications of the hypothesis test, by diminishing s we
can discriminate a value of pH nearer to 2. Qualitatively this argument is clear: if a procedure is more precise, more similar results are
easier to distinguish, so that with a more precise procedure different values will appear that would be considered equal with a less
precise procedure. Eq. (40) quantifies this relation for the hypothesis test we are conducting.
In summary, a hypothesis test includes the following steps: (1) defining the null, H0, and alternative, H1, hypotheses according
to the purpose of the test and the properties of the distribution of the random variable, which, according to Eq. (1), is the
distribution of the values provided by a method of measurement; (2) deciding on the probabilities a and b, that is, the risk for the
two types of error that will be assumed for the decision; (3) computing the needed sample size; (4) obtaining the results, computing
the corresponding test statistic and evaluating whether it belongs to the critical region CR; and, finally, drawing the analytical
conclusion, which should entail more than reporting the pure statistical test decision. The conclusion should include the elements
of the statistical test, the assumed distribution, a, b, and n. Care must be taken in writing the conclusion; for example, it is more
adequate to say “there is no experimental evidence to reject the null hypothesis” than “the null hypothesis is accepted”.
Table 4 summarizes the tests most frequently used in the validation of analytical procedures and in the analysis of their results.
Fig. 5 is the diagram equivalent to the one in Fig. 3 but for hypothesis testing.
b ¼ pr −ta=2, n − 1 tn − 1 ðDÞ ta=2, n − 1 (43)
p ffiffiffi
where D ¼ jsdj n is the noncentrality parameter of a noncentral t(D) distribution, which in Eq. (43) has n − 1 d.f. Note the analogy
with the “shift” of the N(0,1) in Eq. (40). The discussion about the relative effect of sample size and precision is similar to the case in
which the variance is known. The corresponding equations for one-tail tests are b ¼ pr{− ta, n−1 tn−1(D)} if H1: m < m0 and b ¼ pr
{tn−1(D) ta, n−1} if H1: m > m0.
To compute n from Eq. (43), the standard deviation is needed. To solve this additional difficulty, the comments in Case 2 of
Section “Confidence Interval on the Mean of a Normal Distribution” are valid and can be applied here also. Usually, d ¼ 2s
or 3s. Let us compare the solutions with known and unknown variance with the same data of Example 8, but supposing that the
variance is unknown. We wish to detect differences in pH of 0.73s (the same d/s as in Example 8). By using a sample of
size 10, the probability b is 0.31 instead of the previous 0.26 (calculations can be seen in Example A10 of Appendix). This increase
in the probability of type II error is due to the less information we have about the problem; now the standard deviation is
unknown.
Following row 12 of Table 4, the critical region is CR ¼ {tcalc > ta, and the value of the statistic is tcalc ¼ psddffi ¼ 2:69
n−1} pffiffiffi
3:526 ¼ 2:412.
n 10
The critical value t0.05,9 is equal to 1.833; thus, the actual tcalc belongs to the critical region. Therefore, the null hypothesis is
rejected for a ¼ 0.05 and we can conclude that cartridge A is more efficient than cartridge B, because the mean of the differences is
positive.
pffiffiffi
To evaluate the power (1 − b) of the test, the equation b ¼ pr{tn−1(D) ta, n−1} with D ¼ d n for d ¼ |m − m0 |/s ¼ |d |/
s ¼ 2/3.53 ¼ 0.57 provides 1 − b ¼ 1 − 0.492 ¼ 0.508 for a ¼ 0.05 and n ¼ 10 (calculations can be seen in Example A11 of
Appendix). Hence, 50% of the times the conclusion of accepting that there is no difference between recovery rates is wrong.
In this case, the risk of a type II error is very large; in other words, the power is very poor when we want to discriminate differences of
2% in recovery because the ratio d is small.
Table 5 Recovery rates obtained by using two different extraction cartridges for a sulfonamide spiked in wastewater.
Location 1 2 3 4 5 6 7 8 9 10
Cartridge A (%) 77.2 74.0 75.6 80.0 75.2 69.2 75.4 74.0 71.6 60.4
Cartridge B (%) 74.4 70.0 70.2 77.2 75.9 60.0 77.0 76.0 70.0 55.0
The analyst decides that a repeatability is admissible until 2.0 times the initial one, 1.40 mg L−1, and assumes the risks a ¼ b ¼ 0.05.
The sample size needed to guarantee the requirements of the analyst, which is formally a one-tail hypothesis test on the variance, is
obtained from Eq. (46).
k
b ¼ pr w2n − 1 < (46)
l2
k is the value such that a ¼ pr{w2n−1 > k} and l ¼ s/s0. As l ¼ 2.0, Eq. (46) gives that for n ¼ 14, b ¼ 0.0402, whereas for n ¼ 13,
b ¼ 0.0511 (calculations can be seen in Example A12 of Appendix). Therefore, the analyst decides to carry out 14 determinations on
aliquot parts of a sample with 400 mg L−1 obtaining a variance of 3.10 (s ¼ 1.76 mg L−1).
The statistic related to the decision (row 14 in Table 4) is w2calc ¼ (14 − 1) 3.10/1.96 ¼ 20.56. As the critical region is CR ¼
{wcalc > w20.05, 13 ¼ 22.36}, the conclusion is that there is not enough experimental evidence to conclude that the precision has
2
worsened. In this case, the acceptance of the null hypothesis, that is, to maintain the repeatability below 2.0 times the initial one,
will be erroneous 5% of the times because b was fixed at 5%. The decision rule is equally protected against type I and II errors.
H0 : mSPME ¼ mSPE ðrecovery rates are the same for both proceduresÞ
(47)
H1 : mSPME > mSPE ðthe recovery using SPME procedure is greater than the one using SPEÞ
Following row 8 of Table 4, for a significance level a ¼ 0.05, is CR ¼ {Zcalc > za ¼ 1.645}. The statistic is
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s2SPME s2 28:73 9:73
Z calc ¼ ðxSPME − xSPE Þ= + SPE ¼ ð85:9 − 81:8Þ= + ¼ 2:091:
n1 n2 10 10
As the value of the statistic 2.091 2 CR, the null hypothesis is rejected, and we conclude that the mean recovery rate with SPME is
greater than with SPE.
The next question could be related to the risk b for this hypothesis test, provided a difference in recovery of 3% is enough in the
analysis. To answer this question, a simple modification of Eq. (40) shows that
8 9
>
< >
jdj =
b ¼ pr Z > za − qffiffiffiffiffiffiffiffiffiffiffiffiffiffi (48)
>
: s21 s22 >
;
n1 + n2
By substituting our data in Eq. (48), one obtains b ¼ 0.55. That means that in 55% of the cases, we will incorrectly accept that the
recovery is the same for both procedures.
Table 6 Recovery rates for triazines in wastewater using solid-phase microextraction (SPME) and solid-phase
extraction (SPE).
SPME 91 85 90 81 79 78 84 87 93 91
SPE 86 82 85 86 79 82 80 77 79 82
It is also possible to derive formulas to estimate the sample size required to obtain a specified b for given d and a. For the one-
sided alternative, the sample size n ¼ n1 ¼ n2 is
2
za + zb
n’ 2 (49)
d
s21 + s22
Again, with the data of Example 11 and b ¼ 0.05, Eq. (49) gives 46.25, that is, 47 aliquot samples should be analyzed for each
procedure so that a ¼ b ¼ 0.05.
For the two-sided alternative, the sample size n ¼ n1 ¼ n2 is
2
za=2 + zb
n’ 2 (50)
d
s21 + s22
where n ¼ n10 + n20 − 2 are the d.f. of the Student’s t distribution. If this is not possible, the difference to be detected should be
expressed in standard deviation units, that is, d ¼ |m1 − m2 | ¼ ks and the following expression applies:
2
za=2 + zb
n’ k2 =
(52)
2
where za/2 and zb are the corresponding upper percentage points of the standard normal distribution Z.
Example 12: An experimenter wishes to compare the mean of two procedures, stating that they should be considered different if
they differ by 2 or more standard deviations (k ¼ 2), and defining assumable risks of, at most, a ¼ 0.05 and b ¼ 0.10.
As z0.025 ¼ 1.960 and z0.10 ¼ 1.282, Eq. (52) gives n ¼ 5.25, then six samples must be determined using each procedure. If the
experimenter had wanted to distinguish 1 standard deviation (k ¼ 1), then n ¼ 21.01, that is, 22 determinations with each
procedure would have been necessary.
Although it is preferable to always take equal sample sizes, it may be more expensive or laborious to collect data from X1 than
from X2. In this case, there are weighted sample sizes to be considered.36
In case that the equality of variances s21 and s22 cannot be admitted, there is no completely justified solution for the test.
However, approximations exist with good power and easy to use tests, such as Welch test. This method consists of substituting the
known variances in the expression of Zcalc in rows 7 and 8 of Table 4 by their sample estimates, in such a way that the statistic
becomes
x1 − x2
tcalc ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi (53)
s21 s22
n1 + n2
which follows a Student’s t with the degrees of freedom n in Eq. (25). The critical region for the two-tail test is CR ¼ {tcalc < −ta/2,n}
[ {tcalc > ta/2,n}. The critical region for the one-tail test is CR ¼ {tcalc < −ta,n} if H1: m1 < m2.
As the variances are different, it seems reasonable to take the sample sizes, n1 and n2, also different. If s2 ¼ r s1, similar to
Eq. (52), one obtains the expression in Eq. (54)
2
za=2 + zb
n1 ’ 2 (54)
k =ðr + 1Þ
Once n1 is determined, n2 is obtained as n2 ¼ r n1. The computation of the sample sizes with different variances when pilot
samples are at hand can be found in Schouten.36
22 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Contrary to the tests so far, the hypotheses of this test, called interval hypotheses, are not made by one point but an interval. The two
one-sided tests (TOST) procedure consists of decomposing the interval hypotheses H0 and H1 into two sets of one-sided hypotheses:
H01 : m1 − m2 − d
H11 : m1 − m2 > − d
And
H02 : m1 − m2 d
H12 : m1 − m2 < d
The TOST procedure consists of rejecting the interval hypothesis H0 (and thus concluding equality of m1 and m2) if and only if
both H01 and H02 are rejected at a chosen level of significance a.
If two normal distributions with the same unknown variance, s2, are assumed and two samples of size n1 and n2 are taken
from each one, the two sets of one-sided hypotheses will be tested with ordinary one-tail test (row 10 of Table 4). Thus, the critical
region is
8 9
>
<ðx − x Þ + d >
=
1 2 d − ðx1 − x2 Þ
CR ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ta, n and qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ta, n (55)
>s
: 1 1
sp n11 + n12 >
;
p n1 + n2
Again, s is unknown in Eq. (56), so it should be adapted as in Case 2 of Section “Hypothesis Test on the Difference in Two Means”.
When comparing Eq. (56) with those corresponding to the two-tail t-test on the difference of means, one observes that it is
completely analogous by exchanging the two risks (see Eqs. (50), (52)). That is, the significance level and the power of the t-test
become the power and significance level, respectively, of the TOST procedure, which completely agrees with the exchange of the
hypotheses.
The tests based on intervals have a long tradition in Statistics; see for example the book (very technical) by Lehmann.37 The TOST
procedure is a particular case that has also been used under the name bioequivalence test.38 Mehring39 has proposed some technical
improvements to obtain optimal interval hypothesis tests, including equivalence testing. It is shown that TOST is always biased, in
particular, the power tends to zero for increasing variances independently of the difference in means. As a result, an unbiased test40
and a suitable compromise between the most powerful test and the shape of its critical region41 have been proposed. In Chemistry
the use of TOST has been suggested to verify the equality of two procedures.42,43 Kuttatharmmakull et al.44 provide a detailed
analysis of the sample sizes necessary in a TOST procedure to compare methods of measurement. There are different versions of
TOST for ratio of variables and for proportions; the details of the equations for these cases can be consulted in Section 8.13 of the
book by Martin Andrés and Luna del Castillo45 and in the book by Wellek.46 The latter is a comprehensive review of inferential
procedures that enable one to “prove the null hypothesis” for many areas of applied statistical data analysis.
Control sample 46.31 44.90 44.12 36.07 39.20 36.39 50.71 47.85 45.60
Test sample 43.12 43.00 44.75 39.66 37.74 37.50 54.79 53.08 55.07
related to the equality of the precision of the two procedures, and also as a previous step to decide about the equality of variances
before applying the test on the equality of means (Case 2 of Section “Hypothesis Test on the Difference in Two Means”) or
to compute a confidence interval on the difference of means (Case 2 of Section “Confidence Interval on the Difference in
Two Means”).
Assume that two random samples of size n1 of X1 and of size n2 of X2 are available and let s21 and s22 be the sample variances.
To test the two-sided alternative, we use the statistic and CR of row 15 of Table 4. The probability b can be computed as a function of
the ratio of variances l2 ¼ s21/s22 that is to be detected by Eq. (57)
F1 − a=2, n1 − 1, n2 − 1 Fa=2, n1 − 1, n2 − 1
b ¼ pr < Fn1 − 1, n2 − 1 < (57)
l2 l2
where Fn1−1, n2−1 denotes an F-distribution with n1 − 1 and n2 − 1 d.f. and Fa/2, n1−1, n2−1 its upper a/2 point, so that pr{Fn1−1, n2−1 >
Fa/2, n1−1, n2−1} ¼ a/2. Similarly, F1−a/2, n1−1, n2−1 is the upper 1 − a/2 point.
Example 13: Aliquot samples have been analyzed in random order under the same experimental conditions to carry out a stability
test. The results are given in Table 7 and must be compared for assessing the test material stability. Different questions can be asked:
(1) Is there experimental evidence of instability in the material?, (2) Taking into account that the analyst considers that the material
is not stable if the mean of the test sample differs from the mean of the control sample in two standard deviations, what is the
probability of accepting the null hypothesis when it is in fact wrong?, (3) What should be the sample size if just one standard
deviation is needed for fit for purpose of this analysis (with a ¼ b ¼ 0.05)?
The answers would be:
(1) Stability: As we only know the estimates of the variance, we should use a t-test to compare means.
The first step is to test if the variances can be considered equal by using a two-tail F-test:
H0 : s21 ¼ s22
H1 : s21 6¼ s22
Following row 15 in Table 4, CR ¼ {Fcal < F1−a/2, n1−1, n2−1 or Fcal > Fa/2, n1−1, n2−1} with Fa/2, n1−1, n2−1 ¼ F0.025, 8, 8 ¼ 4.43 and
F1−a/2, n1−1, n2−1 ¼ F0.975, 8, 8 ¼ 1/F0.025, 8, 8 ¼ 1/4.43 ¼ 0.23. As Fcalc ¼ s21/s22 ¼ 50.75/26.72 ¼ 1.93, there is no experimental
evidence to conclude that the variances differ.
Therefore, a hypothesis t-test on the difference of the two means with equal variances is being formulated (Case 2 of
Section “Hypothesis Test on the Difference in Two Means”). The statistic and the CR are given in row 9 of Table 4.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The “pooled” variance, s2p ¼ 38.49, so sp ¼ 6.20 with 9 + 9 − 2 ¼ 16 d.f. tcalc ¼ ðx1 − x2 Þ= sp n11 + n12 ¼
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð45:41 − 43:46Þ=6:20 1=9 + 1=9 ¼ 0:67 and t0.025, 16 ¼ 2.12. Therefore, the critical region is the set of values of tcalc less
than −2.12 or greater than 2.12, which does not contain 0.67. Hence, there is no evidence to reject the null hypothesis, that is,
with these data there is no experimental evidence of instability.
(2) Power of the test: With the condition imposed by the analyst, Eq. (52) with k ¼ 2 gives b 0.05, so power greater than 0.95.
(3) Sample size: In this case, the analyst is interested in computing the sample size under the assumption that only 1 standard
deviation is admissible for fit for purpose of this analysis. Therefore, k ¼ 1 and Eq. (52) gives n ¼ 25.99, so n1 ¼ n2 ¼ 26.
Sample sizes are greater than in point (2) reflecting the fact that the analyst is now interested in distinguishing a much smaller
quantity.
Regarding the sample size of the F-test in the answer to question (1), when the aim is to detect a standard deviation that is twice as
the one of the control samples, Eq. (57) gives a probability b for this test of 0.56. That means that 56% of the times the null
hypothesis will be wrongly accepted, and in this case, we have accepted the null hypothesis. When the F-test is used as a previous
step to the one on equality of means, and we decided b ¼ 0.05 for the latter (t-test), it is common to use b ¼ 0.10 for the former
(F-test). Eq. (57) gives n ¼ n1 ¼ n2 ¼ 24 with b ¼ 0.098 (the closest to the intended 0.1), and by maintaining that a change of 2
times the standard deviation of the control samples is to be detected (all calculations of b can be seen in Example A13 of Appendix).
In general, the F-tests on the equality of variances are very conservative and large sample sizes are needed to assure an adequately
small probability of type II error.
24 Quality of Analytical Measurements: Statistical Methods for Internal Validation
6.2
6.1
pH
6.0
5.9
5.8
1 2 3 4 5
Group
Fig. 6 pH obtained by different group of students (Example 14) depicted to visually inspect the equality of variances.
Pk
The sample size of each group is denoted as ni, i ¼ 1,2,. . .,k, and N ¼ i ¼ 1ni.
max s2
Gcal ¼ Pk i (59)
2
i¼1 si
where Ga, k, n is the value tabulated in Table 9 for n d.f. In the case ni ¼ n for all i, is n ¼ n − 1.
With the data of Example 14 in Table 8, Gcalc ¼ 8.993 10−3/(16.429 10−3) ¼ 0.5474 and G0.05, 5,5-1 ¼ 0.5441 (Table 9).
Thus, at 0.05 significance level, the null hypothesis should be rejected and the variance of group 5 should be considered different
from the rest.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 25
Table 9 Critical values for Cochran’s test for testing homogeneity of several variances at 5% significance level.
k n
1 2 3 4 5 6 7 8 9 10
2 0.9985 0.9750 0.9392 0.9057 0.8772 0.8534 0.8332 0.8159 0.8010 0.7880
3 0.9669 0.8709 0.7977 0.7457 0.7071 0.6771 0.6530 0.6333 0.6167 0.6025
4 0.9065 0.7679 0.6841 0.6287 0.5895 0.5598 0.5365 0.5175 0.5017 0.4884
5 0.8412 0.6838 0.5981 0.5441 0.5065 0.4783 0.4564 0.4387 0.4241 0.4118
6 0.7808 0.6161 0.5321 0.4803 0.4447 0.4184 0.3980 0.3817 0.3682 0.3568
7 0.7271 0.5612 0.4800 0.4307 0.3974 0.3726 0.3939 0.3384 0.3259 0.3154
8 0.6798 0.5157 0.4377 0.3910 0.3595 0.3362 0.3185 0.3043 0.2926 0.2820
9 0.6385 0.4775 0.4027 0.3584 0.3286 0.3067 0.2901 0.2768 0.2659 0.2568
10 0.6020 0.4450 0.3733 0.3311 0.3029 0.2823 0.2666 0.2541 0.2439 0.2353
P
In Eq. (62), “log10” means the decimal logarithm and s2p is the pooled variance that, extending Eq. (23) for k variances, is s2p ¼ [ ki¼1
(ni − 1)si ]/(N − k).
2
In Example 14, c ¼ 1.10, q ¼ 3.43, and w2calc ¼ 7.19, which does not belong to the critical region defined in Eq. (64) because
w20.05,4 ¼ 9.49. Consequently, we have no evidence of difference in variances.
Cochran’s and Bartlett’s tests are very sensitive to the normality assumption. Levene’s test, particularly when it is based on the
medians of each group, is more robust to the lack of normality of data.
Consider the data arranged as in Table 8 and compute the usual F statistics for the deviations lij
Pkni ðli − lÞ
2
i¼1 k−1
Fcalc ¼ (66)
Pk Pni ðlij − li Þ2
j¼1 i¼1 N − k
where li is the mean of the i-th group and l is the overall mean. Note that the numerator of Eq. (66) is the pooled variance of the
deviations and the denominator is the overall variance of these deviations. The critical region at 100(1 − a)% confidence level is
CR ¼ Fcalc > Fa, k − 1, N − k (67)
Computing the differences in Eq. (65) with the data of Table 8, Fcalc ¼ (2.205 10−3)/(0.905 10−3) ¼ 2.44. As F0.05,4,20 ¼ 2.866,
there is no evidence to reject the null hypothesis (the variances are equal).
Levene’s test is more recommendable using group medians instead of group means. The adaptation is simple; one has to
consider the absolute value of the differences but to the median, x~i , of each group
26 Quality of Analytical Measurements: Statistical Methods for Internal Validation
lij ¼ xij − x~i , i ¼ 1, 2, . . . , ni (68)
The statistics is again the one of Eq. (66) and it is applied similarly.
With the same data of Table 8, but the values in Eq. (68), one obtains Fcalc ¼ (2.146 10−3)/(1.360 10−3) ¼ 1.58, and the
conclusion is the same. The variance of the five groups should be considered as equal.
It often happens that the three tests do not agree in the result, as is the case here. But the joint interpretation clarifies the situation:
In the data of Example 14, the variance of group 5 is greater than the variance of other groups, as Cochran’s test shows. When
Levene’s test is applied, a large difference between both statistics is observed when using the median. This makes one think that the
increase in the variance of the last group is caused by some data being different from the others which is graphically seen in Fig. 6.
To compute the statistics, the n sample values are grouped into k classes (intervals). Denote by Oi, i ¼ 1,. . .,k, the frequency observed
in each class and by Ei the expected frequency for the same class provided the distribution is exactly F0. Then, the statistic in Eq. (70)
Xk
ðOi − Ei Þ2
w2calc ¼ (70)
i¼1
Ei
follows a w2k−p−1 distribution, that is used to define the critical region at (1 − a)100% confidence level as
n o
CR ¼ w2calc > w2a, k − p − 1 (71)
where w2a, k−p−1 is the value such that pr{w2k−p−1 > wa,
2
k−p−1} ¼ a and p is a number that depends on the distribution F0, for instance
p ¼ 2 for a normal, p ¼ 1 for a Poisson, and p ¼ 0 for a uniform distribution. The test requires that the expected frequencies are not
too small. If this is so, the data are regrouped into bigger classes. In the practice of chemical analysis, the sample sizes are not large
and when grouping the data the d.f. of the chi-square statistics are few, the critical value of Eq. (71) becomes large, and it is necessary
to have a large discrepancy between the estimated and observed frequencies to reject the null hypothesis. That means that the test is
very conservative.
Example 15: To show the validity of the use of the crystal violet (CV) as an internal standard in the determination by LC-MS-MS of
malachite green (MG) in trout, a sample of trout was spiked with 1 mg kg−1 of CV and increasing concentrations of MG between 0.5
and 5.0 mg kg−1. The areas of the CV-specific peak (transition 372 > 356) in these calibration standards resulted: 1326, 1384, 1419,
1464, 1425, 1409, 1387, 1449, 1311, 1338, 1350 and 1345. To verify whether the signal of CV is constant and independent of the
concentration of MG in every standard, we can test the null hypothesis, H0,
Table 10 shows the calculation of both observed and expected frequencies under the uniform distribution in the interval [1311,
1464], the endpoints being respectively the minimum and maximum values in the sample.
By summing up the values of the last column of Table 10, the statistics is w2calc ¼ 0.51, which does not belong to the critical region
because it is not greater than w20.05,5-0-1 ¼ 9.49. Therefore, there is no evidence to reject the hypothesis that the data come from a
uniform distribution.
Table 10 w2 Goodness-of-fit test to a uniform distribution applied to assess the validity of crystal violet as
internal standard; data of Example 15.
ðOi − Ei Þ2
Class Observed frequency (Oi) Expected frequency (Ei) Ei
Significance level
a ¼ 0.05 a ¼ 0.01
Sample size DL DU DL DU
Adapted from Martín Andrés, A.; Luna del Castillo, J. D. Bioestadística para las ciencias de la salud; Spain: Norma Capitel Madrid, 2004.
characteristics that are specific for the pdf of a normal distribution; for example, the skewness and the kurtosis, which are statistics
related to higher than two order moments for the normal pdf. A very powerful test is D’Agostino test, with hypotheses
To apply the test, the data are sorted in increasing order, so that x1 x2 ⋯ xn. The statistic is
P
n Pn
i xi − n +2 1 xi
i¼1 i¼1
Dcalc ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n P h (72)
n Pn 2 io
i¼1 xi − i¼1 xi =n
n n 2
Index i in Eq. (72) refers to the ordered data. Table 11 shows some of the critical values of the statistics Da,n with the two values,
DLa,n and DUa,n, for each sample size n and significance level a. The critical region of the test is
Sometimes, more than two means must be compared. One can think in comparing, say, five means, applying the test of comparison
of two means of Section “Hypothesis Test on the Difference in Two Means” to each of the 10 pairs of means that can be formed by
taking them two by two. This option has a serious drawback: it requires enormous sample sizes, because to test the null hypothesis
“the five means are equal” with a ¼ 0.05 and assuming that the 10 tests are independent, each one of the hypothesis “the means xi
28 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Factor
and xj are equal” should be tested with a significance level of 0.0051 to obtain a confidence equal to (1 − 0.0051)10 0.95. The
appropriate procedure for testing the equality of several means is the analysis of variance (ANOVA).
The ANOVA has many more applications; it is particularly useful in the validation of a model fit to some experimental data and,
hence, in an analytical calibration or in the analysis of response surfaces as can be seen in the corresponding chapters of the
present book.
Table 12 shows how the data are usually arranged in a general case, in columns the k levels of a factor (e.g., five different
extraction cartridges) and in rows the n data obtained (e.g., four determinations with each cartridge). Each of the N ¼ k n values xij
(i ¼ 1,2,. . .,k, j ¼ 1,2,. . .,n) is the result obtained, in our example, when using the i-th cartridge with the j-th aliquot sample.
P
In general, in each level i, a different number of replicates are available ni, with N ¼ ki¼1ni. To make the notation easier, we will
suppose that all ni are equal, that is, ni ¼ n for each level.
Suppose that the data in Table 12 can be described by the model
where m is a parameter common to all treatments, called the overall mean, ti is a parameter associated with the i-th level, called the
factor effect, and eij is the random error component. In our example, m is the content of the sample and ti is the variation in this
quantity caused by the use of the i-th cartridge. Note that in model of Eq. (74), the effect of the factor is additive; this is an
assumption that may be unacceptable in some practical situations.
The ANOVA is posed to test some hypotheses about the treatment effects and to estimate them. In order to support the
conclusion when testing the hypothesis, the model errors, eij, are assumed to be normally and independently distributed random
variables, with mean zero and variance s2, NID(0,s). Besides, the variance s2 is assumed to be constant for every level of the factor.
The model of Eq. (74) is called the one-way ANOVA, because only one factor is studied. The analysis for two or more factors can
be seen in the chapter about factorial techniques in this book. Furthermore, the data of Table 12 are required to be obtained in
random order to reduce the effect of other uncontrolled factors.
There are two ways for choosing the k levels of the factor in the experiment. In the first case, the k levels are specifically chosen by
the researcher, as the cartridges in our example. In this case, we wish to test the hypothesis about the magnitude of ti and
conclusions will apply only to the levels of the factor explicitly considered in the analysis and they cannot be extended to similar
levels that were not considered. This is called the “fixed effects model”.
Alternatively, the k levels could be a random sample from a larger population of levels. In this case, we would like to be able to
extend the conclusions based on the sample to all levels in the population, regardless of whether they have been explicitly
considered in the analysis or not. Hence, each ti is a random variable and information about the specific values included in the
analysis is useless. Instead, we test the hypothesis about the variability. This is called the “random effects model”. This model is used
to evaluate the repeatability and reproducibility of a method and also the laboratory bias when the method of analysis is being
tested by a proficiency test. In the same experiment, and provided there are at least two factors, fixed and random effects can
simultaneously appear.50,51 The reader can easily combine them, when appropriate, from the explanations and examples in the
following subsections.
Also for this section, all the computations for the examples can be followed with the live-script in the supplementary material
named ANOVA_section1024_live.mlx.
X
k
ti ¼ 0 (75)
i¼1
Quality of Analytical Measurements: Statistical Methods for Internal Validation 29
From the individual data, the mean value per level is defined as
P
n
xij
j¼1
xi ¼ , i ¼ 1, 2, . . . , k (76)
n
P
k P
n
xij
i¼1 j¼1
x ¼ (77)
N
X
k X
n 2 X
k X
k X
n 2
xij − x ¼ n ðxi − xÞ2 + xij − xi (78)
i¼1 j¼1 i¼1 i¼1 j¼1
Eq. (78) shows that the total variability of the data, measured by the sum of squares of the difference of each datum and the overall
mean, can be partitioned into a sum of squares of differences between level means and the overall mean and a sum of squares of
P
differences of individual
2 values and their level mean. The term n ki¼1 ðxi − xÞ2 measures the differences between levels, whereas
Pk Pn
i¼1 i is due to random error alone. It is common to write Eq. (78) as
j¼1 xij − x
where SST is the total sum of squares, SSF is the sum of squares due to change levels of the factor, which is called sum of squares
between levels, and SSE is the sum of squares due to random error, which is called sum of squares within levels. There are N
individual values, thus SST has N − 1 d.f. Similarly, as there are k levels of the factor, SSF has k − 1 d.f. Finally, SSE has N − k d.f. We are
interested in testing
Because of the assumption that the errors Eij are NID(0, s), the values xij are NID(m + ti, s), and therefore SST/s2 is distributed as a
w2N−1. Cochran’s theorem guarantees that, under the null hypothesis, SSF/s2 and SSE/s2 are independent chi-square distributions
with k − 1 and N − k d.f., respectively. Therefore, under the null hypothesis, the statistic
SSF
k−1 MSF
Fcalc ¼ SSE
¼ (80)
N−k
MSE
follows an Fk−1, N−k distribution, whereas under the alternative hypothesis, it follows a noncentral F with the same d.f.50 The
P
quantities MSF and MSE are called mean squares. Their expected values are E(MSF) ¼ s2 + n( ki¼1 t2i )/(k − 1) and E(MSE) ¼ s2,
respectively. Therefore, under the null hypothesis, both are unbiased estimators of the residual variance, s2, whereas under the
alternative hypothesis, the expected value of MSF is greater than s2. The critical region of the test at significance level a is in Eq. (81)
and reflects the idea that, if the null hypothesis is false, the numerator of Eq. (80) is significantly greater than the denominator.
CR ¼ Fcalc > Fa, k − 1, N − k (81)
Usually, the test procedure is summarized in a table (called ANOVA table) like the one in Table 13, except that we have added a
column, the one corresponding to E(MS), just to emphasize the values each MS estimate and their relation with the previous
discussion.
Example 16: To investigate the influence of the composition of some fibers on a SPME procedure, an experiment was performed
using five different fibers. The data shown in Table 14 are the results of four replicated analyses carried out after extraction with each
Table 14 Experimental results (mg L−1 of triazine), means and variances obtained in the study of the effect of
the type of fiber in a SPME procedure.
Type of fiber
fiber on a sample spiked with 1000 mg L−1 of triazine. All the analyses were carried out in random order maintaining the rest of
experimental conditions controlled.
In the last two rows of Table 14, the means and variances for each fiber are given. Before conducting the ANOVA, the hypothesis
of equality of variances should be tested:
With the variances in Table 14, the statistics of Cochran’s test (Eq. (59)) is Gcalc ¼ 268.67/617.168 ¼ 0.435. As G0.05,k,n−1 ¼ 0.5981
(see Table 9), the statistic does not belong to the critical region (Eq. (60)) and there is no evidence to reject the null hypothesis at 5%
significance level.
The statistics of Bartlett’s test is w2calc ¼ 1.792 (Eq. (61)) and the critical value is w20.05,4 ¼ 9.488, so there is no evidence to reject the
null hypothesis either (Eq. (64)). The same happens with the Levene’s test; computing the absolute values, according to Eq. (65), of
the data of Table 14, Fcalc ¼ 14.70/44.01 ¼ 0.33, and F0.05,4,15 ¼ 3.06, so there is no evidence to reject the null hypothesis on the
equality of variances. By using the median instead of the mean (Eq. (68)), Fcalc ¼ 15.13/46.23 ¼ 0.33, and the conclusion is the
same. From the analysis of the equality of variances, we can conclude that the variances of the five levels should be considered
as equal.
The ANOVA of the experimental data gives the results in Table 15. Considering the critical region defined in Eq. (81), as
Fcalc ¼ 114.54 is greater than the critical value F0.05,4,15 ¼ 3.06, we reject the null hypothesis and hence the conclusion is that there is
a significant effect of “fiber composition” on the extracted amount.
where Fa, k−1, N−k is the critical value of Eq. (81), F k−1, N−k, d is a noncentral F distribution with k − 1 and N − k d.f. of the numerator
and denominator, respectively, and d is the noncentrality parameter, whose value is given by
Pk 2
i¼1 ti
d¼n (83)
s2
The noncentrality parameter d depends on the number of replicates n and also on the difference in means that we wish to detect in
P
terms of ki¼1 t2i . When the error variance is unknown, which is usually the case, we must define the differences to be detected in
P
terms of ratios ki¼1 t2i /s2. As the power, 1 − b, of the test increases with d, the next question would be about the minimum d needed
Quality of Analytical Measurements: Statistical Methods for Internal Validation 31
Table 16 Probability of type II error, b, as a function of the number n of replicates in the ANOVA for comparing
fiber types.
n 4 5 6 7 8 9
(for a given b) to distinguish differences of at least D in two of the t’s. This minimum d can be computed, provided that two of the ti
differ by D and the remaining k − 2 are kept at the mean of these two50 and is given by
X
k
D2
t2i ¼ (84)
i¼1
2
For example, with the data of Example 16 (Table 14), we are now interested in the risk of falsely affirming that the type of fiber is not
significant for the recovery.
The answer consists of evaluating the probability b by Eq. (82). Suppose that we want to discriminate effects greater than twice
Pk
the MSE, that is, i¼1 t2i /s2 2, and thus d ¼ n 2 ¼ 8. Notice that, by substituting Eq. (84) into Eq. (83), this value of d means
that we want to discriminate a difference D between the two types of fiber of at least 2s. In these conditions, F0.05, 4, 15 ¼ 3.06 and
b ¼ 0.54 (calculations can be seen in Example A14 of Appendix and in live-script in the supplementary material ANOVA_sec-
tion1024_live.mlx). In other words, 54 out of 100 times we will accept the null hypothesis (there is no effect of the composition of
the fiber) when it is wrong. This is not good enough for a suitable decision rule.
Eq. (82) can also be used to determine the sample size before starting an experiment, so that risks a and b are both good enough.
For example, we want to know how many replicates we need to carry out in the experiment for a ¼ b ¼ 0.05 and maintaining the
P
ratio ki¼1 t2i /s2 3. Note that, in this case, the analyst considers “effect of fiber type” if 2
pffiffiffiit is greater than 3 times s , which is
equivalent, using Eq. (84), to detect a difference between two fibers at least equal to D ¼ 6s 2:5 s.
To calculate the sample size, a table must be made to write b as a function of n in Eq. (82) with k, a, and d fixed at 5, 0.05, and
3 n, respectively. Following the results shown in Table 16, computed with the code in the mentioned live-script, we need n ¼ 8
replicates with each fiber to achieve b 0.05, but in practice n ¼ 7 would be enough.
Uncertainty and Testing of the Estimated Parameters in the Fixed Effects Model
It is possible to derive estimators for the parameters m and ti (i ¼ 1,. . .,k) in the one-way ANOVA modeled by Eq. (74). The
normality assumption on the errors is not needed to obtain an estimate by least squares; however, the solution is not unique, so the
constraint of Eq. (75) is imposed. Using this constraint, we obtain the following estimates
where xi and x have been defined in Eqs. (76), (77), respectively. If the number of replicates, ni, in each level is not equal
P
(unbalanced ANOVA), then the constraint in Eq. (75) should be changed by ki¼1 niti ¼ 0 and the weighted average of xi should
be used instead of the unweighted average in Eq. (85).
Now, if we assume that the errors are NID(0,s) and ni ¼ n, i ¼ 1,. . .,k, the estimates of Eq. (85) are also the maximum likelihood
ones. For unbalanced designs, the maximum likelihood solution is better because the least squares solution is biased. The reader
interested on this subject should consult statistical monographs that describe this matter at a high level, such as Milliken and
Johnson52 and Searle.53
^ + ^ti ¼ xi
^i ¼ m
The mean of the i-th level is mi ¼ m + ti, i ¼ 1,. . .,k. In our case, with a balanced design an estimator of mi would be m
pffiffiffi
and, as errors are NID(0,s), then xi is NID(mi, s/ n). Using MSE as an estimator of s2, Eq. (16) gives the confidence interval at
(1 −a)100% level:
" rffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffi #
MSE MSE
xi − ta=2, N − k ; xi + ta=2, N − k (86)
n n
A (1 − a)100% confidence interval on the difference in the means of any two levels, say mi − mj, would be
" rffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffi #
2 MSE 2 MSE
xi − xj − ta=2, N − k ; xi − xj + ta=2, N − k (87)
n n
With the data in Examplep 16 (Table 14),ffi a 95% confidence interval on the difference between fibers 1 and 2 is given by
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð489:75 − 602:25Þ 2:131 2 123:43=4 which is [−129.24, −95.76].
32 Quality of Analytical Measurements: Statistical Methods for Internal Validation
In the example, as there is effect of the type of fiber, it makes no sense to compute this interval.
Rejecting the null hypothesis in the fixed effect model of the ANOVA implies that there are differences between the k levels, but
the exact nature of the differences is not specified. To address this question, two procedures are used: orthogonal contrasts and
multiple tests.
H0 : m4 ¼ m5 C1 ¼ x4 − x5
H0 : m1 + m3 ¼ m4 + m5 C2 ¼ x1 + x3 − x4 − x5
H0 : m1 ¼ m3 C3 ¼ x1 − x3
H0 : 4m2 ¼ m1 + m3 + m4 + m5 C4 ¼ −x1 + 4 x2 − x3 − x4 − x5
For example, SSC1 ¼ 4(−1 601.00 + 1 491.50) /2 ¼ 23981 with 1 d.f. These sums of squares are incorporated into the ANOVA
2
Table 17 ANOVA table with orthogonal contrasts for composition of fibers for SPME.
Table 18 Results of Newman-Keuls for multiple comparison test; data of SPME fibers.
2 1 602.25
4 2 601.00
3 3 498.50
5 4 491.50
1 5 489.75
The symbols “” aligned in columns indicate that the corresponding means are all equal two by two.
Table 19 Skeleton for using the corresponding tabulated values for the Newman-Keuls procedure.
xr ð1Þ − xr ðk Þ xr ð1Þ − xr ðk − 1Þ ... xr ð1Þ − xr ð2Þ
xr ð2Þ − xr ðkÞ ... xr ð2Þ − xr ð3Þ
⋱ ⋮
xr ðk − 1Þ − xr ðkÞ
qa(k, k(n − 1)) qa(k − 1, k(n − 1)) ... qa(2, k(n − 1))
t denotes the difference of ranks plus one; the subscript r(i) indicates the i-th rank. k is the number of levels in the ANOVA and qa
are the tabulated values at significance level a.
The values qa(t, k(n − 1)) in Eq. (90) are tabulated. Table 20 shows some of them. They depend, as usual, on the significance level a,
on t, and on the d.f. N − k of MSE. Further, the first term in Rt changes with the difference of ranks, t. The corresponding values are
written in the last row in Table 19.
The critical region is made up by
CR ¼ xr ðiÞ − xrði + k − 1Þ Rt (91)
The results obtained when applying the method of Newman-Keuls to the data of Example 16 are given in Table 21. The first column
contains the means to be compared, for example, 1–2 indicates that the comparison is between x1 and x2 . The second column
contains the differences (without sign) between the means. The values of t (difference of ranks plus one) are in the third column, for
example, t ¼ 5 in the first row because, with the ranks in Table 18, x1 has rank 5 and the rank of x2 is 1. The next column contains the
critical value (Rt) computed with the value of q in Table 20 and Eq. (90). The critical value Rt defines the critical region so that the
analyst can decide whether the estimated difference is significant or not. At 5% significance level, the resulting decision of rejecting
or not rejecting the null hypothesis is shown in the last column of Table 21.
Usually, the result of this multiple comparison is presented as in the last column of Table 18 that is more graphic. The columns
with the symbols “” aligned indicate that the corresponding means are all equal two by two. In our example, on the one hand, the
means x2 and x4 and, on the other hand, any other pair among x1 , x3 , andx5 , according to the decisions in Table 21. It is possible to
conclude that there are two groups of fibers; as far as the recovery is concerned, fibers 2 and 4 provide results that are significantly
equal and greater than the recovery rate obtained with the other three fibers, that are similar to each other.
34 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Table 20 Values of qa(t,n), the upper percentage points of the studentized range for a ¼ 0.05.
n t
2 3 4 5 6 7 8 9 10
where ti and eij are independent random variables. Note that the model is identical in structure to the fixed effect case (Eq. (74)), but
the parameters have a different interpretation. If V(ti) ¼ s2t , then the variance of any observation is
V xij ¼ s2t + s2 (93)
Quality of Analytical Measurements: Statistical Methods for Internal Validation 35
The variances of Eq. (93) are called variance components, and the model, Eq. (92), is called components of variance or the
random effects model. To test hypotheses in this model, we require that the eij are NID(0,s), all of the ti are NID(0,st), and ti and eij
are independent to one another.
The sum of squares equality SST ¼ SSF + SSE still holds. However, instead of testing the hypothesis about individual levels effects,
we test the hypothesis
H0 : s2t ¼ 0
H1 : s2t > 0
If s2t ¼ 0, all levels are identical; if s2t > 0, then there is variability between levels. Thus, under the null hypothesis, the ratio
SSF
k−1 MSF
Fcalc ¼ SSE
¼ (94)
N−k
MSE
is distributed as an F with k − 1 and N − k d.f. The expected values (means) of MSF and MSE are
and
EðMSE Þ ¼ s2 (96)
Fa, k − 1, N − k
1 − b ¼ pr Fk − 1, N − k > (98)
l2
where l2 ¼ 1 + ns2t /s2. As s2 is usually unknown, we may either use a prior estimate or define the value of s2t that we are interested
in detecting in terms of the ratio s2t /s2. An application to determine the number of replicates in a proficiency test can be seen in
Example A15 of Appendix and in ANOVA_section1024_live.mlx in the supplementary material.
Confidence Intervals for the Estimated Parameters in the Random Effects Model
In general, the mean value per level xi does not have more statistical meaning than being a sample of the random factor. But
sometimes, as in the case of proficiency tests, this mean value is of interest for each participating laboratory. The variance of the
mean value per level is theoretically equal to V ðxi Þ ¼ s2t + s2 =n. From Eqs. (95), (96), MSF/n (with k − 1 d.f.) estimates the variance
of the mean per level. As a consequence, the 100(1 − a)% confidence interval is
" rffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffi#
MSF MSF
xi − ta=2, k − 1 ; xi + ta=2, k − 1 (99)
n n
When calculating the variance for the overall mean, it is necessary to consider the variability provided by the factor, as the factor
always acts. For example, when evaluating an analytical method, the results without the variability attributable to the factor
P
laboratory are not conceivable. The variance of the overall mean is V ðxÞ ¼ ki¼1 V ðxi Þ=k2 , which is estimated by MSF/(nk), with k − 1
d.f., so that the 100(1 − a)% confidence interval is
" rffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffi #
MSF MSF
x − ta=2, k − 1 ; x + ta=2, k − 1 (100)
nk nk
The random effects ANOVA is a model of practical interest because it allows attributing real meaning to many statements that seem
evident. For example, the samples distributed to laboratories in a proficiency test must be homogeneous. Strictly speaking, in most
of the occasions, it is impossible to assure homogeneity, but it is enough that the variability attributable to the change of sample is
significantly smaller than the one attributable to the procedure of analysis. This can be guaranteed by means of an ANOVA of
random effects.
36 Quality of Analytical Measurements: Statistical Methods for Internal Validation
The hypotheses of normality (Section “Goodness-of-Fit Tests: Normality Tests”), and the equality of variances, when applicable,
will have to be tested with the appropriate tests (Section “Hypothesis Test on the Comparison of Several Independent Variances”).
When a hypothesis test is to be posed, one can think about omitting some known data, for example, the variance. The effect is a
loss of power, that is, with the same value of a and the same sample size, there is a greater probability of type II error. Said otherwise,
to maintain power, larger sample sizes are needed to get the same experimental evidence; a calculation on this matter is in Case 2 of
Section “Hypothesis Test on the Mean of a Normal Distribution”. The same applies about the use of one-tail tests respect to the
respective two-tail tests; or about the use of nonparametric tests that do not impose any type of distribution a priori.
Also, it is important to remember that the presence of outlier data tends to greatly increase the variance so that the tests become
insensitive, that is to say, larger experimental evidence is needed to reject the null hypothesis. The nonparametric alternative has, in
general, a high cost in terms of power for the same significance level (or in terms of sample size). For this reason, its use is not
advised unless it is strictly necessary. In addition, some nonparametric tests also assume hypothesis on the distribution of the values,
for example, to be symmetric or unimodal.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 37
Precision
The other very important criterion in the validation of a method is the precision. In the ISO 5725,7 the IUPAC (Inczédy et al.11,
Sections 2 and 3), and the 2002/657/EC European Decision,3 we can read “Precision, the closeness of agreement between
independent test results obtained under stipulated conditions”.
The precision usually is expressed as imprecision. The lesser the dispersion of the random component in Eq. (1), the more
precise the procedure. It must be remembered that the precision depends solely on the distribution of the random errors and is not
related to the value of reference or the value assigned to the sample. In a first approach, it is computed as a standard deviation of the
results; nevertheless, even the ISO 5725-5 recommends the use of a robust estimation.
Two measures, limits in a certain sense, of the precision of an analytical method are the reproducibility and the repeatability.
Repeatability is defined as precision under repeatability conditions. Repeatability conditions means conditions where indepen-
dent test results are obtained with the same method on identical test items in the same laboratory by the same operator using the
same equipment in short intervals of time. Repeatability as standard deviation is denoted as sr.
The repeatability limit, r, is the value below which lies, with a probability of (1 − a)100%, the absolute value of the difference
between two results of a test obtained under repeatability conditions. The repeatability limit is given by
pffiffiffi
r ¼ za=2 2 sr (101)
where za/2 is the a =2 upper percentage point of the standard normal distribution.
Reproducibility is defined as precision under reproducibility conditions. Reproducibility conditions means conditions where
test results are obtained with the same method on identical test items in different laboratories with different operators using
different equipment. Reproducibility as standard deviation is named sR.
The reproducibility limit, R, defined in Eq. (102), is the value below which is, with a probability of (1 − a)100%, the absolute
value of the difference between two results of a test, results obtained under reproducibility conditions.
pffiffiffi
R ¼ za=2 2 sR (102)
When estimating sr (or sR) with n < 10, a correction factor7 should be applied to Eq. (6).
Notice that both R and r define, in fact, two-sided tolerance intervals for the difference of two measurements.
The ISO introduces the concept of intermediate precision when only some of the factors described in the reproducibility
conditions are varied. A particular interesting case is when the “internal” factors of the laboratory (analyst, instrument, day) are
varied, which in Commission Decision3 is called intralaboratory reproducibility.
One of the causes of the ambiguity when defining precision is the laboratory bias. When the method is applied only in a
laboratory, the laboratory bias is a systematic error of that laboratory. If the analytical method is evaluated in general, the laboratory
bias becomes a part of the random error: to change the laboratory contributes to the variance expected for a determination
conducted with that method in any laboratory.
The most eclectic position is the one described in the ISO 5725 that declares “The laboratory bias is considered constant when
the method is used under repeatability conditions but is considered as a random variable if series of applications of the method are
made under reproducibility conditions”.
With these premises, we can realize that to evaluate the precision of an analytical method is equivalent to estimating the variance
of the random error in the results and that the discrepancies that can appear when establishing the sources of variability must be
explicitly identified, for example, the laboratory bias.
The precision of two methods can be compared by a hypothesis test on the equality of variances, under the normality
assumption, that is, an F-test (Section “Hypothesis Test on the Variances of Two Normal Distributions”).
Another usual problem is to decide whether the variance observed can be considered significantly equal or not to an external
value, which is decided by using a w2 test (Section “Hypothesis Test on the Variance of a Normal Distribution”).
It is common that the lack of control on a concrete aspect of an analytical procedure is the origin of a great variability. If the
experimental conditions are not stable, we will have an additional variability in the determinations. The F-test permits to decide
whether the precision improves significantly when an optimization is carried out.
In fact, many improvements in the procedures are the consequence of acting after the identification of some causes of variability
in the results and their quantification. More details about this aspect of control and improvement of the precision are given in the
section dedicated to the ruggedness of chemical analysis.
The technique used in the random effects ANOVA is also the adequate technique to split the variance of each experimental data
into addends, which in turn are specially adapted to estimate the repeatability and the reproducibility of an analytical method when
an interlaboratory test comparison has been carried out. In the following, the use of an ANOVA to estimate reproducibility and
repeatability in a proficiency test is briefly explained.
There is no doubt that a good analytical procedure has to be insensitive to the laboratory where it is conducted. To decide
whether the “change of laboratory” has any effect, k laboratories apply a procedure to aliquot samples; each laboratory makes n
determinations. In the terminology of the ANOVA, we have a random factor (the laboratory) at k levels and n replicates in each level.
It has already been said that in general, it is not necessary to have the same number of replicates in all the levels.
38 Quality of Analytical Measurements: Statistical Methods for Internal Validation
We denote by xij the experimental results, where i ¼ 1,. . .,k identifies the laboratory and j ¼ 1,. . .,n the replicate.
Fig. 7 is a skeleton of Eqs. (93)–(96) and shows how to compute an estimate of the variance of the random variable E in Eq. (92).
If the analytical procedure is well defined, the k estimates s2i are expected to be approximately equal and to gather the variability due
to the use of the analytical method by only one laboratory. In these conditions, the pooled variance s2p is a joint (“pooled”) estimate
of the same variance, that is, by definition, the repeatability of the method expressed as standard deviation (ISO 5725)
pffiffiffiffiffiffiffiffiffi
sr ¼ V ðeÞ sp (103)
From the same data we can obtain k estimates of the bias Di (Fig. 7, top) and then the variance of the laboratory bias, considering
this bias as a random variable. Taking into account the quantities estimated by the variances described in Fig. 7, one obtains the
following expression for the interlaboratory variance:
s2p
V ðDÞ x2 −
s (104)
n
which, linked to Eq. (1), provides the following estimate of the reproducibility as standard deviation (ISO 5725):
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sR V ðDÞ + V ðeÞ (105)
In the ANOVA, the null hypothesis is that V(D) ¼ 0 (i.e., there is no effect of the factor), and the alternative is that at least one
laboratory has non-null bias (there is effect of the factor).
The conclusion of the ANOVA is obtained by deciding whether both variances, n s x2 and s2p , can be considered significantly equal.
To decide it, an F-test is applied. The logic is clear: If there is no effect (laboratory effect), V(D) should be significantly zero and, thus,
both variances are equal or, in other words, they estimate the same quantity.
In practice, the expression for the computation of the power of the ANOVA with random effects (Eq. (98)) is useful in deciding
the number of laboratories that should participate, k, and the number of replicated determinations, n, that each one must conduct.
It is essential to remember that an ANOVA requires normal distribution of the residuals and equality of the variances s21, s22, . . ., s2k .
When the number of replicates is two (n ¼ 2), a common way of the interlaboratory analysis is the use of the graph of Youden54
to show the trueness and precision of each laboratory. Actually, Youden’s graph is nothing but the graphical representation of an
ANOVA shown in Kateman and Pijpers.55 In addition to being used for comparing the quality of the laboratories, Youden’s graph
can be used to compare two methods of analysis in terms of the laboratory bias they have.
An approach for the comparison of two methods in the intralaboratory situation has been proposed by Kuttatharmmakull
et al.56 Instead of the reproducibility, as included in Fig. 7 and ISO guidelines, the (operator + instrument + time) different
intermediate precision is considered in the comparison.
In the case of precision, the effect of outlier data is really devastating; hence, a very delicate task of analysis to detect those outlier
data is essential. In general, more than one test is needed (usual ones are those of Dixon, and Grubbs and Cochran), especially to
accept the hypotheses of the ANOVA made for the determination of repeatability and reproducibility. In view of the difficulties, the
AMC5,6 advises the use of robust methods to evaluate the precision and trueness and for proficiency testing. This path is also
followed in the new ISO norm about reproducibility and repeatability.
1. Critical examination of the data, in order to identify outliers or other irregularities, and to verify the suitability of the model.
2. To compute for each level of concentration the preliminary values of precision and mean.
3. To establish the final values of precision and means, including the establishment of a relation between precision and the level of
concentration when the analysis indicates that such relation may exist.
The analysis includes a systematic application of statistical tests for detecting outliers, and a great variety of such tests are available
from the literature and could be used for this task.
Adapted with permission from ISO-5725–2. Accuracy, Trueness and Precision of Measurement Methods and Results; Gèneve, 1994; p. 22.
xn − x
Gn, calc ¼ (106)
s
On the contrary, to verify whether the smallest observation, x1, is significantly different from the rest, the statistic G1 is
computed as
x − x1
G1, calc ¼ (107)
s
In Eqs. (106), (107), x and s are, respectively, the mean and standard deviation of xi.
To decide whether the greatest or smallest value is significantly different from the rest at 100a% significance level, the values
obtained in Eqs. (106), (107) are compared to the corresponding critical values written down in Table 22.
The decision includes two “anomaly levels”:
(a) If Gi,calc < G0.05,i, with i ¼ 1 or i ¼ n, accept that the corresponding x1 or xn is similar to the rest.
(b) If G0.05,i < Gi,calc < G0.01,i, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is considered an straggler.
(c) If G0.01,i < Gi,calc, with i ¼ 1 or i ¼ n, the corresponding x1 or xn is incompatible with the rest of data of the same level
(statistical outlier).
2. Detection of two outlying observations (double Grubbs’ test)
Sometimes it is necessary to verify that two extreme data (very large or very small) incompatible with the others do not exist.
In the case of the two greatest observations, xn and xn−1, the statistic G is computed as
s2n − 1, n
G¼ (108)
s20
P Pn-2 P − 2 2
where s20 ¼ ni¼1 ðxi − xÞ2 and s2n-1, n ¼ i¼1 x i − n 1− 2 ni¼1 xi .
Similarly, it is possible to jointly decide on the two smallest observations, x1 and x2, by means of the following statistic:
s21, 2
G¼ (109)
s20
P P 2
where s21, 2 ¼ ni¼3 xi − n −1 2 ni¼3 xi .
The decision rule is analogous to the one of the case of an extreme value but with the corresponding critical values in
Table 22.
In general, norms, like ISO 5725,7 propose the inspection of the origin of the anomalous results and, if assignable cause does not
exist, eliminate the incompatible ones and leave the straggler ones indicating their condition with an asterisk.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 41
Table 24 Robust and nonrobust estimates of the centrality and dispersion parameters (data of Table 23).
Nonrobust procedures
Mean, x 13.60 13.47
Standard deviation, s 0.60 0.25
Robust procedures
Median 13.49 13.48
H15, centrality parameter 13.50 13.48
MAD/0.6745 0.26 0.21
H15, dispersion 0.27 0.24
Example 17: With didactic purpose to apply Grubbs’ test and to verify the effect of outliers, the data of Table 23 have been
considered as a unique series of 20 results.
The greatest value is 15.93 and the lowest is 13.03, with s ¼ 0.60 and x ¼ 13.60. Eq. (106) gives G20,calc ¼ 3.889 and Eq. (107)
gives G1,calc ¼ 0.942. By consulting the critical values in Table 22, G0.05,20 ¼ 2.709 and G0.01,20 ¼ 3.001; therefore, according to the
decision rule in Case 1 (single Grubbs’ test), the value 15.93 should be considered different from the rest.
Applying again the test, with 19 data, the greatest value now is 13.92 and the lowest is still 13.03, with G19,calc ¼ 1.804 and
G1,calc ¼ 1.785. As the tabulated values are G0.05,19 ¼ 2.681 and G0.01,19 ¼ 2.968, there is no evidence to say that any of the extreme
values is different from the rest. Table 24 contains the mean and standard deviation, with and without the value 15.93. A large effect
is observed on the standard deviation, which is reduced in more than 50% when removing the point.
Grubbs’ test can also be applied to the mean values per level. In practice, Grubbs’ test is also used to restore the equality of
variances in the ANOVA when the homogeneity of variances is rejected (section “Hypothesis Test on the Comparison of Several
Independent Variances”). The work by Ortiz et al.47 contains a complete analysis with sequential application of Cochran’s, Bartlett’s,
and Grubbs’ tests.
where m and s are the centrality and dispersion parameters, which must be iteratively estimated. The function in Eq. (110) is
represented in Fig. 8.
The estimate is exactly the generalization of the maximum-likelihood estimate. It is asymptotically optimum for high-quality
data, that is, data with little contamination and not very different from data following a Student’s t distribution with three d.f.
Remember that Hampel et al.57 have shown that Student’s t distributions with between 3 and 9 d.f. reproduce high-quality
experimental data, and that for t3 the efficiency of the mean and standard deviation is 50 and 0%, respectively. Therefore, in
practice there is need for robust estimates of high-quality empirical data (as those obtained with present analytical methods).
42 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Ψ(x)
m + cs
m − cs
m − cs m m + cs x
Table 25 Robust and nonrobust estimates of the repeatability and reproducibility with data of Table 23.
ANOVA With all data (n ¼ 20) Without 15.93 (n ¼ 19) Without series 4 and 13.92 (n ¼ 15)
Nonrobust procedure
Fcalc (P-value) 0.22 (0.92) 17.02 (<5 10−5) 24.91 (<5 10−5)
SSF (d.o.f.) 0.094 (4) 0.229 (4) 0.083 (3)
SSE (d.o.f.) 0.431 (15) 0.013 (14) 0.003 (11)
sR 0.657 0.260 0.153
sr 0.657 0.116 0.058
P-value, Cochran’s test 8.9 10−9 0.001 0.093
P-value, Bartlett’s test 3.9 10−8 0.001 0.100
P-value, Levene’s test 0.53 0.61 0.005
Robust procedures
Robust sR 0.281 0.172
Robust sr 0.072 0.072
The H15 estimator provides enough protection against high concentration of data that are abnormally large but near to the
correct data. Nevertheless, the clearly anomalous data are not rejected with the H15 estimator, and they maintain the maximum
influence but bounded. This produces an avoidable loss of the efficiency of the H15 estimator between 5% and 15% when the
proportion of anomalous data present is also between 5% and 15% (rather usual percentages in routine analyses). In order to avoid
this limited weakness, robust estimators as the median and the median of absolute deviation (MAD) (Eq. (111)) are necessary at
least in the first step of the calculation, to surely identify most of the “suitable” data.
The robust procedure obtained when adapting the H15 estimator to the problem of the estimation of repeatability and reproduc-
ibility as posed in ISO norm consists of two stages and it has been followed in an identical way to the proposal in Sanz et al.60 As in
the parametric procedure, it uses the mean and standard deviation of the data. Therefore, once the robust procedure is applied, the
data necessary to estimate the reproducibility or the intermediate precision are at hand.
In order to verify the utility of these robust procedures, with the same data of Table 23 considered as a unique series of 20 values,
the median and the centrality parameter of the H15 estimator have been written down in Table 24. These are very similar to the
nonrobust estimates, and for both 20 and 19 values. Nevertheless, when comparing the robust parameters of dispersion, MAD/
0.6745 and H15, they do not differ when considering 20 or 19 data and are similar to the standard deviation obtained after applying
the method of Grubbs and repeating the calculations without the outlier. For this reason, it is a good strategy to apply systematically
robust procedures together with the classic ones. The difference in the results is an indication of the presence of outlier data, in
which case the robust estimations will have to be used.
The effect, and therefore the advantage, of the robust procedures is much more remarkable when a random effects ANOVA is
evaluated, for example, to estimate the reproducibility and repeatability of a method by means of an interlaboratory test as the one
described in Fig. 7. To show this, we will use the data of Table 23, this time considering its structure of levels of the factor (k ¼ 5) and
replicates (n ¼ 4).
The values of reproducibility and repeatability should not be accepted if the homogeneity of variances in the ANOVA
assumption is not fulfilled. In this case, it is necessary to verify whether some of the levels have outlier data. The first column of
Table 25 shows that the ANOVA with all the data is not acceptable because the variances cannot be considered equal (rejection in
the tests on variance homogeneity). In addition, it is observed that the anomaly in the data causes that the estimates sR and sr are
equal and very different from the robust estimates.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 43
Once the value 15.93 of series 5 is removed, the ANOVA (column 2 of Table 25) points to significant effect, but still lack of
variance homogeneity is observed. Nevertheless, the new estimates of sR and sr are more similar to those obtained with the robust
procedure.
The lack of equality of variances forces one to eliminate series 4, which has a very different variance (smaller than the others), and
later the value 13.92 of series 5. The final result of this sequential process is the third column of Table 25. A significant effect with the
ANOVA, the homogeneity of variances can be accepted, and the estimates of the reproducibility and the repeatability are 0.153 and
0.058, similar to the ones obtained with the robust procedure without series 4. The values sR and sr can be too small due to the
elimination of data, with the risk of having underestimations that are not realistic and thus impossible to be fulfilled by the
laboratories. For this reason, it is advised5–7 to avoid reduction of the sample and to maintain the initial robust estimates.
As the presence of outliers in the experimental work is unavoidable, the robust statistical methodology has consolidated as an
essential tool in chemical analysis. Further information can be found, for example, in the chapter of this book dedicated to robust
statistical techniques.
Accuracy
According to the IUPAC (Inczédy et al.11, Sections 2–3), the ISO,7 and the Directive of the European Union (Definition 1.1 of the
Commission Decision3), the accuracy is a concept defined as “Closeness of agreement between a test result and the accepted
reference value”. It is estimated by determining trueness and precision. Evidently, this definition collects together the systematic and
random errors, because for an individual determination it is xi − m ¼ ðxi − xÞ + ðx − mÞ ¼ e + D.
In practice, it is unreasonable to think that an analytical procedure has no bias; experimentally we can decide about the hypothesis
of null bias. If it is significant, it is possible to correct the measurement by subtracting the value D. However, this implies an increase in
the variance of the final result because D is estimated by experimental replicates and therefore it has uncertainty. For this reason, when
the uncertainty of a measurement is expressed, it is usual to include a term that takes into account the bias in a form similar to
Eq. (105). For a detailed treatment of this question, consult the guide of the EURACHEM/CITAC.1
Ruggedness
The ruggedness of a method is defined as its capacity to maintain trueness and precision throughout the time. The same applies for
the robustness of a material of reference or any other reagent.
Ruggedness means susceptibility of an analytical method to changes in experimental conditions, which can be expressed as a list
of sample materials, analytes, storage conditions, environmental conditions, and/or sample preparation conditions under which
the method can be applied as presented or with specified minor modifications.3
The study on ruggedness can be approached using two different statistical methodologies, one of which consists of using the
well-known control charts (they are confidence intervals on the mean, the variance, or the range of the measured parameter) and
continuously writing down the results obtained on known samples throughout the time. This type of “a posteriori” control is
essential to maintain the quality (precision, trueness, capability of detection, etc.) of a measurement method and to establish alarm
mechanisms when an observed drift can alter the quality of the procedure affecting the value of the analytical results. There is also a
chapter in this book that deals specifically with control charts.
The other approach to the problem of ruggedness involves evaluating “a priori” the variability expected in the analytical
procedure and identifying the sources of that variability.
Before routinely using a procedure, the effect of small changes in the reagents, in the conditions of work, or in the specifications
of its protocol must have been verified. It can happen that small changes in the volume of extracting reagent do not lead to great
variations in the response, whereas a small variation in, say, pH does. One way of knowing and controlling this quality criterion is
by making small changes in the potentially influential factors and observing the effect on the response.
The influence of each factor should not be separately analyzed since it is not methodologically adequate and, in addition, it is
not realistic because in practice unforeseeable combinations of all the factors will occur that can affect the results. Instead, the
methodology of the design of experiments should be used, and details about this are in the corresponding chapters of this
collection. As the number of factors that potentially affect the response is large, highly fractional factorial designs for two levels
have to be used (to reduce, e.g., the needed 27 ¼ 128 different experiments of a complete factorial design for seven factors).
Plackett–Burman designs and D-optimal designs have proven to be useful tools in ruggedness analysis.61–67 For more alternatives,
consult the chapter dedicated to these strategies.
Example 18: An analysis of ruggedness of a procedure of extraction of three sulfonamides is carried out. The seven considered
factors are buffer solution, pH, methanol as extracting agent, extraction cycles, petroleum benzin, volume of elution, and
evaporation mode. A Plackett-Burman design has been proposed to estimate the effects of the factors by fitting a linear model for
each sulfonamide:
y ¼ bo + b1 x1 + b2 x2 + . . . + b7 x7 (112)
where xi denotes the i-th factor (Table 26) and y represents the response to be modeled, which is the chromatographic peak area
for each of the three sulfonamides. The details about the experimental domain can be seen in Table 26, where the nominal level is
codified as “−” and the extreme level as “+”.
44 Quality of Analytical Measurements: Statistical Methods for Internal Validation
Table 26 Experimental factors with nominal (−) and extreme (+) levels selected for a Plackett-Burman design
for seven factors (ruggedness analysis of an extraction procedure of sulfonamides).
Level
Factor (units) − +
Factors Responses
The values of the three responses are the areas under the chromatographic peak (in a.u.) of sulfadiazine (SDZ), sulfamethazine (SMT), and sulfamethoxypyridazine (SMP). All
experiments are replicated twice.
Table 28 Estimated coefficients of the linear model (Eq. (112)) fit for each sulfonamide by means of a Plackett-Burman design.
Table 27 shows the experimental runs and the two values (replicates) of the three responses.
Finally, Table 28 contains the estimated coefficients of model in Eq. (112) and their P-values. The conclusion is that only the
extracting agent (x3), the number of cycles in the extraction (x4), and the volume of petroleum benzin (x5) are significant at 5% level.
Hence, special care should be taken with these factors because small changes in any of them can cause large variation in the
response.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 45
Appendix
Some Basic Elements of Statistics
A distribution function (cumulative distribution function (cdf )) in R is any function F, such that
If the distribution function is continuous, then the above limits coincide with the value of the function in the corresponding point.
The probability density function f(x), abbreviated pdf, if it exists, is the derivative of the cdf.
Each random variable X is characterized by a distribution function FX(x).
When several random variables are handled, it is necessary to define the joint distribution function.
If the previous joint probability is equal to the product of the individual probabilities, it is said that the random variables are
independent:
Eqs. (A3), (A4) define the mean and variance of a continuous random variable whose pdf is f. Some basic properties are
V ðX + Y Þ ¼ V ðX Þ + V ðY Þ + 2CovðX; Y Þ (A5)
CovðX; Y Þ ¼ 2
= ðx − EðXÞÞðy − EðY ÞÞfX, Y ðx; yÞdxdy (A6)
In the definition of the covariance (Eq. (A6)), fX,Y(x, y) is the joint pdf of the random variables. In the case where they are
independent, the joint pdf is equal to the product fX(x)fY(y) and the covariance is zero.
In general, E(XY) 6¼ E(X)E(Y), except if the variables are independent, in which case the equality holds.
In the applications in Analytical Chemistry, it is very frequent to use formulas to obtain the final measurement from other
intermediate ones that had experimental variability. A strategy for the calculation of the uncertainty (variance) in the final result
under two basic hypotheses has been developed. The strategy is to make a linear approach to the formula and then to assimilate the
quadratic terms to the variance of the random variable at hand (see e.g., the “Guide to the Expression of Uncertainty in
Measurement”2). This procedure, called in many texts the method of transmission of errors, can lead to unacceptable results.
Hence, an improvement based on Monte Carlo simulation has been suggested for the calculation of the uncertainty (see the
Supplement 1 to the aforementioned guide).
A useful representation of the data is the so-called box and whisker plot (or simply box plot). It consists of a box built with the
first, Q1, and third, Q3, quartiles, that is, the 0.25 and 0.75 percentiles, so that the box contains half of the central data. The line in
46 Quality of Analytical Measurements: Statistical Methods for Internal Validation
9.2
8.2
7.2
6.2
5.2
A B
Fig. A1 Box and whisker plots computed with A: data of method A in Fig. 2, and B: data of method A with an outlier.
between is the median (Q2 or 0.5 percentile). Then, the whisker extends on both sides of the box up to the maximum and minimum
values, provided they are not further than1.5 times the interquartile range Q3–Q1.
Fig. A1 shows a boxplot of the 100 values of the method A of Fig. 2A, the first on the left. Two values appear like squares,
“disconnected” at the bottom, meaning that these two values are less than 1.5 times the interquartile range below Q1.
The advantage of using box plots is that the quartiles are practically insensitive to outliers. For example, suppose that the
maximum value 7.86 is changed by 8.86; this change does not affect the median or the quartiles, the box plot continues being
similar but with a datum outside the upper whisker, as can be seen in the second box plot on the right in Fig. A1.
The normal distribution is a continuous random variable with E(N(m, s)) ¼ m and V(N(m, s)) ¼ s2, and these two parameters
completely define the distribution.
Particularly interesting is the N(0,1), usually called Z, because any other normal distribution N(m,s) is transformed into a Z
when standardizing it, that is, Z ¼ (N(m, s) − m)/s.
The distribution function of a normal random variable does not have an analytical expression; hence it is necessary to use tables
or somewhat complex formulas to calculate the probabilities. As any normal distribution can be transformed into a N(0,1), it is
customary to use only the table of this distribution. Table A1 contains some of its values that, in any case, cover the cases used in this
article. For example, if z ¼ 1.83, from the reading in row 1.8 and column 0.03. p ¼ pr{N(0, 1) > 1.83} ¼ 0.0336.
Pn
PThe sum
Pn 2
of normal and independent random variables,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i¼1 N(mi, si), also follows a normal distribution
n
N i¼1 mi ; i¼1 i s .
Student’s t Distribution
If X is a random variable N(m, s) and X1,X2,. . .,Xn are n random variables, independent and with the same distribution as X, then the
pffiffiffi P
random variable ðX − mÞ=ðs= nÞ is a N(0,1), where X denotes the random variable ni¼1 Xi/n.
pffiffiffi
However, with the sample variance S instead, the statistics t ¼ ðX − mÞ=ðS= nÞ follows a t distribution with n ¼ n − 1 d.f. The
2
mean and variance of a Student t distribution are respectively E(t) ¼ 0 and V(t) ¼ n/(n − 2), n > 2. The general shape of its pdf is
similar to that of the standard normal distribution, both are symmetrical around zero, unimodal, and defined in (−1,1). However,
the t distribution has heavier tails than the normal; that is, exhibits greater variability. As the number of d.f. tends to infinity, the
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0282 0.0258 0.0250 0.0244 0.0239 0.0233
2.0 0.0227 5 0.0222 2 0.0216 9 0.0211 8 0.0208 8 0.0201 8 0.0197 0 0.0192 3 0.0187 8 0.0183 1
Values of p such that p ¼ pr{N(0,1) > z}. Up to the first decimal of z in rows, second decimal in columns.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 47
n a
n a
limiting distribution is the standard normal one. The family of t distributions only depends on one parameter, the degrees of
freedom.
Table A2 contains some values of the t distribution. For example, if n ¼ 5, for a ¼ 0.025, the value t ¼ 2.571 in the table is the
one such that 0.025 ¼ pr{t5 > 2.571}. Compare with the value 1.96 in Table A1 that would correspond, in the same conditions, to a
N(0,1), that is, 0.025 ¼ pr{N(0,1) > 1.96}.
Because of the symmetry, 0.025 ¼ pr{t5 < −2.571} also holds and, consequently, 0.95 ¼ pr{−2.571 < t5 < 2.571}.
n2 n
1 2 3 4 5 6 7 8 9 10
1 647.79 799.50 864.16 899.58 921.85 937.11 948.22 956.66 963.28 968.63
2 38.506 39.000 39.165 39.248 39.298 39.331 39.355 39.373 39.387 39.398
3 17.443 16.044 15.439 15.101 14.885 14.735 14.624 14.540 14.473 14.419
4 12.218 10.649 9.979 9.605 9.365 9.197 9.074 8.980 8.905 8.844
5 10.007 8.434 7.764 7.388 7.146 6.978 6.853 6.757 6.681 6.619
6 8.813 7.260 6.599 6.227 5.988 5.820 5.696 5.600 5.523 5.461
7 8.073 6.542 5.890 5.523 5.285 5.119 4.995 4.899 4.823 4.761
8 7.571 6.060 5.416 5.053 4.817 4.652 4.529 4.433 4.357 4.295
9 7.209 5.715 5.078 4.718 4.484 4.320 4.197 4.102 4.026 3.964
10 6.937 5.456 4.826 4.468 4.236 4.072 3.950 3.855 3.779 3.717
11 6.724 5.256 4.630 4.275 4.044 3.881 3.759 3.664 3.588 3.526
12 6.554 5.096 4.474 4.121 3.891 3.728 3.607 3.512 3.436 3.374
13 6.414 4.965 4.347 3.996 3.767 3.604 3.483 3.388 3.312 3.250
14 6.298 4.857 4.242 3.892 3.663 3.501 3.380 3.285 3.209 3.147
15 6.200 4.765 4.153 3.804 3.576 3.415 3.293 3.199 3.123 3.060
The F Distribution
Let X1 and X2 be independent chi-square random variables with n1 and n2 d.f., respectively. Then, the ratio F ¼ (X1/n1)/(X2/n2) is an
F distribution with n1 d.f. in the numerator and n2 d.f. in the denominator. It is usually abbreviated as Fn1, n2. The mean and variance
of Fn1, n2 are E(Fn1, n2) ¼ n2/(n2 − 2), n2 > 2 and V(Fn1, n2) ¼ 2n22(n1 + n2 − 2)/[n1(n2 − 2)2(n2 − 4)], n2 > 4.
The F distribution is nonnegative and skewed to the right. Some percentage points of the F distribution are given in Table A4 for
a ¼ 0.025. For, say, n1 ¼ 5 and n2 ¼ 10, F0.025, 5, 10 ¼ 4.24 is the value such that 0.025 ¼ pr{F5, 10 > 4.24}.
The lower percentage points can be found taking into account that F1−a,n1,n2 ¼ 1/Fa,n2,n1. For example, to find F0.975, 5, 10 with
Table A4 is F0.975, 5, 10 ¼ 1/F0.025, 10, 5 ¼ 1/6.62 ¼ 0.15. Therefore, we have 0.95 ¼ pr{0.15 < F5, 10 < 4.24}.
• The “weak law of large numbers” states that if X1,X2,. . .,Xn,. . . are independent and identically distributed random variables with
finite mean m, then X1 + X2 n+ ⋯ + Xn ! m in probability
• If the random variables also have a finite variance (a weaker condition is also possible), then we have the “strong law of large
numbers”, that is, X1 + X2 n+ ⋯ + Xn ! m almost surely
• The “central limit theorem” states that for independent (or weakly correlated) random variables X1,X2,. . .,Xn, with the same
X −pmffiffiÞ
distribution, ðs= n
! Z ¼ Nð0; 1Þ in distribution, where m and s2 are the common mean and variance of the random variables
Xn. This means that the distributional shape of X is more and more like the one of a standard normal random variable as n
increases.
The basic distributions (normal, Student’s t, F, chi-square) can be programmed with the algorithms in Abramowitz and Stegun.68
Appendices of the book by Meier and Zünd69 show the necessary numerical approximations and programs in BASIC for the same
distributions. To compute the noncentral F, the needed numerical approximation can be consulted in Johnson and Kotz,70 and
Evans et al.71
All the calculations in this article have been made with the Statistics Toolbox for MATLAB.72 What follows is a list of basic
command instructions used that the reader can also find in live scripts Appendix_1probDistr_live.mlx, Appendix_2power_live.mlx (in
the form of MATLAB mlx-files) in the supplementary material.
Note that all the MATLAB commands referring to cumulative distribution functions, Eqs. (A8)–(A11), compute the
cumulative probability a until the corresponding value of the distribution. However, along the text and in Tables A1–A4,
the calculated probability a is always the upper percentage point, that is, the cumulative probability above the corresponding
value.
Normal distribution
a ¼ pr fNðm; sÞ < za g (A8)
• z ¼ norminv(a, m, s)
• a ¼ normcdf(z, m, s)
• t ¼ tinv(a,v)
• a ¼ tcdf(t,v)
• x ¼ chi2inv(a,n)
Example A5: a ¼ 0.05, n ¼ 5; then chi2inv(0.05,5) gives x ¼ 1.1455.
• a ¼ chi2cdf(x,n)
• x ¼ finv(a,n1,n2)
Example A7: a ¼ 0.95, n1 ¼ 5, n2 ¼ 15; then finv(0.95,5,15) gives x ¼ 2.9013.
• a ¼ fcdf(x,n1,n2)
References
1. EURACHEM/CITAC, Guide CG4. In Quantifying Uncertainty in Analytical Measurement, 2nd ed.; Ellison, S. L. R., Rosslein, M., Williams, A., Eds.; 2000. ISBN: 0-948926-15-5
Available from the Europchem Secretariat. See http://www.eurochem.org.
2. Evaluation of Measurement Data, Supplement 1 to the ‘Guide to the Expression of Uncertainty in Measurement’—Propagation of Distributions Using a Monte Carlo Method; Joint
Committee for Guides in Metrology, 2008. JCGM 101.
3. Commission Decision (EC), No 2002/657/EC of 12 August 2002 Implementing Council Directive 96/23/EC Concerning the Performance of Analytical Methods and the
Interpretation of Results. Off. J. Eur. Commun. 2002, L221, 8–36.
4. Aldama, J. M. Practicum of Master in Advanced Chemistry; University of Burgos: Burgos, Spain, 2007.
5. Analytical Methods Committee, Robust Statistics-How Not to Reject Outliers, Part 1. Basic Concepts. Analyst 1989, 114, 1693–1697.
6. Analytical Methods Committee, Robust Statistics-How Not to Reject Outliers, Part 2. Inter-laboratory Trials. Analyst 1989, 114, 1699–1702.
7. ISO 5725, Accuracy Trueness and Precision of Measurement Methods and Results, Part 1. General Principles and Definitions, Part 2. Basic Method for the Determination of
Repeatability and Reproducibility of a Standard Measurement Method, Part 3. Intermediate Measures of the Precision of a Standard Measurement Method, Part 4. Basic Methods
for the Determination of the Trueness of a Standard Measurement Method, Part 5. Alternative Methods for the Determination of the Precision of a Standard Measurement Method,
Part 6. Use in Practice of Accuracy Values. Genève, 1994 .
8. Analytical Methods Committee, In Technical Brief No 4; Thompson, M., Ed.; 2006. www.rsc.org/amc/.
9. Silverman, B. W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, Great Britain, 1986.
10. Wand, M. P.; Jones, M. C. Kernel Smoothing; Chapman and Hall: London, Great Britain, 1995.
11. Inczédy, J.; Lengyel, T.; Ure, A. M.; Gelencsér, A.; Hulanicki, A. Compendium of Analytical Nomenclature IUPAC, 3rd ed.; Pot City Press Inc.: Baltimore 2nd printing, 2000.
12. Lira, I.; Wöger, W. Comparison Between the Conventional and Bayesian Approaches to Evaluate Measurement Data. Metrologia 2006, 43, S249–S259.
13. Zech, G. Frequentist and Bayesian confidence intervals. Eur. Phys. J. Direct 2002, C12, 1–81.
14. Armstrong, N.; Hibbert, D. B. An Introduction to Bayesian Methods for Analyzing Chemistry Data, Part 1: An Introduction to Bayesian Theory and Methods. Chemom. Intel. Lab.
Syst. 2009, 97, 194–210.
15. Armstrong, N.; Hibbert, D. B. An Introduction to Bayesian Methods for Analyzing Chemistry Data, Part II: A Review of Applications of Bayesian Methods in Chemistry. Chemom.
Intel. Lab. Syst. 2009, 97, 211–220.
16. Sprent, P.; Smeeton, N. C. Applied Nonparametric Statistical Methods, 4th ed.; Chapman & Hall/CRC: Boca Raton, 2007.
17. Patel, J. K. Tolerance Limits. A Review. Commun. Stat. Theory Methods 1986, 15 (9), 2716–2762.
18. Meléndez, M. E.; Sarabia, L. A.; Ortiz, M. C. Distribution Free Methods to Model the Content of Biogenic Amines in Spanish Wines. Chemom. Intel. Lab. Syst. 2016, 155,
191–199.
Quality of Analytical Measurements: Statistical Methods for Internal Validation 51
19. Reguera, C.; Sanllorente, S.; Herrero, A.; Sarabia, L. A.; Ortiz, M. C. Study of the Effect of the Presence of Silver Nanoparticles on Migration of Bisphenol A From Polycarbonate
Glasses into Food Simulants. Chemom. Intel. Lab. Syst. 2018, 176, 66–73.
20. Wald, A.; Wolfowitz, J. Tolerance Limits for a Normal Distribution. Ann. Math. Stat. 1946, 17, 208–215.
21. Wilks, S. S. Determination of Sample Sizes for Setting Tolerance Limits. Ann. Math. Stat. 1941, 12, 91–96.
22. Kendall, M.; Stuart, A. The Advanced Theory of Statistics, Inference and Relationship; Charles Griffin & Company Ltd.: London, 1979;547–548. Section 32.11; Vol. 2.
23. Willink, R. On using the Monte Carlo Method to Calculate Uncertainty Intervals. Metrologia 2006, 43, L39–L42.
24. Guttman, I. Statistical Tolerance Regions; Charles Griffin and Company: London, 1970.
25. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.;
Nivet, C.; Valat, L. Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal – Part I. J. Pharm. Biomed. Anal. 2004, 36, 579–586.
26. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Chiap, P.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Lallier, M.; Laurentie, M.; Mercier, N.; Muzard, G.;
Nivet, C.; Valat, L.; Rozet, E. Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal—Part II. J. Pharm. Biomed. Anal. 2007, 45,
70–81.
27. Huber, P.; Nguyen-Huu, J. J.; Boulanger, B.; Chapuzet, E.; Cohen, N.; Compagnon, P. A.; Dewé, W.; Feinberg, M.; Laurentie, M.; Mercier, N.; Muzard, G.; Valat, L.; Rozet, E.
Harmonization of Strategies for the Validation of Quantitative Analytical Procedures. A SFSTP Proposal—Part III. J. Pharm. Biomed. Anal. 2007, 45, 82–86.
28. Feinberg, M. Validation of Analytical Methods Based on Accuracy Profiles. J. Chromatogr. A 2007, 1158, 174–183.
29. Rozet, E.; Hubert, C.; Ceccato, A.; Dewé, W.; Ziemons, E.; Moonen, F.; Michail, K.; Wintersteiger, R.; Streel, B.; Boulanger, B.; Hubert, P. Using Tolerance Intervals in Pre-Study
Validation of Analytical Methods to Predict In-Study Results. The Fit-for-Future-Purpose Concept. J. Chromatogr. A 2007, 1158, 126–137.
30. Rozet, E.; Ceccato, A.; Hubert, C.; Ziemons, E.; Oprean, R.; Rudaz, S.; Boulanger, B.; Hubert, P. Analysis of Recent Pharmaceutical Regulatory Documents on Analytical Method
Validation. J. Chromatogr. A 2007, 1158, 111–125.
31. Dewé, W.; Govaerts, B.; Boulanger, B.; Rozet, E.; Chiap, P.; Hubert, P. Using Total Error as Decision Criterion in Analytical Method Transfer. Chemom. Intel. Lab. Syst. 2007, 85,
262–268.
32. González, A. G.; Herrador, M. A. Accuracy Profiles from Uncertainty Measurements. Talanta 2006, 70, 896–901.
33. Rebafka, T.; Clémençon, S.; Feinberg, M. Bootstrap-Based Tolerance Intervals for Application to Method Validation. Chemom. Intel. Lab. Syst. 2007, 89, 69–81.
34. Fernholz, L. T.; Gillespie, J. A. Content-Correct Tolerance Limits Based on the Bootstrap. Technometrics 2001, 43 (2), 147–155.
35. Cowen, S.; Ellison, S. L. R. Reporting Measurement Uncertainty and Coverage Intervals Near Natural Limits. Analyst 2006, 131, 710–717.
36. Schouten, H. J. A. Sample Size Formulae with a Continuous Outcome for Unequal Group Sizes and Unequal Variances. Stat. Med. 1999, 18, 87–91.
37. Lehmann, E. L. Testing Statistical Hypothesis; Wiley & Sons: New York, 1959.
38. Schuirmann, D. J. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. J. Pharmacokinet.
Biopharm. 1987, 15, 657–680.
39. Mehring, G. H. On Optimal Tests for General Interval Hypothesis. Commun. Stat. Theory Methods 1993, 22 (5), 1257–1297.
40. Brown, L. D.; Hwang, J. T. G.; Munk, A. An Unbiased Test the Bioequivalence Problem. Ann. Stat. 1998, 25, 2345–2367.
41. Munk, A.; Hwang, J. T. G.; Brown, L. D. Testing Average Equivalence. Finding a Compromise Between Theory and Practice. Biom. J. 2000, 42 (5), 531–552.
42. Hartmann, C.; Smeyers-Verbeke, J.; Penninckx, W.; Vander Heyden, Y.; Vankeerberghen, P.; Massart, D. L. Reappraisal of Hypothesis Testing for Method Validation: Detection of
Systematic Error by Comparing the Means of Two Methods or of Two Laboratories. Anal. Chem. 1995, 67, 4491–4499.
43. Limentani, G. B.; Ringo, M. C.; Ye, F.; Bergquist, M. L.; McSorley, E. O. Beyond the t-Test. Statistical Equivalence Testing. Anal. Chem. 2005, 77, 221A–226A.
44. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods: Determination of the Minimal Number of Measurements Required
for the Evaluation of the Bias by Means of Interval Hypothesis Testing. Chemom. Intel. Lab. Syst. 2000, 52, 61–73.
45. Martín Andrés, A.; Luna del Castillo, J. D. Bioestadí stica para las ciencias de la salud; Ediciones Norma-Capitel: Madrid, 2004.
46. Wellek, S. Testing Statistical Hypotheses of Equivalence; Chapman & May/CRC Press LLC: Boca Raton, FL, 2003.
47. Ortiz, M. C.; Herrero, A.; Sanllorente, S.; Reguera, C. The Quality of the Information Contained in Chemical Measures; Servicio de Publicaciones Universidad de Burgos: Burgos,
2005. (Electronic Book).
48. D’Agostino, R. B., Stephens, M. A., Eds.; In Goodness-of-Fit Techniques; Marcel Dekker Inc.: New York, 1986.
49. Moreno, E.; Girón, F. J. On the Frequentist and Bayesian Approaches to Hypothesis Testing (with discussion). Stat. Oper. Res. Trans. 2006, 30 (1), 3–28.
50. Scheffé, H. The Analysis of Variance; Wiley & Sons: New York, 1959.
51. Anderson, V. L.; MacLean, R. A. Design of Experiments. A Realistic Approach; Marcel Dekker Inc.: New York, 1974.
52. Milliken, G. A.; Johnson, D. E. Analysis of Messy Data: Designed Experiments; Wadsworth Publishing Co.: Belmont, NJ, 1984; vol. I.
53. Searle, S. R. Linear Models; Wiley & Sons, Inc.: New York, 1971.
54. Youden, W. J. Statistical Techniques for Collaborative Tests; Association of Official Analytical Chemists: Washington, DC, 1972.
55. Kateman, G.; Pijpers, F. W. Quality Control in Analytical Chemistry; Wiley & Sons: New York, 1981.
56. Kuttatharmmakull, S.; Massart, D. L.; Smeyers-Verbeke, J. Comparison of Alternative Measurement Methods. Anal. Chim. Acta 1999, 391, 203–225.
57. Hampel, F. R.; Ronchetti, E. M.; Rousseeuw, P. J.; Stahel, W. A. Robust Statistics. The Approach Based on Influence Functions; Wiley-Interscience: Zurich, 1985.
58. Huber, P. J. Robust Statistics; Wiley & Sons: New York, 1981.
59. Thompson, M.; Wood, R. J. Assoc. Off. Anal. Chem. Int. 1993, 76, 926–940.
60. Sanz, M. B.; Ortiz, M. C.; Herrero, A.; Sarabia, L. A. Robust and Non Parametric Statistic in the Validation of Chemical Analysis Methods. Quí m. Anal. 1999, 18, 91–97.
61. García, I.; Sarabia, L.; Ortiz, M. C.; Aldama, J. M. Usefulness of D-optimal Designs and Multicriteria Optimization in Laborious Analytical Procedures. Application to the Extraction
of Quinolones From Eggs. J. Chromatogr. A 2005, 1085, 190–198.
62. García, I.; Sarabia, L. A.; Ortiz, M. C.; Aldama, J. M. Robustness of the Extraction Step When Parallel Factor Analysis (PARAFAC) is Used to Quantify Sulfonamides in Kidney by
High Performance Liquid Chromatography-Diode Array Detection (HPLC-DAD). Analyst 2004, 129 (8), 766–771.
63. Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; de Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics: Part A; Elsevier:
Amsterdam, 1997.
64. Herrero, A.; Reguera, C.; Ortiz, M. C.; Sarabia, L. A.; Sánchez, M. S. Ad-Hoc Blocked Design for the Robustness Study in the Determination of Dichlobenil and 2,6-
Dichlorobenzamide in Onions by Programmed Temperature Vaporization-Gas Chromatography–Mass Spectrometry. J. Chromatogr. A 2014, 1370, 187–199.
65. Arce, M. M.; Sanllorente, S.; Ortiz, M. C.; Sarabia, L. A. Easy-To-Use Procedure to Optimise a Chromatographic Method. Application in the Determination of Bisphenol-A and
Phenol in Toys by Means of Liquid Chromatography with Fluorescence Detection. J. Chromatogr. A 2018, 1534, 93–100.
66. Oca, M. L.; Rubio, L.; Ortiz, M. C.; Sarabia, L. A.; García, I. Robustness Testing in the Determination of Seven Drugs in Animal Muscle by Liquid Chromatography–Tandem Mass
Spectrometry. Chemom. Intel. Lab. Syst. 2016, 151, 172–180.
67. Rodríguez, N.; Ortiz, M. C.; Sarabia, L. A. Study of Robustness Based on N-Way Models in the Spectrofluorimetric Determination of Tetracyclines in Milk When Quenching Exists.
Anal. Chim. Acta 2009, 651, 149–158.
68. Abramowitz, M.; Stegun, I. A. Handbook of Mathematical Functions; Government Printing Office, 1964.
69. Meier, P. C.; Zünd, R. E. Statistical Methods in Analytical Chemistry, 2nd ed.; Wiley & Sons: New York, 2000.
70. Johnson, N.; Kotz, S. Distributions in Statistics: Continuous Univariate Distributions—2; Wiley & Sons: New York, 1970;191. (Equation (5)).
71. Evans, M.; Hastings, N.; Peacock, B. Statistical Distributions, 2nd ed.; Wiley & Sons: New York, 1993; 73–74.
72. The MathWorks, Inc., Statistics and Machine Learning Toolbox for use with MATLABW; version 11.4 (R2018b) The MathWorks, Inc., 2018.