Evaluation of Proficiency Test Data by Different Statistical Methods Comparison
Evaluation of Proficiency Test Data by Different Statistical Methods Comparison
Sinaia, România
11th − 13th October, 2007
Abstract
Key words
95
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
1 INTRODUCTION
2 EXPERIMENTAL
2.1 Materials
96
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
The test samples consisted of ten encoded pieces of gold alloy and fineness (585
‰ aprox.), coming each one from the same ingot.
2.2 Procedures
The laboratories carried out the tests taking into account the instructions given by
organizer and processing the samples as routine ones. They were asked to test
eight samples at least.
This Program was designed for evaluating quality tests and technical competence of
participants laboratories. [5]
The participant identity was kept under secret, thus guaranteeing data
confidentiality. The coordinator had the right to eject from the programme any
laboratory that unjustifiably did not meet the terms for sending the results.
Each participant had an assigned laboratory identification code number, in order to
maintain the relevant information on a confidential basis.
2.2.1 Homogeneity
Although the variability in the production run of the ingots was not exactly known,
the degree of homogeneity of the test samples was determined and checked by
using the statistical criteria based on analysis of variance (ISO 13528 [2] and
IUPAC [4] ).
Once successfully performed this verification, the suitability of the samples was
considered satisfactory for the purpose of this proficiency programme, so the test
samples to be delivered were both homogeneous and stable enough.
The statistical handling has been planned in compliance with ISO 13528 [2] , ISO
Guide 43 [6] and ISO 5725 standards [7], with a number of sequential steps:
The Cochran outlier test [7] is used to check the assumption that between
laboratories only small differences exist in the within-laboratory variances.
C is calculated by using the maximum variance of all results and the sum of all
variances:
97
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
2
Smáx
C= p
(1)
∑S
i =1
i
2
Where:
p = total number of standard deviations.
Si = Standard deviations
Smáx = Maximum standard deviation of all results.
The result of this test was: There was one outlier. (Lab. 31)
For this aim, Grubb’s statistic Gh is calculated for largest value as follows:
⎛ −
⎞
Gh = ⎜ xh − x ⎟ / s (2)
⎝ ⎠
⎛− ⎞
⎜ ⎟
Gl = ⎜ x − x l ⎟ / s (3)
⎜ ⎟
⎝ ⎠
These obtained values are compared with critical ones of Grubbs’ test tables
respectively.
The result of this test was: There was one outlier. (Lab. 31)
In order to achieve more confidence for coping with outliers, Double Grubbs’ test [7]
was used to determine whether the two largest or two smallest values might be
outliers.
In our statistical study we evaluated the two largest and the two smallest values from
participating laboratories.
To determine the assigned value, this Program took into account the following
criteria:
98
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
• The value obtained from the robust mean of the whole laboratories results
without previous exclusion. (ISO 13528) [2]
Z i = ( X i − Vc ) / σ (4)
where:
Note: A single “action signal”, or “ warning signals” in two successive rounds, shall be
taken as evidence that an anomaly has occurred that requires investigation.
99
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
2.3 Participants
The results of participating laboratories are shown in table 1. The Z-score values are
indicated in tables and appropriate figures as well as the general statistical
parameters of this Program.
100
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
This topic is intended to explain a wider range of possibilities that the PT-scheme
provider might consider to evaluate the data submitted by the participants, so a
number of cases with a different approach were estimated to cover these statistical
principles. [8].
On the whole, in order to evaluate the assessment of each laboratory performance by
interpreting Z-score values, the results were compared using four methods: the
traditional approach according to ISO 5725 including outlier detection; two robust
statistical methods, based on median and NIQR, and based on Huber test and the
algorithms detailed in ISO 13528, respectively; finally was considered a practical
approach that sets up a specified target value taking into account a fit-for-purpose
criterion.
The values considered in the expression of Z-score (assigned value and standard
deviation) have been calculated as follows in each one of the four statistical
approaches:
1. ISO 5725: general mean and reproducibility standard deviation, with outlier
detection [7]
101
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
2. Median and NIQR method: median of the whole data and normalized
interquartile range [9]
3. ISO 13528: robust average and robust standard deviation calculated
according to algorithms A and S, without outlier detection [10]
4. Fit-for-purpose criterion: robust average and a target reproducibility
standard deviation value according to a fixed %RSD from appropriate past
PT-rounds at this level of concentration [11]
As a result of that, in view of the analytical laboratory performances following each
one of the four protocols considered (Table 3), it can be stated that fourteen
participants show Z-score values considered as acceptable (⏐Z⏐ [ 2) regardless the
statistical method applied. The reason for this behaviour lies in the fact that they are
the laboratories that provide the most balanced results with less deviations in data
spread due to common analytical techniques, so the statistical protocol has not
relevant influence in their performance.
Furthermore, in terms of distribution of Z-score values corresponding to the other five
laboratories, a certain trend in the spread is revealed. Thus, according to ISO 5725,
Z-score values comply with the acceptance criteria, particularly because one
laboratory data have been rejected as outlier.
Table –3. Summary of the overall Z-score results obtained by participant laboratories
reported following the different statistical protocols
On the other hand, in order to avoid the influence of extreme results, the application
of robust statistical methods [9], [10] brings about significant larger Z-score values
102
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
since no outlier elimination is applied. In this line, when the calculation is performed
by using median & NIQR method, Z-score values are slightly smaller than the
corresponding ones estimated following the robust method based on ISO 13528. In
this case, one laboratory is given a warning signal whereas four laboratories shows
Z-score values considered to give action signals, so that special investigation is
required for the laboratory previously considered as outlier in the parametric
approach.
2,00
1,00
0,00
-1,00
Z-score
-2,00
ISO 5725
-3,00 median&NIQR
ISO 13528
-4,00
fit-for-purpose
-5,00
-6,00
Lab31 Lab06 Lab14 Lab08 Lab13 Lab27 Lab22 Lab05 Lab03 Lab20 Lab15 Lab21 Lab29 Lab12 Lab10 Lab07 Lab16 Lab24 Lab30
Participant
4 CONCLUSIONS
On balance, due to the fact that proficiency testing participant data are usually heavy-
tailed and the presence of outlier is very common, these arguments can lead to
overestimations of the standard deviation value in which more Z-score values are
considered as satisfactory, unless robust statistical methods were applied since they
are more insensitive to anomalies.
103
Pedro Rosario, José Luis Martínez and José Miguel Silván: Evaluation of
Proficiency Test Data by Different Statistical Methods Comparison
Then, this kind of robust protocols are particularly applicable to look-like normal
distribution data with no more than 10% of outliers, unimodal and roughly symmetric,
apart from cases when it is assumed that all participants do not have the same
analytical performance. However, it is observed that median & NIQR method is more
robust for asymmetry data, while in case of multimodal or skewed distribution of data,
the application of mixture models and kernel density functions should be considered.
In this report, the application of both classical and robust statistical methods when
dealing with proficiency test data clearly shows that mean values are quite similar,
whereas significant differences in standard deviation value have been found, in some
cases too large for fitting the objective of this interlaboratory programme.
Furthermore, it is quite important to obtain an appropriate estimation of the overall
standard deviation parameter in a suitable way, that allows not only to describe the
analytical method in terms of precision but also to provide performance assessment
compatible with the intercomparison requirements.
Finally, in this line, the application of a fit-for-purpose criterion should describe the
end-user requirement and must be consistent from round to round, so that scores in
successive rounds might be comparable. The specification of a target value in terms
of relative standard deviation involves more a quality goal that the data should meet
to reflect fitness for purpose, rather than a simple description of the data results.
REFERENCES
104