Chemometrics
Chemometrics
x x x
i 1
i 1 2 x3 x N
The Mean
(average)
The mean is a measure of the
centrality of a set of data.
Mean (arithmetical)
N
1
x xi
N i 1
Root mean square (RMS)
2 2 2 2 N
x x x x 1
xrms 1 2
N
3
N
N i 1
2
xi
For the data set:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10:
Mode
Midrange
The Midrange
Mode
Midrange
Median
The median is the value for
The which half of the remaining
values are above and half are
Median below it. I.e., in an ordered
array of 15 values, the 8th
value is the median. If the
array has 16 values, the
median is the mean of the 8th
and 9th values.
Measuring variance
N
1
V ( xi x ) 2
N i 1
The Variance
1 1 1
N
x N 2
i 2 xi x N x 2
1 1 1 2
N
x N 2 x xi N x (1)
2
i
x 2 2x 2 x 2
x 2 x 2
The Variance
Is that a problem?
The Standard Deviation
N
1
V
N i 1
( xi x ) 2
The Standard Deviation
1 1
N
x x
N
(x x) 2
The Standard Deviation
Is that a problem?
The Coefficient of
Variation*
CV 100
x
*
Sometimes called the Relative Standard Deviation (RSD
or %RSD)
Standard Deviation (or
Error) of the Mean
x
x
N
The standard deviation of an average decreases by the
reciprocal of the square root of the number of data
points used to calculate the average.
Exercises
1 1
05
.
2 N
1
N 2
05
.
2
N 2 4 (quadruplicate)
Exercises
1 1
01
.
10 N
1
N 10
01
.
2
N 10 100 times!
Population vs. Sample
standard deviation
When we speak of a population, we’re referring to the
entire data set, which will have a mean :
1
Population mean xi
N i
Population vs. Sample
standard deviation
When we speak of a population, we’re referring to the
entire data set, which will have a mean
When we speak of a sample, we’re referring to a subset
of the population, customarily designated “x-bar”
Which is used to calculate the standard deviation?
Population vs. Sample
standard deviation
1
N i
(xi ) 2
1
s
N 1 i
(xi x ) 2
Distributions
Definition
Statistical (probability)
Distribution
A statistical distribution is a mathematically-derived
probability function that can be used to predict the
characteristics of certain applicable real populations
Statistical methods based on probability distributions
are parametric, since certain assumptions are made
about the data
Distributions
Definition
Examples
Binomial distribution
r n r n!
P(r; p, n) p (1 p)
r !(n r )!
Example
10 12 10 12!
. ,12) 05
P(10;05 . (1 05
.)
10!(12 10)!
0016
. or 16%
.
Distributions
Definition
Examples
Binomial
“God does arithmetic”
KARL FRIEDRICH GAUSS (1777-1855)
The Gaussian
Distribution
1 100
number
63 22 85
81 73 152
36 54 90
12 33 45
28 99 127
7 5 12
79 + 61 = 140
52 28 70
96 58 154
17 24 41
22 16 38
4 77 81
61 43 104
85 8 93
F
2 200
number
. . . etc.
Probability
x
The Gaussian Probability
Function
The probability of x in a Gaussian distribution with mean
and standard deviation is given by:
1 ( x ) 2 / 2 2
P ( x; , ) e
2
The Gaussian
Distribution
What is the Gaussian distribution?
What types of data fit a Gaussian distribution?
“Like the ski resort full of
girls hunting not as symmfor
husbands and husbands
hunting for girls, the
situation is
etrical as it might seem.”
Human height
Outside temperature
Raindrop size
Blood glucose concentration
Serum activity
QC results
Proficiency results
The Gaussian
Distribution
What is the Gaussian distribution?
What types of data fit a Gaussian distribution?
What is the advantage of using a Gaussian
distribution?
Gaussian probability
distribution
Probability
.67
.95
Definition
Examples
Binomial
Gaussian
The Poisson Distribution
e r
P ( r; )
r!
Examples of events described
by a Poisson distribution
?
Lightning
Accidents
Laboratory?
A very useful property of
the Poisson distribution
V( r )
Using the Poisson
distribution
Since CV (100) (100)
x
and
0.05
400 counts
Distributions
Definition
Examples
Binomial
Gaussian
Poisson
The Student’s t
Distribution
When a small sample is selected from a large population,
we sometimes have to make certain assumptions in
order to apply statistical methods
Questions about our sample
Is the mean of our sample, x bar, the same as the mean of the
population, ?
Is the standard deviation of our sample, s, the same as the
standard deviation for the population, ?
Unless we can answer both of these questions affirmatively, we
don’t know whether our sample has the same distribution as
the population from which it was drawn.
Recall that the Gaussian distribution is defined by the
probability function:
1 ( x )2 / 2 2
P( x; , ) e
2
Note that the exponential factor contains both and , both
population parameters. The factor is often simplified by
making the substitution:
(x )
z
The variable z in the equation:
(x )
z
is distributed according to a unit gaussian, since it
has a mean of zero and a standard deviation of 1
Gaussian probability
distribution
Probability
.67
.95
-3 -2 -1 0 1 2 3
z
But if we use the sample mean and standard
deviation instead, we get:
(x x )
t
s
and we’ve defined a new quantity, t, which is not
distributed according to the unit Gaussian. It is
distributed according to the Student’s t
distribution.
Important features of the
Student’s t distribution
Use of the t statistic assumes that the parent
distribution is Gaussian
The degree to which the t distribution approximates a
gaussian distribution depends on N (the degrees of
freedom)
As N gets larger (above 30 or so), the differences
between t and z become negligible
Application of Student’s t
distribution to a sample mean
The Student’s t statistic can also be used to analyze
differences between the sample mean and the
population mean:
(x )
t
s
N
Comparison of Student’s t
and Gaussian distributions
( x1 x 2 ) ( x1 x 2 )
t
s1 s 2 s2
s 2
1
2
N 1 N 2 N1 N 2
(29 27)
1.19
2 2
6 4
20 16
Solution, cont.
N df N 1 N 2 2
16 20 2
34
Statistical Tables
we use: N
1
( x1 x 2 ) ( x1i x 2i )
N i 1
to calculate t:
(x1 x2 )
t 2
sd
N
Advantage of the Paired t
Why?
Applications of the Paired t
Method correlation
Comparison of therapies
Distributions
Definition
Examples
Binomial
Gaussian
Poisson
Student’s t
The 2 (Chi-square)
Distribution
There is a general formula that relates actual
measurements to their predicted values
N 2
[ yi f (xi )]
2
i 1 2
i
The 2 (Chi-square)
Distribution
N 2
( ni f i )
2
i 1 fi
Exercise
83 35 118
01034
.
725 416 1141
Calculating 2
f 1 725 01034
. 75 cases
f 2 416 01034
. 43 cases
Calculating 2
2
(ni f i )
2
i fi
(83 75) 2 (35 43) 2
75 43
2.34
Degrees of freedom
Definition
Examples
Binomial
Gaussian
Poisson
Student’s t
2
The F distribution
V1
F
V2
(by convention, the larger V is the numerator)
Applications of the F
distribution
There are several ways the F distribution can be used.
Applications of the F statistic are part of a more general
type of statistical analysis called analysis of variance
(ANOVA). We’ll see more about ANOVA later.
Example
70 85 76
77
3
Analysis, cont.
2 2 2
(70 77) (85 77) (76 77)
Vx
3
38
Analysis, cont.
x
N
Analysis, cont.
2
2
Vx 2
N x
N
N 4 38 152
2 2
x
Analysis, cont.
V1 V2 V3
14.4
3
Analysis, cont.
152
F 10.6
14.4
Conclusion
Definition
Examples
Binomial
Gaussian
Poisson
Student’s t
2
F
Unknown or irregular
distribution
Transform
Probability
x
Log transform
Probability
log x
Unknown or irregular
distribution
Transform
Non-parametric methods
Non-parametric methods
Non-parametric methods make no assumptions about
the distribution of the data
There are non-parametric methods for characterizing
data, as well as for comparing data sets
These methods are also called distribution-free, robust,
or sometimes non-metric tests
Application to Reference
Ranges
The concentrations of most clinical analytes are not usually
distributed in a Gaussian manner. Why?
How do we do this?
“Everything should be made as
simple as possible, but not simpler.”
ALBERT EINSTEIN
Solution #1: Simple
comparison
Suppose we just do a small internal reference range study,
and compare our results to the manufacturer’s range.
Ux = Uy = 1/2NxNy
2
Asq xy x
2
y x
Acir r
2
2
Acir
4
Asq
x
The Monte Carlo method
N mean, SD
N mean, SD
Reference population
N mean, SD
N mean, SD
The Monte Carlo method
45
40
35
30
25
20
15
10
0
0 5 10 15 20 25 30 35 40 45 50
Linear regression (least squares)
45
40
35 y = 1.031x - 0.024
30
25
20
15
10
0
0 5 10 15 20 25 30 35 40 45 50
Covariance
1
cov( x , y ) ( yi y )( xi x )
N i
1
cov( x , y ) ( yi y )( xi x )
N i
1
cov( x , y ) N i
( yi y )( xi x )
x y y x
1 1
The Correlation
Coefficient
The correlation coefficient is a unitless quantity that
roughly indicates the degree to which x and y vary in
the same direction.
is useful for detecting relationships between
parameters, but it is not a very sensitive measure of the
spread.
Correlation
50
45
40
35 y = 1.031x - 0.024
30
= 0.9986
25
20
15
10
0
0 5 10 15 20 25 30 35 40 45 50
Correlation
50
45
40
35 y = 1.031x - 0.024
30
= 0.9894
25
20
15
10
0
0 5 10 15 20 25 30 35 40 45 50
Standard Error of the
Estimate
The linear regression equation gives us a way to calculate
an “estimated” y for any given x value, given the
symbol ŷ (y-hat):
y mx b
Standard Error of the
Estimate
Now what we are interested in is the average difference
between the measured y and its estimate, ŷ :
1
sy / x
N i
( yi yi ) 2
Correlation
50
45
40
35 y = 1.031x - 0.024
30
= 0.9986
sy/x=1.83
25
20
15
10
0
0 5 10 15 20 25 30 35 40 45 50
Correlation
50
45
40
y = 1.031x - 0.024
35 = 0.9894
30 sy/x = 5.32
25
20
15
10
0
0 5 10 15 20 25 30 35 40 45 50
Standard Error of the
Estimate
If we assume that the errors in the y measurements are
Gaussian (is that a safe assumption?), then the standard
error of the estimate gives us the boundaries within
which 67% of the y values will fall.
Within-run: 10 or 20 replicates
What types of errors does within-run precision reflect?
Day-to-day: NCCLS recommends evaluation over 20
days
What types of errors does day-to-day precision reflect?
Evaluating method
performance
Precision
Sensitivity
Method Sensitivity
time
Other measures of
sensitivity
Limit of Detection (LOD) is sometimes defined as the
concentration producing an S/N > 3.
In drug testing, LOD is customarily defined as the lowest
concentration that meets all identification criteria.
Limit of Quantitation (LOQ) is sometimes defined as
the concentration producing an S/N >5.
In drug testing, LOQ is customarily defined as the lowest
concentration that can be measured within ±20%.
Question
Concentration
Outliers
Example: 4, 5, 6, 13
(13 - 4) x 0.765 = 6.89
Limitation of linear
regression method
If the analytical method has a high variance (CV), it is
likely that small deviations from linearity will not be
detected due to the high standard error of the estimate
Signal
Concentration
Ways to evaluate
linearity
Visual/linear regression
Quadratic regression
Quadratic regression
y = f(x) = a + bx
Quadratic regression
y = f(x) = a + bx + cx2
Concentration
Lack-of-fit analysis
VL
0.5981 ( p 0.05)
Vi
i
Lack-of-fit method
calculations
Total sum of the squares: the variance calculated from
all of the y values
Linear regression sum of the squares: the variance of y
values from the regression line
Residual sum of the squares: difference between TSS
and LSS
Lack of fit sum of the squares: the RSS minus the pure
error (sum of variances)
Lack-of-fit analysis
TP
Sensitivity 100
TP FN
Example
23
100 92%
23 2
Evaluating Clinical Performance
of laboratory tests
TN
Specificity 100
TN FP
Example
-
Disease
+
Sensitivity vs. Specificity
TP
PV 100
TP FP
98
100
98 2000
4.7%
What about the negative
predictive value?
TN = 999,900 - 2000 = 997,900
FN = 100 * 0.002 = 0 (or 1)
TN
PV 100
TN FN
997,900
100
997,900 1
100%
Summary of predictive
value
Predictive value describes the usefulness of a clinical
laboratory test in the real world.
Or does it?
Lessons about predictive
value
Even when you have a very good test, it is generally not
cost effective to screen for diseases which have low
incidence in the general population. Exception?
The higher the clinical suspicion, the better the predictive
value of the test. Why?
Efficiency
TP TN
Efficiency 100
TP FP TN FN
The efficiency is the percentage of all patients that
are classified correctly by the test result.
Efficiency of our Addison
screen
98 997,900
100 99.8%
98 2000 997,900 2
“To
“To call
call in
in the
the statistician
statistician after
after the
the experiment
experiment is
is done
done may
may bebe no
no more
more than
than asking
asking
him
him to
to perform
perform
aa postmortem
postmortem examination:
examination: hehe may
may be
be able
able to
to say
say what
what the
the experiment
experiment died
died of.”
of.”
12s 1 in 20
13s 1 in 300
22s 1 in 400
R4s 1 in 800
1 in 600
41s
1 in 1000
10x
Some examples
+3sd
+2sd
+1sd
mean
-1sd
-2sd
-3sd
Some examples
+3sd
+2sd
+1sd
mean
-1sd
-2sd
-3sd
Some examples
+3sd
+2sd
+1sd
mean
-1sd
-2sd
-3sd
Some examples
+3sd
+2sd
+1sd
mean
-1sd
-2sd
-3sd
“In
“In science
science one
one tries
tries to
to tell
tell people,
people, in
in such
such a a way
way as
as to
to be
be
understood
understood byby everyone,
everyone, something
something that
that
no
no one
one ever
ever knew
knew before.
before. But
But in
in poetry,
poetry, it's
it's the
the exact
exact
opposite.”
opposite.”