Basic Statistics
Basic Statistics
For data or observations that contain random noise the distribution will
approach a normal (Gaussian) distribution as the number of observations
approaches infinity.
Properties of a Gaussian function
Normal Distribution
The variance is σ
95%-confidence limit
The 95% confidence limit is defined in terms of the area underneath the
Gaussian (normal) distribution function. 95% of the population will be
observed within the limits set by this definition.
95%-confidence limit
The 95% confidence limit is defined in terms of the area underneath the
Gaussian (normal) distribution function. 95% of the population will be
observed within the limits set by this definition.
If our guess is poor then SS will be large. A good guess will give a small value
of SS. By minimizing the SS function we will find the least squares estimate
(LSE)for the average aLSE. We can easily find the LSE value for a by setting the
derivative d(SS)/da =0
We find:
Definition of the mean
We can divide both sides by 2 to give:
In other words the sample average (or mean) indeed minimizes the sum of
squares. The median by contrast does not have this nice property.
Ordinary Least Squares
Linear data are no longer pure replicates, because we vary the value of x.
For linear data we guess the slope b and intercept a, calculate deviations and
SS. To minimize SS we must now take two derivatives (dSS/da and dSS/db) and
put them zero simultaneously. Matrix notation is a great help when dealing with
this kind of problem. We can write the above model as:
Or:
Ordinary Least Squares
The X matrix records for what values of x we choose to take a
measurement. We generally assume that there is no error in these set
points or independent variables. Y contains the dependent variable, the
measured values. The matrix ε contains the random errors that we
assume to be normal. The matrix β contains the parameters we wish to
estimate, the slope b and intercept a of our line.
Finding the LSE for β can be done quite elegantly in matrix notation.
Ordinary Least Squares
Notice that the only unknowns left are in β. The X and Y matrices are
known because they are either set or measured. Solving for β now requires
some simple matrix algebra:
The regression formula minimizes the sum of squares for a great many
different models: point, line, circle, parabola of polynomial. It is one of the
most powerful equations in statistics. Let’s first look at a simple straight line.
To construct the X matrix we take the derivative with respect to x of
both of the variables in the equation for a line.
Ordinary Least Squares
Data
Fit
Then we plot the data and fit the data using ordinary linear least squares in
order to obtain the slope and intercept of a calibration line.
We can then use this line to determine the concentration other property
of an unknown.
Linear response theory
For calibration we consider the instrumental response R to be a linear
function of the variable V that is to be measured:
Concentration Absorbance
0.5 0.02
1 0.0423
1.5 0.0557
2 0.0821
2.5 0.0956
3 0.115
3.5 0.13634
4 0.1602
4.5 0.1756
5 0.205
Example: an absorbance calibration line
m = 0.04003
b = -0.00130
We can predict
Example: an absorbance calibration line
If we measure an absorbance
of A = 0.123 we can use the
calibration line to determine
the concentration, which is
3.07 mM in this case.
Procedure
To obtain a calibrated value for an unknown sample, we follow the following
procedure:
•We measure a set of Rcal values for a number of standards with known values
Vcal
•We construct a regression line, i.e. determine the best s and b values.
•We measure a Runknown for the unknown sample
•We calculate Vunk = (Runk- b)/s
Of course the calibrated value Vunk is subject to error. In fact its value is subject
to two kinds of error:
•The random error due to the measurement: εunk/s
•Whatever residual systematic calibration error is left despite our calibration
Trumpets: the confidence limits of a line
The calibration error can statistically be represented by drawing the 95%
confidence limits around the calibration line. These limits form the two
branches of a hyperbolic function. The total error (calibration + random
measurement of the unknown) are given by the prediction limits. They also
form a -somewhat wider- set of hyperbolic branches given by:
12
10
6
Response
4
Response
2
confidence lim. (for line)
0
prediction lim.(for one pt)
-6 -4 -2 0 2 4 6 8 10 12
-2
calibration
-4
(regression) line
-6
-8
Standard values
standard values
Replicates and the significance of the t-value
If we take n replicate measurements of the unknown, the (outer) hyperbola
becomes gradually narrower, eventually converging to the (inner)
confidence limit as the 1/n term goes to zero. The inner limits represent the
error due to calibration and can only be improved by doing a better
calibration job. The quantity D is actually the determinant of the (XTX) matrix.
The value of N represents the number of calibration standards used. The value
of t represents the appropriate t-value at the given number of degrees of
freedom (N-2) and the confidence level desired (usually 95% of p=0.05). The
standard values are denoted by X. The center of the calibration set is given by
the average of all X values. This represents the narrowest point where the
error in the slope does not contribute.
Inverting the processing: using the calibration line
For a given calibration line
As we saw above we obtained a calibrated value for the unknown by taking the
inverse function of the calibration line using the best estimates for s and b:
Graphically we can represent that as 'reading back' a value on the Y-axis (the measured R
values) towards the X-axis (representing the calibrated V-values). Let us assume that the
random error in each individual measurement is the same for all measurements (calibration
and unknown alike). We can predict with say 95% confidence that a subsequent experiment
of an unknown substance must fall within the outer hyperbolas. Since we know the response
R (on the Y-axis) we can use the corresponding V values on the X-axis as confidence limits for
our unknown V value.
95%
prediction
confidence
Unknown R
Marginal
unknown R
LOD
Calibrated
value V
The outer prediction limits (‘trumpets’) around the calibration line fix the
LOD and the confidence limits of the calibrated value V
Example from the Trumpets worksheet 2
added meas
0 0.121835
0 0.122289
0 0.12266
0.1 0.214666
0.2 0.30311 1 (𝑑𝑑𝑑𝑑 + 2)(𝑥𝑥 − 𝑥𝑥 )2
0.5 0.573356 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑏𝑏 + 𝑚𝑚𝑚𝑚 ± 𝑡𝑡𝑡𝑡 + 2
(𝑑𝑑𝑑𝑑 + 2) 𝑑𝑑𝑑𝑑 + 2 ∑𝑁𝑁 𝑥𝑥 2
− ∑ 𝑁𝑁
0.7 0.7528 𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑥𝑥𝑖𝑖
0.7 0.75219
1 1.022785
1 (𝑑𝑑𝑑𝑑 + 2)(𝑥𝑥 − 𝑥𝑥 )2
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑏𝑏 + 𝑚𝑚𝑚𝑚 ± 𝑡𝑡𝑡𝑡 1 + + 2
(𝑑𝑑𝑑𝑑 + 2) 𝑑𝑑𝑑𝑑 + 2 ∑𝑁𝑁 𝑥𝑥 2
− ∑ 𝑁𝑁
𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑥𝑥𝑖𝑖
Example from the Trumpets worksheet 2
LOD
Calibrated
value V
The outer prediction limits (‘trumpets’) around the calibration line fix the
LOD and the confidence limits of the calibrated value V