0% found this document useful (0 votes)
7 views

Basic Statistics

Uploaded by

Stefan Franzen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Basic Statistics

Uploaded by

Stefan Franzen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Basic Statistical Analysis

Overview of statistical applications


Three major applications of statistics in chemistry
1. Determining the mean and the standard deviation for a measured value.
Confidence testing can be applied to determine significance.

2. Linear regression as a means to calibrate instrumentation and determine


linear correlations. Error analysis in terms of 95% confidence and
prediction limits. These are determined based on the analytical method of
linear least squares.

3. Non-linear regression as a means to fit data to theoretical functions.


Error analysis is carried out in terms of the non-linear least squares
analysis. The figure of merit in this type of fitting is known as χ2.
Using a normal distribution
Mean and standard deviation
Gaussian distribution
Repeated measurements: the mean and RMSE
When measuring a value using repeated trials we can tabulate a column
of values. In principle, these values should be the same, i.e. we hypothesize
that there is a specific value for the quantity we are measuring. In the
common case where we do not know what the value
should be we can obtain our first estimate by calculating the average of
the N measured values. 𝑁𝑁
∑𝑖𝑖=1 𝑥𝑥𝑖𝑖
𝑥𝑥 =
𝑁𝑁
If we assume that this value is the peak of the normal (Gaussian) distribution,
then we can estimate the error as the standard deviation, which we interpret
as the root-mean-square error.
What is the distribution of continuous probabilities?
The distribution of continuous probabilties will also be
approximately Gaussian. Random fluctuations will tend
to cluster near the average (mean).

We call the distribution of the random errors a normal


distribution. It is given by a Gaussian function according
to the central limit theorem.

Just as for the discrete variable, the approach to a Gaussian


is clearer to more data we obtain.
A Gaussian function describes a normal distribution
Normal Distribution
The Central Limit Theorem
The normal distribution is approached in the limit

Increasing number of measurements

For data or observations that contain random noise the distribution will
approach a normal (Gaussian) distribution as the number of observations
approaches infinity.
Properties of a Gaussian function
Normal Distribution

Least squares definitions


The mean is <x>

The variance is σ
95%-confidence limit
The 95% confidence limit is defined in terms of the area underneath the
Gaussian (normal) distribution function. 95% of the population will be
observed within the limits set by this definition.
95%-confidence limit
The 95% confidence limit is defined in terms of the area underneath the
Gaussian (normal) distribution function. 95% of the population will be
observed within the limits set by this definition.

Where did this number come from?


The Student’s t-test
Student's t-test deals with the problems associated with inference based
on "small" samples: the calculated mean (<x>) and standard deviation (σ)
may by chance deviate from the "real" mean and standard deviation (i.e.,
what you'd measure if you had many more data items: a "large" sample).
For example, it is likely that the true mean size of maple leaves is "close"
to the mean calculated from a sample of N randomly collected leaves.
The 95% confidence interval is:
If N=5, <x>± 2.776 σ/N1/2
if N=10: <x> ± 2.262 σ/N1/2
if N=20: <x> ± 2.093 σ/N1/2
if N=40; <x> ± 2.023 σ/N1/2
for "large" N: <x> ± 1.960 σ/N1/2 .
p-test = significance testing
Significance testing is an important aspect of statistical analysis. The idea
is to make an initial hypothesis (the null hypothesis), then to use
statistical observations to test the hypothesis. The p-test is a measure of
how far one is off the peak of a normal distribution. If the statistical
value is near the peak of the normal distribution (i.e. near the average)
then
p ~ 1 as one moves further away from the average the p value decreases
until it approaches 0. You could think of the p-value as the fractional
probability that the null hypothesis is valid. If p < 0.05 then we can say
that there is less than a 5% chance that the null hypothesis is valid. At
this level of significance we could say that the null hypothesis is invalid.
Linear Regression
and Calibration
The Sum of Squares Function
Ordinary Least Squares
Definition of the Sum of Squares Function
Start with a set of replicate values xi and make a guess for the mean µ of the
distribution: a.
We can now compute the deviations (residual) δi = xi –a.
We take the squares and add them up: This produces the sum of squares

If our guess is poor then SS will be large. A good guess will give a small value
of SS. By minimizing the SS function we will find the least squares estimate
(LSE)for the average aLSE. We can easily find the LSE value for a by setting the
derivative d(SS)/da =0
We find:
Definition of the mean
We can divide both sides by 2 to give:

In other words the sample average (or mean) indeed minimizes the sum of
squares. The median by contrast does not have this nice property.
Ordinary Least Squares
Linear data are no longer pure replicates, because we vary the value of x.
For linear data we guess the slope b and intercept a, calculate deviations and
SS. To minimize SS we must now take two derivatives (dSS/da and dSS/db) and
put them zero simultaneously. Matrix notation is a great help when dealing with
this kind of problem. We can write the above model as:

Or:
Ordinary Least Squares
The X matrix records for what values of x we choose to take a
measurement. We generally assume that there is no error in these set
points or independent variables. Y contains the dependent variable, the
measured values. The matrix ε contains the random errors that we
assume to be normal. The matrix β contains the parameters we wish to
estimate, the slope b and intercept a of our line.
Finding the LSE for β can be done quite elegantly in matrix notation.
Ordinary Least Squares
Notice that the only unknowns left are in β. The X and Y matrices are
known because they are either set or measured. Solving for β now requires
some simple matrix algebra:

The regression formula minimizes the sum of squares for a great many
different models: point, line, circle, parabola of polynomial. It is one of the
most powerful equations in statistics. Let’s first look at a simple straight line.
To construct the X matrix we take the derivative with respect to x of
both of the variables in the equation for a line.
Ordinary Least Squares

Data
Fit

Use the Trendline function


To obtain a fit to a plotted line
Coefficient of Determination
Another measure of the goodness of fit is the coefficient of determination, R2.
Ordinary Least Squares

The LINEST function


LINEST(Y-value, X-value, 1, 1)
Use Ctrl-Shift-Enter
Here the slope is 0.0876(0.005)
The intercept is 0.53(0.03)
Define X
Define XT
Calculate XTX
Define Y
Calculate XTY
Calculate (XTX)-1
Calculate (XTX)-1 XTY = β
Determining a calibration line
To make a calibration line we make a series of measurements as a function
of concentration or some other systematic parameter.

Then we plot the data and fit the data using ordinary linear least squares in
order to obtain the slope and intercept of a calibration line.

We can then use this line to determine the concentration other property
of an unknown.
Linear response theory
For calibration we consider the instrumental response R to be a linear
function of the variable V that is to be measured:

The slope s is known as the sensitivity of the measurement and the


intercept, b, is known as the bias. To obtain high quality data the sensitivity
should be large compared to both the random error and any residual bias
remaining after calibration.
Example: an absorbance calibration line

Concentration Absorbance
0.5 0.02
1 0.0423
1.5 0.0557
2 0.0821
2.5 0.0956
3 0.115
3.5 0.13634
4 0.1602
4.5 0.1756
5 0.205
Example: an absorbance calibration line

m = 0.04003

b = -0.00130

We can predict
Example: an absorbance calibration line

If we measure an absorbance
of A = 0.123 we can use the
calibration line to determine
the concentration, which is
3.07 mM in this case.
Procedure
To obtain a calibrated value for an unknown sample, we follow the following
procedure:
•We measure a set of Rcal values for a number of standards with known values
Vcal
•We construct a regression line, i.e. determine the best s and b values.
•We measure a Runknown for the unknown sample
•We calculate Vunk = (Runk- b)/s
Of course the calibrated value Vunk is subject to error. In fact its value is subject
to two kinds of error:
•The random error due to the measurement: εunk/s
•Whatever residual systematic calibration error is left despite our calibration
Trumpets: the confidence limits of a line
The calibration error can statistically be represented by drawing the 95%
confidence limits around the calibration line. These limits form the two
branches of a hyperbolic function. The total error (calibration + random
measurement of the unknown) are given by the prediction limits. They also
form a -somewhat wider- set of hyperbolic branches given by:

These are the trumpets.


Limits of calibration
Limits of calibration

12

10

6
Response

4
Response

2
confidence lim. (for line)

0
prediction lim.(for one pt)
-6 -4 -2 0 2 4 6 8 10 12

-2

calibration
-4
(regression) line

-6

-8

Standard values
standard values
Replicates and the significance of the t-value
If we take n replicate measurements of the unknown, the (outer) hyperbola
becomes gradually narrower, eventually converging to the (inner)
confidence limit as the 1/n term goes to zero. The inner limits represent the
error due to calibration and can only be improved by doing a better
calibration job. The quantity D is actually the determinant of the (XTX) matrix.

The value of N represents the number of calibration standards used. The value
of t represents the appropriate t-value at the given number of degrees of
freedom (N-2) and the confidence level desired (usually 95% of p=0.05). The
standard values are denoted by X. The center of the calibration set is given by
the average of all X values. This represents the narrowest point where the
error in the slope does not contribute.
Inverting the processing: using the calibration line
For a given calibration line

As we saw above we obtained a calibrated value for the unknown by taking the
inverse function of the calibration line using the best estimates for s and b:

Graphically we can represent that as 'reading back' a value on the Y-axis (the measured R
values) towards the X-axis (representing the calibrated V-values). Let us assume that the
random error in each individual measurement is the same for all measurements (calibration
and unknown alike). We can predict with say 95% confidence that a subsequent experiment
of an unknown substance must fall within the outer hyperbolas. Since we know the response
R (on the Y-axis) we can use the corresponding V values on the X-axis as confidence limits for
our unknown V value.
95%
prediction
confidence
Unknown R
Marginal
unknown R

LOD
Calibrated
value V

The outer prediction limits (‘trumpets’) around the calibration line fix the
LOD and the confidence limits of the calibrated value V
Example from the Trumpets worksheet 2

added meas
0 0.121835
0 0.122289
0 0.12266
0.1 0.214666
0.2 0.30311 1 (𝑑𝑑𝑑𝑑 + 2)(𝑥𝑥 − 𝑥𝑥 )2
0.5 0.573356 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑏𝑏 + 𝑚𝑚𝑚𝑚 ± 𝑡𝑡𝑡𝑡 + 2
(𝑑𝑑𝑑𝑑 + 2) 𝑑𝑑𝑑𝑑 + 2 ∑𝑁𝑁 𝑥𝑥 2
− ∑ 𝑁𝑁
0.7 0.7528 𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑥𝑥𝑖𝑖
0.7 0.75219
1 1.022785

1 (𝑑𝑑𝑑𝑑 + 2)(𝑥𝑥 − 𝑥𝑥 )2
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑏𝑏 + 𝑚𝑚𝑚𝑚 ± 𝑡𝑡𝑡𝑡 1 + + 2
(𝑑𝑑𝑑𝑑 + 2) 𝑑𝑑𝑑𝑑 + 2 ∑𝑁𝑁 𝑥𝑥 2
− ∑ 𝑁𝑁
𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑥𝑥𝑖𝑖
Example from the Trumpets worksheet 2

added meas fit Conf95%+ Conf95%- Pred95%+ Pred95%-


0 0.121835 0.122883 0.123864 0.121901 0.125186 0.120579
0 0.122289 0.122883 0.123864 0.121901 0.125186 0.120579
0 0.12266 0.122883 0.123864 0.121901 0.125186 0.120579
0.1 0.214666 0.212875 0.21373 0.21202 0.215127 0.210622
0.2 0.30311 0.302867 0.303625 0.302109 0.305084 0.300649
0.5 0.573356 0.572843 0.573593 0.572094 0.575058 0.570629
0.7 0.7528 0.752827 0.753794 0.751861 0.755124 0.75053
0.7 0.75219 0.752827 0.753794 0.751861 0.755124 0.75053
1 1.022785 1.022804 1.02424 1.021368 1.025334 1.020273
slope intercept
0.899921 0.122883
se slope 0.000825 0.000415se intc
R-square 0.999994 0.000881RMSE (sy)
1191074 7df
0.925038 5.44E-06
avg(x) 0.355556
sum(x) 3.2
sum(x2) 2.28
t-value 2.364624
n*sum(x2)-
sum(x)2 10.28
y-calc 95% conf 95% conf 95% pred 95% pred x
0.436055 0.43675 0.43536 0.438252 0.433858 0.348
0.436955 0.43765 0.43626 0.439152 0.434758 0.349
0.437855 0.43855 0.43716 0.440052 0.435658 0.35
0.438755 0.43945 0.43806 0.440952 0.436558 0.351
0.439655 0.440349 0.43896 0.441851 0.437458 0.352
0.440555 0.441249 0.43986 0.442751 0.438358 0.353
0.441455 0.442149 0.44076 0.443651 0.439258 0.354
0.442355 0.443049 0.44166 0.444551 0.440158 0.355
0.443255 0.443949 0.44256 0.445451 0.441058 0.356
0.444154 0.444849 0.44346 0.446351 0.441958 0.357
95%
confidence
Unknown R
Marginal
unknown R

LOD
Calibrated
value V

The outer prediction limits (‘trumpets’) around the calibration line fix the
LOD and the confidence limits of the calibrated value V

You might also like