0% found this document useful (0 votes)

7 views

Basic Statistics

Uploaded by

Stefan Franzen

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Basic Statistics

Uploaded by

Stefan Franzen

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Basic Statistical Analysis

Overview of statistical applications

Three major applications of statistics in chemistry
1. Determining the mean and the standard deviation for a measured value.
Confidence testing can be applied to determine significance.

2. Linear regression as a means to calibrate instrumentation and determine

linear correlations. Error analysis in terms of 95% confidence and
prediction limits. These are determined based on the analytical method of
linear least squares.

3. Non-linear regression as a means to fit data to theoretical functions.

Error analysis is carried out in terms of the non-linear least squares
analysis. The figure of merit in this type of fitting is known as χ2.
Using a normal distribution
Mean and standard deviation
Gaussian distribution
Repeated measurements: the mean and RMSE
When measuring a value using repeated trials we can tabulate a column
of values. In principle, these values should be the same, i.e. we hypothesize
that there is a specific value for the quantity we are measuring. In the
common case where we do not know what the value
should be we can obtain our first estimate by calculating the average of
the N measured values. 𝑁𝑁
∑𝑖𝑖=1 𝑥𝑥𝑖𝑖
𝑥𝑥 =
𝑁𝑁
If we assume that this value is the peak of the normal (Gaussian) distribution,
then we can estimate the error as the standard deviation, which we interpret
as the root-mean-square error.
What is the distribution of continuous probabilities?
The distribution of continuous probabilties will also be
approximately Gaussian. Random fluctuations will tend
to cluster near the average (mean)．

We call the distribution of the random errors a normal

distribution. It is given by a Gaussian function according
to the central limit theorem.

Just as for the discrete variable, the approach to a Gaussian

is clearer to more data we obtain.
A Gaussian function describes a normal distribution
Normal Distribution
The Central Limit Theorem
The normal distribution is approached in the limit

Increasing number of measurements

For data or observations that contain random noise the distribution will
approach a normal (Gaussian) distribution as the number of observations
approaches infinity.
Properties of a Gaussian function
Normal Distribution

Least squares definitions

The mean is <x>

The variance is σ
95%-confidence limit
The 95% confidence limit is defined in terms of the area underneath the
Gaussian (normal) distribution function. 95% of the population will be
observed within the limits set by this definition.
95%-confidence limit
The 95% confidence limit is defined in terms of the area underneath the
Gaussian (normal) distribution function. 95% of the population will be
observed within the limits set by this definition.

Where did this number come from?

The Student’s t-test
Student's t-test deals with the problems associated with inference based
on "small" samples: the calculated mean (<x>) and standard deviation (σ)
may by chance deviate from the "real" mean and standard deviation (i.e.,
what you'd measure if you had many more data items: a "large" sample).
For example, it is likely that the true mean size of maple leaves is "close"
to the mean calculated from a sample of N randomly collected leaves.
The 95% confidence interval is:
If N=5, <x>± 2.776 σ/N1/2
if N=10: <x> ± 2.262 σ/N1/2
if N=20: <x> ± 2.093 σ/N1/2
if N=40; <x> ± 2.023 σ/N1/2
for "large" N: <x> ± 1.960 σ/N1/2 .
p-test = significance testing
Significance testing is an important aspect of statistical analysis. The idea
is to make an initial hypothesis (the null hypothesis), then to use
statistical observations to test the hypothesis. The p-test is a measure of
how far one is off the peak of a normal distribution. If the statistical
value is near the peak of the normal distribution (i.e. near the average)
then
p ~ 1 as one moves further away from the average the p value decreases
until it approaches 0. You could think of the p-value as the fractional
probability that the null hypothesis is valid. If p < 0.05 then we can say
that there is less than a 5% chance that the null hypothesis is valid. At
this level of significance we could say that the null hypothesis is invalid.
Linear Regression
and Calibration
The Sum of Squares Function
Ordinary Least Squares
Definition of the Sum of Squares Function
Start with a set of replicate values xi and make a guess for the mean µ of the
distribution: a.
We can now compute the deviations (residual) δi = xi –a.
We take the squares and add them up: This produces the sum of squares

If our guess is poor then SS will be large. A good guess will give a small value
of SS. By minimizing the SS function we will find the least squares estimate
(LSE)for the average aLSE. We can easily find the LSE value for a by setting the
derivative d(SS)/da =0
We find:
Definition of the mean
We can divide both sides by 2 to give:

In other words the sample average (or mean) indeed minimizes the sum of
squares. The median by contrast does not have this nice property.
Ordinary Least Squares
Linear data are no longer pure replicates, because we vary the value of x.
For linear data we guess the slope b and intercept a, calculate deviations and
SS. To minimize SS we must now take two derivatives (dSS/da and dSS/db) and
put them zero simultaneously. Matrix notation is a great help when dealing with
this kind of problem. We can write the above model as:

Or:
Ordinary Least Squares
The X matrix records for what values of x we choose to take a
measurement. We generally assume that there is no error in these set
points or independent variables. Y contains the dependent variable, the
measured values. The matrix ε contains the random errors that we
assume to be normal. The matrix β contains the parameters we wish to
estimate, the slope b and intercept a of our line.
Finding the LSE for β can be done quite elegantly in matrix notation.
Ordinary Least Squares
Notice that the only unknowns left are in β. The X and Y matrices are
known because they are either set or measured. Solving for β now requires
some simple matrix algebra:

The regression formula minimizes the sum of squares for a great many
different models: point, line, circle, parabola of polynomial. It is one of the
most powerful equations in statistics. Let’s first look at a simple straight line.
To construct the X matrix we take the derivative with respect to x of
both of the variables in the equation for a line.
Ordinary Least Squares

Data
Fit

Use the Trendline function

To obtain a fit to a plotted line
Coefficient of Determination
Another measure of the goodness of fit is the coefficient of determination, R2.
Ordinary Least Squares

The LINEST function

LINEST(Y-value, X-value, 1, 1)
Use Ctrl-Shift-Enter
Here the slope is 0.0876(0.005)
The intercept is 0.53(0.03)
Define X
Define XT
Calculate XTX
Define Y
Calculate XTY
Calculate (XTX)-1
Calculate (XTX)-1 XTY = β
Determining a calibration line
To make a calibration line we make a series of measurements as a function
of concentration or some other systematic parameter.

Then we plot the data and fit the data using ordinary linear least squares in
order to obtain the slope and intercept of a calibration line.

We can then use this line to determine the concentration other property
of an unknown.
Linear response theory
For calibration we consider the instrumental response R to be a linear
function of the variable V that is to be measured:

The slope s is known as the sensitivity of the measurement and the

intercept, b, is known as the bias. To obtain high quality data the sensitivity
should be large compared to both the random error and any residual bias
remaining after calibration.
Example: an absorbance calibration line

Concentration Absorbance
0.5 0.02
1 0.0423
1.5 0.0557
2 0.0821
2.5 0.0956
3 0.115
3.5 0.13634
4 0.1602
4.5 0.1756
5 0.205
Example: an absorbance calibration line

m = 0.04003

b = -0.00130

We can predict
Example: an absorbance calibration line

If we measure an absorbance
of A = 0.123 we can use the
calibration line to determine
the concentration, which is
3.07 mM in this case.
Procedure
To obtain a calibrated value for an unknown sample, we follow the following
procedure:
•We measure a set of Rcal values for a number of standards with known values
Vcal
•We construct a regression line, i.e. determine the best s and b values.
•We measure a Runknown for the unknown sample
•We calculate Vunk = (Runk- b)/s
Of course the calibrated value Vunk is subject to error. In fact its value is subject
to two kinds of error:
•The random error due to the measurement: εunk/s
•Whatever residual systematic calibration error is left despite our calibration
Trumpets: the confidence limits of a line
The calibration error can statistically be represented by drawing the 95%
confidence limits around the calibration line. These limits form the two
branches of a hyperbolic function. The total error (calibration + random
measurement of the unknown) are given by the prediction limits. They also
form a -somewhat wider- set of hyperbolic branches given by:

These are the trumpets.

Limits of calibration
Limits of calibration

6
Response

4
Response

2
confidence lim. (for line)

0
prediction lim.(for one pt)
-6 -4 -2 0 2 4 6 8 10 12

-2

calibration
-4
(regression) line

-6

-8

Standard values
standard values
Replicates and the significance of the t-value
If we take n replicate measurements of the unknown, the (outer) hyperbola
becomes gradually narrower, eventually converging to the (inner)
confidence limit as the 1/n term goes to zero. The inner limits represent the
error due to calibration and can only be improved by doing a better
calibration job. The quantity D is actually the determinant of the (XTX) matrix.

The value of N represents the number of calibration standards used. The value
of t represents the appropriate t-value at the given number of degrees of
freedom (N-2) and the confidence level desired (usually 95% of p=0.05). The
standard values are denoted by X. The center of the calibration set is given by
the average of all X values. This represents the narrowest point where the
error in the slope does not contribute.
Inverting the processing: using the calibration line
For a given calibration line

As we saw above we obtained a calibrated value for the unknown by taking the
inverse function of the calibration line using the best estimates for s and b:

Graphically we can represent that as 'reading back' a value on the Y-axis (the measured R
values) towards the X-axis (representing the calibrated V-values). Let us assume that the
random error in each individual measurement is the same for all measurements (calibration
and unknown alike). We can predict with say 95% confidence that a subsequent experiment
of an unknown substance must fall within the outer hyperbolas. Since we know the response
R (on the Y-axis) we can use the corresponding V values on the X-axis as confidence limits for
our unknown V value.
95%
prediction
confidence
Unknown R
Marginal
unknown R

LOD
Calibrated
value V

The outer prediction limits (‘trumpets’) around the calibration line fix the
LOD and the confidence limits of the calibrated value V
Example from the Trumpets worksheet 2

added meas
0 0.121835
0 0.122289
0 0.12266
0.1 0.214666
0.2 0.30311 1 (𝑑𝑑𝑑𝑑 + 2)(𝑥𝑥 − 𝑥𝑥 )2
0.5 0.573356 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑏𝑏 + 𝑚𝑚𝑚𝑚 ± 𝑡𝑡𝑡𝑡 + 2
(𝑑𝑑𝑑𝑑 + 2) 𝑑𝑑𝑑𝑑 + 2 ∑𝑁𝑁 𝑥𝑥 2
− ∑ 𝑁𝑁
0.7 0.7528 𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑥𝑥𝑖𝑖
0.7 0.75219
1 1.022785

1 (𝑑𝑑𝑑𝑑 + 2)(𝑥𝑥 − 𝑥𝑥 )2
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑏𝑏 + 𝑚𝑚𝑚𝑚 ± 𝑡𝑡𝑡𝑡 1 + + 2
(𝑑𝑑𝑑𝑑 + 2) 𝑑𝑑𝑑𝑑 + 2 ∑𝑁𝑁 𝑥𝑥 2
− ∑ 𝑁𝑁
𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑥𝑥𝑖𝑖
Example from the Trumpets worksheet 2

added meas fit Conf95%+ Conf95%- Pred95%+ Pred95%-

0 0.121835 0.122883 0.123864 0.121901 0.125186 0.120579
0 0.122289 0.122883 0.123864 0.121901 0.125186 0.120579
0 0.12266 0.122883 0.123864 0.121901 0.125186 0.120579
0.1 0.214666 0.212875 0.21373 0.21202 0.215127 0.210622
0.2 0.30311 0.302867 0.303625 0.302109 0.305084 0.300649
0.5 0.573356 0.572843 0.573593 0.572094 0.575058 0.570629
0.7 0.7528 0.752827 0.753794 0.751861 0.755124 0.75053
0.7 0.75219 0.752827 0.753794 0.751861 0.755124 0.75053
1 1.022785 1.022804 1.02424 1.021368 1.025334 1.020273
slope intercept
0.899921 0.122883
se slope 0.000825 0.000415se intc
R-square 0.999994 0.000881RMSE (sy)
1191074 7df
0.925038 5.44E-06
avg(x) 0.355556
sum(x) 3.2
sum(x2) 2.28
t-value 2.364624
n*sum(x2)-
sum(x)2 10.28
y-calc 95% conf 95% conf 95% pred 95% pred x
0.436055 0.43675 0.43536 0.438252 0.433858 0.348
0.436955 0.43765 0.43626 0.439152 0.434758 0.349
0.437855 0.43855 0.43716 0.440052 0.435658 0.35
0.438755 0.43945 0.43806 0.440952 0.436558 0.351
0.439655 0.440349 0.43896 0.441851 0.437458 0.352
0.440555 0.441249 0.43986 0.442751 0.438358 0.353
0.441455 0.442149 0.44076 0.443651 0.439258 0.354
0.442355 0.443049 0.44166 0.444551 0.440158 0.355
0.443255 0.443949 0.44256 0.445451 0.441058 0.356
0.444154 0.444849 0.44346 0.446351 0.441958 0.357
95%
confidence
Unknown R
Marginal
unknown R

LOD
Calibrated
value V

The outer prediction limits (‘trumpets’) around the calibration line fix the
LOD and the confidence limits of the calibrated value V

MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
1-Phenylethanol H-NMR PDF
0% (1)
1-Phenylethanol H-NMR PDF
2 pages
Experiment One of The SAIC Remote Viewing Program - A Critical Re-Evaluation - Reply To May
No ratings yet
Experiment One of The SAIC Remote Viewing Program - A Critical Re-Evaluation - Reply To May
15 pages
APP601S Chapter 4 - Data Handling in Analytical Chem
No ratings yet
APP601S Chapter 4 - Data Handling in Analytical Chem
42 pages
Che 411 L2
No ratings yet
Che 411 L2
22 pages
Formula Stables
No ratings yet
Formula Stables
29 pages
Data Analysis, Standard Error, and Confidence Limits: Mean of A Set of Measurements
No ratings yet
Data Analysis, Standard Error, and Confidence Limits: Mean of A Set of Measurements
5 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Lecture4 Mech SU
No ratings yet
Lecture4 Mech SU
17 pages
Outline and Equation Sheet For M E 345: Every Additive Term in An Equation Must Have The Same Dimensions
No ratings yet
Outline and Equation Sheet For M E 345: Every Additive Term in An Equation Must Have The Same Dimensions
7 pages
STAB27
No ratings yet
STAB27
51 pages
Inference For Regression
No ratings yet
Inference For Regression
24 pages
Statistical Data Treatment: - Part 1 (Manual Calculations)
No ratings yet
Statistical Data Treatment: - Part 1 (Manual Calculations)
51 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
23 pages
Statistics
No ratings yet
Statistics
53 pages
Evaluating Analytical Data PDF
No ratings yet
Evaluating Analytical Data PDF
8 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Business Analytics & Machine Learning: Regression Analysis
No ratings yet
Business Analytics & Machine Learning: Regression Analysis
58 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Statistc in Chemistry
No ratings yet
Statistc in Chemistry
13 pages
Chapter 12
No ratings yet
Chapter 12
48 pages
Linear Regression
100% (2)
Linear Regression
228 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Statistic Formulas
No ratings yet
Statistic Formulas
3 pages
Descriptive Statistics: Hypothesis Testing - Means
No ratings yet
Descriptive Statistics: Hypothesis Testing - Means
3 pages
CHM 421 - ToPIC 3 - Statistics
No ratings yet
CHM 421 - ToPIC 3 - Statistics
58 pages
Advanced Statistical Methods
No ratings yet
Advanced Statistical Methods
63 pages
Correlation and Regression: Fathers' and Daughters' Heights
No ratings yet
Correlation and Regression: Fathers' and Daughters' Heights
43 pages
Basic Statistics: There Are Three Types of Error
No ratings yet
Basic Statistics: There Are Three Types of Error
21 pages
Week 11
No ratings yet
Week 11
22 pages
Chapter 7
No ratings yet
Chapter 7
28 pages
Topic III
No ratings yet
Topic III
27 pages
1. Basic Summation Notation
No ratings yet
1. Basic Summation Notation
16 pages
Lecture 12
No ratings yet
Lecture 12
47 pages
Data Analysis - Calculation of Spread
No ratings yet
Data Analysis - Calculation of Spread
37 pages
Lecture 2 2014 Random Errors in Chemical Analysis
No ratings yet
Lecture 2 2014 Random Errors in Chemical Analysis
24 pages
Chapter14
No ratings yet
Chapter14
65 pages
Ch04le 1
No ratings yet
Ch04le 1
59 pages
Statistical Methodology Step of Scientific Research Important Parametric Tests Important Nonparametric Tests Example Using Excel Program Using Excel For Statistics in Gateway Cases - Office 2007
No ratings yet
Statistical Methodology Step of Scientific Research Important Parametric Tests Important Nonparametric Tests Example Using Excel Program Using Excel For Statistics in Gateway Cases - Office 2007
42 pages
SSCK 1203 Data Analysis 090214 Students 02
No ratings yet
SSCK 1203 Data Analysis 090214 Students 02
36 pages
CHAPTERS
No ratings yet
CHAPTERS
17 pages
1 Preliminaries: 1.1 Motivation
No ratings yet
1 Preliminaries: 1.1 Motivation
7 pages
Lab 4 .
No ratings yet
Lab 4 .
6 pages
1 - Introduction To Analytical Chemistry LAB
No ratings yet
1 - Introduction To Analytical Chemistry LAB
9 pages
RMP470S Lecture 7 - One-Dimensionalstatistics
No ratings yet
RMP470S Lecture 7 - One-Dimensionalstatistics
27 pages
ERRORS
No ratings yet
ERRORS
34 pages
Jimma University: M.SC in Economics (Industrial Economics) Regular Program Individual Assignment: Econometrics
No ratings yet
Jimma University: M.SC in Economics (Industrial Economics) Regular Program Individual Assignment: Econometrics
20 pages
328formulas03 (2019 - 04 - 03 15 - 13 - 21 UTC)
No ratings yet
328formulas03 (2019 - 04 - 03 15 - 13 - 21 UTC)
12 pages
Formula and Table Value
No ratings yet
Formula and Table Value
7 pages
Basic Statistical Tools
No ratings yet
Basic Statistical Tools
43 pages
FormulaSheet FinalExam
No ratings yet
FormulaSheet FinalExam
8 pages
Statistics 1 (Final) / Orthodontic Courses by Indian Dental Academy
No ratings yet
Statistics 1 (Final) / Orthodontic Courses by Indian Dental Academy
15 pages
M05 StockWatson123635 03 Econ Ch05
No ratings yet
M05 StockWatson123635 03 Econ Ch05
42 pages
List of Formula For Unit-3 and Unit-4
No ratings yet
List of Formula For Unit-3 and Unit-4
7 pages
Regression Equation
No ratings yet
Regression Equation
56 pages
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
Standard-Slope Integration: A New Approach to Numerical Integration
From Everand
Standard-Slope Integration: A New Approach to Numerical Integration
Peter James Italia, MD
No ratings yet
Student Solutions Manual for Mathematics for Economics, fourth edition
From Everand
Student Solutions Manual for Mathematics for Economics, fourth edition
Michael Hoy
No ratings yet
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Exercises of Numerical Analysis
From Everand
Exercises of Numerical Analysis
Simone Malacrida
No ratings yet
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Analysing Quantitative Data
No ratings yet
Analysing Quantitative Data
33 pages
Imrd Format 2020-2021
No ratings yet
Imrd Format 2020-2021
14 pages
BBSS 1103
No ratings yet
BBSS 1103
14 pages
Statistics: Origin and Definition of Statistics
No ratings yet
Statistics: Origin and Definition of Statistics
6 pages
Discriminant Validity Assessment in PLS-SEM: A Comprehensive Composite-Based Approach
No ratings yet
Discriminant Validity Assessment in PLS-SEM: A Comprehensive Composite-Based Approach
9 pages
Neumann Et Al. - 2015 - Quantifying The Weight of Fingerprint Evidence THR
No ratings yet
Neumann Et Al. - 2015 - Quantifying The Weight of Fingerprint Evidence THR
18 pages
Research Methodology - Syllabus
No ratings yet
Research Methodology - Syllabus
1 page
45596
No ratings yet
45596
6 pages
ShielaMayRengel Sec7 Chapter8 Activity1
No ratings yet
ShielaMayRengel Sec7 Chapter8 Activity1
13 pages
Rekha Mishra
No ratings yet
Rekha Mishra
100 pages
Stat Exercises
No ratings yet
Stat Exercises
31 pages
Asdaf Kota Gorontalo, Provinsi Gorontalo Program Studi Manajemen Sumber Daya Manusia Sektor Publik
No ratings yet
Asdaf Kota Gorontalo, Provinsi Gorontalo Program Studi Manajemen Sumber Daya Manusia Sektor Publik
10 pages
Module 1 Activity - Basic Concepts in IOP PDF
No ratings yet
Module 1 Activity - Basic Concepts in IOP PDF
2 pages
Spatial Autoregressive (Sar) : Online Di
No ratings yet
Spatial Autoregressive (Sar) : Online Di
10 pages
Arahan Tugasan- Individual Dec 2063
No ratings yet
Arahan Tugasan- Individual Dec 2063
4 pages
Simulation and Modeling Syllabus
No ratings yet
Simulation and Modeling Syllabus
3 pages
ManSci Notes - Regression Analysis
No ratings yet
ManSci Notes - Regression Analysis
4 pages
Practical-R1-Module-3-1
No ratings yet
Practical-R1-Module-3-1
15 pages
Scheffe'S Test: Diala de Guia Ignacio Ducusin Malveda Magtangob Corral III-Alfred NOBEL
100% (1)
Scheffe'S Test: Diala de Guia Ignacio Ducusin Malveda Magtangob Corral III-Alfred NOBEL
14 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
Marketing Analytics Intro 111
100% (1)
Marketing Analytics Intro 111
84 pages
Of Science & Scientists
100% (1)
Of Science & Scientists
256 pages
Homework For 2/10 Due 2/19: Name: ID
No ratings yet
Homework For 2/10 Due 2/19: Name: ID
2 pages
2 Serbo Theottoma
No ratings yet
2 Serbo Theottoma
12 pages
STAT 3502 Class 1 Notes
No ratings yet
STAT 3502 Class 1 Notes
26 pages
Reserch Design in Qualitative Research
100% (1)
Reserch Design in Qualitative Research
8 pages
Fundamental of Analytics: CSSELEC3/CS0009
No ratings yet
Fundamental of Analytics: CSSELEC3/CS0009
5 pages
Research 1
No ratings yet
Research 1
17 pages

Basic Statistics

Uploaded by

Basic Statistics

Uploaded by

Basic Statistical Analysis

Overview of statistical applications

2. Linear regression as a means to calibrate instrumentation and determine

3. Non-linear regression as a means to fit data to theoretical functions.

We call the distribution of the random errors a normal

Just as for the discrete variable, the approach to a Gaussian

Increasing number of measurements

Least squares definitions

Where did this number come from?

Use the Trendline function

The LINEST function

The slope s is known as the sensitivity of the measurement and the

These are the trumpets.

added meas fit Conf95%+ Conf95%- Pred95%+ Pred95%-

You might also like