0% found this document useful (0 votes)
26 views

Lecture 2.2 - Statistics - Desc Stat and Distrib

This document provides an overview of descriptive statistics concepts including measures of central tendency (mean, median, mode), dispersion (range, standard deviation, interquartile range), skewness, accuracy, and precision. Key points covered include how to calculate and interpret the mean, median, range, standard deviation, quartiles, and using accuracy and precision to evaluate data quality. Descriptive statistics are used to assess and describe the characteristics of data through quantitative measures.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Lecture 2.2 - Statistics - Desc Stat and Distrib

This document provides an overview of descriptive statistics concepts including measures of central tendency (mean, median, mode), dispersion (range, standard deviation, interquartile range), skewness, accuracy, and precision. Key points covered include how to calculate and interpret the mean, median, range, standard deviation, quartiles, and using accuracy and precision to evaluate data quality. Descriptive statistics are used to assess and describe the characteristics of data through quantitative measures.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Chemometrics

Lecture notes

Part 1. Statistics (con’t)


2.4. Descriptive statistics
•How do you assess the total error?
- One way to assess total error is to
treat a reference standard as a sample.
- The reference standard would be
carried through the entire process to
see how close the results are to the
reference value.
Analysis of Data
1. Population and samples
2. Mean, range, median, standard deviation, relative
standard deviation
3. The distributions of repeated measurements
(normal and log- normal distributions)
4. Data expressing using plot, box plot.
5. Confidence limits of mean and the geometrics mean
1. Population vs. sample
• Population = the entire collection of items
e.g. + all 100 mg vitamin C tablets produced
+ all water in a lake.
+ soil collected in an fixed area.
 Sample = a portion of the population
e.g. + a bottle of vitamin C pills
+ 500 mL of surface water
+ 0.5 kg of soil at a place

Generally only data for samples is available since it is generally


impossible to obtain data for the whole population
Mean of the population and sample

• Population • Sample – part of


population (n objects)
(N objects)

In reality, x  µ rapidly after 20-30 measurements


2. Central Tendency Measure- Range,
Mean, median, Mode
• Mode: is the value that appears most
• Range often in a set of data.
• R= xmax-xmin

For example, the mode of the sample [1, 3, 6, 6,


6, 6, 7, 7, 12, 12, 17] is 6.
Given the list of data [1, 1, 2, 4, 4] the mode is
not unique - the dataset may be said to be
bimodal, while a set with more than two modes
may be described as multimodal.
For a normal distribution : mean median  mode
2a. Mean, range, median, standard deviation, relative standard
deviation Median
Mean The median is described as
The mean is the arithmetic the numeric value separating the
Definition: average of a set of higher half of a sample, a population,
numbers, or distribution. or a probability distribution,
from the lower half.

The mean is used for normal The median is generally used for
Applicability skewed distributions.
distributions.
The median is better suited for skewed
The mean is not a robust tool distributions to derive at central
Robustness since it is largely influenced tendency since it is much more robust
by outliers. and sensible.

The mean of a set There is usually no accepted standard


Expression of numbers x1, x2, ..., xn is symbol for Median. It is popularly
usually denoted by x̅. denoted by the letter ‘m’.
3.Dispersion measure:
Range, Bias, Standard deviation, relative
standard deviation, Quartiles.
• Bias: a systematic • Quartiles: are the
(not random) three points that
deviation from the divide the data set into
true value four equal groups,
each representing a
• Bias= measure value- true
value
fourth of the
population being
sampled
Example of quartiles
A 5 Number Summary divides your data into four quarters.

3 7 9 12 14 15 17 18 40

1st 2nd 3rd 4th


Quarter Quarter Quarter Quarter
The Lower Quartile (Q1) is the second number in the 5

Number Summary
25% of all the numbers in the set are smaller than Q1

3 7 9 12 14 15 17 18 40

The Upper Quartile (Q3) is the fourth number in the 5

Number Summary
25% of all the numbers in the set are larger than Q3
What percent of all the numbers are between Q1 and

Q3?
50% of all the numbers are between Q1 and Q3

3 7 9 12 14 15 17 18 40

This is called the Inter-Quartile Range (IQR)

The size of the IQR is the distance between Q1 and Q3

17 - 9 = 8
Find the mean and median of
the following set of numbers
(no outliers):
3 12 7 40 9 14 18 15 17

Mean is 15 Mean is 11.875

Median is 14 Median is 13
Standard deviation of…
• Sample – part of
• Population population
Estimates
the variation
in the population
- May not be representative
Actual variation in the sample
population
N
 _ 
2
N

 i
x   2   xi  x 
i 1 
s 
  i 1
N 1
N
Why divide by N-1 when calculating “s”?

• N-1 = degrees of freedom (Df) of sample


• number of independent values on which a
result is based, or the number of values in
the final calculation of a statistic that are free
to vary
• for a population Df = N
• for a sample Df = N-1
• one Df lost when calculating the Average of a
sample
More on Dfs
To calculate the std. dev. of a random sample, we must first calculate
the mean of that sample and then compute the sum of the several
squared deviations from that mean.
While there will be n such squared deviations only (n - 1) of them are, in
fact, free to assume any value whatsoever.
This is because the final squared deviation from the mean must include the
one value of X such that the sum of all the Xs divided by n will equal the
obtained mean of the sample.
All of the other (n - 1) squared deviations from the mean can, theoretically,
have any values whatsoever.
For these reasons, std. dev. of a sample is said to have only (n - 1) degrees
of freedom.
Standard deviation of the mean
(standard error)
• When the standard deviation of several mean
values is taken, the amount of deviation between
the mean values will be reduced by a factor
proportional to the square root of the number of
data points (N) present in each set used to
calculate each mean value
• s = standard deviation between individual values
• sm = standard deviation between mean values s
sm 
N
Standard deviation of reproducibility (SR)
(for k analysts and n replicate measurements of each)
k

x i

X  j 1 k k n

S  ( x
2
k
i ji  xj) 2

j 1 j 1 i 1
S2R = MSwithin + MSbetween MS within  
k k (n  1)
k
n ( x j  x) 2

j 1
MS between 
k 1
Other ways of expressing the precision
of the data:

• Range R= xmax-xmin
• Median is the "middle number"
(in a sorted list of numbers).
• Variance Variance = s2
s
• Relative standard deviation RSD 
x
• Percent RSD / coefficient of variation

s
%RSD   100
x
Measure of Asymmetry
(skewness)
• The measure of how a symmetric a
distribution can be is called skewness.
• Skewed to the Right: mean and the median
are both greater than the mode
• Skewed to the Left: the mean will be less
than the median
To express accuracy and precision
The center of
the target is
the true value.

Nature of Both accurate and Precise only Neither accurate


accuracy and precise nor precise
precision
Mathematical •Small standard •Small standard •Large
comments deviation or %CV deviation or standard
•Small %error %CV deviation or
•Large %error %CV
•Large %error
Both a & p Precision only Neither a nor p
Very small error Clustered The shot-gun
in measurement multiple effect
measurements
Scientific but consistently
All cluster the Get a new
comments true value off from true measurement
value system or
Remember a operator
standard or true Calibration of
value is needed probe or other
measuring device
is off or unknown
systematic error
Expressing accuracy and precision

• Mean (average)
• Percent error accuracy

• Range
• Deviation
• Standard deviation
precision
• Percent coefficient of variation
Data expressing using plot, box plot.
Calculating Statistical
Uncertainty
• Mean and standard deviation of set of independent
measurements (unknown errors, assumed
uniform): 1
x0 
N

i
x i  x;

1
 
2
  xi  x 2

N 1 i
• Standard deviation estimates the likely error of
any one measurement
• Uncertainty in the mean is what is quoted:
1/ 2
  1 2
x    x i  x   .
N  N ( N  1) i 
Propagating Uncertainties
• Functions of one variable (general formula):
df
F  X
dx
• Specific cases:

 
 x 2  2 xx or
 
 x2
 2
x
x2 x

x   nx
n n 1
x or
 
 xn
n
x
n
x x
sin x   cos x x
1
ln x   x
x
Propagating Uncertainties
• Functions of >1 variable (general formula):
2 2
 f   f 
f  2
  x    y  .
 x   y 
• Specific cases:

f= Apply equation Simplify


x y f 2  x 2  y 2 f 2  x 2  y 2
2 2 2

xy f 2  y 2 x   x 2 y 


2 2
 f

 f


 x 
  
 x 
 y 
  
 y


2 2
x 
2
2
 f   x   y 
x y f   2
2 x2
 4 y 
2
       
y y  f   x   y 
Combining Uncertainties
• What about if have two or more
measurements of the same quantity, with
different uncertainties?
• Obtain combined mean and uncertainty with:

 i i
x  2
1 1
x i 
1  i
2
 2
i  i
2

i
• Remember we are using the uncertainty in the
mean here:

i 
N
2.5. Distribution of repeated measurements
• The pattern of variation of a variable is called its
distribution, which can be described both
mathematically and graphically.
• In essence, the distribution records all possible
numerical values of a variable and how often each
value occurs (its frequency).
• Can be either discrete or continuous
• Which statistical test is appropriate will depend
upon the distribution of your data.
Types of Distributions
Note that distributions can be either discrete or continuous

Univariate Data Set Two or more Independent Data Sets


• Binomial Distribution • Chi-squared
• Normal Distribution Distribution
• Poisson Distribution • F-Distribution
• Exponential • Gamma Distribution
Distribution • Hypergeometric
• Logistic Distribution • Laplace Distribution
• t-Distribution
Expressing the distribution of replicated data set
Example: The analytical results of Aluminum content in steel depending laboratories
STT
lab X1 X2 X3 X4 X5

1 A 0,016 0,015 0,017 0,016 0,019

2 B 0,017 0,016 0,016 0,016 0,018

3 C 0,015 0,014 0,014 0,014 0,015

4 D 0,011 0,007 0,008 0,010 0,009

5 E 0,011 0,011 0,013 0,012 0,012

6 F 0,012 0,014 0,013 0,013 0,015

7 G 0,011 0,009 0,012 0,010 0,012

8 H 0,011 0,011 0,012 0,014 0,013

9 I 0,012 0,014 0,015 0,013 0,014

10 K 0,015 0,018 0,016 0,017 0,016

11 L 0,015 0,014 0,013 0,014 0,014


8 10 12 14 16 18 20 .10 -3%
12 M 0,012 0,014 0,012 0,013 0,012
Histogram of the data set
The distributions of repeated measurements
For an infinite set of data,
N→∞ then x → µ and s→σ

population mean population std. dev.


The experiment that produces a small
standard deviation is more precise .
Remember, greater precision does not
imply greater accuracy.
Experimental results are commonly expressed
in the form:
_
mean  standard deviation
xs
Normal Distribution
For a large number of experiment replicates the results approach an ideal
smooth curve called the GAUSSIAN or NORMAL DISTRIBUTION CURVE

Characterised by:

The mean value x


gives the center of the
distribution

The standard deviation s


measures the width of
the distribution
Theoretical Distribution of data: normal and t-
distribution
Gaussian Distribution of Random Errors
Another way to represent a Gaussian distribution is to
relate it to a new variable, z, on the x-axis

_
x μ x x
z 
σ s

Where
z = deviation from
the mean of a data
point stated in terms
of units of std dev.
Normal distribution
The standard deviation measures the width of the
Gaussian curve.
(The larger the value of σ, the broader the curve)

Range Percentage of measurements


µ ± 1σ 68.3
µ ± 2σ 95.5
µ ± 3σ 99.7

The more times you measure, the more confident you are that your
average value is approaching the “true” value.
The uncertainty decreases in proportion to 1/ n
t-distributions

Normal distribution

- As N (DF) increases t-distribution is less spread out.


- At large N t-distribution approaches shape of Gaussian distribution.
t-distribution ( 1-sided)
Data Transformation
• What do you do if your data is not normally
distributed?
• Use a non-parametric test
• Transform your data
• Logarithmic transformation:
Variable x  log (Variable x +1)
• Power transformation:
e.g. Variable x  √(Variable x)
• Angular transformation:
e.g. Variable x  arcsine (√(Variable x))
Log transformation
1. Compute the logarithm of each data value and
then analyze the resulting data.
2. Effect of log transformation
- If your data are skewed to the right, a log
transformation can sometimes produce a data set
that is closer to symmetric.
- If the data are symmetric or skewed to the left, a log
transformation could actually make things worse.
- If your data has outliers on the high end, a log
transformation can sometimes help.
Log transformation
When should you consider a log
transformation?
- Is your data bounded below by zero?
- Is your data defined as a ratio?
- Is the largest value in your data more than
three times larger than the smallest value?
Exponential Distribution
• Describes a sample
where y= x^a
• Messy to work with, but
can be transformed
(sometimes) or you can
use a non-parametric
test
Logistic Distribution
• Typically describes
sample that fits
y = log (x)
• Again, messy to work
with (sometimes) but
can be transformed or
you can use a non-
parametric test
2.6. Confidence limits of mean and the geometrics
mean

•The geometric mean is a measure of


central tendency, just like a median.
• it uses multiplication rather than
addition to summarize data values.
•The geometric mean is an arithmetic
mean after taking logs ( 1/n∑logXi)
Estimating Random Error
•The random error (x)in a set of
data can be estimated by
multiplying Sm by a statistical
function called the student-t
distribution function
t p, f s
x  t p , f sm 
N
Confidence intervals
• X at a given confidence level (say 95%)
implies that the true value  will be found
within  X of the calculated mean
t p, f s
  x
N
 t p ,v s t p ,v s 
 x     x 
 N N
How many samples/replicates to analyze?
Rearranging Student’s t equation:

Required number of ts t 2s 2
replicate analyses: x  n
n e2
e
µ = true population mean
x = measured mean
n = number of samples needed
s2 = variance of the sampling operation
e = sought-for uncertainty

Since degrees of freedom is not known at this stage,


the value of t=1.96 for n → ∞ is used to estimate n.
The process is then repeated a few times until a
constant value for n is found.

You might also like