0% found this document useful (0 votes)
35 views13 pages

Week 8

This document discusses analysis of variance (ANOVA) and assumptions for parametric statistical tests including normality and outliers. It covers checking for normality, one-way and two-way ANOVA, applications of ANOVA in Excel, outliers in analytical data, and robust and non-parametric statistics. Graphical and statistical tests for verifying data distribution are also presented, such as histograms, normal quantile plots, and the Kolmogorov-Smirnov test.

Uploaded by

Reza Joia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views13 pages

Week 8

This document discusses analysis of variance (ANOVA) and assumptions for parametric statistical tests including normality and outliers. It covers checking for normality, one-way and two-way ANOVA, applications of ANOVA in Excel, outliers in analytical data, and robust and non-parametric statistics. Graphical and statistical tests for verifying data distribution are also presented, such as histograms, normal quantile plots, and the Kolmogorov-Smirnov test.

Uploaded by

Reza Joia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Week 8

MODULE 4 Analysis of variance. Assumptions for parametric statistical tests:


normality and outliers. Non-parametric and robust statistics

8 L 8. Checking normality. Analysis of variance. 1


Learning resources:
One-way and two-ways ANOVA. 1. Stephen Kokoska,
Introductory Statistics: A
PC 8. Applications of ANOVA in Excel using 2 5 Problem-Solving Approach,
various chemical data. Publisher: WH Freeman; 3rd
edition chapter 6, 11
IWST 4. Exercises and problems regarding 2. James Miller, Jane Miller,
ANOVA and normality tests Robert Miller, Statistics and
Chemometrics for Analytical
9 L 9. Outliers in analytical data. Detecting 1 Chemistry, Publisher:
outliers using Dixon and Grubbs tests. Robust Pearson Education; 7th
and non-parametric statistics. edition, chapter 3
3. Stephen L. R. Ellison, Vicki J.
Barwick, Trevor J. Duguid
PC 9. Practical application involving outlier 2 10
Farrant, Practical Statistics
detections and non-parametric Wilcoxon and for the Analytical Scientist: A
Mann-Whitney tests. Bench Guide, Publisher:
Royal Society of Chemistry;
IWS 3. Individual work with exercises and 10 2nd edition , chapter 6
problems regarding outliers testing, ANOVA,
non-parametric and robust statistics
Data symmetry based on descriptive statistics

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Positively skewed distribution Symmetric distribution Negatively skewed distribution

(right tailed) (left tailed)

(Q3-Q2) > (Q2-Q1) (Q3-Q2) = (Q2-Q1) (Q3-Q2) < (Q2-Q1)


kth raw moment

0,0
kth moment about the a value

,2

raw moment Moment about Standardized ,0


0 the μ central moment 2
(central moment)
1

4 𝒏

∑ ( 𝒙 𝒊 − 𝒙 )𝟑
Skewness 𝒃 𝟏=
𝒏 𝒊=𝟏
(𝒏 − 𝟏)(𝒏 − 𝟐) 𝒔𝟑

Kurtosis ...
Skewness β1 < 0 β1 > 0 β1 = 0
If:
• β1 = 0 then the data is symmetric β1 = 0,5
• β1 > 0 positive asymmetry
• β1 < 0 negative asymmetry β1 = 1

negative asymmetry positive asymmetry β1 = 1,5

Kurtosis (Peakedness)
For a normal distribution, β2 = 3. Therefore, it we define β2' = β2 -
3, excess of kurtosis and takes values between [-2, ꝏ).
If:
• β2’ = 0 then the data is normally distributed (mesokurtic)
• β2’ > 0 leptocurtic distribution
• β2’ < 0 platycurtic distribution
β 2’ > 0
mesokurtic
leptocurtic
β 2’ = 0
β 2’ < 0
platycurtic
Verification of data distribution through graphical representations Verification of data distribution by statistical tests
i. Histogram i. χ2 test
ii. Normal quantile plot, normal quantile-quantile plot, QQ plot ii. Kolmogorov-Smirnov test
iii. Stem and leaf plot iii. Shapiro-Wilk test

i. Histogram

• Interval with values ordered ascendingly, 100


divide into subintervals of the form: with β1 = 0.243 ± 0.127

observatii (mm)
the width 80
• The optimal number of subintervals is β2’ = -0.592 ± 0.253

observations
established by Stirling's formula:
60

40

no. of Nr.
• On the width of these intervals, rectangles are 20
constructed with length (height) proportional
to the relative frequency 0
3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5
Length of orez
Lungime rice(mm)
(mm)
ii. Normal quantile-quantile plot 4
3.0
Plot the i-th ordered value versus quantile of the normal standard distribution 3 2.6
(corresponding z-scores) or vice versa. It allows the identification of potential
2

z-scores
atypical points. A more general formula for determining the quantile in the

z asteptate
corresponding normal distribution: 1

expected
scorurile
-1

-2
-2.6
-3 -3.0
-4
3 4 5 6 7 8 9
Length of orez
Lungime rice(mm)
(mm)

3 · · · ·
4 034 · · · ·
iii. Stem and leaf plot 4 6788899999 · · ·
5 00011111112222222233333333344444444444 ·
A stem and leaf plot is a way of organizing data into a form that allows 5 55555556666666666677777777777888888888899999999
for an easy visual perception of frequencies for different types of 6 000000011111122222233333333444444 ·
values. Such a presentation allows easy determination of quantiles as 6 5555555666666777778888889999999 ·
well as data distribution profile. It also allows the identification of 7 000000001111111122222233444 · ·

potential atypical points. 7 555667 · · · ·


8 · · · ·

establishing a code for the stem and leaves 8 7 · · · ·


9 · · · ·
4 6 means 4.6 min = 4.0 max = 8.7 Total n:371

Stem ‫ ׀‬leaf (leaf unit=1.00, e.g., 66.50 = 5‫)׀‬


Results from Statistica software.
Histogram: Lungime
H istogram : Rorez (mm) (m m )
ise length
K-S d=.07410, p<.05 ; p<.05
K -S d=.07410, Lilliefors p<.01 p<.01
; Lilliefors
Shapiro-Wilk W=.97839,
S hapiro-W p=.00002
ilk W =.97839, p=.00002 Normality/Symmetry Graphs | Real
220 220 Statistics Using Excel (real-statistics.
com)
200 200

180 180 Example using Statistica soft:


160 160 Using a QQ plot determine
whether the data set with 8
140 140
elements {-5.2, -3.9, -2.1, 0.2, 1.1,
No. of obs.

No. of obs.

120 120 2.7, 4.9, 5.3} is normally


100 100
distributed

80 80

60 60

40 40

20 20

0 0
3 3 4 4 5 5 6 6 7 7 8 8 9 9
X <= Category Boundary
X <= C ategory B oundary
One-way Analysis of Variance (ANOVA)
From an applicative point of view, ANOVA is an extension from the t-test for comparing two independent samples (when variations
are unknown) to more than two samples. Basically, ANOVA tests the effect of a single factor (an independent variable) on a
dependent variable for more than two samples/samples (at several levels). For two-factor testing, bifactorial or multi-factor ANOVA
is used, MANOVA is applied.
Examples of factors tested:
qualitative (catalyst, operator, a particular analytical method, etc.
quantitative (pH, temperature, pressure, etc.)
And the dependent variable can be any quantity, measurable or quantitatively assessed, for the tested factor, at different levels.

Thus, statistical assumptions will be:


H0: there is no difference between population means, μA = μB = μC = ...
H1: At least one mean differs, μp ≠ μq, for any p ≠ q
T1 T2 ... Tj ... Tk
k groups (treatments, methods, etc.), k levels for the same factor.
x11 x12 ... x1j ... x1k Each group contains nj values. The j-th group is , and is the i-th
x21 x22 ... x2j ... x2k measurement from the j-th group. It can also be written in the
form: .
. . ... . ... . j – index for the position of a group
. . ... xij ... . i – index for the position of a value in a group
. . ... . ... .
x n 11 x n 22 ... xnjj ... xnkk
T1 T2 ... Tj ... Tk 𝑛𝑗 𝑘 𝑘 𝑛𝑗

𝑺𝑺 𝒋 =∑ (𝑥 𝑖𝑗 − 𝑥 𝑗 ) 𝑺𝑺𝑾 =∑ 𝑆𝑆 𝑗= ∑ ∑ (𝑥 𝑖𝑗 − 𝑥 𝑗 )
2 2
x11 x12 ... x1j ... x1k
x21 x22 ... x2j ... x2k 𝑖=1 𝑗=1 𝑗=1 𝑖=1
𝑘 𝑘 𝑛𝑗
. . ... . ... .
𝑺𝑺 𝑩= ∑ 𝑛 𝑗 (𝑥 𝑗 − 𝑥) 𝑺𝑺𝑻 =∑ ∑ (𝑥𝑖𝑗 − 𝑥 )
2 2
. . ... xij ... .
𝑗=1 𝑗=1 𝑖=1
. . ... . ... .
𝑘
x n 11 x n 22 ... xnjj ... xnkk
𝝂 𝑾 = ∑ ( 𝑛 𝑗 −1 ) =𝑛 − 𝑘𝝂 𝑩 =𝑘 −1 𝝂𝑻 =𝑛 −1
𝑗=1
𝒙𝟏 𝒙𝟐 𝒙 𝒋 𝒙 𝒌
there is a variance in the group (Within), internal, residual
we suspect a variance between groups (Between), external,
explained
If the factor has no effect, there is no difference between the degrees of sum of mean
two variances freedom squares squares
• we define an overall average (Total), x ̅
𝑛𝑗
index ν SS MS
∑ 𝑥 𝑖𝑗
𝒙 𝒋 = 𝑖=1 , 𝑗=1 , 𝑘 , 𝑚𝑒𝑎𝑛 𝑓𝑜𝑟 𝑡h𝑒 𝑗 − 𝑡h 𝑔𝑟𝑜 𝑢𝑝 W
𝑛𝑗 Within
B
Between
T
Total
If the null hypothesis is true, are both a measure of random errors and we expect that

and if it's false

Decision:
• If , the null hypothesis is not rejected(the factor tested has no significant effect, μA = μB = μC = ...)
• If , reject the null hypothesis and accept the alternative hypothesis(The tested factor has a significant effect, at least one
average differs, μp ≠ μq, for a certain p ≠ q)
Example: The table below shows the results obtained in a stability study of a fluorescent reagent stored under different
conditions. The values given are fluorescence signals (in arbitrary units) from solutions diluted to the same concentration.
Three measurements were made in each sample. The table shows that the average values for the four samples are different.
However, we know that due to a random error, even if the true value we are trying to measure is unchanged, the sample
average may vary from sample to sample. Using ANOVA, test (α = 0,05) if the difference between sample means is too large to
be explained by random errors.
A B C D
(freshly diluted) (after 1h in the dark) (after 1h in the shade) (after 1h in light)

102 101 97 90
100 101 95 92
101 104 99 94
2
𝒙=𝟗𝟖

ANOVA: Single ANOVA


Factor
Source of Variation SS df MS F P-value F crit
SUMMARY
Groups Count Sum Average Variance
Column 1 3 303 101 1 Between Groups 186 3 62 20.67 0.0004 4.07
Column 2 3 306 102 3 Within Groups 24 8 3
Column 3 3 291 97 4
Column 4 3 276 92 4 Total 210 11

we reject H0, the tested effect is strongly significant;


IWST

P1. The following results show the percentage of


total interstitial water that was recovered by Depth (m) Water recovered (%)
centrifuging samples taken at different depths in 7 33.3 33.3 35.7 38.1 31 33.3
stone sediment. Choose a statistical test and show 8 43.6 45.2 47.7 45.4 43.8 46.5
16 73.2 68.7 73.6 70.9 72.5 74.5
(with 95% probability) that the percentage of
23 72.5 70.4 65.2 66.7 77.6 69.8
reclaimed water differs significantly at different
depths.
Analyst Paracetamol (% m/m)
P2. Six analysts each made six determinations of A 84.32 84.51 84.63 84.61 84.64 84.51
paracetamol content in the same batch of tablets. B 84.24 84.25 84.41 84.13 84.00 84.30
The results are presented below. Test if there is any C 84.29 84.40 84.68 84.28 84.40 84.36
significant difference (α=0.05) between the D 84.14 84.22 84.02 84.48 84.27 84.33
averages obtained by the six analysts. E 84.50 83.88 84.49 83.91 84.11 84.06
F 84.70 84.17 84.11 84.36 84.61 83.81

You might also like