0% found this document useful (0 votes)
18 views

Statistics Normality

Parametric statistics assume data come from a normal distribution and make inferences about the distribution's parameters. Non-parametric statistics do not depend on a parameterized distribution. Three ways to assess normality are graphical displays, skewness/kurtosis quantification, and Shapiro-Wilk tests.

Uploaded by

Deepti Bhoknal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Statistics Normality

Parametric statistics assume data come from a normal distribution and make inferences about the distribution's parameters. Non-parametric statistics do not depend on a parameterized distribution. Three ways to assess normality are graphical displays, skewness/kurtosis quantification, and Shapiro-Wilk tests.

Uploaded by

Deepti Bhoknal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Normality and Data

Transformations
Definition – (Non)Parametric
Parametric statistics assume that
data come from a normal
distribution, and make inferences
about parameters of that
distribution. These statistical
tests are based on comparing the
means (central tendency) of the
distributions, as a function of their
variability (spread).
Non-parametric statistics do not depend on fitting a
parameterized distribution, based on normality. These
statistical tests are based on comparing the medians
(50 % of data distributions) and the ranks of the
observations amongst the samples.
The Normal Distribution
X ~ N (µ, σ)

Every Normal Distribution can be described


using only two parameters: Mean and S.D.
Is the Basis of Parametric Statistics
Parametric statistical
methods require that
numerical variables
approximate a normal
distribution.
68%
96% They compare the
99% means & S.D.s

In a normal distribution:
• ~ 68% observations within 1 standard deviation of mean
• ~ 96% within 2 standard deviations
• ~ 99% within 3 standard deviations
Assessing Normality
 Three ways to assess the normality of the data
• 1) Graphical Displays
– Histogram, Density plot Boxplot, Q-Q Plot

• 2) Skewness / Kurtosis
- Are they different from 0 ? (normal distribution)

- Rule of Thumb: Too Large (> 1) or too small (< -1)

• 3) Shapiro – Wilk Tests


–Tests if data differ from a normal distribution
–Significant = non-Normal data
–Non-Significant = Normal data
Assessing Normality
 Three ways to assess the normality of the data
• 1) Graphical Displays
– Histogram, Density plot, Boxplot
Assessing Normality
 Three ways to assess the normality of the data
• 1) Graphical Displays
– Histogram, Density plot, Boxplot
Assessing Normality
 1) More Graphical Displays
– Q-Q Plot: quantile / quantile plot
compares observed data and theoretical
data, from a normal distribution

OPTIONS tab:
Select the type and the
parameters of theoretical
data distribution.
Default: “Normal”
Assessing Normality
 Q-Q Plot: quantile / quantile plot

Things to Look For:

How many points plotted?

Are there any outliers?


Quantifying Distributions
2) Skewness: Distribution symmetry (skew)

Skew: Measure of the symmetry of a distribution.

Symmetric distributions have a skew = 0.

Positive skew: Negative skew:


the mean is larger the mean is smaller
than the median, than the median,
skewness > 0 skewness < 0
Quantifying Distributions
2) Kurtosis: Distribution of data in peak / tails

Kurtosis: Measure of the degree to which observations


cluster in the tails or the center of the distribution.

Positive kurtosis: Negative kurtosis:


Less values in tails and More values in tails and
more values close to mean. less values close to mean.
Leptokurtic. Platykurtic.
Assessing Normality - Example

• Use “NormalityExample.xlsx” Dataset


(posted on class web-site)

• Follow along this example using Rcmdr

• Open Rstudio and activate Rcmdr

• Import dataset and start exploring


An Example in Age (yrs)
34
Estimation 36
37
How old is your professor ? 37
38
38
N = 18 guesses 38
38
39
Range = 34 – 48 40
40
41
41
42
42
42
42
48
An Example in Age (yrs)
34
Estimation 36
37
How old is your professor ? 37
38
38
N = 18 guesses 38
38
39
What is the 40
Midpoint Value = 40
41
41
42
42
42
42
48
An Example in value
34
frequency
1
relative frequency
0.056
Estimation 35 0 0.000
36 1 0.056
37 2 0.111
N = 18 guesses 38 4 0.222
39 1 0.056
Mean = 39.6 40 2 0.111
41 2 0.111
Median = 39.5 42 4 0.222
43 0 0.000
44 0 0.000
S.D. = 3.1
45 0 0.000
46 0 0.000
47 0 0.000
48 1 0.056
sum 18 1
An Example in value relative freq.
34 0.056
cumulative freq.
0.056
Estimation 35 0.000 0.056
36 0.056 0.111
37 0.111 0.222
N = 18 guesses 38 0.222 0.444
39 0.056 0.500
50% = 39.5 40 0.111 0.611
41 0.111 0.722
5% = 34 42 0.222 0.944
43 0.000 0.944
44 0.000 0.944
25% = 38 45 0.000 0.944
46 0.000 0.944
75% = 42 47 0.000 0.944
48 0.056 1.000
95% = 48 sum 1 9.389
Data Summary with Rcmdr
Summaries:

- Active data set


Data Summary with Rcmdr
Summaries:

- Numerical summaries
Normality Test with Rcmdr
Test of Normality

Select data

Use Shapiro-Wilk

Test multiple data


using “by groups”
Normality Test with Rcmdr
Test of Normality: SW (Shapiro-Wilk) Test

Null Hypothesis: Data ARE Normal

Alternate Hypothesis: Data ARE NOT Normal


Normality Test with Rcmdr
Test of Normality: SW (Shapiro-Wilk) Test

Is this Result Significant ? How Can You Tell ?


P value > 0.05 (alpha). Result is NOT Significant

Null is not Rejected. Data ARE Normally Distributed

What do you Need to Report ?


Test Name, Sample Size (n OR df), test statistic, p value
Confidence Intervals – Many Tests
Formulation = 95% confidence intervals

Lower bound: Mean – (1.96 * SE)


Upper bound: Mean + (1.96 * SE)

By definition: 95% of the confidence intervals (from


different experiments) will overlap the real parameter µ
NOTE: Estimates Depend on Sample Size

C.I. Formulation: Mean +/- (Z score * SE)


Mean +/- (1.96 * SE)

S.E. = S.D. / sqrt (n) =


3.127466 / (sqrt(18)) = 0.737151

n mean SD sqrt(n) SE 95% CI


3 38.3 1.5 1.7 0.9 1.7
6 40.2 4.4 2.4 1.8 3.5
9 40.1 3.5 3.0 1.2 2.3
12 39.9 3.2 3.5 0.9 1.8
15 39.7 3.0 3.9 0.8 1.5
18 39.6 3.1 4.2 0.7 1.4
NOTE: Estimates are influenced by chance
Age Estimate: 39.6 years (SD = 3.1)

C.I. Formulation: Mean +/- (Z score * SE)


Mean +/- (1.96 * SE)
S.E. = S.D. / sqrt (n)

n mean SD sqrt(n) SE 95% CI lower upper


9 40.1 3.5 3.0 1.2 2.3 37.8 42.4
9 39.1 2.8 3.0 0.9 1.8 37.3 40.9

Are these two samples from the same population ?


Interpreting Confidence Intervals
The (CI) is the
interval that NOTE
includes the
estimated
parameter, with
a probability
determined by
confidence level
(usually 95%).
Interpreting Confidence Intervals
Case 1.
Two samples
indistinguishable.
They are from
same population

Case 2.
Two samples
different. They
are not from
same population
Summary - Parametric Statistics
Benefits and Costs:

- Parametric methods make more assumptions than non-


parametric methods. If the extra assumptions are
correct, parametric methods have more statistical power
(produce more accurate and precise estimates.)

- However, if those assumptions are incorrect,


parametric methods can be very misleading. They can
cause false positives (type –I errors). Thus, they are
often not considered robust.
Summary – Normality
 Indicators of a normal (Gaussian) distribution

A. Mean = Median = Mode

B. Skewness:
Measures asymmetry of the distribution. A value of
zero indicates symmetry. Skewness absolute value > 1
indicates non-normal skewed distribution.

C. Kurtosis:
Measures the distribution of mass in the distribution.
A value of zero indicates a normal distribution. Kurtosis
absolute value > 1 indicates non-normal unbalanced
distribution.
Summary – Approach
Suggested Approach:

- Use parametric tests – whenever possible.

-Take care to examine diagnostic statistics and to


determine if extra assumptions are met.

- If you are in doubt…


Perform the matching non-parametric test and
compare results.

If they agree: go with results of normal test

If they disagree: what caused the disagreement


Data Distributions and
Transformations

Chl Concentration Log (Chl Conc.)


Approach
• We can transform the data to achieve normality

• Need to implement monotonic transformations:

Actual Values Change

Ranks Do Not Change


Presence / Absence Transformation
NONMONOTONIC
TRANSFORMATIONS
P/A

(x) f(x)

Note: 0 power transformation is NOT monotonic


It recodes data as Presence / Absence (0 / 1)
Log Transformation
Logarithmic transformation fx = ln(x) OR log(x)

(x) f(x)

This transformation is useful when:


• High degree of variation within samples (e.g., Chl Conc.)
• Large outliers (tails) and lots of zeros

 Note: to log-transform data containing zeros, a


small number should be added to all data points.

• With count data, add one, so that: fx = log(0+1) =0


• With density data, add constant smaller than smallest
possible sample, so that: fx = log(0+0.001) = -3
Log Transformation
Log Transformation (log(Xi)): Reduces positive skew

Before After
Square Root Transformations
MONOTONIC
TRANSFORMATIONS

(x) f(x)

Power exponents:
½ power (square root)

Square root transform deals with positive skew, by


bringing in large tails.
Special treatment of zeros not necessary.
Square Root Transformation
Square Root Transformation (√Xi):
Reduces positive skew. Useful for stabilizing variance

Before After
Data Transformations – For Proportions
Arcsine / Arcsine-squareroot transformation

This transformation is useful when dealing with proportional


data (e.g., Percent Cover)

 Note: data must range between 0 and 1, inclusive.


The constant 2 / pi scales the result of arcsin(x) [in
radians] to range from 0 to 1, assuming that 0 < x < 1.
So, the Data Transform the
are not Normal… Data in Rcmdr to
Now What ? Fix the Problem

Data
Manage variables in active data set
Compute New Variable
Numeracy: Log (Numeracy):
positive > Summary
skew
Min: 1

Max = 14

NOTE: In R:
Log = Ln
Log10 = Log
Hints for Computing New Variables

Log = is the natural ln

Log10 = is the log base 10

asin = arcsine

sqrt = square root


Summary
Rules for Data Transformations

Most Important Rule: Do not Reverse the Order of the Values (larger remains
larger… smaller remains smaller)

Monotonic: change values but retain ranks

Non-monotonic: change values and ranks


(For example: Add random number, Multiply by (-1) )
Summary
Take-home Lessons
• Parametric tests are more powerful, but are based on assumption
of normally distributed data

• Determine normality criteria and undertake data transformations, if


needed

• If you are unsure, data transformations can always be attempted


to compare the same test results, using transformed and un-
transformed data

• Test normality before / after data transformations

• If transformations do not work…


use non-parametric tests

You might also like