0% found this document useful (0 votes)

18 views

Statistics Normality

Parametric statistics assume data come from a normal distribution and make inferences about the distribution's parameters. Non-parametric statistics do not depend on a parameterized distribution. Three ways to assess normality are graphical displays, skewness/kurtosis quantification, and Shapiro-Wilk tests.

Uploaded by

Deepti Bhoknal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Statistics Normality

Uploaded by

Deepti Bhoknal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Normality and Data

Transformations
Definition – (Non)Parametric
Parametric statistics assume that
data come from a normal
distribution, and make inferences
about parameters of that
distribution. These statistical
tests are based on comparing the
means (central tendency) of the
distributions, as a function of their
variability (spread).
Non-parametric statistics do not depend on fitting a
parameterized distribution, based on normality. These
statistical tests are based on comparing the medians
(50 % of data distributions) and the ranks of the
observations amongst the samples.
The Normal Distribution
X ~ N (µ, σ)

Every Normal Distribution can be described

using only two parameters: Mean and S.D.
Is the Basis of Parametric Statistics
Parametric statistical
methods require that
numerical variables
approximate a normal
distribution.
68%
96% They compare the
99% means & S.D.s

In a normal distribution:
• ~ 68% observations within 1 standard deviation of mean
• ~ 96% within 2 standard deviations
• ~ 99% within 3 standard deviations
Assessing Normality
 Three ways to assess the normality of the data
• 1) Graphical Displays
– Histogram, Density plot Boxplot, Q-Q Plot

• 2) Skewness / Kurtosis
- Are they different from 0 ? (normal distribution)

- Rule of Thumb: Too Large (> 1) or too small (< -1)

• 3) Shapiro – Wilk Tests

–Tests if data differ from a normal distribution
–Significant = non-Normal data
–Non-Significant = Normal data
Assessing Normality
 Three ways to assess the normality of the data
• 1) Graphical Displays
– Histogram, Density plot, Boxplot
Assessing Normality
 Three ways to assess the normality of the data
• 1) Graphical Displays
– Histogram, Density plot, Boxplot
Assessing Normality
 1) More Graphical Displays
– Q-Q Plot: quantile / quantile plot
compares observed data and theoretical
data, from a normal distribution

OPTIONS tab:
Select the type and the
parameters of theoretical
data distribution.
Default: “Normal”
Assessing Normality
 Q-Q Plot: quantile / quantile plot

Things to Look For:

How many points plotted?

Are there any outliers?

Quantifying Distributions
2) Skewness: Distribution symmetry (skew)

Skew: Measure of the symmetry of a distribution.

Symmetric distributions have a skew = 0.

Positive skew: Negative skew:

the mean is larger the mean is smaller
than the median, than the median,
skewness > 0 skewness < 0
Quantifying Distributions
2) Kurtosis: Distribution of data in peak / tails

Kurtosis: Measure of the degree to which observations

cluster in the tails or the center of the distribution.

Positive kurtosis: Negative kurtosis:

Less values in tails and More values in tails and
more values close to mean. less values close to mean.
Leptokurtic. Platykurtic.
Assessing Normality - Example

• Use “NormalityExample.xlsx” Dataset

(posted on class web-site)

• Follow along this example using Rcmdr

• Open Rstudio and activate Rcmdr

• Import dataset and start exploring

An Example in Age (yrs)
34
Estimation 36
37
How old is your professor ? 37
38
38
N = 18 guesses 38
38
39
Range = 34 – 48 40
40
41
41
42
42
42
42
48
An Example in Age (yrs)
34
Estimation 36
37
How old is your professor ? 37
38
38
N = 18 guesses 38
38
39
What is the 40
Midpoint Value = 40
41
41
42
42
42
42
48
An Example in value
34
frequency
1
relative frequency
0.056
Estimation 35 0 0.000
36 1 0.056
37 2 0.111
N = 18 guesses 38 4 0.222
39 1 0.056
Mean = 39.6 40 2 0.111
41 2 0.111
Median = 39.5 42 4 0.222
43 0 0.000
44 0 0.000
S.D. = 3.1
45 0 0.000
46 0 0.000
47 0 0.000
48 1 0.056
sum 18 1
An Example in value relative freq.
34 0.056
cumulative freq.
0.056
Estimation 35 0.000 0.056
36 0.056 0.111
37 0.111 0.222
N = 18 guesses 38 0.222 0.444
39 0.056 0.500
50% = 39.5 40 0.111 0.611
41 0.111 0.722
5% = 34 42 0.222 0.944
43 0.000 0.944
44 0.000 0.944
25% = 38 45 0.000 0.944
46 0.000 0.944
75% = 42 47 0.000 0.944
48 0.056 1.000
95% = 48 sum 1 9.389
Data Summary with Rcmdr
Summaries:

- Active data set

Data Summary with Rcmdr
Summaries:

- Numerical summaries
Normality Test with Rcmdr
Test of Normality

Select data

Use Shapiro-Wilk

Test multiple data

using “by groups”
Normality Test with Rcmdr
Test of Normality: SW (Shapiro-Wilk) Test

Null Hypothesis: Data ARE Normal

Alternate Hypothesis: Data ARE NOT Normal

Normality Test with Rcmdr
Test of Normality: SW (Shapiro-Wilk) Test

Is this Result Significant ? How Can You Tell ?

P value > 0.05 (alpha). Result is NOT Significant

Null is not Rejected. Data ARE Normally Distributed

What do you Need to Report ?

Test Name, Sample Size (n OR df), test statistic, p value
Confidence Intervals – Many Tests
Formulation = 95% confidence intervals

Lower bound: Mean – (1.96 * SE)

Upper bound: Mean + (1.96 * SE)

By definition: 95% of the confidence intervals (from

different experiments) will overlap the real parameter µ
NOTE: Estimates Depend on Sample Size

C.I. Formulation: Mean +/- (Z score * SE)

Mean +/- (1.96 * SE)

S.E. = S.D. / sqrt (n) =

3.127466 / (sqrt(18)) = 0.737151

n mean SD sqrt(n) SE 95% CI

3 38.3 1.5 1.7 0.9 1.7
6 40.2 4.4 2.4 1.8 3.5
9 40.1 3.5 3.0 1.2 2.3
12 39.9 3.2 3.5 0.9 1.8
15 39.7 3.0 3.9 0.8 1.5
18 39.6 3.1 4.2 0.7 1.4
NOTE: Estimates are influenced by chance
Age Estimate: 39.6 years (SD = 3.1)

C.I. Formulation: Mean +/- (Z score * SE)

Mean +/- (1.96 * SE)
S.E. = S.D. / sqrt (n)

n mean SD sqrt(n) SE 95% CI lower upper

9 40.1 3.5 3.0 1.2 2.3 37.8 42.4
9 39.1 2.8 3.0 0.9 1.8 37.3 40.9

Are these two samples from the same population ?

Interpreting Confidence Intervals
The (CI) is the
interval that NOTE
includes the
estimated
parameter, with
a probability
determined by
confidence level
(usually 95%).
Interpreting Confidence Intervals
Case 1.
Two samples
indistinguishable.
They are from
same population

Case 2.
Two samples
different. They
are not from
same population
Summary - Parametric Statistics
Benefits and Costs:

- Parametric methods make more assumptions than non-

parametric methods. If the extra assumptions are
correct, parametric methods have more statistical power
(produce more accurate and precise estimates.)

- However, if those assumptions are incorrect,

parametric methods can be very misleading. They can
cause false positives (type –I errors). Thus, they are
often not considered robust.
Summary – Normality
 Indicators of a normal (Gaussian) distribution

A. Mean = Median = Mode

B. Skewness:
Measures asymmetry of the distribution. A value of
zero indicates symmetry. Skewness absolute value > 1
indicates non-normal skewed distribution.

C. Kurtosis:
Measures the distribution of mass in the distribution.
A value of zero indicates a normal distribution. Kurtosis
absolute value > 1 indicates non-normal unbalanced
distribution.
Summary – Approach
Suggested Approach:

- Use parametric tests – whenever possible.

-Take care to examine diagnostic statistics and to

determine if extra assumptions are met.

- If you are in doubt…

Perform the matching non-parametric test and
compare results.

If they agree: go with results of normal test

If they disagree: what caused the disagreement

Data Distributions and
Transformations

Chl Concentration Log (Chl Conc.)

Approach
• We can transform the data to achieve normality

• Need to implement monotonic transformations:

Actual Values Change

Ranks Do Not Change

Presence / Absence Transformation
NONMONOTONIC
TRANSFORMATIONS
P/A

(x) f(x)

Note: 0 power transformation is NOT monotonic

It recodes data as Presence / Absence (0 / 1)
Log Transformation
Logarithmic transformation fx = ln(x) OR log(x)

(x) f(x)

This transformation is useful when:

• High degree of variation within samples (e.g., Chl Conc.)
• Large outliers (tails) and lots of zeros

 Note: to log-transform data containing zeros, a

small number should be added to all data points.

• With count data, add one, so that: fx = log(0+1) =0

• With density data, add constant smaller than smallest
possible sample, so that: fx = log(0+0.001) = -3
Log Transformation
Log Transformation (log(Xi)): Reduces positive skew

Before After
Square Root Transformations
MONOTONIC
TRANSFORMATIONS

(x) f(x)

Power exponents:
½ power (square root)

Square root transform deals with positive skew, by

bringing in large tails.
Special treatment of zeros not necessary.
Square Root Transformation
Square Root Transformation (√Xi):
Reduces positive skew. Useful for stabilizing variance

Before After
Data Transformations – For Proportions
Arcsine / Arcsine-squareroot transformation

This transformation is useful when dealing with proportional

data (e.g., Percent Cover)

 Note: data must range between 0 and 1, inclusive.

The constant 2 / pi scales the result of arcsin(x) [in
radians] to range from 0 to 1, assuming that 0 < x < 1.
So, the Data Transform the
are not Normal… Data in Rcmdr to
Now What ? Fix the Problem

Data
Manage variables in active data set
Compute New Variable
Numeracy: Log (Numeracy):
positive > Summary
skew
Min: 1

Max = 14

NOTE: In R:
Log = Ln
Log10 = Log
Hints for Computing New Variables

Log = is the natural ln

Log10 = is the log base 10

asin = arcsine

sqrt = square root

Summary
Rules for Data Transformations

Most Important Rule: Do not Reverse the Order of the Values (larger remains
larger… smaller remains smaller)

Monotonic: change values but retain ranks

Non-monotonic: change values and ranks

(For example: Add random number, Multiply by (-1) )
Summary
Take-home Lessons
• Parametric tests are more powerful, but are based on assumption
of normally distributed data

• Determine normality criteria and undertake data transformations, if

needed

• If you are unsure, data transformations can always be attempted

to compare the same test results, using transformed and un-
transformed data

• Test normality before / after data transformations

• If transformations do not work…

use non-parametric tests

Udacity Statistics Notes
No ratings yet
Udacity Statistics Notes
37 pages
1-PDP On Decoding Statistics For Data Analysis - Day 1 - Test of Normality
No ratings yet
1-PDP On Decoding Statistics For Data Analysis - Day 1 - Test of Normality
31 pages
5. CH.5. stat.com
No ratings yet
5. CH.5. stat.com
34 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
Community Project: Checking Normality For Parametric Tests in R
No ratings yet
Community Project: Checking Normality For Parametric Tests in R
4 pages
Normality Checking 11 Ps
No ratings yet
Normality Checking 11 Ps
4 pages
2 Normality PG OK
No ratings yet
2 Normality PG OK
24 pages
2.1 - Normal Data
No ratings yet
2.1 - Normal Data
19 pages
02 Normal Distribution - TV
No ratings yet
02 Normal Distribution - TV
23 pages
Statistics 101: Introduction To Data Management
No ratings yet
Statistics 101: Introduction To Data Management
37 pages
5 Random Var PDF
No ratings yet
5 Random Var PDF
74 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
DS-2, Week 2 - Lectures
No ratings yet
DS-2, Week 2 - Lectures
13 pages
Lecture 2 - Normative Distribution and Descriptive Statistics
No ratings yet
Lecture 2 - Normative Distribution and Descriptive Statistics
51 pages
Quality Control: Fundamentals of Statistics
No ratings yet
Quality Control: Fundamentals of Statistics
62 pages
AP Stat Spring Pacing
No ratings yet
AP Stat Spring Pacing
4 pages
History Reporting
No ratings yet
History Reporting
61 pages
Community Project: Checking Normality For Parametric Tests in SPSS
No ratings yet
Community Project: Checking Normality For Parametric Tests in SPSS
4 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
Lec 7 8
No ratings yet
Lec 7 8
58 pages
Reliability Distribution 1
No ratings yet
Reliability Distribution 1
41 pages
Module 2
No ratings yet
Module 2
13 pages
M2. Understanding A Data Set II
No ratings yet
M2. Understanding A Data Set II
33 pages
Module3-Part2 (1) (Autosaved)
No ratings yet
Module3-Part2 (1) (Autosaved)
35 pages
Process Data Analysis
No ratings yet
Process Data Analysis
24 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
Lecture09 (Assessing Normality)
No ratings yet
Lecture09 (Assessing Normality)
32 pages
Lec 11 Chapter IV Descriptiv and Inferential Stat.
No ratings yet
Lec 11 Chapter IV Descriptiv and Inferential Stat.
26 pages
(eBook PDF) Business Statistics: For Contemporary Decision Making, 8th Edition - The ebook in PDF and DOCX formats is ready for download now
100% (1)
(eBook PDF) Business Statistics: For Contemporary Decision Making, 8th Edition - The ebook in PDF and DOCX formats is ready for download now
45 pages
STAT - Measures of Shape
No ratings yet
STAT - Measures of Shape
5 pages
Analytics compendium (incl stats)
No ratings yet
Analytics compendium (incl stats)
31 pages
Error and Uncertainty: General Statistical Principles
No ratings yet
Error and Uncertainty: General Statistical Principles
8 pages
T Rns Formations
No ratings yet
T Rns Formations
6 pages
Wa Nko Nalipay PR
No ratings yet
Wa Nko Nalipay PR
12 pages
(eBook PDF) Business Statistics: For Contemporary Decision Making, 8th Editioninstant download
100% (2)
(eBook PDF) Business Statistics: For Contemporary Decision Making, 8th Editioninstant download
50 pages
Adobe Scan 05-Dec-2023 (1) (1)
No ratings yet
Adobe Scan 05-Dec-2023 (1) (1)
12 pages
Stats Midterm 1
No ratings yet
Stats Midterm 1
10 pages
Statistics Important Points: Properties of Normal Distribution
No ratings yet
Statistics Important Points: Properties of Normal Distribution
2 pages
WK 1b Biostat
No ratings yet
WK 1b Biostat
38 pages
SA28 Tool - Probability and Statistics
No ratings yet
SA28 Tool - Probability and Statistics
9 pages
Normality Test in Clinical Research
No ratings yet
Normality Test in Clinical Research
7 pages
How Much Data Does Google Handle?
No ratings yet
How Much Data Does Google Handle?
132 pages
4. MEASURE OF DISPERSION_Distribution
No ratings yet
4. MEASURE OF DISPERSION_Distribution
37 pages
Stats Must Knows
No ratings yet
Stats Must Knows
5 pages
DADM S3 Skewness and Transformations To Achieve Normality
No ratings yet
DADM S3 Skewness and Transformations To Achieve Normality
9 pages
Tabu Ran Normal
100% (1)
Tabu Ran Normal
14 pages
7u7 PDF
No ratings yet
7u7 PDF
31 pages
Descriptive Stats
No ratings yet
Descriptive Stats
39 pages
Skewness Kurtosis and Histogram
No ratings yet
Skewness Kurtosis and Histogram
4 pages
Dr. Pham Huynh Tram Department of ISE Phtram@hcmiu - Edu.vn
No ratings yet
Dr. Pham Huynh Tram Department of ISE Phtram@hcmiu - Edu.vn
37 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
CRP Phase 4-Analyzing and Interpreting Quantitative Data
No ratings yet
CRP Phase 4-Analyzing and Interpreting Quantitative Data
24 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
ES031 M1 DataCollection&Presentation
No ratings yet
ES031 M1 DataCollection&Presentation
64 pages
Biostatistics - i
No ratings yet
Biostatistics - i
46 pages
SPSS Exam
No ratings yet
SPSS Exam
17 pages
Psychological Stats Reviewer
No ratings yet
Psychological Stats Reviewer
11 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
22 pages
Normality
No ratings yet
Normality
5 pages
Test For Normality
No ratings yet
Test For Normality
32 pages
Aluminum Translator Workbook Version 3 12
No ratings yet
Aluminum Translator Workbook Version 3 12
17 pages
WMT E A X-Ray Computed Tomography
No ratings yet
WMT E A X-Ray Computed Tomography
364 pages
Six Sigma Interview Questions For Basic Level
No ratings yet
Six Sigma Interview Questions For Basic Level
17 pages
Minitab Commands by Type
No ratings yet
Minitab Commands by Type
6 pages
Taichi Ohno
100% (1)
Taichi Ohno
15 pages
LEAN SIGMA - Sessions
No ratings yet
LEAN SIGMA - Sessions
49 pages
Werth Overview
No ratings yet
Werth Overview
17 pages
Confidence Interval and T Statistics
No ratings yet
Confidence Interval and T Statistics
10 pages
LSSBB Exam Quesions
No ratings yet
LSSBB Exam Quesions
5 pages
VAVE
No ratings yet
VAVE
25 pages
Iso 21747 2006 FR en PDF
0% (1)
Iso 21747 2006 FR en PDF
11 pages

Statistics Normality

Uploaded by

Statistics Normality

Uploaded by

Normality and Data

Every Normal Distribution can be described

- Rule of Thumb: Too Large (> 1) or too small (< -1)

• 3) Shapiro – Wilk Tests

Things to Look For:

How many points plotted?

Are there any outliers?

Skew: Measure of the symmetry of a distribution.

Symmetric distributions have a skew = 0.

Positive skew: Negative skew:

Kurtosis: Measure of the degree to which observations

Positive kurtosis: Negative kurtosis:

• Use “NormalityExample.xlsx” Dataset

• Follow along this example using Rcmdr

• Open Rstudio and activate Rcmdr

• Import dataset and start exploring

- Active data set

Test multiple data

Null Hypothesis: Data ARE Normal

Alternate Hypothesis: Data ARE NOT Normal

Is this Result Significant ? How Can You Tell ?

Null is not Rejected. Data ARE Normally Distributed

What do you Need to Report ?

Lower bound: Mean – (1.96 * SE)

By definition: 95% of the confidence intervals (from

C.I. Formulation: Mean +/- (Z score * SE)

S.E. = S.D. / sqrt (n) =

n mean SD sqrt(n) SE 95% CI

C.I. Formulation: Mean +/- (Z score * SE)

n mean SD sqrt(n) SE 95% CI lower upper

Are these two samples from the same population ?

- Parametric methods make more assumptions than non-

- However, if those assumptions are incorrect,

A. Mean = Median = Mode

- Use parametric tests – whenever possible.

-Take care to examine diagnostic statistics and to

- If you are in doubt…

If they agree: go with results of normal test

If they disagree: what caused the disagreement

Chl Concentration Log (Chl Conc.)

• Need to implement monotonic transformations:

Actual Values Change

Ranks Do Not Change

Note: 0 power transformation is NOT monotonic

This transformation is useful when:

 Note: to log-transform data containing zeros, a

• With count data, add one, so that: fx = log(0+1) =0

Square root transform deals with positive skew, by

This transformation is useful when dealing with proportional

 Note: data must range between 0 and 1, inclusive.

Log = is the natural ln

Log10 = is the log base 10

sqrt = square root

Monotonic: change values but retain ranks

Non-monotonic: change values and ranks

• Determine normality criteria and undertake data transformations, if

• If you are unsure, data transformations can always be attempted

• Test normality before / after data transformations

• If transformations do not work…

You might also like