0% found this document useful (0 votes)
17 views

Checking the normality of a dataset

The document outlines methods for checking the normality of a dataset, which is crucial for parametric statistical analyses. It details visual methods like histograms and Q-Q plots, as well as statistical tests such as the Shapiro-Wilk and Kolmogorov-Smirnov tests. Additionally, it provides guidance on performing these checks in Python, R, and SAS, along with recommendations for data transformations if normality is not met.

Uploaded by

ritisnatanayak2
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Checking the normality of a dataset

The document outlines methods for checking the normality of a dataset, which is crucial for parametric statistical analyses. It details visual methods like histograms and Q-Q plots, as well as statistical tests such as the Shapiro-Wilk and Kolmogorov-Smirnov tests. Additionally, it provides guidance on performing these checks in Python, R, and SAS, along with recommendations for data transformations if normality is not met.

Uploaded by

ritisnatanayak2
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

By-Ritisnata Nayak

Checking the normality of a dataset


Checking the normality of a dataset is important when you're performing statistical analyses that
assume the data is normally distributed, such as parametric tests (e.g., t-tests, ANOVA, linear
regression). There are several methods to check for normality, ranging from visual inspection to
formal statistical tests.
Here’s how you can check for normality:
1. Visual Methods:
a) Histogram:
 A histogram is a simple way to visualize the distribution of your data. If your data is normally
distributed, the histogram should have a bell-shaped curve.
b) Q-Q Plot (Quantile-Quantile Plot):
 A Q-Q plot compares the quantiles of your data against the quantiles of a normal
distribution. If your data is normally distributed, the points on the plot should fall roughly
along a straight line.
c) Box Plot:
 Although a box plot doesn’t directly show normality, it can help identify outliers. Normally
distributed data will have a symmetric box plot without extreme outliers.
d) Density Plot:
 A kernel density estimate (KDE) plot provides a smoothed version of the histogram. You can
compare it to the normal distribution curve to visually inspect normality.
2. Statistical Tests:
a) Shapiro-Wilk Test:
 One of the most popular tests for normality, especially for smaller datasets.
 Null hypothesis (H0): The data is normally distributed.
 If the p-value is less than 0.05, the null hypothesis is rejected, indicating that the data is not
normally distributed.
b) Kolmogorov-Smirnov Test:
 Another test to compare your data with a normal distribution. It's sensitive to deviations
from normality but is generally used for larger datasets.
c) Anderson-Darling Test:
 A more powerful test than the Kolmogorov-Smirnov test for checking normality, particularly
useful for smaller datasets.
d) D'Agostino's K-squared Test:
 This test measures the skewness and kurtosis of the data to check for departures from
normality.
e) Jarque-Bera Test:
By-Ritisnata Nayak

 A goodness-of-fit test that compares the skewness and kurtosis of the sample data to a
normal distribution.
3. Skewness and Kurtosis:
 Skewness measures the asymmetry of the data. For a normal distribution, skewness should
be close to 0.
 Kurtosis measures the "tailedness" of the distribution. For a normal distribution, kurtosis
should be close to 3.
4. Transformations (If Data is Not Normally Distributed):
If the data is not normally distributed, transformations such as logarithmic, square root, or Box-Cox
transformations can sometimes help make the data more normal.
How to Perform in Python:
Code

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

# Sample data
data = np.random.normal(loc=0, scale=1, size=1000) # Normal distribution

# Histogram
plt.hist(data, bins=30, edgecolor='k')
plt.title('Histogram')
plt.show()

# Q-Q plot
stats.probplot(data, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()

# Shapiro-Wilk Test
shapiro_test = stats.shapiro(data)
print(f"Shapiro-Wilk Test p-value: {shapiro_test.pvalue}")

# Kolmogorov-Smirnov Test
ks_test = stats.kstest(data, 'norm')
print(f"Kolmogorov-Smirnov Test p-value: {ks_test.pvalue}")

# Anderson-Darling Test
anderson_test = stats.anderson(data, dist='norm')
print(anderson_test)

# D'Agostino's K-squared Test


dagostino_test = stats.normaltest(data)
print(f"D'Agostino Test p-value: {dagostino_test.pvalue}")

# Skewness and Kurtosis


By-Ritisnata Nayak

print(f"Skewness: {stats.skew(data)}")
print(f"Kurtosis: {stats.kurtosis(data)}")

How to Perform in R:
Code
# Sample data
data <- rnorm(1000) # Normal distribution

# Histogram
hist(data, breaks=30, col="lightblue", main="Histogram")

# Q-Q plot
qqnorm(data)
qqline(data, col = "red")

# Shapiro-Wilk Test
shapiro.test(data)

# Kolmogorov-Smirnov Test
ks.test(data, "pnorm", mean=mean(data), sd=sd(data))

# Anderson-Darling Test (using 'nortest' package)


library(nortest)
ad.test(data)

# Skewness and Kurtosis (using 'moments' package)


library(moments)
skewness(data)
kurtosis(data)

How to Perform in SAS:


In SAS, you can check for normality using a combination of visual plots and statistical tests like the
Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests.
Here’s how you can perform normality checks in SAS:
1. Visual Methods:
You can use histograms and Q-Q plots to visually inspect the distribution of your data.
a) Histogram and Q-Q Plot:
sas
Copy code
proc univariate data=your_dataset;
var your_variable;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;
 The histogram / normal; command will generate a histogram with a fitted normal
distribution curve.
By-Ritisnata Nayak

 The qqplot / normal(mu=est sigma=est); command will generate a Q-Q plot with a normal
reference line.
2. Statistical Tests for Normality:
You can run statistical tests like the Shapiro-Wilk test and others using the PROC UNIVARIATE
procedure.
sas
Copy code
proc univariate data=your_dataset normal;
var your_variable;
histogram / normal;
probplot / normal(mu=est sigma=est);
run;
 This code will generate the following:

o Shapiro-Wilk test (for sample sizes ≤ 2000).


o Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests (for sample
sizes > 2000).
o A probability plot (P-P Plot) of the variable against the normal distribution.
The output will provide p-values for these tests. If the p-value is less than 0.05, you reject the null
hypothesis, indicating that the data is not normally distributed.
Explanation of Key Tests:
 Shapiro-Wilk Test: Most commonly used for normality; accurate for smaller datasets.
 Kolmogorov-Smirnov Test: Compares the empirical distribution function of the sample with
the cumulative distribution function of the specified distribution (here, normal).
 Anderson-Darling Test: A more sensitive test than the K-S test, giving more weight to the
tails of the distribution.
 Cramer-von Mises Test: Similar to Anderson-Darling but slightly different in weighing
deviations.
Example:
Assume your dataset is named fish_data and you want to check the normality of a variable
body_weight.
sas
Copy code
proc univariate data=fish_data normal;
var body_weight;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;

3. Skewness and Kurtosis:


To check skewness and kurtosis in PROC UNIVARIATE:
By-Ritisnata Nayak

sas
Copy code
proc univariate data=fish_data;
var body_weight;
output out=stats skewness=sk kurtosis=ku;
run;

This code will output the skewness and kurtosis into the stats dataset, where:
 Skewness (sk): Measures the symmetry of the data.
 Kurtosis (ku): Measures the "tailedness" of the distribution.
A skewness value near 0 indicates a symmetric distribution, and a kurtosis value near 3 indicates a
normal distribution.
4. Checking Normality in a Report:
You can also include normality tests in a statistical summary report using PROC MEANS or PROC
SUMMARY, but PROC UNIVARIATE provides more detailed output for normality.

5. Transformations (If Data is Not Normally Distributed):


If your data is not normally distributed, you can try transforming it:
Example: Log Transformation:
sas
Copy code
data transformed_data;
set fish_data;
log_body_weight = log(body_weight);
run;

proc univariate data=transformed_data normal;


var log_body_weight;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;

Summary of Steps in SAS:


1. Run PROC UNIVARIATE with the normal option to check normality using visual plots and
statistical tests.
2. Use histograms and Q-Q plots for visual inspection.
3. Look at p-values from tests like Shapiro-Wilk to formally check normality.
4. Skewness and kurtosis can also provide insights into normality.
5. Apply transformations like log, square root, or Box-Cox if the data is not normally
distributed.
This method provides a comprehensive approach to normality testing in SAS.
By-Ritisnata Nayak

Guidelines for Choosing a Method:


 Small datasets (n < 50): Prefer the Shapiro-Wilk test.
 Large datasets (n > 2000): Visual methods and Q-Q plots are generally more reliable, as most
tests will flag even minor deviations from normality as significant.
 Moderate datasets: Combine visual inspection with a test like the Shapiro-Wilk or
D'Agostino's test.
Conclusion:
 Use visual methods for an intuitive understanding of the data's distribution.
 Use statistical tests for formal checks, but be cautious in large samples, as even minor
deviations from normality can lead to rejection of the null hypothesis.

You might also like