0% found this document useful (0 votes)
35 views6 pages

Checking the normality of a dataset

The document outlines methods for checking the normality of a dataset, which is crucial for parametric statistical analyses. It details visual methods like histograms and Q-Q plots, as well as statistical tests such as the Shapiro-Wilk and Kolmogorov-Smirnov tests. Additionally, it provides guidance on performing these checks in Python, R, and SAS, along with recommendations for data transformations if normality is not met.

Uploaded by

ritisnatanayak2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

Checking the normality of a dataset

The document outlines methods for checking the normality of a dataset, which is crucial for parametric statistical analyses. It details visual methods like histograms and Q-Q plots, as well as statistical tests such as the Shapiro-Wilk and Kolmogorov-Smirnov tests. Additionally, it provides guidance on performing these checks in Python, R, and SAS, along with recommendations for data transformations if normality is not met.

Uploaded by

ritisnatanayak2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

By-Ritisnata Nayak

Checking the normality of a dataset


Checking the normality of a dataset is important when you're performing statistical analyses that
assume the data is normally distributed, such as parametric tests (e.g., t-tests, ANOVA, linear
regression). There are several methods to check for normality, ranging from visual inspection to
formal statistical tests.
Here’s how you can check for normality:
1. Visual Methods:
a) Histogram:
 A histogram is a simple way to visualize the distribution of your data. If your data is normally
distributed, the histogram should have a bell-shaped curve.
b) Q-Q Plot (Quantile-Quantile Plot):
 A Q-Q plot compares the quantiles of your data against the quantiles of a normal
distribution. If your data is normally distributed, the points on the plot should fall roughly
along a straight line.
c) Box Plot:
 Although a box plot doesn’t directly show normality, it can help identify outliers. Normally
distributed data will have a symmetric box plot without extreme outliers.
d) Density Plot:
 A kernel density estimate (KDE) plot provides a smoothed version of the histogram. You can
compare it to the normal distribution curve to visually inspect normality.
2. Statistical Tests:
a) Shapiro-Wilk Test:
 One of the most popular tests for normality, especially for smaller datasets.
 Null hypothesis (H0): The data is normally distributed.
 If the p-value is less than 0.05, the null hypothesis is rejected, indicating that the data is not
normally distributed.
b) Kolmogorov-Smirnov Test:
 Another test to compare your data with a normal distribution. It's sensitive to deviations
from normality but is generally used for larger datasets.
c) Anderson-Darling Test:
 A more powerful test than the Kolmogorov-Smirnov test for checking normality, particularly
useful for smaller datasets.
d) D'Agostino's K-squared Test:
 This test measures the skewness and kurtosis of the data to check for departures from
normality.
e) Jarque-Bera Test:
By-Ritisnata Nayak

 A goodness-of-fit test that compares the skewness and kurtosis of the sample data to a
normal distribution.
3. Skewness and Kurtosis:
 Skewness measures the asymmetry of the data. For a normal distribution, skewness should
be close to 0.
 Kurtosis measures the "tailedness" of the distribution. For a normal distribution, kurtosis
should be close to 3.
4. Transformations (If Data is Not Normally Distributed):
If the data is not normally distributed, transformations such as logarithmic, square root, or Box-Cox
transformations can sometimes help make the data more normal.
How to Perform in Python:
Code

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

# Sample data
data = np.random.normal(loc=0, scale=1, size=1000) # Normal distribution

# Histogram
plt.hist(data, bins=30, edgecolor='k')
plt.title('Histogram')
plt.show()

# Q-Q plot
stats.probplot(data, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()

# Shapiro-Wilk Test
shapiro_test = stats.shapiro(data)
print(f"Shapiro-Wilk Test p-value: {shapiro_test.pvalue}")

# Kolmogorov-Smirnov Test
ks_test = stats.kstest(data, 'norm')
print(f"Kolmogorov-Smirnov Test p-value: {ks_test.pvalue}")

# Anderson-Darling Test
anderson_test = stats.anderson(data, dist='norm')
print(anderson_test)

# D'Agostino's K-squared Test


dagostino_test = stats.normaltest(data)
print(f"D'Agostino Test p-value: {dagostino_test.pvalue}")

# Skewness and Kurtosis


By-Ritisnata Nayak

print(f"Skewness: {stats.skew(data)}")
print(f"Kurtosis: {stats.kurtosis(data)}")

How to Perform in R:
Code
# Sample data
data <- rnorm(1000) # Normal distribution

# Histogram
hist(data, breaks=30, col="lightblue", main="Histogram")

# Q-Q plot
qqnorm(data)
qqline(data, col = "red")

# Shapiro-Wilk Test
shapiro.test(data)

# Kolmogorov-Smirnov Test
ks.test(data, "pnorm", mean=mean(data), sd=sd(data))

# Anderson-Darling Test (using 'nortest' package)


library(nortest)
ad.test(data)

# Skewness and Kurtosis (using 'moments' package)


library(moments)
skewness(data)
kurtosis(data)

How to Perform in SAS:


In SAS, you can check for normality using a combination of visual plots and statistical tests like the
Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests.
Here’s how you can perform normality checks in SAS:
1. Visual Methods:
You can use histograms and Q-Q plots to visually inspect the distribution of your data.
a) Histogram and Q-Q Plot:
sas
Copy code
proc univariate data=your_dataset;
var your_variable;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;
 The histogram / normal; command will generate a histogram with a fitted normal
distribution curve.
By-Ritisnata Nayak

 The qqplot / normal(mu=est sigma=est); command will generate a Q-Q plot with a normal
reference line.
2. Statistical Tests for Normality:
You can run statistical tests like the Shapiro-Wilk test and others using the PROC UNIVARIATE
procedure.
sas
Copy code
proc univariate data=your_dataset normal;
var your_variable;
histogram / normal;
probplot / normal(mu=est sigma=est);
run;
 This code will generate the following:

o Shapiro-Wilk test (for sample sizes ≤ 2000).


o Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests (for sample
sizes > 2000).
o A probability plot (P-P Plot) of the variable against the normal distribution.
The output will provide p-values for these tests. If the p-value is less than 0.05, you reject the null
hypothesis, indicating that the data is not normally distributed.
Explanation of Key Tests:
 Shapiro-Wilk Test: Most commonly used for normality; accurate for smaller datasets.
 Kolmogorov-Smirnov Test: Compares the empirical distribution function of the sample with
the cumulative distribution function of the specified distribution (here, normal).
 Anderson-Darling Test: A more sensitive test than the K-S test, giving more weight to the
tails of the distribution.
 Cramer-von Mises Test: Similar to Anderson-Darling but slightly different in weighing
deviations.
Example:
Assume your dataset is named fish_data and you want to check the normality of a variable
body_weight.
sas
Copy code
proc univariate data=fish_data normal;
var body_weight;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;

3. Skewness and Kurtosis:


To check skewness and kurtosis in PROC UNIVARIATE:
By-Ritisnata Nayak

sas
Copy code
proc univariate data=fish_data;
var body_weight;
output out=stats skewness=sk kurtosis=ku;
run;

This code will output the skewness and kurtosis into the stats dataset, where:
 Skewness (sk): Measures the symmetry of the data.
 Kurtosis (ku): Measures the "tailedness" of the distribution.
A skewness value near 0 indicates a symmetric distribution, and a kurtosis value near 3 indicates a
normal distribution.
4. Checking Normality in a Report:
You can also include normality tests in a statistical summary report using PROC MEANS or PROC
SUMMARY, but PROC UNIVARIATE provides more detailed output for normality.

5. Transformations (If Data is Not Normally Distributed):


If your data is not normally distributed, you can try transforming it:
Example: Log Transformation:
sas
Copy code
data transformed_data;
set fish_data;
log_body_weight = log(body_weight);
run;

proc univariate data=transformed_data normal;


var log_body_weight;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;

Summary of Steps in SAS:


1. Run PROC UNIVARIATE with the normal option to check normality using visual plots and
statistical tests.
2. Use histograms and Q-Q plots for visual inspection.
3. Look at p-values from tests like Shapiro-Wilk to formally check normality.
4. Skewness and kurtosis can also provide insights into normality.
5. Apply transformations like log, square root, or Box-Cox if the data is not normally
distributed.
This method provides a comprehensive approach to normality testing in SAS.
By-Ritisnata Nayak

Guidelines for Choosing a Method:


 Small datasets (n < 50): Prefer the Shapiro-Wilk test.
 Large datasets (n > 2000): Visual methods and Q-Q plots are generally more reliable, as most
tests will flag even minor deviations from normality as significant.
 Moderate datasets: Combine visual inspection with a test like the Shapiro-Wilk or
D'Agostino's test.
Conclusion:
 Use visual methods for an intuitive understanding of the data's distribution.
 Use statistical tests for formal checks, but be cautious in large samples, as even minor
deviations from normality can lead to rejection of the null hypothesis.

You might also like