Checking the normality of a dataset
Checking the normality of a dataset
A goodness-of-fit test that compares the skewness and kurtosis of the sample data to a
normal distribution.
3. Skewness and Kurtosis:
Skewness measures the asymmetry of the data. For a normal distribution, skewness should
be close to 0.
Kurtosis measures the "tailedness" of the distribution. For a normal distribution, kurtosis
should be close to 3.
4. Transformations (If Data is Not Normally Distributed):
If the data is not normally distributed, transformations such as logarithmic, square root, or Box-Cox
transformations can sometimes help make the data more normal.
How to Perform in Python:
Code
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = np.random.normal(loc=0, scale=1, size=1000) # Normal distribution
# Histogram
plt.hist(data, bins=30, edgecolor='k')
plt.title('Histogram')
plt.show()
# Q-Q plot
stats.probplot(data, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()
# Shapiro-Wilk Test
shapiro_test = stats.shapiro(data)
print(f"Shapiro-Wilk Test p-value: {shapiro_test.pvalue}")
# Kolmogorov-Smirnov Test
ks_test = stats.kstest(data, 'norm')
print(f"Kolmogorov-Smirnov Test p-value: {ks_test.pvalue}")
# Anderson-Darling Test
anderson_test = stats.anderson(data, dist='norm')
print(anderson_test)
print(f"Skewness: {stats.skew(data)}")
print(f"Kurtosis: {stats.kurtosis(data)}")
How to Perform in R:
Code
# Sample data
data <- rnorm(1000) # Normal distribution
# Histogram
hist(data, breaks=30, col="lightblue", main="Histogram")
# Q-Q plot
qqnorm(data)
qqline(data, col = "red")
# Shapiro-Wilk Test
shapiro.test(data)
# Kolmogorov-Smirnov Test
ks.test(data, "pnorm", mean=mean(data), sd=sd(data))
The qqplot / normal(mu=est sigma=est); command will generate a Q-Q plot with a normal
reference line.
2. Statistical Tests for Normality:
You can run statistical tests like the Shapiro-Wilk test and others using the PROC UNIVARIATE
procedure.
sas
Copy code
proc univariate data=your_dataset normal;
var your_variable;
histogram / normal;
probplot / normal(mu=est sigma=est);
run;
This code will generate the following:
sas
Copy code
proc univariate data=fish_data;
var body_weight;
output out=stats skewness=sk kurtosis=ku;
run;
This code will output the skewness and kurtosis into the stats dataset, where:
Skewness (sk): Measures the symmetry of the data.
Kurtosis (ku): Measures the "tailedness" of the distribution.
A skewness value near 0 indicates a symmetric distribution, and a kurtosis value near 3 indicates a
normal distribution.
4. Checking Normality in a Report:
You can also include normality tests in a statistical summary report using PROC MEANS or PROC
SUMMARY, but PROC UNIVARIATE provides more detailed output for normality.