2071 TC2AILab5
2071 TC2AILab5
Aim:
The purpose of this lab is to introduce hypothesis testing using statistical methods in Python, focusing
on hypothesis tests like the t-test, ANOVA, and chi-square test. By applying these techniques to the
well-known Iris dataset, you will learn how to test assumptions about population means and
relationships between categorical variables.
Hypothesis testing is a statistical method used to make inferences or draw conclusions about a
population based on a sample of data. It helps in determining whether there is enough evidence in a
sample of data to infer that a certain condition is true for the entire population.
• Null Hypothesis (H₀): The statement that there is no effect or no difference. It is what you try
to disprove or reject.
• Alternative Hypothesis (H₁): The statement that there is an effect or a difference. It is what you
want to prove.
• p-value: The probability of observing the results if the null hypothesis is true. A small p-value
(< 0.05) indicates strong evidence against the null hypothesis.
• Significance Level (α): A threshold (commonly 0.05) used to decide whether to reject the null
hypothesis.
• Test Statistic: A value calculated from the data used to determine whether to reject the null
hypothesis.
The Iris dataset is one of the most famous datasets in the field of machine learning. It consists of 150
observations, with the following features:
▪ Sepal length (cm)
▪ Sepal width (cm)
▪ Petal length (cm)
▪ Petal width (cm)
▪ Species (Iris-setosa, Iris-versicolor, and Iris-virginica)
Each observation represents a different iris flower from one of the three species, and the dataset
contains measurements for each flower's sepals and petals.
Objective: To test if there is a significant difference in the sepal lengths between the species Irissetosa
and Iris-versicolor.
Hypotheses:
▪ Null Hypothesis (H₀): There is no significant difference between the mean sepal lengths of setosa
▪ Alternative Hypothesis (H₁): There is a significant difference between the mean sepal lengths of
setosa and versicolor species. (μ₁ ≠ μ₂) Steps:
1. Select the data for the two species (setosa and versicolor).
2. Calculate the mean and standard deviation of the sepal lengths for both species.
3. Use a two-sample t-test to determine if the difference in means is statistically significant.
4. Calculate the t-statistic and p-value.
5. Compare the p-value with the significance level (α = 0.05) to decide whether to reject or fail to
reject the null hypothesis.
Interpretation:
- If the p-value is less than 0.05, reject the null hypothesis, meaning there is a statistically
significant difference in sepal lengths between setosa and versicolor.
- If the p-value is greater than 0.05, fail to reject the null hypothesis, meaning there is no
significant difference in sepal lengths.
Objective:
To test if there is a significant difference in the sepal lengths across all three species (setosa, versicolor,
and virginica).
Hypotheses:
▪ Null Hypothesis (H₀): The means of sepal lengths are equal for all species. (μ₁ = μ₂ = μ₃)
▪ Alternative Hypothesis (H₁): At least one species has a different mean sepal length. (μ₁ ≠
1. Group the data by species and calculate the means and standard deviations for sepal
lengths.
2. Use the one-way ANOVA test to compare the means of sepal lengths across the three
species.
3. Calculate the F-statistic and p-value.
4. Compare the p-value with the significance level (α = 0.05) to decide whether to reject or
fail to reject the null hypothesis.
Interpretation:
▪ If the p-value is less than 0.05, reject the null hypothesis, indicating that at least one species has
a significantly different mean sepal length.
▪ If the p-value is greater than 0.05, fail to reject the null hypothesis, suggesting that the means are
not significantly different across species.
Objective:
To test whether there is a relationship between species and different categories of sepal width (e.g.,
narrow, medium, wide).
Hypotheses:
- Null Hypothesis (H₀): There is no relationship between species and sepal width categories (i.e.,
- Alternative Hypothesis (H₁): There is a relationship between species and sepal width categories
(i.e., the two variables are dependent).
Steps:
1. Divide the sepal width data into categories (e.g., narrow, medium, wide).
2. Create a contingency table showing the frequency of species across these categories.
3. Perform a chi-square test to determine if the distribution of species is independent of
sepal width categories.
4. Calculate the chi-square statistic and p-value.
5. Compare the p-value with the significance level (α = 0.05) to decide whether to reject or
fail to reject the null hypothesis.
Interpretation:
- If the p-value is less than 0.05, reject the null hypothesis, indicating that sepal width and species
are related (dependent).
- If the p-value is greater than 0.05, fail to reject the null hypothesis, suggesting that sepal width
and species are independent.
Conclusion