Chi-Square Test in ML
Chi-Square Test in ML
The Chi-Square test is a statistical test commonly used in machine learning for feature selection,
particularly in categorical data. It helps to determine whether there is a significant association between
two categorical variables or whether the observed frequencies deviate significantly from the expected
frequencies under the assumption of independence.
1. Hypothesis Testing:
Null Hypothesis (H₀): The variables are independent (no association).
Alternative Hypothesis (H₁): The variables are dependent (association exists).
2. Chi-Square Statistic Formula:
(Oi − Ei )2
χ2 = ∑
Ei
Where:
Oi : Observed frequency.
grand total ).
3. Degrees of Freedom:
df = (rows − 1) × (columns − 1)
4. P-value:
Compares the calculated χ2 statistic with the critical value from the Chi-Square distribution
table.
If p ≤ α (significance level, often 0.05), reject the null hypothesis.
In machine learning, the Chi-Square test is often used for feature selection to identify features that
have the strongest relationship with the target variable. This is particularly useful for classification
problems with categorical data.
python
import pandas as pd
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
# Sample dataset
data = {
'Feature1': [1, 0, 1, 1, 0, 1, 0],
'Feature2': [0, 1, 0, 0, 1, 1, 0],
'Target': [1, 0, 1, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
# Display results
for i, col in enumerate(X.columns):
print(f"Feature: {col}, Chi-Square Score: {chi_scores[i]}, P-value: {p_values[i]}")
Output Interpretation
1. Chi-Square Scores: Indicate the strength of the association between each feature and the target.
2. P-values: Help decide statistical significance. A low p-value (e.g., < 0.05) indicates a strong
relationship.
3. Selected Features: The most relevant features based on the test.
Categorical Data: When the dataset includes categorical features and a categorical target.
Feature Selection: To eliminate irrelevant or less significant features.
Preprocessing: Before training classification models like Naive Bayes, Decision Trees, etc.
Limitations
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/3
Requires categorical data.
Assumes independence between observations.
Not suitable for continuous features without discretization.
By leveraging the Chi-Square test effectively, you can streamline your feature selection process and
improve your model's performance.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/3