0% found this document useful (0 votes)
2 views

Chi-Square Test in ML

The Chi-Square test is a statistical method used in machine learning for feature selection, particularly with categorical data, to assess the association between two categorical variables. It involves hypothesis testing, calculating the Chi-Square statistic, and interpreting p-values to determine the relevance of features. The test is beneficial for identifying significant features in classification problems but has limitations, such as requiring categorical data and assuming independence between observations.

Uploaded by

Omkar Jethe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chi-Square Test in ML

The Chi-Square test is a statistical method used in machine learning for feature selection, particularly with categorical data, to assess the association between two categorical variables. It involves hypothesis testing, calculating the Chi-Square statistic, and interpreting p-values to determine the relevance of features. The test is beneficial for identifying significant features in classification problems but has limitations, such as requiring categorical data and assuming independence between observations.

Uploaded by

Omkar Jethe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Chi-Square Test in ML

Chi -square test in Machine learning

The Chi-Square test is a statistical test commonly used in machine learning for feature selection,
particularly in categorical data. It helps to determine whether there is a significant association between
two categorical variables or whether the observed frequencies deviate significantly from the expected
frequencies under the assumption of independence.

Key Concepts of the Chi-Square Test

1. Hypothesis Testing:
Null Hypothesis (H₀): The variables are independent (no association).
Alternative Hypothesis (H₁): The variables are dependent (association exists).
2. Chi-Square Statistic Formula:

(Oi − Ei )2
χ2 = ∑
​ ​

Ei ​

Where:
Oi : Observed frequency.

row total×column total


Ei : Expected frequency (calculated as Ei =
​ ​

grand total ).

3. Degrees of Freedom:
df = (rows − 1) × (columns − 1)
4. P-value:
Compares the calculated χ2 statistic with the critical value from the Chi-Square distribution
table.
If p ≤ α (significance level, often 0.05), reject the null hypothesis.

Application in Machine Learning

In machine learning, the Chi-Square test is often used for feature selection to identify features that
have the strongest relationship with the target variable. This is particularly useful for classification
problems with categorical data.

Steps to Perform Chi-Square Test in Feature Selection

1. Prepare the Data:


Ensure the data has categorical features and a categorical target variable.
Convert continuous variables into categorical ones (if needed).
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/3
2. Create a Contingency Table:
Create a table that summarizes the relationship between a feature and the target variable.
3. Calculate Chi-Square Statistic:
Use the formula or a library function.
4. Interpret the Results:
Features with a higher Chi-Square statistic (and lower p-value) are more relevant.

Demo Code Using Python

python

import pandas as pd
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

# Sample dataset
data = {
'Feature1': [1, 0, 1, 1, 0, 1, 0],
'Feature2': [0, 1, 0, 0, 1, 1, 0],
'Target': [1, 0, 1, 1, 0, 0, 1]
}
df = pd.DataFrame(data)

# Separating features and target


X = df[['Feature1', 'Feature2']]
y = df['Target']

# Perform Chi-Square test


chi_scores, p_values = chi2(X, y)

# Display results
for i, col in enumerate(X.columns):
print(f"Feature: {col}, Chi-Square Score: {chi_scores[i]}, P-value: {p_values[i]}")

# Feature selection using SelectKBest


selector = SelectKBest(chi2, k=1) # Select top 1 feature
X_new = selector.fit_transform(X, y)
print("\nSelected Features:\n", X.columns[selector.get_support()])

Output Interpretation

1. Chi-Square Scores: Indicate the strength of the association between each feature and the target.
2. P-values: Help decide statistical significance. A low p-value (e.g., < 0.05) indicates a strong
relationship.
3. Selected Features: The most relevant features based on the test.

When to Use the Chi-Square Test?

Categorical Data: When the dataset includes categorical features and a categorical target.
Feature Selection: To eliminate irrelevant or less significant features.
Preprocessing: Before training classification models like Naive Bayes, Decision Trees, etc.

Limitations

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/3
Requires categorical data.
Assumes independence between observations.
Not suitable for continuous features without discretization.

By leveraging the Chi-Square test effectively, you can streamline your feature selection process and
improve your model's performance.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/3

You might also like