0% found this document useful (0 votes)
6 views

Maths report (2)

The document provides a comprehensive overview of the Chi-Square distribution, including its definition, formula, and applications in statistics. It outlines the steps for conducting a Chi-Square test, including hypothesis formulation, data organization, and result interpretation. Additionally, it discusses the advantages and limitations of the Chi-Square test, along with practical examples and a C program for computation.

Uploaded by

patilchinmay510
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Maths report (2)

The document provides a comprehensive overview of the Chi-Square distribution, including its definition, formula, and applications in statistics. It outlines the steps for conducting a Chi-Square test, including hypothesis formulation, data organization, and result interpretation. Additionally, it discusses the advantages and limitations of the Chi-Square test, along with practical examples and a C program for computation.

Uploaded by

patilchinmay510
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Table of Contents

Chi-Square Distribution………………………………………………….2
What is a Chi-Square Test ?…………………………………………….. 2
Chi-Square Test Formula……………………………………………….. 2
Degrees of Freedom……………………………………………………...3
Steps for Chi-Square Test………………………………………………..4
Example for Chi-Square Test…………………………………………….7
C – Program……………………………………………………………...9
Applications of Chi-Sqaure Distribution………………………………..11
Advantages……………………………………………………………...13
Limitations………………………………………………………………13
Conclusion………………………………………………………………14
References……………………………………………………………….15

1
CHI SQUARE DISTRIBUTION

Chi-square distribution is a continuous probability distribution


commonly used in statistics, especially in hypothesis testing and inferential
statistics. It is denoted by χ2 and arises primarily when dealing with the sum
of the squares of k independent standard normal random variables.

WHAT IS A CHI-SQUARE TEST?

The Chi-Square test is a statistical procedure for determining the


difference between observed and expected data. This test can also be used to
decide whether it correlates to our data's categorical variables. It helps to
determine whether a difference between two categorical variables is due to
chance or a possible relationship between them. Handling data often involves
testing hypotheses to extract useful information. In categorical analysis, chi-
square tests are used to determine if observed patterns are likely to be purely
random.
A chi-square test or comparable nonparametric test is required to test a
hypothesis regarding the distribution of a categorical variable. Categorical
variables, which indicate categories such as animals or countries, can be
nominal or ordinal. They cannot have a normal distribution since they only
have a few particular values.

CHI-SQUARE TEST FORMULA

2
Where,
• Σ (sigma): The symbol means sum, so each cell of your contingency
table must be computed.
• Oi: This shorthand captures the idea of actual number of observations in
a given cell of a contingency table, or what was actually counted.
• Ei: The number of times you would expect to see a particular result
under conditions where we assume the hypothesis of no association (null
hypothesis) is called as the expected frequency i.e. Ei.
• (Oi - Ei): The difference between the expected and actual frequencies is
computed in this section of the formula.

Degrees of Freedom (df)


The degrees of freedom in a Chi-square test represent the number of
independent values that can vary while still satisfying given constraints (e.g.,
row and column totals in a contingency table).
Formula:
• For a contingency table (Test of Independence):
df = (r−1) (c−1)
Where, r is the number of rows and c is the number of columns.
• For a goodness-of-fit test:
df = k − 1
Where, k is the number of categories.

Observed Value (Oi)


The observed values are the actual data or frequencies collected in a
study. These are typically arranged in a table format.

3
Expected Value (Ei)
The expected values are the theoretical frequencies that would occur if
the null hypothesis were true. They are calculated based on the assumption of
no association (for independence tests) or a specific distribution (for
goodness-of-fit tests).

Steps for Chi-Square Test-

Step 1 : Defining Hypothesis

• Null Hypothesis (H₀): The relationship between categorical variables is


determined by the use of statistical analysis. This means the researcher
assumes that there is no relationship between the two variables under
study no matter the differences or patterns identified are as a result of
random chance. Observing this hypothesis helps us protect our analysis
from possible prejudices hence ensuring it is just.

• Alternative Hypothesis (H₁): The hypothesis suggests that there is a


relation between the two categorical independent variables which are
under study, therefore showing that there is an actual relationship instead
of mere coincidence.

Step 2 : Gather and Organize Data

• Before conducting a chi-square test, it is necessary to get data on two


categorical variables you want to analyze. For instance, if you are
interested in exploring the relationship between gender and preferred ice
cream flavors, then you must collect details on people’s sex (male or
female) and their best flavors (e.g., chocolate, vanilla, strawberry).

• Once this information is collected, it can be inserted into a contingency


table.

4
• When one is investigating the use of two related variables, it is necessary
to use a contingency table to capture all combinations they can possibly
be combined in. In this table, the values of one variable show up in the
columns across, while values of another variable show up in rows. For
instance, one can use it to determine how many females liked diet
coke/vanilla flavored ice cream.

The hypothesis is that men prefer vanilla while women prefer chocolate. So
we need to record how many have chosen vanilla among all male respondents
versus the number who chose chocolate out of all female respondents
Here's an example of what a contingency table might look like:

In this table:
• Table contains two dimensions which are gender and ice cream flavors.
The row headings are male and female categories respectively whereas
column headings represent chocolate, vanilla and strawberry flavors.
Each cell contains numerical counts for every combination of category.
Conduct a chi-square test on this table to examine association between
these two categorical variables.

Step 3 : Calculate Expected Frequencies

• Get Predicted Frequency:In any specific cell, the expected frequency


can be described as the number of occurrences which are expected if the
two variables were independent.

5
• Expected Frequency Calculation: To compute the anticipated
frequency of individual cells, one must use a method of comparison.
This involves multiplying the sums of rows and columns in proportion,
then dividing by the total number of observations in a table.

Step 4 : Perform Chi-Square Test

Formula :

Step 5 : Determine Degrees of Freedom (df)

df = (number of rows - 1) × (number of columns – 1)


or
df = k -1

Step 6: Find p-value

• One can use a chi-squar table to get the p-value for a particular chi-
square statistic (χ²) with certain degrees of freedom (df) which was
calculated. This table has chances of various values of the chi-square
statistic in different degrees of freedom.

• If null hypothesis is correct then chi-square with its validity will be


observed as p value. If it is assumed there is no correlation between the
variables then the probability of this data set occurring given what we
have seen becomes cleare

6
Step 7: Interpret Results

• If the p-value is less than a certain significance level (e.g., 0.05) then we
reject the null hypothesis, which is commonly denoted by α. Thus it
means that category variables highly correlate each other.

• When a p-value is above α it implies that we cannot reject the null


hypothesis hence there is insufficient evidence for establishing the
relationship between these variables.

EXAMPLE-
1. A die is thrown 60 times and the frequency distribution for the number
appearing on the face x is given by the following table:

Faces 1 2 3 4 5 6

Frequency 15 5 4 7 11 17

Test whether the die is unbiased at 5% level of significance

SOLUTION -

Calculate the expected frequency for each face if the die is unbiased.
Since the die is unbiased, each face should appear equally often.
Therefore, the expected frequency for each face is 60/6 = 10 times.

Ei = N x P(Xi) = 60 x 1/6 = 10

7
Observed frequencies (Oi): 15, 6, 4, 7, 11, 17

Expected frequencies (Ei): 10, 10, 10, 10, 10, 10

Xi Oi Ei (Oi - Ei)2 (Oi -Ei)2/Ei

1 15 10 25 2.5

2 6 10 16 1.6

3 4 10 36 3.6

4 7 10 9 0.9

5 11 10 1 0.1

6 17 10 49 4.9

Total 13.6

Level of Significance = ɑ = 0.05


Degree of Freedom = n – 1 = 6 – 1 = 5
Calculated value = 13.6
Critical value = 11.070 (from the table)
So, Calculated value > Critical value
Conclusion, H0 is Rejected , the die is not unbiased (i.e, it is biased)

8
C – Program to compute Chi–Square Distribution -

#include <stdio.h>
#include <math.h>

int main()
{
int n, i;
float chi_square = 0.0;

// Input number of categories


printf("Enter the number of categories: ");
scanf("%d", &n);

float observed[n], expected[n];

// Input observed frequencies


printf("Enter the observed frequencies:\n");
for (i = 0; i < n; i++)
{
printf("Observed[%d]: ", i + 1);
scanf("%f", &observed[i]);
}

// Input expected frequencies


printf("Enter the expected frequencies:\n");
for (i = 0; i < n; i++)
{
printf("Expected[%d]: ", i + 1);
scanf("%f", &expected[i]);
}

// Calculate Chi-Square value

for (i = 0; i < n; i++)


{
9
chi_square += pow((observed[i] - expected[i]), 2) / expected[i];
}

// Output the result

printf("\nChi-Square Calculated Value = %.4f\n", chi_square);

return 0;
}

OUTPUT -

10
Applications of Chi-Sqaure Distribution in Computer Science-

The Chi-Square distribution is widely used in computer science for various


statistical and machine learning applications. Here are some of the primary
applications:

The Chi-Square distribution is widely used in computer science for various


statistical and machine learning applications. Here are some of the primary
applications:

1. Feature Selection in Machine Learning

• The Chi-Square test is commonly used to select features in classification


problems.
• It evaluates the relationship between features (independent variables)
and target classes (dependent variables).
• Features that show higher Chi-Square values (greater dependency with
the target class) are considered significant and are selected for model
training.
• Example: In text classification, Chi-Square can help determine which
words (features) are most relevant to the target category.

2. Goodness of Fit Testing

• Chi-Square tests are used to check whether a sample fits a given


theoretical distribution (like normal, uniform, or Poisson).
• In computer science, this can be useful for validating random number
generators or other probabilistic models.
• Example: Testing whether a pseudo-random number generator produces
numbers that match the desired uniform distribution.

3. Testing Independence in Data Mining

11
• Chi-Square tests are used to determine whether two categorical variables
are independent of each other in large datasets.
• This is useful in association rule mining (e.g., market basket analysis) to
identify dependencies between items.
• Example: Finding relationships between products in transaction data to
discover patterns like "people who buy X often buy Y."

4. Anomaly Detection

• Chi-Square distribution can detect outliers or anomalies in datasets.


• Observed data that significantly deviate from the expected frequencies
can be flagged as anomalies.
• Example: In cybersecurity, Chi-Square tests can identify unusual
patterns in network traffic or log data.

5. Natural Language Processing (NLP)

• In text mining and NLP, the Chi-Square test helps identify significant
terms for document classification.
• Chi-Square evaluates the dependency between words (features) and
document classes.
• Example: Determining which words are most significant for categorizing
emails as spam or non-spam.

6. Image Processing

• In image classification and pattern recognition, Chi-Square distances can


be used as a similarity metric.
• Chi-Square distance measures the difference between the observed and
expected pixel distributions.
• Example: Comparing histograms of images to detect changes or classify
image types.

12
7. Quality of Software Testing

• In software engineering, Chi-Square tests are applied to software


reliability and error detection models.
• It helps to check whether software failures or defects follow a specific
statistical distribution.
• Example: Evaluating whether software errors are randomly distributed
over modules.

8. Bioinformatics and Data Analysis

• Chi-Square tests are used in biological datasets to check statistical


relationships.
• In computer science, this applies to fields like bioinformatics, where
large datasets of gene expression or protein interactions are analyzed.

Advantages -

• Simplicity : The chi-square test is simple to calculate and understand.


• Versatile : It is versatile and can be applied to various research questions
using categorical data
• Non-parametric Nature : It is a non-parametric test and does not require
assumptions about the population distribution

Limitations -

• Sensitivity to Sample Size : It can be influenced by sample size,


potentially losing power with small samples or showing significance for
trivial differences in large samples
• Limited Information : It mainly shows if an association exists without
detailing its strength or direction

13
• Assumption of Independence : It assumes independent observations,
which may not always be true in real-world data

Conclusion -
The conclusion for a Chi-Square distribution analysis depends on the
context in which it is applied. However, a general conclusion for this
statistical method includes the following points:
1. Purpose:
The Chi-Square test evaluates whether there is a significant association
or difference between observed and expected frequencies in categorical
data.
2. Key Outcomes:

• If the p-value is less than the significance level (e.g., 0.05), we


reject the null hypothesis, indicating that there is a significant
association or difference.
• If the p-value is greater than the significance level, we fail to reject
the null hypothesis, meaning there is insufficient evidence to
suggest a significant association or difference.

3. Interpretation:

• For a Goodness-of-Fit Test, the conclusion determines whether the


observed data fit the expected distribution.
• For a Test of Independence, the conclusion reveals whether two
categorical variables are independent or related.

4. Limitations:

• The test assumes a sufficiently large sample size and that expected
frequencies in each category are generally ≥ 5. If these assumptions
are violated, the results may not be reliable.

14
References :

1. GeeksforGeeks
https://www.geeksforgeeks.org/chi-square-test/

2. Wikipedia
https://en.wikipedia.org/wiki/Chi-squared_distribution

3. https://math.arizona.edu/~jwatkins/chi-square-table.pdf

4. https://www.simplilearn.com/tutorials/statistics-tutorial/chi-square-test

5. https://www.w3schools.com/python/numpy/numpy_random_chisquare.asp

6. Higher Engineering Mathematics by Dr. B.S. Grewal

15

You might also like