Module 5a Chi Square - Introduction - Goodness of Fit Test
Module 5a Chi Square - Introduction - Goodness of Fit Test
https://www.youtube.com/watch?v=7_cs1YlZoug
The Chi-square Distribution
The Chi-square distribution is a continuous probability distribution. It is the distribution of a sum
of the squares of k independent standard normal random variables.
In the Chi-square distribution, as the degrees of freedom increases, the Chi-square
distribution approaches a normal distribution.
Characteristics of a Chi-square Distribution:
A chi-square test,
▪ can also be used to test the homogeneity of proportions.
▪ this is used to determine whether the proportions for a variable are equal when several samples
are selected from different populations.
▪ this also use the chi-square distribution and the contingency table.
For example,
You would like to see if the proportions of each group of students who play online gaming are
equal based on their program of affiliation, say proportions of accountancy students,
engineering students, and architecture students who play online gaming.
You may want to see if the proportions of employees who are in to stock market are equal based
on the nature of their profession (IT, Medicine, Accounting, Engineering).
Two main types of Chi-square Tests to be
discussed here are:
Goodness-of-fit tests which focus
on one categorical variable.
Tests of independence which focus on the
relationship between two categorical variables.
Thus, the contingency table (or cross tabulation
table will be used to present the data values).
To illustrate the use of chi-square test:
If, according to Mendel's laws, you expect 10 of 20 offspring to
be male and the actual observed number was 8 males, then you
might want to know about the "goodness-of-fit" between the
observed and expected data.
Were the deviations (differences between observed and expected
value) the result of chance, or were they due to other factors?
How much deviation can occur before we conclude that
something other than chance is at work, causing the observed to
differ from the expected value.
The chi-square test is always testing what scientists call the
null hypothesis, which states that there is no significant
difference between the expected and observed result.
Test for Goodness-of-Fit
Definition:
The chi-square goodness-of-fit test is used to test the
claim that an observed frequency distribution fits some
given expected frequency distribution.
The observed frequencies will almost always differ from the expected frequencies due to
sampling error; that is, the values differ from sample to sample. But the question is: Are these
differences significant? (Which means, there is a difference in the life span of the batteries for
each category) or will it be due to chance only? Thus, the two opposing statements are necessary
before computing the test value, the null and alternative hypotheses. Here, the null hypothesis
indicates that there is no difference or change among the categories.
Ho: There is no difference in the life span of laptop batteries among three categories.
H1: There is difference in the life span of laptop batteries among three categories.
Summary Procedures in conducting Chi-Squared Goodness-of-Fit Test:
2 =
(O − E )2
E
Step 4: Make the decision.
Reject the null hypothesis if the test value is greater than the critical
value.
Do not reject the null hypothesis if the test value is less than the critical
value.
Step 5: Summarize the results.
Example 1:
A quality control officer of a laptop manufacturing company would like to see
if the life span of laptop batteries are equally distributed among three categories.
A sample of 45 student laptop owners is selected. The table below shows the
distribution of the life span of laptop batteries in years. At α = 0.05 can it be
considered that the lifespan of laptop batteries are equally distributed among the
three categories?
Observed
frequency
12 19 14
Note that this problem involves only one categorical variable, the life span of laptop batteries classified into
three (4 years and below, more than 4 years and below 10 years, above 10 years), so we use the
goodness-of-fit-test.
Solution:
Step 1: State the hypotheses and identify the claim.
Ho: The ages of laptop batteries are equally distributed over the three
categories. (claim)
(Which is the same as saying that, “There is no difference in
the lifespan of laptop batteries in the three categories.”)
H1: The ages of laptop batteries are NOT equally distributed.
(Which is the same as saying that, “There is difference in the
lifespan of laptop batteries in the three categories.”)
Step 2: Find the critical value. At α = 0.05 and df = 3-1 = 2, locate the
critical value from the chi-square table. Thus, the critical value
is 5.991.
Step 3: Compute the test value
To compute the test value, we solve first for the expected value E.
n 45
E= = = 15
k 3
More than 4 years and
Category 4 years and below Above 10 years
below 10 years
Observed frequency 12 19 14
Expected frequency 15 15 15
Step 4: Make the decision. Do not reject the null hypothesis, since the test value
1.73 is less than the critical value 5.991 (1.73 < 5.991)
15 20
10
15
0 (A) 10
(B) 10 (C)
1 2 3 5 5
Categories 0 0
1 2 3 1 2 3
Observed frequency Expected frequency
Observed frequency Expected frequency Observed frequency Expected frequency
From (A), the observed From (B), the observed From (C), the observed
values and the expected values and the expected values and the expected
values are close together, values are far apart, the chi- values are far apart, the chi-
indicating that the chi- square test will be large. square test will be large.
square test will be small. Then “the null hypothesis Then “the null hypothesis
The decision will be “do not will be rejected”, hence, will be rejected”, hence,
reject the null hypothesis”, there is “not a good fit”. there is “not a good fit”.
hence, there is “a good fit”.
Example 2:
A financial analyst wants to determine whether investors have any preference on the type of
investment. A sample of 93 investors were interviewed and provided the information shown on the
table below. At 0.10 level of significance, is there a difference in investment preferences among
the investors?
Note that this problem involves only one categorical variable, the types of investment classified into four
(stocks, mutual funds, bonds, index funds), so we use the goodness-of-fit-test.
Solution:
Step 1: State the hypotheses and identify the claim.
Ho: Investors show no preferences.
(Which is the same as saying that, “There is no difference in the
preferences on the type of investment among investors.”)
H1: Investors show preferences. (claim)
(Which is the same as saying that, “There is difference in the
preferences on the type of investment among investors.”)
Step 2: Find the critical value. At α = 0.10 and df = 4-1 = 3, locate the
critical value from the chi-square table. Thus, the critical value
is 6.251.
Step 3: Compute the test value
Types of Investment Observed Frequency Expected Frequency
Stocks 35 24
Mutual Funds 18 24
Bonds 30 24
Index Funds 10 24
To compute the test value, we solve first for the expected value E.
n 93
E= = = 23.25 24
k 4
Then the test value 2 is
(O − E ) 2 (35 − 24) 2 (18 − 24) 2 (30 − 24) 2 (10 − 24) 2
=
2
= + + + = 16.21
E 24 24 24 24
2 = 16.21 (test value/computed value or test statistic)
Step 4: Make the decision. Reject the null hypothesis, since the test value 16.21
is greater than the critical value 6.251 (16.21 > 6.251).
Step 5: Summarize the results. There is difference in the preferences on the type
of investment among investors. The investors in fact show preferences.
Example 3:
An article shows statistics of orders made online on a
particular product with different online stores within city. The
data is based on the last six months of the previous year as Number of Orders
Months
follows, July 17%, August 11%, September 8%, October 14%, made with CECT store
November 27%, and December 23%. The CECT online store July 27
manager wants to compare the orders made with his store with August 17
that of the data revealed by the article. The manager listed September 22
the number of orders in his store on the same product stated in October 45
the article. The table on the right shows the data collected by
November 30
the manager for the last six months in the previous year.
December 59
At 0.01 level of significance, can we support the claim that
the proportions of orders with CECT online store is the same as
the rest of the online stores within city?
Note that this problem involves only one categorical variable, months covered in a year, so we use the
goodness-of-fit-test.
Solution:
Step 1: State the hypotheses and identify the claim.
Ho: The orders made on a particular product in different online
stores within the city for the last six months of the year is
distributed as follows: July 17%, August 11%, September 8%,
October 14%, November 27%, and December 23%.
(or “There is no difference between the orders made with the
CECT online stores with the rest of the online stores within
the city”.(claim)
H1: The distribution is not the same as stated in the null hypothesis.
(or “There is difference between the orders made with the
CECT online stores with the rest of the online stores within
the city”.)
Step 2: Find the critical value. At α = 0.01 and df = 6-1 = 5, locate
the critical value from the chi-square table. Thus, the critical
Step 3: Compute the test value
Number of Orders made with CECT
Months store (O)
P E = np
Step 4: Make the decision. Reject the null hypothesis, since the test value 29.49 is
greater than the critical value 15.086 (29.49 > 15.086).
α = 0.01
15.086
Step 5: Summarize the results. There is significant difference between the orders
made with the CECT online stores with the rest of the online stores
within the city.
Exercise 1: