10 The Chi-Square Tests - 25 - 02 - 28 - 23 - 16 - 48
10 The Chi-Square Tests - 25 - 02 - 28 - 23 - 16 - 48
10.1 Introduction
A test of significance might be based on the assumption that the sample values were
drawn from universes having the same variance. The testing procedures also assumed
that the unknown values of the parameters, about which statistical interferences were to
be made, could be estimated from statistics obtained from random samples. This approach
to inferential statistics is called parametric methods, since the concern is with the value of
a parameter.
There are many situations in which it is not possible for the statistician to make a rigid
assumption about the shape of the population from which samples are being drawn. This
limitation has led to the development of a group of alternate techniques known as non-
parametric or distribution-free methods.
A non-parametric method may be defined as a statistical test in which no hypothesis is
made about specific values of parameters. Distribution-free tests may be defined as meth-
ods for testing a hypothesis that does not depend on assumptions concerning the form of
the underlying distribution.
10.2 Chi-Square, χ2
The χ2-test (pronounced ‘chi-square test’) is one of the simplest and most widely used
non-parametric tests in statistical work. The χ2-distribution has very many applications
in situations that involve the testing of hypotheses concerning discrete or qualitative
data. The Greek letter χ was first used to describe this statistic by Karl Pearson in 1900.
The quantity χ2 describes the magnitude of discrepancy between theory and observa-
tion; that is, with the help of the χ2-test, we are in a position to know whether a given
discrepancy between theory and observation may be attributed to chance or whether it
results from the inadequacy of the theory to fit the observed facts. If χ2 is zero, it means
that the observed and expected frequencies completely coincide. The greater the dis-
crepancy between the observed and expected frequencies, the greater is the value of χ2.
The square of a standard normal variable is called a chi-square variate with 1 degree
of freedom (d.f.). For example, if X is a random variable following a normal distribution
with a mean µ and standard deviation σ, then (X − µ)/ σ is a standard normal variate and
((X − µ)/ σ)2 is a chi-square variate with 1 degree of freedom.
245
246 Quantitative Techniques in Business, Management and Finance
1. N, the total frequency, that is sample size, should be large enough, N > 50.
2. The expected frequency should be greater than 5, that is fe > 5.
3. The data should be given in original numbers, that is natural numbers.
4. The sample observation should be random and independent.
1. While comparing the table value of χ2 with the calculated value, we have to deter-
mine the degrees of freedom.
2. Degrees of freedom is the number of classes to which the values are assigned arbi-
trarily or at will without violating the restrictions.
3. For example, we choose any four numbers whose total is 40. Here, we have a choice
to select any three numbers, say 10, 5 and 20, and the fourth number is 5, that is
[40 – (10 + 5 + 20)].
Thus, our choice of freedom is reduced by 1, on the condition that the total is 40.
Therefore, the restriction placed on the freedom is 1 and the degrees of freedom is
3. As the restriction increases, the degrees of freedom is reduced.
υ= n−k
= 4 −1
=3
The Chi-Square Tests 247
where:
υ = degrees of freedom
4. If more restrictions are placed, our freedom of choice will be still curtailed.
For example, if there are 10 classes and we want our frequencies to be distrib-
uted in such a manner that the number of cases, the mean and the standard devia-
tion agree with the original distribution, now we have three constraints:
υ= n−k
= 10 − 3
=7
υ = n −1
υ= n−2
υ= n−3
υ = (c − 1 )( r − 1)
where:
c = number of columns
r = number of rows
248 Quantitative Techniques in Business, Management and Finance
1. Tests can be performed on the actual numbers but not on the percentages and pro-
portions. If the data is in percent or proportion, then it needs to be converted into
absolute numbers before performing the χ2-test.
2. No theoretical cell frequency should be small. Here, again it is hard to say what
constitutes smallness, but 5 should be regarded as the very minimum and 10 is
better. In practice, data not infrequently contain cell-frequencies below these lim-
its. As a rule, the difficulty may be met by amalgamating such a cell into a single
cell entitled ‘10 or greater than 10’.
3. The chi-square test works only when the sample size is large enough; usually, the
sample size needs to be >50.
4. Observations drawn need to be random and independent.
1. χ2 is always positive. The computed value of χ2 is always positive because the dif-
ferences between Eij and Oij are squared, that is ( Eij − Oij ) .
2
2. The shape of the χ2-distribution depends on number of cells. That is, the number
of degrees of freedom is determined by (n – 1), where n is the number of samples
(categories).
Therefore, the shape of the χ2-distribution does not depend on the size of the
sample.
For example, if 300 employees of an airline were classified into one of three
categories, that is flight personnel, ground support and administrative personnel,
there would be n – 1 = 3 – 1 = 2 degrees of freedom.
3. The χ2-distribution is truly skewed. However, as the number of degrees of free-
dom increases, the distribution begins to approximate the normal distribution.
4. The greater the discrepancy between the observed frequency, the greater is the
value of χ2.
5. Large values of χ2 indicate disagreement between the observed frequency (Oi) and
the expected frequency (Ei) under the null hypothesis.
6. The critical value regions will lie towards the extreme right tail of the χ2-distribution.
Therefore, it is called a right-tailed test or positively skewed.
The Chi-Square Tests 249
7. It depends only on the set of observed and expected frequencies and on degrees of
freedom.
8. It does not make any assumptions regarding the parent population.
9. It does not involve any population parameters (statistics) and is therefore known
as a non-parametric test.
10. It is a distribution-free test because there are no assumptions.
11. If χ2 is 0, this means that the observed and expected frequencies completely
coincide.
12. Uses for quantitative data. Other distributions cannot be used for qualitative data.
Generally, for qualitative data a non-parametric test is preferred, and in paramet-
ric tests we have only χ2 for qualitative tests.
13. Goodness of fit, validity, checking or confirmation, quality checking or fitting only
with the help of the χ2-test. In goodness of fit, the null hypothesis, H0, is proposed
for acceptance.
1. If there are only two cells, the expected frequency (Ei) in each cell should be five or
more.
2. For more than two cells, χ2 should not be applied if more than 20% of Ei cells have
less than five frequencies.
250 Quantitative Techniques in Business, Management and Finance
n
(Oi − Ei)2
χ2 = ∑
i =1
Ei
where:
n = number of categories
Oi = observed frequency in a particular category
Ei = expected frequency in a particular category
1. The decision rule indicates that if there are large differences between the expected
and observed frequencies, resulting in a computed χ2 of more than the table
The Chi-Square Tests 251
value, the null hypothesis should be rejected. If the calculated value of chi-square
(χ2calculated) is greater than the tabulated value of chi-square (χ2tabulated), we reject H0
and accept the alternate hypothesis, H1.
2. If the difference between the observed and expected frequencies is small, then H0
should not be rejected, that is accepted. Clearly, if the differences between Eij and
Oij are small, the computed χ2-value will be less and the null hypothesis should
not be rejected, because such small differences between the expected and observed
frequencies are probably due to chance. Do not reject the null hypothesis, H0, if the
calculated value of χ2 is less than or equal to the χ2 table value.
ni × n j
E ij =
n
where:
Eij = expected frequency of a cell corresponding to a particular row and column
ni = row total
nj = column total
n = total sample size
n k ( E ij − O ij )2
2
χ = ∑∑
E ij
i =1 j=1
where:
Oij = observed frequency in ith row and jth column
Eij = expected frequency in ith row and jth column
n = number of rows
k = number of columns
Exercise
Table 10.1 shows the data obtained during an epidemic of dengue.
Test the efficiency of inoculation in preventing the attack of dengue.
Solution
1. Use the formulae for H0 and H1.
H0: Inoculation is not effective in preventing the attack of dengue.
H1: Inoculation is effective in preventing the attack of dengue.
2. Calculate the expected frequency.
ni × n j
E ij =
n
where:
Eij = expected frequency of a cell corresponding to a particular row and column
ni = row total
nj = column total
n = total sample size
n k ( E ij − O ij ) 2
2
χ = ∑∑
E ij
i =1 j=1
where:
Oij = observed frequency in ith row and jth column
Eij = expected frequency in ith row and jth column
n = number of rows
k = number of columns
The Chi-Square Tests 253
TABLE 10.1
Data of an Epidemic of Dengue
Attacked Not Attacked Total
Inoculated 31 469 500
Not inoculated 185 1,315 1500
Total 216 1,784 2000
TABLE 10.2
Observed and Expected Frequencies
Attacked Not Attacked Total
Inoculated 316 × 500 1784 × 500
2000 2000 500
= (54) 31 = (446) 469
Not inoculated 316 × 500 1784 × 1500
2000 2000 1500
=(162) 185 =(1338) 1315
Total 216 1784 2000
TABLE 10.3
Calculation of χ2
Oi Ei (Oi – Ei) (Oi – Ei)2 (Oi – Ei)2 /Ei
31 54 23 529 9.80
185 162 23 529 3.26
469 446 23 529 1.18
1315 1338 23 529 0.39
∑Oi = 2000 ∑Ei = 2000 ∑(Oi – Ei)2/Ei = 14.63
χ2Calculated = 14.63
Chi-Square
test statistic
Rejection
3.84 14.63
FIGURE 10.1
Chi-square test of independence.
Exercise
Channel viewership is segregated according to age group. Evaluate the statistical sig-
nificance of association among the variables involved in the cross-tabulation.
Table 10.4 shows the channel viewership distribution according to age group.
Solution
Observation: The 15–35 age group respondents prefer Channel A compared with the
other channels. But, it is not certain whether this observation is representative of the
entire population or whether it is due to sampling error. This dilemma can be resolved
by subjecting the data to a χ2-test.
TABLE 10.4
Channel Viewership Distribution
according to Age Group
Channel →,
Age Group ↓ A B C Total
10–15 20 30 30 80
15–35 80 70 50 200
≥35 60 40 20 120
Total 160 140 100 400
The Chi-Square Tests 255
Steps
1. Formulate H0 and H1.
H0: There is no significant association between age groups and channel
viewership.
H1: There is a significant association between age groups and channel
viewership.
2. Calculate the expected frequency.
The expected frequency value for each category (age group) can be calcu-
lated by using the formula
ni × n j
E ij =
n
where:
Eij = expected frequency of a cell corresponding to a particular age group and a
particular channel
ni = row total
nj = column total
n = total sample size
Table 10.5 shows the observed frequencies and corresponding expected fre-
quencies in parentheses.
n k
(O ij − E ij )2
3. Calculate χ2 =
∑∑
i =1 j=1
E ij
where
Oij = observed frequency in ith row and jth column
Eij = expected frequency in ith row and jth column
where
ν = degrees of freedom
n = number of rows
k = number of columns
TABLE 10.5
Observed and Expected Frequencies
Channel →,
Age Group ↓ A B C Total
80 × 60 80 × 140 80 × 100
10–15 = (32)20 = (28)30 = (20)30 80
400 400 400
TABLE 10.6
Calculation of χ2
(Oij − Eij )2
Eij
Oij Eij (Oij – Eij) (Oij – Eij)2
20 32 –12 144 4.5
30 28 2 4 0.1428571
30 20 10 100 5.0
80 80 0 0 0
70 70 0 0 0
50 50 0 0 0
60 48 12 144 3
40 42 –2 4 0.095238
20 30 –10 100 3.33333
∑Oij = 400 ∑Eij = 400 χ2 ≅ 16.08
Here, n = 3, k = 3 and
ν = (3 – 1) (3 – 1) = 2 × 2 = 4
5. Determine the critical value and compare it with the calculated value.
From the χ2-distribution table,
χ2Tabulated for 4 d.f. at the 1% level of significance = 13.28
χ2Calculated = 16.08
Chi-Square
test statistic
Rejection
13.28 16.08
FIGURE 10.2
Chi-square test of independence.
The Chi-Square Tests 257
Conclusion
The channel viewership is significant (1% level) dependent on the age group.
10.6 Phi-Coefficient
In order to measure the strength, the phi-coefficient is used. It is denoted by Φ . The phi-
coefficient measures the strength of association between only two variables (i.e. with two
rows and two columns). It fails to measure the strength of association between more than
two variables.
A formula for measuring the strength of association between two variables is
χ2
Φ=
n
where
χ2 = calculated chi-square value
n = sample size
χ2
C=
χ2 + n
where:
χ2 = calculated chi-square value
n = sample size
χ2
C=
χ +n 2
16.08
=
16.08 + 200
= 0.2727945796
10.8 Summary
In this chapter, we have looked at some situations where we can develop tests based on the
chi-square distribution. We started by testing the variance of a normal population where
the test statistic used was (n − 1)s 2/σ 2 since the distribution of the sample variance s2 was
not known directly. We found that such tests could be one-tailed depending on our null
and alternative hypotheses.
We then describe a multinational experiment and found that if we have data that classify
observations into k different categories, and if the conditions for the multinomial experi-
ment are satisfied, then a test statistic called the chi-square statistic, defined as
( O i − Ei )
k 2
χ2 = ∑
i =1
Ei
The Chi-Square Tests 259
will have a chi-square distribution with specified degrees of freedom. Here, Oi refers to
the observed frequency of the ith category and Ei to the expected frequency of the ith cat-
egory, and the degrees of freedom is equal to the number of categories minus 1, minus the
number of independent parameters estimated from the data to calculate the Ei values. This
concept was used to develop tests concerning the goodness of fit of the observed data to
any hypothesised distribution, and also to test whether two criteria for classification are
independent.
The chapter also discussed the concepts of the chi-square test and its uses and testing
for goodness of fit. The various tests of hypothesis in the earlier sections are based on the
assumption that the sampling distribution follows a normal distribution curve. However,
it is not always possible to assume the distribution pattern from which the samples are
drawn. To overcome this difficulty, many managers use chi-square tests. Chi-square is a
test statistic used to test a hypothesis that provides a set of theoretical frequencies with
which observed frequencies are compared.
REVIEW QUESTIONS
1. What is the χ2-test of goodness of fit? What precautions are necessary in using this
test?
2. Describe the χ2-test of significance and state the various uses to which it can
be put.
3. Illustrate with examples the usefulness of the χ2-test as a test for independence.
4. Describe the use of the χ2-test in testing the independence of attributes in a 2 × 2
contingency table.
5. Explain how the χ2-distribution can be used to (a) test the goodness of fit and
(b) test the independence of the cell frequencies in a 2 × 2 contingency table.
SELF-PRACTICE PROBLEMS
1. The following data relates to the sales in a time of trade depression of a certain
proprietary article in wide demand. Does the data suggest that the sales are sig-
nificantly affected by the depression?
Votes for
Area A B Total
Rural 620 380 1000
Urban 550 450 1000
Total 1170 830 2000
[Answer: χ2 = 10.6, the table value of χ2 at the 5% level for v = 1 = 3.84. The
hypothesis does not hold well.]