0% found this document useful (0 votes)
12 views16 pages

10 The Chi-Square Tests - 25 - 02 - 28 - 23 - 16 - 48

The document discusses the Chi-Square test, a widely used non-parametric statistical method for testing hypotheses concerning discrete or qualitative data. It outlines the need for the test, conditions for its validity, characteristics of the Chi-Square distribution, and its applications, including tests of goodness of fit and independence. Additionally, it details the procedures for conducting these tests and the necessary steps for hypothesis formulation and decision-making.

Uploaded by

Mwangi Mbugua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

10 The Chi-Square Tests - 25 - 02 - 28 - 23 - 16 - 48

The document discusses the Chi-Square test, a widely used non-parametric statistical method for testing hypotheses concerning discrete or qualitative data. It outlines the need for the test, conditions for its validity, characteristics of the Chi-Square distribution, and its applications, including tests of goodness of fit and independence. Additionally, it details the procedures for conducting these tests and the necessary steps for hypothesis formulation and decision-making.

Uploaded by

Mwangi Mbugua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

10

The Chi-Square Tests

10.1 Introduction
A test of significance might be based on the assumption that the sample values were
drawn from universes having the same variance. The testing procedures also assumed
that the unknown values of the parameters, about which statistical interferences were to
be made, could be estimated from statistics obtained from random samples. This approach
to inferential statistics is called parametric methods, since the concern is with the value of
a parameter.
There are many situations in which it is not possible for the statistician to make a rigid
assumption about the shape of the population from which samples are being drawn. This
limitation has led to the development of a group of alternate techniques known as non-
parametric or distribution-free methods.
A non-parametric method may be defined as a statistical test in which no hypothesis is
made about specific values of parameters. Distribution-free tests may be defined as meth-
ods for testing a hypothesis that does not depend on assumptions concerning the form of
the underlying distribution.

10.2 Chi-Square, χ2
The χ2-test (pronounced ‘chi-square test’) is one of the simplest and most widely used
non-parametric tests in statistical work. The χ2-distribution has very many applications
in situations that involve the testing of hypotheses concerning discrete or qualitative
data. The Greek letter χ was first used to describe this statistic by Karl Pearson in 1900.
The quantity χ2 describes the magnitude of discrepancy between theory and observa-
tion; that is, with the help of the χ2-test, we are in a position to know whether a given
discrepancy between theory and observation may be attributed to chance or whether it
results from the inadequacy of the theory to fit the observed facts. If χ2 is zero, it means
that the observed and expected frequencies completely coincide. The greater the dis-
crepancy between the observed and expected frequencies, the greater is the value of χ2.
The square of a standard normal variable is called a chi-square variate with 1 degree
of freedom (d.f.). For example, if X is a random variable following a normal distribution
with a mean µ and standard deviation σ, then (X − µ)/ σ is a standard normal variate and
((X − µ)/ σ)2 is a chi-square variate with 1 degree of freedom.

245
246 Quantitative Techniques in Business, Management and Finance

10.2.1 Need for the χ2-Test


To test hypotheses about a population mean or two or more population means, it was
assumed that the population was normal. A test of hypothesis can be made for interval data.
But for the χ2-distribution, assumptions are not necessary for shaping the parent popula-
tion, and tests can be conducted if the data is not interval scale, but is nominal or ordinal.

10.2.2 Conditions for the Validity of χ2


10.2.2.1 Assumptions

1. N, the total frequency, that is sample size, should be large enough, N > 50.
2. The expected frequency should be greater than 5, that is fe > 5.
3. The data should be given in original numbers, that is natural numbers.
4. The sample observation should be random and independent.

10.2.2.2 Interval Scale


It deals with data that was at least an interval scale such as weights, incomes and wages.

10.2.2.3 Nominal-Level or Nominal-Scale Data


This type of data can only be classified into categories, such as male and female, literate
and illiterate.

10.2.2.4 Ordinal-Level Data


The data measurement assumes that one category is ranked higher than the next one. For
example, a ranking of outstanding is higher than good, good is higher than fair, brighter
is higher than lighter and so on.

10.2.3 Degrees of Freedom

1. While comparing the table value of χ2 with the calculated value, we have to deter-
mine the degrees of freedom.
2. Degrees of freedom is the number of classes to which the values are assigned arbi-
trarily or at will without violating the restrictions.
3. For example, we choose any four numbers whose total is 40. Here, we have a choice
to select any three numbers, say 10, 5 and 20, and the fourth number is 5, that is
[40 – (10 + 5 + 20)].
Thus, our choice of freedom is reduced by 1, on the condition that the total is 40.
Therefore, the restriction placed on the freedom is 1 and the degrees of freedom is
3. As the restriction increases, the degrees of freedom is reduced.

υ= n−k

= 4 −1

=3
The Chi-Square Tests 247

where:

υ = degrees of freedom

k = no. of independent constraints

n = no. of frequency classes

4. If more restrictions are placed, our freedom of choice will be still curtailed.
For example, if there are 10 classes and we want our frequencies to be distrib-
uted in such a manner that the number of cases, the mean and the standard devia-
tion agree with the original distribution, now we have three constraints:

υ= n−k

= 10 − 3

=7

Thus, the number of degrees of freedom is obtained by subtracting the number


of constraints from the number of frequency classes.

10.2.3.1 In Binomial Distribution


The number of degrees of freedom is one less than the number of classes:

υ = n −1

10.2.3.2 In Poisson Distribution


The number of degrees of freedom is two less than the number of classes. (We use total
frequency and arithmetic mean.)

υ= n−2

10.2.3.3 In Normal Distribution


The number of degrees of freedom is three less than the number of classes. (We use total
frequency, mean and standard deviation.)

υ= n−3

10.2.3.4 For a Contingency Table

υ = (c − 1 )( r − 1)

where:
c = number of columns
r = number of rows
248 Quantitative Techniques in Business, Management and Finance

10.2.3.5 Important Characteristics of Degrees of Freedom (ν)

1. The distribution of χ2 depends on the degrees of freedom.


2. There is a different χ2-distribution for each number of degrees of freedom.
3. The distribution is much skewed to the right for small degrees of freedom.
4. As the degrees of freedom increases, the curve becomes more and more
symmetric.
5. The mean of the χ2-distribution is equal to the number of degrees of freedom.
6. The variance is equal to 2 degrees of freedom.

10.2.4 General Aspects of χ2

1. Tests can be performed on the actual numbers but not on the percentages and pro-
portions. If the data is in percent or proportion, then it needs to be converted into
absolute numbers before performing the χ2-test.
2. No theoretical cell frequency should be small. Here, again it is hard to say what
constitutes smallness, but 5 should be regarded as the very minimum and 10 is
better. In practice, data not infrequently contain cell-frequencies below these lim-
its. As a rule, the difficulty may be met by amalgamating such a cell into a single
cell entitled ‘10 or greater than 10’.
3. The chi-square test works only when the sample size is large enough; usually, the
sample size needs to be >50.
4. Observations drawn need to be random and independent.

10.2.5 Characteristics of the Chi-Square Distribution

1. χ2 is always positive. The computed value of χ2 is always positive because the dif-
ferences between Eij and Oij are squared, that is ( Eij − Oij ) .
2

2. The shape of the χ2-distribution depends on number of cells. That is, the number
of degrees of freedom is determined by (n – 1), where n is the number of samples
(categories).
Therefore, the shape of the χ2-distribution does not depend on the size of the
sample.
For example, if 300 employees of an airline were classified into one of three
categories, that is flight personnel, ground support and administrative personnel,
there would be n – 1 = 3 – 1 = 2 degrees of freedom.
3. The χ2-distribution is truly skewed. However, as the number of degrees of free-
dom increases, the distribution begins to approximate the normal distribution.
4. The greater the discrepancy between the observed frequency, the greater is the
value of χ2.
5. Large values of χ2 indicate disagreement between the observed frequency (Oi) and
the expected frequency (Ei) under the null hypothesis.
6. The critical value regions will lie towards the extreme right tail of the χ2-distribution.
Therefore, it is called a right-tailed test or positively skewed.
The Chi-Square Tests 249

7. It depends only on the set of observed and expected frequencies and on degrees of
freedom.
8. It does not make any assumptions regarding the parent population.
9. It does not involve any population parameters (statistics) and is therefore known
as a non-parametric test.
10. It is a distribution-free test because there are no assumptions.
11. If χ2 is 0, this means that the observed and expected frequencies completely
coincide.
12. Uses for quantitative data. Other distributions cannot be used for qualitative data.
Generally, for qualitative data a non-parametric test is preferred, and in paramet-
ric tests we have only χ2 for qualitative tests.
13. Goodness of fit, validity, checking or confirmation, quality checking or fitting only
with the help of the χ2-test. In goodness of fit, the null hypothesis, H0, is proposed
for acceptance.

10.2.6 Application of Chi-Square

1. It is applicable for testing the significance of the concerned discrete or qualitative


data that involves testing of a hypothesis.
2. It describes the magnitude of discrepancy between an observation (experiment)
and theory, and may be distributed by chance (fluctuation of sampling) or result
from the inadequacy of the theory.
3. It can evaluate the relationship between two or more variables.
4. We will not always be interested in means and proportions. There are many man-
agerial situations where we will be concerned with the variability in a population,
instead of means and proportions.
5. The χ2-test enables us to test whether more than two population proportions can
be considered equal.
6. We can use the χ2-test to determine if the two attributes are independent of each
other. The χ2-test is a test of independence that determines whether the difference
between the proportions representing more than two samples is significant.
7. It determines whether the two attributes according to which a population is cat-
egorised are independent of each other, and it also serves as a test for goodness
of fit.
8. It is useful when analysing cross-tabulation.
9. It helps to determine whether there is any significant association between the vari-
ables involved in the research problem.

10.2.7 Limitations of Chi-Square

1. If there are only two cells, the expected frequency (Ei) in each cell should be five or
more.
2. For more than two cells, χ2 should not be applied if more than 20% of Ei cells have
less than five frequencies.
250 Quantitative Techniques in Business, Management and Finance

10.3 Chi-Square Test of Goodness of Fit


This is a non-parametric test developed by Karl Pearson in the early 1900s. χ2 is a test sta-
tistic used to test a hypothesis that provides a set of theoretical frequencies with which
observed frequencies are compared; goodness means +ve. It is used to determine how well
an observed set of data fits an expected set, by using the formula

n
 (Oi − Ei)2 
χ2 = ∑
i =1

 Ei

where:
n = number of categories
Oi = observed frequency in a particular category
Ei = expected frequency in a particular category

10.3.1 Procedure for χ2-Test of Goodness of Fit – Steps

1. Formulate H0 and H1.


Null hypothesis H0: There is no difference between the set of observed frequen-
cies and the set of expected frequencies; that is, any difference can be attributed to
chance.
Alternate hypothesis H1: There is a difference between the two set of
frequencies.
2. Calculate the expected frequency.
3. The test statistic. Calculate χ2 by using the formula.
4. Decide on the test level of significance. Alpha (α) = 5%, a level of significance
which is the same as the Type I error of probability. Thus, the probability is 0.05
that a true null hypothesis will be rejected, and the degrees of freedom = (n – 1),
where n = number of samples.
5. The decision rule and the critical value. Determine the critical (tabulated) value of
χ2 and then compare it with the calculated value.

10.3.2 Critical Value


The decision rule in hypothesis testing requires finding a number that separates the
region, where we do not reject H0 from the region of rejection. This number is called the
critical value.

6. Deduce the conclusion.

10.3.3 Decision Rules

1. The decision rule indicates that if there are large differences between the expected
and observed frequencies, resulting in a computed χ2 of more than the table
The Chi-Square Tests 251

value, the null hypothesis should be rejected. If the calculated value of chi-square
(χ2calculated) is greater than the tabulated value of chi-square (χ2tabulated), we reject H0
and accept the alternate hypothesis, H1.
2. If the difference between the observed and expected frequencies is small, then H0
should not be rejected, that is accepted. Clearly, if the differences between Eij and
Oij are small, the computed χ2-value will be less and the null hypothesis should
not be rejected, because such small differences between the expected and observed
frequencies are probably due to chance. Do not reject the null hypothesis, H0, if the
calculated value of χ2 is less than or equal to the χ2 table value.

10.4 Chi-Square Test – Test of Independence


10.4.1 Characteristics

1. It is used to evaluate the relationship between two or more variables.


2. It is useful when analysing cross-tabulation.
3. It helps in determining whether there is any significant association between the
variables involved in the research problem.
For example, suppose N number of observations are classified according to cri-
teria. We may ask whether the criteria are relative or independent, for example

1. Whether a particular drug is effective in controlling fever


2. Whether there is any association between petrol consumption and air pollution

10.4.2 Procedure for χ2-Test of Independence – Steps

1. Formulate the null (H0) and alternate (H1) hypotheses.


2. Calculate the expected frequency.

ni × n j
E ij =
n

where:
Eij = expected frequency of a cell corresponding to a particular row and column
ni = row total
nj = column total
n = total sample size

3. Calculate χ2 by using the formula.


4. Decide on the test level of significance and degrees of freedom.
Degrees of freedom = [Number of rows – 1] [Number of columns – 1]
5. Determine the critical value of χ2 and then compare it with the calculated value.
252 Quantitative Techniques in Business, Management and Finance

n k  ( E ij − O ij )2 
2
χ = ∑∑ 
 E ij


i =1 j=1  

where:
Oij = observed frequency in ith row and jth column
Eij = expected frequency in ith row and jth column
n = number of rows
k = number of columns

6. Deduce the conclusion.


For example,

H0: Given data follows a Poisson distribution.


H0: X batch is a successful batch.

Exercise
Table 10.1 shows the data obtained during an epidemic of dengue.
Test the efficiency of inoculation in preventing the attack of dengue.

Solution
1. Use the formulae for H0 and H1.
H0: Inoculation is not effective in preventing the attack of dengue.
H1: Inoculation is effective in preventing the attack of dengue.
2. Calculate the expected frequency.

ni × n j
E ij =
n

where:
Eij = expected frequency of a cell corresponding to a particular row and column
ni = row total
nj = column total
n = total sample size

The observed frequencies and corresponding expected frequencies are


shown in parentheses in Table 10.2.
3. Calculate χ2 by using the formula

n k  ( E ij − O ij ) 2

2
χ = ∑∑ 
 E ij


i =1 j=1  

where:
Oij = observed frequency in ith row and jth column
Eij = expected frequency in ith row and jth column
n = number of rows
k = number of columns
The Chi-Square Tests 253

TABLE 10.1
Data of an Epidemic of Dengue
Attacked Not Attacked Total
Inoculated 31 469 500
Not inoculated 185 1,315 1500
Total 216 1,784 2000

TABLE 10.2
Observed and Expected Frequencies
Attacked Not Attacked Total
Inoculated 316 × 500 1784 × 500
2000 2000 500
= (54) 31 = (446) 469
Not inoculated 316 × 500 1784 × 1500
2000 2000 1500
=(162) 185 =(1338) 1315
Total 216 1784 2000

TABLE 10.3
Calculation of χ2
Oi Ei (Oi – Ei) (Oi – Ei)2 (Oi – Ei)2 /Ei
31 54 23 529 9.80
185 162 23 529 3.26
469 446 23 529 1.18
1315 1338 23 529 0.39
∑Oi = 2000 ∑Ei = 2000 ∑(Oi – Ei)2/Ei = 14.63

The calculation is shown in Table 10.3.


4. Decide on the test level of significance and degrees of freedom.
Degrees of freedom = [Number of rows – 1] [Number of columns – 1]
= [2 – 1] [2 – 1]
=1
The level of significance is 5%.
5. Determine the critical value of χ2 and then compare it with the calculated
value.
From the χ2-distribution table,

χ2Tabulated (1 d.f. at 5%) = 3.84

χ2Calculated = 14.63

χ2Tabulated (3.84) > χ2calculated (14.63)

See the graph in Figure 10.1.


254 Quantitative Techniques in Business, Management and Finance

Chi-Square
test statistic

Rejection

3.84 14.63

FIGURE 10.1
Chi-square test of independence.

Conclusion: There is evidence to reject the null hypothesis that

χ2Calulated > χ2Tabulated

Therefore, we conclude that the calculation is effective in preventing the attack


of dengue.

Exercise
Channel viewership is segregated according to age group. Evaluate the statistical sig-
nificance of association among the variables involved in the cross-tabulation.
Table 10.4 shows the channel viewership distribution according to age group.

Solution
Observation: The 15–35 age group respondents prefer Channel A compared with the
other channels. But, it is not certain whether this observation is representative of the
entire population or whether it is due to sampling error. This dilemma can be resolved
by subjecting the data to a χ2-test.

TABLE 10.4
Channel Viewership Distribution
according to Age Group
Channel →,
Age Group ↓ A B C Total
10–15 20 30 30 80
15–35 80 70 50 200
≥35 60 40 20 120
Total 160 140 100 400
The Chi-Square Tests 255

Steps
1. Formulate H0 and H1.
H0: There is no significant association between age groups and channel
viewership.
H1: There is a significant association between age groups and channel
viewership.
2. Calculate the expected frequency.
The expected frequency value for each category (age group) can be calcu-
lated by using the formula

ni × n j
E ij =
n

where:
Eij = expected frequency of a cell corresponding to a particular age group and a
particular channel
ni = row total
nj = column total
n = total sample size

Table 10.5 shows the observed frequencies and corresponding expected fre-
quencies in parentheses.
n k
 (O ij − E ij )2 
3. Calculate χ2 =
∑∑ 
i =1 j=1 
E ij


where
Oij = observed frequency in ith row and jth column
Eij = expected frequency in ith row and jth column

The calculation of χ2 is shown in Table 10.6.


4. Decide on the level of significance and degrees of freedom.
Let us set the level of significance at 1%.
Degrees of freedom ν = (n – 1) (k – 1),

where
ν = degrees of freedom
n = number of rows
k = number of columns

TABLE 10.5
Observed and Expected Frequencies
Channel →,
Age Group ↓ A B C Total

80 × 60 80 × 140 80 × 100
10–15 = (32)20 = (28)30 = (20)30 80
400 400 400

200 × 160 200 × 140 200 × 100


15–35 = (80)80 = (70)70 = (50)50 200
400 400 400

120 × 160 120 × 140 100 × 120


≥35 = ( 48)60 = ( 42) 40 = (30)20 120
400 400 400
Total 160 140 100 400
256 Quantitative Techniques in Business, Management and Finance

TABLE 10.6
Calculation of χ2

(Oij − Eij )2
Eij
Oij Eij (Oij – Eij) (Oij – Eij)2
20 32 –12 144 4.5
30 28 2 4 0.1428571
30 20 10 100 5.0
80 80 0 0 0
70 70 0 0 0
50 50 0 0 0
60 48 12 144 3
40 42 –2 4 0.095238
20 30 –10 100 3.33333
∑Oij = 400 ∑Eij = 400 χ2 ≅ 16.08

Here, n = 3, k = 3 and
ν = (3 – 1) (3 – 1) = 2 × 2 = 4
5. Determine the critical value and compare it with the calculated value.
From the χ2-distribution table,
χ2Tabulated for 4 d.f. at the 1% level of significance = 13.28
χ2Calculated = 16.08

χ2Tabulated (13.28) < χ2Calculated (16.08)

Graph the χ2-test, for 4 degrees of freedom, at the 1% level of significance


showing the region of rejection (Figure 10.2).

Chi-Square
test statistic

Rejection

13.28 16.08

FIGURE 10.2
Chi-square test of independence.
The Chi-Square Tests 257

6. Deduce the conclusion: evidence to reject the null hypothesis (H0).

χ2Calculated (16.08) > χ2tabulated (13.28)

Therefore, the null hypothesis that there is no significant association


between the age group and the channel viewership is rejected. This implies
that the channel viewership is significantly (1% level) dependent on the age
group.

Conclusion
The channel viewership is significant (1% level) dependent on the age group.

10.5 Strength of Association


1. The test of independence will only enable the researcher to identify whether there
is an association between the two variables.
2. The test will not describe the strength or magnitude of the association.
3. The strength of the association can be evaluated using two key techniques:
a. Phi-coefficient
b. Coefficient of contingency

10.6 Phi-Coefficient
In order to measure the strength, the phi-coefficient is used. It is denoted by Φ . The phi-
coefficient measures the strength of association between only two variables (i.e. with two
rows and two columns). It fails to measure the strength of association between more than
two variables.
A formula for measuring the strength of association between two variables is

χ2
Φ=
n

where
χ2 = calculated chi-square value
n = sample size

A limitation is that Φ measure is suitable for only 2 × 2 tables.


258 Quantitative Techniques in Business, Management and Finance

10.7 Coefficient of Contingency


This technique can measure the strength of association between more than two variables.
It can be calculated for tables of any size. The coefficient of contingency (C) can be calcu-
lated by using the given formula

χ2
C=
χ2 + n

where:
χ2 = calculated chi-square value
n = sample size

1. The coefficient varies from 0 to 1.


2. Value 0 indicates there is no association between the variables.
3. Value 1 indicates the maximum strength.
For example, the calculated χ2-value is 16.08, and the sample size is 200. The
coefficient of contingency is given by

χ2
C=
χ +n 2

16.08
=
16.08 + 200
= 0.2727945796

10.8 Summary
In this chapter, we have looked at some situations where we can develop tests based on the
chi-square distribution. We started by testing the variance of a normal population where
the test statistic used was (n − 1)s 2/σ 2 since the distribution of the sample variance s2 was
not known directly. We found that such tests could be one-tailed depending on our null
and alternative hypotheses.
We then describe a multinational experiment and found that if we have data that classify
observations into k different categories, and if the conditions for the multinomial experi-
ment are satisfied, then a test statistic called the chi-square statistic, defined as

( O i − Ei )
k 2

χ2 = ∑
i =1
Ei
The Chi-Square Tests 259

will have a chi-square distribution with specified degrees of freedom. Here, Oi refers to
the observed frequency of the ith category and Ei to the expected frequency of the ith cat-
egory, and the degrees of freedom is equal to the number of categories minus 1, minus the
number of independent parameters estimated from the data to calculate the Ei values. This
concept was used to develop tests concerning the goodness of fit of the observed data to
any hypothesised distribution, and also to test whether two criteria for classification are
independent.
The chapter also discussed the concepts of the chi-square test and its uses and testing
for goodness of fit. The various tests of hypothesis in the earlier sections are based on the
assumption that the sampling distribution follows a normal distribution curve. However,
it is not always possible to assume the distribution pattern from which the samples are
drawn. To overcome this difficulty, many managers use chi-square tests. Chi-square is a
test statistic used to test a hypothesis that provides a set of theoretical frequencies with
which observed frequencies are compared.

REVIEW QUESTIONS

1. What is the χ2-test of goodness of fit? What precautions are necessary in using this
test?
2. Describe the χ2-test of significance and state the various uses to which it can
be put.
3. Illustrate with examples the usefulness of the χ2-test as a test for independence.
4. Describe the use of the χ2-test in testing the independence of attributes in a 2 × 2
contingency table.
5. Explain how the χ2-distribution can be used to (a) test the goodness of fit and
(b) test the independence of the cell frequencies in a 2 × 2 contingency table.

SELF-PRACTICE PROBLEMS

1. The following data relates to the sales in a time of trade depression of a certain
proprietary article in wide demand. Does the data suggest that the sales are sig-
nificantly affected by the depression?

Districts Where Districts Not Hit Districts Hit by


Sales Are by Depression Depression Total
Satisfactory 250 80 330
Not satisfactory 140 30 170
Total 390 110 500

[Answer: χ2 = 0.84; χ2 0.05 = 3.84. The hypothesis holds well.]


2. Two sample polls of votes for two candidates A and B for a public office are taken,
one from among residents of urban areas and the other from residents of rural
areas. The results are given below. Examine whether the nature of the area is
related to voting preferences in the election.
260 Quantitative Techniques in Business, Management and Finance

Votes for
Area A B Total
Rural 620 380 1000
Urban 550 450 1000
Total 1170 830 2000

[Answer: χ2 = 10.6, the table value of χ2 at the 5% level for v = 1 = 3.84. The
hypothesis does not hold well.]

You might also like