Chi Square
Chi Square
NOYON KUMAR DA
Date of Experiment: 27/08/19 Reg. No. 2016334044
Session: 2016-17
3rd Year 2nd Semester
Dept. of Industrial & Production
Date of Submission: 17/09/19 Engineering.
Shahjalal University of Science &
Technology, Sylhet.
OBJECTIVES
i.To study about chi square test.
Ii.To check the normality test.
INTRODUCTION:
A chi-squared test, also written as χ2 test, is any statistical hypothesis test where
the sampling distribution of the test statistic is a chi-squared distribution when
the null hypothesis is true. Without other qualification, 'chi-squared test' often is
used as short for Pearson's chi-squared test. The chi-squared test is used to
determine whether there is a significant difference between the expected
frequencies and the observed frequencies in one or more categories.
In the standard applications of this test, the observations are classified into
mutually exclusive classes, and there is some theory, or say null hypothesis, which
gives the probability that any observation falls into the corresponding class. The
purpose of the test is to evaluate how likely the observations that are made
would be, assuming the null hypothesis is true.
Chi-squared tests are often constructed from a sum of squared errors, or through
the sample variance. Test statistics that follow a chi-squared distribution arise
from an assumption of independent normally distributed data, which is valid in
many cases due to the central limit theorem. A chi-squared test can be used to
attempt rejection of the null hypothesis that the data are independent.
DESCRIPTION:
2.The sample sizes of the study groups are unequal; for the χ2 the groups may be
of equal size or unequal size whereas some parametric tests require groups of
equal or approximately equal size.
3.The original data were measured at an interval or ratio level, but violate one of
the following assumptions of a parametric test:
c. For any of a number of reasons (1), the continuous data were collapsed into
a small number of categories, and thus the data are no longer interval or ratio.
The data in the cells should be frequencies, or counts of cases rather than
percentages or some other transformation of the data.
The levels (or categories) of the variables are mutually exclusive. That is, a
particular subject fits into one and only one level of each of the variables.
Each subject may contribute data to one and only one cell in the χ2. If, for
example, the same subjects are tested over time such that the comparisons are of
the same subjects at Time 1, Time 2, Time 3, etc., then χ2 may not be used.
The study groups must be independent. This means that a different test must be
used if the two groups are related. For example, a different test must be used if
the researcher’s data consists of paired samples, such as in studies in which a
parent is paired with his or her child.
There are 2 variables, and both are measured as categories, usually at the
nominal level. However, data may be ordinal data. Interval or ratio data that have
been collapsed into ordinal categories may also be used. While Chi-square has no
rule about limiting the number of cells (by limiting the number of categories for
each variable), a very large number of cells (over 20) can make it difficult to meet
assumption #6 below, and to interpret the meaning of the results.
The value of the cell expecteds should be 5 or more in at least 80% of the cells,
and no cell should have an expected of less than one (3). This assumption is most
likely to be met if the sample size equals at least the number of cells multiplied by
5. Essentially, this assumption specifies the number of cases (sample size) needed
to use the χ2 for any number of cells in that χ2
Case study
To illustrate the calculation and interpretation of the χ2 statistic, the following
case example will be used:
The owner of a laboratory wants to keep sick leave as low as possible by keeping
employees healthy through disease prevention programs. Many employees have
contracted pneumonia leading to productivity problems due to sick leave from
the disease. There is a vaccine for pneumococcal pneumonia, and the owner
believes that it is important to get as many employees vaccinated as possible. Due
to a production problem at the company that produces the vaccine, there is only
enough vaccine for half the employees. In effect, there are two groups;
employees who received the vaccine and employees who did not receive the
vaccine. The company sent a nurse to every employee who contracted
pneumonia to provide home health care and to take a sputum sample for culture
to determine the causative agent. They kept track of the number of employees
who contracted pneumonia and which type of pneumonia each had. The data
were organized as follows:
Group 1: Not provided with the vaccine (unvaccinated control group, N = 92)
Table 1
Results of the vaccination program.
Calculating Chi-square
With the data in table form, the researcher can proceed with calculating the χ2
statistic to find out if the vaccination program made any difference in the health
outcomes of the employees. The formula for calculating a Chi-Square is:
∑χ2i−j= (O−E)2E
Where:
Observed (the actual count of cases in each cell of the table)Expected value
(calculated below)χ2The cell Chi-square value∑χ2Formula instruction to sum all
the cell Chi-square valuesχ2i−ji−j is the correct notation to represent all the cells,
from the first cell (i) to the last cell (j); in this case Cell 1 (i) through Cell 6 (j).
The first step in calculating a χ2 is to calculate the sum of each row, and the sum
of each column. These sums are called the “marginal” and there are row marginal
values and column marginal values. The marginal values for the case study data
are presented in Table 2.
Table 2
Calculation of marginals.
E=MR×MCn
Where:
E represents the cell expected value Represents the row marginal for that
cell,MCrepresents the column marginal for that cell, andn =represents the total
sample size.
Specifically, for each cell, its row marginal is multiplied by its column marginal,
and that product is divided by the sample size. For Cell 1, the math is as follows:
(28 × 92)/184 = 13.92. Table 3 provides the results of this calculation for each cell.
Once the expected values have been calculated, the cell χ2 values are calculated
with the following formula:
χ2=(O−E)2E
The cell χ2 for the first cell in the case study data is calculated as follows:
(23−13.93)2/13.93 = 5.92. The cell χ2 value for each cellis the value in
parentheses in each of the cells in Table 3.
Table 3
Cell expected values and (cell Chi-square values).
For the sample table with 3 rows and 2 columns, df = (3−1) × (2−1) = 2 × 1 = 2. A
Chi-square table of significances is available in many elementary statistics texts
and on many Internet sites. Using a χ2 table, the significance of a Chi-square value
of 12.35 with 2 df equals P < 0.005. This value may be rounded to P < 0.01 for
convenience. The exact significance when the Chi-square is calculated through a
statistical program is found to be P = 0.0011.
As the P-value of the table is less than P < 0.05, the researcher rejects the null
hypothesis and accepts the alternate hypothesis: “There is a difference in
occurrence of pneumococcal pneumonia between the vaccinated and
unvaccinated groups.” However, this result does not specify what that difference
might be. To fully interpret the result, it is useful to look at the cell χ2 values.
It can be seen in Table 3 that the largest cell χ2 value of 5.92 occurs in Cell 1. This
is a result of the observed value being 23 while only 13.92 were expected.
Therefore, this cell has a much larger number of observed cases than would be
expected by chance. Cell 1 reflects the number of unvaccinated employees who
contracted pneumococcal pneumonia. This means that the number of
unvaccinated people who contracted pneumococcal pneumonia was significantly
greater than expected. The second largest cell χ2 value of 4.56 is located in Cell 2.
However, in this cell we discover that the number of observed cases was much
lower than expected (Observed = 5, Expected = 12.57). This means that a
significantly lower number of vaccinated subjects contracted pneumococcal
pneumonia than would be expected if the vaccine had no effect. No other cell has
a cell χ2 value greater than 0.99.
A cell χ2 value less than 1.0 should be interpreted as the number of observed
cases being approximately equal to the number of expected cases, meaning there
is no vaccination effect on any of the other cells. In the case study example, all
other cells produced cell χ2 values below 1.0. Therefore the company can
conclude that there was no difference between the two groups for incidence of
Non-pneumococcal pneumonia. It can be seen that for both groups, the majority
of employees stayed healthy. The meaningful result was that there were
significantly fewer cases of pneumococcal pneumonia among the vaccinated
employees and significantly more cases among the unvaccinated employees. As a
result, the company should conclude that the vaccination program did reduce the
incidence of pneumococcal pneumonia.
Very few statistical programs provide tables of cell expected and cell χ2 values as
part of the default output. Some programs will produce those tables as an option,
and that option should be used to examine the cell χ2 values. If the program
provides an option to print out only the cell χ2 value (but not cell expected), the
direction of the χ2 value provides information. A positive cell χ2 value means that
the observed value is higher than the expected value, and a negative cell χ2 value
(e.g. −12.45) means the observed cases are less than the expected number of
cases. When the program does not provide either option, all the researcher can
conclude is this: The overall table provides evidence that the two groups are
independent (significantly different because P < 0.05), or are not independent (P >
0.05). Most researchers inspect the table to estimate which cells are
overrepresented with a large number of cases versus those which have a small
number of cases. However, without access to cell expected or cell χ2 values, the
interpretation of the direction of the group differences is less precise. Given the
ease of calculating the cell expected and χ2 values, researchers may want to hand
calculate those values to enhance interpretation.
DISCUSSION:
The chi square test for independence is an extremely flexible and useful test. The
test can be used to examine the relationship between any two variables, with any
types of measurement - nominal, ordinal, interval or ratio, and discrete or
continuous. While chi square tests of independence are very flexible, and can be
used with any cross classification, a researcher must be careful not to either
overemphasize or hide a relationship between two variables. In doing this, the chi
square test itself is not the problem. The methodological problem is the difficulty
of deciding the proper approach to the grouping of the data. There are no strict
guidelines concerning how data is to be grouped properly. In many cases, the
researcher may try several groupings, and observe what happens as these
groupings change. If the results change little as the groupings change, then the
relationship is likely to be quite apparent. Where the relationship seems to
change as the grouping of the data changes, considerably more effort may have
to be made to discern the exact nature of the relationship between the variables.
In our study we took 120 small roller diameters and plotted the data according to
chi square distribution after that we test the normality under this distribution
though we don’t find the exact value.
CONCLUSION:
The chi square distribution is a theoretical or mathematical distribution which has
wide applicability in statistical work. The term ‘chi square’ (pronounced with a
hard ‘ch’) is used because the Greek letter χ is used to define this distribution. It
will be seen that the elements on which this distribution is based are squared, so
that the symbol χ2 is used to denote the distribution. Each χ2 distribution has a
degree of freedom associated with it, so that there are many different chi squared
distributions. The χ2 statistic appears quite different from the other statistics
which have been used in the previous hypotheses tests. It also appears to bear
little resemblance to the theoretical chi square distribution just described.