Goodness of Fit Tests Contingency Tables
Goodness of Fit Tests Contingency Tables
Statistics Lecture 9
Chi-Square Goodness of Fit Tests for Two-Way Tables
Virginia Tech
Categorical Data - Relationships
▶ We worked with categorical data in the previous lecture and introduced the Chi-square
Goodness-of-fit test.
▶ We did tests for the population proportions based upon the observed counts (or observed
frequencies) within each category.
▶ We are going to extrapolate our methods to the more advances scenario where we have multiple
variables that may (or may not) be dependent on each other.
2 / 49
An Example from Wikipedia
In a sample of 100 individuals drawn at random from the population. The individuals were asked about
their handedness. The sex of the participants was also recorded.
Handedness
Right-handed Left-handed Total
Males 43 9 52
Sex
Females 44 4 48
Totals 87 13 100
The row totals are marginal frequencies for the sex categories.
The column totals are the marginal totals for the “handedness” category.
3 / 49
We could just use the marginals to study the 1-way categories for Goodness of Fit. *Hint* for some
homework problems.
Pr{E|C} = Probability that event E occurs given that event C has already occured.
We can ask the question, what proportion of the population is a “lefty” given that they are male?
So Pr{L|M} =?
c {L|M}.
Of course, we estimate this using Pr
4 / 49
c {L|M} = # of left handed males = 9
Pr
# of males 52
c {L|F} = 4
Pr
48
5 / 49
We could test the hypothesis in the following manner.
H0 : Pr{L|M} = Pr{L|F}
or equivalently
H0 : Pr{L|M} = Pr{L|Mc }
Recall that Mc means the compliment of M, interpreted as “not M”. We can do this because it’s a
dichotomous random variable.
HA : Pr{L|M} ̸= Pr{L|F}
The method that we use is essentially the Chi-square goodness-of-fit test that we discussed in the
previous lecture.
6 / 49
The Chi-square statistic from the previous lecture was
k
( oi − e i ) 2
χ2s = ∑ ei
i=1
We modify this test statistic just slightly (or we be careful how we use it in the older form):
(oij − eij )2
χ2s = ∑ eij
all i,j
So now we just need to make sure that we understand what oij and eij are.
7 / 49
For our example, the interior consists of cells that contain our observed frequencies:
We define
R L Total
M 43 9 52 o o12 43 9
⇒ O = 11 =
F 44 4 48 o21 o22 44 4
Total 87 13 100
these are the observed frequencies.
What is different now is how we calculate the expected frequencies corresponding to H0 . (The theory
is discussed in the textbook.)
e11 e12
E=
e21 e22
8 / 49
Computing Expected Frequencies
Each individual expected frequency is found by the formula:
(row i total) × (column j total)
eij =
Grand Total
9 / 49
Therefore,
43 9 45.24 6.76
O= and E=
44 4 41.76 6.24
10 / 49
Example 1
Experimental studies of cancer often use strains of animals that have a naturally high incidence of
tumors.
In once such experiment, tumor-prone mice were kept in a sterile environment, one group was kept
germ free, while the other was exposed to E. Coli.
11 / 49
Let “p” denote the probability of a tumor. Let subscript “1” denote the germ free treatment, and “2”
denote the E. Coli treatment.
12 / 49
(a) How strong is the evidence that tumor incidence is higher in mice exposed to E. Coli? Test at the
5% level.
13 / 49
(row i total)(column j total)
The expected table uses: eij =
grand total
(27)(49)
e11 = = 21.34
62
( )( )
e12 = =?
62
( )( )
e21 = =?
62
( )( )
e22 = =?
62
Observed (Expected)
We can use Chi-square table from the textbook/Canvas. We see that our test statistic is between the
following critical values:
However, since our our test is directional: 0.05 < P-value < 0.10. (Just like we did in the previous
lecture, we need to divide the P-value by 2 for directional tests).
We do not reject H0 with P-value > 0.05 (P-value = 0.0703 exactly). There is insufficient evidence to
conclude that E. Coli increases tumor incidence.
15 / 49
Independence and Association in 2 × 2 Contingency Tables
There are two viewpoints, or contexts, in which we can use a 2 × 2 contingency table.
1. Two independent samples with dichotomous observed variable.
Example: Two groups, those who receive treatment 1 and those who receive treatment 2.
▶ Individuals must be randomly assigned to the treatments.
▶ The observed variable is the success (or failure) of each treatment.
The cell counts consists of the frequency of occurrence of the intersection of the categories, e.g.
the number of HIV Positive males in a study.
16 / 49
Recall last semester, where we discussed some rules of probability.
In particular, we saw contingency tables and we calculated things like conditional probabilities.
When a data set is viewed as a single sample with two observed variables, then the Chi-Square test can
be used as a “test of independent” or a “test of association”.
Look at the following example as a training guide, the most difficult aspect is making sure that your
verbal description of association is correct and not ambiguous.
17 / 49
Example 2
Men with prostate cancer were randomly assigned to undergo surgery (n = 347) or “watchful waiting”
(no surgery, n = 348).
Over the next several years there were 83 deaths in the first group and 106 deaths in the second group.
Treatment
Surgery (S) Watchful Waiting (WW) Total
Died (D) 83 106 189
Survival
Alive (A) 264 242 506
Total 347 348 695
18 / 49
a.) Let D and A represent died and alive, respectively, and let S and WW represent surgery and
watchful waiting.
Estimate Pr{D|S} the “probability a patient dies given that they surgery”, and Pr{D|WW} the
“probability a patient dies given that they did not have surgery”.
c {D|S} = 83 = 0.239
Pr and c {D|WW} = 106 = 0.3046
Pr
347 348
19 / 49
We can also estimate these other conditional probabilities:
The “probability that a patient had surgery, given that a patient died”:
c {S|D} = 83 = 0.7615.
Pr
189
The “probability that a patient had surgery, given that a patient did not die”:
20 / 49
b.) The value of the contingency table Chi-square statistic for this data is χ2s = 3.75. Test for a
relationship between the treatment and survival. Use a non-directional alternative and let α = 0.05.
(furthermore, the probability of death is the same for surgery as it is for watchful waiting).
(furthermore, the probability of death occurs may be dependent on whether surgery was
performed or not).
21 / 49
Each of the following is a valid statement of H0 using proportions.
1. H0 : Pr{D|S} = Pr{D|WW}
2. H0 : Pr{A|S} = Pr{A|WW}
3. H0 : Pr{S|D} = Pr{S|A}
4. H0 : Pr{WW|D} = Pr{WW|A}
H0 : Pr{D|S} = Pr{D|WW}
22 / 49
Treatment
S WW Total
D 83 106 189
Status
A 264 242 506
Total 347 348 695
c {D|S} = 83 = 0.239
Pr
347
we estimate the right-hand side using
23 / 49
Notice that we are doing this column-wise, does this mean we are ignoring the possibility of an
association (dependency) in a row-wise manner? (Read the book for further detail.)
total
a b a+b
c d c+d
total a+c b+d
then
a b a c
= if and only if =
c d b d
additionally
a b a c
= if and only if =
a+c b+d a+b c+d
24 / 49
Using our test statistic, χ2s = 3.75 with df = 1, we look at table 9 which gives
So 0.05 < P-value < 0.10. Therefore we fail to reject H0 at the 5% level.
We lack statistical evidence to conclude that treatment and survival are related (P-value = 0.0528).
25 / 49
The Chi-square test and the z-test for Proportions
It’s clear by now that the chi-square test can be used to compare the proportions of success in 2
groups. For example, consider data that is summarized in a 2 × 2 table.
Success Failure
Group 1
Group 2
We have the option to conduct a chi-square test with df = 1 for this 2 × 2 table.
H0 : p1 = p2
26 / 49
Recall however, that we can use a z-test to carry out the following hypothesis tests with the following
null (see Chapter 20).
H0 : p1 = p2
Furthermore, we can obtain confidence intervals, and conduct directional tests with each using z-test
methods.
Fact: The Chi-square test statistic, χ2s , is the square of the z-statistic, zs , and the P-value for χ2s is
exactly the same as the two-sided P-value for zs .
For simple cases, the choice of which method to use is up to you. However, the z-test method is often
easier for directional tests and the appropriate method to use when you want confidence intervals for
the difference in the population proportions.
27 / 49
The r × k Contingency Table
Suppose instead of dichotomous variables, let us consider a larger and more general situation.
We are going to focus on the case that we have a number of categories, say r categories for variable 1
and k categories for variable 2.
28 / 49
Another context that we can use is that we have k independent samples, with an observed variable
with r categories.
For example, in this context we can have a SRS of k patients. The observations can correspond to
different health related categories that are observed.
As we did with 2 × 2 tables, the test statistic is still calculated the same way but we have more cells.
29 / 49
The only real change other than the number of cells is
df = (r − 1)(k − 1)
Finally, the most extreme change is in the number of items in the null hypothesis.
30 / 49
Example: At an assembly plant for light trucks, routine monitoring of the quality of welds yields the
following data:
Number of Welds
High Moderate Low
Quality Quality Quality Total
(H) (M) (L)
(D) Day Shift 467 191 42 700
(E) Evening Shift 445 171 34 650
(N) Night Shift 254 129 17 400
Total 1166 491 93 1750
31 / 49
Another acceptable null hypothesis is
Pr{H|D} = Pr{H|E} = Pr{H|N}
H0 : Pr{M|D} = Pr{M|E} = Pr{M|N}
Pr{L|D} = Pr{L|E} = Pr{L|N}
Why so complicated?
Before, for the 2 × 2 case, if we had categories Success (S) and Failure (F) for variable 1 and say
categories A and B for variable 2.
We did not have to write the second line because it was implied.
32 / 49
Getting back to our current problem, we can easily show that the table of expected values under the
null hypothesis is:
Expected Values
▶ Finally, by the table we can show that the P-value > 0.10. R gives an exact P-value = 0.218.
We conclude that there is no statistical evidence that weld quality differs among the shifts (P-value =
0.218).
33 / 49
Applicability of Methods
1. Design Conditions. The data must be able to be viewed in one of two ways.
1.1 As two or more independent random samples, observed with respect to a categorical
variable.
1.2 As one random sample, observed with respect to two categorical variables.
2. Sample size conditions. For the Chi-square test, the expected values must all be ≥ 5.
34 / 49
3. Form of H0 . While the form of H0 can be somewhat complex, the most generic form of H0 is:
If the data arise from experiments with random assignments, we can begin to make causal
inferences.
However, if the data come from purely observational studies, then we can only infer that any
observed association implied by a small P-value is not due to chance.
35 / 49
Practice
A group of patients with a binge-eating disorder were randomly assigned to take either the
experimental drug fluvoxamine or a placebo in a nine-week long double-blind clinical trial. At the end
of the trial the condition of each patient was classified into one of four categories: no response,
moderate response, marked response, or remission. The following table shows a cross classification of
the data. Is there statistically significant evidence, at the 0.10 level, to conclude that there is an
association between treatment group (fluvoxamine versus placebo) and condition? Show all necessary
work, state the hypothesis statements, compute the test statistic, P-value, and provide a concluding
statement in the context of this problem.
No Moderate Marked
Response Response Response Remission Total
Fluvoxamine 15 7 13 15
Placebo 23 7 8 11
Total
36 / 49
Using R for Tests for Association Independence of Two Variables:
Chi-square contingency tests for an (r × k) table
In class and the book, you learned how to conduct a hypothesis test for the difference of population
proportions by using the Chi-square goodness-of-fit test.
The general formulation of the test relies on you constructing a (r × k) contingency table. You then
use the marginal frequencies and the expected frequencies to carry out the chi-square test.
Real Data does not end up in those types of contingency tables on their own; we must build them
ourselves.
Let’s use the victims data that you loaded at the very beginning of the assignment and consider an
example of how to do hypothesis tests for (2 × 2) contingency tables.
The victims data set is from the National Crime Victimization Survey from 1996-2005.
37 / 49
We can express categorical data a number of ways, using names as well as indicators (i.e. numbers,
such a 1=“low”, 2=“medium”, 3=“high”)
MSA is the location where the incident occured. To view the categories of the MSA variable, use the
following function
levels(victims$MSA)
levels(victims$Police)
Police is a categorical variable with categories (Police, No Police) which indicates that the incident was
reported to the police or not.
ER is a categorical variable with categories (ER, No ER) which indicates that the victim received
treatment at the ER or not.
38 / 49
Stranger is a categorical indicator variable which indicates the offender was a stranger (indicated with
a 1), or not (indicated with a 0).
Private is a categorical indicator meaning that the location was private (indicated with a 1), or public
(indicated by a 0).
Suppose that we hypothesize that victims who call the police go to the ER more often than those who
don’t.
Then the hypotheses would look like the following:
First, let’s do this by hand (the long way), we first need to construct a contingency table.
39 / 49
To construct a table of absolute frequencies in R we will use the following function
No Police Police
ER 95 675
No ER 2201 2532
rS <- rowSums(tbl1)
rS
ER No ER
770 4733
40 / 49
# To get the column sums of this table
cS <- colSums(tbl1)
cS
No Police Police
2296 3207
totS
[1] 5503
41 / 49
We now need the expected frequencies in a contingency table.
expected.freqs.for.tbl1
No Police Police
[1,] 321.2648 448.7352
[2,] 1974.7352 2758.2648
# Hence our Chi-squared test statistic, xs, for this (2x2) table is
xs.test.tbl1
[1] 317.9321
42 / 49
For a 2x2 table, the degrees of freedom is df = 1. To find the P-value, we simple use the command:
[1] 2.043284e-71
We conclude that patients who call the Police go to the ER more than people who do not call the
police.
43 / 49
Chi-square tests for Tables in R (The Short Way)
Now, for what you’ve been waiting for, the short way.
If we already have our data in a table, then the short way simply uses the following function:
data: tbl1
X-squared = 317.93, df = 1, p-value < 2.2e-16
Caution: This p-value is for the 2-sided test, we would have to divide by 2 to get the same answer we
got when we coded it ourself. In this case, it’s essentially zero so it doesn’t matter as much.
44 / 49
What changes when the contingency table is not (2 × 2) but some other more general (r × k) table?
For any (r × k) contingency table, the degrees of freedom is given by the formula df = (r - 1)(k - 1).
Otherwise, all other steps are identical.
We know that we can make a table from separate columns of a data frame using the table()
function, but what if we are not given the raw data and instead only given the already summarized
data in the bivariate frequency table/contingency table?
Suppose we have a table that looks like (ignoring the row and column names for now)
Burrito
Beef Bean Cheese
Hot 42 10 27
Salsa
Mild 9 39 13
45 / 49
Recall that if we want to input an array into R, we use the following functions:
x <- c(1,2,3,4)
print(x)
[1] 1 2 3 4
Let’s explore how to use the matrix function with c() to enter a tables.
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
46 / 49
# this makes a 2 x 2 matrix but say we want the first row to be [1 2],
# then this would be incorrect
matrix(c(1,2,3,4),nrow=2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
[,1] [,2]
[1,] 1 2
[2,] 3 4
47 / 49
Now that we obtained a table that wanted, we can customize it a little bit.
y <- c(42,10,27,9,39,13)
lunch.data <-matrix( y ,nrow = 2,byrow = T)
print(lunch.data)
Note there are other ways to do the above, all methods work pretty well.
48 / 49
Now that we have our table, we can finally run our chi-square test on this table.
chisq.test(lunch.data)
49 / 49