0% found this document useful (0 votes)
13 views

Goodness of Fit Tests Contingency Tables

Uploaded by

red.book7748
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Goodness of Fit Tests Contingency Tables

Uploaded by

red.book7748
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CMDA-2006: Integrated Quantitative Science II

Statistics Lecture 9
Chi-Square Goodness of Fit Tests for Two-Way Tables

Dr. Christian Lucero

Virginia Tech
Categorical Data - Relationships

▶ We worked with categorical data in the previous lecture and introduced the Chi-square
Goodness-of-fit test.

▶ We did tests for the population proportions based upon the observed counts (or observed
frequencies) within each category.

▶ We are going to extrapolate our methods to the more advances scenario where we have multiple
variables that may (or may not) be dependent on each other.

▶ Now, instead of a single column of counts, we obtain a contingency table.

2 / 49
An Example from Wikipedia

In a sample of 100 individuals drawn at random from the population. The individuals were asked about
their handedness. The sex of the participants was also recorded.

Handedness
Right-handed Left-handed Total
Males 43 9 52
Sex
Females 44 4 48
Totals 87 13 100

The column and row totals are called marginal frequencies.

The row totals are marginal frequencies for the sex categories.

The column totals are the marginal totals for the “handedness” category.

3 / 49
We could just use the marginals to study the 1-way categories for Goodness of Fit. *Hint* for some
homework problems.

Instead we are interested in the probabilities implied by the cells.

We will start with 2 × 2 contigency tables and extrapolate to (r × k) contingency tables.

Let’s recall conditional probability.

Pr{E|C} = Probability that event E occurs given that event C has already occured.

We can ask the question, what proportion of the population is a “lefty” given that they are male?

So Pr{L|M} =?

c {L|M}.
Of course, we estimate this using Pr

4 / 49
c {L|M} = # of left handed males = 9
Pr
# of males 52

Of course, an individual does not have to be male to be a lefty.

c {L|F} = 4
Pr
48

So maybe a more interesting question is:

Is the proportion of “lefty’s” in a population different between men and women?

5 / 49
We could test the hypothesis in the following manner.

H0 : Pr{L|M} = Pr{L|F}

or equivalently
H0 : Pr{L|M} = Pr{L|Mc }
Recall that Mc means the compliment of M, interpreted as “not M”. We can do this because it’s a
dichotomous random variable.

The alternative for “equally of proportions” is the two-sided/non-directional alternative.

HA : Pr{L|M} ̸= Pr{L|F}

The method that we use is essentially the Chi-square goodness-of-fit test that we discussed in the
previous lecture.

6 / 49
The Chi-square statistic from the previous lecture was
k
( oi − e i ) 2
χ2s = ∑ ei
i=1

where k is the number of cells in the contingency tables.

We modify this test statistic just slightly (or we be careful how we use it in the older form):

(oij − eij )2
χ2s = ∑ eij
all i,j

So now we just need to make sure that we understand what oij and eij are.

7 / 49
For our example, the interior consists of cells that contain our observed frequencies:
We define
R L Total
M 43 9 52 o o12 43 9
⇒ O = 11 =
F 44 4 48 o21 o22 44 4
Total 87 13 100
these are the observed frequencies.

What is different now is how we calculate the expected frequencies corresponding to H0 . (The theory
is discussed in the textbook.)

Goal: We need to make a new table for the expected frequencies.

e11 e12
E=
e21 e22

8 / 49
Computing Expected Frequencies
Each individual expected frequency is found by the formula:
(row i total) × (column j total)
eij =
Grand Total

(row 1 total) × (column 1 total) (52)(87)


e11 = = = 45.24
Grand Total 100

(row 2 total) × (column 1 total) (48)(87)


e21 = = = 41.76
Grand Total 100

(row 1 total) × (column 2 total) (52)(13)


e12 = = = 6.76
Grand Total 100

(row 2 total) × (column 2 total) (48)(13)


e22 = = = 6.24
Grand Total 100

9 / 49
Therefore,

43 9 45.24 6.76
O= and E=
44 4 41.76 6.24

So to find the test statistic


(oij − eij )2
χ2s = ∑ eij
all i,j

(43 − 45.24)2 (9 − 6.76)2 (44 − 41.76)2 (4 − 6.24)2


χ2s = + + + = 1.7774
45.24 6.76 41.76 6.24

The degrees of freedom is always df = 1 for 2 × 2 tables.

Our P-value is computed using:


n o n o
P-value = P χ2df > χ2s = P χ21 > 1.7774 = 0.1825, using R.

The command used: 1 - pchisq(1.7774, 1) or pchisq(1.7774, 1, lower.tail=F)

10 / 49
Example 1

Experimental studies of cancer often use strains of animals that have a naturally high incidence of
tumors.

In once such experiment, tumor-prone mice were kept in a sterile environment, one group was kept
germ free, while the other was exposed to E. Coli.

Mice with Liver Tumors


Treatment Total # of Mice Number Percent
Germ Free 49 19 39%
E. Coli 13 8 62%

11 / 49
Let “p” denote the probability of a tumor. Let subscript “1” denote the germ free treatment, and “2”
denote the E. Coli treatment.

Then c {Tumor | Germ-Free} = 19 ≈ 0.39


b1 = Pr
p
49
c {Tumor | E.Coli} = 8
b2 = Pr
p ≈ 0.62
13

We can make a contingency table:

Germ-Free E. Coli Total


Tumors 19 8 27
No Tumors 30 5 35
Total 49 13 62

12 / 49
(a) How strong is the evidence that tumor incidence is higher in mice exposed to E. Coli? Test at the
5% level.

H0 : p1 = p2 ( E. Coli does not affect tumor incidence.)


or
(The proportion of mice with tumors is the same between those
who were exposed to E. Coli and those who weren’t.)

HA : p1 < p2 ( E. Coli increases tumor incidence.)


or
(The proportion of mice with tumors is higher in mice who were
exposed to E. Coli versus those in a sterile environment.)

13 / 49
(row i total)(column j total)
The expected table uses: eij =
grand total

(27)(49)
e11 = = 21.34
62
( )( )
e12 = =?
62
( )( )
e21 = =?
62
( )( )
e22 = =?
62

Observed (Expected)

Germ-Free E. Coli Total


Tumors 19 (21.34 ) 8 (5.66 ) 27
No Tumors 30 (27.66 ) 5 (7.34 ) 35
Total 49 13 62
14 / 49
Our test-statistic is
(19 − 21.34)2 (8 − 5.66)2 (30 − 27.66)2 (5 − 7.34)2
χ2s = + + + = 2.17
21.34 5.66 27.66 7.34

The non-directional P-value is computed using


   
P-value = P χ2df > χ2s = P χ21 > 2.17

We can use Chi-square table from the textbook/Canvas. We see that our test statistic is between the
following critical values:

χ21,0.20 = 1.64 and χ21,0.10 = 2.71

However, since our our test is directional: 0.05 < P-value < 0.10. (Just like we did in the previous
lecture, we need to divide the P-value by 2 for directional tests).

We do not reject H0 with P-value > 0.05 (P-value = 0.0703 exactly). There is insufficient evidence to
conclude that E. Coli increases tumor incidence.

15 / 49
Independence and Association in 2 × 2 Contingency Tables

There are two viewpoints, or contexts, in which we can use a 2 × 2 contingency table.
1. Two independent samples with dichotomous observed variable.
Example: Two groups, those who receive treatment 1 and those who receive treatment 2.
▶ Individuals must be randomly assigned to the treatments.
▶ The observed variable is the success (or failure) of each treatment.

2. One sample with two dichotomous observed variables.


Example: In a sample we can observe two aspects of an experiment unit.

Sex = { male or female } HIV Status = { Positive or Negative }

The cell counts consists of the frequency of occurrence of the intersection of the categories, e.g.
the number of HIV Positive males in a study.

16 / 49
Recall last semester, where we discussed some rules of probability.

In particular, we saw contingency tables and we calculated things like conditional probabilities.

We could also determine if two events were independent.

Now we are going to do something more sophisticated.

When a data set is viewed as a single sample with two observed variables, then the Chi-Square test can
be used as a “test of independent” or a “test of association”.

* Read the textbooks carefully for the deeper theory.

Look at the following example as a training guide, the most difficult aspect is making sure that your
verbal description of association is correct and not ambiguous.

17 / 49
Example 2

Men with prostate cancer were randomly assigned to undergo surgery (n = 347) or “watchful waiting”
(no surgery, n = 348).

Over the next several years there were 83 deaths in the first group and 106 deaths in the second group.

Treatment
Surgery (S) Watchful Waiting (WW) Total
Died (D) 83 106 189
Survival
Alive (A) 264 242 506
Total 347 348 695

18 / 49
a.) Let D and A represent died and alive, respectively, and let S and WW represent surgery and
watchful waiting.

Estimate Pr{D|S} the “probability a patient dies given that they surgery”, and Pr{D|WW} the
“probability a patient dies given that they did not have surgery”.

c {D|S} = 83 = 0.239
Pr and c {D|WW} = 106 = 0.3046
Pr
347 348

19 / 49
We can also estimate these other conditional probabilities:

The “probability that a patient had surgery, given that a patient died”:

c {S|D} = 83 = 0.7615.
Pr
189

The “probability that a patient had surgery, given that a patient did not die”:

c {S|A} = 264 = 0.5217.


Pr
506

20 / 49
b.) The value of the contingency table Chi-square statistic for this data is χ2s = 3.75. Test for a
relationship between the treatment and survival. Use a non-directional alternative and let α = 0.05.

H0 : There is no relationship between treatment and survival.

(or treatment and survival are independent)

(furthermore, the probability of death is the same for surgery as it is for watchful waiting).

HA : There is a relationship between treatment and survival.

(or treatment and survival are associated {or dependent})

(furthermore, the probability of death occurs may be dependent on whether surgery was
performed or not).

21 / 49
Each of the following is a valid statement of H0 using proportions.
1. H0 : Pr{D|S} = Pr{D|WW}
2. H0 : Pr{A|S} = Pr{A|WW}
3. H0 : Pr{S|D} = Pr{S|A}
4. H0 : Pr{WW|D} = Pr{WW|A}

An example of a statement that does not describe a correct H0 is


H0 : Pr{D|S} = Pr{A|S}.

All of the statements (1-4) are equivalent, so let’s pick one.

H0 : Pr{D|S} = Pr{D|WW}

22 / 49
Treatment
S WW Total
D 83 106 189
Status
A 264 242 506
Total 347 348 695

H0 : Pr{D|S} = Pr{D|WW} vs HA : Pr{D|S} ̸= Pr{D|WW}

We estimate the left-hand side using

c {D|S} = 83 = 0.239
Pr
347
we estimate the right-hand side using

c {D|WW} = 106 = 0.3046


Pr
348
Using these observations, can we conclude that the proportions are statistically significantly different?

23 / 49
Notice that we are doing this column-wise, does this mean we are ignoring the possibility of an
association (dependency) in a row-wise manner? (Read the book for further detail.)

As it turns out if,

total
a b a+b
c d c+d
total a+c b+d

then
a b a c
= if and only if =
c d b d
additionally
a b a c
= if and only if =
a+c b+d a+b c+d

24 / 49
Using our test statistic, χ2s = 3.75 with df = 1, we look at table 9 which gives

χ21,0.10 = 2.71 and χ21,0.05 = 3.84

So 0.05 < P-value < 0.10. Therefore we fail to reject H0 at the 5% level.

We lack statistical evidence to conclude that treatment and survival are related (P-value = 0.0528).

25 / 49
The Chi-square test and the z-test for Proportions
It’s clear by now that the chi-square test can be used to compare the proportions of success in 2
groups. For example, consider data that is summarized in a 2 × 2 table.

Success Failure
Group 1
Group 2

We have the option to conduct a chi-square test with df = 1 for this 2 × 2 table.

We should know that the null hypothesis being tests is

H0 : there is no relationship between the row and column variables

is the same as testing

H0 : p1 = p2

where p1 and p2 are the proportion of successes in each group.

26 / 49
Recall however, that we can use a z-test to carry out the following hypothesis tests with the following
null (see Chapter 20).

H0 : p1 = p2

Furthermore, we can obtain confidence intervals, and conduct directional tests with each using z-test
methods.

Fact: The Chi-square test statistic, χ2s , is the square of the z-statistic, zs , and the P-value for χ2s is
exactly the same as the two-sided P-value for zs .

For simple cases, the choice of which method to use is up to you. However, the z-test method is often
easier for directional tests and the appropriate method to use when you want confidence intervals for
the difference in the population proportions.

27 / 49
The r × k Contingency Table

Suppose instead of dichotomous variables, let us consider a larger and more general situation.

We are going to focus on the case that we have a number of categories, say r categories for variable 1
and k categories for variable 2.

Instead of a 2 × 2 contingency table we now have an r × k table.

The table will have r rows and k columns.

Therefore the context of this scenario is that we have:


▶ One sample; two categorical observed variables - one with k categories and one with r categories.

28 / 49
Another context that we can use is that we have k independent samples, with an observed variable
with r categories.

Sample 1 ··· Sample k


Category 1
..
.
Category r

For example, in this context we can have a SRS of k patients. The observations can correspond to
different health related categories that are observed.

As we did with 2 × 2 tables, the test statistic is still calculated the same way but we have more cells.

(oij − eij )2 (row i total)(column j total)


χ2s = ∑ eij
eij =
grand total
all cells

29 / 49
The only real change other than the number of cells is

df = (r − 1)(k − 1)

Note that for 2 × 2, r = 2, k = 2, so df = 1 which is the same as we have always used!

Finally, the most extreme change is in the number of items in the null hypothesis.

To understand this last change, let’s look at an example.

30 / 49
Example: At an assembly plant for light trucks, routine monitoring of the quality of welds yields the
following data:

Number of Welds
High Moderate Low
Quality Quality Quality Total
(H) (M) (L)
(D) Day Shift 467 191 42 700
(E) Evening Shift 445 171 34 650
(N) Night Shift 254 129 17 400
Total 1166 491 93 1750

Can you conclude that the quality varies among shifts?

One acceptable null hypothesis is


 
Pr{D|H} = Pr{D|M} = Pr{D|L}
H0 : Pr{E|H} = Pr{E|M} = Pr{E|L}
 
Pr{N|H} = Pr{N|M} = Pr{N|L}

31 / 49
Another acceptable null hypothesis is
 
 Pr{H|D} = Pr{H|E} = Pr{H|N} 
H0 : Pr{M|D} = Pr{M|E} = Pr{M|N}
 
Pr{L|D} = Pr{L|E} = Pr{L|N}

Why so complicated?

Before, for the 2 × 2 case, if we had categories Success (S) and Failure (F) for variable 1 and say
categories A and B for variable 2.

Then one of the acceptable null hypotheses looked like:


 
Pr{S|A} = Pr{S|B}
H0 :
Pr{F|A} = Pr{F|B}

We did not have to write the second line because it was implied.

32 / 49
Getting back to our current problem, we can easily show that the table of expected values under the
null hypothesis is:

Expected Values

High Moderate Low


Quality Quality Quality
(H) (M) (L)
(D) Day Shift 466.40 196.40 37.20
(E) Evening Shift 433.086 182.371 34.543
(N) Night Shift 266.514 112.229 21.257

▶ The test statistic is χ2s = 5.760

▶ The degrees of freedom: df = (3 − 1)(3 − 1) = 4.

▶ Finally, by the table we can show that the P-value > 0.10. R gives an exact P-value = 0.218.

We conclude that there is no statistical evidence that weld quality differs among the shifts (P-value =
0.218).

33 / 49
Applicability of Methods

Conditions for the Validity of the Chi-Square Test.

1. Design Conditions. The data must be able to be viewed in one of two ways.

1.1 As two or more independent random samples, observed with respect to a categorical
variable.

1.2 As one random sample, observed with respect to two categorical variables.

In either case, the observations must be independent of each other.

2. Sample size conditions. For the Chi-square test, the expected values must all be ≥ 5.

34 / 49
3. Form of H0 . While the form of H0 can be somewhat complex, the most generic form of H0 is:

H0 : The row variable and the column variable are independent.

4. Understanding the scope of the inference.

If the data arise from experiments with random assignments, we can begin to make causal
inferences.

However, if the data come from purely observational studies, then we can only infer that any
observed association implied by a small P-value is not due to chance.

The underlying cause(s) are confounded and cannot be readily identified.

35 / 49
Practice

A group of patients with a binge-eating disorder were randomly assigned to take either the
experimental drug fluvoxamine or a placebo in a nine-week long double-blind clinical trial. At the end
of the trial the condition of each patient was classified into one of four categories: no response,
moderate response, marked response, or remission. The following table shows a cross classification of
the data. Is there statistically significant evidence, at the 0.10 level, to conclude that there is an
association between treatment group (fluvoxamine versus placebo) and condition? Show all necessary
work, state the hypothesis statements, compute the test statistic, P-value, and provide a concluding
statement in the context of this problem.

No Moderate Marked
Response Response Response Remission Total
Fluvoxamine 15 7 13 15
Placebo 23 7 8 11
Total

36 / 49
Using R for Tests for Association Independence of Two Variables:
Chi-square contingency tests for an (r × k) table

In class and the book, you learned how to conduct a hypothesis test for the difference of population
proportions by using the Chi-square goodness-of-fit test.

The general formulation of the test relies on you constructing a (r × k) contingency table. You then
use the marginal frequencies and the expected frequencies to carry out the chi-square test.

Real Data does not end up in those types of contingency tables on their own; we must build them
ourselves.

Let’s use the victims data that you loaded at the very beginning of the assignment and consider an
example of how to do hypothesis tests for (2 × 2) contingency tables.

The victims data set is from the National Crime Victimization Survey from 1996-2005.

The data are found in assaultvictims.csv

37 / 49
We can express categorical data a number of ways, using names as well as indicators (i.e. numbers,
such a 1=“low”, 2=“medium”, 3=“high”)

A short summary of the variables is as follows:

The year and victims age are obvious.

MSA is the location where the incident occured. To view the categories of the MSA variable, use the
following function

levels(victims$MSA)

[1] "Rural" "Suburban" "Urban"

levels(victims$Police)

[1] "No Police" "Police"

Police is a categorical variable with categories (Police, No Police) which indicates that the incident was
reported to the police or not.

ER is a categorical variable with categories (ER, No ER) which indicates that the victim received
treatment at the ER or not.
38 / 49
Stranger is a categorical indicator variable which indicates the offender was a stranger (indicated with
a 1), or not (indicated with a 0).

Private is a categorical indicator meaning that the location was private (indicated with a 1), or public
(indicated by a 0).

Income is categorical (lowest, low, middle, high), indicated by (1,2,3,4) respectively.

For now, we will focus on the ER and Police categories.

Suppose that we hypothesize that victims who call the police go to the ER more often than those who
don’t.
Then the hypotheses would look like the following:

H0 : Pr( ER | Police ) = Pr( ER | No Police )


HA : Pr( ER | Police ) > Pr( ER | No Police )

First, let’s do this by hand (the long way), we first need to construct a contingency table.

39 / 49
To construct a table of absolute frequencies in R we will use the following function

tbl1 <- table(victims$ER, victims$Police)

# This is how we create r x k tables using real data.


# To view the table
tbl1

No Police Police
ER 95 675
No ER 2201 2532

# To get the row sums of this table

rS <- rowSums(tbl1)

rS

ER No ER
770 4733

40 / 49
# To get the column sums of this table

cS <- colSums(tbl1)

cS

No Police Police
2296 3207

# The overall sum is given by

totS <- sum(tbl1)

totS

[1] 5503

41 / 49
We now need the expected frequencies in a contingency table.

A clever way to generate this table is to use the following syntax:

expected.freqs.for.tbl1 <- (rS %*% t(cS))/totS

expected.freqs.for.tbl1

No Police Police
[1,] 321.2648 448.7352
[2,] 1974.7352 2758.2648

To construct the test statistic, we use the regular formula.

xs.test.tbl1 <- sum((tbl1-expected.freqs.for.tbl1)^2


/expected.freqs.for.tbl1)

# Hence our Chi-squared test statistic, xs, for this (2x2) table is
xs.test.tbl1

[1] 317.9321

42 / 49
For a 2x2 table, the degrees of freedom is df = 1. To find the P-value, we simple use the command:

p.value.tbl1.test <- pchisq(xs.test.tbl1, df=1, lower.tail=FALSE)/2


# We divide by 2 because it's a 1-sided test
p.value.tbl1.test

[1] 2.043284e-71

Since our P-value is extremely tiny (it is essentially zero), we reject H0 .

We conclude that patients who call the Police go to the ER more than people who do not call the
police.

43 / 49
Chi-square tests for Tables in R (The Short Way)

Now, for what you’ve been waiting for, the short way.

If we already have our data in a table, then the short way simply uses the following function:

chisq.test(tbl1, correct = FALSE)

Pearson's Chi-squared test

data: tbl1
X-squared = 317.93, df = 1, p-value < 2.2e-16

Caution: This p-value is for the 2-sided test, we would have to divide by 2 to get the same answer we
got when we coded it ourself. In this case, it’s essentially zero so it doesn’t matter as much.

Note, the correct=FALSE option only needs to be used for 2 × 2 tables.

44 / 49
What changes when the contingency table is not (2 × 2) but some other more general (r × k) table?

For any (r × k) contingency table, the degrees of freedom is given by the formula df = (r - 1)(k - 1).
Otherwise, all other steps are identical.

We know that we can make a table from separate columns of a data frame using the table()
function, but what if we are not given the raw data and instead only given the already summarized
data in the bivariate frequency table/contingency table?

If we want to manually input a contingency table, we can do the following steps.

Suppose we have a table that looks like (ignoring the row and column names for now)

Burrito
Beef Bean Cheese
Hot 42 10 27
Salsa
Mild 9 39 13

Note that this is a (2 × 3) (nrows × ncolumns) table.

45 / 49
Recall that if we want to input an array into R, we use the following functions:

x <- c(1,2,3,4)
print(x)

[1] 1 2 3 4

Let’s explore how to use the matrix function with c() to enter a tables.

# This makes a column vector of length 4


matrix(c(1,2,3,4))

[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4

46 / 49
# this makes a 2 x 2 matrix but say we want the first row to be [1 2],
# then this would be incorrect
matrix(c(1,2,3,4),nrow=2)

[,1] [,2]
[1,] 1 3
[2,] 2 4

# this makes a 2 x 2 matrix using c(1,2,3,4) putting the values


# in left-to-right, a row at a time.
matrix(c(1,2,3,4),nrow=2,byrow=T)

[,1] [,2]
[1,] 1 2
[2,] 3 4

# So for our more complicated example above


matrix(c(42,10,27,9,39,13),nrow = 2,byrow = T)

[,1] [,2] [,3]


[1,] 42 10 27
[2,] 9 39 13

47 / 49
Now that we obtained a table that wanted, we can customize it a little bit.

y <- c(42,10,27,9,39,13)
lunch.data <-matrix( y ,nrow = 2,byrow = T)
print(lunch.data)

[,1] [,2] [,3]


[1,] 42 10 27
[2,] 9 39 13

# Let's add names to the rows and columns


colnames(lunch.data) <- c("Beef","Bean","Cheese")
rownames(lunch.data) <- c("Hot","Mild")
print(lunch.data)

Beef Bean Cheese


Hot 42 10 27
Mild 9 39 13

Note there are other ways to do the above, all methods work pretty well.

48 / 49
Now that we have our table, we can finally run our chi-square test on this table.

chisq.test(lunch.data)

Pearson's Chi-squared test


data: lunch.data
X-squared = 41.793, df = 2, p-value = 8.41e-10

49 / 49

You might also like