Chapter 9 (1)
Chapter 9 (1)
Chi Squared
Concepts: So far we have looked at statistical tests where the independent variable was
categorical and the dependent variable was numerical. The χ2 (chi squared) test allows
you look for relationships when both variables are categorical. For example, if you
hypothesize that nationality will have an impact on people’s favorite colors because
people will be influenced by colors that are important to their nation (such as the colors
of their flag or the colors of their national sports teams). Nationality and favorite color
are categorical variables. It is more or less meaningless to say that your favorite color is
13, or that you are from the nation of 7, and even if you labeled all nations with numbers,
it would not be the case that nation number 5 was more of a nation than nation number 2.
As with the other tests we have used, a significant result means that there is a relationship
while a non-significant result means that we have no evidence that the one variable
influences the other. So if we tested the hypothesis about nationality and favorite color
and we got a significant result, we could conclude that nationality does influence
someone’s choice of favorite color.
Application: In order to apply a χ2 test you need to arrange the data into a table where
one axis is the independent variable and the other is the dependent variable. The test is
symmetric so it doesn’t matter which you put where. Strictly fictional data for our
hypothesis above might look like this:
Red Orange Yellow Green Blue Purple
United States of America 251 149 123 158 350 59
Australia 231 102 240 263 203 49
The Netherlands 124 406 89 117 204 65
Germany 306 189 257 145 132 57
Brazil 134 113 163 351 223 64
Notice that the number of people surveyed in each country is different. This will not
affect the test.
Using our usual plan, the first two steps can be taken for granted. The third step
we use the formula df = (#R-1)(#C-1) where #R is the number of rows, and #C is the
number of columns. The example here would have 20 degrees of freedom. For the fifth
step we will use the function CHISQ.DIST.RT (χ2,df).
The first step is to take a sum of each column, each row, and the total of all of the
data. The will result in a spreadsheet that looks like this:
Observed Red Orange Yellow Green Blue Purple
United 251 149 123 158 350 59
States of
America
1090 R1
Australia 231 102 240 263 203 49 1088 R2
The 124 406 89 117 204 65
Netherland
s
1005 R3
Germany 306 189 257 145 132 57 1086 R4
51
Brazil 134 113 163 351 223 64 1048 R5
1046 959 872 1034 1112 294 5317
C1 C2 C3 C4 C5 C6 Total
We will call this our ‘observed’ table since it is the table of our actual observations.
Next we will generate a table of expected values. This is the table that would
have generated the sums for the rows and columns if there were absolutely no
relationship between the variables. That is what the table would look like if all of the
variation was due to distributions of the two variables without either variable influencing
the other. Notice the labels that are next to the sums on the observed table. We will use
these labels to represent the sums they are next to. Each cell in the table can be uniquely
identified by its row and column, for example the USA response to red is row 1, column
1 while the German response to blue is row 4, column 5. The expected value for each
cell is found by taking the product of the row sum and column sum that correspond with
that cell, and dividing the result by the total. So, the expected value for the row 1,
column 1 cell is R1xC1/T. It is useful here to remember that the ‘$’ sign in Excel fixed a
value, and that the letters and numbers of the cell labels can be fixed independently. This
allows us to write one expression in the first cell and then copy it over the rest of the
cells. In our example, the row sums are in the ‘H’ column and the column sums are in
the ‘7’ row, so in the first cell of our ‘expected’ table we can write ‘=$H2*B$7/$H$7’,
copy that expression to the other cells in the table and we will get the following table:
Expected Red Orange Yellow Green Blue Purple
United
States of
America 214.433 196.598 178.762 211.973 227.963 60.2708
Australia 214.039 196.237 178.434 211.584 227.545 60.1602
The
Netherlands
197.711 181.267 164.822 195.443 210.186 55.5708
Germany 213.646 195.876 178.106 211.195 227.127 60.0497
Brazil 206.17 189.022 171.874 203.805 219.179 57.9485
Now we are ready to generate a χ2 table. The value we will compare to the critical value
will be the sum of all of the cells in this table. The values in those cells are generated by
inserting the values from the corresponding cells in the observed (O) and expected (E)
tables in the following expression:
In our example, the end of the spreadsheet will look like this:
Red Orange Yellow Green Blue Purple
United
States of
America 6.23574 11.5237 17.3943 13.7427 65.3307 0.0268
Australia 1.34395 45.2545 21.2421 12.4944 2.64761 2.07032
52
The
Netherlands
27.4811 278.623 34.8801 31.4838 0.18207 1.59993
Germany 39.9224 0.24139 34.9465 20.7476 39.8415 0.15488
Brazil 25.2634 30.5752 0.45821 106.309 0.0666 0.63196 Chi Squared
872.7154
Probability
5.0131E-172
Since the probability is much smaller than .05, there is a real relationship in this fictional
data.
Homework #9
Apply the χ2 test to each of the following data sets, and write a brief explanation of what
the results mean.
1) This data set is the incident rate for several types of crimes in a sample of states. The
numbers represent number of reports per 100,000 people.
Robber Vehicl
Murder Rape y Assault Burglary Larceny e
Alabama 8.2 34.5 141.4 247.8 953.8 2650 288.3
Colorado 3.7 43.4 84.6 264.7 744.8 2735.2 559.5
Hawaii 1.9 26.9 78.5 147.8 767.9 3308.4 716.4
Kansas 3.7 384 65.3 280 689.2 2758.1 339.6
Massachusetts 2.7 27.1 119 308.1 541.1 1527.4 295.1
Montana 1.9 32.2 18.9 228.5 389.2 2543 210.7
New Mexico 7.4 54.1 98.7 541.9 1093.9 2639.9 414.5
Oklahoma 5.3 41.7 91 370.5 1006 2644.2 391.8
South Dakota 2.3 46.7 18.6 108.1 324.4 1343.7 108.4
Virgina 6.1 22.7 99.2 154.8 392.1 2035 211.1
2) This data set is from an experiment on ant navigation. The categories of the
independent variable are ants that are new to the foraging arena (recruits) and ants that
have been to the food source before (experienced). The categories of the dependent
variable are whether the ant went the direction indicated by the odor cue, the direction
indicated by the light cue, or simply returned to the entrance of the maze.
Light Odor Back
Recruited 6 43 2
Experienced 60 38 7
53
3) This data set represents the marital status of American women during each of the last
five censuses.
Women 1960 1970 1980 1990 2000
Married 66.7 61.3 55.9 53.4 52.1
Never Married 16.6 20.7 22.4 22.3 23.5
Sep/Divorced 4.9 5.9 9.4 11.9 13.2
Widowed 11.8 12.1 12.3 11.4 11.1
4) This data set is the same as the last except it is for American men.
Men 1960 1970 1980 1990 2000
Married 71.8 67.8 62.0 60.0 57.0
Never Married 20.9 25.0 28.3 28.1 29.2
Sep/Divorced 3.6 4.0 7.0 9.2 11.1
Widowed 3.6 3.1 2.6 2.7 2.7
5) One possible explanation of the great diversity of hair color in Europeans is that there
was a strong sexual selection pressure on women in the tundra environment that covered
Europe in the last ice age. One prediction of this hypothesis is that hair color
distributions should be different between men and women. Does the following sample
support this?
Blonde Light Brown Black Red Gray
Brown
Men 12 19 40 24 1 5
Women 33 14 40 7 5 1
6) The same issues discussed in question 5 also apply to eye color. This data is a similar
sample of eye colors.
Blue Gray Green Hazel Brown Black
Men 30 3 12 14 35 5
Women 22 15 26 8 28 1
54