Chi Square Distribution
Chi Square Distribution
Very few real-world observations follow a chi-square distribution. The main purpose of chi-
square distributions is hypothesis testing, not describing real-world distributions.
In contrast, most other widely used distributions, like normal distributions or Poisson
distributions, can describe useful things such as newborns’ birth weights or disease cases
per year, respectively.
Relationship to the standard
normal distribution
Chi-square distributions are useful for hypothesis testing because of their close relationship to
the standard normal distribution. The standard normal distribution, which is a normal
distribution with a mean of zero and a variance of one, is central to many important statistical
tests and theories.
Imagine taking a random sample of a standard normal distribution (Z). If you squared all the
values in the sample, you would have the chi-square distribution with k = 1.
Χ21 = (Z)2
Now imagine taking samples from two standard normal distributions (Z1 and Z2). If each time
you sampled a pair of values, you squared them and added them together, you would have
the chi-square distribution with k = 2.
Χ22 = (Z1)2 + (Z2)2
More generally, if you sample from k independent standard normal distributions and then
square and sum the values, you’ll produce a chi-square distribution with k degrees of
freedom.
Χ2k = (Z1)2 + (Z2)2 + … + (Zk)2
Chi-square test statistics
(formula)
Chi-square tests are hypothesis tests with test statistics that follow a chi-square
distribution under the null hypothesis. Pearson’s chi-square test was the first chi-square
test to be discovered and is the most widely used.
The mean (μ) of the chi-square distribution is its degrees of freedom, k. Because the chi-
square distribution is right-skewed, the mean is greater than the median and mode. The
variance of the chi-square distribution is 2k.
Property Value
Continuous or discrete Continuous
Mean k
Mode k − 2 (when k > 2)
Variance 2k
Standard deviation
Range 0 to ∞
Symmetry Asymmetrical (right-skewed), but increasingly
symmetrical as k increases.
Example applications of chi-square
distributions
The chi-square distribution makes an appearance in many statistical tests and
theories. The following are a few of the most common applications of the chi-square
distribution.
A chi-square (Χ2) goodness of fit test is a type of Pearson’s chi-square test. You can use it
to test whether the observed distribution of a categorical variable differs from your
expectations.
Example: Chi-square goodness of fit test You're hired by a dog food company to help them test three new dog
food flavors.
You recruit a random sample of 75 dogs and offer each dog a choice between the three flavors by placing
bowls in front of them. You expect that the flavors will be equally popular among the dogs, with about 25 dogs
choosing each flavor.
Once you have your experimental results, you plan to use a chi-square goodness of fit test to figure out
whether the distribution of the dogs’ flavor choices is significantly different from your expectations.
What is the chi-square goodness
of fit test?
A chi-square (Χ2) goodness of fit test is a goodness of fit test for a categorical variable.
Goodness of fit is a measure of how well a statistical model fits a set of observations.
•When goodness of fit is high, the values expected based on the model are close to the
observed values.
•When goodness of fit is low, the values expected based on the model are far from the
observed values.
The statistical models that are analyzed by chi-square goodness of fit tests
are distributions. They can be any distribution, from as simple as equal probability for
all groups, to as complex as a probability distribution with many parameters.
Hypothesis testing
You explain that your observations were a bit different from what
you expected, but the differences aren’t dramatic. They could be
the result of a real flavor preference or they could be due to
chance.
To put it another way: You have a sample of 75 dogs, but what you
really want to understand is the population of all dogs. Was this
sample drawn from a population of dogs that choose the three
flavors equally often?
Step 1: Create a table
Create a table with the observed and expected frequencies in two
columns.
Example: Step 1
Flavor Observed Expected
Garlic Blast 22 25
Blueberry Delight 30 25
Minty Munch 23 25
Step 2: Calculate O − E
Add a new column called “O − E”. Subtract the expected frequencies from the
observed frequency.
Example: Step 2
Flavor Observed Expected O−E
Garlic Blast 22 25 22 − 25 = −3
Blueberry Delight 30 25 5
Minty Munch 23 25 −2
Step 3: Calculate (O − E)2
Add a new column called “(O − E)2”. Square the values in the previous
column.
Example: Step 3
Flavor Observed Expected O−E (O − E)2
Garlic Blast 22 25 −3 (−3)2 = 9
Blueberry Delight 30 25 5 25
Minty Munch 23 25 −2 4
Example: Step 5
Flavor Observed Expected O−E (O − E)2 (O − E)2 / E
Garlic Blast 22 25 −3 9 9/25 = 0.36
Blueberry 30 25 5 25 1
Delight
Minty Munch 23 25 −2 4 0.16
Question: Are gender and education level dependent at 5% level of significance? In other words, given the data
collected above, is there a relationship between the gender of an individual and the level of education that they
have obtained?
ormula
= (row total * column total)/sample size
The critical value of χ2 with 3 degree of freedom is 7.815. Since 8.006 > 7.815, we reject the
null hypothesis and conclude that the education level depends on gender at a 5% level of
significance.