Chapter 3 Test of Difference Between Means
Chapter 3 Test of Difference Between Means
Learning Outcomes:
At the end of the unit, the students must have:
1. discussed the conditions imposed by the different tests of difference between
means;
2. discussed the assumptions in ANOVA; and
3. performed each test and discussed the results.
Prepared by:
Prof. Jeanne Valerie Agbayani-Agpaoa
Statistical Methods I
Dr. Virgilio Julius P. Manzano, Jr.
Engr. Lawrence John C. Tagata
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Z-test is used when the population standard deviation is known. However, if we do not know the
population standard deviation, a z-test is still applicable provided that the sample is sufficiently large. Sufficiently
large means sample size is at least 30 (n=30) if the distribution of the variable is normal, and at least 50 (n=50)
for any distribution.
Critical Values of z
Foe convenience, the different critical values of z for different commonly used level of significance are
summarized in the table below. But it must be noted that these values were referred from Table A.3: Areas under
the Normal Curve.
Level of Confidence Type of Test
Significance, Level 2-tailed 1-tailed, left 1-tailed, right
0.10 0.90 1.645 −1.282 +1.282
0.05 0.95 1.96 −1.645 +1.645
0.01 0.99 2.575 −2.326 +2.326
0.001 0.999 3.29 −3.08 +3.08
38
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
The P-values and critical regions for these situations are shown in the figure above.
Example 1: A random sample of 100 recorded deaths in the United States during the past year showed an
average life span of 71.8 years. Assuming a population standard deviation of 8.9 years, does
this seem to indicate that the mean life span today is greater than 70 years? Use a 0.05 level
of significance.
Solutions:
Given: 𝜇 = 70 years
n = 100 recorded deaths
𝑥̅ = 71.8 years
𝜎 = 8.9 years
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: The mean life span today is 70 years. 𝜇 = 70
Ha: The mean life span today is greater than 70 years. 𝜇 > 70
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of “greater than” expression, it would be appropriate to use
a 1–tail, right test.
Step 4: Determine the tabular value for the test from the table above.
zcrit = +1.645.
39
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
𝒁𝒄𝒐𝒎𝒑 = 𝟐. 𝟎𝟐
Step 6: Decide.
Rejection region
Decision: Since the computed value of z is greater than the critical value Z comp (2.02) > Zcrit
(1.645), Ho is rejected.
The P-value corresponding to z = 2.02 is given by the area of the shaded region in the figure
below.
Using Table A.3, we have 𝑃 = 𝑃(𝑍 > 2.02) = 0.0217. As a result, the evidence in favor of Ha
is even stronger than that suggested by a 0.05 level of significance.
Step 7: Conclusion.
Since Ho was rejected, conclude:
“The mean life span today is greater than 70 years.”
Example 2: A manufacturer of sports equipment has developed a new synthetic fishing line that the
company claims has a mean breaking strength of 8 kilograms with a standard deviation of
0.5 kilogram. Test the hypothesis that μ = 8 kilograms against the alternative that μ ≠ 8
kilograms if a random sample of 50 lines is tested and found to have a mean breaking strength
of 7.8 kilograms. Use a 0.01 level of significance.
Solutions:
Given: 𝜇 = 8 kilograms
n = 50 lines
𝑥̅ = 7.8 kilograms
𝜎 = 0.5 kilograms
𝛼 = 0.01
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: The mean breaking strength of the newly-developed synthetic fishing line is 8
kilograms. 𝜇 = 8 𝑘𝑔
Ha: The mean breaking strength of the newly-developed synthetic fishing line is not
equal to 8 kilograms. 𝜇 ≠ 8 𝑘𝑔
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of a non-directional expression (not equal to), it would be
STAT 201: Statistical Methods I
Step 4: Determine the tabular value for the test from the table.
zcrit = ±2.575.
40
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 6: Decide.
Rejection region
Decision: Since the computed value of z is greater than the critical value Zcomp (–2.83) > Zcrit
(–2.575), Ho is rejected.
Step 7: Conclusion.
Since Ho was rejected, there is no evidence to support the perfume company’s claim that the
mean breaking strength of their newly-developed synthetic fishing line is not equal to 8
kilograms.
Conclude Ha: “The average breaking strength is not equal to 8 kilograms.”
For example, suppose that the current treatment for a disease cures 62% of all cases. A new treatment
method has been proposed and studied. In a sample of 80 subjects with the disease that were treated with the new
method, 63 were cured. Do the results of this study support the claim that the new method has a higher response
rate than the existing method?
This procedure calculates sample size and statistical power for testing a single proportion using either
the exact test or other approximate z-tests. Exact test results are based on calculations using the binomial (and
hypergeometric) distributions. Because the analysis of several different test statistics is available, their statistical
power may be compared to find the most appropriate test for a given situation.
This procedure has the capability for computing power using both the normal approximation and
binomial enumeration for all tests. Some sample size programs use only the normal approximation to the binomial
distribution for power and sample size estimates. The normal approximation is accurate for large sample sizes and
for proportions between 0.2 and 0.8, roughly. When the sample sizes are small or the proportions are extreme (i.e.
less than 0.2 or greater than 0.8) the binomial calculations are much more accurate.
Proportion where:
p is the parameter of the binomial distribution.
41
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Example 1: A builder claims that heat pumps are installed in 70% of all homes being constructed today
in the city of Richmond, Virginia. Would you agree with this claim if a random survey of
new homes in this city showed that 8 out of 15 had heat pumps installed? Use a 0.10 level of
significance.
Solutions:
Given: 8
p=
15
p0 = 0.70
n = 15
𝛼 = 0.10
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: 70% of all homes being constructed today in the city of Richmond, Virginia are
installed with heat pumps. (𝑝 = 0.70)
Ha: The proportion of homes being constructed today in the city of Richmond,
Virginia installed with heat pumps is not equal to 70%. (𝑝 ≠ 0.70)
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of non-directional expression “not equal to”, it would be
appropriate to use a 2–tailed test.
Step 4: Determine the tabular value for the test from the table above.
zcrit = 1.645.
Step 5: Compute for the required statistical test.
8
𝑝̂ − 𝑝0 − 0.70
𝑧= = 15
𝑝 𝑞
√ 0 0 √(0.70)(1 − 0.70)
𝑛 15
𝒁𝒄𝒐𝒎𝒑 = −𝟏. 𝟒𝟏
Step 6: Decide.
STAT 201: Statistical Methods I
zcomp = –1.41
zcrit = –1.645 zcrit = +1.645
Decision: Since the computed value of z is less than (figuratively, it lies within the non-
rejection region) the critical value Zcomp (−1.41) > Zcrit (1.645), fail to reject Ho.
Step 7: Conclusion.
Since Ho was not rejected, conclude that there is insufficient reason to doubt the builder’s
42
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
claim.
“70% of all homes being constructed today in the city of Richmond, Virginia are installed
with heat pumps.”
Example 2: A commonly prescribed drug for relieving nervous tension is believed to be only 60%
effective. Experimental results with a new drug administered to a random sample of 100
adults who were suffering from nervous tension show that 70 received relief. Is this sufficient
evidence to conclude that the new drug is superior to the one commonly prescribed? Use a
0.05 level of significance.
Solutions:
Given: 70
p=
100
p0 = 0.60
n = 100
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: There is no significant difference between the effectivity of the new and commonly
prescribed drugs. 𝑝 = 0.60
Ha: The new drug is superior to the one commonly prescribed. 𝑝 > 0.60
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of a directional expression (greater than), it would be
appropriate to use a 1–tailed, right test.
Step 4: Determine the tabular value for the test from the table.
zcrit = +1.645.
Step 6: Decide.
Rejection region
zcrit = +1.645
STAT 201: Statistical Methods I
zcomp = 2.04
Decision: Since the computed value of z is greater than the critical value Zcomp (+2.04) > Zcrit
+1.645), Ho is rejected.
Step 7: Conclusion.
Since Ho was rejected, conclude that the new drug is superior.
43
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Example 1: A nutrition teacher wants to compare the food values of the nutrition and dietetics students
with those of the engineering students. She constructed a questionnaire composed of
composed of 15 items. The teacher administered the questionnaire to 75 engineering students
and obtained a mean of 3.98, while the 40 nutrition students had a mean of 4.12. If the
population standard deviation if 0.27, what conclusion can the nutrition teacher draw about
the food value of the students? Use 0.05 level of significance.
Solutions:
Given: Nutrition and
Engineering
Dietetics
Students
Students
𝑥̅ 3.98 4.12
𝜎 0.27 0.27
𝑛 75 students 40 students
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: There is no significant difference in the food values of the nutrition and dietetics
students with those of the engineering students. 𝜇𝐸 − 𝜇𝑁𝐷 = 0 or 𝜇𝐸 = 𝜇𝑁𝐷
Ha: There is a significant difference in the food values of the nutrition and dietetics
students with those of the engineering students. 𝜇𝐸 − 𝜇𝑁𝐷 ≠ 0 or 𝜇𝐸 ≠ 𝜇𝑁𝐷
STAT 201: Statistical Methods I
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of a non-directional expression, it would be appropriate to
use a 2-tail test.
Step 4: Determine the tabular value for the test from the table above.
zcrit = ±1.96.
44
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 6: Decide.
zcomp = –2.65
zcrit = –1.96 zcrit = +1.96
Decision: Since the computed value of z is greater than the critical value Zcomp (–2.65) > Zcrit
(–1.96), reject Ho.
Step 7: Conclusion.
Since Ho was rejected, conclude “There is a significant difference in the food values of the
nutrition and dietetics students with those of the engineering students.
Example 2: A manufacturer claims that the average tensile strength of thread A exceeds the average
tensile strength of thread B by at least 12 kilograms. To test this claim, 50 pieces of each type
of thread were tested under similar conditions. Type A thread had an average tensile strength
of 86.7 kilograms with a standard deviation of 6.28 kilograms, while type B thread had an
average tensile strength of 77.8 kilograms with a standard deviation of 5.61 kilograms. Test
the manufacturer’s claim using a 0.05 level of significance.
Solutions:
Given: Thread A Thread B
𝑥̅ 86.7 kilograms 77.8 kilograms
𝜎 6.28 kilograms 5.61 kilograms
𝑛 50 pieces 50 pieces
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: The average tensile strength of thread A exceeds the average tensile strength of
thread B by at least 12 kilograms. (𝜇𝐴 − 𝜇𝐵 ) ≥ 12
Ha: The average tensile strength of thread A exceeds the average tensile strength of
thread B less than 12 kilograms. (𝜇𝐴 − 𝜇𝐵 ) < 12
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of a directional expression (less than), it would be appropriate
STAT 201: Statistical Methods I
Step 4: Determine the tabular value for the test from the table.
zcrit = −1.645.
45
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
̅A − ̅
(X X B ) − (μA − μB ) (86.7 − 77.8) − (12)
z= =
σ 2 σ 2 2 2
√ A + A √6.28 + 5.61
nA nA 50 50
𝒁𝒄𝒐𝒎𝒑 = −𝟐. 𝟔𝟎
Step 6: Decide.
Rejection region
zcrit = –1.645
zcomp = –2.60
Decision: Since the computed value of z is greater than (figuratively, the value falls in the
rejection region) the critical value Zcomp (–2.60) > Zcrit –1.645), Ho is rejected.
Step 7: Conclusion.
Since Ho was rejected, there is no sufficient data to support the manufacturer claims that the
average tensile strength of thread A exceeds the average tensile strength of thread B by at
least 12 kilograms.
For example, suppose you want to compare two methods for treating cancer. Your experimental design
might be as follows. Select a sample of patients and randomly assign half to one method and half to the other.
After five years, determine the proportion surviving in each group and test whether the difference in the
proportions is significantly different from zero.
Suppose you have two populations from which dichotomous (binary) responses will be recorded. The
probability (or risk) of obtaining the event of interest in population 1 (the treatment group) is p1 and in population
2 (the control group) is p2. The corresponding failure proportions are given by q1 = 1 − p1 and q2 = 1 – p2.
𝑞1 = 1 − 𝑝1 and 𝑞2 = 1 − 𝑝2
𝑛1 and 𝑛2 are the number of cases or observation/sample sizes in each
group
46
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Example 1: A vote is to be taken among the residents of a town and the surrounding county to determine
whether a proposed chemical plant should be constructed. The construction site is within the
town limits, and for this reason many voters in the county believe that the proposal will pass
because of the large proportion of town voters who favor the construction. To determine if
there is a significant difference in the proportions of town voters and county voters favoring
the proposal, a poll is taken. If 120 of 200 town voters favor the proposal and 240 of 500
county residents favor it, would you agree that the proportion of town voters favoring the
proposal is higher than the proportion of county voters? Use an α = 0.05 level of significance.
Solutions:
Given: Let T and C denote town voters and county voters, respectively.
120
p𝑇 =
200
240
p𝐶 =
500
𝑛 𝑇 = 200
𝑛𝐶 = 500
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: There is no significant difference in the proportion of town voters favoring the
proposal and the proportion of county voters. (pT > pC )
Ha: The proportion of town voters favoring the proposal is higher than the proportion
of county voters. (pT > pC )
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of a directional expression “greater than”, it would be
appropriate to use a 1–tail, right test.
Step 4: Determine the tabular value for the test from the table above.
zcrit = +1.645.
z= =
p𝑇 q 𝑇 p𝐶 q 𝐶 120 80 240 260
√ n + n
𝑇 𝐶 √(200) (200) (500) (500)
+
200 500
𝒁𝒄𝒐𝒎𝒑 = 𝟐. 𝟗𝟏
47
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 6: Decide.
Rejection region
Decision: Since the computed value of z is greater than (figuratively, it lies within the
rejection region) the critical value Zcomp (+2.91) > Zcrit (+1.645), reject Ho.
Step 7: Conclusion.
Since Ho was rejected, agree to the claim that the proportion of town voters favoring the
proposal is higher than the proportion of county voters.
Example 2: Two hundred fifty AIDS victims during the year 2000 had only 1% chance of survival. After
3 years, a new medication was found and tested. A total of 500 AIDS victims have been
treated and 15 have survived. Does this result show that the new medication is more
successful than the old one? Use the 0.05 level of confidence.
Solutions:
Given: Let N and O denote new medication and old medication, respectively.
p𝑂 = 0.01
𝑛𝑂 = 250
15
p𝑁 =
500
𝑛𝑁 = 500
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: There is no significant difference in the effectivity of the new and old medications.
(p𝑁 = p𝑂 )
Ha: The new medication is more successful than the old one. (p𝑁 > p𝑂 )
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
Since Ha is expressed in terms of a directional expression (greater than), it would be
appropriate to use a 1–tailed, right test.
Step 4: Determine the tabular value for the test from the table.
zcrit = +1.645.
p𝑁 q 𝑁 p𝑂 q 𝑂 15 485
√ n + n
𝑁 𝑂 √(500) (500) (0.01)(0.99)
+
500 250
𝒁𝒄𝒐𝒎𝒑 = 𝟐. 𝟎𝟐
48
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 6: Decide.
Rejection region
zcrit = +1.645
zcomp = 2.02
Decision: Since the computed value of z is greater than the critical value Zcomp (+2.02) > Zcrit
+1.645), Ho is rejected.
Step 7: Conclusion.
Since Ho was rejected, conclude that the new medication is more successful than the old one.
Activity 05:
Perform hypothesis testing to the following problems. Use a 0.05 level of significance.
1. A group of biology students wish to determine whether an insect population found only in one location of a
forest belonged to certain specie. The only morphological characteristic which appeared from that of the
known members of the specie was wing length. The mean wing length of the specie was 15.4 mm with a
standard deviation of 2.3 mm. The students measured the wing length of 50 insects and had a mean of 17.4
mm. Can the students conclude that the insects are of different species?
2. A study at the University of Colorado at Boulder shows that running increases the percent resting metabolic
rate (RMR) in older women. The average RMR of 50 elderly women runners was 34.0% higher than the
average RMR of 50 sedentary elderly women, and the population standard deviations were reported to be
10.5 and 10.2%, respectively. Was there a significant increase in RMR of the women runners over the
sedentary women?
3. At a certain college, it is estimated that at most 25% of the students ride bicycles to class. Does this seem to
be a valid estimate if, in a random sample of 90 college students, 28 are found to ride bicycles to class?
4. A 2003 New York Times/CBS News poll sampled 523 adults who were planning a vacation during the next
six months and found that 141 were expecting to travel by airplane (New York Times News Service, March
2, 2003). A similar survey question in a May 1993 New York Times/CBS News poll found that of 477 adults
who were planning a vacation in the next six months, 81 were expecting to travel by airplane. Is there a
significant change occurred in the population proportion planning to travel by airplane over the 10-year
period?
STAT 201: Statistical Methods I
49
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Hypothesis testing means finding out if the mean difference is statistically significant or not. A t-test or
a z-test may be used for the particular purpose.
A very simple example: Let’s say you have a cold and you try a naturopathic remedy. Your cold lasts a couple of
days. The next time you have a cold, you buy an over-the-counter pharmaceutical and the cold lasts a week. You
survey your friends and they all tell you that their colds were of a shorter duration (an average of 3 days) when
they took the homeopathic remedy. What you really want to know is, are these results repeatable? A t test can tell
you by comparing the means of the two groups and letting you know the probability of those results happening
by chance.
Another example: Student’s T-tests can be used in real life to compare means. For example, a drug company may
want to test a new cancer drug to find out if it improves life expectancy. In an experiment, there’s always a control
group (a group who are given a placebo, or “sugar pill”). The control group may show an average life expectancy
of +5 years, while the group taking the new drug might have a life expectancy of +6 years. It would seem that the
drug might work. But it could be due to a fluke. To test this, researchers would use a Student’s t-test to find out if
the results are repeatable for an entire population.
THE T-SCORE
The t-score is a ratio between the difference between two groups and the difference within the groups.
STAT 201: Statistical Methods I
The larger the t score, the more difference there is between groups. The smaller the t score, the more similarity
there is between groups. A t score of 3 means that the groups are three times as different from each other as they
are within each other. When you run a t test, the bigger the t-value, the more likely it is that the results are
repeatable.
A large t-score tells you that the groups are different.
A small t-score tells you that the groups are similar.
50
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
–to +to
–to +to
The one-sample t-test determines whether the sample mean is statistically different from a known or
hypothesized population mean. The One Sample t Test is a parametric test. This test is also known as single-
sample t-test. In a one-sample t-test, the test variable is compared against a "test value", which is a known or
hypothesized value of the mean in the population.
STAT 201: Statistical Methods I
51
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
time points for the original measures. If the mean change score is not significantly different
from zero, no significant change occurred.
Note: The one-sample t-test can only compare a single sample mean to a specified constant. It cannot compare
sample means between two or more groups. If you wish to compare the means of multiple groups to each other,
you will likely want to run an Independent Samples t-test (to compare the means of two groups) or a One-Way
ANOVA (to compare the means of two or more groups).
where:
𝑥̅ = sample mean
𝜇 = population mean
𝑠 = sample standard deviation
𝑛 = sample size/number of observations
Degrees of freedom: 𝑑𝑓 = 𝑛 − 1
The P-values and critical regions for these situations are shown in the figure above.
Example 1: Joan’s Nursery specializes in custom-designed landscaping for residential areas. The
estimated labor cost associated with a particular landscaping proposal is based on the number
of plantings of trees, shrubs, and so on to be used for the project. For cost-estimating
purposes, managers use two hours of labor time for the planting of a medium-sized tree.
Actual times from a sample of 10 plantings during the past month follow (times in hours).
1.7 1.5 2.6 2.2 2.4
2.3 2.6 3.0 1.4 2.3
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: The mean tree-planting time is two hours.
𝜇 = 2.0
Ha: The mean tree-planting time is not two hours.
𝜇 ≠ 2.0
52
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
2–tail
Step 4: Determine the tabular value for the test from the table.
tcrit(0.05, 9) = ±2.262.
Step 6: Decide.
tcomp = +1.2163
tcrit = –2.262 tcrit = +2.262
Step 7: Conclusion.
Since Ho was not rejected, we can conclude Ho.
“The mean tree-planting time is two hours.”
STAT 201: Statistical Methods I
Example The Edison Electric Institute has published figures on the number of kilowatt hours used annually
2: by various home appliances. It is claimed that a vacuum cleaner uses an average of 46 kilowatt
hours per year. If a random sample of 12 homes included in a planned study indicates that vacuum
cleaners use an average of 42 kilowatt hours per year with a standard deviation of 11.9 kilowatt
hours, does this suggest at the 0.05 level of significance that vacuum cleaners use, on average,
less than 46 kilowatt hours annually? Assume the population of kilowatt hours to be normal.
Solutions:
Given: 𝜇 = 46 kilowatt hours
53
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
n = 12 homes
𝑥̅ = 42 kilowatt hours
S = 11.9 kilowatt hours
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: Vacuum cleaners use, on average, at least 46 kilowatt hours annually.
𝜇 ≥ 46
Ha: Vacuum cleaners use, on average, less than 46 kilowatt hours annually.
𝜇 < 46
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
1–tail, left
Step 4: Determine the tabular value for the test from the table.
tcomp = –1.1644
54
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 6: Decide.
Rejection region
tcrit = –1.796
tcomp = –1.1644
Example A perfume company claims that the best-selling perfume contains at most 25% alcohol. Twenty
3: bottles were selected and found to have a mean of 29.7% and a standard deviation of 4.8%. Test
the claim of the perfume company at the 0.05 level of significance.
Solutions:
Given: 𝜇 = 25%
n = 20
𝑥̅ = 29.7%
S = 4.8%
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: The best-selling perfume contains at most 25% alcohol.
𝜇 ≤ 25
Ha: The best-selling perfume contains more than 25% alcohol.
𝜇 > 25
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
1–tail, right
Step 4: Determine the tabular value for the test from the table.
STAT 201: Statistical Methods I
55
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 6: Decide.
Rejection region
STAT 201: Statistical Methods I
tcrit = +1.729
tcomp = +4.3790
Step 7: Conclusion.
Since Ho was rejected, there is no evidence to support the perfume company’s claim that the best-
selling perfume contains at most 25% alcohol.
Conclude Ha: “The best-selling perfume contains more than 25% alcohol.”
56
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Note: The Independent Samples t-Test can only compare the means for two (and only two) groups. It cannot make
comparisons among more than two groups. If you wish to compare the means across more than two groups, you
will likely want to run an ANOVA.
In this section, the t-test can also be used to compare two sample means by determining if the samples
were obtained from normal populations with the same means. However, certain assumptions are required.
1. The population must be at least approximately normally distributed.
2. The population must be independent.
3. The population variances must be equal. Test for equality of variances are taken up in advance.
Degrees of freedom: 𝑑𝑓 = 𝑛1 + 𝑛2 − 2
57
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Example 1: An experiment was performed to compare the abrasive wear of two different laminated
materials. Twelve pieces of material 1 were tested by exposing each piece to a machine
measuring wear. Ten pieces of material 2 were similarly tested. In each case, the depth of wear
was observed. The samples of material 1 gave an average (coded) wear of 85 units with a
sample standard deviation of 4, while the samples of material 2 gave an average of 81 with a
sample standard deviation of 5. Can we conclude at the 0.05 level of significance that the
abrasive wear of material 1 exceeds that of material 2 by less than 2 units?
Solutions:
Given: Material 1 Material 2
̅
𝒙 85 81
n 12 10
s 4 5
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: The abrasive wear of material 1 exceeds that of material 2 by at most 2 units.
(𝜇𝑀1 − 𝜇𝑀2 ) ≥ 2
Ha: The abrasive wear of material 1 exceeds that of material 2 by less than 2 units.
(𝜇𝑀1 − 𝜇𝑀2 ) < 2
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
1–tail, left
Step 4: Determine the tabular value for the test from the table.
𝟐 𝟐
𝑺𝟐𝟏 𝑺𝟐𝟐 √𝟒 + 𝟓
√ 𝟏𝟐 𝟏𝟎
𝒏𝟏 + 𝒏𝟐
tcomp = +1.0215
58
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 6: Decide.
Rejection region
tcrit = +1.725
tcomp = +1.0215
Step 7: Conclusion.
Since Ho was not rejected, there was no evidence to support that the abrasive wear of material 1
exceeds that of material 2 by more than 2 units.
Conclude: “The abrasive wear of material 1 exceeds that of material 2 by at most 2 units.”
Example 2: To find out whether a new drug will reduce the spread of cancer, 9 mice which have all reached
an advance state of the disease are selected. Five mice received the treatment and four did not.
The survival periods, in months from the time that the experiment commenced are as follows:
At a 0.05 level of significance is there evidence to say that the new drug is effective?
Solutions:
Given: Treatment No Treatment
̅
𝒙 2.92 2.30
n 5 4
s 1.95 1.24
𝛼 = 0.05
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: There is no significant difference in the mean survival periods of the treated and not
treated mice.
𝜇𝑇 = 𝜇𝑁𝑇
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
1–tail, right
59
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 4: Determine the tabular value for the test from the table.
tcrit(0.05, 7) = +1.895.
Step 6: Decide.
Rejection region
tcrit = +1.895
tcomp = +0.5794
Step 7: Conclusion.
Since Ho was not rejected, there is no sufficient evidence to say that the new drug is effective.
That is, we can say that
STAT 201: Statistical Methods I
“There is no significant difference in the mean survival periods of the treated and not treated mice.”
Example 3: Engineers at a large automobile manufacturing company are trying to decide whether to purchase
brand A or brand B tires for the company’s new models. To help them arrive at a decision, an
experiment is conducted using 50 of each brand. The tires are run until they wear out. The results
are as follows:
Brand A: 𝑥1 = 37,900 kilometers, s1 = 5100 kilometers.
̅̅̅
Brand B: 𝑥2 = 39,800 kilometers, s2 = 5900 kilometers.
̅̅̅
60
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Test the hypothesis that there is no difference in the average wear of the two brands of tires.
Assume the populations to be approximately normally distributed
Solutions:
Given: Brand A Brand B
̅
𝒙 37,900 39,800
n 50 50
s 5,100 5,900
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: There is no significant difference in the average wear of the two brands of tires.
𝜇𝐴 = 𝜇𝐵
Ha: There is a significant difference in the average wear of the two brands of tires.
𝜇𝐴 ≠ 𝜇𝐵
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
2–tail
Step 4: Determine the tabular value for the test from the table.
tcrit(0.05, ∞) = ±1.960
Step 6: Decide.
STAT 201: Statistical Methods I
61
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Step 7: Conclusion.
Since Ho was not rejected, conclude Ho:
“There is no significant difference in the average wear of the two brands of tires.”
where:
𝑑̅ = mean difference between two observations on each pair (𝑑𝑖 = 𝑦𝑖 −
𝑥𝑖 )
𝑠𝑑 = standard deviation of the differences
𝑛 = sample size/number of observations
Degrees of freedom: 𝑑𝑓 = 𝑛 − 1
Example: Five samples of a ferrous-type substance were used to determine if there is a difference between
a laboratory chemical analysis and an X-ray fluorescence analysis of the iron content. Each
sample was split into two subsamples and the two types of analysis were applied. Following are
the coded data showing the iron content analysis:
Samples
Analysis 1 2 3 4 5
X-ray 2.0 2.0 2.3 2.1 2.4
Chemical 2.2 1.9 2.5 2.3 2.4
STAT 201: Statistical Methods I
Assuming that the populations are normal, test at the 0.05 level of significance whether the two
methods of analysis give, on the average, the same result.
Solutions:
Given: Brand A Brand B
̅
𝒙 37,900 39,800
n 50 50
s 5,100 5,900
Step 1: Formulate a null hypothesis (Ho) and the alternative hypothesis (Ha).
Ho: There is no significant difference in the average wear of the two brands of tires.
62
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
𝜇 𝐴 = 𝜇𝐵
Ha: There is a significant difference in the average wear of the two brands of tires.
𝜇 𝐴 ≠ 𝜇𝐵
Step 3: Identify the type of statistical test as either one-tailed test or two-tailed test.
2–tail
Step 4: Determine the tabular value for the test from the table.
tcrit(0.05, ∞) = ±1.960
Step 6: Decide.
Step 7: Conclusion.
Since Ho was not rejected, conclude Ho:
“There is no significant difference in the average wear of the two brands of tires.”
63
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
2. The College Board provided comparisons of Scholastic Aptitude Test (SAT) scores based on the highest level
of education attained by the test taker’s parents. A research hypothesis was that students whose parents had
attained a higher level of education would on average score higher on the SAT. During 2003, the overall mean
SAT verbal score was 507 (The World Almanac, 2004). SAT verbal scores for independent samples of students
follow. The first sample shows the SAT verbal test scores for students whose parents are college graduates
with a bachelor’s degree. The second sample shows the SAT verbal test scores for students whose parents are
high school graduates but do not have a college degree.
Student’s Parents
College Grads High School Grads
485 487 442 492
534 533 580 478
650 526 479 425
554 410 486 485
550 515 528 390
572 578 524 535
497 448
592 469
At 0.05 level of significance, determine whether the sample data support the hypothesis that students show a
higher mean verbal score on the SAT if their parents attained a higher level of education.
3. Suppose a sample of 20 students were given a diagnostic test before studying a particular module and then
took the same test after completing the module. We want to find out if, in general, our teaching leads to
improvements in students’ knowledge/skills (i.e. test scores). We can use the results from our sample of
students to draw conclusions about the impact of this module in general.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x 18 21 16 22 19 24 17 21 23 18 14 16 16 19 18 20 12 22 15 17
y 22 25 17 24 16 29 20 23 19 20 15 15 18 26 18 24 18 25 19 16
x: test score before the module; y: test score after the module
64
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
For example, suppose that three methods could be used to evaluate the strength readings on steel plate
girders. We may think of these as three treatments, say t1, t2, and t3. If we use four girders as the experimental
units, a randomized complete block design would appear as shown below:
The design is called a randomized complete block design because each block is large enough to hold all
the treatments and because the actual assignment of each of the three treatments within each block is done
randomly.
Once the experiment has been conducted, the data are recorded in a table, such as is shown below
The observations in this table, say yij, represent the response obtained when method i is used on girder j.
The general procedure for a randomized complete block design consists of selecting b blocks and running
a complete replicate of the experiment in each block. The data that result from running a randomized complete
block design for investigating a single factor with a levels and b blocks are shown below.
Blocks
Treatments Totals Averages
1 2 ⋯ b
1 𝑦11 𝑦12 ⋯ 𝑦1𝑏 𝑦1∙ 𝑦̅1∙
2 𝑦21 𝑦22 ⋯ 𝑦2𝑏 𝑦2∙ 𝑦̅2∙
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
a 𝑦𝑎1 𝑦𝑎2 ⋯ 𝑦𝑎𝑏 𝑦𝑎∙ 𝑦̅𝑎∙
Totals 𝑦∙1 𝑦∙2 ⋯ 𝑦∙𝑏 𝑦∙∙
Averages 𝑦̅∙1 𝑦̅∙2 ⋯ 𝑦̅∙𝑏 𝑦̅∙∙
There will be a observations (one per factor level) in each block, and the order in which these
observations are run is randomly assigned within the block.
STAT 201: Statistical Methods I
65
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Example: A manufacturer of paper used for making grocery bags is interested in improving the
tensile strength of the product. Product engineering thinks that tensile strength is a
function of the hardwood concentration in the pulp and that the range of hardwood
concentrations of practical interest is between 5 and 20%. A team of engineers responsible
for the study decides to investigate four levels of hardwood concentration: 5%, 10%, 15%,
and 20%. They decide to make up six test specimens at each concentration level, using a
pilot plant. All 24 specimens are tested on a laboratory tensile tester, in random order. The
data from this experiment are shown below:
This is an example of a completely randomized single-factor experiment with four levels of the factor.
The levels of the factor are sometimes called treatments, and each treatment has six observations or replicates.
The role of randomization in this experiment is extremely important. By randomizing the order of the 24 runs,
the effect of any nuisance variable that may influence the observed tensile strength is approximately balanced out.
For example, suppose that there is a warm-up effect on the tensile testing machine; that is, the longer the machine
is on, the greater the observed tensile strength. If all 24 runs are made in order of increasing hardwood
concentration (that is, all six 5% concentration specimens are tested first, followed by all six 10% concentration
specimens, etc.), any observed differences in tensile strength could also be due to the warm-up effect.
The analysis of variance (ANOVA) F-test method is a method for providing the variation observed into
different parts, each part assignable to a known source, cause, or factor. The ANOVA was developed by R.A.
Fisher and reported by him in 1923. Simply stated, it is used when we wish to test the significance of the difference
between two or more means obtained from independent samples.
F-test is a parametric test made; it has the same assumptions attributed for a parametric test made:
1. Random selection of subjects from a normal population with equal variances;
2. Samples or groups are independent; and
3. Data being analyzed must be interval.
66
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
FORMULA where:
((∑ 𝑋) 𝑇 )2 Σ𝑋 2= sum of all the squared scores
𝑆𝑆𝑇 = Σ𝑋 2 − Σ𝑋 = sum of all the scores
𝑁
𝑁 = total number of scores
𝑘
(Σ𝑋𝑖 )2 ((∑ 𝑋) 𝑇 )2 Σ𝑋𝑖 = sum of the scores in any group
𝑆𝑆𝐵 = {∑ }− 𝑛𝑖 = number of scores in any group
𝑛𝑖 𝑁
𝑖=1
𝑆𝑆𝑊 = 𝑆𝑆𝑇 − 𝑆𝑆𝐵
𝑑𝑓𝑏 = 𝑘 − 1 𝑘 = the number of groups
𝑀𝑆𝑏
𝐹=
𝑀𝑆𝑤
(1.) Under “Tools” select “Data Analysis” In the window that appears select “ANOVA: One factor” and click
“OK.”
(2.) Using your mouse highlight the cells containing the data.
STAT 201: Statistical Methods I
(3.) Select “Columns” if each treatment is its own column or “Row” if each treatment is its own row.
(4.) Set your level of significance. (The default is 5% or 0.05.) (5.) Click “OK” and the ANOVA output will
appear on a new worksheet.
67
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Example: You wish to test the effects of a number of experimental treatments (Counseling Approaches):
group counseling, peer counseling, and individual counseling on the self-concept of students.
In this case, the independent variable, counseling approaches, has three levels. Necessarily,
there should be three groups randomly selected from the school population which will be
exposed to the three different counseling approaches. The dependent variable, self-concept,
may be measured through a standardized self-concept instrument which yields interval scores
for the subject.
68
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Since the computed value of F (Fcomp=6.70) is greater than the critical value of F (Fcrit=3.68),
Fcomp>Fcrit, reject the null hypothesis.
Step 7: Interpretation:
The significant F-ratio reveals the rejection of the null hypothesis. We may now accept the
alternative hypothesis that there is an effect of the counseling approaches on the self-concept
of the students. The group means show that it is the individual counseling samples registered
the lowest mean. At this point, it may be correct to say that the individual counseling approach
69
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
is significantly more effective than the group counseling approach in enhancing the self-
concepts of the students. However, the data also reveals that there is a small difference between
the means of the group G and that of group I.
The Analysis of Variance (ANOVA) test has long been an important tool for researchers conducting
studies on multiple experimental groups and one or more control groups. It has been a powerful procedure for
testing the homogeneity of a set of means. However, however, if we reject the null hypothesis and accept the
stated alternative—that the means are not all equal—we still do not know which of the population means are equal
and which are different. ANOVA cannot provide detailed information on differences among the various study
groups, or on complex combinations of study groups. Methods for investigating this issue are called multiple
comparisons methods.
To fully understand group differences in an ANOVA, researchers must conduct tests of the differences
between particular pairs of experimental and control groups. Tests conducted on subsets of data tested previously
in another analysis are called post hoc tests. Post-hoc (Latin, meaning “after this”) means to analyze the results
of your experimental data. A class of post hoc tests that provide this type of detailed information for ANOVA
results are called "multiple comparison analysis" tests. Different multiple comparison analyses have specific
uses, advantages, and disadvantages. Some are best used for testing theory while others are useful in generating
new theory. Selection of the appropriate post hoc test will provide researchers with the most detailed information
while limiting Type 1 errors due to alpha inflation.
Tip: The LSD will only make sense if you have a significant result from ANOVA (i.e. if you reject the null
hypothesis). Therefore, you shouldn’t run the test if you do not get a significant result from ANOVA.
General Steps:
1. Assuming a two-sided alternative hypothesis, find the tcrit.
2. Compute for LSD.
STAT 201: Statistical Methods I
3. Calculate the difference between the means of the two groups, 𝑥̅𝐴 − 𝑥̅𝐵 .
4. If 𝑥̅𝐴 − 𝑥̅𝐵 ≥ 𝐿𝑆𝐷, reject the null hypothesis. The pair of means 𝑥̅𝐴 and 𝑥̅𝐵 would be declared
significantly different.
70
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Example: You wish to test the effects of a number of experimental treatments (Counseling Approaches): group
counseling, peer counseling, and individual counseling on the self-concept of students. In this case, the
independent variable, counseling approaches, has three levels. Necessarily, there should be three
groups randomly selected from the school population which will be exposed to the three different
counseling approaches. The dependent variable, self-concept, may be measured through a standardized
self-concept instrument which yields interval scores for the subject.
Summary of Means:
Counselling
Mean
Approaches
Group 61
Peer 82
Individual 85
F-test Reject Ho.
Post-hoc test using Fisher’s Least Significant Difference Test will be performed to identify
which pairs of means are statistically different.
𝒏 𝑨 𝒏𝑩
71
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
1 1
|G–I| |61 − 85| 24 2.131 ∗ √153.07 ( + ) 15.2217
6 6
1 1
|P–I| |82 − 85| 3 2.131 ∗ √153.07 ( + ) 15.2217
6 6
Step 3: If 𝑥̅𝐴 − 𝑥̅𝐵 ≥ 𝐿𝑆𝐷, reject the null hypothesis. The pair of means 𝑥̅𝐴 and 𝑥̅𝐵 would be declared
significantly different.
Pairwise
MD LSD Remarks
Comparisons
|G–P| 21 15.2217 G and P are significantly different.
|G–I| 24 15.2217 G and I are significantly different
|P–I| 3 15.2217 P and I are not significantly different.
To test all pairwise comparisons among means using the Tukey HSD, calculate HSD for each pair of means using
the following formula:
̅𝑨 − 𝒙
𝒙 ̅𝑩
𝑯𝑺𝑫 =
𝑴𝑺𝒘
√
𝒏
STAT 201: Statistical Methods I
where:
𝑥̅𝐴 − 𝑥̅ 𝐵 is the difference between the pair of means (𝑥̅𝐴 should be larger than 𝑥̅𝐵 )
MSW is the Mean Square Within
n is the number in the group or treatment
Tukey-Kramer Method
If you have unequal sample sizes, you have to calculate the estimated standard deviation for each pairwise
comparison. This is called the Tukey-Kramer Method.
̅𝑨 − 𝒙
𝒙 ̅𝑩
𝑯𝑺𝑫 =
𝑴𝑺 𝟏 𝟏
√ 𝟐 𝒘 (𝒏 + 𝒏 )
𝑨 𝑩
72
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
General Steps:
1. Perform the ANOVA test. Assuming your F value is significant; you can run the post hoc test.
2. Calculate the HSD statistic for the Tukey test using the formula.
3. Find the score in Tukey’s critical value table.
4. If the calculated value of HSD is greater than the critical value, the null hypothesis is rejected. This
implies that the two means are significantly different.
SCHEFFÉ’S METHOD
• The Scheffe Test (also called Scheffe’s procedure or Scheffe’s method) is a post-hoc test used in Analysis
of Variance. It is named for the American statistician Henry Scheffe. After you have run ANOVA and got
a significant F-statistic (i.e. you have rejected the null hypothesis that the means are the same), then you
run Sheffe’s test to find out which pairs of means are significant.
• Used when you want to look at post-hoc comparisons in general (as opposed to just pairwise
comparisons). Scheffe’s controls for the overall confidence level. It is customarily used with unequal
sample sizes.
• Out of the three mean comparisons test you can run (the other two are Fisher’s LSD and Tukey’s HSD).
The Scheffe test is the most flexible, but it is also the test with the lowest statistical power. Deciding
which test to run largely depends on what comparisons you’re interested in:
o If you only want to make pairwise comparisons, run the Tukey procedure because it will have a
narrower confidence interval.
o If you want to compare all possible simple and complex pairs of means, run the Scheffe test as
it will have a narrower confidence interval.
Note: Only perform this test if you have rejected the null hypothesis in an ANOVA test, indicating that the means
are not the same. Otherwise, the means are equal and so there is no point in running this test.
Reject the null hypothesis if the Scheffe test statistic is greater than the critical value.
General Steps:
1. Calculate the absolute values of pair wise differences between sample means, 𝑥̅𝐴 − 𝑥̅𝐵 .
2. Use the following formula to find a set of Scheffe values:
𝟏 𝟏
𝑭𝑺 = √𝒅𝒇𝑩 ∗ 𝑭 ∗ 𝑴𝑺𝑾 ∗ ( + )
𝒏𝑨 𝒏𝑩
where:
𝑑𝑓𝐵 is the between samples degrees of freedom.
F is the F-value (from ANOVA)
MSW is the mean square error (MS within groups from ANOVA).
3. For all values 𝑥̅𝐴 − 𝑥̅𝐵 ≥ 𝐹𝑆 , the null hypothesis is rejected. This implies statistically significant at your
chosen alpha level.
DUNNETT’S CORRECTION
• Dunnett’s Test (also called Dunnett’s Method or Dunnett’s Multiple Comparison) compares means from
several experimental groups against a control group mean to see is there is a difference.
• This post-hoc test is used to compare means, but unlike Tukey, it compares every mean to a control mean.
• One fixed “control” group is compared to all of the other samples, so it should only be used when you
have a control group. If you don’t have a control group, use Tukey’s Test.
STAT 201: Statistical Methods I
As Dunnett’s compares two groups, it acts similarly to a t-test. The following formula gives you a value that
you can use to compare mean differences.The formula is:
𝟐𝑴𝑺𝑾
𝑫𝑫𝒖𝒏𝒏𝒆𝒕𝒕 = 𝒕𝑫𝒖𝒏𝒏𝒆𝒕𝒕 √
𝒏
Steps:
1. Look up the tDunnett critical value in the Dunnett-critical value table (referred to, n, and dfw).
2. If the difference between the control group mean and an experimental group mean is greater than the
critical value of tDunnett, then that difference is significant.
73
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
This procedure is also based on the general notion of studentized range. The
range of any subset of p sample means must exceed a certain value before any
of the p means are found to be different. This value is called the least significant
range for the p means and is denoted by Rp, where
𝟐𝑴𝑺𝒘
𝑹𝒑 = 𝒓𝒑 √
𝒏
The values of the quantity rp, called the least significant studentized range,
depend on the desired level of significance and the number of degrees of
freedom of the mean square error. These values may be obtained from Q table.
Dunn’s Multiple Dunn’s Test can be used to pinpoint which specific means are significant from
Comparison Test the others. Dunn’s Multiple Comparison Test is a post hoc non-parametric
test (a “distribution free” test that doesn’t assume your data comes from a
particular distribution). It is one of the least powerful of the multiple
comparisons tests and can be a very conservative test–especially for larger
numbers of comparisons.
Dunn vs. Tukey and Dunnett
The Dunn is an alternative to the Tukey test when you only want to test for
differences in a small subset of all possible pairs; For larger numbers of pairwise
comparisons, use Tukey’s instead. Use Dunn’s when you choose to test a
specific number of comparisons before you run the ANOVA and when you are
not comparing to controls. If you are comparing to a control group, use
the Dunnett test instead.
Bonferroni Procedure This multiple-comparison post-hoc correction is used when you are performing
(Bonferonni Correction) many independent or dependent statistical tests at the same time. The problem
with running many simultaneous tests is that the probability of
a significant result increases with each test run. This post-hoc test sets
the significance cut off at α/n. For example, if you are running 20 simultaneous
tests at α=0.05, the correction would be 0.0025. The Bonferroni does suffer
from a loss of power. This is due to several reasons, including the fact that Type
II error rates are high for each test. In other words, it overcorrects for Type I
errors.
The ordinary Bonferroni method is sometimes viewed as too conservative.
Holm-Bonferroni Holm’s sequential Bonferroni post-hoc test is a less strict correction for multiple
Procedure comparisons.
Newman-Keuls Like Tukey’s, this post-hoc test identifies sample means that are different from
STAT 201: Statistical Methods I
each other. Newman-Keuls uses different critical values for comparing pairs
of means. Therefore, it is more likely to find significant differences.
Rodger’s Method Considered by some to be the most powerful post-hoc test for detecting
differences among groups. This test protects against loss of statistical power as
the degrees of freedom increase.
Benjamin-Hochberg (BH) If you perform a very large amount of tests, one or more of the tests will have
procedure a significant result purely by chance alone. This post-hoc test accounts for
that false discovery rate.
74
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
ACTIVITY 07: ONE-FACTOR ANOVA. Perform One-way ANOVA, and if found to have significant difference,
perform post-hoc test.
1. Auditors must make judgments about various aspects of an audit on the basis of their own direct
experience, indirect experience, or a combination of the two. In a study, auditors were asked to make
judgments about the frequency of errors to be found in an audit. The judgments by the auditors were then
compared to the actual results. Suppose the following data were obtained from a similar study; lower
scores indicate better judgments.
Use α = 0.05 to test to see whether the basis for the judgment affects the quality of the judgment. What
is your conclusion?
2. A researcher used different laboratory tests in an experiment involving volunteer patients, with the fast
results, in hours, given below. Test the hypothesis that the different laboratory test results have the same
mean at the 0.01 level of significance.
Test A 5 7 10 8 6 3
Test B 8 7 5 5 8
Test C 10 8 9 7 10 8 11 13
The two-way ANOVA compares the mean differences between groups that have been split on two
independent variables (called factors). The primary purpose of a two-way ANOVA is to understand if there is an
interaction between the two independent variables on the dependent variable. For example, you could use a two-
way ANOVA to understand whether there is an interaction between gender and educational level on test anxiety
amongst university students, where gender (males/females) and education level (undergraduate/postgraduate) are
your independent variables, and test anxiety is your dependent variable. Alternately, you may want to determine
whether there is an interaction between physical activity level and gender on blood cholesterol concentration in
children, where physical activity (low/moderate/high) and gender (male/female) are your independent variables,
and cholesterol concentration is your dependent variable.
The interaction term in a two-way ANOVA informs you whether the effect of one of your independent
variables on the dependent variable is the same for all values of your other independent variable (and vice versa).
For example, is the effect of gender (male/female) on test anxiety influenced by educational level
(undergraduate/postgraduate)? Additionally, if a statistically significant interaction is found, you need to
determine whether there are any "simple main effects", and if there are, what these effects are (we discuss this
later in our guide).
STAT 201: Statistical Methods I
Assumptions:
1. Dependent variable should be measured at the continuous level (i.e., they are interval or ratio variables).
2. Two independent variables should each consist of two or more categorical, independent groups.
3. Independence of observations, which means that there is no relationship between the observations in
each group or between the groups themselves.
4. No significant outliers.
5. Dependent variable should be approximately normally distributed for each combination of the groups of
the two independent variables.
6. Variances for each combination of the groups of the two independent variables are homogenous.
The simplest type of factorial experiment involves only two factors, say A, and B. There are a levels of
75
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
Factors B
Totals Averages
1 2 ⋯ b
𝑦111 , 𝑦112 , 𝑦121 , 𝑦122 , 𝑦1𝑏1 , 𝑦1𝑏2 ,
1 ⋯ 𝑦1∙∙ 𝑦̅1∙∙
⋯, 𝑦11𝑛 ⋯, 𝑦12𝑛 ⋯, 𝑦1𝑏𝑛
𝑦211 , 𝑦221 , 𝑦2𝑏1 ,
2 𝑦212 , ⋯, 𝑦222 , ⋯, ⋯ 𝑦2𝑏2 , ⋯, 𝑦2∙∙ 𝑦̅2∙∙
Factors A 𝑦21𝑛 𝑦22𝑛 𝑦2𝑏𝑛
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
𝑦𝑎11 , 𝑦𝑎21 , 𝑦𝑎𝑏1 ,
a 𝑦𝑎12 , ⋯, 𝑦𝑎22 , ⋯, ⋯ 𝑦𝑎𝑏2 , ⋯, 𝑦𝑎∙∙ 𝑦̅𝑎∙∙
𝑦𝑎1𝑛 𝑦𝑎2𝑛 𝑦𝑎𝑏𝑛
Totals 𝑦∙1∙ 𝑦∙2∙ ⋯ 𝑦∙𝑏∙ 𝑦∙∙∙
Averages 𝑦̅∙1∙ 𝑦̅∙2∙ ⋯ 𝑦̅∙𝑏∙ 𝑦̅∙∙∙
The experiment has n replicates, and each replicate contains all ab treatment combinations. The
observation in the ijth cell for the kth replicate is denoted by yijk. In performing the experiment, the abn observations
would be run in random order. Thus, like the single-factor experiment, the two-factor factorial is a completely
randomized design and the analysis of variance (ANOVA) will be used to test its hypotheses. Since there are two
factors in the experiment, the test procedure is sometimes called the two-way analysis of variance.
As before, the ANOVA tests these hypotheses by decomposing the total variability in the data into
component parts and then comparing the various elements in this decomposition.
Computing Formulas 𝒂 𝒃 𝒏
𝒚∙∙∙ 𝟐
for ANOVA: Two 𝑺𝑺𝑻 = ∑ ∑ ∑ 𝒚𝒊𝒋𝒌 𝟐 −
𝒂𝒃𝒏
Factors 𝒊=𝟏 𝒋=𝟏 𝒌=𝟏
𝒂
𝒚𝒊∙∙ 𝟐 𝒚∙∙∙ 𝟐
𝑺𝑺𝑨 = ∑ −
𝒃𝒏 𝒂𝒃𝒏
𝒊=𝟏
76
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
𝒃
𝒚∙𝒋∙ 𝟐 𝒚∙∙∙ 𝟐
𝑺𝑺𝑩 = ∑ −
𝒂𝒏 𝒂𝒃𝒏
𝒋=𝟏
𝒂 𝒃
𝒚𝒊𝒋∙ 𝟐 𝒚∙∙∙ 𝟐
𝑺𝑺𝑨𝑩 = ∑ ∑ − −𝑺𝑺𝑨 − 𝑺𝑺𝑩
𝒏 𝒂𝒃𝒏
𝒊=𝟏 𝒋=𝟏
𝑺𝑺𝑬 = 𝑺𝑺𝑻 − 𝑺𝑺𝑨𝑩 − 𝑺𝑺𝑨 − 𝑺𝑺𝑩
Example: To illustrate, a researcher wishes to investigate the effects of outreach activities (EOA) and
socio-economic status (SES) on the social responsibility of teachers. There are two
independent variables: outreach activities and SES. The teachers may be classified into two
groups, one exposed to outreach activities and the other group not exposed to the same. This
independent variable, therefore, have two levels: with exposure and without exposure. The
SES factor has three levels: High SES, Average SES, and Low SES. If the criterion variable
(social responsibility) is of interval type (i.e., the instrument yields score points for the
subjects), then two-way Analysis of Variance may be applied to the data. The factorial design
is known as a 2x3 ANOVA.
Ho: There is no significant effect of the exposure to outreach activities on the social responsibility
on the teachers. (main effect of factor A)
There is no significant effect of the socio-economic status on the social responsibility of the
teachers. (main effect of factor B)
There will be no interaction effect of the exposure to outreach activities and socio-economic
status on the social responsibility of teachers. (AxB interaction)
Ha: There is a significant effect of the exposure to outreach activities on the social responsibility
on the teachers.
There is a significant effect of the socio-economic status on the social responsibility of the
teachers.
77
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
There is an effect of the interaction of the exposure to outreach activities and socio-economic
status on the social responsibility of teachers.
Exposure to
Socio-Economic Outreach Activities
Status With Without
exposure exposure Totals
X 15 11 26
High
X2 89 41 130
X 51 53 104
Average
X2 881 1,005 1,886
X 35 23 58
Low
X2 553 189 742
(X)T 101 87 188
Totals
(X2)T 1,523 1,235 2,758
(𝚺𝑿𝑨𝒊 )𝟐 (𝚺𝑿𝒕 )𝟐
𝑺𝑺𝑺𝑬𝑺 = (∑ )−
𝑵𝑨𝒊 𝚺𝑵𝒕
(26)2 (104)2 (58)2 (188)2
𝑆𝑆𝑆𝐸𝑆 = + + −
6 6 6 18
𝑆𝑆𝑆𝐸𝑆 = 513
78
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
e. Compute for the Mean Squares (MS), by dividing each SS by its df.
𝑺𝑺𝑬𝑶𝑨 10
𝑴𝑺𝑬𝑶𝑨 = = = 10
𝒅𝑭𝑬𝑶𝑨 1
𝑺𝑺𝑺𝑬𝑺 513
𝑴𝑺𝑺𝑬𝑺 = = = 256.5
𝒅𝑭𝑺𝑬𝑺 2
𝑺𝑺𝑨𝒙𝑩
𝑴𝑺𝑨𝒙𝑩 = = 15 = 7.5
𝒅𝑭𝑨𝒙𝑩
𝑺𝑺𝑾 256
𝑴𝑺𝑾 = = = 21.33
𝒅𝑭𝑾 12
Step 5: Determine the significance of the computed F-ratios with dF associated with the
numerator and denominator of each F.
Step 7: Interpretation:
• There is no significant effect of the exposure to outreach activities on the social
responsibility of the teachers (F0.05,1,12=0.469, p>0.05).
STAT 201: Statistical Methods I
• The socio-economic status, however, is found to have a significant effect, with the
Average SES showing the highest level of social responsibility (𝑥̅ = 17.33) among the
three groups (F0.05,2,12=12.025, p<0.01).
• Lastly, there is no effect of the interaction of the exposure to outreach activities and socio-
economic status on the social responsibility of teachers (F0.05,2,12=0.3516, p>0.05).
REFERENCES:
• Anderson, David R., D. J. Sweeney, and T.A. Williams. Statistics for Business and Economics, 11th
Edition. South-Western, Cengage Learning, 2011.
• D.C. Montgomery and G.C. Runger, Applied Statistics and Probability for Engineers, 5th Edition, John
79
CHAPTER III: TEST OF DIFFERENCE BETWEEN MEANS
80