08 Test of Significance
08 Test of Significance
We toss a coin 10 times and get 7 tails. Is this sufficient evidence to conclude that the
coin is biased?
The null hypothesis, H0 , states that "nothing extraordinary is going on". So in this
case
H0 : P(T) = 12
The alternative hypothesis, HA , states that there is a different chance process that
generates the data. Here we can take
HA : P(T) 6= 12
Hypothesis testing proceeds by collecting data and evaluating whether the data are
compatible with H0 or not (in which case one rejects H0 ).
The logic behind testing hypotheses
A different example: A company develops a new drug to lower blood pressure. It tests
it with an experiment involving 1,000 patients.
In this case "nothing extraordinary going on" means that the drug has no effect. So
H0 : no change in blood pressure HA : blood pressure drops
Note that in this case the company would like to reject H0 !
So the logic of testing is typically indirect: One assumes that nothing extraordinary is
happening and then hopes to reject this assumption H0 .
Setting up a test statistic
A test statistic measures how far away the data are from what we would expect if H0
were true.
The most common test statistic is the z-statistic:
observed − expected
z =
SE
The smaller the p-value, the stronger the evidence against H0 . Often the criterion for
rejecting H0 is a p-value smaller than 5%. Then the result is called statistically
significant.
p-values measure the evidence against H0
In the example:
Note that the p-value does not give the probability that H0 is true, as H0 is either true
or not - there are no chances involved. Rather, it gives the probability of seeing a
statistic as extreme, or more extreme, that the observed one, assuming H0 is true.
Distinguishing Coke and Pepsi by taste
It has been said that it is difficult to distinguish Coke and Pepsi by taste alone, without
the visual cue of the bottle or can.
In an experiment that I did in a class at Stanford, 10 cups were filled at random with
either Coke or Pepsi. A student volunteer tasted each of the 10 cups and correctly
named the conents of seven. Is this sufficient evidence to conclude that the student can
tell apart Coke and Pepsi?
"Nothing extraordinary is going on" means that the student does not have any special
ability to tell them apart and is just guessing.
To write this down formally we introduce 0/1 labels since we are counting correct
answers: 1 = correct answer, 0 = wrong answer
1 1
H0 : P(0) = P(1) = 2 HA : P(1) > 2
This is a one-sided test: the alternative hypothesis for for P(1) we are interested in is
on one side of 21 .
Distinguishing Coke and Pepsi by taste
Since we are looking at the sum of ten 0/1 labels, the z-statistic is the same that we
had for coin-tossing:
Since 10.2% is not smaller than 5%, we don’t reject H0 : We are not convinced that the
student can distinguish Coke and Pepsi.
Distinguishing Coke and Pepsi
The q
fatter tails account for the additional uncertainty introduced by estimating σ by
1 Pn 2
s = n−1 i=1 (xi − x̄) .
Using the t-test in place of the z-test is only necessary for small samples: n ≤ 20 (say).
In that case it is also better to replace the confidence interval x̄ ± z SE by
x̄ ± tn−1 SE
More on testing
I Statistically significant does not mean that the effect size is important:
Suppose the sample average shows a lead concentration that is only slightly above
the health standard of 15 ppb: say the sample average is 15.05 ppb.
That may not be of practical concern, even though the test may be highly
signficant: Statistical significance convinces us that there is an effect, but it
doesn’t say how big the effect is.
Reason: A large sample size n makes SE = √σn small, so even a small exceedance
over the limit by (say) 0.05 ppb may give a statistically significant result.
Therefore it is helpful to complement a test with a confidence interval: In the
above case a 95% confidence interval for µ might be [15.02 ppb, 15.08 ppb].
More on testing
We can improve the estimate of SE(p̂2 − p̂1 ) somewhat by using the fact that p1 = p2
on H0 . Since there is a common proportion we can estimate it by pooling the samples:
0.55 × 1000 = 550 voters approve in the first sample, 870 in the second, so in total
there are 1420 approvals out of 2500. So the pooled estimate of p1 = p2 is
1420
2500 = 56.8%.
q
So we estimate SE(p̂2 − p̂1 ) by 0.568(1−0.568)
1000 + 0.568(1−0.568)
1500 = 0.02022, which
essentially gives the same answer in this case.
The two-sample z-test
The two-sample z-test is applicable in the same way to the difference of two sample
means in order to test for equality of two population means.
If the two samples are independent, then again
p
SE(x̄2 − x̄1 ) = (SE(x̄1 ))2 + (SE(x̄2 ))2
and SE(x̄1 ) = √σ1 is estimated by √s1 .
n1 n1
If the sample sizes n1 , n2 are not large, then the p-value needs to computed from the
t-distribution.
The pooled standard deviation
If one has reason to assume that σ1 = σ2 (or if this has been checked), then one may
use the pooled estimate for σ1 = σ2 given by
However, the advantages of using s2 are small and the analysis rests on the
pooled
assumption that σ1 = σ2 . For these reasons the pooled t-test is usually avoided.
All of the above two-sample tests require that the two samples are independent. They
are also applicable in special situations where the samples are dependent, e.g. to
compare the treatment effect when subjects are randomized into treatment and control
groups.
The paired-difference test
¯ =
SE(d) σd
√ . Estimate σd by sd = 0.55. Then t = 1.4−0
√ = 5.69
n 0.55/ 5
The p-value of this sign-test is less significant than that of the paired t-test. This is
because the latter uses more information, namely the size of the differences. On the
other hand, the sign test has the virtue of easy interpretation due to the analogy to coin
tossing.