0% found this document useful (0 votes)
93 views21 pages

08 Test of Significance

The document discusses hypothesis testing and setting up statistical tests. It provides examples of null and alternative hypotheses for situations like coin tossing, drug testing, and distinguishing between Coke and Pepsi. It explains how to calculate test statistics like the z-statistic and p-values to evaluate the evidence against the null hypothesis. The smaller the p-value, the stronger the evidence against the null. It also discusses when to use a t-test instead of a z-test, and how to interpret statistical significance and effect sizes.

Uploaded by

admirodebrito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views21 pages

08 Test of Significance

The document discusses hypothesis testing and setting up statistical tests. It provides examples of null and alternative hypotheses for situations like coin tossing, drug testing, and distinguishing between Coke and Pepsi. It explains how to calculate test statistics like the z-statistic and p-values to evaluate the evidence against the null hypothesis. The smaller the p-value, the stronger the evidence against the null. It also discusses when to use a t-test instead of a z-test, and how to interpret statistical significance and effect sizes.

Uploaded by

admirodebrito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

The logic behind testing hypotheses

We toss a coin 10 times and get 7 tails. Is this sufficient evidence to conclude that the
coin is biased?
The null hypothesis, H0 , states that "nothing extraordinary is going on". So in this
case
H0 : P(T) = 12
The alternative hypothesis, HA , states that there is a different chance process that
generates the data. Here we can take
HA : P(T) 6= 12
Hypothesis testing proceeds by collecting data and evaluating whether the data are
compatible with H0 or not (in which case one rejects H0 ).
The logic behind testing hypotheses

A different example: A company develops a new drug to lower blood pressure. It tests
it with an experiment involving 1,000 patients.
In this case "nothing extraordinary going on" means that the drug has no effect. So
H0 : no change in blood pressure HA : blood pressure drops
Note that in this case the company would like to reject H0 !
So the logic of testing is typically indirect: One assumes that nothing extraordinary is
happening and then hopes to reject this assumption H0 .
Setting up a test statistic
A test statistic measures how far away the data are from what we would expect if H0
were true.
The most common test statistic is the z-statistic:
observed − expected
z =
SE

‘Observed’ is a statistic that is appropriate for assessing H0 . In the example of the 10


coin tosses, appropriate statistics would be the number of tails or the percent of tails.
‘Expected’ and SE are the expected value and the SE of this statistic, computed under
the assumption that H0 is true.
In the example: Using the formulas for the sum of 0/1 labels we get
1
√ q1 1
‘expected’ = 10 × 2 = 5 and SE = 10 2 × 2 = 1.58. So
7−5
z = = 1.27
1.58
p-values measure the evidence agains H0
Large values of |z| are evidence against H0 : The larger |z| is, the stronger the evidence.
The strength of the evidence is measured by the
p-value (or: observed significance level):
The p-value is the probability of getting a value of z as extreme or more extreme than
the observed z, assuming H0 is true.
But if H0 is true, then z follows that standard normal curve, according to the central
limit theorem, so the p-value can be computed with normal approximation:

The smaller the p-value, the stronger the evidence against H0 . Often the criterion for
rejecting H0 is a p-value smaller than 5%. Then the result is called statistically
significant.
p-values measure the evidence against H0

In the example:

Note that the p-value does not give the probability that H0 is true, as H0 is either true
or not - there are no chances involved. Rather, it gives the probability of seeing a
statistic as extreme, or more extreme, that the observed one, assuming H0 is true.
Distinguishing Coke and Pepsi by taste
It has been said that it is difficult to distinguish Coke and Pepsi by taste alone, without
the visual cue of the bottle or can.
In an experiment that I did in a class at Stanford, 10 cups were filled at random with
either Coke or Pepsi. A student volunteer tasted each of the 10 cups and correctly
named the conents of seven. Is this sufficient evidence to conclude that the student can
tell apart Coke and Pepsi?
"Nothing extraordinary is going on" means that the student does not have any special
ability to tell them apart and is just guessing.
To write this down formally we introduce 0/1 labels since we are counting correct
answers: 1 = correct answer, 0 = wrong answer
1 1
H0 : P(0) = P(1) = 2 HA : P(1) > 2
This is a one-sided test: the alternative hypothesis for for P(1) we are interested in is
on one side of 21 .
Distinguishing Coke and Pepsi by taste
Since we are looking at the sum of ten 0/1 labels, the z-statistic is the same that we
had for coin-tossing:

observed sum − expected sum 7−5


z = = = 1.27
SE of sum 1.58
But since we do a one-sided test instead of a two-sided test, the p-value is only half as
large:

Since 10.2% is not smaller than 5%, we don’t reject H0 : We are not convinced that the
student can distinguish Coke and Pepsi.
Distinguishing Coke and Pepsi

A two-sided alternative might also be appropriate:


1
HA : P(1) 6= 2
HA corresponds to a student who is more likely than not to distinguish Coke and Pepsi,
but who may confuse them. Such a student might get one correct answer (say).
One has to carefully consider whether the alternative should be one-sided or two-sided,
as the p-value gets doubled in the latter case.
It is not ok to change the alternative afterwards in order to get the p-value below 5%.
The t-test
The health guideline for lead in drinking water is a concentration of not more than
15 parts per billion (ppb).
Five independent samples from a reservoir average 15.6 ppb. Is this sufficient evidence
to conclude that the concentration µ in the reservoir is above the standard of 15 ppb?
Recall our model for measurements:
measurement = µ + measurement error
So it may be that the concentration µ is below 15 ppb, but measurement error results
in an average of 15.6 ppb.
H0 : µ = 15 ppb HA : µ > 15 ppb
We can try a z-test for the average of the measurements:

observed average − expected average 15.6 ppb − 15 ppb


z = =
SE of average SE of average
since the measurement error has expected value zero.
The t-test
SE of average = √σ , but the standard deviation σ of the measurement error is
n
unknown.
We can estimate σ by s, the sample standard deviation of the measurements. However:
If we estimate σ and n is small (n ≤ 20), then the normal curve is not a good enough
approximation to the distribution of the z-statistic. Rather, an appropriate
approximation is Student’s t-distribution with n − 1 degrees of freedom:
The t-test

The q
fatter tails account for the additional uncertainty introduced by estimating σ by
1 Pn 2
s = n−1 i=1 (xi − x̄) .

Using the t-test in place of the z-test is only necessary for small samples: n ≤ 20 (say).
In that case it is also better to replace the confidence interval x̄ ± z SE by

x̄ ± tn−1 SE
More on testing

I Statistically significant does not mean that the effect size is important:
Suppose the sample average shows a lead concentration that is only slightly above
the health standard of 15 ppb: say the sample average is 15.05 ppb.
That may not be of practical concern, even though the test may be highly
signficant: Statistical significance convinces us that there is an effect, but it
doesn’t say how big the effect is.
Reason: A large sample size n makes SE = √σn small, so even a small exceedance
over the limit by (say) 0.05 ppb may give a statistically significant result.
Therefore it is helpful to complement a test with a confidence interval: In the
above case a 95% confidence interval for µ might be [15.02 ppb, 15.08 ppb].
More on testing

I There is a general connection between confidence intervals and tests:


A 95% confidence interval contains all values for the null hypothesis that will not
be rejected by a two-sided test at a 5% significance level.
(A 5% significance level means that the threshold for the p-value is 5%).
I There are two ways that a test can result in a wrong decision:
H0 is true, but was erroneously rejected → Type I error (‘false positive’)
H0 is false, but we fail to reject it → Type II error
Rejecting H0 if the p-value is smaller than 5% means P(type I error) ≤ 5%
The two-sample z-test
Last month, the President’s approval rating in a sample of 1,000 likely voters was 55%.
This month, a poll of 1,500 likely voters resulted in a rating of 58%. Is this sufficient
evidence to conclude that the rating has changed?
We want to assess whether
p1 =proportion of all likely voters approving last month
is equal to
p2 =proportion of all likely voters approving this month
"nothing unusual is going on" means p1 = p2 . It’s common to look at the difference
p2 − p1 instead:
H0 : p2 − p1 = 0 H1 : p2 − p1 6= 0
p1 is estimated by p̂1 = 55%, p2 by p̂2 = 58%. The central limit theorem applies to the
difference p̂2 − p̂1 just as it does to p̂1 and p̂2 . So we can use a z-test:
The two-sample z-test
We can use a z-test for the difference p̂2 − p̂1 :

observed difference − expected difference (p̂2 − p̂1 ) − (p2 − p1 )


z = =
SE of difference SE of difference

An important fact is that if p̂1 and p̂2 are independent, then


p
SE(p̂2 − p̂1 ) = (SE(p̂1 ))2 + (SE(p̂2 ))2 . So
(p̂2 − p̂1 ) − 0 0.03
z = rq = = 1.48
2 q 2 0.0202
p1 (1−p1 )
1000 + p2 (1−p
1500
2)
The two-sample z-test

The confidence interval for p2 − p1 is

(p̂2 − p̂1 ) ± z SE(p̂2 − p̂1 ) = [−1%, 7%] when z = 2

We can improve the estimate of SE(p̂2 − p̂1 ) somewhat by using the fact that p1 = p2
on H0 . Since there is a common proportion we can estimate it by pooling the samples:
0.55 × 1000 = 550 voters approve in the first sample, 870 in the second, so in total
there are 1420 approvals out of 2500. So the pooled estimate of p1 = p2 is
1420
2500 = 56.8%.
q
So we estimate SE(p̂2 − p̂1 ) by 0.568(1−0.568)
1000 + 0.568(1−0.568)
1500 = 0.02022, which
essentially gives the same answer in this case.
The two-sample z-test

The two-sample z-test is applicable in the same way to the difference of two sample
means in order to test for equality of two population means.
If the two samples are independent, then again
p
SE(x̄2 − x̄1 ) = (SE(x̄1 ))2 + (SE(x̄2 ))2
and SE(x̄1 ) = √σ1 is estimated by √s1 .
n1 n1

If the sample sizes n1 , n2 are not large, then the p-value needs to computed from the
t-distribution.
The pooled standard deviation

If one has reason to assume that σ1 = σ2 (or if this has been checked), then one may
use the pooled estimate for σ1 = σ2 given by

(n1 − 1)s21 + (n2 − 1)s22


s2pooled =
n1 + n2 − 2

However, the advantages of using s2 are small and the analysis rests on the
pooled
assumption that σ1 = σ2 . For these reasons the pooled t-test is usually avoided.
All of the above two-sample tests require that the two samples are independent. They
are also applicable in special situations where the samples are dependent, e.g. to
compare the treatment effect when subjects are randomized into treatment and control
groups.
The paired-difference test

Do husbands tend to be older than their wives?


The ages of five couples:
Husband’s age Wife’s age age difference
43 41 2
71 70 1
32 31 1
68 66 2
27 26 1
The two-sample t-test is not applicable since the two samples are not independent.
Even if they were independent, the small differences in ages would not be significant
since the standard deviations are large for husbands and also for the wives.
The paired-difference test
Since we have paired data, we can simply analyze the differences obtained from each
pair with a regular t-test, which in this context of matched pairs is called
paired t-test:
H0 : population difference is zero
¯
d−0
t = , where di is the age difference of the ith couple.
SE(d)
¯

¯ =
SE(d) σd
√ . Estimate σd by sd = 0.55. Then t = 1.4−0
√ = 5.69
n 0.55/ 5

The independence assumption is in the sampling of the couples.


The sign test
What if didn’t know the age difference di but only if the husband was older or not?
We can test
H0 : half the husbands in the population are older than their wives
using 0/1 labels and a z-test, just as we tested whether a coin is fair:
sum of 1s− n2 5− 5 1
z = = √ 21 = 2.24 since σ = on H0 .
SE of sum 52 2

The p-value of this sign-test is less significant than that of the paired t-test. This is
because the latter uses more information, namely the size of the differences. On the
other hand, the sign test has the virtue of easy interpretation due to the analogy to coin
tossing.

You might also like