Null Hypothesis Vs
Null Hypothesis Vs
The Purpose of Null Hypothesis Testing As we have seen, psychological research typically
involves measuring one or more variables for a sample and computing descriptive statistics for
that sample. In general, however, the researcher’s goal is not to draw conclusions about that
sample but to draw conclusions about the population that the sample was selected from. Thus
researchers must use sample statistics to draw conclusions about the corresponding values in the
population. These corresponding values in the population are called parameters. Imagine, for
example, that a researcher measures the number of depressive symptoms exhibited by each of
50 clinically depressed adults and computes the mean number of symptoms. The researcher
probably wants to use this sample statistic (the mean number of symptoms for the sample) to
draw conclusions about the corresponding population parameter (the mean number of symptoms
for clinically depressed adults). Unfortunately, sample statistics are not perfect estimates of their
corresponding population parameters. This is because there is a certain amount of random
variability in any statistic from sample to sample. The mean number of depressive symptoms
might be 8.73 in one sample of clinically depressed adults, 6.45 in a second sample, and 9.44 in
a third—even though these samples are selected randomly from the same population. Similarly,
the correlation (Pearson’s r) between two variables might be +.24 in one sample, −.04 in a
second sample, and +.15 in a third—again, even though these samples are selected randomly
from the same population. This random variability in a statistic from sample to sample is called
sampling error. (Note that the term error here refers to random variability and does not imply
that anyone has made a mistake. No one “commits a sampling error.”) One implication of this is
that when there is a statistical relationship in a sample, it is not always clear that there is a
statistical relationship in the population. A small difference between two group means in a
sample might indicate that there is a small difference between the two group means in the
population. But it could also be that there is no difference between the means in the population
and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s r
value of −.29 in a sample might mean that there is a negative relationship in the population. But
it could also be that there is no relationship in the population and that the relationship in the
sample is just a matter of sampling error. In fact, any statistical relationship in a sample can be
interpreted in two ways: • There is a relationship in the population, and the relationship in the
sample reflects this. • There is no relationship in the population, and the relationship in the
sample reflects only sampling error. The purpose of null hypothesis testing is simply to help
researchers decide between these two interpretations.
The Logic of Null Hypothesis Testing
Again, every statistical relationship in a sample can be interpreted in either of these two ways: It
might have occurred by chance, or it might reflect a relationship in the population. So
researchers need a way to decide between them. Although there are many specific null
hypothesis testing techniques, they are all based on the same general logic. The steps are as
follows: • Assume for the moment that the null hypothesis is true. There is no relationship
between the variables in the population. • Determine how likely the sample relationship would
be if the null hypothesis were true. • If the sample relationship would be extremely unlikely,
then reject the null hypothesis in favour of the alternative hypothesis. If it would not be
extremely unlikely, then retain the null hypothesis. Following this logic, we can begin to
understand why Mehl and his colleagues concluded that there is no difference in talkativeness
between women and men in the population. In essence, they asked the following question: “If
there were no difference in the population, how likely is it that we would find a small difference
of d = 0.06 in our sample?” Their answer to this question was that this sample relationship
would be fairly likely if the null hypothesis were true. Therefore, they retained the null
hypothesis—concluding that there is no evidence of a sex difference in the population. We can
also see why Kanner and his colleagues concluded that there is a correlation between hassles
and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it
that we would find a strong correlation of +.60 in our sample?” Their answer to this question
was that this sample relationship would be fairly unlikely if the null hypothesis were true.
Therefore, they rejected the null hypothesis in favour of the alternative hypothesis—concluding
that there is a positive correlation between these variables in the population.
A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null
hypothesis were true. This probability is called the p value. A low p value means that the sample
result would be unlikely if the null hypothesis were true and leads to the rejection of the null
hypothesis. A high p value means that the sample result would be likely if the null hypothesis
were true and leads to the retention of the null hypothesis. But how low must the p value be
before the sample result is considered unlikely enough to reject the null hypothesis? In null
hypothesis testing, this criterion is called α (alpha) and is almost always set to .05. If there is
less than a 5% chance of a result as extreme as the sample result if the null hypothesis were true,
then the null hypothesis is rejected. When this happens, the result is said to be statistically
significant. If there is greater than a 5% chance of a result as extreme as the sample result when
the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean
that the researcher accepts the null hypothesis as true—only that there is not currently enough
evidence to conclude that it is true. Researchers often use the expression “fail to reject the null
hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept
the null hypothesis.”
The p value is one of the most misunderstood quantities in psychological research (Cohen,
1994)1 . Even professional researchers misinterpret it, and it is not unusual for such
misinterpretations to appear in statistics textbooks! The most common misinterpretation is that
the p value is the probability that the null hypothesis is true—that the sample result occurred by
chance. For example, a misguided researcher might say that because the p value is .02, there is
only a 2% chance that the result is due to chance and a 98% chance that it reflects a real
relationship in the population. But this is incorrect. The p value is really the probability of a
result at least as extreme as the sample result if the null hypothesis were true. So a p value of .02
means that if the null hypothesis were true, a sample result this extreme would occur only 2% of
the time. You can avoid this misunderstanding by remembering that the p value is not the
probability that any particular hypothesis is true or false. Instead, it is the probability of
obtaining the sample result if the null hypothesis were true.
Rejecting the null hypothesis when it is true is called a Type I error. This error means that we
have concluded that there is a relationship in the population when in fact there is not. Type I
errors occur because even when there is no relationship in the population, sampling error alone
will occasionally produce an extreme result. In fact, when the null hypothesis is true and α
is .05, we will mistakenly reject the null hypothesis 5% of the time. (This possibility is why α is
sometimes referred to as the “Type I error rate.”) Retaining the null hypothesis when it is false is
called a Type II error. This error means that we have concluded that there is no relationship in
the population when in fact there is. In practice, Type II errors occur primarily because the
research design lacks adequate statistical power to detect the relationship (e.g., the sample is too
small). We will have more to say about statistical power shortly.