Quantitative Methods: Module 2: Scientific Method
Quantitative Methods: Module 2: Scientific Method
The empirical cycle captures the process of coming up with hypotheses about how stuff
works and testing these hypotheses against empirical data in a systematic and rigorous
way. It characterizes the hypothetico-deductive approach to science. Being familiar
with the five different phases of this cycle will really help you to keep the big picture in
mind, especially when we get into the specifics of things like experimental design and
sampling.
So we'll start with the observation phase, obviously. This where the magic happens. It's
where an observation, again obviously, sparks the idea for a new research hypothesis.
We might observe an intriguing pattern, an unexpected event; anything that we find
interesting and that we want to explain. How we make the observation really doesn't
matter. It can be a personal observation, an experience that someone else shares with
us, even an imaginary observation that takes place entirely in your head.
Of course observations generally come from previous research findings, which are
systematically obtained, but in principle anything goes.
Ok, so let's take as an example a personal observation of mine: I have a horrible
mother-in-law. I've talked to some friends and they also complain about their mother-
in-law. So this looks like an interesting pattern to me. A pattern between type of person
and likeability! Ok, so the observation phase is about observing a relation in one or
more specific instances.
In the induction phase this relation, observed in specific instances, is turned into a
general rule. That's what induction means: taking a statement that is true in specific
cases and inferring that the statement is true in all cases, always. For example, from the
observation that my friends and I have horrible mothers-in-law I can induce the general
rule that all mothers-in-law are horrible.
Of course this rule, or hypothesis, is not necessarily true, it could be wrong. That's
what the rest of the cycle is about: Testing our hypothesis. In the induction phase
inductive reasoning is used to transform specific observations into a general rule or
hypothesis.
In the deduction phase we deduce that the relation specified in the general rule should
also hold in new, specific instances. From our hypothesis we deduce an explicit
expectation or prediction about new observations. For example, if all mothers-in-law
are indeed horrible, then if I ask ten of my colleagues to rate their mother-in-law as
either 'likable', 'neutral' or 'horrible', then they should all choose the category
'horrible'.
Now in order to make such a prediction, we need to determine the research setup. We
need to decide on a definition of the relevant concepts,
In the testing phase the hypothesis is actually tested by collecting new data and
comparing them to the prediction. Now this almost always requires statistical
processing, using descriptive statistics to summarize the observations for us, and
inferential statistics to help us decide if the prediction was correct.
In our simple example we don't need statistics. Let's say that eight out of ten colleagues
rate their mother-in-law as horrible but two rate her as neutral. We can see right away
that our prediction didn’t come true, it was refuted! All ten mothers-in-law should have
been rated as horrible. So in the testing phase new empirical data is collected and -
with the aid of statistics - the prediction is confirmed or disconfirmed.
In the evaluation phase we interpret the results in terms of our hypothesis. If the
prediction was confirmed this only provides provisional support for a hypothesis. It
doesn’t mean we've definitively proven the hypothesis. Because it's always possible
that in the future we will find somebody who just loves their mother-in-law.
In our example the prediction was actually refuted. This doesn't mean we should reject
our hypothesis outright. In many cases there are plausible explanations for our failure to
confirm.
Now if these explanations have to do with the research setup, the hypothesis is
preserved and investigated again, but with a better research design. In other cases the
hypothesis is adjusted, based on the results. The hypothesis is rejected and discarded
only in very rare cases. In the evaluation phase the results are interpreted in terms of
the hypothesis, which is provisionally supported, adjusted or rejected.
The observations collected in the testing phase can serve as new, specific observations
in the observation phase. This is why the process is described as a cycle. New
empirical data obtained in the testing phase give rise to new insights that lead to a new
run-through. And that's what empirical science comes down to: We try to hone in on
the best hypotheses and build our understanding of the world as we go through the
cycle again and again.
We're going to take a look at how we should interpret results that confirm or disconfirm
our predictions and whether we should confirm or reject our hypothesis accordingly.
Let's consider the hypothesis that all mothers-in-law are horrible. I formulated this
hypothesis based on personal observations.
Ok, let's look at confirmation first. Suppose all ten colleagues rated their mother-in-law
as horrible. The prediction is confirmed, but this doesn't mean the hypothesis has been
proven! It's easily conceivable that we will be proven wrong in the future. If we were to
repeat the study we might find a person that simply adores their mother-in-law.
The point is that confirmation is never conclusive. The only thing we can say is that our
hypothesis is provisionally supported. The more support from different studies, the
more credence we afford a hypothesis, but we can never prove it. Let me repeat that:
No scientific empirical statement can ever be proven once and for all. The best we can
do is produce overwhelming support for a hypothesis.
Ok, now let's turn to disconfirmation. Suppose only eight out of ten colleagues rate
their mother-in-law as horrible and two actually rate her as neutral. Obviously in this
case our prediction turned out to be false. Logically speaking, empirical findings that
contradict the hypothesis should lead to its rejection. If our hypothesis states that all
swans are white and we then find black swans in Australia, we can very conclusively
reject our hypothesis.
In practice however, especially in the social sciences, there are often plausible
alternative explanations for our failure to confirm. These are in fact so easy to find that
we rarely reject the hypothesis outright. In many cases these explanations have to do
with methodological issues. The research design or the measurement instrument wasn't
appropriate, maybe relevant background variables weren't controlled for; etcetera,
etcetera.
Coming back to our mother-in-law example: I could have made a procedural error
while collecting responses from the two colleagues who rated their mother-in-law as
neutral. Maybe I forgot to tell them their responses were confidential, making them
uncomfortable to choose the most negative category.
If there are plausible methodological explanations for the failure to confirm we preserve
the hypothesis and instead choose to reject the auxiliary, implicit assumptions
concerning the research design and the measurement. We investigate the original
hypothesis again, only with a better research setup.
Sometimes results do give rise to a modification of the hypothesis. Suppose that the
eight colleagues who did have horrible mothers-in-law were all women and the other
two were men. Perhaps all mothers-in-law are indeed horrible, but only to their
daughters-in-law!
If we alter the hypothesis slightly by adding additional clauses (it only applies to
daughters-in-law), then strictly speaking we are rejecting the
QUANTITATIVE METHODS BY ANNEMARIE ZAND SCHOLTEN
original hypothesis, sort of. Of course the new hypothesis is essentially the same as the
original one, just not as general and therefore not as strong. An outright rejection or
radical adjustment of a hypothesis is actually very rare in the social sciences. Progress
is made in very small increments not giant leaps.
We follow the empirical cycle to come up with hypotheses and to test and evaluate
them against observations. But once the results are in, a confirmation doesn't mean a
hypothesis has been proven and a disconfirmation doesn't automatically mean we
reject it.
So how do we decide whether we find a study convincing? Well, there are two main
criteria for evaluation: reliability and validity.
Reliability is very closely related to replicability. A study is replicable if independent
researchers are in principle able to repeat it. A research finding is reliable if we actually
repeat the study and then find consistent results.
Validity is more complicated. A study is valid if the conclusion about the hypothesized
relation between properties accurately reflects reality. In short, a study is valid if the
conclusion based on the results is 'true'.
Construct validity is an important prerequisite for internal and external validity. A study
has high construct validity if the properties or constructs that appear in the hypothesis
are measured and manipulated accurately. In other words, our methods have high
construct validity if they actually measure and manipulate the properties that we
intended them to.
Suppose I accidently measured an entirely different construct with for example my
depression questionnaire. What if it measures feelings of social exclusion instead of
depression? Or suppose taking care of the cat didn’t affect loneliness at all, but instead
increased feelings of responsibility and self-worth. What if loneliness remained the
same?
Well, then the results only seem to support the hypothesis that loneliness caused
depression, when in reality we've manipulated a different cause and measured a
different effect. Developing accurate measurement and
For now I'll move on to internal validity. Internal validity is relevant when our
hypothesis describes a causal relationship. A study is internally valid if the observed
effect is actually due to the hypothesized cause.
Let's assume our measurement and manipulation methods are valid for a second. Can
we conclude depression went down because the elderly felt less lonely? Well... maybe
something else caused the decrease in depression. For example, if the study started in
the winter and ended in spring, then maybe the change in season lowered depression.
Or maybe it wasn't the cat's company but the increased physical exercise from
cleaning the litter box and feeding bowl?
Alternative explanations like these, threaten internal validity. If there's a plausible
alternative explanation, internal validity is low. Now, there are many different types of
threats to internal validity that I will discuss in much more detail in later videos.
OK, let's look at external validity. A study is externally valid if the hypothesized
relationship, supported by our findings, also holds in other settings and other groups. In
other words, if the results generalize to different people, groups, environments and
times.
Let’s return to our example. Will taking care of a cat decrease depression in teenagers
and middle-aged people too? Will the effect be the same for men and women? What
about people from different cultures? Will a dog be as effective as a cat?
Of course this is hard to say based on the results of only elderly people and cats. If we
had included younger people, people from different cultural backgrounds and used
other animals, we might have been more confident about this study's external validity.
I'll come back to external validity, and how it can be threatened, when we come to the
subject of sampling.
So to summarize: Construct validity relates to whether our methods actually reflect the
properties we intended to manipulate and measure. Internal validity relates to whether
our hypothesized cause is the actual cause for the observed effect. Internal validity is
threatened by alternative explanations. External validity or generalizability relates to
whether the hypothesized relation holds in other settings.
The most interesting hypotheses are the ones that describe a causal relationship. If we
know what causes an effect we can predict, influence it, better understand it.
So how do we identify a causal relationship? Well, it was David Hume, with a little
help from John Stuart Mill, who first listed the criteria that we still use today. These are
the four essential criteria:
Ok, so let me illustrate these criteria with an example. Suppose I hypothesize that
loneliness causes feelings of depression. I give some lonely, depressed people a cat to
take care of. Now they're no longer lonely. If my hypothesis is correct, then we would
expect this to lower their depression.
The cause and effect, loneliness and depression, are in close proximity, they happen in
the same persons, and fairly close together in time, so we can show they're connected.
The cause, a decrease in loneliness, needs to happen before the effect, a decrease in
depression. We can show this because we can control the presence of the cause,
loneliness.
The cause and effect should occur together consistently. This means that less loneliness
should go together with lower depression. I could find a comparison group of lonely,
depressed people that do not get a cat. Since the cause is absent: their loneliness
doesn't change, there should be no effect.
This was all easy, the real difficulty lies in the last criterion, excluding any alternative
explanations, other possible causes. Let's look for an alternative explanation in our
example. Maybe the increased physical activity, required to take care of a cat, actually
caused lower depression, instead of the reduction in loneliness. Alternative
explanations form threats to the internal validity of a research study. An important part
of methodology is developing and choosing research designs that minimize these
threats.
Ok, there is one more point I want to make about causation. Causation requires
correlation; the cause and effect have to occur consistently. But correlation doesn't
imply causation!
I'll give you an example: If we consistently observe aggressive behavior after children
play a violent videogame, this doesn’t mean the game caused the aggressive behavior.
It could be that aggressive children seek out more aggressive stimuli, reversing the
causal direction. Or maybe children whose parents allow them to play violent games
aren’t supervised as closely. Maybe they are just as aggressive as other children, they
just feel less inhibited to show this aggressive behavior. So remember: correlation does
not imply causation!
Fortunately there is way to eliminate this alternative explanation of natural change that
we refer to as maturation. We can introduce a control group that is measured at the
same times but is not exposed to the hypothesized cause. Both groups should 'mature'
or change to the same degree. Any difference between the groups can now be
attributed to the hypothesized cause and not natural change.
The threat of maturation is eliminated, but unfortunately a study that includes a control
group is still vulnerable to other threats to internal validity. This brings us to the threat
of selection. Selection refers to any systematic difference in subject characteristics
between groups, other than the manipulated cause.
Suppose in my study I included a control group that didn’t take care of a cat. What if
assignment of elderly participants to the groups was based on mobility? Suppose people
who weren’t able to bend over and clean the litter box were put in the control group.
Now suppose the experimental group was in fact less depressed than the control group.
This might be caused, not by the company of the cat, but because the people in the
experimental group were just more physically fit.
A solution to this threat is to use a method of assignment to groups that ensures that a
systematic difference on subject characteristics is highly unlikely. This method is called
randomization. I’ll discuss it in much more detail when we cover research designs.
The last threat to internal validity related to participants, is the combined threat of
maturation and selection. We call this a selection by maturation threat. This happens
when groups systematically differ in their rate of maturation. For
Participants were selected so that both groups had a similar level of depression at the
start of the study. But what if we find that the experimental group shows lower
depression? Well perhaps the lower rate of depression in the experimental, cat-therapy
group is simply due to the fact that open-minded people tend to naturally “get over”
their depressive feeling more quickly than conservative people do. Just like selection
on its own, the threat of selection by maturation can be eliminated by randomized
assignment to groups.
So you see that the research design we choose - for example adding a control group -
and the methods we use - for example random assignment - can help to minimize
threats to internal validity.
Another category of threats to internal validity is associated with the instruments that
are used to measure and manipulate the constructs in our hypothesis. The threats of
low construct validity, instrumentation and testing fall into this category.
I'll start with low construct validity. Construct validity is low if our instruments contain
a systemic bias or measure another construct or property entirely. In this case there’s
not much point in further considering the internal validity of a study. As I discussed in
an earlier video, construct validity is a prerequisite for internal validity. If our
measurement instruments or manipulation methods are of poor quality, then we can’t
infer anything about the relation between the hypothesized constructs.
The second threat that relates to measurement methods is instrumentation. The threat
of instrumentation occurs when an instrument is changed during the course of the
study. Suppose I use a self-report questionnaire to measure depression at the start of the
study, but I switch to a different questionnaire or maybe to an
Of course it seems rather stupid to change your methods or instruments halfway, but
sometimes a researcher has to depend on others for data, for example when using tests
that are constructed by national testing agencies or polling agencies. A good example is
the use of the standardized diagnostic tool called the DSM. This 'Diagnostic and
Statistical Manual' is used to classify things like mental illness, depression, autism and
is updated every ten to fifteen years.
Now you can imagine the problems this can cause, for example for a researcher who is
doing a long-term study on schizophrenia. In the recently updated DSM, several
subtypes of schizophrenia are no longer recognized. Now if we see a decline in
schizophrenia in the coming years, is this a 'real' effect or is it due to the change in the
measurement system?
The last threat I want to discuss here is testing, also referred to as sensitization.
Administering a test or measurement procedure can affect people’s behavior. A testing
threat occurs if this sensitizing effect of measuring provides an alternative explanation
for our results. For example, taking the depression pre-test, at the start of the study,
might alert people to their feelings of depression. This might cause them to be more
proactive about improving their emotional state, for example by being more social. Of
course this threat can be eliminated by introducing a control group; both groups will be
similarly affected by the testing effect. Their depression scores will both go down, but
hopefully more so in the 'cat-companionship' group.
Adding a control group is not always enough though. In some cases there’s a risk that
the pre-test sensitizes people in the control group differently than people in the
experimental group. For example, in our cat-study, the pre-test in combination with
getting a cat could alert people in the experimental 'cat'-group to the purpose of the
study. They might report lower depression on the post-test, not because they’re less
depressed, but to ensure the study seems successful, so they can keep the cat!
Suppose people in the control group don't figure out the purpose of the study, because
they don't get a cat. They're not sensitized and not motivated to change their scores on
the post-test. This difference in sensitization provides an alternative explanation. One
solution is to add an experimental and control group that weren’t given a pre-test. I’ll
discuss this solution in more detail when we consider different experimental designs.
If the researcher’s expectations have a biasing effect, we call this threat to internal
validity an experimenter expectancy effect. Experimenter expectancy refers to an
unconscious change in a researcher's behavior, caused by expectations about the
outcome, that influences a participant’s responses. Of course this becomes an even
bigger problem if a researcher unconsciously treats people in the control group
differently from people in the experimental group.
Fortunately there is a solution. If the experimenter who interacts with the participants
doesn’t know what behavior is expected of them, the experimenter will not be able to
unconsciously influence participants. We call this an experimenter-blind design.
Of course the participant can also have expectations about the purpose of the study.
Participants who are aware that they are subjects in a study will look for cues to figure
out what the study is about. Demand characteristics refers to a change in participant
behavior due to their expectations about the study. Demand characteristics are
especially problematic if people respond to cues differently in the control and in the
experimental group.
But the cues don’t even have to be accurately interpreted by the participants. As long as
participants in the same group interpret the cues in the same manner and change their
behavior in the same way, demand characteristics form a real problem.
This is why it’s always a good idea to leave participants unaware of the actual purpose
of the study or at least leave them unaware of which group they are in, the
experimental or the control group. If both the subject and the experimenter are
unaware, we call this a double-blind research design.
Because any cue can lead to a bias in behavior, researchers usually come up with a
cover story. A cover story is a plausible explanation of the purpose of the study. It
should provide participants with cues that are unlikely to bias their behavior. Of course
this temporary deception needs to be necessary, the risk of bias due to demand
characteristics needs to be real. And of course you need to debrief participants
afterwards and inform them about the real purpose of the study.
The last category of threats to internal validity is a bit of a remainder category. Very
generally the three types of threats in this category are related to the research procedure
or set-up. The threats are ambiguous temporal precedence, history and mortality.
Ok, to give an example of a history effect on a small scale, imagine that during the last
session in the control group, the experimenter faints. Of course participants are shaken
and upset about this. And this might translate into a more general negative attitude in
the control group, which also makes the control group's attitude towards the minority
more negative. The treatment might look effective, because the experimental group is
more positive, but the difference is due, not the discussion technique, but due to the
fainting incident.
Let’s consider the history effect on a larger scale. Suppose that during the study a
horrific murder is committed, allegedly by a member of the minority. The crime gets an
enormous amount of media attention, reinforcing the negative stereotype about the
minority group. Any positive effect of the intervention could be undone by this event.
The threat of history is hard to eliminate. Large-scale events, well they can’t be
avoided. Small-scale events that happen during the study can be avoided, at least to
some extent, by testing subjects separately, if this is possible. This way, if something
goes wrong, the results of only one or maybe a few subjects will have to be discarded.
The final threat to discuss is mortality. Mortality refers to participant dropout from the
study. If groups are compared and dropout is different in these groups, then this could
provide an alternative explanation. For example, suppose we’re investigating the
effectiveness of a drug for depression. Suppose the drug is actually not very effective,
and has a very strong side effect. It causes extreme flatulence!
Of course this can be so uncomfortable and embarrassing that it wouldn't be strange for
people to drop out of the study because of this side effect. Suppose 80% of people in
the experimental group dropped out. In the control group participants are administered
a placebo, with no active ingredient, so also, no side effects. Dropout in this group is
only 10%. It’s obvious that the groups are no longer comparable.
Suppose that for the remaining 20% of participants in the experimental group the drug
is effective enough to outweigh the negative side effect. This wasn’t the case for the
80% who dropped out. Based on the remaining subjects we might conclude that the
drug is very effective. But if all subjects had remained in the study the conclusion
would have been very different.
I want to go into research designs and give you a more concrete idea how a research
design can minimize threats to internal validity. But before I can do that you have to
become familiar with the terms construct, variable, constant and independent and
dependent variable.
Of course loneliness and depression can be expressed in many different ways. The term
variable refers to an operationalized version of a construct. A variable is a specific,
concrete expression of the construct and is measurable or manipulable. For example, I
could operationalize loneliness in a group of elderly people in a nursing home for
example by using a self-report questionnaire. I could administer the UCLA Loneliness
Scale. This scale is a 20-item questionnaire consisting of items like: ” I have nobody to
talk to”. The variable loneliness now refers to loneliness as expressed through scores on
the UCLA-scale.
Ok, so variables are measured or manipulated properties that take on different values.
This last bit is important, a variable's values need to vary, otherwise the property isn’t
very interesting. Suppose the nursing home is so horrible that all residents get the
maximum depression score. Well then we cannot show a relation between loneliness
and depression, at least not in this group of subjects. Both lonely and less lonely people
will be equally depressed. Depression is a constant.
So the variables central to our hypothesis should be variable, they should show
variation. Of course it is good idea to keep other, extraneous variables constant, so that
Ok now that I’ve defined what a variable is, let’s look at different types of variables
according to the role they play in describing or explaining a phenomenon. I’ll refer to
the variables that are central to our hypothesis as variables of interest.
When a cause and effect can't be identified or when a causal direction isn’t obvious or
even of interest, our variables are ‘on equal footing’. Then we just refer to them as
variables. But when our hypothesis is causal, we can identify a cause and an effect.
And we then refer to the cause variable as the independent variable and to the effect
variable as the dependent variable.
Now if you’re having trouble telling the terms independent and dependent apart, try to
remember that the independent variable is what the researcher would like to be in
control of, it's the cause that comes first.
I’ve discussed variables that are the focus of our hypothesis: the variables of interest. Of
course in any study there will be other, extraneous properties associated with the
participants and research setting that vary between participants.
These properties are not the main focus of our study but they might be associated with
our variables of interest, providing possible alternative explanations. Such variables of
disinterest come in three flavors: confounders, control variables and background
variables.
Suppose physical activity is also causally related to depression - being more active
lowers depression. Well, then physical activity, and not loneliness, may account for a
lower depression score in the experimental cat group. The relation between loneliness
and depression is said to be spurious. The relation can be explained by the
confounding variable, physical activity.
An important thing to note about confounders is that they are not included in the
hypothesis and they’re generally not measured. This makes it impossible to determine
what the actual effect of a confounder was. The only thing to do is to repeat the study
and control the confounder by making sure it takes on the same value for all
participants. For example, if all elderly people in both groups are required to be equally
active, then physical activity cannot explain differences in depression.
We now consider the difference in depression between the cat-therapy and control
group, first for the active people, and then for the inactive people. We control for
activity, in fact holding it constant by considering each activity level separately.
If the relationship between loneliness and depression disappears when we look at each
activity level separately, then activity ‘explains away’ the spurious relation between
loneliness and depression. But if the relationship still shows at each activity level, then
we have eliminated physical activity as an alternative explanation for the drop in
depression.
The last type of variable of disinterest is a background variable. This type of variable is
not immediately relevant to the relation between the variables of interest, but it is
relevant to determine how representative the participants in our study are for a larger
group, maybe all elderly everywhere, even all people, of any age. For this reason it is
interesting to know how many men and women participated, what their mean age was,
their ethnic or cultural background, social economic status, education level, and
whatever else is relevant to the study at hand.