Internal and External Validation
Internal and External Validation
By Arlin Cuncic
In This Article
Internal and external validity are concepts that reflect whether or not the results of a study are
trustworthy and meaningful. While internal validity relates to how well a study is conducted (its
structure), external validity relates to how applicable the findings are to the real world.
Internal Validity
Internal validity is the extent to which a study establishes a trustworthy cause-and-effect relationship
between a treatment and an outcome. It also reflects that a given study makes it possible to eliminate
alternative explanations for a finding. For example, if you implement a smoking cessation program with a
group of individuals, how sure can you be that any improvement seen in the treatment group is due to
the treatment that you administered?
Internal validity depends largely on the procedures of a study and how rigorously it is performed.
Internal validity is not a "yes or no" type of concept. Instead, we consider how confident we can be with
the findings of a study, based on whether it avoids traps that may make the findings questionable.
Ever wonder what your personality type means? Sign up to find out more in our Healthy Mind
newsletter.
ONE-TAP SIGN UP
The less chance there is for "confounding" in a study, the higher the internal validity and the more
confident we can be in the findings. Confounding refers to a situation in which other factors come into
play that confuses the outcome of a study. For instance, a study might make us unsure as to whether we
can trust that we have identified the above "cause-and-effect" scenario.
In short, you can only be confident that your study is internally valid if you can rule out alternative
explanations for your findings. As a brief summary, you can only assume cause-and-effect when you
meet the following three criteria in your study:
There are no other likely explanations for this relationship that you have observed.
If you are looking to improve the internal validity of a study, you will want to consider aspects of your
research design that will make it more likely that you can reject alternative hypotheses. There are many
factors that can improve internal validity.
Randomization refers to randomly assigning participants to treatment and control groups, and ensures
that there is not any systematic bias between groups.
Random selection of participants refers to choosing your participants at random or in a manner in which
they are representative of the population that you wish to study.
Study protocol refers to following specific procedures for the administration of a treatment so as not to
introduce any effects of, for example, doing things differently with one group of people versus another
group of people.
Just as there are many ways to ensure that a study is internally valid, there is also a list of potential
threats to internal validity that should be considered when planning a study.
Confounding refers to a situation in which changes in an outcome variable can be thought to have
resulted from some third variable that is related to the treatment that you administered.
Historical events may influence the outcome of studies that occur over a period of time. Examples of
these events might include a change in political leader or natural disaster that influences how study
participants feel and act.
Maturation refers to the impact of time as a variable in a study. If a study takes place over a period of
time in which it is possible that participants naturally changed in some way (grew older, became tired),
then it may be impossible to rule out whether effects seen in the study were simply due to the effect of
time.
Testing refers to the effect of repeatedly testing participants using the same measures. If you give
someone the same test three times, isn't it likely that they will do better as they learn the test or
become used to the testing process so that they answer differently?
Instrumentation refers to the impact of the actual testing instruments used in a study on how
participants respond. While it may sound unusual, it's possible to "prime" participants in a study in
certain ways with the measures that you use, which causes them to react in a way that is different than
they would have otherwise.
Statistical regression refers to the natural effect of participants at extreme ends of a measure falling in a
certain direction just due to the passage of time rather than the effect of an intervention.
Attrition refers to participants dropping out or leaving a study, which means that the results are based on
a biased sample of only the people who did not choose to leave (and possibly who all have something in
common, such as higher motivation).
Diffusion refers to the treatment in a study spreading from the treatment group to the control group
through the groups interacting and talking with or observing one another. This can also lead to another
issue called resentful demoralization, in which a control group tries less hard because they feel resentful
over the group that they are in.
Experimenter bias refers to an experimenter behaving in a different way with different groups in a study,
which leads to an impact on the results of this study (and is eliminated through blinding).
External Validity
External validity refers to how well the outcome of a study can be expected to apply to other settings. In
other words, this type of validity refers to how generalizable the findings are. For instance, do the
findings apply to other people, settings, situations, and time periods?
Ecological validity, an aspect of external validity, refers to whether a study's findings can be generalized
to the real world.
While rigorous research methods can ensure internal validity, external validity, on the other hand, may
be limited by these methods.
Another term called transferability relates to external validity and refers to the qualitative research
design. Transferability refers to whether results transfer to situations with similar characteristics.
Inclusion and exclusion criteria should be used to ensure that you have clearly defined the population
that you are studying in your research.
Psychological realism refers to making sure that participants are experiencing the events of a study as a
real event and can be achieved by telling them a "cover story" about the aim of the study. Otherwise, in
some cases, participants might behave differently than they would in real life if they know what to
expect or know what the aim of the study is.
Replication refers to conducting the study again with different samples or in different settings to see if
you get the same results. When many studies have been conducted, meta-analysis can also be used to
determine if the effect of an independent variable is reliable (based on examining the findings of a large
number of studies on one topic).
Field experiments can also be used in which you conduct a study outside the laboratory in a natural
setting.
Reprocessing or calibration refers to using statistical methods to adjust for problems related to external
validity. For example, if a study had uneven groups for some characteristic (such as age), reweighting
might be used.
Factors That Threaten External Validity
External validity is threatened when a study does not take into account the interactions of variables in
the real world.
Situational factors such as time of day, location, noise, researcher characteristics, and how many
measures are used may affect the generalizability of findings.
Pre- and post-test effects refer to the situation in which the pre- or post-test is in some way related to
the effect seen in the study, such that the cause-and-effect relationship disappears without these added
tests.
Sample features refer to the situation in which some feature of the particular sample was responsible for
the effect (or partially responsible), leading to limited generalizability of the findings.
Selection bias refers to the problem of differences between groups in a study that may relate to the
independent variable (once again, something like motivation or willingness to take part in the study,
specific demographics of individuals being more likely to take part in an online survey). This can also be
considered a threat to internal validity.
Internal and external validity are like two sides of the same coin. You can have a study with good internal
validity, but overall it could be irrelevant to the real world. On the other hand, you could conduct a field
study that is highly relevant to the real world, but that doesn't have trustworthy results in terms of
knowing what variables caused the outcomes that you see.
Similarities
What are the similarities between internal and external validity? They are both factors that should be
considered when designing a study, and both have implications in terms of whether the results of a
study have meaning. Both are not "either/or" concepts, and so you will always be deciding to what
degree your study performs in terms of both types of validity.
Each of these concepts is typically reported in a research article that is published in a scholarly journal.
This is so that other researchers can evaluate the study and make decisions about whether the results
are useful and valid.
Differences
The essential difference between internal and external validity is that internal validity refers to the
structure of a study and its variables while external validity relates to how universal the results are.
There are further differences between the two as well.
Internal Validity
External Validity
Internal validity focuses on showing a difference that is due to the independent variable alone, whereas
external validity results can be translated to the world at large.
Examples
An example of a study with good internal validity would be if a researcher hypothesizes that using a
particular mindfulness app will reduce negative mood. To test this hypothesis, the researcher randomly
assigns a sample of participants to one of two groups: those who will use the app over a defined period,
and those who engage in a control task.
The researcher ensures that there is no systematic bias in how participants are assigned to the groups,
and also blinds his research assistants to the groups the students are in during experimentation.
A strict study protocol is used that outlines the procedures of the study. Potential confounding variables
are measured along with mood, such as the participants socioeconomic status, gender, age, among other
factors. If participants drop out of the study, their characteristics are examined to make sure there is no
systematic bias in terms of who stays in the study.
An example of a study with good external validity would be in the above example, the researcher also
ensured that the study had external validity by having participants use the app at home rather than in
the laboratory. The researcher clearly defines the population of interest and choosing a representative
sample, and he/she replicates the study for different technological devices.
Setting up an experiment so that it has sound internal and external validity involves being mindful from
the start about factors that can influence each aspect of your research. It's best to spend extra time
designing a structurally sound study that has far-reaching implications rather than to quickly rush
through the design phase only to discover problems later on. Only when both internal and external
validity are high can strong conclusions be made about your results.
Share
Flip
Text
Written by Colin Phelan and Julie Wren, Graduate Assistants, UNI Office of Academic Assessment (2005-
06)
Reliability is the degree to which an assessment tool produces stable and consistent results.
Types of Reliability
Test-retest reliability is a measure of reliability obtained by administering the same test twice over a
period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in
order to evaluate the test for stability over time.
Example: A test designed to assess student learning in psychology could be given to a group of students
twice, with the second administration perhaps coming a week after the first. The obtained correlation
coefficient would indicate the stability of the scores.
Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a
large set of items that all pertain to critical thinking and then randomly split the questions up into two
sets, which would represent the parallel forms.
Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or
raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will
not necessarily interpret answers the same way; raters may disagree as to how well certain responses or
material demonstrate knowledge of the construct or skill being assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating the degree to
which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments
can be considered relatively subjective. Thus, the use of this type of reliability would probably be more
likely when evaluating artwork as opposed to math problems.
Internal consistency reliability is a measure of reliability used to evaluate the degree to which different
test items that probe the same construct produce similar results.
Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all
of the items on a test that probe the same construct (e.g., reading comprehension), determining the
correlation coefficient for each pair of items, and finally taking the average of all of these correlation
coefficients. This final step yields the average inter-item correlation.
Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-
half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area
of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to
a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is
obtained by determining the correlation between the two total “set” scores.
Why is it necessary?
While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to be valid.
For example, if your scale is off by 5 lbs, it reads your weight every day with an excess of 5lbs. The scale
is reliable because it consistently reports the same weight every day, but it is not valid because it adds
5lbs to your true weight. It is not a valid measure of your weight.
Types of Validity
1. Face Validity ascertains that the measure appears to be assessing the intended construct under study.
The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it
may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe
the measure is an accurate assessment of the ability, they may become disengaged with the task.
Example: If a measure of art appreciation is created all of the items should be related to the different
components and types of art. If the questions are regarding historical time periods, with no reference to
any artistic movement, stakeholders may not be motivated to give their best effort or invest in this
measure because they do not believe it is a true assessment of art appreciation.
2. Construct Validity is used to ensure that the measure is actually measure what it is intended to
measure (i.e. the construct), and not other variables. Using a panel of “experts” familiar with the
construct is a way in which this type of validity can be assessed. The experts can examine the items and
decide what that specific item is intended to measure. Students can be involved in this process to obtain
their feedback.
Example: A women’s studies program may design a cumulative assessment of learning throughout the
major. The questions are written with complicated wording and phrasing. This can cause the test
inadvertently becoming a test of reading comprehension, rather than a test of women’s studies. It is
important that the measure is actually assessing the intended construct, rather than an extraneous
factor.
3. Criterion-Related Validity is used to predict future or current performance - it correlates test results
with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student learning throughout the
major. The new measure could be correlated with a standardized measure of ability in this discipline,
such as an ETS field test or the GRE subject test. The higher the correlation between the established
measure and new measure, the more faith stakeholders can have in the new assessment tool.
4. Formative Validity when applied to outcomes assessment it is used to assess how well a measure is
able to provide information to help improve the program under study.
Example: When designing a rubric for history one could assess student’s knowledge across the
discipline. If the measure can provide information that students are lacking knowledge in a certain area,
for instance the Civil Rights Movement, then that assessment tool is providing meaningful information
that can be used to improve the course or program requirements.
5. Sampling Validity (similar to content validity) ensures that the measure covers the broad range of
areas within the concept under study. Not everything can be covered, so items need to be sampled from
all of the domains. This may need to be completed using a panel of “experts” to ensure that the content
area is adequately sampled. Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what
an individual personally feels are the most important or relevant areas).
Example: When designing an assessment of learning in the theatre department, it would not be
sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound, functions
of stage managers should all be included. The assessment should reflect the content area in its entirety.
Make sure your goals and objectives are clearly defined and operationalized. Expectations of students
should be written down.
Match your assessment measure to your goals and objectives. Additionally, have the test reviewed by
faculty at other schools to obtain feedback from an outside party who is less invested in the instrument.
Get students involved; have the students look over the assessment for troublesome wording, or other
difficulties.
If possible, compare your measure with other measures, or data that may be available.
Test reliability
Reliability refers to how dependably or consistently a test measures a characteristic. If a person takes the
test again, will he or she get a similar test score, or a much different score? A test that yields similar
scores for a person who repeats the test is said to measure a characteristic reliably.
How do we account for an individual who does not get exactly the same test score every time he or she
takes the test? Some possible reasons are the following:
Test taker's temporary psychological or physical state. Test performance can be influenced by a person's
psychological or physical state at the time of testing. For example, differing levels of anxiety, fatigue, or
motivation may affect the applicant's test results.
Environmental factors. Differences in the testing environment, such as room temperature, lighting, noise,
or even the test administrator, can influence an individual's test performance.
Test form. Many tests have more than one version or form. Items differ on each form, but each form is
supposed to measure the same thing. Different forms of a test are known as parallel forms or alternate
forms. These forms are designed to have similar measurement characteristics, but they contain different
items. Because the forms are not exactly the same, a test taker might do better on one form than on
another.
Multiple raters. In certain tests, scoring is determined by a rater's judgments of the test taker's
performance or responses. Differences in training, experience, and frame of reference among raters can
produce different test scores for the test taker.