Lesson 5.1 - Validity
Lesson 5.1 - Validity
Concept of Validity
Validity can be defined as the agreement between a test score or measure and the quality it is
believed to measure. Validity is sometimes defined as the answer to the question, “Does the test measure
what it is supposed to measure?” To address this question, we use systematic studies to determine whether
the conclusions from test results are justified by evidence. Validity is a judgment or estimate of how well a
test measures what it purports to measure in a particular context. What this means is that the test has been
shown to be valid for a particular use with a particular population of testtakers at a particular time. No test
or measurement technique is “universally valid” for all time, for all uses, with all types of testtaker
populations. Rather, tests may be shown to be valid within what we would characterize as reasonable
boundaries of a contemplated usage. If those boundaries are exceeded, the validity of the test may be called
into question. The validity of a test may change as times change or as the testtaker populations change.
Validation is the process of gathering and evaluating evidence about validity. Both the test
developer and the test user may play a role in the validation of a test for a specific purpose. It is the test
developer’s responsibility to supply validity evidence in the test manual. It may sometimes be appropriate
for test users to conduct their own validation studies with their own groups of testtakers. Such local
validation studies may yield insights regarding a particular population of testtakers as compared to the
norming sample described in a test manual. Local validation studies are absolutely necessary when the test
user plans to alter in some way the format, instructions, language, or content of the test. For example, a
local validation study would be necessary if the test user sought to transform a nationally standardized test
into Braille for administration to blind and visually impaired testtakers. Local validation studies would also
be necessary if a test user sought to use a test with a population of testtakers that differed in some significant
way from the population on which the test was standardized.
Face Validity
Before we move on to the different types of validity, we must discuss face validity first. It is not
technically a form or type of validity, but it is commonly used in testing literature. Face validity relates
more to what a test appears to measure to the person being tested than to what the test actually measures. It
is the mere appearance that a measure has validity. A paper-and-pencil personality test labeled The
Introversion/Extraversion Test, with items that ask respondents whether they have acted in an introverted
or an extraverted way in particular situations, may be perceived by respondents as a highly face-valid test.
On the other hand, a personality test in which respondents are asked to report what they see in inkblots may
be perceived as a test with low face validity. Many respondents would be left wondering how what they
said they saw in the inkblots really had anything at all to do with personality.
These appearances can help motivate test takers because they can see that the test is relevant. For
example, suppose you developed a test to screen applicants for a training program in accounting. Items that
ask about balance sheets and ledgers might make applicants more motivated than items about fuel
consumption. A lack of face validity will contribute to a lack of confidence in the perceived effectiveness
of the test. When this happens, testtaker may not cooperate and be motivated, test users may not buy tests
from developers, and parents or guardians may not allow their children to take the test since they do not
believe in the test.
Types of Validity
The figure below summarizes the 3 main different types of validity. This figure is based on the
classic conception of validity referred to as the trinitarian view (Guion, 1980).
Content
Trinitarian Criterion-
Validity
view Related
Construct
Content Validity
How many times have you studied for an examination and known almost everything only to find
that the professor has come up with some strange items that do not represent the content of the course? If
this has happened, you may have encountered a test with poor content validity. Content validity describes
a judgment of how adequately a test samples behavior representative of the universe of behavior that the
test was designed to sample. In our example, the test has low content validity because the items included in
the test are not about the concepts discussed in the course. Another example, the universe of behavior
referred to as assertive is very wide-ranging. A content-valid, paper-and-pencil test of assertiveness would
be one that is adequately representative of this wide range. We might expect that such a test would contain
items sampling from hypothetical situations at home (such as whether the respondent has difficulty in
making her or his views known to fellow family members), on the job (such as whether the respondent has
difficulty in asking subordinates to do what is required of them), and in social situations (such as whether
the respondent would send back a steak not done to order in a fancy restaurant).
Ideally, test developers have a clear (as opposed to “fuzzy”) vision of the construct being measured,
and the clarity of this vision can be reflected in the content validity of the test (Haynes et al., 1995). In the
interest of ensuring content validity, test developers strive to include key components of the construct
targeted for measurement and exclude content irrelevant to the construct targeted for measurement.
In psychological educational tests, much research is put into exploring the topics or subjects the
test is used for. Research could use course syllabi, course textbooks, teachers of the course, specialists who
develop curricula, and professors and supervisors who train teachers in the particular subject area as sources
of information. From the pooled information (along with the judgment of the test developer), there emerges
a test blueprint for the “structure” of the evaluation—that is, a plan regarding the types of information to be
covered by the items, the number of items tapping each area of coverage, the organization of the items in
the test, and so forth. In many instances the test blueprint represents the culmination of efforts to adequately
sample the universe of content areas that conceivably could be sampled in such a test.
For an employment test to be content-valid, its content must be a representative sample of the job-
related skills required for employment. Behavioral observation is one technique frequently used in
blueprinting the content areas to be covered in certain types of employment tests. The test developer will
observe successful veterans on that job, note the behaviors necessary for success on the job, and design the
test to include a representative sample of those behaviors. Those same workers (as well as their supervisors
and others) may subsequently be called on to act as experts or judges in rating the degree to which the
content of the test is a representative sample of the required job-related skills.
An example of a test blueprint providing evidence for content validity can be seen below.
Criterion-related Validity
Criterion validity evidence tells us just how well a test corresponds with a particular criterion. It is
a judgment of how adequately a test score can be used to infer an individual’s most probable standing on
some measure of interest—the criterion. That is, to establish criterion-related validity evidence, correlations
between the test and the criterion must be determined. Criterion-related validity indicates the effectiveness
of a test in predicting an individual's behavior in specified situations. For this purpose, performance on the
test is checked against a criterion. A criterion is the standard against which the test is compared. A criterion
could be for a mechanical aptitude test, the subsequent job performance as a machinist or for a scholastic
aptitude test, it might be college grades. Specifically, as an example, a test might be used to predict which
engaged couples will have successful marriages and which ones will get divorced. Marital success is the
criterion, but it cannot be known at the time the couples take the premarital test. The reason for gathering
criterion validity evidence is that the test or measure is to serve as a “stand-in” for the measure we are really
interested in. In the marital example, the premarital test serves as a stand-in for estimating future marital
happiness. Another example, if a test purports to measure the trait of athleticism, we might expect to employ
“membership in a health club” or any generally accepted measure of physical fitness as a criterion in
evaluating whether the athleticism test truly measures athleticism. Operationally, a criterion can be most
anything: pilot performance in flying a Boeing 767, grade on examination in Advanced Hairweaving,
number of days spent in psychiatric hospitalization; the list is endless. There are no hard-and-fast rules for
what constitutes a criterion. It can be a test score, a specific behavior or group of behaviors, an amount of
time, a rating, a psychiatric diagnosis, a training cost, an index of absenteeism, an index of alcohol
intoxication, and so on. Whatever the criterion, ideally it is relevant, valid, and uncontaminated.
Characteristics of a Criterion
An adequate criterion is relevant. By this we mean that it is pertinent or applicable to the matter at
hand. We would expect, for example, that a test purporting to advise testtakers whether they share the same
interests of successful actors to have been validated using the interests of successful actors as a criterion.
An adequate criterion measure must also be valid for the purpose for which it is being used. For example:
If one test (X) is being used as the criterion to validate a second test (Y), then evidence should exist that
test X is valid. If you are making an intelligence test, you should use the SB5, an established, repeatedly
studied valid test to establish the validity of your test. If the criterion used is a rating made by a judge or a
panel, then evidence should exist that the rating is valid. How? One might want to investigate the
background of the panel or judge and see if they have the proper credentials to rate. Lastly, the criterion
should be uncontaminated. Criterion contamination is a term used to describe a criterion measure that has
been based on predictor measures. As an example, consider a hypothetical “Inmate Violence Potential Test”
(IVPT) designed to predict a prisoner’s potential for violence in the cell block. In part, this evaluation entails
ratings from fellow inmates, guards, and other staff in order to come up with a number that represents each
inmate’s violence potential. After all of the inmates in the study have been given scores on this test, the
study authors then attempt to validate the test by asking guards to rate each inmate on their violence
potential. Because the guards’ opinions were used to formulate the inmate’s test score in the first place (the
predictor variable), the guards’ opinions cannot be used as a criterion against which to judge the soundness
of the test. If the guards’ opinions were used both as a predictor and as a criterion, then we would say that
criterion contamination had occurred. When criterion contamination does occur, the results of the validation
study cannot be taken seriously.
Types of Criterion-related Validity
1. Predictive validity
The predictive nature of criterion-related validity is heavily emphasized in predictive
validity. Predictive validity is how accurately scores on the test predict some criterion measure. It
describes the relationship between the test scores and a criterion measure obtained at a future time.
The information provided by predictive validity is most relevant to tests used in the selection and
classification of personnel. Predictive validity can aid decision-makers in selecting who will be
successful. Hiring job applicants, selecting students for admission to college or professional
schools, and assigning military personnel to occupational training programs represent examples of
the sort of decisions requiring a knowledge of the predictive validity of tests. Other examples
include the use of tests to screen out applicants likely to develop emotional disorders in stressful
environments and the use of tests to identify psychiatric patients most likely to benefit from a
particular therapy.
Let’s look at a specific example. Let’s say you are developing an entrance exam for junior
high students. If you are using predictive validity, your predictor would be your test scores and a
criterion could be the grades of grade 7 students after the school year. What will you do? First, you
will give your test to grade 6 students and as they move on and finish grade 7, you get their grades.
Once you have their grades, you can correlate these to their test scores. If the correlation is high,
your entrance exam is valid, if it is low, it is not valid. Notice that the criterion by which you
validated your test, the grades of grade 7 students, were obtained only in the future, a year after you
have developed your test. If you were able to provide predictive validity, high correlation of test
scores (predictor) and grades (criterion), you can now use your entrance exam to see who will be
successful in junior high and who may not. You can use your exam to filter who you think will be
successful and will accept, or who will not and will reject.
When evaluating the predictive validity of a test, researchers must take into consideration
the base rate of the occurrence of the variable in question, both as that variable exists in the general
population and as it exists in the sample being studied. Generally, a base rate is the extent to which
a particular trait, behavior, characteristic, or attribute exists in the population (expressed as a
proportion). A hit rate may be defined as the proportion of people a test accurately identifies as
possessing or exhibiting a particular trait, behavior, characteristic, or attribute. For example, hit rate
could refer to the proportion of people accurately predicted to be able to perform work at the
graduate school level or to the proportion of neurological patients accurately identified as having a
brain tumor. In like fashion, a miss rate may be defined as the proportion of people the test fails to
identify as having, or not having, a particular characteristic or attribute. Here, a miss amounts to an
inaccurate prediction. The category of misses may be further subdivided. A false positive is a miss
wherein the test predicted that the testtaker did possess the particular characteristic or attribute
being measured when in fact the testtaker did not. A false negative is a miss wherein the test
predicted that the testtaker did not possess the particular characteristic or attribute being measured
when the testtaker actually did. An example can be found below. In this example, ABC company
is trying to establish the predictive validity of their FERT test. They did this by comparing the
FERT scores to the on-the-job ratings (OTJSR) of the applicants (3-month period where their
performance is rated).
2. Concurrent validity
Given the time constraint of predictive validity, concurrent validity offers a more readily
available form of evidence. In this type, the test scores are obtained at about the same time as the
criterion measures are obtained. Similarly, measures of the relationship between the test scores and
the criterion provide evidence of concurrent validity. Statements of concurrent validity indicate the
extent to which test scores may be used to estimate an individual’s present standing on a criterion.
The logical distinction between predictive and concurrent validity is based, not on time, but on the
objectives of testing. Concurrent validity is relevant to tests employed for diagnosis of existing
status, rather than prediction of future outcomes. The difference can be illustrated by asking: “Is
Smith neurotic?” (Concurrent validity) and “Is Smith likely to become neurotic?” (Predictive
validity).
Example of concurrent validity evidence would be validating scores (or classifications)
made on the basis of a psychodiagnostic test against a criterion of already diagnosed psychiatric
patients. You can give your tests to diagnosed patients and check if your test will produce the same
results as their diagnosis. Or your criterion could include behavior samples of diagnosed patients
such that is their behavior similar to those who were diagnosed using your test? Job samples can
also be used to provide concurrent validity evidence. Companies can test applicants on a sample of
behaviors that represent the tasks to be required of them. During these samples, managers may rate
the applicants and the ratings can be used as a criterion by which the test could be validated with.
A more common and popular way of establishing concurrent validity evidence is by using another
test. The test here becomes a validating criterion. For example, to assess if ABC test is a valid
intelligence test for adolescents, test users use scores of the respondents on a validated test of
intelligence for adolescents, the WAIS-IV. Correlations will be made to see if the results show
similarities.
Validity Coefficient
The validity coefficient is a correlation coefficient that provides a measure of the relationship
between test scores and scores on the criterion measure. The correlation coefficient computed from a score
(or classification) on a psychodiagnostic test, and the criterion score (or classification) assigned by
psychodiagnosticians is one example of a validity coefficient. Typically, the Pearson correlation coefficient
is used to determine the validity between the two measures. However, depending on variables such as the
type of data, the sample size, and the shape of the distribution, other correlation coefficients could be used.
For example, in correlating self-rankings of performance on some job with rankings made by job
supervisors, the formula for the Spearman rho rank-order correlation would be employed.
There are no hard-and-fast rules about how large a validity coefficient must be to be meaningful.
In practice, one rarely sees a validity coefficient larger than .60, and validity coefficients in the range of .30
to .40 are commonly considered high. College students differ in their academic performance for many
reasons. You probably could easily list a dozen. Because there are so many factors that contribute to college
performance, it would be too much to expect an entrance exam to explain all of the variation. The question
we must ask is “How much of the variation in college performance will we be able to predict on the basis
of entrance exam scores?” The validity coefficient squared is the percentage of variation in the criterion
that we can expect to know in advance because of our knowledge of the test scores. Thus, we will know
.40 squared, or 16%, of the variation in college performance because of the information we have from the
entrance exam. The remainder of the variation in college performance is actually the greater proportion:
84% of the total variation is still unexplained. In other words, when students arrive at college, most of the
reasons they perform differently will be a mystery to college administrators and professors. Despite having
low validity coefficients, these values can still be useful especially when translated to real-life decisions. A
validity of 0.30 can translate into increased earnings of millions of Pesos.
To be an informed consumer of testing information, you should learn to review carefully any
information offered by a test developer. Because not all validity coefficients of .40 have the same meaning,
you should watch for several things in evaluating such information.
1. Look for changes, review the subject population in the validity study
Some changes may happen in relation to the original validity study to the current
test takers. You need to be cautious of validity coefficients because the validity study might
have been done on a population that does not represent the group to which inferences will
be made. For example, a test might be used and shown to be valid for selecting supervisors
in industry; however, the validity study may have been done at a time when all the
employees were men, making the test valid for selecting supervisors for male employees.
If the company hires female employees, then the test may no longer be valid for selecting
supervisors because it may not consider the abilities necessary to supervise a sexually
mixed group of employees.
2. What does the criterion mean?
Criterion-related validity studies mean nothing at all unless the criterion is valid
and reliable. Some test constructors attempt to correlate their tests with other tests that have
unknown validity. A meaningless group of items that correlates well with another
meaningless group remains meaningless.
3. Be sure the sample size was accurate
Information from using a small sample size can be misleading. The smaller the
sample, the more likely chance variation in the data will affect the correlation.
4. Never confuse the criterion with the predictor
For example, an entrance exam would be your predictor while success in college
(graduation) would be your criterion. This means that the entrance exam is used to predict
if you will be successful in college or note. A low score in this exam may mean you will
not be successful making the university deny your entry. It should not be the other way
around wherein if you graduated already from college, you should get a high score on the
entrance exam.
5. Check for restricted range on both predictor and criterion
A variable has a “restricted range” if all scores for that variable fall very close
together. For example, the grade point averages (GPAs) of graduate students in Ph.D.
programs tend to fall within a limited range of the scale—let’s say 95 - 99. The problem
this creates is that correlation depends on variability. If all the people in your class have a
GPA of 95, then you cannot predict variability in graduate-school GPA because the
available values are limited. Correlation requires that there be variability in both the
predictor and the criterion.
6. Review evidence for validity generalization
Criterion-related validity evidence obtained in one situation may not be
generalized to other similar situations. Generalizability refers to the evidence that the
findings obtained in one situation can be generalized—that is, applied to other situations.
This is an issue of empirical study rather than judgment. In other words, we must prove
that the results obtained in a validity study are not specific to the original situation. There
are many reasons why results may not be generalized. For example, there may be
differences in the way the predictor construct is measured or in the type of job or curriculum
involved—in the actual criterion measure—between the groups of people who take the test;
there may also be differences in the time period—year or month—when the test is
administered. Because of these problems, we cannot always be certain that the validity
coefficient reported by a test developer will be the same for our particular situation. An
employer, for example, might use a work-sample test based on information reported in the
manual, yet the situation in which he or she uses the test may differ from the situations of
the original validation studies. When using the test, the employer might be using different
demographic groups or different criterion measures or else predicting performance on a
similar but different task. Generalizations from the original validity studies to these other
situations should be made only on the basis of new evidence.
Incremental Validity
Test users involved in predicting some criterion from test scores are often interested in the utility
of multiple predictors. The value of including more than one predictor depends on a couple of factors. First,
of course, each measure used as a predictor should have criterion-related predictive validity. Second,
additional predictors should possess incremental validity, defined here as the degree to which an additional
predictor explains something about the criterion measure that is not explained by predictors already in use.
Incremental validity may be used when predicting something like academic success in college.
GPA at the end of the first year may be used as a measure of academic success. A study of potential
predictors of GPA may reveal that time spent in the library and time spent studying are highly correlated
with GPA. How much sleep a student’s roommate allows the student to have during exam periods correlates
with GPA to a smaller extent. What is the most accurate but most efficient way to predict GPA? One
approach, employing the principles of incremental validity, is to start with the best predictor: the predictor
that is most highly correlated with GPA. This may be time spent studying. Then, using multiple regression
techniques, one would examine the usefulness of the other predictors.
Construct Validity
Before 1950, most social scientists considered only criterion and content evidence for validity. By
the mid-1950s, investigators concluded that no clear criteria existed for most of the social and psychological
characteristics they wanted to measure. Developing a measure of intelligence, for example, was difficult
because no one could say for certain what intelligence was. Studies of criterion validity evidence would
require that a specific criterion of intelligence be established against which tests could be compared.
However, there was no criterion for intelligence because it is a hypothetical construct. A construct is defined
as something built by mental synthesis. It is an informed, scientific idea developed or hypothesized to
describe or explain behavior. Intelligence, something we cannot touch or feel, is a construct that may be
invoked to describe why a student performs well in school. Anxiety is a construct that may be invoked to
describe why a psychiatric patient paces the floor. Other examples of constructs are job satisfaction,
personality, bigotry, clerical aptitude, depression, motivation, self-esteem, emotional adjustment, potential
dangerousness, executive potential, creativity, and mechanical comprehension, to name but a few.
Construct validity evidence is established through a series of activities in which a researcher
simultaneously defines some construct and develops the instrumentation to measure it. Construct validation
involves assembling evidence about what a test means. This can be done by showing the relationship
between a test and other tests and measures. It could also be done through formulating hypotheses about
expected behavior of high scorers and low scorers on the test. If a test is valid, then high and low scorers
will behave as predicted by the test. The gathering of construct validity evidence is an ongoing process that
is similar to amassing support for a complex scientific theory. Although no single set of observations
provides crucial or critical evidence, many observations over time gradually clarify what the test means.
3. Factor analysis
Both convergent and discriminant evidence of construct validity can be obtained
by the use of factor analysis. Factor analysis is a shorthand term for a class of mathematical
procedures designed to identify factors or specific variables that are typically attributes,
characteristics, or dimensions on which people may differ. In psychometric research, factor
analysis is frequently employed as a data reduction method in which several sets of scores
and the correlations between them are analyzed. In such studies, the purpose of the factor
analysis may be to identify the factor or factors in common between test scores on subscales
within a particular test, or the factors in common between scores on a series of tests. For
example, if 20 tests have been given to 300 persons, the first step is to compute the
correlations of each test with every other. An inspection of the resulting table of 190
correlations may itself reveal certain clusters among the tests, suggesting the location of
common traits. Thus, if such tests as vocabulary, analogies, opposites, and sentence
completion have high correlations with each other and low correlations with all other tests,
we could tentatively infer the presence of a verbal comprehension factor.
In general, factor analysis is conducted on either an exploratory or a confirmatory
basis. Exploratory factor analysis typically entails “estimating or extracting factors;
deciding how many factors to retain; and rotating factors to an interpretable orientation”
(Floyd & Widaman, 1995, p. 287). By contrast, in confirmatory factor analysis, researchers
test the degree to which a hypothetical model (which includes factors) fits the actual data.
In the process of factor analysis, the number of variables or categories in terms of which
each individual’s performance can be described is reduced from the number of original
tests to a relatively small number of factors, or common traits. In the example cited above,
five or six factors might suffice to account for the intercorrelations among the 20 tests.
Each individual might thus be described in terms of his scores in the five or six factors,
rather than in terms of the original 20 scores. A major purpose of factor analysis is to
simplify the description of behavior by reducing the number of categories from an initial
multiplicity of test variables to a few common factors, or traits.
In conducting factor analysis, a factor loading can be obtained. This describes the
extent to which each test or items from a test “load” or relate to a factor. For example, a
new test measuring bulimia can be factor-analyzed with other known measures of bulimia,
as well as with other kinds of related measures (such as measures of intelligence, self-
esteem, general anxiety, anorexia, or perfectionism). High factor loadings by the new test
on a “bulimia factor” would provide convergent evidence of construct validity. Moderate
to low factor loadings by the new test with respect to measures of other eating disorders
such as anorexia would provide discriminant evidence of construct validity.
4. Evidence of homogeneity
When describing a test and its items, homogeneity refers to how uniform a test is
in measuring a single concept. A test developer can increase test homogeneity in several
ways. Consider, for example, a test of academic achievement that contains subtests in areas
such as mathematics, spelling, and reading comprehension. The Pearson r could be used to
correlate average subtest scores with the average total test score. Subtests that in the test
developer’s judgment do not correlate very well with the test as a whole might have to be
reconstructed (or eliminated) lest the test not measure the construct academic achievement.
One way a test developer can improve the homogeneity of a test containing items that are
scored dichotomously (such as a true-false test) is by eliminating items that do not show
significant correlation coefficients with total test scores. If all test items show significant,
positive correlations with total test scores and if high scorers on the test tend to pass each
item more than low scorers do, then each item is probably measuring the same construct
as the total test. Each item is contributing to test homogeneity. The homogeneity of a test
in which items are scored on a multipoint scale can also be improved. For example, some
attitude and opinion questionnaires require respondents to indicate level of agreement with
specific statements by responding, for example, strongly agree, agree, disagree, or strongly
disagree. Each response is assigned a numerical score, and items that do not show
significant Spearman rank-order correlation coefficients are eliminated. If all test items
show significant, positive correlations with total test scores, then each item is most likely
measuring the same construct that the test as a whole is measuring (and is thereby
contributing to the test’s homogeneity). Coefficient alpha may also be used in estimating
the homogeneity of a test composed of multiple-choice items (Novick & Lewis, 1967).
5. Evidence of changes with age
If a test score purports to be a measure of a construct that could be expected to
change over time, then the test score, too, should show the same progressive changes with
age to be considered a valid measure of the construct. For example, if children in grades 6,
th
7, 8, and 9 took a test of 8 grade vocabulary, then we would expect that the total number
of items scored as correct from all the test protocols would increase as a function of the
higher grade level of the testtakers.
Test Bias
Bias has many definitions from prejudice, preferential treatment, and being too difficult for a group
compared to another. For psychology, it is a factor inherent in a test that systematically prevents accurate,
impartial measurement. A test could be considered bias if for example, an intelligence was constructed so
that people who had brown eyes consistently and systematically obtained higher scores than people with
green eyes – assuming of course, that in reality, people with brown eyes are not generally more intelligent
than people with green eyes. When a test is biased, commonly it is due to systematic error or variation – if
you remember, this is consistent error that is rooted in the test itself. The best way to remedy test bias is
through prevention during test development. An example of doing this is to make sure that your sample is
representative of your population, standardization, or making sure items of a test are measuring the same
construct.
Test Fairness
This is the extent to which a test is used in an impartial, just, and equitable way. Some reasons why
tests are labeled as unfair:
1. They discriminate among groups of people – all people are created equally – cannot be
remedied, it depends on the beliefs of people
2. Selection decisions based on the test
Remember, tests may not be unfair but the USE of the test or data from it can be unfair.
*ALL VALID TESTS ARE RELIABLE BUT A RELIABLE TEST DOES NOT MEAN A VALID
TEST*