Chapter 6
Chapter 6
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
items that ask respondents whether they One method of measuring content validity,
have acted in an introverted or an developed by C. H. Lawshe, is essentially a
extraverted way in particular situations, method for gauging agreement among
may be perceived by respondents as a raters or judges regarding how essential a
highly face-valid test. On the other hand, a particular item is. Lawshe (1975) proposed
personality test in which respondents are that each rater respond to the following
asked to report what they see in inkblots question for each item: “Is the skill or
may be perceived as a test with low face knowledge measured by this item
validity. A test’s lack of face validity could Essential
contribute to a lack of confidence in the Useful but not essential
perceived effectiveness of the test—with a Not necessary
consequential decrease in the testtaker’s
cooperation or motivation to do his or her to the performance of the job?” According to
best. Lawshe, if more than half the panelists
indicate that an item is essential, that item
In reality, a test that lacks face validity may has at least some content validity. Greater
still be relevant and useful. However, if it is levels of content validity exist as larger
not perceived as such by testtakers, numbers of panelists agree that a particular
parents, legislators, and others, then item is essential. Using these assumptions,
negative consequences may result. These Lawshe developed a formula termed the
consequences may range from a poor content validity ratio (CVR):
attitude on the part of the testtaker to
lawsuits filed by disgruntled parties against N
a test user and test publisher.
nc −(
)
2
CVR=
N /2
Content validity describes a judgment of
how adequately a test samples behavior
representative of the universe of behavior where CVR = content validity ratio, n c =
that the test was designed to sample. One number of panelists indicating “essential,”
of the first measures of validity. and N = total number of panelists.
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
as on aspects of test construction, scoring, Statements of concurrent validity indicate
interpretation, and validation. The influence the extent to which test scores may be used
of culture thus extends to judgments to estimate an individual’s present standing
concerning validity of tests and test items. on a criterion. In general, once the validity
of the inference from the test scores is
Criterion-Related Validity established, the test may provide a faster,
less expensive way to offer a diagnosis or a
Criterion-related validity is a judgment of classification decision. A test with
how adequately a test score can be used to satisfactorily demonstrated concurrent
infer an individual’s most probable standing validity may therefore be appealing to
on some measure of interest—the measure prospective users because it holds out the
of interest being the criterion. Deceited potential of savings of money and
depending on the purpose of the test, professional time.
research, etc.
Predictive Validity
Two types of Criterion-Related Validity:
1. Concurrent validity is an index of Test scores may be obtained at one time and
the degree to which a test score is the criterion measures obtained at a future
related to some criterion measure time, usually after some intervening event
obtained at the same time has taken place. The intervening event may
(concurrently). take varied forms, such as training,
2. Predictive validity is an index of experience, therapy, medication, or simply
the degree to which a test score the passage of time.
predicts some criterion measure
(future performance). Measures of the relationship between the
test scores and a criterion measure obtained
What is a Criterion? at a future time provide an indication of the
predictive validity of the test; that is, how
Here, in the context of our discussion of accurately scores on the test predict some
criterion-related validity, we will define a criterion measure.
criterion just a bit more narrowly as the
standard against which a test or a test score In settings where tests might be employed—
is evaluated. such as a personnel agency, a college
admissions office, or a warden’s office—a
Characteristics: test’s high predictive validity can be a useful
An adequate criterion is relevant. By this aid to decision makers who must select
we mean that it is pertinent or applicable successful students, productive workers, or
to the matter at hand. good parole risks. Whether a test result is
An adequate criterion measure must also valuable in decision making depends on how
be valid for the purpose for which it is well the test results improve selection
being used. If one test (X) is being used decisions over decisions made without
as the criterion to validate a second test knowledge of test results.
(Y), then evidence should exist that test X
is valid. Judgments of criterion-related validity,
Ideally, a criterion is also whether concurrent or predictive, are based
uncontaminated. Criterion on two types of statistical evidence: the
contamination is the term applied to a validity coefficient and expectancy data.
criterion measure that has been based, at
least in part, on predictor measures. The validity coefficient is a correlation
coefficient that provides a measure of the
Concurrent Validity relationship between test scores and
scores on the criterion measure.
If test scores are obtained at about the
same time that the criterion measures are The correlation coefficient computed from a
obtained, measures of the relationship score (or classification) on a
between the test scores and the criterion psychodiagnostics test, and the criterion
provide evidence of concurrent validity. score (or classification) assigned by
psychodiagnosticians is one example of a
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
validity coefficient. Typically, the Pearson something about the criterion measure that
correlation coefficient is used to determine is not explained by predictors already in use.
the validity between the two measures.
However, depending on variables such as One approach, employing the principles of
the type of data, the sample size, and the incremental validity, is to start with the best
shape of the distribution, other correlation predictor: the predictor that is most highly
coefficients could be used. correlated with GPA. This may be time spent
studying. Then, using multiple regression
The validity coefficient is affected by techniques, one would examine the
restriction or inflation of range. And as in usefulness of the other predictors.
other correlational studies, a key issue is
whether the range of scores employed is Expectancy data provide information
appropriate to the objective of the that can be used in evaluating the
correlational analysis. The problem of criterion-related validity of a test. Using a
restricted range can also occur through a score obtained on some test(s) or
self-selection process in the sample measure(s), expectancy tables illustrate
employed for the validation study. the likelihood that the testtaker will score
within some interval of scores on a
Whereas it is the responsibility of the test criterion measure—an interval that may
developer to report validation data in the be seen as “passing,” “acceptable,” and
test manual, it is the responsibility of test so on.
users to read carefully the description of the
validation study and then to evaluate the An expectancy table shows the
suitability of the test for their specific percentage of people within specified test-
purposes. score intervals who subsequently were
placed in various categories of the criterion
How high should a validity coefficient be for (for example, placed in “passed” category or
a user or a test developer to infer that the “failed” category).
test is valid? There are no rules for
determining the minimum acceptable size of Expectancy chart is the graphic
a validity coefficient. In fact, Cronbach and representation of an expectancy table.
Gleser (1965) cautioned against the
establishment of such rules. They argued Taylor-Russell tables provide an estimate
that validity coefficients need to be large of the extent to which inclusion of a
enough to enable the test user to make particular test in the selection system will
accurate decisions within the unique context actually improve selection.
in which a test is being used. Essentially, the
validity coefficient should be high enough to One limitation of the Taylor-Russell tables is
result in the identification and differentiation that the relationship between the predictor
of testtakers with respect to target (the test) and the criterion (rating of
attribute(s), such as employees who are performance on the job) must be linear.
likely to be more productive, police officers Another limitation of the Taylor-Russell
who are less likely to misuse their weapons, tables is the potential difficulty of identifying
and students who are more likely to be a criterion score that separate “successful”
successful in a particular course of study. from “unsuccessful” employees. The
potential problems of the Taylor-Russell
Test users involved in predicting some tables were avoided by an alternative set of
criterion from test scores are often tables that provided an indication of the
interested in the utility of multiple difference in average criterion scores for the
predictors. The value of including more than selected group as compared with the
one predictor depends on a couple of original group.
factors. First, of course, each measure used
as a predictor should have criterion-related Use of the Naylor-Shine tables entails
predictive validity. Second, additional obtaining the difference between the means
predictors should possess incremental of the selected and unselected groups to
validity, defined here as the degree to derive an index of what the test (or some
which an additional predictor explains other tool of assessment) is adding to
already established procedures.
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
Both the Taylor-Russell and the Naylor-Shine did possess the particular characteristic or
tables can assist in judging the utility of a attribute being measured when in fact the
particular test, the former (Taylor-Russell) by testtaker did not. A false negative is a
determining the increase over current miss wherein the test predicted that the
procedures and the latter (Naylor-Shine) by testtaker did not possess the particular
determining the increase in average score characteristic or attribute being measured
on some criterion measure. With both when the testtaker actually did.
tables, the validity coefficient used must be
one obtained by concurrent validation A test is obviously of no value if the hit rate
procedures—a fact that should not be is higher without using it. One measure of
surprising because it is obtained with the value of a test lies in the extent to which
respect to current employees hired by the its use improves on the hit rate that exists
selection process in effect at the time of the without its use. Decision theory provides
study. guidelines for setting optimal cutoff scores.
In setting such scores, the relative
Decision Theory and Test Utility seriousness of making false-positive or
false-negative selection decisions is
Perhaps the most oft-cited application of frequently considered.
statistical decision theory to the field of
psychological testing is Cronbach and Construct Validity
Gleser’s Psychological Tests and
Personnel Decisions. Construct validity is a judgment about the
appropriateness of inferences drawn from
Stated generally, Cronbach and Gleser test scores regarding individual standings on
(1965) presented a variable called a construct. The degree to
(1) a classification of decision problems; which a test measures the theoretical
(2) various selection strategies ranging from construct it is intended to measure.
single-stage processes to sequential
analyses; Compare it to other measures with different
(3) a quantitative analysis of the relationship or same construct to confirm or provide a
between test utility, the selection ratio, cost falsification of the theory. Another way is to
of the testing program, and expected value compare the score of a single item to the
of the outcome; and total test score to see if there is a
(4) a recommendation that in some correlation.
instances job requirements be tailored to
the applicant’s ability instead of the other A construct is an informed, scientific idea
way around (a concept they refer to as developed or hypothesized to describe or
adaptive treatment). explain behavior. It cannot be directly
observed but is inferred from behavior.
Generally, a base rate is the extent to Constructs are unobservable, presupposed
which a particular trait, behavior, (underlying) traits that a test developer may
characteristic, or attribute exists in the invoke to describe test behavior or criterion
population (expressed as a proportion). performance.
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
reason for obtaining results contrary to test scores and if high scorers on the test
those predicted by the theory is that the tend to pass each item more than low
test simply does not measure the construct. scorers do, then each item is probably
An alternative explanation could lie in the measuring the same construct as the total
theory that generated hypotheses about the test. Each item is contributing to test
construct. The theory may need to be homogeneity. Coefficient alpha may also be
reexamined. Although confirming evidence used in estimating the homogeneity of a
contributes to a judgment that a test is a test composed of multiple-choice items.
valid measure of a construct, evidence to
the contrary can also be useful. If it is an academic test and if high scorers
on the entire test for some reason tended to
Increasingly, construct validity has been get that particular item wrong while low
viewed as the unifying concept for all scorers on the test as a whole tended to get
validity evidence. the item right, the item is obviously not a
good one. The item should be eliminated in
Evidence of Construct Validity the interest of test homogeneity, among
other considerations.
A number of procedures may be used to
provide different kinds of evidence that a If a test score purports to be a measure of a
test has construct validity. The various construct that could be expected to change
techniques of construct validation may over time, then the test score, too, should
provide evidence, for example, that show the same progressive changes with
The test is homogeneous, measuring a age to be considered a valid measure of the
single construct. construct.
Test scores increase or decrease as a
function of age, the passage of time, or Evidence that test scores change as a result
an experimental manipulation as of some experience between a pretest and a
theoretically predicted. posttest can be evidence of construct
Test scores obtained after some event or validity. Some of the more typical
the mere passage of time (that is, intervening experiences responsible for
posttest scores) differ from pretest scores changes in test scores are formal education,
as theoretically predicted. a course of therapy or medication, and on-
Test scores obtained by people from the-job experience.
distinct groups vary as predicted by the
theory. Method of contrasted groups is one way
Test scores correlate with scores on other of providing evidence for the validity of a
tests in accordance with what would be test is to demonstrate that scores on the
predicted from a theory that covers the test vary in a predictable way as a function
manifestation of the construct in of membership in some group.
question.
The rationale here is that if a test is a valid
Homogeneity refers to how uniform a test measure of a particular construct, then test
is in measuring a single concept. A test scores from groups of people who would be
developer can increase test homogeneity in presumed to differ with respect to that
several ways. construct should have correspondingly
different test scores.
One item-analysis procedure focuses on the
relationship between test-takers’ scores on If scores on the test undergoing construct
individual items and their score on the entire validation tend to correlate highly in the
test. predicted direction with scores on older,
more established, and already validated
One way a test developer can improve the tests designed to measure the same (or a
homogeneity of a test containing items that similar) construct, this would be an example
are scored dichotomously (for example, of convergent evidence.
true–false) is by eliminating items that do
not show significant correlation coefficients Convergent evidence for validity may come
with total test scores. If all test items show not only from correlations with tests
significant, positive correlations with total purporting to measure an identical construct
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
but also from correlations with measures loading, which is “a sort of metaphor. Each
purporting to measure related constructs. test is thought of as a vehicle carrying a
certain amount of one or more abilities.”
A validity coefficient showing little (that is, a Factor loading in a test conveys information
statistically insignificant) relationship about the extent to which the factor
between test scores and/or other variables determines the test score or scores.
with which scores on the test being
construct-validated should not theoretically Factor analysis frequently involves technical
be correlated provides discriminant procedures so complex that few
evidence of construct validity (also known contemporary researchers would attempt to
as discriminant validity). conduct one without the aid of a
prepackaged computer program.
Data indicating that a test measures the
same construct as other tests purporting to Validity, Bias, and Fairness
measure the same construct are also
referred to as evidence of convergent Test Bias
validity. The test scores correlate with
other measures of the same construct. For psychometricians, bias is a factor
inherent in a test that systematically
This rather technical procedure, called the prevents accurate, impartial measurement.
multitrait-multimethod matrix, is the Systematic is a key word in our definition of
matrix or table that results from correlating test bias. Bias implies systematic variation.
variables (traits) within and between
methods. Multitrait means “two or more To begin with, three characteristics of
traits” and multimethod means “two or the regression lines used to predict
more methods.” success on the criterion would have to be
scrutinized: (1) the slope, (2) the intercept,
Both convergent and discriminant evidence (3) the error of estimate.
of construct validity can be obtained by the
use of factor analysis. Factor analysis is a Intercept bias is a term derived from the
shorthand term for a class of mathematical point where the regression line intersects
procedures designed to identify factors or the Y -axis. If a test systematically yields
specific variables that are typically significantly different validity coefficients for
attributes, characteristics, or dimensions on members of different groups, then it has
which people may differ. In psychometric what is known as slope bias —so named
research, factor analysis is frequently because the slope of one group’s regression
employed as a data reduction method in line is different in a statistically significant
which several sets of scores and the way from the slope of another group’s
correlations between them are analyzed. regression line.
The purpose of the factor analysis may be to
identify the factor or factors in common One reason some tests have been found to
between test scores on subscales within a be biased has more to do with the design of
particular test, or the factors in common the research study than the design of the
between scores on a series of tests. In test.
general, factor analysis is conducted on
either an exploratory or a confirmatory A rating is a numerical or verbal judgment
basis. (or both) that places a person or an attribute
along a continuum identified by a scale of
Exploratory factor analysis typically numerical or word descriptors known as a
entails “estimating or extracting factors; rating scale. Simply stated, a rating error
deciding how many factors to retain; and is a judgment resulting from the intentional
rotating factors to an interpretable or unintentional misuse of a rating scale. A
orientation” By contrast, in confirmatory leniency error (also known as a generosity
factor analysis, “a factor structure is error) is, as its name implies, an error in
explicitly hypothesized and is tested for its rating that arises from the tendency on the
fit with the observed covariance structure of part of the rater to be lenient in scoring,
the measured variables.” A term commonly marking, and/or grading. At the other
employed in factor analysis is factor extreme is a severity error. Another type
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
of error might be termed a central
tendency error.
Test Fairness
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik