0% found this document useful (0 votes)
6 views

Chapter 6

Psychological Assessment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Chapter 6

Psychological Assessment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 6: Validity In this classic conception of validity, referred

to as the trinitarian view, it might be useful


Validity, as applied to a test, is a judgment to visualize construct validity as being
or estimate of how well a test measures “umbrella validity” since every other variety
what it purports to measure in a particular of validity falls under it.
context.
Three approaches to assessing validity
An inference is a logical result or —associated, respectively, with content
deduction. validity, criterion-related validity, and
construct validity—are
Characterizations of the validity of tests and 1. Scrutinizing the test’s content
test scores are frequently phrased in terms 2. Relating scores obtained on the test
such as “acceptable” or “weak.” Inherent in to other test scores or other
a judgment of an instrument’s validity is a measures
judgment of how useful it is for a particular 3. Executing a comprehensive analysis
purpose with a particular population of of
people. However, what is really meant is a. How scores on the test relate
that the test has been shown to be valid for to other test scores and
a particular use with a particular population measures
of testtakers at a particular time. b. How scores on the test can be
understood within some
No test or measurement technique is theoretical framework for
“universally valid” for all time, for all uses, understanding the construct
with all types of testtaker populations. that the test was designed to
Rather, tests may be shown to be valid measure
within what we would characterize as
reasonable boundaries of a contemplated These three approaches to validity
usage. If those boundaries are exceeded, assessment are not mutually exclusive. Each
the validity of the test may be called into should be thought of as one type of
question. The validity of a test must be evidence that, with others, contributes to a
proven again from time to time. judgment concerning the validity of the test.
All three types of validity evidence
Validation is the process of gathering and contribute to a unified picture of a test’s
evaluating evidence about validity. validity, though a test user may not need to
know about all three. Depending on the use
It is the test developer’s responsibility to to which a test is being put, all three types
supply validity evidence in the test manual. of validity evidence may not be equally
It may sometimes be appropriate for test relevant.
users to conduct their own validation
studies with their own groups of testtakers. The trinitarian model of validity is not
without its critics. Messick (1995), for
Local validation studies are absolutely example, condemned this approach as
necessary when the test user plans to alter fragmented and incomplete. He called for a
in some way the format, instructions, unitary view of validity, one that considers
language, or content of the test. Local everything from the implications of test
validation studies would also be necessary if scores in terms of societal values to the
a test user sought to use a test with a consequences of test use.
population of testtakers that differed in
some significant way from the population on Face validity relates more to what a test
which the test was standardized. appears to measure to the person being
tested than to what the test actually
One way measurement specialists have measures. Face validity is a judgment
traditionally conceptualized validity is concerning how relevant the test items
according to three categories (evidence): appear to be. Not statistically derived and
1. Content validity not an official validity measure.
2. Criterion-related validity
3. Construct validity A paper-and-pencil personality test labeled
The Introversion/Extraversion Test, with

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
items that ask respondents whether they One method of measuring content validity,
have acted in an introverted or an developed by C. H. Lawshe, is essentially a
extraverted way in particular situations, method for gauging agreement among
may be perceived by respondents as a raters or judges regarding how essential a
highly face-valid test. On the other hand, a particular item is. Lawshe (1975) proposed
personality test in which respondents are that each rater respond to the following
asked to report what they see in inkblots question for each item: “Is the skill or
may be perceived as a test with low face knowledge measured by this item
validity. A test’s lack of face validity could  Essential
contribute to a lack of confidence in the  Useful but not essential
perceived effectiveness of the test—with a  Not necessary
consequential decrease in the testtaker’s
cooperation or motivation to do his or her to the performance of the job?” According to
best. Lawshe, if more than half the panelists
indicate that an item is essential, that item
In reality, a test that lacks face validity may has at least some content validity. Greater
still be relevant and useful. However, if it is levels of content validity exist as larger
not perceived as such by testtakers, numbers of panelists agree that a particular
parents, legislators, and others, then item is essential. Using these assumptions,
negative consequences may result. These Lawshe developed a formula termed the
consequences may range from a poor content validity ratio (CVR):
attitude on the part of the testtaker to
lawsuits filed by disgruntled parties against N
a test user and test publisher.
nc −(
)
2
CVR=
N /2
Content validity describes a judgment of
how adequately a test samples behavior
representative of the universe of behavior where CVR = content validity ratio, n c =
that the test was designed to sample. One number of panelists indicating “essential,”
of the first measures of validity. and N = total number of panelists.

Greater content validity = better 1. Negative CVR: When fewer than


representation of the subject. The evidence half the panelists indicate
typically involves subject matter experts “essential,” the CVR is negative.
(SME’s) evaluating test items against the 2. Zero CVR: When exactly half the
test specifications. Ask several experts to panelists indicate “essential,”
judge each item in a test, usually rating 3. Positive CVR: When more than half
system (1,2,3,…) but not all the panelists indicate
“essential,” the CVR ranges
With respect to educational achievement between .00 and .99.
tests, it is customary to consider a test a
content-valid measure when the proportion In validating a test, the content validity ratio
of material covered by the test is calculated for each item. Lawshe
approximates the proportion of material recommended that if the amount of
covered in the course. agreement observed is more than 5% likely
to occur by chance, then the item should be
From the pooled information (along with the eliminated.
judgment of the test developer), a test
blueprint emerges for the “structure” of Culture and the Relativity of Content
the evaluation; that is, a plan regarding the Validity
types of information to be covered by the
items, the number of items tapping each Tests are often thought of as either valid or
area of coverage, the organization of the not valid. A history test, for example, either
items in the test, and so forth. does or does not accurately measure one’s
knowledge of historical fact. However, it is
The Quantification of Content Validity also true that what constitutes historical fact
depends to some extent on who is writing
the history. In stark relief the influence of
culture on what is taught to students as well

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
as on aspects of test construction, scoring, Statements of concurrent validity indicate
interpretation, and validation. The influence the extent to which test scores may be used
of culture thus extends to judgments to estimate an individual’s present standing
concerning validity of tests and test items. on a criterion. In general, once the validity
of the inference from the test scores is
Criterion-Related Validity established, the test may provide a faster,
less expensive way to offer a diagnosis or a
Criterion-related validity is a judgment of classification decision. A test with
how adequately a test score can be used to satisfactorily demonstrated concurrent
infer an individual’s most probable standing validity may therefore be appealing to
on some measure of interest—the measure prospective users because it holds out the
of interest being the criterion. Deceited potential of savings of money and
depending on the purpose of the test, professional time.
research, etc.
Predictive Validity
Two types of Criterion-Related Validity:
1. Concurrent validity is an index of Test scores may be obtained at one time and
the degree to which a test score is the criterion measures obtained at a future
related to some criterion measure time, usually after some intervening event
obtained at the same time has taken place. The intervening event may
(concurrently). take varied forms, such as training,
2. Predictive validity is an index of experience, therapy, medication, or simply
the degree to which a test score the passage of time.
predicts some criterion measure
(future performance). Measures of the relationship between the
test scores and a criterion measure obtained
What is a Criterion? at a future time provide an indication of the
predictive validity of the test; that is, how
Here, in the context of our discussion of accurately scores on the test predict some
criterion-related validity, we will define a criterion measure.
criterion just a bit more narrowly as the
standard against which a test or a test score In settings where tests might be employed—
is evaluated. such as a personnel agency, a college
admissions office, or a warden’s office—a
Characteristics: test’s high predictive validity can be a useful
 An adequate criterion is relevant. By this aid to decision makers who must select
we mean that it is pertinent or applicable successful students, productive workers, or
to the matter at hand. good parole risks. Whether a test result is
 An adequate criterion measure must also valuable in decision making depends on how
be valid for the purpose for which it is well the test results improve selection
being used. If one test (X) is being used decisions over decisions made without
as the criterion to validate a second test knowledge of test results.
(Y), then evidence should exist that test X
is valid. Judgments of criterion-related validity,
 Ideally, a criterion is also whether concurrent or predictive, are based
uncontaminated. Criterion on two types of statistical evidence: the
contamination is the term applied to a validity coefficient and expectancy data.
criterion measure that has been based, at
least in part, on predictor measures.  The validity coefficient is a correlation
coefficient that provides a measure of the
Concurrent Validity relationship between test scores and
scores on the criterion measure.
If test scores are obtained at about the
same time that the criterion measures are The correlation coefficient computed from a
obtained, measures of the relationship score (or classification) on a
between the test scores and the criterion psychodiagnostics test, and the criterion
provide evidence of concurrent validity. score (or classification) assigned by
psychodiagnosticians is one example of a

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
validity coefficient. Typically, the Pearson something about the criterion measure that
correlation coefficient is used to determine is not explained by predictors already in use.
the validity between the two measures.
However, depending on variables such as One approach, employing the principles of
the type of data, the sample size, and the incremental validity, is to start with the best
shape of the distribution, other correlation predictor: the predictor that is most highly
coefficients could be used. correlated with GPA. This may be time spent
studying. Then, using multiple regression
The validity coefficient is affected by techniques, one would examine the
restriction or inflation of range. And as in usefulness of the other predictors.
other correlational studies, a key issue is
whether the range of scores employed is  Expectancy data provide information
appropriate to the objective of the that can be used in evaluating the
correlational analysis. The problem of criterion-related validity of a test. Using a
restricted range can also occur through a score obtained on some test(s) or
self-selection process in the sample measure(s), expectancy tables illustrate
employed for the validation study. the likelihood that the testtaker will score
within some interval of scores on a
Whereas it is the responsibility of the test criterion measure—an interval that may
developer to report validation data in the be seen as “passing,” “acceptable,” and
test manual, it is the responsibility of test so on.
users to read carefully the description of the
validation study and then to evaluate the An expectancy table shows the
suitability of the test for their specific percentage of people within specified test-
purposes. score intervals who subsequently were
placed in various categories of the criterion
How high should a validity coefficient be for (for example, placed in “passed” category or
a user or a test developer to infer that the “failed” category).
test is valid? There are no rules for
determining the minimum acceptable size of Expectancy chart is the graphic
a validity coefficient. In fact, Cronbach and representation of an expectancy table.
Gleser (1965) cautioned against the
establishment of such rules. They argued Taylor-Russell tables provide an estimate
that validity coefficients need to be large of the extent to which inclusion of a
enough to enable the test user to make particular test in the selection system will
accurate decisions within the unique context actually improve selection.
in which a test is being used. Essentially, the
validity coefficient should be high enough to One limitation of the Taylor-Russell tables is
result in the identification and differentiation that the relationship between the predictor
of testtakers with respect to target (the test) and the criterion (rating of
attribute(s), such as employees who are performance on the job) must be linear.
likely to be more productive, police officers Another limitation of the Taylor-Russell
who are less likely to misuse their weapons, tables is the potential difficulty of identifying
and students who are more likely to be a criterion score that separate “successful”
successful in a particular course of study. from “unsuccessful” employees. The
potential problems of the Taylor-Russell
Test users involved in predicting some tables were avoided by an alternative set of
criterion from test scores are often tables that provided an indication of the
interested in the utility of multiple difference in average criterion scores for the
predictors. The value of including more than selected group as compared with the
one predictor depends on a couple of original group.
factors. First, of course, each measure used
as a predictor should have criterion-related Use of the Naylor-Shine tables entails
predictive validity. Second, additional obtaining the difference between the means
predictors should possess incremental of the selected and unselected groups to
validity, defined here as the degree to derive an index of what the test (or some
which an additional predictor explains other tool of assessment) is adding to
already established procedures.

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
Both the Taylor-Russell and the Naylor-Shine did possess the particular characteristic or
tables can assist in judging the utility of a attribute being measured when in fact the
particular test, the former (Taylor-Russell) by testtaker did not. A false negative is a
determining the increase over current miss wherein the test predicted that the
procedures and the latter (Naylor-Shine) by testtaker did not possess the particular
determining the increase in average score characteristic or attribute being measured
on some criterion measure. With both when the testtaker actually did.
tables, the validity coefficient used must be
one obtained by concurrent validation A test is obviously of no value if the hit rate
procedures—a fact that should not be is higher without using it. One measure of
surprising because it is obtained with the value of a test lies in the extent to which
respect to current employees hired by the its use improves on the hit rate that exists
selection process in effect at the time of the without its use. Decision theory provides
study. guidelines for setting optimal cutoff scores.
In setting such scores, the relative
Decision Theory and Test Utility seriousness of making false-positive or
false-negative selection decisions is
Perhaps the most oft-cited application of frequently considered.
statistical decision theory to the field of
psychological testing is Cronbach and Construct Validity
Gleser’s Psychological Tests and
Personnel Decisions. Construct validity is a judgment about the
appropriateness of inferences drawn from
Stated generally, Cronbach and Gleser test scores regarding individual standings on
(1965) presented a variable called a construct. The degree to
(1) a classification of decision problems; which a test measures the theoretical
(2) various selection strategies ranging from construct it is intended to measure.
single-stage processes to sequential
analyses; Compare it to other measures with different
(3) a quantitative analysis of the relationship or same construct to confirm or provide a
between test utility, the selection ratio, cost falsification of the theory. Another way is to
of the testing program, and expected value compare the score of a single item to the
of the outcome; and total test score to see if there is a
(4) a recommendation that in some correlation.
instances job requirements be tailored to
the applicant’s ability instead of the other A construct is an informed, scientific idea
way around (a concept they refer to as developed or hypothesized to describe or
adaptive treatment). explain behavior. It cannot be directly
observed but is inferred from behavior.
Generally, a base rate is the extent to Constructs are unobservable, presupposed
which a particular trait, behavior, (underlying) traits that a test developer may
characteristic, or attribute exists in the invoke to describe test behavior or criterion
population (expressed as a proportion). performance.

In psychometric parlance, a hit rate may be The researcher investigating a test’s


defined as the proportion of people a test construct validity must formulate
accurately identifies as possessing or hypotheses about the expected behavior of
exhibiting a particular trait, behavior, high scorers and low scorers on the test.
characteristic, or attribute. These hypotheses give rise to a tentative
theory about the nature of the construct the
In like fashion, a miss rate may be defined test was designed to measure. If the test is
as the proportion of people the test fails to a valid measure of the construct, then high
identify as having, or not having, a scorers a nd low scorers will behave as
particular characteristic or attribute. Here, a predicted by the theory. If high scorers and
miss amounts to an inaccurate prediction. low scorers on the test do not behave as
The category of misses may be further predicted, the investigator will need to
subdivided. A false positive is a miss reexamine the nature of the construct itself,
wherein the test predicted that the testtaker or hypotheses made about it. One possible

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
reason for obtaining results contrary to test scores and if high scorers on the test
those predicted by the theory is that the tend to pass each item more than low
test simply does not measure the construct. scorers do, then each item is probably
An alternative explanation could lie in the measuring the same construct as the total
theory that generated hypotheses about the test. Each item is contributing to test
construct. The theory may need to be homogeneity. Coefficient alpha may also be
reexamined. Although confirming evidence used in estimating the homogeneity of a
contributes to a judgment that a test is a test composed of multiple-choice items.
valid measure of a construct, evidence to
the contrary can also be useful. If it is an academic test and if high scorers
on the entire test for some reason tended to
Increasingly, construct validity has been get that particular item wrong while low
viewed as the unifying concept for all scorers on the test as a whole tended to get
validity evidence. the item right, the item is obviously not a
good one. The item should be eliminated in
Evidence of Construct Validity the interest of test homogeneity, among
other considerations.
A number of procedures may be used to
provide different kinds of evidence that a If a test score purports to be a measure of a
test has construct validity. The various construct that could be expected to change
techniques of construct validation may over time, then the test score, too, should
provide evidence, for example, that show the same progressive changes with
 The test is homogeneous, measuring a age to be considered a valid measure of the
single construct. construct.
 Test scores increase or decrease as a
function of age, the passage of time, or Evidence that test scores change as a result
an experimental manipulation as of some experience between a pretest and a
theoretically predicted. posttest can be evidence of construct
 Test scores obtained after some event or validity. Some of the more typical
the mere passage of time (that is, intervening experiences responsible for
posttest scores) differ from pretest scores changes in test scores are formal education,
as theoretically predicted. a course of therapy or medication, and on-
 Test scores obtained by people from the-job experience.
distinct groups vary as predicted by the
theory. Method of contrasted groups is one way
 Test scores correlate with scores on other of providing evidence for the validity of a
tests in accordance with what would be test is to demonstrate that scores on the
predicted from a theory that covers the test vary in a predictable way as a function
manifestation of the construct in of membership in some group.
question.
The rationale here is that if a test is a valid
Homogeneity refers to how uniform a test measure of a particular construct, then test
is in measuring a single concept. A test scores from groups of people who would be
developer can increase test homogeneity in presumed to differ with respect to that
several ways. construct should have correspondingly
different test scores.
One item-analysis procedure focuses on the
relationship between test-takers’ scores on If scores on the test undergoing construct
individual items and their score on the entire validation tend to correlate highly in the
test. predicted direction with scores on older,
more established, and already validated
One way a test developer can improve the tests designed to measure the same (or a
homogeneity of a test containing items that similar) construct, this would be an example
are scored dichotomously (for example, of convergent evidence.
true–false) is by eliminating items that do
not show significant correlation coefficients Convergent evidence for validity may come
with total test scores. If all test items show not only from correlations with tests
significant, positive correlations with total purporting to measure an identical construct

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
but also from correlations with measures loading, which is “a sort of metaphor. Each
purporting to measure related constructs. test is thought of as a vehicle carrying a
certain amount of one or more abilities.”
A validity coefficient showing little (that is, a Factor loading in a test conveys information
statistically insignificant) relationship about the extent to which the factor
between test scores and/or other variables determines the test score or scores.
with which scores on the test being
construct-validated should not theoretically Factor analysis frequently involves technical
be correlated provides discriminant procedures so complex that few
evidence of construct validity (also known contemporary researchers would attempt to
as discriminant validity). conduct one without the aid of a
prepackaged computer program.
Data indicating that a test measures the
same construct as other tests purporting to Validity, Bias, and Fairness
measure the same construct are also
referred to as evidence of convergent Test Bias
validity. The test scores correlate with
other measures of the same construct. For psychometricians, bias is a factor
inherent in a test that systematically
This rather technical procedure, called the prevents accurate, impartial measurement.
multitrait-multimethod matrix, is the Systematic is a key word in our definition of
matrix or table that results from correlating test bias. Bias implies systematic variation.
variables (traits) within and between
methods. Multitrait means “two or more To begin with, three characteristics of
traits” and multimethod means “two or the regression lines used to predict
more methods.” success on the criterion would have to be
scrutinized: (1) the slope, (2) the intercept,
Both convergent and discriminant evidence (3) the error of estimate.
of construct validity can be obtained by the
use of factor analysis. Factor analysis is a Intercept bias is a term derived from the
shorthand term for a class of mathematical point where the regression line intersects
procedures designed to identify factors or the Y -axis. If a test systematically yields
specific variables that are typically significantly different validity coefficients for
attributes, characteristics, or dimensions on members of different groups, then it has
which people may differ. In psychometric what is known as slope bias —so named
research, factor analysis is frequently because the slope of one group’s regression
employed as a data reduction method in line is different in a statistically significant
which several sets of scores and the way from the slope of another group’s
correlations between them are analyzed. regression line.
The purpose of the factor analysis may be to
identify the factor or factors in common One reason some tests have been found to
between test scores on subscales within a be biased has more to do with the design of
particular test, or the factors in common the research study than the design of the
between scores on a series of tests. In test.
general, factor analysis is conducted on
either an exploratory or a confirmatory A rating is a numerical or verbal judgment
basis. (or both) that places a person or an attribute
along a continuum identified by a scale of
Exploratory factor analysis typically numerical or word descriptors known as a
entails “estimating or extracting factors; rating scale. Simply stated, a rating error
deciding how many factors to retain; and is a judgment resulting from the intentional
rotating factors to an interpretable or unintentional misuse of a rating scale. A
orientation” By contrast, in confirmatory leniency error (also known as a generosity
factor analysis, “a factor structure is error) is, as its name implies, an error in
explicitly hypothesized and is tested for its rating that arises from the tendency on the
fit with the observed covariance structure of part of the rater to be lenient in scoring,
the measured variables.” A term commonly marking, and/or grading. At the other
employed in factor analysis is factor extreme is a severity error. Another type

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
of error might be termed a central
tendency error.

One way to overcome what might be termed


restriction-of-range rating errors
(central tendency, leniency, severity errors)
is to use rankings, a procedure that
requires the rater to measure individuals
against one another instead of against an
absolute scale.

Halo effect describes the fact that, for


some raters, some ratees can do no wrong.
More specifically, a halo effect may also be
defined as a tendency to give a particular
ratee a higher rating than he or she
objectively deserves because of the rater’s
failure to discriminate among conceptually
distinct and potentially independent aspects
of a ratee’s behavior. Except in highly
integrated situations, ratees tend to receive
higher ratings from raters of the same race.

Test Fairness

Issues of test fairness tend to be rooted


more in thorny issues involving values.
Fairness, in a psychometric context is the
extent to which a test is used in an
impartial, just, and equitable way.

Another misunderstanding of what


constitutes an unfair or biased test is that it
is unfair to administer to a particular
population a standardized test that did not
include members of that population in the
standardization sample.

A final source of misunderstanding is the


complex problem of remedying situations
where bias or unfair test usage has been
found to occur.

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik

You might also like