0% found this document useful (0 votes)
6 views8 pages

SUPPLEMENTARY READINGS FOR RELIABILITY, VALIDITY, UTILITY

The document discusses the concepts of reliability and validity in psychological assessment, detailing various types of reliability such as test-retest, parallel forms, and internal consistency. It also covers the importance of validity, including content, criterion-related, and construct validity, emphasizing the need for accurate measurement and interpretation of test scores. Additionally, it highlights the impact of measurement errors and the significance of understanding the reliability coefficients in the context of psychological testing.

Uploaded by

qinle60434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

SUPPLEMENTARY READINGS FOR RELIABILITY, VALIDITY, UTILITY

The document discusses the concepts of reliability and validity in psychological assessment, detailing various types of reliability such as test-retest, parallel forms, and internal consistency. It also covers the importance of validity, including content, criterion-related, and construct validity, emphasizing the need for accurate measurement and interpretation of test scores. Additionally, it highlights the impact of measurement errors and the significance of understanding the reliability coefficients in the context of psychological testing.

Uploaded by

qinle60434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Psychological Assessment

Reliability, Validity, Utility


Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
Reliability test scoreꟷand thus the reliability can be
o Dependability or consistency affected
o Consistency of the instrument o Measurement Error – all of the factors
o A test may be reliable in one context and associated with the process of measuring
unreliable in another some variable, other than the variable being
o Reliability Coefficient – index of reliability, a measured
proportion that indicates the ratio between the o Random Error – source of error in measuring
true score variance on a test and the total a targeted variable caused by unpredictable
variance fluctuations and inconsistencies of other
o Classical Test Theory – a score on an ability variables in the measurement process
tests is presumed to reflect not only the ▪ “Noise”
testtaker’s true score on the ability being ▪ E.g., physical events that happened while
measured but also error test is happening
▪ Errors of measurement are random o Systematic Error – source of error in a
o Error – refers to the component of the measuring a variable that is typically constant
observed test score that does not have to do or proportionate to what is presumed to be the
with the testtaker’s ability true value of the variable being measured
o Sources of Error Variance:
a. Item Sampling/Content Sampling – refer to
variation among items within a test as well as
to variation among items between tests
▪ The extent to which testtaker’s score is
affected by the content sampled on a test
and by the way the content is sampled is a
source of error variance
b. Test Administration
▪ Testtaker’s motivation or attention,
environment, etc.
▪ Testtaker variables and Examiner-related
Variables
▪ Type I – “false-positive”; an c. Test Scoring and Interpretation
investigator ▪ May employ objective-type items amenable
rejects a null hypothesis that is true to computer scoring of well-documented
▪ Type II – “false-negative”; fails to reject reliability
null ▪ If subjectivity is involved in scoring,
hypothesis that is false in the population Reliability Estimates
▪ Can reduce the likelihood of type 1 and 2 Test-Retest Reliability
errors by increasing the sample size o Time Sampling
o Variance – useful in describing sources of test o An estimate of reliability obtained by
score variability correlating pairs of scores from the same
▪ True Variance – variance from true people on two different administrations of the
differences test
▪ Error Variance – variance from irrelevant, o Appropriate when evaluating the reliability of a
random sources test that purports to measure something that
o Reliability refers to the proportion of total is relatively stable such as personality trait
variance attributed to true variance o The longer the time that passes, the greater
o The greater the proportion of the total the likelihood that the reliability coefficient
variance attributed to true variance, the more will be lower
reliable the test o Coefficient of Stability - when the interval
o Error variance may increase or decrease a between testing is greater than 6 months
test score by varying amounts, consistency Parallel Forms and Alternate Forms Reliability
of the
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
o Item Sampling o Randomly assign items to one or the other half
o Coefficient of Equivalence – the degree of of the test or assign odd-numbered items to
relationship between various forms of test can one half and even-numbered to the other half
be evaluated by means of an alternate forms (odd- even reliability)
or parallel forms coefficient of reliability o Divide the test by content so that each half
o Parallel Forms – each form of the test, the contains items equivalent with respect to
means and the variances are equal content and difficulty
▪ Same items, different o Spearman-Brown Formula – allows a test
positionings/numberings developer or user to estimate internal
▪ Parallel Forms Reliability – estimate of the consistency reliability from a correlation of
extent to which item sampling and other two halves of a test
errors have affect test scores on version of o Reliability of the test is affected by the length.
the same test when, for each form of the Usually, reliability increases as length increases
test, the means and variances of observed o Spearman-Brown may be used to estimate the
test scores are equal effect of the shortening on the test’s reliability
o Alternate Forms – simply different version of a o SBF also be used to determine the number of
test that has been constructed so as to be items needed to attain a desired level of
parallel reliability
▪ Alternate Forms Reliability – estimate of o If the reliability of the original test is relatively
the extent to which these different forms of low, then it may be impractical to increase the
the same test have been affected by no. of items, so they should develop a suitable
sampling error, or other error alternative
o Two administrations with the same group are o Or increase reliability by creating new items,
required clarifying the test instructions, or simplifying
o Test scores may be affected by factors such as the scoring rules
motivation, fatigue, or intervening events such Inter-item Consistency
as practice, learning, or therapy o Refers to the degree of correlation among all
o Some testtaker might do better on a specific the items on a scale
form of a test but not a function of their true o Calculated from a single administration of a
ability but simply bec of the particular items single form of a test
that were selected for inclusion in the test o Useful in assessing Homogeneity
o Minimizes the effect of memory for the content ▪ Homogeneity – if a test contains items that
of a previously administered form of the test measure a single trait (unifactorial)
o Certain traits are presumed to relatively ▪ Heterogeneity – degree to which a test
stable in people measures different factors (more than one
o The means and the variances of the observed trait); source of error variance
scores are equal for two forms o More homogenous = higher inter-item
Internal Consistency consistency
Split-Half Reliability o KR-20 – used for inter-item consistency
o Obtained by correlating two pairs of scores of dichotomous items
obtained from equivalent halves of a single o KR-21 – if all the items have the same degree
test administered once of difficulty (speed tests)
o Useful when it is impractical or undesirable to o Coefficient Alpha – appropriate for use on
assess reliability with two tests or to tests containing non-dichotomous items
administer a test twice ▪ Help answer questions about how similar
o Simply diving the test in the middle is not sets of data are
recommended because it is likely that this ▪ Check consistency across terms of an
procedure would spuriously raise or lower the instrument with responses with varying
reliability coefficient credit
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
o Average Proportional Distance – measure ▪ As individual differences decrease, a
used to evaluate internal consistency of a test traditional measure of reliability would also
that focuses on the degree of differences that decrease, regardless of the stability of
exists between item scores individual performance
▪ Not connected to the number of items on a o Classical Test Theory – everyone has a “true
measure score” on test
Inter-scorer Reliability ▪ True Score – genuinely reflects an
o The degree of agreement or consistency individual’s ability level as measured by a
between two or more scorers with regard to a particular test
particular measure o Domain Sampling Theory – estimate the extent
o Used for coding nonverbal behavior to which specific sources of variation under
o Coefficient of Inter-scorer Reliability defined conditions are contributing to the test
o Observer Differences scores
o Kappa Statistics is used ▪ Considers problem created by using a
▪ Fleiss Kappa – determine the level of limited number of items to represent a
agreement between two or more raters larger and more complicated construct
when the method of assessment is ▪ Test reliability is conceived of as an
measured on a categorical scale; best way; objective measure of how precisely the
more than 2 raters test score assesses the domain from
▪ Cohen’s Kappa – each classify N items into which the test draws a sample
C mutually exclusive categories; rates the ▪ Generalizability Theory – based on the idea
same thing, corrected for how often that that a person’s test scores vary from
the raters may agree by chance; only 2 testing to testing because of the variables
raters in the testing situations
Using and Interpreting Coefficient of Reliability ▪ Universe – test situation
o Tests designed to measure one factor ▪ Facets – number of items in the test,
(Homogenous) are expected to have high amount of review, and the purpose of test
degree of internal consistency and vice versa administration
o Dynamic – trait, state, or ability presumed to ▪ According to Generalizability Theory, given
be ever-changing as a function of situational the exact same conditions of all the facets
and cognitive experience in the universe, the exact same test score
o Static – barely changing or relatively should be obtained (Universe score)
unchanging ▪ Decision Study – developers examine the
o Restriction of range or Restriction of variance usefulness of test scores in helping the test
– if the variance of either variable in a user make decisions
correlational analysis is restricted by the o Item Response Theory – the probability that a
sampling procedure used, then the resulting person with X ability will be able to perform at
correlation coefficient tends to be lower a level of Y in a test
o Power Tests – when time limit is long enough ▪ Latent-Trait Theory
to allow test takers to attempt all times ▪ The computer is used to focus on the range
o Speed Tests – generally contains items of of item difficulty that helps assess an
uniform level of difficulty with time limit individual’s ability level
▪ Reliability should be based on performance ▪ Difficulty – attribute of not being easily
from two independent testing periods using accomplished, solved, or comprehended
test-retest and alternate-forms or split- ▪ Discrimination – degree to which an item
half-reliability differentiates among people with higher or
o Criterion-Referenced Tests – designed to lower levels of the trait, ability or etc.
provide an indication of where a testtaker
▪ Dichotomous – can be answered with only
stands with respect to some variable or
one of two alternative responses
criterion
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
▪ Polytomous – 3 or more alternative o Content Validity – describes a judgement of
responses how adequately a test samples behavior
Reliability and Individual Scores representative of the universe of behavior that
o Standard Error of Measurement – provide a the test was designed to sample
measure of the precision of an observed test o When the proportion of the material covered
score by the test approximates the proportion of
▪ Standard deviation of errors as the basic material covered in the course
measure of error o Test Blueprint – a plan regarding the types of
▪ Provides an estimate of the amount of error information to be covered by the items, the no.
inherent in an observed score or of items tapping each area of coverage, the
measurement organization of the items, and so forth
▪ Higher reliability, lower SEM Criterion-Related Validity
▪ Used to estimate or infer the extent to o Criterion-Related Validity – a judgement of
which an observed score deviates from a how adequately a test score can be used to
true score infer an individual’s most probable standing
▪ Standard Error of a Score on some measure of interestꟷthe measure of
▪ Confidence Interval – a range or band of interest being criterion
test scores that is likely to contain true o Criterion – standard on which a judgement or
scores decision may be made
o Standard Error of the Difference – can aid a ▪ Characteristics: relevant, valid,
test user in determining how large a uncontaminated
difference should be before it is considered ▪ Criterion Contamination – occurs when the
statistically significant criterion measure includes aspects of
o Standard Error of Estimate – refers to the performance that are not part of the job or
standard error of the difference between the when the measure is affected by
predicted and observed values “construct- irrelevant ” (Messick, 1989)
factors that are not part of the criterion
construct
Concurrent Validity
o If the test scores obtained at about the same
time as the criterion measures are obtained
o Extent to which test scores may be used to
estimate an individual’s present standing on a
criterion
o Economically efficient
Validity
Predictive Validity
o Validity – a judgment or estimate of how well
a test measures what it supposed to measure o Measures of the relationship between test
scores and a criterion measure obtained at a
▪ Evidence about the appropriateness of
future time
inferences drawn from test scores
o Researchers must take into consideration the
▪ Inferences – logical result or deduction
base rate of the occurrence of the variable,
▪ May diminish as the culture or times change
both as that variable exists in the general
o Validation – the process of gathering and population and as it exists in the sample being
evaluating evidence about validity
studies
o Validation Studies – yield insights regarding a
o Base Rate – the extent to which a particular
particular population of testtakers as
trait, behavior, characteristic, or attribute
compared to the norming sample described in
exist in the population
a test manual
o Hit Rate – defined as the proportion of people
o Face Validity – a test appears to measure to
a test accurately identifies possessing a
the person being tested than to what the test
particular trait, behavior, etc.
actually measures
Content Validity
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
o Miss Rate – fails to identify having that of people who does not have that construct
particular characteristic
o False Positive – miss; the test predicted that
they do possess a particular trait but actually
not
o False Negative – miss; the test predicted they
do not possess a particular trait but actually
do
o Validity Coefficient – correlation coefficient
that provides a measure of the relationship
between test scores and scores on the
criterion measure
▪ Usually, Pearson R is used, however other
correlation coefficients could be used
depends on the type of data
▪ Affected by restriction or inflation of range
▪ Validity Coefficient need to be large enough
to enable the test user to make accurate
decisions within the unique context in which
a test is being used
o Incremental Validity – the degree to which an
additional predictor explains something about
the criterion measure that is not explained by
predictors already in use
Construct Validity
o Construct Validity – judgement about the
appropriateness of inferences drawn from test
scores regarding individual standing on
variable called construct
o Construct – an informed, scientific idea
developed or hypothesized to describe or
explain behavior
▪ Unobservable, presupposed traits that may
invoke to describe test behavior or criterion
performance
o One way a test developer can improve the
homogeneity of a test containing dichotomous
items is by eliminating items that do not show
significant correlation coefficients with total
test scores
o If it is an academic test and high scorers on
the entire test for some reason tended to get
that particular item wrong while low scorers
got it right, then the item is obviously not a
good one
o Some constructs lend themselves more
readily than others to predictions of change
over time
o Method of Contrasted Groups – demonstrate
that scores on the test vary in a predictable
way as a function of membership in a group
▪ If a test is a valid measure of a particular
construct, then the scores from the group
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
would have different test scores than
those who really possesses that
construct
o Convergent Evidence – if scores on the test
undergoing construct validation tend to
highly correlated with another established,
validated test that measures the same
construct
o Discriminant Evidence – a validity
coefficient showing little relationship
between test scores and/or other variables
with which scores on the test being
construct-validated should not be
correlated
o Factor Analysis – designed to identify
factors or specific variables that are
typically attributes, characteristics, or
dimensions on which people may differ
▪ Employed as data reduction method
▪ Identify the factor or factors in common
between test scores on subscales within
a particular test
▪ Explanatory FA – estimating or
extracting factors; deciding how many
factors must be retained
▪ Confirmatory FA – researchers test the
degree to which a hypothetical model
fits the actual data
▪ Factor Loading – conveys info about the
extent to which the factor determine the
test score or scores
Validity, Bias, and Fairness
o Bias – factor inherent in a test that
systematically prevents accurate, impartial
measurement
▪ Prejudice, preferential treatment
▪ Prevention during test dev through a
procedure called Estimated True Score
Transformation
o Rating – numerical or verbal judgement
that places a person or an attribute along a
continuum identified by a scale of
numerical or word descriptors known as
Rating Scale
▪ Rating Error – intentional or
unintentional misuse of the scale
▪ Leniency Error – rater is lenient in
scoring (Generosity Error)
▪ Severity Error – rater is strict in scoring
▪ Central Tendency Error – rater’s rating
would tend to cluster in the middle of
the rating scale
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
▪ One way to overcome rating errors is to scores on a criterion measure – passing,
use rankings acceptable, failing
▪ Halo Effect – tendency to give high score o Might indicate future behaviors, then if
due to failure to discriminate among successful, the test is working as it should
conceptually distinct and potentially o Taylor-Russel Tables – provide an estimate of
independent aspects of a ratee’s behavior the extent to which inclusion of a particular
o Fairness – the extent to which a test is used in test in the selection system will improve
an impartial, just, and equitable way selection
o Attempting to define the validity of the test will o Selection Ratio – numerical value that reflects
be futile if the test is NOT reliable the relationship between the number of people
to be hired and the number of people available
to be hired

o Base Rate – percentage of people hired under


the existing system for a particular position
o One limitation of Taylor-Russel Tables is that
the relationship between the predictor (test)
and criterion must be linear
o Naylor-Shine Tables – entails obtaining the
difference between the means of the selected
and unselected groups to derive an index of
what the test is adding to already established
Utility procedures
o Utility – usefulness or practical value of Brogden-Cronbach-Gleser Formula
testing to improve efficiency o Used to calculate the dollar amount of a utility
o Can tell us something about the practical gain resulting from the use of a particular
value of the information derived from scores selection instrument
on the test o Utility Gain – estimate of the benefit of using a
o Helps us make better decisions particular test
o Higher criterion-related validity = higher utility o Productivity Gains – an estimated increase in
o One of the most basic elements in utility work output
analysis is financial cost of the selection Some Practical Considerations
device o High performing applicants may have been
o Cost – disadvantages, losses, or expenses offered in other companies as well
both economic and noneconomic terms o The more complex the job, the more people
o Benefit – profits, gains or advantages differ on how well or poorly they do that job
o The cost of test administration can be well o Cut Score – reference point derived as a
worth it if the results is certain noneconomic result of a judgement and used to divide a
benefits set of data into two or more classifications
Utility Analysis ▪ Relative Cut Score – reference point
o Utility Analysis – family of techniques that based on norm-related considerations
entail a cost-benefit analysis designed to yield (norm- referenced); e.g, NMAT
information relevant to a decision about the ▪ Fixed Cut Scores – set with reference to
usefulness and/or practical value of a tool of a judgement concerning minimum level
assessment
of proficiency required; e.g., Board
How is Utility Analysis Conducted?
Exams
Expectancy Data
▪ Multiple Cut Scores – refers to the use of
o Expectancy table – provide an indication that
two or more cut scores with reference
a testtaker will score within some interval
of to
Psychological Assessment
Reliability, Validity, Utility
Source: Cohen & Swerdlik (2018), Kaplan & Saccuzzo (2018), Gravetter & Wallnau (2013)
one predictor for the purpose of
categorization
▪ Multiple Hurdle – multi-stage selection
process, a cut score is in place for each
predictor
▪ Compensatory Model of Selection –
assumption that high scores on one
attribute can compensate for lower
scores
Methods for Setting Cut Scores
o Angoff Method – setting fixed cut scores
▪ low interrater reliability
o Known Groups Method – collection of data on
the predictor of interest from group known to
possess and not possess a trait of interest
▪ The determination of where to set cutoff
score is inherently affected by the
composition of contrasting groups
o IRT-Based Methods – cut scores are typically
set based on testtaker’s performance across
all the items on the test
▪ Item-Mapping Method – arrangement of
items in histogram, with each column
containing items with deemed to be
equivalent value
▪ Bookmark Method – expert places
“bookmark” between the two pages that are
deemed to separate testtakers who have
acquired the minimal knowledge, skills,
and/or abilities from those who are not
o Method of Predictive Yield – took into account
the number of positions to be filled,
projections regarding the likelihood of offer
acceptance, and the distribution of applicant
scores
o Discriminant Analysis – shed light on the
relationship between identified variables and
two naturally occurring groups
end

You might also like