Test Construction Slides
Test Construction Slides
Test Construction
Introduction
Outline of Topics
• Item Analysis
• Reliability
• Methods for Assessing Reliability
• Factors that Affect the Reliability Coefficient
• Standard Error of Measurement
• Validity
• Content Validity
• Construct Validity
• Criterion-Related Validity
Study Strategies
• Study Emphasis:
• master basic terms and concepts first
• if you have time, become familiar with some of the more advanced
concepts
• Memorization Strategies:
• use strategies that will ensure that information is adequately encoded
• use multiple modalities and multiple study strategies
• schedule ample time for review
• study only one content domain at a time
T t Construction
Test C t ti
Item Analysis
Item Difficulty
• Description:
• refers to the proportion of examinees in the tryout sample who
answered the item correctly
• is of concern for tests designed to measure an examinee’s knowledge
or skill level
Item Difficulty
• Item Difficulty Index (cont.):
• for
f mostt tests,
t t a test
t t developer
d l wants
t items
it with
ith p values
l close
l to
t .50
50
• if the goal of testing is to choose a certain number of top performers, the
optimal p value corresponds to the proportion of examinees to be
chosen
• the optimal value is also affected by the likelihood that examinees can
select
l t th
the correctt answer by
b guessing,
i with
ith the
th preferred
f d difficulty
diffi lt llevell
being halfway between 100% of examinees answering the item correctly
and the probability of answering correctly by guessing
Item Difficulty
• Item Difficulty Index (cont.):
• the
th optimal
ti l p value
l also
l ddepends
d on th
the ttest’s
t’ ceiling
ili and
d flfloor
• a test has adequate ceiling when it can distinguish between examinees
with high levels of the attribute being measured
• ceiling is maximized by including a large proportion of items with a low p
value
• a test has adequate floor when it can distinguish between examinees with
low levels of the attribute being measured
• floor is maximized by including a large proportion of items with a high p
value
Item Discrimination
• Description:
• refers to the extent to which an item discriminates between examinees
who obtain low or high scores on the test or an external criterion
Item Discrimination
• Item Discrimination Index (cont.):
• D ranges in value from -1 to +1
• when D equals +1, all examinees in the upper-scoring group answered the item
correctly while all examinees in the lower-scoring group answered the item
incorrectly
• when D equals 0, the same percent of examinees in both groups answered the
item correctly
• when D equals -1, all examinees in the lower-scoring group answered the item
correctly while all examinees in the upper-scoring group answered the item
correctly
Test Construction
Test Reliability
(Session #1)
Reliability
• Classical Test Theory:
• variability in test scores reflects a combination of true score variability
and variability due to measurement (random) error
X = T + E
Reliability
• Estimating Reliability:
• reliability
li bilit iis estimated
ti t d bby evaluating
l ti consistency
i t iin scores over titime or
across different forms of the test, different test items, or different raters
• most methods for estimating reliability produce a reliability coefficient
• the reliability coefficient is symbolized as rxx
• reliability coefficients range in value from 0 to +1
• they are always interpreted directly as a measure of true score variability
Reliability
• Test-Retest Reliability:
• provides
id a measure off test
t t score consistency
i t or stability
t bilit over titime
• is calculated by administering the test to the same examinees on two
occasions and correlating the two sets of scores
Reliability
• Alternate Forms Reliability:
• provides
id a measure off ttestt score consistency
i t over two
t forms
f off the
th test
t t
• is calculated by correlating the scores obtained by a sample of examinees
on the two forms
Reliability
• Internal Consistency Reliability:
• indicates
i di t th the d
degree off consistency
i t across diff
differentt test
t t items
it
• is appropriate for tests that measure a single content or behavior
domain
• is useful for estimating the reliability of tests that measure
p
characteristics that fluctuate over time or are susceptible to memory
y or
practice effects
Reliability
• Internal Consistency Reliability (cont.):
• split-half reliability involves splitting the test in half and correlating examinees’
examinees
scores on the two halves
• tends to underestimate the test’s reliability
• consequently, the split-half reliability coefficient is corrected using the Spearman-
Brown prophecy formula
• the Spearman-Brown
p formula can also be used more g generally y to estimate the
effect of shortening or lengthening a test on its reliability coefficient
Reliability
• Internal Consistency Reliability (cont.):
• C
Cronbach’s
b h’ coefficient
ffi i t alpha
l h is
i th
the ““mean off allll possible
ibl split-half
lit h lf
correlation coefficients”
• Kuder-Richardson Formula 20 (KR-20) can be used as a substitute for
coefficient alpha when test items are scored dichotomously
Reliability
• Inter-Rater Reliability:
• is important for measures that are subjectively scored,
scored such as essay
and projective tests
• can be evaluated using percent agreement, but this tends to
overestimate inter-rater reliability
• alternatively, a special correlation coefficient can be used
• Cohen’s kappa statistic is used to measure agreement between two raters
when scores represent a nominal scale
• Kendall’s coefficient of concordance is used to measure agreement
between three or more raters when scores are reported as ranks
Test Construction
Test Reliability
(Session #2)
Reliability
• Factors that Affect the Reliability Coefficient:
• longer
l ttests
t are generally
ll more reliable
li bl th
than shorter
h t ttests
t
• a wide range of scores increases the size of the reliability coefficient
• the more homogeneous a test is with regard to content, the higher its
reliability coefficient
• th
the more difficult
diffi lt it is
i tto pick
i k th
the right
i ht answer b
by guessing,
i th
the llarger th
the
reliability coefficient
Reliability
• Confidence Intervals:
• because tests are not totally reliable
reliable, an examinee’s
examinee s obtained score may
or may not be his/her true score
• consequently, it’s always best to interpret an examinee’s obtained score
in terms of a confidence interval
• a confidence interval indicates the range within which an examinee’s true
score is likely to fall given his/her obtained score
• it is derived using the standard error of measurement (SEM)
Reliability
• Confidence Intervals (cont.):
• for the 68% confidence interval
interval, one SEM is added to and subtracted
from the obtained score
• for the 95% confidence interval, two SEM’s are added to and subtracted
from the obtained score
• for the 99% confidence interval, three SEM’s are added to and
subtracted from the obtained score
Reliability
• Standard Error of Measurement:
SEM = SDx 1 - rxx
Example:
SEM = 10 1 - .91
= 10(.3)
= 3
Test Construction
Content and Construct Validity
(Session #1)
Validity
• Definition:
• refers to a test
test’ss accuracy in terms of the extent to which the test
measures what it was designed to measure
• Types of Validity:
• content validity is important for tests designed to measure a specific
content or behavior domain
• construct validity is important for tests designed to measure a
hypothetical trait or construct
• criterion-related validity is important for tests that will be used to predict
or estimate an examinee’s status on an external criterion
Validity
• Content Validity:
• is of concern when a test is designed to measure a content or behavior
domain
• is built into the test while it’s being constructed
• after a test has been developed, content validity is evaluated by subject
matter experts who determine if test items are an adequate and
representative
p sample
p of the content or behavior domain
• content validity is not the same as face validity
• face validity refers to whether or not test items “look like” they’re
measuring what the test is designed to measure
Validity
• Construct Validity:
• iis iimportant
t t for
f tests
t t designed
d i d to
t measure a hypothetical
h th ti l trait
t it or
construct
• several methods are used to evaluate construct validity
• the multitrait-multimethod matrix is a table of correlation coefficients
that p
provide information about a test’s convergent
g and divergent
g
(discriminant) validity
• factor analysis also provides information about convergent and
divergent validity but is a more complex technique
Validity
• Multitrait-Multimethod Matrix:
• itits use requires
i a minimum
i i off ffour measures – the
th measure being
b i
validated; a measure of the same trait using a different method; a
measure of an unrelated trait using the same method; and a measure of
the same unrelated trait using a different method
• the correlation between the test we’re validating and the measure of the
same trait using a different method provides information about the test’s
test s
convergent validity
• the correlations between the test we’re validating and the measures of
unrelated traits provide information about the test’s divergent validity
Validity
• Multitrait-Multimethod Matrix (cont.):
Assertive Aggressive
Assertive Test Aggressive Test
Rating Rating
Test Construction
Content and Construct Validity
(Session #2)
Validity
• Steps in Factor Analysis:
1 Administer tests to a sample of examinees
1.
2. Derive and interpret the correlation matrix
3. Extract the initial factor matrix
4. Rotate the factor matrix
5 Name the factors
5.
Validity
• Rotated Factor Matrix:
Factor I Factor II Communality
Interpersonal Assertiveness Test .78 .16 .64
Global Assertiveness Rating .69 .14 .49
Behavioral Assertiveness Scale .59 .12 .36
Aggressiveness Self-Rating .14 .69 .49
Global Aggressiveness Rating .12 .59 .36
Social Aggressiveness Scale .10
10 .49
49 .25
25
Validity
• Rotated Factor Matrix (cont.):
Factor I Factor II Communality
Interpersonal Assertiveness Test .78 .16 ?
Validity
• Factor Analysis:
• the
th rotation
t ti off a factor
f t matrix
t i can be
b orthogonal
th l or oblique
bli
• orthogonal means uncorrelated, while oblique means correlated
• a researcher decides which is appropriate based on his/her theory about
the characteristics measured by the tests included in the analysis
Test Construction
Criterion-Related Validity
(Session #1)
Validity
• Criterion-Related Validity:
• is important when test scores will be used to predict or estimate status
on a criterion
• is evaluated by correlating scores on the test (predictor) with scores on
the criterion for a sample of examinees to obtain a criterion-related
validity coefficient
• a concurrent validityy study
y involves obtaining
g scores on the predictor
and criterion at about the same time
• a predictive validity study involves obtaining predictor scores prior to
obtaining criterion scores
Validity
• Confidence Intervals:
• b
because ththe relationship
l ti hi bbetween
t a predictor
di t and
d criterion
it i iis never
perfect, there’s some degree of error whenever a predictor is used to
predict or estimate status on a criterion
• consequently, the standard error of estimate is used to construct a
confidence interval around a predicted criterion score
• th
the procedure
d ffor constructing
t ti a confidence
fid iinterval
t l around
da
predicted criterion score is the same as the procedure for
constructing a confidence interval around an obtained test score
Validity
• Standard Error of Estimate:
Example:
SEest = 10 602
1 - .60
= 10(.8)
= 8
Validity
• Relationship Between Reliability and Validity:
• reliability
li bilit iis a necessary b
butt nott sufficient
ffi i t condition
diti ffor validity
lidit
• as indicated by the following formula, reliability places an upper limit on
validity
Example:
Test Construction
Criterion-Related Validity
(Session #2)
Validity
• Steps in Validating a Predictor:
1. Conduct a job analysis
2. Select/develop the predictor and criterion
3. Obtain and correlate scores on the predictor and criterion
4. Check for adverse impact
5. Evaluate incremental validity
6. Cross-validate
Validity
• Incremental Validity:
• refers to the increase in decision-making accuracy that use of a
predictor provides
• even when a predictor has a large validity coefficient, it may not
increase decision-making accuracy beyond the current level
• is evaluated byy comparing
p g the number of correct decisions made with
and without the new predictor
Validity
• Incremental Validity (cont.):
Validity
• Incremental Validity (cont.):
• calculated by subtracting the base rate from the positive hit rate
Incremental Validity = Positive Hit Rate – Base Rate
• Example:
Positive Hit Rate = 9/10 = 90%
Base Rate = 15/30 = 50%
T t Construction
Test C t ti
Test Score Interpretation
• Norm-Referenced Interpretation:
• involves comparing an examinee
examinee’s s test score to scores obtained in a
standardization sample or other comparison group
• an examinee’s raw score is converted to a score that indicates his/her
relative standing in the comparison group
Norm-Referenced Interpretation
• Percentile Ranks:
• range from 1 to 99 and express an examinee
examinee’s
s score in terms of
percentage of examinees who achieved lower scores
• distribution is always flat (rectangular) regardless of the shape of the
raw score distribution
• because the transformation changes the shape of the original raw score
distribution,, it is categorized
g as a nonlinear transformation
• a limitation of percentile ranks is that they indicate an examinee’s
relative position in a distribution but do not provide information about
absolute differences between examinees in terms of their raw scores
Norm-Referenced Interpretation
• Standard Scores:
• iindicate
di t ththe examinee’s
i ’ relative
l ti standing
t di iin th
the comparison
i group iin
terms of standard deviations from the mean
• the z-score distribution has a mean of 0 and standard deviation of 1
• a z-score is calculated by subtracting the mean of the distribution from the
examinee’s score to obtain a deviation score and dividing the deviation
score by
b th
the di
distribution’s
t ib ti ’ standard
t d d deviation
d i ti
• if an examinee obtains a score of 110 on a test that has a mean of 100 and
standard deviation of 10, his/her z-score is +1.0
Norm-Referenced Interpretation
• Standard Scores (cont.):
• the
th T-score
T di t ib ti h
distribution has a mean off 50 and
d standard
t d dd deviation
i ti off 10
• an examinee whose raw score is one standard deviation above the mean
will have a T-score of 60
Criterion-Referenced Interpretation
• Description:
• involves interpreting an examinee
examinee’s
s score in terms of a predefined
standard
• a percent correct (percentage) score indicates the percent of test
content the examinee answered correctly
• when used as the method of score interpretation, a cutoff score is usually
set
• another method involves interpreting an examinee’s score in terms of
his/her likely status on an external criterion, which might involve using a
regression equation or expectancy table