0% found this document useful (0 votes)
10 views

Test Construction Slides

Uploaded by

omaff hurtado
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Test Construction Slides

Uploaded by

omaff hurtado
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Association For Advanced Training

Test Construction
Introduction

Test Construction Questions

• 8 to 12 questions will be from this area


• at least two-thirds of the questions will cover basic or introductory
information
• the remaining questions will address more advanced material

© AATBS. All Rights Reserved.


Association For Advanced Training

Outline of Topics
• Item Analysis
• Reliability
• Methods for Assessing Reliability
• Factors that Affect the Reliability Coefficient
• Standard Error of Measurement

• Validity
• Content Validity
• Construct Validity
• Criterion-Related Validity

• Test Score Interpretation

Study Strategies
• Study Emphasis:
• master basic terms and concepts first
• if you have time, become familiar with some of the more advanced
concepts
• Memorization Strategies:
• use strategies that will ensure that information is adequately encoded
• use multiple modalities and multiple study strategies
• schedule ample time for review
• study only one content domain at a time

© AATBS. All Rights Reserved.


Association For Advanced Training

T t Construction
Test C t ti
Item Analysis

Item Difficulty
• Description:
• refers to the proportion of examinees in the tryout sample who
answered the item correctly
• is of concern for tests designed to measure an examinee’s knowledge
or skill level

• Item Difficulty Index:


• symbolized
b li d with
ith the
th letter
l tt “p”
“ ” and
d calculated
l l t d by
b dividing
di idi ththe number
b off
examinees who answered the item correctly by the total number of
examinees
• ranges in value from 0 to 1, with larger values indicating an easier item

© AATBS. All Rights Reserved.


Association For Advanced Training

Item Difficulty
• Item Difficulty Index (cont.):
• for
f mostt tests,
t t a test
t t developer
d l wants
t items
it with
ith p values
l close
l to
t .50
50
• if the goal of testing is to choose a certain number of top performers, the
optimal p value corresponds to the proportion of examinees to be
chosen
• the optimal value is also affected by the likelihood that examinees can
select
l t th
the correctt answer by
b guessing,
i with
ith the
th preferred
f d difficulty
diffi lt llevell
being halfway between 100% of examinees answering the item correctly
and the probability of answering correctly by guessing

Item Difficulty
• Item Difficulty Index (cont.):
• the
th optimal
ti l p value
l also
l ddepends
d on th
the ttest’s
t’ ceiling
ili and
d flfloor
• a test has adequate ceiling when it can distinguish between examinees
with high levels of the attribute being measured
• ceiling is maximized by including a large proportion of items with a low p
value
• a test has adequate floor when it can distinguish between examinees with
low levels of the attribute being measured
• floor is maximized by including a large proportion of items with a high p
value

© AATBS. All Rights Reserved.


Association For Advanced Training

Item Discrimination
• Description:
• refers to the extent to which an item discriminates between examinees
who obtain low or high scores on the test or an external criterion

• Item Discrimination Index:


• symbolized with the letter “D”
• is calculated by subtracting the percent of examinees in the lower-
scoring group who answered the item correctly from the percent of
examinees in the upper-scoring group who answered the item correctly

Item Discrimination
• Item Discrimination Index (cont.):
• D ranges in value from -1 to +1
• when D equals +1, all examinees in the upper-scoring group answered the item
correctly while all examinees in the lower-scoring group answered the item
incorrectly
• when D equals 0, the same percent of examinees in both groups answered the
item correctly
• when D equals -1, all examinees in the lower-scoring group answered the item
correctly while all examinees in the upper-scoring group answered the item
correctly

© AATBS. All Rights Reserved.


Association For Advanced Training

Test Construction
Test Reliability
(Session #1)

Reliability
• Classical Test Theory:
• variability in test scores reflects a combination of true score variability
and variability due to measurement (random) error

X = T + E

Total Variability True Score Measurement Error


Variability

© AATBS. All Rights Reserved.


Association For Advanced Training

Reliability
• Estimating Reliability:
• reliability
li bilit iis estimated
ti t d bby evaluating
l ti consistency
i t iin scores over titime or
across different forms of the test, different test items, or different raters
• most methods for estimating reliability produce a reliability coefficient
• the reliability coefficient is symbolized as rxx
• reliability coefficients range in value from 0 to +1
• they are always interpreted directly as a measure of true score variability

Reliability
• Test-Retest Reliability:
• provides
id a measure off test
t t score consistency
i t or stability
t bilit over titime
• is calculated by administering the test to the same examinees on two
occasions and correlating the two sets of scores

• is appropriate for tests designed to measure a characteristic that is


stable over time
• is not appropriate for tests that measure characteristics that fluctuate
over time or are likely to be affected in a random way by taking the test
more than once

© AATBS. All Rights Reserved.


Association For Advanced Training

Reliability
• Alternate Forms Reliability:
• provides
id a measure off ttestt score consistency
i t over two
t forms
f off the
th test
t t
• is calculated by correlating the scores obtained by a sample of examinees
on the two forms

• is appropriate for tests that measure a characteristic that is stable over


time
• is not appropriate for tests that measure characteristics that fluctuate
over time or when exposure to one form is likely to affect performance
on the other form in an unsystematic way

Reliability
• Internal Consistency Reliability:
• indicates
i di t th the d
degree off consistency
i t across diff
differentt test
t t items
it
• is appropriate for tests that measure a single content or behavior
domain
• is useful for estimating the reliability of tests that measure
p
characteristics that fluctuate over time or are susceptible to memory
y or
practice effects

© AATBS. All Rights Reserved.


Association For Advanced Training

Reliability
• Internal Consistency Reliability (cont.):
• split-half reliability involves splitting the test in half and correlating examinees’
examinees
scores on the two halves
• tends to underestimate the test’s reliability
• consequently, the split-half reliability coefficient is corrected using the Spearman-
Brown prophecy formula
• the Spearman-Brown
p formula can also be used more g generally y to estimate the
effect of shortening or lengthening a test on its reliability coefficient

• split-half reliability is not appropriate for speeded test

Reliability
• Internal Consistency Reliability (cont.):
• C
Cronbach’s
b h’ coefficient
ffi i t alpha
l h is
i th
the ““mean off allll possible
ibl split-half
lit h lf
correlation coefficients”
• Kuder-Richardson Formula 20 (KR-20) can be used as a substitute for
coefficient alpha when test items are scored dichotomously

© AATBS. All Rights Reserved.


Association For Advanced Training

Reliability
• Inter-Rater Reliability:
• is important for measures that are subjectively scored,
scored such as essay
and projective tests
• can be evaluated using percent agreement, but this tends to
overestimate inter-rater reliability
• alternatively, a special correlation coefficient can be used
• Cohen’s kappa statistic is used to measure agreement between two raters
when scores represent a nominal scale
• Kendall’s coefficient of concordance is used to measure agreement
between three or more raters when scores are reported as ranks

Intentionally Left Blank

© AATBS. All Rights Reserved.


Association For Advanced Training

Test Construction
Test Reliability
(Session #2)

Reliability
• Factors that Affect the Reliability Coefficient:
• longer
l ttests
t are generally
ll more reliable
li bl th
than shorter
h t ttests
t
• a wide range of scores increases the size of the reliability coefficient
• the more homogeneous a test is with regard to content, the higher its
reliability coefficient
• th
the more difficult
diffi lt it is
i tto pick
i k th
the right
i ht answer b
by guessing,
i th
the llarger th
the
reliability coefficient

© AATBS. All Rights Reserved.


Association For Advanced Training

Reliability
• Confidence Intervals:
• because tests are not totally reliable
reliable, an examinee’s
examinee s obtained score may
or may not be his/her true score
• consequently, it’s always best to interpret an examinee’s obtained score
in terms of a confidence interval
• a confidence interval indicates the range within which an examinee’s true
score is likely to fall given his/her obtained score
• it is derived using the standard error of measurement (SEM)

Reliability
• Confidence Intervals (cont.):
• for the 68% confidence interval
interval, one SEM is added to and subtracted
from the obtained score
• for the 95% confidence interval, two SEM’s are added to and subtracted
from the obtained score
• for the 99% confidence interval, three SEM’s are added to and
subtracted from the obtained score

© AATBS. All Rights Reserved.


Association For Advanced Training

Reliability
• Standard Error of Measurement:
SEM = SDx 1 - rxx

Example:

SEM = 10 1 - .91

= 10(.3)

= 3

Intentionally Left Blank

© AATBS. All Rights Reserved.


Association For Advanced Training

Test Construction
Content and Construct Validity
(Session #1)

Validity
• Definition:
• refers to a test
test’ss accuracy in terms of the extent to which the test
measures what it was designed to measure
• Types of Validity:
• content validity is important for tests designed to measure a specific
content or behavior domain
• construct validity is important for tests designed to measure a
hypothetical trait or construct
• criterion-related validity is important for tests that will be used to predict
or estimate an examinee’s status on an external criterion

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Content Validity:
• is of concern when a test is designed to measure a content or behavior
domain
• is built into the test while it’s being constructed
• after a test has been developed, content validity is evaluated by subject
matter experts who determine if test items are an adequate and
representative
p sample
p of the content or behavior domain
• content validity is not the same as face validity
• face validity refers to whether or not test items “look like” they’re
measuring what the test is designed to measure

Validity
• Construct Validity:
• iis iimportant
t t for
f tests
t t designed
d i d to
t measure a hypothetical
h th ti l trait
t it or
construct
• several methods are used to evaluate construct validity
• the multitrait-multimethod matrix is a table of correlation coefficients
that p
provide information about a test’s convergent
g and divergent
g
(discriminant) validity
• factor analysis also provides information about convergent and
divergent validity but is a more complex technique

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Multitrait-Multimethod Matrix:
• itits use requires
i a minimum
i i off ffour measures – the
th measure being
b i
validated; a measure of the same trait using a different method; a
measure of an unrelated trait using the same method; and a measure of
the same unrelated trait using a different method
• the correlation between the test we’re validating and the measure of the
same trait using a different method provides information about the test’s
test s
convergent validity
• the correlations between the test we’re validating and the measures of
unrelated traits provide information about the test’s divergent validity

Validity
• Multitrait-Multimethod Matrix (cont.):
Assertive Aggressive
Assertive Test Aggressive Test
Rating Rating

Assertive Test .93

Aggressive Test .13 .91

Assertive Rating .71 .09 .86

Aggressive Rating .04 .68 .16 .89

© AATBS. All Rights Reserved.


Association For Advanced Training

Test Construction
Content and Construct Validity
(Session #2)

Validity
• Steps in Factor Analysis:
1 Administer tests to a sample of examinees
1.
2. Derive and interpret the correlation matrix
3. Extract the initial factor matrix
4. Rotate the factor matrix
5 Name the factors
5.

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Rotated Factor Matrix:
Factor I Factor II Communality
Interpersonal Assertiveness Test .78 .16 .64
Global Assertiveness Rating .69 .14 .49
Behavioral Assertiveness Scale .59 .12 .36
Aggressiveness Self-Rating .14 .69 .49
Global Aggressiveness Rating .12 .59 .36
Social Aggressiveness Scale .10
10 .49
49 .25
25

Validity
• Rotated Factor Matrix (cont.):
Factor I Factor II Communality
Interpersonal Assertiveness Test .78 .16 ?

Communality = .782 + .162


= .61
61 + .03
03
= .64

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Factor Analysis:
• the
th rotation
t ti off a factor
f t matrix
t i can be
b orthogonal
th l or oblique
bli
• orthogonal means uncorrelated, while oblique means correlated
• a researcher decides which is appropriate based on his/her theory about
the characteristics measured by the tests included in the analysis

Intentionally Left Blank

© AATBS. All Rights Reserved.


Association For Advanced Training

Test Construction
Criterion-Related Validity
(Session #1)

Validity
• Criterion-Related Validity:
• is important when test scores will be used to predict or estimate status
on a criterion
• is evaluated by correlating scores on the test (predictor) with scores on
the criterion for a sample of examinees to obtain a criterion-related
validity coefficient
• a concurrent validityy study
y involves obtaining
g scores on the predictor
and criterion at about the same time
• a predictive validity study involves obtaining predictor scores prior to
obtaining criterion scores

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Confidence Intervals:
• b
because ththe relationship
l ti hi bbetween
t a predictor
di t and
d criterion
it i iis never
perfect, there’s some degree of error whenever a predictor is used to
predict or estimate status on a criterion
• consequently, the standard error of estimate is used to construct a
confidence interval around a predicted criterion score
• th
the procedure
d ffor constructing
t ti a confidence
fid iinterval
t l around
da
predicted criterion score is the same as the procedure for
constructing a confidence interval around an obtained test score

Validity
• Standard Error of Estimate:

SEest = SDy 1 - rxy2

Example:

SEest = 10 602
1 - .60
= 10(.8)
= 8

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Relationship Between Reliability and Validity:
• reliability
li bilit iis a necessary b
butt nott sufficient
ffi i t condition
diti ffor validity
lidit
• as indicated by the following formula, reliability places an upper limit on
validity

rxy < rxx

Example:

rxy < .81 < .90

Intentionally Left Blank

© AATBS. All Rights Reserved.


Association For Advanced Training

Test Construction
Criterion-Related Validity
(Session #2)

Validity
• Steps in Validating a Predictor:
1. Conduct a job analysis
2. Select/develop the predictor and criterion
3. Obtain and correlate scores on the predictor and criterion
4. Check for adverse impact
5. Evaluate incremental validity
6. Cross-validate

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Incremental Validity:
• refers to the increase in decision-making accuracy that use of a
predictor provides
• even when a predictor has a large validity coefficient, it may not
increase decision-making accuracy beyond the current level
• is evaluated byy comparing
p g the number of correct decisions made with
and without the new predictor

Validity
• Incremental Validity (cont.):

© AATBS. All Rights Reserved.


Association For Advanced Training

Validity
• Incremental Validity (cont.):
• calculated by subtracting the base rate from the positive hit rate
Incremental Validity = Positive Hit Rate – Base Rate

• Example:
Positive Hit Rate = 9/10 = 90%
Base Rate = 15/30 = 50%

Incremental Validity = 90% - 50% = 40%

Intentionally Left Blank

© AATBS. All Rights Reserved.


Association For Advanced Training

T t Construction
Test C t ti
Test Score Interpretation

Test Score Interpretation


• Introduction:
• an examinee’s
examinee s raw score is often difficult to interpret unless it’s
it s
anchored to the performance of other examinees or a predefined
standard of performance

• Norm-Referenced Interpretation:
• involves comparing an examinee
examinee’s s test score to scores obtained in a
standardization sample or other comparison group
• an examinee’s raw score is converted to a score that indicates his/her
relative standing in the comparison group

© AATBS. All Rights Reserved.


Association For Advanced Training

Norm-Referenced Interpretation
• Percentile Ranks:
• range from 1 to 99 and express an examinee
examinee’s
s score in terms of
percentage of examinees who achieved lower scores
• distribution is always flat (rectangular) regardless of the shape of the
raw score distribution
• because the transformation changes the shape of the original raw score
distribution,, it is categorized
g as a nonlinear transformation
• a limitation of percentile ranks is that they indicate an examinee’s
relative position in a distribution but do not provide information about
absolute differences between examinees in terms of their raw scores

Norm-Referenced Interpretation
• Standard Scores:
• iindicate
di t ththe examinee’s
i ’ relative
l ti standing
t di iin th
the comparison
i group iin
terms of standard deviations from the mean
• the z-score distribution has a mean of 0 and standard deviation of 1
• a z-score is calculated by subtracting the mean of the distribution from the
examinee’s score to obtain a deviation score and dividing the deviation
score by
b th
the di
distribution’s
t ib ti ’ standard
t d d deviation
d i ti
• if an examinee obtains a score of 110 on a test that has a mean of 100 and
standard deviation of 10, his/her z-score is +1.0

© AATBS. All Rights Reserved.


Association For Advanced Training

Norm-Referenced Interpretation
• Standard Scores (cont.):
• the
th T-score
T di t ib ti h
distribution has a mean off 50 and
d standard
t d dd deviation
i ti off 10
• an examinee whose raw score is one standard deviation above the mean
will have a T-score of 60

• deviation IQ scores have a mean of 100 and standard deviation of 15


• an examinee whose raw score is one standard deviation above the mean
will have a deviation IQ score of 115

Criterion-Referenced Interpretation
• Description:
• involves interpreting an examinee
examinee’s
s score in terms of a predefined
standard
• a percent correct (percentage) score indicates the percent of test
content the examinee answered correctly
• when used as the method of score interpretation, a cutoff score is usually
set
• another method involves interpreting an examinee’s score in terms of
his/her likely status on an external criterion, which might involve using a
regression equation or expectancy table

© AATBS. All Rights Reserved.

You might also like