B Wright - Item Bank - What Why How
B Wright - Item Bank - What Why How
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
National Council on Measurement in Education is collaborating with JSTOR to digitize, preserve and extend
access to Journal of Educational Measurement.
http://www.jstor.org
'Bruce Howard Choppin, who died in Chile on July 15, 1983, was one of the earliest and most persistent item
bankers. He was our student, colleague and friend. May his pioneering work be long appreciated.
331
TABLE 1
each item is written to represent an element of the strand at a particular point on the
achievement variable, then each item exemplifies the knowledge, skill, and behavior that
defines achievement at that point. The calibration that becomes attached to each item
puts this definition of the strand on an underlying continuum. Items with low calibrations
describe easy tasks that define the elementary end of the strand. Items with high
calibrations describe difficult tasks that define the advanced end of the strand. The
progressionthrough the items in the orderof their calibrations from easy to hard describes
the path that most students follow as they learn.
Item calibrations are obtained by applying a probabilistic model for what ought to
happen when a student attempts an item (Rasch, 1960/1980). The probabilitiesallow for
give and take between what is intended and what occurs. This is necessary because some
students do not follow the expected path. The model tries to impose an orderly response
process on the data. Evaluation of the extent to which the data can be understoodand used
in this way is an essential part of item banking.
To develop confidence in the structure of the bank, it is necessary to assess the extent of
agreement among student performances and between student performances and teacher
expectations. If there is much disagreement among students as to which items are hard
and which are easy, then there may not be a common basis for describing progress. It may
become impossible to say that one student has achieved more than another or that a
student's change in position on the bank variable indicates development. If the empirical
ordering of the items surprises the teachers who designed the bank, then they may not
understand what their items measure. Fortunately, students tend to agree with one
another and with their teachers on the relative difficulty of most test items so that item
banking usually succeeds.
Reviewing the bank arrangement of items from easy to hard promotes a new kind of
communication about the strands a curriculum contains and what can be done to teach
them. This is a kind of communication that does not occur when off-the-cuff tests are
patched together by teachers or off-the-shelf tests are brought in by publishers. When
teachers review together the confirmations and contradictions that emerge as they
compare the observed item hierarchy in an item bank with their intended one, they
discover details and teaching sequences that they were unaware they shared with one
another.
Psychometric Implications
Item banks return the development and control of testing to the local level without loss
of comparability. The quantitative basis for objective comparisons among performances
of students or between performances of the same student at different times is achieved
through the itemization of the curriculum strand defined by the bank. This requires
nothing more in item quality or relevance than is already taken for granted in most
fixed-item, norm-referencedtests.
When items that share a common content are calibrated onto a common variable, each
item represents a position on the variable that is also represented by other items of
comparable difficulty. This makes it possible to infer a student's mastery with respect to
the basic variable that the items share regardless of which items are administered or
whom else has been tested. The idea that items might be exchangeable with respect to
their contributionto a measure and hence to a general idea as to how much of a particular
curriculumstrand a particularstudent has learned may seem surprising.But this idea is a
basic requirement of any measuring system in which many responses are collected but
only one score is reported, including all of the fixed-item norm-referencedtests so widely
used. It is the isolation of this exchangeable part of each item by its calibration on the
common curriculum strand which frees the item's unique content for diagnostic use.
The summary information about a tested student should begin with the validity of the
student's pattern of performance (Smith, 1982, 1984; Wright & Stone, 1979, chaps. 4 &
7). If the performance pattern is valid, then the measure estimates the student's level of
mastery in terms of all the items that define the bank rather than merely the few items
taken. This provides a criterion reference for the student's performance which is as
detailed as the items in the bank and as broad as their implications.
The student's position on the variable also places that student among whomever else
has ever taken items from the bank, rather than merely among those who have taken the
same items. This providesa norm reference for the student's performancewith respect to
every other student who has produced a valid test performance with items from the
bank.
Because items can be written, administered, and scored locally and because, when
items are plentiful, there is no need for item secrecy, it becomes possible to analyze,
report, and use the individual interactions between student and item immediately in the
teaching-learning process. The calibration of the items facilitates this analysis because it
enables an immediate evaluation of the consistency of each response. This focuses the
teacher's attention on the particular responses that are most pertinent to a particular
student's education. The teacher can go beyond the criterion and norm referencing
produced by a student's position on the curriculum strand into an itemized diagnosis of
the details of each student's particular performance.
Curricular Implications
An item bank can accept new items without large scale pretests. All that is needed is an
analysis of the extent to which the pattern of student responses elicited by each new item
is consistent with these students' estimated positions on the curriculum strand. New items
that share the common content can be added as the curriculumdevelops. When these new
items are administeredwith items already in the bank, their consistency with the bank can
be evaluated au courant and, if satisfactory, the new items can be calibrated onto the bank
and used immediately. This means that the contents of the bank can follow the curriculum
strand as it develops. Freed from the constraint of a fixed list of items that must be
administered as a complete set, teachers can teach to their curriculum. Then they can use
their curriculum strand banks dynamically not only to assess how well their teaching is
succeeding with individual students but also to build objective maps of the direction their
curriculum is taking.
Test results, however individualized, are not restricted to single teachers' assessments
of their own teaching methods. Because all of the items drawn from a particular bank are
calibrated onto one common scale, teachers can compare their test results with one
another, even when their tests contain no common items. This opportunity to compare
results quantitatively enables teachers to examine how the same topic is learned by
different students working with different teachers and hence to evaluate alternative
teaching strategies. With common curriculum strands as the frames of reference, it
becomes possible to recognize subtle differences in the way school subjects are mastered.
The investigation of which teaching methods are most effective in which circumstances
can become an ongoing, routine part of the educational process. Tests constructed from
item banks can promote an exchange of ideas, not only about assessment, but also about
curricula. The organization of curricular content provided by item bank calibrations can
also supply an objective basis for the development and revision of curricular theory.
estimated from the subscore on the "relevant" items remaining. The wisest reaction will
depend on the reason for testing and on what has disturbed the testing session. Routine
analysis of student performance consistency can help teachers make the best choice by
calling performance problems to their attention and suggesting their nature.
-
r _
I -B
_ank Test
Plan to betems I
IIPlon p10 bonked, I Administration
I
I I
I
i ossemble"I
forms/
I
I
test
student
responses
r --- - __-
FORCAL
BankBuilding calibrotor
I
~~~~~~I~SHIFT
~I~~~~~~~ linker
I
I
I t I KID KID
LIST MAP LIST LIST MAP
I
_ Item
_ _ Reports
_ _- _ _ _- _ StudentReports
L
Figure 1. How to build an item bank.
1983; Wright & Panchapakesan, 1969; Wright & Stone, 1979). The bank building
equations are given in this article. Table 1 lists some sites where computer programs like
this have been used.
Analyzing Fit
The first estimation of item and form difficulties is based on all data and the
expectation that these data can be used to approximate additive conjoint measurement
(Brogden, 1977; Wright & Masters, 1982, chap. 1). The estimates of item and form
difficulties are sample-free to the extent of this approximation.The empirical criterion is
the degree of consistency between observation and expectation and the extent to which
provocative subdivisions of data, by ability group, grade level, sex, and so on, produce
statistically equivalent item and form calibrations (Ludlow, 1983; Mead, 1975).
Item-Within-Form Fit Analysis
The first check as to whether item difficulties are approximately sample-free is done
during form calibration. If item estimates are invariant with respect to student abilities,
student sample subdivisions will give statistically equivalent item difficulties.
One way to test this is to divide the student sample into subgroups by raw score r (the
sufficient statistic for ability) and to compare the observed successes on each item i in
each ability subgroup g with the number predicted for that subgroup. If parameter
estimates are adequate for describing group g, then the observed number correct in group
g will be near the estimated model expectation
= NrPri (1)
Rgi
rCg
Sj = E
s2-E Nrpri[l
^Jl-^i] PriI (2)
(2)
rCg
where Nr is the number of students with score r, and Pr, is the estimated probability of
success for a student with score r on item i (Rasch, 1960). If observed and expected are
comparable, given the model variance of the observed, then there is no evidence against
the conclusion that subgroupsconcur on the estimated difficulty, and the confidence to be
placed in this estimate can be specified with its modeled standard error. Similar analyses
can be done for student subgroups defined in other ways.
Another way to check within form item fit is to evaluate the agreement between the
variable manifested by item i and the variable defined by the other items. A useful
statistic for this is a mean square in which the standard squared residual of observationx
from its expectation p, z2 = (x p)2/[p(l - p)], for each student n's response to item i,
is weighted by the information in the observation, u,n = p(l - p), and summed over N
students.
N
E [i Uni]
V= N [N/(N- 1)] (3)
"
Uni
n
where dikand dijare the estimated difficulties of linking item i in forms k and j, n is the
number of items in this link, and 1/Wikj= 1/ [se2k + se] iisan informationweight based on
the calibration standard errors, seikand seij. The standard error of tkjis
n -1/2
Shift tkjestimates the difference in origins of forms k and j. A shift is calculated for
every pair of forms linked by common items. The difficulty Tkof form k is the average
shift for form k over all forms.
M
>1tkj (6)
Tk=-
M
where M is the number of forms and tkk = 0. The standard error of Tk is
M 11/2
[
|(sek )2 ](7)
sek =
M
Equations 4 through 7 assume every form is linked to every other form. When links are
missing between some forms, as is usually the case, empty cells can be started at
tkj = 0, (8)
and improved by calculating form difficulties with Equation 6 and adjusting empty cells
to
= Tk - Tj (9)
tkj
until the Tkstabilize. This process works as long as every form can be reached from every
other form by some chain of links.
The bank origin is at the center of the forms so that form difficulty Tkis the difference
between the center of form k and the center of the bank.
(d'ik - d'ij)2
Between link fit = i (11)
Z Wikj
where the within form item difficulties, dik, have been translated to their bank values d'k
by
dk = dik + Tk (12)
and Wikj= [setk+ se2]. Values substantially greater than one, given expected variance
2/(n - 1), signify that some items operate differently in the two forms. A plot of dk
versus dj over i facilitates the evaluation of link status and the identification of aberrant
items (Wright & Masters, 1982, pp. 114-117; Wright & Stone, 1979, pp. 92-95).
Bank difficulty is the average of the item's difficulties in the forms in which it was
calibrated, adjusted for these forms' difficulties. The between difficulty root mean square
is the square root of the average squared difference between an item's bank equated
difficulties in each form and its bank difficulty. It is useful to tag items with between
difficulty root mean squares greater than 0.5 logits for examination because they are
frequently found to have been miskeyed or misprinted in one of the forms in which they
appear.
The within form item fit mean square of Equation 3 is standardized to mean zero and
variance one and the average square of these standardized within form fits is used to
summarize item performances within forms. Its sign is taken from the sign of the
standardized fit with the largest absolute value to distinguish between misfit caused by
unexpected disorder, indicated by large positive standardized fits, and misfit caused by
unexpected within form inter-item dependence, indicated by large negative standardized
fits. It is useful to tag items producing values greater than 2 or less than -2 for further
examination.
Program ITEMMAP is used to display the variable graphically by plotting the items,
according to their bank difficulties, along the line of the variable which they define. This
enables teachers to examine the relationship between the content of the items and their
bank difficulties in order to review the extent to which the item order defines a curriculum
strand that agrees with their expectations and so has construct validity. It also provides a
framework for writing new items to fill gaps that appear in the definition of the
curriculum strand and for choosing items for new tests.
Program FORMLIST is used to list each form by form number, name, number of
items, and bank difficulty. Each item is listed by form position, item name, key, within
form difficulty and standard error, total within form standardized fit, and bank difficulty.
This facilitates the review of each form as a whole and the identification of form specific
anomalies.
Program KIDLIST is used to list each student by identification, ability measure, error,
and fit statistic. KIDLIST indicates students who misfit by displaying their response
string and its residuals from expectation, so that teacher and student can see the specific
sources of misfit.
Program KIDMAP is used to provide a graphical representation of each student's
performance.KIDMAP makes an item responsemap for each student which shows where
the student and the items taken stand on the curriculum strand, which items were
answered correctly, the probabilityof each response, and the student's percent mastery at
each item. This provides teacher and student with a picture of the student's performance
which combines specification of criteria mastery with the identification of unexpected
strengths and weaknesses.
less than one imply noise in item use, outbreaks of guessing or carelessness, or the
presence of secondary variables correlated negatively with the intended variable.
Discriminations greater than one imply items unreached by or not yet taught to low
scoring students, response formats or item contents that introduce inter-item dependen-
cies, or the presence of secondary variables correlated positively with the intended
variable.
The between fit statistic in Table 2 is a between score group mean square calculated
from Equations 1 and 2 accumulated over score groups (Wright & Panchapakesan, 1969,
pp. 44-46) and standardizedto mean zero and variance one (Wright & Masters, 1982, p.
101). The total fit statistic is the mean square of Equation 3 similarly standardized.
Miskeying usually produces a characteristic misfit pattern. The item appears more
difficult than anticipated, the between fit is large, the discriminationlow. Item No. 277 in
Table 2 illustrates this. Its calibration implies that it is very difficult, but it requires an
easy task, "What does the symbol '-' mean?" The other items that deal with the
recognition of addition and subtraction symbols are easy. Investigation revealed that
division rather than subtraction was the keyed right answer. Correcting the key and
rescoring rescued item No. 277 from its misfit status and gave it a new difficulty which
placed it among other items of its type.
Misfit caused by student behavior, such as guessing and carelessness, is not diagnosed
well by item fit statistics because item statistics lump together students behaving
differently. Disturbances that are the consequences of individual student behavior are
best detected and best dealt with through the fit analysis of individual students (Smith,
1982, 1984; Wright & Stone, 1979, chaps. 4 & 7). But item statistics can call attention to
items that tend to provoke irregular behavior.
TABLE 2
Math 277 310 4.41 0.52 -0.06 -0.50 9.50 0.38 Miskey/
324 4.04 0.39 -0.11 -0.19 10.59 1.09 Misprint
Math 256 314 2.77 0.30 0.11 0.38 6.61 0.82 Guessing
321 3.02 0.34 0.02 0.05 10.23 0.77
Read 339 112 2.07 0.28 0.10 0.11 6.81 1.85 Guessing
Math 258 304 -2.83 0.72 -0.03 0.24 6.48 -0.05 Careless
Guessing is only a problem when some low-ability students are provoked to guess on
items that are too difficult for them. The characteristic item statistic pattern is high
difficulty, high between fit, and low discrimination. Item Nos. 256, 339, and 23 in Table 2
illustrate this.
Item No. 256 shows a map and asks: "Bill followed the path and went from home to the
mountains for a picnic. How far was his round trip if he went the shortest way possible?"
Except for the requirement that students know that "round trip" means the same as
"from X to Y and back" this item is similar to the other items in its skill. Perhaps
uncertainty concerning the meaning of "round trip" provokedsome low-scoring students
to guess on this item.
Item No. 339 appeared to be one of the most difficult items in the reading bank. Item
No. 23 appeared to be the hardest item in its forms. The item characteristic curves for
these items were flat implying that low-scoring students answered them correctly about as
often as high-scoring students. These items were found to share an ambiguous correct
alternative. Item No. 339 reads: "The word that has the same sound as the 'e' in 'problem'
is:" with alternatives: "ago," "eat," "out," "ink." Item No. 23 reads: "The word that has
the same sound as the 'ou' in 'famous' is:" with alternatives:"own," "you," "ago," "odd."
"Ago," the correct answer to both, is the only alternative in this skill containing two vowel
sounds. Only three or four students in each ability group responded "ago." It seems
reasonable to infer that many of these few successes were guesses. Which students
actually guessed, however,can only be determined by examining each student's individual
performance pattern and evaluating the extent of improbable correct answers to these
(and other) items.
Carelessness occurs when some high-ability students fail easy items. The pattern in
item statistics is low difficulty, high between fit, and low discrimination. Item Nos. 258, 9,
and 92 in Table 2 illustrate this.
Item No. 92 reads: "In this story, 'don't' means the same as:" with alternatives: "do
no," "do not," "did not," "does not." Misfit was traced to a high-scoring group in which
more students than expected chose the incorrect alternative "do no." As this error is
particularly glaring, and this item was easy for low-scoring students, carelessness is
implied. Perhaps the distractor "do no" was misread as "do not" by some able students in
a hurry. The identification of which particular students were careless, however, requires
the examination of each student's individualperformancepattern and an evaluation of the
extent of improbable wrong answers to these (and other) items.
When the disturbance in a misfitting item is not mechanical or clerical, the cause can
often be traced to special knowledge such as knowing that multiplication by zero is
different than multiplication by other numbers. Interactions with exposure can also affect
the shape of the responsecurve. Dependence on a skill that only high-ability students have
been taught can make an item unfairly easier for these high-ability students. This will
cause the item to have a discrimination index larger than one and a fit statistic that is too
low. On the other hand, dependence on a skill that is negatively related to instruction, so
that low-ability students tend to possess more of it, can make an item unfairly easier for
low-ability students and, hence, give it a discrimination index smaller than one and a fit
statistic that is too high. Either way, the interaction disqualifies the item for use with
students who are unequal in their exposure to the special skill. If discriminationis too low,
the item is unfair to more able students. If discrimination is too high, the item is unfair to
less able students.
REFERENCES
BROGDEN, H. E. (1977). The Rasch model, the law of comparative judgement and additive
conjoint measurement. Psychometrika, 42, 631-634.
CHOPPIN, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872.
CHOPPIN, B. (1976). Recent developments in item banking. In Advances in psychological and
educational measurement. New York: Wiley.
CHOPPIN, B. (1978). Item banking and the monitoring of achievement (Research in Progress
Series No. 1). Slough, England: National Foundation for Educational Research.
CHOPPIN, B. (1981). Educational measurement and the item bank model. In C. Lacey & D.
Lawton (Eds.), Issues in evaluation and accountability. London.
CONNOLLY, A. J., NACHTMAN, W., & PRITCHETT, E. M. (1971). Key math: Diagnostic
arithmetic test. Circle Pines, MN: American Guidance Service.
CORNISH, G., & WINES, R. (1977). Mathematics profile series. Hawthorn, Victoria:
Australian Council for Educational Research.
ELLIOTT, C. D. (1983). British ability scales, manuals 1-4. Windsor, Berks: NFER-Nelson.
ENGLEHARD, G., & OSBERG, D. (1983). Constructing a test network with a Rasch
measurement model. Applied Psychological Measurement, 7, 283-294.
IZARD, J., FARISH, S., WILSON, M., WARD, G., & VAN DER WERF, A. (1983). RAPT in
subtraction: Manualfor administration and interpretation. Melbourne:Australian Council for
Educational Research.
KOSLIN, B., KOSLIN, S., ZENO, S., & WAINER, H. (1977). The validity and reliability of the
degrees of reading power test. Elmsford, NY: Touchstone Applied Science Associates.
LUDLOW, L. H. (1983). The analysis of Rasch model residuals. Unpublished doctoral
dissertation, University of Chicago.
MEAD, R. J. (1975). Analysis of fit to the Rasch model. Unpublished doctoral dissertation,
University of Chicago.
RASCH, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago:
University of Chicago Press. (Original work published 1960)
RENTZ, R., & BASHAW, L. (1977). The national reference scale for reading: An application of
the Rasch model. Journal of Educational Measurement, 14, 161-179.
SMITH, R. M. (1982). Detecting measurement disturbances with the Rasch model. Unpublished
doctoral dissertation, University of Chicago.
SMITH, R. M. (1984). Validation of individual response patterns. In International Encyclopedia
of Education. Oxford: Pergamon Press.
STONE, M. H., & WRIGHT, B. D. (1981). Knox's cube test. Chicago: Stoelting.
WOODCOCK, R. W. (1973). Woodcock Reading Mastery Test. Circle Pines, MN: American
Guidance Service.
WRIGHT, B. D. (1968). Sample-free test calibration and person measurement. In Proceedings of
the 1967 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing
Service.
WRIGHT, B. D. (1977). Solving measurement problems with the Rasch model. Journal of
Educational Measurement, 14, 97-116.
WRIGHT, B. D. (1983). Fundamental measurement in social science and education (Research
Memorandum No. 33). Chicago: University of Chicago, Department of Education, MESA
Psychometric Laboratory.
WRIGHT, B. D., & MASTERS, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
WRIGHT, B. D., & PANCHAPAKESAN, N. (1969). A procedurefor sample-free item analysis.
Educational Psychological Measurement, 29(1), 23-48.
WRIGHT, B. D., & STONE, M. H. (1979). Best test design. Chicago: MESA Press.
AUTHORS
BENJAMIND. WRIGHT,Professorof EducationandBehavioralScience,Chair,MESASpecial
Field, Director, MESA PsychometricLaboratory,Universityof Chicago, 5835 Kimbark
Avenue,Chicago,IL 60637. Degrees:BS, CornellUniversity;PhD, Universityof Chicago.
Specializations:Measurement,psychoanalyticpsychology.
SUSAN R. BELL,ResearchAssociate,MESA PsychometricLaboratory,Universityof Chicago,
5835 KimbarkAvenue,Chicago,IL 60637.