Interpretation of Discrimination Data From Multiple-Choice Test Items
Interpretation of Discrimination Data From Multiple-Choice Test Items
RELIABILITY INTERPRETATION
.90 and above: Excellent reliability; at the level of the best standardized tests
.80 - .90: Very good for a classroom test
.70 - .80 :Good for a classroom test; in the range of most. There are probably a few items that could
be improved.
.60 - .70: Somewhat low. This test should be supplemented by other measures to determine grades.
There are probably some items that could be improved.
.50 - .60: Suggests need to revise the test, unless it is quite short (ten or fewer items). The test must
be supplemented by other measures for grading.
.50 or below: Questionable reliability. This test should not contribute heavily to the course grade, and
it needs revision.
DISTRACTOR EVALUATION
Another useful item review technique is distractor evaluation.
You should consider each distractor an important part of an item in view of nearly 50 years of research that
shows that there is a relationship between the distractors students choose and total test score. The quality
of the distractors influences student performance on a test item.
Although correct answers must be truly correct, it is just as important that distractors be clearly incorrect,
appealing to low scorers who have not mastered the material rather than to high scorers. You should review
all item options to anticipate potential errors of judgment and inadequate performance so you can revise,
replace, or remove poor distractors.
One way to study responses to distractors is with a frequency table that tells you the proportion of students
who selected a given distractor. Remove or replace distractors selected by a few or no students because
students find them to be implausible.
Standards of Acceptance
Review the item statistics for the exams and answer the following questions when possible:
Which item(s) would you remove altogether from the test? Why?
Which distractor(s) would you revise? Why?
Which items are working well?
What does the pattern of responses for the correct and incorrect alternatives across the various splits tell
you about the item?
Which distractor(s) would you revise? Why?
How can you use the frequency counts (and related percentages) to determine how the class did as a
whole?
How would you use the standard scores to compare 1) the same students across different tests or 2) the
overall scores between tests?
References
The University of Texas at Austin, Faculty Innovation Center. https://facultyinnovate.utexas.edu/. Last
accessed November, 2016.
DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park: Sage Publications.
Haladyna. T. M. (1999). Developing and validating multiple-choice test items (2nd ed.). Mahwah, NJ:
Lawrence Erlbaum Associates.
Lord, F.M. (1952). The relationship of the reliability of multiple-choice test to the distribution of item
difficulties. Psychometrika, 18, 181-194.
Mehrens, W. A., & Lehmann, I. J. (1973). Measurement and Evaluation in Education and Psychology. New
York: Holt, Rinehart and Winston, 333-334.
Suen, H. K. (1990). Principles of test theories. Hillsdale, NJ: Lawrence Erlbaum Associates.