0% found this document useful (0 votes)
122 views4 pages

Interpretation of Discrimination Data From Multiple-Choice Test Items

The document discusses three useful statistics for interpreting student performance on multiple choice tests: item difficulty, point-biserial correlation coefficient (item discrimination), and reliability coefficient. It provides guidelines for evaluating items based on these statistics, such as desirable difficulty levels and discrimination value ranges. The document also discusses evaluating distractors and cautions about interpreting item analysis results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views4 pages

Interpretation of Discrimination Data From Multiple-Choice Test Items

The document discusses three useful statistics for interpreting student performance on multiple choice tests: item difficulty, point-biserial correlation coefficient (item discrimination), and reliability coefficient. It provides guidelines for evaluating items based on these statistics, such as desirable difficulty levels and discrimination value ranges. The document also discusses evaluating distractors and cautions about interpreting item analysis results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Teaching & Learning Seminar Series

College of Health Sciences


Midwestern University

Interpretation of Discrimination Data from Multiple-Choice Test Items


Understanding how to interpret three useful statistics concerning your students' multiple-choice test scores
will help you construct well-designed tests and improve instruction.

1. Item Difficulty (P): the percentage of students who correctly answered an


item.
 Also referred to as the p-value
 Ranges from 0% to 100%, or more typically written as a proportion 0.00 to 1.00
 The higher the value, the easier the item
 P-values above 0.90 indicate very easy items that you should not use in subsequent tests. If almost
all students responded correctly, an item addresses a concept probably not worth testing.
 P-values below 0.20 indicate very difficult items. If almost all students responded incorrectly, either
an item is flawed or students did not understand the concept. Consider revising confusing
language, removing the item from subsequent tests, or targeting the concept for re-instruction.
For maximum discrimination potential, desirable difficulty levels are slightly higher than midway between
chance (1.00 divided by the number of choices) and perfect scores (1.00) for an item:

ITEM FORMAT AND RESPCTIVE IDEAL DIFFICULTY

 Five-response multiple-choice .60


 Four-response multiple-choice .63
 Three-response multiple-choice .66
 True-false (two-response multiple-choice) .75

2. Point-Biserial Correlation Coefficient (PBCC) for Item discrimination or R


(IT): the relationship between how well students performed on the item and
their total test score.
 Ranges from 0.00 to 1.00
 The higher the value, the more discriminating the item
 A highly discriminating item indicates that students with high test scores responded correctly
whereas students with low test scores responded incorrectly.
Remove items with discrimination values near or less than zero, because this indicates that students who
performed poorly on the test performed better on an item than students who performed well on the test. The
item is confusing for your better scoring students in some way.
EVALUATE ITEMS USING FOUR GUIDELINES FOR CLASSROOM TEST
DISCRIMINATION VALUES:
 0.40 or higher very good items
 0.30 to 0.39 good items
 0.20 to 0.29 fairly good items
 0.19 or less poor items

3. Reliability coefficient (ALPHA): a measure of the amount of error


associated with a test score.
 Ranges from 0.00 to 1.00
 The higher the value, the more reliable the test score
 Typically, a measure of internal consistency, indicating how well items are correlated with one
another
 High reliability indicates that items are measuring the same construct (e.g., knowledge of how to
calculate integrals)
 Two ways to improve test reliability: 1) increase the number of items or 2) use items with high
discrimination values

RELIABILITY INTERPRETATION
 .90 and above: Excellent reliability; at the level of the best standardized tests
 .80 - .90: Very good for a classroom test
 .70 - .80 :Good for a classroom test; in the range of most. There are probably a few items that could
be improved.
 .60 - .70: Somewhat low. This test should be supplemented by other measures to determine grades.
There are probably some items that could be improved.
 .50 - .60: Suggests need to revise the test, unless it is quite short (ten or fewer items). The test must
be supplemented by other measures for grading.
 .50 or below: Questionable reliability. This test should not contribute heavily to the course grade, and
it needs revision.

DISTRACTOR EVALUATION
Another useful item review technique is distractor evaluation.
You should consider each distractor an important part of an item in view of nearly 50 years of research that
shows that there is a relationship between the distractors students choose and total test score. The quality
of the distractors influences student performance on a test item.
Although correct answers must be truly correct, it is just as important that distractors be clearly incorrect,
appealing to low scorers who have not mastered the material rather than to high scorers. You should review
all item options to anticipate potential errors of judgment and inadequate performance so you can revise,
replace, or remove poor distractors.
One way to study responses to distractors is with a frequency table that tells you the proportion of students
who selected a given distractor. Remove or replace distractors selected by a few or no students because
students find them to be implausible.

CAUTION WHEN INTERPRETING ITEM ANALYSIS RESULTS


Mehrens and Lehmann (1973) offer three cautions about using the results of item analysis:
 Item analysis data are not synonymous with item validity. An external criterion is required to
accurately judge the validity of test items. By using the internal criterion of total test score, item
analyses reflect internal consistency of items rather than validity.
 The discrimination index is not always a measure of item quality. There are a variety of reasons why
an item may have low discrimination power:
o extremely difficult or easy items will have low ability to discriminate, but such items are often needed to
adequately sample course content and objectives.
o an item may show low discrimination if the test measures many content areas and cognitive skills. For
example, if the majority of the test measures "knowledge of facts," then an item assessing "ability to apply
principles" may have a low correlation with total test score, yet both types of items are needed to measure
attainment of course objectives.
 Item analysis data are tentative. Such data are influenced by the type and number of students being
tested, instructional procedures employed, and chance errors. If repeated use of items is possible,
statistics should be recorded for each administration of each item.

Standards of Acceptance

Item difficulty 30% - 90%

Item Discrimination Ratio 25% and Above

PBCC 0.20 and Above

KR20 0.70 and Above

Activity: Item Analysis

Review the item statistics for the exams and answer the following questions when possible:

Which item(s) would you remove altogether from the test? Why?
Which distractor(s) would you revise? Why?
Which items are working well?
What does the pattern of responses for the correct and incorrect alternatives across the various splits tell
you about the item?
Which distractor(s) would you revise? Why?
How can you use the frequency counts (and related percentages) to determine how the class did as a
whole?
How would you use the standard scores to compare 1) the same students across different tests or 2) the
overall scores between tests?

References
The University of Texas at Austin, Faculty Innovation Center. https://facultyinnovate.utexas.edu/. Last
accessed November, 2016.
DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park: Sage Publications.
Haladyna. T. M. (1999). Developing and validating multiple-choice test items (2nd ed.). Mahwah, NJ:
Lawrence Erlbaum Associates.
Lord, F.M. (1952). The relationship of the reliability of multiple-choice test to the distribution of item
difficulties. Psychometrika, 18, 181-194.
Mehrens, W. A., & Lehmann, I. J. (1973). Measurement and Evaluation in Education and Psychology. New
York: Holt, Rinehart and Winston, 333-334.
Suen, H. K. (1990). Principles of test theories. Hillsdale, NJ: Lawrence Erlbaum Associates.

You might also like