0% found this document useful (0 votes)
4 views

item analysis article

This study conducted a comprehensive item analysis of multiple-choice questions (MCQs) used in a pediatric exam for fifth-year medical students, focusing on single best answer (SBA) and extended matching questions (EMQs). The analysis revealed that a significant percentage of SBA items assessed low cognitive skills, while EMQs assessed higher-order skills, with only a small proportion of items falling within acceptable difficulty and discrimination ranges. The findings aim to guide medical educators in developing a validated question bank and improving question quality for future assessments.

Uploaded by

Fareeha Naz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

item analysis article

This study conducted a comprehensive item analysis of multiple-choice questions (MCQs) used in a pediatric exam for fifth-year medical students, focusing on single best answer (SBA) and extended matching questions (EMQs). The analysis revealed that a significant percentage of SBA items assessed low cognitive skills, while EMQs assessed higher-order skills, with only a small proportion of items falling within acceptable difficulty and discrimination ranges. The findings aim to guide medical educators in developing a validated question bank and improving question quality for future assessments.

Uploaded by

Fareeha Naz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Rashwan et al.

BMC Medical Education (2024) 24:168 BMC Medical Education


https://doi.org/10.1186/s12909-024-05153-3

RESEARCH Open Access

Postexamination item analysis


of undergraduate pediatric multiple‑choice
questions exam: implications for developing
a validated question Bank
Nagwan I. Rashwan1, Soha R. Aref2, Omnia A. Nayel3 and Mennatallah H. Rizk4*   

Abstract
Introduction Item analysis (IA) is widely used to assess the quality of multiple-choice questions (MCQs). The objec-
tive of this study was to perform a comprehensive quantitative and qualitative item analysis of two types of MCQs:
single best answer (SBA) and extended matching questions (EMQs) currently in use in the Final Pediatrics undergradu-
ate exam.
Methodology A descriptive cross-sectional study was conducted. We analyzed 42 SBA and 4 EMQ administered
to 247 fifth-year medical students. The exam was held at the Pediatrics Department, Qena Faculty of Medicine, Egypt,
in the 2020–2021 academic year. Quantitative item analysis included item difficulty (P), discrimination (D), distrac-
tor efficiency (DE), and test reliability. Qualitative item analysis included evaluation of the levels of cognitive skills
and conformity of test items with item writing guidelines.
Results The mean score was 55.04 ± 9.8 out of 81. Approximately 76.2% of SBA items assessed low cognitive skills,
and 75% of EMQ items assessed higher-order cognitive skills. The proportions of items with an acceptable range
of difficulty (0.3–0.7) on the SBA and EMQ were 23.80 and 16.67%, respectively. The proportions of SBA and EMQ
with acceptable ranges of discrimination (> 0.2) were 83.3 and 75%, respectively. The reliability coefficient (KR20)
of the test was 0.84.
Conclusion Our study will help medical teachers identify the quality of SBA and EMQ, which should be included
to develop a validated question bank, as well as questions that need revision and remediation for subsequent use.
Keywords Single best answer questions, Extended matching questions, Item analysis, Item writing flaws, Question
Bank

*Correspondence:
Mennatallah H. Rizk
[email protected]
Full list of author information is available at the end of the article

© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecom-
mons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Rashwan et al. BMC Medical Education (2024) 24:168 Page 2 of 9

Introduction either introduce unnecessary difficulty unrelated to the


“Assessment affects students learning in at least four intended learning outcomes or provide cues that ena-
ways: its content, format, timing, and any subsequent ble testwise students to guess the correct answer with-
feedback given to the medical students” [1]. MCQs out necessarily understanding the content. Both types
are a well-established format for undergraduate medi- of flaws can skew the final test scores and compromise
cal student assessment, given that MCQs allow broad the validity of the assessment [8, 11]. Well-constructed
coverage of learning objectives. In addition, MCQs are MCQs allow the evaluation of high-order cognitive skills
objective and scored easily and quickly with minimal such as the application of knowledge, interpretation, or
human-related errors or bias. Well-designed MCQs allow synthesis rather than testing lower cognitive skills. On
for the assessment of higher cognitive skills rather than the other hand, MCQs were mostly used to test lower
low cognitive skills [2]. rather than higher cognitive skills, which can be consid-
However, MCQs have some limitations. Construction ered a significant threat to the quality of multiple-choice
of MCQs is most difficult and time-consuming even for questions [12]. In many medical schools, faculty mem-
well-trained staff members. There is evidence that the bers are not sufficiently trained to construct MCQs that
basic item-writing principles are not followed mostly examine high cognitive skills linked to authentic profes-
when constructing MCQs. The presence of flawed MCQs sional situations [13].
can interfere with the accurate and meaningful interpre- This study aimed to perform a postexamination quanti-
tation of test scores and negatively affect student pass tative and qualitative item analysis of two types of MCQs,
rates. Therefore, to develop reliable and valid tests, items SBA and EMQ, to provide guidance when making deci-
must be constructed that are free of such flaws [3]. sions regarding keeping, reviewing, or discarding ques-
Item analysis (IA) is the set of qualitative and quanti- tions from exams or question banks.
tative procedures used to evaluate the characteristics of
items of the test before and after test development and Methods
construction. Quantitative item analysis uses statisti- Participants
cal methods to help make judgments about which items Data were collected from the pediatric summative exam
need to be kept, reviewed, or discarded. Qualitative item of Pediatrics course (PED502, a 7-credit-hour course),
analysis depends on the judgment of the reviewers about which was conducted at the Qena Faculty of Medi-
whether guidelines for item writing are followed or not cine, South Valley University, Qena, Egypt. The medical
[4]. school implements a ‘6 + 1’ medical curriculum. This is
In quantitative IA, three psychometric domains are a comprehensive seven-year educational program that
assessed for each item: item difficulty (P), item dis- includes 6 years of foundational and clinical medical edu-
crimination (D), and distractor efficiency (DE) [5]. Item cation, followed by a year of practical training or intern-
difficulty (P) refers to the proportion of students who cor- ship. Qena Faculty of Medicine, South Valley University
rectly answered the item. It ranges from (0–1) [6]. Item has been officially accredited by the National Authority
discrimination (D) indicates the extent to which the item for Quality Assurance and Accreditation of Education
can differentiate between higher- and lower-achieving (NAQAAE) in 2021 (https://​naqaae.​eg/​ar/​accre​dited_​
students. It ranges between − 1.0 (perfect negative dis- organ​izati​on/​accre​dited_​he). Approximately 247 medical
crimination) to + 1.0 (perfect positive discrimination) [6]. students in their fifth year were qualified to take the pedi-
An item discrimination of more than 0.2 was reported as atric final exam during the second semester of the 2020–
evidence of item validity. Any item with less than 0.2 or 2021 academic year. All exam questions were authored by
negative discrimination should be reviewed or discarded Pediatrics department, Qena Faculty of Medicine, South
[7, 8]. Distractor efficiency (DE) is determined for each Valley University faculty members, intended to have one
item based on the number of nonfunctioning distractors correct response.
(NFDs) (option selected by < 5% of students) within it [9].
Qualitative IA should be routinely performed before Procedures
and after the exam to review test items’ conformity with The exam papers and relevant SBA and EMQ item

of Remark Classic OMR® (MCQ test item analysis soft-


MCQ construction guidelines. The two most common analysis reports were collected and reviewed. Outputs
threats to the quality of multiple-choice questions are
item writing flaws (IWFs) and testing of lower cognitive ware) were used for scanning and analyzing data from
function [10]. Item writing flaws are violations of MCQ the exam. It automates the process of collecting and
construction guidelines meant to prevent testwiseness analyzing data from “fill in the bubble” forms. The infor-
and irrelevant difficulty from influencing medical stu- mation collected were the following: test item analysis
dents’ performance on multiple-choice exams. IWFs can report; number of questions graded, students’ responses
Rashwan et al. BMC Medical Education (2024) 24:168 Page 3 of 9

(correct, incorrect, no response), item difficulty (P), item appropriateness and plausibility of all distractors. The
discrimination (D), and distractor efficiency (DE). The core of this theory is based on the functions of the true
qualitative item analysis was determined by three asses- test score and the error of random measurement [17].
sors. They were provided with MCQ qualitative analy- Item psychometric parameters were collected from
sis checklist to review the exam (Additional file 1). Two reported examination statistics including item difficulty
types of multiple-choice questions (MCQ) were used (P), item discrimination (D), distractor efficiency (DE)
in this exam; Single Best Answer (SBA) and Extended and internal consistency reliability for the whole test.
Matching Questions (EMQs). SBA items were 42 with The criteria for classification of item difficulty are as fol-
five options, and the EMQs were four sets with three lows: P < 0.3 (too difficult), P between 0.3 and 0.7 (good/
stems in each set and eight options for each set. The cor- acceptable/average), P > 0.7 (too easy) and item difficulty
rect response was awarded one and half mark and the between 0.5 and 0.6 (excellent/ideal). The criteria for
incorrect response given zero mark. Each SBA and EMQ classification of the item discrimination are as follows:
were analyzed independently by three assessors as to its D ≤ 0.20 (poor), 0.21 to 0.39 (good) and D ≥ 0.4 (excel-
level of cognitive skill test and presence of item writing lent). The items were categorized on the basis of numbers
flaws. Assessors had content-area expertise, experience of NFDs in SBA and EMQ, that is, if a five-option SBA
preparing multiple choice exam. Questions were catego- includes 4-NFD, 3-NFD, 2-NFD, 1-NFD, or 0-NFD, the
rized according to modified Bloom’s taxonomy: Level I corresponding distractor efficiency (DE) is 0.00, 25, 50, 75
Knowledge (recall of information), Level II Comprehen- and 100%, respectively. In an EMQ, if the options include
sion and Application (ability to interpret data). Level III 7-NFD, 6-NFD, 5-NFD, 4-NFD, 3-NFD, 2-NFD, 1-NFD,
Problem solving (Use of knowledge and understanding in or 0-NFD, the corresponding distractor efficiency (DE) is
new circumstances) [14]. Cohen’s κ was run to determine 0.00, 14.30, 28.50, 42.80, 57.10, 71.40, 85.70, and 100.00%,
the inter-rater reliability for the three assessors which respectively.
was found to be substantial, with a Kappa coefficient of
0.591 (p < 0.001). This indicates that there is a significant Test reliability
level of agreement between the assessors beyond what Reliability refers to how consistent the results from the
would be expected by chance according to the guidelines test are. The Kuder and Richardson method KR-20 is a
proposed by Landis and Koch (1977) [15]. measure of reliability for a test with binary variables (i.e.
SBA item writing flaws (IWFs) were retrieved from answers that are right or wrong). K-R20 is used to esti-
NBME item writing guide (6th edition, 2020) [11]. IWFs mate the extent to which performance on an item relates
were categorized and scored as stem flaws (1 = negatively to the overall test scores. In this study, K-R20 was used
phrased stem, 2 = logical/grammatical cue, 3 = vague, to estimate the reliability of the pediatric final exam. A
unclear term, 4 = tricky, unnecessarily complicated single test was used hence the reliability method rest in
stems, 5 = no led in question/defective, 6 = poorly con- the internal consistency methods. The scores for KR-20
structed, short). Option Flaws (1 = Long, complex range from 0 to 1, where 0 is no reliability and 1 is per-
options, 2 = Inconsistent use of numeric data, 3= “None fect reliability. The value of KR-20 between 0.7 and 0.9
of the above” option, 4 = Nonhomogeneous options, falls in good range. Reliability estimates can be applied
5 = Collectively exhaustive options, 6 = Absolute terms, in numerous ways in assessment. A practical application
7 = Grammatical/logical clues, 8 = Correct answer stands of the reliability coefficient is to compute the Standard
out, 9 = Word repeats (clang clue), 10 = Convergence). Error of Measurement (SEM). The SEM is calculated for
EMQ IWFs were retrieved from Case and Swanson the full range of scores on an evaluation using a specific
(1993) work that highlighted the characteristics of well formula, SEM = Standard deviation ×  √ (1 − Reliability).
written EMQs [16]. EMQ IWFs were categorized and This SEM can be utilized to create confidence intervals
scored into: Options Flaws (1 = options less than 6/more around the observed assessment score, which signifies
than 25, 2 = not focused, 3 = no logical/alphabetical order, the accuracy of the measurement, considering the reli-
4 = not homogenous, 5 = overlapping/complex), Led in ability of the evaluation, for each scoring level. This esti-
Question Flaws (1 = not clear/focused, 2 = nonspecific), mate aids assessors in determining how an individual’s
and Stem Flaws (1 = non-vignette, 2 = not Clear/Focused observed test score and true score differ [18].
vignette, 3 = short, poorly constructed). Basic frequency distributions and descriptive statistics
were computed for all variables. Normality assumption
Data analysis testing involved the use of Q-Q plots, frequency histo-
Descriptive methods are based on Classical Test Theory grams (with normal curve overlaid) and Shapiro-Wilks
(CTT). The CTT considers reliability, difficulty, dis- Test of Normality. This testing found that Normality was
crimination, and the distractor efficiency to check the met for all analyses except one variable (difficulty level of
Rashwan et al. BMC Medical Education (2024) 24:168 Page 4 of 9

Table 1 Distribution of single best answer (SBA) items by difficulty and discrimination levels: frequencies and percentages (42 SBAs)
IA parameter Item Difficulty

0.71–1 0.3–0.7 < 0.3


Easy Moderate Difficult

Item Discrimination ≥ 0.4 Very good 12 (28.6%) 4 (9.5%) 0


0.21 to 0.39 Good 13 (30.9%) 5 (11.9%) 1 (2.4%)
0–0.2 Poor 1 (2.4%) 1 (2.4%) 1 (2.4%)
Negative Negative 0 0 4 (9.5%)

SBA). This variable was subjected to a two-step normali- Table 2 Distribution of extended matching question (EMQ)
zation process to achieve a normal distribution, as per items by difficulty and discrimination levels: frequencies and
the method outlined by Templeton, Gary F. (2011). This percentages (12 EMQ stems)
approach ensured a more accurate analysis of the data IA parameter Item Difficulty
[19].
0.71–1 0.3–0.7 < 0.3
Parametric significance test, specifically the inde-
pendent t-test, was used to compare the means of dif- Easy Moderate Difficult
ficulty and discrimination, for SBA and EMQ formats. Item Discrimi- ≥ 0.4 Very good 6 (50%) 0 0
The independent t-test allowed us to determine if there nation 0.21 to 0.39 Good 1 (8.3%) 2 (16.7%) 0
were statistically significant differences in difficulty and
0–0.2 Poor 1 (8.3%) 0 2 (16.7%)
discrimination between the SBA and EMQ formats. All
Negative Negative 0 0 0
analyses were conducted as two-tailed, with p = .05 used
as the threshold for statistical significance, using SPSS
Statistics for Windows, Version 24 (IBM Corp.). Figure
Table 3 Comparison of the mean difficulty P and discrimination
was generated in Microsoft Excel 2013.
D values of the SBA and EMQ
Results IA parameter SBA Mean ± SD EMQ Mean ± SD t-test P-Value
The final pediatrics exam was composed of 54 items, and P 0. 67 ± 0.28 0.70 ± 0.28 −0.405t 0.686*
the total score was 81 (1.5 mark for each question). The D 0.32 ± 0.16 0.35 ± 0.18 -0.557t 0.620*
mean exam score was 55.04 ± 9.82. The value of KR-20
SD Standard deviation t: independent t-test *: not statistically significant
was 0.86. This is considered acceptable as it is greater (p > 0.05)
than the commonly accepted threshold of 0.7 for accept-
able reliability. This suggests that the MCQ exam in this
study is a reliable tool for assessment [20]. Reliability distractors was 63.1%. Only 11.90% of SBA had distrac-
depends both on Standard Error of Measurement (SEM) tor efficiency (100.00%), while 26.20% had distractor effi-
and on the ability range (standard deviation, SD) of stu- ciency 0.00%. The mean DE for EMQ was 13.09 ± 15.46.
dents taking an assessment. The standard error of meas- No EMQ had a DE of 100%, while EMQ with DE (0.00%)
urement (SEM) was 3.91. The smaller the SEM, the more was 41.60%. The percentage of functioning distractors
accurate are the assessments that are being made [21]. was 13.1%, while the percentage of nonfunctioning dis-
tractors was 86.90%.
Quantitative item analysis Table 1 indicates that only 9 SBAs (21.4%) met the rec-
The difficulty level of items was easy (P > 0.7), at 61.9% of ommended levels for difficulty and discrimination (with
SBA and 66.69% of EMQ. The difficulty level of items was P ranging from 0.3 to 0.7 and D > 0.2). Table 2 indicates
moderate (0.7 ≥ P > 0.3) at 23.8% of SBA and 16.67% of that only 2 EMQ items (16.7%) met the recommended
EMQ. However, the difficulty level was difficult (P ≤ 0.3), levels for difficulty and discrimination (with P ranging
at 14.3% for SBA and 16.67% for EMQ. Item discrimina- from 0.3 to 0.7 and D > 0.2). These questions should be
tion was > 0.2 at 83.3% of SBA and 75% of EMQ, indi- retained in the question bank, provided they are free of
cating good discriminating items. Three SBAs (7.10%) IWFs.
had poor discrimination (D ≤ 0.2). Four SBAs (9.5%) Table 3 shows the comparative analysis the mean
had negative discrimination. The mean DE of SBA was difficulty (P) and discrimination (D) values of Single
37.69% ± 33.12. The percentage of functioning distrac- Best Answer (SBA) and Extended Matching Ques-
tors was 36.9%, and the percentage of nonfunctioning tions (EMQ), the following findings were observed. The
Rashwan et al. BMC Medical Education (2024) 24:168 Page 5 of 9

mean difficulty for SBA was 0.67 (±0.28) and for EMQ Figure 1 shows the four categories of MCQ based on
was 0.70 (±0.28). The independent t-test showed no sig- the level of IA indices (P and D) and the presence or
nificant difference between the two formats (t = − 0.405, absence of IWFs. The four categories are as follows:
p = 0.686). Similarly, the mean discrimination for SBA
was 0.32 (±0.16) and for EMQ was 0.35 (±0.18). Again, I. Acceptable IA indices with no IWFs: Questions
the independent t-test revealed no significant difference have difficulty level within the acceptable range
(t = − 0.557, p = 0.620). (0.3–0.7) and discrimination level > 0.2, and items
are free of flaws.
II. Acceptable IA indices with IWFs: Questions have
Qualitative item analysis acceptable difficulty and discrimination levels, and
The prevalence of SBA testing low cognitive skills was items are flawed.
76.19%. Only 23.8% of SBAs tested higher cognitive skills. III. Nonacceptable IA indices with no IWFs: Questions
Conversely, most EMQs tested higher cognitive skills have difficulty level < 0.3 or more than > 0.7 and
(75%), and 25% of EMQs tested low cognitive skills. discrimination less than < 0.2, and items are free of
The frequency of flawed SBAs with stem flaws was 30 flaws.
questions (71.40%). Option flaws were found in 23 ques- IV. Nonacceptable IA indices with IWFs: Questions
tions (54.76%). SBA with more than 2 IWFs comprised have difficulty level < 0.3 or more than > 0.7 and dis-
15 questions (35.7%). Poorly constructed stems were the crimination less than < 0.2, and items are flawed.
most frequent stem flaw, with 15 questions (35.70%),
followed by negatively phrased stems (33.30%), vague, The prevalence of SBA and EMQ with acceptable
unclear terms (21.40%), tricky unnecessarily complicated IA indices with no IWFs was 14.2 and 0%, respectively.
stems (21.40%), no lead-in question (21.40%) and logi- Those previous questions should be kept in the questions
cal/grammatical cue flaws (7.10%). Regarding the flaws bank without any modifications. However, the preva-
related to options, the nonhomogeneous options list lence of SBA and EMQ with acceptable IA indices with
was the most frequent flaw (35.70%). The correct answer IWFs was 23.8 and 33.3%, respectively. These questions
stands out, with long complex options and inconsistent need remediations before being kept in the question
use of numeric data (9.50%) each. Word repeats and con- bank. Items with nonacceptable IA indices with or with-
vergence were found in 4.80% of cases each. out IWFs (which constitute more than 60% of the items)
EMQs with option flaws with no logical/alphabetical should be discarded from the question bank.
order were the most frequent (100.00%), nonhomoge-
nous, and overlapping/complex (25.00%) for each. Lead- Discussion
in statement flaws included unclear/unfocused lead-in In this study, we performed both quantitative and
statements (75.00%) and nonspecific statements (50.00%). qualitative postexamination item analysis of the sum-
The stem flaws found were nonvignette and short poorly mative undergraduate pediatrics MCQ exam. The quan-
constructed stems (25.00%) for each. titative analysis discovered a range of item difficulty and

Fig. 1 The four categories of both SBA and EMQ formats based on: the level of item analysis (IA) indices (difficulty P and discrimination D)
and the presence or absence of IWFs
Rashwan et al. BMC Medical Education (2024) 24:168 Page 6 of 9

discrimination levels, highlighting the importance of a questions. Therefore, a high number of NFDs can make it
diverse question bank in assessing a broad spectrum of easier for examinees to identify the correct answer. Sec-
student abilities. Qualitative item analysis, on the other ondly, NFDs can also affect the item discrimination level.
hand, involves a more subjective review of each item. It If a question has many NFDs, it may not effectively dis-
helped to identify issues with cognitive level, item clarity, criminate between higher- and lower-achieving students.
and writing flaws. The qualitative analysis complemented In this study, the percentage of non-functioning distrac-
the quantitative findings and provided additional insights tors of EMQs was 86.90%. These findings underline the
into the quality of the items. The findings underlined the importance of careful distractor selection and review
value of both quantitative and qualitative item analysis in the development of EMQs. By reducing the number
in ensuring the validity and reliability of the exam and in of NFDs, it may be possible to increase the item diffi-
building a robust question bank. Both quantitative and culty and discrimination levels of the questions, thereby
qualitative item analysis are crucial for making decisions improving the overall quality of the assessment.
about whether to keep, review, or remove questions from Distractor analysis of MCQs can enhance the quality
the test or question bank. These decisions enabled us to of exam items. We can fix MCQ items by replacing or
identify ideal questions and develop a valid and reliable removing nonfunctioning distractors rather than elimi-
question bank for future assessment that will enhance the nating the whole item, which would save more energy
quality of the assessment in undergraduate pediatrics. and time for future exams [24]. In both the SBA and the
An ideal MCQ is clear, focused, and relevant to the EMQ, we found a considerable number of nonfunction-
intended learning outcomes. It should have a single best ing distractors (NFDs), 63.10 and 86.90%, respectively.
answer and distractors that are plausible but incorrect. In We found that our faculty members need training for the
addition, an ideal MCQ should have an appropriate level construction of plausible distractors of MCQs to improve
of difficulty and discrimination power. The findings from the quality of MCQ exams [28]. In addition, we should
this study suggest that the proportion of ideal questions, reduce the number of options to three-option items
as defined by the three criteria (difficulty level of 0.3–0.7, instead of five-option items [29, 30]. Tarrant and Ware
discrimination level > 0.2, and 100% distractor efficiency), proved that three-option items perform equally well
is lower than what has been reported in previous stud- as four-option items and have suggested writing three-
ies. Specifically, only 4.7% of Single Best Answer (SBA) option items, as they require less time to be developed
questions met these criteria, and none of the Extended [31]. NFDs were more commonly encountered in EMQ
Matching Questions (EMQs) did. This is in contrast to than SBA. The EMQ had more options (8 compared to
previous studies, which reported that 15–20% of MCQs 5), so it may be more difficult to create plausible distrac-
fulfilled all three criteria [22, 23]. These findings high- tors that draw students to respond to them. All EMQ
light the importance of rigorous question development with many NFDs should be revised or even converted to
and review processes to ensure the quality of MCQs. This SBA instead [32].
could include strategies such as regular postexamina- The reliability coefficient (KR20) of the test was 0.84,
tion item analysis, peer review of questions, and ongoing which shows an acceptable level of reliability. The stand-
training for question authors [24, 25]. ard error of measurement (SEM) was 3.91. SEM esti-
In this study, the mean P was higher for the EMQ than mates the amount of error built into a test taker’s score.
for the SBA, although the difference was not statistically This estimate aids evaluators in determining how an
significant (t = − 0.405, p = 0.686). The mean D was higher individual’s observed test score and true score differ.
for the EMQ than for the SBA, although the difference The test reliability and the SEM are interconnected. The
was not statistically significant difference (t = − 0.557, SEM decreases as the test reliability increases [5]. For a
p = 0.620). Therefore, both formats demonstrated compa- short test (fewer than 50 items), a KR20 of 0.7 is accept-
rable levels of difficulty and discrimination in the context able, while for a prolonged test (more than 50 items), a
of this study. This is in contrast to previous studies, which KR20 of 0.8 would be acceptable. Test reliability can be
have reported significant differences in difficulty levels improved by the removal of flawed items or very easy
between these two formats. Increasing the number of or difficult items. Items with poor correlation should be
options had an influence on difficulty levels as questions revised or discarded from the test [7].
with more options were more difficult or harder [26, 27]. In our study, we analyzed the cognitive levels of SBA
This discrepancies could be explained by high number of and EMQ based on modified Bloom’s taxonomy [14]. We
non-functioning distractors (NFD) in Extended Match- found that 76.19% of SBA assessed low cognitive levels,
ing Questions (EMQ) which had a significant impact on while only 25% of EMQ assessed low cognitive skills.
both the item difficulty and discrimination levels of the Conversely, 75% of EMQ assessed higher cognitive skills.
questions. Firstly, the presence of NFDs leads to easier These results are similar to other studies that found that
Rashwan et al. BMC Medical Education (2024) 24:168 Page 7 of 9

60.47 and 90% of MCQs were at low cognitive levels [13]. is inherently subjective. Different assessors might
EMQs are recommended to be used in undergraduate have different interpretations of item clarity, cogni-
medical examinations to test the higher cognitive skills tive level, and writing flaws. This subjectivity could
of advanced medical students or in high-stakes examina- potentially impact the consistency of the analysis.
tions [33]. A mixed examination format including SBA 2. Scope of the Study: The study was limited to a sin-
and EMQ was the best examination to distinguish poor gle summative undergraduate pediatrics MCQ exam.
from moderate and excellent students [34]. Therefore, the findings may not be generalizable to
In this study, we aimed to find common technical other exams or disciplines.
flaws in the MCQ Pediatrics exam. We found that only 3. Sample Size: The study’s conclusions are based on
26.20% of SBA questions followed all best practices of the analysis of a single exam. A larger sample size,
item writing construction guidelines. The prevalence of including multiple exams over a longer period, might
item writing flaws was 73.80% for SBA, and all EMQ sets provide more robust and reliable findings.
were flawed. This high proportion of flawed items was
similar to other studies, where approximately half of the
analyzed items were considered flawed items [35]. The Delimitations
high prevalence of IWFs in our study exposed the lack
of preparation and time devoted by evaluators for MCQ 1. Focus on MCQs: The study was delimited to two
construction. The most prevalent types of flaws in SBA types of multiple-choice questions (MCQs). Other
questions were poorly constructed, short stems (35.70%), types of questions, such as short answer or essay
and negatively phrased stems (33.3%). Furthermore, all questions, were not included in the analysis.
EMQ had flaws, and option flaws were the dominating 2. Single medical school Study: The study was con-
type of flaws (100.00% no logical order, 25.00% nonho- ducted within a medical school, which may limit
mogeneous, and 25.00% complex option). These findings the generalizability of the findings to other medical
were consistent with other studies [13, 35]. schools with different student populations or assess-
The presence of IWFs had a negative effect on the per- ment practices.
formance of high-achieving students, giving an advantage
to borderline students who probably relied on testwise- Despite these limitations and delimitations, the study
ness [36]. According to Downing, MCQ tests are threat- provides valuable insights into the importance of both
ened by two factors: construct-irrelevant variance (CIV) quantitative and qualitative item analysis in ensuring the
and construct underrepresentation (CUR). Construct- validity and reliability of exams and in building a robust
irrelevant variance (CIV) is the incorrect inflation or question bank. Future research could aim to address
deflation of assessment scores caused by certain types these limitations and delimitations to further enhance
of uncontrolled or systematic measurement error. Con- the quality of MCQ assessment in undergraduate medi-
struct underrepresentation (CUR), which is the cogni- cal education.
tive domain’s down sampling. Flawed MCQs tend to be
ambiguous, unjustifiably difficult, or easy. This is directly Conclusions
related to the CIV added to a test due to flawed MCQs. In summary, item analysis is a vital procedure to ascertain
CUR takes place when many of the test items are written the quality of MCQ assessments in undergraduate medi-
to assess low levels of the cognitive domain, such as recall cal education. We demonstrated that quantitative item
of facts [37]. All defective items found by quantitative analysis can yield valuable data about the psychometric
item analysis should be analyzed for the presence of item properties of each item. Furthermore, it can assist us in
writing flaws. Those defective items need to be correctly selecting “ideal MCQs” for the question bank. Neverthe-
reconstructed; validated and feedback should be given to less, quantitative item analysis is insufficient by itself. We
the item’s authors for corrective action. Both quantitative also require qualitative item analysis to detect and rectify
and qualitative item analysis are necessary for the valida- flawed items. We discovered that numerous items had
tion of viable question banks in undergraduate medical satisfactory indices but were inadequately constructed or
education programs [38]. had a low cognitive level. Hence, both quantitative and
qualitative item analysis can enhance the validity of MCQ
Limitations and delimitations assessments by making informed judgments about each
Limitations item and the assessment as a whole.

1. Subjectivity in Qualitative Analysis: While the quali- Abbreviations


tative item analysis provided valuable insights, it MCQ multiple choice question
Rashwan et al. BMC Medical Education (2024) 24:168 Page 8 of 9

IA item analysis 4. Reynolds CR, Altmann RA, Allen DN. Mastering modern psychological
P item difficulty testing. 2nd ed. Cham: Springer International Publishing; 2021.
D item discrimination 5. Tavakol M, Dennick R. Post-examination analysis of objective tests. Med
DE distractor efficiency Teach. 2011;33(6):447–58.
SBA single best answer 6. Rahim Hingorjo M, Jaleel F. Analysis of one-best MCQs: the difficulty
EMQs extended matching questions index, discrimination index and distractor efficiency metabolic and
IWFs item writing flaws. hormonal interactions in hypertensive subjects view project. J Pakistan
Med Assoc. 2012;62(2):142. Available from: https://​www.​resea​rchga​te.​
net/​publi​cation/​22811​1127
Supplementary Information 7. Tavakol M, O’Brien D. Psychometrics for physicians: everything a clinician
The online version contains supplementary material available at https://​doi.​ needs to know about assessments in medical education. Int J Med Educ.
org/​10.​1186/​s12909-​024-​05153-3. 2022;13:100–6.
8. Rush BR, Rankin DC, White BJ. The impact of item-writing flaws and item
Supplementary Material 1. complexity on examination item difficulty and discrimination value. BMC
Med Educ. 2016;16(1):250.
9. Kaur M, Singla S, Mahajan R. Item analysis of in use multiple choice ques-
Authors’ contributions tions in pharmacology. Int J Appl Basic Med Res. 2016;6(3):170.
NIR, SRA, OAN and MHR conceived and designed the study, and wrote the 10. Ali SH, Ruit KG. The impact of item flaws, testing at low cognitive level,
manuscript. NIR MHR undertook all statistical analyses. NIR, MHR designed and and low distractor functioning on multiple-choice question quality.
implemented assessment and provided data for analysis. All the authors have Perspect Med Educ. 2015;4(5):244–51.
read and approved the final manuscript. 11. Billings MS, Deruchie K, Hussie K, Kulesher A, Merrell J, Swygert KA, et al.
Constructing written test questions for the health sciences. 6th ed. Phila-
Funding delphia: National Board of Medical Examiners; 2020. Available from: www.​
Open access funding provided by The Science, Technology & Innovation nbme.​org (accessed 22 August 2023)
Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank 12. Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice
(EKB). This study received no institutional or external funding. item-writing guidelines for classroom assessment. Appl Meas Educ.
2002;15(3):309–34. https://​doi.​org/​10.​1207/​S1532​4818A​ME1503_5.
Availability of data and materials 13. Tariq S, Tariq S, Maqsood S, Jawed S, Baig M. Evaluation of cognitive levels
Primary data are available from the corresponding author upon reasonable and item writing flaws in medical pharmacology internal assessment
request. examinations. Pak J Med Sci. 2017;33(4):866–70.
14. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in
undergraduate education: modified essay or multiple choice questions?
Declarations Research paper. BMC Med Educ. 2007:7.
15. Landis JR, Koch GG. The measurement of observer agreement for cat-
Ethics approval and consent to participate egorical data. Biometrics. 1977;33(1):159–74.
Ethical approval was sought from the Faculty of Medicine, Alexandria Univer- 16. Case SM, Swanson DB. Extended-matching items: a practical alternative
sity Human Research Ethics Committee, and the study granted exemption to free-response questions. Teach Learn Med. 1993;5(2):107–15.
from human research ethics review. Students were not asked to provide 17. Tavakol M, Dennick R. Post-examination interpretation of objective test
consent to participate, as the study was exempt from human research ethics data: Monitoring and improving the quality of high-stakes examinations:
review. AMEE Guide No. 66. Med Teach. 2012 Mar;34(3).
18. Downing SM. Reliability: on the reproducibility of assessment data. Med
Consent to publication Educ. 2004;38:1006–12. Available from: https://​asmep​ublic​ations.​onlin​
Not applicable. elibr​ary.​wiley.​com/​doi/​10.​1111/j.​1365-​2929.​2004.​01932.x
19. Templeton GF. A two-step approach for transforming continuous vari-
Competing interests ables to Normal: implications and recommendations for IS research.
The authors declare no competing interests. Commun Assoc Inf Syst. 2011;28
20. Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ.
Author details 2011;27(2):53–5.
1
Pediatrics, Qena Faculty of Medicine, South Valley University, Qena, Egypt. 21. Tighe J, McManus I, Dewhurst NG, Chis L, Mucklow J. The standard error
2
Community Medicine, Faculty of Medicine, Alexandria University, Alexandria, of measurement is a more appropriate measure of quality for postgradu-
Egypt. 3 Clinical Pharmacology, Faculty of Medicine, Alexandria University, Alex- ate medical assessments than is reliability: an analysis of MRCP (UK)
andria, Egypt. 4 Medical Education, Faculty of Medicine, Alexandria University, examinations. BMC Med Educ. 2010;10(1):40.
Alexandria, Egypt. 22. Kumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multi-
ple choice questions: a quality assurance test for an assessment tool. Med
Received: 24 September 2023 Accepted: 8 February 2024 J Armed Forces India. 2021;1(77):S85–9.
23. Wajeeha D, Alam S, Hassan U, Zafar T, Butt R, Ansari S, et al. Difficulty
index, discrimination index and distractor efficiency in multiple choice
questions. Annals of PIMS. 2018;4. ISSN:1815-2287.
24. Tarrant M, Ware J, Mohammed AM. An assessment of functioning and
References non-functioning distractors in multiple-choice questions: a descriptive
1. CPM VDV. The assessment of professional competence: developments, analysis. BMC Med Educ. 2009;9(1)
research and practical implications. Adv Health Sci Educ. 1996;1(1):41–67. 25. Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing
2. Kumar A, George C, Harry Campbell M, Krishnamurthy K, Michele Lashley flaws in multiple-choice questions used in high stakes nursing assess-
P, Singh V, et al. Item analysis of multiple choice and extended matching ments. Nurse Educ Pract. 2006;6(6):354–63.
questions in the final MBBS medicine and therapeutics examination. J 26. Swanson DB, Holtzman KZ, Allbee K, Clauser BE. Psychometric charac-
Med Educ. 2022;21(1) teristics and response times for content-parallel extended-matching
3. Salam A, Yousuf R, Bakar SMA. Multiple choice questions in medical and one-best-answer items in relation to number of options. Acad Med.
education: how to construct high quality questions. Int J Human Health 2006;81(Suppl):S52–5.
Sci (IJHHS). 2020;4(2):79. 27. Case SM, Swanson DB, Ripkey DR. Comparison of items in five-option
and extended-matching formats for assessment of diagnostic skills. Acad
Med. 1994;69(10):S1–3.
Rashwan et al. BMC Medical Education (2024) 24:168 Page 9 of 9

28. Naeem N, van der Vleuten C, Alfaris EA. Faculty development on item
writing substantially improves item quality. Adv Health Sci Educ.
2012;17(3):369–76.
29. Raymond MR, Stevens C, Bucak SD. The optimal number of options for
multiple-choice questions on high-stakes tests: application of a revised
index for detecting nonfunctional distractors. Adv Health Sci Educ.
2019;24(1):141–50.
30. Kilgour JM, Tayyaba S. An investigation into the optimal number
of distractors in single-best answer exams. Adv Health Sci Educ.
2016;21(3):571–85.
31. Tarrant M, Ware J. A comparison of the psychometric properties of three-
and four-option multiple-choice questions in nursing assessments. Nurse
Educ Today. 2010;30(6):539–43.
32. Vuma S, Sa B. A descriptive analysis of extended matching questions
among third year medical students. Int J Res Med Sci. 2017;5(5):1913.
33. Frey A, Leutritz T, Backhaus J, Hörnlein A, König S. Item format statistics
and readability of extended matching questions as an effective tool to
assess medical students. Sci Rep. 2022;12(1)
34. Eijsvogels TMH, van den Brand TL, Hopman MTE. Multiple choice ques-
tions are superior to extended matching questions to identify medicine
and biomedical sciences students who perform poorly. Perspect Med
Educ. 2013;2(5–6):252–63.
35. Downing SM. The effects of violating standard item writing principles
on tests and students: the consequences of using flawed test items on
achievement examinations in medical education. Adv Health Sci Educ.
2005;10(2):133–43.
36. Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice ques-
tions on student achievement in high-stakes nursing assessments. Med
Educ. 2008;42(2):198–206.
37. Downing SM. Threats to the validity of locally developed multiple-choice
tests in medical education: construct-irrelevant variance and construct
underrepresentation. Adv Health Sci Educ. 2002;7(3):235–41.
38. Bhat SK, Prasad KHL. Item analysis and optimizing multiple-choice
questions for a viable question bank in ophthalmology: a cross-sectional
study. Indian J Ophthalmol. 2021;69(2):343–6.

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub-
lished maps and institutional affiliations.

You might also like